New Source to Target experience

info

The new Source to Target experience is currently available only in private preview.

The "Source to Target Rivers" feature in Data Integration simplifies building robust data pipelines. It lets you extract data from a designated source and load it into a target system. This automation facilitates the detection of incoming data structures, and Data Integration generates the corresponding target tables and columns for data storage without manual intervention.

You can connect MySQL/PostgreSQL/Oracle to Snowflake/BigQuery using Data Integration Source to Target Rivers feature. It includes step-by-step instructions for setting up and configuring the source and target, letting you establish seamless data flows between their chosen platforms.

Before you begin

Navigate to the Create River section and select Source to Target River from the available options.

Step 1: set up the data source

In the Source step, you can select the data source from which you ingest data. You can customize the ingestion process using parameters.

Data Integration provides a comprehensive list of supported connectors, letting you choose the appropriate source for your data needs.

Creating a connection

After selecting MySQL, PostgreSQL, or Oracle, you must choose a connector or create a connection. If a connection already exists in the Connections tab, it appears in the drop-down menu. To create a new one, select Add new connection.

note

Test the connection to ensure it is valid before running the River.

Step 2: select data target

In the Target tab, select the Snowflake destination where the source data is loaded. Ensure that the appropriate Snowflake instance matches the data integration requirements. This step configures the destination for the data Flow, ensuring that the source data is directed to the correct Snowflake environment for further processing or analysis.

Creating a target connection

Select an existing Target connection or create a new one. If the Target is a cloud data warehouse, you must specify the database, schema, and target table where the data is loaded.

Data loading settings

Define a database, schema, and target table, where data from your selected source lands. Data Integration automatically detects the available databases and schemas for you to choose from.

note

In "Snowflake" and "BigQuery", the connection form includes a Default Pre-Populated Values option. You can specify the values you want to work with, and the connection automatically remembers and uses these as the default.

Advanced settings

Data Integration provides a variety of advanced settings specific to Snowflake, letting you manage and customize data handling with precision.

Truncate Columns:
Truncates any VARCHAR values that exceed the defined column length. This prevents data overflow and ensures consistency with the column definition.
Replace Invalid UTF-8 Characters:
Automatically replaces invalid UTF-8 characters with the Unicode replacement character. This ensures that invalid or corrupted characters do not interfere with data processing.
- Replace invalid UTF-8 characters:
  Replaces invalid UTF-8 characters with the Unicode replacement character. This option prevents corrupted characters from disrupting data processing.
Add Data Integration metadata:
When Data Integration metadata is turned on, additional metadata columns are added to the target table in Snowflake.

These include:

River_last_update: Tracks the timestamp of the most recent update.
River_river_id: Identifies the specific River used for the data load.
River_run_id: Stores the unique ID of each run of the River.

You can also use custom expressions to include additional metadata fields beyond the defaults.

Custom File Zone:
Custom File Zone lets you specify a custom file zone for staging files before loading into Snowflake. This option helps manage file storage and properly handle large data files.

These advanced settings control data processing and storing in Snowflake, ensuring data integrity, compatibility, and enhanced metadata tracking.

Step 3: configure schema

Choose an extraction mode that fits your requirements.

Extraction mode

Data Integration provides multiple extraction modes depending on the selected source. When using a source like an RDBMS (Relational Database Management System), you have two options for extracting data:

CDC
Standard Extraction

note

Use the Extraction Mode option on the left side to switch between modes after selecting.

Change data capture

This mode tracks and extracts the data that has changed since the last extraction, making it efficient for high-volume data sources with frequent updates.

The Data Integration Change Data Capture (CDC) extraction mode captures changes by monitoring logs or records generated by the source database, then capturing changes made to the source data. This change data is then collected, transformed, and loaded into the target database, ensuring the target is in sync with the source. To learn more, refer to the Database River Modes.

Standard extraction

This mode lets you map and transform data from multiple tables into a unified schema before loading it into the destination. You establish relationships between tables to link and load the data.

The Multi-Tables River mode in Data Integration uses SQL-based queries to perform data transformations. You can configure it to run on a defined schedule or trigger it manually. To learn more, refer to the Database River Modes.

After selecting the table, its name appears in the Target configuration. You can then choose the extraction method and configure it. Define specific settings to use Incremental extraction.

Incremental extraction: "created_at" field

If you use the "Created_at" option in the incremental field, you must specify both a Start Date and an End Date to select the time range for data extraction. This ensures incremental load includes only records created within this period.

note

Start date and end date configuration:
- Start Date: Specify the beginning of the time range from which you want to pull data.
- End Date: Leave this blank if you want the data extraction to continue up to the current River execution time.
Automatic date updates:
- After each river run, the Start Date is automatically updated to match the previous End Date. You can set End Date to empty, letting the next run extract data from the endpoint of the previous run.
Time zone offset:
- Set the appropriate time zone offset to align data extraction with the local time when the River executes(if you set the "End Date" to empty).
Days back:
- Use the Days Back field to specify several days before the provided Start Date. This lets Data Integration to pull data from a historical point relative to the Start Date.

important

The Start Date is not updated if a river run fails. To alter this behavior, navigate to More Options and select the checkbox to advance the Start Date even in the event of a failed river run.
This is not recommended, as it may cause inconsistencies in your data extraction.

To learn more on selecting time periods, refer to the Source to Target River - general overview.

Definitions

These settings apply to all selected tables to modify schema definitions on the "Source" and "Target".

Advanced Source Definitions: The options vary based on the selected extraction mode.
Advanced Target Definitions: You can customize the loading modes depending on the target, each providing its own set of supported loading modes. To learn more about loading modes, refer to the Targets.

Table settings

After selecting a specific table, a Settings page appears with three key options:

Mapping
Table Source Settings
Table Target Settings

Mapping

All columns tab

In the All Columns tab under the Mapping section, you can view the Source and Target columns and their respective data Type and Mode. Click the arrow icon next to a column, update the settings as needed, and click Apply Changes to save each modification. You can also add a "Calculated Column" feature for advanced customizations. This feature lets you apply expressions, including mathematical and string operations, to the data from the Source. This can be useful for tailoring the output to specific business needs. For example, use functions to concatenate fields, perform arithmetic calculations, or transform data types. To learn more on using this feature, refer to the Targets.

Match key

The Match Key section lets you define the keys used to match records during data loading. Use the arrow buttons to move the relevant columns to the left table. This step is essential when using the Upsert-Merge loading mode, as at least one match key is required to identify and merge records accurately.

Cluster

The Cluster Key lets you organize data by selecting and moving the desired columns to the left table. These keys are arranged in descending order and help optimize performance when querying and managing large datasets.

Table source settings

The Table Source Settings section lets you define how data is extracted from the source. You can choose between Incremental or All data extraction methods, and configure Advanced Settings such as:

Update Incremental Date Range on Failures: Determines if the date range should be updated even when an extraction fails.
Interval Chunk Size: Defines the size of data chunks for extraction.
Filter Expression: Enter a filter expression to control which data gets extracted.

Table target settings

note

This example is specific to Snowflake. Each Target has its own unique settings.

In the Table Target Settings, you can override the default target configuration to customize how the data is stored in the target system. Options include:

Filter logical key duplication between files: Ensures that duplicate logical keys across files are filtered out.
Enforce masking policy: Applies masking to sensitive data fields as per policy.
Support escape character: Supports escape characters in data to handle special characters.

These options offer greater control over how data is handled and stored in the target.

Reload metadata

The Reload Metadata options for refreshing schema metadata within Data Integration. You can choose to:

Reload Metadata for Selected Schema: This option refreshes the metadata for a specific schema, ensuring any recent changes are reflected.
Reload Metadata for All Schemas: This option updates the metadata across all schemas withinData Integration. This is useful for applying widespread updates when working with multiple schemas.

Custom query mode

Data Integration provides a Custom Query mode that lets you load data into the platform using personalized SQL queries. Use this mode when your data import requires advanced flexibility and precise control.

Key aspects of the Custom Query mode include:

User-defined queries: Write SQL queries to define the data to be loaded and determine the needed transformations.
Data sources: The Custom Query mode supports loading data from various databases and data warehouses, making it suitable for different data environments.
Automatic scheduling: Schedule queries to run data loads regularly or trigger them on demand, ensuring real-time access to the latest data.

important

Switching to Custom Query mode redirects you to the older version of Data Integration that supports this feature, and you cannot return to the previous mode.

To learn more on using Custom Query mode, refer to the Database River Modes.

Step 4: schedule and settings

Scheduling the river

The River is scheduled to run automatically by default, which is the recommended setting. However, you can customize the schedule according to your preferences.

Timeout settings

You can specify timeout settings for River execution. Set a timeout duration to control how long the River runs before automatically terminating. This ensures that prolonged executions do not hinder system performance.

Notifications

Enable notifications to stay informed about the River's execution status. Enter your email address to receive one or more notifications. To add multiple email addresses, separate them with a comma (,)..

Additional river information

In this section, you can include any additional information relevant to the River to enhance clarity and maintain comprehensive documentation.

To learn more on using Custom Query mode, refer to the Settings tab.

note

This topic is on an earlier UI version, but the feature remains unchanged.

River activation

After completing all configurations for your River, you can activate it. This step lets you verify everything works as expected.

Activation process: Click Activation to monitor the status and ensure the River initializes correctly.

Data Integration provides an additional sidebar with important options to enhance user experience.

River Info: Access detailed information about the current river setup and configuration.
Version history: R Review the changes made to the River over time, including updates and modifications.
Activities: Monitor and track all activities related to the River, such as data extraction and transformations.
Variables: Manage and configure variables used within your river.
Scheduling and notifications: Set up schedules for data runs and receive notifications based on the River's performance and status.

Editing and managing rivers

note

Only applicable to CDC extraction mode.

Reactivation ensures that updates or modifications are applied correctly, letting the River resume processing data. You must reactivate the River to restore full feature when making changes to the River.

Summary page

The Summary page (first page) provides a comprehensive overview of the River’s recent activity, performance metrics, and current configuration settings, letting you quickly review its status and key operational details.

Procedure

Navigate to the Data Integration console.
Click Rivers from the left-hand menu.
Select the specific River you want to edit or review.
The "Summary" page shows essential information about the River's current state and recent activity.

note

The Source, Target, and Schema tabs are identical to those used when creating a new River.

Summary Page Overview

Deployments

Deploying configurations from one environment to another is supported for Rivers created using the new Source to Target experience.

The deployment process includes the following details:

River type and status:
- You can view the River type in the relevant column.
- You can view the current status of the "Source Environment" and the intended status of the "Target Environment".
Original status retention:

Rivers created with the new "Source to Target experience" retain their original status when deployed to the "Target Environment. - CDC Rivers" are deployed with a Disabled status.

Existing Rivers in the target environment:

If a River exists in the "Target Environment," its configuration is updated without altering its status. For example, Active Rivers remain active, and their validation state does not reset.

Recommendation: While the process avoids re-validation, it is recommended to manually reactivate the Rivers to ensure proper feature and integration in the "Target Environment".

Before you begin

Step 1: set up the data source

Creating a connection

Step 2: select data target

Creating a target connection

Data loading settings

Advanced settings

Step 3: configure schema

Extraction mode

Change data capture

Standard extraction

Incremental extraction: "created_at" field

Definitions

Table settings

Mapping

All columns tab

Match key

Cluster

Table source settings

Table target settings

Reload metadata

Custom query mode

Step 4: schedule and settings

Scheduling the river

Timeout settings

Notifications

Additional river information

River activation

Side bar features

Editing and managing rivers

Summary page

Procedure

Deployments