Database river modes

Data Integration offers multiple database River modes to meet the needs of different use cases. Each River mode has its own advantages, and the choice of a River mode depends on the nature of the source data and the requirements of the target database.

River modes

Multi-Tables

Standard extraction

This mode lets you map and transform the data from multiple tables into a unified schema, before loading the data into the destination. The process involves defining the relationships between the tables so that data can be linked and loaded consistently. Multi-Tables River mode in Data Integration uses SQL-based queries to perform the data transformations, and can be configured to run on a schedule or triggered manually.

Standard extraction mode flowchart

Standard extraction mode flowchart

The steps

Define the source tables: Identify the tables you want to load into the destination and set them up as sources in Data Integration.
Establish relationships: Define the relationships between the source tables. For example, you may have a main table and several related tables that provide additional information about each record in the main table.
Map and transform data: Use SQL-based queries to map and transform the data from the source tables into a unified schema that is compatible with the destination. This may include operations such as merging columns from different tables, splitting columns into multiple fields, or aggregating data across multiple tables.
Load data into Target: Use Data Integration to load the transformed data into the destination. This may involve creating a new database or table, or updating existing data in an existing database or table.
Monitor and troubleshoot: Monitor the process to ensure that the data is loaded correctly, and troubleshoot any issues that may arise. For example, you may need to fix data that has been rejected due to data type mismatches or other errors.

Here are a few additional notes: Even though there is a default Loading Mode defined in the Target tab, any individual table's loading mode can be changed by selecting the Schema tab and then 'Edit' on the far right:

After choosing your tables, they are displayed with further editing possibilities. You can change the following on the Columns tab:

Check or uncheck the boxes next to the columns you want to include or exclude from the Target.
In the Target column, rename the field by clicking the name and typing a new one.
Double-click the current data type under Type to change the field to a new data type.
Change the field's mode under Mode.
By checking the Cluster Key box, you can make a field a Cluster Key, which is used for partitioning.
To make a field the key for Upsert-Merge, highlight it with the key to the left of the field name.
On the Table Settings tab, you can edit the following:
- You can extract data in two ways:
  - All (Default)
  - Incremental If you choose Incremental, you can specify which field will be used to define the increment.

note

Start Date is mandatory.
Data can be retrieved for the date range specified between the Start and End dates.
If you leave the end date blank, the data will be pulled at the current time of the River's run.
Date timezone: UTC.

Select Edit to change the Target table name.
Change the Upsert-Merge mode to any other Loading Mode.
Filter logical key duplication between files - This option removes duplications from the current source pull.
A custom expression can be added to any target column by clicking + Add Calculated Column.

When adding a comment to an Expressions query, use /* "comment" */ rather than two hyphens (--) to prevent commands from being mistakenly interpreted as comments, as the query is converted to a single line.

Change Data Capture (CDC)

Data Integration Change Data Capture (CDC) extraction mode works by constantly monitoring the logs or records generated by the source database and then capturing changes made to the source data. This change data is then collected, transformed, and loaded into the target database, ensuring the target is always in sync with the source. CDC extraction provides real-time data replication and ensures minimal data loss by facilitating low-latency data transfer.

CDC mode flowchart cdcmodeflowchart

The steps

Monitor Source logs: Data Integration monitors the logs generated by the source database to keep track of changes made to the data.
Capture changes made to Source data: Data Integration captures changes made to the source data, such as updates, inserts, and deletes, as they occur.
Collect change data: Data Integration collects the captured change data into a batch or stream for processing.
Transform change data: The change data is transformed to match the structure of the target database and to meet any other specific requirements.
Load transformed data into Target: The transformed change data is loaded into the target database, ensuring that it is always up-to-date with the source database.
Ensure Target is always in sync with Source: Data Integration continually monitors the source database logs and replicates changes to the target database, ensuring that it remains in sync with the source.

note

The CDC extraction mode can be used with the following Databases:

Custom query

The Custom Query River mode in Data Integration allows users to load data into the platform using their own SQL queries. This mode offers a high degree of flexibility and control over the data loaded, letting you define the exact data to be loaded and the specific transformations to be applied. The queries can be written in SQL, and data can be loaded from databases and data warehouses. Once the custom query is defined, the data can be automatically loaded into Data Integration on a regular schedule, or on demand, providing real-time access to the data.

note

Data Integration does not support multi-statements or SQL Script in the custom query field.

Custom Query Mode Flowchart

Custom Query Mode

The steps

Define the target data storage: The user needs to specify the target storage where they want to load the data. This could be a database, data warehouse, or any other data storage system supported by Data Integration.
Write the custom SQL query: The user writes a custom SQL query to select the data they want to load. The query should return the desired data in the expected format.
Execute the query: Data Integration executes the custom SQL query and retrieves the data.
Transform the data: Data Integration lets you perform transformations on the data if necessary. This could include operations like renaming columns, transforming data types, and more.
Load the data: Finally, Data Integration loads the transformed data into the target storage. The data is loaded in the desired format and structure.

Extract methods

Two methods are available for extracting data:

All
Incremental

All

This mode loads all data from a Source into the Target database, replacing all existing data in the Target database. It is useful when the Source data is complete and up-to-date, and you want to create a fresh copy of the source data in the target database.

Incremental

This mode loads only new or updated data from a source into the target database. It is useful when the Source data is frequently updated and you want to keep the Target database in sync with the Source data.

Incremental behavior

Data Integration works in millisecond increments, but users can store data in more granular microsecond or nanosecond increments.

When operating incremental runs based on timestamp, the most recent record may be missing.

Data Integration reads three numbers after the dot when there are six, so '2022-08-14 10:26:52.895132' becomes '2022-08-14 10:26:52.895' and the last record is missing from the current run when fetching the data.

Ensure that the following run will retrieve this record with updated information.

note

The extraction methods can be selected in all River modes. In Multi-Tables mode, you can locate the options by navigating to the Schema tab and selecting a table. From there, access the 'Table settings' option.

Additional notes on the Incremental extraction method:

Start Date/Value is mandatory.
Data can be retrieved for the date range specified between the Start and End dates.
If you leave the end date blank, the data will be pulled at the current time of the River's run.
Should we check the End value box? This question concerns whether the end value in a sequence should be included or excluded. The difference between including or excluding the last value is shown in 3 cases:

Case #1: If the start date/value is 01/01/2023 and the end date/value is also 01/01/2023:

If the ** Include End Value** toggle is checked, all data from 01/01/2023 will be pulled to the target table.
If the ** Include End Value** toggle is unchecked, no data will be pulled to the target table, and a warning is issued.

Case #2: If the start date/value is 01/01/2023 and the end date/value is 02/01/2023:

If the Include End Value toggle is checked, all data between 01/01/2023 and 02/01/2023 will be pulled to the target table, and the start date will be updated to 02/01/2023 after the run.
If the Include End Value toggle is unchecked, only the data from 01/01/2021 will be pulled to the target table.

Case #3: If the start date/value is 01/01/2023 and the end date/value is blank:

If the Include End Value toggle is checked, all data between 01/01/2023 and the current UTC will be pulled to the target table, and the start date will be updated to the current UTC after the run.
If the Include End Value toggle is unchecked, only the data from 01/01/2021 will be pulled to the target table.
Dates timezone: UTC.
Start/End Date format: DD/MM/YYYY (Day/Month/Year with leading zeros).