SQL server walkthrough

You can integrate data from a SQL Server database (on-prem or managed service) into a cloud target using Data Integration.

Prerequisites

You must grant access for the necessary permissions for extracting and reading data in the Standard Extraction loading mode:

SELECT: Required to read data from tables and views.
VIEW DEFINITION: Required to access table structures, metadata, and schema definitions.
EXECUTE: Needed if stored procedures or functions are used during ETL.

Example of how to grant these permissions:

USE [DatabaseName];
GRANT SELECT ON SCHEMA::[SchemaName] TO [Username];
GRANT EXECUTE ON SCHEMA::[SchemaName] TO [Username];
GRANT VIEW DEFINITION ON DATABASE::[DatabaseName] TO [Username];

Pulling SQL server data into a target

Using Data Integration, you can pull data from your SQL Server tables and send that data into your Target database.

Find SQL Server in the list of data sources and select it.
Define a Source Connection (this is the connection created in the process). If you do not have a SQL Server connection in your Data Integration account, you can create a new connection by clicking Create New Connection.
- Multi-Tables: Load multiple tables simultaneously from SQL Server to your target.
- Custom Query: Create a custom query and load it into your target.

Multi-Table mode

Load multiple tables simultaneously from SQL Server to your target. There are three Default Extraction Modes:

Standard Extraction
Change Data Capture (CDC)
Change Tracking.
On the Table Settings tab, you can edit the following:
- Change the loading mode.
- Change the extraction method. If you select 'Incremental', you can then define which field will be used to define the increment.
- Filter by expression that will be used as a WHERE clause to fetch the selected data from the table.

Advanced options

When using Multi-Tables River mode in Standard or Change Tracking extraction mode, the Advanced Options contain checkboxes that could enhance the usability of the data inserted into the Target table:

Invalid Characters
Hidden columns
Varbinary Values

Invalid characters

When this option is selected, the system automatically replaces any invalid characters with underscores.

Hidden columns

Automatic extraction of hidden columns is enabled; however, you can turn it off by clicking the "Ignore Hidden Columns" checkbox.

Varbinary values

This option lets you export the column's Varbinary data as a hexadecimal string, with a maximum length is 254 characters.

Deleted rows

This pulls back deleted rows. The row's key fields do not change, and all the remaining fields are all NULL.

note

This option is available only when using "Change Tracking" as the extraction mode and employing "Snowflake" or "Databricks" as the Target.

Replace newline characters

This replaces newline characters with a selected replacement value or a custom value of your choice. Default Behavior: If this option is not enabled, newline characters are replaced with spaces by default.

Data conversion considerations

New lines

To minimize potential errors when loading data into the targets, Data Integration removes any CHAR(10) and CHAR(13) (CR/LF) chars during the run for the following data types:

XML
VARBINARY
TEXT
STRING

Custom query

You can also employ Data Integration Custom Query River Mode for the precise definition of the data extraction. When using this feature, you can employ any query that aligns with the Source's compatibility.

It is crucial to emphasize that you should employ a SELECT query without including any other statements. Data Integration does not provide support for multi-statements or SQL scripts in the custom query field. You should limit your Custom Query to selecting the data you need without incorporating additional SQL commands or complex scripting.

Data Integration does offer support for XML in the Custom Query. In such cases, ensure that they cast the XML fields accordingly.

Procedure for casting XML to string

Clear Previous Mappings
- In the Source Tab: Clear all previous mappings.
- In the Schema Tab: Clear all previous mappings.
Write SQL Command

   SELECT some_id, CAST([some_xml] AS NVARCHAR(MAX)) AS some_xml_stringified, some_date 
   FROM testing_schema.testing_table

Rerun Auto Mapping

In the Source Tab: Rerun auto-mapping and verify xml_field is mapped to STRING. In the Schema Tab: Rerun auto-mapping and verify xml_field is mapped to STRING.

Run the River

note

TIMESTAMP columns might not return in the correct format as the target table's TIMESTAMP type, and failfail to be uploaded into the target table as TIMESTAMP. When getting an error of a TIMESTAMP column that failed to be loaded into the target table, convert its format in your custom query as follows: convert(varchar, <DATE_COLUMN_NAME>, 127) as <DATE_COLUMN_NAME> Ensure that this column is a TIMESTAMP type in the source tab mapping.

Extract method

Using Data Integration, you can pull your data incrementally or pull the complete data that exists in the table:

All: Fetch all data in the table using chunks.
Incremental: Gives you the option to run over a column in a table or your custom query to SELECT. You can filter by running over it using Start and End dates or Epoch times. You can choose to run over the date using **Daily **, ** Monthly , ** Weekly, or Yearly intervals.

Define the incremental Field to be used in the Incremental Field section. After choosing the incremental field, select the Incremental Type and the dates/values you want to fetch.

Data Integration manages the increments over the runs using the Maximum value in the data. You can retrieve the entire data since the last run, which prevents data gaps. You need to configure your river once.

Recommended: Define your incremental field in Data Integration over a field with an Index or Partitions key in the table.

note

Start Date is mandatory.
Data can be retrieved for the date range specified between the Start and End dates.
If you leave the end date blank, the data will be pulled at the current time of the River's run.
Date timezone: UTC.
The Start Date won't be advanced if a River run is unsuccessful. If you don't want this default setting, click More Options and check the box to advance the start date even if the River run is unsuccessful (Not recommended).
Use the 'Last Days Back For Each Run' option to gather data from a specified number of days before the selected start date.

Limit and auto mapping

After defining the extract method, you may choose a limit of top N rows to fetch. Data Integration sets your Schema using the "Auto Mapping" feature. You can also select fields you want to fetch in the "Mapping table" and add fields on your own.

Row version

Row Version is commonly used to version-stamp table rows. The storage size is 8 bytes. The row version data type is an incrementing integer that does not store dates or times.

Row Version can be used in Multi-Tables River mode.

Using row version

Procedure

Choose your Source Connection.
Select Schema and Table Name.
Click on Extract Method and select Incremental.
Type in Incremental Field Name.
Choose Row Version in Incremental Type.
Enter a zero as the Start Value.

note

The default value for 'Rows in Chunk' is 100,000.

Click Run river.
Connect to your Target. The Row Version is indicated in Hexadecimal values.

note

When you run a River with Row Version again, the Start Value will be updated to the last row's end value.

Modifying the chunk size

The best practice for setting the Exporter Chunk Size in Data Integration involves balancing memory usage with query performance, which can vary based on workload, network latency, and database type.

When dealing with wide tables with a relatively large number of columns or tables that have large text fields (for example, JSON/XML/TEXT types), it is crucial to set the chunk size carefully, as these fields can lead to high memory usage and potentially slower performance or even out-of-memory (OOM) failures. In such cases, the best practice is to lower the chunk size. This reduces the amount of data loaded into memory in each batch, minimizing the risk of OOM failures for large fields.

For narrow tables, increasing the chunk size can yield higher throughput and improve read performance.

Since multiple factors are involved (for example, network latency, database type), the best practice for finding the optimal value for a specific table is to start with the default setting and then adjust incrementally, either up or down, until you find the setting that provides the best performance.