Automatic schema drift management

Schema drift refers to the gradual changes in the structure of data in a Source over time. These changes can have a significant impact on the integrity and accuracy of data replicated to a destination Target, such as a data warehouse.

Data Integration provides automated solutions for extracting, transforming, and loading data from various Sources into a Target. One of the key features is the ability to detect and handle schema changes in the Source.

Example of schema drift

Customers table: initial structure

ID	Name	Address	Phone Number
1	Jane	500 7th Avenue	5551235555
2	John	16 Great Chapel Street	5554565556

Customers table: after schema drift

ID	Name	Address	Phone Number	Email	Social Media Handle
1	Jane	500 7th Avenue	555-123-5555	Null	Null
2	John	16 Great Chapel Street	555-456-5556	Null	Null
3	Bob	14425 Falcon Head Blvd	(+1)555-789-5559	bob.smith@boomi.com	@bobsmith

The Customers database initially included four columns, as seen in this example: ID, Name, Address, and Phone Number. However, due to schema drift, the table now has two extra columns, Email and Social Media Handle, and the Phone Number column's data type has been changed from integer to string.

Handling schema changes

When a schema change is detected in the source, Data Integration automatically updates the destination target's schema to match the new structure. This process happens without disrupting data flow or requiring manual intervention, whether through Standard extraction or Change Data Capture (CDC).

Data Integration continuously compares the source's schema with its metadata—a record of the source's structure and attributes—to detect changes. On each run, Data Integration resyncs the metadata with the source, ensuring they are always aligned. The metadata then serves as the foundation for data processing.

New field detection and processing

When Data Integration detects a new field in the source schema, such as ExcludeInMDM, it appears in the Data Integration interface. However, for sources like SQL Server using Change Tracking, Data Integration only updates the target schema (For example, Snowflake) when processing new or modified data. The latest field does not reflect in Snowflake until it is included in new or updated data for the relevant table.

If the new field is created with a default value (For example, 0), this action is not considered a data modification by Change Tracking. Therefore, Data Integration does not recognize it as a change that would trigger a schema update in the target. The new field will remain pending in Data Integration’s schema until the river processes actual new or changed data. Running the river without such data will not trigger a schema change in the target, even if the process completes with warnings.

Column-specific schema changes

Dropping a Column in the Source Table: If a column is dropped in the source but still exists in the target, the corresponding field in the target table will be populated with null values.
Renaming a Column in the Source Table: When a column is renamed in the source, Data Integration detects the change during the next river run. This results in a new column being added to the target table with the updated name, while existing records are duplicated with null values in the new column. This approach maintains data integrity while accommodating schema changes.
Adding a Column in the Source Table: Upon detecting a new column in the source, Data Integration adds it to the target table during the next river run. Existing records in the target table, predating the introduction of the new column, will have null values in this new column.
Changing Data Type of an Existing Column: If the data type of a source column changes, Data Integration adjusts the target column to a wider data type. For instance, changing a field from an integer to a string in the source will result in the target column being set to a string. However, if a field is changed from a string to a date, it will remain a string in the target.

Column specific schema changes

Sources supporting schema drift

Schema drift is available for the following Sources:

BigQuery
MariaDB
Microsoft SQL Server
MySQL
Oracle
PostgreSQL
Redshift
Salesforce
Teradata
Vertica

MongoDB considerations

As a non-schema-based database, MongoDB operates differently from traditional databases. To simplify the mapping process when working with MongoDB collections, there are three powerful features. For more details, see streamlining the mapping process.