Debugging data discrepancies

Discrepancies with API sources

Discrepancy issues in aggregated API sources are resolved effectively when comparing exactly apples-to-apples from source to target. Some metrics in these sources are ‘unique’, which, when sliced at different levels of detail, can lead to misleading totals when comparing one system to another.

Debugging steps

Confirm apples-to-apples comparison - Check that the steps to produce data in the source system are at the same granularity as the configuration set in the river. For example, if the discrepancy is in Google Analytics data, and the screenshot shows totals by week, but the river has a time resolution of ‘daily’, then that is a good indicator of not comparing the same dimensions and what is causing the perceived discrepancy.
Narrow down the search - For example, select a single campaign on a single day and compare those metrics to the source. If that matches, work your way up to higher levels of aggregation to pinpoint the point where the data no longer matches.
Check your loading mode (set in the Target tab) - The river may be unexpectedly overwriting or appending data. If the loading mode is set to upsert-merge, confirm that the keys set in the river are the desired key(s) to define the level of detail of the dataset.
If all fails, create a support ticket by clicking the Support in the Data Integration console.

Discrepancies with database sources

More often, this is related to the configuration of the river itself. Certain aspects of database multi-table river configurations may cause confusion or contribute to potential data discrepancies, such as:

Extraction method - when set to ‘Incremental’, Data Integration requires an incremental field. This can either be a timestamp or a running number.

If set to a timestamp, it must be a timestamp field that updates in the source whenever a record is updated.
If you select a running number field, it must represent that a new record is produced for every single change.

There have been cases where this behavior was not followed in the source database, resulting in unexpected incremental performance because not every change was captured. Neither option (timestamp field or running number) should ever contain NULL values in the source data.

Loading mode - this setting often causes the most confusion when set to upsert-merge. In the multi-table rivers, the primary keys are automatically detected, but some tables may not have primary keys defined in the source database. In such cases, users select their own in the column mapping. No field chosen as a primary key should have NULL values in the source.

Schedule frequency - In some cases, discrepancies occur because the latest data has not yet been loaded. For example, if the schedule frequency is set to daily, the source database may already contain new records, but those records have not yet been loaded into the destination.