Debugging data discrepancies
Discrepancies with API sources
You can resolve discrepancy issues in aggregated API sources effectively by comparing exactly apples-to-apples from source to target. Some metrics in these sources are ‘unique’, which, when sliced at different levels of detail, can lead to misleading totals when comparing one system to another.
Debugging steps
- Confirm apples-to-apples comparison - Check that the steps to produce data in the source system are at the same granularity as the configuration set in the Data Flow. For example, if the discrepancy is in Google Analytics data, and the screenshot shows totals by week, but the Data Flow has a time resolution of ‘daily’, then that is a good indicator of not comparing the same dimensions and what is causing the perceived discrepancy.
- Narrow down the search - For example, select a single campaign on a single day and compare those metrics to the source. If that matches, work your way up to higher levels of aggregation to pinpoint the point where the data no longer matches.
- Check your loading mode (set in the Target tab) - The Data Flow may be unexpectedly overwriting or appending data. If the loading mode is upsert-merge, confirm that the keys set in the Data Flow are the desired key(s) to define the level of detail of the dataset.
- If all fails, create a support ticket by clicking the Support in the Data Integration console.
Discrepancies with database sources
More often, this relates to the configuration of the Data Flow itself. Certain aspects of database multi-table Data Flow configurations may cause confusion or contribute to potential data discrepancies, such as:
Extraction method - when set to ‘Incremental’, Data Integration requires an incremental field. This can either be a timestamp or a running number.
- If set to a timestamp, it must be a timestamp field that the source updates whenever a record changes.
- If you select a running number field, it must represent that the source produces a new record for every single change.
There have been cases where the source database did not follow this behavior, resulting in unexpected incremental performance because the platform did not capture every change. Neither option (timestamp field or running number) should ever contain NULL values in the source data.
Loading mode - this setting often causes the most confusion when set to upsert-merge. In the multi-table Data Flows, the primary keys are automatically detected, but some tables may not have primary keys defined in the source database. In such cases, users select their own in the column mapping. No field chosen as a primary key should have NULL values in the source.
Schedule frequency - In some cases, discrepancies occur because the system has not yet loaded the latest data. For example, if the schedule frequency is set to daily, the source database may already contain new records, but the system has not yet loaded those records into the destination.