Sources overview

Data Integration connects to your source, receives data, and writes it to your destination. It provides the option to use the intermediate storage before writing to the specified destination Custom File Storage, ensuring that no data is saved on a vendor's servers.

For more information on how to build simple data pipelines to transfer data from a Source to a Target destination, refer to the Source to Target River overview topic.

Types of sources

Applications

The data provided by your company's software services offers greater insight than the data available on internal dashboards. These services include "Application Programming Interfaces (APIs)" that enable access to or extraction of data through a secure internet connection.

Data Integration creates connections to these applications and gathers the delta of changes at a cadence set.

Data extraction time variability in APIs

When extracting data from APIs in data sources such as Adobe, LinkedIn, or Facebook, the time required to retrieve a report can vary due to several factors:

API rate limits and throttling
- Rate limits: Many APIs enforce rate limits to protect their servers from overload. Exceeding these limits can slow response times or limit the number of results returned.
- Throttling: In cases of throttling, the API may return partial data or take longer to complete a request, leading to delays.
Server-side data processing
- Data processing: APIs process large volumes of data. For example, they aggregate, filter, or calculate metrics before returning results. This increases response time, particularly with complex reports or large datasets.
- Request queuing: Some APIs may queue requests based on system load, which can further delay data retrieval.
Data availability and latency
- Data timing: Requested data may not be immediately available, particularly if it relies on near-real-time processing. This causes incomplete data or delays in reflecting the latest information.
- Inherent latency: Certain data sources may have built-in latency, which affects the speed at which data becomes available through their APIs.
Query complexity
- Complex queries: Queries involving multiple filters, joins, or custom metrics can take longer to execute. This complexity may cause the API to time out or return a limited subset of the data.
- API limitations: If the API struggles to process complex queries efficiently, it may result in slower response times or incomplete data.
Network latency and connectivity
- Geographic latency: Network latency, when communicating with distant servers, can affect data retrieval speed.
- Connectivity issues: Poor network connectivity or temporary disruptions cause delays, partial responses, or incomplete data retrieval due to connection timeouts.

These factors contribute to the variability in the time required to extract data via APIs. Understanding them helps you optimize the data retrieval process.

Optimizing data extraction and loading performance

To maximize the efficiency of data extraction and loading, fine-tuning the following settings is essential. Each setting plays a role in balancing performance with resource usage.

Exporter chunk size:

a. The default value is 30,000, which is suitable for most situations.
b. For wide tables with many columns or large text fields (JSON/XML/TEXT):
- Lower the chunk size to reduce memory usage and avoid out-of-memory failures.
  c. For narrow tables:
- Consider increasing the chunk size for potentially higher throughput.
Interval chunk size:

a. This setting splits data extraction over long periods or for large amounts of data.
b. Options include:
- Don't Split (pull all data in one bulk)
- Daily
- Monthly
- Yearly (less recommended)

Best practices

For exporter chunk size:
- Test with different chunk sizes, starting at the default (30,000).
- Reduce the size if you encounter memory-related errors.
- Increase the size of narrow tables to maximize throughput.
For interval chunk size:
- Use interval splitting for high-volume data or long time frames.
- Balance performance with API rate limits by selecting an appropriate interval.
- Ensure the date column is included in the extraction results to align data with interval boundaries.

Databases

You can extract data from databases and move it to a Data Warehouse. Data Integration can connect to both on-premise and cloud databases, and use Whitelisted IPs, VPNs, and SSH Tunnels to secure the connection.

The Incremental Data Capture method is dependable, secure, and cost-effective.

Events

Data Integration solution lets you collect data via a webhook. A Webhook is an HTTP callback triggered by a user-defined event on your website or application. A webhook lets you send real-time HTTP notifications from one application to another when an event you define occurs. You can receive the JSON elements as HTTP POST requests at a webhook URL.

Files

Data Integration lets you sync files from on-premises and Cloud storage sources.

Rest API

You can connect to any API Endpoint that includes an authentication Flow. An Action River loads data into a target table in your cloud database using the REST API.

Stages of connector release

The connectors (Sources and Targets) are released in a staged manner to ensure delivering the highest quality experience to the users.

Stage	Definition
Beta	The Beta Stage in Data Integration is characterized by a controlled release where the connector is live but might have limited total capabilities. During the Beta stage, the primary objective is to test the connector in a range of uncommon and complex scenarios (edge cases) to ensure stability and readiness for a wider release.
Alpha	The Alpha Stage of Data Integration release lifecycle is designated for connectors at a preliminary development stage and primarily undergoing a testing phase. You can manually request access.
Coming Soon	The Coming Soon Stage denotes a preliminary development stage Data Integration product, indicating that the connectors in this stage have been officially added to Data Integration’s short-term roadmap. You may request early access when it becomes available.
Sunset	The Sunset Stage at Data Integration is initiated when a connector is phased out following the announcement by the data source vendor that the service has reached its end of life stage.

Extended execution time for large tables and API reports

Data Integration extends the execution time for handling large datasets, including RDBMS tables and API reports. This feature lets Rivers efficiently process and load extensive data, automatically extending the processing time up to 48 hours if necessary, ensuring successful data loading.

RDBMS tables

Data Integration has an automatic mechanism that adjusts the execution time based on the size of the table or the number of rows. For large tables, Data Integration automatically switches to long-duration mode to complete the process without manual intervention.

API reports

Data Integration pre-defines specific API reports that typically return large datasets. These reports are automatically configured to run with extended execution time when necessary. You can adjust the default timeout value in your River’s settings to limit the run time per your preferences.

User control

You can adjust the default execution time in the River's settings tab to customize the timeout limit. If you set a custom timeout value, the process terminates when the selected time is exceeded.

Automatic update to reports

Data Integration identifies a new list of API reports and updates the list that requires extended execution. It automatically applies extended time to these reports without user action.

note

Data Integration automatically processes large tables and reports for up to 48 hours, unless you modify the default timeout setting.