MongoDB best practice for pulling large amounts of data

In historical data migration / large collection(s) cases, there is a need to pull a vast amount of data from MongoDB into any of the optional target DWH available in the Data Integration console.

This guide will walk you through the best practice and efficient way to connect your MongoDB to Data Integration.

Source configuration

Extract Method: We recommend using the Incremental extract mode for pulling data from your collection. This mode pulls data by an incremental value that performs as an Index, fetching only the data that changed since the last run. In the Incremental extract method there are additional options:

Incremental Attributes: Set the name of the incremental attribute to run by. For example, UpdateTime.
Incremental Type: Timestamp | Epoch. Choose the incremental attribute.
Date Range:
- In Timestamp, choose your date increment range to run by.
- In Epoch, choose the value increment range to run by.
- If you enter no End Date / Value, the default is the last date/value in the table. Data Integration manages the increment automatically.
Include End Value: Should the increment process take the end value of the run or not.
Interval Chunks: Define the chunk size for data retrieval. Split the data by minutes, hours, days, months, or years. Split your data into chunks when the source returns large amounts of data, to avoid a call timeout from the source side.
Max Rows In Batch: When parsing data in the file zone from BSON to a JSON file, add a batch max rows limit as best practice (the number of rows Data Integration reads in a single batch before moving to the next one). To determine the size of the rows, follow this principle:

When parsing data in the file zone from BSON screenshot

After every 50% reduction, test the Data Flow and check if it is running successfully, follow this process up until you set a size that enable the Data Flow to run as expected.

Filter (Optional): Filter the data to pull. Use the MongoDB Extended JSON format for the filter. for example {"account":{"$oid":"1234567890abcde"},"datasource": "google", "is_deleted": {"$ne": true}}.
Limit (Optional): Limit the rows to pull.
Auto Mapping: Data Integration determines your table / query metadata. You can choose the set of columns you want to fetch from the source, add a new column that hadn’t got in Data Integration determination or leave it as is. If you do not make any of Auto Mapping, Data Integration gets all the columns in the table.

Schema configuration

Converting entire key data As a STRING (Optional):
In some cases, the data is not as expected by a target, like key names starting with numbers, or flexible and inconsistent object data. A possible option to overcome this will be converting attributes to a STRING format by setting their data types in the mapping section as a STRING. An example for this can be the following key:value combination:

{
  "name": "myname",
  "id": 1,
  "sites": {"www.google.com": {"count": 3},
    "www.yahoo.com": {"count": 4},
    "www.amazon.com": {"count": 4}
  }
}

The mapping, in this case, would have the following structure:

The mapping, in this case, would have the following screenshot

As shown, the site’s key has inconsistent changing values, and the mapping uses the values themselves (sites) as columns. To deal with this situation, define the sites data type as a STRING and parse this column by using an open JSON function on the target’s side:

As shown, the site’s key has inconsistent changing values, screenshot

note

Conversion exists for any value under that key. Data Integration converts arrays and objects to JSON strings.

Enhancing data preservation and mapping

Preserve the raw data of the collection for Snowflake as the target platform. When dealing with JSON objects or arrays that feature schema-less keys, potentially changing per row or including special characters or blanks, Snowflake users can use its built-in capability to query complex JSON structures using the Semi-Structured Query Mechanism. We recommend reclassifying the entire column as a VARIANT column in such scenarios and then extract the necessary columns within Snowflake's console or through a Logic Flow using any of Snowflake's Open JSON functions.

To achieve this, activate the Preserve Raw Data toggle (located in the Target tab under file settings). This will automatically introduce an Array into the column mapping, which defaults to loading as a variant column. This approach helps maintain the row's granularity.

note

Data Integration does not support the Preserve Raw Data toggle in Change Streams.

The mapping outcome appears as shown below:

Additionally, you have the option to use the expression column mapping feature to extract a specific column from the raw JSON data. This extracted column can function as a key for the Upsert merge loading mode:

Additionally, you have the option to use the expression screenshot

Retrieve large collections by chunks and range queries

The MongoDB Log Based Mode can pull large collections by breaking them down into chunks. When the number of records in a collection exceeds the maximum number of rows specified in the Data Flow to retrieve the data by using a range query with the _id index instead of using skip for pagination.

This method sorts and increments the data, pulling it in batches using a range query. This technique is more efficient than the traditional method of using skip for pagination, especially for large collections, as it avoids the performance overhead associated with the use of skip.

By breaking down the collection into smaller chunks, the MongoDB Log Based Mode improves the speed and efficiency of data retrieval, making it a more reliable and effective tool for managing large datasets.

For more information on range queries, you can refer to the MongoDB - Using Range Queries topic.

Retrieve large collections by using concurrent requests

To enhance the extraction process of large Collections, consider raising the number of concurrent requests to your MongoDB. This enables Data Integration to retrieve each data chunk concurrently rather than sequentially.

note

The maximum number of concurrent requests permitted by your MongoDB depends on its memory configurations. Running a Data Flow with concurrent requests exceeding the capabilities of your MongoDB can result in Data Flow failure.

You can find this option in Multi-Table mode.

Indexing incremented fields for performance in MongoDB

When working with MongoDB and incrementing fields, you must index the incremented field properly. Failing to do so can lead to significant performance issues, especially when pulling large amounts of data. MongoDB does not automatically index datetime columns, so you must index them manually for optimal performance.

Why index incremented fields

Indexes in MongoDB significantly speed up query performance by allowing the database to locate specific documents more efficiently. When you increment a field without indexing it, MongoDB has to scan through the entire collection to find the relevant documents. This process can be time-consuming and resource-intensive, especially as the collection size grows.

Best practices for indexing incremented fields

Identify incremented fields: Determine which fields in your MongoDB documents are frequently incremented or used in queries involving increments.
Create indexes: For each identified field, create an index using the createIndex method in MongoDB. Specify the field you want to index and the desired index type (for example, ascending, descending).
Regular maintenance: Periodically check and optimize indexes to ensure they remain effective as data grows and usage patterns change. Use tools like MongoDB's explain method to analyze query performance and index usage.

Streamlining the mapping process

The Table Settings tab within the Schema section offers 3 powerful features to enhance the mapping process when working with a MongoDB collection. These features address the challenges of MongoDB's schemaless structure by offering greater flexibility and control over schema inference.

Number of records
- By default, the mapping process samples the last 100 documents from the collection to infer the schema. Users can adjust this behavior by:
  - Selecting a sample size of up to 10,000 records.
  - Sampling either the first or last records based on their preference.
- This allows you to define the number and sequence of rows sampled, enabling a more tailored approach to schema detection.
Why is this important?
MongoDB collections lack table metadata due to their schemaless structure, making schema inference challenging. Sampling an insufficient number of documents may result in incomplete mappings.
Document ID mapping
- You can map data using specific Document IDs, ensuring that the schema mapping includes only relevant records.
- Key behaviors:
  - If Data Integration finds no matches for the provided Document IDs, it displays an error notification.
  - If Data Integration finds at least one match, the mapping process continues, and you will receive a notification about any unmatched IDs in the collection.
- This feature provides a more targeted mapping approach, reducing ambiguity and improving the relevance of the inferred schema.
JSON file mapping
- Users can now upload a JSON file to define the mapping explicitly. This feature offers:
  - Greater precision in specifying the structure of the collection.
  - Enhanced control over the expected schema, minimizing errors caused by sampling variability.
- This method is ideal for advanced users who have pre-defined schema requirements or need to align with external data models.

Availability

The mapping features are accessible in both Legacy and Multi-Tables modes (excluding Change Streams) to ensure seamless integration across diverse data processing scenarios. You can find these options under the Table Settings tab in the Schema section.

Benefits

Flexibility: The ability to adjust sample size, use specific Document IDs, or define mapping through a JSON file ensures adaptability to multiple use cases.
Accuracy: These features minimize mapping errors, reducing the risk of incomplete or inconsistent schemas.
Control: This empowers users with more options to tailor the mapping process to their specific needs.

Recommended data types for incremental keys in MongoDB to ensure accurate filtering

When using incremental mechanisms in MongoDB, the behavior of comparisons depends on the field data type designated as the increment key.

Below is a detailed explanation based on common data types:

Date/ISODate/Timestamp fields

MongoDB compares fields with the data type Date, ISODate, or Timestamp chronologically.
The mechanism performs proper date-based comparisons, such as greater than (>) or less than (<).

Example:

MongoDB considers a record with a timestamp of 2024-01-02 greater than a record with a timestamp of 2024-01-01.
This ensures logical, time-sensitive ordering and filtering.

String fields

If the increment key field is a String, MongoDB bases comparisons on alphabetical order, regardless of whether the string looks like a date.
MongoDB still treats strings that resemble dates (for example, 12/31/2023 or 2024-01-01) as plain text during comparisons.

Example:

MongoDB considers the string 12/31/2023 greater than 1/1/2024 because string comparisons evaluate the characters alphabetically from left to right.

Key considerations

The data type, not the visual format of the field's values, determines the comparison logic.
Date/ISODate/Timestamp fields enable chronological comparisons, whereas String fields use alphabetical ordering.

Recommendation

To ensure accurate incremental updates based on time or date values:

Verify that the increment key field is of the correct data type (Date, ISODate, or Timestamp).
Avoid using strings for date-like values unless you require intentional string-based ordering.