Skip to main content
Feedback

MongoDB walkthrough

MongoDB is a flexible, document-based database that stores data in JSON-like format instead of traditional tables. To use it, you create a MongoDB instance, and perform basic database operations.

Procedure

Step 1: Establish a connection

Make sure you have appropriate credentials for establishing a connection with a MongoDB database server.

Step 2: Choose the Data Flow mode

Once you have successfully established a connection to MongoDB, select the Data Flow mode to load data into a Target. You can choose from two available options:

note

When you click on the highlighted modes, a relevant page that offers a comprehensive explanation and presents a Product Tour example showcasing the available features in MongoDB.

  • Multi-Tables : Simultaneously load multiple tables from MySQL to your desired Target. In this process, Data Integration conducts a full ELT process which extracts the full structure of all selected tables before filtering and writes only selected columns to your target. This can result in data transferred of a larger volume than data written to target.

    Select the method you would like to extract your data:

    • Standard Extraction - This Data Flow mode maps, transforms, and loads data from multiple tables into a unified schema. It uses SQL queries for transformations and you can schedule or manually trigger it.
    • Change Streams - This mode monitors Source database logs, capturing and transforming changes in real-time. This ensures minimal data loss and low latency transfer when loading the transformed data into the target database, keeping it in sync with the Source.
note
  • The Timestamp datatype in MongoDB returns the timestamp value without milliseconds, displaying as 'YYYY-MM-DD HH:MM:SS'

Step 3: Run the Data Flow

Once you have successfully configured the Data Flow mode, selected the extraction method, and set up the scheduling, you are now ready to start with the Data Flow and its operation.

MongoDB key features

  • Select connection(Required): select the connection that you’ve set before.
  • Collection name(Required): choose the collection by entering the collection name
info
  • You can also specify the collection name with the [database].[collection] format. This format can enable you create multiple Data Flows without changing the connection.
  • Make sure that the connection URI targets the authSource database. To enable this feature, the connection needs to target the database that manages authentication
  • Extract Method (Required):

    • All: pull the entire data in the table.
    • Incremental: pull data by incremental value. I.e: get only the data that changed since the last run. In the Incremental extract method:
      • Incremental Attributes: Required.  Set the name of the incremental attribute to run by. I.e: UpdateAt, UpdateTime .

      • Incremental Type: Required.  Timestamp | Epoch. Choose the type of incremental attribute.

      • Choose Range: Required.

        • In Timestamp, choose your date increment range to run by.
        • In Epoch, choose the value increment range to run by. 
        • If no  End Date or Value entered, the default is the last date/value available in the table. Data Integration manages the increment automatically.
        note

        The Start Date does not advance if a Data Flow run is unsuccessful. If you do not want this default setting, click More Options and check the box to advance the start date even if the Data Flow run is unsuccessful (Not recommended).

      • Include End Value: Should the increment process take the end value or not. The differences between Including the end value and discarding it are available in the following cases (#1,#2) -

    Assumptions: The collection's start date value is 01/01/2021. INC_KEY stands for the collection's incremental key.

    Case #1 - The collection's max updatedate field's value - 01/01/2021 (maxupdate is equal to the start date)

    Include End Value toggleIncrement logic condition being testedData Flow Outcome result
    checkedSELECT * FROM TABLE WHERE INC_KEY>=01/01/2021 AND INC_KEY<=01/01/2021Data Integration pulls all the data of the - 01/01/2021 to the target table
    uncheckedSELECT * FROM TABLE WHERE INC_KEY>=01/01/2021 AND INC_KEY<01/01/2021Done With Warning: Got No Results When Checking Increment

    Case #2 - The collection's max updatedate field's value - 02/01/2021 (maxupdate is grater than the start date)

    State of Include End Value toggleIncrement logic condition being testedData Flow Outcome result
    checkedSELECT * FROM TABLE WHERE INC_KEY>=01/01/2021 AND INC_KEY<=02/01/2021Data Integration pulls all the data between 01/01/2021 and 02/01/2021 to the target table; after the run, Data Integration updates the start date to 02/01/2021
    uncheckedSELECT * FROM TABLE WHERE INC_KEY>=01/01/2021 AND INC_KEY<02/01/2021Data Integration pulls only the data of the - 01/01/2021 to the target table
  • Interval chunks: (Optional) The chunks by which Data Integration pulls the data. Split the data by minutes, hours, days, months, or years, split your data into chunks for a large amount of returned data to avoid a call timeout from the source side.

  • Max rows in batch: (Optional) While parsing the data in the file zone from BSON to a JSON file, as best practice, make sure to add a batch max rows limit (the number of rows Data Integration reads in a single batch before moving to the next one). To determine the size of the rows, you can follow the next principle:

    mongo_best_1.png

After every 50% reduction, test the Data Flow and check if it is running successfully, follow this process up until you set a size that enable the Data Flow to run as expected.

  • Filter:  Optional.  filter the data to pull. The filter format will be a MongoDB Extended JSON.
  • Limit:  Optional. Limit the rows to pull.
  • Auto Mapping:  Optional. Data Integration will determine your table/query metadata. You can choose the set of columns you want to bring, add a new column that hadn’t got in Data Integration determination or leave it as it is. If you don’t make any of Auto Mapping, Data Integration will get all the columns in the table.

Converting entire key data as a string

In cases where the target does not receive the data in the expected format, such as key names starting with numbers, or flexible and inconsistent object data. You can convert attributes to a STRING format by setting their data types in the mapping section as STRING.

Conversion exists for any value under that key.  Data Integration converts arrays and objects to JSON strings.

Use cases

Here are few filtering examples:

{"account":{"$oid":"1234567890abcde"}, "datasource": "google", "is_deleted": {"$ne": true}}

or another filter using date variables:

date(MODIFY_DATE_START_COLUMN) >=date("2020-08-01")

Common errors

Loading to BQ Failed: Field XXXX is type RECORD but has no schema

This is relatively common generic error and the root cause is due to Schema settings that are not compatible with BQ. The error may vary depending on field and type. To fix this simply clear the mapping, redo automapping and convert the column into STRING as described in the Converting Entire Key Data As a STRING section.

Limitation

If you use Snowflake as a Target and your Collection contains hyphens, Data Integration automatically replaces those hyphens with underscores.

Consider the following scenario - if the original Collection is as follows: mongodb-collection-name

Data Integration automatically converts it to mongodb_collection_name

On this Page