MongoDB to Snowflake CDC data replication with Data Integration
Replicating data from MongoDB to a Snowflake data warehouse is important for enabling advanced analytics using standard SQL. While MongoDB is well-suited for fast read/write operations and supports flexible, evolving data structures, its NoSQL format and unique data models can make it challenging for analysts to perform complex queries and extract meaningful insights directly.
Why replicate MongoDB Data to Snowflake?
-
SQL Analytics: Most analysts are skilled in SQL, and many reporting and visualization tools rely on a SQL interface. Snowflake supports this, making it easier to analyze data efficiently.
-
Data Consolidation: Snowflake enables you to combine MongoDB data with other data sources, helping create a unified and comprehensive view for business intelligence and decision-making.
-
Performance: By offloading analytical workloads from MongoDB to Snowflake, you reduce the load on your operational database. This helps maintain optimal application performance while still supporting complex analytics.
Change Data Capture (CDC) with Data Integration
To ensure analytics are based on the most up-to-date data, we use Change Data Capture (CDC) to replicate changes as they happen. CDC is an efficient way to track, capture, and move updates made to a database.
This guide uses the Data Integration platform to create a data pipeline using MongoDB Change Streams, which is MongoDB’s built-in CDC feature. This approach reduces the load on the source database and also supports capturing deleted records, which many scheduled extraction methods do not.
Prerequisites
- Access to a Snowflake account
- Access to a MongoDB database.
Step 1: Set up a Data Integration account
If you already have a Data Integration account you can skip this step.
The easiest way to connect to your Snowflake account is by using Snowflake Partner Connect.
- Log in to Snowsight and navigate to Data Products > Partner Connect.
- Search for Data Integration.
- Click the Data Integration tile. A Connect to Rivery screen appears, pre-populated with your Snowflake details.
- Click Connect and Activate. You will be redirected to the Data Integration console, where you will be asked to enter a password to use with Data Integration and to create your account.
Step 2: Configure Snowflake Target connection
If you have created Data Integration account through Snowflake’s Partner Connect, you can skip to Step 4.
You can configure your Snowflake connection in Data Integration. This will be needed later on as we point our MongoDB data (our Source) into Snowflake (our Target). For more information, see Snowflake as a Target.
Step 3: Configure MongoDB Source connection
You need to set up the connection to your MongoDB database within Data Integration. For more information, see MongoDB basic connection.
Step 4: Create a River
Now you will build a River that moves the data from MongoDB to Snowflake.
- From the Data Integration navigation menu, click Create River > Source to Target River.
- Source: Select MongoDB as your source and choose the MongoDB connection you created in the previous step.
- Extraction Mode: Select Change Streams as the Default Extraction Mode for near-real-time replication. If you're using an older MongoDB version or prefer a different method, you can choose Standard Extraction (SQL-based).
- Target: Click the Target tab and select Snowflake.
- Choose your Snowflake connection from the dropdown.
- Set the Database, Schema Name, Table Prefix, and Default Loading Mode for your data. The loading mode options include Upsert-Merge, and Append Only.
Step 5: Configure the pipeline schema
This step allows you to select the specific collections you want to replicate and define their target schema in Snowflake.
- Click the Schema tab.
- Check the box next to the MongoDB collections you want to extract. You can select multiple collections or check the box next to Source collections to select all.
- Click on the table name to customize the target table names and other settings.
- In the Table Settings tab, you can find Advanced Options to Filter logical key duplication between files and Enforce Masking Policy Snowflake on the ingested data if you have one defined.
Step 6: Schedule and run the pipeline
Once your pipeline is configured, you can enable and schedule it for continuous operation.
- Click the Enable Stream toggle. This activates the Change Streams listener.
note
- If you chose Standard Extraction in Step 4, you can set a schedule in the Settings tab.
- In the Settings tab, you can also configure alerts and notifications for the pipeline.
- Click Save and Run to execute the pipeline.
Step 7: Monitor your pipeline
Data Integration provides tools to monitor the performance and status of your data pipelines.
- Navigate to the Activities page from the left hand navigation menu. You can see a list of all your pipeline runs.
- Click on any of the pipeline runs to get a detailed breakdown of its execution.