Data Hub walkthrough
The Data Hub source connector is available as a limited availability (Beta) release.
Using the Data Hub connector, you can extract golden records from one or more Universe models in your Data Hub repository and load them into a supported target such as Snowflake or Databricks in Data Integration. For an overview of Data Hub concepts, refer to Boomi Data Hub.
Prerequisites
- A configured Data Hub connector connection
- At least one Universe model deployed in your Data Hub repository
- A supported target connector configured in Data Integration
Setting up a Data Hub Data Flow
Step 1: Setting up your data source
- Navigate to the Data Integration Console.
- Select Create New Data Flow > Source to Target Flow as your Data Flow type.
- Find and select Data Hub in the list of data sources.
- Under Selected Data Source, select Data Hub.
- Under Source Connection, select the connection you configured. To edit an existing connection or create a new one, click the edit icon next to the connection field.
- Click Test Connection to verify that Data Integration can reach your Data Hub repository.
- Click Next.
Step 2: Selecting your data target
- Under Selected Data Target, select your target connector.
- Under Target Connection, select your target connection. Click the edit icon to modify an existing connection or create a new one.
- Click Test Connection to verify the target connection.
- Under Data Loading Settings, enter the values for your target destination:
| Field | Description | Required |
|---|---|---|
| Database | The target database to load data into. | Yes |
| Schema | The target schema within the database. | Yes |
| Advanced Settings | Optional advanced loading configuration. Click to expand. | No |
- Click Next.
Step 3: Configuring the schema
The Configure Schema step is where you select Universe models, set their extraction methods, and configure mapping and loading settings.
Selecting models and configuring extraction
The Configure Schema step displays all Universe models available in the connected repository as rows in a table. Each row contains the following columns:
| Column | Description |
|---|---|
| Model | The Universe model name. Select the checkbox to include it in the Data Flow. Click the model name to open its detailed settings. |
| Target Table | The destination table name in the target. Auto-generated from the model name by default. |
| Extract Method | The extraction method for this model: All or Incremental. |
| Incremental Field | The field used as the cursor for incremental extraction. Required when Extract Method is Incremental. |
| Incremental Type | The data type of the incremental field. |
| Start Value | The start value for incremental extraction. |
| End Value | The optional end value for incremental extraction. Leave blank to extract until the current run time. |
| Loading Mode | The loading mode for this model. Inherited from Tables Definitions unless overridden per model. Default: Upsert Merge. |
To select and configure models:
- Select the checkbox next to each Universe model you want to include.
- In the Extract Method column, select All or Incremental for each model.
- If you selected Incremental, fill in the Incremental Field, Incremental Type, and Start Value columns for that model.
- Each extracted record includes an
is_enddatedcolumn. Active records are taggedis_enddated = false. Soft-deleted records are taggedis_enddated = true. Use this column in your target to distinguish active from deleted records. To include end-dated records in the extraction, click the model name and enable Include End-Dated in the Table Source Settings tab. - For large datasets, use Incremental extraction. It retrieves only records updated after the start value, reducing load time on each run.
Configuring Tables Definitions (optional)
Click Tables Definitions in the toolbar to apply settings across all models in the Data Flow.
| Field | Description | Required | Default |
|---|---|---|---|
| Table Prefix | A character or phrase added to the beginning of each target table name. | No | — |
| Default Loading Mode | The loading mode applied to all models unless overridden in Table Target Settings. | No | Upsert Merge |
| Merge Method | The merge strategy applied when Loading Mode is Upsert Merge. | No | Merge |
| Filter Logical Key Duplication Between Files | Filters out duplicate records in the current source pull. Use only when duplicates are expected in the source but not in the target table. | No | Off |
Applying bulk actions (optional)
Use Bulk Actions to apply extraction and loading settings across multiple Universe models at once, instead of configuring each model individually. Refer to Using bulk actions for more information.
Configuring model settings
Click a model name in the table to open its detailed settings. The model panel contains three tabs: Mapping, Table Source Settings, and Table Target Settings.
Mapping tab
The Mapping tab shows the column-level mapping between Data Hub source fields and target table columns.
- Use the Search field to find specific columns.
- Click Reload Model Metadata to refresh the schema from the Data Hub repository.
- Click Add Calculated Column to add a custom computed field to the mapping.
- Use the All Columns, Match Key, and Cluster tabs to filter the column view.
Each mapping row contains the following fields:
| Column | Description |
|---|---|
| Source Column Name / Expression | The field name as it appears in the Data Hub golden record. |
| Target Column Name | The field name in the destination table. Editable. |
| Type | The data type of the field (STRING, TIMESTAMP, and so on). |
| Mode | Whether the field accepts null values. Default: NULLABLE. |
| Cluster Key | Assigns this field as a cluster key in the target. Used for query optimization. |
Table Source Settings tab
The Table Source Settings tab controls how data is extracted from Data Hub for this model.
Enable Include End-Dated to include soft-deleted records in the extraction. When disabled, only active records (is_enddated = false) are extracted.
Under Extraction Method, select the extract method for this model:
| Option | Description |
|---|---|
| All | Retrieves all golden records for this model on every run. The connector maintains no state between runs. Use for initial loads or small datasets. |
| Incremental | Retrieves only records updated after the configured start value. Use for large datasets. |
Table Target Settings tab
The Table Target Settings tab controls how data is loaded into the target table for this model.
| Field | Description | Required | Default |
|---|---|---|---|
| Target Table Name | Overrides the target table name for this model only. | No | Inherited from model name |
| Override Default Target Settings | When enabled, allows per-model overrides of the loading mode and merge settings below. | No | Off |
| Table Loading Mode | The loading mode for this model. Available when Override is enabled. Append Only is applied automatically if no key columns are selected. | Conditional | Upsert Merge |
| Merge Method | The merge strategy for this model. Available when Override is enabled. | Conditional | Merge |
| Filter Logical Key Duplication Between Files | Filters out duplicate records in the current source pull for this model. Use only when duplicates are expected in the source but not in the target table. | No | Off |
| Enforce Masking Policy | Preserves the data masking policy applied at the column level in the target table. Requires copy permission on the masking policy and at least one column with an active masking policy. | No | Off |
- Click Next.
If your repository contains 100,000 or more golden records, activate Accelerated Query in Boomi Data Hub before running extractions. Accelerated Query significantly improves Repository API query performance, which can reduce extraction time for large datasets. Refer to Activating accelerated query for golden records for more information.
Step 4: Scheduling and activating your data flow
- Under Schedule Data Flow, enable scheduling, then set the run frequency. All times are in UTC.
- Under Set Custom Timeout, enable to set a custom timeout. By default, the timeout is handled automatically based on table size, between 12 hours and 7 days.
- Under Notifications, configure email alerts for pipeline events:
| Option | Description |
|---|---|
| Failure | Sends an email when the data flow fails. |
| Warning | Sends an email when the data flow completes with warnings. |
| Run Threshold | Sends an email when a run exceeds a defined duration. |
- Under Data Flow Info, enter a name for the data flow. Optionally assign it to a group and add a description.
- Click Activate to save and activate the data flow.