Databricks as a target
Databricks serves as a cloud-native SQL analytics engine that provides high-performance, scalable analytics for data lakes, data warehouses, and other significant data use cases.
Prerequisites
Unity catalog
Databricks Unity Catalog feature of the Databricks Unified Data Analytics Platform lets organizations to easily manage, discover, and collaborate on their data assets. By offering consolidated access control, tracking, lineage, and data exploration functions throughout Databricks workspaces, Unity Catalog simplifies the management of data assets.
Catalog prerequisites
- A Premium plan or a higher tier for your Databricks account.
- You must set up IAM roles, IAM policies, and S3 buckets in AWS.
- Create your first metastore and attach a Workspace.
- Enable the Unity Catalog toggle in your SQL Warehouse.
Create your first metastore and attach a workspace
To use the Unity Catalog, ensure to create a metastore. A metastore provides a centralized repository for metadata management. This metadata includes information such as data types, file formats, and other characteristics that help users locate and use the data they need.
After creating the metastore, you can link it to a workspace. Connecting the metastore to a workspace enables you to manage and access your data using the Unity catalog conveniently.
To create a Metastore and link it to a Workspace, follow the instructions below:
- Access the Databricks account console and log in.
- Navigate to the Catalog icon and click on it.
- Select the option Create metastore.
- Name the metastore and select a deployment region. Place workspaces, metastores, and cloud storage in the same region to ensure optimal performance.
- Enter the S3 bucket path for the storage bucket and IAM role name.
you can exclude s3:// when entering the path.
- Click the Create to complete the process.
- Once prompted, select the workspaces you want to associate with the metastore and click Assign.
Enabling the Unity Catalog toggle in your SQL warehouse
- Log in to your Databricks workspace.
- Locate the SQL Warehouse and click it.
- Locate the specific warehouse, click the More (three dots) icon, and select Edit.
- In the Advanced options section, enable the Unity Catalog toggle.
Working with Unity catalog
- Metastore admins can access all catalogs within the metastore. Other users can access only the catalogs they own or those providing
USAGEprivileges. - Without a specified catalog name, Data Integration loads data into the default
Hive_Metastorecatalog.
- Complete all prerequisites.
- A valid Databricks connection in Data Integration.
- Enter the Catalog name.
Schemas and tables
Databricks uses a hierarchy of schemas and tables for data warehousing. You can create and then populate tables manually, this method requires significant technical overhead and setup time.
Data Integration eliminates the need for this upfront table creation process. Data Integration automatically creates the necessary tables if they do not exist and populate them with the extracted data. Additionally, Data Integration handles all the required changes and updates to your tables automatically. This simplifies your data integration process and helps you to focus on extracting insights from your data.
Create a new schema
Set up a dedicated Schema for Data Integration on Databricks.
- Go to the Databricks workspace console, click on Queries, and then select Create Query.
- In the screen designated for query, execute the following SQL command by pasting and running it.
CREATE SCHEMA IF NOT EXISTS `data_integration`
COMMENT '{conKeyRefs.DataIntegration} Tables Schema';
Loading modes
Data Integration provides three types of loading modes in Databricks as a target.
Store in a custom location
Data Integration provides an option for storing external tables in DBFS (Databricks File System) or external location.
DBFS
DBFS (Databricks File System) provides a distributed file system optimized for heavy workloads. This architecture scales to store and manage data across file formats, including Parquet, CSV, and JSON. You can access data directly through Spark SQL commands.
External tables in Databricks enable you to work with data stored in one or more locations and file formats. By providing support for DBFS, it enables easy management and manipulation of large datasets, while offering the flexibility and scalability that data workloads require.
To specify a DBFS prefix, users can turn on the Store in a Custom Location checkbox and define the DBFS parameter.
Databricks automatically assigns the location within the abfss://<some_path>/<some_sub_path> path.
External location
Data Integration also provides an option for storing in external tables. External tables in Databricks let you create and manage tables. The external tables exist in storage locations outside the Databricks file system. This supports direct data access and managed directly. This provides greater flexibility and scalability, as it lets you work with data stored in different file formats and external sources.
To create an external table, select the Store in a Custom Location checkbox and set the External Location parameter. This configuration defines the external location prefix for the Delta table.
Databricks automatically assigns the location within the <storage>://<bucket_name>/<some_path> path.
This makes it easy to work with data stored in external locations, such as Amazon S3 or Azure Blob Storage, without having to move it into Databricks' file system.
Creating the target database if it does not exists
Data Integration checks and makes sure the database on the process exists in Databricks workspace. If the database does not exist, Data Integration creates the database for you.
Your personal access token should include permissions to create any new database in the system. If not, the source to target process fails. You can ask your Databricks account manager or admin to provide you a token that have CREATE DATABASE permissions in Databricks.
Default database name
Without a specified database in the process, Data Integration targets the default database.
Staging table loading
- Creating a table with
tmp_prefix with the correct metadata. - Loading the data to the
tmp_table using COPY command.
-
CSV: Creates the
tmp_table using the fields in the mapping (typed-converted). -
JSON: Creates a single-column table using the string data type.
- Flattening the data (if
jsontype in the source) using JSON Extract Functions and casting withcast([column] as [dataType]) as [alias]syntax. - Auto-scaling Integer columns to the right Integer data type based on their data length (SMALLINT->INTEGER->BIGINT->NUMERIC).
- Dropping the old
tmp_table with thejson_datafield (if exists).
Load from STG table to target table
In this stage, you perform all the following steps in one transaction (Based on Postgres ability to run DDL and DML inside one transaction):
- Defining and updating any metadata changes on the target table, excluding dropping columns.
- If the target table does not exist - create it using CREATE AS SELECT clause.
- If append loading mode - clone the target table aside, and Add/Drop/Alter columns in the clone table.
- Load the data to the target table using the loading mode (as described above).
- Use DEEP CLONE to clone the
tmp_table into target table. - Drop the
tmp_tables in the process.
Data Integration creates the target schema if it does not exist in the Databricks workspace. Without a specified schema name, the system targets the default schema.
The Databricks personal access token requires CREATE SCHEMA permissions. Without these permissions, the Source to Target process fails. Contact a Databricks administrator to grant a token with CREATE SCHEMA privileges. Without a specified schema name, the system targets the default schema.
Staging table loading involves creating a table with the tmp_ prefix and the correct metadata. Data Integration loads the data to the tmp_ table using the COPY command.
-
CSV: Creates the
tmp_table using the fields in the mapping (typed-converted). -
JSON: Creates a single-column table using the string data type.
Data Integration then flattens the data using JSON Extract Functions (if the source data format - JSON) and casts it with cast([column] as [dataType]) as [alias] syntax.
Data Integration also auto-scales integer columns to the right integer data type based on their data length (SMALLINT->INTEGER->BIGINT->NUMERIC). After that, Data Integration drops the old tmp_ table with the json_data field (if it exists).
In the last stage, Data Integration loads data from the staging table to the target table. You can perform all the following steps in one transaction based on Postgres ability to run DDL and DML inside one transaction:
Data Integration defines and updates any metadata changes on the target table, excluding dropping columns.
If the target table does not exist, Data Integration creates it using the CREATE AS SELECT clause.
If you select the append loading mode, Data Integration clones the target table and adds/drops/alters columns in the clone table. Data Integration then loads the data to the target table using the loading mode.
Finally, Data Integration uses DEEP CLONE to clone the tmp_ table into the target table and drops the tmp_ tables in the process.
Databricks Spark configuration
To ensure the Databricks Target connector can successfully writes data to your landing storage, you must configure the Spark parameters based on your cloud provider.
Step 1: Storage implementation (cloud-specific)
Choose the configuration block that matches your Databricks deployment. Add these settings to the Spark Configuration section of your Databricks cluster or job.
-
Azure Databricks (ADLS Gen2)
For Azure, do not include the S3 parameters. Use the following Azure Blob File System (ABFS) configurations to authenticate with your storage account:
Remove:
All spark.hadoop.fs.s3... parametersand add the following:Azure Storage Authentication
spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<scope-name>/<secret-name>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<tenant-id>/oauth2/token -
AWS Databricks (S3)
For AWS, you must explicitly map the S3A implementation to the Databricks-optimized filesystem:
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3n.impl.disable.cache true
spark.hadoop.fs.s3.impl.disable.cache true
Step 2: ANSI compatibility (Global or Session control)
By default, modern runtime (10.x and higher) in Databricks enforce strict ANSI SQL compliance, which can cause data loads to fail when source and target data types do not match. Since Data Integration does not support ANSI_MODE, set it to false using one of the following methods:
-
Per cluster (Recommended)
This isolates the setting to only the cluster used by this connector.
- In Databricks, go to Compute > [Your cluster] > Edit.
- Expand Advanced Options and select the Spark tab.
- Add the following line to the
Spark Config.
spark.sql.ansi.enabled false -
Per session
If you want to apply the setting only during a specific job execution, run the following command as your first SQL statement:
SQL
SET ANSI_MODE = false;
Is ANSI_MODE=false mandatory