Creating a connection to Databricks

Databricks provides a managed service for data processing and transformation on a data lake. It uses Delta Lake, an open-source solution developed by Databricks, which enables the creation, management, and processing of data using the Lakehouse architecture.

You can set up the required credentials and configurations for using Databricks with Data Integration.

info

When establishing a connection, you can use the Databricks Partner Connect guide. A fully functional Databricks connection will be automatically generated in Data Integration.

Prerequisites

A valid Databricks Admin Account and Workspace.

Creating a SQL warehouse

To employ Databricks as a target, you must perform operations on the existing SQL Warehouse.

Procedure

Login into your Databricks workspace.
Go to SQL console.
Click SQL Warehouse and then Create SQL Warehouse at the top right corner.
In the modal opened for new SQL endpoint detail, Name your endpoint (for example, "RiverySQLEndpoint"), choose the right Cluster Size, and set the Auto Stop to at least 120 minutes of no activity. Click Create.

Configuring data access for the SQL warehouse

Navigate to the Admin Settings.
Click the SQL Warehouse Settings tab.
If you are using an Instance Profile, select the specific one that you wish to use.
Copy and paste the following configurations into the designated textbox for Data Access Configuration:

   spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
   spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
   spark.hadoop.fs.s3n.impl.disable.cache true
   spark.hadoop.fs.s3.impl.disable.cache true
   spark.hadoop.fs.s3a.impl.disable.cache true
   spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem

In the SQL Configuration Parameters text box, configure the following settings:

   ANSI_MODE false

Click Save Changes.

Get the SQL warehouse credentials

To ensure compatibility with Data Integration, credentials are necessary for each SQL Warehouse configured within the Databricks console. In Data Integration, each connection typically corresponds to a single SQL Warehouse.

Procedure

Go to the SQL Warehouses section and select the warehouse that was recently created.
Access the Connection Details tab and copy the following parameters:
- Server Hostname
- Port
- HTTP Path

Creating a new personal access token

To establish a connection, you must generate a Personal Access Token associated with the user.

Procedure

Navigate to User Settings.
Click the Access Tokens tab and select Generate New Token.
Within the opened modal, provide a name for your token (such as "Data Integration") and adjust the expiration lifetime to a duration that guarantees consistent and dependable functionality. In the given instance, we have established an expiration lifetime of 1825 days (equivalent to 5 years).

If using a personal access token, you can authenticate your Databricks connection using OAuth 2.0. For detailed steps on setting up OAuth in Databricks, refer to the Enable custom OAuth applications using the Databricks UI topic.

Configure the OAuth 2.0 Connection in Data Integration

Go to Connections, and click + New Connection.
Select Databricks.
Enter the following details:
- Connection Name
- Server Hostname from your Databricks SQL Warehouse settings.
- Set the Port.
- HTTP Path from your SQL Warehouse details.
- Authentication Type, select the OAuth2 option.
- Client ID from Databricks.
- Client Secret from Databricks.
- (Optional) You can also specify a Default Catalog and Default Schema to control where data loads within Databricks.
Enable the Custom File Zone toggle if required.
(Optional) You can also specify a Default Catalog and Default Schema to control where data loads within Databricks.
Click Connect with Databricks to authorize the connection. If the connection is successful, you see Connected with Databricks and Test Connection Passed! Otherwise, an error message appears.

Configure Databricks to permit communication from Data Integration Ips (Optional)

If your Databricks workspace has IP restrictions, it is necessary to open the Data Integration IPs to ensure the successful execution of any operations from Data Integration.

To open the Data Integration IPs, follow these steps:

To access the full range of IP access list operations in Databricks, refer to the Databricks documentation on IP access lists.
Whitelist our IPs.
Submit the following POST request to the API of your Databricks workspace:

   curl -X POST -n \
    https://`<databricks-instance>`/api/2.0/ip-access-lists
    -d '{
    "label": "{conKeyRefs.DataIntegration}",
    "list_type": "ALLOW",
    "ip_addresses": [
    "52.14.86.20/32",
    "13.58.140.165/32",
    "52.14.192.86/32",
    "34.254.56.182/32"
    ]
    }'

Custom file zone

A Custom File Zone is a data storage setup enabling organizations to manage and store their data flexibly. Data Integration out-of-the-box Managed File Zone provides this feature effortlessly, while the Custom File Zone provides organizations with greater authority over the specifics of data storage, although it requires setup.

For more details on the setup process, refer to the Custom File Zone topic.