Skip to main content
Feedback

Creating a connection to Databricks

Databricks provides a managed service for data processing and transformation on a datalake. It uses Delta Lake, an open source solution developed by Databricks, which enables the creation, management, and processing of data using the Lakehouse architecture.

This guide will show you how to set up the required credentials and configurations for using Databricks with Data Integration.

info

When establishing a connection, you have the option to use the Databricks Partner Connect guide. By following this guide, a fully functional Databricks connection will be automatically generated in Data Integration.

Prerequisites

A valid Databricks Admin Account and Workspace.

Create a SQL warehouse

To employ Databricks as a target, it is essential to perform operations on the existing SQL Warehouse. To create a new SQL Warehouse, follow the steps outlined below:

  1. Login into your Databricks workspace.
  2. Go to SQL console.
  3. Click SQL Warehouse and then Create SQL Warehouse at the top right corner.
  4. In the modal opened for new SQL endpoint detail, Name your endpoint (for example "RiverySQLEndpoint"), choose the right Cluster Size and set the Auto Stop to at least 120 minutes of no-activity. Click Create.

Configure data access for the SQL warehouse

  1. Navigate to the Admin Settings.
  2. Click the SQL Warehouse Settings tab.
  3. If you are using an Instance Profile, select the specific one that you wish to use.
  4. Copy and paste the following configurations into the designated textbox for Data Access Configuration:
    spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3n.impl.disable.cache true
    spark.hadoop.fs.s3.impl.disable.cache true
    spark.hadoop.fs.s3a.impl.disable.cache true
    spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
  5. In the SQL Configuration Parameters textbox, configure the following settings:
    ANSI_MODE false
  6. Click Save Changes.

Get the SQL warehouse credentials

To ensure compatibility with Data Integration, credentials are necessary for each SQL Warehouse configured within the Databricks console. In Data Integration, each connection typically corresponds to a single SQL Warehouse.

  1. Return to the SQL Warehouses section and select the warehouse that was recently created.
  2. Access the Connection Details tab and copy the following parameters:
    • Server Hostname
    • Port
    • HTTP Path

Create a new personal access token

To establish a connection, it is necessary to generate a Personal Access Token associated with the user. To create a personal access token, follow the instructions provided below:

  1. Navigate to User Settings.
  2. Click the Access Tokens tab and select Generate New Token.
  3. Within the opened modal, provide a name for your token (such as "Data Integration") and adjust the expiration lifetime to a duration that guarantees consistent and dependable functionality. In the given instance, we have established an expiration lifetime of 1825 days (equivalent to 5 years).

Alternatively, instead of using a personal access token, you can authenticate your Databricks connection using OAuth 2.0. For detailed steps on setting up OAuth in Databricks, refer to the Enable custom OAuth applications using the Databricks UI.

Configure the OAuth 2.0 Connection in Data Integration

  1. Go to Connections, and click + New Connection.

  2. Select Databricks.

  3. Enter the following details:

    • Connection Name
    • Server Hostname from your Databricks SQL Warehouse settings.
    • Set the Port.
    • HTTP Path from your SQL Warehouse details.
    • Authentication Type, select the OAuth2 option.
    • Client ID from Databricks.
    • Client Secret from Databricks.
    • (Optional) You can also specify a Default Catalog and Default Schema to control where data loads within Databricks.
  4. Enable the Custom File Zone toggle if required.

  5. (Optional) You can also specify a Default Catalog and Default Schema to control where data loads within Databricks.

  6. Click Connect with Databricks to authorize the connection. If the connection is successful, you see Connected with Databricks and Test Connection Passed! Otherwise, an error message appears.

Configure Databricks to allow communication from Data Integration Ips (Optional)

If your Databricks workspace has IP restrictions, it is necessary to open the Data Integration IPs to ensure the successful execution of any operations from Data Integration.

To open the Data Integration IPs, follow these steps:

  1. To access the full range of IP access list operations in Databricks, refer to the Databricks documentation on IP access lists.
  2. Whitelist our IPs.
  3. Submit the following POST request to the API of your Databricks workspace:
    curl -X POST -n \
    https://`<databricks-instance>`/api/2.0/ip-access-lists
    -d '{
    "label": "{conKeyRefs.DataIntegration}",
    "list_type": "ALLOW",
    "ip_addresses": [
    "52.14.86.20/32",
    "13.58.140.165/32",
    "52.14.192.86/32",
    "34.254.56.182/32"
    ]
    }'

Custom file zone

A Custom File Zone is a data storage setup enabling organizations to flexibly manage and store their data. Data Integration out-of-the-box Managed File Zone provides this functionality effortlessly, while the Custom File Zone provides organizations with greater authority over the specifics of data storage, although it requires setup.

For additional details on the setup process, kindly refer to our documentation regarding the Custom File Zone.

On this Page