Custom file zone

Data Integration offers flexible data management options. You can securely store your data in your own Custom file zone, giving control over your designated storage or use Data Integration default Managed File Zone, which requires no setup and retains data for a fixed period of 48 hours.

A Custom file zone enables you to stage data in your cloud environment (for example, AWS S3, Google Cloud Storage, Azure Blob) before loading it into a data warehouse. This keeps intermediary data within the organization’s control.

This setup enables the custom file zone to act as a data lake, storing raw data before transferring to the data warehouse. You can define your own data retention policies for handling Personally Identifiable Information (PII), HIPAA-regulated data, and other sensitive datasets.

Key benefits of using a custom file zone

Data storage control: With the custom file zone, you can store your data in your own designated file zones rather than Data Integration Managed File Zone, offering greater control over data storage.
Custom retention policies: Set your own data retention policies for the custom file zone, unlike the Managed File Zone, which retains data for a fixed period of 48 hours.
Data management: Files stored in a custom file zone remain in the client’s specified AWS, Google Cloud Storage, or Azure buckets, managed by the client.

info

Configuring a custom file zone requires some setup, while the Managed File Zone requires no additional setup.
Data Integration support does not have access to files stored in a custom file zone; the client must provide access or share necessary information.

Custom file zone support for data warehouses and cloud storage

Data Warehouse	Amazon S3	Azure Blob Storage	Google Cloud Storage Buckets
Amazon Redshift	Yes	No	No
BigQuery	No	No	Yes
Snowflake	Yes	Yes	No
Azure Synapse Analytics	No	Yes	No
Amazon RDS/Aurora for PostgreSQL	Yes	No	No
Databricks	Yes	Yes	No
Amazon Athena	Yes	No	No
Azure SQL	No	Yes	No

Amazon S3 bucket

info

Refer to the Amazon S3 topic for more information about Amazon S3 Buckets.

Creating a bucket

A bucket serves as a container for objects. To store data in Amazon S3, you must create a bucket and specify a bucket name and an AWS Region. Then you upload your data as objects to that bucket in Amazon S3. Each object has a key (or key name) that serves as the object's unique identifier within the bucket. You can start by logging into AWS and searching for Buckets.

Adding a policy

A bucket policy or a resource-based policy lets you grant access permissions to your bucket and its objects. After creating a bucket, create a policy to grant the necessary permissions.

note

Make sure to replace <RiveryFileZoneBucket> with the name of your S3 bucket.

Policy code:

{
 "Version":"2012-10-17",
 "Statement":[
 {
 "Sid":"RiveryManageFZBucket",
 "Effect":"Allow",
 "Action":[
 "s3:GetBucketCORS",
 "s3:ListBucket",
 "s3:GetBucketLocation"
 ],
 "Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`"
 },
 {
 "Sid":"RiveryManageFZObjects",
 "Effect":"Allow",
 "Action":[
 "s3:ReplicateObject",
 "s3:PutObject",
 "s3:GetObjectAcl",
 "s3:GetObject",
 "s3:PutObjectVersionAcl",
 "s3:PutObjectAcl",
 "s3:ListMultipartUploadParts"],
 "Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`/*"
 },
 {
 "Sid":"RiveryHeadBucketsAndGetLists",
 "Effect":"Allow",
 "Action":"s3:ListAllMyBuckets",
 "Resource":"*"
 }
 ]
}

Create a Data Integration user in AWS

To connect to the Amazon S3 Source and Target in Data Integration console, you must create an AWS Data Integration user.

Connecting to Amazon S3

To connect to Amazon S3, refer to Amazon S3 connection topic.

After creating the bucket and connecting to Amazon S3 in Data Integration, continue to Configure Custom FileZone in Data Integration.

Managing Azure Blob storage container

info

Refer to Microsoft topic for more information about Azure Blob Storage.

Obtaining Keys from Azure Blob storage

Follow the Create an Azure storage account topic to create a Standard Azure Account.

note

Only Standard Azure accounts can use Azure Blob Storage Containers (Custom File zones) with Data Integration. Make sure to choose Standard in the Performance section.

Verify the settings before clicking Create. Creating a Blob account may take a few minutes.
Click Go to resource.
Choose Containers (You can also scroll down the main menu to Blob Service and select Containers).
In the upper left corner, click +Containers.
Enter the container name.
From the Public access level drop-down menu, select Container.
Click Ok.
Go to Access Keys in the storage account menu.
Click Copy and save your keys. Use this when connecting to Azure Blob Storage in Data Integration.

Connecting to Azure Blob Storage in Data Integration

Navigate the Data Integration console and select Connection from the main menu, and find 'Azure Blob Storage'.

Procedure

Enter your Connection Name.
Fill out the Account Name and Account Key.
Enter your SAS Token

note

Custom FileZone configurations require Blob Storage. For Blob Storage source configurations, this setting remains optional.

To create a SAS Token, refer to the Shared access signature (SAS) tokens for storage containers. 4. Click Test Connection to verify your connection. If the connection succeeds, you can use this connection in Data Integration.

After creating the container and connecting to Azure Blob storage in Data Integration, continue to 'Configure Custom FileZone in Data Integration'.

Configuring Google Cloud Storage Buckets

info

Refer to the Google for an introduction to Google Cloud Storage Buckets.

Enable necessary APIs.
- Navigate to the Google Cloud console and access the APIs and Services section.
- Click on Library.
- Look up Google Cloud Storage JSON API and enable it.
- Find the BigQuery API and enable it.
Grant permissions:
- Data Integration automatically generates a dedicated service account and Google Cloud Storage (GCS) account folder. This service account only has access to the created folder.
- Sign in to the Google Cloud Platform console and select the target project to grant permissions.
- Go to IAM & Admin and then click IAM.
- Click on +GRANT ACCESS under the VIEW BY PRINCIPALS section.
- Add the Data Integration service account under the New Principals section.
- Assign the BigQuery Data Viewer and BigQuery Job User roles and save the settings.
Create a Google Cloud Service Account

note

The Custom Service account can use either an existing service account or a newly created one. If you wish to create a new service account within the Google Cloud Platform, confirm that your user possesses the ServiceAccountAdmin role. Then, proceed by adhering to these instructions:

Sign in to the Google Cloud Platform console.
Go to IAM & Admin and then Service accounts and click +CREATE SERVICE ACCOUNT.
In the service account creation pop-up:
- Specify the Service Account name (for example, Data_Integration User).
- Click CREATE AND CONTINUE.
- Provide the service account with project access by selecting the BigQuery Data Viewer and BigQuery Job User roles. Proceed by clicking on CONTINUE.
- You have the option to grant users access to this service account. To complete the process, click DONE.
- After completion of the creation process, the platform displays the service account within the Google Cloud console.

Provide access for the service account Grant the storage.buckets.list Permission at the project level by associating it with a Role that includes this specific permission and assigning it to the service account.

note

This permission enables Data Integration to retrieve service account buckets and add them to the connection list.

Create a Google Cloud Storage bucket:
- Sign in to the Google Cloud Platform Console.
- Go to Cloud Storage > Buckets and click +CREATE.
- In the Bucket Creation Wizard:
  - Set Bucket Name (for example, project_name_data_integration_file_zone).
  - Choose a Region for the bucket.
  - Select a storage class.
  - Configure the object access for your Bucket as Uniform, and make sure to select the option Enforce public access prevention on this bucket.
  - Click CREATE.
Provide access to the dedicated bucket for the service account
- Navigate to Cloud Storage and click Buckets.
- Select the intended bucket (designated for custom file zone).
- Within the Permissions section, click on the +GRANT ACCESS option.
- In the Add Principals area, include your service account.
- Assign the Storage Admin role for the specified custom file zone
- Complete the process by clicking the SAVE.
Configure BigQuery source custom connection in Data Integration
- Follow the same steps as described under step 3 in Default Service Account and Bucket Configuration.
- Now, you can enable the Custom File Zone toggle where the Custom Service Account and Bucket Configuration comes into action.
- Provide the Service Account email.
- Submit your Service Account Private Key (JSON) file. To create your JSON file for the Service Account, adhere to these instructions:
  - Log in to the Google Cloud Platform Console.
  - Navigate to IAM & Admin and click Service accounts.
  - Select the relevant service account from the menu and access the Manage Keys option.
  - Click on the dropdown for adding a key, then choose to Create New Key.
  - Opt for a JSON key format and proceed by clicking the Create option.
  - This action results in the generation of a JSON file. You can then download to your local device.
- Set the Service Account Private Key (JSON), and this action will lead to the automatic population of the Service Account Email and Project ID fields.
- Set the Default Bucket to the one created earlier, ensuring Region consistency.
- Test the connection and save the settings.

Project ID and custom file zone association

The project ID specified in the attached Service Account Private key (JSON) file determines the custom file zone. For instance, if Data Integration-Bucket, Data Integration extracts the Project ID from the service account private key (JSON). This ID determines which buckets the system targets for use as the Custom FileZone.

Configure custom file zone in Data Integration

Go to Connections -> +New Connection and search for your Target warehouse.
Type in your Connection Details and Credentials.
Toggle the Custom File Zone to true.
By clicking on File Zone Connection, you can select the configured File Zone Connection.
Choose a Default Bucket (Container) from the drop-down list.
Use the Test Connection feature to verify if your connection.
Click Save to save the connection.

note

When working with Databricks, ensure you include specific statements to enable Azure Blob storage as a custom file system. Incorporate the configuration into the SQL Warehouse within Databricks:

spark.hadoop.fs.azure.account.auth.type.rivery.dfs.core.windows.net SAS
spark.hadoop.fs.azure.sas.token.provider.type.rivery.dfs.core.windows.net 
org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider