Custom file zone
Data Integration offers flexible data management options. You can securely store your data in your own Custom file zone, giving control over your designated storage or use Data Integration default Managed File Zone, which requires no setup and retains data for a fixed period of 48 hours.
A Custom file zone enables you to stage data in your cloud environment (for example, AWS S3, Google Cloud Storage, Azure Blob) before loading it into a data warehouse. This keeps intermediary data within the organization’s control.
This setup enables the custom file zone to act as a data lake, storing raw data before transferred to the data warehouse. You can define your own data retention policies, which is beneficial for handling Personally Identifiable Information (PII), HIPAA-regulated data, and other sensitive datasets.
Key benefits of using a custom file zone
-
Data storage control: With the custom file zone, you can store your data in your own designated file zones rather than Data Integration Managed File Zone, offering greater control over data storage.
-
Custom retention policies: Set your own data retention policies for the custom file zone, unlike the Managed File Zone, which retains data for a fixed period of 48 hours.
-
Data management: Files stored in a custom file zone remain in the client’s specified AWS, Google Cloud Storage, or Azure buckets, managed by the client.
- Configuring a custom file zone requires some setup, while the Managed File Zone requires no additional setup.
- Data Integration support does not have access to files stored in a custom file zone; the client must provide access or share necessary information.
Custom file zone support for data warehouses and cloud storage
| Data Warehouse | Amazon S3 | Azure Blob Storage | Google Cloud Storage Buckets |
|---|---|---|---|
| Amazon Redshift | Yes | No | No |
| BigQuery | No | No | Yes |
| Snowflake | Yes | Yes | No |
| Azure Synapse Analytics | No | Yes | No |
| Amazon RDS/Aurora for PostgreSQL | Yes | No | No |
| Databricks | Yes | Yes | No |
| Amazon Athena | Yes | No | No |
| Azure SQL | No | Yes | No |
Amazon S3 bucket
If you are new to Amazon S3 Buckets, refer to the Amazon S3 documentation.
Creating a bucket
A bucket is an object container. To store data in Amazon S3, you must create a bucket and specify a bucket name and an AWS Region. Then you upload your data as objects to that bucket in Amazon S3. Each object has a key (or key name) that serves as the object's unique identifier within the bucket. You can start by logging into AWS and searching for Buckets.
Adding a policy
A bucket policy is a resource-based policy that lets you grant access permissions to your bucket and the objects contained within it. After creating a bucket, create a policy to grant the necessary permissions.
Make sure to replace <RiveryFileZoneBucket> with the name of your S3 bucket.
Policy code:
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"RiveryManageFZBucket",
"Effect":"Allow",
"Action":[
"s3:GetBucketCORS",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`"
},
{
"Sid":"RiveryManageFZObjects",
"Effect":"Allow",
"Action":[
"s3:ReplicateObject",
"s3:PutObject",
"s3:GetObjectAcl",
"s3:GetObject",
"s3:PutObjectVersionAcl",
"s3:PutObjectAcl",
"s3:ListMultipartUploadParts"],
"Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`/*"
},
{
"Sid":"RiveryHeadBucketsAndGetLists",
"Effect":"Allow",
"Action":"s3:ListAllMyBuckets",
"Resource":"*"
}
]
}
Create a Data Integration user in AWS
To connect to the Amazon S3 Source and Target in Data Integration console, you must create an AWS Data Integration user.
Connecting to Amazon S3
To connect to Amazon S3, refer to Amazon S3 connection documentation.
After creating the bucket and connecting to Amazon S3 in Data Integration, continue to Configure Custom FileZone in Data Integration.
Azure Blob storage container
If you are new to Azure Blob Storage, start with the Microsoft documentation.
Obtaining Keys from Azure Blob storage
- Follow the Microsoft documentation to create a Standard Azure Account.
Only Standard Azure accounts can use Azure Blob Storage Containers (Custom File zones) with Data Integration. Make sure to choose "Standard" in the Performance section.
- Make sure all the settings are correct before clicking Create. Creating a Blob account may take a few minutes.
- Click Go to resource.
- Choose Containers (You can also scroll down the main menu to Blob Service and select Containers).
- In the upper left corner, click +Containers.
- Enter the container name.
- From the Public access level drop-down menu, select Container.
- Click Ok.
- Go to Access Keys in the storage account menu.
- Click Copy and save your keys. When connecting to Azure Blob Storage in Data Integration, this is used.
Connecting to Azure Blob Storage in Data Integration
Navigate the Data Integration console and select Connection from the main menu, and find 'Azure Blob Storage'.
Procedure
- Enter your Connection Name.
- Fill out the Account Name and Account Key.
- Enter your SAS Token
This is mandatory for using Blob Storage as a Custom FileZone (it is optional only for Blob Storage as a source).
To create a SAS Token, refer to the Microsoft documentation.** 4. Click Test Connection to verify your connection. If the connection succeeds, you can use this connection in Data Integration.
After creating the container and connecting to Azure Blob storage in Data Integration, continue to 'Configure Custom FileZone in Data Integration'.
Google cloud storage buckets
If you are new to Google Cloud Storage Buckets, start with the Google documentation.
-
Enable necessary APIs.
- Navigate to the Google Cloud console and access the "APIs and Services" section.
- Click on "Library."
- Look up "Google Cloud Storage JSON API" and enable it.
- Find the "BigQuery API" and enable it.
Grant permissions:
- Data Integration automatically generates a dedicated service account and Google Cloud Storage (GCS) account folder. This service account only has access to the created folder.
- To grant permissions, sign in to the "Google Cloud Platform" console and ensure you are in the desired project.
- Go to "IAM & Admin" and then click "IAM".
- Click on "+GRANT ACCESS" under the "VIEW BY PRINCIPALS" section.
- Add the Data Integration service account under the "New Principals" section.
- Assign the "BigQuery Data Viewer" and "BigQuery Job User" roles and save the settings.
-
Create a Google cloud service account
The Custom Service account can use either an existing service account or a newly created one. If you wish to create a new service account within the Google Cloud Platform, confirm that your user possesses the "ServiceAccountAdmin" role. Then, proceed by adhering to these instructions:
-
Sign in to the Google Cloud Platform console.
-
Go to "IAM & Admin" and then "Service accounts" and click "+CREATE SERVICE ACCOUNT".
-
In the service account creation pop-up:
- Specify the Service Account name (for example, Data_Integration User).
- Click "CREATE AND CONTINUE".
- Provide the service account with project access by selecting the BigQuery Data Viewer and BigQuery Job User roles. Proceed by clicking on CONTINUE.
- You have the option to grant users access to this service account. To finalize the process, simply click the DONE.
- After completion of the creation process, the service account will be displayed within the Google Cloud console.
- Provide access for the service account Grant the 'storage.buckets.list' Permission at the project level by associating it with a Role that includes this specific permission and assigning it to the service account.
This permission is essential for retrieving Service account buckets and adding them to the connection list.
-
Create a Google cloud storage bucket:
- Sign in to the Google Cloud Platform Console.
- Go to "Cloud Storage" and then "Buckets" and click "+CREATE".
- In the Bucket Creation Wizard:
- Set Bucket Name (for example, project_name_data_integration_file_zone).
- Choose a Region for the bucket.
- Select a storage class.
- Configure the object access for your Bucket as Uniform, and make sure to select the option "Enforce public access prevention on this bucket."
- Click "CREATE".
-
Provide access to the dedicated bucket for the service account
- Navigate to "Cloud Storage" and click "Buckets".
- Select the intended bucket (designated for custom file zone).
- Within the "Permissions" section, click on the "+GRANT ACCESS" option.
- In the "Add Principals" area, include your service account.
- For role assignment, designate the Storage Admin role for the specified custom file zone
- Complete the process by clicking the SAVE.
-
Configure BigQuery source custom connection in Data Integration
-
Follow the same steps as described under step 3 in "Default Service Account and Bucket Configuration."
-
Now, you have the option to enable the "Custom File Zone" toggle, which is where the Custom Service Account and Bucket Configuration comes into action.
-
Provide the Service Account email.
-
Submit your Service Account Private Key (JSON) file. To create your JSON file for the Service Account, adhere to these instructions:
-
Log in to the Google Cloud Platform Console.
-
Navigate to "IAM & Admin" and click "Service accounts".
-
Select the relevant service account from the menu and access the "Manage Keys" option.
-
Click on the dropdown for adding a key, then choose to 'Create New Key".
-
Opt for a JSON key format and proceed by clicking the "Create" option.
-
This action will result in the generation of a JSON file, which will then be downloaded to your local device.
-
-
Set the Service Account Private Key (JSON), and this action will lead to the automatic population of the Service Account Email and Project ID fields.
-
Set the Default Bucket to the one created earlier, ensuring Region consistency.
-
Test the connection and save the settings.
-
Project ID and custom file zone association
The project ID specified in the attached Service Account Private key (JSON) file determines the custom file zone. For instance, if "Data Integration-Bucket," this Project ID is extracted from the Service Account Private key (JSON), and this specific Project ID dictates the buckets available for use as the Custom File Zone.
Configure custom file zone in Data Integration
- Go to Connections -> +New Connection and search for your Target warehouse.
- Type in your Connection Details and Credentials.
- Toggle the Custom File Zone to true.
- By clicking on File Zone Connection, you can select the previously configured File Zone Connection.
- Choose a Default Bucket (Container) from the drop-down list.
- Use the Test Connection function to see if your connection is up to the task. If the connection was successful, click Save.
When working with Databricks, it is necessary to include specific statements to enable Azure Blob storage to function as a customized file zone. Incorporate the subsequent configuration into the SQL Warehouse within Databricks:
spark.hadoop.fs.azure.account.auth.type.rivery.dfs.core.windows.net SAS
spark.hadoop.fs.azure.sas.token.provider.type.rivery.dfs.core.windows.net
org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider