Converting a CSV file to Parquet
You can use Data Integration to convert a CSV file to Parquet in Amazon S3.
Prerequisites
- Bucket
- Policy
- Data Integration User in AWS
Creating a bucket
A bucket is an object container. To store data in Amazon S3, you must first create a bucket and specify a bucket name as well as an AWS Region. Then you upload your data as objects to that bucket in Amazon S3. Each object has a key (or key name) that serves as the object's unique identifier within the bucket. Let us begin by logging into AWS and searching for Buckets:
Policy
A bucket policy is a resource-based policy that allows you to grant access permissions to your bucket and the objects contained within it. Now that you have created a bucket, create a policy to grant the necessary permissions:
Policy code:
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"RiveryManageFZBucket",
"Effect":"Allow",
"Action":[
"s3:GetBucketCORS",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`"
},
{
"Sid":"RiveryManageFZObjects",
"Effect":"Allow",
"Action":[
"s3:ReplicateObject",
"s3:PutObject",
"s3:GetObjectAcl",
"s3:GetObject",
"s3:PutObjectVersionAcl",
"s3:PutObjectAcl",
"s3:ListMultipartUploadParts"],
"Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`/*"
},
{
"Sid":"RiveryHeadBucketsAndGetLists",
"Effect":"Allow",
"Action":"s3:ListAllMyBuckets",
"Resource":"*"
}
]
}
Data Integration user in AWS
To connect to the Amazon S3 source and target (described in the following section) in Data Integration console, you must create an AWS Data Integration user.
Converting with Data Integration
After completing all the necessary AWS configurations, you need to create a Data Integration Account to connect to the Data Integration console. Then, using Data Integration feature, you can convert the CSV file to Parquet.
To convert a CSV to Parquet using the Data Integration Console, you need to create a Source-to-Target River. The conversion happens by defining the Target storage format.
Step 1: Select the source
- Navigate to Rivers and click + New River.
- Select Source to Target River.
Step 2: Choose your source
In the Source tab, define the parameters for your incoming CSV data:
-
Select your source connection (for example, Amazon S3, Google Cloud Storage, or Azure Blob).
-
File Path Prefix: Enter the CSV directory path.
-
File Pattern: Use
*.csvto pick up all files or a specific filename. -
File Type: Select CSV.
-
Delimiter: Ensure this matches your file (a comma ,).
-
Header Rows To Skip: Typically set to 0 if you want to include the first row.
-
After Pulling: Choose whether the file should
Remain in original place,Deleted, orArchived.
Step 3: Define the target settings
In the Target tab of your River, define the destination for your Parquet file:
-
Select your Target (for example, Amazon S3).
-
Bucket Name: Enter or select the destination bucket (for example,
{aws_file_zone}). -
Path Selection: Choose between Auto-Period or Custom.
If using Auto-Period, define the File zone Path and the File Zone Folders Period Partition (e.g., Day, Hour).
-
Original File Type: Select the format of your source data—in this case, CSV.
-
Turn on the Convert file to Parquet toggle.
This toggle is the primary engine for the conversion. When enabled, the Data Integration Console will transform the source CSV data into a columnar Parquet format before writing it to the target bucket.
Step 4: Schema mapping
- Go to the Mapping tab.
- Click Auto-Detect and ensure the data types are correct.
Step 4: Run and verify
- Click Save and then Run.
- Once the River completes, navigate to your target bucket.
- You can view a new file with the
.parquetextension.