Amazon S3 as a source walkthrough

Data Integration offers a seamless solution for connecting to Amazon S3 and effortlessly pulling data into your preferred Target, enabling you to unlock valuable insights from your datasets.

Creating a new River in Data Integration by choosing extraction modes and managing your data storage. You can effectively harness Data Integration capabilities to pull data from Amazon S3 and integrate it seamlessly into your data ecosystem.

Prerequisite

Before you begin the process of pulling data from Amazon S3 using Data Integration, ensure to configure Amazon S3 connection. After creating a connection, you can set up and execute a data extraction River.

Extracting data from Amazon S3 to your chosen target

Start pulling data from Amazon S3 using Data Integration and explore the various steps and considerations involved in this process.

Procedure

Navigate to the Data Integration Console.
Create a new River: Click Create River from the top right corner of the page.
Choose River Type: Select Source to Target as the River type.
Locate Amazon S3 in the list of available data sources under the Storage section.
Click Amazon S3.
Enter a River Name.
Under Source Connection, select an existing connection or create a new one.
Select Bucket: Choose the desired bucket name from the list of available buckets.
Choose Extract method.
- All This method loads all data from the source into the Target database, replacing all existing data in the Target database regardless of the time period. It is useful when the Source data is complete and up-to-date, and you want to create a fresh copy of the source data in the target database.
- Incremental load by file modified timestamp Incremental Load by File Modified Timestamp lets you control the date range of your data.
note
- Start Date is mandatory.
- In the case of a modified timestamp load, the next Start Date is calculated by adding 1 millisecond to the last file date. This ensures that the last file is not reloaded in the next run, and it captures any newly added data.
- Data can be retrieved for the date range specified between the start and end dates.
- If you leave the End Date blank, the data is pulled at the current time of the river's run.
- Date timezone: UTC.
- Use the Last Days Back For Each Run option to extend the starting date and retrieve data from a specified number of days before the selected start date.

Incremental run: by template

Templates let you run over folders and load files in the order in which they were created. Choose a template type (Timestamp or Epoc time) and a data date range.

Timestamp Template - Use {} and the appropriate timestamp components to establish the folder format.
Epoch Time Template - Incorporate {e} (for epoch) or {ee} (for epoch in milliseconds) fields to define the folder structure for running by epoch time. Enter the desired starting value and an optional ending value. This approach applies to the entire library and is not applicable to individual files.

note

Start Date is mandatory.
When the template load includes days and hours, the Start Date for the next run is determined by adding 1 hour to the last file date, as hours serve as the minimal differentiator. This ensures that the last file is not reloaded on subsequent runs, while also preventing the omission of any new data.
Data can be retrieved for the date range specified between the start and end dates.
If you leave the End Date blank, the data is pulled at the current time of the river's run.
Date timezone: UTC.
Interval Chunk Size - Use this field when you intend to retrieve data over extended timeframes.
When the Load Last Folder checkbox is enabled, the system loads the folder most recently accessed. Using this option is not recommended for daily folder creation due to the possibility of loading incomplete folders.

(Drop-down menu with More Options)

The start date does not advance if a River run is unsuccessful. To advance the start date even if the run fails, click More Options and select the corresponding checkbox (not recommended).

File path prefix and pattern: For the chosen extract methods, specify the file path prefix and file pattern to filter by.
Select after-pull action:
- Retain the original location.
- Transfer to the archive path: Select the container name and specify the optional archived folder path.
- Delete.
Choose the Number of files to pull (Leave empty for all).
Select the desired file type: CSV, Excel, JSON, or Other .

note

For each type, you can incorporate a compressed file and ensure that the Is Compressed checkbox is marked.

If using .zip files and using the Custom File Zone, the process entails Data Integration initially ingests the files through its Default file zone, following which the files are forwarded to your designated Custom File Zone.

File types

CSV

CSV lets you select the Delimiter, Quote character, and Newline Delimiter according to your preference.

Excel

In Excel, you can specify the sheet by its position, using a comma separator for multiple sheets. Typically, the initial row serves as the header, but you can opt for different rows. If this scenario applies and your dataset begins after that particular row, configure the start loading rows from row number to the subsequent row following the header. An empty list indicates that all sheets are included.

You can select the option to perform an Auto Mapping at the bottom, which lets you visualize your file's structure.

JSON

The supported JSON format is jsonlines format only.

Other

When opting for Other files, Data Integration accepts them in their original state, without any data transformation, and limits your Target selection to Storage.

Navigate to the Target tab and choose a Target warehouse.

Is Compressed

For each file type, you can choose to include a compressed version by selecting the Is Compressed checkbox.
When referencing a file within another file, ensure to add the correct file extension in the prefix field; otherwise, Data Integration cannot identify the correct file.
If two files share the same name, one compressed and one not, marking a file as compressed during execution can cause Data Integration to select the uncompressed version, leading to an error. To avoid this, ensure to use a prefix with the file format or provide the full file name along with its format.

Limitations

For large datasets, a suffix such as "000", "001", and so on, is added for every 5 gigabytes of data.
The maximum mapping field size is restricted to 9.2 billion characters.
Direct load cases do not include the capability to delete files after they are pulled. The data is transferred directly from the customer's Source to the Target without any intermediate file deletion capability.
The Path structure should be as follows: my_bucket/folder1/folder2/file.csv* Here is the breakdown:
- Bucket: my_bucket
- PREFIX: folder1/folder2/file.csv
- FILENAME: *