SFTP walkthrough

Data Integration provides a streamlined solution for establishing a secure SFTP connection and transferring data to your specified target, enabling you to uncover valuable insights within your datasets.

Create a new data pipeline within Data Integration, choose extraction techniques, and manage your data storage settings as needed.

Use Data Integration features to retrieve data from an SFTP server and seamlessly integrate it into your data ecosystem while ensuring a secure and reliable data ecosystem.

Prerequisite

Before you can start extracting data from an SFTP server using Data Integration, ensure that you have a configured SFTP server connection.

Extracting data from SFTP to your chosen target

Navigate to the Data Integration Console.
Create a New River: Choose Create River from the top-left corner of the page.
Choose River Type: Select Source to Target as the River type.
Locate SFTP Server in the list of available data sources under the Storage & Files section.
Click SFTP Server.
Provide a name for your River.
Under Source Connection, select an existing connection or create a new one.
Select Folder: Choose the desired folder from the list of available directories on the SFTP server.
Choose Extraction Method.
- All - This method retrieves all data from the source and replaces all existing data in the target, regardless of time periods. It is useful when the source data is complete and up-to-date, and you want to create a fresh copy of the source data in the target.
- Incremental load by file modified timestamp- The Incremental Load by File Modified Timestamp method lets you control the date range of your data.
note
- Start Date is mandatory.
- Data can be retrieved for the date range specified between the start and end dates.
- If you leave the end date blank, the data is pulled at the current time of the river's run.
- Dates are in UTC time.
- Incremental run: by template - Templates let you run over folders and load files in the order in which they were created. Choose a template type (Timestamp or Epoch time) and a data date range.
  - Timestamp Template: Utilize {} and the appropriate timestamp components to establish the folder format.
  - Epoch Time Template: Incorporate {e} (for epoch) or {ee} (for epoch in milliseconds) fields to define the folder structure for running by epoch time. Enter the desired starting value and an optional ending value. This approach applies to the entire library and is not applicable to individual files.
note
- Start Date is mandatory.
- Data can be retrieved for the date range specified between the start and end dates.
- Start Date is automatically incremented with each scheduled run.
- If you leave the end date blank, the data is pulled at the current time of the river's run.
- Dates are in UTC time.
- Use the Interval Chunk Size field when you intend to retrieve data over extended time frames.

Define the file path prefix and file pattern for filtering purposes. A file path prefix, like {social_dims_source_path}/snap/ads/, serves as a means to identify a group of files for retrieval. When you supply this file prefix, it directs the system to gather all files situated within the indicated directory structure that adhere to the specified filtering criteria.

If you append *.json as the file pattern, this procedure entails scanning the designated directory structure to decide which files should be retrieved.

Select After-Pull Action:
- Retain the original location.
- Transfer to the archive path: Select the container name and specify the optional archived folder path.
- Delete.

note

This applies to both successful and failed Rivers. For instance, in the event of a failure during the upload to the Target, the Container will still be relocated from the Source to the archive path as specified.

Choose the number of files to pull (Leave empty for all).
Select the desired file type: CSV, Excel, JSON, or Other.

File types

CSV

CSV lets you select the Delimiter and Quote character according to your preference.

Excel

In Excel, you can specify the sheet by its position, using a comma separator for multiple sheets. Typically, the initial row serves as the header, but you can opt for different rows. If this scenario applies and your dataset begins after that particular row, configure the start loading rows from row number to the subsequent row following the header. An empty list signifies that all sheets are included.

When you scroll to the bottom of this section, you can perform an Auto Mapping that lets you visualize your file's structure.

JSON

The supported JSON format is jsonlines format only.

Other

When opting for Other files, Data Integration accepts them in their original state, without any data transformation, and limits your Target selection to Storage.

For each file type, you can choose to include a compressed version by selecting the Is Compressed checkbox.

When referencing a file within another file, ensure to add the correct file extension in the prefix field; otherwise, Data Integration cannot identify the correct file.
If two files share the same name, one compressed and one not, marking a file as compressed during execution causes Data Integration to select the uncompressed version, leading to an error.
To avoid this, use a prefix with the file format or provide the full file name along with its format.

Navigate to the Target tab and choose a Target warehouse.