Using python DataFrames

info

Python is currently available for Enterprise, Pro Plus, and Professional customers only by default. If you have a Trial or Starter plan and want to use it, contact us.

A DataFrame is a data structure that organizes data into a two-dimensional table of rows and columns. DataFrames are advanced data structures used in current data analytics, providing a dynamic and intuitive means of storing and interacting with data. Data Integration DataFrames solution offers a user-friendly method of manipulating data that takes advantage of machine learning and other data techniques implemented in Python.

Use cases

You can use DataFrames in a variety of cases:

Use a custom landing zone.
Use DataFrames as a Source.
Use DataFrames as a Target.
Manipulate the DataFrame using Python.
Use pandas to replace an existing DataFrame.
Set up a DataFrame using data from another river.

Working with DataFrames

Configure DataFrames

Select DataFrames from the top right-hand corner menu and complete the instructions:

note

The +DataFrame is renamed as +Add DataFrame and relocated to the right corner.

Custom landing zone

Data Integration lets you use a custom FileZone (Landing Zone) to manage your data in your S3 bucket.

A bucket is a container that holds objects.
An object is a file and any metadata that describes that file.

note

If you do not use a Custom Landing Zone, a Data Integration Bucket is used by default, which stores the DataFrame until deleted using the delete() Python function.

Prerequisites

Connect to Amazon S3 before using a Custom Landing Zone.
If you are new to Amazon S3 buckets, start with the S3 documentation.

Using a custom landing zone

You can use a custom landing zone.

Bucket stored objects

Your stored objects are saved in a Parquet file. Parquet is an open-source file format for flat columnar data storage. It works well with large amounts of complex data.

note

Include the proper policy so your cloud storage can successfully use DataFrame.

DataFrames as a source

Navigate to the Data Integration Console.
Click the River tab from the left-hand menu.
Click ADD River and choose Logic River or the existing one from the list.
Select the SQL / DB Transformations step type.
Choose your Connection or create one. 6.Create new DataFrames or select pre-configured DataFrames from other rivers. Use these DataFrames as a Source within Logic River.

note

When using DataFrames as a Source, a table can only be selected as a Target.

DataFrames as a target

Select the SQL / DB Transformations Step type and your Connection first, then select SQL Query as a Source, after which you may either create new DataFrames or use pre-configured DataFrames from other rivers as a Target in Logic River.

You can only select a SQL Query as a source when targeting a DataFrame.
Each time you run a River, the target DataFrame gets overwritten with new data pulled via the SQL query

Using python to manipulate a DataFrame

You can manipulate your Data frame by using the code samples below or by implementing pandas functions.

You can edit any new or pre-configured DataFrames downloaded to your local computer, clear values, and delete them.

DataFrames and environments

DataFrames are associated with the environment rather than a specific river. Any updates or deletions of a DataFrame in a particular River are also reflected in the Environment of which the River is part.

Deploying DataFrames in different environments

Data Integration lets you migrate pre-configured DataFrames between Environments. Go to the Environment View, pick the Deployment tab, and then follow the instructions in the Deployment section.

note

Ensure that the DataFrames names in the Source and Target environments differ.
Encrypted variables do not copy/transfer from one environment to another when deploying DataFrames between Environments.
Two settings were added to DataFrames in the Deployment procedure's Settings tab.

Consider a common use case for using Python with DataFrame in Data Integration for pulling data from a database and inserting it in a target DataFrame using a SQL query. The DataFrame is manipulated in Python before pushing into the database.

The following use case is performed in 3 Logic steps:

Using a SQL query, pull data from your database into a DataFrame:

Navigate to the Data Integration Console.
Click the River tab from the left-hand menu.
Click ADD River and choose Logic River or the existing one from the list.
Select SQL / DB Transformations as the Logic step type.
In the Database Type, choose your database.
Select Connection Name.
Enter your SQL query in the Source field.
In the Target field, choose DataFrame and specify your DataFrame.

Manipulate the DataFrame using python:

Navigate to the Data Integration Console.
Click the River tab from the left-hand menu.
Click ADD River and choose Logic River or the existing one from the list.
Select Python as the Logic step type.
Check if the necessary Python packages are installed.
Choose a resource based on the data you intend to consume.
Import your Python script into the Script field to manipulate the DataFrame,

Push the DataFrame's data to your database:

Navigate to the Data Integration Console.
Click the River tab from the left-hand menu.
Click ADD River and choose Logic River or the existing one from the list.
Select Python as the Logic step type.
Select the required DataFrame in the Source field.
In the Target field, choose Table and enter the load information.
Click Run.

Code samples

note

Once you import a DataFrame, it serves as a Pandas object, providing access to all the features the Pandas library offers.

Basic import

from rivery_dataframes import DATAFRAME_NAME
print(DATAFRAME_NAME)

Save

from rivery_dataframes import EXISTING_DATAFRAME
EXISTING_DATAFRAME.save(Your Input)

Using pandas to create a new DataFrame

import pandas as pd
from rivery_dataframes import TempDF
new_dataframe = pd.DataFrame()
TempDF.save(new_dataframe)

For more information about the pandas Dataframe, refer to the documentation.

Empty

from rivery_dataframes import EXISTING_DATAFRAME
EXISTING_DATAFRAME.empty()

Using chunks

This is an import solution for large amounts of data.

from rivery_dataframes import DATAFRAME_NAME
DATAFRAME_NAME.chuncked = NUMBER_OF_CHUNCKS
for df in DATAFRAME_NAME:
print(df)

Extract and save .csv file

from rivery_dataframes import NEW_DATAFRAME

print(f"Download csv from {CSV_URL}")
csv = requests.get(CSV_URL)
print("Convert csv to dataframe")
new_dataframe = pd.read_csv(csv)
NEW_DATAFRAME.save(new_dataframe)

Fetch a specific column from a dataframe

from rivery_dataframes import EXISTING_DATAFRAME

# Given the DataFrame structure:
#          max_speed      shield
# cobra              1            2
# viper               4            5
# sidewinder    7            8

# To fetch the 'max_speed' column in its entirety, use this command:
EXISTING_DATAFRAME.loc[:, 'max_speed']

DataFrame usage notes

In the Data Integration Python implementation, DataFrame usage with [ ] is not supported.

Incorrect Usage:

A_DATAFRAME['key']

Correct Usage: DataFrame should only be used with functions, for example:

A_DATAFRAME.function()

If you encounter any script-related errors, follow these steps:
- Print the DataFrame types using:

     print(A_DATAFRAME.types)

For more information, refer to the documentation of the chosen Source (For example, Amazon Redshift Supported Data Types).
If any type is not supported, convert it using:

A_DATAFRAME.astype({'col1': 'int32'}).

Required permissions

Custom file zone

For DataFrames to work correctly with your cloud storage, make sure to include the following policy in the AWS IAM Console:

For using Snowflake, copy this permission policy:

{
 "Sid":"RiveryManageFZObjects",
 "Effect":"Allow",
 "Action":[
 "s3:DeleteObject",
 "s3:ListObjectsV2",
 "s3:ReplicateObject",
 "s3:PutObject",
 "s3:GetObjectAcl",
 "s3:GetObject",
 "s3:GetBucketLocation",
 "s3:ListMultipartUploadParts"],
 "Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`/*"
 }

For using Redshift, copy this permission policy:

{
 "Sid":"RiveryManageFZObjects",
 "Effect":"Allow",
 "Action":[
 "s3:ListObjectsV2",
 "s3:ReplicateObject",
 "s3:PutObject",
 "s3:GetObjectAcl",
 "s3:GetObject",
 "s3:DeleteObject",
 "s3:PutObjectVersionAcl",
 "s3:PutObjectAcl",
 "s3:ListMultipartUploadParts"],
 "Resource":"arn:aws:s3:::`<RiveryFileZoneBucket>`/*"
 }

Once you have copied the relevant policy for you, go to the AWS IAM Console. Click Roles from the menu and select the role you created for Data Integration. Next, scroll down to the Permission section and paste the policy.

Using snowflake as a target

To ensure the information from DataFrames transfers successfully to Snowflake (when using Snowflake as a Target), follow these steps:

Copy the following query:

GRANT USAGE ON SCHEMA public TO `<Rivery_User>`;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO `<Rivery_User>`;

Enter the Snowflake Console.
Select the checkbox next to "All Queries".
Paste the query into the console's main page.

note

You can choose the database and schema for Data Integration to use in managing the generated DataFrame. If not selected, Data Integration uses the first database on the list and the 'PUBLIC' schema by default.

Ensure that the permissions mentioned are configured for the chosen schema/database.

Limitations

When naming a DataFrame, avoid using the following characters:

Character
Space " "
Slash " / " and " \ "
Period " . "
Any type of brackets "(" and "[" and "{"
Hyphen " - "

	Character
Space	" "
Slash	" / " and " \ "
Period	" . "
Any type of brackets	`"("` and `"["` and `"{"`
Hyphen	" - "

When designating column names within a DataFrame, the following characters are only compatible: ^([A-Z][a-z])([A-Z][a-z][0-9]_)* Use Regex101 for testing and troubleshooting expression patterns.
Snowflake does not let header columns containing the character "." in them. If your script includes "." in the header columns, use the following command to fix the issue:

NEW_DF = DATAFRAME_NAME.rename(lambda x: x.replace('.', '_'), axis='columns')
DATAFRAME_NAME.save(NEW_DF) 

If you are using Redshift, create a "Custom Landing Zone" and ensure it is in the same region as your Redshift cluster.
Python reserved words cannot be used in Data Integration. This causes syntax errors and conflicts during River runs.

Temporary limitations

Kits/Environments do not support DataFrames.
DataFrames are supported only for Amazon Redshift and Snowflake connections.
DataFrames on Redshift connections are supported only for the following locations:
1. For Data Integration US console - Redshift cluster located at us-east-1
2. For Data Integration EU console - Redshift cluster located at eu-west-1

Use cases

Working with DataFrames

Configure DataFrames

Custom landing zone

Prerequisites

Using a custom landing zone

Bucket stored objects

DataFrames as a source

DataFrames as a target

Using python to manipulate a DataFrame

DataFrames menu

DataFrames and environments

Deploying DataFrames in different environments

Using a SQL query, pull data from your database into a DataFrame:

Manipulate the DataFrame using python:

Push the DataFrame's data to your database:

Code samples

Basic import

Save

Using pandas to create a new DataFrame

Empty

Using chunks

Extract and save .csv file

Fetch a specific column from a dataframe

DataFrame usage notes

Required permissions

Custom file zone

Using snowflake as a target

Limitations

Temporary limitations