Partitioning and clustering in BigQuery
A _partitioned_ table is a table divided into sections by partitions. Dividing a large table into smaller partitions enables improved performance and reduced costs by controlling the amount of data retrieved from a query.
Clustering sorts the data based on one or more columns in the table. The order of the clustered columns determines the sort order of the data. Clustering can enhance the performance of specific query types, including those that utilize filter clauses and those that involve data aggregation.
You can select up to 4 fields to cluster from and only a single field to partition upon. When you partition on a field, you can select the partition granularity, which includes YEAR, MONTH, and DAY for columns of date type, and the additional HOUR for timestamp columns. The granularity determines the boundaries of the partitions; for example, if you choose day granularity, each day is a separate partition in BigQuery.
When you cluster a table using multiple columns, the order of the specified columns is crucial. The order of the selected columns determines the order in which the data is sorted.
For Source to target rivers, selecting the SQL dialect and partition settings is done in the Target step under Advanced options.
In the mapping section, you can specify the partitioned and clustered fields and their order.
Multi-table rivers have a general setting for the SQL dialect in the target section, and each table has its own setting similar to Data Source to Target rivers, where you can decide the dialect for that table and whether to set a partition field. In each table mapping, you can select the cluster and partition accordingly.
Switching between legacy SQL and standard SQL in Data Integration
There are differences between legacy SQL and Standard SQL that are taken into account when switching between dialects. For instance, not all functions and features are supported, and there are syntax differences.
For example, in Standard SQL, we query tables like this:
bigquery-public-data.samples.shakespeare
While in legacy SQL, it is done in this manner:
[bigquery-public-data:samples.shakespeare]
For more information on differences between legacy SQL and StandardSQL, refer to BIgQuery documentation.
Ensure your river is aligned when you make the switch.
With Standard SQL, the system flattens columns of RECORD type, but for legacy SQL, it does not. When switching from legacy SQL to Standard SQL, if you have columns of the RECORD type, expect the final table in Standard SQL mode to have multiple new flattened columns, rather than a single one.
If your target table already exists, it is recommended NOT to switch back to legacy SQL mode.
Append or merge functions may not be properly supported, which could cause data loss. This is due to the differences in how records are handled between the two SQL dialects.
Additionally, if the table has clusters or partitions, the river fails because it is not possible to query clustered or partitioned tables using legacy SQL syntax.
Logic River handles records with StandardSQL in the following manner:
- If the logic step is not mapped and the flatten results checkbox is not selected, the records will be loaded as is.
- If the logic step is not mapped and the flatten results checkbox is selected - you will be prompted to map the step.
- If the logic step is mapped and the flatten results checkbox is selected, the records will be flattened.
- If the logic step is mapped and the flatten results checkbox is not selected, you will be prompted to either remove the mapping, remove the column from the mapping, or turn off the flatten results checkbox
| Logic step status\flatten results checkbox status | flatten results checkbox is not checked | flatten results checkbox is checked |
|---|---|---|
| Logic step not mapped | you will be prompted to map the step | |
| Logic step mapped | You will be prompted to either remove the mapping, remove the column from the mapping, or uncheck the flatten results checkbox | the records will be flattened |
Creating or altering partitions and/or clusters on an existing table
The process of creating or altering partitions on an existing table requires the table to be dropped and recreated with the desired settings.
Procedure
- Create a new temporary table with the existing data of the target table. This is when the clustering and partitioning are specified
PL/SQL
CREATE TABLE < dataset_name > . < tmp_table_name >
PARTITION BY TIMESTAMP_TRUNC( < timestamp_column > , `{ DAY | HOUR | MONTH | YEAR }`)
CLUSTER BY < column1 > , < column2 > AS
SELECT *
FROM < dataset_name > . < table_name >
Specific example:
CREATE TABLE `dwh.temp_users`
PARTITION BY TIMESTAMP_TRUNC(inserted_at, DAY)
CLUSTER BY user_id AS
SELECT *
FROM `dwh.users`
PL/SQL
- Now that the data is safely stored in the new temporary table, you can drop the target table. Do note that until the process is completed, you will not have access to the data in the target table.
PL/SQL
DROP TABLE < dataset_name >.< table_name >
Specific example:
DROP TABLE dwh.users
-
Use the BigQuery console UI to copy the temporary table to the target table. This duplicates the temp table and renames it. Then, the temp table will be dropped.
-
Select the table you want to copy from the Resources pane.
-
Below the Query editor, click Copy table
-
In the Copy table dialog, under Destination:
-
For Dataset name, choose the < dataset_name > or in our case "dwh"
-
For Table name, choose the < table_name > or in our case "users".
-
Click on copy table.
-
-
Drop the temporary table.
PL/SQL
DROP TABLE < dataset_name >.< tmp_table_name >
Specific example:
DROP TABLE dwh.temp_users