Partitioning and clustering in BigQuery
Partitioning and clustering are BigQuery features that improve query performance and reduce costs by limiting the amount of data scanned during queries. The BigQuery connector supports both features to optimize loading operations, particularly when you use the Merge loading method.
Partitioning
A partitioned table is divided into smaller segments called partitions. Partitioning large tables improves performance and reduces costs because queries scan only the relevant partitions instead of the entire table.
You can partition a table using only one column.
Clustering
Clustering organizes data within a table based on the values of specific columns. BigQuery sorts the data on disk according to these columns:
- You can cluster a table using up to four columns.
- Column order matters: BigQuery sorts data by the first clustering column, then the second, and so on.
- Strategic column ordering improves performance for queries that use filters or aggregations.
Partition granularity
When partitioning on a column, you must select a granularity level to determine how BigQuery groups the data. Each partition represents a discrete range of values, for example, one day per partition.
The connector supports the following granularities:
- DATE columns:
YEAR,MONTH,DAY - TIMESTAMP columns:
YEAR,MONTH,DAY,HOUR
Loading methods
Configure loading behavior in the connector under Target > Loading Mode. Merge is the default and recommended method for optimized loading.
| Method | Description | Recommended for |
|---|---|---|
| Merge (Default) | Optimized incremental loading using the BigQuery MERGE command. Updates only the columns provided by the staging area to enable Change Data Capture (CDC). | Partitioned or clustered tables; large datasets. |
| Switch-Merge | Matches the previous backend behavior (formerly Upsert-Merge). Rebuilds the table by selecting non-matching target rows and adding new data. | Tables without partitions or clustering. |
| Append | Adds new records to the end of the table without checking for existing keys. | Immutable logs or event streams. |
| Overwrite | Replaces the entire target table with the staging data. | Full refreshes of smaller tables. |
Defining Match Keys
You must define at least one Match Key to use Merge or Switch-Merge methods. If you do not define match keys, the system blocks river activation to prevent data correctness issues.
Enabling partitioning on existing tables
BigQuery cannot alter a non-partitioned table into a partitioned table. If you enable partitioning for a table that already exists in BigQuery without partitions, the following rules apply:
- Table recreation: The target table is recreated if required partitions are missing or if the existing partition definition differs from the selected configuration.
- Initial run: The initial run requires a full table scan during the table creation process.
- UI confirmation : When you select a partition column for the first time, a warning appears. You must select Confirm to proceed.
Match key behavior with partition columns
Selecting a partition column does not automatically make it a match key. The partition column can be designated as a match key to improve performance and reduce processing costs; however:
- If the partition column does not form a unique key when combined with other match keys, duplicate records can be written.
- To prevent duplicates:
- Leave the partition column unselected as a match key, or
- Use the partition column as the only match key.
Partition-aware MERGE logic
When partitioning (DATE or TIMESTAMP) or clustering is enabled, the connector optimizes the BigQuery MERGE command.
Including the partition column in the ON condition, BigQuery can perform partition pruning, which significantly reduces the amount of data scanned and lowers query costs.
Example MERGE statement
MERGE target_table AS T
USING staging_table AS S
ON T.id = S.id
AND T.partition_date = S.partition_date -- Enables partition pruning
WHEN MATCHED THEN UPDATE SET
T.column_a = S.column_a -- Partial-column update
WHEN NOT MATCHED THEN INSERT (id, partition_date, column_a)
VALUES (S.id, S.partition_date, S.column_a);