Partitioning and clustering in BigQuery

Partitioning and clustering are BigQuery features that improve query performance and reduce costs by limiting the amount of data scanned during queries. The BigQuery connector supports both features to optimize loading operations, particularly when you use the Merge loading method.

Partitioning

A partitioned table is divided into smaller segments called partitions. Partitioning large tables improves performance and reduces costs because queries scan only the relevant partitions instead of the entire table.

note

You can partition a table using only one column.

Clustering

Clustering organizes data within a table based on the values of specific columns. BigQuery sorts the data on disk according to these columns:

You can cluster a table using up to four columns.
Column order matters: BigQuery sorts data by the first clustering column, then the second, and so on.
Strategic column ordering improves performance for queries that use filters or aggregations.

Partition granularity

When partitioning on a column, you must select a granularity level to determine how BigQuery groups the data. Each partition represents a discrete range of values, for example, one day per partition.

The connector supports the following granularities:

DATE columns: YEAR, MONTH, DAY
TIMESTAMP columns: YEAR, MONTH, DAY, HOUR

Loading methods

Configure loading behavior in the connector under Target > Loading Mode. Merge is the default and recommended method for optimized loading.

Method	Description	Recommended for
Merge (Default)	Optimized incremental loading using the BigQuery `MERGE` command. Updates only the columns provided by the staging area to enable Change Data Capture (CDC).	Partitioned or clustered tables; large datasets.
Switch-Merge	Matches the previous backend behavior (formerly Upsert-Merge). Rebuilds the table by selecting non-matching target rows and adding new data.	Tables without partitions or clustering.
Append	Adds new records to the end of the table without checking for existing keys.	Immutable logs or event streams.
Overwrite	Replaces the entire target table with the staging data.	Full refreshes of smaller tables.

Defining Match Keys

You must define at least one Match Key to use Merge or Switch-Merge methods. If you do not define match keys, the system blocks river activation to prevent data correctness issues.

Enabling partitioning on existing tables

BigQuery cannot alter a non-partitioned table into a partitioned table. If you enable partitioning for a table that already exists in BigQuery without partitions, the following rules apply:

Table recreation: The target table is recreated if required partitions are missing or if the existing partition definition differs from the selected configuration.
Initial run: The initial run requires a full table scan during the table creation process.
UI confirmation : When you select a partition column for the first time, a warning appears. You must select Confirm to proceed.

Match key behavior with partition columns

Selecting a partition column does not automatically make it a match key. The partition column can be designated as a match key to improve performance and reduce processing costs; however:

If the partition column does not form a unique key when combined with other match keys, duplicate records can be written.
To prevent duplicates:
- Leave the partition column unselected as a match key, or
- Use the partition column as the only match key.

Partition-aware MERGE logic

When partitioning (DATE or TIMESTAMP) or clustering is enabled, the connector optimizes the BigQuery MERGE command.

Including the partition column in the ON condition, BigQuery can perform partition pruning, which significantly reduces the amount of data scanned and lowers query costs.

Example MERGE statement

MERGE target_table AS T
USING staging_table AS S
ON T.id = S.id
AND T.partition_date = S.partition_date   -- Enables partition pruning
WHEN MATCHED THEN UPDATE SET
    T.column_a = S.column_a               -- Partial-column update
WHEN NOT MATCHED THEN INSERT (id, partition_date, column_a)
VALUES (S.id, S.partition_date, S.column_a);