Configuring PostgreSQL CDC replication slots and publications
Replication is a key feature in database management systems that synchronizes data between multiple database instances. Database teams use replication for high availability, disaster recovery, and data distribution. In some replication setups, publication slots allow you to control the flow of changes between the primary and replica databases.
- Replication slot and publication names in PostgreSQL are case-sensitive. To access data from a replication slot, you must have superuser privileges. On Amazon RDS, this requires full superuser access. On Google Cloud Platform, you must have the
cloudsqlsuperuserrole to retrieve the necessary REPLICATION permission. - To create a publication, ensure you have superuser (or
cloudsqlsuperuser) privileges, which grants you the required CREATE permission.
Replication slots configuration
Replication slots enable data replication in PostgreSQL by reserving space on the server to store replicated data before it is consumed. To configure replication slots, you must grant the Data Integration user the REPLICATION role.
If the server does not have a REPLICATION role, you can create one using the following SQL commands:
CREATE ROLE `<replication_role_name>` REPLICATION LOGIN;
Replace <replication_role_name> with a role name of your choice.
Once the role has been created, run the following SQL command to create a new login for Data Integration Log-Based extraction if it does not already exist:
CREATE USER data_integration WITH `<replication_role_name>`;
If the Data Integration user already exists, run the following SQL command to assign the new role:
ALTER USER data_integration WITH `<replication_role_name>`;
Configure replica identity for selected tables
Each table's REPLICA IDENTITY controls what previous-row data PostgreSQL writes to the WAL for UPDATE and DELETE events. This setting determines whether Data Integration can replicate the table correctly.
| Setting | Behavior |
|---|---|
DEFAULT | Logs primary key columns only. Default for every table. |
INDEX | Logs columns of a specified unique index. |
FULL | Logs every column of the previous row. |
NOTHING | Logs nothing. UPDATE and DELETE events will not replicate. |
For more information on DELETE events, refer to Databases overview.
When FULL is required
Data Integration requires REPLICA IDENTITY FULL on a selected table when either condition is true:
- The table has TOAST storage. PostgreSQL stores oversized values (large
TEXT,JSONB,BYTEA) out of line. When anUPDATEdoes not modify a TOASTed column, PostgreSQL omits its value from the WAL. WithoutFULL, that value cannot be recovered and the column is written asNULLat the destination, overwriting real data. - The table has no primary key or unique index. Without an identifying key in the WAL,
UPDATEandDELETEevents cannot be matched to a target row.
Tables with a primary key and no TOAST-eligible columns can remain on DEFAULT. Set FULL only where required.
Identify which tables need FULL
Run the following query to check a table's TOAST storage and replica identity setting:
SELECT relname,
reltoastrelid, -- non-zero means TOAST is linked
relreplident -- 'd'=Default, 'f'=Full, 'i'=Index, 'n'=Nothing
FROM pg_class
WHERE relname = '<table_name>';
A selected table must have relreplident = 'f' if reltoastrelid is non-zero or it has no primary key or unique index.
Apply the setting
ALTER TABLE <schema>.<table> REPLICA IDENTITY FULL;
What happens if you skip it
When a Data Flow is activated, Data Integration validates replica identity on every selected table. Activation fails for any qualifying table not set to FULL, with the offending table named in the error.
This prevents silent data loss. Without FULL, an UPDATE to a non-TOAST column on a row with a TOASTed column produces a WAL event missing the TOAST value. The destination row is then written with NULL in place of real data, with no error to indicate the loss.
Performance considerations
FULL increases WAL volume because the entire previous row is logged on every UPDATE and DELETE. On high-throughput tables, this can increase WAL size and replication lag. Apply FULL only to tables where it is technically required, not globally.
Publication slots configuration
Use the Create Replication Slot command in PostgreSQL to establish a slot dedicated to publishing data. This command reserves space in the primary database for the publication slot, enabling you to publish data and keep other databases in sync.
The basic syntax to create a publication slot is as follows:
SELECT pg_create_logical_replication_slot('slot_name', 'plugin');
Replace 'slot_name' with a unique name for the publication slot, and 'plugin' with the name of the logical decoding plugin to use for replication.
Custom configurations
Once you have granted the appropriate permissions, Data Integration automatically generates replication slots and publications in a PostgreSQL database as part of its default configuration. You can also create dedicated replication slots or publications for specific Data Flows.
Custom replication slots
With Data Integration, you can set a custom replication slot name. This name identifies the PostgreSQL logical decoding slot that streams changes from a particular plug-in for a specific database or schema.
If you do not provide a replication slot, a default one will be automatically created: Data_Integration_<account id>_<{conKeyRefs.DataFlow} id>_source.
Run the following command to view the replication slots and their associated data:
SELECT * FROM pg_replication_slots;
To create a replication slot:
- Access the PostgreSQL client by logging in with superuser credentials.
- Run the following command:
SELECT pg_create_logical_replication_slot('data_integration_pgoutput', 'pgoutput'); - Assign the replication slot name to the Data Flow.
Custom publication slots
You can create a publication for specific tables so that Data Integration replicates only changes from those tables. You must have CREATE privileges to perform this action.
If a specific name is not provided, the default name for the following publication will be created: data_integration_<account id>_<{conKeyRefs.DataFlow} id>_source.
In such cases, Data Integration automatically executes the following command during the initial run and creates a publication for ALL TABLES:
CREATE PUBLICATION data_integration_<account id>_<{conKeyRefs.DataFlow} id>_source FOR ALL TABLES;
To create a publication for specific tables:
- Log in to a PostgreSQL client as a superuser.
- Create publication for your tables.
To remove or update the list of tables under a publication:
CREATE PUBLICATION data_integration_publication FOR TABLE movies, films, actors;Optionally, you can create a publication for specific operations on particular tables:ALTER PUBLICATION data_integration_publication ADD/DROP TABLE my_table;CREATE PUBLICATION data_integration_insert_only FOR TABLE movies WITH (publish = 'INSERT'); - Ensure that your tables appear in the publication.
SELECT * FROM pg_publication_tables; - Once you have created your publication, assign its name to the Data Flow.
- To verify the publication is configured correctly, run the following command:
SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete
FROM pg_publication
WHERE pubname = ‘data_integration_insert_only’;
Partitioned tables in PostgreSQL
Data Integration supports PostgreSQL partitioned tables in both Standard Extraction and Change Data Capture (CDC).