What is CDC?
Change Data Capture (CDC) turns database changes into a stream of events. Every time a row is inserted, updated, or deleted, a corresponding event is emitted. This enables real-time data synchronization between systems without polling or batch ETL.
PostgreSQL → CDC → StreamHouse → Downstream Systems
├── Elasticsearch (search index)
├── Redis (cache invalidation)
├── Data warehouse (analytics)
└── Other microservices
Why StreamHouse for CDC?
CDC generates high-volume, ordered event streams — exactly what StreamHouse is built for. Key advantages:
- Ordered delivery: CDC events must be processed in order per key. StreamHouse's partition-key ordering guarantees this.
- Replay: If a downstream system needs to rebuild its state, consumers can replay from the beginning of the topic.
- Fan-out: Multiple downstream systems can independently consume the same CDC stream.
- Cost: CDC data is often high-volume but rarely queried. S3 storage at $0.023/GB is perfect.
- SQL processing: Filter, transform, and aggregate CDC events using SQL before they reach downstream systems.
Setting Up CDC with Debezium
Debezium is the standard open-source CDC tool. It captures changes from PostgreSQL's WAL (Write-Ahead Log) and produces them to a streaming platform.
Step 1: Configure PostgreSQL for Logical Replication
-- In postgresql.conf
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
-- Create a replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator;
Step 2: Create StreamHouse Topics
Create topics that mirror your database tables:
# One topic per table, keyed by primary key
streamctl topic create --name cdc.public.users --partitions 6 --retention 30d
streamctl topic create --name cdc.public.orders --partitions 12 --retention 30d
streamctl topic create --name cdc.public.products --partitions 6 --retention 30d
Step 3: Configure Debezium
{
"name": "pg-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "replicator",
"database.password": "secret",
"database.dbname": "myapp",
"topic.prefix": "cdc",
"plugin.name": "pgoutput",
"publication.autocreate.mode": "filtered",
"table.include.list": "public.users,public.orders,public.products"
}
}
Step 4: Consume CDC Events
CDC events include the full row data plus metadata:
{
"op": "u",
"before": {"id": 1, "name": "Alice", "email": "alice@old.com"},
"after": {"id": 1, "name": "Alice", "email": "alice@new.com"},
"source": {
"table": "users",
"lsn": 123456789,
"txId": 42
},
"ts_ms": 1706000000000
}
The op field indicates the operation: c (create), u (update), d (delete), r (snapshot read).
Use Case 1: Search Index Sync
Keep Elasticsearch in sync with your database:
-- Create a continuous query that transforms CDC events for Elasticsearch
CREATE CONTINUOUS QUERY user_search_updates AS
SELECT
after->>'id' as user_id,
after->>'name' as name,
after->>'email' as email,
after->>'bio' as bio,
CASE op
WHEN 'd' THEN 'delete'
ELSE 'upsert'
END as action
FROM cdc_users
WHERE op IN ('c', 'u', 'd')
OUTPUT TO 'search-user-updates';
A downstream consumer reads from search-user-updates and applies changes to Elasticsearch.
Use Case 2: Cache Invalidation
Invalidate Redis cache entries when database rows change:
CREATE CONTINUOUS QUERY cache_invalidations AS
SELECT
'product:' || (after->>'id') as cache_key,
op as operation
FROM cdc_products
OUTPUT TO 'cache-invalidations';
A consumer reads from cache-invalidations and calls DEL on the corresponding Redis keys. This is faster and more reliable than TTL-based cache expiry.
Use Case 3: Real-Time Analytics
Stream database changes to your data warehouse without batch ETL:
-- Aggregate order metrics in real-time
CREATE CONTINUOUS QUERY order_metrics AS
SELECT
count(*) FILTER (WHERE op = 'c') as new_orders,
sum(CAST(after->>'amount' AS DOUBLE)) FILTER (WHERE op = 'c') as total_revenue,
window_start
FROM TUMBLE(cdc_orders, ts_ms, INTERVAL '1 minute')
GROUP BY window_start, window_end
OUTPUT TO 'order-metrics-per-minute';
Ordering Guarantees
CDC events must be processed in order for each row (primary key). StreamHouse guarantees this because:
- Debezium produces events with the primary key as the message key
- StreamHouse routes all events with the same key to the same partition
- Within a partition, events are strictly ordered
- Consumer groups assign entire partitions (not individual keys) to consumers
This means all changes to user_id=123 are processed by the same consumer in the exact order they occurred in the database.
Handling Schema Changes
When your database schema evolves (ALTER TABLE), Debezium emits events with the new schema. StreamHouse's Schema Registry can validate these changes:
# Set backward compatibility on CDC topics
curl -X PUT http://localhost:8081/config/cdc.public.users-value \
-H "Content-Type: application/json" \
-d '{"compatibility": "BACKWARD"}'
Adding nullable columns is always backward compatible. Removing columns or changing types will be caught by the compatibility check before any bad data reaches consumers.
Getting Started
# 1. Deploy StreamHouse + Debezium
docker compose up -d
# 2. Register the Debezium connector
curl -X POST http://debezium:8083/connectors -d @connector.json
# 3. Watch CDC events flow
streamctl consume --topic cdc.public.users --from-beginning --format json
CDC with StreamHouse gives you a real-time, replayable, SQL-queryable mirror of every change in your database. Build search indexes, invalidate caches, feed analytics, and sync microservices — all from a single, ordered event stream.