Backup & Recovery

Disaster recovery strategies for StreamHouse.

7 min readOperations

Backup Strategy

StreamHouse's disaggregated architecture simplifies backup and recovery. Event data in S3 is inherently durable (11 nines). The critical component to back up is the metadata store (PostgreSQL), which contains topic definitions, partition state, segment locations, and consumer group offsets.

Metadata Store Backup

Regular PostgreSQL backups ensure you can recover topic and consumer state.

bash
# Automated daily backup with pg_dump
pg_dump -h localhost -U streamhouse -d streamhouse -F c -f backup_$(date +%Y%m%d).dump

# For managed databases (RDS/Aurora), enable automated backups
# with point-in-time recovery (PITR) for the most protection

# Restore from backup
pg_restore -h localhost -U streamhouse -d streamhouse -c backup_20260115.dump

Recovery Procedures

Recovery procedures depend on the type of failure.

  • Agent failure: Automatic. Other agents detect via lease expiry and take over work. No data loss.
  • S3 outage: Agents buffer writes in memory and retry. Consumers may experience increased latency. Recovery is automatic when S3 returns.
  • Metadata store failure: Agents cannot accept new writes or update consumer offsets. Restore from backup or failover to replica.
  • Full disaster recovery: Restore metadata from backup, point agents at the same S3 bucket, and restart. All data is recovered.

Cross-Region Replication

For multi-region disaster recovery, enable S3 Cross-Region Replication (CRR) on your data bucket and set up a standby PostgreSQL replica in the target region. In a regional failure, promote the standby replica and point agents at the replicated S3 bucket.