Architecture Overview

Understand StreamHouse's disaggregated architecture and how components interact.

10 min readCore Concepts

Design Philosophy

StreamHouse follows a disaggregated architecture where compute, storage, and metadata are separated into independent layers. This design enables independent scaling of each layer, fast recovery from failures (no local state to rebuild), and dramatic cost reduction by leveraging S3 for storage at $0.023/GB/month compared to $0.10-0.30/GB for Kafka's attached disks.

Core Components

StreamHouse consists of three main components that work together to provide a complete streaming platform.

Stateless Agents: Handle produce and consume requests, buffer writes in memory, and flush segments to S3. Agents can be added or removed without data rebalancing.
S3 Storage: All event data is stored as immutable segments in S3 with 99.999999999% durability. Segments are compressed with LZ4 and typically 64MB in size.
Metadata Store: PostgreSQL stores topic definitions, partition assignments, segment locations, and consumer group offsets. This is the only stateful component.

Data Flow

When a producer sends a message, it reaches a StreamHouse agent via gRPC or HTTP. The agent buffers the message in an in-memory write pool organized by partition. When the buffer reaches 64MB or a time threshold, the agent compresses the segment with LZ4 and uploads it to S3. The segment's location and offset range are then registered in the metadata store. When a consumer requests messages, the agent looks up which segments contain the requested offset range, fetches and decompresses them from S3 (with local caching), and streams the records back to the consumer.

Caching Architecture

StreamHouse uses a multi-level caching strategy to minimize latency and reduce load on external dependencies.

Metadata Cache: LRU cache with topic metadata (5 min TTL) and partition info (30s TTL) to reduce PostgreSQL queries
Segment Cache: Local disk and memory cache for recently accessed S3 segments, critical for consumer tail reads
Schema Cache: In-memory cache for schema registry lookups (10,000 schemas, 1 hour TTL)

Comparison with Apache Kafka

Unlike Apache Kafka, which couples compute and storage on broker nodes with local disks, StreamHouse separates these concerns. Kafka brokers must replicate data across multiple disks for durability, requiring expensive attached storage and careful partition rebalancing when scaling. StreamHouse eliminates this by writing directly to S3, which handles durability and replication transparently. The trade-off is slightly higher produce latency (50-100ms vs Kafka's 5-10ms) in exchange for dramatically lower cost and operational complexity.

Docker Setup

Topics & Partitions