The Log Abstraction
At its core, StreamHouse is a distributed log. Every event you produce is appended to an ordered, immutable sequence. Understanding the three building blocks — topics, partitions, and offsets — is key to using StreamHouse effectively.
Topics: Named Event Streams
A topic is a named category for events. Think of it like a database table, but append-only. When you create a topic, you're declaring: "this is where events of type X go."
streamctl topic create --name user-events --partitions 6 --retention 30d --compression lz4
Topics have configuration that controls their behavior:
- Partition count: How many parallel lanes the topic has (more partitions = more parallelism)
- Retention: How long events are kept (time-based, size-based, or both)
- Compression: LZ4 (fast) or Zstd (compact)
- Schema: Optional schema validation via the Schema Registry
Unlike traditional message queues, events in a topic are persistent and replayable. A consumer can read from the beginning, the end, or any point in between. Multiple consumer groups can independently process the same topic at different speeds.
Partitions: Parallel Ordered Lanes
Each topic is split into one or more partitions. A partition is a single ordered sequence of events — within a partition, events have a strict total order.
Why partitions? Two reasons:
1. Parallelism: Each partition can be processed by a different consumer. A topic with 6 partitions can have up to 6 consumers processing simultaneously.
2. Ordering guarantees: Events with the same key always go to the same partition, so they're always processed in order.
How Keys Map to Partitions
When a producer sends a message with a key, StreamHouse uses murmur2 hashing (consistent with Kafka) to determine the target partition:
partition = murmur2(key) % num_partitions
This means all events for user-123 always land in the same partition, preserving order for that user. Events without a key are distributed round-robin.
# These always go to the same partition
streamctl produce --topic user-events --key "user-123" --message '{"event": "login"}'
streamctl produce --topic user-events --key "user-123" --message '{"event": "purchase"}'
# These are spread across partitions
streamctl produce --topic user-events --message '{"event": "heartbeat"}'
Choosing the Right Partition Count
This is one of the most common questions. Here's our guidance:
- Low throughput (<10K events/sec): 3-6 partitions
- Medium throughput (10K-100K events/sec): 6-12 partitions
- High throughput (100K+ events/sec): 12-64 partitions
The upper limit is practical, not technical. More partitions mean more S3 segments, more metadata entries, and more consumer coordination. Start small and increase if you hit throughput limits.
Offsets: Your Place in the Stream
Every event within a partition has an offset — a monotonically increasing 64-bit integer that uniquely identifies it. Offsets start at 0 and never go backwards.
Partition 0:
Offset 0: {"user": "alice", "event": "signup"}
Offset 1: {"user": "alice", "event": "login"}
Offset 2: {"user": "bob", "event": "signup"}
Offset 3: {"user": "alice", "event": "purchase"}
...
Offsets serve two critical purposes:
1. Position tracking: A consumer's offset tells StreamHouse exactly where to resume reading after a restart.
2. Exactly-once semantics: By committing offsets only after successful processing, consumers achieve exactly-once delivery guarantees.
The High Watermark
Each partition tracks a high watermark — the offset of the most recently committed (flushed to S3) event. Consumers can read up to the high watermark. Events still buffering in agent memory are not yet visible to consumers using committed reads.
Partition 0 state:
First offset: 0
High watermark: 15,234
Log end offset: 15,289 (includes unflushed buffer)
The gap between the high watermark and log end offset represents events that are in the agent's write buffer but not yet flushed to S3.
Consumer Groups: Coordinated Processing
A consumer group is a set of consumers that cooperatively process a topic. StreamHouse ensures each partition is assigned to exactly one consumer in the group:
Topic: user-events (6 partitions)
Consumer Group: analytics-pipeline
Consumer A → Partition 0, 1
Consumer B → Partition 2, 3
Consumer C → Partition 4, 5
If Consumer B crashes, its partitions are redistributed:
Consumer A → Partition 0, 1, 2
Consumer C → Partition 3, 4, 5
Offset Commits
Consumer groups track their progress by committing offsets to the metadata store. Two strategies:
Auto-commit: Offsets are committed every 5 seconds automatically. Simple but may cause duplicate processing on crash.
streamctl consume --topic user-events --group analytics --auto-commit
Manual commit: Your application commits offsets explicitly after processing. More complex but enables exactly-once processing when combined with idempotent writes.
How It All Fits Together in S3
Here's the physical reality of a topic with 3 partitions:
PostgreSQL (metadata):
topics: {name: "user-events", partitions: 3}
partitions: [{id: 0, hwm: 15234}, {id: 1, hwm: 12001}, {id: 2, hwm: 18442}]
segments: [{partition: 0, start: 0, end: 999, path: "s3://..."}, ...]
S3 (data):
topics/user-events/partitions/0/segments/00000000-00000999.seg (64MB)
topics/user-events/partitions/0/segments/00001000-00001999.seg (64MB)
topics/user-events/partitions/1/segments/00000000-00000799.seg (64MB)
topics/user-events/partitions/2/segments/00000000-00001199.seg (64MB)
Metadata stays small (kilobytes). Data stays in S3 (terabytes). Agents are stateless in between.
Key Takeaways
- Topics are named, persistent event streams with configurable retention and compression
- Partitions enable parallel processing while preserving per-key ordering
- Offsets are your consumer's bookmark — commit them wisely
- Consumer groups coordinate multiple consumers to share the load
- The entire model is Kafka-compatible, so existing mental models and patterns transfer directly
Understanding these primitives is the foundation for everything else in StreamHouse — from building real-time pipelines to debugging consumer lag.