StreamHouse Console

The Log Abstraction

At its core, StreamHouse is a distributed log. Every event you produce is appended to an ordered, immutable sequence. Understanding the three building blocks — topics, partitions, and offsets — is key to using StreamHouse effectively.

Topics: Named Event Streams

A topic is a named category for events. Think of it like a database table, but append-only. When you create a topic, you're declaring: "this is where events of type X go."

streamctl topic create --name user-events --partitions 6 --retention 30d --compression lz4

Topics have configuration that controls their behavior:

Partition count: How many parallel lanes the topic has (more partitions = more parallelism)
Retention: How long events are kept (time-based, size-based, or both)
Compression: LZ4 (fast) or Zstd (compact)
Schema: Optional schema validation via the Schema Registry

Unlike traditional message queues, events in a topic are persistent and replayable. A consumer can read from the beginning, the end, or any point in between. Multiple consumer groups can independently process the same topic at different speeds.

Partitions: Parallel Ordered Lanes

Each topic is split into one or more partitions. A partition is a single ordered sequence of events — within a partition, events have a strict total order.

Why partitions? Two reasons:

1. Parallelism: Each partition can be processed by a different consumer. A topic with 6 partitions can have up to 6 consumers processing simultaneously.

2. Ordering guarantees: Events with the same key always go to the same partition, so they're always processed in order.

How Keys Map to Partitions

When a producer sends a message with a key, StreamHouse uses murmur2 hashing (consistent with Kafka) to determine the target partition:

partition = murmur2(key) % num_partitions

This means all events for user-123 always land in the same partition, preserving order for that user. Events without a key are distributed round-robin.

# These always go to the same partition streamctl produce --topic user-events --key "user-123" --message '{"event": "login"}' streamctl produce --topic user-events --key "user-123" --message '{"event": "purchase"}'

# These are spread across partitions streamctl produce --topic user-events --message '{"event": "heartbeat"}'

Choosing the Right Partition Count

This is one of the most common questions. Here's our guidance:

Low throughput (<10K events/sec): 3-6 partitions
Medium throughput (10K-100K events/sec): 6-12 partitions
High throughput (100K+ events/sec): 12-64 partitions

The upper limit is practical, not technical. More partitions mean more S3 segments, more metadata entries, and more consumer coordination. Start small and increase if you hit throughput limits.

Offsets: Your Place in the Stream

Every event within a partition has an offset — a monotonically increasing 64-bit integer that uniquely identifies it. Offsets start at 0 and never go backwards.

Partition 0:
  Offset 0: {"user": "alice", "event": "signup"}
  Offset 1: {"user": "alice", "event": "login"}
  Offset 2: {"user": "bob", "event": "signup"}
  Offset 3: {"user": "alice", "event": "purchase"}
  ...

Offsets serve two critical purposes:

1. Position tracking: A consumer's offset tells StreamHouse exactly where to resume reading after a restart.

2. Exactly-once semantics: By committing offsets only after successful processing, consumers achieve exactly-once delivery guarantees.

The High Watermark

Each partition tracks a high watermark — the offset of the most recently committed (flushed to S3) event. Consumers can read up to the high watermark. Events still buffering in agent memory are not yet visible to consumers using committed reads.

Partition 0 state:
  First offset:     0
  High watermark:   15,234
  Log end offset:   15,289  (includes unflushed buffer)

The gap between the high watermark and log end offset represents events that are in the agent's write buffer but not yet flushed to S3.

Consumer Groups: Coordinated Processing

A consumer group is a set of consumers that cooperatively process a topic. StreamHouse ensures each partition is assigned to exactly one consumer in the group:

Topic: user-events (6 partitions) Consumer Group: analytics-pipeline

Consumer A → Partition 0, 1 Consumer B → Partition 2, 3 Consumer C → Partition 4, 5

If Consumer B crashes, its partitions are redistributed:

Consumer A → Partition 0, 1, 2
  Consumer C → Partition 3, 4, 5

Offset Commits

Consumer groups track their progress by committing offsets to the metadata store. Two strategies:

Auto-commit: Offsets are committed every 5 seconds automatically. Simple but may cause duplicate processing on crash.

streamctl consume --topic user-events --group analytics --auto-commit

Manual commit: Your application commits offsets explicitly after processing. More complex but enables exactly-once processing when combined with idempotent writes.

How It All Fits Together in S3

Here's the physical reality of a topic with 3 partitions:

PostgreSQL (metadata):
  topics: {name: "user-events", partitions: 3}
  partitions: [{id: 0, hwm: 15234}, {id: 1, hwm: 12001}, {id: 2, hwm: 18442}]
  segments: [{partition: 0, start: 0, end: 999, path: "s3://..."}, ...]
S3 (data):
  topics/user-events/partitions/0/segments/00000000-00000999.seg (64MB)
  topics/user-events/partitions/0/segments/00001000-00001999.seg (64MB)
  topics/user-events/partitions/1/segments/00000000-00000799.seg (64MB)
  topics/user-events/partitions/2/segments/00000000-00001199.seg (64MB)

Metadata stays small (kilobytes). Data stays in S3 (terabytes). Agents are stateless in between.

Key Takeaways

Topics are named, persistent event streams with configurable retention and compression
Partitions enable parallel processing while preserving per-key ordering
Offsets are your consumer's bookmark — commit them wisely
Consumer groups coordinate multiple consumers to share the load
The entire model is Kafka-compatible, so existing mental models and patterns transfer directly

Understanding these primitives is the foundation for everything else in StreamHouse — from building real-time pipelines to debugging consumer lag.

Topics, Partitions, and Offsets: The Building Blocks of StreamHouse

The Log Abstraction

Topics: Named Event Streams

Partitions: Parallel Ordered Lanes

How Keys Map to Partitions

Choosing the Right Partition Count

Offsets: Your Place in the Stream

The High Watermark

Consumer Groups: Coordinated Processing

Offset Commits

How It All Fits Together in S3

Key Takeaways

Keep Reading

More from the blog

Real-Time CDC Pipelines: Streaming Database Changes with StreamHouse

Schema Evolution Without the Pain: StreamHouse's Built-In Schema Registry

Ready to try StreamHouse?