Topics & Partitions

How data is organized and distributed in StreamHouse.

8 min readCore Concepts

Topics

A topic is a named, append-only log that stores a sequence of events. Topics are the primary unit of organization in StreamHouse. Each topic has a unique name and can be configured with retention policies, compression settings, and partition counts. Topics are created through the API, CLI, or web console.

Partitions

Each topic is divided into one or more partitions. Partitions enable parallel processing and horizontal scaling. Each partition is an independent, ordered sequence of messages. A message's position within a partition is called its offset. Messages with the same key are always routed to the same partition via consistent hashing, guaranteeing ordering for related events.

Segments

Within each partition, messages are grouped into segments. A segment is an immutable file stored in S3, typically 64MB in size. Segments are compressed with LZ4 and contain a sequence of records along with an index for fast offset lookups. When a segment reaches its target size, it is sealed, uploaded to S3, and a new segment begins.

text
# Segment path format in S3
s3://streamhouse-data/topics/{topic}/partitions/{partition}/segments/{start_offset}-{end_offset}.seg

# Example
s3://streamhouse-data/topics/user-events/partitions/0/segments/00000000-00000999.seg

Replication & Durability

StreamHouse relies on S3 for data durability rather than Kafka-style partition replication. S3 provides 99.999999999% (11 nines) durability by automatically replicating data across multiple availability zones. This eliminates the need for ISR (in-sync replica) management and partition reassignment that adds operational complexity in traditional Kafka deployments.