Segment Format

Internal format of StreamHouse data segments.

8 min readStorage

Segment Structure

A segment is an immutable, compressed file stored in S3 that contains a sequence of records for a single partition. Each segment has a header, a data section containing the compressed records, and a footer with an offset index for fast lookups.

Binary Format

The segment binary format is designed for fast reads and efficient storage.

text
┌─────────────────────────────────┐
│ Header (32 bytes)               │
│  - Magic bytes: "STRM"         │
│  - Version: u16                │
│  - Compression: u8 (LZ4=1)    │
│  - Record count: u64           │
│  - Start offset: u64           │
│  - End offset: u64             │
├─────────────────────────────────┤
│ Data Section (variable)         │
│  - LZ4-compressed records      │
│  - Each record:                │
│    - Key length (u32)          │
│    - Key bytes                 │
│    - Value length (u32)        │
│    - Value bytes               │
│    - Timestamp (i64)           │
│    - Headers count (u16)       │
│    - Header entries            │
├─────────────────────────────────┤
│ Index Section                   │
│  - Sparse offset index         │
│  - Every 1000th record offset  │
│  - Byte position in data       │
├─────────────────────────────────┤
│ Footer (16 bytes)               │
│  - Index offset: u64           │
│  - CRC32: u32                  │
│  - Magic bytes: "ENDS"         │
└─────────────────────────────────┘

Segment Sizing

The target segment size is 64MB (configurable). Smaller segments increase metadata overhead but reduce read amplification. Larger segments are more efficient for storage but increase the minimum granularity for reads. For most workloads, 64MB provides a good balance.