Back to blog
Deep DiveJanuary 22, 20249 min read

The Stateless Agent: How StreamHouse Scales Without Disks

StreamHouse agents hold no persistent state — no disks, no replication, no rebalancing. Here's how lease coordination, failure detection, and S3 make this possible.

Why Stateless?

The single biggest operational burden in running Apache Kafka is managing broker state. Each broker owns partitions, stores replicas on local disk, and must coordinate leader election with other brokers. Adding or removing a broker triggers a complex rebalancing process that can take hours.

StreamHouse agents are stateless. You can start one, stop one, or replace one without any data migration. This post explains how.

What an Agent Actually Does

A StreamHouse agent is a Rust process that handles three things:

  1. Accept produce requests: Buffer records in memory, flush segments to S3
  2. Serve consume requests: Find segments in S3 (or cache), decompress, return records
  3. Coordinate with peers: Acquire leases to avoid duplicate work

That's it. The agent doesn't own partitions. It doesn't replicate data. It doesn't store anything permanently. When it restarts, it starts fresh.

┌─────────────────────────────────────────┐
│ StreamHouse Agent                       │
│                                         │
│  ┌──────────────────────────────────┐   │
│  │ Writer Pool                      │   │
│  │  Partition 0: [records...] 23MB  │   │
│  │  Partition 3: [records...] 45MB  │   │
│  │  Partition 7: [records...] 12MB  │   │
│  └──────────────────────────────────┘   │
│                                         │
│  ┌──────────────────────────────────┐   │
│  │ Segment Cache (LRU)             │   │
│  │  512MB of recently read segments │   │
│  └──────────────────────────────────┘   │
│                                         │
│  ┌──────────────────────────────────┐   │
│  │ Metadata Cache (LRU)            │   │
│  │  Topics: 5 min TTL              │   │
│  │  Partitions: 30s TTL            │   │
│  └──────────────────────────────────┘   │
│                                         │
│  State: NONE on disk                    │
└─────────────────────────────────────────┘

Lease-Based Coordination

The challenge with stateless agents is coordination: if any agent can write to any partition, how do you prevent two agents from creating conflicting segments for the same partition?

The answer is leases. Before an agent flushes a segment for a partition, it acquires a lease from PostgreSQL:

-- Simplified lease acquisition
SELECT * FROM partition_leases
WHERE partition_id = $1 AND topic_id = $2
FOR UPDATE SKIP LOCKED;

-- If no active lease, acquire one
INSERT INTO partition_leases (partition_id, topic_id, agent_id, expires_at)
VALUES ($1, $2, $3, NOW() + INTERVAL '30 seconds')
ON CONFLICT (partition_id, topic_id)
DO UPDATE SET agent_id = $3, expires_at = NOW() + INTERVAL '30 seconds'
WHERE partition_leases.expires_at < NOW();

Leases have a 30-second TTL and are renewed every 10 seconds via heartbeat. If an agent crashes, its leases expire and other agents can take over.

Failure Detection

StreamHouse uses multiple mechanisms to detect agent failures:

1. Lease expiration: If an agent stops heartbeating, its leases expire in 30 seconds. Other agents detect this on their next coordination cycle.

2. Health endpoints: Kubernetes liveness and readiness probes check agent health every 10 seconds.

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5

3. Connection monitoring: The agent monitors its PostgreSQL connection. If metadata queries fail for 3 consecutive attempts, the agent marks itself as unhealthy.

The maximum detection latency is bounded by the lease TTL: 30 seconds in the worst case. In practice, Kubernetes restarts unhealthy agents in under 30 seconds.

Scaling: Just Add Agents

Because agents are stateless, scaling is trivially simple:

# Scale from 3 to 10 agents
kubectl scale deployment streamhouse --replicas=10
  • Partition reassignment
  • Data rebalancing
  • Leader election
  • Replica synchronization

Every new agent immediately starts accepting requests. The load balancer distributes traffic evenly. Each agent independently acquires leases for the partitions it writes to.

Auto-Scaling

For production deployments, use Kubernetes HPA based on CPU or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: streamhouse-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: streamhouse
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

The Metadata Cache

Agents cache frequently accessed metadata to minimize PostgreSQL load:

  • Topic metadata: Cached for 5 minutes (topic config rarely changes)
  • Partition info: Cached for 30 seconds (high watermarks change frequently)
  • Segment index: In-memory BTreeMap for offset lookups (100x faster than DB queries)

This caching layer achieves 90%+ hit rates, reducing PostgreSQL load by 10-20x. Cache entries are invalidated on TTL expiry — write-through ensures consistency.

Performance Characteristics

A single StreamHouse agent (4 CPU, 4GB RAM):

  • Produce throughput: 50,000+ records/sec
  • Consume throughput: 200,000+ records/sec
  • Produce latency: <1ms P50, <5ms P99 (buffered, not including S3 flush)
  • Consume latency: 5ms P50, 10ms P99 (cached segments)
  • Memory usage: ~2GB (write buffers + segment cache)

Three agents behind a load balancer comfortably handle 150,000 produces/sec and 600,000 consumes/sec.

What This Means for Operations

Running StreamHouse agents is fundamentally different from running Kafka brokers:

| | Kafka Brokers | StreamHouse Agents |
|---|---|---|
| Scaling | Partition rebalancing (hours) | kubectl scale (seconds) |
| Failure recovery | ISR catch-up (minutes) | Lease expiry (30 seconds) |
| Disk management | Monitor IOPS, plan capacity | No disks needed |
| Upgrades | Rolling restart with care | Rolling restart, any order |
| Cost | Provisioned for peak | Scale to actual load |

The stateless agent architecture is the reason StreamHouse is operationally simpler than Kafka. No state means no state to lose, no state to replicate, and no state to rebalance.

Tags:agentsarchitecturescalingkubernetesoperations

Ready to try StreamHouse?

Get started with S3-native event streaming in minutes.