Alerting

Configure alerts for StreamHouse operational issues.

6 min readOperations

Alert Strategy

Effective alerting focuses on symptoms (what users experience) rather than causes. Alert on high latency, rising consumer lag, and error rates. Investigate causes using dashboards and logs after being paged.

Recommended Alert Rules

Start with these Prometheus alerting rules for a production StreamHouse deployment.

yaml
groups:
  - name: streamhouse
    rules:
      - alert: HighProduceLatency
        expr: histogram_quantile(0.99, rate(streamhouse_produce_latency_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 produce latency above 500ms"

      - alert: ConsumerLagRising
        expr: increase(streamhouse_consumer_lag[10m]) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag rising rapidly"

      - alert: S3ErrorRate
        expr: rate(streamhouse_s3_errors_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "S3 error rate above 1%"

      - alert: AgentDown
        expr: up{job="streamhouse"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "StreamHouse agent is down"

Runbook Links

Each alert should link to a runbook with investigation steps.

  • HighProduceLatency: Check S3 upload times, agent CPU, and metadata query latency. May need to scale agents.
  • ConsumerLagRising: Check if consumers are healthy, if processing is slow, or if produce rate has spiked.
  • S3ErrorRate: Check AWS Service Health Dashboard, verify IAM permissions, and check for S3 throttling (429 responses).
  • AgentDown: Check container/pod status, review logs for crash reasons, verify network connectivity.