Alerting
Configure alerts for StreamHouse operational issues.
6 min readOperations
Alert Strategy
Effective alerting focuses on symptoms (what users experience) rather than causes. Alert on high latency, rising consumer lag, and error rates. Investigate causes using dashboards and logs after being paged.
Recommended Alert Rules
Start with these Prometheus alerting rules for a production StreamHouse deployment.
yaml
groups:
- name: streamhouse
rules:
- alert: HighProduceLatency
expr: histogram_quantile(0.99, rate(streamhouse_produce_latency_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "P99 produce latency above 500ms"
- alert: ConsumerLagRising
expr: increase(streamhouse_consumer_lag[10m]) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer lag rising rapidly"
- alert: S3ErrorRate
expr: rate(streamhouse_s3_errors_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "S3 error rate above 1%"
- alert: AgentDown
expr: up{job="streamhouse"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "StreamHouse agent is down"Runbook Links
Each alert should link to a runbook with investigation steps.
- HighProduceLatency: Check S3 upload times, agent CPU, and metadata query latency. May need to scale agents.
- ConsumerLagRising: Check if consumers are healthy, if processing is slow, or if produce rate has spiked.
- S3ErrorRate: Check AWS Service Health Dashboard, verify IAM permissions, and check for S3 throttling (429 responses).
- AgentDown: Check container/pod status, review logs for crash reasons, verify network connectivity.