Health & Metrics
Monitor agent health and performance with Prometheus metrics.
5 min readAgents
Health Endpoints
Each agent exposes health check endpoints for liveness and readiness probes, compatible with Kubernetes health checks.
text
# Liveness probe - agent is running
GET /health/live
# Response: {"status": "ok"}
# Readiness probe - agent can serve requests
GET /health/ready
# Response: {"status": "ready", "metadata": "connected", "storage": "connected"}
# Detailed health with metrics
GET /api/health
# Response: {"status": "healthy", "version": "0.1.0", "uptime_seconds": 3600}Prometheus Metrics
Agents export Prometheus metrics at the /metrics endpoint. These metrics cover request rates, latencies, storage operations, and cache hit rates.
text
# Key metrics to monitor
streamhouse_produce_requests_total # Total produce requests
streamhouse_produce_latency_seconds # Produce latency histogram
streamhouse_consume_requests_total # Total consume requests
streamhouse_consume_latency_seconds # Consume latency histogram
streamhouse_s3_upload_bytes_total # Total bytes uploaded to S3
streamhouse_s3_download_bytes_total # Total bytes downloaded from S3
streamhouse_segment_cache_hit_ratio # Cache hit rate (aim for >90%)
streamhouse_metadata_cache_hit_ratio # Metadata cache hit rate
streamhouse_active_connections # Current active connections
streamhouse_writer_buffer_bytes # Current write buffer usageGrafana Dashboard
StreamHouse includes pre-built Grafana dashboards for monitoring. Import the dashboard JSON from the repository, or use the web console's built-in monitoring page.
- Overview: Request rate, error rate, latency P50/P95/P99
- Producers: Produce throughput, batch sizes, flush rates
- Consumers: Consumer lag, fetch latency, partition assignments
- Storage: S3 upload/download rates, segment sizes, cache hit ratios