StreamHouse Console

The Log Pipeline Problem

Most companies run a log pipeline that looks like this:

Apps → Fluentd/Vector → Kafka → Logstash → Elasticsearch → Kibana

That's five systems to manage, each with its own configuration, scaling concerns, and failure modes. Elasticsearch alone can cost $10,000+/month for moderate log volumes, and operating it requires dedicated expertise.

What if you could replace this entire stack with one system?

StreamHouse as a Log Platform

StreamHouse's combination of cheap S3 storage, built-in SQL, and high-throughput ingestion makes it a natural fit for log aggregation:

Apps → StreamHouse → SQL Queries + Grafana

Why It Works

Cost: Logs stored in S3 at $0.023/GB/month with LZ4 compression (typically 8x for text logs). 1TB of raw logs becomes ~125GB on S3, costing ~$3/month. Compare that to Elasticsearch at $50-100/GB/month.

Throughput: A single agent ingests 50,000+ log lines/sec. Three agents handle 150,000+ logs/sec — enough for most organizations.

Retention: Store months or years of logs cheaply. Set different retention for different log topics:

# Debug logs: keep 7 days streamctl topic create --name logs-debug --partitions 6 --retention 7d --compression zstd # Application logs: keep 30 days streamctl topic create --name logs-app --partitions 12 --retention 30d --compression lz4

# Audit logs: keep forever streamctl topic create --name logs-audit --partitions 6 --retention infinite --compression zstd

Queryability: SQL. No query DSL to learn.

Setting It Up

Step 1: Ship Logs to StreamHouse

Use Vector (our recommended log collector) or any HTTP/gRPC client:

# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]
[transforms.parse]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
.timestamp = now()
.hostname = get_hostname!()
'''
[sinks.streamhouse]
type = "http"
inputs = ["parse"]
uri = "http://streamhouse:8080/api/topics/logs-app/produce"
encoding.codec = "json"

Step 2: Create a SQL Stream

CREATE STREAM app_logs (
  timestamp TIMESTAMP,
  hostname VARCHAR,
  service VARCHAR,
  level VARCHAR,
  message VARCHAR,
  trace_id VARCHAR,
  duration_ms DOUBLE
) WITH (topic = 'logs-app', format = 'json');

Step 3: Query Your Logs

-- Find errors in the last hour
SELECT timestamp, service, message
FROM app_logs
WHERE level = 'error'
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC
LIMIT 100;
-- Error rate per service per 5 minutes
SELECT
  service,
  count(*) FILTER (WHERE level = 'error') as errors,
  count(*) as total,
  round(100.0 * count(*) FILTER (WHERE level = 'error') / count(*), 2) as error_rate_pct,
  window_start
FROM TUMBLE(app_logs, timestamp, INTERVAL '5 minutes')
GROUP BY service, window_start, window_end
ORDER BY error_rate_pct DESC;
-- Slow requests by endpoint
SELECT
  message,
  avg(duration_ms) as avg_duration,
  max(duration_ms) as max_duration,
  count(*) as request_count
FROM app_logs
WHERE duration_ms > 1000
GROUP BY message
ORDER BY avg_duration DESC;

Step 4: Set Up Alerts

Create continuous queries that detect anomalies and write to an alerts topic:

CREATE CONTINUOUS QUERY error_spike_detector AS
  SELECT
    service,
    count(*) as error_count,
    window_start
  FROM TUMBLE(app_logs, timestamp, INTERVAL '1 minute')
  WHERE level = 'error'
  GROUP BY service, window_start, window_end
  HAVING count(*) > 100
  OUTPUT TO 'alerts-error-spikes';

Connect the alerts topic to PagerDuty, Slack, or any webhook endpoint.

Cost Comparison

For 500GB/day of raw logs (a mid-size company):

| Component | ELK Stack | StreamHouse |
|---|---|---|
| Ingestion | Kafka + Logstash: $800/mo | Included |
| Storage (30d) | Elasticsearch: $7,500/mo | S3: $45/mo |
| Compute | ES nodes: $2,000/mo | 3 agents: $300/mo |
| Total | $10,300/mo | $345/mo |

That's a 97% cost reduction. Even accounting for S3 API costs (PUT/GET), StreamHouse comes in at under $500/month for this workload.

Trade-offs

StreamHouse isn't a full-text search engine. It won't give you Elasticsearch's relevance ranking or fuzzy matching. For most log use cases — finding errors, tracking requests, computing metrics — SQL is more than sufficient.

If you need full-text search on a subset of logs, consider routing high-value logs to a smaller Elasticsearch cluster while sending everything to StreamHouse for cheap, long-term retention.

Getting Started

# 1. Deploy StreamHouse docker compose up -d # 2. Create log topics streamctl topic create --name logs-app --partitions 12 --retention 30d --compression lz4

# 3. Point your log shipper at StreamHouse # 4. Query with SQL streamctl sql --interactive

Your logs deserve better than a $10,000/month Elasticsearch bill.

Replacing Your Log Pipeline: StreamHouse for Centralized Log Aggregation

The Log Pipeline Problem

StreamHouse as a Log Platform

Why It Works

Setting It Up

Step 1: Ship Logs to StreamHouse

Step 2: Create a SQL Stream

Step 3: Query Your Logs

Step 4: Set Up Alerts

Cost Comparison

Trade-offs

Getting Started

Keep Reading

More from the blog

Real-Time CDC Pipelines: Streaming Database Changes with StreamHouse

Schema Evolution Without the Pain: StreamHouse's Built-In Schema Registry

Ready to try StreamHouse?