Back to blog
Use CaseMarch 5, 20248 min read

Replacing Your Log Pipeline: StreamHouse for Centralized Log Aggregation

Elasticsearch is expensive. Kafka + Fluentd is complex. Here's how StreamHouse replaces your entire log pipeline with a single system — ingest, store, query, and alert on logs with SQL.

The Log Pipeline Problem

Most companies run a log pipeline that looks like this:

Apps → Fluentd/Vector → Kafka → Logstash → Elasticsearch → Kibana

That's five systems to manage, each with its own configuration, scaling concerns, and failure modes. Elasticsearch alone can cost $10,000+/month for moderate log volumes, and operating it requires dedicated expertise.

What if you could replace this entire stack with one system?

StreamHouse as a Log Platform

StreamHouse's combination of cheap S3 storage, built-in SQL, and high-throughput ingestion makes it a natural fit for log aggregation:

Apps → StreamHouse → SQL Queries + Grafana

Why It Works

Cost: Logs stored in S3 at $0.023/GB/month with LZ4 compression (typically 8x for text logs). 1TB of raw logs becomes ~125GB on S3, costing ~$3/month. Compare that to Elasticsearch at $50-100/GB/month.

Throughput: A single agent ingests 50,000+ log lines/sec. Three agents handle 150,000+ logs/sec — enough for most organizations.

Retention: Store months or years of logs cheaply. Set different retention for different log topics:

# Debug logs: keep 7 days
streamctl topic create --name logs-debug --partitions 6 --retention 7d --compression zstd

# Application logs: keep 30 days
streamctl topic create --name logs-app --partitions 12 --retention 30d --compression lz4

# Audit logs: keep forever
streamctl topic create --name logs-audit --partitions 6 --retention infinite --compression zstd

Queryability: SQL. No query DSL to learn.

Setting It Up

Step 1: Ship Logs to StreamHouse

Use Vector (our recommended log collector) or any HTTP/gRPC client:

# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]

[transforms.parse]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
.timestamp = now()
.hostname = get_hostname!()
'''

[sinks.streamhouse]
type = "http"
inputs = ["parse"]
uri = "http://streamhouse:8080/api/topics/logs-app/produce"
encoding.codec = "json"

Step 2: Create a SQL Stream

CREATE STREAM app_logs (
  timestamp TIMESTAMP,
  hostname VARCHAR,
  service VARCHAR,
  level VARCHAR,
  message VARCHAR,
  trace_id VARCHAR,
  duration_ms DOUBLE
) WITH (topic = 'logs-app', format = 'json');

Step 3: Query Your Logs

-- Find errors in the last hour
SELECT timestamp, service, message
FROM app_logs
WHERE level = 'error'
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC
LIMIT 100;

-- Error rate per service per 5 minutes
SELECT
service,
count(*) FILTER (WHERE level = 'error') as errors,
count(*) as total,
round(100.0 * count(*) FILTER (WHERE level = 'error') / count(*), 2) as error_rate_pct,
window_start
FROM TUMBLE(app_logs, timestamp, INTERVAL '5 minutes')
GROUP BY service, window_start, window_end
ORDER BY error_rate_pct DESC;

-- Slow requests by endpoint
SELECT
message,
avg(duration_ms) as avg_duration,
max(duration_ms) as max_duration,
count(*) as request_count
FROM app_logs
WHERE duration_ms > 1000
GROUP BY message
ORDER BY avg_duration DESC;

Step 4: Set Up Alerts

Create continuous queries that detect anomalies and write to an alerts topic:

CREATE CONTINUOUS QUERY error_spike_detector AS
  SELECT
    service,
    count(*) as error_count,
    window_start
  FROM TUMBLE(app_logs, timestamp, INTERVAL '1 minute')
  WHERE level = 'error'
  GROUP BY service, window_start, window_end
  HAVING count(*) > 100
  OUTPUT TO 'alerts-error-spikes';

Connect the alerts topic to PagerDuty, Slack, or any webhook endpoint.

Cost Comparison

For 500GB/day of raw logs (a mid-size company):

| Component | ELK Stack | StreamHouse |
|---|---|---|
| Ingestion | Kafka + Logstash: $800/mo | Included |
| Storage (30d) | Elasticsearch: $7,500/mo | S3: $45/mo |
| Compute | ES nodes: $2,000/mo | 3 agents: $300/mo |
| Total | $10,300/mo | $345/mo |

That's a 97% cost reduction. Even accounting for S3 API costs (PUT/GET), StreamHouse comes in at under $500/month for this workload.

Trade-offs

StreamHouse isn't a full-text search engine. It won't give you Elasticsearch's relevance ranking or fuzzy matching. For most log use cases — finding errors, tracking requests, computing metrics — SQL is more than sufficient.

If you need full-text search on a subset of logs, consider routing high-value logs to a smaller Elasticsearch cluster while sending everything to StreamHouse for cheap, long-term retention.

Getting Started

# 1. Deploy StreamHouse
docker compose up -d

# 2. Create log topics
streamctl topic create --name logs-app --partitions 12 --retention 30d --compression lz4

# 3. Point your log shipper at StreamHouse
# 4. Query with SQL
streamctl sql --interactive

Your logs deserve better than a $10,000/month Elasticsearch bill.

Tags:logsobservabilityuse-casecost-savingssql

Ready to try StreamHouse?

Get started with S3-native event streaming in minutes.