The Log Pipeline Problem
Most companies run a log pipeline that looks like this:
Apps → Fluentd/Vector → Kafka → Logstash → Elasticsearch → Kibana
That's five systems to manage, each with its own configuration, scaling concerns, and failure modes. Elasticsearch alone can cost $10,000+/month for moderate log volumes, and operating it requires dedicated expertise.
What if you could replace this entire stack with one system?
StreamHouse as a Log Platform
StreamHouse's combination of cheap S3 storage, built-in SQL, and high-throughput ingestion makes it a natural fit for log aggregation:
Apps → StreamHouse → SQL Queries + Grafana
Why It Works
Cost: Logs stored in S3 at $0.023/GB/month with LZ4 compression (typically 8x for text logs). 1TB of raw logs becomes ~125GB on S3, costing ~$3/month. Compare that to Elasticsearch at $50-100/GB/month.
Throughput: A single agent ingests 50,000+ log lines/sec. Three agents handle 150,000+ logs/sec — enough for most organizations.
Retention: Store months or years of logs cheaply. Set different retention for different log topics:
# Debug logs: keep 7 days
streamctl topic create --name logs-debug --partitions 6 --retention 7d --compression zstd
# Application logs: keep 30 days
streamctl topic create --name logs-app --partitions 12 --retention 30d --compression lz4
# Audit logs: keep forever
streamctl topic create --name logs-audit --partitions 6 --retention infinite --compression zstd
Queryability: SQL. No query DSL to learn.
Setting It Up
Step 1: Ship Logs to StreamHouse
Use Vector (our recommended log collector) or any HTTP/gRPC client:
# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]
[transforms.parse]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
.timestamp = now()
.hostname = get_hostname!()
'''
[sinks.streamhouse]
type = "http"
inputs = ["parse"]
uri = "http://streamhouse:8080/api/topics/logs-app/produce"
encoding.codec = "json"
Step 2: Create a SQL Stream
CREATE STREAM app_logs (
timestamp TIMESTAMP,
hostname VARCHAR,
service VARCHAR,
level VARCHAR,
message VARCHAR,
trace_id VARCHAR,
duration_ms DOUBLE
) WITH (topic = 'logs-app', format = 'json');
Step 3: Query Your Logs
-- Find errors in the last hour
SELECT timestamp, service, message
FROM app_logs
WHERE level = 'error'
AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC
LIMIT 100;
-- Error rate per service per 5 minutes
SELECT
service,
count(*) FILTER (WHERE level = 'error') as errors,
count(*) as total,
round(100.0 * count(*) FILTER (WHERE level = 'error') / count(*), 2) as error_rate_pct,
window_start
FROM TUMBLE(app_logs, timestamp, INTERVAL '5 minutes')
GROUP BY service, window_start, window_end
ORDER BY error_rate_pct DESC;
-- Slow requests by endpoint
SELECT
message,
avg(duration_ms) as avg_duration,
max(duration_ms) as max_duration,
count(*) as request_count
FROM app_logs
WHERE duration_ms > 1000
GROUP BY message
ORDER BY avg_duration DESC;
Step 4: Set Up Alerts
Create continuous queries that detect anomalies and write to an alerts topic:
CREATE CONTINUOUS QUERY error_spike_detector AS
SELECT
service,
count(*) as error_count,
window_start
FROM TUMBLE(app_logs, timestamp, INTERVAL '1 minute')
WHERE level = 'error'
GROUP BY service, window_start, window_end
HAVING count(*) > 100
OUTPUT TO 'alerts-error-spikes';
Connect the alerts topic to PagerDuty, Slack, or any webhook endpoint.
Cost Comparison
For 500GB/day of raw logs (a mid-size company):
| Component | ELK Stack | StreamHouse |
|---|---|---|
| Ingestion | Kafka + Logstash: $800/mo | Included |
| Storage (30d) | Elasticsearch: $7,500/mo | S3: $45/mo |
| Compute | ES nodes: $2,000/mo | 3 agents: $300/mo |
| Total | $10,300/mo | $345/mo |
That's a 97% cost reduction. Even accounting for S3 API costs (PUT/GET), StreamHouse comes in at under $500/month for this workload.
Trade-offs
StreamHouse isn't a full-text search engine. It won't give you Elasticsearch's relevance ranking or fuzzy matching. For most log use cases — finding errors, tracking requests, computing metrics — SQL is more than sufficient.
If you need full-text search on a subset of logs, consider routing high-value logs to a smaller Elasticsearch cluster while sending everything to StreamHouse for cheap, long-term retention.
Getting Started
# 1. Deploy StreamHouse
docker compose up -d
# 2. Create log topics
streamctl topic create --name logs-app --partitions 12 --retention 30d --compression lz4
# 3. Point your log shipper at StreamHouse
# 4. Query with SQL
streamctl sql --interactive
Your logs deserve better than a $10,000/month Elasticsearch bill.