Why SQL?
Every data engineer knows SQL. But stream processing has traditionally required learning specialized frameworks: Flink's DataStream API, Spark Structured Streaming's DataFrame API, or ksqlDB's custom dialect.
We wanted StreamHouse users to write plain SQL and have it work on streaming data. No new APIs. No new deployment infrastructure. Just SQL.
Why DataFusion?
Apache DataFusion is a query engine written in Rust, built on Apache Arrow for columnar in-memory processing. It gives us:
- SQL parser and planner: Full ANSI SQL support
- Query optimizer: Predicate pushdown, projection pruning, join reordering
- Columnar execution: Arrow RecordBatches for vectorized processing
- Extensibility: Custom table providers, functions, and execution plans
DataFusion was designed for batch queries, but its architecture is flexible enough for streaming. Here's how we adapted it.
The Streaming Table Provider
In standard DataFusion, a table is a static set of files or partitions. In StreamHouse, a table is a live stream of events. We implemented a custom StreamTableProvider that:
- Registers topic schema: Maps topic fields to Arrow data types
- Provides a streaming execution plan: Instead of scanning files, it continuously polls for new records
- Tracks watermarks: Knows how far into the stream we've processed
-- Register a topic as a SQL table
CREATE STREAM clickstream (
user_id VARCHAR,
page VARCHAR,
referrer VARCHAR,
event_time TIMESTAMP
) WITH (
topic = 'clicks',
format = 'json',
timestamp_field = 'event_time'
);
-- Query it like any table
SELECT page, count(*) as views
FROM clickstream
WHERE event_time > NOW() - INTERVAL '1 hour'
GROUP BY page
ORDER BY views DESC;
Windowed Aggregations
Aggregations over unbounded streams need windows — finite time boundaries that define "group this set of events together." StreamHouse supports three window types:
Tumbling Windows
Fixed-size, non-overlapping. Every event belongs to exactly one window.
SELECT
page,
count(*) as views,
window_start,
window_end
FROM TUMBLE(clickstream, event_time, INTERVAL '5 minutes')
GROUP BY page, window_start, window_end;
- Reads batches from the stream
- Assigns each record to a window based on its timestamp
- Maintains per-window accumulators (count, sum, etc.)
- Emits completed windows when the watermark advances past the window end
Hopping Windows
Fixed-size with a smaller advance interval. Events belong to multiple overlapping windows.
-- 10-minute windows that advance every 2 minutes
SELECT
page,
count(*) as views,
avg(load_time_ms) as avg_load
FROM HOP(clickstream, event_time, INTERVAL '2 minutes', INTERVAL '10 minutes')
GROUP BY page, window_start, window_end;
Session Windows
Dynamic windows that group events close together in time, separated by gaps of inactivity. Perfect for user sessions.
SELECT
user_id,
count(*) as clicks,
min(event_time) as session_start,
max(event_time) as session_end
FROM SESSION(clickstream, event_time, INTERVAL '30 minutes')
GROUP BY user_id, session_start, session_end;
Continuous Queries
One-off queries are useful for exploration. Continuous queries are useful for production. A continuous query runs indefinitely, processing each new event and writing results to an output topic.
CREATE CONTINUOUS QUERY page_view_counts AS
SELECT
page,
count(*) as views,
window_start
FROM TUMBLE(clickstream, event_time, INTERVAL '1 minute')
GROUP BY page, window_start, window_end
OUTPUT TO 'page-views-per-minute';
The output topic (page-views-per-minute) is a regular StreamHouse topic. Other consumers, SQL queries, or external systems can read from it.
Stream-Table Joins
The most powerful pattern: joining a real-time stream with a "table" (a compacted topic that represents the latest state per key).
-- Compacted topic as a table
CREATE TABLE customers (
customer_id VARCHAR PRIMARY KEY,
name VARCHAR,
plan VARCHAR,
mrr DECIMAL
) WITH (topic = 'customers', format = 'json');
-- Enrich orders with customer data
SELECT
o.order_id,
o.amount,
c.name,
c.plan,
c.mrr
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
The table is backed by a compacted topic — StreamHouse maintains the latest value for each key in memory. When an order arrives, the join lookups are O(1) hash map accesses.
Performance
SQL processing runs directly within the StreamHouse agent process. No external clusters, no network hops, no serialization overhead.
Benchmarks on a single agent (4 CPU, 4GB RAM):
- Simple filter + project: 500,000 events/sec
- Tumbling window aggregation: 200,000 events/sec
- Stream-table join: 150,000 events/sec
- Complex multi-join: 50,000 events/sec
Processing latency adds <1ms for simple queries and <5ms for windowed aggregations.
When to Use SQL vs. External Processing
- Filtering and transforming events
- Windowed aggregations (counts, sums, averages)
- Enrichment joins with lookup tables
- Real-time dashboards and alerting
For complex event processing (CEP), machine learning inference, or stateful computations that don't map to SQL, external processors like Flink remain the better choice. StreamHouse integrates cleanly as the transport layer in those architectures.
Getting Started
# Interactive SQL shell
streamctl sql --interactive
# Run a query from the CLI
streamctl sql "SELECT count(*) FROM clickstream WHERE page = '/pricing'"
# Create a continuous query via the REST API
curl -X POST http://localhost:8080/api/sql \
-H "Content-Type: application/json" \
-d '{"query": "CREATE CONTINUOUS QUERY ..."}'
Or use the web console's SQL Workbench for a visual query editor with auto-complete and result visualization.
SQL on streams isn't the future — it's available now in StreamHouse.