StreamHouse Console

Why SQL?

Every data engineer knows SQL. But stream processing has traditionally required learning specialized frameworks: Flink's DataStream API, Spark Structured Streaming's DataFrame API, or ksqlDB's custom dialect.

We wanted StreamHouse users to write plain SQL and have it work on streaming data. No new APIs. No new deployment infrastructure. Just SQL.

Why DataFusion?

Apache DataFusion is a query engine written in Rust, built on Apache Arrow for columnar in-memory processing. It gives us:

SQL parser and planner: Full ANSI SQL support
Query optimizer: Predicate pushdown, projection pruning, join reordering
Columnar execution: Arrow RecordBatches for vectorized processing
Extensibility: Custom table providers, functions, and execution plans

DataFusion was designed for batch queries, but its architecture is flexible enough for streaming. Here's how we adapted it.

The Streaming Table Provider

In standard DataFusion, a table is a static set of files or partitions. In StreamHouse, a table is a live stream of events. We implemented a custom StreamTableProvider that:

Registers topic schema: Maps topic fields to Arrow data types
Provides a streaming execution plan: Instead of scanning files, it continuously polls for new records
Tracks watermarks: Knows how far into the stream we've processed

-- Register a topic as a SQL table
CREATE STREAM clickstream (
  user_id VARCHAR,
  page VARCHAR,
  referrer VARCHAR,
  event_time TIMESTAMP
) WITH (
  topic = 'clicks',
  format = 'json',
  timestamp_field = 'event_time'
);
-- Query it like any table
SELECT page, count(*) as views
FROM clickstream
WHERE event_time > NOW() - INTERVAL '1 hour'
GROUP BY page
ORDER BY views DESC;

Windowed Aggregations

Aggregations over unbounded streams need windows — finite time boundaries that define "group this set of events together." StreamHouse supports three window types:

Tumbling Windows

Fixed-size, non-overlapping. Every event belongs to exactly one window.

SELECT
  page,
  count(*) as views,
  window_start,
  window_end
FROM TUMBLE(clickstream, event_time, INTERVAL '5 minutes')
GROUP BY page, window_start, window_end;

Reads batches from the stream
Assigns each record to a window based on its timestamp
Maintains per-window accumulators (count, sum, etc.)
Emits completed windows when the watermark advances past the window end

Hopping Windows

Fixed-size with a smaller advance interval. Events belong to multiple overlapping windows.

-- 10-minute windows that advance every 2 minutes
SELECT
  page,
  count(*) as views,
  avg(load_time_ms) as avg_load
FROM HOP(clickstream, event_time, INTERVAL '2 minutes', INTERVAL '10 minutes')
GROUP BY page, window_start, window_end;

Session Windows

Dynamic windows that group events close together in time, separated by gaps of inactivity. Perfect for user sessions.

SELECT
  user_id,
  count(*) as clicks,
  min(event_time) as session_start,
  max(event_time) as session_end
FROM SESSION(clickstream, event_time, INTERVAL '30 minutes')
GROUP BY user_id, session_start, session_end;

Continuous Queries

One-off queries are useful for exploration. Continuous queries are useful for production. A continuous query runs indefinitely, processing each new event and writing results to an output topic.

CREATE CONTINUOUS QUERY page_view_counts AS
  SELECT
    page,
    count(*) as views,
    window_start
  FROM TUMBLE(clickstream, event_time, INTERVAL '1 minute')
  GROUP BY page, window_start, window_end
  OUTPUT TO 'page-views-per-minute';

The output topic (page-views-per-minute) is a regular StreamHouse topic. Other consumers, SQL queries, or external systems can read from it.

Stream-Table Joins

The most powerful pattern: joining a real-time stream with a "table" (a compacted topic that represents the latest state per key).

-- Compacted topic as a table
CREATE TABLE customers (
  customer_id VARCHAR PRIMARY KEY,
  name VARCHAR,
  plan VARCHAR,
  mrr DECIMAL
) WITH (topic = 'customers', format = 'json');
-- Enrich orders with customer data
SELECT
  o.order_id,
  o.amount,
  c.name,
  c.plan,
  c.mrr
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

The table is backed by a compacted topic — StreamHouse maintains the latest value for each key in memory. When an order arrives, the join lookups are O(1) hash map accesses.

Performance

SQL processing runs directly within the StreamHouse agent process. No external clusters, no network hops, no serialization overhead.

Benchmarks on a single agent (4 CPU, 4GB RAM):

Simple filter + project: 500,000 events/sec
Tumbling window aggregation: 200,000 events/sec
Stream-table join: 150,000 events/sec
Complex multi-join: 50,000 events/sec

Processing latency adds <1ms for simple queries and <5ms for windowed aggregations.

When to Use SQL vs. External Processing

Filtering and transforming events
Windowed aggregations (counts, sums, averages)
Enrichment joins with lookup tables
Real-time dashboards and alerting

For complex event processing (CEP), machine learning inference, or stateful computations that don't map to SQL, external processors like Flink remain the better choice. StreamHouse integrates cleanly as the transport layer in those architectures.

Getting Started

# Interactive SQL shell
streamctl sql --interactive
# Run a query from the CLI
streamctl sql "SELECT count(*) FROM clickstream WHERE page = '/pricing'"
# Create a continuous query via the REST API
curl -X POST http://localhost:8080/api/sql \
  -H "Content-Type: application/json" \
  -d '{"query": "CREATE CONTINUOUS QUERY ..."}'

Or use the web console's SQL Workbench for a visual query editor with auto-complete and result visualization.

SQL on streams isn't the future — it's available now in StreamHouse.

SQL on Streams: How Apache DataFusion Powers Real-Time Queries