Back to blog
Deep DiveMarch 12, 20249 min read

Schema Evolution Without the Pain: StreamHouse's Built-In Schema Registry

Schemas change. Producers and consumers deploy at different times. Here's how StreamHouse's Schema Registry handles Avro, Protobuf, and JSON Schema evolution with automatic compatibility checking.

The Schema Problem

You have a topic called user-events. A producer writes events with fields user_id, event, and timestamp. Six months later, you add a country field. A year later, you rename event to event_type.

What happens to consumers still reading old data? What happens when a producer accidentally drops a required field?

Without schema management, the answer is: silent data corruption, broken pipelines, and 3am pages.

StreamHouse Schema Registry

StreamHouse includes a built-in Schema Registry that provides:

  • Schema versioning: Every schema change gets a version number
  • Compatibility checking: Changes are validated against configurable rules before registration
  • Producer validation: Messages are validated against the schema at write time
  • Multi-format support: Avro, Protobuf, and JSON Schema

The registry is API-compatible with the Confluent Schema Registry, so existing tools and client libraries work out of the box.

Registering a Schema

# Register a JSON Schema for user-events
curl -X POST http://localhost:8081/subjects/user-events-value/versions \
  -H "Content-Type: application/json" \
  -d '{
    "schemaType": "JSON",
    "schema": "{
      \"type\": \"object\",
      \"required\": [\"user_id\", \"event\", \"timestamp\"],
      \"properties\": {
        \"user_id\": {\"type\": \"string\"},
        \"event\": {\"type\": \"string\"},
        \"timestamp\": {\"type\": \"integer\"}
      }
    }"
  }'

# Response: {"id": 1}

The schema gets an ID (1) that's embedded in every message produced with this schema. Consumers use the ID to fetch the schema for deserialization.

Compatibility Modes

The key feature of any schema registry is compatibility checking. When you register a new version of a schema, the registry validates it against the previous version(s):

Backward Compatible (Default)

  • Add optional fields (with defaults)
  • Remove fields
  • Add required fields
  • Change field types
// Version 1
{"user_id": "string", "event": "string", "timestamp": "integer"}

// Version 2 (backward compatible - added optional field)
{"user_id": "string", "event": "string", "timestamp": "integer", "country": "string (optional)"}

Forward Compatible

  • Remove optional fields
  • Add fields
  • Remove required fields
  • Change field types

Full Compatible

Both backward and forward compatible. The safest mode but the most restrictive.

# Set compatibility mode for a subject
curl -X PUT http://localhost:8081/config/user-events-value \
  -H "Content-Type: application/json" \
  -d '{"compatibility": "FULL"}'

Producer-Side Validation

When a producer sends a message with a schema ID, the StreamHouse agent validates the payload before accepting it:

# Produce with schema validation
streamctl produce --topic user-events \
  --schema-id 1 \
  --key "user-123" \
  --message '{"user_id": "123", "event": "login", "timestamp": 1706000000}'
# Success

# This will be rejected — missing required field
streamctl produce --topic user-events \
--schema-id 1 \
--key "user-123" \
--message '{"user_id": "123"}'
# Error: Schema validation failed: missing required field "event"

Catching bad data at the producer is far cheaper than debugging it downstream.

Schema Caching

The Schema Registry caches schemas in memory for fast lookups. With a 10,000-schema cache and 1-hour TTL, the cache achieves ~99% hit rates in production. A cache miss costs a single PostgreSQL query (~2ms).

Schema lookup path:
  1. Check in-memory cache → ~100ns (99% of requests)
  2. Query PostgreSQL → ~2ms (1% of requests)
  3. Return schema + update cache

Avro and Protobuf Support

While JSON Schema is the simplest to get started with, Avro and Protobuf offer better performance for production workloads:

Avro: Binary format with schema embedded in the file header. Excellent for evolving schemas because readers can use the writer's schema for deserialization.

Protobuf: Google's binary format. Best for gRPC-heavy environments and when you need language-neutral schema definitions.

# Register an Avro schema
curl -X POST http://localhost:8081/subjects/orders-value/versions \
  -H "Content-Type: application/json" \
  -d '{
    "schemaType": "AVRO",
    "schema": "{
      \"type\": \"record\",
      \"name\": \"Order\",
      \"fields\": [
        {\"name\": \"order_id\", \"type\": \"string\"},
        {\"name\": \"amount\", \"type\": \"double\"},
        {\"name\": \"currency\", \"type\": \"string\", \"default\": \"USD\"}
      ]
    }"
  }'

Best Practices

  1. Always use backward compatibility for consumer-first deployments (most common)
  2. Use full compatibility for topics shared across many teams
  3. Register schemas before producing — don't rely on auto-registration in production
  4. Version your schemas in source control alongside your application code
  5. Monitor compatibility check failures — they indicate coordination problems between teams

The Bottom Line

Schema management isn't optional at scale. StreamHouse's built-in Schema Registry makes it free to adopt — no additional infrastructure, no additional service to operate. Register schemas, enforce compatibility, and catch data quality issues before they reach your consumers.

# Get started in 30 seconds
curl http://localhost:8081/subjects  # List all registered schemas
Tags:schema-registryavrodata-qualitycompatibilityevolution

Ready to try StreamHouse?

Get started with S3-native event streaming in minutes.