Introducing Supermetal
Data replication that just works
Supermetal is a data replication platform that syncs transactional databases to data warehouses and other databases. It is purpose built for batch, real-time/change data capture (CDC) and ELT workloads at any scale.
- Single binary: No Kafka, no JVM, no Debezium, no complex container orchestration. Ships with built in UI, management APIs and metrics (otel).
- Built in Rust and Apache Arrow: Compute efficient architecture with a rich type system that preserves data accuracy across sources and targets.
- Object store native: Uses S3, Azure Blob Storage, or local NVMe as durable buffer storage, decoupling source and target systems.
Get started right away.
curl -fsSL https://trial.supermetal.io/install.sh | shiwr -useb https://trial.supermetal.io/install.ps1 | iexWhy Supermetal?
Data replication should be straightforward: read changes from a source database and write them to a target. In practice, it became an infrastructure job.
The current state of data replication follows a multi hop architecture:
This design provides durability and decoupling. Kafka acts as a buffer between sources and targets while Debezium handles the change data capture. Consumers can be custom code, Kafka Connect sinks or stream processors. While Debezium can run embedded, most production deployments use Kafka for durability.
However, this data path introduces overhead. Changes move through multiple serialization cycles as Debezium reads from the database log, serializes to Avro or JSON and writes to Kafka. Consumers deserialize, potentially transform, then reserialize (often to Parquet for warehouses) before loading. Each conversion adds latency. Each component boundary adds operational complexity.
The costs add up:
- Operational overhead: Each component (Debezium, Kafka, consumers) needs deployment, monitoring, and maintenance. Configuration changes require coordinating across multiple systems.
- Serialization tax: Multiple encode/decode cycles as data moves through the pipeline. Each serialization step consumes CPU cycles adding latency.
- Type system collapse: Avro and JSON flatten database types, losing precision and metadata.
- Slow snapshots: Even when using multiple threads, existing tools process tables synchronously or don't chunk large tables effectively. This leaves CPU cores underutilized during initial synchronization, making snapshots of large databases take hours or days.
Our key insight: a single process architecture eliminates most of this complexity. Data streams from source to target within one process using Apache Arrow, removing the need for intermediate serialization and inter process communication. Object storage is cheap, durable and replaces Kafka.1
The architecture reflects this:
- Single process pipeline: Data flows from source to target using Arrow record batches, eliminating IPC and serialization overhead.
- Low latency: No multi hop pipeline. For latency sensitive workloads, buffer to local NVMe or S3 Express One Zone2.
- Type preservation: Arrow's type system maintains precision, scale, and signedness end to end. Schema evolution is handled seamlessly as table structures change.
- Parallel snapshots: Both inter-table and intra-table parallelization. Large tables are automatically chunked and processed across all available CPU cores, maximizing hardware utilization.
- Transactional consistency: Transaction boundaries are preserved from source to target
The single binary architecture makes it a natural fit for our upcoming bring your own cloud (BYOC) deployments. No Kubernetes clusters to provision or manage, no complex orchestration. The control plane is managed by Supermetal while the data plane runs entirely in your VPC boundary as a single container. Your data never leaves your infrastructure.
Questions? Check out our FAQ or reach out to us.
Footnotes
-
For more on the shift away from Kafka-centric architectures, see Kafka: End of Beginning. ↩
-
See S3 Express is All You Need for details on latency/cost tradeoffs with object storage. ↩