Announcing Apache Doris Target
Direct CDC from operational databases to VeloDB Cloud and Apache Doris with native Merge-on-Write.
Supermetal now replicates into Apache Doris and VeloDB Cloud from every supported source: Postgres, MySQL, MongoDB, SQL Server, and Oracle. CDC updates and deletes flow through Doris's native Merge-on-Write.
A single Rust binary handles both snapshot and CDC, loading into Doris via the S3 TVF or Stream Load.
Single Process CDC
A typical Doris CDC stack runs four components in sequence:
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐ ┌───────────────┐
│ Source DB │ ──► │ Debezium │ ──► │ Kafka │ ──► │ Flink/Spark │ ──► │ Apache Doris │
└───────────┘ └───────────┘ └───────────┘ └─────────────┘ └───────────────┘
(Source Conn.) (Message Broker) (Compute Cluster)Debezium decodes the change log to row-oriented Avro or JSON for Kafka. Flink decodes from Kafka, transforms, and re-encodes for Stream Load. Every hop pays per-row encode/decode.
Apache Doris is built for high-throughput, low-latency ingestion. The multi-hop pipeline limits throughput and adds latency at every hop. Every failure traces through three systems.
Supermetal runs as a single process, deployed directly in your infrastructure:
┌───────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Source DB │ ──► │ Supermetal │ ──► │ Object Store │ ──► │ VeloDB Cloud / │
└───────────┘ └────────────┘ │ (optional) │ │ Apache Doris │
└──────────────┘ └─────────────────┘
S3 / AzureSupermetal encodes rows once into Arrow at the source. They stay columnar through Parquet into Doris. With an object store buffer, Doris pulls those Parquet files via the S3 TVF. Without one, it writes Parquet to local disk and sends it to Doris via Stream Load.
Updates and Deletes
Supermetal creates Doris Unique Key tables with Merge-on-Write for every CDC target.
With a source primary key, the Unique Key uses those columns and _sm_version (derived from the source's transaction-log position) is the sequence column for merge ordering under retries. Deletes set Doris's hidden __DORIS_DELETE_SIGN__.
Without a source primary key, the Unique Key uses _sm_id, a row-content hash. Replays and retries dedupe against the hash, keeping inserts idempotent. _sm_version and _sm_deleted are regular columns. Schema changes (column drops or renames) invalidate the hash and require resync.
Performance
Postgres to VeloDB Cloud on the TPC-H dataset. The snapshot covers SF10–SF50. CDC runs at 5K–50K ops/sec.
SF50 (433M rows) loads in 6m 11s. SF10 (86.6M rows) finishes in 1m 16s. Supermetal reads all 8 tables in parallel at a sustained ~1.5M rows/sec (~290 MB/sec), writing Parquet to object storage. Doris pulls those files via the S3 TVF while the source read is still running.
End-to-end p100 latency stays 7–9s at 5K–25K ops/sec. The 5s floor is the default flush interval (configurable), with Doris write under 2s at every tier.
Throughput matches the target through 30K ops/sec, slipping to 96% at 40K. Postgres logical decoding is single-threaded and saturates around 40K rows/sec on this RDS instance. Latency growth above that point is the source, not Supermetal. The Breakdown View shows read latency climbing at higher tiers.
Get started in minutes
curl -fsSL https://trial.supermetal.io/install.sh | shiwr -useb https://trial.supermetal.io/install.ps1 | iex