Skip to Content
DocumentationWhat is Oxia?

What is Oxia?


Oxia: a scalable metadata store and coordination systemOriginal image credited to xkcd.com/2347, alterations by Qiang Zhao.

Oxia is a metadata store and coordination system with the same programming model as ZooKeeper, Etcd, or Chubby — linearizable keys, sessions, ephemerals, notifications, compare-and-swap — but sharded across many nodes. Throughput and capacity grow with the cluster instead of hitting a single-leader ceiling. It’s open source, a CNCF Sandbox project .

The problem

ZooKeeper, Etcd, and Consul all share one architecture: a single replicated state machine driven by Paxos or Raft. Every node holds the full dataset, every write passes through the leader, and periodic snapshots of the full state machine stall I/O. At scale that means:

  • A hard ceiling of a few gigabytes of metadata — full-dataset snapshots become prohibitive past that.
  • Write throughput capped by one leader, regardless of cluster size.
  • Long tail latencies from snapshot I/O and GC pressure.

Apache Pulsar hit this head-on: cursor positions had to live in BookKeeper ledgers — separate from the rest of its metadata — because ZooKeeper couldn’t sustain the write rate. That split metadata across two systems and doubled the operational surface area.

Sharded key–value stores (FoundationDB, TiKV, CockroachDB, Redis Cluster) solve the scaling problem but lack the coordination primitives — sessions, ephemerals, ordered notifications, atomic sequencing — that distributed systems need to operate. Teams end up either rebuilding those primitives on top of a generic KV store, or running a separate coordination service alongside it and paying to operate both.

Oxia’s approach

Oxia partitions each namespace into independently replicated shards. Each shard has its own leader, followers, write-ahead log, and LSM-backed state (Pebble ). Shards are assigned and re-elected by a lightweight coordinator that is off the data path; the coordinator stores its own state in Kubernetes (as a ConfigMap) or in a built-in Raft group when Kubernetes is not available.

Each shard runs a log-replication protocol that is deliberately not Raft. Raft fuses election logic and log matching into the data-path code, which makes the hot path more complex than it needs to be. Oxia instead separates log replication (optimized for throughput) from leader election and recovery (optimized for correctness) — the same principle that Apache BookKeeper  uses. Safety is enforced by the coordinator: it alone advances each shard’s term, fences stale leaders before electing new ones, and drives recovery. Split-brain is prevented by construction, not by vote counting.

On top of that sharded substrate, Oxia exposes coordination primitives that generic KV stores don’t have: sessions with ephemeral records, reliably ordered notifications, and atomic multi-dimensional sequence assignment fused with a Put. These are the building blocks of leader election, fencing, service discovery, cache coherence, and streaming offset assignment.

What you get

  • Per-key linearizability, with read-after-write and durable-after-ack semantics.
  • Horizontal scale: throughput and capacity grow by adding storage nodes. Hundreds of gigabytes of metadata per cluster, no full-dataset snapshots.
  • Coordination primitives: sessions, ephemerals, compare-and-swap, range scans, and reliable notifications that clients can resume from an acknowledged offset.
  • Atomic sequences with arbitrary +N deltas and multi-dimensional counters, fused with value storage — the primitive that lets StreamNative’s Ursa  assign offsets from any availability zone without per-partition leaders.
  • Cloud-native operation: deploy with a single Helm chart on Kubernetes, or run self-contained with a built-in Raft coordinator on bare metal. Shards split and merge automatically as load changes.
  • Observability: OpenTelemetry metrics out of the box; Grafana dashboards in the repo.
  • Formally verified: the single-shard protocol is modeled in TLA+; the production code is continuously validated with Maelstrom (Jepsen-style linearizability checks) and Chaos Mesh failure injection.
  • Clients in Go, Java, Python, and Node.js.

What Oxia is not

  • Not globally linearizable across shards. Each shard is linearizable on its own; cross-shard ordering is the application’s responsibility. For coordination workloads that’s what you want anyway — fencing, leader election, and offset assignment all operate on a single key or key range.
  • Not a transactional database. There are no multi-key transactions that span shards.

How Oxia compares

ZooKeeper / EtcdSharded KV (FoundationDB, TiKV, …)Oxia
Scaling modelSingle RSMShardedSharded
Write throughput ceilingBounded by one leaderScales horizontallyScales horizontally
Metadata capacity~GBs (snapshot-bounded)Scales horizontallyHundreds of GBs, LSM-backed
ConsensusPaxos / RaftMulti-range RaftPer-shard log replication
Sessions / ephemeralsYesNoYes
Ordered notificationsWatches (single-fire)NoYes — reliable and resumable
Atomic sequencing+1 znodes onlyNoArbitrary +N, multi-dimensional, fused with Put
Full-dataset snapshotsYes (I/O stalls)NoNo (WAL + LSM)

Performance

On YCSB-style benchmarks Oxia sustains throughput on the order of 10× that of a 3-node ZooKeeper or Etcd on comparable hardware, and scales near-linearly as storage nodes are added. Skewed (Zipf) workloads do not collapse the cluster — sharding spreads hot keys across shard leaders, so a few popular records don’t bottleneck the system the way they do in a single-RSM design. Full benchmark numbers will be published with the forthcoming paper.

Who is using Oxia?



ProjectUsage
Apache Pulsar

Metadata substrate, replacing ZooKeeper for topic ownership, partition routing, segment metadata, broker registration, and — unlike the ZooKeeper-based architecture — cursor positions stored directly in Oxia instead of a separate BookKeeper path. Pulsar 4.2 ships an official ZooKeeper-to-Oxia migration framework (PIP-454). In production for 2+ years.

Apache BookKeeperMetadata storage.
StreamNative Ursa

Metadata substrate and atomic offset assignment. Ursa is a Kafka-compatible, lakehouse-native streaming engine that operates without per-partition leaders: any broker in any availability zone can ingest data for a given partition. Ursa’s leaderless, geo-distributed architecture depends on Oxia’s atomic multi-dimensional sequences to assign offsets and byte positions in a single replicated operation.

Next steps

  • Getting started — run Oxia locally and try your first operations.
  • Logical architecture — shards, namespaces, and the leader-based design in more depth.
  • Design goals — the reasoning behind the replication protocol and operational model.
Last updated on