Skip to Content

Metrics Reference

Oxia exposes OpenTelemetry metrics on every process. They are collected via the Prometheus registry by default and scraped at /metrics on the metrics bind address — 0.0.0.0:8080 out of the box. This page enumerates the metrics each process publishes and what each one means. Pre-built Grafana dashboards that use these metrics are in the Oxia repository under deploy/dashboards.

Conventions

  • All metric names are prefixed with oxia_. Client SDK metrics use oxia_client_*, storage-node metrics use oxia_server_* / oxia_dataserver_*, and coordinator metrics use oxia_coordinator_*.
  • Latency histograms are in milliseconds. Byte counters and gauges are in bytes; plain counters are dimensionless (count).
  • Most storage-node metrics carry two labels: oxia_namespace and shard.
  • Histograms are published with Prometheus’s standard _bucket / _sum / _count suffixes; counters that wrap a timer (e.g. oxia_client_op) publish _sum and _count.

Client metrics

Emitted by the Go SDK when a MeterProvider is configured (WithMeterProvider or WithGlobalMeterProvider). The metrics carry a type label identifying the operation (put, delete, delete_range, get) and a success label.

NameTypeUnitPurpose
oxia_client_optimermsTime for a single operation (by type).
oxia_client_op_valuehistogrambytesSize of the value on put / get operations.
oxia_client_batch_totaltimermsTotal time for a batched request, including batch linger.
oxia_client_batch_exectimermsServer-side execution time for a batch.
oxia_client_batch_requesthistogramcountNumber of operations per batched request.
oxia_client_batch_valuehistogrambytesTotal payload size of a batched request.

Storage-node metrics

Published at /metrics on every oxia server (and the standalone binary).

Role counters

NameTypeUnitPurpose
oxia_server_leaders_countup-down countercountShards this node is currently leading.
oxia_server_followers_countup-down countercountShards this node is currently following.

Request counts and latencies (logical DB layer)

NameTypeUnitPurpose
oxia_server_db_putscountercountPut operations applied.
oxia_server_db_deletescountercountDelete operations applied.
oxia_server_db_delete_rangescountercountDeleteRange operations applied.
oxia_server_db_getscountercountGet operations served.
oxia_server_db_listscountercountList operations served.
oxia_server_db_range_scanscountercountRangeScan operations served.
oxia_server_db_get_sequence_updatescountercountGetSequenceUpdates subscriptions opened.
oxia_server_db_batch_write_latencyhistogrammsTime to apply a write batch to the DB.
oxia_server_db_get_latencyhistogrammsTime to serve a Get.
oxia_server_db_list_latencyhistogrammsTime to serve a List.

Replication (leader side)

NameTypeUnitPurpose
oxia_server_leader_write_latencyhistogrammsTime to replicate and commit a write as leader.
oxia_server_leader_head_offsetgaugeoffsetHead (latest uncommitted) WAL offset of the shard.
oxia_server_leader_commit_offsetgaugeoffsetCommit (durable-on-quorum) WAL offset of the shard.

Replication (follower / observer side)

NameTypeUnitPurpose
oxia_server_follower_write_latencyhistogrammsTime to apply replicated writes on a follower.
oxia_server_follower_ack_offsetgaugeoffsetPer-follower ack offset. Labelled by follower identity.
oxia_server_observer_ack_offsetgaugeoffsetPer-observer ack offset.

Snapshots

Emitted when a new replica catches up via a full DB snapshot.

NameTypeUnitPurpose
oxia_server_snapshots_startedcountercountFull-snapshot transfers initiated (to followers).
oxia_server_snapshots_completedcountercountFull-snapshot transfers completed.
oxia_server_snapshots_failedcountercountFull-snapshot transfers that failed.
oxia_server_snapshots_sentcounterbytesData sent during snapshot transfer.
oxia_server_snapshots_transfer_timehistogrammsDuration of a snapshot transfer.
oxia_server_observer_snapshots_startedcountercountSame, for observer replicas.
oxia_server_observer_snapshots_completedcountercount
oxia_server_observer_snapshots_failedcountercount
oxia_server_observer_snapshots_sentcounterbytes
oxia_server_observer_snapshots_transfer_timehistogramms

WAL

NameTypeUnitPurpose
oxia_server_wal_appendcounterbytesBytes appended to the WAL.
oxia_server_wal_append_latencyhistogrammsAppend latency (excluding fsync).
oxia_server_wal_sync_latencyhistogrammsfsync latency on the WAL file.
oxia_server_wal_readcounterbytesBytes read from the WAL (replication, recovery).
oxia_server_wal_read_latencyhistogrammsRead latency.
oxia_server_wal_trimcountercountRetention-driven trim operations.
oxia_server_wal_read_errorscountercountI/O errors on WAL reads.
oxia_server_wal_write_errorscountercountI/O errors on WAL writes.
oxia_server_wal_entriesgaugecountCurrently active entries in the WAL.

KV store (low-level Pebble batching)

NameTypeUnitPurpose
oxia_server_kv_batch_commit_latencyhistogrammsTime to commit a Pebble write batch.
oxia_server_kv_read_latencyhistogrammsPebble read latency.
oxia_server_kv_writecounterbytesBytes written to Pebble.
oxia_server_kv_readcounterbytesBytes read from Pebble.
oxia_server_kv_write_opscountercountPebble write ops.
oxia_server_kv_write_errorscountercountWrite errors.
oxia_server_kv_read_errorscountercountRead errors.
oxia_server_kv_batch_sizehistogrambytesSize of each Pebble write batch.
oxia_server_kv_batch_counthistogramcountOperations per Pebble write batch.

Pebble internals

Pebble exposes its own internal counters. Oxia re-publishes them under the oxia_server_kv_pebble_* prefix.

NameTypeUnitPurpose
oxia_server_kv_pebble_max_cache_sizegaugebytesBlock-cache capacity.
oxia_server_kv_pebble_block_cache_usedgaugebytesBlock-cache in-use bytes.
oxia_server_kv_pebble_block_cache_hitsgaugecountCumulative hits.
oxia_server_kv_pebble_block_cache_missesgaugecountCumulative misses.
oxia_server_kv_pebble_read_iteratorsgaugecountIterators currently open.
oxia_server_kv_pebble_compactions_totalgaugecountCompactions performed.
oxia_server_kv_pebble_compaction_debtgaugebytesEstimated bytes still to compact.
oxia_server_kv_pebble_flush_totalgaugecountMemtable flushes.
oxia_server_kv_pebble_flushgaugebytesBytes flushed.
oxia_server_kv_pebble_memtable_sizegaugebytesMemtable size.
oxia_server_kv_pebble_disk_spacegaugebytesTotal size of all DB files.
oxia_server_kv_pebble_num_files_totalgaugecountTotal SST files.
oxia_server_kv_pebble_readgaugebytesBytes read (LSM level).
oxia_server_kv_pebble_write_amplification_percentgaugecountWrite amplification percentage.
oxia_server_kv_pebble_per_level_num_filesgaugecountFiles per LSM level (labelled level).
oxia_server_kv_pebble_per_level_sizegaugebytesSize per level.
oxia_server_kv_pebble_per_level_readgaugebytesBytes read per level.

Checksums

NameTypeUnitPurpose
oxia_dataserver_db_checksumgaugecountCurrent DB checksum — used by the checksum scheduler to detect replica divergence.
oxia_dataserver_wal_checksumgaugecountCurrent WAL checksum.

Notifications

NameTypeUnitPurpose
oxia_server_notifications_readcountercountNotification events read by clients.
oxia_server_notifications_read_batchescountercountNotification batches delivered.

Sessions

NameTypeUnitPurpose
oxia_server_sessions_createdcountercountSessions created.
oxia_server_sessions_closedcountercountSessions closed explicitly.
oxia_server_sessions_expiredcountercountSessions expired (missed heartbeats / partition).
oxia_server_session_activegaugecountSessions currently active.

Shard assignments

NameTypeUnitPurpose
oxia_server_shards_assignments_active_clientsgaugecountClients currently subscribed to the GetShardAssignments stream.

Coordinator metrics

Published at /metrics on the oxia coordinator process.

Leader election

NameTypeUnitPurpose
oxia_coordinator_leader_election_latencyhistogrammsEnd-to-end leader-election duration.
oxia_coordinator_leader_election_failedcountercountFailed leader elections.
oxia_coordinator_new_term_quorum_latencyhistogrammsTime to advance the ensemble to a new term.
oxia_coordinator_become_leader_latencyhistogrammsTime from election to the new leader serving traffic.
oxia_coordinator_termgaugecountCurrent term of each shard.

Node health

NameTypeUnitPurpose
oxia_coordinator_node_health_checks_failedcountercountHealth-check failures observed against a data server.
oxia_coordinator_node_runninggaugecount1 if the coordinator considers the node alive, 0 otherwise.

Metadata persistence

NameTypeUnitPurpose
oxia_coordinator_metadata_get_latencyhistogrammsTime to read coordinator metadata from the backing provider.
oxia_coordinator_metadata_store_latencyhistogrammsTime to store coordinator metadata.
oxia_coordinator_metadata_sizegaugebytesSize of the coordinator metadata.

Scraping

Every Oxia process exposes a Prometheus /metrics endpoint on its metrics bind address. Defaults:

  • Storage node: 0.0.0.0:8080 (override with -m, --metrics-addr or observability.metric.bindAddress).
  • Coordinator: same.
  • Standalone: same.

On Kubernetes the Helm chart ships a ServiceMonitor for the Prometheus operator; see Kubernetes resources.

Source of truth

Metric registrations are spread across the relevant subsystems in the Oxia source:

Last updated on