ConfigServer Metrics

NameDescriptionUnit

configserver.requests

Number of requests processed request

configserver.failedRequests

Number of requests that failed request

configserver.latency

Time to complete requests millisecond

configserver.cacheConfigElems

Time to complete requests item

configserver.cacheChecksumElems

Number of checksum elements in the cache item

configserver.hosts

The number of nodes being served configuration from the config server cluster node

configserver.tenants

The number of tenants being served configuration from the config server cluster instance

configserver.applications

The number of applications being served configuration from the config server cluster instance

configserver.delayedResponses

Number of delayed responses response

configserver.sessionChangeErrors

Number of session change errors session

configserver.unknownHostRequests

Config requests from unknown hosts request

configserver.newSessions

New config sessions session

configserver.preparedSessions

Prepared config sessions session

configserver.activeSessions

Active config sessions session

configserver.inactiveSessions

Inactive config sessions session

configserver.addedSessions

Added config sessions session

configserver.removedSessions

Removed config sessions session

configserver.rpcServerWorkQueueSize

Number of elements in the RPC server work queue item

maintenanceDeployment.transientFailure

Number of maintenance deployments that failed with a transient failure operation

maintenanceDeployment.failure

Number of maintenance deployments that failed with a permanent failure operation

maintenance.successFactorDeviation

Configserver: Maintenance Success Factor Deviation fraction

maintenance.duration

Configserver: Maintenance Duration millisecond

configserver.zkConnectionLost

Number of ZooKeeper connections lost connection

configserver.zkReconnected

Number of ZooKeeper reconnections connection

configserver.zkConnected

Number of ZooKeeper nodes connected node

configserver.zkSuspended

Number of ZooKeeper nodes suspended node

configserver.zkZNodes

Number of ZooKeeper nodes present node

configserver.zkAvgLatency

Average latency for ZooKeeper requests millisecond

configserver.zkMaxLatency

Max latency for ZooKeeper requests millisecond

configserver.zkConnections

Number of ZooKeeper connections connection

configserver.zkOutstandingRequests

Number of ZooKeeper requests in flight request

orchestrator.lock.acquire-latency

Time to acquire zookeeper lock second

orchestrator.lock.acquire-success

Number of times zookeeper lock has been acquired successfully operation

orchestrator.lock.acquire-timedout

Number of times zookeeper lock couldn't be acquired within timeout operation

orchestrator.lock.acquire

Number of attempts to acquire zookeeper lock operation

orchestrator.lock.acquired

Number of times zookeeper lock was acquired operation

orchestrator.lock.hold-latency

Time zookeeper lock was held before it was released second

nodes.active

The number of active nodes in a cluster node

nodes.nonActive

The number of non-active nodes in a cluster node

nodes.nonActiveFraction

The fraction of non-active nodes vs total nodes in a cluster node

nodes.exclusiveSwitchFraction

The fraction of nodes in a cluster on exclusive network switches fraction

nodes.emptyExclusive

The number of exclusive hosts that do not have any nodes allocated to them node

nodes.expired.deprovisioned

The number of deprovisioned nodes that have expired node

nodes.expired.dirty

The number of dirty nodes that have expired node

nodes.expired.inactive

The number of inactive nodes that have expired node

nodes.expired.provisioned

The number of provisioned nodes that have expired node

nodes.expired.reserved

The number of reserved nodes that have expired node

cluster.cost

The cost of the nodes allocated to a certain cluster, in $/hr dollar_per_hour

cluster.load.ideal.cpu

The ideal cpu load of a certain cluster fraction

cluster.load.ideal.memory

The ideal memory load of a certain cluster fraction

cluster.load.ideal.disk

The ideal disk load of a certain cluster fraction

cluster.load.peak.cpu

The peak cpu load in the period considered of a certain cluster fraction

cluster.load.peak.memory

The peak memory load in the period considered of a certain cluster fraction

cluster.load.peak.disk

The peak disk load in the period considered of a certain cluster fraction

zone.working

The value 1 if zone is considered healthy, 0 if not. This is decided by considering the number of non-active nodes vs the number of active nodes in a zone binary

cache.nodeObject.hitRate

The fraction of cache hits vs cache lookups for the node object cache fraction

cache.nodeObject.evictionCount

The number of cache elements evicted from the node object cache item

cache.nodeObject.size

The number of cache elements in the node object cache item

cache.curator.hitRate

The fraction of cache hits vs cache lookups for the curator cache fraction

cache.curator.evictionCount

The number of cache elements evicted from the curator cache item

cache.curator.size

The number of cache elements in the curator cache item

wantedRestartGeneration

Wanted restart generation for tenant node generation

currentRestartGeneration

Current restart generation for tenant node generation

wantToRestart

One if node wants to restart, zero if not binary

wantedRebootGeneration

Wanted reboot generation for tenant node generation

currentRebootGeneration

Current reboot generation for tenant node generation

wantToReboot

One if node wants to reboot, zero if not binary

retired

One if node is retired, zero if not binary

wantedVespaVersion

Wanted vespa version for the node, in the form . Major version is not included here version

currentVespaVersion

Current vespa version for the node, in the form . Major version is not included here version

wantToChangeVespaVersion

One if node want to change Vespa version, zero if not binary

hasWireguardKey

One if node has a WireGuard key, zero if not binary

wantToRetire

One if node wants to retire, zero if not binary

wantToDeprovision

One if node wants to be deprovisioned, zero if not binary

failReport

One if there is a fail report for the node, zero if not binary

suspended

One if the node is suspended, zero if not binary

suspendedSeconds

The number of seconds the node has been suspended second

activeSeconds

The number of seconds the node has been active second

numberOfServicesUp

The number of services confirmed to be running on a node instance

numberOfServicesNotChecked

The number of services supposed to run on a node, that has not checked instance

numberOfServicesDown

The number of services confirmed to not be running on a node instance

someServicesDown

One if one or more services has been confirmed to not run on a node, zero if not binary

numberOfServicesUnknown

The number of services the config server does not know if is running on a node instance

nodeFailerBadNode

One if the node is failed due to being bad, zero if not binary

downInNodeRepo

One if the node is registered as being down in the node repository, zero if not binary

numberOfServices

Number of services supposed to run on a node instance

lockAttempt.acquireMaxActiveLatency

Maximum duration for keeping a lock, ending during the metrics snapshot, or still being kept at the end or this snapshot period second

lockAttempt.acquireHz

Average number of locks acquired per second the snapshot period operation_per_second

lockAttempt.acquireLoad

Average number of locks held concurrently during the snapshot period operation

lockAttempt.lockedLatency

Longest lock duration in the snapshot period second

lockAttempt.lockedLoad

Average number of locks held concurrently during the snapshot period operation

lockAttempt.acquireTimedOut

Number of locking attempts that timed out during the snapshot period operation

lockAttempt.deadlock

Number of lock grab deadlocks detected during the snapshont period operation

lockAttempt.errors

Number of other lock related errors detected during the snapshont period operation

hostedVespa.docker.totalCapacityCpu

Total number of VCPUs on tenant hosts managed by hosted Vespa in a zone vcpu

hostedVespa.docker.totalCapacityMem

Total amount of memory on tenant hosts managed by hosted Vespa in a zone gigabyte

hostedVespa.docker.totalCapacityDisk

Total amount of disk space on tenant hosts managed by hosted Vespa in a zone gigabyte

hostedVespa.docker.freeCapacityCpu

Total number of free VCPUs on tenant hosts managed by hosted Vespa in a zone vcpu

hostedVespa.docker.freeCapacityMem

Total amount of free memory on tenant hosts managed by hosted Vespa in a zone gigabyte

hostedVespa.docker.freeCapacityDisk

Total amount of free disk space on tenant hosts managed by hosted Vespa in a zone gigabyte

hostedVespa.docker.allocatedCapacityCpu

Total number of allocated VCPUs on tenant hosts managed by hosted Vespa in a zone vcpu

hostedVespa.docker.allocatedCapacityMem

Total amount of allocated memory on tenant hosts managed by hosted Vespa in a zone gigabyte

hostedVespa.docker.allocatedCapacityDisk

Total amount of allocated disk space on tenant hosts managed by hosted Vespa in a zone gigabyte

hostedVespa.pendingRedeployments

The number of hosted Vespa re-deployments pending task

hostedVespa.docker.skew

A number in the range 0..1 indicating how well allocated resources are balanced with availability on hosts fraction

hostedVespa.activeHosts

The number of managed hosts that are in state "active" host

hostedVespa.breakfixedHosts

The number of managed hosts that are in state "breakfixed" host

hostedVespa.deprovisionedHosts

The number of managed hosts that are in state "deprovisioned" host

hostedVespa.dirtyHosts

The number of managed hosts that are in state "dirty" host

hostedVespa.failedHosts

The number of managed hosts that are in state "failed" host

hostedVespa.inactiveHosts

The number of managed hosts that are in state "inactive" host

hostedVespa.parkedHosts

The number of managed hosts that are in state "parked" host

hostedVespa.provisionedHosts

The number of managed hosts that are in state "provisioned" host

hostedVespa.readyHosts

The number of managed hosts that are in state "ready" host

hostedVespa.reservedHosts

The number of managed hosts that are in state "reserved" host

hostedVespa.activeNodes

The number of managed nodes that are in state "active" host

hostedVespa.breakfixedNodes

The number of managed nodes that are in state "breakfixed" host

hostedVespa.deprovisionedNodes

The number of managed nodes that are in state "deprovisioned" host

hostedVespa.dirtyNodes

The number of managed nodes that are in state "dirty" host

hostedVespa.failedNodes

The number of managed nodes that are in state "failed" host

hostedVespa.inactiveNodes

The number of managed nodes that are in state "inactive" host

hostedVespa.parkedNodes

The number of managed nodes that are in state "parked" host

hostedVespa.provisionedNodes

The number of managed nodes that are in state "provisioned" host

hostedVespa.readyNodes

The number of managed nodes that are in state "ready" host

hostedVespa.reservedNodes

The number of managed nodes that are in state "reserved" host

overcommittedHosts

The number of hosts with over-committed resources host

spareHostCapacity

The number of spare hosts host

throttledHostFailures

Number of host failures stopped due to throttling host

throttledNodeFailures

Number of node failures stopped due to throttling host

nodeFailThrottling

Metric indicating when node failure throttling is active. The value 1 means active, 0 means inactive binary

clusterAutoscaled

Number of times a cluster has been rescaled by the autoscaler operation

deployment.prepareMillis

Duration of deployment preparations millisecond

deployment.activateMillis

Duration of deployment activations millisecond

throttledHostProvisioning

Value 1 if host provisioning is throttled, 0 if not binary