ConfigServer Metrics

NameUnitDescription

configserver.requests

request Number of requests processed

configserver.failedRequests

request Number of requests that failed

configserver.latency

millisecond Time to complete requests

configserver.cacheConfigElems

item Time to complete requests

configserver.cacheChecksumElems

item Number of checksum elements in the cache

configserver.hosts

node The number of nodes being served configuration from the config server cluster

configserver.tenants

instance The number of tenants being served configuration from the config server cluster

configserver.applications

instance The number of applications being served configuration from the config server cluster

configserver.delayedResponses

response Number of delayed responses

configserver.sessionChangeErrors

session Number of session change errors

configserver.unknownHostRequests

request Config requests from unknown hosts

configserver.newSessions

session New config sessions

configserver.preparedSessions

session Prepared config sessions

configserver.activeSessions

session Active config sessions

configserver.inactiveSessions

session Inactive config sessions

configserver.addedSessions

session Added config sessions

configserver.removedSessions

session Removed config sessions

configserver.rpcServerWorkQueueSize

item Number of elements in the RPC server work queue

maintenanceDeployment.transientFailure

operation Number of maintenance deployments that failed with a transient failure

maintenanceDeployment.failure

operation Number of maintenance deployments that failed with a permanent failure

maintenance.successFactorDeviation

fraction Configserver: Maintenance Success Factor Deviation

maintenance.duration

millisecond Configserver: Maintenance Duration

configserver.zkConnectionLost

connection Number of ZooKeeper connections lost

configserver.zkReconnected

connection Number of ZooKeeper reconnections

configserver.zkConnected

node Number of ZooKeeper nodes connected

configserver.zkSuspended

node Number of ZooKeeper nodes suspended

configserver.zkZNodes

node Number of ZooKeeper nodes present

configserver.zkAvgLatency

millisecond Average latency for ZooKeeper requests

configserver.zkMaxLatency

millisecond Max latency for ZooKeeper requests

configserver.zkConnections

connection Number of ZooKeeper connections

configserver.zkOutstandingRequests

request Number of ZooKeeper requests in flight

orchestrator.lock.acquire-latency

second Time to acquire zookeeper lock

orchestrator.lock.acquire-success

operation Number of times zookeeper lock has been acquired successfully

orchestrator.lock.acquire-timedout

operation Number of times zookeeper lock couldn't be acquired within timeout

orchestrator.lock.acquire

operation Number of attempts to acquire zookeeper lock

orchestrator.lock.acquired

operation Number of times zookeeper lock was acquired

orchestrator.lock.hold-latency

second Time zookeeper lock was held before it was released

nodes.active

node The number of active nodes in a cluster

nodes.nonActive

node The number of non-active nodes in a cluster

nodes.nonActiveFraction

node The fraction of non-active nodes vs total nodes in a cluster

nodes.exclusiveSwitchFraction

fraction The fraction of nodes in a cluster on exclusive network switches

nodes.emptyExclusive

node The number of exclusive hosts that do not have any nodes allocated to them

nodes.expired.deprovisioned

node The number of deprovisioned nodes that have expired

nodes.expired.dirty

node The number of dirty nodes that have expired

nodes.expired.inactive

node The number of inactive nodes that have expired

nodes.expired.provisioned

node The number of provisioned nodes that have expired

nodes.expired.reserved

node The number of reserved nodes that have expired

cluster.cost

dollar_per_hour The cost of the nodes allocated to a certain cluster, in $/hr

cluster.load.ideal.cpu

fraction The ideal cpu load of a certain cluster

cluster.load.ideal.memory

fraction The ideal memory load of a certain cluster

cluster.load.ideal.disk

fraction The ideal disk load of a certain cluster

cluster.load.peak.cpu

fraction The peak cpu load in the period considered of a certain cluster

cluster.load.peak.memory

fraction The peak memory load in the period considered of a certain cluster

cluster.load.peak.disk

fraction The peak disk load in the period considered of a certain cluster

zone.working

binary The value 1 if zone is considered healthy, 0 if not. This is decided by considering the number of non-active nodes vs the number of active nodes in a zone

cache.nodeObject.hitRate

fraction The fraction of cache hits vs cache lookups for the node object cache

cache.nodeObject.evictionCount

item The number of cache elements evicted from the node object cache

cache.nodeObject.size

item The number of cache elements in the node object cache

cache.curator.hitRate

fraction The fraction of cache hits vs cache lookups for the curator cache

cache.curator.evictionCount

item The number of cache elements evicted from the curator cache

cache.curator.size

item The number of cache elements in the curator cache

wantedRestartGeneration

generation Wanted restart generation for tenant node

currentRestartGeneration

generation Current restart generation for tenant node

wantToRestart

binary One if node wants to restart, zero if not

wantedRebootGeneration

generation Wanted reboot generation for tenant node

currentRebootGeneration

generation Current reboot generation for tenant node

wantToReboot

binary One if node wants to reboot, zero if not

retired

binary One if node is retired, zero if not

wantedVespaVersion

version Wanted vespa version for the node, in the form MINOR.PATCH. Major version is not included here

currentVespaVersion

version Current vespa version for the node, in the form MINOR.PATCH. Major version is not included here

wantToChangeVespaVersion

binary One if node want to change Vespa version, zero if not

hasWireguardKey

binary One if node has a WireGuard key, zero if not

wantToRetire

binary One if node wants to retire, zero if not

wantToDeprovision

binary One if node wants to be deprovisioned, zero if not

failReport

binary One if there is a fail report for the node, zero if not

suspended

binary One if the node is suspended, zero if not

suspendedSeconds

second The number of seconds the node has been suspended

activeSeconds

second The number of seconds the node has been active

numberOfServicesUp

instance The number of services confirmed to be running on a node

numberOfServicesNotChecked

instance The number of services supposed to run on a node, that has not checked

numberOfServicesDown

instance The number of services confirmed to not be running on a node

someServicesDown

binary One if one or more services has been confirmed to not run on a node, zero if not

numberOfServicesUnknown

instance The number of services the config server does not know is running on a node

nodeFailerBadNode

binary One if the node is failed due to being bad, zero if not

downInNodeRepo

binary One if the node is registered as being down in the node repository, zero if not

numberOfServices

instance Number of services supposed to run on a node

lockAttempt.acquireMaxActiveLatency

second Maximum duration for keeping a lock, ending during the metrics snapshot, or still being kept at the end or this snapshot period

lockAttempt.acquireHz

operation_per_second Average number of locks acquired per second the snapshot period

lockAttempt.acquireLoad

operation Average number of locks held concurrently during the snapshot period

lockAttempt.lockedLatency

second Longest lock duration in the snapshot period

lockAttempt.lockedLoad

operation Average number of locks held concurrently during the snapshot period

lockAttempt.acquireTimedOut

operation Number of locking attempts that timed out during the snapshot period

lockAttempt.deadlock

operation Number of lock grab deadlocks detected during the snapshot period

lockAttempt.errors

operation Number of other lock related errors detected during the snapshot period

hostedVespa.docker.totalCapacityCpu

vcpu Total number of VCPUs on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.totalCapacityMem

gigabyte Total amount of memory on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.totalCapacityDisk

gigabyte Total amount of disk space on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.freeCapacityCpu

vcpu Total number of free VCPUs on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.freeCapacityMem

gigabyte Total amount of free memory on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.freeCapacityDisk

gigabyte Total amount of free disk space on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.allocatedCapacityCpu

vcpu Total number of allocated VCPUs on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.allocatedCapacityMem

gigabyte Total amount of allocated memory on tenant hosts managed by hosted Vespa in a zone

hostedVespa.docker.allocatedCapacityDisk

gigabyte Total amount of allocated disk space on tenant hosts managed by hosted Vespa in a zone

hostedVespa.pendingRedeployments

task The number of hosted Vespa re-deployments pending

hostedVespa.docker.skew

fraction A number in the range 0..1 indicating how well allocated resources are balanced with availability on hosts

hostedVespa.activeHosts

host The number of managed hosts that are in state "active"

hostedVespa.breakfixedHosts

host The number of managed hosts that are in state "breakfixed"

hostedVespa.deprovisionedHosts

host The number of managed hosts that are in state "deprovisioned"

hostedVespa.dirtyHosts

host The number of managed hosts that are in state "dirty"

hostedVespa.failedHosts

host The number of managed hosts that are in state "failed"

hostedVespa.inactiveHosts

host The number of managed hosts that are in state "inactive"

hostedVespa.parkedHosts

host The number of managed hosts that are in state "parked"

hostedVespa.provisionedHosts

host The number of managed hosts that are in state "provisioned"

hostedVespa.readyHosts

host The number of managed hosts that are in state "ready"

hostedVespa.reservedHosts

host The number of managed hosts that are in state "reserved"

hostedVespa.activeNodes

host The number of managed nodes that are in state "active"

hostedVespa.breakfixedNodes

host The number of managed nodes that are in state "breakfixed"

hostedVespa.deprovisionedNodes

host The number of managed nodes that are in state "deprovisioned"

hostedVespa.dirtyNodes

host The number of managed nodes that are in state "dirty"

hostedVespa.failedNodes

host The number of managed nodes that are in state "failed"

hostedVespa.inactiveNodes

host The number of managed nodes that are in state "inactive"

hostedVespa.parkedNodes

host The number of managed nodes that are in state "parked"

hostedVespa.provisionedNodes

host The number of managed nodes that are in state "provisioned"

hostedVespa.readyNodes

host The number of managed nodes that are in state "ready"

hostedVespa.reservedNodes

host The number of managed nodes that are in state "reserved"

overcommittedHosts

host The number of hosts with over-committed resources

spareHostCapacity

host The number of spare hosts

throttledHostFailures

host Number of host failures stopped due to throttling

throttledNodeFailures

host Number of node failures stopped due to throttling

nodeFailThrottling

binary Metric indicating when node failure throttling is active. The value 1 means active, 0 means inactive

clusterAutoscaled

operation Number of times a cluster has been rescaled by the autoscaler

deployment.prepareMillis

millisecond Duration of deployment preparations

deployment.activateMillis

millisecond Duration of deployment activations

throttledHostProvisioning

binary Value 1 if host provisioning is throttled, 0 if not