Name | Description | Unit |
---|---|---|
configserver.requests |
Number of requests processed | request |
configserver.failedRequests |
Number of requests that failed | request |
configserver.latency |
Time to complete requests | millisecond |
configserver.cacheConfigElems |
Time to complete requests | item |
configserver.cacheChecksumElems |
Number of checksum elements in the cache | item |
configserver.hosts |
The number of nodes being served configuration from the config server cluster | node |
configserver.tenants |
The number of tenants being served configuration from the config server cluster | instance |
configserver.applications |
The number of applications being served configuration from the config server cluster | instance |
configserver.delayedResponses |
Number of delayed responses | response |
configserver.sessionChangeErrors |
Number of session change errors | session |
configserver.unknownHostRequests |
Config requests from unknown hosts | request |
configserver.newSessions |
New config sessions | session |
configserver.preparedSessions |
Prepared config sessions | session |
configserver.activeSessions |
Active config sessions | session |
configserver.inactiveSessions |
Inactive config sessions | session |
configserver.addedSessions |
Added config sessions | session |
configserver.removedSessions |
Removed config sessions | session |
configserver.rpcServerWorkQueueSize |
Number of elements in the RPC server work queue | item |
maintenanceDeployment.transientFailure |
Number of maintenance deployments that failed with a transient failure | operation |
maintenanceDeployment.failure |
Number of maintenance deployments that failed with a permanent failure | operation |
maintenance.successFactorDeviation |
Configserver: Maintenance Success Factor Deviation | fraction |
maintenance.duration |
Configserver: Maintenance Duration | millisecond |
configserver.zkConnectionLost |
Number of ZooKeeper connections lost | connection |
configserver.zkReconnected |
Number of ZooKeeper reconnections | connection |
configserver.zkConnected |
Number of ZooKeeper nodes connected | node |
configserver.zkSuspended |
Number of ZooKeeper nodes suspended | node |
configserver.zkZNodes |
Number of ZooKeeper nodes present | node |
configserver.zkAvgLatency |
Average latency for ZooKeeper requests | millisecond |
configserver.zkMaxLatency |
Max latency for ZooKeeper requests | millisecond |
configserver.zkConnections |
Number of ZooKeeper connections | connection |
configserver.zkOutstandingRequests |
Number of ZooKeeper requests in flight | request |
orchestrator.lock.acquire-latency |
Time to acquire zookeeper lock | second |
orchestrator.lock.acquire-success |
Number of times zookeeper lock has been acquired successfully | operation |
orchestrator.lock.acquire-timedout |
Number of times zookeeper lock couldn't be acquired within timeout | operation |
orchestrator.lock.acquire |
Number of attempts to acquire zookeeper lock | operation |
orchestrator.lock.acquired |
Number of times zookeeper lock was acquired | operation |
orchestrator.lock.hold-latency |
Time zookeeper lock was held before it was released | second |
nodes.active |
The number of active nodes in a cluster | node |
nodes.nonActive |
The number of non-active nodes in a cluster | node |
nodes.nonActiveFraction |
The fraction of non-active nodes vs total nodes in a cluster | node |
nodes.exclusiveSwitchFraction |
The fraction of nodes in a cluster on exclusive network switches | fraction |
nodes.emptyExclusive |
The number of exclusive hosts that do not have any nodes allocated to them | node |
nodes.expired.deprovisioned |
The number of deprovisioned nodes that have expired | node |
nodes.expired.dirty |
The number of dirty nodes that have expired | node |
nodes.expired.inactive |
The number of inactive nodes that have expired | node |
nodes.expired.provisioned |
The number of provisioned nodes that have expired | node |
nodes.expired.reserved |
The number of reserved nodes that have expired | node |
cluster.cost |
The cost of the nodes allocated to a certain cluster, in $/hr | dollar_per_hour |
cluster.load.ideal.cpu |
The ideal cpu load of a certain cluster | fraction |
cluster.load.ideal.memory |
The ideal memory load of a certain cluster | fraction |
cluster.load.ideal.disk |
The ideal disk load of a certain cluster | fraction |
cluster.load.peak.cpu |
The peak cpu load in the period considered of a certain cluster | fraction |
cluster.load.peak.memory |
The peak memory load in the period considered of a certain cluster | fraction |
cluster.load.peak.disk |
The peak disk load in the period considered of a certain cluster | fraction |
zone.working |
The value 1 if zone is considered healthy, 0 if not. This is decided by considering the number of non-active nodes vs the number of active nodes in a zone | binary |
cache.nodeObject.hitRate |
The fraction of cache hits vs cache lookups for the node object cache | fraction |
cache.nodeObject.evictionCount |
The number of cache elements evicted from the node object cache | item |
cache.nodeObject.size |
The number of cache elements in the node object cache | item |
cache.curator.hitRate |
The fraction of cache hits vs cache lookups for the curator cache | fraction |
cache.curator.evictionCount |
The number of cache elements evicted from the curator cache | item |
cache.curator.size |
The number of cache elements in the curator cache | item |
wantedRestartGeneration |
Wanted restart generation for tenant node | generation |
currentRestartGeneration |
Current restart generation for tenant node | generation |
wantToRestart |
One if node wants to restart, zero if not | binary |
wantedRebootGeneration |
Wanted reboot generation for tenant node | generation |
currentRebootGeneration |
Current reboot generation for tenant node | generation |
wantToReboot |
One if node wants to reboot, zero if not | binary |
retired |
One if node is retired, zero if not | binary |
wantedVespaVersion |
Wanted vespa version for the node, in the form MINOR.PATCH. Major version is not included here | version |
currentVespaVersion |
Current vespa version for the node, in the form MINOR.PATCH. Major version is not included here | version |
wantToChangeVespaVersion |
One if node want to change Vespa version, zero if not | binary |
hasWireguardKey |
One if node has a WireGuard key, zero if not | binary |
wantToRetire |
One if node wants to retire, zero if not | binary |
wantToDeprovision |
One if node wants to be deprovisioned, zero if not | binary |
failReport |
One if there is a fail report for the node, zero if not | binary |
suspended |
One if the node is suspended, zero if not | binary |
suspendedSeconds |
The number of seconds the node has been suspended | second |
activeSeconds |
The number of seconds the node has been active | second |
numberOfServicesUp |
The number of services confirmed to be running on a node | instance |
numberOfServicesNotChecked |
The number of services supposed to run on a node, that has not checked | instance |
numberOfServicesDown |
The number of services confirmed to not be running on a node | instance |
someServicesDown |
One if one or more services has been confirmed to not run on a node, zero if not | binary |
numberOfServicesUnknown |
The number of services the config server does not know is running on a node | instance |
nodeFailerBadNode |
One if the node is failed due to being bad, zero if not | binary |
downInNodeRepo |
One if the node is registered as being down in the node repository, zero if not | binary |
numberOfServices |
Number of services supposed to run on a node | instance |
lockAttempt.acquireMaxActiveLatency |
Maximum duration for keeping a lock, ending during the metrics snapshot, or still being kept at the end or this snapshot period | second |
lockAttempt.acquireHz |
Average number of locks acquired per second the snapshot period | operation_per_second |
lockAttempt.acquireLoad |
Average number of locks held concurrently during the snapshot period | operation |
lockAttempt.lockedLatency |
Longest lock duration in the snapshot period | second |
lockAttempt.lockedLoad |
Average number of locks held concurrently during the snapshot period | operation |
lockAttempt.acquireTimedOut |
Number of locking attempts that timed out during the snapshot period | operation |
lockAttempt.deadlock |
Number of lock grab deadlocks detected during the snapshot period | operation |
lockAttempt.errors |
Number of other lock related errors detected during the snapshot period | operation |
hostedVespa.docker.totalCapacityCpu |
Total number of VCPUs on tenant hosts managed by hosted Vespa in a zone | vcpu |
hostedVespa.docker.totalCapacityMem |
Total amount of memory on tenant hosts managed by hosted Vespa in a zone | gigabyte |
hostedVespa.docker.totalCapacityDisk |
Total amount of disk space on tenant hosts managed by hosted Vespa in a zone | gigabyte |
hostedVespa.docker.freeCapacityCpu |
Total number of free VCPUs on tenant hosts managed by hosted Vespa in a zone | vcpu |
hostedVespa.docker.freeCapacityMem |
Total amount of free memory on tenant hosts managed by hosted Vespa in a zone | gigabyte |
hostedVespa.docker.freeCapacityDisk |
Total amount of free disk space on tenant hosts managed by hosted Vespa in a zone | gigabyte |
hostedVespa.docker.allocatedCapacityCpu |
Total number of allocated VCPUs on tenant hosts managed by hosted Vespa in a zone | vcpu |
hostedVespa.docker.allocatedCapacityMem |
Total amount of allocated memory on tenant hosts managed by hosted Vespa in a zone | gigabyte |
hostedVespa.docker.allocatedCapacityDisk |
Total amount of allocated disk space on tenant hosts managed by hosted Vespa in a zone | gigabyte |
hostedVespa.pendingRedeployments |
The number of hosted Vespa re-deployments pending | task |
hostedVespa.docker.skew |
A number in the range 0..1 indicating how well allocated resources are balanced with availability on hosts | fraction |
hostedVespa.activeHosts |
The number of managed hosts that are in state "active" | host |
hostedVespa.breakfixedHosts |
The number of managed hosts that are in state "breakfixed" | host |
hostedVespa.deprovisionedHosts |
The number of managed hosts that are in state "deprovisioned" | host |
hostedVespa.dirtyHosts |
The number of managed hosts that are in state "dirty" | host |
hostedVespa.failedHosts |
The number of managed hosts that are in state "failed" | host |
hostedVespa.inactiveHosts |
The number of managed hosts that are in state "inactive" | host |
hostedVespa.parkedHosts |
The number of managed hosts that are in state "parked" | host |
hostedVespa.provisionedHosts |
The number of managed hosts that are in state "provisioned" | host |
hostedVespa.readyHosts |
The number of managed hosts that are in state "ready" | host |
hostedVespa.reservedHosts |
The number of managed hosts that are in state "reserved" | host |
hostedVespa.activeNodes |
The number of managed nodes that are in state "active" | host |
hostedVespa.breakfixedNodes |
The number of managed nodes that are in state "breakfixed" | host |
hostedVespa.deprovisionedNodes |
The number of managed nodes that are in state "deprovisioned" | host |
hostedVespa.dirtyNodes |
The number of managed nodes that are in state "dirty" | host |
hostedVespa.failedNodes |
The number of managed nodes that are in state "failed" | host |
hostedVespa.inactiveNodes |
The number of managed nodes that are in state "inactive" | host |
hostedVespa.parkedNodes |
The number of managed nodes that are in state "parked" | host |
hostedVespa.provisionedNodes |
The number of managed nodes that are in state "provisioned" | host |
hostedVespa.readyNodes |
The number of managed nodes that are in state "ready" | host |
hostedVespa.reservedNodes |
The number of managed nodes that are in state "reserved" | host |
overcommittedHosts |
The number of hosts with over-committed resources | host |
spareHostCapacity |
The number of spare hosts | host |
throttledHostFailures |
Number of host failures stopped due to throttling | host |
throttledNodeFailures |
Number of node failures stopped due to throttling | host |
nodeFailThrottling |
Metric indicating when node failure throttling is active. The value 1 means active, 0 means inactive | binary |
clusterAutoscaled |
Number of times a cluster has been rescaled by the autoscaler | operation |
deployment.prepareMillis |
Duration of deployment preparations | millisecond |
deployment.activateMillis |
Duration of deployment activations | millisecond |
throttledHostProvisioning |
Value 1 if host provisioning is throttled, 0 if not | binary |