Metrics and Health API

Vespa processes expose APIs for metrics and health. Default port is 8080. The Health APIs is used for heartbeating. Use the Metrics API to integrate with graphing and alerting services.

To find metrics ports, use vespa-model-inspect services to find running services in a cluster, then vespa-model-inspect service <service name> to find ports for the given service (e.g. searchnode).

Health API

Health status is found at http://host:port/state/v1/health

Example:

{
  "status" : {
    "code" : "up",
    "message" : "Everything ok here"
  }
}
The status code can either be up or down. Status up means that the service is fully up, ready for serving traffic. If the page cannot be downloaded, a state of down is typically assumed. The message part is optional. Typically it is empty if the service is up, while it is set to a textual reason for why it is unavailable if that is the case.

Metrics API

Metrics are found at http://host:port/state/v1/metrics

Metrics are reported in snapshots, where the snapshot specifies the time window the metrics are gathered from. Typically, the service will aggregate metrics as they are reported, and after each snapshot period, a snapshot is taken of the current values and they are reset. Using this approach, min and max values are tracked, and enables values like 95% percentile for each complete snapshot period.

The from and to times are specified in seconds since 1970. Milliseconds or microseconds can be added as decimals.

Vespa supports custom metrics.

Example:

{
  "status" : {
    "code" : "up",
    "message" : "Everything ok here"
  },
  "metrics" : {
    "snapshot" : {
      "from" : 1334134640.089,
      "to" : 1334134700.088,
    },
    "values" : [
      {
        "name" : "queries",
        "description" : "Number of queries executed during snapshot interval",
        "values" : {
          "count" : 28,
          "rate" : 0.4667
        },
        "dimensions" : {
          "searcherid" : "x"
        }
      },
      {
        "name" : "query_hits",
        "description" : "Number of documents matched per query during snapshot interval",
        "values" : {
          "count" : 28,
          "rate" : 0.4667,
          "average" : 128.3,
          "min" : 0,
          "max" : 10000,
          "sum" : 3584,
          "median" : 124.0,
          "std_deviation": 5.43
        },
        "dimensions" : {
          "searcherid" : "x"
        }
      }
    ]
  }
}
A flat list of metrics is returned. Each metric value reported by a component should be a separate metric. For related metrics, prefix metric names with common parts and dot separate the names - e.g. memory.free and memory.virtual. Each metric have one or more values set - valid values:

count Number of times metric has been set. For instance in a count metric counting number of operations done, it will annotate the number of operations added for that snapshot period. For a value metric, for instance setting latency of operations, the count will set how many times latencies have been added to the metric.
average The average of all the values gotten during a snapshot period. Typically sum divided by count.
rate count/s.
min The smallest value seen in this snapshot period.
max The largest value seen in this snapshot period.
sum The total value seen in this snapshot period.