Reindexing

When the indexing pipeline of a Vespa application changes, Vespa may automatically refeed stored data such that the index is updated according to the new specification. Changes in the indexing pipeline may be due to changes in external libraries, e.g., for linguistics, or due to changes in the configuration done by the user, such as the indexing script in a document's schema, or the indexing mode of a document type in a content cluster. Reindexing can be triggered for an application's full corpus, for only certain content clusters, or for only certain document types in certain clusters, using the reindex endpoint, and inspected at the reindexing endpoint.

When to reindex

When deployment results in a change in the indexing pipeline of an application, this is discovered by the config server. (See the prepare endpoint for details.) If the change is to be deployed, a validation override must be added to the application package. Deployment will then list the reindex actions required to make the index reflect the new indexing pipeline. Use the reindex endpoint to trigger reindexing of affected document types, but only after the new indexing pipeline is successfully deployed, i.e., when the application has converged on the config generation that introduced the change. Reindexing then commences with the next deployment of the application.

Reindexing progress

Reindexing is done by a component in each content cluster that visits all documents of the indicated types, and refeeds these through the indexing chain of the cluster. (Note that only the document fields are refed — all derived fields, produced by the indexing pipeline, are recomputed.) The reindexing process avoids write races with concurrent feed by locking small subsets of the corpus when reindexing them; this may cause elevated write latencies for a fraction of concurrent write operations, but does not impact general throughput. Moreover, since reindexing can be both lengthy and resource consuming, depending on the corpus, the process is tuned to yield resources to other tasks, such as external feed and serving, and is generally safe to run in the background.

Reindexing is done for one document type at a time, in parallel across content clusters. Detailed progress can be found at the reindexing endpoint. If state is failed, reindexing will not be retried for that document type until triggered again. State pending indicates reindexing will start, or resume, when the cluster is ready, while running means it's currently progressing. Finally, successful means all documents of that type were successfully reindexed.

Use cases

Below are sample changes to the schema for different use cases, or examples of operational steps for data manipulation.

clear field

To clear a field, do a partial update all documents with the value, say an empty string.

It is also possible to use reindexing, but there is a twist - intuitively, this would work:

field artist type string {
    indexing: "" | summary | index
}
However, the reset only works for synthetic fields.

A solution is to deploy a document processor that empties the field, to the default indexing chain - then trigger a reprocessing.

change indexing settings

As reindexing takes time, a field's data can be in one state or another, while the queries to it have the most current state. This is OK for many changes and applications.

It is possible to reindex to a new field for a more atomic change. Add a synthetic field outside of the document definition and pipe the content of the current field to it:

search mydocs {

    field title_non_stemmed type string {
        indexing: input title | index | summary
        stemming: none
    }

    document mydocs {
        field title type string {
            indexing: index | summary
        }
Once reindexing is completed, switch queries to use the new field. This solution naturally increases memory and disk requirements in the transition.

Relevant pointers: