When the indexing pipeline of a Vespa application changes, Vespa may automatically refeed stored data such that the index is updated according to the new specification. Changes in the indexing pipeline may be due to changes in external libraries, e.g. for linguistics, or due to changes in the configuration done by the user, such as the indexing script in a document's schema, or the indexing mode of a document type in a content cluster. Reindexing can be triggered for an application's full corpus, for only certain content clusters, or for only certain document types in certain clusters, using the reindex endpoint, and inspected at the reindexing endpoint.
When deployment results in a change in the indexing pipeline of an application, this is discovered by the config server. (See the prepare endpoint for details) If the change is to be deployed, a validation override must be added to the application package. Deployment will then list the reindex actions required to make the index reflect the new indexing pipeline. Use the reindex endpoint to trigger reindexing of affected document types, but only after the new indexing pipeline is successfully deployed, i.e. when the application has converged on the config generation that introduced the change. Reindexing then commences with the next deployment of the application.
Reindexing is done by a component in each content cluster that visits all documents of the indicated types, and re-feeds these through the indexing chain of the cluster. (Note that only the document fields are re-fed — all derived fields, produced by the indexing pipeline, are recomputed.) The reindexing process avoids write races with concurrent feed by locking small subsets of the corpus when reindexing them; this may cause elevated write latencies for a fraction of concurrent write operations, but does not impact general throughput. Moreover, since reindexing can be both lengthy and resource consuming, depending on the corpus, the process is tuned to yield resources to other tasks, such as external feed and serving, and is generally safe to run in the background.
Reindexing is done for one document type at a time, in parallel across content clusters.
Detailed progress can be found at the
reindexing endpoint.
If state is failed, reindexing attempts to resume from the position where it failed after a grace period of some minutes.
State pending indicates reindexing will start, or resume, when the cluster is ready,
while running means it's currently progressing.
Finally, successful means all documents of that type were successfully reindexed.
Additionally, if the speed of a reindexing is 0.0
—set by users—that reindexing is
halted until the speed is either set to a positive value again, or it is replaced by a new reindexing of that document type.
Refer to schema changes for a procedure / way to test the reindexing feature, and tools to validate the data.
Below are sample changes to the schema for different use cases, or examples of operational steps for data manipulation.
Use case | Description |
---|---|
clear field |
To clear a field, do a partial update all documents with the value, say an empty string. It is also possible to use reindexing, but there is a twist - intuitively, this would work: field artist type string { indexing: "" | summary | index } However, the reset only works for synthetic fields. A solution is to deploy a document processor that empties the field, to the default indexing chain - then trigger a reprocessing. |
change indexing settings |
As reindexing takes time, a field's data can be in one state or another, while the queries to it have the most current state. This is OK for many changes and applications. It is possible to reindex to a new field for a more atomic change. Add a synthetic field outside of the document definition and pipe the content of the current field to it: search mydocs { field title_non_stemmed type string { indexing: input title | index | summary stemming: none } document mydocs { field title type string { indexing: index | summary } Once reindexing is completed, switch queries to use the new field. This solution naturally increases memory and disk requirements in the transition. |
Relevant pointers: