Changing between index
and attribute
is a common field change operation
to optimize performance.
Use the reindexing feature to safely migrate data to/from index structures.
Changing from attribute to index can be seen as "drop attribute" and "add index". When the attribute aspect of a field is removed, the field's data is not queryable after deployment. The reindexing process will populate the field's index structure, but this takes time, depending on corpus size.
Another approach is to run with both attribute and index in the transition, keeping data available for queries.
The gist of this procedure is to add index
, run a reindex -
then remove attribute
aspect:
# field configuration at start field artist type string { indexing: summary | attribute } -> # intermediate step to populate index structure, keeping the data in the attribute field artist type string { indexing: summary | attribute | index match: word stemming: none } -> # final configuration, migrated to index field artist type string { indexing: summary | index match: word stemming: none }
rank: filter
, see example in
feature-tuning.
Test this using the quick-start,
changing the artist
field to an attribute before running.
Also add a validation override
in src/main/application/validation-overrides.xml
:
<validation-overrides> <allow until="2021-08-30">indexing-change</allow> </validation-overrides>
Run the quick start, stop after feeding documents. Run a query to validate data can be queried:
$ curl "http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20artist%20contains%20%22Coldplay%22"
One can also dump current index structures, see artist
as an attribute.
Add index aspect and match/stemming settings to the field, deploy and observe output
field artist type string {
indexing: summary | attribute | index
match : word
stemming : none
}
$ (cd src/main/application && zip -r - .) | curl --header Content-Type:application/zip --data-binary @- \
localhost:19071/application/v2/tenant/default/prepareandactivate
{
"log": [
{
"time": 1628239290150,
"level": "WARNING",
"message": "Change(s) between active and new application that may require re-index:\nindexing-change:
Consider re-indexing document type 'music' in cluster 'music' because:\n
1) Document type 'music': Field 'artist' changed: add index aspect,
indexing script: '{ input artist | summary artist | attribute artist; }' ->
'{ input artist | exact | summary artist | attribute artist | index artist; }'\n"
}
],
"tenant": "default",
"url": "http://localhost:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default",
"message": "Session 3 for tenant 'default' prepared and activated.",
"configChangeActions": {
"restart": [],
"refeed": [],
"reindex": [
{
"name": "indexing-change",
"documentType": "music",
"clusterName": "music",
"messages": [
"Document type 'music': Field 'artist' changed:
add index aspect, indexing script:
'{ input artist | summary artist | attribute artist; }'
->
'{ input artist | exact | summary artist | attribute artist | index artist; }'"
],
"services": [
{
"serviceName": "searchnode",
"serviceType": "searchnode",
"configId": "music/search/cluster.music/0",
"hostName": "vespa-container"
}
]
}
]
}
}
Wait for the new configuration generation to be activated on the config server(s) -
this is normally quite immediate.
After that, allow up to 3 minutes for the config servers to set reindexing ready,
track this using the reindexing
endpoint:
$ while true; do curl http://localhost:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/reindexing | jq . sleep 10 done { "enabled": true, "clusters": { "music": { "pending": { "music": 3 }, "ready": { "music": {} } } } } ... { "enabled": true, "clusters": { "music": { "pending": {}, "ready": { "music": { "readyMillis": 1628665589516 } } } } }
When ready, deploy again to start reindexing, wait for it to complete (use the loop in previous step):
$ (cd src/main/application && zip -r - .) | curl --header Content-Type:application/zip --data-binary @- \ localhost:19071/application/v2/tenant/default/prepareandactivate ... { "enabled": true, "clusters": { "music": { "pending": {}, "ready": { "music": { "readyMillis": 1628665589516 } } } } } ... { "enabled": true, "clusters": { "music": { "pending": {}, "ready": { "music": { "readyMillis": 1628665589516, "startedMillis": 1628668739973, "endedMillis": 1628668740536, "state": "successful" } } } } }
Dumping the index structures now shows artist both in index and attribute, and there is an entry in vespa.log. Verify the query still works:
$ docker exec vespa /usr/bin/sh -c 'vespa-logfmt | grep Reindexer' [2021-08-11 07:59:00.535] INFO : container-clustercontroller Container.ai.vespa.reindexing.Reindexer Completed reindexing of datatype music (code: 1412693671) after PT0.558683S $ curl "http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20artist%20contains%20%22Coldplay%22"
As data is now reindexed into the index data structures, deploy without attribute. (Observe changes to index files, "artist" is now in index only). Test query after restart:
field artist type string { indexing: summary | index match : word stemming : none } $ (cd src/main/application && zip -r - .) | curl --header Content-Type:application/zip --data-binary @- \ localhost:19071/application/v2/tenant/default/prepareandactivate $ curl "http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20artist%20contains%20%22Coldplay%22"
Optional: restart Vespa - a restart will reclaim memory from the attribute:
$ docker exec vespa sh -c 'vespa-stop-services && vespa-start-services'
Notes:
To inspect attribute and index data (can be useful when troubleshooting), use vespa-proton-cmd, then list files:
$ docker exec vespa vespa-proton-cmd --local triggerFlush $ docker exec vespa find /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready