Indexing is the process of routing document writes to indexing processors, processing (indexing) documents and writing the documents to content clusters.
Refer to the overview. The primary index configuration is the schema. services.xml configures how indexing is distributed to the nodes.
This article documents the default indexing, how to configure indexing for different clusters and how to add custom document processing.
See #13193 for a discussion on using default as a name.
document-processing is an example of custom document processing, and useful for testing routing.
A normal Vespa configuration has container and content cluster(s), with one or more document types defined in schemas. Routing document writes means routing documents to the indexing container cluster, then the right content cluster.
The indexing cluster is a container cluster - see multiple container clusters for variants. Add the document-api feed endpoint to this cluster. The mapping from document type to content cluster is in document in the content cluster. From album-recommendation:
<services version="1.0"> <container id="container" version="1.0"> <document-api /> <search /> <nodes> <node hostalias="node1" /> </nodes> </container> <content id="music" version="1.0"> <redundancy>1</redundancy> <documents> <document type="music" mode="index" /> </documents> <nodes> <node hostalias="node1" distribution-key="0" /> </nodes> </content> </services>
Given this configuration, Vespa knows which is the container cluster used for indexing, and which content cluster that stores the music document type. Use vespa-route to display routing generated from this configuration:
$ vespa-route There are 6 route(s): 1. default 2. default-get 3. music 4. music-direct 5. music-index 6. storage/cluster.music There are 2 hop(s): 1. container/chain.indexing 2. indexing
Note the default route. This route is auto-generated by Vespa, and is used when no other route is used when using /document/v1. default points to indexing:
$ vespa-route --route default The route 'default' has 1 hop(s): 1. indexing
$ vespa-route --hop indexing The hop 'indexing' has selector: [DocumentRouteSelector] And 1 recipient(s): 1. music
$ vespa-route --route music The route 'music' has 1 hop(s): 1. [MessageType:music]
In short, the default route handles documents of type music. Vespa will route to the container cluster with document-api - note the chain.indexing above. This is a set of built-in document processors that does the indexing (below).
Refer to the trace appendix for routing details.
This indexing chain is set up on the container once a content cluster has mode="index"
.
The IndexingProcessor annotates the document based on the indexing script generated from the schema. Example:
$ vespa-get-config -n vespa.configdefinition.ilscripts \ -i container/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor maxtermoccurrences 100 fieldmatchmaxlength 1000000 ilscript[0].doctype "music" ilscript[0].docfield[0] "artist" ilscript[0].docfield[1] "artistId" ilscript[0].docfield[2] "title" ilscript[0].docfield[3] "album" ilscript[0].docfield[4] "duration" ilscript[0].docfield[5] "year" ilscript[0].docfield[6] "popularity" ilscript[0].content[0] "clear_state | guard { input artist | tokenize normalize stem:"BEST" | summary artist | index artist; }" ilscript[0].content[1] "clear_state | guard { input artistId | summary artistId | attribute artistId; }" ilscript[0].content[2] "clear_state | guard { input title | tokenize normalize stem:"BEST" | summary title | index title; }" ilscript[0].content[3] "clear_state | guard { input album | tokenize normalize stem:"BEST" | index album; }" ilscript[0].content[4] "clear_state | guard { input duration | summary duration; }" ilscript[0].content[5] "clear_state | guard { input year | summary year | attribute year; }" ilscript[0].content[6] "clear_state | guard { input popularity | summary popularity | attribute popularity; }"
Refer to linguistics for more details.
By default, the indexing chain is set up on the first container cluster in services.xml. When having multiple container clusters, it is recommended to configure this explicitly, see multiple container clusters.
The document can have a selection string, normally used to expire documents. This is also evaluated during feeding, so documents that would immediately expire are dropped. This is not an error, the document API will report 200 - but can be confusing.
The evaluation is done in the DocumentRouteSelector at the feeding endpoint - before any processing/indexing. I.e. the document is evaluated using the selection string (drop it or not), then where to route it, based on document type.
Example: the selection is configured to not match the document being fed:
<content id="music" version="1.0">
<redundancy>1</redundancy>
<documents>
<document type="music" mode="index" selection='music.album == "thisstringwillnotmatch"'/>
$ vespa-feeder --trace 6 doc.json
<trace>
[1564576570.693] Source session accepted a 4096 byte message. 1 message(s) now pending.
[1564576570.713] Sequencer sending message with sequence id '-1163801147'.
[1564576570.721] Recognized 'default' as route 'indexing'.
[1564576570.727] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
[1564576570.811] Running routing policy 'DocumentRouteSelector'.
[1564576570.822] Policy 'DocumentRouteSelector' assigned a reply to this branch.
[1564576570.828] Sequencer received reply with sequence id '-1163801147'.
[1564576570.828] Source session received reply. 0 message(s) now pending.
</trace>
Messages sent to vespa (route default) :
----------------------------------------
PutDocument: ok: 0 msgs/sec: 0.00 failed: 0 ignored: 1 latency(min, max, avg): 9223372036854775807, -9223372036854775808, 0
Without the selection (i.e. everything matches):
$ vespa-feeder --trace 6 doc.json <trace> [1564576637.147] Source session accepted a 4096 byte message. 1 message(s) now pending. [1564576637.168] Sequencer sending message with sequence id '-1163801147'. [1564576637.176] Recognized 'default' as route 'indexing'. [1564576637.180] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false). [1564576637.256] Running routing policy 'DocumentRouteSelector'. [1564576637.268] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'. ... </trace> Messages sent to vespa (route default) : ---------------------------------------- PutDocument: ok: 1 msgs/sec: 1.05 failed: 0 ignored: 0 latency(min, max, avg): 845, 845, 845
In the last case, in the DocumentRouteSelector routing policy, the document matched the selection string / there was no selection string, and the document was forward to the nex hop in the route.
Add custom processing of documents using document processing. The normal use case is to add document processors in the default route, before indexing. Example:
<services version="1.0"> <container id="container" version="1.0"> <document-api /> <search /> <document-processing> <chain id="default"> <documentprocessor id="com.mydomain.example.Rot13DocumentProcessor" bundle="album-recommendation-docproc" /> </chain> </document-processing> <nodes> <node hostalias="node1" /> </nodes> </container> <content id="music" version="1.0"> <redundancy>1</redundancy> <documents> <document >type="music" mode="index" /> </documents> <nodes> <node hostalias="node1" distribution-key="0" /> </nodes> </content> </services>
Note that a new hop default/chain.default is added, and the default route is changed to include this:
$ vespa-route There are 6 route(s): 1. default 2. default-get 3. music 4. music-direct 5. music-index 6. storage/cluster.music There are 3 hop(s): 1. default/chain.default 2. default/chain.indexing 3. indexing
$ vespa-route --route default The route 'default' has 2 hop(s): 1. default/chain.default 2. indexing
Note that the document processing chain must be called default to automatically be included in the default route.
An alternative to the above is inheriting the indexing chain - use this when getting this error:
Indexing cluster 'XX' specifies the chain 'default' as indexing chain. As the 'default' chain is run by default, using it as the indexing chain will run it twice. Use a different name for the indexing chain.
Call the chain something else than default, and let it inherit indexing:
<services version="1.0"> <container id="container" version="1.0"> <document-api /> <search /> <document-processing> <chain id="offer-processing" inherits="indexing"> <documentprocessor id="processor.OfferDocumentProcessor"/> </chain> </document-processing> <nodes> <node hostalias="node1" /> </nodes> </container> <content id="music" version="1.0"> <redundancy>1</redundancy> <documents> <document type="offer" mode="index"/> <document-processing cluster="default" chain="offer-processing"/> </documents> <nodes> <node hostalias="node1" distribution-key="0" /> </nodes> </content> </services>
See #13193 for details.
Vespa can be configured to use more than one container cluster. Use cases can be to separate search and document processing or having different document processing clusters due to capacity constraints or dependencies. Example with separate search and feeding/indexing container clusters:
<services version="1.0"> <container id="container-search" version="1.0"> <search /> <nodes> <node hostalias="node1" /> </nodes> </container> <container id="container-indexing" version="1.0"> <http> <server id="httpServer2" port="8081" /> </http> <document-api /> <document-processing /> <nodes> <node hostalias="node1" /> </nodes> </container> <content id="music" version="1.0"> <redundancy>1</redundancy> <documents> <document type="music" mode="index" /> <document-processing cluster="container-indexing" /> </documents> <nodes> <node hostalias="node1" distribution-key="0" /> </nodes> </content> </services>
Notes:
Observe the container-indexing/chain.indexing hop, and the indexing chain is set up on the container-indexing cluster:
$ vespa-route There are 6 route(s): 1. default 2. default-get 3. music 4. music-direct 5. music-index 6. storage/cluster.music There are 2 hop(s): 1. container-indexing/chain.indexing 2. indexing
$ curl -s http://localhost:8081 | python -m json.tool | grep -C 3 chain.indexing { "bundle": "container-disc:7.0.0", "class": "com.yahoo.messagebus.jdisc.MbusClient", "id": "chain.indexing@MbusClient", "serverBindings": [] }, { -- "class": "com.yahoo.docproc.jdisc.DocumentProcessingHandler", "id": "com.yahoo.docproc.jdisc.DocumentProcessingHandler", "serverBindings": [ "mbus://*/chain.indexing" ] }, {
Below is a trace example, no selection string:
$ cat doc.json [ { "put": "id:mynamespace:music::123", "fields": { "album": "Bad", "artist": "Michael Jackson", "title": "Bad", "year": 1987, "duration": 247 } } ] $ vespa-feeder --trace 6 doc.json <trace> [1564571762.403] Source session accepted a 4096 byte message. 1 message(s) now pending. [1564571762.420] Sequencer sending message with sequence id '-1163801147'. [1564571762.426] Recognized 'default' as route 'indexing'. [1564571762.429] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false). [1564571762.489] Running routing policy 'DocumentRouteSelector'. [1564571762.493] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'. [1564571762.493] Resolving '[MessageType:music]'. [1564571762.520] Running routing policy 'MessageType'. [1564571762.520] Component 'music-index' selected by policy 'MessageType'. [1564571762.520] Resolving 'music-index'. [1564571762.520] Recognized 'music-index' as route 'container/chain.indexing [Content:cluster=music]'. [1564571762.520] Recognized 'container/chain.indexing' as HopBlueprint(selector = { '[LoadBalancer:cluster=container;session=chain.indexing]' }, recipients = { }, ignoreResult = false). [1564571762.526] Running routing policy 'LoadBalancer'. [1564571762.538] Component 'tcp/vespa-container:19101/chain.indexing' selected by policy 'LoadBalancer'. [1564571762.538] Resolving 'tcp/vespa-container:19101/chain.indexing [Content:cluster=music]'. [1564571762.580] Sending message (version 7.83.27) from client to 'tcp/vespa-container:19101/chain.indexing' with 179.853 seconds timeout. [1564571762.581] Message (type 100004) received at 'container/container.0' for session 'chain.indexing'. [1564571762.581] Message received by MbusServer. [1564571762.582] Request received by MbusClient. [1564571762.582] Running routing policy 'Content'. [1564571762.582] Selecting route [1564571762.582] No cluster state cached. Sending to random distributor. [1564571762.582] Too few nodes seen up in state. Sending totally random. [1564571762.582] Component 'tcp/vespa-container:19114/default' selected by policy 'Content'. [1564571762.582] Resolving 'tcp/vespa-container:19114/default'. [1564571762.586] Sending message (version 7.83.27) from 'container/container.0' to 'tcp/vespa-container:19114/default' with 179.995 seconds timeout. [1564571762.587181] Message (type 100004) received at 'storage/cluster.music/distributor/0' for session 'default'. [1564571762.587245] music/distributor/0 CommunicationManager: Received message from message bus [1564571762.587510] Communication manager: Sending Put(BucketId(0x2000000000000020), id:mynamespace:music::123, timestamp 1564571762000000, size 275) [1564571762.587529] Communication manager: Passing message to source session [1564571762.587547] Source session accepted a 1 byte message. 1 message(s) now pending. [1564571762.587681] Sending message (version 7.83.27) from 'storage/cluster.music/distributor/0' to 'storage/cluster.music/storage/0/default' with 180.00 seconds timeout. [1564571762.587960] Message (type 10) received at 'storage/cluster.music/storage/0' for session 'default'. [1564571762.588052] music/storage/0 CommunicationManager: Received message from message bus [1564571762.588263] PersistenceThread: Processing message in persistence layer [1564571762.588953] Communication manager: Sending PutReply(id:mynamespace:music::123, BucketId(0x2000000000000020), timestamp 1564571762000000) [1564571762.589023] Sending reply (version 7.83.27) from 'storage/cluster.music/storage/0'. [1564571762.589332] Reply (type 11) received at 'storage/cluster.music/distributor/0'. [1564571762.589448] Source session received reply. 0 message(s) now pending. [1564571762.589459] music/distributor/0Communication manager: Received reply from message bus [1564571762.589679] Communication manager: Sending PutReply(id:music:music::123, BucketId(0x0000000000000000), timestamp 1564571762000000) [1564571762.589807] Sending reply (version 7.83.27) from 'storage/cluster.music/distributor/0'. [1564571762.590] Reply (type 200004) received at 'container/container.0'. [1564571762.590] Routing policy 'Content' merging replies. [1564571762.590] Reply received by MbusClient. [1564571762.590] Sending reply from MbusServer. [1564571762.590] Sending reply (version 7.83.27) from 'container/container.0'. [1564571762.612] Reply (type 200004) received at client. [1564571762.613] Routing policy 'LoadBalancer' merging replies. [1564571762.613] Routing policy 'MessageType' merging replies. [1564571762.615] Routing policy 'DocumentRouteSelector' merging replies. [1564571762.622] Sequencer received reply with sequence id '-1163801147'. [1564571762.622] Source session received reply. 0 message(s) now pending. </trace> Messages sent to vespa (route default) : ---------------------------------------- PutDocument: ok: 1 msgs/sec: 3.30 failed: 0 ignored: 0 latency(min, max, avg): 225, 225, 225