Indexing

Indexing is the process of routing document writes to indexing processors, processing (indexing) documents and writing the documents to content clusters.

Refer to the overview. The primary index configuration is the search definition - services.xml configures how indexing is distributed to the nodes.

This article documents the default indexing, how to configure indexing for different clusters, the different index and streaming modes and finally how to add custom document processing.

Routing

A normal Vespa configuration has container and content cluster(s), with one or more document types defined in search definitions. Routing document write hence means routing documents to the indexing container cluster, then content cluster.

The indexing cluster is the container cluster with document-api. The mapping from document type to content cluster is in document in the content cluster. From album-recommendation-selfhosted:

<services version="1.0">

  <container id="container" version="1.0">
    <document-api />
    <search />
    <nodes>
      <node hostalias="node1" />
    </nodes>
  </container>

  <content id="music" version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document type="music" mode="index" />
    </documents>
    <nodes>
      <node hostalias="node1" distribution-key="0" />
    </nodes>
  </content>

</services>
Given this configuration, Vespa knows which is the container cluster used for indexing, and which content cluster that stores the music document type. Use vespa-route to display routing generated from this configuration:
$ vespa-route
There are 6 route(s):
    1. default
    2. default-get
    3. music
    4. music-direct
    5. music-index
    6. storage/cluster.music

There are 2 hop(s):
    1. container/chain.indexing
    2. indexing
Note the default route. This route is auto-generated by Vespa, and is used when no other route is used when using the Document API. default points to indexing:
$ vespa-route --route default
The route 'default' has 1 hop(s):
    1. indexing
$ vespa-route --hop indexing
The hop 'indexing' has selector:
       [DocumentRouteSelector]
And 1 recipient(s):
    1. music
$ vespa-route --route music
The route 'music' has 1 hop(s):
    1. [MessageType:music]
In short, the default route handles documents of type music. Vespa will route to the container cluster with document-api - note the chain.indexing above. This is a set of built-in document processors that does the indexing (below).

Refer to the trace appendix for routing details.

chain.indexing

This indexing chain is set up on the container once a content cluster has mode="index" - as opposed to mode="streaming".

The IndexingProcessor annotates the document based on the indexing script generated from the search definition. Example:

$ vespa-get-config -n vespa.configdefinition.ilscripts \
  -i container/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor
  
maxtermoccurrences 100
fieldmatchmaxlength 1000000
ilscript[0].doctype "music"
ilscript[0].docfield[0] "artist"
ilscript[0].docfield[1] "artistId"
ilscript[0].docfield[2] "title"
ilscript[0].docfield[3] "album"
ilscript[0].docfield[4] "duration"
ilscript[0].docfield[5] "year"
ilscript[0].docfield[6] "popularity"
ilscript[0].content[0] "clear_state | guard { input artist | tokenize normalize stem:"BEST" | summary artist | index artist; }"
ilscript[0].content[1] "clear_state | guard { input artistId | summary artistId | attribute artistId; }"
ilscript[0].content[2] "clear_state | guard { input title | tokenize normalize stem:"BEST" | summary title | index title; }"
ilscript[0].content[3] "clear_state | guard { input album | tokenize normalize stem:"BEST" | index album; }"
ilscript[0].content[4] "clear_state | guard { input duration | summary duration; }"
ilscript[0].content[5] "clear_state | guard { input year | summary year | attribute year; }"
ilscript[0].content[6] "clear_state | guard { input popularity | summary popularity | attribute popularity; }"
Refer to linguistics for more details.

By default, the indexing chain is set up on the first container cluster in services.xml. When having multiple container clusters, it is recommended to configure this explicitly, see multiple container clusters.

Document selection

The document can have a selection string, normally used to expire documents. This is also evaluated during feeding, so documents that would immediately expire are dropped. This is not an error, the document API will report 200 - but can be confusing.

The evaluation is done in the DocumentRouteSelector at the feeding endpoint - before any processing/indexing. I.e, the document is evaluated using the selection string (drop it or not), then where to route it, based on document type.

Example: the selection is configured to not match the document being fed:

<content id="music" version="1.0">
  <redundancy>1</redundancy>
  <documents>
    <document type="music" mode="index" selection='music.album == "thisstringwillnotmatch"'/>
$ vespa-feeder --trace 6 doc.json

<trace>
    [1564576570.693] Source session accepted a 4096 byte message. 1 message(s) now pending.
    [1564576570.713] Sequencer sending message with sequence id '-1163801147'.
    [1564576570.721] Recognized 'default' as route 'indexing'.
    [1564576570.727] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
    [1564576570.811] Running routing policy 'DocumentRouteSelector'.
    [1564576570.822] Policy 'DocumentRouteSelector' assigned a reply to this branch.
    [1564576570.828] Sequencer received reply with sequence id '-1163801147'.
    [1564576570.828] Source session received reply. 0 message(s) now pending.
</trace>

Messages sent to vespa (route default) :
----------------------------------------
PutDocument:	ok: 0 msgs/sec: 0.00 failed: 0 ignored: 1 latency(min, max, avg): 9223372036854775807, -9223372036854775808, 0
Without the selection (i.e. everything matches):
$ vespa-feeder --trace 6 doc.json

<trace>
    [1564576637.147] Source session accepted a 4096 byte message. 1 message(s) now pending.
    [1564576637.168] Sequencer sending message with sequence id '-1163801147'.
    [1564576637.176] Recognized 'default' as route 'indexing'.
    [1564576637.180] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
    [1564576637.256] Running routing policy 'DocumentRouteSelector'.
    [1564576637.268] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'.
    ...
</trace>
    
Messages sent to vespa (route default) :
----------------------------------------
PutDocument:	ok: 1 msgs/sec: 1.05 failed: 0 ignored: 0 latency(min, max, avg): 845, 845, 845
In the last case, in the DocumentRouteSelector routing policy, the document matched the selection string / there was no selection string, and the document was forward to the nex hop in the route.

Document processing

Add custom processing of documents using document processing. The normal use case is to add document processors in the default route, beforing indexing. Example:

<services version="1.0">

  <container id="container" version="1.0">
    <document-api />
    <search />
    <document-processing>
        <chain id="default">
            <documentprocessor id="com.mydomain.example.Rot13DocumentProcessor" bundle="album-recommendation-java" />
        </chain>
    </document-processing>
    <nodes>
      <node hostalias="node1" />
    </nodes>
  </container>

  <content id="music" version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document >type="music" mode="index" />
    </documents>
    <nodes>
      <node hostalias="node1" distribution-key="0" />
    </nodes>
  </content>

</services>
Note that a new hop default/chain.default is added, and the default route is changed to include this:
$ vespa-route

There are 6 route(s):
    1. default
    2. default-get
    3. music
    4. music-direct
    5. music-index
    6. storage/cluster.music

There are 3 hop(s):
    1. default/chain.default
    2. default/chain.indexing
    3. indexing
$ vespa-route --route default

The route 'default' has 2 hop(s):
    1. default/chain.default
    2. indexing
Note that the document processing chain must be called default to automatically be included in the default route.

Streaming search does not generate posting lists, and the routing configuration is hence different, too - indexed search:

$ vespa-route
There are 6 route(s):
    1. default
    2. default-get
    3. music
    4. music-direct
    5. music-index
    6. storage/cluster.music

There are 2 hop(s):
    1. container/chain.indexing
    2. indexing
Streaming search:
$ vespa-route
There are 4 route(s):
    1. default
    2. default-get
    3. music
    4. storage/cluster.music

There are 1 hop(s):
    1. indexing
Trace from feeding using indexed search:
    [1564571762.403] Source session accepted a 4096 byte message. 1 message(s) now pending.
    [1564571762.420] Sequencer sending message with sequence id '-1163801147'.
    [1564571762.426] Recognized 'default' as route 'indexing'.
    [1564571762.429] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
    [1564571762.489] Running routing policy 'DocumentRouteSelector'.
    [1564571762.493] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'.
    [1564571762.493] Resolving '[MessageType:music]'.
    [1564571762.520] Running routing policy 'MessageType'.
    [1564571762.520] Component 'music-index' selected by policy 'MessageType'.
    [1564571762.520] Resolving 'music-index'.
    [1564571762.520] Recognized 'music-index' as route 'container/chain.indexing [Content:cluster=music]'.
    [1564571762.520] Recognized 'container/chain.indexing' as HopBlueprint(selector = { '[LoadBalancer:cluster=container;session=chain.indexing]' }, recipients = {  }, ignoreResult = false).
    [1564571762.526] Running routing policy 'LoadBalancer'.
    [1564571762.538] Component 'tcp/vespa-container:19101/chain.indexing' selected by policy 'LoadBalancer'.
    [1564571762.538] Resolving 'tcp/vespa-container:19101/chain.indexing [Content:cluster=music]'.
    [1564571762.580] Sending message (version 7.83.27) from client to 'tcp/vespa-container:19101/chain.indexing' with 179.853 seconds timeout.
    [1564571762.581] Message (type 100004) received at 'container/container.0' for session 'chain.indexing'.
    [1564571762.581] Message received by MbusServer.
    [1564571762.582] Request received by MbusClient.
    [1564571762.582] Running routing policy 'Content'.
    [1564571762.582] Selecting route
Streaming search:
    [1564578828.735] Source session accepted a 4096 byte message. 1 message(s) now pending.
    [1564578828.752] Sequencer sending message with sequence id '-1163801147'.
    [1564578828.759] Recognized 'default' as route 'indexing'.
    [1564578828.763] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
    [1564578828.810] Running routing policy 'DocumentRouteSelector'.
    [1564578828.814] Component '[Content:cluster=music]' selected by policy 'DocumentRouteSelector'.
    [1564578828.814] Resolving '[Content:cluster=music]'.
    [1564578828.870] Running routing policy 'Content'.
    [1564578828.870] Selecting route
Observe that the DocumentRouteSelector selects different routing policies.

Multiple container clusters

Vespa can be configured to use more than one container cluster. Use cases can be to separate search and document processing or having different document processing clusters due to capacity constraints, dependencies, etc. Example with separate search and feeding/indexing container clusters:

<services version="1.0">
 
  <container id="container-search" version="1.0">
    <search />
    <nodes>
      <node hostalias="node1" />
    </nodes>
  </container>
  
  <container id="container-indexing" version="1.0">
    <http>
      <server id="httpServer2" port="8081" />
    </http>
    <document-api />
    <document-processing />
    <nodes>
      <node hostalias="node1" />
    </nodes>
  </container>

  <content id="music" version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document type="music" mode="index" />
      <document-processing cluster="container-indexing" />
    </documents>
    <nodes>
      <node hostalias="node1" distribution-key="0" />
    </nodes>
  </content>

</services>
Notes:
  • The indexing route is explicit using document-processing elements from the content to the container cluster
  • Set up document-api on the same cluster as indexing to avoid network hop from feed endpoint to indexing processors
Observe the container-indexing/chain.indexing hop, and the indexing chain is set up on the container-indexing cluster:
$ vespa-route

There are 6 route(s):
    1. default
    2. default-get
    3. music
    4. music-direct
    5. music-index
    6. storage/cluster.music

There are 2 hop(s):
    1. container-indexing/chain.indexing
    2. indexing
$ curl -s http://localhost:8081 | python -m json.tool | grep -C 3 chain.indexing

        {
            "bundle": "container-disc:7.0.0",
            "class": "com.yahoo.messagebus.jdisc.MbusClient",
            "id": "chain.indexing@MbusClient",
            "serverBindings": []
        },
        {
--
            "class": "com.yahoo.docproc.jdisc.DocumentProcessingHandler",
            "id": "com.yahoo.docproc.jdisc.DocumentProcessingHandler",
            "serverBindings": [
                "mbus://*/chain.indexing"
            ]
        },
        {

Appendix: trace

Below is a trace example from feeding to indexed search, no selection string:

$ cat doc.json
[
{
    "put": "id:mynamespace:music::123",
    "fields": {
         "album": "Bad",
         "artist": "Michael Jackson",
         "title": "Bad",
         "year": 1987,
         "duration": 247
    }
}
]

$ vespa-feeder --trace 6 doc.json
<trace>
    [1564571762.403] Source session accepted a 4096 byte message. 1 message(s) now pending.
    [1564571762.420] Sequencer sending message with sequence id '-1163801147'.
    [1564571762.426] Recognized 'default' as route 'indexing'.
    [1564571762.429] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
    [1564571762.489] Running routing policy 'DocumentRouteSelector'.
    [1564571762.493] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'.
    [1564571762.493] Resolving '[MessageType:music]'.
    [1564571762.520] Running routing policy 'MessageType'.
    [1564571762.520] Component 'music-index' selected by policy 'MessageType'.
    [1564571762.520] Resolving 'music-index'.
    [1564571762.520] Recognized 'music-index' as route 'container/chain.indexing [Content:cluster=music]'.
    [1564571762.520] Recognized 'container/chain.indexing' as HopBlueprint(selector = { '[LoadBalancer:cluster=container;session=chain.indexing]' }, recipients = {  }, ignoreResult = false).
    [1564571762.526] Running routing policy 'LoadBalancer'.
    [1564571762.538] Component 'tcp/vespa-container:19101/chain.indexing' selected by policy 'LoadBalancer'.
    [1564571762.538] Resolving 'tcp/vespa-container:19101/chain.indexing [Content:cluster=music]'.
    [1564571762.580] Sending message (version 7.83.27) from client to 'tcp/vespa-container:19101/chain.indexing' with 179.853 seconds timeout.
    [1564571762.581] Message (type 100004) received at 'container/container.0' for session 'chain.indexing'.
    [1564571762.581] Message received by MbusServer.
    [1564571762.582] Request received by MbusClient.
    [1564571762.582] Running routing policy 'Content'.
    [1564571762.582] Selecting route
    [1564571762.582] No cluster state cached. Sending to random distributor.
    [1564571762.582] Too few nodes seen up in state. Sending totally random.
    [1564571762.582] Component 'tcp/vespa-container:19114/default' selected by policy 'Content'.
    [1564571762.582] Resolving 'tcp/vespa-container:19114/default'.
    [1564571762.586] Sending message (version 7.83.27) from 'container/container.0' to 'tcp/vespa-container:19114/default' with 179.995 seconds timeout.
    [1564571762.587181] Message (type 100004) received at 'storage/cluster.music/distributor/0' for session 'default'.
    [1564571762.587245] music/distributor/0 CommunicationManager: Received message from message bus
    [1564571762.587510] Communication manager: Sending Put(BucketId(0x2000000000000020), id:mynamespace:music::123, timestamp 1564571762000000, size 275)
    [1564571762.587529] Communication manager: Passing message to source session
    [1564571762.587547] Source session accepted a 1 byte message. 1 message(s) now pending.
    [1564571762.587681] Sending message (version 7.83.27) from 'storage/cluster.music/distributor/0' to 'storage/cluster.music/storage/0/default' with 180.00 seconds timeout.
    [1564571762.587960] Message (type 10) received at 'storage/cluster.music/storage/0' for session 'default'.
    [1564571762.588052] music/storage/0 CommunicationManager: Received message from message bus
    [1564571762.588263] PersistenceThread: Processing message in persistence layer
    [1564571762.588953] Communication manager: Sending PutReply(id:mynamespace:music::123, BucketId(0x2000000000000020), timestamp 1564571762000000)
    [1564571762.589023] Sending reply (version 7.83.27) from 'storage/cluster.music/storage/0'.
    [1564571762.589332] Reply (type 11) received at 'storage/cluster.music/distributor/0'.
    [1564571762.589448] Source session received reply. 0 message(s) now pending.
    [1564571762.589459] music/distributor/0Communication manager: Received reply from message bus
    [1564571762.589679] Communication manager: Sending PutReply(id:music:music::123, BucketId(0x0000000000000000), timestamp 1564571762000000)
    [1564571762.589807] Sending reply (version 7.83.27) from 'storage/cluster.music/distributor/0'.
    [1564571762.590] Reply (type 200004) received at 'container/container.0'.
    [1564571762.590] Routing policy 'Content' merging replies.
    [1564571762.590] Reply received by MbusClient.
    [1564571762.590] Sending reply from MbusServer.
    [1564571762.590] Sending reply (version 7.83.27) from 'container/container.0'.
    [1564571762.612] Reply (type 200004) received at client.
    [1564571762.613] Routing policy 'LoadBalancer' merging replies.
    [1564571762.613] Routing policy 'MessageType' merging replies.
    [1564571762.615] Routing policy 'DocumentRouteSelector' merging replies.
    [1564571762.622] Sequencer received reply with sequence id '-1163801147'.
    [1564571762.622] Source session received reply. 0 message(s) now pending.
</trace>

Messages sent to vespa (route default) :
----------------------------------------
PutDocument:	ok: 1 msgs/sec: 3.30 failed: 0 ignored: 0 latency(min, max, avg): 225, 225, 225