Streaming Search

A search engine normally implements indexing structures like reverse indexes to reduce query latency. It does indexing up-front, so later matching and ranking is quick. It also normally keeps a copy of the original document for later retrieval / use in search summaries.

Simplified, the engine keeps the original data plus auxiliary data structures to reduce query latency. This induces both extra work - indexing - as compared to only store the raw data, and extra static resource usage - disk, memory - to keep these structures.

Streaming search is an alternative to indexed search. It is useful in cases where the document corpus is statically split into many subsets and all searches go to just one (or a few) of the small subsets. The canonical example being personal indexes where a user only searches his own data. Read more on document identifier schemes to learn how to specify subsets.

In streaming mode, only the raw data of the documents is stored, in the document store. Only data structures for document IDs are in memory, not attributes. It matches documents to queries by streaming through them, similar to a grep. This is too costly for a global search but works fine for searching small subsets of the data. This means Vespa can avoid the overhead of maintaining indexes. Streaming mode is suitable when subsets are on average very small compared to the entire corpus. Vespa maintains low latency also for the occasional large subset (say, users with huge amounts of data) by automatically sharding the data over many content nodes, searched in parallel.

Streaming search uses the same implementation of most features in Vespa, including ranking, matching and grouping, and supports the same features. However, streaming search supports neither stemming, nor is the indexing script evaluated when documents are fed — only index and summary are considered; instead, it supports a wider range of term matching options, which can be specified either at query time or at configuration time. Summary:

  • Streaming search has low latency if the data searched per node is small. Total data volume can be huge as data searched is limited by a predicate
  • Streaming search is highly flexible as it does not create precomputed indexes, and hence supports more matching options
  • Streaming search uses less disk space and memory, and zero CPU for indexing. It uses more CPU for search
  • Streaming search does not have linguistic features like stemming and normalization, but does case-insensitive match

These are the minimal steps to get started using streaming search, based on the sample apps:

  1. Set indexing mode to streaming:
    <content id="mycluster" version="1.0">
      <documents>
        <document type="music" mode="streaming" />
    
  2. Document IDs must have a numeric id or string for the set of documents to search - numeric example:
    $ curl -H "Content-Type:application/json" \
      --data-binary @feedfile.json \
      http://localhost:8080/document/v1/mynamespace/music/docid/1
    
    $ curl -H "Content-Type:application/json" \
      --data-binary @feedfile.json \
      http://localhost:8080/document/v1/mynamespace/music/number/1234/1
    
  3. Specify the subset to search using the query request attribute streaming.groupname or streaming.userid. Example (URL decoded for readability):
    http://localhost:8080/search/?yql=select * from sources * where artist contains "coldplay";&streaming.userid=1234
    
Streaming mode is search in document store data. Changing between "index" and "streaming" (or "store-only") mode hence requires refeeding all documents. Use vespa-remove-index to drop documents on nodes before changing mode.

Match mode

Next step is setting correct match mode for fields - example using default string tokenized matching:

field artist type string {
    indexing: summary | index
}
To find "Coldplay" or "Billie Eilish":
select * from sources * where artist contains "coldplay";
select * from sources * where artist contains "billie";
Without changing schema, one can do substring matching in tokens using annotations - this matches "Coldplay":
select * from sources * where artist contains ([{"prefix":true}]"col");
select * from sources * where artist contains ([{"substring":true}]"old");
select * from sources * where artist contains ([{"suffix":true}]"play");
Instead of annotating query terms, enable prefix matching as default, and find that this query now also matches "Coldplay":
field artist type string {
    indexing: summary | index
    match   : prefix
}

select * from sources * where artist contains "col";
To match a field exactly:
field artist type string {
    indexing: summary | index
    match   : exact
}

select * from sources * where artist contains "billie eilish";
Observe that the full string field value is now required for match.

See the appendix for how to find match configuration per field.

Disk sizing

Disk sizing for streaming search is:

Example:
$ du -sh $VESPA_HOME/var/db/vespa/search/cluster.mystream/n1/documents/doctype/0.ready/*
  4.0K	attribute
  216M	documentmetastore
  4.0K	index
  1.5G	summary
Run triggerFlush if documentmetastore is empty.

Both scale linearly with number of documents - document meta store with approx 30 bytes per document, document store depending on document size. Hence, to estimate disk used, feed X% of corpus and extrapolate.

Memory sizing

Two data structures are loaded into memory in a streaming search:

This is the memory used to keep documents searchable - adding to this is the memory used per search. Applications with a low query rate can optimize for static memory use by presetting the document meta store ensuring it is never re-sized - example setting 5M documents per node:
<tuning>
  <searchnode>
    <resizing>
      <initialdocumentcount>5000000</initialdocumentcount>
Note: This is a hard limit if the node does not have memory to keep more than one attribute in memory!

Streaming search query tuning

Streaming search is a visit operation. Parallelism is configured using persistence-threads:

<persistence-threads count='8'/>
<visitors thread-count='8'/>
Note: on Vespa Cloud, this is auto-generated based on number of VCPUs set in resources. To increase node performance, increase VCPU as long as query latency decreases - at some point, the application will be IO bound.

Summary store: Direct IO and cache

For better control of memory usage, use direct IO for reads when summary cache is enabled - this makes the OS buffer cache size smaller and more predictable performance. The summary cache will cache recent entries and increase performance for users or groups which does repeated accesses. Below setting sets aside 1GB for summary cache.

<engine>
  <proton>
    <tuning>
      <searchnode>
        <summary>
          <io>
            <write>directio</write>
            <read>directio</read>
          </io>
          <store>
            <cache>
              <maxsize>1073741824</maxsize>
            </cache>

Searchable copies

Vespa has a concept of searchable and ready copies for indexed search. In short, indices are generated for replicas used in search - other replicas do not have the indices generated. This does not apply for streaming search, where the point is not having indices. When nodes stop, replicas transfer to the active database - for streaming, disable this by setting searchable copies to the same level as redundancy:

  <content id="mycluster" version="1.0">
    <redundancy>2</redundancy>
    <engine>
      <proton>
        <searchable-copies>2</searchable-copies>
The effect of not setting the same number is higher load on nodes and hence worse latency during state transitions (i.e. nodes going up and down).

When redundancy = searchable copies, all documents are found in the 0.ready database.

Grouping

Grouping works for streaming search just as indexed search. In streaming search, all documents matching the selection string are streamed. Streaming search hence has one grouping extension as document data is in memory already during search: Also group documents that are not hits when using where(true). See the grouping reference for details. Example queries (urldecoded) - the first query results to the left, without where(true):

/search/?&streaming.selection=true&hits=0&yql=select * from sources * where a contains "a1" |
  all(group(a) each(output(count())));

/search/?&streaming.selection=true&hits=0&yql=select * from sources * where a contains "a1" |
  all(where(true) all(group(a) each(output(count()))));
{
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 10
    },
    "coverage": {
      "coverage": 100,
      "documents": 28,
      "full": true,
      "nodes": 1,
      "results": 1,
      "resultsFull": 1
    },
    "children": [
      {
        "id": "group:root:0",
        "relevance": 1,
        "continuation": {
          "this": ""
        },
        "children": [
          {
            "id": "grouplist:a",
            "relevance": 1,
            "label": "a",
            "children": [
              {
                "id": "group:string:a1",
                "relevance": 123.4,
                "value": "a1",
                "fields": {
                  "count()": 10
                }
              }
            ]
          }
        ]
      }
    ]
  }
}
















{
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 10
    },
    "coverage": {
      "coverage": 100,
      "documents": 28,
      "full": true,
      "nodes": 0,
      "results": 1,
      "resultsFull": 1
    },
    "children": [
      {
        "id": "group:root:0",
        "relevance": 1,
        "continuation": {
          "this": ""
        },
        "children": [
          {
            "id": "grouplist:a",
            "relevance": 1,
            "label": "a",
            "children": [
              {
                "id": "group:string:a1",
                "relevance": 0,
                "value": "a1",
                "fields": {
                  "count()": 10
                }
              },
              {
                "id": "group:string:a2",
                "relevance": 0,
                "value": "a2",
                "fields": {
                  "count()": 9
                }
              },
              {
                "id": "group:string:a3",
                "relevance": 0,
                "value": "a3",
                "fields": {
                  "count()": 9
                }
              }
            ]
          }
        ]
      }
    ]
  }
}
Oberserve:
  • where(true) includes groups for a2 and a3 even though these do not match the query
  • there is no relevance score for groups when using where(true)
This kind of grouping is useful when using grouping to evaluate the corpus in the selection string. One example is computing a checksum of all documents to validate correctness during search.