Streaming Search

A search engine normally implements indexing structures like reverse indexes to reduce query latency. It does indexing up-front, so later matching and ranking is quick. It also normally keeps a copy of the original document for later retrieval / use in search summaries.

Simplified, the engine keeps the original data plus auxiliary data structures to reduce query latency. This induces both extra work - indexing - as compared to only store the raw data, and extra static resource usage - disk, memory - to keep these structures.

Streaming search is an alternative to indexed search. It is useful in cases where the document corpus is statically split into many subsets and all searches go to just one (or a few) of the small subsets. The canonical example being personal indexes where a user only searches his own data. Read more on document identifier schemes to learn how to specify subsets.

In streaming mode, only the raw data of the documents is stored, in the document store. Only data structures for document IDs are in memory, not attributes. It matches documents to queries by streaming through them, similar to a grep. This is too costly for a global search but works fine for searching small subsets of the data. This means Vespa can avoid the overhead of maintaining indexes. Streaming mode is suitable when subsets are on average small compared to the entire corpus. Vespa maintains low latency also for the occasional large subset (say, users with huge amounts of data) by automatically sharding the data over many content nodes, searched in parallel.

Streaming search uses the same implementation of most features in Vespa, including ranking, matching and grouping, and supports the same features. However, streaming search supports neither stemming, nor is the indexing script evaluated when documents are fed — only index and summary are considered.

Streaming search supports a wider range of matching options (e.g, substring), which can be specified either at query time or at configuration time.

Streaming mode does not support Tensor fields, or tensor computations in ranking.

Summary:

• Streaming search has low latency if the data searched per node is small. Total data volume can be huge as data searched is limited by a predicate
• Streaming search is highly flexible as it does not create precomputed indexes, and hence supports more matching options
• Streaming search uses less disk space and memory, and zero CPU for indexing. It uses more CPU for search
• Streaming search does not have linguistic features like stemming and normalization, but does case-insensitive match

These are the minimal steps to get started using streaming search, based on the sample apps:

1. Set indexing mode to streaming:
<content id="mycluster" version="1.0">
<documents>
<document type="music" mode="streaming" />

2. Document IDs must have a numeric id or string for the set of documents to search - numeric example:
$curl -H "Content-Type:application/json" \ --data-binary @feedfile.json \ http://localhost:8080/document/v1/mynamespace/music/number/1234/1  3. Specify the subset to search using the query request attribute streaming.groupname or streaming.userid. Example (URL decoded for readability): http://localhost:8080/search/?yql=select * from sources * where artist contains "coldplay"&streaming.userid=1234  Streaming mode is search in document store data. Changing between "index" and "streaming" (or "store-only") mode hence requires refeeding all documents. Use vespa-remove-index to drop documents on nodes before changing mode. Match mode Next step is setting correct match mode for fields - example using default string tokenized matching: field artist type string { indexing: summary | index }  To find "Coldplay" or "Billie Eilish": select * from sources * where artist contains "coldplay" select * from sources * where artist contains "billie"  Without changing schema, one can do substring matching in tokens using annotations - this matches "Coldplay": select * from sources * where artist contains ({prefix:true}"col") select * from sources * where artist contains ({substring:true}"old") select * from sources * where artist contains ({suffix:true}"play")  Instead of annotating query terms, enable prefix matching as default, and find that this query now also matches "Coldplay": field artist type string { indexing: summary | index match : prefix } select * from sources * where artist contains "col"  To match a field exactly: field artist type string { indexing: summary | index match : exact } select * from sources * where artist contains "billie eilish"  Observe that the full string field value is now required for match. Find match configuration per field - the first example is using default (i.e. string tokenized) matching, the artist field has default matching, so arg1 is empty. The second example uses match: exact: $ vespa-get-config -n vespa.config.search.vsm.vsmfields -i music/search/cluster.music.music | egrep 'name|arg1'
fieldspec[0].name "artist"
fieldspec[0].arg1 ""

# vespa-get-config -n vespa.config.search.vsm.vsmfields -i music/search/cluster.music.music | egrep 'name|arg1'
fieldspec[0].name "artist"
fieldspec[0].arg1 "exact"


Use vespa-configproxy-cmd to find the value for the -i argument above.

Disk sizing

Disk sizing for streaming search is:

Example:
$du -sh$VESPA_HOME/var/db/vespa/search/cluster.mystream/n1/documents/doctype/0.ready/*
4.0K	attribute
216M	documentmetastore
4.0K	index
1.5G	summary


Run triggerFlush if documentmetastore is empty.

Both scale linearly with number of documents - document meta store with approx 30 bytes per document, document store depending on document size. Hence, to estimate disk used, feed X% of corpus and extrapolate.

Memory sizing

Two data structures are loaded into memory in a streaming search:

As a rule of thumb, assume 50 bytes memory usage per document for streaming search.

Streaming search query tuning

Streaming search is a visit operation. Parallelism is configured using persistence-threads:

<persistence-threads count='8'/>


Note: on Vespa Cloud, this is auto-generated based on number of VCPUs set in resources. To increase node performance, increase VCPU as long as query latency decreases - at some point, the application will be IO bound.

Summary store: Direct IO and cache

For better control of memory usage, use direct IO for reads when summary cache is enabled - this makes the OS buffer cache size smaller and more predictable performance. The summary cache will cache recent entries and increase performance for users or groups doing repeated accesses. This sets aside 1 GB for summary cache.

<engine>
<proton>
<tuning>
<searchnode>
<summary>
<io>
<write>directio</write>
</io>
<store>
<cache>
<maxsize>1073741824</maxsize>
</cache>


Searchable copies

Vespa has a concept of searchable and ready copies for indexed search. In short, indices are generated for replicas used in search - other replicas do not have the indices generated. This does not apply for streaming search, where the point is not having indices. When nodes stop, replicas transfer to the active database - for streaming, disable this by setting searchable copies to the same level as redundancy:

  <content id="mycluster" version="1.0">
<redundancy>2</redundancy>
<engine>
<proton>
<searchable-copies>2</searchable-copies>


The effect of not setting the same number is higher load on nodes and so worse latency during state transitions (i.e. nodes going up and down).

When redundancy = searchable copies, all documents are found in the 0.ready database.

Grouping

Grouping works for streaming search just as indexed search. In streaming search, all documents matching the selection string are streamed. Streaming search hence has one grouping extension as document data is in memory already during search: Also group documents that are not hits when using where(true):

all( where(true) all(group(myfield) each(output(count()))) )

When using where(true), relevancy is not calculated for groups, as only matched hits have relevance.

Example queries (urldecoded) - the first query results to the left, without where(true):

/search/?&streaming.selection=true&hits=0&yql=select * from sources * where a contains "a1" |
all(group(a) each(output(count())))

/search/?&streaming.selection=true&hits=0&yql=select * from sources * where a contains "a1" |
all(where(true) all(group(a) each(output(count()))))

 { "root": { "id": "toplevel", "relevance": 1, "fields": { "totalCount": 10 }, "coverage": { "coverage": 100, "documents": 28, "full": true, "nodes": 1, "results": 1, "resultsFull": 1 }, "children": [ { "id": "group:root:0", "relevance": 1, "continuation": { "this": "" }, "children": [ { "id": "grouplist:a", "relevance": 1, "label": "a", "children": [ { "id": "group:string:a1", "relevance": 123.4, "value": "a1", "fields": { "count()": 10 } } ] } ] } ] } }  { "root": { "id": "toplevel", "relevance": 1, "fields": { "totalCount": 10 }, "coverage": { "coverage": 100, "documents": 28, "full": true, "nodes": 0, "results": 1, "resultsFull": 1 }, "children": [ { "id": "group:root:0", "relevance": 1, "continuation": { "this": "" }, "children": [ { "id": "grouplist:a", "relevance": 1, "label": "a", "children": [ { "id": "group:string:a1", "relevance": 0, "value": "a1", "fields": { "count()": 10 } }, { "id": "group:string:a2", "relevance": 0, "value": "a2", "fields": { "count()": 9 } }, { "id": "group:string:a3", "relevance": 0, "value": "a3", "fields": { "count()": 9 } } ] } ] } ] } } 

Observe:

• where(true) includes groups for a2 and a3 even though these do not match the query
• there is no relevance score for groups when using where(true)

This kind of grouping is useful when using grouping to evaluate the corpus in the selection string. One example is computing a checksum of all documents to validate correctness during search.

docidnsspecific

docidnsspecific returns the docid without namespace. Applies only to streaming search:

all( group(docidnsspecific()) each(output(count())) )

Routing

Streaming search does not generate posting lists, and so the routing configuration is different too - indexed search:

$vespa-route There are 6 route(s): 1. default 2. default-get 3. music 4. music-direct 5. music-index 6. storage/cluster.music There are 2 hop(s): 1. container/chain.indexing 2. indexing  Streaming search: $ vespa-route
There are 4 route(s):
1. default
2. default-get
3. music
4. storage/cluster.music

There are 1 hop(s):
1. indexing


Trace from feeding using indexed search:

    [1564571762.403] Source session accepted a 4096 byte message. 1 message(s) now pending.
[1564571762.420] Sequencer sending message with sequence id '-1163801147'.
[1564571762.426] Recognized 'default' as route 'indexing'.
[1564571762.429] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
[1564571762.489] Running routing policy 'DocumentRouteSelector'.
[1564571762.493] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'.
[1564571762.493] Resolving '[MessageType:music]'.
[1564571762.520] Running routing policy 'MessageType'.
[1564571762.520] Component 'music-index' selected by policy 'MessageType'.
[1564571762.520] Resolving 'music-index'.
[1564571762.520] Recognized 'music-index' as route 'container/chain.indexing [Content:cluster=music]'.
[1564571762.520] Recognized 'container/chain.indexing' as HopBlueprint(selector = { '[LoadBalancer:cluster=container;session=chain.indexing]' }, recipients = {  }, ignoreResult = false).
[1564571762.538] Component 'tcp/vespa-container:19101/chain.indexing' selected by policy 'LoadBalancer'.
[1564571762.538] Resolving 'tcp/vespa-container:19101/chain.indexing [Content:cluster=music]'.
[1564571762.580] Sending message (version 7.83.27) from client to 'tcp/vespa-container:19101/chain.indexing' with 179.853 seconds timeout.
[1564571762.581] Message (type 100004) received at 'container/container.0' for session 'chain.indexing'.
[1564571762.582] Running routing policy 'Content'.
[1564571762.582] Selecting route


Streaming search:

    [1564578828.735] Source session accepted a 4096 byte message. 1 message(s) now pending.
[1564578828.752] Sequencer sending message with sequence id '-1163801147'.
[1564578828.759] Recognized 'default' as route 'indexing'.
[1564578828.763] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
[1564578828.810] Running routing policy 'DocumentRouteSelector'.
[1564578828.814] Component '[Content:cluster=music]' selected by policy 'DocumentRouteSelector'.
[1564578828.814] Resolving '[Content:cluster=music]'.
[1564578828.870] Running routing policy 'Content'.
[1564578828.870] Selecting route


Observe that the DocumentRouteSelector selects different routing policies.

Linguistics

• Tokenization is not used in streaming search.
• Normalization is not used in streaming search.
• Note: Stemming is not applicable to streaming search.

Query API reference

The features in this section applies to streaming search only.

streaming.userid

 Alias Values An integer in decimal notation in the range [0, 2^64> Default None

Restricts streaming search to only stream through documents with document ids having the n=<number> modifier and the userid part matches the supplied value. This can be used for grouping documents on a 64-bit integer.

streaming.groupname

 Alias Values A string Default None

Restricts streaming search to only stream through documents with document ids having the g=<groupname> modifier and the groupname part matches the supplied value. This can be used for grouping documents on a string.

streaming.selection

 Alias Values A string Default None

Restricts streaming search using a document selection. This can be used for selecting a subset of documents based on an advanced expression.

streaming.maxbucketspervisitor

 Alias Values int Default 1 (if ordering is set), or infinite

If set, visit only this many buckets at a time. Combine with ordering to reduce visiting time for large users/groups.

Query language reference

Annotation Default Values Description
substring false boolean Do substring matching for this word if available in the index. ("Search for "*word*".") Only supported for streaming search.

Visiting the document store

A streaming search is a visit, by buckets, in sequence or (semi)parallel. Access to the document store is by local ID - LID. In proton, a bucket is just a property on the document's ID. As documents are added to the file pairs in insertion order, a scan of all documents in a bucket is hence a set of random file accesses, unless some kind of bucket localization is done. The set of document IDs for a given bucket is easily generated from memory structures and assumed to take little resources.

Schema

On match:

PropertyValid withDescription
substring Streaming

Set default match mode to substring for the field. Only available in streaming search. As the data structures in streaming search support substring searches, one can always set substring matching in the query, without setting the field to substring default. Also see regular expressions.

suffix Streaming

Like substring (above).

On struct-field:

Contained in field or struct-field. Defines how this struct field (a subfield of a struct) should be stored, indexed, searched, presented and how it should influence ranking. The field in which this struct field is contained must be of type struct or a collection of type struct. Note that struct fields are supported differently in indexed search and streaming search:

struct-field [name] {
[body]
}

The body of a struct field is optional and may contain the following elements:
NameDescriptionSupported inOccurrence
indexing The indexing statements used to create index structure additions from this field. For indexed search only attribute is supported, which makes the struct field a searchable in-memory attribute. For streaming search only index and summary is supported. Indexed and streaming Zero to one
attribute Specifies an attribute setting. Indexed Zero to many
match Set the matching type to use for this field. Streaming Zero to one
query-command Specifies a command which can be received by a plugin searcher in the Search Container. Streaming Zero to many
struct-field A subfield of a field of type struct. The struct must have been defined to contain this subfield in the struct definition. If you want the subfield to be handled differently from the rest of the struct, you may specify it within the body of the struct-field. Streaming Zero to many.
summary Sets a summary setting of this field, set to dynamic to make a dynamic summary. Streaming Zero to many
summary-to The list of document summary names this should be included in. Streaming Zero to one

If this struct field is of type struct (i.e. a nested struct), only indexing, match and query-command may be specified.

Notes

searchable-copies does not apply to streaming search as this does not use index structures.

Ref parent/child: References and imported fields are not supported in streaming mode.

The bm25(field) rank feature is not supported when using streaming search.

In streaming search the second phase ranking is run on all hits. Therefore, put all the rank calculation in the first phase ranking expression and just skip second phase.

Refer to #1829 for PR for schema reference changes for streaming search ..