A search engine normally implements indexing structures like reverse indexes to reduce query latency. It does indexing up-front, so later matching and ranking is quick. It also normally keeps a copy of the original document for later retrieval / use in search summaries.
Simplified, the engine keeps the original data plus auxiliary data structures to reduce query latency. This induces both extra work - indexing - as compared to only store the raw data, and extra static resource usage - disk, memory - to keep these structures.
Streaming search is an alternative to indexed search. It is useful in cases where the document corpus is statically split into many subsets and all searches go to just one (or a few) of the small subsets. The canonical example being personal indexes where a user only searches his own data. Read more on document identifier schemes to learn how to specify subsets.
In streaming mode, only the raw data of the documents is stored, in the document store. Only data structures for document IDs are in memory, not attributes. It matches documents to queries by streaming through them, similar to a grep. This is too costly for a global search but works fine for searching small subsets of the data. This means Vespa can avoid the overhead of maintaining indexes. Streaming mode is suitable when subsets are on average small compared to the entire corpus. Vespa maintains low latency also for the occasional large subset (say, users with huge amounts of data) by automatically sharding the data over many content nodes, searched in parallel.
Streaming search uses the same implementation of most features in Vespa, including ranking,
matching and grouping, and supports the same features.
However, streaming search supports neither stemming,
nor is the indexing script
evaluated when documents are fed — only index
and summary
are considered.
Streaming search supports a wider range of matching options (e.g, substring), which can be specified either at query time or at configuration time.
Streaming mode does not support Tensor fields, or tensor computations in ranking.
Summary:
These are the minimal steps to get started using streaming search, based on the sample apps:
<content id="mycluster" version="1.0">
<documents>
<document type="music" mode="streaming" />
$ curl -H "Content-Type:application/json" \
--data-binary @feedfile.json \
http://localhost:8080/document/v1/mynamespace/music/number/1234/1
http://localhost:8080/search/?yql=select * from sources * where artist contains "coldplay"&streaming.userid=1234
Streaming mode is search in document store data. Changing between "index" and "streaming" (or "store-only") mode hence requires refeeding all documents. Use vespa-remove-index to drop documents on nodes before changing mode.
Next step is setting correct match mode for fields - example using default string tokenized matching:
field artist type string { indexing: summary | index }To find "Coldplay" or "Billie Eilish":
select * from sources * where artist contains "coldplay" select * from sources * where artist contains "billie"Without changing schema, one can do substring matching in tokens using annotations - this matches "Coldplay":
select * from sources * where artist contains ({prefix:true}"col") select * from sources * where artist contains ({substring:true}"old") select * from sources * where artist contains ({suffix:true}"play")Instead of annotating query terms, enable prefix matching as default, and find that this query now also matches "Coldplay":
field artist type string { indexing: summary | index match : prefix } select * from sources * where artist contains "col"To match a field exactly:
field artist type string { indexing: summary | index match : exact } select * from sources * where artist contains "billie eilish"
Observe that the full string field value is now required for match.
Find match configuration per field - the first example is using default (i.e. string tokenized) matching, the artist field has default matching, so arg1 is empty. The second example uses match: exact:
$ vespa-get-config -n vespa.config.search.vsm.vsmfields -i music/search/cluster.music.music | egrep 'name|arg1' fieldspec[0].name "artist" fieldspec[0].arg1 "" # vespa-get-config -n vespa.config.search.vsm.vsmfields -i music/search/cluster.music.music | egrep 'name|arg1' fieldspec[0].name "artist" fieldspec[0].arg1 "exact"
Use vespa-configproxy-cmd to find the value for the -i argument above.
Disk sizing for streaming search is:
$ du -sh $VESPA_HOME/var/db/vespa/search/cluster.mystream/n1/documents/doctype/0.ready/* 4.0K attribute 216M documentmetastore 4.0K index 1.5G summary
Run triggerFlush if documentmetastore is empty.
Both scale linearly with number of documents - document meta store with approx 30 bytes per document, document store depending on document size. Hence, to estimate disk used, feed X% of corpus and extrapolate.
Two data structures are loaded into memory in a streaming search:
As a rule of thumb, assume 50 bytes memory usage per document for streaming search.
Streaming search is a visit operation. Parallelism is configured using persistence-threads:
<persistence-threads count='8'/> <visitors thread-count='8'/>
Note: on Vespa Cloud, this is auto-generated based on number of VCPUs set in resources. To increase node performance, increase VCPU as long as query latency decreases - at some point, the application will be IO bound.
For better control of memory usage, use direct IO for reads when summary cache is enabled - this makes the OS buffer cache size smaller and more predictable performance. The summary cache will cache recent entries and increase performance for users or groups doing repeated accesses. This sets aside 1 GB for summary cache.
<engine> <proton> <tuning> <searchnode> <summary> <io> <write>directio</write> <read>directio</read> </io> <store> <cache> <maxsize>1073741824</maxsize> </cache>
Vespa has a concept of searchable and ready copies for indexed search. In short, indices are generated for replicas used in search - other replicas do not have the indices generated. This does not apply for streaming search, where the point is not having indices. When nodes stop, replicas transfer to the active database - for streaming, disable this by setting searchable copies to the same level as redundancy:
<content id="mycluster" version="1.0"> <redundancy>2</redundancy> <engine> <proton> <searchable-copies>2</searchable-copies>
The effect of not setting the same number is higher load on nodes and so worse latency during state transitions (i.e. nodes going up and down).
When redundancy = searchable copies, all documents are found in the 0.ready database.
Grouping works for streaming search just as indexed search.
In streaming search, all documents
matching the selection string are streamed.
Streaming search hence has one grouping extension as document data is in memory already during search:
Also group documents that are not hits when using where(true)
:
all( where(true) all(group(myfield) each(output(count()))) )
When using where(true)
, relevancy is not calculated for groups, as only matched hits have relevance.
Example queries (urldecoded) - the first query results to the left, without where(true)
:
/search/?&streaming.selection=true&hits=0&yql=select * from sources * where a contains "a1" | all(group(a) each(output(count()))) /search/?&streaming.selection=true&hits=0&yql=select * from sources * where a contains "a1" | all(where(true) all(group(a) each(output(count()))))
{ "root": { "id": "toplevel", "relevance": 1, "fields": { "totalCount": 10 }, "coverage": { "coverage": 100, "documents": 28, "full": true, "nodes": 1, "results": 1, "resultsFull": 1 }, "children": [ { "id": "group:root:0", "relevance": 1, "continuation": { "this": "" }, "children": [ { "id": "grouplist:a", "relevance": 1, "label": "a", "children": [ { "id": "group:string:a1", "relevance": 123.4, "value": "a1", "fields": { "count()": 10 } } ] } ] } ] } } |
{ "root": { "id": "toplevel", "relevance": 1, "fields": { "totalCount": 10 }, "coverage": { "coverage": 100, "documents": 28, "full": true, "nodes": 0, "results": 1, "resultsFull": 1 }, "children": [ { "id": "group:root:0", "relevance": 1, "continuation": { "this": "" }, "children": [ { "id": "grouplist:a", "relevance": 1, "label": "a", "children": [ { "id": "group:string:a1", "relevance": 0, "value": "a1", "fields": { "count()": 10 } }, { "id": "group:string:a2", "relevance": 0, "value": "a2", "fields": { "count()": 9 } }, { "id": "group:string:a3", "relevance": 0, "value": "a3", "fields": { "count()": 9 } } ] } ] } ] } } |
Observe:
where(true)
includes groups for a2 and a3 even though these do not match the querywhere(true)
This kind of grouping is useful when using grouping to evaluate the corpus in the selection string. One example is computing a checksum of all documents to validate correctness during search.
docidnsspecific
returns the docid without namespace.
Applies only to streaming search:
all( group(docidnsspecific()) each(output(count())) )
Streaming search does not generate posting lists, and so the routing configuration is different too - indexed search:
$ vespa-route There are 6 route(s): 1. default 2. default-get 3. music 4. music-direct 5. music-index 6. storage/cluster.music There are 2 hop(s): 1. container/chain.indexing 2. indexing
Streaming search:
$ vespa-route There are 4 route(s): 1. default 2. default-get 3. music 4. storage/cluster.music There are 1 hop(s): 1. indexing
Trace from feeding using indexed search:
[1564571762.403] Source session accepted a 4096 byte message. 1 message(s) now pending.
[1564571762.420] Sequencer sending message with sequence id '-1163801147'.
[1564571762.426] Recognized 'default' as route 'indexing'.
[1564571762.429] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
[1564571762.489] Running routing policy 'DocumentRouteSelector'.
[1564571762.493] Component '[MessageType:music]' selected by policy 'DocumentRouteSelector'.
[1564571762.493] Resolving '[MessageType:music]'.
[1564571762.520] Running routing policy 'MessageType'.
[1564571762.520] Component 'music-index' selected by policy 'MessageType'.
[1564571762.520] Resolving 'music-index'.
[1564571762.520] Recognized 'music-index' as route 'container/chain.indexing [Content:cluster=music]'.
[1564571762.520] Recognized 'container/chain.indexing' as HopBlueprint(selector = { '[LoadBalancer:cluster=container;session=chain.indexing]' }, recipients = { }, ignoreResult = false).
[1564571762.526] Running routing policy 'LoadBalancer'.
[1564571762.538] Component 'tcp/vespa-container:19101/chain.indexing' selected by policy 'LoadBalancer'.
[1564571762.538] Resolving 'tcp/vespa-container:19101/chain.indexing [Content:cluster=music]'.
[1564571762.580] Sending message (version 7.83.27) from client to 'tcp/vespa-container:19101/chain.indexing' with 179.853 seconds timeout.
[1564571762.581] Message (type 100004) received at 'container/container.0' for session 'chain.indexing'.
[1564571762.581] Message received by MbusServer.
[1564571762.582] Request received by MbusClient.
[1564571762.582] Running routing policy 'Content'.
[1564571762.582] Selecting route
Streaming search:
[1564578828.735] Source session accepted a 4096 byte message. 1 message(s) now pending.
[1564578828.752] Sequencer sending message with sequence id '-1163801147'.
[1564578828.759] Recognized 'default' as route 'indexing'.
[1564578828.763] Recognized 'indexing' as HopBlueprint(selector = { '[DocumentRouteSelector]' }, recipients = { 'music' }, ignoreResult = false).
[1564578828.810] Running routing policy 'DocumentRouteSelector'.
[1564578828.814] Component '[Content:cluster=music]' selected by policy 'DocumentRouteSelector'.
[1564578828.814] Resolving '[Content:cluster=music]'.
[1564578828.870] Running routing policy 'Content'.
[1564578828.870] Selecting route
Observe that the DocumentRouteSelector selects different routing policies.
The features in this section applies to streaming search only.
Alias | |
Values | An integer in decimal notation in the range [0, 2^64> |
Default | None |
Restricts streaming search to only stream through documents with document ids having the n=<number> modifier and the userid part matches the supplied value. This can be used for grouping documents on a 64-bit integer.
Alias | |
Values | A string |
Default | None |
Restricts streaming search to only stream through documents with document ids having the g=<groupname> modifier and the groupname part matches the supplied value. This can be used for grouping documents on a string.
Alias | |
Values | A string |
Default | None |
Restricts streaming search using a document selection. This can be used for selecting a subset of documents based on an advanced expression.
Alias | |
Values | int |
Default | 1 (if ordering is set), or infinite |
If set, visit only this many buckets at a time. Combine with ordering to reduce visiting time for large users/groups.
Annotation | Default | Values | Description |
---|---|---|---|
substring | false | boolean | Do substring matching for this word if available in the index. ("Search for "*word*".") Only supported for streaming search. |
A streaming search is a visit, by buckets, in sequence or (semi)parallel. Access to the document store is by local ID - LID. In proton, a bucket is just a property on the document's ID. As documents are added to the file pairs in insertion order, a scan of all documents in a bucket is hence a set of random file accesses, unless some kind of bucket localization is done. The set of document IDs for a given bucket is easily generated from memory structures and assumed to take little resources.
On match:
Property | Valid with | Description |
---|---|---|
substring |
Streaming | Set default match mode to substring for the field. Only available in streaming search. As the data structures in streaming search support substring searches, one can always set substring matching in the query, without setting the field to substring default. Also see regular expressions. |
suffix |
Streaming | Like substring (above). |
On struct-field:
Contained in field
or
struct-field
.
Defines how this struct field (a subfield of a struct) should be stored,
indexed, searched, presented and how it should influence ranking.
The field in which this struct field is contained must be of
type struct or a collection of type struct.
Note that struct fields are supported differently in indexed search and streaming search:
struct-field [name] { [body] }The body of a struct field is optional and may contain the following elements:
Name | Description | Supported in | Occurrence |
---|---|---|---|
indexing | The indexing statements used to create index structure additions from this field.
For indexed search only attribute is supported, which makes the struct field a searchable in-memory attribute.
For streaming search only index and summary is supported.
|
Indexed and streaming | Zero to one |
attribute | Specifies an attribute setting. | Indexed | Zero to many |
match | Set the matching type to use for this field. | Streaming | Zero to one |
query-command | Specifies a command which can be received by a plugin searcher in the Search Container. | Streaming | Zero to many |
struct-field | A subfield of a field of type struct. The struct must have been defined to contain this subfield in the struct definition. If you want the subfield to be handled differently from the rest of the struct, you may specify it within the body of the struct-field. | Streaming | Zero to many. |
summary | Sets a summary setting of this field, set to dynamic
to make a dynamic summary. |
Streaming | Zero to many |
summary-to |
Deprecated:
Use document-summary instead.
The list of document summary names this should be included in. |
Streaming | Zero to one |
If this struct field is of type struct (i.e. a nested struct), only
indexing
,
match
and
query-command
may be specified.
searchable-copies does not apply to streaming search as this does not use index structures.
Ref parent/child: References and imported fields are not supported in streaming mode.
The bm25(field) rank feature is not supported when using streaming search.
In streaming search the second phase ranking is run on all hits. Therefore, put all the rank calculation in the first phase ranking expression and just skip second phase.
Refer to #1829 for PR for schema reference changes for streaming search ..