In short, visit iterates over all, or a set of, buckets
and sends documents to (a set of) targets.
The target is normally the visit client
(like vespa-visit),
but can be set a set of targets that act like sinks for the documents -
see vespa-visit-target.
Typically, the visit use cases are not time sensitive, like data reprocessing,
and document dump for backup and cluster clone -
cloning applications and data
is a good read for more details.
Note:
Due to the bucket iteration, visiting is normally a high-latency iteration.
Even an empty content cluster is pre-partitioned into many data buckets to enable scaling.
Dumping just one document requires iteration over all data buckets,
unless location-specific selections are used.
To test visiting performance, use larger data sets; do not extrapolate from small.
Use the query API for low latency operations on small result sets.
Tools
vespa-visit is used to run a visit operation.
By default, vespa-visit gets visited documents and emits to stdout.
However, the tool may specify a visitor target and be used as a tool
to run reprocessing or migration.
It supports keeping a progress file on disk,
such that you can restart it if it should fail in the middle for some reason.
vespa-visit-target
is a tool to set up an endpoint for visiting data.
It binds to a socket or to a slobrok address,
which is specified as a target in the visit client.
If co-localization of documents has been used, and the selection specifies location criteria,
the visitor client may be able to identify one or a few buckets
that are guaranteed to include all documents that will match the expression.
In all other cases, visiting needs to visit all buckets in the cluster.
Note that streaming search typically relies on this behavior to ensure low latency,
but the visitor client will only be able to map visitors to a bucket subset
if the document selection is simple, i.e. maps to a location.
If streaming search uses a lot of time,
verify that the selection is actually simple enough
for the visitor client to map it to a small bucket set.
Fields
To select which fields to output, use a
fieldset.
See examples below - common use cases is to use a comma-separated list of fields
or the [document] / [all] shorthand.
Data export / import
Export data to stdout.
As a full data dump takes time / space / resources, refine as needed:
-p progress-file saves progress, allowing the visitor to resume at next startup.
Always remove the progress file to run the visiting operation from the start.
-s 'id.hash().abs() % 100 = 0' dumps 1% of the corpus - see selection.
Or see next section, routing output directly to another Vespa cluster.
Data migration / synchronization
To migrate a set of documents from one cluster to another, use visiting -
as the data is transferred directly, using a compact
serialization format, from the source nodes to the targets,
this is performance optimal (data is not piped through the visit client).
Implement backup this way, or dump to file.
Search node recovery: Feed the documents directly to a search cluster.
Example, selecting documents of type music:
$ vespa-visit --selection music --datahandler indexing
This feeds from the source into the search cluster in the same application.
Note that simultaneous feed can make updates go lost.
Include remove-entries
in the visit operation using --visitremove - this dumps the tombstones
of documents recently removed.
The content policy can be configured to
use a set of configuration servers from another cluster to configure
itself. This is specified with the config parameter. As
an example, the following route routes to the content cluster
mycluster with a configuration server on
myconfigserver.mydomain.com:12345:
The following examples illustrate how to copy all data from a source cluster to another cluster using vespa-visit:
# Copies all data in the local cluster, routing it to the remote mycluster
$ vespa-visit --datahandler '[Content:config=tcp/myconfigserver.mydomain.com:12345;cluster=mycluster]'
# Limit to 'music' documents only
$ vespa-visit --datahandler '[Content:config=tcp/myconfigserver.mydomain.com:19070;cluster=mycluster]' \
--selection music
# Limit to all documents for user '1234'
$ vespa-visit --datahandler '[Content:config=tcp/myconfigserver.mydomain.com:12345;cluster=mycluster]' \
--selection id.user=1234
Reprocessing
Reprocessing is used to solve these use cases:
To change the document type used in ways that will not be
backward compatible, define a new type, and reprocess to change all
the existing documents.
Document identifiers can be changed to a new scheme.
Search documents can be reindexed after indexing steps have been changed.
This example illustrates how one can identify a subset of the
documents in a content cluster, reprocess these, and write them back.
It is assumed that a Vespa cluster is set up, with data.
1. Set up a document reprocessing cluster
This example document processor:
deletes documents with an artist field whose value contains Metallica
uppercases title field values of all other documents
importcom.yahoo.docproc.Arguments;importcom.yahoo.docproc.DocumentProcessor;importcom.yahoo.docproc.Processing;importcom.yahoo.docproc.documentstatus.DocumentStatus;importcom.yahoo.document.DocumentOperation;importcom.yahoo.document.DocumentPut;importcom.yahoo.document.Document;/**
* Example of using a document processor will modify and/or delete
* documents in the context of a reprocessing use case.
*/publicclassReProcessorextendsDocumentProcessor{privateStringdeleteFieldName;privateStringdeleteRegex;privateStringuppercaseFieldName;publicReProcessor(){deleteFieldName="artist";deleteRegex=".*Metallica.*";uppercaseFieldName="title";}publicProgressprocess(Processingprocessing){Iterator<DocumentOperation>it=processing.getDocumentOperations().iterator();while(it.hasNext()){DocumentOperationop=it.next();if(opinstanceofDocumentPut){Documentdoc=((DocumentPut)op).getDocument();// Delete the current document if it matches:StringdeleteValue=(String)doc.getValue(deleteFieldName);if(deleteValue!=null){if(deleteValue.matches(deleteRegex)){it.remove();continue;}}// Uppercase the other field:StringuppercaseValue=doc.getValue(uppercaseFieldName).toString();if(uppercaseValue!=null){doc.setValue(uppercaseFieldName,uppercaseValue.toUpperCase());}}}returnProgress.DONE;}}
To compile this processor, see the Developer Guide.
For more information on document processing, refer
to Document processor Development.
After having changed the Vespa setup, reload config:
$ vespa-deploy prepare music
$ vespa-deploy activate
Restart nodes as well to activate.
2. Select documents
Define a selection criteria for the documents to be reprocessed.
(To reprocess all documents, skip this).
For this example, assume all documents where the field year is greater than 1995.
The selection string music.year > 1995 does this.
default - Documents are sent to the default
route.
indexing - Documents are sent to indexing.
<clustername>/chain.<chainname>:
Documents are sent to the document processing chain chainname running in
cluster clustername.
Assume you have a container cluster with id reprocessing
containing a docproc chain with id reprocessing-chain.
This example route sends documents from the content node,
into the document reprocessing chain, and ultimately, into indexing:
The '-v' option emits progress information on standard error.
Notes
With multiple search clusters with one document type each,
and one content cluster storing all document types,
there are a few adjustments that must be made.
A visitor on a content node reads documents from disk,
and sends batches of documents in messages over the network.
If a visitor is set to visit all document types,
one message may thus contain more than one document type.
Since indexing is done in a separate indexing cluster for each search cluster,
these indexing clusters can only handle documents of their own type.
In the case where all documents regardless of document type are visited,
some documents will thus end up in the wrong indexing cluster, and indexing will fail.
In this use case, the solution is simple -
use a selection string to reprocess one document type at a time.
Request handling
Client
If the selection criteria managed to map the visitor to a specific set of buckets,
these will be used when sending distributor visit requests.
If not, the visit client will iterate the entire bucket space,
typically at the minimum split-level required to decide correct distributor.
The distributors will receive the requests and look at what buckets it has
that are contained by the requested bucket.
If more than one, the distributor will only start a limited number of bucket visitors at the same time.
Once it has processed the first ones,
it will reply to the visitor client with the last bucket processed.
As all buckets have a natural order, the client can use the returned bucket as a progress counter.
Thus, after a distributor request has returned, the client knows one of the following:
All buckets contained within the bucket sent have been visited
All buckets contained within the bucket sent,
up to this specific bucket in the order, have been visited
No buckets existed that was contained within the requested bucket
The client can decide whether to visit in strict order,
allowing only one bucket to be pending at a time,
or whether to start visiting many buckets at a time, allowing great throughput.
Distributor
The distributors receive visit requests from clients for a given bucket,
which may map to none, one or many buckets within the distributor.
It picks one or more of the first buckets in the order,
picks out one content node copy of each
and passes the request on to the content nodes.
Once all the content node requests have been responded to,
the distributor will reply to the client with the last bucket visited,
to be used as a progress counter.
Subsequent client requests to the distributor will have the progress counter set,
letting the distributor ignore all the buckets prior to that point in the ordering.
Bucket splitting and joining does not alter the ordering,
and does thus not affect visiting much as long as the buckets are consistently split.
If two buckets are joined, where the first one have already been visited,
a visit request has to be sent to the joined bucket.
The content layer use the progress counter to avoid revisiting documents already processed in the bucket.
If the distributor only starts one bucket visitor at a time,
it can ensure the visitor order is kept.
Starting multiple buckets at a time may improve throughput and decrease latency,
but progress tracking will be less fine-grained,
so a bit more documents may need to be revisited when continued after a failure.
Content node
The content node receives visit requests for single buckets for which they store documents.
It may receive many in parallel, but their execution is independent of each other.
The visitor layer in the content node picks up the visit requests.
There it is assigned a visitor thread,
and an instance of the processor type is created for that visitor.
The processor will set up an iterator in the backend
and send one or more requests for documents to the backend.
The document selection specified in the visitor client is sent through to the backend,
allowing it to filter out unwanted data at the lowest level possible.
Once documents are retrieved from the backend, back up to the visitor layer,
the visit processor will process the data.
The default is one iterator request is pending to the backend at any time.
By sending many small iterator requests, having several pending at a time,
the processing may occur in parallel with the document fetching.
Visitor processor types
Processor Type
Description
Dump visitor
The most commonly used visitor processor type is the dump visitor.
All it does is to send the read documents on to some external target
specified by the visitor.
Using the command line tool vespa-visit
the default is to just send the documents back to the client,
and have them printed to stdout.
The dump visitor is used to implement reprocessing.
Typically, using a messagebus route,
that will send the documents through the document processing cluster
and then back to the content cluster.
Migration of documents from one cluster to another
is also implemented using a dump visitor.
Streaming search visitor
The streaming search visitor runs in the Vespa container,
making it transparent whether search results were created from streaming or indexed search -
see indexing mode.
Visitor targets
Requests sent from the visitor processor are sent to a visitor target - types:
Target Type
Description
Message bus routes
You can specify a message bus route name directly,
and this route will be used to send the results.
This is typically used when doing reprocessing or migration.
Message bus routes is set up in the application package.
In addition, some routes may have been auto-generated in simple setups,
for instance a route called default is generated
if your setup is simple enough for the config model
to likely guess where you want to send your data.
Slobrok address
You can also specify a slobrok address for data to be sent to.
A slobrok address is a slash-separated path
where you can use asterisk to mean any element within this path.
For instance, if you have a docproc cluster called mydpcluster,
it will have registered its nodes with slobrok names like
docproc/cluster.mydpcluster/docproc/0/feed_processor,
where the 0 here indicates the first node in the cluster.
You can thus specify to send visit data to this docproc cluster
by stating a slobrok address of docproc/cluster.mydpcluster/docproc/*/feed_processor.
Note that this will not send all the data to one or all the nodes.
The data sent from the visitor will be distributed among the matching nodes,
but each message will just be sent to one node.
Slobrok names can be used when using
vespa-visit-target
to retrieve the data at some location.
If you start vespa-visit-target on two nodes,
listening to slobrok names mynode/0/visit-destination
and mynode/1/visit-destination,
you can send the results to these nodes by specifying
mynode/*/visit-destination as the data handler.
vespa-destination
is similar to vespa-visit-target in that it can receive messages
from messagebus and print the contents to stdout.
It can be useful in situations where you want to debug a route or a docproc,
by using the vespadestination as the endpoint of your route.
TCP socket
TCP sockets can also be specified directly.
This requires that the endpoint speaks FNET RPC.
This is typically done, either by using the vespa-visit-target tool,
or by using a visitor destination programmatically by using utility class in the document API.
A socket address looks like the following:
tcp/hostname:port/servicename.
For instance, an address generated by vespa-visit-target
might look like: tcp/myhost.mydomain.com:12345/visit-destination
Troubleshooting
Normally all documents share the same bucket space—documents
for multiple schemas are co-located.
When using parent/child, global documents are stored in a separate bucket space.
Use the bucketSpace parameter
to visit the default or global space.
This is a common problem when dumping all documents and dumped count is not the expected count.
A visit operation might stall/hang if the content cluster is in an inconsistent
state—replicas are still merging between nodes.
Find more details at
visitinconsistentbuckets.