Document API

This is an introduction to how to build and compile Vespa clients using the Document API. This is a high-level interface that gives access to data in Vespa content clusters. It can be used for feeding, updating and retrieving documents, or removing documents from the repository. See also the Java reference.

Documents

All data fed, indexed and searched in Vespa are instances of the Document class. A document is a composite object that consists of:

  • A DocumentType that defines the set of fields that can exist in a document. A document can only have a single document type, but document types can inherit the content of another. All fields of an inherited type is available in all its descendants. The document type definition is derived from the search definition, which is converted into a configuration file to be read by the DocumentManager.

    All registered document types are instantiated and stored within the document manager. A reference to these objects can be retrieved using the getDocumentType() method by supplying the name and the version of the wanted document type.

    DocumentManager initialization is done automatically by the Document API by subscribing to the appropriate configuration.

  • A DocumentId which is a unique document identifier. The document distribution uses the document identifier, see the reference for details.

  • A set of (Field, FieldValue) pairs, or "fields" for short. The Field class has methods for getting its name, data type and internal identifier. The field object for a given field name can be retrieved using the getField(<fieldname>) method in the DocumentType.

    For most data types, simply assign a value object directly to a document. For the complex data type DataType.WEIGHTEDSET there are special classes that must be used to wrap the data.

To construct a document one must first retrieve the document type from the document manager, construct a unique identifier, and then pass both of those objects to the constructor of the Document class:

DocumentAccess access = DocumentAccess.createDefault();
DocumentType type = access.getDocumentTypeManager().getDocumentType("music");
DocumentId id = new DocumentId("id:music:music::0");
Document document = new Document(type, id);
To get and set the value of a field in a document, use the getValue() and setValue() method:
document.setFieldValue(document.getType().getField("myIntField"), 100);
FieldValue value = document.getFieldValue(document.getType().getField("myIntField"));
int val = ((IntegerFieldValue) value).getInteger();

DataType.WEIGHTEDSET

This type is used to hold a set of other field values of any one given data type with an associated integer weight. This is implemented as a generic so that any other data type can be contained in it:

WeightedSetDataType dataType =
        (WeightedSetDataType) document.getType().getField("myweightedset").getDataType();
WeightedSet<String> val = new WeightedSet<String>(dataType);
    val.put("foo", 100);
    val.put("bar", 101);
    assert (val.get("foo") == 100);

Document updates

A document update is a request to modify a document. The update contains a document id to identify which document to update, a list of field updates, and a list of field path updates.

Field updates

Each field update contains a field id to identify which field of the document to update (this id is retrieved from the document type using getField(<name>).getId()), and a list of value updates to perform. A value update can be any of the following:

AddValueUpdate Adds its content to the target field value.
ArithmeticValueUpdate

An arithmetic value update is an update to a numerical field value. This can also be used in combination with the MapValueUpdate below to modify weights in weighted sets. This update has an operator (available by the getOperator() and setOperator() methods) and a numerical operand (available by the getOperand() and setOperand() methods).

Note: Vespa provides at least once semantics. Due to the non-idempotent nature of arithmetic updates, situations that e.g. cause re-sending of messages may result in the final value of the updated document field to differ from the expected value.

AssignValueUpdate Assigns its content to the target field value.
ClearValueUpdate Clears the target field value.
MapValueUpdate

Since value updates are concerned with updating only a single field value it is necessary for the update to be able to indicate which value to modify in complex field data types such as weighted sets. This is the purpose of the MapValueUpdate class. It contains a value target (available by the setValue() and getValue()) which is the identifier of the value to update, and a value update (available by the getUpdate() and setUpdate() methods) which is the update to perform.

For simplicity, the FieldUpdate class contains factory methods to create the most common maps. For example, if one was to assign 100 as the weight to the “foo” key in the document field “myStrWSet” (a weighted set of strings), the value update in would be created as follows:

DocumentAccess access = DocumentAccess.createDefault();
DocumentType type = access.getDocumentTypeManager().getDocumentType("music");
DocumentId id = new DocumentId("id:music:music::0");
DocumentUpdate upd = new DocumentUpdate(type, id);

upd.addFieldUpdate(FieldUpdate.createMap(type.getField("myStrWSet"), "foo", new AssignValueUpdate(100)));
RemoveValueUpdate Removes the target field value.

Field Path Updates

Field path updates are used to simplify updates to complex data structures, using maps, structs, arrays, and so on. A field path update uses a field path to designate what part of the document is to be changed. This can be combined with a where clause, which is a document selection expression. The where clause decides whether the document is to be changed at all, and also sets variables to use in the field path when deciding which parts of the document to change. There are three different field path updates:

AssignFieldPathUpdate Modify a value in any part of the document, or add values to maps or weighted sets
AddFieldPathUpdate Add values to arrays
RemoveFieldPathUpdate Remove values from the document

Update reply semantics

Sending in an update for which the system can not find a corresponding document to update is not considered an error. These are returned with a successful status code (assuming that no actual error occurred during the update processing). If one cares about whether the update was known to have been applied, use the boolean UpdateDocumentReply.wasFound() method.

If the update returns with an error reply, the update may or may not have been applied, depending on where in the platform stack the error occurred.

Document Access

The starting point of for passing documents and updates to Vespa is the DocumentAccess class. This is a singleton (see get() method) session factory (see createXSession() methods), that provides three distinct access types:

  • Synchronous random access: provided by the class SyncSession. Suitable for low-throughput proof-of-concept applications.
  • Asynchronous random access: provided by the class AsyncSession. It allows for document repository writes and random access with high throughput.
  • Visiting: provided by the class VisitorSession. Allows a set of documents to be accessed in order decided by the document repository, which gives higher read throughput than random access.

AsyncSession

This class represents a session for asynchronous access to a document repository. It is created by calling myDocumentAccess.createAsyncSession(myAsyncSessionParams), and provides document repository writes and random access with high throughput. The usage pattern for an asynchronous session is like:

  1. put(), update(), get() or remove() is invoked on the session, and it returns a synchronous Result object that indicates whether or not the request was successful. The Result object also contains a request identifier.
  2. The client polls the session for a Response through its getNext() method. Any operation accepted by an asynchronous session will produce exactly one response within the configured timeout.
  3. Once a response is available, it is matched to the request by inspecting the response's request identifier. The response may also contain data, either a retrieved document or a failed document put or update that needs to be handled.
Example:

import com.yahoo.document.*;
import com.yahoo.documentapi.*;

public class MyClient {

    private final DocumentAccess access = DocumentAccess.createDefault();
    private final AsyncSession session = access.createAsyncSession(new AsyncParameters());
    private boolean abort = false;
    private int numPending = 0;

    /**
     * Implements application entry point.
     *
     * @param args Command line arguments.
     */
    public static void main(String[] args) {
        MyClient app = null;
        try {
            app = new MyClient();
            app.run();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (app != null) {
                app.shutdown();
            }
        }
        if (app == null || app.abort) {
            System.exit(1);
        }
    }

    /**
     * This is the main entry point of the client. This method will not return until all available documents
     * have been fed and their responses have been returned, or something signaled an abort.
     */
    public void run() {
        System.out.println("client started");
        while (!abort) {
            flushResponseQueue();

            Document doc = getNextDocument();
            if (doc == null) {
                System.out.println("no more documents to put");
                break;
            }
            System.out.println("sending doc " + doc);

            while (!abort) {
                Result res = session.put(doc);
                if (res.isSuccess()) {
                    System.out.println("put has request id " + res.getRequestId());
                    ++numPending;
                    break; // step to next doc.
                } else if (res.getType() == Result.ResultType.TRANSIENT_ERROR) {
                    System.out.println("send queue full, waiting for some response");
                    processNext(9999);
                } else {
                    res.getError().printStackTrace();
                    abort = true; // this is a fatal error
                }
            }
        }
        if (!abort) {
            waitForPending();
        }
        System.out.println("client stopped");
    }

    /**
     * Shutdown the underlying api objects.
     */
    public void shutdown() {
        System.out.println("shutting down document api");
        session.destroy();
        access.shutdown();
    }

    /**
     * Returns the next document to feed to Vespa. This method should only return null when the end of the
     * document stream has been reached, as returning null terminates the client. This is the point at which
     * your application logic should block if it knows more documents will eventually become available.
     *
     * @return The next document to put, or null to terminate.
     */
    public Document getNextDocument() {
        return null; // TODO: Implement at your discretion.
    }

    /**
     * Processes all immediately available responses.
     */
    void flushResponseQueue() {
        System.out.println("flushing response queue");
        while (processNext(0)) {
            // empty
        }
    }

    /**
     * Wait indefinitely for the responses of all sent operations to return. This method will only return
     * early if the abort flag is set.
     */
    void waitForPending() {
        while (numPending != 0) {
            if (abort) {
                System.out.println("waiting aborted, " + numPending + " still pending");
                break;
            }
            System.out.println("waiting for " + numPending + " responses");
            processNext(9999);
        }
    }

    /**
     * Retrieves and processes the next response available from the underlying asynchronous session. If no
     * response becomes available within the given timeout, this method returns false.
     *
     * @param timeout The maximum number of seconds to wait for a response.
     * @return True if a response was processed, false otherwise.
     */
    boolean processNext(int timeout) {
        Response res;
        try {
            res = session.getNext(timeout);
        } catch (InterruptedException e) {
            e.printStackTrace();
            abort = true;
            return false;
        }
        if (res == null) {
            return false;
        }
        System.out.println("got response for request id " + res.getRequestId());
        --numPending;
        if (!res.isSuccess()) {
            System.err.println(res.getTextMessage());
            abort = true;
            return false;
        }
        return true;
    }
}

VisitorSession

This class represents a session for sequentially visiting documents with high throughput.

A visitor is started when creating the VisitorSession through a call to createVisitorSession. A visitor target, that is a receiver of visitor data, can be created through a call to createVisitorDestinationSession. The VisitorSession is a receiver of visitor data. See visiting reference for details. The VisitorSession:

  • Controls the operation of the visiting process
  • Handles the data resulting from visiting data in the system
Those two different tasks may be set up to be handled by a VisitorControlHandler and a VisitorDataHandler respectively. These handlers may be supplied to the VisitorSession in the VisitorParameters object, together with a set of other parameters for visiting. Example: To increase performance, let more separate visitor destinations handle visitor data - then specify the addresses to remote data handlers.

The default VisitorDataHandler used by the VisitorSession returned from DocumentAccess is VisitorDataQueue which queues up incoming documents and implements a polling API. The documents can be extracted by calls to the session's getNext() methods and can be ack-ed by the ack() method. The default VisitorControlHandler can be accessed through the session's getProgress(), isDone(), and waitUntilDone() methods.

Implement custom VisitorControlHandler and VisitorDataHandler by subclassing them and supplying these to the VisitorParameters object.

The VisitorParameters object controls how and what data will be visited - refer to the javadoc. Configure the document selection string to select what data to visit - the default is all data.

For improved performance, dump a subset of the document fields - control which fields are returned by using fieldSet - see Document field sets.

Example:

import com.yahoo.document.Document;
import com.yahoo.document.DocumentId;
import com.yahoo.documentapi.DocumentAccess;
import com.yahoo.documentapi.DumpVisitorDataHandler;
import com.yahoo.documentapi.ProgressToken;
import com.yahoo.documentapi.VisitorControlHandler;
import com.yahoo.documentapi.VisitorParameters;
import com.yahoo.documentapi.VisitorSession;

import java.util.concurrent.TimeoutException;

public class MyClient {

    public static void main(String[] args) throws Exception {
        VisitorParameters params = new VisitorParameters("true");
        params.setLocalDataHandler(new DumpVisitorDataHandler() {

            @Override
            public void onDocument(Document doc, long timeStamp) {
                System.out.print(doc.toXML(""));
            }

            @Override
            public void onRemove(DocumentId id) {
                System.out.println("id=" + id);
            }
        });
        params.setControlHandler(new VisitorControlHandler() {

            @Override
            public void onProgress(ProgressToken token) {
                System.err.format("%.1f %% finished.\n", token.percentFinished());
                super.onProgress(token);
            }

            @Override
            public void onDone(CompletionCode code, String message) {
                System.err.println("Completed visitation, code " + code + ": " + message);
                super.onDone(code, message);
            }
        });
        params.setRoute(args.length > 0 ? args[0] : "[Storage:cluster=storage;clusterconfigid=storage]");
        params.setFieldSet(args.length > 1 ? args[1] : "[all]");

        DocumentAccess access = DocumentAccess.createDefault();
        VisitorSession session = access.createVisitorSession(params);
        if (!session.waitUntilDone(0)) {
            throw new TimeoutException();
        }
        session.destroy();
        access.shutdown();
    }
}
The first optional argument to this client is the route of the cluster to visit. The second is the fieldset set to retrieve.

Compiling and linking

To compile Java applications using Document API, the library $VESPA_HOME/lib/jars/documentapi-jar-with-dependencies.jar needs to be included in the file path. Build and run the class MyClient from the file MyClient.java:

$ javac -classpath $VESPA_HOME/lib/jars/documentapi-jar-with-dependencies.jar MyClient.java
$ java -enableassertions -classpath .:$VESPA_HOME/lib/jars/documentapi-jar-with-dependencies.jar MyClient
To feed from a non-Vespa node, set the VESPA_CONFIG_SOURCES environment variable. See the ports and config server for details.