Documents

Vespa models data as documents. A document has a string identifier, set by the application, unique across all documents. A document is a set of key-value pairs. A document has a schema (i.e. type), defined in the schema.

When configuring clusters, a documents element set what document types a cluster is to store. This configuration is used to configure the garbage collector if it is enabled. Additionally it is used to define default routes for documents sent into the application. By default, a document will be sent to all clusters having the document type defined. Refer to routing for details.

Vespa uses the document ID to distribute documents to nodes. From the document identifier, the content layer calculates a numeric location. A bucket contains all the documents, where a given amount of least significant bits of the location are all equal. This property is used to enable co-localized storage of documents - read more in buckets and elastic Vespa.

Documents can be global, see parent/child.

Document IDs

The document identifiers are URIs, represented by a string, which must conform to a defined URI scheme for document identifiers. The document identifier string may only contain text characters, as defined by isTextCharacter in com.yahoo.text.Text Schemes have two parts:

Namespace Intended to be used to distinguish data from users who share the same Vespa cluster and/or distinguish between different document types in search. It is hence possible for various applications to use the same Vespa installation, ensuring they do not create document identifier collisions.
User specified Application specific.

id scheme

Vespa has defined one scheme, the id scheme. Format: id:<namespace>:<document-type>:<key/value-pairs>:<user-specified>

namespaceRequiredSee above
document-typeRequiredDocument type as defined in services.xml and the schema
key/value-pairsOptional Modifiers to the id scheme, used to configure document distribution. Used in streaming search to limit the search space. Read more about document to bucket distribution. With no modifiers, the id scheme distributes all documents uniformly. Beware that use of these modifiers in non streaming search, will make document distribution non-uniform which has many caveats with normal indexed search. If there is any correlation with group in your queries you might see uneven load and latencies on your content nodes. You will most likely need to set top-k-probability to the conservative value of 1.0 to get enough hits. Especially when using a non-zero offset The <key/value-pairs> field contains a comma-separated list of lexicographically sorted key/value pairs. n and g are mutually exclusive:
n=<number> All documents with the same number will be stored close to each other. The number must be in the range [0,2^63-1].
g=<groupname> Just like n=, but with a string instead of number.
user-specifiedRequiredA unique ID string
Find examples in album-recommendation. In most cases, the Vespa instance is not shared and hence no use for a namespace - here namespace is set to the same as the document type:
  • Uniform distribution: id:mynamespace:music::mydocid-123
  • Data access is grouped, e.g. personal data (each user has a numeric user id): id:mynamespace:music:n=12345:mydocid-123
  • Using a string identifier to group data: id:mynamespace:music:g=mymusicsite.com:mydocid-123
Access documents using /document/v1/:
$ curl http://hostname:8080/document/v1/mynamespace/music/docid/mydocid-123

$ curl http://hostname:8080/document/v1/mynamespace/music/number/12345/mydocid-123

$ curl http://hostname:8080/document/v1/mynamespace/music/group/mymusicsite.com/mydocid-123

Fieldsets

Use fieldset to limit the fields that are returned from a read operation, like get or visit. Fieldsets should be considered hints to Vespa, used to optimize. It should not be considered an error if Vespa returns more fields than specified.

Note: Document field sets is a different thing than searchable fieldsets.

There are two options for specifying a field set:

  • Built-in field set
  • Comma-separated list of fields
Built in field sets:
[all] Returns all fields in the document, including the document ID.
[none] Returns no fields at all, not even the document ID. Internal, do not use
[id] Returns only the document ID
<document type>:[document] Returns only the original document fields (generated fields not included) together with the document ID. Supported for indexed search only.
If a built-in field set is not used, a list of fields can be specified. Syntax:
<document type>:field1,field2,…
Example:
music:title,artist
Also find examples in visiting.

Document expiry

To auto-expire documents, use a selection with now. Example, keep music-documents for a day, using a field called timestamp:

<documents garbage-collection="true">
  <document type="music" selection="music.timestamp &gt; now() - 86400" >
</documents>
The timestamp-field must have values in seconds since EPOCH:
field timestamp type long {
    indexing: attribute
    attribute {
        fast-access
    }
}
Notes:
  • Using a selection with now can have side effects when re-feeding or re-processing documents, as timestamps can be stale. A common problem is feeding with too old timestamps, resulting in no documents being indexed.
  • Deploying a configuration where the selection string selects no documents will cause all documents to be garbage collected. Use visit to test the selection string. Garbage collected documents are not to be expected to be recoverable.
  • The fields that are referenced in the selection expression should be attributes. Also, either the fields should be set with "fast-access" or the number of searchable copies in the content cluster should be the same as the redundancy. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system.
  • Imported fields can be used in the selection string to expire documents, but special care needs to be taken when using these. See using imported fields in selections for more information and restrictions.
To batch remove, set a selection that matches no documents, like "not music"

Use vespa-visit to test the selection. Dump the IDs of all documents that would be preserved:

$ vespa-visit -i -p progress-file -s 'music.timestamp > now() - 86400' > ids.json

Negate the expression by wrapping it in a not to dump the IDs of all the documents that would be removed during GC:

$ vespa-visit -i -p progress-file -s 'not (music.timestamp > now() - 86400)' > ids.json