# Documents

Vespa models data as documents. A document has a string identifier, set by the application, unique across all documents. A document is a set of key-value pairs. A document has a schema (i.e. type), defined in the schema.

When configuring clusters, a documents element set what document types a cluster is to store. This configuration is used to configure the garbage collector if it is enabled. Additionally, it is used to define default routes for documents sent into the application. By default, a document will be sent to all clusters having the document type defined. Refer to routing for details.

Vespa uses the document ID to distribute documents to nodes. From the document identifier, the content layer calculates a numeric location. A bucket contains all the documents, where a given amount of least-significant bits of the location are all equal. This property is used to enable co-localized storage of documents - read more in buckets and content cluster elasticity.

Documents can be global, see parent/child.

## Document IDs

The document identifiers are URIs, represented by a string, which must conform to a defined URI scheme for document identifiers. The document identifier string may only contain text characters, as defined by isTextCharacter in com.yahoo.text.Text.

### id scheme

Vespa has defined one scheme, the id scheme: id:<namespace>:<document-type>:<key/value-pairs>:<user-specified>

Find examples in the /document/v1/ guide.

Part Required Description
namespaceYesSee below.
document-typeYesDocument type as defined in services.xml and the schema.
key/value-pairsOptional Modifiers to the id scheme, used to configure document distribution to buckets. With no modifiers, the id scheme distributes all documents uniformly. The key/value-pairs field contains a comma-separated list of lexicographically sorted key/value pairs. n and g are mutually exclusive:
n= Number in the range [0,2^63-1] Just like n=, the string is hashed to a number
See streaming search. Using modifiers for regular indexed document will cause unpredictable feeding performance, in addition, search dispatch does not have support to limit the search to modifiers/buckets.
user-specifiedYesA unique ID string.

### Namespace

Example - if feeding

• document A by curl -X POST https:.../document/v1/first_namespace/my_doc_type/docid/shakespeare
• document B by curl -X POST https:.../document/v1/second_namespace/my_doc_type/docid/shakespeare

then those will be separate documents, both searchable, with different document IDs. The document ID differs not in the user specified part (this is shakespeare for both documents), but in the namespace part (first_namespace vs second_namespace). The full document ID for document A is id:first_namespace:my_doc_type::shakespeare.

The namespace has no relation to other configuration elsewhere, like in services.xml or in schemas. It is just like the user specified part of each document ID in that sense. Namespace can not be used in queries, other than as part of the full document ID. However, it can be used for document selection, where id.namespace can be accessed and compared to a given string, for instance. An example use case is visiting a subset of documents.

## Fields

Documents can have fields, see the schema reference.

A field can not be defined with a default value. Use a document processor to assign a default to document put/update operations.

## Fieldsets

Use fieldset to limit the fields that are returned from a read operation, like get or visit. Fieldsets should be considered hints to Vespa, used to optimize. It should not be considered an error if Vespa returns more fields than specified.

Note: Document field sets is a different thing than searchable fieldsets.

There are two options for specifying a fieldset:

• Built-in fieldset
• Name of a document type, then a colon ":", followed by a comma-separated list of fields (for example music:artist,song to fetch two fields declared in music.sd)

Built-in fieldsets:

Fieldset Description
[all] Returns all fields in the schema (generated fields included) and the document ID.
[document] Returns original fields in the document, including the document ID.
[none] Returns no fields at all, not even the document ID. Internal, do not use
[id] Returns only the document ID
<document type>:[document] Same as [document] fieldset above: Returns only the original document fields (generated fields not included) together with the document ID.

If a built-in field set is not used, a list of fields can be specified. Syntax:

<document type>:field1,field2,…


Example:

music:title,artist


Also find examples in visiting.

## Document expiry

To auto-expire documents, use a selection with now. Example, keep music-documents for a day, using a field called timestamp:

<documents garbage-collection="true">
<document type="music" selection="music.timestamp &gt; now() - 86400" />
</documents>

The timestamp-field must have values in seconds since EPOCH:

field timestamp type long {
indexing: attribute
attribute {
fast-access
}
}


When garbage-collection="true", Vespa iterates over the document space to purge expired documents. Vespa will invoke the configured GC selection for each stored document at most once every garbage-collection-interval seconds.

• Using a selection with now can have side effects when re-feeding or re-processing documents, as timestamps can be stale. A common problem is feeding with too old timestamps, resulting in no documents being indexed.
• Normally, documents that are already expired at write time are not persisted. When using create (Create if nonexistent), it is possible to create documents that are expired and will be removed in next cycle.
• Deploying a configuration where the selection string selects no documents will cause all documents to be garbage collected. Use visit to test the selection string. Garbage collected documents are not to be expected to be recoverable.
• The fields that are referenced in the selection expression should be attributes. Also, either the fields should be set with "fast-access" or the number of searchable copies in the content cluster should be the same as the redundancy. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system.
• Imported fields can be used in the selection string to expire documents, but special care needs to be taken when using these. See using imported fields in selections for more information and restrictions.
• Document garbage collection is a low priority background operation that runs continuously unless preempted by higher priority operations. If the cluster is too heavily loaded by client feed operations, there's a risk of starving GC from running. To verify that garbage collection is not starved, check the  vds.idealstate.max_observed_time_since_last_gc_sec.average distributor metric. If it significantly exceeds garbage-collection-interval it is an indication that GC is starved.

To batch remove, set a selection that matches no documents, like "not music"

Use vespa-visit to test the selection. Dump the IDs of all documents that would be preserved:

$vespa-visit -i -s 'music.timestamp > now() - 86400' > ids.json Negate the expression by wrapping it in a not to dump the IDs of all the documents that would be removed during GC: $ vespa-visit -i -s 'not (music.timestamp > now() - 86400)' > ids.json

## Processing documents

To process documents, use Document processing. Examples are enriching documents (look up data from other sources), transform content (like linguistic transformations, tokenization), filter data and trigger other events based on the input data.

See the sample app album-recommendation-docproc for use of Vespa APIs like:

• Document API - work on documents and fields in documents, and create unit tests using the Application framework
• Document Processing - chain independent processors with ordering constraints

The sample app vespa-documentation-search has examples of processing PUTs or UPDATEs (using create-if-nonexistent) of documents in OutLinksDocumentProcessor. It is also in introduction to using multivalued fields like arrays, maps and tensors. Use the VespaDocSystemTest to build code that feeds and tests an instance in the Vespa Developer Cloud / local Docker instance.

Both sample apps also use the Document API to GET/PUT/UPDATE other documents as part of processing, using asynchronous DocumentAccess. Use this as a starting point for applications that enrich data when writing.