Vespa models data as documents. A document has a string identifier, set by the application, unique across all documents. A document is a set of key-value pairs. A document has a schema (i.e. type), defined in the schema.
When configuring clusters, a documents element sets what document types a cluster is to store. This configuration is used to configure the garbage collector if it is enabled. Additionally, it is used to define default routes for documents sent into the application. By default, a document will be sent to all clusters having the document type defined. Refer to routing for details.
Vespa uses the document ID to distribute documents to nodes. From the document identifier, the content layer calculates a numeric location. A bucket contains all the documents, where a given amount of least-significant bits of the location are all equal. This property is used to enable co-localized storage of documents - read more in buckets and content cluster elasticity.
Documents can be global, see parent/child.
The document identifiers are URIs, represented by a string,
which must conform to a defined URI scheme for document identifiers.
The document identifier string may only contain text characters,
as defined by isTextCharacter
in
com.yahoo.text.Text.
Vespa currently has only one defined scheme, the id scheme:
id:<namespace>:<document-type>:<key/value-pair>:<user-specified>
id:mynamespace:mydoctype::user-defined-id
to
/document/v1/mynamespace/mydoctype/docid/user-defined-id
.
Find examples and tools in troubleshooting.
Find examples in the /document/v1/ guide.
Part | Required | Description | ||||
---|---|---|---|---|---|---|
namespace | Yes | Not used by Vespa, see below. | ||||
document-type | Yes | Document type as defined in services.xml and the schema. | ||||
key/value-pair | Optional |
Modifiers to the id scheme, used to configure document distribution to
buckets.
With no modifiers, the id scheme distributes all documents uniformly.
The key/value-pair field contains one of two possible key/value pairs;
n and g are mutually exclusive:
Important:
This is only useful for document types with
mode=streaming or mode=store-only.
Do not use modifiers for regular indexed document types.
See streaming search. Using modifiers
for regular indexed document will cause unpredictable feeding performance, in addition,
search dispatch does not have support to limit the search to modifiers/buckets.
| ||||
user-specified | Yes | A unique ID string. |
The full Document ID (as a string) will often contain redundant information and be quite long; a typical value may look like "id:mynamespace:mydoctype::user-specified-identifier" where only the last part is useful outside Vespa. The Document ID is therefore not stored in memory, and it not always present in search results. It is therefore recommended to put your own unique identifier (usually the "user-specified-identifier" above) in a document field, typically named "myid" or "shortid" or similar:
field shortid type string { indexing: attribute | summary }
This enables using a document-summary with only in-memory fields while still getting the identifier you actually care about. If the "user-specified-identifier" is just a simple number you could even use "type int" for this field for minimal memory overhead.
The namespace in document ids is useful when you have multiple document collections that you want to be sure never end up with the same document id. It has no function in Vespa beyond this, and can just be set to any short constant value like for example "doc". Consider also letting synthetic documents used for testing use namespace "test" so it's easy to detect and remove them if they are present outside the test by mistake.
Example - if feeding
curl -X POST https:.../document/v1/first_namespace/my_doc_type/docid/shakespeare
curl -X POST https:.../document/v1/second_namespace/my_doc_type/docid/shakespeare
then those will be separate documents, both searchable, with different document IDs.
The document ID differs not in the user specified part (this is shakespeare
for both documents),
but in the namespace part (first_namespace
vs second_namespace
).
The full document ID for document A is id:first_namespace:my_doc_type::shakespeare
.
The namespace has no relation to other configuration elsewhere, like in services.xml or in schemas.
It is just like the user specified part of each document ID in that sense.
Namespace can not be used in queries, other than as part of the full document ID.
However, it can be used for document selection,
where id.namespace
can be accessed and compared to a given string, for instance.
An example use case is visiting a subset of documents.
Documents can have fields, see the schema reference.
A field can not be defined with a default value. Use a document processor to assign a default to document put/update operations.
Use fieldset to limit the fields that are returned from a read operation, like get or visit - see examples. Vespa may return more fields than specified if this does not impact performance.
There are two options for specifying a fieldset:
music:artist,song
to fetch two fields declared in music.sd
)
Built-in fieldsets:
Fieldset | Description |
---|---|
[all] | Returns all fields in the schema (generated fields included) and the document ID. |
[document] | Returns original fields in the document, including the document ID. |
[none] | Returns no fields at all, not even the document ID. Internal, do not use |
[id] | Returns only the document ID |
<document type>:[document] |
Deprecated:
Use
Same as [document]
[document] fieldset above:
Returns only the original document fields (generated fields not included)
together with the document ID.
|
If a built-in field set is not used, a list of fields can be specified. Syntax:
<document type>:field1,field2,…
Example:
music:title,artist
To auto-expire documents, use a selection with now. Example, set time-to-live (TTL) for music-documents to one day, using a field called timestamp:
Note: The selection
expression says which documents to keep, not which ones to delete.
The timestamp field must have a value in seconds since EPOCH:
field timestamp type long { indexing: attribute attribute { fast-access } }
When garbage-collection="true"
, Vespa iterates over the document space to purge expired documents.
Vespa will invoke the configured GC selection for each stored document once every
garbage-collection-interval seconds.
It is unspecified when a particular document will be processed within the configured interval.
garbage-collection-interval
it is an indication that GC is starved.
To batch remove, set a selection that matches no documents, like "not music"
Use vespa visit to test the selection. Dump the IDs of all documents that would be preserved:
Negate the expression by wrapping it in a not
to dump the IDs of all the documents that would be removed during GC:
To process documents, use Document processing. Examples are enriching documents (look up data from other sources), transform content (like linguistic transformations, tokenization), filter data and trigger other events based on the input data.
See the sample app album-recommendation-docproc for use of Vespa APIs like:
The sample app vespa-documentation-search has examples of processing PUTs or UPDATEs (using create-if-nonexistent) of documents in OutLinksDocumentProcessor. It is also in introduction to using multivalued fields like arrays, maps and tensors. Use the VespaDocSystemTest to build code that feeds and tests an instance in the Vespa Developer Cloud / local Docker instance.
Both sample apps also use the Document API to GET/PUT/UPDATE other documents as part of processing, using asynchronous DocumentAccess. Use this as a starting point for applications that enrich data when writing.