A schema defines a document type and what we want to compute over it, the
rank-profiles.
Schemas are stored in files named the same as the schema, with the ending ".sd" (for schema definition),
in the schemas/
directory of the application package.
Refer to the schema reference.
Document types, rank profiles and document summaries in schemas can be inherited.
Schema example:
schema music { document music { field artist type string { indexing: summary | index } field artistId type string { indexing: summary | attribute match: word rank: filter } field title type string { indexing: summary | index } field album type string { indexing: index } field duration type int { indexing: summary } field year type int { indexing: summary | attribute } field popularity type int { indexing: summary | attribute } } fieldset default { fields: artist, title, album } rank-profile song inherits default { first-phase { expression { nativeRank(artist,title) + if(isNan(attribute(popularity)) == 1, 0, attribute(popularity)) } } } }
searchdefinitions/
directory and use search
instead of schema
as the top level tag. This is deprecated.
A document is the unit the rank-profile evaluates, and is returned in query results. Documents have fields - reads and writes updates full documents or some fields of documents. Refer to the schema reference.
Documents can have relations, field values can be imported from parent documents.
Note that the document id is not a field of the document - add this explicitly if needed.
A field has a type, see field reference for a full list.
A field contained in a document can be written to, read from and queried - this is the normal field use. A field can also be generated (i.e. a synthetic field) - in this case, the field definition is outside the document. See reindexing for examples.
A field can be single value, like a string, or multivalue, like an array of strings - see the field type list. Most multivalue fields can be used in grouping. Accessing attributes in maps and arrays of struct in ranking is not possible.
The rank feature attribute(name).count can be used in ranking to rank based on number of elements in a multivalue attribute. To filter based on number of elements, create a strict tiering rank function combined with a rank-score-drop-limit. Then use a query variable for number of elements. Note that doing this filtering is more expensive to evaluate than just having a separate field for the count.
There is no general setting for max field size in terms of size in bytes. Example of fields with potentially large value includes string and raw fields. Other large values include multivalue fields with many elements, like an array, weightedset or tensor. This is relevant when the field is returned in query responses - large result sets and parallel queries requires the Container with the query endpoint to keep many field instances in memory simultaneously. Use a summary class to tune which fields to return in query responses, and keep result sets smaller using limit or hits.
Vespa requires a document to be able to load into memory in serialized form. A document in json format is serialized in the Container hosting the document-api endpoint, and persisted in the content node document store.
A text field is capped at max-length characters when indexing. Increase this to index all terms in large string fields, example:
match { max-length: 15000000 }
A struct is contained in a document and groups one or more fields into a composite type that can be accessed like a single field.
Example:struct email { field sender type string {} field recipient type string {} field subject type string {} field content type string {} } field emails type array<email> {}
In this example the struct is part of an array. A struct can also be used in a map.
A struct-field defines how a given field in a struct should be indexed and searched.
Note that though a struct-field refers to a field in a struct, the struct-field itself is defined inside a field.
Using the email struct defined previously (see struct), we can define indexing for a specific field, like content:
field emails type array<email> { indexing: summary struct-field content { indexing: attribute attribute: fast-search } }
The equivalent code (including the struct definition) in Pyvespa is as follows:
email_struct = Struct(name="email", fields=[ Field(name="sender", type="string"), Field(name="recipient", type="string"), Field(name="subject", type="string"), Field(name="content", type="string"), ]) emails_field = Field(name="emails", type="array<email>", indexing=["summary"], struct_fields=[StructField(name="content", indexing=["attribute"], attribute=["fast-search"])] ) schema = Schema(name="schema", document=Document()) schema.add_fields(emails_field) schema.document.add_structs(email_struct)
indexing configures how to process data of a field during indexing - the most important ones are:
index | For unstructured text: Create a text index for this field. Text matching and all text ranking features become available. Indexes are disk backed and do not need to fit in memory. Reference / index details |
attribute | For structured data: Keep this field in memory in a forward structure. This makes the field available for grouping, sorting and ranking. Attributes may also be searched by complete match (word or exact), or (for numerical fields) by range. Optionally a B-tree in memory can also be created by adding the fast-search option - this improves performance if the attribute is a strong criterion in queries (i.e. filters out many documents). Reference / attribute details |
summary | Include this field in the document summary in search result sets. Reference / document summary details |
Indexing instructions have pipeline semantics similar to unix shell commands, with data flowing from left to right. They can perform complex transformations on field values, or just send the field value unchanged to the next sections of the index structure. Example: The data is first added to the document summary, then added as an in-memory attribute and finally indexed:
indexing: summary | attribute | index
attribute
and index
is set on a field,
queries to this field use index
mode.
The normal case for setting both is to run queries (using index
) with
grouping (that requires attribute
).
The match mode configures how query items are matched to fields (e.g. exact or prefix matching), and is tightly coupled with indexing. Find more details in text matching.
When searching in array or map of struct, sameElement() is a useful query operator to restrict matches to same struct element (e.g. first_name contains 'Joe', last_name contains 'Smith' - both must match in the same field value). Note that the document summary will not contain which element(s) matched.
A fieldset create a group of fields which can be queried as one:
fieldset myset {
fields: artist, title, album
}
$ vespa query "select * from sources * where myset contains 'bob' and title contains 'best'"
Each term searching a fieldset is only tokenized once, so all the field in a field set should have compatible types and match settings.
If you want to let some user text search multiple fields with different match settings, repeat the userInput query operator multiple times in the query:
select * from sources * where ({defaultIndex: 'fieldsetOrField1'}userInput(@query)) or ({defaultIndex: 'fieldsetOrField2'}userInput(@query))
The rank profile defines the computation to be made over documents of this type when matching a query. Learn more in getting started with ranking.
If you use IntelliJ, you can install the Vespa IntelliJ plugin to simplify working with schema files.
Vespa is built for safe schema modifications,
like adding a field or changing indexing or match modes.
A new version of the schema is deployed in an application package.
As some changes are potentially destructive (e.g. change a field index settings),
the deploy
command will by default not accept such changes.
Example output from deploy (change from index to attribute):
Invalid application package: Error loading default.default: indexing-change: Document type 'music': Field 'artist' changed: remove index aspect, matching: 'text' -> 'word', stemming: 'best' -> 'none', normalizing: 'ACCENT' -> 'LOWERCASE', summary field 'artist' transform: 'none' -> 'attribute', indexing script: '{ input artist | tokenize normalize stem:\"BEST\" | summary artist | index artist; }' -> '{ input artist | summary artist | attribute artist; }' To allow this add <allow until='yyyy-mm-dd'>indexing-change</allow> to validation-overrides.xml
To accept such changes, add a validation-override:
<validation-overrides> <allow until="2021-08-30">indexing-change</allow> </validation-overrides>
By blocking destructive changes, it is safe and easy to automate on an evolving schema. Many schema changes are non-destructive and does not require the validation override, like adding a field. Read more in modifying-schemas.
Refer to procedure to change from attribute to index.
Renaming a field is not directly supported. Options:
Also try using an alias.
An application can define multiple document types, each in their own schema. Multiple schemas can either be mapped to a single content cluster, or one can define separate content clusters for schemas to be able to scale differently for the document types. A single container cluster can be used to query all the document types in both these configurations.
In an application with multiple document types, the query restricts which document types to be used. Vespa will by default query all document types and all clusters in parallel, and blend results based on score - find details in federation.
To limit a query to a subset of the document types, set restrict to a comma-separated list of schema names:
$ vespa query 'select * from sources * where title contains "bob"' restrict=music,books
Schemas can be thought of as tables in a database. Most applications start off with one schema, adding schemas as more content types are needed. Queries can hit one, some, or all schemas, using the restrict query parameter or selecting in YQL.
Content nodes can hold multiple schemas:
One or more schemas can be deployed in separate content clusters:
The evolution can be illustrated like:
The optimal mapping from schema to content cluster is application dependent:
Sizing search is a good read for how to optimize content clusters.
To limit a query to a subset of the content clusters, set from sources to a comma-separated list of content cluster ids, e.g.:
$ vespa query 'select * from sources items, news where title contains "bob"'
The request parameter restrict and from sources can be combined to search both a subset of document types and content clusters.
Both document types and full schemas can be inherited to make it easy to design a structured application package with little duplication. Document type inheritance defines a type hierarchy which is also useful for applications that federate queries as queries can be written to the common supertype. This guide covers the different elements in the schema that supports inheritance:
A schema that inherits another gets all the content of the parent schema as if it was defined inside the inheriting schema. A schema that inherits another must also (explicitly) inherit its document type:
schema books inherits items { document books inherits items { field author type string { indexing: summary | index } } }
A document type can inherit another document type. This will include all fields, also fields declared outside the document block in the schema, rank-profiles defined in the super-schema can then be inherited in the schema of this document, see Rank profile inheritance below.
Both schemas music and books have the title field through inheritance:
my-app/schemas/items.sd
:
document items { field title type string { indexing: summary | index } }
my-app/schemas/books.sd
:
schema books {
document books inherits items {
field author type string {
indexing: summary | index
}
}
}
my-app/schemas/music.sd
:
schema music {
document music inherits items {
field artist type string {
indexing: summary | index
}
}
}
This is equivalent to:
my-app/schemas/books.sd
:
schema books { document books { field title type string { indexing: summary | index } field author type string { indexing: summary | index } } }
my-app/schemas/music.sd
:
schema music { document music { field title type string { indexing: summary | index } field artist type string { indexing: summary | index } } }
Notes:
Where fields define the document types, rank profiles define the computations over the documents. Rank profiles can be inherited from rank-profiles defined in the same schema, or defined in another schema when this document inherits the document defined in the schema where the rank profile is defined:
my-app/schemas/items.sd
:
schema items { document items { field title type string { indexing: summary | index } } rank-profile items_ranking_base { function title_score() { expression: fieldLength(title) } first-phase { expression: title_score } summary-features { title_score } } }
my-app/schemas/books.sd
:
schema books { document books inherits items { field author type string { indexing: summary | index } } rank-profile items_ranking inherits items_ranking_base {} rank-profile items_subschema_ranking inherits items_ranking_base { first-phase { expression: title_score + fieldMatch(author) } summary-features inherits items_ranking_base { fieldMatch(author) } } }
my-app/schemas/music.sd
:
schema music { document music inherits items { field artist type string { indexing: summary | index } } rank-profile items_ranking inherits items_ranking_base {} rank-profile items_subschema_ranking inherits items_ranking_base { first-phase { expression: title_score + fieldMatch(artist) } summary-features inherits items_ranking_base { fieldMatch(artist) } } }
items_ranking can be considered the "base" ranking. Pro-tip: Set this as the default rank profile by modifying the default query profile:
my-app/search/query-profiles/default.xml
:
<query-profile id="default"> <field name="ranking.profile">items_ranking</field> </query-profile>
Queries using ranking.profile=default will then use the first-phase ranking defined in items.sd.
Another way to inherit behavior is to override the first-phase ranking in the sub-schemas, still using functions defined in the super-schema (e.g. title_score).
Summary-features and match-features are rank features computed during ranking, to be included in results. These features can be inherited - the above will include scores from features in super- and sub-schema - example:
"summaryfeatures": { "fieldMatch(author)": 0, "rankingExpression(title_score)": 4 }
Here, both books and music schemas implement rank profiles with same names (e.g. items_subschema_ranking), so they can be used in queries spanning both. If a query's rank profile can not be found in a given schema, Vespa's default rank profile nativerank is used.
Inputs to a rank profile are automatically inherited from the parent rank profile. If a new inputs block is defined in a child rank profile, those inputs will be added cumulatively to those defined in the parent.
Document summaries can inherit others defined in the same or an inherited schema.
my-app/schemas/books.sd
:
schema books {
document books {
field title type string {
indexing: summary | index
}
field author type string {
indexing: summary | index
}
}
document-summary items_summary_tiny {
summary title {}
}
document-summary items_summary_full inherits items_summary_tiny {
summary author {}
}
}