Search Definitions

Data in Vespa is modeled as Documents. Search Definitions configure the document types and how they should be stored, indexed, ranked, searched and presented. A vague analogy is a schema. Documents adhere strictly to a configured document type. A Vespa system can have multiple document types and each document type is defined in a search definition.

Search definition files have suffix .sd, and the prefix must be the same as the search name described inside. Vespa applications must have at least one search definition in the application package - the search definition is a required file.

Search definitions can be modified without restart / re-indexing - with some restrictions. Refer to the reference for details, and the application package for how to deploy.

Example music.sd:

search music {
    document music {
        field artist type string {
            indexing: summary | index
        }

        field artistId type string {
            indexing: summary | attribute
        }

        field title type string {
            indexing: summary | index
        }

        field album type string {
            indexing: index
        }

        field duration type int {
            indexing: summary
        }

        field year type int {
            indexing: summary | attribute
        }

        field popularity type int {
            indexing: summary | attribute
        }
    }

    fieldset default {
        fields: artist, title, album
    }

    rank-profile song inherits default {
        first-phase {
            expression:nativeRank(artist,title,album) + if(isNan(attribute(popularity)) == 1, 0,attribute(popularity))
        }
    }
}

field

A field has a type, like string or double - see field reference for a full list.

Documents can have relations, and field values can be imported from parent documents.

Multivalue fields

A field can be singlevalue, like a string, or multivalue, like an array of strings - see the field type list.

Most multivalue fields can be used in grouping.

When searching in array or map of struct, sameElement() is a useful query operator to restrict matches to same struct element. Note that the document summary will not contain which element(s) matched.

Accessing attributes in maps and arrays of struct in ranking is not possible.

The rank feature attribute(name).count can be used in ranking to rank based on number of elements in a multivalue attribute. To filter based on number of elements, create a strict tiering rank function combined with a rank-score-drop-limit. Then use a query variable for number of elements. Note that doing this filtering is more expensive to evaluate than just having a separate field for the count.

indexing

indexing configures how to process data of a field during indexing - the most important ones are:

index For unstructured text: Create a text index for this field. Text matching and all text ranking features become available. Indexes are disk backed and do not need to fit in memory. Reference / index details
attribute For structured data: Keep this field in memory in a forward structure. This makes the field available for grouping, sorting and ranking. Attributes may also be searched by complete match (word or exact), or (for numerical fields) by range. Optionally a B-tree in memory can also be created by adding the fast-search option. This improves performance if the attribute is a strong criterion in queries (i.e filters out many documents). Reference / attribute details
summary Include this field in the document summary in search result sets. Reference / document summary details
Indexing instructions have pipeline semantics similar to unix shell commands, with data flowing from left to right. They can perform complex transformations on field values, or just send the field value unchanged to the next sections of the index structure - example:
indexing: summary | attribute | index
The data is first added to the document summary, then added as an in-memory attribute and finally indexed. The indexing language offers more functionality than this, like filter field values, combine field values, select on different values. Learn more in the indexing language reference.

fieldset

A fieldset groups fields together for searching - example:

search/?query=title:sometitle default:someword
This query returns documents having sometitle in the field title, and someword in one or more of the fields in the fieldset default. If no field/field set name is given for a search term, the fieldset named default is searched. Find details in the fieldset reference.

rank-profile

Vespa has built-in rank profiles, and/or such profiles can be configured, by hand or using machine learning. Read more in the ranking documentation.

Multiple Search Definitions

Some applications need to search more than one kind of data, each described by its own search definition. There are two ways to do this:

  1. Deploy multiple search definitions to one indexed content cluster
  2. Deploy each search definition in a separate indexed content cluster
In both solutions, the search container will be used to blend results for searches that query multiple search definitions - below is pros and cons of each approach, as well as how to configure.

It is possible to combine the two methods, having some clusters serving one large search definition each, in combination with other clusters serving many small search definitions.

Multiple Search Definitions in one indexed content cluster

Vespa is able to serve multiple search definitions (document types) from the same indexed content cluster (this is not supported for streaming indexing mode). This is done inside the Vespa search core, but it still behaves in most ways as a collection of clusters each serving one document type; so a search that queries multiple search definitions will still be split before it is sent from the container, even if the same search core process will handle several types.

  • You need to manage only one instance of the set of search processes regardless of the number of search definitions.
  • You can no longer start, stop, or otherwise manage the serving for the different search definitions independently, since they are running in the same process.
  • Even if you want different physical clusters at runtime it may be convenient to use this feature during development as it makes it easier to develop and test on a single node.
To enable, add the .sd-file and configure the document type (example below) - then deploy:
<documents>
  <document mode="index" type="book" />
  <document mode="index" type="pc" />
  <document mode="index" type="sock" />
</documents>

Multiple indexed content clusters

Vespa can query multiple indexed content clusters for each search and do customizable blending of the results at the container level. Each indexed content cluster can be configured with its own search definition (document type). The consequences of this, compared to the method above, are as follows:

  • You will need to manage one instance of the entire set of search processes for each kind of search definition.
  • You can separate different types of loads on different machines (say one search definitions has big documents and needs lots of disk space, another has many small documents with attributes that just need memory).
  • You can manage the processes separately (so you can stop serving one document type and delete the indexes for it without impacting the other types).
To use this feature, configure multiple indexed content clusters, each having one search definition and selecting one kind of document. Below is an example of a Vespa setup configuration using three nodes to set up three indexed content clusters:
<content version="1.0" id="book">
  <redundancy>1</redundancy>
  <documents>
    <document mode="index" type="book"/>
  </documents>
  <group>
    <node distribution-key="0" hostalias="HOST1"/>
  </group>
</content>
<content version="1.0" id="pc">
  <redundancy>1</redundancy>
  <documents>
    <document mode="index" type="pc"/>
  </documents>
  <group>
    <node distribution-key="0" hostalias="HOST2"/>
  </group>
</content>
<content version="1.0" id="sock">
  <redundancy>1</redundancy>
  <documents>
    <document mode="index" type="sock"/>
  </documents>
  <group>
    <node distribution-key="0" hostalias="HOST3"/>
  </group>
</content>
No restarts are needed when adding clusters (assuming no content host overlap) - also refer to admin procedures.

Document Inheritance

Regardless of how you choose to deploy your search definitions, it is likely that some of them contains fields that are common across some or all of the documents. Vespa supports document inheritance to collect those documents in one or more super-documents. To search across different search definitions with one search, you should define common fields in superclasses. The result is not well defined when you search a field that is defined independently by two different search definitions.

To let a document inherit another, just add inherits [document-name] after the document name. Multiple inheritance is also supported, for example:

document cod inherits food, fish {
    …
}
Multiple levels of inheritance works as well, fish may inherit animal and so on.

Overriding Super-Fields

Overriding super-fields is not allowed. However there are other ways you can do this. Keep in mind that only documents are inherited. The recommended way is to separate the definition of the field and the search specific stuff. Create an external field which takes the physical field as input and does the search specific stuff. This can be done in both base and inherited type. This leaves full freedom to do whatever you like.

Sharing Super-Documents Across Indexed Content Clusters

If you have multiple indexed content clusters, you may want to have search definitions containing documents which should be available for inheriting in multiple indexed content clusters. To do this, you need to list them in the documents tag, along with the other search definitions. For example:

<content version="1.0" id="myid">
  <documents>
    <document mode="index" type="base" />
  </documents>
  …
</content>

Searching multiple document types

In an application with multiple types of data, decide which data to search in each query. Read up on federation before continuing. The following always apply:

  • Vespa will by default search all document types and all clusters in parallel, and blend results based on relevancy. A query may end up with just hits of one type, or some mixture of different data, depending on which type had the most relevant results. It is possible to define different ranking for each search definition, and Vespa will apply the correct ranking for each hit, depending on the document type.
  • To limit the search to a subset of the types inside indexed content clusters, set restrict to a comma-separated list of search definition (document type) names. This is typically used when having multiple document types in one cluster (method 1). Example:
    /search/?query=lotr&restrict=music,book
    
  • To limit the search to a subset of the indexed content clusters, set sources to a comma-separated list of indexed content cluster names. This is typically used when having multiple clusters with one document type in each (method 2). Example:
    /search/?query=lotr&sources=music_cluster,book_cluster
    
  • To specify the number of results to return of each type, instead of letting this be decided by relevancy, either submit one request per type, or write a searcher.
  • The blending of hits from different sources is limited to simple relevancy blending. For more sophisticated blending, e.g. include a minimum number of hits of each type, prefer some type of results over others for certain queries and so on, write a searcher.
  • Searches to indexes that are only present in some of the sources will not return results from other sources (that is, this works as expected).
  • The behavior of Vespa is undefined if searching for an index which has the same name, but different attributes across multiple search definitions (one will not get entirely correct results). It is legal to change relevancy boosts and relevancy type.
  • In addition to blending the hits, Vespa will also unify grouping information about the same field from multiple document types.
The above is true regardless of whether using multiple document types in one cluster (method 1) or multiple clusters (method 2). A search in a Vespa installation having multiple clusters is dispatched to all (selected) clusters in parallel. When multiple types are deployed on one cluster (method 1) it behaves just the same, the search is dispatched in parallel multiple times (one for each selected document type), now to the same indexed content cluster.