Search Definitions

Data in Vespa is modeled as Documents. Search Definitions configure the document types and how they should be stored, indexed, ranked, searched and presented. A vague analogy is a schema. Documents adhere strictly to a configured document type. A Vespa system can have multiple document types and each document type is defined in a search definition. The values in each field have an associated data type like string, long, array - see the search definitions reference for a complete list.

Some commonly used field properties are described in this document - there are other properties that provide finer control over ranking, linguistics and other advanced features. Read more in the search definition reference.

Search definition files have suffix .sd, and the prefix must be the same as the search name described inside. Vespa applications must have at least one search definition in the application package - the search definition is a required file.

A document has a unique Document Id. Documents can have relations - read more about document references.

Search definitions can be modified without restarts.

Example music.sd:

search music {
    document music {
        field artist type string {
            indexing: summary | index
        }

        field artistId type string {
            indexing: summary | attribute
        }

        field title type string {
            indexing: summary | index
        }

        field album type string {
            indexing: index
        }

        field duration type int {
            indexing: summary
        }

        field year type int {
            indexing: summary | attribute
        }

        field popularity type int {
            indexing: summary | attribute
        }
    }

    fieldset default {
        fields: artist, title, album
    }

    rank-profile song inherits default {
        first-phase {
            expression:nativeRank(artist,title,album) + if(isNan(attribute(popularity)) == 1, 0,attribute(popularity))
        }
    }
}

indexing

indexing configures how to process data of a field during indexing - the most important ones are:

index Make this field's content part of a searchable index. Reference
attribute Create an in-memory attribute to enable sorting or grouping. Note some match methods are not available for attributes. Reference / attribute details
summary Include this field in the document summary in search result sets. Reference
Indexing instructions have pipeline semantics similar to unix shell commands, with data flowing from left to right. They can perform complex transformations on field values, or just send the field value unchanged to the next sections of the index structure - example:
indexing: summary | attribute | index
The data is first added to the document summary, then added as an in-memory attribute and finally indexed. The indexing language offers more functionality than this, like filter field values, combine field values, select on different values. Learn more in the indexing language reference.

fieldset

A fieldset groups fields together for searching - example:

search/?query=title:sometitle default:someword
This query returns documents having sometitle in the field title, and someword in one or more of the fields in the fieldset default. If no field/field set name is given for a search term, the fieldset named default is searched. Find details in the fieldset reference.

rank-profile

Vespa has built-in rank profiles, and/or such profiles can be configured, by hand or using machine learning. Read more in the ranking documentation.

Document references

Documents can refer to other documents using references to global documents. Using a reference, fields can be imported from parent types into the child's search definition and used for matching, ranking, grouping and sorting.

Note: This feature is ready for production for applications with a data distribution where all documents are on each node.

Example:

search campaign {
    document campaign {
        field budget type int { indexing : attribute }
    }
}
[
  { "put": "id:test:campaign::thebest", "fields": { "budget": 20 } },
  { "put": "id:test:campaign::nextbest", "fields": { "budget": 10 } }
]
search salesperson {
    document salesperson {
        field name type string { indexing: attribute }
    }
}
[
  { "put": "id:test:salesperson::johndoe", "fields": { "name": "John Doe" } }
]
search ad {
    document ad {
        field campaign_ref type reference<campaign> {
            indexing: attribute
        }
        field other_campaign_ref type reference<campaign> {
            indexing: attribute
        }
        field salesperson_ref type reference<salesperson> {
            indexing: attribute
        }
    }
        
    import field campaign_ref.budget as budget {}
    import field salesperson_ref.name as salesperson_name {}

    document-summary my_summary {
        summary budget type int {}
        summary salesperson_name type string {}
    }
}
[
  { "put": "id:test:ad::1", "fields": {
      "campaign_ref": "id:test:campaign::thebest",
      "other_campaign_ref": "id:test:campaign::nextbest",
      "salesperson_ref": "id:test:salesperson::johndoe" }
  }
]
Document type ad has two references to campaign (via campaign_ref and other_campaign_ref) and one reference to salesperson (via salesperson_ref). The budget field from campaign is imported into the ad search definition (via campaign_ref) and given the name budget. Similarly, the name of salesperson is imported as salesperson_name. To use the imported fields in summary we have created a document summary my_summary containing these fields.

Using the parent-child relationship, data does not have to be normalized, as fields from parents are imported into children. Use this to update parent fields to limit number of updates if a field's value is shared beween many documents. This also limits the resources (memory / disk) required to store and handle documents on content nodes.

As all global documents are distributed to all nodes, node capacity will limit the number of such documents. Global documents are hence called parents - there are usually less parents than children. Note that parents can have parents.

The following type of references are not supported:

  • Self-reference.
  • Transitive reference. A document type child cannot import fields from it's grandparent via a reference to parent (and its reference to grandparent). In this case child needs an explicit reference to grandparent.
  • Cyclic reference. If document type foo has a reference to bar, then bar cannot have a reference to foo.

Document expiry

To auto-expire documents, use a selection with now. Example, keep music-documents for a day:

<documents garbage-collection="true">
  <document type="music" selection="music.timestamp > now() - 86400" >
</documents>
The timestamp-field must have values in seconds since EPOCH:
field timestamp type long {
    indexing: attribute
    attribute {
        fast-access
    }
}
Notes:
  • Using a selection with now can have side effects when re-feeding or re-processing documents, as timestamps can be stale. A common problem is feeding with too old timestamps, resulting in no documents being indexed.
  • Deploying a configuration where the selection string selects no documents will cause all documents to be garbage collected. Use visit to test the selection string. Garbage collected documents are not to be expected to be recoverable.
  • When using this feature, searchable copies in a content cluster should be the same as the redundancy and the selection expression should only reference fields that are attribute vectors. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system.

To batch remove, set a selection that matches no documents, like "!music"

Multiple Search Definitions

Some applications need to search more than one kind of data, each described by its own search definition. There are two ways to do this:

  1. Deploy multiple search definitions to one indexed content cluster
  2. Deploy each search definition in a separate indexed content cluster
In both solutions, the search container will be used to blend results for searches that query multiple search definitions - below is pros and cons of each approach, as well as how to configure.

It is possible to combine the two methods, having some clusters serving one large search definition each, in combination with other clusters serving many small search definitions.

Multiple Search Definitions in one indexed content cluster

Vespa is able to serve multiple search definitions (document types) from the same indexed content cluster (this is not supported for streaming indexing mode). This is done inside the Vespa search core, but it still behaves in most ways as a collection of clusters each serving one document type; so a search that queries multiple search definitions will still be split before it is sent from the container, even if the same search core process will handle several types.

  • You need to manage only one instance of the set of search processes regardless of the number of search definitions.
  • You can no longer start, stop, or otherwise manage the serving for the different search definitions independently, since they are running in the same process.
  • Even if you want different physical clusters at runtime it may be convenient to use this feature during development as it makes it easier to develop and test on a single node.
To enable, add the .sd-file and configure the document type (example below) - then deploy:
<documents>
  <document mode="index" type="book" />
  <document mode="index" type="pc" />
  <document mode="index" type="sock" />
</documents>

Multiple indexed content clusters

Vespa can query multiple indexed content clusters for each search and do customizable blending of the results at the container level. Each indexed content cluster can be configured with its own search definition (document type). The consequences of this, compared to the method above, are as follows:

  • You will need to manage one instance of the entire set of search processes for each kind of search definition.
  • You can separate different types of loads on different machines (say one search definitions has big documents and needs lots of disk space, another has many small documents with attributes that just need memory).
  • You can manage the processes separately (so you can stop serving one document type and delete the indexes for it without impacting the other types).
To use this feature, configure multiple indexed content clusters, each having one search definition and selecting one kind of document. Below is an example of a Vespa setup configuration using three nodes to set up three indexed content clusters:
<content version="1.0" id="book">
  <redundancy>1</redundancy>
  <documents>
    <document mode="index" type="book"/>
  </documents>
  <group>
    <node distribution-key="0" hostalias="HOST1"/>
  </group>
</content>
<content version="1.0" id="pc">
  <redundancy>1</redundancy>
  <documents>
    <document mode="index" type="pc"/>
  </documents>
  <group>
    <node distribution-key="0" hostalias="HOST2"/>
  </group>
</content>
<content version="1.0" id="sock">
  <redundancy>1</redundancy>
  <documents>
    <document mode="index" type="sock"/>
  </documents>
  <group>
    <node distribution-key="0" hostalias="HOST3"/>
  </group>
</content>
No restarts are needed when adding clusters (assuming no content host overlap) - also refer to admin procedures.

Document Inheritance

Regardless of how you choose to deploy your search definitions, it is likely that some of them contains fields that are common across some or all of the documents. Vespa supports document inheritance to collect those documents in one or more super-documents. To search across different search definitions with one search, you should define common fields in superclasses. The result is not well defined when you search a field that is defined independently by two different search definitions.

To let a document inherit another, just add inherits [document-name] after the document name. Multiple inheritance is also supported, for example:

  document cod inherits food, fish {
    …
  }
Multiple levels of inheritance works as well, fish may inherit animal and so on.

Overriding Super-Fields

Overriding super-fields is not allowed. However there are other ways you can do this. Keep in mind that only documents are inherited. The recommended way is to separate the definition of the field and the search specific stuff. Create an external field which takes the physical field as input and does the search specific stuff. This can be done in both base and inherited type. This leaves full freedom to do whatever you like.

Sharing Super-Documents Across Indexed Content Clusters

If you have multiple indexed content clusters, you may want to have search definitions containing documents which should be available for inheriting in multiple indexed content clusters. To do this, you need to list them in the documents tag, along with the other search definitions. For example:

<content version="1.0" id="myid">
  <documents>
    <document mode="index" type="base" />
  </documents>
  …
</content>

Searching multiple document types

In an application with multiple types of data, decide which data to search in each query. Read up on federation before continuing. The following always apply:

  • Vespa will by default search all document types and all clusters in parallel, and blend results based on relevancy. A query may end up with just hits of one type, or some mixture of different data, depending on which type had the most relevant results. It is possible to define different ranking for each search definition, and Vespa will apply the correct ranking for each hit, depending on the document type.
  • To limit the search to a subset of the types inside indexed content clusters, set restrict to a comma-separated list of search definition (document type) names. This is typically used when having multiple document types in one cluster (method 1). Example:
    /search/?query=lotr&restrict=music,book
    
  • To limit the search to a subset of the indexed content clusters, set sources to a comma-separated list of indexed content cluster names. This is typically used when having multiple clusters with one document type in each (method 2). Example:
    /search/?query=lotr&sources=music_cluster,book_cluster
    
  • To specify the number of results to return of each type, instead of letting this be decided by relevancy, either submit one request per type, or write a searcher.
  • The blending of hits from different sources is limited to simple relevancy blending. For more sophisticated blending, e.g. include a minimum number of hits of each type, prefer some type of results over others for certain queries and so on, write a searcher.
  • Searches to indexes that are only present in some of the sources will not return results from other sources (that is, this works as expected).
  • The behavior of Vespa is undefined if searching for an index which has the same name, but different attributes across multiple search definitions (one will not get entirely correct results). It is legal to change relevancy boosts and relevancy type.
  • In addition to blending the hits, Vespa will also unify grouping information about the same field from multiple document types.
The above is true regardless of whether using multiple document types in one cluster (method 1) or multiple clusters (method 2). A search in a Vespa installation having multiple clusters is dispatched to all (selected) clusters in parallel. When multiple types are deployed on one cluster (method 1) it behaves just the same, the search is dispatched in parallel multiple times (one for each selected document type), now to the same indexed content cluster.

Modify Search Definitions

This section describes how a search definition in a live application can be modified - categories:

  1. Valid changes without restart or re-feed
  2. Changes that require restart but not re-feed
  3. Changes that require re-feed
When running vespa-deploy prepare on a new application package, the changes in the search definition files are compared with the files in the current active package. If some of the changes require restart or re-feed, the output from vespa-deploy prepare specifies which actions are needed.

Note: If there are changes to perform on a live system that are not covered by this document and no output is given from vespa-deploy prepare, their impact is undefined and in no way guaranteed to allow a system to stay live until re-feeding. Changes not related to the search definition are discussed in admin procedures.

It is best practise to try changes in a staging system first.

Valid changes without restart or re-feed

Procedure:

  1. Run vespa-deploy prepare on the changed application.
  2. Run vespa-deploy activate. The changes will take effect immediately.
Changes:
Add a new document field Add a new document field as index, attribute, summary or any combinations of these. Existing documents will implicitly get the new field with no content. Documents fed after the change can specify the new field. If the field has existed with same type earlier, then old content may or may not reappear.
Remove a document field Existing documents will no longer see the removed field, but the field data is not completely removed from the search node.
Add or remove an existing document field from document summary Add an existing field to summary or any number of summary classes, and remove an existing field from summary or any number of summary classes.
Remove the attribute aspect from a field that is also an index field This is the only scenario of changing the attribute aspect of a document field that is allowed without restart.
Add, change or remove field sets Change fieldsets used to group fields together for searching.
Change the alias or sorting attribute settings for an attribute field
Add, change or remove rank profiles
Change document field weights
Add, change or remove field aliases
Add, change or remove rank settings for a field Example: Adding rank: filter to a field.
Add or remove a search definition Removing a search definition file will make proton drop all documents of that type - subsequently releasing memory and disk.

Changes that require restart but not re-feed

Procedure:

  1. Run vespa-deploy prepare on the changed application. Output specifies which restart actions are needed.
  2. Run vespa-deploy activate.
  3. Restart services on the services specified in the prepare output.
Changes:
Change the attribute aspect of a document field Add or remove a field as attribute
Change the attribute settings for an attribute field Change the following attribute settings: fast-search, fast-access, huge
Example: Given an indexed content cluster search:
search test {
  document test {
    field f1 type string { indexing: summary }
  }
}
Then add field f1 as an attribute:
search test {
  document test {
    field f1 type string { indexing: attribute | summary }
  }
}
The following is output from vespa-deploy prepare - which restart actions are needed:
WARNING: Change(s) between active and new application that require restart:
In cluster 'basicsearch' of type 'search':
    Restart services of type 'searchnode' because:
        1) Document type 'test': Field 'f1' changed: add attribute aspect

Changes that require re-feed

All of the changes listed below require re-feeding of all documents. Unless a change is listed in the above sections treat it as if it was listed here. Until re-feed is complete, affected fields will be empty or have potentially wrong annotations not matching the query processing. Procedure:

  1. Run vespa-deploy prepare on the changed application. Output specifies which re-feed actions are needed.
  2. Stop feeding, wait until done
  3. Run vespa-deploy activate.
  4. Re-feed all documents.
Changes:
Change the data type or collection type of a document field Existing documents will no longer have any content for this field. To populate the field, re-feed the existing documents using the new type for this field. There will be no automatic conversion from old to new field type.
NOTE: If not re-feeding after such a change, serving works, but searching this field will not give any results.
Change index aspect of a document field This changes the document processing pipeline before documents arrive in the backend. Only documents fed after index aspect was added will have annotations and be present in the reverse index. Only documents fed after index aspect was removed will avoid disk bloat due to unneeded annotations.
Change fields from static to dynamic summary, or vice versa
Switch stemming/normalizing on or off This changes the document processing pipeline before documents arrive in the backend, and what annotations are made for an indexed field.
NOTE: If not re-feeding after such a change, serving works, but recall is undefined as the index has been produced using a different setting than the one used when doing stemming/normalizing of the query terms.
Switch bolding on or off
Add, change or remove match settings for a field Example: Adding match: word to a field.
This changes the document processing pipeline before documents arrive in the backend, and what annotations are made for an indexed field.
NOTE: If not re-feeding after such a change, serving works, but recall is undefined as the index has been produced using one match mode while run-time is using a different match mode.
Change the tensor type of a tensor attribute
Example: Given an indexed content cluster search:
search test {
  document test {
    field f1 type string { indexing: summary }
  }
}
Then add field f1 as an index:
search test {
  document test {
    field f1 type string { indexing: index | summary }
  }
}
The following is output from vespa-deploy prepare - which re-feed actions are needed:
WARNING: Change(s) between active and new application that require re-feed:
Re-feed document type 'test' in cluster 'basicsearch' because:
    1) Document type 'test': Field 'f1' changed: add index aspect, indexing script: '{ input f1 | summary f1; }' -> '{ input f1 | tokenize normalize stem:"SHORTEST" | index f1 | summary f1; }'