Document Summaries

Use document summaries to configure which fields to include in results. The default summary contains all fields that are possible to include in summaries; all other summaries will contain a subset of the fields included in the default summary.

Vespa keeps attribute type fields in memory and fetches those fields from memory when requested as part of document summaries. This means summaries are memory-only operations if all fields are attributes. The other document fields are stored as blobs/records in the document store. This record is used when processing summary requests that include fields in this record, and as needed during visiting or re-distribution of content to handle elasticity.

The default summary class will always access the document store because it includes the document ID which is stored here. To include the document ID in a custom summary class, add a field for the id and include it in the summary class.

When using additional summary classes to increase performance, only the network data size is changed - the data read from storage is unchanged. Having "debug" fields with summary enabled will hence also affect the amount of information that needs to be read from disk.

Use dynamic to generate dynamic abstracts of fields, based on search keywords.

Definining summary sets in the search definition

Define additional summary sets as described in the searchdefinition reference.

Example: the title and year fields are included in a the titleyear summary.

# A basic search definition - called music, should be saved to music.sd
search music {

  # It contains one document type only - called music as well
  document music {

    field title type string {
      indexing: summary | index   
    }

    field artist type string {
      indexing: summary | attribute | index
    }

    field year type int {
      indexing: summary | attribute
    }

    field popularity type int {
      indexing: summary | attribute
    }

    field url type uri {
      indexing: summary | index
    }

   }
   document-summary titleyear {
    summary title type string {
      source: title
    }

    summary artist type string {
      source:artist 
    }
  }
}

For more details on summary properties, see Search Definitions: Summary.

Using summaries in queries

Use presentation.summary=[summary name] in search request to choose the summary class to use (the default one is called default). See Search API. Example:

/search/?yql=select+*+from+sources+*+where+default+contains+"best"%3B&presentation.summary=titleyear
The select statement in YQL lists a set of fields to return. Vespa in general makes a best effort to return those fields, and only those fields, unless a wildcard ("*") is given as argument. The wildcard implies returning the full set of fields included in the given summary class.

In conjunction with YQL statements, the summary argument operates like a definition of the set which YQL select then chooses a subset of fields from.

In other words, if the YQL expression is "select * …", and the summary argument is titleyear, all the fields in the summary class titleyear will be returned. If the select statement lists one or more fields (and summary is titleyear), the summary class titleyear is fetched, and the fields not listed in the select statement will be stripped away.

Document Store Details

Documents are stored in the document store in proton. Put, update and remove operations are persisted in the transaction log before updating the document in the document store. The operation is ack'ed to the client and the result of the operation is immediately seen in search results.

Note: If visibility-delay is set to non-zero, writes are batched (for better write performance) and delayed in search results.

Files in the document store are written sequentially, and occur in pairs - example:

-rw-r--r-- 1 owner users 4133380096 Aug 10 13:36 1467957947689211000.dat
-rw-r--r-- 1 owner users   71192112 Aug 10 13:36 1467957947689211000.idx
The maximum size: (in bytes) per .dat file on disk can be set using the following:
<content id="mycluster" version="1.0">
  <engine>
    <proton>
      <tuning>
        <searchnode>
          <summary>
            <store>
              <logstore>
                <maxfilesize>8000000000</maxfilesize>
Notes:
  • The files are written in sequence. proton starts with one pair and grows it until maxfilesize. Once full, a new pair is started.
  • This means, the pair is immutable, except the last pair, which is written to.
  • This also implies that documents for a given group or user (i.e. bucket) will be distributed to multiple such pairs, depending on insertion order.
  • This also means that a streaming search potentially hits multiple files per bucket searched. This impacts search latency if disk access time is high.
  • Documents exist in multiple versions in multiple files. Older versions are compacted away when a pair is scheduled for being the new active pair - obsolete versions are removed, leaving only the active document version left in a new file pair - which is the new active pair.
  • Read more on implications of setting maxfilesize in proton maintenance jobs.
  • Files are written in chunks, using compression settings.

Visiting the document store

A streaming search is a visit, by buckets, in sequence or (semi)parallel. Access to the document store is by local ID - LID. In proton, a bucket is just a property on the document's ID. As documents are added to the file pairs in insertion order, a scan of all documents in a bucket is hence a set of random file accesses, unless some kind of bucket localization is done. The set of document IDs for a given bucket is easily generated from memory structures and assumed to take little resources.

Defragmentation within files

Document store compaction, (defragmentation), does two things:

  1. Removes stale versions of documents (i.e. old version of updated documents). Triggered when the disk bloat of the document store is larger than the total disk usage of the document store times diskbloatfactor.
Refer to summary tuning for details. As streaming accesses are organized per bucket, latencies are cut by co-locating documents by bucket.

Defragmentation status is hence best observed by tracking max_bucket_spread over time, a sawtooth pattern is normal for corpused that change over time. The document_store_compact metric tracks when proton is running compaction jobs. Compaction settings can be set too tight, in that case, the metric is always, or close to, 1.

When benchmarking, it is hence important to set the correct compaction settings, and also make sure that proton has compacted files since (can take hours), and is not actively compacting (document_store_compact should be 0 most of the time).

Defragmentation across files

There is no bucket-compaction across files - documents will not move between files.

Optimized reads using chunks

As documents are clustered within the .dat file, proton optimizes reads by reading larger chunks when accessing documents. When visiting (as in streaming search), documents are read in bucket order. This is the same order as the defragmentation jobs uses.

The first document read in a visit operation for a bucket will read a chunk from the .dat file into memory. Subsequent document accesses are served be a memory lookup only. The chunk size is configured by maxsize:

<engine>
  <proton>
    <tuning>
      <searchnode>
        <summary>
          <store>
            <logstore>
              <chunk>
                <maxsize>16384</maxsize>
              </chunk>
            </logstore>
There can be 2^22=4M chunks. This sets a minimum chunk size based on maxfilesize - e.g. an 8G file can have minimum 2k chunk size. Finally, bucket size is configured by setting bucket-splitting:
<content id="imagepersonal" version="1.0">
  <tuning>
    <bucket-splitting max-documents="1024"/>

The following are hence the relevant sizing units:

  • .dat file size - maxfilesize. Larger files give less files and hence better locality, but compaction requires more memory and more time to complete.
  • chunk size - maxsize. Smaller chunks give less wasted IO bytes but more IO operations.
  • bucket size - bucket-splitting. Larger buckets give less buckets and hence better locality to nodes and files. Larger buckets means higher streaming search latency per bucket.

Document store memory usage

The document store has a mapping in memory from local ID (LID) to position in a document store file (.dat). Part of this mapping is persisted in the .idx-file paired to the .dat file. The memory used by the document store is linear with number of documents and updates to these.

The metric content.proton.documentdb.ready.document_store.memory_usage.allocated_bytes gives the size in memory - use the metric API to find it. A rule of thumb is 12 bytes per document.