Document Summaries

Use multiple document summaries to search the same document type, but present different subsets of fields in different situations.

The default summary always contains all fields that are possible to include in summaries; all other summaries will contain a subset of the fields included in the default summary.

  • When configuring or using additional summaries for performance reasons, you only limit the amount of network bandwidth used when using a summary with fewer fields than the default. Having 'debug' fields with summary enabled will hence also affect the amount of information that needs to be read from disk even for the production summaries.
  • Vespa keeps attribute type fields in memory and fetches those fields from memory when requested as part of document summaries. This means summaries are memory-only operations if all fields are attributes.
  • The remaining document fields are stored as blobs/records in the document store, possibly compressed, possibly along with a number of other documents in order to have sufficient data to achieve a reasonable compression ratio. This record is used when processing summary requests that include fields in this record, and as needed during visiting or re-distribution of content to handle elasticity.
  • The default summary class will always access the document store because it includes the document ID which is stored here.
  • To include the document ID in a custom summary class it has to be included in an explicit summary definition.

Summary classes in the search definition

Define additional summary classes as described in the searchdefinition reference. This can be done by an implicit definition or by an explicit definition.

In an implicit definition, name one or more summary classes that should contain the field on the field definition itself:

field [name] type [type] {
  …
  summary-to: [summary name], [summary name]  # The names of the doc summaries which will include this field
}

Example: the title and year fields are included in a the titleyear summary. Also note that both year and title will be a part of the default summary, even if not mentioned in the summary-to statement.

# A basic search definition - called music, should be saved to music.sd
search music {

  # It contains one document type only - called music as well
  document music {

    field title type string {
      indexing: summary | index   # How this field should be indexed
      summary-to: titleyear
    }

    field artist type string {
      indexing: summary | attribute | index
    }

    field year type int {
      indexing: summary | attribute
      summary-to: titleyear
    }

    field popularity type int {
      indexing: summary | attribute
    }

    field url type uri {
      indexing: summary | index
    }

   }

  fieldset default {
    fields: title, artist
  }

  rank-profile default inherits default {
    first-phase {
      expression: nativeRank(title,artist) + attribute(popularity)
    }
  }

  rank-profile textmatch inherits default {
    first-phase {
      expression: nativeRank(title,artist)
    }
  }

}

In an explicit definition, name a summary class with the list of fields to include:

search [name] {

 document [name] {
   …
 }

 document-summary [name] {

   summary [field name] type [type] {
     source: [source field name]
   }

   summary [field name] type [type] {
     source: [source field name], [source field name]
   }

 }
}

Example: Equivalent to the above example

# A basic search definition - called music, should be saved to music.sd
search music {

  # It contains one document type only - called music as well
  document music {

    field title type string {
      indexing: summary | index   # How this field should be indexed
    }

    field artist type string {
      indexing: summary | attribute | index
    }

    field year type int {
      indexing: summary | attribute
    }

    field popularity type int {
      indexing: summary | attribute
    }

    field url type uri {
      indexing: summary | index
    }

   }

  fieldset default {
    fields: title, artist
  }

  rank-profile default inherits default {
    first-phase {
      expression: nativeRank(title,artist) + attribute(popularity)
    }
  }

  rank-profile textmatch inherits default {
    first-phase {
      expression: nativeRank(title,artist)
    }
  }

  document-summary titleyear {

    summary title type string {
      source: title
    }

    summary year type int {
      source: year
    }
  }
}

It is also possible to combine implicit and explicit definition of summary classes. For more details on summary properties, see Search Definitions: Summary.

Summary classes in queries

Use summary=[summary name] in the query to choose the summary class to use (the default one is called default). See Search API. Example:

/search/?yql=select+*+from+sources+*+where+default+contains+"best"%3B&summary=titleyear
The select statement in YQL lists a set of fields to return. Vespa in general makes a best effort to return those fields, and only those fields, unless a wildcard ("*") is given as argument. The wildcard implies returning the full set of fields included in the given summary class.

In conjunction with YQL statements, the summary argument operates like a definition of the set which YQL select then chooses a subset of fields from.

In other words, if the YQL expression is "select * …", and the summary argument is titleyear, all the fields in the summary class titleyear will be returned. If the select statement lists one or more fields (and summary is titleyear), the summary class titleyear is fetched, and the fields not listed in the select statement will be stripped away.

Document store

Documents are stored in the document store in proton. Put, update and remove operations are persisted in the transaction log server (TLS) before updating the document in the document store. The operation is ack'ed to the client and the result of the operation is immediately seen in search results.

Note: If visibility-delay is set to non-zero, writes are batched (for better write performance) and delayed in search results.

Files in the document store are written sequentially, and occur in pairs - example:

-rw-r--r-- 1 owner users 4133380096 Aug 10 13:36 1467957947689211000.dat
-rw-r--r-- 1 owner users   71192112 Aug 10 13:36 1467957947689211000.idx
The maximum size: (in bytes) per .dat file on disk can be set using the following:
<content id="mycluster" version="1.0">
  <engine>
    <proton>
      <tuning>
        <searchnode>
          <summary>
            <store>
              <logstore>
                <maxfilesize>8000000000</maxfilesize>
Notes:
  • The files are written in sequence. proton starts with one pair and grows it until maxfilesize. Once full, a new pair is started.
  • This means, the pair is immutable, except the last pair, which is written to.
  • This also implies that documents for a given group or user (i.e. bucket) will be distributed to multiple such pairs, depending on insertion order.
  • This also means that a streaming search potentially hits multiple files per bucket searched. This impacts search latency if disk access time is high.
  • Documents exist in multiple versions in multiple files. Older versions are compacted away when a pair is scheduled for being the new active pair - obsolete versions are removed, leaving only the active document version left in a new file pair - which is the new active pair.
  • Read more on implications of setting maxfilesize in proton maintenance jobs.
  • Files are written in chunks, using comression settings.

Visiting the document store

A streaming search is a visit, by buckets, in sequence or (semi)parallel. Access to the document store is by local ID - LID. In proton, a bucket is just a property on the document's ID. As documents are added to the file pairs in insertion order, a scan of all documents in a bucket is hence a set of random file accesses, unless some kind of bucket localization is done. The set of document IDs for a given bucket is easily generated from memory structures and assumed to take little resources.

Defragmentation within files

Document store compaction, (defragmentation), does two things:

  1. Removes stale versions of documents (i.e. old version of updated documents). Triggered when the disk bloat of the document store is larger than the total disk usage of the document store times diskbloatfactor.
Refer to summary tuning for details. As streaming accesses are organized per bucket, latencies are cut by co-locating documents by bucket.

Defragmentation status is hence best observed by tracking max_bucket_spread over time, a sawtooth pattern is normal for corpused that change over time. The document_store_compact metric tracks when proton is running compaction jobs. Compaction settings can be set too tight, in that case, the metric is always, or close to, 1.

When benchmarking, it is hence important to set the correct compaction settings, and also make sure that proton has compacted files since (can take hours), and is not actively compacting (document_store_compact should be 0 most of the time).

Defragmentation across files

There is no bucket-compaction across files - documents will not move between files.

Optimized reads using chunks

As documents are clustered within the .dat file, proton optimizes reads by reading larger chunks when accessing documents. When visiting (as in streaming search), documents are read in bucket order. This is the same order as the defragmentation jobs uses.

The first document read in a visit operation for a bucket will read a chunk from the .dat file into memory. Subsequent document accesses are served be a memory lookup only. The chunk size is configured by maxsize:

<engine>
  <proton>
    <tuning>
      <searchnode>
        <summary>
          <store>
            <logstore>
              <chunk>
                <maxsize>16384</maxsize>
              </chunk>
            </logstore>
There can be 2^22=4M chunks. This sets a minimum chunk size based on maxfilesize - e.g. an 8G file can have minimum 2k chunk size. Finally, bucket size is configured by setting bucket-splitting:
<content id="imagepersonal" version="1.0">
  <tuning>
    <bucket-splitting max-documents="1024"/>

The following are hence the relevant sizing units:

  • .dat file size - maxfilesize. Larger files give less files and hence better locality, but compaction requires more memory and more time to complete.
  • chunk size - maxsize. Smaller chunks give less wasted IO bytes but more IO operations.
  • bucket size - bucket-splitting. Larger buckets give less buckets and hence better locality to nodes and files. Larger buckets means higher streaming search latency per bucket.

Document store memory usage

The document store has a mapping in memory from local ID (LID) to position in a document store file (.dat). Part of this mapping is persisted in the .idx-file paired to the .dat file. The memory used by the document store is linear with number of documents and updates to these.

The metric content.proton.documentdb.ready.document_store.memory_usage.allocated_bytes gives the size in memory - use the metric API to find it. A rule of thumb is 12 bytes per document.