Attribute sizing

Attributes are field-level, in-memory data structures that enable functionality like sorting, grouping, and ranking. As attributes are stored in memory, it is important to have enough memory to avoid swapping and general unresponsiveness. Attribute structures are regularly optimized and this causes temporal resource usage - read more in Proton maintenance jobs

Data types

The memory footprint of an attribute depends on a few factors, data type being the most important:

  • Numeric types (int, long, byte, and double) - fixed length and fix cost per document
  • String type - the footprint depends on the length of the strings and how many unique strings that needs to be stored.

Collection types like array and weighted sets increases the memory usage some, but the main factor is the average number of values per document. String attributes are typically the largest attributes, and requires most memory during initialization - use numeric types where possible.

Example

search foo {
    document bar {
        field titles type array<string> {
            indexing: summary | attribute
        }
    }
}

Refer to formulas below. Assume average 5 values per document, and average string length 10. Then usage is 5*(10 + 32) bytes per document during initialization, with 10 million documents that becomes 2100000000 bytes, or roughly 2GB of attribute data. Increase the average number of values per document to 10 (double) will also double the memory footprint during initialization (4GB). The steady state attribute footprint will be lower, but when doing the capacity plan, keep in mind the maximum footprint, which occurs during initialization. For the steady state footprint, the number of unique values is very important for string attributes.

Check the Example attribute sizing spreadsheet, with various data types and collection types for a simple search application. It also contains estimates for how many documents a 16GB RAM node can hold.

Attributes can be configured with fast-search - this impacts memory footprint:

  • Setting fast-search is not recommended unless you are going to query the attribute without any other more restrictive terms that are indexed
  • fast-search will increase steady state memory usage for all attribute types and also add initialization overhead for numeric types

Sizing

Attribute sizing is not an exact science, rather an approximation. The reason is that they vary in size. Both number of documents, number of values and uniqueness of the values are varying. The components of the attributes that occupy memory are listed below - concepts:

Abbreviation Concept Comment
D Number of documents Number of documents on the node, or rather the maximum number of documentids allocated
V Average number of values per document Only applicable for arrays and weighted sets
U Number of unique values Only applies if fast-search is set
FW Fixed data width sizeof(T) for numerics, 1 byte for strings
WW Weight width Width of the weight in a weighted set, 4 bytes
EW Enum index width Width of the enum index, 4 bytes. Used by all strings and other attributes if fast-search is set
VW Variable data width strlen(s) for strings, 0 bytes for the rest
PW Posting width Width of a postinglist entry for attribute. fast-search -> 4. array/weighted set -> (4+4)
IW Index width Width of index - 4 bytes, 8 bytes if huge is set
ROF Resize overhead factor Default is 6/5. This is the average overhead in any dynamic vector due to resizing strategy. Resize strategy is 50% indicating that structure is 5/6 full on average.

Components

Component Formula Approx Factor Applies to
Document vector D * ((FW or EW) or IW) ROF FW for single value numeric attributes and IW for multi-value attributes. EW for single value string or the attribute is single value fast-search
Multi-value mapping D * V * (FW or EW) ROF Applicable only for array or weighted sets. EW if string or fast-search
Enum store U * (FW + VW) + 4 ROF Applicable for strings or if fast-search is set
Posting list D * V * PW ROF Applicable for strings or if fast-search is set

Variants

Type Components Formula
Numeric single value plain Document vector D * FW * ROF
Numeric multi-value value plain Document vector, Multi-value mapping D * IW * ROF + D * V * FW * ROF
Numeric single value fast-search Document vector, Enum store, Posting List D * EW * ROF + D * PW * ROF + U * (FW+4) * ROF
Numeric multi-value value plain Document vector, Multi-value mapping D * IW * ROF + D * V * FW * ROF
Numeric multi-value value fast-search Document vector, Multi-value mapping, Enum store, Posting List D * IW * ROF + D * V * EW * ROF + U * (FW+4) * ROF + D * V * PW * ROF
Single value string fast-search Document vector, Enum store, Posting List D * EW * ROF + U * (FW+VW+4) * ROF + D * PW * ROF
Single value string plain Document vector, Enum store D * EW * ROF + U * (FW+VW+4) * ROF
Multi-value string plain Document vector, Multi-value mapping, Enum store D * IW * ROF + D * V * EW * ROF + U * (FW+VW+4) * ROF
Multi-value string fast-search Document vector, Multi-value mapping, Enum store, Posting list D * IW * ROF + D * V * EW * ROF + U * (FW+VW+4) * ROF + D * V * PW * ROF