The bm25 rank feature implements the Okapi BM25 ranking function used to estimate the relevance of a text document given a search query. It is a pure text ranking feature which operates over an indexed string field. The feature is cheap to compute, about 3-4 times faster than nativeRank, while still providing a good rank score quality wise. It is a good candidate to use in a first phase ranking function when ranking text documents.
The bm25 feature calculates a score for how good a query with terms matches an indexed string field t in a document D. The score is calculated as follows:
Where the components in the function are:: The inverse document frequency (IDF) of query term i in field t. This is calculated as: N is the total number of documents on the content node. is the number of documents containing query term i for field t, which is calculated per index existing for that field. The max value among the indexes is used in the calculation, which typically comes from the largest disk index.
As the IDF is calculated per content node and index, slight variations might occur. To use the same IDF across all content nodes, set it as the significance on each query term using annotations.
In the following example we have an indexed string field content, and a rank profile using the bm25 rank feature. Note that the field must be enabled for usage with the bm25 feature by setting the enable-bm25 flag in the index section of the field definition.
schema example { document example { field content type string { indexing: index | summary index: enable-bm25 } } rank-profile default { first-phase { expression { bm25(content) } } } }
If the enable-bm25 flag is turned on after documents are
already fed then proton performs
a memory index flush
followed by a disk index
fusion to prepare the posting lists for use with bm25.
Use the custom
component state API on each content node and
examine pending_urgent_flush
to determine if the
preparation is still ongoing:
/state/v1/custom/component/documentdb/mydoctype/subdb/ready/index