Vespa supports importing Gradient Boosting Decision Tree (GBDT) models trained with XGBoost.
Vespa supports importing XGBoost's JSON model dump, e.g. Python API
xgboost.Booster.dump_model.
When dumping the trained model, XGBoost allows users to set the dump_format
to json
,
and users can specify the feature names to be used in fmap
.
Here is an example of an XGBoost JSON model dump with 2 trees and maximum depth 1:
Notice the split
attribute which represents the Vespa feature name. The split
feature must resolve to a Vespa
rank feature defined in the document schema. The feature can also
be user defined features (for example using functions).
The above model JSON was produced using the XGBoost Python api with a regression objective:
The training data is represented using LibSVM text format.
See also a complete XGBoost training notebook using ranking
objective.
XGBoost is trained on array or array like data structures where features are named based on the index in the array as in the example above. To convert the XGBoost features we need to map feature indexes to actual Vespa features (native features or custom defined features):
$ cat feature-map.txt |egrep "fieldMatch\(title\).completeness|fieldMatch\(title\).importance" 36 fieldMatch(title).completeness q 39 fieldMatch(title).importance q
In the feature mapping example, feature at index 36 maps to fieldMatch(title).completeness and index 39 maps to fieldMatch(title).importance. The feature mapping format is not well described in the XGBoost documentation, but the sample demo for binary classification writes:
Format of feature-map.txt: <featureid> <featurename> <q or i or int>\n
:
When using pandas
DataFrame
's with columns names, one does not need to provide feature mappings.
See also a complete example of how to train a ranking function, using learning to rank with ranking losses, in this notebook.
To import the XGBoost model to Vespa, add the directory containing the
model to your application package under a specific directory named models
.
For instance, if you would like to call the model above as my_model
,
you would add it to the application package resulting in a directory structure like this:
├── models │ └── my_model.json ├── schemas │ └── main.sd └── services.xml
An application package can have multiple models.
To download models during deployment, see deploying remote models.
Vespa has a xgboost
ranking feature.
This ranking feature specifies the model to use in a ranking expression.
Consider the following example:
schema xgboost { rank-profile prediction inherits default { first-phase { expression: nativeRank } second-phase { expression: xgboost("my_model.json") } } }
Here, we specify that the model my_model.json
is applied to the top ranking documents by the first-phase ranking expression.
The query request must specify prediction
as the ranking.profile.
See also Phased ranking on how to control number of data points/documents which is exposed to the model.
Generally the run time complexity is determined by:
fieldMatch
features are 100x more expensive that nativeFieldMatch/nativeRank
.Serving latency can be brought down by using multiple threads per query request.
There are six different objective types that Vespa supports:
reg:squarederror
/ reg:logistic
binary:logistic
rank:pairwise
, rank:ndcg
and rank:map
For reg:logistic
and binary:logistic
the raw margin tree sum (Sum of all trees)
needs to be passed through the sigmoid function to represent the probability of class 1.
For regular regression the model can be directly imported
but the base_score
should be set 0 as the base_score
used during the training phase is not dumped with the model.
An example model using the sklearn toy datasets is given below:
To represent the predict_proba
function of XGBoost for the binary classifier in Vespa,
we need to use the sigmoid function:
schema xgboost { rank-profile prediction-binary inherits default { first-phase { expression: sigmoid(xgboost("binary_breast_cancer.json")) } } }
base_score
or the optimal number of trees if trained with early stopping).
XGBoost also has different predict functions (e.g. predict/predict_proba).
The following XGBoost System Test
demonstrates how to represent different type of XGBoost models in Vespa.match-features
or summary-features
so
that features from offline training matches the online Vespa computed features.
Dumping features can also help debug any differences by zooming into specific query,document pairs
using recall parameter.double
precision.
If the training routine rounds features to float
or other more compact floating number representations, feature split decisions might differ in Vespa versus XGboost.nativeRank
, nativFieldMatch
, bm25
and fieldMatch
might differ, depending on which node produced the hit. The reason is that all these features use term(n).significance, which is computed locally indexed corpus. The term(n).significance
feature
is related to Inverse Document Frequency (IDF). The term(n).significance
should be set by a searcher in the container for global correctness as each node will estimate the significance values from the local corpus.