Vespa supports importing Gradient Boosting Decision Tree (GBDT) models trained with
XGBoost.
Exporting models from XGBoost
Vespa supports two XGBoost model formats: UBJ (recommended) and JSON (legacy).
UBJ format (recommended)
Vespa 8.656.31
The recommended way to export an XGBoost model for Vespa is using
save_model()
with the .ubj (Universal Binary JSON) extension.
UBJ has been the default XGBoost model format since XGBoost 2.1.0 and
preserves all model information: tree structure, base_score, feature names, and objective.
importxgboostasxgbimportnumpyasnp# Train a model
dtrain=xgb.DMatrix(np.random.rand(100,2),label=np.random.randint(2,size=100),feature_names=["feature_1","feature_2"])param={"max_depth":2,"objective":"binary:logistic"}model=xgb.train(param,dtrain,num_boost_round=10)# Export as UBJ
model.save_model("my_model.ubj")
Warning:
Do not use save_model("model.json") — this produces a different JSON structure
(with a learner wrapper) that Vespa cannot parse. Only dump_model() with dump_format="json" is supported for the JSON path.
Since the UBJ format preserves the objective, Vespa automatically applies the correct
transformation (e.g. sigmoid for logistic objectives) — no need to wrap the ranking expression manually.
JSON format (legacy)
Vespa also supports importing XGBoost's JSON model dump via
dump_model()
with dump_format="json".
The split attribute represents the Vespa feature name and must resolve to a Vespa
rank feature defined in the document schema,
or a user-defined function.
Warning:dump_model() JSON does not preserve base_score.
Set base_score=0 during training, or accept that Vespa predictions will be offset.
For logistic objectives, you must manually wrap the expression in sigmoid() (see Objective types).
Feature mappings from XGBoost to Vespa
Model feature names must map to Vespa rank features.
The mapping method depends on the model format.
UBJ feature mapping
For UBJ models, place a features file named <model_name>-features.txt alongside the .ubj file
in the models directory. The file contains one feature name per line, matching the training column order:
feature_1
feature_2
feature_3
For a model file named my_model.ubj, the features file must be named my_model-features.txt.
Then define rank profile functions
that match the feature names and map them to Vespa document attributes or query features:
schema my_app {
document my_app {
field price type double {
indexing: summary | attribute
}
field popularity type double {
indexing: summary | attribute
}
}
rank-profile my_rank_profile inherits default {
function feature_1() {
expression: attribute(price)
}
function feature_2() {
expression: attribute(popularity)
}
function feature_3() {
expression: query(user_context)
}
first-phase {
expression: xgboost("my_model.ubj")
}
}
}
If the model was trained with feature names that are valid Vespa rank features
(e.g. attribute(price)), the functions are not needed — Vespa resolves them directly.
JSON feature mapping
When using dump_model(), XGBoost names features by array index (f0, f1, …) unless a feature map file (fmap) is provided.
The fmap maps feature indices to named Vespa features:
Format of feature-map.txt: <featureid> <featurename> <q or i or int>\n:
Feature id must be from 0 to number of features, in sorted order
i means this feature is a binary indicator feature
q means this feature is a quantitative value, such as age, time, can be missing
int means this feature is an integer value (when int is hinted, the decision boundary will be integer)
When using Pandas DataFrames with column names, the feature names are embedded directly in the JSON dump
and a feature map file is not needed.
Importing XGBoost models
To import an XGBoost model, add the model file to your application package
under the models directory.
For UBJ models, also include the corresponding -features.txt file:
Vespa has a xgboostranking feature.
This ranking feature specifies the model to use in a ranking expression.
Both UBJ and JSON models use the same ranking feature:
Here, we specify that the model my_model.ubj is applied to the top ranking documents
by the first-phase ranking expression.
The query request must specify prediction as the ranking.profile.
See also Phased ranking on how to control number of data points/documents which is exposed to the model.
Generally the run time complexity is determined by:
The number of documents evaluated per thread / number of nodes and the query filter
The complexity of computing features. For example fieldMatch features are 100x more expensive than nativeFieldMatch/nativeRank.
The number of XGBoost trees and the maximum depth per tree
Warning:
Vespa does not support XGBoost's native categorical splits
(enable_categorical=True). Deploying a model with native categorical splits will silently produce
wrong predictions — Vespa interprets the categorical split condition as a numerical threshold.
To use categorical features with XGBoost models in Vespa, integer-encode them before training:
importxgboostasxgbimportpandasaspd# Integer-encode categorical features
category_map={"small":0,"medium":1,"large":2}df["size"]=df["size_raw"].map(category_map).astype(float)# Train without enable_categorical — XGBoost uses numerical splits on the integers
dtrain=xgb.DMatrix(df[feature_cols],label=targets)param={"max_depth":4,"objective":"binary:logistic"}model=xgb.train(param,dtrain,num_boost_round=100)model.save_model("my_model.ubj")
In the Vespa schema, store integer-encoded categoricals as int attributes
and map them via rank profile functions like any other numerical feature.
Note: Vespa's LightGBM importer does support native categorical splits.
XGBoost objective types
Vespa can import XGBoost models trained with any
objective.
Common objectives include:
Regression reg:squarederror / reg:logistic
Classification binary:logistic
Ranking rank:pairwise, rank:ndcg and rank:map
Vespa evaluates XGBoost models by summing the tree outputs.
The only objective-specific behavior is for logistic objectives (reg:logistic and binary:logistic),
where the raw tree sum must be passed through a sigmoid function to produce a probability.
UBJ models
For UBJ models, Vespa reads the objective from the model file.
For logistic objectives, the base_score is automatically transformed (logit)
so the model output matches XGBoost's predictions without manual adjustment:
Note that UBJ does not automatically apply a sigmoid to the final output.
For logistic objectives, wrap the expression in sigmoid() if you need a probability:
For ranking objectives and reg:squarederror, the raw tree sum can be used directly.
JSON models
For JSON models exported with dump_model(), the objective and base_score are not preserved.
For reg:logistic and binary:logistic, the raw margin tree sum
needs to be passed through the sigmoid function
to represent the probability of class 1.
For regression, the model can be directly imported
but base_score should be set to 0 during training as it is not included in the dump.
Replace 0.5 with the actual base_score used during training.
See the XGBoost System Test for a complete working example.
Debugging Vespa inference score versus XGBoost predict score
For JSON models, the base_score and optimal number of trees (if trained with early stopping) are lost in the dump.
UBJ models preserve this information.
XGBoost also has different predict functions (e.g. predict/predict_proba).
The following XGBoost System Test
demonstrates how to represent different types of XGBoost models in Vespa.
For training, features should be scraped from Vespa, using either match-features or summary-features so
that features from offline training matches the online Vespa computed features.
Dumping features can also help debug any differences by zooming into specific query,document pairs
using recall parameter.
In a distributed setting when multiple nodes use the model, text matching features such as nativeRank, nativeFieldMatch, bm25 and fieldMatch
might differ, depending on which node produced the hit. The reason is that all these features use term(n).significance, which is computed from the locally indexed corpus. The term(n).significance feature
is related to Inverse Document Frequency (IDF). The term(n).significance should be set by a searcher in the container for global correctness as each node will estimate the significance values from the local corpus.