Vespa supports using GPUs to evaluate ONNX models, as part of its stateless model evaluation feature. When running Vespa inside a container engine such as Docker or Podman, special configuration is required to make GPU(s) available inside the container.
The following guide explains how to do this for Nvidia GPUs, using Podman on RHEL8. For other platforms and container engines, see the Nvidia container toolkit installation guide.
Ensure that Nvidia drivers are installed on your host. On RHEL 8 this can be done as follows:
dnf config-manager \ --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo dnf module install -y --enablerepo cuda-rhel8-x86_64 nvidia-driver:latest
Install nvidia-container-toolkit
. This grants the container engine access
to your GPU device(s). On RHEL 8 this can be done as follows:
dnf config-manager \ --add-repo=https://nvidia.github.io/libnvidia-container/rhel8.6/libnvidia-container.repo dnf install -y --enablerepo libnvidia-container nvidia-container-toolkit
Generate a "Container Device Interface" config:
nvidia-ctk cdi generate --device-name-strategy=type-index --format=json > /etc/cdi/nvidia.json
Verify that the GPU device is exposed to the container:
podman run --rm -it --device nvidia.com/gpu=all docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
This should print details about your GPU(s) if everything is configured correctly.
Start the Vespa container with the --device
option:
podman run --detach --name vespa --hostname vespa-container \ --publish 8080:8080 --publish 19071:19071 \ --device nvidia.com/gpu=all \ vespaengine/vespa
vespaengine/vespa
image does not currently include the
necessary CUDA libraries by default, due to their large size. These
libraries must be installed inside the container manually:
podman exec -it vespa /bin/bash dnf config-manager \ --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo dnf -y install vespa-onnxruntime-cuda
/dev/nvidiaN
.
See stateless
model evaluation for how to configure the ONNX runtime to use a GPU for
computation.