Please refer to Large Language Models in Vespa for an introduction to using LLMs in Vespa.
Vespa provides a client for integration with OpenAI compatible APIs. This includes, but is not limited to OpenAI, Google Gemini, Anthropic, Cohere and Together.ai. You can also host your own OpenAI-compatible server using for example VLLM or llama-cpp-server.
To set up a connection to an LLM service such as OpenAI's ChatGPT, you need to define a component in your application's services.xml:
<services version="1.0">
<container id="default" version="1.0">
...
<component id="openai" class="ai.vespa.llm.clients.OpenAI">
<!-- Optional configuration: -->
<config name="ai.vespa.llm.clients.llm-client">
<apiKeySecretName> ... </apiKeySecretName>
<endpoint> ... </endpoint>
</config>
</component>
...
</container>
</services>
To see the full list of available configuration parameters, refer to the llm-client config definition file.
This sets up a client component that can be used in a searcher or a document processor.
Vespa provides several options to configure the API key used by the client.
apiKeySecretName
configuration parameter to the name of the secret in the secret store. This is the recommended way for Vespa Cloud users.X-LLM-API-KEY
HTTP header of the Vespa query.You can set up multiple connections with different settings. For instance, you
might want to run different LLMs for different tasks. To distinguish between the
connections, modify the id
attribute in the component specification. We will
see below how this is used to control which LLM is used for which task.
As a reminder, Vespa also has the option of running custom LLMs locally. Please refer to running LLMs in your application for more information.
Please refer to the general discussion in LLM parameters for setting inference parameters.
The OpenAI-client also has the following inference parameters that can be sent along with the query:
Parameter (Vespa) | Parameter (OpenAI) | Description |
---|---|---|
maxTokens |
max_completion_tokens |
Maximum number of tokens that can be generated in the chat completion. |
temperature |
temperature |
Number between 0 and 2. Higher values like 0.8 make output more random, while lower values like 0.2 make it more focused and deterministic. |
topP |
top_p |
An alternative to temperature sampling. Model considers tokens with top_p probability mass (0-1). Value of 0.1 means only tokens comprising top 10% probability are considered. |
seed |
seed |
If specified, the system will attempt to sample deterministically, so repeated requests with the same seed should return similar results. Determinism is not guaranteed. |
npredict |
n |
How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all choices. |
frequencypenalty |
frequency_penalty |
Number between -2.0 and 2.0. Positive values penalize new tokens based on their frequency in the text so far, decreasing the likelihood of repetition. Negative values encourage repetition. |
presencepenalty |
presence_penalty |
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. Negative values encourage repeating content from the prompt. |
Any parameter sent with the query will override configuration specified for the client component in services.xml
.
Note that if you are not using OpenAI's API, the parameters may be handled differently than the descriptions above.
By default, this particular client connects to the OpenAI service, but can be used against any
OpenAI chat completion compatible API
by changing the endpoint
configuration parameter.