Large Language Models in Vespa

Large Language Models (LLMs) are AI systems that generate human-like text, supporting a variety of applications like chatbots and content generation. In Vespa, LLMs can enhance search relevance, create dynamic content based on search results, and understand natural language by integrating into Vespa's processing chain structure, which handles querying and data ingestion. This allows Vespa to apply LLMs' deep linguistic and semantic capabilities across different stages, improving tasks such as document enrichment, query comprehension, summarization and question-answering.

Vespa is ideally suited for retrieval-augmented generation (RAG). This technique allows these models to access relevant and up-to-date information beyond their training in real-time, enabling Vespa's output to be contextually informed. For more information, refer to Retrieval-Augmented Generation in Vespa.

The advantage of setting up a client connection to an LLM from within your Vespa application compared to doing the API call(s) from your client after responses are returned from Vespa is that you eliminate an extra network hop, which means lower latency for end users. The importance of this is amplified if you want to leverage multiple LLM calls for eg. agentic applications or reranking.

Vespa supports LLMs in three ways:

External LLMs: Vespa can connect to any external LLM provider that serves an OpenAI-compatible API.
Local LLMs: Vespa can run LLMs within the Vespa application itself. This allows for customized models and avoids sending data outside the application. This is particularly useful for applications with strict data privacy requirements or those needing specific model configurations.
Custom language models: Vespa can be extended to support any language model, including those not based on OpenAI's API. This allows for flexibility in integrating various LLMs into Vespa applications.

This document will focus on features that are common to both external and local LLMs. For more information on configuration details for each type, please refer to the respective sections.

For a quick start, check out the RAG sample app, which demonstrates setting up Vespa for RAG, using either an external LLM service or a local LLM.

Using LLMs

Note: This feature is available in Vespa versions >= 8.327

Vespa distinguishes between the clients used to connect to LLMs and components that uses these clients. You can, for instance, set up a single client connection to an LLM, and use this connection for both document enrichment and retrieval-augmented generation (RAG).

LLM/RAG searcher

After adding a client connection to your services.xml, you can use the same client for various tasks such as retrieval-augmented generation. To do this, you need to set up the searchers or document processors that will use them. An example of a simple searcher that uses the client component is the LLMSearcher, which can be set up like this:

<services version="1.0">
  <container id="default" version="1.0">

    ...

    <component id="openai" class="ai.vespa.llm.clients.OpenAI">
      <!-- Configure as required -->
    </component>

    <search>
      <chain id="llm" inherits="vespa">
        <searcher id="ai.vespa.search.llm.LLMSearcher">
          <config name="ai.vespa.search.llm.llm-searcher">
            <providerId>openai</providerId>
          </config>
        </searcher>
      </chain>
    </search>

    ...

  </container>
</services>

This sets up a new search chain which includes an LLMSearcher. This searcher has the responsibility of calling out to the LLM connection using some prompt that has been sent along with the query.

Note the providerId configuration parameter: this must match the id given in the component specification. Using this, one can set up as many clients and searchers and combinations of these as one needs. If you do not specify a providerId, the searcher will use the first available LLM connection.

This particular searcher doesn't provide a lot of functionality, it only calls out to the LLM service using a provided prompt sent along with the query. The searcher expects the prompt to be passed in the query parameter prompt. For instance, using the Vespa CLI:

$ vespa query \
    --header="X-LLM-API-KEY:..." \
    searchChain=llm \
    prompt="what was the manhattan project?"

Here, we first pass along the API key to the OpenAI API. You need to provide your own OpenAI key for this. The searchChain parameter selects the llm chain set up in services.xml. Finally, the prompt parameter determines what is sent to the language model.

Note that if the prompt query parameter is not provided, the LLMSearcher will try to use the query query parameter.

By running the above command you will get something like the following:

{
 "root": {
   "id": "token_stream",
   "relevance": 1.0,
   "fields": {
     "totalCount": 0
   },
   "children": [
     {
       "id": "event_stream",
       "relevance": 1.0,
       "children": [
         {
           "id": "1",
           "relevance": 1.0,
           "fields": {
             "token": "The"
           }
         },
         {
           "id": "2",
           "relevance": 1.0,
           "fields": {
             "token": " Manhattan"
           }
         },
         {
           "id": "3",
           "relevance": 1.0,
           "fields": {
             "token": " Project"
           }
         },
         {
           "id": "4",
           "relevance": 1.0,
           "fields": {
             "token": " was"
           }
         },
         ...
    ]
  }
}

Streaming with Server-Sent Events

By running the above, you will have to wait until the entire response is generated from the underlying LLM. This can take a while, as LLMs generate one token at a time. To stream the tokens as they arrive, use the sse (Server-Sent Events) renderer by adding the format query parameter:

$ vespa query \
    --header="X-LLM-API-KEY:..." \
    searchChain=llm \
    prompt="what was the manhattan project?" \
    format=sse

The Manhattan Project was a research and development project during World War II that produced the first nuclear weapons. It was led by the United States with the support of the United Kingdom and Canada, and aimed to develop the technology necessary to build an atomic bomb. The project culminated in the bombings of the Japanese cities of Hiroshima and Nagasaki in August 1945.

The Vespa CLI understands this format and will stream the tokens as they arrive. The underlying format is Server-Sent Events, and the output from Vespa is like this:

$ vespa query \
    --format=plain \
    --header="X-LLM-API-KEY:..." \
    searchChain=llm \
    prompt="what was the manhattan project?" \
    format=sse

event: token
data: {"token":"The"}

event: token
data: {"token":" Manhattan"}

event: token
data: {"token":" Project"}

event: token
data: {"token":" was"}

event: token
data: {"token":" a"}

...

Notice the use of the --format=plain in the Vespa CLI here to output exactly what is sent from Vespa.

These events can be consumed by using a EventSource as described in the HTML specification, or however you see fit as the format is fairly simple. Each data element contains a small JSON object which must be parsed, and contains a single token element containing the actual token.

Errors are also sent in such events:

$ vespa query \
    --header="X-LLM-API-KEY: banana" \
    prompt="what was the manhattan project?" \
    searchChain=llm \
    format=sse

event: error
data: {
    "source": "openai",
    "error": 401,
    "message": "{    \"error\": {        \"message\": \"Incorrect API key provided: banana. You can find your API key at https://platform.openai.com/account/api-keys.\",        \"type\": \"invalid_request_error\",        \"param\": null,        \"code\": \"invalid_api_key\"    }}"
}

LLM parameters

The LLM service typically has a set of inference parameters that can be set. This can be parameters such as:

model - for OpenAI can be any valid model such as gpt-4o or gpt-4o-mini etc.
temperature - for setting the model temperature
maxTokens - for setting the maximum number of tokens to produce

Note that these parameters are common to both Local LLMs and External LLMs, but each of them also supports additional inference parameters. See the respective sections for more details on these.

To provide inference parameters, you pass these along with the query:

$ vespa query \
    --header="X-LLM-API-KEY: ..." \
    prompt="what was the manhattan project?" \
    searchChain=llm \
    format=sse \
    llm.model=gpt-4 \
    llm.maxTokens=10

Note that these parameters are prepended with llm. This is so that you can have multiple LLM searchers and control them independently by setting them up with different property prefixes in services.xml. For instance:

<chain id="rag" inherits="vespa">
  <searcher id="ai.vespa.search.llm.RAGSearcher">
    <config name="ai.vespa.search.llm.llm-searcher">
      <providerId>openai</providerId>
      <propertyPrefix>rag</propertyPrefix>
    </config>
  </searcher>
  <searcher id="ai.vespa.search.llm.LLMSearcher">
    <config name="ai.vespa.search.llm.llm-searcher">
      <providerId>openai</providerId>
      <propertyPrefix>llm</propertyPrefix>
    </config>
  </searcher>
</chain>

Here, we have set up a chain with two LLM searchers, that have set up different propertyPrefixs. The searchers use this to get their specific properties. This also includes prompts. The prompt for the first searcher would thus be rag.prompt and the second would be llm.prompt.

Note that if this propertyPrefix is not set, the default is llm and all LLM searchers would share the same parameters.

Also note that prompt does not need to be prefixed in the query, however the other parameters do need to.

If you are using different LLM services, you can also distinguish between API keys sent along with the query by prepending them as well with the propertyPrefix.

Retrieval-Augmented Generation (RAG)

Above we used the LLMSearcher to call out to LLMs using a pre-specified prompt. Vespa provides the RAGSearcher to construct a prompt based on search results. This enables a flexible way of first searching for content in Vespa, and using the results to generate a response.

Please refer to RAG in Vespa for more details.

Structured output

Both the OpenAI and LocalLLM clients in Vespa can also be configured to return structured output. This is done by providing an llm.json_schema in the query. (Assuming you are using the LLMSearcher or RAGSearcher with propertyPrefix=llm).

This can be useful for different use cases. Examples include applying moderation of the output or providing the response in different styles and/or languages.

{
    "type": "object",
    "properties": {
        "answer-short": {
            "type": "string"
        },
        "answer-short-french": {
            "type": "string",
            "description": "exact translation of short answer in French language"
        },
        "answer-short-eli5": {
            "type": "string",
            "description": "explain the answer like I am 5 years old"
        }
    },
    "required": [
        "answer-short",
        "answer-short-french",
        "answer-short-eli5"
    ],
    "additionalProperties": false
}

The json_schema can be passed with the query using the llm.json_schema parameter:

$ vespa query \
    --timeout 60 \
    --header="X-LLM-API-KEY:<YOUR_API_KEY>" \
    query="what was the manhattan project?" \
    hits=5 \
    searchChain=openai \
    format=sse \
    llm.json_schema="{\"type\":\"object\",\"properties\":{\"answer-short\":{\"type\":\"string\"},\"answer-short-french\":{\"type\":\"string\",\"description\":\"exact translation of short answer in French language\"},\"answer-short-eli5\":{\"type\":\"string\",\"description\":\"explain the answer like I am 5 years old\"}},\"required\":[\"answer-short\",\"answer-short-french\",\"answer-short-eli5\"],\"additionalProperties\":false}" \
    traceLevel=1

Which for example, using gpt-4o-mini returns

{
    "answer-short": "The Manhattan Project was a World War II research and development program that produced the first atomic bombs, led by the United States with help from the UK and Canada, overseen by Major General Leslie Groves and physicist Robert Oppenheimer.",
    "answer-short-french": "Le Projet Manhattan était un programme de recherche et développement de la Seconde Guerre mondiale qui a produit les premières bombes atomiques, dirigé par les États-Unis avec l'aide du Royaume-Uni et du Canada, sous la supervision du général Leslie Groves et du physicien Robert Oppenheimer.",
    "answer-short-eli5": "The Manhattan Project was a secret and important project during World War II where scientists worked together to make the first big bombs that could make huge explosions, which changed the world."
}

This can also leveraged for automated Document Enrichment during ingestion. With this approach, the json_schema is automatically generated based on the Vespa schema (and your prompt).

Query profiles

In all the above you have sent parameters along with each query. It is worth mentioning that Vespa supports query profiles, which are named collections of search parameters. This frees the client from having to manage and send a large number of parameters, and enables the request parameters for a use case to be changed without having to change the client.

Custom language model components

Vespa also allows you to create your own language model components. This is useful in cases where you want to use a language model that is not supported as local LLM through llama.cpp, or if you want to use an external LLM service that is incompatible with the OpenAI API.

To create your own language model component, you need to implement the ai.vespa.llm.LanguageModel interface. Minimal example shown below:

// Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
package ai.vespa.test;

import ai.vespa.llm.InferenceParameters;
import ai.vespa.llm.completion.Completion;
import ai.vespa.llm.completion.Prompt;

import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.function.Consumer;

public class MockLanguageModel implements ai.vespa.llm.LanguageModel {
    private final MockLanguageModelConfig config;

    public MockLanguageModel(MockLanguageModelConfig config) {
        this.config = config;
    }

    @Override
    public List<Completion> complete(Prompt prompt, InferenceParameters params) {
        var stringBuilder = new StringBuilder();

        for (int i = 0; i < config.repetitions(); i++) {
            stringBuilder.append(prompt.asString());
            
            if (i < config.repetitions() - 1) {
                stringBuilder.append(" ");
            }
        }

        return List.of(Completion.from(stringBuilder.toString().trim()));
    }

    @Override
    public CompletableFuture<Completion.FinishReason> completeAsync(Prompt prompt,
                                                                    InferenceParameters params,
                                                                    Consumer<Completion> consumer) {
        throw new UnsupportedOperationException();
    }
}

You can also create a config definition that will make your component configurable through the services.xml file.

Example of a minimal config definition:

# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
namespace=ai.vespa.test
package=ai.vespa.test

repetitions int default=1

See also developer guide for more information on how to create your own components.

Creating your own searchers in Java

The above example uses the LLMSearcher class. You can easily create your own LLM searcher in Java by either specifically injecting the connection component, or subclassing the LLMSearcher. Please refer to Searcher Development or Document Processor Development for more information on creating your own components.

Note that it should not be necessary to create your own components in Java to use this functionality.