Please refer to Large Language Models in Vespa for an introduction to using LLMs in Vespa.
Retrieval-Augmented Generation (RAG) is a technique that merges retrieval systems with generative models to enhance language model outputs. It works by first using a retrieval system like Vespa to fetch relevant documents based on an input query, and then a generative model, like an LLM, to generate more contextually relevant responses. This method allows language models to access up-to-date or specific domain knowledge beyond their training, improving performance in tasks such as question answering and dynamic content creation.
In Vespa, the RAGSearcher
first performs the query as specified by the user,
creates a prompt based on the results, and queries the language model to
generate a response.
For a quick start, check out the RAG sample app which demonstrates using either an external LLM service or a local LLM.
In services.xml
, specify your LLM connection and the RAGSearcher
:
<services version="1.0">
<container id="default" version="1.0">
...
<component id="openai" class="ai.vespa.llm.clients.OpenAI">
<!-- Configure as required -->
</component>
<search>
<chain id="rag" inherits="vespa">
<searcher id="ai.vespa.search.llm.RAGSearcher">
<config name="ai.vespa.search.llm.llm-searcher">
<providerId>openai</providerId>
</config>
</searcher>
</chain>
</search>
...
</container>
</services>
As mentioned in LLMs in Vespa, you can call this chain using the Vespa CLI:
$ vespa query \
--header="X-LLM-API-KEY:..." \
query="what was the manhattan project?" \
searchChain=rag \
format=sse
However, notice here the use of the query
query parameter. In LLMs in
Vespa, we used a prompt
parameter to set up the prompt
to send to the LLM. You can also do that in the RAGSearcher
, however this means
that no actual query is run in Vespa. For Vespa to run a search, you need to
specify a yql
or query
parameter. By using query
here, this text is
used as both query text for the document retrieval, and in the prompt sent to
the LLM, as we will see below.
Indeed, with the RAGSearcher
you can use any type of search in
Vespa, including text search based on
BM25 and advanced approximate vector
search. This makes the retrieval part of
RAG very flexible.
Based on the query, Vespa will retrieve a set of documents. The RAGSearcher
will create a context from these documents looking like this:
field1: ...
field2: ...
field3: ...
field1: ...
field2: ...
field3: ...
...
Here, field1
and so on are the actual fields as returned from the search. For
instance, the text search tutorial defines a
document schema consisting of fields: id
, title
, url
, and body
. If you
only want to include the title
and body
fields for use in the context, you
can issue a query like this:
$ vespa query \
--header="X-LLM-API-KEY:..." \
yql="select title,body from msmarco where userQuery()" \
query="what was the manhattan project?" \
searchChain=rag \
format=sse
The actual prompt that will be sent to the LLM will, by default, look like this:
{context}
{@prompt or @query}
where {context}
is as given above, and @prompt
is replaced with the prompt
query parameter if given, and @query
is replaced with the user query if given.
This means you can customize the actual prompt by passing in a prompt
parameter, and thus distinguish between what is searched for in Vespa, and what
is asked for from the LLM.
For instance:
$ vespa query \
--header="X-LLM-API-KEY:..." \
yql="select title,body from msmarco where userQuery()" \
query="what was the manhattan project?" \
prompt="{context} @query Be as concise as possible." \
searchChain=rag \
format=sse
will results in a prompt like this:
title: <title of first document>
body: <body of first document>
title: <title of second document>
body: <body of second document>
<rest of documents>
what was the manhattan project? Be as concise as possible.
Note that if your prompt
does not contain {context}
, the context will
automatically be prepended to your prompt. However, if @query
is not
found in the prompt, it will not automatically be added to the prompt.
Please be advised that all documents as returned by Vespa will be used in the
context. Most LLMs have some form of limit for how large the prompt can be. LLM
services also typically have a cost per query based on number of tokens both in
input and output. To reduce context size it is important to control the number
of results by using the hits
query
parameter. Also, using the query above
limit the fields to only what is strictly required.
To debug the prompt, i.e. what is actually sent to the LLM, you can use the
traceLevel
query parameter, and set that to a value larger than 0
:
$ vespa query \
--header="X-LLM-API-KEY:..." \
query="what was the manhattan project?" \
searchChain=rag \
format=sse \
traceLevel=1
event: prompt
data: {"prompt":"<the actual prompt sent to the LLM>"}
event: token
data: {"token":"<first token of response>"}
...