Federation

The Vespa Container allows multiple sources of data to be federated to a common search service. The sources of data may be both search clusters belonging to the same application, or external services, backed by Vespa or any other kind of service. The container may be used as a pure federation platform by setting up a system consisting solely of container nodes federating to external services. This document gives a short intro to federation, explains how to create an application package doing federation and shows what support is available for choosing the sources given a query, and the final result given the query and some source specific results.

Federation allows users to access data from multiple sources of various kinds through one interface. This is useful to:

  • enrich the results returned from an application with auxiliary data, like finding appropriate images to accompany news articles.
  • provide more comprehensive results by finding data from alternative sources in the cases where the application has none, like backfilling web results.
  • create applications whose main purpose is not to provide access to some data set but to provide users or frontend applications a single starting point to access many kinds of data from various sources. Examples are browse pages created dynamically for any topic by pulling together data from external sources.
The main tasks in creating a federation solution are:
  1. creating connectors to the various sources
  2. selecting the data sources which will receive a given query
  3. rewriting the received request to an executable query returning the desired data from each source
  4. creating the final result by selecting from, organizing and combining the returned data from each selected source
The container aids with these tasks by providing a way to organize a federated execution as a set of search chains which can be configured through the application package. Read the Container intro and Chained components before proceeding. Read about Searching multiple document types. Refer to the com.yahoo.search.federation Javadoc.

Configuring Providers

A provider is a search chain that is responsible for the actual connection to a physical service. Create a new provider called yahoo-search as follows:

<search>
  <provider id="yahoo-search" />
As expected, this gives you an empty search chain called yahoo-search. But an empty search chain always gives you an empty result set back.

So how can we set up a provider that actually talks to an external service? We have to specify two pieces of information:

  1. Which nodes it should talk to.
  2. How it should talk to them.
We specify which nodes it should talk to by giving a list of nodes, and how to talk to them by specifying their type:
<search>
  <provider id="yahoo-search" type="vespa">
    <nodes>
      <node host="[host]" port="[port]" />
    </nodes>
  </provider>
Behind the scenes, this adds one or more searchers to this search chain that knows how to talk to a Vespa cluster. type=local can be used to talk to a search cluster which is part of the same application. If no type is given, this search chain must contain a bundled searcher which talks to the backend service.

Configuring Sources

A provider might be able to give us different types of results. A source is a set of search chains that represents a way to get some particular kind of results from a set of providers.

Suppose that we want to retrieve two kinds of results from yahoo-search: web results and java API documentation:

<search>
  <provider id="yahoo-search" type="vespa">
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>

    <source id="web" />
    <source id="java-api">
      <searcher id="com.yahoo.example.JavaApiSearcher" />
    </source>
  </provider>
This results in two source search chains being created, web@yahoo-search and java-api@yahoo-search. Each of them constitutes a source, namely web and java-api respectively. As the example suggests, these search chains are named after the source and the enclosing provider. The at sign in the name should be read as in, so web@yahoo-search should for example be read as web in yahoo search.

The JavaApiSearcher is responsible for modifying the query so that we only get hits from the java API documentation. We added this searcher directly inside the source element; source search chains and providers are as we remember search chains. All the options for configuring regular search chains are therefore also available for them.

How does the web@yahoo-search and java-api@yahoo-search source search chains use the yahoo-search provider to send queries to the external service? Internally, the source search chains inherit from the enclosing provider. Since the provider contains searchers that know how to talk to the external service, the sources will also contain the same searchers. As an example, consider the "web" search chain; It will contain exactly the same searchers as the yahoo-search search chain.

The provider search chain yahoo-search is not modified by adding sources. To verify this, try to send queries to the three search chains yahoo-search, web@yahoo-search and java-api@yahoo-search.

Multiple Providers per Source

You can create a source that consists of source search chains from several providers. Effectively, this lets you vary which provider should be used to satisfy each request to the source.

<search>
  <provider id="yahoo-search" type="vespa">
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <source id="common-search" />
  </provider>

  <provider id="news-search" type="vespa">
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <source idref="common-search" />
  </provider>
Here, the two source search chains common-search@news-search and common-search@yahoo-search constitutes a single source common-search. The source search chains using the idref attribute are called participants, while the ones using the id attribute are called leaders. Each source must consist of a single leader and zero or more participants.

Per default, only the leader search chain is used when federating to a source. To use one of the participants instead, use sources and source:

http://[host]:[port]/?sources=common-search&source.common-search.provider=news-search

Federation

Now we can search both the web and the java API documentation at the same time, and get a combined result set back. We achieve this by setting up a federation searcher:

<search>
  <provider id="yahoo-search" type="vespa">
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <source id="web" />
    <source id="java-api">
      <searcher id="com.yahoo.example.JavaApiSearcher" />
    </source>
  </provider>
  <chain id="combined">
    <federation id="combinator">
      <source idref="web" />
      <source idref="java-api" />
    </federation>
  </chain>
Inside the Federation element, we list the sources we want to use. Do not let the name source fool you; If it behaves like a source, then you can use it as a source (i.e. all types of search chains including providers are accepted). As an example, try replacing the web reference with yahoo-search.

When searching, select a subset of the sources specified in the federation element by specifying the sources query parameter.

Built-in Federation

To get you started, the built-in search chains native and vespa contain a federation searcher named federation. This searcher has been configured to federate to:

  • All sources
  • All providers that does not contain a source
If configuring your own federation searcher, you are not limited to a subset of these sources - you can use any provider, source or search chain.

Inheriting default Sources

To get the same sources as the built-in federation searcher, inherit the default source set:

<search>
  <chain id="my-chain">
    <federation id="combinator">
      <source-set inherits="default" />
      ...
    </federation>
  </chain>
</search>

Timeout behavior

What if we want to limit how much time a provider is allowed to use to answer a query?

<search>
  <provider id="yahoo-search" type="vespa">
    <federationoptions timeout="100 ms" />
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <source id="web" />
    <source id="java-api">
      <searcher id="com.yahoo.example.JavaApiSearcher" />
    </source>
  </provider>
The provider search chain will then be limited to use 100 ms to execute each query. The Federation layer allows all providers to continue until the non-optional provider with the longest timeout is finished or canceled.

In some cases it is useful to be able to keep executing the request to a provider longer than we are willing to wait for it in that particular query. This allows us to populate caches inside sources which can only meet the timeout after caches are populated. To use this option, specify a request timeout for the provider:

<search>
  <provider id="yahoo-search" type="vespa">
    <federationoptions timeout="100 ms" requestTimeout="10000 ms" />
    ...
  </provider>

Non-essential Providers

Now let us add a provider that retrieves ads:

<search>
  <provider id="ads">
    <nodes>
      <node host="[host]" port="[port]" />
    </nodes>
  </provider>
Suppose that it is more important to return the result to the user as fast as possible than to retrieve ads. To signal this, we mark the ads provider as optional:
<search>
  <provider id="ads">
    <federationoptions optional="true" />
    <nodes>
      <node host="[host]" port="[port]" />
    </nodes>
  </provider>
The Federation searcher will then only wait for ads as long as it waits for mandatory providers. If the ads are available in time, they are used, otherwise they are dropped.

If only optional providers are selected for Federation, they will all be treated as mandatory. Otherwise, they would not get a chance to return any results.

Federation options inheritance

The sources automatically use the same Federation options as the enclosing provider. override one or more of the federation options in the sources:

<search>
  <provider id="yahoo-search" type="vespa">
    <federationoptions timeout="100 ms" />
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <source id="web">
      <federationoptions timeout="200 ms" />
    </source>
    <source id="java-api">
      <searcher id="com.yahoo.example.JavaApiSearcher" />
    </source>
  </provider>
You can use a single source in different Federation searchers. If you send queries with different cost to the same source from different federation searchers, you might also want to override the federation options for when they are used:
<search>
  <provider id="yahoo-search" type="vespa">
    <federationoptions timeout="100 ms" />
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <source id="web">
     <federationoptions timeout="200 ms" />
    </source>
    <source id="java-api">
      <searcher id="com.yahoo.example.JavaApiSearcher" />
    </source>
  </provider>
  <chain id="combined">
    <federation id="combinator">
      <source idref="web">
        <federationoptions timeout="2.5 s" />
      </source>
      <source idref="java-api" />
    </federation>
  </chain>

Caching

Most provider types support caching the results retrieved from the external service. To configure the caches for each provider, use cachesize:

<search>
  <provider id="yahoo-search" type="vespa" cachesize="200M">
    <nodes>
      <node host="[host]" port="[port]" />
    </nodes>
  </provider>
Here we have reserved 200 megabytes of cache for this provider. The actual cache implementation will depend on the provider type.

Disabling redirects

Prevent the vespa federation searcher from following redirects using followRedirects:

<search>
  <provider id="yahoo-search" type="vespa">
    <nodes>
      <node host="[host]" port="[port]"/>
    </nodes>
    <config name="search.federation.provider">
      <followRedirects>false</followRedirects>
    </config>
  </provider>

Selecting Search Chains programmatically

If we have complicated rules for when a search chain should be used, we can select search chains programmatically instead of setting up sources under federation in services.xml. The selection code is implemented as a TargetSelector. This TargetSelector is used by registering it on a federation searcher.

package com.yahoo.example;

import com.google.common.base.Preconditions;
import com.yahoo.component.chain.Chain;
import com.yahoo.processing.execution.chain.ChainRegistry;
import com.yahoo.search.Query;
import com.yahoo.search.Result;
import com.yahoo.search.result.Hit;
import com.yahoo.search.Searcher;
import com.yahoo.search.federation.selection.FederationTarget;
import com.yahoo.search.federation.selection.TargetSelector;
import com.yahoo.search.searchchain.model.federation.FederationOptions;

import java.util.Arrays;
import java.util.Collection;

class MyTargetSelector implements TargetSelector<Object> {
    @Override
    public Collection<FederationTarget<Object>> getTargets(Query query, ChainRegistry<Searcher> searcherChainRegistry) {
        Chain<Searcher> searchChain = searcherChainRegistry.getComponent("my-chain");
        Preconditions.checkNotNull(searchChain, "No search chain named 'my-chain' exists in services.xml");

        return Arrays.asList(
                new FederationTarget<>(searchChain, new FederationOptions(), null));
    }

    @Override
    public void modifyTargetQuery(FederationTarget<Object> target, Query query) {
        query.setHits(10);
    }

    @Override
    public void modifyTargetResult(FederationTarget<Object> target, Result result) {
        for (Hit hit: result.hits()) {
            hit.setField("my-field", "hello-world");
        }
    }
}
The target selector chooses search chains for the federation searcher. In this example, MyTargetSelector.getTargets returns a single chain named "my-chain" that has been set up in services.xml.

Before executing each search chain, the federation searcher allows the target selector to modify the query by calling modifyTargetQuery. In the example, the number of hits to retrieve is set to 10.

After the search chain has been executed, the federation searcher allows the target selector to modify the result by calling modifyTargetResult. In the example, each hit gets a field called "my-field" with the value "hello-world".

Configure a federation searcher to use a target selector in services.xml. Only a single target selector is supported.

<search>
  <chain id="my-chain">
    <federation id="my-federation">
      <target-selector id="com.yahoo.example.MyTargetSelector" bundle="MyBundle" />
    </federation>
  </chain>
We can also set up both a target-selector and normal sources. The federation searcher will then send queries both to programmatically selected sources and those that would normally be selected without the target selector:
<search>
  <chain id="my-chain">
    <federation id="my-federation">
      <target-selector id="com.yahoo.example.MyTargetSelector" bundle="MyBundle" />
      <source idref="java-api" />
      <source-set inherits="default" />
      ...
    </federation>
  </chain>

Example: Setting up a Federated Service

A federation application is created by providing custom searcher components performing the basic federation tasks and combining these into chains in a federation setup in services.xml. For example, this is a complete configuration which sets up a cluster of container nodes (having 1 node) which federates to the another Vespa service (news) and to some web service:

<?xml version="1.0" encoding="utf-8" ?>
<services version="1.0">
    <admin version="2.0">
        <adminserver hostalias="node1"/>
    </admin>

    <container id="test" version="1.0">
        <nodes>
            <node hostalias="node1"/>
        </nodes>
        <search>
            <provider id="news" type="vespa">
                <nodes>
                    <node host="[host]" port="[port]"/>
                </nodes>
                <searcher id="com.yahoo.example.NewsCustomerIdSearcher"/>
            </provider>
            <provider id="webService">
                <nodes>
                    <node host="[host]" port="[port]" />
                </nodes>
                <searcher id="com.yahoo.example.ExampleProviderSearcher" />
            </provider>
        </search>
    </container>
</services>
This creates a configuration of search chains like:

Each provider is a search chain ending in a Searcher forwarding the query to a remote service. In addition there is a main chain (included by default) ending in a FederationSearcher, which by default forwards the query to all the providers in parallel. The provider chains returns their result upwards to the federation searcher which merges them into a complete result which is returned up the main chain.

This services file, an implementation of the example classes (see below), and hosts.xml listing the container nodes, is all that is needed to set up and deploy an application federating to multiple sources. For a reference to these XML sections, see the chains reference.

The following sections outlines how this can be elaborated into a solution producing more user friendly federated results.

Creating a Provider

The container includes a provider searcher which can talk to a remote Vespa instance, this can be included in the provider chain by setting type=vespa on the provider as shown above. Also, type=local should be used to access a search cluster which is part of the same application.

To talk to an external service which is not a Vespa instance, create a custom provider implementation. Example using some included utility classes which simplifies making HTTP clients:

package com.yahoo.example;

import com.yahoo.component.ComponentId;
import com.yahoo.search.*;
import com.yahoo.search.federation.http.*;
import com.yahoo.search.result.Hit;
import com.yahoo.component.chain.dependencies.After;
import com.yahoo.statistics.Statistics;
import com.yahoo.search.federation.ProviderConfig;
import com.yahoo.search.cache.QrBinaryCacheConfig;
import com.yahoo.search.cache.QrBinaryCacheRegionConfig;

import java.io.*;
import java.util.*;

@After("*")
public class ExampleProviderSearcher extends ConfiguredHTTPProviderSearcher {

    public ExampleProviderSearcher(ComponentId id, ProviderConfig config, QrBinaryCacheConfig c, QrBinaryCacheRegionConfig r, Statistics statistics) {
        super(id, config, c, r, statistics);
    }

    /** Convenience for testing */
    public ExampleProviderSearcher(String idString, String host, int port, String path, Statistics statistics) {
        super(idString, host, port, path, statistics);
    }

    @Override
    public void unmarshal(InputStream inputStream, long contentLength, Result result) throws IOException {
        BufferedReader hitReader = new BufferedReader(
                new InputStreamReader(inputStream, "UTF-8"));

        String hit;
        while ((hit = hitReader.readLine()) != null) {
            result.hits().add(unmarshalHit(hit));
        }
    }

    private Hit unmarshalHit(String hitString) {
        String[] fields = hitString.split("\t");

        Hit hit = new Hit(fields[0]);
        hit.setField("exampleField", fields[1]);
        return hit;
    }

    @Override
    public Map<String, String> getCacheKey(Query q) {
        Map<String, String> queryMap = new HashMap<String, String>();
        queryMap.put("query",q.getModel().getQueryString());
        return queryMap;
    }

}
This should be built as bundle and placed in components/ in the application package.

Selecting Sources

To do the best possible job of bringing relevant data to the user, we should send every query to all sources, and decide what data to include when all the results are available and we have as much information as possible at hand. In general this is not advisable because of the resource cost involved, so we must select a subset based on information in the query. This is best viewed as a probabilistic optimization problem: The selected sources should be the ones having a high enough probability of being useful to offset the cost of querying it.

Any Searcher which is involved in selecting sources or processing the entire result should be added to the main search chain, which was created implicitly in the examples above. To do this, the main chain should be created explicitly:

<?xml version="1.0" encoding="utf-8" ?>
<services version="1.0">
    <admin version="2.0">
        <adminserver hostalias="node1"/>
    </admin>
    <container id="test" version="1.0">
        <nodes>
            <node hostalias="node1"/>
        </nodes>
        <search>
            <chain id="default" inherits="native">
                <searcher id="com.yahoo.example.ResultBlender"/>
            </chain>
            <provider id="news" type="vespa">
                <nodes>
                    <node host="[host]" port="[port]"/>
                </nodes>
                <searcher id="com.yahoo.example.NewsCustomerIdSearcher"/>
            </provider>
            <provider id="webService">
                <nodes>
                    <node host="[host]" port="[port]"/>
                </nodes>
                <searcher id="com.yahoo.example.ExampleProviderSearcher" />
            </provider>
        </search>
    </container>
</services>
This adds an explicit main chain to the configuration which has two additional searchers in addition to those inherited from the native chain, which includes the FederationSearcher. Note that if the full Vespa functionality is needed, the vespa chain should be inherited rather than native.

The chain called default will be invoked if no searchChain parameter is given in the query.

To learn more about creating Searcher components, see searcher development.

Rewriting Queries to Individual Providers

The provider searchers are responsible for accepting a Query object, translating it to a suitable request to the backend in question and deserializing the response into a Result object. There is often a need to modify the query to match the particulars of a provider before passing it on:

  • To get results from the provider which matches the determined interpretation and intent as well as possible, the query may need to be rewritten using detailed information about the provider
  • Parameters beyond the basic ones supported by each provider searcher may need to be translated to the provider
  • There may be a need for provider specific business rules
These query changes may range in complexity from setting a query parameter, applying some source specific information to the query or transferring all the relevant query state into a new object representation which is consumed by the provider searcher.

This example shows a searcher adding a customer id to the news request:

package com.yahoo.example;

import com.yahoo.search.searchchain.Execution;
import com.yahoo.search.*;

public class NewsCustomerIdSearcher extends Searcher {

    public @Override Result search(Query query,Execution execution) {
        String customerId="provider.news.custid";
        if (query.properties().get(customerId)==null)
            query.properties().set(customerId,"yahoo/test");
        if (query.getTraceLevel()>=3)
            query.trace("News provider: Will use " + customerId + "=" +
                    query.properties().get(customerId),false,3);
        return execution.search(query);
    }

}
This searcher should be added to the news source chain as shown above.

You may have noticed that we have referred to the search chains talking to a service as a provider while referring to selection of sources. The reason for making this distinction is that it is sometimes useful to treat different kinds of processing of queries and results to/from the same service as different sources. Hence, it is possible to create source search chains in addition to the provider chains in services.xml. Each such source will refer to a provider (by inheriting the provider chain) but include some searchers specific to that source. Selection and routing of the query from the federation searchers is always to sources, not providers. By default, if no source tags are added in the provider, each provider implicitly creates a source by the same name.

Processing Results

When we have selected the sources, created queries fitting to get results from each source and executed those queries, we have produced a result which contains a HitGroup per source containing the list of hits from that source. These results may be returned in XML as is, preserving the structure as XML, by requesting the page result format:

http://[host]:[port]/search/?query=test&presentation.format=page
However, this is not suitable for presenting to the user in most cases. What we want to do is select the subset of the hits having the highest probable utility to the user, organized in a way that maximizes the user experience. This is not an easy task, and we will not attempt to solve it here, other than noting that any solution should make use of both the information in the intent model and the information within the results from each source, and that this is a highly connected optimization problem because the utility of including some data in the result depends on what other data is included.

Here we will just use a searcher which shows how this is done in principle, this searcher flattens the news and web service hit groups into a single list of hits, where only the highest ranked news ones are included:

package com.yahoo.example;

import com.yahoo.search.*;
import com.yahoo.search.result.*;
import com.yahoo.search.searchchain.Execution;

public class ResultBlender extends Searcher {

    public Result search(Query query,Execution execution) {
        Result result=execution.search(query);
        HitGroup news=(HitGroup)result.hits().remove("source:news");
        HitGroup webService=(HitGroup)result.hits().remove("source:webService");
        if (webService==null) return result;
        result.hits().addAll(webService.asList());
        if (news==null) return result;
        for (Hit hit : news.asList())
            if (shouldIncludeNewsHit(hit))
                result.hits().add(hit);
        return result;
    }

    private boolean shouldIncludeNewsHit(Hit hit) {
        if (hit.isMeta()) return true;
        if (hit.getRelevance().getScore()>0.7) return true;
        return false;
    }

}
The optimal result to return to the user is not necessarily one flattened list. In some cases it may be better to keep the source organization, or to pick some other organization. The page result format requested in the query above is able to represent any hierarchical organization as XML. A more realistic version of this searcher will use that to choose between some predefined layouts which the frontend in question knows how to handle, and choose some way of grouping the available hits suitable for the selected layout.

This searcher should be added to the main (default) search chain in services.xml together with the SourceSelector (the order does not matter).

Unit Testing the Result Processor

Unit test example for the Searcher above:

package com.yahoo.search.example.test;

import com.yahoo.search.searchchain.*;
import com.yahoo.search.example.ResultBlender;
import com.yahoo.search.*;
import com.yahoo.search.result.*;

public class ResultBlenderTestCase extends junit.framework.TestCase {

    public void testBlending() {
        Chain<Searcher> chain=new Chain<Searcher>(new ResultBlender(),new MockBackend());
        Context context = Execution.Context.createContextStub(null);
        Result result=new Execution(chain, context).search(new Query("?query=test"));
        assertEquals(4,result.hits().size());
        assertEquals("webService:1",result.hits().get(0).getId().toString());
        assertEquals("news:1",result.hits().get(1).getId().toString());
        assertEquals("webService:2",result.hits().get(2).getId().toString());
        assertEquals("webService:3",result.hits().get(3).getId().toString());
    }

    private static class MockBackend extends Searcher {

        public @Override Result search(Query query,Execution execution) {
            Result result=new Result(query);
            HitGroup webService=new HitGroup("source:webService");
            webService.add(new Hit("webService:1",0.9));
            webService.add(new Hit("webService:2",0.7));
            webService.add(new Hit("webService:3",0.5));
            result.hits().add(webService);
            HitGroup news=new HitGroup("source:news");
            news.add(new Hit("news:1",0.8));
            news.add(new Hit("news:2",0.6));
            news.add(new Hit("news:3",0.4));
            result.hits().add(news);
            return result;
        }
    }
}
This shows how a search chain can be created programmatically, with a mock backend producing results suitable for exercising the functionality of the searcher being tested.

Passing Arguments to Web Services

Pass arguments to services from a query profile or request parameter. They will be passed on in the remote HTTP request with the two first components of the name removed. The provided HTTPSearcher utility classes picks up properties following this name scheme:

source.[source].[property] The value set is passed as [property] in the request to providers invoked as this source
provider.[provider].[property] The value set is passed as [property] in the request to this provider regardless of source
service.[service].[property] The value set is passed as [property] to services called from a client - e.g when calling a service to add information to the query rather than producing a result