Adding support for special tokens

This document describes some additional features for doing extra processing of text during indexing and query processing.

Document Processing

Before indexing the input documents, it is possible to modify the data using the document processing framework. It is highly recommended you look at this if you want to handle problems as reformatting/stripping/splitting data, or any kind of transforms. The framework handles load balancing, error handling, and makes it easy to to the processing on the document level.

Specialtoken support

Normally, each token that is possible to search for is found by considering blocks of consecutive word characters. For some applications it may desirable to allow a few extra tokens containing non-word characters. By enabling the Specialtoken Support, you will be able to both index and search for such tokens.

In addition, for some tokens there may be several different ways of writing essentially the same thing. Using the Specialtoken Support, you may add replacements to the tokens, allowing you to map them to arbitrary strings, for instance a common base form or a form without non-word characters.

Configuration

Add a specialtokens config to services.xml. The config should specify a token list called "default", with a list of tokens. For each token you may also specify an optional replacement, that is used to translate the specified token to an actual query term. Check specialtokens.def for details. Example configuration:

<?xml version='1.0' encoding='UTF-8'?>
<services version='1.0'>
  <config name='vespa.configdefinition.specialtokens'>
     <tokenlist>
       <item>
         <name>default</name>
         <tokens>
           <item>
             <token>c++</token>
           </item>
           <item>
             <token>wal-mart</token>
             <replace>walmart</replace>
           </item>
           <item>
             <token>.net</token>
           </item>
       </item>
    </tokenlist>
  </config>

  ...

</services>
Note that all special tokens must be lower-cased.

Using Special Tokens

When specialtokens config is present, it is used behind the scenes both during indexing and query processing. There is no need to enable it for particular fields, or indicate the need for special token handling during query input.

Query Phrasing

Another option for processing text is to use query phrasing to make it possible to search on phrases, i.e. New York and Rolling Stones. This is described in query phrasing.