Vespa Document Processing

Vespa document processing is a flexible framework that allows one to create reusable components (called document processors), that read and modify document operations, and compose chains of many such components to carry out the total processing needs of an application.

This has many use cases. The most common case is where a property has some input data that is to be stored in Vespa Storage, or indexed in Vespa Search, or both.

The input source, typically a crawler, a stream of incoming mail, data generated from user actions, or basically anything else, is able to split the input data into logical units, called documents. A feeder application will then send these documents into a document processing chain.

This chain is basically an ordered list of document processors, where each processor has one specific task. This can be anything - examples range from things like language detection, HTML removal and natural language processing to mail attachment processing, character set transcoding and image thumbnailing.

At the end of the processing chain, extracted data will typically be set in some fields in the document, which will continue into Vespa Search or Vespa Storage.

Read more on developing document processors. More details can be found in the javadoc.

Design goals

The framework is designed to meet these goals:

  • Developer friendliness. It must be so simple to write a simple processor that anybody can do it. This is a complete processor:
    import com.yahoo.document.*;
    import com.yahoo.docproc.*;
    
    public class ExampleDocumentProcessor extends DocumentProcessor {
    
        public Progress process(Processing processing) {
            for (DocumentOperation op : processing.getDocumentOperations()) {
                if (op instanceof DocumentPut) {
                    Document document = ((DocumentPut) op).getDocument();
                    //TODO do something to 'document' here
                } else if (op instanceof DocumentUpdate) {
                    DocumentUpdate update = (DocumentUpdate) op;
                    //TODO do something to 'update' here
                } else if (op instanceof DocumentRemove) {
                    DocumentRemove remove = (DocumentRemove) op;
                    //TODO do something to 'remove' here
                }
            }
            return Progress.DONE;
        }
    }
    
  • Gradual learning. More advanced concepts needed for processing multiple documents or document updates, making control decisions and so on builds on, and extends naturally the basic concepts learned when doing simple processing.
  • Scaling. The framework must allow cheap scaling by document size, document count and document processing clock time. The document processing framework uses a completely asynchronous architecture to allow scaling in several of these dimensions at the same time.
  • Simplicity. The core framework consists of less than ten classes totaling less than thousand lines of source. It has one dependency - to a minimal document model (a few more classes) modelling documents as a named map of fields.
  • Embedding. The framework does not make any assumptions about the context in which it will run. It can be embedded in another application which handles the thread management, configuration and so on.
  • Vespa integration. The framework may also run as a Vespa service. Configuration through Vespa, logging to Vespa and remote capabilities are handled by optional add-on packages.
  • Plugin support. Document processors developed by applications can be deployed and un-deployed in the framework, run without compromising the framework instance even if they contain errors, and be binary compatible across different versions of the framework. This is handled by developing the framework in Java and loading the document processors as OSGi components.

Core Features

The framework core supports asynchronous processing, processing one or multiple documents or document updates at the same time, document processors that makes dynamic decisions about the processing flow and passing of information between processors outside the document or document update:

  • One or more named Docproc Services may be created. One of the services is the default.
  • A service accepts subclasses of DocumentOperation for processing, which currently means DocumentPuts, DocumentUpdates and DocumentRemoves. It has a Call Stack which lists the calls to make to various Document Processors to process each DocumentOperation handed to the service.
  • Call Stacks consist of Calls, which refer to the Document Processor instance to call.
  • DocumentPuts and document updates are processed asynchronously, the state is kept in a Processing for its duration (instead of in a thread or process). A Document Processor may make some asynchronous calls (typically to remote services) and return to the framework that it should be called again later for the same Processing to handle the outcome of the calls.
  • A processing contains its own copy of the Call Stack of the Docproc Service to keep track of what to call next. Document Processors may modify this Call Stack to dynamically decide the processing steps required to process a DocumentOperation.
  • A Processing may contain one or more DocumentOperations to be processed as a unit.
  • A Processing has a context, which is a Map of named values which can be used to pass arguments between processors.
  • Processings are prepared to be stored to disk to allow a very high number of ongoing long-term processings per node.
Document processing core class diagram
Document processing core class diagram

Additional Features

Outside the common core, there are optional packages which provides the following:

  • Configuration using Vespa Configuration. Docproc Services with Call Stacks can be configured from file or a configuration server using the Vespa Configuration System. Document processors may also easily get own sub Call Stacks configured using the configuration system.