This document describes the Vespa Annotations API; its purpose and use cases along with some usage examples.
Imagine a use case where one wants to add some metadata to a chunk of text, where various parts of the text have some semantics that we want to express.
This can be done by marking up the text with spans - where a span is identified by a start character index, and a length, and grouping these spans together to form a span tree:
In the illustration above, we have a span tree called "html", with a root node that holds references to the spans we have created over the text. To do this using the Annotations API, use the following code:
StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>"); SpanList root = new SpanList(); root.add(new Span(0, 19)) .add(new Span(19, 5)) .add(new Span(24, 21)) .add(new Span(45, 23)) .add(new Span(68, 14)); SpanTree tree = new SpanTree("html", root); text.setSpanTree(tree);
Now for each of the spans over the text, we can add an arbitrary number of annotations. An annotation is a piece of information associated with a span. For now, think of it as a label:
Annotations are kept by the span tree in a global list. The annotations in the list have references to their respective spans. To do this using the Annotations API, first declare the annotation types in the schema:
schema example { annotation text { } annotation markup { } }
Then, use the declared types and annotate the spans:
// The following line works inside process(Processing) in a DocumentProcessor AnnotationTypeRegistry atr = processing.getService().getDocumentTypeManager().getAnnotationTypeRegistry(); StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>"); AnnotationType textType = atr.getType("text"); AnnotationType markup = atr.getType("markup"); SpanList root = new SpanList(); SpanTree tree = new SpanTree("html", root); Span span1 = new Span(0, 19); root.add(span1); tree.annotate(span1, markup); Span span2 = new Span(19, 5); root.add(span2); tree.annotate(span2, textType); Span span3 = new Span(24, 21); root.add(span3); tree.annotate(span3, markup); Span span4 = new Span(45, 23); root.add(span4); tree.annotate(span4, textType); Span span5 = new Span(68, 14); root.add(span5); tree.annotate(span5, markup); text.setSpanTree(tree);
Note that in the above code, we have used a convenience method
SpanTree.annotate(SpanNode node, AnnotationType at)
.
This is equivalent to:
AnnotationType markupType = new AnnotationType("markup"); Annotation a = new Annotation(markupType); tree.annotate(span, a);
The annotated spans shown above might be fine for the simple cases where one wants to just annotate some text. However, let's imagine that one wants to not only identify markup from text, but also create a structure over the markup.
In such a case, we can build a tree of spans using SpanLists. A SpanList is simply a node in the tree that can have children—the children can be spans, or SpanLists themselves. And of course, SpanLists can be annotated as well. Henceforth, we will refer to both Spans and SpanLists as SpanNodes, which is in fact their common superclass.
Here, we no longer have a simple two-level structure of Spans with labels on them, but instead a tree of SpanNodes, each having zero or more annotations.
To do this using the Annotations API, first declare the annotation types in the schema:
schema example { annotation text { } annotation begintag { } annotation endtag { } annotation body { } annotation header { } }
Then, use the declared types and annotate the spans:
// The following line works inside process(Processing) in a DocumentProcessor AnnotationTypeRegistry atr = processing.getService().getDocumentTypeManager().getAnnotationTypeRegistry(); StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>"); SpanList root = new SpanList(); SpanTree tree = new SpanTree("html", root); AnnotationType textType = atr.getType("text"); AnnotationType beginTag = atr.getType("begintag"); AnnotationType endTag = atr.getType("endtag"); AnnotationType bodyType = atr.getType("body"); AnnotationType headerType = atr.getType("header"); SpanList header = new SpanList(); { Span span1 = new Span(6, 6); Span span2 = new Span(12, 7); Span span3 = new Span(19, 5); Span span4 = new Span(24, 8); Span span5 = new Span(32, 7); header.add(span1) .add(span2) .add(span3) .add(span4) .add(span5); tree.annotate(span1, beginTag) .annotate(span2, beginTag) .annotate(span3, textType) .annotate(span4, endTag) .annotate(span5, endTag) .annotate(header, headerType); } SpanList body = new SpanList(); { Span span1 = new Span(39, 6); Span span2 = new Span(45, 23); Span span3 = new Span(68, 7); body.add(span1) .add(span2) .add(span3); tree.annotate(span1, beginTag) .annotate(span2, textType) .annotate(span3, endTag) .annotate(body, bodyType); } { Span span1 = new Span(0, 6); Span span2 = new Span(75, 7); root.add(span1) .add(header) .add(body) .add(span2); tree.annotate(span1, beginTag) .annotate(span2, endTag); } text.setSpanTree(tree);
But what if we need to attach more information to a SpanNode than just a label? Imagine that we want to annotate "San Francisco" in the text above with not only "city", but also include its latitude and longitude. This can be done, since annotations can also have values.
Every annotation in the tree is of a declared annotation type, where an annotation type is declared with a name and a possible data type for its optional value. Up until now, our annotation types have only had names, and no data type.
For the case of "San Francisco", we can let our annotation type have two data fields:
schema example {
annotation text {
}
annotation begintag {
}
annotation endtag {
}
annotation body {
}
annotation header {
}
annotation city {
field latitude type double {}
field longitude type double {}
}
}
By deploying the schema above, a struct data type is implicitly created,
named annotation.city
, having the two fields declared.
The annotation type city
is set to use this data type.
For more on struct types,
see the schema reference.
We can then create an annotation holding the latitude and longitude of San Francisco on this SpanNode.
To do this using the Annotations API:
//the following line works inside process(Processing) in a DocumentProcessor AnnotationTypeRegistry atr = processing.getService().getDocumentTypeManager().getAnnotationTypeRegistry(); StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>"); SpanList root = new SpanList(); SpanTree tree = new SpanTree("html", root); AnnotationType textType = atr.getType("text"); AnnotationType beginTag = atr.getType("begintag"); AnnotationType endTag = atr.getType("endtag"); AnnotationType bodyType = atr.getType("body"); AnnotationType headerType = atr.getType("header"); AnnotationType cityType = atr.getType("city"); Struct position = (Struct) cityType.getDataType().createFieldValue(); position.setValue("latitude", 37.774929); position.setValue("longitude", -122.419415); Annotation city = new Annotation(cityType, position); SpanList header = new SpanList(); { Span span1 = new Span(6, 6); Span span2 = new Span(12, 7); Span span3 = new Span(19, 5); Span span4 = new Span(24, 8); Span span5 = new Span(32, 7); header.add(span1) .add(span2) .add(span3) .add(span4) .add(span5); tree.annotate(span1, beginTag) .annotate(span2, beginTag) .annotate(span3, textType) .annotate(span4, endTag) .annotate(span4, endTag) .annotate(header, headerType); } SpanList textNode = new SpanList(); { Span span1 = new Span(45, 10); Span span2 = new Span(55, 13); textNode.add(span1) .add(span2); tree.annotate(span2, city) .annotate(textNode, textType); } SpanList body = new SpanList(); { Span span1 = new Span(39, 6); Span span2 = new Span(68, 7); body.add(span1) .add(textNode) .add(span2); tree.annotate(span1, beginTag) .annotate(span2, endTag) .annotate(body, bodyType); } { Span span1 = new Span(0, 6); Span span2 = new Span(75, 7); root.add(span1) .add(header) .add(body) .add(span2); tree.annotate(span1, beginTag) .annotate(span2, endTag); } text.setSpanTree(tree);
For the examples above, the purpose of the annotator has been to express the structure of the original HTML document, as well as adding some semantics to the tree. The HTML structure is fairly unambiguous (let's assume valid HTML for now). However, there are many other use cases where the source text allows for multiple interpretations, i.e. where there is not one unambiguous tree. Natural language processing is one such use case.
As an example, review the following sentence: "I saw the girl with the boy"
Most humans would read this as "the boy is accompanying the girl, and I saw them both". There is one alternate interpretation; that "boy" is an instrument that could be used to see the girl, as in "I saw the girl with the telescope", i.e. "I saw the girl using the telescope". NLP parsers would likely identify both these interpretations.
We can express more than one interpretation in one span tree, using an AlternateSpanList. As opposed to a SingleSpanList, which can have a single subtree of SpanNodes, AlternateSpanList can have an arbitrary number of subtrees, each with its own probability. In the analysis of longer and more complex passages of text, this is a great advantage, as we don't have to copy the entire tree to express differing interpretations. We just insert an AlternateSpanList at the point in the tree where the interpretations differ, and attach suitable probabilities to them, if possible.
Annotations can in fact have references to other annotations in the tree, that is, have an Annotation reference as its value.
Review the example below - an HTML structure where San and Francisco do not have common supernode:
We can see that in the HTML structure, "I live in San" is one paragraph, while "Francisco" continues on the next line. Consequently, "San" and "Francisco" do not have a SpanList as their immediate common supernode. On a higher semantic level, though, it is clear that "San Francisco" should be annotated as a city, as in the previous example. This can be achieved by using an annotation reference:
Note that the annotation "city" is not annotating a span node.
It is present in the global list of annotations,
and has references to other annotations in the same list.
To create the structure as shown above, declare the struct position
,
and change the fields of annotation type city
.
schema example {
annotation text {
}
annotation begintag {
}
annotation endtag {
}
annotation body {
}
annotation header {
}
annotation city {
field position type position {}
field references type array<annotationref<text>> {}
}
struct position {
field latitude type double {}
field longitude type double {}
}
}
To do this using the Annotations API:
//the following two lines work inside process(Processing) in a DocumentProcessor DocumentTypeManager dtm = processing.getService().getDocumentTypeManager(); AnnotationTypeRegistry atr = dtm.getAnnotationTypeRegistry(); StringFieldValue text = new StringFieldValue("<body><p>I live in San </p>Francisco</body>"); SpanList root = new SpanList(); SpanTree tree = new SpanTree("html", root); StructDataType positionType = (StructDataType) dtm.getDataType("position"); AnnotationType textType = atr.getType("text"); AnnotationType beginTag = atr.getType("begintag"); AnnotationType endTag = atr.getType("endtag"); AnnotationType bodyType = atr.getType("body"); AnnotationType paragraphType = atr.getType("paragraph"); AnnotationType cityType = atr.getType("city"); Struct position = new Struct(positionType); position.setValue("latitude", 37.774929); position.setValue("longitude", -122.419415); Annotation sanAnnotation = new Annotation(textType); Annotation franciscoAnnotation = new Annotation(textType); Struct positionWithRef = (Struct) cityType.getDataType().createFieldValue(); positionWithRef.setValue("position", position); Field referencesField = ((StructDataType) cityType.getDataType()).getField("references"); Array<FieldValue> refList = new Array<FieldValue>(referencesField.getDataType()); AnnotationReferenceDataType annRefType = (AnnotationReferenceDataType) ((ArrayDataType) referencesField.getDataType()).getNestedType(); refList.add(new AnnotationReference(annRefType, sanAnnotation)); refList.add(new AnnotationReference(annRefType, franciscoAnnotation)); positionWithRef.set(referencesField, refList); Annotation city = new Annotation(cityType, positionWithRef); SpanList paragraph = new SpanList(); { Span span1 = new Span(6, 3); Span span2 = new Span(9, 10); Span span3 = new Span(19, 4); Span span4 = new Span(23, 4); paragraph.add(span1) .add(span2) .add(span3) .add(span4); tree.annotate(span1, beginTag) .annotate(span2, textType) .annotate(span3, sanAnnotation) .annotate(span4, endTag) .annotate(paragraph, paragraphType); } { Span span1 = new Span(0, 6); Span span2 = new Span(27, 9); Span span3 = new Span(36, 8); root.add(span1) .add(paragraph) .add(span2) .add(span3); tree.annotate(span1, beginTag) .annotate(span2, franciscoAnnotation) .annotate(span3, endTag) .annotate(root, bodyType) .annotate(city); } text.setSpanTree(tree);
The above example shows that when using annotation references, building the span tree, and overlaying annotations (which now form an annotation graph), becomes quite complex. However, it enables annotators from various contexts to cooperate on one single annotation graph.
In the above example, we are mixing two semantically different trees into one tree. The first tree models the HTML representation of the input document. The second tree tries to find entities (like "San Francisco"), and creates a structure on a higher semantic level.
Note that in some cases, it would be wiser to create two span trees, and annotating these separately.
Recall that on the last line in all the above code samples,
we have set the tree on the StringFieldValue using
StringFieldValue.setSpanTree(String s, SpanNode sn)
.
The string given is an arbitrary name for this tree.
Creating two trees is then trivial (and is left as an exercise to the reader).
The previous section focused mainly on building a span tree over an input string. In many cases though, like when using the docproc framework, a document processor reads a span tree created by some previous process, manipulates it, and passes it on.
A typical use case is to iterate over all SpanNodes (that have an Annotation of a certain type), and manipulate these. As an example, imagine that the text at the start is the output of one document processor and the annotated text is the output of another. The second document processor would typically iterate over all nodes that have an annotation of type "markup", and replace them with spans that have annotations of type "begintag" and "endtag".
To do this using the Annotations API:
public void example() { StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>"); SpanTree tree = text.getSpanTree("html"); SpanList root = (SpanList) tree.getRoot(); //TODO: Note that the above could have been a Span or an AlternateSpanList! ListIterator<SpanNode> nodeIt = root.childIterator(); AnnotationType beginTag = new AnnotationType("begintag"); AnnotationType endTag = new AnnotationType("endtag"); while (nodeIt.hasNext()) { SpanNode node = nodeIt.next(); boolean nodeHadMarkupAnnotation = removeMarkupAnnotation(tree, node); if (nodeHadMarkupAnnotation) { nodeIt.remove(); List<Span> replacementNodes = analyzeMarkup(tree, node, text, beginTag, endTag); for (SpanNode repl : replacementNodes) { nodeIt.add(repl); } } } } /** * Removes annotations of type 'markup' from the given node. * * @param tree the tree to remove annotations from * @param node the node to remove annotations of type 'markup' from * @return true if the given node had 'markup' annotations, false otherwise */ private boolean removeMarkupAnnotation(SpanTree tree, SpanNode node) { //get iterator over all annotations on this node: Iterator<Annotation> annotationIt = tree.iterator(node); while (annotationIt.hasNext()) { Annotation annotation = annotationIt.next(); if (annotation.getType().getName().equals("markup")) { //this node has an annotation of type markup, remove it: annotationIt.remove(); //return true, this node had a markup annotation: return true; } } //this node did not have a markup annotation: return false; } /** * NOTE: This method is provided only for completeness. It analyzes spans annotated with "markup", * and splits them into several shorter spans annotated with "begintag" and "endtag". * * @param tree the span tree to annotate into * @param input a SpanNode that is annotated with "markup". * @param text the text that the SpanNode covers * @param beginTag the type to use for begintag annotations * @param endTagType the type to use for endtag annotations * @return a list of new spans to replace the input */ private List<Span> analyzeMarkup(SpanTree tree, SpanNode input, StringFieldValue text, AnnotationType beginTag, AnnotationType endTagType) { //we know that this node is annotated with "markup" String coveredText = input.getText(text.getString()).toString(); int spanOffset = input.getFrom(); int tagStart = -1; boolean endTag = false; List<Span> tags = new ArrayList<Span>(); for (int i = 0; i > coveredText.length(); i++) { if (coveredText.charAt(i) == '<') { //we're in a tag tagStart = i; continue; } if (coveredText.charAt(i) == '>' && tagStart > -1) { Span span = new Span(spanOffset + tagStart, (i + 1) - tagStart); tags.add(span); if (endTag) { tree.annotate(span, endTagType); } else { tree.annotate(span, beginTag); } tagStart = -1; } if (tagStart > -1 && i == (tagStart + 1)) { if (coveredText.charAt(i) == '/') { endTag = true; } else { endTag = false; } } } return tags; }
One may also traverse the global list of annotations, as opposed to iterating over SpanNodes. Imagine a use case where some annotator wants to find and remove all annotations of type "markup".
To do this using the Annotations API:
StringFieldValue text = new StringFieldValue("<html><head><title>Diary</title></head><body>I live in San Francisco</body></html>"); SpanTree tree = text.getSpanTree("html"); ListIterator<Annotation> annotationIt = tree.iterator(); while (annotationIt.hasNext()) { Annotation annotation = annotationIt.next(); if (annotation.getType().getName().equals("markup")) { //we have an annotation of type markup, remove it: annotationIt.remove(); } }
Annotation types can inherit from each other. This is particularly useful when given e.g. a document processor (along with its configuration of annotation types and document types) from some external entity, and one wants to extend these annotation types with some additional information. Review the below example:
schema example { annotation person { field birthdate type int { } field firstname type string { } field lastname type string { } } }
This annotation type, person
, comes from some legacy code that we have gotten from some external entity.
We want to leave this code and this configuration as-is,
but we are writing document processors that rely on these types and extend them:
schema example2 { annotation employee inherits person { field employeeid type int { } } }
The type employee
behaves just like a person
,
and can be used anywhere that a person
can appear.
It has inherited the three fields defined in person
,
and has one field of its own in addition.