The rationale for predicate fields is described in the predicate fields document. Vespa also has a standalone Java library for searching predicate fields, for boolean matching tightly integrated with a Java program, e.g. running on a grid. The performance is similar to predicate search in Vespa. Find API documentation in the javadoc. Get started - add a dependency in pom.xml:
<dependency> <groupId>com.yahoo.vespa</groupId> <artifactId>predicate-search</artifactId> </dependency>
Unlike Vespa predicate fields, which have dynamic indexes,
the Java library requires the entire index to be built before any searches are run.
Once built, the index cannot be changed.
Build an index using an instance of PredicateIndexBuilder
.
Use indexDocument(id, predicate)
to index documents.
This method takes two arguments, a 32-bit document id and the document itself
(in form of a Predicate
object).
Once all documents are indexed, create the index by invoking build()
.
This method returns a PredicateIndex
object.
Use Predicate.fromString() to parse predicate expressions from strings. The expressions use the predicate syntax.
Just as with Vespa predicate fields, specify the arity and
upper- and lower bounds for the index,
to make the index more efficient and trade index size for query performance.
This is set when creating a PredicateIndexBuilder
object,
and cannot be changed without creating a new PredicateIndexBuilder
.
Check the Config
class for more information on other configuration parameters.
The predicate index supports serialization.
Use the PredicateIndex.writeToOutputStream(out)
to serialize the index to an output stream,
and the PredicateIndex.fromInputStream(in)
to deserialize an index from an input stream.
Deserializing an index is significantly faster than creating a new index
through PredicateIndexBuilder
.
PredicateIndex
has a method called searcher()
, which creates a new searcher object.
The searcher exposes one method, search(query)
, which searches the index.
The index itself is thread-safe, but a searcher is not. When searching from multiple threads,
make sure to create a separate searcher object for each thread.
A query is represented as a PredicateQuery
object.
The PredicateQuery
object contains a set of features with String
values,
and a set of range features with long
values.
Each feature in the query may have a 64-bit sub-query bitmap set.
The search()
method on PredicateIndex.Searcher
returns a stream object which lazily evaluates the query when the results are needed.
Each Hit
in the result stream contains the document id for the hit,
as well as a sub-query bitmap indicating which sub-queries the hit was a match for.
The hits are returned in the order they were indexed.
As with Vespa, the arity and upper- and lower bound configuration impacts on the performance and index size. E.g, increasing the arity increases QPS at a cost of a larger index.
The library may cache common posting lists to increase performance.
The cache consists of the most expensive posting lists based on size and their prevalence in queries.
The cache is disabled by default and must be build manually using
PredicateIndex.rebuildPostingListCache()
.
It is recommended to rebuild the cache regularly for optimal performance,
for instance every 1 millionth query or every 30 minutes.
The cache rebuild is thread-safe and can be executed safely during concurrent search operations.
package com.yahoo.example; import com.yahoo.document.predicate.Predicate; import com.yahoo.search.predicate.Config; import com.yahoo.search.predicate.Hit; import com.yahoo.search.predicate.PredicateIndex; import com.yahoo.search.predicate.PredicateIndexBuilder; import com.yahoo.search.predicate.PredicateQuery; import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.DataInputStream; import java.io.DataOutputStream; import java.io.IOException; import java.util.List; import java.util.stream.Stream; import static com.yahoo.document.predicate.Predicates.and; import static com.yahoo.document.predicate.Predicates.feature; import static java.util.stream.Collectors.toList; public class App { public static void main( String[] args ) throws IOException { // Create index configuration Config config = new Config.Builder() .setArity(10) .setLowerBound(0) // Minimum value for 'age' range feature .setUpperBound(150) // Maximum value for 'age' range feature .build(); // Create index builder PredicateIndexBuilder indexBuilder = new PredicateIndexBuilder(config); // Pass document id and predicate. 'age' is a range feature, while 'gender' is a normal feature indexBuilder.indexDocument(1, Predicate.fromString("age in [20..40] and gender in ['male', 'female']")); indexBuilder.indexDocument(2, and(feature("age").inRange(40, 60), feature("gender").inSet("male"))); // Create index from builder PredicateIndex index = indexBuilder.build(); // Create query1 PredicateQuery query1 = new PredicateQuery(); query1.addFeature("gender", "male", 0b01); // Subquery 0 query1.addFeature("gender", "female", 0b10); // Subquery 1 query1.addRangeFeature("age", 30, 0b11); // Subquery 0 and 1 // Run queries using multiple threads Runnable searchRunnable = () -> { // Create a searcher // Note: PredicateIndex.Searcher is not thread safe, so each thread has to use a separate Searcher PredicateIndex.Searcher searcher = index.searcher(); // Search index. A stream of hits is returned Stream<Hit> hitStream = searcher.search(query1); // Prints document id and subquery bitmap ('[0, 0x3]'). Bitmap is 0b11 as document matches both subqueries System.out.println("Hit: " + hitStream.findFirst().get()); }; new Thread(searchRunnable).start(); new Thread(searchRunnable).start(); // Rebuild posting list cache to improve performance index.rebuildPostingListCache(); new Thread(searchRunnable).start(); new Thread(searchRunnable).start(); // Serialized index to a byte array ByteArrayOutputStream baos = new ByteArrayOutputStream(); index.writeToOutputStream(new DataOutputStream(baos)); byte[] serializedIndex = baos.toByteArray(); // Load the index from byte array PredicateIndex deserializedIndex = PredicateIndex.fromInputStream( new DataInputStream( new ByteArrayInputStream(serializedIndex))); // Create new query (which is not using subqueries this time) PredicateQuery query2 = new PredicateQuery(); query2.addFeature("gender", "male"); query2.addRangeFeature("age", 40); // Search using deserialized index List<Hit> hits = deserializedIndex.searcher().search(query2).collect(toList()); // Prints the id for both documents. No subquery bitmap printed System.out.println("Hits from deserialized index: " + hits); } }