Geo search

A field can have type position. This enables users to limit hits to within an area, or use the distance from a particular point as a criteria for relevancy calculation. Sample search definition and document:

search local {
    document local {
        field title type string {
            indexing: index
        }
        field latlong type position {
            indexing: attribute
        }
    }
    fieldset default {
        fields: title
    }
}
<document documenttype="local" documentid="id:local:local::abc123">
    <title>some random pizza place</title>
    <latlong>N37.401;W121.996</latlong>
</document>
To search for a geographical position, use pos.ll
search/?query=pizza&pos.ll=N37.416383%3BW122.024683
By default, this will limit the hits otherwise returned by the given query to those having a position within 50km of the Yahoo office in Sunnyvale. Since 50km is probably too far to go just to get some pizza, a more realistic example would add a radius of e.g. 5 miles:
search/?query=pizza&pos.ll=N37.416383%3BW122.024683&pos.radius=5mi
The pos.radius parameter accepts distances in kilometers (km), miles (mi), and meters (m).

Each search result will have an additional summary entry that contains the distance from the given geographical position, in millionths of a degree - about 10 cm. The new entry gets the name of the search definition field and the suffix “distance”: <fieldname>.distance. In the examples above, the name would be latlong.distance. For documents with multiple positions in the attribute, the distance to the nearest position will be returned.

The corresponding rank features for this example would be distance(latlong) or closeness(latlong) or closeness(latlong).logscale - the last is probably the most useful when combining distance ranking with textual relevance ranking.

Summary fields

2017-01-20: if using summary | attribute instead of attribute,

"mypos": {
  "x": -123988700,
  "y": -22453200
},
is returned.

When querying with a position, one can get up to three summary fields. If the searchdefinition looks like:

field mygeo type position {
    indexing: attribute | summary
}
the output with default rendering looks like:
<field name="mygeo">
  <struct-field name="y">37374821</struct-field>
  <struct-field name="x">-122057174</struct-field>
</field>
<field name="mygeo.position"><position x="-122057174" y="37374821" latlong="N37.374821;W122.057174" /></field>
<field name="mygeo.distance">48921</field>
The first summary field here ("latlong") is only generated if you specify "summary" in the indexing statement. It is mostly intended for programmatic use in a result processor (a Searcher).

The second summary field is generated as XML in the back-end, with XML attributes for x/y and also a "latlong" attribute with the same format as the input (feeding) format.

Note that the numbers used for "x" and "y" in the above two fields are integers - it's a direct representation of the internal x/y struct used for position. These are in millionths of a degree, so it's quite easy to convert. Also note which is which of "x" and "y":

It's just putting a normal coordinate system on top of the world map, so "x" is the longitude (east-west) and "y" the latitude (north-south).

The third summary field is the distance, also as an integer, and also in millionths of degrees. When converting to internal units (millionths of degrees), the Earth polar radius is used, so degrees = 180.0 * meters / (Math.PI * 6356752.0); is the basic conversion formula.

Bounding box

When searching for all documents that has a location in a given map view, it is most convenient to specify a bounding box instead of searching for all documents within a radius from a point. This can be achieved using the pos.bb parameter in the query:

search/?query=pizza&pos.bb=n=37.44899,s=37.3323,e=-121.98241,w=-122.06566
Example result:
<boundingBox>
  <southWest>
    <latitude>40.183868</latitude>
    <longitude>-74.819519</longitude>
  </southWest>
  <northEast>
    <latitude>40.248291</latitude>
    <longitude>-74.728798</longitude>
  </northEast>
</boundingBox>
here you have (in order) south/west corner, north/east corner, which is simple to use in a query like this:
search/?query=pizza&pos.bb=s=40.183868,w=-74.819519,n=40.248291,e=-74.728798
The directions S/W/N/E can appear in any order, and be specified in lower or upper case, but you must have N>=S and E>=W.

Using multiple position fields

For some applications, it can be useful to have several position attributes that may be searched. For example an address book application could use positions for home address and work address. This is possible to declare without any special considerations in the search definition file, but needs some extra handling on the query side. A single query can only search in one of the position attributes, and must specify which attribute with a pos.attribute query parameter. If you want to have some searches that spans several fields, you must make a combined field (outside vespa or in a document processor before indexing) that holds them all. Example:

search address {
    document address {
        field homeaddress type string {
            indexing: summary | index
        }
        field homelatlong type position {
            indexing: attribute
        }
        field workaddress type string {
            indexing: summary | index
        }
        field worklatlong type position {
            indexing: attribute
        }
        field bothlatlong type array<position> {
            indexing: attribute
        }
    }
    field bothaddress type string {
        indexing: input homeaddress . " " . input workaddress | index
    }
}
Here we assume that the home fields will contain the address and position of your house, the work fields the address and position of your workplace, while the "bothlatlong" field is assumed to be filled with the positions of both house and workplace somehow. In a query it's then possible to say
search/?query=homeaddress:sunnyvale&pos.attribute=homelatlong&pos.ll=N37.416383%3BW122.024683&pos.radius=5km
which is unlikely to give very many hits, since it's mostly a business district around Yahoo! headquarters, while
search/?query=workaddress:sunnyvale&pos.attribute=worklatlong&pos.ll=N37.416383%3BW122.024683&pos.radius=5km
would show lots of people working in Sunnyvale; use pos.attribute=bothlatlong for cases where it's uncertain if home address or work address position was wanted.

Distance to path

This example provides an overview of the DistanceToPath rank feature. This feature matches document locations to a path given in the query. Not only does this feature return the closest distance for each document to the path, it also includes the length traveled along the path before reaching the closest point, or intersection. This feature has been nick named the gas feature because of its obvious use case of finding gas stations along a planned trip.

In this example we have been traveling from the US to Bangalore, and we are now planning our trip back. We have decided to rent a car in Bangalore that we are to return upon arrival at the airport in Chennai. We are already quite hungry and wish to stop for a meal once we are outside of town. To avoid having to pay an additional fueling premium, we also wish to refuel just before reaching the airport. We need to figure out what roads to take, what restaurants are available outside of Bangalore, and what fuel stations are available once we get close to Chennai. In figure 1 we have plotted our trip from Bangalore to the airport:

If we search for restaurants along the path, we only see a small subset of all restaurants present in the window of our quite large map. In figure 2 you see how the most relevant results are actually all in Bangalore or Chennai:

To find the best results, move the map window to just about where we expect to be eating, and redo the search:

This has to be done similarly for finding a gas station near the airport. This illustrates searching for restaurants in a smaller window along the planned trip without DistanceToPath. Next, we outline how DistanceToPath can be used to quickly and easily improve this type of planning to be more convenient for the user.

The nature of this feature requires that the search corpus contains documents with position data. A searcher component needs to be written that is able to pass paths with the queries that lie in the same coordinate space as the searchable documents. Finally, a rank-profile needs to defined that scores documents according to how they match some target distance traveled and at the same time lies close "enough" to the path.

Query Syntax

This document does not describe how to write a searcher plugin for the JDisc Container, refer to the container documentation. However, let us review the syntax expected by DistanceToPath. As noted in the the rank features reference, the path is supplied as a query parameter by name of the feature and the path keyword:

query=(…)&rankproperty.distanceToPath(name).path=(x1,y1,x2,y2,…,xN,yN)
Here name has to match the name of the position attribute that holds the positions data.

The path itself is parsed as a list of N coordinate pairs that together form N-1 line segments:

$$(x_1,y_1) \rightarrow (x_2,y_2), (x_2,y_2) \rightarrow (x_3,y_3), (…), (x_{N-1},y_{N-1}) \rightarrow (x_N,y_N)$$

The path is not in a readable (longitude, latitude) format, but is a pair of integers in the internal format (degrees multiplied by 1 million). If a transform is required from geographic coordinates to this, the search pluging plugin must do it; note that the first number in each pair (the "x") is longitude (degrees East or West) while the second ("y") is latitude (degrees North or South), corresponding to the usual orientation for maps - opposite to the usual order of latitude/longitude.

Rank profile

If we were to disregard our scenario for a few moments, we could suggest the following rank profile:

rank-profile default {
    first-phase {
        expression: nativeRank
    }
    second-phase {
        expression: firstPhase * if (distanceToPath(ll).distance < 10000, 1, 0)
    }
}
This profile will first rank all documents according to Vespa's nativeRank feature, and then do a second pass over the top 100 results and order these based on their distance to our path. It is very simple; if a document lies within 100 metres of our path it retains its relevancy, otherwise its relevancy is set to 0. Such a rank profile would indeed solve the current problem, but Vespa's ranking model allows for us to take this a lot further.

The following is a very simple rank profile that ranks documents according to a query-specified target distance to path and distance traveled:

rank-profile default {
    first-phase {
        expression {
            max(0,    $distance - distanceToPath(ll).distance) *
            (1 - fabs($traveled - distanceToPath(ll).traveled))
        }
    }
}
The expression is two-fold; a first component determines a rank based on the document's distance to the given path as compared to the query parameter $distance. If the allowed distance is exceeded, this component's contribution is 0. The distance contribution is then multiplied by the difference of the actual distance traveled as compared to the query parameter $traveled. In short, this profile will include all documents that lie close enough to the path, ranked according to their actual distance and traveled measure.

DistanceToPath is only compatible with 2D coordinates because pathing in 1 dimension makes no sense.

Results

For the sake of this example, assume that we have implemented a custom path searcher that is able to pass the path found by the user's initial directions query to Vespa's query syntax. There are then two more parameters that must be supplied by the user; $distance and $traveled. Vespa expects these parameters to be supplied in a scale compatible with the feature's output, and should probably also be mapped by the container plugin. The feature's “distance” output is given in Vespa's internal resolution, which is approximately 10 units per meter. The “traveled” output is a normalized number between 0 and 1, where 0 represents the beginning of the path, and 1 is the end of the path.

This illustrates how these parameters can be used to return the most appropriate hits for our scenario. Note that the figures only show the top hit for each query: a) Searching for restaurants with the DistanceToPath feature. $distance = 1000, $traveled = 0.1 b) Searching for gas stations with the DistanceToPath feature. $distance = 1000, $traveled = 0.9