A system tests suite is an invaluable tool both when developing and maintaining a complex Vespa application. These are functional tests which are run against a deployment of the application package to verify, and use its HTTP APIs to execute feed and query operations which are compared to expected outcomes. Vespa provides two formalizations of this:
These two frameworks also includes an upgrade—or staging—test construct for scenarios where the application is upgraded, and state in the backend depends on the old application configuration; as well as a production verification test—basically a health check for production deployments. For system and staging tests, the frameworks provide an easy way to perform HTTP request against a designated test deployment, separating the tests from the deployment and configuration of the test clusters.
This document describes how each of these test categories can be run as part of an imagined CI/CD system for safely deploying changes to a Vespa application in a continuous manner.
Finally, find a section on A/B-testing / bucket tests using feature switches.
System tests are just functional tests that verify a deployed Vespa application behaves as expected when fed and queried. Running a system test is as simple as making a separate deployment with the application package to test, and then running the system test suite, or one or a few or those tests.
For the most part, system tests must be updated due to changes in the application package. Rarely, an upgrade of the Vespa version may also lead to changed functionality, but within major versions, this should only be new features and bug fixes. In any case, it is a good idea to always run system tests against a dedicated test deployment—both before upgrading the Vespa platform, and the application package—before deploying the change to production.
The Vespa CLI makes it easy to set up a test deployment, and run system and staging tests. To run a system test, first set up a test deployment:
Run the basic HTTP tests (prefer using this test suite for regular tests) - also see the example application:
Example Java API tests (use for complex test cases) - also see the example application:
The test config file used by the test runner in the maven-plugin defines the endpoints for each of the clusters in
services.xml as fields under a
localEndpoints JSON object:
feed-service is the endpoint of the container cluster with
query-service is the endpoint of the container cluster with
The goal of staging (upgrade) tests is not to ensure the new deployment satisfies its functional specifications, as that should be covered by system tests; rather, it is to ensure the upgrade of the application package and/or Vespa platform does not break the application, and is compatible with the behavior expected by existing clients.
As an example, consider a change in how documents are indexed, e.g., adding new document processor. A system test would test verify this new behavior by feeding a document, and then verifying the document processor modified the document, or perhaps did something else. A staging test, on the other hand, would feed the document before the document processor was added, and querying for the document after the upgrade could give different results from what the system test would expect.
Many such changes, which require additional action post-deployment, are also guarded by validation overrides, but the staging test is then a great way of figuring out what the exact consequences of the change are, and how to deal with it.
As opposed to system tests, staging tests are not self-contained, as the state change during upgrade is precisely what is tested. Instead, execution order of any staging tests that modify state, particularly after upgrade, must be controlled. Indeed, some changes will require refeeding data, and this should then be part of the staging test code. Finally, it is also good to verify the expected state prior to upgrade.
The clients of a Vespa application should be compatible with both the system and staging test expectations, and this dictates the workflow when deploying a breaking change - steps:
Again, it is a good idea to always run staging tests before deployment of every change—be it a change in the application package, or an upgrade of the Vespa platform.
See system tests above for links to example applications. Steps:
Example using JSON-tests:
Example using Java tests (see system tests for test-config.json):
With continuous deployment, it is not practical to hold off releasing a feature until it is done, test it manually until convinced it works and then release it to production. What to do instead? The answer is feature switches: release new features to production as they are developed, but include logic which keeps them deactivated until they are ready, or until they have been verified in production with a subset of users.
Bucket tests is the practice of systematically testing new features or behavior for a controlled subset of users. This is common practice when releasing new science models, as they are difficult to verify in test, but can also be used for other features.
To test new behavior in Vespa, use a combination of search chains and rank profiles, controlled by query profiles, where one query profile corresponds to one bucket. These features support inheritance to make it easy to express variation without repetition.
Sometimes a new feature requires incompatible changes to a data field. To be able to CD such changes, it is necessary to create a new field containing the new version of the data. This costs extra resources but less than the alternative: standing up a new system copy with the new data. New fields can be added and populated while the system is live.
One way to reduce the need for incompatible changes can be decreased by making the semantics of the fields more precise. E.g., if a field is defined as the "quality" of a document, where a higher number means higher quality, a new algorithm which produces a different range and distribution will typically be an incompatible change. However, if the field is defined more precisely as the average time spent on the document once it is clicked, then a new algorithm which produces better estimates of this value will not be an incompatible change. Using precise semantics also have the advantage of making it easier to understand if the use of the data and its statistical properties are reasonable.