• [+] expand all

Batch delete

Options for batch deleting documents:

  1. Find documents using a query, delete, repeat. Pseudocode:
    while True; do
       query and read document ids, if empty exit
       delete document ids using /document/v1
       wait a sec # optional, add wait to reduce load while deleting
    
  2. Like 1, but use the Vespa feed client. Instead of deleting one-by-one, stream remove operations to the API (write a Java program for this), or append to a file and use the binary:
    $ vespa-feed-client --file deletes.json --endpoint my-endpoint
    
  3. Use a document selection to expire documents. This deletes all documents not matching the expression. It is possible to use parent documents and imported fields for expiry of a document set. The content node will iterate over the corpus and delete documents (that are later compacted out):
    <documents garbage-collection="true">
        <document type="mytype" selection="mytype.version > 4" />
    </documents>
  4. Use /document/v1 to delete documents identified by a document selection - example dropping all documents from the my_doctype schema:
    $ curl -X DELETE \
      "$ENDPOINT/document/v1/my_namespace/my_doctype/docid?selection=true&cluster=my_cluster"
    
  5. It is possible to drop a schema, with all its content, by removing the mapping to the content cluster. To understand what is happening, here is the status before the procedure:

    # ls $VESPA_HOME/var/db/vespa/search/cluster.music/n0/documents
    
    drwxr-xr-x 6 vespa vespa 4096 Oct 25 16:59 books
    drwxr-xr-x 6 vespa vespa 4096 Oct 25 12:47 music
    

    Remove the schema from configuration:

    <documents>
        <document type="music" mode="index" />
        <!--document type="books" mode="index" /-->
    </documents>
    

    It is not required to remove the schema file itself. It is however required to add a schema-removal entry to validation-overrides.xml:

    <validation-overrides>
        <allow until="2022-10-31">schema-removal</allow>
    </validation-overrides>
    

    Deploy the application package. This will reconfigure the content node processes, and the directory with the schema data is removed:

    # ls $VESPA_HOME/var/db/vespa/search/cluster.music/n0/documents
    
    drwxr-xr-x 6 vespa vespa 4096 Oct 25 12:47 music
    

    Add the mapping back and redeploy - the cluster now has a books schema with zero documents.

    # ls $VESPA_HOME/var/db/vespa/search/cluster.music/n0/documents
    
    drwxr-xr-x 6 vespa vespa 4096 Oct 25 17:06 books
    drwxr-xr-x 6 vespa vespa 4096 Oct 25 12:47 music
    

    Use the Custom Component State API to inspect document count per schema.

    The procedure, deploying with and without the schema, is an efficient way to drop all documents. After the procedure, it is good practice to remove validation-overrides.xml or the schema-removal element inside, to avoid accidental data loss later. The directory listing above is just for illustration.