Cassandra Backup & Restore

Launchpad blueprint:

https://blueprints.launchpad.net/trove/+spec/cassandra-backup-restore

Problem Description

The Cassandra datastore currently does not support any backup/restore strategy.

Proposed Change

The patch set will implement full backup/restore of a single instance using the Nodetool [1] utility for Cassandra 2.1 [3].

Configuration

The following Cassandra configuration options will be updated:
  • backup/restore namespaces
  • backup_strategy

Database

None

Public API

None

Public API Security

None

Python API

None

CLI (python-troveclient)

The following Trove CLI commands (upon completion) will be fully functional with Cassandra:

  • backup-create
  • backup-delete
  • backup-list
  • create –backup

Internal API

None

Guest Agent

We are implementing full backup using node snapshots following the procedure outlined in the Nodetool manual [2]. Nodetool can take a snapshot of one or more keyspace(s). A snapshot can be restored by moving all *.db files from a snapshot directory to the respective keyspace overwriting any existing files.

When a snapshot is taken Cassandra starts saving all changes into new data files keeping the old ones at the same state as when the snapshot was taken. The data storage must have enough capacity to accommodate the backlog of all changes throughout the duration of the backup operation until the snapshots get cleaned up.

Backups are streamed to and from a remote storage as (TAR) archives. We now outline the general procedure for creating and restoring such an archive.

Unique backup IDs will be used for snapshot names, to avoid collisions between concurrent backups.

The Backup Procedure:

  1. Make sure the database is up an running.

  2. Clear any existing snapshots (nodetool clearsnapshot) with the same name as the created one.

  3. Take a snapshot of all keyspaces (nodetool snapshot).

  4. Collect all *.db files from the snapshot directories.

  5. Package the snapshot files into a single TAR archive (compressing and/or encrypting as required) while streaming the output to Swift storage under the database_backups container.

    Transform the paths such that the backup can be restored simply by extracting the archive right to an existing data directory. This is to ensure we can always restore an old backup even if the standard guest agent data directory changes.

  6. Clear the created snapshots as in (1).

The Restore Procedure:

  1. Stop the database if running and clear any files generated in the system keyspace.
  2. Create a new data directory.
  3. Read backup from storage unpacking it to the data directory.
  4. Update ownership of the restored files to the Cassandra user.

Additional Considerations:

Instances are created as single-node clusters. A restored instance should therefore belong to its own cluster as well. The original cluster name property has to be reset to match the new unique ID of the restored instance. This is to ensure that the restored instance is a part of a new single-node cluster rather than forming a one with the original node or other instances restored from the same backup. Cluster name is stored in the database and is required to match the configuration value. Cassandra fails to start otherwise.

A general ‘cluster_name’ reset procedure is:

  1. Update the name in the system keyspace table.
  2. Update the name in the configuration file.
  3. Restart the service.

The ‘superuser’ (“root”) password stored in the system keyspace needs to be reset before we can start up with restored data.

A general password reset procedure is:

  1. Disable user authentication and remote access.
  2. Restart the service.
  3. Update the password in the ‘system_auth.credentials’ table.
  4. Re-enable authentication and make the host reachable.
  5. Restart the service.

Alternatives

None

Implementation

Assignee(s)

Petr Malik <pmalik@tesora.com>

Milestones

Liberty-1

Work Items

  1. Implement functionality needed for resetting cluster name and superuser password.
  2. Implement backup/restore API calls.

Upgrade Implications

None

Dependencies

The patch set will be building on functionality implemented in blueprints: cassandra-database-user-functions [4] and cassandra-configuration-groups [5]

Testing

Unittests will be added to validate implemented functions and non-trivial codepaths. We do not implement functional tests as a part of this patch set.

Documentation Impact

The datastore documentation should be updated to reflect the enabled features. Also note the new configuration options - backup/restore namespaces and backup_strategy for Cassandra datastore.