Offline migration process: Apache Cassandra to Amazon Keyspaces - Amazon Keyspaces (for Apache Cassandra)
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Offline migration process: Apache Cassandra to Amazon Keyspaces

Offline migrations are suitable when you can afford downtime to perform the migration. It's common among enterprises to have maintenance windows for patching, large releases, or downtimes for hardware upgrades or major upgrades. Offline migration can use this window to copy data and switch over the application traffic from Apache Cassandra to Amazon Keyspaces. Offline migration reduces modifications to the application because it doesn't require communication to both Cassandra and Amazon Keyspaces simultaneously. Additionally, with the data flow paused, the exact state can be copied without maintaining mutations.

In this example, you use Amazon Simple Storage Service (Amazon S3) as a staging area for data during the offline migration to minimize downtime. You can automatically import the data you stored in Parquet format in Amazon S3 into an Amazon Keyspaces table using the Spark Cassandra connector and Amazon Glue. The following section is going to show the high-level overview of the process. You can find code examples for this process on Github.

The offline migration process from Apache Cassandra to Amazon Keyspaces using Amazon S3 and Amazon Glue requires the following Amazon Glue jobs.

  1. An ETL job that extracts and transforms CQL data and stores it in an Amazon S3 bucket.

  2. A second job that imports the data from the bucket to Amazon Keyspaces.

  3. A third job to import incremental data.

How to perform an offline migration to Amazon Keyspaces from Cassandra running on Amazon EC2 in a Amazon Virtual Private Cloud
  1. First you use Amazon Glue to export table data from Cassandra in Parquet format and save it to an Amazon S3 bucket. You need to run an Amazon Glue job using a Amazon Glue connector to a VPC where the Amazon EC2 instance running Cassandra resides. Then, using the Amazon S3 private endpoint, you can save data to the Amazon S3 bucket. The following diagram illustrates these steps.

    Migrating Apache Cassandra data from Amazon EC2 running in a VPC to a Amazon S3 bucket using Amazon Glue.
  2. Shuffle the data in the Amazon S3 bucket to improve data randomization. Evenly imported data allows for more distributed traffic in the target table. This step is required when exporting data from Cassandra with large partitions (partitions with more than 1000 rows) to avoid hot key patterns when inserting the data into Amazon Keyspaces. Hot key issues cause WriteThrottleEvents in Amazon Keyspaces and result in increased load time.

    A Amazon Glue job shuffles data from a Amazon S3 bucket and returns it into another Amazon S3 bucket.
  3. Use another Amazon Glue job to import data from the Amazon S3 bucket into Amazon Keyspaces. The shuffled data in the Amazon S3 bucket is stored in Parquet format.

    The Amazon Glue import job takes shuffled data from the Amazon S3 bucket and moves it into an Amazon Keyspaces table.