How to select the right tool for bulk uploading or migrating data to Amazon Keyspaces - Amazon Keyspaces (for Apache Cassandra)
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

How to select the right tool for bulk uploading or migrating data to Amazon Keyspaces

In this section you can review the different tools that you can use to bulk upload or migrate data to Amazon Keyspaces, and learn how to select the correct tool based on your needs. In addition, this section provides an overview and use cases for the available step-by-step tutorials that demonstrate how to import data into Amazon Keyspaces.

To review the available strategies to migrate workloads from Apache Cassandra to Amazon Keyspaces, see Create a migration plan for migrating from Apache Cassandra to Amazon Keyspaces.

  • Migration tools

    • With the pricing calculator for Amazon Keyspaces (for Apache Cassandra) available on Github, you can estimate your monthly costs for Amazon Keyspaces based on your existing Apache Cassandra workload. Enter metrics from your Cassandra nodetool status output and intended serverless configuration for Amazon Keyspaces to compare direct costs between the two solutions. Note that this calculator focuses only on the operational costs of Amazon Keyspaces compared to your existing Cassandra deployment. It doesn't include total cost of ownership (TCO) factors like infrastructure maintenance, operational overhead, or support costs for Cassandra.

    • ZDM Dual Write Proxy for Amazon Keyspaces Migration – ZDM Dual Write Proxy available on Github supports zero-downtime migration from Apache Cassandra to Amazon Keyspaces.

    • CQLReplicator – CQLReplicator is an open source utility available on Github that helps you to migrate data from Apache Cassandra to Amazon Keyspaces in near real time.

      For more information, see Migrate data using CQLReplicator.

    • To learn more about how to use Amazon Managed Streaming for Apache Kafka to implement an online migration process with dual-writes, see Guidance for continuous data migration from Apache Cassandra to Amazon Keyspaces.

    • For large migrations, consider using an extract, transform, and load (ETL) tool. You can use Amazon Glue to quickly and effectively perform data transformation migrations. For more information, see Offline migration process: Apache Cassandra to Amazon Keyspaces.

    • To learn how to use the Apache Cassandra Spark connector to write data to Amazon Keyspaces, see Tutorial: Integrate with Apache Spark to import or export data.

    • Get started quickly with loading data into Amazon Keyspaces by using the cqlsh COPY FROM command. cqlsh is included with Apache Cassandra and is best suited for loading small datasets or test data. For step-by-step instructions, see Tutorial: Loading data into Amazon Keyspaces using cqlsh.

    • You can also use the DataStax Bulk Loader for Apache Cassandra to load data into Amazon Keyspaces using the dsbulk command. DSBulk provides more robust import capabilities than cqlsh and is available from the GitHub repository. For step-by-step instructions, see Tutorial: Loading data into Amazon Keyspaces using DSBulk.

General considerations for data uploads to Amazon Keyspaces

  • Break the data upload down into smaller components.

    Consider the following units of migration and their potential footprint in terms of raw data size. Uploading smaller amounts of data in one or more phases may help simplify your migration.

    • By cluster – Migrate all of your Cassandra data at once. This approach may be fine for smaller clusters.

    • By keyspace or table – Break up your migration into groups of keyspaces or tables. This approach can help you migrate data in phases based on your requirements for each workload.

    • By data – Consider migrating data for a specific group of users or products, to bring the size of data down even more.

  • Prioritize what data to upload first based on simplicity.

    Consider if you have data that could be migrated first more easily—for example, data that does not change during specific times, data from nightly batch jobs, data not used during offline hours, or data from internal apps.