Estimate row size in Amazon Keyspaces - Amazon Keyspaces (for Apache Cassandra)
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Estimate row size in Amazon Keyspaces

Amazon Keyspaces provides fully managed storage that offers single-digit millisecond read and write performance and stores data durably across multiple Amazon Availability Zones. Amazon Keyspaces attaches metadata to all rows and primary key columns to support efficient data access and high availability.

This section provides details about how to estimate the encoded size of rows in Amazon Keyspaces. The encoded row size is used when calculating your bill and quota use. You should also use the encoded row size when calculating provisioned throughput capacity requirements for tables. To calculate the encoded size of rows in Amazon Keyspaces, you can use the following guidelines.

  • For regular columns, which are columns that aren't primary keys, clustering columns, or STATIC columns, use the raw size of the cell data based on the data type and add the required metadata. For more information about the data types supported in Amazon Keyspaces, see Data types. Some key differences in how Amazon Keyspaces stores data type values and metadata are listed below.

  • The space required for each column name is stored using a column identifier and added to each data value stored in the column. The storage value of the column identifier depends on the overall number of columns in your table:

    • 1–62 columns: 1 byte

    • 63–124 columns: 2 bytes

    • 125–186 columns: 3 bytes

    For each additional 62 columns add 1 byte. Note that in Amazon Keyspaces, up to 225 regular columns can be modified with a single INSERT or UPDATE statement. For more information, see Amazon Keyspaces service quotas.

  • Partition keys can contain up to 2048 bytes of data. Each key column in the partition key requires up to 3 bytes of metadata. When calculating the size of your row, you should assume each partition key column uses the full 3 bytes of metadata.

  • Clustering columns can store up to 850 bytes of data. In addition to the size of the data value, each clustering column requires up to 20% of the data value size for metadata. When calculating the size of your row, you should add 1 byte of metadata for each 5 bytes of clustering column data value.

  • Amazon Keyspaces stores the data value of each partition key and clustering key column twice. The extra overhead is used for efficient querying and built-in indexing.

  • Cassandra ASCII, TEXT, and VARCHAR string data types are all stored in Amazon Keyspaces using Unicode with UTF-8 binary encoding. The size of a string in Amazon Keyspaces equals the number of UTF-8 encoded bytes.

  • Cassandra INT, BIGINT, SMALLINT, and TINYINT data types are stored in Amazon Keyspaces as data values with variable length, with up to 38 significant digits. Leading and trailing zeroes are trimmed. The size of any of these data types is approximately 1 byte per two significant digits + 1 byte.

  • A BLOB in Amazon Keyspaces is stored with the value's raw byte length.

  • The size of a Null value or a Boolean value is 1 byte.

  • A column that stores collection data types like LIST or MAP requires 3 bytes of metadata, regardless of its contents. The size of a LIST or MAP is (column id) + sum (size of nested elements) + (3 bytes). The size of an empty LIST or MAP is (column id) + (3 bytes). Each individual LIST or MAP element also requires 1 byte of metadata.

  • STATIC column data doesn't count towards the maximum row size of 1 MB. To calculate the data size of static columns, see Calculate the static column size per logical partition in Amazon Keyspaces.

  • Client-side timestamps are stored for every column in each row when the feature is turned on. These timestamps take up approximately 20–40 bytes (depending on your data), and contribute to the storage and throughput cost for the row. For more information, see Client-side timestamps in Amazon Keyspaces.

  • Add 100 bytes to the size of each row for row metadata.

The total size of an encoded row of data is based on the following formula:

partition key columns + clustering columns + regular columns + row metadata = total encoded size of row
Important

All column metadata, for example column ids, partition key metadata, clustering column metadata, as well as client-side timestamps and row metadata count towards the maximum row size of 1 MB.

Consider the following example of a table where all columns are of type integer. The table has two partition key columns, two clustering columns, and one regular column. Because this table has five columns, the space required for the column name identifier is 1 byte.

CREATE TABLE mykeyspace.mytable(pk_col1 int, pk_col2 int, ck_col1 int, ck_col2 int, reg_col1 int, primary key((pk_col1, pk_col2),ck_col1, ck_col2));

In this example, we calculate the size of data when we write a row to the table as shown in the following statement:

INSERT INTO mykeyspace.mytable (pk_col1, pk_col2, ck_col1, ck_col2, reg_col1) values(1,2,3,4,5);

To estimate the total bytes required by this write operation, you can use the following steps.

  1. Calculate the size of a partition key column by adding the bytes for the data type stored in the column and the metadata bytes. Repeat this for all partition key columns.

    1. Calculate the size of the first column of the partition key (pk_col1):

      (2 bytes for the integer data type) x 2 + 1 byte for the column id + 3 bytes for partition key metadata = 8 bytes
    2. Calculate the size of the second column of the partition key (pk_col2):

      (2 bytes for the integer data type) x 2 + 1 byte for the column id + 3 bytes for partition key metadata = 8 bytes
    3. Add both columns to get the total estimated size of the partition key columns:

      8 bytes + 8 bytes = 16 bytes for the partition key columns
  2. Calculate the size of the clustering column by adding the bytes for the data type stored in the column and the metadata bytes. Repeat this for all clustering columns.

    1. Calculate the size of the first column of the clustering column (ck_col1):

      (2 bytes for the integer data type) x 2 + 20% of the data value (2 bytes) for clustering column metadata + 1 byte for the column id = 6 bytes
    2. Calculate the size of the second column of the clustering column (ck_col2):

      (2 bytes for the integer data type) x 2 + 20% of the data value (2 bytes) for clustering column metadata + 1 byte for the column id = 6 bytes
    3. Add both columns to get the total estimated size of the clustering columns:

      6 bytes + 6 bytes = 12 bytes for the clustering columns
  3. Add the size of the regular columns. In this example we only have one column that stores a single digit integer, which requires 2 bytes with 1 byte for the column id.

  4. Finally, to get the total encoded row size, add up the bytes for all columns and add the additional 100 bytes for row metadata:

    16 bytes for the partition key columns + 12 bytes for clustering columns + 3 bytes for the regular column + 100 bytes for row metadata = 131 bytes.

To learn how to monitor serverless resources with Amazon CloudWatch, see Monitoring Amazon Keyspaces with Amazon CloudWatch.