Types of updates - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Types of updates

This topic describes some of the changes that you can make to the schema in CREATE TABLE statements without actually altering your data. We review each type of schema update and specify which data formats allow them in Athena. To update a schema, you can in some cases use an ALTER TABLE command, but in other cases you do not actually modify an existing table. Instead, you create a table with a new name that modifies the schema that you used in your original CREATE TABLE statement.

Depending on how you expect your schemas to evolve, to continue using Athena queries, choose a compatible data format.

Consider an application that reads orders information from an orders table that exists in two formats: CSV and Parquet.

The following example creates a table in Parquet:

CREATE EXTERNAL TABLE orders_parquet ( `orderkey` int, `orderstatus` string, `totalprice` double, `orderdate` string, `orderpriority` string, `clerk` string, `shippriority` int ) STORED AS PARQUET LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_ parquet/';

The following example creates the same table in CSV:

CREATE EXTERNAL TABLE orders_csv ( `orderkey` int, `orderstatus` string, `totalprice` double, `orderdate` string, `orderpriority` string, `clerk` string, `shippriority` int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_csv/';

In the following sections, we review how updates to these tables affect Athena queries.

Adding columns at the beginning or in the middle of the table

Adding columns is one of the most frequent schema changes. For example, you may add a new column to enrich the table with new data. Or, you may add a new column if the source for an existing column has changed, and keep the previous version of this column, to adjust applications that depend on them.

To add columns at the beginning or in the middle of the table, and continue running queries against existing tables, use AVRO, JSON, and Parquet and ORC if their SerDe property is set to read by name. For information, see Index access in ORC and parquet.

Do not add columns at the beginning or in the middle of the table in CSV and TSV, as these formats depend on ordering. Adding a column in such cases will lead to schema mismatch errors when the schema of partitions changes.

The following example creates a new table that adds an o_comment column in the middle of a table based on JSON data.

CREATE EXTERNAL TABLE orders_json_column_addition ( `o_orderkey` int, `o_custkey` int, `o_orderstatus` string, `o_comment` string, `o_totalprice` double, `o_orderdate` string, `o_orderpriority` string, `o_clerk` string, `o_shippriority` int, ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_json/';

Adding columns at the end of the table

If you create tables in any of the formats that Athena supports, such as Parquet, ORC, Avro, JSON, CSV, and TSV, you can use the ALTER TABLE ADD COLUMNS statement to add columns after existing columns but before partition columns.

The following example adds a comment column at the end of the orders_parquet table before any partition columns:

ALTER TABLE orders_parquet ADD COLUMNS (comment string)
Note

To see a new table column in the Athena Query Editor after you run ALTER TABLE ADD COLUMNS, manually refresh the table list in the editor, and then expand the table again.

Removing columns

You may need to remove columns from tables if they no longer contain data, or to restrict access to the data in them.

  • You can remove columns from tables in JSON, Avro, and in Parquet and ORC if they are read by name. For information, see Index access in ORC and parquet.

  • We do not recommend removing columns from tables in CSV and TSV if you want to retain the tables you have already created in Athena. Removing a column breaks the schema and requires that you recreate the table without the removed column.

In this example, remove a column `totalprice` from a table in Parquet and run a query. In Athena, Parquet is read by name by default, this is why we omit the SERDEPROPERTIES configuration that specifies reading by name. Notice that the following query succeeds, even though you changed the schema:

CREATE EXTERNAL TABLE orders_parquet_column_removed ( `o_orderkey` int, `o_custkey` int, `o_orderstatus` string, `o_orderdate` string, `o_orderpriority` string, `o_clerk` string, `o_shippriority` int, `o_comment` string ) STORED AS PARQUET LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_parquet/';

Renaming columns

You may want to rename columns in your tables to correct spelling, make column names more descriptive, or to reuse an existing column to avoid column reordering.

You can rename columns if you store your data in CSV and TSV, or in Parquet and ORC that are configured to read by index. For information, see Index access in ORC and parquet.

Athena reads data in CSV and TSV in the order of the columns in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

One strategy for renaming columns is to create a new table based on the same underlying data, but using new column names. The following example creates a new orders_parquet table called orders_parquet_column_renamed. The example changes the column `o_totalprice` name to `o_total_price` and then runs a query in Athena:

CREATE EXTERNAL TABLE orders_parquet_column_renamed ( `o_orderkey` int, `o_custkey` int, `o_orderstatus` string, `o_total_price` double, `o_orderdate` string, `o_orderpriority` string, `o_clerk` string, `o_shippriority` int, `o_comment` string ) STORED AS PARQUET LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_parquet/';

In the Parquet table case, the following query runs, but the renamed column does not show data because the column was being accessed by name (a default in Parquet) rather than by index:

SELECT * FROM orders_parquet_column_renamed;

A query with a table in CSV looks similar:

CREATE EXTERNAL TABLE orders_csv_column_renamed ( `o_orderkey` int, `o_custkey` int, `o_orderstatus` string, `o_total_price` double, `o_orderdate` string, `o_orderpriority` string, `o_clerk` string, `o_shippriority` int, `o_comment` string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_csv/';

In the CSV table case, the following query runs and the data displays in all columns, including the one that was renamed:

SELECT * FROM orders_csv_column_renamed;

Reordering columns

You can reorder columns only for tables with data in formats that read by name, such as JSON or Parquet, which reads by name by default. You can also make ORC read by name, if needed. For information, see Index access in ORC and parquet.

The following example creates a new table with the columns in a different order:

CREATE EXTERNAL TABLE orders_parquet_columns_reordered ( `o_comment` string, `o_orderkey` int, `o_custkey` int, `o_orderpriority` string, `o_orderstatus` string, `o_clerk` string, `o_shippriority` int, `o_orderdate` string ) STORED AS PARQUET LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_parquet/';

Changing a column's data type

You might want to use a different column type when the existing type can no longer hold the amount of information required. For example, an ID column's values might exceed the size of the INT data type and require the use of the BIGINT data type.

When planning to use a different data type for a column, consider the following points:

  • In most cases, you cannot change the data type of a column directly. Instead, you re-create the Athena table and define the column with the new data type.

  • Only certain data types can be read as other data types. See the table in this section for data types that can be so treated.

  • For data in Parquet and ORC, you cannot use a different data type for a column if the table is not partitioned.

  • For partitioned tables in Parquet and ORC, a partition's column type can be different from another partition's column type, and Athena will CAST to the desired type, if possible. For information, see Avoiding schema mismatch errors for tables with partitions.

  • For tables created using the LazySimpleSerDe only, it is possible to use the ALTER TABLE REPLACE COLUMNS statement to replace existing columns with a different data type, but all existing columns that you want to keep must also be redefined in the statement, or they will be dropped. For more information, see ALTER TABLE REPLACE COLUMNS.

  • For Apache Iceberg tables only, you can use the ALTER TABLE CHANGE COLUMN statement to change the data type of a column. ALTER TABLE REPLACE COLUMNS is not supported for Iceberg tables. For more information, see Evolving Iceberg table schema.

Important

We strongly suggest that you test and verify your queries before performing data type translations. If Athena cannot use the target data type, the CREATE TABLE query may fail.

The following table lists data types that be treated as other data types:

Compatible data types
Original data type Available target data types
STRING BYTE, TINYINT, SMALLINT, INT, BIGINT
BYTE TINYINT, SMALLINT, INT, BIGINT
TINYINT SMALLINT, INT, BIGINT
SMALLINT INT, BIGINT
INT BIGINT
FLOAT DOUBLE

The following example uses the CREATE TABLE statement for the original orders_json table to create a new table called orders_json_bigint. The new table uses BIGINT instead of INT as the data type for the `o_shippriority` column.

CREATE EXTERNAL TABLE orders_json_bigint ( `o_orderkey` int, `o_custkey` int, `o_orderstatus` string, `o_totalprice` double, `o_orderdate` string, `o_orderpriority` string, `o_clerk` string, `o_shippriority` BIGINT ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://DOC-EXAMPLE-BUCKET/orders_json';

The following query runs successfully, similar to the original SELECT query, before the data type change:

Select * from orders_json LIMIT 10;