Using the cost-based optimizer - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using the cost-based optimizer

You can use the cost-based optimizer (CBO) feature in Athena SQL to optimize your queries. You can optionally request that Athena gather table or column-level statistics for one of your tables in Amazon Glue. If all of the tables in your query have statistics, Athena uses the statistics to create an execution plan that it determines to be the most performant. The query optimizer calculates alternative plans based on a statistical model and then selects the one that will likely be the fastest to run the query.

Statistics on Amazon Glue tables are collected and stored in the Amazon Glue Data Catalog and made available to Athena for improved query planning and execution. These statistics are column-level statistics such as number of distinct, number of null, max, and min values on file types such as Parquet, ORC, JSON, ION, CSV, and XML. Amazon Athena uses these statistics to optimize queries by applying the most restrictive filters as early as possible in query processing. This filtering limits memory usage and the number of records that must be read to deliver the query results.

In conjunction with CBO, Athena uses a feature called the rule-based optimizer (RBO). RBO mechanically applies rules that are expected to improve query performance. RBO is generally beneficial because its transformations aim to simplify the query plan. However, because RBO does not perform cost calculations or plan comparisons, more complicated queries make it difficult for RBO to create an optimal plan.

For this reason, Athena uses both RBO and CBO to optimize your queries. After Athena identifies opportunities to improve query execution, it creates an optimal plan. For information about execution plan details, see Viewing execution plans for SQL queries. For a detailed discussion of how CBO works, see the Amazon Big Data cost-based optimizer blog post.

To generate statistics for Amazon Glue Catalog tables, you can use the Athena console, the Amazon Glue Console, or Amazon Glue APIs. Because Athena is integrated with Amazon Glue Catalog, you automatically get the corresponding query performance improvements when you run queries from Amazon Athena.

Considerations and limitations

  • Table types – Currently, the CBO feature in Athena supports only Hive tables that are in the Amazon Glue Data Catalog.

  • Athena for Spark – The CBO feature is not available in Athena for Spark.

  • Pricing – For pricing information, visit the Amazon Glue pricing page.

Generating table statistics using the Athena console

This section describes how to use the Athena console to generate table or column-level statistics for a table in Amazon Glue. For information on using Amazon Glue to generate table statistics, see Working with column statistics in the Amazon Glue Developer Guide.

To generate statistics for a table using the Athena console
  1. Open the Athena console at https://console.amazonaws.cn/athena/.

  2. In the Athena query editor Tables list, choose the three vertical dots for the table that you want, and then choose Generate statistics.

    Context menu for a table in the Athena query editor.
  3. In the Generate statistics dialog box, choose All columns to generate statistics for all columns in the table, or choose Selected columns to select specific columns. All columns is the default setting.

    The generate statistics dialog box.
  4. For Amazon Glue service role, create or select an existing service role to give permission to Amazon Glue to generate statistics. The Amazon Glue service role also requires S3:GetObject permissions to the Amazon S3 bucket that contains the table's data.

    Choosing a Amazon Glue service role.
  5. Choose Generate statistics. A Generating statistics for table_name notification banner displays the task status.

    The Generating statistics notification banner.
  6. To view details in the Amazon Glue console, choose View in Glue.

    For information about viewing statistics in the Amazon Glue console, see Viewing column statistics in the Amazon Glue Developer Guide.

  7. After statistics have been generated, the tables and columns that have statistics show the word Statistics in parentheses, as in the following image.

    A table showing statistics icons in the Athena query editor.

Now when you run your queries, Athena will perform cost-based optimization on the tables and columns for which statistics were generated.

Additional resources

For additional information, see the following resource.