Use the cost-based optimizer
You can use the cost-based optimizer (CBO) feature in Athena SQL to optimize your queries. You can optionally request that Athena gather table or column-level statistics for one of your tables in Amazon Glue. If all of the tables in your query have statistics, Athena uses the statistics to create an execution plan that it determines to be the most performant. The query optimizer calculates alternative plans based on a statistical model and then selects the one that will likely be the fastest to run the query.
Statistics on Amazon Glue tables are collected and stored in the Amazon Glue Data Catalog and made available to Athena for improved query planning and execution. These statistics are column-level statistics such as number of distinct, number of null, max, and min values on file types such as Parquet, ORC, JSON, ION, CSV, and XML. Amazon Athena uses these statistics to optimize queries by applying the most restrictive filters as early as possible in query processing. This filtering limits memory usage and the number of records that must be read to deliver the query results.
In conjunction with CBO, Athena uses a feature called the rule-based optimizer (RBO). RBO mechanically applies rules that are expected to improve query performance. RBO is generally beneficial because its transformations aim to simplify the query plan. However, because RBO does not perform cost calculations or plan comparisons, more complicated queries make it difficult for RBO to create an optimal plan.
For this reason, Athena uses both RBO and CBO to optimize your queries. After Athena
identifies opportunities to improve query execution, it creates an optimal plan. For
information about execution plan details, see View execution plans for SQL queries. For a detailed discussion of how CBO works, see Speed up queries with the cost-based optimizer in Amazon Athena
To generate statistics for Amazon Glue Catalog tables, you can use the Athena console, the Amazon Glue Console, or Amazon Glue APIs. Because Athena is integrated with Amazon Glue Catalog, you automatically get the corresponding query performance improvements when you run queries from Amazon Athena.
Considerations and limitations
-
Table types – Currently, the CBO feature in Athena supports only Hive tables that are in the Amazon Glue Data Catalog.
-
Athena for Spark – The CBO feature is not available in Athena for Spark.
-
Pricing – For pricing information, visit the Amazon Glue pricing page
.
Generate table statistics using the Athena console
This section describes how to use the Athena console to generate table or column-level statistics for a table in Amazon Glue. For information on using Amazon Glue to generate table statistics, see Working with column statistics in the Amazon Glue Developer Guide.
To generate statistics for a table using the Athena console
Open the Athena console at https://console.amazonaws.cn/athena/
. -
In the Athena query editor Tables list, choose the three vertical dots for the table that you want, and then choose Generate statistics.
-
In the Generate statistics dialog box, choose All columns to generate statistics for all columns in the table, or choose Selected columns to select specific columns. All columns is the default setting.
-
For Amazon Glue service role, create or select an existing service role to give permission to Amazon Glue to generate statistics. The Amazon Glue service role also requires
S3:GetObject
permissions to the Amazon S3 bucket that contains the table's data. -
Choose Generate statistics. A Generating statistics for
table_name
notification banner displays the task status. -
To view details in the Amazon Glue console, choose View in Glue.
For information about viewing statistics in the Amazon Glue console, see Viewing column statistics in the Amazon Glue Developer Guide.
-
After statistics have been generated, the tables and columns that have statistics show the word Statistics in parentheses, as in the following image.
Now when you run your queries, Athena will perform cost-based optimization on the tables and columns for which statistics were generated.
Additional resources
For additional information, see the following resource.