Optimizing query performance using column statistics - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Optimizing query performance using column statistics

You can compute column-level statistics for Amazon Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column.

Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. Amazon analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.

There are two scenarios for generating column statistics:

Scheduled

Amazon Glue supports scheduling column statistics generation so that it can be run automatically on a recurring schedule.

With scheduled statistics computation, the column statistics task updates the overall table-level statistics, such as min, max, and avg with the new statistics, providing query engines with accurate and up-to-date statistics to optimize query execution.

On-demand

Use this option to generate column statistics on-demand whenever needed. This is useful for ad-hoc analysis or when statistics need to be computed immediately.

You can configure to run column statistics generation task using Amazon Glue console, Amazon CLI, and Amazon Glue API operations. When you initiate the process, Amazon Glue starts a Spark job in the background and updates the Amazon Glue table metadata in the Data Catalog. You can view column statistics using Amazon Glue console or Amazon CLI or by calling the GetColumnStatisticsForTable API operation.

Note

If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

The following video demonstrates how to enhance query performance using column statistics.