Optimizing query performance using column statistics - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Optimizing query performance using column statistics

You can compute column-level statistics for Amazon Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column. Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values.

Amazon analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.

You can configure to run column statistics generation task using Amazon Glue console or Amazon CLI. When you initiate the process, Amazon Glue starts a Spark job in the background and updates the Amazon Glue table metadata in the Data Catalog. You can view column statistics using Amazon Glue console or Amazon CLI or by calling the GetColumnStatisticsForTable API operation.

Note

If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

The following video demonstrates how to enhance query performance using column statistics.