Optimizing query performance using column statistics

You can compute column-level statistics for Amazon Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column.

Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. Amazon analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.

There are three scenarios for generating column statistics:

Auto

Amazon Glue supports automatic column statistics generation at the catalog-level so that it can automatically generate statistics for new tables in the Amazon Glue Data Catalog.

Scheduled

Amazon Glue supports scheduling column statistics generation so that it can be run automatically on a recurring schedule.

With scheduled statistics computation, the column statistics task updates the overall table-level statistics, such as min, max, and avg with the new statistics, providing query engines with accurate and up-to-date statistics to optimize query execution.

On-demand

Use this option to generate column statistics on-demand whenever needed. This is useful for ad-hoc analysis or when statistics need to be computed immediately.

You can configure to run column statistics generation task using Amazon Glue console, Amazon CLI, and Amazon Glue API operations. When you initiate the process, Amazon Glue starts a Spark job in the background and updates the Amazon Glue table metadata in the Data Catalog. You can view column statistics using Amazon Glue console or Amazon CLI or by calling the GetColumnStatisticsForTable API operation.

Note

If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

The following video demonstrates how to enhance query performance using column statistics.

Topics

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Integrating with MongoDB

Prerequisites