Optimizing query performance for Iceberg tables - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Optimizing query performance for Iceberg tables

Apache Iceberg is a high-performance open table format for huge analytic datasets. Amazon Glue supports calculating and updating number of distinct values (NDVs) for each column in Iceberg tables. These statistics can facilitate better query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets.

Amazon Glue estimates the number of distinct values in each column of the Iceberg table and and store them in Puffin files on Amazon S3 associated with Iceberg table snapshots. Puffin is an Iceberg file format designed to store metadata like indexes, statistics, and sketches. Storing sketches in Puffin files tied to snapshots ensures transactional consistency and freshness of the NDV statistics.

You can configure to run column statistics generation task using Amazon Glue console or Amazon CLI. When you initiate the process, Amazon Glue starts a Spark job in the background and updates the Amazon Glue table metadata in the Data Catalog. You can view column statistics using Amazon Glue console or Amazon CLI or by calling the GetColumnStatisticsForTable API operation.

Note

If you're using Amazon Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

See also