Compaction management - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Compaction management

The Amazon S3 data lakes using open table formats like Apache Iceberg store data as S3 objects. Having thousands of small Amazon S3 objects in a data lake table increases metadata overhead and affects read performance. Amazon Glue Data Catalog provides managed compaction for Iceberg tables, compacting small objects into larger ones for better read performance by Amazon analytics services like Amazon Athena and Amazon EMR, and Amazon Glue ETL jobs. Data Catalog performs compaction without interfering with concurrent queries and supports compaction only for Parquet format tables.

The table optimizer continuously monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes. An Iceberg table qualifies for compaction if the file size specified in the write.target-file-size-bytes property is within the 128MB to 512MB range. In the Data Catalog, the compaction process starts if the table has more than five files, each smaller than 75% of the write.target-file-size-bytes property.

For example, you have a table with the file size threshold set to 512MB in the write.target-file-size-bytes property (within the prescribed range of 128MB to 512MB), and the table contains 10 files. If 6 out of the 10 files are less than 384MB (.75*512) each, then the Data Catalog triggers compaction.

For supported data types, compression formats, and limitations, see Supported formats and limitations for managed data compaction .