Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions,
see Getting Started with Amazon Web Services in China
(PDF).
Enabling snapshot retention optimizer
You can use Amazon Glue console, Amazon CLI, or Amazon API to enable snapshot retention optimizers for your Apache Iceberg tables in the Data Catalog.
For new tables, you can choose Apache Iceberg as table format and enable snapshot retention optimizer when you create the table.
Snapshot retention is disabled by default for new tables.
- Console
-
To enable snapshot retention optimizer
-
Open the Amazon Glue console at https://console.amazonaws.cn/glue/ and sign in as a data lake administrator, the table creator, or a user who has been granted
the glue:UpdateTable
and lakeformation:GetDataAccess
permissions on the table.
-
In the navigation pane, under Data Catalog, choose Tables.
On the Tables page, choose an Iceberg table that you want to enable
snapshot retention optimizer for, then under Actions menu,
choose Enable under Optimization.
You can also enable optimization by selecting the table and opening the
Table details page. Choose the Table
optimization tab on the lower section of the page, and choose
Enable snapshot retention.
-
On the Enable optimization page, under
Optimization configuration, you have two options:
Use default setting or Customize
settings. If you choose to use the default settings, Amazon Glue utilizes
the properties defined in the Iceberg table configuration to determine the
snapshot retention period and the number of snapshots to be retained. In the
absence of this configuration, Amazon Glue retains one snapshot for five days, and
deletes files associated with the expired snapshots.
-
Next, choose an IAM role that Amazon Glue can assume on your behalf to run the optimizer. For details about the permissions required for the IAM role,
see the
Table optimization prerequisites
section.
Follow the steps below to update an existing IAM role:
-
To update the permissions policy for the IAM role, in the IAM console, go to the IAM role that is being used for running compaction.
-
In the Add permissions section, choose Create policy. In the newly opened browser window, create a new policy to use with your role.
On the Create policy page, choose the JSON tab. Copy the JSON code shown in the Prerequisites into the policy editor field.
-
If you prefer to set the values for the Snapshot retention configuration manually, choose Customize settings.
-
Choose the box Apply the selected IAM role to the selected
optimizers option to use a single IAM role for all enabling all
optimizers.
-
If you have security policy configurations where the Iceberg table
optimizer needs to access Amazon S3 buckets from a specific Virtual Private
Cloud (VPC), create an Amazon Glue network connection or use an existing
one.
If you don't have an Amazon Glue VPC Connection set up already,
create a new one by following the steps in the Creating connections for connectors section using the Amazon Glue console or the Amazon CLI/SDK.
Next, under Snapshot retention configuration, either choose to use the
values specified in the Iceberg table configuration, or specify custom values for snapshot
retention period (history.expire.max-snapshot-age-ms) and minimum number of
snapshots (history.expire.min-snapshots-to-keep) to retain.
-
Choose Delete associated files to delete underlying files when the table optimizer deletes old snapshots from the table metadata.
If you don't choose this option, when older snapshots are removed from the table metadata, their associated files will remain in the storage as orphaned files.
-
Next, read the caution statement, and choose I acknowledge
to proceed.
In the Data Catalog, the snapshot retention optimizer honors the lifecycle that
is controlled by branch and tag level retention policies. For more information,
see Branching and tagging section in the Iceberg documentation.
-
Review the configuration and choose Enable
optimization.
Wait a few minutes for the retention optimizer to run and expire old snapshots
based on the configuration.
- Amazon CLI
-
To enable snapshot retention for new Iceberg tables in Amazon Glue,
you need to create a table optimizer of type retention
and set the enabled
field to true
in the table-optimizer-configuration
.
You can do this using the Amazon CLI command create-table-optimizer
or update-table-optimizer
.
Additionally, you need to specify the retention configuration fields like snapshotRetentionPeriodInDays
and numberOfSnapshotsToRetain
based on your requirements.
The following example shows how to enable the snapshot retention optimizer.
Replace the account ID with a valid Amazon account ID. Replace the database name and
table name with actual Iceberg table name and the database name. Replace the
roleArn
with the Amazon Resource Name (ARN) of the IAM role and name
of the IAM role that has the required permissions to run the snapshot retention
optimizer.
aws glue create-table-optimizer \
--catalog-id 123456789012
\
--database-name iceberg_db
\
--table-name iceberg_table
\
--table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012
:role/optimizer_role
","enabled":'true', "vpcConfiguration":{
"glueConnectionName":"glue_connection_name"
}, "retentionConfiguration":{"icebergConfiguration":{"snapshotRetentionPeriodInDays":7
,"numberOfSnapshotsToRetain":3
,"cleanExpiredFiles":'true'
}}}'\
--type retention
This command creates a retention optimizer for the specified Iceberg table in the given catalog, database, and Region.
The table-optimizer-configuration specifies the IAM role ARN to use, enables the optimizer, and sets the retention configuration.
In this example, it retains snapshots for 7 days, keeps a minimum of 3 snapshots, and cleans expired files.
-
snapshotRetentionPeriodInDays –The number of days to retain snapshots
before expiring them. The default value is 5
.
-
numberOfSnapshotsToRetain – The minimum number of snapshots to keep, even
if they are older than the retention period. The default value is 1
.
-
cleanExpiredFiles – A boolean indicating whether to delete expired data
files after expiring snapshots. The default value is true
.
When set to true, older snapshots are removed from table metadata, and their
underlying files are deleted. If this parameter is set to false, older snapshots
are removed from table metadata but their underlying files remain in the storage
as orphan files.
- Amazon API
-
Call CreateTableOptimizer operation to enable snapshot retention optimizer for a table.
After you enable compaction, Table optimization tab shows the
following compaction details (after approximately 15-20 minutes):
- Start time
-
The time at which the snapshot retention optimizer started. The value is a timestamp in UTC time.
- Run time
-
The time shows how long the optimizer takes to complete the task. The value is a
timestamp in UTC time.
- Status
-
The status of the optimizer run. Values are success or fail.
- Data files deleted
Total number of files deleted.
- Manifest files deleted
-
Total number of manifest files deleted.
- Manifest lists deleted
-
Total number of manifest lists deleted.