Excluding Amazon S3 storage classes - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Excluding Amazon S3 storage classes

If you're running Amazon Glue ETL jobs that read files or partitions from Amazon Simple Storage Service (Amazon S3), you can exclude some Amazon S3 storage class types.

The following storage classes are available in Amazon S3:

  • STANDARD — For general-purpose storage of frequently accessed data.

  • INTELLIGENT_TIERING — For data with unknown or changing access patterns.

  • STANDARD_IA and ONEZONE_IA — For long-lived, but less frequently accessed data.

  • GLACIER, DEEP_ARCHIVE, and REDUCED_REDUNDANCY — For long-term archive and digital preservation.

For more information, see Amazon S3 Storage Classes in the Amazon S3 Developer Guide.

The examples in this section show how to exclude the GLACIER and DEEP_ARCHIVE storage classes. These classes allow you to list files, but they won't let you read the files unless they are restored. (For more information, see Restoring Archived Objects in the Amazon S3 Developer Guide.)

By using storage class exclusions, you can ensure that your Amazon Glue jobs will work on tables that have partitions across these storage class tiers. Without exclusions, jobs that read data from these tiers fail with the following error: AmazonS3Exception: The operation is not valid for the object's storage class.

There are different ways that you can filter Amazon S3 storage classes in Amazon Glue.

Excluding Amazon S3 storage classes when creating a Dynamic Frame

To exclude Amazon S3 storage classes while creating a dynamic frame, use excludeStorageClasses in additionalOptions. Amazon Glue automatically uses its own Amazon S3 Lister implementation to list and exclude files corresponding to the specified storage classes.

The following Python and Scala examples show how to exclude the GLACIER and DEEP_ARCHIVE storage classes when creating a dynamic frame.

Python example:

glueContext.create_dynamic_frame.from_catalog( database = "my_database", tableName = "my_table_name", redshift_tmp_dir = "", transformation_ctx = "my_transformation_context", additional_options = { "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"] } )

Scala example:

val* *df = glueContext.getCatalogSource( nameSpace, tableName, "", "my_transformation_context", additionalOptions = JsonOptions( Map("excludeStorageClasses" -> List("GLACIER", "DEEP_ARCHIVE")) ) ).getDynamicFrame()

Excluding Amazon S3 storage classes on a Data Catalog table

You can specify storage class exclusions to be used by an Amazon Glue ETL job as a table parameter in the Amazon Glue Data Catalog. You can include this parameter in the CreateTable operation using the Amazon Command Line Interface (Amazon CLI) or programmatically using the API. For more information, see Table Structure and CreateTable.

You can also specify excluded storage classes on the Amazon Glue console.

To exclude Amazon S3 storage classes (console)
  1. Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/.

  2. In the navigation pane on the left, choose Tables.

  3. Choose the table name in the list, and then choose Edit table.

  4. In Table properties, add excludeStorageClasses as a key and [\"GLACIER\",\"DEEP_ARCHIVE\"] as a value.

  5. Choose Apply.