FAQ: Upgrading to the Amazon Glue Data Catalog - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

FAQ: Upgrading to the Amazon Glue Data Catalog

If you created databases and tables using Athena in a region before Amazon Glue was available in that region, metadata is stored in an Athena-managed data catalog, which only Athena and Amazon Redshift Spectrum can access. To use Amazon Glue with Athena and Redshift Spectrum, you must upgrade to the Amazon Glue Data Catalog.

Why should I upgrade to the Amazon Glue Data Catalog?

Amazon Glue is a completely-managed extract, transform, and load (ETL) service. It has three main components:

  • An Amazon Glue crawler can automatically scan your data sources, identify data formats, and infer schema.

  • A fully managed ETL service allows you to transform and move data to various destinations.

  • The Amazon Glue Data Catalog stores metadata information about databases and tables and points to a data store in Amazon S3 or a JDBC-compliant data store.

For more information, see Amazon Glue Concepts.

Upgrading to the Amazon Glue Data Catalog has the following benefits.

Unified metadata repository

The Amazon Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats. It provides out-of-the-box integration with Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon Redshift Spectrum, Athena, Amazon EMR, and any application compatible with the Apache Hive metastore. You can create your table definitions one time and query across engines.

For more information, see Populating the Amazon Glue Data Catalog.

Automatic schema and partition recognition

Amazon Glue crawlers automatically crawl your data sources, identify data formats, and suggest schema and transformations. Crawlers can help automate table creation and automatic loading of partitions that you can query using Athena, Amazon EMR, and Redshift Spectrum. You can also create tables and partitions directly using the Amazon Glue API, SDKs, and the Amazon CLI.

For more information, see Cataloging Tables with a Crawler.

Easy-to-build pipelines

The Amazon Glue ETL engine generates Python code that is entirely customizable, reusable, and portable. You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After your ETL job is ready, you can schedule it to run on the fully managed, scale-out Spark infrastructure of Amazon Glue. Amazon Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs, allowing you to tightly integrate ETL with your workflow.

For more information, see Authoring Amazon Glue Jobs in the Amazon Glue Developer Guide.

Are there separate charges for Amazon Glue?

Yes. With Amazon Glue, you pay a monthly rate for storing and accessing the metadata stored in the Amazon Glue Data Catalog, an hourly rate billed per second for Amazon Glue ETL jobs and crawler runtime, and an hourly rate billed per second for each provisioned development endpoint. The Amazon Glue Data Catalog allows you to store up to a million objects at no charge. If you store more than a million objects, you are charged USD$1 for each 100,000 objects over a million. An object in the Amazon Glue Data Catalog is a table, a partition, or a database. For more information, see Amazon Glue Pricing.

Upgrade process FAQ

Who can perform the upgrade?

You need to attach a customer-managed IAM policy with a policy statement that allows the upgrade action to the user who will perform the migration. This extra check prevents someone from accidentally migrating the catalog for the entire account. For more information, see Step 1 - Allow a User to Perform the Upgrade.

My users use a managed policy with Athena and Redshift Spectrum. What steps do I need to take to upgrade?

The Athena managed policy has been automatically updated with new policy actions that allow Athena users to access Amazon Glue. However, you still must explicitly allow the upgrade action for the user who performs the upgrade. To prevent accidental upgrade, the managed policy does not allow this action.

What happens if I don’t upgrade?

If you don’t upgrade, you are not able to use Amazon Glue features together with the databases and tables that you create in Athena or vice versa. You can use these services independently. During this time, Athena and Amazon Glue both prevent you from creating databases or tables that have the same names in the other data catalog. This prevents name collisions when you do upgrade.

Why do I need to add Amazon Glue policies to Athena users?

Before you upgrade, Athena manages the data catalog, so Athena actions must be allowed for your users to perform queries. After you upgrade to the Amazon Glue Data Catalog, Amazon Glue actions must be allowed for your users. Remember, the managed policy for Athena has already been updated to allow the required Amazon Glue actions, so no action is required if you use the managed policy.

What happens if I don’t allow Amazon Glue policies for Athena users?

If you upgrade to the Amazon Glue Data Catalog and don't update a user's customer-managed or inline IAM policies, Athena queries fail because the user won't be allowed to perform actions in Amazon Glue. For the specific actions to allow, see Step 2 - Update Customer-Managed/Inline Policies Associated with Athena Users.

Is there risk of data loss during the upgrade?

No.

Is my data also moved during this upgrade?

No. The migration only affects metadata.