Federating into external data sources in the Amazon Glue Data Catalog - Amazon Lake Formation
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Federating into external data sources in the Amazon Glue Data Catalog

You can connect the Amazon Glue Data Catalog (Data Catalog) to data warehouses such as Amazon Redshift, Snowflake, cloud databases such as Amazon RDS, Amazon DynamoDB, Oracle, and streaming services such as Amazon MSK, and on-premises systems such as Teradata using Amazon Glue connections. These connections are stored in the Amazon Glue Data Catalog and registered with Amazon Lake Formation, allowing you to create a federated catalog for each available data source.

A federated catalog is a top level container that points to a database in an external data system. It enables you to query the data directly from the external data system without extract, transform, and load (ETL) process.

For more information about Amazon Glue connections, see Connecting to data in the Amazon Glue Developer Guide.

Data lake administrators can create federated catalogs using Amazon Sage Maker Lakehouse or Amazon Athena.

Data lake administrators can then grant fine-grained permissions on the objects within the catalog using Lake Formation, controlling access at various levels such as catalog, database, table, column, row, or cell. Data analysts can discover and query the cataloged data sources using Athena, with Lake Formation enforcing the defined access policies. Analysts can join data across multiple sources in a single query without needing to connect to each source individually.

Workflow

A data lake administrator or a user with the required permissions completes the following the steps for connecting the Amazon Glue Data Catalog to an external data source.

  1. Creates an Amazon Glue connection to the data source. When you register the connection, the IAM role used in registering the connection must have access to the Lambda function and the Amazon S3 spill bucket location.

  2. Registers the connection with Lake Formation.

  3. Creates a federated catalog in the Data Catalog using a Amazon Glue connection to connect to the available data sources. The databases, tables, and views are automatically cataloged in the Data Catalog, and registered with Lake Formation.

  4. Grants access to specific catalogs, databases, and tables to data analysts using Lake Formation permissions. Fine-grained access control policies can be defined across data lakes, warehouses, and OLTP sources using Lake Formation, enabling row-level and column-level security filters.

    Data analysts can then access all data through the Data Catalog using SQL queries in Athena, without needing separate connections or data source credentials. Analysts can run federated SQL queries that scan data from multiple sources, joining data in-place without complex data pipelines.