Amazon Glue components - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Glue components

Amazon Glue provides a console and API operations to set up and manage your extract, transform, and load (ETL) workload. You can use API operations through several language-specific SDKs and the Amazon Command Line Interface (Amazon CLI). For information about using the Amazon CLI, see Amazon CLI Command Reference.

Amazon Glue uses the Amazon Glue Data Catalog to store metadata about data sources, transforms, and targets. The Data Catalog is a drop-in replacement for the Apache Hive Metastore. The Amazon Glue Jobs system provides a managed infrastructure for defining, scheduling, and running ETL operations on your data. For more information about the Amazon Glue API, see Amazon Glue API.

Amazon Glue console

You use the Amazon Glue console to define and orchestrate your ETL workflow. The console calls several API operations in the Amazon Glue Data Catalog and Amazon Glue Jobs system to perform the following tasks:

  • Define Amazon Glue objects such as jobs, tables, crawlers, and connections.

  • Schedule when crawlers run.

  • Define events or schedules for job triggers.

  • Search and filter lists of Amazon Glue objects.

  • Edit transformation scripts.

Amazon Glue Data Catalog

The Amazon Glue Data Catalog is your persistent technical metadata store in the Amazon Cloud.

Each Amazon account has one Amazon Glue Data Catalog per Amazon Region. Each Data Catalog is a highly scalable collection of tables organized into databases. A table is metadata representation of a collection of structured or semi-structured data stored in sources such as Amazon RDS, Apache Hadoop Distributed File System, Amazon OpenSearch Service, and others. The Amazon Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. You can then use the metadata to query and transform that data in a consistent manner across a wide variety of applications.

You use the Data Catalog together with Amazon Identity and Access Management policies and Lake Formation to control access to the tables and databases. By doing this, you can allow different groups in your enterprise to safely publish data to the wider organization while protecting sensitive information in a highly granular fashion.

The Data Catalog, along with CloudTrail and Lake Formation, also provides you with comprehensive audit and governance capabilities, with schema change tracking and data access controls. This helps ensure that data is not inappropriately modified or inadvertently shared.

For information about securing and auditing the Amazon Glue Data Catalog, see:

The following are other Amazon services and open-source projects that use the Amazon Glue Data Catalog:

Amazon Glue crawlers and classifiers

Amazon Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the Amazon Glue Data Catalog. The Amazon Glue Data Catalog can then be used to guide ETL operations.

For information about how to set up crawlers and classifiers, see Defining crawlers in Amazon Glue. For information about how to program crawlers and classifiers using the Amazon Glue API, see Crawlers and classifiers API.

Amazon Glue ETL operations

Using the metadata in the Data Catalog, Amazon Glue can automatically generate Scala or PySpark (the Python API for Apache Spark) scripts with Amazon Glue extensions that you can use and modify to perform various ETL operations. For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a relational form and save it in Amazon Redshift.

For more information about how to use Amazon Glue ETL capabilities, see Programming Spark scripts.

Streaming ETL in Amazon Glue

Amazon Glue enables you to perform ETL operations on streaming data using continuously-running jobs. Amazon Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can ingest streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming ETL can clean and transform streaming data and load it into Amazon S3 or JDBC data stores. Use Streaming ETL in Amazon Glue to process event data like IoT streams, clickstreams, and network logs.

If you know the schema of the streaming data source, you can specify it in a Data Catalog table. If not, you can enable schema detection in the streaming ETL job. The job then automatically determines the schema from the incoming data.

The streaming ETL job can use both Amazon Glue built-in transforms and transforms that are native to Apache Spark Structured Streaming. For more information, see Operations on streaming DataFrames/Datasets on the Apache Spark website.

For more information, see Streaming ETL jobs in Amazon Glue.

The Amazon Glue jobs system

The Amazon Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in Amazon Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.

For more information about using the Amazon Glue Jobs system, see Monitoring Amazon Glue. For information about programming using the Amazon Glue Jobs system API, see Jobs API.