Amazon Glue: How it works - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Amazon Glue: How it works

Amazon Glue uses other Amazon services to orchestrate your ETL (extract, transform, and load) jobs to build data warehouses and data lakes and generate output streams. Amazon Glue calls API operations to transform your data, create runtime logs, store your job logic, and create notifications to help you monitor your job runs. The Amazon Glue console connects these services into a managed application, so you can focus on creating and monitoring your ETL work. The console performs administrative and job development operations on your behalf. You supply credentials and other properties to Amazon Glue to access your data sources and write to your data targets.

Amazon Glue takes care of provisioning and managing the resources that are required to run your workload. You don't need to create the infrastructure for an ETL tool because Amazon Glue does it for you. When resources are required, to reduce startup time, Amazon Glue uses an instance from its warm pool of instances to run your workload.

With Amazon Glue, you create jobs using table definitions in your Data Catalog. Jobs consist of scripts that contain the programming logic that performs the transformation. You use triggers to initiate jobs either on a schedule or as a result of a specified event. You determine where your target data resides and which source data populates your target. With your input, Amazon Glue generates the code that's required to transform your data from source to target. You can also provide scripts in the Amazon Glue console or API to process your data.

Data sources and destinations

Amazon Glue allows you to read and write data from multiple systems and databases including:

  • Amazon S3

  • Amazon DynamoDB

  • Amazon Redshift

  • Amazon Relational Database Service (Amazon RDS)

  • Third-party JDBC-accessible databases

  • MongoDB and Amazon DocumentDB (with MongoDB compatibility)

  • Other marketplace connectors and Apache Spark plugins

Data streams

Amazon Glue can stream data from the following systems:

  • Amazon Kinesis Data Streams

  • Apache Kafka

Amazon Glue is available in several Amazon Regions. For more information, see Amazon Regions and Endpoints in the Amazon Web Services General Reference.

Serverless ETL jobs run in isolation

Amazon Glue runs your ETL jobs in an Apache Spark serverless environment. Amazon Glue runs these jobs on virtual resources that it provisions and manages in its own service account.

Amazon Glue is designed to do the following:

  • Segregate customer data.

  • Protect customer data in transit and at rest.

  • Access customer data only as needed in response to customer requests, using temporary, scoped-down credentials, or with a customer's consent to IAM roles in their account.

During provisioning of an ETL job, you provide input data sources and output data targets in your virtual private cloud (VPC). In addition, you provide the IAM role, VPC ID, subnet ID, and security group that are needed to access data sources and targets. For each tuple (customer account ID, IAM role, subnet ID, and security group), Amazon Glue creates a new Spark environment that is isolated at the network and management level from all other Spark environments inside the Amazon Glue service account.

Amazon Glue creates elastic network interfaces in your subnet using private IP addresses. Spark jobs use these elastic network interfaces to access your data sources and data targets. Traffic in, out, and within the Spark environment is governed by your VPC and networking policies with one exception: Calls made to Amazon Glue libraries can proxy traffic to Amazon Glue API operations through the Amazon Glue VPC. All Amazon Glue API calls are logged; thus, data owners can audit API access by enabling Amazon CloudTrail, which delivers audit logs to your account.

Amazon Glue managed Spark environments that run your ETL jobs are protected with the same security practices followed by other Amazon services. For an overview of the practices and shared security responsibilities, see the Introduction to Amazon Security Processes whitepaper.