What is Amazon Glue? - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

What is Amazon Glue?

Amazon Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows.

With Amazon Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Amazon Glue consolidates major data integration capabilities into a single service. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It’s also serverless, which means there’s no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and streaming in one service, Amazon Glue supports users across various workloads and types of users.

Also, Amazon Glue makes it easy to integrate data across your architecture. It integrates with Amazon analytics services and Amazon S3 data lakes. Amazon Glue has integration interfaces and job-authoring tools that are easy to use for all users, from developers to business users, with tailored solutions for varied technical skill sets.

With the ability to scale on demand, Amazon Glue helps you focus on high-value activities that maximize the value of your data. It scales for any data size, and supports all data types and schema variances. To increase agility and optimize costs, Amazon Glue provides built-in high availability and pay-as-you-go billing.

For pricing information, see Amazon Glue pricing.

Amazon Glue Studio

Amazon Glue Studio is a graphical interface that makes it easy to create, run, and monitor data integration jobs in Amazon Glue. You can visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in Amazon Glue. For more information, see What is Amazon Glue Studio.

With Amazon Glue Studio, you can create and manage jobs that gather, transform, and clean data. You can also use Amazon Glue Studio to troubleshoot and edit job scripts.

Amazon Glue features

Amazon Glue features fall into three major categories:

  • Discover and organize data

  • Transform, prepare, and clean data for analysis

  • Build and monitor data pipelines

Discover and organize data

  • Unify and search across multiple data stores – Store, index, and search across multiple data sources and sinks by cataloging all your data in Amazon.

  • Automatically discover data – Use Amazon Glue crawlers to automatically infer schema information and integrate it into your Amazon Glue Data Catalog.

  • Manage schemas and permissions – Validate and control access to your databases and tables.

  • Connect to a wide variety of data sources – Tap into multiple data sources, both on premises and on Amazon, using Amazon Glue connections to build your data lake.

Transform, prepare, and clean data for analysis

  • Visually transform data with a drag-and-drop interface – Define your ETL process in the drag-and-drop job editor and automatically generate the code to extract, transform, and load your data.

  • Build complex ETL pipelines with simple job scheduling – Invoke Amazon Glue jobs on a schedule, on demand, or based on an event.

  • Clean and transform streaming data in transit – Enable continuous data consumption, and clean and transform it in transit. This makes it available for analysis in seconds in your target data store.

  • Deduplicate and cleanse data with built-in machine learning – Clean and prepare your data for analysis without becoming a machine learning expert by using the FindMatches feature. This feature deduplicates and finds records that are imperfect matches for each other.

  • Built-in job notebooks – Amazon Glue Studio job notebooks provide serverless notebooks with minimal setup in Amazon Glue Studio so you can get started quickly.

  • Edit, debug, and test ETL code – With Amazon Glue interactive sessions, you can interactively explore and prepare data. You can explore, experiment on, and process data interactively using the IDE or notebook of your choice.

  • Define, detect, and remediate sensitive data – Amazon Glue sensitive data detection lets you define, identify, and process sensitive data in your data pipeline and in your data lake.

Build and monitor data pipelines

  • Automatically scale based on workload – Dynamically scale resources up and down based on workload. This assigns workers to jobs only when needed.

  • Automate jobs with event-based triggers – Start crawlers or Amazon Glue jobs with event-based triggers, and design a chain of dependent jobs and crawlers.

  • Run and monitor jobs – Run your Amazon Glue jobs, and then monitor them with automated monitoring tools, the Apache Spark UI, Amazon Glue job run insights, and Amazon CloudTrail.

  • Define workflows for ETL and integration activities – Define workflows for ETL and integration activities for multiple crawlers, jobs, and triggers.

Getting started with Amazon Glue

We recommend that you start with the following sections:

Accessing Amazon Glue

You can create, view, and manage your Amazon Glue jobs using the following interfaces:

  • Amazon Glue console – Provides a web interface for you to create, view, and manage your Amazon Glue jobs. To access the console, see Amazon Glue console.

  • Amazon Glue Studio – Provides a graphical interface for you to create and edit your Amazon Glue jobs visually. For more information, see What is Amazon Glue Studio.

  • Amazon Glue section of the Amazon CLI Reference – Provides Amazon CLI commands that you can use with Amazon Glue. For more information, see Amazon CLI Reference for Amazon Glue.

  • Amazon Glue API – Provides a complete API reference for developers. For more information, see Amazon Glue API.

Users of Amazon Glue also use:

  • Amazon Lake Formation – A service that is an authorization layer that provides fine-grained access control to resources in the Amazon Glue Data Catalog.

  • Amazon Glue Data Brew – A visual data preparation tool that you can use to clean and normalize data without writing any code.