Amazon Glue Data Quality - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Glue Data Quality

Amazon Glue Data Quality allows you to measure and monitor the quality of your data so that you can make good business decisions. Built on top of the open-source DeeQu framework, Amazon Glue Data Quality provides a managed, serverless experience. Amazon Glue Data Quality works with Data Quality Definition Language (DQDL), which is a domain specific language that you use to define data quality rules. To learn more about DQDL and supported rule types, see Data Quality Definition Language (DQDL) reference.

For additional product details and pricing, see the service page for Amazon Glue Data Quality.

Benefits and key features

Benefits and key features of Amazon Glue Data Quality include:

  • Serverless – there is no installation, patching or maintenance.

  • Get started quickly – Amazon Glue Data Quality quickly analyzes your data and creates data quality rules for you. You can get started with two clicks: “Create Data Quality Rules → Recommend rules”.

  • Detect data quality issues – Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.

  • Improvise your rules – with 25+ out-of-the-box DQ rules to start from, you can create rules that suit your specific needs.

  • Evaluate quality and make confident business decisions – Once you evaluate the rules, you get a Data Quality score that provides an overview of the health of your data. Use Data Quality score to make confident business decisions.

  • Zero in on bad data – Amazon Glue Data Quality helps you identify the exact records that caused your quality scores to go down. Easily identify them, quarantine and fix them.

  • Pay as you go – There are no annual licenses you need to use Amazon Glue Data Quality.

  • No lock-in – Amazon Glue Data Quality is built on open source DeeQu, allowing you to keep the rules you are authoring in an open language.

  • Data quality checks – Amazon Glue Data Quality You can enforce data quality checks on Data Catalog and Amazon Glue ETL pipelines allowing you to manage data quality at rest and in transit.

  • ML-based data quality detection – Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.

How it works

There are two entry points for Amazon Glue Data Quality: the Amazon Glue Data Catalog and Amazon Glue ETL jobs. This section provides an overview of the use cases and Amazon Glue features that each entry point supports.

Data quality for the Amazon Glue Data Catalog

Amazon Glue Data Quality evaluates objects that are stored in the Amazon Glue Data Catalog It offers non-coders an easy way to set up data quality rules. These personas include data stewards and business analysts.

You might choose this option for the following use cases:

  • You want to perform data quality tasks on data sets that you've already cataloged in the Amazon Glue Data Catalog.

  • You work on data governance and need to identify or evaluate data quality issues in your data lake on an ongoing basis.

You can manage data quality for the Data Catalog using the following interfaces:

  • The Amazon Glue management console

  • Amazon Glue APIs

To get started with Amazon Glue Data Quality for the Amazon Glue Data Catalog see Getting started with Amazon Glue Data Quality for the Data Catalog.

Data quality for Amazon Glue ETL jobs

Amazon Glue Data Quality for Amazon Glue ETL jobs lets you perform proactive data quality tasks. Proactive tasks help you identify and filter out bad data before you load a data set into your data lake.

You might choose data quality for ETL jobs for the following use cases:

  • You want to incorporate data quality tasks into your ETL jobs

  • You want to write code that defines data quality tasks in ETL scripts

  • You want to manage the quality of data that flows in your visual data pipelines

You can manage data quality for ETL jobs using the following interfaces:

  • Amazon Glue Studio, Amazon Glue Studio notebooks, and Amazon Glue interactive sessions

  • Amazon Glue libraries for ETL scripting

  • Amazon Glue APIs

To get started with data quality for ETL jobs, see Tutorial: Getting started with Data Quality in the Amazon Glue Studio User Guide.

Comparing data quality for the Data Catalog to data quality for ETL jobs

This table provides an overview of features that each entry point for Amazon Glue Data Quality supports.

Feature Data quality for the Data Catalog Data quality for ETL jobs
Data sources Amazon S3, Amazon Redshift, JDBC sources compatible with the Data Catalog, and transactional data lake formats such as Apache Iceberg, Apache Hudi, and Delta Lake. Note that if tables are Amazon Lake Formation managed, Iceberg, Delta and HUDI tables are not supported. Amazon Athena views that are cataloged in Amazon Glue Data Catalog are not supported. All data sources supported by Amazon Glue, including custom connectors and third-party connectors.
Data Quality rule recommendations Supported Not supported
Author and run DQDL rules Supported Supported
Auto scaling Not supported Supported
Amazon Glue Flex support Not supported Supported
Scheduling Supported when evaluating Data Quality rules and via Step Functions. Supported when using Step Functions and workflows.
Identifying records that failed data quality checks Not supported Supported
Integration with Amazon Eventbridge Supported Supported
Integration with Amazon Cloudwatch Supported Supported
Writing data quality results to Amazon S3 Supported Supported
Incremental data quality Supported via pushdown predicates Supported via Amazon Glue bookmarks
Amazon CloudFormation support Supported Supported
ML-based anomaly detection Not supported Preview
Dynamic rules Not supported Supported

Considerations

Consider the following items before you use Amazon Glue Data Quality:

Terminology

The following list defines terms that are related to Amazon Glue Data Quality.

Data Quality Definition Language (DQDL)

A domain-specific language that you can use to write Amazon Glue Data Quality rules.

To learn more about DQDL, see the Data Quality Definition Language (DQDL) reference guide.

data quality

Describes how well a dataset serves its specific purpose. Amazon Glue Data Quality evaluates rules against a dataset to measure data quality. Each rule checks for particular characteristics like data freshness or integrity. To quantify data quality, you can use a data quality score.

data quality score

The percentage of data quality rules that pass (result in true) when you evaluate a ruleset with Amazon Glue Data Quality.

rule

A DQDL expression that checks your data for a specific characteristic and returns a Boolean value. For more information, see Rule structure.

analyzer

A DQDL expression that gathers data statistics. An analyzer gathers data statistics that can be used by ML algorithms to detect anomalies and hard-to-detect data quality issues over time.

ruleset

An Amazon Glue resource that comprises a set of data quality rules. A ruleset must be associated with a table in the Amazon Glue Data Catalog. When you save a ruleset, Amazon Glue assigns an Amazon Resource Name (ARN) to the ruleset.

data quality score

The percentage of data quality rules that pass (result in true) when you evaluate a ruleset with Amazon Glue Data Quality.

observation

An unconfirmed insight generated by Amazon Glue by analyzing data statistics gathered from rules and analyzers over time.

Release notes for Amazon Glue Data Quality

This topic describes features introduced in Amazon Glue Data Quality.

General availability: new features

The following new features are available with the general availability of Amazon Glue Data Quality:

  • The ability to identify which records failed data quality checks is now supported in Amazon Glue Studio

  • New data quality ruletypes such as validating referential integrity of data between two data sets, comparing data between two datasets, and data type checks

  • Improved user experience in the Amazon Glue Data Catalog

  • Support for Apache Iceberg, Apache Hudi and Delta Lake

  • Support for Amazon Redshift

  • Simplified notification with Amazon Eventbridge

  • Amazon CloudFormation support for creating rulesets

  • Performance improvements: caching option in ETL and Amazon Glue Studio for faster performance when evaluating data quality

Nov 27, 2023 (Preview)

Mar 12, 2024

  • Support for Keywords like NULL, BLANKS, WHITESPACES_ONLY

  • Bug fix: ColumnValues now will fail when rows have NULL values

  • Option to evaluate composite rules