Benefits and key features How it works Considerations Terminology Limits Release notes for Amazon Glue Data Quality

Amazon Glue Data Quality

Amazon Glue Data Quality allows you to measure and monitor the quality of your data so that you can make good business decisions. Built on top of the open-source DeeQu framework, Amazon Glue Data Quality provides a managed, serverless experience. Amazon Glue Data Quality works with Data Quality Definition Language (DQDL), which is a domain specific language that you use to define data quality rules. To learn more about DQDL and supported rule types, see Data Quality Definition Language (DQDL) reference.

For additional product details and pricing, see the service page for Amazon Glue Data Quality.

Benefits and key features

Benefits and key features of Amazon Glue Data Quality include:

Serverless – There is no installation, patching or maintenance.
Get started quickly – Amazon Glue Data Quality quickly analyzes your data and creates data quality rules for you. You can get started with two clicks: “Create Data Quality Rules → Recommend rules”.
Detect data quality issues – Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.
Improvise your rules – with 25+ out-of-the-box DQ rules to start from, you can create rules that suit your specific needs.
Evaluate quality and make confident business decisions – Once you evaluate the rules, you get a Data Quality score that provides an overview of the health of your data. Use Data Quality score to make confident business decisions.
Zero in on bad data – Amazon Glue Data Quality helps you identify the exact records that caused your quality scores to go down. Easily identify them, quarantine and fix them.
Pay as you go – There are no annual licenses you need to use Amazon Glue Data Quality.
No lock-in – Amazon Glue Data Quality is built on open source DeeQu, allowing you to keep the rules you are authoring in an open language.
Data quality checks – You can enforce data quality checks on Data Catalog and Amazon Glue ETL pipelines allowing you to manage data quality at rest and in transit.
ML-based data quality detection – Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.
Open language to express rules – ensures that data quality rules are authored consistently and simply. Business users can easily express data quality rules in a straightforward language that they can understand. For engineers, this language provides the flexibility to generate code, implement consistent version control, and automate deployments.

How it works

There are two entry points for Amazon Glue Data Quality: the Amazon Glue Data Catalog and Amazon Glue ETL jobs. This section provides an overview of the use cases and Amazon Glue features that each entry point supports.

Data quality for the Amazon Glue Data Catalog

Amazon Glue Data Quality evaluates objects that are stored in the Amazon Glue Data Catalog It offers non-coders an easy way to set up data quality rules. These personas include data stewards and business analysts.

You might choose this option for the following use cases:

You want to perform data quality tasks on data sets that you've already cataloged in the Amazon Glue Data Catalog.
You work on data governance and need to identify or evaluate data quality issues in your data lake on an ongoing basis.

You can manage data quality for the Data Catalog using the following interfaces:

The Amazon Glue management console
Amazon Glue APIs

To get started with Amazon Glue Data Quality for the Amazon Glue Data Catalog see Getting started with Amazon Glue Data Quality for the Data Catalog.

Data quality for Amazon Glue ETL jobs

Amazon Glue Data Quality for Amazon Glue ETL jobs lets you perform proactive data quality tasks. Proactive tasks help you identify and filter out bad data before you load a data set into your data lake.

You might choose data quality for ETL jobs for the following use cases:

You want to incorporate data quality tasks into your ETL jobs
You want to write code that defines data quality tasks in ETL scripts
You want to manage the quality of data that flows in your visual data pipelines

You can manage data quality for ETL jobs using the following interfaces:

Amazon Glue Studio, Amazon Glue Studio notebooks, and Amazon Glue interactive sessions
Amazon Glue libraries for ETL scripting
Amazon Glue APIs

To get started with data quality for ETL jobs, see Tutorial: Getting started with Data Quality in the Amazon Glue Studio User Guide.

Comparing data quality for the Data Catalog to data quality for ETL jobs

This table provides an overview of features that each entry point for Amazon Glue Data Quality supports.

Feature	Data quality for the Data Catalog	Data quality for ETL jobs
Data sources	Amazon S3, Amazon Redshift, JDBC sources compatible with the Data Catalog, and transactional data lake formats such as Apache Iceberg, Apache Hudi, and Delta Lake. Note that if tables are Amazon Lake Formation managed, Iceberg, Delta and HUDI tables are not supported. Amazon Athena views that are cataloged in Amazon Glue Data Catalog are not supported.	All data sources supported by Amazon Glue, including custom connectors and third-party connectors.
Data Quality rule recommendations	Supported	Not supported
Author and run DQDL rules	Supported	Supported
Auto scaling	Not supported	Supported
Amazon Glue Flex support	Not supported	Supported
Scheduling	Supported when evaluating Data Quality rules and via Step Functions.	Supported when using Step Functions and workflows.
Identifying records that failed data quality checks	Not supported	Supported
Integration with Amazon Eventbridge	Supported	Supported
Integration with Amazon Cloudwatch	Supported	Supported
Writing data quality results to Amazon S3	Supported	Supported
Incremental data quality	Supported via pushdown predicates	Supported via Amazon Glue bookmarks
Amazon CloudFormation support	Supported	Supported
ML-based anomaly detection	Not supported	Supported
Dynamic rules	Not supported	Supported

Considerations

Consider the following items before you use Amazon Glue Data Quality:

Data quality rules can't evaluate nested or list-type data sources. See Flatten nested structs.

Terminology

The following list defines terms that are related to Amazon Glue Data Quality.

Data Quality Definition Language (DQDL)

A domain-specific language that you can use to write Amazon Glue Data Quality rules.

To learn more about DQDL, see the Data Quality Definition Language (DQDL) reference guide.

data quality

Describes how well a dataset serves its specific purpose. Amazon Glue Data Quality evaluates rules against a dataset to measure data quality. Each rule checks for particular characteristics like data freshness or integrity. To quantify data quality, you can use a data quality score.

data quality score

The percentage of data quality rules that pass (result in true) when you evaluate a ruleset with Amazon Glue Data Quality.

rule

A DQDL expression that checks your data for a specific characteristic and returns a Boolean value. For more information, see Rule structure.

analyzer

A DQDL expression that gathers data statistics. An analyzer gathers data statistics that can be used by ML algorithms to detect anomalies and hard-to-detect data quality issues over time.

ruleset

An Amazon Glue resource that comprises a set of data quality rules. A ruleset must be associated with a table in the Amazon Glue Data Catalog. When you save a ruleset, Amazon Glue assigns an Amazon Resource Name (ARN) to the ruleset.

data quality score

The percentage of data quality rules that pass (result in true) when you evaluate a ruleset with Amazon Glue Data Quality.

observation

An unconfirmed insight generated by Amazon Glue by analyzing data statistics gathered from rules and analyzers over time.

Limits

Amazon Glue Data Quality service limits:

You can have 2,000 rules in a ruleset. If your rulesets are larger, we recommend splitting into multiple rulesets.
The size of the ruleset is 65KB. If your rulesets are larger, we recommend splitting into multiple rulesets.
Amazon Glue Data Quality collects statistics when you create a rule or analyzer. There is no cost associated with storing these statistics. However, there is a limit of 100,000 statistics per account, and these statistics will be retained for a maximum of two years.

Release notes for Amazon Glue Data Quality

This topic describes features introduced in Amazon Glue Data Quality.

General availability: new features

The following new features are available with the general availability of Amazon Glue Data Quality:

The ability to identify which records failed data quality checks is now supported in Amazon Glue Studio
New data quality ruletypes such as validating referential integrity of data between two data sets, comparing data between two datasets, and data type checks
Improved user experience in the Amazon Glue Data Catalog
Support for Apache Iceberg, Apache Hudi and Delta Lake
Support for Amazon Redshift
Simplified notification with Amazon EventBridge
Amazon CloudFormation support for creating rulesets
Performance improvements: caching option in ETL and Amazon Glue Studio for faster performance when evaluating data quality

Nov 27, 2023 (Preview)

ML-powered anomaly detection capabilities are now available in Amazon Glue ETL and Amazon Glue Studio. With this, you can now detect anomalies and hard-to-detect data quality issues
Dynamic Rules allows you to provide dynamic thresholds (ex: RowCount> avg(last(10)))

Mar 12, 2024

DQDL improvements

June 26, 2024

DQDL improvements
- DQDL now supports where clause so that you can filter data before applying DQ rules

August 7, 2024

Anomaly Detection and Dynamic Rules are now generally available

Nov 22, 2024

Complex composite rules allows you to author more complex business rules with nested support
New rule types for managing data quality for your files
Default data quality checks in Visual ETL jobs

Dec 6, 2024

Amazon Glue Data Quality now supports Amazon SageMaker AI LakeHouse tables and Amazon Lake Formation managed Iceberg, Delta and HUDI tables in Amazon Glue ETL 5.0.

Jun 30, 2025

Aggregate Metrics – now you can get aggregated metrics such as number of records passed, failed at the API level for Amazon Glue ETL based DQ jobs.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Limitations

Anomaly detection in Amazon Glue Data Quality