Anomaly detection in Amazon Glue Data Quality - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Anomaly detection in Amazon Glue Data Quality

Note

Amazon Glue Data Quality is available in preview in the following regions:

  • US East (Ohio, N. Virginia)

  • US West (Oregon)

  • Asia Pacific (Tokyo)

  • Europe (Ireland)

Amazon Glue Data Quality anomaly detection applies machine learning (ML) algorithms on data statistics over time to detect abnormal patterns and hidden data quality issues that are hard to detect through rules. At present, anomaly detection is only available for Amazon Glue 4.0. This feature is currently available only in Amazon Glue Studio Visual ETL and Amazon Glue ETL. This capability doesn't work on Amazon Glue Studio Notebooks, Amazon Glue Data Catalog, Amazon Glue Interactive Sessions, and Amazon Glue Data Previews.

How it works

When evaluating Data Quality rules, Amazon Glue captures data statistics needed to determine whether the data conforms with the rules. For example, Data Quality will compute the number of distinct values in a dataset, and then compare that value to the expectation.

The Data Quality rule engine compares the statistic value with the defined thresholds, and evaluates your quality requirements. As these statistics are collected over time, you can enable anomaly detection on your ETL pipelines to let Amazon Glue learn from past statistics and report hidden patterns as Observations. Observations are unconfirmed insights that Amazon Glue's ML algorithm identifies. They come with recommended Data Quality rules that you can apply to your ruleset for monitoring of the discovered pattern. We recommend running jobs at a regular schedule (for example, hourly and daily). Irregular runs might produce poor insights.

The screenshot shows the data quality anomaly detection process.

Using analyzers to inspect your data

Sometimes, you might not have the time to author data quality rules. This is where analyzers come in handy. Analyzers are part of your ruleset and are very simple to configure. For example, you can write this in your ruleset:

Analyzers = [ RowCount, Completeness “AllColumns” ]

This will gather the following statistics:

  • Row Count for the entire dataset

  • Completeness of every column in your dataset

We recommend using Analyzers because you won't have to worry about the thresholds. You can run your data pipelines and after three runs, Amazon Glue Data Quality will start generating observations and rule recommendations when it notices any anomalies. You can review the observations, associated statistics and can easily incorporate the rule recommendations in your ruleset. To get started see Configuring Anomaly detection and generating insights . Note that Analyzers will not impact your data quality scores. They generate statistics that can be analyzed over time to generate observations.

Using the DetectAnomaly Rule

Sometimes, you want your jobs to fail when it detects anomalies. To enforce a constraint, you must configure a rule. Analyzers won’t stop a job. Instead, they will gather statistics and analyze the data. Configuring the DetectAnomaly rule in the rules section of the ruleset will confirm that the DQ scan reports the job has failed to pass all the rules in the scan.

Benefits and use cases of Anomaly Detection

Engineers may manage hundreds of data pipelines at any given time. Each pipeline can extract data from different sources and load it into the data lake. Since each pipeline might extract data from a different source and load it into data lake, it is difficult to get immediate feedback on the data – whether its shape has changed significantly, or it has diverged from existing trends.

In the past, upstream data sources have changed without warning to data engineering teams, introducing hard-to-track “data bugs” into this process. By adding Data Quality nodes to jobs, this makes life much easier, as jobs fail when issues are spotted. However, this doesn't remove all the failure modes that data teams are worried about, which keeps the door open for other data bugs to come in.

One failure mode is around data volume. As a company’s data store grows over time, the number of records produced by data pipelines may grow exponentially. Every week, data teams may need to manually update ETL jobs to increase each Data Quality rule that sets a limit to the number of rows ingested.

Another failure mode is that some of the data quality rule limits are very wide to accommodate the fact that transaction volume varies by day of the week. On weekends, there are almost no transactions, and on Mondays there are about three times more transaction than on other weekdays. Data teams have two options - either implement logic to change the ruleset on the fly depending on the day, or set a very wide expectation.

Finally, data teams are also concerned with less well-defined data bugs. Models have been trained on data with specific characteristics, and if these start skewing in unexpected ways, the team wants to know. For example, in February a company may expand to Montana, and so transactions started containing the “MT” code appear more frequently. This may break the ML inference, and as a result the models falsely predicted that every single Montana transaction was fraudulent.

This is where Data quality anomaly detection can help solve these problems. Some of the benefits of Data Quality anomaly detection include:

  • Scanning of data on a scheduled, event-driven, or manual basis.

  • Detection of anomalies that can be indicative of an unintended event, seasonality, or statistical abnormality.

  • Offer Rule Recommendations to take action on observations found by Data Quality anomaly detection.

This is useful if you:

  • want to detect anomalies on your data automatically without the need to write data quality rules.

  • want to catch potential problems in your data that data quality rules alone can't find.

  • want to automate some tasks that evolve over time, such as limiting the number of rows ingested for data quality monitoring.