Evaluating data quality with Amazon Glue Studio - Amazon Glue Studio
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Evaluating data quality with Amazon Glue Studio

Amazon Glue Data Quality is in open preview release for Amazon Glue Studio and is subject to change. This preview feature is already enabled in your accounts in select Regions:
  • US East (Ohio)

  • US East (N. Virginia)

  • US West (Oregon)

  • Asia Pacific (Tokyo)

  • Europe (Ireland)

  • South America (São Paulo)

Amazon Glue Data Quality evaluates and monitors the quality of your data based on rules that you define. This makes it easy to identify the data that needs action. In Amazon Glue Studio, you can add data quality nodes to your visual job to create data quality rules on tables in your Data Catalog. Then you can monitor and evaluate changes to your data sets as they evolve over time.

The following are the high-level steps for how you work with Amazon Glue Data Quality:

  1. Create data quality rules – Build a set of data quality rules using the DQDL builder by choosing built-in rulesets that you configure.

  2. Configure a data quality job – Define actions based on the data quality results and output options.

  3. Save and run a job with data quality – Create and run a job. Saving the job will save the rule sets you created for the job.

  4. Monitor and review the data quality results – Review the data quality results after the job run is complete. Optionally, schedule the job for a future date.

Benefits

Data analysts, data engineers, and data scientists can use the Evaludate Data Quality node in Amazon Glue Studio to analyze, configure, monitor, and improve the quality of data from the visual job editor. The benefits of using the data quality node include:

  • You can detect data quality issues - You can check for issues by creating rules that check characteristics of your datasets.

  • It's easy to get started - You can start with pre-built rules and actions.

  • Tight integration - You can use data quality nodes in Amazon Glue Studio because Amazon Glue Data Quality runs on top of the Amazon Glue Data Catalog.