Tutorial: Getting started with Data Quality - Amazon Glue Studio
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Tutorial: Getting started with Data Quality

Through this getting started tutorial, you will learn how to:

  • Create rules using the DQDL rule builder.

  • Specify data quality actions, data to output, and the output location of the data quality results.

  • Review data quality results.

Step 1: Add the Evaluate Data Quality node to the visual job

In this step, you add the Evaluate Data Quality node to the visual job editor.

To add the data quality node
  1. From the Amazon Glue Studio console, choose Visual with a source and target from the Create job section. Choose Create.

  2. From the Amazon Glue Studio console, choose a node on which to apply the data quality transform. Typically, this will be a transform node or a data source.

  3. From the menu, choose Action, then Evaluate Data Quality from the drop-down menu. You can also type in data quality in the search bar and then choose Evaluate Data Quality from the search results.

    
                        The screenshot shows the Action menu open with the Evaluate Data Quality transform highlighted.
  4. The visual job editor will show the Evaluate Data Quality transform node branching from the node you selected. On the right side of the console, the Transform tab is automatically opened. If you need to change the parent node, choose the Node properties tab, then choose the node parent from the drop-down menu.

    
                        The screenshot shows the Node properties tab with the parent node drop-down selected.

    When you choose a new node parent, a new connection is made between the parent node and the Evaluate Data Quality node. Remove any unwanted parent nodes. Only one parent node can be connected to one Evaluate Data Quality node.

Step 2: Create a rule using DQDL

In this step, you create a rule using DQDL. For this tutorial, you will create a single rule using the Completeness rule type. This rule type Checks the percentage of complete (non-null) values in a column against a given expression. For more information on using DQDL, see DQDL.

  1. In the Transform tab, add a Rule type by clicking on the insert button. This adds the rule type to the rule editor where you can enter the parameters for the rule.

    Note

    When you insert a rule type, it is inserted between the brackets [ ]. When editing rules, ensure that rules are within the brackets and ensure the rules are separated by commas. For example, a complete rule expression will look like:

    Rules= [ Completeness "year">0.8, Completeness "month">0.8 ]

    This example specifies the parameter for completness for the columns named 'year' and 'month'. These columns must be greater than 80% 'complete', or have data in over 80% of instances for each respective column, in order for the rule to pass.

    In this example, search for and insert the Completeness rule type. This will add the rule type to the rule editor. This rule type has the following syntax: Completeness <COL_NAME> <EXPRESSION>. Most rule types require that you provide an expression as a parameter in order to create a boolean response. For more information on supported DQDL expressions, see DQDL Expressions. Next, you will add the column name.

  2. In the DQDL rule builder, click on the Schema tab. Use the search bar to locate the column name in the input schema. The input schema displays the column name and data type.

  3. In the rule editor, click to the right of the rule type to insert the cursor where the column will be inserted. Alternately, you can type in the name of the column in the rule.

    For example, from the list of columns in the input schema list, click the insert button next to the column (in this example, year). This will add the column to the rule.

  4. Then, in the rule editor, add an expression to evaluate the rule. Since the Completeness rule type checks the percentage of complete (non-null) values in a column against a given expression, enter an expression such as > 0.8. This rule will check the column if it is greater than 80% complete (non-null) values.

    
                        The screenshot shows the rule editor with a complete rule using the Completeness rule type.

Step 3: Configure Data Quality actions and output

After creating data quality rules, you can select additional options:

  1. In Data quality actions, click the checkbox for Fail job when data quality fails. By default, this action is not selected and the job will complete its run even if data quality rules fail.

  2. In Data quality transform output, choose from the following options:

    • Original data — Choose to output original input data. This option is ideal if you want to stop the job when quality issues are detected.

    • Data quality results — Choose to output configured rules and their pass or fail status. This option is useful if you want to take a custom action.

  3. (Optional) - Click Data quality output settings to reveal the Data quality result location field. Then, click Browse to search for an Amazon S3 location to set as the data quality output target.

Step 4: View data quality results

After running the job, view the data quality results by clicking on the Data quality tab.

  1. For each job run, view the Data quality results. Each node displays a data quality status and status detail. Click on a node to view all rules and the status of each rule.

  2. Click Download results to download a CSV file that contains information about the job run and data quality results.

  3. If you have more than one job run with data quality results, you can filter the results by date and time range. Click Filter by a date and time range to expand the filter window.

    
                        The screenshot shows the date and time range filter open.
  4. Choose a relative range or absolute range. For absolute ranges, use the calendar to select a date and enter values for start time and end time. When done, choose Apply.