Data Quality Definition Language (DQDL) reference

Data Quality Definition Language (DQDL) is a domain specific language that you use to define rules for Amazon Glue Data Quality.

This guide introduces key DQDL concepts to help you understand the language. It also provides a reference for DQDL rule types with syntax and examples. Before you use this guide, we recommend that you have familiarity with Amazon Glue Data Quality. For more information, see Amazon Glue Data Quality.

Note

DynamicRules are only supported in Amazon Glue ETL.

DQDL syntax

A DQDL document is case sensitive and contains a ruleset, which groups individual data quality rules together. To construct a ruleset, you must create a list named Rules (capitalized), delimited by a pair of square brackets. The list should contain one or more comma-separated DQDL rules like the following example.


Rules = [
   IsComplete "order-id",
   IsUnique "order-id"
]

Rule structure

The structure of a DQDL rule depends on the rule type. However, DQDL rules generally fit the following format.


<RuleType> <Parameter> <Parameter> <Expression>

RuleType is the case-sensitive name of the rule type that you want to configure. For example, IsComplete, IsUnique, or CustomSql. Rule parameters differ for each rule type. For a complete reference of DQDL rule types and their parameters, see DQDL rule type reference.

Composite rules

DQDL supports the following logical operators that you can use to combine rules. These rules are called Composite Rules.

and

The logical and operator results in true if and only if the rules that it connects are true. Otherwise, the combined rule results in false. Each rule that you connect with the and operator must be surrounded by parentheses.

The following example uses the and operator to combine two DQDL rules.


(IsComplete "id") and (IsUnique "id")

or

The logical or operator results in true if and only if one or more of the rules that it connects are true. Each rule that you connect with the or operator must be surrounded by parentheses.

The following example uses the or operator to combine two DQDL rules.


(RowCount "id" > 100) or (IsPrimaryKey "id")

You can use the same operator to connect multiple rules, so the following rule combination is allowed.


(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) and (IsComplete "Order_Id")

You can combine the logical operators into a single expression. For example:


(Mean "Star_Rating" > 3) and ((Mean "Order_Total" > 500) or (IsComplete "Order_Id"))

You can also author more complex, nested rules.


(RowCount > 0) or ((IsComplete "colA") and (IsUnique "colA"))

How Composite rules work

By default, Composite Rules are evaluated as individual rules across the entire dataset or table and then the results are combined. In other words, it evaluates the entire column first and then applies the operator. This default behaviour is explained below with an example:


# Dataset

+------+------+
|myCol1|myCol2|
+------+------+
|     2|     1|
|     0|     3|
+------+------+

# Overall outcome

+----------------------------------------------------------+-------+
|Rule                                                      |Outcome|
+----------------------------------------------------------+-------+
|(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)|Failed |
+----------------------------------------------------------+-------+

In the above example, Amazon Glue Data Quality first evaluates (ColumnValues "myCol1" > 1) which will result in a failure. Then it will evaluate (ColumnValues "myCol2" > 2) which will also fail. The combination of both results will be noted as FAILED.

However, if you prefer an SQL like behaviour, where you need the entire row to be evaluated, you have to explicitly set the ruleEvaluation.scope parameter as shown in additionalOptions in the code snippet below.


object GlueApp {
  val datasource = glueContext.getCatalogSource(
    database="<db>",
    tableName="<table>",
    transformationContext="datasource"
  ).getDynamicFrame()

  val ruleset = """
    Rules = [
        (ColumnValues "age" >= 26) OR (ColumnLength "name" >= 4)        
    ]
  """

  val dq_results = EvaluateDataQuality.processRows(
    frame=datasource,
    ruleset=ruleset,
    additionalOptions=JsonOptions("""
        {
          "compositeRuleEvaluation.method":"ROW"
        }
      """
    )
  )
}

In Amazon Glue Data Catalog, you can easily configure this option in the user interface as shown below.

The screenshot shows a composite rule settings window where you can choose the rule evaluation configuration between row and column. If you choose Row, the composite rules will behave as a single rule evaluating entire row. If you choose Column, the composite rules will evaluate individual rules across the entire dataset and combine the results.

Once set, the composite rules will behave as a single rule evaluating the entire row. The following example illustrates this behaviour.




# Row Level outcome

+------+------+------------------------------------------------------------+---------------------------+
|myCol1|myCol2|DataQualityRulesPass                                        |DataQualityEvaluationResult|
+------+------+------------------------------------------------------------+---------------------------+
|2     |1     |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed                     |
|0     |3     |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed                     |
+------+------+------------------------------------------------------------+---------------------------+

Some rules cannot be supported in this feature because their overall outcome rely on thresholds or ratios. They are listed below.

Rules relying on ratios:

Completeness
DatasetMatch
ReferentialIntegrity
Uniqueness

Rules dependent on thresholds:

When the following rules include with threshold, they are not supported. However, rules that do not involve with threshold remain supported.

ColumnDataType
ColumnValues
CustomSQL

Expressions

If a rule type doesn't produce a Boolean response, you must provide an expression as a parameter in order to create a Boolean response. For example, the following rule checks the mean (average) of all the values in a column against an expression to return a true or false result.


Mean "colA" between 80 and 100

Some rule types such as IsUnique and IsComplete already return a Boolean response.

The following table lists expressions that you can use in DQDL rules.

Supported DQDL expressions
Expression	Description	Example
=`x`	Resolves to `true` if the rule type response is equal to `x`.	`Completeness "colA" = "1.0", ColumnValues "colA" = "2022-06-30"`
!=`x`	x Resolves to true if the rule type response is not equal to `x`.	`ColumnValues "colA" != "a", ColumnValues "colA" != "2022-06-30"`
> `x`	Resolves to `true` if the rule type response is greater than `x`.	`ColumnValues "colA" > 10`
< `x`	Resolves to `true` if the rule type response is less than `x`.	`ColumnValues "colA" < 1000, ColumnValues "colA" < "2022-06-30"`
>= `x`	Resolves to `true` if the rule type response is greater than or equal to `x`.	`ColumnValues "colA" >= 10`
<= `x`	Resolves to `true` if the rule type response is less than or equal to `x`.	`ColumnValues "colA" <= 1000`
between `x` and `y`	Resolves to `true` if the rule type response falls in a specified range (exclusive). Only use this expression type for numeric and date types.	`Mean "colA" between 8 and 100, ColumnValues "colA" between "2022-05-31" and "2022-06-30"`
not between `x` and `y`	Resolves to true if the rule type response does not fall in a specified range (inclusive). You should only use this expression type for numeric and date types.	`ColumnValues "colA" not between "2022-05-31" and "2022-06-30"`
in [`a, b, c, ...`]	Resolves to `true` if the rule type response is in the specified set.	`ColumnValues "colA" in [ 1, 2, 3 ], ColumnValues "colA" in [ "a", "b", "c" ]`
not in [`a, b, c, ...`]	Resolves to `true` if the rule type response is not in the specified set.	`ColumnValues "colA" not in [ 1, 2, 3 ], ColumnValues "colA" not in [ "a", "b", "c" ]`
matches `/ab+c/i`	Resolves to `true` if the rule type response matches a regular expression.	`ColumnValues "colA" matches "[a-zA-Z]*"`
not matches `/ab+c/i`	Resolves to `true` if the rule type response does not match a regular expression.	`ColumnValues "colA" not matches "[a-zA-Z]*"`
`now()`	Works only with the `ColumnValues` rule type to create a date expression.	`ColumnValues "load_date" > (now() - 3 days)`
matches/in […]/not matches/not in [...] `with threshold`	Specifies the percentage of values that match the rule conditions. Works only with the `ColumnValues`, `ColumnDataType`, and `CustomSQL` rule types.	`ColumnValues "colA" in ["A", "B"] with threshold > 0.8, ColumnValues "colA" matches "[a-zA-Z]*" with threshold between 0.2 and 0.9 ColumnDataType "colA" = "Timestamp" with threshold > 0.9`

Keywords for NULL, EMPTY and WHITESPACES_ONLY

If you want to validate if a string column has a null, empty or a string with only whitespaces you can use the following keywords:

NULL / null – This keyword resolves to true for a null value in a string column.

ColumnValues "colA" != NULL with threshold > 0.5 would return true if more than 50% of your data does not have null values.

(ColumnValues "colA" = NULL) or (ColumnLength "colA" > 5) would return true for all rows which either have a null value or have length >5. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.
EMPTY / empty – This keyword resolves to true for an empty string (“”) value in a string column. Some data formats transform nulls in a string column to empty strings. This keyword helps filter out empty strings in your data.

(ColumnValues "colA" = EMPTY) or (ColumnValues "colA" in ["a", "b"]) would return true if a row is either empty, “a” or “b”. Note that this requires the use of the “compositeRuleEvaluation.method” = “ROW” option.
WHITESPACES_ONLY / whitespaces_only – This keyword resolves to true for a string with only whitespaces (“ ”) value in a string column.

ColumnValues "colA" not in ["a", "b", WHITESPACES_ONLY] would return true if a row is neither “a” or “b” nor just whitespaces.

Supported rules:
- ColumnValues

For a numeric or date based expression, if you want to validate if a column has a null you can use the following keywords.

NULL / null – This keyword resolves to true for a null value in a string column.

ColumnValues "colA" in [NULL, "2023-01-01"] would return true if a dates in your column are either 2023-01-01 or null.

(ColumnValues "colA" = NULL) or (ColumnValues "colA" between 1 and 9) would return true for all rows which either have a null value or have values between 1 and 9. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.

Supported rules:
- ColumnValues

Filtering with Where Clause

Note

Where Clause is only supported in Amazon Glue 4.0.

You can filter your data when authoring rules. This is helpful when you want to apply conditional rules.


<DQDL Rule> where "<valid SparkSQL where clause> "

The filter must be specified with the where keyword, followed by a valid SparkSQL statement that is enclosed in quotes ("").

If the rule you wish to add the where clause to a rule with a threshold, the where clause should be specified before the threshold condition.


<DQDL Rule> where "valid SparkSQL statement>" with threshold <threshold condition>

With this syntax you can write rules like the following.


Completeness "colA" > 0.5 where "colB = 10"
ColumnValues "colB" in ["A", "B"] where "colC is not null" with threshold > 0.9
ColumnLength "colC" > 10 where "colD != Concat(colE, colF)"

We will validate that the SparkSQL statement provided is valid. If invalid, the rule evaluation will fail and we will throw the an IllegalArgumentException with the following format:


Rule <DQDL Rule> where "<invalid SparkSQL>" has provided an invalid where clause :
<SparkSQL Error>

Where clause behaviour when Row level error record identification is turned on

With Amazon Glue Data Quality, you can identify specific records that failed. When applying a where clause to rules that support row level results, we will label the rows that are filtered out by the where clause as Passed.

If you prefer to separately label the filtered out rows as SKIPPED, you can set the following additionalOptions for the ETL job.


object GlueApp {
  val datasource = glueContext.getCatalogSource(
    database="<db>",
    tableName="<table>",
    transformationContext="datasource"
  ).getDynamicFrame()

  val ruleset = """
    Rules = [
       IsComplete "att2" where "att1 = 'a'"        
    ]
  """

  val dq_results = EvaluateDataQuality.processRows(
    frame=datasource,
    ruleset=ruleset,
    additionalOptions=JsonOptions("""
        {
          "rowLevelConfiguration.filteredRowLabel":"SKIPPED"
        }
      """
    )
  )
}

As an example, refer to the following rule and dataframe:


IsComplete att2 where "att1 = 'a'"

id	att1	att2	Row-level Results (Default)	Row Level Results (Skipped Option)	Comments
1	a	f	PASSED	PASSED
2	b	d	PASSED	SKIPPED	Row is filtered out, since `att1` is not `"a"`
3	a	null	FAILED	FAILED
4	a	f	PASSED	PASSED
5	b	null	PASSED	SKIPPED	Row is filtered out, since `att1` is not `"a"`
6	a	f	PASSED	PASSED

Dynamic rules

Note

Dynamic Rules are only supported in Amazon Glue ETL and is not supported in Amazon Glue Data Catalog.

You can now author dynamic rules to compare current metrics produced by your rules with their historical values. These historical comparisons are enabled by using the last() operator in expressions. For example, the rule RowCount > last() will succeed when the number of rows in the current run is greater than the most recent prior row count for the same dataset. last() takes an optional natural number argument describing how many prior metrics to consider; last(k) where k >= 1 will reference the last k metrics.

If no data points are available, last(k) will return the default value 0.0.
If fewer than k metrics are available, last(k) will return all prior metrics.

To form valid expressions use last(k), where k > 1 requires an aggregation function to reduce multiple historical results to a single number. For example, RowCount > avg(last(5)) will check whether the current dataset’s row count is strictly greater than the average of the last five row counts for the same dataset. RowCount > last(5) will produce an error because the current dataset row count can't be meaningfully compared to a list.

Supported aggregation functions:

avg
median
max
min
sum
std (standard deviation)
abs (absolute value)
index(last(k), i) will allow for selecting the ith most recent value out of the last k. i is zero-indexed, so index(last(3), 0) will return the most recent datapoint and index(last(3), 3) will result in an error as there are only three datapoints and we attempt to index the 4th most recent one.

Sample expressions

ColumnCorrelation

ColumnCorrelation "colA" "colB" < avg(last(10))

DistinctValuesCount

DistinctValuesCount "colA" between min(last(10))-1 and max(last(10))+1

Most rule types with numeric conditions or thresholds support dynamic rules; see the provided table, Analyzers and Rules, to determine whether dynamic rules are supported for your rule type.

Excluding statistics from dynamic rules

Sometimes, you will need to exclude data statistics from your dynamic rule calculations. Lets say you did a historical data load and you don’t want that to impact your averages. To do this, open the job in Amazon Glue ETL and choose the Data Quality Tab, then choose Statistics and select the statistics that you want to exclude. You will be able to see a trend chart along with a table of statistics. Select the values you want to exclude and choose Exclude Statistics. Now the excluded statistics will not be included in the dynamic rule calculations.

The screenshot shows the option to exclude or include statistics from the drop-down menu after selecting a statistic.

Analyzers

Note

Analyzers are not supported in Amazon Glue Data Catalog.

DQDL rules use functions called analyzers to gather information about your data. This information is employed by a rule’s Boolean expression to determine whether the rule should succeed or fail. For example, the RowCount rule RowCount > 5 will use a row count analyzer to discover the number of rows in your dataset, and compare that count with the expression > 5 to check whether more than five rows exist in the current dataset.

Sometimes, instead of authoring rules, we recommend creating analyzers and then have them generate statistics that can be used to detect anomalies. For such instances, you can create analyzers. Analyzers differ from rules in the following ways.

Characteristic	Analyzers	Rules
Part of ruleset	Yes	Yes
Generates statistics	Yes	Yes
Generates observations	Yes	Yes
Can evaluate and assert a condition	No	Yes
You can configure actions such as stop the jobs on failure, continue processing job	No	Yes

Analyzers can independently exist without rules, so you can quickly configure them and progressively build data quality rules.

Some rule types can be input in the Analyzers block of your ruleset to run the rules required for analyzers and gather information without applying checks for any condition. Some analyzers aren't associated with rules and can only be input in the Analyzers block. The following table indicates whether each item is supported as a rule or a standalone analyzer, along with additional details for each rule type.

Example Ruleset with Analyzer

The following ruleset uses:

a dynamic rule to check whether a dataset is growing over its trailing average for the last three job runs
a DistinctValuesCount analyzer to record the number of distinct values in the dataset's Name column
a ColumnLength analyzer to track minimum and maximum Name size over time

Analyzer metric results can be viewed in the Data Quality tab for your job run.


Rules = [
   RowCount > avg(last(3))
]
Analyzers = [
   DistinctValuesCount "Name",
   ColumnLength "Name"
]

Amazon Glue Data Quality supports the following analyzers.

Analyzer name	Functionality
`RowCount`	Computes row counts for a dataset
`Completeness`	Computes the completeness percentage of a column
`Uniqueness`	Computes the uniqueness percentage of a column
`Mean`	Computes mean of a numeric column
`Sum`	Computes sum of a numeric column
`StandardDeviation`	Computes standard deviation of a numeric column
`Entropy`	Computes entropy of a numeric column
`DistinctValuesCount`	Computes the number of distinct values in a column
`UniqueValueRatio`	Computes the unique values ratio in a column
`ColumnCount`	Computes the number of columns in a dataset
`ColumnLength`	Computes the length of a column
`ColumnValues`	Computes the minimum, maximum for numerical columns. Computes Minimum ColumnLength and Maximum ColumnLength for non-numeric columns
`ColumnCorrelation`	Computes column correlations for given columns
`CustomSql`	Computes statistics returned by the CustomSQL
`AllStatistics`	Computes the following statistics: RowCount, ColumnCount Every column: Completeness, uniqueness Numerical: Min, Max, Entropy, Mean, Standard Dev, Sum String: MinLength, MaxLength

Comments

You can use the '#' character to add a comment to your DQDL document. Anything after the '#' character and until the end of the line is ignored by DQDL.


Rules = [
    # More items should generally mean a higher price, so correlation should be positive
    ColumnCorrelation "price" "num_items" > 0
]

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Data Quality for ETL jobs in Amazon Glue Studio notebooks

Rule type reference

Data Quality Definition Language (DQDL) reference

Note

Contents

DQDL syntax

Rule structure

Composite rules

How Composite rules work

Expressions

Keywords for NULL, EMPTY and WHITESPACES_ONLY

Filtering with Where Clause

Note

Dynamic rules

Note

Analyzers

Note

Comments