EvaluateDataQuality class
Evaluates a data quality ruleset against a DynamicFrame and returns a new
      DynamicFrame with results of the evaluation.
Example
The following example code demonstrates how to evaluate data quality for a
        DynamicFrame and then view the data quality results. 
from awsglue.transforms import * from pyspark.context import SparkContext from awsglue.context import GlueContext from awsgluedq.transforms import EvaluateDataQuality #Create Glue context sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Define DynamicFrame legislatorsAreas = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="areas_json") # Create data quality ruleset ruleset = """Rules = [ColumnExists "id", IsComplete "id"]""" # Evaluate data quality dqResults = EvaluateDataQuality.apply( frame=legislatorsAreas, ruleset=ruleset, publishing_options={ "dataQualityEvaluationContext": "legislatorsAreas", "enableDataQualityCloudWatchMetrics": True, "enableDataQualityResultsPublishing": True, "resultsS3Prefix": "amzn-s3-demo-bucket1", }, ) # Inspect data quality results dqResults.printSchema() dqResults.toDF().show()
root |-- Rule: string |-- Outcome: string |-- FailureReason: string |-- EvaluatedMetrics: map | |-- keyType: string | |-- valueType: double +-----------------------+-------+-------------+---------------------------------------+ |Rule |Outcome|FailureReason|EvaluatedMetrics | +-----------------------+-------+-------------+---------------------------------------+ |ColumnExists "id" |Passed |null |{} | |IsComplete "id" |Passed |null |{Column.first_name.Completeness -> 1.0}| +-----------------------+-------+-------------+---------------------------------------+
Methods
__call__(frame, ruleset, publishing_options = {})
- 
        
frame– TheDynamicFramethat you want evaluate the data quality of. - 
        
ruleset– A Data Quality Definition Language (DQDL) ruleset in string format. To learn more about DQDL, see the Data Quality Definition Language (DQDL) reference guide. - 
        
publishing_options– A dictionary that specifies the following options for publishing evaluation results and metrics:- 
            
dataQualityEvaluationContext– A string that specifies the namespace under which Amazon Glue should publish Amazon CloudWatch metrics and the data quality results. The aggregated metrics appear in CloudWatch, while the full results appear in the Amazon Glue Studio interface.- 
                
Required: No
 - 
                
Default value:
default_context 
 - 
                
 - 
            
enableDataQualityCloudWatchMetrics– Specifies whether the results of the data quality evaluation should be published to CloudWatch. You specify a namespace for the metrics using thedataQualityEvaluationContextoption.- 
                
Required: No
 - 
                
Default value: False
 
 - 
                
 - 
            
enableDataQualityResultsPublishing– Specifies whether the data quality results should be visible on the Data Quality tab in the Amazon Glue Studio interface.- 
                
Required: No
 - 
                
Default value: True
 
 - 
                
 - 
            
resultsS3Prefix– Specifies the Amazon S3 location where Amazon Glue can write the data quality evaluation results.- 
                
Required: No
 - 
                
Default value: "" (empty string)
 
 - 
                
 
 - 
            
 
apply(cls, *args, **kwargs)
Inherited from GlueTransform
      apply.
name(cls)
Inherited from GlueTransform
      name.
describeArgs(cls)
Inherited from GlueTransform
      describeArgs.
describeReturn(cls)
Inherited from GlueTransform
      describeReturn.
describeTransform(cls)
Inherited from GlueTransform
      describeTransform.
describeErrors(cls)
Inherited from GlueTransform
      describeErrors.
describe(cls)
Inherited from GlueTransform
      describe.