FileUniqueness - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

FileUniqueness

File Uniqueness allows you to ensure that there are no duplicate files in the data you have received from your data producers.

It gathers the following data statistics

  1. Total number of files in the folder

  2. The Uniqueness Ratio of the files

Dataset.*.FileUniquenessRatio: 1.00, Dataset.*.FileCount: 8.00

Find duplicate files in a folder:

FileUniqueness "s3://bucket/" > 0.5 FileUniqueness "s3://bucket/folder/" = 1

Inferring folder names directly from data frames to detect duplicates:

You don't always have to provide a file path. For instance, when you are authoring the rule in the Amazon Glue Data Catalog, it may be hard to find which folders the catalog tables are using. Amazon Glue Data Quality can find the specific folders or files used to populate your data frame.

FileUniqueness > 0.5 FileUniqueness with threshold = 1

There are a few considerations:

  1. In Amazon Glue ETL, you must have the EvaluateDataQuality Transform immediately after an Amazon S3 or Amazon Glue Data Catalog transform.

  2. This rule will not work in Amazon Glue Interactive Sessions.