FileSize - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

FileSize

The FileSize ruletype allows you to ensure that files meet a certain file size criteria. This is useful for following use cases:

  1. Ensure that producers are not sending empty or substantially smaller files for processing.

  2. Ensure that your target buckets don’t have smaller files which may lead to performance issues.

FileSize gathers the following metrics:

  1. Compliance: returns the % of files that meet the rule threshold you have established

  2. File Count: number of files processed

  3. Minimum file size in bytes

  4. Maximum file size in bytes

Dataset.*.FileSize.Compliance: 1.00, Dataset.*.FileCount: 8.00, Dataset.*.MaximumFileSize: 327413121.00, Dataset.*.MinimumFileSize: 204558920.00

Anomaly detection is not supported for these metrics.

Validate size of files

This rule will pass when file.dat is greater than 2 MB.

FileSize "s3://bucket/file.dat" > 2 MB

The supported unites include B(bytes), MB(mega bytes), GB(giga bytes) and TB(terra bytes).

Validate size of files in folders

FileSize "s3://bucket/" > 5 B FileSize "s3://bucket/" < 2 GB

This rule will pass if 70% of the files in s3://bucket is between 2 GB and 1 TB.

FileSize "s3://bucket/" between 2 GB and 1 TB with threshold > 0.7

Inferring file names directly from data frames

You don't always have to provide a file path. For instance, when you are authoring the rule in the Data Catalog, it may be hard to find which folders the catalog tables are using. Amazon Glue Data Quality can find the specific folders or files used to populate your data frame.

FileSize < 10 MB with threshold > 0.7

There are a few considerations:

  1. In Amazon Glue ETL, you must have Evaluate DataQuality Transform immediately after the Amazon S3 or Data Catalog transform.

  2. This rule will not work in Amazon Glue Interactive Sessions.