FileFreshness
Note
For all File-based rules, you must run the job in the same region as your bucket. If you’re attempting to parse an Amazon S3 folder, that folder must exist in Amazon S3.
FileFreshness ensures your data files are fresh based on the condition you provide. It uses your files' last modified time to ensure that data files or the entire folder is up-to-date.
This rule gathers two metrics:
-
FileFreshness compliance based on the rule you set up
-
Number of files that were modified for the day
{"Dataset.*.FileFreshness.Compliance":1,"Dataset.*.FileCount":1}
Anomaly detection does not consider these metrics.
Checking file freshness
The following rule ensures that tickets.parquet was created in the past 24 hours.
FileFreshness "s3://bucket/artifacts/file/tickets/tickets.parquet" > (now() - 24 hours)
Checking folder freshness
The following rule passes if all files in the folder were created or modified in past 24 hours.
FileFreshness "s3://bucket/" >= (now() -1 days) FileFreshness "s3://bucket/artifacts/file/tickets/" >= (now() - 24 hours)
Checking folder or file freshness with threshold
The following rule passes if 10% of the files in the folder “tickets“ were created or modified in the past 10 days.
FileFreshness "s3://bucket/artifacts/file/tickets/" < (now() - 10 days) with threshold > 0.1
Checking files or folders with specific dates
You can check for file freshness for specific days.
FileFreshness "s3://bucket/artifacts/file/tickets/" > "2020-01-01" FileFreshness "s3://bucket/artifacts/file/tickets/" between "2023-01-01" and "2024-01-01"
Inferring file names directly from data frames
You don't always have to provide a file path. For instance, when you are authoring the rule in the Amazon Glue Data Catalog, it may be hard to find which folders the catalog tables are using. Amazon Glue Data Quality can find the specific folders or files used to populate your dataframe and can detect if they are fresh.
FileFreshness > (now() - 24 hours)
This rule will find the folder path or files that are used to populate the dynamic frame or data frame. This works for Amazon S3 paths or Amazon S3 based Amazon Glue Data Catalog tables. There are a few considerations:
-
In Amazon Glue ETL, you must have the EvaluateDataQuality Transform immediately after an Amazon S3 or Amazon Glue Data Catalog transform.
-
This rule will not work in Amazon Glue Interactive Sessions.
If you attempt in both of the cases, or when Glue can’t find the files, It will throw the following error:
“Unable to parse file path from DataFrame”