Example metadata table queries
The following examples show how you can get different types information from your S3 Metadata tables by using standard SQL queries.
Remember when using these examples:
-
The examples are written to work with Amazon Athena. You might have to modify the examples to work with a different query engine.
-
Make sure that you understand how to optimize your queries.
-
Replace
b_
with the name of your namespace.general-purpose-bucket-name
-
For a full list of supported columns, see the S3 Metadata journal tables schema and S3 Metadata live inventory tables schema.
Contents
Journal table example queries
You can use the following example queries to query your journal tables.
Finding objects by file extension
The following query returns objects with a specific file extension (.jpg
in this
case):
SELECT key FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE key LIKE '%.jpg' AND record_type = 'CREATE'general-purpose-bucket-name
Listing object deletions
The following query returns object deletion events, including the Amazon Web Services account ID or Amazon service principal that made the request:
SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, requester, source_ip_address, version_id FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_type = 'DELETE';general-purpose-bucket-name
Listing Amazon KMS encryption keys used by your objects
The following query returns the ARNs of the Amazon Key Management Service (Amazon KMS) keys encrypting your objects:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal";general-purpose-bucket-name
Listing objects that don't use KMS keys
The following query returns objects that aren't encrypted with Amazon KMS keys:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE encryption_status NOT IN ('SSE-KMS', 'DSSE-KMS') AND record_type = 'CREATE';general-purpose-bucket-name
Listing Amazon KMS encryption
keys used for PUT
operations in the last 7 days
The following query returns the ARNs of the Amazon Key Management Service (Amazon KMS) keys encrypting your objects:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_timestamp > (current_date - interval '7' day) AND kms_key_arn is NOT NULL;general-purpose-bucket-name
Listing objects deleted in the last 24 hours by S3 Lifecycle
The following query returns lists the objects expired in the last day by S3 Lifecycle:
SELECT bucket, key, version_id, last_modified_date, record_timestamp, requester FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE requester = 's3.amazonaws.com' AND record_type = 'DELETE' AND record_timestamp > (current_date - interval '1' day)general-purpose-bucket-name
Viewing metadata provided by Amazon Bedrock
Some Amazon services (such as Amazon Bedrock),
upload objects to Amazon S3. You can query the object metadata provided by these services. For example,
the following query includes the user_metadata
column to determine if there are objects
uploaded by Amazon Bedrock to a general purpose bucket:
SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, user_metadata FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_type = 'CREATE' AND user_metadata['content-source'] = 'AmazonBedrock';general-purpose-bucket-name
If Amazon Bedrock uploaded an object to your bucket, the user_metadata
column will display
the following metadata associated with the object in the query result:
user_metadata {content-additional-params -> requestid="CVK8FWYRW0M9JW65", signedContentSHA384="38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b", content-model-id -> bedrock-model-arn, content-source -> AmazonBedrock}
Understanding the current state of your objects
The following query can help you determine the current state of your objects. The query
identifies the most recent version of each object, filters out deleted objects, and marks the latest
version of each object based on sequence numbers. Results are ordered by the bucket
,
key
, and sequence_number
columns.
WITH records_of_interest as ( -- Start with a query that can narrow down the records of interest. SELECT * from "s3tablescatalog/aws-s3"."
b_
"."journal" ), version_stacks as ( SELECT *, -- Introduce a column called 'next_sequence_number', which is the next larger -- sequence_number for the same key version_id in sorted order. LEAD(sequence_number, 1) over (partition by (bucket, key, coalesce(version_id, '')) order by sequence_number ASC) as next_sequence_number from records_of_interest ), -- Pick the 'tip' of each version stack triple: (bucket, key, version_id). -- The tip of the version stack is the row of that triple with the largest sequencer. -- Selecting only the tip filters out any row duplicates. -- This isn't typical, but some events can be delivered more than once to the table -- and include rows that might no longer exist in the bucket (since the -- table contains rows for both extant and extinct objects). -- In the next subquery, eliminate the rows that contain deleted objects. current_versions as ( SELECT * from version_stacks where next_sequence_number is NULL ), -- Eliminate the rows that are extinct from the bucket by filtering with -- record_type. An object version has been deleted from the bucket if its tip is -- record_type==DELETE. existing_current_versions as ( SELECT * from current_versions where not (record_type = 'DELETE' and is_delete_marker = FALSE) ), -- Optionally, to determine which of several object versions is the 'latest', -- you can compare their sequence numbers. A version_id is the latest if its -- tip's sequencer is the largest among all other tips in the same key. with_is_latest as ( SELECT *, -- Determine if the sequence_number of this row is the same as the largest sequencer for the key that still exists. sequence_number = (MAX(sequence_number) over (partition by (bucket, key))) as is_latest_version FROM existing_current_versions ) SELECT * from with_is_latest ORDER BY bucket, key, sequence_number;general-purpose-bucket-name
Inventory table example queries
You can use the following example queries to query your inventory tables.
Discovering datasets that use specific tags
The following query returns the dataset that uses the specified tags:
SELECT * FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE object_tags['key1'] = 'value1' AND object_tags['key2'] = 'value2';general-purpose-bucket-name
Listing objects not encrypted with SSE-KMS
The following query returns objects that aren't encrypted with SSE-KMS:
SELECT key, encryption_status FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE encryption_status != 'SSE-KMS';general-purpose-bucket-name
Listing objects generated by Amazon Bedrock
The following query lists objects that were generated by Amazon Bedrock:
SELECT DISTINCT bucket, key, sequence_number, user_metadata FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE user_metadata['content-source'] = 'AmazonBedrock';general-purpose-bucket-name
Generating the latest inventory table
The following query generates the most recent inventory table. This query works when your metadata configuration is in Active status. This query requires both the journal and inventory tables to be in Active status. If your inventory table isn't up to date because of permission or other issues, this query might not work.
We recommend using this query for general purpose buckets with less than a billion objects.
This query reconciles the contents of the inventory table with the recent events of the journal table. When your journal becomes up to date with all the changes that occurred in your bucket, the query result will match that of the contents of the bucket.
This example restricts the output to only those keys ending with '%.txt'
. To query
a different subset, you can adjust the common table expression named
"working_set_of_interest"
.
WITH inventory_time_cte AS ( -- Reveal the extent of the journal table that has not yet been reflected in the inventory table. SELECT COALESCE(inventory_time_from_property, inventory_time_default) AS inventory_time FROM ( SELECT * FROM -- The fallback default includes the entirety of the journal table. (VALUES (TIMESTAMP '2024-12-01 00:00')) AS T (inventory_time_default) LEFT OUTER JOIN -- This side queries the Iceberg table property and loads it up in -- a column. If the property doesn't exist, then you get 0 rows. ( SELECT from_unixtime(CAST(value AS BIGINT)) AS inventory_time_from_property FROM "journal$properties" WHERE key = 'aws.s3metadata.oldest-uncoalesced-record-timestamp' LIMIT 1 ) -- Force an unequivocal join. ON TRUE ) ), -- Select only those journal table events not yet reflected in the inventory table. my_new_events AS ( SELECT journal.* FROM ( journal JOIN inventory_time_cte -- Include only those rows that have yet to be merged with the inventory table. -- Allow some overlap to account for clock skew. ON record_timestamp > (inventory_time - interval '1' hour) ) ), -- Bring the "journal" and "inventory" table rows to a common inventory schema. working_set AS ( ( SELECT -- Keep the inventory table columns, but drop these journal table columns: -- "record_type", "requester", "source_ip_address", "request_id" bucket, key, sequence_number, version_id, is_delete_marker, size, COALESCE(last_modified_date, record_timestamp) AS last_modified_date, e_tag, storage_class, is_multipart, encryption_status, is_bucket_key_enabled, kms_key_arn, checksum_algorithm, object_tags, user_metadata, -- Temporary columns required to align the two tables. record_timestamp AS _log_ts, (record_type = 'DELETE' AND NOT COALESCE(is_delete_marker, FALSE)) AS _is_perm_delete FROM my_new_events ) UNION ( SELECT *, last_modified_date as _log_ts, FALSE AS _is_perm_delete FROM "inventory" ) ), -- You can apply a filter over key, tags, or metadata here to restrict your view to a subset of all keys. working_set_of_interest AS ( SELECT * FROM working_set WHERE key LIKE '%.txt' ), most_recent_changes AS ( -- For each (bucket, key, version_id) stack, find the event that should have -- been the ultimate to arrive in the journal table, and confine the results to the -- 1-hour window of events (for that key) that preceded that arrival. -- -- This gives preferential treatment to events that arrived later in the journal table -- order, and handles cases with uploads that were completed much later after they were -- initiated. SELECT * FROM ( SELECT *, -- Do not confuse this MAX() with the aggregate function. This is the MAX window function. MAX(_log_ts) OVER (PARTITION BY bucket, key, version_id) AS _supremum_ts FROM working_set_of_interest ) WHERE _log_ts >= (_supremum_ts - interval '1' hour) ), -- Among each "1-hour window of most recent mutations" for a given key, identify -- the one that is reflected in the general purpose bucket. updated_inventory AS ( SELECT * FROM ( SELECT *, MAX(sequence_number) OVER (PARTITION BY bucket, key, version_id) AS _supremum_sn FROM most_recent_changes ) WHERE sequence_number = _supremum_sn -- Again here, use QUALIFY clause if your planner supports it. ) -- Finally, project the resulting rows onto the inventory table schema. SELECT bucket, key, sequence_number, version_id, is_delete_marker, size, last_modified_date, e_tag, storage_class, is_multipart, encryption_status, is_bucket_key_enabled, kms_key_arn, checksum_algorithm, object_tags, user_metadata FROM updated_inventory WHERE NOT _is_perm_delete ORDER BY bucket, key ASC, sequence_number ASC