Querying Amazon Glue Data Catalog
Because Amazon Glue Data Catalog is used by many Amazon Web Services as their central metadata repository, you might want to query Data Catalog metadata. To do so, you can use SQL queries in Athena. You can use Athena to query Amazon Glue catalog metadata like databases, tables, partitions, and columns.
To obtain Amazon Glue Catalog metadata, you query the information_schema
database
on the Athena backend. The example queries in this topic show how to use Athena to query Amazon Glue
Catalog metadata for common use cases.
Topics
Considerations and limitations
-
Instead of querying the
information_schema
database, it is possible to use individual Apache Hive DDL commands to extract metadata information for specific databases, tables, views, partitions, and columns from Athena. However, the output is in a non-tabular format. -
Querying
information_schema
is most performant if you have a small to moderate amount of Amazon Glue metadata. If you have a large amount of metadata, errors can occur. -
You cannot use
CREATE VIEW
to create a view on theinformation_schema
database.
Listing databases and searching a specified database
The examples in this section show how to list the databases in metadata by schema name.
Example – Listing databases
The following example query lists the databases from the
information_schema.schemata
table.
SELECT schema_name FROM information_schema.schemata LIMIT 10;
The following table shows sample results.
6 | alb-databas1 |
7 | alb_original_cust |
8 | alblogsdatabase |
9 | athena_db_test |
10 | athena_ddl_db |
Example – Searching a specified database
In the following example query, rdspostgresql
is a sample
database.
SELECT schema_name FROM information_schema.schemata WHERE schema_name = 'rdspostgresql'
The following table shows sample results.
schema_name | |
---|---|
1 | rdspostgresql |
Listing tables in a specified database and searching for a table by name
To list metadata for tables, you can query by table schema or by table name.
Example – Listing tables by schema
The following query lists tables that use the rdspostgresql
table
schema.
SELECT table_schema, table_name, table_type FROM information_schema.tables WHERE table_schema = 'rdspostgresql'
The following table shows a sample result.
table_schema | table_name | table_type | |
---|---|---|---|
1 | rdspostgresql | rdspostgresqldb1_public_account | BASE TABLE |
Example – Searching for a table by name
The following query obtains metadata information for the table
athena1
.
SELECT table_schema, table_name, table_type FROM information_schema.tables WHERE table_name = 'athena1'
The following table shows a sample result.
table_schema | table_name | table_type | |
---|---|---|---|
1 | default | athena1 | BASE TABLE |
Listing partitions for a specific table
You can use SHOW PARTITIONS
to list
the partitions for a specified table, as in the following example.table_name
SHOW PARTITIONS cloudtrail_logs_test2
You can also use a $partitions
metadata query to list the partition
numbers and partition values for a specific table.
Example – Querying the partitions for a table using the $partitions syntax
The following example query lists the partitions for the table
cloudtrail_logs_test2
using the $partitions
syntax.
SELECT * FROM default."cloudtrail_logs_test2$partitions" ORDER BY partition_number
The following table shows sample results.
table_catalog | table_schema | table_name | Year | Month | Day | |
---|---|---|---|---|---|---|
1 | awsdatacatalog | default | cloudtrail_logs_test2 | 2020 | 08 | 10 |
2 | awsdatacatalog | default | cloudtrail_logs_test2 | 2020 | 08 | 11 |
3 | awsdatacatalog | default | cloudtrail_logs_test2 | 2020 | 08 | 12 |
Listing all columns for all tables
You can list all columns for all tables in AwsDataCatalog
or for all
tables in a specific database in AwsDataCatalog
.
-
To list all columns for all databases in
AwsDataCatalog
, use the querySELECT * FROM information_schema.columns
. -
To restrict the results to a specific database, use
table_schema='
in thedatabase_name
'WHERE
clause.
Example – Listing all columns for all tables in a specific database
The following example query lists all columns for all tables in the database
webdata
.
SELECT * FROM information_schema.columns WHERE table_schema = 'webdata'
Listing or searching columns for a specified table or view
You can list all columns for a table, all columns for a view, or search for a column by name in a specified database and table.
To list the columns, use a SELECT *
query. In the FROM
clause, specify information_schema.columns
. In the WHERE
clause, use table_schema='
to
specify the database and database_name
'table_name =
'
to specify the table or view that has
the columns that you want to list.table_name
'
Example – Listing all columns for a specified table
The following example query lists all columns for the table
rdspostgresqldb1_public_account
.
SELECT * FROM information_schema.columns WHERE table_schema = 'rdspostgresql' AND table_name = 'rdspostgresqldb1_public_account'
The following table shows sample results.
table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | comment | extra_info | |
---|---|---|---|---|---|---|---|---|---|---|
1 | awsdatacatalog | rdspostgresql | rdspostgresqldb1_public_account | password | 1 | YES | varchar | |||
2 | awsdatacatalog | rdspostgresql | rdspostgresqldb1_public_account | user_id | 2 | YES | integer | |||
3 | awsdatacatalog | rdspostgresql | rdspostgresqldb1_public_account | created_on | 3 | YES | timestamp | |||
4 | awsdatacatalog | rdspostgresql | rdspostgresqldb1_public_account | last_login | 4 | YES | timestamp | |||
5 | awsdatacatalog | rdspostgresql | rdspostgresqldb1_public_account | 5 | YES | varchar | ||||
6 | awsdatacatalog | rdspostgresql | rdspostgresqldb1_public_account | username | 6 | YES | varchar |
Example – Listing the columns for a specified view
The following example query lists all the columns in the default
database for the view arrayview
.
SELECT * FROM information_schema.columns WHERE table_schema = 'default' AND table_name = 'arrayview'
The following table shows sample results.
table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | comment | extra_info | |
---|---|---|---|---|---|---|---|---|---|---|
1 | awsdatacatalog | default | arrayview | searchdate | 1 | YES | varchar | |||
2 | awsdatacatalog | default | arrayview | sid | 2 | YES | varchar | |||
3 | awsdatacatalog | default | arrayview | btid | 3 | YES | varchar | |||
4 | awsdatacatalog | default | arrayview | p | 4 | YES | varchar | |||
5 | awsdatacatalog | default | arrayview | infantprice | 5 | YES | varchar | |||
6 | awsdatacatalog | default | arrayview | sump | 6 | YES | varchar | |||
7 | awsdatacatalog | default | arrayview | journeymaparray | 7 | YES | array(varchar) |
Example – Searching for a column by name in a specified database and table
The following example query searches for metadata for the sid
column
in the arrayview
view of the default
database.
SELECT * FROM information_schema.columns WHERE table_schema = 'default' AND table_name = 'arrayview' AND column_name='sid'
The following table shows a sample result.
table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | comment | extra_info | |
---|---|---|---|---|---|---|---|---|---|---|
1 | awsdatacatalog | default | arrayview | sid | 2 | YES | varchar |