Creating tables - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Creating tables

Even though running a crawler is the recommended method to take inventory of the data in your data stores, you can add metadata tables to the Amazon Glue Data Catalog manually. This approach allows you to have more control over the metadata definitions and customize them according them to your specific requirements.

You can also add tables to the Data Catalog manually in the following ways:

When you define a table manually using the console or an API, you specify the table schema and the value of a classification field that indicates the type and format of the data in the data source. If a crawler creates the table, the data format and schema are determined by either a built-in classifier or a custom classifier. For more information about creating a table using the Amazon Glue console, see Working with tables on the Amazon Glue console.

Table partitions

An Amazon Glue table definition of an Amazon Simple Storage Service (Amazon S3) folder can describe a partitioned table. For example, to improve query performance, a partitioned table might separate monthly data into different files using the name of the month as a key. In Amazon Glue, table definitions include the partitioning key of a table. When Amazon Glue evaluates the data in Amazon S3 folders to catalog a table, it determines whether an individual table or a partitioned table is added.

You can create partition indexes on a table to fetch a subset of the partitions instead of loading all the partitions in the table. For information about working with partition indexes, see Working with partition indexes in Amazon Glue.

All the following conditions must be true for Amazon Glue to create a partitioned table for an Amazon S3 folder:

  • The schemas of the files are similar, as determined by Amazon Glue.

  • The data format of the files is the same.

  • The compression format of the files is the same.

For example, you might own an Amazon S3 bucket named my-app-bucket, where you store both iOS and Android app sales data. The data is partitioned by year, month, and day. The data files for iOS and Android sales have the same schema, data format, and compression format. In the Amazon Glue Data Catalog, the Amazon Glue crawler creates one table definition with partitioning keys for year, month, and day.

The following Amazon S3 listing of my-app-bucket shows some of the partitions. The = symbol is used to assign partition key values.

my-app-bucket/Sales/year=2010/month=feb/day=1/iOS.csv my-app-bucket/Sales/year=2010/month=feb/day=1/Android.csv my-app-bucket/Sales/year=2010/month=feb/day=2/iOS.csv my-app-bucket/Sales/year=2010/month=feb/day=2/Android.csv ... my-app-bucket/Sales/year=2017/month=feb/day=4/iOS.csv my-app-bucket/Sales/year=2017/month=feb/day=4/Android.csv
The Amazon Glue console was recently updated. The current version of the console does not support Table Resource Links.

The Data Catalog can also contain resource links to tables. A table resource link is a link to a local or shared table. Currently, you can create resource links only in Amazon Lake Formation. After you create a resource link to a table, you can use the resource link name wherever you would use the table name. Along with tables that you own or that are shared with you, table resource links are returned by glue:GetTables() and appear as entries on the Tables page of the Amazon Glue console.

The Data Catalog can also contain database resource links.

For more information about resource links, see Creating Resource Links in the Amazon Lake Formation Developer Guide.

Updating manually created Data Catalog tables using crawlers

You might want to create Amazon Glue Data Catalog tables manually and then keep them updated with Amazon Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.

To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.

The following are other reasons why you might want to manually create catalog tables and specify catalog tables as the crawler source:

  • You want to choose the catalog table name and not rely on the catalog table naming algorithm.

  • You want to prevent new tables from being created in the case where files with a format that could disrupt partition detection are mistakenly saved in the data source path.

For more information, see Step 2: Choose data sources and classifiers.

Data Catalog table properties

Table properties, or parameters, as they are known in the Amazon CLI, are unvalidated key and value strings. You can set your own properties on the table to support uses of the Data Catalog outside of Amazon Glue. Other services using the Data Catalog may do so as well. Amazon Glue sets some table properties when running jobs or crawlers. Unless otherwise described, these properties are for internal use, we do not support that they will continue to exist in their current form, or support product behavior if these properties are manually changed.

For more information about table properties set by Amazon Glue crawlers, see Parameters set on Data Catalog tables by crawler.