Customizing crawler behavior

When you configure an Amazon Glue crawler, you have several options for defining the behavior of your crawler.

Incremental crawls – You can configure a crawler to run incremental crawls to add only new partitions to the table schema.
Partition indexes – A crawler creates partition indexes for Amazon S3 and Delta Lake targets by default to provide efficient lookup for specific partitions.
Accelerate crawl time by using Amazon S3 events – You can configure a crawler to use Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder which triggered the event instead of listing the full Amazon S3 or Data Catalog target.
Handling schema changes – You can prevent a crawlers from making any schema changes to the existing schema. You can use the Amazon Web Services Management Console or the Amazon Glue API to configure how your crawler processes certain types of changes.
A single schema for multiple Amazon S3 paths – You can configure a crawler to create a single schema for each S3 path if the data is compatible.
Table location and partitioning levels – The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created.
Table threshold – You can specify the maximum number of tables the crawler is allowed to create by specifying a table threshold.
Amazon Lake Formation credentials – You can configure a crawler to use Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same Amazon Web Services account or another Amazon Web Services account.

For more information about using the Amazon Glue console to add a crawler, see Configuring a crawler.

Topics

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Parameters set on Data Catalog tables by crawler

Scheduling incremental crawls