Incremental crawls for adding new partitions in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Incremental crawls for adding new partitions in Amazon Glue

The crawler provides an option for adding new partitions resulting in faster crawls for incremental datasets with a stable table schema. The typical use case is for scheduled crawlers, where during each crawl, new partitions are added. When this option is turned on, it will first run a complete crawl on the target dataset to allow the crawler to record the initial schema and partition structure. During a recrawl, new partitions will be added to existing tables only when the schemas are compatible. No schema changes are made and no new tables will be added to the Data Catalog after the first crawl run.

You can use this option when setting up an Amazon S3 data source. You can set the RecrawlPolicy with RecrawlBehavior as "Crawl_New_Folders" in the CreateCrawler API or Subsequent crawler runs as Crawl new sub-folders only in the console.

Continuing with the example in How does a crawler determine when to create partitions?, the following diagram shows that files for the month of March have been added.


          The following diagram shows that files for the month of March have been added.

If you set the RecrawlBehavior as the "Crawl_New_Folders" option, only the new folder, month=Mar is crawled.

Notes and restrictions

When this option is turned on, you can't change the Amazon S3 target data stores when editing the crawler. This option affects certain crawler configuration settings. When turned on, it forces the update behavior and delete behavior of the crawler to LOG. This means that:

  • If it discovers objects where schemas are not compatible, the crawler will not add the objects in the Data Catalog, and adds this detail as a log in CloudWatch Logs.

  • It will not update deleted objects in the Data Catalog.

For more information, see Setting crawler configuration options.