Incremental crawls in Amazon Glue
For an Amazon Simple Storage Service (Amazon S3) data source, incremental crawls only crawl folders that were
added since the last crawler run. Without this option, the crawler crawls the entire
dataset. Incremental crawls can save significant time and cost. To perform an incremental
crawl, you can set the Crawl new folders only option in the Amazon Glue
console or set the RecrawlPolicy
property in the CreateCrawler
request in the API.
Incremental crawls are best suited to incremental datasets with a stable table schema. The typical use case is for scheduled crawlers, where during each crawl, new partitions are added. Note that new tables will not be added. Continuing with the example in How does a crawler determine when to create partitions?, the following diagram shows that files for the month of March have been added.

If you set the Crawl new folders only option, only the new folder,
month=Mar
is crawled.
Notes and Restrictions for Incremental Crawls
Keep in mind the following additional information about incremental crawls:
-
The best practice for incremental crawls is to first run a complete crawl on the target dataset to enable the crawler to record the initial schema and partition structure.
-
When this option is turned on, you can't change the Amazon S3 target data stores when editing the crawler.
-
This option affects certain crawler configuration settings. When turned on, it forces the update behavior and delete behavior of the crawler to
LOG
. This means that:-
If an incremental crawl discovers objects with schemas that are different enough from the schema recorded in the Data Catalog such that the crawler cannot create new partitions, the crawler ignores the objects and records the event in CloudWatch Logs.
-
If an incremental crawl discovers deleted objects, it ignores them and doesn't update the Data Catalog.
For more information, see Setting crawler configuration options.
-
-
If an incremental crawl discovers multiple new partitions or folders added, the majority of them have to match the schema recorded in the Data Catalog to enable the crawler to add them successfully. Otherwise, the crawler might fail to add the partitions because there are too many schema varieties.