

# Accelerating crawls using Amazon S3 event notifications


Instead of listing the objects from an Amazon S3 or Data Catalog target, you can configure the crawler to use Amazon S3 events to find any changes. This feature improves the recrawl time by using Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder which triggered the event instead of listing the full Amazon S3 or Data Catalog target.

The first crawl lists all Amazon S3 objects from the target. After the first successful crawl, you can choose to recrawl manually or on a set schedule. The crawler will list only the objects from those events instead of listing all objects.

When the target is a Data Catalog table, the crawler updates the existing tables in the Data Catalog with changes (for example, extra partitions in a table).

The advantages of moving to an Amazon S3 event based crawler are:
+ A faster recrawl as the listing of all the objects from the target is not required, instead the listing of specific folders is done where objects are added or deleted.
+ A reduction in the overall crawl cost as the listing of specific folders is done where objects are added or deleted.

The Amazon S3 event crawl runs by consuming Amazon S3 events from the SQS queue based on the crawler schedule. There will be no cost if there are no events in the queue. Amazon S3 events can be configured to go directly to the SQS queue or in cases where multiple consumers need the same event, a combination of SNS and SQS. For more information, see [Setting up your account for Amazon S3 event notifications](#crawler-s3-event-notifications-setup).

After creating and configuring the crawler in event mode, the first crawl runs in listing mode by performing full a listing of the Amazon S3 or Data Catalog target. The following log confirms the operation of the crawl by consuming Amazon S3 events after the first successful crawl: "The crawl is running by consuming Amazon S3 events."

After creating the Amazon S3 event crawl and updating the crawler properties which may impact the crawl, the crawl operates in list mode and the following log is added: "Crawl is not running in S3 event mode".

**Note**  
The maximum number of messages to consume is 100,000 messages per crawl.

## Considerations and limitations


The following considerations and limitations apply when you configure a crawler to use Amazon S3 event notifications to find any changes. 
+  **Important behavior with deleted partitions** 

  When using Amazon S3 event crawlers with Data Catalog tables:
  +  If you delete a partition using the `DeletePartition` API call, you must also delete all S3 objects under that partition, and select **All object removal events** when you configure your S3 event notifications. If deletion events are not configured, the crawler recreates the deleted partition during its next run. 
+ Only a single target is supported by the crawler, whether for Amazon S3 or Data Catalog targets.
+ SQS on private VPC is not supported.
+ Amazon S3 sampling is not supported.
+ The crawler target should be a folder for an Amazon S3 target, or one or more Amazon Glue Data Catalog tables for a Data Catalog target.
+ The 'everything' path wildcard is not supported: s3://%
+ For a Data Catalog target, all catalog tables should point to same Amazon S3 bucket for Amazon S3 event mode.
+ For a Data Catalog target, a catalog table should not point to an Amazon S3 location in the Delta Lake format (containing \$1symlink folders, or checking the catalog table's `InputFormat`).

**Topics**
+ [

## Considerations and limitations
](#s3event-crawler-limitations)
+ [

## Setting up your account for Amazon S3 event notifications
](#crawler-s3-event-notifications-setup)
+ [

# Setting up a crawler for Amazon S3 event notifications for an Amazon S3 target
](crawler-s3-event-notifications-setup-console-s3-target.md)
+ [

# Setting up a crawler for Amazon S3 event notifications for a Data Catalog table
](crawler-s3-event-notifications-setup-console-catalog-target.md)

## Setting up your account for Amazon S3 event notifications


Complete the following setup tasks. Note the values in parenthesis reference the configurable settings from the script.

1. You need to set up event notifications for your Amazon S3 bucket.

   For more information, see [Amazon S3 event notifications](https://docs.amazonaws.cn/AmazonS3/latest/userguide/EventNotifications.html).

1. To use the Amazon S3 event based crawler, you should enable event notification on the Amazon S3 bucket with events filtered from the prefix which is the same as the S3 target and store in SQS. You can set up SQS and event notification through the console by following the steps in [Walkthrough: Configuring a bucket for notifications](https://docs.amazonaws.cn/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html).

1. Add the following SQS policy to the role used by the crawler. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "VisualEditor0",
         "Effect": "Allow",
         "Action": [
           "sqs:DeleteMessage",
           "sqs:GetQueueUrl",
           "sqs:ListDeadLetterSourceQueues",
           "sqs:ReceiveMessage",
           "sqs:GetQueueAttributes",
           "sqs:ListQueueTags",
           "sqs:SetQueueAttributes",
           "sqs:PurgeQueue"
         ],
         "Resource": "arn:aws-cn:sqs:us-east-1:111122223333:cfn-sqs-queue"
       }
     ]
   }
   ```

------

# Setting up a crawler for Amazon S3 event notifications for an Amazon S3 target


Follow these steps to set up a crawler for Amazon S3 event notifications for an Amazon S3 target using the Amazon Web Services Management Console or Amazon CLI.

------
#### [ Amazon Web Services Management Console ]

1. Sign in to the Amazon Web Services Management Console and open the GuardDuty console at [https://console.amazonaws.cn/guardduty/](https://console.amazonaws.cn/guardduty/).

1.  Set your crawler properties. For more information, see [ Setting Crawler Configuration Options on the Amazon Glue console ](https://docs.amazonaws.cn/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console). 

1.  In the section **Data source configuration**, you are asked * Is your data already mapped to Amazon Glue tables? * 

    By default **Not yet** is already selected. Leave this as the default as you are using an Amazon S3 data source and the data is not already mapped to Amazon Glue tables. 

1.  In the section **Data sources**, choose **Add a data source**.   
![\[Data source configuration interface with options to select or add data sources for crawling.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/crawler-s3-event-console1.png)

1.  In the **Add data source** modal, configure the Amazon S3 data source: 
   +  **Data source**: By default, Amazon S3 is selected. 
   +  **Network connection** (Optional): Choose **Add new connection**. 
   +  **Location of Amazon S3 data**: By default, **In this account** is selected. 
   +  **Amazon S3 path**: Specify the Amazon S3 path where folders and files are crawled. 
   +  **Subsequent crawler runs**: Choose **Crawl based on events** to use Amazon S3 event notifications for your crawler. 
   +  **Include SQS ARN**: Specify the data store parameters including the a valid SQS ARN. (For example, `arn:aws:sqs:region:account:sqs`). 
   +  **Include dead-letter SQS ARN** (Optional): Specify a valid Amazon dead-letter SQS ARN. (For example, `arn:aws:sqs:region:account:deadLetterQueue`). 
   +  Choose **Add an Amazon S3 data source**.   
![\[Add data source dialog for S3, showing options for network connection and crawl settings.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/crawler-s3-event-console2.png)

------
#### [ Amazon CLI ]

 The following is an example Amazon S3 Amazon CLI call to configure a crawler to use event notifications to crawl an Amazon S3 target bucket. 

```
Create Crawler:
aws glue update-crawler \
    --name myCrawler \
    --recrawl-policy RecrawlBehavior=CRAWL_EVENT_MODE \
    --schema-change-policy UpdateBehavior=UPDATE_IN_DATABASE,DeleteBehavior=LOG
    --targets '{"S3Targets":[{"Path":"s3://amzn-s3-demo-bucket/", "EventQueueArn": "arn:aws:sqs:us-east-1:012345678910:MyQueue"}]}'
```

------

# Setting up a crawler for Amazon S3 event notifications for a Data Catalog table


When you have a Data Catalog table, set up a crawler for Amazon S3 event notifications using the Amazon Glue console:

1.  Set your crawler properties. For more information, see [ Setting Crawler Configuration Options on the Amazon Glue console ](https://docs.amazonaws.cn/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console). 

1.  In the section **Data source configuration**, you are asked * Is your data already mapped to Amazon Glue tables? * 

    Select **Yes** to select existing tables from your Data Catalog as your data source. 

1.  In the section **Glue tables**, choose **Add tables**.   
![\[Data source configuration interface with options to select existing Glue tables or add new ones.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/crawler-s3-event-console1-cat.png)

1.  In the **Add table** modal, configure the database and tables: 
   +  **Network connection** (Optional): Choose **Add new connection**. 
   +  **Database**: Select a database in the Data Catalog. 
   +  **Tables**: Select one or more tables from that database in the Data Catalog. 
   +  **Subsequent crawler runs**: Choose **Crawl based on events** to use Amazon S3 event notifications for your crawler. 
   +  **Include SQS ARN**: Specify the data store parameters including the a valid SQS ARN. (For example, `arn:aws:sqs:region:account:sqs`). 
   +  **Include dead-letter SQS ARN** (Optional): Specify a valid Amazon dead-letter SQS ARN. (For example, `arn:aws:sqs:region:account:deadLetterQueue`). 
   +  Choose **Confirm**.   
![\[Add Glue tables dialog with network, database, tables, and crawler options.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/crawler-s3-event-console2-cat.png)