Configuring a crawler to populate the Amazon Glue Data Catalog
Amazon Glue crawlers crawl data stores to populate tables in the Amazon Glue Data Catalog. In this procedure, you create and run an Amazon Glue crawler for your S3 bucket that contains exported asset data. The crawler creates a table for asset property updates and a table for asset metadata. Then, you can perform SQL queries on these tables with Athena. For more information, see Populating the Amazon Glue Data Catalog and Defining crawlers in the Amazon Glue Developer Guide.
To create an Amazon Glue crawler
-
Navigate to the Amazon Glue console
. -
In the navigation pane, choose Crawlers.
-
Choose Add crawler.
-
On the Add crawler page, do the following:
-
Enter a name for your crawler, such as
IoTSiteWiseDataCrawler
, and then choose Next. -
For Crawler source type, choose Data stores, and then choose Next.
-
On the Add a data store page, do the following:
-
For Choose a data store, choose S3.
-
In Include path, enter
s3://
to add your asset data bucket as a data store. ReplaceDOC-EXAMPLE-BUCKET1
DOC-EXAMPLE-BUCKET1
with the bucket name that you chose when you created the stack. -
Choose Next.
-
-
On the Add another data store page, choose No, and then choose Next.
-
On the Choose an IAM role page, do the following:
-
To create a new service role that allows Amazon Glue to access the S3 bucket, choose Create an IAM role.
-
Enter a suffix for your role's name, such as
IoTSiteWiseDataCrawler
. -
Choose Next.
-
-
For Frequency, choose Hourly, and then choose Next. The crawler updates the tables with new data each time it runs, so you can choose any frequency that fits your use case.
-
On the Configure the crawler's output page, do the following:
-
Choose Add database to create an Amazon Glue database for your asset data.
-
Enter a name for the database, such as
iot_sitewise_asset_database
. -
Choose Create.
-
Choose Next.
-
-
Review the crawler details, and then choose Finish.
-
By default, your new crawler doesn't immediately run. You must manually run it or wait until it runs on its configured schedule.
To run a crawler
-
On the Crawlers page, select the check box for your new crawler, and then choose Run crawler.
-
Wait until the crawler finishes and has a status of Ready.
The crawler can take several minutes to run, and its status updates automatically.
-
In the navigation pane, choose Tables.
You should see two new tables: asset_metadata and asset_property_updates.