Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)

Configuring a crawler to use Lake Formation credentials

You can configure a crawler to use Amazon Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same Amazon Web Services account or another Amazon Web Services account. You can configure an existing Data Catalog table as a crawler's target, if the crawler and the Data Catalog table reside in the same account. Currently, only a single catalog target with a single catalog table is allowed when using a Data Catalog table as a crawler’s target.

Note

When you are defining a Data Catalog table as a crawler target, make sure that the underlying location of the Data Catalog table is an Amazon S3 location. Crawlers that use Lake Formation credentials only support Data Catalog targets with underlying Amazon S3 locations.

Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)

To allow the crawler to access a data store or Data Catalog table by using Lake Formation credentials, you need to register the data location with Lake Formation. Also, the crawler's IAM role must have permissions to read the data from the destination where the Amazon S3 bucket is registered.

You can complete the following configuration steps using the Amazon Web Services Management Console or Amazon Command Line Interface (Amazon CLI).

Amazon Web Services Management Console

Before configuring a crawler to access the crawler source, register the data location of the data store or the Data Catalog with Lake Formation. In the Lake Formation console (https://console.amazonaws.cn/lakeformation/), register an Amazon S3 location as the root location of your data lake in the Amazon Web Services account where the crawler is defined. For more information, see Registering an Amazon S3 location.
Grant Data location permissions to the IAM role that's used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see Granting data location permissions (same account).
Grant the crawler role access permissions (Create) to the database, which is specified as the output database. For more information, see Granting database permissions using the Lake Formation console and the named resource method.
In the IAM console (https://console.amazonaws.cn/iam/), create an IAM role for the crawler. Add the lakeformation:GetDataAccess policy to the role.
In the Amazon Glue console (https://console.amazonaws.cn/glue/), while configuring the crawler, select the option Use Lake Formation credentials for crawling Amazon S3 data source.

Note
The accountId field is optional for in-account crawling.

Amazon CLI


aws glue --profile demo create-crawler --debug --cli-input-json '{
    "Name": "prod-test-crawler",
    "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role",
    "DatabaseName": "prod-run-db",
    "Description": "",
    "Targets": {
    "S3Targets":[
                {
                 "Path": "s3://amzn-s3-demo-bucket"
                }
                ]
                },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG"
  },
  "RecrawlPolicy": {
    "RecrawlBehavior": "CRAWL_EVERYTHING"
  },
  "LineageConfiguration": {
    "CrawlerLineageSettings": "DISABLE"
  },
  "LakeFormationConfiguration": {
    "UseLakeFormationCredentials": true,
    "AccountId": "111122223333"
  },
  "Configuration": {
           "Version": 1.0,
           "CrawlerOutput": {
             "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
             "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
           },
           "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" }
         },
  "CrawlerSecurityConfiguration": "",
  "Tags": {
    "KeyName": ""
  }
}'

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Specifying a table threshold

Setup required when the crawler and registered Amazon S3 location reside in different accounts (cross-account crawling)