Syntax Properties Return values Examples

AWS::Glue::Crawler

The AWS::Glue::Crawler resource specifies an Amazon Glue crawler. For more information, see Cataloging Tables with a Crawler and Crawler Structure in the Amazon Glue Developer Guide.

Syntax

To declare this entity in your Amazon CloudFormation template, use the following syntax:

JSON


{
  "Type" : "AWS::Glue::Crawler",
  "Properties" : {
      "Classifiers" : [ String, ... ],
      "Configuration" : String,
      "CrawlerSecurityConfiguration" : String,
      "DatabaseName" : String,
      "Description" : String,
      "LakeFormationConfiguration" : LakeFormationConfiguration,
      "Name" : String,
      "RecrawlPolicy" : RecrawlPolicy,
      "Role" : String,
      "Schedule" : Schedule,
      "SchemaChangePolicy" : SchemaChangePolicy,
      "TablePrefix" : String,
      "Tags" : [ Tag, ... ],
      "Targets" : Targets
    }
}

YAML


Type: AWS::Glue::Crawler
Properties:
  Classifiers: 
    - String
  Configuration: String
  CrawlerSecurityConfiguration: String
  DatabaseName: String
  Description: String
  LakeFormationConfiguration: 
    LakeFormationConfiguration
  Name: String
  RecrawlPolicy: 
    RecrawlPolicy
  Role: String
  Schedule: 
    Schedule
  SchemaChangePolicy: 
    SchemaChangePolicy
  TablePrefix: String
  Tags: 
    - Tag
  Targets: 
    Targets

Properties

Classifiers

A list of UTF-8 strings that specify the names of custom classifiers that are associated with the crawler.

Required: No

Type: Array of String

Update requires: No interruption

Configuration

Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

Required: No

Type: String

Update requires: No interruption

CrawlerSecurityConfiguration

The name of the SecurityConfiguration structure to be used by this crawler.

Required: No

Type: String

Minimum: 0

Maximum: 128

Update requires: No interruption

DatabaseName

The name of the database in which the crawler's output is stored.

Required: No

Type: String

Update requires: No interruption

Description

A description of the crawler.

Required: No

Type: String

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*

Minimum: 0

Maximum: 2048

Update requires: No interruption

LakeFormationConfiguration

Specifies whether the crawler should use Amazon Lake Formation credentials for the crawler instead of the IAM role credentials.

Required: No

Type: LakeFormationConfiguration

Update requires: No interruption

Name

The name of the crawler.

Required: No

Type: String

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*

Minimum: 1

Maximum: 255

Update requires: Replacement

RecrawlPolicy

A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.

Required: No

Type: RecrawlPolicy

Update requires: No interruption

Role

The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.

Required: Yes

Type: String

Update requires: No interruption

Schedule

For scheduled crawlers, the schedule when the crawler runs.

Required: No

Type: Schedule

Update requires: No interruption

SchemaChangePolicy

The policy that specifies update and delete behaviors for the crawler. The policy tells the crawler what to do in the event that it detects a change in a table that already exists in the customer's database at the time of the crawl. The SchemaChangePolicy does not affect whether or how new tables and partitions are added. New tables and partitions are always created regardless of the SchemaChangePolicy on a crawler.

The SchemaChangePolicy consists of two components, UpdateBehavior and DeleteBehavior.

Required: No

Type: SchemaChangePolicy

Update requires: No interruption

TablePrefix

The prefix added to the names of tables that are created.

Required: No

Type: String

Minimum: 0

Maximum: 128

Update requires: No interruption

Tags

The tags to use with this crawler.

Required: No

Type: Array of Tag

Update requires: No interruption

Targets

A collection of targets to crawl.

Required: Yes

Type: Targets

Update requires: No interruption

Return values

Ref

When you pass the logical ID of this resource to the intrinsic Ref function, Ref returns the crawler name.

For more information about using the Ref function, see Ref.

Create a crawler

The following example creates a crawler for an Amazon S3 target.

JSON


{
    "Description": "Amazon Glue crawler test",
    "Resources": {
        "MyRole": {
            "Type": "AWS::IAM::Role",
            "Properties": {
                "AssumeRolePolicyDocument": {
                    "Version": "2012-10-17",
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "Service": [
                                    "glue.amazonaws.com"
                                ]
                            },
                            "Action": [
                                "sts:AssumeRole"
                            ]
                        }
                    ]
                },
                "Path": "/",
                "ManagedPolicyArns": ["arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"],
                "Policies": [
                    {
                        "PolicyName": "S3BucketAccessPolicy",
                        "PolicyDocument": {
                            "Version": "2012-10-17",
                            "Statement": [
                                {
                                    "Effect": "Allow",
                                    "Action": [
                                        "s3:GetObject",
                                        "s3:PutObject"
                                    ],
                                    "Resource": {
                                        "Fn::Join": [
                                            "", 
                                            [
                                                {
                                                    "Fn::GetAtt": ["MyS3Bucket", "Arn"]
                                                },
                                                "*"
                                            ]
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                ]
            }
        },
        "MyDatabase": {
            "Type": "AWS::Glue::Database",
            "Properties": {
                "CatalogId": {
                    "Ref": "AWS::AccountId"
                },
                "DatabaseInput": {
                    "Name": "dbcrawler",
                    "Description": "TestDatabaseDescription",
                    "LocationUri": "TestLocationUri",
                    "Parameters": {
                        "key1": "value1",
                        "key2": "value2"
                    }
                }
            }
        },
        "MyClassifier": {
            "Type": "AWS::Glue::Classifier",
            "Properties": {
                "GrokClassifier": {
                    "Name": "CrawlerClassifier",
                    "Classification": "wikiData",
                    "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}"
                }
            }
        },
        "MyS3Bucket": {
            "Type": "AWS::S3::Bucket",
            "Properties": {
                "BucketName": "crawlertesttarget",
                "AccessControl": "BucketOwnerFullControl"
            }
        },
        "MyCrawler2": {
            "Type": "AWS::Glue::Crawler",
            "Properties": {
                "Name": "testcrawler1",
                "Role": {
                    "Fn::GetAtt": [
                        "MyRole",
                        "Arn"
                    ]
                },
                "DatabaseName": {
                    "Ref": "MyDatabase"
                },
                "Classifiers": [
                    {
                        "Ref": "MyClassifier"
                    }
                ],
                "Targets": {
                    "S3Targets": [
                        {
                            "Path": {
                                "Ref": "MyS3Bucket"
                            }
                        }
                    ]
                },
                "SchemaChangePolicy": {
                    "UpdateBehavior": "UPDATE_IN_DATABASE",
                    "DeleteBehavior": "LOG"
                },
                "Tags": {
                    "key1": "value1"
                },
                "Schedule": {
                    "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)"
                }
            }
        }
    }
}

YAML


Resources:
  MyRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      ManagedPolicyArns:
        ['arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole']
      Policies:
        -
          PolicyName: "S3BucketAccessPolicy"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: 
                  - "s3:GetObject"
                  - "s3:PutObject"
                Resource: 
                  !Join
                    - ''
                    - - !GetAtt MyS3Bucket.Arn
                      - "*"
 
  MyDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: "dbcrawler"
        Description: "TestDatabaseDescription"
        LocationUri: "TestLocationUri"
        Parameters:
          key1 : "value1"
          key2 : "value2"
 
  MyClassifier:
    Type: AWS::Glue::Classifier
    Properties:
      GrokClassifier:
        Name: "CrawlerClassifier"
        Classification: "wikiData"
        GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}"
 
  MyS3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: "crawlertesttarget"
      AccessControl: "BucketOwnerFullControl"
 
  MyCrawler2:
    Type: AWS::Glue::Crawler
    Properties:
      Name: "testcrawler1"
      Role: !GetAtt MyRole.Arn
      DatabaseName: !Ref MyDatabase
      Classifiers:
        - !Ref MyClassifier
      Targets:
        S3Targets:
          - Path: !Ref MyS3Bucket
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Tags:
        "Key1": "Value1"
      Schedule:
        ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"

Crawler Configuration

The following example specifies a configuration that controls a crawler's behavior.

JSON


{
    "Type": "AWS::Glue::Crawler",
    "Properties": {
        "Role": "role1",
        "Classifiers": [],
        "Description": "example classifier",
        "SchemaChangePolicy": "",
        "Schedule": "Schedule",
        "DatabaseName": "test",
        "Targets": [],
        "TablePrefix": "test-",
        "Name": "my-crawler",
        "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
    }
}

YAML


Type: AWS::Glue::Crawler
Properties:
  Role: role1
  Classifiers:
    - ''
  Description: example classifier
  SchemaChangePolicy: ''
  Schedule: Schedule
  DatabaseName: test
  Targets:
    - ''
  TablePrefix: test-
  Name: my-crawler
  Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

PhysicalConnectionRequirements

CatalogTarget