AWS::Glue::Crawler - AWS CloudFormation
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

AWS::Glue::Crawler

AWS::Glue::Crawler 资源指定 AWS Glue 爬网程序。有关更多信息,请参阅 AWS Glue 开发人员指南 中的使用爬网程序对表进行编录Crawler 结构

语法

要在 AWS CloudFormation 模板中声明此实体,请使用以下语法:

JSON

{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Classifiers" : [ String, ... ], "Configuration" : String, "CrawlerSecurityConfiguration" : String, "DatabaseName" : String, "Description" : String, "Name" : String, "Role" : String, "Schedule" : Schedule, "SchemaChangePolicy" : SchemaChangePolicy, "TablePrefix" : String, "Tags" : Json, "Targets" : Targets } }

YAML

Type: AWS::Glue::Crawler Properties: Classifiers: - String Configuration: String CrawlerSecurityConfiguration: String DatabaseName: String Description: String Name: String Role: String Schedule: Schedule SchemaChangePolicy: SchemaChangePolicy TablePrefix: String Tags: Json Targets: Targets

属性

Classifiers

指定与爬网程序关联的自定义分类器的 UTF-8 字符串列表。

必需:否

类型:字符串列表

Update requires: No interruption

Configuration

爬网程序配置信息。此受版本控制的 JSON 字符串允许用户指定爬网程序的行为的各个方面。有关更多信息,请参阅配置爬网程序

必需:否

类型:字符串

Update requires: No interruption

CrawlerSecurityConfiguration

该爬网程序将使用的 SecurityConfiguration 结构的名称。

必需:否

类型:字符串

Update requires: No interruption

DatabaseName

存储爬网程序输出的数据库的名称。

必需:否

类型:字符串

Update requires: No interruption

Description

爬网程序的描述。

必需:否

类型:字符串

Update requires: No interruption

Name

爬网程序的名称。

必需:否

类型:字符串

Update requires: Replacement

Role

用于访问客户资源的 IAM 角色的 Amazon Resource Name (ARN),如 Amazon Simple Storage Service (Amazon S3) 数据。

必需:是

类型:字符串

Update requires: No interruption

Schedule

对于计划的爬网程序,是爬网程序运行时的计划。

必需:否

类型Schedule

Update requires: No interruption

SchemaChangePolicy

指定爬网程序的更新和删除行为的策略。

必需:否

类型SchemaChangePolicy

Update requires: No interruption

TablePrefix

添加到创建的表的名称的前缀。

必需:否

类型:字符串

Update requires: No interruption

Tags

要用于此爬网程序的标签。

必需:否

类型:Json

Update requires: No interruption

Targets

要爬网的目标的集合。

必需:是

类型Targets

Update requires: No interruption

返回值

Ref

在将此资源的逻辑 ID 传递给内部 Ref 函数时,Ref 返回爬网程序名称。

For more information about using the Ref function, see Ref.

示例

以下示例为 Amazon S3 目标创建一个爬网程序。

JSON

{ "Description": "AWS Glue Crawler Test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "Policies": [ { "PolicyName": "root", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbCrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }

YAML

Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbCrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"

爬网程序配置

以下示例指定控制爬网程序行为的配置。

JSON

{ "Type": "AWS::Glue::Crawler", "Properties": { "Role": "role1", "Classifiers": [], "Description": "example classifier", "SchemaChangePolicy": "", "Schedule": "Schedule", "DatabaseName": "test", "Targets": [], "TablePrefix": "test-", "Name": "my-crawler", "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}" } }

YAML

Type: AWS::Glue::Crawler Properties: Role: role1 Classifiers: - '' Description: example classifier SchemaChangePolicy: '' Schedule: Schedule DatabaseName: test Targets: - '' TablePrefix: test- Name: my-crawler Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"