AWS CloudFormation
User Guide (API 版本 2010-05-15)
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

AWS::Glue::Crawler

AWS::Glue::Crawler 资源指定一个 AWS Glue 爬网程序。有关更多信息,请参阅 AWS Glue 开发人员指南 中的使用爬网程序对表进行编录Crawler 结构

语法

要在 AWS CloudFormation 模板中声明此实体,请使用以下语法:

JSON

{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Role" : String, "Classifiers" : [ String, ... ], "Description" : String, "SchemaChangePolicy" : SchemaChangePolicy, "Schedule" : Schedule, "DatabaseName" : String, "Targets" : Targets, "TablePrefix" : String, "Name" : String } }

YAML

Type: "AWS::Glue::Crawler" Properties: Role: String Classifiers: - String Description: String SchemaChangePolicy: SchemaChangePolicy Schedule: Schedule DatabaseName: String Targets: Targets TablePrefix: String Name: String

属性

Role

用于访问客户资源 (如 Amazon S3 数据) 的 IAM 角色的 Amazon 资源名称 (ARN)。

必需:是

类型:字符串

更新要求无需中断

Classifiers

指定与爬网程序关联的自定义分类器的 UTF-8 字符串列表。

必需:否

类型:字符串值列表

更新要求无需中断

Description

爬网程序及其使用场合的描述。它必须与 URI 地址多行字符串模式匹配:[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*

必需:否

类型:字符串

更新要求无需中断

SchemaChangePolicy

指定爬网程序的更新和删除行为的策略。

必需:否

类型AWS Glue Crawler SchemaChangePolicy

更新要求无需中断

Schedule

爬网程序的计划。

必需:否

类型AWS Glue Crawler Schedule

更新要求无需中断

DatabaseName

存储爬网程序输出的数据库的名称。

必需:是

类型:字符串

更新要求无需中断

Targets

爬网程序目标。

必需:是

类型AWS Glue Crawler Targets

更新要求无需中断

TablePrefix

用于创建的目录表的表前缀。

必需:否

类型:字符串

更新要求无需中断

Name

爬网程序的名称。必须与单行字符串模式匹配:[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*

必需:否

类型:字符串

更新要求替换

返回值

Ref

当该资源的逻辑 ID 提供给 Ref内部函数时,Ref 将返回资源名称。

有关使用 Ref 功能的更多信息,请参阅参考

示例

以下示例为 Amazon S3 目标创建一个爬网程序。

JSON

{ "Description": "AWS Glue Crawler Test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "Policies": [ { "PolicyName": "root", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbCrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }

YAML

Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbCrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"

本页内容: