创建Amazon Glue使用的资源Amazon CloudFormation模板 - Amazon连接词
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

创建Amazon Glue使用的资源Amazon CloudFormation模板

Amazon CloudFormation 是可创建许多 Amazon 资源的服务。Amazon Glue 提供了 API 操作以在 Amazon Glue Data Catalog中创建对象。但是,在 Amazon CloudFormation 模板文件中定义并创建 Amazon Glue 对象和其他相关 Amazon 资源对象可能更方便。然后,您可以自动化创建对象的过程。

Amazon CloudFormation提供了简化的语法 — JSON (JavaScript 对象表示法) 或 YAML (YAML Ainno 标记语言) — 来表示Amazon资源的费用。您可以使用Amazon CloudFormation模板来定义数据目录对象,如数据库、表、分区、爬网程序、分类器和连接。您还可以定义 ETL 对象,如作业、触发器和开发终端节点。您可创建一个模板来描述所需的所有 Amazon 资源,而 Amazon CloudFormation 则可为您预配和配置这些资源。

有关更多信息,请参阅 。什么是Amazon CloudFormation?使用Amazon CloudFormation模板中的Amazon CloudFormation用户指南

如果您计划使用与 Amazon Glue 兼容的 Amazon CloudFormation 模板,则您作为管理员,必须授予对 Amazon CloudFormation 及其依赖的 Amazon 服务和操作的访问权。授予权限以创建Amazon CloudFormation资源中,请将以下策略附加到使用Amazon CloudFormation:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudformation:*" ], "Resource": "*" } ] }

下表包含 Amazon CloudFormation 模板可代表您执行的操作。它包括指向您可添加到 Amazon CloudFormation 模板的 Amazon 资源类型及其属性类型的相关信息的链接。

要开始使用,请使用以下示例模板并使用您自己的元数据对其进行自定义。然后,使用 Amazon CloudFormation 控制台创建 Amazon CloudFormation 堆栈以将对象添加到 Amazon Glue 和任何关联的服务。Amazon Glue 对象中的许多字段都是可选字段。这些模板说明了必填字段或正常运行的 Amazon Glue 对象所需的字段。

Amazon CloudFormation 模板可以采用 JSON 或 YAML 格式。在这些示例中,使用了 YAML 以便于阅读。这些示例包含注释 (#) 以介绍模板中定义的值。

Amazon CloudFormation 模板可以包含 Parameters 部分。可以在示例文本中或在将 YAML 文件提交到 Amazon CloudFormation 控制台以创建堆栈时更改此部分。模板的 Resources 部分包含 Amazon Glue 和相关对象的定义。Amazon CloudFormation 模板语法定义可能包含包括更详细的属性语法的属性。可能并非所有属性都是创建 Amazon Glue 对象所必需的。这些示例显示创建 Amazon Glue 对象时常用的属性的示例值。

Amazon Glue 数据库的示例 Amazon CloudFormation 模板

一个Amazon Glue数据目录中的数据库包含元数据表。数据库包含非常少的属性,可在数据目录中使用Amazon CloudFormationTemplate。提供了以下示例模板以帮助您入门并说明如何将 Amazon CloudFormation 堆栈与 Amazon Glue 一起使用。示例模板创建的唯一资源是名为 cfn-mysampledatabase 的数据库。您可以更改它,方法是编辑示例的文本,或在提交 YAML 时在 Amazon CloudFormation 控制台上更改值。

下面显示创建 Amazon Glue 数据库时常用的属性的示例值。有关 Amazon CloudFormation 的 Amazon Glue 数据库模板的更多信息,请参阅 AWS::Glue::Database

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CloudFormation template in YAML to demonstrate creating a database named mysampledatabase # The metadata created in the Data Catalog points to the flights public S3 bucket # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: CFNDatabaseName: Type: String Default: cfn-mysampledatabse # Resources section defines metadata for the Data Catalog Resources: # Create an Amazon Glue database CFNDatabaseFlights: Type: AWS::Glue::Database Properties: # The database is created in the Data Catalog for your account CatalogId: !Ref AWS::AccountId DatabaseInput: # The name of the database is defined in the Parameters section above Name: !Ref CFNDatabaseName Description: Database to hold tables for flights data LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/ #Parameters: Leave Amazon database parameters blank

Amazon Glue 数据库、表和分区的示例 Amazon CloudFormation 模板

Amazon Glue 表包含定义要使用 ETL 脚本处理的数据的结构和位置的元数据。在表中,可以定义分区以并行处理您的数据。分区是您使用键定义的数据块。例如,使用月份作为键,1 月的所有数据包含在同一分区中。在 Amazon Glue 中,数据库可以包含表,表可以包含分区。

以下示例显示如何使用 Amazon CloudFormation 模板填充数据库、表和分区。基本数据格式为 csv 并使用逗号 (,) 分隔。因为数据库必须先存在才能包含表,表必须先存在才能创建分区,所以模板在创建这些对象时使用 DependsOn 语句来定义它们的依赖关系。

此示例中的值定义一个包含公开可用的 Amazon S3 存储桶中的航班数据的表。为方便说明,仅定义了一些数据列和一个分区键。还在数据目录中定义了 4 个分区。还在 StorageDescriptor 字段中显示了用于描述基本数据的存储的一些字段。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CloudFormation template in YAML to demonstrate creating a database, a table, and partitions # The metadata created in the Data Catalog points to the flights public S3 bucket # # Parameters substituted in the Resources section # These parameters are names of the resources created in the Data Catalog Parameters: CFNDatabaseName: Type: String Default: cfn-database-flights-1 CFNTableName1: Type: String Default: cfn-manual-table-flights-1 # Resources to create metadata in the Data Catalog Resources: ### # Create an Amazon Glue database CFNDatabaseFlights: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: !Ref CFNDatabaseName Description: Database to hold tables for flights data ### # Create an Amazon Glue table CFNTableFlights: # Creating the table waits for the database to be created DependsOn: CFNDatabaseFlights Type: AWS::Glue::Table Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableInput: Name: !Ref CFNTableName1 Description: Define the first few columns of the flights table TableType: EXTERNAL_TABLE Parameters: { "classification": "csv" } # ViewExpandedText: String PartitionKeys: # Data is partitioned by month - Name: mon Type: bigint StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: year Type: bigint - Name: quarter Type: bigint - Name: month Type: bigint - Name: day_of_month Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 1 # Create an Amazon Glue partition CFNPartitionMon1: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 1 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=1/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 2 # Create an Amazon Glue partition CFNPartitionMon2: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 2 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=2/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 3 # Create an Amazon Glue partition CFNPartitionMon3: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 3 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=3/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 4 # Create an Amazon Glue partition CFNPartitionMon4: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 4 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=4/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Amazon Glue Grok 分类器的示例 Amazon CloudFormation 模板

Amazon Glue 分类器确定数据的架构。一种类型的自定义分类器使用 grok 模式来匹配您的数据。如果模式匹配,则使用自定义分类器来创建您的表的架构并将 classification 设置为分类器定义中设置的值。

此示例创建了一个分类器,该分类器创建了具有一个名为 message 的列的架构并将 classification 设置为 greedy

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a classifier # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the classifier to be created CFNClassifierName: Type: String Default: cfn-classifier-grok-one-column-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create classifier that uses grok pattern to put all data in one column and classifies it as "greedy". CFNClassifierFlights: Type: AWS::Glue::Classifier Properties: GrokClassifier: #Grok classifier that puts all data in one column Name: !Ref CFNClassifierName Classification: greedy GrokPattern: "%{GREEDYDATA:message}" #CustomPatterns: none

Amazon Glue JSON 分类器的示例 Amazon CloudFormation 模板

Amazon Glue 分类器确定数据的架构。一种类型的自定义分类器使用 JsonPath 字符串,该字符串定义供分类器分类的 JSON 数据。Amazon Glue 支持小部分适用于 JsonPath 的运算符,如编写 JsonPath 自定义分类器中所述。

如果模式匹配,则使用自定义分类器来创建您的表的架构。

此示例创建了一个分类器,该分类器创建一个架构,其每条记录都在对象的 Records3 数组中。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a JSON classifier # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the classifier to be created CFNClassifierName: Type: String Default: cfn-classifier-json-one-column-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create classifier that uses a JSON pattern. CFNClassifierFlights: Type: AWS::Glue::Classifier Properties: JSONClassifier: #JSON classifier Name: !Ref CFNClassifierName JsonPath: $.Records3[*]

Amazon Glue XML 分类器的示例 Amazon CloudFormation 模板

Amazon Glue 分类器确定数据的架构。一种类型的自定义分类器指定 XML 标签来指定包含要分析的 XML 文档中的每条记录的元素。如果模式匹配,则使用自定义分类器来创建您的表的架构并将 classification 设置为分类器定义中设置的值。

此示例创建了一个分类器,该分类器创建了一个其每条记录位于 Record 标签中的架构并将分类设置为 XML

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating an XML classifier # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the classifier to be created CFNClassifierName: Type: String Default: cfn-classifier-xml-one-column-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create classifier that uses the XML pattern and classifies it as "XML". CFNClassifierFlights: Type: AWS::Glue::Classifier Properties: XMLClassifier: #XML classifier Name: !Ref CFNClassifierName Classification: XML RowTag: <Records>

示例Amazon CloudFormation的 TemplateAmazon GlueAmazon S3 的爬网程序

一个Amazon Glue爬网程序在数据目录中创建与您的数据对应的元数据表。然后,可以在您的 ETL 作业中使用这些表定义作为源和目标。

此示例创建爬网程序、所需的 IAM 角色和Amazon Glue数据目录中的数据库。当此爬网程序运行时,它会代入 IAM 角色并在数据库中为公用航班数据创建一个表。使用前缀“cfn_sample_1_”创建此表。此模板创建的 IAM 角色允许全局权限;您可能希望创建自定义角色。此分类器没有定义任何自定义分类器。默认使用 Amazon Glue 内置分类器。

当您将此示例提交到Amazon CloudFormation控制台中,您必须确认要创建 IAM 角色。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a crawler # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the crawler to be created CFNCrawlerName: Type: String Default: cfn-crawler-flights-1 CFNDatabaseName: Type: String Default: cfn-database-flights-1 CFNTablePrefixName: Type: String Default: cfn_sample_1_ # # # Resources section defines metadata for the Data Catalog Resources: #Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions. CFNRoleFlights: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" # Create a database to contain tables created by the crawler CFNDatabaseFlights: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: !Ref CFNDatabaseName Description: "AWS Glue container to hold metadata tables for the flights crawler" #Create a crawler to crawl the flights data on a public S3 bucket CFNCrawlerFlights: Type: AWS::Glue::Crawler Properties: Name: !Ref CFNCrawlerName Role: !GetAtt CFNRoleFlights.Arn #Classifiers: none, use the default classifier Description: Amazon Glue crawler to crawl flights data #Schedule: none, use default run-on-demand DatabaseName: !Ref CFNDatabaseName Targets: S3Targets: # Public S3 bucket with the flights data - Path: "s3://crawler-public-us-east-1/flight/2016/csv" TablePrefix: !Ref CFNTablePrefixName SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Amazon Glue 连接的示例 Amazon CloudFormation 模板

一个Amazon Glue连接包含连接到 JDBC 数据库所需的 JDBC 和网络信息。在您连接到 JDBC 数据库以运行 ETL 作业或对其进行爬网时,会使用此信息。

此示例创建到 MySQL 为devdb。使用此连接时,还必须提供 IAM 角色、数据库凭证和网络连接值。请参阅模板中的必填字段的详细信息。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a connection # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the connection to be created CFNConnectionName: Type: String Default: cfn-connection-mysql-flights-1 CFNJDBCString: Type: String Default: "jdbc:mysql://xxx-mysql.yyyyyyyyyyyyyy.us-east-1.rds.amazonaws.com:3306/devdb" CFNJDBCUser: Type: String Default: "master" CFNJDBCPassword: Type: String Default: "12345678" NoEcho: true # # # Resources section defines metadata for the Data Catalog Resources: CFNConnectionMySQL: Type: AWS::Glue::Connection Properties: CatalogId: !Ref AWS::AccountId ConnectionInput: Description: "Connect to MySQL database." ConnectionType: "JDBC" #MatchCriteria: none PhysicalConnectionRequirements: AvailabilityZone: "us-east-1d" SecurityGroupIdList: - "sg-7d52b812" SubnetId: "subnet-84f326ee" ConnectionProperties: { "JDBC_CONNECTION_URL": !Ref CFNJDBCString, "USERNAME": !Ref CFNJDBCUser, "PASSWORD": !Ref CFNJDBCPassword } Name: !Ref CFNConnectionName

JDBC 的 Amazon Glue 爬网程序的示例 Amazon CloudFormation 模板

一个Amazon Glue爬网程序在数据目录中创建与您的数据对应的元数据表。然后,可以在您的 ETL 作业中使用这些表定义作为源和目标。

此示例创建爬网程序、所需的 IAM 角色和Amazon Glue数据目录中的数据库。当此爬网程序运行时,它会代入 IAM 角色并在数据库中为存储在 MySQL 数据库中的公用航班数据创建一个表。使用前缀“cfn_jdbc_1_”创建此表。此模板创建的 IAM 角色允许全局权限;您可能希望创建自定义角色。不能为 JDBC 数据定义自定义分类器。默认使用 Amazon Glue 内置分类器。

当您将此示例提交到Amazon CloudFormation控制台中,您必须确认要创建 IAM 角色。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a crawler # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the crawler to be created CFNCrawlerName: Type: String Default: cfn-crawler-jdbc-flights-1 # The name of the database to be created to contain tables CFNDatabaseName: Type: String Default: cfn-database-jdbc-flights-1 # The prefix for all tables crawled and created CFNTablePrefixName: Type: String Default: cfn_jdbc_1_ # The name of the existing connection to the MySQL database CFNConnectionName: Type: String Default: cfn-connection-mysql-flights-1 # The name of the JDBC path (database/schema/table) with wildcard (%) to crawl CFNJDBCPath: Type: String Default: saldev/% # # # Resources section defines metadata for the Data Catalog Resources: #Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions. CFNRoleFlights: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" # Create a database to contain tables created by the crawler CFNDatabaseFlights: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: !Ref CFNDatabaseName Description: "AWS Glue container to hold metadata tables for the flights crawler" #Create a crawler to crawl the flights data in MySQL database CFNCrawlerFlights: Type: AWS::Glue::Crawler Properties: Name: !Ref CFNCrawlerName Role: !GetAtt CFNRoleFlights.Arn #Classifiers: none, use the default classifier Description: Amazon Glue crawler to crawl flights data #Schedule: none, use default run-on-demand DatabaseName: !Ref CFNDatabaseName Targets: JdbcTargets: # JDBC MySQL database with the flights data - ConnectionName: !Ref CFNConnectionName Path: !Ref CFNJDBCPath #Exclusions: none TablePrefix: !Ref CFNTablePrefixName SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

示例Amazon CloudFormation的 TemplateAmazon GlueAmazon S3 到 Amazon S3 的 Amazon S3 的 Amazon S3 的 Job 业

一个Amazon Glue作业包含在中运行脚本所需的参数值。Amazon Glue。

此示例创建从csv格式并将其写入 Amazon S3 镶木地板文件。此作业运行的脚本必须已存在。您可以使用 Amazon Glue 控制台为您的环境生成 ETL 脚本。在运行此作业时,还必须提供具有正确权限的 IAM 角色。

模板中显示了常用参数值。例如,AllocatedCapacity (DPU) 默认为 5。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a public bucket # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the job to be created CFNJobName: Type: String Default: cfn-job-S3-to-S3-2 # The name of the IAM role that the job assumes. It must have access to data, script, temporary directory CFNIAMRoleName: Type: String Default: AWSGlueServiceRoleGA # The S3 path where the script for this job is located CFNScriptLocation: Type: String Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2 # # # Resources section defines metadata for the Data Catalog Resources: # Create job to run script which accesses flightscsv table and write to S3 file as parquet. # The script already exists and is called by this job CFNJobFlights: Type: AWS::Glue::Job Properties: Role: !Ref CFNIAMRoleName #DefaultArguments: JSON object # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'} #Connections: No connection needed for S3 to S3 job # ConnectionsList #MaxRetries: Double Description: Job created with CloudFormation #LogUri: String Command: Name: glueetl ScriptLocation: !Ref CFNScriptLocation # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-" # script uses temp directory from job definition if required (temp directory not used S3 to S3) # script defines target for output as s3://aws-glue-target/sal AllocatedCapacity: 5 ExecutionProperty: MaxConcurrentRuns: 1 Name: !Ref CFNJobName

示例Amazon CloudFormation的 TemplateAmazon GlueJDBC 到 Amazon S3 的 JDBC 的 Job 业

一个Amazon Glue作业包含在中运行脚本所需的参数值。Amazon Glue。

此示例创建从 MySQL JDBC 数据库读取航班数据的作业,该数据是名为cfn-connection-mysql-flights-1并将其写入 Amazon S3 镶木地板文件。此作业运行的脚本必须已存在。您可以使用 Amazon Glue 控制台为您的环境生成 ETL 脚本。在运行此作业时,还必须提供具有正确权限的 IAM 角色。

模板中显示了常用参数值。例如,AllocatedCapacity (DPU) 默认为 5。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data to an S3 file # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the job to be created CFNJobName: Type: String Default: cfn-job-JDBC-to-S3-1 # The name of the IAM role that the job assumes. It must have access to data, script, temporary directory CFNIAMRoleName: Type: String Default: AWSGlueServiceRoleGA # The S3 path where the script for this job is located CFNScriptLocation: Type: String Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a # The name of the connection used for JDBC data source CFNConnectionName: Type: String Default: cfn-connection-mysql-flights-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create job to run script which accesses JDBC flights table via a connection and write to S3 file as parquet. # The script already exists and is called by this job CFNJobFlights: Type: AWS::Glue::Job Properties: Role: !Ref CFNIAMRoleName #DefaultArguments: JSON object # For example, if required by script, set temporary directory as DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'} Connections: Connections: - !Ref CFNConnectionName #MaxRetries: Double Description: Job created with CloudFormation using existing script #LogUri: String Command: Name: glueetl ScriptLocation: !Ref CFNScriptLocation # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-" # if required, script defines temp directory as argument TempDir and used in script like redshift_tmp_dir = args["TempDir"] # script defines target for output as s3://aws-glue-target/sal AllocatedCapacity: 5 ExecutionProperty: MaxConcurrentRuns: 1 Name: !Ref CFNJobName

Amazon Glue 按需触发器的示例 Amazon CloudFormation 模板

一个Amazon Glue触发器包含在它触发时启动作业运行所需的参数值。在您启用按需触发器时,该按需触发器触发。

此示例创建启动一个名为 cfn-job-S3-to-S3-1 的作业的按需触发器。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating an on-demand trigger # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The existing job to be started by this trigger CFNJobName: Type: String Default: cfn-job-S3-to-S3-1 # The name of the trigger to be created CFNTriggerName: Type: String Default: cfn-trigger-ondemand-flights-1 # # Resources section defines metadata for the Data Catalog # Sample CFN YAML to demonstrate creating an on-demand trigger for a job Resources: # Create trigger to run an existing job (CFNJobName) on an on-demand schedule. CFNTriggerSample: Type: AWS::Glue::Trigger Properties: Name: Ref: CFNTriggerName Description: Trigger created with CloudFormation Type: ON_DEMAND Actions: - JobName: !Ref CFNJobName # Arguments: JSON object #Schedule: #Predicate:

Amazon Glue 计划触发器的示例 Amazon CloudFormation 模板

一个Amazon Glue触发器包含在它触发时启动作业运行所需的参数值。在启用计划触发器并弹出 cron 计时器时,该计划触发器触发。

此示例创建启动一个名为 cfn-job-S3-to-S3-1 的作业的计划触发器。计时器是在工作日每隔 10 分钟运行一次作业的 cron 表达式。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a scheduled trigger # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The existing job to be started by this trigger CFNJobName: Type: String Default: cfn-job-S3-to-S3-1 # The name of the trigger to be created CFNTriggerName: Type: String Default: cfn-trigger-scheduled-flights-1 # # Resources section defines metadata for the Data Catalog # Sample CFN YAML to demonstrate creating a scheduled trigger for a job # Resources: # Create trigger to run an existing job (CFNJobName) on a cron schedule. TriggerSample1CFN: Type: AWS::Glue::Trigger Properties: Name: Ref: CFNTriggerName Description: Trigger created with CloudFormation Type: SCHEDULED Actions: - JobName: !Ref CFNJobName # Arguments: JSON object # # Run the trigger every 10 minutes on Monday to Friday Schedule: cron(0/10 * ? * MON-FRI *) #Predicate:

Amazon Glue 条件触发器的示例 Amazon CloudFormation 模板

一个Amazon Glue触发器包含在它触发时启动作业运行所需的参数值。在启用条件触发器并满足其条件 (如作业成功完成) 时,该条件触发器触发。

此示例创建启动一个名为 cfn-job-S3-to-S3-1 的作业的条件触发器。在名为 cfn-job-S3-to-S3-2 的作业成功完成时,此作业启动。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts when another job completes # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The existing job to be started by this trigger CFNJobName: Type: String Default: cfn-job-S3-to-S3-1 # The existing job that when it finishes causes trigger to fire CFNJobName2: Type: String Default: cfn-job-S3-to-S3-2 # The name of the trigger to be created CFNTriggerName: Type: String Default: cfn-trigger-conditional-1 # Resources: # Create trigger to run an existing job (CFNJobName) when another job completes (CFNJobName2). CFNTriggerSample: Type: AWS::Glue::Trigger Properties: Name: Ref: CFNTriggerName Description: Trigger created with CloudFormation Type: CONDITIONAL Actions: - JobName: !Ref CFNJobName # Arguments: JSON object #Schedule: none Predicate: #Value for Logical is required if more than 1 job listed in Conditions Logical: AND Conditions: - LogicalOperator: EQUALS JobName: !Ref CFNJobName2 State: SUCCEEDED

Amazon Glue 开发终端节点的示例 Amazon CloudFormation 模板

Amazon Glue 开发终端节点是可用于开发和测试您的 Amazon Glue 脚本的环境。

此示例使用成功创建开发终端节点所需的最少网络参数值创建开发终端节点。有关设置开发终端节点所需的参数的更多信息,请参阅针对开发终端节点设置您的环境

您需要提供现有的 IAM 角色 ARN(Amazon 资源名称)来创建开发终端节点。如果您计划在开发终端节点上创建笔记本电脑服务器,请提供有效的 RSA 公有密钥并保持对应的私有密钥可用。

注意

对于您创建的任何与开发终端节点关联的 notebook 服务器,都可以对其进行管理。因此,如果您删除开发终端节点以删除笔记本服务器,您必须在 Amazon CloudFormation 控制台上删除 Amazon CloudFormation 堆栈。

--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a development endpoint # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the crawler to be created CFNEndpointName: Type: String Default: cfn-devendpoint-1 CFNIAMRoleArn: Type: String Default: arn:aws:iam::123456789012/role/AWSGlueServiceRoleGA # # # Resources section defines metadata for the Data Catalog Resources: CFNDevEndpoint: Type: AWS::Glue::DevEndpoint Properties: EndpointName: !Ref CFNEndpointName #ExtraJarsS3Path: String #ExtraPythonLibsS3Path: String NumberOfNodes: 5 PublicKey: ssh-rsa public.....key myuserid-key RoleArn: !Ref CFNIAMRoleArn SecurityGroupIds: - sg-64986c0b SubnetId: subnet-c67cccac