示例数据库示例数据库、表、分区示例 grok 分类器示例 JSON 分类器示例 XML 分类器示例 Amazon S3 爬网程序示例连接示例 JDBC 爬网程序 Amazon S3 到 Amazon S3 的示例作业 JDBC 到 Amazon S3 的示例作业示例按需触发器示例计划触发器示例条件触发器机器学习转换数据质量规则集示例使用 EventBridge 调度器的数据质量规则集示例示例开发终端节点

适用于 Amazon Glue 的 Amazon CloudFormation

Amazon CloudFormation 是可创建许多 Amazon 资源的服务。Amazon Glue 提供了 API 操作以在 Amazon Glue Data Catalog 中创建对象。但是，在 Amazon CloudFormation 模板文件中定义并创建 Amazon Glue 对象和其他相关 Amazon 资源对象可能更方便。然后，您可以自动化创建对象的过程。

Amazon CloudFormation 提供了简化的语法 JSON（JavaScript 对象表示法）或 YAML（YAML Ain't 标记语言）来表示 Amazon 资源的创建。您可以使用 Amazon CloudFormation 模板来定义数据目录对象，如数据库、表、分区、爬网程序、分类器和连接。您还可以定义 ETL 对象，如作业、触发器和开发终端节点。您可创建一个模板来描述所需的所有 Amazon 资源，而 Amazon CloudFormation 则可为您预配和配置这些资源。

相关详情，请参阅《Amazon CloudFormation 用户指南》中的什么是 Amazon CloudFormation？以及使用 Amazon CloudFormation 模板。

如果您计划使用与 Amazon Glue 兼容的 Amazon CloudFormation 模板，则您作为管理员，必须授予对 Amazon CloudFormation 及其依赖的 Amazon 服务和操作的访问权。要授予创建 Amazon CloudFormation 资源的权限，请将以下策略附加到使用 Amazon CloudFormation 的用户：


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [                
        "cloudformation:*"        
      ],
      "Resource": "*"
    }
  ]
}

下表包含 Amazon CloudFormation 模板可代表您执行的操作。它包括指向您可添加到 Amazon CloudFormation 模板的 Amazon 资源类型及其属性类型的相关信息的链接。

Amazon Glue 资源	Amazon CloudFormation 模板	Amazon Glue 示例
分类器	AWS::Glue::Classifier	Grok 分类器、JSON 分类器、XML 分类器
Connection	AWS::Glue::Connection	MySQL 连接
爬网程序	AWS::Glue::Crawler	Amazon S3 爬网程序、MySQL 爬网程序
数据库	AWS::Glue::Database	空数据库、具有表的数据库
开发终端节点	AWS::Glue::DevEndpoint	开发终端节点
任务	AWS::Glue::Job	Amazon S3 任务、JDBC 任务
机器学习转换	AWS::Glue::MLTransform	机器学习转换
数据质量规则集	AWS::Glue::DataQualityRuleset	数据质量规则集、使用 EventBridge 调度器的数据质量规则集
分区	AWS::Glue::Partition	表的分区
表	AWS::Glue::Table	数据库中的表
触发器	AWS::Glue::Trigger	按需触发器、计划触发器、条件触发器

要开始使用，请使用以下示例模板并使用您自己的元数据对其进行自定义。然后，使用 Amazon CloudFormation 控制台创建 Amazon CloudFormation 堆栈以将对象添加到 Amazon Glue 和任何关联的服务。Amazon Glue 对象中的许多字段都是可选字段。这些模板说明了必填字段或正常运行的 Amazon Glue 对象所需的字段。

Amazon CloudFormation 模板可以采用 JSON 或 YAML 格式。在这些示例中，使用了 YAML 以便于阅读。这些示例包含注释 (#) 以介绍模板中定义的值。

Amazon CloudFormation 模板可以包含 Parameters 部分。可以在示例文本中或在将 YAML 文件提交到 Amazon CloudFormation 控制台以创建堆栈时更改此部分。模板的 Resources 部分包含 Amazon Glue 和相关对象的定义。Amazon CloudFormation 模板语法定义可能包含包括更详细的属性语法的属性。可能并非所有属性都是创建 Amazon Glue 对象所必需的。这些示例显示创建 Amazon Glue 对象时常用的属性的示例值。

Amazon Glue 数据库的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 数据库包含元数据表。数据库包含非常少的属性，可在数据目录中使用 Amazon CloudFormation 模板进行创建。提供了以下示例模板以帮助您入门并说明如何将 Amazon CloudFormation 堆栈与 Amazon Glue 一起使用。示例模板创建的唯一资源是名为 cfn-mysampledatabase 的数据库。您可以更改它，方法是编辑示例的文本，或在提交 YAML 时在 Amazon CloudFormation 控制台上更改值。

下面显示创建 Amazon Glue 数据库时常用的属性的示例值。有关 Amazon Glue 的 Amazon CloudFormation 数据库模板的更多信息，请参阅 AWS::Glue::Database。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database named mysampledatabase
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-mysampledatabse

# Resources section defines metadata for the Data Catalog
Resources:
# Create an Amazon Glue database
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      # The database is created in the Data Catalog for your account
      CatalogId: !Ref AWS::AccountId   
      DatabaseInput:
        # The name of the database is defined in the Parameters section above
        Name: !Ref CFNDatabaseName	
        Description: Database to hold tables for flights data
        LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/
        #Parameters: Leave Amazon database parameters blank

Amazon Glue 数据库、表和分区的示例 Amazon CloudFormation 模板

Amazon Glue 表包含定义要使用 ETL 脚本处理的数据的结构和位置的元数据。在表中，可以定义分区以并行处理您的数据。分区是您使用键定义的数据块。例如，使用月份作为键，1 月的所有数据包含在同一分区中。在 Amazon Glue 中，数据库可以包含表，表可以包含分区。

以下示例显示如何使用 Amazon CloudFormation 模板填充数据库、表和分区。基本数据格式为 csv 并使用逗号 (,) 分隔。因为数据库必须先存在才能包含表，表必须先存在才能创建分区，所以模板在创建这些对象时使用 DependsOn 语句来定义它们的依赖关系。

此示例中的值定义一个包含公开可用的 Amazon S3 存储桶中的航班数据的表。为方便说明，仅定义了一些数据列和一个分区键。还在数据目录中定义了 4 个分区。还在 StorageDescriptor 字段中显示了用于描述基本数据的存储的一些字段。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database, a table, and partitions
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters substituted in the Resources section
# These parameters are names of the resources created in the Data Catalog
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTableName1:
    Type: String
    Default: cfn-manual-table-flights-1
# Resources to create metadata in the Data Catalog
Resources:
###
# Create an Amazon Glue database
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName	
        Description: Database to hold tables for flights data
###
# Create an Amazon Glue table
  CFNTableFlights:
    # Creating the table waits for the database to be created
    DependsOn: CFNDatabaseFlights
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableInput:
        Name: !Ref CFNTableName1
        Description: Define the first few columns of the flights table
        TableType: EXTERNAL_TABLE
        Parameters: {
    "classification": "csv"
  }
#       ViewExpandedText: String
        PartitionKeys:
        # Data is partitioned by month
        - Name: mon
          Type: bigint
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: year
            Type: bigint
          - Name: quarter
            Type: bigint
          - Name: month
            Type: bigint
          - Name: day_of_month
            Type: bigint			
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 1
# Create an Amazon Glue partition  
  CFNPartitionMon1:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 1
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=1/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 2
# Create an Amazon Glue partition 
  CFNPartitionMon2:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 2
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=2/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 3
# Create an Amazon Glue partition 
  CFNPartitionMon3:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 3
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=3/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 4
# Create an Amazon Glue partition 
  CFNPartitionMon4:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 4
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=4/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Amazon Glue grok 分类器的示例 Amazon CloudFormation 模板

Amazon Glue 分类器确定数据的架构。一种类型的自定义分类器使用 grok 模式来匹配您的数据。如果模式匹配，则使用自定义分类器来创建您的表的架构并将 classification 设置为分类器定义中设置的值。

此示例创建了一个分类器，该分类器创建了具有一个名为 message 的列的架构并将 classification 设置为 greedy。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-grok-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses grok pattern to put all data in one column and classifies it as "greedy".	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      GrokClassifier:
        #Grok classifier that puts all data in one column		
        Name: !Ref CFNClassifierName
        Classification: greedy                                                        	   
        GrokPattern: "%{GREEDYDATA:message}"
        #CustomPatterns: none

Amazon Glue JSON 分类器的示例 Amazon CloudFormation 模板

Amazon Glue 分类器确定数据的架构。一种类型的自定义分类器使用 JsonPath 字符串，该字符串定义供分类器分类的 JSON 数据。Amazon Glue 支持小部分适用于 JsonPath 的运算符，如编写 JsonPath 自定义分类器中所述。

如果模式匹配，则使用自定义分类器来创建您的表的架构。

此示例创建了一个分类器，该分类器创建一个架构，其每条记录都在对象的 Records3 数组中。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a JSON classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-json-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses a JSON pattern.	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      JSONClassifier:
        #JSON classifier		
        Name: !Ref CFNClassifierName
        JsonPath: $.Records3[*]

Amazon Glue XML 分类器的示例 Amazon CloudFormation 模板

Amazon Glue 分类器确定数据的架构。一种类型的自定义分类器指定 XML 标签来指定包含要分析的 XML 文档中的每条记录的元素。如果模式匹配，则使用自定义分类器来创建您的表的架构并将 classification 设置为分类器定义中设置的值。

此示例创建了一个分类器，该分类器创建了一个其每条记录位于 Record 标签中的架构并将分类设置为 XML。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an XML classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-xml-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses the XML pattern and classifies it as "XML".	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      XMLClassifier:
        #XML classifier		
        Name: !Ref CFNClassifierName
        Classification: XML   
        RowTag: <Records>

Amazon S3 的 Amazon Glue 爬网程序的示例 Amazon CloudFormation 模板

Amazon Glue 爬网程序在数据目录中创建与您的数据对应的元数据表。然后，可以在您的 ETL 任务中使用这些表定义作为源和目标。

此示例在数据目录中创建爬网程序、所需的 IAM 角色和 Amazon Glue 数据库。当此爬网程序运行时，它会代入 IAM 角色并在数据库中为公用航班数据创建一个表。使用前缀“cfn_sample_1_”创建此表。此模板创建的 IAM 角色允许全局权限；您可能希望创建自定义角色。此分类器没有定义任何自定义分类器。默认使用 Amazon Glue 内置分类器。

当您将此示例提交到 Amazon CloudFormation 控制台时，您必须确认要创建 IAM 角色。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-flights-1
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTablePrefixName:
    Type: String
    Default: cfn_sample_1_	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"
 #Create a crawler to crawl the flights data on a public S3 bucket
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: Amazon Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        S3Targets:
          # Public S3 bucket with the flights data
          - Path: "s3://crawler-public-us-east-1/flight/2016/csv"
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Amazon Glue 连接的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 连接包含连接到 JDBC 数据库所需的 JDBC 和网络信息。在您连接到 JDBC 数据库以运行 ETL 作业或对其进行爬网时，会使用此信息。

此示例创建到名为 devdb 的 Amazon RDS MySQL 数据库的连接。使用此连接时，还必须提供 IAM 角色、数据库凭证和网络连接值。请参阅模板中的必填字段的详细信息。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a connection
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the connection to be created
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
  CFNJDBCString:  
    Type: String
    Default: "jdbc:mysql://xxx-mysql.yyyyyyyyyyyyyy.us-east-1.rds.amazonaws.com:3306/devdb"
  CFNJDBCUser:  
    Type: String
    Default: "master"
  CFNJDBCPassword:  
    Type: String
    Default: "12345678"
    NoEcho: true
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  CFNConnectionMySQL:
    Type: AWS::Glue::Connection
    Properties:
      CatalogId: !Ref AWS::AccountId
      ConnectionInput: 
        Description: "Connect to MySQL database."
        ConnectionType: "JDBC"
        #MatchCriteria: none		
        PhysicalConnectionRequirements:
          AvailabilityZone: "us-east-1d"
          SecurityGroupIdList: 
           - "sg-7d52b812"
          SubnetId: "subnet-84f326ee" 
        ConnectionProperties: {
          "JDBC_CONNECTION_URL": !Ref CFNJDBCString,
          "USERNAME": !Ref CFNJDBCUser,
          "PASSWORD": !Ref CFNJDBCPassword
        }
        Name: !Ref CFNConnectionName

JDBC 的 Amazon Glue 爬网程序的示例 Amazon CloudFormation 模板

Amazon Glue 爬网程序在数据目录中创建与您的数据对应的元数据表。然后，可以在您的 ETL 任务中使用这些表定义作为源和目标。

此示例在数据目录中创建爬网程序、所需的 IAM 角色和 Amazon Glue 数据库。当此爬网程序运行时，它会代入 IAM 角色并在数据库中为存储在 MySQL 数据库中的公用航班数据创建一个表。使用前缀“cfn_jdbc_1_”创建此表。此模板创建的 IAM 角色允许全局权限；您可能希望创建自定义角色。不能为 JDBC 数据定义自定义分类器。默认使用 Amazon Glue 内置分类器。

当您将此示例提交到 Amazon CloudFormation 控制台时，您必须确认要创建 IAM 角色。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-jdbc-flights-1
# The name of the database to be created to contain tables	
  CFNDatabaseName:
    Type: String
    Default: cfn-database-jdbc-flights-1
# The prefix for all tables crawled and created	
  CFNTablePrefixName:
    Type: String
    Default: cfn_jdbc_1_
# The name of the existing connection to the MySQL database
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
# The name of the JDBC path (database/schema/table) with wildcard (%) to crawl	
  CFNJDBCPath:  
    Type: String
    Default: saldev/%		
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"
 #Create a crawler to crawl the flights data in MySQL database
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: Amazon Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        JdbcTargets:
          # JDBC MySQL database with the flights data
          - ConnectionName: !Ref CFNConnectionName
            Path: !Ref CFNJDBCPath
          #Exclusions: none
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
	  Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Amazon S3 到 Amazon S3 Amazon Glue 作业的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 任务包含在 Amazon Glue 中运行脚本所需的参数值。

此示例创建从 csv 格式的 Amazon S3 存储桶读取航班数据并将其写入 Amazon S3 Parquet 文件的任务。此任务运行的脚本必须已存在。您可以使用 Amazon Glue 控制台为您的环境生成 ETL 脚本。在运行此任务时，还必须提供具有正确权限的 IAM 角色。

模板中显示了常用参数值。例如，AllocatedCapacity（DPU）默认为 5。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a public bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the job to be created
  CFNJobName:  
    Type: String
    Default: cfn-job-S3-to-S3-2
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
  CFNIAMRoleName:  
    Type: String
    Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
  CFNScriptLocation:  
    Type: String
    Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2	
#
#
# Resources section defines metadata for the Data Catalog
Resources:                                      
# Create job to run script which accesses flightscsv table and write to S3 file as parquet.
# The script already exists and is called by this job	
  CFNJobFlights:
    Type: AWS::Glue::Job   
    Properties:
      Role: !Ref CFNIAMRoleName  
      #DefaultArguments: JSON object 
      # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'}
      #Connections:  No connection needed for S3 to S3 job 
      #  ConnectionsList  
      #MaxRetries: Double  
      Description: Job created with CloudFormation  
      #LogUri: String  
      Command:   
        Name: glueetl  
        ScriptLocation: !Ref CFNScriptLocation
             # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"					 
             # script uses temp directory from job definition if required (temp directory not used S3 to S3)
             # script defines target for output as s3://aws-glue-target/sal    			 
      AllocatedCapacity: 5  
      ExecutionProperty:   
        MaxConcurrentRuns: 1  
      Name: !Ref CFNJobName

JDBC 到 Amazon S3 的 Amazon Glue 作业的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 任务包含在 Amazon Glue 中运行脚本所需的参数值。

此示例创建从名为 cfn-connection-mysql-flights-1 的连接所定义的 MySQL JDBC 数据库读取航班数据并将其写入 Amazon S3 Parquet 文件的任务。此任务运行的脚本必须已存在。您可以使用 Amazon Glue 控制台为您的环境生成 ETL 脚本。在运行此任务时，还必须提供具有正确权限的 IAM 角色。

模板中显示了常用参数值。例如，AllocatedCapacity（DPU）默认为 5。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data to an S3 file
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the job to be created
  CFNJobName:  
    Type: String
    Default: cfn-job-JDBC-to-S3-1
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
  CFNIAMRoleName:  
    Type: String
    Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
  CFNScriptLocation:  
    Type: String
    Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a	
# The name of the connection used for JDBC data source
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
#
#
# Resources section defines metadata for the Data Catalog
Resources:                                      
# Create job to run script which accesses JDBC flights table via a connection and write to S3 file as parquet.
# The script already exists and is called by this job	
  CFNJobFlights:
    Type: AWS::Glue::Job   
    Properties:
      Role: !Ref CFNIAMRoleName  
      #DefaultArguments: JSON object  
      # For example, if required by script, set temporary directory as DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'}
      Connections:
        Connections:
        - !Ref CFNConnectionName 
      #MaxRetries: Double  
      Description: Job created with CloudFormation using existing script
      #LogUri: String  
      Command:   
        Name: glueetl  
        ScriptLocation: !Ref CFNScriptLocation
             # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"					 
             # if required, script defines temp directory as argument TempDir and used in script like redshift_tmp_dir = args["TempDir"] 
             # script defines target for output as s3://aws-glue-target/sal    			 
      AllocatedCapacity: 5  
      ExecutionProperty:   
        MaxConcurrentRuns: 1  
      Name: !Ref CFNJobName

Amazon Glue 按需触发器的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 触发器包含在它触发时启动任务运行所需的参数值。在您启用按需触发器时，该按需触发器触发。

此示例创建启动一个名为 cfn-job-S3-to-S3-1 的作业的按需触发器。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an on-demand trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-ondemand-flights-1	
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating an on-demand trigger for a job	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) on an on-demand schedule.	
  CFNTriggerSample:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: ON_DEMAND                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      #Schedule: 
      #Predicate:

Amazon Glue 计划触发器的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 触发器包含在它触发时启动任务运行所需的参数值。在启用计划触发器并弹出 cron 计时器时，该计划触发器触发。

此示例创建启动一个名为 cfn-job-S3-to-S3-1 的作业的计划触发器。计时器是在工作日每隔 10 分钟运行一次作业的 cron 表达式。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a scheduled trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-scheduled-flights-1	
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating a scheduled trigger for a job
#	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) on a cron schedule.	
  TriggerSample1CFN:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: SCHEDULED                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      # # Run the trigger every 10 minutes on Monday to Friday 		
      Schedule: cron(0/10 * ? * MON-FRI *) 
      #Predicate:

Amazon Glue 条件触发器的示例 Amazon CloudFormation 模板

数据目录中的 Amazon Glue 触发器包含在它触发时启动任务运行所需的参数值。在启用条件触发器并满足其条件 (如作业成功完成) 时，该条件触发器触发。

此示例创建启动一个名为 cfn-job-S3-to-S3-1 的作业的条件触发器。在名为 cfn-job-S3-to-S3-2 的作业成功完成时，此作业启动。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts when another job completes
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The existing job that when it finishes causes trigger to fire
  CFNJobName2:
    Type: String
    Default: cfn-job-S3-to-S3-2	
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-conditional-1	
#	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) when another job completes (CFNJobName2).	
  CFNTriggerSample:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: CONDITIONAL                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      #Schedule: none 
      Predicate:
        #Value for Logical is required if more than 1 job listed in Conditions	  
        Logical: AND
        Conditions:
          - LogicalOperator: EQUALS	
            JobName: !Ref CFNJobName2
            State: SUCCEEDED

Amazon Glue 开发终端节点的示例 Amazon CloudFormation 模板

Amazon Glue 机器学习转换是用于清理数据的自定义转换。当前有一个名为 FindMatches 的可用转换。通过 FindMatches 转换，您可以识别数据集中的重复或匹配记录，即使记录没有公共唯一标识符且没有完全匹配的字段也是如此。

此示例创建机器学习转换。有关创建机器学习转换所需参数的更多信息，请参阅与 Amazon Lake Formation FindMatches 匹配的记录。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a machine learning transform
#
# Resources section defines metadata for the machine learning transform
Resources:
  MyMLTransform:
    Type: "AWS::Glue::MLTransform"
    Condition: "isGlueMLGARegion"
    Properties:
      Name: !Sub "MyTransform"
      Description: "The bestest transform ever"
      Role: !ImportValue MyMLTransformUserRole
      GlueVersion: "1.0"
      WorkerType: "Standard"
      NumberOfWorkers: 5
      Timeout: 120
      MaxRetries: 1
      InputRecordTables:
        GlueTables:
          - DatabaseName: !ImportValue MyMLTransformDatabase
            TableName: !ImportValue MyMLTransformTable
      TransformParameters:
        TransformType: "FIND_MATCHES"
        FindMatchesParameters:
          PrimaryKeyColumnName: "testcolumn"
          PrecisionRecallTradeoff: 0.5
          AccuracyCostTradeoff: 0.5
          EnforceProvidedLabels: True
      Tags:
        key1: "value1"
        key2: "value2"
      TransformEncryption:
        TaskRunSecurityConfigurationName: !ImportValue MyMLTransformSecurityConfiguration
        MLUserDataEncryption:
          MLUserDataEncryptionMode: "SSE-KMS"
          KmsKeyId: !ImportValue MyMLTransformEncryptionKey

Amazon Glue Data Quality 规则集的示例 Amazon CloudFormation 模板

Amazon Glue 数据质量规则集包含可以在 Data Catalog 中的表上进行评估的规则。将规则集放在目标表上后，您可以进入 Data Catalog 并运行评估，根据规则集中的这些规则运行数据。这些规则可能各不相同，从评估行数到评估数据的引用完整性。

以下示例是 CloudFormation 模板，该模板在指定的目标表上创建包含各种规则的规则集。


AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a DataQualityRuleset
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
  # The name of the ruleset to be created
  RulesetName:  
    Type: String
    Default: "CFNRulesetName"
  RulesetDescription:  
    Type: String
    Default: "CFN DataQualityRuleset"
  # Rules that will be associated with this ruleset
  Rules:  
    Type: String
    Default: 'Rules = [
        RowCount > 100,
        IsUnique "id",
        IsComplete "nametype"
        ]'
  # Name of database and table within Data Catalog which the ruleset will 
  # be applied too
  DatabaseName:  
    Type: String
    Default: "ExampleDatabaseName"
  TableName:  
    Type: String
    Default: "ExampleTableName"

# Resources section defines metadata for the Data Catalog
Resources:
  # Creates a Data Quality ruleset under specified rules 
  DQRuleset:
    Type: AWS::Glue::DataQualityRuleset
    Properties:
      Name: !Ref RulesetName
      Description: !Ref RulesetDescription
      # The String within rules must be formatted in DQDL, a language 
      # used specifically to make rules
      Ruleset: !Ref Rules
      # The targeted table must exist within Data Catalog alongside 
      # the correct database
      TargetTable:
        DatabaseName: !Ref DatabaseName
        TableName: !Ref TableName

使用 EventBridge 调度器的 Amazon Glue Data Quality 规则集的示例 Amazon CloudFormation 模板

Amazon Glue 数据质量规则集包含可以在 Data Catalog 中的表上进行评估的规则。将规则集放在目标表上后，您可以进入 Data Catalog 并运行评估，根据规则集中的这些规则运行数据。您不必手动进入 Data Catalog 来评估规则集，而是可以在我们的 CloudFormation 模板中添加 EventBridge 调度器，按定时间隔为您计划这些规则集评估。

以下示例是一个 CloudFormation 模板，它创建了一个 Data Quality 规则集和一个 EventBridge 调度器，用于每五分钟评估一次上述规则集。


AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a DataQualityRuleset
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
  # The name of the ruleset to be created
  RulesetName:  
    Type: String
    Default: "CFNRulesetName"
  # Rules that will be associated with this Ruleset
  Rules:  
    Type: String
    Default: 'Rules = [
        RowCount > 100,
        IsUnique "id",
        IsComplete "nametype"
        ]'
  # The name of the Schedule to be created  
  ScheduleName:  
    Type: String
    Default: "ScheduleDQRulsetEvaluation"
  # This expression determines the rate at which the Schedule will evaluate
  # your data using the above ruleset
  ScheduleRate:
    Type: String
    Default: "rate(5 minutes)"
  # The Request that being sent must match the details of the Data Quality Ruleset
  ScheduleRequest:
    Type: String
    Default: '
        { "DataSource": { "GlueTable": { "DatabaseName": "ExampleDatabaseName",
         "TableName": "ExampleTableName" } },
         "Role": "role/AWSGlueServiceRoleDefault",
          "RulesetNames": [ ""CFNRulesetName"" ] }
        '

# Resources section defines metadata for the Data Catalog
Resources:
  # Creates a Data Quality ruleset under specified rules 
  DQRuleset:
    Type: AWS::Glue::DataQualityRuleset
    Properties:
      Name: !Ref RulesetName
      Description: "CFN DataQualityRuleset"
      # The String within rules must be formatted in DQDL, a language 
      # used specifically to make rules
      Ruleset: !Ref Rules
      # The targeted table must exist within Data Catalog alongside 
      # the correct database
      TargetTable:
        DatabaseName: "ExampleDatabaseName"
        TableName: "ExampleTableName"
  # Create a Scheduler to schedule evaluation runs on the above ruleset
  ScheduleDQEval:
    Type: AWS::Scheduler::Schedule
    Properties: 
      Name: !Ref ScheduleName
      Description: "Schedule DataQualityRuleset Evaluations"
      FlexibleTimeWindow: 
        Mode: "OFF"
      ScheduleExpression: !Ref ScheduleRate
      ScheduleExpressionTimezone: "America/New_York"
      State: "ENABLED"
      Target: 
        # The ARN is the API that will be run, since we want to evaluate our ruleset
        # we want this specific ARN
        Arn: "arn:aws:scheduler:::aws-sdk:glue:startDataQualityRulesetEvaluationRun"
        # Your RoleArn must have approval to schedule
        RoleArn: "arn:aws:iam::123456789012:role/AWSGlueServiceRoleDefault"
        # This is the Request that is being sent to the Arn
        Input: '
        { "DataSource": { "GlueTable": { "DatabaseName": "sampledb", "TableName": "meteorite" } },
         "Role": "role/AWSGlueServiceRoleDefault",
          "RulesetNames": [ "TestCFN" ] }
        '

Amazon Glue 开发终端节点的示例 Amazon CloudFormation 模板

Amazon Glue 开发终端节点是可用于开发和测试您的 Amazon Glue 脚本的环境。

此示例使用成功创建开发终端节点所需的最少网络参数值创建开发终端节点。有关设置开发终端节点所需的参数的更多信息，请参阅为 Amazon Glue 设置开发网络。

您需要提供现有的 IAM 角色 ARN（Amazon Resource Name）来创建开发终端节点。如果您计划在开发终端节点上创建笔记本电脑服务器，请提供有效的 RSA 公有密钥并保持对应的私有密钥可用。

注意

对于您创建的任何与开发终端节点关联的 notebook 服务器，都可以对其进行管理。因此，如果您删除开发终端节点以删除笔记本服务器，您必须在 Amazon CloudFormation 控制台上删除 Amazon CloudFormation 堆栈。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a development endpoint
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNEndpointName:  
    Type: String
    Default: cfn-devendpoint-1
  CFNIAMRoleArn:
    Type: String
    Default: arn:aws:iam::123456789012/role/AWSGlueServiceRoleGA	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  CFNDevEndpoint:
    Type: AWS::Glue::DevEndpoint
    Properties:
      EndpointName: !Ref CFNEndpointName
      #ExtraJarsS3Path: String
      #ExtraPythonLibsS3Path: String
      NumberOfNodes: 5
      PublicKey: ssh-rsa public.....key myuserid-key
      RoleArn: !Ref CFNIAMRoleArn
      SecurityGroupIds: 
        - sg-64986c0b
      SubnetId: subnet-c67cccac

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

查看蓝图运行

Amazon Glue 编程指南