查询大型数据集(Amazon Athena、Amazon S3Amazon Glue、Amazon SNS) - Amazon Step Functions
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

查询大型数据集(Amazon Athena、Amazon S3Amazon Glue、Amazon SNS)

此示例项目演示如何在 Amazon S3 中提取大型数据集并通过Amazon Glue Crawlers 对其进行分区,然后针对该分区执行 Amazon Athena 查询。部署此示例项目会创建Amazon Step Functions状态机、Amazon S3 存储桶、Amazon Glue爬虫和 Amazon SNS 主题。

在此项目中,Step Functions 状态机调用了一个Amazon Glue爬虫,该爬虫对 Amazon S3 中的大型数据集进行了分区。一旦Amazon Glue搜寻器返回成功消息,工作流程就会针对该分区执行 Athena 查询。成功执行查询 Amazon SNS

创建状态机并预置资源

  1. https://console.aws.amazon.com/athena/ 打开 Amazon Athena 控制台。

  2. 在左侧导航窗格中

  3. 在 “查询大型数据集” 图块中,选择 “开始”。

  4. 在 “开始” 对话框中,选择 “部署示例项目”,然后选择 “继续”。

  5. 您将被重定向到 Step Functions 控制台的 “查看” 工作流程页面。查看为示例项目自动生成的亚马逊州语言定义。

    将显示状态机工作流定义可视工作流

    
                        查询大型数据集工作流程。
  6. 选择 Next(下一步)

    将显示 “部署并运行” 页面,其中列出了将要创建的资源。此示例项目创建了以下资源:

    • Amazon Athena

    • Lambda 函数

    • 一个 Amazon S3 存储桶

    • 一个 Amazon SNS 主题

    • 一个Amazon Glue数据库

  7. 选择 “部署并运行”

    注意

    创建这些资源和相关的 IM 资源和相关的 IM 权限。在显示 “部署并运行” 页面时,您可以打开 Stack ID 链接以查看正在配置哪些资源。

启动新的执行

  1. New execution 页面上,输入执行名称 (可选),然后选择 Start Execution (开始执行)

  2. (可选)要识别您的执行情况,可以在 “名称” 框中为其指定名称。默认情况下,Step Functions 会自动生成唯一的执行名称。

    注意

    Step Functions 允许您为包含非 ASCII 字符的状态机、执行、活动和标签创建名称。这些非 ASCII 名称不适用于亚马逊 CloudWatch。为确保您可以跟踪 CloudWatch 指标,请选择仅使用 ASCII 字符的名称。

  3. (可选)您可以转到 Step Functions 仪表板上新创建的状态机,然后选择 “新建执行”。

  4. 执行完成后,您可以在 Visual workflow (可视工作流) 上选择状态,并浏览 Step details (步骤详细信息) 下的 Input (输入)Output (输出)

示例状态机代码

此示例项目中的状态机通过将参数直接传递给这些资源,与 Amazon S3Amazon Glue、Amazon Athena 和 Amazon SNS 集成。

浏览此示例状态机,了解 Step Functions 如何通过连接到Resource字段中的亚马逊资源名称 (ARN) 并传递Parameters给服务 API 来控制 Amazon S3、Amazon Athena 和 Amazon SNS。Amazon Glue

有关 Amazon Step Functions 如何控制其他 Amazon 服务的更多信息,请参阅将 Amazon Step Functions 与其他服务一起使用

{ "Comment": "An example demonstrates how to ingest a large data set in Amazon S3 and partition it through aws Glue Crawlers, then execute Amazon Athena queries against that partition.", "StartAt": "Start Crawler", "States": { "Start Crawler": { "Type": "Task", "Next": "Get Crawler status", "Parameters": { "Name": "<GLUE_CRAWLER_NAME>" }, "Resource": "arn:aws:states:::aws-sdk:glue:startCrawler" }, "Get Crawler status": { "Type": "Task", "Parameters": { "Name": "<GLUE_CRAWLER_NAME>" }, "Resource": "arn:aws:arn:aws:states:::aws-sdk:glue:getCrawler", "Next": "Check Crawler status" }, "Check Crawler status": { "Type": "Choice", "Choices": [ { "Variable": "$.Crawler.State", "StringEquals": "RUNNING", "Next": "Wait" } ], "Default": "Start an Athena query" }, "Wait": { "Type": "Wait", "Seconds": 30, "Next": "Get Crawler status" }, "Start an Athena query": { "Resource": "arn:aws:states:::athena:startQueryExecution.sync", "Parameters": { "QueryString": "<ATHENA_QUERYSTRING>", "WorkGroup": "<ATHENA_WORKGROUP>" }, "Type": "Task", "Next": "Get query results" }, "Get query results": { "Resource": "arn:aws:states:::athena:getQueryResults", "Parameters": { "QueryExecutionId.$": "$.QueryExecution.QueryExecutionId" }, "Type": "Task", "Next": "Send query results" }, "Send query results": { "Resource": "arn:aws:states:::sns:publish", "Parameters": { "TopicArn": "<SNS_TOPIC_ARN>", "Message": { "Input.$": "$.ResultSet.Rows" } }, "Type": "Task", "End": true } } }

IM 示例。

示例项目生成的这些示例Amazon Identity and Access Management (IAM) 策略包括执行状态机和相关资源所需的最低权限。我们建议您在 IAM 策略中仅包含必要的权限。

AthenaGetQueryResults

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "athena:getQueryResults" ], "Resource": [ "arn:aws:athena:us-east-2:123456789012:workgroup/*" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
AthenaStartQueryExecution

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "athena:startQueryExecution", "athena:stopQueryExecution", "athena:getQueryExecution", "athena:getDataCatalog" ], "Resource": [ "arn:aws:athena:us-east-2:123456789012:workgroup/stepfunctions-athena-sample-project-workgroup-8v7bshiv70", "arn:aws:athena:us-east-2:123456789012:datacatalog/*" ] }, { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload", "s3:CreateBucket", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::*" ] }, { "Effect": "Allow", "Action": [ "glue:CreateDatabase", "glue:GetDatabase", "glue:GetDatabases", "glue:UpdateDatabase", "glue:DeleteDatabase", "glue:CreateTable", "glue:UpdateTable", "glue:GetTable", "glue:GetTables", "glue:DeleteTable", "glue:BatchDeleteTable", "glue:BatchCreatePartition", "glue:CreatePartition", "glue:UpdatePartition", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition", "glue:DeletePartition", "glue:BatchDeletePartition" ], "Resource": [ "arn:aws:glue:us-east-2:123456789012:catalog", "arn:aws:glue:us-east-2:123456789012:database/*", "arn:aws:glue:us-east-2:123456789012:table/*", "arn:aws:glue:us-east-2:123456789012:userDefinedFunction/*" ] }, { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": [ "*" ] } ] }
snsPublish

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sns:Publish" ], "Resource": [ "arn:aws:sns:us-east-2:123456789012:StepFunctionsSample-AthenaIngestLargeDataset92bc4949-abf8-4a1e-9236-5b7c81b3efa3-SNSTopic-8Y5ZLI5AASXV" ] } ] }

有关在将Step Functions与其他Amazon服务一起使用时如何配置 IAM 的信息,请参阅集成服务的 IAM 政策