Query large datasets (Amazon Athena, Amazon S3, Amazon Glue, Amazon SNS) - Amazon Step Functions
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Query large datasets (Amazon Athena, Amazon S3, Amazon Glue, Amazon SNS)

This sample project demonstrates how to ingest a large data set in Amazon S3 and partition it through Amazon Glue Crawlers, then execute Amazon Athena queries against that partition. Deploying this sample project creates an Amazon Step Functions state machine, an Amazon S3 Bucket, an Amazon Glue crawler, and an Amazon SNS topic.

In this project, the Step Functions state machine invokes an Amazon Glue crawler that partitions a large dataset in Amazon S3. Once the Amazon Glue crawler returns a success message, the workflow executes Athena queries against that partition. Once query execution is successfully complete, an Amazon SNS notification is sent to an Amazon SNS topic.

Create the State Machine and Provision Resources

  1. Open the Amazon Athena console at https://console.amazonaws.cn/athena/.

  2. In the left navigation pane, choose Workflows.

  3. In the Query large datasets tile, choose Get started.

  4. In the Get started dialog box, choose Deploy a sample project, and then choose Continue.

  5. You're redirected to the Review workflow page of the Step Functions console. Review the Amazon States Language definition automatically generated for the sample project.

    The state machine Workflow definition and Visual Workflow are displayed.

    
                        Query large datasets workflow.
  6. Choose Next.

    The Deploy and run page is displayed, listing the resources that will be created. This sample project creates the following resources:

    • Amazon Athena queries

    • Lambda function

    • An Amazon S3 bucket

    • An Amazon SNS topic

    • A Amazon Glue database

  7. Choose Deploy and run.

    Note

    It can take up to 10 minutes for these resources and related IAM permissions to be created. While the Deploy and run page is displayed, you can open the Stack ID link to see which resources are being provisioned.

Start a New Execution

  1. On the New execution page, enter an execution name (optional), and then choose Start Execution.

  2. (Optional) To identify your execution, you can specify a name for it in the Name box. By default, Step Functions automatically generates a unique execution name.

    Note

    Step Functions allows you to create names for state machines, executions, activities, and labels that contain non-ASCII characters. These non-ASCII names don't work with Amazon CloudWatch. To ensure that you can track CloudWatch metrics, choose a name that uses only ASCII characters.

  3. (Optional) You can go to the newly created state machine on the Step Functions Dashboard, and then choose New execution.

  4. When an execution is complete, you can select states on the Visual workflow and browse the Input and Output under Step details.

Example State Machine Code

The state machine in this sample project integrates with Amazon S3, Amazon Glue, Amazon Athena and Amazon SNS by passing parameters directly to those resources.

Browse through this example state machine to see how Step Functions controls Amazon S3, Amazon Glue, Amazon Athena and Amazon SNS by connecting to the Amazon Resource Name (ARN) in the Resource field, and by passing Parameters to the service API.

For more information about how Amazon Step Functions can control other Amazon services, see Using Amazon Step Functions with other services.

{ "Comment": "An example demonstrates how to ingest a large data set in Amazon S3 and partition it through aws Glue Crawlers, then execute Amazon Athena queries against that partition.", "StartAt": "Start Crawler", "States": { "Start Crawler": { "Type": "Task", "Next": "Get Crawler status", "Parameters": { "Name": "<GLUE_CRAWLER_NAME>" }, "Resource": "arn:aws-cn:states:::aws-sdk:glue:startCrawler" }, "Get Crawler status": { "Type": "Task", "Parameters": { "Name": "<GLUE_CRAWLER_NAME>" }, "Resource": "arn:aws-cn:arn:aws:states:::aws-sdk:glue:getCrawler", "Next": "Check Crawler status" }, "Check Crawler status": { "Type": "Choice", "Choices": [ { "Variable": "$.Crawler.State", "StringEquals": "RUNNING", "Next": "Wait" } ], "Default": "Start an Athena query" }, "Wait": { "Type": "Wait", "Seconds": 30, "Next": "Get Crawler status" }, "Start an Athena query": { "Resource": "arn:aws-cn:states:::athena:startQueryExecution.sync", "Parameters": { "QueryString": "<ATHENA_QUERYSTRING>", "WorkGroup": "<ATHENA_WORKGROUP>" }, "Type": "Task", "Next": "Get query results" }, "Get query results": { "Resource": "arn:aws-cn:states:::athena:getQueryResults", "Parameters": { "QueryExecutionId.$": "$.QueryExecution.QueryExecutionId" }, "Type": "Task", "Next": "Send query results" }, "Send query results": { "Resource": "arn:aws-cn:states:::sns:publish", "Parameters": { "TopicArn": "<SNS_TOPIC_ARN>", "Message": { "Input.$": "$.ResultSet.Rows" } }, "Type": "Task", "End": true } } }

IAM Examples

These example Amazon Identity and Access Management (IAM) policies generated by the sample project include the least privilege necessary to execute the state machine and related resources. We recommend that you include only those permissions that are necessary in your IAM policies.

AthenaGetQueryResults

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "athena:getQueryResults" ], "Resource": [ "arn:aws-cn:athena:us-west-2:123456789012:workgroup/*" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws-cn:s3:::*" ] } ] }
AthenaStartQueryExecution

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "athena:startQueryExecution", "athena:stopQueryExecution", "athena:getQueryExecution", "athena:getDataCatalog" ], "Resource": [ "arn:aws-cn:athena:us-west-2:123456789012:workgroup/stepfunctions-athena-sample-project-workgroup-8v7bshiv70", "arn:aws-cn:athena:us-west-2:123456789012:datacatalog/*" ] }, { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload", "s3:CreateBucket", "s3:PutObject" ], "Resource": [ "arn:aws-cn:s3:::*" ] }, { "Effect": "Allow", "Action": [ "glue:CreateDatabase", "glue:GetDatabase", "glue:GetDatabases", "glue:UpdateDatabase", "glue:DeleteDatabase", "glue:CreateTable", "glue:UpdateTable", "glue:GetTable", "glue:GetTables", "glue:DeleteTable", "glue:BatchDeleteTable", "glue:BatchCreatePartition", "glue:CreatePartition", "glue:UpdatePartition", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition", "glue:DeletePartition", "glue:BatchDeletePartition" ], "Resource": [ "arn:aws-cn:glue:us-west-2:123456789012:catalog", "arn:aws-cn:glue:us-west-2:123456789012:database/*", "arn:aws-cn:glue:us-west-2:123456789012:table/*", "arn:aws-cn:glue:us-west-2:123456789012:userDefinedFunction/*" ] }, { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": [ "*" ] } ] }
SNSPublish

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sns:Publish" ], "Resource": [ "arn:aws-cn:sns:us-west-2:123456789012:StepFunctionsSample-AthenaIngestLargeDataset92bc4949-abf8-4a1e-9236-5b7c81b3efa3-SNSTopic-8Y5ZLI5AASXV" ] } ] }

For information about how to configure IAM when using Step Functions with other Amazon services, see IAM Policies for integrated services.