Process data in an Amazon S3 bucket with Distributed Map - Amazon Step Functions
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Process data in an Amazon S3 bucket with Distributed Map

This sample project demonstrates how you can use the Distributed Map state to process large-scale data, for example, analyze historical weather data and identify the weather station that has the higest average temperature on the planet each month. The weather data is recorded in over 12,000 CSV files, which in turn are stored in an Amazon S3 bucket.

This sample project includes two Distributed Map states named Distributed S3 copy NOA Data and ProcessNOAAData. Distributed S3 copy NOA Data iterates over the CSV files in a public Amazon S3 bucket named noaa-gsod-pds and copies them to an Amazon S3 bucket in your Amazon Web Services account. ProcessNOAAData iterates over the copied files and includes a Lambda function that performs the temperature analysis.

The sample project first checks the contents of the Amazon S3 bucket with a call to the ListObjectsV2 API action. Based on the number of keys returned in response to this call, the sample project takes one of the following decisions:

  • If the key count is more than or equal to 1, the project transitions to the ProcessNOAAData state. This Distributed Map state includes a Lambda function named TemperatureFunction that finds the weather station that had the highest average temperature each month. This function returns a dictionary with year-month as the key and a dictionary that contains information about the weather station as the value.

  • If the returned key count doesn't exceed 1, the Distributed S3 copy NOA Data state lists all objects from the public bucket noaa-gsod-pds and iteratively copies the individual objects to another bucket in your account in batches of 100. An Inline Map performs the iterative copying of the objects.

    After all objects are copied, the project transitions to the ProcessNOAAData state for processing the weather data.

The sample project finally transitions to a reducer Lambda function that performs a final aggregation of the results returned by the TemperatureFunction function and writes the results to an Amazon DynamoDB table.

With Distributed Map, you can run up to 10,000 parallel child workflow executions at a time. In this sample project, the maximum concurrency of ProcessNOAAData Distributed Map is set at 3000 that limits it to 3000 parallel child workflow executions.

This sample project creates the state machine, the supporting Amazon resources, and configures the related IAM permissions. Explore this sample project to learn about using the Distributed Map for orchestrating large-scale, parallel workloads, or use it as a starting point for your own projects.

Important

This sample project is only available in the US East (N. Virginia) Region.

Amazon CloudFormation template and additional resources

You use a CloudFormation template to deploy this sample project. This template creates the following resources in your Amazon Web Services account:

  • A Step Functions state machine.

  • Execution role for the state machine. This role grants the permissions that your state machine needs to access other Amazon Web Services and resources such as the Lambda function's Invoke action.

  • An Amazon S3 bucket named NOAADataBucket. This bucket contains the CSV files with weather data.

  • A Lambda function named ReducerFunction that performs a final aggregation of the weather data and writes the results to an Amazon DynamoDB table.

  • Execution role for the reducer Lambda function. This role grants the function permission to access other Amazon Web Services.

  • An Amazon S3 output bucket named ResultsBucket to store the weather analysis results.

  • A DynamoDB table named ResultsDynamoDBTable that contains the results returned by the ReducerFunction.

  • A Lambda function named TemperatureFunction that finds the highest monthly average temperature.

  • Execution role for the Lambda function. This role grants the function permission to access other Amazon Web Services.

  • A CloudWatch log group that stores information related to the state machine’s execution history.

Important

Standard charges apply for each service.

Step 1: Create the state machine and provision resources

  1. Open the Step Functions console and choose Create state machine.

  2. Type Distributed Map to process files in S3 in the search box, and then choose Distributed Map to process files in S3 from the search results that are returned.

  3. Choose Next to continue.

  4. Step Functions lists the Amazon Web Services used in the sample project you selected. It also shows a workflow graph for the sample project. Deploy this project to your Amazon Web Services account or use it as a starting point for building your own projects. Based on how you want to proceed, choose Run a demo or Build on it.

    For information about the resources that will be created for this sample project, see Amazon CloudFormation template and additional resources.

    The following image shows the workflow graph for the Distributed Map to process files in S3 sample project:

    
                        Workflow graph of the Distributed Map to process files in S3 sample project.
  5. Choose Use template to continue with your selection.

  6. Do one of the following:

    • If you selected Build on it, Step Functions creates the workflow prototype for the sample project you selected. Step Functions doesn't deploy the resources listed in the workflow definition.

      In Workflow Studio's Design mode, drag and drop states from the States browser to continue building your workflow protoype. Or switch to the Code mode that provides an integrated code editor similar to VS Code for updating the Amazon States Language (ASL) definition of your state machine within the Step Functions console. For more information about using Workflow Studio to build your state machines, see Using Workflow Studio.

      Important

      Remember to update the placeholder Amazon Resource Name (ARN) for the resources used in the sample project before you run your workflow.

    • If you selected Run a demo, Step Functions creates a read-only sample project which uses an Amazon CloudFormation template to deploy the Amazon resources listed in that template to your Amazon Web Services account.

      Tip

      To view the state machine definition of the sample project, choose Code.

      When you're ready, choose Deploy and run to deploy the sample project and create the resources.

      It can take up to 10 minutes for these resources and related IAM permissions to be created. While your resources are being deployed, you can open the CloudFormation Stack ID link to see which resources are being provisioned.

      After all the resources in the sample project are created, you can see the new sample project listed on the State machines page.

      Important

      Standard charges may apply for each service used in the CloudFormation template.

Step 2: Run the state machine

After all the resources are provisioned and deployed, you can run the state machine.

  1. On the State machines page, choose your sample project.

  2. On the sample project page, choose Start execution.

  3. In the Start execution dialog box, do the following:

    1. (Optional) Enter input values in JSON format to run your sample project.

      If you chose to Run a demo, you need not provide any execution input.

      Note

      If the demo project you deployed contains prepopulated execution input data, use that input to run the state machine.

    2. Choose Start execution.

    3. (Optional) The Step Functions console directs you to a page that's titled with your execution ID. This page is known as the Execution Details page. On this page, you can review the execution results as the execution progresses or after it's complete.

      After the execution is complete, choose individual states on the Graph view, and then choose the individual tabs on the Step details pane to view each state's details including input, output, and definition respectively.

    4. (Optional) Review the execution results exported to the Amazon S3 bucket. These results include data, such as execution input and output, ARN, and execution status. For more information, see ResultWriter.