Best practices for Step Functions - Amazon Step Functions
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Best practices for Step Functions

The following topics are best practices to help you manage and optimize your Step Functions workflows.

Optimizing costs using Express Workflows

Step Functions determines pricing for Standard and Express workflows based on the workflow type you use to build your state machines. To optimize the cost of your serverless workflows, you can follow either or both of the following recommendations:

For information about how choosing a Standard or Express workflow type affects billing, see Amazon Step Functions Pricing.

Nest Express workflows inside Standard workflows

Step Functions runs workflows that have a finite duration and number of steps. Some workflows may complete execution within a short period of time. Others may require a combination of both long-running and high-event-rate workflows. With Step Functions, you can build large, complex workflows out of multiple smaller, simpler workflows.

For example, to build an order processing workflow, you can include all non-idempotent actions into a Standard workflow. This could include actions, such as approving order through human interaction and processing payments. You can then combine a series of idempotent actions, such as sending payment notifications and updating product inventory, in an Express workflow. You can nest this Express workflow within the Standard workflow. In this example, the Standard workflow is known as the parent state machine. The nested Express workflow is known as a child state machine.

Convert Standard workflows into Express workflows

You can convert your existing Standard workflows into Express workflows if they meet the following requirements:

  • The workflow must complete its execution within five minutes.

  • The workflow conforms to an at-least-once execution model. This means that each step in the workflow may run more than exactly once.

  • The workflow doesn't use the .waitForTaskToken or .sync service integration patterns.

Important

Express workflows use Amazon CloudWatch Logs to record execution histories. You will incur additional costs when using CloudWatch Logs.

To convert a Standard workflow into an Express workflow using the console
  1. Open the Step Functions console.

  2. On the State machines page, choose a Standard type state machine to open it.

    Tip

    From the Any type dropdown list, choose Standard to filter the state machines list and view only Standard workflows.

  3. Choose Copy to new.

    Workflow Studio opens in Design mode displaying workflow of the state machine you selected.

  4. (Optional) Update the workflow design.

  5. Specify a name for your state machine. To do this, choose the edit icon next to the default state machine name of MyStateMachine. Then, in State machine configuration, specify a name in the State machine name box.

  6. (Optional) In State machine configuration, specify other workflow settings, such as state machine type and its execution role.

    Make sure that for Type, you choose Express. Keep all the other default selections on State machine settings.

    Note

    If you're converting a Standard workflow previously defined in Amazon CDK or Amazon SAM, you must change the value of Type and Resource name.

  7. In the Confirm role creation dialog box, choose Confirm to continue.

    You can also choose View role settings to go back to State machine configuration.

    Note

    If you delete the IAM role that Step Functions creates, Step Functions can't recreate it later. Similarly, if you modify the role (for example, by removing Step Functions from the principals in the IAM policy), Step Functions can't restore its original settings later.

For more information about best practices and guidelines when you manage cost-optimization for your workflows, see Building cost-effective Amazon Step Functions workflows.

Tagging state machines and activities in Step Functions

Amazon Step Functions supports tagging state machines (both Standard and Express) and activities. Tags can help you track and manage your resources and provide better security in your Amazon Identity and Access Management (IAM) policies. After tagging Step Functions resources, you can manage them with Amazon Resource Groups. To learn how, see the Amazon Resource Groups User Guide.

For tag-based authorization, state machine execution resources as shown in the following example inherit the tags associated with a state machine.

arn:<partition>:states:<Region>:<account-id>:execution:<StateMachineName>:<ExecutionId>

When you call DescribeExecution or other APIs in which you specify the execution resource ARN, Step Functions uses tags associated with the state machine to accept or deny the request while performing tag-based authorization. This helps you allow or deny access to state machine executions at the state machine level.

To review the restrictions related to resource tagging, see Restrictions related to tagging.

Tagging for Cost Allocation

You can use cost allocation tags to identify the purpose of a state machine and reflect that organization in your Amazon bill. Sign up to get your Amazon account bill to include the tag keys and values. See Setting Up a Monthly Cost Allocation Report in the Amazon Billing User Guide for details on setting up reports.

For example, you could add tags that represent your cost center and purpose of your Step Functions resources, as follows.

Resource Key Value
StateMachine1 Cost Center 34567
Application Image processing
StateMachine2 Cost Center 34567
Application Rekognition processing

Tagging for Security

IAM supports controlling access to resources based on tags. To control access based on tags, provide information about your resource tags in the condition element of an IAM policy.

For example, you could restrict access to all Step Functions resources that include a tag with the key environment and the value production.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Deny", "Action": [ "states:TagResource", "states:DeleteActivity", "states:DeleteStateMachine", "states:StopExecution" ], "Resource": "*", "Condition": { "StringEquals": {"aws:ResourceTag/environment": "production"} } } ] }

For more information, see Controlling Access Using Tags in the IAM User Guide.

Managing tags in the Step Functions console

You can view and manage tags for your state machines in the Step Functions console. From the Details page of a state machine, select Tags.

Managing tags with Step Functions API Actions

To manage tags using the Step Functions API, use the following API actions:

Using timeouts to avoid stuck Step Functions workflow executions

By default, the Amazon States Language doesn't specify timeouts for state machine definitions. Without an explicit timeout, Step Functions often relies solely on a response from an activity worker to know that a task is complete. If something goes wrong and the TimeoutSeconds field isn't specified for an Activity or Task state, an execution is stuck waiting for a response that will never come.

To avoid this situation, specify a reasonable timeout when you create a Task in your state machine. For example:

"ActivityState": { "Type": "Task", "Resource": "arn:aws-cn:states:us-east-1:123456789012:activity:HelloWorld", "TimeoutSeconds": 300, "Next": "NextState" }

If you use a callback with a task token (.waitForTaskToken), we recommend that you use heartbeats and add the HeartbeatSeconds field in your Task state definition. You can set HeartbeatSeconds to be less than the task timeout so if your workflow fails with a heartbeat error then you know it's because of the task failure instead of the task taking a long time to complete.

{ "StartAt": "Push to SQS", "States": { "Push to SQS": { "Type": "Task", "Resource": "arn:aws-cn:states:::sqs:sendMessage.waitForTaskToken", "HeartbeatSeconds": 600, "Parameters": { "MessageBody": { "myTaskToken.$": "$$.Task.Token" }, "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789012/push-based-queue" }, "ResultPath": "$.SQS", "End": true } } }

For more information, see Task workflow state in the Amazon States Language documentation.

Note

You can set a timeout for your state machine using the TimeoutSeconds field in your Amazon States Language definition. For more information, see State machine structure in Amazon States Language for Step Functions workflows.

Using Amazon S3 ARNs instead of passing large payloads in Step Functions

Executions that pass large payloads of data between states can be terminated. If the data you are passing between states might grow to over 256 KB, use Amazon Simple Storage Service (Amazon S3) to store the data, and parse the Amazon Resource Name (ARN) of the bucket in the Payload parameter to get the bucket name and key value. Alternatively, adjust your implementation so that you pass smaller payloads in your executions.

In the following example, a state machine passes input to an Amazon Lambda function, which processes a JSON file in an Amazon S3 bucket. After you run this state machine, the Lambda function reads the contents of the JSON file, and returns the file contents as output.

Create the Lambda function

The following Lambda function named pass-large-payload reads the contents of a JSON file stored in a specific Amazon S3 bucket.

Note

After you create this Lambda function, make sure you provide its IAM role the appropriate permission to read from an Amazon S3 bucket. For example, attach the AmazonS3ReadOnlyAccess permission to the Lambda function's role.

import json import boto3 import io import os s3 = boto3.client('s3') def lambda_handler(event, context): event = event['Input'] final_json = str() s3 = boto3.resource('s3') bucket = event['bucket'].split(':')[-1] filename = event['key'] directory = "/tmp/{}".format(filename) s3.Bucket(bucket).download_file(filename, directory) with open(directory, "r") as jsonfile: final_json = json.load(jsonfile) os.popen("rm -rf /tmp") return final_json
Create the state machine

The following state machine invokes the Lambda function you previously created.

{ "StartAt":"Invoke Lambda function", "States":{ "Invoke Lambda function":{ "Type":"Task", "Resource":"arn:aws:states:::lambda:invoke", "Parameters":{ "FunctionName":"arn:aws-cn:lambda:us-west-2:123456789012:function:pass-large-payload", "Payload":{ "Input.$":"$" } }, "OutputPath": "$.Payload", "End":true } } }

Rather than pass a large amount of data in the input, you could save that data in an Amazon S3 bucket, and pass the Amazon Resource Name (ARN) of the bucket in the Payload parameter to get the bucket name and key value. Your Lambda function can then use that ARN to access the data directly. The following is example input for the state machine execution, where the data is stored in data.json in an Amazon S3 bucket named amzn-s3-demo-large-payload-json.

{ "key": "data.json", "bucket": "arn:aws-cn:s3:::amzn-s3-demo-large-payload-json" }

Starting new executions to avoid reaching the history quota in Step Functions

Amazon Step Functions has a hard quota of 25,000 entries in the execution event history. When an execution reaches 24,999 events, it waits for the next event to happen.

  • If the event number 25,000 is ExecutionSucceeded, the execution finishes successfully.

  • If the event number 25,000 isn't ExecutionSucceeded, the ExecutionFailed event is logged and the state machine execution fails because of reaching the history limit

To avoid reaching this quota for long-running executions, you can try one of the following workarounds:

Handle transient Lambda service exceptions

Amazon Lambda can occasionally experience transient service errors. In this case, invoking Lambda results in a 500 error, such as ClientExecutionTimeoutException, ServiceException, AWSLambdaException, or SdkClientException. As a best practice, proactively handle these exceptions in your state machine to Retry invoking your Lambda function, or to Catch the error.

Lambda errors are reported as Lambda.ErrorName. To retry a Lambda service exception error, you could use the following Retry code.

"Retry": [ { "ErrorEquals": [ "Lambda.ClientExecutionTimeoutException", "Lambda.ServiceException", "Lambda.AWSLambdaException", "Lambda.SdkClientException"], "IntervalSeconds": 2, "MaxAttempts": 6, "BackoffRate": 2 } ]
Note

Unhandled errors in Lambda are reported as Lambda.Unknown in the error output. These include out-of-memory errors and function timeouts. You can match on Lambda.Unknown, States.ALL, or States.TaskFailed to handle these errors. When Lambda hits the maximum number of invocations, the error is Lambda.TooManyRequestsException. For more information about Lambda Handled and Unhandled errors, see FunctionError in the Amazon Lambda Developer Guide.

For more information, see the following:

Avoiding latency when polling for activity tasks

The GetActivityTask API is designed to provide a taskToken exactly once. If a taskToken is dropped while communicating with an activity worker, a number of GetActivityTask requests can be blocked for 60 seconds waiting for a response until GetActivityTask times out.

If you only have a small number of polls waiting for a response, it's possible that all requests will queue up behind the blocked request and stop. However, if you have a large number of outstanding polls for each activity Amazon Resource Name (ARN), and some percentage of your requests are stuck waiting, there will be many more that can still get a taskToken and begin to process work.

For production systems, we recommend at least 100 open polls per activity ARN's at each point in time. If one poll gets blocked, and a portion of those polls queue up behind it, there are still many more requests that will receive a taskToken to process work while the GetActivityTask request is blocked.

To avoid these kinds of latency problems when polling for tasks:

  • Implement your pollers as separate threads from the work in your activity worker implementation.

  • Have at least 100 open polls per activity ARN at each point in time.

    Note

    Scaling to 100 open polls per ARN can be expensive. For example, 100 Lambda functions polling per ARN is 100 times more expensive than having a single Lambda function with 100 polling threads. To both reduce latency and minimize cost, use a language that has asynchronous I/O, and implement multiple polling threads per worker. For an example activity worker where the poller threads are separate from the work threads, see Example: Activity Worker in Ruby.

For more information on activities and activity workers see Learn about Activities in Step Functions.

CloudWatch Logs resource policy size limits

When you create a state machine with logging, or update an existing state machine to enable logging, Step Functions must update your CloudWatch Logs resource policy with the log group that you specify. CloudWatch Logs resource policies are limited to 5,120 characters.

When CloudWatch Logs detects that a policy approaches the size limit, CloudWatch Logs automatically enables logging for log groups that start with /aws/vendedlogs/.

You can prefix your CloudWatch Logs log group names with /aws/vendedlogs/ to avoid the CloudWatch Logs resource policy size limit. If you create a log group in the Step Functions console, the suggested log group name will already be prefixed with /aws/vendedlogs/states.

CloudWatch Logs also has a quota of 10 resource policies per region, per account. If you try to enable logging on a state machine that already has 10 CloudWatch Logs resource policies in a region for an account, the state machine will not be created or updated. For more information about logging quotes, see CloudWatch Logs quotas.

If you are having trouble sending logs to CloudWatch Logs, see Troubleshooting state machine logging to CloudWatch Logs. To learn more about logging in general, see Enable logging from Amazon services.