Monitoring Step Functions metrics using Amazon CloudWatch - Amazon Step Functions
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Monitoring Step Functions metrics using Amazon CloudWatch

Monitoring is an important part of maintaining the reliability, availability, and performance of Amazon Step Functions and your Amazon solutions. You should collect as much monitoring data from the Amazon services that you use so that you can debug multi-point failures.

Before you start monitoring Step Functions, you should create a monitoring plan that answers the following questions:

  • What are your monitoring goals?

  • What resources will you monitor?

  • How often will you monitor these resources?

  • What monitoring tools will you use?

  • Who will perform the monitoring tasks?

  • Who should be notified when something goes wrong?

The next step is to establish a baseline for normal performance in your environment. To do this, measure performance at various times and under different load conditions. As you monitor Step Functions, consider storing historical monitoring data. Such data can give you a baseline to compare against current performance data, to identify normal performance patterns and performance anomalies, and to devise ways to address issues.

For example, with Step Functions, you can monitor how many activities or Amazon Lambda tasks fail due to a heartbeat timeout. When performance falls outside your established baseline, you might have to change your heartbeat interval.

To establish a baseline you should, at a minimum, monitor the following metrics:

  • ActivitiesStarted

  • ActivitiesTimedOut

  • ExecutionsStarted

  • ExecutionsTimedOut

  • LambdaFunctionsStarted

  • LambdaFunctionsTimedOut