Monitoring Step Functions Using CloudWatch
Monitoring is an important part of maintaining the reliability, availability, and performance of Amazon Step Functions and your Amazon solutions. You should collect as much monitoring data from the Amazon services that you use so that you can more easily debug any multi-point failures. Before you start monitoring Step Functions, you should create a monitoring plan that answers the following questions:
-
What are your monitoring goals?
-
What resources will you monitor?
-
How often will you monitor these resources?
-
What monitoring tools will you use?
-
Who will perform the monitoring tasks?
-
Who should be notified when something goes wrong?
The next step is to establish a baseline for normal Step Functions performance in your environment. To do this, measure performance at various times and under different load conditions. As you monitor Step Functions, consider storing historical monitoring data. Such data can give you a baseline to compare against current performance data, to identify normal performance patterns and performance anomalies, and to devise ways to address issues.
For example, with Step Functions, you can monitor how many activities or Amazon Lambda tasks fail due to a heartbeat timeout. When performance falls outside your established baseline, you might have to change your heartbeat interval.
To establish a baseline you should, at a minimum, monitor the following metrics:
-
ActivitiesStarted
-
ActivitiesTimedOut
-
ExecutionsStarted
-
ExecutionsTimedOut
-
LambdaFunctionsStarted
-
LambdaFunctionsTimedOut
The following sections describe metrics that Step Functions provides to Amazon CloudWatch. You can use these metrics to track your state machines and activities and to set alarms on threshold values. You can view metrics using the Amazon Web Services Management Console.
Topics
Metrics That Report a Time Interval
Some of the Step Functions CloudWatch metrics are time intervals, always measured in milliseconds. These metrics generally correspond to stages of your execution for which you can set state machine, activity, and Lambda function timeouts, with descriptive names.
For example, the ActivityRunTime
metric measures the time it takes for an
activity to complete after it begins to execute. You can set a timeout value for the
same time period.
In the CloudWatch console, you can get the best results if you choose average as the display statistic for time interval metrics.
Metrics That Report a Count
Some of the Step Functions CloudWatch metrics report results as a count. For
example, ExecutionsFailed
records the number of failed state machine
executions.
In the CloudWatch console, you can get the best results if you choose sum as the display statistic for count metrics.
Execution Metrics
The AWS/States
namespace includes the following metrics for Step Functions
executions.
Metric | Description |
---|---|
ExecutionTime |
The interval, in milliseconds, between the time the execution starts and the time it closes. |
ExecutionThrottled |
The number of StateEntered events and retries that
have been throttled. This is related to StateTransition
throttling. For more information, see Quotas related to state
throttling. |
ExecutionsAborted |
The number of aborted or terminated executions. |
ExecutionsFailed |
The number of failed executions. |
ExecutionsStarted |
The number of started executions. |
ExecutionsSucceeded |
The number of successfully completed executions. |
ExecutionsTimedOut |
The number of executions that time out for any reason. |
Execution Metrics for Express Workflows
The AWS/States
namespace includes the following metrics for Step Functions Express Workflows' executions.
Metric | Description |
---|---|
ExpressExecutionMemory
|
The total memory consumed by an Express Workflow. |
ExpressExecutionBilledDuration |
The duration for which an Express Workflow is charged. |
ExpressExecutionBilledMemory |
The amount of consumed memory for which an Express Workflow is charged. |
Dimension for Step Functions Execution Metrics
Dimension | Description |
---|---|
StateMachineArn
|
The Amazon Resource Name (ARN) of the state machine for the execution in question. |
Activity Metrics
The AWS/States
namespace includes the following metrics for Step Functions
activities.
Metric | Description |
---|---|
ActivityRunTime
|
The interval, in milliseconds, between the time the activity starts and the time it closes. |
ActivityScheduleTime |
The interval, in milliseconds, for which the activity stays in the schedule state. |
ActivityTime |
The interval, in milliseconds, between the time the activity is scheduled and the time it closes. |
ActivitiesFailed |
The number of failed activities. |
ActivitiesHeartbeatTimedOut |
The number of activities that time out due to a heartbeat timeout. |
ActivitiesScheduled |
The number of scheduled activities. |
ActivitiesStarted |
The number of started activities. |
ActivitiesSucceeded |
The number of successfully completed activities. |
ActivitiesTimedOut |
The number of activities that time out on close. |
Dimension for Step Functions Activity Metrics
Dimension | Description |
---|---|
|
The ARN of the activity. |
Lambda Function Metrics
The AWS/States
namespace includes the following metrics for Step Functions
Lambda functions.
Metric | Description |
---|---|
LambdaFunctionRunTime |
The interval, in milliseconds, between the time the Lambda function starts and the time it closes. |
LambdaFunctionScheduleTime |
The interval, in milliseconds, for which the Lambda function stays in the schedule state. |
LambdaFunctionTime |
The interval, in milliseconds, between the time the Lambda function is scheduled and the time it closes. |
LambdaFunctionsFailed |
The number of failed Lambda functions. |
LambdaFunctionsScheduled |
The number of scheduled Lambda functions. |
LambdaFunctionsStarted |
The number of started Lambda functions. |
LambdaFunctionsSucceeded |
The number of successfully completed Lambda functions. |
LambdaFunctionsTimedOut |
The number of Lambda functions that time out on close. |
Dimension for Step Functions Lambda Function Metrics
Dimension | Description |
---|---|
|
The ARN of the Lambda function. |
Note
Lambda Function Metrics are emitted for Task states that specify the Lambda function ARN in the
Resource
field. Task states that use "Resource": "arn:aws:states:::lambda:invoke"
emit Service Integration Metrics instead. For more information, see Invoke Lambda with Step Functions.
Service Integration Metrics
The AWS/States
namespace includes the following metrics for Step Functions
service integrations. For more information, see Using Amazon Step Functions with other services.
Metric | Description |
---|---|
ServiceIntegrationRunTime |
The interval, in milliseconds, between the time the Service Task starts and the time it closes. |
ServiceIntegrationScheduleTime |
The interval, in milliseconds, for which the Service Task stays in the schedule state. |
ServiceIntegrationTime |
The interval, in milliseconds, between the time the Service Task is scheduled and the time it closes. |
ServiceIntegrationsFailed |
The number of failed Service Tasks. |
ServiceIntegrationsScheduled |
The number of scheduled Service Tasks. |
ServiceIntegrationsStarted |
The number of started Service Tasks. |
ServiceIntegrationsSucceeded |
The number of successfully completed Service Tasks. |
ServiceIntegrationsTimedOut |
The number of Service Tasks that time out on close. |
Dimension for Step Functions Service Integration Metrics
Dimension | Description |
---|---|
|
The resource ARN of the integrated service. |
Service Metrics
The AWS/States
namespace includes the following metrics for the Step Functions
service.
Metric | Description |
---|---|
ThrottledEvents
|
The count of requests that have been throttled. |
ProvisionedBucketSize |
The count of available requests per second. |
ProvisionedRefillRate |
The count of requests per second that are allowed into the bucket. |
ConsumedCapacity |
The count of requests per second. |
Dimension for Step Functions Service Metrics
Dimension | Description |
---|---|
|
Filters data to show State Transitions metrics. |
API Metrics
The AWS/States
namespace includes the following metrics for the Step Functions
API.
Metric | Description |
---|---|
ThrottledEvents
|
The count of requests that have been throttled. |
ProvisionedBucketSize |
The count of available requests per second. |
ProvisionedRefillRate |
The count of requests per second that are allowed into the bucket. |
ConsumedCapacity |
The count of requests per second. |
Dimension for Step Functions API Metrics
Dimension | Description |
---|---|
|
Filters data to an API of the specified API name. |