Amazon SWF Metrics for CloudWatch
Amazon SWF now provides metrics for CloudWatch that you can use to track your workflows and activities and set alarms on threshold values that you choose. You can view metrics using the Amazon Web Services Management Console. For more information, see Viewing Amazon SWF Metrics for CloudWatch using the Amazon Web Services Management Console.
Topics
Reporting Units for Amazon SWF Metrics
Metrics that Report a Time Interval
Some of the Amazon SWF metrics for CloudWatch are time intervals, always measured in
milliseconds. The CloudWatch unit is reported as Time
. These metrics generally
correspond to stages of your workflow execution for which you can set workflow and activity
timeouts, and have similar names.
For example, the DecisionTaskStartToCloseTime
metric measures the time it
took for the decision task to complete after it began executing, which is the same time
period for which you can set a DecisionTaskStartToCloseTimeout
value.
For a diagram of each of these workflow stages and to learn when they occur over the workflow and activity lifecycles, see Amazon SWF Timeout Types .
Metrics that Report a Count
Some of the Amazon SWF metrics for CloudWatch report results as a count. For
example, WorkflowsCanceled
, records a result as either one
or zero, indicating whether or not the workflow was canceled. A value
of zero doesn't indicate that the metric was not reported, only that the condition described
by the metric did not occur.
Some of the Amazon SWF metrics for CloudWatch that report a Count
in CloudWatch are a
count per second. For instance, ProvisionedRefillRate
,
which is reported as a Count
in CloudWatch, represents a rate of
the Count
of requests per second.
For count metrics, minimum and maximum will always be either zero or one, but average will be a value ranging from zero to one.
API and Decision Event Metrics
You can monitor both API and Decision events in CloudWatch to provide insight into your usage and capacity. See deciders in the How Amazon SWF Works section, and the Decision topic in the Amazon Simple Workflow Service API Reference.
You can also monitor these limits to alarm when you are approaching your Amazon SWF throttling limits. See Amazon SWF throttling quotas for a description of these limits and their default settings. These limits are designed to prevent incorrect workflows from consuming excessive system resources. To request an increase to your limits see: Requesting a quota increase.
As a best practice, you should configure CloudWatch alarms at around 60% of your API or decision
events capacity. This will allow you to either adjust your workflow, or request a service
limit increase, before Amazon SWF throttling is enabled. Depending on the burstiness
-
If your traffic has significant spikes, set an alarm at 60% of your
ProvisionedBucketSize
limits. -
If your calls have a relatively steady rate, set an alarm at 60% of your
ProvisionedRefillRate
limit for your related API and decision events.
Amazon SWF Metrics
The following metrics are available for Amazon SWF:
Metric |
Description |
---|---|
|
The time interval, in milliseconds, between the time that the decision task was scheduled and when it was picked up by a worker and started. CloudWatch Units: Dimensions: Valid statistics: |
|
The time interval, in milliseconds, between the time that the decision task was started and when it closed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of decision tasks that have been completed. CloudWatch Units: Dimensions: Valid statistics: |
PendingTasks |
The count of pending tasks in a 1 minute interval for a specific Task List. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of decision tasks that started but timed out on closing. CloudWatch Units: Dimensions: Valid statistics: |
|
The time, in milliseconds, between the time the workflow started and when it closed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of workflows that were canceled. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of workflows that completed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of workflows that continued as new. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of workflows that failed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of workflows that were terminated. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of workflows that timed out, for any reason. CloudWatch Units: Dimensions: Valid statistics: |
|
The time interval, in milliseconds, between the time when the activity was scheduled and when it closed. CloudWatch Units: Dimensions: Valid statistics: |
|
The time interval, in milliseconds, between the time when the activity task was scheduled and when it started. CloudWatch Units: Dimensions: Valid statistics: |
|
The time interval, in milliseconds, between the time when the activity task started and when it closed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that were canceled. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that completed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that failed. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that were scheduled but timed out on close. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that were scheduled but timed out on start. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that were started but timed out on close. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of activity tasks that were started but timed out due to a heartbeat timeout. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of requests that have been throttled. CloudWatch Units: Dimensions: Valid statistics: |
|
The count of available requests per second. Dimensions: Valid statistics: |
|
The count of requests per second. CloudWatch Units: Dimensions: Valid statistics: |
ConsumedLimit |
The amount of general limit that has been consumed. Dimensions: |
|
The count of requests per second that are allowed into the bucket. Dimensions: Valid statistics: |
ProvisionedLimit |
The amount of general limit that is provisioned to the account. Dimensions: |
Dimension |
Description |
---|---|
|
Filters data to the Amazon SWF domain that the workflow or activity is running in. |
|
Filters data to the name of the activity type. |
|
Filters data to the version of the activity type. |
|
Filters data to the name of the workflow type for this workflow execution. |
|
Filters data to the version of the workflow type for this workflow execution. |
|
Filters data to an API of the specified API name. |
|
Filters data to the specified Decision name. |
|
Filters data to the specified Task List name. |
|
Filters data to the classification of the task list. Value is "D" for Decision Task Lists and "A" for Activity Task Lists. |
|
Filters data to the specified throttling scope. Value is "Account" when exceeding account-level quota, or "Workflow" when exceeding workflow-level quota. |
Amazon SWF non-ASCII resource names and CloudWatch dimensions
Amazon SWF allows non-ASCII characters in resource names such as TaskList and DomainName. However, the dimension values of CloudWatch metrics can only contain printable ASCII characters. To ensure that Amazon SWF uses dimension values that are compatible with CloudWatch requirements, Amazon SWF resource names that do not meet these requirements are converted and will have a checksum appended as follows:
-
Any non-ASCII character is replaced with
?
. -
The input string or converted string will, if necessary, be truncated. This ensures that when the checksum is appended, the new string length will not exceed the CloudWatch maximum.
-
Since any non-ASCII characters are converted to
?
, some CloudWatch metric dimension values that were different before conversion may appear to be the same after conversion. To help differentiate between them, an underscore (_
) followed by the first 16 characters of the SHA256 checksum of the original resource name is appended to the resource name.
Conversion examples:
-
test àpple
would be converted totest ?pple_82cc5b8e3a771d12
-
àòà
would be converted to???_2fec5edbb2c05c22
. -
The TaskList names
àpplé
andâpplè
would both be converted to?ppl?
, and would be identical. Appending the checksum returns distinct values,?ppl?_f39a36df9d85a69d
and?ppl?_da3efb4f11dd0f7f
.
Tip
You can generate your own SHA256 checksum. For example, to use the shasum
command line tool:
echo -n "<the original resource name>" | shasum -a 256 | cut -c1-16