Service level objectives (SLOs)
You can use Application Signals to create service level objectives for the services for your critical business operations. By creating SLOs on these services, you will be able to track them on the SLO dashboard, giving you an at-a-glance view of your most important operations.
In addition to creating a quick view your operators can use to see the current status of critical operations, you can use SLOs to track the longer-term performance of your services, to ensure that they are meeting your expectations. If you have service level agreements with customers, SLOs are a great tool to ensure that they are met.
Assessing your services' health with SLOs starts with setting clear, measurable objectives based on key performance metrics— service level indicators (SLIs). An SLO tracks the SLI performance against the threshold and goal that you set, and reports how far or how close your application performance is to the threshold.
Application Signals helps you set SLOs on your key performance metrics. Application Signals
automatically collects Latency
and Availability
metrics for every
service and operation that it discovers, and these metrics are often ideal to use as SLIs. With
the SLO creation wizard, you can use these metrics for your SLOs. You can then track the status
of all of your SLOs with the Application Signals dashboards.
You can set SLOs on specific operations that your service calls or uses. You can use any CloudWatch metric
or metric expression as an SLI, in addition to using the Latency
and Availability
metrics.
Creating SLOs is very important for getting the most benefit from CloudWatch Application Signals. After you create SLOs, you can view their status in the Application Signals console to quickly see which of your these critical services and operations are performing well and which are unhealthy. Having SLOs to track provides the following major benefits:
It is easier for your service operators to see the current operational health of critical services as measured against the SLI. Then they can quickly triage and identify unhealthy services and operations.
You can track your service performance against measurable business goals over longer periods of time.
By choosing what to set SLOs on, you are prioritizing what is important to you. The Application Signals dashboards automatically present information about what you have prioritized.
When you create an SLO, you can also choose to create CloudWatch alarms at the same time to monitor the SLOs. You can set alarms that monitor for breaches of the threshold, and also for warning levels. These alarms can automatically notify you if the SLO metrics are breaching the threshold that you set, or if they are nearing a warning threshold. For example, an SLO nearing its warning threshold can let you know that your team might need to slow down churn in the application to make sure that long-term performance goals are met.
Topics
SLO concepts
An SLO includes the following components:
A service level indicator (SLI), which is a key performance metric that you specify. It represents the desired level of performance for your application. Application Signals automatically collects the key metrics
Latency
andAvailability
for the services and operations that it discovers, and these can often be ideal metrics to set SLOs for.You choose the threshold to use for your SLI. For example, 200ms for latency.
A goal or attainment goal, which is the percentage of time or requests that the SLI is expected to meet the threshold over each time interval. The time intervals can be as short as hours or as long as a year.
Intervals can be either calendar intervals or rolling intervals.
Calendar intervals are aligned with the calendar, such as an SLO that is tracked per month. CloudWatch automatically adjusts health, budget, and attainment numbers based on the number of days in a month. Calendar intervals are better suited for business goals that are measured on a calendar-aligned basis.
Rolling intervals are calculated on a rolling basis. Rolling intervals are better suited for tracking recent user experience of your application.
The period is a shorter length of time, and many periods make up an interval. The application's performance is compared to the SLI during each period within the interval. For each period, the application is determined to have either achieved or not achieved the necessary performance.
For example, a goal of 99% with a calendar interval of one day and a period of 1 minute means that the application must meet or achieve the success threshold during 99% of the 1-minute periods during the day. If it does, then the SLO is met for that day. The next day is a new evaluation interval, and the application must meet or achieve the success threshold during 99% of the 1-minute periods during the second day to meet the SLO for that second day.
An SLI can be based on one of the new standard application metrics collected by
Application Signals. Alternatively, it can be any CloudWatch metric or metric expression. The
standard application metrics that you can use for an SLI are Latency
and
Availability
. Availability
represents the successful responses
divided by the total requests. It is calculated as (1 - Fault
Rate)*100, where Fault responses are 5xx
errors. Success responses
are responses without a 5XX
error. 4XX
responses are treated as
successful.
Calculate error budget and attainment for period-based SLOs
When you view information about an SLO, you see its current health status and its error budget. The error budget is the amount of time within the interval that can breach the threshold but still let the SLO be met. The total error budget is the total amount of breaching time that can be tolerated through the entire interval. The remaining error budget is the remaining amount of breaching time that can be tolerated during the current interval. This is after the amount of breaching time that has already happened has been subtracted from the total error budget.
The following figure illustrates the attainment and error budget concepts for a goal with a 30-day interval, 1-minute periods, and a 99% attainment goal. 30 days includes 43,200 1-minute periods. 99% of 43,200 is 42,768, so 42,768 minutes during the month must be healthy for the SLO to be met. So far in the current interval, 130 of the 1-minute periods were unhealthy.
Determine success within each period
Within each period, the SLI data is aggregated into a single data point based on the statistic used for the SLI. This data point represents the entire length of the period. That single data point is compared to the SLI threshold to determine if the period is healthy. Seeing unhealthy periods during the current time range on the dashboard can alert your service operators that the service needs to be triaged.
If the period is determined to be unhealthy, the entire length of the period is counted as failed against the error budget. Tracking the error budget lets you know whether the service is achieving the performance that you want over a longer period of time.
Calculate error budget and attainment for request-based SLOs
After you have created an SLO, you can retrieve error budget reports for it. An error budget is the amount of requests that your application can be non-compliant with the SLO's goal, and still have your application meet the goal. For a request-based SLO, the remaining error budget is dynamic and can increase or decrease, depending on the ratio of good requests to total requests
The following table illustrates the calculation for a request-based SLO with an interval of 5 days and 85% attainment goal. In this example, we assume there is no traffic before Day 1. The SLO did not meet the goal on Day 10.
Time | Total requests | Bad requests | Accumulative total requests in last 5 days | Accumulative total good requests in last 5 days | Request-based attainment | Total budget requests | Remaining budget requests |
---|---|---|---|---|---|---|---|
Day 1 |
10 | 1 |
10 |
9 |
9/10 = 90% |
1.5 |
0.5 |
Day 2 |
5 |
1 |
15 |
13 |
13/15=86% |
2.3 |
0.3 |
Day 3 |
1 |
1 |
16 |
13 |
13/16=81% |
2.4 |
-0.6 |
Day 4 |
24 |
0 |
40 |
37 |
37/40=92% |
6.0 |
3.0 |
Day 5 |
20 |
5 |
60 |
52 |
52/60=87% |
9.0 |
1.0 |
Day 6 |
6 |
2 |
56 |
47 |
47/56=84% |
8.4 |
-0.6 |
Day 7 |
10 |
3 |
61 |
50 |
50/61=82% |
9.2 |
-1.8 |
Day 8 |
15 |
6 |
75 |
59 |
59/75=79% |
11.3 |
-4.7 |
Day 9 |
12 |
1 |
63 |
46 |
46/63=73% |
9.5 |
-7.5 |
Day 10 |
5 |
57 |
40 |
40/57=70% |
8.5 |
-8.5 | |
Final attainment for last 5 days |
|
70% |
Create an SLO
We recommend that you set both latency and availability SLOs on your critical applications. These metrics collected by Application Signals align with common business goals.
You can also set SLOs on any CloudWatch metric or any metric math expression that results in a single time series.
The first time that you create an SLO in your account, CloudWatch automatically creates the AWSServiceRoleForCloudWatchApplicationSignals service-linked role in your account, if it doesn't already exist. This service-linked role allows CloudWatch to collect CloudWatch Logs data, X-Ray trace data, CloudWatch metrics data, and tagging data from applications in your account. For more information about CloudWatch service-linked roles, see Using service-linked roles for CloudWatch.
When you create an SLO, you specify whether it is a period-based SLO or a request-based SLO. Each type of SLO has a different way of evaluating your application's performance against its attainment goal.
A period-based SLO uses defined periods of time within a specified total time interval. For each period of time, Application Signals determines whether the application met its goal. The attainment rate is calculated as the
number of good periods/number of total periods
.For example, for a period-based SLO, meeting an attainment goal of 99.9% means that within your interval, your application must meet its performance goal during at least 99.9% of the time periods.
A request-based SLO doesn't use pre-defined periods of time. Instead, the SLO measures
number of good requests/number of total requests
during the interval. At any time, you can find the ratio of good requests to total requests for the interval up to the time stamp that you specify, and measure that ratio against the goal set in your SLO.
Create a period-based SLO
Use the following procedure to create a period-based SLO.
To create a period-based SLO
Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/
. In the navigation pane, choose Service Level Objectives (SLO).
Choose Create SLO.
Enter a name for the SLO. Including the name of a service or operation, along with appropriate keywords such as latency or availability, will help you quickly identify what the SLO status indicates during triage.
For Set Service Level Indicator (SLI), do one of the following:
To set the SLO on either of the standard application metrics
Latency
orAvailability
:Choose Service Operation.
Select the service that this SLO will monitor.
Select the operation that this SLO will monitor.
For Select a calculation method, choose Periods.
The Select service and Select operation drop-downs are populated by services and operations that have been active within the past 24 hours.
Choose either Availability or Latency and then set the threshold.
To set the SLO on any CloudWatch metric or a CloudWatch metric math expression:
Choose CloudWatch Metric.
Choose Select CloudWatch metric.
The Select metric screen appears. Use the Browse or Query tabs to find the metric you want, or create a metric math expression.
After you select the metric that you want, choose the Graphed metrics tab and select the Statistic and Period to use for the SLO. Then choose Select metric.
For more information about these screens, see Graph a metric and Add a math expression to a CloudWatch graph.
For Select a calculation method, choose Periods.
For Set condition, select a comparison operator and threshold for the SLO to use as the indicator of success.
If you selected Service Operation in step 5, you can optionally choose Additional settings and then adjust the period length for this SLO.
Set the interval and attainment goal for the SLO. For more information about intervals and attainment goals and how they work together, see SLO concepts.
(Optional) Set one or more CloudWatch alarms or a warning threshold for the SLO.
CloudWatch alarms can use Amazon SNS to proactively notify you if an application is unhealthy based on its SLI performance.
To create an alarm, select one of the alarm check boxes and enter or create the Amazon SNS topic to use for notifications when the alarm goes into
ALARM
state. For more information about CloudWatch alarms, see Using Amazon CloudWatch alarms. Creating alarms incurs charges. For more information about CloudWatch pricing, see Amazon CloudWatch Pricing. If you set a warning threshold, it appears in Application Signals screens to help you identify SLOs that are in danger of being unmet, even if they're currently healthy.
To set a warning threshold, enter the threshold value in Warning threshold. When the SLO's error budget is lower than the warning threshold, the SLO is marked with Warning in several Application Signals screens. Warning thresholds also appear on error budget graphs. You can also create an SLO warning alarm that's based on the warning threshold.
To add tags to this SLO, choose the Tags tab and then choose Add new tag. Tags can help you manage, identify, organize, search for, and filter resources. For more information about tagging, see Tagging your Amazon resources.
Note
If the application this SLO is related to is registered in Amazon Service Catalog AppRegistry, you can use the
awsApplication
tag to associate this SLO with that application in AppRegistry. For more information, see What is AppRegistry?Choose Create SLO. If you also chose to create one or more alarms, the button name changes to reflect this.
Create a request-based SLO
Use the following procedure to create a request-based SLO.
To create a request-based SLO
Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/
. In the navigation pane, choose Service Level Objectives (SLO).
Choose Create SLO.
Enter a name for the SLO. Including the name of a service or operation, along with appropriate keywords such as latency or availability, will help you quickly identify what the SLO status indicates during triage.
For Set Service Level Indicator (SLI), do one of the following:
To set the SLO on either of the standard application metrics
Latency
orAvailability
:Choose Service Operation.
Select the service that this SLO will monitor.
Select the operation that this SLO will monitor.
For Select a calculation method, choose Requests.
-
The Select service and Select operation drop-downs are populated by services and operations that have been active within the past 24 hours.
Choose either Availability or Latency. If you choose Latency, set the threshold.
To set the SLO on any CloudWatch metric or a CloudWatch metric math expression:
Choose CloudWatch Metric.
-
For Define target requests, do the following:
Choose whether you want to measure Good Requests or Bad Requests.
-
Choose Select CloudWatch metric. This metric will be the numerator of the ratio of target requests to total requests. If you use a latency metric, use the Trimmed count (TC) statistics. If the threshold is 9 ms and you're using the less than (<) comparison operator, then use threshold TC (:threshold - 1). For more information about TC, see Syntax.
The Select metric screen appears. Use the Browse or Query tabs to find the metric you want, or create a metric math expression.
-
For Define total requests, choose the CloudWatch metric that you want to use for the source. This metric will be the denominator of the ratio of target requests to total requests.
The Select metric screen appears. Use the Browse or Query tabs to find the metric you want, or create a metric math expression.
After you select the metric that you want, choose the Graphed metrics tab and select the Statistic and Period to use for the SLO. Then choose Select metric.
If you use a latency metric which emits one data point per request, use the Sample count statistics to count the number of total requests.
For more information about these screens, see Graph a metric and Add a math expression to a CloudWatch graph.
Set the interval and attainment goal for the SLO. For more information about intervals and attainment goals and how they work together, see SLO concepts.
(Optional) Set one or more CloudWatch alarms or a warning threshold for the SLO.
CloudWatch alarms can use Amazon SNS to proactively notify you if an application is unhealthy based on its SLI performance.
To create an alarm, select one of the alarm check boxes and enter or create the Amazon SNS topic to use for notifications when the alarm goes into
ALARM
state. For more information about CloudWatch alarms, see Using Amazon CloudWatch alarms. Creating alarms incurs charges. For more information about CloudWatch pricing, see Amazon CloudWatch Pricing. If you set a warning threshold, it appears in Application Signals screens to help you identify SLOs that are in danger of being unmet, even if they're currently healthy.
To set a warning threshold, enter the threshold value in Warning threshold. When the SLO's error budget is lower than the warning threshold, the SLO is marked with Warning in several Application Signals screens. Warning thresholds also appear on error budget graphs. You can also create an SLO warning alarm that's based on the warning threshold.
To add tags to this SLO, choose the Tags tab and then choose Add new tag. Tags can help you manage, identify, organize, search for, and filter resources. For more information about tagging, see Tagging your Amazon resources.
Note
If the application this SLO is related to is registered in Amazon Service Catalog AppRegistry, you can use the
awsApplication
tag to associate this SLO with that application in AppRegistry. For more information, see What is AppRegistry?Choose Create SLO. If you also chose to create one or more alarms, the button name changes to reflect this.
View and triage SLO status
You can quickly see the health of your SLOs using either the Service Level Objectives or the Services options in the CloudWatch console. The Services view provides an at-a-glance view of the ratio of unhealthy services, calculated based on SLOs that you have set. For more information about using the Services option, see Monitor the operational health of your applications with Application Signals.
The Service Level Objectives view provides a macro view of your organization. You can see the met and unmet SLOs as a whole. This gives you a view of how many of your services and operations are performing to your expectations over longer periods of time, according to the SLIs that you chose.
To view all of your SLOs using the Service Level Objectives view
-
Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/
. In the navigation pane, choose Service Level Objectives (SLO).
The Service Level Objectives (SLO) list appears.
You can quickly see the current status of your SLOs in the SLI status column. To sort the SLOs so that all the unhealthy ones are at the top of the list, choose the SLI status column until the unhealthy SLOs are all at the top.
The SLO table has the following default columns. You can adjust which columns are displayed by choosing the gear icon above the list. For more information about goals, SLIs, attainment, and intervals, see SLO concepts.
The name of the SLO.
The Goal column displays the percentage of periods during each interval that must successfully meet the SLI threshold for the SLO goal to be met. It also displays the interval length for the SLO.
The SLI status displays whether the current operational state of the application is healthy or not. If any period during the currently selected time range was unhealthy for the SLO, then the SLI status displays Unhealthy.
The Ending attainment is the attainment level achieved as of the end of the selected time range. Sort by this column to see the SLOs that are most in danger of not being met.
The Attainment delta is the difference in attainment level between the start and end of the selected time range. A negative delta means that the metric is trending in a downward direction. Sort by this column to see the latest trends of the SLOs.
The Ending error budget (%) is the percentage of total time in the period that can have unhealthy periods and still have the SLO be achieved successfully. If you set this to 5%, and the SLI is unhealthy in 5% or fewer of the remaining periods in the interval, the SLO is still achieved successfully.
The Error budget delta is the difference in error budget between the start and end of the selected time range. A negative delta means that the metric is trending in a failing direction.
The Ending error budget (time) is the amount of actual time in the interval that can be unhealthy and still have the SLO be achieved successfully. For example, if this is 14 minutes, then if the SLI is unhealthy for fewer than 14 minutes during the remaining interval, the SLO will still be achieved successfully.
-
The Ending error budget (requests) is the amount of requests in the interval that can be unhealthy and still have the SLO be achieved successfully. For request-based SLOs, this value is dynamic and can fluctuate as the cumulative total number of requests changes over time.
The Service, Operation, and Type columns display information about what service and operation this SLO is set for.
To see the attainment and error budget graphs for an SLO, choose the radio button next to the SLO name.
The graphs at the top of the page display the SLO attainment and Error budget status. A graph about the SLI metric associated with this SLO is also displayed.
To further triage an SLO that is not meeting its goal, choose the service name or operation name associated with that SLO. You are taken to the details page where you can triage further. For more information, see View detailed service activity and operational health with the service detail page.
To change the time range of the charts and tables on the page, choose a new time range near the top of the screen.
Edit an existing SLO
Follow these steps to edit an existing SLO. When you edit an SLO, you can change only the threshold, interval, attainment goal, and tags. To change other aspects such as service, operation, or metric, create a new SLO instead of editing an existing one.
Changing part of an SLO core configuration, such as period or threshold, invalidates all the previous data points and assessments about attainment and health. It effectively deletes and re-creates the SLO.
Note
If you edit an SLO, alarms associated with that SLO are not automatically updated. You might need to update the alarms to keep them in sync with the SLO.
To edit an existing SLO
-
Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/
. In the navigation pane, choose Service Level Objectives (SLO).
Choose the radio button next to the SLO that you want to edit, and choose Actions, Edit SLO.
Make your changes, then choose Save changes.
Delete an SLO
Follow these steps to delete an existing SLO.
Note
When you delete an SLO, alarms associated with that SLO are not automatically deleted. You'll need to delete them yourself. For more information, see Managing alarms.
To delete an SLO
-
Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/
. In the navigation pane, choose Service Level Objectives (SLO).
Choose the radio button next to the SLO that you want to edit, and choose Actions, Delete SLO.
Choose Confirm.