How zonal autoshift and practice runs work - Amazon Route 53 Application Recovery Controller
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

How zonal autoshift and practice runs work

The zonal autoshift capability in Amazon Route 53 Application Recovery Controller allows Amazon to shift traffic for a resource away from an Availability Zone, on your behalf, when Amazon determines that there's an impairment that could potentially affect customers in the Availability Zone. Zonal autoshift is designed for a resource that is pre-scaled in all Availability Zones in an Amazon Web Services Region, so that an application can operate normally with the loss of one Availability Zone.

With zonal autoshift, you are required to configure practice runs, where Route 53 ARC regularly shifts traffic for the resource away from one Availability Zone. Route 53 ARC schedules practice runs about weekly for each resource that has a practice run configuration associated with it. Practice runs for each resource are scheduled independently.

For each practice run, Route 53 ARC records an outcome. If a practice run is interrupted by a blocking condition, the practice run outcome is not marked as successful. For more information about practice run outcomes, see Outcomes for practice runs.

You can configure Amazon EventBridge notifications to send you information about autoshifts and practice runs. For more information, see Using zonal autoshift with Amazon EventBridge.

Topics

When Amazon starts and stop autoshifts

When you enable zonal autoshift for a resource, you authorize Amazon to shift away resource traffic for an application from an Availability Zone during events, on your behalf, to help reduce time to recovery.

To achieve this, zonal autoshift uses Amazon telemetry to detect, as early as possible, that there is an Availability Zone impairment that could potentially impact customers. When Amazon starts an autoshift, traffic to configured resources immediately starts shifting away from the impaired Availability Zone that could potentially impact customers.

Zonal autoshift is a capability designed for customers who have pre-scaled their application resources for all Availability Zones in an Amazon Web Services Region. You should not rely on scaling on demand when an autoshift or practice run starts.

Amazon ends an autoshift when it determines that the Availability Zone has recovered.

When Route 53 ARC schedules, starts, and ends practice runs

Route 53 ARC schedules a practice run for a resource weekly, for about 30 minutes. Route 53 ARC schedules, starts, and manages practice runs for each resource independently. Route 53 ARC does not batch together practice runs for resources in the same account.

When a practice run continues for the expected duration, without interruption, it is marked with an outcome of SUCCESSFUL. There are several other possible outcomes: FAILED, INTERRUPTED, and PENDING. Outcome values and descriptions are included in the Outcomes for practice runs section.

There are some scenarios when Route 53 ARC interrupts a practice run and ends it. For example, if an autoshift starts during a practice run, Route 53 ARC interrupts the practice run and ends it. As another example, say that the resource has an adverse response to a practice run and causes an alarm that you've specified to monitor the practice run to go into an ALARM state. In this scenario, Route 53 ARC also interrupts the practice run and ends it.

In addition, there are several scenarios when Route 53 ARC does not start a schedule practice run for a resource.

In response to interrupted and blocked practice runs for a resource, Route 53 ARC does the following:

  • If a practice run for a resource is interrupted while it's in progress, Route 53 ARC considers the weekly practice run to be over, and schedules a new practice run for the resource for the next week. The weekly practice outcome is INTERRUPTED in this scenario, not FAILED. The practice run outcome set to FAILED only when the outcome alarm that monitors the practice run goes into an ALARM state during the practice run.

  • If there is a blocking constraint when a practice run for a resource is scheduled to be started, Route 53 ARC does not start the practice run. Route 53 ARC continues regular monitoring, to determine if there are still one or more blocking constraints. When there aren't any blocking constraints, Route 53 ARC starts the practice run for the resource.

The following are examples of blocking constraints that stop Route 53 ARC from starting, or continuing, a practice run for a resource:

  • Route 53 ARC does not start or continue practice runs when there is an Amazon Fault Injection Service experiment in progress. If an Amazon FIS event is active when Route 53 ARC has scheduled a practice run to start, Route 53 ARC does not start the practice run. Route 53 ARC monitors throughout practice runs for blocking constraints, including an Amazon FIS event. If an Amazon FIS event starts while a practice run is active, Route 53 ARC ends the practice run and doesn't attempt to start another one until the next regularly scheduled practice run for the resource.

  • If there is a current Amazon event in a Region, Route 53 ARC does not start practice runs for resources, and ends active practice runs, in the Region.

When the practice run finishes without being interrupted, Route 53 ARC schedules the next practice run in a week, as usual. If a practice run isn't started because of a blocking constraint, such as a Amazon FIS experiment or a blocked time window that you've specified, Route 53 ARC continues to attempt to start a practice run until the practice run can be started.

Notifications for practice runs and autoshifts

You can choose to be notified about practice runs and autoshifts for your resource by setting up Amazon EventBridge notifications. You can also set up EventBridge notifications when you haven't enabled zonal autoshift for any resources, known as autoshift observer notification. With autoshift observer notification, you are notified about all autoshifts that Route 53 ARC starts when an Availability Zone is potentially impaired. Note that you must configure this option in each Amazon Web Services Region that you want to receive notifications about.

To see the steps for enabling autoshift observer notification, see Enabling and working with zonal autoshift. To learn more about notification options and how to configure them in EventBridge, see Using zonal autoshift with Amazon EventBridge.

Precedence for zonal shifts, practice runs, and autoshifts

There can be no more than one traffic shift for a resource that is in effect at once—that is, only one practice run zonal shift, customer-initiated zonal shift, or autoshift for the resource. When there is more than one traffic shift in progress, Route 53 ARC follows a precedence to determine which traffic shift is in effect for a resource.

The overall principle for precedence is that zonal shifts that you start as a customer take precedence over autoshifts, which take precedence over practice runs. That is, customer-initiated zonal shifts > autoshifts > practice run zonal shifts.

To illustrate this, the following is how precedence works for a few example scenarios:

  • If there is an active autoshift and you start a zonal shift for a resource that has autoshift enabled, the zonal shift that you start is APPLIED. The resource is now shifted away from the Availability Zone that the zonal shift applies to. If the zonal shift ends before Amazon ends the autoshift, then the autoshift becomes the APPLIED shift. So, the resource is shifted away from the Availability Zone where Amazon has the autoshift in progress.

  • If there's an active zonal shift that you're started for a resource that has autoshift enabled, and Amazon starts an autoshift, the autoshift exists for the resource. However, the zonal shift is set to APPLIED and the autoshift is set to NOT APPLIED until the zonal shift ends. Then, the status for the autoshift is updated to APPLIED and the autoshift shifts traffic away for the resource until Amazon ends the autoshift.

  • If there's an active practice run for a resource and you start a zonal shift for the resource that shifts traffic away for the same Availability Zone, the practice run is interrupted. If you start a zonal shift that shifts traffic away from a different Availability Zone, the practice run continues as usual.

  • If there's an active zonal shift for a resource and Route 53 ARC is scheduled to start a practice run, the practice run is deferred for an hour. Then Route 53 ARC attempts again to start the practice run. Route 53 ARC continues to check hourly until a practice run can be started.

The traffic shift that is currently in effect for the resource has an applied zonal shift status set to APPLIED. Only one shift is set to APPLIED at any time. Other shifts that are in progress are set to ACTIVE.

Stopping an active autoshift or practice run for a resource

To stop an in-progress autoshift for a resource, disable zonal autoshift for the resource.

When you disable zonal autoshift, the practice run configuration for the resource is not affected. Regular practice runs still take place for the resource, on the same schedule. If you want to stop practice runs in addition to disabling autoshifts, you must delete the practice run configuration associated with the resource.

When you delete a practice run configuration, Amazon stops performing practice runs that shift traffic for the resource away from an Availability Zone each week. In addition, because zonal autoshift requires practice runs, when you delete a practice run configuration using the Route 53 ARC console, this action also disables zonal autoshift for the resource. However, note that if you use the zonal autoshift API to delete a practice run, you must first disable zonal autoshift for the resource.

To stop a active practice run, cancel the practice run zonal shift. For more information, see Canceling a practice run zonal shift.

How traffic is shifted away

For autoshifts and for practice run zonal shifts, traffic is shifted away from an Availability Zone using the same mechanism that Route 53 ARC uses for customer-initiated zonal shifts. To shift traffic away from an Availability Zone for load balancers that have cross-zone load balancing turned off, Route 53 ARC sets the load balancer health check for the Availability Zone to unhealthy, so that it fails its health check. An unhealthy health check, in turn, results in Amazon Route 53 withdrawing the corresponding IP addresses for the resource from DNS, so that traffic is redirected from the Availability Zone. New connections are now routed to other Availability Zones in the Amazon Web Services Region instead.

With an autoshift, when an Availability Zone recovers and Amazon decides to end the autoshift, Route 53 ARC reverses the health check process, requesting the Route 53 health checks to be reverted. Then, the original zonal IP addresses are restored and, if the health checks continue to be healthy, the Availability Zone is included in the load balancer's routing again.

It's important to be aware that autoshifts are not based on health checks that monitor the underlying health of load balancers or applications. Route 53 ARC uses health checks to move traffic away from Availability Zones, by requesting health checks to be set to unhealthy, and then restores health checks to normal again when it ends an autoshift or zonal shift.

Alarms for practice runs

You can specify two CloudWatch alarms for practice runs in zonal autoshift. The first alarm, the outcome alarm, is required. You should configure the outcome alarm to monitor the health of your application when traffic is shifted away from an Availability Zone during each 30-minute practice run.

For a practice run to be effective, specify as an outcome alarm a CloudWatch alarm that monitors metrics for the resource, or your application, that respond with an ALARM state when your application is adversely affected by the loss of one Availability Zone. For more information, see the Alarms that you specify for practice runs section in Best practices when you configure zonal autoshift.

The outcome alarm also provides information for the practice run result that Route 53 ARC reports for each practice run. If the alarm enters an ALARM state, the practice run is ended and the practice run outcome is returned as FAILED. If the practice run completes the 30 minute scheduled test period and the outcome alarm does not enter an ALARM state, the outcome is returned as SUCCEEDED. A list of all outcome values, with descriptions, is provided in the Outcomes for practice runs section.

Optionally, you can specify a second alarm, the blocking alarm. The blocking alarm blocks practice runs from starting, or continuing, when it’s in an ALARM state. This alarm blocks practice run traffic shifts from being started—and stops any practice runs in progress—when the alarm is in an ALARM state.

For example, in a large architecture with multiple microservices, when one microservice is experiencing a problem, you typically want to stop all other changes in the application environment, which would including blocking practice runs.

Blocked dates and blocked windows (UTC)

You have the option to block practice runs for specific calendar dates, or for specific time windows, that is, days and times, in UTC.

For example, if you have an application update scheduled to launch on May 1, 2024, and you don't want practice runs to shift traffic away at that time, you could set a blocked date for 2024-05-01.

Or, say you run business report summaries three days a week. For this scenario, you might set the following recurring days and times as blocked windows, for example, in UTC: MON-20:30-21:30 WED-20:30-21:30 FRI-20:30-21:30.