How Amazon Route 53 Application Recovery Controller works - Amazon Route 53 Application Recovery Controller
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

How Amazon Route 53 Application Recovery Controller works

Amazon Route 53 Application Recovery Controller helps you to prepare for and quickly mitigate impairments for applications on Amazon.

  • A readiness check continually audits Amazon resource capacity, configuration, Amazon quotas, and routing policies, for an application and provides information that you can use to help successfully recover from application failure. Readiness checks help to ensure that your recovery environment is scaled and configured to fail over to when needed.

  • Routing controls enable you to rebalance traffic across application replicas during failures, to ensure that your application is available. You can also pair routing controls with safety rules that you create to help avoid unintended consequences. For example, you might want to prevent inadvertently turning off all the routing controls for an application, which would stop all traffic flow, resulting in a fail-open scenario.

  • A zonal shift temporarily moves traffic for a resource away from an Availability Zone (AZ), to enable you to quickly and reliably recover from issues for multi-AZ applications. Currently supported resources are Network Load Balancers and Application Load Balancers with cross-zone load balancing turned off.

Learn more about how Route 53 ARC works in the following sections.

Monitoring your application replica with readiness checks

Route 53 ARC audits your application replicas by using readiness checks to ensure that each one has the same configuration setup and the same runtime state.

To be prepared for recovery, for example, you must maintain sufficient spare capacity at all times to absorb failover traffic from another Availability Zone or Region. Route 53 ARC continually (once a minute) inspects your application to ensure that your provisioned capacity matches across all Availability Zones or Regions. The capacity that Route 53 ARC inspects includes, for example, Amazon EC2 instance counts, Aurora read and write capacity units, and Amazon EBS volume size. If you scale up the capacity in your primary replica for resource values but forget to also increase the corresponding values in your standby replica, Route 53 ARC detects the mismatch so that you can increase the values in the standby.

Important

Readiness checks are most useful for verifying, on an ongoing basis, that application replica configurations and runtime states are aligned. Readiness checks shouldn't be used to indicate whether your production replica is healthy, nor should you rely on readiness checks as a primary trigger for failover during a disaster event.

In an active-standby configuration, you should make decisions about whether to fail away from or to a cell based on your monitoring and health check systems, and consider readiness checks as a complementary service to those systems. Route 53 ARC readiness checks are not highly available, so you should not depend on the checks being accessible during an outage. In addition, the resources that are checked might also not be available during a disaster event.

You can monitor the readiness status for your application's resources in specific cells (Amazon Regions or Availability Zones) or for your overall application. You can be notified when a readiness check status changes, for example, to Not ready, by creating rules in EventBridge. For more information, see Using Route 53 ARC with Amazon EventBridge. You can also view readiness status in the Amazon Web Services Management Console, or by using API operations, such as get-recovery-readiness. For more information, see Recovery readiness (readiness check) actions.

Rerouting traffic for recovery with routing control

A Route 53 ARC routing control is an on/off switch that changes the state of a Route 53 ARC health check, which can then be associated with a DNS record that redirects traffic, for example, from a primary to a standby deployment replica.

If there's an application failure or latency issue, you can update routing control states to shift traffic from your primary replica to, for example, a standby replica. By using the highly reliable Route 53 ARC data plane API operations to make routing control queries and routing control state updates, you can rely on Route 53 ARC for failover during disaster recovery scenarios. For more information, see Getting and updating routing control states using the Route 53 ARC API (recommended).

Route 53 ARC maintains routing control states in a cluster, which is a set of five redundant Regional endpoints. Route 53 ARC propagates routing control state changes across the cluster, which is located in an Amazon EC2 fleet, to get a quorum across five Amazon Regions. After propagation, when you query Route 53 ARC for a routing control state, using the API and the highly-reliable data plane, it returns the consensus view.

You can interact with any one of the five cluster endpoints to update the state of a routing control from, for example, Off to On. Then Route 53 ARC propagates the update across the five Regions of the cluster.

Data consistency across all five cluster endpoints is achieved within 5 seconds on average, and after no more than 15 seconds maximum.

Route 53 ARC offers extreme reliability with its data plane for you to manually fail over your application across cells. Route 53 ARC ensures that at least three out of the five cluster endpoints are always accessible to you to perform routing control state changes. Note that each Route 53 ARC cluster is single-tenant, to ensure that you're not affected by "noisy neighbors" that might slow down your access patterns.

When you make changes to routing control states, you rely on the following three criteria, which are highly unlikely to fail:

  • At least three of your five endpoints are available and take part in the quorum.

  • You have working IAM credentials and can authenticate against a working Regional cluster endpoint.

  • The Route 53 data plane is healthy (this data plane is designed to meet a 100% availability SLA).

Resilience in Route 53 ARC

Here's an example of incorporating routing controls into your failover strategy, to improve the resilience and availability of your applications in Amazon.

You can support highly available Amazon applications on Amazon by running multiple (typically three) redundant replicas across Regions. Then you can use Amazon Route 53 routing control to route traffic to the appropriate replica.

For example, you can set up one application replica to be active and serve application traffic, while another is a standby replica. When your active replica has failures, you can reroute user traffic there to restore availability to your application. Readiness checks can help you make sure that a standby replica matches your production replica on an ongoing basis. However, you should decide whether to fail away from or to a replica based on information from your monitoring and health check systems, and consider readiness checks as a complementary service to those systems.

If you want to enable faster recoveries, another option that you can choose for your architecture is an active-active implementation. With this approach, all of your replicas are active at the same time. This means that you can recover from failures by moving users away from your impaired application replica by just rerouting traffic to another active replica.

Moving traffic away from an Availability Zone with zonal shift

With zonal shift, you can move traffic for a load balancing resource away from an Availability Zone (AZ), so that you can continue operating your application normally in the other AZs in an Amazon Region. At this time, you can start a zonal shift for Network Load Balancers and Application Load Balancers with cross-zone load balancing turned off.

When you deploy and run Amazon applications on load balancers in multiple (typically three) AZs in a Region, you can quickly recover an application in an impaired AZ by starting a zonal shift. Shifting your application traffic to other AZs reduces the duration and severity of impact caused by power outages, or hardware or software failures in the impaired AZ.

When you start a zonal shift for an AZ, Route 53 ARC sets Amazon Route 53 health checks to unhealthy for the corresponding IP addresses for the load balancer resource, so that traffic for the resource is no longer directed to the AZ. When the zonal shift expires or you cancel it, Route 53 ARC sets the Route 53 health checks to healthy again and the original zonal IP addresses are restored.

A zonal shift must have an expiry (expiration date), when it will end and traffic will return to the AZ. You can initially set a zonal shift to expire in a maximum of three days (72 hours). However, you can update a zonal shift to set a new expiration at any time (which still, however, has a maximum of three days). You can also cancel a zonal shift, before it expires, if you're ready to restore traffic to the AZ earlier.

In a few specific scenarios, zonal shift does not shift traffic from the AZ. For example, if the load balancer target groups in the AZs don't have any instances, or if all of the instances are unhealthy, then the load balancer is in a fail open state. If you start a zonal shift for a load balancer in this scenario, the zonal shift does not change which AZs the load balancer uses because the load balancer is already in a fail open state. This is expected behavior. Zonal shift cannot force one AZ to be unhealthy and shift traffic to the other AZs in a Region if all AZs are failing open (unhealthy). A second scenario is if you start a zonal shift for an Application Load Balancer that is an endpoint for an accelerator in Amazon Global Accelerator. Zonal shift isn't supported for Application Load Balancers that are endpoints of accelerators in Global Accelerator.

To learn more about starting a zonal shift, see Zonal shift in Amazon Route 53 Application Recovery Controller.

Data and control planes for Route 53 ARC

As you plan for failover and disaster recovery, it's important to consider how resilient your failover mechanisms are and make sure that the mechanisms that you depend on are highly available, so that you can use them when you need them in a disaster scenario. Typically you should use data plane functions for your mechanisms when you can, for the greatest reliability and fault tolerance. With that in mind, it's important to understand how the functionality of a service is divided between control planes and data planes, and when you can rely on an expectation of extreme reliability with a service's data plane.

Route 53 ARC includes two sets of functionality, readiness checks and routing control for recovery. As with most Amazon services, the Route 53 ARC functionality is supported by control planes and data planes. While both types are built to be reliable, a control plane is optimized for data consistency, while a data plane is optimized for availability. A data plane is designed for resilience so that it can maintain availability even during disruptive events, when a control plane might become unavailable. Because of this, we recommend that you use data plane operations when availability is important, for example, when you need to reroute traffic to a standby replica during an outage.

In general, a control plane enables you to do basic management functions, such as create, update, and delete resources in the service. A data plane provides a service's core functionality.

For Route 53 ARC, the control planes and data planes are divided as follows:

  • For zonal shifts, supported resources are automatically registered with Route 53 ARC. When a resource is registered, it becomes a managed resource for zonal shifts in Route 53 ARC. Route 53 ARC has a data plane in each Amazon Region that provides API operations to get, list, create, and update zonal shifts for managed resources. The zonal shift data plane is highly available.

  • For readiness checks, there is a single API, the Recovery Readiness API, for both the control plane and data plane. Readiness checks and readiness resources are only in the US West (Oregon) Region (us-west-2). The readiness checks control plane and data plane are not highly available.

  • For routing control, the control plane API is the Recovery Control Configuration API, supported in the US West (Oregon) Region (us-west-2). You use these API operations or the Amazon Web Services Management Console to create or delete clusters, control panels, and routing controls, to help prepare for a disaster recovery event when you might need to reroute traffic for your application. The routing control configuration control plane is not highly available.

  • The routing control data plane in Route 53 ARC is a dedicated cluster across five geographically-isolated Amazon Regions. Each customer creates one or more clusters using the routing control control plane. The cluster hosts control panels and routing controls. Then you use the Routing Control (Recovery Cluster) API to get, list, and update routing control states when you want to reroute traffic for your application. The routing control data plane IS highly available.

To learn more about recovery readiness and preparing for failover with Route 53 ARC, see Best practices for Amazon Route 53 Application Recovery Controller.

For more information about data planes, control planes, and how Amazon builds services to meet high availability targets, see the Static stability using Availability Zones paper in the Amazon Builders' Library.