Best practices for Amazon Route 53 Application Recovery Controller - Amazon Route 53 Application Recovery Controller
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Best practices for Amazon Route 53 Application Recovery Controller

To minimize disruption and help provide for operational continuity, follow best practices to plan for and execute disaster recovery with Amazon Route 53 Application Recovery Controller. Review the guidelines in this chapter to learn more.

Best practices for recovery in Route 53 ARC

We recommend the following best practices for recovery and failover preparedness in Amazon Route 53 Application Recovery Controller.

Keep purpose-built, long-lived Amazon credentials secure and always accessible

In a disaster recovery (DR) scenario, keep system dependencies to a minimum by using a simple approach to accessing Amazon and performing recovery tasks. Create IAM long-lived credentials specifically for DR tasks, and keep the credentials securely in an on-premises physical safe or a virtual vault, to access when needed. With IAM, you can centrally manage security credentials, such as access keys, and permissions for access to Amazon resources. For non-DR tasks, we recommend that you continue to use federated access, using Amazon services such as Amazon Single Sign-On.

To perform failover tasks in Route 53 ARC with the recovery cluster data plane API, you can attach a Route 53 ARC IAM policy to your user. To learn more, see Identity-based policy examples for Amazon Route 53 Application Recovery Controller.

Choose lower TTL values for DNS records involved in failover

For DNS records that you might need to change as part of your failover mechanism, especially records that are health checked, using lower TTL values is appropriate. Setting a TTL of 60 or 120 seconds is a common choice for this scenario.

The DNS TTL (time to live) setting tells DNS resolvers how long to cache a record before requesting a new one. When you choose a TTL, you make a trade-off between latency and reliability, and responsiveness to change. With a shorter TTL on a record, DNS resolvers notice updates to the record more quickly because the TTL specifies that they must query more frequently.

For more information, see Choosing TTL values for DNS records in Best practices for Amazon Route 53 DNS.

Best practices for zonal shifts in Route 53 ARC

We recommend the following best practices for using zonal shifts for multi-AZ recovery in Route 53 ARC. Zonal shifts typically remove capacity from a live application, so it's important to be careful when you use them in production.

Capacity planning and pre-scaling

Ensure that you have planned for, and either pre-scaled or can auto-scale, sufficient capacity to accommodate the extra load imposed on Availability Zones when you start a zonal shift. With a recovery-oriented architecture, a typical recommendation is to pre-scale compute capacity to include enough headroom to serve your peak traffic when one of your (typically) three replicas is offline.

When you start a zonal shift for a single load balancer resource, for example, the capacity of one Availability Zone is temporarily removed from behind the load balancer. Depending on the zonal shifts that you start and how your load balancers are configured, you must make sure that you've carefully planned for managing the increased load on the remaining Availability Zones.

Test starting zonal shifts, in advance

Regularly test moving traffic away from Availability Zones for your application by starting zonal shifts. Plan for and execute starting zonal shifts, preferably in both test and production environments, as part of regular failover testing for recovering your applications in the event of a disaster. Regular testing is a critical part of ensuring that you're ready for and have the confidence to mitigate issues when an operational event occurs.

Ensure that all Availability Zones are healthy and taking traffic

Zonal shifts work by marking a resource, that is, an application replica, as unhealthy in an Availability Zone. This means that it's critical to ensure that the targets in the load balancers for your applications are generally healthy and actively taking traffic in the Availability Zones in a Region. We recommend that you have dashboards to track this, including, for example, Elastic Load Balancing metrics for unhealthy targets and bytesProcessed per Availability Zone.

Consider monitoring health of your resources from a second, adjacent Region. Advantages of this approach are that it can be more representative of your end users' experience, and it also reduces the risk of both your application and your monitoring being impacted by the same disaster at the same time ("shared fate").

Use data plane API operations for disaster recovery

For starting a zonal shift when you need to recover an application quickly, with few dependencies, we recommend using the Amazon Command Line Interface or API with zonal shift actions, with pre-stored credentials, if possible. You can also start zonal shifts in the Amazon Web Services Management Console, for ease of use. But when fast, reliable recovery is critical, data plane operations are a better choice. For more information, see Zonal Shift API Reference Guide.

Move traffic with a zonal shift only temporarily

A zonal shift moves traffic away from an Availability Zone on a temporary basis, to mitigate an impairment. You should restore the resource for the application to service as soon as you've taken action to correct a problem. This ensures that your overall application is restored to its original fully redundant, resilient state.

Best practices for zonal autoshift in Route 53 ARC

We recommend the following best practices for enabling zonal autoshift for multi-AZ recovery in Route 53 ARC. Practice runs and autoshifts with zonal autoshift remove capacity from a live application, so it's important to be careful when you use or enable these capabilities in production.

Capacity planning and pre-scaling

When you plan to configure zonal autoshift for a resource, make sure to pre-scale capacity for your application. Then, start one or more zonal shifts for the resource, to shift traffic away from an Availability Zone, and verify that your application continues to operate normally with the loss of one Availability Zone. When you configure zonal autoshift, Route 53 ARC regularly starts practice run zonal shifts for your resource, to help you to confirm that you can operate your application normally with the loss of one Availability Zone.

Create targeted CloudWatch alarms for practice runs

For practice runs in zonal autoshift, you specify a CloudWatch alarm to monitor the health of your application when traffic is shifted away from an Availability Zone during a practice run. Make sure that you configure the thresholds for the CloudWatch alarm so that a practice run stops before your application performance degrades, so that your clients can continue to use the application normally. For more information, see the Alarms that you specify for practice runs section in Considerations when you configure zonal autoshift.

Best practices for readiness checks and routing controls in Route 53 ARC

We recommend the following best practices for recovery readiness and failover preparedness when you set up and use Route 53 ARC with readiness checks and routing control, for example, for Regional failover.

Bookmark or hard code your five Regional cluster endpoints and routing control ARNs

We recommend that you keep a local copy of your Route 53 ARC Regional cluster endpoints, in bookmarks or saved in automation code that you use to retry your endpoints. During a failure event, you might not be able to access some API operations, including Route 53 ARC API operations that are not hosted on the extremely reliable data plane cluster. You can list the endpoints for your Route 53 ARC clusters by using the DescribeCluster API operation.

Choose one of your endpoints at random to update your routing control states

We recommend that when you need to fail over, you update (and retrieve) routing control states using a random endpoint from your five Regional cluster endpoints. If that endpoint fails, then retry each of your other Regional endpoints. For information about using code examples with the Amazon SDK, including examples for trying cluster endpoints, see Code examples for Application Recovery Controller using Amazon SDKs.

Use the extremely reliable data plane API to list and update routing control states, not the console

Using the Route 53 ARC data plane API, view your routing controls and states with the ListRoutingControls operation and update routing control states to redirect traffic for failover with the UpdateRoutingControlState operation. You can use the Amazon CLI (as in these examples) or code that you write using one of the Amazon SDKs. Route 53 ARC offers extreme reliability with the API in the data plane to fail over traffic. We recommend using the API instead of changing routing control states in the Amazon Web Services Management Console.

Connect to one of your Regional cluster endpoints for Route 53 ARC to use the data plane API. If the endpoint is unavailable, try connecting to another cluster endpoint.

If a safety rule blocks a routing control state update, you can bypass it to make the update and fail over traffic. For more information, see Overriding safety rules to reroute traffic.

Test failover with Route 53 ARC

Test failover regularly with Route 53 ARC routing control, to fail over from your primary application stack to a secondary application stack. It's important to make sure that the Route 53 ARC structures that you've added are aligned with the correct resources in your stack, and that everything works as you expect it to. You should test this after you set up Route 53 ARC for your environment, and continue to test periodically, so that your failover environment is prepared, before you experience a failure situation in which you need your secondary system to be up and running quickly to avoid downtime for your users.

Add notifications for readiness status changes

Set a rule in Amazon EventBridge to send a notification whenever a readiness check status changes, for example, from READY to NOT READY. When you receive a notification, you can investigate and address the issue, to make sure that your application and resources are ready for failover when you expect them to be.

You can set EventBridge rules to send notifications for several readiness check status changes, including for your recovery group (for your application), for a cell (such as an Amazon Region), or for a readiness check for a resource set.

For more information, see Using Route 53 ARC with Amazon EventBridge.