Using failover in an Amazon Aurora global database - Amazon Aurora
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using failover in an Amazon Aurora global database

An Aurora global database provides more comprehensive failover capabilities than the failover provided by a default Aurora DB cluster. By using an Aurora global database, you can plan for and recover from disaster fairly quickly. Recovery from disaster is typically measured using values for RTO and RPO.

  • Recovery time objective (RTO) – The time it takes a system to return to a working state after a disaster. In other words, RTO measures downtime. For an Aurora global database, RTO can be in the order of minutes.

  • Recovery point objective (RPO) – The amount of data that can be lost (measured in time). For an Aurora global database, RPO is typically measured in seconds. With an Aurora PostgreSQL–based global database, you can use the rds.global_db_rpo parameter to set and track the upper bound on RPO, but doing so might affect transaction processing on the primary cluster's writer node. For more information, see Managing RPOs for Aurora PostgreSQL–based global databases.

With an Aurora global database, there are two different approaches to failover depending on the scenario.

  • Manual unplanned failover ("detach and promote") – To recover from an unplanned outage or to do disaster recovery (DR) testing, perform a cross-Region failover to one of the secondaries in your Aurora global database. The RTO for this manual process depends on how quickly you can perform the tasks listed in Recovering an Amazon Aurora global database from an unplanned outage. The RPO is typically measured in seconds, but this depends on the Aurora storage replication lag across the network at the time of the failure.

  • Managed planned failover – This feature is intended for controlled environments, such as operational maintenance and other planned operational procedures. By using managed planned failover, you can relocate the primary DB cluster of your Aurora global database to one of the secondary Regions. Because this feature synchronizes secondary DB clusters with the primary before making any other changes, RPO is 0 (no data loss). To learn more, see Performing managed planned failovers for Amazon Aurora global databases.

Note

You can only perform a managed cross-Region database failover on an Aurora global database if the primary and secondary DB clusters have the same major, minor, and patch level engine versions. However, the patch levels can be different, depending on the minor engine version. For more information, see Version compatibility for failover.

Recovering an Amazon Aurora global database from an unplanned outage

On very rare occasions, your Aurora global database might experience an unexpected outage in its primary Amazon Web Services Region. If this happens, your primary Aurora DB cluster and its writer node aren't available, and the replication between the primary cluster and the secondaries ceases. To minimize both downtime (RTO) and data loss (RPO), you can work quickly to perform a cross-Region failover and reconstruct your Aurora global database.

Tip

We recommend that you understand this process before using it. Have a plan ready to quickly proceed at the first sign of a Region-wide issue. Be ready to identify the secondary Region with the least lag time. Use Amazon CloudWatch regularly to track lag times for the secondary clusters. Make sure to test your plan to check that your procedures are complete and accurate, and that staff are trained to perform a DR failover before it really happens.

To fail over to a secondary cluster after an unplanned outage in the primary Region
  1. Stop issuing DML statements and other write operations to the primary Aurora DB cluster in the Amazon Web Services Region with the outage.

  2. Identify an Aurora DB cluster from a secondary Amazon Web Services Region to use as a new primary DB cluster. If you have two or more secondary Amazon Web Services Regions in your Aurora global database, choose the secondary cluster that has the least lag time.

  3. Detach your chosen secondary DB cluster from the Aurora global database.

    Removing a secondary DB cluster from an Aurora global database immediately stops the replication from the primary to this secondary and promotes it to a standalone provisioned Aurora DB cluster with full read/write capabilities. Any other secondary Aurora DB clusters associated with the primary cluster in the Region with the outage are still available and can accept calls from your application. They also consume resources. Because you're recreating the Aurora global database, remove the other secondary DB clusters before creating the new Aurora global database in the following steps. Doing this avoids data inconsistencies among the DB clusters in the Aurora global database (split-brain issues).

    For detailed steps for detaching, see Removing a cluster from an Amazon Aurora global database.

  4. Reconfigure your application to send all write operations to this now standalone Aurora DB cluster using its new endpoint. If you accepted the provided names when you created the Aurora global database, you can change the endpoint by removing the -ro from the cluster's endpoint string in your application.

    For example, the secondary cluster's endpoint my-global.cluster-ro-aaaaaabbbbbb.us-west-1.rds.amazonaws.com becomes my-global.cluster-aaaaaabbbbbb.us-west-1.rds.amazonaws.com when that cluster is detached from the Aurora global database.

    This Aurora DB cluster becomes the primary cluster of a new Aurora global database when you start adding Regions to it in the next step.

    If you are using RDS Proxy, make sure to redirect your application's write operations to the appropriate read/write endpoint of the proxy that's associated with the new primary cluster. This proxy endpoint might be the default endpoint or a custom read/write endpoint. For more information see How RDS Proxy endpoints work with global databases.

  5. Add an Amazon Web Services Region to the DB cluster. When you do this, the replication process from primary to secondary begins. For detailed steps to add a Region, see Adding an Amazon Web Services Region to an Amazon Aurora global database.

  6. Add more Amazon Web Services Regions as needed to recreate the topology needed to support your application.

Make sure that application writes are sent to the correct Aurora DB cluster before, during, and after making these changes. Doing this avoids data inconsistencies among the DB clusters in the Aurora global database (split-brain issues).

If you reconfigured in response to an outage in an Amazon Web Services Region, you can make that Amazon Web Services Region the primary again after the outage is resolved. To do so, you add the old Amazon Web Services Region to your new global database, and then use the managed planned failover process to switch its role. Your Aurora global database must use a version of Aurora PostgreSQL or Aurora MySQL that supports managed planned failovers. For more information, see Performing managed planned failovers for Amazon Aurora global databases.

Performing managed planned failovers for Amazon Aurora global databases

By using managed planned failovers, you can relocate the primary cluster of your Aurora global database to a different Amazon Web Services Region on a routine basis. This approach is intended for controlled environments, such as operational maintenance and other planned operational procedures.

As an example, say a financial institution headquartered in New York has branch offices located in San Francisco, the UK, and Europe. The organization's core business applications use an Aurora global database. Its primary cluster runs in the US East (Ohio) Region. It has secondary clusters running in the US West (N. California) Region, Europe (London) Region, and the Europe (Frankfurt) Region. Every quarter, it relocates the primary cluster from the (current) primary Amazon Web Services Region to the secondary Region designated for that rotation.

Note

Managed planned failover is designed to be used on a healthy Aurora global database. To recover from an unplanned outage or to do disaster recovery (DR) testing, follow the "detach and promote" process detailed in Recovering an Amazon Aurora global database from an unplanned outage.

During a managed planned failover, your primary cluster is failed over to your choice of secondary Region while your Aurora global database's existing replication topology is maintained. Before the managed planned failover process begins, Aurora global database synchronizes all secondary clusters with its primary cluster. After ensuring that all clusters are synchronized, the managed planned failover begins. The DB cluster in the primary Region becomes read-only. The chosen secondary cluster promotes one of its read-only nodes to full writer status, thus allowing the cluster to assume the role of primary cluster. Because all secondary clusters were synchronized with the primary at the beginning of the process, the new primary continues operations for the Aurora global database without losing any data. Your database is unavailable for a short time while the primary and selected secondary clusters are assuming their new roles.

To optimize application availability, we recommend that you do the following before using this feature:

  • Perform this operation during nonpeak hours or at another time when writes to the primary DB cluster are minimal.

  • Take applications offline to prevent writes from being sent to the primary cluster of Aurora global database.

  • Check lag times for all secondary Aurora DB clusters in the Aurora global database. Choose the secondary with the least overall lag time for the managed planned failover. Use Amazon CloudWatch to view the AuroraGlobalDBReplicationLag metric for all secondaries. This metric tells you how far behind (in milliseconds) a secondary is to the primary DB cluster. Its value is directly proportional to the time it'll take for Aurora to complete failover. In other words, the larger the lag value, the longer the outage, so choose the Region with the least lag.

    For more information about CloudWatch metrics for Aurora, see Cluster-level metrics for Amazon Aurora.

During a managed planned failover, the chosen secondary DB cluster is promoted to its new role as primary. However, it doesn't inherit the various configuration options of the primary DB cluster. A mismatch in configuration can lead to performance issues, workload incompatibilities, and other anomalous behavior. To avoid such issues, we recommend that you resolve differences between your Aurora global database clusters for the following:

  • Configure Aurora DB cluster parameter group for the new primary, if necessary – You can configure your Aurora DB cluster parameter groups independently for each Aurora cluster in your Aurora global database. That means that when you promote a secondary DB cluster to take over the primary role, the parameter group from the secondary might be configured differently than for the primary. If so, modify the promoted secondary DB cluster's parameter group to conform to your primary cluster's settings. To learn how, see Modifying parameters for an Aurora global database.

  • Configure monitoring tools and options, such as Amazon CloudWatch Events and alarms – Configure the promoted DB cluster with the same logging ability, alarms, and so on as needed for the global database. As with parameter groups, configuration for these features isn't inherited from the primary during the failover process. For more information about Aurora DB clusters and monitoring, see Overview of monitoring Amazon Aurora.

  • Configure integrations with other Amazon services – If your Aurora global database integrates with Amazon services, such as Amazon Secrets Manager, Amazon Identity and Access Management, Amazon S3, and Amazon Lambda, you need to make sure these are configured as needed. For more information about integrating Aurora global databases with IAM, Amazon S3 and Lambda, see Using Amazon Aurora global databases with other Amazon services. To learn more about Secrets Manager, see How to automate replication of secrets in Amazon Secrets Manager across Amazon Web Services Regions.

When the failover process completes, the promoted Aurora DB cluster can handle write operations for the Aurora global database. Make sure to change the endpoint for your application to use the new endpoint. If you accepted the provided names when you created the Aurora global database, you can change the endpoint by removing the -ro from the promoted cluster's endpoint string in your application.

For example, the secondary cluster's endpoint my-global.cluster-ro-aaaaaabbbbbb.us-west-1.rds.amazonaws.com becomes my-global.cluster-aaaaaabbbbbb.us-west-1.rds.amazonaws.com when that cluster is promoted to primary.

If you are using RDS Proxy, make sure to redirect your application's write operations to the appropriate read/write endpoint of the proxy that's associated with the new primary cluster. This proxy endpoint might be the default endpoint or a custom read/write endpoint. For more information see How RDS Proxy endpoints work with global databases.

You can fail over your Aurora global database using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.

To start the failover process on your Aurora global database
  1. Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/.

  2. Choose Databases and find the Aurora global database you want to fail over.

  3. Choose Fail over global database in the Actions menu. The failover process doesn't begin until after you choose the failover target in the next step. At this point, the failover is pending.

    
                 Starting the failover process for an Aurora global database
    1. Choose the secondary Aurora DB cluster that you want to promote to primary. The secondary DB cluster must be available. If you have more than one secondary DB cluster, you can compare the lag amount for all secondaries and choose the one with the smallest amount of lag.

      
                 Choosing the secondary DB cluster to promote
    2. Choose Fail over global database to confirm your choice of secondary DB cluster and begin the failover process.

      Tip

      The failover process can take some time to complete. You can cancel once the process is underway, but it can take some time to return your Aurora global database to its original configuration.

      
                 Secondary cluster selector and confirmation card for the failover process

      The Status column of the Databases list shows the state of each Aurora DB instance and Aurora DB cluster during the failover process.

      
                 Failover in action. Status column shows state of process.

      The status bar at the top of the Console displays progress and provides a Cancel failover option. If you choose Cancel failover, you're given the option to proceed with the failover or to cancel the failover process.

      
                 Cancel failover alert

      Choose Cancel failover if you want to cancel the failover.

    3. Choose Close to continue failing over and dismiss the prompt.

When the failover completes, you can see the Aurora DB clusters and their current state in the Databases list, as shown following.


             Failover in action. Status column shows state of process.

To fail over an Aurora global database

Use the failover-global-cluster CLI command to fail over your Aurora global database. With the command, pass values for the following parameters.

  • --region – Specify the Amazon Web Services Region where the primary DB cluster of the Aurora global database is running.

  • --global-cluster-identifier – Specify the name of your Aurora global database.

  • --target-db-cluster-identifier – Specify the Amazon Resource Name (ARN) of the Aurora DB cluster that you want to promote to be the primary for the Aurora global database.

For Linux, macOS, or Unix:

aws rds --region aws-Region \ failover-global-cluster --global-cluster-identifier global_database_id \ --target-db-cluster-identifier ARN-of-secondary-to-promote

For Windows:

aws rds --region aws-Region ^ failover-global-cluster --global-cluster-identifier global_database_id ^ --target-db-cluster-identifier ARN-of-secondary-to-promote

To fail over an Aurora global database, run the FailoverGlobalCluster API operation.

Managing RPOs for Aurora PostgreSQL–based global databases

With an Aurora PostgreSQL–based global database, you can manage the recovery point objective (RPO) for your Aurora global database by using PostgreSQL's rds.global_db_rpo parameter. RPO represents maximum amount of data that can be lost in the event of an outage.

When you set an RPO for your Aurora PostgreSQL–based global database, Aurora monitors the RPO lag time of all secondary clusters to make sure that at least one secondary cluster stays within the target RPO window. RPO lag time is another time-based metric.

The RPO is used when your database resumes operations in a new Amazon Web Services Region after a failover. Aurora evaluates RPO and RPO lag times to commit (or block) transactions on the primary as follows:

  • Commits the transaction if at least one secondary DB cluster has an RPO lag time less than the RPO.

  • Blocks the transaction if all secondary DB clusters have RPO lag times that are larger than the RPO. It also logs the event to the PostgreSQL log file and emits "wait" events that show the blocked sessions.

In other words, if all secondary clusters are behind the target RPO, Aurora pauses transactions on the primary cluster until at least one of the secondary clusters catches up. Paused transactions are resumed and committed as soon as the lag time of at least one secondary DB cluster becomes less than the RPO. The result is that no transactions can commit until the RPO is met.

If you set this parameter as outlined in the following, you can then also monitor the metrics that it generates. You can do so by using psql or another tool to query the Aurora global database's primary DB cluster and obtain detailed information about your Aurora PostgreSQL–based global database's operations. To learn how, see Monitoring Aurora PostgreSQL-based Aurora global databases.

Setting the recovery point objective

The rds.global_db_rpo parameter controls the RPO setting for a PostgreSQL database. This parameter is supported by Aurora PostgreSQL. Valid values for rds.global_db_rpo range from 20 seconds to 2,147,483,647 seconds (68 years). Choose a realistic value to meet your business need and use case. For example, you might want to allow up to 10 minutes for your RPO, in which case you set the value to 600.

You can set this value for your Aurora PostgreSQL–based global database by using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.

To set the RPO
  1. Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/.

  2. Choose the primary cluster of your Aurora global database and open the Configuration tab to find its DB cluster parameter group. For example, the default parameter group for a primary DB cluster running Aurora PostgreSQL 11.7 is default.aurora-postgresql11.

    Parameter groups can't be edited directly. Instead, you do the following:

    • Create a custom DB cluster parameter group using the appropriate default parameter group as the starting point. For example, create a custom DB cluster parameter group based on the default.aurora-postgresql11.

    • On your custom DB parameter group, set the value of the rds.global_db_rpo parameter to meet your use case. Valid values range from 20 seconds up to the maximum integer value of 2,147,483,647 (68 years).

    • Apply the modified DB cluster parameter group to your Aurora DB cluster.

For more information, see Modifying parameters in a DB cluster parameter group.

To set the rds.global_db_rpo parameter, use the modify-db-cluster-parameter-group CLI command. In the command, specify the name of your primary cluster's parameter group and values for RPO parameter.

The following example sets the RPO to 600 seconds (10 minutes) for the primary DB cluster's parameter group named my_custom_global_parameter_group.

For Linux, macOS, or Unix:

aws rds modify-db-cluster-parameter-group \ --db-cluster-parameter-group-name my_custom_global_parameter_group \ --parameters "ParameterName=rds.global_db_rpo,ParameterValue=600,ApplyMethod=immediate"

For Windows:

aws rds modify-db-cluster-parameter-group ^ --db-cluster-parameter-group-name my_custom_global_parameter_group ^ --parameters "ParameterName=rds.global_db_rpo,ParameterValue=600,ApplyMethod=immediate"

To modify the rds.global_db_rpo parameter, use the Amazon RDS ModifyDBClusterParameterGroup API operation.

Viewing the recovery point objective

The recovery point objective (RPO) of a global database is stored in the rds.global_db_rpo parameter for each DB cluster. You can connect to the endpoint for the secondary cluster you want to view and use psql to query the instance for this value.

db-name=>show rds.global_db_rpo;

If this parameter isn't set, the query returns the following:

rds.global_db_rpo ------------------- -1 (1 row)

This next response is from a secondary DB cluster that has 1 minute RPO setting.

rds.global_db_rpo ------------------- 60 (1 row)

You can also use the CLI to get values for find out if rds.global_db_rpo is active on any of the Aurora DB clusters by using the CLI to get values of all user parameters for the cluster.

For Linux, macOS, or Unix:

aws rds describe-db-cluster-parameters \ --db-cluster-parameter-group-name lab-test-apg-global \ --source user

For Windows:

aws rds describe-db-cluster-parameters ^ --db-cluster-parameter-group-name lab-test-apg-global * --source user

The command returns output similar to the following for all user parameters. that aren't default-engine or system DB cluster parameters.

{ "Parameters": [ { "ParameterName": "rds.global_db_rpo", "ParameterValue": "60", "Description": "(s) Recovery point objective threshold, in seconds, that blocks user commits when it is violated.", "Source": "user", "ApplyType": "dynamic", "DataType": "integer", "AllowedValues": "20-2147483647", "IsModifiable": true, "ApplyMethod": "immediate", "SupportedEngineModes": [ "provisioned" ] } ] }

To learn more about viewing parameters of the cluster parameter group, see Viewing parameter values for a DB cluster parameter group.

Disabling the recovery point objective

To disable the RPO, reset the rds.global_db_rpo parameter. You can reset parameters using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.

To disable the RPO
  1. Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/.

  2. In the navigation pane, choose Parameter groups.

  3. In the list, choose your primary DB cluster parameter group.

  4. Choose Edit parameters.

  5. Choose the box next to the rds.global_db_rpo parameter.

  6. Choose Reset.

  7. When the screen shows Reset parameters in DB parameter group, choose Reset parameters.

For more information on how to reset a parameter with the console, see Modifying parameters in a DB cluster parameter group.

To reset the rds.global_db_rpo parameter, use the reset-db-cluster-parameter-group command.

For Linux, macOS, or Unix:

aws rds reset-db-cluster-parameter-group \ --db-cluster-parameter-group-name global_db_cluster_parameter_group \ --parameters "ParameterName=rds.global_db_rpo,ApplyMethod=immediate"

For Windows:

aws rds reset-db-cluster-parameter-group ^ --db-cluster-parameter-group-name global_db_cluster_parameter_group ^ --parameters "ParameterName=rds.global_db_rpo,ApplyMethod=immediate"

To reset the rds.global_db_rpo parameter, use the Amazon RDS API ResetDBClusterParameterGroup operation.