Perform failback to the primary Amazon Region
You can failback to the primary Amazon region after the service event in that region has ended.
If you’re using Identical topic name replication configuration, follow these steps:
Create a new MSK Replicator with your secondary cluster as source and primary cluster as target, starting position set to earliest and Identical topic name replication (Keep the same topics name in console).
This will start the process of copying all data written to the secondary cluster after failover back to the primary region.
Monitor the
MessageLag
metric on the new replicator in Amazon CloudWatch until it reaches0
, which indicates all data has been replicated from secondary to primary.After all data has been replicated, stop all producers connecting to the secondary cluster and start producers connecting to the primary cluster.
Wait for
MaxOffsetLag
metric for your consumers connecting to secondary cluster to become0
to ensure they have processed all the data. See Monitor consumer lags.Once all data has been processed, stop consumers in the secondary region and start consumers connecting to the primary cluster to complete the failback.
Delete the Replicator you created in the first step that is replicating data from your secondary cluster to primary.
Verify that your existing Replicator copying data from primary to secondary cluster has status as “RUNNING” and
ReplicatorThroughput
metric in Amazon CloudWatch0
.Note that when you create a new Replicator with starting position as Earliest for failback, it starts reading all data in your secondary clusters’ topics. Depending on your data retention settings, your topics may have data that came from your source cluster. While MSK Replicator automatically filters those messages, you will still incur data processing and transfer charges for all the data in your secondary cluster. You can track the total data processed by replicator using
ReplicatorBytesInPerSec
. See MSK Replicator metrics.
If you’re using Prefixed topic name configuration, follow these steps:
You should initiate failback steps only after replication from the cluster in the secondary Region to the cluster in the primary Region has caught up and the MessageLag metric in Amazon CloudWatch is close to 0. A planned failback should not result in any data loss.
Shut down all producers and consumers connecting to the MSK cluster in the secondary Region.
-
For active-passive topology, delete the Replicator that is replicating data from cluster in the secondary Region to primary Region. You do not need to delete the Replicator for active-active topology.
-
Start producers connecting to the MSK cluster in the primary Region.
-
Depending on your application’s message ordering requirements, follow the steps in one of the following tabs.
-
Verify that the existing Replicator from cluster in primary to cluster in secondary Region is in RUNNING state and working as expected using the
ReplicatorThroughput
and latency metrics.