Simulating a cluster network failure

Description — Simulate a network failure to test the cluster behavior in case of a split brain.

Run node — Can be run on any node. In this test case, this is done on node B.

Run steps:

Drop all the traffic coming from and going to the subnet of the secondary node using the following command. This ensures that traffic is stopped on both the primary and secondary ring.

iptables -A INPUT -s <<Subnet_CIDR>> -j DROP; iptables
-A OUTPUT -d <<Subnet_CIDR>> -j DROP

sechana:~  crm status
Stack: corosync
Current DC: prihana (version 1.1.18+20180430.b12c320f5-3.24.1-b12c320f5)
- partition with quorum
Last updated: Fri Jan 22 02:16:28 2021
Last change: Fri Jan 22 02:16:27 2021 by root via crm_attribute on sechana
2 nodes configured
6 resources configured
Online: [ prihana sechana ]
Full list of resources:
 res_AWS_STONITH        (stonith:external/ec2): Started prihana
 res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started sechana
 Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     Started: [ prihana sechana ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ prihana ]
     Slaves: [ sechana ]
sechana:~  iptables -A INPUT -s 11.0.1.132 -j DROP; iptables -A OUTPUT -d 11.0.1.132 -j DROP

Expected result:

The cluster detects network failure and fence node 1. It promotes the secondary SAP HANA database (on node 2) to take over as primary without going to a split brain situation.

sechana:~  crm status
Stack: corosync
Current DC: prihana (version 1.1.18+20180430.b12c320f5-3.24.1-b12c320f5)
- partition with quorum
Last updated: Fri Jan 22 17:08:09 2021
Last change: Fri Jan 22 17:07:46 2021 by root via crm_attribute on sechana

2 nodes configured
6 resources configured

Online: [ prihana sechana ]

Full list of resources:

 res_AWS_STONITH        (stonith:external/ec2): Started prihana
 res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started sechana
 Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     rsc_SAPHanaTopology_HDB_HDB00      (ocf::suse:SAPHanaTopology):
Started prihana (Monitoring)
     Started: [ sechana ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ sechana ]
     Stopped: [ prihana ]

Failed Actions:
* rsc_SAPHanaTopology_HDB_HDB00_monitor_10000 on prihana 'unknown error'
(1): call=317, status=Timed Out, exitreason='',
    last-rc-change='Fri Jan 22 16:58:19 2021', queued=0ms, exec=300001ms
* rsc_SAPHana_HDB_HDB00_start_0 on prihana 'unknown error' (1): call=28, status=Timed Out,
exitreason='',
    last-rc-change='Fri Jan 22 02:40:38 2021', queued=0ms, exec=3600001ms

Recovery procedure:

Clean up the cluster "failed actions".

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Reboot SAP HANA on node 2

Administration and troubleshooting