Simulating a cluster network failure - SAP HANA on Amazon
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Simulating a cluster network failure

Description —To simulate a network failure to test the cluster behavior in case of a split brain.

Run node: Can be run on any node. In this test case, this is done on node B.

Run steps:

  • Drop all the traffic coming from and going to node A with the following command:

    iptables -A INPUT -s <<Primary IP address of Node A>> -j DROP; iptables -A OUTPUT -d <<Primary IP address of Node A>> -j DROP
    [root@sechana ~]# pcs status Cluster name: rhelhanaha Stack: corosync Current DC: prihana(version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum Last updated: Fri Jan 22 14:45:24 2021 Last change: Fri Jan 22 14:45:11 2021 by hacluster via crmd on sechana 2 nodes configured 6 resources configured Online: [ prihana sechana ] Full list of resources: clusterfence (stonith:fence_aws): Started prihana Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00] Started: [ prihana sechana ] Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00] Masters: [ prihana] Slaves: [ sechana ] hana-oip (ocf::heartbeat:aws-vpc-move-ip): Started prihana Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@ sechana ~]#sechana:~ # iptables -A INPUT -s xxx.xxx.xxx.xxx -j DROP; iptables -A OUTPUT -d xxx.xxx.xxx.xxx -j DROP

Expected result:

  • The cluster detects network failure and fences node 1. The cluster promotes the secondary SAP HANA database (on node 2) to take over as primary without going to a split brain situation.

    [root@sechana ~]# pcs status Cluster name: rhelhanaha Stack: corosync Current DC: sechana (version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum Last updated: Fri Jan 22 15:11:43 2021 Last change: Fri Jan 22 15:10:48 2021 by root via crm_attribute on sechana 2 nodes configured 6 resources configured Online: [ sechana ] OFFLINE: [ prihana] Full list of resources: clusterfence (stonith:fence_aws): Started sechana Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00] Started: [ sechana ] Stopped: [ prihana] Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00] Masters: [ sechana ] Stopped: [ prihana] hana-oip (ocf::heartbeat:aws-vpc-move-ip): Started sechana Failed Actions: * clusterfence_monitor_60000 on sechana 'unknown error' (1): call=-1, status=Timed Out, exitreason='', last-rc-change='Fri Jan 22 14:59:14 2021', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@sechana ~]#

Recovery procedure:

  • Clean up the cluster “failed actions”.