Simulating a cluster network failure - SAP HANA on Amazon
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Simulating a cluster network failure

Description --To simulate a network failure to test the cluster behavior in case of a split brain.

Run node: Can be run on any node. In this test case, this is done on node B.

Run steps:

  • Drop all the traffic coming from and going to node A with the following command:

    iptables -A INPUT -s <<Primary IP address of Node A>> -j DROP;
    iptables -A OUTPUT -d <<Primary IP address of Node A>> -j DROP
[root@sechana ~] pcs status
Cluster name: rhelhanaha
Stack: corosync
Current DC: prihana(version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum
Last updated: Fri Jan 22 14:45:24 2021
Last change: Fri Jan 22 14:45:11 2021 by hacluster via crmd on  sechana
2 nodes configured
6 resources configured
Online: [ prihana sechana ]
Full list of resources:
 clusterfence   (stonith:fence_aws):    Started prihana
 Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00]
     Started: [ prihana sechana ]
 Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00]
     Masters: [ prihana]
     Slaves: [ sechana ]
 hana-oip       (ocf::heartbeat:aws-vpc-move-ip):       Started prihana
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@ sechana ~]sechana:~  iptables -A INPUT -s xxx.xxx.xxx.xxx -j DROP;
iptables -A OUTPUT -d xxx.xxx.xxx.xxx -j DROP

Expected result:

  • The cluster detects network failure and fences node 1. The cluster promotes the secondary SAP HANA database (on node 2) to take over as primary without going to a split brain situation.

    [root@sechana ~] pcs status
    Cluster name: rhelhanaha
    Stack: corosync
    Current DC: sechana (version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum
    Last updated: Fri Jan 22 15:11:43 2021
    Last change: Fri Jan 22 15:10:48 2021 by root via crm_attribute on sechana
    2 nodes configured
    6 resources configured
    Online: [ sechana ]
    OFFLINE: [ prihana]
    Full list of resources:
     clusterfence   (stonith:fence_aws):    Started sechana
     Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00]
         Started: [ sechana ]
         Stopped: [ prihana]
     Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00]
         Masters: [ sechana ]
         Stopped: [ prihana]
     hana-oip       (ocf::heartbeat:aws-vpc-move-ip):       Started sechana
    Failed Actions:
    * clusterfence_monitor_60000 on sechana 'unknown error' (1): call=-1,
    status=Timed Out, exitreason='',
        last-rc-change='Fri Jan 22 14:59:14 2021', queued=0ms, exec=0ms
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled
    [root@sechana ~]

Recovery procedure:

  • Clean up the cluster "failed actions".