Create a custom OCF resource agent Attach the resource Verify recovery

Active/active Amazon IoT Greengrass V2 component

In this setup, you manage a Amazon IoT Greengrass V2 component with Pacemaker using a custom OCF (Open Cluster Framework) resource agent. This allows Pacemaker to monitor the health of a Amazon IoT Greengrass V2 component and trigger recovery actions when the component enters a broken state.

Important

Complete all steps in Prerequisites and cluster setup before proceeding, except for the DRBD setup steps. This setup does not use DRBD. Install Amazon IoT Greengrass V2 to a local path on each instance instead. Amazon IoT Greengrass V2 must be provisioned and running on all instances. This tutorial assumes Amazon IoT Greengrass V2 is installed at /greengrass/v2. If you chose a different path, update the GG_CLI variable in the OCF script accordingly.

Create a custom OCF resource agent

Create the custom resource agent directory and script on all instances. This example manages a component named PythonWebServer.


sudo mkdir -p /usr/lib/ocf/resource.d/custom

Create the resource agent script at /usr/lib/ocf/resource.d/custom/gg-webserver with the following content.


#!/bin/bash
# OCF Resource Agent for Greengrass Web Server component
. /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs

GG_CLI="/greengrass/v2/bin/greengrass-cli"
COMPONENT="PythonWebServer"
STATE_FILE="/run/gg-webserver.ocf-state"

case "$1" in
  meta-data)
  cat <<EOF
<?xml version="1.0"?>
<resource-agent name="gg-webserver">
  <version>1.0</version>
  <longdesc lang="en">Greengrass webserver component agent</longdesc>
  <shortdesc lang="en">GG Webserver</shortdesc>
  <parameters>
  </parameters>
  <actions>
    <action name="start" timeout="60"/>
    <action name="stop" timeout="10"/>
    <action name="monitor" timeout="5" interval="10"/>
    <action name="meta-data" timeout="5"/>
  </actions>
</resource-agent>
EOF
;;
  start)
    touch "$STATE_FILE"
    systemctl restart greengrass
    if [ $? -eq 0 ]; then
      exit $OCF_SUCCESS
    else
      rm -f "$STATE_FILE"
      exit $OCF_ERR_GENERIC
    fi
    ;;
  stop)
    rm -f "$STATE_FILE"
    exit $OCF_SUCCESS
    ;;
  monitor)
    # Check state file first — if absent, resource is stopped
    [ ! -f "$STATE_FILE" ] && exit $OCF_NOT_RUNNING
    # Check if the Greengrass service is running
    if ! systemctl is-active --quiet greengrass; then
      exit $OCF_NOT_RUNNING
    fi
    STATE=$($GG_CLI component details -n=$COMPONENT 2>/dev/null | grep '^[[:space:]]*State:' | awk '{print $2}')
    if [[ -z "$STATE" ]]; then
      ocf_log warn "Component $COMPONENT state is empty — component may not be deployed"
      exit $OCF_SUCCESS
    elif [[ "$STATE" == "BROKEN" ]]; then
      exit $OCF_ERR_GENERIC
    else
      exit $OCF_SUCCESS
    fi
    ;;
  *)
    echo "Usage: $0 {start|stop|monitor|meta-data}"
    exit $OCF_ERR_UNIMPLEMENTED
    ;;
esac

Make the script executable.


sudo chmod +x /usr/lib/ocf/resource.d/custom/gg-webserver

Note

The start action restarts the entire Amazon IoT Greengrass V2 service, which restarts all components on the instance, not just PythonWebServer. This is the only practical recovery path because Amazon IoT Greengrass V2 does not support restarting individual components. The stop action is intentionally a no-op because this agent is a monitoring wrapper — the Amazon IoT Greengrass V2 service lifecycle is managed by systemd, not by this agent. If a component remains persistently BROKEN (for example, due to a bad deployment), Pacemaker will retry up to migration-threshold times, then ban the resource on that node until failure-timeout expires. You must fix the root cause (for example, redeploy a valid component version) to stop the retry cycle.

Attach the resource

Create the Pacemaker resource using the custom OCF agent.


sudo pcs property set stonith-enabled=false

Warning

STONITH is disabled here to simplify this tutorial. In a production environment, you must enable STONITH and configure a fencing agent (for example, fence_aws for Amazon EC2 instances) to prevent split-brain and data corruption.


sudo pcs resource create gg-webserver ocf:custom:gg-webserver \
  op monitor interval=30s \
  op start timeout=60s \
  meta migration-threshold=3 failure-timeout=60s \
  clone

Verify recovery

Check the initial state. Verify that the Amazon IoT Greengrass V2 component is running and healthy on all instances.
```
sudo pcs status
```
Simulate component failure. Kill the component's process to simulate a transient failure. Amazon IoT Greengrass V2 might attempt internal recovery first. If the component enters a BROKEN state, Pacemaker detects it and triggers a service restart. If Amazon IoT Greengrass V2 recovers the component internally, Pacemaker takes no action.
```
sudo pkill -f "PythonWebServer"

# Wait 30-60 seconds, then check the component state
sudo /greengrass/v2/bin/greengrass-cli component details -n=PythonWebServer
```
Verify recovery. Pacemaker detects that the component is unhealthy and performs recovery steps as defined in the custom OCF script. No failover is needed — Pacemaker restarts the service on the same instance.

Other services such as HAProxy and Amazon IoT Greengrass V2 continue to operate normally on all instances. The application on the standby instances continues to take requests without interruption.
```
sudo pcs status
```
When the recovered instance comes back up, the load balancer identifies it and distributes client requests as needed.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Active/passive load balancer

Components