Amazon EMR cluster error: Deny-listed nodes
The NodeManager daemon is responsible for launching and managing containers on core and task nodes. The containers are allocated to the NodeManager daemon by the ResourceManager daemon that runs on the master node. The ResourceManager monitors the NodeManager node through a heartbeat.
There are a couple of situations in which the ResourceManager daemon deny lists a NodeManager, removing it from the pool of nodes available to process tasks:
-
If the NodeManager has not sent a heartbeat to the ResourceManager daemon in the past 10 minutes (600,000 milliseconds). This time period can be configured using the
yarn.nm.liveness-monitor.expiry-interval-ms
configuration setting. For more information about changing Yarn configuration settings, see Configuring applications in the Amazon EMR Release Guide. -
NodeManager checks the health of the disks determined by
yarn.nodemanager.local-dirs
andyarn.nodemanager.log-dirs
. The checks include permissions and free disk space (< 90%). If a disk fails the check, the NodeManager stops using that particular disk but still reports the node status as healthy. If a number of disks fail the check, the node is reported as unhealthy to the ResourceManager and new containers are not assigned to the node.
The application master can also deny list a NodeManager node if it has more than three failed tasks. You can change this to a higher value using the mapreduce.job.maxtaskfailures.per.tracker
configuration parameter. Other configuration settings you might change control how many times to attempt a task before marking it as failed: mapreduce.map.max.attempts
for map tasks and mapreduce.reduce.maxattempts
for reduce tasks. For more information about changing configuration settings, see Configuring applications in the Amazon EMR Release Guide.