Alert Name: rm_liveness

Alert Condition: The condition that triggers the alert is avg(last_5m):avg:yarn.resourcemanager.liveness{*} by {host} == 0.

Alert Explanation: The alert indicates that the ResourceManager is not active or live for the last 5 minutes (on an average).


Step 1

The ResourceManager daemon is monitored through monit. Run sudo monit summary on the coordinator node to see the status of the ResourceManager.

Step 2

If monit displays this status message: execution failed, then it implies that monit has failed to restart the process. Run monit restart resourcemanager to restart the process.

Step 3

See the ResourceManager logs (/media/ephemeral0/logs/yarn/*) to see if there is a different error (if any).