nn_livenessΒΆ

Alert Name: nn_liveness

Alert Condition: The condition that triggers the alert is avg(last_5m):avg:dfs.namenode.liveness{*} by {host} == 0.

Alert Explanation: The alert indicates that the NameNode is not active or live for the last 5 minutes (on an average).

Resolution:

Step 1

The NameNode daemon is monitored through monit. Run sudo monit summary on the coordinator node to see the status of the NameNode.

Step 2

If monit displays this status message: execution failed, then it implies that monit has failed to restart the process. Run monit restart namenode to restart the process.

Step 3

See the HDFS logs (/media/ephemeral0/logs/hdfs/*) to see if there is a different error (if any).