nn_liveness
Alert Name: nn_liveness
Alert Condition: The condition that triggers the alert is avg(last_5m):avg:dfs.namenode.liveness{*} by {host} == 0
.
Alert Explanation: The alert indicates that the NameNode is not active or live for the last 5 minutes (on an average).
Resolution:
Step 1
The NameNode daemon is monitored through monit. Run sudo monit summary
on the coordinator node to see the status of
the NameNode.
Step 2
If monit displays this status message: execution failed
, then it implies that monit has failed to restart
the process. Run monit restart namenode
to restart the process.
Step 3
See the HDFS logs (/media/ephemeral0/logs/hdfs/*
) to see if there is a different error (if any).