Health Checks for Clusters

This section explains the various health checks configured for the clusters.

Cluster HDFS Disk Utilization

This alert checks the free space allotted to HDFS and sends an alert if the free space is lower than a configurable limit.

Node Disk Utilization

This alert checks the free space allotted to HDFS on each node of the cluster and sends an alert if the free space is lower than a configurable limit.

Simple Hadoop Job Probe

This alert probes a simple end-to-end hadoop job in the cluster to check the overall health of the cluster.

Describing Cluster Health Data from the UI

The Cluster health data is available (on AWS only) for you only when the cluster is up.

The Cluster Health data displays the status of the services and metrics. The services are displayed under the Service Status section in binary values (red and green). The green color indicates that the service is running properly whereas the red color denotes that it is not running in an optimal state. Under the Metrics section, the status of the metrics are displayed in percentage (%). The percentage bar becomes red when the CPU and Disk Usage metrics become 90% or more.

Metrics and Services Available on Clusters

Note

YARN-based metrics are only available when Ganglia is enabled on the cluster.

Metrics/Service	Available On Cluster Type
Binary Metrics (Services)
Hive Metastore	All
Name Node	Hive, Spark
Resource Manager	Hive, Spark
HS2	Hive (HS2 enabled on coordinator)
Zeppelin	Spark, Presto
Presto	Presto
Bar Metrics (Float)
CPU Usage	All (coordinator node’s CPU usage)
Coordinator Disk Usage	All
Spot nodes lost count (Integer)	All
Heap Information (All heap metrics are calculated from jstat command)
Hive Metastore Heap	All
HS2 Heap	Hive (HS2 enabled on Coordinator)
Presto Heap	Presto
Zeppelin Heap	Presto, Spark