Health Checks for Clusters

This section explains the various health checks configured for the clusters.

Cluster HDFS Disk Utilization

This alert checks the free space allotted to HDFS and sends an alert if the free space is lower than a configurable limit.

Node Disk Utilization

This alert checks the free space allotted to HDFS on each node of the cluster and sends an alert if the free space is lower than a configurable limit.

Simple Hadoop Job Probe

This alert probes a simple end-to-end hadoop job in the cluster to check the overall health of the cluster.

Describing Cluster Health Data from the UI

The Cluster health data is available (on AWS only) for you only when the cluster is up.

The Cluster Health data displays the status of the services and metrics. The services are displayed under the Service Status section in binary values (red and green). The green color indicates that the service is running properly whereas the red color denotes that it is not running in an optimal state. Under the Metrics section, the status of the metrics are displayed in percentage (%). The percentage bar becomes red when the CPU and Disk Usage metrics become 90% or more.

../../_images/cluster_health_window.png

Metrics and Services Available on Clusters

Note

YARN-based metrics are only available when Ganglia is enabled on the cluster.

Metrics/Service

Available On Cluster Type

Binary Metrics (Services)

Hive Metastore

All

Name Node

Hive, Spark

Resource Manager

Hive, Spark

HS2

Hive (HS2 enabled on coordinator)

Zeppelin

Spark, Presto

Presto

Presto

Bar Metrics (Float)

CPU Usage

All (coordinator node’s CPU usage)

Coordinator Disk Usage

All

Spot nodes lost count (Integer)

All

Heap Information (All heap metrics are calculated from jstat command)

Hive Metastore Heap

All

HS2 Heap

Hive (HS2 enabled on Coordinator)

Presto Heap

Presto

Zeppelin Heap

Presto, Spark