Understanding the YARN and HDFS Metrics for Monitoring (AWS)

Hadoop 2 (Hive) and Spark clusters support the Datadog monitoring service.

You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings.

For more information on configuring the Datadog monitoring service at the account level in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.

Qubole also provides a default dashboard on Datadog and alerts to monitor Hadoop 2 (Hive) clusters. Default Dashboard for YARN and HDFS Metrics describes a sample default dashboard.

If you want to customize the threshold values or alerts about other metrics, you can set such alerts/values. For information on how to create alerts and configure email notifications, see the Datadog Alerts description.

This section describes:

YARN Metrics

This table describes the YARN metrics that are sent to Datadog. Log in to the Datadog account to see these metrics.

Metric

Description

yarn.QueueMetrics.AppsCompleted

It denotes the number of completed applications.

yarn.QueueMetrics.AppsPending

It denotes the number of pending applications.

yarn.QueueMetrics.AppsRunning

It denotes the number of running applications.

yarn.QueueMetrics.AppsFailed

It denotes the number of failed applications.

yarn.QueueMetrics.AppsKilled

It denotes the number of killed applications.

yarn.QueueMetrics.ReservedMB

It denotes the size of the reserved memory.

yarn.QueueMetrics.AvailableMB

It denotes the size of the available memory in Mebibytes.

yarn.QueueMetrics.AllocatedMB

It denotes the size of the allocated memory in Mebibytes.

yarn.QueueMetrics.ReservedVCores

It denotes the number of reserved virtual cores.

yarn.QueueMetrics.AvailableVCores

It denotes the number of available virtual cores.

yarn.QueueMetrics.AllocatedVCores

It denotes the number of allocated virtual cores.

yarn.NodeManagerMetrics.ContainersFailed

It denotes the number of containers that have failed.

yarn.NodeManagerMetrics.ContainersRunning

It denotes the number of running containers.

yarn.NodeManagerMetrics.ContainersKilled

It denotes the number of containers that are killed.

yarn.NodeManagerMetrics.ContainersCompleted

It denotes the number of containers that are completed.

yarn.QueueMetrics.AllocatedContainers

It denotes the number of allocated containers.

yarn.QueueMetrics.ReservedContainers

It denotes the number of reserved containers.

yarn.ClusterMetrics.NumActiveNMs

It denotes the number of active NodeManagers.

yarn.ClusterMetrics.NumDecommissionedNM

It denotes the number of decommissioned NodeManagers.

yarn.ClusterMetrics.NumDecommissioningNMs

It denotes the number of decommissioning NodeManagers.

yarn.ClusterMetrics.NumLostNMs

It denotes the number of NodeManagers that are lost.

yarn.ClusterMetrics.NumRebootedNMs

It denotes the number of rebooted NodeManagers.

yarn.ClusterMetrics.NumUnhealthyNMs

It denotes the number of unhealthy NodeManagers.

HDFS Metrics

Metric

Description

dfs.FSNamesystem.CapacityTotal

It denotes the total disk capacity in bytes.

dfs.FSNamesystem.CapacityUsed

It denotes the disk usage in bytes.

dfs.FSNamesystem.CapacityRemaining

It denotes the remaining disk space left in bytes.

dfs.FSNamesystem.CapacityUsedGB

It denotes the disk usage in Gigabytes.

dfs.FSNamesystem.CapacityTotalGB

It denotes the total disk capacity in Gigabytes.

dfs.FSNamesystem.TotalLoad

It denotes the total load on the file system.

dfs.FSNamesystem.BlocksTotal

It denotes the total number of blocks.

dfs.FSNamesystem.FilesTotal

It denotes the total number of files.

dfs.FSNamesystem.MissingBlocks

It denotes the number of missing blocks.

dfs.FSNamesystem.CorruptBlocks

It denotes the number of corrupt blocks.

dfs.FSNamesystem.PendingReplicationBlocks

It denotes the number of blocks pending replication.

dfs.FSNamesystem.UnderReplicatedBlocks

It denotes the number of under replicated blocks.

dfs.FSNamesystem.ScheduledReplicationBlocks

It denotes the number of blocks scheduled for replication.

dfs.FSNamesystem.PendingDeletionBlocks

It denotes the number of pending deletion blocks.

Default Dashboard for YARN and HDFS Metrics

QDS provides a default dashboard with these metrics:

  • Apps

  • Containers

  • DFS Used Capacity

Here is a sample default dashboard that contains YARN/HDFS metrics.

../../../../_images/HadoopDashB.png