Understanding the YARN and HDFS Metrics for Monitoring (AWS)

Hadoop 2 (Hive) and Spark clusters support the Datadog monitoring service.

You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings.

For more information on configuring the Datadog monitoring service at the account level in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.

Qubole also provides a default dashboard on Datadog and alerts to monitor Hadoop 2 (Hive) clusters. Default Dashboard for YARN and HDFS Metrics describes a sample default dashboard.

If you want to customize the threshold values or alerts about other metrics, you can set such alerts/values. For information on how to create alerts and configure email notifications, see the Datadog Alerts description.

This section describes:

YARN Metrics

This table describes the YARN metrics that are sent to Datadog. Log in to the Datadog account to see these metrics.

Metric Description
yarn.QueueMetrics.AppsCompleted It denotes the number of completed applications.
yarn.QueueMetrics.AppsPending It denotes the number of pending applications.
yarn.QueueMetrics.AppsRunning It denotes the number of running applications.
yarn.QueueMetrics.AppsFailed It denotes the number of failed applications.
yarn.QueueMetrics.AppsKilled It denotes the number of killed applications.
yarn.QueueMetrics.ReservedMB It denotes the size of the reserved memory.
yarn.QueueMetrics.AvailableMB It denotes the size of the available memory in Mebibytes.
yarn.QueueMetrics.AllocatedMB It denotes the size of the allocated memory in Mebibytes.
yarn.QueueMetrics.ReservedVCores It denotes the number of reserved virtual cores.
yarn.QueueMetrics.AvailableVCores It denotes the number of available virtual cores.
yarn.QueueMetrics.AllocatedVCores It denotes the number of allocated virtual cores.
yarn.NodeManagerMetrics.ContainersFailed It denotes the number of containers that have failed.
yarn.NodeManagerMetrics.ContainersRunning It denotes the number of running containers.
yarn.NodeManagerMetrics.ContainersKilled It denotes the number of containers that are killed.
yarn.NodeManagerMetrics.ContainersCompleted It denotes the number of containers that are completed.
yarn.QueueMetrics.AllocatedContainers It denotes the number of allocated containers.
yarn.QueueMetrics.ReservedContainers It denotes the number of reserved containers.
yarn.ClusterMetrics.NumActiveNMs It denotes the number of active NodeManagers.
yarn.ClusterMetrics.NumDecommissionedNM It denotes the number of decommissioned NodeManagers.
yarn.ClusterMetrics.NumDecommissioningNMs It denotes the number of decommissioning NodeManagers.
yarn.ClusterMetrics.NumLostNMs It denotes the number of NodeManagers that are lost.
yarn.ClusterMetrics.NumRebootedNMs It denotes the number of rebooted NodeManagers.
yarn.ClusterMetrics.NumUnhealthyNMs It denotes the number of unhealthy NodeManagers.

HDFS Metrics

Metric Description
dfs.FSNamesystem.CapacityTotal It denotes the total disk capacity in bytes.
dfs.FSNamesystem.CapacityUsed It denotes the disk usage in bytes.
dfs.FSNamesystem.CapacityRemaining It denotes the remaining disk space left in bytes.
dfs.FSNamesystem.CapacityUsedGB It denotes the disk usage in Gigabytes.
dfs.FSNamesystem.CapacityTotalGB It denotes the total disk capacity in Gigabytes.
dfs.FSNamesystem.TotalLoad It denotes the total load on the file system.
dfs.FSNamesystem.BlocksTotal It denotes the total number of blocks.
dfs.FSNamesystem.FilesTotal It denotes the total number of files.
dfs.FSNamesystem.MissingBlocks It denotes the number of missing blocks.
dfs.FSNamesystem.CorruptBlocks It denotes the number of corrupt blocks.
dfs.FSNamesystem.PendingReplicationBlocks It denotes the number of blocks pending replication.
dfs.FSNamesystem.UnderReplicatedBlocks It denotes the number of under replicated blocks.
dfs.FSNamesystem.ScheduledReplicationBlocks It denotes the number of blocks scheduled for replication.
dfs.FSNamesystem.PendingDeletionBlocks It denotes the number of pending deletion blocks.

Default Dashboard for YARN and HDFS Metrics

QDS provides a default dashboard with these metrics:

  • Apps
  • Containers
  • DFS Used Capacity

Here is a sample default dashboard that contains YARN/HDFS metrics.

../../../_images/HadoopDashB.png