Understanding the System Metrics for Monitoring (AWS)

Qubole clusters support Datadog monitoring, which you can enable at the QDS account level. For more information on enabling Datadog in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.

The following table lists the different system metrics that are published to the Datadog account.

System Metrics Metrics Definition
disk_free Total free disk space
disk_total Total disk space
part_max_used Maximum percent used on any single disk partition.
load_one Load Average over 1 minute
load_five Load Average over 5 minutes
load_fifteen Load Average over 15 minutes
cpu_user Percentage of CPU utilization while executing at the user level.
cpu_system Percentage of CPU utilization while executing at the system level.
cpu_wio The percentage of CPU Wait I/O.
cpu_nice Percentage of CPU cycles spent on nice processes.
cpu_steal Stolen time, which is the time spent in other operating systems when running in a virtualized environment.
cpu_aidle Percentage of CPU cycles spent idle since last boot.
cpu_idle Percentage of CPU idle time.
cpu_report Aggregate report of CPU utilization percentage.
mem_report Aggregate report of memory usage in bytes.
load_report Aggregate report with current load, number of processes running processes, nodes and CPU count.
network_report Aggregate report with network traffic in and out of the cluster nodes.
cluster-addnodefailure The node addition metric to monitor the autoscaling feature.
cluster-removenodefailure The node removal metric to monitor the downscaling/autoscaling events in a cluster.
system-rootdiskfullmaster The metric displays the disk space in the master node’s root partition.
system-ephemeral0fullmaster The metric displays the disk space in the master node’s ephemeral0 partition.