Understanding the System Metrics for Monitoring (AWS)

Qubole clusters support Datadog monitoring, which you can enable at the QDS account level. For more information on enabling Datadog in Control Panel > Account Settings, see Datadog Settings.

The following table lists the different system metrics that are published to the Datadog account.

System Metrics Metrics Definition
disk_free Total free disk space
disk_total Total disk space
part_max_used Maximum percent used on any single disk partition.
load_one Load Average over 1 minute
load_five Load Average over 5 minutes
load_fifteen Load Average over 15 minutes
cpu_user Percentage of CPU utilization while executing at the user level.
cpu_system Percentage of CPU utilization while executing at the system level.
cpu_wio The percentage of CPU Wait I/O.
cpu_nice Percentage of CPU cycles spent on nice processes.
cpu_steal Stolen time, which is the time spent in other operating systems when running in a virtualized environment.
cpu_aidle Percentage of CPU cycles spent idle since last boot.
cpu_idle Percentage of CPU idle time.
cpu_report Aggregate report of CPU utilization percentage.
mem_report Aggregate report of memory usage in bytes.
load_report Aggregate report with current load, number of processes running processes, nodes and CPU count.
network_report Aggregate report with network traffic in and out of the cluster nodes.
cluster-addnodefailure The node addition metric to monitor the autoscaling feature.
cluster-removenodefailure The node removal metric to monitor the downscaling/autoscaling events in a cluster.
qubole.cluster_size The metric displays the minimum size of a cluster.
qubole.max_cluster_size The metric displays the maximum size of a cluster.
system-rootdiskfullmaster The metric displays the disk space in the coordinator node’s root partition.
system-ephemeral0fullmaster The metric displays the disk space in the coordinator node’s ephemeral0 partition.