Understanding the System Metrics for Monitoring (AWS)

Qubole clusters support Datadog monitoring, which you can enable at the QDS account level. For more information on enabling Datadog in Control Panel > Account Settings, see Datadog Settings.

The following table lists the different system metrics that are published to the Datadog account.

System Metrics	Metrics Definition
disk_free	Total free disk space
disk_total	Total disk space
part_max_used	Maximum percent used on any single disk partition.
load_one	Load Average over 1 minute
load_five	Load Average over 5 minutes
load_fifteen	Load Average over 15 minutes
cpu_user	Percentage of CPU utilization while executing at the user level.
cpu_system	Percentage of CPU utilization while executing at the system level.
cpu_wio	The percentage of CPU Wait I/O.
cpu_nice	Percentage of CPU cycles spent on nice processes.
cpu_steal	Stolen time, which is the time spent in other operating systems when running in a virtualized environment.
cpu_aidle	Percentage of CPU cycles spent idle since last boot.
cpu_idle	Percentage of CPU idle time.
cpu_report	Aggregate report of CPU utilization percentage.
mem_report	Aggregate report of memory usage in bytes.
load_report	Aggregate report with current load, number of processes running processes, nodes and CPU count.
network_report	Aggregate report with network traffic in and out of the cluster nodes.
cluster-addnodefailure	The node addition metric to monitor the autoscaling feature.
cluster-removenodefailure	The node removal metric to monitor the downscaling/autoscaling events in a cluster.
qubole.cluster_size	The metric displays the minimum size of a cluster.
qubole.max_cluster_size	The metric displays the maximum size of a cluster.
system-rootdiskfullmaster	The metric displays the disk space in the coordinator node’s root partition.
system-ephemeral0fullmaster	The metric displays the disk space in the coordinator node’s ephemeral0 partition.