Understanding the System Metrics for Monitoring (AWS)

Qubole clusters support Datadog monitoring, which you can enable at the QDS account level. For more information on enabling Datadog in Control Panel > Account Settings, see Datadog Settings.

The following table lists the different system metrics that are published to the Datadog account.

System Metrics

Metrics Definition

disk_free

Total free disk space

disk_total

Total disk space

part_max_used

Maximum percent used on any single disk partition.

load_one

Load Average over 1 minute

load_five

Load Average over 5 minutes

load_fifteen

Load Average over 15 minutes

cpu_user

Percentage of CPU utilization while executing at the user level.

cpu_system

Percentage of CPU utilization while executing at the system level.

cpu_wio

The percentage of CPU Wait I/O.

cpu_nice

Percentage of CPU cycles spent on nice processes.

cpu_steal

Stolen time, which is the time spent in other operating systems when running in a virtualized environment.

cpu_aidle

Percentage of CPU cycles spent idle since last boot.

cpu_idle

Percentage of CPU idle time.

cpu_report

Aggregate report of CPU utilization percentage.

mem_report

Aggregate report of memory usage in bytes.

load_report

Aggregate report with current load, number of processes running processes, nodes and CPU count.

network_report

Aggregate report with network traffic in and out of the cluster nodes.

cluster-addnodefailure

The node addition metric to monitor the autoscaling feature.

cluster-removenodefailure

The node removal metric to monitor the downscaling/autoscaling events in a cluster.

qubole.cluster_size

The metric displays the minimum size of a cluster.

qubole.max_cluster_size

The metric displays the maximum size of a cluster.

system-rootdiskfullmaster

The metric displays the disk space in the coordinator node’s root partition.

system-ephemeral0fullmaster

The metric displays the disk space in the coordinator node’s ephemeral0 partition.