Understanding the Spark Metrics for Monitoring (AWS)

Spark clusters support Datadog monitoring when the Datadog monitoring is enabled at the QDS account level. You can configure Datadog settings at the cluster level for a Spark cluster as described in Advanced configuration: Modifying Cluster Monitoring Settings.

For more information on enabling Datadog in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.

The following table lists the different Spark metrics that are displayed in the Datadog account.


All metrics are aggregated from all the spark applications that are running on the cluster.

Spark Metrics Metric Definition
JVM Heap Usage
driver.jvm.heap.used, driver.jvm.heap.max, driver.jvm.heap.committed Driver jvm’s current memory usage of the heap that is used for object allocation. The heap consists of one or more memory pools. The used and committed size is the sum of those values of all heap memory pools.The used value is the amount of memory occupied by both live objects and garbage objects that have not been collected, if any.
executor_id.jvm.* Executor jvm’s heap usage.
driver.jvm.total.* Sum of corresponding metrics values from jvm.heap and jvm.non-heap.
BlockManager Memory Status
maxMem_MB Sum of maximum memory available for all the block managers.
memUsed_MB Sum of memory used by caching RDDs + non RDD memory storage for all the block managers. memUsed_MB or maxMem_MB > ``spark.qubole.autoscaling.memorythreshold`` is used to upscale ``driver.ExecutorAllocationManager.executors.memoryBasedExecutorsNeeded``.
remainingMem_MB Sum of (maxMem - memUsed) across all the block managers.
diskSpaceUsed_MB Sum of disk space used by RDDs + non RDD disk storage for all the block managers.
Executor Status
numExecutorsRunning Sum of all known executors running.
totalExecutorsTarget Total executors including the following value: needed = (max (jobProgressExecutorsNeeded, memoryBasedExecutorsNeeded) + executors already running) , calculated based on dynamic allocation strategy.
executorsPendingToRemove Number of executors that are requested to be killed.
allJobs Sum of all the job Ids that were submitted for an application.
activeJobs Total number of jobs ids for which ActiveJob was created successfully after asserting that the job had valid rdds and partitions to work on and non zero tasks to run, and stages (ResultStage and its parent ShuffleMapStages dependencies) that were created successfully for the job id.
failedStages Stages that should be resubmitted due to fetch failures.
runningStages Stages that are running.
waitingStages Stages that are yet to run and waiting for the parent stages to complete.
activeStages Stages that are successfully submitted.
numCompletedStages Total of completed stages that have been run.
pendingStages Total of all the stages waiting to be submitted.
numTotalTasks Sum of tasks across all the active stages.
numActiveTasks Sum of active tasks from latest attempt of all active stages.
numCompletedTasks Sum of completed tasks from latest attempt of all active stages.
numPendingTasks Sum of pending tasks = ( totalTasks - (completedTasks + activeTasks) ) from latest attempt of all active stages.
numSQLRunning Sum of SQL tasks that are running.