Understanding the Spark Metrics for Monitoring (AWS)

Spark clusters support Datadog monitoring when the Datadog monitoring is enabled at the QDS account level. You can configure Datadog settings at the cluster level for a Spark cluster as described in Advanced configuration: Modifying Cluster Monitoring Settings.

For more information on enabling Datadog in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.

The following table lists the different Spark metrics that are displayed in the Datadog account.

Note

All metrics are aggregated from all the spark applications that are running on the cluster.

Spark Metrics	Metric Definition
JVM Heap Usage
driver.jvm.heap.used, driver.jvm.heap.max, driver.jvm.heap.committed	Driver jvm’s current memory usage of the heap that is used for object allocation. The heap consists of one or more memory pools. The used and committed size is the sum of those values of all heap memory pools.The used value is the amount of memory occupied by both live objects and garbage objects that have not been collected, if any.
executor_id.jvm.*	Executor jvm’s heap usage.
driver.jvm.total.*	Sum of corresponding metrics values from jvm.heap and jvm.non-heap.
BlockManager Memory Status
maxMem_MB	Sum of maximum memory available for all the block managers.
memUsed_MB	Sum of memory used by caching RDDs + non RDD memory storage for all the block managers. memUsed_MB or maxMem_MB > ``spark.qubole.autoscaling.memorythreshold`` is used to upscale ``driver.ExecutorAllocationManager.executors.memoryBasedExecutorsNeeded``.
remainingMem_MB	Sum of (maxMem - memUsed) across all the block managers.
diskSpaceUsed_MB	Sum of disk space used by RDDs + non RDD disk storage for all the block managers.
Executor Status
numExecutorsRunning	Sum of all known executors running.
totalExecutorsTarget	Total executors including the following value: needed = (max (jobProgressExecutorsNeeded, memoryBasedExecutorsNeeded) + executors already running) , calculated based on dynamic allocation strategy.
executorsPendingToRemove	Number of executors that are requested to be killed.
Jobs
allJobs	Sum of all the job Ids that were submitted for an application.
activeJobs	Total number of jobs ids for which ActiveJob was created successfully after asserting that the job had valid rdds and partitions to work on and non zero tasks to run, and stages (ResultStage and its parent ShuffleMapStages dependencies) that were created successfully for the job id.
Stages
failedStages	Stages that should be resubmitted due to fetch failures.
runningStages	Stages that are running.
waitingStages	Stages that are yet to run and waiting for the parent stages to complete.
activeStages	Stages that are successfully submitted.
numCompletedStages	Total of completed stages that have been run.
pendingStages	Total of all the stages waiting to be submitted.
Tasks
numTotalTasks	Sum of tasks across all the active stages.
numActiveTasks	Sum of active tasks from latest attempt of all active stages.
numCompletedTasks	Sum of completed tasks from latest attempt of all active stages.
numPendingTasks	Sum of pending tasks = ( totalTasks - (completedTasks + activeTasks) ) from latest attempt of all active stages.
numSQLRunning	Sum of SQL tasks that are running.