Understanding the Spark Metrics for Monitoring (AWS)

Spark clusters support Datadog monitoring when the Datadog monitoring is enabled at the QDS account level. You can configure Datadog settings at the cluster level for a Spark cluster as described in Advanced configuration: Modifying Cluster Monitoring Settings.

For more information on enabling Datadog in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.

The following table lists the different Spark metrics that are displayed in the Datadog account.

Note

All metrics are aggregated from all the spark applications that are running on the cluster.

Spark Metrics

Metric Definition

JVM Heap Usage

driver.jvm.heap.used, driver.jvm.heap.max, driver.jvm.heap.committed

Driver jvm’s current memory usage of the heap that is used for object allocation. The heap consists of one or more memory pools. The used and committed size is the sum of those values of all heap memory pools.The used value is the amount of memory occupied by both live objects and garbage objects that have not been collected, if any.

executor_id.jvm.*

Executor jvm’s heap usage.

driver.jvm.total.*

Sum of corresponding metrics values from jvm.heap and jvm.non-heap.

BlockManager Memory Status

maxMem_MB

Sum of maximum memory available for all the block managers.

memUsed_MB

Sum of memory used by caching RDDs + non RDD memory storage for all the block managers. memUsed_MB or maxMem_MB > ``spark.qubole.autoscaling.memorythreshold`` is used to upscale ``driver.ExecutorAllocationManager.executors.memoryBasedExecutorsNeeded``.

remainingMem_MB

Sum of (maxMem - memUsed) across all the block managers.

diskSpaceUsed_MB

Sum of disk space used by RDDs + non RDD disk storage for all the block managers.

Executor Status

numExecutorsRunning

Sum of all known executors running.

totalExecutorsTarget

Total executors including the following value: needed = (max (jobProgressExecutorsNeeded, memoryBasedExecutorsNeeded) + executors already running) , calculated based on dynamic allocation strategy.

executorsPendingToRemove

Number of executors that are requested to be killed.

Jobs

allJobs

Sum of all the job Ids that were submitted for an application.

activeJobs

Total number of jobs ids for which ActiveJob was created successfully after asserting that the job had valid rdds and partitions to work on and non zero tasks to run, and stages (ResultStage and its parent ShuffleMapStages dependencies) that were created successfully for the job id.

Stages

failedStages

Stages that should be resubmitted due to fetch failures.

runningStages

Stages that are running.

waitingStages

Stages that are yet to run and waiting for the parent stages to complete.

activeStages

Stages that are successfully submitted.

numCompletedStages

Total of completed stages that have been run.

pendingStages

Total of all the stages waiting to be submitted.

Tasks

numTotalTasks

Sum of tasks across all the active stages.

numActiveTasks

Sum of active tasks from latest attempt of all active stages.

numCompletedTasks

Sum of completed tasks from latest attempt of all active stages.

numPendingTasks

Sum of pending tasks = ( totalTasks - (completedTasks + activeTasks) ) from latest attempt of all active stages.

numSQLRunning

Sum of SQL tasks that are running.