Understanding the Spark Metrics for Monitoring (AWS)
Spark clusters support Datadog monitoring when the Datadog monitoring is enabled at the QDS account level. You can configure Datadog settings at the cluster level for a Spark cluster as described in Advanced configuration: Modifying Cluster Monitoring Settings.
For more information on enabling Datadog in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.
The following table lists the different Spark metrics that are displayed in the Datadog account.
Note
All metrics are aggregated from all the spark applications that are running on the cluster.
Spark Metrics |
Metric Definition |
---|---|
JVM Heap Usage |
|
driver.jvm.heap.used, driver.jvm.heap.max, driver.jvm.heap.committed |
Driver jvm’s current memory usage of the heap that is used for object allocation. The heap consists of one or more memory pools. The used and committed size is the sum of those values of all heap memory pools.The used value is the amount of memory occupied by both live objects and garbage objects that have not been collected, if any. |
executor_id.jvm.* |
Executor jvm’s heap usage. |
driver.jvm.total.* |
Sum of corresponding metrics values from jvm.heap and jvm.non-heap. |
BlockManager Memory Status |
|
maxMem_MB |
Sum of maximum memory available for all the block managers. |
memUsed_MB |
Sum of memory used by caching RDDs + non RDD memory storage for all the block managers. memUsed_MB or maxMem_MB > |
remainingMem_MB |
Sum of (maxMem - memUsed) across all the block managers. |
diskSpaceUsed_MB |
Sum of disk space used by RDDs + non RDD disk storage for all the block managers. |
Executor Status |
|
numExecutorsRunning |
Sum of all known executors running. |
totalExecutorsTarget |
Total executors including the following value: needed = (max (jobProgressExecutorsNeeded, memoryBasedExecutorsNeeded) + executors already running) , calculated based on dynamic allocation strategy. |
executorsPendingToRemove |
Number of executors that are requested to be killed. |
Jobs |
|
allJobs |
Sum of all the job Ids that were submitted for an application. |
activeJobs |
Total number of jobs ids for which ActiveJob was created successfully after asserting that the job had valid rdds and partitions to work on and non zero tasks to run, and stages (ResultStage and its parent ShuffleMapStages dependencies) that were created successfully for the job id. |
Stages |
|
failedStages |
Stages that should be resubmitted due to fetch failures. |
runningStages |
Stages that are running. |
waitingStages |
Stages that are yet to run and waiting for the parent stages to complete. |
activeStages |
Stages that are successfully submitted. |
numCompletedStages |
Total of completed stages that have been run. |
pendingStages |
Total of all the stages waiting to be submitted. |
Tasks |
|
numTotalTasks |
Sum of tasks across all the active stages. |
numActiveTasks |
Sum of active tasks from latest attempt of all active stages. |
numCompletedTasks |
Sum of completed tasks from latest attempt of all active stages. |
numPendingTasks |
Sum of pending tasks = ( totalTasks - (completedTasks + activeTasks) ) from latest attempt of all active stages. |
numSQLRunning |
Sum of SQL tasks that are running. |