Understanding the Presto Metrics for Monitoring
Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)
In addition to the default Presto metrics that Qubole sends to Datadog, you can also send other Presto metrics to Datadog.
Qubole uses Datadog’s JMX agent through the jmx.yaml
configuration file in its Datadog integration. It uses 8097
as
the JMX port. This enhancement is available for a beta access and it can be enabled by creating a ticket with
Qubole Support.
The following section describes:
Presto Metrics
These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.
Presto Metric |
Metric Definition |
Abnormalities indicated in the Metrics |
Actions |
---|---|---|---|
presto.jmx.qubole.workers |
Number of worker nodes that are part of the Presto cluster registered with the Presto Coordinator |
|
Perform these actions:
|
presto.jmx.gc.minor_collection_time |
Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster. Its unit is milliseconds. |
Sudden spike in the value indicates a problem |
GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
presto.jmx.gc.minor_collection_count |
Maximum number of YoungGen GC events across all nodes of the cluster |
Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. |
GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
presto.jmx.gc.major_collection_time |
Maximum time spent in OldGen GC across all nodes of the cluster. Its unit is milliseconds. |
Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. |
GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
presto.jmx.gc.major_collection_count |
Maximum number of OldGen GC events across all nodes of the cluster |
Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. |
GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
presto.jmx.avg_planning_time |
Average planning time (in milliseconds) in planning phase of queries. Its unit is milliseconds. |
These are the possible abnormalities:
|
Perform these actions:
|
presto.jmx.qubole.request_failures |
Number of requests that failed at coordinator while contacting worker nodes during the task execution |
There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem. |
Perform these actions:
|
presto.jmx.execution_time |
Query execution latency in 5 minutes. Its unit is milliseconds. |
Not Applicable |
Not Applicable |
presto.jmx.internal_failures |
Number of failed queries (internal) in one minute |
Not Applicable |
Not Applicable |
presto.jmx.running_queries |
Number of running queries in the cluster at a given point in time |
Not Applicable |
Not Applicable |
presto.jmx.completed_queries |
Number of finished queries |
Not Applicable |
Not Applicable |
presto.jmx.external_failures |
Number of failed queries (external) in one minute |
Not Applicable |
Not Applicable |
presto.jmx.failed_queries |
Number of failed queries in the last one minute |
Not Applicable |
Not Applicable |
presto.jmx.insufficient_resources_failures |
Failed queries due to insufficient resources in one minute |
Not Applicable |
Not Applicable |
presto.jmx.started_queries |
Number of queries started on the cluster |
Not Applicable |
Not Applicable |
presto.jmx.queued_queries |
Number of queued queries at a given point in time |
Not Applicable |
Not Applicable |
presto.jmx.cancelled_queries |
Number of canceled queries |
Not Applicable |
Not Applicable |
presto.jmx.abandoned_queries |
Number of queries that a Presto server cancels
if the client has not polled the server to get results
for the configured |
Not Applicable |
Not Applicable |
presto.jmx.submitted_queries |
Number of submitted queries |
Not Applicable |
Not Applicable |
presto.jmx.user_error_failures |
Number of failed queries due to user errors in the last one minute |
Not Applicable |
Not Applicable |
presto.jmx.wall_input_bytes_rate |
Input data rate in 5 minutes. Its unit is bytes/sec. |
Not Applicable |
Not Applicable |
presto.jmx.qubole.spot_loss_notifications |
Total number of spot-loss notifications, populated when you encounter spot losses in the cluster |
Sudden spike in the value indicates that node is going to be lost due to the spot node termination. |
Not Applicable |
presto.jmx.input_data_size |
Total input data size in 5 minutes. Its unit is bytes. |
Not Applicable |
Not Applicable |
presto.jmx.input_positions |
Total input positions (input rows of tasks) in 5 minutes |
Not Applicable |
Not Applicable |
presto.jmx.queries_killed_due_to_out_of_memory |
Number of queries killed due to out-of-memory issues counted cumulatively |
Not Applicable |
Not Applicable |
presto.jmx.executor.active_count |
Number of active query executors |
Not Applicable |
Not Applicable |
presto.jmx.executor.shutdown |
Number of query executors that are shutdown |
Not Applicable |
Not Applicable |
presto.jmx.executor.terminated |
Number of query executors that are terminated |
Not Applicable |
Not Applicable |
System Utilization Metrics
System Utilization Metric |
Metric Definition |
---|---|
presto.jmx.consumed_cpu_time_secs |
Consumed CPU time in seconds |
presto.jmx.qubole.avg_total_milli_vcores |
Moving average of total milli virtual cores of all nodes in the cluster.
It is ( |
presto.jmx.qubole.avg_used_milli_vcores |
Moving average of used milli virtual cores of all the nodes in the
cluster ( |
presto.jmx.qubole.avg_per_node_max_used_milli_vcores |
Moving average of milli virtual cores maximum used per node |
presto.jmx.qubole.avg_per_node_min_used_milli_vcores |
Moving average of milli virtual cores minimum used per node |
presto.jmx.qubole.avg_total_memory_mb |
Moving average total memory in MB of all nodes in the cluster |
presto.jmx.qubole.avg_used_memory_mb |
Moving Average of used memory
( |
presto.jmx.heap_memory_usage_used |
Used heap memory of Presto Coordinator. Its unit is bytes. |
presto.jmx.non_heap_memory_usage_used |
Used non-heap memory of Presto Coordinator. Its unit is bytes. |
presto.jmx.general.free_bytes |
Free bytes in General Memory Pool |
presto.jmx.general.max_bytes |
Maximum bytes in General Memory Pool |
presto.jmx.general.reserved_bytes |
Reserved bytes in General Memory Pool |
presto.jmx.reserved.free_bytes |
Free bytes in Reserved Memory Pool |
presto.jmx.reserved.max_bytes |
Maximum bytes in Reserved Memory Pool |
presto.jmx.reserved.reserved_bytes |
Reserved bytes in Reserved Memory Pool |
presto.jmx.qubole.avg_per_node_max_used_memory_mb |
Moving average of max_used_memory (in Presto’s memory pool) per worker node over the last one minute. It helps in detecting consistent skew in the cluster’s memory usage. |
presto.jmx.qubole.avg_per_node_min_used_memory_mb |
Moving average of min_used_memory per node. It helps in detecting consistent skew in the cluster’s memory usage. |
presto.jmx.qubole.simple_sizer.current_size |
Number of worker nodes that are currently running |
presto.jmx.qubole.simple_sizer.optimal_size |
Optimal number of worker nodes that are required |
presto.jmx.qubole.cluster_state_machine.running |
Number of nodes, which are used to process a query at a given point in time. |
presto.jmx.qubole.cluster_state_machine.unknown |
Number of worker nodes that are coming up |
presto.jmx.qubole.cluster_state_machine.removed |
Number of removed nodes |
presto.jmx.qubole.cluster_state_machine.quiesced |
Number of nodes in quiesced state (nodes taken away as part of downscaling) |
presto.jmx.qubole.cluster_state_machine.quiesced_requested |
Number of nodes in the |
presto.jmx.qubole.cluster_state_machine.forced_quiesced |
Number of nodes in |
presto.jmx.qubole.cluster_state_machine.forced_quiesced_requested |
Number of nodes in |
presto.jmx.qubole.cluster_state_machine.to_be_lost |
Number of worker nodes about to be lost due to Spot node interruption |
presto.jmx.user_based_max_cluster_size |
Current maximum cluster size calculated dynamically when
|
presto.jmx.active_resource_groups_count |
Number of resource groups that are currently active |