Understanding the Presto Metrics for Monitoring

Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)

In addition to the default Presto metrics that Qubole sends to Datadog, you can also send other Presto metrics to Datadog. Qubole uses Datadog’s JMX agent through the jmx.yaml configuration file in its Datadog integration. It uses 8097 as the JMX port. This enhancement is available for a beta access and it can be enabled by creating a ticket with Qubole Support.

The following section describes:

Presto Metrics
System Utilization Metrics

Presto Metrics

These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.

Presto Metric	Metric Definition	Abnormalities indicated in the Metrics	Actions
presto.jmx.qubole.workers	Number of worker nodes that are part of the Presto cluster registered with the Presto Coordinator	`presto.Workers` is lesser than configured minimum nodes	Perform these actions: Check if there is a spot node loss. Use the `presto.SpotLossNotification` metric to confirm. If there is a spot node loss, then this is expected and the cluster scales up in sometime. If there is an increase in `presto.requestFailures` metric, then it can point to Presto in a worker node failing. Most common reason for this GC pauses in a worker node. To prevent it, better workload management must be done through Queues or ResourceGroups.
presto.jmx.gc.minor_collection_time	Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster. Its unit is milliseconds.	Sudden spike in the value indicates a problem	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.gc.minor_collection_count	Maximum number of YoungGen GC events across all nodes of the cluster	Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.gc.major_collection_time	Maximum time spent in OldGen GC across all nodes of the cluster. Its unit is milliseconds.	Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.gc.major_collection_count	Maximum number of OldGen GC events across all nodes of the cluster	Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.avg_planning_time	Average planning time (in milliseconds) in planning phase of queries. Its unit is milliseconds.	These are the possible abnormalities: Sudden spikes in values can be expected when query runs for the first time on a table causing the metastore cache warmup With metastore caching disabled, if the planning time is consistently high that is in 10s of seconds, then it indicates a problem	Perform these actions: Even after several queries has run on the cluster and still this metric’s value does not reduce, then check metastore cache settings and ensure that it is enabled and TTL values are high enough to allow using cached values. This could mean a problem in metastore or the running Hive Metastore server. Resolution: Firstly, verify that the metatore is not under a heavy load. If it is, then the metastore must be upgraded or other measures must be taken to bring down the load. If metastore is not heavily loaded, then it might be an issue with the Hive Metastore . server. To resolve this, create a ticket with Qubole Support.
presto.jmx.qubole.request_failures	Number of requests that failed at coordinator while contacting worker nodes during the task execution	There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem.	Perform these actions: This can happen if a node is lost. Check the `presto.Workers` metric to confirm. This can also happen if a node is stuck in GC. Use GC related metrics to confirm.
presto.jmx.execution_time	Query execution latency in 5 minutes. Its unit is milliseconds.	Not Applicable	Not Applicable
presto.jmx.internal_failures	Number of failed queries (internal) in one minute	Not Applicable	Not Applicable
presto.jmx.running_queries	Number of running queries in the cluster at a given point in time	Not Applicable	Not Applicable
presto.jmx.completed_queries	Number of finished queries	Not Applicable	Not Applicable
presto.jmx.external_failures	Number of failed queries (external) in one minute	Not Applicable	Not Applicable
presto.jmx.failed_queries	Number of failed queries in the last one minute	Not Applicable	Not Applicable
presto.jmx.insufficient_resources_failures	Failed queries due to insufficient resources in one minute	Not Applicable	Not Applicable
presto.jmx.started_queries	Number of queries started on the cluster	Not Applicable	Not Applicable
presto.jmx.queued_queries	Number of queued queries at a given point in time	Not Applicable	Not Applicable
presto.jmx.cancelled_queries	Number of canceled queries	Not Applicable	Not Applicable
presto.jmx.abandoned_queries	Number of queries that a Presto server cancels if the client has not polled the server to get results for the configured `query.client.timeout`, which defaults to 5 minutes. It is a count for the last one minute.	Not Applicable	Not Applicable
presto.jmx.submitted_queries	Number of submitted queries	Not Applicable	Not Applicable
presto.jmx.user_error_failures	Number of failed queries due to user errors in the last one minute	Not Applicable	Not Applicable
presto.jmx.wall_input_bytes_rate	Input data rate in 5 minutes. Its unit is bytes/sec.	Not Applicable	Not Applicable
presto.jmx.qubole.spot_loss_notifications	Total number of spot-loss notifications, populated when you encounter spot losses in the cluster	Sudden spike in the value indicates that node is going to be lost due to the spot node termination.	Not Applicable
presto.jmx.input_data_size	Total input data size in 5 minutes. Its unit is bytes.	Not Applicable	Not Applicable
presto.jmx.input_positions	Total input positions (input rows of tasks) in 5 minutes	Not Applicable	Not Applicable
presto.jmx.queries_killed_due_to_out_of_memory	Number of queries killed due to out-of-memory issues counted cumulatively	Not Applicable	Not Applicable
presto.jmx.executor.active_count	Number of active query executors	Not Applicable	Not Applicable
presto.jmx.executor.shutdown	Number of query executors that are shutdown	Not Applicable	Not Applicable
presto.jmx.executor.terminated	Number of query executors that are terminated	Not Applicable	Not Applicable

System Utilization Metrics

System Utilization Metric	Metric Definition
presto.jmx.consumed_cpu_time_secs	Consumed CPU time in seconds
presto.jmx.qubole.avg_total_milli_vcores	Moving average of total milli virtual cores of all nodes in the cluster. It is (`Runtime.getRuntime().availableProcessors() * no. of worker` `nodes * 1000`).
presto.jmx.qubole.avg_used_milli_vcores	Moving average of used milli virtual cores of all the nodes in the cluster (`system CPU load * presto.jmx.qubole.avg_total_milli_vcores`)
presto.jmx.qubole.avg_per_node_max_used_milli_vcores	Moving average of milli virtual cores maximum used per node
presto.jmx.qubole.avg_per_node_min_used_milli_vcores	Moving average of milli virtual cores minimum used per node
presto.jmx.qubole.avg_total_memory_mb	Moving average total memory in MB of all nodes in the cluster
presto.jmx.qubole.avg_used_memory_mb	Moving Average of used memory (`Total memory - generalpool_freememory - reservedpool_freememory`) of all nodes in the cluster
presto.jmx.heap_memory_usage_used	Used heap memory of Presto Coordinator. Its unit is bytes.
presto.jmx.non_heap_memory_usage_used	Used non-heap memory of Presto Coordinator. Its unit is bytes.
presto.jmx.general.free_bytes	Free bytes in General Memory Pool
presto.jmx.general.max_bytes	Maximum bytes in General Memory Pool
presto.jmx.general.reserved_bytes	Reserved bytes in General Memory Pool
presto.jmx.reserved.free_bytes	Free bytes in Reserved Memory Pool
presto.jmx.reserved.max_bytes	Maximum bytes in Reserved Memory Pool
presto.jmx.reserved.reserved_bytes	Reserved bytes in Reserved Memory Pool
presto.jmx.qubole.avg_per_node_max_used_memory_mb	Moving average of max_used_memory (in Presto’s memory pool) per worker node over the last one minute. It helps in detecting consistent skew in the cluster’s memory usage.
presto.jmx.qubole.avg_per_node_min_used_memory_mb	Moving average of min_used_memory per node. It helps in detecting consistent skew in the cluster’s memory usage.
presto.jmx.qubole.simple_sizer.current_size	Number of worker nodes that are currently running
presto.jmx.qubole.simple_sizer.optimal_size	Optimal number of worker nodes that are required
presto.jmx.qubole.cluster_state_machine.running	Number of nodes, which are used to process a query at a given point in time.
presto.jmx.qubole.cluster_state_machine.unknown	Number of worker nodes that are coming up
presto.jmx.qubole.cluster_state_machine.removed	Number of removed nodes
presto.jmx.qubole.cluster_state_machine.quiesced	Number of nodes in quiesced state (nodes taken away as part of downscaling)
presto.jmx.qubole.cluster_state_machine.quiesced_requested	Number of nodes in the `quiesced_requested` state (nodes that will be taken away as part of downscaling once the scheduled tasks complete)
presto.jmx.qubole.cluster_state_machine.forced_quiesced	Number of nodes in `forced_quiesced` state (nodes that are forcefully terminated)
presto.jmx.qubole.cluster_state_machine.forced_quiesced_requested	Number of nodes in `forced_quiesced_requested` state (nodes that will be terminated after scheduled tasks on them complete)
presto.jmx.qubole.cluster_state_machine.to_be_lost	Number of worker nodes about to be lost due to Spot node interruption
presto.jmx.user_based_max_cluster_size	Current maximum cluster size calculated dynamically when `resource-groups.user-scaling-limits-enabled` is set to `true`
presto.jmx.active_resource_groups_count	Number of resource groups that are currently active