Understanding the Presto Metrics for Monitoring

Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)

In addition to the default Presto metrics that Qubole sends to Datadog, you can also send other Presto metrics to Datadog. Qubole uses Datadog’s JMX agent through the jmx.yaml configuration file in its Datadog integration. It uses 8097 as the JMX port. This enhancement is available for a beta access and it can be enabled by creating a ticket with Qubole Support.

The following section describes:

Presto Metrics

These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.

Presto Metric Metric Definition Abnormalities indicated in the Metrics Actions
presto.jmx.qubole.workers Number of worker nodes that are part of the Presto cluster registered with the Presto Coordinator presto.Workers is lesser than configured minimum nodes

Perform these actions:

  1. Check if there is a spot node loss. Use the presto.SpotLossNotification metric to confirm. If there is a spot node loss, then this is expected and the cluster scales up in sometime.
  2. If there is an increase in presto.requestFailures metric, then it can point to Presto in a worker node failing. Most common reason for this GC pauses in a worker node. To prevent it, better workload management must be done through Queues or ResourceGroups.
presto.jmx.gc.minor_collection_time Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster. Its unit is milliseconds. Sudden spike in the value indicates a problem GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.gc.minor_collection_count Maximum number of YoungGen GC events across all nodes of the cluster Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.gc.major_collection_time Maximum time spent in OldGen GC across all nodes of the cluster. Its unit is milliseconds. Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.gc.major_collection_count Maximum number of OldGen GC events across all nodes of the cluster Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.jmx.avg_planning_time Average planning time (in milliseconds) in planning phase of queries. Its unit is milliseconds.

These are the possible abnormalities:

  1. Sudden spikes in values can be expected when query runs for the first time on a table causing the metastore cache warmup
  2. With metastore caching disabled, if the planning time is consistently high that is in 10s of seconds, then it indicates a problem

Perform these actions:

  1. Even after several queries has run on the cluster and still this metric’s value does not reduce, then check metastore cache settings and ensure that it is enabled and TTL values are high enough to allow using cached values.
  2. This could mean a problem in metastore or the running Hive Metastore server. Resolution:
    1. Firstly, verify that the metatore is not under a heavy load. If it is, then the metastore must be upgraded or other measures must be taken to bring down the load.
    2. If metastore is not heavily loaded, then it might be an issue with the Hive Metastore . server. To resolve this, create a ticket with Qubole Support.
presto.jmx.qubole.request_failures Number of requests that failed at coordinator while contacting worker nodes during the task execution There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem.

Perform these actions:

  1. This can happen if a node is lost. Check the presto.Workers metric to confirm.
  2. This can also happen if a node is stuck in GC. Use GC related metrics to confirm.
presto.jmx.execution_time Query execution latency in 5 minutes. Its unit is milliseconds. Not Applicable Not Applicable
presto.jmx.internal_failures Number of failed queries (internal) in one minute Not Applicable Not Applicable
presto.jmx.running_queries Number of running queries in the cluster at a given point in time Not Applicable Not Applicable
presto.jmx.completed_queries Number of finished queries Not Applicable Not Applicable
presto.jmx.external_failures Number of failed queries (external) in one minute Not Applicable Not Applicable
presto.jmx.failed_queries Number of failed queries in the last one minute Not Applicable Not Applicable
presto.jmx.insufficient_resources_failures Failed queries due to insufficient resources in one minute Not Applicable Not Applicable
presto.jmx.started_queries Number of queries started on the cluster Not Applicable Not Applicable
presto.jmx.queued_queries Number of queued queries at a given point in time Not Applicable Not Applicable
presto.jmx.cancelled_queries Number of canceled queries Not Applicable Not Applicable
presto.jmx.abandoned_queries Number of queries that a Presto server cancels if the client has not polled the server to get results for the configured query.client.timeout, which defaults to 5 minutes. It is a count for the last one minute. Not Applicable Not Applicable
presto.jmx.submitted_queries Number of submitted queries Not Applicable Not Applicable
presto.jmx.user_error_failures Number of failed queries due to user errors in the last one minute Not Applicable Not Applicable
presto.jmx.wall_input_bytes_rate Input data rate in 5 minutes. Its unit is bytes/sec. Not Applicable Not Applicable
presto.jmx.qubole.spot_loss_notifications Total number of spot-loss notifications, populated when you encounter spot losses in the cluster Sudden spike in the value indicates that node is going to be lost due to the spot node termination. Not Applicable
presto.jmx.input_data_size Total input data size in 5 minutes. Its unit is bytes. Not Applicable Not Applicable
presto.jmx.input_positions Total input positions (input rows of tasks) in 5 minutes Not Applicable Not Applicable
presto.jmx.queries_killed_due_to_out_of_memory Number of queries killed due to out-of-memory issues counted cumulatively Not Applicable Not Applicable
presto.jmx.executor.active_count Number of active query executors Not Applicable Not Applicable
presto.jmx.executor.shutdown Number of query executors that are shutdown Not Applicable Not Applicable
presto.jmx.executor.terminated Number of query executors that are terminated Not Applicable Not Applicable

System Utilization Metrics

System Utilization Metric Metric Definition
presto.jmx.consumed_cpu_time_secs Consumed CPU time in seconds
presto.jmx.qubole.avg_total_milli_vcores Moving average of total milli virtual cores of all nodes in the cluster. It is (Runtime.getRuntime().availableProcessors() * no. of worker nodes * 1000).
presto.jmx.qubole.avg_used_milli_vcores Moving average of used milli virtual cores of all the nodes in the cluster (system CPU load * presto.jmx.qubole.avg_total_milli_vcores)
presto.jmx.qubole.avg_per_node_max_used_milli_vcores Moving average of milli virtual cores maximum used per node
presto.jmx.qubole.avg_per_node_min_used_milli_vcores Moving average of milli virtual cores minimum used per node
presto.jmx.qubole.avg_total_memory_mb Moving average total memory in MB of all nodes in the cluster
presto.jmx.qubole.avg_used_memory_mb Moving Average of used memory (Total memory - generalpool_freememory - reservedpool_freememory) of all nodes in the cluster
presto.jmx.heap_memory_usage_used Used heap memory of Presto Coordinator. Its unit is bytes.
presto.jmx.non_heap_memory_usage_used Used non-heap memory of Presto Coordinator. Its unit is bytes.
presto.jmx.general.free_bytes Free bytes in General Memory Pool
presto.jmx.general.max_bytes Maximum bytes in General Memory Pool
presto.jmx.general.reserved_bytes Reserved bytes in General Memory Pool
presto.jmx.reserved.free_bytes Free bytes in Reserved Memory Pool
presto.jmx.reserved.max_bytes Maximum bytes in Reserved Memory Pool
presto.jmx.reserved.reserved_bytes Reserved bytes in Reserved Memory Pool
presto.jmx.qubole.avg_per_node_max_used_memory_mb Moving average of max_used_memory (in Presto’s memory pool) per worker node over the last one minute. It helps in detecting consistent skew in the cluster’s memory usage.
presto.jmx.qubole.avg_per_node_min_used_memory_mb Moving average of min_used_memory per node. It helps in detecting consistent skew in the cluster’s memory usage.
presto.jmx.qubole.simple_sizer.current_size Number of worker nodes that are currently running
presto.jmx.qubole.simple_sizer.optimal_size Optimal number of worker nodes that are required
presto.jmx.qubole.cluster_state_machine.running Number of nodes, which are used to process a query at a given point in time.
presto.jmx.qubole.cluster_state_machine.unknown Number of worker nodes that are coming up
presto.jmx.qubole.cluster_state_machine.removed Number of removed nodes
presto.jmx.qubole.cluster_state_machine.quiesced Number of nodes in quiesced state (nodes taken away as part of downscaling)
presto.jmx.qubole.cluster_state_machine.quiesced_requested Number of nodes in the quiesced_requested state (nodes that will be taken away as part of downscaling once the scheduled tasks complete)
presto.jmx.qubole.cluster_state_machine.forced_quiesced Number of nodes in forced_quiesced state (nodes that are forcefully terminated)
presto.jmx.qubole.cluster_state_machine.forced_quiesced_requested Number of nodes in forced_quiesced_requested state (nodes that will be terminated after scheduled tasks on them complete)
presto.jmx.qubole.cluster_state_machine.to_be_lost Number of worker nodes about to be lost due to Spot node interruption
presto.jmx.user_based_max_cluster_size Current maximum cluster size calculated dynamically when resource-groups.user-scaling-limits-enabled is set to true
presto.jmx.active_resource_groups_count Number of resource groups that are currently active