Understanding the Presto Metrics for Monitoring

Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)

In addition to the default Presto metrics that Qubole sends to Datadog, you can also send other Presto metrics to Datadog. Qubole uses Datadog’s JMX agent through the jmx.yaml configuration file in its Datadog integration. It uses 8097 as the JMX port. This enhancement is available for a beta access and it can be enabled by creating a ticket with Qubole Support.

The following section describes:

Presto Metrics

These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.

Presto Metric

Metric Definition

Abnormalities indicated in the Metrics

Actions

presto.jmx.qubole.workers

Number of worker nodes that are part of the Presto cluster registered with the Presto Coordinator

presto.Workers is lesser than configured minimum nodes

Perform these actions:

  1. Check if there is a spot node loss. Use the presto.SpotLossNotification metric to confirm. If there is a spot node loss, then this is expected and the cluster scales up in sometime.

  2. If there is an increase in presto.requestFailures metric, then it can point to Presto in a worker node failing. Most common reason for this GC pauses in a worker node. To prevent it, better workload management must be done through Queues or ResourceGroups.

presto.jmx.gc.minor_collection_time

Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster. Its unit is milliseconds.

Sudden spike in the value indicates a problem

GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.

presto.jmx.gc.minor_collection_count

Maximum number of YoungGen GC events across all nodes of the cluster

Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.

GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.

presto.jmx.gc.major_collection_time

Maximum time spent in OldGen GC across all nodes of the cluster. Its unit is milliseconds.

Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.

GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.

presto.jmx.gc.major_collection_count

Maximum number of OldGen GC events across all nodes of the cluster

Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.

GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.

presto.jmx.avg_planning_time

Average planning time (in milliseconds) in planning phase of queries. Its unit is milliseconds.

These are the possible abnormalities:

  1. Sudden spikes in values can be expected when query runs for the first time on a table causing the metastore cache warmup

  2. With metastore caching disabled, if the planning time is consistently high that is in 10s of seconds, then it indicates a problem

Perform these actions:

  1. Even after several queries has run on the cluster and still this metric’s value does not reduce, then check metastore cache settings and ensure that it is enabled and TTL values are high enough to allow using cached values.

  2. This could mean a problem in metastore or the running Hive Metastore server. Resolution:

    1. Firstly, verify that the metatore is not under a heavy load. If it is, then the metastore must be upgraded or other measures must be taken to bring down the load.

    2. If metastore is not heavily loaded, then it might be an issue with the Hive Metastore . server. To resolve this, create a ticket with Qubole Support.

presto.jmx.qubole.request_failures

Number of requests that failed at coordinator while contacting worker nodes during the task execution

There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem.

Perform these actions:

  1. This can happen if a node is lost. Check the presto.Workers metric to confirm.

  2. This can also happen if a node is stuck in GC. Use GC related metrics to confirm.

presto.jmx.execution_time

Query execution latency in 5 minutes. Its unit is milliseconds.

Not Applicable

Not Applicable

presto.jmx.internal_failures

Number of failed queries (internal) in one minute

Not Applicable

Not Applicable

presto.jmx.running_queries

Number of running queries in the cluster at a given point in time

Not Applicable

Not Applicable

presto.jmx.completed_queries

Number of finished queries

Not Applicable

Not Applicable

presto.jmx.external_failures

Number of failed queries (external) in one minute

Not Applicable

Not Applicable

presto.jmx.failed_queries

Number of failed queries in the last one minute

Not Applicable

Not Applicable

presto.jmx.insufficient_resources_failures

Failed queries due to insufficient resources in one minute

Not Applicable

Not Applicable

presto.jmx.started_queries

Number of queries started on the cluster

Not Applicable

Not Applicable

presto.jmx.queued_queries

Number of queued queries at a given point in time

Not Applicable

Not Applicable

presto.jmx.cancelled_queries

Number of canceled queries

Not Applicable

Not Applicable

presto.jmx.abandoned_queries

Number of queries that a Presto server cancels if the client has not polled the server to get results for the configured query.client.timeout, which defaults to 5 minutes. It is a count for the last one minute.

Not Applicable

Not Applicable

presto.jmx.submitted_queries

Number of submitted queries

Not Applicable

Not Applicable

presto.jmx.user_error_failures

Number of failed queries due to user errors in the last one minute

Not Applicable

Not Applicable

presto.jmx.wall_input_bytes_rate

Input data rate in 5 minutes. Its unit is bytes/sec.

Not Applicable

Not Applicable

presto.jmx.qubole.spot_loss_notifications

Total number of spot-loss notifications, populated when you encounter spot losses in the cluster

Sudden spike in the value indicates that node is going to be lost due to the spot node termination.

Not Applicable

presto.jmx.input_data_size

Total input data size in 5 minutes. Its unit is bytes.

Not Applicable

Not Applicable

presto.jmx.input_positions

Total input positions (input rows of tasks) in 5 minutes

Not Applicable

Not Applicable

presto.jmx.queries_killed_due_to_out_of_memory

Number of queries killed due to out-of-memory issues counted cumulatively

Not Applicable

Not Applicable

presto.jmx.executor.active_count

Number of active query executors

Not Applicable

Not Applicable

presto.jmx.executor.shutdown

Number of query executors that are shutdown

Not Applicable

Not Applicable

presto.jmx.executor.terminated

Number of query executors that are terminated

Not Applicable

Not Applicable

System Utilization Metrics

System Utilization Metric

Metric Definition

presto.jmx.consumed_cpu_time_secs

Consumed CPU time in seconds

presto.jmx.qubole.avg_total_milli_vcores

Moving average of total milli virtual cores of all nodes in the cluster. It is (Runtime.getRuntime().availableProcessors() * no. of worker nodes * 1000).

presto.jmx.qubole.avg_used_milli_vcores

Moving average of used milli virtual cores of all the nodes in the cluster (system CPU load * presto.jmx.qubole.avg_total_milli_vcores)

presto.jmx.qubole.avg_per_node_max_used_milli_vcores

Moving average of milli virtual cores maximum used per node

presto.jmx.qubole.avg_per_node_min_used_milli_vcores

Moving average of milli virtual cores minimum used per node

presto.jmx.qubole.avg_total_memory_mb

Moving average total memory in MB of all nodes in the cluster

presto.jmx.qubole.avg_used_memory_mb

Moving Average of used memory (Total memory - generalpool_freememory - reservedpool_freememory) of all nodes in the cluster

presto.jmx.heap_memory_usage_used

Used heap memory of Presto Coordinator. Its unit is bytes.

presto.jmx.non_heap_memory_usage_used

Used non-heap memory of Presto Coordinator. Its unit is bytes.

presto.jmx.general.free_bytes

Free bytes in General Memory Pool

presto.jmx.general.max_bytes

Maximum bytes in General Memory Pool

presto.jmx.general.reserved_bytes

Reserved bytes in General Memory Pool

presto.jmx.reserved.free_bytes

Free bytes in Reserved Memory Pool

presto.jmx.reserved.max_bytes

Maximum bytes in Reserved Memory Pool

presto.jmx.reserved.reserved_bytes

Reserved bytes in Reserved Memory Pool

presto.jmx.qubole.avg_per_node_max_used_memory_mb

Moving average of max_used_memory (in Presto’s memory pool) per worker node over the last one minute. It helps in detecting consistent skew in the cluster’s memory usage.

presto.jmx.qubole.avg_per_node_min_used_memory_mb

Moving average of min_used_memory per node. It helps in detecting consistent skew in the cluster’s memory usage.

presto.jmx.qubole.simple_sizer.current_size

Number of worker nodes that are currently running

presto.jmx.qubole.simple_sizer.optimal_size

Optimal number of worker nodes that are required

presto.jmx.qubole.cluster_state_machine.running

Number of nodes, which are used to process a query at a given point in time.

presto.jmx.qubole.cluster_state_machine.unknown

Number of worker nodes that are coming up

presto.jmx.qubole.cluster_state_machine.removed

Number of removed nodes

presto.jmx.qubole.cluster_state_machine.quiesced

Number of nodes in quiesced state (nodes taken away as part of downscaling)

presto.jmx.qubole.cluster_state_machine.quiesced_requested

Number of nodes in the quiesced_requested state (nodes that will be taken away as part of downscaling once the scheduled tasks complete)

presto.jmx.qubole.cluster_state_machine.forced_quiesced

Number of nodes in forced_quiesced state (nodes that are forcefully terminated)

presto.jmx.qubole.cluster_state_machine.forced_quiesced_requested

Number of nodes in forced_quiesced_requested state (nodes that will be terminated after scheduled tasks on them complete)

presto.jmx.qubole.cluster_state_machine.to_be_lost

Number of worker nodes about to be lost due to Spot node interruption

presto.jmx.user_based_max_cluster_size

Current maximum cluster size calculated dynamically when resource-groups.user-scaling-limits-enabled is set to true

presto.jmx.active_resource_groups_count

Number of resource groups that are currently active