Understanding the Presto Metrics for Monitoring

Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)

In addition to the default Presto metrics that Qubole sends to Datadog, you can also send other Presto metrics to Datadog. Qubole uses Datadog’s JMX agent through jmx.yaml configuration file in its Datadog integration. It uses 8097 as the JMX port. This enhancement is available for a beta access and it can be enabled by creating a ticket with Qubole Support.

These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.

Presto Metric Metric Definition Abnormalities indicated in the Metrics Actions
presto.Workers Number of workers that are part of the Presto cluster presto.Workers is lesser than configured minimum nodes

Perform these actions:

  1. Check if there is a spot node loss. Use the presto.SpotLossNotification metric to confirm. If there is a spot node loss, then this is expected and the cluster scales up in sometime.
  2. If there is an increase in presto.requestFailures metric, then it can point to Presto in a worker node failing. Most common reason for this GC pauses in a worker node. To prevent it, better workload management must be done through Queues or ResourceGroups.
presto.MaxYoungGenGC-Time Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster Sudden spike in the value indicates a problem GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.MaxYoungGenGC-Count Maximum number of YoungGen GC events across all nodes of the cluster Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.MaxOldGenGC-Time Maximum time spent in OldGen GC across all nodes of the cluster Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.MaxOldGenGC-Count Maximum number of OldGen GC events across all nodes of the cluster Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.AveragePlanningTime Average planning time (in milliseconds) in planning phase of queries

These are the possible abnormalities:

  1. Sudden spikes in values can be expected when query runs for the first time on a table causing the metastore cache warmup
  2. With metastore caching disabled, if the planning time is consistently high that is in 10s of seconds, then it indicates a problem

Perform these actions:

  1. Even after several queries has run on the cluster and still this metric’s value does not reduce, then check metastore cache settings and ensure that it is enabled and TTL values are high enough to allow using cached values.
  2. This could mean a problem in metastore or the running Hive Metastore server. Resolution:
    1. Firstly, verify that the metatore is not under a heavy load. If it is, then the metastore must be upgraded or other measures must be taken to bring down the load.
    2. If metastore is not heavily loaded, then it might be an issue with the Hive Metastore . server. To resolve this, create a ticket with Qubole Support.
presto.requestFailures Number of requests that failed at master while contacting worker nodes during the task execution There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem.

Perform these actions:

  1. This can happen if a node is lost. Check the presto.Workers metric to confirm.
  2. This can also happen if a node is stuck in GC. Use GC related metrics to confirm.
presto.RUNNING-Queries Number of Running Queries in the cluster Not Applicable Not Applicable
presto.FINISHED-Queries Number of Finished Queries Not Applicable Not Applicable
presto.FAILED-Queries Number of Failed Queries Not Applicable Not Applicable
presto.bytesReadPerSecondPerQuery Bytes read per second per query (This metric considers only running queries in its calculations and if there are not any, then no data is reported.)

Value going towards 0 indicates that there is an issue

Absence of a value does not indicate an issue

This is most probably due to read operators getting stuck in reading from the cloud object store. Most probable reason for this is a network issue, which can be manually checked from the nodes. If you can manually reach the cloud object store, then create a ticket with Qubole Support.