Presto Server and Cluster Issues

This section describes the issues with the solutions related to the Presto server and cluster and they are:

Handling Presto Server Connection Issues
Handling the Exception - Encountered too many errors talking to a worker node
Handling Query Failures due to an Exceeded Memory Limit
Handling the Exception - Server did not reply
Investigating Ganglia Reports
Investigating Datadog Metrics
Handling Presto Query Failures due to the Abnormal Server Shutdown

Handling Presto Server Connection Issues

If you get this error message while trying to connecting to a Presto cluster:

Error running command: Server refused connection:

One possible workaround is to ensure that you have provided access to Qubole public buckets to get Presto cluster to boot up.

Trace the Presto logs that are at the location(s) below:

On the cluster, logs are at: /media/ephemeral0/presto/var/log or /usr/lib/presto/logs
On AWS S3, logs are at: s3://<DefLoc>/logs/presto/<cluster_id>/<cluster start time>/

You can go to the logs location on the cluster using these commands.

[ec2-user@ip-XX-XXX-XX-XX logs]$ cd /media/ephemeral0/presto/var/log
[ec2-user@ip-XX-XXX-XX-XX log]$ pwd
/media/ephemeral0/presto/var/log
[ec2-user@ip-XX-XXX-XX-XX log]$ ls -ltr
total 692

-rw-r--r-- 1 root root 231541 Dec 18 07:10 gc.log
-rw-r--r-- 1 root root 248166 Dec 18 07:10 launcher.log
-rw-r--r-- 1 root root 160394 Dec 18 07:10 server.log
-rw-r--r-- 1 root root  40822 Dec 18 07:10 http-request.log

The different types of logs are:

server.log: For any job failure in presto, it is important to see the Presto server log which will be having error stack traces, warning messages and so on.
launcher.log: There is a python process which starts the Presto process and the logs for that python process goes to the launcher.log. If you do not find anything in the server.log, then next option is to see the launcher.log.
gc.log: This log is helpful in analyzing the cause for the long running job or for a query being stuck. This is quite verbose so it can be helpful in looking at the Garabage Collection (GC) pause as a result of minor and full GC.
http-request.log: This log tells us the incoming request to Presto server and responses from the Presto server.

Handling the Exception - Encountered too many errors talking to a worker node

It could be a generic error message and you must check the logs. Handling Presto Server Connection Issues mentions the logs’ location on the cluster and S3.

Here are a few common causes of the error:

Node may have gone out of memory and it shows up in the launcher.log of the worker node.
High Garbage Collection (GC) pause on the node and it shows up in the gc.log of the worker node.
Spot loss and it shows up in the server.log of the coordinator.
The coordinator node is too busy to get heartbeat from the node. It shows up in the the server.log of the coordinator node.

Handling Query Failures due to an Exceeded Memory Limit

The query failure that occurs due to an exceeded maximum memory limit may be as a result of incorrect property values that are overridden in the cluster. The values may not be required or the property names may have a typo/incorrectly entered.

Handling the Exception - Server did not reply

When you get the Server did not reply exception, check the logs and look for the phrase SERVER STARTED. If the phrase is not in the logs, then there can be an error in the overridden Presto configuration on the cluster.

Handling Presto Server Connection Issues mentions the logs’ location on the cluster and S3.

Investigating Ganglia Reports

Ganglia is a monitoring system for distributed systems. You can access the Ganglia Monitoring page by navigating to Control Panel > Clusters. Under the Resources column for the running cluster in question, there is a Ganglia Metrics link. If that link does not exist, an administrator needs to enable it for the cluster in question.

For more information on how to enable Ganglia monitoring, see Performance Monitoring with Ganglia.

Ganglia will provide visibility into many detailed metrics like presto-jvm.metrics, disk metrics, CPU metrics, memory metrics, and network metrics. It is very crucial in understanding system resource utilization during certain windows of time and troubleshooting performance issues.

Investigating Datadog Metrics

Understanding the Presto Metrics for Monitoring describes the list of metrics that can be seen on the Datadog monitoring service. It also describes the abnormalities and actions that you can perform to handle abnormalities.

Handling the query.max-memory-per-node configuration

The maximum memory a query can take up on a node is defined by the query.max-memory-per-node configuration property. Its value only applies to the worker nodes and does not apply to the cluster’s coordinator node.

If the value of query.max-memory-per-node is set more than 42% of Physical Memory, cluster failures occur. For more information, see the query execution properties table under Presto Configuration Properties.

If the queries are failing with the maximum memory limit exceeded exception, then reduce the value of query.max-memory-per-node by overriding it in the cluster’s Override Presto Configuration. You can also try reducing the worker node size.

Handling Presto Query Failures due to the Abnormal Server Shutdown

Sometimes, when you run the node bootstrap scripts, you can see the Presto queries intermittently fail with the following error.

2017-09-13T23:05:19.309Z    ERROR    remote-task-callback-828    com.facebook.presto.execution.StageStateMachine    Stage 20170913_230512_00045_9tvic.21 failed
com.facebook.presto.spi.PrestoException: Server is shutting down. Task 20170913_230512_00045_9tvic.21.8 has been canceled
at com.facebook.presto.execution.SqlTaskManager.close(SqlTaskManager.java:227)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at io.airlift.bootstrap.LifeCycleManager.stop(LifeCycleManager.java:135)
at io.airlift.bootstrap.LifeCycleManager$1.run(LifeCycleManager.java:101)

Solution: This error occurs when the node bootstrap scripts of the cluster contain the presto server stop command. Otherwise, the scripts may have just caused the Presto server to abnormally shut down.

To resolve or avoid this error, run a node bootstrap script for Presto changes using the Qubole Presto Server bootstrap, which is an alternative to the node bootstrap. For more information, see Using the Qubole Presto Server Bootstrap.

For other changes, you may have to still use the node bootstrap sript.