Presto Server and Cluster Issues¶
This section describes the issues with the solutions related to the Presto server and cluster and they are:
- Handling Presto Server Connection Issues
- Handling the Exception - Encountered too many errors talking to a worker node
- Handling Query Failures due to an Exceeded Memory Limit
- Handling the Exception - Server did not reply
- Investigating Ganglia Reports
- Investigating Datadog Metrics
- Handling Presto Query Failures due to the Abnormal Server Shutdown
Handling Presto Server Connection Issues¶
If you get this error message while trying to connecting to a Presto cluster:
Error running command: Server refused connection:
One possible workaround is to ensure that you have provided access to Qubole public buckets to get Presto cluster to boot up.
Trace the Presto logs that are at the location(s) below:
- On the cluster, logs are at:
/media/ephemeral0/presto/var/log or /usr/lib/presto/logs
- On AWS S3, logs are at:
s3://<DefLoc>/logs/presto/<cluster_id>/<cluster start time>/
You can go to the logs location on the cluster using these commands.
[[email protected] logs]$ cd /media/ephemeral0/presto/var/log [[email protected] log]$ pwd /media/ephemeral0/presto/var/log [[email protected] log]$ ls -ltr total 692 -rw-r--r-- 1 root root 231541 Dec 18 07:10 gc.log -rw-r--r-- 1 root root 248166 Dec 18 07:10 launcher.log -rw-r--r-- 1 root root 160394 Dec 18 07:10 server.log -rw-r--r-- 1 root root 40822 Dec 18 07:10 http-request.log
The different types of logs are:
server.log: For any job failure in presto, it is important to see the Presto server log which will be having error stack traces, warning messages and so on.
launcher.log: There is a python process which starts the Presto process and the logs for that python process goes to the
launcher.log. If you do not find anything in the
server.log, then next option is to see the
gc.log: This log is helpful in analyzing the cause for the long running job or for a query being stuck. This is quite verbose so it can be helpful in looking at the Garabage Collection (GC) pause as a result of minor and full GC.
http-request.log: This log tells us the incoming request to Presto server and responses from the Presto server.
Handling the Exception - Encountered too many errors talking to a worker node¶
It could be a generic error message and you must check the logs. Handling Presto Server Connection Issues mentions the logs’ location on the cluster and S3.
Here are a few common causes of the error:
- Node may have gone out of memory and it shows up in the
launcher.logof the worker node.
- High Garbage Collection (GC) pause on the node and it shows up in the
gc.logof the worker node.
- Spot loss and it shows up in the
server.logof the coordinator.
- The coordinator node is too busy to get heartbeat from the node. It shows up in the the
server.logof the coordinator node.
Handling Query Failures due to an Exceeded Memory Limit¶
The query failure that occurs due to an exceeded maximum memory limit may be as a result of incorrect property values that are overridden in the cluster. The values may not be required or the property names may have a typo/incorrectly entered.
Handling the Exception - Server did not reply¶
When you get the
Server did not reply exception, check the logs and look for the phrase
SERVER STARTED. If the
phrase is not in the logs, then there can be an error in the overridden Presto configuration on the cluster.
Handling Presto Server Connection Issues mentions the logs’ location on the cluster and S3.
Investigating Ganglia Reports¶
Ganglia is a monitoring system for distributed systems. You can access the Ganglia Monitoring page by navigating to Control Panel > Clusters. Under the Resources column for the running cluster in question, there is a Ganglia Metrics link. If that link does not exist, an administrator needs to enable it for the cluster in question.
For more information on how to enable Ganglia monitoring, see Performance Monitoring with Ganglia.
Ganglia will provide visibility into many detailed metrics like presto-jvm.metrics, disk metrics, CPU metrics, memory metrics, and network metrics. It is very crucial in understanding system resource utilization during certain windows of time and troubleshooting performance issues.
Investigating Datadog Metrics¶
Understanding the Presto Metrics for Monitoring describes the list of metrics that can be seen on the Datadog monitoring service. It also describes the abnormalities and actions that you can perform to handle abnormalities.
Handling the query.max-memory-per-node configuration¶
The maximum memory a query can take up on a node is defined by the
query.max-memory-per-node configuration property.
Its value only applies to the worker nodes and does not apply to the cluster’s coordinator node.
If the value of
query.max-memory-per-node is set more than 42% of Physical Memory, cluster failures occur. For more
information, see the query execution properties table under Presto Configuration Properties.
If the queries are failing with the maximum memory limit exceeded exception, then reduce the value of
by overriding it in the cluster’s Override Presto Configuration. You can also try reducing the worker node size.
Handling Presto Query Failures due to the Abnormal Server Shutdown¶
Sometimes, when you run the node bootstrap scripts, you can see the Presto queries intermittently fail with the following error.
2017-09-13T23:05:19.309Z ERROR remote-task-callback-828 com.facebook.presto.execution.StageStateMachine Stage 20170913_230512_00045_9tvic.21 failed com.facebook.presto.spi.PrestoException: Server is shutting down. Task 20170913_230512_00045_9tvic.21.8 has been canceled at com.facebook.presto.execution.SqlTaskManager.close(SqlTaskManager.java:227) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at io.airlift.bootstrap.LifeCycleManager.stop(LifeCycleManager.java:135) at io.airlift.bootstrap.LifeCycleManager$1.run(LifeCycleManager.java:101)
Solution: This error occurs when the node bootstrap scripts of the cluster contain the
presto server stop command.
Otherwise, the scripts may have just caused the Presto server to abnormally shut down.
To resolve or avoid this error, run a node bootstrap script for Presto changes using the Qubole Presto Server bootstrap, which is an alternative to the node bootstrap. For more information, see Using the Qubole Presto Server Bootstrap.
For other changes, you may have to still use the node bootstrap sript.