Autoscaling in Presto Clusters¶
Here’s how autoscaling works on a Presto cluster:
Presto is not supported on all Cloud platforms.
- The Presto Server (running on the master node) keeps track of the launch time of each node.
- At regular intervals (10 seconds by default) the Presto Server takes a snapshot of the state of running queries,
compares it with the previous snapshot, and estimates the time required to finish the queries. If this time exceeds a
threshold value (set to one minute by default and configurable through
ascm.bds.target-latency), the Presto Server adds more nodes to the cluster. For more information on
ascm.bds.target-latencyand other autoscaling properties, see Presto Configuration Properties.
- If QDS determines that the cluster is running more nodes than it needs to complete the running queries within the threshold value, it begins to decommission the excess nodes.
Because all processing in Presto is in-memory and no intermediate data is written to HDFS, the HDFS-related decommissioning tasks are not needed in a Presto cluster.
After new nodes are added, you may notice that they sometimes are not being used by the queries already in progress. This is
because new nodes are used by queries in progress only for certain operations such as TableScans and Partial Aggregations.
You can run
EXPLAIN (TYPE DISTRIBUTED) (see EXPLAIN)
to see which of a running query’s operations can use the new nodes: look for operations that are part
of Fragments and appear as
If no query in progress requires any of these operations, the new nodes remain unused initially. But all new queries started after the nodes are added can make use of the new nodes (irrespective of the types of operation in those queries).
Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.
Configuring the Required Number of Worker Nodes¶
You can configure
query-manager.required-workers as a cluster override to set the number of worker nodes that
must be present in the cluster before a query is scheduled to be run on the cluster. This allows cluster admins to reduce
the minimum size of Presto clusters to 1 without causing query failures due to limited resources. Qubole Presto’s
autoscaling feature ensures that
query-manager.required-workers, which denotes the number of worker nodes brought up before
the query is scheduled to be run on the cluster. While nodes are being requested from the cloud provider and added to the
cluster, queries are queued on the Presto’s coordinator node. These queries are visible as queries in the Waiting for resources
state in the Presto web UI.
This enhancement is only supported with Presto 0.193 and later versions.
Queries waits for a maximum time of upto
query-manager.required-workers-max-wait (with a default value of 5 minutes)
before getting scheduled on the cluster if the required number of nodes (configured for
could not be provisioned. The nodes brought up to match the value of
query-manager.required-workers size are
autoscaling nodes and adheres to the existing spot ratio configuration. This allows higher usage of spot nodes in the cluster.
Queries such as queries on JMX, system, and information schema connectors or queries such as
SELECT 1 and
which do not require multiple worker nodes to execute, are immediately executed on the cluster. While queries are running
on the cluster, the required number of worker nodes set through
query-manager.required-workers are maintained in the
cluster. The cluster downscales back to the minimum configured size when there are no active queries.
Controlling the Nodes Downscaling Velocity¶
The autoscaling service for Presto triggers an action of removing the
ascm.downscaling.group-size (with its default=5)
nodes during each configured
ascm.decision.interval (with its default=10s) if it calculates the optimal size of the
cluster to be less than the current cluster size continuously for the configured
Cool down period. This results in a
downscaling profile where no nodes are removed during the Cool down period and nodes are removed very aggressively until
the cluster reaches its optimal size.
This figure illustrates the downscaling profile of cluster nodes.
To control the nodes downscaling velocity, Qubole provides a Presto cluster configuration override,
When you override it on the cluster, every time a downscaling action is triggered, the Cool down period is reset, which has
a default value of 5 minutes. The next downscaling action is not triggered by the autoscaling service until it calculates
the optimal size of the cluster to be less than the current cluster size continuously for the configured Cool down period.
This results in a more gradual downscaling profile where
ascm.downscaling.group-size nodes are removed in each
Cool down period until the cluster reaches its optimal size.
For better understanding, let us consider these two examples.
Example 1: Consider a cluster without
The configured Cooldown period is
10m. The current cluster size is 12 and optimal size is 2 with
In this case, for 10 minutes no nodes are removed that is until the Cool down period lasts. After that, 2 nodes are removed every 10 seconds until the cluster size is 2.
The total time taken to get to optimal size is (cool down period + ((current - optimal)/group_size) * 10s) = 10 minutes and 50 seconds.
Example 2: Consider a cluster with
The configured Cool down period is
2m. The current cluster size is 12 and optimal size is 2 with
In this case, 2 nodes are removed every 2 minutes until the cluster size is 2.
The total time taken to get to optimal size is ((current - optimal)/group_size) * cool down period) = 10 minutes.
In addition, Presto also supports resource groups based dynamic cluster sizing at the cluster and account levels as described in Resource Groups based Dynamic Cluster Sizing in Presto.
Decommissioning a Worker Node¶
Qubole allows you to gracefully decommission a worker node during autoscaling through the cluster’s master node. If you find a problematic worker node, then you can manually decommission it.
To get the instance ID of the worker node that you want to decommission, run this API from the cluster’s master node that returns a list of all node instance IDs.
From the list in the API response, pick and note the instance ID of the worker node that you want to decommission and add it in the API (below).
From the cluster master node, run this command to decommission the worker node below:
curl -X DELETE localhost:8081/v1/cm/worker/instance-id
instance-id is the worker node’s instance ID.
This puts the specified worker node in a decommissioning state but it still completes its ongoing job (if any) and only after that it shuts down gracefully. This node is not used for executing any new queries which are submitted after you call the node-decommissioning API.