Autoscaling in Presto Clusters

Here’s how autoscaling works on a Presto cluster:

Note

Presto is not supported on all Cloud platforms.

  • The Presto Server (running on the master node) keeps track of the launch time of each node.
  • At regular intervals (10 seconds by default) the Presto Server takes a snapshot of the state of running queries, compares it with the previous snapshot, and estimates the time required to finish the queries. If this time exceeds a threshold value (set to one minute by default and configurable through ascm.bds.target-latency), the Presto Server adds more nodes to the cluster. For more information on ascm.bds.target-latency and other autoscaling properties, see Presto Configuration Properties.
  • If QDS determines that the cluster is running more nodes than it needs to complete the running queries within the threshold value, it begins to decommission the excess nodes.

Note

Because all processing in Presto is in-memory and no intermediate data is written to HDFS, the HDFS-related decommissioning tasks are not needed in a Presto cluster.

After new nodes are added, you may notice that they sometimes are not being used by the queries already in progress. This is because new nodes are used by queries in progress only for certain operations such as TableScans and Partial Aggregations. You can run EXPLAIN (TYPE DISTRIBUTED) (see EXPLAIN) to see which of a running query’s operations can use the new nodes: look for operations that are part of Fragments and appear as [SOURCE].

If no query in progress requires any of these operations, the new nodes remain unused initially. But all new queries started after the nodes are added can make use of the new nodes (irrespective of the types of operation in those queries).

Note

Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Configuring the Required Number of Worker Nodes

You can configure query-manager.required-workers as a cluster override to set the number of worker nodes that must be present in the cluster before a query is scheduled to be run on the cluster. This allows cluster admins to reduce the minimum size of Presto clusters to 1 without causing query failures due to limited resources. Qubole Presto’s autoscaling feature ensures that query-manager.required-workers, which denotes the number of worker nodes brought up before the query is scheduled to be run on the cluster. While nodes are being requested from the cloud provider and added to the cluster, queries are queued on the Presto’s coordinator node. These queries are visible as queries in the Waiting for resources state in the Presto web UI.

Note

This enhancement is only supported with Presto 0.193 and later versions.

Queries waits for a maximum time of upto query-manager.required-workers-max-wait (with a default value of 5 minutes) before getting scheduled on the cluster if the required number of nodes (configured for query-manager.required-workers) could not be provisioned. The nodes brought up to match the value of query-manager.required-workers size are autoscaling nodes and adheres to the existing spot ratio configuration. This allows higher usage of spot nodes in the cluster. Queries such as queries on JMX, system, and information schema connectors or queries such as SELECT 1 and SHOW CATALOGS, which do not require multiple worker nodes to execute, are immediately executed on the cluster. While queries are running on the cluster, the required number of worker nodes set through query-manager.required-workers are maintained in the cluster. The cluster downscales back to the minimum configured size when there are no active queries.

Controlling the Nodes Downscaling Velocity

The autoscaling service for Presto triggers an action of removing the ascm.downscaling.group-size (with its default=5) nodes during each configured ascm.decision.interval (with its default=10s) if it calculates the optimal size of the cluster to be less than the current cluster size continuously for the configured Cool down period. This results in a downscaling profile where no nodes are removed during the Cool down period and nodes are removed very aggressively until the cluster reaches its optimal size.

This figure illustrates the downscaling profile of cluster nodes.

../../_images/Downscaling-profile.png

To control the nodes downscaling velocity, Qubole provides a Presto cluster configuration override, ascm.downscaling.staggered=true. When you override it on the cluster, every time a downscaling action is triggered, the Cool down period is reset, which has a default value of 5 minutes. The next downscaling action is not triggered by the autoscaling service until it calculates the optimal size of the cluster to be less than the current cluster size continuously for the configured Cool down period. This results in a more gradual downscaling profile where ascm.downscaling.group-size nodes are removed in each Cool down period until the cluster reaches its optimal size.

For better understanding, let us consider these two examples.

Example 1: Consider a cluster without ascm.downscaling.staggered enabled.

The configured Cooldown period is 10m. The current cluster size is 12 and optimal size is 2 with ascm.downscaling.group-size=2.

In this case, for 10 minutes no nodes are removed that is until the Cool down period lasts. After that, 2 nodes are removed every 10 seconds until the cluster size is 2.

The total time taken to get to optimal size is (cool down period + ((current - optimal)/group_size) * 10s) = 10 minutes and 50 seconds.

Example 2: Consider a cluster with ascm.downscaling.staggered enabled.

The configured Cool down period is 2m. The current cluster size is 12 and optimal size is 2 with ascm.downscaling.group-size=2.

In this case, 2 nodes are removed every 2 minutes until the cluster size is 2.

The total time taken to get to optimal size is ((current - optimal)/group_size) * cool down period) = 10 minutes.

In addition, Presto also supports resource groups based dynamic cluster sizing at the cluster and account levels as described in Resource Groups based Dynamic Cluster Sizing in Presto.

Decommissioning a Worker Node

Qubole allows you to gracefully decommission a worker node during autoscaling through the cluster’s master node. If you find a problematic worker node, then you can manually decommission it.

To get the instance ID of the worker node that you want to decommission, run this API from the cluster’s master node that returns a list of all node instance IDs.

curl localhost:8081/v1/cm/list/

From the list in the API response, pick and note the instance ID of the worker node that you want to decommission and add it in the API (below).

From the cluster master node, run this command to decommission the worker node below:

curl -X DELETE localhost:8081/v1/cm/worker/instance-id

Where instance-id is the worker node’s instance ID.

This puts the specified worker node in a decommissioning state but it still completes its ongoing job (if any) and only after that it shuts down gracefully. This node is not used for executing any new queries which are submitted after you call the node-decommissioning API.