Autoscaling in Qubole Clusters

What Autoscaling Is

Autoscaling is a mechanism built in to QDS that adds and removes cluster nodes while the cluster is running, to keep just the right number of nodes running to handle the workload. Autoscaling automatically adds resources when computing or storage demand increases, while keeping the number of nodes at the minimum needed to meet your processing needs efficiently.

How Autoscaling Works

When you configure a cluster, you choose the minimum and maximum number of nodes the cluster will contain (Minimum Worker Nodes and Maximum Worker Nodes, respectively). While the cluster is running, QDS continuously monitors the progress of tasks and the number of active nodes to make sure that:

Tasks are being performed efficiently (so as to complete within an amount of time that is set by a configurable default).
No more nodes are active than are needed to complete the tasks.

If the first criterion is at risk, and adding nodes will correct the problem, QDS adds as many nodes as are needed, up to the Maximum Worker Nodes. This is called upscaling.

If the second criterion is not being met, QDS removes idle nodes, down to the Minimum Worker Nodes. This is called downscaling, or decommissioning.

The topics that follow provide details:

Types of Nodes
Upscaling
Downscaling
Spot-based Autoscaling in AWS
How Autoscaling Works in Practice

Types of Nodes

Autoscaling operates only on the nodes that comprise the difference between the Minimum Worker Nodes and Maximum Worker Nodes (the values you specified in the QDS Cluster UI when you configured the cluster), and affects worker nodes only; these are referred to as autoscaling nodes.

The Coordinator Node, and the nodes comprising the Minimum Worker Nodes, are the stable core of the cluster; they normally remain running as long as the cluster itself is running; these are called core nodes.

Stable and Volatile Nodes on AWS

In AWS clusters, core nodes are stable instances – normally AWS On-Demand or high-priced Spot instances – while autoscaling nodes can be stable or volatile instances. Volatile instances are Spot instances that can be lost at any time; stable instances are unlikely to be lost. See AWS Settings for a more detailed explanation, and Spot-based Autoscaling in AWS for further discussion of how QDS manages Spot instances when autoscaling a cluster.

Upscaling

Launching a Cluster

QDS launches clusters automatically when applications need them. If the application needs a cluster that is not running, QDS launches it with the minimum number of nodes, and scales up as needed toward the maximum.

Upscaling Criteria

QDS bases upscaling decisions on:

The rate of progress of the jobs that are running.
Whether faster throughput can be achieved by adding nodes.

Assuming the cluster is running fewer than the configured maximum number of nodes, QDS activates more nodes if, and only if, the configured SLA (Service Level Agreement) will not be met at the current rate of progress, and adding the nodes will improve the rate.

Even if the SLA is not being met, QDS does not add nodes if the workload cannot be distributed more efficiently across more nodes. For example, if three tasks are distributed across three nodes, and progress is slow because the tasks are large and resource-intensive, adding more nodes will not help because the tasks cannot be broken down any further. On the other hand, if tasks are waiting to start because the existing nodes do not have the capacity to run them, then QDS will add nodes.

Note

In a heterogeneous cluster, upscaling can cause the actual number of nodes running in the cluster to exceed the configured Maximum Worker Nodes. See Why is my cluster scaling beyond the configured maximum number of nodes?.

EBS Upscaling on Hadoop MRv2 AWS Clusters

EBS upscaling dynamically adds EBS volumes to AWS instances that are approaching the limits of their storage capacity. You can enable EBS upscaling on Hadoop MRv2 clusters running on Amazon AWS; these include clusters running Spark and Tez jobs as well as those running MapReduce jobs. See Configuring EBS Upscaling in AWS Hadoop and Spark Clusters for details.

EBS upscaling works as follows.

The underlying mechanism is Logical Volume Manager (LVM).
When you configure EBS storage for a node, QDS configures the EBS volumes into an LVM volume group comprising a single logical volume.
When free space on that volume falls below the free-space threshold, QDS adds EBS volumes to the logical volume and resizes the file system.
QDS autoscaling adds EBS storage but does not remove it directly, because this involves reducing the filesystem size, a risky operation. The storage is removed when the node is decommissioned.

See Autoscaling in Qubole With AWS Elastic Block Storage for a full discussion.

Disk Upscaling on Hadoop MRv2 Azure Clusters

Disk upscaling on Azure clusters works similarly to EBS upscaling on AWS.

When you enable disk upscaling for a node, you also specify:

The maximum number of disks that QDS can add to a node (Maximum Data Disk Count on the Clusters page in the QDS UI).
The minimum percentage of storage that must be available on the node (Free Space Threshold %). When available storage drops below this percentage, QDS adds one or more disks until free space is at or above the minimum percentage, or the node has reached its Maximum Data Disk Count. The default is 25%.
The absolute amount of storage that must be available on the node, in gigabytes (Absolute Free Space Threshold). When available storage drops below this amount, QDS adds one or more disks until free space is at or above the minimum amount, or the node has reached its Maximum Data Disk Count. The default is 100 GB.

In addition, QDS monitors the rate at which running Hadoop jobs are using up storage, and from this computes when more storage will be needed.

QDS autoscaling adds storage but does not remove it directly, because this involves reducing the filesystem size, a risky operation. The storage is removed when the node is decommissioned.

Reducers-based Upscaling on Hadoop MRv2 Clusters

Hadoop MRv2 clusters can upscale on the basis of the number of Reducers. This configuration is disabled by default. Enable it by setting mapred.reducer.autoscale.factor=1 as a Hadoop override.

Downscaling

QDS bases downscaling decisions on the following factors.

Downscaling Criteria

A node is a candidate for decommissioning only if:

The cluster is larger than its configured minimum size.
No tasks are running.
The node is approaching an hourly boundary.

By default, QDS runs nodes for multiples of one hour. (This is true unless you enable Aggressive Downscaling; see Understanding Aggressive Downscaling in Clusters (AWS) or Aggressive Downscaling (Azure) for more information, and note that you must currently contact the account team to enable Aggressive Downscaling on the QDS account.)
The node is not storing any shuffle data (data from Map tasks for Reduce tasks that are still running).
Enough cluster storage will be left after shutting down the node to hold the data that must be kept (including HDFS replicas).

Note

This storage consideration does not apply to Presto clusters.

Note

In Hadoop MRv2, you can control the maximum number of nodes that can be downscaled simultaneously by setting mapred.hustler.downscaling.nodes.max.request to the maximum you want; the default is 500.

Downscaling Exception for Hadoop 2 and Spark Clusters: Hadoop 2 and Spark clusters do not downscale to a single worker node once they have been upscaled. When Minimum Worker Nodes is set to 1, the cluster starts with a single worker node, but once upscaled, it never downscales to fewer than two worker nodes. This is because decommissioning slows down greatly if there is only one usable node left for HDFS, so nodes doing no work may be left running, waiting to be decommissioned. You can override this behaviour by setting mapred.allow.single.worker.node to true and restarting the cluster.

Container Packing in Hadoop 2 and Spark

QDS allows you to pack YARN containers on Hadoop MRv2 (Hadoop 2) and Spark clusters.

You must enable this capability; it is disabled by default.

Container packing causes the scheduler to pack containers on a subset of nodes instead of distributing them across all the nodes of the cluster. This increases the probability of some nodes remaining unused; these nodes become eligible for downscaling, reducing your cost.

Packing works by separating nodes into three sets:

Nodes with no containers (the Low set)
Nodes with memory utilization greater than the threshold (the High set)
All other nodes (the Medium set)

YARN schedules each container request in this order: nodes in the Medium set first, nodes in the Low set next, nodes in the High set last. For more information, see Enabling Container Packing in Hadoop 2 and Spark.

Graceful Shutdown of a Node

If all of the downscaling criteria are met, QDS starts decommissioning the node. QDS ensures a graceful shutdown by:

Waiting for all tasks to complete.
Ensuring that the node does not accept any new tasks.
Transferring HDFS block replicas to other nodes.

Note

Data transfer is not needed in Presto clusters.

Recommissioning a Node

If more jobs enter the pipeline while a node is being decommissioned, and the remaining nodes cannot handle them, the node is recommissioned – the decommissioning process stops and the node is reactivated as a member of the cluster and starts accepting tasks again.

Recommissioning takes precedence over launching new instances: when handling an upscaling request, QDS launches new nodes only if the need cannot be met by recommissioning nodes that are being decommissioned.

Recommissioning is preferable to starting a new instance because:

It is more efficient, avoiding bootstrapping a new node.
It is cheaper than provisioning a new node.

Dynamic Downscaling

Dynamic downscaling is triggered when you reduce the maximum size of a cluster while it’s running. The subsections that follow explain how it works. First you need to understand what happens when you decrease (or increase) the size of a running cluster.

Effects of Changing Worker Nodes Variables while the Cluster is Running: You can change the Minimum Worker Nodes and Maximum Worker Nodes while the cluster is running. You do this via the Cluster Settings screen in the QDS UI, just as you would if the cluster were down.

To force the change to take effect dynamically (while the cluster is running) you must push it, as described here. Exactly what happens then depends on the current state of the cluster and configuration settings. Here are the details for both variables.

Minimum Worker Nodes: An increase or reduction in the minimum count takes effect dynamically by default. (On a Hadoop MRv2 cluster, this happens because mapred.refresh.min.cluster.size is set to true by default. Similarly, on a Presto cluster, the configuration reloader mechanism detects the change.)

Maximum Worker Nodes:

An increase in the maximum count takes effect dynamically.
A reduction in the maximum count produces the following behavior:
- If the current cluster size is smaller than the new maximum, the change takes effect dynamically. For example, if the maximum is 15, 10 nodes are currently running and you reduce the maximum count to 12, 12 will be the maximum from now on.
- If the current cluster size is greater than the new maximum, QDS begins reducing the cluster to the new maximum, and subsequent upscaling will not exceed the new maximum. In this case, the default behavior for reducing the number of running nodes is dynamic downscaling.

How Dynamic Downscaling Works:

If you decrease the Maximum Worker Nodes while the cluster is running, and more than the new maximum number of nodes are actually running, then QDS begins dynamic downscaling.

If dynamic downscaling is triggered, QDS selects the nodes that are:

closest to completing their tasks
closest to their hourly boundary (not applicable if Aggressive Downscaling is enabled)
(in the case of Hadoop MRv2 clusters) closest to the time limit for their containers

Once selected, these nodes stop accepting new jobs and QDS shuts them down gracefully until the cluster is at its new maximum size (or the maximum needed for the current workload, whichever is smaller).

Note

In a Spark cluster, a node selected for dynamic downscaling may not be removed immediately in some cases– for example, if a Notebook or other long-running Spark application has executors running on the node, or if the node is storing Shuffle data locally.

Aggressive Downscaling

Aggressive Downscaling refers to a set of QDS capabilities that are currently available only by request from the account team. See Understanding Aggressive Downscaling in Clusters (AWS) or Aggressive Downscaling (Azure) for more information.

Shutting Down an Idle Cluster

By default, QDS shuts the cluster down completely if both of the following are true:

There have been no jobs in the cluster over a configurable period.
At least one node is close to its hourly boundary (not applicable if Aggressive Downscaling is enabled) and no tasks are running on it.

You can change this behavior by disabling automatic cluster termination, but Qubole recommends that you leave it enabled – inadvertently allowing an idle cluster to keep running can become an expensive mistake.

Spot-based Autoscaling in AWS

You can specify that a portion of the cluster’s autoscaling nodes can be potentially volatile AWS Spot instances (as opposed to higher-priced, more stable Spot instances, or On-Demand instances, which are stable but still more expensive.) You do this via the Spot Nodes (%) field that appears when you choose Spot Nodes as the Autoscaling Worker Nodes option on the Add Cluster and Cluster Settings screens under Clusters in the QDS UI. The number you put in that field actually specifies the maximum percentage of autoscaling nodes that QDS can launch as volatile Spot instances.

For example, if you specify a Minimum Worker Nodes of 2 and a Maximum Worker Nodes of 10, and you choose Spot Instance as the Autoscaling Worker Nodes option, you might specify a Spot Nodes percentage of 50. This would allow a maximum of 4 nodes in the cluster (50% of the difference between 2 and 10) to be volatile Spot instances. The non-autoscaling nodes (those represented by the Minimum Worker Nodes, 2 in this case) are always stable instances (On-Demand or high-priced Spot instances).

Opting for volatile Spot instances has implications for the working of your cluster, because Spot instances can be lost at any time. To minimize the impact of losing a node in this way, QDS implements the Qubole Placement Policy, which is in effect by default and makes a best effort to place one copy of each data block on a stable node. You can change the default in the QDS Cluster UI on the Add Cluster and Cluster Settings screens, but Qubole recommends that you leave the policy enabled (Use Qubole Placement Policy checked in the Cluster Composition section). For more information about this policy, and Qubole’s implementation of Spot instances in general, see the blog post Riding the Spotted Elephant .

Spot Rebalancing

Using Spot instances significantly reduces your cost, but fluctuations in the market may mean that QDS cannot always obtain as many Spot instances as your cluster specification calls for. (QDS tries to obtain Spot instances for a configurable number of minutes before giving up.) To pursue the example in the previous section, suppose your cluster needs to scale up by four additional nodes, but only two Spot instances that meet your requirements (out of the maximum of four you specified) are available. In this case, QDS will launch the two Spot instances, and (by default) make up the shortfall by also launching two on-demand instances, meaning that you will be paying more than you had hoped in the case of those two instances. (You can change this default behavior in the QDS Cluster UI on the Add Cluster and Cluster Settings screens, by un-checking Fallback to On-demand Nodes in the Cluster Composition section, but Qubole recommends you leave it checked.)

Whenever the cluster is running a greater proportion of on-demand instances than you have specified, QDS works to remedy the situation by monitoring the Spot market, and replacing the on-demand nodes with Spot instances as soon as suitable instances become available. This is called Spot Rebalancing.

Note

Spot rebalancing is supported in Hadoop MRv2, Presto and Spark clusters only.

Presto Spot rebalancing is available only when Aggressive Downscaling is enabled; see Spot Rebalancing in Presto.

For a more detailed discussion of this topic, see Rebalancing Hadoop Clusters for Higher Spot Utilization.

How Autoscaling Works in Practice

Hadoop MRv2

Here’s how autoscaling works on Hadoop MRv2 (Hadoop 2 (Hive) clusters):

In Hadoop MRv2, you can control the maximum number of nodes that can be downscaled simultaneously by means of mapred.hustler.downscaling.nodes.max.request. Its default value is 500.

Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special autoscaling container requests.
The ResourceManager sums the ApplicationMasters’ autoscaling container requests, and on that basis adds more nodes (up to the configured Maximum Worker Nodes).
Whenever a node approaches its hourly boundary, the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.

Note

You can improve autoscaling efficiency by enabling container packing.
You can control the maximum number of nodes that can be downscaled simultaneously by setting mapred.hustler.downscaling.nodes.max.request to the maximum you want; the default is 500.

Presto

Autoscaling in Presto Clusters explains how autoscaling works in Presto.

Spark

Here’s how autoscaling works on a Spark cluster:

You can configure Spark autoscaling at the cluster level and at the job level.
Spark applications consume YARN resources (containers); QDS monitors container usage and launches new nodes (up to the configured Maximum Worker Nodes) as needed.
If you have enabled job-level autoscaling, QDS monitors the running jobs and their rate of progress, and launches new executors as needed (and hence new nodes if necessary).
As jobs complete, QDS selects candidates for downscaling and initiates Graceful Shutdown of those nodes that meet the criteria.

For a detailed discussion and instructions, see Autoscaling in Spark.

Note

You can improve autoscaling efficiency by enabling container packing.

Tez on Hadoop 2 (Hive) Clusters

Here’s how autoscaling works on a Hadoop 2 (Hive) cluster where Tez is the execution engine:

Note

Tez is not supported on all Cloud platforms.

Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
ApplicationMasters monitor the progress of the DAG (on the Mapper nodes) and calculate how long it will take to finish their tasks at the current rate.
On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special autoscaling container requests.
The ResourceManager sums the ApplicationMasters’ autoscaling container requests, and on that basis adds more nodes (up to the configured Maximum Worker Nodes).
Whenever a node approaches its hourly boundary, the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.

For More Information

For more information about configuring and managing QDS clusters, see: