Auto-scaling in Qubole Clusters¶
What Auto-scaling Is¶
Auto-scaling is a mechanism built in to QDS that adds and removes cluster nodes while the cluster is running, to keep just the right number of nodes running to handle the workload. Auto-scaling automatically adds resources when computing or storage demand increases, while also keeping the number of nodes at the minimum needed to meet your processing needs efficiently.
How Auto-scaling Works¶
When you configure a cluster, you choose the minimum and maximum number of nodes the cluster will contain (Minimum Slave Count and Maximum Slave Count, respectively). While the cluster is running, QDS continuously monitors the progress of tasks and the number of active nodes to make sure that:
- Tasks are being performed efficiently (so as to complete within an amount of time that is set by a configurable default).
- No more nodes are active than are needed to complete the tasks.
If the first criterion is at risk, and adding nodes will correct the problem, QDS adds as many nodes as are needed, up to the Maximum Slave Count. This is called upscaling.
If the second criterion is not being met, QDS removes idle nodes, down to the Minimum Slave Count. This is called downscaling, or decommissioning.
The topics that follow provide details:
- Types of Nodes
- Downscaling (Decommissioning)
- Spot-based Auto-scaling in AWS
- HDFS-based Auto-scaling in Hadoop MRv1 Clusters
- HDFS-based Auto-scaling in Hadoop MRv2 AWS Clusters
- Aggressive Downscaling
- How Auto-scaling Works in Practice
Types of Nodes¶
Auto-scaling operates only on the nodes that comprise the difference between the Minimum Slave Count and Maximum Slave Count (the values you specified in the QDS Control Panel when you configured the cluster), and affects Slave nodes only; these are referred to as auto-scaling nodes.
The Master Node(s), and the nodes comprising the Minimum Slave Count, are the stable core of the cluster; they normally remain running as long as the cluster itself is running; these are called core nodes.
Stable and Volatile Nodes on AWS¶
In AWS clusters, core nodes are stable instances – normally AWS On-Demand or high-priced Spot instances – while auto-scaling nodes can be stable or volatile instances. Volatile instances are Spot instances that can be lost at any time; stable instances are unlikely to be lost. See AWS Settings for a more detailed explanation, and Spot-based Auto-scaling in AWS for further discussion of how QDS manages Spot instances when auto-scaling a cluster.
Launching a Cluster
QDS launches clusters automatically when applications need them. If the application needs a cluster that is not running, QDS launches it with the minimum number of nodes, and scales up as needed toward the maximum.
QDS bases upscaling decisions on three factors:
- The rate of progress of the jobs that are running.
- Whether faster throughput can be achieved by adding nodes.
- Whether the HDFS storage currently configured for the cluster will be sufficient for all tasks to complete.
Assuming the cluster is running fewer than the configured maximum number of nodes, QDS activates more nodes if, and only if, the configured SLA (Service Level Agreement) will not be met at the current rate of progress, and adding the nodes will improve the rate.
Even if the SLA is not being met, QDS does not add nodes if the workload cannot be distributed more efficiently across more nodes. For example, if three tasks are distributed across three nodes, and progress is slow because the tasks are large and resource-intensive, adding more nodes will not help because the tasks cannot be broken down any further. On the other hand, if tasks are waiting to start because the existing nodes do not have the capacity to run them, then QDS will add nodes.
QDS bases downscaling decisions on the following factors.
A node is a candidate for decommissioning only if:
- The cluster is larger than its configured minimum size. However, Hadoop 2/Spark clusters do not decommission to 1
node once they have upscaled. When the minimum size is 1, the cluster starts with a single node, but once upscaled,
it never goes back to 1 node that is a minimum of 2 slave nodes or minimum number of slave nodes whichever is greater
is maintained after upscaling. This is because decommissioning slows down immensely if there is one usable node
left for HDFS, potentially leaving several nodes hanging around, doing no work. You can override this behaviour by
true. However, this setting needs a cluster restart.
- The node is approaching its billing boundary (for example, a one-hour boundary in the case of AWS).
- No tasks are running.
- The node is not storing any shuffle data (data from Map tasks for Reduce tasks that are still running).
- Enough cluster storage will be left after shutting down the node to hold the data that must be kept (including HDFS replicas).
This storage consideration does not apply in Presto clusters.
Graceful Shutdown of a Node
If all of the downscaling criteria are met, QDS starts decommissioning the node. QDS ensures a graceful shutdown by:
- Waiting for all tasks to complete.
- Ensuring that the node does not accept any new tasks.
- Transferring HDFS block replicas to other nodes.
Data transfer is not needed in Presto clusters.
Shutting Down an Idle Cluster
By default, QDS shuts the cluster down completely if all of these are true:
- There have been no jobs in the cluster over a configurable period.
- At least one node is close to its billing boundary.
You can change this by disabling automatic cluster termination, but Qubole recommends that you leave it enabled – inadvertently allowing an idle cluster to keep running can become an expensive mistake.
Recommissioning a Node
If more jobs enter the pipeline while a node is being decommissioned, and the remaining nodes cannot handle them, the node is recommissioned – the decommissioning process stops and the node is reactivated as a member of the cluster and starts accepting tasks again.
Recommissioning takes precedence over launching new instances: when handling an upscaling request, QDS launches new nodes only if the need cannot be met by recommissioning nodes that are being decommissioned.
Recommissioning is preferable to starting a new instance because:
- It is more efficient, avoiding bootstrapping a new node.
- It is cheaper, if the node is not yet at its billing boundary. (And it can never cost more than provisioning a new node.)
Recommissioning is not currently supported in Presto clusters.
Spot-based Auto-scaling in AWS¶
You can specify that a portion of the cluster’s auto-scaling nodes can be potentially volatile AWS Spot instances (as opposed to higher-priced, more stable Spot instances, or On-Demand instances, which are stable but still more expensive.) You do this via the Spot Instances Percentage field that appears when you choose the Autoscaling Node Purchasing Option on the Add Cluster and Cluster Settings screens in the QDS Control Panel. The number you put in that field actually specifies the maximum percentage of auto-scaling nodes that QDS can launch as volatile Spot instances.
For example, if you specify a Minimum Slave Count of 2 and a Maximum Slave Count of 10, and you choose Spot Instance as the Autoscaling Node Purchasing Option, you might specify a Spot Instances Percentage of 50. This would allow a maximum of 4 nodes in the cluster (50% of the difference between 2 and 10) to be volatile Spot instances. The non-auto-scaling nodes (those represented by the Minimum Slave Count, 2 in this case) are always stable instances (On-Demand or high-priced Spot instances).
Opting for volatile Spot instances has implications for the working of your cluster, because Spot instances can be lost at any time. To minimize the impact of losing a node in this way, Qubole implements the Qubole Placement Policy, which is in effect by default and makes a best effort to place one copy of each data block on a stable node. You can change the default in the QDS Control Panel on the Add Cluster and Cluster Settings screens, but Qubole recommends that you leave the policy enabled (Use Qubole Placement Policy checked in the Cluster Composition section). For more information about this policy, and Qubole’s implementation of Spot instances in general, see the blog post Riding the Spotted Elephant .
Fluctuations in the market may mean that QDS cannot always obtain as many Spot instances as your cluster specification calls for. (QDS tries to obtain the Spot instances for a configurable number of minutes before giving up.) To pursue the example in the previous section, suppose your cluster needs to scale up by four additional nodes, but only two Spot instances that meet your requirements (out of the maximum of four you specified) are available. In this case, QDS will launch the two Spot instances, and (by default) make up the shortfall by also launching two On-Demand instances, meaning that you will be paying more than you had hoped in the case of those two instances. (You can change this default behavior in the QDS Control Panel on the Add Cluster and Cluster Settings screens, by un-checking Fallback to on demand in the Cluster Composition section.)
Whenever the cluster is running a greater proportion of On-Demand instances than you have specified, QDS works to remedy the situation by monitoring the Spot market, and replacing the On-Demand nodes with Spot instances as soon as suitable instances become available. This is called Spot Rebalancing.
For a more detailed discussion of this topic, see Rebalancing Hadoop Clusters for Higher Spot Utilization.
HDFS-based Auto-scaling in Hadoop MRv1 Clusters¶
You can enable HDFS-based auto-scaling by setting
mapred.hustler.dfs.autoscale.enable = true
in the Override Hadoop Configuration Variables field on the Add Cluster and Cluster Settings screens in the
Control Panel , or in your cluster’s
If HDFS-based auto-scaling is enabled, QDS continuously monitors the cluster’s HDFS storage to ensure that the available capacity remains sufficient for the jobs that are running to complete, and launches more nodes if necessary.
There are two triggers for launching more nodes; these are discussed under Rate-based Auto-scaling and Threshold-based Auto-scaling below.
Hadoop MRv1 is not supported on all Cloud platforms. See also HDFS-based Auto-scaling in Hadoop MRv2 AWS Clusters.
As Hadoop jobs are running, QDS monitors the rate at which they are using up storage, and from this computes when to launch additional nodes (taking into account how long it is taking for newly launched nodes to become operational).
Bringing up additional instances is the most efficient and reliable way to increase storage, but it is expensive, so QDS ensures that this is done only if it has to be, and is done as late as possible. Nodes are launched so they, and their storage volumes, come online just in time to prevent the cluster from running out of storage, and no sooner.
In addition to monitoring the rate at which the storage is filling up, QDS also compares current usage against a configurable threshold. If usage exceeds this threshold, QDS launches more nodes until usage is below the threshold.
HDFS-based Auto-scaling in Hadoop MRv2 AWS Clusters¶
You can enable HDFS-based auto-scaling on Hadoop MRv2 clusters running on Amazon AWS; these include Spark and Tez clusters, as well as those running MapReduce jobs. To distinguish this type of auto-scaling from the MRv1 case, we’ll refer to it as EBS upscaling.
The following conditions and restrictions apply:
EBS upscaling is not enabled by default; use the Clusters API to enable it in a new or existing cluster.
You can also enable it on the Spark/Hadoop 2 cluster UI.
To enable it on the UI, set EBS Volume Count to be a value greater than 0 from the drop-down list. Only then you can select Enable EBS Upscaling. Once you select it on the cluster configuration UI, the EBS upscaling configuration is displayed with the default configuration as shown in this figure.
You can configure:
- The maximum number of EBS volumes QDS can add to an instance in Maximum EBS Volume Count.
- The free-space threshold (in percentage) below which more storage is added in Free Space Threshold. The default is 15%.
- The free-space capacity (in GB) above which upscaling does not occur in Absolute Free Space Threshold. The default is 100 GB.
- The frequency (in seconds) at which the capacity of the logical volume is sampled in Sampling Interval. The default is 30 seconds.
- The number of sampling intervals over which Qubole evaluates the rate of increase of used capacity in Sampling Window. The default is 5.
EBS upscaling is available only on instance types on which Qubole supports adding EBS storage.
QDS auto-scaling adds EBS storage but does not remove it directly. Downscaling occurs on a node-by-node basis; the storage is removed when the node is decommissioned.
EBS upscaling works as follows.
- The underlying mechanism is Logical Volume Manager (LVM).
- When you configure EBS storage for a node, QDS configures the EBS volumes into an LVM volume group comprising a single logical volume.
- When free space on that volume falls below the free-space threshold, QDS adds EBS volumes to the logical volume and resizes the file system.
Aggressive downscaling is triggered when you reduce the maximum size of a cluster while it’s running. The subsections that follow explain how it works. First you need to understand what happens when you decrease (or increase) the size of a running cluster.
Effects of Changing Slave Count Variables while the Cluster is Running¶
You can change the Minimum Slave Count and Maximum Slave Count while the cluster is running. You do this via the Cluster Settings screen in the QDS Control Panel, just as you would if the cluster were down.
To force the change to take effect dynamically (while the cluster is running) you must push it, as described here. Exactly what happens then depends on the current state of the cluster and configuration settings. Here are the details for both variables.
Minimum Slave Count
An increase or reduction in the minimum count takes effect dynamically by default (
is set to
true by default).
Maximum Slave Count
- An increase in the maximum count always takes effect dynamically.
- A reduction in the maximum count produces the following behavior:
- If the current cluster size is smaller than the new maximum, the change takes effect dynamically. For example, if the maximum is 15, 10 nodes are currently running and you reduce the maximum count to 12, 12 will be the maximum from now on.
- If the current cluster size is greater than the new maximum, QDS begins reducing the cluster to the new maximum, and subsequent upscaling will not exceed the new maximum. In this case, The default behavior for reducing the number of running nodes is aggressive downscaling.
How Aggressive Downscaling Works¶
mapred.aggressive.downscale.enable is set to
true. If you decrease the Maximum Slave Count while
the cluster is running, and more than the new maximum number of nodes are actually running, then QDS begins aggressive
If aggressive downscaling is triggered, QDS selects the nodes that are:
- closest to completing their tasks
- closest to their billing boundary
- (in the case of Hadoop MRv2 clusters) closest to the time limit for their containers.
Once selected, these nodes stop accepting new jobs and QDS shuts them down gracefully until the cluster is at its new maximum size (or the maximum needed for the current workload, whichever is smaller).
You can use the QDS Control Panel to disable aggressive downscaling by setting
false in the Override Hadoop Configuration Variables field (in the Hadoop Cluster Settings section) when
you add or modify the cluster. In this case, nodes continue to accept new jobs, and the cluster is reduced to the new
new size by normal downscaling.
Normally MapReduce requires a Mapper’s hosting cluster node, which stores the Mapper output, to remain up until the Reducer finishes and the job completes. Only then can the Mapper process quit. This means that you may be paying for Cloud instances that are idle and merely waiting for Reducers to finish.
QDS can optionally remove this bottleneck, and reduce your costs, by offloading the Mapper data either to HDFS or to Cloud storage outside the cluster. When you enable offloading, a Mapper needs to run only until its data has been offloaded; once the data data has been copied, and all other criteria are met, the Mapper node can be decommissioned.
To enable offloading, use the Control Panel to set
true in the
Override Hadoop Configuration Variables field in the Hadoop Cluster Settings section when you add or modify the
By default, the data is offloaded to a Cloud storage location (
s3.log.location for AWS). You can use the
Override Hadoop Configuration Variables field to change this to:
- a different Cloud storage location, by setting the path in
- an HDFS path, by setting
qubole.offload.pathto the HDFS path.
How Auto-scaling Works in Practice¶
Here’s how auto-scaling works in a Hadoop MRv1 cluster:
- Each node in the cluster reports its launch time to the JobTracker, which keeps track of how long each node has been running.
- The JobTracker continuously monitors the pending and current work in the system and computes the amount of time required to finish it. If that time exceeds the pre-configured threshold (for example, two minutes) and there is sufficient parallelism in the workload, then the JobTracker adds more nodes to the cluster.
- Whenever a node approaches its billing boundary (for example, each hour), the JobTracker checks to see if there is enough work in the system to justify continuing to run this node. If not, and the criteria described above under Downscaling Criteria are met, the JobTracker decommissions the node.
Hadoop MRv1 clusters are not supported on all Cloud platforms.
Here’s how auto-scaling works in a Hadoop MRv2 cluster:
Offloading is not supported in Hadoop MRv2 clusters.
Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special auto-scaling container requests.
The ResourceManager sums the ApplicationMasters’ auto-scaling container requests, and on that basis adds more nodes (up to the configured Maximum Slave Count).
You can improve auto-scaling efficiency by enabling container packing.
Whenever a node approaches its billing boundary (for example, each hour), the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.
Here’s how auto-scaling works in a Hadoop Presto cluster:
- The Presto Server (running on the master node) keeps track of the launch time of each node.
- At regular intervals (10 seconds by default) the Presto Server takes a snapshot of the state of running queries, compares it with the previous snapshot, and estimates the time required to finish the queries. If this time exceeds a threshold value (set to one minute and currently not configurable), the Server adds more nodes to the cluster.
- Whenever a node approaches its billing boundary (for example, each hour), the Presto Server checks to see if there is enough work in the system to justify continuing to run this node. If not, and the cluster is running more than the minimum of core nodes, it removes the node from the cluster.
Because all processing in Presto is in-memory and no intermediate data is written to HDFS, the HDFS-related decommissioning tasks are not needed in a Presto cluster.
Here’s how auto-scaling works in a Spark cluster:
- You can configure Spark auto-scaling at the cluster level and at the job level.
- Spark applications consume YARN resources (containers); QDS monitors container usage and launches new nodes (up to the configured Maximum Slave Count) as needed.
- If you have enabled job-level auto-scaling, QDS monitors the running jobs and their rate of progress, and launches new executors as needed (and hence new nodes if necessary).
- As jobs complete, QDS selects candidates for downscaling and initiates Graceful Shutdown of those nodes that meet the criteria.
For a detailed discussion and instructions, see Autoscaling in Spark.
Here’s how auto-scaling works in a Tez cluster:
- Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
- YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
- ApplicationMasters monitor the progress of the DAG (on the Mapper nodes) and calculate how long it will take to finish their tasks at the current rate.
- On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special auto-scaling container requests.
- The ResourceManager sums the ApplicationMasters’ auto-scaling container requests, and on that basis adds more nodes (up to the configured Maximum Slave Count).
- Whenever a node approaches its billing boundary (for example, each hour), the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.