Auto-scaling in Qubole Clusters

What Auto-scaling Is

Auto-scaling is a mechanism built in to QDS that adds and removes cluster nodes while the cluster is running, to keep just the right number of nodes running to handle the workload. Auto-scaling automatically adds resources when computing or storage demand increases, while also keeping the number of nodes at the minimum needed to meet your processing needs efficiently.

How Auto-scaling Works

When you configure a cluster, you choose the minimum and maximum number of nodes the cluster will contain (Minimum Slave Count and Maximum Slave Count, respectively). While the cluster is running, QDS continuously monitors the progress of tasks and the number of active nodes to make sure that:

  • Tasks are being performed efficiently (so as to complete within an amount of time that is set by a configurable default).
  • No more nodes are active than are needed to complete the tasks.

If the first criterion is at risk, and adding nodes will correct the problem, QDS adds as many nodes as are needed, up to the Maximum Slave Count. This is called upscaling.

If the second criterion is not being met, QDS removes idle nodes, down to the Minimum Slave Count. This is called downscaling, or decommissioning.

The topics that follow provide details:

Note

You can improve auto-scaling efficiency by enabling container packing.

Types of Nodes

Auto-scaling operates only on the nodes that comprise the difference between the Minimum Slave Count and Maximum Slave Count (the values you specified in the QDS Cluster UI when you configured the cluster), and affects Slave nodes only; these are referred to as auto-scaling nodes.

The Master Node(s), and the nodes comprising the Minimum Slave Count, are the stable core of the cluster; they normally remain running as long as the cluster itself is running; these are called core nodes.

Stable and Volatile Nodes on AWS

In AWS clusters, core nodes are stable instances – normally AWS On-Demand or high-priced Spot instances – while auto-scaling nodes can be stable or volatile instances. Volatile instances are Spot instances that can be lost at any time; stable instances are unlikely to be lost. See AWS Settings for a more detailed explanation, and Spot-based Auto-scaling in AWS for further discussion of how QDS manages Spot instances when auto-scaling a cluster.

Upscaling

Launching a Cluster

QDS launches clusters automatically when applications need them. If the application needs a cluster that is not running, QDS launches it with the minimum number of nodes, and scales up as needed toward the maximum.

Upscaling Criteria

QDS bases upscaling decisions on three factors:

  • The rate of progress of the jobs that are running.
  • Whether faster throughput can be achieved by adding nodes.
  • Whether the HDFS storage currently configured for the cluster will be sufficient for all tasks to complete.

Note

This storage criterion currently applies only in Hadoop MRv1 clusters. These are not supported on all Cloud platforms.

Assuming the cluster is running fewer than the configured maximum number of nodes, QDS activates more nodes if, and only if, the configured SLA (Service Level Agreement) will not be met at the current rate of progress, and adding the nodes will improve the rate.

Even if the SLA is not being met, QDS does not add nodes if the workload cannot be distributed more efficiently across more nodes. For example, if three tasks are distributed across three nodes, and progress is slow because the tasks are large and resource-intensive, adding more nodes will not help because the tasks cannot be broken down any further. On the other hand, if tasks are waiting to start because the existing nodes do not have the capacity to run them, then QDS will add nodes.

Downscaling

QDS bases downscaling decisions on the following factors.

Note

In Hadoop 2 MRv2, you can control the maximum number of nodes that can be downscaled simultaneously during downscaling through mapred.hustler.downscaling.nodes.max.request. Its default value is 500.

Downscaling Criteria

A node is a candidate for decommissioning only if:

  • The cluster is larger than its configured minimum size. But note this exception: Hadoop 2 and Spark clusters do not downscale to a single slave node once they have been upscaled. When Minimum Slave Nodes is set to 1, the cluster starts with a single slave node, but once upscaled, it never goes back to fewer than two slave nodes. This means that, after upscaling, a Hadoop 2 or Spark cluster will always have a minimum of two slave nodes, or the configured minimum number of slave nodes, whichever is greater. This is because decommissioning slows down immensely if there is only one usable node left for HDFS, so nodes doing no work may be left running, waiting to be decommissioned. You can override this behaviour by setting mapred.allow.single.worker.node to true. This setting needs a cluster restart.

  • The node is approaching its billing boundary (for example, a one-hour boundary in the case of AWS).

  • No tasks are running.

  • The node is not storing any shuffle data (data from Map tasks for Reduce tasks that are still running).

  • Enough cluster storage will be left after shutting down the node to hold the data that must be kept (including HDFS replicas).

    Note

    This storage consideration does not apply in Presto clusters.

Graceful Shutdown of a Node

If all of the downscaling criteria are met, QDS starts decommissioning the node. QDS ensures a graceful shutdown by:

  • Waiting for all tasks to complete.
  • Ensuring that the node does not accept any new tasks.
  • Transferring HDFS block replicas to other nodes.

Note

Data transfer is not needed in Presto clusters.

Shutting Down an Idle Cluster

By default, QDS shuts the cluster down completely if all of these are true:

  • There have been no jobs in the cluster over a configurable period.
  • At least one node is close to its billing boundary.

You can change this by disabling automatic cluster termination, but Qubole recommends that you leave it enabled – inadvertently allowing an idle cluster to keep running can become an expensive mistake.

Recommissioning a Node

If more jobs enter the pipeline while a node is being decommissioned, and the remaining nodes cannot handle them, the node is recommissioned – the decommissioning process stops and the node is reactivated as a member of the cluster and starts accepting tasks again.

Recommissioning takes precedence over launching new instances: when handling an upscaling request, QDS launches new nodes only if the need cannot be met by recommissioning nodes that are being decommissioned.

Recommissioning is preferable to starting a new instance because:

  • It is more efficient, avoiding bootstrapping a new node.
  • It is cheaper, if the node is not yet at its billing boundary. (And it can never cost more than provisioning a new node.)

Note

Recommissioning is not currently supported in Presto clusters.

Spot-based Auto-scaling in AWS

You can specify that a portion of the cluster’s auto-scaling nodes can be potentially volatile AWS Spot instances (as opposed to higher-priced, more stable Spot instances, or On-Demand instances, which are stable but still more expensive.) You do this via the Spot Instances Percentage field that appears when you choose the Autoscaling Node Purchasing Option on the Add Cluster and Cluster Settings screens in the QDS Cluster UI. The number you put in that field actually specifies the maximum percentage of auto-scaling nodes that QDS can launch as volatile Spot instances.

For example, if you specify a Minimum Slave Count of 2 and a Maximum Slave Count of 10, and you choose Spot Instance as the Autoscaling Node Purchasing Option, you might specify a Spot Instances Percentage of 50. This would allow a maximum of 4 nodes in the cluster (50% of the difference between 2 and 10) to be volatile Spot instances. The non-auto-scaling nodes (those represented by the Minimum Slave Count, 2 in this case) are always stable instances (On-Demand or high-priced Spot instances).

Opting for volatile Spot instances has implications for the working of your cluster, because Spot instances can be lost at any time. To minimize the impact of losing a node in this way, Qubole implements the Qubole Placement Policy, which is in effect by default and makes a best effort to place one copy of each data block on a stable node. You can change the default in the QDS Cluster UI on the Add Cluster and Cluster Settings screens, but Qubole recommends that you leave the policy enabled (Use Qubole Placement Policy checked in the Cluster Composition section). For more information about this policy, and Qubole’s implementation of Spot instances in general, see the blog post Riding the Spotted Elephant .

Spot Rebalancing

Fluctuations in the market may mean that QDS cannot always obtain as many Spot instances as your cluster specification calls for. (QDS tries to obtain the Spot instances for a configurable number of minutes before giving up.) To pursue the example in the previous section, suppose your cluster needs to scale up by four additional nodes, but only two Spot instances that meet your requirements (out of the maximum of four you specified) are available. In this case, QDS will launch the two Spot instances, and (by default) make up the shortfall by also launching two On-Demand instances, meaning that you will be paying more than you had hoped in the case of those two instances. (You can change this default behavior in the QDS Cluster UI on the Add Cluster and Cluster Settings screens, by un-checking Fallback to on demand in the Cluster Composition section.)

Whenever the cluster is running a greater proportion of On-Demand instances than you have specified, QDS works to remedy the situation by monitoring the Spot market, and replacing the On-Demand nodes with Spot instances as soon as suitable instances become available. This is called Spot Rebalancing.

Note

Spot rebalancing is supported in Hadoop MRv1, Hadoop MRv2, and Spark clusters only.

For a more detailed discussion of this topic, see Rebalancing Hadoop Clusters for Higher Spot Utilization.

HDFS-based Auto-scaling in Hadoop MRv1 Clusters

You can enable HDFS-based auto-scaling by setting

mapred.hustler.dfs.autoscale.enable = true

in the Override Hadoop Configuration Variables field on the Add Cluster and Cluster Settings screens in the Cluster UI , or in your cluster’s mapred-site.xml file.

If HDFS-based auto-scaling is enabled, QDS continuously monitors the cluster’s HDFS storage to ensure that the available capacity remains sufficient for the jobs that are running to complete, and launches more nodes if necessary.

There are two triggers for launching more nodes; these are discussed under Rate-based Auto-scaling and Threshold-based Auto-scaling below.

Rate-based Auto-scaling

As Hadoop jobs are running, QDS monitors the rate at which they are using up storage, and from this computes when to launch additional nodes (taking into account how long it is taking for newly launched nodes to become operational).

Bringing up additional instances is the most efficient and reliable way to increase storage, but it is expensive, so QDS ensures that this is done only if it has to be, and is done as late as possible. Nodes are launched so they, and their storage volumes, come online just in time to prevent the cluster from running out of storage, and no sooner.

Threshold-based Auto-Scaling

In addition to monitoring the rate at which the storage is filling up, QDS also compares current usage against a configurable threshold. If usage exceeds this threshold, QDS launches more nodes until usage is below the threshold.

HDFS-based Auto-scaling in Hadoop MRv2 AWS Clusters

Note

In Hadoop 2 MRv2, you can control the maximum number of nodes that can be downscaled simultaneously during downscaling through mapred.hustler.downscaling.nodes.max.request. Its default value is 500.

You can enable HDFS-based auto-scaling on Hadoop MRv2 clusters running on Amazon AWS; these include Spark and Tez clusters, as well as those running MapReduce jobs. To distinguish this type of auto-scaling from the MRv1 case, we’ll refer to it as EBS upscaling.

The following conditions and restrictions apply:

  • EBS upscaling is not enabled by default; use the Clusters API to enable it in a new or existing cluster. For the required EC2 permissions, see Sample Policy for EBS Upscaling.

    You can also enable it on the Spark/Hadoop 2 cluster UI.

    To enable it on the UI, set EBS Volume Count to be a value greater than 0 from the drop-down list. Only then you can select Enable EBS Upscaling. Once you select it on the cluster configuration UI, the EBS upscaling configuration is displayed with the default configuration as shown in this figure.

    ../_images/EBSUpscaling.png

    You can configure:

    • The maximum number of EBS volumes QDS can add to an instance in Maximum EBS Volume Count.
    • The free-space threshold (in percentage) below which more storage is added in Free Space Threshold. The default is 15%.
    • The free-space capacity (in GB) above which upscaling does not occur in Absolute Free Space Threshold. The default is 100 GB.
    • The frequency (in seconds) at which the capacity of the logical volume is sampled in Sampling Interval. The default is 30 seconds.
    • The number of sampling intervals over which Qubole evaluates the rate of increase of used capacity in Sampling Window. The default is 5.
  • EBS upscaling is available only on instance types on which Qubole supports adding EBS storage.

  • QDS auto-scaling adds EBS storage but does not remove it directly. Downscaling occurs on a node-by-node basis; the storage is removed when the node is decommissioned.

EBS upscaling works as follows.

  • The underlying mechanism is Logical Volume Manager (LVM).
  • When you configure EBS storage for a node, QDS configures the EBS volumes into an LVM volume group comprising a single logical volume.
  • When free space on that volume falls below the free-space threshold, QDS adds EBS volumes to the logical volume and resizes the file system.

Aggressive Downscaling

Aggressive downscaling is triggered when you reduce the maximum size of a cluster while it’s running. The subsections that follow explain how it works. First you need to understand what happens when you decrease (or increase) the size of a running cluster.

Note

Aggressive downscaling is not supported in Spark or Presto clusters.

In Hadoop 2 MRv2, you can control the maximum number of nodes that can be downscaled simultaneously during downscaling through mapred.hustler.downscaling.nodes.max.request. Its default value is 500.

Effects of Changing Slave Count Variables while the Cluster is Running

You can change the Minimum Slave Count and Maximum Slave Count while the cluster is running. You do this via the Cluster Settings screen in the QDS UI, just as you would if the cluster were down.

To force the change to take effect dynamically (while the cluster is running) you must push it, as described here. Exactly what happens then depends on the current state of the cluster and configuration settings. Here are the details for both variables.

Minimum Slave Count

An increase or reduction in the minimum count takes effect dynamically by default (mapred.refresh.min.cluster.size is set to true by default only on an Hadoop 2 cluster).

Maximum Slave Count

  • An increase in the maximum count always takes effect dynamically.
  • A reduction in the maximum count produces the following behavior:
    • If the current cluster size is smaller than the new maximum, the change takes effect dynamically. For example, if the maximum is 15, 10 nodes are currently running and you reduce the maximum count to 12, 12 will be the maximum from now on.
    • If the current cluster size is greater than the new maximum, QDS begins reducing the cluster to the new maximum, and subsequent upscaling will not exceed the new maximum. In this case, The default behavior for reducing the number of running nodes is aggressive downscaling.

How Aggressive Downscaling Works

By default, mapred.aggressive.downscale.enable is set to true only on an Hadoop 2 (Hive) cluster. It is disabled by default on an Hadoop 1 cluster. To enable downscaling on an Hadoop 1 cluster, set mapred.aggressive.downscale.enable to true as an override in the Override Hadoop Configuration Variables field (in the Hadoop Cluster Settings section of the cluster configuration UI).

If you decrease the Maximum Slave Count while the cluster is running, and more than the new maximum number of nodes are actually running, then QDS begins aggressive downscaling.

If aggressive downscaling is triggered, QDS selects the nodes that are:

  • closest to completing their tasks
  • closest to their billing boundary
  • (in the case of Hadoop MRv2 clusters) closest to the time limit for their containers.

Once selected, these nodes stop accepting new jobs and QDS shuts them down gracefully until the cluster is at its new maximum size (or the maximum needed for the current workload, whichever is smaller).

You can use the QDS Cluster UI to disable aggressive downscaling by setting mapred.aggressive.downscale.enable to false in the Override Hadoop Configuration Variables field (in the Hadoop Cluster Settings section) when you add or modify the cluster. In this case, nodes continue to accept new jobs, and the cluster is reduced to the new new size by normal downscaling.

Offloading

Normally MapReduce requires a Mapper’s hosting cluster node, which stores the Mapper output, to remain up until the Reducer finishes and the job completes. Only then can the Mapper process quit. This means that you may be paying for Cloud instances that are idle and merely waiting for Reducers to finish.

QDS can optionally remove this bottleneck, and reduce your costs, by offloading the Mapper data either to HDFS or to Cloud storage outside the cluster. When you enable offloading, a Mapper needs to run only until its data has been offloaded; once the data data has been copied, and all other criteria are met, the Mapper node can be decommissioned.

To enable offloading, use the Cluster Configuration UI to set qubole.offload.enable to true in the Override Hadoop Configuration Variables field in the Hadoop Cluster Settings section when you add or modify the cluster.

By default, the data is offloaded to a Cloud storage location (s3.log.location for AWS). You can use the Override Hadoop Configuration Variables field to change this to:

  • a different Cloud storage location, by setting the path in qubole.offload.path, or
  • an HDFS path, by setting qubole.offload.protocol to hdfs:// and qubole.offload.path to the HDFS path.

Note

Offloading is supported in Hadoop MRv1 clusters only. These are not supported on all Cloud platforms.

How Auto-scaling Works in Practice

Hadoop MRv1

Here’s how auto-scaling works in a Hadoop MRv1 cluster:

  • Each node in the cluster reports its launch time to the JobTracker, which keeps track of how long each node has been running.
  • The JobTracker continuously monitors the pending and current work in the system and computes the amount of time required to finish it. If that time exceeds the pre-configured threshold (for example, two minutes) and there is sufficient parallelism in the workload, then the JobTracker adds more nodes to the cluster.
  • Whenever a node approaches its billing boundary (for example, each hour), the JobTracker checks to see if there is enough work in the system to justify continuing to run this node. If not, and the criteria described above under Downscaling Criteria are met, the JobTracker decommissions the node.

Note

Hadoop MRv1 clusters are not supported on all Cloud platforms.

Hadoop MRv2

Here’s how auto-scaling works in a Hadoop MRv2 cluster:

Note

Offloading is not supported in Hadoop MRv2 clusters.

In Hadoop 2 MRv2, you can control the maximum number of nodes that can be downscaled simultaneously during downscaling through mapred.hustler.downscaling.nodes.max.request. Its default value is 500.

  • Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
  • YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
  • On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special auto-scaling container requests.
  • The ResourceManager sums the ApplicationMasters’ auto-scaling container requests, and on that basis adds more nodes (up to the configured Maximum Slave Count).
  • Whenever a node approaches its billing boundary (for example, each hour), the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.

Note

You can improve auto-scaling efficiency by enabling container packing.

Presto

Here’s how auto-scaling works in a Hadoop Presto cluster:

Note

Spot Rebalancing, HDFS-based Auto-scaling, Aggressive Downscaling, and Offloading are not supported in Presto clusters. Presto is not supported not supported on all Cloud platforms.

  • The Presto Server (running on the master node) keeps track of the launch time of each node.
  • At regular intervals (10 seconds by default) the Presto Server takes a snapshot of the state of running queries, compares it with the previous snapshot, and estimates the time required to finish the queries. If this time exceeds a threshold value (set to one minute and currently not configurable), the Server adds more nodes to the cluster.
  • Whenever a node approaches its billing boundary (for example, each hour), the Presto Server checks to see if there is enough work in the system to justify continuing to run this node. If not, and the cluster is running more than the minimum of core nodes, it removes the node from the cluster.

Note

Because all processing in Presto is in-memory and no intermediate data is written to HDFS, the HDFS-related decommissioning tasks are not needed in a Presto cluster.

Sometimes, while tracking new nodes that are added during upscaling in the Query Tracker, you would notice that the new nodes are not being utilized by the running query. This is because new nodes are used only if certain operations such as TableScans and Partial Aggregations are in progress after new nodes are added. If the query does not have such operations in progress, then the new nodes remain unused. However, new queries started after the nodes are added, can utilize new nodes irrespective of the operation types in those queries.

To find out the operations in the query that can utilize new nodes that get added due to upscaling while the query is being executed, run EXPLAIN (TYPE DISTRIBUTED). Operations which are part of Fragments and appear as [SOURCE] can utilize new nodes while the query is being executed.

Spark

Here’s how auto-scaling works in a Spark cluster:

Note

Aggressive Downscaling and Offloading are not supported in Spark clusters.

  • You can configure Spark auto-scaling at the cluster level and at the job level.
  • Spark applications consume YARN resources (containers); QDS monitors container usage and launches new nodes (up to the configured Maximum Slave Count) as needed.
  • If you have enabled job-level auto-scaling, QDS monitors the running jobs and their rate of progress, and launches new executors as needed (and hence new nodes if necessary).
  • As jobs complete, QDS selects candidates for downscaling and initiates Graceful Shutdown of those nodes that meet the criteria.

For a detailed discussion and instructions, see Autoscaling in Spark.

Note

You can improve auto-scaling efficiency by enabling container packing.

Tez

Here’s how auto-scaling works in a Tez cluster:

Note

Offloading is not supported in Tez clusters. Tez is not supported on all Cloud platforms.

  • Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
  • YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
  • ApplicationMasters monitor the progress of the DAG (on the Mapper nodes) and calculate how long it will take to finish their tasks at the current rate.
  • On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special auto-scaling container requests.
  • The ResourceManager sums the ApplicationMasters’ auto-scaling container requests, and on that basis adds more nodes (up to the configured Maximum Slave Count).
  • Whenever a node approaches its billing boundary (for example, each hour), the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.

For More Information

For more information about configuring and managing QDS clusters, see: