Aggressive Downscaling (Azure)
Aggressive Downscaling refers to a set of QDS capabilities that allow idle clusters and cluster nodes to be shut down as quickly and efficiently as possible. It comprises the following:
Read these sub-sections in conjunction with the Downscaling section of Autoscaling in Qubole Clusters. See also Understanding the QDS Cluster Lifecycle.
Note
Aggressive Downscaling is not in effect by default; to enable it for your QDS account, create a Qubole Support ticket.
Faster Cluster Termination
QDS waits for a configurable period after the last command executes before terminating a cluster. This period is referred to as the Idle Cluster Timeout in the QDS UI. By default this is configurable in multiples of one hour; Aggressive Downscaling allows you to configure it in increments of a minute. You can configure this value at both the account level and the cluster level. If you set it at the cluster level, that value overrides the account-level value, which defaults to two hours. You can change a cluster’s Idle Cluster Timeout setting without restarting the cluster.
Note
QDS monitors the cluster every 5 minutes to see if it is eligible for shutdown. This can mean that a cluster is idle longer than the timeout you set. For example, if you set the Idle Cluster Timeout to five minutes, and QDS checks the cluster four minutes after the last command has completed, QDS will not shut down the cluster. If no further commands have executed by the next checkpoint, five minutes later, QDS will shut the cluster down. In this case the cluster has been idle nine minutes in all.
Exception for Spark Notebooks
Spark notebook interpreters have a separate timeout parameter (spark.qubole.idle.timeout
) that defaults to one hour.
A cluster will not shut down if an interpreter is running, so you should reduce the value of spark.qubole.idle.timeout
if it’s greater than the Idle Cluster Timeout.
Faster Node Termination
The Downscaling section of Autoscaling in Qubole Clusters explains the conditions under which QDS decommissions a node and removes it from a running cluster. By default, these conditions include the concept of an hour boundary: if a node meets all other downscaling criteria, it becomes eligible for shutdown as it approaches an hourly increment of up-time. Aggressive Downscaling does away with this criterion: after you enable Aggressive Downscaling and restart the cluster, its nodes will be decommissioned as soon as they meet all of the other downscaling criteria.
Note
For Hadoop 2 (Hive) and Spark clusters, see also Container Packing in Hadoop 2 and Spark.
For Presto clusters, Aggressive Downscaling does not affect the way individual nodes are decommissioned; see the Presto section of Autoscaling in Qubole Clusters and Cool-Down Period.
Cool-Down Period
Faster node termination could cause the cluster size to fluctuate too rapidly, so that nodes spend a disproportionate amount of time booting and shutting down, and users may have to wait unnecessarily for new nodes to start and run their commands. The Cool Down Period is designed to prevent this; it allows you to configure how long QDS waits before terminating a cluster node after it becomes idle.
When a node enters its Cool Down Period, QDS initiates graceful shutdown on that node, allowing the node to be either recommissioned or shut down, depending on the cluster workload.
The default value is 10 minutes for Hadoop (Hive) and Spark clusters, and 5 minutes for Presto clusters. The minimum value you should set in all cases is 5 minutes; a lower value may be greater than the time it takes to decommission the node.
Note
For Presto clusters, the Cool Down Period does not apply to individual nodes, but to the cluster as a whole: QDS starts to decommission Presto nodes only if it determines that the cluster has been underutilized throughout the Cool Down Period.
Configuring the Cool-Down Period
To change a cluster’s Cool Down Period to something other than the default, navigate to the Configuration tab of the Clusters page in the QDS UI, and set the value to 5 minutes or longer.
Note
If you set the Idle Cluster Timeout to a lower value than the Cool Down Period, the Idle Cluster Timeout takes precedence.