Using AWS Spot Instances and Spot Blocks in Qubole Clusters¶
Amazon Web Services (AWS) offers three major types of instances (computing hosts) that are suitable for use as Qubole cluster nodes:
- On-Demand instances
- Spot instances
- Spot Blocks
On-Demand instances are offered at a fixed price, whereas the price for Spot instances varies with market demand and availability, and may well vary over the lifetime of a Qubole cluster. Spot Blocks are Spot instances that run continuously for a finite duration (1 to 6 hours).
On-Demand versus Spot Instances versus Spot Blocks: Advantages and Disadvantages¶
The advantages of On-Demand instances are stability, and speed of launching. An On-Demand instance comes up very quickly and is likely to remain available for the life of the cluster, helping to ensure that the cluster performs its work smoothly and reliably. The disadvantage is cost: a cluster composed entirely or mostly of On-Demand nodes may be many times more expensive than a similarly-sized cluster composed of Spot nodes.
The obvious advantage of Spot instances is cost, and the potential disadvantage is instability. A Spot instance can cost only a small fraction of the price of an On-Demand instance, but may take a long time to become available (or may never become available), and may be reclaimed by AWS at any time– if the Spot price goes above your maximum bid, or if the supply simply runs out. This in turn can put running jobs in jeopardy, though you can improve the stability of Spot instances by setting your Maximum Price Percentage high. For a detailed discussion, see the Qubole blog post Riding the Spotted Elephant.
Spot Blocks are 30 to 45 percent cheaper on average than On-Demand instances running for the same amount of time. They are more stable than Spot nodes because AWS will not reclaim them during the configured period (up to six hours), though they will be reclaimed after that. For more information, see AWS spot blocks.
QDS ensures that Spot Block instances are acquired at a lower price than On-Demand nodes, and also that instances acquired during upscaling are acquired only for the remaining lifetime of the original Spot Block instances (the Master and Minimum Worker Nodes with which the cluster was launched); that is, they are acquired for the remainder of the Spot Block Duration you configured for the Master and Minimum Worker Nodes when you created or modified the cluster. For example, if the original Spot Block instances were acquired for five hours, and nodes need to be added after the cluster has been running for two hours, the new Spot Block instances are acquired for three hours, and are proportionately cheaper than the original instances.
When the Spot Block Duration expires, AWS reclaims the instances, halting the cluster. This behavior overrides the normal QDS cluster-runtime controls such as Cluster Idle Timeout.
Cluster Composition Choices¶
You can choose to create a cluster in any of the following configurations:
- On-Demand nodes only
- Spot nodes only
- A mix of Spot and On-Demand nodes
- Spot blocks only. (See Configuring Spot Blocks).
- Spot blocks for master and minimum number of nodes and Spot nodes for autoscaling
- Spot blocks for master and minimum number of nodes and On-Demand nodes for autoscaling
You can find out about configuring each type of cluster here.
For most purposes, the third option is the best, because it provides a good balance between cost and stability. The fourth option of using Spot Blocks is also advantageous, as the cost is lower than for On-Demand instances and availability is fixed for a specific amount of time.
The remainder of this section focuses on the settings and mechanisms Qubole provides to help safeguard the overall functioning of a cluster that includes Spot nodes.
How You Configure Spot Instances into a Qubole Cluster¶
The critical items when you configure Spot instances into a cluster are the Request Timeout, the Maximum Bid Price, the Qubole Placement Policy option, the Fallback to on demand option, and the Spot Instances Percentage.
- The Request Timeout specifies how many minutes Qubole should keep trying to obtain Spot instances when launching the cluster or adding nodes.
- The Maximum Bid Price is the maximum price you are willing to pay for the instances at launch time, or at any time after that as the Spot price fluctuates. The price is expressed as a percentage of the current On-Demand price. Bear in mind that the bid is the maximum you are offering to pay– your cluster will obtain any suitable instance at or below this price.
- The Qubole Placement Policy option, if selected, causes Qubole to make a best effort to store one replica of each HDFS block on a stable node (normally an On-Demand node, except in the case of a Spot-only cluster). Qubole recommends you select this option to prevent job failures that could occur if all replicas were lost as a result of AWS reclaiming many Spot instances at once.
- The Fallback to on demand is discussed here.
- The Spot Instances Percentage:
- In a mixed cluster this specifies the maximum percentage of auto-scaling nodes that can be Spot instances. Auto-scaling nodes are those that comprise the difference between the Minimum Worker Nodes and the Maximum Worker Nodes; Qubole adds and removes these nodes according to the cluster workload, as explained in detail here.
- In a Spot-only cluster, this is always set to 100.
The AWS Availability Zone (AZ) is also important, but in general you should not specify this yourself, but allow Qubole to select it.
The following strategies are all viable; choose according to your priorities:
- Bid very low and set a large Request Timeout. This minimizes cost, ensuring that the cluster obtains instances only when the price is low.
- Bid at about 100% (or just above) and achieve good general cost reduction combined with reliability, using mainly Spot instances but occasionally falling back to more-expensive On-Demand instances.
- Bid very high (say 200%) and be almost sure of getting instances even when they are in short supply.
Configuring a Spot-Only Cluster (Not Recommended)¶
You can configure a Spot-only cluster by choosing Spot Nodes when you create the cluster. This forces the Spot Instances Percentage to 100, meaning all the cluster nodes will be Spot instances, including the core nodes (the Master Node and the nodes comprising the Minimum Worker Nodes).
Causes for Poor Performance in a Spot-only Cluster¶
100% Spot nodes configuration in a cluster can cause poor performance for these two main reasons:
- If fallback to On-Demand is disabled, this greatly slows down upscaling and can even completely stop upscaling. So, you are running the workload on a very small cluster for a long time.
- You can lose a majority of processing nodes at any time, potentially losing, and requiring to repeat the work done by them up to that time. (But it is even almost equally likely to face this problem if the cluster configuration contains 95% Spot nodes.)
To mitigate poor performance that may occur due to 100% Spot nodes, Qubole recommends you to:
- Use heterogeneous clusters as it is less likely all instance types would experience a Spot-price surge.
- Use a larger minimum-cluster size in the cluster
- Use a smaller Spot percentage in the cluster
- In the cluster settings, keep Fallback to On-Demand Nodes enabled.
Configuring a Mixed Cluster (On-Demand and Spot Nodes)¶
You configure a mixed cluster by doing all of the following:
- Setting the Autoscaling Node Purchasing Option to Spot Instance.
- Setting the Spot Instances Percentage to a number less than 100.
- Leaving Use Stable Spot Nodes unchecked.
This configures a cluster in which the core nodes (the Master Node and the nodes comprising the Minimum Worker Nodes) are On-Demand instances, and a percentage of the auto-scaling nodes are Spot instances as specified by the Spot Instances Percentage.
For example, if the Minimum Worker Nodes is 2 and the and the Maximum Worker Nodes is 10, and you set the Spot Instances Percentage to 50, the resulting cluster will have, at any given time:
- A minimum of 3 nodes: the Master Node plus the Minimum Worker Nodes, all of them On-Demand instances (the core nodes).
- (Usually) a maximum of 11 nodes, of which up to 4 (50% of the difference between 2 and 10) will be Spot instances, and the remainder On-Demand instances. (The cluster size can occasionally rise above the maximum for brief periods while the cluster is auto-scaling.)
Fallback to On-Demand
In addition to the above settings, you should normally choose Fallback to on demand (check the box). This option, if selected, causes Qubole to launch On-Demand instances if it cannot obtain enough Spot/Spot block instances when adding nodes during autoscaling. This means that the cluster could possibly at times consist entirely of On-Demand nodes – if no Spot nodes are available for your maximum bid price or less. But unless cost is all-important, this is a sensible option to choose because it allows the cluster to do its work even if Spot nodes are not available.
Qubole also falls back to On-Demand nodes when master-and-minimum-number-of-nodes’ cluster composition is spot nodes.
How Qubole Manages the Spot Nodes While the Cluster is Running¶
Qubole’s primary goal in managing cluster resources is productivity– making sure that the work you need to do gets done as efficiently and reliably as possible, and at the lowest cost that is consistent with that goal.
Qubole uses the following mechanisms to help ensure maximum productivity in running clusters that deploy Spot instances:
- The Fallback to on demand option described above.
- Spot Rebalancing. This works in conjunction with Fallback to on demand, ensuring that the cluster conforms to your original specification as much of the time as possible, by swapping out On-Demand nodes in favor of Spot nodes as soon as possible after the Spot price returns to your bid price or below. For a more detailed discussion, see Spot Rebalancing.
- The Qubole Placement Policy described above.
- Intelligent Availability Zone (AZ) Selection. Unless you specify a particular AZ when you configure the cluster, Qubole can automatically select the AZ with the lowest Spot prices for the region and instance type you’ve specified. This capability is supported for non-VPC (Virtual Private Cloud) clusters only at present, and is not enabled by default; you can create a ticket with Qubole Support to enable it for your account.
- Auto-scaling. Auto-scaling ensures that the cluster remains at just the right size for maximum productivity and efficiency, as explained in detail here.
Configuring Spot Blocks¶
If you configure AWS Spot Blocks for the Master node and minimum number of Worker nodes, you can configure:
- Spot blocks for auto-scaling (the Master nodes and minimum Worker nodes must be Spot blocks); or
- Spot nodes for auto-scaling; or
- On-Demand nodes for auto-scaling.