Qubole believes in making Big Data analysis easy. Every new account is configured with some clusters by default and these are sufficient to run small test workloads. This section describes the configuration of Qubole clusters. A good understanding of the trade-offs and capabilities in this area can help you achieve higher performance and stability at lower costs.
The following sections cover these sub-topics:
- Cluster Settings Page
- General Cluster Configuration
- AWS Settings
- Azure Settings
- Oracle OCI Settings
- Oracle OCI Classic Settings
- Hadoop-specific Options
Cluster Settings Page¶
To modify a cluster, navigate to the Clusters page and click Edit against that cluster.
See Managing Clusters for more information.
The following sections explain the different options available on the Cluster Settings page.
General Cluster Configuration¶
Many of the cluster configuration options are common across different types of clusters. Let us cover them first by going over some of the most important categories.
As explained in Cluster Labels, each cluster has one or more labels that are used to route Qubole commands. In the first form entry, you can assign one or more comma-separated labels to a cluster.
QDS supports the following cluster types:
Airflow (not configured by default).
Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. It supports integration with third-party platforms. You can author complex directed acyclic graphs (DAGs) of tasks inside Airflow. It comes packaged with a rich feature set, which is essential to the ETL world. The rich user interface and command-line utilities make it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues as required.
Hadoop 1 (one Hadoop 1 cluster is configured by default for AWS only).
This is Qubole’s tried and tested Mapreduce framework running a version of Hadoop API compatible with Apache Hadoop 1.0. Hadoop is suitable for reasonably long-running jobs and batch-processing workloads. You can also use this for Hive, Pig and MapReduce applications of all types (including Cascading).
Hadoop 2 (one Hadoop 2 cluster is configured by default in all cases).
Hadoop 2 clusters run a version of Hadoop API compatible with Apache Hadoop 2.6. Hadoop 2 clusters use Apache YARN cluster manager and are tuned for running MapReduce and other applications as with Hadoop 1.
Presto (one Presto cluster is configured by default on Cloud platforms that support Presto; see QDS Components: Supported Versions and Cloud Platforms).
Presto is the fast in-memory query processing engine from Facebook. This is useful for near-real-time query processing for less complex queries that can be mostly performed in-memory.
Spark (one Spark cluster is configured by default in all cases).
Spark clusters allow you to run applications based on supported Apache Spark versions. Spark has a fast in-memory processing engine that is ideally suited for iterative applications like machine learning. Qubole’s offering integrates Spark with the YARN cluster manager.
Cluster Size and Instance Types¶
From a performance standpoint, this is one of the most critical sets of parameters:
- Set a Minimum and Maximum Slave Count for a cluster (in addition to one fixed Master node).
All Qubole clusters auto-scale up and down automatically within the minimum and maximum range set in this section.
Master and Slave Node Type
Select the Slave Node Type according to the characteristics of the application. A memory-intensive application would benefit from memory-rich nodes (such as r3 node types in AWS, or E2-64 V3 in Azure), while a CPU-intensive application would benefit from instances with higher compute power (such as the c3 types in AWS or E2-64 V3 in Azure).
The Master Node Type is usually determined by the size of the cluster. For smaller clusters and workloads, small instances (such as m1.large in AWS or Standard_A6 in Azure) suffice. But for extremely large clusters (or for running a large number of concurrent applications), Qubole recommends large-memory machines (such as 8xlarge machines in AWS, or larger Azure instances such as Standard_A8).
Qubole uses Linux instances as cluster nodes.
Node Bootstrap File¶
This field provides the location of a Bash script used for installing custom software packages on cluster nodes.
Advanced applications often require custom software to be installed as a prerequisite. A Hadoop Mapper Python script may require access to SciPy/NumPy, for example, and this is often best arranged by simply installing these packages (using yum for example) by means of the node bootstrap script. See Understanding a Node Bootstrap Script for more information.
The account’s storage credentials are used to read the script, which runs with root privileges on both master and slave nodes; on slave nodes, make it runs before any task is launched on behalf of the application.
Qubole does not check the exit status of the script. If software installation fails and it is unsafe to run user applications this case, you should shut the machine down from the bootstrap script.
Qubole recommends installing/updating custom Python libraries after activating Qubole’s Virtual Environment and installing libraries in it.
See Running Node Bootstrap and Adhoc Scripts on a Cluster for more information on running node bootstrap scripts.
Disable Automatic Cluster Termination:
By default, an idle cluster is automatically terminated. This option allows you to override this behavior. If the option is enabled, the cluster can still be downscaled to minimum slave count but it is not terminated automatically. It must be terminated explicitly by you.
Use this option with extreme caution because it can significantly increase running costs.
Use the following glossary to help you configure the settings below; see also Using AWS Spot Instances and Spot Blocks in Qubole Clusters.
- Core nodes refers to the Master node and minimum number of Slave nodes (together comprising the minimum cluster size).
- Auto-scaled or Non-core nodes refers to nodes that are added in addition to the core nodes, up to the Maximum Slave Count. The auto-scaled nodes can be On-Demand or Spot instances depending on the Spot Instances Percentage.
Qubole recommends using On-Demand instances for core nodes.
Configure the following in the Cluster Composition section:
Autoscaling Node Purchasing Option: Specifies the AWS purchasing model for Slave nodes added to the cluster during auto-scaling. The values that this option can have are:
Qubole recommends this setting for workloads that are not cost sensitive or are extremely time sensitive.
When this option is selected, QDS bids for Spot instances when provisioning auto-scaling nodes.
Use Qubole Placement Policy:
When Slave nodes are Spot instances, this policy makes a best effort to ensure that at least one replica of each HDFS block is placed on on a stable instance. Qubole recommends you leave this enabled.
Fallback to on-demand:
When upscaling the cluster, QDS may not be able to procure Spot instances because of low availability or high price. This option specifies that in this case QDS should fall back to procuring On-Demand instances. This increases the cost of running the cluster but ensures that the job completes relatively quickly. Enable this option if processing time is important. QDS attempts to replace such On-Demand instances with Spot instances when Spot instance prices and cluster utilization permit.
Maximum Bid Price:
Specifies the maximum bid price for the auto-scaling Spot nodes, as a percentage of On-Demand Instance price.
Specifies the time in minutes for which the bid is valid.
Spot Nodes %:
Specifies the maximum percentage of auto-scaling nodes that can be Spot instances. For example, if the cluster has a minimum size of 2 nodes, a maximum size of 10, and a Spot percentage of 50%, then while scaling the node from 2 to 10 nodes, QDS tries to use 4 On-Demand and 4 Spot nodes.
AWS EC2 Settings¶
See also Modifying Security Settings (AWS).
Qubole Public Key cannot be changed. QDS uses this key to access the cluster and run commands on it.
Persistent Security Groups - This option overrides the account-level security group settings. By default, this option is not set; it inherits the account-level persistent security group, if any. Use this option if you want to apply additional access permissions to cluster nodes.
AWS Region and Availability Zone can be selected for nodes in the cluster.
By default, the Region is set to us-east-1. If it is set to No Preference, the best Availability Zone based on system health and available capacity is selected. Note that all cluster nodes are in the same availability zone. You can select an Availability Zone in which instances are reserved to benefit from the AWS Reserved Instances.
VPC and Subnet settings can be used to launch this cluster within a specific VPC.
If they are not specified, cluster nodes are placed in a Qubole created Security Group in EC2-Classic or in the default VPC for EC2-VPC accounts.
Custom EC2 Tags allow you to tag instances of a cluster to get that tag on AWS. They are useful in billing across teams as you get the AWS cost per tag, which helps to calculate the AWS costs of different teams in a company. This setting is under the Advanced Settings tab.
AWS EBS Volumes¶
EBS Volume settings allow you to add additional EBS drives to AWS instances when they are launched.
These options are displayed only for instances with limited instance-store capacity (such as c3 instance types).
Additional drives may not be required if jobs do not generate a lot of output, so use this option only if your testing shows you need it.
QDS uses instance-store drives in preference to EBS volumes even if the latter are configured. This is only valid to magnetic EBS volumes on Hadoop 1 clusters. This reduces AWS cost and helps to achieve better and more predictable I/O performance.
Qubole has the following recommendations for using EBS volumes:
If workflows use HDFS intensively, it is recommended to use
st1volumes. EBS upscaling may also be enabled in such cases to avoid out of space errors.
For instance types that do not have any instance store volumes (
r4), it is recommended to use at least two EBS volumes as one disk would also be used for logs and other frequently-accessed data and it may thus experience reduced bandwidth for HDFS.
In other cases, small SSD volumes should suffice.
It is not recommended to use older generation magnetic (standard) volumes as they are comparatively poor in performance and unlike other volume types, there is an additional charge by AWS on I/O to these volumes.
sc1volumes operate on a burst capacity model as described here. Hence, if the desired disk capacity is between 2 and 12.5 TB, it is preferable to use multiple smaller disks (of 2 TB or above) since you would get more burst capacity. However if you are using less than that you are better off with a single
st1volume. Between 12.5 and 16 TB, you do not get any benefit of the burst capacity. Similarly for SSD volumes up to 1 TB, it is better to use multiple smaller volumes so you can make use of more burst capacity. However, it is not recommended to use too small a volume size since the baseline IOPS would be very low. Also, the burst capacity would be exhausted quite quickly and would take a longer time to refill. Above 1 TB, you would not get the benefit of burst capacity, so it is better to use a single volume so you can use the higher baseline throughput.
If you need more than 1 TB HDFS capacity per node, it would be better to use
st1volumes instead of
ssdvolumes for better throughput and cheaper cost.
See Modifying the Configuration (AWS) for more information on the supported EBS Volume Type and the size range.
Configuring EBS Upscaling in AWS Hadoop and Spark Clusters¶
EBS upscaling is supported on YARN-based clusters: Hadoop 2 (Hive) and Spark clusters. QDS dynamically scales cluster storage (independent of compute capacity) to suit the workload by adding EBS volumes to EC2 instances that have limited storage and are close to full capacity. EBS upscaling is available only on instance types on which Qubole supports adding EBS storage. See EBS Upscaling on Hadoop MRv2 AWS Clusters for more information.
EBS upscaling is not enabled by default; you can enable it for Spark and Hadoop 2 clusters in the Clusters section of the QDS UI, or use the Clusters API. For the required EC2 permissions, see Sample Policy for EBS Upscaling..
To enable EBS upscaling in the QDS UI, choose a value greater than 0 from the drop-down list for EBS Volume Count. Then you can select Enable EBS Upscaling. Once you select it, the EBS upscaling configuration is displayed with the default configuration as shown in this figure.
You can configure:
The maximum number of EBS volumes QDS can add to an instance: Maximum EBS Volume Count. It must be more than the EBS volume count for the upscaling to work.
The percentage of free space on the logical volume as a whole at which disks should be added: Free Space Threshold. The default is 15%, which means new disks are added when the EBS volume is at least 85% full. The percentage threshold changes as the size of the logical volume increases. For example, if you start with a threshold of 15% and a single disk of 100GB, the disk would upscale when it has less than 15GB free capacity. On addition of a new node, the total capacity of the logical volume becomes 200GB and it would upscale when the free capacity falls below 30GB.
The free-space capacity (in GB) above which upscaling does not occur: Absolute Free Space Threshold. The default is 100 GB. If you prefer to upscale only when free capacity is below a fixed value, use the Absolute Free Space Threshold. The default value is 100, meaning that if the logical volume has at least 100GB of capacity, QDS does not add more EBS volumes.
The frequency (in seconds) at which the capacity of the logical volume is sampled: Sampling Interval. The default is 30 seconds.
The number of sampling intervals over which Qubole evaluates the rate of increase of used capacity: Sampling Window. The default is 5. It is the number of Sampling Intervals over which Qubole evaluates the rate of increase of used capacity. Its default value is 5. This means that the rate is evaluated over 150 (30 * 5) seconds by default. To disable upscaling based on rate and use only thresholds, this value may be set to 1. When the rate-based upscaling is set to 1, then Absolute Free Space Threshold is monitored at Sampling Interval.
The logical volume is upscaled if, at the current rate of usage, it will fill up in (Sampling Interval + 120) seconds (the additional 120 seconds is because the addition of a new EBS volume to a heavily loaded volume group has been observed to take up to 120 seconds).
Here is an example of how the free space threshold decreases with respect to the Sampling Window and Sampling Interval. Assuming the default value of Sampling Interval (30 seconds) and Sampling Window (5), this is how EBS Upscaling occurs:
- 0th second: 100% free space
- 30th second: 80% free space
- 60th second: 60% free space
- 90th second: 40% free space
- 120th second: 20% free space
- 150th second: 0% free space
ebs_upscaling_config provides a list of API parameters related to EBS upscaling configuration.
Miscellaneous AWS Settings¶
Provides an option to encrypt the data at rest on the node ephemeral (local) storage. If this option is selected, HDFS along with the intermediate output generated by Hadoop is stored in an encrypted form on the underlying storage device. Block device encryption is setup for ephemeral drives before the node joins the cluster. As a side effect, the cluster might take additional time (depending on the instance type selected) before it becomes operational. Upscaling a cluster may also take additional time.
Enable Ganglia Monitoring:
This provides an option to monitor the state of cluster with Ganglia. When you enable Ganglia monitoring for a cluster, you can view the performance of the cluster as a whole, and as well inspect the performance of individual node instances. You can also view various CPU and disk metrics as well as detailed Hadoop metrics. For more information, see Performance Monitoring with Ganglia.
To add or modify an Azure cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New and the cluster type and click on Create, or choose Edit to change the configuration of an existing cluster. Configure the cluster as described under Modifying Cluster Settings for Azure.
Oracle OCI Settings¶
To add or modify an Oracle OCI cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New, or choose Edit to change the configuration of an existing cluster. Then follow the directions you used when you first updated your clusters.
Oracle OCI Classic Settings¶
To add or modify an Oracle OCI Classic cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New, or choose Edit to change the configuration of an existing cluster. Then follow the directions you used when you first updated your clusters.
- Override Hadoop Configuration Variables: Allows you to override the default Hadoop cluster and job configurations. Enter one variable, with its value, per line.
- Recommended Configuration: The recommended configuration variable values, used by default.
- Fair Scheduler Configuration: Provides the fair scheduler configuration values.
- Default Fair Scheduler Pool: All jobs run on the cluster default to this pool (can be overridden at job submission time).