Create a New Cluster¶
-
POST
/api/v1.3/clusters/
¶
Creates a new cluster with the given configuration.
Use this to create a new cluster for a workload that has to run in parallel with your pre-existing workloads.
You might want to run workloads across different regions, or on different types of instances, or there could be other reasons for creating a new cluster.
QDS supports defining account-level default cluster tags through the UI and plans to provide API support shortly. For more information, see Adding Account and User level Default Cluster Tags (AWS).
Note
Qubole supports options to control the query runtime, which are described in Configuring Query Runtime Settings.
Required Role¶
The following users can make this API call:
- Users who belong to the system-user or system-admin group.
- Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters¶
Note
Parameters marked in bold below are mandatory. Others are optional and have default values. See Create a New Airflow Cluster for more information on the parameters supported by an Airflow cluster and an example. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.
engine_config for Enabling HiveServer2 on a Hadoop 2 (Hive) Cluster describes the additional configuration for enabling HiveServer2 on an Hadoop 2 (Hive) cluster. For details on configuring multi-instance HS2 through REST API, see Choosing Multi-instance as an option for running HiveServer2 on Hadoop (Hive) Clusters.
Parameter | Description |
---|---|
label | A list of labels that identify the cluster. At least one label must be provided when creating a cluster. |
presto_version | It is mandatory and only applicable to a Presto cluster. The supported values are:
|
spark_version | It is mandatory and only applicable to a Spark cluster. The supported values are: 1.6.2 , 2.0.2 , 2.1.1 , 2.2.0 , 2.2.1 , 2.3.1 , and 2.4.0 .
For more information, see QDS Components: Supported Versions and Cloud Platforms. Deprecated versions: 1.3.1 , 1.4.0 , 1.4.1 , 1.5.0 , 1.5.1 , 1.6.0 , 1.6.1 , 2.0.0 , and 2.1.0 . |
zeppelin_interpreter_mode | This parameter is only applicable to the Spark cluster. The default mode is legacy . Set it to user mode if you want the user-level cluster-resource management on notebooks. See Using the User Interpreter Mode for Spark Notebooks for more
information. |
ec2_settings | Amazon EC2 Settings. The default values are considered if the settings are not configured. |
node_configuration | Cluster node instances type and other settings. |
hadoop_settings | Hadoop cluster settings that also contains the configuration description to enable Spark on the cluster. |
security_settings | Instance security settings. |
presto_settings | Presto cluster settings. |
spark_settings | Spark cluster settings. |
datadog_settings | Datadog cloud monitoring settings. Qubole supports the Datadog cloud monitoring service on Hadoop 2 (Hive) clusters. |
disallow_cluster_termination | Prevent auto-termination of the cluster after a prolonged period of disuse. The default value is, false . |
enable_ganglia_monitoring | Enable Ganglia monitoring for the cluster. The default value is, false . |
node_bootstrap_file | A file that is executed on every node of the cluster at boot time. Use this to customize the cluster nodes by setting up environment variables, installing the required packages, and so on. The default value is, node_bootstrap.sh . |
ec2_settings¶
Parameter | Description |
---|---|
compute_access_key | The EC2 Access Key. (Note: This field is not visible to non-admin users.) |
compute_secret_key | The EC2 Secret Key. (Note: this field is not visible to non-admin users.) |
aws_region | The AWS region to create the cluster in.
The default value is, us-east-1 . Valid values are, us-east-1 , us-east-2 , us-west-1 , us-west-2 ,
eu-west-1 , eu-west-2 , sa-east-1 , ap-south-1 , ap-southeast-1 , ap-northeast-1 , ap-northeast-2 ,
and ca-central-1 . |
aws_preferred_availability_zone | The preferred availability zone (AZ) in which the cluster must be created. The default value is Any . However, if the
cluster is in a VPC, then you cannot set the AZ. |
vpc_id | The ID of the Virtual Private Cloud (VPC) in which the cluster is created.
In this VPC, the enableDnsHostnames parameter must be set to true. |
subnet_id | The ID of the subnet that must belong to the above VPC in which the cluster is created and it can be a public/private
subnet. Qubole supports multiple subnets. Specify multiple subnets in this format:
"subnet_id": "subnet-id1, subnet-id2, ...., subnet-idn" . |
master_elastic_ip | It is the Elastic IP address for attaching to the cluster coordinator. For more information, see this documentation. |
bastion_node_public_dns | Specify the Bastion host public DNS name if private subnet is provided for the cluster in a VPC. Do not specify this value for a public subnet. |
bastion_node_port | It is the port of the Bastion node. The default value is 22. You can specify a non-default port if you want to access the cluster that is in a VPC with a private subnet. |
bastion_node_user | It is the Bastion node user, which is ec2-user by default. You can specify a non-default user using this option. |
role_instance_profile | It is a user-defined IAM Role name that you can use in a dual-IAM role configuration. This Role overrides the account-level IAM Role and only you (and not even Qubole) can access this IAM Role and thus it provides more security. For more information, see Creating Dual IAM Roles for your Account. |
use_account_compute_creds | Set it to true to use the account’s compute credentials for all clusters of the account. The default value is false .
This option is not supported on clusters of an IAM-Role-based account. |
node_configuration¶
Parameter | Description |
---|---|
master_instance_type | The instance type to use for a cluster coordinator node. The default value is m1.large for Hadoop-1, Hadoop-2, and
Presto clusters. The default value is m3.xlarge for a Spark cluster. |
slave_instance_type | The instance type to use for cluster worker nodes. The default value is m1.xlarge for Hadoop-1, Hadoop-2, and Presto
clusters. The default value is m3.2xlarge for a Spark cluster. |
heterogeneous_instance_config | Qubole supports configuring heterogeneous nodes in Hadoop 2 and Spark clusters. It implies that worker nodes can be of different instance types. For more information, see heterogeneous_instance_config and An Overview of Heterogeneous Nodes in Clusters. |
initial_nodes | The number of nodes to start the cluster with. The default value is 2 . |
max_nodes | The maximum number of nodes up to which the cluster can be autoscaled. The default value is 2 . |
slave_request_type | The request type for the autoscaled worker instances. The default value is Qubole allows you to set Note The feature to set Spot blocks as autoscaling nodes even when the coordinator node and minimum worker nodes are On-Demand nodes, is available for a beta access and it is only applicable to Hadoop 2 (Hive) clusters. Create a ticket with Qubole Support to enable it on the account. For more information, see Configuring Spot Blocks. |
spot_instance_settings | The purchase options for autoscaling worker spot instances and these are not applicable to the minimum number of nodes that
is initial_nodes . |
stable_spot_instance_settings | Purchases both coordinator node(s) and worker node(s) as Spot Instances only. The bid price is given using
the stable_spot_instance_settings. The coordinator node and minimum worker node request type depends on whether or not the
stable_spot_instance_settings are passed. To know more, see Coordinator and Minimum Number of Nodes in a Cluster. |
spot_block_settings | Spot Blocks are Spot instances that run continuously for a finite duration (1 to 6 hours). They are 30 to 45 percent cheaper than On-Demand instances based on the requested duration. For more information, see spot_block_settings. QDS ensures that Spot blocks are acquired at a price lower than On-Demand nodes. It also ensures that autoscaled nodes are acquired for the remaining duration of the cluster. For example, if the duration of a Spot block cluster is 5 hours and there is a need to autoscale at the 2nd hour, Spot blocks are acquired for 3 hours. |
fallback_to_ondemand | Fallback to on-demand nodes if spot nodes could not be obtained when adding nodes during autoscaling. It is valid only if
worker request type is spot . The default value is false if slave_request_type is spot . Qubole also falls back to
On-Demand nodes when coordinator-and-minimum-number-of-nodes’ cluster composition is spot nodes. |
ebs_volume_type | The default EBS volume type is Note For recommendations on using EBS volumes, see AWS EBS Volumes. |
ebs_volume_size | The default EBS volume size is 100 GB for Magnetic/General Purpose SSD volume types and 500 GB for Throughput Optimized HDD/Cold HDD volume type. The supported value range is 100 GB/500 GB to 16 TB. The minimum and maximum volume size varies for each EBS volume type and are mentioned below:
Note For recommendations on using EBS volumes, see AWS EBS Volumes. |
ebs_volume_count | The number of EBS volumes to attach to each cluster instance. The default value is 0. |
ebs_upscaling_config | Hadoop 2 and Spark clusters that use EBS volumes can now dynamically expand the storage capacity. This relies on Logical Volume Management. When enabled, a volume group is created on this volume group. Additional EBS volumes are attached to the instance and to the logical volume when the latter is approaching full capacity usage and the file system is resized to accommodate the additional capacity. This is not enabled by default. Storage-capacity upscaling in Hadoop2/Spark clusters using EBS volumes also supports upscaling based on the rate of increase of used capacity. Note For the required EC2 permissions, see Sample Policy for EBS Upscaling. Here is an "node_configuration" : {
"ebs_upscaling_config": {
"max_ebs_volume_count":5,
"percent_free_space_threshold":20.0,
"absolute_free_space_threshold":100,
"sampling_interval":40,
"sampling_window":8
}
}
See ebs_upscaling_config for information on the configuration options. |
custom_ec2_tags | It is an optional parameter. Its value contains a <tag> and a <value>. For example, custom-ec2-tags ‘{“key1”:”value1”, “key2”:”value2”}’. A set of tags to be applied on the AWS instances created for the cluster and EBS volumes attached to these instances. Specified as a JSON object, for example, {“project”: “webportal”, “owner”: “john@example.com”}. It contains a custom tag and value. You can set a custom EC2 tag if you want the instances of a cluster to get that tag on AWS. The custom tags are applied to the Qubole-created security groups (if any). Tags and values must have alphanumeric characters and can contain only these special characters: + (plus-sign), . (full-stop/period/dot), - (hyphen), @ (at-the-rate of symbol), = (equal sign), / (forward slash), : (colon) and _ (an underscore). The tags, Qubole and alias are reserved for use by Qubole (see Qubole Cluster EC2 Tags (AWS)). Tags beginning with aws- are reserved for use by Amazon. Qubole supports defining user-level EC2 tags. For more information, see Adding Account and User level Default Cluster Tags (AWS). |
idle_cluster_timeout | The default cluster timeout is 2 hours. Optionally, you can configure it between 0 to 6 hours that is the value range is
0-6 hours. The unit of time supported is only hour. If the timeout is set at account level, it applies to all clusters
within that account. However, you can override the timeout at cluster level. The timeout is effective on the completion of
all queries on the cluster. Qubole terminates a cluster in an hour boundary. For example, when idle_cluster_timeout
is 0, then if there is any node in the cluster near its hour boundary (that is it has been running for 50-60 minutes and is
idle even after all queries are executed), Qubole terminates that cluster. |
idle_cluster_timeout_in_secs | After enabling the aggressive downscaling feature on the QDS account, the Cluster Idle Timeout can be configured in
seconds. Its minimum configurable value is Note This feature is only available on a request. Contact the account team to enable this feature on the QDS account. |
node_base_cooldown_period | With the aggressive downscaling feature enabled on the QDS account, it is the cool down period set in minutes for On-Demand nodes on a Hadoop 2 or a Spark cluster. The default value is 10 minutes. For more information, see Understanding Aggressive Downscaling in Clusters (AWS). Note This feature is only available on a request. Contact the account team to enable this feature on the QDS account.
You must not set the Cool Down Period to a value lower than |
With the aggressive downscaling feature enabled on the QDS account, it is the cool down period set in minutes for
cluster nodes on a Presto cluster. The default value is Note This feature is only available on a request. Contact the account team to enable this feature on the QDS account. |
|
node_spot_cooldown_period | With the aggressive downscaling feature enabled on the QDS account, it is the cool down period set in minutes for
Spot nodes on a Hadoop 2 or a Spark cluster. The default value is 15 minutes. For more information, see
Understanding Aggressive Downscaling in Clusters (AWS). It is not applicable to Presto clusters as Note This feature is only available on a request. Contact the account team to enable this feature on the QDS account.
You must not set the Cool Down Period to a value lower than |
root_volume_size | Use this parameter to configure the root volume of cluster instances. The supported range for the root volume size is
90 - 2047 . |
env_settings | Use this parameter to specify the Python and R versions. Possible values are: Supported combinations of Python and R versions:
Note For the last two combinations, contact Qubole Support. Sample code: "env_settings":{"python_version":"3.7","r_version":"3.5"}
|
Coordinator and Minimum Number of Nodes in a Cluster¶
To add the Coordinator and Minimum Number of Nodes in a cluster, you can use Stable Spot Instance, Spot Blocks, or On-Demand nodes. You can set the cluster composition by using one of these configuration types:
OnDemand
: It is the default value. This applies to On-Demand nodes.- stable_spot_instance_settings. This applies to Spot Instances. For example,
stable_spot_instance_settings: {maximum_bid_price_percentage: "", timeout_for_request: ""}
. - spot_block_settings. This applies to Spot Blocks. For example,
spot_block_settings: {duration: ""}
.
Cluster Composition Settings (AWS) describes how to configure the coordinator and Minimum Number of Nodes through the Clusters UI page.
heterogeneous_instance_config¶
See An Overview of Heterogeneous Nodes in Clusters for more information.
Parameter | Description |
---|---|
memory | To configure the heterogeneous cluster, you must provide a list of whitelisted set of "node_configuration":{
"heterogeneous_instance_config":{
"memory": {
[
{"instance_type": "m4.4xlarge", "weight": 1.0},
{"instance_type": "m4.2xlarge", "weight": 0.5},
{"instance_type": "m4.xlarge", "weight": 0.25}
]
}
}
}
The following points about the instance types hold good for an heterogeneous cluster:
|
ebs_upscaling_config¶
Note
For the required EC2 permissions, see Sample Policy for EBS Upscaling.
Parameter | Description |
---|---|
max_ebs_volume_count | The maximum number of EBS volumes that can be attached to an instance. It must be more than ebs_volume_count for upscaling to work. |
percent_free_space_threshold | The percentage of free space on the logical volume as a whole at which addition of disks must be attempted. The default value is 25%, which means new disks are added when the EBS volume is (greater than or equal to) 75% full. |
absolute_free_space_threshold | The absolute free capacity of the EBS volume above which upscaling does not occur. The percentage threshold changes as the size of the logical volume increases. For
example, if you start with a threshold of 15% and a single disk of 100GB, the disk would upscale when it has less than 15GB free capacity. On addition of a new node,
the total capacity of the logical volume becomes 200GB and it would upscale when the free capacity falls below 30GB. If you would prefer to upscale only when the
free capacity is below a fixed value, you may use the absolute_free_space_threshold . The default value is 100, which means that if the logical volume has at least
100GB of capacity, Qubole would not add more EBS volumes. |
sampling_interval | It is the frequency at which the capacity of the logical volume is sampled. Its default value is 30 seconds. |
sampling_window | It is the number of The logical volume is upscaled if, based on the current rate, it is estimated to get full in (sampling_interval + 600) seconds (the additional 600 seconds is because
the addition of a new EBS volume to a heavily loaded volume group has been observed to take up to 600 seconds.) Here is an example how the free space threshold
decrease with respect to the Sample Window and Sample Interval. Assuming the default value of
|
spot_instance_settings¶
Parameter | Description |
---|---|
timeout_for_request | The timeout for a Spot Instance request in minutes. The default value is 1 for new clusters. Qubole recommends you to use the default value of 1 minute in the
existing clusters. |
maximum_spot_instance_percentage | The maximum percentage of instances that may be purchased from the AWS Spot market. The default value is 50 . |
stable_spot_instance_settings¶
Use this parameter to set coordinator and minimum number of nodes in a cluster. For more information, see Coordinator and Minimum Number of Nodes in a Cluster.
Parameter | Description |
---|---|
timeout_for_request | The timeout for a Spot Instance request in minutes. The default value is 1 for new clusters. Qubole recommends you to use the default value of 1 minute in
the existing clusters. |
spot_block_settings¶
Use this parameter to set the coordinator node and minimum number of nodes as described in Coordinator and Minimum Number of Nodes in a Cluster, and worker nodes.
Parameter | Description |
---|---|
duration | Set the duration in minutes. The accepted value range is 60-360 minutes and the duration must be a multiple of 60. It is set in node_configuration. Spot blocks are stable than spot nodes as they are not susceptible to being taken away for the specified duration. However, these nodes certainly get terminated once the duration for which they are requested for is completed. For more details, see AWS spot blocks. An example of Spot block can be as given below. "node_configuration": {"spot_block_settings": {"duration":120} }
|
hadoop_settings¶
Parameter | Description |
---|---|
use_hadoop2 | Set this parameter value to true for starting Hadoop-2 daemons on a cluster. It is a mandatory setting for an Hadoop 2
cluster. |
use_spark | This is a mandatory setting for a Spark cluster. Its value must be true to start Spark daemons on the cluster. |
custom_config | The custom Hadoop configuration overrides. The default value is blank. |
fairscheduler_settings | The fair scheduler configuration options. |
use_qubole_placement_policy | Use Qubole Block Placement policy for clusters with spot nodes. |
fairscheduler_settings¶
Parameter | Description |
---|---|
fairscheduler_config_xml | The XML string, with custom configuration parameters, for the fair scheduler. The default value is blank. |
default_pool | It is the default Fair Scheduler Queue if the queue is not submitted during job submission. |
security_settings¶
It is now possible to enhance security of a cluster by authorizing Qubole to generate a unique SSH key every time a cluster is started. This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature. Once this feature is enabled, Qubole starts using the unique SSH key to interact with the cluster. For clusters running in private subnets, enabling this feature generates a unique SSH key for the Qubole account. This SSH key must be authorized on the Bastion host.
Parameter | Description |
---|---|
encrypted_ephemerals | Qubole allows encrypting ephemeral drives on the instances. Create a ticket with Qubole Support to enable the block device encryption. |
ssh_public_key | SSH key to use to login to the instances. The default value is none. (Note: This parameter is not visible to non-admin users.) The SSH key must be in the OpenSSH format and not in the PEM/PKCS format. |
persistent_security_group | This option overrides the account-level security group settings. By default, this option is not set but inherits the account-level persistent security group, if any. Use this option if you want to give additional access permissions to cluster nodes. Qubole only uses the security group name for validation. So, do not provide the security group’s ID. You must provide a persistent security group when you configure outbound communication from cluster nodes to pass through a Internet proxy server. |
presto_settings¶
Parameter | Description |
---|---|
enable_presto | Enables Presto on the cluster. |
custom_config | Specify the custom Presto configuration overrides. The default value is blank. |
spark_settings¶
Parameter | Description |
---|---|
custom_config | Specify the custom Spark configuration overrides. The default value is blank. |
datadog_settings¶
Note
This feature is enabled on Hadoop 2 (Hive), Presto, and Spark clusters. Once you set the Datadog settings, Ganglia monitoring gets automatically enabled. Although the Ganglia monitoring is enabled, its link may not be visible in the cluster’s UI resources list.
Parameter | Description |
---|---|
datadog_api_token | Specify the Datadog API token to use the Datadog monitoring service. The default value is NULL. |
datadog_app_token | Specify the Datadog APP token to use the Datadog monitoring service. The default value is NULL. |
Response¶
The response contains a JSON object representing the created cluster. All the attributes mentioned here are returned (except when otherwise specified or redundant).
Example¶
Goal
Create a cluster called, ‘my_cluster’ in the ‘us-west-2’ AWS region.
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
"label":[
"my_cluster"
],
"node_configuration":{
"slave_request_type":"spot",
"initial_nodes":1,
"spot_instance_settings":{
"timeout_for_request":10,
"maximum_spot_instance_percentage":60,
"maximum_bid_price_percentage":100.0
},
"max_nodes":10,
"master_instance_type":"m1.large",
"slave_instance_type":"m1.xlarge"
},
"hadoop_settings":{
"use_hadoop2":true,
"max_nodes":10,
"fairscheduler_settings":{
"default_pool":null
},
"custom_config":null
},
"enable_ganglia_monitoring":false,
"state":"DOWN",
"node_bootstrap_file":"node_bootstrap.sh",
"use_hadoop2":false,
"security_settings":{
"encrypted_ephemerals":false
},
"ec2_settings":{
"compute_validated":true,
"aws_region":"us-west-2",
"aws_preferred_availability_zone":"Any",
"compute_secret_key":"<your_ec2_compute_secret_key>",
"vpc_id":null,
"compute_access_key":"<your_ec2_compute_access_key>",
"subnet_id":null
},
"presto_settings":{
"enable_presto":false,
"custom_config":null
},
"disallow_cluster_termination":false
}' \
https://api.qubole.com/api/v1.3/clusters
Note
The above syntax uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.
Response
{
"security_settings":{
"encrypted_ephemerals":false
},
"enable_ganglia_monitoring":false,
"label":[
"my_cluster"
],
"ec2_settings":{
"compute_validated":false,
"compute_secret_key":"<your_ec2_compute_secret_key>",
"aws_region":"us-east-1",
"vpc_id":null,
"aws_preferred_availability_zone":"Any",
"compute_access_key":"<your_ec2_compute_access_key>",
"subnet_id":null
},
"node_bootstrap_file":"node_bootstrap.sh",
"hadoop_settings":{
"use_hadoop2":true,
"custom_config":null,
"fairscheduler_settings":{
"default_pool":null
}
},
"disallow_cluster_termination":false,
"presto_settings":{
"enable_presto":false,
"custom_config":null
},
"id":116,
"state":"DOWN",
"node_configuration":{
"max_nodes":10,
"master_instance_type":"m1.large",
"slave_instance_type":"m1.xlarge",
"use_stable_spot_nodes":false,
"slave_request_type":"spot",
"initial_nodes":1,
"spot_instance_settings":{
"maximum_bid_price_percentage":"100.0",
"timeout_for_request":10,
"maximum_spot_instance_percentage":60
}
}
}
Create a New Airflow Cluster¶
In the Parameters table, only label
, node_configuration
, ec2_tags
, node_bootstrap
, and enable_ganglia_monitoring
hold good for an Airflow cluster. It is a single-node machine and does not have worker nodes and always use On-Demand nodes.
Airflow cluster has an additional configuration option called engine_config
.
See Setting up a Data Store (AWS) for additional information on setting up a data store and authorize it to connect to an Airflow cluster in EC2 Classic/VPC.
node_configuration for an Airflow Cluster¶
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter | Description |
---|---|
master_instance_type | Airflow cluster is single node machine, there are no worker instances but only a single coordinator instance. |
slave_instance_type | Not supported |
initial_nodes | Not applicable; Airflow clusters do not support autoscaling. |
max_nodes | Not applicable; Airflow clusters do not support autoscaling. |
slave_request_type | Not applicable; Airflow clusters only support On-Demand instances. |
fallback_to_ondemand | Not applicable |
ebs_volume_type | Not supported |
ebs_volume_size | Not supported |
ebs_volume_count | Not supported |
spot_instance_settings | Not supported |
stable_spot_instance_settings | Not supported |
custom_ec2_tags | It is an optional parameter. Its value contains a <tag> and a <value>. For example, custom-ec2-tags ‘{“key1”:”value1”, “key2”:”value2”}’. A set of tags to be applied on the AWS instances created for the cluster. Specified as a JSON object, for example, {“project”: “webportal”, “owner”: “john@example.com”} It contains a custom tag and value. You can set a custom EC2 tag if you want the instances of a cluster to get that tag on AWS. Tags and values must have alphanumeric characters and can contain only these special characters: + (plus-sign), . (full-stop/period/dot), - (hyphen), @ (at-the-rate of symbol), = (equal sign), / (forward slash), : (colon) and _ (an underscore). The tags, Qubole and alias are reserved for use by Qubole (see Qubole Cluster EC2 Tags (AWS)). Tags beginning with aws- are reserved for use by Amazon. |
use_hadoop2 | Not applicable |
use_spark | Not applicable |
engine_config for an Airflow Cluster¶
The following table contains engine_config
for an Airflow cluster.
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter | Description |
---|---|
dbtap_id | ID of the data store inside QDS. See Setting up a Data Store (AWS) for more information. Set it to -1 if you
are using the local MySQL instance as the data store. |
fernet_key | Encryption key for sensitive information inside airflow database. For example, user passwords and connections. It must be a 32 url-safe base64 encoded bytes. |
type | Engine type. It is airflow for an Airflow cluster. |
version | The default version is 1.10.0 (stable version). The other supported stable versions are 1.8.2 and 1.10.2. All the Airflow versions are compatible with MySQL 5.6 or higher. |
airflow_python_version | Supported versions are 3.5 (supported using package management) and 2.7. To know more, see Configuring an Airflow Cluster. |
overrides | Airflow configuration to override the default settings. Use the following syntax for overrides:
|
Example for Creating an Airflow Cluster¶
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
"label": ["airflow"],
"enable_ganglia_monitoring": true,
"node_bootstrap_file": "node_bootstrap.sh",
"ec2_settings": {
"vpc_id": "",
"subnet_id": "",
"bastion_node_public_dns": "",
"aws_region": "us-east-1",
"aws_preferred_availability_zone": "Any",
"use_account_compute_creds": true
},
"security_settings": {
"encrypted_ephemerals": false,
"ssh_public_key": "",
"persistent_security_group": ""
},
"node_configuration": {
"master_instance_type": "m1.large",
"custom_ec2_tags": {}
},
"datadog_settings": {
"datadog_api_token": "",
"datadog_app_token": ""
},
"engine_config": {
"dbtap_id": "9670",
"fernet_key": "<your-fernet-key>",
"type": "airflow",
"overrides": "core.dag_concurrency=32"
}
}' \ https://api.qubole.com/api/v1.3/clusters
Note
The above syntax uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.
Sample Response¶
{
"state": "DOWN",
"id": 50335,
"spark_version": null,
"label": [1]
0: "airflow"
"disallow_cluster_termination": true,
"force_tunnel": false,
"enable_ganglia_monitoring": true,
"node_bootstrap_file": "node_bootstrap.sh",
"ec2_settings":
{
"aws_preferred_availability_zone": "Any",
"aws_region": "us-east-1",
"compute_validated": true,
"vpc_id": null,
"subnet_id": null,
"bastion_node_public_dns": null,
"compute_secret_key": "",
"compute_access_key": "AKIAIDR6RL*********",
"use_account_compute_creds": true,
}
"hadoop_settings":
{
"use_spark": false,
"custom_config": null,
"use_hadoop2": false,
"use_qubole_placement_policy": false,
"fairscheduler_settings":
{
"default_pool": null
}
}
"node_configuration":
{
"master_instance_type": "m1.large",
"slave_instance_type": "",
"initial_nodes": 1,
"max_nodes": 1,
"slave_request_type": "ondemand",
"cluster_name": "qa_qbol_acc2930_cl50335"
}
"security_settings":
{
"encrypted_ephemerals": false
}
"presto_settings":
{
"enable_presto": false,
"custom_config": null,
}
"spark_settings":
{
"custom_config": null
}
"errors": [0]
"datadog_settings":
{
"datadog_api_token": "",
"datadog_app_token": ""
}
"spark_s3_package_name": null,
"zeppelin_s3_package_name": null,
"engine_config":
{
"type": "airflow",
"dbtap_id": 9670,
"fernet_key": "rsHS7m/xJa7pkNtLpvwcGIFbkNsnSWHHnro+6ZWSsjo=",
"overrides": "core.dag_concurrency=32"
}
}
engine_config for Enabling HiveServer2 on a Hadoop 2 (Hive) Cluster¶
You can enable HiveServer2 on a Hadoop 2 (Hive) cluster. The following table contains engine_config
for enabling
HiveServer2 on a cluster. Other settings of HiveServer2 are configured under the hive_settings
parameter. For more
information on HiveServer2 in QDS, see Configuring a HiveServer2 Cluster.
This is an additional setting in the Hadoop 2 request API for enabling HiveServer2. Other settings that are explained in Parameters must be added. For details on configuring multi-instance as an option to run HS2 through REST API, see Choosing Multi-instance as an option for running HiveServer2 on Hadoop (Hive) Clusters.
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter | Description | |
---|---|---|
hive_settings | is_hs2 | Set it to true to enable HiveServer2. |
hive_version | It is the Hive version that supports HiveServer2. The values are 1.2.0 , 2.1.1 ,
and 2.3 . Qubole’s Hive 2.1.1 is a stable version and LLAP from the Hive open source is
not verified in Qubole’s Hive 2.1.1.
For more information, see Understanding Hive Versions. |
|
hive.qubole.metadata.cache | This parameter enables Hive metadata caching that reduces split computation time for ORC
files. This feature is not available by default. Create a ticket with
Qubole Support for using this feature on the QDS
account. Set it to true in old Hadoop 2 (Hive) clusters and it is enabled by
default in new clusters. For more information, see Understanding Hive Metadata Caching. |
|
hs2_thrift_port | It is used to set HiveServer2 port. The default port is 10003 . This parameter is not
available on the Hadoop 2 (Hive) cluster UI and Qubole plans to add the UI option in a
future release. |
|
overrides | Hive configuration to override the default settings. | |
pig_version | The default version of Pig is 0.11. Pig 0.15 and Pig 0.17 (beta) are the other supported versions. Pig 0.17 (beta) is only supported with Hive 2.1.1. | |
pig_execution_engine | Only with Pig 0.17 (beta), you can use this parameter to set tez as the engine.
MapReduce (mr ) is the default engine used for Pig versions. |
|
type | Engine type. It is hadoop2 for an HiveServer2 cluster. |
Sample API Request¶
Here is a sample API call of HiveServer2 settings.
{
"engine_config": {
"type":"hadoop2",
"hive_settings": {
"is_hs2": true,
"hive_version": "2.1.1",
"hive.qubole.metadata.cache":"true",
"hs2_thrift_port": "10007",
"overrides": "hive.execution.engine=tez"
}
}
}
Choosing Multi-instance as an option for running HiveServer2 on Hadoop (Hive) Clusters¶
Configuring Multi-instance HiveServer2 describes choosing multi-instance to run HS2 on a Hadoop 2 (Hive) cluster and how to configure it.
This enhancement is available for beta access and it is not available by default. Create a ticket with Qubole Support to enable it on the QDS account.
Parameters for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters¶
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter | Description |
---|---|
node_bootstrap_file | You can specify a different node bootstrap file location if you want to change its default location inherited from the associated Hadoop 2 (Hive) cluster. |
ec2_settings | You can only set elastic IP for the coordinator node of the multi-instance HS2 if you do not want to use the elastic IP of the associated Hadoop 2 (Hive) cluster. For more information, see ec2_settings for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters. |
node_configuration | It has four configuration properties and one of them is mandatory. For more information, see node_configuration for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters. |
engine_config | It denotes the type of the engine. For more information, see engine_config for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters. |
node_configuration for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters¶
The coordinator node type of multi-instance HS2 is always m3.xlarge
. Create a ticket with
Qubole Support to configure the coordinator node of the multi-instance HS2.
Parameter | Description |
---|---|
slave_instance_type | The instance type to use for cluster worker nodes. The default value is m3.xlarge when you choose multi-instance as an
option to run HS2 on a Hadoop 2 (Hive) cluster. You can set any other instance regardless of the instance family. |
initial_nodes | The default number of nodes is 2. You can only increase the number of nodes even when the cluster is running. However, you cannot reduce the number of nodes or remove them when the cluster is running. |
custom_ec2_tags | Add custom EC2 tags and values and ensure that you do not add reserved keywords as EC2 tags as described in ec2_settings. |
parent_cluster_id | You must specify the cluster ID of the Hadoop 2 (Hive) cluster on which you want to enable multi-instance as an option to run HS2. |
ec2_settings for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters¶
Parameter | Description |
---|---|
master_elastic_ip | Enter the Elastic IP of Coordinator Node for the multi-instance HS2. When you want to directly (through external BI tools) run queries on multi-instance HS2, you can attach an Elastic IP (EIP) to it and configure the tools to connect to the EIP of multi-instance coordinator. You must add EIP to the multi-instance HS2’s coordinator node because HS2 queries run on the multi-instance HS2 instead of the associated Hadoop 2 (Hive) cluster. |
engine_config for Choosing Multi-instance as an Option to run HS2 on Hadoop 2 (Hive) Clusters¶
Parameter | Description |
---|---|
type | The value of type must be hs2 for configuring multi-instance as an option to run HS2. |
Sample Request API¶
Here is a sample API call for choosing multi-instance as an option to run HS2 on a Hadoop 2 (Hive) cluster, which has 24419 as its cluster ID.
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
"node_bootstrap_file": "node_bootstrap.sh",
"ec2_settings": {
"master_elastic_ip": null
},
"node_configuration": {
"slave_instance_type": "m3.2xlarge",
"initial_nodes": 2,
"custom_ec2_tags": "HS2",
"parent_cluster_id": 24419
},
"engine_config": {
"type": "hs2"
}
}' "http://api.qubole.com/api/v1.3/clusters"