Create a Cluster on Microsoft Azure
- POST /api/v2/clusters/
Use this API to create a new cluster when you are using Qubole on the Azure cloud. You create a cluster for a workload that has to run in parallel with your pre-existing workloads.
You might want to run workloads across different geographical locations or there could be other reasons for creating a new cluster.
Required Role
The following users can make this API call:
Users who belong to the system-user or system-admin group.
Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter |
Description |
---|---|
A list of labels that identify the cluster. At least one label must be provided when creating a cluster. |
|
It contains the configurations of a cluster. |
|
It contains the configurations of the type of clusters |
|
It contains the cluster monitoring configuration. |
|
It contains the security settings for the cluster. |
cloud_config
Parameter |
Description |
---|---|
provider |
It defines the cloud provider. Set |
It defines the Azure account compute credentials for the cluster. |
|
location |
It is used to set the geographical Azure location. |
It defines the network configuration for the cluster. |
|
It defines the Azure account storage credentials for the cluster. |
compute_config
Parameter |
Description |
---|---|
compute_validated |
It denotes if the credentials are validated or not. |
use_account_compute_creds |
It is to use account compute credentials. By default, it is set to |
compute_client_id |
The client ID of the Azure active directory application which has the permissions over the subscription. It is required when |
compute_client_secret |
The client secret of the Azure active directory application. It is required when |
compute_tenant_id |
The tenant_id of the Azure Active Directory. It is required when |
compute_subscription_id |
The subscription id of the azure account where you want to create the compute resources. It is required when |
network_config
Parameter |
Description |
---|---|
vnet_name |
Set the virtual network. |
subnet_name |
Set the subnet |
vnet_resource_group_name |
Set the resource group of your virtual network. |
bastion_node |
It is the public IP address of bastion node to access private subnets if required. |
persistent_security_group_name |
It is the network security group name on the Azure account. |
persistent_security_group_resource_group_name |
It is the resource group of the network security group of the Azure account. |
storage_config
Parameter |
Description |
---|---|
disk_storage_account_name |
Set your Azure storage account. You must only configure this parameter or |
disk_storage_account_resource_group_name |
Set your Azure disk storage account resource group name. |
managed_disk_account_type |
You can set it if you do not want to configure disk storage account details. Its accepted values are |
data_disk_count |
It is the number of reserved disks to be attached to each cluster node; so, for example, choosing a Data Disk Count of 2 in a four-node cluster will provision eight disks in all. |
data_disk_size |
It is used to set the Data Disk Size in gigabytes (GB). The default size is 256 GB. |
cluster_info
Parameter |
Description |
---|---|
label |
A cluster can have one or more labels separated by a commas. You can make a cluster the default cluster by including the label “default”. |
master_instance_type |
To change the coordinator node type from the default (Standard_A5), select a different type from the drop-down list. |
slave_instance_type |
To change the worker node type from the default (Standard_A5), select a different type from the drop-down list. |
min_nodes |
Enter the minimum number of worker nodes if you want to change it (the default is 1). |
max_nodes |
Enter the maximum number of worker nodes if you want to change it (the default is 1). |
node_bootstrap |
You can append the name of a node bootstrap script to the default path. |
disallow_cluster_termination |
Set it to |
custom_tags |
It is an optional parameter. Its value contains a <tag> and a <value>. |
rootdisk |
Use this parameter to configure the root volume of cluster instances. You must configure its size within this parameter. The supported range for the root volume size is |
engine_config
Parameter |
Description |
---|---|
flavour |
It denotes the type of cluster. The supported values are: |
It provides a list of Airflow-specific configurable sub options. |
|
To change the coordinator node type from the default (Standard_A5), select a different type from the drop-down list. |
|
To change the worker node type from the default (Standard_A5), select a different type from the drop-down list. |
|
Enter the minimum number of worker nodes if you want to change it (the default is 1). |
|
It provides a list of Hiveserver2 specific configurable sub options. |
hadoop_settings
Parameter |
Description |
---|---|
custom_hadoop_config |
The custom Hadoop configuration overrides. The default value is blank. |
The fair scheduler configuration options. |
fairscheduler_settings
Parameter |
Description |
---|---|
fairscheduler_config_xml |
The XML string, with custom configuration parameters, for the fair scheduler. The default value is blank. |
default_pool |
The default pool for the fair scheduler. The default value is blank. |
presto_settings
Parameter |
Description |
---|---|
presto_version |
Specify the Presto version to be used on the cluster. The default version is |
custom_presto_config |
Specifies if the custom Presto configuration overrides. The default value is blank. |
spark_settings
Parameter |
Description |
---|---|
zeppelin_interpreter_mode |
The default mode is |
custom_spark_config |
Specify the custom Spark configuration overrides. The default value is blank. |
spark_version |
It is the Spark version used on the cluster. The default version is |
monitoring
Parameter |
Description |
---|---|
enable_ganglia_monitoring |
Enable Ganglia monitoring for the cluster. The default value is, |
security_settings
Parameter |
Description |
---|---|
ssh_public_key |
SSH key to use to login to the instances. The default value is none. (Note: This parameter is not visible to non-admin users.) The SSH key must be in the OpenSSH format and not in the PEM/PKCS format. |
airflow_settings
The following table contains engine_config
for an Airflow cluster.
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter |
Description |
---|---|
dbtap_id |
ID of the data store inside QDS. Set it to |
fernet_key |
Encryption key for sensitive information inside airflow database. For example, user passwords and connections. It must be a 32 url-safe base64 encoded bytes. |
type |
Engine type. It is |
version |
The default version is 1.10.0 (stable version). The other supported stable versions are 1.8.2 and 1.10.2. All the Airflow versions are compatible with MySQL 5.6 or higher. |
airflow_python_version |
Supported versions are 3.5 (supported using package management) and 2.7. To know more, see Configuring an Airflow Cluster. |
overrides |
Airflow configuration to override the default settings. Use the following syntax for overrides:
|
engine_config to enable an HiveServer2 on a Hadoop 2 (Hive) Cluster
You can enable HiveServer2 on a Hadoop 2 (Hive) cluster. The following table contains engine_config
for enabling
HiveServer2 on a cluster. Other settings of HiveServer2 are configured under the hive_settings
parameter. For more
information on HiveServer2 in QDS, see Configuring a HiveServer2 Cluster.
This is an additional setting in the Hadoop 2 request API for enabling HiveServer2. Other settings that are explained in Parameters must be added.
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter |
Description |
|
---|---|---|
hive_settings |
is_hs2 |
Set it to |
hive_version |
It is the Hive version that supports HiveServer2. The values are |
|
hive.qubole.metadata.cache |
This parameter enables Hive metadata caching that reduces split computation time for ORC
files. This feature is not available by default. Create a ticket with
Qubole Support for using this feature on the QDS
account. Set it to |
|
hs2_thrift_port |
It is used to set HiveServer2 port. The default port is |
|
overrides |
Hive configuration to override the default settings. |
|
flavour |
It denotes the cluster type. It is |
Request API Syntax
If use_account_compute_creds is set to false, then it is not required to set compute credentials.
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
"cloud_config" : {
"provider" : "azure",
"compute_config" : {
"compute_validated": "<default is ``false``/set it to ``true``>",
"use_account_compute_creds": false,
"compute_client_id": "<your client ID>",
"compute_client_secret": "<your client secret key>",
"compute_tenant_id": "<your tenant ID>",
"compute_subscription_id": "<your subscription ID>"
},
"location": {
"location": "centralus"
},
"network_config" : {
"vnet_name" : "<vpc name>",
"subnet_name": "<subnet name>",
"vnet_resource_group_name": "<vnet resource group name>",
"bastion_node_public_dns": "<bastion node public dns>",
"persistent_security_groups": "<persistent security group>",
"master_elastic_ip": ""
},
"storage_config" : {
"disk_storage_account_name": "<Disk storage account name>",
"disk_storage_account_resource_group_name": "<Disk account resource group name>",
//You can either configure "disk_storage_account_name" or "managed_disk_account_type"
"managed_disk_account_type":"<standard_lrs/premium_lrs>",
"data_disk_count":"<Count>",
"data_disk_size":"<Disk Size>"
}
},
"cluster_info": {
"master_instance_type": "Standard_A6",
"slave_instance_type": "Standard_A6",
"label": ["azure1"],
"min_nodes": 1,
"max_nodes": 2,
"cluster_name": "Azure1",
"node_bootstrap": "node_bootstrap.sh",
},
"engine_config": {
"flavour": "hadoop2",
"hadoop_settings": {
"custom_hadoop_config": <default is null>,
"fairscheduler_settings": {
"default_pool": <default is null>
}
}
},
"monitoring": {
"ganglia": <default is false/set it to true>,
}
}' \ "https://azure.qubole.com/api/v2/clusters"
Sample API Request
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json"
-d '{
"cloud_config" : {
"provider" : "azure",
"compute_config" : {
"compute_validated": False,
"use_account_compute_creds": False,
"compute_client_id": "<your client ID>",
"compute_client_secret": "<your client secret key>",
"compute_tenant_id": "<your tenant ID>",
"compute_subscription_id": "<your subscription ID>"
},
"location": {
"location": "centralus"
},
"network_config" : {
"vnet_name" : "<vpc name>",
"subnet_name": "<subnet name>",
"vnet_resource_group_name": "<vnet resource group name>",
"persistent_security_groups": "<persistent security group>"
},
"storage_config" : {
"storage_access_key": "<your storage access key>",
"storage_account_name": "<your storage account name>",
"disk_storage_account_name": "<your disk storage account name>",
"disk_storage_account_resource_group_name": "<your disk storage account resource group name>"
"data_disk_count":4,
"data_disk_size":300 GB
}
},
"cluster_info": {
"master_instance_type": "Standard_A6",
"slave_instance_type": "Standard_A6",
"label": ["azure1"],
"min_nodes": 1,
"max_nodes": 2,
"cluster_name": "Azure1",
"node_bootstrap": "node_bootstrap.sh"
},
"engine_config": {
"flavour": "hadoop2",
"hadoop_settings": {
"custom_hadoop_config": "mapred.tasktracker.map.tasks.maximum=3"
}
"hive_settings":{
"is_hs2":true,
"hive_version":"2.3",
"overrides":"hive.server2.a=dummy",
"is_metadata_cache_enabled":false,
"execution_engine":"tez",
"hs2_thrift_port":10003
}
},
"monitoring": {
"ganglia": true,
}
}' "https://azure.qubole.com/api/v2/clusters"