Create a Cluster on Microsoft Azure

POST /api/v2/clusters/

Use this API to create a new cluster when you are using Qubole on the Azure cloud. You create a cluster for a workload that has to run in parallel with your pre-existing workloads.

You might want to run workloads across different geographical locations or there could be other reasons for creating a new cluster.

Required Role

The following users can make this API call:

Users who belong to the system-user or system-admin group.
Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.

Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter	Description
cloud_config	A list of labels that identify the cluster. At least one label must be provided when creating a cluster.
cluster_info	It contains the configurations of a cluster.
engine_config	It contains the configurations of the type of clusters
monitoring	It contains the cluster monitoring configuration.
security_settings	It contains the security settings for the cluster.

cloud_config

Parameter	Description
provider	It defines the cloud provider. Set `azure` when the cluster is created on QDS-on-Azure.
compute_config	It defines the Azure account compute credentials for the cluster.
location	It is used to set the geographical Azure location. `eastus` is the default location. The other locations are `centralus`, `southcentralus`, `southeastasia`, and `westus`.
network_config	It defines the network configuration for the cluster.
storage_config	It defines the Azure account storage credentials for the cluster.

compute_config

Parameter	Description
compute_validated	It denotes if the credentials are validated or not.
use_account_compute_creds	It is to use account compute credentials. By default, it is set to `false`. Set it to `true` to use account compute credentials. Setting it to ``true`` implies that the following four settings are not required to be set.
compute_client_id	The client ID of the Azure active directory application which has the permissions over the subscription. It is required when `use_account_compute_creds` is set to `false`.
compute_client_secret	The client secret of the Azure active directory application. It is required when `use_account_compute_creds` is set to `false`.
compute_tenant_id	The tenant_id of the Azure Active Directory. It is required when `use_account_compute_creds` is set to `false`.
compute_subscription_id	The subscription id of the azure account where you want to create the compute resources. It is required when `use_account_compute_creds` is set to `false`.

network_config

Parameter	Description
vnet_name	Set the virtual network.
subnet_name	Set the subnet
vnet_resource_group_name	Set the resource group of your virtual network.
bastion_node	It is the public IP address of bastion node to access private subnets if required.
persistent_security_group_name	It is the network security group name on the Azure account.
persistent_security_group_resource_group_name	It is the resource group of the network security group of the Azure account.

storage_config

Parameter	Description
disk_storage_account_name	Set your Azure storage account. You must only configure this parameter or `managed_disk_account_type`.
disk_storage_account_resource_group_name	Set your Azure disk storage account resource group name.
managed_disk_account_type	You can set it if you do not want to configure disk storage account details. Its accepted values are `standard_lrs` and `premium_lrs`. You must only configure this parameter or `disk_account_storage_name`.
data_disk_count	It is the number of reserved disks to be attached to each cluster node; so, for example, choosing a Data Disk Count of 2 in a four-node cluster will provision eight disks in all.
data_disk_size	It is used to set the Data Disk Size in gigabytes (GB). The default size is 256 GB.

cluster_info

Parameter	Description
label	A cluster can have one or more labels separated by a commas. You can make a cluster the default cluster by including the label “default”.
master_instance_type	To change the coordinator node type from the default (Standard_A5), select a different type from the drop-down list.
slave_instance_type	To change the worker node type from the default (Standard_A5), select a different type from the drop-down list.
min_nodes	Enter the minimum number of worker nodes if you want to change it (the default is 1).
max_nodes	Enter the maximum number of worker nodes if you want to change it (the default is 1).
node_bootstrap	You can append the name of a node bootstrap script to the default path.
disallow_cluster_termination	Set it to `true` if you do not want QDS to terminate idle clusters automatically. Qubole recommends that you to set this parameter to `false`.
custom_tags	It is an optional parameter. Its value contains a <tag> and a <value>.
rootdisk	Use this parameter to configure the root volume of cluster instances. You must configure its size within this parameter. The supported range for the root volume size is `90 - 2047`. An example usage would be `"rootdisk" => {"size" => 500}`.

engine_config

Parameter	Description
flavour	It denotes the type of cluster. The supported values are: `hadoop2`, `presto`, `airflow`, and `spark`.
airflow_settings	It provides a list of Airflow-specific configurable sub options.
hadoop_settings	To change the coordinator node type from the default (Standard_A5), select a different type from the drop-down list.
presto_settings	To change the worker node type from the default (Standard_A5), select a different type from the drop-down list.
spark_settings	Enter the minimum number of worker nodes if you want to change it (the default is 1).
hive_settings	It provides a list of Hiveserver2 specific configurable sub options.

hadoop_settings

Parameter	Description
custom_hadoop_config	The custom Hadoop configuration overrides. The default value is blank.
fairscheduler_settings	The fair scheduler configuration options.

fairscheduler_settings

Parameter	Description
fairscheduler_config_xml	The XML string, with custom configuration parameters, for the fair scheduler. The default value is blank.
default_pool	The default pool for the fair scheduler. The default value is blank.

presto_settings

Parameter	Description
presto_version	Specify the Presto version to be used on the cluster. The default version is `0.142`. The stable version that is supported is `0.157`.
custom_presto_config	Specifies if the custom Presto configuration overrides. The default value is blank.

spark_settings

Parameter	Description
zeppelin_interpreter_mode	The default mode is `legacy`. Set it to `user` mode if you want the user-level cluster-resource management on notebooks. See Configuring a Spark Notebook for more information.
custom_spark_config	Specify the custom Spark configuration overrides. The default value is blank.
spark_version	It is the Spark version used on the cluster. The default version is `2.0-latest`. The other supported version is `2.1-latest`.

monitoring

Parameter	Description
enable_ganglia_monitoring	Enable Ganglia monitoring for the cluster. The default value is, `false`.

security_settings

Parameter	Description
ssh_public_key	SSH key to use to login to the instances. The default value is none. (Note: This parameter is not visible to non-admin users.) The SSH key must be in the OpenSSH format and not in the PEM/PKCS format.

airflow_settings

The following table contains engine_config for an Airflow cluster.

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter	Description
dbtap_id	ID of the data store inside QDS. Set it to `-1` if you are using the local MySQL instance as the data store.
fernet_key	Encryption key for sensitive information inside airflow database. For example, user passwords and connections. It must be a 32 url-safe base64 encoded bytes.
type	Engine type. It is `airflow` for an Airflow cluster.
version	The default version is 1.10.0 (stable version). The other supported stable versions are 1.8.2 and 1.10.2. All the Airflow versions are compatible with MySQL 5.6 or higher.
airflow_python_version	Supported versions are 3.5 (supported using package management) and 2.7. To know more, see Configuring an Airflow Cluster.
overrides	Airflow configuration to override the default settings. Use the following syntax for overrides: `<section>.<property>=<value>\n<section>.<property>=<value>...`

engine_config to enable an HiveServer2 on a Hadoop 2 (Hive) Cluster

You can enable HiveServer2 on a Hadoop 2 (Hive) cluster. The following table contains engine_config for enabling HiveServer2 on a cluster. Other settings of HiveServer2 are configured under the hive_settings parameter. For more information on HiveServer2 in QDS, see Configuring a HiveServer2 Cluster.

This is an additional setting in the Hadoop 2 request API for enabling HiveServer2. Other settings that are explained in Parameters must be added.

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter	Description
hive_settings	is_hs2	Set it to `true` to enable HiveServer2.
	hive_version	It is the Hive version that supports HiveServer2. The values are `1.2.0`, `2.1.1`, and `2.3`. Qubole’s Hive 2.1.1 is a stable version and LLAP from the Hive open source is not verified in Qubole’s Hive 2.1.1. For more information, see Understanding Hive Versions.
	hive.qubole.metadata.cache	This parameter enables Hive metadata caching that reduces split computation time for ORC files. This feature is not available by default. Create a ticket with Qubole Support for using this feature on the QDS account. Set it to `true` in old Hadoop 2 (Hive) clusters and it is enabled by default in new clusters. For more information, see Understanding Hive Metadata Caching.
	hs2_thrift_port	It is used to set HiveServer2 port. The default port is `10003`. This parameter is not available on the Hadoop 2 (Hive) cluster UI and Qubole plans to add the UI option in a future release.
	overrides	Hive configuration to override the default settings.
flavour	It denotes the cluster type. It is `hadoop2` for an HiveServer2 cluster.

Request API Syntax

If use_account_compute_creds is set to false, then it is not required to set compute credentials.

curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
     "cloud_config" : {
       "provider" : "azure",
       "compute_config" : {
                     "compute_validated": "<default is ``false``/set it to ``true``>",
                     "use_account_compute_creds": false,
                     "compute_client_id": "<your client ID>",
                     "compute_client_secret": "<your client secret key>",
                     "compute_tenant_id": "<your tenant ID>",
                     "compute_subscription_id": "<your subscription ID>"
               },
               "location": {
                     "location": "centralus"
                  },
               "network_config" : {
                     "vnet_name" : "<vpc name>",
                         "subnet_name": "<subnet name>",
                         "vnet_resource_group_name": "<vnet resource group name>",
                         "bastion_node_public_dns": "<bastion node public dns>",
                        "persistent_security_groups": "<persistent security group>",
                        "master_elastic_ip": ""
               },
               "storage_config" : {
                     "disk_storage_account_name": "<Disk storage account name>",
                     "disk_storage_account_resource_group_name": "<Disk account resource group name>",
         //You can either configure "disk_storage_account_name" or "managed_disk_account_type"
         "managed_disk_account_type":"<standard_lrs/premium_lrs>",
         "data_disk_count":"<Count>",
         "data_disk_size":"<Disk Size>"
         }
         },
     "cluster_info": {
          "master_instance_type": "Standard_A6",
          "slave_instance_type": "Standard_A6",
          "label": ["azure1"],
          "min_nodes": 1,
          "max_nodes": 2,
          "cluster_name": "Azure1",
          "node_bootstrap": "node_bootstrap.sh",
          },
     "engine_config": {
          "flavour": "hadoop2",
          "hadoop_settings": {
             "custom_hadoop_config": <default is null>,
             "fairscheduler_settings": {
                "default_pool": <default is null>
             }
          }
     },
     "monitoring": {
            "ganglia": <default is false/set it to true>,
           }
     }' \ "https://azure.qubole.com/api/v2/clusters"

Sample API Request

curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json"
-d '{
     "cloud_config" : {
       "provider" : "azure",
       "compute_config" : {
                     "compute_validated": False,
                     "use_account_compute_creds": False,
                     "compute_client_id": "<your client ID>",
                     "compute_client_secret": "<your client secret key>",
                     "compute_tenant_id": "<your tenant ID>",
                     "compute_subscription_id": "<your subscription ID>"
               },
       "location": {
                     "location": "centralus"
               },
       "network_config" : {
                     "vnet_name" : "<vpc name>",
                         "subnet_name": "<subnet name>",
                         "vnet_resource_group_name": "<vnet resource group name>",
                         "persistent_security_groups": "<persistent security group>"
               },
       "storage_config" : {
                     "storage_access_key": "<your storage access key>",
                     "storage_account_name": "<your storage account name>",
                     "disk_storage_account_name": "<your disk storage account name>",
                     "disk_storage_account_resource_group_name": "<your disk storage account resource group name>"
         "data_disk_count":4,
         "data_disk_size":300 GB
               }
     },
     "cluster_info": {
          "master_instance_type": "Standard_A6",
          "slave_instance_type": "Standard_A6",
          "label": ["azure1"],
          "min_nodes": 1,
          "max_nodes": 2,
          "cluster_name": "Azure1",
          "node_bootstrap": "node_bootstrap.sh"
          },
     "engine_config": {
          "flavour": "hadoop2",
            "hadoop_settings": {
                "custom_hadoop_config": "mapred.tasktracker.map.tasks.maximum=3"
            }
          "hive_settings":{
            "is_hs2":true,
            "hive_version":"2.3",
            "overrides":"hive.server2.a=dummy",
            "is_metadata_cache_enabled":false,
            "execution_engine":"tez",
            "hs2_thrift_port":10003
            }
           },
     "monitoring": {
            "ganglia": true,
           }
     }' "https://azure.qubole.com/api/v2/clusters"