Configuring a Spark Cluster¶
Qubole has introduced an improvement that reduces Spark cluster launch time. This improvement is not implemented by default, but you can create a ticket on Qubole Support to enable it.
For information on Notebook Interpreter Mode on a Spark cluster, see Using the User Interpreter Mode for Spark Notebooks.
By default, each account has a Spark cluster; this cluster is used automatically for Spark jobs and applications. You can see and change the configuration of the default Spark cluster on the Clusters page, and you can also use that page to add a new Spark cluster. QDS clusters are configured with reasonable defaults.
To make changes, navigate to the Clusters page and find the cluster with the spark label.
- To edit the cluster configuration, click the Edit button next to that cluster.
- To add a new Spark cluster, click the New button near the top left of the page, and then choose Spark on the Create New Cluster page.
Specify a label for a new cluster; you can also change the label of an existing cluster. The Cluster Type must be Spark.
You can select the version from the Spark Version drop-down list. If you are changing the version for an existing cluster, you must restart the cluster for the change to take effect.
In the drop-down list, Spark 2.x-latest means the latest open-source maintenance version of 2.x. When a new maintenance version is released, Qubole Spark versions are automatically upgraded to that version. So if 2.2-latest currently points to 2.2.0, then when 2.2.1 is released, QDS Spark clusters running 2.2-latest will automatically start using 2.2.1 on a cluster restart. See QDS Components: Supported Versions and Cloud Platforms for more information about Spark versions in QDS.
There is a known issue for Spark 2.2.0 in Qubole Spark: Avro write fails with
org.apache.spark.SparkException: Task failed while writing rows. This is a known issue in the
open-source code. As a workaround, append the following to your node bootstrap script:
rm -rf /usr/lib/spark/assembly/target/scala-2.11/jars/spark-avro_2.11-3.2.0.jar /usr/lib/hadoop2/bin/hadoop fs -get s3://paid-qubole/spark/jars/spark-avro/spark-avro_2.11
See Managing Clusters for instructions on changing other cluster settings.
Handling Spot Node Loss in Spark Clusters¶
Qubole proactively identifies the nodes that undergo Spot loss, and stops scheduling tasks on the corresponding executors.
This feature is supported on Spark versions 2.1.0, 2.1.1, and 2.2-latest, and is controlled using the spark configuration
By default, the Spark configuration
spark.qubole.spotloss.handle is set to
true. To disable this feature, set the Spark configuration as:
false in the Override Spark Configuration field of the SPARK SETTINGS section on the Edit Cluster Settings > Advanced Configuration page.
After you modify any cluster configuration, you must restart the cluster for the changes to take effect.
Viewing a Package Management Environment on the Spark Cluster UI¶
When you create a new Spark cluster, by default a package environment gets created and is attached to the cluster. This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature on the QDS account.
You can attach a package management environment to an existing Spark cluster. For more information, see Using QDS Package Management.
Once an environment is attached to the cluster, you can see the ENVIRONMENT SETTINGS in the Spark cluster’s Advanced Configuration. Here is an environment attached to the Spark cluster.
The default environment gets a list of pre-installed Python and R packages. To see the environment list, navigate to the Control Panel > Environments.
Configuring Heterogeneous Nodes in Spark Clusters¶
An Overview of Heterogeneous Nodes in Clusters explains how to configure heterogeneous nodes in Hadoop 2 and Spark clusters.
Overriding the Spark Default Configuration¶
Qubole provides a default configuration based on the Slave Node Type. The settings are used by Spark programs running in the cluster whether they are run from the UI, an API, or an SDK.
The figure belows shows the default configuration.
Note: Use the tooltip to get help on a field or checkbox.
To change or override the default configuration, provide the configuration values in the Override Spark Configuration Variables text box. Enter the configuration variables as follows:
In the first line, enter
spark-defaults.conf:. Enter the
<key> <value> pair in subsequent lines.
Provide only one key-value pair per line; for example:
spark-defaults.conf: spark.executor.cores 2 spark.executor.memory 10G
To apply the new settings, restart the cluster.
To handle different types of workloads (for example, memory-intensive versus compute-intensive) you can add clusters and configure each appropriately.
Setting Time-To-Live in the JVMs for DNS Lookups on a Running Cluster¶
Qubole now supports configuring Time-To-Live (TTL) JVMs for DNS Lookups in a running cluster (except Airflow and Presto).
This feature is not enabled by default. Create a ticket with Qubole Support for
enabling this feature on the QDS account. The recommended value of TTL is
60 and its unit is seconds.