Configuring a Spark Cluster

Note

Qubole has introduced an improvement that reduces Spark cluster launch time. This improvement is not implemented by default, but you can create a ticket on Qubole Support to enable it.

Note

For information on Notebook Interpreter Mode on a Spark cluster, see Using the User Interpreter Mode for Spark Notebooks.

By default, each account has a Spark cluster; this cluster is used automatically for Spark jobs and applications. You can see and change the configuration of the default Spark cluster on the Clusters page, and you can also use that page to add a new Spark cluster. QDS clusters are configured with reasonable defaults.

To make changes, navigate to the Clusters page and find the cluster with the spark label.

  • To edit the cluster configuration, click the Edit button next to that cluster.
  • To add a new Spark cluster, click the New button near the top left of the page, and then choose Spark on the Create New Cluster page.

Specify a label for a new cluster; you can also change the label of an existing cluster. The Cluster Type must be Spark.

You can select the version from the Spark Version drop-down list. If you are changing the version for an existing cluster, you must restart the cluster for the change to take effect.

In the drop-down list, Spark 2.x-latest means the latest open-source maintenance version of 2.x. When a new maintenance version is released, Qubole Spark versions are automatically upgraded to that version. So if 2.2-latest currently points to 2.2.0, then when 2.2.1 is released, QDS Spark clusters running 2.2-latest will automatically start using 2.2.1 on a cluster restart. See QDS Components: Supported Versions and Cloud Platforms for more information about Spark versions in QDS.

Note

There is a known issue for Spark 2.2.0 in Qubole Spark: Avro write fails with org.apache.spark.SparkException: Task failed while writing rows. This is a known issue in the open-source code. As a workaround,  append the following to your node bootstrap script:

rm -rf /usr/lib/spark/assembly/target/scala-2.11/jars/spark-avro_2.11-3.2.0.jar
/usr/lib/hadoop2/bin/hadoop fs -get s3://paid-qubole/spark/jars/spark-avro/spark-avro_2.11

See Managing Clusters for instructions on changing other cluster settings.

Handling Spot Node Loss in Spark Clusters

Qubole proactively identifies the nodes that undergo Spot loss, and stops scheduling tasks on the corresponding executors. This feature is supported on Spark versions 2.1.0, 2.1.1, and 2.2-latest, and is controlled using the spark configuration spark.qubole.spotloss.handle.

By default, the Spark configuration spark.qubole.spotloss.handle is set to true. To disable this feature, set the Spark configuration as: spark.qubole.spotloss.handle = false in the Override Spark Configuration field of the SPARK SETTINGS section on the Edit Cluster Settings > Advanced Configuration page.

Note

After you modify any cluster configuration, you must restart the cluster for the changes to take effect.

Viewing a Package Management Environment on the Spark Cluster UI

When you create a new Spark cluster, by default a package environment gets created and is attached to the cluster. This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

You can attach a package management environment to an existing Spark cluster. For more information, see Using QDS Package Management.

Once an environment is attached to the cluster, you can see the ENVIRONMENT SETTINGS in the Spark cluster’s Advanced Configuration. Here is an environment attached to the Spark cluster.

../../_images/EnvironmentSettings.png

The default environment gets a list of pre-installed Python and R packages. To see the environment list, navigate to the Control Panel > Environments.

Configuring Heterogeneous Nodes in Spark Clusters

An Overview of Heterogeneous Nodes in Clusters explains how to configure heterogeneous nodes in Hadoop 2 and Spark clusters.

Overriding the Spark Default Configuration

Qubole provides a default configuration based on the Slave Node Type. The settings are used by Spark programs running in the cluster whether they are run from the UI, an API, or an SDK.

The figure belows shows the default configuration.

../../_images/spark-defaults.png

Note: Use the tooltip Help_Tooltip to get help on a field or checkbox.

To change or override the default configuration, provide the configuration values in the Override Spark Configuration Variables text box. Enter the configuration variables as follows:

In the first line, enter spark-defaults.conf:. Enter the <key> <value> pair in subsequent lines. Provide only one key-value pair per line; for example:

spark-defaults.conf:
spark.executor.cores 2
spark.executor.memory 10G

To apply the new settings, restart the cluster.

To handle different types of workloads (for example, memory-intensive versus compute-intensive) you can add clusters and configure each appropriately.

Setting Time-To-Live in the JVMs for DNS Lookups on a Running Cluster

Qubole now supports configuring Time-To-Live (TTL) JVMs for DNS Lookups in a running cluster (except Airflow and Presto). This feature is not enabled by default. Create a ticket with Qubole Support for enabling this feature on the QDS account. The recommended value of TTL is 60 and its unit is seconds.