Composing Spark Commands in the Workbench Page

Use the command composer on Workbench to compose a Spark command in different languages.

See Running Spark Applications and Spark in Qubole for more information. For information about using the REST API, see Submit a Spark Command.

Spark queries run on Spark clusters. See Mapping of Cluster and Command Types for more information.

Qubole Spark Parameters

As part of a Spark command, you can use command-line options to set or override Qubole parameters such as the following:

The Qubole parameter spark.sql.qubole.parquet.cacheMetadata allows you to turn caching on or off for Parquet table data. Caching is on by default; Qubole caches data to prevent table-data-access query failures in case of any change in the table’s Cloud storage location. If you want to disable caching of Parquet table data, set spark.sql.qubole.parquet.cacheMetadata to false. You can do this at the Spark cluster or job level, or in a Spark Notebook interpreter.
In case of DirectFileOutputCommitter (DFOC) with Spark, if a task fails after writing files partially, the subsequent reattempts might fail with FileAlreadyExistsException (because of the partial files that are left behind). Therefore, the job fails. You can set the spark.hadoop.mapreduce.output.textoutputformat.overwrite and spark.qubole.outputformat.overwriteFileInWrite flags to true to prevent such job failures.

Ways to Compose and Run Spark Applications

You can compose a Spark application using:

Command Line
Python
Scala
SQL
R

Note

You can read a Spark job’s logs, even after the cluster on which it was run has terminated, by means of the offline Spark History Server (SHS). For offline Spark clusters, only event log files that are less than 400 MB are processed in the SHS. This prevents high CPU utilization on the webapp node. For more information, see this blog.

You can use the --packages option to add a list of comma-separated Maven coordinates for external packages that are used by a Spark application composed in any supported language. For example, in the Spark Submit Command Line Options text field, enter --packages com.package.module_2.10:1.2.3.

You can use macros in script files for Spark commands with subtypes scala (Scala), py (Python), R (R), sh (Command), and sql (SQL). You can also use macros in large inline content and large script files for scala (Scala), py (Python), R (R), @ and sql (SQL). This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

About Using Python 2.7 in Spark Jobs

If your cluster is running Python 2.6, you can enable Python 2.7 for a Spark job as follows:

Add the following configuration in the node bootstrap script (node_bootstrap.sh) of the Spark cluster:
```
source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-hadoop-use-python2.7
```
To run spark-shell/spark-submit on any node’s shell, run these two commands by adding them in the Spark Submit Command Line Options text field before running spark-shell/spark-submit:
```
source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-hadoop-use-python2.7
```

Compose a Spark Application in Scala

Navigate to Workbench and click + New Collection.
Select Spark from the command type drop-down list.
By default, Scala is selected.
Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.
Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

or

To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.
Add macro details (as needed).
Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.
Optionally specify arguments in the Arguments for User Program text field.
Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in Python

Navigate to Workbench and click + New Collection.
Select Spark from the command type drop-down list.
By default, Scala is selected. Select Python from the drop-down list.
Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.
Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

or

To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.
Add macro details (as needed).
Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.
Optionally specify arguments in the Arguments for User Program text field.
Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application using the Command Line

Note

Qubole does not recommend using the Shell command option to run a Spark application via Bash shell commands, because in this case automatic changes (such as increases in the Application Coordinator memory based on the driver memory, and the availability of debug options) do not occur. Such automatic changes do occur when you run a Spark application using the Command Line option.

Navigate to Workbench and click + New Collection.
Select Spark from the command type drop-down list.
By default, Scala is selected. Select Command Line from the drop-down list.
Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.
Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

or

To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.
Add macro details (as needed).
Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in SQL

Note

You can run Spark commands in SQL with Hive Metastore 2.1. This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

You can run Spark SQL commands with large script files and large inline content. This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

Navigate to Workbench and click + New Collection.
Select Spark from the command type drop-down list.
By default, Scala is selected. Select SQL from the drop-down list.
Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.
Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field. Press Ctrl + Space in the command editor to get a list of suggestions.

or

To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.
Add macro details (as needed).
Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.
Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in R

Navigate to Workbench and click + New Collection.
Select Spark from the command type drop-down list.
By default, Scala is selected. Select R from the drop-down list.
Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.
Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

or

To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.
Add macro details (as needed).
Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.
Optionally specify arguments in the Arguments for User Program text field.
Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Known Issue

The Spark Application UI might display an incorrect state of the application when Spot Instances are used. You can view the accurate status of the Qubole command in the Workbench or Notebooks page.

When the Spark application is running, if the coordinator node or the node that runs driver is lost, the Spark Application UI may display an incorrect state of the application. The event logs are persisted to cloud storage from the HDFS location periodically for a running application. If the coordinator node is removed due to spot loss, the cloud storage may not have the latest application status. As a result, the Spark Application UI may display the application in running state.

To prevent this issue, Qubole recommends using an On-Demand master node.