Composing Spark Commands in Different Spark Languages through the UI

You can compose a spark command in different languages using the query composer on the Analyze page. See Running Spark Applications and Spark in Qubole for more information. For REST API-related information, see Submit a Spark Command.

Spark queries run on Spark clusters. See Mapping of Cluster and Command Types for more information.

Qubole Spark Parameters

  • The Qubole parameter spark.sql.qubole.parquet.cacheMetadata allows you to turn caching on or off for Parquet table data. Caching is on by default; Qubole caches data to prevent table-data-access query failures in case of any change in the table’s Cloud storage location. If you want to disable caching of Parquet table data, set spark.sql.qubole.parquet.cacheMetadata to false. You can do this at the Spark cluster or job level, or in a Spark Notebook interpreter.
  • In case of DirectFileOutputCommitter (DFOC) with Spark, if a task fails after writing files partially, the subsequent reattempts might fail with FileAlreadyExistsException (because of the partial files that are left behind). Therefore, the job fails. You can set the spark.hadoop.mapreduce.output.textoutputformat.overwrite and spark.qubole.outputformat.overwriteFileInWrite flags to true to prevent such job failures.

Ways to Compose and Run Spark Applications

You can compose a Spark application using:

Note

You can access a Spark job’s logs even after the cluster on which it was run is terminated through the offline spark history server (SHS). For the offline Spark clusters, only the event log files that are less than 400 MB are processed in the offline Spark History Server (SHS). This prevents high CPU utilization on the webapp node due to SHS. For more information, see this blog.

Note

You can use the --packages option to add a list of comma-separated Maven coordinates of external packages that are used by a Spark application composed in any supported language. For example, in the Spark Submit Command Line Options text field, enter --packages com.package.module_2.10:1.2.3.

Note

You can use macros in script files for the Spark commands with subtypes scala (Scala), py (Python), R (R), sh (Command), and sql (SQL). You can also use macros in large inline contents and large script files for scala (Scala), py (Python), R (R), @ and sql (SQL). This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

About Using Python 2.7 in Spark Jobs

If your cluster is running Python 2.6, you can enable Python 2.7 for a Spark job as follows:

  1. Add the following configuration in the node bootstrap script (node_bootstrap.sh) of the Spark cluster:

    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    qubole-hadoop-use-python2.7
    
  2. To run spark-shell/spark-submit on any node’s shell, run these two commands by adding them in the Spark Submit Command Line Options text field before running spark-shell/spark-submit:

    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    qubole-hadoop-use-python2.7
    

Note

Using the Supported Keyboard Shortcuts in Analyze describes the supported keyboard shortcuts.

Compose a Spark Application in Scala

Perform the following steps to compose a Spark command:

  1. Navigate to the Analyze page and click Compose. Select Spark Command from the Command Type drop-down list.

  2. By default, Scala is selected. Compose the Spark application in Scala in the query editor. The query composer with Spark Command as the command type is as shown in the following figure.

    ../../_images/ComposeSparkScala.png
  3. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options in the Spark Default Submit Command Line Options text field.

  4. Optionally specify arguments in the Arguments for User Program.

  5. Click Run to execute the query. Click Save if you want to re-run the same query later. (See Workspace for more information on saving queries.)

  6. The query result is displayed in the Results tab, and the query logs in the Logs tab. The Logs tab has a Errors and Warnings filter. For more information on how to download command results and logs, see Download Results and Logs from the Analyze UI. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in Python

Perform the following steps to compose a Spark command:

  1. Navigate to the Analyze page and click Compose. Select Spark Command from the Command Type drop-down list.

  2. By default, Scala is selected. Select Python from the drop-down list. Compose the Spark application in Python in the the query editor. The query composer with Spark Command as the command type is as shown in the following figure.

    ../../_images/ComposeSparkPython.png
  3. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options in the Spark Default Submit Command Line Options text field.

    You can pass remote files in a Cloud storage location, in addition to the local files, as values to the --py-files argument.

  4. Optionally specify arguments in the Arguments for User Program.

  5. Click Run to execute the query. Click Save if you want to run the same query later. (See Workspace for more information on saving queries.)

  6. The query result is displayed in the Results tab, and the query logs in the Logs tab. The Logs tab has an Errors and Warnings filter. For more information on how to download command results and logs, see Download Results and Logs from the Analyze UI. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application using the Command Line

Perform the following steps to compose a Spark command:

  1. Navigate to the Analyze page and click Compose. Select Spark Command from the Command Type drop-down list.

  2. By default, Scala is selected. Select Command Line from the drop-down list. Compose the Spark application using command-line commands in the the query editor. You can override default command options in the Spark Default Submit Command Line Options text field by specifying other options.

    Note

    Qubole does not recommend running a Spark application as Bash commands under the Shell command option, because in this case automatic changes (such as increases in the Application Master memory based on the driver memory, and the availability of debug options) do not occur. Such automatic changes do occur when you run a Spark application using the Command Line option.

    The query composer with Spark Command as the command type is as shown in the following figure.

    ../../_images/ComposeSparkCmdLine.png
  3. Click Run to execute the query. Click Save if you want to run the same query later. (See Workspace for more information on saving queries.)

  4. The query result is displayed in the Results tab, and the query logs in the Logs tab. The Logs tab has an Errors and Warnings filter. For more information on how to download command results and logs, see Download Results and Logs from the Analyze UI. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in SQL

Note

You can run Spark commands in SQL with Hive Metastore 2.1. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Note

You can run Spark SQL commands with large script file and large inline content. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

  1. Navigate to the Analyze page and click Compose. Select Spark Command from the Command Type drop-down list.

  2. By default, Scala is selected. Select SQL from the drop-down list. Compose the Spark application in SQL in the the query editor. Press Ctrl + Space in the command editor to get a list of suggestions.

    The query composer with Spark Command as the command type is as shown in the following figure.

    ../../_images/ComposeSparkSQL.png
  3. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options in the Spark Default Submit Command Line Options text field.

  4. Click Run to execute the query. Click Save if you want to run the same query later. (See Workspace for more information on saving queries.)

  5. The query result is displayed in the Results tab, and the query logs in the Logs tab. The Logs tab has an Errors and Warnings filter. For more information on how to download command results and logs, see Download Results and Logs from the Analyze UI. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in R

Perform the following steps to compose a Spark command:

  1. Navigate to the Analyze page and click Compose. Select Spark Command from the Command Type drop-down list.

  2. By default, Scala is selected. Select R from the drop-down list. Compose the Spark application in R in the query editor. The query composer with Spark Command as the command type is as shown in the following figure.

    ../../_images/ComposeSparkR.png

    The example in the above figure uses Amazon S3 (s3://). For Azure you would use wasb:// or adl://, or for Oracle OCI oci://.

  3. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options in the Spark Default Submit Command Line Options text field.

  4. Optionally specify arguments in the Arguments for User Program.

  5. Click Run to execute the query. Click Save if you want to run the same query later. (See Workspace for more information on saving queries.)

  6. The query result is displayed in the Results tab, and the query logs in the Logs tab. The Logs tab has an Errors and Warnings filter. For more information on how to download command results and logs, see Download Results and Logs from the Analyze UI. Note the clickable Spark Application UI URL in the Resources tab.

Run a Spark Notebook from the Analyze Query Composer

Qubole supports running a Spark notebook from the Analyze page’s query composer. To run a notebook using the Analyze page, perform the following steps:

  1. Navigate to the Analyze page and click Compose. Select Spark Command from the Command Type drop-down list.

  2. By default, Scala is selected. Select Notebook from the drop-down list. The query composer with Spark Command as the command type is as shown in the following figure.

    ../../_images/ComposeSparkNotebook.png
  3. Select the Note from the Notebook’s drop-down list. You can select any cluster from the cluster drop-down list. The cluster associated with the selected notebook is also part of the cluster drop-down list.

  4. In the Arguments text field, you can optionally add parameters to the notebook command. These parameters are passed to dynamic forms of the notebook. You can pass more than one variable. The syntax for using arguments is given below.

    {"key1":"value1", "key2":"value2", ..., "keyN":"valueN"}
    

    Where key1, key2, … keyN are the parameters that you want to pass before you run the notebook. If you need to change the corresponding values (value1, value2,…, valueN), you can do so each time you run command.

  5. Click Run to execute the query. Click Save if you want to run the same query later. (See Workspace Tab for more information on saving queries.)

  6. The query result is displayed in the Results tab, and the query logs in the Logs tab. The Logs tab has an Errors and Warnings filter. For more information on how to download command results and logs, see Download Results and Logs from the Analyze UI. Note the clickable Spark Application UI URL in the Resources tab.

Known Issue

The Spark Application UI might display an incorrect state of the application when Spot Instances are used. You can view the accurate status of the Qubole command in the Analyze or Notebooks page.

When the Spark application is running, if the master node or the node that runs driver is lost, then the Spark Application UI might display an incorrect state of the application. The event logs are persisted to cloud storage from the HDFS location periodically for a running application. If the master node is removed due to spot loss, then the cloud storage might not have the latest status of the application. As a result, the Spark Application UI might show the application in running state.

To avoid this issue, it is recommended to use an on-demand master node.

Other Examples

Example 1: Sample Scala program with the package parameter

The following figure shows a sample Scala program with the package mypackage parameter.

../../_images/ComposeSparkScala_pkg-name.png

Example 2: Sample Scala program with the --repositories parameter

In this example, the spark-avro package is downloaded from the Jitpack maven repository and is made available to the spark program.

The following figure shows a sample Scala program with the --repositories https://jitpack.io --packages com.github.apache:commons-csv:CSV_1.0_RC2 parameter in the Spark Submit Command Line Options text field.

../../_images/ComposeSparkScala_repo.png

Example 3: Sample Scala program with the --class parameter

The following figure shows a sample Scala program with the --class=myclass parameter in the Spark Submit Command Line Options text field.

../../_images/ComposeSparkScala_class.png

Example 4: Sample Scala program with user program arguments

The following figure shows a sample Scala program with the user argument args(0), value 10, and the respective result.

../../_images/user-pgm-example.png