Spark

Note

Spark 2.3-latest is now set to Spark 2.3.2 in the QDS UI. Spark clusters running 2.3-latest will run 2.3.2 after a cluster restart.

New Features

RubiX Caching

SPAR-3162: Spark on Azure now supports RubiX caching. Via Support, Disabled

Once Qubole support has activated RubiX caching for your account, use the QDS UI to enable it for a cluster by checking the Enable Rubix check box under the Advanced tab of the Clusters page when you create or modify a Spark cluster.

RubiX caching is supported only for Azure Blob storage (WASB).

Support for Hive Authorization Admin Commands

SPAR-2786: Spark on Qubole now supports Hive Admin commands to allow users to grant privileges such as SELECT, UPDATE, INSERT and DELETE to other users or roles. Via Support, Disabled.

The following commands are supported:

  • Set role

  • Grant privilege (SELECT, INSERT, DELETE, UPDATE or ALL)

  • Revoke privilege (SELECT, INSERT, DELETE, UPDATE or ALL)

  • Grant role

  • Revoke role

  • Show Grant

  • Show current roles

  • Show roles

  • Show role grant

  • Show principals for role.

Support for these commands is available in Spark 2.4 and later versions.

Qubole Job History Server Upgrade

SPAR-3053: The multi-tenant Qubole Job History Server has been upgraded to Spark 2.3 (2.3.1 by default). This server makes available the logs and history of Spark jobs that ran on clusters that have since been terminated.

Improvements

  • SPAR-3003: Cluster images now include the PyArrow package to support Pandas UDFs, enabling performance improvements in Spark 2.3.1. This enhancement is available via Support and is disabled by default for Spark 2.3.1. It is enabled by default for Spark 2.4 and later versions.

  • SPAR-2649: You can now dynamically change min executors and max executors for a running Spark application from the Executors tab of the Spark Application UI. This capability is supported in Spark 2.3.1 and later versions.

Bug Fix

  • SPAR-3059: Fixes the following problem with native Optimized Row Columnar (ORC) with DirectFileOutputCommitter: if a task failed after writing partial files, the re-attempt also failed with FileAlreadyExistsException and the job failed. Fixed in Spark 2.4.