Spark
Note
Spark 2.3-latest is now set to Spark 2.3.2 in the QDS UI. Spark clusters running 2.3-latest will run 2.3.2 after a cluster restart.
New Features
RubiX Caching
SPAR-3162: Spark on Azure now supports RubiX caching. Via Support, Disabled
Once Qubole support has activated RubiX caching for your account, use the QDS UI to enable it for a cluster by checking the Enable Rubix check box under the Advanced tab of the Clusters page when you create or modify a Spark cluster.
RubiX caching is supported only for Azure Blob storage (WASB).
Qubole Job History Server Upgrade
SPAR-3053: The multi-tenant Qubole Job History Server has been upgraded to Spark 2.3 (2.3.1 by default). This server makes available the logs and history of Spark jobs that ran on clusters that have since been terminated.
Improvements
SPAR-3003: Cluster images now include the PyArrow package to support Pandas UDFs, enabling performance improvements in Spark 2.3.1. This enhancement is available via Support and is disabled by default for Spark 2.3.1. It is enabled by default for Spark 2.4 and later versions.
SPAR-2649: You can now dynamically change
min executors
andmax executors
for a running Spark application from the Executors tab of the Spark Application UI. This capability is supported in Spark 2.3.1 and later versions.
Bug Fix
SPAR-3059: Fixes the following problem with native Optimized Row Columnar (ORC) with
DirectFileOutputCommitter
: if a task failed after writing partial files, the re-attempt also failed withFileAlreadyExistsException
and the job failed. Fixed in Spark 2.4.