Spark

Spark provides Dynamic Filtering for join query performance improvement. Learn more. Via Support Disabled
The Sparklens experimental open-source tool is available on http://sparklens.qubole.net. Learn more
Proactive cleanup of shuffle block data allows faster downscaling of nodes. Learn more. Via Support. Disabled
Autoscaling is enabled by default for Qubole Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster. Learn more.
Large Spark SQL commands are now supported in the API and on the Analyze page of the QDS UO. Learn more. Via Support. Disabled
Spark commands of sub-type scala, python, R, command line, and sql now support macros in a script file. Learn more. Via Support. Disabled

Note

Spark 2.3.2 will be supported in a patch following the R54 release.

New Features

Dynamic Filtering

SPAR-2572: Dynamic Filtering improves the performance of queries using JOIN. Via Support. Disabled

Metadata Caching

SPAR-2116: Parquet footer metadata caching in Spark improves query performance by reducing the time spent on reading Parquet footers from an object store. Via Support. Disabled

Shuffle Data Cleanup

SPAR-2283: The Spark external shuffle service removes shuffle data after it goes out of scope to improve aggressive downscaling. Via Support. Disabled

Cluster Autoscaling

SPAR-2658: Autoscaling is enabled by default for Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster.

Support for HiveServer2

SPAR-2827: HiveServer2 is now supported with Spark 2.3.x. JDBC and ODBC clients can execute SQL and Hive queries over JDBC and ODBC protocols on Spark 2.3.x.

Support for Large SQL Commands

SPAR-2894: Spark SQL commands with large script files and large inline content are now supported. Via Support. Disabled

Support for Qubole Macros

SPAR-2653: Spark commands of sub-type scala, python, R, command line and sql now support macros in a script file. Learn more. Via Support. Disabled

Sparklens

SPAR-2812: The Sparklens experimental open-source tool is available at http://sparklens.qubole.net. You can use this tool with any Spark application to identify opportunities for optimizations with respect to driver-side computations, lack of parallelism, skew, etc. Learn more

Improvements

SPAR-2500: Optimizes INSERT OVERWRITE into dynamic partitions in Hive tables via Spark direct writes. Spark writes files directly to the final destination instead of writing to a temporary staging directory, which improves performance. Supported on Spark 2.2.x and 2.3.x. Via Support. Disabled
SPAR-3042 and SPAR-3060: If the cluster uses a custom package, the package is identified in the Custom Spark Package field under the Configuration tab of the Edit Cluster Settings page. You can remove a custom package and choose a mainline Spark version instead; the default is 2.3.

Deprecations

SPAR-2975: The following Spark versions are deprecated: 1.5.1, 1.6.0, 1.6.1, 2.0.0, and 2.1.0. QDS continues to support Spark 1.6.2, and the latest maintenance versions of each minor version of Spark 2.x. See the Supported Versions page. Spark 2.3-latest is now the default Spark version.

Bug Fixes

SPAR-2127: Spark commands using a query path are now supported.
SPAR-2866: The legacy Hadoop aws-sdk jar was causing conflicts with the Spark aws-sdk jar. The legacy JAR has been removed.