Spark

Note

Spark 2.3.2 will be supported in a patch following the R54 release.

New Features

Dynamic Filtering

SPAR-2572: Dynamic Filtering improves the performance of queries using JOIN. Via Support. Disabled

Metadata Caching

SPAR-2116: Parquet footer metadata caching in Spark improves query performance by reducing the time spent on reading Parquet footers from an object store. Via Support. Disabled

Shuffle Data Cleanup

SPAR-2283: The Spark external shuffle service removes shuffle data after it goes out of scope to improve aggressive downscaling. Via Support. Disabled

Cluster Autoscaling

SPAR-2658: Autoscaling is enabled by default for Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster.

Support for HiveServer2

SPAR-2827: HiveServer2 is now supported with Spark 2.3.x. JDBC and ODBC clients can execute SQL and Hive queries over JDBC and ODBC protocols on Spark 2.3.x.

Support for Large SQL Commands

SPAR-2894: Spark SQL commands with large script files and large inline content are now supported. Via Support. Disabled

Support for Qubole Macros

SPAR-2653: Spark commands of sub-type scala, python, R, command line and sql now support macros in a script file. Learn more. Via Support. Disabled

Sparklens

  • SPAR-2812: The Sparklens experimental open-source tool is available at http://sparklens.qubole.net. You can use this tool with any Spark application to identify opportunities for optimizations with respect to driver-side computations, lack of parallelism, skew, etc. Learn more

Improvements

  • SPAR-2500: Optimizes INSERT OVERWRITE into dynamic partitions in Hive tables via Spark direct writes. Spark writes files directly to the final destination instead of writing to a temporary staging directory, which improves performance. Supported on Spark 2.2.x and 2.3.x. Via Support. Disabled
  • SPAR-3042 and SPAR-3060: If the cluster uses a custom package, the package is identified in the Custom Spark Package field under the Configuration tab of the Edit Cluster Settings page. You can remove a custom package and choose a mainline Spark version instead; the default is 2.3.

Deprecations

  • SPAR-2975: The following Spark versions are deprecated: 1.5.1, 1.6.0, 1.6.1, 2.0.0, and 2.1.0. QDS continues to support Spark 1.6.2, and the latest maintenance versions of each minor version of Spark 2.x. See the Supported Versions page. Spark 2.3-latest is now the default Spark version.

Bug Fixes

  • SPAR-2127: Spark commands using a query path are now supported.
  • SPAR-2866: The legacy Hadoop aws-sdk jar was causing conflicts with the Spark aws-sdk jar. The legacy JAR has been removed.