Spark
Spark provides Dynamic Filtering for join query performance improvement. Learn more. Via Support Disabled
The Sparklens experimental open-source tool is available on http://sparklens.qubole.net. Learn more
Proactive cleanup of shuffle block data allows faster downscaling of nodes. Learn more. Via Support. Disabled
Autoscaling is enabled by default for Qubole Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster. Learn more.
Large Spark SQL commands are now supported in the API and on the Analyze page of the QDS UO. Learn more. Via Support. Disabled
Spark commands of sub-type
scala
,python
,R
,command line
, andsql
now support macros in a script file. Learn more. Via Support. Disabled
Note
Spark 2.3.2 will be supported in a patch following the R54 release.
New Features
Dynamic Filtering
SPAR-2572: Dynamic Filtering improves the performance of queries using JOIN. Via Support. Disabled
Metadata Caching
SPAR-2116: Parquet footer metadata caching in Spark improves query performance by reducing the time spent on reading Parquet footers from an object store. Via Support. Disabled
Shuffle Data Cleanup
SPAR-2283: The Spark external shuffle service removes shuffle data after it goes out of scope to improve aggressive downscaling. Via Support. Disabled
Cluster Autoscaling
SPAR-2658: Autoscaling is enabled by default for Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster.
Support for HiveServer2
SPAR-2827: HiveServer2 is now supported with Spark 2.3.x. JDBC and ODBC clients can execute SQL and Hive queries over JDBC and ODBC protocols on Spark 2.3.x.
Support for Large SQL Commands
SPAR-2894: Spark SQL commands with large script files and large inline content are now supported. Via Support. Disabled
Support for Qubole Macros
SPAR-2653: Spark commands of sub-type scala
, python
, R
, command line
and sql
now support macros
in a script file. Learn more. Via Support.
Disabled
Sparklens
SPAR-2812: The
Sparklens
experimental open-source tool is available at http://sparklens.qubole.net. You can use this tool with any Spark application to identify opportunities for optimizations with respect to driver-side computations, lack of parallelism, skew, etc. Learn more
Improvements
SPAR-2500: Optimizes
INSERT OVERWRITE
into dynamic partitions in Hive tables via Spark direct writes. Spark writes files directly to the final destination instead of writing to a temporary staging directory, which improves performance. Supported on Spark 2.2.x and 2.3.x. Via Support. DisabledSPAR-3042 and SPAR-3060: If the cluster uses a custom package, the package is identified in the Custom Spark Package field under the Configuration tab of the Edit Cluster Settings page. You can remove a custom package and choose a mainline Spark version instead; the default is 2.3.
Deprecations
SPAR-2975: The following Spark versions are deprecated: 1.5.1, 1.6.0, 1.6.1, 2.0.0, and 2.1.0. QDS continues to support Spark 1.6.2, and the latest maintenance versions of each minor version of Spark 2.x. See the Supported Versions page. Spark 2.3-latest is now the default Spark version.
Bug Fixes
SPAR-2127: Spark commands using a query path are now supported.
SPAR-2866: The legacy Hadoop
aws-sdk jar
was causing conflicts with the Sparkaws-sdk jar
. The legacy JAR has been removed.