Spark¶
- Spark provides Dynamic Filtering for join query performance improvement. Learn more. Via Support Disabled
- The Sparklens experimental open-source tool is available on http://sparklens.qubole.net. Learn more
- Proactive cleanup of shuffle block data allows faster downscaling of nodes. Learn more. Via Support. Disabled
- Autoscaling is enabled by default for Qubole Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster. Learn more.
- Large Spark SQL commands are now supported in the API and on the Analyze page of the QDS UO. Learn more. Via Support. Disabled
- Spark commands of sub-type
scala
,python
,R
,command line
, andsql
now support macros in a script file. Learn more. Via Support. Disabled
Note
Spark 2.3.2 will be supported in a patch following the R54 release.
New Features¶
Dynamic Filtering¶
SPAR-2572: Dynamic Filtering improves the performance of queries using JOIN. Via Support. Disabled
Metadata Caching¶
SPAR-2116: Parquet footer metadata caching in Spark improves query performance by reducing the time spent on reading Parquet footers from an object store. Via Support. Disabled
Shuffle Data Cleanup¶
SPAR-2283: The Spark external shuffle service removes shuffle data after it goes out of scope to improve aggressive downscaling. Via Support. Disabled
Cluster Autoscaling¶
SPAR-2658: Autoscaling is enabled by default for Spark clusters. The default value for the maximum number of autoscaling nodes has been increased from 2 to 10 for a new Spark cluster.
Support for HiveServer2¶
SPAR-2827: HiveServer2 is now supported with Spark 2.3.x. JDBC and ODBC clients can execute SQL and Hive queries over JDBC and ODBC protocols on Spark 2.3.x.
Support for Large SQL Commands¶
SPAR-2894: Spark SQL commands with large script files and large inline content are now supported. Via Support. Disabled
Support for Qubole Macros¶
SPAR-2653: Spark commands of sub-type scala
, python
, R
, command line
and sql
now support macros
in a script file. Learn more. Via Support.
Disabled
Sparklens¶
- SPAR-2812: The
Sparklens
experimental open-source tool is available at http://sparklens.qubole.net. You can use this tool with any Spark application to identify opportunities for optimizations with respect to driver-side computations, lack of parallelism, skew, etc. Learn more
Improvements¶
- SPAR-2500: Optimizes
INSERT OVERWRITE
into dynamic partitions in Hive tables via Spark direct writes. Spark writes files directly to the final destination instead of writing to a temporary staging directory, which improves performance. Supported on Spark 2.2.x and 2.3.x. Via Support. Disabled - SPAR-3042 and SPAR-3060: If the cluster uses a custom package, the package is identified in the Custom Spark Package field under the Configuration tab of the Edit Cluster Settings page. You can remove a custom package and choose a mainline Spark version instead; the default is 2.3.
- SPAR-2500: The PyArrow package is now installed on Spark clusters to support Pandas UDFs.
- SPAR-3042 and SPAR-3060: If the cluster uses a custom package, the package is identified in the Custom Spark Package field under the Configuration tab of the Edit Cluster Settings page. You can remove a custom package and choose a mainline Spark version instead; the default is 2.3.
- SPAR-3118: For Spark clusters using Rubix caching, QDS has added a Hadoop override to the Recommended Configuration
(shown on the Advanced Configuration tab of the Clusters page in the UI):
hadoop.cache.data.fullness.percentage 50
. This sets the disk usage limit to 50%.
Deprecations¶
- SPAR-2975: The following Spark versions are deprecated: 1.5.1, 1.6.0, 1.6.1, 2.0.0, and 2.1.0. QDS continues to support Spark 1.6.2, and the latest maintenance versions of each minor version of Spark 2.x. See the Supported Versions page. Spark 2.3-latest is now the default Spark version.
Bug Fixes¶
- SPAR-2127: Spark commands using a query path are now supported.
- SPAR-2866: The legacy Hadoop
aws-sdk jar
was causing conflicts with the Sparkaws-sdk jar
. The legacy JAR has been removed.