Spark

In this release, Qubole Spark provides various new features and enhancements for performance and autoscaling.

Some of the key features are listed below:

  • Integration with RubiX distributed file cache to improve file read performance.
  • Dynamic filtering to improve join performance.
  • Metadata caching to improve query execution performance.
  • Proactive cleanup of shuffle block data to enable faster downscaling of nodes.
  • Better out of the box cluster autoscaling defaults.
  • HS2 support to allow SQL access via JDBC/ODBC.
  • Large Spark SQL commands are now supported in API and the Analyze page.
  • Qubole Spark command subtypes are now supported with script files containing macros.

New Features

RubiX Cache Integration

SPAR-1896: Qubole Spark now supports Rubix caching system for s3a/s3n file systems. This feature provides faster read performance in Spark for frequently accessed files. This feature is supported on Spark 2.2.0, 2.2.1, and 2.3.1 versions. Via Support Disabled

SPAR-2763: You can enable RubiX for Spark at a cluster level in the Advanced Settings of the Spark Clusters UI. This feature is supported on Spark 2.2.0, 2.2.1, and 2.3.1 versions. Via Support Disabled

RubiX is Qubole’s file caching layer. Learn more about RubiX.

Dynamic Filtering

SPAR-2572: Qubole Spark provides dynamic filtering to infer small table predicates and apply to large table in a join dynamically, which is a join optimization to improve the join query performance. This feature is supported on Spark 2.3.1 and later versions. Via Support Disabled

Metadata Caching

SPAR-2116: Parquet footer metadata caching in Spark improves query performance by reducing the time spent on reading Parquet footers from an object store. This feature is supported on Spark 2.3.1 and later versions. Via Support Disabled

Shuffle Data Cleanup

SPAR-2283: The Spark external shuffle service removes shuffle data after it goes out of scope to improve aggressive downscaling. Via Support. Disabled

Cluster Autoscaling

SPAR-2658: Cluster autoscaling on Qubole Spark clusters is enabled by default. The default value of max-nodes is increased from 2 to 10 when creating a new Spark cluster.

Support for HiveServer2

SPAR-2827: HiveServer2 is now supported with Spark 2.3.x. JDBC/ODBC clients can execute SQL and hive queries over JDBC and ODBC protocols on Spark 2.3.x.

Support for Large SQL Commands

SPAR-2894: Spark SQL commands with large script file and large inline content is now supported. This feature is supported on Spark 2.1.1, 2.2.0, 2.2.1, and 2.3.1 versions. Via Support Disabled

Support for Qubole Macros

SPAR-2653: Qubole macro substitution is now supported for all Spark command subtypes when script file is provided. Via Support Disabled

Sparklens

  • SPAR-2812: The Sparklens tool is now available via the experimental open service at http://sparklens.qubole.net. You can use this tool with any Spark application to identify the potential opportunities for optimizations with respect to driver side computations, lack of parallelism, skew, etc. Learn more

Enhancements

  • SPAR-2683: The file listing optimization feature is enhanced to improve the performance of the execution of commands that contain computing and updating table statistics. This feature is supported on Spark 2.3.1 and later versions. Via Support Disabled
  • SPAR-2500: INSERT OVERWRITE on hive tables into dynamic partitions is now optimized via Spark direct writes. Now, Spark writes files directly in the final destination instead of writing to a temporary staging directory, which improves performance. This feature is supported on Spark 2.2.x, and 2.3.x versions. Via Support Disabled
  • SPAR-3042 and SPAR-3060: The Custom Spark package field is now displayed on the Edit Cluster Settings page if the package is available for the cluster. This field is useful to identify if the cluster has custom spark package and if there are any changes required to the package when debugging any issues. If you are using custom packages, you can remove it from the Spark cluster UI and select the required Spark version. By default, Spark 2.3 is selected.
  • SPAR-1632 and SPAR-3145: In case of yarn-client mode, the Spark driver uses the local directories configured in spark.local.dir, which is set to /tmp by default. Now, in Spark 2.3.1, spark.local.dir is set to a folder in an ephemeral volume to avoid failures due to root volume being full. This feature is supported on Spark 2.1.1, 2.2.0, 2.2.1, and 2.3.1 versions. Via Support Disabled

Deprecations

  • SPAR-2975: Qubole deprecates support for the following Spark versions: 1.5.1, 1.6.0, 1.6.1, 2.0.0, and 2.1.0. Qubole will continue to support Spark 1.6.2 and latest maintenance versions of each minor version in Spark 2.x in this release. See version support documentation. Spark 2.3-latest is now the default Spark version.

Bug Fixes

  • SPAR-2686: Extending support for offline Spark History Server in multiple AWS regions (eu-central-1, us-east-2). You can open the SPARK UI links for completed applications from all the s3 bucket regions. This issue is fixed in Spark 2.0.2, 2.1.1, 2.2.0, and 2.2.1 versions.
  • SPAR-2866: Legacy hadoop aws-sdk jar in AMI was causing conflict with Spark aws-sdk jar. Now, this legacy jar is removed from the AMI. This issue is fixed in Spark 2.1.1, 2.2.0, 2.2.1, and 2.3.1 versions.
  • SPAR-2662: Fixed the issue that caused Spark UI fail to open when application threw application_xxxxxxxxxxxxx_xxxx not found exception. This issue is fixed in Spark 2.0.2, 2.1.1, and 2.2.1 versions.
  • SPAR-3072: Different results were displayed when distinct count was run on a dataframe that was created using window function. This issue is fixed.

Enhancements After 13th November 2018

Spark 2.4

SPAR-3151: Qubole now supports the latest Apache Spark 2.4.0 version. It is displayed as 2.4 latest (2.4.0) in the Spark Version field of the Create New Cluster page on QDS UI.

Spark 2.4 provides various major features and performance enhancements. Some of the key enhancements are listed below:

  • Barrier execution mode: New task scheduling model to start all tasks at the same time and restart all tasks if any of the tasks fails. This enables fault tolerance for deep learning frameworks.
  • Flexible streaming sinks: New user API in Structured Streaming (forEachBatch) to reuse batch sinks. This allows the use of batch connectors for streaming data when streaming connector is not available.
  • Multiple streaming sinks: Ability to write output rows of a micro-batch to multiple sinks within a single streaming query.
  • Pandas UDFs: Support for aggregate and window functions in Pandas UDFs.
  • Avro data source: Avro data source support to read Avro data from Spark SQL and Structured Streaming.

For more information about all the features and enhancements of Apache Spark 2.4, see https://spark.apache.org/releases/spark-release-2-4-0.html.

Qubole Spark 2.4 also includes the new Spark features added in the R54 release.