Spark

In this release, the Apache Spark’s latest version 2.3.1 is included. Spark 2.3 provides various features including vectorization of PySpark performance improvements, Structured Streaming enhancements, and ORC read performance improvements.

In addition, Qubole provides the following features with respect to usability, performance, and integrations:

  • Disallow creation of Spark clusters with low memory instances

  • Integration of DDL commands with Snowflake through Spark

New Features

Spark 2.3.1

Qubole Spark 2.3.1 is the first release of Apache Spark 2.3.x on Qubole. It is displayed as 2.3 latest (2.3.1) in the Spark Version field of the Create New Cluster page on QDS UI.

Apache Spark 2.3 has the following major features and enhancements:

  • Vectorization of PySpark UDFs significantly improves execution performance of PySpark.

  • Structured Streaming

    • Stream-stream joins add the ability to join streaming data from multiple streaming sources.

    • Continuous processing provides millisecond latency stream processing on certain Spark functions.

  • Vectorized ORC Reader is a new ORC reader with improved performance of ORC file read.

For more information about all the features and enhancements of Apache Spark 2.3, see:

Qubole Spark 2.3 Enhancements

Apart from the Apache Spark 2.3 enhancements, Qubole Spark 2.3 has the following enhancements:

  • SPAR-2527: Integration with newer autoscaling APIs introduced in Apache Spark 2.3.

  • SPAR-2603: Refactored the Qubole idle timeout functionality which shuts down Spark applications that are idle for more than the idle timeout minutes. The default value is 60 minutes.

Known Issues and Limitations in Qubole Spark 2.3.1

Bug Fixes

  • SPAR-2678: Downscaling did not occur when the number of executors is equal to the maximum capacity.

  • SPAR-2726: Snowflake writes on Spark 2.1 clusters failed because of problems with the latest Snowflake JARs.

  • SPAR-2445: When QDS autoscaled Spark executors, running executors were not counted. As a result, executors were autoscaled too aggressively (faster than the ETA provided by spark.qubole.autoscaling.stagetime).

  • SPAR-2905: Issues related to private IP addresses in Spark that caused communication between driver and executors to fail.

Enhancements

  • SPAR-2525: Qubole Spark’s listingV2 feature considerably reduces overall query time. Beta, Via Support, Disabled

  • SPAR-2128: QDS provides an option to prevent creation of Spark clusters with low-memory instances (memory < 8 GB). When this option is turned on, an update to an existing cluster that uses low-memory instances will fail. Beta, Via Support, Disabled

  • SPAR-2527: Provides integration with newer autoscaling APIs introduced in Apache Spark 2.3.

  • SPAR-2603: Changes to the QDS idle timeout, which shuts down Spark applications that are idle beyond the idle-timeout limit. The default value is 60 minutes.