Spark
In this release, the Apache Spark’s latest version 2.3.1 is included. Spark 2.3 provides various features including vectorization of PySpark performance improvements, Structured Streaming enhancements, and ORC read performance improvements.
In addition, Qubole provides the following features with respect to usability, performance, and integrations:
Disallow creation of Spark clusters with low memory instances
Integration of DDL commands with Snowflake through Spark
New Features
Spark 2.3.1
Qubole Spark 2.3.1 is the first release of Apache Spark 2.3.x on Qubole. It is displayed as 2.3 latest (2.3.1)
in the Spark Version field of the Create New Cluster page on QDS UI.
Apache Spark 2.3 has the following major features and enhancements:
Vectorization of PySpark UDFs significantly improves execution performance of PySpark.
Structured Streaming
Stream-stream joins add the ability to join streaming data from multiple streaming sources.
Continuous processing provides millisecond latency stream processing on certain Spark functions.
Vectorized ORC Reader is a new ORC reader with improved performance of ORC file read.
For more information about all the features and enhancements of Apache Spark 2.3, see:
Qubole Spark 2.3 Enhancements
Apart from the Apache Spark 2.3 enhancements, Qubole Spark 2.3 has the following enhancements:
SPAR-2527: Integration with newer autoscaling APIs introduced in Apache Spark 2.3.
SPAR-2603: Refactored the Qubole idle timeout functionality which shuts down Spark applications that are idle for more than the idle timeout minutes. The default value is 60 minutes.
Known Issues and Limitations in Qubole Spark 2.3.1
SPAR-2827: HiveServer2(HS2) fails to come up with Spark 2.3.1 due to a known issue in Open Source Spark (OSS). For details, see https://issues.apache.org/jira/browse/SPARK-22918.
Bug Fixes
SPAR-2678: Downscaling did not occur when the number of executors is equal to the maximum capacity.
SPAR-2726: Snowflake writes on Spark 2.1 clusters failed because of problems with the latest Snowflake JARs.
SPAR-2445: When QDS autoscaled Spark executors, running executors were not counted. As a result, executors were autoscaled too aggressively (faster than the ETA provided by
spark.qubole.autoscaling.stagetime
).SPAR-2905: Issues related to private IP addresses in Spark that caused communication between driver and executors to fail.
Enhancements
SPAR-2525: Qubole Spark’s listingV2 feature considerably reduces overall query time. Beta, Via Support, Disabled
SPAR-2128: QDS provides an option to prevent creation of Spark clusters with low-memory instances (memory < 8 GB). When this option is turned on, an update to an existing cluster that uses low-memory instances will fail. Beta, Via Support, Disabled
SPAR-2527: Provides integration with newer autoscaling APIs introduced in Apache Spark 2.3.
SPAR-2603: Changes to the QDS idle timeout, which shuts down Spark applications that are idle beyond the idle-timeout limit. The default value is 60 minutes.