Spark

In this release, the Apache Spark’s latest version 2.3.1 is included. Spark 2.3 provides various features including vectorization of PySpark performance improvements, Structured Streaming enhancements, and ORC read performance improvements.

In addition, Qubole provides the following features with respect to usability, performance, and integrations:

Improvement in S3 listing performance
Disallow creation of Spark clusters with low memory instances
Integration of DDL commands with Snowflake through Spark
Rolling and aggregation of Spark executor and driver logs into remote S3

New Features

Rolling and Aggregation of Spark Logs

SPAR-2347: Qubole supports rolling and aggregation of Spark driver and executor logs. The logs for running applications are rolled and aggregated periodically into the remote storage S3, to prevent hard disk space issues. This is especially useful in long running workloads such as stream processing jobs. Beta, Via Support, Disabled

Spark 2.3.1

Qubole Spark 2.3.1 is the first release of Apache Spark 2.3.x on Qubole. It is displayed as 2.3 latest (2.3.1) in the Spark Version field of the Create New Cluster page on QDS UI.

Apache Spark 2.3 has the following major features and enhancements:

Vectorization of PySpark UDFs significantly improves execution performance of PySpark.
Structured Streaming
- Stream-stream joins add the ability to join streaming data from multiple streaming sources.
- Continuous processing provides millisecond latency stream processing on certain Spark functions.
Vectorized ORC Reader is a new ORC reader with improved performance of ORC file read.

For more information about all the features and enhancements of Apache Spark 2.3, see:

Qubole Spark 2.3 Enhancements

Apart from the Apache Spark 2.3 enhancements, Qubole Spark 2.3 has the following enhancements:

SPAR-2527: Integration with newer autoscaling APIs introduced in Apache Spark 2.3.
SPAR-2274: Refactored collecting S3 metrics by integrating with the newer AppStatusListener APIs in Apache Spark 2.3.
SPAR-2603: Refactored the Qubole idle timeout functionality which shuts down Spark applications that are idle for more than the idle timeout minutes. The default value is 60 minutes.

Known Issues and Limitations in Qubole Spark 2.3.1

SPAR-2827: HiveServer2(HS2) fails to come up with Spark 2.3.1 due to a known issue in Open Source Spark (OSS). For details, see https://issues.apache.org/jira/browse/SPARK-22918.
ZEP-2769: %knitr Interpreter is not supported with Spark 2.3.1

Enhancements

SPAR-2525: The overall query time is reduced considerably with Qubole Spark listing V2 feature. Beta, Via Support, Disabled
SPAR-2128: Qubole provides an option to disallow creation of Spark clusters with low memory instances (memory < 8 GB) for Spark clusters. With this option, the existing cluster that uses a low memory instance fails. Beta, Via Support, Disabled
SPAR-2217: Spark 2.2.1 is included in this release and it is reflected as 2.2 latest (2.2.1) on the Spark cluster UI. All 2.2.0 clusters are automatically upgraded to 2.2.1 in accordance with Qubole’s Spark versioning policy.
SPAR-2179: Users can run DDL commands on Snowflake data warehouse by using the new Qubole Spark Scala API. This enables users to perform queries such as CTAS on Snowflake tables through Spark.

Bug Fixes

SPAR-2678: Downscaling fails when the number of executors is equal to the maximum capacity.
SPAR-2726: Snowflake writes on Spark 2.1 clusters fail due to an upgrade in the Snowflake Jars.
SPAR-2445: When autoscaling the executors in Spark, the running executors are not considered. As a result, autoscaling of executors was faster than the ETA provided by spark.qubole.autoscaling.stagetime.
SPAR-1984: 403 error or No logs error is displayed when the user clicks on the driver/executor logs link from Spark Application UI.
SPAR-2905: Spark 2.3 queries fail due to the connection issues between driver and executors when private IPs are used in the spark cluster.