Spark 3.0 Features

Spark on Qubole supports the latest Apache Spark 3.0.0 version. It is displayed as 3.0 latest in the Spark Version field of the Create New Cluster page on QDS UI.

Apache Spark 3.0 Features
Spark on Qubole Features with Spark 3.0.0

Apache Spark 3.0 Features

Spark 3.0 provides various major features and performance enhancements. Some of the key features are listed below:

Features

Description

Configuration

Adaptive Query Execution

Adaptive Query Execution (AQE) changes the Spark execution plan at runtime based on the statistics available from intermediate data generated and stage runs. The optimized plan can convert a sort-merge join to broadcast join, optimize the reducer count, and handle data skew during the join operation.

spark.sql.adaptive.enabled

This configuration is disabled by default in Qubole Spark cluster.

Dynamic Partition Pruning

Dynamic Partition Pruning (DPP) optimization selects the specific partitions within the table that need to be read at runtime. This improves the job performance for the queries where the join condition is on the partitioned column by significantly reducing the amount of data read and processed.

spark.sql.optimizer.dynamicPartitionPruning.enabled

This configuration is disabled by default in Qubole Spark cluster.

Disk-persisted RDD Blocks Served by Shuffle Service

Disk-persisted RDD blocks served by shuffle service optimizes the dynamic allocation of resources in Spark by handling the following issues:

Executors are not downscaled even when they are idle for a long time due to the RDD cache present on these executors.
Cache RDD data is required but the executor where the RDD cache data is present is downscaled, resulting in recomputation of the RDD.

With this feature, fetching of RDD works the same way as it works for Shuttle data using Shuffle Service. The executors are not responsible for maintaining the state of RDD and can downscale easily.

spark.shuffle.service.fetch.rdd.enabled

This configuration is disabled by default in Qubole Spark cluster.

For more information about all the features and enhancements of Apache Spark 3.0.0, see Apache Spark 3.0.0 documentation.

Spark on Qubole Features with Spark 3.0.0

Spark on Qubole provides customized features with Spark 3.0.0 as listed in the following table.

Feature Area	Description	Reference
Total Cost of Ownership	Stage level autoscaling for the best utilization of resources in the cluster.	Autoscaling in Spark on Qubole.
	Graceful Decommission of Spot nodes to achieve higher Spot utilization, which optimizes the job for lower costs while ensuring reliability.	Spark Cluster Optimization for Cost, Reliability and Performance.
	Container Packing, a new resource allocation strategy that makes more nodes available for downscaling in an elastic computing environment, while simultaneously preventing hot spots in the cluster and trying to honor data locality preferences.	Container packing.
Read Optimization	Dynamic Filtering improves the performance of Join Queries. It supports both Row filtering and Partition Pruning at runtime.	Dynamic Filtering.
Read Optimization	Skew Join Optimization handles skew in Joins by using the hints specified by the users. Users can specify the hints for join by specifying the join keys that are skewed and the values they are skewed upon.	NA
Write Optimization	Direct Writes in Storage delivers performance improvements of up to 40x for write-heavy Spark workloads.	Direct Writes.
Write Optimization	Distributed writes directly saves the query results in the cloud’s object store from the Spark executors during the execution of the query to prevent memory and performance issues.	NA
Data Governance for SparkSQL	Apache Ranger with Spark delivers fine-grained data access control, including row-level filtering and column-level masking.	Data Governance.
Data Governance for SparkSQL	Hive Authorization provides Qubole Hive users the ability to control granular access to Hive tables and columns.	Understanding Qubole Hive Authorization.
Usability and Reliability	Executor Taint Management handles Out of Memory issues that might occur when the memory configurations are tuned appropriately.	NA
Usability and Reliability	Performance and robustness enhancements in the Spark Application UI and Spark History Server.	NA
Streaming Analytics	Qubole Pipelines Service is a stream processing service that helps data engineers to ingest and process streaming data from various sources, accelerate the development of streaming applications, and run highly reliable and observable production applications on a managed environment at a low cost.	Introduction to Qubole Pipelines Service.