Spark 3.0 Features
Spark on Qubole supports the latest Apache Spark 3.0.0 version. It is displayed as 3.0 latest in the Spark Version field of the Create New Cluster page on QDS UI.
Apache Spark 3.0 Features
Spark 3.0 provides various major features and performance enhancements. Some of the key features are listed below:
Features |
Description |
Configuration |
---|---|---|
Adaptive Query Execution |
Adaptive Query Execution (AQE) changes the Spark execution plan at runtime based on the statistics available from intermediate data generated and stage runs. The optimized plan can convert a sort-merge join to broadcast join, optimize the reducer count, and handle data skew during the join operation. |
|
Dynamic Partition Pruning |
Dynamic Partition Pruning (DPP) optimization selects the specific partitions within the table that need to be read at runtime. This improves the job performance for the queries where the join condition is on the partitioned column by significantly reducing the amount of data read and processed. |
This configuration is disabled by default in Qubole Spark cluster. |
Disk-persisted RDD Blocks Served by Shuffle Service |
Disk-persisted RDD blocks served by shuffle service optimizes the dynamic allocation of resources in Spark by handling the following issues:
With this feature, fetching of RDD works the same way as it works for Shuttle data using Shuffle Service. The executors are not responsible for maintaining the state of RDD and can downscale easily. |
This configuration is disabled by default in Qubole Spark cluster. |
For more information about all the features and enhancements of Apache Spark 3.0.0, see Apache Spark 3.0.0 documentation.
Spark on Qubole Features with Spark 3.0.0
Spark on Qubole provides customized features with Spark 3.0.0 as listed in the following table.
Feature Area |
Description |
Reference |
---|---|---|
Total Cost of Ownership |
Stage level autoscaling for the best utilization of resources in the cluster. |
|
Graceful Decommission of Spot nodes to achieve higher Spot utilization, which optimizes the job for lower costs while ensuring reliability. |
Spark Cluster Optimization for Cost, Reliability and Performance. |
|
Container Packing, a new resource allocation strategy that makes more nodes available for downscaling in an elastic computing environment, while simultaneously preventing hot spots in the cluster and trying to honor data locality preferences. |
||
Read Optimization |
Dynamic Filtering improves the performance of Join Queries. It supports both Row filtering and Partition Pruning at runtime. |
|
Skew Join Optimization handles skew in Joins by using the hints specified by the users. Users can specify the hints for join by specifying the join keys that are skewed and the values they are skewed upon. |
NA |
|
Write Optimization |
Direct Writes in Storage delivers performance improvements of up to 40x for write-heavy Spark workloads. |
|
Distributed writes directly saves the query results in the cloud’s object store from the Spark executors during the execution of the query to prevent memory and performance issues. |
NA |
|
Data Governance for SparkSQL |
Apache Ranger with Spark delivers fine-grained data access control, including row-level filtering and column-level masking. |
|
Hive Authorization provides Qubole Hive users the ability to control granular access to Hive tables and columns. |
||
Usability and Reliability |
Executor Taint Management handles Out of Memory issues that might occur when the memory configurations are tuned appropriately. |
NA |
Performance and robustness enhancements in the Spark Application UI and Spark History Server. |
NA |
|
Streaming Analytics |
Qubole Pipelines Service is a stream processing service that helps data engineers to ingest and process streaming data from various sources, accelerate the development of streaming applications, and run highly reliable and observable production applications on a managed environment at a low cost. |