Spark 3.0 Features

Spark on Qubole supports the latest Apache Spark 3.0.0 version. It is displayed as 3.0 latest in the Spark Version field of the Create New Cluster page on QDS UI.

Apache Spark 3.0 Features

Spark 3.0 provides various major features and performance enhancements. Some of the key features are listed below:

Features

Description

Configuration

Adaptive Query Execution

Adaptive Query Execution (AQE) changes the Spark execution plan at runtime based on the statistics available from intermediate data generated and stage runs. The optimized plan can convert a sort-merge join to broadcast join, optimize the reducer count, and handle data skew during the join operation.

spark.sql.adaptive.enabled

This configuration is disabled by default in Qubole Spark cluster.

Dynamic Partition Pruning

Dynamic Partition Pruning (DPP) optimization selects the specific partitions within the table that need to be read at runtime. This improves the job performance for the queries where the join condition is on the partitioned column by significantly reducing the amount of data read and processed.

spark.sql.optimizer.dynamicPartitionPruning.enabled

This configuration is disabled by default in Qubole Spark cluster.

Disk-persisted RDD Blocks Served by Shuffle Service

Disk-persisted RDD blocks served by shuffle service optimizes the dynamic allocation of resources in Spark by handling the following issues:

  • Executors are not downscaled even when they are idle for a long time due to the RDD cache present on these executors.

  • Cache RDD data is required but the executor where the RDD cache data is present is downscaled, resulting in recomputation of the RDD.

With this feature, fetching of RDD works the same way as it works for Shuttle data using Shuffle Service. The executors are not responsible for maintaining the state of RDD and can downscale easily.

spark.shuffle.service.fetch.rdd.enabled


This configuration is disabled by default in Qubole Spark cluster.

For more information about all the features and enhancements of Apache Spark 3.0.0, see Apache Spark 3.0.0 documentation.

Spark on Qubole Features with Spark 3.0.0

Spark on Qubole provides customized features with Spark 3.0.0 as listed in the following table.

Feature Area

Description

Reference

Total Cost of Ownership

Stage level autoscaling for the best utilization of resources in the cluster.

Autoscaling in Spark on Qubole.

Graceful Decommission of Spot nodes to achieve higher Spot utilization, which optimizes the job for lower costs while ensuring reliability.

Spark Cluster Optimization for Cost, Reliability and Performance.

Container Packing, a new resource allocation strategy that makes more nodes available for downscaling in an elastic computing environment, while simultaneously preventing hot spots in the cluster and trying to honor data locality preferences.

Container packing.

Read Optimization

Dynamic Filtering improves the performance of Join Queries. It supports both Row filtering and Partition Pruning at runtime.

Dynamic Filtering.

Skew Join Optimization handles skew in Joins by using the hints specified by the users. Users can specify the hints for join by specifying the join keys that are skewed and the values they are skewed upon.

NA

Write Optimization

Direct Writes in Storage delivers performance improvements of up to 40x for write-heavy Spark workloads.

Direct Writes.

Distributed writes directly saves the query results in the cloud’s object store from the Spark executors during the execution of the query to prevent memory and performance issues.

NA

Data Governance for SparkSQL

Apache Ranger with Spark delivers fine-grained data access control, including row-level filtering and column-level masking.

Data Governance.

Hive Authorization provides Qubole Hive users the ability to control granular access to Hive tables and columns.

Understanding Qubole Hive Authorization.

Usability and Reliability

Executor Taint Management handles Out of Memory issues that might occur when the memory configurations are tuned appropriately.

NA

Performance and robustness enhancements in the Spark Application UI and Spark History Server.

NA

Streaming Analytics

Qubole Pipelines Service is a stream processing service that helps data engineers to ingest and process streaming data from various sources, accelerate the development of streaming applications, and run highly reliable and observable production applications on a managed environment at a low cost.

Introduction to Qubole Pipelines Service.