Spark
The new features and enhancements are:
Other enhancements and bug fixes are listed below:
Introducing Size-based JOIN Reordering
SPAR-3882: Spark on Qubole now supports auto-reordering of the JOINs based on the table sizes and can work even in the absence of column-level statistics. During benchmarking, Qubole found that the optimization is effective in 38 queries on TPC-DS standards and improves query performance by up to 85%.
Introducing Apache Spark 3.0
Spark 3.0 release comes with a lot of exciting new features and enhancements. Some of the significant features are Adaptive Query Execution, Dynamic Partition Pruning, and Disk-persisted RDD blocks served by shuffle service. Substantial improvements are in pandas API and upto 40X speedups in invoking R user-defined functions. Scala 2.12 is generally available. The latest Spark version does not have Scala 2.11.
Enhancements
SPAR-3414: Pyspark now supports using the MySQL data store.
SPAR-4409: If a Hive table is an ACID table, then any reads and writes are blocked in that table if the Hive ACID datasource jar is not present in the application’s classpath.
SPAR-4416: Upgraded the netty version from 4.1.17 to 4.1.47 to avoid the security vulnerability in Spark version 2.4.x. Spark version 3.0 contains the upgraded netty version by default.
SPAR-4418: Spark 2.4-latest is now the default Spark version. When you create a new Spark cluster, the Spark version is set to Spark 2.4-latest by default.
SPAR-4588: Spark versions 1.6.2, 2.0 latest, and 2.1 latest are now deprecated.
SPAR-4620: Spark on Qubole skips fetching authorization info while listing Hive table partitions in Spark versions 2.4-latest and 3.0-latest. Spark has enabled this enhancement by default. To disable it, set
spark.sql.qubole.hive.storageAuthorization.skip=false
.
Bug Fixes
SPAR-4247: A user can now set the deprecated property
spark.yarn.executor.memoryOverhead
to specify the Spark job’s executor memory-overhead. Earlier, Spark only honoredspark.executor.memoryOverhead
.SPAR-4397: Fixed an issue in which
SparkFileStatusCache
incorrectly stored an input path that contained a comma in its name.SPAR-4676: HDFS space got clogged unnecessarily with logs of terminated applications. To resolve this problem, Spark cleans up terminated applications’ logs older than 30 days automatically every day. Auto-enabled the configuration properties to delete older files. It applies to Spark version 2.3 and later.
For a list of bug fixes between versions R59 and R60, see Changelog for api.qubole.com.