Spark

The new features and enhancements are:

Other enhancements and bug fixes are listed below:

Introducing Size-based JOIN Reordering

SPAR-3882: Spark on Qubole now supports auto-reordering of the JOINs based on the table sizes and can work even in the absence of column-level statistics. During benchmarking, Qubole found that the optimization is effective in 38 queries on TPC-DS standards and improves query performance by up to 85%.

Introducing Apache Spark 3.0

Spark 3.0 release comes with a lot of exciting new features and enhancements. Some of the significant features are Adaptive Query Execution, Dynamic Partition Pruning, and Disk-persisted RDD blocks served by shuffle service. Substantial improvements are in pandas API and upto 40X speedups in invoking R user-defined functions. Scala 2.12 is generally available. The latest Spark version does not have Scala 2.11.

Enhancements

  • SPAR-3414: Pyspark now supports using the MySQL data store.
  • SPAR-4409: If a Hive table is an ACID table, then any reads and writes are blocked in that table if the Hive ACID datasource jar is not present in the application’s classpath.
  • SPAR-4416: Upgraded the netty version from 4.1.17 to 4.1.47 to avoid the security vulnerability in Spark version 2.4.x. Spark version 3.0 contains the upgraded netty version by default.
  • SPAR-4418: Spark 2.4-latest is now the default Spark version. When you create a new Spark cluster, the Spark version is set to Spark 2.4-latest by default.
  • SPAR-4588: Spark versions 1.6.2, 2.0 latest, and 2.1 latest are now deprecated.
  • SPAR-4620: Spark on Qubole skips fetching authorization info while listing Hive table partitions in Spark versions 2.4-latest and 3.0-latest. Spark has enabled this enhancement by default. To disable it, set spark.sql.qubole.hive.storageAuthorization.skip=false.

Bug Fixes

  • SPAR-4247: A user can now set the deprecated property spark.yarn.executor.memoryOverhead to specify the Spark job’s executor memory-overhead. Earlier, Spark only honored spark.executor.memoryOverhead.
  • SPAR-4397: Fixed an issue in which SparkFileStatusCache incorrectly stored an input path that contained a comma in its name.
  • SPAR-4676: HDFS space got clogged unnecessarily with logs of terminated applications. To resolve this problem, Spark cleans up terminated applications’ logs older than 30 days automatically every day. Auto-enabled the configuration properties to delete older files. It applies to Spark version 2.3 and later.

For a list of bug fixes between versions R59 and R60, see Changelog for api.qubole.com.