Spark

In this release, Spark on Qubole provides various new features, enhancements, and bug fixes.

New Features

  • SPAR-3979 and SPAR-2953: Dynamic Filtering supports the following improvements in Spark 2.4.3 and later versions.

    • Pruning of partitions at the scan level to avoid overhead of scanning redundant partitions.
    • Push down of values for ORC file format as well as Parquet.

    Gradual Rollout.

  • SPAR-3821: Direct writes in Create Table as Select (CTAS) commands are supported for enhanced performance. This feature is supported on Spark 2.4.3 and later versions. Via Support.

  • SPAR-3713: For JOIN operations, Spark automatically detects skew in data; skew join optimization is used to handle the skew. Via Support.

Enhancements

  • SPAR-3431: Spark in Qubole supports DBTap for Redshift. Users can now create a Redshift datastore from the Explorer UI and use it for Spark applications. This feature is supported on Spark 2.4.3 and later versions. Via Support.
  • SPAR-3952 and SPAR-3616: The Spark’s core scheduling algorithm for memory intensive applications is enhanced to improve the reliability of Spark applications. The tasks that fail with executor OOM exception (for example, while handling large data skew within a stage), upon reschedule, get larger share of available executor memory. The Spark’s core scheduling algorithm controls the schedule of the number of memory intensive tasks scheduled on that executor. This feature is supported on Spark 2.4.3 and later versions. Via Support.
  • SPAR-3838: The driver memory for Spark commands is allocated based on the instance type of cluster to optimize memory usage. This feature is supported on non heterogeneous clusters running on Spark 2.3.2 and later versions.

Bug Fixes

  • SPAR-3862: Driver logs were not displayed when Spark applications were run with deploy mode as cluster. This issue is fixed and the executor page renders the container logs the driver.
  • SPAR-3791: The rerun option was failing the commands immediately when attempting to re-run the Scheduled Jobs. This issue is fixed.
  • SPAR-3714: Queries with a large number of nested sub-queries experienced significant slow down when Hive Authorization was enabled. With this fix, the performance of the queries with a large number of nested sub-queries is improved. This issue is fixed in Spark 2.4.3 version.