Spark

In this release, Spark on Qubole provides various new features and enhancements.

New Features

Support for Redshift Connector

SPAR-2991: Spark on Qubole supports Redshift connector. You can use this connector to load data from Amazon redshift tables into Spark SQL DataFrames, and write data back to Redshift tables.

This feature is supported on Spark 2.4 and later versions.

Support for Glue MetaStore

SPAR-3370: Glue is supported as a metastore for Spark. Users can read metadata from and write metadata to Glue. Via Support, Cluster Restart Required.

For information about the limitations of the connector, see Limitations.

Executor Based Broadcast

SPAR-3220: Executor based broadcast is introduced in which values that are broadcasted are not collected on driver. This enables the users to broadcast with lower driver memory as the driver memory is not the bottleneck for using broadcast.

Enhancements

  • SPAR-3480: Handling of limit 0 queries is improved, where unnecessary jobs are detected and pruned away.
  • SPAR-3321: Direct writes is supported in Datasource writes when dynamicPartitionOverwrite is used.
  • SPAR-2805: Spark on Qubole supports UDF parallelization using systemd. This improves the time taken to bring up Spark and the required daemons on a Spark cluster after the nodes are provisioned.
  • SPAR-3317: Secure per user access to data stores is supported. Only users with the appropriate privileges can access the data stores. This feature is supported on Spark 2.3.2, 2.4.0, and later versions. Via Support.
  • SPAR-3300: Users can use Sparklens with the QDS spark commands without having to pass the sparklens jars option externally. Beta
  • SPAR-3032: CBO is enabled by default in Spark version 2.4.0 to collect statistics automatically whenever the data in the underlying table changes. Statistics are also collected by default for various DDL statements.
  • SPAR-2484: The functionality of Multipart upload based File Output Committer (MFOC) supported by Hadoop is extended in Spark on Qubole for S3. This feature is supported in Spark 2.4.0 and later versions.
  • SPAR-2164: Handling skew in the join keys is supported. Users can now specify the hint ` /*+ SKEW ('<table_name>') */ ` for a join that describes the column and the values upon which skew is expected. Based on that information, the engine automatically ensures that the skewed values are handled appropriately.
  • SPAR-3005: For the offline Spark clusters, only the event log files that are less than 400 MB are processed in the offline Spark History Server (SHS). This prevents high CPU utilization on the internal servers due to SHS.

Bug Fixes

  • SPAR-3298: Spark history showed incomplete even when spark job was finished. This issue is fixed in Spark 2.2.1, 2.3.2, and 2.4.0 versions. After the fix, the Spark UI shows completed state after the successful completion of a Spark job.
  • SPAR-3471: The deadlock issue with TaskManager is now fixed.

For a list of bug fixes between versions R55 and R56, see Changelog for api.qubole.com.