Spark

In this release, Spark on Qubole provides various new features, enhancements, and bug fixes.

New Features

  • SPAR-2937: Users can configure Ranger policies for Hive tables, which is honored by Spark SQL for authorization. This feature is supported on Spark 2.4.0 and later versions. Feature to opt in | Cluster Restart Required.
  • ZEP-3465: Qubole supports RStudio Server Pro that is hosted on cluster coordinator, with preinstalled sparklyr and various common libraries. Users can access RStudio from the Clusters page, Resources section. RStudio is supported on Spark 2.2 and 2.3 clusters. Beta, Via Support.
  • SPAR-3510: Qubole now supports the latest Apache Spark 2.4.3 version. It is displayed as 2.4 latest (2.4.3) in the Spark Version field of the Create New Cluster page on QDS UI. All existing 2.4.0 clusters are automatically upgraded to 2.4.3 in accordance with Qubole’s Spark versioning policy.

Enhancements

  • SPAR-3616: Spark applications run reliably even in the Out of Memory scenarios that might occur when the memory configurations are tuned appropriately.This is supported on Spark 2.4.3 and later versions. Via Support.
  • SPAR-3418: ORC metadata caching in Spark improves the query performance by reducing the time spent on reading ORC metadata from an object store. This is supported on Spark 2.4.3 and later versions. Via Support.
  • SPAR-3197: Users can use macro variables in the Arguments for User Program field of Spark commands that are scheduled to run. Macro variables are supported for Python, Scala, and R languages. This is supported on Spark 2.4.0 and later versions.
  • SPAR-3650: Spark computes size of the input table during query planning, which speeds up the query involving joins by using BroadcastHashJoin. This is supported on Spark 2.4.0 and later versions. Via Support.
  • SPAR-3226: Spark applications handle Spot Node Loss and Spot-blocks using YARN status of Graceful-Decommission. This is supported on Spark versions 2.4.0 and later versions. Via Support.

Bug Fixes

  • SPAR-3730: The ClassNotFoundException error occurred due to the missing Rubix caching jars in the Hive Metastore classpath. With this fix, the Rubix caching jars are now available in the Hive Metastore classpath. This issue is fixed on Spark 2.2.0 and later versions.
  • SPAR-3701: Query run times in few TPCDS queries had increased due to filter pushdown in subqueries that disables subquery reuse. With this fix, the overall query run time is reduced whenever applicable.
  • SPAR-3405: Hive configs such as hive.metastore.uris were not reaching the Spark Hive Authorizer plugin when passed through Spark defaults or -–conf. As a result, connection errors occurred when connecting to the Hive Metastore and Hive Authorization was enabled. This issue is fixed in Spark 2.4.0 and later versions. Via Support.
  • SPAR-3766: During operations like update Table Stats the owner of the table was changed to the user running the command. With this fix, the original owner of the table is retained. This issue is fixed in Spark 2.4.0 and later versions.

For a list of bug fixes between versions R56 and R57, see Changelog for api.qubole.com.