Spark

Changes in Spark provides more description on the new features/enhancements.

Backported Fixes/Changes from Open Source

SPAR-2073: Backported SPARK-21445 from 2.2.1 to 2.2.0. Makes IntWrapper and LongWrapper in UTF8String Serializable.

Bug Fixes

SPAR-1098: Support for Spark-submit command line spread over multiple lines using backward slash at the end of each line.
SPAR-1689: The issue in which the --class option was not working with the Scala language has been resolved now. In case of Spark Submit Command Line Options, if there is --class parameter added, that class needs to be invoked as the part of the spark-submit command to run the Spark job irrespective of whatever the class name is given in the Scala code.
SPAR-1984: Private IP addresses are used for all nodes in Spark. As a result of which executor logs are accessible .
SPAR-2023: The issue in which overriding of –repositories in Spark Submit Command Line Options which was not honored has been resolved.
SPAR-2181: Fixed issues with running long running streaming jobs from an AWS IAM-Role enabled account.
SPAR-2255: Package name is added in Spark jobs written in Scala during a spark-submit command execution. For example, spark-submit pacakge_name.class_name*.jar.

Fix for finding the package name and sending the package during the Spark job submission has been added.
SPAR-2277: Qubole Spark now supports Hive metastore 2.1 on the Spark 2.2+ version.
SPAR-2237: The issue in which the Spark Application UI was inaccessible has been resolved now.
SPAR-2279: The issue in which quboleDBTap.registerTables displayed an exception, com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown database, has been resolved.
SPAR-2458: Fixed a bug where NodeManager (NM) did not go down due to the auxiliary ExternalShuffleService running inside NM.
SPAR-2574: The issue where the ALTER TABLE RECOVER PARTITIONS command had an issue with spark.sql.qubole.recover.partitions has been resolved. The issue occurs when invalid files/directories are present in the partition(s) path.
SPAR-2584: Qubole has made the custom package deploy for Spark and Zeppelin more robust by handling the error condition and adding retries.

Enhancements

SPAR-2166: The default value of max-executors for a Spark application has been increased from 2 to 1000. If you want to use a different value, set the spark.dynamicAllocation.maxExecutors configuration explicitly at the Spark application level. If you want a different value for all Spark applications run on a cluster, set the value as a Spark override on that cluster. It is effective only when the Spark application is run through the Analyze UI or through a REST API call. It does not apply to Spark notebooks.
SPAR-2173: In case of DirectFileOutputCommitter (DFOC) with Spark, if a task fails after writing partial files, the reattempt also fails with FileAlreadyExistsException and the job fails. This issue is fixed in Spark versions 2.1.x and 2.2.x.

Enable spark.hadoop.mapreduce.output.textoutputformat.overwrite, spark.qubole.outputformat.overwriteFileInWrite. Qubole Spark would enable both the options by default in the near future.

SPAR-2198: QDS has changed the default values of few of these Spark configurations:

spark.shuffle.reduceLocality.enabled = false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2
spark.hadoop.parquet.enable.summary-metadata = false
spark.sql.qubole.ignoreFNFExceptions = true
spark.hadoop.fs.s3a.connection.establish.timeout = 5000
spark.hadoop.fs.s3a.connection.maximum = 200

SPAR-2210: Qubole Spark supports Hive 2.1 metastore for Spark 2.2.x .
SPAR-2217: Spark 2.2.1 is the latest version supported on Qubole Spark and it is reflected as 2.2 latest (2.2.1) on the Spark cluster UI. All 2.2.0 clusters are automatically upgraded to 2.2.1 in accordance with Qubole’s Spark versioning policy.

Note

Spark 2.2.1 as the latest version will be rolled out in a subsequent patch post the R52 release.
SPAR-2264: Spark defaults are upgraded for few instance types to ensure not to have less than 8 GB executors if possible. There is an option to use executors with large memory (upto 64 GB) in case of bigger instance types .
SPAR-2285: The issue in which the data store information was not pushed into autoscaled nodes has been resolved.
SPAR-2332: The Snowflake-jdbc and Spark-snowflake jars are updated in Spark 2.2 clusters. The changed versions are:
```
snowflake-jdbc: Upgraded from 3.4.0 to 3.5.3
spark-snowflake_2.11: Upgraded from 2.2.8 to 2.3.0
```