Spark Structured Streaming

In this release, Spark Structured Streaming has the following:

Error Sink

SPAR-4534: Added support for detecting schema mismatch and invalid records in the input data stream from Kafka. Qubole filters such records and writes the metadata of the bad record along with the exception message in a configurable cloud storage location. It is supported in Spark version 2.4.3 and later.

Enhancements

  • SPAR-4289: Log Rolling and Aggregation are enabled for Spark Streaming clusters by default. Spark executor logs are available in spark.log files instead of stderr for Spark applications running on these clusters. This change makes the spark.log file accessible from Spark UI’s Executors and Task page along with the existing links for stderr and stdout. The enhancement applies to Spark versions 2.4.3 and 3.0.0.
    • Kafka Offset Lag at Per Topic and Per Partition level
    • Number of Input Rows for a batch
    • Number of Invalid Rows for a batch with Error Sink enabled

Bug Fixes

  • SPAR-4493 and SPAR-4526: Enhancements and improvements in Kinesis Connector for Structured Streaming:
    • Over 10X performance improvement in ingesting data into Kinesis Sin
    • Added the ability to start a new streaming job from a specified Kinesis offset. Qubole only supported earliest and latest before.
    • Graceful handling of deleted shards when the streaming job is restarted.
    • Fixed issues with retries on Kinesis exceptions with status code >= 500
    • Session-based authentication