Spark Structured Streaming¶
In this release, Spark Structured Streaming has the following:
- Error Sink. It is a new feature.
- Enhancements
- Bug fixes
Error Sink¶
SPAR-4534: Added support for detecting schema mismatch and invalid records in the input data stream from Kafka. Qubole filters such records and writes the metadata of the bad record along with the exception message in a configurable cloud storage location. It is supported in Spark version 2.4.3 and later.
Enhancements¶
- SPAR-4289: Log Rolling and Aggregation are enabled for Spark Streaming clusters by default. Spark executor logs
are available in
spark.log
files instead ofstderr
for Spark applications running on these clusters. This change makes thespark.log
file accessible from Spark UI’s Executors and Task page along with the existing links forstderr
andstdout
. The enhancement applies to Spark versions 2.4.3 and 3.0.0.- Kafka Offset Lag at Per Topic and Per Partition level
- Number of Input Rows for a batch
- Number of Invalid Rows for a batch with Error Sink enabled
Bug Fixes¶
- SPAR-4493 and SPAR-4526: Enhancements and improvements in Kinesis Connector for Structured Streaming:
- Over 10X performance improvement in ingesting data into Kinesis Sin
- Added the ability to start a new streaming job from a specified Kinesis offset. Qubole only supported earliest and latest before.
- Graceful handling of deleted shards when the streaming job is restarted.
- Fixed issues with retries on Kinesis exceptions with status code >= 500
- Session-based authentication