Spark Structured Streaming

New Features

State Store Implementation

SPAR-3024: The performance of stateful structured streaming jobs is optimized by implementing RocksDB state store. You can enable RocksDB based state store by setting the following Spark Configuration before starting the streaming query: --conf spark.sql.streaming.stateStore.providerClass = org.apache.spark.sql.execution.streaming.state.qubole.RocksDbStateStoreProvider. This feature is supported on Spark 2.4 and later versions.

Amazon S3-SQS Integration

SPAR-2918: Amazon S3-SQS source, which optimizes on the expensive S3 listing that occurs every microbatch, is supported. S3-SQS source lists files by reading them from Amazon SQS instead of directly listing them from S3 bucket. As a prerequisite, you must configure S3 bucket to send Object Created notification to SQS queue.

This feature is supported on Spark 2.4 and later versions.

Enhancements

SPAR-3284: In Spark on Qubole Structured Streaming solution, the offset file of last microbatch is fetched when constructing a new microbatch. With filestream as data source and S3 as checkpoint location, the fetch operation gave IllegalArgumentException due to Eventual Consistency issue. With this fix, the probability of such EC issues to occur is reduced. You can optionally avoid fetching previous batch offset where it has no subsequent effect and can be safely ignored. Via Support.

Bug Fixes

  • SPAR-3375: Kinesis connector commits the processed offsets to a checkpoint location, which is usually an S3 path. Retries with configurable timeouts are added to handle EC issues when committing to an S3 path. This change is supported in Spark 2.3.2 and 2.4.0 versions.
  • SPAR-1868: Auto-retries are added for Spark streaming applications that might fail with intermittent errors. Via Support.

For a list of bug fixes between versions R55 and R56, see Changelog for api.qubole.com.