Spark Structured Streaming
New Features
SPAR-3615: For Spark structured streaming, users can create a Snowflake datastore in the QDS UI, and use the corresponding catalog name (instead of passing username and password) on the QuEST UI or Notebooks UI . The Snowflake dependent jar is upgraded to support Snowflake structured streaming.
Users can use the following API for Snowflake structured streaming:
val query = df .writeStream .option("sfDatabase", <DATABASE NAME> ) .option("streaming_stage", <TEMP TABLE NAME>) .option("checkpointLocation", <CHECKPOINT LOCATION>) .snowflake(<QUBOLE DBTAP CATALOG NAME>,<WAREHOUSE NAME>,<TARGET TABLE NAME>) .start()'
This is supported on Spark 2.3.3, 2.4.0, 2.4.3, and later versions.
Enhancements
SPAR-3747: Memory leak issues in RocksDB based state store are fixed, and the rocksdb configuration is better tuned to improve read performance. The previous
rocksDB state-store provider
class is deprecated. Users should use the following Spark configuration to enable RocksDB based state store:spark.sql.streaming.stateStore.providerClass = org.apache.spark.sql.execution.streaming.state.RocksDbStateStoreProvider
This is supported on Spark 2.4.0 and later versions.
SPAR-3760: Users should specify the SQS queue region in the following format:
val inputDf = spark .readStream .format("s3-sqs") .schema(schema) .option("sqsUrl", queueUrl) .option("region", region) .option("fileFormat", "json") .option("sqsFetchIntervalSeconds", "2") .option("sqsLongPollingWaitTimeSeconds", "5") .option("maxFilesPerTrigger", "50") .option("ignoreFileDeletion", "true") .load() This is supported on 2.4.0, 2.4.3, and later versions.
SPAR-3555: Users can write a structured streaming query, which can append the data to a Spark Datasource or Hive table, and users can query the table to view the latest data in real-time. Users can use the following API:
dataset .writeStream .appendToTable("tableName") .option("checkpointLocation", checkpointDir) .start()
This is supported on Spark 2.3.2, 2.4.0, 2.4.3, and later versions.
SPAR-3591: Users can pass
x-amz-meta-metadata(key1=value1,key2=val2)
while creating a new streaming job with s3 as the sink by setting the optionfs.s3a.user.metadata
askey1=val1,key2=val2
. The streaming application creates new files with this metadata.This is supported on Spark 2.3.2 and later versions.
For a list of bug fixes between versions R56 and R57, see Changelog for api.qubole.com.