Spark Structured Streaming

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming. You can express your streaming computation the same way you would express a batch computation on static data. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The Spark SQL engine runs it incrementally and continuously, and updates the final result as streaming data continues to arrive. The computation is executed on the same optimized Spark SQL engine.

Note

Kinesis, Kafka, and file system (S3) client jars are available in Spark in Qubole as part of the basic package.

Spark Structured Streaming on Qubole provides the following capabilities:

  • Support to run on Spark 2.2 and later versions.
  • Support to run long-running tasks. No 36 hours timeout.
  • Ability to monitor health of the job.
  • End-to-end exactly-once fault-tolerance guarantees through checkpointing.
  • Support for various input data sources, such as Kafka, File system (S3), Kinesis, and Azure event hub.
  • Support for various data sinks, such as Kafka, File system (S3), Kinesis, and Spark tables.
  • Spark Logs are rotated and aggregated to prevent hard disk space issues.
  • Direct S3 writes for checkpointing.
  • Direct Streaming append to Spark tables.
  • Optimized performance of stateful streaming queries using RocksDB.

Note

Spark Structured Streaming on Qubole is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

For more information about the data sources and data sinks, see the following documentation:

When running structured Spark streaming jobs, you must understand how to run the jobs and monitor the progress of the jobs. You can also refer to some of the examples from various data sources on the Notebooks page.

Running Spark Structured Streaming on QDS

You can run Spark Structured Streaming jobs on a Qubole Spark cluster from the Analyze and Notebooks page similar to any other Spark application. You can also run Spark Structured Streaming jobs by using the API. For more information, see Submit a Spark Command.

Note

QDS has a 36-hour time limit on every command run. For streaming applications this limit can be removed. For more information, contact Qubole Support.

Running the Job from the Analyze Page

  1. Navigate to the Analyze page.
  2. Click +Compose.
  3. Select Spark Command from the Command Type drop-down list.
  4. Select the required Spark language from the drop-down list. By default, Scala is selected.
  5. Select Query Statement or Query Path.
  6. Compose the code and click Run to execute.

For more information on composing a Spark command from the Analyze page, see Composing Spark Commands in Different Spark Languages through the UI.

Running the Job from the Notebooks Page

  1. Navigate to the Notebooks page.
  2. Start your Spark cluster.
  3. Compose your paragraphs and click the Run icon for each of these paragraphs in contextual order.

Sample program on the Notebooks page.

Monitoring Health of Streaming Jobs

You should monitor the health of the jobs or pipeline for long running ETL tasks to understand the following information:

  • Input and output throughput of the Spark cluster to prevent overflow of incoming data.
  • Latency, which is the time taken to complete a job on the Spark cluster.

When you start a streaming query in a notebook paragraph, the monitoring graph is displayed in the same paragraph.

Note

Streaming query progress graphs feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

The following figure shows a sample graph displayed on the Notebooks page.

../../_images/streaming-query-notebooks.png

Additionally, you can monitor the streaming query progress graphs in notebooks and on Grafana dashboard.

Monitoring from the Notebooks Page

  1. Navigate to the Notebooks page.
  2. Click Interpreters.
  3. On the Spark interpreter, click spark ui.
  4. In the Spark UI, click Streaming Query tab.

The Spark UI opens in a separate tab.

The following figure shows a sample Spark UI with details of the streaming jobs.

../../_images/spark-streaming-cluster-webpage.png

Monitoring from the Grafana Dashboard

  1. Navigate to the Clusters page.
  2. Select the required Spark cluster.
  3. Navigate to Overview >> Resources >> Prometheus Metrics.

The Grafana dashboard opens in a separate tab.

The following figure shows a sample Grafana dashboard with the details.

../../_images/Grafana-Streaming-Dashboard.png

Examples

You can refer to the examples that show streaming from various data sources on the Notebooks page.

  1. Log in to https://api.qubole.com/notebooks#home (or any other env URL).

  2. Navigate to Examples >> Streaming.

  3. Depending on the data source, select the appropriate examples from the list.

    Data Source Name of the Example
    S3 Source CloudTrail Log Analysis
    Kafka Source Kafka Structured Streaming
    Kinesis Source Kinesis Structured Streaming

Limitations

  • The Logs tab on the Analyze page displays only initial 1500 lines. To view the complete logs, you must log in to the corresponding cluster.
  • Historical logs, events, and dashboards are not displayed.