Introduction to Qubole Pipelines Service

Qubole Pipelines Service is a Data Engineering product offered by Qubole. It is a Stream Processing Platform-as-a-Service catering to real-time ingestion and reporting use-cases. Qubole Pipelines Service helps Data Engineers to operationalize the complex streaming ETL workloads. It helps in moving data with transformations and in real-time from sources such as, kafka and Kinesis to targets such as, S3, HIVE, and Snowflake.

Qubole Pipelines Service provides the following benefits:

  • Lower TCO at high performance: Qubole’s Spark Structured Streaming-as-a-service with advanced auto-scaling capabilities and reliability with exactly one semantics.
  • Higher Productivity: A management platform to build, test, deploy, monitor, and manage streaming ETL pipelines.
  • Wide Connectivity: Lower Time-to-Insight with wide range of connectivity to various Real-Time, NoSQL, Data Warehouse stores to support end-to-end real-time analytics.
  • Open Source standard conformity and Multi-Cloud: To future-proof your solution, associate your workloads to OSS standards which can be seamlessly ported from one public cloud to another.

Note

Qubole Pipelines Service is enabled for all users by default as a BETA feature.

The primary users of Qubole Pipelines Service are Data Engineers and the secondary users are Data Scientists, Data Analysts, and ETL developers.

  • Data Engineers with coding expertise in Scala, Python or Java can use this platorm to operationalize business-critical workloads.
  • Data Scientists, Data Analysts, and ETL developers can perform simple ingestions without having to write a single line of code, and subscribe to a streaming topic from Kafka or Kinesis.

Key Capabilities

To operationalize your business critical pipelines, Qubole Pipelines Service provides following capabilities in addition to the capabilities offered by the open source Apache Spark Structured streaming:

  • Reliability:
    • Retry on Failure: Jobs are restarted on intermittent failures.
    • Real-Time Monitoring: Integrated monitoring with Prometheus.
    • Reliable checkpoint management on underlying file system: Handle S3 eventual consistency issues.
    • Prevent Disk Errors and support debugging: Log rolling and aggregation on cloud storage.
  • Performance and Cost: RocksDB support. Optimized performance for stateful streaming joins.
  • Auto-scaling: Auto-scaling to handle burst and inactivities.
  • Parity with Open Source Spark: Supported on Spark structured streaming for 2.2, 2.3 and 2.4 versions.
  • Formats: JSON, AVRO, and Parquet.
  • Connectivity: The supported connectors are Apache Kafka, Amazon Kinesis Data Streams, Amazon S3, Amazon Redshift, Apache HIVE, Snowflake, Azure EventHub, Azure Storage such as Blobs, Data Lake Storage gen2, and Google Cloud Storage.
  • Enterprise Productivity:
    • Bring your own code/jar workflow for data engineers and assisted pipeline workflow for other users.
    • Capability to view the list or history of all pipelines with their instances and states.
    • Real-time insights into pipeline, alerts to various channels, integrated debugging with the Analyze UI.
    • Dev/Test workflow that provides ability to run multiple test until the required quality and business logic is met without having to constantly clean checkpoint data and sink data.
    • Ability to invoke reusable User Defined Functions.
    • Save as draft and edit pipelines.
  • Administration Controls:
    • Security: Access Control Lists (ACL) on streaming jobs for different users.
    • Dashboard: View the real-time progress of an instance of any pipeline on the integrated Grafana dashboard.

For more information about Spark on Qubole Structured Streaming Solution, see Spark Structured Streaming.