Supported Connectors
From the Pipelines UI, you can use the following sources and sinks to build your streaming pipeline.
Note
Kinesis and Kafka client jars are available natively as part of Spark on Qubole.
Sources
Apache Kafka
Apache Kafka is an open-source publish-subscribe message system designed to provide quick, scalable and fault-tolerant handling of real-time data feeds.
Kafka is generally used for the following applications:
Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data
When building the streaming pipeline with Kafka as the source, you can specify multiple topic names or the patterns, and the input format. JSON and AVRO are the supported input formats.
You can also define the starting point for a new query as latest or earliest.
For replaying from a specific offset, you can set the
--reset-offsets
configuration from the Edit Cluster Settings >> Advanced Configuration page.For more information about Kafka, see Kafka Documentation
Amazon Kinesis
Amazon Kinesis is an Amazon Web Service (AWS) for collecting and processing large streams of data records in real time.
When using Amazon Kinesis as the source, you can define the starting point for a new query as latest or earliest. Currently, only JSON input format is supported.
For more information about Amazon Kinesis, see Amazon Kinesis Documentation.
S3-SQS
Amazon S3-SQS is a source that optimizes on the expensive S3 listing that occurs every microbatch. S3-SQS source lists files by reading them from Amazon SQS instead of directly listing them from S3 bucket.
Sinks
S3
Amazon S3 is used as a data sink that can store large streaming data.
When building the pipeline in Pipelines UI, you can define the partitions for your streaming data by columns and store the data in JSON, CSV, or Parquet formats on S3.
For more information about S3, see Amazon S3 Documentation.
Additionally, the following supported connectors can be used as sinks:
Kafka
Event Hub
MongoDB
HIVE table
Redshift
Snowflake
ElasticSearch
For more details about using these supported connectors as sinks, contact Qubole Support.