Supported Connectors

From the Pipelines UI, you can use the following sources and sinks to build your streaming pipeline.

Note

Kinesis and Kafka client jars are available natively as part of Spark on Qubole.

Sources

  • Apache Kafka

    Apache Kafka is an open-source publish-subscribe message system designed to provide quick, scalable and fault-tolerant handling of real-time data feeds.

    Kafka is generally used for the following applications:

    • Building real-time streaming data pipelines that reliably get data between systems or applications
    • Building real-time streaming applications that transform or react to the streams of data

    When building the streaming pipeline with Kafka as the source, you can specify multiple topic names or the patterns, and the input format. JSON and AVRO are the supported input formats.

    You can also define the starting point for a new query as latest or earliest.

    For replaying from a specific offset, you can set the --reset-offsets configuration from the Edit Cluster Settings >> Advanced Configuration page.

    For more information about Kafka, see Kafka Documentation

  • Amazon Kinesis

    Amazon Kinesis is an Amazon Web Service (AWS) for collecting and processing large streams of data records in real time.

    When using Amazon Kinesis as the source, you can define the starting point for a new query as latest or earliest. Currently, only JSON input format is supported.

    For more information about Amazon Kinesis, see Amazon Kinesis Documentation.

  • S3-SQS

    Amazon S3-SQS is a source that optimizes on the expensive S3 listing that occurs every microbatch. S3-SQS source lists files by reading them from Amazon SQS instead of directly listing them from S3 bucket.

Sinks

  • S3

    Amazon S3 is used as a data sink that can store large streaming data.

    When building the pipeline in Pipelines UI, you can define the partitions for your streaming data by columns and store the data in JSON, CSV, or Parquet formats on S3.

    For more information about S3, see Amazon S3 Documentation.

Additionally, the following supported connectors can be used as sinks:

  • Kafka
  • Event Hub
  • MongoDB
  • HIVE table
  • Redshift
  • Snowflake
  • ElasticSearch

For more details about using these supported connectors as sinks, contact Qubole Support.