Using StreamX (AWS)

StreamX captures data from Kafka logs and preserves it in a Cloud store (currently Amazon S3), where it can be processed by engines such as Hive or Spark.

Why Use StreamX?

StreamX is a managed, scalable, and reliable open-source service, built on the Kafka Connect framework and running in a dedicated Qubole Data Service (QDS) cluster. It provides ready access to usable streaming ingest with minimal configuration or maintenance.

Major features include:

  • Output in Avro or Parquet
  • Output can be partitioned into multiple topics, and written to multiple paths in the cloud store

How to Use StreamX

Proceed as follows to configure and start StreamX:

  1. If you have not already done so, install Kafka and start a Kafka cluster.
  2. Use these instructions to create a persistent AWS security group, and open ports 2181, 8081, and 9092 in the security configuration of your Kafka cluster, to allow access by members of that group to ZooKeeper, the schema registry, and the Kafka server respectively.
  3. In the QDS UI, navigate to Clusters and choose New.
  4. Choose StreamX as the cluster type.
  5. Accept the default Kafka version or use the drop-down to change it.
  6. Specify the Kafka brokers as a comma-separated list of DNS names in the form <fully-qualified-domain-name:port>.
  7. Choose the instance types for master and slave nodes from the drop-downs.
  8. Specify the number of nodes in the cluster. (StreamX clusters do not support auto-scaling.)
  9. Assuming that your Kafka cluster is running on AWS, select the same AWS Region and Availability Zone for the QDS cluster as for the Kafka cluster.
  10. Optionally specify the number, type, and size of EBS reserved volumes to be mounted to each instance as additional storage.
  11. Provide a file name to be appended to the path for the node bootstrap file, or accept the default. Click Next to proceed.

Note

Automatic cluster termination is disabled in StreamX clusters, but you can stop the cluster manually.

  1. If your Kafka cluster is running in an AWS VPC, specify the same VPC for the QDS cluster.
  2. Optionally specify parameter values to override the Kafka Connect configuration defaults (the UI provides an example).
  3. Provide the name of the Persistent Security Group you created in step 2.
  4. When you are satisfied with the configuration, click Create. (For more information on optional fields, see Configuring Clusters.)