Understanding the Talend Integration with QDS

The Qubole-Talend integration offers a serverless experience utilizing Qubole’s workload-aware autoscaling feature, which resizes the big data clusters automatically, based on the data jobs and pipelines built in Talend Studio.

With this integration, you can use Talend Studio to integrate data from various sources into the cloud data lake, and build the data quality workflows to cleanse, mask, and transform data as per the business requirements. You can use Qubole Data Service (QDS) as a Hadoop or Spark data processing engine for Talend jobs as well as an HDFS connection in data preparation Jobs.

Talend Studio is responsible for job management, configuration, monitoring, and history, while QDS is responsible for job execution and logging. The Talend connection bypasses Qubole’s Control Plane and connects directly to the cluster, therefore the job history is only visible in Talend Studio.

Note

In Talend Studio, a cluster is always called as a Hadoop cluster, but on QDS they are categorized into Hadoop2 and Spark clusters for the respective jobs.

When the job is selected to run, Talend Studio connects to the coordinator node of the Qubole cluster and submits the job. During the job execution, QDS optimizes the cluster size, automatically scaling it up and down based on the requirements of the data preparation job. The job execution is transparent to Talend Studio and does not need user intervention. If the Qubole cluster is configured to use AWS Spot instances, QDS automatically determines when to bid, acquire, and rebalance clusters based on the requirements. As a result, the execution cost is low, achieving an optimal price performance combination.

The following diagram is the graphical representation of the Qubole-Talend integration workflow:

../../../../_images/talend-arch.png