Sample Use Case for Creating a Job

This topic explains how to create a Spark job after setting up the cluster on Talend server. This sample use case explains how to create a job to filter out S3 data using Talend and Qubole.

Prerequisites

The Spark cluster must have been configured on the Talend server. See Configuring Talend to Interact with QDS.

Note

If you want to submit the jobs remotely, you must ensure that the Talend Jobserver is installed.

Parameters for this sample use case:

  • Name of the cluster connecting to QDS: qubole_hadoop_cluster.
  • HDFS connection: qubole_hdfs.
  • Name of the sample job: spark_sample_job.

Note

Hive components are not supported on the Spark cluster. Therefore, you must not use Hive tables in Spark jobs.

Steps

Perform the following steps to create a Spark job:

  1. In the Talend studio, navigate to Repository >> Job Design.
  2. Right-click on Big Data Batch, and select Create Big Data Batch Job.
  3. Enter name, purpose, and description for the job in the respective fields as shown in the following figure, and click Finish.
../../../../_images/sample-spark-job.png
  1. From the new studio editor, navigate to Repository >> Metadata >> Hadoop Cluster.
  2. Select the Hadoop cluster that was configured. Example, qubole_hadoop_cluster.
  3. Drag the HDFS connection (example, qubole_hdfs) to the Designer pane. The Components pop-up box is displayed as shown in the figure.
../../../../_images/components.png
  1. Select tHDFSConfiguration and click OK.
  2. Click Finish.
  3. On the Designer pane, search for tS3Configuration from the Palette side bar as shown in the figure.
../../../../_images/palette.png
  1. Configure the tS3Configuration component:

    1. Drag the tS3Configuration component to the Designer pane.
    2. Select tS3Configuration, and click on the Component tab.
    3. Enter the access key, secret key, bucket name, and temp folder in the respective fields as shown in the following figure.
    ../../../../_images/ts3config.png
  2. On the Designer pane, search for tFileInputDelimited from the Palette side bar.

  3. Configure tFileInputDelimited component:

    1. Drag the tFileInputDelimited component to the Designer pane.
    2. Select tFileInputDelimited, and click on the Component tab.
    3. Select tS3Configuration_1 for AWS or tAzureFSConfiguration for Azure from Define a storage configuration component.
    4. Enter name of the file that has to be filtered in the Folder/File field.
    5. Select the appropriate Row Separator and Field Separator.

    The following figure shows the sample values

    ../../../../_images/tfileinput.png
  1. For Azure, update the storage account name, access key, and container for tAzureFSConfiguration as shown in the following figure.
../../../../_images/azure_fsconfig.png
  1. On the Designer pane, search for tFilterRow from the Palette side bar.

  2. Configure tFilterRow component:

    1. Drag the tFilterRow component to the Designer pane.
    2. Select tFilterRow, and click on the Component tab.
    3. Enter the details as shown in the figure.
    ../../../../_images/tfilterrow.png
  3. On the Designer pane, search for tLogRow from the Palette side bar.

  4. Configure tLogRow component:

    1. Drag the tLogRow component to the Designer pane.
    2. Select tLogRow, and click on the Component tab.
    3. Enter the details as shown in the figure.
    ../../../../_images/tlogrow.png

The following figure shows the view of the sample workflow for AWS.

../../../../_images/sample-workflow.png

The following figure shows the view of the sample workflow for Azure.

../../../../_images/sample-workflow-azure.png
  1. From the Designer pane, click Run (Job spark_sample_job).

  2. Select Target Exec from the left pane. Select the custom job server that was configured as part of configuring Talend.

  3. Select Spark Configuration and add Property Type, Spark Version, and Spark Mode as shown in the following figure.

    ../../../../_images/spark-config.png
  4. Click on the Run tab. From the navigation pane, select Basic Run, and click Run as shown in the following figure.

    ../../../../_images/job-run.png

The job runs on the cluster and the results are displayed on the console.