Configuring Talend to Interact with QDS

You can use this procedure to configure Talend with Hadoop, Spark, and Hive clusters in QDS.

Before you begin, you must have the public DNS of the cluster’s coordinator node. You must have installed the Talend Jobserver if you want to submit the jobs remotely.

  1. Install Talend Real-time Big Data platform on a Windows AWS instance or a standalone computer that has Windows operating system installed.

  2. Navigate to the Talend Website and search for Qubole Distribution.

  3. Download the Qubole Distribution solution.

    Note

    The downloaded zip file is used later in defining the connection. Note the path of the downloaded zip file.

  4. Optional: If you want to run the jobs remotely, configure the JobServer.

    1. Create a JobServer cluster.

      1. Create an EC2 instance in the VPC where your Qubole cluster is running.
      2. Upload and run the JobServer.
    2. In the Studio, define this JobServer as a remote server.

      1. Open Window > Preferences, then in the Preferences wizard, open Talend > Run/Debug > Remote.

      2. Click the [+] button twice to add lines. Add the location of your EC2 instance for the JobServer and leave the default values in the Password column.

        The following figure shows the Preferences page with sample values.

      ../../../../_images/talend-pref.png
    3. Click Apply and then OK to validate the configuration.

    The JobServer is now ready to be used to run your Job remotely.

  5. Define the Qubole connection.

    1. Launch Talend Studio from C:\Talend\7.0.1\studio.

    2. In the Repository tree view, right-click Hadoop cluster under the Metadata node to open the contextual menu.

    3. Select Create Hadoop cluster.

    4. In the step 1 of the wizard, enter the descriptive information about the connection to be created, such as the name of this connection and its purpose.

      Note

      White spaces and special characters “~”, “!”, “`”, “#”, “^”, “&”, “*”, “", “/”, “?”, “:”, “;”, “””, “.”, “(”, “)”, “’”, “¥”, “’”, “””, “«”, “»”, “<”, “>”. are not supported in the name. These characters are all replaced with “_” in the file system and it might result in duplicate entries.

      The following figure shows step 1 of the wizard with sample values.

      ../../../../_images/hadoop-cluster1.png
    5. In Import Option section of the wizard, select Enter manually Hadoop services and click Finish.

    6. In the step 2 of the wizard, from the Distribution drop-down list, select Custom - Unsupported as shown in the following figure.

      ../../../../_images/hadoop-cluster2.png
    7. Click the […] button to import the Qubole zip file (QuboleExchange.zip). Click OK to validate the import. Click Yes to continue.

    8. Enter the public DNS of the cluster’s coordinator node along with the port numbers 9000, 8032, 8030, and 10020 in the Namenode URI, Resource Manager, Resource Manager Scheduler, and Job History fields as shown in the following figure.

      • For Hadoop, use the public DNS of the Hadoop cluster’s coordinator node.
      • For Spark, use the public DNS of the Spark cluster’s coordinator node.
      • For Hive, use the public DNS of the Hadoop2 cluster’s coordinator node.
      ../../../../_images/hadoop-cluster3.png
    9. Ensure that Use Yarn is selected.

    10. Enter a valid Talend user name in the User name field.

    11. Click Check Services to verify the connection.

    12. Depending on whether you are adding Hadoop cluster or Spark cluster, perform the appropriate action:

      • For Hadoop cluster, click the button next to Hadoop Properties, and enter the appropriate values as shown in the following figure:

        ../../../../_images/hadoop-properties.png
      • For Spark cluster, select the Use Spark Properties check box, click the button next to the check box, and enter the appropriate values as shown in the following figure:

        ../../../../_images/spark-properties.png
    13. Click OK. Click Finish on the Hadoop Configuration Import Wizard.

      The new connection is displayed under the Hadoop Cluster node in Repository.

    14. Right-click on the newly created cluster and in the contextual menu, select Create HDFS.

    15. Follow the wizard to create the connection to the HDFS service of your Qubole cluster as shown in the following figure.

      ../../../../_images/hadoop-connection.png

      The connection parameters should have been inherited from the parent Qubole connection. Modify the parameters if required.

    16. Click Check to verify the connection to the HDFS service and click Finish.

      This HDFS connection is displayed under the Qubole connection you previously defined in Hadoop Cluster node in Repository.

      The following figure shows the list of HDFS connections

      ../../../../_images/hadoop-connection-list.png
The Qubole connection is now ready to be used in a Talend Job.
  1. Optional: If you want to use Hive connection for the data integration job, perform the following steps:

    1. Right-click on the cluster and in the contextual menu, select Create Hive.
    2. Enter the required fields in the first step of the wizard.
    3. In the second step of the wizard, perform the following steps:
      1. Select DB Type as Hive.
      2. Select Repository from the Hadoop Cluster drop-down list. Select the appropriate Hadoop cluster.
      3. Set Hive Model to Standalone.
      4. Select Use Yarn.
      5. Enter HiveServer2 in the HiveServer Version field.
      6. Enter the appropriate values for Server, Port, and Database fields.
      7. If required, enter the additional JDBC settings, encryption, and Hive properties.

    The following figure shows the Step 2/2 of the Hive connection wizard.

    ../../../../_images/hive-connection.png
    1. Click Test Connection to verify the connection settings.
    2. Click Finish to complete the Hive connection configuration.