GCP Quick Start Guide

This document is to help you get started quickly using Qubole Data Services with Google Compute Engine (GCE).

Getting Started with Qubole on GCP

How to Sign up

  1. Go to http://gce.qubole.com/.

  2. Click Sign Up.

  3. Provide the information you are prompted for and click Next.

  4. Enter your email address and full name. Click CREATE MY FREE ACCOUNT. You will receive an email message at the email address you provided, with an activation code. You can confirm your account by either by clicking on the link sent to you in the message, or by copying and pasting the activation code into the signup window.

    Alternatively, you can use your Google or SAML credentials to create a Qubole account.

  5. After creating and confirming your account, you can log in and start your free trial of the Qubole Data Service; you’ll see the Analyze page initially.

Configuring Qubole Account Settings

Prerequisite: You must have a Google Cloud Platform account to configure a Qubole account.

Navigate to the Control Panel to see the Clusters tab by default. Click Account Settings to configure account and compute settings.

Configure the settings as follows:

  • Account Name - Choose a name for the account.

  • Timeout for Command notification - Enter the number of seconds to wait before triggering an email alert from queries you run from the Compose tab of the Analyze page.

  • Project ID - Enter the project ID of your project from the Google Cloud Platform Dashboard page, for example:

    abstract-key-123456

  • Service Account Client Email - From the Google Cloud Platform Accounts Home page, navigate to APIs Manager > Credentials > Manage Service Accounts. Enter the address listed under Service account ID for your project, for example, my-123@abstract-key-123456.iam.gserviceaccount.com

  • Service Account Client Private Key File -

    1. Create the key in Google Cloud Platform as follows: on the Manage Service Accounts page, click on the menu icon at the far right of the entry for your project, and choose Create key. In the resulting dialog, choose P12 and click Create. This downloads a file onto your local system.

    2. Now, on the Qubole Account Settings page, click Choose File, navigate to the downloaded file, and choose that file.

  • Default Location (for the created data) - Enter the path to the location in the Google Cloud Storage bucket (specified in the next field) where logs and output data will be stored, for example,

    mybucket/defloc

  • GCE System Bucket - Enter the name of the bucket you want to use as your system bucket.

  • Compute Settings - QUBOLE_MANAGED or CUSTOMER_MANAGED. QUBOLE_MANAGED is available for a 15-day trial period. After the trial expires, you must set this to CUSTOMER_MANAGED, and then set or re-set the GCE Settings for your clusters. To do this, go to the Control Panel, click the edit icon editicon for each cluster, and enter your GCP Project ID, Service Account Client Email, and Storage System Bucket. These may be the same ones you used when you originally configured the account, as described in the earlier steps in this section.

  • Domains Allowed to signin/signup - Select the domain(s) that you want to allow sign up and sign in (for example, mydomain.com). Specifying domains prevents a third party from associating their Qubole ID with your account. Separate domain names with a comma.

    After specifying domains, click Save.

Running a Simple Word Count Hadoop Streaming Job

Perform the following steps to run a Hadoop streaming job:

  1. Navigate to the Analyze page and click the Compose button. A command editor opens in the right frame.

  2. Select Hadoop Job from the drop-down list.

  3. Specify the location of the job JAR file (in this case: gs://qubole-karma/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar)

  4. Specify the arguments to the JAR file. Following the example below, specify the Mapper and Reducer scripts, the location of these scripts, the number of Reducers, the location of the input dataset, and an output Google Cloud Storage bucket location.

    -mapper wc -numReduceTasks 1
    -input gs://qubole-karma/default-datasets/gutenberg
    -output gs://.../.../
    

    Note

    The output path shown in the above step and the following figure is not an actual path. Provide an output location in a Google Cloud Storage bucket that you own.

A sample Hadoop job is shown in the following figure.

../_images/GCEHadoopSampleJob.png

5. Click Run to execute the job. (This may take a few minutes because this is your first query and the cluster has to start up.) The status of the job appears below the query composer. When the job completes, the display switches to the Results tab to show you the result.

Congratulations! You have executed a first Hadoop command using Qubole Data Service.

Managing Qubole Clusters for GCP

Qubole supports Hadoop and Spark clusters. Use the Control Panel to add or reconfigure a cluster for GCP. The Active Clusters tab is the default view. Click the + icon to add a new cluster, or the edit icon editicon to edit an existing cluster.

In the Cluster Settings section, add a label in the Cluster Labels text field to identify the cluster. This is a mandatory step. To change the cluster type from the default (Hadoop), select Hadoop 2 or Spark.

In the Hadoop Cluster Settings section, set the options as follows:

  • Minimum Slave Count: Set the minimum number of slave nodes if you want to change it from the default (2).
  • Maximum Slave Count: Set the maximum number of slave nodes if you want to change it from the default (2).
  • Master Node Type: Set the master node type by selecting the preferred node type from the drop-down list.
  • Slave Node Type: Set the slave node type by selecting the preferred node type from the drop-down list.
  • Node bootstrap File: Provide the path to the node bootstrap file, if any.
  • To override the default Hadoop configuration, enter cluster and job variables in the Override Hadoop Configuration Variables text field.
  • Enter Fair Scheduler Configuration values if you want to override the default values.
  • Specify the Default Fair Scheduler Pool for pools not specified during job submission.
  • Other Settings: Choose Disable Automatic Cluster Termination if you do not want QDS to terminate the cluster if it is idle. Qubole strongly recommends leaving automatic cluster termination enabled, as inadvertently leaving an idle cluster running can be very expensive.

Click Save to add a Hadoop cluster, or Cancel to discard your changes.

GCP Machine Types

To see the machine types, their configuration, and explanations, go to machine types.