Running a Hadoop Job

This section describes how to run native MapReduce jobs written in Java using the Qubole Data Service (QDS).

QDS Access

These are the prerequisites:

  • You must be signed up for QDS. New users can sign up using the Sign Up page and create an account as instructed
  • To run Hadoop jobs using UI, you must sign in to QDS.
  • To run the Hadoop jobs using the API, you must have an authentication token that you can access from the profile page. For more information on authentication, see Authentication.

Example Hadoop Job

For this example, let us use a widely referenced Python Map-Reduce Tutorial. The input data set for this job is text from 3 books from Project Gutenberg. The map and reduce programs are python scripts that are used to calculate word counts in this data set. To make this example easily accessible to Qubole users, the required data and code is provided in a publicly accessible bucket:

  • Input Data: s3n://paid-qubole/default-datasets/gutenberg
  • Map Script: s3n://paid-qubole/HadoopAPIExamples/WordCountPython/
  • Reduce Script: s3n://paid-qubole/HadoopAPIExamples/WordCountPython/
  • Jar File: s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar This is a standard hadoop streaming jar that is compatible with the Qubole Hadoop service and can be used for all streaming jobs.


In new QDS accounts, QDS provides example saved queries of different command types. For more information, see Workspace Tab.

Running Hadoop Jobs from Analyze

The steps to run a Hadoop job using a custom jar are:

  1. Navigate to the Analyze page from the top menu and click the Compose button.

  2. Clicking Compose opens a command editor. Select the command type as Hadoop Job from the drop-down list. Custom Jar is selected by default in the Job Type drop-down list.

  3. Specify the location of the job JAR file (in this case: s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar)

  4. Specify the arguments to the JAR file. In the example shown below, specify the mapper/reducer scripts, the location of these scripts, the number of reducers and the location of the input dataset, and an output Amazon S3 bucket location as arguments that are as shown below.

    -files s3n://paid-qubole/HadoopAPIExamples/WordCountPython/,
    -mapper -reducer -numReduceTasks 1
    -input s3n://paid-qubole/default-datasets/gutenberg
    -output s3n://<S3 location>


    When you copy paste the above arguments on a Macbook, you must remove the new-line characters and provide a valid ouptut location before running the query.

    The output path shown in the above step and the following figure is not an actual path. Provide an output location in an Amazon S3 bucket that you own. However, ensure that the directory is new and does not exist before running the Hadoop job.

  1. Click Run to execute the job. The status of the job is displayed on the top of the query composer. The query result is displayed in the Results tab.

Viewing Hadoop Job Logs

Upon clicking Submit, the job progress can be monitored by viewing the logs. The job submission logs are available under the Log section in the Composer tab and also in the History tab for later access. The job log provides a Job Tracker URL, which when clicked displays detailed information of the job, such as map and reduce task information.


Figure: Sample Hadoop Log

Congratulations! You have executed your first Hadoop command using Qubole Data Service.

Running Hadoop Jobs using the API

You can also run Hadoop jobs from the command line using Qubole API interface. The following steps show how this can be accomplished.


The environment variable, AUTH_TOKEN in these examples, must be populated using the authentication token of the user as described in Authentication.

  1. Submit the command:


    The syntax below uses as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.

    unix-prompt > curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d '{"sub_command": "jar", "sub_command_args": "s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar -files s3n://paid-qubole/HadoopAPIExamples/WordCountPython/,s3n://paid-qubole/HadoopAPIExamples/WordCountPython/ -mapper -reducer -numReduceTasks 1 -input s3n://paid-qubole/default-datasets/gutenberg -output s3://.../grun2", "command_type": "HadoopCommand"}' ""
    HTTP/1.1 200 OK
      -mapper -reducer -numReduceTasks 1 -input
      s3n://paid-qubole/default-datasets/gutenberg -output

    The ID of the command is 137222 as shown in the result of the REST API call. Let us use this ID to check for the status of the command.

  2. Check Status of command:

    unix-prompt > curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" ""
    HTTP/1.1 200 OK
    {“status”:”done”,”qbol\_session\_id”:30867,”created\_at”:”2013-04-22T11:37:32Z”,”command”:{“job\_url”:”[\\"`\_proxy?query=/proxy?jobinfo=http://ec2... <>`__\\"]“,”sub\_command”:”jar”,”sub\_command\_args”:”s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar
      -mapper -reducer -numReduceTasks 1 -input
      s3n://paid-qubole/default-datasets/gutenberg -output

    The status field shows whether the command is waiting, running, done, and so on. In this case, the command has already been completed (status is done).

  3. Get the logs of the command:

    unix-prompt > curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" ""

Detailed documentation is available at the Documentation home page.