Running a Dumbo Job

Dumbo is a popular Python module for running Hadoop jobs. This Quick Start Guide is for users who want to run Dumbo programs using Qubole Data Service (QDS).

Example Dumbo Program

For this example, let us use the canonical word-count program, but use Dumbo instead of plain Python. To make this example easily accessible to Qubole users, the required data has been provided in a publicly accessible bucket, and the Python program as a publicly accessible pastebin paste (since Dumbo does not work directly with s3 files).

  • Input Data: s3://paid-qubole/default-datasets/gutenberg, which contains a small subset of books from Project Gutenberg
  • Dumbo program (link to raw file)

Installing Dumbo

The simplest way to install Dumbo on a cluster is to do so in its node bootstrap file. Add the following line in the bootstrap file:

easy_install -z dumbo

This installs the required modules on all the cluster nodes, so Qubole’s command infrastructure may be used to run the programs.

Running Dumbo Jobs from Analyze

Perform the following steps to run a Dumbo job:

  1. Navigate to the Analyze page from the menu on the left and click on the Compose button.

  2. Select the command type as ShellCommand from the drop-down list.

  3. In the editor window, provide the following commands in order.

    wget -q http://pastebin.com/raw.php?i=8RVaucJf -O /tmp/wordcount.py
    dumbo start /tmp/wordcount.py -input s3://paid-qubole/default-datasets/gutenberg -output /dumbo/wc-output -hadoop /usr/lib/hadoop > /dev/null
    dumbo cat /dumbo/wc-output -hadoop /usr/lib/hadoop | sort -k2nr | head -10
    
  4. Click Run  to execute the job. The job returns the top 10 words from the sample data set. The progress of the jobs can be monitored in the Logs tab.

  5. Once the job is completed, the results are available in the Results tab, as below:

    ../../_images/dumbo_results.png

    Results from Dumbo wordcount job (not surprisingly, the is right on top)

Congratulations! You have executed your first Dumbo program using Qubole Data Service.

Further documentation is available at our Documentation home page.