Running a Dumbo Job¶
Dumbo is a popular Python module for running Hadoop jobs. This Quick Start Guide is for users who want to run Dumbo programs using Qubole Data Service (QDS).
Example Dumbo Program¶
For this example, let us use the canonical word-count program, but use Dumbo instead of plain Python. To make this example easily accessible to Qubole users, the required data has been provided in a publicly accessible bucket, and the Python program as a publicly accessible pastebin paste (since Dumbo does not work directly with Amazon S3 files).
The simplest way to install Dumbo on a cluster is to do so in its node bootstrap file. Add the following line in the bootstrap file:
easy_install -z dumbo
This installs the required modules on all the cluster nodes, so Qubole’s command infrastructure may be used to run the programs.
Running Dumbo Jobs from Analyze¶
Perform the following steps to run a Dumbo job:
Navigate to the Analyze page from the menu on the left and click on the Compose button.
Select the command type as ShellCommand from the drop-down list.
In the editor window, provide the following commands in order.
wget -q http://pastebin.com/raw.php?i=8RVaucJf -O /tmp/wordcount.py dumbo start /tmp/wordcount.py -input s3://paid-qubole/default-datasets/gutenberg -output /dumbo/wc-output -hadoop /usr/lib/hadoop > /dev/null dumbo cat /dumbo/wc-output -hadoop /usr/lib/hadoop | sort -k2nr | head -10
Click Run ** to execute the job. The job returns the top 10 words from the sample data set. The progress of the jobs can be monitored in the **Logs tab.
Once the job is completed, the results are available in the Results tab, as below:
Results from Dumbo wordcount job (not surprisingly, the is right on top)
Congratulations! You have executed your first Dumbo program using Qubole Data Service.
Further documentation is available at our Documentation home page.