Running Node Bootstrap and Ad hoc Scripts on a Cluster

Qubole allows you to run node bootstrap scripts, and other scripts ad hoc as needed, on cluster nodes. The following topics describe running node bootstrap and ad hoc scripts:

Running Node Bootstrap Scripts on a Cluster

You can edit the default node bootstrap script from the cluster settings page: in the QDS UI, navigate to Clusters and click Edit against a specific cluster. Managing Clusters provides more information.

Note

Qubole recommends installing or updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install and update Python libraries in Qubole’s virtual environment by adding code to the node bootstrap script, as follows:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap logs are written to node_bootstrap.log under /media/ephemeral0/logs/others. You can also find them from the QDS UI in the Nodes table for a running cluster: in the Clusters section of the UI, below the active/running cluster, the number of nodes on the cluster is displayed against Nodes; click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Understanding a Node Bootstrap Script provides more information.

Examples of Bootstrapping Cluster Nodes with Custom Scripts

Example 1: Using Giraph (AWS and Azure)

Giraph is a framework to perform offline batch processing of semi-structured graph data on a massive scale. To use Giraph jar files in a bootstrap script, perform the following steps:

  1. Use Giraph jar files and create a node bootstrap script, node_bootstrap.sh.

For AWS:

mkdir -p /media/ephemeral0/

cd /media/ephemeral0/

hadoop dfs -get s3://paid-qubole/giraph/giraph-qubole.tar.gz ./

tar -xvf giraph-qubole.tar.gz

For Azure (Blob storage):

mkdir -p /media/ephemeral0/

cd /media/ephemeral0/

hadoop dfs -get wasb://[email protected]/paid-qubole/giraph/giraph-qubole.tar.gz ./

tar -xvf giraph-qubole.tar.gz
  1. Run the Hadoop Jar Jobs

Sample Shortest Path Job

hadoop dfs -put /media/ephemeral0/giraph/tiny_text.txt /tmp
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64
hadoop jar /media/ephemeral0/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /tmp/tiny_text.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /tmp/out1 -w 1

Page Ranker Benchmark Job

hadoop jar /media/ephemeral0/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark  -e 1 -s 3 -v -V 5000 -w 1

Example 2: Changing Python Version

Python 2.7 is the default version configured on cluster nodes by default. This applies to older clusters as well if they are restarted after Python 2.7 has been set as the default version. Existing clusters that use Python 2.6 continue to use Python 2.6 until they are restarted. If your cluster is running Python 2.6, you can enable Python 2.7 for Hadoop tasks by adding the following lines to the node bootstrap file specified in the cluster configuration.

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-hadoop-use-python2.7

Example: Node Bootstrap Script for a Mahout Job (AWS)

mkdir -p /media/ephemeral0/mahout
cd /media/ephemeral0/mahout
hadoop dfs -get s3://paid-qubole/mahout0.9/mahout-distribution-0.9.tar.gz .
hadoop dfs -get s3://paid-qubole/mahout0.9/data.tar.gz  .

#copy stuff from s3://paid-qubole

tar -xvf mahout-distribution-0.9.tar.gz
tar -xvf data.tar.gz

Example: Adding Node Bootstrap Script using Pig (AWS and Azure)

  1. Create a file node_bootstrap.sh (or choose another name) with the following content:

For AWS:

mkdir -p /media/ephemeral1/pig11
hadoop dfs -get s3://paid-qubole/pig11/pig.tar.gz /media/ephemeral1/pig11/
tar -xvf /media/ephemeral1/pig11/pig.tar.gz -C /media/ephemeral1/pig11/

For Azure (Blob storage):

mkdir -p /media/ephemeral1/pig11
hadoop dfs -get wasb://[email protected]/paid-qubole/pig11/pig.tar.gz /media/ephemeral1/pig11/
tar -xvf /media/ephemeral1/pig11/pig.tar.gz -C /media/ephemeral1/pig11/
  1. Edit the QDS cluster in the QDS UI and enter the name of the Bootstrap file in the Node Bootstrap File field so as to place the file in the appropriate location in Cloud storage.

Given the above example, pig11 is installed on all the nodes of the cluster. Now you can use a shell command from the QDS UI and invoke any Pig script using the Pig command. You must provide the complete path of Pig as shown in the following AWS example.

/media/ephemeral1/pig11/pig/bin/pig s3n://<BucketName>/<subfolder>/<PigfileName>.pig

Example: Installing R and RHadoop on cluster

  1. Create a file named node_bootstrap.sh (or other name you choose) with the content:
sudo yum -y install R
echo "install.packages(c(\"rJava\", \"Rcpp\", \"RJSONIO\", \"bitops\", \"digest\",
               \"functional\", \"stringr\", \"plyr\", \"reshape2\", \"dplyr\",
               \"R.methodsS3\", \"caTools\", \"Hmisc\"), repos=\"http://cran.uk.r-project.org\")" > base.R
Rscript base.R
wget https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.8.tar.gz
echo "install.packages(\"rhdfs_1.0.8.tar.gz\", repos=NULL, type=\"source\")" > rhdfs.R
Rscript rhdfs.R
wget https://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz
echo "install.packages(\"rmr2_3.3.1.tar.gz\", repos=NULL, type=\"source\")" > rmr.R
Rscript rmr.R
cd /usr/lib/hadoop
wget http://www.java2s.com/Code/JarDownload/hadoop-streaming/hadoop-streaming-1.1.2.jar.zip
unzip hadoop-streaming-1.1.2.jar.zip
  1. Edit the cluster in the QDS UI and enter the name of the bootstrap file into the Node Bootstrap File field, so as to place the file in the appropriate location in Cloud storage.

The above example installs R, RHadoop and RHDFS on the cluster nodes. You can now run R commands as well as RHadoop commands. A sample R script using RHadoop is as given below.

Sys.setenv(\"HADOOP_STREAMING\"=\"/usr/lib/hadoop/hadoop-streaming-1.1.2.jar\")
library(rmr2)
small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints,
    map = function(k, v) cbind(v, v^2))

Running Ad hoc Scripts on a Cluster

You may want to execute scripts on a cluster in an ad hoc manner. You can use a REST API to execute a script located in Cloud storage. See Run Adhoc Scripts on a Cluster for information about the API.

The Run-Adhoc Script functionality uses the pssh to spawn adhoc scripts on the cluster nodes. It has been tested under the following conditions:

  • Works in clusters that are being set up using a proxy tunnel server
  • Even if the script execution time is longer than the pssh timeout, the script still executes on the node.

Limitations of Running Ad hoc Scripts

If a script is running and you try to execute the same script on the same cluster, the second instance will not run. To work around this, you can tweak the path of the script, and then run it as a separate instance of the API.