Running Node Bootstrap and Adhoc Scripts on a Cluster

Qubole allows you to run node bootstrap scripts, and other scripts in an adhoc manner as needed, on cluster nodes. The following topics explain about running node bootstrap and adhoc scripts:

Running Node Bootstrap Scripts on a Cluster

You can edit the default node bootstrap script from the cluster settings page: in the QDS UI, navigate to Clusters and click Edit against a specific cluster. Managing Clusters provides more information.

Note

Qubole recommends installing or updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install and update Python libraries in Qubole’s virtual environment by adding code to the node bootstrap script, as follows:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap logs are written to node_bootstrap.log under /media/ephemeral0/logs/others. You can also find them from the QDS UI in the Nodes table for a running cluster: in the Clusters section of the UI, below the active/running cluster, the number of nodes on the cluster is displayed against Nodes; click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Understanding a Node Bootstrap Script provides more information.

Examples of Bootstrapping Cluster Nodes with Custom Scripts

Example 1: Using Giraph (AWS and Azure)

Giraph is a framework to perform offline batch processing of semi-structured graph data on a massive scale. To use Giraph jar files in a bootstrap script, perform the following steps:

  1. Use Giraph jar files and create a node bootstrap script, node_bootstrap.sh.

For AWS:

mkdir -p /media/ephemeral0/

cd /media/ephemeral0/

hadoop dfs -get s3://paid-qubole/giraph/giraph-qubole.tar.gz ./

tar -xvf giraph-qubole.tar.gz

For Azure (Blob storage):

mkdir -p /media/ephemeral0/

cd /media/ephemeral0/

hadoop dfs -get wasb://default-datasets@paidqubole.blob.core.windows.net/paid-qubole/giraph/giraph-qubole.tar.gz ./

tar -xvf giraph-qubole.tar.gz
  1. Run the Hadoop Jar Jobs

Sample Shortest Path Job

hadoop dfs -put /media/ephemeral0/giraph/tiny_text.txt /tmp
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64
hadoop jar /media/ephemeral0/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /tmp/tiny_text.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /tmp/out1 -w 1

Page Ranker Benchmark Job

hadoop jar /media/ephemeral0/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark  -e 1 -s 3 -v -V 5000 -w 1

Example 2: Changing Python Version

If your cluster is running Python 2.6, you can enable Python 2.7 for Hadoop tasks by adding the following lines to the node bootstrap file specified in the cluster configuration.

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-hadoop-use-python2.7

Example 3: Node Bootstrap Script for a Mahout Job (AWS)

mkdir -p /media/ephemeral0/mahout
cd /media/ephemeral0/mahout
hadoop dfs -get s3://paid-qubole/mahout0.9/mahout-distribution-0.9.tar.gz .
hadoop dfs -get s3://paid-qubole/mahout0.9/data.tar.gz  .

#copy stuff from s3://paid-qubole

tar -xvf mahout-distribution-0.9.tar.gz
tar -xvf data.tar.gz

Example 4: Adding Node Bootstrap Script using Pig (AWS and Azure)

  1. Create a file node_bootstrap.sh (or choose another name) with the following content:

For AWS:

mkdir -p /media/ephemeral1/pig11
hadoop dfs -get s3://paid-qubole/pig11/pig.tar.gz /media/ephemeral1/pig11/
tar -xvf /media/ephemeral1/pig11/pig.tar.gz -C /media/ephemeral1/pig11/

For Azure (Blob storage):

mkdir -p /media/ephemeral1/pig11
hadoop dfs -get wasb://default-datasets@paidqubole.blob.core.windows.net/paid-qubole/pig11/pig.tar.gz /media/ephemeral1/pig11/
tar -xvf /media/ephemeral1/pig11/pig.tar.gz -C /media/ephemeral1/pig11/
  1. Edit the specific Qubole cluster in the Control Panel and enter the name of the Bootstrap file into the location at Node Bootstrap File and place the file in the appropriate location in AWS S3.

With the above example of Bootstrap file, pig11 gets installed on all the nodes of cluster. Later, you can use a Shell Command interface from the QDS user interface and invoke any pig script using the pig command. You must provide the complete path of pig as shown in the following example.

/media/ephemeral1/pig11/pig/bin/pig s3n://<BucketName>/<subfolder>/<PigfileName>.pig

Example 5: Installing R and RHadoop on cluster

  1. Create a file named node_bootstrap.sh (file name is user-defined) with the content:
sudo yum -y install R
echo "install.packages(c(\"rJava\", \"Rcpp\", \"RJSONIO\", \"bitops\", \"digest\",
               \"functional\", \"stringr\", \"plyr\", \"reshape2\", \"dplyr\",
               \"R.methodsS3\", \"caTools\", \"Hmisc\"), repos=\"http://cran.uk.r-project.org\")" > base.R
Rscript base.R
wget https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.8.tar.gz
echo "install.packages(\"rhdfs_1.0.8.tar.gz\", repos=NULL, type=\"source\")" > rhdfs.R
Rscript rhdfs.R
wget https://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz
echo "install.packages(\"rmr2_3.3.1.tar.gz\", repos=NULL, type=\"source\")" > rmr.R
Rscript rmr.R
cd /usr/lib/hadoop
wget http://www.java2s.com/Code/JarDownload/hadoop-streaming/hadoop-streaming-1.1.2.jar.zip
unzip hadoop-streaming-1.1.2.jar.zip
  1. Edit the specific Qubole cluster in the Control Panel and enter the name of the bootstrap file into the location at Node Bootstrap File and place the file in the appropriate location in Cloud storage.

The above example installs R, RHadoop and RHDFS on the cluster nodes. You can now run R commands as well as RHadoop commands. A sample R script using RHadoop is as given below.

Sys.setenv(\"HADOOP_STREAMING\"=\"/usr/lib/hadoop/hadoop-streaming-1.1.2.jar\")
library(rmr2)
small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints,
    map = function(k, v) cbind(v, v^2))

Running Adhoc Scripts on a Cluster

At times, you may want to execute some scripts on the cluster in an adhoc manner. To run an adhoc script, you can use a REST API to execute a script located in Cloud storage. See Run Adhoc Scripts on a Cluster for information about the API.

The Run-Adhoc Script functionality uses the pssh to spawn adhoc scripts on the cluster nodes. It has been tested under the following conditions:

  • Works in clusters that are being set up using a proxy tunnel server
  • Even if the script execution time is longer than the pssh timeout, the script still executes on the node.

Limitations of Running Adhoc Scripts

If a script is running and you try to execute the same script on the same cluster, the second instance will not run. To work around this, you can tweak the path of the script, and then run it as a separate instance of the API.