Understanding a Node Bootstrap Script

Bootstrap scripts allow installation, management, and configuration of tools useful for cluster monitoring and data loading. A node bootstrap script runs on all cluster nodes, including auto-scaling nodes, when they come up. In an AWS cluster, the script is called a user-data script.

Node bootstrap scripts must be placed in the default location, for example, something similar to:

  • for AWS: s3://pydata.com/scripts/
  • for Azure: wasb://defloc@quboledatastore.blob.core.windows.net/nodebootstrap/
  • for Oracle OCI: oci://<bucket>@<namespace>//defloc/scripts/hadoop/
  • for Oracle OCI Classic: swift://<defloc_container>.oracle/<defloc_directory>/scripts/hadoop/

The logs written by the node bootstrap script are saved in node_bootstrap.log in /media/ephemeral0/logs/others.

The Node Bootstrap Logs are also available in the cluster UI as part of the Nodes table for a running cluster. In the cluster UI, below the active/running cluster, the number of nodes on the cluster is displayed against Nodes. Click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Note

Qubole recommends you install or update custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has the advantages described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install or update Python libraries in Qubole’s virtual environment by adding a script to the node bootstrap file as in the following example:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap script is invoked as a root user. It does not have a terminal (TTY or text-only console); note that many programs do not run without a TTY. In Hadoop clusters, a node bootstrap script is invoked after the HDFS daemons have been bought up (the NameNode on the Master and DataNodes on Workers) but before MapReduce and YARN daemons have been initialized. This means that Hadoop applications are run only after the node bootstrap completes.

The node bootstrap process is idempotent and is executed via code resident on the instance. This code is executed only on the first boot cycle, not on reboot.

The cluster launch process waits without limit for the node bootstrap script to complete. Specifically, worker daemons and task execution daemons – for example, NodeManager (Hadoop2) and TaskTracker (Hadoop1)– wait for the script to execute.

Running Node Bootstrap Scripts on a Cluster describes how to run node bootstraps on a cluster and Run Utility Commands in a Cluster describes how to run utility commands to get the node-related information such as seeing if a node is a Worker or Master, or getting the master node’s IP address. You can also see How do I check if a node is a master node or a worker node?.