Understanding a Node Bootstrap Script

Bootstrap scripts allow installation, management, and configuration of tools useful for cluster monitoring and data loading. A node bootstrap script runs on all cluster nodes, including autoscaling nodes, when they come up.

In an AWS cluster, the script is called a user-data script.

Node bootstrap scripts must be placed in the default location, for example, something similar to:

  • for AWS: s3://<default location of the account>/scripts/hadoop/
  • for Azure: wasb://defloc@quboledatastore.blob.core.windows.net/nodebootstrap/
  • for Oracle OCI: oci://<bucket>@<namespace>//defloc/scripts/hadoop/
  • for Oracle OCI Classic: swift://<defloc_container>.oracle/<defloc_directory>/scripts/hadoop/

The logs written by the node bootstrap script are saved in node_bootstrap.log in /media/ephemeral0/logs/others.

The Node Bootstrap Logs are also available in the cluster UI as part of the Nodes table for a running cluster. In the cluster UI, below a running cluster, the number of nodes in the cluster is displayed next to Nodes. Click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Note

Qubole recommends you install or update custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has the advantages described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install or update Python libraries in Qubole’s virtual environment by adding a script to the node bootstrap file as in the following example:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap script is invoked as a root user. It does not have a terminal (TTY or text-only console); note that many programs do not run without a TTY. In Hadoop clusters, a node bootstrap script is invoked after the HDFS daemons have been bought up in case of Worker nodes but before MapReduce and YARN daemons have been initialized. However, in case of the master node, a node bootstrap script is invoked after the ResourceManager is started. This means that Hadoop applications are run only after the node bootstrap completes.

The node bootstrap process is executed via code resident on the node. This code is executed only on the first boot cycle, not on reboot.

The cluster launch process waits without limit for the node bootstrap script to complete. Specifically, worker daemons and task execution daemons – for example, NodeManager (Hadoop2) waits for the script to execute.

Qubole provides a library of certified bootstrap functions for use in node bootstraps. It is recommended to use those certified bootstrap functions to avoid compatibility issues with future versions of Qubole Software.

Running Node Bootstrap Scripts on a Cluster describes how to run node bootstraps on a cluster and Run Utility Commands in a Cluster describes how to run utility commands to get the node-related information such as seeing if a node is a Worker or Master, or getting the master node’s IP address. You can also see How do I check if a node is a master node or a worker node?.

Understanding the Multistep Node Bootstrap

Note

Currently, the multistep node bootstrap feature is supported only on Qubole-on-AWS.

A single node bootstrap file that runs at a specific time in the bootup sequence of the master and worker nodes poses these two major problems:

  • If the bootstrap changes anything with respect to the daemons that start before it, the daemons must restart. This can cause command failures as sometimes daemons do not come up in time.
  • Some functions that are run in the bootstrap are not critical to running jobs. For example, installing monitoring software such as statsd or security software such as Qualys. These are not required to run tasks on worker nodes but they get installed before the NodeManager. This delays the node from joining YARN, and therefore slows down upscaling.

To overcome problems mentioned above, Qubole supports running a bootstrap script at multiple points in the node startup sequence in YARN-based clusters. Specifically, you can add the following execution functions:

  • On the master node:
    • Before any of the services are started
    • After all the services are started
  • On worker nodes:
    • Before any of the services are started
    • After some services (such as data node) are started but before the NodeManager is started, that is before tasks can start running.
    • After the NodeManager is started

An example of multi-step node bootstrap is as follows.

source /usr/lib/qubole/bootstrap-functions/misc/util.sh

function pre_service_start() {
  set_timezone "US/Mountain"
}

function pre_task_start() {
  pip install numpy
  pip install scipy
}

function post_start() {
ssh_key="ssh-rsa foo.... [email protected]"
add_to_authorized_keys(ssh_key, "ec2-user")
}

A multistep node bootstrap defines up to three functions that are executed at specific stages of instance boot as follows:

  1. pre_service_start: This function gets executed after services are configured but before they are started. This includes services such as HMS. Thus, any configuration overrides that you apply in this function do not get overridden by Qubole’s defaults. In addition, any configuration change in this function avoids the need to restart daemons. As operations that execute in this function delays marking the cluster Up/Active, it is recommended to keep this function short.
  2. pre_task_start: This function is equivalent to the current single step bootstrap on worker nodes. It runs before NodeManager is started.
  3. post_start: This function is executed after all services are started. You can treat this as an ideal location to place any function that’s not directly related to the command execution. On the master node, this function is equivalent to the single function bootstrap.

Mapping of Multistep Node Bootstrap with Node Bootstrap

Multistep Node Bootstrap Node Bootstrap
pre_service_start It is new only in the multistep node bootstrap.
pre_task_start It is equivalent to the current bootstrap on worker nodes. It is not applicable to the master node.
post_start It is equivalent to the current bootstrap on the master node.