Using the Node Bootstrap on Airflow Clusters (AWS)

In QDS, all clusters share the same node bootstrap script by default, but for an Airflow cluster running on AWS, Qubole recommends you configure a separate node bootstrap script.

Note

A separate, Airflow-specific node bootstrap script is currently supported only on AWS.

Through the node bootstrap script, you can:

Install Packages on Airflow Cluster

Add this code snippet in the node bootstrap to install packages on the Airflow cluster.

# this activates the virtual environment on which airflow is running, so that we can install pacakges in it
source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh activate

pip install <package name>
source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh deactivate

Automatically Synchronize DAGs from Amazon S3

Add this code snippet in the node bootstrap to automatically synchronize DAGs from Amazon S3 when the cluster is in an IAM Keys-based account.

# install awscli, as we'll use it for sync
source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh activate
pip install awscli

# set config file which will be used by awscli
mkdir ~/.aws
echo "[default]" >> ~/.aws/config
echo "aws_access_key_id=`s3cmd -c /usr/lib/hustler/s3cfg --dump-config |grep access_key|awk '{print $3}'`" >> ~/.aws/config
echo "aws_secret_access_key=`s3cmd -c /usr/lib/hustler/s3cfg --dump-config |grep secret_key|awk '{print $3}'`" >> ~/.aws/config

# prepare the command
command="*/5 * * * * aws s3 sync s3://{path_to_dags}/airflow_dags/ $AIRFLOW_HOME/dags"

# register it on cron
crontab -l | { cat; echo "$command"; } | crontab -

Add this code snippet in the node bootstrap to automatically synchronize DAGS from Amazon S3 when the cluster is in the IAM Roles-based account.

 # install awscli, as we'll use it for sync
 source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh activate
 pip install awscli

 # prepare the command to sync every 5 minutes
 command="*/5 * * * * aws s3 sync s3://{path_to_dags}/airflow_dags/ $AIRFLOW_HOME/dags"

# register it on cron
crontab -l | { cat; echo "$command"; } | crontab -

Automatically Synchronize DAGs from a GitHub Repository

Add this code snippet in the node bootstrap editor to automatically synchronize DAGs from a GitHub repository.

# clone the repo using github access token
git clone https://{access_token}@github.com/username/airflow-dags.git $AIRFLOW_HOME/dags

# prepare command
command="*/5 * * * * cd $AIRFLOW_HOME/dags; git pull"

# register it on cron
crontab -l | { cat; echo "$command"; } | crontab -

Create a User in RabbitMQ to Access it Through Dashboard

If you are using RabbitMQ, which is installed on the cluster and if you want to access its dashboard through QDS, create a user in RabbitMQ as the default user (guest) cannot access the RabbitMQ dashboard from outside.

Add following code snippet in bootstrap to add a new user (new_user) in RabbitMQ.

/usr/sbin/rabbitmqctl add_user new_user new_password
/usr/sbin/rabbitmqctl set_user_tags new_user administrator;
/usr/sbin/rabbitmqctl set_permissions -p / new_user ".*" ".*" ".*"