Troubleshooting Airflow Issues

This topic describes a couple of best practices and common issues with solutions related to Airflow.

Cleaning up Root Partition Space by Removing the Task Logs

You can set up a cron to cleanup root partition space filled by task log. Usually Airflow cluster runs for a longer time, so, it can generate piles of logs, which could create issues for the scheduled jobs. So, to clear the logs, you can set up a cron job by following these steps:

  1. Edit the crontab:

    sudo crontab -e

  2. Add the following line at the end and save

    0 0 * * * /bin/find $AIRFLOW_HOME/logs -type f -mtime +7 -exec rm -f {} \;

Upgrading Airflow from 1.7.0 version to 1.8.2 version on QDS-on-AWS

Airflow 1.8.2 is supported with MySQL 5.6 or greater versions only. Before upgrading, if not using MySQL on cluster, ensure that you are using MySQL 5.6 or later version. You can check the MySQL version by running mysql version on the machine on which MySQL is installed. Perform these steps to upgrade:

  1. Login to Airflow cluster

  2. Stop Airflow scheduler and workers by executing following commands

    sudo monit stop scheduler; sudo monit stop worker;

  3. Prepare a zip of DAGs from $AIRFLOW_HOME/dags folder

    cd $AIRFLOW_HOME/ tar -zcvf dags.tar.gz dags/

  4. Upload the DAG zip to AWS S3 by using the following command:

    s3cmd -c /usr/lib/hustler/s3cfg put dags.tar.gz s3://<defloc>/logs/airflow/dags_backup/

  5. Check if all workers have completed running tasks that were being executed. Run ps -ef|grep airflow and wait until there are no airflow run commands running.

  6. Navigate to the Clusters UI page. Clone the existing cluster and update the cloned clusters version to 1.8.2.

  7. Log in to the cloned cluster.

  8. Bring back the DAG zip from AWS S3 and unzip at the $AIRFLOW_HOME/dags using the following command:

    s3cmd -c /usr/lib/hustler/s3cfg get s3://<defloc>/logs/airflow/dags_backup/dags.tar.gz mv dags $AIRFLOW_HOME/dags

Using macros with Airflow

Macros on Airflow describes how to use macros.

Common Issues with Possible Solutions

Issue 1: When a DAG has X number of tasks but it has only Y number of running tasks

Check the DAG concurrency in airflow configuration file(ariflow.cfg).

Issue 2: When it is difficult to tirgger one of the DAGs

Check the connection id used in task/Qubole operator. There could be an issue with the API token used in connection.To check the connection Id, Airflow Webserver -> Admin -> Connections. Check the datastore connection: sql_alchemy_conn in airflow configuration file(airflow.cfg) If there is no issue with the above two things. Create a ticket with Qubole

Issue 3: Tasks for a specific DAG get stuck

Check if the depends_on_past property is enabled in airflow.cfg file. Based on the property, you can choose to do one of these appropriate solutions:

  1. If depends_on_past is enabled, check the runtime of the last task that has run successfully or failed before the task gets stuck. If the runtime of the last successful or failed task is greater than the frequency of the DAG, then DAG/tasks are stuck for this reason. It is an open-source bug. Create a ticket with Qubole Support to clear the stuck task. Before creating a ticket, gather the information as mentioned in Troubleshooting Query Problems – Before You Contact Support.
  2. If depends_on_past is not enabled, create a ticket with Qubole Support. Before creating a ticket, gather the information as mentioned in Troubleshooting Query Problems – Before You Contact Support.

Issue 4: When manually running a DAG is impossible

If you are unable to manually run a DAG from the UI, do these steps:

  1. Go to line 902 of the /usr/lib/virtualenv/python27/lib/python2.7/site-packages/apache_airflow-1.9.0.dev0+incubating-py2.7.egg/airflow/www/ file.
  2. Change from airflow.executors import CeleryExecutor to from airflow.executors.celery_executor import CeleryExecutor.

Questions on Airflow Service Issues

Here is a list of FAQs that are related to Airflow service issues with corresponding solutions.

  1. Which logs do I look up for Airflow cluster startup issues?

    Refer to Airflow Services logs which are brought up during the cluster startup.

  2. Where can I find Airflow Services logs?

    Airflow services are Scheduler, Webserver, Celery, and RabbitMQ. The service logs are available at /media/ephemeral0/logs/airflow location inside the cluster node. Since airflow is single node machine, logs are accessible on the same node. These logs are helpful in troubleshooting cluster bringup and scheduling issues.

  3. What is $AIRFLOW_HOME?

    $AIRFLOW_HOME is a location that contains all configuration files, DAGs, plugins, and task logs. It is an environment variable set to /usr/lib/airflow for all machine users.

  4. Where can I find Airflow Configuration files?

    Configuration file is present at “$AIRFLOW_HOME/airflow.cfg”.

  5. Where can I find Airflow DAGs?

    The DAGs’ configuration file is available in the $AIRFLOW_HOME/dags folder.

  6. Where can I find Airflow task logs?

    The task log configuration file is available in $AIRFLOW_HOME/logs.

  7. Where can I find Airflow plugins?

    The configuration file is available in $AIRFLOW_HOME/plugins.

  8. How do I restart Airflow Services?

    You can do start/stop/restart actions on an Airflow service and the commands used for each service are given below:

    • Run sudo monit <action> scheduler for Airflow Scheduler.
    • Run sudo monit <action> webserver for Airflow Webserver.
    • Run sudo monit <action> worker for Celery workers. A stop operation gracefully shuts down existing workers. A start operation adds more equivalent number of workers as per the configuration. A restart operation gracefully shuts down existing workers and adds equivalent number of workers as per the configuration.
    • Run sudo monit <action> rabbitmq for RabbitMQ.
  9. How do I invoke Airflow CLI commands within the node?

    Airflow is installed inside a virtual environment at the /usr/lib/virtualenv/python27 location. Firstly, activate the virtual envirnoment, source /usr/lib/virtualenv/python27/bin/activate and run the Airflow command.

Questions on DAGs

Is there any button to run a DAG on Airflow?

There is no button to run a DAG in the Qubole UI, but the Airflow 1.8.2 web server UI provides one.

How do I delete a DAG?

Deleting a DAG is still not very intuitive in Airflow. Qubole supports its own implementation of deleting DAGs, but you must be careful in using it.

To delete a DAG, submit the following command from the Qubole Analyze UI.

airflow delete_dag dag_id -f

The above command deletes the DAG Python code along with its history from the data source. There could be two type of errors that may occur while deleting a DAG, which are:

  • DAG isn't available in Dagbag:

    This happens when the DAG Python code is not found on the cluster’s DAG location. In that case, nothing can be done from the UI and it would need a manual inspection.

  • Active DAG runs:

    If there are active DAG runs pending for the DAG, then QDS cannot delete it. In such a case, you can visit the DAG and mark all tasks under those DAG runs as completed and try again.

Can I create a configuration to externally trigger an Airflow DAG?

No, but you can trigger DAGs from the QDS Analyze UI using the shell command airflow trigger_dag <DAG>....

If there is no connection password, the qubole_example_operator DAG will fail when it is triggered.