Introduction to Airflow in Qubole

Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. It supports integration with third-party platforms. You can author complex directed acyclic graphs (DAGs) of tasks inside Airflow. It comes packaged with a rich feature set, which is essential to the ETL world. The rich user interface and command-line utilities make it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues as required.

See QDS Components: Supported Versions and Cloud Platforms for up-to-date information on support for Airflow in QDS.

Airflow Principles

  • Dynamic: Airflow pipelines are configured as code (Python). (Pipelines are synonymous for workflow in the ETL world.) Configuration in code allows dynamic pipeline generation and writing code that instantiates pipelines dynamically.
  • Extensible: Anyone can easily define its own Airflow operators, executors, and extend the library to fit the level of abstraction, which would suit an environment.
  • Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow can be scaled to infinity.

Qubole Airflow is derived from Apache (Incubator) Airflow versions 1.10.0 and 1.10.2. Airflow as a service provides the following features:

  • A single-click deployment of Airflow and other required services on a Cloud
  • Cluster and configuration management
  • Linking Airflow with QDS
  • Visualize Airflow monitoring dashboards

Note

Qubole supports file and Hive table sensors that Airflow can use to programmatically monitor workflows. For more information, see File and Partition Sensors and sensor-api-index.

Qubole Operator

Qubole has introduced a new type of Airflow operator called QuboleOperator. You can use the operator just like any other existing Airflow operator. During the operator execution in the workflow, it submits a command to to QDS and waits until the command completion. You can execute any valid Qubole command from the QuboleOperator. In addition to the required Airflow parameters such as task_id and dag, there are other key value arguments required to submit a command within QDS. For example, for submitting a Hive command within QDS, define QuboleOperator as shown below:

hive_operator = QuboleOperator(task_id='hive_show_table', command_type='hivecmd', query='show tables',
cluster_label='default', fetch_logs=True, dag=dag)

To check different command types and the required parameters that are supported, you can check the detailed documentation on QuboleOperator class inside the Airflow codebase. See Qubole Operator Examples DAG for QuboleOperator with various use cases.

For more information, see Questions about Airflow.