Setting up a Data Store (AWS)

Airflow uses a data store to track the status of jobs, tasks, and other related information. QDS provisions Airflow clusters with a default, cluster-local data store for this purpose. This data store lasts only for the lifetime of the cluster.

For Airflow clusters running on AWS, Qubole recommends you also configure a persistent data store outside the cluster, to simplify the Airflow upgrade process and safeguard DAG metadata from cluster failures. To do this, proceed as follows.

Note

  • Configuring an external, persistent data store for your Airflow cluster is currently supported only on AWS.
  • QDS Airflow clusters support MySQL, Amazon Aurora-MySQL, and Postgres data stores at present.
  1. Create a MySQL, Amazon Aurora-MySQL, or Postgres database in your Cloud account; you may want to name the database airflow for ease of identification.
  2. Use the Explore page in the QDS UI to add the data store you have created.
  3. Edit your Airflow cluster (from the Clusters section of the UI), and select your airflow database from the drop-down in the Data Store field under the Configuration tab. Select Update to save the change.

Authorizing Data Stores for AWS

Authorizing RDS Data Stores to connect to Airflow Clusters in AWS EC2 Classic and Authorizing Data Stores to connect to Airflow Clusters in a VPC describe the steps to authorize an Airflow cluster to connect with data stores.

(For AWS, see RDS Security Groups for more information.)

Authorizing RDS Data Stores to connect to Airflow Clusters in AWS EC2 Classic

Authorize the Airflow cluster in EC2 classic to connect with the Amazon RDS data store by performing the following steps:

  1. Create an empty EC2 security group in EC2-classic (see Create a Persistent Security Group in AWS).
  2. Authorize the created EC2 security group with DB Security Group of Amazon RDS.
  3. Specify the EC2 security group in the Persistent Security Group field while creating an Airflow cluster in QDS. See Configuring an Airflow Cluster for more information.

Authorizing Data Stores to connect to Airflow Clusters in a VPC

Authorize the Airflow cluster in a VPC to connect with the data store by performing the following steps:

Note

The steps hold good only when the cluster and data store are within the same VPC.

  1. Create an empty VPC security group (SG1) in which you want to launch the DB instance.
  2. Create another VPC security group (SG2) in which you want to launch the DB instance. Set an inbound rule allowing TCP connection over port 3306 (or data store port you choose to allocate) from the security group (SG1) created in step 1. (For AWS, Configuring a Cluster in a VPC with Public and Private Subnets (AWS) has more information on inbound rules.)
  3. Launch the DB instance in the same VPC. You can launch the DB instance in a private or public subnet as required and attach the security group (SG2) you created in step 2.
  4. Specify the security group (SG1) created in step 1 in the Security Group field when you create the Airflow cluster. See Configuring an Airflow Cluster for more information.

Note

When the data store is set to the default, the connection authorization password (which is the AUTH-Token) is stored directly on the cluster in the default data store. That means that when you restart the cluster, you must re-add the password (AUTH-Token) because the database is deleted when the cluster is terminated.

For more information, see Questions about Airflow.

A database user must have permissions to perform all operations (DDL/DML) on the airflow database. QDS sets up Airflow tables while launching the cluster.

Some index key prefixes in Airflow can exceed the MySQL length limit depending on the character set, storage engine, and MySQL version that you use. For example, utf-8 uses 3 bytes for a character whereas latin1 uses a single byte. You should tune the database to handle such issues; otherwise the Airflow database initialization may fail. If you want to use utf-8 or utf-16, you can also increase the length, as described in the MySQL(5.6) documentation.

You do not need to provide Qubole with network access to the data store. If QDS does not have access, you may see an error such as Data store was created successfully but it could not be activated. You can ignore the error message; the data store will be still available to the Airflow cluster. See Understanding a Data Store for more information on creating a data store using the QDS UI, and Create a DbTap to create a data store using a REST API call.