Qubole Operator API

This page describes the Qubole Operator API. For more information on the Qubole Operator, see Introduction to Airflow in Qubole, Qubole Operator Examples, and Questions about Airflow.

class airflow.contrib.operators.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)

Execute tasks (commands) on QDS.

Parameters Description

Parameter	Description
qubole_conn_id	The connection ID which consists of QDS auth_token.

kwargs

Parameter	Description
command_type	Type of command that is to be executed. For example, a Hive, Shell, or Hadoop command.
tags	An array of tags that you can assign to the command.
cluster_label	The label of a cluster on which the command is executed.
name	A name that you can provide to the command. This is a template-supported field.
notify	Set this to receive an email when the command completes. You are notified on a successful or failed command.

Understanding Command-type-specific Parameters

Here are the different command-type-specific parameters:

hivecmd Parameters
prestocmd Parameters
pigcmd Parameters
hadoopcmd Parameters
shellcmd Parameters
sparkcmd Parameters
dbtapquerycmd Parameters
dbexportcmd Mode 1/Simple Mode Parameters
dbexportcmd Mode 2/Advanced Mode Parameters
dbimportcmd Mode 1/Simple Mode Parameters
dbimportcmd Mode 2/Advanced Mode Parameters
get_results
get_log
get_jobs

Note

You can also use .txt files for template-driven use cases.

hivecmd Parameters

Parameter	Description
query	An inline query statement. This is a template-supported field. Either a `query` or a `script_location` is required.
script_location	For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a `query` or a `script_location` is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
sample_size	Sample size in bytes on which to run a query.
macros	Macro values that are used in the query. This is a template-supported field.

prestocmd Parameters

Parameter	Description
query	An inline query statement. This is a template-supported field. Either a `query` or a `script_location` is required.
script_location	For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a `query` or a `script_location` is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
macros	Macro values that are used in the query. This is a template-supported field.

hadoopcmd Parameters

Parameter	Description
sub_command	Must be `jar`, `s3distcp`, or `streaming` followed by 1 or more arguments. This is a template-supported field. (`s3distcp` is valid for all platforms.)

shellcmd Parameters

Parameter	Description
script	An inline command with arguments. This is a template-supported field. Either a `script` or a `script_location` is required.
script_location	For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a `script` or a `script_location` is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
files	A list of files in an AWS S3 bucket in the `file1` and `file2` format. These files are copied into the working directory where the Qubole command is being executed. It is a template-supported field.
archives	A list of archives in an AWS S3 bucket in the `archive1` and `archive2` format. These are unarchived into the working directory where the Qubole command is being executed. This is a template-supported field.
parameters	Any additional arguments which must be passed to the script (only when `script_location` is added). This is a template-supported field.

pigcmd Parameters

Parameter	Description
script	An inline command with arguments. This is a template-supported field. Either a `script` or a `script_location` is required.
script_location	For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a `script` or a `script_location` is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
parameters	Any additional arguments which must be passed to the script (only when `script_location` is added).

sparkcmd Parameters

Parameter	Description
program	The complete Spark program in Scala, SQL, Command, R, or Python. This is a template-supported field. A Spark notebook can be run using the QuboleOperator. For more information, see Qubole Operator Examples.
cmdline	The Spark-submit command line; specify the required information on this command line. This is a template-supported field.
sql	An inline SQL query statement. This is a template-supported field.
script_location	The local file path that contains the query statement. This is a template-supported field. One of the following values must be specified: `script_location` , `program` , `cmdline` , `sql` , or `note_id`. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
language	The program languages `scala`, `sql`, `R`, `python`, and `notebook` are supported. Specify the language that you want to use.
app_id	The ID of an Spark job server app.
note_id	The ID of a notebook.
arguments	These are Spark-submit command line arguments.
user_program_arguments	These are arguments that the user program accepts.
macros	Macro values that are used in the query. It is a template-supported field.

Example

The following example shows how to use the cmdline spark parameter with the Qubole Operator API.

operator = QuboleOperator(
    task_id='hello_world',
    command_type="sparkcmd",
    cmdline="/usr/lib/spark/bin/spark-submit --max-executors 10 --num-executors 15 --driver-memory 2g --executor-memory 3g --executor-cores 5 s3://mybucket/somelocation/hello_world.py 'myuserargument'",
    dag=dag)

dbtapquerycmd Parameters

Parameter	Description
db_tap_id	The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
query	An inline query statement. This is a template-supported field.
macros	Macro values that are used in the query. This is a template-supported field.

dbexportcmd Mode 1/Simple Mode Parameters

Parameter	Description
mode	The value must be 1 for the simple mode to push data from QDS to a relational database.
hive_table	The name of the Hive table. This is a template-supported field.
partition_spec	The partition specification for the Hive table. This is a template-supported field.
dbtap_id	The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table	The name of the DB table. This is a template-supported field.
db_update_mode	The two different types of update modes, `allowinsert` or `updateonly`.
db_update_keys	Columns used to determine the uniqueness of rows and it is only valid for `db_update_mode`. This is a template-supported field.

dbexportcmd Mode 2/Advanced Mode Parameters

Parameter	Description
mode	The mode value for advanced mode is `2` to export to an HDFS directory or a storage location.
dbtap_id	The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table	The name of the DB table. This is a template-supported field.
db_update_mode	The two different types of update modes, `allowinsert` or `updateonly`.
db_update_keys	Columns used to determine the uniqueness of rows and it is only valid for `db_update_mode`. This is a template-supported field.
export_dir	An HDFS/Cloud location from which data is exported. This is a template-supported field.
fields_terminated_by	The Hex value of the character used as a column separator in the dataset.

dbimportcmd Mode 1/Simple Mode Parameters

Parameter	Description
mode	The mode value for simple mode is `1` to pull data from a relational database to QDS in a Hive table.
hive_table	The name of the Hive table. This is a template-supported field.
dbtap_id	The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table	The name of the db table. This is a template-supported field.
where_clause	The `WHERE` clause (if any). This is a template-supported field.
parallelism	The number of parallel database connections used for extracting the data.

dbimportcmd Mode 2/Advanced Mode Parameters

Parameter	Description
mode	The mode value for advanced mode is `2` to specify a custom query to transform the data before pulling it.
hive_table	The name of the Hive table. This is a template-supported field.
dbtap_id	The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table	The name of the db table. This is a template-supported field.
parallelism	The number of parallel database connections used for extracting the data.
extract_query	The SQL query to extract data from the database. `$CONDITIONS` must be part of the `WHERE` clause. This is a template-supported field.
boundary_query	The query used to get range of row IDs that are to be extracted. This is a template-supported field.
split_column	Column used as row ID to split data into ranges. This is a template-supported field.

get_results

This command returns standard output of the command represented by the Qubole Operator.

Parameter	Description
delim	Specify the delimiter (example can be a `,`, `(space)`, and so on to segregate each row’s data. Delimiter replaces Ctrl + A from results data.
fp	Use this to write command results directly into a file . If you do not specify `fp`, Airflow creates an `fp` and returns it.
inline	This parameter decides whether or not to display the command results inline as a CRLF-separated string.
fetch	This parameter decides whether or not to download large results directly from the Cloud; it is set to `true` by default. It becomes effective only when `inline` is set to `true`. If `inline` is `true` and `fetch` is `false`, only the Cloud path is displayed.
ti	The TaskInstance object.

get_log

This command returns standard logs (in a raw text format) of the command represented by the Qubole Operator.

Parameter	Description
ti	The TaskInstance object.

get_jobs

This command returns jobs of the command represented by the Qubole Operator. It calls the Jobs API and retrieves the details of the hadoop jobs spawned on the cluster by command (command_id). This information is only available for commands, which have been completed.

Parameter	Description
ti	The TaskInstance object.