Qubole Operator API

This page describes the Qubole Operator API. For more information on the Qubole Operator, see Introduction to Airflow in Qubole, Qubole Operator Examples, and Questions about Airflow.

class airflow.contrib.operators.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)

Execute tasks (commands) on QDS.

Parameters Description

Parameter Description
qubole_conn_id The connection ID which consists of QDS auth_token.

kwargs

Parameter Description
command_type Type of command that is to be executed. For example, a Hive, Shell, or Hadoop command.
tags An array of tags that you can assign to the command.
cluster_label The label of a cluster on which the command is executed.
name A name that you can provide to the command. This is a template-supported field.
notify Set this to receive an email when the command completes. You are notified on a successful or failed command.

Understanding Command-type-specific Parameters

Here are the different command-type-specific parameters:

Note

You can also use .txt files for template-driven use cases.

hivecmd Parameters

Parameter Description
query An inline query statement. This is a template-supported field. Either a query or a script_location is required.
script_location For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a query or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
sample_size Sample size in bytes on which to run a query.
macros Macro values that are used in the query. This is a template-supported field.

prestocmd Parameters

Parameter Description
query An inline query statement. This is a template-supported field. Either a query or a script_location is required.
script_location For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a query or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
macros Macro values that are used in the query. This is a template-supported field.

hadoopcmd Parameters

Parameter Description
sub_command Must be jar, s3distcp, or streaming followed by 1 or more arguments. This is a template-supported field. (s3distcp is valid for all platforms.)

shellcmd Parameters

Parameter Description
script An inline command with arguments. This is a template-supported field. Either a script or a script_location is required.
script_location For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a script or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
files A list of files in an AWS S3 bucket in the file1 and file2 format. These files are copied into the working directory where the Qubole command is being executed. It is a template-supported field.
archives A list of archives in an AWS S3 bucket in the archive1 and archive2 format. These are unarchived into the working directory where the Qubole command is being executed. This is a template-supported field.
parameters Any additional arguments which must be passed to the script (only when script_location is added). This is a template-supported field.

pigcmd Parameters

Parameter Description
script An inline command with arguments. This is a template-supported field. Either a script or a script_location is required.
script_location For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a script or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
parameters Any additional arguments which must be passed to the script (only when script_location is added).

sparkcmd Parameters

Parameter Description
program The complete Spark program in Scala, SQL, Command, R, or Python. This is a template-supported field. A Spark notebook can be run using the QuboleOperator. For more information, see Qubole Operator Examples.
cmdline The Spark-submit command line; specify the required information on this command line. This is a template-supported field.
sql An inline SQL query statement. This is a template-supported field.
script_location The local file path that contains the query statement. This is a template-supported field. One of the following values must be specified: script_location , program , cmdline , sql , or note_id. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.
language The program languages scala, sql, R, python, and notebook are supported. Specify the language that you want to use.
app_id The ID of an Spark job server app.
note_id The ID of a notebook.
arguments These are Spark-submit command line arguments.
user_program_arguments These are arguments that the user program accepts.
macros Macro values that are used in the query. It is a template-supported field.

Example

The following example shows how to use the cmdline spark parameter with the Qubole Operator API.

operator = QuboleOperator(
    task_id='hello_world',
    command_type="sparkcmd",
    cmdline="/usr/lib/spark/bin/spark-submit --max-executors 10 --num-executors 15 --driver-memory 2g --executor-memory 3g --executor-cores 5 s3://mybucket/somelocation/hello_world.py 'myuserargument'",
    dag=dag)

dbtapquerycmd Parameters

Parameter Description
db_tap_id The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
query An inline query statement. This is a template-supported field.
macros Macro values that are used in the query. This is a template-supported field.

dbexportcmd Mode 1/Simple Mode Parameters

Parameter Description
mode The value must be 1 for the simple mode to push data from QDS to a relational database.
hive_table The name of the Hive table. This is a template-supported field.
partition_spec The partition specification for the Hive table. This is a template-supported field.
dbtap_id The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table The name of the DB table. This is a template-supported field.
db_update_mode The two different types of update modes, allowinsert or updateonly.
db_update_keys Columns used to determine the uniqueness of rows and it is only valid for db_update_mode. This is a template-supported field.

dbexportcmd Mode 2/Advanced Mode Parameters

Parameter Description
mode The mode value for advanced mode is 2 to export to an HDFS directory or a storage location.
dbtap_id The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table The name of the DB table. This is a template-supported field.
db_update_mode The two different types of update modes, allowinsert or updateonly.
db_update_keys Columns used to determine the uniqueness of rows and it is only valid for db_update_mode. This is a template-supported field.
export_dir An HDFS/Cloud location from which data is exported. This is a template-supported field.
fields_terminated_by The Hex value of the character used as a column separator in the dataset.

dbimportcmd Mode 1/Simple Mode Parameters

Parameter Description
mode The mode value for simple mode is 1 to pull data from a relational database to QDS in a Hive table.
hive_table The name of the Hive table. This is a template-supported field.
dbtap_id The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table The name of the db table. This is a template-supported field.
where_clause The WHERE clause (if any). This is a template-supported field.
parallelism The number of parallel database connections used for extracting the data.

dbimportcmd Mode 2/Advanced Mode Parameters

Parameter Description
mode The mode value for advanced mode is 2 to specify a custom query to transform the data before pulling it.
hive_table The name of the Hive table. This is a template-supported field.
dbtap_id The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.
db_table The name of the db table. This is a template-supported field.
parallelism The number of parallel database connections used for extracting the data.
extract_query The SQL query to extract data from the database. $CONDITIONS must be part of the WHERE clause. This is a template-supported field.
boundary_query The query used to get range of row IDs that are to be extracted. This is a template-supported field.
split_column Column used as row ID to split data into ranges. This is a template-supported field.

get_results

This command returns standard output of the command represented by the Qubole Operator.

Parameter Description
delim Specify the delimiter (example can be a ,, (space), and so on to segregate each row’s data. Delimiter replaces Ctrl + A from results data.
fp Use this to write command results directly into a file . If you do not specify fp, Airflow creates an fp and returns it.
inline This parameter decides whether or not to display the command results inline as a CRLF-separated string.
fetch This parameter decides whether or not to download large results directly from the Cloud; it is set to true by default. It becomes effective only when inline is set to true. If inline is true and fetch is false, only the Cloud path is displayed.
ti The TaskInstance object.

get_log

This command returns standard logs (in a raw text format) of the command represented by the Qubole Operator.

Parameter Description
ti The TaskInstance object.

get_jobs

This command returns jobs of the command represented by the Qubole Operator. It calls the Jobs API and retrieves the details of the hadoop jobs spawned on the cluster by command (command_id). This information is only available for commands, which have been completed.

Parameter Description
ti The TaskInstance object.