Qubole Operator API

This page describes the Qubole Operator API. For more information on the Qubole Operator, see Introduction to Airflow in Qubole, Qubole Operator Examples, and Questions about Airflow.

class airflow.contrib.operators.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)

Execute tasks (commands) on QDS.

Parameters Description

Parameter

Description

qubole_conn_id

The connection ID which consists of QDS auth_token.

kwargs

Parameter

Description

command_type

Type of command that is to be executed. For example, a Hive, Shell, or Hadoop command.

tags

An array of tags that you can assign to the command.

cluster_label

The label of a cluster on which the command is executed.

name

A name that you can provide to the command. This is a template-supported field.

notify

Set this to receive an email when the command completes. You are notified on a successful or failed command.

Understanding Command-type-specific Parameters

Here are the different command-type-specific parameters:

Note

You can also use .txt files for template-driven use cases.

hivecmd Parameters

Parameter

Description

query

An inline query statement. This is a template-supported field. Either a query or a script_location is required.

script_location

For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a query or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.

sample_size

Sample size in bytes on which to run a query.

macros

Macro values that are used in the query. This is a template-supported field.

prestocmd Parameters

Parameter

Description

query

An inline query statement. This is a template-supported field. Either a query or a script_location is required.

script_location

For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a query or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.

macros

Macro values that are used in the query. This is a template-supported field.

hadoopcmd Parameters

Parameter

Description

sub_command

Must be jar, s3distcp, or streaming followed by 1 or more arguments. This is a template-supported field. (s3distcp is valid for all platforms.)

shellcmd Parameters

Parameter

Description

script

An inline command with arguments. This is a template-supported field. Either a script or a script_location is required.

script_location

For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a script or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.

files

A list of files in an AWS S3 bucket in the file1 and file2 format. These files are copied into the working directory where the Qubole command is being executed. It is a template-supported field.

archives

A list of archives in an AWS S3 bucket in the archive1 and archive2 format. These are unarchived into the working directory where the Qubole command is being executed. This is a template-supported field.

parameters

Any additional arguments which must be passed to the script (only when script_location is added). This is a template-supported field.

pigcmd Parameters

Parameter

Description

script

An inline command with arguments. This is a template-supported field. Either a script or a script_location is required.

script_location

For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a script or a script_location is required. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.

parameters

Any additional arguments which must be passed to the script (only when script_location is added).

sparkcmd Parameters

Parameter

Description

program

The complete Spark program in Scala, SQL, Command, R, or Python. This is a template-supported field. A Spark notebook can be run using the QuboleOperator. For more information, see Qubole Operator Examples.

cmdline

The Spark-submit command line; specify the required information on this command line. This is a template-supported field.

sql

An inline SQL query statement. This is a template-supported field.

script_location

The local file path that contains the query statement. This is a template-supported field. One of the following values must be specified: script_location , program , cmdline , sql , or note_id. This parameter does not currently support Azure blob storage locations, but you can download the file to local storage and pass the local location in this parameter.

language

The program languages scala, sql, R, python, and notebook are supported. Specify the language that you want to use.

app_id

The ID of an Spark job server app.

note_id

The ID of a notebook.

arguments

These are Spark-submit command line arguments.

user_program_arguments

These are arguments that the user program accepts.

macros

Macro values that are used in the query. It is a template-supported field.

Example

The following example shows how to use the cmdline spark parameter with the Qubole Operator API.

operator = QuboleOperator(
    task_id='hello_world',
    command_type="sparkcmd",
    cmdline="/usr/lib/spark/bin/spark-submit --max-executors 10 --num-executors 15 --driver-memory 2g --executor-memory 3g --executor-cores 5 s3://mybucket/somelocation/hello_world.py 'myuserargument'",
    dag=dag)

dbtapquerycmd Parameters

Parameter

Description

db_tap_id

The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.

query

An inline query statement. This is a template-supported field.

macros

Macro values that are used in the query. This is a template-supported field.

dbexportcmd Mode 1/Simple Mode Parameters

Parameter

Description

mode

The value must be 1 for the simple mode to push data from QDS to a relational database.

hive_table

The name of the Hive table. This is a template-supported field.

partition_spec

The partition specification for the Hive table. This is a template-supported field.

dbtap_id

The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.

db_table

The name of the DB table. This is a template-supported field.

db_update_mode

The two different types of update modes, allowinsert or updateonly.

db_update_keys

Columns used to determine the uniqueness of rows and it is only valid for db_update_mode. This is a template-supported field.

dbexportcmd Mode 2/Advanced Mode Parameters

Parameter

Description

mode

The mode value for advanced mode is 2 to export to an HDFS directory or a storage location.

dbtap_id

The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.

db_table

The name of the DB table. This is a template-supported field.

db_update_mode

The two different types of update modes, allowinsert or updateonly.

db_update_keys

Columns used to determine the uniqueness of rows and it is only valid for db_update_mode. This is a template-supported field.

export_dir

An HDFS/Cloud location from which data is exported. This is a template-supported field.

fields_terminated_by

The Hex value of the character used as a column separator in the dataset.

dbimportcmd Mode 1/Simple Mode Parameters

Parameter

Description

mode

The mode value for simple mode is 1 to pull data from a relational database to QDS in a Hive table.

hive_table

The name of the Hive table. This is a template-supported field.

dbtap_id

The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.

db_table

The name of the db table. This is a template-supported field.

where_clause

The WHERE clause (if any). This is a template-supported field.

parallelism

The number of parallel database connections used for extracting the data.

dbimportcmd Mode 2/Advanced Mode Parameters

Parameter

Description

mode

The mode value for advanced mode is 2 to specify a custom query to transform the data before pulling it.

hive_table

The name of the Hive table. This is a template-supported field.

dbtap_id

The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer.

db_table

The name of the db table. This is a template-supported field.

parallelism

The number of parallel database connections used for extracting the data.

extract_query

The SQL query to extract data from the database. $CONDITIONS must be part of the WHERE clause. This is a template-supported field.

boundary_query

The query used to get range of row IDs that are to be extracted. This is a template-supported field.

split_column

Column used as row ID to split data into ranges. This is a template-supported field.

get_results

This command returns standard output of the command represented by the Qubole Operator.

Parameter

Description

delim

Specify the delimiter (example can be a ,, (space), and so on to segregate each row’s data. Delimiter replaces Ctrl + A from results data.

fp

Use this to write command results directly into a file . If you do not specify fp, Airflow creates an fp and returns it.

inline

This parameter decides whether or not to display the command results inline as a CRLF-separated string.

fetch

This parameter decides whether or not to download large results directly from the Cloud; it is set to true by default. It becomes effective only when inline is set to true. If inline is true and fetch is false, only the Cloud path is displayed.

ti

The TaskInstance object.

get_log

This command returns standard logs (in a raw text format) of the command represented by the Qubole Operator.

Parameter

Description

ti

The TaskInstance object.

get_jobs

This command returns jobs of the command represented by the Qubole Operator. It calls the Jobs API and retrieves the details of the hadoop jobs spawned on the cluster by command (command_id). This information is only available for commands, which have been completed.

Parameter

Description

ti

The TaskInstance object.