Qubole Operator API¶
This page describes the Qubole Operator API. For more information on the Qubole Operator, see Introduction to Airflow in Qubole, Qubole Operator Examples, and Questions about Airflow.
class airflow.contrib.operators.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)
Execute tasks (commands) on QDS.
Parameters Description¶
Parameter | Description |
---|---|
qubole_conn_id | The connection ID which consists of QDS auth_token. |
kwargs¶
Parameter | Description |
---|---|
command_type | Type of command that is to be executed. For example, a Hive, Shell, or Hadoop command. |
tags | An array of tags that you can assign to the command. |
cluster_label | The label of a cluster on which the command is executed. |
name | A name that you can provide to the command. This is a template-supported field. |
notify | Set this to receive an email when the command completes. You are notified on a successful or failed command. |
Understanding Command-type-specific Parameters¶
Here are the different command-type-specific parameters:
- hivecmd Parameters
- prestocmd Parameters
- pigcmd Parameters
- hadoopcmd Parameters
- shellcmd Parameters
- sparkcmd Parameters
- dbtapquerycmd Parameters
- dbexportcmd Mode 1/Simple Mode Parameters
- dbexportcmd Mode 2/Advanced Mode Parameters
- dbimportcmd Mode 1/Simple Mode Parameters
- dbimportcmd Mode 2/Advanced Mode Parameters
- get_results
- get_log
- get_jobs
Note
You can also use .txt
files for template-driven use cases.
hivecmd Parameters¶
Parameter | Description |
---|---|
query | An inline query statement. This is a template-supported field. Either a query or a
script_location is required. |
script_location | For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a
query or a script_location is required. This parameter does not currently support Azure
blob storage locations, but you can download the file to local storage and pass the local location in this
parameter. |
sample_size | Sample size in bytes on which to run a query. |
macros | Macro values that are used in the query. This is a template-supported field. |
prestocmd Parameters¶
Parameter | Description |
---|---|
query | An inline query statement. This is a template-supported field. Either a query or a
script_location is required. |
script_location | For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a
query or a script_location is required.
This parameter does not currently support Azure blob storage locations, but you can download the file to
local storage and pass the local location in this parameter. |
macros | Macro values that are used in the query. This is a template-supported field. |
hadoopcmd Parameters¶
Parameter | Description |
---|---|
sub_command | Must be jar , s3distcp , or streaming followed by 1 or more arguments.
This is a template-supported field. (s3distcp is valid for all platforms.) |
shellcmd Parameters¶
Parameter | Description |
---|---|
script | An inline command with arguments. This is a template-supported field. Either a script or a
script_location is required. |
script_location | For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a
script or a script_location is required.
This parameter does not currently support Azure blob storage locations, but you can download the file to
local storage and pass the local location in this parameter. |
files | A list of files in an AWS S3 bucket in the file1 and file2 format. These files are copied
into the working directory where the Qubole command is being executed.
It is a template-supported field. |
archives | A list of archives in an AWS S3 bucket in the archive1 and archive2 format. These are
unarchived into the working directory where the Qubole command is being executed.
This is a template-supported field. |
parameters | Any additional arguments which must be passed to the script (only when script_location is added).
This is a template-supported field. |
pigcmd Parameters¶
Parameter | Description |
---|---|
script | An inline command with arguments. This is a template-supported field. Either a script or a
script_location is required. |
script_location | For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a
script or a script_location is required.
This parameter does not currently support Azure blob storage locations, but you can download the file to
local storage and pass the local location in this parameter. |
parameters | Any additional arguments which must be passed to the script (only when script_location is added). |
sparkcmd Parameters¶
Parameter | Description |
---|---|
program | The complete Spark program in Scala, SQL, Command, R, or Python. This is a template-supported field. A Spark notebook can be run using the QuboleOperator. For more information, see Qubole Operator Examples. |
cmdline | The Spark-submit command line; specify the required information on this command line. This is a template-supported field. |
sql | An inline SQL query statement. This is a template-supported field. |
script_location | The local file path that contains the query statement. This is a template-supported field. One of the
following values must be specified: script_location , program , cmdline , sql , or
note_id .
This parameter does not currently support Azure blob storage locations, but you can download the file to
local storage and pass the local location in this parameter. |
language | The program languages scala , sql , R , python , and notebook are
supported. Specify the language that you want to use. |
app_id | The ID of an Spark job server app. |
note_id | The ID of a notebook. |
arguments | These are Spark-submit command line arguments. |
user_program_arguments | These are arguments that the user program accepts. |
macros | Macro values that are used in the query. It is a template-supported field. |
Example¶
The following example shows how to use the cmdline
spark parameter with the Qubole Operator API.
operator = QuboleOperator(
task_id='hello_world',
command_type="sparkcmd",
cmdline="/usr/lib/spark/bin/spark-submit --max-executors 10 --num-executors 15 --driver-memory 2g --executor-memory 3g --executor-cores 5 s3://mybucket/somelocation/hello_world.py 'myuserargument'",
dag=dag)
dbtapquerycmd Parameters¶
Parameter | Description |
---|---|
db_tap_id | The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
query | An inline query statement. This is a template-supported field. |
macros | Macro values that are used in the query. This is a template-supported field. |
dbexportcmd Mode 1/Simple Mode Parameters¶
Parameter | Description |
---|---|
mode | The value must be 1 for the simple mode to push data from QDS to a relational database. |
hive_table | The name of the Hive table. This is a template-supported field. |
partition_spec | The partition specification for the Hive table. This is a template-supported field. |
dbtap_id | The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table | The name of the DB table. This is a template-supported field. |
db_update_mode | The two different types of update modes, allowinsert or updateonly . |
db_update_keys | Columns used to determine the uniqueness of rows and it is only valid for db_update_mode . This is a
template-supported field. |
dbexportcmd Mode 2/Advanced Mode Parameters¶
Parameter | Description |
---|---|
mode | The mode value for advanced mode is 2 to export to an HDFS directory or a storage location. |
dbtap_id | The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table | The name of the DB table. This is a template-supported field. |
db_update_mode | The two different types of update modes, allowinsert or updateonly . |
db_update_keys | Columns used to determine the uniqueness of rows and it is only valid for db_update_mode . This is a
template-supported field. |
export_dir | An HDFS/Cloud location from which data is exported. This is a template-supported field. |
fields_terminated_by | The Hex value of the character used as a column separator in the dataset. |
dbimportcmd Mode 1/Simple Mode Parameters¶
Parameter | Description |
---|---|
mode | The mode value for simple mode is 1 to pull data from a relational database to QDS in a Hive table. |
hive_table | The name of the Hive table. This is a template-supported field. |
dbtap_id | The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table | The name of the db table. This is a template-supported field. |
where_clause | The WHERE clause (if any). This is a template-supported field. |
parallelism | The number of parallel database connections used for extracting the data. |
dbimportcmd Mode 2/Advanced Mode Parameters¶
Parameter | Description |
---|---|
mode | The mode value for advanced mode is 2 to specify a custom query to transform the data before pulling it. |
hive_table | The name of the Hive table. This is a template-supported field. |
dbtap_id | The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table | The name of the db table. This is a template-supported field. |
parallelism | The number of parallel database connections used for extracting the data. |
extract_query | The SQL query to extract data from the database. $CONDITIONS must be part of the WHERE clause.
This is a template-supported field. |
boundary_query | The query used to get range of row IDs that are to be extracted. This is a template-supported field. |
split_column | Column used as row ID to split data into ranges. This is a template-supported field. |
get_results¶
This command returns standard output of the command represented by the Qubole Operator.
Parameter | Description |
---|---|
delim | Specify the delimiter (example can be a , , (space) , and so on to segregate each row’s data.
Delimiter replaces Ctrl + A from results data. |
fp | Use this to write command results directly into a file . If you do not specify fp , Airflow creates
an fp and returns it. |
inline | This parameter decides whether or not to display the command results inline as a CRLF-separated string. |
fetch | This parameter decides whether or not to download large results directly from the Cloud; it is set to
true by default. It becomes effective only when inline is set to true . If inline is true
and fetch is false , only the Cloud path is displayed. |
ti | The TaskInstance object. |
get_log¶
This command returns standard logs (in a raw text format) of the command represented by the Qubole Operator.
Parameter | Description |
---|---|
ti | The TaskInstance object. |
get_jobs¶
This command returns jobs of the command represented by the Qubole Operator. It calls the Jobs API and retrieves the
details of the hadoop jobs spawned on the cluster by command (command_id
). This information is only available for
commands, which have been completed.
Parameter | Description |
---|---|
ti | The TaskInstance object. |