Qubole Operator API
This page describes the Qubole Operator API. For more information on the Qubole Operator, see Introduction to Airflow in Qubole, Qubole Operator Examples, and Questions about Airflow.
class airflow.contrib.operators.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)
Execute tasks (commands) on QDS.
Parameters Description
Parameter |
Description |
---|---|
qubole_conn_id |
The connection ID which consists of QDS auth_token. |
kwargs
Parameter |
Description |
---|---|
command_type |
Type of command that is to be executed. For example, a Hive, Shell, or Hadoop command. |
tags |
An array of tags that you can assign to the command. |
cluster_label |
The label of a cluster on which the command is executed. |
name |
A name that you can provide to the command. This is a template-supported field. |
notify |
Set this to receive an email when the command completes. You are notified on a successful or failed command. |
Understanding Command-type-specific Parameters
Here are the different command-type-specific parameters:
Note
You can also use .txt
files for template-driven use cases.
hivecmd Parameters
Parameter |
Description |
---|---|
query |
An inline query statement. This is a template-supported field. Either a |
script_location |
For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a
|
sample_size |
Sample size in bytes on which to run a query. |
macros |
Macro values that are used in the query. This is a template-supported field. |
prestocmd Parameters
Parameter |
Description |
---|---|
query |
An inline query statement. This is a template-supported field. Either a |
script_location |
For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a
|
macros |
Macro values that are used in the query. This is a template-supported field. |
hadoopcmd Parameters
Parameter |
Description |
---|---|
sub_command |
Must be |
shellcmd Parameters
Parameter |
Description |
---|---|
script |
An inline command with arguments. This is a template-supported field. Either a |
script_location |
For AWS, the S3 location that contains the query statement. This is a template-supported field. Either a
|
files |
A list of files in an AWS S3 bucket in the |
archives |
A list of archives in an AWS S3 bucket in the |
parameters |
Any additional arguments which must be passed to the script (only when |
pigcmd Parameters
Parameter |
Description |
---|---|
script |
An inline command with arguments. This is a template-supported field. Either a |
script_location |
For AWS, the S3 location that contains the query statement. It is a template-supported field. Either a
|
parameters |
Any additional arguments which must be passed to the script (only when |
sparkcmd Parameters
Parameter |
Description |
---|---|
program |
The complete Spark program in Scala, SQL, Command, R, or Python. This is a template-supported field. A Spark notebook can be run using the QuboleOperator. For more information, see Qubole Operator Examples. |
cmdline |
The Spark-submit command line; specify the required information on this command line. This is a template-supported field. |
sql |
An inline SQL query statement. This is a template-supported field. |
script_location |
The local file path that contains the query statement. This is a template-supported field. One of the
following values must be specified: |
language |
The program languages |
app_id |
The ID of an Spark job server app. |
note_id |
The ID of a notebook. |
arguments |
These are Spark-submit command line arguments. |
user_program_arguments |
These are arguments that the user program accepts. |
macros |
Macro values that are used in the query. It is a template-supported field. |
Example
The following example shows how to use the cmdline
spark parameter with the Qubole Operator API.
operator = QuboleOperator(
task_id='hello_world',
command_type="sparkcmd",
cmdline="/usr/lib/spark/bin/spark-submit --max-executors 10 --num-executors 15 --driver-memory 2g --executor-memory 3g --executor-cores 5 s3://mybucket/somelocation/hello_world.py 'myuserargument'",
dag=dag)
dbtapquerycmd Parameters
Parameter |
Description |
---|---|
db_tap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
query |
An inline query statement. This is a template-supported field. |
macros |
Macro values that are used in the query. This is a template-supported field. |
dbexportcmd Mode 1/Simple Mode Parameters
Parameter |
Description |
---|---|
mode |
The value must be 1 for the simple mode to push data from QDS to a relational database. |
hive_table |
The name of the Hive table. This is a template-supported field. |
partition_spec |
The partition specification for the Hive table. This is a template-supported field. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table |
The name of the DB table. This is a template-supported field. |
db_update_mode |
The two different types of update modes, |
db_update_keys |
Columns used to determine the uniqueness of rows and it is only valid for |
dbexportcmd Mode 2/Advanced Mode Parameters
Parameter |
Description |
---|---|
mode |
The mode value for advanced mode is |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table |
The name of the DB table. This is a template-supported field. |
db_update_mode |
The two different types of update modes, |
db_update_keys |
Columns used to determine the uniqueness of rows and it is only valid for |
export_dir |
An HDFS/Cloud location from which data is exported. This is a template-supported field. |
fields_terminated_by |
The Hex value of the character used as a column separator in the dataset. |
dbimportcmd Mode 1/Simple Mode Parameters
Parameter |
Description |
---|---|
mode |
The mode value for simple mode is |
hive_table |
The name of the Hive table. This is a template-supported field. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table |
The name of the db table. This is a template-supported field. |
where_clause |
The |
parallelism |
The number of parallel database connections used for extracting the data. |
dbimportcmd Mode 2/Advanced Mode Parameters
Parameter |
Description |
---|---|
mode |
The mode value for advanced mode is |
hive_table |
The name of the Hive table. This is a template-supported field. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value is a string and not an integer. |
db_table |
The name of the db table. This is a template-supported field. |
parallelism |
The number of parallel database connections used for extracting the data. |
extract_query |
The SQL query to extract data from the database. |
boundary_query |
The query used to get range of row IDs that are to be extracted. This is a template-supported field. |
split_column |
Column used as row ID to split data into ranges. This is a template-supported field. |
get_results
This command returns standard output of the command represented by the Qubole Operator.
Parameter |
Description |
---|---|
delim |
Specify the delimiter (example can be a |
fp |
Use this to write command results directly into a file . If you do not specify |
inline |
This parameter decides whether or not to display the command results inline as a CRLF-separated string. |
fetch |
This parameter decides whether or not to download large results directly from the Cloud; it is set to
|
ti |
The TaskInstance object. |
get_log
This command returns standard logs (in a raw text format) of the command represented by the Qubole Operator.
Parameter |
Description |
---|---|
ti |
The TaskInstance object. |
get_jobs
This command returns jobs of the command represented by the Qubole Operator. It calls the Jobs API and retrieves the
details of the hadoop jobs spawned on the cluster by command (command_id
). This information is only available for
commands, which have been completed.
Parameter |
Description |
---|---|
ti |
The TaskInstance object. |