Submit a Hadoop Jar Command

POST /api/v1.2/commands/

This API is used to submit a Hadoop Jar command. Ensure that the output directory is new and does not exist before running a Hadoop job.

For developing applications, see Use Cascading with QDS.

Required Role

The following users can make this API call:

Users who belong to the system-user or system-admin group.
Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.

Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter	Description
command_type	HadoopCommand
sub_command	jar
sub_command_args	s3_path_to_jar [main_class] [hadoop-generic-options] [arg1] [arg2] ...
label	Specify the cluster label on which this command is to be run.
retry	Denotes the number of retries for a job. Valid values of `retry` are 1, 2, and 3.
retry_delay	Denotes the time interval between the retries when a job fails. The unit of measurement is minutes.
name	Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), “ (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
pool	Use this parameter to specify the Fairscheduler pool name for the command to use.
tags	Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, `{"tags":["<tag-value>"]}`.
macros	Denotes the macros that are valid assignment statements containing the variables and its expression as: `macros: [{"<variable>":<variable-expression>}, {..}]`. You can add more than one variable. For more information, see Macros.
timeout	It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.

Examples

The example given below runs a Hadoop Streaming job. The streaming jar is stored on S3 and the application just runs a map-only job running the Unix utility wc against the input dataset.

Hadoop Streaming Job

export OUTPUT_LOC=<s3 output location>;

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json"  \
-d '{"sub_command": "jar", "sub_command_args": "s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar -mapper wc -numReduceTasks 0 -input s3://paid-qubole/HadoopAPITests/data/3.tsv -output s3://paid-qubole/HadoopAPITests/data/3_wc", "command_type": "HadoopCommand"}' \
"https://api.qubole.com/api/v1.2/commands"

Note

The above syntax uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.

Sample Response

{
  "id":4246,
  "meta_data":
   {
      "results_resource":"commands/4246/results",
      "logs_resource":"commands/4246/logs"
   },
   "command":{"sub_command_args":"s3n://paid-qubole/HadoopAPITests/jars/hadoop-0.20.1-dev-streaming.jar -mapper wc -numReduceTasks 0 -input s3://paid-qubole/datasets/data1_30days/20100101/EU/3.tsv -output s3n://paid-qubole/tmp/wcl_3","sub_command":"jar"},
   "progress":0,
   "status":"waiting",
   "command_type":"HadoopCommand",
   "qbol_session_id":1629,
   "created_at":"2012-10-16T11:29:36Z",
   "user_id":9
}

Hadoop Jar Gutenberg Job

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN " -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "jar", "sub_command_args": "s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar -files s3n://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3n://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py -mapper mapper.py -reducer reducer.py -numReduceTasks 1 -input s3n://paid-qubole/default-datasets/gutenberg -output s3://paid-qubole/default-datasets/grun119_1",
"command_type": "HadoopCommand"}' "https://api.qubole.com/api/v1.2/commands"

Note

The above syntax uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.

Hadoop Streaming Job with a Cluster Label

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN " -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "streaming", "sub_command_args": "'-files'
's3://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py' '-mapper' 'mapper.py' '-reducer' 'reducer.py' '-numReduceTasks' '1' '-input' 's3://paid-qubole/default-*/guten*' '-output' 's3://paid-qubole/default-datasets/output4'",
"command_type": "HadoopCommand", "label":"HadoopCluster"}' "https://api.qubole.com/api/v1.2/commands"

Hadoop Streaming Job without a Cluster Label

Note: When a job is run without a cluster label, the default cluster runs the command.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN " -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "streaming", "sub_command_args": "'-files' 's3://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py' '-mapper' 'mapper.py' '-reducer' 'reducer.py' '-numReduceTasks' '1' '-input' 's3://paid-qubole/default-*/guten*' '-output' 's3://paid-qubole/default-datasets/output4'",
"command_type": "HadoopCommand"}' "https://api.qubole.com/api/v1.2/commands"