Running a Pig Job

This Quick Start Guide is for users who want to run Pig jobs using Qubole Data Service (QDS). To run Pig jobs using our API, you must have a working QDS account. If you do not have one, click here to create one using sign up.

Files Used in the Demo

Pig script and the UDFs must be uploaded to Amazon S3. For quick start, we have already uploaded some sample Pig scripts and UDFs to our S3 Bucket, s3://paid-qubole/PigAPIDemo. Here is the list of files:

  1. /data/excite-small.log - The dataset used to crunch.

  2. /jars/tutorial.jar - Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing.

  3. /scripts/script1-hadoop-s3-small.pig - This is the pig script we will be running.

  4. /scripts/script1-hadoop-parametrized.pig - This script is the same as the one above except that it is parametrized. It takes in the following parameters: $input,$output and $udf_jar

The above script is Query Phrase Popularity script which processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day. These files are cloned from the Apache Pig Tutorial and it has an explanation of what this script does.

Steps

Step 1: Get the Access Token

Get the API access token as mentioned in Authentication.

Set the environment variables:

  1. export AUTH_TOKEN={your account’s auth token}

  2. export V=v1.2 # api rev at the time of this writing

Step 2: Submit a Pig Command

Let us submit the non-parametrized Pig Script using Curl.

Note

The syntax below uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d '{"script_location":"s3://paid-qubole/PigAPIDemo/scripts/script1-hadoop-s3-small.pig","command_type": "PigCommand"}'  "https://api.qubole.com/api/${V}/commands"

It returns a JSON response of command object. Note the id in the JSON response.

Similarly, to submit the parametrized script, run this command:

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"script_location":"s3://paid-qubole/PigAPIDemo/scripts/script1-hadoop-parametrized.pig","parameters":{"udf_jar":"s3://paid-qubole/PigAPIDemo/jars/tutorial.jar","input":"s3://paid-qubole/PigAPIDemo/data/excite-small.log","output": "<your s3 output location>" },"command_type": "PigCommand"}' \
"https://api.qubole.com/api/${V}/commands"

Step 3: Get the Command Status

From the JSON response, get the “id”. To check the status of the Pig Command – replace the ${id} in the request below with the actual value from the response.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" "https://api.qubole.com/api/{$V}/commands/${id}"

Step 4: Check the Logs and Results

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" "https://api.qubole.com/api/{$V}/commands/${id}/logs"

Congratulations! You have submitted the first Pig job.

You can also provide the PigLatin statements in the request itself. For more details, refer the API documentation Submit a Pig Command