Pig in Qubole

The Qubole Pig distribution is derived from the Apache Pig versions 0.11, 0.15 and 0.17 (beta). It currently runs on AWS only.

Pig is a platform used to analyze large data sets that contains high-level language to express data analysis programs. Pig’s infrastructure layer contains a compiler that generates MapReduce programs’ sequences, for which large-scale parallel implementations are already existing. Pig’s language layer contains a textual language called Pig Latin that has the following important properties:

  • Ease of programming - Complex tasks containing related data transformations are explicitly encoded as data flow sequences that make them easy to write, understand, and maintain.
  • Optimization Opportunities - The mechanism in which tasks are encoded lets the system in optimizing the execution automatically and lets you to focus on semantics rather than efficiency.
  • Extensibility - It allows creating own functions to do special-purpose processing.

Qubole supports Pig versions, 0.11 (Pig11), 0.15 (Pig15), and 0.17 (beta) (Pig17) on Hadoop 2 clusters. Pig 0.11 is the default version on the Hadoop 2 cluster. Pig 0.15 is supported in shell commands on Hadoop 2 clusters. You can also choose between MapReduce and Tez as the execution engine when you set the Pig 0.17 (beta) version.

Running a Pig Job is a quick-start guide to run a pig job query. Submit a Pig Command provides the REST API information.

Qubole supports HCatalog and Pig integration. However, only Pig11 and later versions support HCatalog integration. See Pig HCatalog Integration for more information.