MapReduce Configuration in Hadoop 2

Qubole’s Hadoop 2 offering is based on Apache Hadoop 2.6.0. Qubole has some optimizations in S3 access and has enhanced it with its autoscaling code. Qubole jars have been uploaded in a maven repository and can be accessed seamlessly for developing mapreduce/yarn applications as highlighted by this POM file.

One of the significant shifts from Hadoop 1 to Hadoop 2 is the JobTracker’s functionality split into Resource Manager and Application Master. The other significant shift is the replacement of Map and Reduce slots by containers.

In Hadoop 1, a task tracker (slave node) has a maximum limit on the number of map and reduce slots. This means that if a node is running the limit on the map slots, more map slots cannot be scheduled even when more resources are available.

In Hadoop 2, slots have been replaced by containers, which is an abstracted part of the slave resources. A container can be of any size within the limit of the Node Manager (slave node). The map and reduce tasks are Java Virtual Machines (JVMs) launched within these containers.

This change means that specifying the container sizes become important. For example, a memory-heavy map task, would require a larger container than a lighter map task. Moreover, the container sizes are different for different instance types (for example, an instance with larger memory has larger container size). While Qubole specifies good default parameters for the container sizes per instances, there are certain cases when you would like to change the defaults.

The default Hadoop 2 settings for a cluster is shown in the Edit Cluster page of a Hadoop 2 Cluster. (Navigate to Clusters and click the edit button that is against a Hadoop 2 (Hive) Cluster. See Managing Clusters for more information.) A sample is as shown in the following figure.

../../../../_images/Hadoop2Cluster_Settings.jpg

In MapReduce, changing a task’s memory requirement requires changing the following parameters:

  • The size of the container in which the map/reduce task is launched.
  • Specifying the maximum memory (-Xmx) to the JVM of the map/reduce task.

The two parameters (mentioned above) are changed for MapReduce Tasks/Application Master as shown below:

  • Map Tasks:
mapreduce.map.memory.mb=2240   # Container size
mapreduce.map.java.opts=-Xmx2016m  # JVM arguments for a Map task
  • Reduce Tasks:
mapreduce.reduce.memory.mb=2240  # Container size
mapreduce.reduce.java.opts=-Xmx2016m  # JVM arguments for a Reduce task
  • MapReduce Application Master:
yarn.app.mapreduce.am.resource.mb=2240  # Container size
yarn.app.mapreduce.am.command-opts=-Xmx2016m  # JVM arguments for an Application Master

Locating Logs

  • The YARN logs (Application Master and container logs) are stored at: <scheme><defloc>/logs/hadoop/<cluster_id>/<cluster_inst_id>/app-logs.
  • The daemon logs for each host are stored at: <scheme><defloc>/logs/hadoop/<cluster_id>/<cluster_inst_id>/<host>.
  • The MapReduce Job History files are stored at: <scheme><defloc>/logs/hadoop/<cluster_id>/<cluster_inst_id>/mr-history.

Where:

  • scheme is the Cloud-specific URI scheme: s3:// for AWS; wasb:// or adl:// for Azure; oraclebmc:// for Oracle OCI.
  • defloc is the default storage location for the QDS account.
  • cluster_id is the cluster ID as shown on the Clusters page of the QDS UI.
  • cluster_inst_id is the cluster instance ID. It is the latest folder under <scheme><defloc>/logs/hadoop/<cluster_id>/ for a running cluster or the last-terminated cluster.

To extract a container’s log files, create a YARN command line similar to the following:

 yarn logs \
-applicationId application_<application ID> \
-logsDir <scheme><qubolelogs-location>/logs/hadoop/<clusterid>/<cluster-instanceID>/app-logs \
-appOwner <application owner>

Configuring RubiX on Hadoop 2 Clusters (AWS)

RubiX is supported on Hadoop 2 clusters. Ensure that the entire Hadoop job runs on the master, which you can configure by setting hive.on.master to true for Hive jobs.

Note

This parameter applies to only an Hadoop 2 cluster and not a Spark cluster.

Set the following properties to make Hive jobs on MapReduce and Tez use RubiX for caching data.

set fs.s3n.impl=com.qubole.rubix.hadoop2.CachingHadoop2FileSystem;
set fs.s3.impl=com.qubole.rubix.hadoop2.CachingHadoop2FileSystem;