OutOfMemory issues are sometimes caused by there being too many files in split computation. To resolve this problem, increase the Application Master (AM) memory. To increase the AM memory, set the following parameters:
set tez.am.resource.memory.mb=<Size in MB>; set tez.am.launch.cmd-opts=-Xmx<Size in MB>; The default value for tez.am.resource.memory.mb is 1536MB.
Block & Split Tuning¶
HDFS block size manages the storage of the data in the cluster and the split size drives how that data is read for processing by MapReduce. Make sure the block sizing and the Mapper maximum and minimum split size are not causing the creation of an unnecessarily large number of files.
dfs.blocksize Sets the HDFS Block Size for storage - defaults to 128 MB mapred.min.split.size Sets the minimum split size - defaults to dfs.blocksize mapred.max.split.size Sets the maximum split size - defaults to dfs.blocksize
Configuring the split size boundaries for MapReduce may have cascading effects on the number of mappers created and the number of files each Mapper will access.
Blocks Required Dataset Size / dfs.blocksize Maximum Mappers Required Dataset Size / mapred.min.split.size Minimum Mappers Required Dataset Size / mapred.max.split.size Maximum Mappers per Block Maximum Mappers Required / Blocks Required Maximum Blocks per Mapper Blocks Required / Minimum Mappers Required
The number of tasks configured for worker nodes determines the parallelism of the cluster for processing Mappers and Reducers. As the slots get used by MapReduce jobs, there may job delays due to constrained resources if the number of slots was not appropriately configured. Try to set maximums and not constants so as to put boundaries on Hive but not handcuff it to a certain number of tasks.
mapred.tasktracker.map.tasks.maximum Maximum number of map tasks mapred.tasktracker.reduce.tasks.maximum Maximum number of reduce tasks
If analysis of the tasks reveals that the memory utilization is low, consider modifying the memory allocation for the Hadoop cluster. Reducing the allocated memory for the tasks will free up space on the cluster and allow for an increased in the number of Mappers or Reducers.
mapred.map.child.java.opts Java heap memory setting for the map tasks mapred.reduce.child.java.opts Java heap memory setting for the reduce tasks