Troubleshooting Errors and Exceptions in Hive Jobs
This topic provides information about the errors and exceptions that you might encounter when running Hive jobs or applications. You can resolve these errors and exceptions by following the respective workarounds.
Container memory requirement exceeds physical memory limits
Problem Description
A Hive job fails, and the error message below appears in the Qubole UI under the Logs tab of the Analyze page, or in the Mapper logs, Reducer logs, or ApplicationMaster logs:
Container [pid=18196,containerID=container_1526931816701_34273_02_000003] is running beyond physical memory limits.
Current usage: 2.2 GB of 2.2 GB physical memory used; 3.2 GB of 4.6 GB virtual memory used. Killing container.
Diagnosis
Three different kinds of failure can result in this error message:
- Mapper failure
This error can occur because the Mapper is requesting more memory than the configured memory. The parameter
mapreduce.map.memory.mb
represents Mapper memory.
- Reducer failure
This error can occur because the Reducer is requesting more memory than the configured memory. The parameter
mapreduce.reduce.memory.mb
represents Reducer memory.
- ApplicationMaster failure
This error can occur when the container hosting the ApplicationMaster is requesting more than the assigned memory. The parameter
yarn.app.mapreduce.am.resource.mb
represents the memory allocated.
Solution
Mapper failure: Modify the two parameters below to increase the memory for Mapper tasks if a Mapper fails with the above error.
mapreduce.map.memory.mb
: The upper memory limit that Hadoop allows to be allocated to a Mapper, in megabytes.mapreduce.map.java.opts
: Sets the heap size for a Mapper.
Reducer failure: Modify the two parameters below to increase the memory for Reducer tasks if a Reducer fails with the above error.
mapreduce.reduce.memory.mb
: The upper memory limit that Hadoop allows to be allocated to a Reducer, in megabytes.mapreduce.reduce.java.opts
: Sets the heap size for a Reducer.
ApplicationMaster failure: Modify the two parameters below to increase the memory for the ApplicationMaster if the ApplicationMaster fails with the above error.
yarn.app.mapreduce.am.resource.mb
: The amount of memory the ApplicationMaster needs, in megabytes.yarn.app.mapreduce.am.command-opts
: To set the heap size for the ApplicationMaster.
Make sure that yarn.app.mapreduce.am.command-opts
is less than yarn.app.mapreduce.am.resource.mb
. Qubole recommends that the value of yarn.app.mapreduce.am.command-opts
be around 80% of yarn.app.mapreduce.am.resource.mb
.
Example: Use the set
command to update the configuration property at the query level.
set yarn.app.mapreduce.am.resource.mb=3500;
set yarn.app.mapreduce.am.command-opts=-Xmx2000m;
To update configs at the cluster level:
Add or update the parameters under Override Hadoop Configuration Variables in the Advanced Configuration tab in Cluster Settings and restart the cluster.
See also: MapReduce Configuration in Hadoop 2
GC overhead limit exceeded, causing out of memory error
Problem Description
A Hive job fails with an out-of-memory error “GC overhead limit exceeded,” as shown below.
java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): GC overhead limit exceeded
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:337)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:422)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:579)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:348)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:345)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
Diagnosis
This out-of-memory error is coming from the getJobStatus method call. This is likely to be an issue with the JobHistory server running out of memory. This can be confirmed by checking the JobHistory server log on the coordinator node in media/ephemeral0/logs/mapred
. The JobHistory server log will show the out of memory exception stack trace as above.
The out of memory error for the JobHistory server usually happens in the following cases:
The cluster coordinator node is too small and the JobHistory server is set to, for example, a heap size of 1 GB.
The jobs are very large, with thousands of mapper tasks running.
Solution
Qubole recommends that you use a larger cluster coordinator node, with at least 60 GB RAM and a heap size of 4 GB for the JobHistory server process.
Depending on the nature of the job, even 4 GB for the JobHistory server heap size might not be sufficient. In this case, set the JobHistory server memory to a higher value, such as 8 GB, using the following bootstrap commands:
sudo echo 'export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="8192"' >> /etc/hadoop/mapred-env.sh
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver
Mapper or reducer job fails because no valid local directory is found
Problem Description
Mapper or reducer job fails with the following error:
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid directory for <file_path>
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext$DirSelector.getPathForWrite(LocalDirAllocator.java:541)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:627)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:173)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:154)
at org.apache.tez.runtime.library.common.task.local.output.TezTaskOutputFiles.getInputFileForWrite(TezTaskOutputFiles.java:250)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput.createDiskMapOutput(MapOutput.java:100)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.reserve(MergeManager.java:404)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:476)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:278)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
Diagnosis
This error can appear on the Analyze page of the QDS UI or in the Hadoop Mapper or Reducer logs.
MapReduce stores intermediate data in local directories specified by
the parameter mapreduce.cluster.local.dir
in the mapred-site.xml
file. During job processing, MapReduce
checks these directories to see if there is enough space to create the intermediate files. If there is no
directory that has enough space, the MapReduce job will fail with the error shown above.
Solution
Make sure that there is enough space in the local directories, based on the requirements of the data to be processed.
You can compress the intermediate output files to minimize space consumption.
Parameters to be set for compression:
set mapreduce.map.output.compress = true;
set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; -- Snappy will be used for compression
Out of Memory error when using ORC file format
Problem Description
An Out of Memory error occurs while generating splits information when the ORC file format is used.
Diagnosis
The following logs appear on the Analyze page of the QDS UI under the Logs tab:
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1098)
... 15 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
Solution
The Out of Memory error could be because of using the default split strategy (HYBRID), which requires more memory. Qubole recommends using the ORC split strategy as BI by setting the parameter below:
hive.exec.orc.split.strategy=BI
Hive job fails when “lock wait timeout” is exceeded
Problem Description
A Hive job fails with the following error message:
Lock wait timeout exceeded; try restarting transaction.
The timeout happens while partioning Insert operations.
Diagnosis
The following content will appear in the hive.log
file:
ERROR metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(173)) - Retrying HMSHandler after 2000 ms (attempt 9 of 10) with error:
javax.jdo.JDODataStoreException: Insert of object “org.apache.hadoop.hive.metastore.model.MPartition@74adce4e” using statement
“INSERT INTO `PARTITIONS` (`PART_ID`,`TBL_ID`,`LAST_ACCESS_TIME`,`CREATE_TIME`,`PART_NAME`,`SD_ID`) VALUES (?,?,?,?,?,?)” failed :
Lock wait timeout exceeded; try restarting transaction
This MySQL transaction timeout can happen during heavy traffic on the Hive Metastore when the RDS server is too busy.
Solution
Try setting a higher value for innodb_lock_wait_timeout
on the MySQL side. innodb_lock_wait_timeout
defines the length of time in seconds an InnoDB transaction waits for a row lock before giving up. The default value is 50 seconds.
S3 “Access denied” error while creating a Hive table
Problem
An S3 “Access denied” error appears when creating a Hive table.
Diagnosis
When using server-side encryption for s3 buckets, the “Access denied” error message below appears when creating a Hive table:
ERROR: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:com.qubole.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied
Solution
Set the required server-side encryption algorithm using
fs.s3a.server-side-encryption-algorithm
for s3a orfs.s3n.sse
for s3n. For more information, see Enabling SSE-KMS.If the s3 “Access denied” error still appears, check the s3 bucket policy and ensure that required permissions are defined there. For more information, see What are some examples of policies I should use to delegate access to Qubole for my Cloud accounts?.