Disk Space Issues in Hadoop

This topic addresses about how to troubleshoot a few common Hadoop disk space issues.

Handling a Disk Space Issue When Creating a Directory

While running Hadoop jobs, you can hit this exception: cannot create directory :No space left on device.

This exception usually appears when the disk space in the HDFS is full. In Qubole, only temporary/intermittent data to the HDFS is deleted as a cron job would be running to delete the temp files regularly. This issue would be seen in cases such as:

Long running jobs where the jobs may be writing lots of intermediate data and the cron could not delete the data as the jobs are still running.
Long running clusters where in rare cases, the data written from failed or killed tasks may not get deleted.

Solution: Verify the actual cause by checking the HDFS disk usage from one of these methods:

On the Qubole UI, through the DFS Status from the running cluster’s UI page.

By logging into the cluster node and running this command:

hadoop dfsadmin -report

A sample response is mentioned here.

Configured Capacity: 153668681728 (143.12 GB)
Present Capacity: 153668681728 (143.12 GB)
DFS Remaining: 153555091456 (143.01 GB)
DFS Used: 113590272 (108.33 MB)
DFS Used%: 0.07%
Under replicated blocks: 33
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (2):

Name:x.x.x.x:50010 (ip-x-x-x-x.ec2.internal)
Hostname:ip-x-x-x-x.ec2.internal
Decommission Status : Normal
Configured Capacity: 76834340864 (71.56 GB)
DFS Used: 56795136 (54.16 MB)
Non DFS Used: 0 (0 B)
DFS Remaining: 76777545728 (71.50 GB)
DFS Used%: 0.07%
DFS Remaining%: 99.93%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
Last contact: Tue Dec 26 11:21:19 UTC 2017


Name:x.x.x.x:50010((ip-x-x-x-x.ec2.internal)
Hostname: ip-x-x-x-x.ec2.internal
Decommission Status : Normal
Configured Capacity: 76834340864 (71.56 GB)
DFS Used: 56795136 (54.16 MB)
Non DFS Used: 0 (0 B)
DFS Remaining: 76777545728 (71.50 GB)
DFS Used%: 0.07%
DFS Remaining%: 99.93%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
Last contact: Tue Dec 26 11:21:21 UTC 2017

Handling a Device Disk Space Error

While running jobs, you may hit this exception - java.io.IOException: No space left on device.

Cause: This exception usually appears when there is no disk space on the worker or coordinator nodes. You can confirm this by logging into the corresponding node and running a df -h on the node when the query is still running.

Solution: You can avoid this error by one of these solutions:

Enable EBS autoscaling. After enabling, you can attach additional EBS volumes based on the query’s requirement.
You can also try using cluster instance types with larger disk space.