Disk Space Issues in Hadoop

This topic addresses about how to troubleshoot a few common Hadoop disk space issues.

Handling a Disk Space Issue When Creating a Directory

While running Hadoop jobs, you can hit this exception: cannot create directory :No space left on device.

This exception usually appears when the disk space in the HDFS is full. In Qubole, only temporary/intermittent data to the HDFS is deleted as a cron job would be running to delete the temp files regularly. This issue would be seen in cases such as:

  • Long running jobs where the jobs may be writing lots of intermediate data and the cron could not delete the data as the jobs are still running.

  • Long running clusters where in rare cases, the data written from failed or killed tasks may not get deleted.

Solution: Verify the actual cause by checking the HDFS disk usage from one of these methods:

  • On the Qubole UI, through the DFS Status from the running cluster’s UI page.

  • By logging into the cluster node and running this command:

    hadoop dfsadmin -report

    A sample response is mentioned here.

    Configured Capacity: 153668681728 (143.12 GB)
    Present Capacity: 153668681728 (143.12 GB)
    DFS Remaining: 153555091456 (143.01 GB)
    DFS Used: 113590272 (108.33 MB)
    DFS Used%: 0.07%
    Under replicated blocks: 33
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    
    -------------------------------------------------
    Live datanodes (2):
    
    Name:x.x.x.x:50010 (ip-x-x-x-x.ec2.internal)
    Hostname:ip-x-x-x-x.ec2.internal
    Decommission Status : Normal
    Configured Capacity: 76834340864 (71.56 GB)
    DFS Used: 56795136 (54.16 MB)
    Non DFS Used: 0 (0 B)
    DFS Remaining: 76777545728 (71.50 GB)
    DFS Used%: 0.07%
    DFS Remaining%: 99.93%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 2
    Last contact: Tue Dec 26 11:21:19 UTC 2017
    
    
    Name:x.x.x.x:50010((ip-x-x-x-x.ec2.internal)
    Hostname: ip-x-x-x-x.ec2.internal
    Decommission Status : Normal
    Configured Capacity: 76834340864 (71.56 GB)
    DFS Used: 56795136 (54.16 MB)
    Non DFS Used: 0 (0 B)
    DFS Remaining: 76777545728 (71.50 GB)
    DFS Used%: 0.07%
    DFS Remaining%: 99.93%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 2
    Last contact: Tue Dec 26 11:21:21 UTC 2017
    

Handling a Device Disk Space Error

While running jobs, you may hit this exception - java.io.IOException: No space left on device.

Cause: This exception usually appears when there is no disk space on the worker or coordinator nodes. You can confirm this by logging into the corresponding node and running a df -h on the node when the query is still running.

Solution: You can avoid this error by one of these solutions:

  1. Enable EBS autoscaling. After enabling, you can attach additional EBS volumes based on the query’s requirement.

  2. You can also try using cluster instance types with larger disk space.