April 11, 2011

Common Hadoop HDFS exceptions with large files

Big data in HDFS, so many disk problems. First of all, make sure there are at least ~20-30% free space in each node. There are two other problems I faced recently:

all datanodes are bad
This error could be cause because of there are too many open files. limit is 1024 by default. To increase this use
ulimit -n newsize
For more information click!

error in shuffle in fetcher#k 
This is another problem - here is full error log:
2011-04-11 05:59:45,744 WARN org.apache.hadoop.mapred.Child: 
Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: 
error in shuffle in fetcher#2
 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
 at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.lang.OutOfMemoryError: Java heap space
 at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58)
 at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45)
 at org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104)
 at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)
 at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:257)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:305)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:251)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)

One way to go around this problem is making sure there are not too many map tasks for small input files. If possible you can cat input files manually to create bigger chunks or push hadoop to combine multiple tiny input files for a single mapper. For more details, take a look at here.

Also, at Hadoop discussions groups, it is mentioned that default value of dfs.datanode.max.xcievers parameter, the upper bound for the number of files an HDFS DataNode can serve, is too low and causes ShuffleError. In hdfs-site.xml, I set this value to 2048 and worked in my case.
<property>
        <name>dfs.datanode.max.xcievers</name>
        <value>2048</value>
  </property>

Update: Default value for dfs.datanode.max.xcievers is updated with this JIRA.