22 March 2012 ~ 0 Comments

Finding the input File for Hadoop Map task



I had a lot of pure Map only jobs, whose main function was to clean the incoming log stream and emit a refined output log with consistent fields. Due to code bugs or variation in input, a lot of time Map jobs would get killed or not produce the desired outcomes. In the quest of narrowing down those offenders found this simple line of code that always helped.

But before that a bit about my input format and environment

  • Inputs files were gzipped (meaning no split would happen)
  • Hadoop jobs with only Map
  • Using Cloudera Hadoop Distribution

Since in my case, the whole input file would be processed by a sigle map task, it was easy to find the offending file which causes the job to fail. Here is the snippet

@Override
protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        System.out.println( context.getTaskAttemptID() +  " - "+ ((FileSplit)context.getInputSplit()).getPath());
}

This snippet shall print the File being processed by the Map attempt in the stdout log file. One can navigate to the job log folder and find out the file for the failing Map job.

My strategy was to pick this file and run it in isolation outside Hadoop to find the problem cause.

Leave a Reply