28 April 2014 ~ 1 Comment

[Apache Pig] Extending CSVExcelLoader to append file name of the split



CSVExcelLoader doesn't have an option to append the filename of the split it is processing. It comes in handy in certain situations. Here is a quick way to add the support

public class CSVExcelLoaderWithFileName extends CSVLoader {

  Path path;

  @Override
  public void prepareToRead(RecordReader reader, PigSplit split) throws IOException {
    super.prepareToRead(reader, split);
    path = ((FileSplit)split.getWrappedSplit()).getPath();
  }

  @Override
  public Tuple getNext() throws IOException {
    Tuple superTuple =  super.getNext();
    if(superTuple != null) {
      superTuple.append(path.getName());
    }
    return superTuple;
  }
}

The code is simple. When the Load function is ready to read, we get the path of the split it shall be processing.

Caution: Watch out forpig.splitCombination property. More info at http://pig.apache.org/docs/r0.12.1/perf.html#combine-files

One Response to “[Apache Pig] Extending CSVExcelLoader to append file name of the split”

  1. tank 29 May 2014 at 3:06 pm Permalink

    i use Mina 2.0.7 ,NioProcessor 100% CPU usage on Linux (epoll selector bug) is back!!!
    centos 64bit,CentOS Linux release 6.2 (Final)
    java version “1.6.0_43″
    Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
    Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)

    please !help me!


Leave a Reply