Crunching Data with Apache Crunch – Part 6 – Word Co-occurrence

In this post we look at solving Word Co-occurrence problem using Crunch. Please refer other post in the series for some basic stuff

In this post, we shall solve the word co-occurrence problem using Pairs, described in Data-Intensive Text Processing with MapReduce using Crunch.

We shall use the default neighbor window of 2

PCollection<String> textFile = pipeline.readTextFile(args[0]);

PTable<TextPair, Long> wordCoOccurrence = textFile.parallelDo(new DoFn<String, Pair<TextPair, Long>>() {
  transient TextPair textPair;

  @Override
  public void initialize() {
    super.initialize();
    textPair = new TextPair();
  }

  @Override
  public void process(String input, Emitter<Pair<TextPair, Long>> emitter) {
    String[] words =  input.split("\s+");

    for (int i = 0; i < words.length; i++) {
      String word = words[i];
        if(Strings.isNullOrEmpty(word)) {
          continue;
        }

      // lets look for neighbours now
      int start = (i - DEFAULT_NEIGHBOUR_WINDOW < 0) ? 0 : i - DEFAULT_NEIGHBOUR_WINDOW;
      int end = (i + DEFAULT_NEIGHBOUR_WINDOW >= words.length) ? words.length - 1 : i + DEFAULT_NEIGHBOUR_WINDOW;
        for(int j = start; j < end; j++) {
          if(i == j) continue;
            textPair.set(new Text(words[i]), new Text(words[j]));
            emitter.emit(Pair.of(textPair, 1L));
          }
        }
     }
}, textFile.getTypeFamily().tableOf(Writables.writables(TextPair.class), Writables.longs()));

CombineFn<TextPair, Long> longSumCombiner = CombineFn.SUM_LONGS();
PTable<TextPair, Long> wordCoOccurrenceCount = wordCoOccurrence.groupByKey().combineValues(longSumCombiner);

pipeline.writeTextFile(wordCoOccurrenceCount, args[1]);

Code Flow

  • We split the line into words
  • We iterate and emit word with each of its neighbors individually
  • We aggregate the results and write to file

The TextPair class is from https://github.com/tomwhite/hadoop-book/tree/2e

2 thoughts on “Crunching Data with Apache Crunch – Part 6 – Word Co-occurrence

  1. Hi Ashish!

    First of all, your posts are awesome! And I am being able to learn more from Crunch thanks to you (:
    Just one thing, could you please provide the TextPair class url??
    Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *