Crunching Data with Apache Crunch – Part 6 – Word Co-occurrence

In this post we look at solving Word Co-occurrence problem using Crunch. Please refer other post in the series for some basic stuff

In this post, we shall solve the word co-occurrence problem using Pairs, described in Data-Intensive Text Processing with MapReduce using Crunch.

We shall use the default neighbor window of 2

PCollection<String> textFile = pipeline.readTextFile(args[0]);

PTable<TextPair, Long> wordCoOccurrence = textFile.parallelDo(new DoFn<String, Pair<TextPair, Long>>() {
  transient TextPair textPair;

  public void initialize() {
    textPair = new TextPair();

  public void process(String input, Emitter<Pair<TextPair, Long>> emitter) {
    String[] words =  input.split("\s+");

    for (int i = 0; i < words.length; i++) {
      String word = words[i];
        if(Strings.isNullOrEmpty(word)) {

      // lets look for neighbours now
      int start = (i - DEFAULT_NEIGHBOUR_WINDOW < 0) ? 0 : i - DEFAULT_NEIGHBOUR_WINDOW;
      int end = (i + DEFAULT_NEIGHBOUR_WINDOW >= words.length) ? words.length - 1 : i + DEFAULT_NEIGHBOUR_WINDOW;
        for(int j = start; j < end; j++) {
          if(i == j) continue;
            textPair.set(new Text(words[i]), new Text(words[j]));
            emitter.emit(Pair.of(textPair, 1L));
}, textFile.getTypeFamily().tableOf(Writables.writables(TextPair.class), Writables.longs()));

CombineFn<TextPair, Long> longSumCombiner = CombineFn.SUM_LONGS();
PTable<TextPair, Long> wordCoOccurrenceCount = wordCoOccurrence.groupByKey().combineValues(longSumCombiner);

pipeline.writeTextFile(wordCoOccurrenceCount, args[1]);

Code Flow

  • We split the line into words
  • We iterate and emit word with each of its neighbors individually
  • We aggregate the results and write to file

The TextPair class is from

2 thoughts on “Crunching Data with Apache Crunch – Part 6 – Word Co-occurrence

  1. Hi Ashish!

    First of all, your posts are awesome! And I am being able to learn more from Crunch thanks to you (:
    Just one thing, could you please provide the TextPair class url??

Leave a Reply to Renato2099 Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.