Crunching Data with Apache Crunch – Part 2

In Part 1, we saw the word count example. Lets built more on top of it.
A very common use case of Word Count example would be to find, Top 100 words. Using MapReduce, you would use Secondary Sort and get this. Let try to achieve the same functionality using Crunch

Requirement
Find Top 100 most occurred words in an input.

NOTE: We shall build this on WordCount code

PTable<String, Long> top100 = counts.top(100);

// Instruct the pipeline to write the resulting counts to a text file.
pipeline.writeTextFile(top100, args[1]);

We just need to add Line# 1. This line essentially gets the Top 100 words for us. To find the least frequently occurring words, a similar bottom() API is present.

Run the example again and see the output.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.