Crunching Data with Apache Crunch – Part 3

In Part 2 of the series, we saw finding Top 100 words. Lets explore a bit about filtering the data. From the word list, we may be interested in removing certain words. For this posy, we shall remove "the" from the list of words. This can be done while splitting text as well, but would keep it separate to have a stage for filtering.

We ara gain going to build the code, upon the WordCount example.

Here is how the filter code looks like

// Filtering the Data
PCollection<String> filteredWords = words.parallelDo(new DoFn<String, String>() {
    public void process(String word, Emitter<String> emitter) {
        if(!"the".equalsIgnoreCase(word)) {
            emitter.emit(word);
        }
    }
}, Writables.strings());

It's a function which compares the received word against "the" and drops the word if it matches. We can have a logic which matches the word against a list of blocked words. It operates on the output of split function (PCollection). For all subsequent calculations, we use the filteredWords reference.

Leave a Reply

Your email address will not be published. Required fields are marked *