[Learning Spark with Examples] Famous Word Count

In the last post we saw filtering, it's time to see the famous Word Count example.

The Code can be found at WordCount.java

Let's see the code, removing the code which we already discussed in last post

// Now we have non-empty lines, lets split them into words
JavaRDD<String> words = nonEmptyLines.flatMap(new FlatMapFunction<String, String>() {
  @Override
  public Iterable<String> call(String s) throws Exception {
    return Arrays.asList(s.split(" "));
  }
});

// Convert words to Pairs, remember the TextPair class in Hadoop world
JavaPairRDD<String, Integer> wordPairs = words.mapToPair(new PairFunction<String, String, Integer>() {
  public Tuple2<String, Integer> call(String s) {
    return new Tuple2<String, Integer>(s, 1);
  }
});

JavaPairRDD<String, Integer> wordCount = wordPairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
  @Override
  public Integer call(Integer integer, Integer integer2) throws Exception {
    return integer + integer2;
  }
});

// Just for debugging, NOT FOR PRODUCTION
wordCount.foreach(new VoidFunction<Tuple2<String, Integer>>() {
  @Override
  public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
    System.out.println(String.format("%s - %d", stringIntegerTuple2._1(), stringIntegerTuple2._2()));
  }
});

From the last post, we have already filtered empty lines. To count words, we need to split the lines in words and sum all the counts.

  • To split the individual words, we use flatMap() API, which returns an Iterator
  • Once we have the words, we create a Tuple (same as TextPair class in Hadoop world) with word as key and value as 1
  • To get the word count, we call reduceByKey() API, summing all the results for a particular key
  • Last few lines are just for printing on console and not to be used other than examples

h3. Compilation and Launch

$mvn clean package

Once the build is successful, run the program as follows (run from where pom.xml is present)

$~/cots/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class org.learningspark.simple.WordCount --master local[1] target/learningspark-1.0-SNAPSHOT.jar /Users/ashishpaliwal/open-spource/flume/trunk/CHANGELOG 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.