[Learning Spark with Examples] Line Count

In the First post we looked at how to load/save an RDD. In this post we shall build upon the example and count number of lines present in RDD.

The code can be found at LineCount.java

For complete project refer https://github.com/paliwalashish/learning-spark

Lets look at the code

public static void main(String[] args) {
    SparkConf sparkConf = new SparkConf().setAppName("File Copy");
    JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

    // Read the source file
    JavaRDD<String> input = sparkContext.textFile(args[0]);

    // Gets the number of entries in the RDD
    long count = input.count();
    System.out.println(String.format("Total lines in %s is %d",args[0],count));
  }
  • First, we create an instance of SparkConf and set the application name
  • We create a JavaSparkContext, passing Spark Conf
  • Read the file into a JavaRDD. For simplicity, we have picked a simple text file
  • RDD implementation has a count() API that gives number of records present in the RDD. This becomes out line count



Let's compile the program

$mvn clean package

Once the build is successful, run the program as follows (run from where pom.xml is present)

$~/cots/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class org.learningspark.simple.LineCount --master local[1] target/learningspark-1.0-SNAPSHOT.jar /Users/ashishpaliwal/open-spource/flume/trunk/CHANGELOG 

The program shall print the number of lines in the file.

Next we shall use filtering to filter out empty lines from the RDD extending the same examples

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.