[Learning Spark with Examples] Left Outer Join

In the last post, we saw the Inner join example. Time to tweak this into a Apache Spark left outer join example. From our data set of inner join, we may need to have a dataset with all the Ad’s served, along with possible impression, if received. Left Outer join is the way to do.

The complete code can be found at LeftOuterJoin.java

The data preparation is same as in Inner join example, so we shall skip it here. Let’s look at code for

// Lets go for an inner join, to hold data only for Ads which received an impression
JavaPairRDD<String, Tuple2<String, Optional<String>>> joinedData = adsRDD.leftOuterJoin(impressionsRDD);

There is a minor change from Inner Join here, the return type from left outer join has 2nd param of Tuple2 class as Optional (to cater to scenario’s that data might not be present for all the keys).

Here is how the output look like


Output has carried over Optional class along with it. We can apply transformation to the output and modify the data to suit our need.

Compile and Run

$mvn clean package
$~/cots/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class org.learningspark.simple.LeftOuterJoin --master local[1] target/learningspark-1.0-SNAPSHOT.jar ./src/main/resources/ads.csv ./src/main/resources/impression.csv

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.