In the last post, we saw the Inner join example. Time to tweak this into a Apache Spark left outer join example. From our data set of inner join, we may need to have a dataset with all the Ad’s served, along with possible impression, if received. Left Outer join is the way to do.
The complete code can be found at LeftOuterJoin.java
The data preparation is same as in Inner join example, so we shall skip it here. Let’s look at code for
// Lets go for an inner join, to hold data only for Ads which received an impression JavaPairRDD<String, Tuple2<String, Optional<String>>> joinedData = adsRDD.leftOuterJoin(impressionsRDD); joinedData.saveAsTextFile("./output-outerjoin");
There is a minor change from Inner Join here, the return type from left outer join has 2nd param of Tuple2 class as Optional (to cater to scenario’s that data might not be present for all the keys).
Here is how the output look like
(00832901-21a6-4888-b06b-1f43b9d1acac,(AdProvider1,Optional.of(Publisher1))) (9a1786e1-ab21-43e3-b4b2-4193f572acbc,(AdProvider1,Optional.of(Publisher1))) (aca88cd0-fe50-40eb-8bda-81965b377827,(AdProvider1,Optional.of(Publisher1))) (50a78218-d65a-4574-90de-0c46affbe7f3,(AdProvider5,Optional.absent())) (611cf585-a8cf-43e9-9914-c9d1dc30dab5,(AdProvider1,Optional.of(Publisher1))) (940c138a-88d3-4248-911a-7dbe6a074d9f,(AdProvider3,Optional.of(Publisher3))) (983bb5e5-6d5b-4489-85b3-00e1d62f6a3a,(AdProvider3,Optional.of(Publisher3))) (5de3ae82-d56a-4f70-8738-7e787172c018,(AdProvider1,Optional.of(Publisher1))) (f1b6c6f4-8221-443d-812e-de857b77b2f4,(AdProvider2,Optional.of(Publisher2))) (d9bb837f-c85d-45d4-95f2-97164c62aa42,(AdProvider4,Optional.absent()))
Output has carried over Optional class along with it. We can apply transformation to the output and modify the data to suit our need.
Compile and Run
$mvn clean package $~/cots/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class org.learningspark.simple.LeftOuterJoin --master local target/learningspark-1.0-SNAPSHOT.jar ./src/main/resources/ads.csv ./src/main/resources/impression.csv