<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Thread.currentThread().join()</title>
	<atom:link href="http://www.ashishpaliwal.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ashishpaliwal.com/blog</link>
	<description>From Programmer, For Programmers</description>
	<lastBuildDate>Sat, 19 May 2012 05:03:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Hadoop Recipe &#8211; Implementing Custom Partitioner</title>
		<link>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-implementing-custom-partitioner/</link>
		<comments>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-implementing-custom-partitioner/#comments</comments>
		<pubDate>Thu, 17 May 2012 11:49:11 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=722</guid>
		<description><![CDATA[This recipe is about implementing custom Parititoner A Partitioner in MapReduce world partitions the key space. The partitioner is used to derive the partition to which a key-value pair belongs. It is responsible for bring records with same key to same partition so that they can be processed together by a reducer. To implement a [...]]]></description>
			<content:encoded><![CDATA[<p>This recipe is about implementing custom <a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/Partitioner.html">Parititoner</a></p>
<p>A Partitioner in MapReduce world partitions the key space. The partitioner is used to derive the partition to which a key-value pair belongs. It is responsible for bring records with same key to same partition so that they can be processed together by a reducer.</p>
<p>To implement a Custom Partitioner,we need to extend the Partitioner class.<br />
Let's look at the code for Partitioner class</p>
<pre class="brush: java">
public abstract class Partitioner&lt;KEY, VALUE&gt; {

  /**
   * Get the partition number for a given key (hence record) given the total
   * number of partitions i.e. number of reduce-tasks for the job.
   *
   * &lt;p&gt;Typically a hash function on a all or a subset of the key.&lt;/p&gt;
   *
   * @param key the key to be partioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the &lt;code&gt;key&lt;/code&gt;.
   */
  public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
</pre>
<p>The default partitioner is <a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html">HashPartitioner</a>, which finds a partition based on hash of the key class</p>
<pre class="brush: java">
public class HashPartitioner&lt;K, V&gt; extends Partitioner&lt;K, V&gt; {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() &amp;amp;amp;amp;amp;amp;amp; Integer.MAX_VALUE) % numReduceTasks;
  }
}
</pre>
<p>Ok, now lets implement a custom partitioner. Assume that we have a Text key, and we want to use first char as deciding factor for the determining the partition.</p>
<pre class="brush: java">
public class FirstCharTextPartitioner&lt;Text, Text&gt; extends Partitioner&lt;K, V&gt; {

  public int getPartition(Text key, Text value,
                          int numReduceTasks) {
    return (key.toString().charAt(0)) % numReduceTasks;
  }
}

// Set this to the JobConf
</pre>
<p>The code takes first char from the key as a deciding factor for determining the partition. So based on this, all keys that begin with 'A' shall be processed by same reducer. The next step is to set our custom class this class to the Job and framework shall use our custom Partitioning logic.</p>
<p>Based on the needs we can choose our own way of partitioning the key space.</p>
<p>Before implementing Custom Partitioner, its best to check following Partitioner provided by Hadoop</p>
<ol>
<li>
<a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/lib/partition/BinaryPartitioner.html">BinaryPartitioner</a></li>
<li>
<a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html">HashPartitioner</a></li>
<li>
<a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/lib/partition/KeyFieldBasedPartitioner.html">KeyFieldBasedPartitioner</a></li>
<li>
<a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/lib/partition/TotalOrderPartitioner.html">TotalOrderPartitioner</a></li>
</ol>
<p>References</p>
<p><a href="http://www.amazon.com/gp/product/1449389732/ref=as_li_ss_il?ie=UTF8&tag=ashtecblo-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1449389732"><img border="0" src="http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&Format=_SL160_&ASIN=1449389732&MarketPlace=US&ID=AsinImage&WS=1&tag=ashtecblo-20&ServiceVersion=20070822" ></a><img src="http://www.assoc-amazon.com/e/ir?t=ashtecblo-20&l=as2&o=1&a=1449389732" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /></p>
<p><script type="text/javascript"><!--
amazon_ad_tag = "ashtecblo-20"; amazon_ad_width = "600"; amazon_ad_height = "520"; amazon_ad_logo = "hide"; amazon_ad_link_target = "new";//--></script><br />
<script type="text/javascript" src="http://www.assoc-amazon.com/s/ads.js"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-implementing-custom-partitioner/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop Recipe &#8211; Implementing Custom Writable</title>
		<link>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-implementing-custom-writable/</link>
		<comments>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-implementing-custom-writable/#comments</comments>
		<pubDate>Tue, 15 May 2012 02:50:11 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=712</guid>
		<description><![CDATA[This Recipe is about implementing a custom Writable to be used in MapReduce code. Hadoop provides a lot of implementations of Writables out-of-the-box which suffice to most of the cases. However, at time we need to implement custom Objects to be passed. They are the implementations of Hadoop's Writable interface. Let's see how to implement [...]]]></description>
			<content:encoded><![CDATA[<p>This Recipe is about implementing a custom Writable to be used in MapReduce code. </p>
<p>Hadoop provides a lot of implementations of Writables out-of-the-box which suffice to most of the cases. However, at time we need to implement custom Objects to be passed. They are the implementations of Hadoop's <a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/io/Writable.html">Writable</a> interface. Let's see how to implement one.</p>
<h2>Use Case:</h2>
<p>We want to pass Request Information as a whole which consists of  a request Id, request type and timestamp. We can use it as a key or just pass it as a value for a key.</p>
<p>NOTE: This is only a custom writable and does not implement <a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/io/WritableComparable.html">WritableComparable</a>, which we shall cover in a different post.</p>
<p>Let's see the code</p>
<pre class="brush: java">
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * A custom Writable implementation for Request information.
 *
 * This is simple Custom Writable, and does not implement Comparable or RawComparator
 */
public class RequestInfo implements Writable {

    // Request ID as a String
    private Text requestId;

    // Request Type
    private Text requestType;

    // request timestamp
    LongWritable timestamp;

    public RequestInfo() {
        this.requestId = new Text();
        this.requestType = new Text();
        this.timestamp = new LongWritable();
    }

    public RequestInfo(Text requestId, Text requestType, LongWritable timestamp) {
        this.requestId = requestId;
        this.requestType = requestType;
        this.timestamp = timestamp;
    }

    public RequestInfo(String requestId, String requestType, long timestamp) {
        this.requestId = new Text(requestId);
        this.requestType = new Text(requestType);
        this.timestamp = new LongWritable(timestamp);
    }

    public void write(DataOutput dataOutput) throws IOException {
        requestId.write(dataOutput);
        requestType.write(dataOutput);
        timestamp.write(dataOutput);
    }

    public void readFields(DataInput dataInput) throws IOException {
        requestId.readFields(dataInput);
        requestType.readFields(dataInput);
        timestamp.readFields(dataInput);
    }

    public Text getRequestId() {
        return requestId;
    }

    public Text getRequestType() {
        return requestType;
    }

    public LongWritable getTimestamp() {
        return timestamp;
    }

    public void setRequestId(Text requestId) {
        this.requestId = requestId;
    }

    public void setRequestType(Text requestType) {
        this.requestType = requestType;
    }

    public void setTimestamp(LongWritable timestamp) {
        this.timestamp = timestamp;
    }

    @Override
    public int hashCode() {
        // This is used by HashPartitioner, so implement it as per need
        // this one shall hash based on request id
        return requestId.hashCode();
    }
}
</pre>
<p>The code is fairly simple. We implement the Writable interface, and write the logic in readFields() and write() API. In write() API, we dump the current state on the Object and in readFields() we read it back.</p>
<p>Hadoop treats String's in different way, so decided to use <a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/io/Text.html">Text</a> class. For a more authoratative discussion on this topic, please refer <a href="http://www.amazon.com/gp/product/1449311520/ref=as_li_ss_tl?ie=UTF8&tag=ashtecblo-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1449311520">Hadoop: The Definitive Guide</a><img src="http://www.assoc-amazon.com/e/ir?t=ashtecblo-20&l=as2&o=1&a=1449311520" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />, Chapter 4 - Hadoop I/O</p>
<p><strong>Note about hashCode()</strong></p>
<p>Current implementation uses requestId's hashCode(), but you should implement this API carefully, if you plan to use the Object as a Key in your MapReduce code, as this shall be used by default <a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html">HashPartitioner</a> to partition the keys.</p>
<p>Also, equals() method has been left for the users to implement <img src='http://www.ashishpaliwal.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h2>References:</h2>
<p>Must have, if you want to know Hadoop in detail</p>
<p><iframe src="http://rcm.amazon.com/e/cm?lt1=_blank&bc1=000000&IS2=1&bg1=FFFFFF&fc1=000000&lc1=0000FF&t=ashtecblo-20&o=1&p=8&l=as4&m=amazon&f=ifr&ref=ss_til&asins=1449311520" style="width:120px;height:240px;" scrolling="no" marginwidth="0" marginheight="0" frameborder="0"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-implementing-custom-writable/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop Recipe &#8211; Using Custom Java Counters</title>
		<link>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-using-custom-java-counters/</link>
		<comments>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-using-custom-java-counters/#comments</comments>
		<pubDate>Fri, 11 May 2012 15:02:39 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Application Programming]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=703</guid>
		<description><![CDATA[Starting the Hadoop Recipe series, in which I shall pick up a topic and provide sample code around it. Each shall be small and concise, would provide ready to use hints on topics covered. The post covers usage of Custom Counters in Java in Hadoop world. Counters are very helpful in MapReduce world. We can [...]]]></description>
			<content:encoded><![CDATA[<p>Starting the Hadoop Recipe series, in which I shall pick up a topic and provide sample code around it. Each shall be small and concise, would provide ready to use hints on topics covered.</p>
<p>The post covers usage of Custom Counters in Java in Hadoop world. Counters are very helpful in MapReduce world. We can use them as a way of watching the progress or as a way of indirect debugging or validation as well. For example, you may want to have specific counter to know how many types of specific record were processed from the complete data, like how many request had a specific keyword in the apache access log. There can be many other similar Use Cases.</p>
<p>Hadoop provides some inbuilt counters that are always there like number of Map input records, number of bytes processed etc.</p>
<p>Lets see how we can use a custom Counter in Java code</p>
<h2>Define the counter</h2>
<pre class="brush: java">
public static enum COUNTERS {
    ERROR_COUNT,
    MISSING_FIELDS_RECORD_COUNT
}
</pre>
<p>Counter definition is very simple, we define an enum and all the Counters that we want to use.</p>
<h2>Using the Counters</h2>
<pre class="brush: java">
 @Override
 public void map(LongWritable key, Text value, Context context)
                 throws IOException, InterruptedException {
     // mapper code here

     // if error condition, increment the error counter
     if(error) {
         context.getCounter(COUNTERS.ERROR_COUNT).increment(1);
     }

     // if missing records conditions
     if(missingRecords) {
         context.getCounter(COUNTERS.MISSING_FIELDS_RECORD_COUNT).increment(1);
     }
}
</pre>
<p>Usage of Counters is again simple. For a given condition you can increment it. In the example, we are incrementing it by one, but you can increment it by higher values as well.</p>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x15LinkUnit */
google_ad_slot = "4400881690";
google_ad_width = 468;
google_ad_height = 15;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
<h2>Viewing the Counter values</h2>
<p>You can view the counter values in JobTracker UI or programatically as well. You can print all the Counters or a specific Counter as well. Following example shows how to print value of a specific Counter</p>
<pre class="brush: java">

// Code in the Job Driver Class
Counter errorCounter = job.getCounters().findCounter(COUNTERS.ERROR_COUNT);
System.out.println(&quot;Error Counter = &quot;+errorCounter.getValue());
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x60ImageOnly */
google_ad_slot = "0221726572";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2012/05/hadoop-recipe-using-custom-java-counters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Playing with JClouds transient Blobstore</title>
		<link>http://www.ashishpaliwal.com/blog/2012/04/playing-with-jclouds-transient-blobstore/</link>
		<comments>http://www.ashishpaliwal.com/blog/2012/04/playing-with-jclouds-transient-blobstore/#comments</comments>
		<pubDate>Mon, 23 Apr 2012 12:19:22 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[jclouds]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=684</guid>
		<description><![CDATA[JClouds BlobStore API provides a portable way of managing key-value providers like Amazon S3. This post is a getting started guide with the API. We shall explore a bit about the API and create a simple program to use the same. Before we begin, lets get hold of some concepts Service - It's refers to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.jclouds.org/">JClouds</a> BlobStore API provides a portable way of managing key-value providers like Amazon S3. This post is a getting started guide with the API. We shall explore a bit about the API and create a simple program to use the same.</p>
<p>Before we begin, lets get hold of some concepts<br />
<strong><u>Service</u></strong> - It's refers to the provider where we host our key-value data, like Amazon S3</p>
<p><strong><u>Containers</u></strong> - They are namespace for the data, where we store our Blobs. For example, in Amazon S3, Containers == buckets</p>
<p><strong><u>Blob</u></strong> - This is the unstructured data that we store inside Containers</p>
<p>There are some more details about the API, but they are not covered here to keep things simple.</p>
<p>Lets briefly see the steps that we need use the API</p>
<ul>
<li>We create a BlobStoreContext. In simple words, it's our handle to Service provider</li>
<li>We get BlobStore handle from BlobStoreContext</li>
<li>We create a Container</li>
<li>We add our data/Blobs to the Container</li>
</ul>
<p>Lets look at the code in these steps.</p>
<p>The best way to play with JClouds API is clone the examples git-hub repo (https://github.com/jclouds/jclouds-examples) and modify the code to fit the needs, this is what I have done</p>
<p>NOTE: We shall be using JClouds in-memory blobstore for our playing around so that we don't have to pay for the usage charges of BlobStore providers</p>
<p><strong><u>Step 1:</u></strong> Creating the BlobStoreContext</p>
<pre class="brush: java">
 String provider = &quot;transient&quot;;
 String identity = &quot;Unused&quot;;
 String credential = Optional.absent().toString();

 // Init
 BlobStoreContext context = ContextBuilder.newBuilder(provider)
                .credentials(identity, credential)
                .build(BlobStoreContext.class);
</pre>
<p>The code is simple enough. The transient provider represents the in-memory BlobStore provider, and the other options are specific to this transient provider. If we need to use S3, we replace these options.</p>
<p><strong><u>Step 2:</u></strong> Get the handle to BlobStore</p>
<pre class="brush: java">
BlobStore blobStore = context.getBlobStore();
</pre>
<p>From the BobStoreContext, we get the handle to BlobStore, which we shall use to perform subsequent operations.</p>
<p><strong><u>Step 3:</u></strong> Creating a Container</p>
<pre class="brush: java">
String containerName = &quot;dummybase&quot;;
blobStore.createContainerInLocation(null, containerName);
</pre>
<p>This shall create a Container with name "dummybase"</p>
<p><strong><u>Step 4:</u></strong> Adding Blob to Container</p>
<pre class="brush: java">
// Add Blob
Blob blob = blobStore.blobBuilder(&quot;test&quot;).payload(&quot;testdata&quot;).build();
blobStore.putBlob(containerName, blob);

// Add Blob
Blob blob2 = blobStore.blobBuilder(&quot;test1&quot;).payload(&quot;testdata1&quot;).build();
blobStore.putBlob(containerName, blob2);
</pre>
<p>We create two blobs, with String payload and add to the Container. We can use different API's to add payload, like using a File or an InputStream</p>
<p>That's it, we have added it the Blob to transient BlobStore provider</p>
<p>Let's list the contents</p>
<pre class="brush: java">
for (StorageMetadata resourceMd : blobStore.list()) {
    if (resourceMd.getType() == StorageType.CONTAINER || resourceMd.getType() == StorageType.FOLDER) {
        // Use Map API
        Map&lt;String, InputStream&gt; containerMap = context.createInputStreamMap(resourceMd.getName());
        System.out.printf(&quot;  %s: %s entries%n&quot;, resourceMd.getName(), containerMap.size());
     }
}
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x69ImageNew */
google_ad_slot = "3726301122";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
<p>The complete code together</p>
<pre class="brush: java">
public static void main(String[] args) throws IOException {
    String provider = &quot;transient&quot;;
    String identity = &quot;Unused&quot;;
    String credential = Optional.absent().toString();
    String containerName = &quot;dummybase&quot;;

    // Init
    BlobStoreContext context = ContextBuilder.newBuilder(provider)
                .credentials(identity, credential)
                .build(BlobStoreContext.class);

    try {
        // Create Container
        BlobStore blobStore = context.getBlobStore();
        blobStore.createContainerInLocation(null, containerName);

        // Add Blob
        Blob blob = blobStore.blobBuilder(&quot;test&quot;).payload(&quot;testdata&quot;).build();
        blobStore.putBlob(containerName, blob);

        // Add Blob
        Blob blob2 = blobStore.blobBuilder(&quot;test1&quot;).payload(&quot;testdata1&quot;).build();
        blobStore.putBlob(containerName, blob2);

        // List Container
        for (StorageMetadata resourceMd : blobStore.list()) {
            if (resourceMd.getType() == StorageType.CONTAINER || resourceMd.getType() == StorageType.FOLDER) {
                // Use Map API
                Map&lt;String, InputStream&gt; containerMap = context.createInputStreamMap(resourceMd.getName());
                System.out.printf(&quot;  %s: %s entries%n&quot;, resourceMd.getName(), containerMap.size());
             }
         }
   } finally {
    // Close connecton
    context.close();
    System.exit(0);
  }
}
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x69ImageNew */
google_ad_slot = "3726301122";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2012/04/playing-with-jclouds-transient-blobstore/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Hadoop Distributed Cache</title>
		<link>http://www.ashishpaliwal.com/blog/2012/04/using-hadoop-distributed-cache/</link>
		<comments>http://www.ashishpaliwal.com/blog/2012/04/using-hadoop-distributed-cache/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 09:31:40 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Distributed Cache]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=669</guid>
		<description><![CDATA[Hadoop has a distributed cache mechanism to make available file locally that may be needed by Map/Reduce jobs. This post tried to expand a bit more on the information provided by the javadoc of DistributedCache Use Case Lets understand our Use Case a bit more in details so that we can follow-up the code snippets. [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop has a distributed cache mechanism to make available file locally that may be needed by Map/Reduce jobs. This post tried to expand a bit more on the information provided by the javadoc of <a href="http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html">DistributedCache</a></p>
<h1>Use Case</h1>
<p>Lets understand our Use Case a bit more in details so that we can follow-up the code snippets.<br />
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to replace all keywords that we encounter during parsing, with some other value.</p>
<p>So what we need is</p>
<ul>
<li>A key-values files (Lets use a Properties files)</li>
<li>The Mapper code that uses the code</li>
</ul>
<h2>Step 1</h2>
<p>Place the key-values file on the HDFS</p>
<pre class="brush: bash">
hadoop fs -put ./keyvalues.properties cache/keyvalues.properties
</pre>
<p>This path is relative to the user's home folder on HDFS</p>
<h2>Step 2</h2>
<p>Write the Mapper code that uses it</p>
<pre class="brush: java">
public class DistributedCacheMapper extends Mapper&lt;LongWritable, Text, Text, Text&gt; {

    Properties cache;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());

        if(localCacheFiles != null) {
            // expecting only single file here
            for (int i = 0; i &lt; localCacheFiles.length; i++) {
                Path localCacheFile = localCacheFiles[i];
                cache = new Properties();
                cache.load(new FileReader(localCacheFile.toString()));
            }
        } else {
            // do your error handling here
        }

    }

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // use the cache here
        // if value contains some attribute, cache.get(&lt;value&gt;)
        // do some action or replace with something else
    }

}
</pre>
<p>Mapper code is simple enough. During the setup phase, we read the file and populate the Properties object. And inside the map() we use the cache to lookup for certain keys and replace them, if they are present.</p>
<h2>Step 3</h2>
<p>Add the properties file to your driver code</p>
<pre class="brush: java">
JobConf jobConf = new JobConf();
// set job properties
// set the cache file
DistributedCache.addCacheFile(new URI(&quot;cache/keyvalues.properties#keyvalues.properties&quot;), jobConf);
</pre>
<p><a href="http://www.amazon.com/gp/product/1449311520/ref=as_li_ss_il?ie=UTF8&tag=ashtecblo-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1449311520"><img border="0" src="http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&Format=_SL110_&ASIN=1449311520&MarketPlace=US&ID=AsinImage&WS=1&tag=ashtecblo-20&ServiceVersion=20070822" ></a><img src="http://www.assoc-amazon.com/e/ir?t=ashtecblo-20&l=as2&o=1&a=1449311520" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /></p>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x69ImageNew */
google_ad_slot = "3726301122";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2012/04/using-hadoop-distributed-cache/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Finding the input File for Hadoop Map task</title>
		<link>http://www.ashishpaliwal.com/blog/2012/03/finding-the-input-file-for-hadoop-map-task/</link>
		<comments>http://www.ashishpaliwal.com/blog/2012/03/finding-the-input-file-for-hadoop-map-task/#comments</comments>
		<pubDate>Thu, 22 Mar 2012 05:19:02 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=665</guid>
		<description><![CDATA[I had a lot of pure Map only jobs, whose main function was to clean the incoming log stream and emit a refined output log with consistent fields. Due to code bugs or variation in input, a lot of time Map jobs would get killed or not produce the desired outcomes. In the quest of [...]]]></description>
			<content:encoded><![CDATA[<p>I had a lot of pure Map only jobs, whose main function was to clean the incoming log stream and emit a refined output log with consistent fields. Due to code bugs or variation in input, a lot of time Map jobs would get killed or not produce the desired outcomes. In the quest of narrowing down those offenders found this simple line of code that always helped.</p>
<p>But before that a bit about my input format and environment</p>
<ul>
<li>Inputs files were gzipped (meaning no split would happen)</li>
<li>Hadoop jobs with only Map</li>
<li>Using Cloudera Hadoop Distribution</li>
</ul>
<p>Since in my case, the whole input file would be processed by a sigle map task, it was easy to find the offending file which causes the job to fail. Here is the snippet</p>
<pre class="brush: java">
@Override
protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        System.out.println( context.getTaskAttemptID() +  &quot; - &quot;+ ((FileSplit)context.getInputSplit()).getPath());
}
</pre>
<p>This snippet shall print the File being processed by the Map attempt in the stdout log file. One can navigate to the job log folder and find out the file for the failing Map job.</p>
<p>My strategy was to pick this file and run it in isolation outside Hadoop to find the problem cause.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2012/03/finding-the-input-file-for-hadoop-map-task/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Updating Google Calendar with MS Project Tasks &#8211; 2 &#8211; Revisited</title>
		<link>http://www.ashishpaliwal.com/blog/2011/08/updating-google-calendar-with-ms-project-tasks-2-revisited/</link>
		<comments>http://www.ashishpaliwal.com/blog/2011/08/updating-google-calendar-with-ms-project-tasks-2-revisited/#comments</comments>
		<pubDate>Tue, 23 Aug 2011 03:53:36 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Application Programming]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Google Calendar]]></category>
		<category><![CDATA[MS Project]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=648</guid>
		<description><![CDATA[This is the 2nd part in the series. Please refer to Updating Google Calendar with MS Project Tasks - Revisited for Part 1, including how to build the code. In the previous post we saw, how to list all the Calendar's. In this post we shall build upon the code to list all the calendar's [...]]]></description>
			<content:encoded><![CDATA[<p>This is the 2nd part in the series. Please refer to <a href="http://www.ashishpaliwal.com/blog/2011/07/updating-google-calendar-with-ms-project-revisited/">Updating Google Calendar with MS Project Tasks - Revisited</a> for Part 1, including how to build the code.</p>
<p>In the previous post we saw, how to list all the Calendar's. In this post we shall build upon the code to list all the calendar's for a given user account, and update the MS Project tasks to the selected calendar.</p>
<p>We shall need to following three steps:</p>
<ul>
<li>List Calendars and select one</li>
<li>Parse MS Project tasks and convert them to Calendar events</li>
<li>Update Google Calendar with Events</li>
</ul>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x60ImageOnly */
google_ad_slot = "0221726572";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
<h3>Step 1: List All calendar's</h3>
<p>We have already seen the code for this in previous <a href="http://www.ashishpaliwal.com/blog/2011/07/updating-google-calendar-with-ms-project-revisited">post</a></p>
<h3> Step 2: Parse MS Project task and convert to Calendar Event</h3>
<p>Lets look at the code for the conversion</p>
<pre class="brush: java">
protected CalendarEventEntry convertTaskToCalenderEntry(Task task) {
    System.out.println(&quot;Task &quot;+task);
    CalendarEventEntry eventEntry = new CalendarEventEntry();
    eventEntry.setTitle(new PlainTextConstruct(task.getName()));

    When date = new When();
    date.setStartTime(new DateTime(task.getStart()));
    eventEntry.addTime(date);
    return eventEntry;
}
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x15, created 8/2/10 */
google_ad_slot = "9956401459";
google_ad_width = 468;
google_ad_height = 15;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script><br />
The code takes Start and End date of task, as well as title, and creates an CalendarEventEntry for the same. There can be additional logic built into this, like filtering events based on some rules etc.</p>
<h3> Step 3: Updating entries into Google Calendar </h3>
<pre class="brush: java">
public void updateCalenderWithEntry(List&lt;Task&gt; tasks, CalendarService calendarService, URL url) {
    for (Task task : tasks) {
        CalendarEventEntry entry = convertTaskToCalenderEntry(task);
        try {
            calendarService.insert(url, entry);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ServiceException e) {
            e.printStackTrace();
        }
    }
}
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x15, created 8/2/10 */
google_ad_slot = "9956401459";
google_ad_width = 468;
google_ad_height = 15;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script><br />
Here we just iterate through the tasks and update the events using CalendarService. The Calendar URL can be retrieved from the CalendarEntry class.</p>
<p>A complete working example is part of com.ashishpaliwal.mpputils.examples.MppToGoogleCalendar</p>
<h4>Using the sample program</h4>
<p>>com.ashishpaliwal.mpputils.examples.MppToGoogleCalendar [MPP File] [Google User Name] [password]</p>
<h3>What's Next?</h3>
<p>Eager to hear from user, what they would like to see. Adding a UI and out-of-box working package is something that would be good. However, please do try using this and I shall try my best to improve this.</p>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x60ImageOnly */
google_ad_slot = "0221726572";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2011/08/updating-google-calendar-with-ms-project-tasks-2-revisited/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hunting down CPU hogging Java Thread</title>
		<link>http://www.ashishpaliwal.com/blog/2011/08/finding-java-thread-consuming-high-cpu/</link>
		<comments>http://www.ashishpaliwal.com/blog/2011/08/finding-java-thread-consuming-high-cpu/#comments</comments>
		<pubDate>Tue, 09 Aug 2011 21:48:21 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[CPU]]></category>
		<category><![CDATA[Thread]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=632</guid>
		<description><![CDATA[Most of us have encountered a situation to find cause of high CPU usage in Java application. Profiling is the best way, but at times running Profiler in production is not an option. Fortunately, there is a simple way, if you are running your app on *nix. Lets explore how to find this. Find the [...]]]></description>
			<content:encoded><![CDATA[<p>Most of us have encountered a situation to find cause of high CPU usage in Java application. Profiling is the best way, but at times running Profiler in production is not an option. Fortunately, there is a simple way, if you are running your app on *nix.</p>
<p>Lets explore how to find this.</p>
<ul>
<li>Find the pid of the application, using <em>top</em> or <em>jps</em> command</li>
<li>Once you get the pid, run following command<br />
   <strong><strong>$ ps -L pid</strong></strong>
</li>
<p>We get an output as shown in the figure<br />
<div id="attachment_635" class="wp-caption aligncenter" style="width: 410px"><a href="http://www.ashishpaliwal.com/blog/wp-content/uploads/2011/08/Screen-shot-2011-08-10-at-2.39.54-AM1.png"><img src="http://www.ashishpaliwal.com/blog/wp-content/uploads/2011/08/Screen-shot-2011-08-10-at-2.39.54-AM1.png" alt="" title="ps output" width="400" height="216" class="size-full wp-image-635" /></a><p class="wp-caption-text">ps output</p></div></p>
<p>The output displays all the Threads in the application along with the time spent. Find the Thread that has spent highest time in execution (Entry circled on right). Once we identify this, get the LWP ID of the Thread (Entry circled on left).</p>
<li>Using <em>jstack</em> or <em>visualvm</em>, take a Thread dump.</li>
<li>Convert LWP ID to Hex and search for the ID in Thread dump. </li>
</ul>
<p>Then you can narrow down the thread which is consuming max CPU and investigate further.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2011/08/finding-java-thread-consuming-high-cpu/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Updating Google Calendar with MS Project Tasks &#8211; Revisited</title>
		<link>http://www.ashishpaliwal.com/blog/2011/07/updating-google-calendar-with-ms-project-revisited/</link>
		<comments>http://www.ashishpaliwal.com/blog/2011/07/updating-google-calendar-with-ms-project-revisited/#comments</comments>
		<pubDate>Mon, 11 Jul 2011 11:04:51 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Google Calendar]]></category>
		<category><![CDATA[MS Project]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=619</guid>
		<description><![CDATA[Back in 2008, wrote the post on Updating Google Calendar with MS Project. Since then, lot many people had asked for the code, and myself felt that getting the code on github would be great. I shall keep complete the post in 2 parts, starting with setup, listing the calendars and then winding up with [...]]]></description>
			<content:encoded><![CDATA[<p>Back in 2008, wrote the post on <a href="http://www.ashishpaliwal.com/blog/2008/10/updating-google-calendar-with-ms-project-tasks/">Updating Google Calendar with MS Project</a>. Since then, lot many people had asked for the code, and myself felt that getting the code on github would be great. I shall keep complete the post in 2 parts, starting with setup, listing the calendars and then winding up with updating Google Calendar with MS Project entries.</p>
<p>To begin, you can get the code form github at following location</p>
<p><a href="https://github.com/paliwalashish/mpputils">https://github.com/paliwalashish/mpputils</a></p>
<h2>Building the code</h2>
<p>To build the code and download the dependencies, please refer to the wiki page <a href="https://github.com/paliwalashish/mpputils/wiki/Build">https://github.com/paliwalashish/mpputils/wiki/Build</a></p>
<h2> Retrieving Calendar's</h2>
<p>Now lets move to retrieving the list of all the Calendar's for a user<br />
The code is simple enough, here is the function which does the</p>
<pre class="brush: java">
public static List&lt;CalendarEntry&gt; getAllCalendars(String userName, String password) throws Exception {
        CalendarService myService = new CalendarService(&quot;CalendarService-&quot;+userName);
        myService.setUserCredentials(userName, password);

        // Send the request and print the response
        URL feedUrl = new URL(&quot;https://www.google.com/calendar/feeds/default/allcalendars/full&quot;);
        CalendarFeed resultFeed = myService.getFeed(feedUrl, CalendarFeed.class);

        return resultFeed.getEntries();
    }
</pre>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x15LinkUnit */
google_ad_slot = "4400881690";
google_ad_width = 468;
google_ad_height = 15;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
<p>This snippet shall return all the Calendar's for the specified user account.</p>
<p>The code has an example of using the same. Please refer to <em><strong>com.ashishpaliwal.mpputils.examples.ListAllCalendars</strong></em> class on the usage.</p>
<p>In the next part we shall see the code for updating of Google Calendar with MS Project task.</p>
<p><script type="text/javascript"><!--
google_ad_client = "ca-pub-6961884887741817";
/* 468x60TextOnly */
google_ad_slot = "8279148993";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script><br />
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2011/07/updating-google-calendar-with-ms-project-revisited/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Simple LRU Cache</title>
		<link>http://www.ashishpaliwal.com/blog/2011/05/a-simple-lru-cache/</link>
		<comments>http://www.ashishpaliwal.com/blog/2011/05/a-simple-lru-cache/#comments</comments>
		<pubDate>Fri, 13 May 2011 16:19:37 +0000</pubDate>
		<dc:creator>ashish</dc:creator>
				<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.ashishpaliwal.com/blog/?p=612</guid>
		<description><![CDATA[Recently, while working on a module has a need of very light weight fixed size LRU cache. There are definitely lot of caching solutions around, but decided to try a simple Map based solution. Surprisingly, the solution using LinkedHashMap turned out to be simplest. Here is the snippet. Removed all additional validation and other stuff [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, while working on a module has a need of very light weight fixed size LRU cache. There are definitely lot of caching solutions around, but decided to try a simple Map based solution. Surprisingly, the solution using LinkedHashMap turned out to be simplest.</p>
<p>Here is the snippet. Removed all additional validation and other stuff for simplicity.</p>
<pre class="brush: java">
public class LruCache {
    // cache data holder
    Map&lt;Object, Object&gt; cache;

    // Max size
    private int maxCacheSize;

    public LruCache(int maxSize) {
        this.maxCacheSize = maxSize;
        cache = new LinkedHashMap(maxSize, 0.75f, true) {
            @Override
            protected boolean removeEldestEntry(Map.Entry eldest) {
                return size() &gt; maxCacheSize;
            }
        };
    }

    public void put(Object key, Object value) {
        cache.put(key, value);
    }

    public Object get(Object key) {
        return cache.get(key);
    }
}
</pre>
<p>The implementation is fairly simple. The third parameter while initialization of LinkedHashMap is the key. a value of "true" means the implementation should use access-order while evicting the entry.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ashishpaliwal.com/blog/2011/05/a-simple-lru-cache/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

