Getting Started with Cassandra
After spending some time with Cassandra, thought about writing a small post. The post is an attempt to get started with Cassandra quick examples. This post does not try to explain the Data Model.
Pre-requisite
1. Cassandra 0.7
2. Cassandra GUI
3. Hector (Cassandra client). You can use any other client with slight modifications
You shall need following jars in your classpath (available with downloads above)
1. apache-cassandra-0.7.0.jar
2. hector-core-0.7.0-22.jar
3. slf4j-log4j12-1.6.1.jar
4. slf4j-api-1.6.1.jar
5. libthrift-0.5.jar
6. log4j-1.2.16.jar
7. perf4j-0.9.12.jar
8. high-scale-lib-1.0.jar
NOTE: I did tweaked the Cassandra GUI to work with Cassandra 0.7.
The reason for using GUI is to see how things are stored within Cassandra and this visualization helps in understanding things faster
We shall explore Cassandra with 3 independent examples.
Downloading the example code from - https://code.google.com/p/cassandra-examples
Lets first do some house keeping stuff like starting Cassandra and applying the schema
Starting Cassandra
1. Download and extract Cassandra 0.7
2. Go to Cassandra-install-dir/bin directory
3. Execute following command
> cassandra -f
This shall start cassandra in foreground.
NOTE: We are running Cassandra single node for the sake of simplicity
Applying the Schema
The schema used for the examples is part of example code
Steps to apply the schema
1. Keep the schema file (cassandra.yaml) in classpath of Cassandra server like in conf directory
2. Launch jconsole (from JAVA_HOME\bin)
3. Connect to Cassandra
4. Click MBeans tab and navigate to org.apache.cassandra.db.StorageService MBean
5. Expand Operations and click on loadSchemaFromYaml
6. Clock the button loadSchemaFromYaml in right pane
You shall receive a success message which means schema was loaded.
Lets see how the Schema looks in cassandra-gui
So far so good, lets move to our first sample
Example 1: Tweets
When I started with Cassandra, got Tweets examples a lot to learn. So here is my simple Tweet version. For those who are looking for complete Tweet application, twissjava is the way to go.
This idea of this sample is to store all the tweets that are received giving it a unique id.
The POJO has just three fields
public class Tweet implements Serializable {
private final UUID key;
private final String uname;
private final String body;
public Tweet(UUID key, String uname, String body) {
this.key = key;
this.uname = uname;
this.body = body;
}
// Eliminated get/set for clarity
}
Now we need to get hold to Cluster. The usage is specific to Hector
final static Cluster cluster = HFactory.createCluster("LogsCluster",
new CassandraHostConfigurator("localhost:9160"));
final static Keyspace keyspace = HFactory.createKeyspace("LogData", cluster);
These 2 lines gets a reference to the Cassandra cluster and the keyspace we are using. The names have been specified in the cassandra.yaml that we specified.
Now lets see how we save the tweets
public void saveTweet(Tweet tweet) {
Mutator<String> m1 = HFactory.createMutator(keyspace, ss);
m1.addInsertion(tweet.getKey().toString(),
TWEETS,
HFactory.createStringColumn("uname", tweet.getUname()))
.addInsertion(tweet.getKey().toString(),
TWEETS,
HFactory.createStringColumn("body", tweet.getBody()));
m1.execute();
}
Here we create a Mutator for the given keyspace and insert the tweet details, like username and tweet data. The column is a UUID, to uniquely identify a tweet and acts as a key.
Now lets see the main function
public static void main(String[] args) {
// Number of Tweets to be stored
int count = 500;
TweetSample sample = new TweetSample();
System.out.println("Saving Tweets ....");
for (int i = 0; i < count; i++) {
Tweet tweet = new Tweet(UUID.randomUUID(),
"paliwalashish",
"This is tweet# "+ i);
sample.saveTweet(tweet);
System.out.println("Saving Tweet # : "+ i);
}
System.out.println("Tweet Saved....");
}
This is simple, we create tweet objects and insert them in a loop. So how does our Cassandra data looks after we run this program.
As we see for each key (UUID) we have stored two column, username and tweet data.
Example 2: Saving User Action Log
This example is very similar to our first example. Here we are storing userid, action and the URL in the DB.
Example 3: Saving Logs per Hour
In this example we wanted to save Log file per hour so that we can analyze them easily. I choose to use a SuperColumn for this, Day and hour as the keys. There can be other approaches of getting the same functionality. The idea is to have following structure for the logs
For each day, we will store logs per hour
Log POJO just has a string message to be saved. Real world scenario can be more sophisticated
Lets see how we add the data to the SuperColumn
public void saveLogs(String tag, String hrTag, Log logMessage) {
Mutator<String> mutator = HFactory.createMutator(keyspace, ss);
mutator.insert(tag, LOGS, HFactory.createSuperColumn(hrTag,
Arrays.asList(HFactory.createStringColumn(UUID.randomUUID().toString(),
logMessage.getLogMessage())),
ss, ss, ss));
mutator.execute();
}
The calls are essentially similar, but we just add more keys, like tag is Day key (YYYYMMDD), LOGS is the name of the SuperColumn. Inside the SuperColumn, we add log message with a unique id.
Here is how it looks, when stored in Cassandra.
Cassandra Data Model is slightly tricky to understand in the begining. There are really wondeful posts out there explaining the same. Take some time to read about the Data Model and tweak the examples, and have fun.
Happy Scaling







Great!
Thanks for this article.
Together with the source code I now have a very good start!
NOTE: The reference links have changed:
Cassandra Swing GUI: http://code.google.com/p/cassandra-gui/downloads/list
Full example code: http://cassandra-examples.googlecode.com/svn/trunk/
What is ss?
It’s StringSerializer
final static StringSerializer ss = StringSerializer.get();
My apology for the incomplete snippet. You can however see the complete code from the google-code link in the post (https://code.google.com/p/cassandra-examples)
Please note, this is very old and Cassandra had changed a lot
Hi,
That may not be a great data schema for the logs. The data is partitioned between your Cassandra nodes based on the row key. Since your row key is the same for a whole day, this would mean all logs for each day get sent to only one of your nodes, which may or may not be what you want. See http://www.datastax.com/docs/1.0/cluster_architecture/partitioning
It may be better to use a UUID for the key so that the logs are distributed between servers.
Jeff
Thanks Jeff! Your point well taken. Also this was sort of just a sample to get started, so didn’t went deeper into partitions. And UUID is way better for even partitions
cheers !
ashish