Multidimensional Drill-downs using Cassandra

Multidimensional data is the type of data that has hierarchies. These hierarchies can be implicit or explicit. One of the most useful ways of navigating multidimensional hierarchies is by using drill-downs. This implies moving from summary information to detailed data by focusing in on something. These drill-downs cannot be handled through RDBMS solutions, as each additional drill-down is essentially a new where clause, and this gets progressively slower for large data.

Traditionally the problem of having consistently fast drill-downs was solved using special multidimensional data structures like OLAP cubes and MDX queries. However, with the advent of NoSQL technologies like Cassandra, we now have faster and cheaper ways of supporting these drill-downs.

Cassandra is an Open Source distributed database management system. The main feature of the system are decentralization, replication support with multi data center replication, elastic scalability, fault-tolerance, tunable consistency, MapReduce support and an SQL like query language (CQL).

To get into the problem of drill-downs, let’s look at some sample data from a shopping cart:

timestamp, transaction_id, item_code, user_id, payment_method
20130304221518, abcd, 3, 5, creditcard
20130304221519, efgh, 4, 6, cashondelivery
20130305180402, ijkl, 4, 5, loyaltypoints

This data isn’t directly useful for analytics. Most analytics is about finding patterns and making marketing or other business decisions based on them. The useful information that can be derived from this raw data is as follows:

{month: march} {number of transactions: 3, unique_users: 2, unique_items_sold: 2}
{month: march, item: 4}{number of transactions: 2, unique_users: 2}
{month: march, day: 4} {number of transactions: 2, unique_users: 2, unique_items_sold: 2}
{item: 4, payment: cashondelivery} {number of transactions: 1, unique_users: 1, items_sold: 1}

Clearly, what we are trying to deduce here are aggregate numbers based on various combination of dimensions. When in RDBMS, the data we require can be found using queries with count, unique and where clauses. It is also evident that when this data grows to large sizes, the potential amount of joins in such queries will result in considerably long-running queries.

Thus, a better way to handle serving these kinds of aggregations in a reasonable amount of time is to pre-compute these aggregates and store them. This is where NoSQL technologies provide a distinct advantage, due to the lack of predefined schemas in them. This allows us to store similar combinations in a single column family (Cassandra’s tables), which in turn lets us serve extremely fast drill-downs. To illustrate this, let’s look at how the above data will be stored in Cassandra:

Column Family: Monthly{
    Rowkey: March {
         number_of_transactions: 3,
         unique_users: 2,
         unique_items_sold: 2
     },

     Rowkey: March, Item:4 {
         number of transactions: 2,
         unique_users:
     },

     Rowkey: March, Day:4 {
         number of transactions: 2,
         unique_users: 2.
         unique_items_sold: 2
     },

     Rowkey: March, Item:4, Day:4 {
         number of transactions: 1,
         unique_users: 1,
         items_sold: 1
     }
}

With the above data model in place, fetching a drill-down on March for items is as simple as forming the key and doing one fetch, fast and join-less.

The logical question to ask here is when we start making these kind of combinations for every value of every dimension in our data, what about data size and data explosion. This is a real problem as the storage requirements for this kind of data is huge. Cassandra helps here by providing elastic scalability with linear performance and storage addition as number of nodes increase.

However, this problem also warrants having some data specific optimizations. One of these is to limit your combinations to only those values which make business sense. Generate data by judicious selection of what subsets you want to produce to serve your end users.

Finally, this is only one of the problem cases where Cassandra is useful and is one of the solutions for this problem. Let us know if you have ever needed to solve this kind of a problem, the way you did it, or if you are using Cassandra to solve any other interesting problems.

One thought on “Multidimensional Drill-downs using Cassandra

  1. Nikhil Gupta January 31, 2014 / 6:17 pm

    I landed up on this page by accident and then again got caught by the word cassandra. The scenario which you have mentioned over here doesn't seem to justify the use of so called cassandra. I mean pre-computing as per specific combinations can be done in RDBMS as well which will make your so called Multidimensional Drill-downs faster as well.

    I am pretty much a novice in this area but I believe much of the relevance of cassandra lies in :-

    1) Parallel processing at multiple nodes (Clustering)
    2) Distribution of huge volume of data in the form of small chunks at these nodes
    3) Backup and replication of data at these specific nodes avoiding run time failures.
    4) Peer to peer clustering
    5) Faster migration of data to these nodes because it is transferred in unorganized format in the first step unlike migrating data to RDBMS where database integrity and other stuff results in slow data transfer

    Sincere apllogies if I have goofed up somewhere as I have read on this hot crap only a little 😛

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s