Mapping the NoSQL space

NoSQL is an unfortunate name – it doesn’t give any description of what the product does except what query language it will not support. What’s worse, it makes people treat the various non-relational databases as interchangable, while in fact many of them solve completely different problems and have different trade-offs, strengths, etc.

What is common to all these DBs is that they don’t do ACID in an attempt to improve scalability, most of them are distributed and most of them were built to handle semi-structured or unstructured data.

The theoretical case for these databases starts from the CAP theorem which says you can’t have consistency, availability and partition tolerance all at once. Partition tolerance is the prevention of split-brain in a cluster or distributed system – you don’t want network failures to allow data corruptions or incorrect results.

Since you can’t have all three, you choose two. So RAC does partition tolerance and consistency at the expense of availability – if the voting disk crashes or loses network connectivity, the entire cluster will go down.

NoSQL databases keep availability and partition tolerance at the expense of consistency. They have something called “Soft-State” and “Eventual Consistency”. To the best of my understanding, “Eventual Consistency” means that all the DML statements in the transaction are inserted into a queue (or some equivalent), from which they are executed at different times by different servers. Eventually they are all executed and you reach a consistent state, but you don’t know when. Of course with such system, it appears nearly impossible to prevent lost updates.

This doesn’t seem like a good way to manage bank accounts, but when I reviewed the databases I manage, only very few of them really need full consistency. Many of them are not updated concurrently, or where there are no updates (just inserts) or contain data such as project plans where not being consistent at every single second would be OK.

Here’s a short list of the the non-relational databases I’m somewhat familiar with and the problems they solve:

Map-Reduce – not a database at all, its an algorithm or a design methodology that allows for massive scalability.

Hadoop – not a database. Its a platform – a distributed file-system and a map-reduce job manager.

Hive – Its a SQL like language allows for structured schema design and queries on top of Hadoop. It has some superficial similarities with RDBMS, but it is just the syntax – every query is translated to map-reduce code, execution is totally different and don’t expect most  RDBMS features.

HBase – Allows you to create tables with rows and columns (normally very large ones) and query them through several Java/HTTP interfaces. You query each table individually, no joins.

Cassandra - Does exactly the same as HBASE. To the best of my understanding it is more configurable and flexible but is not as well documented.

Tokyo Cabinet /Tyrant – Stores key/value pairs. There are no tables and no data types. You can store data in hash tables, b-trees or fixed-size arrays. It is not distributed. Said to have amazing performance.

Voldemort – Similar to Tokyo Cabinet, but distributed – although it appears that when adding nodes performance doesn’t scale well.

CouchDB - This is a document store, where each document contains multiple key-value pairs. It does include some of the traditional DB features, just in a different context. It has the concept of index, and you create an index for each report you want to run. It supports multiple-versions of each document, where a report is guaranteed to run on the same version from beginning to end. There is no schema – documents can contain different keys.

MongoDB – Similar to CouchDB, it is a document store. It is not distributed. It doesn’t have multiple document revisions – all updates are done on the same document. No indexes either, which allows for ad-hoc querying.

Hope this is useful :)

About these ads

5 Comments on “Mapping the NoSQL space”

  1. Hey Gwen,

    Nice round up but you are little off with MongoDB. It does have indexes within collections (similar to tables but without a defined schema) which make for very fast ad-hoc queries.

    It also does support several different distribution models ranging from the traditional master – slave replication, through master – master replication right up to a fully shared setup with many nodes.

    It also does MapReduce if you feel the need :)

    Regards,
    a_musing_moose

  2. prodlife says:

    Thanks for the corrections! I’m a bit new to all this, so I guess it was easy to understand what I was reading…

  3. […] Shapira has been at the compass and protractor, mapping the NoSQL space and returns from those wild parts […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,114 other followers