Mapping the NoSQL spacePosted: February 19, 2010
NoSQL is an unfortunate name – it doesn’t give any description of what the product does except what query language it will not support. What’s worse, it makes people treat the various non-relational databases as interchangable, while in fact many of them solve completely different problems and have different trade-offs, strengths, etc.
What is common to all these DBs is that they don’t do ACID in an attempt to improve scalability, most of them are distributed and most of them were built to handle semi-structured or unstructured data.
The theoretical case for these databases starts from the CAP theorem which says you can’t have consistency, availability and partition tolerance all at once. Partition tolerance is the prevention of split-brain in a cluster or distributed system – you don’t want network failures to allow data corruptions or incorrect results.
Since you can’t have all three, you choose two. So RAC does partition tolerance and consistency at the expense of availability – if the voting disk crashes or loses network connectivity, the entire cluster will go down.
NoSQL databases keep availability and partition tolerance at the expense of consistency. They have something called “Soft-State” and “Eventual Consistency”. To the best of my understanding, “Eventual Consistency” means that all the DML statements in the transaction are inserted into a queue (or some equivalent), from which they are executed at different times by different servers. Eventually they are all executed and you reach a consistent state, but you don’t know when. Of course with such system, it appears nearly impossible to prevent lost updates.
This doesn’t seem like a good way to manage bank accounts, but when I reviewed the databases I manage, only very few of them really need full consistency. Many of them are not updated concurrently, or where there are no updates (just inserts) or contain data such as project plans where not being consistent at every single second would be OK.
Here’s a short list of the the non-relational databases I’m somewhat familiar with and the problems they solve:
Map-Reduce – not a database at all, its an algorithm or a design methodology that allows for massive scalability.
Hadoop – not a database. Its a platform – a distributed file-system and a map-reduce job manager.
Hive – Its a SQL like language allows for structured schema design and queries on top of Hadoop. It has some superficial similarities with RDBMS, but it is just the syntax – every query is translated to map-reduce code, execution is totally different and don’t expect most RDBMS features.
HBase – Allows you to create tables with rows and columns (normally very large ones) and query them through several Java/HTTP interfaces. You query each table individually, no joins.
Cassandra – Does exactly the same as HBASE. To the best of my understanding it is more configurable and flexible but is not as well documented.
Tokyo Cabinet /Tyrant – Stores key/value pairs. There are no tables and no data types. You can store data in hash tables, b-trees or fixed-size arrays. It is not distributed. Said to have amazing performance.
Voldemort – Similar to Tokyo Cabinet, but distributed – although it appears that when adding nodes performance doesn’t scale well.
CouchDB – This is a document store, where each document contains multiple key-value pairs. It does include some of the traditional DB features, just in a different context. It has the concept of index, and you create an index for each report you want to run. It supports multiple-versions of each document, where a report is guaranteed to run on the same version from beginning to end. There is no schema – documents can contain different keys.
MongoDB – Similar to CouchDB, it is a document store. It is not distributed. It doesn’t have multiple document revisions – all updates are done on the same document. No indexes either, which allows for ad-hoc querying.
Hope this is useful :)