It’s the End of the World As We Know It (NoSQL Edition)

This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!

Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.

Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.

He makes his point by stating that traditional RDBM design is already being replaced for a variety of specialized solutions: Data-warehouses, streams processing, text and scientific databases. The only uses left for RDBMS is OLTP and hybrid systems.

The provocatively named paper is simply a description of a system, designed from scratch for modern OLTP requirements and the demonstration that this system gives better performance than traditional RDBMS on OLTP type load. The conclusion is that since RDBMS can’t even excel at OLTP – it must be destined for the garbage pile. I’ll ignore the fact that hybrid systems are far from extinct and look at the paper itself.

The paper starts with a short review of the design considerations behind traditional RDBMS, before proceeding to list the design considerations behind the new OLTP system, HStore.:

  1. The OLTP database should fit entirely in-memory. There should be no disk writes at all. Based on TPC-C size requirements this should be possible, if not now then within few years.
  2. The OLTP database should be single threaded – no concurrency at all. This should be possible since OLTP transactions are all sub-millisecond. In an memory-only system they should be even faster. This will remove the need for complex algorithms and data structures and will improve performance even more. Ad-hoc queries will not be allowed.
  3. It should be possible to add capacity to an OLTP system without any downtime. This means incremental expansion – it should be possible to grow the system by adding nodes transparently.
  4. The system should be highly available, with a peer-to-peer configuration – the OLTP load should be distributed across multiple machines and inter-machine replication should be used for availability. According to the paper, in such a system redo and undo logging becomes unnecessary. This paper references another paper that argues that rebuilding a failed node over the network is as efficient as recovering from redo log. Obviously, eliminating redo logs eliminates one of the worse OLTP bottlenecks where data is written to disk synchronously.
  5. No DBAs. Modern systems should be completely self tuning.

In other sections Stonebraker describes few more properties of the system:

  1. With persistent redo logs gone, and locks/latches gone, the overhead of JDBC interface is likely to be the next bottleneck. Therefore the application code should be in form of stored procedures inside the data store. The only command ran externally should be “execute transaction X”.
  2. Given that the DB will be distributed and replicated, and network latencies still take milliseconds, the two-phase commit protocol should be avoided
  3. There will be no ad-hoc queries. The entire workload will be specified in advance.
  4. SQL is an old legacy language with serious problems that were exposed by Chris Date two decades ago. Modern OLTP systems should be programmable in a modern light-weight language such as Ruby. Currently the system is queried with C++.

The requirements seem mostly reasonable and very modern – use replication as a method of high availability and scalabilty, avoid disks and their inherent latencies, avoid the complications of concurrency, avoid ad-hoc queries, avoid SQL and avoid annoying DBAs. If Stonebraker can deliver on his promise, if he can do all of the above without sacraficing the throughput and durability of the system, this sounds like a database we’ll all enjoy.
In the rest of the paper, the authors describe some special properties of OLTP work loads, and then explains how HStore utilizes the special properties to implement a very efficient distributed OLTP system. In the last part of the paper, the authors use HStore to run a TPC-C like benchmark and compare the results with an RDBMS.

Here are in very broad strokes the idea:
The paper explains in some detail how things are done, while I only describe what is done:

The system is distributed, with each object partitioned over the nodes. You can have specify how many copies of each row will be distributed, and this will provide high availability (if one node goes down you will have all the data available on other nodes).

Each node is single threaded. Once SQL query arrives at a node, it will be performed to the end without interruptions. There are no physical files. The data objects are stored as Btrees in memory, Btree block is sized to match L2 cache line.

The system will have a simple cost-based optimizer. It can be simple because OLTP queries are simple. If multi-way joins happen they always involve identifying a single tuple and then tuples to join to that record in a small number of 1-to-n joins. Group by and aggregation don’t happen in OLTP systems.

The query plans can either run completely in one of the nodes, can be decomposed to a set of independent transactions that can run completely in one node each, or require results to be communicated between nodes.

The way to make all this efficient is by using a “database designer” – Since the entire workload is known in advance, the database designer’s job is to make sure that most queries in the workload can run completely on a single node. It does this by smartly partitioning the tables, placing parts that are used together frequently on the same node and copying tables (or just specific columns) that are read-only all over the place.

Since there are at least two copies of each row and each table, there must be a way to consistently update them. Queries that can complete on a single node, can just be sent to all relevant nodes and we can be confident that they will all complete them with identical results. The only complication is that each node must wait a few milliseconds before running the latest transaction to allow for recieving prior transactions from other nodes. The order in which transactions run is identified by timestamps and node ids. This allows for identical order of execution on all nodes and is responsible for consistent results.

In case of transactions that span multiple sites and involve changes that affect other transactions (i.e. The order in which they execute in relation to other transactions matter), one way to achieve consistency could be locking the data sources for the duration of the transaction. The HStore uses another method – each worker node recieves its portion of the transaction from a coordinator. If there are no conflicting transactions with lower timestamps, the transaction runs and the worker sends the coordinator an “ok”, otherwise the worker aborts and notifies the coordinator. The transaction failed and its up to the application to recover from this. Of course, some undo should be used to rollback the successfull nodes.

The coordinator monitors the number of aborts and if there are too many unsuccessfull transactions, it starts waiting longer between the time a transaction arrives at a node until the node attempts to run it. If there are still too many failures, a more advanced strategy of aborting is used. In short, this is a very optimistic database where failure is prefered to locking.

I’ll skip the part where a modified TPC-C proves that HStore is much faster than a traditional RDBMS tuned for 3 days by an expert. We all know that all benchmarks are institutionalized cheating.

What do I think of this database?

  1. It may be too optimistic in its definition of OLTP. I’m not sure we are all there with the pre-defined workload. Especially since adding queries can require a complete rebuild of the data-store.
  2. I’m wondering how he plans to get a consistent image of the data stored there to another system to allow querying. ETL hooks are clearly required, but it is unclear how they can be implemented.
  3. Likewise, there is no clear solution on how to migrate existing datasets into this system.
  4. HStore seems to depend quite heavily on the assumption that networks never fail or slow down. Not a good assumption from my experience.
  5. If Stonebraker is right and most datasets can be partitioned in a way that allows SQL and DML to almost always run on a single node, this can be used to optimize OLTP systems on RAC.
  6. I like the idea of memory only systems. I like the idea of replication providing recoverability and allowing us to throw away the redo logs. I’m not sure we are there yet, but I want to be there.
  7. I also like the idea of a system allowing only stored procedures to run.
  8. I’m rather skeptical about systems without DBAs, I’ve yet to see any large system work without someone responsible for it to keep working.
  9. I’m even more skeptical about systems without redo logs and how they manage to still be atomic, durable and just plain reliable. Unfortunately this paper doesn’t explain how redo-less systems can be recovered. It references another paper as proof that it can be done.
  10. Stonebraker deserves credit for anticipating the NoSQL boom 3 years in advance. Especially the replication and memory-only components.

I hope that in the next few month I’ll add few more posts reviewing futuristic systems. I enjoy keeping in touch with industry trends and cutting-edge ideas.


On the Difficulty of Data Migrations (Especially to NoSQL Databases)

(Originally posted at Pythian Blog)

I’ve been reading a lot of NoSQL blogs recently, and one thing that bothers me is that many of the leading NoSQL bloggers seem to have very different experience in operations that I’ve had.

Here’s an example:
Over at the O’Reilly community blogs, Andy Oram interviewed two MongoDB experts about migrating from a relational databases to MongoDB.

Here’s what the experts said:

” 1. Get to know MongoDB. Download it, read the tutorials, try some toy projects.
2. Think about how to represent your model in its document store.
3. Migrate the data from the database to MongoDB, probably simply by writing a bunch of SELECT * FROM statements against the database and then loading the data into your MongoDB model using the language of your choice.
4. Rewrite your application code to query MongoDB through statements such as insert() or find().

OK, so which step do you think takes the longest? And the answer is…step 2. Design is critical, and there are trade-offs that provide no simple answers but require a careful understanding of your application. Migrating the data and rewriting the application are straightforward by comparison. “

I’ve never migrated anything to MongoDB, but I was involved in the migration of a large application from SQLServer to Oracle. Both are relational databases so there was almost no need to rethink the data model. The rewrite and the migration took over two years, with significant bugs discovered and fixed up to the last week. The majority of the time spent on migration. None of it was done by “simply by writing a bunch of SELECT * FROM statements against the database”.

We did not lack expertise – we had plenty SQLServer and Oracle developers and DBAs with 10+ years of experience. Note that no one has 10 years of MongoDB experience.

I don’t doubt that modeling is critical and the trade-offs are always difficult, but I’ve yet to see a modeling phase that took more than rewrite + migration of large applications with big data. Note that large applications and big data are the target customers of NoSQL databases, so I’m not inventing irrelevant issues here.

I’ve experienced two major difficulties with migrations:
The first one is that you normally have large number of users, and you may be reluctant to migrate everyone to a new system at once. No matter how good your load testing skills are, you will still not be 100% certain your new system will have perfect performance under peak load. So you do phased migration. Start by moving 5% of the users, then another 15%, then another 30%, and then if everything goes well, you may migrate the rest.

Why is this a difficulty? First, the users may share data with users that have not yet migrated. There could be dependencies. You’ll need to figure these out and write temporary code to solve those that will be used only during the migration phase. But before that, you need to find a way to migrate specific parts of your data. This requires figuring out how to tear things apart carefully within and across tables. A mini modeling project in its own right. This complicates the “bunch of SELECT * FROM statements” quite a bit.

Oh, and the migration may fail. Spectacularly. At 3am. You now need to migrate all the users back. With the new data they inserted into the new DB. I hope you prepared a script in advance to do that.

And that is just the first difficulty. The second major problem is that you may have large amounts of data arriving at high rates. You could declare 3 days downtime to move all the data, but I can see some reasons not to do that.

The alternative is to move the data in increments. First select and copy all the data inserted until today at 8am. Once this is done, select and copy all the data inserted between 8am and now. Then all the data between the previous now and the now-now. All in ever shrinking deltas of data that will eventually converge to a point where you can switch the users over. This requires that all large tables will have timestamps, preferably indexed, hopefully partitioned. Even with timestamps it is not a trivial application to write, and it has to take care of dependencies – you can’t migrate comments on a document without migrating the document itself.

During the incremental migration and the data streaming phase, you have to support two systems with the same one operational group. The same operational group that now have to learn to support a new database and a lot of new code rewritten for it. Not impossible, but far from “straightforward”.

I always thought that the biggest misconception developers have about operations is the “just add a bunch of servers to solve the performance issue” myth. I can add “migration to a new system is straighforward” as another dangerous myth.

I’m not blaming them, they are architects and developers. Solving difficult operational problems is not their job. The “migration is straightforward” attitude is a problem only when you ask your developers to support your operations. Something that seems depressingly common when NoSQL databases arrive to operations. Operations have no NoSQL experience and management asks the developers to help out until the ops teams learn to support the new beast. Problem is that NoSQL developers without operations experience are likely to cause just as much damage as operations without NoSQL experience.

Mapping the NoSQL space

NoSQL is an unfortunate name – it doesn’t give any description of what the product does except what query language it will not support. What’s worse, it makes people treat the various non-relational databases as interchangable, while in fact many of them solve completely different problems and have different trade-offs, strengths, etc.

What is common to all these DBs is that they don’t do ACID in an attempt to improve scalability, most of them are distributed and most of them were built to handle semi-structured or unstructured data.

The theoretical case for these databases starts from the CAP theorem which says you can’t have consistency, availability and partition tolerance all at once. Partition tolerance is the prevention of split-brain in a cluster or distributed system – you don’t want network failures to allow data corruptions or incorrect results.

Since you can’t have all three, you choose two. So RAC does partition tolerance and consistency at the expense of availability – if the voting disk crashes or loses network connectivity, the entire cluster will go down.

NoSQL databases keep availability and partition tolerance at the expense of consistency. They have something called “Soft-State” and “Eventual Consistency”. To the best of my understanding, “Eventual Consistency” means that all the DML statements in the transaction are inserted into a queue (or some equivalent), from which they are executed at different times by different servers. Eventually they are all executed and you reach a consistent state, but you don’t know when. Of course with such system, it appears nearly impossible to prevent lost updates.

This doesn’t seem like a good way to manage bank accounts, but when I reviewed the databases I manage, only very few of them really need full consistency. Many of them are not updated concurrently, or where there are no updates (just inserts) or contain data such as project plans where not being consistent at every single second would be OK.

Here’s a short list of the the non-relational databases I’m somewhat familiar with and the problems they solve:

Map-Reduce – not a database at all, its an algorithm or a design methodology that allows for massive scalability.

Hadoop – not a database. Its a platform – a distributed file-system and a map-reduce job manager.

Hive – Its a SQL like language allows for structured schema design and queries on top of Hadoop. It has some superficial similarities with RDBMS, but it is just the syntax – every query is translated to map-reduce code, execution is totally different and don’t expect most  RDBMS features.

HBase – Allows you to create tables with rows and columns (normally very large ones) and query them through several Java/HTTP interfaces. You query each table individually, no joins.

Cassandra – Does exactly the same as HBASE. To the best of my understanding it is more configurable and flexible but is not as well documented.

Tokyo Cabinet /Tyrant – Stores key/value pairs. There are no tables and no data types. You can store data in hash tables, b-trees or fixed-size arrays. It is not distributed. Said to have amazing performance.

Voldemort – Similar to Tokyo Cabinet, but distributed – although it appears that when adding nodes performance doesn’t scale well.

CouchDB – This is a document store, where each document contains multiple key-value pairs. It does include some of the traditional DB features, just in a different context. It has the concept of index, and you create an index for each report you want to run. It supports multiple-versions of each document, where a report is guaranteed to run on the same version from beginning to end. There is no schema – documents can contain different keys.

MongoDB – Similar to CouchDB, it is a document store. It is not distributed. It doesn’t have multiple document revisions – all updates are done on the same document. No indexes either, which allows for ad-hoc querying.

Hope this is useful 🙂

Notes about Hadoop

My notes from two presentations given to the data mining SIG of the local ACM chapter.

Hadoop is a scalable fault-tolerant grid operating system for data storage and processing.

It is not a database. It is more similar to an operating system: Hadoop has a file system (HDFS) and a job scheduler (Map-Reduce). Both are distributed. You can load any kind of data into Hadoop.

It is quite popular – the last Hadoop Summit had 750 attendees. Not bad for a new open-source technology. It is also quite efficient for some tasks. Hadoop cluster of 1460 nodes can sort a Terabyte of data in 62 seconds – currently the world record for sorting a terabyte.

Hadoop Design Axioms:

  • System will manage and heal itself. (Because using commodity hardware – failure is inevitable).
  • Performance will scale linearly. (With few limitations).
  • Compute should move to data (Processing job should run on the machine holding the data to process)
  • Simple core. Modular and extensible

Distributed file system. Block size is 64M (!). User configures replication factor – each block is replicated on K machines (K chosen by user). More replication can be configured for hot blocks.
A name node keeps track of the blocks and if a node fails the data on it will be replicated to other nodes.

Distributes jobs. It tried to run jobs local to their data to avoid network overhead. It also detects failures and even servers running behind on the processing. If a part of the job is lagging in processing, it will start copies of this part of the job on other servers with the hope that one of the copies will finish faster.

Hadoop Ecosystem:

  • HBase: Google’s big table implementation. Key-value based. Good for quick lookups, but not batch processing. Transactional.
  • Pig, Hive, Scoop: Different languages. Map-Reduce is like assembly – High performance, low-level, contains too much details for most tasks. Hive is SQL language for Hadoop.

Hadoop vs. RDBMS?
RDBMS – expensive, structured, fast, interactive, has standards, transactional.
Hadoop – affordable, unstructured, scalable, resilient. Solves both storage and processing.

Hive and Hadoop at Facebook
Facebook got 200GB of data each day as of March 2008. Thats a lot of data to manage. Facebook philosophy is that more insights can be achieved from running simpler algorithms on more data.

Why Hadoop? Cost of storage. Limitations of data-analysis systems. Many systems have limited scalability. And they were closed and propitiatory.

Why not map-reduce? Not many developers have experience with it. Needed well known schemas and structure.

Hive was built on top of Hadoop to solve these problems. It saves metadata and adds SQL. Also allows integrating with other systems. Hive has tables, which have partitions which hold buckets. Buckets are used for sampling. Hive is very extensible. You can have user defined functions, types, objects, etc.

Hive does optimizations – join order, different processing for skewed data. The optimizer is rule based and uses hints. It also does some kind of dynamic sampling. You can look at the explain plans for the jobs and use that for tuning. Hive uses columnar compression.

Hive support integrations with JDBC, ODBC and Thrift.

It lacks resource management and needs monitoring to catch and kill “bad” jobs.

Concurrency wise, the idea is that you insert data, “publish” it and from the moment it is published everyone else can see it – but it cannot be modified or deleted. This means no read/write contention.