It’s the End of the World As We Know It (NoSQL Edition)

This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!

Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.

Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.

He makes his point by stating that traditional RDBM design is already being replaced for a variety of specialized solutions: Data-warehouses, streams processing, text and scientific databases. The only uses left for RDBMS is OLTP and hybrid systems.

The provocatively named paper is simply a description of a system, designed from scratch for modern OLTP requirements and the demonstration that this system gives better performance than traditional RDBMS on OLTP type load. The conclusion is that since RDBMS can’t even excel at OLTP – it must be destined for the garbage pile. I’ll ignore the fact that hybrid systems are far from extinct and look at the paper itself.

The paper starts with a short review of the design considerations behind traditional RDBMS, before proceeding to list the design considerations behind the new OLTP system, HStore.:

  1. The OLTP database should fit entirely in-memory. There should be no disk writes at all. Based on TPC-C size requirements this should be possible, if not now then within few years.
  2. The OLTP database should be single threaded – no concurrency at all. This should be possible since OLTP transactions are all sub-millisecond. In an memory-only system they should be even faster. This will remove the need for complex algorithms and data structures and will improve performance even more. Ad-hoc queries will not be allowed.
  3. It should be possible to add capacity to an OLTP system without any downtime. This means incremental expansion – it should be possible to grow the system by adding nodes transparently.
  4. The system should be highly available, with a peer-to-peer configuration – the OLTP load should be distributed across multiple machines and inter-machine replication should be used for availability. According to the paper, in such a system redo and undo logging becomes unnecessary. This paper references another paper that argues that rebuilding a failed node over the network is as efficient as recovering from redo log. Obviously, eliminating redo logs eliminates one of the worse OLTP bottlenecks where data is written to disk synchronously.
  5. No DBAs. Modern systems should be completely self tuning.

In other sections Stonebraker describes few more properties of the system:

  1. With persistent redo logs gone, and locks/latches gone, the overhead of JDBC interface is likely to be the next bottleneck. Therefore the application code should be in form of stored procedures inside the data store. The only command ran externally should be “execute transaction X”.
  2. Given that the DB will be distributed and replicated, and network latencies still take milliseconds, the two-phase commit protocol should be avoided
  3. There will be no ad-hoc queries. The entire workload will be specified in advance.
  4. SQL is an old legacy language with serious problems that were exposed by Chris Date two decades ago. Modern OLTP systems should be programmable in a modern light-weight language such as Ruby. Currently the system is queried with C++.

The requirements seem mostly reasonable and very modern – use replication as a method of high availability and scalabilty, avoid disks and their inherent latencies, avoid the complications of concurrency, avoid ad-hoc queries, avoid SQL and avoid annoying DBAs. If Stonebraker can deliver on his promise, if he can do all of the above without sacraficing the throughput and durability of the system, this sounds like a database we’ll all enjoy.
In the rest of the paper, the authors describe some special properties of OLTP work loads, and then explains how HStore utilizes the special properties to implement a very efficient distributed OLTP system. In the last part of the paper, the authors use HStore to run a TPC-C like benchmark and compare the results with an RDBMS.

Here are in very broad strokes the idea:
The paper explains in some detail how things are done, while I only describe what is done:

The system is distributed, with each object partitioned over the nodes. You can have specify how many copies of each row will be distributed, and this will provide high availability (if one node goes down you will have all the data available on other nodes).

Each node is single threaded. Once SQL query arrives at a node, it will be performed to the end without interruptions. There are no physical files. The data objects are stored as Btrees in memory, Btree block is sized to match L2 cache line.

The system will have a simple cost-based optimizer. It can be simple because OLTP queries are simple. If multi-way joins happen they always involve identifying a single tuple and then tuples to join to that record in a small number of 1-to-n joins. Group by and aggregation don’t happen in OLTP systems.

The query plans can either run completely in one of the nodes, can be decomposed to a set of independent transactions that can run completely in one node each, or require results to be communicated between nodes.

The way to make all this efficient is by using a “database designer” – Since the entire workload is known in advance, the database designer’s job is to make sure that most queries in the workload can run completely on a single node. It does this by smartly partitioning the tables, placing parts that are used together frequently on the same node and copying tables (or just specific columns) that are read-only all over the place.

Since there are at least two copies of each row and each table, there must be a way to consistently update them. Queries that can complete on a single node, can just be sent to all relevant nodes and we can be confident that they will all complete them with identical results. The only complication is that each node must wait a few milliseconds before running the latest transaction to allow for recieving prior transactions from other nodes. The order in which transactions run is identified by timestamps and node ids. This allows for identical order of execution on all nodes and is responsible for consistent results.

In case of transactions that span multiple sites and involve changes that affect other transactions (i.e. The order in which they execute in relation to other transactions matter), one way to achieve consistency could be locking the data sources for the duration of the transaction. The HStore uses another method – each worker node recieves its portion of the transaction from a coordinator. If there are no conflicting transactions with lower timestamps, the transaction runs and the worker sends the coordinator an “ok”, otherwise the worker aborts and notifies the coordinator. The transaction failed and its up to the application to recover from this. Of course, some undo should be used to rollback the successfull nodes.

The coordinator monitors the number of aborts and if there are too many unsuccessfull transactions, it starts waiting longer between the time a transaction arrives at a node until the node attempts to run it. If there are still too many failures, a more advanced strategy of aborting is used. In short, this is a very optimistic database where failure is prefered to locking.

I’ll skip the part where a modified TPC-C proves that HStore is much faster than a traditional RDBMS tuned for 3 days by an expert. We all know that all benchmarks are institutionalized cheating.

What do I think of this database?

  1. It may be too optimistic in its definition of OLTP. I’m not sure we are all there with the pre-defined workload. Especially since adding queries can require a complete rebuild of the data-store.
  2. I’m wondering how he plans to get a consistent image of the data stored there to another system to allow querying. ETL hooks are clearly required, but it is unclear how they can be implemented.
  3. Likewise, there is no clear solution on how to migrate existing datasets into this system.
  4. HStore seems to depend quite heavily on the assumption that networks never fail or slow down. Not a good assumption from my experience.
  5. If Stonebraker is right and most datasets can be partitioned in a way that allows SQL and DML to almost always run on a single node, this can be used to optimize OLTP systems on RAC.
  6. I like the idea of memory only systems. I like the idea of replication providing recoverability and allowing us to throw away the redo logs. I’m not sure we are there yet, but I want to be there.
  7. I also like the idea of a system allowing only stored procedures to run.
  8. I’m rather skeptical about systems without DBAs, I’ve yet to see any large system work without someone responsible for it to keep working.
  9. I’m even more skeptical about systems without redo logs and how they manage to still be atomic, durable and just plain reliable. Unfortunately this paper doesn’t explain how redo-less systems can be recovered. It references another paper as proof that it can be done.
  10. Stonebraker deserves credit for anticipating the NoSQL boom 3 years in advance. Especially the replication and memory-only components.

I hope that in the next few month I’ll add few more posts reviewing futuristic systems. I enjoy keeping in touch with industry trends and cutting-edge ideas.

3 Comments on “It’s the End of the World As We Know It (NoSQL Edition)”

  1. Hey that looks familiar. I work on VoltDB which is an attempt at commercializing the H-Store project.

    #1 part 1 definitely keeps me up at night. The only saving grace is that no transactional alternatives to a SQL RDBMS have sprung up. There are still some workloads that benefit from transactions and are extremely difficult to implement correctly without. I hear a lot of people talking about how they integrate new stores but keep the RDBMS around for certain roles.
    Part 2 is no different from a traditional database. You don’t need to rebuild the data part of the data-store. You can compile a new catalog and load it onto a running system to add/remove procedures, change permissions, and add/drop tables.

    #2 You can create append only tables where rows inserted into the append only table are transactionally pushed to another system. You can also specify that all changes to a table are logged and pushed to another system. You can write a client to receive the data, load it into the system of your choice, and then ack it.

    #3 There is a ticket for writing a CSV loader. One of the many things that has to get done. If you are migrating an existing system to use VoltDB you are also probably migrating parts of it to other data stores as well. It’s a polyglot future.

    #4 An unspoken part of the H-Store premise is that you can do a lot more with a lot less (otherwise why bother?). I suspect most people will be adding nodes for memory capacity (and to use lower density memory) and not transactional throughput. Thirty nodes with 3-way replication can do 1.3 million single partition stored procedures a second. When you don’t need hundreds of nodes it becomes much more realistic to have redundant switches and rack local UPS. Also, the txn ordering system described in the paper has been replaced with one that does not rely on timely message delivery for correctness. That jumped out at me the first time as well.

    #5 One look at the TPC-C leaderboard will tell you that RAC scales, but at the end of the day it is all about $ per txn. If part of your workload can be done at a fraction of the cost by another system or if your workload starts to grow rapidly then you are a lot more likely to look at something purpose built.

    #6 & #9 Rack local UPS gives you enough time to finish in flight work, snapshot to disk, and shutdown. With no UPS or in a true disaster you will be set back to the last online snapshot which means low single digit minutes of data loss. Eventually (pending implementation) you could fail over to a WAN replica and only lose the txns in flight.

    #10 Some NoSQL stores have very strong disk based durability stories (Voltdemort, Cassandra) and in the case of Cassandra a strong multi-datacenter replication story which is much more important IMO.

  2. Iggy Fernandez says:

    re: “If Stonebraker is right and most datasets can be partitioned in a way that allows SQL and DML to almost always run on a single node, this can be used to optimize OLTP systems on RAC.”

    Stonebraker was unable to partition the nine-table TPC-C schema in such a way that each TPC-C transaction can be performed on a single node. The “Voter” teaching example used by VoltDB has only one table and two queries🙂 See

    • Noons says:

      Amazing, isn’t it? Here we are with prodcution systems with in excess of 30000 tables, and along comes Stonewhatever telling us that we don’t need any of that, all we need is his concept of one table and two queries…
      I am so looking forward to his next demented claim!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s