Big Data News from Oracle OpenWorld 2013

Only a week after Oracle OpenWorld concluded and I already feel like I’m hopelessly behind on posting news and impressions. Behind or not, I have news to share!

The most prominent feature announced at OpenWorld is the “In-Memory Option”  for Oracle Database 12c.  This option is essentially a new part of the SGA that caches tables in column formats. This is expected to make data warehouse queries significantly faster and more efficient. I would have described the feature in more details, but Jonathan Lewis gave a better overview in this forum discussion, so just go read his post.

Why am I excited about a feature that has nothing to do with Hadoop?

First, because I have a lot of experience with large data warehouses. So I know that big data often means large tables, but only few columns used in each query. And I know that in order to optimize these queries and to avoid expensive disk reads every time each query runs, we build indexes on those columns, which makes data loading slow. In-memory option will allow us to drop those indexes and just store the columns we need in memory.

Second, because I’m a huge fan of in-memory data warehouses, and am happy that Oracle is now making these feasible. Few TB of memory in a large server are no longer science fiction, which means that most of your data warehouse will soon fit in memory. Fast analytics for all! And what do you do with the data that won’t fit in memory? Perhaps store it in your Hadoop cluster.

Now that I’m done being excited about the big news, lets talk about small news that you probably didn’t notice but you should.

Oracle announced two cool new features for the Big Data Appliance. Announced may be a big word, Larry Ellison did not stand up on stage and talked about them. Instead the features sneaked quietly into the last upgrade and appeared in the documentation.

Perfect Balance – If you use Hadoop as often as I do, you know how data skew can mess with query performance. You run a job with several reducers, each aggregates data for a subset of keys. Unless you took great care in partitioning your data, the data will not be evenly distributed between the reducers, usually because it wasn’t evenly distributed between the keys. As a result, you will spend 50% of the time waiting for that one last reducer to finish already.

Oracle’s Perfect Balance makes the “took great case in partitioning your data” part much much easier. This blog post is just a quick overview, not an in-depth blog post, so I won’t go into details of how this works (wait for my next post on this topic!). I’ll just mention that Perfect Balance can be used without any change to the application code, so if you are using BDA, there is no excuse not to use it.

And no excuse to play solitaire while waiting for the last reducer to finish.

Oracle XQuery for Hadoop – Announced but not officially released yet, which is why I’m pointing you at an Amis blog post. For now thats the best source of information about this feature. This feature, combined with the existing Oracle Loader for Hadoop will allow running XQuery operations on XMLs stored in Hadoop, pushing down the entire data processing bit to Map Reduce on the Hadoop cluster. Anyone who knows how slow, painful and CPU intensive XML processing can be on an Oracle database server will appreciate this feature. I wish I had it a year ago when I had to ingest XMLs at a very high rate. It is also so cool that I’m a bit sorry that we never developed more awesome XQuery capabilities for Hive and Impala. Can’t wait for the release so I can try that!

During OpenWorld there was also additional exposure for existing, but perhaps not very well known Oracle Big Data features – Hadoop for ODI, Hadoop for OBIEE and using GoldenGate with Hadoop. I’ll try to write more about those soon.

Meanwhile, let me know what you think of In-Memory, Perfect Balance and OXH.