Notes about Hadoop

My notes from two presentations given to the data mining SIG of the local ACM chapter.

Hadoop is a scalable fault-tolerant grid operating system for data storage and processing.

It is not a database. It is more similar to an operating system: Hadoop has a file system (HDFS) and a job scheduler (Map-Reduce). Both are distributed. You can load any kind of data into Hadoop.

It is quite popular – the last Hadoop Summit had 750 attendees. Not bad for a new open-source technology. It is also quite efficient for some tasks. Hadoop cluster of 1460 nodes can sort a Terabyte of data in 62 seconds – currently the world record for sorting a terabyte.

Hadoop Design Axioms:

  • System will manage and heal itself. (Because using commodity hardware – failure is inevitable).
  • Performance will scale linearly. (With few limitations).
  • Compute should move to data (Processing job should run on the machine holding the data to process)
  • Simple core. Modular and extensible

Distributed file system. Block size is 64M (!). User configures replication factor – each block is replicated on K machines (K chosen by user). More replication can be configured for hot blocks.
A name node keeps track of the blocks and if a node fails the data on it will be replicated to other nodes.

Distributes jobs. It tried to run jobs local to their data to avoid network overhead. It also detects failures and even servers running behind on the processing. If a part of the job is lagging in processing, it will start copies of this part of the job on other servers with the hope that one of the copies will finish faster.

Hadoop Ecosystem:

  • HBase: Google’s big table implementation. Key-value based. Good for quick lookups, but not batch processing. Transactional.
  • Pig, Hive, Scoop: Different languages. Map-Reduce is like assembly – High performance, low-level, contains too much details for most tasks. Hive is SQL language for Hadoop.

Hadoop vs. RDBMS?
RDBMS – expensive, structured, fast, interactive, has standards, transactional.
Hadoop – affordable, unstructured, scalable, resilient. Solves both storage and processing.

Hive and Hadoop at Facebook
Facebook got 200GB of data each day as of March 2008. Thats a lot of data to manage. Facebook philosophy is that more insights can be achieved from running simpler algorithms on more data.

Why Hadoop? Cost of storage. Limitations of data-analysis systems. Many systems have limited scalability. And they were closed and propitiatory.

Why not map-reduce? Not many developers have experience with it. Needed well known schemas and structure.

Hive was built on top of Hadoop to solve these problems. It saves metadata and adds SQL. Also allows integrating with other systems. Hive has tables, which have partitions which hold buckets. Buckets are used for sampling. Hive is very extensible. You can have user defined functions, types, objects, etc.

Hive does optimizations – join order, different processing for skewed data. The optimizer is rule based and uses hints. It also does some kind of dynamic sampling. You can look at the explain plans for the jobs and use that for tuning. Hive uses columnar compression.

Hive support integrations with JDBC, ODBC and Thrift.

It lacks resource management and needs monitoring to catch and kill “bad” jobs.

Concurrency wise, the idea is that you insert data, “publish” it and from the moment it is published everyone else can see it – but it cannot be modified or deleted. This means no read/write contention.


Build Less (DB Design Version)

37Signals, the company behind few highly successful web-based applications, has published a book about their business building experience. Knowing that the company is both successful and has an unconventional business development philosophy, I decided to browse a bit.

One of the essays that caught my attention is “Build Less”. The idea is that instead of having more features than the competition (or more employees or whatever), you should strive to have less. To avoid any sense of irony – the essay is very short 🙂

One of the suggestions I would add to the essay is:
“Keep less data”

Keeping a lot of data is a pain. Indexing, partitioning, tuning, backup and recovery – everything is more painful when you have terabytes instead of gigabytes. And when it comes to cleaning data out, it always causes endless debates on how long to keep the data (3 month? 7 years?) and different life-cycle options (move to “old data” system? archiving? how to purge? What is the process?).

What’s more, a lot of the time customers would really prefer we won’t keep the data. Maybe its privacy concerns (when we keep a lot of search history) or difficulty in generating meaningful reports or just plain confusion caused by all those old projects floating around.

Google taught us that all the data should be stored forever. But perhaps your business can win by keeping less data.

RMOUG Presentations

Like many other DBAs, I’ll be attending RMOUG training days conference on Feb 17-18 in Denver. I’ll give two presentations in the conference. On the same day, just thinking about it makes me exhausted.

The first presentation is “Everything DBAs need to know about TCP/IP Networks”. Here’s the paper and the slides. I’ll also present this at NoCOUG‘s winter conference in Pleasanton, CA. Maybe you prefer to catch me there.

The second presentation is “Analyzing Database Performance using Time Series Techniques”. Here’s the paper and the slides.

I still have time to improve the presentations and papers – so comments are very welcome 🙂

Damnit Method for Writing Papers

So today was this date when I was supposed to send RMOUG the papers I’m going to present at the conference.
Normally I’m pretty good about having papers ready well in advance, and indeed the time-series paper was done few weeks back.

But there was the other paper. “Everything a DBA Needs to Know about TCPIP Networks”. I’ve been avoiding it to the point that I only finished writing it this morning. Which is totally unlike me.

Its not that I dislike the topic. On the contrary – I love TCP and am very passionate about it. This blog is full of posts where I explain TCP. Its nearly as much fun as queue theory and CPU scheduling. And its not like I had nothing to say – I had tons of stories and examples and network captures and techniques and even few opinions to share.

The problem was that I also had a bunch of things that I did not want to say, but felt like I have to.

You see, I started working on the paper by thinking about my experience. I had few war stories in mind where knowledge of networking saved the day, and I planned of sharing the stories and the knowledge with the audience. So far so good.

But when I started planning and outlining the paper, I shifted my thinking from my knowledge and stories to the potential audience. What do “they” need to know and to see and to understand. Generally it is a good idea to think about the audience when writing papers and preparing presentations, but this also meant that a lot of stuff that I did not want to write about found its way into the paper outline.

All of the sudden writing my paper meant that I have to write about things that I don’t want to write about, and the paper feels like a school paper and I’m avoiding it just so I don’t have to write about all that stuff.

So yesterday, when it looks like I’ll never finish the paper on time because I can’t even start writing it, I decided to take a second look at the plan. I figured out that the only way the paper will ever be finished is if I’ll give myself full permission not to write about things that I don’t want to write about.

It went like this:

  • I’m not writing yet another introduction to networking. I’m not describing any layers model, I’m not drawing IP datagrams or 3-way-handshake diagrams or anything with arrows going back and forth. Enough people described layers and IP and TCP basics and I don’t have to. This paper will not include basics, damnit!
  • I don’t care if every single DBA on the planet is tuning SDU sizes. I can say that other people are doing it, but I still think it is a waste of time and I won’t do it. No demo code for how tunining SDU size can improve performance, damnit!
  • I’m not doing any tuning that involves calculating exactly how many bytes go into each packet and how much space goes into the header. Its annoying and a waste of time on any known network. I’m not counting header sizes, damn it!
  • I’m giving some tips on how to use packet capture software, but I’m not teaching everything about Wireshark. Some things are out of scope, damn it!

After that I felt much better and proceeded to write the paper in about 4 hours. Amazingly, even without all those topics, I still have a fairly large paper full of content barely covered elsewhere and which is demonstrated with my own stories and examples.

I hope that if you ever have to write a paper, you will also give yourself full permission to write only about what you see as the interesting stuff and to skip things that bore you. I really believe that a paper that was fun to write is also much more useful for the reader.