“A beginning is the time for taking the most delicate care that the balances are correct.”

It is spring. Time for planting new seeds. I started on a new job last week, and it seems that few of my friends and former colleagues are on their way to new adventures as well. I’m especially excited because I’m starting not just a new job – I will be working on a new product, far younger than Oracle and even MySQL. I am also making first tiny steps in the open-source community, something I’ve been looking to do for a while.

I’m itching to share lessons I’ve learned in my previous job, three challenging and rewarding years as a consultant. The time will arrive for those, but now is the time to share what I know about starting new jobs. Lessons that I need to recall, and that my friends who are also in the process of starting a new job may want to hear.

Say hello
I’m usually a very friendly person and after years of attending conferences I’m very comfortable talking to people I’ve never met before. But still, Cloudera has around 200 people in the bay area offices, which means that I had to say “Hello, I’m Gwen Shapira the new Solutions Architect, who are you?” around 200 times. This is not the most comfortable feeling in the world. Its important to go through the majority of the introductions in the first week or two, later on it becomes a bit more awkward. So in the first week it will certainly seem like you are doing nothing except meeting people, chatting a bit and franctically memorizing names and faces. This is perfectly OK.

Get comfortable being unproductive
The first week in a new job feels remarkably unproductive. This is normal. I’m getting to know people, processes, culture, about 20 new products and 40 new APIs. I have incredibly high expectations of myself, and naturaly I’m not as fast installing Hadoop cluster as I am installing RAC cluster. It takes me far longer to write Python code than it does to write SQL. My expectations create a lot of pressure, I internally yell at myself for taking an hour or so to load data into Hive when it “should” have taken 5 minutes. But of course, I don’t know how long it “should” take, I did it very few times before. I’m learning and while learning has its own pace, it is an investment and therefore productive.

Have lunch, share drinks
The best way to learn about culture is from people, and the best way to learn about products is from the developers who wrote them and are passionate about how they are used. Conversations at lunch time are better than tackling people in the corridor or interrupting them at their desk. Inviting people for drinks are also a great way to learn about a product. Going to someones cube and asking for an in-depth explanation of Hive architecture can be seen as entitled and bothersome. Sending email to the internal Hive mailing list and saying “I’ll buy beer to anyone who can explain Hive architecture to me” will result in a fun evening.

If its not overwhelming, you may be in the wrong job
I’m overwhelmed right now. So many new things to learn. First there are the Hadoop ecosystem products, I know some but far from all of them, and I feel that I need to learn everything in days. Then there is programming. I can code, but I’m not and never have been a proficient programmer. My colleagues are sending out patches left and right. It also seems like everyone around me is a machine learning expert. When did they learn all this? I feel like I will never catch up.

And that is exactly how I like it.

Make as many mistakes as possible
You can learn faster by doing, and you can do faster if you are not afraid of failing and making mistakes. Mistakes are more understandable and forgivable when you are new. I suggest using this window of opportunity and accelerate your learning by trying to do as much as possible. When you make a mistake smile and say “Sorry about that. I’m still new. Now I know what I should and shouldn’t do”

Take notes
When you are new a lot of things will look stupid. Sometimes just because they are very different from the way you are used to things in a previous job. Don’t give in to the temptation to criticise everything, because you will look like a whiner. No one likes whiner. But take note of them, because you will get used to them soon and never see things with “beginner mind” again. In few month take a look at your list, if things still look stupid, it will be time to take on a project or two to fix them.

I may be new at this specific job, but I still have a lot to contribute. I try hard to look for opportunities and I keep finding out that I’m more useful than I thought. I participate in discussions in internal mailing lists, I make suggestions, I help colleagues solve problems. I participate in interviews and file tickets when our products don’t work as expected. I don’t wait to be handed work or to be sent to a customer, I look for places where I can be of use.

I don’t change jobs often. So its quite possible that I don’t know everything there is to know about starting a new job. If you have tips and suggestions to share with me and my readers, please comment!

It’s the End of the World As We Know It (NoSQL Edition)

This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!

Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.

Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.

He makes his point by stating that traditional RDBM design is already being replaced for a variety of specialized solutions: Data-warehouses, streams processing, text and scientific databases. The only uses left for RDBMS is OLTP and hybrid systems.

The provocatively named paper is simply a description of a system, designed from scratch for modern OLTP requirements and the demonstration that this system gives better performance than traditional RDBMS on OLTP type load. The conclusion is that since RDBMS can’t even excel at OLTP – it must be destined for the garbage pile. I’ll ignore the fact that hybrid systems are far from extinct and look at the paper itself.

The paper starts with a short review of the design considerations behind traditional RDBMS, before proceeding to list the design considerations behind the new OLTP system, HStore.:

  1. The OLTP database should fit entirely in-memory. There should be no disk writes at all. Based on TPC-C size requirements this should be possible, if not now then within few years.
  2. The OLTP database should be single threaded – no concurrency at all. This should be possible since OLTP transactions are all sub-millisecond. In an memory-only system they should be even faster. This will remove the need for complex algorithms and data structures and will improve performance even more. Ad-hoc queries will not be allowed.
  3. It should be possible to add capacity to an OLTP system without any downtime. This means incremental expansion – it should be possible to grow the system by adding nodes transparently.
  4. The system should be highly available, with a peer-to-peer configuration – the OLTP load should be distributed across multiple machines and inter-machine replication should be used for availability. According to the paper, in such a system redo and undo logging becomes unnecessary. This paper references another paper that argues that rebuilding a failed node over the network is as efficient as recovering from redo log. Obviously, eliminating redo logs eliminates one of the worse OLTP bottlenecks where data is written to disk synchronously.
  5. No DBAs. Modern systems should be completely self tuning.

In other sections Stonebraker describes few more properties of the system:

  1. With persistent redo logs gone, and locks/latches gone, the overhead of JDBC interface is likely to be the next bottleneck. Therefore the application code should be in form of stored procedures inside the data store. The only command ran externally should be “execute transaction X”.
  2. Given that the DB will be distributed and replicated, and network latencies still take milliseconds, the two-phase commit protocol should be avoided
  3. There will be no ad-hoc queries. The entire workload will be specified in advance.
  4. SQL is an old legacy language with serious problems that were exposed by Chris Date two decades ago. Modern OLTP systems should be programmable in a modern light-weight language such as Ruby. Currently the system is queried with C++.

The requirements seem mostly reasonable and very modern – use replication as a method of high availability and scalabilty, avoid disks and their inherent latencies, avoid the complications of concurrency, avoid ad-hoc queries, avoid SQL and avoid annoying DBAs. If Stonebraker can deliver on his promise, if he can do all of the above without sacraficing the throughput and durability of the system, this sounds like a database we’ll all enjoy.
In the rest of the paper, the authors describe some special properties of OLTP work loads, and then explains how HStore utilizes the special properties to implement a very efficient distributed OLTP system. In the last part of the paper, the authors use HStore to run a TPC-C like benchmark and compare the results with an RDBMS.

Here are in very broad strokes the idea:
The paper explains in some detail how things are done, while I only describe what is done:

The system is distributed, with each object partitioned over the nodes. You can have specify how many copies of each row will be distributed, and this will provide high availability (if one node goes down you will have all the data available on other nodes).

Each node is single threaded. Once SQL query arrives at a node, it will be performed to the end without interruptions. There are no physical files. The data objects are stored as Btrees in memory, Btree block is sized to match L2 cache line.

The system will have a simple cost-based optimizer. It can be simple because OLTP queries are simple. If multi-way joins happen they always involve identifying a single tuple and then tuples to join to that record in a small number of 1-to-n joins. Group by and aggregation don’t happen in OLTP systems.

The query plans can either run completely in one of the nodes, can be decomposed to a set of independent transactions that can run completely in one node each, or require results to be communicated between nodes.

The way to make all this efficient is by using a “database designer” – Since the entire workload is known in advance, the database designer’s job is to make sure that most queries in the workload can run completely on a single node. It does this by smartly partitioning the tables, placing parts that are used together frequently on the same node and copying tables (or just specific columns) that are read-only all over the place.

Since there are at least two copies of each row and each table, there must be a way to consistently update them. Queries that can complete on a single node, can just be sent to all relevant nodes and we can be confident that they will all complete them with identical results. The only complication is that each node must wait a few milliseconds before running the latest transaction to allow for recieving prior transactions from other nodes. The order in which transactions run is identified by timestamps and node ids. This allows for identical order of execution on all nodes and is responsible for consistent results.

In case of transactions that span multiple sites and involve changes that affect other transactions (i.e. The order in which they execute in relation to other transactions matter), one way to achieve consistency could be locking the data sources for the duration of the transaction. The HStore uses another method – each worker node recieves its portion of the transaction from a coordinator. If there are no conflicting transactions with lower timestamps, the transaction runs and the worker sends the coordinator an “ok”, otherwise the worker aborts and notifies the coordinator. The transaction failed and its up to the application to recover from this. Of course, some undo should be used to rollback the successfull nodes.

The coordinator monitors the number of aborts and if there are too many unsuccessfull transactions, it starts waiting longer between the time a transaction arrives at a node until the node attempts to run it. If there are still too many failures, a more advanced strategy of aborting is used. In short, this is a very optimistic database where failure is prefered to locking.

I’ll skip the part where a modified TPC-C proves that HStore is much faster than a traditional RDBMS tuned for 3 days by an expert. We all know that all benchmarks are institutionalized cheating.

What do I think of this database?

  1. It may be too optimistic in its definition of OLTP. I’m not sure we are all there with the pre-defined workload. Especially since adding queries can require a complete rebuild of the data-store.
  2. I’m wondering how he plans to get a consistent image of the data stored there to another system to allow querying. ETL hooks are clearly required, but it is unclear how they can be implemented.
  3. Likewise, there is no clear solution on how to migrate existing datasets into this system.
  4. HStore seems to depend quite heavily on the assumption that networks never fail or slow down. Not a good assumption from my experience.
  5. If Stonebraker is right and most datasets can be partitioned in a way that allows SQL and DML to almost always run on a single node, this can be used to optimize OLTP systems on RAC.
  6. I like the idea of memory only systems. I like the idea of replication providing recoverability and allowing us to throw away the redo logs. I’m not sure we are there yet, but I want to be there.
  7. I also like the idea of a system allowing only stored procedures to run.
  8. I’m rather skeptical about systems without DBAs, I’ve yet to see any large system work without someone responsible for it to keep working.
  9. I’m even more skeptical about systems without redo logs and how they manage to still be atomic, durable and just plain reliable. Unfortunately this paper doesn’t explain how redo-less systems can be recovered. It references another paper as proof that it can be done.
  10. Stonebraker deserves credit for anticipating the NoSQL boom 3 years in advance. Especially the replication and memory-only components.

I hope that in the next few month I’ll add few more posts reviewing futuristic systems. I enjoy keeping in touch with industry trends and cutting-edge ideas.

BAAG, Best Practices and Multiple Choice Exams

(This post originally appeared at the Pythian blog)

I’ve been following the discussion in various MySQL blogs regarding the sort_buffer_size parameters. As an Oracle DBA, I don’t have an opinion on the subject, but the discussion did remind me of many discussions I’ve been involved in. What’s the best size for SDU? What is the right value for OPEN_CURSORS? How big should the shared pool be?

All are good questions. Many DBAs ask them hoping for a clear cut answer – Do this, don’t do that! Some experts recognize the need for a clear cut answer, and if they are responsible experts, they will give the answer that does the least harm.

Often the harmless answer is “Don’t touch anything, because if you have to ask this question you don’t have the experience to make the correct decision”. As Sheeri noted, it is a rather patronizing answer and it is stands in the way of those who truly want to learn and become experts.

But I can appreciate that it comes from long and bitter experience. Many users read random bits of information off the web and then rush to modify the production database without fully digesting the details. They end up tuning their database in a way no database should ever be tuned. Not even MySQL.

I used to think that users search for those “best practices” out of laziness, or maybe a lack of time. I used to laugh at the belief that there are best practices and clear answers, because if there were – we wouldn’t have a parameter. But now I think the problem is in the way most institutions evaluate intelligence, which affects the way many people approach any problem.

Even though all of us DBAs come from a wide variety of cultures, I’m willing to bet that every one of us had to work his way through a multiple choice test. If you ever took an Oracle certification exam, you know what I mean:

How do you find the name of the database server?
D) none of the above

You run into those in certification exams, job interviews and in slightly different variation when you try to get accepted to a university. You had to learn to succeed at those multiple choice tests at a very early age, or you would be labled “less intelligent”.

Yet those questions are absurd. In the question above, the answer could be A, but A would be wrong if my database is a RAC cluster. Besides a much better way would be to use /etc/oratab because there may be more than one DB on the machine.

But you can’t have a discussion with the exam author. You can’t ask for assumptions and clarifications  and you can’t even explain your assumptions in the test. What’s more, these tests also check for speed, so you don’t have much time to contemplate the options.

What these exams teach you is that every question has a single solution and one that is so obvious that once you see it, you recognize its rightness in less than 30 seconds. They also teach you that someone else knows the right answer (the person grading the test). So finding the right answer can be a matter of getting the “expert” to give you the one and obvious correct answer.

If our society teaches this point of view from a very young age, why are we surprised that DBAs keep looking for the one obvious answer?

I fully support the BAAG cause, and I believe that DBAs should be given a full explanation of the problem involved, the different choices that exist, their meaning, the trade-offs involved and in general give them the tools to make the right decision themselves time after time. But we should realize that undoing years of faulty teaching can take a while.

There is an even worse side effect to those multiple-choice-should-be-obvious tests. You may learn to never ask for clarifications. That asking for clarifications is “bad” or “wrong” in some way. In any real problem, asking for clarifications such as “why are you asking this?”, “what is the real issue you are trying to solve?” and  “How will you use this script?” is the most important part of finding a solution. It is not cheating – it is doing a professional job.

It is a scary thought that the very way we train and evaluate the future DBAs is something that will prevent them from doing a good job of being DBAs.

What a Difference a Month Makes

Its March 22. Exactly one month ago, I came back from few days spent in Colorado. I gave presentations, met amazing people, enjoyed skiing and drank a lot of beer, wine and whiskey. I also barely made it back, but that’s a different story.

Obviously I drank too much, or maybe I gave too many presentations. I don’t remember.

What I do know is that in the month since that eventful weekend, my life has taken a sharp turn.

I’m about to start working for Pythian. There will be a separate post about that, where I explain how amazing Pythian is and how to continue to follow my blog once it is merged with the official Pythian blog. Same hypothetical blog post will also include comments about how much I love my current colleagues and how sorry I am to leave them.

But more important – I’m now an Oakie. Seriously. Check it out 🙂 Can you tell I’m stoked?

This is not really a blog post, more of a tweet really. But I had to share the news.

Deliberate Practice

Recently I did some soul searching about my expertise as a DBA. I am not talking about my knowledge, my talents and my work style. I’m talking about which things I’m really comfortable doing. The commands I know by heart, the issues I ran into so often that I can diagnose with the tiniest clues.

There are definitely things I’m very good at. Diagnosing why RAC crashed or wouldn’t start. Solving a range of different problems with Streams. User managed recoveries. Netapp. BASH. Top. sar. vmstat. Redo log mining. Datapump. ASH and its relatives AWR and ADDM. Using Snapper to work with wait event interface. SQL coding, network diagnosis, Patching.

These are mostly things I do every day or close enough to it that the commands, the techniques, the traps and the limitations are always clear in my mind. But there are things that I do rarely or even never. This are important DBA skills, some are even very basic, which I do not have because they are not very useful in my specific position.

These include RMAN, ASM, Dataguard, AWK, perl, python, PL/SQL, tracing, SQL tuning, upgrade testing, benchmarks, many Linux administration tools, hadoop and those new NoSQL things, MySQL, Amazon’s cloud databases, RAT, partitions, scheduler.

These are all things that I know something about, that I’ve read about – but I can’t say I’m confident with any of these because I simply haven’t played with them all that much. After all, you learn by doing and running into issues – not by reading people say how everything works perfectly when they use it.

In order to widen my skill set a bit, I’m planning to take time this year to deliberately practice some of the technologies I didn’t use much last year. I’m thinking of taking anything from few weeks to few month per technology. I’ll invent and look up “lab exercises” for the specific topic and then proceed to spend the month practicing. Think of it as a “poor DBA’s Oracle University”.

This is the opposite of what I’ve been doing until now which can be described as “read Oracle’s Concepts book”. Reading the concepts book is great, but I’m at a point where I feel that I know tons of theory and need to spend some quality time grappling with its various applications. There seems to be a lot of research that shoes that best way to become an expert is with deliberate practice, so this year – I’ll practice.

Unless I’ll run into something really exciting, I don’t expect to blog about this adventure. After all, if I start posting scores of trivial AWK scripts, I doubt if anyone will keep reading my blog. But I thought that maybe some of my readers will enjoy joining me in my practice. So I opened this mailing list.

If you are also interested in practicing with me, feel free to join the mailing list. I’ll announce a “topic of the month”. This month its AWK, next month probably RMAN, we’ll see how this goes. I’ll send to the list the questions I’m planning to solve. If you join, you can send in your own questions. We’ll send our answers to our own and others questions with a week delay (so everyone will have time to practice on his own first). We’ll discuss and compare our answers. We’ll cheer each other as we improve our skills. I will not try to sell you anything.

This is an obvious ploy. I need people on the list so I’ll be accountable to this practice. Otherwise I’ll probably forget it by next week. Feel free to join just to watch me stumble and remind me that this is what learning looks like and it will be worth it.

Here’s to a year of practice and hopefully better skills that will follow.

Automated Root Cause Analysis

I’ve ran into multiple products that claim to offer automated root cause analysis, so don’t think that I’m ranting against a specific product or vendor. I have a problem with the concept.

The problem these products are trying to solve: IT staff spend much of their time trying to troubleshoot issues. Essentially finding the cause of effects they don’t like. What is causing high response times on this report? What is causing the lower disk throughputs?

If we can somehow automate the task of finding a cause for a problem, we’ll have a much more efficient IT department.

The idea that troubleshooting can be automated is rather seductive. I’d love to have a “What is causing this issue” button. My problem is with the way those vendors go about solving this issue.

Most of them use variations of a very similar technique:
All these vendors already have monitoring software, so they usually know when there is a problem. They also know of many other things that happen at the same time. So if their software detects that response time go up, it can look at disk throughput, DB cpu, swap, load average, number of connections, etc etc.
When they see that CPU goes up together with response times – Tada! Root cause found!

First problem with this approach: You can’t look at correlation and declare that you found a cause. Enough said.

Second problem: If you collect so much data (and often these systems have millions of measurements) you will find many correlation by pure chance, in addition to some correlations that do indicate a common issue.
What these vendors do is ignore all the false findings and present the real problems found at a conference as proof that their method works. Also, you can’t reduce the rate of false-findings without losing the rate of finding real issues as well.

Note that I’m not talking about tools like Tanel Poder visualization tool. Tools which makes it easier for the DBA to look at large amounts of data and using our brain’s built in pattern matcher to find correlations. I support any tool that assists me in applying my knowledge to large sets of data at once.

I have a problem with tools that use statistical correlation as a replacement to applying knowledge. It can’t be done.

Here’s the kind of tool I’d like to see:
Suppose your monitoring tool will give you the ability to visually browse, filter and explore all that data you collect in ways that help you troubleshoot. The tool will remember the things you looked at and the steps you took. After you solve the problem, you can upload the problem description and your debug process to a website. You can even mark away the dead-ends of the investigation.

Now you can go to that website and see that for problem X, 90% of the DBAs started by looking at v$sesstat and 10% ran trace. Maybe you can even have a friend network, so you can see that in this case Fuad looked at disk utilization first while Iggy checked how much redo is written each hour.

If you are not into sharing, you can still browse your own past problems and solutions for ideas that might have slipped your mind.

I think that a troubleshooting tool combined with “collective wisdom” site can assist experienced DBAs and improve the learning curve for junior DBAs without pretending to automate away knowledge and experience.

Notes about Hadoop

My notes from two presentations given to the data mining SIG of the local ACM chapter.

Hadoop is a scalable fault-tolerant grid operating system for data storage and processing.

It is not a database. It is more similar to an operating system: Hadoop has a file system (HDFS) and a job scheduler (Map-Reduce). Both are distributed. You can load any kind of data into Hadoop.

It is quite popular – the last Hadoop Summit had 750 attendees. Not bad for a new open-source technology. It is also quite efficient for some tasks. Hadoop cluster of 1460 nodes can sort a Terabyte of data in 62 seconds – currently the world record for sorting a terabyte.

Hadoop Design Axioms:

  • System will manage and heal itself. (Because using commodity hardware – failure is inevitable).
  • Performance will scale linearly. (With few limitations).
  • Compute should move to data (Processing job should run on the machine holding the data to process)
  • Simple core. Modular and extensible

Distributed file system. Block size is 64M (!). User configures replication factor – each block is replicated on K machines (K chosen by user). More replication can be configured for hot blocks.
A name node keeps track of the blocks and if a node fails the data on it will be replicated to other nodes.

Distributes jobs. It tried to run jobs local to their data to avoid network overhead. It also detects failures and even servers running behind on the processing. If a part of the job is lagging in processing, it will start copies of this part of the job on other servers with the hope that one of the copies will finish faster.

Hadoop Ecosystem:

  • HBase: Google’s big table implementation. Key-value based. Good for quick lookups, but not batch processing. Transactional.
  • Pig, Hive, Scoop: Different languages. Map-Reduce is like assembly – High performance, low-level, contains too much details for most tasks. Hive is SQL language for Hadoop.

Hadoop vs. RDBMS?
RDBMS – expensive, structured, fast, interactive, has standards, transactional.
Hadoop – affordable, unstructured, scalable, resilient. Solves both storage and processing.

Hive and Hadoop at Facebook
Facebook got 200GB of data each day as of March 2008. Thats a lot of data to manage. Facebook philosophy is that more insights can be achieved from running simpler algorithms on more data.

Why Hadoop? Cost of storage. Limitations of data-analysis systems. Many systems have limited scalability. And they were closed and propitiatory.

Why not map-reduce? Not many developers have experience with it. Needed well known schemas and structure.

Hive was built on top of Hadoop to solve these problems. It saves metadata and adds SQL. Also allows integrating with other systems. Hive has tables, which have partitions which hold buckets. Buckets are used for sampling. Hive is very extensible. You can have user defined functions, types, objects, etc.

Hive does optimizations – join order, different processing for skewed data. The optimizer is rule based and uses hints. It also does some kind of dynamic sampling. You can look at the explain plans for the jobs and use that for tuning. Hive uses columnar compression.

Hive support integrations with JDBC, ODBC and Thrift.

It lacks resource management and needs monitoring to catch and kill “bad” jobs.

Concurrency wise, the idea is that you insert data, “publish” it and from the moment it is published everyone else can see it – but it cannot be modified or deleted. This means no read/write contention.