I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Mapping the NoSQL space February 19, 2010

Filed under: NoSQL — prodlife @ 7:26 pm

NoSQL is an unfortunate name – it doesn’t give any description of what the product does except what query language it will not support. What’s worse, it makes people treat the various non-relational databases as interchangable, while in fact many of them solve completely different problems and have different trade-offs, strengths, etc.

What is common to all these DBs is that they don’t do ACID in an attempt to improve scalability, most of them are distributed and most of them were built to handle semi-structured or unstructured data.

The theoretical case for these databases starts from the CAP theorem which says you can’t have consistency, availability and partition tolerance all at once. Partition tolerance is the prevention of split-brain in a cluster or distributed system – you don’t want network failures to allow data corruptions or incorrect results.

Since you can’t have all three, you choose two. So RAC does partition tolerance and consistency at the expense of availability – if the voting disk crashes or loses network connectivity, the entire cluster will go down.

NoSQL databases keep availability and partition tolerance at the expense of consistency. They have something called “Soft-State” and “Eventual Consistency”. To the best of my understanding, “Eventual Consistency” means that all the DML statements in the transaction are inserted into a queue (or some equivalent), from which they are executed at different times by different servers. Eventually they are all executed and you reach a consistent state, but you don’t know when. Of course with such system, it appears nearly impossible to prevent lost updates.

This doesn’t seem like a good way to manage bank accounts, but when I reviewed the databases I manage, only very few of them really need full consistency. Many of them are not updated concurrently, or where there are no updates (just inserts) or contain data such as project plans where not being consistent at every single second would be OK.

Here’s a short list of the the non-relational databases I’m somewhat familiar with and the problems they solve:

Map-Reduce – not a database at all, its an algorithm or a design methodology that allows for massive scalability.

Hadoop – not a database. Its a platform – a distributed file-system and a map-reduce job manager.

Hive – Its a SQL like language allows for structured schema design and queries on top of Hadoop. It has some superficial similarities with RDBMS, but it is just the syntax – every query is translated to map-reduce code, execution is totally different and don’t expect most  RDBMS features.

HBase – Allows you to create tables with rows and columns (normally very large ones) and query them through several Java/HTTP interfaces. You query each table individually, no joins.

Cassandra - Does exactly the same as HBASE. To the best of my understanding it is more configurable and flexible but is not as well documented.

Tokyo Cabinet /Tyrant – Stores key/value pairs. There are no tables and no data types. You can store data in hash tables, b-trees or fixed-size arrays. It is not distributed. Said to have amazing performance.

Voldemort – Similar to Tokyo Cabinet, but distributed – although it appears that when adding nodes performance doesn’t scale well.

CouchDB - This is a document store, where each document contains multiple key-value pairs. It does include some of the traditional DB features, just in a different context. It has the concept of index, and you create an index for each report you want to run. It supports multiple-versions of each document, where a report is guaranteed to run on the same version from beginning to end. There is no schema – documents can contain different keys.

MongoDB – Similar to CouchDB, it is a document store. It is not distributed. It doesn’t have multiple document revisions – all updates are done on the same document. No indexes either, which allows for ad-hoc querying.

Hope this is useful :)

 

Deliberate Practice February 10, 2010

Filed under: musing — prodlife @ 3:27 am

Recently I did some soul searching about my expertise as a DBA. I am not talking about my knowledge, my talents and my work style. I’m talking about which things I’m really comfortable doing. The commands I know by heart, the issues I ran into so often that I can diagnose with the tiniest clues.

There are definitely things I’m very good at. Diagnosing why RAC crashed or wouldn’t start. Solving a range of different problems with Streams. User managed recoveries. Netapp. BASH. Top. sar. vmstat. Redo log mining. Datapump. ASH and its relatives AWR and ADDM. Using Snapper to work with wait event interface. SQL coding, network diagnosis, Patching.

These are mostly things I do every day or close enough to it that the commands, the techniques, the traps and the limitations are always clear in my mind. But there are things that I do rarely or even never. This are important DBA skills, some are even very basic, which I do not have because they are not very useful in my specific position.

These include RMAN, ASM, Dataguard, AWK, perl, python, PL/SQL, tracing, SQL tuning, upgrade testing, benchmarks, many Linux administration tools, hadoop and those new NoSQL things, MySQL, Amazon’s cloud databases, RAT, partitions, scheduler.

These are all things that I know something about, that I’ve read about – but I can’t say I’m confident with any of these because I simply haven’t played with them all that much. After all, you learn by doing and running into issues – not by reading people say how everything works perfectly when they use it.

In order to widen my skill set a bit, I’m planning to take time this year to deliberately practice some of the technologies I didn’t use much last year. I’m thinking of taking anything from few weeks to few month per technology. I’ll invent and look up “lab exercises” for the specific topic and then proceed to spend the month practicing. Think of it as a “poor DBA’s Oracle University”.

This is the opposite of what I’ve been doing until now which can be described as “read Oracle’s Concepts book”. Reading the concepts book is great, but I’m at a point where I feel that I know tons of theory and need to spend some quality time grappling with its various applications. There seems to be a lot of research that shoes that best way to become an expert is with deliberate practice, so this year – I’ll practice.

Unless I’ll run into something really exciting, I don’t expect to blog about this adventure. After all, if I start posting scores of trivial AWK scripts, I doubt if anyone will keep reading my blog. But I thought that maybe some of my readers will enjoy joining me in my practice. So I opened this mailing list.

If you are also interested in practicing with me, feel free to join the mailing list. I’ll announce a “topic of the month”. This month its AWK, next month probably RMAN, we’ll see how this goes. I’ll send to the list the questions I’m planning to solve. If you join, you can send in your own questions. We’ll send our answers to our own and others questions with a week delay (so everyone will have time to practice on his own first). We’ll discuss and compare our answers. We’ll cheer each other as we improve our skills. I will not try to sell you anything.

This is an obvious ploy. I need people on the list so I’ll be accountable to this practice. Otherwise I’ll probably forget it by next week. Feel free to join just to watch me stumble and remind me that this is what learning looks like and it will be worth it.

Here’s to a year of practice and hopefully better skills that will follow.

 

Automated Root Cause Analysis February 5, 2010

Filed under: musing — prodlife @ 11:15 pm

I’ve ran into multiple products that claim to offer automated root cause analysis, so don’t think that I’m ranting against a specific product or vendor. I have a problem with the concept.

The problem these products are trying to solve: IT staff spend much of their time trying to troubleshoot issues. Essentially finding the cause of effects they don’t like. What is causing high response times on this report? What is causing the lower disk throughputs?

If we can somehow automate the task of finding a cause for a problem, we’ll have a much more efficient IT department.

The idea that troubleshooting can be automated is rather seductive. I’d love to have a “What is causing this issue” button. My problem is with the way those vendors go about solving this issue.

Most of them use variations of a very similar technique:
All these vendors already have monitoring software, so they usually know when there is a problem. They also know of many other things that happen at the same time. So if their software detects that response time go up, it can look at disk throughput, DB cpu, swap, load average, number of connections, etc etc.
When they see that CPU goes up together with response times – Tada! Root cause found!

First problem with this approach: You can’t look at correlation and declare that you found a cause. Enough said.

Second problem: If you collect so much data (and often these systems have millions of measurements) you will find many correlation by pure chance, in addition to some correlations that do indicate a common issue.
What these vendors do is ignore all the false findings and present the real problems found at a conference as proof that their method works. Also, you can’t reduce the rate of false-findings without losing the rate of finding real issues as well.

Note that I’m not talking about tools like Tanel Poder visualization tool. Tools which makes it easier for the DBA to look at large amounts of data and using our brain’s built in pattern matcher to find correlations. I support any tool that assists me in applying my knowledge to large sets of data at once.

I have a problem with tools that use statistical correlation as a replacement to applying knowledge. It can’t be done.

Here’s the kind of tool I’d like to see:
Suppose your monitoring tool will give you the ability to visually browse, filter and explore all that data you collect in ways that help you troubleshoot. The tool will remember the things you looked at and the steps you took. After you solve the problem, you can upload the problem description and your debug process to a website. You can even mark away the dead-ends of the investigation.

Now you can go to that website and see that for problem X, 90% of the DBAs started by looking at v$sesstat and 10% ran trace. Maybe you can even have a friend network, so you can see that in this case Fuad looked at disk utilization first while Iggy checked how much redo is written each hour.

If you are not into sharing, you can still browse your own past problems and solutions for ideas that might have slipped your mind.

I think that a troubleshooting tool combined with “collective wisdom” site can assist experienced DBAs and improve the learning curve for junior DBAs without pretending to automate away knowledge and experience.

 

Excited about NoCOUG Winter Conference February 4, 2010

Filed under: nocoug — prodlife @ 1:39 am

NoCOUG is hosting its winter conference next week – On February 11th.
As usual, we’ll have the best speakers and presentations ever. This time I’m extra happy because two of the speakers that are going to be there, Dr. Neil Gunther and Robyn Sands, are there because I was wowed by them in a previous conference and asked our Director of Conference Programming to invite them. And they agreed! I believe it is the first time that either of them presents at NoCOUG and I’m very excited about this.

I’m sure I don’t need to introduce Robyn Sands to any Oracle professional – She’s an OakTable member who talks a lot about the right ways to manage performance. She is very scientific and precise but she gives very practical advice that is very applicable.

Dr. Neil Gunther is a well known performance expert. So well known that he has his own Wikipedia article. I first ran into his work when I did performance testing work, something like 6 years ago. From his articles, I learned the importance of having performance models without which you cannot interpret your results and know when your tests were faulty. I ran into him again when Tanel Poder mentioned that Dr. Neil Gunther is now doing work that will be relevant to Oracle professionals. He appeared in HotSos few years back and now we get to see him at NoCOUG – with both a keynote session and a technical session. He invited the crowds to ask questions at his blog, so you can participate.

In addition to these two prestigious names, we have few local celebrities giving presentations: Ahbaid Gaffoor, lead DBA at Amazon, will show his make-based deployment methodology. If you don’t have a deployment methodology, this presentation is a must-see. Maria Colgan will give a presentation about data loading for data warehouses. Although she’s an Oracle presenter, which sometimes means “marketing”, Maria is smart and knowledgeable and if you are doing data warehouse work – she is worth listening to.

I’ll be presenting “What Every DBA Should Know About TCP/IP Networks”. The presentation is about network problems I’ve had to solve in the last year and how I solved them with some basic knowledge of networks, a packet sniffer and an envelope. If you ever wondered how to make your network admin take you seriously, how to get more bang from your bandwidth and whether or not you should care about your SDU, you should definitely show up.

I’m looking forward to meeting old and new friends in the conference. Its going to be a blast.

 

 
Follow

Get every new post delivered to your Inbox.

Join 48 other followers