I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Deliberate Practice February 10, 2010

Filed under: musing — prodlife @ 3:27 am

Recently I did some soul searching about my expertise as a DBA. I am not talking about my knowledge, my talents and my work style. I’m talking about which things I’m really comfortable doing. The commands I know by heart, the issues I ran into so often that I can diagnose with the tiniest clues.

There are definitely things I’m very good at. Diagnosing why RAC crashed or wouldn’t start. Solving a range of different problems with Streams. User managed recoveries. Netapp. BASH. Top. sar. vmstat. Redo log mining. Datapump. ASH and its relatives AWR and ADDM. Using Snapper to work with wait event interface. SQL coding, network diagnosis, Patching.

These are mostly things I do every day or close enough to it that the commands, the techniques, the traps and the limitations are always clear in my mind. But there are things that I do rarely or even never. This are important DBA skills, some are even very basic, which I do not have because they are not very useful in my specific position.

These include RMAN, ASM, Dataguard, AWK, perl, python, PL/SQL, tracing, SQL tuning, upgrade testing, benchmarks, many Linux administration tools, hadoop and those new NoSQL things, MySQL, Amazon’s cloud databases, RAT, partitions, scheduler.

These are all things that I know something about, that I’ve read about – but I can’t say I’m confident with any of these because I simply haven’t played with them all that much. After all, you learn by doing and running into issues – not by reading people say how everything works perfectly when they use it.

In order to widen my skill set a bit, I’m planning to take time this year to deliberately practice some of the technologies I didn’t use much last year. I’m thinking of taking anything from few weeks to few month per technology. I’ll invent and look up “lab exercises” for the specific topic and then proceed to spend the month practicing. Think of it as a “poor DBA’s Oracle University”.

This is the opposite of what I’ve been doing until now which can be described as “read Oracle’s Concepts book”. Reading the concepts book is great, but I’m at a point where I feel that I know tons of theory and need to spend some quality time grappling with its various applications. There seems to be a lot of research that shoes that best way to become an expert is with deliberate practice, so this year – I’ll practice.

Unless I’ll run into something really exciting, I don’t expect to blog about this adventure. After all, if I start posting scores of trivial AWK scripts, I doubt if anyone will keep reading my blog. But I thought that maybe some of my readers will enjoy joining me in my practice. So I opened this mailing list.

If you are also interested in practicing with me, feel free to join the mailing list. I’ll announce a “topic of the month”. This month its AWK, next month probably RMAN, we’ll see how this goes. I’ll send to the list the questions I’m planning to solve. If you join, you can send in your own questions. We’ll send our answers to our own and others questions with a week delay (so everyone will have time to practice on his own first). We’ll discuss and compare our answers. We’ll cheer each other as we improve our skills. I will not try to sell you anything.

This is an obvious ploy. I need people on the list so I’ll be accountable to this practice. Otherwise I’ll probably forget it by next week. Feel free to join just to watch me stumble and remind me that this is what learning looks like and it will be worth it.

Here’s to a year of practice and hopefully better skills that will follow.

 

Automated Root Cause Analysis February 5, 2010

Filed under: musing — prodlife @ 11:15 pm

I’ve ran into multiple products that claim to offer automated root cause analysis, so don’t think that I’m ranting against a specific product or vendor. I have a problem with the concept.

The problem these products are trying to solve: IT staff spend much of their time trying to troubleshoot issues. Essentially finding the cause of effects they don’t like. What is causing high response times on this report? What is causing the lower disk throughputs?

If we can somehow automate the task of finding a cause for a problem, we’ll have a much more efficient IT department.

The idea that troubleshooting can be automated is rather seductive. I’d love to have a “What is causing this issue” button. My problem is with the way those vendors go about solving this issue.

Most of them use variations of a very similar technique:
All these vendors already have monitoring software, so they usually know when there is a problem. They also know of many other things that happen at the same time. So if their software detects that response time go up, it can look at disk throughput, DB cpu, swap, load average, number of connections, etc etc.
When they see that CPU goes up together with response times – Tada! Root cause found!

First problem with this approach: You can’t look at correlation and declare that you found a cause. Enough said.

Second problem: If you collect so much data (and often these systems have millions of measurements) you will find many correlation by pure chance, in addition to some correlations that do indicate a common issue.
What these vendors do is ignore all the false findings and present the real problems found at a conference as proof that their method works. Also, you can’t reduce the rate of false-findings without losing the rate of finding real issues as well.

Note that I’m not talking about tools like Tanel Poder visualization tool. Tools which makes it easier for the DBA to look at large amounts of data and using our brain’s built in pattern matcher to find correlations. I support any tool that assists me in applying my knowledge to large sets of data at once.

I have a problem with tools that use statistical correlation as a replacement to applying knowledge. It can’t be done.

Here’s the kind of tool I’d like to see:
Suppose your monitoring tool will give you the ability to visually browse, filter and explore all that data you collect in ways that help you troubleshoot. The tool will remember the things you looked at and the steps you took. After you solve the problem, you can upload the problem description and your debug process to a website. You can even mark away the dead-ends of the investigation.

Now you can go to that website and see that for problem X, 90% of the DBAs started by looking at v$sesstat and 10% ran trace. Maybe you can even have a friend network, so you can see that in this case Fuad looked at disk utilization first while Iggy checked how much redo is written each hour.

If you are not into sharing, you can still browse your own past problems and solutions for ideas that might have slipped your mind.

I think that a troubleshooting tool combined with “collective wisdom” site can assist experienced DBAs and improve the learning curve for junior DBAs without pretending to automate away knowledge and experience.

 

Excited about NoCOUG Winter Conference February 4, 2010

Filed under: nocoug — prodlife @ 1:39 am

NoCOUG is hosting its winter conference next week – On February 11th.
As usual, we’ll have the best speakers and presentations ever. This time I’m extra happy because two of the speakers that are going to be there, Dr. Neil Gunther and Robyn Sands, are there because I was wowed by them in a previous conference and asked our Director of Conference Programming to invite them. And they agreed! I believe it is the first time that either of them presents at NoCOUG and I’m very excited about this.

I’m sure I don’t need to introduce Robyn Sands to any Oracle professional – She’s an OakTable member who talks a lot about the right ways to manage performance. She is very scientific and precise but she gives very practical advice that is very applicable.

Dr. Neil Gunther is a well known performance expert. So well known that he has his own Wikipedia article. I first ran into his work when I did performance testing work, something like 6 years ago. From his articles, I learned the importance of having performance models without which you cannot interpret your results and know when your tests were faulty. I ran into him again when Tanel Poder mentioned that Dr. Neil Gunther is now doing work that will be relevant to Oracle professionals. He appeared in HotSos few years back and now we get to see him at NoCOUG – with both a keynote session and a technical session. He invited the crowds to ask questions at his blog, so you can participate.

In addition to these two prestigious names, we have few local celebrities giving presentations: Ahbaid Gaffoor, lead DBA at Amazon, will show his make-based deployment methodology. If you don’t have a deployment methodology, this presentation is a must-see. Maria Colgan will give a presentation about data loading for data warehouses. Although she’s an Oracle presenter, which sometimes means “marketing”, Maria is smart and knowledgeable and if you are doing data warehouse work – she is worth listening to.

I’ll be presenting “What Every DBA Should Know About TCP/IP Networks”. The presentation is about network problems I’ve had to solve in the last year and how I solved them with some basic knowledge of networks, a packet sniffer and an envelope. If you ever wondered how to make your network admin take you seriously, how to get more bang from your bandwidth and whether or not you should care about your SDU, you should definitely show up.

I’m looking forward to meeting old and new friends in the conference. Its going to be a blast.

 

Notes about Hadoop January 27, 2010

Filed under: musing — prodlife @ 11:16 pm

My notes from two presentations given to the data mining SIG of the local ACM chapter.

Hadoop is a scalable fault-tolerant grid operating system for data storage and processing.

It is not a database. It is more similar to an operating system: Hadoop has a file system (HDFS) and a job scheduler (Map-Reduce). Both are distributed. You can load any kind of data into Hadoop.

It is quite popular – the last Hadoop Summit had 750 attendees. Not bad for a new open-source technology. It is also quite efficient for some tasks. Hadoop cluster of 1460 nodes can sort a Terabyte of data in 62 seconds – currently the world record for sorting a terabyte.

Hadoop Design Axioms:

  • System will manage and heal itself. (Because using commodity hardware – failure is inevitable).
  • Performance will scale linearly. (With few limitations).
  • Compute should move to data (Processing job should run on the machine holding the data to process)
  • Simple core. Modular and extensible

HDFS:
Distributed file system. Block size is 64M (!). User configures replication factor – each block is replicated on K machines (K chosen by user). More replication can be configured for hot blocks.
A name node keeps track of the blocks and if a node fails the data on it will be replicated to other nodes.

Map-Reduce:
Distributes jobs. It tried to run jobs local to their data to avoid network overhead. It also detects failures and even servers running behind on the processing. If a part of the job is lagging in processing, it will start copies of this part of the job on other servers with the hope that one of the copies will finish faster.

Hadoop Ecosystem:

  • HBase: Google’s big table implementation. Key-value based. Good for quick lookups, but not batch processing. Transactional.
  • Pig, Hive, Scoop: Different languages. Map-Reduce is like assembly – High performance, low-level, contains too much details for most tasks. Hive is SQL language for Hadoop.

Hadoop vs. RDBMS?
RDBMS – expensive, structured, fast, interactive, has standards, transactional.
Hadoop – affordable, unstructured, scalable, resilient. Solves both storage and processing.

Hive and Hadoop at Facebook
Facebook got 200GB of data each day as of March 2008. Thats a lot of data to manage. Facebook philosophy is that more insights can be achieved from running simpler algorithms on more data.

Why Hadoop? Cost of storage. Limitations of data-analysis systems. Many systems have limited scalability. And they were closed and propitiatory.

Why not map-reduce? Not many developers have experience with it. Needed well known schemas and structure.

Hive was built on top of Hadoop to solve these problems. It saves metadata and adds SQL. Also allows integrating with other systems. Hive has tables, which have partitions which hold buckets. Buckets are used for sampling. Hive is very extensible. You can have user defined functions, types, objects, etc.

Hive does optimizations – join order, different processing for skewed data. The optimizer is rule based and uses hints. It also does some kind of dynamic sampling. You can look at the explain plans for the jobs and use that for tuning. Hive uses columnar compression.

Hive support integrations with JDBC, ODBC and Thrift.

It lacks resource management and needs monitoring to catch and kill “bad” jobs.

Concurrency wise, the idea is that you insert data, “publish” it and from the moment it is published everyone else can see it – but it cannot be modified or deleted. This means no read/write contention.

 

Build Less (DB Design Version) January 20, 2010

Filed under: musing — prodlife @ 1:28 am

37Signals, the company behind few highly successful web-based applications, has published a book about their business building experience. Knowing that the company is both successful and has an unconventional business development philosophy, I decided to browse a bit.

One of the essays that caught my attention is “Build Less”. The idea is that instead of having more features than the competition (or more employees or whatever), you should strive to have less. To avoid any sense of irony – the essay is very short :)

One of the suggestions I would add to the essay is:
“Keep less data”

Keeping a lot of data is a pain. Indexing, partitioning, tuning, backup and recovery – everything is more painful when you have terabytes instead of gigabytes. And when it comes to cleaning data out, it always causes endless debates on how long to keep the data (3 month? 7 years?) and different life-cycle options (move to “old data” system? archiving? how to purge? What is the process?).

What’s more, a lot of the time customers would really prefer we won’t keep the data. Maybe its privacy concerns (when we keep a lot of search history) or difficulty in generating meaningful reports or just plain confusion caused by all those old projects floating around.

Google taught us that all the data should be stored forever. But perhaps your business can win by keeping less data.

 

RMOUG Presentations January 19, 2010

Filed under: advert, network, presentation, statistics — prodlife @ 11:47 pm

Like many other DBAs, I’ll be attending RMOUG training days conference on Feb 17-18 in Denver. I’ll give two presentations in the conference. On the same day, just thinking about it makes me exhausted.

The first presentation is “Everything DBAs need to know about TCP/IP Networks”. Here’s the paper and the slides. I’ll also present this at NoCOUG’s winter conference in Pleasanton, CA. Maybe you prefer to catch me there.

The second presentation is “Analyzing Database Performance using Time Series Techniques”. Here’s the paper and the slides.

I still have time to improve the presentations and papers – so comments are very welcome :)

 

Damnit Method for Writing Papers January 12, 2010

Filed under: musing — prodlife @ 8:18 pm

So today was this date when I was supposed to send RMOUG the papers I’m going to present at the conference.
Normally I’m pretty good about having papers ready well in advance, and indeed the time-series paper was done few weeks back.

But there was the other paper. “Everything a DBA Needs to Know about TCPIP Networks”. I’ve been avoiding it to the point that I only finished writing it this morning. Which is totally unlike me.

Its not that I dislike the topic. On the contrary – I love TCP and am very passionate about it. This blog is full of posts where I explain TCP. Its nearly as much fun as queue theory and CPU scheduling. And its not like I had nothing to say – I had tons of stories and examples and network captures and techniques and even few opinions to share.

The problem was that I also had a bunch of things that I did not want to say, but felt like I have to.

You see, I started working on the paper by thinking about my experience. I had few war stories in mind where knowledge of networking saved the day, and I planned of sharing the stories and the knowledge with the audience. So far so good.

But when I started planning and outlining the paper, I shifted my thinking from my knowledge and stories to the potential audience. What do “they” need to know and to see and to understand. Generally it is a good idea to think about the audience when writing papers and preparing presentations, but this also meant that a lot of stuff that I did not want to write about found its way into the paper outline.

All of the sudden writing my paper meant that I have to write about things that I don’t want to write about, and the paper feels like a school paper and I’m avoiding it just so I don’t have to write about all that stuff.

So yesterday, when it looks like I’ll never finish the paper on time because I can’t even start writing it, I decided to take a second look at the plan. I figured out that the only way the paper will ever be finished is if I’ll give myself full permission not to write about things that I don’t want to write about.

It went like this:

  • I’m not writing yet another introduction to networking. I’m not describing any layers model, I’m not drawing IP datagrams or 3-way-handshake diagrams or anything with arrows going back and forth. Enough people described layers and IP and TCP basics and I don’t have to. This paper will not include basics, damnit!
  • I don’t care if every single DBA on the planet is tuning SDU sizes. I can say that other people are doing it, but I still think it is a waste of time and I won’t do it. No demo code for how tunining SDU size can improve performance, damnit!
  • I’m not doing any tuning that involves calculating exactly how many bytes go into each packet and how much space goes into the header. Its annoying and a waste of time on any known network. I’m not counting header sizes, damn it!
  • I’m giving some tips on how to use packet capture software, but I’m not teaching everything about Wireshark. Some things are out of scope, damn it!

After that I felt much better and proceeded to write the paper in about 4 hours. Amazingly, even without all those topics, I still have a fairly large paper full of content barely covered elsewhere and which is demonstrated with my own stories and examples.

I hope that if you ever have to write a paper, you will also give yourself full permission to write only about what you see as the interesting stuff and to skip things that bore you. I really believe that a paper that was fun to write is also much more useful for the reader.

 

Autocorrelation and Causation December 7, 2009

Filed under: musing, statistics — prodlife @ 7:48 pm

Everyone talks about how correlation doesn’t imply causation, but no one says what autocorrelation implies.

Maybe its because we don’t talk about autocorrelation at all :)

Lets start talking about autocorrelation by saying what it is:
First of all, autocorrelation is a concept related to time series. Time series is an ordered series of measurements taken at intervals over time. You know how we measure CPU utilization every 10 minutes and then display nice “CPU over time” graphs? thats a time series.
Autocorrelation is the correlation between points in the time series. So we can compare every point in the time series to the measurement taken 10 minutes later, 20 minutes later, etc. And we can find out that every point in our graph is strongly correlated with the measurement taken 10 minutes later and the point taken 60 minutes later.

But does that imply causation?
We normally don’t assume that the current value of the CPU caused the value that the CPU has in 60 minutes. It makes much more sense to assume that there is a third factor that causes the CPU to peak every 60 minutes. This effect is also called seasonality. The weather today is strongly correlated with the weather on Dec 9th 2008. The third factor in this case is the circles our planet makes around the sun.

However the autocorrelation with the point immediately following the current value, often does imply causation of sorts. If you want to make a good guess about the value of IBM stock tomorrow, your best bet is to guess that it will be the same as the value today. Stock values usually have very strong short-term autocorrelation, and we can say that tomorrow’s value is todays value plus some error. IBM stock prices are normally stable, so the error is normally small. So you can say that today’s stock value is caused by today’s value. In a similar way the CPU in 2 minutes can be predicted to be identical to the CPU right now.

I’m hesitant to call this “causation”, because although the stock price today does cause the stock price of tomorrow (plus an error!), the “real” cause is that stock prices and cpus behave in a specific way. On the other hand, we know that they behave in a specific way because we measured the autocorrelation, modeled it and made predictions that work. So in two important uses of causation, understanding the behavior of the thing we measured and making predictions, we can say that we have a cause-and-effect relation. Albeit a bit less intuitive that usual.

If you dig the idea of explaining and predicting CPU and other important performance measurements by using only the measure itself without looking for other explaining factors, then you should definitely attend my presentation about time series analysis at RMOUG. I’ll show exactly how we find autocorrelations and how to predict future values and we’ll discuss whether or not this is a useful method.

 

Goodness of Data December 2, 2009

Filed under: Analysis — prodlife @ 4:16 pm

I’m working on my time-series analysis presentation for RMOUG, and one of the topics I may include (or may not, because it is only marginally relevant) is that of data quality.

You cannot do good analysis and get meaningful results if your data is distorted. If you start your analysis with garbage you’ll end up with garbage.

So, before starting your analysis, you have to look at the data and make sure it doesn’t have any obvious problems. One of the favorite ways of doing this is by literally looking at the data. Plotting a graph of the data is the fastest and easiest way to spot issues.

What kinds of issues?

  1. Outliers: It is easy to spot outliers in a graph (especially if you use a box plot), but not all outliers are bad data. It is important to differentiate between extreme but legitimate data and bad measurements. In order to do that, you really need to understand the data you are looking at and the system it describes. There is a world of difference between “Yes, the system did hit 75 load average that morning” and “Oh yeah, thats the morning when we hit a freaky problem with the /proc system and top reported 75 load average even though the system was not loaded at all”. Some outliers don’t offer a ready explanation – the data says load average was 75, but you have no clue if it was a real issue or not. In those cases my tendency is to err on the side of including the data – if I can’t explain why the data is bad and should be excluded, then I keep it.
  2. Missing data: Missing data isn’t always bad. Systematically missing data is fine – ASH samples Oracle sessions every 10 seconds, so you can say that data in between is missing. But thats just random sampling and all analysis tools deal with this, so you are safe. The bad kind of missing data is the biased missing data – if you monitor your sessions from an external tool that queries v$session, it is likely that when the load is very high, the tool will be unable to connect and query your database. So the data isn’t randomly missing – it is always missing the points of the highest load. Your data set will lack the most important data, and worse it will show your system load as much lower than it really is. Obviously any analysis based on this data will be hopelessly flawed.
  3. Breaks: By breaks I mean specific points in time where the nature of the data completely changed. Example: You are looking at monthly response-times data, and at Nov 15th someone upgraded the SAN. Before the upgrade average response time was 8s with standard deviation of 4s, but after the average response time was 3s with standard deviation of 0.01s. It should be obvious that you can’t analyze November as one single time series, because the behavior of the system changed dramatically. Any forecast made based on the first half of the month will be completely irrelevant.

How do we fix the bad data? Here are few relatively simple suggestions.

  1. Replace outliers and missing data with average value. Note that if your data has strong trend or seasonality (higher load on Monday morning for example), you will need to use local average values because overall average will be meaningless. The process of replacing missing data (or outliers) with meaningful average data is also known as interpolation. You do this whenever you display your data as a graph with one continuous line instead of a series of dots. You only sampled the system every 10 seconds, but your show the graph as if you also have all the data in between. This is probably the most common and most intuitive way of fixing data.
  2. If the missing data has a bias (data is only missing when the load is very high) then replacing missing data with averages is not a good idea. You know that the missing data was not average. In this case you can replace the missing values with maximum values of the data you did measure.
  3. If the data has breaks in it, analyze each part of the data separately and fit a different model to each part. In forecasts you will probably want to use only the latest model (if you have reasons to believe that this behavior will continue into the future).
 

Mad Troubleshooting Skillz! November 24, 2009

Filed under: musing — prodlife @ 3:06 am

Last week I attended Tanel Poder’s Advanced Oracle Troubleshooting seminar, organized by NoCOUG.

Well, actually it was organized by me, with lots of help from Iggy and the rest of NoCOUG. Organizing a seminar was not trivial, but wow – it was totally worth it!

When the seminar was over, we asked the attendees to fill a small survey and tell us what was good and what was bad. Turns out that there was just a single answer for “What was the best thing about the seminar”. The answer is – Tanel.

He is clearly an expert. He loves what he is doing. He can think on his feet and answer all sorts of more-or-less related questions from the audience. He is extremely generous with his time and his knowledge and his scripts (I know lots of DBAs who guard their scripts with their lives, Tanel happily shares them in his blog). Tanel is also funny, entertaining and the course is pretty well structured. He also talks very fast (and seems somewhat obsessed about not wasting a single millisecond) – you won’t believe how much you’ll learn in two days.

I had some trouble explaining explaining to my team and boss what I learned:
“Well, the first half day was troubleshooting hangs and slowdowns using v$views”
“But you knew that before!”
“We also learned lots of Oracle internals. How the shared pool really works, and how SQL is really processed. Lots of cool stuff.”
“OK, but what is it good for?”
“Knowing how stuff works is always good. Anyway, I also learned to use Unix tools to debug problems. Like dumping process stuck to see where it hangs! Also, we learned how to handle free memory issues. Remember that awful leak we had on that test server?”
“No.”
“Oh! Look! Shiny scripts!”

But the proof of the pudding is in the eating. No one can argue the fact that last week I already managed to troubleshoot and solve two problems that other team members failed to make much progress on. I did it very quickly too.
Now here is the strange thing – the two problems were in areas of Oracle that Tanel very explicitly did not mention during the seminar. Streams and Clusterware. I did not even use any of his scripts to shoot them. And yet I’m still convinced that the reason I was so effective in solving those problems is directly related to the seminar. How is that? Here are the important non-technical things I learned at the seminar:

  1. Systematic approach – You don’t work off lists, you don’t waste time by looking at random places and you don’t guess (much). You gather symptoms, you use them to pinpoint the problem and you use the pinpointed knowledge to work your way toward a solution. The last part sometimes involves Oracle support. I knew about the systematic approach thing before, but two days of looking at someone demonstrating it makes a difference.
  2. Don’t believe anyone (except the OS) – Users lie, other DBAs lie, even Oracle sometimes lies. Always crosscheck and double check the facts. No one lies intentionally, but the result is still misleading.
  3. Problems have causes. I know it sounds funny, but very often we stop troubleshooting too soon, attributing a problem to mysterious unknown forces or at least say “well, I don’t know how to know this” and leave things at that. Tanel went farther than anyone I’ve ever seen by saying “I have to know why this behaves like that” and when Oracle doesn’t tell him, he goes to the OS, or the network, or writes his own tools. Thats a good lesson – don’t take no for an answer.
  4. All DBAs have tons of troubleshooting scripts. Real experts have scripts with very short names and very flexible arguments. They also have a script for reminding them how to use their scripts
  5. I no longer view trouble as something annoying that wastes my time and prevents me from doing stuff I want to do. Instead every trouble is now cherished as an opportunity to practice what I learned, learn more and polish my skills.

I highly recommend Tanel’s course to DBAs who want to suddenly become the best troubleshooters in their team. Its not a comfortable position to be in (suddenly a lot more trouble finds its way to you), but it can be lots of fun.