DBA on a deserted island

Sometimes I like asking myself – If I had to manage large databases at a deserted island, what tools would I take with me? What do I really use day in and day out and would not like to admin my DB without?

Lets start building the system from ground up:

Netapp storage will be a must. I’m addicted to many of Netapp capabilities, such as snapshots and flexible cloning, and it would be difficult to live without them. I also prefer my storage on NFS.

The servers would be Intel, 64 bit, lots of memory, several fast network cards. I don’t care about much more than that.

Linux, of course. I’m used to RH, but will work with anything. Bash, perl, gvim, sar, vmstat, top, ps, ssh and all my scripts and aliases should be included. Don’t forget to install rlwrap.

The DB? Oracle of course. works great, but 11 may be more fun. Can I take both? Throw in few patches.  Throw in the relevant Oracle documentation and books, can I also download the PSOUG library?

It is not absolutely mandatory, but performance tuning goes much faster with proper tools. Either Confio’s or Oracle’s will do.

Thats it. I’m ready to rock 🙂

Needless to say, I did not get payed by any vendor for mentioning them here. On the contrary, my company pays them quite a bit and they deliver the goods.


Why Application Developers Think Differently Than DB Developers

It is all about programming practices. Very early in their training application developers learn three important paradigms:

  1. Generality – Solutions should work in as many new situations as possible
  2. Premature optimization is the root of all evil
  3. Goto considered harmful

These are very good programming practices that work great in many development situations. Except when it comes to databases. With databases, the first two practices are trouble. Here’s how:

  1. Application developers want their code to be very general and adaptable. So, they can’t depend on any vendor specific DB features. This mostly means that the developers will be forced to reinvent the wheel on every step of the way, or write a huge, expansive database layer, or even worse – use Hibernate.
  2. More generality: Often application developers think that the idea of having separate tables for different conceptual objects and different columns for specific attributes of these objects is a form of evil hard coding. Their solution will be to use few “generic” tables, like this one:
    create table things (thing_id number,
    thing_name varchar(4000),
    thing_value varchar(4000),
    thing_data_type_id number);

    Or sometimes, they will add columns named userdata_1 … userdata_n to an otherwise well designed table. Just in case.
  3. Of course, if someone will point out that their design will never scale and will cause the application will run very slowly, the application developer will scoff “This is premature optimization! Knuth said it is evil! If it will be slow we will get the DBA to add few indexes and buy more memory”.

DB developers have a different view of things. For us, performance and scalability are not features number 18 and 27 on our list. For DBs performance and scalability is everything. There is no other reason to even have a DB, not to mention an expensive one like Oracle.

No optimization is ever premature on a database, and we prefer to start optimizing when we do the design and modeling. The question “which way is faster” is usually asked when we try to decide between several design ideas or different ways to write the same query. On the other hand, we optimize on an entire different level than C developers, and only very rarely resort to inlining functions and embedding assembly code.

DB developers remember very well that if the data model is good, all other optimizations are much easier. Often great queries will simply write themselves on a good model, but even when they don’t, it is easy to see which indexes should be added and how queries should be tuned when the underlying model is clear.

We also know that DBs work best on specifics. The more information about your data the DB knows, the better you can get it to perform, and also the better you can guard the consistency of your data. While application developers like to “generalizing” their programs by pretending they don’t have information that they actually have, DB developers prefer to use every piece of information they can get their hands on, and then force more information and promises from their customers.

Good DB developers are vendor specific. They know one database very well, they know all the features, limitations, concepts and behaviors. They know how to make it run very fast and how to avoid problems that can only happen on this specific DB. They can tune it to death and get maximum value out of every last feature the vendor put in. Being good engineers, we will not let this knowledge go to waste.

No wonder it is very difficult for application developers and DB developers to get along!
What the application developer considers a best practice, DB developers call a complete waste of resources and a hopelessly slow application. When DB developers talk about their best practices, application developers think this is the root of all evil.

Of course, I didn’t do justice to the topics I covered. Tom Kyte has several best selling books about how to approach database development and design. But I think I had a new insight about the source of the problem: Application developers are not stupid, they just had different indoctrination. Thought it is worth sharing.

Anecdote about Rebuilding Indexes

A year ago I was handling a rather confusing performance issue. It was the usual problem – a query that took forever to run. While checking the issue, I discovered that one of the main tables that this query was querying had amazing size variance at different times it held anything between 3 rows to 3 million rows. This is obviously a design mistake that can indeed cause performance problems. An experienced DBA might have configured a job to collect statistics very frequently, or maybe something about bind variable peeking, or just go to the developers and demand an immediate fix.

I’m not an experienced DBA. I noticed that the indexes were highly fragmented with lots of deleted leaf nodes, and so I decided to rebuild the indexes. Online rebuild, so users will not be affected and parallel – so it will go faster.

It was a disaster. First of all, rebuilding online does affect performance. A lot. Users did notice. The table was effectively locked until we killed the rebuild. Its been a year and Oracle support is still silent about whether this is an expected behavior of online rebuild. Also, since the index was rebuilt with parallel option, many of the queries started running in parallel, which also had horrible effects and we had to fix this issue too.

After we cleaned up the entire mess, I got a stern lecture from Senior DBA: “You can’t solve performance problems by randomly rebuilding indexes. You must throughly understand the problem first”.

Fast forward to this weekend. Customer complains about very slow queries. Quick check on v$sql shows that each execution retrieves about 100 rows, but does about 10 million buffer gets. Ouch. The explain plan looked reasonable and I couldn’t see any cause for the excessive gets. I decide to wait with this to Monday, run a trace and see what is really happening there. I never got to run that trace. Senior DBA got home later that day, saw the emails about that query and immediately decides to rebuild indexes!

Needless to say, rebuilding indexes works great when you really understand the problem. It took 2 hours, caused no additional performance issues, and after it was done, the query required only 1000 buffer gets.

I still have some suspicions that the rebuild didn’t change much and what really caused the difference was the fact that Senior DBA also collected statistics on the entire table and all its indexes immediately after the rebuild, but we can’t really know, and as far as I saw – the plan did change.

Regardless of the exact facts, this reminded me of an AI Koan:

A novice was trying to fix a broken Lisp machine by turning the power off and on. Knight, seeing what the student was doing spoke sternly: “You can not fix a machine by just power-cycling it with no understanding of what is going wrong.” Knight turned the machine off and on. The machine worked.

(Knight is the inventor of the Lisp machine).

You can read more AI Koans here. Highly recommended.

Log Buffer #80 – A Carnival of the Vanities for DBAs

Couple of weeks ago, when I asked Dave Edwards of the Pythian Group to let me edit Log Buffer edition #80, I just wanted to reserve an edition with a nice round number. As it turned out, I got to write Log Buffer for an incredibly exciting week.

Breaking news: Sun bought MySQL (and everyone is blogging about it).
Groklaw site posts The Details on the Deal, including the announcement, responses of the primary players and big media coverage. Brian “Krow” Aker posts very quick thoughts, Kaj Arno talked to MySQL founders, and Matt Asay on CNet blogs posted a good analysis on what it all means for the software industry. Reading the posts you can see that everyone is celebrating this incredible change for the open source database and the largest ever open source deal. Well, maybe except for Sheeri Cabral who is a bit worried. The nice thing about blogs is that you can read many opinions and pick your side – Paul Vallee is not worried and thinks it is great. Between Sheeri and Paul, Pythian Blogs provided lots of coverage for this event and if you want to read (and hear!) more about this topic, head over to the MySQL category of Pythian blogs.

The other big news of the week is that Oracle bought BEA, a bit less exciting because BEA is neither a database nor open source. Eye on Oracle provided good coverage and asks for your opinions, while Lucas Jellema of Amis Technology Blog posted several different opinions about the deal.

Oracle released another Critical Patch Update. Matt Asay of Cnet Blogs is shocked to hear that 66% of Oracle users will not install it. Pete Finnigan is not shocked at all but thinks it is time to get over our collective fear of patches. Jaykumar Vijayan on Computer World blog posts reader opinions who collectively say that when CPUs are not applied it is not the DBA fault. For those who are wondering, I will be installing Jan08 CPU on all our production servers within this quarter.

On other Oracle news: Laurent Schneider made a typo and discovered that in SQL white space can be optional, Tanel Poder thinks that this flexibility can be viewed as a feature, but Red Database Security thinks this can be security threat.

Jonathan Lewis posted great tips for troubleshooting and Richard Foote explains reverse indexes and then explains them some more.

In SQL Server blogs Decipher Information Systems posted a simple script for capturing DB size and showed how data can be moved in and out of partitions. In SQL Team blogs, Joe Webb posts instructions on how to reclaim unused space in a table, and Dan Guzman explains the significance of database owner.

More news about Open Source Databses: MySQL DBA posts about problems with version 5.0.51, and warns not to use O_DIRECT with EXT3, Xaprb explains how to speed up MySQL replication slaves and Mark Schoonover posts about a linked-in group for MySQL certified professionals.
In Database Soup, Josh Berckus tells PostgreSQL DBAs to stay updated with the current security patches. Postgres fans may also want to read David Fetter‘s blog, he posts a weekly summary of Postgres News.

Thats it for Log Buffer #80. Plenty of links this week. Remember that you don’t have to read them all at once – you have the entire weekend for catching up on the news.

Rethinking RAC

I’ve working with RAC since my first day as a DBA. My first task was to install a RAC server (took well over a week), and since then I’ve installed dozens of RAC servers, more than anyone I know, and I spend 90% of my time maintaining them.

I’ve had lots and lots of trouble with RAC, but at the end of the day – I love RAC, with all of its marvelous complexity.  I love the ability to do rolling maintenances, without any downtime to our customers and quite easily and I love the technology and the amazing ideas behind it.

Which is why everyone was very surprised that in a large DBA and management meeting, I suggested replacing RAC with DataGuard for HA. The reason for the meeting was cost-cutting. The objective was clear – reduce the cost per customer. We do shared hosting (mostly), so to reduce the cost per customer we need to either reduce the cost of a single system or put more customers on one system. Both options are viable, and I’ll also write a post about how to max-out an existing system, but it seemed to me that replacing RAC with other HA alternatives will be a very immediate way of cutting costs without significantly lowering availability.

Moans Noorgard wrote a while back an amazing article: “You probably don’t need RAC“.  I’ve spent the days since the last meeting reading it again and again, and trying to prepare a rock solid case that our system can host more customers with more availability and lower costs without RAC. I’ve also found a good post by the storage guy on the same subject,  he is feeling the pain with 11g RAC. The real irony is that at the same time I’m claiming that we don’t need RAC at all, I’m still proceeding with our 11g clusterware tests, because a DBA should always be prepared.

A year later Moens wrote another good article, this time about how difficult it is to get his message accepted, he also mentions that the discussion is very emotional and non-technical. I hope my experience won’t be as bad as his, but I think it may be worse. In addition of the usual difficulties of getting people to participate in a serious technical discussion (it is a lot of work to prepare a serious technical case, and much easier to resort to rhetorics), the entire team that made the RAC decision three years ago is still around, and saying “we made a wrong decision and stuck with it for three years” is very difficult at best of times, and then Oracle sales will get involved sooner or later, the difference between our RAC and non-RAC cost is very high (especially since we are talking about many servers), and I can’t see Oracle accepting the sudden loss of revenue without a fight.

Maybe Oracle is fighting back already? A while back Kevin Closson wrote number of articles about RAC, its performance, high availability, maintenance, etc. I remember they were very good, but I’ve made the mistake and didn’t save a copy assuming they will always be there.  Unfortunately, many good articles are no longer available.

I’m doing Log Buffer again this  Friday! Don’t forget to visit for the hottest DB blog posts of the week.

Interesting observation about “drop user cascade”

Started two “drop user cascade” one from sqldeveloper on my machine and the other by ssh to the server and running local sqlplus (normal run, didn’t use nohup or anything like this). Each of the users had about 70,000 objects, so few hours later when I had to go home, I closed my sqldeveloper (reasoning that it can’t possibly cause a rollback), left the ssh session open, disconnected my laptop from the network, and went home.

The next morning – the user I was deleting from sqldeveloper had the same number of objects as it did when I closed the client last night. But the user I deleted through ssh and sqlplus was completely deleted. Obviously the server process didn’t notice that the client went away and continued to run.

Exceptions Should Not Contain Business Logic

Steven Feuerstein’s PL/SQL book has a remarkably good chapter about collections. In addition to showing the different kinds of collections, their methods and how to use them, he also shows some cool non-trivial uses for collections, a taste of the great things you can do.
Few of the examples involve the use of collections for caching. Associative arrays are perfect for caching, and caching is a useful programing technique. One of the examples is as follows (The code is available on Oreilly’s site, so I don’t think I’m breaking copyright here):

   FUNCTION description (code_in IN hairstyles.code%TYPE)
      RETURN hairstyles.description%TYPE
      return_value   hairstyles.description%TYPE;

      FUNCTION desc_from_database RETURN hairstyles.description%TYPE
         CURSOR desc_cur IS
            SELECT description FROM hairstyles WHERE code = code_in;
         desc_rec   desc_cur%ROWTYPE;
         OPEN desc_cur;
         FETCH desc_cur INTO desc_rec;
         RETURN desc_rec.description;
      RETURN descriptions (code_in);
         descriptions (code_in) := desc_from_database;
         RETURN descriptions (code_in);

This code is short, simple to understand, can be adapted for many different uses, and is an excellent demonstration of the power of associative arrays.

I have just one problem with this code: Most of its logic is implemented inside the exception clause.

Technically speaking, exceptions are very similar to “goto” and historically goto has been used for exactly the same purpose – get out of the current block when things have gone wrong. Exceptions (and goto) make the code easier to read by putting all the clean up work in one place and jumping to this one place whenever needed, but other uses of exceptions can make the code more confusing.

The development standards I’m working with specify that exceptions should be used for handling exceptional conditions, not situations that will occur during the normal flow of the program. The advantage of this standard is that when someone reads the code, she can understand the main logic without jumping back and forth between different blocks. Then when she gets to the exception block she can see which problems you anticipated and how you dealt with those.
How do we know if we did this right? We look at the code and ask ourselves what would happen if we deleted the exception block and then ran it in ideal circumstances (enough memory, disk space, smart users, etc) – would it behave the same from business logic perspective? It should.

So, I would rewrite the function above as follows:

    If  (not descriptions.exists(code_in)) then
        descriptions (code_in) := desc_from_database;
    End if;
    RETURN descriptions (code_in);

I’m not saying that Steven’s original example is confusing. It is not. I enjoyed the smart use of exceptions. But I know from bitter experience that examples have ways of propagating, and beginners that see this example in the book may copy this bit without thinking about the finer points of code readability.

If you are not sure what belongs in the exception clause and what doesn’t, keep your business logic out of it. Even if an example in your favorite book is doing the opposite. 99% of the time, your code will be more readable.