Sometimes I like asking myself – If I had to manage large databases at a deserted island, what tools would I take with me? What do I really use day in and day out and would not like to admin my DB without?
Lets start building the system from ground up:
Netapp storage will be a must. I’m addicted to many of Netapp capabilities, such as snapshots and flexible cloning, and it would be difficult to live without them. I also prefer my storage on NFS.
The servers would be Intel, 64 bit, lots of memory, several fast network cards. I don’t care about much more than that.
Linux, of course. I’m used to RH, but will work with anything. Bash, perl, gvim, sar, vmstat, top, ps, ssh and all my scripts and aliases should be included. Don’t forget to install rlwrap.
The DB? Oracle of course. 10.2.0.3 works great, but 11 may be more fun. Can I take both? Throw in few patches. Throw in the relevant Oracle documentation and books, can I also download the PSOUG library?
It is not absolutely mandatory, but performance tuning goes much faster with proper tools. Either Confio’s or Oracle’s will do.
Thats it. I’m ready to rock
Needless to say, I did not get payed by any vendor for mentioning them here. On the contrary, my company pays them quite a bit and they deliver the goods.
It is all about programming practices. Very early in their training application developers learn three important paradigms:
- Generality – Solutions should work in as many new situations as possible
- Premature optimization is the root of all evil
- Goto considered harmful
These are very good programming practices that work great in many development situations. Except when it comes to databases. With databases, the first two practices are trouble. Here’s how:
- Application developers want their code to be very general and adaptable. So, they can’t depend on any vendor specific DB features. This mostly means that the developers will be forced to reinvent the wheel on every step of the way, or write a huge, expansive database layer, or even worse – use Hibernate.
- More generality: Often application developers think that the idea of having separate tables for different conceptual objects and different columns for specific attributes of these objects is a form of evil hard coding. Their solution will be to use few “generic” tables, like this one:
create table things (thing_id number,
Or sometimes, they will add columns named userdata_1 … userdata_n to an otherwise well designed table. Just in case.
- Of course, if someone will point out that their design will never scale and will cause the application will run very slowly, the application developer will scoff “This is premature optimization! Knuth said it is evil! If it will be slow we will get the DBA to add few indexes and buy more memory”.
DB developers have a different view of things. For us, performance and scalability are not features number 18 and 27 on our list. For DBs performance and scalability is everything. There is no other reason to even have a DB, not to mention an expensive one like Oracle.
No optimization is ever premature on a database, and we prefer to start optimizing when we do the design and modeling. The question “which way is faster” is usually asked when we try to decide between several design ideas or different ways to write the same query. On the other hand, we optimize on an entire different level than C developers, and only very rarely resort to inlining functions and embedding assembly code.
DB developers remember very well that if the data model is good, all other optimizations are much easier. Often great queries will simply write themselves on a good model, but even when they don’t, it is easy to see which indexes should be added and how queries should be tuned when the underlying model is clear.
We also know that DBs work best on specifics. The more information about your data the DB knows, the better you can get it to perform, and also the better you can guard the consistency of your data. While application developers like to “generalizing” their programs by pretending they don’t have information that they actually have, DB developers prefer to use every piece of information they can get their hands on, and then force more information and promises from their customers.
Good DB developers are vendor specific. They know one database very well, they know all the features, limitations, concepts and behaviors. They know how to make it run very fast and how to avoid problems that can only happen on this specific DB. They can tune it to death and get maximum value out of every last feature the vendor put in. Being good engineers, we will not let this knowledge go to waste.
No wonder it is very difficult for application developers and DB developers to get along!
What the application developer considers a best practice, DB developers call a complete waste of resources and a hopelessly slow application. When DB developers talk about their best practices, application developers think this is the root of all evil.
Of course, I didn’t do justice to the topics I covered. Tom Kyte has several best selling books about how to approach database development and design. But I think I had a new insight about the source of the problem: Application developers are not stupid, they just had different indoctrination. Thought it is worth sharing.
A year ago I was handling a rather confusing performance issue. It was the usual problem – a query that took forever to run. While checking the issue, I discovered that one of the main tables that this query was querying had amazing size variance at different times it held anything between 3 rows to 3 million rows. This is obviously a design mistake that can indeed cause performance problems. An experienced DBA might have configured a job to collect statistics very frequently, or maybe something about bind variable peeking, or just go to the developers and demand an immediate fix.
I’m not an experienced DBA. I noticed that the indexes were highly fragmented with lots of deleted leaf nodes, and so I decided to rebuild the indexes. Online rebuild, so users will not be affected and parallel – so it will go faster.
It was a disaster. First of all, rebuilding online does affect performance. A lot. Users did notice. The table was effectively locked until we killed the rebuild. Its been a year and Oracle support is still silent about whether this is an expected behavior of online rebuild. Also, since the index was rebuilt with parallel option, many of the queries started running in parallel, which also had horrible effects and we had to fix this issue too.
After we cleaned up the entire mess, I got a stern lecture from Senior DBA: “You can’t solve performance problems by randomly rebuilding indexes. You must throughly understand the problem first”.
Fast forward to this weekend. Customer complains about very slow queries. Quick check on v$sql shows that each execution retrieves about 100 rows, but does about 10 million buffer gets. Ouch. The explain plan looked reasonable and I couldn’t see any cause for the excessive gets. I decide to wait with this to Monday, run a trace and see what is really happening there. I never got to run that trace. Senior DBA got home later that day, saw the emails about that query and immediately decides to rebuild indexes!
Needless to say, rebuilding indexes works great when you really understand the problem. It took 2 hours, caused no additional performance issues, and after it was done, the query required only 1000 buffer gets.
I still have some suspicions that the rebuild didn’t change much and what really caused the difference was the fact that Senior DBA also collected statistics on the entire table and all its indexes immediately after the rebuild, but we can’t really know, and as far as I saw – the plan did change.
Regardless of the exact facts, this reminded me of an AI Koan:
A novice was trying to fix a broken Lisp machine by turning the power off and on. Knight, seeing what the student was doing spoke sternly: “You can not fix a machine by just power-cycling it with no understanding of what is going wrong.” Knight turned the machine off and on. The machine worked.
(Knight is the inventor of the Lisp machine).
You can read more AI Koans here. Highly recommended.
I’ve working with RAC since my first day as a DBA. My first task was to install a RAC server (took well over a week), and since then I’ve installed dozens of RAC servers, more than anyone I know, and I spend 90% of my time maintaining them.
I’ve had lots and lots of trouble with RAC, but at the end of the day – I love RAC, with all of its marvelous complexity. I love the ability to do rolling maintenances, without any downtime to our customers and quite easily and I love the technology and the amazing ideas behind it.
Which is why everyone was very surprised that in a large DBA and management meeting, I suggested replacing RAC with DataGuard for HA. The reason for the meeting was cost-cutting. The objective was clear – reduce the cost per customer. We do shared hosting (mostly), so to reduce the cost per customer we need to either reduce the cost of a single system or put more customers on one system. Both options are viable, and I’ll also write a post about how to max-out an existing system, but it seemed to me that replacing RAC with other HA alternatives will be a very immediate way of cutting costs without significantly lowering availability.
Moans Noorgard wrote a while back an amazing article: “You probably don’t need RAC“. I’ve spent the days since the last meeting reading it again and again, and trying to prepare a rock solid case that our system can host more customers with more availability and lower costs without RAC. I’ve also found a good post by the storage guy on the same subject, he is feeling the pain with 11g RAC. The real irony is that at the same time I’m claiming that we don’t need RAC at all, I’m still proceeding with our 11g clusterware tests, because a DBA should always be prepared.
A year later Moens wrote another good article, this time about how difficult it is to get his message accepted, he also mentions that the discussion is very emotional and non-technical. I hope my experience won’t be as bad as his, but I think it may be worse. In addition of the usual difficulties of getting people to participate in a serious technical discussion (it is a lot of work to prepare a serious technical case, and much easier to resort to rhetorics), the entire team that made the RAC decision three years ago is still around, and saying “we made a wrong decision and stuck with it for three years” is very difficult at best of times, and then Oracle sales will get involved sooner or later, the difference between our RAC and non-RAC cost is very high (especially since we are talking about many servers), and I can’t see Oracle accepting the sudden loss of revenue without a fight.
Maybe Oracle is fighting back already? A while back Kevin Closson wrote number of articles about RAC, its performance, high availability, maintenance, etc. I remember they were very good, but I’ve made the mistake and didn’t save a copy assuming they will always be there. Unfortunately, many good articles are no longer available.
I’m doing Log Buffer again this Friday! Don’t forget to visit for the hottest DB blog posts of the week.
Started two “drop user cascade” one from sqldeveloper on my machine and the other by ssh to the server and running local sqlplus (normal run, didn’t use nohup or anything like this). Each of the users had about 70,000 objects, so few hours later when I had to go home, I closed my sqldeveloper (reasoning that it can’t possibly cause a rollback), left the ssh session open, disconnected my laptop from the network, and went home.
The next morning – the user I was deleting from sqldeveloper had the same number of objects as it did when I closed the client last night. But the user I deleted through ssh and sqlplus was completely deleted. Obviously the server process didn’t notice that the client went away and continued to run.
Steven Feuerstein’s PL/SQL book has a remarkably good chapter about collections. In addition to showing the different kinds of collections, their methods and how to use them, he also shows some cool non-trivial uses for collections, a taste of the great things you can do.
Few of the examples involve the use of collections for caching. Associative arrays are perfect for caching, and caching is a useful programing technique. One of the examples is as follows (The code is available on Oreilly’s site, so I don’t think I’m breaking copyright here):
FUNCTION description (code_in IN hairstyles.code%TYPE) RETURN hairstyles.description%TYPE IS return_value hairstyles.description%TYPE; FUNCTION desc_from_database RETURN hairstyles.description%TYPE IS CURSOR desc_cur IS SELECT description FROM hairstyles WHERE code = code_in; desc_rec desc_cur%ROWTYPE; BEGIN OPEN desc_cur; FETCH desc_cur INTO desc_rec; RETURN desc_rec.description; END; BEGIN RETURN descriptions (code_in); EXCEPTION WHEN NO_DATA_FOUND THEN descriptions (code_in) := desc_from_database; RETURN descriptions (code_in); END;
This code is short, simple to understand, can be adapted for many different uses, and is an excellent demonstration of the power of associative arrays.
I have just one problem with this code: Most of its logic is implemented inside the exception clause.
Technically speaking, exceptions are very similar to “goto” and historically goto has been used for exactly the same purpose – get out of the current block when things have gone wrong. Exceptions (and goto) make the code easier to read by putting all the clean up work in one place and jumping to this one place whenever needed, but other uses of exceptions can make the code more confusing.
The development standards I’m working with specify that exceptions should be used for handling exceptional conditions, not situations that will occur during the normal flow of the program. The advantage of this standard is that when someone reads the code, she can understand the main logic without jumping back and forth between different blocks. Then when she gets to the exception block she can see which problems you anticipated and how you dealt with those.
How do we know if we did this right? We look at the code and ask ourselves what would happen if we deleted the exception block and then ran it in ideal circumstances (enough memory, disk space, smart users, etc) – would it behave the same from business logic perspective? It should.
So, I would rewrite the function above as follows:
BEGIN If (not descriptions.exists(code_in)) then descriptions (code_in) := desc_from_database; End if; RETURN descriptions (code_in); END;
I’m not saying that Steven’s original example is confusing. It is not. I enjoyed the smart use of exceptions. But I know from bitter experience that examples have ways of propagating, and beginners that see this example in the book may copy this bit without thinking about the finer points of code readability.
If you are not sure what belongs in the exception clause and what doesn’t, keep your business logic out of it. Even if an example in your favorite book is doing the opposite. 99% of the time, your code will be more readable.
I made few new year resolutions about being a kinder, gentler person, so I’ll say this as nicely as possible:
Vendors sometimes make their DB related software very challenging to maintain.
I’ve spent the last three days trying to run some very routine maintenance on a schema that contains data for a specific vendor application. The original time estimate was 2 hours. I was about to publish the name of the software and the company and warn you to do you best to avoid it, but I ran into similar issues last month with different vendor software, and three month ago it was yet another one, so I won’t point fingers. If you have to maintain a database for vendor application, you are probably screwed.
Few challenges that are common to vendor applications: If you are a vendor – do your best to avoid these traps.
- No instructions on how to configure the application if you need to move the schema from one DB to another. This is a common task. Really.
- DB Maintenance instructions in the maintenance documentation that say “Disable cost based optimization” and “rebuild indexes frequently”. DBAs read this as “Clueless vendor”.
- No built in data purging policy or good way to clean old data from schema. We can’t let the application grow as much as it wants. Unfortunately, disk space is still expensive.
- Do a “select * from huge_table” on many trivial operations. Or more than once, on the same huge table, during the same trivial operation. You do realize that this makes the application slow?
- Scary bugs in DB maintenance utilities. I realize that bugs are hard to avoid, but when an aggregation utility fails to aggregate data, it should not delete the data it just failed to aggregate. Especially not by deleting tiny batches of the data and committing after each batch, making it impossible to use flashback for recovery and forcing a full fledged recovery process.
- Maintenance utilities that do very strange things. I started the vendor’s data purging utility this morning. 6 hours later it is still running (22% complete), and yet to run a single query. What the hell is it doing?
- Undocumented maintenance utilities – Maintenance instructions tell me to run “db reorganization tool” regularly. I have no clue what this will actually do on my DB, and as a responsible DBA I simply can’t run this on production.
As a general word of advice – Dev managers of DB intensive software would do well to read Chapter 10 of “Oracle Insights: Tales of the Oak Table”. Its called “Design Disasters” written by Jonathan Lewis. It contains lots of other traps to avoid that I did not mention here because Jonathan Lewis already explained them very well indeed.