C developers with SQL code

Jeff Hunter posted an amusing story about how he optimized a query and what the developer did with his suggestions. The story is so cool not because it is one of these horrible WTF stories that are unlikely to happen in any but the most moronic IT organizations, but because this is a rather harmless oddity that happens all the time.

Most developers started their career by studying CS. This means about three years of writing code in Pascal, C, Java or whatever language CS departments prefer these days. After three years of C development you think of every problem in terms of arrays and loops, after three years of Java you think about objects and patterns. You are not even aware that you are doing this, but everything you look at instantly becomes an array with a loop, or a virtual private class.

At least in my days, databases were a highly optional course in the CS department, so the new developer on the team is probably seeing SQL for the first time in her life, after years of converting problems to arrays and loops. Unfortunately, a database table looks exactly the same as an array. The developer’s automatic response is to write a loop that goes over the array. Imagine her horror when she discovers that SQL doesn’t do loops. With few exceptions, even after learning SQL, the developer  will continue thinking of it in terms of loops and arrays. For the procedural developer, everything is a full table scan and nested joins.

At this point, his code will do funny things like fetch entire tables and iterate over the results to count the number of rows. Maybe not that funny, but close enough. In a good development organization, there will be code reviews and his manager or peers will turn him to the path of efficient database development.

In my experience, it is best that if at this fragile stage, no one will tell the developer about PL/SQL. The longer the developer has to work only with set operations within the database,  the more he will get the ideas of database developement and the better DBA he will become. Developers need to be broken of their loop addiction early and brutally. If they find out about PL/SQL too soon, they will always have the crutch of going back to their loops and will never really get DBs, or at least the process will be much delayed.

DBAs working with development teams really need to take care in educating their developers. This is really preventing performance issues before they are even written, and it is much easier to solve problems at this stage.

Meaningless Foreign Keys

Jonathan Lewis wrote an article about Meaningless Keys. It’s an old article, but I just read it today and I found it very inspiring. So immediately I started thinking about my latest project.

For the project, I created a customers table. Each customer has a unique name, but it also has a meaningless ID, generated as sequence, and this meaningless id is used as a foreign key in other tables.

Why do I have this meaningless ID? Well, by now it was a habit, but I can still remember the reason – you don’t want to duplicate the same piece of data in multiple tables. Thats a known best practice for developers. But why? Because you don’t want to waste space – but in this case it is just varchar(50) column in a table that will never reach 100,000 rows, the space is not the issue. Also because if we will want to change the customer name we will have to change it in many places. We will have to hunt down its usage in all the child tables and update them as well. Thats the real bummer – it will complicate the application and make it more bug prone.

So, maybe Jonathan Lewis got it wrong? Or maybe I got him wrong? Maybe my meaningless key is not such a bad idea after all? Too bad, it would have been nice using the customer name as the key and avoiding all these join  operations. If I could just solve the update problem. Why can’t I create a foreign key constraint with “on update cascade”?

Actually, I can.
Oracle didn’t provide this feature as built-in, but Tom Kyte wrote an update cascade package that does exactly what I just describe – it updates all the child references when you update a key on the parent table. I used it, it works great.
I would have given an example here, but this is Tom Kyte we are talking about, and of course he gave a great example that is much clearer than what I can put here. So just go check it out.

How to automate tasks

Automation has always been the holy grail of information technology departments everywhere. Sysadmins and DBAs always prefer to automate the routine and boring tasks to make more time for interesting projects. Managers also like automation, mostly because this means fewer employees can do more work, and because automated tasks usually involve fewer human errors.

Some tasks are automated to a point where they run automatically at a specific time every day without DBA present. Backups are the most common example of this. More often the DBA writes a script that will automatically do what used to be a series of steps done manually by the DBA at some point, every DBA has the “create schema” script that will create a user with the permissions and objects he is going to need.

After years of automating different tasks, I’ve come up with a method of doing this which usually brings good results, and which I always regret when I don’t follow:

1. Automate the task after you did it manually at least three times. This gives you better understanding of the different scenarios you need to prepare for and also prevent you from wasting time automating tasks that turn out not to be routine after all.

2.  Write down all the steps you do when you perform the task manually. This helps when writing the script, and also serves as good documentation later on.

3. Write a script that does the task based on the steps you wrote earlier.

4. To verify that the script indeed does the right thing I run it in debug mode – step by step, verifying that each step behaves as expected. Step n  is supposed to shut down the listener, verify that after step n the listener is really down.

5. The script should output a log with all the action taken, responses and errors. You will need this when something goes wrong.

6. Run the script several times and verify the results and log.

7. Ask someone else to review the script with you. This can help catch bugs, make sure the script is readable to other people and have the added benefit of having someone else who can maintain the script if needed.

8. If you schedule it using cron, oracle job scheduler or a similar system, add a change to your change tracking system saying that this script now runs automatically every day at 3am.

Keeping a tree in your relational DB

One of the questions that our developers face again and again is how to keep hierarchical data in a relational database. Developers always run into this question because trees are one of the basic data structure developers use and like all data structures, sometimes they need to be serialized to the disk for persistence in case the server crashes or between restarts. Unlike other data structures, trees simply don’t map well to the relational database. The models are too different, and no matter which strategy you choose you will run into problems. Here are some solutions you can try:

1. Don’t use the DB at all. Keep the tree in a separate file. Thats the first solution I recommended to our developers – the DB is not a good place for trees, try using something else. This solution failed miserably when we needed to recovery from backups and discovered that data in the tree and in the DB were now inconsistent because we couldn’t recover the DB to the exact same time the file was backed up.

2. Don’t use a relational DB. Our developers chose OpenLDAP as a good place to keep trees. It is really a good solution from the development point of view. I’m not as happy because now I have to manage both Oracle and OpenLDAP, and I don’t like OpenLDAP all that much.

3. Keep the tree as a huge XML inside the DB. Very similar to keeping the data in a file, but without the recovery and consistency issues. Just don’t expect the DBA to help you if you need to get some data from inside the XML. Although Oracle now has built in support for XML, not all DBAs are fluent with the new creature.

4. In each row, have a column with the ID of the parent of this row. Kind of like pointers in the database. Use this and you are likely to run into a very painful situation where you need to use unbelievable number of self joins for any simple operation. IO usage will probably become an issue very quickly. You can also forget about having any usable index there.

5. In each row, have a column with a string that encodes the level in the tree that this row belongs to, and the ID of the parent. I’m not sure if this is similar to the previous idea or even worse.

6. According to Microsoft technology preview SQL Server 2008 should have a built in hierarchical data type. I think it just makes it easier for developers to make the wrong decision, but maybe Microsoft will surprise us all and the new data type will be both easy to use in queries and have decent performance.

Iterations are not enough

Over at the SQL Server blogs, Andy Leonard wrote a nice article praising iterations. I’m often surprised at how often articles about SQL Server are relevant for Oracle DBAs when the databases are so different. This article however, is relevant for everyone. Not just DBAs, all developers.

I agree with Andy that software tends to improve with iterations. I know how I tend to have trouble when I start developing a new package or script, but once I get something written and start running it, debugging and improving as I go, the process becomes quicker and easier as the code gets better and better. Often, at the end, there is very little code left from the first iteration. Everything has been rewritten.

When I look at Oracle, though, I realize that iterations only go so far. There is some level of conceptual integrity, that requires a really strong up front design.

Conceptual integrity is not easy to define. Like beauty, quality and porn, you know it is there when you see it. When the different pieces of software work together in a way that really makes sense. When there are few basic concepts to the software, and when you get those you are rarely surprised by the software behavior, everything falls into place in a very understandable way.

Consider Oracle. No one could call Oracle simple, or easy to use. But, it is really not too complicated either. There is the SGA, cut up into buffer cache and library cache. There is the redo log and its archives. The control file, data files, undo space. Tablespaces and extents. Five basic background processes. You learn all that in a two day architecture course, and after that everything makes sense. Backup and recovery, IO issues, CPU use, locking and concurrency, ORA-01555, latches, everything you do as a DBA is really based on very few concepts. Oracle is really beautiful this way.

And I don’t think this simplicity can be achieved by iterations. It is perfected by iterations, many of them, and it is debugged in many many more iterations, but the ideas have to be in place first. The foundation that the building is based on. Without a strong foundation, the building will end up unstable, no matter how much you will polish and color the walls.

Data Types and Cost Based Optimization

Jeff Smith, my favorite SQL Server blogger, wrote a great post about the importance of data types in database design.

Jeff is absolutely right – developers have absolutely no excuse for not getting data type corret. If you let an Excel script kiddie design your database, you deserve to be screwed.

If you are running Oracle, instead of SQL Server, the issue of data types is even more important than you think. It is not just a matter of getting the sort wrong, taking more space than you should, or being more vulnerable to typos. It is a matter of Oracle getting the execution plan right and getting you the data in an efficient way.

Here’s a nice example:

A bug tracking application was designed to support multiple databases. Since different databases handle dates very differently, the developers took a shortcut and decided to use numeric field for the date, use something like ‘20070301’ to signify that the bug was closed at March 1st, 2007, and write tons of application code to handle all kinds of trivial date operations. Note that they could have used Unix time, seconds since 1970, and avoid all the pain I describe here, but Unix time is less human readable. The application also supported generating reports. Number of bugs opened last week, two month ago, etc.

Suppose we have two years of bugs in our database. Thats 730 different possible closing dates. But if they are kept as numbers from 20060101 to 20071231, the optimizer will be convinced that there are 11130 different possible dates.

First scenario: We want to generate a report from January 1 to June 1 2007.
When we use real dates that means a quarter of our date range and the optimizer will correctly decide that it is faster to go do a table scan than use an index.
When you use numbers, 20070601-20070101 = 500. That is 4% of the data and indexes will be used to retrieve 25% of your data. Go grab a cup of coffee and maybe a nice book (“Cost Based Oracle Fundamentals” will be a good choice), it will be a while before you see the results of that report.

Second scenario: Suppose you want just one week of data. Say, March 1 to March 7. With both data types this is a very small percent of the data and an index will be used.
But what if you want to see the week from December 28 to January 3rd? If you use dates, this is still just 1% of the data and the same plan will be used with the same quick results. If you use the numbers, however, 20070103-20061228=8875. Thats around 50% of the data – this means a full table scan. You will see lots of managers wondering why the report became so slow “all of the sudden” and “just when we need the end of the year report very urgently”.

I hope I made my point here: Using correct data types is critical, more so in Oracle, and it is so easy you really have no reason to get it wrong.

Is there a DBA in the development team?

You know a DBA didn’t participate in the design of the application or the database schema when:

  1. The primary key of the table is a numeric ID, generated by the application. Looks like the developers never heard of sequence objects.
  2. No foreign keys where two tables clearly reference each other.
  3. No indexes other than the numeric primary key, making it impossible to search for anything. Did I mention the table contained over 10 million rows?
  4. Instead of using partitions, the data is spread between about 100 tables and a view does “union all” so the application can query all the data.
  5. No bind variables are used, at all.

I’ve had all of these, and few more in today’s incident.

A user complained that he is seeing the wrong data. We noticed that he got the wrong data by searching the data table through an ID generated somewhere. We had to search a bit to find where the application got that ID because foreign keys were not used, but finally we found that the data table references an “additional data” table through an application generated id.

We discovered that few days ago the application started generating duplicate ids, and now when a user looks at the additional data he sometimes see the wrong data that is referenced by the same ID. We noticed that ids generated today duplicate ids generated 3 month ago, but we have no way of using this information because the date column is not indexed. In fact, nothing is indexed. I’m glad the “additional data” table is very rarely used.

The id should have been unique, but the data is spread through 100 tables in an attempt to mimic partitions – and the uniqueness cannot be enforced between tables.

As a temporary solution we ended up deleting the old data to clean the duplication. Not too easy with 100 tables, 10 million rows and no index in sight. I hope we bought enough time for the application developers to fix the issue.

Dear developers,

Oracle has many wonderful features. It has sequence numbers, it has constraints that can be used to enforce data integrity, it has partitions, it has indexes. Please make good use of those features.

Production DBA