Data Types and Cost Based OptimizationPosted: July 10, 2007
Jeff Smith, my favorite SQL Server blogger, wrote a great post about the importance of data types in database design.
Jeff is absolutely right – developers have absolutely no excuse for not getting data type corret. If you let an Excel script kiddie design your database, you deserve to be screwed.
If you are running Oracle, instead of SQL Server, the issue of data types is even more important than you think. It is not just a matter of getting the sort wrong, taking more space than you should, or being more vulnerable to typos. It is a matter of Oracle getting the execution plan right and getting you the data in an efficient way.
Here’s a nice example:
A bug tracking application was designed to support multiple databases. Since different databases handle dates very differently, the developers took a shortcut and decided to use numeric field for the date, use something like ‘20070301’ to signify that the bug was closed at March 1st, 2007, and write tons of application code to handle all kinds of trivial date operations. Note that they could have used Unix time, seconds since 1970, and avoid all the pain I describe here, but Unix time is less human readable. The application also supported generating reports. Number of bugs opened last week, two month ago, etc.
Suppose we have two years of bugs in our database. Thats 730 different possible closing dates. But if they are kept as numbers from 20060101 to 20071231, the optimizer will be convinced that there are 11130 different possible dates.
First scenario: We want to generate a report from January 1 to June 1 2007.
When we use real dates that means a quarter of our date range and the optimizer will correctly decide that it is faster to go do a table scan than use an index.
When you use numbers, 20070601-20070101 = 500. That is 4% of the data and indexes will be used to retrieve 25% of your data. Go grab a cup of coffee and maybe a nice book (“Cost Based Oracle Fundamentals” will be a good choice), it will be a while before you see the results of that report.
Second scenario: Suppose you want just one week of data. Say, March 1 to March 7. With both data types this is a very small percent of the data and an index will be used.
But what if you want to see the week from December 28 to January 3rd? If you use dates, this is still just 1% of the data and the same plan will be used with the same quick results. If you use the numbers, however, 20070103-20061228=8875. Thats around 50% of the data – this means a full table scan. You will see lots of managers wondering why the report became so slow “all of the sudden” and “just when we need the end of the year report very urgently”.
I hope I made my point here: Using correct data types is critical, more so in Oracle, and it is so easy you really have no reason to get it wrong.