Installing Oracle 11g

I didn’t install Oracle 11g. Augusto Bott over at Pythian group installed 11g on Ubuntu, which is really cool considering how Ubuntu is not officially supported and all. What’s the challenge in installing Oracle on a supported platform?

Apparently, the challenge is finding Oracle 11g to install. I’ve no idea how Augusto got his copy. I guess you have to have an insider access to Oracle for it, because I’ve searched OTN today and all I found was a bunch of whitepapers. Fascinating reading, but not what I wanted. I thought 11g was released already, but it doesn’t seem to be fully released.

So, here I am, itching to play with the new toy, and all I can do is read articles about how the other kids are having tons of fun with it.

If any of my kind readers happened to be an Oracle insider and know how to get a copy, please feel free to drop me a line, or better yet – a CD!


Iterations are not enough

Over at the SQL Server blogs, Andy Leonard wrote a nice article praising iterations. I’m often surprised at how often articles about SQL Server are relevant for Oracle DBAs when the databases are so different. This article however, is relevant for everyone. Not just DBAs, all developers.

I agree with Andy that software tends to improve with iterations. I know how I tend to have trouble when I start developing a new package or script, but once I get something written and start running it, debugging and improving as I go, the process becomes quicker and easier as the code gets better and better. Often, at the end, there is very little code left from the first iteration. Everything has been rewritten.

When I look at Oracle, though, I realize that iterations only go so far. There is some level of conceptual integrity, that requires a really strong up front design.

Conceptual integrity is not easy to define. Like beauty, quality and porn, you know it is there when you see it. When the different pieces of software work together in a way that really makes sense. When there are few basic concepts to the software, and when you get those you are rarely surprised by the software behavior, everything falls into place in a very understandable way.

Consider Oracle. No one could call Oracle simple, or easy to use. But, it is really not too complicated either. There is the SGA, cut up into buffer cache and library cache. There is the redo log and its archives. The control file, data files, undo space. Tablespaces and extents. Five basic background processes. You learn all that in a two day architecture course, and after that everything makes sense. Backup and recovery, IO issues, CPU use, locking and concurrency, ORA-01555, latches, everything you do as a DBA is really based on very few concepts. Oracle is really beautiful this way.

And I don’t think this simplicity can be achieved by iterations. It is perfected by iterations, many of them, and it is debugged in many many more iterations, but the ideas have to be in place first. The foundation that the building is based on. Without a strong foundation, the building will end up unstable, no matter how much you will polish and color the walls.


Few notes on migrating from SQL Server to Oracle

Ibrahimdemir suggested I should describe how we migrated from SQL Server to Oracle. Here are few notes on the subject. Obviously, there are lots of details here – the move took anything from 2 years to 6 month, depends on where you start counting, so I’m only mentioning the highlights.

1) Motivation: The official reason for moving was that Oracle released version 10.2. It became clear that many companies are successfully deploying RAC systems and that it represents a real solution for database high availability. High availability was such an issue with SQL Server that management agreed to the project with surprising speed.
Less officially, we were five Oracle DBAs and we found ourselves managing a production environment that is based almost entirely on SQL Server. We hated that and wanted a way out. RAC was simply a bait for upper management.

2) The applications: Our largest application was supposed to support Oracle out of the box, but we needed to change JDBC drivers to one that supports RAC. During tests we found other errors related to Oracle being different than SQL Server (Lots of issues around case sensitivity, which completely slipped my mind when I wrote the previous post), but mostly the change on the application was rather painless.
Other applications needed to be changed. Anything from small modifications to the connection string, to complete and total rewrite – depending on how the application used different database features.
We had to convert around a dozen applications, and the most difficult part was hunting them all down. Just yesterday, five days before we shut down the SQL Servers, we found two more applications that were somehow forgotten.

3) Deciding to migrate: At first we didn’t plan to migrate at all. We planned to put new customers on Oracle and keep existing customers on SQL Server, with the understanding that at some point the SQL Server environment will die naturally.
We lived with both environments for about a year and around that time two interesting things happened – we discovered that a two node RAC can carry the same load as six SQL Server machines and give significantly better performance, and the data center started pressuring us to use less electricity. The decision was made – move everything to Oracle and shut down the SQL Servers.

4) How to migrate: By the time we decided to migrate, we already knew exactly how the applications should work on Oracle. The correct schema were already in place. The only question was how to move the data. Our development team took a very controversial approach (I’m still not convinced it was the right one), and they wrote two utilities for the move.

The first utility could dump an SQL Server schema to a bunch of CSV files, and then load the CSV files into a ready schema on Oracle side. The utility moved only data, not definitions (table structure, indexes, constraints, etc were not included), but it was OK because we had the right definitions on Oracle already.

The second utility could “listen” to new data coming in to the SQL Server, and copy that data to Oracle. We used sequence (and identity) numbers to get this done.

Using both utilities we could copy a customer over to Oracle, run both systems in parallel for few days until the customer was happy with Oracle and then cut off the data duplication and remove the SQL Server schema.

We moved over 200 customers this way. One by one over the course of 6 month.

5) Summary: We thought we got it all nailed and done. We were ready for the big shut down on Wednesday. However, this morning we discovered two applications that were forgotten. So, shutdown was delayed by two weeks (which are rather costly, because the data center really wants us to cut down on electricity), during this time we have to make the necessary changes in the application, test them and make the move. We hope that this time it will really be it.


RAC tricks – rolling patch with a shared home

We had to apply yet another opatch, but we are only allowed 4 hours of downtime per system per month and we already used our monthly budget, so we need a way to apply an opatch without any downtime.

On some of our systems, it is not an issue. They are RAC systems where each node has its own $ORACLE_HOME on its own server. We take one node down, apply the patch, start the node, stop the other node, apply the patch, start other node. Patch installed on both nodes, no downtime for our customers. Win-Win.

But what do we do about our other systems? The ones which share a single $ORACLE_HOME on a filer? Where we need to take both nodes down for applying the patch?

A co-worker came up with a brilliant idea:

Stop one node. Use the filer power to duplicate $ORACLE_HOME. Connect node to new home, just make the change in /etc/fstab, the database will never notice the difference.
Apply patch in new home. Start database in new home. Now stop the second node and connect it to the new home as well. Start the node in the new home. We have a patched DB with no downtime in a shared home system! We even have a built in rollback – connect one node after the other back to the old home, where we didn’t apply the patch. In my experience rollback of opatches don’t always work, so having a sure rollback plan is a great bonus.

We tested it today in a staging environment and it seems to work well. Now we just need to convince management that we should do it in production. It looks like a great solution, but in my experience management hates approving any plan that does not appear in Oracle manuals. For all their talk of innovation and “thinking outside the box” they are a very conservative bunch. I can understand the extreme risk aversion of IT management, but if you never do anything new, you can never improve, and thats also risky.


More on Data Pump

Few Data Pump limitations that I ran into and you probably want to know about:

1. If you are using Oracle 10.2, and you want to write your exports into a directory that is on an NFS mount, it must be a hard mount. In 10.1 soft mounts used to work, and the upgrade caused us a significant amount of headache.

You see, our exports directory sits on an old and slow netapp (R200), and our storage manager gives us terabytes of space there but without any promise of high availability. This is fine for exports, but you can’t afford to hard mount storage that doesn’t have high availability. Once it goes down, processes will hang with uninterruptable sleep and we will need to reboot the machine to get rid of them. So with 10.2 data pump we are stuck with either hard mounting the original directories and taking a risk, or negotiating with the storage manager for space on a high availability device. I consider using interruptible hard mount, which should give us the best of both worlds, but I didn’t test it yet. I hope Oracle plans to bring back the option of configuring data pump directories on soft mounts.

2. Importing statistics takes unreasonable amounts of time. We had to export and import one schema. Collecting statistics on this schema takes around 5 minutes. Importing the statistics using Data Pump took over 20 minutes. Oracle’s documentation fails to mention this little bit of information. I really hope this issue is fixed in 11g.

Work around: import with “exclude=statistics” flag and collect statistics later.

3. Data Pump doesn’t write errors to standard output. At one point we believed that Data Pump got completely stuck, until we checked the alert log and discovered that we ran out of space in the tablespace. Data Pump itself never mentioned this little problem.

4. Data pump can hang when another process collects statistics while Data Pump is trying to import statistics. Note that Oracle 10g has a daily job that collects statistics by default.

5. If you consider working with multiple threads when importing, you need to do your homework. Multiple Data Pump workers don’t always behave as you might expect. This article by Carol Palmer gives a good explanation of the issues involved and when using multiple workers may not have benefit and may even slow you down.

Don’t get me wrong, I love Data Pump. It is one of the best things that happened to my job this year, literally saving me hours of boring work. You just need to be aware of its few catches.


To upgrade or not to upgrade

So 11g was finally announced and it seems to have some very cool features. Database replay is the largest and most attractive new feature, maybe it will finally allow us to have an accurate staging environment. Direct NFS also sounds interesting, especially for the added instrumentation and monitor opportunities. Table Space encryption sounds good, especially if it will allow us to get rid of the scary encryption device we currently use. Aside from completely new features, Oracle promises improvements to features I already enjoy – Data Pump, Partitions, Automatic Memory Management, building indexes online and lots more.

I didn’t get to install a test instance of 11g yet, but I’m planning to get to it as I can’t wait to try it hands-on.

Contrary to what Oracle believes, we are not even thinking of installing it on anything that remotely resembles a production system until Oracle releases service pack 1 or maybe even 2. Oracle managed to teach the entire DBA community to avoid new versions like the plague and new features even more. I thought Oracle knew this, but now I’m not so sure. I opened an SR in Metalink regarding a bug I found on 10.2 and needed a fix (It was a rather horrible bug which completely prevented us from collecting statistics), and Oracle Support claimed that this bug will not be fixed in 10.2 and that we should upgrade to 11g which will be released this month. We insisted that we won’t upgrade to 11g in the next year and that we will need statistics sooner, and eventually they did release an opatch, but this was rather unpleasant experience.

While we don’t plan to upgrade to 11g soon, it does raise an interesting question about the systems that we never got around to upgrading from 9i to 10g. Do we upgrade them to 10g or wait with 9i until 11g becomes stable? I prefer to upgrade, because my work is easier when all of my systems are the same version (and you get addicted to features like regular expressions very quickly), but the customers know that 11g was released and are applying pressure toward waiting a bit and then giving them the latest.

If you are planning to install 11g on production any time soon, the following article by Dizwell will probably change your mind.  In any case, you probably want to read Arup Nanda’s 11g new features series.


Rounding dates

We are shutting down most of the SQL Servers, moving our applications and customers to Oracle. This move has been going on for a while and is targeted to finish on the end of the month. One of the applications that will have to move is our SLA calculations. Thats an important and highly visible app – our executives are following the results, so we really don’t want to mess up with this one. The downside is that moving it means translating a bunch of code written in T-SQL for SQL Server to Oracle’s PL/SQL. To make things even better, the application involves a lot of date manipulations. SQL Server and Oracle have very different approaches to dates. So, this task has been tossed back and forth for a while now, because no one really wanted to get his hands dirty with this, until it firmly lended on my desk, with 2 days deadline.

So, one of the things that had to be translated was a query that groups the last day of data into 15 minute intervals. I spent few minutes trying to translate the code into Oracle in a operation by operation basis, before I gave up and decided to write it from scratch. Since my solution is both generic (will work on any minute interval, not just 15) and much cuter than the other solutions I found in various forums, I decided to post it here:

Round to previous 15 minute (i.e. 4:10 will become 4:00):
TRUNC(sysdate,'HH24') + (to_char(SYSDATE,'mi')-mod(TO_char(SYSDATE,'mi'),15))/(24*60)

Round to next 15 minutes (4:10 will become 4:15):
trunc(sysdate,'HH24') + (to_char(SYSDATE,'mi')+15-mod(TO_char(SYSDATE,'mi'),15))/(24*60)


Data Types and Cost Based Optimization

Jeff Smith, my favorite SQL Server blogger, wrote a great post about the importance of data types in database design.

Jeff is absolutely right – developers have absolutely no excuse for not getting data type corret. If you let an Excel script kiddie design your database, you deserve to be screwed.

If you are running Oracle, instead of SQL Server, the issue of data types is even more important than you think. It is not just a matter of getting the sort wrong, taking more space than you should, or being more vulnerable to typos. It is a matter of Oracle getting the execution plan right and getting you the data in an efficient way.

Here’s a nice example:

A bug tracking application was designed to support multiple databases. Since different databases handle dates very differently, the developers took a shortcut and decided to use numeric field for the date, use something like ‘20070301’ to signify that the bug was closed at March 1st, 2007, and write tons of application code to handle all kinds of trivial date operations. Note that they could have used Unix time, seconds since 1970, and avoid all the pain I describe here, but Unix time is less human readable. The application also supported generating reports. Number of bugs opened last week, two month ago, etc.

Suppose we have two years of bugs in our database. Thats 730 different possible closing dates. But if they are kept as numbers from 20060101 to 20071231, the optimizer will be convinced that there are 11130 different possible dates.

First scenario: We want to generate a report from January 1 to June 1 2007.
When we use real dates that means a quarter of our date range and the optimizer will correctly decide that it is faster to go do a table scan than use an index.
When you use numbers, 20070601-20070101 = 500. That is 4% of the data and indexes will be used to retrieve 25% of your data. Go grab a cup of coffee and maybe a nice book (“Cost Based Oracle Fundamentals” will be a good choice), it will be a while before you see the results of that report.

Second scenario: Suppose you want just one week of data. Say, March 1 to March 7. With both data types this is a very small percent of the data and an index will be used.
But what if you want to see the week from December 28 to January 3rd? If you use dates, this is still just 1% of the data and the same plan will be used with the same quick results. If you use the numbers, however, 20070103-20061228=8875. Thats around 50% of the data – this means a full table scan. You will see lots of managers wondering why the report became so slow “all of the sudden” and “just when we need the end of the year report very urgently”.

I hope I made my point here: Using correct data types is critical, more so in Oracle, and it is so easy you really have no reason to get it wrong.


Is there a DBA in the development team?

You know a DBA didn’t participate in the design of the application or the database schema when:

  1. The primary key of the table is a numeric ID, generated by the application. Looks like the developers never heard of sequence objects.
  2. No foreign keys where two tables clearly reference each other.
  3. No indexes other than the numeric primary key, making it impossible to search for anything. Did I mention the table contained over 10 million rows?
  4. Instead of using partitions, the data is spread between about 100 tables and a view does “union all” so the application can query all the data.
  5. No bind variables are used, at all.

I’ve had all of these, and few more in today’s incident.

A user complained that he is seeing the wrong data. We noticed that he got the wrong data by searching the data table through an ID generated somewhere. We had to search a bit to find where the application got that ID because foreign keys were not used, but finally we found that the data table references an “additional data” table through an application generated id.

We discovered that few days ago the application started generating duplicate ids, and now when a user looks at the additional data he sometimes see the wrong data that is referenced by the same ID. We noticed that ids generated today duplicate ids generated 3 month ago, but we have no way of using this information because the date column is not indexed. In fact, nothing is indexed. I’m glad the “additional data” table is very rarely used.

The id should have been unique, but the data is spread through 100 tables in an attempt to mimic partitions – and the uniqueness cannot be enforced between tables.

As a temporary solution we ended up deleting the old data to clean the duplication. Not too easy with 100 tables, 10 million rows and no index in sight. I hope we bought enough time for the application developers to fix the issue.

Dear developers,

Oracle has many wonderful features. It has sequence numbers, it has constraints that can be used to enforce data integrity, it has partitions, it has indexes. Please make good use of those features.

Yours,
Production DBA


Recovering from Crash

A poor soul landed in my blog after searching for “oracle db won’t come up after crash” I can just imagine the lonely DBA sitting in front of the crashed DB without a clue on how to proceed.

Bummer, isn’t it? The best solution is to find the nearest experienced DBA and get him to help you. But if you don’t have one around, here’s my take on the situation:

 

It can be that the same issue that caused the crash still exists. Lack of disk space and other storage issues are the first things that come to mind, but I’m sure there can be others.

 

Another possibility is that while the DB was up you made some changes to the SPFile and when it tries to start with the new SPFile, the changes you made prevent it from starting. In this case, I’d try starting from a backup of the spfile or init.ora (you did backup these very important files, right?).

 

The most common possibility is that nothing prevents the DB from starting, but when it crashed it got into a slightly inconsistent state and now it needs some recovery. You can use Oreilly’s “Recovering Oracle” page, which is very comprehensive and covers almost every scenario.

Most likely you have a damaged data file, and when you try to do “alter database open” Oracle will tell you exactly which data file is damaged. In this case I hope you have your archive log files ready, and you can do "Recover datafile '<datafile-name>';". You can also try recovering to a known good time before the crash by doing “Recover database until time '2001-03-06:16:00:00'“.

 

There are tons of other options – recover until cancel, bring datafiles back from backup, bring control file back from backup, etc. Which is why I recommend looking up Oreilly’s page in emergencies and reading Oracle’s Backup and Recovery book cover to cover while you are not handling emergencies.

Understanding Oracle’s concepts of datafiles, redo logs, log archives and control files goes a long way toward having calm and successful recoveries.

I also really recommend practicing recovery from backup on a development system every month or two, because if you are used to doing recovery it will be that much easier to do it when disaster strikes.