A DBA from another team went on vacation, and asked me to do few urgent tasks for him while he is gone.
One of the tasks was to refresh test schema from production data. In his email he commented that in his team the process includes opening a change in the Change Management system for this task “in case we drop the production schema instead of the test schema by mistake”.
I was very surprised by this. In my team, the ticket asking for the refresh was considered enough documentation, and there was no need to open a change. Also, if you regularly drop the wrong schema, opening changes will not help you.
I asked around for explanation for the different procedure. Turned out that few month ago someone from the other team dropped the wrong schema by mistake. It was recovered from backup with no issues, but it still caused an hour or two of downtime for the users. Which means that we need to open an Incident, and of course, every Incident has to contain action items for preventing same issue from reoccurring. In this case, the manager who reviewed the incident noticed that there was no Change open for the schema refresh, which means that the DBA did not follow the right procedure! The natural action item for the change was “Instruct DBAs on Change Management procedures”. DBAs were properly instructed (at least some of them) and are now opening a change before dropping the wrong schema.
Which just goes to show how procedures can’t replace common sense.
It started at 5pm, because all interesting production issues start at 5pm. As the on-call DBA, I recieved an alert about over used undo tablespace on one of our databases. First thing to do in such case is check v$transaction for transactions that have been running for a while. Turned out that v$transaction is empty.
The next suspect was the monitor itself. Maybe there is a bug in the query that checks undo tablespace usage? Checking the query, I noticed two interesting things:
1. It reported over 3G of used undo tablespace, while I know we only have 2G.
2. It contained the following lovely gem (completely unrelated to the post, but too cute not to mention): case when to_number(value) < 3600 then to_number(value)/60/1440 else to_number(value)/3600/24 end
Clearly the query needed some debugging, and I set to the task.
I usually start debugging by breaking the query into component subqueries and verifying that they give the expected results. This time, I started with the part that calculates the amount of undo bytes we need to keep due to our retention policy. The following query gave me all the undo history on the system. Not just the rows that were within our retention policy:
to_char(begin_time,'dd-mm-yyyy hh24:mi:ss'),to_char(end_time,'dd-mm-yyyy hh24:mi:ss'),undoblks,txncount
US.BEGIN_TIME >= SYSDATE - (
Very strange. The calculation looks right. Lets go all the way back to basics:
SQL> select to_char(SYSDATE,'dd-mm-yyyy hh24:mi:ss') from dual ;
This is not an undo issue anymore. What happened to my sysdate? The documentation says that sysdate gets the date from the host OS. So I start there:
[oracle@vvsdb01 ~]$ date
Mon Mar 17 18:40:31 PDT 2008
OS looks fine. What about timestamp?
SQL> select sysdate from dual;
02 Mar 2004 22:23:24
SQL> select current_timestamp from dual;
17-MAR-08 05.48.49.846746 PM -07:00
Timestamp is fine, but now senior DBA noticed that the sysdate time didn’t change at all since the last time we looked at sysdate. Surely more than few minutes passed? Quick search revealed a previously unknown system parameter – fixed_date. FIXED_DATE enables you to set a constant date that SYSDATE will always return instead of the current date.
SQL> show parameter FIX
NAME TYPE VALUE
———————————— ———– ——————————
fixed_date string 02-MAR-04
Yes. We had that. Checking the alert log to see when it happened:
Mon Mar 17 16:48:51 2008
ALTER SYSTEM SET fixed_date='2004-03-02 22:23:24' SCOPE=BOTH;
We caught it right on time! Lets fix it fast:
ALTER SYSTEM SET fixed_date=NONE SCOPE=BOTH;
And now how did it happen? We had the time it happened, so we checked /var/log/secure for matching ssh logins and got the IP.
The IP was of my own machine. I was completely flabbergasted for quite some time. But I think I reconstructed how it happened:
I was reading about some initialization parameters on this site, and I have a nasty habit of playing with the mouse while reading. I must have selected about half of that page. Then I remembered that I had other work to do and proceeded to copy-paste from vim window to sqlplus. Turned out that I didn’t really copy from vim, and when I pasted into sqlplus, I actually pasted half of that webpage. I saw that I got a bunch of errors because I just pasted a bunch of random sentences into sqlplus, so I cursed a bit and tried copying from vim again. What I did not notice was that among the other things I pasted by mistake was:
SQL> alter session set nls_date_format = 'dd.mon.yyyy hh24:mi:ss';
alter system set fixed_date='2004-03-02 22:23:24';
select sysdate from dual;
the problem was right there in my terminal screen all along!
Every once in a while something happens that makes me question not only my skills as a DBA, but my sanity as well. Today’s event was one of the stranger ones that happened to me.
A coworker asked me to export a schema and import it into the same database but under a different name. Very common request, not something that is very likely to go wrong. I’m running the expdp command, finished without warnings. Time to run impdp command, with remap_schema to load the data into a new schema, again it finishes with no warnings. I log on to the database and change the password for the new user, again it finished successfully.
Great, I call the co-worker and I tell him his new schema is ready. Two minutes later he is back, and complains that he gets an error logging in. Puzzled, I try logging in as that user myself:
SQL> conn test_user_copy/secret
ORA-01017: invalid username/password; logon denied
Even more puzzled I check dba_users for the user I just created. It is not there. Blinking, I check again. Still not there. I know that I successfully ran “alter user… identified by….” on that user not two minutes ago. It is a large user – dropping the schema would take significantly longer, even if someone was so inclined, which I doubt.
I recheck the import log, and it still seems that the new user was created and the data was imported successfully. Resigned, I reran the import command. It succeeded, which means that the user really was no longer there. I altered the password again, verified that the user still exists and notified the coworker that he can try again. He confirmed that everything is fine, so I guess the user didn’t disappear this time.
Puzzled, I consulted a senior DBA with my problem. He suggested looking at the alert log, as exports and imports also write a rather cryptic line there. I checked. The alert log clearly showed only one import – the second one.
The senior DBA kindly suggested that I was imagining things when I thought I was running the first import, and maybe I should go home early today and get some rest.
I really have no explanation for this event. I’m not even planning on contacting Oracle Support with this story, it is just too strange.
While searching for something vaguely NFS related in Google, I found out that someone in Netapp has a blog. I was unable to find out what is he doing in Netapp besides blogging, but I found his posts interesting. I happen to love working with Netapp – both the appliance and the vendor, and the blog really echo the competent and friendly vibe I get when dealing with the company.