This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!
This should have been the easiest task on my todo list: Install Oracle 10.2.0.3 EE standalone on a new Linux RHEL 5 server, later to be used as a standby for a production RAC system. This means 2 lines of “runinstall -silent …”, less than 5 minutes of DBA work and maybe 20 minutes of waiting. I did not expect to spend over 5 hours doing this.
Problems started when I discovered that I don’t have the 10.2.0.3 patchset and another patch that exists on production and should be installed on the standby. I had to wait for my Metalink credentials to be approved for this customer CSI before I could download the patches for them.
“Why don’t you just clone the software from production?” asked a helpful colleague.
Sounds like a great suggestion. I cloned Oracle software before and it is a simple process: tar $ORACLE_HOME, copy the tar file to the new server, untar, run the cloning script which will register the new home with the inventory, and you are done!
In theory, at least.
Here is what actually happened:
- Tar, copy, untar, script
- Ran OPatch to verify that the new oracle home is in the inventory and that I see the correct version and patches.
- OPatch is showing two nodes. Oops. I didn’t realize oracle home has information about the cluster – didn’t Oracle move the inventory elsewhere? Spend an hour looking for the cause of this.
- Found that the two nodes are mentioned in $ORACLE_HOME/inventory/ContentsXML/oraclehomeproperties.xml
- Removed this file.
- Deattached Oracle Home to clean inventory without deleting the software.
- Ran the clone script again
- Yay! OPatch looks good now.
- Decided to create test database to be extra sure everything is fine
- NETCA failed with linking error. Spent an hour figuring out why. Cursed a bit.
- Had to install libc-devel, 32 bit version. Too bad RDA didn’t catch this.
- Created test database, but SQLPLUS now fails with linking error. More cursing. Wondered what I did to deserve this.
- libaio.so.1 was missing so I had to install the 64 bit version of libaio. Too bad RDA was silent about this as well.
- Couldn’t start the database because the database couldn’t find the cluster. Why was it even looking for a cluster? Spent an hour figuring out why. Ah, because I copied the software from a RAC server and it was linked as RAC database.
- Relinked everything with RAC_OFF option.
- Finally things are working. Too bad it is 8pm already.
What I should have done: (I’m not sure if it is supported by Oracle, but at least it works)
- Double check that we have all RPMs.
- Tar, copy, untar
- remove $ORACLE_HOME/inventory/ContentsXML/oraclehomeproperties.xml
- run clone.pl: clone/bin/clone.pl ORACLE_HOME=/appl/oracle/product/10.2.0/db_1 ORACLE_HOME_NAME=OraDb10g_home1
- Relink as non-RAC: make -f ins_rdbms.mk rac_off
- Verify with OPatch.
- Create test DB:
netca /silent /responsefile ~/netca.rsp
dbca -silent -createDatabase -templateName General_Purpose.dbc -gdbName TST -sid TST -SysPassword xxx -SystemPassword xxxx -emConfiguration NONE -datafileDestination /appl/oracle/oradata -storageType FS -characterSet WE8ISO8859P1 -nationalcharacterSet AL16UTF16 -memoryPercentage 40
- Go for a nice afternoon ride.
I hope that I’m not the only DBA who always have to find the most difficult way to accomplish a task, and that this post will be useful to others. Perhaps the best piece of advice I can offer is to avoid this type of cloning in the first place.
I’m definitely having one of these weeks.
We have this insanely complicated, highly visible, ultra high priority project going live this this Wednsday. The DBA who worked on this project nearly full time for the last 6 month left on 3 weeks vacation on Monday. I’m the replacement.
Tuesday morning, I’m on a conference call. The night job failed due to an issue with “dblink”. Must be fixes ASAP or the go-live will be delayed. I was all “Wait. I’m just a simple DBA. What dblink are you talking about?”.
So the morning was spent with me trying to slowly work my way through huge amounts of night-job code. While everyone around me was running in circles screaming. Mostly screaming “network” and “dblink”.
You know what I found out? That we have no clue why the night job failed. Because at the end of every single procedure in the job was code that said:
EXCEPTION WHEN OTHERS THEN INSERT INTO NIGHT_JOB (status, last_run_stmp) VALUES ('failed', SYSDATE); COMMIT;
The job failed. What more information could you possibly want?
And the funny thing that happens when you have such useful error messages, is that everyone starts developing theories about why the failure occurred. Maybe someone tested the dblink few hours after the failure occurred, noticed that the remote site is down, and decides that this is the issue. Someone else tries rerunning some of the code and gets “unique constraint” error, so he guesses that this is the issue. But of course, its all guesses after the fact. No one can know what caused the night job to fail on its original run.
Trying to troubleshoot an issue based on guesses about what was the error is a complete waste of time – what are the chances that you’ll find the real issue and that the next time the night-job runs everything will work?
Proper error handling is a complex topic – which exceptions can be handled locally and which should be raised? At which levels do we trap exceptions? These questions are a topic of much discussions among developers and architects. But the basics of error handling should not be ignored – When an error occurs, we must know what was the error.
My friends were teasing me for working in a very large company, and how we have all those “all hands meetings” with hundreds of people participating from all over the world. This reminded me of an old story that I want to share.
This happened over five years ago. I was still an application developer at the time. The customer was a huge telephony company, from an exotic country. The project was to move about 250G of data from our database to their database, and then connect it to an application over in the customer’s data center. I was on the project as the application expert. I would be working with special tools developed for this migration, and with the DBAs on the customer side. The main challenge for this project was that we were working under very tight deadlines.
Obviously the customer was warned about the amounts of data involved. I also explained that we will not be doing direct loading, but bulk inserts. These generate redo, and they must prepare for the amounts of redo generated. I was very proud that although I was not (yet) a DBA, I could already explain about redo 🙂
Anyway, we started the project by creating the user, and then ran some scripts to create the empty segments. Later we will load the data into them. Our 250G of data was not uniformly distributed between the 200 tables in the schema. We actually had less than 10 very large tables, and large amount of smaller tables and a significant amount of empty tables. Half way through the object creation our tablespace ran out of space. Anticipating large amounts of data, someone configured HUGE extents. OK, these things happen, lets fix that and try again. We need to get going fast, loading the data was expected to take about 24 hours.
At around 12pm in my timezone, which was around 2am for the customer, their database crashed. Out of space of archive logs. I called their DBAs one after another, leaving messages, but didn’t manage to catch (=wake up) anyone. This was surprising, the DBAs I worked with in my company used to wake up at any hour.
6 hours later, one of the DBAs called me back. I told him that his database is down, was done for the last 6 hours, and this is holding up our urgent data loading project.
His response: “Oh, this is really bad. I must call a meeting to discuss this”.
Two days later the DB was up and running and I could continue the data migration. I love the whooshing sounds deadlines make as they pass by.
At that time I was horrified at their response, becuase I knew that when our servers crash, our DBAs leaped to save it, they did not call any meetings. Now I have some perspective from the DBA side of things and I still don’t get it. We have procedures on how to deal with a DB that ran out of redo space (it is relatively common disaster), and these procedures do not include any meetings.
Kevin Closson likes my blog! This makes me very happy, not just because it drives hoards of readers my way (hey everyone! hope you like it here!), but also because Kevin’s blog is the very first Oracle blog I started reading regularly. Kevin is one of the people I most admire in this field. His knowledge of OS, network, storage and hardware is unbelievable. I’ve always been more interested in how stuff actually works than in how to do things and Kevin’s blog is just full of this kind of knowledge.
Suppose you have a table like this:
drop table t_; create table t_ ( nm Varchar2(20), pr Char ( 7), vl Number ); insert into t_ values ('company 1','2003-06', 10); insert into t_ values ('company 1','2003-07', 29); insert into t_ values ('company 1','2003-08', 39); insert into t_ values ('company 1','2003-09', 41); insert into t_ values ('company 1','2003-10', 22); insert into t_ values ('company 2','2003-06', 13); insert into t_ values ('company 2','2003-07', 17); insert into t_ values ('company 2','2003-08', 61); insert into t_ values ('company 2','2003-09', 55); insert into t_ values ('company 2','2003-10', 71); insert into t_ values ('company 3','2003-06', 33); insert into t_ values ('company 3','2003-07', 18); insert into t_ values ('company 3','2003-08', 27); insert into t_ values ('company 3','2003-09', 5); insert into t_ values ('company 3','2003-10', 32);
(Thanks to René Nyffenegger for the helpful create script)
And you want this output:
NM JUL AUG SEP -------------------- ---------- ---------- ---------- company 2 17 61 55 company 3 18 27 5 company 1 29 39 41
What you should never, ever do, is this:
select t1.nm, t1.vl jul, t2.vl aug, t3.vl sep from t_ t1, t_ t2, t_ t3 where 1=1 and t1.nm=t2.nm and t2.nm=t3.nm and t1.pr='2003-07' and t2.pr='2003-08' and t3.pr='2003-09' group by t1.nm, t1.vl, t2.vl, t3.vl
And if you do this, and if t_ has more than 15 rows, and if you actually need more than 3 columns, don’t be surprised if the performance may be slightly disappointing.
Ever since I was a young kid, I was fascinated with superheros and I’ve always wanted to have a superpower. Now it looks like I got what I asked for and somewhere in the last year I’ve become invisible.
We are having a lot of “End of Year” talks in the company, at all levels. Teams, departments, organizations, business, etc. You know how these go – what did we accomplish last year, what are our plans for next year. So, in the department meeting we talked about the improvements in the applications and new customer demands for the next year. In the organization level we talked more about sales, strategy and plans to cut costs.
Listening to all those discussions, you’d come to the conclusion that we have customers that are paying us a lot and have high demands, they are using applications which we keep upgrading. These applications run on servers, which cost a lot, so we are moving to VMWare, and usage storage which is also expensive but there is nothing to do about it. Everything is hosted in a data center which costs way too much, so we will be moving to a cheaper one. We also have nice processes that we keep improving and internal tools that we are building.
It sounds like we don’t have a database anywhere. Which makes me wonder what is it I do all day. Or more accurately, why the DBA team fails to communicate our value to the business and whether it means that we will be the first team to go when downsizing starts.
So, we have this customer who has way too many grants and permissions on our database. This kind of thing should have been illegal, but sales sold him a solution that required this and we were forced to cooperate.
Now this customer requires from us about twice as many resources as all the other customers combined. Maybe more. Theoretically he got all those grants so he can administer his own instance and we shouldn’t be involved with him at all, but practically the customer and his DB developers bring the database to its knees every few days, then they blame us and we have to go in, prove it is not us and in the process fix the issue for them.
Now, we’ve been involved in enough of these fires to know much of their code and processes inside and out, we know where they coded to unrealistic requirements (poll ever second a table that updates every five minutes), and where the requirements were ok but the code could be optimized. In fact, we believe we can clean up their code enough to prevent many of these fires.
The customers will be happy if we help them fix the code, they don’t enjoy the frequent problems any more than we do. We’ll be happy to fix the code two – every decent DBA will prefer to fix code rather than spend few hours a week killing processes and writing emails documenting what he did and what the developers should be looking into.
So why don’t we do that? Because the customer didn’t pay for us to do “DBA work” for him. He paid for the instance and for the liberal access, but not for the DBAs. So, DBAs are not allowed to fix any problem with the customer code or do any other DBA work for him. We are only allowed to shoot processes that cause CPU or IO problems on the server itself, send the customer trace files, and reboot the servers when the customer requests it. Never mind that this decision is actually costing us *more* DBA time, and therefore the company can both save money and make the big customer happy by giving him more than he asked for. Never mind that DBAs are usually much happier when they are actually allowed to do their job. The most important thing is that we are following our contracts to the letter.
Jeff the SQL Server blogger explains that distinct is not a function. You have no idea how many times I’ve seen this mistake (and even did it myself once or twice).
Too often, when I search the net for some information I need, I find something completely unrelated that catches my interest. Normal people may bookmark the interesting article for later review and continue searching for the original piece of information. But I am not normal people, I have N.A.D.D – so I’ll read the interesting article, think of it for a bit, then write a blog entry about it, discuss it with friends, and few hours later remember that I still need to find that original bit.
This time, the interesting article is “Oracle DBA Checklist“. Its a nice list of things DBAs do daily, weekly, monthly, etc. At first glance it looked like the kind of thing that every DBA should have, and I was starting to wonder if I’m a bad DBA because I don’t have one.
At second glance, I figured out that I hate this list and everything it stands for. No DBA should have such list. The correct name of the list is “Monitors to set up when configuring Oracle system”.
the first problem is the entire idea of having the DBA go though a checklist every day, or even every week. This is an unbelievable waste of time. Everything that has to be done at such frequency should be automated during the first week on the job. While the list author does mention that “some sites may wish to automate this”, this is not good enough. It should be phrased as “Automate the entire checklist or find another job”.
The second problem is that of frequency. The suggest to check that the instances are up once a day. Don’t know about their implementation, but we can’t afford a day of downtime on our databases. Our monitors check for instance availability every 10 minutes. Verifying resources should also be done more frequently – Disk space problems can occur very fast and crash your instance, high CPU can cause severe performance problems. You want to catch these sooner rather than later.
however, they were few gems in that document – I loved the suggestion to read manuals for an hour a day. So I decided to write my own mini-checklist despite my hate for anything that involves routine and repetition:
- Go through mailbox and dashboard to verify that no alerts were sent during the night from our monitors and scheduled jobs.
- Read DBA manuals
- go over blog aggregator for interesting news and posts
- Review reports of CPU, Disk, network, memory and response times for trends, future scalability and ideas for improvement.
The rest of the day can be dedicated to the usual disasters, urgent requests and firefighting. Oh, and maybe some system improvements, in our spare time.
I’ve run into a nice post about RTFM on the BAAG journal. Yes, I know this is not exactly news, but I think its going to resonate with many DBAs. However, I suspect that the RTFM post oversimplifies what is actually a rather painful issue.
Lets start with the fact that many DBAs are not native English speakers. They know enough English to get along nicely, but reading technical documents can still be a slow and painful activity. Perhaps slower than waiting for a kind soul on a mailing list.
Then, there is the fact that getting meaningful answers out of Oracle documentation is a bit of an art form. There are both OTN documents and documents in Metalink. They sometimes contradict, so you need to verify which is more updated, but make sure it matches your version.
If you need to know how to configure automatic memory manager or how to rebuild an index, the documentation is pretty good. If you need to know what happens when you set SGA_MAX_SIZE to zero while using automatic memory management, you are in for a significant search at the end of which you will have more information and still end up having to make a guess which may or may not be correct. I remember looking for a good quote that will allow me to prove to my boss that if we rebuild an index with parallel option, all queries using this index will be parallelized to the same degree. I couldn’t find one, although it could be inferred by combining several paragraphs from two different books in the right way.
There is a reason for the huge market for Oracle books other than the official documentation. Thats because the official documentation is difficult to read and sometimes is not even good enough as a reference. There are a bunch of websites, blogs and magazine articles that explain information that is already contained in Oracle documents. A co-worker is learning PL/SQL and asked for good book. You can bet I told him to get Steve Feuerstein’s PL/SQL book and didn’t tell him to download Oracle’s book from OTN.
So, I agree with Simon Haslam at BAAG that people need to RTFM. I just wouldn’t recommend Oracle’s official documentation for that purpose.
You know a DBA didn’t participate in the design of the application or the database schema when:
- The primary key of the table is a numeric ID, generated by the application. Looks like the developers never heard of sequence objects.
- No foreign keys where two tables clearly reference each other.
- No indexes other than the numeric primary key, making it impossible to search for anything. Did I mention the table contained over 10 million rows?
- Instead of using partitions, the data is spread between about 100 tables and a view does “union all” so the application can query all the data.
- No bind variables are used, at all.
I’ve had all of these, and few more in today’s incident.
A user complained that he is seeing the wrong data. We noticed that he got the wrong data by searching the data table through an ID generated somewhere. We had to search a bit to find where the application got that ID because foreign keys were not used, but finally we found that the data table references an “additional data” table through an application generated id.
We discovered that few days ago the application started generating duplicate ids, and now when a user looks at the additional data he sometimes see the wrong data that is referenced by the same ID. We noticed that ids generated today duplicate ids generated 3 month ago, but we have no way of using this information because the date column is not indexed. In fact, nothing is indexed. I’m glad the “additional data” table is very rarely used.
The id should have been unique, but the data is spread through 100 tables in an attempt to mimic partitions – and the uniqueness cannot be enforced between tables.
As a temporary solution we ended up deleting the old data to clean the duplication. Not too easy with 100 tables, 10 million rows and no index in sight. I hope we bought enough time for the application developers to fix the issue.
Oracle has many wonderful features. It has sequence numbers, it has constraints that can be used to enforce data integrity, it has partitions, it has indexes. Please make good use of those features.