Environment Variables in Grid Control User Defined Metrics

This post originally appeared at the Pythian blog.

Emerson wrote: “Foolish consistency is the hobgoblin of small minds”. I love this quote, because it allows me to announce a presentation titled “7 Sins of Concurrency” and then show up with only 5. There are places where consistency is indeed foolish, while other times I wish for more consistency.

Here is a nice story that illustrates both types of consistency, or lack of.

This customer Grid Control installed in their environment. We were asked to configure all kinds of metrics and monitors for several databases, and we decided to use the Grid Control for this. One of the things we decided to monitor is the success of the backup jobs.

Like many others, this customer runs his backup jobs from cron and the cron job generates an RMAN logfile. I thought that a monitor that will check the logfile for RMAN- and ORA- errors will be just the thing we need.

To be consistent, I could have moved the backup jobs to run from Grid Control scheduler instead of cron. But in my opinion, this would have been foolish consistency – why risk breaking perfectly good backups? Why divert my attention from the monitoring project to take on side improvements?

To have Grid Control check the log files, I decided to use OS UDM: Thats a “User Defined Metric” that is defined on “host” targets and allows to run a script on the server. I wrote a very small shell script that finds the latest log, greps for errors and counts them. The script returns the error count to Grid Control. More than 0 errors is a critical status for the monitor. I followed the instructions in the documentation to the letter – and indeed, everything easily worked. Hurray!

Wait. There’s a catch (and a reason for this blog post). I actually had two instances that are backed up, and therefore two logs to check. I wanted to use the same script and just change the ORACLE_SID in the environment.

No worries. The UI has a field called “Environment” and the documentation says: “Enter any environmental variable(s) required to run the user-defined script.”

One could imagine, based on the field name and the documentation, that if I type: “ORACLE_SID=mysid” in this field, and later run “echo $ORACLE_SID” in my script, the result would be “mysid”.

Wrong. What does happen is that $ORACLE_SID is empty. $1, on the other hand, is “{ORACLE_SID=mysid}”.

To get the environment variable I wanted, I had to do: tmp=(`echo $1 | tr ‘{}’ ‘  ’`); eval $tmp

It took me a while to figure this out as this behavior is not documented and I found no usage examples that refer to this issue.

Consistency between your product, the UI and the documentation is not foolish consistency. I expect the documentation and field descriptions to help me do my job and I’m annoyed when it doesn’t.

At least now this behavior is documented somewhere so future googlers may have easier time.


Cloning Oracle Home from RAC to Stand-Alone

This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!

This should have been the easiest task on my todo list: Install Oracle 10.2.0.3 EE standalone on a new Linux RHEL 5 server, later to be used as a standby for a production RAC system. This means 2 lines of “runinstall -silent …”, less than 5 minutes of DBA work and maybe 20 minutes of waiting. I did not expect to spend over 5 hours doing this.

Problems started when I discovered that I don’t have the 10.2.0.3 patchset and another patch that exists on production and should be installed on the standby. I had to wait for my Metalink credentials to be approved for this customer CSI before I could download the patches for them.

“Why don’t you just clone the software from production?” asked a helpful colleague.

Sounds like a great suggestion. I cloned Oracle software before and it is a simple process: tar $ORACLE_HOME, copy the tar file to the new server, untar, run the cloning script which will register the new home with the inventory, and you are done!

In theory, at least.

Here is what actually happened:

  1. Tar, copy, untar, script
  2. Ran OPatch to verify that the new oracle home is in the inventory and that I see the correct version and patches.
  3. OPatch is showing two nodes. Oops. I didn’t realize oracle home has information about the cluster – didn’t Oracle move the inventory elsewhere? Spend an hour looking for the cause of this.
  4. Found that the two nodes are mentioned in  $ORACLE_HOME/inventory/ContentsXML/oraclehomeproperties.xml
  5. Removed this file.
  6. Deattached Oracle Home to clean inventory without deleting the software.
  7. Ran the clone script again
  8. Yay! OPatch looks good now.
  9. Decided to create test database  to be extra sure everything is fine
  10. NETCA failed with linking error. Spent an hour figuring out why. Cursed a bit.
  11. Had to install libc-devel, 32 bit version. Too bad RDA didn’t catch this.
  12. Created test database, but SQLPLUS now fails with linking error.  More cursing. Wondered what I did to deserve this.
  13. libaio.so.1 was missing so I had to install the 64 bit version of libaio. Too bad RDA was silent about this as well.
  14. Couldn’t start the database because the database couldn’t find the cluster. Why was it even looking for a cluster? Spent an hour figuring out why. Ah, because I copied the software from a RAC server and it was linked as RAC database.
  15. Relinked everything with RAC_OFF option.
  16. Finally things are working. Too bad it is 8pm already.

What I should have done: (I’m not sure if it is supported by Oracle, but at least it works)

  1. Double check that we have all RPMs.
  2. Tar, copy, untar
  3. remove $ORACLE_HOME/inventory/ContentsXML/oraclehomeproperties.xml
  4. run clone.pl: clone/bin/clone.pl ORACLE_HOME=/appl/oracle/product/10.2.0/db_1 ORACLE_HOME_NAME=OraDb10g_home1
  5. Relink as non-RAC:  make -f ins_rdbms.mk rac_off
  6. Verify with OPatch.
  7. Create test DB:
    netca /silent /responsefile ~/netca.rsp
    dbca -silent -createDatabase -templateName General_Purpose.dbc -gdbName TST -sid TST -SysPassword xxx -SystemPassword xxxx -emConfiguration NONE -datafileDestination /appl/oracle/oradata  -storageType FS -characterSet WE8ISO8859P1 -nationalcharacterSet AL16UTF16 -memoryPercentage 40
  8. Go for a nice afternoon ride.

I hope that I’m not the only DBA who always have to find the most difficult way to accomplish a task, and that this post will be useful to others. Perhaps the best piece of advice I can offer is to avoid this type of cloning in the first place.


Recommended articles from Metalink

Another DBA team in the company hired a consultant to review and comment on their RAC installations, and they shared his report with us. Some of the recommendations were rather mysterious and involved changing kernel parameters I was not used to changing, so I decided to look around metalink to see if these parameters and their effects are documented somewhere.

I didn’t find what I was searching for. I have lots of trouble getting the metalink search engine to find anything useful for me. Oracle can probably get rid of many support calls by making a deal with google and using them for the metalink search. But while I was looking, I ran across two cool articles.

Note:265262.1 – Things to Know About Linux
A very cute document summarizing everything a DBA ever wanted to know about Linux, or almost. Links to installation guides, useful prerequisites and parameters, how to find out important information (like whether you are running 64 bit or 32), how to relink the kernel and even debugging tools.
A must read, must bookmark, print and hang besides the desk. This one is a keeper.

Note:169706.1 – Oracle Database on AIX,HP-UX,Linux,MacOSX,Solaris,Tru64 Operating Systems Installation and Configuration Requirements Quick Reference (8.0.5 to 10.2)
They are cheating with the name – the document also include 11g.
A useful and concise summary of the installation pre-reqs on unix platforms. Nice if you are doing many installations on different platforms and are tired of looking up a specific detail in a huge installation manual.


Log Buffer #59: a Carnival of the Vanities for DBAs

As I’ve hinted at the beginning of the week, this week I’m writing up edition #59 of Log Buffer. Dave Edwards of Pythian Group contacted me couple of weeks ago and offered this opportunity. I hesitated quite a bit, this is a big responsibility, and I was not sure my rather new blog is ready for such exposure. I’m glad I agreed. Dave Edwards sent me a bunch of helpful links to other interesting blogs, so I got to discover many valuable blogs I’ve managed to miss so far, and to read about databases that I usually don’t see much. Despite the fact that the links Dave sent included many databases besides Oracle, this Log Buffer will probably be a bit biased toward the worlds most popular DB.

Oracle 11g is still the hottest topic in Oracle blogsphere. Tanel Poder started an 11g internals series with an article about automatic memory management, getting into the implementation in Linux. Kevin Closson continued looking into this topic with a typically well researched article.

Blogging About Oracle also has a series about Oracle 11g, written by different authors. I particularly liked Part 7 – Function Result Cache, written by Roel. He gives a good example that demonstrates the value of this feature.
Virag Sharma, in Database, RAC, Dataguard and Apps writes about SQL Performance Analyzer, one of the hottest features in 11g, and Oracle Brains explain how to tune using 11g’s invisible indexes.

If you are still not tired of 11g, Eddie Awad posted 40+ links to 11g articles and blog posts. This should keep everyone busy during the weekend.

Oracle Open World is three month away, but Dan Norris and Shay Shmeltzer are already getting ready for the event. I also need to start looking for interesting sessions. there are always so many to choose from.

And in other news: Tanel Poder published a seriously cool script for tracking session level performance stats that work even when you are not allowed to do anything on the database.
Alex Niujten, on the highly useful Amis Technology blog explained how to create table as select when you have nested tables.
In Eye on Oracle, Mike Brunelli is collecting information about the quality of Oracle support. I hope his project will generate many responses and maybe it will even cause Oracle to rethink their support organization.

Daniel Fink, the Optimal DBA has fun with SQL and SQL*Plus while dynamically assigning column names. Don Seiler at Die Seilerwerks writes how he used trace 10053 to determine how CBO does its job, and Jakub Pawlowski points to training material about PL/SQL that Steven Feuerstein published in his site.

SQL Server fans continue to blog about SQLServer 2008, Bob Beauchemin writes about SQL Server’s support for extended events. Jeff Smith writes about composite primary keys. I truly hope that everyone already knows about them, but if you don’t – its a must read, and Mladen Prajdić , at I want some Moore, explains how to modify data in a result set that uses joins. I wish he had written this article few month back when I needed it.

Meanwhile, on the OpenSource web:
Ronald, at Technical Notes blog posts a bunch of links for Oracle DBAs learning MySQL and also advice regarding backup and recovery. Last week I received a task to take ownership of MySQL server and write a recovery procedure for it, I can testify that both articles are very useful.
Charlie Cahoon ends in his Summer of Code blog with a release of his MySQL proxy. Corra is already using MySQL Proxy on Ubuntu.
Morgan Tocker and Brian “Krow” Aker try to decide how big transactions should be, and last but not least </depesz> explains how to secure PostgreSQL.

Whew, that has been one long post. I didn’t realize how much interesting things are being written by DBAs every day. There were so many great articles that choosing and picking them for this post was much more difficult than I expected. I highly recommend every database blogger to try writing Log Buffer once or twice, its a unique experience.


Oracle Performance Training with Kyle Hailey

On Friday, I’ve been to a training day with Kyle Hailey, organized by North California Oracle User Group.

It was really a great experience. The first part of the day was dedicated to ASH – v$active_session_history. We learned how to use the data, how to use the related graphs in Enterprise Manager, and how to get similar data when you don’t have the ASH table (ASH requires 10g and a special license). I already worked with ASH before, but it was still very interesting and usefull.

The second part of the day was dedicated to specific wait events – why they happen and how to deal with them. The talk was very technical – including overviews of the buffer cache memory management (I didn’t know Oracle keeps the buffer cache in a hash table), and explanations of all the scenarios where you can run into TX-Enqueue lock. There are more scenarios than you think, and most of them will be hell to debug.

We also received a copy of “Oracle Wait Interface: A Practical Guide to Performance Diagnostics and Tuning”, I’ve starting reading it during the weekend and it seems like a useful and interesting book. Maybe I’ll review it when I’m done.

I was really impressed by the audience. No one asked stupid and annoying questions, on the contrary, many people had very insightful comments and questions. Everyone I talked to during the breaks was interesting, had interesting experiences to share and great comments when listening to my war stories. Can you tell I’ve had fun? I’ve also learned that I’m probably the only DBA who did not try ASM yet. I really have to get to it.

So, hurray to NoCoug who organized this great day. I strongly recommend Kyle Hailey’s talk to all DBAs.


Becoming a better DBA

I’ve read Coskan’s post about invalid DBAs and now I’m a bit worried that I’m nearly invalid. Too many of the traits he mentioned hit too close home.

So, here’s how I plan to recompile my professional skills:

  1. Read the concepts guide. I’ve read parts of it, but it is probably time to give it a more through read.  Pretty much everyone agrees that one cannot be a good DBA without reading it.
  2. Learn to do wait interface tuning. Start by reading  the book and trying out the ideas in the test system. I know the theory, but currently I do all the wait interface analysis using Ignite for Oracle. A marvelous tool, but a DBA should know how to get along without fancy tools.
  3. Get more acquainted with Oracle security. Maybe start by reading Pete Finnigan‘s papers.
  4. Get Certified! This will instantly make me a better DBA, no?

This short list will probably keep me occupied well into the end of the year. Especially the performance tuning part – Millisap’s book is not exactly a short and easy read for the next flight. Hopefully I’ll be able to start the next year as slighly less invalid DBA.


Flashback!

Many of our application owners arrive from Microsoft SQL Server backgrounds, where they learned some very bad habits. For example, when using SQL Server’s query analyzer, every time you run a query it will automatically commit. When we started using Oracle, they were confused at first, but quickly learned to configure their Oracle tools (AquaStudio is rather popular for some reason) to emulate the same behavior. Usually it doesn’t really matter, until someone makes a mistake.

“Chen, how do I roll back a transaction after it was committed?” is not the most frequent question I hear, but it is somewhere in the top 10.

We keep our undo for about 30 minutes, so if they caught me fast enough, I’m able to save the day using the magical flashback query:
SELECT * FROM mytable AS OF TIMESTAMP TO_TIMESTAMP ('06-aug-07 08:30:00','DD-MON-YY HH24: MI: SS')

Once the user verifies that I managed to get a good copy of his table, I usually save it as another table, so I won’t lose it because too much time passed.

create table mytable_old as SELECT * FROM mytable AS OF TIMESTAMP TO_TIMESTAMP ('06-aug-07 08:30:00','DD-MON-YY HH24: MI: SS')

And then I have to figure out how to use the old data to undo what the user did. Do I insert missing rows? Reverse a bad update? It all depends on the exact mistake, but if I have good data from flashback, I know that we can fix the damage. Now the user can go to her white board and write 50 times: “I will not use auto commit”.

I heard that in 11g flashback will use the redo logs and not just the undo, so theoretically it means that there won’t be any time constraints on resolving the issue. I’m looking forward to that.

Sometimes, DBAs make mistakes too. And our mistakes are even more painful because DROP and TRUNCATE can’t be rolled back. Oracle now has an option of keeping dropped tables for a while longer in a recycle bin before really dropping them, allowing the DBA to fix his mistakes. We don’t use the recycle bin for a silly reason – when we first started using 10g, this feature caught us by surprise and caused us to run out of space when least expected, so we turned it off and planned to deal with this later. Later didn’t happen yet. So when we drop an important table by mistake we need to restore our backups on a standby machine, roll the logs until the point just before the mistake and get the data from there. That can easily take two hours, while using flashback takes few minutes.

Maybe its time to revisit our recycle bin policy.


Error Triggers

Once every few weeks this customer encountered an ORA-1031 error – insufficient privileges. We saw the error in the alert log, and the customer complained about getting the error, but for some reason we never had enough information on what the customer was actually doing when the error occurred. We needed to know what he was doing to figure out how we can fix the issue.

We tried tracing, but the issue occurred so rarely that we were never able to actually capture it on time.

I thought we can solve the issue with error triggers. My suggestion was rejected by my team lead who prefers not to use triggers on production, but I still think it could be useful.

create table caught_errors (
dt        date,
username  varchar2( 30), -- value from ora_login_user
msg       varchar2(512),
stmt      varchar2(512)
);
CREATE TRIGGER log_errors AFTER SERVERERROR ON DATABASE
declare
sql_text ora_name_list_t;
msg_     varchar2(2000) := null;
stmt_    varchar2(2000) := null;
BEGIN
IF (IS_SERVERERROR (1031)) THEN
    for depth in 1 .. ora_server_error_depth loop
         msg_ := msg_ || ora_server_error_msg(depth);
    end loop;
    for i in 1 .. ora_sql_txt(sql_text) loop
         stmt_ := stmt_ || sql_text(i);
    end loop;
    insert into
    caught_errors (dt     , username      ,msg ,stmt )
    values (sysdate, ora_login_user,msg_,stmt_);
END IF;
END;

Grow your tablespace in 3 easy steps

This is a rather simple script, but I use it so often that it must be good for something.

We don’t use auto-extending tablespaces, because using auto-extending tablespaces puts you at a risk of running out of diskspace. When you run out of diskspace for data files, the database will have a nasty crash and you’ll probably need to recover it. If you run out of tablespace, on the other hand, you just get some errors. We did the math and put some strict limits on the tablespace. This means that when a customer decides to add a significant amount of data without telling us first, we get alerts that we are running out of table space.

We check that we have disk space to grow and at this point we do the following:

1. Using this query, I find the tablespace that is running out of space:

SELECT a.TABLESPACE_NAME,
a.BYTES bytes_used,
b.BYTES bytes_free,
b.largest,
round(((a.BYTES-b.BYTES)/a.BYTES)*100,2) percent_used
from
(
select TABLESPACE_NAME,
sum(BYTES) BYTES
from dba_data_files
group by TABLESPACE_NAME
)
a,
(
select TABLESPACE_NAME,
sum(BYTES) BYTES ,
max(BYTES) largest
from dba_free_space
group by TABLESPACE_NAME
)
b
where a.TABLESPACE_NAME=b.TABLESPACE_NAME
order by ((a.BYTES-b.BYTES)/a.BYTES) desc

2. Lets assume that TBS_009 is out of space. Using this query, I decide if I want to grow one of the existing data files in the table space, or add a new one:

select * from dba_data_files where tablespace_name= 'TBS_009'

3. If I want to grow an existing data file, I run:

ALTER DATABASE DATAFILE '/voloradata/ora6410g/oradata/dev/tbs_009_01.dbf'
RESIZE 268435456;

And if I want to add a new one, I run:

alter tablespace TBS_009
add datafile '/voloradata/ora6410g/oradata/dev/tbs_009_02.dbf' size 4M autoextend off;

Tip: Even if you want to create a data file that has 10G space, start by creating a very small data file, see that it is created successfully and then grow it to the size you really need. We had a client that crushed while we were creating a large data file on 10.1.0.3, and it ended up creating a corrupted file, which we had to offline and drop. Oracle confirmed that this is indeed a bug, so try to avoid creating large files.


Thank you, NoCoug

Few weeks ago, my manager discovered that our training budget disappeared. We are still not sure where it went, but the fact was that our DBA team could not take any course this year. Quite disappointing, as we planned going to HotSos seminar and take a RAC course. The good news was that we couldn’t afford Oracle Week either, and I hate attending this marketing event.

Upper Management thinks that training for DBAs is a waste of both time and money. They think we can learn everything we need from books and online, so when the budget is in trouble, the training is the first thing they cut out. Not very smart. A bit like eating your seeds instead of planting them. I’ve been trying to explain the added value of seminars and courses. The intensity of the learning experience, the value of interacting with other DBAs and hearing about their issues and ideas. I really believe that the value we bring back from HotSoS seminar is well worth the price. He-who-controls-the-budget remained unconvinced.

I was all but resigned to have a boring year without any courses or seminars, when I received news that NoCoug – North California Oracle User Group is having a performance training day with Kyle Haily. The best part is that this training day, which sounds very interesting and useful, costs only 250$. At this rate, I was happy to pay for it out of my own pocket, but my manager came through and got me reimbursed for it anyway.

So now I have a good training day to look forward to. Thank you NoCoug and Kyle Haily. I hope you’ll do this more often.