I’m just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

New Look May 15, 2008

Filed under: design — prodlife @ 5:14 am

I wanted the fancy code formatting to look better on my blog.  Besides, the design was too monochromatic and the links on the side-bar were difficult to read.

So I’m trying out a new design, and as an added bonus, my blog now has flowers. Flowers are usually an improvement.

Feedback on new look is welcome. If my blog is less readable than before, let me know and I’ll continue searching for a good theme. Those of you reading me from a feed shouldn’t care about the change at all.

 

Spying on sessions May 14, 2008

Filed under: tips — prodlife @ 11:28 pm

Sometimes I want to know what a session, a process, a user or a machine is doing, so I wrote a kitchen-sink query that shows me everything I want to know about what a session is running (10g and up).
Its amazing how many issues you can catch simply by running a simple query few times.


select
--'alter system kill session '''|| s.SID||',' || s.serial# ||'''' ,
--'kill -9 ' || p.spid,
p.SPID UnixProcess ,s.SID,s.serial#,s.USERNAME,s.COMMAND,s.MACHINE,s.SQL_ADDRESS,s.SQL_HASH_VALUE
,s.program, cpu_time,fetches,disk_reads,buffer_gets,rows_processed,
executions,child_latch,event,sql_text,COMMAND_TYPE,
sbc.name,to_char(sbc.last_captured,'yyyy-mm-dd hh24:mi:ss'),sbc.value_string
from v$session s
join v$process p on p.ADDR = s.PADDR-- and p.SID in (27817)
left outer join v$sqlarea sa on sa.ADDRESS = s.SQL_ADDRESS
left outer join V$SQL_BIND_CAPTURE sbc on sbc.ADDRESS = s.SQL_ADDRESS
where 1=1
--and s.program like 'imp%'
and s.username ='DEV'
--AND p.SPID in (15179,8304)
--and s.SID in (1019)

The fancy code formatting looks better when Don Seiler is using it. Maybe its time for a new look for the blog.

 

ITIL Processes Taken to Extremes May 8, 2008

Filed under: xfiles — prodlife @ 3:06 am

A DBA from another team went on vacation, and asked me to do few urgent tasks for him while he is gone.
One of the tasks was to refresh test schema from production data. In his email he commented that in his team the process includes opening a change in the Change Management system for this task “in case we drop the production schema instead of the test schema by mistake”.

I was very surprised by this. In my team, the ticket asking for the refresh was considered enough documentation, and there was no need to open a change. Also, if you regularly drop the wrong schema, opening changes will not help you.

I asked around for explanation for the different procedure. Turned out that few month ago someone from the other team dropped the wrong schema by mistake. It was recovered from backup with no issues, but it still caused an hour or two of downtime for the users. Which means that we need to open an Incident, and of course, every Incident has to contain action items for preventing same issue from reoccurring.  In this case, the manager who reviewed the incident noticed that there was no Change open for the schema refresh, which means that the DBA did not follow the right procedure! The natural action item for the change was “Instruct DBAs on Change Management procedures”. DBAs were properly instructed (at least some of them) and are now opening a change before dropping the wrong schema.

Which just goes to show how procedures can’t replace common sense.

 

Monitoring Performance With BHive May 7, 2008

Filed under: monitoring — prodlife @ 4:10 am

After we established that looking at the CPU is not a productive way to monitoring performance, and after I wrote a long long rant about how every application level response time monitor is useless, I want to talk a bit about an application response time tool I’ve seen that takes an interesting approach to the problem -  BHive’s Conductor.

Just to make sure we are on the same page: Application level response time monitors are the tools that attempt to check the application response times as it is experienced by its users. As such it monitors the entire system, not a particular component of it.

The first cool thing about the Conductor is that it works by monitoring your network traffic. Why is this so important?

  • Its agentless. You install one box on your network, and that should be it.
  • It monitors your real users, probably by examining packets flying back and forth. This means that you don’t have to write scripts to emulate users, and worry about getting the scripts wrong. I’m a big fan of virtual user scripts, because you get very consistent data set that is easier to analyze. But even I have to admit that it is very difficult to get those scripts right.
    As an additional bonus - you will not get alerted about performance issues that happen at 3am when the big batch jobs are running but users are asleep.
  • It attempts to model your business processes (login, search, bid, buy, wish list, etc) from the network information it catches. So the monitor can alert about response time for bidding vs. searching, which is the language that Operations and Business are speaking.
    This worked great in the demo, and it could save tons of time if it always works that well. I’m still slightly skeptical, because I’ve seen so many products fail when they attempt to do this. Especially when some critical information about what a page will do is embedded in environment variables sent to server in HTTP POST.
  • They also automatically map all your servers, so after the monitor ran for a while you see a nice map of which application server is serving which business processes and which DB is used by which application server, and it even maps the queries that each DB is running. I suspect that the map is one more thing that is nicer in a demo than in a real IT, where there are just too many things to look at.
  • The latter point means that they should have good information if the performance issue is DB related or not. Which may mean that the DB will not get automatically blamed for everything.
  • The ability to relate DB slowdowns with real business processes and real users can lead to incredible things. Imagine instructing the helpdesk (or even configuring alerts) to beep you only if more than 10 users are affected, or that they can call you at 3am for slow “buy” but slow “search” can wait until morning.

All these are quite cool, and certainly make life easier, but they are not exactly revolutionary. What makes the Conductor so interesting is that BHive does not intend to leave it as a passive monitor.

The idea is that there are some things you can do once performance issue is detected: You can modify load balancer settings or if you are using VMWare you can move resources from a  server that is doing fine to one that is in trouble, maybe even rebuild indexes ;).  So why  not have  the monitoring system do this for you when it discovers that there is an issue?

Of course, I don’t mind if the Conductor modifies our Load Balancer settings twice an hour, but I’d rather die before I let it rebuilt indexes. I think that our network admin feels exactly the reverse. There are definitely an issue with automating tasks there were previously manual (and quite rare). I suggested that instead of actually doing the modification, the Conductor should attach a script with instructions to each alert - so the admin can review the actions and decide if he wants to run this or not. Oracle’s ADDM does it successfully.

One direction that I’d really love to see this product taking is more integration with external data sources. BHive is talking about integration with VMWare, but I want to see much more. I want to see the CPU on the machines where slowness is detected. If the DB is oracle, please show me wait events, correlated with the slow queries that the monitor shows. Maybe take AWR snapshot while you are at it and send me the result of the report.

In my opinion one of the most important things to remember while developing a monitoring product is that you can’t really give good value if you cover just one aspect of the system. You need to integrate, you need to correlate, and you need to give me the fullest picture possible. You can’t give good value if you play vendor power games and pretend that your product is the only one I’ll ever use. Remember, I already have 10 terrific monitoring products. It is time to give us the big picture. I hope BHive will be able to do this. They certainly seem to be going the right way.

(Disclaimer: My best friend works for BHive. I did not get paid to write this review - I wrote this because I found the product exciting. I hope my readers will find this review somewhat useful, but I’m sure we are all grownups capable of making our own monitoring decisions. Specifically, for reasons beyond my control (i.e. corporate policy), I’m not using this product myself. )

 

Newsflash - 100% CPU is worse than you think! May 6, 2008

Filed under: hardcore, performance — prodlife @ 1:53 am

Found a recent thread on OTN that discusses my recent obsession: http://forums.oracle.com/forums/thread.jspa?threadID=650986&start=0&tstart=0

Jonathan Lewis points out a major issue with running your DB at 100% cpu:

“Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches - which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”

Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks.

And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this. The thread mentions that the book “Forecasting Oracle Performance” discusses this issue and mentions 75% cpu as the tipping point, but I’d assume that the number would be different for systems with different numbers of CPUs. I definitely need to read that book.

I also was not aware that processes stay on CPU while waiting for latch. I’d assume the CPU will replace it with runable process? Of course the switch will also cost resources, so you lose either way.

I can’t believe I ignored this until now (and that not one of my readers mentioned this!). The thread is well worth reading.

 

Praise of OTN Forums May 1, 2008

Filed under: musing, tips — prodlife @ 5:13 am

Sometimes readers of this blog email me questions. Recently it has been mostly streams related questions. Sometimes I have a good answer, and sometimes I barely understand the question, but there is one sentence I say in every reply:

“Why don’t you ask your question in OTN forums?”

Because seriously, there is no better place for Oracle questions. Often far better than Metalink (especially if bugs are not involved). You ask a question in XMLDB forum and there is a good chance that the program manager of the product will reply. Oracle program manager! You can’t buy this kind of support, but you can get it for free!

Jonathan Lewis must be one of the most expensive Oracle consultants, but if you post your performance issues in OTN Forums with good statspack report, he’ll often give you good advice for free! The best Oracle experts in the world are answering hundreds of questions every day, for free. This is simply incredible.

I started posting in OTN forums less than half a year ago, and I regret not doing it before that. I was aware that these forums existed, but I was highly biased against any kind of product forums. Lots of forums are full of lazy idiots asking questions and slightly less lazy idiots giving very bad answers. Only when I started meeting OTN regulars through their blogs, I realized that OTN forums may be different. There are still idiots asking idiotic questions, but there are also many many excellent and interesting questions, and the quality of answers is simply unbelievably high.

Keeping such an amazing signal-to-noise ratio in a public forum is not easy, and few OTN regulars help maintain the high quality by being rather harsh with lazy questions. It may be unkind to the people posting the questions, and is certainly very scary to newbies, but I appreciate their diligence. There are enough forums where bad questions are tolerated, I’m happy there are some where they are not.

Of course, even after few month of regularly posting questions and answers in OTN, I’m still a little scared before posting, rereading my post several times to make sure I’m not going to get flamed. I’ve been somewhat hurt by replies to my posts in the past, and I know it may happen again. I’ll gladly pay this small price in exchange for the best support forums in the world.

The main danger of OTN is that it is highly addictive. I started by posting a random question every few weeks, then I answered a question or two, and now I spend few hours a week reading threads and posting replies. Very little compared to “Top Addicts in Category” who seem to live in the forum, respond to every question almost immediately and post scores of answers each day. Still, even few hours a week makes it a serious hobby.

Why do I do that? Because there are truly interesting threads there every day, and I learn a lot by reading them. Because if someone asks a question and I know the answer, I enjoy helping out a bit to a fellow DBA, knowing that I’ll get help with my issues when I need it. Because sometimes I don’t know the answer, but the topic is interesting, so I’m happy to do some reading and testing and find out the answer. Because if I get the answer wrong, often a top notch expert will correct me and I’ll learn from that.

But probably the best reward is that after you follow OTN forums for a while, you start recognizing the regulars (some OTNers you can recognize by their SQLPLUS listings without even looking at the name!). Sometimes they even recognize you. And after a while you realize that there is a terrific community in there, with its own special humor, and it is really a fun place to hang out on the net. There are worse ways to waste time.

 

Measuring Performance - The Problem April 30, 2008

Filed under: monitoring, musing — prodlife @ 6:27 am

After I published “Yet Another Nice Myth”, my friend MosheZ sent me an angry email:
” STOP looking at top. You need actual application data if you want to validate application responsiveness.”

And as always, he is absolutely right. My managers fell for one of the oldest operations management errors - Confusing what can be easily measure with what they actually need to know.

They wanted to know if our users are unhappy with the application response times, what they could easily see was server CPU, and to simplify their lives, they decided that “high cpu = unhappy users”. Which is true often enough to make this simplification tempting, but not often enough to make this a serious operational strategy.

Measuring application performance is incredibly complex. Applications are complex, performance is difficult to define and even more difficult to define is “good performance”. I’ve been working in the field of application performance for 8 years now.  I spent the first two years of my career developing application monitoring tools, and another 6 years using them - first as operations manager and now as a DBA, and I can easily claim that this is still an open problem. Tons of tools, each and every one of them is doing it wrong.

Two problems I see with the tools I’ve looked at (besides managers taking cpu averages too seriously):

1.  Narrow view: Network monitors show you the network. System monitors show you cpu, io and memory. DB Monitors show you how the database wastes its time. End User Monitors should you slow web pages.  So you end up with at least 10 different tools (How many do you have in your IT organization? Do you even know?), and whenever one of them indicates something unusual, you have to juggle and correlate all of them to get a reasonably accurate idea of what is going on. Usually they show conflicting information that you never manage to fully correlate and explain. In any case, tons of intuition and guess work is involved.

Those different tools will be owned by different teams. So you are not correlating network monitors, storage monitors, db monitors and application monitors. You are correlating your understanding of the situation as you see it from db monitors with what you understand from the network manager who is trying to understand his network monitors simultaneously. Throw in the storage manager, application owners, angry customers, flustered managers and few consultants, and you no longer have rational correlation of facts, you have politics and finger-pointing.

Uncorrelated data means that understanding scope of issues is very difficult. DB server has 98% cpu, does it cause an application problem? if so, where? which users are affected? Or maybe application monitoring tool shows slow login - is it caused by web server? DB? LDAP? Maybe its the load balancer? or a switch? are there other applications affected by same issue? Is it a reoccurring issue?

2. Too much data, too little meaning: Often a problem of application layer monitors. Actually, this is not exactly a tool problem - application owners usually insist on having incredible (=absurd) amounts of data about their application performance and having no idea what to do with it later.

Here the tools diverge into two kinds, virtual users and real users. Each one of these tools has its own way of generating too much data.

Virtual User based tools require an application expert that will write a script to emulate a user visiting his website, performing various common operations. The you run this script from various locations and ISPs, usually every 5-15 minutes, the tool will collect information regarding response times for each file accessed by the script, often broken down to various network components (server time, network time, ssl, dns, etc).
Take all this data, multiply by 30 locations, running 12 times an hour, 24×7, and of course you want to keep this for something like 18 month, right? These tools also have the additional issue that you may be monitoring the wrong thing, or having a badly written script completely skew your results.

Real User tools will sit at your data center and capture pretty much the same information, but about the real users that are actually visiting your site. If your site is worth monitoring, this will generate much more information that Virtual Users tools (because you should have more real users than what monitoring will generate), this information will also be more “real” by definition. It will be more difficult to analyze, because virtual users lend themselves to nice time-series type analysis. But lets face it, no one does time-series on monitors. We have averages, and on good days - percentiles.

Now you got 50G of monitoring data. Do you really need all this? Do you know what to do with it? Is it meaningful to you? Thursday night, 5am GMT you see performance downgrade for users (real or virtual) accessing your search page from Japan. What do you do?

Do you have meaningful trends? Can your tool show you the slow-down that occurs 10 minutes after the hour, every hour? Can it show you that it is getting worth? Do you know if it is something you should worry about and investigate?  Application owners rarely have the ability to mine their own data. Do you have a tool that does it for you?

I’ve seen operators get alerts about performance issue every day at 3pm, it lasted for 10 minutes so it was dismissed at “momentary anomaly”, and not noticing that it happened same time every day for over a month, starting right after a new patchset was installed. It is easy to drown in too much data, especially data that you don’t immediately know how to process. After all, no one knows how to debug performance issues that last for 10 minutes. You need to know that it is repeating the same way every day in order to debug it.

If you have a performance monitoring tool that you are truly happy with, please comment and let us know! In general, I’m very interested in how you do your monitoring.

My friend Moshe works for BHive, a company that happened to develop a new monitor tool that promise to take some of the pain (and irrelevant data) out of the equation. He arranged for me a demo of their solution, and while it is still very very new and incomplete, it does take monitoring in an interesting direction. I’ll write a seperate post about that.

****

Exciting posts appeared on my RSS Reader yesterday:

Dom Brooks at OraStory upgraded production server to 11g and lived to tell the tale. Its the first production upgrade I’ve heard of and apparently it went well. Good news!

Greg Rahn at Structured Data wrote about data warehouse scalability.  This post is so good that I have no superlatives. Just go read it. Even if you don’t have data warehouse. Its a must.

 

Splitting Comma Seperated String to Rows April 15, 2008

Filed under: sql — prodlife @ 11:19 pm

Got the following email from one of my users:

I am trying to write a SQL select to get comma delimited values from one field in rows.

For example I have a Table T1
create table t1(a  varchar2(30),
b  varchar2(30));

insert into t1 values ('A','27.68%,2.78%,69.55%');

SELECT A, B FROM T1;
A       B
=====  =====
A  27.68%,2.78%,69.55%

Can you help me write a SQL which will give the following output?

A        PERCENT
======  ========
A   27.68%
A    2.78%
A   69.55%

My solution is rather ugly, but it works (on 10g and above) and makes no assumptions about the number of elements in the list :

With t2 as (select A,','||B||',' as B from t1)
SELECT A,SUBSTR(B,INSTR(B,',',1,LEVEL)+1,INSTR(B,',',1,LEVEL+1)-INSTR(B,',',1,LEVEL)-1)
FROM t2
CONNECT BY LEVEL <= length(regexp_replace('[^\,]‘,”))-2 ;

Senior DBA had a simpler solution: “Just replace all the commas with new lines”.

 

Walk to Work - California Style April 15, 2008

Filed under: musing — prodlife @ 3:51 am

Now that the weather is nice again, I decided to save a bit on gasoline and walk to work.
Here is a map of my daily commute (slightly simplified):

My house is point A and my workplace is at point G.
It doesn’t take much knowledge of geometry or parity to realize that I have two completely unnecessary road crossings in my route. These road crossings are in fact completely unnecessary, and were not added to the route simply because I want to spend more time waiting for traffic lights to change.
There is no pedestrian crossing between points B and E, and there is no sidewalk between points D and G. So I either take the scenic route or I jaywalk on a very busy street during commute hour.

Chicken and egg problem: Americans never walk because their streets are so incompetently designed, or perhaps street design is incompetent because people who never walk don’t notice how annoying this is?

***
And for those of you who expected Oracle related content, I’ll send you to an excellent and thought provoking article, uploaded by Daniel Fink - the OptimalDBA.

 

Yet Another Nice Myth April 9, 2008

Filed under: Linux, nerdism — prodlife @ 2:49 am

A while ago I explained Nice, what it can do and what it can’t. From this point, whenever someone misunderstood nice, I just pointed at my post.

But yesterday I ran into a misunderstanding that was not covered in the first post:

Boss: Server X is at 100% cpu. Can you renice process 4096?
Me: Sure thing boss. *clickety-click* Done
Boss: Why is cpu still at 100%?

Renice will not lower CPU utilization because:
1. It does not kill processes. After renicing the process will have lower priority, but it will still need the same amount of CPU it did before, and so do all the other processes in the system. At any given moment, our nice process will either wait for CPU or use CPU. Either way, we are still at 100% CPU.
2. You can’t lower the priority to a point that the CPU will prefer to sit idle instead of running that process. Which is exactly what you ask for if you expect renice to affect CPU utilization.

The Linux OS will not let the CPU run on IDLE if someone wants it. It makes absolutely no sense to make processes wait while the CPU is not doing anything.

Boss: Sysadmin types always say that 100% CPU is not necessarily a bad thing, but we are DBAs and we know that if a machine is at 100% CPU it means trouble.

I hope we can all agree that what we want to prevent is processes having to wait for CPU. Ideally, every process that needs CPU will get it immediately. In this sense, 100% CPU is indeed a bad thing, because either processes are already waiting for CPU (look at load average!) or it can happen any second. But it is not because 100% cpu is the problem, it is because waiting processes are the problem.

Suppose you have a classic OLTP system - lots of tiny, fast, transactions. Usually the system runs at around 60% CPU on its single CPU server, but this morning someone ran a huge cpu-intensive report on this system. You know that this report runs for 60 minutes on an empty system. He begs you not to kill the process, but he doesn’t mind waiting. Now you have two options:

1. Let this process run as usual. Since it is a large report, it could be scheduled for more than its fair share of CPU. Taking maybe 80% of cpu time, leaving 20% for our tiny transactions. Process will finish in 75 minutes. On average each tiny transaction will take three times longer than usual. 75 of 100% cpu, bad performance and complaints.
2. Renice the process. Now it will get 40% of CPU time. Tiny transactions get 60% (which is all they need). Report takes 150 minutes, tiny transactions take as usual. 150 minutes of 100% cpu, no performance issues.

Obviously the second scenario is much better. Even if you spend twice as much time in 100% CPU. So maybe 100% cpu is not always evil, and renice is useful even when it doesn’t lower the cpu utilization.


Boss: But look at Top! Things look terrible! How do we know that everything is OK?

It is important to remember that viewing the CPU world through “top” tends to skew our perspective a bit. Top gives CPU utilization aggregated according to its refresh rate - 3 seconds by default. Now during these 3 seconds, 1.2 seconds were used by huge report (40%!) and 2.8 seconds used by 20 processes pooling all those tiny 2 ms transactions. Each process will show close to 0.05% CPU. It looks scary - big bad report taking so much CPU leaving almost nothing for other processes. But each time a transaction wanted to run it got 100% of the CPU. It just finished very very fast, and the big report stayed.

How do we REALLY know that everything is OK?
I trust Oracle on this and go to Active Session History tables (or EM report), find the transaction processing sessions, check out the wait events and make sure CPU is reasonable.

Boss: But what do we do about CPU?

So, 100% CPU is not good, but we can’t use renice to solve this. What can we do?
Killing the offending process is a crude but effective solution. Rumor says SQL Profiles that will allow you to alter the execution plan of the query, maybe making it easier on the CPU. I never tried it. Of course, the ultimate solution is to optimize the report to use less CPU, and in my experience it is easier to convince developers to optimize after you killed the report several times. In worst case, buying more CPU power is always an option.