I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Cumulative Distribution May 28, 2008

Filed under: links, musing, nerdism — prodlife @ 2:05 am

Laurent Schneider sent me a link to an interesting discussion in AskTom.

How do we calculate Cumulative Normal Distribution? This means calculating the probability that a random variable from a normal distribution, will end up equal to or less than X.

I’d love to know why the OP needed that. While Normal Distribution is very popular, most realistic statistical applications would use Student’s T distribution. This is so common that Oracle contains a built in T Test function. Maybe this would be useful to the OP?

Regardless of Tom’s flippant reply, it is actually a difficult question. Knowing the cumulative normal distribution means integrating the normal (Gaussian) function. Which is impossible. So in a way, the correct answer to the OP’s question is “you can’t”.

But thats not correct either, while there is no “general” function for the integral of the normal distribution, you can manually calculate the integral for any specific value on the function. Lucky for us, Gauss had grad students who did this hard work, and until very recently statisticians referred to books with tables containing these results whenever they needed to know the probabilities.

Which brings us to another good answer – find a website with this table, load the table into Oracle, and use it as much as you want. No PL/SQL needed.

Another interesting way to calculate the cumulative normal distribution in Oracle would be to use Oracle’s dbms_random.normal to generate an approximation of the normal distribution and calculate the cumulative distribution on the result. Of course it is not very accurate, but it is a lot of fun:

select n,cume_dist ( ) over (order by n) cume_dist from
(select dbms_random.normal n from dual connect by level<=100);

In the AskTom thread, Georg from Germany linked to a function that approximates the cumulative normal distribution, and that site contained link to a paper with even better approximation. This is probably how statistics software (SPSS and its friends) calculates probabilities.

Interesting question, many good answers.

***

Frits Hoogland also managed to write a long post about a short question with a short answer:
How an ASM diskgroup is found by the database.

 

New Look May 15, 2008

Filed under: design — prodlife @ 5:14 am

I wanted the fancy code formatting to look better on my blog.  Besides, the design was too monochromatic and the links on the side-bar were difficult to read.

So I’m trying out a new design, and as an added bonus, my blog now has flowers. Flowers are usually an improvement.

Feedback on new look is welcome. If my blog is less readable than before, let me know and I’ll continue searching for a good theme. Those of you reading me from a feed shouldn’t care about the change at all.

 

Spying on sessions May 14, 2008

Filed under: tips — prodlife @ 11:28 pm

Sometimes I want to know what a session, a process, a user or a machine is doing, so I wrote a kitchen-sink query that shows me everything I want to know about what a session is running (10g and up).
Its amazing how many issues you can catch simply by running a simple query few times.


select
--'alter system kill session '''|| s.SID||',' || s.serial# ||'''' ,
--'kill -9 ' || p.spid,
p.SPID UnixProcess ,s.SID,s.serial#,s.USERNAME,s.COMMAND,s.MACHINE,s.SQL_ADDRESS,s.SQL_HASH_VALUE
,s.program, cpu_time,fetches,disk_reads,buffer_gets,rows_processed,
executions,child_latch,event,sql_text,COMMAND_TYPE,
sbc.name,to_char(sbc.last_captured,'yyyy-mm-dd hh24:mi:ss'),sbc.value_string
from v$session s
join v$process p on p.ADDR = s.PADDR-- and p.SID in (27817)
left outer join v$sqlarea sa on sa.ADDRESS = s.SQL_ADDRESS
left outer join V$SQL_BIND_CAPTURE sbc on sbc.ADDRESS = s.SQL_ADDRESS
where 1=1
--and s.program like 'imp%'
and s.username ='DEV'
--AND p.SPID in (15179,8304)
--and s.SID in (1019)

The fancy code formatting looks better when Don Seiler is using it. Maybe its time for a new look for the blog.

 

ITIL Processes Taken to Extremes May 8, 2008

Filed under: xfiles — prodlife @ 3:06 am

A DBA from another team went on vacation, and asked me to do few urgent tasks for him while he is gone.
One of the tasks was to refresh test schema from production data. In his email he commented that in his team the process includes opening a change in the Change Management system for this task “in case we drop the production schema instead of the test schema by mistake”.

I was very surprised by this. In my team, the ticket asking for the refresh was considered enough documentation, and there was no need to open a change. Also, if you regularly drop the wrong schema, opening changes will not help you.

I asked around for explanation for the different procedure. Turned out that few month ago someone from the other team dropped the wrong schema by mistake. It was recovered from backup with no issues, but it still caused an hour or two of downtime for the users. Which means that we need to open an Incident, and of course, every Incident has to contain action items for preventing same issue from reoccurring.  In this case, the manager who reviewed the incident noticed that there was no Change open for the schema refresh, which means that the DBA did not follow the right procedure! The natural action item for the change was “Instruct DBAs on Change Management procedures”. DBAs were properly instructed (at least some of them) and are now opening a change before dropping the wrong schema.

Which just goes to show how procedures can’t replace common sense.

 

Monitoring Performance With BHive May 7, 2008

Filed under: monitoring — prodlife @ 4:10 am

After we established that looking at the CPU is not a productive way to monitoring performance, and after I wrote a long long rant about how every application level response time monitor is useless, I want to talk a bit about an application response time tool I’ve seen that takes an interesting approach to the problem -  BHive’s Conductor.

Just to make sure we are on the same page: Application level response time monitors are the tools that attempt to check the application response times as it is experienced by its users. As such it monitors the entire system, not a particular component of it.

The first cool thing about the Conductor is that it works by monitoring your network traffic. Why is this so important?

  • Its agentless. You install one box on your network, and that should be it.
  • It monitors your real users, probably by examining packets flying back and forth. This means that you don’t have to write scripts to emulate users, and worry about getting the scripts wrong. I’m a big fan of virtual user scripts, because you get very consistent data set that is easier to analyze. But even I have to admit that it is very difficult to get those scripts right.
    As an additional bonus – you will not get alerted about performance issues that happen at 3am when the big batch jobs are running but users are asleep.
  • It attempts to model your business processes (login, search, bid, buy, wish list, etc) from the network information it catches. So the monitor can alert about response time for bidding vs. searching, which is the language that Operations and Business are speaking.
    This worked great in the demo, and it could save tons of time if it always works that well. I’m still slightly skeptical, because I’ve seen so many products fail when they attempt to do this. Especially when some critical information about what a page will do is embedded in environment variables sent to server in HTTP POST.
  • They also automatically map all your servers, so after the monitor ran for a while you see a nice map of which application server is serving which business processes and which DB is used by which application server, and it even maps the queries that each DB is running. I suspect that the map is one more thing that is nicer in a demo than in a real IT, where there are just too many things to look at.
  • The latter point means that they should have good information if the performance issue is DB related or not. Which may mean that the DB will not get automatically blamed for everything.
  • The ability to relate DB slowdowns with real business processes and real users can lead to incredible things. Imagine instructing the helpdesk (or even configuring alerts) to beep you only if more than 10 users are affected, or that they can call you at 3am for slow “buy” but slow “search” can wait until morning.

All these are quite cool, and certainly make life easier, but they are not exactly revolutionary. What makes the Conductor so interesting is that BHive does not intend to leave it as a passive monitor.

The idea is that there are some things you can do once performance issue is detected: You can modify load balancer settings or if you are using VMWare you can move resources from a  server that is doing fine to one that is in trouble, maybe even rebuild indexes ;) .  So why  not have  the monitoring system do this for you when it discovers that there is an issue?

Of course, I don’t mind if the Conductor modifies our Load Balancer settings twice an hour, but I’d rather die before I let it rebuilt indexes. I think that our network admin feels exactly the reverse. There are definitely an issue with automating tasks there were previously manual (and quite rare). I suggested that instead of actually doing the modification, the Conductor should attach a script with instructions to each alert – so the admin can review the actions and decide if he wants to run this or not. Oracle’s ADDM does it successfully.

One direction that I’d really love to see this product taking is more integration with external data sources. BHive is talking about integration with VMWare, but I want to see much more. I want to see the CPU on the machines where slowness is detected. If the DB is oracle, please show me wait events, correlated with the slow queries that the monitor shows. Maybe take AWR snapshot while you are at it and send me the result of the report.

In my opinion one of the most important things to remember while developing a monitoring product is that you can’t really give good value if you cover just one aspect of the system. You need to integrate, you need to correlate, and you need to give me the fullest picture possible. You can’t give good value if you play vendor power games and pretend that your product is the only one I’ll ever use. Remember, I already have 10 terrific monitoring products. It is time to give us the big picture. I hope BHive will be able to do this. They certainly seem to be going the right way.

(Disclaimer: My best friend works for BHive. I did not get paid to write this review – I wrote this because I found the product exciting. I hope my readers will find this review somewhat useful, but I’m sure we are all grownups capable of making our own monitoring decisions. Specifically, for reasons beyond my control (i.e. corporate policy), I’m not using this product myself. )

 

Newsflash – 100% CPU is worse than you think! May 6, 2008

Filed under: hardcore, performance — prodlife @ 1:53 am

Found a recent thread on OTN that discusses my recent obsession: http://forums.oracle.com/forums/thread.jspa?threadID=650986&start=0&tstart=0

Jonathan Lewis points out a major issue with running your DB at 100% cpu:

“Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches – which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”

Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks.

And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this. The thread mentions that the book “Forecasting Oracle Performance” discusses this issue and mentions 75% cpu as the tipping point, but I’d assume that the number would be different for systems with different numbers of CPUs. I definitely need to read that book.

I also was not aware that processes stay on CPU while waiting for latch. I’d assume the CPU will replace it with runable process? Of course the switch will also cost resources, so you lose either way.

I can’t believe I ignored this until now (and that not one of my readers mentioned this!). The thread is well worth reading.

 

Praise of OTN Forums May 1, 2008

Filed under: musing, tips — prodlife @ 5:13 am

Sometimes readers of this blog email me questions. Recently it has been mostly streams related questions. Sometimes I have a good answer, and sometimes I barely understand the question, but there is one sentence I say in every reply:

“Why don’t you ask your question in OTN forums?”

Because seriously, there is no better place for Oracle questions. Often far better than Metalink (especially if bugs are not involved). You ask a question in XMLDB forum and there is a good chance that the program manager of the product will reply. Oracle program manager! You can’t buy this kind of support, but you can get it for free!

Jonathan Lewis must be one of the most expensive Oracle consultants, but if you post your performance issues in OTN Forums with good statspack report, he’ll often give you good advice for free! The best Oracle experts in the world are answering hundreds of questions every day, for free. This is simply incredible.

I started posting in OTN forums less than half a year ago, and I regret not doing it before that. I was aware that these forums existed, but I was highly biased against any kind of product forums. Lots of forums are full of lazy idiots asking questions and slightly less lazy idiots giving very bad answers. Only when I started meeting OTN regulars through their blogs, I realized that OTN forums may be different. There are still idiots asking idiotic questions, but there are also many many excellent and interesting questions, and the quality of answers is simply unbelievably high.

Keeping such an amazing signal-to-noise ratio in a public forum is not easy, and few OTN regulars help maintain the high quality by being rather harsh with lazy questions. It may be unkind to the people posting the questions, and is certainly very scary to newbies, but I appreciate their diligence. There are enough forums where bad questions are tolerated, I’m happy there are some where they are not.

Of course, even after few month of regularly posting questions and answers in OTN, I’m still a little scared before posting, rereading my post several times to make sure I’m not going to get flamed. I’ve been somewhat hurt by replies to my posts in the past, and I know it may happen again. I’ll gladly pay this small price in exchange for the best support forums in the world.

The main danger of OTN is that it is highly addictive. I started by posting a random question every few weeks, then I answered a question or two, and now I spend few hours a week reading threads and posting replies. Very little compared to “Top Addicts in Category” who seem to live in the forum, respond to every question almost immediately and post scores of answers each day. Still, even few hours a week makes it a serious hobby.

Why do I do that? Because there are truly interesting threads there every day, and I learn a lot by reading them. Because if someone asks a question and I know the answer, I enjoy helping out a bit to a fellow DBA, knowing that I’ll get help with my issues when I need it. Because sometimes I don’t know the answer, but the topic is interesting, so I’m happy to do some reading and testing and find out the answer. Because if I get the answer wrong, often a top notch expert will correct me and I’ll learn from that.

But probably the best reward is that after you follow OTN forums for a while, you start recognizing the regulars (some OTNers you can recognize by their SQLPLUS listings without even looking at the name!). Sometimes they even recognize you. And after a while you realize that there is a terrific community in there, with its own special humor, and it is really a fun place to hang out on the net. There are worse ways to waste time.