Newsflash – 100% CPU is worse than you think!Posted: May 6, 2008 Filed under: hardcore, performance 9 Comments
Found a recent thread on OTN that discusses my recent obsession: http://forums.oracle.com/forums/thread.jspa?threadID=650986&start=0&tstart=0
Jonathan Lewis points out a major issue with running your DB at 100% cpu:
“Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches – which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”
Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks.
And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this. The thread mentions that the book “Forecasting Oracle Performance” discusses this issue and mentions 75% cpu as the tipping point, but I’d assume that the number would be different for systems with different numbers of CPUs. I definitely need to read that book.
I also was not aware that processes stay on CPU while waiting for latch. I’d assume the CPU will replace it with runable process? Of course the switch will also cost resources, so you lose either way.
I can’t believe I ignored this until now (and that not one of my readers mentioned this!). The thread is well worth reading.
Whoaaaa! A bit of confusion here!
An Oracle process spins, waiting for a latch, for a given, finite period of time. Which is tunable, BTW.
Once that spin time is exceeded without getting the latch, the process is taken off the the CPU ready-to-run queue and put in the latch wait queue.
Where it sits waiting for the latch to be released.
Once that release happens, the releasing process posts the queue to mark the next in-line queued process ready to run.
Then, the waiting process returns to the ready-to-run queue, courtesy of the normal OS scheduling algorithms.
I’m terribly over-simplifying here, as the whole thing can be very, very complex. Particularly where CPU-affinity and pipeline affinity creeps in, to keep all those CPU cores busy with their shared pipelines and local caches.
The reason for the initial spin is to avoid the overhead of all that complexity: if the wait is potentially very small, it’s faster to spin on a lock waiting for it to be released, than it is to go through all that paraphernalia of de-qeueing/re-scheduling/queueing.
As such, 100% CPU use due only to spin is not necessarily a bad thing. In fact, it can be quite good.
What you do NOT want by any means, is 100% CPU use AND lots of latch waits. That is a sure sign of under-gunned CPU resource.
And of course I should also have pointed out that there are fundamentally two types of CPU time use: overhead and user process runtime.
Overhead is what CPUs have to do – OS code or application code – to make sure they can also dedicate themselves to full time running of user programs – the so-called user process runtime.
The processes involved in queueing/rescheduling/de-queueing are a portion of the many that *may* cause CPU overhead. Note that I said MAY, I didn’t say they necessarily DO.
When any process is sitting in a wait queue, it is NOT using up CPU. It’s idle, waiting for CPU. The wait itself does not use CPU although of course the time spent waiting can be measured!
The process of queueing that wait and de-queueing it, is part of what causes CPU overhead. It’s called queueing overhead.
The spin on wait is an attempt to reduce that queueing overhead by replacing it with a “brute force” approach of “try-again-and-again-and-again”, for a *finite and controlled* amount of time, hopefully smaller than the time that is takes to perform the queueing overhead. Otherwise there would be no net gain.
The whole idea is to reduce the overall amount of time a CPU is in overhead state and increase the time it is in running state.
Note that all this also depends on how CPU time is measured.
For example: in most UNIX systems, what you see as CPU in “idle” state is not really idle at all!
What happens is that the CPU is spending time in a tight loop doing nothing, basically twiddling its thumbs executing the same instructions again and again. It’s called the “idle loop” and the time a CPU spends executing it is measured as “idle time”.
In fact, it’s anything but that! In reality the CPU is flat out executing that small loop of code until it gets interrupted by the OS to do real work.
So in absolute terms, to say that a CPU is 10% idle means simply that it is 10% busy executing a round-robbin loop that does essentially nothing!
Your 8 processes are wasting CPU time while spinning only if you assume that they are heavily accessing or modifying (worse) the same block(s), thus undergoing latch contention.
I think it’s fairly unusual over the amount of blocks in the buffer cache, unless the application design is somewhat flawed.
I mean: you can have 8 processes that do not undergo latch contention and work as if no other process is running, or have latch contention, and in both cases see 100% CPU. In the first case you have high load but the app is working, in the second it’s a performance issue.
Exactly! The important consideration is to understand what “busy” means in each case.
As you pointed out in your example, 100% busy CPU with little buffer latch waits means the CPU(s) are producing a near max throughput.
While 100% busy CPU with a lot of buffer latch waits means there is high contention for the same range of blocks and probably one is not getting any mileage out of the last 20% or so of CPU usage. Ie, the throughput has not increased by increasing CPU usage from, say, 80%.
It does NOT mean, in any of these cases, that 100%CPU is producing LESS throughput!
Let’s not confuse “increased response time” with “reduced throughput”: the two things are potentially related, but not the same!
As Nuno Souto points out, you have to be very careful with your terminology here. You don’t “wait” for a lathch, you are either spinning (which means you are wasting CPU in the hope that the latch will become free) or you are sleeping, which means you are off the run queue, (i.e. not using any CPU) and an alarm (interrupt) will put you back on the run queue, where you may spend some time getting to the head of the queue before your starting spinning again.
If you want to know more about things like behaviour being “differente for different numbers of CPUs” then the book to read is Cary Millsap’s Optimizing Oracle Performance, chapter 9 particularly, where he discusses queueing theory and the effects of ‘randomly arriving’ jobs for a finite number of services. Basically the more services (e.g. CPUs) available the higher the level of utilisation before you run into problems of extreme variation in response times.
I think you may introduce some confusion by talking about the “latch wait queue” and the “run queue”. The first is more of an Oracle concept, while the second is an O/S concept. From the perspective of the O/s aren’t you simply on the run queue or not on the run queue.
I believe, by the way, that there is an option (that may have appeared in 9i) for a process to yield immediately on a latch collision – in other words to not spin (use CPU) but immediately drop to the bottom of the run queue (i.e. not leave the run queue to wait for an alarm).
You’re observation is valid. The worst CPU wastage occurs if all your processes are constantly accessing the same few blocks. (I think the most visible waits are likely to be buffer busy waits, though, if the purpose of the access is to modify the same few blocks). However, the number of latches is relatively small compared to the number of buffers (typically about 1 latch per 64 to 128 buffers) and the same competition can occur in the library cache if your processes are executing large numbers of “light-weight” statements. The scenario of 8 processes running aggressively on 8 CPUs was given only as a simple thought experiment to explain the concept of a job using more CPU than normal because of latch competition that did not involve waiting.
In practice, I have an example of query X (say) against table T1 which takes 15 CPU seconds to complete by access a single block 8 million times.
If I run two copies of query X concurrently on a machine with 2 CPUs (using 9i), both copies take 45 CPU seconds to complete, where the excess CPU is due to “spin gets” on the single critical cache buffers chains latch. If I modify one copy of query X so that it runs against a clone of the target table, the two copies take 18 CPU seconds to complete because of scheduling overhead, CPU cache sniffing, etc. even though they don’t hit the same latch.
At no stage do I see any processes in the run queue (i.e. runnable but not running) for any significant period of time.
See above response to Rudy – with no latch waits, I can still use a huge amount of CPU on latch spinning that does no effective work; and even when not competing for the same latch I can still see extra CPU disappearing somewhere because of time-slicing and some type of scheduling activity.
Clearly 100% CPU does NOT “mean” that you are producing near maximum throughput. In fact my example IS a case of 100% CPU obviously producing less throughput. On the other hand, if I have to wait because you are working, the presence of WAITS (as opposed to spins and CPU usage) allows the throughput to stay constant while both of us see increased response time.
Wow, lots of discussion appeared while I was busy with day job 🙂
Thank you so much! I learned more from reading your comments here than from a full year of OS theory course at the university.
I’m still not clear on how I see “latch spinning”. “Latch wait” will be visible as a wait event, but it seems that latch spins will account as CPU time, just like actual productive work?
Great input as always, Jonathan.
Thanks a lot for jumping in and making things a lot more clear.
[…] afterwards my blog dashboard showed a couple of incoming references from a blog entry that Chen Shapira had made about my comments. Her blog had received a couple of follow-up comments (from Nuno Souto, […]
[…] out, I care a lot about what my CPUs are doing. Last time I came up with the epiphany that 100% CPU utilization is a bad idea. During the discussion that followed, Jonathan Lewis and Noons took the time to explain to me the […]