You know that you care a lot about a topic, if you find yourself thinking about it again and again, each time hoping to gain enough insights to stop this cycle.
Turns out, I care a lot about what my CPUs are doing. Last time I came up with the epiphany that 100% CPU utilization is a bad idea. During the discussion that followed, Jonathan Lewis and Noons took the time to explain to me the difference between waiting and spinning.
The topic came up again as I’m digging for the worse ways concurrency can go wrong.
Concurrency becomes interesting when the concurrent processes attempt to access shared resources, and since Oracle has shared memory, the shared resources tend to be areas in the memory.
We are in 11g now, so we have 3 Oracle ways to protect memory structures – Locks, latches and mutexes (Oracle mutexes, which should not be confused with OS mutexes.). Below, I’ll summarize the main differences between them. Nearly everything I wrote (and a lot more including cool examples) is covered by Tom Kyte’s Expert Oracle Database Architecture book. I’m just summarizing the importnat points below for my (and your) convinience.
When you read about latches, the first thing you hear is that “Latches are lightweight locks”. Lightweight in this case means “Takes less bits in memory”. Why do we care about our locking structures being small? Small memory footprint of the locking mechanism will translate to faster checks and changes to it. Latches are smaller than locks, and the new 10g mutexes are even smaller than latches.
Locks work by queueing. One process holds the lock for a resource, everyone else who tries to access the lock queues up and goes to sleep (i.e. off the CPU). When the current process finishes, the next in line becomes runnable and now owns the lock on the resource. Queuing is nice, because we have a bunch of queueing theory that lets us predict response times, and waits and such. It is also nice, because while Oracle manages the locks and queues it gives us tons of information about who is blocking and who is waiting. And as one last nice, while all those processes are waiting for locks, they are not using CPU, nor represent any cpu scheduling overhead.
Latches work by spinning (mostly). Think of a case when you know a process will need the memory structure for a very short amount of time. Do you really want to maintain queues, waste time on context switching, lose your CPU cache all for just few milliseconds of waiting? Latches exist for this reason. If a process tries to access the latch and its busy, it will keep on retrying for a while, still using the cpu. If during this “retry time” the latch became free, the process can take the latch and we are saved from the need to context switch. If it didn’t get the latch after several retries, the process goes off the CPU.
The important thing to note it that there is no queue. So there is a lot of uncertainty around when your process will finally get its latch. It is entirely possible that processes that started spinning on the latch later will get the latch first due to a fluke of luck. Because there is no queue, it seems that there is no good way to find a list of processes that are waiting for a latch (maybe by looking at statistics and wait tables?). You do have good information about how many requests, misses, spins and sleeps per latch, which is very useful information.
It is interesting to see how Oracle attempts to prevent the situation where a process waits forever for a latch, and keeps missing it because newer processes keep snatching the latch away as soon as it is freed. When reading about “latch free” wait events, the documentation says: ” The wait time increases exponentially and does not include spinning on the latch (active waiting). The maximum wait time also depends on the number of latches that the process is holding. There is an incremental wait of up to 2 seconds.” It is nearly the same mechanism ethernet uses to avoid machines starving for network connections (“truncated binary exponential backoff“) . Incrementally increasing the wait times reduces the probability of collision.
Mutexes are best covered by Tanel Poder. They are even smaller and faster to take than latches, they also work as a cursor pin (signifying shared or exclusive ownership of a cursor), and they give you even less information about who is waiting for what and for how long. You have information about sleep times, but not number of requests and misses.