Fetch as Much as You Can

In my “Everything DBAs Need to Know About TCP/IP” presentation, I discussed the issue of fetch size – tuning the number of rows your application will get from Oracle in a single “fetch” call.

The problem is interesting when you want to get a large number of rows from the DB to the application. Cary Millsap would say that no one should want to do that, but Cary obviously never had to work with statistical analysis application. These things demand large amounts of blood raw data to do their thing.

If this is too slow for your users, you should look at statistics like “round trips”. If the number of round-trips looks completely off the wall, like 200,000 round trips for 2M rows, someone might have left JDBC’s default fetch size of 10.

In my presentation, I mentioned that there is really no problem with setting a very high fetch size. I recommended setting it as high as possible while taking into account the amount of memory each row will consume.

Charles Hooper blogged last week with a new reason to use large fetch sizes: Turns out that it allows Oracle to pin the block for the length of the fetch. He references an OTN discussion where Jonathan Lewis says:

“…what it the array fetch size ? If the array fetch size is large enough the number of block visits would be similar to the number of physical block reads. If the arraysize is one then the number of block visits could be close to the number of rows fetched as a block has to be released at the end of each fetch and the reacquired for the next fetch.”

So a higher fetch size ends up with fewer consistent gets. Even if you are not tuning any specific issue that requires lowering the number of consistent gets, I hope you’ll agree that getting the same results with less work is a good thing.

One more reason to set fetch size to as many rows as your memory allows.


One More Thing Everyone Should Know About Queues

I already posted two things everyone should know about queues, but the incidents of the last month made me realize I missed another very important queue fact.

Don’t double the load on a server and expect that response times will be “a little slower”. If the server is above 30-40% capacity, doubling the load is likely to be catastrophic.

Suppose you have two servers, for example a Netapp heads, that are operating at 48% capacity each, and suddenly one of them fails and you have to move all the load to one of them. This means that you have one server at close to 100% capacity. Users will not complain that things are a bit slow, they will complain that nothing is working.

Someone once told me that if I have two RAC nodes running at 100% CPU each, I do not have high availability. The truth is that you stop having high availability long before the 200% CPU point.

Oh, and in case anyone wonders what we did about the overloaded Netapp. We stopped every non-essential process reading from this Netapp. This included snapmirrors, non-production databases, exports, vmwares, and probably more stuff than I know. This moved utilization down to 60% point and life was good (except that we weren’t too comfortable about lack of snapmirrors and exports).


Two Things Everyone Should Know About Queues

If you are in the performance business, you should know a lot about queues. How to use them to find performance problems, predict issues, plan your capacity, model your load test results, etc. Queues are just a part of what you should know and be comfortable discussing.

But what if you are not a performance professional? What if you are a sales person or a manager or a dentist? Do you still need to understand queues?

Obviously not everyone should know queues at a precise mathematical level. But queues are everywhere, and sometimes I wish people around me understood queues better. It’ll make it easier for me to explain things. There are two things I think everyone should know about queues:

  1. If it takes me one hour on average to handle a request, and I get one request every hour – most of the time requests will be delayed due to queueing and backlog. Running your DBAs (or servers, or doctors, or toll-booths) at full utilization with every minute accounted for means queueing and delays.
  2. If there are multiple servers (or DBAs or DMV clerks), the most efficient way to get service is to arrange all the requests in a single queue and have all servers accept requests from that queue. The way supermarkets do it – a different queue per cashier is inefficient. Deciding that you want all your requests to be handled by a specific DBA because she is better looking is also less efficient than entering the request in the general DBA queue.

Spread the word 🙂


Unexpected Side Effect of Table Shrinking

Last year, I wrote a bit about table shrinking.  It generated quite a bit of discussion and both Freek d’Hooge and Venz warned about some of the side effects involved, especially when using “enable row movement”.

I just discovered another effect of row movement. As the name implies, after enabling row movement, rows get moved around. This can impact the clustering factor of your primary key.

If you had an index on a monotonously growing column (such as “create date”), the table and the index would be ordered in the same way. This results in a low clustering factor, and it means that the index is considered more favorably by the optimizer for some operations. I’m not a big expert on the optimizer, but I assume that if the index and table are ordered in the same way, and you need to scan a portion of the table, the index can be used to find the first row and then the table itself can be scanned. This method should be about twice faster than scanning the index and going to the table for each row to get extra columns that are not in the index. (Again, this is common sense only, I did not test or even deeply studied this).

Anyway, so you have this nicely ordered table, and a nicely ordered index to match, and you probably enjoy all kinds of performance benefits from this arrangement.

Now you enable row movement and shrink. Rows get moved around, and the order of rows in the table no longer matches the order of rows in the index. So you have a smaller table (less blocks) and a higher clustering factor. Maybe scanning portions of the table will be faster since the table is smaller and you read fewer blocks, but maybe it’ll take longer because the ordered changed and the index is less usefull now. In any case, this is something you should consider before deciding to shrink the table with row movement – it can have unexpected effect on your plans.

Here’s a small test I ran to verify:


SQL> create table test (id number);

Table created.

SQL> INSERT INTO test SELECT rownum FROM dual CONNECT BY LEVEL <= 10000;

10000 rows created.

SQL> commit;

Commit complete.

SQL> create index test_i on test (id);

Index created.

SQL> EXEC dbms_stats.gather_table_stats(ownname=>'SCOTT', tabname=>'TEST', estimate_percent=> null, cascade=> true, method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> select t.table_name,i.index_name,t.blocks,t.num_rows,i.clustering_factor from user_tables t, user_indexes i where t.table_name=i.table_name and i.index_name='TEST_I';

TABLE_NAME                     INDEX_NAME                         BLOCKS   NUM_ROWS CLUSTERING_FACTOR
------------------------------ ------------------------------ ---------- ---------- -----------------
TEST                           TEST_I                                 20      10000                16

SQL> delete from test where mod(id,2)=0;

5000 rows deleted.

SQL> commit;

Commit complete.

SQL> EXEC dbms_stats.gather_table_stats(ownname=>'SCOTT', tabname=>'TEST', estimate_percent=> null, cascade=> true, method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> select t.table_name,i.index_name,t.blocks,t.num_rows,i.clustering_factor from user_tables t, user_indexes i where t.table_name=i.table_name and i.index_name='TEST_I';

TABLE_NAME                     INDEX_NAME                         BLOCKS   NUM_ROWS CLUSTERING_FACTOR
------------------------------ ------------------------------ ---------- ---------- -----------------
TEST                           TEST_I                                 20       5000                16

SQL> alter table test enable row movement ;

Table altered.

SQL> alter table test shrink space cascade;

Table altered.

SQL> alter table test disable row movement ;

Table altered.

SQL> EXEC dbms_stats.gather_table_stats(ownname=>'SCOTT', tabname=>'TEST', estimate_percent=> null, cascade=> true, method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

TABLE_NAME                     INDEX_NAME                         BLOCKS   NUM_ROWS CLUSTERING_FACTOR
------------------------------ ------------------------------ ---------- ---------- -----------------
TEST                           TEST_I                                  8       5000                22


Latches, Spinning and Queues

You know that you care a lot about a topic, if you find yourself thinking about it again and again, each time hoping to gain enough insights to stop this cycle.

Turns out, I care a lot about what my CPUs are doing. Last time I came up with the epiphany that 100% CPU utilization is a bad idea. During the discussion that followed, Jonathan Lewis and Noons took the time to explain to me the difference between waiting and spinning.

The topic came up again as I’m digging for the worse ways concurrency can go wrong.

Concurrency becomes interesting when the concurrent processes attempt to access shared resources, and since Oracle has shared memory, the shared resources tend to be areas in the memory.

We are in 11g now, so we have 3 Oracle ways to protect memory structures – Locks, latches and mutexes (Oracle mutexes, which should not be confused with OS mutexes.). Below, I’ll summarize the main differences between them. Nearly everything I wrote (and a lot more including cool examples) is covered by Tom Kyte’s Expert Oracle Database Architecture book. I’m just summarizing the importnat points below for my (and your) convinience.

When you read about latches, the first thing you hear is that “Latches are lightweight locks”. Lightweight in this case means “Takes less bits in memory”. Why do we care about our locking structures being small? Small memory footprint of the locking mechanism will translate to faster checks and changes to it. Latches are smaller than locks, and the new 10g mutexes are even smaller than latches.

Locks work by queueing. One process holds the lock for a resource, everyone else who tries to access the lock queues up and goes to sleep (i.e. off the CPU). When the current process finishes, the next in line becomes runnable and now owns the lock on the resource. Queuing is nice, because we have a bunch of queueing theory that lets us predict response times, and waits and such. It is also nice, because while Oracle manages the locks and queues it gives us tons of information about who is blocking and who is waiting. And as one last nice, while all those processes are waiting for locks, they are not using CPU, nor represent any cpu scheduling overhead.

Latches work by spinning (mostly). Think of a case when you know a process will need the memory structure for a very short amount of time. Do you really want to maintain queues, waste time on context switching, lose your CPU cache all for just few milliseconds of waiting? Latches exist for this reason. If a process tries to access the latch and its busy, it will keep on retrying for a while, still using the cpu. If during this “retry time” the latch became free, the process can take the latch and we are saved from the need to context switch. If it didn’t get the latch after several retries, the process goes off the CPU.

The important thing to note it that there is no queue. So there is a lot of uncertainty around when your process will finally get its latch. It is entirely possible that processes that started spinning on the latch later will get the latch first due to a fluke of luck. Because there is no queue, it seems that there is no good way to find a list of processes that are waiting for a latch (maybe by looking at statistics and wait tables?). You do have good information about how many requests, misses, spins and sleeps per latch, which is very useful information.

It is interesting to see how Oracle attempts to prevent the situation where a process waits forever for a latch, and keeps missing it because newer processes keep snatching the latch away as soon as it is freed. When reading about “latch free” wait events, the documentation says: ” The wait time increases exponentially and does not include spinning on the latch (active waiting). The maximum wait time also depends on the number of latches that the process is holding. There is an incremental wait of up to 2 seconds.” It is nearly the same mechanism ethernet uses to avoid machines starving for network connections (“truncated binary exponential backoff“) . Incrementally increasing the wait times reduces the probability of collision.

Mutexes are best covered by Tanel Poder.  They are even smaller and faster to take than latches, they also work as a cursor pin (signifying shared or exclusive ownership of a cursor), and they give you even less information about who is waiting for what and for how long. You have information about sleep times, but not number of requests and misses.


Newsflash – 100% CPU is worse than you think!

Found a recent thread on OTN that discusses my recent obsession: http://forums.oracle.com/forums/thread.jspa?threadID=650986&start=0&tstart=0

Jonathan Lewis points out a major issue with running your DB at 100% cpu:

“Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches – which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”

Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks.

And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this. The thread mentions that the book “Forecasting Oracle Performance” discusses this issue and mentions 75% cpu as the tipping point, but I’d assume that the number would be different for systems with different numbers of CPUs. I definitely need to read that book.

I also was not aware that processes stay on CPU while waiting for latch. I’d assume the CPU will replace it with runable process? Of course the switch will also cost resources, so you lose either way.

I can’t believe I ignored this until now (and that not one of my readers mentioned this!). The thread is well worth reading.


Oracle on NFS and TCP Throttling

In an old post about the futility of best practices, I mentioned a strange best practice that is used in our organization: “mounting a single NFS volume to multiple mount points and configuring the DB to use them as though they were separate volumes (i.e. put different data files on different mount points).”

I was 100% sure that there was absolutely no reason for this practice. I thought it was there because someone misunderstood OFA, and no one changed it because they worried about imaginary performance implications. Reader comments on this blog farther convinced me of this.

This week I had lunch with a former colleague. He worked as a DBA in the same organization for at least 10 years before I joined, and he is a real expert about storage (A rare gift – DBA who knows storage). I had to ask him if this strange best practice was in effect when he was working here, and what did he think of it. As a response he burst out laughing. I thought it was because he also found this practice ridiculous, but it turned out (once he stopped laughing and was able to talk again) that he was the one who invented this best practice. He had excellent reasons for doing it. It is not his fault that the practice was kept long after the reasons were no longer relevant.

So, why would you want to mount the same volume on different mount points?

If you use NFS on top of TCP (I think no one does NFS on top of UDP anymore), and you have a heavy throughput system (like a data warehouse), you risk reaching the point that the ACKs from the Netapp are not arriving fast enough, and Linux will apply throttling on your connection.

The reason for this behavior lies in the TCP Congestion Control. The Congestion Control was introduced in eighties to prevent the internet from choking on noisy lines, and it is built around a dynamic value called TCP Congestion Window. TCP Congestion Window is the amount of data a server will agree to send without receiving any ACKs. If this amount of data was sent and no ACK arrived yet, the connection will patiently wait until the ACKs arrive. There are good reasons for this behavior:  First, we don’t want the  OS to risk run out of memory for keeping all those packets. But even more important is that it is good network behavior, maybe there is a bottleneck on the way, and the packets really never reach the target, if you continue to send them, the bottleneck will only get worse.

However, Linux defaults are really too conservative for database systems, which is why Oracle’s installation instructions include changes to these defaults:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=1048576
net.core.wmem_max=1048576

These parameters control the send and receive buffer sizes. In this post, I’m talking about the send buffer (wmem) which is used to govern the TCP Congestion Window. The receive buffer (rmem) is related how much data the machine will accept when acting as a client and is out of scope here. Oracle’s recommendations for the buffer sizes are a nice start, but you can change these values to match the throughput your system needs and your hardware supports.

So, now days improving throughput by playing with window sizes is all fun and games. But according to the old and wise DBA, back in the days of Linux 2.2, you did not want to change them. You had to work around the issue in a different way.

By mounting your single NFS volume on different mount points you could trick Linux into creating a separate TCP buffer for each separate connection (mount point), and now you have as many times the size of the original window as you want. As long as the physical line is keeping up, you are fine.

Great solution. Cheap, simple, scalable, maintainable, stable, everything you want.

Except that this same practice is used 10 years later, on a low-throughput OLTP systems that are not even close to saturating Oracle’s default window sizes and when there is an easier way to adjust the window anyway. Because no one knew why this was done.

Aside from the cool network details (I love cool network details, if you enjoy them as well, you probably want to read Tanel Poder’s blog), I think this illustrates the story behind many best practices – Once upon the time, it was a great solution to a problem. Then the problem was forgotten, the technology changed but the solution stayed.


Anecdote about Rebuilding Indexes

A year ago I was handling a rather confusing performance issue. It was the usual problem – a query that took forever to run. While checking the issue, I discovered that one of the main tables that this query was querying had amazing size variance at different times it held anything between 3 rows to 3 million rows. This is obviously a design mistake that can indeed cause performance problems. An experienced DBA might have configured a job to collect statistics very frequently, or maybe something about bind variable peeking, or just go to the developers and demand an immediate fix.

I’m not an experienced DBA. I noticed that the indexes were highly fragmented with lots of deleted leaf nodes, and so I decided to rebuild the indexes. Online rebuild, so users will not be affected and parallel – so it will go faster.

It was a disaster. First of all, rebuilding online does affect performance. A lot. Users did notice. The table was effectively locked until we killed the rebuild. Its been a year and Oracle support is still silent about whether this is an expected behavior of online rebuild. Also, since the index was rebuilt with parallel option, many of the queries started running in parallel, which also had horrible effects and we had to fix this issue too.

After we cleaned up the entire mess, I got a stern lecture from Senior DBA: “You can’t solve performance problems by randomly rebuilding indexes. You must throughly understand the problem first”.

Fast forward to this weekend. Customer complains about very slow queries. Quick check on v$sql shows that each execution retrieves about 100 rows, but does about 10 million buffer gets. Ouch. The explain plan looked reasonable and I couldn’t see any cause for the excessive gets. I decide to wait with this to Monday, run a trace and see what is really happening there. I never got to run that trace. Senior DBA got home later that day, saw the emails about that query and immediately decides to rebuild indexes!

Needless to say, rebuilding indexes works great when you really understand the problem. It took 2 hours, caused no additional performance issues, and after it was done, the query required only 1000 buffer gets.

I still have some suspicions that the rebuild didn’t change much and what really caused the difference was the fact that Senior DBA also collected statistics on the entire table and all its indexes immediately after the rebuild, but we can’t really know, and as far as I saw – the plan did change.

Regardless of the exact facts, this reminded me of an AI Koan:

A novice was trying to fix a broken Lisp machine by turning the power off and on. Knight, seeing what the student was doing spoke sternly: “You can not fix a machine by just power-cycling it with no understanding of what is going wrong.” Knight turned the machine off and on. The machine worked.

(Knight is the inventor of the Lisp machine).

You can read more AI Koans here. Highly recommended.


Don’t use AWR for wait event analysis

Tonight, at midnight, inserts slowed to a crawl at one of our main production databases. This database is used to process about 300 inserts every second, but for about 20 minutes starting at midnight, each insert took almost a minute.  The problem resolved by itself, but now the application owner wants answers – what caused the slowdown at midnight?

A co-worker ran AWR and found that the top wait event is Enq: TM-Contention. A lock! This looks promising, but inserts usually don’t lock. Thats one of the most important principles of Oracle – readers don’t block writers and writers don’t block readers. The most locking I’d expect from insert will be a latch contention on a hot block. But co-worker is a very experienced DBA and he knew that TM-Contention can happen from non-indexed foreign key, and he started looking for one.
At this point I arrived at the office, and he updated me on the situation.
I don’t like AWR at all, so I took out my favorite tool – Confio. This is a nice graphical tool that helps with wait event analysis, but you can probably get pretty much the same data by digging into v$active_session_history yourself.

Confio also showed that TM-Contention is the largest wait event, but it also showed the queries waiting (lock table … in exclusive mode) and the machines. The table locked and the machines had nothing to do with the insert issue. A completely different part of the application. So, we can skip this event. I don’t think you can get this kind of information from AWR!

The second and third highest events are “db file sequential read” and “gc current request”. These events are so typical of selects (usually with a bad index), that if I saw them in AWR I’d probably assume they are unrelated, but Confio showed, clear as a day, that all the queries with this wait are inserts. I’m not talking about a fancy “insert into … select”, I’m talking about a plain “insert into … values”.

I still don’t know why a plain insert had to wait on “db file sequential read”, and why it had to do it only at midnight and stopped that 20 minutes later. But at least now I know where to look!

Note how AWR not only lead us in the wrong direction, it also doesn’t give enough information to follow up on the issue – like queries, tables and machines involved.