Fetch as Much as You Can

In my “Everything DBAs Need to Know About TCP/IP” presentation, I discussed the issue of fetch size – tuning the number of rows your application will get from Oracle in a single “fetch” call.

The problem is interesting when you want to get a large number of rows from the DB to the application. Cary Millsap would say that no one should want to do that, but Cary obviously never had to work with statistical analysis application. These things demand large amounts of blood raw data to do their thing.

If this is too slow for your users, you should look at statistics like “round trips”. If the number of round-trips looks completely off the wall, like 200,000 round trips for 2M rows, someone might have left JDBC’s default fetch size of 10.

In my presentation, I mentioned that there is really no problem with setting a very high fetch size. I recommended setting it as high as possible while taking into account the amount of memory each row will consume.

Charles Hooper blogged last week with a new reason to use large fetch sizes: Turns out that it allows Oracle to pin the block for the length of the fetch. He references an OTN discussion where Jonathan Lewis says:

“…what it the array fetch size ? If the array fetch size is large enough the number of block visits would be similar to the number of physical block reads. If the arraysize is one then the number of block visits could be close to the number of rows fetched as a block has to be released at the end of each fetch and the reacquired for the next fetch.”

So a higher fetch size ends up with fewer consistent gets. Even if you are not tuning any specific issue that requires lowering the number of consistent gets, I hope you’ll agree that getting the same results with less work is a good thing.

One more reason to set fetch size to as many rows as your memory allows.


RMOUG Presentations

Like many other DBAs, I’ll be attending RMOUG training days conference on Feb 17-18 in Denver. I’ll give two presentations in the conference. On the same day, just thinking about it makes me exhausted.

The first presentation is “Everything DBAs need to know about TCP/IP Networks”. Here’s the paper and the slides. I’ll also present this at NoCOUG‘s winter conference in Pleasanton, CA. Maybe you prefer to catch me there.

The second presentation is “Analyzing Database Performance using Time Series Techniques”. Here’s the paper and the slides.

I still have time to improve the presentations and papers – so comments are very welcome 🙂


What is “Network Delay”?

Data-Guard is often used for disaster planning, where the primary server is usual very far from the failover. Often on a different continent.

We are also planning to data-guard, but for a different disaster. We move about 90% of our operation to a new data center. About 3000 miles away. We have 4 hours of downtime. There is an old Russian saying – “two moves equal one fire”, so I guess we are planning for 50% disaster 🙂

One of the questions that pop-up a lot is the question of network delays. Its not a very well defined question. The person asking usually starts with “A change was done on production at 3pm. When will it be applied on the failover?” and the next question is: “How much time out of this gap is spent on network-related waits, rather than oracle-waits or disk-waits”?

There are 3 important factors that will influence the answer:

  1. Network Latency. The time it takes to send 1 packet of data from target to destination. It is the easiet to measure – just ping the destination, and it is influenced mostly by the physical distance between the two locations and the speed of light.
  2. Bandwidth utilization. The time it takes to send 1 packet is interesting, but we are more interested in the time it takes to send 500M of redo log. We have a nice OC3, theoretically capable of doing 155Mb/s. So theoretically 500M should around take around 15 seconds? Not really. First of all, network is a queue system, and we all know that we shouldn’t really run our queue system at 100% capacity, so we can’t use all 155 Mb/s.  TCP has congestion control implemented, so it wouldn’t let you send all your data at once, it will make you wait for the other side to start acknowledging first and carefully control the amounts of data you are allowed to send at one time.  Oh, and maybe other applications are also using the line. Given all that, it should be obvious that the percentage of the given bandwidth that you manage to actually use has a huge impact on the transfer rates. I’ve no idea how to check the line utilization myself – but my network manager can send me very nice graphs  🙂
  3. Congestion and errors. You have an SLA with your provider and all, but its a fact of life that not all the packets that leave source arrive safely to destination. Sometimes they get lost, or arrive in the wrong order. All these errors have an proportional impact on transfer time – 1% lost packets can cause 200% change in transfer times. Because once TCP has to retransmit lost packets it will start sending data very slowly, waiting to make sure it is received on the other side before sending more – and the utilization will drop like a rock.

If you are in the business of getting data across the ocean at a decent speed, you should also know that there are some companies that do WAN acceleration by addressing the causes for delay I mentioned about, and by introducing compression and proxies to the game. Its worth taking a look.

Most of all, don’t estimate delays by pinging remote machines. Talk to your network manager.


Chicken and Egg Troubleshooting

It all started about 10 days ago, with a phone call at 8am. 8am phone calls mean that something bad happened at 4am, but the system kept functioning, so it wasn’t worth waking anyone up about it.

One of our stand alone databases was swapping like there is no tomorrow (or more precisely, no free memory). Load average was up at around 20. On a 2 cpu machine. We could not find out any process that we could blame for the current situation, so we did the usual – restart the instance. This solved the immediate problem.

During the post mortem, one of the night DBAs reported that at some point during the night, this machine lost connectivity to the Netapp for about 10 minutes. He did not have an exact time when it happened, we couldn’t find any trace for this in the database logs, and the system people denied it completely – it did not happen, it could not happen. Except that we already seen that loss of connectivity to the Netapp will result in large number of processes in “uninterruptible sleep” mode, which will bring load average up… So this was still our best explanation. And it was not a very good one.

During the next few days, at least twice a day, load average and swap went way way up on that machine. Most of the CPU time was system time, and we also couldn’t find a single process that was taking too much memory. Sometimes the situation went away quickly, sometimes I had to kill a bunch of old connections to stabilize it. Naturally, I was unhappy about it. When I’m unhappy, I start looking at numbers. I compared the broken stand-alone to two other stand alone servers that were doing similar work. Good servers had an average of 200 concurrent connections. Each serving about 20 customers with 10 connections each. The broken server had 500 connections.  It had 30 customers each with 15 connections. It also has only 12G of memory, compared to 16G on other servers.

Aha! I said to my boss. The machine is overloaded. We can try to move few customers to a less loaded server, but we should also consider moving to a better machine, or add another server.

My boss was skeptical – if this is just connectivity load, why did it spring on us all of the sudden? Shouldn’t we have seen a slow degradation? Well, each connection consumes some memory. As long as the machine had enough memory, there were no issue. Once it started swapping, things went wrong very fast.

My boss was not convinced, and I was not convinced either. There was the possibility that the high load was the cause of too many connections, not the result of it. When each task takes much longer to complete, you will have more users trying to complete their tasks at the same time. But my plan was the only thing that was remotely close to a solution, so we decided to try anyway.

Lucky for us, that night the machine lost connectivity to the Netapp again. This time the issue did not resolve itself, and the sysadmins had to replace the network cards on the server to solve the problem.

Needless to say, all my problems – load, swap, too many connections, went away that night and never came back.  I’ve been staring at a bunch of eggs for days, trying to decide which one of them was the chicken.

How did the broken network card caused all these problems? Two ways, I think.

The first one is quite simple – every IO operation the DB had to do took much longer due to broken network connectivity. This means that everything takes longer, which means more connections. All of them waiting for IO most of the time.

Which leads straight into the second problem – since the network was now unreliable, and since TCP tries to cover up for the problems and keep communication reliable, each socket now had to keep huge buffer of information that has to be retransmitted, again and again. With more connections, each taking much more memory, we went into very high swap. Now our OS is so busy with managing the broken connections and swapping, that it has very little time for the work we actually want it to do, and load average is way up too. We have more and more connections, taking more and more memory and getting less and less IO and CPU to do their work.

Next time I see high swap and system taking lots of CPU, I’ll remember to test all my network connections.


Oracle on NFS and TCP Throttling

In an old post about the futility of best practices, I mentioned a strange best practice that is used in our organization: “mounting a single NFS volume to multiple mount points and configuring the DB to use them as though they were separate volumes (i.e. put different data files on different mount points).”

I was 100% sure that there was absolutely no reason for this practice. I thought it was there because someone misunderstood OFA, and no one changed it because they worried about imaginary performance implications. Reader comments on this blog farther convinced me of this.

This week I had lunch with a former colleague. He worked as a DBA in the same organization for at least 10 years before I joined, and he is a real expert about storage (A rare gift – DBA who knows storage). I had to ask him if this strange best practice was in effect when he was working here, and what did he think of it. As a response he burst out laughing. I thought it was because he also found this practice ridiculous, but it turned out (once he stopped laughing and was able to talk again) that he was the one who invented this best practice. He had excellent reasons for doing it. It is not his fault that the practice was kept long after the reasons were no longer relevant.

So, why would you want to mount the same volume on different mount points?

If you use NFS on top of TCP (I think no one does NFS on top of UDP anymore), and you have a heavy throughput system (like a data warehouse), you risk reaching the point that the ACKs from the Netapp are not arriving fast enough, and Linux will apply throttling on your connection.

The reason for this behavior lies in the TCP Congestion Control. The Congestion Control was introduced in eighties to prevent the internet from choking on noisy lines, and it is built around a dynamic value called TCP Congestion Window. TCP Congestion Window is the amount of data a server will agree to send without receiving any ACKs. If this amount of data was sent and no ACK arrived yet, the connection will patiently wait until the ACKs arrive. There are good reasons for this behavior:  First, we don’t want the  OS to risk run out of memory for keeping all those packets. But even more important is that it is good network behavior, maybe there is a bottleneck on the way, and the packets really never reach the target, if you continue to send them, the bottleneck will only get worse.

However, Linux defaults are really too conservative for database systems, which is why Oracle’s installation instructions include changes to these defaults:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=1048576
net.core.wmem_max=1048576

These parameters control the send and receive buffer sizes. In this post, I’m talking about the send buffer (wmem) which is used to govern the TCP Congestion Window. The receive buffer (rmem) is related how much data the machine will accept when acting as a client and is out of scope here. Oracle’s recommendations for the buffer sizes are a nice start, but you can change these values to match the throughput your system needs and your hardware supports.

So, now days improving throughput by playing with window sizes is all fun and games. But according to the old and wise DBA, back in the days of Linux 2.2, you did not want to change them. You had to work around the issue in a different way.

By mounting your single NFS volume on different mount points you could trick Linux into creating a separate TCP buffer for each separate connection (mount point), and now you have as many times the size of the original window as you want. As long as the physical line is keeping up, you are fine.

Great solution. Cheap, simple, scalable, maintainable, stable, everything you want.

Except that this same practice is used 10 years later, on a low-throughput OLTP systems that are not even close to saturating Oracle’s default window sizes and when there is an easier way to adjust the window anyway. Because no one knew why this was done.

Aside from the cool network details (I love cool network details, if you enjoy them as well, you probably want to read Tanel Poder’s blog), I think this illustrates the story behind many best practices – Once upon the time, it was a great solution to a problem. Then the problem was forgotten, the technology changed but the solution stayed.


Sniffing the network for fun and profit

Every DBA should know something about the network (and OS, and storage, and application development – good DBAs are multi-disciplinary). Probably everyone knows how to debug connectivity issues with ping and trace route, how to diagnose problems with the DNS, and how clients are connecting to the listener.

DBAs with RAC need to know much more about their network – they have virtual ips and interconnect and usually a storage network too. RAC systems can crash due to network latency problems, and the DBA may need to diagnose this. RAC also has this entire load balancing and failover thing where network addresses can move around and you need LOCAL_LISTENER and REMOTE_LISTENER parameters configured.

However, it is very rare that things get as complicated as they did this week:
On Thursday we configured a server with LOCAL_LISTENER and REMOTE_LISTENER values.
Almost immediately a customer started experiencing occasional connection failures. Naturally, we (DBAs) didn’t hear about it until it was escalated to an emergency call at Saturday afternoon. I had a date on Saturday night and didn’t want to be late due to a long debugging session, so I rolled back the change, asked the customer to send me his tnsnames.ora, hosts file and screenshots with the failure, and told them I’ll look into the issue on Monday.

Monday morning arrived sooner than I would have preferred. Their tnsnames.ora that was supposed to connect to our database actually contained an address I did not recognize. A quick meeting with the local network manager, revealed that these guys have VPN, they connect through a NAT and they also have a special firewall configuration. Remember I said that every DBA should know networks? Well, I didn’t mean NATs and VPNs. So I don’t know enough about networks, and the network manager doesn’t understand listeners and RAC, but we had to find a solution together.

It appeared that after I configured LOCAL_LISTENER values, when the listener attempted to redirect their connection to another server, it sent them an address (or maybe IP?) that their client couldn’t connect to and therefore failed. But why did everything work before we configured the LOCAL_LISTENER? According to the documentation we would still send addresses the client can’t connect to, just the local addresses instead of the vip. The network administrator had a theory that maybe the NAT translated the local address as it was sent back to the client to something the client understands, but this is really far fetched.

This is where the sniffer comes into the picture. When you have a complex setup and you have to know exactly what is the behavior – who is initiating the connection, what is the reply, where are the redirects, who closes the connection and how. The sniffer will listen to the network and give you the complete picture.

I use Ethereal, which is graphical and relatively clear and friendly, but still  very powerful.

In this case, Ethereal was especially useful – using the customer network capture, we could easily see that we were debugging the wrong issue all along. The tnsnames.ora file he sent us belonged to a different machine that didn’t experience any problem. The machine that experienced the issue connected to a different ip, which no one really knew about until this time. We are still not sure how it is related.
What we do know is that if you have a complicated network configuration, that changed several times, in very inconsistent ways and that no one documented – a sniffer is your only friend.