Oracle on NFS and TCP Throttling

In an old post about the futility of best practices, I mentioned a strange best practice that is used in our organization: “mounting a single NFS volume to multiple mount points and configuring the DB to use them as though they were separate volumes (i.e. put different data files on different mount points).”

I was 100% sure that there was absolutely no reason for this practice. I thought it was there because someone misunderstood OFA, and no one changed it because they worried about imaginary performance implications. Reader comments on this blog farther convinced me of this.

This week I had lunch with a former colleague. He worked as a DBA in the same organization for at least 10 years before I joined, and he is a real expert about storage (A rare gift – DBA who knows storage). I had to ask him if this strange best practice was in effect when he was working here, and what did he think of it. As a response he burst out laughing. I thought it was because he also found this practice ridiculous, but it turned out (once he stopped laughing and was able to talk again) that he was the one who invented this best practice. He had excellent reasons for doing it. It is not his fault that the practice was kept long after the reasons were no longer relevant.

So, why would you want to mount the same volume on different mount points?

If you use NFS on top of TCP (I think no one does NFS on top of UDP anymore), and you have a heavy throughput system (like a data warehouse), you risk reaching the point that the ACKs from the Netapp are not arriving fast enough, and Linux will apply throttling on your connection.

The reason for this behavior lies in the TCP Congestion Control. The Congestion Control was introduced in eighties to prevent the internet from choking on noisy lines, and it is built around a dynamic value called TCP Congestion Window. TCP Congestion Window is the amount of data a server will agree to send without receiving any ACKs. If this amount of data was sent and no ACK arrived yet, the connection will patiently wait until the ACKs arrive. There are good reasons for this behavior:  First, we don’t want the  OS to risk run out of memory for keeping all those packets. But even more important is that it is good network behavior, maybe there is a bottleneck on the way, and the packets really never reach the target, if you continue to send them, the bottleneck will only get worse.

However, Linux defaults are really too conservative for database systems, which is why Oracle’s installation instructions include changes to these defaults:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=1048576
net.core.wmem_max=1048576

These parameters control the send and receive buffer sizes. In this post, I’m talking about the send buffer (wmem) which is used to govern the TCP Congestion Window. The receive buffer (rmem) is related how much data the machine will accept when acting as a client and is out of scope here. Oracle’s recommendations for the buffer sizes are a nice start, but you can change these values to match the throughput your system needs and your hardware supports.

So, now days improving throughput by playing with window sizes is all fun and games. But according to the old and wise DBA, back in the days of Linux 2.2, you did not want to change them. You had to work around the issue in a different way.

By mounting your single NFS volume on different mount points you could trick Linux into creating a separate TCP buffer for each separate connection (mount point), and now you have as many times the size of the original window as you want. As long as the physical line is keeping up, you are fine.

Great solution. Cheap, simple, scalable, maintainable, stable, everything you want.

Except that this same practice is used 10 years later, on a low-throughput OLTP systems that are not even close to saturating Oracle’s default window sizes and when there is an easier way to adjust the window anyway. Because no one knew why this was done.

Aside from the cool network details (I love cool network details, if you enjoy them as well, you probably want to read Tanel Poder’s blog), I think this illustrates the story behind many best practices – Once upon the time, it was a great solution to a problem. Then the problem was forgotten, the technology changed but the solution stayed.

About these ads

7 Comments on “Oracle on NFS and TCP Throttling”

  1. [...] Wieso muss man bei Production Standards auch immer eine Begründung dazuschreiben? [...]

  2. Dan says:

    We are testing 10Gbit Nic’s on SUN T2000’s and we have discovered that a single treaded IO can only push 700Mbit/s … Multiple streams can push all the way to 9.7Gbit/s

    Using multiple NFS mount points to the same volume may be a way to “fool” the OS into using multiple NFS streams.

    Gonna have to test this theory!

  3. sean says:

    Scale-out NAS that supports Oracle (over NFS) allows one to access same data via multiple NFS servers (each of which can also have multiple NICs, of course).

  4. [...] herself (just a simple DBA on a complex production system) posts about the right way to handle Oracle on NFS and TCP throttling, reaching into relatively dark corners of Linux, like net.core.rmem and [...]

  5. Derek Mallard says:

    Say what you will about ‘best practices’ managers, consulting firms and CEO’s still like to use the phrase especially when the real problem is, we don’t know what the heck to do.
    NFS mounts may overcome the overhead of the protocol with the faster network technology available now but I am still leary about stability and reduced understanding of impacts as more and more organizations follow a ‘division of responsibilities’ policy in light of SOX, CSOX or whatever the flavour is in your country. More than once a storage ‘expert’ has told me they need a sub-second outage for a change on the Net App that we would not even see. The first time that was tried, the database crashed. The second time, at a different company, I put my foot down and insisted on a scheduled outage. It was for a ‘network id’ change, in the words of the storage person. The sub-second outage turned into an hour because it turned out that there was another system on the network with the same ID. Nobody thought to check first. That kind of error would only occur using NAS and NFS. Yes there are other errors that can occur with direct attached storage or SAN but my access to storage cannot be impacted by such a simple mistake as somebody connecting another system with the same address. Multiple active NICs to the same Net App would be nice for performance and HA. The Sys Admin has told me that IBM has documented that better throughput can be had with multiple mount points to the same backend storage. We are going to be testing that as we build our new production server.

  6. prodlife says:

    Derek,

    I agree with everything you said. I also stopped believing our storage manager when he talks about changes to our Netapp that will have no impact on production.

    The best DBAs I worked with are of the old-fashioned type that are also great sysadmins, storage managers, network managers and developers. Unfortunately, there are very few of those.

    I’m striving to be one myself, but its an uphill battle since (as you wrote) IT management believe in a completely different strategy.

  7. Rich says:

    There is no such thing as a DBA who is a”..great sysadmins, storage managers, network managers and developers”. It is foolish to think that.

    A very good DBA, or any individual for that matter, knows the limits of their knowledge, and where to turn when those limits have been reached.

    A person can be a great DBA, but never more than a good sysadmin/storage admin/network admin/developer/etc. The reason is that the point of view of a DBA is heavily influenced by their need: databases and how to manipulate them. Everything else is on a need basis.

    To Derek: the Netapp is an awesome tool that has the ability to be fully rebooted in under 30 seconds, and NFS has the ability to survive that. However, a database over NFS is another beast. Your admin, if he was worth anything, would never have even suggested rebooting without advanced notice or preplanning. The stupidity of an individual will always be the weak link in the chain.

    And ‘best practices’ are _guidelines_, people. They are not hard-n-fast rules, but they are a starting point. There are a lot of bad admins that act like bulls in a china shop, and ‘best practice’ guidelines are out there to protect everyone from those admins.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,110 other followers