How high can you go?

What’s the highest load average you’ve seen on a machine?

Our production databases are running on boxes with two dual core cpus, so we usually have load average of around 3. When someone is doing heavy processing on a machine, the load average goes up to 10 by the time one of the DBAs intervenes and stops the madness. When we had a huge problem tuning the memory for client processes and the machine was swapping like there is no tomorrow, we had a load average of 25 and that was the highest I’ve seen a machine go. Needless to say, at that moment response times were less than inspiring.

You can imagine my surprise when I got a phone call one Sunday morning from the application owner who said “Looks like one of our DBs has load average of 500. It could be a problem with the monitor, but can you take a look?”

Sure I can take a look, I’ve nothing better to do on Sunday mornings.

I took a look, and saw the following:
[oracle@productiondb]# uptime
11:15:58 up 162 days, 15:58, 2 users, load average: 546.06, 538.21, 522.55

Not a monitor bug then. But the machine seemed surprisingly responsive for the extravagant load average. I connected to the DB with no delays, ran few queries which returned within a reasonable time. Everything seemed fine except for the high load average numbers.

And the numbers were going up. 30 minutes later we were well over 600.

I tried running “top” – which promptly hanged; “ps” hangs too; “df” works and now I know it is not a disk space problem nor an NFS issue.

I peaked in the “/proc” directory and saw that many of the stuck processes are actually “ps” and “bash”. So the hanging ps is probably causing the high load average, not a result of the load average. We have 500 ps processes hanging there, which inflates the load average while letting the database function as usual. More and more ps processes are accumulating because we use ps in most of our automated monitors. Vmstat shows 50% iowait, but thats obviously related to the stuck processes because the DB shows no more than the usual io waits. So ps hangs while waiting for io? Interesting.

Why does ps hangs? Strace reveals the problem:
stat("/proc/538", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/538/stat", O_RDONLY)= 6
read(6,<unfinished ...>

So ps gets stuck on process 538. I wonder what’s this process is doing. Of course, I can’t use ps, so lets go straight into the proc directory. Except that cd /proc/538 got stuck too. No way to know what it is then. Can I kill it? kill -9 538 doesn’t hang, but doesn’t do anything either. Only a reboot will clear it.

Now is the interesting part: Should we reboot a perfectly working database just because the OS is doing something interesting? Time for management decision. I escalate to my manager, who escalates to his, who calls our system administrator and the application owner.

Meanwhile the system administrator reveals more interesting information: Few hours earlier the DB moved to a new Netapp cluster. Is this the cause of the strange behavior? How can a move to a new storage can corrupt the process memory blocks so throughly? Well, Redhat will need to answer this question.

3 hours later I get the decision – reboot the server. We took a netdump of the machine and restarted it. 5 minutes late the DB is up and running again with the normal load average and ps works.

We sent the netdump to Redhat. It is now almost a month since the incident and there is still no word on what happened to process 538 and whether we can blame the new Netapp.

Advertisements

10 Comments on “How high can you go?”

  1. Rishi says:

    Yes – I also was once load average on Linux running oracle reached to 500.

    Then we rebooted the server to rectify the issue.

  2. prodlife says:

    yep, looks like reboot is the only solution.

  3. Seth says:

    I’ve seen this many times myself. The record for me as been just over 800. I have a screenshot somewhere too.

    I’ve never had the trouble with ps though but it’s usually a problem with too many processes in the sleep state that drive up the load average but since the the processes in the run state aren’t killing the load it remains responsive. I can usually stop the processes in the sleep state and restart the daemon controlling them but your case seems to have been different.

  4. Bob Diebel says:

    Any word from Red Hat to anyone?
    We also are having this issue.

  5. prodlife says:

    Hi Bob,
    No update from Redhat. We should contact them if it happens again. It didn’t happen again yet, so we have no clue what it was.

    When you say “having the issue”, do you mean happened once and was solved by reboot? or happens again and again? because if it keeps happening, I’d ask my system team to change memory on the machine, or even move the DB to a completely new machine.

  6. Danny Tauber says:

    Hi Guys,

    We also have the same issue. It occurs daily. We are running Redhat 2.6.9-42.0.10. with about 200+ users concurrent.

    Has anyone got any further information? ps and Top hang for me too. Anything that forks a process seems to hang when this occurs and the load seems to increase for about 20 minutes to 600. A reboot is the only solution.

    Memory and processor seems to be fine. What kind of hardware are you guys using. We have a HP ML570 G4 and it has a P400 raid controller.

    any help/suggestions would be appreciated

  7. Ratheesh says:

    Yes I too hit a similar problem but with httpd (apache). The load average is consistently around 23. But the system is very much responsive. When I peeked into the /proc dir I see a lot of httpd threads. When I see a the apache server status I see the same thread in W state and these threads have a very high value under the SS column. Any idea how could this could be solved?

  8. oblek says:

    Hi, I had the same problem,
    my system was an Xubuntu LTSP box, with 10 concurrent users accesing simultaneously, after a couple of googling, I found your website, I’m glad I’m not alone, but seems there is no clear clue how to get rid of this problem without reboot.

    A quick look at the process in /proc, the hung process is in “D (disk sleep) state”, from kernel documentation (proc.txt), this process is not interruptible, and might be killed if the parent is killed, but from my system, the PPID is 1, and that mean I do have to reboot 🙂

  9. oblek says:

    update to my previous comment,
    the problem is with process that blocking the ‘ps’ process, on my case, it was Thunar filemanager on XFCE of which user has USB flashdisk attached on LTSP network, killing the ltspfsmount solved the problem, now ‘ps’ and ‘top’ running just fine.

  10. John baptist says:

    This happened to me too. On Ubuntu. Turns out the problem was operapluginwrapper, a program used by the Opera web browser to display plugins. Who knew.

    I figured out the process id similar to your technique. Then, even though you can’t ls the appropriate /proc dir, you can examine things in it. you can do, e.g. cat /proc/10032/status which will tell you the name of the process. Also the answer turned up in the kernel log. Examining /proc/10032/cmdline or ls /proc/10032 though will definitely hang.

    Still don’t know WHY it happened though and of course can’t reproduce it. Probably in the kernel or perhaps a result of the process going nuts and overwriting some location that the kernel expects to be valid. Please let me know if you ever get any new info on this.

    Clearly this is a kernel issue


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s