Chicken and Egg TroubleshootingPosted: December 15, 2008
It all started about 10 days ago, with a phone call at 8am. 8am phone calls mean that something bad happened at 4am, but the system kept functioning, so it wasn’t worth waking anyone up about it.
One of our stand alone databases was swapping like there is no tomorrow (or more precisely, no free memory). Load average was up at around 20. On a 2 cpu machine. We could not find out any process that we could blame for the current situation, so we did the usual – restart the instance. This solved the immediate problem.
During the post mortem, one of the night DBAs reported that at some point during the night, this machine lost connectivity to the Netapp for about 10 minutes. He did not have an exact time when it happened, we couldn’t find any trace for this in the database logs, and the system people denied it completely – it did not happen, it could not happen. Except that we already seen that loss of connectivity to the Netapp will result in large number of processes in “uninterruptible sleep” mode, which will bring load average up… So this was still our best explanation. And it was not a very good one.
During the next few days, at least twice a day, load average and swap went way way up on that machine. Most of the CPU time was system time, and we also couldn’t find a single process that was taking too much memory. Sometimes the situation went away quickly, sometimes I had to kill a bunch of old connections to stabilize it. Naturally, I was unhappy about it. When I’m unhappy, I start looking at numbers. I compared the broken stand-alone to two other stand alone servers that were doing similar work. Good servers had an average of 200 concurrent connections. Each serving about 20 customers with 10 connections each. The broken server had 500 connections. It had 30 customers each with 15 connections. It also has only 12G of memory, compared to 16G on other servers.
Aha! I said to my boss. The machine is overloaded. We can try to move few customers to a less loaded server, but we should also consider moving to a better machine, or add another server.
My boss was skeptical – if this is just connectivity load, why did it spring on us all of the sudden? Shouldn’t we have seen a slow degradation? Well, each connection consumes some memory. As long as the machine had enough memory, there were no issue. Once it started swapping, things went wrong very fast.
My boss was not convinced, and I was not convinced either. There was the possibility that the high load was the cause of too many connections, not the result of it. When each task takes much longer to complete, you will have more users trying to complete their tasks at the same time. But my plan was the only thing that was remotely close to a solution, so we decided to try anyway.
Lucky for us, that night the machine lost connectivity to the Netapp again. This time the issue did not resolve itself, and the sysadmins had to replace the network cards on the server to solve the problem.
Needless to say, all my problems – load, swap, too many connections, went away that night and never came back. I’ve been staring at a bunch of eggs for days, trying to decide which one of them was the chicken.
How did the broken network card caused all these problems? Two ways, I think.
The first one is quite simple – every IO operation the DB had to do took much longer due to broken network connectivity. This means that everything takes longer, which means more connections. All of them waiting for IO most of the time.
Which leads straight into the second problem – since the network was now unreliable, and since TCP tries to cover up for the problems and keep communication reliable, each socket now had to keep huge buffer of information that has to be retransmitted, again and again. With more connections, each taking much more memory, we went into very high swap. Now our OS is so busy with managing the broken connections and swapping, that it has very little time for the work we actually want it to do, and load average is way up too. We have more and more connections, taking more and more memory and getting less and less IO and CPU to do their work.
Next time I see high swap and system taking lots of CPU, I’ll remember to test all my network connections.