One More Thing Everyone Should Know About Queues

I already posted two things everyone should know about queues, but the incidents of the last month made me realize I missed another very important queue fact.

Don’t double the load on a server and expect that response times will be “a little slower”. If the server is above 30-40% capacity, doubling the load is likely to be catastrophic.

Suppose you have two servers, for example a Netapp heads, that are operating at 48% capacity each, and suddenly one of them fails and you have to move all the load to one of them. This means that you have one server at close to 100% capacity. Users will not complain that things are a bit slow, they will complain that nothing is working.

Someone once told me that if I have two RAC nodes running at 100% CPU each, I do not have high availability. The truth is that you stop having high availability long before the 200% CPU point.

Oh, and in case anyone wonders what we did about the overloaded Netapp. We stopped every non-essential process reading from this Netapp. This included snapmirrors, non-production databases, exports, vmwares, and probably more stuff than I know. This moved utilization down to 60% point and life was good (except that we weren’t too comfortable about lack of snapmirrors and exports).


5 Comments on “One More Thing Everyone Should Know About Queues”

  1. Hi Chen/Gwen,

    My experience of a little time back is netapp heads start to perform worse after approx. 40%. But CPU is not the only thing to watch for, there’s also a statistic called ‘cache age’: it shows you how old things get in the cache. And that is what keeps it performing.

  2. John Brady says:

    Ah yes! The magic ‘knee’ in the utilisation curve. Or rather, it is always a curve and never a straight line, so doubling utilisation never puts you at a point that is simply twice the original response time. As you say, it is always more than twice the response time. How much more depends on where you were to begin with – 20% and it will be a little more than 2x, but at 40% it will be a lot more than 2x, and beyond 40% things might appear to have stopped as you say.

    When discussing failover between shared systems with clients I always ask them to identify their separate workloads and prioritise them. This way we can take out and switch off low priority workloads during a failover, until we are sure the high priority ones are performing adequately.

    If the client says that they either have only one workload or that everything is high priority, then I say that each node of the cluster should be at 50% maximum utilisation to ensure adequate response times during failover. There is no other way of achieving this – you must have redundant, unused, spare capacity to be able to handle a failure and move its workload somewhere else and achieve the same response times and throughputs.

    This assumes a 2 way cluster, and potentially 3 or 4 way clusters can reduce the ‘spare capacity’ you must keep unused per node. But this also assumes that you can evenly distribute your workload amongst 3 or 4 nodes in a cluster. Some workloads cannot be split this way, and must be run only on a single node / system. In which case you need to ensure that the other nodes have spare capacity for this workload should its node fail.

    John

  3. Welker says:

    Hy , I’m a Oracle DBA to, and planning to buy a Netapp storage, so can you write some bad/good stufs about this hardware.
    how do you manage the snapmirror, and how this works with a critical DB? do you put the datafiles on this partition and do your snapshot?
    if you can comment….
    Thanks.

  4. NeverNetapp says:

    Oracle DBA. Three words: DONT DO IT. Netapp filers are “write optimized” via the WAFL file system. This is a “write anywhere” approach that assure massive data fragmentation and IO performance that gets slower and slower as time goes by. It’s great for writes, but most database sytems do far more reading than writing.

    • prodlife says:

      For what’s it worth, our systems have been running on Netapp storage for over 10 years. We had no performance problems (in fact Netapp solved many problems we encountered with EMC systems), we love its snapshot and snapmirror features which make backups a snap (sorry for pun…). In terms of easy of configuration, nothing comes close to Netapp. Even now that we move to HP storage for disks, we still use Netapps VSeries as a management front-end.

      To answer Jonny’s question specifically: Yes, we put datafiles on partition, put the tablespaces in “begin backup” mode, take a snapshot and “end backup”. Netapp also has “SnapManager for Oracle” that gives a friendly interface for managing backups and restores, but I still use good old shell scripts🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s