The politics of running Oracle on NFS

Kevin Closson wrote an article making the economic case for running Oracle on NFS. Its a great article, go read it. It shows how much cheaper, more scalable and easier to configure and administer NFS can be compared to SAN solutions, even when taking possible performance degredations into account.

One thing that Kevin failed to mention is that choosing NFS requires a strong backbone and not every DBA would wish to test his or her mettle by attempting to introduce NFS into production enviroment.

You know how the database is always blamed for every application issue? How the DBAs always have to run around and prove that the DB is innocent while everyone else is pointing their fingers at the them? I’m sure everyone reading this blog is familar with “The database is guilty until proven innocent beyond all shadow of doubt” mentality.

Well, its the same with NFS only a bit worse. We’ve been running several large production RAC systems on NFS for the last two years, and for the past two years Netapp and the NFS protocol has been blamed for OS crashes, cluster errors, Oracle bugs, application errors, database misconfiguration, badly written backup scripts, errors that occurd within third party network devices, and the list can go on and on.

Our RAC systems had tons of issues – Netapp and the NFS protocol has been blamed for about 90% of them. I don’t believe we found even one case where it was clearly a storage issue. There used to be a saying “No one was ever fired for deciding to use IBM”. I can’t say the same for deciding to use Oracle on NFS.

The already sensitive situation is made even worse when we were unable to find any consultant – from Oracle, Netapp or 3rd party that was willing to review our storage configuration top to bottom and reassure us (and mostly our twitchy management) that we are doing things fine, everything is configured as it should be and if your Linux servers and Oracle Cluster crashes twice a day, the problem must be somewhere else,

Maybe Kevin should start NFS auditing services.

(I’ve actually planned to write about auditing today, but I got so worked up about storage that I totally forgot what is so interesting about audits anyway. Don’t worry, I’m testing fine grained auditing for a customer these days so some interesting bits about it are sure to come up soon)

Advertisements

5 Comments on “The politics of running Oracle on NFS”

  1. kevinclosson says:

    Chen,

    Please stay tuned regarding Oracle’s commitment to NAS. Oracle 11g goes public on July 11 and I’ve got a lot to talk about. Please, stay tuned!

    BTW, it is a shame to hear of such scape-goat actions but honestly, that happens in SANs too.

    I think the landscape changes dramatically regarding Oracle on NFS in the 11g time frame and that momentum should rekindle a closer look to the state of affairs with 10g on NFS.

    Please tell us, however, what Oracle versions had problems on SAN and what was the OS? You’ve probably read how irresponsible I think it was for people to promote 9i on NAS/NFS with such OSes as Red Hat AS 2.1. Bad, very bad. What was your configuration?

  2. prodlife says:

    Kevin,

    While I’m sure 11g will be great, I think Oracle has conditioned us not to install any version on production before they released at least two service packs. It will be well over a year before we can enjoy the benefits of the new version.

    We are working with Oracle 10.0.1.3, 10.2.0.2 and 10.2.0.3, all of them with RHEL 3.
    The funny thing is that some of these systems are very stable, and some show the strangest issues without any difference that we managed to detect.
    Our staging systems are uniformly stable no matter how hard we try to use them to reproduce our real issues.
    Out of three 10.2.0.3 RACs, one suddenly started experiencing horrible response times and only stopped when we disabled the automatic memory management (Shared Pool bloated to take 75% of SGA). The performance issue began the day after we moved the machine to work with a new Netapp cluster. Maybe it was a coincidence, it certainly didn’t look related, but it is very hard to say.

    It is long past time to upgrade to RHEL 4, but since we know that our staging environment can’t reproduce our real issues and maybe can’t help us catch new issues either, we can’t really go ahead with this project.

    In any case, we have quite a mess, but our storage system seems to have very little to do with it.

  3. kevinclosson says:

    RHEL3 32 or 64bit?

  4. prodlife says:

    64bit linux on a 64 bit AMD.

  5. […] also had a thread going with Chen Shapira who has blogged about Oracle troubles on NAS. His point throughout that blog entry, and the comments to follow, was that they’ve suffered […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s