When-Others-Then and other Troubleshooting Monsters

I’m definitely having one of these weeks.

We have this insanely complicated, highly visible, ultra high priority project going live this this Wednsday. The DBA who worked on this project nearly full time for the last 6 month left on 3 weeks vacation on Monday. I’m the replacement.

Tuesday morning, I’m on a conference call. The night job failed due to an issue with “dblink”. Must be fixes ASAP or the go-live will be delayed. I was all “Wait. I’m just a simple DBA. What dblink are you talking about?”.

So the morning was spent with me trying to slowly work my way through huge amounts of night-job code. While everyone around me was running in circles screaming. Mostly screaming “network” and “dblink”.

You know what I found out? That we have no clue why the night job failed. Because at the end of every single procedure in the job was code that said:

EXCEPTION
 WHEN OTHERS
 THEN
 INSERT INTO NIGHT_JOB
 (status, last_run_stmp)
 VALUES ('failed', SYSDATE);
 COMMIT;

The job failed. What more information could you possibly want?

And the funny thing that happens when you have such useful error messages, is that everyone starts developing theories about why the failure occurred. Maybe someone tested the dblink few hours after the failure occurred, noticed that the remote site is down, and decides that this is the issue. Someone else tries rerunning some of the code and gets “unique constraint” error, so he guesses that this is the issue. But of course, its all guesses after the fact. No one can know what caused the night job to fail on its original run.

Trying to troubleshoot an issue based on guesses about what was the error is a complete waste of time – what are the chances that you’ll find the real issue and that the next time the night-job runs everything will work?

Proper error handling is a complex topic – which exceptions can be handled locally and which should be raised? At which levels do we trap exceptions? These questions are a topic of much discussions among developers and architects. But the basics of error handling should not be ignored – When an error occurs,  we must know what was the error.


19 Comments on “When-Others-Then and other Troubleshooting Monsters”

  1. chet says:

    That’s not fun at all.

    Should I ask why that INSERT wasn’t wrapped in a procedure with a PRAGMA AUTONOMOUS declaration?

  2. Boneist says:

    You’d think it would be obvious, wouldn’t you?!

  3. John Russell says:

    The challenge is, what the most efficient/reusable way to collect the information? Separate column with the SQLERRM value… another column with the name of the procedure that failed… Can a running procedure determine its own name rather than hardcoding a different literal in every error-handling location?

  4. Gary says:

    The scary thing is the commit. “We don’t know what happened or what state the transaction is in, so it must be safe to commit”.

    The easiest solution here is a trigger. BEFORE INSERT ON NIGHT_JOB FOR EACH ROW BEGIN
    IF NEW.STATUS = ‘failed’ THEN
    RAISE_APPLICATION_ERROR(SQLCODE,SQLERRM);
    END IF;
    END;

    (possibly with DBMS_UTILITY.FORMAT_ERROR_STACK and/or DBMS_UTILITY.FORMAT_ERROR_BACKTRACE)
    At least there’s a chance of catching something useful without recoding every procedure.

  5. prodlife says:

    Gary,
    Thanks for the excellent idea!
    On the other hand, the application which is not used to receive exceptions may not like the new trigger. Maybe instead insert the error into a separate table? But this will only add to the mess…

  6. amol says:

    the other question: is the job expected to run in the “all or nothing” mode or ignore the error on that specific row and move-on to the next one.

  7. Timur Akhmadeev says:

    Chen,

    it’s strange for me to see that after such a great post you are possibly trying to find the answer how to fix when-others-then-null.
    IMO the answer is the only one: get rid of that cr*p immediately. I.e. delete that useless, ugly, stupid piece of … you know what I mean. What would be changed after such exception handler removal? Absolutely nothing (nothing works, people screaming, etc.) except you know what happened and real cause(s).
    >On the other hand, the application which is not used to receive exceptions may not like the new trigger
    Why trigger? Just delete this handler.

  8. prodlife says:

    @amol – Good question. Now its “finish no matter what”, but I’m not sure this is what the users really want.

  9. prodlife says:

    @Timur – I can’t get rid of the crap 😦
    The functionality is in use, and it looks like few month of work to rewrite. Removing the handler means that all the errors from now on are suddenly my fault – because I’m the last one to touch it.

    The reasonable thing to do it stay away from this thing and explain that developers and not DBAs should debug the code.

    I hope that after a week or so, the developers will find a way to improve error reporting for their own good.

    • Sybrand Bakker says:

      Chen,

      I wouldn’t be surprised when nothing happens at all.
      I have been constantly dealing with this kind of code, and usually it is discovered when the 3rd-party developers are gone, and the bills have been paid, so fixing the code would result in new bills.
      Also in my own experience all developers at consultancies are instructed always to deny responsibility, so as to avoid lawsuits.

    • Nikunj says:

      Agree with your statement “The reasonable thing to do it stay away from this thing and explain that developers and not DBAs should debug the code.”

      DBA should be there for support/suggestions and help to developers. It is sometimes hard for a DBA to change the entire process and code written by developers.

      Based on what you have seen there can be some suggestions for developers (again they will to some extent go with the requirements) for improvement and proper error handling, but DBA should probably stay away from major coding – code changes.

      • prodlife says:

        @Nikunj
        Yep. I sent my suggestions to the developers, and the account manager (just in case). Now I just have to pray that development will find it worthwhile to fix the system.

    • joel garry says:

      I was going to suggest this Feuerstein blog post, but then I saw your reply to Sybrand.

  10. John Hurley says:

    How is it that code in an apparently important system is allowed to go close to production status with error handling coded like this? If DBA’s are not allowed to stop something like this and if there is no quality assurance system in place to stop stuff like this from even getting close to prod status … it is time to update the resume.

    Tom Kyte in particular has been on this topic for like what … a decade? Steve Feuerstein is another one that has warned repeatedly about things like this.

    • prodlife says:

      @John Hurley
      Oh, the DBAs and QA can tell management that there are big issues, but if there is enough money on the table, the system will still go production as is. We can refuse to sign an SLA though 🙂
      As you probably know, quality of code in production is not the only measurement of whether a workplace is a good one. I actually like working here. But as you notice, we still have some processes to improve.

  11. […] facilmente errori o problemi.  Qualche giorno fa però ho letto sul blog di Chen Shapira un post in cui descrive una situazione in cui si è trovata coinvolta a causa di procedure PL/SQL in cui non […]

  12. […] Shapira was stalking WHEN-OTHERS-THEN and other troubleshooting monsters. Chen says, “Trying to troubleshoot an issue based on guesses about what was the error is a […]


Leave a comment