When-Others-Then and other Troubleshooting MonstersPosted: July 16, 2009
I’m definitely having one of these weeks.
We have this insanely complicated, highly visible, ultra high priority project going live this this Wednsday. The DBA who worked on this project nearly full time for the last 6 month left on 3 weeks vacation on Monday. I’m the replacement.
Tuesday morning, I’m on a conference call. The night job failed due to an issue with “dblink”. Must be fixes ASAP or the go-live will be delayed. I was all “Wait. I’m just a simple DBA. What dblink are you talking about?”.
So the morning was spent with me trying to slowly work my way through huge amounts of night-job code. While everyone around me was running in circles screaming. Mostly screaming “network” and “dblink”.
You know what I found out? That we have no clue why the night job failed. Because at the end of every single procedure in the job was code that said:
EXCEPTION WHEN OTHERS THEN INSERT INTO NIGHT_JOB (status, last_run_stmp) VALUES ('failed', SYSDATE); COMMIT;
The job failed. What more information could you possibly want?
And the funny thing that happens when you have such useful error messages, is that everyone starts developing theories about why the failure occurred. Maybe someone tested the dblink few hours after the failure occurred, noticed that the remote site is down, and decides that this is the issue. Someone else tries rerunning some of the code and gets “unique constraint” error, so he guesses that this is the issue. But of course, its all guesses after the fact. No one can know what caused the night job to fail on its original run.
Trying to troubleshoot an issue based on guesses about what was the error is a complete waste of time – what are the chances that you’ll find the real issue and that the next time the night-job runs everything will work?
Proper error handling is a complex topic – which exceptions can be handled locally and which should be raised? At which levels do we trap exceptions? These questions are a topic of much discussions among developers and architects. But the basics of error handling should not be ignored – When an error occurs, we must know what was the error.