One of those… Months?

I’ve had one of those days, and even some weeks like that, but its the first time we have an entire murphy month – where everything possible goes wrong.

Lets see the list:

  1. DBA accidentally dropped production schema. He thought he was on test DB, of course. We are very proud that we managed to restore said schema with no data loss.
  2. One of our databases magically lost the storage network. No idea why or how. Reboot solved it.
  3. 8 hours downtime caused by a faulty switch. We have high availability, so we automatically failed over to the secondary switch. The secondary switch immediately failed too. Since we test the failover regularly, this can only be described as unbelievably bad luck.
  4. One of our Netapp heads failed. Again, we have high availability, so we fail over to the second head. Except that after we fixed the first head, it refused to recognize the disks. According to Netapp, the first head has to run a “rebuild” on the disks, so it can figure out again where is our data. We could have done it with few hours of downtime, but we already had a lot of downtime this month. So we opted for online rebuild. Which is as fun as online rebuild of indexes. Online rebuild of each disk takes around 12 hours. We have 14 disks. It was the week of unbelievable IO latency. The only upside is that for one week the DBA team was not the target for performance complaints.
  5. Bunch of smaller things: DBA who accidentally reset passwords in 20 servers, backups that stopped working, ORA-600 on capture process for our largest streams customer, accidentally exposing data of one customer to another, etc.

It should be obvious that the gods are out to get us. So much bad luck in one month cannot be accidental or random.

Since this run of production crashes coincides with the Jewish “Day of Atonement” (Yom Kippur) and the preceding repentance days period, the solution seemed obvious – I should repent my sins, promise never to repeat them, and pray for atonement. In Judaism any transgression of law is considered a sin. Even if it is not a moral hazard or if it was done by innocent mistake.

So consider this the reverse of new year’s resolutions. What I resolve not to do next year:

  1. Install new servers and publish them as production before I verify that backups and monitors indeed work on these servers.
  2. Undocumented changes on production servers.
  3. Accuse developers of being stupid and lazy. Not even if I find a nice way to paraphrase this.
  4. Ignore large infrastructural problems just because I prefer to work on something else.
  5. Ignore mysterious production glitches, just because they don’t happen a lot.

These steps should help our production be more stable next year. The more positive resolutions will wait for January 🙂

5 Comments on “One of those… Months?”

  1. […] — prodlife @ 12:11 am I already posted two things everyone should know about queues, but the incidents of the last month made me realize I missed another very important queue […]

  2. joel garry says:

    Oh man, I remember an interview where one of the things they asked was “the previous dba dropped the production database” to see my reaction. Apparently I looked horrified enough!

    The entire community must repent for the sin of each.

    The US government just loaned $529M to this company. Would you? 🙂

  3. On a similar topic, ruthlessly following up on small problems can avoid future large problems.

    I also try to come up with simple things that reduce the probability of prod/dev/test mixups. The simplest one I’ve come up with is to change the background color on any screen or interactive sesstion (SSH, RDP) that is on a production system. If prod backgrounds are bright red and dev/test are normal colors, you’ve got a chance to think before you drop a schema in prod.

    • prodlife says:

      We do the background things since the last drop (Something like 4 years ago). This time the DBA outwitted us and used SQLDeveloper for the drop…

      We are much better at doing recoveries this time around 🙂

  4. Cd-MaN says:

    Keep your spirits up! The important thing is that you always try to better yourself. Other than that, the only one who doesn’t make mistakes is the one who doesn’t work.

