One of those… Months?Posted: October 7, 2009
I’ve had one of those days, and even some weeks like that, but its the first time we have an entire murphy month – where everything possible goes wrong.
Lets see the list:
- DBA accidentally dropped production schema. He thought he was on test DB, of course. We are very proud that we managed to restore said schema with no data loss.
- One of our databases magically lost the storage network. No idea why or how. Reboot solved it.
- 8 hours downtime caused by a faulty switch. We have high availability, so we automatically failed over to the secondary switch. The secondary switch immediately failed too. Since we test the failover regularly, this can only be described as unbelievably bad luck.
- One of our Netapp heads failed. Again, we have high availability, so we fail over to the second head. Except that after we fixed the first head, it refused to recognize the disks. According to Netapp, the first head has to run a “rebuild” on the disks, so it can figure out again where is our data. We could have done it with few hours of downtime, but we already had a lot of downtime this month. So we opted for online rebuild. Which is as fun as online rebuild of indexes. Online rebuild of each disk takes around 12 hours. We have 14 disks. It was the week of unbelievable IO latency. The only upside is that for one week the DBA team was not the target for performance complaints.
- Bunch of smaller things: DBA who accidentally reset passwords in 20 servers, backups that stopped working, ORA-600 on capture process for our largest streams customer, accidentally exposing data of one customer to another, etc.
It should be obvious that the gods are out to get us. So much bad luck in one month cannot be accidental or random.
Since this run of production crashes coincides with the Jewish “Day of Atonement” (Yom Kippur) and the preceding repentance days period, the solution seemed obvious – I should repent my sins, promise never to repeat them, and pray for atonement. In Judaism any transgression of law is considered a sin. Even if it is not a moral hazard or if it was done by innocent mistake.
So consider this the reverse of new year’s resolutions. What I resolve not to do next year:
- Install new servers and publish them as production before I verify that backups and monitors indeed work on these servers.
- Undocumented changes on production servers.
- Accuse developers of being stupid and lazy. Not even if I find a nice way to paraphrase this.
- Ignore large infrastructural problems just because I prefer to work on something else.
- Ignore mysterious production glitches, just because they don’t happen a lot.
These steps should help our production be more stable next year. The more positive resolutions will wait for January 🙂