Alexey Kharlamov: Production system maintenance

Lately, I've spent around 5 hours recovering production system from failure caused by improper actions of engineering staff during update procedure. I was lucky, so it was not me who has pressed the red button. But after that emergency situation I was trying to analysis causes of the problem and establish a better deployment process for software update. So these are thoughts somebody may found interesting:

Keep people without relevant experience out during important updates
Yes, some are pretty cool in development and they may be very attentive. But production system update require engineer to be a little paranoid. If you have not bad experience in this area, you probably will be unable to predict side-effect and bad-thing-that-might-happen. In ideal case the update must be done by a pair of engineers. Human-factor will be depressed much in that case.

Have a detailed update plan
Well, if the first thing is not the case... Prepare a detailed update plan with all things (shell commands, file & database operation & etc) included. It would be wonderful if you test it in your staging environment first. In ideal case you must have an auto-install package (deb, rpm or what ever).

Have a detailed fallback plan
This is continuation of the previous thought actually. Your update plan should include graceful fall back procedure description. Some errors appear on production system only. Or your development team might miss something, or . So you must have a way to return to previous version. Always.

Have a Plan B
In some cases fall back scenario can not help. For example in case of improper execution of update script. For example, if somebody deleted important files or dropped wrong database. Yes, your update script contains backup step. But what if that step failed silently?
General rule is "wait for trouble, always". So regular backup of production is a must have thing. Some data centers (as ours) even provides daily server backups in default service package.

Alexey Kharlamov

Wednesday, December 12, 2007

Production system maintenance

No comments: