Tuesday, April 24, 2007

common pitfalls in outage/crisis situations

Teams dealing with outages typically suffer the following behavioral problems. Some of these conflict with each other in aims .. there isn't unfortunately a set formula I can come up with as each situation has its own variables/complexities. A standardized process/template however, would be great !

1> Trial and error
Don't fall for this. You know you are in trouble when you get ambiguity from the technical teams. If you are stuck, ask yourself what can you do to get additional information onto the table. Often, you will find teams stuck 'enjoying' the problem because they have no method for infusing new information or experience into the problem diagnosis. Doing things trial & error mode also force you into a sequential analysis mode. Also, this leads to the '2 hr ETA bait' (see below).

2> Debugging in live
While prevention of recurrence is key and that requires some investment in analysis / data collection during an outage, it MUST be capped. Do not fall into the trap of ... "give me 20 more minutes, I have almost figured it out ..". You will hate yourself later. Walk into the situation with a time limit in your mind upon which you will trigger a failsafe way of restoring service. Communicate that upfront to the team (I am assuming that you have a failsafe procedure to restart the system to restore service that you will trigger on .. may be something as simple as rebooting the servers). Of course, best practice is that you always execute the failsafe and never debug in live, however, that requires a significant investment in test infrastructures that are capable of reproducing the problem. Remember, there is a value to getting to true root cause as that is the only way you will prevent recurrence.

3> Sequential analysis
Distinguish 'sequential analysis' from 'sequential execution'. Sequential execution is good, sequential analysis is BAD. On the analysis side, you should try to split off multiple teams (assuming you have the resources) on different aspects of the problem. That allows you to cover all bases quickly vs. an elongated recovery path where you are problem solving only one thing at a time. Sequential execution is GOOD because you want to introduce only one variable at a time else, you will break the cause and effect chain. Usually, problem solving is about eliminating variables and then incrementally fixing one thing at a time using a measured/scientific approach.

4> 2 hr ETA bait ...
Setting expectations on 'expected time to restoral' is really really hard. Here is the dark art of estimation at its finest. Setting no expectation is unacceptable (it will be fixed when it is fixed .. attitude). Your business partners/users will not be as upset about an outage as they will be about setting false expectations. Usually, a significant outage will require operational teams to build workaround/catch up plans where they may have to staff overtime or weekends. These plans depend on your estimates.

And .. what's worse is none of your technology suppliers / partners will co-operate.

On crisis situations with financial implications, vendors get very very conservative or worse, clam up ! In a crisis situation, you always will feel the information is inadequate to make a decision or set an expectation. I usually follow my instincts here (of course, harnessing whatever facts are available on the situation). Don't try this unless you have the right technical experience.

No comments: