The (traditional) disaster recovery plan

Aka the business continuity plan and rather less well known as the "risk action plan," this is a document whose existence and table of contents are subject to audit - but which, like most data processing control artifacts, doesn't have to bear much resemblance to reality. In theory, of course, it does: after all the primary control, the user service level agreement, will specify how long data processing has to bring a list of critical applications back on-line, and the risk action plan documents are intended to describe just how those commitments will be met.

Unfortunately, all of the plans I've reviewed have had one thing in common: a lack of testing, or even testability, under realistic conditions.

In theory a disaster recovery plan consists of a list of possible disaster scenarios together with a proven method (including staffing and technology) for overcoming the consequences of each one. Typically, therefore, they'll start with a hypothetical event that closes or wrecks the data center, and then focus on who does what and where to bring a carefully prioritised list of applications back up as quickly as possible.

In reality, of course, the disasters rarely fit the scenarios, the people listed as responsible for each action are rarely reachable, and the senior managers who get rousted out when the brown stuff hits the fan usually throw the best laid plans into total chaos by overruling the rule book within minutes of arriving on site.

That disconnect between plans and reality is perfectly normal and people usually just muddle through, but the abnormal can be even more fun. Two favourite stories:

  1. this organization had its disaster recovery plans professionally prepared by high powered consultants from a major international firm. After several weeks of intensive effort they handed over a very pretty piece of work - the powerpoints were works of art and the embedded "emergency adaptive organisational call-out" process masterful.

    Everything had been considered, all contingencies covered - except that when an unhappy employee spent $29.95 for a butane torch at Home Depot, disabled the halon system, and then sloshed around some gasoline to really get those rack mounts running hot, it turned out that the only copies the company had of its disaster recovery plans were stored on those servers -along with the readers and encryption keys for the back-up tapes carefully stored off-site.

    Worse, the police closed the entire data center to all traffic for about ten days while they conducted their investigation and the health department refused access for another week because of the chemicals released before and during the fire.

  2. In the second case a government agency wanted to take control of its own data processing from the IBM dominated group providing government wide services. Negotiations having failed, local management invoked its right to opt out and hired the Canadian franchise holder for a large American consulting group to implement a non IBM mainframe solution (an HDS) complete with disaster recovery document preparation and appropriate staff recruitment and training, but compromised by agreeing to use the central agency's ultra-safe and temperature controlled vaults for off-site data storage.

    A few years later a contractor's employee working on the tunnel system two floors below the data center is thought to have unknowingly punctured a gas line sometime before leaving work on a Friday night. The inevitable happened early Sunday morning - turning that Hitachi into just so much shredded metal and taking the disks and on-site tape vault with it to some otherwise unreachable digital heaven.

    On Tuesday, messengers arriving at the central organisation's off site storage facility to pick up Thursday's tapes were turned away - and by late Wednesday local management had got the message: the central agency had put itself in charge of certifying disaster recovery sites, had not certified the Hitachi partner providing standby processing support for the agency, and "quite properly" refused to release the tapes to an uncertified site.

The bottom line message here should be clear: a formal disaster recovery plan of the traditional if this, then that style only makes sense if you can count on being able to control both the timing and the nature of the disaster - and doesn't if you can't. In other words the only things that are really predictable about data center recovery are that the plan won't apply to what actually happens, the recovery process will take longer and cost more than expected, and the whole thing will be far more chaotic and ad hoc then anyone ever wants to admit afterward.

So what do you do instead? that's tomorrow's topic, but here's the one word answer: drill.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specialising in Unix and Unix-related management issues.