Draft Blog Entries

% fortune -ae paul murphy

Disaster Recovery Planning and hardware change

Most of the time software change outpaces hardware change, but sometimes that reverses. Right now, for example, Microsoft's Xenon, IBM's cell, and Sun's T1 series CMT machines are all well ahead of the software they're intended to run.

In the IBM and Microsoft cases the software simply isn't there yet and the hardware leapfrog's major implication is that there are enormous opportunities open to those willing to go after the evolving Xenon and Cell markets.

In Sun's case, however, existing software runs unchanged on the new gear and direct exploits of the new hardware using existing software tend to be both shorter term and lower risk than going after the performance opportunities open to those willing to code directly for the new architecture.

A couple of weeks ago, for example, I talked about consolidating many Exchange licenses on Wintel servers to Domino on my hypothetical Sun pod with one T2000 processor. That's a low risk, high reward application almost anyone with appropriate volume can pull off quickly.

The opportunity in data center disaster recovery planning is pretty much the opposite of messaging change in terms of user visibility, process duration, and the degree of overall change needed to take advantage of it. Thus switching from Exchange to Domino affects almost everyone but has essentially no ripple effects on either client software or the rest of the data center - in contrast, using the T1 performance breakthrough in disaster recover affects almost nobody in the short term, but builds toward extensive change in the data center.

What makes this possible is simply that Sun's CMT line, starting with the UltraSPARC T1, leapfrogs software resource requirements growth to the point that redundant data center operation becomes financially feasible for tens of thousands of organizations that were previously forced to rely on traditional backup and recovery solutions for disaster recovery.

Basically what's happened is that the fixed cost of operating dual, mutually redundant, data centers has dropped, for many organizations, below the expected cost of business interuptions at the heart of the traditional disaster recovery plan.

Remember that traditional disaster planning process? Realistically, it went something like this:

Identify critical business processes and the applications that support them -and hope against all reality that applications judged to be less critical either really are less critical or don't fail.
For each selected application, estimate the expected cost of failure by determining the organizational financial impact of failure to meet the recovery and operational minimums specified in the Service Level Agreement - and hope against all reality that those standards exist and prove reasonable when tested under disaster conditions.
Look at HW/SW risk reduction options for each application to guestimate the expected cost of risk reduction through traditional methods like the use of RAID storage, off-site tape backup, and a hot site recovery contract.
Get the board or CFO to take responsibility for the indicated cost/benefit trade-off and implement

Or, in plain English: figure what the minimum acceptable strategy is, and do that - but make sure someone else signs off as valuing the savings achieved over the processing continuity achieveable.

There are lots of assumptions hidden in the plans this process produces. On the surface the big one is simply that failures are granular, graduated, and costable - i.e. that all processing is centralized, that the cost of failure varies by application, that application failures are independent events, and that recovery processes can be structured to match.

Although almost everyone still does this in one form or another, the underlying structure reflects the simplicity and control ideals of a mid seventies mainframe data center and is a correspondingly poor fit for today's interactive environments.

The right answer for today's environment is to treat all of this as inapplicable and go right to the solution: operate duplicate, fully concurrent, data centers; with different management and technical teams individually responsible for operations in each one.

Doing it right takes discipline and cross training but provides maximal assurance for processing integrity because everything is covered, all the time. Basically, someone could blow up one data center, and the users would only find out via the news.

The problem with this strategy, of course, is cost. If you need two complete teams: they'd better be small relative to the business; and if you need two complete data centers, they'd better be cheap relative to the business. I.e. you'd better be using Unix ideas and technologies - a mid size airline, for example, cannot afford redundant hundred million dollar mainframe data centers, but can afford redundant Unix centers at about seven cents on the mainframe dollar.

Notice, however, that there are two dependencies here: first the organization has to have the in-house skills needed to pull it off, and secondly the combination of processing volume and available budget has to make a redundant data center operational strategy both affordable and effective.

That's where Sun's new technologies come in: lowering both the cost and skill barriers to the point that the redundant data center strategy becomes feasible for tens of thousands of companies that could not previously afford it.

Running messaging via Domino on Sun, for example, is much less manpower intensive than running it via Exchange Servers - and the hardware cost reduction is such that you can get fully redundant pods in place for less than you would previously have had to pay for a few months of support on gear of comparable power.

In other words, the kind of true data center redundancy that was once probitively expensive is now easily within range for organizations with anywhere from a few hundred to a few tens of thousands of employees - and it isn't an all-at-once or nothing issue either: you can go at this transition one step at a time.

Messaging is as good a place to start as any. Suppose, for example, that you're responsible for messaging support for 15,000 users spread across a large number of offices. With Exchange your redundancy strategy consists of using RAID disks or clustered servers with backup tapes and the bet that outages will be local - meaning that only a few hundred users will be affected by each outage and prompt recovery is therefore important, but not mission critical. In that situation converting to two fully synchronized, but "network distant" from each other, pods using Domino on the T2000 isn't difficult, eliminates the need for off-site backup storage, and improves the level of processing and storage redundancy because you get both better performance almost all of the time and automatic service continuation for everybody, if something truly nasty hits one service center or the other.

You can cost justify this simply on Microsoft license savings; and saving cash is cool, but there's another benefit: doing this starts you down the road to running truly redundant data centers - there's no reason, for example, not to consider adding a second T2000 and disk pack to each rack as a way to safeguard operational integrity for your company's ERP/SCM database.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.