% fortune -ae paul murphy

Using disaster avoidance to drive change

In traditional data processing you prepare a comprehensive disaster recovery plan, rank your scheduled jobs in some kind of criticality order, contract with an off site provider for standby hardware access, and pretend you understand how recovery from an actual disaster is going to happen - while spending a significant portion of the total DP budget on ensuring that you never find out for real.

With Unix you can do that too - but its neither necessary nor appropriate. The difference is this: data processing's (and by extension your SOX auditor's) view of appropriate disaster recovery planning evolved over ninety years of experience to fit a world of predictable jobs in which hardware is extremely expensive relative to staffing - but none of those conditions hold for Unix. With Unix, staff time is expensive, hardware is cheap, and processor loads are largely unpredictable - all exact opposites of the conditions the traditional DR planning process responds to.

At the personal and mom and pop shop levels, therefore, the simplest answer is to make good backups while keeping an older generation of hardware in a box from which it can be unpacked for emergency use. Thus last year's PC runs Linux just about as well as this year's and a ten year old Sun box will run today's USIV+, Solaris 10, SPARC binaries faster than their US2 originals ran under Solaris 2.7 or 8.

As you get bigger, the key is to distribute your risks - and once you get over 40 or 50 users and/or more than two sites, full on line redundancy becomes a better, more effective bet because each site can back up at least one other, and processing resources at that level are cheaper than network resources - meaning that maintaining a local processor at each site is often cheaper and more effective than centralisation.

As you get larger, however, you find that unit bandwidth costs go down much faster than staff and systems costs, so that by the time you get much above a couple of thousand employees and half dozen roughly equal sites, maintaining two widely separated data centers, both capable of serving everybody, is usually both cheaper and more effective than maintaining a single co-ordination center and local processors at each major site.

Note, however, that staffing has to be fully redundant too - because the overwhelming majority of user affecting failures for Unix are caused by IT staff action - and setting things up so that a moronic moment at one site promptly affects the other is as counter-productive as it is common.

When you try to move in this direction, however, site management will typically fight you tooth and nail -and you can talk about combining centralised processing with decentralised management until your face turns blue, but they won't believe you and that disbelief will prevent you from doing it.

What you're caught in then is a classic chicken and egg situation: you can't achieve much of anything while the old relationships and beliefs about Systems exist, but the mutual suspicion created by the service level agreement process and its consequences (particularly its annual budgeting component) prevents changing those relationships.

What you can do, however, is sell your redundant centers to divisional and site management as disaster avoidance and recovery tools - backing up the processors and staff they have on site until you eventually earn the trust needed to reverse that arrangement by centralising processing and quietly reconfiguring the local servers for whatever functions continue to be appropriate.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specialising in Unix and Unix-related management issues.