Draft Blog Entries

% fortune -ae paul murphy

Questioning backup

There's a wonderfully exemplary National Institute of Health document available on the web entitled File Backup Guidance.doc. Here's the google headline on it:

NIH Computer Center Disaster Recovery Plan, Backup and Off-Site Storage Procedures. ... describes backup mistakes and avoidance advice for UNIX machines. ...

but it's a ".doc" - meaning that the people who offer this bit of Unix advice don't use Unix and the general focus, google to the contrary, is in fact on backing up Wintel gear.

Here's a sample:

Mission Critical Systems.

In general, mission critical systems at NIH are described as ones where their loss or compromise could:

Potentially place human or animal life at risk, or
Adversely affect primary patient, animal, or scientific research data, or
Jeopardize a ubiquitous system that is central to the NIH mission or integral to the agency's infrastructure.

The following backup guidelines apply to mission critical systems:

Make daily backups.
When storing backup tapes off-site make certain the site is geographically removed from primary site.
When storing tapes on-site put them in a locked fire and waterproof container.
Store backup media in an environmentally safe and locked location/room.
Verify the media regularly by doing validation tests and restoring media to test sites.
Encrypt the data on the backup media if it contains sensitive data.
Perform background checks on personnel that will be handling backup media for highly sensitive or mission critical systems. Background checks are required by regulation for all IT staff whether Government or Contractor.

At best this could be described as endearingly naive, but the assumptions about tape backup and off-site storage raise an interesting and surprisingly difficult question: are there circumstances under which an enterprise scale system should not be backed up in the traditional sense of copying bit streams to portable media for remote storage?

Lets take an extreme case: imagine that you had a couple of hundred 24" Sun Ray users on a separately switched and wired network in a DOD or related agency all running a high security application off a pair of maxed out Sun 6900s, both booting from internal mirrored US320 drives and accessing a shared "thumper" array offering 96TB of fully mirrored ZFS storage. Would tape backup make sense?

Obviously not, right? First because the data volume makes it impractical, secondly because storing the tapes on site is self-defeating while taking them off-site is insecure, thirdly because the likelihood of ever needing to recover from tape is vanishingly close to zero, and finally because the cheaper, smarter solution is a physically and electronically secure vault somewhere else with another thumper and a dedicated optical link to the production systems.

That's an argument I'd buy into cheerfully enough, but there's a slippery slope component to this - along the lines of the "will you sleep with me for a million bucks? ok, now that we know what you are, how about $20" joke- because if you agree that this makes sense for the agency, how can you argue that it doesn't make sense for most business or government organizations?

Think about it: hardware reliability gets better every time the gear, particularly the power transformers in it, gets smaller while ZFS automates disk mirroring and recovery for even quite small installations. I don't have numbers on system wide reliability, but I'd be surprised if Solaris on SPARC isn't pushing five nines out of the box even on systems without processor redundancy and hot swap capabilities built in - certainly the T1 CPU complex has a much wider range of service reliability features built in than almost anything else.

Costs keep going down too - a 12TB thumper with 16GB of buffer and two dual core Opteron processors provides a lot of instant backup at a list price of about $33K while smaller SCSI or SATA arrays are now in the discretionary account range. So if the big risk is user or administrator error and ZFS protects you from that, while the marginal risk is data center destruction and redundancy protects you from that, why not drop the traditional off site tape business with all its attendant risks, costs, and hassles?

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specialising in Unix and Unix-related management issues.