% fortune -ae paul murphy

Paranoia, money, and research

/usr/sfw/bin/gtar: /dev/rmt/0: Cannot read: I/O error
/usr/sfw/bin/gtar: /dev/rmt/0: Cannot read: I/O error
/usr/sfw/bin/gtar: Too many errors, quitting
/usr/sfw/bin/gtar: Error is not recoverable: exiting now

That's frightening, but simple minded compared to a very different kind of recovery problem that's becoming increasingly important in academic and other research oriented service areas. What's going on is that researchers are coming under increasing financial and career pressure to produce either commercial success or grants at just the same time that changes in science are making simulation both riskier and more important - and some of them are finding ways to make their problems ours.

Consider these two lines:

602 192304:45./// 2512213:55/ 204096414:45/ 2512215:28/ 784608414:07/ 45122304:44/ 11215:29/ 145122305:10/ 756322812:09/ 281922304:44/ 25122614:15/ 25122304:45/ 11215:29/ 195122306:04/ 5220482812:09/ 94258752514:37/ 210242812:07/ 45122304:45/ 7988514:31/ 25122306:08/ 35122810:37/ 411024214:03/ 4510242811:00/ 6512215:29/ 466 25122305:101394/ 1302304:47..///0: 1272304:47..///0: 172304:54/0 1102304:54/0 1112305:55./// 1252304:54..///0: 1282304:51..///0: 35122304:54/ 1312304:47..///0: 1302304:47..///0: 35122304:54/ 1332304:54..///0: 1392304:54..///0: 25122304:54/ 1482304:54..//1,0/7/0,0/0,30: 1482304:540..//1,0/7/0,0/0,30: 25122304:54/ 25122307:23/ 1292304:54..///0: 1542304:540..//1,0/7/0,0/0,378:0 1292304:47..///0: 135215:29//1,0/,6413:640 182304:540/640 25122304:54/ 1312304:54..///0: 2528514:37/ 142306:0700 1482306:070..//1,0/7/0,0/0,30: 1482306:070..//1,0/7/0,0/0,30: 192306:0700 1302304:54..///0:

Now imagine first that the file this came from has slightly over 328 million lines all of which look pretty much like this - and secondly that the guy who owns the data looks at you, the departmental sysadmin, while telling your boss that someone altered the file to sabotage his research.

If this file were real it would represent the outcome of about 3,174,336,000 AMD 2.6Ghz CPU seconds and nearly seven months spent waiting for a two rack grid with 176 total cores to produce a result. Researchers, of course, don't care about CPU seconds or hardware limitations -but they do care about waiting time and care even more if that waiting time either lets a competitor publish first or turns out to have been wasted because of an error in the setup or running code.

Combine frustration with outside financial pressure and what you get is a recipe for the expression of paranoia - hence our entirely hypothetical researcher's assertion that the file had been sabotaged and consequent threats of both legal and direct action against the guy responsible for running the hardware.

Since I'm making this up as I go along I can imagine that this particular situation was resolved when the system logs showed that some unsung hero in the University's central IT department had taken it upon himself to update every Solaris box on campus through an automated shutdown, patch, and reboot process running every thirty days -thereby surfacing both an IT control hidden in purchasing and a synchronization error in the panic shutdown, checkpointing, and restart code managing the application.

In the more common case, however, the only thing between you and getting fired in disgrace will be your ability to document the stringent application of control procedures designed to safeguard both data and processes - procedures that, if actually applied, would just about cripple the research productivity you're supposed to facilitate.

So what can you do?

There are some things you have to do - for example using cryptology and Solaris containers or similar means to ensure user privacy and denying all outsiders, particularly central IT people if you're burdened with those, access to the machine during critical periods.

The more general solution, however, is duplication - convince management that any serious research computing is at risk and that the only known way to combine relationship based researcher access with retroactively demonstrable processing integrity is to silently copy all work to another system run by somebody else.

If you demonstrably have no access to the other guy's system and it produces the same results, the burden of proof will shift to the accuser - meaning that, practically speaking, you're off the hook, and so is your department chairman.

To implement this, take advantage of two things: continuously falling hardware costs and continuously improving software. If, for example, you work in biotechnology and you get along with the sysadmin over in chemistry you've got the basis for a deal - because he's got the same problem.

Propose that new hardware be duplicated at both sites, use one set of containers to achieve verifiable run-time isolation, and use another pair to swap data and programs. When paranoia erupts and one of you is facing accusations of leaking, distorting, destroying, or otherwise interfering in some user's march to the Nobel -you've got backup: and it's not a tape, it's a person who can testify that the same data with the same programs on similar hardware produced the same results.

Just make sure only you and the department heads know - or the researchers will be signing agreements promising faithfully not to whine at you if only they can, just this once, please, have access to both halves of the facility, because, you know, it's an emergency, and you can trust them, right?


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.