% fortune -ae paul murphy

OASIS? ODL? XML? Whaaaaa!!

My favorite CD has Ormandy's Philadephia orchestra playing the Shostakovich fifth. The recording was made in (I think) 1964 with the CD replacing the LP sometime in the early ninties. Both, however, still work as well as they ever did with the CD clearer but colder than the record. In contrast I've got a 3480 tape cartridge somewhere with several hundred functionally unrecoverable documents I worked on for an Alberta government department in the mid to late eighties. They required the use of a proprietory IBM word processor on PC-DOS and even if I had the software and the drive, I don't think the tape would still be readable. Those documents are lost.

For my own use I get around this problem by doing most most things first as unformatted text and, when I use FrameMaker instead of vi, making separate backup copies of my FrameMaker files using its text output function. Unfortunately that's not a solution I can recommend to clients whose ideas about word processors start and end with Microsoft Word. So what can I suggest to them?

Is XML, in any variant including the OASIS and ODL proposals, part of the answer?

Notice that software and formatting are only part of the problem - tape storage has problems because of print through, CDs and DVDs lose information over time, so do standard disks. Yes, you can buy drives and media certified for fifty years, but no one's had them for even fifeteen years, so how much do you want to bet on this stuff?

It's the software stability issue that's the killer here. Get everybody in the organization using the same word processing tools and the problem takes care of itself, right? Wrong, Openoffice.org actually opens and correctly formats more kinds of Microsoft Word documents then Microsoft Word does - and the more rapid and adaptive change becomes, the worse this problem gets.

So is XML likely to go the same way? In principle no, in practice I think so.

In principle XML is just a set of rules for the derivation of a document type definition [DTD] that is then used to describe document formatting. Thus SAML is an XML examplar and a tool, like FrameMaker, that can import an XML compliant DTD like SAML can be used to recover both the original text and its look and feel.

In practice, however, Microsoft's mind share means that whatever it calls XML becomes XML. They've apparently given up on their first idea of turning XML into a web programming language but doesn't mean we'll get stability in document formatting. Indeed we've already seen Word's basic DTDs evolve and accrete as Microsoft's needs and goals changed. Already, for example, people who bought into the lockable and one read document ideas on offer in 2001 face hard choices: write these documents off as unrecoverable or face the cost of having someone read them using old software and then write them using the newer product.

Don't misunderstand, this isn't a Microsoft or even a PC issue, nobody's homered this one. The problem is that XML is quite stable in principle but not in practice. In practice specific DTDs change over time or the technology needed to process them changes or disappears. In Word's case it's the DTDs that change over time; the WordPerfect case, although not XML, illustrates what happens when technology and our access to it changes.

Either way, of course, our cost of recovering the stored information and formatting rises with each change and whether that change takes place in the technology or the market doesn't really matter. At some point that curve goes exponential and it first becomes impractical, and then functionally impossible to recover the information.

So what do you tell a client legally required to keep documents on file and accessible for a minimum of sixty years? I'm thinking of killing the problem with hardware and knocking off document search in the process - on which more tomorrow.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.