% fortune -ae paul murphy

Document Storage and Retrieval

My basic problem is simple: thousands of word processor and applications generated documents being created each day, a legal requirement that at least some of these be kept accessible for sixty years, and a practical need for some people, under some circumstances, to have immediate access to old documents.

I'm a big believer in open document standards but a sixty year old digital document today would have been typed in 1945 and we have trouble reading optical microforms from that far back. In other words I just don't think any standard adopted today will last long enough to address this problem.

So my solution is hardware combined with business processes. Establish an organizational document center to provide both archiving and document search functions to the entire organization. That document center would then update its technologies as needed, converting all documents to the new formats each time, and thereby keep everything both current and accessible.

In operation this center's management would do all the normal things - dual center redundancy, use of non volatile secondary storage - and careful scrutiny of access authorization and use. In fact it would look a lot like google will if they go ahead and offer openoffice document storage, retrieval, and conversion to gmail users.

I'm not a big fan of virtualization in general but there's a natural application here for Sun's containers - just use them to segregate the search engine by access authorization classes because that allows use of a single storage hierarchy. In other words, use ZFS/z-raid with a couple of towers of ordinary SATA disks. Add automated conversion of incoming documents to a standard form along with something like google's licensable search technologies and what you get is reliable storage coupled with an easily secured search capability that responds differently to different classes of user authorization.

And if standard document formats change tomorrow - or five years from now? There's no new impact: the people responsible for operations just adapt on the fly, updates continue, search continues, and backwards compatibility remains assured unless a power-off outlasts the backup optical or other media: say ten years or more.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.