Draft Blog Entries

% fortune -ae paul murphy

A word from Rob Pike

The world of text processing in the sense of search and interpretation is very broadly split into two camps: people who want to structure the document database according to content prior to searching, and those who want to impose that structure as part of each search. Yahoo, for example, seeks to classify web documents by type and content and then enables text search within each such pre-classified group. Google, in contrast, imposes no external structure on the data until a search is requested and then creates a new classification - documents that meet the search conditions- on the fly.

At the time Microsoft announced its intention to re-invent PICK in the Longhorn file system it also saw XML as a programming language for the web and the combination, therefore, as a means of imposing structure on the disorder of the typical PC disk -and the internet. In effect, a Yahoo style classification based solution.

When Rob Pike, co-creator of Plan9 and one of the true gurus of both Unix and C, did a web interview on slashdot he was working with Google and touched on this set of issues. Here's part of what he said:

This is not the first time databases and file systems have collided, merged, argued, and split up, and it won't be the last. The specifics of whether you have a file system or a database is a rather dull semantic dispute, a contest to see who's got the best technology, rigged in a way that neither side wins. Well, as with most technologies, the solution depends on the problem; there is no single right answer.
What's really interesting is how you think about accessing your data. File systems and databases provide different ways of organizing data to help find structure and meaning in what you've stored, but they're not the only approaches possible. Moreover, the structure they provide is really for one purpose: to simplify accessing it. Once you realize it's the access, not the structure, that matters, the whole debate changes character.
One of the big insights in the last few years, through work by the internet search engines but also tools like Udi Manber's glimpse, is that data with no meaningful structure can still be very powerful if the tools to help you search the data are good. In fact, structure can be bad if the structure you have doesn't fit the problem you're trying to solve today, regardless of how well it fit the problem you were solving yesterday. So I don't much care any more how my data is stored; what matters is how to retrieve the relevant pieces when I need them.
Grep was the definitive Unix tool early on; now we have tools that could be characterized as `grep my machine' and `grep the Internet'. GMail, Google's mail product, takes that idea and applies it to mail: don't bother organizing your mail messages; just put them away for searching later. It's quite liberating if you can let go your old file-and-folder-oriented mentality. Expect more liberation as searching replaces structure as the way to handle data.

From the big picture perspective what's important about this is the implied preference for unstructured data because any classification imposes its own limits.

Look closely at what Microsoft is doing with XML now and you'll see a contrast with what the Cocoon people are doing. Microsoft's approach is easier to understand and consistent with classification based text processing practice, but likely to be dead ended by its lack of flexibility in the face of vast amounts of data and differing user agendas. In contrast Cocoon's use of XML seems to be getting ever closer to the original point of the specification: its use as a flexible markup language for transmitting format information rather than as a structuring tool describing or defining the information itself.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.