% fortune -ae paul murphy

Hey, where's my (search) app, dude?

We tend to think of the internet as an almost unlimited information source - everything you ever wanted to know about almost anything. Unfortunately this perception is just fundamentally wrong.

The most obvious gap in the internet knowledge base is in the absence of third party validation for much of what is presented as fact. Thus you can generally tell "good" information sites from "bad"information sites only by comparing what you read to what you already know - and if what you know comes from the internet or some other equally suspect source, well you could ultimately find yourself unable to differentiate a site decoding crop circles from one describing the limits of what is known about the relationship between lightning and sprites.

What's less obvious is that the depth of knowledge represented on the internet isn't remotely comparable to its breadth. In other words, you'll find something about nearly every topic, but most of what you find will be very superficial, repeated on tens if not hundreds or even thousands of sites, and prove, on close inspection, to be either trivial or unverifiable.

Part of what's going on is that search engines aren't meeting the challenge. Google started out as a handy index to the educated web, but its basic page ranking idea, that third party citations proxy credibility, has been left far behind as google adapted to its own commercial needs, site builders learnt how to game google's system, and more people put up useless sites.

To apply the needle and haystack analogy to internet information search what's happened is that the number of needles hasn't increased proportionally with the size of the haystack, people selling pins have learnt how to simulate needles, and google now makes money misdirecting user attention. As a result a lot of searches produce pages and pages of useless hits - try, for example, something like Minnesota Mennonite (dowry OR hope) chest and see how many citations you have to read to find anything of real relevance.

From a user perspective one way to look at this is to say that you need to go through several search steps using the first few to gain the vocabulary needed to refine your search - or perhaps add something like a "site:.edu" to your search string if you want to cut out most commercial sites.

From a systems perspective, however, what we're seeing emerge is an enormous opportunity to improve on existing search tools or, failing that, to provide sites specifically dedicated to improving search results.

You could, for example, set up a site that takes a naive search string like my earlier example (Minnesota Mennonite (dowry OR hope) chest), asks the user a few questions about what is really wanted, and then applies domain specific knowledge to improving the query before passing it to google. This kind of approach would work well, for example, for museums whose staff are deeply familiar with projects like the Getty's controlled vocabulary for arts related issues and in doing that provide both a valuable service to the public and new revenue and traffic opportunities for the museums.

Another idea, this one taking advantage of both the PC's local processing capabilities and widespread broadband connectivity, would be to provide a browser add-in that applies a set of externally editable criteria to winnow down the list of google hits to just the ones likely to interest the user.

Such an application would need to fetch each document found in the initial search, filter them according to the user's criteria, and then cache the interesting ones for display. On net it might not be much slower than today's processes, but it would have the nice side benefit of producing huge numbers of completely useless automated page hits for the worst offenders - the useless directory people who currently pollute google's search results, thereby ultimately putting them out of business.

New forms of search would be interesting too. In the longer run I'm convinced google's crawler-and-cache approach will go away in favor of much more efficient solutions but the short term opportunities probably lie more in text search augmentation than in text search replacement.

Consider, for example, the problem facing someone (like me) who just bought an old chair because it looked "interesting" and now wants to know what it actually is.

An internet search done via google using Danish teak 60s chair on July 5th got 48,600 hits and adding a "-site:ebay" to cut out the hits that just send you to ebay's rather useless search engine only reduces this to 17,300.

Trying this on google's image search produces a claimed 12,000 images so just clicking through images in hopes of spotting the right chair is unlikely to be successful. The big problem, of course, is that google's image search doesn't search images: it uses text search on image names and descriptive text to select the images to display.

The technology for the search application that's needed both exists and doesn't exist. On the positive side Ingres was initially created as a vehicle for research into image storage, recognition, and retrieval. PostGres carried on this work and Montage (the original name for the PostGres spinoff that then became Illustra and ultimately formed part of Informix 10) came with a Stanford developed blade enabling contextual image retrieval from movies -and that was ten years ago.

On the negative side, I don't know of anyone who's ever put the package together so a user could upload a digital image, have the computer interact with the user to characterise major image components, and then have the software go raid the internet to find other images having similar characteristics.

Having worked on facial recognition software I think this would be very hard - but also extremely valuable. So much so, in fact, that there's probably another google class IPO waiting for someone who gets a process like this to work on the internet scale.

So what's the bottom line? Internet search is getting less and less effective every day and while there are many contributing factors, one of the keys to understanding this is to recognise that the normal race between new applications on the one hand and the evolutionary responses that render existing ones ineffective on the other has defaulted to the bad guys because the new applications just haven't appeared.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.