% fortune -ae paul murphy

Scale, language, and perceptional change

For most of us, most of the time, a large file is something in the hundreds of megabytes while a significant database usually has a few tens of gigabytes of real data.

In many science applications, however, files start in the gigabyte range and significant databases can start in the terabyte range.

At the lowest end of the computing range software has typically driven hardware, with each new Microsoft product set forcing another round of consumer hardware upgrades - as witness Windows 3.1, 95, NT, 2000, 2003/XP, and now "Vista".

At the cutting edge of science applications, however, the opposite has generally been true with hardware limitations driving software development. Thus individual computers have generally been considered quite small relative to the problems people working where mathematics and theory meet practice want to solve, and that, in turn, has driven the evolution first of array processors and then grid computing.

From a hardware perspective IBM's cell processor represents the next evolutionary step because it takes most of the physical distance, and thus wiring and signalling time, out of the grid and has been designed, like Sun's MAJC CPU before it, to support multiple processor sets on a single piece of silicon - I'm told a four way (32 SPU) cell is already running and bigger units are undoubtedly on the way.

Unfortunately even a performance increase in the range of three orders of magnitude - which is where cell2 is headed relative to last year's 3Ghz Xeon - will still leave the hardware decades behind recent increases in problem scale.

Consider for example an apparently trivial pattern matching problem. Find all ordered matches between substrings (length greater than X) of a vector N and substrings of the rows of a matrix M with interpolation (of less than Y characters) allowed. Thus, for example, the substring "ABDC" in 'AABABDCB' matches the pattern "ABC" despite the interpolated "D".

For small problems this is much harder to get right than you'd think it should be, but it isn't conceptually difficult - and, indeed, 25 lines of Perl should just about do it. But what happens when the inputs get bigger? Make N a million characters long, size M as having a 500 million rows with an average vector length of 2.5 billion, allow substrings as short as 800 characters, and the run time on our Perl program increases to tens of thousands of years.

To get stuff like this done in human time people have developed better algorithms, found ways to reduce both M and N, and developed ways to spread the job across many separate processors. That works: Blast (which finds good, but not always optimal, matches) can more or less do this job for "small M,N" in a few hours on a typical supercomputing grid, but also has consequences that go far beyond the recognised limits imposed by the computational shortcuts used.

For example, the ease with which grids can be partitioned combines with their cost to facilitate the stupidest possible, and most common, administrative reaction: allocating fractional grid resources to individual researchers -thereby forcing the best and brightest people to engage in destructive competition for resources while nevertheless spending most of their research time just waiting for results

The most important result, however is more subtle: and it's this: machine limitations have consistently colored all serious thinking about how to address large scale problems.

Science has a long history of wasting resources on make-dos - consider, for example, the human effort that has gone into previous computational shortcut methods: from logarithms to the advancement of Calculus, it's all been about getting "good enough" results in the absence of adequate discrete computational power. But such detours take up a life of their own - and even today people who made their careers on trivial advances in Calculus continue to force it on students while blocking widespread acceptance of more effective computational methods.

Unfortunately the solutions adopted to get science codes down to human scale run-times are having a similar effect - with machine limitations forcing us to form our thinking about many of these problems in ways that reflect, not the underlying mathematics, but our computational methods.

And that's bad because it means first that the graduate students now inventing tomorrow's methods are being forced to do so in a programming context set by those limitations, and secondly that ideas transcending those limitations will be crushed by both peer reviews and the funding agencies dependent on those peer reviews.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.