% fortune -ae paul murphy

Thinking about multicore and software

Informit.com recently ran a fascinating Andrew Binstock interview with Donald Knuth touching on some of the same topics we discuss here - programming methods, the value of open source, and the links between programming and hardware change.

I want to get back to some of his ideas about programming languages next week, but look at his views on problem and programmer adaptability to the emerging multi-core world today.

Here's the key exchange on this:

Andrew: Vendors of multicore processors have expressed frustration at the difficulty of moving developers to this model. As a former professor, what thoughts do you have on this transition and how to make it happen? Is it a question of proper tools, such as better native support for concurrency in languages, or of execution frameworks? Or are there other solutions?

Donald: I don't want to duck your question entirely. I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they're trying to pass the blame for the future demise of Moore's Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won't be surprised at all if the whole multithreading idea turns out to be a flop, worse than the "Itanium" approach that was supposed to be so terrific - until it turned out that the wished-for compilers were basically impossible to write.

Let me put it this way: During the past 50 years, I've written well over a thousand programs, many of which have substantial size. I can't think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX.[1]

How many programmers do you know who are enthusiastic about these promised machines of the future? I hear almost nothing but grief from software people, although the hardware folks in our department assure me that I'm wrong.

I know that important applications for parallelism exist - rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.

Even if I knew enough about such methods to write about them in TAOCP, my time would be largely wasted, because soon there would be little reason for anybody to read those parts. (Similarly, when I prepare the third edition of Volume 3 I plan to rip out much of the material about how to sort on magnetic tapes. That stuff was once one of the hottest topics in the whole software field, but now it largely wastes paper when the book is printed.)

The machine I use today has dual processors. I get to use them both only when I'm running two independent jobs at the same time; that's nice, but it happens only a few minutes every week. If I had four processors, or eight, or more, I still wouldn't be any better off, considering the kind of work I do - even though I'm using my computer almost every day during most of the day. So why should I be so happy about the future that hardware vendors promise? They think a magic bullet will come along to make multicores speed up my kind of work; I think it's a pipe dream. (No - that's the wrong metaphor! "Pipelines" actually work for me, but threads don't. Maybe the word I want is "bubble.")

From the opposite point of view, I do grant that web browsing probably will get better with multicores. I've been talking about my technical work, however, not recreation. I also admit that I haven't got many bright ideas about what I wish hardware designers would provide instead of multicores, now that they've begun to hit a wall with respect to sequential computation. (But my MMIX design contains several ideas that would substantially improve the current performance of the kinds of programs that concern me most - at the cost of incompatibility with legacy x86 programs.)

Notice that he's talking about personal use here -he does most of his work on a dual core, x86, laptop running Linux (Ubantu) - and in that context he's obviously right: he'd be better off if it were possible to trade that second core for even 25% fewer wait states on the other one.

What's important to note, however, is that his comments only partially apply to Sun's CMT architecture and don't apply at all to environments in which many users share the same machine for broadly similar tasks.

The issue with personal tasks is simple: if the user has to wait for the computer, then the system is too slow -and because there are many tasks which we think of as inherently sequential and for which we don't know how to write effective parallel processing code, many of today's multi-core computers effectively deliver only single core performance and are therefore slower than we're paying for.

In Knuth's case a TeX compile for a long chapter uses one processor, and that processor is almost completely memory bound for most of the time - in fact a 2.4Ghz Intel Centrino dual core running Linux will do the job on one core that spends upwards of 80% of its time waiting for memory.

In contrast a 2Ghz Sun "Rock" chip with hardware scout enabled and not much else to do will complete the same task at very nearly clock speed - meaning in about one fourth the Centrino's elapsed time and entirely without changing a line of code.

The current T2 CPUs do not have hardware scout and the attempt to build equivalent software into the compilers has not delivered a working product; but the T2 will run the code, absent competing work load, at the maximum rate allowed by memory limitations - and because it has greater aggregate bandwidth but the same read time bottleneck, I'd expect it to take just about as long as the Centrino.

(Obviously this depends on the task sequence and scale. The Centrino will be faster for very small jobs, the T2 faster for extremely large ones - and both will do better the second time a job is run in quick succession than the first time. I don't have TeX here but you can, for example, cheat the clock a bit on a mid size text processing job by pre-loading the document:

% time nroff -me f > f1
7.0u 0.0s 0:07 95% 0+0k 0+0io 0pf+0w
% sleep 300;
% ps -e | wc
324 1391 9956
% wc f &; !t
wc f ; time nroff -me f > f1
53991 416489 2512805 f
6.0u 0.0s 0:06 93% 0+0k 0+0io 0pf+0w

Not, mind you, that this kind of thing is useful in the real world.)

Notice, however, that this comparison assumes that the code hasn't been modified for the T2. In reality, people who bought T2 workstations would probably modify the code to suit - more Knuth:

Andrew Binstock: You are one of the fathers of the open-source revolution, even if you aren't widely heralded as such. You previously have stated that you released TeX as open source because of the problem of proprietary implementations at the time, and to invite corrections to the code -both of which are key drivers for open-source projects today. Have you been surprised by the success of open source since that time?

Donald Knuth: The success of open source code is perhaps the only thing in the computer field that hasn't surprised me during the past several decades. But it still hasn't reached its full potential; I believe that open-source programs will begin to be completely dominant as the economy moves more and more from products towards services, and as more and more volunteers arise to improve the code.

For example, open-source code can produce thousands of binaries, tuned perfectly to the configurations of individual users, whereas commercial software usually will exist in only a few versions. A generic binary executable file must include things like inefficient "sync" instructions that are totally inappropriate for many installations; such wastage goes away when the source code is highly configurable. This should be a huge win for open source.

In other words, a key benefit of open source is better hardware adaptability as people delete the overhead associated with making the code run on the generic target in favor of customization for their specific machines. In the CMT case, for example, the standard SPARC/Solaris binary would work, but you would normally reduce the executable in both size and run-time by compiling it specifically for the UltraSPARC T2 -and then go on from there to add whatever specific optimizations make sense for your workload.

So here's a bet: for the kind of coding, compiling, and text processing workload Knuth ascribes to himself, a T2 based workstation would provide broadly similar performance to what he has if he makes no code changes, and outperform the Centrino by up to five times on key jobs if he does.

The multi-user case is simpler. Tim Bray's "Widefinder" project uses a simple sysadmin job: finding the top ten downloads according to an Apache log file, to illustrate the limitations of CMT style multi-threading - arguing, incorrectly I believe, that the apparently serial nature of jobs like this means something in terms of the difficulty of switching to a heavily multi-processing, multi-threading, environment like Solaris/CMT.

The reason I don't think it does has nothing to do with the task itself - which makes his point nicely - but everything to do with how CMT hardware is used. Thus in his illustration having 64 concurrent threads available doesn't much affect job execution time - but in the real world it does because the CMT/coolthreads architecture allows a form of load once, run many times approach to sysadmin work just like this that other approaches to multi-core performance enhancement can't match.

Specifically what happens in the real world of CMT usage is that the first such report provided to a key user spawns requests for many more - and when the count gets past two or three the CMT/Perl strategy of loading the data once and then forking off many report threads to run in parallel, gets the job package done significantly sooner than the traditional multicore strategies of serial or parallel execution - and the more custom reports users demand from the same data, the bigger the CMT advantage will get.

(Notice that CMT/Solaris is much more tightly integrated for multi-threading than a Linux/Multi-core x86 machine: Knuth could load Bray's records into memory once, and then fire off two processing routines, but he'd still pay a memory access penalty on the second process, still face cache contention, and, if his is an Intel paired core package instead of an AMD dual core, face additional wait states for memory access and processor switching.)

So what's the bottom line? I think Knuth is perfectly correct with respect to the case he describes - because you can't buy a T2 laptop and his work doesn't fit the x86 multi-core model very well; that his conclusions cannot be generalized to other CPU architectures and other workloads; and, that his comment about open source allowing people to create efficient, custom, binaries probably makes sense, has tremendous appeal, and ultimately contradicts the limitation of his comments about multi-core to the x86 world.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.