Back in May of last year someone using the name "high end crusader" [HEC] did an HPCwire interview with Thomas Sterling, a Faculty Associate at the Center for Advanced Computing Research at the California Institute of Technology. Sterling had a lot of interesting stuff to say. Here's a sample:
HEC: How would you characterise the current (and possibly future) demand for high-end computing? Some people seem quite upset about setbacks in this area while others appear not to care. Do we have people living in totally different performance regimes, e.g., gigascale, terascale, petascale? In general, what governs the way users perceive their computing requirements?
Sterling: Demand for something in the future that does not currently exist is difficult to measure. While the field is surpassing 100-Teraflops peak performance, the scale of system sub-partitions ordinarily made available to computational scientists is usually measured at best in a few Teraflops (yes, there are exceptions, especially for benchmarking and demos just before SC). Practitioners have to accomplish near-term objectives within their respective problem domains using the resources they have available. That is usually their day-to-day focus. Instead of determining what they could ultimately use for their problems, they consider what they are likely to get in the next few years and then craft their present projects around that perception of reality. Also, there is the subtle influence of having to tell your sponsors that you can do the job with what they are going to give you. If you say you can't do it and need more, you'll probably get nothing and the resources will go to someone else.
And yet, investigations of some important application domains have revealed ultimate requirements so dramatic that the mind blurs with the number of orders of magnitude in required future capability. While excellent work is being accomplished in the multiple-Teraflops performance arena, key problems of strategic national and social import can consume Exaflops to enable a level of resolution and phenomenology necessary for high-confidence simulations. I am excited about such computational challenges as long-duration climate modelling, controlled-fusion reactor simulation, molecular modelling and protein folding, gravity-wave astronomy, and even symbolic processing for goal-driven autonomous-system operation (like rovers on Mars, or house cleaning robots smart enough not to try to suck up the pet cat in its vacuum cleaner attachment). No, I'm not expecting mobile Exaflops computing any time soon, but other machine-intelligence applications, for example, homeland security, may be very important for large-ground based systems in the immediate future.
Of course, demand can break down into a number of operational categories, if not disjoint, then at least with strong alternative biases in terms of sensitivity to resource availability. One distinction is between capability and capacity computing. I would add to this "cooperative" or "coordinated" computing. "Capability computing" should be reserved for those systems that reduce the execution time of fixed-size problems. "Capacity computing" should be restricted to large-throughput workloads with little or no interaction among the concurrent pieces ("embarrassingly parallel"). A lot of the computing workload we carry out is of this throughput type. Performance achieved is largely a function with how large is the total number of separate data sets being processed.
HEC: In your opinion, what are---today---the major roadblocks to Petaflops computing?
Sterling: Latency, overhead, contention, and starvation---in a word: architecture. Today, the vast majority of HEC systems comprise ensembles of commodity sequential processors. With the possible exception of the vector processors from Cray and NEC, none of these incorporate mechanisms intended to manage parallel resources or concurrent activities (excluding cache coherence mechanisms among a few processors in SMP configurations). It is not simply having enough resources; it is what those resources do. Think about how we manage our putative parallel computing today. We start up a bunch of like processes on largely separate processors, one per processor. They run for a while on local data, sometimes getting low efficiency because their memory hierarchy is unsuitable for the data-access patterns of the application. At some point, the entire computer comes to a halt with the different processes checking in with a global barrier. When the slowest task finally stops, this global barrier is negotiated and then a bunch of data is exchanged among the nodes, only to have the entire computer come to a halt again, waiting for the slowest processor by means of another global barrier. Finally, all of the processes on the separate computers are allowed to start up, yet again; and over and over again. Yet, we are OK with this?!
Perhaps the biggest roadblock is not technology at all but a community-wide attitude of complacency. Often we talk about how hard it is to program HEC systems. But the real problem, in my mind, is that we are programming the wrong computer structures: trying to force piles of sequential components to pretend they are a parallel computing engine. Of course it is difficult: the programmer has to do essentially all of the work related to resource management and performance optimisation that the system should be doing automatically. When we put in place a real parallel-computing architecture, much of the other problems will become tractable. Until then, it will continue to be a heroic and costly struggle as we move into the Petaflops era. So the question is not so much the roadblocks to Petaflops computing, but the roadblocks to practical effective Petaflops computing. I have no doubt that with sufficient money and infrastructure to provide the electrical power and cooling, that there will be a system comprising an ensemble of interconnected processors (probably four to a chip) with a peak performance greater than a Petaflops around the year 2010 plus or minus 18 months.
But what we should care about is when will real users routinely get the resources to achieve a sustained Petaflops on their important science, engineering, and defence-related applications. One interesting statistic comes from the TOP500 list itself showing the difference in years between when a specific performance threshold is crossed by the #1 machine and when the same Linpack performance is achieved by the #500 machine. This is approximately 7 years, which would mean that mere mortals would be able to undertake Petaflops computing around the year 2016 or so. This could be greatly accelerated if new architectures were developed that were far more efficient in both space and time. Some of the ideas being pursued at Caltech, Stanford, and other institutions mentioned earlier are just some examples of ways this might be done.
In the end, I suppose, it comes down to cost. Beowulf and its siblings brought the cost down dramatically for certain classes of applications. Multicore and Cell-like architectures may also bring the cost down for some applications. But to do so for a wide array of applications including those falling in the category of capability problems, then the architectural challenges mentioned at the beginning will have to be solved.
Notice that he's coming down solidly on the side of hardware design as the basis for future solutions, but while I think that's probably right in the short and long terms, I think it's an over simplification in the three to ten years term.
The biggest gains, in my opinion, are going to come from the software side. Both Sun's CMT and IBM's cell architectures can take on both capacity and capability constrained tasks, but there's one clear difference. Cell implements a known technology, grids, in a better way and thus offers significant performance gains at the cost of incremental software change, while Sun's SMP on a chip includes a number of "inadvertent" features (i.e. true simultaneous thread execution) that we simply don't yet know how to exploit and therefore sets the stage for breakthrough gains in software performance.