% fortune -ae paul murphy

PS3 based super-computing cluster on Linux

As many people know Yellow Dog Linux, from Terra Soft, now runs on the Cell engine in Sony's PS3. That's very cool, but the thing many people may not realize is that Terra Soft isn't so much in the yellow dog business as it is in the supercomputing and life sciences software businesses.

Thus Terra Soft's recent announcement: Terra Soft to Build World's First Cell-Based Supercomputer is focused on the use of the PS3 hardware to run bioscience software, not on its ability to run Linux - an orientation that should be of interest to the folks at the University of Cincinnati Genome Research Institute whose IT people have embarked on a lonely quest to port and maintain similar software for Windows.

Here's the summary:

Glen Otero, Director of Life Sciences Research for Terra Soft Solutions explains, "This cluster represents a two-fold opportunity: to optimize a suite of open-source life science applications for the Cell processor; to develop a hands-on community around this world-first cluster whereby researchers and life science studies at all levels may benefit. Once up and running with our first labs engaged, we will expand the community through invitations and referrals, supporting a growing knowledge base and library of Cell optimized code, open and available to life science researchers everywhere."

Lawrence Berkeley National Lab is working with Terra Soft to optimize a suite of life science applications. Los Alamos and Oak Ridge National Labs are also engaged, with select universities coming on-board early in 2007. Terra Soft is working to optimize the entire Y-Bio bioinformatics suite.

Thomas Swidler, Sr. Director of Research & Development at SCEI states, "This cluster is for Sony a means of demonstrating the diversity of the PS3, taking it well beyond the traditional role of a game box. While we are not in the business of competing for the Top500.org nor building cluster components, this creative use of the PS3 beta systems enables Sony to support a level of real world research that may produce very positive, beneficial results."

Regarding Terra Soft's contribution to the project, Swidler continued, "In working with Terra Soft, we found a single source for the operating system, cluster construction tools, and bioinformatics software suite. Again, their dedication to detail and professional results has surpassed our expectations. We are very eager for the completion of this initial phase in order that the research may begin."

The thumbnail on Cell, incidently is simple: it's IBM's current implementation of a communication and syncronization method bringing ordinary OpenGrid technology down to the chip level. Thus the cell patent is mainly about managing inter-processor communication both on and off the grid, the name derives from both the design idea itself and the ability to plug cell hardware together to form arbitrary processing grids, and the current implementation is a PPC based eight way grid with an embedded, 3.2 Ghz, G5+ derived, master controller.

Cell is fast enough that there's a serious payoff for facing the programming complexity that goes with it, but there's a problem: much of super computing relies on double precision arithmetic and the current Cell hardware is largely geared to single precision arithmetic. How effective can it be, therefore, for typical super computing tasks?

That's the question addressed in a recent Lawrence Berkeley research paper by Drs. Williams, Shalf, Oliker, and others. Here's their complete abstract:

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems.

Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading super-scalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

They do a lot of careful performance testing and architectural exploration - but here's a quick sample of their results on performance:

Table 2 shows a performance comparison of GEMM between Cellpm and the set of modern processors evaluated in our study. Note the impressive performance characteristics of the Cell processors, achieving 69x, 26x, and 7x speed up for SGEMM compared with the Itanium2, Opteron, and X1E respectively. For DGEMM, the default Cell processor is 2.7x and 3.7x faster than the Itanium2 and Opteron. In terms of power, the Cell performance is even more impressive, achieving over 200x the efficiency of the Itanium2 for SGEMM.

Overall, they conclude that the next generation cell product needs minor hardware change to scale efficiently for double precision work, but that the first generation is already between 3 and 60 times faster, and between 10 and 200 times more power efficient, than its competitors - numbers to keep in mind when you think about Apple's triumph in arranging to get dual core Xeon CPUs from Intel for only slightly more than than four times the $89 Sony is estimated to pay for an 8+1 cell at 3.2Ghz.

They're also numbers to keep in mind when thinking about next generation supercomputing. Terra Soft is mainly focused on biosciences applications and Yellow Dog Linux works now, but the real bottom line on the trade-off between cell's programming complexity and its performance potential is simply that we're a just language breakthrough from everybody's supercomputer being a rack of cell processors.

That may sound overblown, but consider this: you can buy Mercury's dual cell compute server from IBM (as the QS20 blade) at a list price of $18,995 - meaning that you could put 16 of these in a rack for less than $350,000 exclusive of disk and connectivity. In theory, that rack could sustain around 500 Teraflops - making it significantly faster than the IBM ASCI Purple and Bluegene/L combination for which the the Lawrence Livermore labs paid an estimated $290 million (including disk and connectivity) in 2005.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.