This is a draft for my LinuxWorld.com series. Please do not copy or distribute this without the permission of Linuxworld.com.
This is the first of three articles in which Paul Murphy takes a close hard look at running Linux on the mainframe. In this one he fills in the technical background and looks at the probable price/performance of the system relative to Linux on x86 and Solaris on SPARC.
In the second article he will look at what Linux on the mainframe may mean for the Linux marketplace and in the third he'll look at what IBM should be doing to make the case for Mainframe Linux.
Note: My plan called for Linuxworld.com to invite IBM to write a fourth article in this series putting forward their answers to my comments but the editor wanted to include IBM responses right in the first article.
A few weeks ago I was asked what I thought of using Linux on the mainframe. The truth was that I'd simply never really thought about it all, just assumed that it made sense on the basis of IBM's reputation and my own experience with MVS/XA. Forced to think about it, I realized that my MVS experience is 16 years out of date and that my perception then that the 3084Q we were running on was fast relative to my workstation (a Sun 160 with a 16MHZ MC68020 and 2MB of RAM), was coloring my reactions now.
|What's in an acronym?|
"HGOS" could be used as an acronym for "Hosted Guest Operating System" but (apparently to the
dismay of VM experts everywhere) I'd like to use "GHOST." There are two reasons:
As it turns out, the IBM product set is not cost effective relative to other Unix options such as Linux on Intel or Solaris on SPARC, but this doesn't mean no-one should buy into it. As next week's article tries to make clear, there are people who might be well advised to buy into this technology despite its costs and limitations.
As of Feb 26/02 IBM said:
Linux for zSeries will support the new 64 bit architecture in real and virtual mode on zSeries servers. The Linux code to exploit the 64 bit architecture will be available from the IBM developerWorks web site at a later date. Linux for S/390, currently available on G5, G6 and Multiprise 3000 processors, will be able to execute on zSeries servers in 31 bit mode.
In brief, one of the standard versions of Linux is compiled for the zSeries and loaded as a guest operating system under VM where it resides on one or more real or memory resident mini-disks (IBMese for logical volume) and acts for all intents and purposes like a real machine running the selected Linux system.
In the earliest days of commercial data processing, jobs were entered into the computer on decks of punched cards. In preparing these, the user started with a bare hardware systems model and so the first set of cards in a card box would define the resources to be used by the machine -including things like the hex addresses for the memory range and I/O devices to be used. The next set of cards then had the program, followed by the data, and eventually the fourth set of cards had the job end instructions freeing resources again. Although today's JCL [Job Control Language] evolved from the control statements section of this, CP/40 [Control Program] originated as an attempt to create a set of virtual machines predefining resources that could be interactively assigned to jobs on the logical, rather than physical, level.
Out of this developed CP/67 and its followup - CP/VM [Control Program Virtual Machine] or just VM. (For lots of interesting detail on the history and the conflicts between the batch oriented majority within IBM and the attempts to create an interactive environment that led to VM/CMS see VM and the VM Community: Past, Present, and Future by Melinda Varian.)
The VM/CP combination doesn't operate directly on the hardware: both exploit an integrated microcode component called the Processor Resource/System Manager (PR/SM) which handles things like basic system resource partitioning and operates a bit like a PC BIOS.
A fully configured zSeries can be partitioned into no more than 15 logical partitions [LPARs] in this way although further micro-partitioning is possible within the CP/VM environment. Since each such LPAR is independent of all others, it can run VM or any other OS, including Linux, separately although each remains dependent on the underlying hardware and microcode.
It is possible to run Linux as a
single operating system controlling the entire z800 processor.
This machine starts at about $250,000 and has a maximum of four system CPUs plus one control and storage assist
processor. Running Linux in this way turns the mainframe into a simple four-way box with some enhanced reliability
and performance characteristics.
This approach is largely ignored here both because it isn't really different from running Linux on any other four-way box and because the major benefits IBM advertises for mainframe Linux generally derive from VM's ability to switch between multiple Linux GHOSTs on the same machine.
Amdahl's UTS product provided native System V Unix on the 470 mainframe as early as the mid eighties but the highly interactive nature of Unix conflicted with mainframe design to make it a very poor performance bet relative to running BSD on Vaxen. For some special purposes, however, this didn't matter and it is, I believe, still in use.
It is possible to dispense with VM entirely either across the whole machine or only in one or more LPARs. Without VM an LPAR, whether that has one engine ("engine" is an IBM term denoting a CPU and the I/O hardware it's embedded in) or all of them, can only run one instance of whatever operating system is booted from CP -making it possible to run Linux in near-native mode on a dedicated LPAR or machine. Similarly, the fact that the CP/VM combination handles all resource allocation means that you can use VM to share resources among guest operating systems or to establish and maintain communication pathways between them.
In an FAQ accompanying its January 25/02 announcement of an entry level configuration (estimated at $250,000 with one Linux engine enabled) IBM described the target market for the machine as:
The IBM zSeries Offering for Linux is mainly targeted to server consolidation workloads of 20 to many hundreds of servers. The offering is designed from the ground up for server consolidation giving you unparalleled Total Cost of Ownership through consolidation of UNIX, Windows NT and Linux applications to Linux on zSeries.
Furthermore, it is an excellent application development platform for large customers or Independent Software Vendors (ISVs) requiring a 64-bit target platform. It provides an ideal lower-entry-price, new workload platform for customers who want the qualities of service provided by zSeries processors.
Note that most IBM mainframes are shipped with the maximum number of CPUs and amount of memory for that line pre-installed. These resources are then licensed for use as needed to tune the hardware to the specific workload to which it is applied.
According to IBM advertising and whitepapers the five most important benefits offered by this approach are:
In addition most published references to the IBM mainframe also have the words "high performance" or something very similar in, or very near, the same sentence. This implies that mainframe use confers an additional benefit: access to very high levels of throughput.
The basic zSeries hardware consists of a single integrated CPU board [known as an MCM] with 20 on board CPU pairs, up to 64GB of RAM, and 24 1GB/Sec full duplex I/O ports. Both processors in each pair execute code, if the results fail to match, that pair is taken off-line and the spare pair switched in. By default the board is structured to have up to 16 central processors [CP] in a pair of tightly coupled 8-way SMP configurations, three System Assist Processors [SAP] which direct I/O, and one spare. Current generation processors run at about 770MHZ in 64bit mode.
Each I/O port can be multiplexed four ways to produce a total of 96 I/O connection points. For disk related I/O these are now usually FICON [IBM Fiber channel] connections to independently memory buffered disk arrays and operate at industry standard rates. According to FICON and FICON Express Channel Performance Version 1.0 by Cronin et al (IBM, Poughkeepsie, February, 2002):
In addition to its bandwidth improvements, the zSeries 900 native FICON Express channel also improves the number of 4K byte operations/sec that can be processed. If a single native FICON Express channel is connected via a native FICON director to two different native FICON Shark CU ports, it can process up to 7200 IO/sec as shown in Figure 5 above. With 96 FICON and 160 ESCON, a z900 could theoretically drive a peak of over 800,000 4K I/O operations per second.
Note that FICON controllers have 333MHZ PPC processors providing what amounts to a DMA service without interrupting main processing.
The introduction to the RedBook on Linux on IBM zSeries and S/390: ISP/ASP Solutions by Michael MacIsaac, Peter Chu, et al, contains an excellent overview of the hardware. Thus a a fully configured zSeries machine offers:
The hardware architecture and the microcode supporting it are heavily optimized for processing batch transactions. In these:
As a result the main CPU should have ready access to co-processors to handle I/O, should have enough on-board cache to hold the main instruction loop and a small data set, should have very fast access to the information returned from database calls, and multiple CPUs should share the same external cache to avoid having to to tie transactions to specific processors and thus incur wait states if that processor is busy on return from an I/O call.
IBM pioneered what we know think of as interactive on-line transactions processing using the CICS/IMS combination in the late sixties but how this works is very different from what you may be used to with Unix. When a PC boots up Linux and a user logs in, there may be as many as 60 to 90 running processes supporting that activity. Add apache/tomcat with mod_perl or PHP and a database to support on-line transactions, and the CPU could be switching between 140 or more concurrent processes.
In mainframe on-line work the transactions processor loads as a batch job - it just runs continuously getting input from a pre-processor rather than a file - and is usually the only thing running in that LPAR. (Note that this applies to so called "production processing" under zOS, not to VM's batch facility. When the latter runs, it acts more like a shell script than a traditional batch and can share resources with other VM processes.)
For a (1070 page) introduction to batch processing in the CICS/DB2 environment see:
Jim Gray and Andreas Reuter, Transaction Processing Concepts and Techniques; Morgan Kaufman Publishers, San Francisco, 1993.
For a quick introduction to the look and feel of this technology today, check out a February 2002 IBM Redbook by Andrea Consett et al on using IBM VisualAge Cobol with CICS and DB2.
That RedBook is predominantly about using the editor and related facilities but it includes an example (Page 135) of what JCL [Job Control language] statements look like after almost 40 years of of progressive refinement:
Here is a simple compile job for a batch program. Except for the fact that for a DB2 program there is an additional precompile step, this compile JCL also applies to compiling a DB2 program, including a DB2 Stored Procedure. See 9.5, "Preparing your program for debugging" on page 104 for more information about compiling a program for test.
Although the zSeries achieves a near perfect balance within this set of requirements the hardware cannot, by itself, be tailored to specific applications -only to a generic class of applications. The refinement needed to further tune the machine to its workload is implemented in licensing, not hardware or microcode. Systems are shipped with the maximum number of CPUs and amount of memory pre-installed, and then tuned to the workload by adjusting the number of processors, or amount of memory, licensed for actual use.
Pricing information is both difficult to obtain and quite different from normal Unix pricing structures. When you buy a Unix machine from someone like Sun or Dell, the OS is part of the package, not something you lease separately. In contrast a fully configured zSeries, estimated at around $4.8 million, usually needs separate monthly operating systems and related software licensing that can add considerably to the total cost of the system.
On smaller machines some licensing does devolve to the model most people are used to. For example, IBM offers zVM, for the z800 only, on a perpetual license for a rumored $45,000 per engine (with a maximum of four engines per system) and only $11,000 in per license annual maintenance.
A sales document by Sytek Services (an IBM mainframe reseller) offers some cost information for Linux on the mainframe with tabulations showing an estimated $22,031 in monthly software license fees, $26,000 for the initial TurboLinux setup, (SuSe is rumored to run about $11,000 per engine) and $3,200 a month for Linux support - all for running Linux in one partition on an entry level machine.
A tech-news site offers precise list price and configuration information for the basic "raw iron". They show that:
Assuming these costs are authoritative - and tech-news is widely respected within the Mainframe community- they would represent list prices before rumored average 30% discounts but also without the additional disk and licensed software resources normally needed to run this type of gear in an enterprise data center. Real system costs are likely to net out marginally higher because few customers buy stripped down systems without significant additional software.
IBM does not seem to publish much zSeries performance information. The Redbook on running Linux in an ASP mode quoted above has extensive comments on benchmarking all of which I believe can be summarized as saying that IBM does not do publically audited benchmarks because there are no benchmarks which reflect the mainframe's strength.
Of course, the question is whether this should be of concern to major benchmark management organizations like the Transactions Processing Council [tpc.org] and the Standard Performance Evaluation Corporation (spec.org) or to IBM and its customers.
In any case, as of Feb 25/02, I could not find any audited benchmark results for the zSeries, or releases of the S/390 (its immediate predecessor) subsequent to about 1998 -when the S/390 lost badly to early versions of the Sun 10EK - listed among the benchmark reports offered by SAP, Oracle, Peoplesoft, SPEC, or TPC.
In searching for applicable benchmark information using google I came across an analysis by David Boyes which is frequently cited in lieu of a benchmark and three candidate benchmarks:
The Boyes Report
This analysis has been widely reported and often quoted (just do a google search using "test plan Charlie" "David Boyes" -quotes as shown!- for a long list of press and other citations) in support of the claim that the mainframe can run thousands of concurrent Linux ghosts.
Three quotations from the report tell the story:
Boyes selected a simple application - presenting a static page via the Apache web server - as a good test case that, for documentation purposes, could be quickly constructed and instrumented. The LPAR available to Boyes consisted of two CPUs from a G5-class System/390 along with 128MB of central storage and a recently acquired EMC disk unit that had not yet been placed into service and was available to be dedicated to the LPAR for testing purposes ...
During this phase, Boyes also developed some short REXX execs to duplicate and customize a Linux instance. He discovered that creating and configuring a new Linux instance from one of the master copies involved no more than two commands and about 90 seconds to duplicate and configure the system. This code was later used as the core of the production solution...
Finally, Test Plan Charlie, the "let's push it until it falls apart" test, was created to gauge the upper limits of the solution. Charlie began at 5 p.m. on a Friday; by midnight Saturday, it had reached 41,400 servers and it had run out of resources on the System/390 LPAR. While the system did not crash, it was unable to create new servers due to lack of resources.
|Note that the Redbook Linux on IBM zSeries and S/390: ISP/ASP Solutions: Create and maintain hundreds of virtual Linux images shows, on page 214, that it takes about 30 seconds just to copy a 250 cylinder mini-disk, so the 90 seconds is much more credible than the 2.69 second average (=111,600/41,400) implied.|
I have problems with this. He says, for example, that it takes about 90 seconds to create a Linux ghost but then claims to have created 41,400 of them in the 111,600 seconds from 5PM Friday to midnight Saturday? I don't understand this: at 90 seconds each, 41,400 instances should have taken 43 days to create if each new instance added absolutely nothing to system load. If he created them in parallel he would have had to be running an average of 33.45 creation streams for a continuous write rate (at 70MB each) of about 269MB/Sec before any other system activities - like running the already created instances.
It probably is possible to get 41,400 instances on a small machine -if you have all of them share essentially everything, don't load a separate and individually complete working set for each ghost, and don't connect each instance to an external network. In my opinion, however, this deserves all the credibility of a claim that my ability to run 10,000 concurrent "sleep 7200" processes on a Sun Model 80 workstation proves it can support 10,000 concurrent users.
Mr. Boyes appears to be careful with his wording but, from a community perspective, the problem is that his experiment is widely presented as realistic. Consider, for example, this from the header on an interview he did with LinuxPlanet.com:
Anyone who works with Linux on IBM's System/390 mainframes has certainly heard of David Boyes. He made history early in the project by running no less than 41,400 Linux images on a single mainframe, all of them doing real work under simulated load as web servers.
or, nicely combining two bits of utter nonsense in one citation, this from a column in the San Francisco Chronicle:
To back up its claims, IBM steered me to David Boyes of Sine Nomine Associates, a networking systems consultancy in Ashburn, Va. Boyes recently helped a major East Coast telecommunications company install a new S/390, which can host up to 41,400 separate "virtual servers" running Linux. Before settling on the mainframe, Boyes and his customer considered using big Sun Microsystems servers instead, but they figured they would have needed about 750 of them, filling more than eight times as much expensive data-center space, to get equivalent computing power.
Similarly, if you check out IBM's compilation of his video clips and related materials you'll be able to hear and watch him say that each instance is a fully functional, separate, ghost system and only cynical and suspicious people like me will notice that he says this about ghosts in general, but not specifically about the 41,400 he ran.
You'll also hear him make statements about Unix that I don't think are true. For example he repeatedly claims that resource management either does not work or does not exist in Unix. In reality user resource limits are consistent with basic Unix philosphies and have been available at least since BSD 4.x in the early eighties. Stronger allocation tools are somewhat inconsistent with core system beliefs but are often commercially required and so available from all major vendors: Sun offers Solaris Resource Manager, HP has both a Workload Manager and a Process Resource Manager, AIX supports Workload Management, and Tru64 implements these functions as Class Manager.
There are many other things I don't grok either. For example, I don't understand how he could multiplex 41,400 apache instances into available TCP/IP resources without dropping performance below that of two cans connected with a bit of string.
My main question, however, is how he got 41,400 instances to fit into a 128MB machine. The problem here is that the default Linux interrupt timer runs 100 times per second so VM would have to page in the ghost, start it running (that's over 50 processes for a typical Linux instance running Apache), process the interrupt, and then either page it out or do whatever work is queued up; and do all this 100 times per second per instance. If a minimal working set for the Linux kernel [6MB] plus an Apache instance [2MB] runs to about 8MB, his machine would have had to handle something like (8MB x 100 times per second x 41,400 ghosts =) 33,120 GB/sec in throughput. Since that's about 1,380 times the maximum theoretical capacity of a fully configured next generation z900, I have trouble believing it happened.
What he appears to have done was modify the timer (the Linux kernel needs about 45,000 different lines of code to work on the mainframe - see Chart 10) to avoid these interrupts and so reduce the paging load. As he puts it in the Dancing Penguins document:
Default Linux idle task management concept is not well-suited for hypervisor environments.
- Default 100 hz timer pops consume substantial resources for no benefit if system is idle.
- Must be adjusted proportionately -- other important timing functions are derived from this value.
If I understand things correctly, the adjustment needed to fit the paging requirements for 41,400 ghosts into available bandwidth on a brand new 16 CPU z900, never mind a two CPU G5, means that the interrupt frequency has to be reset from the default 100 times per second to about once every 13.8 seconds. Maybe I'm missing something, but this doesn't seem practical - if each interrupt caused Apache to serve up one character, you could drive the entire 3,085 mile length of the I90 from Seattle to Boston and back in about the time it would take for all 41,400 ghosts to serve up this article.
In a press release on the IBM site sendmail.com claims that an IBM zSeries can support up to 2 million e-mail accounts but provides no data to back up the claim.
I have been discussing this issue with a Sendmail representative. As part of this I've received a confidential document which includes a "preliminary" compilation of results from a partial mstone test (see this report of a test on a 733Mhz Linux system for details on mstone) run on the mainframe.
The test reports we have are limited to pop3 mail users accessing the system only via traditional dialup lines to download five messages per day while sending nothing. Only 10% of users are "active."
Although the report we have only shows partial results for three tests it does include one which corresponds on three values (z900, 400,000 mailboxes, 13% CPU utilization) to IBM's report:
Tests conducted in a controlled environment with a 400,000 user load resulted in very low hardware utilization (approximately 13%). Based on these results, IBM and Sendmail project that a single IBM zSeries mainframe may be able to support more than two million user mailboxes running the POP protocol!
This test failed on Sendmail's Login & Retrieval QoS [quality of service] criteria. Neither of the other tests shown correspond to IBM's numbers, although Sendmail does mention a 250K user test which passed QoS tests but provides no detail for it. Perhaps "may" is a key word in IBM's sentence?
When asked about a later [January 29/02] press release claiming that sendmail on a two processor Proliant supports 10,000 users at 215% [sic] less than a four year old Sun 450, Jon Doyle, for sendmail, wrote "We did nothing more than check retail pricing."
Another vendor, which claimed in its press release to have participated in a similar sendmail benchmark, eventually sent this note to explain why it could not forward the actual data:
I am told that the Sendmail legal department would not authorize the release of what they consider internal information. The press release was completed prior to this decision.
With reference to the Domino benchmark, Joann Duguid, Director of Linux on IBM eServer zSeries, sent me a
TCO study on the use of Bynari Insight which compares the cost of running Microsoft Exchange server on NT to the
cost of getting comparable services using the Bynari Insight Server running under Linux on the mainframe.
The Bynari product looks like it might be pretty neat but the TCO study certainly isn't.
For example it provides a seemingly detailed, but unsupported, cost tabulation to show that the three year cost of running the Bynari product for 5,000 mailboxes comes to $3,193,210 using Linux on the mainframe - including 50MB of disk space per user and a Bynari license fee of $11.60 per user.
At the 5000 user level they show the cost per user over 36 months as about twice that for running Exchange Server under NT, but then make their fundamental point about mainframe scaling by working out the cost ($3,278,210) for 50,000 users on the mainframe and comparing that to the $5,447,900 they compute for using NT/Exchange.
The pricing shown includes some hefty discounts:
How this kind of thinking plays out in the real world is nicely illustrated by a piece in E-week for March 25/02. This article describes a company's success in moving 700 [sic] users to Bynari under Linux on a mainframe at a cost of "just $26,000" --and "between 7 percent and 10 percent" of their mainframe MIPS.
The Notesbench.org site has information about a benchmark result obtained by running Domino R4 mail against an R5 server on a 10 way S/390 under OS/390. In all other cases I looked at, full pricing information was provided, including one based on a 4.5 Million dollar IBM iSeries but, for the zSeries they only report that:
"The $/User and $/NotesMark are not reported because the NotesBench certification is based on a total system cost exceeding $500,000.00."
Based on the tech-news numbers, this machine would have had a base cost of about $3,735,000 before maintenance, software licensing, or DASD. At that base cost its NotesMark score of 42,508 for 32,000 concurrent users gives it a minimum estimate of around $87.86 per NotesMark.
There are only two other reports available for this particular version of the benchmark, both for PC servers running NT. The faster of these, an IBM Netfinity 5500 M20 with two 550MHZ Xeons, scored 10,957 for 8250 concurrent users at a total cost of $10,419 or about $4.15 per NotesMark.
Dozens of systems are reported on for the marginally more complex R5Mail benchmark. Here, for example, a Sun V880 gets score of 27,435 for a cost per NotesMark of $6.42 and an IBM P680 with 24 processors at 600MHZ achieves 108,000 NotesMarks at a total cost of $2,952,402 or $19.66 per point.
GT.M Financial Transactions
A May, 2001, press press release by IBM and Sanchez Associates (donor of the GT.M opensource and maker of a GT.M -formerly mumps- based financial system) included the statement:
"Initial testing on the z900 showed strong promise, with initial accrual processing throughput of 5,841 accounts per second on a 10 million account database," said Wayne Ross, Sanchez' engineering manager of systems evaluation.
The Sanchez website offers a whitepaper on their benchmarking effort and PDFs of their reports on the Sun 6800 and IBM S80, but no further information on the Linux for zSeries effort. Their report on the Sun 6800 shows a 24CPU model hitting 7,949 on the same accrual processing task.
This is the only unambiguous performance comparison found and shows the Sun 6800 outperforming the mainframe by about 35% in absolute terms, but lacks comparative pricing information.
The IBM argument for this solution does not anticipate heavy use of the Linux resource in either interactive or server mode. Instead, the focus is on replacing lightly loaded Sun (not Linux or Windows) servers. The IBM Redbook mentioned earlier contains an example showing the kind of consolidation effort the machine is aimed at:
This is the setup we inherit at the fictitious company XYZ.
Table 2-1 Setup for company XYZ Function Server Type #of Servers Average utilization File Server Compaq DL380 10 10% DNS Server Sun 5S 4 15% Firewall Sun 420R 2 15% Web Server Sun 280R 10 15%
Note: Before we go through each of the elements of sizing, keep in mind that many of the calculations we base our sizing on are confidential and cannot be explicitly written out. There are several reasons for this, the most important being we do not want to set a "standard" for how to size. Although this may seem counterintuitive, when one considers how many variations there can be in hardware (notice that our setup is fairly small and homogeneous, which will not always be the case), software, and workload, one can see why we cannot endorse a generic formula with some constants and a few variables. Since each situation is different, each sizing will have to vary accordingly. The intent here is to illustrate the principle, and not the specific implementation. [Chapter 2. Sizing 33]
There are some aspects of this hypothetical consolidation target that are really quite remarkable. For example:
Even if we assume that all gear is configured with the maximum of the most recent CPUs available we have
|System||Maximum CPUs||Servers Listed||Utilization Shown||Implied cycles needed|
|Sun 280R||2 x 750||10||15%||2,250|
|Sun 420R||4 x 450||2||15%||540|
|Sun 5S||1 x 440||4||15%||330|
|Compaq DL380||2 x 1000||10||10%||2000|
a total systems requirement that's well within range for a single Dell 8450 or Sun V880.
As the IBM authors put it in something of a masterpiece of understatement:
It turns out that server consolidation is most viable when there is some inefficiency in the current operation. [Chapter 2, Page 26]
The sizing case isn't actually worked out (after all the sizing methodology is confidential) so we don't know what performance level this is intended to match, what the disk requirements were, what the cost case is, or why this configuration was chosen.
No such ambiguities exists, however, in
LINUX for S/390: Scalability and Competitive Advantage.
apparently by a group
called Sine Nomine Associates [SNA]. (See also,
a June 2000 presentation by David Boyes
on the same subject.)
Here, SNA proposes that a single S/390 "with support for up to 40,000 virtual servers" and costing "less than 5 million in the first year" can use Linux ghosting to replace:
I don't know what a UE1000 was (the Sparc server 1000 was a much earlier machine) although it is presented here as a quad processor but I know the UE2 well --I'm typing this on one I got in 1996. The UE2 is a workstation, and not a reasonable choice for a server job --then or now.
If the client requirement called for each customer to have a dedicated machine (but not three dedicated machines) Sun's 450 would have left the organization with 250 machines at about half the price - and the S/390 option would be not have met the client's business requirements.
The IBM solution is only possible if there is no business reason for the use of separate servers for each customer. In this situation a cluster of HP K boxes or Sun 6000s would have provided an 80% cost reduction without reducing performance or reliability.
This stuff isn't just specious, it's contagious. Another bit of advertorial made available on
the IBM site and
IBM eServer z900 Provides Energy Saving Alternative to Server Farms
While a typical configuration of 750 Sun servers costs approximately $620/day in electricity to run, a single z900 -- running the same workload -- costs only $32/day, a power saving ratio of nearly 20-1. The savings are even more dramatic when floor space requirements of a server farm are considered. The average server farm requires some 10,000 square feet of floor space compared with only 400 square feet for a single IBM z900. At an average of 100 Watts per square foot, the savings can be significant.
I think these people are unintentionally illustrating a logical process called "reductio ad absurdum" in which you disprove something by showing that its consequences are untenable - it's not that there's anything wrong with that, but they don't carry things quite far enough to draw serious conclusions.
Had they asked me, I'd have pointed out that the Sun boxes have to be idle 99% of the time for the workload to fit on the mainframe. Allowing for trickle power and spin-up time, that means the Sun gear will be powered down around 97 percent of the time and so cumulatively use less power, and produce less heat, than the mainframe.
If there were some organizational reason for doing things like having DNS services run on four separate machines those reasons would presumably rule out having them all run on one machine, whether that's a larger Sun box or an IBM mainframe.
Absent such a reason a sensible manager working two years ago would have put in something like a Dell 6400 for the Windows file and print support and a Sun 450 for everything else. Of course, being "sensible" he, or she, would have two of each for redundancy and thus end up with a total of four servers instead of the 26 shown - and a total cost of about $260,000 exclusive of the disk space requirements left unspecified in the IBM document.
Today, of course, that same manager would choose between two Dell 8450s running Linux or two Sun V880s running Solaris to achieve the same services on two machines for a total cost of about $210,000 - or somewhere around two million less than the IBM mainframe proposed in the document.
The RedBook cited above also contains quite a lot of commentary on the inappropriateness of benchmarking for performance, most of which looked like special pleading to me and a number of statements like:
An important element of the MCM and PU design is the massive bandwidth available to each processor. The MCM has a total of 24 GB/sec of bandwidth, resulting in an available bandwidth to each processor of 1.5 GB/sec. This is an order of magnitude, or more, greater than traditional enterprise-class UNIX servers. This is significant because the traditional UNIX approach is to try and minimize I/O operations; using the same approach on the z900 architecture will not make maximum use of its capabilities [Chapter 1. Introduction 9]
However, in general these servers have relatively limited memory bandwidth, so that the more frequently cache misses occur and data must be retrieved from main memory, the less the deep, private cache helps. In particular, when the system is heavily loaded and tasks must compete for processor time, each task's working set must be loaded into the private cache each time that task moves to a different processor. It is for this reason that most SMP UNIX servers are typically sized to run at utilization levels of approximately 40 to 50%. [Chapter 1. Introduction 11]
which contradict my understanding of "traditional enterprise-class UNIX servers" from Sun, HP, and DEC/Compaq.
With respect to the memory bandwidth and cache coherency management claims, I believe that the table below is more nearly correct:
|IBM zSeries 900/Shark Disk Array||Dell 8450/ 210S disk array||Sun 3800/A5200 Array|
|Maximum SMP CPUs||16||8||12|
|System wide Cache Coherency maximum||2 x 16MB||1MB||96MB|
|Per CPU external Cache||16MB shared 8 ways||2MB||8MB|
|CPU to Cache bandwidth||1.5GB/Sec||3.2GB/Sec||9.6GB/Sec|
|Cache to RAM bandwidth
|Maximum single controller disk I/O rate||32.0MB/Sec||160MB/Sec||160MB/Sec|
|Maximum Disk I/O channels
maximum combined I/O rate
|Maximum CPU cycles/sec||(16 x 770)
|(8 x 900)
|(12 x 900)
Includes, OS, 3 Years
|$5,200,000; 64GB, 1.6TB Disk; [Estimated]||$115,000; 32GB, 0.5TB Disk||$306,000; 64GB, 1.6TB disk|
To me, the comments on cache utilization seem to reflect a very fundamental design difference between batch oriented processing and interactive work. In the traditional IBM world a process is created first, resources are assigned to it, and then it enters the run queue. Once it is running, the executable switches between data and instruction sources as new transactions and logic arrive, but the main process control loop stays largely "CPU resident" and external resource allocations do not change during the run.
|Batch processing on Unix?|
|You can emulate batch processing on Unix but you can't wholly remove Unix
from the system while your batch runs.
There are Unix job schedulers that resemble their mainframe cousins and products like Unikix or transactions processing environments [e.g. Sun MTP/DBM] which simplify both the porting and management of the applications that go with this.
In the Unix world, however, processes are not created; they spring magically into existence and start to run when their contexts are loaded. As a result most Unix CPUs have hardware context management allowing them to completely switch processes within one instruction cycle - or even to run more than one instruction stream concurrently. In effect that creates a large process cache independent of on board data, address, or instruction caches.
The third claim made, that "traditional enterprise-class UNIX servers" are usually sized to run at 40-50 percent utilization to compensate for memory bandwidth limits strikes me as another example of a cultural difference in perception resulting in a claim that looks perfectly sensible to a mainframer but utterly nonsensical to a Unix user.
In the IBM mainframe world workloads are very tightly scheduled into a 24 x 7 processing envelope. This works because considerable resources are devoted to predicting and managing run-times, thereby allowing systems managers to precisely balance the hardware they license against the workload they expect. Combined with the extremely high cost of a fundamentally scarce resource, this ability to predict and easily measure system utilization has led to capacity planning and utilization management becoming widely recognized professional specialties within the mainframe community.
Neither these specialties nor the need for them exist in Unix. The Unix world is fundamentally interactive - meaning that you cannot predict when someone will start a job, what that job will be, or precisely what resources it will take. About the only thing you can predict with certainty is that individual users will think their jobs take too long to run and that groups of users will launch vast conspiracies against you by starting their longest and most incompetently programmed ad hoc queries at the same time -i.e. just before leaving for Lunch, coffee breaks, or staff meetings. As a result the experienced Unix manager always wants all the instantaneous processing resources he can possibly afford - and typically couldn't care less about such touchstones of mainframe management as average system utilization levels.
On the numbers, the mainframe should not be remotely competitive with Unix. The hardware specifications don't match up to those from Sun's midrange (and, by extension to those describing the Alpha and PA-RISC); the upfront cost appears to be much higher, and IBM has stopped benchmarking the S/390 and its successors against Unix on things like SAP, or TPC, transactions processing.
Nevertheless, when you talk to mainframers they're usually absolutely confident that their "big iron" outperforms everything else - and able to point to roughly 14,000 mainframe data centers in which these machines continue to do some serious "heavy lifting."
I believe that this situation exists for three main reasons:
Careful load planning combines with precise capacity management to produce very high system utilization rates because grouping processing requirements into batches averages resource demand over time.
In the Unix world the primary load consists of handling user interaction and thus occurs mainly while users are at work - typically during less than 25% of the 24 x 7 week. Both systems have to devote off peak resources to things like backup, but a well run mainframe center can use batch control and capacity management to achieve 96% or higher average utilization for 168 hours a week while a well run Unix system will usually average less than 50% utilization during peak hours and 10% during off hours.
|Linux kills the assembler advantage|
One reason the mainframe gets far more work done per CPU than you might expect based on experience with Unix is that much of the code used on the mainframe is very highly optimized.
This efficiency is a consequence of the tens of billions of dollars spent on mainframe coding, performance optimization, and tool development --but it all goes away when you try to run Linux and Linux applications on that mainframe because these are not written in assembler, are not highly optimized to fit the technical environment, and don't have forty years of hardware specific performance tuning behind them.
The key to performance is usually found in data and algorythm design, not the choice of language. Optimizing compilers like GNU C or IBM's PL/x encode hundreds of man-years of experience in converting basic language structures into highly efficient machine code and so generally do this better than ordinary assembler programmers inventing the code for themselves.
Most data centers, however work with assembler primarily where analysis shows that compiled code bottlenecks. In those situations order of magnitude improvements are common because even optimizing compilers have to generalize where hand tweaking by people intimately familiar with the specific system can fit code to exactly the hardware and system software installed.
There are few analogues to this in the Unix/Windows worlds because we don't typically code for a specific system installation. The closest example I can think of, Sun's medialib attempt to get more people using the VIS/SIMD instructions on SPARC is still generic to an architecture --not specific to an installed system. For people willing to make the effort, use of VIS/SIMD can produce average speed-ups in the 6-10 times range for "new media" type processing and four times for some arithmetic processing.
To put this into a mainframe context, imagine a production Solaris system that bottlenecks on something like the checksum computation needed for packet assembly, so its managers recode just that function using the SIMD short array capability to get about a 5:1 speedup. Because bottlenecks cause other problems, like excessive paging, their removal typically provides a disproportionate overall gain in throughput. Finding and fixing bottlenecks like this is how mainframe code optimization often works.
A Unix system like a PC running Linux or BSD is a general purpose machine capable of handling a wide variety of jobs ranging from highly interactive to batch. Mainframes, on the other hand, are specialized machines purpose built for exactly one kind of job: processing large numbers of relatively simple transactions. Even interactive environments like TSO load as batch jobs that loop to read and process content sent from block mode terminals and therefore only emulate interactive processing of the kind native to the Unix kernel.
From a user management or organizational perspective these amount to coping mechanisms that both individually and collectively raise systems cost while compromising systems performance. Batch processing is resource efficient from a systems perspective - in fact, it started as a way to make maximum use of very small processing capacities. From a corporate perspective, however, this kind of resource efficiency is important only so long as the cost of the resource is high relative to the cost consequences of the workarounds needed to minimize use of that resource.
Given the enormous cost of mainframe computing, organizational costs incurred including:
are easily justified. Step outside that environment, however, and the low cost of Unix processing power reverses the balance making it more important to get organizational benefits like:
then to save a few dollars in capital cost by trying to spread processing loads over the full 24 x 7 week.
Fundamentally that's what's wrong with running Linux on the mainframe: all of the machine's design advantages are shunted aside by the interactive nature of the workload, the inefficiency of the software in hardware terms, and the unpredictability of usage demand --leaving only its high costs in place.
|A salute to the Show Me! state|
None of the benchmark results are definitive
with respect to the cost performance tradeoff and this product's positioning relative to other
Unix offerings including Linux and BSD on the PC, Solaris on SPARC, Tru64 on Alpha,
HP-UX on PA-RISC, and even AIX on the Power4. On the numbers we have, Linux
on the mainframe looks like a loser but we don't actually know because we don't have
access to real test data.
As you'll see in next week's second article on this topic, I think that Linux on the mainframe has an important role to play and ought to be considered for use in many data centers - i.e. that there is an IBM value proposition to be considered. At the same time, however, I don't think mainframe Linux is remotely competitive with Linux on x86 or Solaris on SPARC on either cost or performance.
The way to find out whether this is right or wrong is to run actual tests. Third party, audited, tests with verifiable results and full information on the real costs of using the products. In the third article in this series I'm going to propose a framework for this - and hope IBM responds in a positive way.
As noted earlier IBM claims five main benefits for mainframe Linux:
All of these are, I believe, highly questionable.
The mainframe's reliability reflects its use in a system that includes an extremely well defined set of management methods. The hardware is reliable, but no more so than that from other manufacturers making comparable quality gear; it is the management methods which go with the hardware that make the combination extremely reliable. Take away those management methods, and the claimed reliability benefit is unlikely to materialize.
The isolation benefit is theoretical rather than practical for any significant number of Linux instances because:
Access to open source software
Quite aside from the obvious fact that Free Linux for the mainframe costs in excess of $11,000 per CPU plus several thousand per month for support, there's the problem of bridging the gap between the fundamentally interactive nature of Unix and the batch oriented mainframe architecture.
The most effective route to resource sharing is to move the applicable Linux functions out of the Linux guest OS and handle them in VM instead. Consider, for example, the effect of the 64GM memory limit and the use of a 1GB ramdisk as a way of limiting Linux paging chains. If you have Linux create the ramdisk it become subject to VM paging and those overheads will prevent you from setting up more than perhaps 30-40 guest instances. If, on the other hand, you use VM to partition memory and create the ramdisk to be used for swapping, you pretty much have to make it a shared resource because otherwise you'll run out of system memory at no more than perhaps 30 to 40 Linux instances.
In the longer term these kinds of conflicts between the basic interactive design at the heart of Linux and the design of the mainframe will need to be resolved. Some of those differences are hardware specific, including the big endian vs little endian issues, assembler issues, the absence of sigcontext capabilities etc, but the more important ones are fundamental to the operational concepts behind system deployment. Linux embeds very basic assumptions about usage patterns and access to hardware that have to be worked around on the mainframe. For example, IBM recommends the use of telnet instead of a native GUI like KDE or GNOME while issues like memory management and timer control require significant kernel patches to work efficiently on the mainframe - but become unLinux-like in the process.
As a result mainframe Linux has different operational values than does desktop x86 based Linux and those differences are ultimately reflected first in usage patterns and then in code adapted to those usage patterns.
IBM has done a a lot of work on this already and provided a starting point for people considering porting Linux applications to mainframe Linux. These change requirements are extensive, so much so, in fact, that ordinary users downloading code cannot reasonably expect to be able to make those changes on the fly or through the use of simple tools like GNU's configure utilities. Over time, therefore, I expect that we'll see applications that run on Linux, but not on mainframe Linux; and applications that run on mainframe Linux but not on x86 Linux --thus voiding this claimed benefit.
There's little reason to question this. Running one IBM mainframe uses less power than running 750 Sun or PC servers. No ifs, buts, or maybes; this would be a real benefit, however trivial next to the cost of the Linux and VM licenses, if the mainframe could handle the same load - something I don't believe.
The networking benefit demonstrates the kind of logical fallacy known as "affirming the consequent" in which you prove eggs by assuming chickens and then prove the chickens by pointing at the eggs.
The argument is that if you replace several hundred Linux PCs with one zSeries, all of the networking gear and resources previously needed to allow the Linux PCs to communicate with each other become virtual connections within the VM environment. Since these new connections are faster and have essentially no maintenance costs, the elimination of the previous networking costs amounts to a zSeries benefit.
The conclusion is obviously correct, except that it assumes both an unlikely problem - a need to replace hundreds of Linux machines with Linux ghosts - and its solution - replace hundreds of Linux machines with Linux ghosts.
Consider that there are two usage scenarios where the requirement might exist:
On a "raw iron" basis the machine beats high end PC servers but doesn't stack up against mid range Sun gear (and, by extension, against competing PA-RISC and Alpha products).
On a workload applicability basis the gear fits Linux about as well as snowshoes go on a downhill ski racer.
For an easily severable workload, like Domino, the same benchmark results that show a 10-way mainframe getting a NotesMark score of 42,508 suggest that a cluster of five Dell 2450 PC servers for a total of around $61,000 would blow it away on both absolute and relative peformance.
We don't have a comparative performance benchmark for a mixed workload of relatively small but unpredictable tasks of the kind Linux on x86 excels at. What we do have, however, is comparative cost information on the basic systems. At list price, you could rack up eighty (80) Dell 8450 servers each with:
At the moment Linux doesn't scale well past four processors and about 4GB of RAM but other Unix variants do. If your workload needs massive SMP capabilities with flat memory spaces above 16GB the place to start is with the Sun 3800 at about $350,000 fully configured, but the only direct comparison for which we have performance indicators is with the larger 6800:
|IBM ZSeries (2064-116)||Sun 6800||Sun 6800 as Percent of IBM zSeries|
|Maximum CPUs (Total MHZ)||16 (12,320)||24 (21,600)||175%|
|Maximum System Throughput||24 GB/Sec||67.2 GB/Sec||280%|
|Maximum System Memory||64GB||192GB||300%|
|Estimated cost with 1.5TB of disk, 16 CPUs, 64GB RAM||$5,251,000||$960,000||18.3%|
|Note that the 18.3% price comparison is a best case for the mainframe and assumes a workload justifying the flat memory space and high reliability of the 6800. If the workload is easily severable, like Domino or a Windows file and print service, you could expect to achieve about the same throughput with a cluster of eight four-way PC servers like the Dell 8450 at less than 10% of the mainframe's cost.|
On the Sanchez financial transactions benchmark, the kind of thing that constitutes home field advantage for the mainframe, the 6800 beat the mainframe by about 35% - at about one fifth the cost.
To put it nicely, Linux on zOS looks like a loser, so why is IBM, a serious company with big dollars at stake, telling us that Linux on the zSeries is a good idea? That's the topic of next week's article.