Draft Blog Entries

% fortune -ae paul murphy

The purloined benchmark

Last Wednesday I challenged Windows proponents to demonstrate a metric not based on market share which when fairly applied showed an advantage for Windows over Unix.

Nobody took me up on it, but frequent contributor "ShadeTree" suggested that Windows would win with metrics based on business application development and on cluster performance.

I challenged him to prove his claims about development - to, so far, resounding silence. On cluster performance, however, he did provide a quotation from, and reference to, a Microsoft/Oracle joint presentation on Windows best practices which in turn quoted a set of performance results from work done by Larry Pedigo at performance Tuning Corporation claiming that a two-way Oracle Real Applications Cluster [RAC] on Windows 2003/XP Server outperforms, for simple TPC-C like transactions, that same RAC on Linux.

Various people, including Erik Engbrecht and Odubtaig, corrected him about what "scaling" means and questioned the applicability of the data he relies on -and I thought they were right. It's possible, however, to take the claim on its own and see how it stands up - to see, in other words, just how legitimate this benchmark was.

In reading through the Pedigo document I had two distinct reactions. First I was deeply impressed by how effectively the author seemed to be applying Poe's purloined letter strategy: everything the testers do to ensure a Windows "win" is proudly and openly presented as an effort to be fair to both technologies.

At a deeper level, however, the thing that's most striking about this document is that you can't tell (at least, I couldn't) apparent incompetence from real dishonesty. I can't tell, for example, whether their decision to use only one storage connector per server was a clever way to leverage a Windows performance weakness against Linux or simply what happens when you put Windows experts in charge of running a benchmark on Linux.

Look closely, for example, at Pedigo's description of the server hardware used:

For the purposes of these tests, it was important that the servers for both the Linux and Windows clusters be configured as identically as possible. The following configuration options were chosen for the RAC clusters:

Four HPŽ ProLiant DL380 G4
Two Windows servers

RacBench1
RacBench2

Two Linux servers

racbench3
racbench4

Two Intel EM64T Xeon processors per server (4 logical processors with Hyperthreading), 3.4 GHz
Two 36 GB SCSI disks per server, configured as RAID 1
8 GB RAM per server
8 GB swap space/paging file per server
Two Gigabit NICs per server
One Qlogic 2340-E HBA per server

This may look fair, but it isn't - ask competing teams of experts to set up a two server Oracle real application cluster benchmark for Windows and Linux, and they will definitely not come up with identical hardware configurations.

So what I see here instead of fair configuration is one that's been carefully structured to give Windows every advantage while inflicting the slowdowns of a thousand cuts on Linux.

Some areas of concern include:

Windows incurs fairly high overheads on activating, using, and deactiving storage objects - so a Windows expert setting up a small benchmark like this would naturally try to use only one connector per server.
His Linux colleague, on the other hand, doesn't face a significant cost for parallelism and so would want at least two, and preferably three, connectors to ensure that reads, writes, and logging could all happen in parallel.
The Windows guy would know that the TCP/IP stack is quite literally a stack of objects that call each other in sequence and then wait - and therefore try to minimize both memory and cycle use by dedicating one NIC to each major process group - something that's normally approximated in the Windows world by having one card per CPU in the box.
His Linux colleague wouldn't do that because having two where only one is needed creates unnecessary bus contention - indirectly using parallelism to force waits throughout the rest of the system.
Windows Server 2003/XP was designed for 32bits and then adapted first for Intel's 2 and 4 bit extensions and then for AMD's x64 instruction set - meaning that there are both pagefile management penalties for any unused memory and memory access penalities if any part of Oracle's SGA steps over the 8GB base boundary.
These issues don't exist for Linux, but under similar workloads default Linux does reserve more memory than Windows for things like buffer management and new process creation. As a result a smart windows benchmarker competing against Linux is going to do just what these guys did: cap memory at 8GB and then adjust the workload expectation to just fill it on Windows - thereby both avoiding some overheads on the Windows side and forcing Linux to spend a lot more time paging.
In somewhat the same way the decision to configure the two internal drives as RAID1 (mirrors) makes perfect sense for Windows Server 2003/XP because the underlying paging file model reduces parallel paging opportunities - but that's not true for Linux where you get the best performance by mirroring only the system part of the disk and allowing parallel I/O to multiple swap partitions.

I haven't done the homework needed to be sure, but I think essentially all other system components - from EMC setup to network switch configurations - are similarly optimized for Windows, and thus stacked against Linux.

Pedigo tells tell us, for example, about the care taken in testing Oracle configurations for each OS - and then concludes that the fairest thing to do is use the Windows settings on Linux:

After the initial tests, the [Oracle] tuning was performed manually. Before and after each test, Automatic Workload Repository snapshots were created as time markers. After each test, an Automatic Workload Repository report was generated to monitor the performance efficiency of each Oracle instance. In particular, the Top 5 Timed Wait Events, the Buffer Pool Advisory, the PGA Advisory, the Shared Pool Advisory, and the SGA Advisory sections were carefully monitored. The SPFILE parameters were adjusted after each run to iteratively improve performance. After several iterations, it became obvious that the optimal parameters for both Windows and Linux were very close (this proved true for both the Stress Tests and the User Load tests). This is not surprising, since the same test is being run against databases that are physically configured exactly the same. To simplify the Stress Test procedures, a common set of optimal parameters was determined and applied to both the Linux and Windows instances. The same approach was used for the User Load tests.

In my opinion, however, this result isn't just surprising, it's astonishing - because the two releases have major common components but are quite OS specific in critical areas from file I/O to garbage collection.

Pedigo has a disclosure summary on the effect the OS differences has on Oracle:

It should also be noted that performance for a 64-bit Oracle Database on Linux is not necessarily exactly the same as a 64-bit Oracle Database on MS Windows. Windows uses a multi-threaded processing model. This means that only one process is implemented for Oracle, although the Oracle process contains multiple threads. In theory, a fully multi-threaded application should be highly efficient. In contrast, Linux uses a multi-process processing model. There are multiple background processes visible on the database server (PMON, SMON, etc.). Multi-threading is only utilized for certain components, such as the Multi-Threaded Server. The two processing models may behave differently in some scenarios.

This confuses Windows threads with Unix threads but otherwise suggests the obvious: these two implementations are sufficiently dissimilar in terms of key tunables like page size, process counts, and logging that an SPFILE appropriate for small memory use on Windows could reasonably be expected to kneecap the product on Linux.

The nature of the actual tests run is a bit unclear. Throughout almost the entirity of the report Pedigo credits Dominic Giles, creator of the the Swingbench toolset with producing the benchmark tools he used. The exception occurs in this paragraph:

Two test suites were run against both the Linux Cluster and the Windows Cluster: a "Stress Test" and a "User Load Test" (both designed by the author). Both tests utilize the Order Entry schema.

I didn't see anything to explain how what Giles describes as a "classic order entry benchmark. TPC-C like" becomes a mere Order Entry schema, but I haven't used Swingbench and this may be a simple case of the same words meaning different things to different people - or it may mean that the transactions themselves were revised "to be fair" to Windows.

Some of the settings information provided suggests this - for example, here's how Pedigo describes part of the test set-up as executed:

The stress Test was designed to use relatively small numbers of user sessions, each session "pushing" as many transactions as possible per unit of time. To accomplish this, the Maximum and Minimum Think Times were reduced to zero. Think time is meant to simulate users waiting between each transaction in order to "think" about the next step. The actual time between any two transactions is randomly chosen between the Minimum and Maximum Think Times. With Think Times set to zero, transactions are submitted back-to-back. This is guaranteed to fully stress the CPU and I/O capabilities of the system.
The User Load test was designed to simulate how a system will respond to supporting a relatively large number of user connections. For this test, a combination of a Minimum Think Time of 1 second and a Maximum Think Time of 5 seconds was used. This is a reasonable approximation of real-world work load. In addition, Shared Server connections were used to maximize the number of connections while conserving memory

"Think times" are normally used to simulate the transactions and I/O interleaving associated with having to deal with a large number of "clients" - remote devices the servers have to communicate with. In this case, however, only two client computers were used -meaning that setting query delay to zero does produce a a large number of queries, but also plays to Windows weaknesses missing in Linux by artificially reducing both per source memory use and per query switching overheads for both networking and storage access.

I could go on - my first draft for this blog ran another 200 lines - but my bottom line is simple: this could be a completely honest attempt, by Windows experts who didn't know Linux very well and didn't really understand that it's not Windows, to fairly benchmark a few very simple database functions on both Windows and Linux - or it could be a brilliant example of Poe's purloined letter strategy: hiding one lie after another in plain sight - and after reading the report several times I still don't know which it is. But I do know this: change some of the hardware and setup parameters in this test to reflect Linux and the Windows server wouldn't have a look-in.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.