Draft Blog Entries

% fortune -ae paul murphy

Exploring web server volume failure modalities

One of the big risks involved in building a site whose readership could vary from nothing to several hundred thousand pages pages per hour is that you either grossly overbuild, or build something that becomes a significant constraint on success.

To get a rough handle on this I recently ran some tests - and got weird (i.e. retroactively predictable) results I'm hoping you guys can help me understand and extend.

The technology is Apache, PHP, MySQL, and Wordpress on Solaris/SPARC.

As a first test I put the draft site on a 650Mhz Sun 150 and had a dual 400Mhz Sun 60 fire off page requests via wget across a local 100MBS link. These pages are big (60K) because there are lots of images so the Sun 60 had to write its wget records to /tmp, but other than that neither machine registered any serious strains before totally maxing my 100Mbs local net.

A friend ran a similar test for me between a Sun 890 host and some local workstations. This maxed out the internal 1Gbs network well before the 890 showed significant loading. Similarly using a remote T2000 across a nominal 173MB/S link caused the routers at both ends to fail (losing up to half the arriving packets) well before either machine showed any significant impact.

Part of the issue here is that the web pages produced are reasonably static - thus each WordPress page request gets assembled from a lot of pieces, but they're almost all cached and the technology is therefore quite resource efficient - code to construct each page from unique data would undoubtedly produce very different results.

So here's my bottom line question: the gear I used to test with here is getting close to ten years old but easily maxed out a direct 100Mbs connection that's much faster than anything I can afford to pay for when, or if, this thing goes live on the internet - and on the bigger machines the tests clobbered the routers long before the server hits its maximums.

So what's really going on inside when something like the Intel/IIS combination used at the do not call registry I talked about yesterday starts to lose the ability to produce and manage pages? What specifically fails, and in what order?

In other words, has anyone seen a credible analysis in which a couple of web servers were stressed until they failed - and clear logs were kept and analysed to show exactly what failed and under what conditions? It's absurd, but so far I haven't found anything useful - just the usual pathetics whinging about default parameter settings being, duh, "inappropriate" to high volume processing.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.