% fortune -ae paul murphy

About that London Stock Exchange IT failure

It's the third one in a year and worse even than last year's November 8th failure.

That said, here's a "reprint" of my blog for November 21, 2006 - followed by a few new comments.

---

Another Microsoft anti-Linux case study

As most people know Microsoft has an anti-Linux program called "Get the Facts" featuring case studies arguing the Windows case. One of those, wearing the title: London Stock Exchange chooses windows over Linux for reliability, arrived in my email last week.

Here's the summary quotation attributed to the customer: LSE CIO David lester:

"No other exchange is undertaking such an ambitious technology refresh programme based on next-generation Microsoft technology. We've always provided a first-class service, but now we can claim to be the fastest in the world as well."

Take a careful look at the actual wording: "No other exchange is undertaking.." and, "now we can claim to be the fastest in the world." (Emphasis added.)

The Tandem system this replaced was installed in 1995 and had earned its non-stop tradename with zero downtime over the last six operating years but now belongs to HP and is therefore going away. In response LSE CIO David Lester developed a plan - one structured around a partnership with Microsoft:

Before choosing Microsoft technology, the London Stock Exchange reviewed several potential architectures to meet the requirements of Infolect® and the TRM design objectives. The Microsoft .NET Framework -an integral component of the Windows Server® 2003 operating system- was selected for a number of reasons, including developer efficiency, performance, and scalability. The Infolect® application, which went into production in September 2005, was implemented on a total of 120 HP ProLiant servers across multiple data centres. This configuration allows Infolect to process an average of 15 million real-time messages a day distributed to more than 107,000 trading screens in more than 100 countries.

120 HP Proliant servers sounds like a lot - but then so does 15 million if you're thinking in terms of personal dollars or weeds to pull in your garden. Unfortunately neither number squares with the reality that 15 million messages per day amounts to something between 600 messages per second if generation occurs only during an eight hour trading period, and 180 if you average across 24 hours to allow for electronic trading. Either way, however, easily within scope for a small Unix server like a four way Opteron or T2000 - remember, this stuff ran on an old Tandem before those 120 proliants were brought in.

But at least they can claim it's fast, right? Here's their headline:

London Stock Exchange Cuts Information Dissemination Time from 30 to 2 Milliseconds

Two milliseconds isn't much time -in fact its barely communications latency for a PC NIC- and in fact 30 MS is pretty fast for the old gear considering that the system was first developed and implemented before the Pentium hit 100Mhz

If you look carefully at the wording, especially as repeated in the excerpt below, you'll see how this is achieved: because they say only that the information is "distributed to more than 107,000 trading screens in more than 100 countries", not that their system actually does it:

Reliability is fundamental to the London Stock Exchange value proposition for the market and continues to give its senior managers peace of mind about system uptime. There are approximately 300 customers who connect directly to the live Infolect system to receive real-time market data directly from the London Stock Exchange. The data disseminated from Infolect is then displayed on more than 107,000 terminals in more than 100 countries.

In other words, we're entitled to assume that the 2ms number represents something like a packet delivery time for bulk flows over a local area network - and not only do those "107,000 screens in 100 more than countries" have nothing at all to do with the 2ms claim, but, because they're attached to networks run by the 300 or so big customers with servers on that LAN, it's very doubtful that their users would have experienced any change at all.

All of which should have you wonder what Linux has to do with any of this - Microsoft's headline, you'll recall said that the LSE picked Windows over Linux for reliability.

The answer is that Linux has nothing to do with any of this: Microsoft simply hung an anti-Linux label on a very carefully worded story about a pair of committed Microsoft partners, HP and Accenture, getting together with Microsoft to sell rather simple technology to a willing customer - and neither Linux nor Solaris is mentioned anywhere in the text.

---

So now the chickens are coming home and the question is, why? Are Microsoft's dot.net technologies so inherently unreliable it's simply absurd to expect them to work when volume changes dramatically and performance pressure mounts, or is there something deeper going on?

My vote goes for a combination of both: second rate technology combining with a problem obvious in both the decision process and Microsoft's decision to brag about this install on its anti-Linux site. The underlying problem as I see it is one of incentives: what incentive did any of the power players involved have to get either the decision or the implementation right?

Before the sale incentives for Accenture, HP, and Microsoft were aligned with selling a Windows project - not with actually achieving both the high reliability and the high performance expected. And, after the sale, the incentives align more with keeping costs down while getting sign-offs than with meeting any promises made about reliability or performance.

What I'm reminded of in this context is the sad story of the frog who believed a scorpion's promise of unscorpion like behavior and died for his naivete when the scorpion did what scorpions do - what I think, in other words, is that primary responsibility for the LSE mess belongs to the top LSE managers who let their CIO get the LSE into bed with Microsoft and its partners.

Basically it's top management's job to set the right performance incentives in place, to understand how existing incentives are likely to work out, and to take immediate corrective action when people who report to them start to respond to career incentives that don't align with the organization's welfare -and thus the single most important driver for these recent failures wasn't poor technology but the simple fact that LSE top management didn't do its job.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.