This is the first of a series in which we compare the results produced when real business applications are implemented using Unix tools and ideas to what happens when those same applications are implemented using software licensed from Microsoft.
Each application will be the subject of two articles. The first one will present the theoretical, or "book learning" view of the issue and invite readers with real world experience using the underlying technologies to contact the author, in confidence, to correct my errors; give your estimate of the time needed for the work; and, discuss what goes wrong when you try to go from theory to practice.
The follow-up article will then try to summarize community experience with the technologies in order to draw out conclusions everyone can use and to answer the basic question for this series: is open source software better than Microsoft software?
The focus here is on the technology, but readers should be aware that the most important factors in real architecture decisions usually have little or nothing to do with the technology or cost. The goal in making tech decisions is to get an end product that works for its intended users but getting the best product at the lowest cost is only part of this. As discussed at length in my Unix Guide to Defenestration any Unix technology, no matter how insanely great or cheap, can be made to fail if the managers who get control of it after implementation either wanted something else and consciously or unconsciously set out to prove they were right, or only understand how to manage a proprietary system and insist on applying those ideas to Unix.
All three of the examples planned have a context set by the same project in the same imaginary company: Nichievo Inc. This setting, which is purely imaginary, is designed to illustrate a political opportunity for Unix and open source in an otherwise closed shop. Be aware, however, of the risks involved: if the people who take over from you on system delivery don't want to make the system work, it won't.
The overall job involves setting up a secure digital exchange for what Nichievo calls an acceptance order. Nichievo insures receivables, an acceptance order is the company's commitment to pay an insured receivable in case of default by the debtor. In its unsigned form such an order is a quote, signed it is a contractual commitment. Our job is to collect quotes as they are issued, make them available for review and signature by senior managers, and make the signed orders available for customer download.
Background on Nichievo and the overall systems project under discussion can be found in this extended sidebar.
As envisaged, our solution will require XML publishing capabilities so this first article will look at the Windows versus Linux option in terms of the core hardware and licensed software needed for XML publishing. Specifically, we'll look at the choice between Apache/Cocoon on the one hand and Microsoft's proprietary tools on the other.
The third article in the series will look at the development issue. If we picked cocoon and open source in round one, we'd already be largely committed to using javabeans and java for forms management and validation, but there are quite a few other things we need to do. For those, should we use PERL, PHP, or try to extend our use of java and the cocoon framework?
On the other hand, if we decided to do this job with Microsoft's tools; we could use a third party IDE for some tasks but will be using BASIC, or some variant of it, for others.
The fifth article will look at the database layer. If we made the Microsoft choice in round one, SQL-Server is a given here. The open source choice, in contrast, has options: should we use leading products like PostGreSQL and mySQL, explore interesting new ideas like eXist, or choose a commercial product like Sybase?
That's the plan, but this series need not be limited to these toolsets or the Nichievo case. If you have additional, or alternative, suggestions for toolsets please let me know.
Reality check |
---|
It's important to think about, particularly when putting an application design
into a service proposal, who it is that's paying your bills. In this case the client, Nichievo,
has a dismal technology record and a CIO who is not on side with the managers bringing us in.
In this situation your clients have to let the CIO review the proposal, and it's not that unusual for him to respond by having some of his people cook up a few screens that prove beyond any doubt that he can do a better job of implementing your ideas than you can. If senior management buys into that, it leaves your sponsors looking like idiots and you without an invoicable project - or friends you can go back to in that company. Having been burnt a few times, I now brief the sponsors whenever that kind of thing looks likely and put something significant in the proposal that hits a few hot buttons among client executives but is as hard as possible for the CIO to promise or do. In this case we don't need XML to do this project; we need it to win this project. Technically, ordinary HTML with either PERL or PHP would work just fine; but: management would really like to automate much of the customer interaction and, of course, the international outsourcing services firm hired 18 months ago has failed to deliver on its promises to do this. The CIO isn't the only threat to worry about. That giant international outsourcing services firm doesn't want its billings to fall and could respond by going to the firm's managing director with a story that blames the CIO for their failures. If only he had let them use Domino everything would long since have been working beautifully. Technically Domino would work for the basic job but be very hard to push to full customer portal operation. Putting together a working Domino demo wouldn't be that hard, much easier, in fact, than doing it with cocoon. That reverses, however, as you add functional complexity. By the time you've developed a full messaging portal, my guess is that Domino would demand many times the programming effort than Cocoon needs to get to that same level. So how serious a threat is this? Well, if you pack enough serious suits with pretty business cards into a room, it's often rather easy to convince senior management to feel heroic and dedicated about backstabbing their friend the CIO - a tough decision, of course, but taken in the interests of the company, you understand. To head this kind of thing off I've been praising the ebXML standardization effort OASIS (the Organization for the Advancement of Structured Information Standards) has been working on as a downstream means of standardizing the business messaging they need to enable the customer message exchange they want. I think acting on this direction would be premature, but picked XML for this application as a building block toward eventually doing that - knowing that it would be very difficult to do with Lotus while blocking the CIO's credibility if he attempts a putsch.
|
The job seems to call for a cross between centralized XML document publishing solution and a customer portal. Either way, we need provision for strong authentication, lots of logging, and applications code to handle the on-line addition of digital signatures as well as some back end database functions to simplify acceptance processing for standing orders.
Under this view of the process:
It would be possible to do this using servers already in place in Nichievo's 34 operating offices but the centralized approach is preferable because:
We could implement the centralized alternative in one of two ways:
The steps required using the open source toolset (Apache and Tomcat with mod_perl and Cocoon on Linux, BSD, or Solaris) are extensively discussed on the apache/cocoon site. Here's what the main page says about it:
Apache Cocoon is an XML publishing framework that raises the usage of XML and XSLT technologies for server applications to a new level. Designed for performance and scalability around pipelined SAX processing, Cocoon offers a flexible environment based on a separation of concerns between content, logic, and style. To top this all off, Cocoon's centralized configuration system and sophisticated caching help you to create, deploy, and maintain rock-solid XML server applications.Cocoon interacts with most data sources, including filesystems, RDBMS, LDAP, native XML databases, and network-based data sources. It adapts content delivery to the capabilities of different devices like HTML, WML, PDF, SVG, and RTF, to name just a few. You can run Cocoon as a Servlet as well as through a powerful, command line interface. The deliberate design of its abstract environment gives you the freedom to extend its functionality to meet your special needs in a highly modular fashion.
Although Microsoft's site doesn't seem to offer a clear statement of direction on this kind of work, they do provide a long and apparently detailed discussion showing how various products can be integrated to achieve something vaguely similar to cocoon.
From a design perspective we see the central system as a web based "order switch" collecting requests from customers, recommended orders from juniors, order approvals from senior partners, and then passing the approved orders back to customers.
Diagram one below, shows typical high level use cases for this:
To make this work we need:
Since we have no control over the client device, reliance on a web browser as the user interface is a given and that decision essentially determines that we'll use a web server as our means of communicating with the user client and logging accesses.
Document volume is quite low: we expect a maximum of only about 120,000 acceptance orders per month. On the other hand, we have to keep every document ever filed on the server on line --both in order to support the firm's customer relationship management effort and to provide data for part of its risk assessment methodology-- so the numbers build relatively quickly. In three years, for example, we can expect to need on-line access to something like 3.2 million approved, and 400,000 unapproved, orders.
It would be possible to store and index these as signed and unsigned documents, but it would be better to store the data for them in simple tables and have the application construct the documents on request. That step complicates processing, but enormously facilitates activity logging, reporting, backup, system recovery, and statistical uses of the data.
Use of a database would also, in this case, reduce disk space requirements considerably because a typical order document, stored as a Microsoft Word 10 binary, takes 19,658 bytes exclusive of the standard contract terms referenced in it but storing that information in an SQL table takes only about 320 bytes for the addressee (which is normally stored only once per customer) and 190 bytes per covered receivable for about a 95% overall disk space saving (after indexing and overhead). It is not the dollars that are important here; disk is cheap. What's important is the reduction in backup and recovery time. Recovering 60GB from tape takes hours, recovering 60MB, minutes.
Using a database and constructing the documents on the fly eliminates concern over varying input file formats since data entry can be handled via a browser form. It creates, on the other hand, two additional problems:
This is addressed through use of XML. This will enable us to produce almost anything the customer requires including PDFs, flat files suitable for use with Excel or other spreadsheet tools, or formats we currently don't know about.
We cannot rely on our ability to reproduce documents sent to customers on an as-needed basis because it would be possible to argue, in court, that our system could have changed in the interim and so introduce doubt about the authenticity of the copy we generate.
To deal with that we need to store the actual document sent the customer, together with authentication and delivery information. This does not, however, destroy the usefulness of the database approach; the best solution is probably to do both: use the database and store the final documents as sent. That's because, in almost all cases, the need to take time to recover the document files will not impede resumption of production operations and so does not significantly affect recovery time.
Therefore, as shown below, the design will be based on using a database to store the information going into each document, using an "XML enabled" application layer to construct documents as needed, and using a web server as an interface to the user's browser.
The processing applications needed can be thought of as modules within an overall framework. Diagram three, below, shows a typical screen flow for one such module.
Actual definition of these screens is best done using an active prototyping approach in which you start with your best, and usually rather naive, idea of of how it should work and then do two things in parallel:
Once your prototype achieves stability, you can implement formal testing and review by users not previously associated with the project and use their comments to refine the thing to the point that they think your prototype "works."
Once the system works, phase two will deal with deployment issues including:
From a capital cost perspective the new system is to fit into the existing network and support framework. As a result initial infrastructure costs are limited to the server and any licensed software needed.
The weakest link |
---|
In this situation, verify that the network can deliver. You may have
a 10MBS connection with low utilization but that doesn't mean you can add a
substantial new load. Particularly on PC type networks all kinds things - firewalls,
poorly configured or underpowered routers, "invisible" SMB network use - can foul
things up.
If the network is slow, your users won't care about your excuses or your demonstrations of how fast the server is. They'll see poor response, and turn off. So be sure to test your connection, repeatedly and at different times of day, before agreeing to its adequacy. If their in-house network won't support your access needs, and the local network guru doesn't take action, try to take your test system somewhere else - and make the network effect obvious when the guru's boss has to migrate the box in-house. |
Server sizing is something of a non issue. We know that the database will be quite small, probably still under 20GB three years from now, and we know that typical usage volume will also be quite low because, on a typical day, the company insures about 7,500 customers of whom around 900 will record some change -usually a receipt or a new receivable on a rolling account.
It is likely that most users will initially see use of this server as an additional burden imposed by management and respond by fitting their interactions with it into already busy schedules. In practice, that means we can predict usage surges just before or after lunch and and just around go-home times in each time zone. Unfortunately west coasters tend to leave impositions until after lunch while east coasters do them just before going home, leaving us facing the likelihood that the the two biggest surges, those from the Pacific and Eastern time zones, will overlap.
Once people see value in this service usage will balance out but the quickest way to destroy any chance of that happening is to under configure the hardware at the beginning - users who have to wait for your server the first time they connect to it will have their resentment of the new imposition reinforced - and you'll never recover their trust.
On the other hand the cost difference between "about right" and grossly overpowered is, in this context, a few dollars - so I've intentionally specified an insanely overpowered machine below: a dual processor Dell Xeon running at 2.4GHZ.
Server Capital Cost | ||
---|---|---|
Microsoft | Open Source | |
machine type (Data from Dell.com, Sept24/2002) |
Dell 4600 includes Triplite 3000VA UPS, CD, Floppy, and network cards, no services, monitor, mouse, or keyboard |
Dell 4600 includes Triplite 3000VA UPS, CD, Floppy, and network cards, no services, has 16" monitor, mouse, and keyboard |
CPU Type | 2 x 2.4GHZ P4/Xeon | 2 x 2.4GHZ P4/Xeon |
RAM | 4 x 1GB DDR/SDRAM | 4 x 1GB DDR/SDRAM |
Internal Disk | 4 x36GB internal | 4 x36GB internal |
Document Storage | Dell 220 Powervault; 6 x 73GB, US160 | Dell 220 Powervault; 6 x 73GB, US160 |
SDLT Tape | External 220GB with controller and CA arcserv license | External 220GB with controller |
Operating System | Windows 2000 Server with 25 client licenses | Caldera Open Linux |
Total Cost | $31,087 | $26,047 |
I considered a Sun 480 for this role too since the cost wouldn't be much different while its fiber channel disk, the higher reliability of Solaris on SPARC, and its upgrade-ability to four CPUs give it some advantages.
For this article I want to compare Linux to Microsoft solutions on the same hardware but a real world decision would be more influenced by the workload. The twin Xeons are faster than the two UltraSparcs but the Sun machine offers a hardware cryptographic accelerator for $2,700 that's capable of doing around 4,300 SSL "hand-shakes" per second. If system usage were going to be high relative to the hardware, that accelerator would make a big difference, but that isn't true here and the Xeon's shorter completion times on single tasks makes them, to me, the better choice.
On the software side I've not worked with the Microsoft stuff and am not all that sure what we need or don't need. The list here is deduced from the how-to article on the Microsoft website referenced earlier.
Software Licensing Cost | ||||
---|---|---|---|---|
Microsoft | Open Source | |||
Database Layer (licensed per processor) | Microsoft SQL Server 2000, includes:
|
4,999 x 2= $9,998 may need enterprise editions ($19,999 per CPU) |
PostGreSQL or mySQL | 0.0 |
Database integration (Licensed per processor) | Bizztalk Server 2002, standard edition Includes Simple API for XML (SAX2) | 6,999 x 2 = $13,998 | Cocoon | 0.0 |
Programming Language | XMLSpy IDE (assumes VB license?) | $999 | mod_perl or mod_php | 0.0 |
Proxy/Cache web server | Internet security and acceleration server (ISA) | 1,499 x 2= $2,998 | 0.0 | |
Other required licenses | Unknown | ? | none | 0.0 |
Total List | $25,739 | 0.0 |
As an operational matter the importance of this data to the company means that I'd recommend redundancy - setting up two servers, in different cities, with different administration, and different internet backbone connectivity - at somewhat more than twice the cost. As shown below, you could do that with the Linux solution for about the cost of one Windows 2000 system.
Total Capital Cost | ||||
---|---|---|---|---|
Microsoft | Open Source | Percentage Savings with Open Source |
A note on purchase timing | |
Server Hardware | $31,087 | $26,056 | 16% | For in-house development on Windows 2000 you would probably load the licenses
once, on the production machine. That means you'd buy everything before writing
a line of code.
In the Linux world there aren't any license portability issues so you'd develop and test on an existing machine, postponing capital expenses until you had a working system and thus reducing overall project risks. |
Software Tools | $25,739 | 0 | 100% | |
Total | $56,826 | $26,046 | 54% |
a week? |
---|
This should be an area of intense comment from people who have actually used this stuff. Remember, article two needs your experience and opinions. If you have used cocoon, or the biztalk/SQL-Server combo, in a real application then please contact me. |
We do not yet have manpower estimates for either the development or the operational phases of this work. On the development side the requirements are currently only loosely understood while operational issues have yet to be discussed at any length.
Nevertheless experience tells us that the first prototype can be developed under cocoon in about a week and that the process is likely to go through from three to five iterations before a full user manual (which is the requirements specification) can be written for user signoff.
Clearly infrastructure costs for the open source solution are less than half those for the proprietary solution but that doesn't, by itself, make the cocoon solution better. The cost difference, perhaps $100,000 for a two-way redundant system, looks like a lot of money at the personal level but barely registers on Nichievo's bottom line.
Failure would hurt us and our sponsors but won't break the firm. Success, on the other hand, affects the balance of management power in the firm and could lead, downstream, to radical change starting with the cancellation of the current development contract and the ousting of the CIO. That, in turn would create opportunities for us in particular and the open source movement in general - replace 1,500 or so Windows desktops with Unix smart displays and we'll have a massive positive impact on the firm's bottom line.
The potential rewards of change are therefore clear --and no-one's under the illusion that we're here to sell Microsoft products-- but we still need to ask the question as fairly and "straight up" as we can:
What are the relative risks associated with each decision? Use cocoon, or use Microsoft's tools?
If you choose the open source route for this there's no doubt it will be going into a hostile environment but the Windows decision isn't all that great either. Yes, it makes you compliant with the CIO's preferred direction but it still leaves you in a conflict with the international outsourcing and consulting firm that's been beavering away there for the last eighteen or so months.
Different agendas, different methods |
---|
Having the client own and manage the development environment is great
if your primary interest is selling time. After all, waiting for the other
guy to act, or just for a PC to grind something out,
is far more profitable than working because
it reduces your average selling cost.
One client I know fell for this twice; not only demanding control of the development servers but once buying the development house's used 486s and once getting another consultant's retired P2s as Oracle development workstations. These worked as Oracle seats; but Windows NT and Oracle on P2 gear made for lots of long, and fully billable, waiting while mutual finger pointing and related delays added more billable days to the project's overall duration. |
Either way, you'll have people working against you; fewer and more muted with Windows than with Linux, but no picnic either way.
This is, of course, the biggest risk there is; but, if you've made your clients aware of the danger and they're willing to take the risk, then its your job to minimize it, but not to undermine their judgment by agonizing over it.
Resource control is the most effective risk reduction strategy possible here. If you want to succeed, own the hardware and control network access to it -- even if that means putting a bunch of their PCs in a room with the server and a small hub. Later, put two phases into your deployment plan:
To facilitate this I often include an offer in the proposal to develop on production scale hardware that we own until hand over. At that time the client can decide to buy it at the pre-agreed price or replace it with hardware of his own. In most cases this looks like a great risk reduction strategy to the client (and it is) because he doesn't spend a hardware nickel until the software works and then doesn't face a systems transition, but its real purpose is to trap the opposition between rocks and hard places:
Notice, however, that this strategy requires you to buy the machine and any needed licenses upfront and is, therefore, a powerful argument for Linux because:
Both sets of tools are under development with both subject to change. As a general rule, however, Microsoft's changes affect everything from the operating system (which may require new hardware to run) to the client interface layer while Apache's changes tend to be independent of the operating system.
There are operating system patches to consider in both cases, but Linux patches don't generally require application reconfiguration or testing; Windows service packs, in contrast, often change everything from licensing terms to API internals.
From a stability perspective, therefore, both choices mean that we will be adapting to technical change as it occurs, but the cocoon option limits that to the application and is therefore strongly preferable.
The absence of licensing issues together with the separation of application, database, server, and OS on Linux mean that we could, if necessary, recover the application to any Linux machine capable of handling the load. That isn't true on Windows 2000 server; a failure pretty much has to be recovered on the machine that failed - otherwise we're really looking at a new install; something that's usually much harder and more time consuming to do.
Given how critical this application is, recoverability is a killer issue - and a strong vote for Linux with cocoon.
Security is the other killer issue. There have been security issues with Apache, Tomcat, PostGreSQL and Linux, but not many and those that appeared were quickly remedied. The Microsoft toolset, on the other hand, has dozens of outstanding security issues; including XML based attacks on SQL-Server and Windows 2000 Server, and remediation is usually slow in coming.
This, to me, is a decisive issue: Linux and Cocoon it is.