head: Things only Unix can do: going way off scale at SafetyJet International

DECK: Author Paul Murphy's previous Linuxworld article on making the Unix decision suggested that Unix is usually a smarter business choice than Windows. The current article, fourth in a series, looks at what it takes to implement that knowledge.

The scenario being explored is that of a consultant brought in early in the business planning cycle for a new airline. Unlike the other articles in this series this example does not reflect the author's real world consulting experience. I have done work for an airline, but nothing on this scale. SafetyJet is a wholly imaginary company constructed solely to illustrate three outstanding Unix characteristics:

A typical smart display has a powerful graphics engine and runs Java/OS, Microsoft Windows, or X clients; often concurrently.

The image shown is of a 21 inch NCD900; others are made by companies like Sun, IBM, and Thinknic. Typical MTBF ratings are in the 300,000 hour range and there are neither moving parts nor user accessible OS components to cause failures.

  1. massive scalability;
  2. very high reliability; and,
  3. the ability to use maintenance free smart displays.

In this case, the problems are real, the proposed solutions arguable, and the consulting assignment described is not based on real events.

Background

The scenario assumed for this article is that a group of investors have reacted to recent upheaval in the airline industry by commissioning development of a business plan for a new airline. Their terms of reference to the design team, of whom "I" represent the IT end of things, are:

  1. the marketing message, and so the business design, will focus on safety, reliability, and ease of use;
  2. their idea of starting small involves around ten aircraft and 1000 people at start-up but with very rapid growth to 100 airplanes and 10,000 employees serving every major center in the US and Canada;
  3. they expect that the present state of the industry will let them wrestle significant regulatory concessions from the authorities in the United States, Canada, and at various airports. In particular they expect to be able to sidestep existing regulation on fare structures, freight shipment, and passenger loading procedures at airports; and,
  4. they expect the regulatory authorities to be willing to bend as a side effect of current airline problems and public concern - but recognize that some changes cannot easily be made. Cabotage rules (governing the movement of passengers within a country on international flights) are set by international treaty and can't be easily waived or changed. As a result SafetyJet will legally be a resource company: SafetyJet International, that charters flights from two operating companies: Canadian SafetyJet International Charter Ltd. and American SafetyJet International Charter Inc.

On this basis our team has agreed that:

  1. this airline will be an all Boeing production with only 757 and 767 equipment;
  2. preference will be given, everywhere in the company, to hiring servicemen. Military training and current Reserve or Guard certification will be mandatory for all drivers and senior officers. Crews will be assigned as units with, where possible, same shift returns to their home bases;
  3. the initial scope of operation will be defined, not by market surveys, but by the the combination of airport agreement to proposed procedures and the availability of cheap, longer term, landing and takeoff slots as other airlines go under;
  4. SafetyJet will focus only on its own business with its customers. We will not forward checked luggage, make third party reservations, or pay commission agents;
  5. SafetyJet's aircraft will park and load in the open. Passengers will be responsible for luggage loaded onto, and taken off, aircraft loading dollies. Ground crew will externally release aircraft wheel brakes on pre-departure clearance;
  6. SafetyJet will provide transportation between destinations, not between airports. SafetyJet buses will shuttle most passengers between the aircraft and pickup points such as downtown hotels or the main passenger concourse as part of the ticket price;
  7. where possible SafetyJet will fly into, and out of, secondary airports in major areas; using, for example, the airport at Hamilton instead of Toronto's Pearson airport as a way of avoiding congestion and related delays in or around the Toronto area;
  8. SafetyJet will be sold to the public as a fair pricing airline, not as a discount airline. Taxes will be reported as line items on ticket sales. Nominal fares will be set to yield an approximate 20% allocatable margin with a 33% revenue passenger load per flight, but actual fares will be adjusted, at time of departure, to reflect the actual load.
    For example, if a flight is expected to cost the airline about $22,500 inclusive, then the nominal pre-tax ticket cost would be $421.87 [ =1.2 x 22500/(.33 x 192)] but that price would be reduced to a pre-tax $183.67 when 147 people board the flight [=1.2 x 22500/147].

My job is to design, and approximately cost out, an information architecture to support their vision.

Basic Issues

The airline business, perhaps more than any other industry, is about large numbers. Some examples:

  1. it costs about $80,000,000 to buy and configure a new Boeing 757-200 for mid range operation with 192 revenue seats;
  2. fully loaded that 757-200 will cruise at 580MPH while eating fuel at about 1.8 cents per passenger mile - that's about one third the fuel cost of the average car moving at 60 MPH but adds up, for 192 passengers at 9.6 miles per minute, to just over $2,000 per hour;
  3. a simple three legged flight from Hamilton airport to Minneapolis, to Chicago, and back to Hamilton must comply with approximately 90,000 individual regulations enforced at levels ranging from international aviation treaties to local ordinances;
  4. an extremely efficient, mid-size, airline operation with highly motivated staff will typically require about 100 people per operating aircraft;

    In Canada, for example, the federal government has announced a $12 per ticket "travel safety improvement" tax that perfectly illustrates this problem.

    There are several high volume, limited distance, routes in Canada. The Edmonton to Calgary distance, for example, is about 180 miles downtown to downtown but only about 145 miles airport to airport. The local discount carrier's one way ticket costs from $54 to about $95. Inclusive of parking and cab fares a typical day trip costs from $160 to about $220 before the new tax adds about 20% to the ticket price and 15% to the lowest net cost.

    For most people, most of the time, the end to end trip takes about three hours - but that same average person can, especially if he remembers the inevitable police speed traps around Red Deer, make that same end to end trip by car in about 3 and half hours. At a re-imbursement rate of 27 cents per kilometer and $20 for destination parking, driving costs about $177 - vs $184 for the lowest net cost for air transportation after the new tax is added.

    Since most people feel safer and more in control of their own schedules in their cars than in an airliner the effect of the new tax will be to push people onto the highway - thereby penalizing the airline industry and raising the overall rate of death, injury, and property loss among people making the trip.

  5. regulation and taxation often combine to produce bizarre results. The fuel tax rebate system, for example, combines with other federal, state and provincial taxes to make some routes markedly cheaper than others - and produces arbitrage opportunities for airlines willing to treat fuel as paid-in freight for some cycles;
  6. airport taxes, ticket surcharges, and related compliance costs represent, for profitable operation, the death of a thousand cuts. Depending on the specific airport and the jurisdictions it operates under, an airline can face 15 or more separate levies for each transit and up to 10 more that apply to each passenger carried.
    Some surcharges imposed by taxation or other authority are added as flat fees rather than being calculated as a percentage of ticket price. As a result they disproportionally affect short haul routes where the customer might already be tempted to use alternate modes and so pull people off airlines and onto highways.
  7. Combined with sales taxes these add-on charges can often reach 45% of the ticket price on short haul flights and one third or more of the total on medium and longer range services.

Airport authorities pose the biggest challenge to SafetyJet's business plan because they control access to their local markets and can invoke literally thousands of regulations to smother almost any change initiative. Our plans will arouse their hostility because bussing passengers around airport delays reduces both their revenues and their control. This, I'm assured by the experts responsible for the project, will be the primary regulatory battleground on which the airline's potential for success or failure is going to hang.

Within this context the major IT challenges are:

  1. managing the revenue cycle - from customer interest to commitment, through ticket issuance, luggage tracking, pricing adjustment, making and tracking compliance payments;
  2. resource scheduling - everything it takes to initiate and complete a flight leg;
  3. system wide security - including making sure that crew members are who they say they are, that all systems are clear, that traveler information is available to the authorities and connecting airlines, and that all financial and related information is fully and accurately reported.

None of this is very difficult when you have one or two small airplanes that zip back and forth on domestic routes of four hours or less, but the resource scheduling problem gets exponentially more complex as you scale up. By the time you get to 8 airplanes, 45 daily flights, and 120 flight crew the problem has exceeded human capacity. At 100 airplanes inefficiencies in the solutions used can add up to 5% or more of total operating costs - more than the bottom line - and it just gets worse as you get larger.

When airlines started to experience this complexity, in the nineteen twenties, computers had not been invented and people simply did the best they could with manual means. By 1970 airlines had invested heavily in the use of computers for reservations management but computerization, outside of military logistics planning, had yet to make inroads into scheduling. By 1980 that had changed with major airlines investing in Cray and other Supercomputer gear to attack this problem. The operations research groups created within airlines to do this were not, of course, up to solving the entire problem and so concentrated on specific subsets where they hoped to have the greatest short term impact on profitability. As a result organizational structures and technical disciplines evolved around these problem sub-sets; mainly: maintenance scheduling, flight scheduling, crew rostering, and pairings (matching inbound and outbound routes to bring crews back to home bases).

A NASA paper by John Usher of Mississippi State University provides a clear and easily understood problem description. The more mathematically inclined may want to check out informs.org or start with the MetaNeos project site at the Argonne National laboratories.

The classic text in the field of optimization is Harvey Wagner's Principles of Operations Research (Prentice Hall, New Jersey, 1969) although many people will find Claude McMillan's Mathematical Programming (Wiley, New York; 1970) rather easier to follow.

There has been tremendous progress in both the theory and practice behind the computation of actual solutions to various scheduling models. A problem which, run on a 300Mhz Sun UltraSparc IIi takes about 83 hours to solve using one of the best available 1990 solvers (CPLEX 1) may now complete in less than three minutes on that same machine using the latest CPLEX release. For larger problems susceptible to the best modern algorithms the improvement, on identical hardware, over the decade is on the order of 4,000 times.

Coupled with improvements in hardware, those algorithm gains have made it possible to solve problems that were once considered unthinkably complex - including the original integrated problem which, because it could not be solved, led to the segmented approach still deeply embedded in most airline organizations today.

Solutions

The most fundamental problems to be addressed by the technology solution are:

  1. reliability;
  2. response speed; and,
  3. accuracy.

In this context security is an aspect of reliability and completeness an aspect of all three main requirements: reliability, speed, and accuracy.

Decisions

Initially, I expected to be able to break the IT components down into major sections each of which could then be dealt with separately through purchase and deployment of one or more commercial packages. Those beliefs, however, turned out to be extremely naive.

I had assumed, for example, that we could buy or license so called "revenue cycle" software and resource allocation software. That's the way most airlines do it but, on closer review, the two sets of problems turn out to be complements that use the same core data and so logically part of one system.

The revenue cycle starts when a passenger makes a service request and ends when that service has been delivered and all consequent liabilities have been satisfied. The resource allocation process controls how that service is delivered. Basically this is just a very large scale ERP problem but breaking these things up into the widely separated islands of automation needed to address them with 1970s solutions guarantees that attempts to fit them back together produce unnecessary inefficiencies.

What gradually dawned on me as we toured airline data centers and talked to both sellers and users of this software is the extent to which airline data processing traditions hold airline operators hostage. On the surface the problems these guys deal with are huge and the solutions, experientially evolved over decades, hold together well enough. But, look a bit deeper and several things stand out:

  1. the data centers are often enormously expensive. Several data center operations we were toured through by hopeful vendors had staffs numbering into the thousands running hundreds of separate applications on warehouses full of fully configured mainframe gear;
  2. the defining application for the industry does not deal with reservations or resource allocation. The defining application is middleware. Everywhere we went, people pointed proudly to their MQ-Series or related implementations as switching data between half a dozen or more production MVS/XA applications and talked about using "connectivity glue" and "data marts" as if those related magically to the profitability of their airlines;
  3. the relative stability of airline data processing through bankruptcies, mergers, out-sourcing, and re-integration is striking. Many of the people we talked to had twenty-five years or more of experience in the industry and seemed to have done exactly the same things, with the same applications and gear, whether the data center they worked at was currently owned by an airline, and out-sourcer, or a bankruptcy trustee;
  4. Unix, although used extensively by those concerned with crew scheduling and related compute intensive work, is not widely accepted in the industry. Most of the data centers maintained a strong "them" and "us" relationship with the Unix users in scheduling and optimization; and,
  5. all of the available software embedded a 1950s style view of the travel industry in which:
    • the relationship between airline and passenger is mediated by a knowledgeable travel agent; and,
    • passengers generally have complicated, multi modal, itineraries.

SafetyJet isn't in the business of fulfilling travel agent orders nor do we need to route transcontinental passengers through a half dozen short hops. We work, instead, directly for the passenger and conceptualize the airline operation as nothing more than a fast link in a bus service between major downtown points of arrival and departure anywhere in Canada or the US. To SafetyJet, the fundamental service issue isn't fulfilling flight orders but getting passengers from where they are to where they want to go.

As a result, we eventually decided to recommend custom development instead of licensing in order to:

  1. get a cheaper, easier to operate, fully integrated system; and,
  2. embed a business model reflecting SafetyJet's operational plans.

That decision wasn't easy to make and will be even harder to defend if, or when, the bosses get the regulatory agreements they need to proceed and all of this suddenly acquires reality.

The key computing workloads to be addressed within the integrated database framework are:

  1. the on-line transactions processing systems, mainly for revenue cycle, maintenance, and operations (scheduling);
    The core OLTP problem is quite small. In full operation SafetyJet will sell fewer than 100,000 seats per day and require something less than a million daily database transactions outside the Solver. Twenty years ago, that was a lot; today it's easily within range for a PC server.
  2. the security systems;
  3. the financial systems including tax compliance; and,
  4. the management information systems (including operational support systems).
There are, in addition, a range of smaller, limited purpose, systems which interact with the database framework at one remove - i.e. with a filtration step. These include:
  1. EDI and related regulatory, competitor, and supplier links;
  2. document management and training systems;
  3. fuel management system;
  4. audit services; and,
  5. the HR system.

The most basic of the ideas underlying the software development project is that the passenger is part of the optimization equation. Flying an airplane from one place to another, and all the organizational complexity it takes to deliver that flight, is a means to an end, not an end in itself. The airline's business is about moving people, not airplanes. In effect the optimizer will direct minute by minute operations in an attempt to minimize cost while picking up, and delivering, both people and freight according to a defined schedule.

In that context our fundamental passenger story is:

  1. a passenger wants to be in some location at least until some specified date and time;
  2. at some later time and date he wants to be somewhere else.

The airline's job is to make that happen with a minimum of risk, effort, or complexity within the time frame set by the customer.

For example:

  1. a customer (or representative) checks the web site (or phones the call center, or appears at a departure desk), chooses start/end points and a latest arrival time;
  2. the system offers one or more candidate itineraries including downtown pickup/drop off points and shows both nominal and average actual prices (before and after taxes);
  3. customer chooses one, proffers credit card (or contract number) for payment;
  4. space is reserved on the buses and aircraft serving his route;
  5. payment is cleared for the nominal total cost;
  6. customer may be asked about luggage, food, allergy, load time, or security issues and appropriate resources allocated;
  7. confirmations are issued: printed, emailed, faxed, or held for pickup at a departure desk;
  8. customer is offered links to third parties such as hotels and car rental agencies in the point, or points, of destination;
  9. resources are cleared by service delivery or cancellation and re-assigned by customer initiated change.

In a perfect world this would be easy but, in the real world, complications can include:

  1. passengers rarely want to get somewhere at 3:00 AM. Instead they generally want to arrive during evenings or early mornings. That demand pattern imposes serious constraints on resource scheduling because the easiest way to meet it, having aircraft sit idle for a few hours between flights, is an airline's way of hemorrhaging cash - it costs about $25 per minute just to own the airplane; to make money, it has to be in the air moving passengers.

    The single most common cause of delay is another airline's inability to meet it's landing or take-off slot commitments. In some cases that can be due to an airline using its control of specific airport traffic patterns to keep a competitor out of that airport but, in most cases, delays are unavoidable side-effects of "the system" at work.

    In theory, the RTAOS design can mitigate the overall impact of third party delay.

    For example, a 25 minute take-off delay in a United Flight from New Orleans to Denver can intersect a three minute Denver landing window for a SafetyJet arriving from Winnipeg - potentially delaying landing by 15 minutes and causing two crews, and possibly 384 passengers, to wait.

    This has obvious direct costs to the airline for things like fuel and maintenance but may also have the indirect effect of requiring that the incoming crew take a four hour rest period - because, without it, they would be over-time on arrival in Winnipeg. That means a different crew has to be sent - putting three crews off-schedule and two out of place.

    If alerted early enough, Operations can burn slightly more fuel to get that aircraft into United's nominal slot - avoiding the problem and saving money for both airlines.

  2. to keep crews functioning as integrated teams, build relationships between cabin crew and frequent flyers, and ease flight crew recruiting and retention, we need to ensure that they, to the maximum extent possible, start and end their shifts in the same place;
    Scale makes this much easier. It is virtually impossible to achieve with few planes and relatively long routes but quite easy to do if the airplanes are interchangeable from a crew qualification perspective and the airline has a dense mix of short and medium to long range flights.

  3. things will go wrong every day for weather, mechanical, or human reasons. When that happens passenger schedules are affected and airline costs usually rise but flight revenues do not;

The obvious best solution, if technically feasible, to the cost trade-offs implicit in these processes is to integrate consideration of passenger (and freight) issues along with all other operating parameters in the dynamic programming model used for resource allocation.

If:

  1. the model produced results in near real-time;
  2. the airline were able to continuously track all air and ground vehicles;
  3. good information on slot change by other airport users were quickly available to the operations center;
  4. operations desk change orders had near real-time effect on the information provided at departure desks, on buses, and on-board the aircraft; and,
  5. the airline could generally expect good regulatory co- operation on minor flight plan change,

then operations should be able to maintain an overall schedule optimized in terms of passenger needs while giving up a minimum in cycle time, fuel, or other operational penalties.

Overall optimization for passengers does not, of course, mean individual optimization. The occasional passenger may find herself temporarily re-routed to Alaska or stranded in Saskatoon but the system would continually adjust itself to produce the best possible result for the majority of passengers the majority of the time.

As it happens that's also the best possible result for the airline because minimizing passenger waiting times and ground travel distances is usually the same as keeping the fastest gear, i.e. the airplanes, busiest earning money.

The critical design question is therefore clear: can the scheduling problem be formulated in such a way as to be sufficiently inclusive to generate usable results and yet solve in near real-time? - particularly if we define the latter as "generally less than one minute"?

Formulating the scheduling problem requires considerable expertise and an immense amount of data - both of which should be available. Since there is no compelling reason to believe that the problem cannot be properly formulated we'll assume that it can be and concentrate on options for solving it.

The actual problem size is difficult to predict, inclusion of passenger concerns will add less complexity than might be expected because many passenger constraints are linearly dependent - meaning that a full linear program might have 100 million rows and 200 million columns but the subset of interest will usually be at least an order of magnitude smaller on each dimension.

There are some givens in solving this. For example, the use of the Informix database with Tuxedo is a given in view of the reliability requirements for the transactions environment and the consequent need to keep the two data centers fully synchronized. Since this requirement also amounts to a Solaris specification for the primary transactions processing and database hosting jobs, the real architectural issue for the solver lies between:

  1. putting the scheduling problem, along with everything else that fits within the integrated database framework on a single machine in each data center; or,
  2. putting the solver on a Linux or Solaris GRID as a kind of co-processor to the main OLTP host.

It is fairly clear that the single machine approach would work with today's computing gear for a ten airplane operation - what SafetyJet will be at start-up. The real question, however, is what happens eight months or a year later: when second round financing enables explosive growth to the 100 airplane level? It would be suicidely stupid to build the greatest piece of airline software ever, only to have to abandon it as unworkable if growth drives the problem size past hardware capabilities.

What I need to predict, therefore is whether or not the problem can be run, with a target solution time of one minute or less, on hardware available about two years from now.

The best guide to that is, of course, performance today but I don't have good information on the distribution of solution times as a function of problem complexity on this scale. It is relatively easy to predict how long it will take, for any given set of hardware, to load the problem and to run pre-solve (collapsing redundant row and column information) preparatory to invoking barrier or other algorithms to produce the optimal solution. Beyond that, however, the actual time-to-solution depends far more on the applicability of the algorithms used to the specific data set attempted than on the hardware.

The easy approach to scaling up is to add machines. There are several research projects aimed at making use of thousands of individual machines and even an off-the-shelf product like Sun's GRID engine can be used to push the problem out across a network of co-operating machines. Our requirement for near real time answers means, however, that large scale, internet based, compute sharing isn't going to be viable because:

  1. we can't be assured of predictable compute times; and,
  2. we can't be assured about computation accuracy or system availability in the presence of incompetence or hostile attack.
For us, therefore, a distributed approach means a Linux or Solaris/Intel compute farm with racks of dedicated processors.

Although the actual compute time under either hardware scenario depends more relationships in the data for a specific problem than on its size, we know that the primary predictor of relative efficiency between the two solutions is the number of iterations that require information from outside the local compute block to proceed. The more linkage has to be accounted for, the better the single machine approach will look. Unfortunately, however, experience with smaller problems, on the order of a million rows, does not translate directly to problems with forty million or more rows and so we won't know the answer to this until we run actual trials.

A benchmark that may describe a best case uses a real world fluid dynamics program. This has reasonable complexity and scale along with known high separability between components and should, therefore, mark the upper limit on the effectiveness of a distributed processing solution. The benchmark data posted at Fluid.com. may therefore provide a useful guide for our decision.

Although the numbers as presented need some interpretation, the results show that, for the largest problem benchmarked:

  1. we can assign a 1.4GHz machine running Windows 2000 a relative score of 1;
  2. and then see that a network of 128 dedicated Linux machines, each running a 1GHz P3, gets a relative score of 34; and,
  3. a single Sun Starfire running 72 CPUs at 900MHz gets a relative score of 42.

In the time since this particular test was run, two significant changes have been made to the Starfire:

  1. updated on-board microcode has improved memory management, including cache consistency, across multiple boards; and,
  2. Sun released its new MediaLib - and pushed those results backward into the standard compilers to make far better use of the SIMD/VIS instruction set for linear algebra and related matrix operations than ever before on SPARC systems. Early results, on SunBlade workstations, indicate up to a 40% improvement on computationally intensive tasks - like CPLEX or the fluid dynamics model benchmarked.

Since both of these address critical components of our problem we can reasonably speculate that a newly configured Starfire 15K with all 106 possible CPUs installed should score around 65 - or about what you'd get if you replaced each of the Gigahertz P3 boxes in the IBM X-series with a dual CPU Xeon chasing a gigabyte of RAM at about 1.7GHz.


Large Model Fluid Dynamic Computation - Relative Power Ratings
Windows 2000 on 1.4GHz P4 = 1
IBM Xseries = MYRINET Linux Cluster
128 Machines each with 1GHz P3 = 34
Sun 15K, 72 CPUs [Actual] = 42
Sun 15K, 106 CPUs [Estimated; includes effect of MediaLib code] = 65
Data from: http://www.fluent.com/software/fluent/fl5bench/fullres.htm on 10/12/01

The worst case would occur if separability is minimized. A benchmark that tracks that is provided by the Transactions Processing Council's analytical processing test. Results are not linearly comparable across database sizes but reasonable estimates suggest that the workload increases by about 3.3 times per transaction as the database grows from 300GB to 1TB. On that basis an eight processor Proliant with Microsoft Windows 2000 and SQL Server which achieved 1,506 QphH on the 300GB test could be expected to produce about 456 QphH on the 1TB test.

Sun didn't report a 300GB test, but posts a QphH score of 4735 for a 24 processor 6800 on the 1TB test -making that machine seem about ten times faster than the Proliant. On a CPU basis alone this doesn't make sense because the Proliant offers 31% of the raw cycles provided by the Sun 6800; what made the difference was the 9.6GB/Sec data exchange rate for the UltraSparc CPUs in the 6800 versus the 2.4GB/Sec for the Proliant.

Relative Power Ratings on 1TB QphH benchmark
Windows 2000 on 8 Processor Proliant = 456 (estimated)
Sun 6800, 24 CPUs = 4735
Sun 15K, 72 CPUs [Estimated = 10,735]
Data from: http://www.tpc.org/ on 10/12/01

Since the reality is likely to fall between these two extremal cases we can conclude that the Solaris based solution is likely to be feasible with presently available hardware - and by implication that real costs two years from now will be lower than, and solution times shorter than, would be the case if we needed that capability today. We can, furthermore, conclude that the decision on trying to reap the cost savings available from the use of a distributed GRID type solution to running Solver can be postponed until just before we scale up to meet the 100 airplane level - by which time better data on algorithmic performance with respect to the problem as formulated will also be available.

There's a hidden danger to the airline in a big machine solution. The Starfire choice invites downstream mis-management because it makes it easier for the board to eventually choose a completely unsuitable CIO who will then destroy operational efficiency by doing all the things experience has taught him to do -like hierarchal staffing, stove-pipe decision making, isolation of users from technical staff, and the imposition of rigid operational controls. These methods are appropriate to an MVS/XA environment but wholly counterproductive in a Unix one and will, if forcefully applied, first raise costs, then freeze adaptation to external change, and, eventually, kill the airline's ability to compete.

That risk comes about because machines like the Starfire 15K are key pieces in Sun's metamorphosis from Unix guerilla to data center gorilla. To get the dollars available from mainframe managers to whom a five million dollar machine looks cheap, but who demand that it replicate all of their favorite VM/370 facilities, Sun has added things that use resources pointlessly but enable these people to treat the machines as if they were cheaper, faster, mainframes. As a result the board will eventually be looking at resumes from people who are utterly clueless about Unix, user relationships, or making money, but claim expertise with the Starfire along with 25 years or more of "progressively more senior experience" in airline data centers.

This is a much bigger issue than most people believe. Management methods arenot independent of technology; the right organizational posture for an MVS based operation is radically different from the right structure for a Unix based solution. Resource limited environments require careful management and control of user access to services. Unix based infrastructures simply don't face those limits and so benefit from staffing strategies, leadership, and working relationships with the user community, that are anathema in traditional data shops. Please see my Unix Guide to Defenestration for a detailed discussion of what this means and how it works.

The core solver application and its relationship to both the revenue cycle and operational systems constitutes the largest single component of the real-time airline operating system we're considering, but it isn't the only critical piece. SafetyJet will be sold in part on its merit as a safe airline. Part of the safety factor applies to passenger apprehension about hijacks and other malicious action affecting operations. The security systems are, therefore, both extremely sensitive and mission-critical.

The crew security plan functionally requires Solaris - it's one reason each operating office will have a Sun V880 (or 280R) local computer. That's because Solaris lets us use Java cards with SunRays to make user identification both easier and more certain. That capability means that each local center will have locally booting SunRays that run their X environments on the local processor. That machine, in turn, will connect those X- servers to the client end running in the data centers. It is the need to make this process foolproof (and secure against man-in- the-middle attacks originating within SafetyJet) that makes the strongest argument for using the simplest possible processing architecture.

One of the things that this makes easy to implement, for example, is crew vouching and verification. The enabling SunRay feature here is that the user's card identifies a session history and is independent of terminal or location. Thus a driver can pull his card from a SunRay in Omaha and resume exactly the same session when he plugs in again in Denver an hour later. Add biometric smarts to the card (in this case, a temperature sensitive embedded fingerprint reader is envisaged) to uniquely identify its owner and the system becomes both hassle free to its users and reasonably secure.

"Reasonably secure" is not, of course, quite good enough for people who are being given charge of 192 passengers and 80,000 pounds of jet fuel, so all team members with flight line responsibilities follow check-in procedures that require them to vouch for the identity of the other people checking in with them.

The security application used for this is one of only two applications in the airline that are not automatically mirrored between the two data centers. In both cases the front end machines used to run the SunRays randomly switch sessions between the two data centers. As a result it is essentially impossible capture a SafetyJet by stealth - a well designed ground assault can succeed, but the operations control center will be instantly alerted to the problem and the jet won't get off the ground even if the hijackers have trained people on board --because the external wheel locks require that the crew chief get unlock codes from the operations center.

The other redundant application involves the passenger identification side of security and the protection is as much against official (and internal) misuse of data as it as against co-ordinated digital and physical attacks.

Each workstation used for pre-departure passenger check-in has both document and portrait cameras. When travel papers are required (e.g. identification for transborder flights) the departure clerk requests the document to copy key information from it into an on-screen form and, to facilitate that, puts it face up in a pre-determined position. As soon as the first field is entered on screen, the the computer grabs images of both the document and the passenger for transmission to the customs service in the destination country.

Both images are also stored in our database and matched to passenger information with database access cross-reported to audit services in the two data centers as a control on misuse of the information.

The Ramp-Up Processes

The systems development plan is based on four phases:

  1. initial ramp-up and application development is expected to take about eight months and run in parallel with regulatory and financial negotiations;
  2. the 10 airplane operation is expected to prove the SafetyJet marketing concept and run from six to nine months;
  3. airline ramp-up to 100 airplane operation will take about a year but systems ramp-up to that level must preceed the addition of airplanes and destinations by at least a month; and,
  4. team projections on full operation are to be based on a five year financial planning horizon from the start of ramp-up.

At the end of phase three ramp-up period (18 to 24 months from "go") we want to have the following systems in place:

  1. an Office of the CIO to control and co-ordinate all IT related activities across all three companies;
  2. a total of about 65 IT staff split about equally between two fully redundant data centers. These will be co-located with the main operating offices for the Canadian and American airlines; i.e. most probably in Winnipeg and Minneapolis;
  3. each data center will have a major transactions processor, around ten terabytes of data storage, and the primary solver but whether the OLTP and solver operations run on one big machine or a mid range coupled with a GRID based solver solution will not be decided until the hardware is actually needed;
  4. the data centers will be mutually redundant in real time despite their separation distance. To ensure this, we will use Informix ODS with Tuxedo and route almost all transactions to both centers;
  5. each local operating office will need 40 or so SunRay 17 inch workstations, 8-10 printers, various scanners and their radio based local networks, 8 to 10 of the 21 inch NCDs, and perhaps a dozen PDAs along with a V880 or comparable host;
  6. the distributed call center will grow to about 100 people and use 21 inch NCD smart displays with Sony video cameras in the homes of its staffers. These will have high speed VPN connections to the local operating office host for boot and applications (OpenOffice.org) hosting purposes but open windows to their designated primary data center for reservations access;
    The call center operators will generally work from their homes. As a rule this type of call center suffers from 50% or higher turn-over rates among employees, but we hope to address the causes of that turnover by linking these operators to each other and SafetyJet via video conferencing technologies. The core idea is that people want to work with people - and so tend to quit distributed call centers as soon as they realize that these don't provide "water cooler" type social benefits. With our commitment to always-live video conferencing, we're hoping that people will form sub- communities of their own as they help each other with customers, system issues, or other problems. Quitting would then break those bonds - leading us to think that the telearb system will greatly reduce recruitment and retraining costs on this part of the business.
  7. the major data centers will be connected via dedicated fiber with backup agreements in place for internet backbone use in emergencies while all offices outside the two headquarters will be independently connected to both data centers through two separately contracted international data carriers with no common links on their routes to the centers;
  8. almost all administrative desktops will be NC900 or similar smart displays with 21 inch screens although some, mainly in advertising, will be Mac G4 workstations. All passenger or crew processing stations will use 17 inch SunRay smart displays;
  9. everyone in the company will have on-line access to web pages (done with the statware QC E-server) showing real time quality control charts on key indicators including average service times for the call centers, web sales, and departure desks, on-time performance, and revenue. Critical charts are displayed, live, 24 x 7, on 21 inch flat screens hanging in the entry areas of all executive offices;
  10. full video conferencing gear -to be used for all internal telephony and based on next generation SunForum with the real time video streamer- will be installed on all smart displays; and,
  11. each operations control center will have redundant Ultra 60 or comparable workstations driving high intensity projectors that do nothing but graphically display the real time GPS reported position of all aircraft against a map overlay. If feasible, an on-demand facility will be added to allow display of real time bus positions by metropolitan area.

Development Rational and Processes

The decision to try to develop what amounts to a real time operating system for an airline is a potential business killer and correspondingly questionable. My fundamental arguments for doing it are:

  1. the SafetyJet business model is not a good fit for any established system that we're aware of;
    The one we liked best, built around a core reservation system that provided the foundation for SouthWest's success and is now powering several other airlines to pre-eminence in their markets, seemed to require over 100 ancillary systems and is dependent on MPE/3000 - a product which HP has recently end-of-lifed. The vendor assured us that continuity will not be a problem and we accepted that, but it is my belief that good software reflects both the business it serves and the mentality of its developers. As companies using other major applications ported from world-class integrated database environments (e.g. PICK, OS/400) have demonstrated many times, the magic that makes the thing an unusual success in its original environment is usually missing from the ported product.
  2. you don't get competitive advantage by copying the other guy's software. At best you'll always be at a disadvantage because his people have more experience with it than yours do;
  3. the actual costs and risks are not substantially different from those involved in setting up all of the interfacing and related middleware pieces needed to string together the 100 or more packaged applications we would need to license, and make work, if we went the licensed software route;
  4. winning at the airline game is a matter of gaining and holding many small advantages. A core system built around an all encompassing operational optimization model should, if it works, deliver those better than any other approach can;
  5. if successful, our systems operating cost will be less than 25% of competitor costs and translate to long term competitive advantages for SafetyJet; and,
  6. a successful custom system can, if managed properly, provide greater long term security with lower downstream risk, than any licensed alternative regardless of safeguards.

Development isn't as frightening as it used to be. We're not going to be spending years building a static specification nor will we be spending time and effort on database development, front end development, or any of the internal or network infrastructure - all of these pieces are fully predefined. What we are going to be doing is connecting a complex mixed integer linear programming formulation to a rapid prototyping environment that gets us nearly instant user feedback on functionality.

A lot of the ancillary functions will, furthermore, be licensed from third parties and then run at arms length for security reasons. None of the document and EDI/Interface management systems will, for example, be either home grown or directly linked to the production systems. Whether licensed or mandated, these systems will run independently of the real-time airline operating system with fully proceduralized manual intervention on any required data transfers.

Internally to the real time system, we see the scheduler driving the revenue cycle just as it drives the expenditure side. In effect, marketing determines a basic set of flight plans well in advance but reality, in the form of weather, Murphy's law, regulatory action, and passenger/freight demand, will determine the actual minute by minute schedule. The core idea, therefore, is to start with a basic solution, add demand and reality information, and then operate all parts of the business according to the best schedule the constraints will allow. In that vision the two high volume sources of the demand and reality information needed are the ticketing and operations pieces - that's one reason they'll be built using a shared database.

The biggest single problem with developing this system is that we don't have a live airline to practice on. Normally, as described in my Unix Guide to Defenestration, I strongly prefer adaptive prototyping: build something, hand it over, fix or replace it, repeat until perfect. Here, however, we don't have a working airline within which to develop the functional prototypes needed.

This means bootstrapping the system at the same time that we bootstrap the airline. That's even scarier than the development decision by itself - but I don't see many choices. As a result my development plan is to do all three:

  1. the business model;
  2. the system; and,
  3. the airline
at the same time.

To provide a starting point and get some idea of the complexity, I asked two teams, each with three people one of whom is a constraint programming expert, to develop a structural framework that:

  1. has a common web front end for all users;
  2. handles making, changing, or dropping reservations based on having the customer (or call center agent) pick a starting point, a destination, and a latest arrival time to generate one or more choices from which a decision can be made;
  3. adopts a database structure that allows the solver pre and post processes to pick up constraint data, and write solutions, with an absolute minimum of delay; and,
  4. can provably deal with conflicting reservation attempts; last minute changes; bar code and/or ticket printing errors; missed freight scans; and, incorrect passenger counts, while immediately incorporating resource allocation changes coming from the solver.

One of the rules I make for this is that nothing will be done as a database stored procedure although triggers are allowed for database products lacking native referential integrity support. This maximizes code re-use and hedges against database licensing schemes based on processor use.

When Oracle imposed processor based licensing recently it became obvious that applications with modularized external code could limit their database use to purely database functions and needed far fewer processor licenses than those which relied on stored procedures. For many companies, that meant millions of dollars in license fees went to Oracle when better application design could have avoided the expense.

Compared to the real thing, the result was trivial but the winning team's 16 day masterpiece did let us simulate a multiple airport, multiple user reservation system and demonstrate direct linkage, for both passengers and freight, to both the solver and operations control application components. Interestingly, but uninformatively, their demo solver set-up ran LINDO solutions for a 25,000 row, 80,000 column problem in less than two seconds using an 8GB, dual 400MHz, Sun 450 with locally attached disk.

The winning team used Unify Vision and ran the demo on U2K rather than Informix ODS, but the tools are database independent and scale, within the Unix/Smart display architecture, essentially linearly as you add users.

On a personal basis I have used this product set for demos since about 1988, but have yet to have a client adopt it for serious development work and production. Without exception, they've declared the company likely to go under real soon now and bought into better marketed products that died or were absorbed by Microsoft or CA - while Unify when right on making first class products.

I've never understood this - just as I don't now understand people buying Microsoft Office rather than adopting OpenOffice.Org because the fact that it costs big bucks gives them more faith in its future. If you can explain it without using words like "stupid", please send me a note!

Combined with the results obtained by the Forte/iPlanet team, we developed a consensus that perfect execution would let us tie the licensed pieces together under one Informix ODS based application with less than sixteen man years of effort and four major adaptive iterations of the core user interaction pieces. We'll probably never find out how realistic that is, but it produced the brilliantly indefensible two million dollar estimate that went into the subsequent business plan.

Staffing

Finding qualified staff isn't easy. By the time we're done we'll need about 65 people, 60 of them regular IT staff, and the other five comprising the office of the CIO.

Many of these people need special skills, particularly those who will work with Boeing or FAA/DOT mandated systems, the solver, and the development tools we'll be selecting. Our general recruiting strategy is to recognize that we don't need them all today, start with a few good people, and let those people recruit others for us.

Two of our team members, for example, are full time professors - one of whom teaches several OR courses- and we'll count on them to recruit some promising graduate students for us. For other start-up jobs we will use existing team members and people recruited via advertisements in the two metropolitan areas we'll be head quartering in. The key decision here, however, is to hire good people as we come across them rather than waiting for the actual jobs to materialize - at worst, we can always sell their services to third parties as consultants.

Once people are in place we'll adopt the usual Unix strategies:

  1. everyone gets assigned basic personal responsibilities that require some expertise, initiative, and commitment to carry out;
  2. people are encouraged to work directly with users and act as user advocates within the data center; and,
  3. people are encouraged to swap themselves into, and out of, whatever project team needs, or doesn't need, their expertise;
    Team members can delegate jobs, but not responsibility. Someone hired as a DBA can, for example, delegate some tasks to someone previously limited to Solaris sysadmin work but retain full personal responsibility for database integrity and availability - while actually spending most of his time helping to get documents to display properly on PDAs carried by mechanics. Done properly, this combination of personal responsibility with tremendous job flexibility, produces highly productive, happier, workplaces - and graduates who become CIOs in later life.

As usual, there are two key goals to this:

  1. to ensure that users, not systems management, dictate most systems decisions; and,
  2. to ensure that systems people get cross-trained on just about everything in the operation but never lose their sense of personal responsibility for one or more key functions.
Capital Cost

The full capital spending plan has over 600 entries ranging from a $50,000 software license with half a million in installation and formulation support services, to a set of $225 stackable hubs. Actual expenditures are scheduled to occur in parallel with the airline bootstrap operation. The first six months of work, for example, will take place in one center with only a Sun V880 server and some workstations on a local network.

To stay on schedule we will need, however, to commit to some steps - like data center construction or CPLEX licensing - well before the matching approvals and go-aheads come in. To reduce our exposure we hope, therefore, to negotiate risk sharing arrangements with some key vendors (e.g. Sun) under which we can cancel receipt or delay payment at the last minute if regulatory approvals are denied.

The first five months will see us spend about $1.8 million on hardware, licenses, and staff. The next three or four, exclusive of contractor commitments, will about double that but the major expenditures start about 60 days before the first airplane arrives. During those last 60 days all the networks and servers will go in while the last of the staff are brought up to speed in both data centers and last minute panic sets in. We'll blow through something like 9 million during that period - real money, but much less than what the airline will spend on reconditioned highway buses, departure desk installations, or airport facility re-construction.


Key IT Goals
  1. Provide continuous operations information in near real-time
  2. Provide highly scalable Infrastructure
  3. Build long term staff skills and loyalties
  4. 0% down time no matter what
  5. Keep costs to a minimum
Approximate Budgetted Costs (in $000) for six month operating period by phase1
Pre-Operation 10 Airplane Operations 100 Airplane Operations
Six months staff cost (Number of people) $825 (22) $1,500 (40) $2,438 (65)
Licensed Software and Services $540 $1,700 $1,800
OLTP/Solver processor (type) $130 (V880) $2,290 (4 x 6800; 10TB) $11,700 (2 x 15K; 2 x 10TB)2
Typical Local Operations Office (Number needed)
40 SunRays, 1 V880, 10 NCD NC900s, 10 printers, 8 barcode reader/scanners & redundant wireless hubs, 15 PDAs
$120 (1) $2,400 (25) $19,200 (160)
Operations Center $40 $190 $190
Head Office Operations $160 $120 $200
Call Center Rollout (NC900 with video gear & DSlPipes) $0 $60 $240
Network Operations $15 $550 $900
Total cash cost $1,830 $8,810 $36,668
1Current costs are from the vendor web sites as of Nov 11/01.
2This is the worst case outcome. If Solver runs fast enough on the 6800, no upgrade may be needed. A mid range option adds a 256 CPU Linux/Solaris GRID for about $4 million less than adding two Starfires.

During the initial operational phase:

  1. both data centers will have dual Sun 6800 machines, each with 24 CPUs, 128GB of RAM, and about 10TB of on-line storage;
  2. the 20 local operating offices will account for about 800 of the 17" Sunrays; 200 printers, 20 wireless barcode reader/scanner networks, about 100 NCD 21" smart displays, and just over 200 handhelds;
  3. there will be about 20 of the 21 inch NCDs allocated to the distributed call center each with a DslPipe or comparable VPN capable router;
  4. the two administrative offices will each have about 80 of the NC900 smart displays, 40 or so Sunrays, and a handful of Mac G4 workstations;
  5. we will need about 90 IT workstations, mostly NC900 smart displays but including about a half dozen Model 60 graphics displays and a few Sunblades for higher speed applications.
  6. we will need a Windows 2000 server and perhaps as many as a dozen PCs - and a full time support person- to handle government and vendor mandated communications functions requiring Windows clients as well as format conversions for incoming Windows file formats;

Supporting all this will take about 18 people in each center and three in the office of the CIO. That will grow rapidly to about 65 when we ramp up for full operation - well in advance of the arrival of more airplanes and approvals. At that time:

  1. each center will need about 9 people for basic network and related "power-on" activities;
  2. the development teams which started as:
    As a matter of policy all code will be written twice, once by each team, with adoption decided after users, and other developers, have exercised both sets of changes to the evolving prototype;
    • two prototyping teams of seven people and a full time joint development co-ordinator; and,
    • an independently chartered development team of two will work on the statistical quality control and geophysical reporting applications. This team will cover both data centers with one member posted at each center;
    will carry on refining and improving the system;
  3. the major document management sub-systems will have staff dedicated to their care and feeding; adding three more bodies to each center;
  4. the tax compliance team working within finance will have dedicated IT support in each data center - adding two more people to the total;
  5. the EDI and other third party data exchange systems will have at least one person specifically assigned to them in each center;
  6. each data center will need at least one person to babysit the few Windows Operating systems on PCs needed to deal with incoming file conversions and meet mandated client environments for regulatory and supplier notifications;
  7. the user support team will comprise about four people in each center. Their role will not, however, be that of traditional help desk staff because SafetyJet will not deploy Windows to users and so needs no support staff in that role. What these people will do, instead, is install local office gear, provide training and related support, and liase between local office users and the data centers. These people, in other words, will primarily act as roving ambassadors and teachers;
  8. the audit and security group will be independently staffed and report directly to the president and board but be housed within the two data centers. Their staff will grow from one person at each location during startup to about one person for each three to five destinations as SafetyJet matures;

for a total complement of about 65 IT people and perhaps as many as 20 others who work out of the data centers but are not carried on the Systems budget.

Getting these people in place will be the hardest part of the scale-up process needed to sustain 100 airplane operation. The technology part is made easy by the SPARC architecture which lets us run the same code on anything from a single user workstation to a 15K.

Depending on the solver decision scaling up will involve either:

  1. adding a Starfire 15K (or its descendent) to each data center, migrating the RTAOS [real time airline operating system] applications from the two 6800s to the new machine and dedicating the former to backup and experimental processing roles; or,
  2. migrating the solver component of the RTAOS from the 6800s to the Linux (or Solaris) GRID;

but no code or significant procedural change within the overall system beyond that needed to accomodate the solver solution chosen. Really, when you get right down to it, that's the best thing about SPARC - no code, staff, or skill changes as you scale up from next to nothing to the biggest box anyone offers.