Draft Blog Entries

% fortune -ae paul murphy

From Chapter one: Data Processing and the IBM Mainframe

This is the 7th excerpt from the second book in the Defen series: BIT: Business Information Technology: Foundations, Infrastructure, and Culture

Note that the section this is taken from, on the evolution of the data processing culture, includes numerous illustrations and note tables omitted here.

Roots (Part five: CICS/IMS and SDM)

One problem in particular seemed to bedevil customers trying to develop their own new systems: keeping some card images in memory while retrieving the out-of-order information from disk needed to complete their processing without also stopping all other processing while waiting for disk response.

In this kind of application it was the card ordering, specifically the inability to presort records, that created the most difficulty. Think back to the earlier example involving processing medical claims and notice that efficiency there depended on having the records arrive in a known order. In an on-line query environment, however, you can't predict the order in which data will be needed and therefore need the ability to store some data in memory while getting other information from disk.

In effect the process:

Read some cards
Determine there's missing information
Issue a request to the disk handler to get that information
Park the cards in memory
Read some other cards
Process them until
The disk handler announces availability of the missing information
Park the current cards in memory
Fnish processing the first set and write them out
Take up the second set again

came up for hundreds of customers, most of whom were defeated by the complexities involving in doing this effectively in environments where you could get multiple waits and multiple interrupts.

For example, several IBM customers in the utilities billing segment wanted to use terminals (teletypes; the 327X CRT (monitor) didn't appear until 1972) to allow customer account look-up but, of course, lookups couldn't be sorted in advance and so incurred this problem.

In response IBM developed a sample application to demonstrate how software could be built to address this need. Originally called "Public Utility Customer Information Control System," this quickly became known as CICS (Customer Information Control System) and was later transformed into one of the primary support pillars of mainframe processing.

CICS was one of three products (the others were GIS and IMS) used to launch IBM Program Products, the commercial software business spawned when IBM separated Engineering Support from Software Support in June of 1969.

Hierarchal databases

In MS-DOS and early versions of Windows, files were organized in directories and they are still shown that way today.
Thus:
c:\windows\office\myfiles
is a hierarchy that lets you find your files in the "myfiles" directory.
In a hierarchal database such as an early PC file system each directory can contain a number of "child" directories - c:\windows\office typically contains at least a dozen subdirectories and hundreds of files.
Your "myfiles" directory can, similarly, contain many "child" records but it, like all directories, can only have one parent, or source directory - in this case "office".
That's the essence of the hierarchal database: parent records can have many child records but a child record can have only one parent.
The point of this is query efficiency. Since you can pre-define the path to any record, the computer need not spend memory or CPU resources on this task.

IMS was an advance on tools available with DOS/360 -which had included a keyed sequential access [KSAM] file manager through which users could read a batch of records from tape to disk, index them, and then retrieve them by index value. IMS, first released as Information Control System (ICS) and Data Language/Index (DL/I) took this several steps further and became the first IBM mainframe product to separate database management from applications code and so address a number of very important systems design problems

Both IMS and CICS used strong partitions in memory to achieve efficient and effective operation while maintaining separation from customer written code accessing CICS/IMS facilities. In effect a customer using these tools would dedicate CPU, disk, and memory resources to them and then treat those resources as being on a separate machine that could interact with customer written code but not be modified by it.

The IMS hierarchal data model is singularly efficient and extremely well suited to simple query applications. As a result the CICS IMS combination rapidly became a mainstay of IBM transactions processing and is currently (mid 2008) in production use by well over 12,000 mainframe data centers worldwide.

IMS introduced the concept of the hierarchal database and terminology like "entity-relationship" to the mainframe community. In its earliest incarnations the "entity-relationship" concept simply specified the parent-child relationship between data items like payroll hours and external entities like employees.

Over time, however, accretion set in and, by the mid seventies whole disciplines and numerous books had grown up around entity-relationship analysis and diagramming.

Even today many data centers, although now using relational databases (see Chapter Five) to which these concepts do not apply, continue to use entity-relationship diagramming techniques as basic components of their process models. In the early days, however, the concepts involved did apply to the technology used and that gave them a natural role in the evolution of systems development life cycle methodologies such as SDM70 (Systems Development Methodology, 1970).

In general these followed a sequence, first formally described in 1970 (Royce, W.W., Managing the Development of Large Software Systems, (Westcon, San Francisco,1970) and known as the waterfall model for obvious visual reasons.

Later variations typically broke up the steps shown here to allow for increased specialization, added budget and approval management steps, and/or put user feedback loops between two or more steps; but the basic model continues to dominate most mainframe system development work today.

Essentially all of these SDLC/SDM models share the same basic structural assumptions about how automatic data processing is done. These include:

Application software is home grown, not purchased or licensed;
Each stage of a development cycle is independent of the other stages except for its position in the development pipeline or cycle;
Individual steps are handled at the technical level and therefore by lower ranked staff;
Management control is exerted by focusing on process and (paper) deliverables, not outcomes.
Reviews therefore take place at project milestones -typically transitions from one stage in the pipeline or cycle to the next; and,
End users are involved up front in the specifications process and at the end as users, but not in between.

The underlying logical model here is the same as that of batch processing - an application consists of multiple independent batch jobs done in the right order - and reflects two generally unstated assumptions. These are:

The durability assumption; and,
The clear channel assumption.

Business cases developed to obtain budgetary approval for large projects often envisaged three to five year development periods followed by ten and fifteen year run-time periods. Implicitly this assumes that requirements are sufficiently well understood, and stable, to be specified years in advance of system delivery and then remain sufficiently stable to be addressed only through maintenance for an extended run-time or deployment period.

CICS and IMS were successes, but the lead product got ahead of its market.

The lead product in the set of three offered at the announcement of IBM's first commercial software division in 1969 was called GIS: Generalized Information System.
One of its key designers, E. F. Codd, went on to define the relational database, (See: A Relational Model of Data for Large Shared Data Banks CACM 13, 6, June, 1970), but GIS itself, although (or because) the most advanced of the three, did not receive widespread acceptance.

Since the assumed stability is not generally found in any organization, whether business or government, this assumption led to massive internal conflicts as systems groups tried to force business stability by maintaining systems stability. In many cases the Systems dog was able to wag the Business tail in this way mainly because it reported through Finance and was therefore isolated from both business change and user reaction.

Thus the underlying focus on automated clerking had structural consequences which often allowed the data center to use its budget as a means of putting itself squarely in place as a blocking force opposing business change --thereby nicely creating the antagonisms that set the stage for the later emergence of the PC mono-culture discussed in Chapter Three of this book.

The "clear channel" assumptions about communications clarity mean that all the players are assumed to share a common vocabulary and understand data and process labels to mean exactly the same things. Unfortunately, this "clear channel" doesn't exist in the real world any more than long term stability does.

In spoken or written English a little sloppiness isn't generally detrimental to communications; if I write about hospital beds, it doesn't really matter if half the readers think of those in the context of physical objects for people to sleep on while most of the others consider the term an accounting fiction - we all share some approximate sense of what we mean by standard terms like "bed" or "general voucher" or "adjusting entry", and can adapt on the fly to differences in meaning as those became clear from the context.

That sloppiness about meanings isn't acceptable in computer programming. If we're not talking about exactly the same hospital bed, and exactly the same processes around it, a program you write will reflect your understanding of the term and not meet my understanding of my needs.

Worse, after a few rounds of mutual incomprehension, we won't be on speaking terms any more.

This problem isn't limited to communications between systems people and user staff. Talk about vouchers to someone in Payroll and you won't be sharing any understanding of the term as it is used in Accounts Payable; and, of course, the GL people have their own interpretation - but almost every single one of the people the systems analysts talk to about vouchers will first assume that everyone knows what a voucher is and then decry the analyst's abilities for not understanding the obvious.

Initially applications were built independently of each other and the inter-disciplinary communications problems within the business didn't matter much. A GL analyst learned what a voucher was to local GL users and adjusted IBM's sample code to incorporate that local meaning. The payroll analyst did the same, but since they were working on different systems, the fact that their definitions were different didn't matter.

Of course, once you have both a working GL and a working Payroll, it starts to seem unreasonable to print out the payroll totals only to retype them for entry into the GL. Even if these aren't integrated in the sense of using the same database and logic, surely - people thought - it should be easy to interface the two by building something which stored the general journal entries from payroll and then read them into the GL?

As early as the mid sixties people who had used IBM or other code to get basic GL and Payroll systems running were trying to get such automated interfaces between them to work - but that meant sharing data and, more importantly, data definitions, between two or more systems. That, in turn, meant trying to get people who were certain they already knew what terms like "bed" or "voucher" meant, to understand that they actually didn't know.

In practice therefore integration was achieved mainly by continuing batch scheduling practices from the 1920s rather than through data management or software. In this approach jobs are separate but the programmer building the second job knows the format of the file output by the first. Although the output format has nothing to do with naming standards, definitions, or data flow, this information is sufficient to allow the programmer to read data from the file, assign his own names and structures, and so develop something that works. As long as the second guy's output format is documented, the programmer for the third step can do much the same, and so on.

At run time the dependencies are therefore handled by the scheduler- since job 1 has to run before job 2 and so on - just as they would be by program flow in an integrated application.

---

Some notes:

These excerpts don't include footnotes and most illustrations have been dropped as simply too hard to insert correctly. (The wordpress html "editor" as used here enables a limited html subset and is implemented to force frustrations like the CPM line delimiters from MS-DOS).
The feedback I'm looking for is what you guys do best: call me on mistakes, add thoughts/corrections on stuff I've missed or gotten wrong, and generally help make the thing better.
Notice that getting the facts right is particularly important for BIT - and that the length of the thing plus the complexity of the terminology and ideas introduced suggest that any explanatory anecdotes anyone may want to contribute could be valuable.
When I make changes suggested in the comments, I make those changes only in the original, not in the excerpts reproduced here.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.