% fortune -ae paul murphy

Comparing code bases

The Swartz analysis I discussed Monday reflected an attempt to compare the Red Hat 5.2 source to other code bases including those for SCO OpenServer. To the extent that the documents we have describe the methodology it appears to have consisted of two phases: first an automated review looking for line by line similarities and then a manual review of the files found to share lines.

So is this methodology adequate?

It is clearly not adequate for legal purposes - that's the main thing to be learnt from a denunciation, presumably written by IBM's lawyers but attributed to Brian Kernighan of apparently similar work done by Sandeep Gupta for SCO. In its more substantive portions this cites a legal standard for such comparisons derived from "Gates Rubber V. Bando, 9 F.3d 823 (10th circuit, 1993)" and claims, with apparent credibility, that Gupta's work does not meet it - and, by extension, that Swartz's wouldn't either.

Suppose, however, that we want to answer a more general question: ignoring the issues raised by case law, how can such comparisons be done fairly?

In its most general form this challenge may be stated as: given a signal, reproduce all of its information content in another form. (Because, that way, two signals could be converted to a standard form and their information content fairly compared.). Since there's a tremendous amount of theoretical research being done on this I believe that, in the long run, something akin to Star Trek's universal translator will solve this problem - and obsolete all forms of cryptology.

In the short run the right way to do this for two Unix code bases might be to focus on the details of the algorithms used instead of the code - by, for example, partial compilation and comparison of the intermediate tree structures after aggressive optimisation.

Another way might be to forget about the code itself and look, instead, at what the product does, or doesn't do: comparing Linux 2.4, for example, to Sys VR4.3 to see where, if anywhere, differences in the results produced by the two systems suggest differences in their construction.

Although fundamentally inadequate for legal purposes, this approach should be more than adequate for technical people wanting to form their own opinions - a fact of which I was reminded by the Swartz report because I think this may well be the same Bob Swartz mentioned by Dennis Ritchie in a very similar, but much earlier, context.

Here's the complete entry posted by Dennis Ritchie to alt.folklore.computers in early 1998

An anecdote: sometime fairly early after the Mark Williams company started offering their Coherent system (a Unix clone), some AT&T legal people asked me to visit Mark Williams for purposes of determining whether what they were offering was a rip-off (i.e. essentially a copy) of the currently licensed Unix done by us. I find it hard to reconstruct the date this happened, but it was a long time ago; probably early 1980s. I went to Chicago with Otis Wilson, who was then involved in Unix licensing.

It was a rather strange experience. The Mark Williams company was a paint producer, and I was given to understand that the subsidiary that was doing Coherent was, approximately, a corporation arranged by a father who, approaching retirement, had more or less shut down the older business and was using the corporate name and legal setup to help his son in a new venture.

Otis and I visited the offices of Mark Williams on the outskirts of Chicago and were received with courtesy and some deference. We talked to the father and the son (Bob Swartz, i.e. the guy behind Coherent). There had been communication before, and from their point of view we were like the IRS auditors coming in. From my point of view, I felt the same, except that playing that role was a new, and not particularly welcome, experience. The locale of the company was in an industrial section and it definitely retained the flavor of a the offices of a paint company being recycled.

What I actually did was to play around with Coherent and look for peculiarities, bugs, etc. that I knew about in the Unix distributions of the time. Whatever legal stuff had been talked about in the letters between MWC and AT&T didn't allow us to look at their source. I'd made some notes about things to look for.

I concluded two things:

First, that it was very hard to believe that Coherent and its basic applications were not created without considerable study of the OS code and details of its applications.

Second, that looking at various corners convinced me that I couldn't find anything that was copied. It might have been that some parts were written with our source nearby, but at least the effort had been made to rewrite. If it came to it, I could never honestly testify that my opinion was that what they generated was irreproducible from the manual.

I wrote up a detailed description of this. I can't find it, probably because at the time I was advised that it was privileged lawyer/client material. Partly at the time, partly thereafter, I learned that a variety of Unix enthusiasts (several from U. Toronto) had spent time there.

In the event, "we" (=AT&T) backed off, possibly after other thinking and investigation that I'd wasn't involved in.

So far as I know, after that MWC and Coherent were free to offer their system and allow it to succeed or fail in the market.

I suppose there's a second story about the suit by USL against BSDI and then UCB, but my own involvement was far tinier and didn't get me a trip to Falls Church or Berkeley to snoop. What advice I offered in this situation was exactly in line with that about MWC/Coherent, and as it turned out the resolution (though more costly for all) was pretty much the same.

(As a capper, Bob Swartz came by Bell Labs a week or so ago, and we had a pleasant social visit.)

The right way to handle things.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.