% fortune -ae paul murphy

Out of context patterns

Most non textual business data reflects the sum of multiple overlaying cyclic patterns -for example, the base level of retail demand for toilet paper is distorted before and after major holidays because people over stock in anticipation of holiday guests.

Data mining and combinatorial pattern matching methods (see statsoft's on line text for a quick overview) are good at spotting these kinds of patterns -discovering, for example, that beer, diapers, and sanitary napkins are likely to be bought together and refining that pattern to help retailers understand the impact local employment and ethnic factors have on things like optimal staffing schedules, store layouts, or product display strategies.

In many cases, however, the critical pattern actually isn't in the data directly - there's information about it, but only in shadows cast by the interaction of otherwise unrelated patterns in the data you do have.

In Alberta, for example, tickets issued by speed cameras during the early hours of winter darkness are, on average, for lessor offences than tickets issued on the same roads during those same hours in summertime. Oddly, however, traffic tickets issued by police show the opposite phenomenon: people are charged, on average, with breaking the speed limits by greater margins during the early hours of winter darkness than during those same hours in summer.

The crucial bit of data needed to understand what's going on here isn't in the database - although, as coincidence would have it, it was an unrelated review of the increasing importance of daylighting (using natural light inside stores) in retailing that brought this issue into the foreground. It's simply this: changes in ambient light drive changes in risk perception. On the highways that means the average driver generally speeds less in the dark, while police officers react to perceived risk increases by acting more aggressively during traffic stops made in the dark.

Indirect relationships, especially those which, like the ticket writing example above, oscillate between positive and negative correlations over the course of the year, are very hard to spot - and most of our current data mining tools don't do well at the job at all.

As an aside, it's interesting to note that there is a positive side to the PC here - we're getting better at visual information presentation - check out this Java based presentation on the U.S. National budget for an example.

But how much progress has there been since the mid nineties in spotting out of context patterns? (Associations for which no previous pattern is known). Outside of the national security environment, I believe the answer is "None."

So why?

Having just spent some time reviewing this stuff I have a dispiriting answer in two parts:

  1. first, most of the tools we have today originated when machines were smaller - and I don't mean mickey mouse stuff like datacubes designed to make PCs look useful - I mean from the days when Teradata was king of the hill and a big machine like a high end Amdahl would crunch along for a couple of weeks to discover a dozen candidate relationships in a month's worth of GM dealer data.

  2. and, secondly, SQL, and more particularly the assumptions about data structure that go with SQL use, appears to have had a profoundly negative influence on what people believe can be done and therefore on what they're willing to try.

As I said, depressing - but perhaps not as badly so as the continuing parade of tenured mental blocks slowing the adoption of otherwise rather obviously good ideas - consider, for example, what serious investigation of the eminently reasonable nemesis hypothesis (that earth's Sun is part of a binary system) could do in terms of kicking most currently accepted theories of climatological change into the dustbin of political history if it turns out to be right.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.