% fortune -ae paul murphy

Associative arrays and elections data

As you may, or may not, be aware the right wing press has recently been in an uproar over Obama's refusal to release information about donations of less than $200 - and they're also upset because the Ohio Secretary of State is refusing to compare the information provided on about 650,000 new voter registration forms to state records because up to about 200,000 of the voters concerned could be forced to use provisional ballots if the names, addresses, and numbers used on those forms don't match those found in records maintained by the Ohio Bureau of Motor Vehicles or the Social Security Administration.

In both cases the groups concerned cite the complexity of the processes needed to meet these requirements as the main reason for refusing to do so.

Ohio says it either has, or can easily make, the data available in electronic format - and because the Obama campaign uses the Paymentech merchant service they also have immediate on line access to every card based donation attempt during the current and previous six months.

So the question is, for either 650,000 names or about 7 million donation records, how hard is the required matching process really?

I asked a few people and got the most amazing answers: many involving SQL and data warehousing or "business inteligence" tools -one guy even seriously suggested that this would be a clerical task best done this using Excel!

The right answer is that it's trivial to do this kind of matching using an associative array - something I wrote about in a Solaris tip back in 2004:

The associative array is one of the great computing constructs. Essentially you create an array and index it by value rather than row or column position. It seems simple and you probably thought it was simple in school, but may not use as often as you should today because it can actually be quite tricky.

It can also be extremely useful. When the boss wants to know, for example, which tables in the ERP system use "bal.udak_surr_key," how many users print to "HP-16-3rd-flr-east," or the size of data transfers to the east pohunk office by client id, you've really got three choices. You can opt for sheer masochism and use SQL, indulge in a lot of typing and error checking with Excel, or use an associative array.

Awk and Nawk are good for this, but perl really shines at it - and you can use the same basic syntax for all three. Here's how it works.

First you need to organize your input data as a flat file with some kind of delimiter -I like "|" but you can use anything that isn't found in your data - so that each row represents one occurance of whatever you happen to be counting. For example, you can use awk or perl to process the proxy server log (or the print log or the data dictionary dump, etc) to produce a delimited file in which one field has the thing you want to count by (in this case names) and another has the thing you want to count (in this case bytes transfered).

% cat demo.data
Microsoft|1234|Mike Lame
Tropica Vacations|56789|Janice Bezier
ISACA|101112|Elizabeth Noheart
National|131415|Janice Bezier
bardunes.hpg.com.br|16171819|Mike Lame
KPMG|202122|Elizabeth Noheart

To add them up we're going to create an array that's indexed by name and then print that out:

% cat demo.pl
#!/bin/perl
##separated steps for clairity

$FS = '|';
while (<>) {
($from,$bytes,$name) = split(/[|\n]/, $_, 9999);
$AA{$name} += $bytes;
}

foreach $i (keys %AA) {
printf "%20s %10d\n", $i, $AA{$i};
}

That first line in the while loop splits out the input data and the second line accummulates totals in an array that's indexed by the user's name.

Apply this to our sample file and you get:

% demo.pl demo.data
Mike Lame 16173053
Elizabeth Noheart 303234
Janice Bezier 188204

That same script will, with minor hacking, do a within-groups total for almost anything you can generate an input file for - giving you a simple tool to handle the majority of "how many/how much" requests from your boss.

So, how difficult would either task really be? with traditional coding methods either one could become a messy little project - but with associative arrays and Perl? you'll spend some time figuring out how to prepare the input file (sorts for Ohio, repeating field removal for the transactions list), but the guts of the thing? maybe twenty minutes plus run-time.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.