Using associative arrays

By Paul Murphy, author of The Unix Guide to Defenestration

The associative array is one of the great computing constructs. Essentially you create an array and index it by value rather than row or column position. It seems simple and you probably thought it was simple in school, but may not use as often as you should today because it can actually be quite tricky.

It can also be extremely useful. When the boss wants to know, for example, which tables in the ERP system use "bal.udak_surr_key," how many users print to "HP-16-3rd-flr-east," or the size of data transfers to the east pohunk office by client id, you've really got three choices. You can opt for sheer masochism and use SQL, indulge in a lot of typing and error checking with Excel, or use an associative array.

Awk and Nawk are good for this, but perl really shines at it - and you can use the same basic syntax for all three. Here's how it works.

First you need to organize your input data as a flat file with some kind of delimiter -I like "|" but you can use anything that isn't found in your data - so that each row represents one occurance of whatever you happen to be counting. For example, you can use awk or perl to process the proxy server log (or the print log or the data dictionary dump, etc) to produce a delimited file in which one field has the thing you want to count by (in this case names) and another has the thing you want to count (in this case bytes transfered).

% cat
Microsoft|1234|Mike Lame
Tropica Vacations|56789|Janice Bezier
ISACA|101112|Elizabeth Noheart
National|131415|Janice Bezier|16171819|Mike Lame
KPMG|202122|Elizabeth Noheart

To add them up we're going to create an array that's indexed by name and then print that out:

% cat
$FS = '|';
while (<>) {
($from,$bytes,$name) = split(/[|\n]/, $_, 9999);
$AA{$name} += $bytes;

foreach $i (keys %AA) {
printf "%20s %10d\n", $i, $AA{$i};

That first line in the while loop splits out the input data and the second line accummulates totals in an array that's indexed by the user's name.

Apply this to our sample file and you get:

Mike Lame 16173053
Elizabeth Noheart 303234
Janice Bezier 188204

That same script will, with minor hacking, do a within-groups total for almost anything you can generate an input file for - giving you a simple tool to handle the majority of "how many/how much" requests from your boss.