Sorting with expensive comparisons

[ Perl tips index ]
[ Subscribe to Perl tips ]

Last week we investigated how we can use Perl's sort function to sort lists using different methods of comparison. This week we'll see two special techniques to make sorting faster when our comparisons are expensive.

Let's pretend that we have a list of files, and we wish to sort them based upon the number of lines containing the string "Perl". Performing such a count is easy:

    sub count_perls {
    local $_;    # Avoid clobbering caller's $_
        my ($filename) = @_;
        my $count = 0;
        open(FH, "<", $filename) or die "Cannot open $filename - $!\n";
        while (<FH>) {
            $count++ if /Perl/i;
        }
        close(FH);
        return $count;
    }

You'll notice this code uses 'local $_' to say that changes to $_ (such as those made by the 'while (<FH>)' construct) will be localised to this subroutine. This is considered good manners -- anyone calling our code probably doesn't want $_ to mysteriously change.

Now we can sort our files based upon the number of occurrences of the string "Perl":

  my @sorted = sort { count_perls($a) <=> count_perls($b) } @filenames;

This works perfectly well, but it's doing a lot of needless work. The count_perl subroutine is called for each filename every time we do a comparison, and since each filename will be compared to a number of other filenames during the sort, we'll needlessly search and re-search each file in our list.

One solution to this is to do the counts once before we sort, and place those results into a cache. We can then simply sort on the cached values.

Hashes are ideal for this sort of work:

    my %cache;
    foreach my $file (@filenames) {
        $cache{$file} = count_perls($file);
    }

    my @sorted = sort { $cache{$a} <=> $cache{$b} } @filenames;

By using just a little bit of memory for our cache, we've saved ourselves a lot of time in needless file processing.

Alternately, we can massage our list into a list of pairs: the filename and a precomputed count. This can be faster than creating a cache, but the cache might be easier to understand. We can do this by using map:

    my @pairs = map { [ $_, count_perls($_) ] } @filenames;

@pairs should now look similar to:

    @pairs = ( [ 'av.c',    56 ],
               [ 'deb.c',   31 ],
               [ 'doio.c', 249 ],
               [ 'doop.c',  36 ] );

Now all we need to do is sort this list on the second element:

    my @sorted_pairs = sort { $a->[1] <=> $b->[1] } @pairs;

Which gives us:

    @sorted_pairs = ( [ 'deb.c',   31 ],
                      [ 'doop.c',  36 ],
                      [ 'av.c',    56 ],
                      [ 'doio.c', 249 ] );

However, all we really want are the filenames, not this set of pairs. Now we can use map again to remove the unwanted counts.

    my @sorted = map { $_->[0] } @sorted_pairs;

Putting this all together we get:

    my @pairs        = map  { [ $_, count_perls($_) ] } @filenames;
    my @sorted_pairs = sort { $a->[1] <=> $b->[1] }     @pairs;
    my @sorted       = map  { $_->[0] }                 @sorted_pairs;

Careful examination will show that for the three lines, the input for the next line is the result from the previous line. This means these can be condensed to give:

    my @sorted = map  { $_->[0] }
                 sort { $a->[1] <=> $b->[1] }
                 map  { [ $_, count_perls($_) ] }
                 @filenames;

This is commonly known as the 'Schwartzian Transform', as it was first discussed in detail by Randal Schwartz in a column of Unix Review, in January 1996. The original article can be found at http://www.stonehenge.com/merlyn/UnixReview/col06.html.

Both the Schwartzian Transform and the hash caching methods provide significant improvements over the naive approach. The speed difference between the two methods is negligible so we recommend you choose whichever you feel most comfortable with. Pre-computing your sort keys is useful whenever you're sorting using an attribute that may take a long time to compute, such as filesystem attributes or database lookups.

This is the last perl-tip for 2004 as Perl Training Australia's staff break for a well-earned rest. Perl-tips will return in 2005.

NEXT YEAR: Why use objects? A brief introduction to Object Oriented programming, Perl-style.

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS