Tricks with Arrays and Hashes

[ Perl tips index ]
[ Subscribe to Perl tips ]

Today's Perl tip is about a simple programming concept, but one that's easy to get wrong: basic operations when dealing with lists of data. Let's pretend we wish to find the unique elements in an array. For example, if we had:

        my @array = (1, 1, 1, 2, 1);

we'd want to get back:

        my @unique = (1, 2);

Naive solution

This problem is naively solved using a nested loop. For example:

        # This is an AWFUL solution

        my @array = (1, 1, 1, 2, 1);
        my @uniques;

        FIND: 
        foreach my $element (@array) {
            foreach my $unique (@uniques) {
                if($element == $unique) {
                    # This is already in our unique array
                    # so skip to the next element
                    next FIND;
                }
            }

            # If we're here, it's not yet in our unique array, so add
            # it.
            push @uniques, $element;
        }

        # Uniques now contains all the unique elements
        print "@uniques\n";

While this solution looks correct, and may seem fast when dealing with small lists, it's extremely slow when working with large lists. In fact this algorithm has a complexity of O(n2), which means the time taken increases in proportion to the square of the number of elements. (See our Perl tip on Big-O notation for more information.)

Hash slices

A better way to find the unique elements in a list is to use a hash, since it guarantees its keys are unique. To build a hash, we only need to walk through our list once, which gives us an algorithmic complexity of O(n), meaning our time scales in linear proportion to the number of elements we're processing.

We could use a looping construct to build our hash, but Perl actually provides us with a short-cut, known as a hash slice. A hash slice allows us to reference multiple keys at once, and looks a little like this:

        my %favourite_colour;

        # A hash slice still uses curly-braces around the keys, but
        # an '@' to access multiple keys at once.

        @favourite_colour{ qw(Paul Jacinta Sean) } = qw(Blue Green Red);

We can use the fact that hash keys are unique, as well as the short-cut of using a hash-slice, to very quickly find our unique set:

        my @duplicates = (1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 2, 3, 3,);

        my %unique;

        # Our hash here has keys, but all its values will be undefined.
        # That's fine, since we're only interested in the keys.

        @unique{@duplicates} = ();

        my @unique_elements = sort keys %unique;

        print "@unique_elements\n";  # 1 2 3 4 5 6 7 8 9

Retaining the original order

If we did not sort the results, in the previous example, we might get back the set:

        6 3 7 9 2 8 1 4 5

which is not related to the original order of @duplicates. This is because hashes do not keep any internal ordering. If it is important that the original ordering is preserved we can use Tie::IxHash (covered in our tip about Ordered hashes. Tie::IxHash does not come standard with Perl, but can be downloaded from the CPAN.

        use Tie::IxHash;

        tie my %unique => 'Tie::IxHash';

        my @duplicates = (1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 2, 3, 3,);

        @unique{@duplicates} = ();

        my @unique_elements = keys %unique;

        print "@unique_elements\n";  # 1 2 3 4 5 6 7 8 9

This works even if our @duplicates are words:


        my @duplicates = qw(e d c b e e e b c a a d e e a b );

        @unique{@duplicates} = ();
        my @unique_elements = keys %unique;

        print "@unique_elements\n";  # e d c b a

There's more than one way to do it (List::MoreUtils)

Although hash slices provide us with a very useful way to gain unique values, there's another, even easier way to achieve the same purpose. The List::MoreUtils module (available from the CPAN) provides a uniq function which with the same results as our Tie::IxHash example above:

        use List::MoreUtils qw(uniq);

        my @duplicates = (1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 2, 3, 3,);

        my @unique_elements = uniq( @duplicates );

        print "@unique_elements\n";  # 1 2 3 4 5 6 7 8 9

The List::MoreUtils subroutines are not only easier to understand than our hand-rolled code, but since they're implemented in pure C they should also outperform our home-grown code by a significant margin.

Resources

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS