Searching for items in a large list

[ Perl tips index ]
[ Subscribe to Perl tips ]

Often we need to search for one or more items in a larger list. This tip examines how we can do this and examines the efficiency of various techniques.

Searching for a single item

Let's pretend we have a list of usernames and we wish to find out whether a given name is in that list. A traditional approach using a foreach loop would look like this:

        my $found = 0;
        foreach my $username (@usernames) {
                if($username eq $check_name) {
                        $found = 1;
                        last;
                }
        }

Those who have used Perl's grep function before would probably write:

        my $found = grep { $_ eq $check_name } @usernames;

We could also use first from the Scalar::Util module (which comes standard with Perl), which allows us to find the first occurrence of the element in the list.

        use Scalar::Util qw(first);
        my $found = first { $_ eq $check_name } @usernames;

Each of these are fine solutions, although the solution using foreach provides the best performance in most circumstances. It avoids searching the entire list (which grep does even in a scalar context), and the overhead of subroutine calls (which first uses for each comparison).

Searching for multiple items

Let's say we have a list of items and we want to test each one for existence in a much larger list. A simple solution may look like this:

        foreach my $wanted_book (@books_I_want) {
                foreach my $library_book (@library_books) {
                        if($library_book eq $wanted_book) {
                                # My book is available.  Order it.
                                order_book($wanted_book);
                                last;
                        }
                }
        }

An alternate solution using grep could look like this:

        foreach my $wanted_book (@books_I_want) {
                if(grep { $_ eq $wanted_book} @library_books) {
                        # My book is available.  Order it.
                        order_book($wanted_book);
                        last;
                }
        }

If we have many library books (for example approximately 10,000) and a medium demand for book orders (approximately 150), one third of which do not exist in the library, then we find that foreach is moderately faster (26%) than grep which is moderately faster (26%) than first. Each of these take between one third and half of a second to run on modern hardware.

Using a hash

If we're doing lots of searches, it becomes much faster to have our books in a hash, and use a hash-lookup:

        my $found = exists $library_index{$book_name};

How much faster? About 100,000 times faster on average when compared to a linear search with foreach on a list with with 10,000 items in it. Of course, there is the additional overhead of building the hash in the first place. For two-character long keys, building a hash takes twice the time as building an array, and it's even longer for longer keys. It's just not worth the extra effort to build a hash if we're going to perform just a single search.

However in our library example we're walking over our arrays multiple times, making this a prime candidate for using a hash to speed things up. We can use a hash slice to add our keys to the hash in a single step.

        my %library_search;
        @library_search{@library_books} = ();

        foreach my $book (@books_I_want) {
                if(exists $library_search{$book}) {
                        # My book is available.  Order it.
                        order_book($book);
                }
        }

Or alternately, putting our wanted books into a hash (notice the change in the foreach loop's list).

        # Add library books into a hash
        my %books_I_want;
        @books_I_want{@books_I_want} = ();

        foreach my $library_book (@library_books) {
                if(exists $books_I_want{$library_book}) {
                        # yay my book's available
                        order_book($library_book);

                        # We've ordered it, delete from hash
                        delete $books_I_want{$library_book};
                }

                # end if we have no more books we want
                last unless keys %books_I_want;
        }

Building the hash of all the library books (our first example) and then searching is much faster (1928%) than our simple linear searches with foreach, while building the hash of our wanted books (our second example) is even faster (160%) than that.

Both our examples requires that Perl walks through each of our arrays at least once, either to build the hash (using our hash slice), or to search the hash (using our foreach loop). Our second example is the faster of the two since we're building a smaller hash, which takes less time. We also have the potential to exit our main loop early if we find all the books we're looking for.

However our second solution is only superior if we throw away our hashes at the end, and rebuild them whenever we need to order new books. If we're able to retain our hashes in a persistent process, then it's much faster to index all the library books once, and then loop over the smaller list each time:

        foreach my $wanted_book (@books_I_want) {
                if(exists $prebuilt_library_books{$wanted_book}) {
                        # yay my book's available
                        order_book($wanted_book);
                }
        }

In fact, it's blindingly fast, running at more than 78 times the speed of the fastest code that builds the hash each time. Whenever you build a hash for the purposes of searching, it's often worthwhile to hang onto it as long as possible.

Benchmark and code

We used Perl's Benchmark module to obtain these results as follows:

              Rate   first    grep foreach lb-hash  w-hash  pre-built
 first      2.03/s      --    -20%    -37%    -97%    -99%      -100%
 grep       2.55/s     26%      --    -21%    -96%    -98%      -100%
 foreach    3.21/s     58%     26%      --    -95%    -98%      -100%
 lb-hash    65.0/s   3103%   2451%   1928%      --    -62%      -100%
 w-hash      169/s   8238%   6540%   5178%    160%      --       -99%
 pre-built 13444/s 662497% 527577% 419353%  20588%   7846%         --

(where lb-hash is building the hash from the library books and w-hash is building the hash from the wanted books). Note that these results will vary slightly from machine to machine and run to run.

We can tell by the rate that we can run the solution using first 2.03 times in a second, while we can use the solution using foreach 3.21 times in a second. The results with less library books (about 5000) are much the same.

Benchmark is a standard Perl module, and its documentation can be found on CPAN ( http://search.cpan.org/perldoc ) or can be obtained with perldoc Benchmark.

The original code used to generate these tests can be found at http://perltraining.com.au/tips/listsearch.pl.html. Note that there will be some variation between results from one invocation to the next.

Conclusion

There are reasons why you may have large lists instead of hashes. For example you may need the elements to be ordered, or an array may make more sense at other points of your program. However, if you are likely to be doing lots of random accesses into a list, including looking for items which do not exist, then it may be more efficient to create a hash. In our situation, even with the cost of creating a hash of all of the library books, we discovered a massive performance increase of over 1900% than using a simple foreach loop. Using a better approach and creating a hash of our wanted books was an improvement of more than 5100% over the simple foreach loop. If we can instead use a hash instead of a list our lookups are even faster.

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS