Big-O notation

[ Perl tips index ]
[ Subscribe to Perl tips ]

Big-O notation is a mathematical construct used to describe algorithmic complexity. In particular it is allows efficiency comparisons of different algorithms for large data sets.

Typical orders are (in order of desirability):

O(1) also C
Constant time. Computation does not take appreciably longer as the data set grows. For example, a hash lookup is considered to occur in constant time.
O(log N)
Logarithmic. Computation time grows as N grows, but this growth is small for large N. For example doing a binary search over a sorted list to find an item.
O((log N)c)
Poly-logarithmic. Computation time grows as N grows, but this extra growth in time is still small for large N.
O(N)
Linear. Computation time grows in equal proportion to N's growth. For example, adding a value to each item in a list is linear.
O(N log N)
Quasi-linear. Computation time grows only slightly faster than N's growth, for large N.
O(N2) and O(Nc)
Quadratic and Polynomial. For N2: doubling N will quadruple execution time, N3: doubling N increases execution time by a factor of 8. For large N, O(N2) and above rapidly become unusable.
O(n!)
Factorial. Almost never a good idea, large for even small N. 5! = 120.

It is an interesting property of Big-O notation that if a function can be written as a sum of other functions, then only the fastest growing function matters. This is because the growth in that function will rapidly dwarf the growth in the others. For example:

cN + dN2 => O(N2) for any constants c and d

N4 + N! => O(N!)

Why Big-O matters

Understanding the basics of how to calculate Big-O notation can help us identify bad choices in our code and spot likely causes of slow execution time. For example consider the following attempt to find all the elements in list1 which are also in list2:

        my @list1 = (10 .. 100);
        my @list2 = (50 .. 200);

        # Walk over @list1
        foreach my $element (@list1) {

                # Check in @list2 to see if it's there
                foreach my $item (@list2) {

                        if($item eq $element) {
                                print "$item is in both lists\n";
                        }
                }
        }

If @list1 and @list2 are small, this code works very quickly. However, if we use:


        my @list1 = (10 .. 10000);
        my @list2 = (5000 .. 2000000);

then the code takes a very long time. Let's call the size of @list1 N and the size of @list2 M then what we're doing is:

        for each N
                for each M
                        do something

You can see here that this means we walk over @list1 once, but for each item we also walk over the whole of @list2. This means we're dealing with O(N M). As the smaller list approaches the same size as the larger, this approaches O(N2) which suggests we should investigate whether a faster possibility exists.

We can rewrite the above code as follows:

        my @list1 = (10 .. 100);        # length N
        my @list2 = (50 .. 200);        # length M

        # Create a hash of all the values in @list1: O(N)
        my %in_list1;
        foreach my $element (@list1) {
                $in_list1{$element} = ();
        }

        # Walk over each element in @list2 and compare against the hash
        # O(M)
        foreach my $item (@list2) {
                if(exists $in_list1{$item}) {
                        print "$item is in both lists\n";
                }
        }

Now we can see that the end result is O(N+M) - we walk over @list1 once, and we walk over @list2 once. This is an O(N) solution so we know it's good. Note that we're achieving this trade-off by using extra memory to store our %in_list1 hash. For very large N this needs to be considered.

The above style problem comes up regularly, even though it may not quite look as simple as this. For example, imagine that you have a flat file containing addresses for businesses, and you wish to display a subset of those based on a list of suburbs. You could go through the whole file for each suburb (O(N M)) or you could adapt the above solution to store either the addresses or suburbs in a hash (depending on your requirements) and go through each only once. The same style solution can also be used to find the complement of a list.

Ignoring constants

Big-O notation does not predict how fast something will run, but rather gives us information about how well it will scale. For example doubling the size of N for a O(N2) problem will quadruple the program execution time regardless of whether the function complexity is actually O(0.5 N2) or O(10000 N2).

Constants only matter when comparing functions of different complexities. For example O(100*N) is slower than O(N2 /100) until N is greater than 10,000.

See also

Mathematical coverage for Big-O notation can be found on Wikipedia at: http://en.wikipedia.org/wiki/Big_O_notation.

There are also some great notes on Big-O notation on perlmonks.org

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS