Improving regular expressions

[ Perl tips index ]
[ Subscribe to Perl tips ]

Many programmers find regular expressions to be esoteric constructions. They're dense, complex, and for many people utterly incomprehensible. Large regular expressions without sufficient documentation can be staggeringly difficult to maintain, so it's no surprise that many developers are happy if they match at all, even if they're not as fast or maintainable as they could be.

This tip will focus on how to make regular expressions easier to understand, easier to maintain, and faster to run.

Optimising for readability

Perl regular expressions have an extended mode, which allows you to embed whitespace and comments into your expressions. This makes them both faster to write (and debug) and easier to read in the future. To use this mode add the /x flag to the end of your expression. For example we can take this rather dense regexp:

  # Parse a line from 'ls -l'

  m{^([\w-]+)\s+(\d+)\s+(\w+)\s+(\w+)\s+(\d+)\s+(\w+\s+\d+\s+[\d:]+)\s+(.*)$};

and turn it into this much more understandable form:

        # Parse a line from 'ls -l'

        m{
            ^                           # Start of line.
            ([\w-]+)\s+                 # $1 - File permissions.
            (\d+)\s+                    # $2 - Hard links.
            (\w+)\s+                    # $3 - User
            (\w+)\s+                    # $4 - Group
            (\d+)\s+                    # $5 - File size
            (\w+\s+\d+\s+[\d:]+)\s+     # $6 - Date and time.
            (.*)                        # $7 - Filename.
            $                           # End of line.
        }x;

This latter version makes it much easier to see why the expression fails to match the following two lines:

    -rw-r--r-- 1 jarich jarich    7376 2008-04-04 14:03 usingobjects.sgml
    -rw-r--r-- 1 jarich jarich   15092 2008-04-04 14:03 whatisperl.sgml

Everything looks good until we get to the date and time. 2008- does not match \w+\s. On this particular machine, ls shows the date and time differently than on the machine for which this regular expression was designed. The regular expression is instead expecting lines more like:

        -rw-r--r-- 1 jarich jarich  7376 Apr  4 14:03 usingobjects.sgml
        -rw-r--r-- 1 jarich jarich 15092 Apr  4 14:03 whatisperl.sgml

Using extended regular expressions allows us to break up our expression to better match how we think of the data. Note that when using extended regular expressions you must specify your space characters explicitly. You can do this with [ ] or \ if you like, but if you really want to match any sort of whitespace character (including tabs and newlines) then we recommend using \s. If you need to search for a literal hash, just use \#.

The /x switch also works for the left hand side (only) of substitutes too:

        s{
                (\W*)           # $1 - Before the first word
                (\w+)           # $2 - The first word (assume no punctuation)
                (.*)            # $3 - Everything in the middle
                \b(\w+)         # $4 - The last word
        }
        {$1$4$3$2}x;            # Swap first and last words

Extended regular expressions still run just as quickly as their more dense counterparts, so you should consider using them whenever possible.

Pulling data forward

Consider the following expression:

        m{Perl Training|Perl Courses|Perl Course Notes|Perl Tips};

This will match many of the things Perl Training Australia offers. However consider matching it against the following string:

        $string = "Can you help me find a Perl Programmer?";

In this case the Perl regular expression engine will try really hard to match. It starts by trying to match Perl Training, as that's our first alternative. However while it finds a partial match with Perl , it's not followed by a T, so it fails and backtracks.

Before Perl 5.10, the regular expression engine goes to the next match, and starts to match Perl all over again. It does this for each failed match, getting as far as the first failing character, going on to the next alternative, and stepping through all the common characters all over again. This is a big waste of time, and if we had a longer string of common characters, it would take even longer.

We can make this expression much more efficient with a very little change:

        m{Perl (?:Training|Courses|Course Notes|Tips)};

Now the regular expression engine matches Perl only once, and can then fail very rapidly as it sees the next character in our string doesn't match any of our alternatives. Under Perl 5.8.8 our new regexp runs eight times faster than our old regexp for the example given.

The Perl 5.10 regexp engine is smart enough to do this factoring by itself, so both expressions run with the same speed on 5.10 systems.

Most common matches first

As an extension to the above improvement, if you know roughly how often you'll be seeing each option in your alternation, move the most commonly matched ones to the front. This reduces the amount of failures and backtracking required in order for a successful match.

Avoiding regular expressions

Regular expressions are a great way to solve text-matching problems, but they're not the only way. Starting up the regular expression engine can be slow, and although there are many, many optimisations to make Perl's regular expressions really fast, they're not always the fastest solution.

Before writing any regular expression you should:

By limiting the number of regular expressions your code uses, you may be able to both increase both the maintainability and speed of your program.

Benchmarking

If you wish to determine if a change to your program or regular expression makes it faster, then you may wish to consider using Perl's Benchmark module. We have a separate Perl-tip on this topic.

Further resources

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS