[ Perl tips index]
[ Subscribe to Perl tips ]
Many programmers find regular expressions to be esoteric constructions. They're dense, complex, and for many people utterly incomprehensible. Large regular expressions without sufficient documentation can be staggeringly difficult to maintain, so it's no surprise that many developers are happy if they match at all, even if they're not as fast or maintainable as they could be.
This tip will focus on how to make regular expressions easier to understand, easier to maintain, and faster to run.
Perl regular expressions have an extended mode, which allows you to embed
whitespace and comments into your expressions. This makes them both faster
to write (and debug) and easier to read in the future. To use this mode
add the /x flag to the end of your expression. For example we can take
this rather dense regexp:
# Parse a line from 'ls -l'
m{^([\w-]+)\s+(\d+)\s+(\w+)\s+(\w+)\s+(\d+)\s+(\w+\s+\d+\s+[\d:]+)\s+(.*)$};
and turn it into this much more understandable form:
# Parse a line from 'ls -l'
m{
^ # Start of line.
([\w-]+)\s+ # $1 - File permissions.
(\d+)\s+ # $2 - Hard links.
(\w+)\s+ # $3 - User
(\w+)\s+ # $4 - Group
(\d+)\s+ # $5 - File size
(\w+\s+\d+\s+[\d:]+)\s+ # $6 - Date and time.
(.*) # $7 - Filename.
$ # End of line.
}x;
This latter version makes it much easier to see why the expression fails to match the following two lines:
-rw-r--r-- 1 jarich jarich 7376 2008-04-04 14:03 usingobjects.sgml
-rw-r--r-- 1 jarich jarich 15092 2008-04-04 14:03 whatisperl.sgml
Everything looks good until we get to the date and time. 2008- does not
match \w+\s. On this particular machine, ls shows the date and time
differently than on the machine for which this regular expression was
designed. The regular expression is instead expecting lines more like:
-rw-r--r-- 1 jarich jarich 7376 Apr 4 14:03 usingobjects.sgml
-rw-r--r-- 1 jarich jarich 15092 Apr 4 14:03 whatisperl.sgml
Using extended regular expressions allows us to break up our expression
to better match how we think of the data. Note that when using extended
regular expressions you must specify your space characters explicitly.
You can do this with [ ] or \ if you like, but if you really want
to match any sort of whitespace character (including tabs and newlines)
then we recommend using \s. If you need to search for a literal
hash, just use \#.
The /x switch also works for the left hand side (only) of substitutes too:
s{
(\W*) # $1 - Before the first word
(\w+) # $2 - The first word (assume no punctuation)
(.*) # $3 - Everything in the middle
\b(\w+) # $4 - The last word
}
{$1$4$3$2}x; # Swap first and last words
Extended regular expressions still run just as quickly as their more dense counterparts, so you should consider using them whenever possible.
Consider the following expression:
m{Perl Training|Perl Courses|Perl Course Notes|Perl Tips};
This will match many of the things Perl Training Australia offers. However consider matching it against the following string:
$string = "Can you help me find a Perl Programmer?";
In this case the Perl regular expression engine will try really hard to
match. It starts by trying to match Perl Training, as that's our
first alternative. However while it finds a partial match
with Perl , it's not followed by a T, so it fails and backtracks.
Before Perl 5.10, the regular expression engine goes to the next match,
and starts to match Perl all over again. It does this for each
failed match, getting as far as the first failing character, going on to
the next alternative, and stepping through all the common characters
all over again. This is a big waste of time, and if we had a longer
string of common characters, it would take even longer.
We can make this expression much more efficient with a very little change:
m{Perl (?:Training|Courses|Course Notes|Tips)};
Now the regular expression engine matches Perl only once, and can then
fail very rapidly as it sees the next character in our string doesn't
match any of our alternatives. Under Perl 5.8.8 our new regexp runs
eight times faster than our old regexp for the example given.
The Perl 5.10 regexp engine is smart enough to do this factoring by itself, so both expressions run with the same speed on 5.10 systems.
As an extension to the above improvement, if you know roughly how often you'll be seeing each option in your alternation, move the most commonly matched ones to the front. This reduces the amount of failures and backtracking required in order for a successful match.
Regular expressions are a great way to solve text-matching problems, but they're not the only way. Starting up the regular expression engine can be slow, and although there are many, many optimisations to make Perl's regular expressions really fast, they're not always the fastest solution.
Before writing any regular expression you should:
See if someone else has written it for you. Consider the
Regexp::Common modules for common
solutions to regular expressions. These can match balanced parentheses,
bracket and curlies; IP addresses; comments (C, C++, PHP, Java, Perl etc),
numbers and so much more. Chances are these will be more complete, more
efficient and more correct than the expressions you write yourself.
Consider specialist modules which might solve your problem. For example,
instead of attempting to parse lines from a comma or tab separated file, look
at Text::CSV_XS.
Instead of parsing apache logs by hand look at
ApacheLog::Parser.
Ask yourself whether there's a better function. Could you split the string
up on a regular delimiter (even if you need to stitch a field or two back
together again)? Are you dealing with fixed length records where substr
would work better? Would a combination of substr and index do the
trick faster?
By limiting the number of regular expressions your code uses, you may be able to both increase both the maintainability and speed of your program.
If you wish to determine if a change to your program or regular expression
makes it faster, then you may wish to consider using Perl's
Benchmark module. We have a separate
Perl-tip on this
topic.
Regular expression basics:
perldoc perlre,
perldoc
perlretut,
perldoc
perlrequick.
[ Perl tips index ]
[ Subscribe to Perl tips ]
| Location | Course | Course Date | Duration | Early Bird Date |
|---|---|---|---|---|
| Melbourne | Programming Perl | Tue 10 Feb 2009 | 4 days | Mon 12 Jan 2009 |
| Sydney | Programming Perl | Tue 10 Mar 2009 | 4 days | Mon 9 Feb 2009 |
| Canberra | Programming Perl | Tue 31 Mar 2009 | 4 days | Mon 2 Mar 2009 |
| Melbourne | Programming Perl | Tue 12 May 2009 | 4 days | Mon 13 Apr 2009 |
| Sydney | Programming Perl | Tue 2 Jun 2009 | 4 days | Mon 4 May 2009 |
| Melbourne | Programming Perl | Tue 15 Sep 2009 | 4 days | Mon 17 Aug 2009 |
For future dates, please see our training calendar.
This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.
If you have any questions please don't hesitate to contact us:
| Email: | contact@perltraining.com.au |
| Phone: | 03 9354 6001 (Australia) |
| International: | +61 3 9354 6001 |
Copyright 2001-2008 Perl Training Australia. Contact us at contact@perltraining.com.au