[ Perl tips index ]
[ Subscribe to Perl tips ]
Back-references allow us to re-match the same sequence of characters matched by an earlier set of parentheses. For example, we could use back-references to match repeating words:
this this
that that
bang bang
While we can use our capture variables in substitutions, this is no use in a
simple match pattern, because $1 and friends aren't set until after the
match is complete. Something like:
say if m{(\b\w+) $1\b};
will not match "this this" or "that that". Rather, it will match a word
followed by a space, followed by whatever $1 was set to by an earlier
match.
In order to match "this this" (or "that that") we need to use the special
regular expression meta characters called back-references. These are
written: \1, \2, etc. These meta characters refer to parenthesized
parts of a match pattern, just as $1 does, but within the same match
rather than referring back to the previous match.
# say if we find repeated words, eg: "this this"
say $1 if m{(\b\w+) \1\b};
Along with named matches, Perl 5.10.0 provides us with named back
references using the \g{name} syntax:
say $+{repeated} if m{(?<repeated>\w+) \g{repeated}};
We can also use the \g{} syntax to match recent matches
by counting backwards. \g{-1} matches the most recent
set of parentheses (including named matches), \g{-2} the
second most recent set and so on. Thus the above could also have been
written:
say $+{repeated} if m{(?<repeated>\w+) \g{-1}};
Finally we can use \g{} to match regular back-references in a way that
is always safe. Inside a regular expression \10 can mean either the
character whose ordinal in octal is 010 (a backspace) or - if there are
at least 10 matching parentheses in this regular expression - the 10th
back-reference. Thus it is better (for clarity and a lack of surprises) to
always use \g{}. \1 and \g{1} are identical.
say $1 if m{(\b\w+) \g{1}\b};
The braces in the above examples are not required if you're using a
numbered back-reference. Thus sometimes you'll see \g1 and \g2 etc:
say $1 if m{(\b\w+) \g1\b};
Likewise, named matches can also use \k<name>:
say $+{repeated} if m{(?<repeated>\w+) \k<repeated>};
Although these syntaxes are allowed, we recommend always using \g{} for
consistency.
Regardless of the syntax we use to write our back-reference, it is important to remember that any back-reference will only match the characters the matching set of parentheses matched. This is why we require both of the word boundaries in the above examples. Without them we'd match part-way through words:
say $1 if m{(\w+) \g{1}};
# matches: "this is a test" (prints "is")
# matches: "an antelope ate the apples" (prints "an")
In the first case, the parentheses are starting their match part way through a word (at "is" in "this") and the back-reference is matching "is" standing alone. In the second case, the parentheses are matching a whole word ("an") but the back-reference is matching a partial word ("an" in "antelope"). If we wish to match duplicate words, we need to match only full words, so we require that both the parentheses and the back-reference be bounded with word boundaries.
For more information on references check out the handy Perl Regular Expression Tutorial.
[ Perl tips index ]
[ Subscribe to Perl tips ]
This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.
If you have any questions please don't hesitate to contact us:
| Email: | contact@perltraining.com.au |
| Phone: | 03 9354 6001 (Australia) |
| International: | +61 3 9354 6001 |
Copyright 2001-2012 Perl Training Australia. Contact us at contact@perltraining.com.au