The regular expression extensions are a way to significantly add to the power of patterns without adding a lot of meta-characters to the proliferation that already exists. By using the basic (?...) notation, the regular expression capabilities can be greatly extended.
At this time, Perl recognizes five extensions. These vary widely in functionality-from adding comments to setting options. They are:
A very useful feature of extended mode is the ability to add comments directly inside your patterns. For example, would you rather a see a pattern that looks like this (ext1.pl):
# Match a string with two words. $1 will be the # first word. $2 will be the second word. m/^\s+(\w+)\W+(\w+)\s+$/;
or one that looks like this (ext2.pl):
m/
(?# This pattern will match any string with two)
(?# and only two words in it. The matched words)
(?# will be available in $1 and $2 if the match)
(?# is successful.)
^ (?# Anchor this match to the beginning)
(?# of the string)
\s* (?# skip over any whitespace characters)
(?# use the * because there may be none)
(\w+) (?# Match the first word, we know it's)
(?# the first word because of the anchor)
(?# above. Place the matched word into)
(?# pattern memory.)
\W+ (?# Match at least one non-word)
(?# character, there may be more than one)
(\w+) (?# Match another word, put into pattern)
(?# memory also.)
\s* (?# skip over any whitespace characters)
(?# use the * because there may be none)
$ (?# Anchor this match to the end of the)
(?# string. Because both ^ and $ anchors)
(?# are present, the entire string will)
(?# need to match the pattern. A)
(?# sub-string that fits the pattern will)
(?# not match.)
/x;
Of course, the commented pattern is much longer, but it takes the same amount of time to execute. In addition, it will be much easier to maintain the commented pattern because each component is explained. When you know what each component is doing in relation to the rest of the pattern, it becomes easy to modify its behavior when the need arises.
Extensions also let you change the order of evaluation without affecting pattern memory. For example, ext3.pl,
m/(?:a|b)+/;
will match either the a character repeated one or more times or the b character repeated one or more times. The pattern memory will not be affected.
At times, you might like to include a pattern component in your pattern without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:
David Veterinarian 56 Jackie Orthopedist 34 Karen Veterinarian 28
and you want to find all veterinarians and store the value of the first column, you can use a look-ahead assertion. This will do both tasks in one step. For example (ext4.pl):
while (<>) { push(@array, $&) if m/^\w+(?=\s+Vet)/; } print("@array\n");
This program will display:
David Karen
Let's look at the pattern with comments added using the extended mode. In this case, it doesn't make sense to add comments directly to the pattern because the pattern is part of the if statement modifier. Adding comments in that location would make the comments hard to format. So let's use a different tactic (ext5.pl).
$pattern = '^\w+ (?# Match the first word in the string)
(?=\s+ (?# Use a look-ahead assertion to match)
(?# one or more whitespace characters)
Vet) (?# In addition to the whitespace, make)
(?# sure that the next column starts)
(?# with the character sequence "Vet")
';
while (<>) {
push(@array, $&) if m/$pattern/x;
}
print("@array\n");
Here we used a variable to hold the pattern and then used variable interpolation in the pattern with the match operator. You might want to pick a more descriptive variable name than $pattern, however.
The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not a veterinarian. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example ( ext6.pl).
while (<>) { push(@array, $&) if m/^\w+(?!\s+Vet)/; } print("@array\n");
Unfortunately, this program displays
Davi Jackie Kare
which is not what you need. The problem is that Perl is looking at the last character of the word to see if it matches the Vet character sequence. In order to correctly match the first word, you need to explicitly tell Perl that the first word ends at a word boundary, like this (ext7.pl):
while (<>) { push(@array, $&) if m/^\w+\b(?!\s+Vet)/; } print("@array\n");
This program displays
Jackie
which is correct.
Note There are many ways of matching any value. If the first method you try doesn't work, try breaking the value into smaller components and match each boundary. If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup.