Using the Match Operator

Next: Using the Substitution Operator Up: Some Practical Examples Previous: Some Practical Examples

Using the Match Operator

Here are some handy uses of the match operator:

If you need to find repeated characters in a string like the AA in "ABC AA ABC", then do this:
```
m/(.)\1/;
```
This pattern uses pattern memory to store a single character. Then a back-reference (\1) is used to repeat the first character. The back-reference is used to reference the pattern memory while still inside the pattern. Anywhere else in the program, use the $1 variable. After this statement, $1 will hold the repeated character. This pattern will match two of any non-newline character.
If you need to find the first word in a string, then do this:
```
m/^\s*(\w+)/;
```
After this statement, $1 will hold the first word in the string. Any whitespace at the beginning of the string will be skipped by the \s* meta-character sequence. Then the \w+ meta-character sequence will match the next word. Note that the *-which matches zero or more-is used to match the whitespace because there may not be any. The +-which matches one or more-is used for the word.

If you need to find the last word in a string, then do this:

m/
    (\w+)      (?# Match a word, store its value into pattern memory)
    [.!?]?     (?# Some strings might hold a sentence. If so, this)
               (?# component will match zero or one punctuation)
               (?# characters)

    \s*        (?# Match trailing whitespace using the * because there)
               (?# might not be any)
    $          (?# Anchor the match to the end of the string)
/x;

After this statement, $1 will hold the last word in the string. You need to expand the character class, [.!?], by adding more punctuation.

If you need to know that there are only two words in a string, you can do this:
```
m/^(\w+)\W+(\w+)$/x;
```
After this statement, $1 will hold the first word and $2 will hold the second word, assuming that the pattern matches. The pattern starts with a caret and ends with a dollar sign, which means that the entire string must match the pattern. The \w+ meta-character sequence matches one word. The \W+ meta-character sequence matches the whitespace between words. You can test for additional words by adding one \W+(\w+) meta-character sequence for each additional word to match.
If you need to know that there are only two words in a string while ignoring leading or trailing spaces, you can do this:
```
m/^\s*(\w+)\W+(\w+)\s*$/;
```
After this statement, $1 will hold the first word and $2 will hold the second word, assuming that the pattern matches. The \s* meta-character sequence will match any leading or trailing whitespace.
If you need to assign the first two words in a string to $one and $two and the rest of the string to $rest, you can do this:
```
$_ = "This is the way to San Jose.";
$word   = '\w+';    # match a whole word.
$space  = '\W+';    # match at least one character of whitespace
$string = '.*';     # match any number of anything except
                    # for the newline character.
($one, $two, $rest) = (m/^($word) $space ($word) $space ($string)/x);
```
After this statement, $1 will hold the first word, $2 will hold the second word, and $rest will hold everything else in the $_ variable. This example uses variable interpolation to, hopefully, make the match pattern easier to read. This technique also emphasizes which meta-sequence is used to match words and whitespace. It lets the reader focus on the whole of the pattern rather than the individual pattern components by adding a level of abstraction.

If you need to see if $_ contains a legal Perl variable name, you can do this:

$result = m/
            ^          (?# Anchor the pattern to the start of the string)
            [\$\@\%]   (?# Use a character class to match the first)
                       (?# character of a variable name)

            [a-z]      (?# Use a character class to ensure that the)
                       (?# character of the name is a letter)

            \w*        (?# Use a character class to ensure that the)
                       (?# rest of the variable name is either an)
                       (?# alphanumeric or an underscore character)

            $          (?# Anchor the pattern to the end of the)
                       (?# string. This means that for the pattern to)
                       (?# match, the variable name must be the only)
                       (?# value in $_.

          /ix;         # Use the /i option so that the search is
                       # case-insensitive and use the /x option to
                       # allow extensions.

After this statement, $result will be true if $_ contains a legal variable name and false if it does not.

If you need to see if $_ contains a legal integer literal, you can do this:

$result = m/
            (?# First check for just numbers in $_)

            ^         (?# Anchor to the start of the string)
            \d+       (?# Match one or more digits)
            $         (?# Anchor to the end of the string)
            |         (?# or)

           (?# Now check for hexadecimal numbers)

            ^         (?# Anchor to the start of the string)
            0x        (?# The "0x" sequence starts a hexadecimal number)
            [\da-f]+  (?# Match one or more hexadecimal characters)
            $         (?# Anchor to the end of the string)
          /i;

After this statement, $result will be true if $_ contains an integer literal and false if it does not.

If you need to match all legal integers in $_, you can do this:
```
@results = m/^\d+$|^0[x][\da-f]+$/gi;
```
After this statement, @result will contain a list of all integer literals in $_. @result will contain an empty list if no literals are found.
If you need to match the end of the first word in a string, you can do this:
```
m/\w\W/;
```
After this statement is executed, $& will hold the last character of the first word and the next character that follows it. If you want only the last character, use pattern memory,
```
m/(\w)\W/};.
```
Then $1 will be equal to the last character of the first word. If you use the global option,
```
@array = m/\w\W/g;,
```
then you can create an array that holds the last character of each word in the string.
If you need to match the start of the second word in a string, you can do this:
```
m/\W\w/;
```
After this statement, $& will hold the first character of the second word and the whitespace character that immediately precedes it. While this pattern is the opposite of the pattern that matches the end of words, it will not match the beginning of the first word! This is because of the \W meta-character. Simply adding a * meta-character to the pattern after the \W does not help, because then it would match on zero non-word characters and therefore match every word character in the string.
If you need to match the file name in a file specification, you can do this:
```
$_ = '/user/Jackie/temp/names.dat';
m!^.*/(.*)!;
```
After this match statement, $1 will equal names.dat. The match is anchored to the beginning of the string, and the .* component matches everything up to the last slash because regular expressions are greedy. Then the next (.*) matches the file name and stores it into pattern memory. You can store the file path into pattern memory by placing parentheses around the first .* component.
If you need to match two prefixes and one root word, like "rockfish" and "monkfish," you can do this:
m/(?:rock|monk)fish/x;
The alternative meta-character is used to say that either rock or monk followed by fish needs to be found. If you need to know which alternative was found, then use regular parentheses in the pattern. After the match, $1 will be equal to either rock or monk.

If you want to search a file for a string and print some of the surrounding lines, you can do this:

# read the whole file into memory.
open(FILE, "<fndstr.dat");
@array = <FILE>;
close(FILE);

# specify which string to find.
$stringToFind = "A";


# iterate over the array looking for the
# string.
for ($index = 0; $index <= $#array; $index++) {
    last if $array[$index] =~ /$stringToFind/;
}

# Use $index to print two lines before
# and two lines after the line that contains
# the match.
foreach (@array[$index-2..$index+2]) {
    print("$index: $_");
    $index++;
}

There are many ways to perform this type of search, and this is just one of them. This technique is only good for relatively small files because the entire file is read into memory at once. In addition, the program assumes that the input file always contains the string that you are looking for.

Next: Using the Substitution Operator Up: Some Practical Examples Previous: Some Practical Examples

dave@cs.cf.ac.uk