next up previous contents
Next: Quantifiers Up: Regular Expressions Previous: The Binding Operators

Character Classes

A character class defines a type of character. The character class [0123456789] defines the class of decimal digits, and [0-9a-f] defines the class of hexadecimal digits. Notice that you can use a dash to define a range of consecutive characters. Character classes let you match any of a range of characters; you don't know in advance which character will be matched. This capability to match non-specific characters is what meta-characters are all about.

You can use variable interpolation inside the character class, but you must be careful when doing so. For example (char1.pl),

$_ = "AAABBBccC";
$charList = "ADE";
print "matched" if m/[$charList]/;

will display

matched

This is because the variable interpolation results in a character class of [ADE]. If you use the variable as one-half of a character range, you need to ensure that you don't mix numbers and digits. For example (char2.pl),

$_ = "AAABBBccC";
$charList = "ADE";
print "matched" if m/[$charList-9]/;

will result in the following error message when executed:

/[ADE-9]/: invalid [] range in regexp at test.pl line 4.

At times, it's necessary to match on any character except for a given character list. This is done by complementing the character class with the caret. For example (char3.pl),

$_ = "AAABBBccC";
print "matched" if m/[^ABC]/;

will display nothing. This match returns true only if a character besides A, B, or C is in the searched string. If you complement a list with just the letter A (char4.pl):

$_ = "AAABBBccC";
print "matched" if m/[^A]/;

then the string "matched" will be displayed because B and C are part of the string-in other words, a character besides the letter A.

Perl has shortcuts for some character classes that are frequently used. The control characters \d (digit), \s (space), \w (word character) can also be used. \D, \S, \W are the negations of \d\s\w

You can use these symbols inside other character classes, but not as endpoints of a range. For example, you can do the following:

$_ = "\tAAA"; print "matched" if m/[d\s]/;}

which will display

matched

because the value of $_ includes the tab character.

Tip Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be a little confusing at first.

Note I think that most of the confusion regarding regular expressions lies in the fact that each character of a pattern might have several possible meanings. The caret could be an anchor, it could be a caret, or it could be used to complement a character class. Therefore, it is vital that you decide which context any given pattern character or symbol is in before assigning a meaning to it.


next up previous contents
Next: Quantifiers Up: Regular Expressions Previous: The Binding Operators
dave@cs.cf.ac.uk