JavaScript Lesson 11, Regular Expressions

What are regular expressions?

A regular expression is similar to a string. However, it has one essential difference - it is a string pattern or template, which is used to match against strings, to see if they match or not. The characters which make up regular expressions have special meanings. For instance, the letter D might represent the character "D", but might also represent any character which is not a digit. Perhaps an example will make this clear:

<html>
<body>
<script language="JavaScript">
var example = "Sam, Samuel, Samantha, sam, balsam";
var rg = /Sam/;
document.write(example.replace(rg, "Pete"));
</script>
</body>
</html>

Running this program gives the following:

Pete, Samuel, Samantha, sam, balsam

So what's going on? Well, you will notice that the variable rg has been declared as a string enclosed within / / symbols rather than quotation marks. This automatically tells JavaScript that the variable is a regular expression rather than a normal string.

The (normal) string, example, has a replace() method which is given, in this case, a regular expression and a string. When you met it in the previous section, it was given just two strings - and it had to replace one of them with the other inside the string. It really comes into its own with regular expressions.

The regular expression represents the string Sam in this case. The replace() method finds the first occurrence of Sam inside example, and replaces it with Pete, as you can see from the result displayed.

So far, regular expressions don't seem to be that different from normal strings. However, replace the declaration of the regular expression with the following:

var rg = /Sam/g;

The letter g has been added to the end of the regular expression, outside the / symbol but before the semicolon. The letter g, in this case, means "global replace", i.e. replace all the occurrences of Sam with Pete. This time, when you run the program, you get

Pete, Peteuel, Peteantha, sam, balsam

It has replaced all of them - hang on a moment! It missed a couple. No - those versions of Sam start with lower case letters, so they don't match the regular expression (it had an upper case S, didn't it!) If you want it to ignore the case of the characters, i.e. replace both Sam and sam wherever they occur, then add the letter i to the end of the regular expression alongside the g:

var rg = /Sam/gi;

This time, when the program is run you get:

Pete, Peteuel, Peteantha, Pete, balPete

There is an alternative way to declare a regular expression, using the following syntax:

var rg = new RegExp("Sam");

This is equivalent to setting a regular expression using /Sam/, except that we are specifically declaring variable rg to be a variable of type RegExp which means a regular expression. In this case, we can't add modifiers to the regular expression such as i or g by adding them to the end of the string. If we want to do that, then we need to add a second parameter to the string:

var rg = new RegExp("Sam","ig");

This is now equivalent to /Sam/ig. This method for setting up a regular expression takes more typing than the method involving the / signs, but has one advantage - you can use some sort of string variable to set up the regular expression. Here is the same example involving a string variable:

var choice = "Sam"; var rg = new RegExp(choice,"ig");

The fact that string variables can be included allows the patterns for the regular expressions to be entered by the user in prompt() statements or using form elements.

Special characters in regular expressions

The special symbol \w matches any word character. A word character is defined as any upper or lower case letter, a digit or an underscore (i.e. any character that can make up a variable name). This means that the regular expression /a\wa/ will match any three letter character strings (even if they are present inside a longer word) consisting of a letter a (lower case), followed by any word character, and then another a. It will find a match inside parallel, apathetic and aVa. Note that there are no spaces within the regular expression - if we put spaces into it, the regular expression would only match substrings which had spaces in the corresponding position. The regular expression /\w\w\w/ would match any three word characters.

The special symbol \d matches any digit (0 to 9). The special symbol \s represents a "white space" character (anything which produces white space on the screen), namely a space character (obviously), but also a tab character and a new-line character (specified in a string as \n).

The symbols \W, \D and \S represent, respectively, any character other than a word character, any character other than a digit and any character other than a white space character. This means that \W will match any character except a word character. For instance, if you wanted to match a three-letter word sandwiched between two non-word characters, you would use the regular expression /\W\w\w\w\W/ with the upper case symbol specifying that the characters to the right and left of the three letters you are trying to pick up are not word characters. Similarly, to find a word which consists of a single letter or digit or some other symbol in the middle of a sentence, you would use the regular expression \s\S\s, with the two \s symbols representing the space characters and the \S representing the non-space character in the middle. However, this regular expression won't find the single non-space character if it happens to be at the beginning or end of the string, as there wouldn't be a white space character at the start of the sentence or at the end.

To solve this problem, there is a special symbol \b which matches any word boundary. This might be the boundary between any "word" and a white space symbol (i.e. the normal boundaries at the end of words) or the boundary between a word and the beginning or end of the string. The regular expression /\b\w\w\w\b/ will therefore match three letter words (where "letter" compromises any word character) - and only three letter words (i.e. not any sequence of three word characters regardless of how long the word is that they are in), even if that three letter word is at the beginning or end of the string. Similarly, the symbol \B matches any position which isn't at a word boundary.

One thing to note about the symbols \b and \B is that they don't match characters, just positions. The symbol \b, for instance, does not "use up" a character in the string. For instance, the regular expression /\w\b\s\b\w/ represents three characters (a word character, followed by a white space, followed by a word character), rather than five. In this particular case, of course, the fact that a white space is being specified after a word character means that the \b symbol isn't really necessary. Similarly, the regular expression /pat\B/ will match the letters pat but only when they do not come at the end of a word. They can appear at the start of the word (as we have not specified a \B symbol at the start).

The symbol \$ matches only the start or the end of the string. This means that you can match on a word which appears only at the start of the string, or only at the end of the string. The regular expression /\$\w\w\w\b/ will only produce a match if there is a three-letter word at the start of the string. Similarly, the regular expression /\D\D\D\$/ will produce a match if the last three characters of the string are not digits. Just as with \b and \B, the symbol \$ does not represent a character, only a position.

Including characters within [^ and ] specifies that any characters except those specified are to be matched. For instance, if you wanted to match any character except B, D or T, you would use the following:

[^BDT]

Similarly, you can match any that does not lie within a given range by specifying the first and last character in that range with a hyphen between them:

[^4-9]

(match any character except a digit in the range 4 to 9 (i.e. 4, 5, 6, 7 , 8 or 9))

^ is used without the presence of the square brackets to indicate the beginning of the string, i.e. /^D/ only matches D if it occurs at the start of the string. Similarly, $ indicates the end of the string and always follows the other parts (as the end of the string would), so /and$/ only matches the string and if it occurs at the end of the string.

Ambiguous characters

Use square brackets to specify a range of characters which will match any given position. For instance, to specify a vowel character in a regular expression, use [aeiou]. This represents just one character position in the string, and will produce a match if then character at that position is any one of the characters between the square brackets. The regular expression /\bp[aeiou]t\b/ matches the words pat, pet, pit, pot and put. Similarly, the regular expression /Q[\sAc$]/ will match the upper case letter Q followed by either a white-space character, an upper case letter A, a lower case letter c or a dollar sign. (Please note, the dollar sign in this regular expression is nothing to do with \$ as in this case, it does not start with a back slash).

There is a special shortcut that can be used with ambiguous characters. To specify any character in a given range from a start to an end character, put a hyphen between those characters, so [A-Z] stands for any uppercase letter, [0-4] stands for any digit from 0 to 4 inclusive. These ranges can be combined with other character, for instance [a-f\d] stands for any digit or lower case letter in the range a to f. Also, [A-Za-z] stands for any upper or lowercase letters etc.

Repetition

Place a plus sign, +, after a character or symbol to indicate "one or more" of that symbol:

\d matches a single digit.
\d+ matches one or more digits.
\d+, matches one or more digits followed by a comma.

Similarly, a ? sign matches the previous item zero or once, i.e. it makes the previous item optional.

\ca?t

matches both cat and ct (i.e. the a is optional), but not caat

The * sign matches the previous item zero or more times:

\ca*t

matches ct, cat, caat, caaaaat etc.

You can match a specified number of the previous item by enclosing the number within curly brackets, e.g. {40} matches exactly 40 of the previous item:

a{3}

matches aaa but not aa or a. It will match aaaa because it starts with aaa which matches.

a{3,6}

matches aaa, aaaa, aaaaa, aaaaaa only (i.e. anything from three a's to 6 a's only).

a{4,}

matches any sequence of a's as long as there are four or more of them

Matching special characters

Suppose you want to match the symbol $ within a string. You can't just include the $ symbol within the string as the program would assume you wanted to match the end of the string. To overcome this, precede the $ sign with the \ symbol to indicate that you actually want to match the dollar sign: \$

Similarly, preceding other special characters such as +, * or ^ by \ to give \+, \* and \^ will match those particular characters. JavaScript does not confuse these with the slashes pointing the other way, / /, which are used to specify the start and end of the regular expression itself.

A Summary of regular expression characters

Symbol What does it match? Example
\d Any digit from 0 to 9 \d\d\d matches 415 but not xqy or 47x
\D Any non-digit character \D\D matches Q$ but not 76 or 9W
\w Any word character, that is A-Z, a-z, 0-9 and the underscore character \w\w\w matches A_4 but not ££$ or Z_%
\W Any non-word character \W\W matches &@, but not E4
\s Any white space character, including tab, new line and carriage return. \s\s\s will match a space followed by a new line and then another space
\S Any non-white space character \S matches X, but not a tab character.
. Any single character .. matches any two characters
[ ... ] Any one of the characters between the brackets [Q4\s] matches Q, 4 or a single white space character but nothing else.
[^...] Any single character except from the ones specified in the brackets [^Q4\s] matches any character which isn't Q, 4 or a white space character.
\b Any boundary at the start or end of a word Both \b and \B match positions, not characters, i.e. c\b matches a single character c providing it ends a word.
\B Any position which isn't the start or end of a word.
{a} Match a occurrences of the previous item. o{3} matches ooo
{a,b} Match at least a and at most b occurrences of the previous item. £{2,4} matches ££, £££ and ££££
{a,} Match a or more occurrences of the previous item. F{4} matches FFFF, FFFFF, FFFFFF etc.
? Match the previous item zero times or one time. P? matches P or nothing at all.
+ Match the previous item one or more times. P+ matches P, PP, PPP etc.
* Match the previous item zero or more times. P* matches nothing, P, PP, PPP etc.

String methods and regular expressions

Regular expressions really come into their own when passed as parameters to methods built into string variables:

split()

You have met this with simple string parameters. It is used to split a string into substrings (words, typically) and the parameter that is passed is the character or string that appears between the substrings:

var s = "The dog, whose name is Fido, is here.";
var word_list = s.split(" ");

The example above creates the array word_list which will contain eight elements. It is equivalent to the following:

var word_list = new Array();
word_list[0] = "The";
word_list[1] = "dog,";
word_list[2] = "whose";
word_list[3] = "name";
word_list[4] = "is";
word_list[5] = "Fido,";
word_list[6] = "is";
word_list[7] = "here.";

However, you will notice that the punctuation has been included in the words, i.e. there is the occasional full stop or comma at the end of some of the words. What should we do if we only want the words themselves, and no punctuation? What we want is a version of split() which can extract only the words from the string and ignore everything else:

var word_list = s.split(/[^a-zA-Z]/);

The regular expression [^a-zA-Z] matches any character which isn't an upper case (A-Z) or lower case (a-z) letter. It will create the array but only the letter characters will be enclosed within the elements, not the spaces or punctuation.

replace()

This method is used to replace one substring with another, but in this more advanced version, the thing to be replaced is specified as a regular expression. Remember to include g at the end of the regular expression if you want every occurrence of the substring replaced. Otherwise, only the first occurrence will be replaced (assuming it is found at all!)

There is something which is unique to the replace() method - the idea of groups. A group allows you to give a name (or in this case a number) to part of the regular expression, so that it can be retrieved when a match is made. To specify a group within the regular expression, enclose the part of the regular expression within round brackets. For instance:

var myRegExpression = /[a-z](\w)x\s\s(\D)/;

This matches any lower case letter followed by any word character, followed by a lower case x, two white spaces characters and, lastly, any non-digit character. However, the unspecified word character is enclosed within round brackets, so it can be referred to as $0 (meaning the first group) in later parts of the replace() method. Similarly, the non-digit character at the end can be referred to as $1. You can have up to 10 of these groups (the last being referred to as $9), and they are normally used as follows:

years = "1990, 1991, 1992, 1993";
poshYears = years.replace(/\b(\d{4})\b/,"$0 A.D.");

In this case, the \d{4} (which matches 4 digits) is enclosed within round brackets, so whatever 4 digits match that can be referred to in the second parameter of the replace() method as $0. In this case, the regular expression matches 1990 only (we forgot to put on g at the end) so 1990 is replaced by 1990 A.D. If we had written the second line as follows:

poshYears = years.replace(/\b(\d{4})\b/g,"$0 A.D.");

then all the years would have been replaced, i.e. 1990 would become 1990 A.D. then 1991 would become 1991 A.D. etc. In each case, $0 corresponds to the 4 digits inside the regular expression (the \d{4} part) which matched.

Note that the original variable, years, is not altered in any way. The replace() method simply creates a new version of it with the replacement(s) made, for you to do with as you think fit (i.e. assign to another variable or display on the screen).

search()

This is the regular expression equivalent of indexOf(). It is given a single parameter, which is the regular expression, and it searches for that within the string. It returns the position of the first occurrence of the regular expression, or -1 if it is not found.

proverb = "People who live in glass houses shouldn't throw stones.";
pattern = /\b\w{3}\b/;    // Match first three letter word
document.write(proverb.search(pattern));

This displays the number 7, as the first substring to match the pattern (i.e. the first three letter word) is "who" which occurs at position 7. Note that since search() only ever returns the position of the first occurrence, there is no point in putting g on the end of the regular expression.

match()

This is similar to the search() method except that it returns an array of all the substrings that matched the regular expression. It doesn't keep a record of whereabouts in the string they were found:

years = "1990, 1964, distant past 1842, future year 2108";
var findThem = /\d{4}/g;    // Match any four digit number
                            // (Note the "g" for global match)
listOfYears = years.match(findThem);

This will create an array called listOfYears with four elements in it:

listOfYears[0] will contain 1990,
listOfYears[1] will contain 1964,
listOfYears[2] will contain 1842 and
listOfYears[3] will contain 2108.

Again, there is no need to declare listOfYears specifically as an array. JavaScript knows that the function match() returns an array, so listOfYears becomes an array automatically.