Regular Expressions

Regular expressions (often shortened to “regex”) are widely used in applications that involve matching patterns in text.

The full documentation for Java’s regular expression syntax can be found in the Javadoc for the Pattern class. The tables below include a simplified subset of the full syntax.

This regular expression tutorial introduces the basics of writing Java code that uses regular expressions. Note that the focus of today’s lab is on regular expression syntax, not on writing Java code that uses regular expressions. The Java code will be provided for you.

Regular Expression Syntax by Example

The examples below illustrate Java-style regular expressions applied to the following string:

cat bat comb catatat catacomb rabbit caaat. rat

Read through each example and make sure you understand why the provided regex matches the indicated regions of the string.

Individual Characters

Most individual characters match themselves.

regex literal: "m"

cat cat comic catatat catatonic rabbit caaat. mat
          ^                                   ^

Metacharacters

Some characters (“metacharacters”) have a special meaning when they appear in a regular expression. Here is the complete list:

<([{\^-=$!|]})?*+.>

If we want to match any of these characters we need to “escape” them by prefixing them with a \ character. Since \ is itself a metacharacter, it needs to be escaped to include it in a string literal representing a regular expression:

regex literal: "\\."

cat cat comic catatat catatonic rabbit caaat. mat
                                            ^

Logical Operators

Construct Description
XY X followed by Y
X|Y Either X or Y
(X) X, as a group
regex literal: "caaat\\."

cat cat comic catatat catatonic rabbit caaat. mat
                                       ^----^
regex literal: "comic|caaat\\."

cat cat comic catatat catatonic rabbit caaat. mat
        ^---^                          ^----^

Character Classes

Bracket notation can be used to create sets such that any character in the set will be considered a match.

Construct Description
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z, or A through Z, inclusive (range)
regex literal: "[cm]at"

cat cat comic catatat catatonic rabbit caaat. mat
^-^ ^-^       ^-^     ^-^                     ^-^

Predefined Character Classes

Construct Description
. Any character
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
regex literal: "...m"           (any three characters followed by m)

cat cat comic catatat catatonic rabbit caaat. mat
       ^--^                                ^--^
regex literal: "\\w"            (any word character)

cat cat comic catatat catatonic rabbit caaat. mat
^^^ ^^^ ^^^^^ ^^^^^^^ ^^^^^^^^^ ^^^^^^ ^^^^^  ^^^
regex literal: "\\W"            (any non-word character)

cat cat comic catatat catatonic rabbit caaat. mat
   ^   ^     ^       ^         ^      ^     ^^
regex literal: "\\s"            (any whitespace character)

cat cat comic catatat catatonic rabbit caaat. mat
   ^   ^     ^       ^         ^      ^      ^

Quantifiers

Construct Description
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
regex literal: "c(at)*"         (c followed by zero or more copies of at)

cat cat comic catatat catatonic rabbit caaat. mat
^-^ ^-^ ^   ^ ^-----^ ^---^   ^        ^
regex literal: "c(at)+"         (c followed by one or more copies of at)

cat cat comic catatat catatonic rabbit caaat. mat
^-^ ^-^       ^-----^ ^---^
regex literal: "c(at){3}"       (c followed by exactly three copies of at)

cat cat comic catatat catatonic rabbit caaat. mat
              ^-----^

Boundary Matchers

Boundary matchers restrict where a match can be made.

Construct Description
^ The beginning of a line
$ The end of a line
\b A word boundary
regex literal: "^cat"

cat cat comic catatat catatonic rabbit caaat. mat
^-^
regex literal: "\\bcat\\b"             (the word cat, but not catatonic etc.)

cat cat comic catatat catatonic rabbit caaat. mat
^-^ ^-^
regex literal: "\\b\\w*ic"             (any word ending in ic)

cat cat comic catatat catatonic rabbit caaat. mat
        ^---^         ^-------^

Exercises

Download the following files:

Open up BleakHouse.txt and numbers.txt in a text editor to get feel for their contents.

Complete the unfinished methods in RegexExercises.java. Use SearchDriver.java to experiment with the regular expressions as you develop each one. Submit your finished version of RegexExercises.java through Autolab.