Friday, April 15, 2011

Regular Expressions and State Machines

I recently found out that some of my colleagues did not know (much) about regular expressions or state machines.  These are amazingly useful concepts that I use on a daily basis, so I thought I'd try to set down everything you need to know about these subjects in one page (or thereabouts).

Regular Expressions for Searching

The most common use of regular expressions is for searching through text files.  For example, if I wanted to search my source code for all class definitions that include the word 'Collection' in their class names, I would search on the following regular expression: 'class .*Collection'.  This is read as "the word class, followed by a space, followed by zero or more of (*) any character (.), followed by the word Collection".  That is, most characters are interpreted literally in a regular expression, but some are treated specially (e.g., '.' and '*').

Here's the typical core grammar of regular expressions (I use a, b, c to denote non-special characters and x, y, z to denote arbitrary regular expressions):
  • a -- the regular expression matching the (non-special) character 'a';
  • xy -- the regular expression matching x followed by y;
  • (x) -- the regular expression matching x (it is sometimes useful to treat a complex regular expression as a single item);
  • x|y -- the regular expression matching x or y;
  • x* -- the regular expression matching zero or more instances of x;
  • nil -- the regular expression matching the empty string (which in practice is just written as the empty string, rather than 'nil').
So, for example, the regular expression '(aa|bb)*c' matches any sequence of aa or bb pairs, followed by a c.  That is, the language recognised by '(aa|bb)*c' consists of the strings {c, aac, bbc, aaaac, aabbc, bbaac, bbbbc, ...}.

In practice, the language of regular expressions is extended with plenty of syntactic sugar.  Here are some standard extensions:
  • x+ -- the regular expression matching one or more instances of x (i.e., x+ = xx*);
  • x? -- the regular expression matching zero or one instance of x (i.e., x? = (x|nil);
  • [abc] -- the regular expression matching any of the characters a, b, or c;
  • [a-f] -- the regualar expression matching any of the characters in the range a..f;
  • [^abc] -- the regular expression matching any character that is not a, b, or c;
  • [^a-f] -- the regular expression matching any character that is not in the range a..f;
  • ^ -- the regular expression matching the start of a line of text;
  • \$ -- the regular expression matching the end of a line of text;
  • \a -- the regular expression matching the escape sequence a (e.g., \* matches a literal *, \t matches a tab character, \( matches an open parenthesis, etc.)
There are many more (see, e.g., http://msdn.microsoft.com/en-us/library/az24scfc(v=VS.100).aspx for the .NET 4 regular expression language).

Learning these is easy and really, really useful!

State Machines

Imagine a machine with a set of lights.  Each light is called a state and the machine is called a state machine.  When the machine is started, it always starts with a particular light on (all other lights are off); this is the start state.

The machine takes as input a sequence of symbols of some type (e.g., characters, GUI events, messages, whatever you like.).  As each symbol is read by the machine, which light is shining changes depending on which light was shining just before the new symbol was read and on what particular symbol was just read in.  For example, if the machine is in state 3, say (i.e., light number three is on) and the next symbol read in is a 'q', then light 3 might turn off and light 7 might come one (i.e., reading a 'q' when in state 3 causes the machine to change to state 7).  The machine has a transition rule like this for every possible input for every possible state.

Each of the lights on the machine is either red or green.  The green lights denote accepting states and the red lights denote non-accepting states.  If the input sequence finishes with the machine showing a green light, then the input sequence was accepted by the state machine.  Otherwise the input sequence was not accepted.

The language accepted by the state machine is the set of input sequences that leave it in an accepting state.  In fact, every regular expression has a corresponding state machine and vice versa -- they are two ways of describing the same thing!

For example, here's a state machine to match our earlier pattern (aa|bb)*c:

State Input Next state
s1 [start] a s2
b s3
c s4
(...) s5
s2 a s1
(...) s5
s3 b s1
(...) s5
s4 [accepting] (...) s5
s5 [error] (...) s5

Observe the inclusion of an error state (s5) which, like the Hotel California, you can never leave. This state just ensures that the state machine never enters an accepting state once the input has differed from the pattern we are trying to match.

[Aside: the fancy name for what I've just described is a deterministic finite automaton or DFA.]

Deterministic and Non-deterministic State Machines

The state machine described above is deterministic, meaning it is only ever in one state and each state transition is defined entirely by the current state and the next symbol to be read.  This is a very useful notion for designing programs that have to process sequences.

One can also imagine a nondeterministic state machine.  Such a machine would works like this: at any point, any collection of lights can be on (i.e., the machine could be in any one of these states given the input so far). The transition rules are relaxed so that each state can have multiple possible successors for each input, plus have the option of having successor states that can be reached without consuming any input at all.  For example, the transition rules for state 3 might say "move to state 7 after reading a 'q' or move to state 8 after reading a 'q' or move to state 9 without reading anything".  So, for our non-deterministic state machine, if light 3 is one of the lights that is currently on, then light 9 must also be on (because it is a "no input" successor to 3); also, if a 'q' is read next, then both lights 7 and 8 will come on.

Why is this idea useful?  Well, here's the magic: you can always turn a non-deterministic state machine into a deterministic state machine (how this magic happens is a little beyond the scope of this article, but the starting point is to imagine every possible subset of lights on the non-deterministic state machine as corresponding to a single state on the equivalent deterministic state machine).

It's easy to see how to construct a deterministic state machine that handles simple sequences of symbols or repetition of simple sequences, the tricky bit is handling alternation (i.e., regular expressions of the form x|y).  The easiest way of constructing a state machine to match x|y is to construct a non-deterministic state machine from the two state machines for recognising x and y (hopefully you can imagine how this might be done without too much difficulty), then we can turn that into a deterministic state machine.

In practice, textual regular expressions are typically just compiled into representations of their non-deterministic state machines and the matching algorithm simulates all possible state transitions at each point.  This is actually quite effective, since the cost of searching a string of m characters with an n state machine is just O(mn) (and typically much less).

What Can't Regular Expressions Match?

Roughly speaking, regular expressions can't match any pattern that requires counting to arbitrarily large numbers.  For example, you can't write a regular expression that checks whether the parentheses ( ) in a program are properly balanced, because a program can contain any number of parentheses.  

[What follows is for extra credit.]

There is something called the pumping lemma, which says that for any state machine with n states that accepts some input of length n or more symbols, then that input can be written somehow as xyz (where y is a non-empty sequence and the length of xy is at most n) and the machine must also accept all patterns of the form xy*z (i.e., the machine had to go through at least one loop while recognising the y part in xyz).  

How does this relate to counting patterns?  Well, let's say we did claim to have a state machine with 4 states that matched all and only programs with properly balanced parentheses.  Now consider the balanced input sequence '((((a))))', which our state machine duly accepts.  The pumping lemma tells us that this must be decomposable as some xyz where y is non-empty and the length of xy is at most 4.  Well, we only have four possibilities for x and y here -- it doesn't matter which one we end up with.  Let's say the answer happens to be x = '(((' and y = '(' -- leaving z = 'a))))'.  The pumping lemma says that our machine must accept all strings of the form xy*z, which clearly includes just xz, namely '(((a))))', which is clearly not balanced!  Therefore our claim must be wrong: our 7 state machine does not match all and only well balanced programs.  Since there is nothing in our argument that really depends on the 4 states starting point, the same thing must be true for any number of states that a parenthesis-matching state machine might have.  Therefore, there is no regular expression (or state machine) that matches all and only well balanced programs.  If you don't believe me, I encourage you to have a go at designing such a state machine!

I find all this stuff really rather beautiful.

What About Counting Support in Modern Regular Expressions?


Modern textual regular expressions usually include some support for counting, such as x{40,70} meaning "between 40 and 70 instances of the pattern x".  This functionality comes at a cost: since the size of the equivalent ordinary regular expressions for these "counting" patterns quickly grows out of hand, they are implemented using backtracking pattern matchers.  These are exponentially slower than non-deterministic state machines and it is easy to construct pathological search patterns that will cause these matching algorithms to take forever.

Further Reading


Here is a brilliant paper on implementing regular expression matchers.