Home » » Tokenization vs. Eager Regular Expressions

Tokenization vs. Eager Regular Expressions

We now have regular-expression based chunkers and tokenizers in LingPipe. They work by compiling a regex using java.util.regex.Pattern and then running a java.util.regex.Matcher’s find()
method over the inputs and pulling out the matches. Recently, I (Bob)
have been tuning part-of-speech taggers for French trained on the French Treebank.

I thought writing a regex-based tokenizer would make sense to match
the way the treebank itself tokenized as best as possible.
Unfortunately, it’s impossible to write a pure tokenizer to match
because, like the Penn Treebank and Penn BioIE projects, the annotators
decided to use syntactic and semantic contextual information in
deciding when to group a sequence of characters into a token. For
instance, hyphens after prefixes (a lexical syntactic issue) are coded
one way and hyphens before suffixes another; in the Penn Treebank,
periods at the end of sentences (a contextual decision) are coded as
separate tokens whereas those appearing sentence internally are coded
as part of the token they follow.

It turns out that regular expressions don’t work the way I thought
they would. I wanted to write a bunch of regexes and then or (|) them together to produce a larger regular expression that would greedily match as much as it could in each chunk.

Let’s consider a simple example:


And consider running a find against the string "aacac". What do we get? Let’s ask Java.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {

public static void main(String[] args) {
test("(a|b)+|(a|c)+", "aacac");

static void test(String regex, String input) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
System.out.println("regex=" + regex
+ " input=" + input
+ " found=" + matcher.group());


This just sets up the pattern based on the regex, generates a matcher
from the input and then runs find on the matcher and prints out the
first thing found. What’ll it do?

c:\carp\temp>javac Regex.java

c:\carp\temp>java -cp . Regex
regex=(a|b)+|(a|c)+ input=aacac found=aa

Ouch. I was expecting the whole thing to match. Apparently, that’s what a POSIX regex
would do. But Java follows the Perl model, in which eagerness overcomes
greediness. Specifically, if the first disjunct of an alternation
matches, the matcher does not try to find a longer match in the second
disjunct. Unfortunately, there are no greediness/reluctance modifiers
for the disjunct.

So what do we do? Refactor the regex, of course. How about this one?


This should do the trick of matching the longest possible sequence
of alternating as and bs or alternating as and cs. Sure enough, adding
the following to the main() method:

    test("a*(b(a|b)*|c(a|c)*)?", "aacac");

produces the expected output:

regex=a*(b(a|b)*|c(a|c)*)? input=aacac found=aacac


Popular Posts