Home » » Chain Conditional Random Fields: Implementation and Design Issues

Chain Conditional Random Fields: Implementation and Design Issues


Sent to you by Jeffye via Google Reader:


via LingPipe Blog by lingpipe on 6/18/09

I'm starting to implement conditional random fields (CRF) for LingPipe. (See, I really don't hate CRFs.) If you don't know CRFs, here are three good tutorials:

Implementing CRFs brings up several design issues.

N-gram Order

Like HMMs, chain CRFs extract features over the sequence of tags. The issue is whether the features can span more than a single pair of tags (a bigram). For instance, can I use the previous label and label two back to help model the current label?

Although it'd be nice to implement everything very generally for arbitrary n-gram contexts, it's actually a pain in practice in that everything gets deeper loops and allocating all the arrays is hard to do generically in Java with arrays. And then extracting answers with confidence and doing n-best also gets more complex.

I'm also not sure that any order beyond first is going to be tractable enough for training.

Feature Extraction

I plan to do the usual factoring of features into node features (depend only on inputs) and edge features (depend on input and also previous category). Even though edge features are more general, it's much more efficient to implement the usually much more prevalent node features independently (it all goes through the forward-backward algorithm). Nodes and edges have the following interfaces:

 interface ChainCrf.Node {     int position();     String[] words(); } 


 interface ChainCrf.Edge  {     String previousTag();     int position();     String[] words(); } 

Then we have two feature extractors required to estimate a CRF from a corpus of labeled tag data:

  • FeatureExtractor<ChainCrf.Edge>, and
  • FeatureExtractor<ChainCrf.Node>.

One or More Coefficient Vectors?

Given the feature extraction, I have to follow our logistic regression model in having different coefficient vectors per category rather than a single coefficient vector that essentially concatenates all the individual vectors. Having different vectors makes feature extraction much more economical. In the "standard" CRF presentation, a separate feature extraction is done for each possible tag, both for nodes and edges. That does allow some richer features (e.g. using a feature for labels that don't change entity type, such as B_PERSON to I_PERSON), and also allows features to be structural zeroes for some tags.

Note that the output tag is implicit in the proposed feature extraction, which means it's always included to distinguish both node and edge features across output tags.

K-1 Coefficient Vectors vs. K coefficient Vectors

That leaves the issue of whether to use (K-1) coefficient vectors (where K is the number of tags), as you see in typical statistics presentations of logistic regression. The first advantage is one less vector to store, and even more importantly, one fewer vector to update. The second is a system with an identified answer even for Laplace priors or maximum likelihood.

The alternative is to use K coefficient vectors, as you see in the typical "machine learning" presentation of CRFs or logistic regression. I'm planning to go with the stats approach, as I did for our logistic regression.

Order of Coefficient Vectors and Sparsity

There's an issue of how to store the coefficient matrix vector, either tag dominant or feature dominant. That is, do we have a map from features to categories to numerical values, or from categories to features to numerical values. The latter is the natural way to do it if you look at the typical definitions. The former way would be useful if many features are non-zero for half the categories or fewer. I don't think that'll happen much given the way the regressions work, so I'm planning to go with (features to categories to numerical values).

I pretty much have to use sparse vectors for the edge and node features. But what about the model's regression coefficients? I'll be taking the coefficient vector's dot-product with the sparse edge and node features. That's much more efficient for dense coefficient vectors. I could save some space for really big models, but I think the time's going to be more precious here. Luckily, this implementation detail will be easy to change under the hood if I change my mind later.

Arithmetic Precision

We won't be able to estimate coefficients with more accuracy than is representable as a float. So for storing the models, floats make most sense. The problem is that at run time, we need to do double-precision arithmetic for decoding (in all three cases, Viterbi first best, Viterbi/A* n-best, or full forward-backward marginal decoding). It's slower to cast floats to doubles every time you do a multiply, so I'll probably keep the coefficients dense and doubles as long as they'll fit. If they won't fit, I'll probably switch the implementation to sparse floats. It'll all be transparent and backward compatible at the API level if I make the switch after the first release.

Start/End Tags

For HMMs, I added a BEGIN-OF-SENTENCE tag and an END-OF-SENTENCE tag. The start tag is not generated, but used as context to generate the first word in the sentence. The end-of-sentence tag is generated, because it's required to get the normalizations right.

For CRFs, it seems to make most sense to include a BEGIN-OF-SENTENCE tag that doesn't have an estimated coefficient, but is just used in defining the edge features for the first token in a sentence. The end marker is actually derivable from the node (word-only) features. So is the begin marker, but it seems simpler this way than to have a null value passed in to the edge feature extractor.

Stochastic Gradient Descent

I'm not even considering other implementations. I'll parameterize it the same way as for logistic regression. This means I'll allow coefficient priors (Laplace, Gaussian, Cauchy, with the option of non-zero means and different variances/scales per dimension). I'll also allow an annealing schedule.

SGD Code Re-use

A serious question is whether I should try to re-use the SGD code in logistic regression. The basic algorithm's the same, only now I have to do forward-backward in the inner loop to compute the gradient of the log likelihood function. I'm thinking I won't share the code, but that's mainly because the refactoring to something with the right callbacks in the inner loop looks daunting.

Forward/Backward Decoding and N-best

I need to do forward/backward for training, not just confidence-based extraction. What I'd like to do is provide a common format for the lattice output I can share between HMM taggers and CRFs.

I've already defined a new package com.aliasi.tag that has three interfaces for tagging: first-best, n-best (joint or conditional probabilities), and marginal (per tag) taggers matching those for HMMs.

Relation to Chunking

I'd also like to provide a general tagger to chunker implementation, which means more refactoring, but will immediately let me use CRFs to do named-entity extraction with n-best and confidence-based (per entity mention) extractions. I'd very much like to refactor the CharLmHmm chunker to rely on a generic tagger. This should be doable.

Threading and/or File Compilation

Unfortunately, storing in memory all of the features for an input task the size of the Penn Treebank POS data (about 50 tags and 1M words) is prohibitive. There are 50M edge feature extractions and 1M node feature extractions. Even if the edge feature extractions are tiny (say 40 bytes), that's a lot of memory; if we add in previous tag plus word features, it won't fit in memory even on a big workstation. If there are hundreds of word features per node, which is conservative if we use 5-word contexts, n-grams and subword features, and represent them tightly, we're still looking at several KB/vector and thus several GB of memory.

So I'm inclined to do the feature extraction online. This is making me queasy, though, because it'll be very expensive if I do it naively in a single thread (even measured relative to forward/backward, perhaps dominating training time).

Alternatively, I can run feature extraction in one thread and gradient computation in another. Then we probably won't notice the feature extraction if there are enough CPUs free.

The other alternative, which matches what may CRF packages do, is to require all of the feature vectors for the training corpus to be serialized. Then they can be read back in much more efficiently than creating them from scratch. It won't take that much disk space, but the requirement will be nontrivial (perhaps as much as 20 or 30GB with Treebank POS size corpora and subword features).

Generic Inputs

There's really nothing word-specific about CRFs. In fact, they could take arbitrarily structured inputs instead of word tokens. This would require generifying the CRF interface to the type of objects being tagged, and similarly generifying feature extraction and the lattice representations. I could probably get away then with having the HMMs just instantiate the generic to String.

I may just go ahead and do this once it's implemented for strings. It'll probably be easier to refactor a string-based implementation than to start with something more generic. (Thanks again to those Java engineers for making things so easy to modify with backward compatibility.)


Things you can do from here:



Popular Posts