Max Entropy vs. Logistic Regression: Feature Extraction and Coefficient Enco...

Sent to you by Jeffye via Google Reader:

Max Entropy vs. Logistic Regression: Feature Extraction and Coefficient Encoding

via LingPipe Blog by lingpipe on 4/30/09

I'm designing a conditional random field (CRF) package for LingPipe, so I've been reading through the literature on CRFs, most of which is expressed in terms of maximum entropy. I can't recommend Ryan McDonald's tutorial slides highly enough:

McDonald, Ryan. 2007. Generalized Linear Classifiers in NLP. Course presented to the Swedish Graduate School in Language Technology.

Among other things, Ryan shows how the maximum entropy estimate subject to expectation matching constraints yields the same unique solution as the maximum likelihood estimate (where the derivative of error shows the max likelihood estimate matches expectations). But I'm not going to harp on terminology or philosophy today, but rather on the more immediate practical problems of feature extraction and coefficient encoding.

K vs. K-1 Coefficient Vectors?

Consider multinomial classifiers with K outcome categories c[1],…,c[K] and n-dimensional vector inputs x. In logistic regression, as typically done in stats, you have K-1 n-dimensional coefficient vectors β[1], …, β[K-1] and the probability distribution over categories given an input vector x is defined so that p(c[k]|x) is proportional to exp(β[k] * x) for 1 <= k <= K-1, and p(c[K]|x) is proportional to exp(0 * x) = 1.

Sometimes, you see K coefficient vectors and p(c[K]|x) is proprotional to exp(β[K] * x) just like for the other dimensions. For instance, Dave Lewis and David Madigan et al.'s package Bayesian Multinomial Regression (BMR) lets you choose K or K-1 coefficient vectors; LingPipe uses the K-1 encoding. Dave (Lewis) told me they included the K-vector approach because their users found it easier to interpret than pinning the last coefficient vector to 0.

The problem with the K coefficient vector approach is that the parameters are not identified in the statistical sense. Basically, if you subtract β[k] from all the coefficient vectors (leaving β[k] = 0), you get exactly the same predictions. Zero-centered (or other) priors can identify unique solutions that minimize coefficient values. Gaussian priors (aka L2 regularization) always leads to a unique solution. Laplace priors might not. Imagine a very simple binary (outcomes K = 2) single regression (dimensions n = 1), and the vector β[1] = 1.0 in the K-1 coefficient coding, which implicitly sets β[2] = 0.0. For an input x, we get

p(c[1]|x) = exp(x * 1.0) / ( exp(x * 1.0) + exp(x * 0.0) )

p(c[2]|x) = exp(x * 0.0) / ( exp(x * 1.0) + exp(x * 0.0) ).

We get the same results with the K-vector coefficient encoding, taking β[1] = 0.5 and β[2] = -0.5, namely

p(c[1]|x) = exp(x * 0.5) / ( exp(x * 0.5) + exp(x * -0.5) )

p(c[2]|x) = exp(x * -0.5) / ( exp(x * 0.5) + exp(x * -0.5) )

(To see the equivalence, multiply the numerator and denominator of the second set of equations by exp(x*0.5). )

Coefficients of 0.5 and -0.5 are more likely in a zero-centered Gaussian prior (L2 regularization) than coefficients of 1.0 and 0.0, but have the same likelihood under a Laplace prior (L1 regularization).

Sampling approaches to modeling the posterior distribution p(θ|X) are more problematic, because the usual measures of Gibbs sampler convergence (e.g. R hat) won't work. The usual solution is to monitor the K-1 dimensional transform, because that is identified.

Using a Single Coefficient Vector

What always throws me when I read max entropy presentations is that they don't use K or K-1 vectors, they use a single vector, with a different kind of feature extraction. Specifically, there's a different input feature vector for each category, traditionally written φ(c[k],e) for input e and category c[k]. Recall that the usual set up in logistic regression is to use a single feature extractor ψ(e) that is independent of category (the value of ψ(e) is the input vector x we used above before talking about feature extraction). Now with K vectors for input e, φ(c[1],e) through φ(c[K],e), we have a single coefficient vector β and we model probabilities p(c[k]|e) as being proportional to φ(c[k],e) * β.

It's easy to code the usual logistic setup in the max entropy pattern. Just set β = β[1],…,β[K] and set φ(c[k],e) to 0,0,…ψ(c[k]),…,0. Going the other way around doesn't work, because there may be features extracted by more general φ that can't be replicated in the usual logistic setup.

I asked Hal Daumé III when he was in town if people used this flexibility in the models. He said usually not, but he did mention one case he'd seen, which is in first-order CRFs, where the conditioning is on the previous category as well as the entire input string. He said he'd seen people code up the feature "my tag is the same as the previous tag", which is not something you can code directly in the logistic setting. There, you can only get "is my tag T and is the previous tag T" duplicated for every tag T.

Are there other examples?

Is it worth implementing the more general feature extraction case?