Home » » How Fat is Your (Prior’s) Tail?

# How Fat is Your (Prior’s) Tail?

How Fat is Your (Prior’s) Tail?
This paper is from LingPipe Blog

What kind of prior should you use for logistic regression (aka maximum entropy) coefficients? The common choices are the Laplace prior (aka L1 regularization, aka double exponential prior, aka the lasso), the Gaussian prior (aka L2 regularization, aka normal, aka ridge regression), and more recently, the Cauchy prior (aka Lorentz, aka Student-t with one degree of freedom).

These are all symmetric priors, and we typically use zero mean distributions as priors (or zero median in the Cauchy, which has no mean because the integral diverges). The effect is that there’s a penalty for larger coefficients, or looked at the other way, the priors shrink the parameters.

The main difference between the priors is how fat their tails are. For a given variance, the Laplace prior has very thin tails compared to the Gaussian, and the Cauchy has very thick tails compared to the Gaussian.

There’s an interesting paper by Goodman, Exponential Priors for Maximum Entropy Models, for which Microsoft received a blatantly ridiculous patent. Maybe we’ll get a nastygram from Redmond for using a textbook technique. Like the stats textbooks, Goodman notes that the Laplace prior likes to shrink coefficients to zero, effectively performing a kind of Bayesian feature selection.

There’s also a more recent offering by Genkin, Lewis and Madigan, Large-scale Bayesian logistic regression for text categorization, which covers both L1 and L2 regularization, which also comes with an open-source implementation for the multinomial case. This paper does a good job of comparing regularized logistic regression to SVM baselines. Dave Lewis has been a great help in understanding the math and algorithms behind logistic regression, particularly the one-parameter vector vs. two-parameter vector case, and the implications for priors and offsets in sparse computations.

There’s an interesting paper by Gelman, Jakulin, Su and Pittau, A Default Prior Distribution for Logistic and Other Regression Models, in which they argue that after centering and variance adjusting inputs, the Cauchy prior is a good general purpose prior. They evaluate on a range of binary classification problems.

I just had coffee with Andrew Gelman after guest-lecturing in his stat computation class, and we had a chat about regularization and about discriminitive versus generative models. He wasn’t happy with Laplace priors being applied willy-nilly, suggesting instead that feature selection be done separately from feature estimation. Andrew also surprised me in that he thinks of logistic regression as a generative model (”of course it’s generative, it’s Bayesian”, I believe he said); but for me, this is odd, because the data’s “generated” trivially. He was keen on my trying out some kind of posterior variance fiddling to move from a maximum a posteriori form of reasoning to a more Bayesian one. The only problem is that I just can’t invert the million by million matrix needed for the posterior variance estimates.

The issue here is a culture conflict between “machine learning” and “statistics”. If you read Hill and Gelman’s book on regression, it’s study after study of small dimensionality problems with dense data, not much data, and carefully engineered features and interaction features. Natural language problems, like most machine learning problems, involve millions of features and very sparse vectors. Maybe different techniques work better for quantitatively different problems.