## Logistic Regression by Any Other Name

I

(Bob) have been working on logistic regression. In particular,

multinomial logistic regression with a range of different priors and a

sparse regularized stochastic gradient descent optimizer. It’ll be out

in LingPipe 3.5 any day now.

I wrote up everything I learned in a white paper, Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. The result was a slew of algebra reminiscent of Dan Klein and Chris Manning’s max-ent tutorial, but with more general regularization, a different (k-1)-vector parameterization, and a different optimization scheme.

I added a new white papers section

to the blog to hold it. I made a minor original contribution: a

stochastic gradient descent algorithm with regularization that fully

preserves both sparseness and the stocahsticity of regularization.

The amount of math I had to do to put the pieces back together took me straight back to grad school. My parents are visiting

and asked me last night why there was an enormous pile of math books on

the table (Strang on matrices [I couldn’t remember what positive

definite means], a calc textbook [I couldn’t even remember how to

differentiate logs and exponentials], Larsen and Marx (or was it

DeGroot and Schervish?) [need the basic density definitions and

properties], Gelman et al.’s Bayesian Data Analysis and Regression

books, rounded off with Bishop’s, MacKay’s, and Hastie et al.’s trio of

machine learning books and Manning and Schuetze for good measure). The

answer is that they all contained pieces of the puzzle.

I work out the full error function under maximum likelihood and

Gaussian, Laplace and Cauchy priors, derive the gradients and Hessians,

and show how to build it all into a stochastic gradient descent solver.

I show the links between logistic regression and max entropy, work out

the impact of feature functions, exponential bases, binary features,

input centering and variance normalization, and even kernel methods. I

have to admit I skipped full duality (but did talk about

kernelization), didn’t work out the Hessian positive definiteness proof

for error function is concavity, and don’t cover any of the convergence

proofs for SGD. But there are lots of references to books, papers,

tutorials and software packages.

I have to admit that during this process I felt like the great

Nicolai Ivanovich Lobachevsky as satirized in the Tom Lehrer song Lobachevsky.

The unit tests are now passing. I’d like to thank Dave Lewis

for help with test cases and understanding some of the math. Dave and I

were undergrad math majors at Michigan State together, and now find

ourselves doing logistic regression for NLP; he’s contributing to the BMR: Bayesian Multinomial Regression Software package (check out their cool use of priors for IDF-like term weighting).

One of the problems I had is the diverse literature around

regularized logistic regression and the sheer volume of terms used to

refer to it and its properties. Ditto for the stochastic gradient and

other optimization literature. Here’s the “also known as” section from

the forthcoming LingPipe 3.5 Javadoc (we’re our own search engine optimization providers):

Multinomial logistic regression is also known as polytomous,

polychotomous, or multi-class logistic regression, or just multilogit

regression.Binary logistic regression is an instance of a generalized linear model (GLM) with the logit link function.

The logit function is the inverse of the logistic function, and the

logistic function is sometimes called the sigmoid function or the

s-curve.Logistic regression estimation obeys the maximum entropy principle,

and thus logistic regression is sometimes called “maximum entropy

modeling”, and the resulting classifier the “maximum entropy

classifier”.The generalization of binomial logistic regression to multinomial

logistic regression is sometimes called a softmax or exponential model.Maximum a priori (MAP) estimation with Gaussian priors is often

referred to as “ridge regression”; with Laplace priors MAP estimation

is known as the “lasso”. MAP estimation with Gaussian, Laplace or

Cauchy priors is known as parameter shrinkage. Gaussian and Laplace are

forms of regularized regression, with the Gaussian version being

regularized with the L_{2}norm (Euclidean distance, called the

Frobenius norm for matrices of parameters) and the Laplace version

being regularized with the L_{1}norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.Binary logistic regression is equivalent to a one-layer,

single-output neural network with a logistic (sigmoid) activation

function trained under log loss. This is sometimes called

classification with a single neuron.

## 0 Comments:

Post a Comment