(Bob) have been working on logistic regression. In particular,
multinomial logistic regression with a range of different priors and a
sparse regularized stochastic gradient descent optimizer. It’ll be out
in LingPipe 3.5 any day now.
I wrote up everything I learned in a white paper, Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. The result was a slew of algebra reminiscent of Dan Klein and Chris Manning’s max-ent tutorial, but with more general regularization, a different (k-1)-vector parameterization, and a different optimization scheme.
I added a new white papers section
to the blog to hold it. I made a minor original contribution: a
stochastic gradient descent algorithm with regularization that fully
preserves both sparseness and the stocahsticity of regularization.
The amount of math I had to do to put the pieces back together took me straight back to grad school. My parents are visiting
and asked me last night why there was an enormous pile of math books on
the table (Strang on matrices [I couldn’t remember what positive
definite means], a calc textbook [I couldn’t even remember how to
differentiate logs and exponentials], Larsen and Marx (or was it
DeGroot and Schervish?) [need the basic density definitions and
properties], Gelman et al.’s Bayesian Data Analysis and Regression
books, rounded off with Bishop’s, MacKay’s, and Hastie et al.’s trio of
machine learning books and Manning and Schuetze for good measure). The
answer is that they all contained pieces of the puzzle.
I work out the full error function under maximum likelihood and
Gaussian, Laplace and Cauchy priors, derive the gradients and Hessians,
and show how to build it all into a stochastic gradient descent solver.
I show the links between logistic regression and max entropy, work out
the impact of feature functions, exponential bases, binary features,
input centering and variance normalization, and even kernel methods. I
have to admit I skipped full duality (but did talk about
kernelization), didn’t work out the Hessian positive definiteness proof
for error function is concavity, and don’t cover any of the convergence
proofs for SGD. But there are lots of references to books, papers,
tutorials and software packages.
I have to admit that during this process I felt like the great
Nicolai Ivanovich Lobachevsky as satirized in the Tom Lehrer song Lobachevsky.
The unit tests are now passing. I’d like to thank Dave Lewis
for help with test cases and understanding some of the math. Dave and I
were undergrad math majors at Michigan State together, and now find
ourselves doing logistic regression for NLP; he’s contributing to the BMR: Bayesian Multinomial Regression Software package (check out their cool use of priors for IDF-like term weighting).
One of the problems I had is the diverse literature around
regularized logistic regression and the sheer volume of terms used to
refer to it and its properties. Ditto for the stochastic gradient and
other optimization literature. Here’s the “also known as” section from
the forthcoming LingPipe 3.5 Javadoc (we’re our own search engine optimization providers):
Multinomial logistic regression is also known as polytomous,
polychotomous, or multi-class logistic regression, or just multilogit
Binary logistic regression is an instance of a generalized linear model (GLM) with the logit link function.
The logit function is the inverse of the logistic function, and the
logistic function is sometimes called the sigmoid function or the
Logistic regression estimation obeys the maximum entropy principle,
and thus logistic regression is sometimes called “maximum entropy
modeling”, and the resulting classifier the “maximum entropy
The generalization of binomial logistic regression to multinomial
logistic regression is sometimes called a softmax or exponential model.
Maximum a priori (MAP) estimation with Gaussian priors is often
referred to as “ridge regression”; with Laplace priors MAP estimation
is known as the “lasso”. MAP estimation with Gaussian, Laplace or
Cauchy priors is known as parameter shrinkage. Gaussian and Laplace are
forms of regularized regression, with the Gaussian version being
regularized with the L2 norm (Euclidean distance, called the
Frobenius norm for matrices of parameters) and the Laplace version
being regularized with the L1 norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.
Binary logistic regression is equivalent to a one-layer,
single-output neural network with a logistic (sigmoid) activation
function trained under log loss. This is sometimes called
classification with a single neuron.