SGD Special Ingredient: Reckless Driving

I’ve been playing around with convergence for my SGD implementation for the upcoming LingPipe 3.5, in the context of the 2008 i2b2 Obesity Challenge,
the full title of which is "Second i2b2 Shared-Task and Workshop;
Challenges in Natural Language Processing for Clinical Data; Obesity
Challenge (A Shared-Task on Obesity): Who’s obese and what
co-morbidities do they (definitely/likely) have?". Participants have
until April 15, 2008 to register to participate.

Slide 37 of Léon Bottou’s NIPS tutorial Learning with Large Datasets
reveals the "secret ingredient" behind (his) successful stochastic
gradient search (where η is the learning rate, which he calls a "gain"):

At any moment during training, we can:

Select a small subsample of examples.
Try various gains η on the subsample.
Pick the gain η that most reduces the cost.
Use it for the next 100000 iterations on the full dataset.

This is a kind of meta-descent algorithm, the most well known of which is Nicolas Shraudolph’s Stochastic Meta-Descent.
I’m going to try Bottou’s approach tomorrow. But I’m skeptical it’ll
work very well, because the optimal learning rate changes pretty
quickly through the corpus, so balancing batch size and number of rates
to probe may prove tricky. What I need is a control stick, sort of like
David MacKay et al.’s Dasher interface for convergence.

Until today, I’ve been using a simple annealing schedule to set the
learning rate. The good news is that the rumors were true — sparse SGD
is very fast. The bad news is that driving the SGD implementation is
like bobsledding on a steep hill in the dark over hilly terrain (that
reference was for you, Jeff). Too high a rate, and the first bump’ll
blow you off the hill; too heavy a hand on the brakes, and it’ll take
months to get to the bottom. Just right and you’re there in a few
hundred iterations.

Today, I played around with the so-called "reckless driver"
heuristic, which increases the learning rate for the next epoch if the
current epoch lowered error, and decreases it if the overall error goes
up. It helps. Most of the time. I’m undecided about whether to include
it in the release and if so, whether or not to parameterize the
acceleration/deceleration terms.

If you can get close to a solution early on with a fairly low
learning rate, turning the rate up later on seems to help. Yes, I know
that eventually the learning rate has to decrease to get within epsilon
of a solution. So much for asymptotics and unbounded number of steps in
the theoretical analysis.

I’m reporting total error and breaking out the log likelihood
component. What’s interesting to me is that the log likelihood goes way
down (almost to zero, indicating the data’s pretty much separable),
then starts climbing up as regularization kicks in to shrink all the
parameters back down. I’m wondering if separate learning rates for
regularization and likelihood gradient contributions make sense.

I’m also realizing why the "early stopping" heuristic was introduced
as a poor man’s regularization. Even with proper Bayesian
regularization, the optimal solution under cross-validation for many
choices of prior is not the maximum a posteriori solution. For
instance, if I take a prior Gaussian with a variance of 2.0 for every
dimension, the MAP solution is inferior to "early stopping". Hello, empirical Bayes. And perhaps dimension-specific priors, as in Genkin, Madigan and Lewis’s Bayesian Multinomial Regression Software, which lets you get the effects of variance normalizing the input dimensions.

The other good news is that (at least under most parameterizations)
regularized logistic regression definitely works better than naive
Bayes or our character language model or TF/IDF classifiers. As I
suspected all along, logistic regression, even with just bag-of-word or
bag-of-character-ngrams features, is incredibly sensitive to priors,
and the task has incredibly high variance under cross-validation no
matter which classifiers we use.

On the very next page of his tutorial (38), Bottou describes the
batch version of regularized SGD to maintain sparsity in the per-input
gradient updates. I independently came up with the same algorithm, and
then went on to derive my fully lazy implementation which preserves
stochasticity.

0 Comments:

Popular Posts

IR、ML、NLP

Total Pageviews