I’ve been playing around with convergence for my SGD implementation for the upcoming LingPipe 3.5, in the context of the 2008 i2b2 Obesity Challenge,

the full title of which is "Second i2b2 Shared-Task and Workshop;

Challenges in Natural Language Processing for Clinical Data; Obesity

Challenge (A Shared-Task on Obesity): Who’s obese and what

co-morbidities do they (definitely/likely) have?". Participants have

until April 15, 2008 to register to participate.

Slide 37 of Léon Bottou’s NIPS tutorial Learning with Large Datasets

reveals the "secret ingredient" behind (his) successful stochastic

gradient search (where η is the learning rate, which he calls a "gain"):

At any moment during training, we can:

- Select a small subsample of examples.
- Try various gains η on the subsample.
- Pick the gain η that most reduces the cost.
- Use it for the next 100000 iterations on the full dataset.

This is a kind of meta-descent algorithm, the most well known of which is Nicolas Shraudolph’s Stochastic Meta-Descent.

I’m going to try Bottou’s approach tomorrow. But I’m skeptical it’ll

work very well, because the optimal learning rate changes pretty

quickly through the corpus, so balancing batch size and number of rates

to probe may prove tricky. What I need is a control stick, sort of like

David MacKay et al.’s Dasher interface for convergence.

Until today, I’ve been using a simple annealing schedule to set the

learning rate. The good news is that the rumors were true — sparse SGD

is very fast. The bad news is that driving the SGD implementation is

like bobsledding on a steep hill in the dark over hilly terrain (that

reference was for you, Jeff). Too high a rate, and the first bump’ll

blow you off the hill; too heavy a hand on the brakes, and it’ll take

months to get to the bottom. Just right and you’re there in a few

hundred iterations.

Today, I played around with the so-called "reckless driver"

heuristic, which increases the learning rate for the next epoch if the

current epoch lowered error, and decreases it if the overall error goes

up. It helps. Most of the time. I’m undecided about whether to include

it in the release and if so, whether or not to parameterize the

acceleration/deceleration terms.

If you can get close to a solution early on with a fairly low

learning rate, turning the rate up later on seems to help. Yes, I know

that eventually the learning rate has to decrease to get within epsilon

of a solution. So much for asymptotics and unbounded number of steps in

the theoretical analysis.

I’m reporting total error and breaking out the log likelihood

component. What’s interesting to me is that the log likelihood goes way

down (almost to zero, indicating the data’s pretty much separable),

then starts climbing up as regularization kicks in to shrink all the

parameters back down. I’m wondering if separate learning rates for

regularization and likelihood gradient contributions make sense.

I’m also realizing why the "early stopping" heuristic was introduced

as a poor man’s regularization. Even with proper Bayesian

regularization, the optimal solution under cross-validation for many

choices of prior is not the maximum a posteriori solution. For

instance, if I take a prior Gaussian with a variance of 2.0 for every

dimension, the MAP solution is inferior to "early stopping". Hello, empirical Bayes. And perhaps dimension-specific priors, as in Genkin, Madigan and Lewis’s Bayesian Multinomial Regression Software, which lets you get the effects of variance normalizing the input dimensions.

The other good news is that (at least under most parameterizations)

regularized logistic regression definitely works better than naive

Bayes or our character language model or TF/IDF classifiers. As I

suspected all along, logistic regression, even with just bag-of-word or

bag-of-character-ngrams features, is incredibly sensitive to priors,

and the task has incredibly high variance under cross-validation no

matter which classifiers we use.

On the very next page of his tutorial (38), Bottou describes the

batch version of regularized SGD to maintain sparsity in the per-input

gradient updates. I independently came up with the same algorithm, and

then went on to derive my fully lazy implementation which preserves

stochasticity.

## 0 Comments:

Post a Comment