Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

Sent to you by Jeff via Google Reader:

Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

via LingPipe Blog by lingpipe on 7/13/10

I've uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Carpenter, Bob. 2010. Integrating out multinomial parameters in latent Dirichlet allocation and naive Bayes for collapsed Gibbs sampling. LingPipe Technical Report.

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don't derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child's play. Maybe they are if you're a physicist or theoretical statistician. But that was a whole lot of choppin' with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn't possible to integrate out the multinomials in naive Bayes (actually, they told me you'd be left with residual $\Gamma$ functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result's interesting. I didn't wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don't get me started on the misappropriation of the term "Bayesian" by any system that applies Bayes's rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you're curious.

Thanks to Wikipedia

I couldn't have done the derivation for LDA (or NB) without the help of

Wikipedia: Latent Dirichlet Allocation

It pointed me (implicitly) at a handful of invaluable tricks, such as

multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
unfolding products based on position, unfolding a $\Gamma()$ function for the position at hand, then refolding the rest back up so it could be dropped out, and
reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia's Derivation is Wrong!

At least I'm pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia's calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we've already eliminated all the integrals, but still have a mess of $\Gamma$ functions left. The only hint at what's going on is in the text above which says it drops terms that don't depend on $k$ (the currently considered topic assignment for the $n$ -th word of the $m$ -th document). The Wikipedia's calculation then proceeds to drop the term $\prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i)$ without any justification. It clearly depends on $k$ .

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of $\Gamma$ function applications are dropped, such as $\Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k)$ , which even more clearly depend on $k$ .

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don't eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable $m$ ranges from $1$ to $M$ so you don't need to keep (as large) a symbol table in your head.

I haven't changed the Wikipedia page for two reasons: (1) I'd like some confirmation that I haven't gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I'd introduce typos or spend all day entering it.

LaTeX Rocks

I also don't think I could've done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?