**EM Clustering, Fat Gaussians and Number of Clusters**

I just got back from a very nice talk at Columbia by Sanjoy Dasgupta, who’s on leave at Yahoo! from UCSD. The talk was basically about an old paper, Dasgupta and Schulman 2000, in which the authors introduce a “fat Gaussian” to overcome the curse of dimensionality in EM (soft) clustering.

Dasgupta started by making the point that a Gaussian with 1000

independent dimensions has almost all sample points roughly the same

distance from the mean, which is easy to prove given the central limit

theorem. In other words, the standard Gaussian defines a thin

high-likelihood shell around the expected distance. In reality (or at

least in experimental data sets), data’s more dispersed than

independent high-dimensional Gaussians can represent. Therefore, they

introduce fat Gaussians, which are just mixtures of Gaussians with

different variances, thus averaging over a bunch of thin shells at

their respective expected distances from the mean. He went on to show

various separability results for clustering based on these fat

Gaussians and mostly separable data.

What struck me most about the talk was Dasgupta’s answer to a

question from an audience member who asked if his method would help

determine the number of clusters. This is a perennial problem in

clustering. K-means and EM-clustering require the number of clusters to

be set ahead of time. Dasgupta’s answer was in two parts. The official

answer was “no”. But he qualified that by saying that what he really

believed was that the problem was ill-formed because different numbers

of clusters were useful for different applications.

Some natural language processing problems are like this, and some

are different. Categorizing documents by topic is like this, because

the world doesn’t come pre-segmented into a known number of topics. For

instance, we might categorize news into business, politics, sports and

entertainment, leading to four categories. Or we might go further, and

divide them into regional and international politics, baseball, soccer

and bowling, and movies versus music. It depends on the application.

But one reason we’re interested in clustering is for coreference

resolution, including database deduplication and grouping mentions of

the same entity (such as a consumer electronics product, gene name or

person) in text. For these tasks, there is a notion of “true” number of

clusters, though one might quibble that Bob-as-cook is different than

Bob-as-programmer.

EM clustering that estimates means and variances in a maximum

likelihood setting is ill-formed. The max likelihood solution in the

2-cluster setting places one cluster mean on a single data point with

zero variance, leading to an infinite value of the density for that

point, with the other points belonging to a single high-variance

covering distribution. One solution to this problem is to fix the

variances to positive values ahead of time and only estimate the means.

Another thing that strikes me about EM clustering is that it’s

almost always done with a uniform distribution over categories. That

is, the likelihood of a point x being in cluster c is p(c) * p(x|c).

I’ve never seen a presentation of Gaussian EM clustering where p(c) is

taken to be anything other than uniform. That is, we’re not modeling

the situation where clusters have different likelihoods — the

generative model is to draw a cluster according to a uniform

distribution and then draw a point from that cluster’s distribution.

The Bayesian (equivalently minimum description length) solution to

the number of clusters problem also addresses the uniform-cluster size

problem. Specifically, we put a prior on the model structure and then

choose the clustering that maximizes the likelihood times the prior:

p(data|model) * p(model). The standard Bayesian prior for this kind of

setting would comprise something like a Dirichlet process prior on the

distribution of categories p(c); we could also put Bayesian priors on

the means or more commonly variances of the Gaussians. With a Dirichlet

process, there is no bound on the number of categories, and draws from

a Dirichlet process prior lead to power-law (Zipf-like) distributions

of number and size of categories. You can see this working quite well

in coreference resolution as evidenced by the recent paper Haghighi and Klein 2007.

The only problem with the Bayesian “solution” is that the

concentration parameter of the Dirichlet process prior is just another

way of setting k. That is, you need to make a prior assumption about

how costly it is to introduce new categories. This is just replacing a

direct control (number of clusters) with an indirect one (cost to

introduce a new cluster). Haghighi and Klein, for instance, set the

concentration parameter of their Dirichlet process prior empirically.

## 1 Comments:

Post a Comment