I just got back from a very nice talk at Columbia by Sanjoy Dasgupta, who’s on leave at Yahoo! from UCSD. The talk was basically about an old paper, Dasgupta and Schulman 2000, in which the authors introduce a “fat Gaussian” to overcome the curse of dimensionality in EM (soft) clustering.
Dasgupta started by making the point that a Gaussian with 1000
independent dimensions has almost all sample points roughly the same
distance from the mean, which is easy to prove given the central limit
theorem. In other words, the standard Gaussian defines a thin
high-likelihood shell around the expected distance. In reality (or at
least in experimental data sets), data’s more dispersed than
independent high-dimensional Gaussians can represent. Therefore, they
introduce fat Gaussians, which are just mixtures of Gaussians with
different variances, thus averaging over a bunch of thin shells at
their respective expected distances from the mean. He went on to show
various separability results for clustering based on these fat
Gaussians and mostly separable data.
What struck me most about the talk was Dasgupta’s answer to a
question from an audience member who asked if his method would help
determine the number of clusters. This is a perennial problem in
clustering. K-means and EM-clustering require the number of clusters to
be set ahead of time. Dasgupta’s answer was in two parts. The official
answer was “no”. But he qualified that by saying that what he really
believed was that the problem was ill-formed because different numbers
of clusters were useful for different applications.
Some natural language processing problems are like this, and some
are different. Categorizing documents by topic is like this, because
the world doesn’t come pre-segmented into a known number of topics. For
instance, we might categorize news into business, politics, sports and
entertainment, leading to four categories. Or we might go further, and
divide them into regional and international politics, baseball, soccer
and bowling, and movies versus music. It depends on the application.
But one reason we’re interested in clustering is for coreference
resolution, including database deduplication and grouping mentions of
the same entity (such as a consumer electronics product, gene name or
person) in text. For these tasks, there is a notion of “true” number of
clusters, though one might quibble that Bob-as-cook is different than
EM clustering that estimates means and variances in a maximum
likelihood setting is ill-formed. The max likelihood solution in the
2-cluster setting places one cluster mean on a single data point with
zero variance, leading to an infinite value of the density for that
point, with the other points belonging to a single high-variance
covering distribution. One solution to this problem is to fix the
variances to positive values ahead of time and only estimate the means.
Another thing that strikes me about EM clustering is that it’s
almost always done with a uniform distribution over categories. That
is, the likelihood of a point x being in cluster c is p(c) * p(x|c).
I’ve never seen a presentation of Gaussian EM clustering where p(c) is
taken to be anything other than uniform. That is, we’re not modeling
the situation where clusters have different likelihoods — the
generative model is to draw a cluster according to a uniform
distribution and then draw a point from that cluster’s distribution.
The Bayesian (equivalently minimum description length) solution to
the number of clusters problem also addresses the uniform-cluster size
problem. Specifically, we put a prior on the model structure and then
choose the clustering that maximizes the likelihood times the prior:
p(data|model) * p(model). The standard Bayesian prior for this kind of
setting would comprise something like a Dirichlet process prior on the
distribution of categories p(c); we could also put Bayesian priors on
the means or more commonly variances of the Gaussians. With a Dirichlet
process, there is no bound on the number of categories, and draws from
a Dirichlet process prior lead to power-law (Zipf-like) distributions
of number and size of categories. You can see this working quite well
in coreference resolution as evidenced by the recent paper Haghighi and Klein 2007.
The only problem with the Bayesian “solution” is that the
concentration parameter of the Dirichlet process prior is just another
way of setting k. That is, you need to make a prior assumption about
how costly it is to introduce new categories. This is just replacing a
direct control (number of clusters) with an indirect one (cost to
introduce a new cluster). Haghighi and Klein, for instance, set the
concentration parameter of their Dirichlet process prior empirically.