Vector Space, Probabilistic LSI, and LDA

Sent to you by jeffye via Google Reader:

via IR Thoughts by E. Garcia on 4/3/09

lda
source: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.

From VSM to LSI

Prior to 1988 the prevalent IR model was Salton's Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.

VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.

In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan's 1965 SVD algorithm to "documents" exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).

LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The "documents" were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.

It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.

Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.

There is now a crew of SEOs claiming that they can design documents "LSI-friendly" by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as "experts"). Same goes for those that claim "PLSI-SEO" strategies.

Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why "LSI-friendly" documents is plain SEO Snakeoil. Again, the same goes for those that claim "PLSI-SEO" strategies. Keep reading.

Enters Probabilistic Latent Semantic Indexing (PLSI) model

In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?

In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.

Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected "reduced description" associated with the document.

But there is a problem.

Enters Latent Dirichlet Allocation Model (LDA)

By 2003 Hofman's PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI "is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. "

Blei and co-workers then stated that this leads to two problems:

1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting

2. it is not clear how to assign probability to a document outside of the training set.

In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a "bag of words". However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for college junior or junior college produce far different results.

To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.

Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.

In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.

I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere "bags of words". The former is IR's Original Sin and the later is its copycat.

The image above gives me a flash back on research work I conducted in the late '80s on sequential simplex optimization methods.