Marginalizing Latent Variables in EM

Sent to you by Jeffye via Google Reader:

via LingPipe Blog by lingpipe on 1/19/10

After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I "turned the crank" (an idiom in mathematics for solving an equation by brute force without thinking too much about it).

Expression Mixture Model

Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads $r$ , gene or splice variant sources $g$ , and mRNA expression $\theta$ , setting

$p(r,g|\theta) = \prod_{n=1}^N p(r_n,g_n|\theta) = \prod_{n=1}^N p(r_n|g_n) \ p(g_n|\theta)$ .

Maximum Likelihood

Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.

What's a Parameter?

So that's what I did, calculating

$g^*, \theta^* = \arg \max_{g,\theta} p(r|g,\theta)$ .

The problem is that there's an implicit "except the latent parameters" inside the usual understanding the MLE.

What I should have done is marginalized out the latent gene sources $g$ , taking

$\theta^* = \arg \max_{\theta} p(r|\theta)$ .

That requires marginalizing out the latent mapping of reads $r_n$ to gene or splice variant sources $g_n$ ,

$p(r|\theta) = \prod_{n=1}^N p(r_n|\theta_n) = \prod_{n=1}^N \sum_{k = 1}^K p(r_n|k) \ p(k|\theta)$ .

On a log scale, that's

$\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k = 1}^K p(r_n|k) \ p(k|\theta)$ .

The Example

I'll repeat the example here for convenience:

Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we'll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

Exon 1 Exon 2 Exon 3 Total

Iso 1 N 20 0 20+N

Iso 2 70-N 0 50 120-N

Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.

	Exon 1	Exon 2	Exon 3	Total
Iso 1	N	20	0	20+N
Iso 2	70-N	0	50	120-N
Total	70	20	50	140

Turning the Crank In the Right Direction

By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.

Now when we turn the crank, we see that

$\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k=1}^K p(r_n|k) p(k|\theta)$

${} \ \ \ \ = 70 \log \sum_{k=1}^K p(1|k) p(k|\theta) + 20 \log \sum_{k=1}^K p(2|k) p(k|\theta) + 50 \log \sum_{k=1}^K p(3|k) p(k\theta)$

$= 20 \log \frac{\theta_1}{2} + 50 \log \frac{\theta_2}{2} + 70 \log (\frac{\theta_1}{2} + \frac{\theta_2}{2})$

$= 20 \log \theta_1 + 50 \log \theta_2 + (70 \log 1 -20\log 2 - 50 \log 2 - 70 \log 2)$ .

The reduction on the third line is because the probability of a read in the second exon, $p(2|k)$ , is only non-zero for isoform $k=1$ , and similarly for a read in the third exon, where $p(3|k)$ is nly non-zero for the second gene or splice variant, $k=2$ . The final reduction follows because $\theta_1 + \theta_2 = 1$ .

Not so Biased After All

After marginalizing out the read mappings $g$ , the MLE for expression is right where we'd expect it, at the solution of

$\frac{d}{d\theta} \log p(r|\theta) = 0$ ,

which is

$\theta^*=20/70$ .

It turns out EM isn't so biased for these discrete mixtures, at least when you turn the crank the right way.

Things you can do from here:

Subscribe to LingPipe Blog using Google Reader
Get started using Google Reader to easily keep up with all your favorite sites

2 Comments:

Robert said...: This comment has been removed by the author.; 2:19 AM
Robert said...: A fascinating technique to simplify the EM process! Marginalizing latent variables improves efficiency and ||New York Divorce Timeline||New York Divorce Waiting Period accuracy in complex models. A key breakthrough in machine learning and statistics!; 2:19 AM

Marginalizing Latent Variables in EM

Sent to you by Jeffye via Google Reader: