Sent to you by Jeffye via Google Reader:
After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I "turned the crank" (an idiom in mathematics for solving an equation by brute force without thinking too much about it).
Expression Mixture Model
Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads , gene or splice variant sources , and mRNA expression , setting
Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.
What's a Parameter?
So that's what I did, calculating
The problem is that there's an implicit "except the latent parameters" inside the usual understanding the MLE.
What I should have done is marginalized out the latent gene sources , taking
That requires marginalizing out the latent mapping of reads to gene or splice variant sources ,
On a log scale, that's
I'll repeat the example here for convenience:
Isoform Expression Example
Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we'll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:
Exon 1 Exon 2 Exon 3 Total Iso 1 N 20 0 20+N Iso 2 70-N 0 50 120-N Total 70 20 50 140
The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.
Turning the Crank In the Right Direction
By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.
Now when we turn the crank, we see that
The reduction on the third line is because the probability of a read in the second exon, , is only non-zero for isoform , and similarly for a read in the third exon, where is nly non-zero for the second gene or splice variant, . The final reduction follows because .
Not so Biased After All
After marginalizing out the read mappings , the MLE for expression is right where we'd expect it, at the solution of
It turns out EM isn't so biased for these discrete mixtures, at least when you turn the crank the right way.