The Entropy of English vs. Chinese

I (Bob) have long been fascinated by the idea of comparing the
communication efficiency of different languages. Clearly there’s a
noisy-channel problem that languages have in some way optimized through
evolution. There was some interesting discussion recently by Mark
Liberman on the Language Log in an entry Comparing Communication Efficiency Across Languages and a reply to a follow-up by Bob Moore in Mailbag: comparative communication efficiency.

Mark does a great job of pointing out what the noisy channel issues
are and why languages might not all be expected to have the same
efficiency. He cites grammatical marking issues like English requiring
articles, plural markers, etc., on every noun.

The spoken side is even more interesting, and not just because
spoken language is more "natural" in an evolutionary sense. Just how
efficiently (and accurately) the sounds of a language are encoded in
its characters plays a role in the efficiency of the writing system.
For instance, Arabic orthography
doesn’t usually encode the vowels in their spellings, so you need to
use context to sort them out. The alphabet includes vowels, but they
are conventionally employed only for important texts, like the Qur’an.

I would add to Mark’s inventory of differences between English and
Chinese the fact that English has a lot of borrowings on both the
lexical and spelling side, which increase entropy. That is, we could
probably eke out some gains by recoding “ph” and “f”, collapsing the
distinction between reduced vowels and so on; for instance, we wouldn’t
have to code the difference between “Stephen” and “Steven” which is
only present in the written language (at least in my dialect).

There are lots of other differences. It may seem that Chinese
doesn’t waste bits coding spaces between words. Or encoding capitalized
versus uncapitalized letters. Surprisingly, when I was working on language modeling in LingPipe,
I tested the compressibility (with character n-grams ranging from 5-16)
of English drawn from LDC’s Gigaword corpus, with and without case
normalization. The unnormalized version could be compressed more,
indicating that even though there are more superficial distinctions
(higher uniform model entropy), in fact, these added more information
than they took away. Ditto for punctuation. I didn’t try removing
spaces, but I should have.

I also found counter-intuitively that MEDLINE could be compressed
tighter than Gigaword English. So even though it looks worse to
non-specialists, it’s actually more predictable.

So why can’t we measure entropy? First of all, even the Gigaword New
York times section is incredibly non-stationary. Evaluations on
different samples have much more variance than would have been expected
if the data were stationary.

Second of all, what’s English? We can only measure compressibility of a corpus, and they vary by content.

Finally, why can’t we trust Brown et al.’s
widely cited paper? Because the result will depend on what background
training data is used. They used a ton of data from "similar" sources
to what they were testing. The problem with this game is how close are
you allowed to get? Given the test set, it’s pretty easy to engineer a
training set by carefully culling data. We might try to compress a
fixed corpus, but that leads to all the usual problems of overtraining.
This is the approach of the Hutter Prize
(based on compressing the Wikipedia). So instead, we create baskets of
corpora and evaluate those, with the result that there’s no clear
“winning” compression method.

0 Comments:

Popular Posts

IR、ML、NLP

Total Pageviews