I (Bob) am starting to feel like a participant in an early 20th century epistolary academic exchange (e.g. that between Mr. Russell and Mr. Strawson).
In a comment to John Langford’s response to my blog entry recapitulating his comments after his talk, Kuzman Ganchev points out that he and Mark Dredze did the empirical legwork in their 2008 paper:
- Ganchev, Kuzman and Mark Dredze. 2008. Small Statistical Models by Random Feature Mixing. In Proceedings of the ACL-2008 Workshop on Mobile Language.
To summarize, we’re considering the effect on classification accuracy of using hash codes (modulo some fixed n) of feature representations as dimensional identifiers rather than requiring a unique identifier for each of m
underlying features. The reason this is interesting is that it requires
much less memory and far less computational effort to compute features
this way. The reason it might be problematic for accuracy is that
there’s an increasing likeliood of collisions as the number of
parameters n decreases.
For instance, character n-gram hash codes may be computed at a cost of only a couple assignments and arithmetic operations per character by the Karp-Rabin algorithm, which only implicitly represent the n-grams themselves. For Latin1 (ISO-8859-1) encoded text, Karp-Rabin can even be computed online with binary input streams (java.io.InputStream
) without the expense of decoding Unicode characters. With n-gram-based
features, input streams are usually effective with other character
encodings. More complex features may be handled by simple modifications
of the Karp-Rabin algorithm.
Ganchev and Dredze showed that for many NLP problems (e.g. spam
filtering, 20 newsgroups, Reuters topics, appliance reviews), there is
very little loss from drastic reductions in n relative to m. This is very good news indeed.
Getting back to John Langford’s last post, I would like to answer my own question
about how having multiple hash codes per feature helps maintain
discriminative power in the face of collisions. It may seem
counterintuive, as having more features (2 * m) for the same number of parameters (n) seems like there will simply be more collisions.
Let’s consider a very simple case where there is a feature f which is split into two features f0 and f1 without collision. With maximum likelihood, if the original weight for f is β, then the weights for f0 and
f1 and f2 will be β/2 (or any linear
interpolation). With Laplace priors (L1 regularization), the behavior
is the same because the penalty is unchanged abs(β) = abs(β/2) + abs(β/2). But, with Gaussian priors (L2 regularization), the penalty is no longer equivalent, because with β != 0, β2 > (β/2)2 + (β/2)2.
Setting aside regularization, let’s work through an example with two
topics with the following generative model (adapted from Griffiths’ and
Steyvers’ LDA paper):
Word(feature) | |||
---|---|---|---|
Topic | river(0) | bank(1) | loan(2) |
0 | 1/2 | 1/2 | 0 |
1 | 0 | 1/2 | 1/2 |
So topic 0 is about geography and topic 1 about finance. In topic 0,
there is a 50% chance of generating the word "river", a 50% chance of
generating the word "bank", and no chance of generating the word
"loan". It is easy to identify a set of regression parameters that has
perfect classification behavior: β = (1,0,-1) [the maximum
likelihood solution will not actually be identified; with priors, the
scale or variance parameter of the prior determines the scale of the
coefficients].
Now what happens when we blow out each feature to two features and
allow collisions? The generative model is the same, but each feature is
replicated. If there is a collision between the first code for "river"
and the first code for "loan", the resulting coefficients look like:
Words(feature) | |||||
---|---|---|---|---|---|
Topic | riverA, loanA(0) | riverB(1) | bankA(2) | bankB(3) | loanB(4) |
Coefficients β | 0 | 1 | 0 | 0 | -1 |
The resulting coefficients again produce a perfect classifier. The
collision at code 0 is simply a non-discriminative feature and the
split versions pick up the slack.
If the collision is between a discriminative and non-discriminative feature code, there is still a perfect set of coefficients:
Words(feature) | |||||
---|---|---|---|---|---|
Topic | riverA(0) | riverB, bankA(1) | bankB(2) | loanA(3) | loanB(4) |
Coefficients β | 1 | 0 | 0 | -1/2 | -1/2 |
Of course, real problems aren’t quite as neat as the examples, and
as we pointed out, regularization is non-linear except for maximum
likelihood and Laplace priors.
From LingPipe blog
In the real world, we typically find "new" features at runtime (e.g.
a character 8-gram that was never seen during training). There is a
very long tail for most linguistic processes. Luckily, there is also a
lot of redundancy in most classification problems.
This problem doesn’t occur in the hermetically sealed world of a fixed training and test set with feature pruning (as in the ECML KDD 2006 Spam Detection Challenge, which distributed data as bags of words with words occurring only if they occurred 4 or more times in the training data).
1 Comments:
Keep up the good work.
Post a Comment