I (Bob) am starting to feel like a participant in an early 20th century epistolary academic exchange (e.g. that between Mr. Russell and Mr. Strawson).

In a comment to John Langford’s response to my blog entry recapitulating his comments after his talk, Kuzman Ganchev points out that he and Mark Dredze did the empirical legwork in their 2008 paper:

- Ganchev, Kuzman and Mark Dredze. 2008. Small Statistical Models by Random Feature Mixing. In
*Proceedings of the ACL-2008 Workshop on Mobile Language*.

To summarize, we’re considering the effect on classification accuracy of using hash codes (modulo some fixed *n*) of feature representations as dimensional identifiers rather than requiring a unique identifier for each of *m*

underlying features. The reason this is interesting is that it requires

much less memory and far less computational effort to compute features

this way. The reason it might be problematic for accuracy is that

there’s an increasing likeliood of collisions as the number of

parameters *n* decreases.

For instance, character *n*-gram hash codes may be computed at a cost of only a couple assignments and arithmetic operations per character by the Karp-Rabin algorithm, which only implicitly represent the *n*-grams themselves. For Latin1 (ISO-8859-1) encoded text, Karp-Rabin can even be computed online with binary input streams (`java.io.InputStream`

) without the expense of decoding Unicode characters. With *n*-gram-based

features, input streams are usually effective with other character

encodings. More complex features may be handled by simple modifications

of the Karp-Rabin algorithm.

Ganchev and Dredze showed that for many NLP problems (e.g. spam

filtering, 20 newsgroups, Reuters topics, appliance reviews), there is

very little loss from drastic reductions in *n* relative to *m*. This is very good news indeed.

Getting back to John Langford’s last post, I would like to answer my own question

about how having multiple hash codes per feature helps maintain

discriminative power in the face of collisions. It may seem

counterintuive, as having more features (2 * *m*) for the same number of parameters (*n*) seems like there will simply be more collisions.

Let’s consider a very simple case where there is a feature *f* which is split into two features *f0* and *f1* without collision. With maximum likelihood, if the original weight for *f* is *β*, then the weights for *f0* and*f1* and *f2* will be *β/2* (or any linear

interpolation). With Laplace priors (L1 regularization), the behavior

is the same because the penalty is unchanged abs(*β*) = abs(*β*/2) + abs(*β*/2). But, with Gaussian priors (L2 regularization), the penalty is no longer equivalent, because with *β != 0*, *β*^{2} > (*β*/2)^{2} + (*β*/2)^{2}.

Setting aside regularization, let’s work through an example with two

topics with the following generative model (adapted from Griffiths’ and

Steyvers’ LDA paper):

Word(feature) | |||
---|---|---|---|

Topic | river(0) | bank(1) | loan(2) |

0 | 1/2 | 1/2 | 0 |

1 | 0 | 1/2 | 1/2 |

So topic 0 is about geography and topic 1 about finance. In topic 0,

there is a 50% chance of generating the word "river", a 50% chance of

generating the word "bank", and no chance of generating the word

"loan". It is easy to identify a set of regression parameters that has

perfect classification behavior: *β* = (1,0,-1) [the maximum

likelihood solution will not actually be identified; with priors, the

scale or variance parameter of the prior determines the scale of the

coefficients].

Now what happens when we blow out each feature to two features and

allow collisions? The generative model is the same, but each feature is

replicated. If there is a collision between the first code for "river"

and the first code for "loan", the resulting coefficients look like:

Words(feature) | |||||
---|---|---|---|---|---|

Topic | riverA, loanA(0) | riverB(1) | bankA(2) | bankB(3) | loanB(4) |

Coefficients β | 0 | 1 | 0 | 0 | -1 |

The resulting coefficients again produce a perfect classifier. The

collision at code 0 is simply a non-discriminative feature and the

split versions pick up the slack.

If the collision is between a discriminative and non-discriminative feature code, there is still a perfect set of coefficients:

Words(feature) | |||||
---|---|---|---|---|---|

Topic | riverA(0) | riverB, bankA(1) | bankB(2) | loanA(3) | loanB(4) |

Coefficients β | 1 | 0 | 0 | -1/2 | -1/2 |

Of course, real problems aren’t quite as neat as the examples, and

as we pointed out, regularization is non-linear except for maximum

likelihood and Laplace priors.

From LingPipe blog

In the real world, we typically find "new" features at runtime (e.g.

a character 8-gram that was never seen during training). There is a

very long tail for most linguistic processes. Luckily, there is also a

lot of redundancy in most classification problems.

This problem doesn’t occur in the hermetically sealed world of a fixed training and test set with feature pruning (as in the ECML KDD 2006 Spam Detection Challenge, which distributed data as bags of words with words occurring only if they occurred 4 or more times in the training data).

## 1 Comments:

Post a Comment