Symbols as Hash Codes for Super-fast Linear Classifiers
this paper is from LingPipe Blog
I was talking to John Langford, of Machine Learning(Theory) blog fame, after his talk in the Columbia machine learning reading group yesterday.
He mentioned a way to speed up linear classifiers at runtime that’s intriguing. A binary linear classifier is based on the dot product of a weight vector β (almost always dense) with a feature vector x (almost always sparse). Examples of linear classifiers include perceptrons, logistic regression, naive Bayes/multinomial, SVMs, and it’s even the innermost loop in discriminitive sequence models like CRFs.
The bottleneck at runtime for linear classifiers is converting the objects being classified into sparse vectors. If you use a symbol table, the bottleneck is the hash lookup in the symbol table. The feature weight vector β is almost always an array, so once you have the symbol ID, you pay the array lookup, multiplication of feature weight (typically 1 in NLP problems) by value found, and add to the sum for the dot product. Only the array lookup is time consuming here.
Actually constructing the sparse vector itself would also be expensive, but this can be done implicitly, because all we need is the dot product of the vector with the parameter
So what happens if we replace the symbol generated by a symbol table with a hash code? Instant speedup. We eliminate the expensive hash lookup, which requires an array lookup almost certainly out of L2 cache, and then iterating over the collision set doing a string-match until we get a match or exhaust the bucket.
The price we pay is possible collisions. In effect, any two features that have the same hash code get conflated. If we’re doing 20 newsgroups and trying to distinguish hockey posts from baseball posts, it’s going to hurt accuracy if the hashcode of “goalie” and “pitcher” are the same, as they’re highly discriminitive in this domain.
Now we’re going to use a hash code that produces numbers in a small range, say 0 to 2**18, or 18 bits, so that an array of floats or doubles of that size fits in L2 cache on our CPU. Now we’re really flying. The symbol we’re looking up will fit in a register, so computing its hash code will be pretty fast. It’s the lookup out of cache and subsequent matching that’s the time-sink.
In practice, John reports that experiments they’ve done have shown that this isn’t a problem. He found this somewhat surprising, but I didn’t. Language is highly redundant, so a few features being conflated is unlikely to hurt performance much. It’d be interesting to see a plot of size of hash table vs. number of features vs. accuracy.
This approach extends to the more complex, structured features common in discriminitive classifiers. We never need to build an explicit feature representation if we can generate a hash code for it.
If we ever have to make a simple classifier that really flies, this is what I’ll be thinking about. I might also be thinking about perfect hashing, because I’m a neat freak.
Resources about lucene Resources Introductions The API documentation contains a short and simple code example that show...
This is a very useful list of surveys from Doug Oard , which would be definitely helpful for those who want to enter this territory. I...
We examine top Python Machine learning open source projects on Github, both in terms of contributors and commits, and identify most popula...
The problems such as multirow.sty’ not found can be fixed via the following command (Ubuntu system): sudo apt-get install texlive-latex...
Logistic Regression by Any Other Name LingPipe Blog I (Bob) have been working on logistic regression. In particular, multinomial logistic r...
Repost from http://terrytao.wordpress.com/advice-on-writing-papers/ There are three rules for writing the novel. Unfortunately, no on...
This plugin can be used to read a RSS feed and transform it into a custom piece of HTML. Setup <!DOCTYPE html> <html> <...
Just read a post from http://blog.bigml.com/2013/02/21/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-pa...
From https://de.dariah.eu/tatom/preprocessing.html Also refer to http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize ...
After installing Scipy, when I import optimize from Scipy, the following error occurs. Traceback (most recent call last): File &q...