Symbols as Hash Codes for Superfast Linear Classifiers
this paper is from LingPipe Blog
I was talking to John Langford, of Machine Learning(Theory) blog fame, after his talk in the Columbia machine learning reading group yesterday.
He mentioned a way to speed up linear classifiers at runtime that’s intriguing. A binary linear classifier is based on the dot product of a weight vector β (almost always dense) with a feature vector x (almost always sparse). Examples of linear classifiers include perceptrons, logistic regression, naive Bayes/multinomial, SVMs, and it’s even the innermost loop in discriminitive sequence models like CRFs.
The bottleneck at runtime for linear classifiers is converting the objects being classified into sparse vectors. If you use a symbol table, the bottleneck is the hash lookup in the symbol table. The feature weight vector β is almost always an array, so once you have the symbol ID, you pay the array lookup, multiplication of feature weight (typically 1 in NLP problems) by value found, and add to the sum for the dot product. Only the array lookup is time consuming here.
Actually constructing the sparse vector itself would also be expensive, but this can be done implicitly, because all we need is the dot product of the vector with the parameter
So what happens if we replace the symbol generated by a symbol table with a hash code? Instant speedup. We eliminate the expensive hash lookup, which requires an array lookup almost certainly out of L2 cache, and then iterating over the collision set doing a stringmatch until we get a match or exhaust the bucket.
The price we pay is possible collisions. In effect, any two features that have the same hash code get conflated. If we’re doing 20 newsgroups and trying to distinguish hockey posts from baseball posts, it’s going to hurt accuracy if the hashcode of “goalie” and “pitcher” are the same, as they’re highly discriminitive in this domain.
Now we’re going to use a hash code that produces numbers in a small range, say 0 to 2**18, or 18 bits, so that an array of floats or doubles of that size fits in L2 cache on our CPU. Now we’re really flying. The symbol we’re looking up will fit in a register, so computing its hash code will be pretty fast. It’s the lookup out of cache and subsequent matching that’s the timesink.
In practice, John reports that experiments they’ve done have shown that this isn’t a problem. He found this somewhat surprising, but I didn’t. Language is highly redundant, so a few features being conflated is unlikely to hurt performance much. It’d be interesting to see a plot of size of hash table vs. number of features vs. accuracy.
This approach extends to the more complex, structured features common in discriminitive classifiers. We never need to build an explicit feature representation if we can generate a hash code for it.
If we ever have to make a simple classifier that really flies, this is what I’ll be thinking about. I might also be thinking about perfect hashing, because I’m a neat freak.
Symbols as Hash Codes for Superfast Linear Classifiers
Posted by jeffy
Posted on 7:12 PM
with No comments
Popular Posts

Resources about lucene Resources Introductions The API documentation contains a short and simple code example that show...

After installing Scipy, when I import optimize from Scipy, the following error occurs. Traceback (most recent call last): File &q...

Dnsmasq Contents [ hide ] 1 1. Introduction 2 2. Installation. 2.1 2.1. Install Using The 'aptget' Softw...

汉字编码问题 下面是搜集的多篇关于汉字编码问题文章的合集，相信你的问题一定包含在其中，如果没有请留言，一起把这方面的内容补充全。 一、汉字编码的种类 汉字编码中现在主要用到的有三类，包括GBK，GB2312和Big5。 1、 GB2312又称国标码 ，由国家标准总...

Downloads Download pages for classes can be found below. Videos are archived by unit, are numbered, named and have a playlist. CS...

The problems such as multirow.sty’ not found can be fixed via the following command (Ubuntu system): sudo aptget install texlivelatex...

ACL 2013: ACCEPTED PAPERS A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations Angel...

From https://de.dariah.eu/tatom/preprocessing.html Also refer to http://www.nltk.org/api/nltk.tokenize.html#modulenltk.tokenize ...

In IR and many other research domains, we always have to use statistical Test to evaluate whether a newly proposed model can bring signif...

Type in a terminal window: gs sDEVICE=bbox dNOPAUSE dBATCH file.pdf (or file.ps) you must have ghostscript installed of course. This c...
0 Comments:
Post a Comment