Symbols as Hash Codes for Super-fast Linear Classifiers
this paper is from LingPipe Blog
I was talking to John Langford, of Machine Learning(Theory) blog fame, after his talk in the Columbia machine learning reading group yesterday.
He mentioned a way to speed up linear classifiers at runtime that’s intriguing. A binary linear classifier is based on the dot product of a weight vector β (almost always dense) with a feature vector x (almost always sparse). Examples of linear classifiers include perceptrons, logistic regression, naive Bayes/multinomial, SVMs, and it’s even the innermost loop in discriminitive sequence models like CRFs.
The bottleneck at runtime for linear classifiers is converting the objects being classified into sparse vectors. If you use a symbol table, the bottleneck is the hash lookup in the symbol table. The feature weight vector β is almost always an array, so once you have the symbol ID, you pay the array lookup, multiplication of feature weight (typically 1 in NLP problems) by value found, and add to the sum for the dot product. Only the array lookup is time consuming here.
Actually constructing the sparse vector itself would also be expensive, but this can be done implicitly, because all we need is the dot product of the vector with the parameter
So what happens if we replace the symbol generated by a symbol table with a hash code? Instant speedup. We eliminate the expensive hash lookup, which requires an array lookup almost certainly out of L2 cache, and then iterating over the collision set doing a string-match until we get a match or exhaust the bucket.
The price we pay is possible collisions. In effect, any two features that have the same hash code get conflated. If we’re doing 20 newsgroups and trying to distinguish hockey posts from baseball posts, it’s going to hurt accuracy if the hashcode of “goalie” and “pitcher” are the same, as they’re highly discriminitive in this domain.
Now we’re going to use a hash code that produces numbers in a small range, say 0 to 2**18, or 18 bits, so that an array of floats or doubles of that size fits in L2 cache on our CPU. Now we’re really flying. The symbol we’re looking up will fit in a register, so computing its hash code will be pretty fast. It’s the lookup out of cache and subsequent matching that’s the time-sink.
In practice, John reports that experiments they’ve done have shown that this isn’t a problem. He found this somewhat surprising, but I didn’t. Language is highly redundant, so a few features being conflated is unlikely to hurt performance much. It’d be interesting to see a plot of size of hash table vs. number of features vs. accuracy.
This approach extends to the more complex, structured features common in discriminitive classifiers. We never need to build an explicit feature representation if we can generate a hash code for it.
If we ever have to make a simple classifier that really flies, this is what I’ll be thinking about. I might also be thinking about perfect hashing, because I’m a neat freak.
Resources about lucene Resources Introductions The API documentation contains a short and simple code example that show...
The problems such as multirow.sty’ not found can be fixed via the following command (Ubuntu system): sudo apt-get install texlive-latex...
Just read a post from http://blog.bigml.com/2013/02/21/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-pa...
From https://de.dariah.eu/tatom/preprocessing.html Also refer to http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize ...
We examine top Python Machine learning open source projects on Github, both in terms of contributors and commits, and identify most popula...
汉字编码问题 下面是搜集的多篇关于汉字编码问题文章的合集，相信你的问题一定包含在其中，如果没有请留言，一起把这方面的内容补充全。 一、汉字编码的种类 汉字编码中现在主要用到的有三类，包括GBK，GB2312和Big5。 1、 GB2312又称国标码 ，由国家标准总...
牛津学生英语搭配词典（OXFORD Collocations Dictionary for Students of English ） 这是新东方李笑来老师极力推荐的字典，今天试用了一下果然不错–fall in love with it in first sight。都是英文...
Journals TOIS – ACM Transactions on Information Systems Publication: 467 Citation: 13992 IPM – Information Processing and Management...
Type in a terminal window: gs -sDEVICE=bbox -dNOPAUSE -dBATCH file.pdf (or file.ps) you must have ghostscript installed of course. This c...
Downloads Download pages for classes can be found below. Videos are archived by unit, are numbered, named and have a playlist. CS...