Symbols as Hash Codes for Super-fast Linear Classifiers
this paper is from LingPipe Blog
I was talking to John Langford, of Machine Learning(Theory) blog fame, after his talk in the Columbia machine learning reading group yesterday.
He mentioned a way to speed up linear classifiers at runtime that’s intriguing. A binary linear classifier is based on the dot product of a weight vector β (almost always dense) with a feature vector x (almost always sparse). Examples of linear classifiers include perceptrons, logistic regression, naive Bayes/multinomial, SVMs, and it’s even the innermost loop in discriminitive sequence models like CRFs.
The bottleneck at runtime for linear classifiers is converting the objects being classified into sparse vectors. If you use a symbol table, the bottleneck is the hash lookup in the symbol table. The feature weight vector β is almost always an array, so once you have the symbol ID, you pay the array lookup, multiplication of feature weight (typically 1 in NLP problems) by value found, and add to the sum for the dot product. Only the array lookup is time consuming here.
Actually constructing the sparse vector itself would also be expensive, but this can be done implicitly, because all we need is the dot product of the vector with the parameter
So what happens if we replace the symbol generated by a symbol table with a hash code? Instant speedup. We eliminate the expensive hash lookup, which requires an array lookup almost certainly out of L2 cache, and then iterating over the collision set doing a string-match until we get a match or exhaust the bucket.
The price we pay is possible collisions. In effect, any two features that have the same hash code get conflated. If we’re doing 20 newsgroups and trying to distinguish hockey posts from baseball posts, it’s going to hurt accuracy if the hashcode of “goalie” and “pitcher” are the same, as they’re highly discriminitive in this domain.
Now we’re going to use a hash code that produces numbers in a small range, say 0 to 2**18, or 18 bits, so that an array of floats or doubles of that size fits in L2 cache on our CPU. Now we’re really flying. The symbol we’re looking up will fit in a register, so computing its hash code will be pretty fast. It’s the lookup out of cache and subsequent matching that’s the time-sink.
In practice, John reports that experiments they’ve done have shown that this isn’t a problem. He found this somewhat surprising, but I didn’t. Language is highly redundant, so a few features being conflated is unlikely to hurt performance much. It’d be interesting to see a plot of size of hash table vs. number of features vs. accuracy.
This approach extends to the more complex, structured features common in discriminitive classifiers. We never need to build an explicit feature representation if we can generate a hash code for it.
If we ever have to make a simple classifier that really flies, this is what I’ll be thinking about. I might also be thinking about perfect hashing, because I’m a neat freak.
Symbols as Hash Codes for Super-fast Linear Classifiers
Posted by jeffy
Posted on 7:12 PM
with No comments
Popular Posts
-
Resources about lucene Resources Introductions The API documentation contains a short and simple code example that show...
-
The problems such as multirow.sty’ not found can be fixed via the following command (Ubuntu system): sudo apt-get install texlive-latex...
-
1 baidu.com Music search engine and free MP3 & video streaming for all kind o… More 2 qq.com 中国最大的门户网站,提供即时通讯、新闻资讯、网络游戏以...
-
After installing Scipy, when I import optimize from Scipy, the following error occurs. Traceback (most recent call last): File &q...
-
Journals TOIS – ACM Transactions on Information Systems Publication: 467 Citation: 13992 IPM – Information Processing and Management...
-
Dnsmasq Contents [ hide ] 1 1. Introduction 2 2. Installation. 2.1 2.1. Install Using The 'apt-get' Softw...
-
牛津学生英语搭配词典(OXFORD Collocations Dictionary for Students of English ) 这是新东方李笑来老师极力推荐的字典,今天试用了一下果然不错–fall in love with it in first sight。都是英文...
-
lftp is an amazing command line ftp tool, which lets you operate remote files just like in a local filesystem in a terminal (bash). If y...
-
Downloads Download pages for classes can be found below. Videos are archived by unit, are numbered, named and have a playlist. CS...
-
Python Regular Expressions 1. Extracting a single url OR all the urls in a text snippet If you're only looking for one: import...

0 Comments:
Post a Comment