Home » » About Averaged Kernel Perceptrons

Perceptrons
are a kind of large-margin linear classifier. The
polynomial kernel trick
is used to embed the basic feature vector in a higher-degree vector space
in which the data are more separable.

An average of all of the perceptrons created during training is
used for the final prediction. The factored formulation of the
algorithm allows the perceptrons to be expressed as linearly
weighted training samples.

Although theoretical bounds are almost equivalent, in practice
averaged perceptrons slightly underperform support
vector machine
(SVM) learners over the same polynomial kernels.
The advantage of perceptrons is that they are much more efficient
in time and slightly more efficient in space in practice.

### Averaged Perceptron Model with Polynomial Kernel

The model used for runtime predictions by the averaged
perceptron is quite straightforward, consisting of a set of
weighted feature vectors (represented as parallel arrays of basis
vectors and weights) and a kernel degree:

` Vector[] basisVectors; int[] weights; int degree;`

The basis vectors are all vectors derived from single training
examples by the specified feature extractor. The weights may
be positive or negative and represent the cumulative voted
weight of the specified basis vector.

The kernel function computes a distance
`kernel(v1,v2)` between vectors `v1` and
`v2` in an enhanced feature space defined by the
particular kernel employed.

A new input to classify is first converted to a feature
vector by the feature extractor. Classification is then
based on the sign of the following score:

` score(Vector v) = Σi weights[i] * kernel(basisVectors[i],v)`

An example is accepted if the score of its feature vector is
greater than zero and rejected otherwise.

### Estimating the Perceptron Model

To estimate the perceptron model, we will assume that we have a
training corpus consisting of an array of vectors with boolean
polarities indicating whether they are positive (to accept) or
negative (to reject) examples. We also assume we have a fixed
kernel function. The training method iterates over the corpus a
specified number of times.

` Vector[] basisVectors; int[] incrementalWeights; boolean[] polarities; int degree; int index = -1; for (# iterations)     for (vector,polarity) in training corpus         yHat = scoreIntermediate(vector);         if (yHat &gt; 0 && polarity) || (yHat &lt; 0 && !polarity)             ++incrementalWeights[index];         else             ++index;             basisVectors[index] = vector;             polarities[index] = polarity;             incrementalWeights[index] = 1;`

` scoreIntermediate(vector)   = Σi &lt;= index polarities[i] * kernel(basisVectors[i],vector) `

The final weight for a vector is the cumulative weight
computed as follows:

` cumulativeWeight(j) = Σk &gt;= j incrementalWeights[k] `

The actual implementations of these methods involve
considerably more indirection and index chasing to avoid
copies and duplication in the final vectors.

### Historical Notes

The averaged kernel perceptron implemented here was introduced
in the following paper, which also provides error bounds
for learning and evaluations with polynomial kernels of various
degrees:

Freund, Yoav and Robert E. Schapire (1999)
Large margin classification using the perceptron algorithm.
Machine Learning 37(3):277-296.

The basic perceptron model was introduced in:

Block, H.D. (1962) The perceptron: a model for brain functioning.
Reviews of Modern Physics 34:123-135.

The kernel-based perceptron was introduced in:

Aizerman, M.A., E.M. Braverman, and L.I. Rozonoer. 1964.
Theoretical foundations of the potential function method in pattern
recognition learning. Automation and Remote Control.
25:821-837.

The basis of the voting scheme is a deterministically averaged
version of the randomized approach of adapting online learners
to a batch setup described in the following paper:

Helmbold, D.P. and M.K. Warmuth. (1995)
On weak learning. Journal of Computer and System Sciences
50:551-573.