### About Averaged Kernel Perceptrons

Perceptrons

are a kind of large-margin linear classifier. The

polynomial kernel trick

is used to embed the basic feature vector in a higher-degree vector space

in which the data are more separable.

An average of all of the perceptrons created during training is

used for the final prediction. The factored formulation of the

algorithm allows the perceptrons to be expressed as linearly

weighted training samples.

Although theoretical bounds are almost equivalent, in practice

averaged perceptrons slightly underperform support

vector machine (SVM) learners over the same polynomial kernels.

The advantage of perceptrons is that they are much more efficient

in time and slightly more efficient in space in practice.

### Averaged Perceptron Model with Polynomial Kernel

The model used for runtime predictions by the averaged

perceptron is quite straightforward, consisting of a set of

weighted feature vectors (represented as parallel arrays of basis

vectors and weights) and a kernel degree:

Vector[] basisVectors;

int[] weights;

int degree;

The basis vectors are all vectors derived from single training

examples by the specified feature extractor. The weights may

be positive or negative and represent the cumulative voted

weight of the specified basis vector.

The kernel function computes a distance

`kernel(v1,v2)`

between vectors `v1`

and

`v2`

in an enhanced feature space defined by the

particular kernel employed.

A new input to classify is first converted to a feature

vector by the feature extractor. Classification is then

based on the sign of the following score:

score(Vector v) = Σ_{i}weights[i] * kernel(basisVectors[i],v)

An example is accepted if the score of its feature vector is

greater than zero and rejected otherwise.

### Estimating the Perceptron Model

To estimate the perceptron model, we will assume that we have a

training corpus consisting of an array of vectors with boolean

polarities indicating whether they are positive (to accept) or

negative (to reject) examples. We also assume we have a fixed

kernel function. The training method iterates over the corpus a

specified number of times.

Vector[] basisVectors;

int[] incrementalWeights;

boolean[] polarities;

int degree;

int index = -1;

for (# iterations)

for (vector,polarity) in training corpus

yHat = scoreIntermediate(vector);

if (yHat > 0 && polarity) || (yHat < 0 && !polarity)

++incrementalWeights[index];

else

++index;

basisVectors[index] = vector;

polarities[index] = polarity;

incrementalWeights[index] = 1;

scoreIntermediate(vector)

= Σ_{i <= index}polarities[i] * kernel(basisVectors[i],vector)

The final weight for a vector is the cumulative weight

computed as follows:

cumulativeWeight(j) = Σ_{k >= j}incrementalWeights[k]

The actual implementations of these methods involve

considerably more indirection and index chasing to avoid

copies and duplication in the final vectors.

### Historical Notes

The averaged kernel perceptron implemented here was introduced

in the following paper, which also provides error bounds

for learning and evaluations with polynomial kernels of various

degrees:

Freund, Yoav and Robert E. Schapire (1999)

Large margin classification using the perceptron algorithm.

Machine Learning37(3):277-296.

The basic perceptron model was introduced in:

Block, H.D. (1962) The perceptron: a model for brain functioning.

Reviews of Modern Physics34:123-135.

The kernel-based perceptron was introduced in:

Aizerman, M.A., E.M. Braverman, and L.I. Rozonoer. 1964.

Theoretical foundations of the potential function method in pattern

recognition learning.Automation and Remote Control.

25:821-837.

The basis of the voting scheme is a deterministically averaged

version of the randomized approach of adapting online learners

to a batch setup described in the following paper:

Helmbold, D.P. and M.K. Warmuth. (1995)

On weak learning.Journal of Computer and System Sciences

50:551-573.

## 0 Comments:

Post a Comment