Kernel Perceptrons

The update rule for a perceptron has been discussed earlier. It is given by:

\[\begin{align} w^{k+1} &= w^k + \eta y'\phi(x') \\ \implies f^{k+1}(x) &= sign\left(f^k(x) + y'\phi^\text{T}(x')\phi(x)\right) \\ \implies f^{k+1}(x) &= sign\left(f^0(x) + \sum_{x',y'} y'\phi^\text{T}(x')\phi(x) count(x')\right)\\ \end{align}\]

$count(x)$ is the number of times that $x’$ has been misclassified so far. Notice that $\phi^\text{T}(x’)\phi(x)$ is a dot-product, which is akin to a measure of similarity between them both.

We can redefine this to obtain a non-linear update function, as given below!

\[f(x) = sign\left( b + \sum_i y_i \alpha_i K(x_i, x) \right)\]

$\alpha$ is a vector initialized to 0s, and corresponding element is incremented by 1 if it is misclassified. (It stores $count$)

$K(x_i, x)$ replaces $\phi^\text{T}(x’)\phi(x)$, and it is the “relation” between the given two datapoints.

Some examples of kernels are:

Linear: $K(x,y) = x^\text{T}y$
Polynomial: $K(x,y) = \left(1+x^\text{T}y\right)^2$
Radial Basis: $K(x,y) = exp(\vert\vert x-y \vert\vert^2_2 / 2\sigma^2)$

This is used to find non-linear partitions between classes. Kernels operate on an implicit space, and are usually easier to compute than the linear separator in higher dimensional space.

Gram Matrices

For a given dataset ${ x_1, \ldots x_m }$, the Gram Kernel Matrix $\mathcal{K}$ is defined as follows:

\[\mathcal{K} = \begin{bmatrix} K(x_1, x_1) & K(x_1, x_2) & \ldots & K(x_1, x_m)\\ K(x_2, x_1) & K(x_2, x_2) & \ldots & K(x_2, x_m)\\ \vdots & \vdots & & \vdots\\ K(x_m, x_1) & K(x_m, x_2) & \ldots & K(x_m, x_m)\\ \end{bmatrix}\]

Given a Gram matrix, the attribute vector can be obtained via Singular Value Decomposition. That is, we find a diagonal matrix $D$ and a matrix $U$ with $UU^T=I$ such that:

\[K = UDU^T = (UD^{1/2})(UD^{1/2})^T = \Phi\Phi^T\]

Therefore, for a Kernel matrix to be valid:

It needs to be symmetric
It needs to be Positive Semi Definite (for any $b\in\mathcal{R}^m$, $b^TKb\geq 0$)

Mercer Kernel

A kernel which has the following property is said to be a Mercer Kernel. Do keep in mind that every mercer kernel is valid. (This follows from the mercer theorem showing that the matrix is positive definite, but the proof is not necessary for this course)

\[\int_x\int_y K(x,y)g(x)g(y) dx dy \geq 0 \text{ for all square-integrable functions } g(x)\]

Given two positive definite (mercer kernels) $K_1(x,y)$ and $K_2(x,y)$, it can be proven that:

$\alpha K_1(x,y) + \beta K_2(x,y)$ is Mercer as well for $\alpha, \beta \geq 0$
$K_1(x,y)K_2(x,y)$ is mercer as well