Analysis of Perceptron Algorithm

The following claim can be proven mathematically.

Claim. If $\exists w^*$ such that the given data is linearly seperable, then the perceptron algorithm will converge to a value $\hat{w}$ which classifies the entire data correctly.

Proof todo here

Stochastic Gradient Descent

In normal gradient descent, we compute the value of $\nabla \mathcal{E}(\Phi W, Y)$ and update the value of the weight vector accordingly.

In Stochastic gradient descent, we iterate over the entire data and update the weight vector at the $i^{th}$ iteration according to $\nabla \mathcal{E}(W^\text{T}\phi, y_i)$. It can be seen that the perceptron update rule follows stochastic gradient descent with the Hinge Loss Function.

\[\text{Hinge Loss}(f_w(x), y)= \max(0, -yf_w(x))\]

Deciding final weight vector

Using the finally obtained weight vector is not always a good idea, because the process is iterative in nature. Usually, one of these two methods is employed:

Voted Perceptron: Take the vector which classified most of the weight vectors correctly
Averaged Perceptron: Calculate the weighted average of the weight vectors