Analysis of Perceptron Algorithm
The following claim can be proven mathematically.
Claim. If $\exists w^*$ such that the given data is linearly seperable, then the perceptron algorithm will converge to a value $\hat{w}$ which classifies the entire data correctly.
Proof todo here
Stochastic Gradient Descent
In normal gradient descent, we compute the value of $\nabla \mathcal{E}(\Phi W, Y)$ and update the value of the weight vector accordingly.
In Stochastic gradient descent, we iterate over the entire data and update the weight vector at the $i^{th}$ iteration according to $\nabla \mathcal{E}(W^\text{T}\phi, y_i)$. It can be seen that the perceptron update rule follows stochastic gradient descent with the Hinge Loss Function.
\[\text{Hinge Loss}(f_w(x), y)= \max(0, -yf_w(x))\]Deciding final weight vector
Using the finally obtained weight vector is not always a good idea, because the process is iterative in nature. Usually, one of these two methods is employed:
- Voted Perceptron: Take the vector which classified most of the weight vectors correctly
- Averaged Perceptron: Calculate the weighted average of the weight vectors