Neural Networks

We’ll be tackling non-linear classification now. We’ve already discussed the usage of kernels for such classification. However, domain knowledge is required for selecting a proper kernel for the given situation, which limits the applications.

Neural networks, on the other hand, act as universal function approximators which do not require as much domain knowledge.

In general, a series of mappings are used to obtain the desired results.

\[x\xrightarrow{f}y\xrightarrow{g}z\xrightarrow{h}\{c_1, \ldots c_k\}\]

Activation Functions

\[f(x) = g(w^Tx)\]

Here, $g$ is called the Activation function. Some examples of activation functions are:

Sigmoid
Tanh
Linear
ReLU: $\max(0,s)$
Softplus: $\log(1+e^s)$, is the “differentiable version” of ReLU

The problem with sigmoid and tanh is that, their values are limited. ReLU and Softplus do not have this problem.

VC Dimensions

The cardinality of the largest set of points that $f(w)$ can shatter is its VC Dimension. A function is said to shatter a given set of points if, for all assignments of labels to those points there exists a $w$ such that $f_w$ perfectly evaluates the set.

The VC Dimensions of a linear separator in $\mathcal{R}^2$ is 3.

The VC dimensions of a threshold classifier in $\mathcal{R}$ is 1.

Designing Neural Networks

Neural networks have great expressive power owing to:

Non linearity of the activation functions
Cascaded non-linear activation functions

There are four main design choices that are to be considered while coding up a neural network. These are:

Input Layer
Number of Hidden layers and number of nodes per hidden layer
Output layer
Loss function

Do note that the activations at the hidden layers are not visible, even during training.