Linear Regression
Regression is about learning to predict a set of output (dependent) variables as a function of input (independent) variables.
Consider the inputs to be of form $<x_i, y_i>$. Attributes of $x$ are (non-linear) functions $\phi$ which operate on $x$. The form of the equation that linear regression optimizes is:
Where $\Phi$ is a vector of all attributes, and $W$ of all weights.
Do note that $b$ can be dropped by defining $\widetilde{w}, \widetilde{\Phi}$ with one additional element being $b$ and $1$ respectively.
Linear regression is linear in terms of weights and attributes, and (generally) non-linear in terms of $x$ owing to $\Phi$.
For example, $\phi_1$ could be the date of investment, $\phi_2$ could be value of investment and so on.
There are general classes of basis functions, such as:
- Radial Basis function
- Wavelet function
- Fourier Basis
Formal Notation
Dataset $\mathcal{D} = <x_1, y_1> \ldots <x_m, y_m>$
Attribute/basis functions $\phi_i$, and the general class of basis $\Phi$ is given as shown below. Do note that we have redefined the value of $\Phi$ now, and we shall be using this definition from here on.
With the above redefinition, the linear equation for a given $W$ becomes $Y = \Phi W$.
General regression would be to find $\hat{f}$ such that;
Parameterized Regression is a bit more complex as it involves the optimization of weights in the above definition for a given $f(\phi(x), w, b)$ for minimizing error.
The error function determines the type of regression. Some examples are given below. These will be discussed later in the course.
- Least Squares Regression
- Ridge Regression
- Logistic Regression
Least Square Solution
Formally, the solution is given by:
If the “true” relation between $X$ and $Y$ was linear in nature, then 0 error is attainable. That is, $y = \Phi W$ exists, or Y belongs to the column space of Phi. We can just solve linear equations to get the optimal value of $W$.
If $Y$ is not in the column space of $\Phi$, the closed form solution for optimal weights $W^*$ is given by:
Do note that $\Phi^T\Phi$ is invertible iff it has full column rank. That is:
- All columns are linearly independent of each other
- The columns are not data driven
It can be proven that Gradient Descent converges to the same solution as well.