Linear Regression

Regression is about learning to predict a set of output (dependent) variables as a function of input (independent) variables.

Consider the inputs to be of form $<x_i, y_i>$. Attributes of $x$ are (non-linear) functions $\phi$ which operate on $x$. The form of the equation that linear regression optimizes is:

$$ Y = \sum_{i=1}^n w_i\phi_i(x) + b = W^T\Phi(x) + b $$

Where $\Phi$ is a vector of all attributes, and $W$ of all weights.

Do note that $b$ can be dropped by defining $\widetilde{w}, \widetilde{\Phi}$ with one additional element being $b$ and $1$ respectively.

Linear regression is linear in terms of weights and attributes, and (generally) non-linear in terms of $x$ owing to $\Phi$.

For example, $\phi_1$ could be the date of investment, $\phi_2$ could be value of investment and so on.

There are general classes of basis functions, such as:

  • Radial Basis function
  • Wavelet function
  • Fourier Basis

Formal Notation

Dataset $\mathcal{D} = <x_1, y_1> \ldots <x_m, y_m>$

Attribute/basis functions $\phi_i$, and the general class of basis $\Phi$ is given as shown below. Do note that we have redefined the value of $\Phi$ now, and we shall be using this definition from here on.

$$\Phi = \begin{bmatrix} \phi_1(x_1) & \phi_2(x_1) & \ldots & \phi_p(x_1) \\ \vdots & & & \\ \phi_1(x_m) & \phi_2(x_m) & \ldots & \phi_p(x_m) \\ \end{bmatrix}$$

With the above redefinition, the linear equation for a given $W$ becomes $Y = \Phi W$.

General regression would be to find $\hat{f}$ such that;

$$\hat{f} = \min_{f\in\mathcal{F}} E(f, D)$$

Parameterized Regression is a bit more complex as it involves the optimization of weights in the above definition for a given $f(\phi(x), w, b)$ for minimizing error.

$$ w^*, b^* = \min_{w,b} \left[ E(f(\phi(x), w, b), D) \right] $$

The error function determines the type of regression. Some examples are given below. These will be discussed later in the course.

  • Least Squares Regression
  • Ridge Regression
  • Logistic Regression

Least Square Solution

Formally, the solution is given by:

$$w^*, b^* = \min_{w,b} \sum_{j=1}^m \left( \left( \sum_{i=1}^p w_i\phi_i(x_j) + b - y_j \right)^2 \right) $$

If the “true” relation between $X$ and $Y$ was linear in nature, then 0 error is attainable. That is, $y = \Phi W$ exists, or Y belongs to the column space of Phi. We can just solve linear equations to get the optimal value of $W$.

If $Y$ is not in the column space of $\Phi$, the closed form solution for optimal weights $W^*$ is given by:

$$W^* = \left(\Phi^T\Phi\right)^{-1}\Phi^TY$$

Do note that $\Phi^T\Phi$ is invertible iff it has full column rank. That is:

  • All columns are linearly independent of each other
  • The columns are not data driven

It can be proven that Gradient Descent converges to the same solution as well.