Skip to article frontmatterSkip to article content
machine learning

Linear Regression

If you ever had to fit a line to some data-points you quite possibly have come across linear regression and least squares. Most of the time (linear) regression is introduced as follows:

Assume we have some target data yRNy \in \mathbb{R}^N and some observations XRN×pX \in \mathbb{R}^{N \times p} and our task is to fit a line f(Xi)=wTXi+w0f(X_i) = w^T X_i + w_0, which minimizes the error.

Now at this point the mean squared error is often introduced:

MSE(f)=1Ni=1N(yif(Xi))2\text{MSE}(f) = \frac{1}{N} \sum_{i=1}^N (y_i - f(X_i))^2

The question one should always ask: Why exactly do we use this? If we ask ourselves which conditions the error function should satisfy, we will see that the (mean) squared error arises quite naturally.

All of these criterions are met for the squared error (yif(Xi))2(y_i - f(X_i))^2. Squaring the error results in positive values and values between 0 and 1 have less weight, while values greater than 1 get further amplified. At the same time the least squares error has the solution w^=(XTX)1XTy\hat{w} = (X^T X)^{-1} X^T y.

Now even tough all of this sounds reasonable in my opinion there is a better way to introduce regression and least squares.

Probabilistic introduction to Regression

As before we want to fit a line f(Xi)=wTXi+w0f(X_i) = w^T X_i + w_0 as good as possible to our target data yy. What this means, is we assume that the target values yiy_i and the data have the following relationship:

yi=wTXi+w0+ϵiy_i = w^T X_i + w_0 + \epsilon_i

Here ϵi\epsilon_i is a random error, that will always be present in real observations, which we assume to be drawn from a normal distribution N(0,σ2)\mathcal{N}(0, \sigma^2).

Knowing this we can also express the conditional probability of yy in terms of a normal distribution:

P(yiXi)=N(yiwTXi+w0,σ2)P(y_i | X_i) = \mathcal{N}(y_i | w^T X_i + w_0, \sigma^2)

Now we still want to find the best possible ww and w0w_0 for our data. A simple way to fit a statistical model is to use maximum likelihood estimation which involves maximizing the following likelihood function:

L(w)=i=1NN(yiwTXi+w0,σ2)L(w) = \prod_{i=1}^N \mathcal{N}(y_i | w^T X_i + w_0, \sigma^2)

Taking the logarithm of the above expression, multiplying with -1 and plugging in the definition of the normal distribution, we can equally minimize:

NLL(w)=i=1Nlog[12πσ2exp(12σ2(yiwTXiw0))]=12σ2i=1N(ynwTXiw0)2+N2log(2πσ2)\begin{aligned} NLL(w) &= - \sum_{i=1}^N -\log \left[ \sqrt{\frac{1}{2 \pi \sigma^2}} \exp \left( -\frac{1}{2\sigma^2}(y_i - w^T X_i - w_0) \right) \right]\\ &= \frac{1}{2\sigma^2}\sum_{i = 1}^N (y_n - w^T X_i - w_0)^2 + \frac{N}{2}\log(2 \pi\sigma ^2) \end{aligned}

If you look closely the first term includes the squared error we introduced earlier. At the same time the second term is a constant that can be neglected, when minimizing.

The minimization problem at hand, argminw,w012σ2i=1N(ynwTXiw0)2\mathop{\rm arg\,min}\limits_{w, w_0} \frac{1}{2\sigma^2}\sum_{i = 1}^N (y_n - w^T X_i - w_0)^2 is actually proportional to the residual sum of squares and the mean squared error introduced above. Hence the solution to the maximum likelihood estimation is also:

w^=(XTX)1XTy\hat{w} = (X^T X)^{-1} X^T y

To me this is quite a remarkable explanation of why the squared error is used.

One step further

We can take the above one step further, if instead of a maximum likelihood estimation we use maximum a posteriori estimation.

wMAP,w0MAP=argmaxw,w0i=1NP(yiwTXi+w0)P(w)w_{MAP}, w_{0_{MAP}} = \mathop{\rm arg\,max}\limits_{w, w_0} \prod_{i=1}^N P(y_i | w^T X_i + w_0) \cdot P(\mathbf{w})

Here P(w)P(\mathbf{w}) is the probability density of the prior we choose for the weights w,w0w, w_0 of our model.

Using this framework we can easily derive many regression models such as lasso and ridge regression as explained by the following table:

Summary of regression models for different likelihoods and priors. Likelihood refers to the distribution of P(yiXi)P(y_i | X_i) in this case.

LikelihoodPriorName
GaussianUniformLeast Squares
GaussianGaussianRidge
GaussianLaplaceLasso
LaplaceUniformRobust regression

A in depth explanation of this topic can also be found in Chapter 11 of Murphy (2022).

References
  1. Murphy, K. P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. https://probml.github.io/pml-book/book1.html