Linear Regression - Welcome to my blog

If you ever had to fit a line to some data-points you quite possibly have come across linear regression and least squares. Most of the time (linear) regression is introduced as follows:

Assume we have some target data $y \in \mathbb{R}^N$ and some observations $X \in \mathbb{R}^{N \times p}$ and our task is to fit a line $f(X_i) = w^T X_i + w_0$ , which minimizes the error.

Now at this point the mean squared error is often introduced:

\text{MSE}(f) = \frac{1}{N} \sum_{i=1}^N (y_i - f(X_i))^2

(1)

The question one should always ask: Why exactly do we use this? If we ask ourselves which conditions the error function should satisfy, we will see that the (mean) squared error arises quite naturally.

The error has to positive, i.e. $y_i - f(X_i) \ge 0$
Small errors should have less influence than large errors
Should be easy to optimize

All of these criterions are met for the squared error $(y_i - f(X_i))^2$ . Squaring the error results in positive values and values between 0 and 1 have less weight, while values greater than 1 get further amplified. At the same time the least squares error has the solution $\hat{w} = (X^T X)^{-1} X^T y$ .

Now even tough all of this sounds reasonable in my opinion there is a better way to introduce regression and least squares.

Probabilistic introduction to Regression¶

As before we want to fit a line $f(X_i) = w^T X_i + w_0$ as good as possible to our target data $y$ . What this means, is we assume that the target values $y_i$ and the data have the following relationship:

y_i = w^T X_i + w_0 + \epsilon_i

(2)

Here $\epsilon_i$ is a random error, that will always be present in real observations, which we assume to be drawn from a normal distribution $\mathcal{N}(0, \sigma^2)$ .

Knowing this we can also express the conditional probability of $y$ in terms of a normal distribution:

P(y_i | X_i) = \mathcal{N}(y_i | w^T X_i + w_0, \sigma^2)

(3)

Now we still want to find the best possible $w$ and $w_0$ for our data. A simple way to fit a statistical model is to use maximum likelihood estimation which involves maximizing the following likelihood function:

L(w) = \prod_{i=1}^N \mathcal{N}(y_i | w^T X_i + w_0, \sigma^2)

(4)

Taking the logarithm of the above expression, multiplying with -1 and plugging in the definition of the normal distribution, we can equally minimize:

\begin{aligned} NLL(w) &= - \sum_{i=1}^N -\log \left[ \sqrt{\frac{1}{2 \pi \sigma^2}} \exp \left( -\frac{1}{2\sigma^2}(y_i - w^T X_i - w_0) \right) \right]\\ &= \frac{1}{2\sigma^2}\sum_{i = 1}^N (y_n - w^T X_i - w_0)^2 + \frac{N}{2}\log(2 \pi\sigma ^2) \end{aligned}

(5)

If you look closely the first term includes the squared error we introduced earlier. At the same time the second term is a constant that can be neglected, when minimizing.

The minimization problem at hand, $\mathop{\rm arg\,min}\limits_{w, w_0} \frac{1}{2\sigma^2}\sum_{i = 1}^N (y_n - w^T X_i - w_0)^2$ is actually proportional to the residual sum of squares and the mean squared error introduced above. Hence the solution to the maximum likelihood estimation is also:

\hat{w} = (X^T X)^{-1} X^T y

(7)

To me this is quite a remarkable explanation of why the squared error is used.

One step further¶

We can take the above one step further, if instead of a maximum likelihood estimation we use maximum a posteriori estimation.

w_{MAP}, w_{0_{MAP}} = \mathop{\rm arg\,max}\limits_{w, w_0} \prod_{i=1}^N P(y_i | w^T X_i + w_0) \cdot P(\mathbf{w})

(8)

Here $P(\mathbf{w})$ is the probability density of the prior we choose for the weights $w, w_0$ of our model.

Using this framework we can easily derive many regression models such as lasso and ridge regression as explained by the following table:

Summary of regression models for different likelihoods and priors. Likelihood refers to the distribution of $P(y_i | X_i)$ in this case.

Likelihood	Prior	Name
Gaussian	Uniform	Least Squares
Gaussian	Gaussian	Ridge
Gaussian	Laplace	Lasso
Laplace	Uniform	Robust regression

A in depth explanation of this topic can also be found in Chapter 11 of Murphy (2022).

References¶

Murphy, K. P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. https://probml.github.io/pml-book/book1.html

Posts

Generative vs Discriminative Models

Posts

Backpropagation