Divergence, Regularization and Variational Approaches - Welcome to my blog

I have recently taken an Image Processing/Computer Vision class as part of my studies, and was introduced to various variational approaches for image processing. Although I had some previous exposure to calculus of variations, not to this extend.

Most of these variational approaches take the form

E(u) = \int L(f, u) + \alpha R(u)

(1)

and the goal is to find a function $u$ that minimizes this energy functional. $R$ is a regularizing term that has the goal to produces a smooth result. Any background in classical machine learning and or optimization will also tell you that the regularization term is generally helpful in cases where the problem is ill-defined or unstable.

What I noticed after seeing many of these variational models, is that in the image processing community the standard regularizer is given by $||\nabla u||^2$ . For example a variational denoising model with the goal of smoothing the image $f$ might look like:

E(u) = \frac{1}{2} \int (u - f)^2 + \alpha || \nabla u || ^2 \text{d}x \text{d}y

(2)

By the calculus of variation the minimizer of this function has to satisfy

u-f + \alpha \left( \frac{\partial}{\partial x} u_x + \frac{\partial}{\partial y} u_y \right) = 0

(3)

Where I use $u_x, u_y$ to denote the $x, y$ entries of the gradient. Depending on your expertise in multivariable calculus you might recognize that the second term is the divergence of the gradient field.

Remembering that the gradients point in the direction of the steepest ascent, minima of $u$ will have a large divergence since all gradients point away from this point and vice versa for maxima of $u$ .

The divergence of the gradient even has a special name, it is called the laplacian $\Delta u$ and as one might have noticed it’s simply the trace of the hessian.

In my opinion this is quite interesting considering the optimality condition (3). What (3) tells us is that for an optimal function $u$ , the difference between it and the target $f$ should be the same as the divergence of $u$ .

From a machine learning perspective, if we constrain $u$ to be the set of linear functions, (2) is the usual ridge model. Let $u(\mathbf{x}) = \boldsymbol{w}^T\mathbf{x}$ , and $f$ be the target values, then (2) becomes

E(u) = \frac{1}{2} \int (u(\mathbf{x}) - f(\mathbf{x}))^2 + \alpha || \boldsymbol{w} ||^2 \text{d} \mathbf{x}

(4)

I’m not sure why, but at least as far as I can remember, I have never seen $\ell_2$ regularization of the weights $\mathbf{w}$ introduced from the perspective of adding the squared norm of the gradient as a smoothness term during my studies up to now.

This might give quite a few new insights and connections.

From (3) we know that the optimal weights $\boldsymbol{w}$ must satisfy:

u(\mathbf{x}) - f(\mathbf{x}) = \alpha \Delta u(\mathbf{x}) = \alpha \cdot \text{tr}\left(\text{Hess}[u](\mathbf{x})\right)

(5)

Since $\ell_2$ regularization and weight decay in gradient descent are equivalent, it is clear that the regularization constant $\alpha$ also acts as a preconditioner during optimization, where it represents the diagonal of the Hessian.

Posts

Notes on attacks against machine learning models

Posts

Influence functions