Skip to article frontmatterSkip to article content

Divergence, Regularization and Variational Approaches

I have recently taken an Image Processing/Computer Vision class as part of my studies, and was introduced to various variational approaches for image processing. Although I had some previous exposure to calculus of variations, not to this extend.

Most of these variational approaches take the form

E(u)=L(f,u)+αR(u)E(u) = \int L(f, u) + \alpha R(u)

and the goal is to find a function uu that minimizes this energy functional. RR is a regularizing term that has the goal to produces a smooth result. Any background in classical machine learning and or optimization will also tell you that the regularization term is generally helpful in cases where the problem is ill-defined or unstable.

What I noticed after seeing many of these variational models, is that in the image processing community the standard regularizer is given by u2||\nabla u||^2. For example a variational denoising model with the goal of smoothing the image ff might look like:

E(u)=12(uf)2+αu2dxdyE(u) = \frac{1}{2} \int (u - f)^2 + \alpha || \nabla u || ^2 \text{d}x \text{d}y

By the calculus of variation the minimizer of this function has to satisfy

uf+α(xux+yuy)=0 u-f + \alpha \left( \frac{\partial}{\partial x} u_x + \frac{\partial}{\partial y} u_y \right) = 0

Where I use ux,uyu_x, u_y to denote the x,yx, y entries of the gradient. Depending on your expertise in multivariable calculus you might recognize that the second term is the divergence of the gradient field.

Divergence

The divergence of some vector vRNv \in \mathbb{R}^N belonging to some vector field is given by i=1Nxivi \sum_{i=1}^N \frac{\partial}{\partial x_i} v_i. Imagining particles flowing along the vector field, the divergence roughly quantifies how the number of points in a small region changes over time. I.e. if the divergence is large, the number of points will become smaller over time (they diverge from that point), similarly if the divergence is small (<0<0) particles will accumulate at this point.

Remembering that the gradients point in the direction of the steepest ascent, minima of uu will have a large divergence since all gradients point away from this point and vice versa for maxima of uu.

The divergence of the gradient even has a special name, it is called the laplacian Δu\Delta u and as one might have noticed it’s simply the trace of the hessian.

In my opinion this is quite interesting considering the optimality condition (3). What (3) tells us is that for an optimal function uu, the difference between it and the target ff should be the same as the divergence of uu.

From a machine learning perspective, if we constrain uu to be the set of linear functions, (2) is the usual ridge model. Let u(x)=wTxu(\mathbf{x}) = \boldsymbol{w}^T\mathbf{x}, and ff be the target values, then (2) becomes

E(u)=12(u(x)f(x))2+αw2dxE(u) = \frac{1}{2} \int (u(\mathbf{x}) - f(\mathbf{x}))^2 + \alpha || \boldsymbol{w} ||^2 \text{d} \mathbf{x}

I’m not sure why, but at least as far as I can remember, I have never seen 2\ell_2 regularization of the weights w\mathbf{w} introduced from the perspective of adding the squared norm of the gradient as a smoothness term during my studies up to now.

This might give quite a few new insights and connections.

From (3) we know that the optimal weights w\boldsymbol{w} must satisfy:

u(x)f(x)=αΔu(x)=αtr(Hess[u](x))u(\mathbf{x}) - f(\mathbf{x}) = \alpha \Delta u(\mathbf{x}) = \alpha \cdot \text{tr}\left(\text{Hess}[u](\mathbf{x})\right)

Since 2\ell_2 regularization and weight decay in gradient descent are equivalent, it is clear that the regularization constant α\alpha also acts as a preconditioner during optimization, where it represents the diagonal of the Hessian.