I have recently taken an Image Processing/Computer Vision class as part of my studies, and was introduced to various variational approaches for image processing. Although I had some previous exposure to calculus of variations, not to this extend.
Most of these variational approaches take the form
and the goal is to find a function that minimizes this energy functional. is a regularizing term that has the goal to produces a smooth result. Any background in classical machine learning and or optimization will also tell you that the regularization term is generally helpful in cases where the problem is ill-defined or unstable.
What I noticed after seeing many of these variational models, is that in the image processing community the standard regularizer is given by . For example a variational denoising model with the goal of smoothing the image might look like:
By the calculus of variation the minimizer of this function has to satisfy
Where I use to denote the entries of the gradient. Depending on your expertise in multivariable calculus you might recognize that the second term is the divergence of the gradient field.
Divergence
The divergence of some vector belonging to some vector field is given by . Imagining particles flowing along the vector field, the divergence roughly quantifies how the number of points in a small region changes over time. I.e. if the divergence is large, the number of points will become smaller over time (they diverge from that point), similarly if the divergence is small () particles will accumulate at this point.
Remembering that the gradients point in the direction of the steepest ascent, minima of will have a large divergence since all gradients point away from this point and vice versa for maxima of .
The divergence of the gradient even has a special name, it is called the laplacian and as one might have noticed it’s simply the trace of the hessian.
In my opinion this is quite interesting considering the optimality condition (3). What (3) tells us is that for an optimal function , the difference between it and the target should be the same as the divergence of .
From a machine learning perspective, if we constrain to be the set of linear functions, (2) is the usual ridge model. Let , and be the target values, then (2) becomes
I’m not sure why, but at least as far as I can remember, I have never seen regularization of the weights introduced from the perspective of adding the squared norm of the gradient as a smoothness term during my studies up to now.
This might give quite a few new insights and connections.
From (3) we know that the optimal weights must satisfy:
Since regularization and weight decay in gradient descent are equivalent, it is clear that the regularization constant also acts as a preconditioner during optimization, where it represents the diagonal of the Hessian.