# A Denoiser Can Do Much More than Just Clean Noise

#### Regularization by Denoising

Nearly all image processing tasks require access to some “approximate” notion of the images’ probability density function. This problem is generally intractable, especially due to the high dimensions that are involved. Rather than directly approximating this distribution, the image processing community has consequently built algorithms that—either explicitly or implicitly—incorporate key features of the unknown distribution of natural images. In particular, researchers have proposed very efficient denoising algorithms (i.e., algorithms that remove noise from images, which is the simplest inverse problem) and embedded valuable characteristics of natural images in them. The driving question is thus as follows: How can we systematically leverage these algorithms and deploy their implicit information about the distribution in more general tasks?

Consider a noisy image observation \(\mathbf{y=x+v}\), where \(\mathbf{x}\) is an unknown image that is corrupted by zero-mean white Gaussian noise \(\mathbf{v}\) of a known standard deviation \(\sigma\). We use \(f\) to denote an image denoising function — a mapping from \(\mathbf{y}\) to an image of the same size \(\mathbf{\hat{x}}=f\mathbf{(y)}\), such that the resulting estimate will be as close as possible to the unknown \(\mathbf{x}\). This innocent-looking problem has attracted much attention over the past 50 years and sparked innovative ideas across different fields, including robust statistics, harmonic analysis, sparse representations, nonlocal modeling, and deep learning. Indeed, denoising engines are now at the core of the image processing pipeline in any smartphone device, surveillance system, and medical imaging machine.

The recent development of sophisticated and well-performing denoising algorithms has led researchers to believe that current methods have reached the ceiling in terms of noise reduction performance. This belief comes from the observation that substantially different algorithms lead to nearly the same denoising performance; it has been corroborated by theoretical studies that aimed to derive denoising performance bounds. These insights led researchers to conclude that improving image denoising algorithms may be a task with diminishing returns, or to put it more bluntly: a dead end.

Surprisingly, a consequence of this realization is the emergence of a new and exciting area of research: the leveraging of denoising engines to solve other, far more challenging inverse problems. Examples of such problems include image deblurring, super-resolution imaging, inpainting, demosaicing, and tomographic reconstruction. The basis for achieving this goal resides in the formulation of an inverse problem as a general optimization task that seeks to solve

\[\hat{\mathbf{x}} = \underset {\mathbf{x}} {\textrm{argmin}} \, l(\mathbf{y,x}) + \lambda R (\mathbf{x}).\tag1\]

The term \(l (\mathbf{y,x})\) is called the likelihood and represents \(\mathbf{x}\)'s faithfulness to measurement \(\mathbf{y}\). For example, \(l(\mathbf{y,x})= \parallel \mathbf{x-y} \parallel ^2_2\) in the case of image denoising and \(l(\mathbf{y,x})= \parallel H\mathbf{x-y} \parallel ^2_2\) in the case of image deblurring, for which we assume that \(\mathbf{y}= H\mathbf{x+v}\) with a linear blurring operator \(H\). The term \(R(\mathbf{x})\) represents the prior, or regularizer, that aims to drive the optimization task towards a unique or stable solution; one typically cannot achieve such a solution via the likelihood term alone. The hyperparameter \(\lambda\) controls the regularization strength.

For the denoising problem, the choice of \(\lambda = 0\) in \((1)\) leads to a trivial solution for which \(\mathbf{\hat{x} = y}\). This solution reveals the crucial role of \(R(\mathbf{x})\); loosely speaking, an ideal prior should penalize the appearance of noise in \(\mathbf{\hat{x}}\) while preserving edges, textures, and other internal structures in the unknown \(\mathbf{x}\). This intuition has motivated the formulation of important image denoising priors, such as Laplacian smoothness, total variation, wavelet sparsity, enforcement of nonlocal self-similarity, Gaussian mixture models, Field of Experts models, and sparse approximation.

How can we leverage a given powerful denoising machine \(f(\mathbf{x})\) to handle other image processing problems? The Plug-and-Play priors (PPP) framework is an innovative, systematic approach for treating a wide class of inverse problems via denoising engines [2]. PPP’s key novelty is the observation that one can use denoising algorithms as “black box” solvers, which in turn define general image priors. The framework achieves this by introducing an auxiliary image \(\mathbf{z}\) to \((1)\) that decouples the denoising task from the likelihood term:^{1}

\[\mathbf{\hat{x}}= \underset {\mathbf{x,z}} {\textrm{argmin}} \, l (\mathbf{y,x})+R(\mathbf{z})+\frac{1}{2\mu} \parallel \mathbf{x-z} \parallel^2_2.\tag2\]

We can minimize the above objective with alternating optimization techniques. For example, consider a deblurring problem with \(l(\mathbf{y,x})= \parallel H \mathbf{x-y} \parallel^2_2\). When treating \(\mathbf{z}\) as fixed, the minimization of \((2)\) with respect to \(\mathbf{x}\) involves solving a simple linear system of equations — a sharpening step. When optimizing \((2)\) with respect to \(\mathbf{z}\) while \(\mathbf{x}\) is fixed, we obtain a denoising problem that treats the sharpened image \(\mathbf{x}\) as the noisy input. We can interpret the hyperparameter \(\mu\) as the noise level in the candidate estimate \(\mathbf{x}\).

Inspired by the PPP rationale, the framework of Regularization by Denoising (RED) [1] takes a different route and defines an *explicit* regularizer \(R(\mathbf{x})\) of the form

\[R(\mathbf{x})=\frac{1}{2}\mathbf{x}^T(\mathbf{x}-f(\mathbf{x})).\]

Put simply, the value of the above penalty function is low if the cross-correlation between the candidate image \(\mathbf{x}\) and its denoising residual \(\mathbf{x}-f(\mathbf{x})\) is small, or if the residual itself is small. RED brings a modern interpretation of the classic Laplacian regularizer \(R(\mathbf{x})=\frac{1}{2}\mathbf{x}^T(\mathbf{x}-W\mathbf{x})\), for which \(W\) is a fixed and predefined smoothing operator, like a Gaussian filter. In striking contrast to the classic Laplacian prior, RED replaces the naïve filter \(W\) with a state-of-the-art image adaptive denoising filter that is defined by a black box function \(f\).

What are the mathematical properties of the RED prior? Can we hope to compute its derivative? Recall that scientists often formulate state-of-the-art denoising functions as optimization problems; therefore, computing the derivative of \(f\) will likely be highly nontrivial. Surprisingly, research has shown that RED’s penalty term is differentiable and convex under testable conditions, and its gradient is simply the residual \(\mathbf{x}-f(\mathbf{x})\) [1]. As a result, for a convex likelihood \(l(\mathbf{y,x})\)—as in the deblurring example—the optimization problem

\[\mathbf{\hat{x}}= \underset {\mathbf{x}} {\textrm{argmin}} \, l(\mathbf{y,x})+\lambda \mathbf{x}^T(\mathbf{x}-f(\mathbf{x}))\]

is convex as well, thus guaranteeing global convergence to the optimum. One can flexibly treat this task with a wide variety of first-order optimization procedures, as the gradient is simple to obtain and necessitates only a single activation of the denoiser. In its formal form, RED requires the chosen denoiser to meet some strict conditions, including local homogeneity, differentiability, and Jacobian symmetry. From an empirical standpoint, however, RED-based recovery algorithms seem to be highly stable and capable of incorporating any denoising algorithm as a regularizer—from the simplest median filtering to state-of-the-art deep learning methods—and treating general inverse problems very effectively.

The PPP and RED frameworks pose new and exciting research questions. The gap between theory and practice has inspired the development of a series of new variations for RED’s prior, as well as novel numerical algorithms. Provable convergence guarantees further support these new methods, broadening the family of denoising machines that one can use to solve general inverse problems. Another exciting line of research seeks a rigorous connection between RED and PPP, with the hope that such an understanding will lead to improved regularization schemes and optimizers. In terms of machine learning aspects, RED solvers formulate novel deep learning architectures by replacing the traditional nonlinear activation functions—like rectified linear units or sigmoid functions—with well-performing denoising algorithms. This approach offers new ways for researchers to train data-driven solvers for the RED functional, with the hope of ultimately achieving superior recovery in fewer iterations than the analytic approach.

*This article is based on Yaniv Romano’s SIAM Activity Group on Imaging Science Early Career Prize Lecture at the 2020 SIAM Conference on Imaging Science, which took place virtually last year. Romano’s presentation is available on SIAM’s YouTube Channel. *

^{1 }Here we present a simplified version of the original PPP objective by replacing the hard constraint \(\mathbf{x = z}\) with a penalty; the original PPP relied on augmented Lagrange and the alternating direction method of multipliers.

**References**

[1] Romano, Y., Elad, M., & Milanfar P. (2017). The little engine that could: regularization by denoising (RED). *SIAM J. Imaging Sci.*, *10*(4), 1804-1844.

[2] Venkatakrishnan, S.V., Bouman, C.A., & Wohlberg, B. (2013). Plug-and-play priors for model based reconstruction. In *2013 IEEE global conference on signal and information processing* (pp. 945-948). Austin, TX: IEEE.