Regularization, Bayesian Inference, and Machine Learning Methods for Inverse Problems

Classical methods for inverse problems are mainly based on regularization theory, in particular those, that are based on optimization of a criterion with two parts: a data-model matching and a regularization term. Different choices for these two terms and a great number of optimization algorithms have been proposed. When these two terms are distance or divergence measures, they can have a Bayesian Maximum A Posteriori (MAP) interpretation where these two terms correspond to the likelihood and prior-probability models, respectively. The Bayesian approach gives more flexibility in choosing these terms and, in particular, the prior term via hierarchical models and hidden variables. However, the Bayesian computations can become very heavy computationally. The machine learning (ML) methods such as classification, clustering, segmentation, and regression, based on neural networks (NN) and particularly convolutional NN, deep NN, physics-informed neural networks, etc. can become helpful to obtain approximate practical solutions to inverse problems. In this tutorial article, particular examples of image denoising, image restoration, and computed-tomography (CT) image reconstruction will illustrate this cooperation between ML and inversion.


Introduction
Inverse problems arise in almost any scientific and engineering application. In fact, they arise whenever we want to infer a quantity that is not directly measured. Noting the unknown quantity f and the measurement data g, we may have a mathematical relation between them: g = H( f ) where f can be a 1D function (signal), a 2D function (image), a 3D function, or more (e.g., video, hyperspectral images, etc.). H is a mathematical model, called a forward operator, and g can also be a 1D, 2D, 3D, or more function. In practice, we may only have discrete values of it available, and, for this reason, the inverse problem that is inferring f from this limited data is an ill-posed problem. When discretized, we may write the relations between them as g = H( f ) + where g contains all the data, f contains all the discretized representations of the unknown quantity, and H is a multidimensional operator connecting them. Finally, represents all the errors of discretization and measurement uncertainties.
Handling inverse problems, even in the discretized version linear model g = H f + is not easy, at least for two reasons: one is the ill-conditioning of the matrix H and its great dimensions; the second is accounting for the errors.
Classical methods for inverse problems are mainly based on regularization theory, particularly those that are based on optimization of a criterion with two parts: a data-modelmatching part ∆ 1 (g, H f ) and a regularization term ∆ 2 ( f , f 0 ) with a balancing term between them: J( f ) = ∆ 1 (g, H f ) + λ∆ 2 ( f , f 0 ) where ∆ 1 and ∆ 2 are two distances (L2, L1, etc.) or

Inverse Problems Example
Inverse problems arise almost everywhere in science and engineering-everywhere we want to infer an unknown quantity f that is not accessible (observable) directly. We only have access to another observable quantity g that is related to it via a linear or a non-linear relation H [36][37][38].
As you can see, I am going to use a color code: red for unknown quantities and blue for observed or assumed, known quantities. The forward operator linking the two quantities is noted H. In general, the forward operator is well-posed, but the inverse problem is ill-posed. This means that either the classical inverse operator does not exist (existence), or we can define many generalized inverse operators, so many solutions to the problem can be defined (uniqueness), or even if we can define an inverse operator, it may be unstable (stability) [39].
Let us mention a few examples of common inverse problems here.

Image Restoration
Any photographic system (camera, microscope, or telescope) has a limited field of view and a limited resolution. If we note by f (x, y) the original image and by the g(x, y) the observed image and if we assume a linear and space-invariant operator between them, then the forward relation can be written as a convolution operator: where h(x, y) represents the point spread function (psf) of the imaging system. Many examples can be given [40,41]. In Figure

X-ray Computed Tomography
In X-ray computed tomography (CT), the relation between the data and the object can be modeled via the radon transform: where δ is the Dirac function; thus, g(r, φ) represents the line integrals over the lines of angles φ of the object function f (x, y). Forward operation is called projection, and the inversion process is called image reconstruction. In Figure 2, one synthetic example is shown.
Line integrals g(r, φ) ⇐= Attenuation distribution f (x, y) Forward Data g(r, φ) =⇒ Unknown f (x, y) g Inverse f Figure 2. Forward and inverse problems in computed tomography. The horizontal axis on the left is r, the vertical is φ, and the values of g(r, φ) are presented as the gray levels. On the right, the object section f (x, y) is presented. Forward operation is called projection, and the inversion process is called image reconstruction.

Acoustical Imaging
Acoustic source localization in acoustical imaging can also be considered as an inverse problem, where the positions and intensities of acoustical sources have to be estimated from the signal received by the microphone arrays. If we represent the distribution of the sources by f (x, y) = ∑ n s n δ(x − x n , y − y n ) , each microphone receives the sum of the delayed sources' sounds [42][43][44][45]: where τ mn is the delay of transmission from the source position n to the microphone position m. This delay is a function of the speed of the sound and the distance between the source position (x n , y n ) and the microphone position (x m , y m ).
In Figure 3, one synthetic example is shown to explain the main idea.
Array of microphones ⇐= Acoustic sources Forward Data =⇒ Unknown g Inverse f Figure 3. Forward and inverse problems in acoustical imaging. Each microphone receives the sum of the delayed sources' sounds. The inverse problem is to estimate the sources' distribution from the received signals by the microphones array.

Microwave Imaging for Breast-Cancer Detection
In microwave imaging, the body is illuminated by microwaves. As the electrical properties (conductivity and permeability) of the healthy and tumor tissues are different, their corresponding induced sources are different. These differences can be measured via the electrodes outside of the breast. The inverse problem, in this case, consists in estimating these induced sources or even directly the distribution of the conductivity and permeability inside the breast. Looking at such images, the tumor area can be visualized [46,47].
The forward problem here is described by two linked equations: where f (r) represents the variation of the conductivity, φ(r) the induced field due to the incident field φ 0 (r), and G m (r i , r ) and G o (r, r ) are the Green functions. The first one relates the measured diffracted field on the sensors g(r) as a function of the induced currents J(r) = φ(r) f (r) inside the brain due to the external field via the Green functions, and the second relates the total field as the sum of the incident and the induced field. So, the forward problem is nonlinear, as φ(r) appears in both sides of the equation. However, it can be approximated by a linear relation if we assume that the induced field inside is very small compared to the incident field: (φ(r ) φ 0 (r )). This is called Born approximation: Both the bi-linear relations and the linear Born approximations are used in microwave imaging. The first one is more common in industrial non-destructive testing (NDT) and the second for biological and medical applications.

Brain Imaging
In brain imaging, the electrical activity of the neurons inside the brain brain are propagated and can be measured at the surface of the sculpt via the electrodes fixed on it. These signals are called electroencephalography (EEG). It is also possible to measure the magnetic field created by this activity. This time, the signals are called (MEG). In both cases, the inversion process consists in estimating the distribution of the brain activity from the measured signals. If the brain electrical activity can be modeled as the electrical monoor dipoles distributed over the surface of the brain, then the simplified forward model can be almost similar to acoustical sources localization of the previous example. Here, the distribution of the sources are in the 3D space of the brain, and the EEG electrodes are positioned on the sculpt. The signal received by each EEG electrode can be compared to the signals received by the microphones in the previous example. There are a great number of forward models, analysis, and inversion methods that have been proposed for this application. A very good toolbox is the EEGLAB, which can be searched on the internet and easily obtained.

Other Applications
Many other imaging systems to see inside the human body or inside any industrial object in non-destructive testing (NDT) applications exist. Here, a few of them are illustrated. We can just mention a few more: magnetic-resonance imaging (MRI), ultrasound imaging such as echography, positron emission tomography (PET), single-emission computed tomography (SPECT), electrical impedance tomography, and eddy current tomography [48].

Classification of Inverse Problems' Methods
Inverse problems' methods can be classified in the following categories: • Analytical inversion methods. • Generalized inversion approach. • Regularization methods. • Bayesian inference methods.
In the first category, the main idea is to recognize the forward operators as one of the well-known mathematical invertible operator and thus to use the appropriate inversion operator. Typical examples are Fourier transform (FT) and radon transform (RT). In the second category, the notion of generalized inversion is used. The corresponding methods are either based on singular value decomposition (SVD) or the iterative projection based algorithms. The regularization methods are mainly based on the optimization of a criterion, often made in two parts: data-model adequacy and the regularization with a regularization parameter. Finally, the Bayesian inference approach, which I consider to be the most general and complete, has all the necessary tools to go beyond the regularization methods. Figure 4 shows the main idea behind the analytical methods via two classical cases of image deconvolution and X-ray image reconstruction. In the first case, as the forward model is a 2D convolution: g(x, y) = h(x, y) * f (x, y) or equivalently a 2D Fourier-transform (FT) filtering: G(u, v) = H(u, v)F(u, v); the operation consists in going to the Fourier domain, doing inverse filtering, and coming back. However, the inverse filtering 1 H(u,v) must be regularized either by limiting the band width or by applying an appropriate window mask before doing the inverse Fourier transform (IFT).

Analytical Methods
In the second case, the forward model is the radon transform (RT). Using the relation between FT and RT (Fourier slice theorem), the analytical inversion process becomes:
Relate it to the 2D FT of f (x, y) via the Fourier slice theorem and interpolate to obtain the full 2D FT of f (x, y); and 3.
Compute 2D IFT to obtain f (x, y) For more details, refer to [49,50]. The main difficulty with the analytical method is that the forward relations are given for continuous functions or densities. They can give satisfactory results if the inversion process is regularized and if the data are dense, complete, and without any measurement or discretization errors. In practical situations, rarely are all these conditions satisfied.

Generalized Inversion Approach
In this approach, the main idea is based on the fact that the forward operator is in general a singular one. This means there are many possible solutions to the inverse problem. In this approach, there are mainly two categories of methods. The first are based on singular-values decomposition (SVD). The second is based on optimization of a criterion such as the least squares (LS). In both, the main idea is to define a set of possible solutions, called generalized inverse solutions: or pseudo solutions: Then, between those possible solutions, one tries to define a criterion, such as the minimum norm, to choose a solution. For the linear inverse problems, the corresponding solutions are given by In great dimensional problems, even if we have these analytical expressions, in practice, the solutions are computed by using iterative optimization algorithms, for example, to optimize the LS criterion J( f ) = H f − g 2 by a gradient-based algorithm: with a stopping criteria or just after some fixed number of iterations. We will see in the next sections how this can lead to a deep-learning NN structure.

Model-Based and Regularization Approach
The model-based methods are related to the notions of the forward-model and the inverse-problems approach. Figure 5 shows the main idea: Physical model of some unknown quantity f

Forward problem
Estimate of that unknown quantity f Given the forward model H and the source f , the prediction of the data g can be done, either in a deterministic way: g = H( f ) or via a probabilistic model: p(g| f , H) as we will see in the next section. In the same way, given the forward model H and the data g, the estimation of the unknown source f can be done either via a deterministic method or a probabilistic one. One of the deterministic methods is the generalized inversion: f = H † (g). A more general method is the regularization: As we will see later, the only probabilistic method that can be efficiently used for the inverse problems is the Bayesian approach.

Regularization Methods
Let consider the discretized linear inverse problem: g = H f + , and the regularization criterion The first main issue in such a regularization method is the choice of the regularizer. The most-common examples are: where D is a linear operator, generally a first-order derivation, a gradient, or a second-order derivation. The function φ has to be convex to ensure the uniqueness of the solution. Many such functions have been proposed, but some non-convex ones have also been proposed, which then need global optimization techniques. The second main issue in regularization is the choice of an appropriate optimization algorithm. This mainly depends on the type of the criterion, and we have: • R( f ) quadratic: gradient-based and conjugate gradient algorithms are appropriate. • R( f ) non-quadratic but convex and differentiable: here too the gradient-based and the conjugate gradient (CG) methods can be used, but there are also a great number of convex criterion optimization algorithms. • R( f ) convex but non-differentiable: here, the notion of a sub-gradient is used. Specific cases are: In this case we have an analytic solution: f = (H t H + λD D) −1 H t g. However, in practice, this analytic solution is not usable in high-dimensional problems. In general, as the gradient ∇J( f ) = −H t (g − H f ) + 2λD D f can be evaluated analytically, gradient-based algorithms are used. • L1 (TV): convex but not differentiable at zero: The algorithms in this case use the notions of the Fenchel conjugate, the dual problem, the sub gradient, and the proximal operator [11,[51][52][53] • Variable splitting and augmented Lagrangian The best possible solution to push further all these limits is the Bayesian approach, which has: (a) many possibilities to choose prior models, (b) the possibility of the estimation of the hyperparameters, and, most important, (c) an accounting for the uncertainties.

Basic Idea
The simple case of the Bayes rule is: where H is a model, p(g| f , H) is the likelihood of f in the data through the model, p( f |H) is the prior knowledge about the unknown quantity f , and p( f |g, H) called the posterior is the result of the combination of the likelihood and the prior. The denominator p(g|H), called the evidence, is the overall likelihood of the model in the data g.
When there are some hyperparameters, for example, the parameters of the likelihood and those of the prior law, which have also to be estimated, we have: This is called the joint posterior law of all the unknowns. From that joint posterior distribution, we may also obtain the marginals:

Gaussian Priors Case
To be more specific, let us consider the case of linear inverse problems g = H f + . Then, assuming Gaussian noise, we have: Assuming also a Gaussian prior: it is easy to see that the posterior is also Gaussian, and the MAP and posterior mean (PM) estimates become the same and can be computed as the minimizer of : In summary, we have: This case is also summarized in ( Figure 6).

of 25
where H is a model, p(g| f , H) is the likelihood of f in the data through the model, p( f |H) is the prior knowledge about the unknown quantity f and p( f |g, H) called the posterior is the result of the combination of the likelihood and prior. The denominator p(g|H), called the evidence, is the overall likelihood of the model in the data g.
When there are some hyper parameters, for example the parameters of the likelihood and those of the prior law, which have also to be estimated, we have: This is called the joint posterior law of all the unknowns. From that joint posterior distribution, we may also obtain the marginals:

Gaussian Priors Case
To be more specific, let's consider the case of linear inverse problems g = H f + . Then, assuming Gaussian noise, we have: Assuming also a Gaussian prior: it is easy to see that the posterior is also Gaussian and the MAP and Posterior Mean (PM) estimates become the same and can be computed as the minimizer of : In summary, we have: This case is also summarized in ( Figure 6) .  We may note that, in this case, we have an analytical expression for the posterior law, which is also a Gaussian law and entirely specified by its mean f and covariance Σ.
However, for great dimensional problems where f is a great size vector, the computation of Σ can become very costly. The computation of the posterior mean f can be done by optimization as it is the same as the MAP solution.

Gaussian Priors with Unknown Parameters
For the case where the hyperparameters v and v f are unknown (unsupervised case), we can derive the following: where all the details and, in particular, the expressions for α , β , α f , β f can be found in [19].
As we can see, the expressions of f and Σ are the same as in previous case, except that the values of v h, v f h, and λ have to be updated. They are obtained from the conditionals This case is also summarized in Figure 7.
ntropy 2021, 1, 0 10 of 25 We may note that, in this case, we have analytical expression for the posterior law which is also a Gaussian law and entirely specified by its mean f and covariance Σ. However, for great dimensional problems where f is a great size vector, the computation of Σ can become very costly. The computation of the posterior mean f can be done by optimization as it is the same as the MAP solution.

Gaussian Priors with Unknown Parameters
For the case where the hyper parameters v and v f are unknown (Unsupervised case), we can derive the following: where all the details and in particular the expressions for α , β , α f , β f can be found in [19].
As we can see, the expressions of f and Σ are the same as in previous case, except that the values of v h, v f h and λ have to be updated. They are obtained from the conditionals This case is also summarized in Figure 7. The joint posterior can be written as: From this expression, we have different possible expansions: Each iteration can be done in two steps: 1.
fixe v and v f to previous values and optimize with respect to f ; 2.
fixe f and optimize with respect to v and v f .
The first step results to a quadratic criterion with respect to f which results to an analytical expression for the solution which can be used for small dimension problems The joint posterior can be written as: From this expression, we have different possible expansions: Each iteration can be done in two steps: 1.
Fix v and v f to previous values and optimize with respect to f ; 2.
Fix f and optimize with respect to v and v f .
The first step results in a quadratic criterion with respect to f , which results in an analytical expression for the solution, which can be used for small-dimension problems, or it can be optimized easily by any gradient-based algorithm. The second step, in this case, results in two separate explicit solutions: one for v and one for v f .
• Gibbs sampling and Markov Chain Monte Carlo (MCMC): These steps can be done using the expressions of the conditional given in Equation (18). These methods are used generally when we not only want to have a point estimator such as MAP or the posterior mean but also to quantify the uncertainties by estimations of the variances and covariances. [19,[55][56][57][58]. We can see that the alternate optimization of KL(q 1 , q 2 , q 3 |p) with respect to q 1 , q 2 , and q 3 result in the same expressions as in Equation (18), only the expressions for updating the parameters α , β , α f , and β f are different. The Approximate Bayesian Computation (ABC) method and, in particular, the VBA and mean-field-approximation methods are used when Gibbs sampling and MCMC methods are too expensive and we still want to quantify uncertainties, for example, estimating the variances.

Imaging inside the Body: From Data Acquisition to Decision
To introduce the link between the different model-based methods and the machinelearning tools, let us consider the case of medical imaging, from the acquisition to the decision steps: • Data acquisition :

Bayesian Joint Reconstruction and Segmentation
The questions now are: can we join any of these steps? Can we go directly from the image to the decision? For the first one, the Bayesian approach can provide a solution: The main tool here is to introduce a hidden variable that can represent the segmentation. A solution is to introduce a classification hidden variable z with z j = {1, 2, · · · , K}, which can be used to show the segmented image. See Figure 8. Figures 8 and 9 summarize this scheme.
MCMC: Gibbs Sampling VBA: Alternate optimization.  A few comments for these relations:

Real word
• p(g| f , z) does not depend on z, so it can be written as p(g| f ). • We use a Markovian Potts model for p(z) to obtain more compact homogeneous regions [18,19]. • If we choose for p( f |z) a Gaussian law, then p( f , z|g) becomes a Gauss-Markov-Potts model [19]. • We can use the joint posterior p( f , z|g) to infer on ( f , z): We may just do JMAP: ( f , z) = arg max{p( f , z|g)} or trying to access to the expected posterior values by using the Variational Bayesian Approximation (VBA) techniques [17,19,[58][59][60][61][62]. • When the iterations finished, we get an estimate of the reconstructed image f and its segmentation z when using JMAP and also the covariance of f as well as the parameters of the posterior laws of z A few comments for these relations: • p(g| f , z) does not depend on z, so it can be written as p(g| f ). • We used a Markovian Potts model for p(z) to obtain more compact homogeneous regions [18,19]. • If we choose for p( f |z) a Gaussian law, then p( f , z|g) becomes a Gauss-Markov-Potts model [19]. • We can use the joint posterior p( f , z|g) to infer on ( f , z): we may just do JMAP: ( f , z) = arg max{p( f , z|g)} or trying to access to the expected posterior values by using the Variational Bayesian Approximation (VBA) techniques [17,19,[58][59][60][61][62]. • When the iterations finished, we obtain an estimate of the reconstructed image f and its segmentation z when using JMAP and also the covariance of f as well as the parameters of the posterior laws of z This scheme can be extended to consider the estimation of the hyperparameters too. Figure 10 shows this. This scheme can be extended to consider the estimation of the hyper parameters too. Figure 10 shows this. Again, here, we can use the joint posterior p( f , z, θ|g) to infer on all the unknowns. When the iterations finished, we get an estimate of the reconstructed image f , its segmentation z as well as all the unknown parameters such as the means and variances of the reconstructed image at each of its segments. Giving more details is out of focus of this overview paper. They can be found more specifically in [17].

Advantages of the Bayesian Framework
Between the main advantages of the Bayesian framework for inverse problems, we can mention the following: From previous sections, we see that we have many solutions to go from data to an image by inversion (image reconstruction), then extraction of interesting features (segmentation) and finally the interpretation and decision. The question that we may ask now is: Can we do all together in a more easily way? Machine Learning and Artificial Intelligence tools may propose such a solution. See Figure 11.
Data g → Machine Learning and Artificial Intelligence → Tumor or Not Tumor Figure 11. Two approaches going from the data to decision: Top: from data, first reconstruct an image via inversion, then post-process to obtain segmentation and do pattern recognition to extract the contours of the region of interest and finally make a decision. Bottom: Try to use Machine Learning methods to go directly from data to decision.
To be able to use ML to go from data to decision, there is a crucial need of a great and rich database obtained by experts to let the machine to Learn from that great database. In the next section, we go a little more in detail to see the advantages, limitations and drawbacks. Again, here, we can use the joint posterior p( f , z, θ|g) to infer on all the unknowns. When the iterations are finished, we get an estimate of the reconstructed image f , its segmentation z, as well as all the unknown parameters such as the means and variances of the reconstructed image at each of its segments. Giving more details is out of the focus of this overview article. They can be found more specifically in [17].

Advantages of the Bayesian Framework
Between the main advantages of the Bayesian framework for inverse problems, we can mention the following: • Large flexibility of prior models. From previous sections, we see that we have many solutions to go from data to an image by inversion (image reconstruction), then extraction of interesting features (segmentation), and finally the interpretation and decision. The question that we may ask now is: can we do all together in a more easy way? Machine-learning and artificial-intelligence tools may propose such a solution. See Figure 11. To be able to use ML to go from data to decision, there is a crucial need of a great and rich database obtained by experts to let the machine learn from that great database. In the next section, we add a little more in detail to see the advantages, limitations, and drawbacks.
Data g → Machine Learning and Artificial Intelligence → Tumor or Not Tumor Figure 11. Two approaches going from the data to decision. Top: from data, first, reconstruct an image via inversion, then post-process to obtain segmentation. Perform pattern recognition to extract the contours of the region of interest and finally make a decision. Bottom: try to use machine-learning methods to go directly from data to decision.

Machine Learning's Basic Idea
The main idea in machine learning is first to learn from a great number of datadecisions: (g i , d i ), i = 1, · · · N:  Figure 12 shows the main process of ML.

Machine Learning Basic Idea
The main idea in Machine Learning is first to learn from a great number of datadecisions: (g i , d i ), i = 1, · · · N:  Figure 12 shows the main process of ML.  Additionally, the combination of Imaging technology and systems, image processing, computer vision, machine learning, and artificial intelligence has been the seed for much great progress in all areas of health and our environment. The frontiers between science and technology has become less precise as is shown in Figure 14.

Image technology
Image processing

Computer vision
Machine learning Artificial intelligence 2D, 3D, hyperspectral acquisition, compression, transmission; Representation, compression, segmentation; Enhancement, restoration; Segmentation, contour detection; Segments, edges, patterns, RoIs, features extraction; Pattern matching and localization; Objects detection and identification; 2D and 3D pattern recognition, interpretation; Classification, clustering, recognition, decision making, etc. Between the machine-learning tools using NN, the convolutional NN (CNN), recurrent NN (RNN), deep learning (DL), generative artificial networks (GAN) had greater success in different area such as speech recognition, computer vision, and specifically in segmentation, classification and clustering, and in multi-modality and cross-domain information fusion.
However, there are still many limitations: a lack of interpretability, reliability, and uncertainty and no reasoning and explaining capabilities. To overcome this, there os still much to do with the fundamentals.

Neural Networks
Let us start this section with a few words on neurons and neural networks. The following figures show the basic idea. The following figure shows the main idea about a neuron in a mathematical framework. Figure 15 shows this graphically.

NN and Learning
A neural network can be used for modeling a universal relation between its inputs X and outputs Y. This model can be written as Y = F W (X) where W represents the parameters of the model represented by the weights of the network nodes relation. They are commonly used for: • Classification (supervised learning) A set of data {(x i , y i )} with labels (classes) {c i } are given. The objective during the training is to use them for training the network, which is then used for classifying a new income (x j , y j ). • Clustering (unsupervised learning) A set of data {(x i , y i )} is given. The objective is to cluster them in different classes {c i }. • Regression with all data (supervised learning). A set of data {(x i , y i )} are given. The objective is to find a function F describing the relation between them: F(x, y) or explicitly y = F(x) for any x (extrapolation or interpolation).

Modeling, Identification, and Inversion
Here, we make a connection between the classical and ML tools and show the links between forward modeling and inversion or inference, model identification and learning or training, and inversion and using the NN: • Forward modeling and inversion

ML for Inverse Problems
To show the possibilities of the interaction between inverse problems' methods, machine learning, and NN methods, the best way is to give a few examples.

First Example: A Linear One-or Two-Layer Feed-Forward NN
The first one is the case of linear inverse problems and quadratic regularization of the Bayesian with Gaussian priors. The solution has an analytic expression, and we have the following relations: These relations can be presented schematically as As we can see, these relations directly induce a linear feed-forward NN structure. In particular, if H represents a convolution operator, then H t , H t H, and H H t are too, as well as the operators B and C. Thus, the whole inversion can be modeled by CNN [63].
For the case of computed tomography (CT), the first operation is equivalent to an analytic inversio;, the second corresponds to back-projection first followed by 2D filtering in the image domain; and the third corresponds to to the famous filtered back-projection (FBP), which is implemented on classical CT scans. These three cases are illustrated on Figure 17.  Figure 17. Three linear NN structures that are derived directly from quadratic regularization inversion method.

Second Example: Image Denoising with a Two-Layer CNN
The second example is the denoising g = f + with an L1 regularizer: where D is a filter, i.e., a convolution operator. This can also be considered as the MAP estimator with a double exponential prior. It is easy to show that the solution can be obtained by a convolution followed by a thresholding [64][65][66].
where S λ is a thresholding operator.

Third Example: A Deep-Learning Equivalence
One of the classical iterative methods in linear inverse-problems algorithms is based on a gradient-based method to optimize J( f ) = g − H f 2 : where the solution of the problem is obtained recursively. Everybody knows that when the forward model operator H is singular or ill-conditioned, this iterative algorithm starts by converging, but it may diverge easily. One of the experimental methods to obtain an acceptable approximate solution is just to stop the iterations after K iterations. This idea can be translated to a deep-learning NN by using K layers. Each layer represents one iteration of the algorithm. See Figures 18 and 19. 18 of 25

Second Example: Image Denoising with a Two Layers CNN
The second example is the denoising g = f + with L1 regularizer: where D is a filter, i.e., a convolution operator. This can also be considered as the MAP estimator with a double exponential prior. It is easy to show that the solution can be obtained by a convolution followed by a thresholding [64][65][66].
where S λ is a Thresholding operator.
→ Thresholding → z → D → f or equivalently g → Two layers CNN → f

Third Example: A Deep Learning Equivalence
One of the classical iterative methods in linear inverse problems algorithm is based on just a gradient based method to optimize J( f ) = g − H f 2 : where the solution of the problem is obtained recursively. Everybody knows that, when the forward model operator H is singular or ill-conditioned, this iterative algorithm starts by converging, but it may diverge easily. One of the experimental methods to obtain an acceptable approximate solution is just to stop the iterations after K iterations. This idea can be translated to a Deep Learning NN by using K layers. Each layer represents one iteration of the algorithm. See Figures 18 ang 19.

Second Example: Image Denoising with a Two Layers CNN
The second example is the denoising g = f + with L1 regularizer: where D is a filter, i.e., a convolution operator. This can also be considered as the MAP estimator with a double exponential prior. It is easy to show that the solution can be obtained by a convolution followed by a thresholding [64][65][66].
where S λ is a Thresholding operator.
Thresholding → z → D → f or equivalently g → Two layers CNN → f

Third Example: A Deep Learning Equivalence
One of the classical iterative methods in linear inverse problems algorithm is based on just a gradient based method to optimize J( f ) = g − H f 2 : where the solution of the problem is obtained recursively. Everybody knows that, when the forward model operator H is singular or ill-conditioned, this iterative algorithm starts by converging, but it may diverge easily. One of the experimental methods to obtain an acceptable approximate solution is just to stop the iterations after K iterations. This idea can be translated to a Deep Learning NN by using K layers. Each layer represents one iteration of the algorithm. See Figures 18 ang 19. g  Figure 19. A K layers DL NN equivalent to K iterations of the basic optimization algorithm.
This DL structure can easily be extended to a regularized criterion: We just need to replace (I − αH t H) by (I − αH t H − αλD t D). This DL structure can easily be extended to a regularized criterion:

Second Example: Image Denoising with a Two Layers CNN
The second example is the denoising g = f + with L1 regularizer: where D is a filter, i.e., a convolution operator. This can also be considered as the MAP estimator with a double exponential prior. It is easy to show that the solution can be obtained by a convolution followed by a thresholding [64][65][66].
where S λ is a Thresholding operator.

Third Example: A Deep Learning Equivalence
One of the classical iterative methods in linear inverse problems algorithm is based on just a gradient based method to optimize J( f ) = g − H f 2 : where the solution of the problem is obtained recursively. Everybody knows that, when the forward model operator H is singular or ill-conditioned, this iterative algorithm starts by converging, but it may diverge easily. One of the experimental methods to obtain an acceptable approximate solution is just to stop the iterations after K iterations. This idea can be translated to a Deep Learning NN by using K layers. Each layer represents one iteration of the algorithm. See Figures 18 ang 19. g  Figure 19. A K layers DL NN equivalent to K iterations of the basic optimization algorithm.
This DL structure can easily be extended to a regularized criterion: We just need to replace (I − αH t H) by (I − αH t H − αλD t D). This DL structure can easily be extended to a regularized criterion: We just need to replace (I − αH t H) by (I − αH t H − αλD t D).
This structure can also be extended to all the sparsity-enforcing regularization terms such as 1 and total variation (TV) using appropriate algorithms such as ISTA, FISTA, ADMM, etc. by replacing the update expression and by adding a NL operation much like the ordinary NNs. A simple example is given in the following subsection.

Fourth Example: 1 Regularization and NN
Let us consider the linear inverse problem g = H f + with 1 regularization criterion: and an iterative optimization algorithm, such as ISTA: where S θ is a soft thresholding operator, and α ≤ |eig(H t H)| is the Lipschitz constant of the normal operator. When H is a convolution operator, then: can also be approximated by a convolution and thus considered as a filtering operator; • 1 α H t g can be considered as a bias term and is also a convolution operator; and • S θ=λα is a nonlinear point-wise operator. In particular, when f is a positive quantity, this soft thresholding operator can be compared to the ReLU activation function of NN. See Figure 20.

of 26
This structure can also be extended to all the sparsity enforcing regularization terms such as 1 and Total Variation (TV) using appropriate algorithms such as ISTA, FISTA, ADMM, etc. by replacing the update expression and by adding a NL operation much like the ordinary NNs. A simple example is given in the following subsection.

Fourth Example: 1 Regularization and NN
Let us consider the linear inverse problem g = H f + with 1 regularization criterion: and an iterative optimization algorithm, such as ISTA: where S θ is a soft thresholding operator and α ≤ |eig(H t H)| is the Lipschitz constant of the normal operator. When H is a convolution operator, then: can also be approximated by a convolution and thus considered as a filtering operator; • 1 α H t g can be considered as a bias term and is also a convolution operator; and • S θ=λα is a nonlinear point wise operator. In particular when f is a positive quantity, this soft thresholding operator can be compared to ReLU activation function of NN. See Figure 20. In all these three examples, we directly could obtain the structure of the NN from the Forward model and known parameters. However, in these approaches there are some difficulties which consist in the determination of the structure of the NN. For example, in the first example, obtaining the structure of B depends on the regularization parameter λ. The same difficulty arises for determining the shape and the threshold level of the Thresholding bloc of the network in the second example. The same need of the regularization parameter as well as many other hyper parameters are necessary to create the NN structure and weights. In practice, we can decide, for example, on the number and structure of a DL network, but as their corresponding weights depend on many unknown or difficult to fix parameters, ML may become of help. In the following we first consider the training part of a general ML method. Then, we will see how to include the physics based knowledge of the forward model in the structure of learning.

ML General Approach
The ML approach can become helpful if we could have a great amount of data: inputsoutputs {( f , g) k , k = 1, 2, ..., K} examples. Thus, during the Training step, we can learn the coefficients of the NN and then use it for obtaining a new solution f for a new data g.
The main issue is the number of data input-output examples {( f , g) k , k = 1, 2, ..., K} we can have for the training step of the network. In all these three examples, we directly could obtain the structure of the NN from the forward model and known parameters. However, in these approaches, there are some difficulties, which consist in the determination of the structure of the NN. For example, in the first example, obtaining the structure of B depends on the regularization parameter λ. The same difficulty arises for determining the shape and the threshold level of the thresholding bloc of the network in the second example. The same need of the regularization parameter as well as many other hyperparameters is necessary to create the NN structure and weights. In practice, we can decide, for example, on the number and structure of a DL network, but as their corresponding weights depend on many unknown or difficult to fix parameters, ML may become of help. In the following, we first consider the training part of a general ML method. Then, we see how to include the physics-based knowledge of the forward model in the structure of learning.

ML General Approach
The ML approach can become helpful if we could have a great amount of data: inputsoutputs {( f , g) k , k = 1, 2, . . . , K} examples. Thus, during the training step, we can learn the coefficients of the NN and then use it for obtaining a new solution f for a new data g.
The main issue is the number of data input-output examples {( f , g) k , k = 1, 2, . . . , K} we can have for the training step of the network.

Fully Learned Method
Let us consider a one-layer NN where the relation between its input g k and output f k is given by f k = φ(Wg k ) where W is the weighting parameters of the NN, and φ is the point-wise non-linearity function of the output NN output layer. The estimation of W from the training data in the learning step is done by an optimization algorithm, which optimizes a loss function L defined as with a quadratic distance or any other appropriate distance or divergence or a probabilistic one When the NN is trained and we obtain the weights W, then we can use it easily when a new case (Test g j ) appears, just by applying: f k = φ( Wg k ). These two steps of training and using (also called testing) are illustrated in Figure 21.

Fully Learned Method
Let consider a one layer NN where the relation between its input g k and output f k is given by f k = φ(Wg k ) where W is the weighting parameters of the NN and φ is the point wise non linearity function of the output NN output layer. The estimation of W from the training data in the learning step is done by an optimization algorithm which optimizes a Loss function L defined as with a quadratic distance or any other appropriate distance or divergence or a probabilistic one When the NN is trained and we obtain the weights W, then we can use it easily when a new case (Test g j ) appears, just by applying: f k = φ( Wg k ). These two steps of Training and Using (called also Testing) are illustrated in Figure 21. The scheme that we presented is general and can be extended to any multi-layer NN and DL. In fact, if we had a great number of data-ground truth examples {( f , g) k , k = 1, 2, ..., K} with K much more than the number of elements W m,n of the weighting parameters W, then, we did not even have any need for forward model H. This can be possible for very low dimensional problems [67][68][69][70]. But, in general, in practice we do not have enough data. So, some prior or regularizer is needed to obtain a usable solution. This can be just by adding a regularizer R(W) to the loss function (25) and (26), but we can also use the physics of the forward operator H.

Physics based ML
As mentioned above, in general, in practice, a rich enough and complete data set is not often available in particular for inverse problems. So, some prior or regularizer is needed to obtain a usable solution. Using a regularizer R(W) to the loss function (25) is good, but not enough. We have to use the physics of the forward operator H. This can be done in different ways.

Decomposition of the NN Structure to Fixed and Trainable Parts
The first easiest and understandable method consists in decomposing the structure of the network W in two parts: a fixed part and a learnable part. As the simplest example, we can consider the case of analytical expression of the quadratic regularization: f = (H H t + λDD t ) −1 H t g = BH t g which suggests to have a two layers network with a fixed part structure H t and a trainable one B = (H H t + λDD t ) −1 . See Figure 22. The scheme that we presented is general and can be extended to any multi-layer NN and DL. In fact, if we had a great number of data-ground-truth examples {( f , g) k , k = 1, 2, . . . , K} with K much more than the number of elements W m,n of the weighting parameters W, then, we would not even have any need for forward model H. This can be possible for very low dimensional problems [67][68][69][70]. However, in general, in practice, we do not have enough data. So, some prior or regularizer is needed to obtain a usable solution. This can be done just by adding a regularizer R(W) to the loss function (25) and (26), but we can also use the physics of the forward operator H.

Physics-Based ML
As mentioned above, in general, in practice, a rich-enough and complete data set is not often available, particularly for inverse problems. So, some prior or regularizer is needed to obtain a usable solution. Using a regularizer R(W) to the loss function (25) is good but is not enough. We have to use the physics of the forward operator H. This can be done in different ways.

Decomposition of the NN Structure to Fixed and Trainable Parts
The first, easiest, and understandable method consists in decomposing the structure of the network W in two parts: a fixed part and a learnable part. As the simplest example, we can consider the case of the analytical expression of the quadratic regularization: f = (H H t + λDD t ) −1 H t g = BH t g which suggests to have a two-layer network with a fixed part structure H t and a trainable one B = (H H t + λDD t ) −1 . See Figure 22. It is interesting to note that in X-ray Computed Tomography (CT) the forward operator H is called Projection, the adjoint operator H t is called Back-Projection (BP) and the B operator is assimilated to a 2D filtering (convolution).

Using Singular value decomposition of forward and backward operators
Using the eigenvalues and eigenvectors of the pseudo or generalized inverse operators and where ∆ is a diagonal matrix containing the singular values, U and V containing the corresponding eigenvectors. This can be used to decompose the W to four operators: where three of them can be fixed and only one ∆ can be trainable. It is interesting to know that when the forward operator H has a shift-invariant (convolution) property, then the operators U and V will correspond, respectively, to the FT and IFT operators and the diagonal elements of Λ correspond to the FT of the impulse response of the convolution forward operator. So, we will have a fixed layer corresponding to H t which can be interpreted as a matched filtering, then a fixed FT layer which is a feed-forward linear network, a trainable filtering part corresponding to the diagonal elements of Λ and a forth fixed layer corresponding to IFT. See Figure 23. Figure 23. A four-layers NN with three physics based fixed corresponding to H t , U or FT and V or IFT layers and one trainable layer corresponding to Λ.

DL structure based on iterative inversion algorithm
Using the iterative gradient based algorithm with fixed number of iterations for computing a GI or a regularized one as explained in previous section can be used to propose a DL structure with K layers, K being the number of iterations before stopping. It is interesting to note that, in X-ray computed tomography (CT), the forward operator H is called projection, the adjoint operator H t is called back-projection (BP), and the B operator is assimilated to a 2D filtering (convolution).

Using Singular-Value Decomposition of Forward and Backward Operators
Using the eigenvalues and eigenvectors of the pseudo or generalized inverse operators where ∆ is a diagonal matrix containing the singular values, U and V , containing the corresponding eigenvectors. This can be used to decompose the W to four operators: where three of them can be fixed, and only one ∆ can be trainable. It is interesting to know that when the forward operator H has a shift-invariant (convolution) property, then the operators U and V will correspond to the FT and IFT operators, respectively, and the diagonal elements of Λ correspond to the FT of the impulse response of the convolution forward operator. So, we will have a fixed layer corresponding to H t , which can be interpreted as a matched filtering, and a fixed FT layer, which is a feed-forward linear network, a trainable filtering part corresponding to the diagonal elements of Λ, and a forth fixed layer corresponding to IFT. See Figure 23. Figure 23. A four-layer NN with three physics-based fixed operators corresponding to H t , U or FT, and V or IFT layers and one trainable layer corresponding to Λ.

DL Structure Based on Iterative Inversion Algorithm
Using the iterative gradient-based algorithm with a fixed number of iterations for computing a GI or a regularized one as explained in previous section can be used to propose a DL structure with K layers, K being the number of iterations before stopping. Figure 24 shows this structure for a quadratic regularization, which results to a linear NN and Figure 25 for the case of 1 regularization.
Using the iterative gradient based algorithm with fixed number of iterations for 463 computing a GI or a regularized one as explained in previous section can be used to 464 propose a DL structure with K layers, K being the number of iterations before stopping. 465 Figure 24 shows this structure for a quadratic regularization which results to a linear 466 NN and Figure 25 for the case of 1 regularization. Figure 24. A K layers DL NN equivalent to K iterations of a basic gradient based optimization algorithm. A quadratic regularization results to a linear NN while a 1 regularization results to a classical NN with a nonlinear activation function. Left: supervised case. Right: unsupervised case. In both cases, all the K layers have the same structure.
f (1) f (K) Figure 25. All the K layers of DL NN equivalent to K iterations of an iterative gradient based optimization algorithm. The simplest solution is to choose W 0 = αH and W (k) = W = (I − αH t H), k = 1, · · · , K. A more robust, but more costly is to learn all the layers W (k) = (I − α (k) H t H), k = 1, · · · , K.

473
The main successful approach was based on regularization methods using a combined 474 criterion. The third generation was model based but probabilistic and in particular using 475 the Bayes rule, the so called Bayesian approach.

476
Classical methods for inverse problems are mainly based on regularization methods.

477
In particular those which are based on the optimization of a criterion with two parts: a 478 data-model matching part and a regularization term. A great number of methods have 479 been proposed for choosing these two parts and proposing appropriate optimization 480 algorithms. A Bayesian Maximum A Posteriori (MAP) interpretation for these regular-481 ization methods can be given where these two terms correspond, respectively, to the 482 likelihood and prior probability models.

483
The Bayesian approach gives more flexibility in different aspects: i) in choosing 484 these terms, and in particular, the prior term via hierarchical models and hidden vari- , k = 1, · · · , K. A more robust but more costly method is to learn all the layers W (k) = (I − α (k) H t H), k = 1, · · · , K.

Conclusions and Challenges
Signal and image processing (SIP), imaging systems (IS), computer vision (CV), machine learning (ML), and artificial intelligence (AI) have made great progress in the last forty years. The first category of the methods in signal and image processing was based on linear transformation followed by a thresholding or windowing and coming back. The second generation was model based: the forward-modeling and the inverse-problems approach. The main successful approach was based on regularization methods using a combined criterion. The third generation was model based but probabilistic and used the Bayes rule, which is the Bayesian approach.
Classical methods for inverse problems are mainly based on regularization methods, particularly those that are based on the optimization of a criterion with two parts: a datamodel matching part and a regularization term. A great number of methods have been proposed for choosing these two parts and proposing appropriate optimization algorithms. A Bayesian Maximum A Posteriori (MAP) interpretation for these regularization methods can be given where these two terms correspond to the likelihood and prior probability models, respectively.
The Bayesian approach gives more flexibility in different aspects: (i) in choosing these terms and, in particular, the prior term via hierarchical models and hidden variables; (ii) a more-extended class of prior models can be obtained, particularly via the hierarchical prior models; (iii) determination of the regularization parameter, and more generally all the hyperparameters, can also be estimated; (iv) all the uncertainties are accounted for, and all the remaining uncertainties can be evaluated.
However, the Bayesian computations can become very heavy computationally, particularly when we want compute the uncertainties (variances and covariances) and when we want also to estimate the hyperparameters. Recently, the machine-learning (ML) methods have become a good help for some aspects of these difficulties.
Nowadays, ML, neural networks (NN), convolutional NN (CNN), and deep-learning (DL) methods have obtained great success in classification, clustering, object detection, speech and face recognition, etc. However, they need a great number of training data, still lack explanation, and they may fail very easily.
For inverse problems, they still need progress. In fact, using only data-based NN without any specific structure coming from the forward model (physics) is just an illusion. However, the progress arrives via their interaction with the model-based methods. In fact, the success of CNN and DL methods greatly depends on the appropriate choice of the network structure. This choice can be guided by the model-based methods.
In this work, a few examples of such interactions are described. As we could see, the main contribution of ML and NN tools can be in reducing the costs of the inversion method when an appropriate model is trained. However, to obtain a good model, there is a need for sufficiently rich data and a good network structure obtained from the physics knowledge of the problem in hand.
For inverse problems, when the forward models are non-linear and complex, NN and DL may be of great help. However, we may still need to choose the structure of the NN via an approximate forward model and approximate Bayesian inversion.