1. Introduction
Ill-posed inverse problems represent a major challenge to research in many disciplines. They arise, e.g., when one wishes to reconstruct the shape of an astronomical object after its light has passed through a turbulent atmosphere and an imperfect telescope or when we image the interior of a patients skull during a computer tomography scan. In a more theoretical setting, extracting spectral functions from numerical simulations of strongly correlated quantum fields constitutes another example. The common difficulty among these tasks lies in the fact that we do not have direct access to the quantity of interest (from here on referred to as ) but instead only to a distorted representation of it, measured in our experiment (from here on denoted by D). Extracting  from D, in general, requires us to solve an inherently nonlinear optimization problem, which we construct and discuss in detail below.
Let us consider the common class of inverse problems, where the quantity of interest 
 and the measured data 
D are related via an integral convolution
      
      with a kernel function 
. For the sake of simplicity let us assume (as is often the case in practice) that the function 
K is exactly known. The task at hand is to estimate the function 
 that underlies the observed 
D. The ill-posedness (and ill-conditioning) of this task is readily spotted if we acknowledge that our data comes in the form of 
 discrete estimates 
 of the true function 
, where 
 denotes a source of noise. In addition, we need to approximate the integral in some form for numerical treatment. In its simplest form, writing it as a sum over 
 bins we obtain
      
At this point, we are asked to estimate  optimal parameters  from  data points, which themselves carry uncertainty. A naive  fit of the ’s is of no use, since it would produce an infinite number of degenerate solutions, which all reproduce the set of ’s within their error bars. Only if we introduce an appropriate regularization can the problem be made well-posed, and it is this regularization which in general introduces nonlinearities.
Bayesian inference represents one way to regularize the inversion task. It provides a systematic procedure for how additional (so called prior) knowledge can be incorporated to that effect. Bayes theorem
      
      states that the 
posterior probability 
 for some set of parameters 
 to be the true solution of the inversion problem is given by the product of the 
likelihood probability 
 and the 
prior probability 
. The 
 independent normalization 
 is often referred to as the 
evidence. Assuming that the noise 
 is Gaussian, we may write the likelihood as
      
      where 
 is the unbiased covariance matrix of the measured data with respect to the true mean and 
 refers to the synthetic data that one obtains by inserting the current set of 
 parameters into (
2). A 
 fit would simply return one of the many degenerate extrema of 
, hence being referred to as a maximum likelihood fit.
The important ingredient of Bayes theorem is the presence of the prior probability, often expressed in terms of a regulator functional 
It is here where pertinent domain knowledge can be encoded. For the study of intensity profiles of astronomical objects and hadronic spectral functions for example, it is a-priori known that the values of  must be positive. Depending on which type of information one wishes to incorporate, the explicit form of  will be different. It is customary to parameterize the shape of the prior distribution by two types of hyperparameters, the default model  and a confidence function . The discretized  represents the maximum of the prior  for each parameter  and each  its corresponding spread.
Once both the likelihood and prior probability are set, we may search for the maximum a posteriori (MAP) point estimate of the 
’s via
      
 constitutes the most probable parameters given our data and prior knowledge. (In the case that no data is provided, the Bayesian reconstruction will simply reproduce the default model 
m.) Note that since the exponential function is monotonous, instead of finding the extremum of 
, in practice, one often considers the extremum of 
 directly.
The idea of introducing a regularization in order to meaningfully estimate the most probable set of parameters underlying observed data has a long history. As early as 1919 [
1], it was proposed to combine what we here call the likelihood with a smoothing regulator. Let us have a look at three choices of regulators from the literature: the historic Tikhonov (TK) regulator [
2] (1943–), the Shannon–Jaynes entropy deployed in the Maximum Entropy Method (MEM) [
3,
4] (1986–), and the more recent Bayesian Reconstruction (BR) method [
5] regulator (2013–), respectively,
      
Both 
 and 
 are axiomatically constructed, incorporating the assumption of positivity of the function 
. The assumption manifests itself via the presence of a logarithmic term that forces 
 to be positive-semidefinite in the former and positive-definite in the latter case. It is this logarithm that is responsible for the numerical optimization problem (
6) to become genuinely nonlinear.
Note that all three functions are concave, which (as proven for example in [
6]) guarantees that if an extremum of 
 exists, it is unique—i.e., within the 
 dimensional solution space spanned by the discretized parameters 
, in the case that a Bayesian solution exists, we will be able to locate it with standard numerical methods in a straightforward fashion.
  2. Diagnosis of the Problem in Bryan’s MEM
In this section, we investigate the consequences of the choice of regularization on the determination of the most probable spectrum. Starting point is the fully linear Tikhonov regularization, on which a lot of the intuition in the treatment of inverse problems is built. We then continue to the Maximum Entropy Method, which amounts to a genuinely nonlinear regularization and point out how arguments, which were valid in the linear case, fail in the nonlinear context.
  2.1. Tikhonov Regularization
The Tikhonov choice amounts to a Gaussian prior probability, which allows  to take on both positive and negative values. The default model  denotes the value for , which was most probable before the arrival of the measured data D (e.g., from a previous experiment), and  represents our confidence into the prior knowledge (e.g., the uncertainty of the previous experiment).
Since both (
4) and (
7) are at most quadratic in 
, taking the derivative in (
6) leads to a set of linear equations that need to be solved to compute the Bayesian optimal solution 
. It is this fully linear scenario from which most intuition is derived when it comes to the solution space of the inversion task. Indeed, we are led to the following relations
        
        which can be written solely in terms of linear vector-matrix operations. Note that in this case 
 contains the vector 
 in a linear fashion. (
10) invites us to parameterize the function 
 via its deviation from the default model
        
        and to look for the optimal set of parameters 
. Here, we may safely follow Bryan [
7] and investigate the singular values of 
 with 
U being an 
 special orthogonal matrix, 
 an 
 matrix with 
 nonvanishing diagonal entries, corresponding to the singular values of 
 and 
 being an 
 special orthogonal matrix. We are led to the expression
        
        which tells us that in this case, the solution of the Tikhonov inversion lies in a functional space spanned by the first 
 columns of the matrix 
 (usually referred to as the SVD or singular subspace) around the default model 
—i.e., we can parameterize 
 as
        
The point here is that if we add to this SVD space parametrization any further column of the matrix 
, it directly projects into the null space of 
K via 
. In turn, such a column does not contribute to computing synthetic data via (
2). As was pointed out in [
8], in such a linear scenario, the SVD subspace is indeed all there is. If we add extra columns of 
 to the parametrization of 
, these do not change the likelihood. Thus, the corresponding parameter 
 of that column is not constrained by data and will come out as zero in the optimization procedure of (
6), as it encodes a deviation from the default model, which is minimized by 
.
  2.2. Maximum Entropy Method
Now that we have established that in a fully linear context the arguments based on the SVD subspace are indeed justified, let us continue to the Maximum Entropy Method, which deploys the Shannon–Jaynes entropy as regulator. 
 encodes as prior information, e.g., the positivity of the function 
, which manifests itself in the presence of the logarithm in (
8). This logarithm however also entails that we are now dealing with an inherently nonlinear optimization problem in (
6). Carrying out the functional derivative with respect to 
 on 
, we are led to the following relation
        
As suggested by Bryan [
7], let us introduce the completely general (since 
) and nonlinear redefinition of the parameters
        
Inserting (
15) into (
14), we are led to an expression that is formally quite similar to the result obtained in the Tikhonov case
        
While at first sight this relation is also amenable to be written as a linear relation for 
, it is actually fundamentally different from (
10), since due to its nonlinear nature, 
 enters 
 via componentwise exponentiation. It is here, when attempting to make a statement about such a nonlinear relation with the tools of linear algebra, that we run into difficulties. What do I mean by that? Let us push ahead and introduce the SVD decomposition of the transpose Kernel as before,
        
At first sight, this relation seems to imply that the vector 
—which encodes the deviation from the default model (this time multiplicatively)—is restricted to the SVD subspace, spanned by the first 
 entries of the matrix 
. My claim (as put forward most recently in [
9]) is that this conclusion is false, since this linear-algebra argument is not applicable when working with (
16). Let us continue to setup the corresponding SVD parametrization advocated, e.g., in [
8]
        
In contrast to (
13), the SVD space is not all there is to (
18) (see explicit computations in 
Appendix A). This we can see by taking the 
 column of the matrix 
, exponentiating it componentwise, and applying it to the matrix 
K. In general, we get
        
This means that if we add additional columns of the matrix 
 to the parametrization in (
18), they do not automatically project into the null-space of the Kernel (see the explicit example in 
Appendix A of this manuscript) and thus will contribute to the likelihood. In turn, the corresponding parameter 
 related to that column will not automatically come out to be zero in the minimization procedure (
6). Hence, we cannot a priori disregard its contribution and thus, the contribution of this direction of the search space, which is not part of the SVD subspace. We thus conclude that limiting the solution space in the MEM to the singular subspace amounts to an ad-hoc procedure, motivated by an incorrect application of linear-algebra arguments to a fully nonlinear optimization problem.
A representative example from the literature, where the nonlinear character of the parametrization of 
 is not taken into account, is the recent [
8] (see Equations (7) and (12) in that manuscript). We emphasize that one does not apply a column of the matrix 
 itself to the matrix K but the componentwise exponentiation of this column. This operation does not project into the null-space.
In those cases where we have only few pieces of reliable prior information and our datasets are limited, restricting to the SVD subspace may lead to significantly distorted results (as shown explicitly in [
10]). On the other hand, its effect may be mild if the default model already encodes most of the relevant features of the final result and the number of datapoints is large, so that the SVD subspace is large enough to (accidentally) encompass the true Bayesian solution sought after in (
6). Independent of the severity of the problem, the artificial restriction to the SVD subspace in Bryan’s MEM is a source of systematic error, which needs to be accounted for when the MEM is deployed as precision tool for inverse problems.
Being liberated from the SVD subspace does not lead to any conceptual problems either. We have brought to the table 
 points of data, as well as 
 points of prior information in the form of the default model 
m (as well as its uncertainty 
). This is enough information to determine the 
 parameters 
 uniquely, as proven in [
6].
Recognizing that linear-algebra arguments fail in the MEM setup also helps us to understand some of the otherwise perplexing results found in the literature. If the singular subspace were all there is to the parametrization of 
 in (
18), then it would not matter whether we use the first 
 columns of 
 or just use the 
 columns of 
 directly. Both encode the same target space, the difference being only that the columns of 
 are orthonormal. However, as was clearly seen in Figure 28 of [
11], using the SVD parametrization or the columns of 
 leads to significantly different results in the reconstructed features of 
. If the MEM were a truly linear problem, these two parameterizations gave exactly the same result. The finding that the results do not agree emphasizes that the MEM inversion is genuinely nonlinear and the restriction to the SVD subspace is ad hoc.
  2.3. Numerical Evidence for the Inadequacy of the SVD Subspace
Let us construct an explicit example to illustrate the fact that the solution of the MEM reconstruction may lie outside of the SVD search space. Since Bryan’s derivation of the SVD subspace proceeds independently of the particular form of the kernel K, the provided data D, and the choice of the default model m, we are free to choose them at will. For our example, we consider a transform often encountered among inverse problems related to the physics of strongly correlated quantum systems.
One then has 
 and we may set 
. With 
 entering simply as scaling factor in (
16), we do not consider it further in the following. Let me emphasize again that the arguments leading to (
16) did not make any reference to the data we wish to reconstruct. Here, we will consider three datapoints that encode a single delta peak at 
, embedded in a flat background.
Now, let us discretize the frequency domain between 
 and 
 with 
 points. Together with the choice of 
, 
, and 
, this fully determines the kernel matrix 
 in (
2). Three different mock functions 
 are considered, with 
, 
, and 
, the background is assigned the magnitude 
 (see 
Figure 1 left).
Bryan’s argument states that in the presence of three datapoints (see 
Figure 1 right), irrespective of the data encoded in those datapoints, the extremum of the posterior must lie in the space spanned by the exponentiation of the three first columns of the matrix 
, obtained from the SVD of the transpose kernel 
. In 
Figure 2, its first three columns are explicitly plotted. Note that while they do show some peaked behavior around the origin, they quickly flatten off above 
. From this inspection by eye, it already follows that it will be very difficult to linearly combine 
, 
, and 
 into a sharply peaked function, especially for a peak located at 
.
Assuming that the data comes with constant relative error 
, let us find out how well we can reproduce it within Bryan’s search space. A minimization carried out by Mathematica (see the explicit code in 
Appendix B) tells us that 
, 
, and 
—i.e., we clearly see that we are not able to reproduce the provided datapoints well (i.e., 
) and that as the deviation becomes more and more pronounced, the higher the delta peak is positioned along 
. In the full search space on the other hand, we can always find a set of 
’s which bring the likelihood as close to zero as desired.
Minimizing the likelihood of course is not the whole story in a Bayesian analysis, which is why we have to take a look at the contributions from the regulator term . Remember that by definition, it is negative. We find that for all three mock functions , Bryan’s best fit of the likelihood leads to values of . On the other hand, a value of  is obtained in the full search space for the parameter set  which contains only one entry that is unity and all other entries being set to the background at .
From the above inspection of the extremum of  and the associated value of  inside and outside of Bryan’s search space, we arrive at the following: Due to the very limited structure present in the SVD basis functions , , and , it is in general not possible to obtain a good reconstruction of the input data. This in turn leads to a large minimal value of  accessible within the SVD subspace. Due to the fact that  cannot compensate for a large value of , we have constructed an explicit example where at least one set of ’s (the one that brings  close to zero in the full search space) leads to a smaller value of  and thus to a larger posterior probability than any of the ’s within the SVD subspace.
In other words, we have constructed a concrete example in which the MEM solution, given by the global extremum of the posterior, is not contained in the SVD subspace.
  4. Summary and Conclusions
We have critically assessed in this paper the arguments underlying Bryan’s MEM, which we show to be flawed as they resort to methods of linear algebra when treating an inherently nonlinear optimization problem. Therefore, we conclude that even though the individual steps in the derivation of the SVD subspace are all correct, they do not apply to the problem at hand and their conclusions can be disproved with a direct counterexample. The counterexample we provided utilizes the fact that the componentwise exponentiated columns of the matrix  do not project into the null-space of the Kernel when computing synthetic data. After establishing the fact that the restriction to the SVD subspace is an ad-hoc procedure, we discussed possible ways to overcome it, suggesting either to systematically extend the search space within the MEM or abandon the MEM in favor of one of the many modern Bayesian approaches developed over the past two decades.
In our ongoing work to improve methods for the Bayesian treatment of inverse problems, we focus on the development of more specific regulators. So far, the methods described above only make reference to very generic properties of the quantity , such as smoothness and positivity. As a specific example, in the study of spectral functions in strongly-correlated quantum systems, the underlying first principles theory provides extra domain knowledge about admissible structures in  that so far is not systematically exploited. A further focus of our work is the extension of the separable regulators discussed in this work to fully correlated prior distributions that exploit cross-correlation between the individual parameters .