Image Reconstruction in Diffuse Optical Tomography Using Adaptive Moment Gradient Based Optimizers: A Statistical Study

Diffuse optical tomography (DOT) is an emerging modality that reconstructs the optical properties in a highly scattering medium from measured boundary data. One way to solve DOT and recover the quantities of interest is by an inverse problem approach, which requires the choice of an optimization algorithm for the iterative approximation of the solution. However, the well-established and proven fact of the no free lunch principle holds in general. This paper aims to compare the behavior of three gradient descent-based optimizers on solving the DOT inverse problem by running randomized simulation and analyzing the generated data in order to shade light on any significant difference—if existing at all—in performance among these optimizers in our specific context of DOT. The major practical problems when selecting or using an optimization algorithm in a production context for a DOT system is to be confident that the algorithm will have a high convergence rate to the true solution, reasonably fast speed and high quality of the reconstructed image in terms of good localization of the inclusions and good agreement with the true image. In this work, we harnessed carefully designed randomized simulations to tackle the practical problem of choosing the right optimizer with the right parameters in the context of practical DOT applications, and derived statistical results concerning rate of convergence, speed, and quality of image reconstruction. The statistical analysis performed on the generated data and the main results for convergence rate, reconstruction speed, and quality between three optimization algorithms are presented in the paper at hand.


Introduction
In recent years, the problem of DOT is becoming more attractive since it presents many advantages. It is a non-invasive, non-ionizing, and an inexpensive technique compared to other imaging modalities such as Magnetic Resonance Imaging (MRI) and X-ray [1][2][3]. DOT has been applied to detect breast tumors [4][5][6][7], brain injuries [8,9], imaging newborn infants' heads [10], and providing some important information about tissue metabolism. Solving the DOT problem involves addressing the radiative transfer equation (RTE) that describes the light propagation in biological tissues [11,12]. However, the RTE does not have an analytical close form solution for complex geometries, and its numerical alternative is computationally expensive. Since the diffusion approximation (DA) of the RTE is easy to implement, we will use it as the forward model throughout this work.
It is a well known fact that the inverse problem in DOT is nonlinear and severely ill-posed. Gradient-based methods are commonly used to solve minimization problems in optical tomography [11].
In recent years, a number of new optimizers have been proposed to tackle the problem of convergence when there is insufficient prior knowledge to elect a good learning rate. One of the most popular and practical techniques used to control the distance of each step. The Adaptive moment estimation (Adam) is one of the first adaptive moment optimizers proposed in literature and was presented by Diedriek Kingma and Jimmy Ba [13]. It is a combination of adaptive gradient algorithm (AdaGrad) [14] and Root Mean Square propagation with momentum (RMS prop) [15]. Adam is an efficient optimizer that only requires first order gradients and uses square gradients to scale the learning rate implementing momentum by using the moving average of the gradient rather than the gradient itself. To cope with the shortcoming of Adam, mainly the lack of convergence guarantees, a number of variants of Adam algorithm have been derived lately such as Nesterov-accelerated Adaptive Moment Estimation (Nadam) [16] and the AmsGrad optimizer [17]. For more details, we refer the reader to [18,19].
In this work, we will examine the convergence behavior of Adam, Nadam, and AmsGrad optimizers when applied to the problem of DOT. A comparison between these optimizers will be investigated and discussed. We will characterize the performance of these algorithms with respect to the choice of some hyperparameters and the initial guess error. To evaluate the quality of reconstructed images by the algorithms in quantitative manner, we use quality metrics, such as the structural similarity index (SSIM) and the peak signal-to-noise ratio (PSNR) on the reconstructed images.
The structure of this paper is as follows: In Section 2, we give an overview of the mathematical formulation of the diffusion approximation in continuous wave (CW) cases. In Section 3, we describe the inverse problem and the algorithms we use to reconstruct the absorption coefficient of DOT. In Section 4, we show the results of our statistical analysis of the simulation data. We present conclusions in Section 5.

Forward Problem
In this section, we describe the mathematical formulation of the diffusion approximation (DA).
Let Ω ⊂ R n , n = 2, 3 be our domain of interest, and ∂Ω the boundary of Ω. Then, the DA inside the domain Ω satisfies the partial differential equation with the Robin-boundary condition where Φ(r) is the photon density, and D(r) is the diffusion coefficient defined by D(r) = 1 3(µ a +µ s ) . a is the Fresnel reflection coefficient, which depends on the mismatch between the refractive indices, µ a and µ s the absorption and scattering coefficient, respectively, and µ s the reduced scattering coefficient expressed as µ s = (1 − g)µ s , where g is the anisotropic factor. S(r) describes the boundary condition for the incoming radiation andn is the outward normal vector to Ω.
We assume that the medium is highly scattering such that µ a µ s .The forward model (1) and (2) is solved by using the finite element method as described in [20].

Inverse Problem
The inverse problem we are interested in consists of determining the couple (µ a , µ s ) from the set of true data y i such that where we denote by F i the forward operator which is assumed to be Fréchet differentiable, and y i the approximate measured data. In this study, we restrict our attention to the reconstruction of the absorption coefficient, and we assume that the distribution of the scattering coefficient is known. Then, the objective function can be written as follows: Then, this problem can be stated in terms of an optimization problem Since the inverse problem is ill-posed, it requires regularization. A Total Variation regularization is applied [21]. By adding a regularization term, the cost function is formulated as where R(µ a ) = µ a − µ 0 a 2 is the regularization operator that enforces smoothness conditions in the solution, and λ is the regularization parameter. µ 0 a denoted the initial guess error.The forward operator F i is linearized around some initial guess µ 0 a .
where F i is the Fréchet derivative of the forward operator F i , and W denotes the Taylor remainder for the linearization around µ 0 a . The gradient of the objective functional can be written as follows: where R (µ a ) is the Fréchet derivative of regularization operator with respect to µ a .

Iterative Inverse Problem Solution
We consider an iterative optimization algorithm denoted by Q. The statement of our problem can be reduced to the iterative form: Naturally, the promise of the algorithm is to get us closer to the solution after each step in an iterative manner. The proof of convergence of any specific algorithm ensures that lim n→+∞ Q(µ n ; J R ) = µ * a (10) and can give even more information on the speed of convergence by deriving a theoretical formula of Q(µ n ; J R ) − µ * a as a bounded formula of n. In general, this is a hard formula to derive, and it is even more difficult when dealing with complex problems like DOT, with many multidimensional parameters. In practice, the convergence speed is influenced by many factors, related to the algorithm itself, and the configuration of the problem (physical reality and constraints). A numerical approach based on simulation and statistical analysis will prove to be very useful in tackling these kinds of hard situations, and can help us to gain more insight in the choice of optimization algorithm and all other practical purpose. As we consider in our study, a family of optimization algorithms based on gradient descent, we can point out the learning rate hyper parameter as the main factor of interest in this context. From a practical point of view, J R depends on the structure of the problem, and, consequently, J R depends on different factors like the nature of inclusions (their number, form, distribution ...), the properties of the medium, and all other parameters that shape the above forward problem as stated in the previous section. In addition, it depends on the choice of the regularization and initial guess. Table 1 below gives an example of the parameters and hyperparameters that can be of interest in studying the practical optimization problem (including the iterative algorithm hyperparameters). A more focused statement of the iterative optimization algorithm, for the following study in the present paper, can be formulated as where Q AM describes the adaptive moment algorithm, µ 0 a the initial guess, n the number of inclusions, β denotes the learning rate hyperparameter, and Θ represents all the remaining parameters.
Hereafter, we address our attention only to the number of inclusions (n), the learning rate β, and the initial guess µ 0 a . In our implementation of the optimization problem, we used an objective function C defined as C(l; β, n, µ 0 a ) = 1 2 (J R (Q AM (µ l ; n, β)) + 0 + |J R (Q AM (µ l ; n, β) − 0 |))) (12) We can easily show that We define the number of iterations to convergence by This formulation guarantees that our optimization algorithm will stop whenever J R (Q AM (µ l ; n, β) is lower than 0 or l is greater than L max , where 0 > 0 and L max ≥ 1, are parameters used in iteration stopping criteria, which is explicitly set in this study to be either when the cost function is lower than 0 or the number of iteration exceeds L max .
The aim of our numerical statistical study of convergence speed can then be brought down to the study of properties of the N Q AM probability distribution P(N Q AM |n, β, µ 0 a ) using simulation tools. In the following study, we restrict our attention to the comparison of three algorithms based on the adaptive moment procedure. For more details, we refer the reader to the next section.

Simulation and Data Generation
As mentioned before, only the absorption coefficient is reconstructed and discussed. The distribution of the scattering coefficient is assumed to be known. To generate synthetic data, we use the Toast++ software [22], which solves the forward problem (1) and (2) described above, using the finite element method. In all the numerical simulations, a circular domain of radius 20 mm which contains different inclusion sizes and shapes is performed. To avoid inverse crime [23], we use different meshes in the forward and inverse problem. In all cases, we use a circular mesh with 22,011 nodes and 43,400 tetrahedral elements for the forward problem and 15,408 nodes and 30,308 tetrahedral elements for the inverse problem. Sixteen sources and 16 detectors are located on the boundary of the domain with equal distance. The location, size, and number of anomalies in µ a are chosen randomly with a background µ bkg a = 0.01 mm −1 and µ bkg s = 2 mm −1 . We consider that there is no change in the anisotropic factor g, which is taken to be equal to 0.9. The regularization parameter λ is set to be equal to 10 −8 . To solve the minimization problem (5), we use Algorithms 1-3, as described in pseudo codes below [18], where β is the learning rate, and ρ 1 and ρ 2 are the exponential decay rates for the moment estimates. The parameter of stabilization is set to be equal to 10 −10 .
To control all the parameters of our simulation, we first control the error of the initial guess of reconstruction µ 0 a by taking it to be µ 0 where µ real a is the original image matrix used to solve the forward problem, and α is a random matrix variable sampled uniformly such as α inf = δ, where δ itself is a uniformly random number taken in range [0,0.2]. We define δ as the initial guess error.

Algorithm 3: Pseudo-code of AmsGrad
Require: µ 0 a , β, ρ 1 , ρ 2 , and with ρ 1 , The choice of the learning rate is very important. For this purpose, a preliminary study has been conducted where we experimented with the learning rate of all optimizers in the range [0, 0.5], and noticed that, when the learning rate is out of the interval [0.001, 0.3], the minimization of the objective function does not converge or take a very long time to do. For the sake of this simulation, the learning rate is constrained to be chosen uniformly from the range [0.001, 0.3]. We fixed all the other hyper parameters for all optimizers, to the recommended values from the corresponding literature (momentum ρ 1 and ρ 2 are 0.9 and 0.999, respectively). The number of anomalies is taken among 1, 2, and 3 equi-proportionally randomly. Figure 1 shows the resulting distributions (histograms) from running the simulation, for the three parameters of the study, the learning rate β, the number of anomalies, and the perturbation coefficient δ. To be fair, we use the same parameters for all optimizers in each simulation instance.

Results
In this part of our study, we will characterize the convergence rate of the three algorithms, and compare the convergence/divergence behavior in relation to the parameters of simulation, and, finally, we will examine the quality of the resulting reconstructions of the three optimizers.
First of all, we choose in the context of this present analysis, the definition of divergence to be the state of the running optimization when the error minimization didn't improve for longer than 200 iterations in total.
The first subject of focus is the convergence rate of each algorithm. Let X AD , X N AD , and X AMS be three random variables representing the state of convergence for Adam, Nadam, and AmsGrad optimizers. These variables take values 0 or 1, depending on either the corresponding algorithm diverges or converges, such that: The simulation provided us with three samples of independent and identically distributed (X AD,n ) n≤1340 , (X N AD,n ) n≤1340 , and (X AMS,n ) n≤1340 . To statistically estimate the rates of convergence, namely P(X AD = 1) = p AD , P(X N AD = 1) = p N AD , and P(X AMS = 1) = p AMS .
We use the three estimators where # denotes the count function, in order to construct 95% confidence intervals based on the large number normal approximation, as presented in Table 2. To shed light on the influence of simulation parameters on convergence rates, we run a logistic regression to estimate the conditional distributions P(X|β, n, δ), where X ∈ {X AD , X N AD , X AMS }, n denotes the number of anomalies in the image, and the remaining variables β and δ are as mentioned earlier. The result of this procedure is depicted in Table 3 presenting p-values for the statistical significance of each regression parameter.
From Table 3, we conclude that the main parameter that also has a significant influence on convergence of these algorithms is the learning rate hyper-parameter. Since the logistic coefficient forβ is positive for Adam and Nadam, the larger the learning rate, the more guarantee there is for the algorithm to converge. This statement is reversed for AmsGrad, as we observe a negative coefficient for learning rate. The AmsGrad is also impacted negatively with the number of anomalies in the image. Running our experiment simulation provided us with 1340 convergent instances in total (this means where (X AD = 1, X N AD = 1, X AMS = 1)), to evaluate the comparative performance between optimizers in terms of the speed of convergence as measured by the number of iterations taken by each optimizer to reach the solution, We will conduct a statistical analysis on the generated data, comparing first the speed globally between optimizers and then relating it to the variables of simulation such as the initial guess error, the choice of learning rate and the number of anomalies in the image. In addition, the influence of these variables on reconstructed image quality will be discussed, and PSNR and SSIM score values are calculated for each simulation instance; we kindly refer the reader to later discussions about reconstruction quality in this paper for more information on these scores.
A number of statistical methods have been applied and results are examined to describe the convergence speed behavior of each algorithm when applied to the inverse problem of DOT.
Image reconstruction in optical tomography is an ill-posed nonlinear inverse problem, the algorithms based on the gradient descent present no guarantee to converge to the global minima when there are local minima in the optimization problem at hand, the convergence point depends heavily on the choice of the starting point of the optimization, and, generally, these algorithms converge (depending also on the learning rate) to the nearest local minima to the initial starting point.
Image reconstruction in optical tomography is an ill-posed nonlinear inverse problem, the algorithms based on gradient descent present no guarantee to converge to the global minima when there are local minima in the optimization problem at hand, the convergence point depends heavily on the choice of the starting point of the optimization, and, generally, these algorithms converge (depending also on the learning rate) to the nearest local minima to the initial starting point.
In this section, we address the optimization problem (image reconstruction) from the perspective of the speed of convergence (as one of the very important matters in practical use of DOT in clinical applications) rather than sensitivity of the algorithms to the choice of the initial guess with respect to their efficiency to find global minima (which is the other important practical issue in applying DOT); this last perspective is equally relevant and without a doubt needs particular attention and further analysis, but in the scope of our current paper remains an open question to follow up, as our randomized simulation design was focused on controlling the factors that influence speed of convergence. We can use the same approach as in this work to quantify (statistically speaking) the efficiency and sensitivity to reach the global minima depending on problem factors, but this obviously needs to redesign the simulation to generate the appropriate data suitable for this substantially different analysis objective.
The "blindness" toward the globality/locality character of the reached optimum for the gradient descent-based algorithms is an inherent property because the gradient is a local concept, and by itself carries only local information about the objective function which makes these algorithms very sensitive to the choice of learning rate and initialization. The adaptive moment included features does not add to the picture but some amount of "memory" of the recent gradients.
A rough observation that can be mentioned here is the fact that, in our generated sample data, most of the time the convergent instances for Nadam and Adam were to the global minima, but we can't really draw any statistical evidence from this naïve observation because our randomized simulation design does not support this analysis.
First of all, we check the distributions of number of iterations (speed) for normality, in the hope to be able to harness the large and powerful available parametric statistical approaches, from literature heavily relying on this (workhorse) normal distribution.
Probability distributions of speed of convergence and the log of speed of convergence are shown in the QQ plot described in Figure 2a,b, respectively. From these two graphs, it clearly appears that these distributions are very far from being reasonably considered normally or log-normally distributed. This is not a surprising fact indeed, knowing that these distributions are not symmetric to begin with, and look (strongly) skewed, but we wanted to exclude the possibilities of any approximate (left truncated) normal distributions. Confirming this visual observation, the results of running Shapiro-Wilk normality tests on the three data samples are listed in Table 4. From Table 4, we conclude that the number of iterations for different optimizers significantly deviate from being normally distributed, and there is very little evidence, if none at all, that supports the normality. We did not test the goodness of fit for other density functions like Gumbel, Fréchet, and Weibul, even though the look of the distributions may suggest this family of extreme value distribution (EVD), mainly for two reasons: First, those EVDs, even if approximately fitted to our empirical distribution, will not provide us, following our best judgment, with any advantage, considering the fact that the nature of exact distribution is not our main goal in itself, but rather is the distributions' locations, while all of the well known available parametric statistical methods for this purpose are based on the assumption that the samples come from (approximate) normal distribution.
Second, since we stopped the optimization iterations at 200 as mentioned above, we automatically lost information about the distribution in the extreme left part of the tail (which is almost 10% of the population according to the estimates in Table 2, for the three algorithms). This fact would certainly impact (heavily) the estimation of any EVD parameter, and, consequently, would reduce the power of any parametric test based on those inherently biased, and grossly approximate fits, which will minimize the comparative advantage of the eventual parametric over a non-parametric alternative method. Following the arguments discussed above, we will use non-parametric statistical approaches to recover further information about the three optimizer performances from data, and, since the exact distributions are not well defined, we will use the empirical cumulative distribution as a legitimate approximation.
From the superposition of the three optimizers' empirical densities and cumulative densities functions of speed of convergence, as shown in Figure 3a,b, respectively, we note the differences in the central tendencies of the speed of convergence for the three optimizers, and we remark that the minimization of the objective function converges faster in the case of AmsGrad algorithm in comparison to the other two algorithms. To gain more credible evidence about these preliminary raw observations, we conducted a Kruskal-Wallis paired test [24] to elicit any significant difference of means among the three optimizers. Results of the tests are included in a box plot shown in Figure 4 with p-values. We can conclude with high confidence that there is a significant difference (p < 0.05) between the speed of convergence for the three optimizers. Comparing the means of number of iterations between each two algorithms individually, and especially between Adam and AmsGrad that look very close (mean wise), we conclude that there is a significant difference between these two groups too.
To frame these differences in speed between the three algorithms, we generate the 95% confidence intervals for the median differences using the bootstrap method with 10,000 replicates each. Normal, Percentile, and pivotal 95% confidence intervals have been calculated. Results are summarized in Table 5. From this table, we spot a clear advantage of Nadam and AmsGrad over Adam in the speed of convergence (on average) while the difference between AmsGrad and Nadam is around just four steps.  Following the logic of our study, we investigate the relationship between speed of convergence and each of the three factors of the simulation, namely the number of anomalies in image, the initial guess error, and the choice of the learning rate. To verify the impact of number of anomalies on the speed of convergence, the Kruskal-Wallis test is applied on each algorithm speed of convergence sample data, as grouped by the number of inclusions. Kruskal-Wallis test results are presented in Figure 5, and we can conclude (by failing to reject the Kruskal-Wallis null hypothesis) that the number of anomalies present in the image is not significantly affecting the speed of convergence (p > 0.05) for different optimizers.  To fulfill our investigation, we discuss the impact of initial guess error and learning rate parameter over number of iterations as shown in Figure 6a and Figure 6b, respectively. The Spearman's coefficient of correlation is used due to its robustness against outliers which appears in data. Scatter plots in Figure 6a,b show the relationship between the initial guess error and the learning rate parameter on the speed of convergence, respectively. Spearman's coefficient of correlation R and p-value are mentioned at the top of each graph. From Figure 6b, we notice that, when the learning rate ranges in [0.001, 0.2], Nadam and Adam algorithms take more iterations than the AmsGrad algorithm. In addition, we note that the AmsGrad algorithm presents some robustness toward the learning rate in this range and presents some outliers in the range [0.001, 0.2]. According to the Spearman's correlation coefficient, we observe a very strong correlation (R = −1) between learning rate parameter and number of iterations for Adam and Nadam optimizers and a negligible correlation for the case of AmsGrad optimizer and presents some outliers when the learning rate ranges in [0.2, 0.3]. On the other hand, Figure 6a shows the relationship between the initial guess error and number of iterations taken by each optimizer to reach convergence of cost functional. We note that the error has the same impact on Adam and Nadam algorithms, when comparing their p-value and coefficient of correlation. However, we observe that the AmsGrad is more efficient than the other two optimizers even if the error is far from the real image. To assess the quality performance in reconstructed images between these optimizers, we performed statistical tests for differences of means on PSNR and SSIM as measured for reconstructed images, between the optimizers. These two scores are defined as follows:  To evaluate the influence of number of inclusions on image quality, we conduct a Wilcoxon test [25]. The test was applied according to different groups of numbers of inclusions. The resulting p-values of this test are summarized in Table 6. The results analysis shows that there is a significant statistical difference between means due to the difference in number of inclusion present in images (p-value < 0.05). A similar conclusion is deduced about the influence of learning rate on PSNR and SSIM. Scatter plots in Figure 8a,b clearly show this strong influence of learning rate on PSNR and SSIM, respectively. The resulting Spearman's correlation coefficients by optimizer (and the corresponding p-value) for PSNR and SSIM are mentioned at the top of each graph. As shown in Figure 8a, we notice that there is a strong negative correlation between learning rate hyper-parameter and PSNR for the case of Adam and AmsGrad. For the case of Nadam, we note a moderate negative correlation between the choice of learning rate and PSNR of reconstructed images. From Figure 8b, we observe that there is a strong negative correlation between learning rate parameter and SSIM for the case of Adam and Nadam. In addition, there is a moderate negative correlation between learning rate and SSIM in the case of AmsGrad. The resulting p-values mentioned at the top of each graph indicate that these correlations are statistically significant, and, consequently, we can conclude the same about the significance of the influence of learning rate choice on the quality of the resulting reconstructed image. Thus, a small value of learning rate that ranges between 0.001 and 0.2 is recommended.
Concerning initial guess error, scatter plots in Figure 9 demonstrate the influence of initial guess error on reconstructed image quality. From Figure 9a,b, the obtained results show that there is no significant statistical differences between the initial guess error and resulting quality (PSNR/SSIM). Thus, we can conclude with high confidence (p > 0.05) that the image quality is only influenced by the number of anomalies in the image and the choice of the learning rate.
We illustrate some cases from our simulation. Figure 10 shows the reconstructed absorption coefficient µ a for the case of one inclusion for an initial guess error equal to δ = 0.2. Different values of learning rate are used. The background of true images are taken equal to µ bck a = 0.01 mm −1 and µ bck s = 2 mm −1 . The reconstruction using Nadam and Adam showed a good localization of inclusion. In addition, its size is the same compared to the true image with optical properties close to those of true image values. Some artifacts are observed in the borders close to sources and detectors region when the learning rate is higher than 0.1. For the case of AmsGrad reconstruction, we observe that the size of reconstructed image matches those for the true image with some artifacts in the center when the learning rate is lower than 0.1. However, when the learning rate is greater than 0.1, we remark that AmsGrad can localize the inclusion, but with some artifacts in the borders. The size and the shape of inclusion do not match those in the true image. Figure 11 shows the reconstructed absorption coefficient µ a for the case of two inclusions with different shapes for the same values of initial guess error and optical properties used in the first case of one inclusion. From Figure 11, we notice that we obtain a good localization of both inclusions for the case of Nadam and Adam for different values of learning rates. However, when the learning rate is higher than 0.01, we observe some artifacts near the borders. For the case of AmsGrad reconstruction, it is clear that, when the learning rate exceeds 0.1, the size and the shape of inclusion do not match those figuring in the true image.  Reconstruction of the absorption coefficient µ a with one inclusion. The first row presents the true image (left) and initial guess image with an initial guess error δ = 0.2 (right). The second, third, fourth, and fifth rows present the reconstruction images using the learning rate β taking values equal to 0.001,0.01,0.1, 0.2, and 0.3, respectively, using Nadam, Adam, and AmsGrad, from (left to right). Figure 11. Reconstruction of the absorption coefficient µ a with two inclusions. The first row presents the true image (left) and initial guess image with an initial guess error δ = 0.2 (right). The second, third, fourth, and fifth rows present the reconstruction images using the learning rate β taking values equal to 0.001,0.01, 0.1, 0.2, and 0.3, respectively, using Nadam, Adam, and AmsGrad, from (left to right).

Discussion and Conclusions
This research work analyzed the behavior of three optimizers when applied to the inverse problem of DOT regarding the speed of convergence and quality of reconstruction. The three optimizers under study, namely Nadam, Adam, and AmsGrad, are enhanced versions of the simple gradient descent algorithm, and have proved to perform very well in solving optimization problems in other areas of applications, especially in Deep Learning model search. The study we performed is based on a carefully designed randomized numerical simulation that aimed to gain credible statistical evidence on the actual performance of these optimizers when applied to solving the DOT inverse problem. We focused our attention on the impact of number of inclusions, the learning rate choice, and initial guess error on the speed of convergence. We also considered the impact of these same parameters on the quality of image reconstruction. The results derived using mainly non-parametric statistical approaches provide a scientifically credible quantification of the actual performance of these optimizers, with respect to the choice of learning rate, and under the constraint of the true numbers of inclusions and the arbitrariness of the initial starting point of the optimization.
The study provided valuable guidelines in terms of statistical evidence of the importance of the good choice of the learning rate for the three algorithms, and statistically proved the robustness of Nadam and Adam to the initial guess and the number of inclusions; these results can help improve and promote further the application of DOT in practical medical applications. However, we did not study the impact of these parameters on the simultaneous reconstruction of the absorption and scattering coefficients.
In future work, we aim to combine Nadam and AmsGrad and construct an algorithm that switches back and forth between AmsGrad and Nadam in a controlled fashion, where the AmsGrad optimizer will be used to accelerate the speed of convergence and Nadam to obtain a good quality of reconstructed images.