Confidence Estimation for Machine Learning-Based Quantitative Photoacoustics

In medical applications, the accuracy and robustness of imaging methods are of crucial importance to ensure optimal patient care. While photoacoustic imaging (PAI) is an emerging modality with promising clinical applicability, state-of-the-art approaches to quantitative photoacoustic imaging (qPAI), which aim to solve the ill-posed inverse problem of recovering optical absorption from the measurements obtained, currently cannot comply with these high standards. This can be attributed to the fact that existing methods often rely on several simplifying a priori assumptions of the underlying physical tissue properties or cannot deal with realistic noise levels. In this manuscript, we address this issue with a new method for estimating an indicator of the uncertainty of an estimated optical property. Specifically, our method uses a deep learning model to compute error estimates for optical parameter estimations of a qPAI algorithm. Functional tissue parameters, such as blood oxygen saturation, are usually derived by averaging over entire signal intensity-based regions of interest (ROIs). Therefore, we propose to reduce the systematic error of the ROI samples by additionally discarding those pixels for which our method estimates a high error and thus a low confidence. In silico experiments show an improvement in the accuracy of optical absorption quantification when applying our method to refine the ROI, and it might thus become a valuable tool for increasing the robustness of qPAI methods.


Introduction
Photoacoustic imaging (PAI) has been shown to have various medical applications and to potentially benefit patient care [1][2][3].It is a non-invasive modality that offers the ability to measure optical tissue properties, especially optical absorption µ a , both locally resolved and centimeters deep in tissue.Knowledge of these properties allows for deriving functional tissue parameters, such as blood oxygenation SO 2 , which is a biomarker for tumors and other diseases [4].The photoacoustic (PA) signal is a measure of the pressure waves arising from the initial pressure distribution p 0 , which depends mainly on µ a , the Grüneisen coefficient Γ, and the light fluence φ, which is shaped by the optical properties of the imaged tissue [5].Because of this dependence, the measured p 0 is only a qualitative indicator of the underlying µ a , because even if the initial pressure distribution could be recovered perfectly, estimation of the light fluence is an ill-posed inverse problem that has not conclusively been solved [6].
In order to derive quantitative information from initial pressure p 0 reconstructions of photoacoustic images, one has to account for the light fluence and solve the optical inverse problem.Most methods model the distribution of optical absorption coefficients by iteratively updating the distribution after computing the solution of a forward model (cf., e.g., [7][8][9][10][11][12][13][14]) with inclusion of the acoustic inverse problem [15,16].Alternatively, in multispectral photoacoustic imaging applications, the functional parameters are approximated directly by using a variety of spectral unmixing techniques (cf., e.g., [17][18][19]).Recently, machine learning-based methods for quantitative PAI (qPAI) have been proposed.These encompass end-to-end deep learning on 2D images [20] or the estimation of voxel point estimates with Context Encoding qPAI (CE-qPAI) [21], which incorporates the 3D p 0 context around each voxel into a single feature vector that is used to learn the fluence at that particular voxel.Some of the listed approaches to qPAI have been shown to work in ideal in silico conditions or on specific datasets.At the same time, they have proven difficult to use in clinical applications, which can be attributed to a lack of robustness caused by a priori assumptions that are made regarding, e.g., illumination, probe design, calibration factors, or scattering properties [22].Developing tools to estimate systematic errors and gain information on the quantification of uncertainties in PAI could thus be of great benefit and could be utilized to improve quantification accuracy.
In a recent publication [44], we showed a method for uncertainty quantification for the CE-qPAI method.A key result was that the practice of evaluating PA images over a purely input noise-based (aleatoric) region of interest (ROI) can be improved when also taking into account model-based (epistemic) uncertainty.To achieve this, we combined both sources of uncertainty into a joint uncertainty metric and used this to create an ROI mask on which to compute the statistics.A limitation to that approach could be seen in the fact that we used an uncertainty model specially tailored toward the CE-qPAI method.To overcome this bottleneck, we expand on our prior work in this contribution and present a method that yields confidence estimates by observing the performance of an arbitrary qPAI algorithm and uses the estimates to refine an ROI that was defined based on aleatoric uncertainty.
For validation in the context of qPAI, we applied this methodology to different PA signal quantification algorithms to investigate whether the approach is applicable in a general manner.We hypothesize that an estimated error metric is indicative of the actual absorption quantification error and that we can consequently improve on µ a estimations by evaluating on an ROI that is further narrowed down with a confidence threshold (CT).

Materials and Methods
This section gives an overview of the confidence estimation approach, the experiments, and the used dataset and briefly introduces the different qPAI methods that are being used.
Method for Confidence Estimation.Our approach to confidence estimation can be applied to any qPAI method designed to convert an input image I (p 0 or raw time-series data) into an image of optical absorption I µ a .In order to not restrict the qPAI method to a certain class (e.g., a deep learning-based method), we made the design decision to base the confidence quantification method on an external observing method.For this, we use a neural network, which is presented tuples of input image I and absorption quantification error e µ a in the training phase.When applying the method to a previously unseen image I, the following steps are performed (cf. Figure 1).Visualization of the proposed method for confidence estimation using an observing neural network as an error model.The estimator generates an output for a given input and the error model is used to obtain an estimate of the quantification error from the same input data.The region of interest (ROI), which is based on the aleatoric uncertainty I aleatoric extracted from the input data, can then be refined using the error estimates I epistemic of the error model as a confidence threshold (CT).

1.
Quantification of aleatoric uncertainty: I is converted into an image I aleatoric reflecting the aleatoric uncertainty.For this purpose, we use the contrast-to-noise-ratio (CNR), as defined by Welvaert and Rosseel [45], as CNR = (S − avg(b))/std(b), with S being the pixel signal intensity, and avg(b) and std(b) being the mean and standard deviation of all background pixels in the dataset.Using this metric, we predefine our ROI to comprise all pixels with CNR > 5.

2.
Quantification of epistemic confidence: I is converted into an image I epistemic reflecting the epistemic confidence.For this purpose, we use the external model to estimate the quantification error e µ a of the qPAI algorithm.

3.
Output generation: A threshold over I aleatoric yields a binary image with an ROI representing confident pixels in I µ a according to the input signal intensity.We then proceed to narrow down the ROI by applying a confidence threshold (CT) which removes the n% least confident pixels according to I epistemic .

Deep Learning Model.
As our external observing network, we used an adapted version of a standard U-Net [46] implemented in PyTorch [47].The model uses 2 × 2 max pooling for downscaling, 2 × 2 transpose convolutions for upscaling, and all convolution layers have a kernel size of 3 × 3 and a padding of 1 × 1.We modified the skip connections to incorporate a total of three convolution layers and thus generate an asymmetric U-Net capable of dealing with different input and output sizes (cf. Figure 2).This is necessary to enable the network to directly output reconstructed initial pressure or optical absorption distributions when receiving raw time-series data as input, as the data have different sizes on the y-axis.Specifically, the second of these convolutions was modified to have a kernel size of 3 × 20, a stride of 1 × 20, and a padding of 1 × 9, effectively scaling down the input along the y-axis by a factor of 20.To be more robust to overfitting, we added dropout layers with a dropout rate of 25% to each convolutional layer of the network.Note that a recent study [48] suggests that the U-net architecture is particularly well-suited for medical imaging applications because of its ability to generate data representations on many abstraction levels.Quantitative PAI Methods.We applied our approach to confidence estimation to three different qPAI methods; a naïve quantification method, as well as two different deep learning-based approaches (cf. Figure 3).The three methods are detailed in the following paragraphs.
Naïve Fluence Correction: As a naïve quantitative PAI (qPAI) reference method that does not use a deep learning model to reconstruct a quantitative absorption estimate μa , we performed fluence compensation similar to, e.g., [49,50].To achieve this, we used a simple Monte Carlo fluence simulation φ h based on the same hardware setup as used in the dataset, without any vascular structures inside the volume but instead with a homogeneous absorption coefficient of 0.1 cm −1 and a reduced scattering coefficient of 15 cm −1 .To quantify optical absorption with this method, we corrected the simulated p 0 images with φ h by calculating μa = p 0 /φ h .
Fluence Correction: Fluence correction refers to a two-stage algorithm, where the initial pressure is corrected by an estimation of the underlying light fluence.The quantification model for this method uses two deep neural networks, N 1 and N 2 , with adapted U-Net architectures, as described above (cf.Figure 2).N 1 yields an estimation of the underlying fluence φ from the input data S: N 1 (S) = φ, and N 2 is used to obtain the initial pressure distribution from the input data p0 : N 2 (S) = p0 with the aim of also reducing the noise of S. When using PA raw time-series data as the input, it has to be considered that the decoder and encoder sections of N 2 are asymmetric and the described U-Net adaptation has to be used.The optical absorption coefficients are then estimated from the results of N 1 and N 2 by computing μa = N 2 (S)/N 1 (S) = p0 / φ (for a visual representation of the method see Figure 3a).
Direct Absorption Estimation: Recently, a deep learning method for end-to-end estimation of absorption and derived functional parameters from p 0 distributions was proposed [20].In a similar fashion, we also use a deep learning model N with a modified U-Net architecture (cf. Figure 2) to directly estimate µ a from the input signal S: N(S) = μa (for a visual representation of the method see Figure 3b).This time, µ a estimation is done without the intermediate step of fluence correction, which might be more sensitive to errors due to error propagation or the presence of artifacts and noise.Validation Data.We simulated an in silico dataset containing 3600 training samples, 400 validation samples, as well as 150 calibration and test samples which were used in all experiments.Each data sample consists of the ground truth optical tissue parameters, the corresponding light fluence, and the initial pressure distribution simulated with the mcxyz framework [51], which is a Monte Carlo simulation of photon propagation where we assume a symmetric illumination geometry with two laser outputs.The data sample also comprises raw time-series data simulated using the k-Wave [52] toolkit with the first-order 2D k-space method, assuming a 128-element linear array ultrasound transducer with a central frequency of 7.5 MHz, a bandwidth of 60%, and a sample rate of 1.5 × 10 −8 s.The illumination and ultrasound geometry are depicted in Figure 4.Each tissue volume sample comprises 1-10 tubular vessel structures, whose absorption coefficients µ a range from 2 to 10 cm −1 in vessel structures and are assumed constant with 0.1 cm −1 in the background.We chose a constant reduced scattering coefficient of 15 cm −1 in both background and vessel structures.Additional details on the simulation parameters can also be found in our previous work [53].The raw time-series data was noised after k-space simulation with an additive Gaussian noise model of recorded noise of our system [54].For the experiments where we directly use p 0 , we noise the initial pressure distribution with a Gaussian additive noise model, as also described in [7].In our case, the noise model comprised an additive component with (5 ± 5)% of the mean signal and a multiplicative component with a standard deviation of 20% to simulate imperfect reconstructions of p 0 .The data used in this study is available in a Zenodo repository [55].Experimental Design.We performed five in silico experiments to validate our approach to confidence estimation in qPAI-one experiment with naïve fluence correction applied to p 0 data, as well as our different configurations for quantification with deep learning.Both methods shown in Figure 3 are applied to initial pressure p 0 , as well as raw time-series data.We used the Trixi [56] framework to perform the experiments.All qPAI models were trained on the training set and the progress was supervised with the validation set.We also used the validation set for hyperparameter optimization of the number of training epochs and batch sizes.We trained for 50 epochs, showing the network 10 4 randomly drawn and augmented samples from the training set; each epoch had a learning rate of 10 −4 , used an L 1 loss function, and augmented every sample with a white Gaussian multiplicative noise model and horizontal mirroring to prevent the model from overfitting.Afterward, we estimated the optical absorption μa of the validation set, calculated the relative errors e µ a = | μa − µ a |/µ a , and trained the external observing neural networks on the errors of the validation set with the same hyperparameters, supervising the progression on the calibration set.For better convergence, we used the weights of the p 0 estimation model as a starting point for the e µ a estimation deep learning models.All results presented in this paper were computed on the test set.

Results
We report the relative absorption quantification error e µ a at various different confidence thresholds (CT).To this end, we evaluated only the top n percent most confident estimates by excluding all estimates below a certain CT.We performed this evaluation over all ROI image samples of the respective dataset and examined the relative changes in e µ a compared with the evaluation over all ROI pixels.This was done in five different in silico experiments, corresponding to the qPAI methods when applied to initial pressure, as well as raw time-series input data.Our findings are summarized in Figure 5. Figure 5 shows the absorption quantification error e µ a for all five experiments for confidence thresholds CT ranging from 10 to 100%.In all cases, the error decreases when excluding more pixels with a higher estimated error.When only considering the top 50% most confident estimates, our method yields a decrease in the error e µ a of up to approx.30% (increasing to up to about a 50% improvement when evaluating only the top 10% of the most confident estimates).Figure 6 shows violin plots visualizing the changes in the distribution of e µ a when applying different confidence thresholds CT.The results reveal a meandering of the distribution toward lower error values.Note that the results are only given for the ROI.When computing the statistics over the entire image (thus decreasing the difficulty of the problem due to the homogeneous background), the median quantification error drops to about 0.1% (direct estimation), 5-10% (fluence correction), and 20% (naïve approach).
Figure 7 shows representative images of the experiment corresponding to the end-to-end direct µ a quantification method applied to p 0 data.It shows samples of the test set comprising the best, median, and worst result when applying a 50% CT to narrow down the ROI.The results show a maximum increase of nearly 80% in accuracy in the best case while achieving a median improvement of about 29% and worsening the quantification results by 28% in the worst case from the test set.Analogous illustrations for the other experiments can be found in Appendices A-D.

Discussion
In this work, we present a method which uses estimated confidences in the context of photoacoustic signal quantification to increase the accuracy of the quantification algorithms.In theory, the proposed method is independent of the underlying qPAI method, as it uses a deep learning model to observe the errors resulting from the quantification method in order to provide confidence estimates.While the application of our method to other state-of-the-art qPAI methods is the subject of future work, we aimed to show the general applicability of our method by also incorporating a naïve fluence compensation method into the experiments.Our results suggest that using a method to estimate confidence information to refine a region of interest for subsequent computations might be a valuable tool for increasing the robustness of qPAI methods and could be easily integrated in future qPAI research.
We hypothesized that our confidence metric is indicative of e µ a .In the experiments, we showed that a deep learning model is able to learn a representation of the errors of the quantification method leading to error improvements of 10-50% in region-of-interest structures and yielding up to 5-fold improvements in background structures.Furthermore, Figure 5 shows that the absorption estimation error does not decrease monotonously, especially for the qPAI methods that yield more accurate results.One reason for this might be that the confidence estimates are not correlated to the quantification error and that low confidences might still correspond to low errors.One has to point out the worse performance of the quantification methods when applied directly to raw time-series data.One reason for this might be that the addition of the acoustic inverse problem and the inclusion of a realistic noise model greatly increased the complexity of the problem and reduced the amount of information in the data due to, e.g., limited-view artifacts.At the same time, we did not increase the number of training samples or change the methodology to account for this.
The dataset simulated for the experiments was specifically designed such that out-of-plane fluence effects cannot occur, as the in silico phantoms contain only straight tubular vessel structures that run orthogonal to the imaging plane.Additionally, other a priori assumptions of the parameter space were made, such as a constant background absorption, an overall constant scattering coefficient, and a fixed illumination geometry.Due to the homogeneous nature of the background structure, the errors observed in our study are highly specialized to our dataset.This is especially apparent with the direct estimation method, as here, e µ a is never greater than 0.2%.As such, we focus on reporting the errors in the ROI, as only reporting the results of the entire images would be misleading.In order for the method to generalize to more complex or in vitro datasets and yield similar µ a and confidence estimation results, more elaborate and diverse datasets would need to be simulated.Nevertheless, the experiments demonstrate that applying an ROI threshold based on the estimation of the quantification error can lead to an increase in accuracy for a given dataset regardless of the underlying qPAI method.
From a qPAI perspective, end-to-end deep learning-based inversion of PA data is feasible in specific contexts and for specific in silico datasets, as shown previously [20] and in this work.However, PA signal quantification cannot be regarded as solved in a general manner.One of the main reasons is the large gap between simulated in silico data and in vivo recordings.In order for deep learning to tackle this problem, either highly sophisticated unsupervised domain adaptation methods have to be developed, or a large number of labeled correspondences between the simulation domain and real recorded images need to be provided, which is not currently feasible due to the lack of methodology to reliably measure ground truth optical properties in in vivo settings.However, with the promising progress in PA image reconstruction from limited-view geometries with deep learning techniques (cf., e.g., [57,58]), it might be possible to start bridging the gap and to improve on the current methods for qPAI.

Acknowledgments:
The authors would like to thank David Zimmerer for letting us tap into his well of knowledge on deep learning, Clemens Hentschke and the SIDT group for their support on the state of the art in non deep learning uncertainty quantification, Niklas Holzwarth for proofreading the manuscript, and the ITCF of the DKFZ for the provision of their computing cluster for data simulation.

Conflicts of Interest:
The authors declare no conflict of interest.

Figure 1 .
Figure1.Visualization of the proposed method for confidence estimation using an observing neural network as an error model.The estimator generates an output for a given input and the error model is used to obtain an estimate of the quantification error from the same input data.The region of interest (ROI), which is based on the aleatoric uncertainty I aleatoric extracted from the input data, can then be refined using the error estimates I epistemic of the error model as a confidence threshold (CT).

Figure 2 .
Figure 2. Visualization of the deep learning model used in the experiments: a standard U-Net structure with slight modifications to the skip connections.The (x, y, c) numbers shown represent the x and y dimensions of the layers, as well as the number of channels c.Specifically, in this figure, they show the values for a 128 × 128 input and 128 × 128 output.The center consists of an additional convolution and a skip convolution layer to enable different input and output sizes.

Figure 3 .
Figure3.Visualization of the two methods for absorption quantification with subsequent confidence estimation.(a) A quantification approach based on fluence estimation.In our implementation, the denoised initial pressure p0 and the fluence φ distributions are estimated using deep learning models.These are used to calculate the underlying absorption μa .An error model then estimates the quantification error and, in combination with an aleatoric uncertainty metric, a region of interest is defined.(b) An approach in which one model is used to directly estimate μa from the input data.I aleatoric is calculated on the basis of the input data and I epistemic is estimated with a second model.

Figure 4 .
Figure 4. Depiction of the illumination geometry and the transducer design which is based on our Fraunhofer DiPhAS photoacoustic imaging (PAI) system [54].(a) The ultrasound transducer design with the position of the laser output and the transducer elements, as well as the imaging plane; (b) one-half of the symmetric transducer design, where the laser outputs are in parallel left and right to the transducer elements over a length of 2.45 cm.

Figure 5 .
Figure 5. Quantification error as a function of the confidence threshold (CT) for five different quantification methods.The line shows the median relative absorption estimation error when only evaluating with the most confident estimates regarding CT, and the transparent background shows the corresponding interquartile range.Naïve: Fluence correction with a homogeneous fluence estimate; Fluence raw/p 0 : Deep learning-based quantification of the fluence applied to p 0 and raw time-series input data and subsequent estimation of µ a ; Direct raw/p 0 : End-to-end deep learning-based quantification of µ a applied to p 0 and raw time-series input data.

Figure 6 .
Figure 6.Visualization of the changes in the distribution of the absorption quantification error e µ a when applying different confidence thresholds CT = {100%, 50%, 10%}.The plot shows results for all five conducted experiments.The white line denotes the median, and the black box corresponds to the interquartile range.Outliers of e µ a with a value greater than 100% have been omitted from this plot.

Figure 7 .
Figure 7. Sample images of the end-to-end direct µ a quantification method applied to p 0 data, showing the best, the worst, and the median performance of our method when considering only the 50% most confident quantification estimations.All images show the (a) ground truth absorption coefficients (b) reconstructed absorption, (c) error estimate from external model, (d) the actual quantification error.