Conditional Variational Autoencoder for Learned Image Reconstruction

Learned image reconstruction techniques using deep neural networks have recently gained popularity, and have delivered promising empirical results. However, most approaches focus on one single recovery for each observation, and thus neglect the uncertainty information. In this work, we develop a novel computational framework that approximates the posterior distribution of the unknown image at each query observation. The proposed framework is very flexible: It handles implicit noise models and priors, it incorporates the data formation process (i.e., the forward operator), and the learned reconstructive properties are transferable between different datasets. Once the network is trained using the conditional variational autoencoder loss, it provides a computationally efficient sampler for the approximate posterior distribution via feed-forward propagation, and the summarizing statistics of the generated samples are used for both point-estimation and uncertainty quantification. We illustrate the proposed framework with extensive numerical experiments on positron emission tomography (with both moderate and low count levels) showing that the framework generates high-quality samples when compared with state-of-the-art methods.


Introduction
Machine learning techniques, predominantly using deep neural networks (DNNs), have received attention in recent years, and delivered state-of-the-art reconstruction performance on many classical imaging tasks, including image denoising [56], image deblurring [53], super-resolution [16], and on challenging applications such as low-dose and sparse-data computed tomography [27,13], and under-sampled magnetic resonance imaging [23], to name just a few.
Most existing works on DNN-based image reconstruction aim at providing one point estimate of the unknown image of interest for a given (noisy) observation; and that implies, we regard DNNs as deterministic machineries. Nonetheless, the stochastic nature of practical imaging problems (e.g., a noisy observation process, or an imprecise prior knowledge) implies that there actually exists an ensemble of plausible reconstructions, which are still consistent with the given data (although to various extent). This observation has motivated the need for developing a fully probabilistic treatment of the reconstruction task in order to assess the reliability of one specific reconstructed image so as to inform downstream decision making [28]. Unfortunately, uncertainty / reliability information is not directly available from most existing DNN-based image reconstruction techniques, and hence there is an imperative need to develop uncertainty quantification (UQ) frameworks that are capable of efficiently delivering uncertainty information along with the reconstruction [7].
The Bayesian framework provides a systematic yet very flexible way to address many UQ tasks, by modelling the data and the unknown image probabilistically (i.e., as random variables), and has been popular for UQ in imaging inverse problems [26,50]. Furthermore, recent advances in Bayesian inference leverage powerful deep generative modelling tools, such as Wasserstein generative adversarial networks (GANs) [4], and variational autoencoders (VAEs) [31]. Deep generative models output one sample of the unknown image via feed-forward propagation, which is computationally feasible, and for multiple samples, can also be performed in parallel. These techniques hold enormous potentials for quantifying the uncertainty for image reconstruction; nonetheless, developing rigorous data-driven image reconstruction techniques within a Bayesian framework remains challenging.
The first challenge stems from the high complexity of posterior distribution of the unknown image (conditioned on a given observation). The "conventional" Bayesian setting often involves an explicit likelihood and the construction of a prior. In fact, in Bayesian inversion, both likelihood and prior are given explicitly (or hierarchically). The likelihood is derived from the statistical model of an observation process, under the assumption that both the noise statistics and the underlying physical principles of the imaging modality are well calibrated. Nonetheless, deriving precise likelihoods can be highly nontrivial (e.g., due to a complex corruption process). Further, how to stipulate statistically meaningful yet explicit priors is a long-standing open problem in image reconstruction. In the context of learning-based approaches, one can only implicitly extract prior information in a "data-driven" way using deep neural networks from a set of training pairs instead of an explicit posterior distribution (e.g., up to a normalising constant), leading to an approximate posterior, which is "intrinsically implicit". Furthermore, the UQ task is substantially complicated by the high-dimensionality of the reconstructed image, which renders the many conventional sampling type techniques not directly applicable.
The second challenge is related to the presence of physical laws. In many practical imaging applications, the data formation process itself is highly complex. To make matters even worse, the forward operator itself can be implicitly defined, for instance, via a system of differential equations in many medical imaging modalities, such as second-order elliptic differential equation in electrical impedance tomography [11], or radiative transfer equation in diffuse optical tomography [6]. The forward maps often describe the fundamental physical laws, and can be regarded as "established" prior knowledge. Therefore, it is useful to directly inform DNNs with the underlying physical laws (instead of learning them from the given training data), and how to best use the physical laws represents an important issue in architectural design, which is currently actively researched.
In this work, we develop a novel computational framework for uncertainty-aware learned image reconstruction, using conditional variational autoencoder (cVAE) [49], implemented via algorithmic unrolling [39]. The framework recurrently refines the (stochastic) reconstruction benefiting directly from the physical laws as well from the samples of a low-dimensional random variable, which is conditionally dependent on the observation. Furthermore, minimizing the cVAE objective is equivalent to optimizing an upper bound of an expected Kullback-Leibler (KL) divergence (see Proposition 4.1 for a precise statement), and the output samples (i.e., the plausible reconstructions) are actually drawn from an approximate posterior distribution of the unknown image. In sum, the proposed framework has the following features: (i) it serves as an efficient sampler from an approximate posterior in the sense of (generalised) variational inference [51]; (ii) it is scalable due to its encoder-decoder structure, more specifically to the low-dimensionality of the latent variable. Note that the low-dimensional latent variable in cVAE is introduced to inject stochasticity into the reconstructions, and the heavy-duty reconstruction task is still carried out by the unrolling construction. Thus, the approach does not suffer from the oversmoothing effect typically observed for VAEs.
This work lies within in the context of quantifying uncertainty for learned image reconstruction techniques. There are mainly two different types of uncertainties: aleatoric [28,34] and epistemic [18,28] uncertainties. See the review [7] for UQ in medical image analysis, and the review [2] UQ of DL techniques in general. We focus on aleatoric uncertainty associated with the reconstructed image. This arises from the intrinsic stochasticity with the given data, and it is irreducible even if one can collect more data. It differs from epistemic uncertainty, which often is a byproduct of a Bayesian treatment for neural network, that is, Bayesian neural networks (BNNs) [19,10,18,9,8]). BNNs model the uncertainty within network weights, which are then inferred with approximate inference techniques [37,51,43]. The corresponding uncertainty arises from the fact that there is an ensemble of weights' configurations that can explain equally well the given training data, since there are generally far more network parameters than the available training data (a.k.a. overparameterization).
The rest of the paper is organised as follows. In Section 2, we describe the related works. In Section 3, we describe the problem formulation. Section 4 provides background information on conditional variational autoencoder, and describes our proposed framework, including the network architecture, and the training and inference phases. In Section 5, we showcase our framework on an established medical imaging modalitypositron emission tomography (PET) [45], for which uncertainty quantification has long been desired yet still very challenging to achieve [54,17,57], and confirm that the generated samples are indeed of high quality in terms of both point estimation and uncertainty quantification, when compared with several state-of-the-art benchmarks. Finally, in Section 6, we conclude the paper with additional discussions.

Related works
This work lies at the interface of the following three topics, i.e., learned image reconstruction, uncertainty quantification using generative models, and approximate inference. We review briefly the related works in the following. Learned reconstruction techniques have received a lot of attention recently. Most of them are learned in a supervised manner [41]. One prominent idea is algorithmic unrolling, i.e., to unfold classical iterative optimization schemes and to replace certain components of the schemes with neural networks, whose parameters are then learned from training data. This construction brings good interpretability into the working of the algorithms, and allows incorporating physical models, thus often delivers high-performant results without resorting to very large training datasets. This was first done in [20], which unrolls the iterative shrinkage thresholding algorithm for sparse coding. In the context of inverse problems, an early work is [44]. Currently, there is a large body of works on unrolling; see [39] for a recent overview. These approaches focus mostly on point estimates, and ignore the associated uncertainties. This work employs the unrolling idea to construct the backbone network, but also endows the reconstructions with uncertainty estimates using the low-dimensional latent space.
Over the last decade, there have been impressive progresses on UQ of DNNs, especially Bayesian neural networks [19,10,18,28,43] (see [2] for an overview). Nonetheless, high-dimensional inference tasks such as image reconstruction remains largely under explored due to the associated computational cost [7], and thus UQ of learned image reconstruction techniques is just emerging. For epistemic uncertainty, the most common technique used in image reconstruction is Monte Carlo dropout [18], due to its low computational cost, but it may sacrifice the accuracy [9]. Further one may separate the sources of uncertainties into aleatoric and epistemic ones [8], if desired. This work addresses the aleatoric uncertainty with a reconstructed image, through approximating an implicitly defined posterior distribution, which differs markedly from existing works on epistemic uncertainty.
In the machine learning community, there is a myriad of different ways to approximate a target distribution using an approximate inference scheme, e.g., variational inference [25], variational Gaussian [42,5], expectation propagation [37,54], and Laplace approximation [36]. This work provides one way to approximate the distribution using cVAE. Note that to rigorously justify the distribution modelled by DNNs for aleatoric uncertainty, proper Bayesian interpretations of the loss function used in the training of DNNs (in connection with the target posterior distribution) are needed. The cVAE approach adopted in the present work does admit the interpretation as generalised variational inference (cf. Proposition 4.1). Very recently, a graph version of cVAE was employed for synthesizing cortical morphological changes in [12]. Of course, cVAE represents only one way to construct the approach. Other approaches, e.g., generative adversarial networks [4], also hold potentials, by suitably adapting the corresponding loss function and equipping it with a Bayesian interpretation. For example, the interesting work [22] constructs a sampler by approximating the transport map with DNNs. However, a thorough comparison these different approaches utilizing deep generative models for UQ of learned reconstruction remains missing.

Preliminaries
In this section we describe the problem setting, and the variational autoencoder (VAE) framework.

Problem Formulation
First we specify the setting of this work, i.e., finite-dimensional linear inverse problems. Let x ∈ R n be the unknown image of interest, and y ∈ R m be the observational data. The linear forward operator A : R n → R m maps the unknown x to the data y. In practice, the observation y is a noisy version of the exact data Ax: where η(·) denotes a general corruption process by (possibly unknown type) noise, e.g., Gaussian, Poisson, and Salt and Pepper, Cauchy, uniform or the mixtures thereof. The image reconstruction task is to recover the unknown ground-truth x from the given noisy observation y. In the Bayesian framework, the corruption process η(·) is encoded by a conditional distribution p * (y|x), and the unknown image x of interest is a random variable with a prior distribution p * (x). In the proposed framework, we only require (i) samples from the joint distribution p * (x, y) := p * (y|x)p * (x) (which is proportional to the posterior distribution p * (x|y), up to a normalizing constant p(y) = R n p * (x, y)dx) and (ii) the action of the maps A and A * (the adjoint of A). These two conditions hold for many practical imaging problems.
Note that DNN-based image reconstruction techniques employ paired training data {(x i , y i )} N i=1 from the underlying joint distribution p * (x, y), and also the associated forward operators, which may differ for each data pair. Thus the training data tuple takes the form: (x i , y i , A i ), which ideally is an exhaustive representation of the image reconstruction problem. In particular, a closed-form expression of the posterior distribution p * (x|y) is not needed, hence the term "implicit" posterior distribution. In practice, one can collect measurements corresponding to samples of x drawn from the prior p * (x) (i.e., physically derived data), without explicitly knowing the corruption process η(·). However, in many medical imaging applications, a dataset of ordered ground-truth image and observational data pairs is often expensive to acquire at a large volume, if not impossible at all. Note that the dimension of (x i , y i , A i ) can also vary from one sample to another, e.g., due to discretization of different resolution levels. The explicit presence of the operator A is a noticeable difference from standard supervised learning in machine learning. The main goal of learned image reconstruction with aleatoric UQ is to provide a computational framework that gives an approximation to the posterior distribution p * (x|y) or the joint distribution p * (x, y), by suitably training on the given dataset . Below we use the notation h(·) to denote a DNN, and use the subscript to distinguish between DNNs. Further, we abuse the subscript for a distribution and a DNN to reparameterize the corresponding random variable. For instance, p θ (x) is a distribution of x and h θ (·) is a DNN to reparameterize x, both with the parameter vector θ.

Variational Inference and Variational Autoencoders
Now we describe the basic technique, variational inference (VI) and building block of the proposed framework: variational autoencoders (VAEs).
First we describe the idea of VI for posterior approximation. Let p θ (x|y) be an intractable target distribution of interest, where the vector θ in p θ (x|y) contains the hyperparameters of both prior p θ (x) and likelihood p θ (y|x), such as the prior belief strength (a.k.a. regularisation parameter) and noise precision. The intractability of the distribution p θ (x|y) arises often from its high-dimensionality of the parameter space (i.e., n is large) and non-Gaussian of the distribution nature (e.g., due to the complex data formation process).
To approximate the target distribution p θ (x|y), we employ variational inference (VI), which is a popular approximate inference technique in machine learning [25,51]. It selects the best approximation q φ (x|y) from a candidate family Q (parameterized by the vector φ, commonly known as the variational parameters) by minimizing a suitable divergence measure for probability distributions. The family Q is often taken to be independent Gaussian distributions, which is fully characterised by the mean and (diagonal) covariance. This is commonly known as the mean field approximation. In VI, the most popular choice of the probability metric is Kullback-Leibler (KL) divergence [33]. The KL divergence D KL (q φ (x|y)||p θ (x|y)) of the approximate posterior q φ (x|y) from the target posterior p θ (x|y) is defined by It follows directly from Jensen's inequality that it is always nonnegative and zero if and only if q φ (x|y) = p θ (x|y) almost everywhere. Since the target p θ (x|y) is often only known up to a normalizing constant, the problem min is often cast into an equivalent evidence lower bound (ELBO) maximization where the notation E q [·] denotes taking expectation with respect to the distribution q. In the ELBO, the first term enforces data consistency, whereas the second term represents a penalty, which is relative to the prior distribution p θ (x). Solving the optimization problem (3.1) requires evaluating the gradient of the loss L with respect to the variational parameters φ (i.e., is often not analytically tractable and can only be evaluated by Monte Carlo methods. The reparameterization trick [31,46] is useful to overcome the challenges in directly backpropagating the gradients (with respect to φ) through the random variable x. It provides an unbiased estimate of the gradient of the ELBO with respect to the variational parameters φ. In fact, we assume that the conditional sampling of x depends on y and an easy-to-sample auxiliary variable distributed according to p( ) (e.g., a Gaussian distribution): where g φ (·) is a deterministic function (in our case, a DNN). By estimating θ from the given data y, while simultaneously optimizing the variational parameter φ, one obtains the following Monte Carlo estimator of the loss in (3.1): where {x (i, ) } L =1 are L samples generated with y i and using the variational encoder q φ (x|y): { } L =1 are sampled from p( ) and x (i, ) = g φ (y i , z ). The KL term can often be evaluated in closed form since both factors therein are often taken to be Gaussian, otherwise it can also be approximated using Monte Carlo. Note that in the preceding construction, we have allowed the observation data y to vary with the unknown image x, in order to accommodate the training data {(x i , y i )} N i=1 . This differs slightly from the standard approximate inference schemes, and it is often referred to as amortized variational inference.
In generative modelling, the above considerations give rise to a formalism very similar to the popular variational autoencoders (VAEs) [31,32], with x assuming the role of latent variable, whereas y being the data to be probabilistically modelled; see the remark below for more details on the difference.
VAE is an unsupervised deep generative framework that learns a stochastic mapping between the observed y-space and a latent x-space. The model learns a joint distribution p θ (y, x) that factorises as p θ (y, x) = p θ (x)p θ (y|x) with a stochastic decoder p θ (y|x) and a prior distribution over a low-dimensional latent space p θ (x). A stochastic encoder q φ (x|y) (a.k.a. parametric inference model) approximates the intractable posterior p θ (x|y). The framework compresses the observed data into a constrained latent distribution (via the encoder) and reconstructs it faithfully (via the decoder). This process is carried out by two neural networks, an encoding network h φ (·) with parameter φ (i.e., the weights and the biases of the network) also called variational parameters, and a decoding network h θ (·) with parameter θ.
In practice, VAEs often do not use the encoding network to model the parameterization function g φ (·) in an end-to-end way. VAEs take, instead, the observation y and outputs the coefficients to reparameterize the image x. For example, for a multivariate Gaussian q φ (x|y) the decoding network can output the mean µ and Cholesky factor L of the covariance Σ = LL : We can generate samples from q φ (x|y) by sampling from the standard Gaussian p( ) = N ( |0, I), followed by an affine transformation x = µ + L . VAE allows performing simultaneously VI (with respect to φ) and model selection (with respect to θ), and the resulting VAE objective is given by In practice, one may employ an identity variance Gaussian with the mean being the decoder's output as the likelihood p θ (y|x), and a standard Gaussian distribution as the prior distribution p θ (x).
Remark 3.1. Note that the original formalism of VAE [31] is slightly different from the above. We briefly recall its derivation for the convenience of the readers. Given a dataset {y i } N i=1 from an unknown probability function p(z) and a multivariate latent encoding vector z, the objective of VAE is to model the data y as a distribution p θ (y), i.e., where θ represents the network parameters. Thus, in VAE, if we assume p θ (y|z) is a Gaussian distribution, then p θ (y) is a mixture of Gaussians. To solve for θ in a learning paradigm, one constructs an approximation q φ (z|y) ≈ p θ (z|y) by means of VI with variational parameters φ. Repeating the preceding discussion on VI directly yields the following standard VAE loss The problem then reduces to an autoencoder formalism: the conditional likelihood distribution p θ (y|z) is the probabilistic decoder, and the approximation q φ (z|y) serves as the probabilistic encoder. Hence, the goal of VAE is to represent the given unlabelled data {y i } N i=1 and to generate new data (from the latent variable z), which differs markedly from the task in learned image reconstruction, for which the image (represented as latent variable) is of the main object of interest (and often of very high dimensionality, larger than that of y). Besides, the problem of modelling p θ (y|z) and p θ (z) in the VAE framework is usually unidentifiable, in the sense that there may exist many different (p θ (y|z), p θ (z)) pairs that admit the same marginal distribution p θ (y) [29]. Thus the associated modelled approximate posterior q φ (z|y) is not unique.

Proposed framework
We develop a computational UQ framework that learns a map from the observation y to a distribution, denoted by p φ (x|y), which approximates the true posterior distribution p * (x|y). The map is modelled with a recurrent network and a probabilistic encoder therein allows for diverse reconstruction samples, and hence it facilitates the quantification of the associated aleatoric uncertainty.

Conditional VAE as approximate inference
Note that a direct application of VAEs to image reconstruction is problematic: VAEs are unsupervised generative machineries, and would only use noisy observations y, but not the ground-truth image x during the training process. To circumvent the issue, we employ the cVAE loss [49,52], a conditional variant of VAE: max In cVAEs, there are three distributions: a teacher encoder q θ (z|x, y), a student encoder p φ1 (z|y) and a conditional decoder p φ2 (x|y, z). The vector φ = (φ 1 , φ 2 ) assembles the parameters of p φ1 (z|y) and p φ2 (x|y, z). The approximate posterior p φ (x|y) obtained by cVAE is given by p φ (x|y) = p φ2 (x|y, z)p φ1 (z|y)dz.
The cVAE loss admits the following approximate inference interpretation in a generalised sense. Note that J * is a functional of the variational distribution p φ (x|y) and other auxiliary distributions involving z. Proof. By the definition of KL divergence and Fubini theorem, Next we derive a lower bound for log p φ (x|y) of the conditional distribution p φ (x|y) using the standard procedure. Since p φ (x, z|y) = p φ (z|x, y)p φ (x|y), we have and consequently, for any distribution q(z|x, y), there holds By the nonnegativity of the KL divergence, the second term is nonpositive, and then using the splitting p φ (x, z|y) = p φ2 (x|y, z)p φ1 (z|y), we deduce log p φ (x|y) ≥ q(z|x, y) log p φ (x, z|y) q(z|x, y) dz = q(z|x, y) log p φ1 (z|y)p φ2 (x|y, z) q(z|x, y) dz = q(z|x, y) log p φ1 (z|y) q(z|x, y) dz + q(z|x, y) log p φ2 (x|y, z)dz = −KL(q(z|x, y)||p φ1 (z|y)) + E z∼q(z|x,y) [log p φ2 (x|y, z)], i.e., − log p φ (x|y) ≤ D KL (q(z|x, y)||p φ1 (z|y)) + E z∼q(z|x,y) [− log p φ2 (x|y, z)]. Taking expectation of (4.3) with respect to p * (x, y) and then substituting it into identity (4.2) yield Since the term E p * (x,y) [log p * (x|y)] is independent of the variational distribution p φ (x|y) and other auxiliary distributions involving z, minimizing the cVAE loss in (4.1) expected on the training data distribution p * (x, y) is equivalent to minimizing an upper bound of J * (p φ (x|y)). This shows the desired assertion.
In view of Proposition 4.1, cVAEs can indeed learn an optimal map from the observation y to an approximate posterior distribution p φ (x|y) in the sense of minimizing an expected loss of the KL divergence. This interpretation underpins the validity of the procedure for quantifying aleatoric uncertainty. The minimizer is indeed an approximation to the target posterior distribution p * (x|y), constructed by a generalised variational inference principle. Example 4.1. We briefly validate Proposition 4.1, which states that the cVAE framework can approximate the ground-truth distribution p * (x|y) in the sense of generalised variational inference. Note that it is notoriously challenging to numerically verify the accuracy of any approximate inference scheme for highdimensional distributions. Nonetheless, in the low-dimensional case, the target distribution can be efficiently sampled by Markov chain Monte Carlo [35]. To illustrate this, we take the ground truth distribution p * (x|y) to be a two-dimensional multi-modal distribution, which consists of the mixture of seven Gaussians, shown in Figure 1(a), and approximate it by cVAE (with teacher encoder, student encoder and decoder modelled by different three-layer neural networks and ReLu as the nonlinear activation function). We clearly observe from Figure 1(b) that the approximation p φ (x|y) is fairly accurate, and can capture well the multi-modality of the distribution, thereby verifying Proposition 4.1.
(a) ground-truth (b) cVAE approximation Figure 1: Validation of the cVAE for a toy two-dimensional distribution, Gaussian mixture with eight components.
We model the conditional encoder p φ2 (x|y, z) by a mean-field Gaussian with covariance βI, where β > 0 is a hyperparameter. The DNN h φ2 with parameter φ 2 only outputs the mean of p φ2 (x|y, z), and on a mini-batch {(x i , y i )} M i=1 the objective functionL cVAE (φ, θ) is given by: wherex (i, ) is the mean of the distribution p φ2 (x|y i , z i, ) and {z i, } L =1 are L samples drawn from the distribution q θ (z|x i , y i ), with the default choice L = 1. Note that, for special choices of q θ (z|x, y) and p φ1 (z|y) the KL divergence term may be evaluated analytically, and can be used, if available. The gradient of the lossL L cVAE (φ, θ) is then evaluated by the reparameterization trick.
Remark 4.1. In VAEs, the approximate posterior of the image x (i.e., the latent variable) is modelled by q φ (x|y), whereas in cVAEs, it is modelled by p φ (x|y) = p φ2 (x|y, z)p φ1 (z|y)dz. In both VAEs and cVAEs, the stochasticity of x comes from z: in VAEs, z is independent on the observation y, whereas in cVAEs, z is dependent on y. Since the distribution of z is learned, it can be more flexible than that in VAEs. Thus, even if p φ2 (x|y, z) is chosen to be simple distributions (e.g., Gaussian distributions) p φ1 (x|y) can still model a broad family of distributions for continuous unobservable x due to the presence of p φ1 (z|y), in a manner similar to scale mixture of Gaussians.

cVAE for learned reconstruction
Probabilistic modelling consists of a learning principle given by a loss function with a proper probabilistic interpretation, and a graphical model describing probabilistic dependency between variables. In the proposed framework, we employ a cVAE type loss function: Its difference from the standard cVAE loss (4.1) is that (4.4) also includes the forward map A (and its adjoint A * ) as a part of the training data. Here A may have different realizations (e.g., corresponding to different levels of discretization) with varying dimensions. Nonetheless, it is viewed as a deterministic variable. The modelled approximate posterior distribution p φ (x|y) is then given by The graphical model is given in Figure 2(a). The learning algorithm that learns a conditional sampler, in a manner similar to a random number generator (RNG) for a given probability distribution. Note that for a RNG, different runs lead to different samples, but with a fixed random seed, the path is the same for different runs. The auxiliary (low-dimensional) latent variable z (conditionally dependent on the observation y) is an analogue of the random seed in the RNG, and is introduced into the deterministic recurrent process (modelled by a network) to diversify the reconstruction samples. In particular, for a fixed z, the recursion process inputs the sample initialization x 0 and applies a recurrent refining step based on suitable sample quality measures and the information encoded in the variable z.  Three DNNs are employed to model the distributions in L(θ, φ; x, y, A) and to carry out the conditional sampling process, including one for the recursive process (i.e., recurrent unit modelled by a neural network) and two for probabilistic encoders (i.e., teacher and student encoders). The observation y and forward map A constitute their inputs: y is fed into the two probabilistic encoders and each recurrence of the recurrent unit; A is fed into each recurrence of the recurrent unit and the teacher encoder. Below we explain how the three networks work separately.
The recurrent component is the (deep) network h φ2 (·); see Figure 2(b). The network begins with an initial guess x 0 (default: back-projected data A * y) and outputs x K after K iterations as the mean of p φ (x|y, z, A), following the established idea of algorithmic unrolling [44,39], which mimics the iterations from standard optimization algorithms (e.g., (proximal) gradient descent, alternating direction method of multipliers (ADMM), and primal-dual hybrid gradient). At the k-th recursion, the network h φ2 takes one sample x k−1 to refine and outputs an improved sample x k . To incorporate the forward map A and the observation y, we employ a functional E(A, y, x k−1 ). In the lens of variational regularization [24], E(A, y, x k−1 ) measures how well x k−1 can explain the data y. To indicate how well x k−1 fulfils the prior knowledge (e.g., sparsity and smoothness) we use the penalty R(x k−1 ) as a part of the input. For the sample quality measure E(A, y, x k−1 ) and the penalty R(x k−1 ), we use ||y − A(x k−1 )|| 2 (or its gradient), and ||x k−1 || 2 2 or |x k−1 | TV (total variation semi-norm), respectively. Besides the latest iterate x k−1 and the quality measures E and R, the network h φ2 (·) also takes a memory variable a k−1 and an auxiliary variable z. The memory variable a k−1 plays the role of momentum in gradient type methods, and is to retain long-term information between recursions. The overall procedure resembles an unrolled momentum accelerated gradient descent. The auxiliary random variable z is low-dimensional and to injects randomness into the iteration procedure.
Since both x k−1 and x k belong to the image space R n , we adopt CNNs without pooling layers to model the recurrent unit h φ2 . Different inputs of h φ2 (·) are concatenated along the channel axis, and the outputs of h φ2 (·), that is, the update δx k with x k = x k−1 + δx k and the updated memory a k , are also concatenated.
Remark 4.2. At each step, the recurrent unit (neural network h φ2 ) reuses the observation data y and the map A for refinement, and the overall procedure differs from the deterministic mapping that serves as a post-processing step of back-projection. The latter is an end-to-end mapping that takes the back-projected data and outputs a refinement, but the proposed approach employs the current sample and quality measures and then decides the refinement strategy. The framework employs two encoders of z: a teacher encoder q θ (z|x, y, A) and a student encoder p φ1 (z|y); see Figure 3. The student encoder p φ1 (z|y) takes the observation y, and encodes the observation-based knowledge so as to inform the recurrent unit h φ2 . Given one sample z from p φ1 (z|y), the network h φ2 gives one refining increment, and the distribution of z contributes to the diversity of the unknown image x. To help train the student encoder p φ1 (z|y), we provide the teacher encoder q θ (z|x, y, A) with the ground-truth x and the forward map A. The teacher encoder q θ (z|x, y, A) is discarded once the training is finished. The encoders p φ1 (z|y) and q θ (z|x, y, A) are modelled by two DNNs h φ1 (·) and h θ (·), respectively, which reparameterize the auxiliary variable z. Since z is low-dimensional, CNNs with reduced mean layers and 1 × 1 convolutional layers can guarantee the dimension flexibility of the input y. To input the ground-truth data x and y into h θ (·), we use the forward map A and concatenate A(x) with y along the channel axis. This construction is very flexible with the problem dimension, and allows training (x i , y i ) of varying size (and the corresponding A i ). Now we can state the complete algorithms for training and inference of the cVAE framework for uncertaintyaware learned image reconstruction in Algorithms 1 and 2, respectively. Here M denotes the mini-batch size, T the maximum number of training batches, K the number of recurrences of h φ2 for one sample, and (φ,θ) the output of the training procedure (i.e., the learned parameters). There are many possible choices of the stochastic optimizer at line 11 of Algorithm 1 (e.g., ADAM [30], and SGD). We shall employ ADAM [30] in our experiment. The final sample from the recurrent process is regarded as the mean of the conditional distribution p φ (x|y, z, A). Thus, given the initial x 0 , the iteration with different realisations of z leads to diverse samples of the unknown image x. Since each sample is the mean of p φ (x|y, z, A) rather than a direct sample from the approximate posterior p φ (x|y), the summarizing statistics also has to be transformed; see (4.5). Note that the posterior variance contains two components: one is due to the background (i.e., βI), and the other is due to the sample variance (i.e., 1 , as shown in the following result. Proposition 4.2. Let p φ (x|y, z, A) = N (x|x K (z), βI) the approximate posterior p φ (x|y) = p φ2 (x|y, z, A)p φ1 (z|y)dz be a mixture of Gaussian distributions with z being the mixture variable. Then given samples {z i } S i=1 of z from p φ (z|y), and the corresponding x K (z), denoted by {x i } S i=1 , the mean E p φ (x|y) [x] and the covariance Cov p φ (x|y) [x] of p φ (x|y) can be estimated by

(4.5)
Proof. For the mean E p φ (x|y) [x], by dentition, there holds . Similarly, for the covariance Cov p φ (x|y) [x], by the standard bias variance decomposition, Now the first term on the right hand side, there holds

Numerical experiments and discussions
Now we showcase the proposed cVAE framework for learned reconstruction with quantified aleatoric uncertainty with numerical experiments on positron emission tomography (PET). PET is a pillar of modern diagnostic imaging, allowing noninvasive, sensitive and specific detection of functional changes in a number of diseases. Most conventional PET reconstruction algorithms rely on penalized maximum likelihood estimates, using a hand crafted prior (e.g., total variation and anatomical); see the work [45] for an overview Algorithm 1 Training procedure , β, T , K, M 2: for t = 1, 2, . . . , T do 3:

3:
Sample z s from p φ1 (z|y)

4:
Initialisex s with A * (y) and a with zeros on classical reconstruction techniques. More recently learning based approaches have been proposed. While these techniques have been successful, they still lack the capability to provide uncertainty estimates; see the work [5,17,54,57] for several recent investigations on UQ in PET reconstruction, but none of which is based on deep learning.
For the experiments below, we employ a 3-layer CNN as the recurrent unit h φ2 and fix K = 10 iterations for each sampling step, cf. Figure 4 for a schematic illustration, and VGG style encoders for both h φ1 and h θ , cf. Figure 5. We train the network on a synthetic dataset consisting of elliptical phantoms, and test it on the public medical imaging dataset BrainWeb [14] (available from https://brainweb.bic.mni.mcgill.ca/). Throughout, the training pair (x, y) ∈ R 128×128 × R 30×183 , and the forward map is the Radon transform, which is normalised, and different peak values of x are used to indicate the count level: 1e4 and 1e2 for respectively moderate and low count levels. The observation y is generated by corrupting the sinogram Ax by Poisson noise entrywise, i.e., y i ∼ Pois((Ax) i ), where Pois(·) denotes the Poisson random variable. The hyper-parameter β is tuned in a trial-and-error manner, and fixed at β = 5e-3 below. The experiments are conducted on a desktop with two Nvidia GeForce 1080 Ti GPUs and Intel i7-7700K CPU 4.20GHz×8. It is trained for T = 1e5 batches, each of which contains 10 randomly generated (x, y) pairs on-the-fly. The training almost converges after 2e4 batches and it takes around 11 hours to go over all T = 1e5 batches. The summarizing statistics reported below are computed from 1000 samples for each observation y generated by the trained network. The implementation uses the following public deep learning frameworks: Tensorflow (https://www.tensorflow.org/, [1]), Tensorflow Probability (https: //www.tensorflow.org/probability, [15]), DeepMind Sonnet (github.com/deepmind/sonnet) and ODL (github.com/odlgroup/odl), and the source code will be made available at github.com/chenzxyz/cvae.
First we compare the proposed cVAE approach with conventional and deep learning based methods on all 181 phantoms in the BrainWeb dataset. It is compared with the following three benchmark methods: maximum likelihood EM (MLEM) [48], total variation (TV) [47] with nonnegativity constraint and learned Figure 4: The layer configuration of the recurrent unit h φ2 : 3 × 3 × 32 denotes convolutional layer with a kernel size 3 × 3 and 32 output channels. In the third convolutional layer, 5 + 1 denotes 5 channels for memory a k and 1 channel for the update δ k .  Figure 5: The layer configurations of the teacher (top) and student (bottom) encoder: (3 × 3 × 32) × 3 denotes 3 convolutional layers respectively followed by an ReLU layer with a kernel size 3 × 3 and 32 output channels. 2 under the brown layer denote average pooling layer with stride size 2. 1 × 1 × (2 × 6) denotes 1 × 1 convolutional layer with 12 output channels, i.e. 6 for mean µ and 6 for log (diagonal) variance log Σ. gradient descent (LGD) [3]. MLEM and TV are two established reconstruction methods in the PET community, and LGD is an unrolled iterative method inspired by classical variational regularization and exploits DNNs for iterative refinement. For MLEM, we use the odl inbuilt solver mlem, and for TV, use the primal dual hybrid gradient method (implemented by odl.solvers.phgd). The regularization parameter α for total variation prior is fixed at α = 2e-1 and α = 2e0 for the moderate and low count levels, respectively, which is determined in a trial-and-error manner. The comparative results are summarized in Table 1, shown with two most popular image quality metrics, i.e., SSIM and PSNR, averaged over all 181 phantoms in the BrainWeb dataset. The results clearly show that cVAE can deliver state-of-the-art point estimates in terms of PSNR and SSIM, especially in the practically very important low count case. Compared with these deterministic benchmark methods, cVAE can additionally provide uncertainty information. Next we compare cVAE with a probabilistic approach [34], which reports state-of-the-art performance for aleatoric uncertainty (see [21] for theoretical interpretations). It employs (non-Bayesian) neural network ensembles to estimate predictive uncertainty, where each network in the ensemble learns similar values close to the training data, and different ones in regions of the space far from the training data. For the comparison, we train a mixture with three multivariate Gaussians (GM3) without adversarial samples, where the training of each component of the mixture is to fit a mean network and a variance network using Gaussian log-likelihood to the outputs [40]. To stabilize the training procedure, we first train the mean network, and then train the variance network. Alternatively, one can train the mean network as a warm up and then train the mean and variance networks simultaneously, but it usually leads to worse results, and thus we do not present the corresponding results. The comparative quantitative results are given in Table 2, which present the PSNR and SSIM results for ten phantoms from the Brainweb dataset, in order to shed fine-grained insights into the performance of the methods over different phantoms. It is observed that in the low count case, cVAE can provide better point estimates in terms of both SSIM and PSNR, which concurs with Figure 6, whereas in the moderate count case, GM3 can sometimes deliver better results. In terms of the variance map, the results by GM3 contain more structural patterns, and resemble more closely the error map. To shed more insights into the variance by cVAE and the benchmark GM3, we show the cross-section plots with 0.95 Highest Posterior Density (HPD) in Figure 7 for both moderate and low count levels. According to Proposition 4.2, the estimated covariance by cVAE contains two distinct sources, i.e., sample variance and Figure 6: Reconstructions of two phantoms (i.e., 10 for the two rows and 90 for the bottom two rows) from BrainWeb with peak value 1e2, by cVAE and GM3, respectively, in terms of the meanx, the pointwise error x − x d ag and the posterior variance Var(x).
the variance β of the conditional Gaussians p φ (x|y, z, A) = N (x|x K (z), βI). The latter is uniform across the pixels, and acts as a background. We show the HPDs of cVAE with full variance (unbiased variance estimated by Cov p φ (x|y) [x]) and the variance without β factor (i.e., Cov p φ (x|y) [x] − βI). It is observed that the latter contains more structures in the credible intervals. Further, the overall shape and magnitude of the HPDs by cVAE with the full variance and GM3 are fairly closely to each other. It is noteworthy that in the cold regions (i.e. zero count), cVAE can provide almost zero variance, upon subtracting the background variance, and is able to indicate the contrast of variance to highlight the pixels, whereas the variances by GM3 are also relatively high. The comparison between the cross-section plots for low and moderate count cases (i.e., high and low noise levels) on the same ground-truth phantom indicates that cVAE does provide higher uncertainty for higher noise level, which is intuitively consistent with the underlying statistical background. Lastly, we evaluate all the methods on phantoms with an artificially added tumour by changing the pixel values to the peak value, in order to examine their capability of recovering novel features that are not presented in the training data. We (randomly) choose two phantoms from BrainWeb dataset (Python style index: 10 and 110). A small tumour of radius 2 and a large tumour of radius 5 are added into the 10-th and 110-th phantoms, respectively. The corresponding reconstructions are shown in Figures 8 and 9, respectively. It is observed the tumours can be clearly reconstructed by the cVAE means for both count levels, except the small tumour at low count levels. In the latter case, none of the methods can reasonably reconstruct the tumor, since the data is very noisy in view of low signal strength. The results by cVAE, LGD and GM3 are comparable, at least visually. The ability of reconstructing tumours further indicates that cVAE does not miss out important features non-present in the training data, indicating a certain degree of robustness of the cVAE framework, so long as the signal is sufficiently strong, since many deep learning based methods tend to miss the tumour due to the bias induced by tumour-free training data [38].

MC
x † x mlem x tv x ldĝ x cvae Var(x cvae )x gm3 Var(x gm3 ) x † x mlem x tv x ldĝ x cvae Var(x cvae )x gm3 Var(x gm3 ) Figure 8: Reconstructions for one BrainWeb phantom (No. 10) with tumor, obtained by benchmark methods (MLEM, TV, LGD, GM3) and cVAE. For each phantom, the top two row is for the low count level and the bottom two rows for the moderate count level.

Conclusion
In this work, we have developed a general and flexible probabilistic computational framework for the uncertainty quantification of inverse problems in a purely data-driven setting. The approach is based on the conditional variational autoencoder loss, and employs the deep neural network as a recurrent unit to recurrently refine the samples using the observation and forward map, seeded by a probabilistic encoder conditioned on the observation. The efficiency of the framework is underpinned by encoding the observations in a low-dimensional latent space. The significant potentials of the framework have been demonstrated on PET image reconstruction with both moderate and low count levels, and the approach shows competitive performance when compared with several deterministic and probabilistic benchmark methods, especially within the low count regime. There are several avenues for further study. First, the framework is flexible and general, and it is of much interest to evaluate its potentials on other computationally expensive imaging modalities (e.g., MRI, CT and PET-MRI) especially in the under sampling and low-dose regime, for which there is a great demand on uncertainty quantification due to lack of sufficient information. Such studies will also shed important insights into statistical features of the framework. Second, it is of much interest to analyze the theoretical properties of the cVAE loss as an upper bound of the expected KL divergence (e.g., approximation error and asymptotic convergence). This line of research has been long outstanding in approximate inference, and often provides theoretical guarantees of the overall inference procedure and guidelines for constructing efficient approximations. Third and last, it is imperative to develop scalable benchmarks for uncertainty quantification of high-dimensional inverse problems. Several deep learning based uncertainty quantification techniques have been proposed in the machine learning literature, but mostly on different types of uncertainties or without explicitly elucidating the sources of uncertainties (see the work [8] for a preliminary study in medical imaging). A careful calibration of the obtained uncertainties is essential towards practical deployment, as recently highlighted in the review [7].