An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing

In this paper, we present neural network methods for predicting uncertainty in atmospheric remote sensing. These include methods for solving the direct and the inverse problem in a Bayesian framework. In the first case, a method based on a neural network for simulating the radiative transfer model and a Bayesian approach for solving the inverse problem is proposed. In the second case, (i) a neural network, in which the output is the convolution of the output for a noise-free input with the input noise distribution; and (ii) a Bayesian deep learning framework that predicts input aleatoric and model uncertainties, are designed. In addition, a neural network that uses assumed density filtering and interval arithmetic to compute uncertainty is employed for testing purposes. The accuracy and the precision of the methods are analyzed by considering the retrieval of cloud parameters from radiances measured by the Earth Polychromatic Imaging Camera (EPIC) onboard the Deep Space Climate Observatory (DSCOVR).


Introduction
In atmospheric remote sensing, the retrieval of atmospheric parameters is an inverse problem that is usually ill posed. Due to its ill posedness, measurement errors can lead to large errors in the retrieved quantities. It is therefore desirable to characterize the retrieved value by an estimate of uncertainty describing a range of values that probably produce a measurement [1].
The retrieval algorithms are mostly based on deterministic or stochastic optimization methods. From the first category, the method of Tikhonov regularization and iterative regularization methods deserve to be mentioned, while from the second category, the Bayesian methods are the most representative. The Bayesian framework [2,3] provides an efficient way to deal with the ill-posedness of the inverse problem and its uncertainties. In this case, the solution of the inverse problem is given by the a posteriori distribution (the conditional distribution of the retrieval quantity given the measurement) that accounts for all assumed retrieval uncertainties. Under the assumptions that (i) the a priori knowledge and the measurement uncertainty are both Gaussian distributions, and (ii) the forward model is moderately nonlinear, the a posteriori distribution is approximately Gaussian with a covariance matrix that can be analytically calculated. However, even neglecting the validity of these assumptions, the method is not efficient for the operational processing of large data volumes. The reason is that the computations of the forward model and its Jacobian are expensive computational processes.
Compared to Bayesian methods, neural networks are powerful tools for the design of efficient retrieval algorithms. Their capability to approximate any continuous function on a compact set to an arbitrary accuracy makes them well suited to approximate the input-output function represented by a radiative transfer model. An additional feature is coincides with the radiative transfer model R, while in the second case, referred to as the inverse problem, the situation is reversed: the input x includes the sets of data and forward model parameters, while the output y includes the set of atmospheric parameters to be retrieved (in the absence of forward model parameters, the forward model F reproduces the inverse of the radiative transfer model R −1 ).
In machine learning, the task is to approximate F(x) by a neural network model f(x, ω) characterized by a set of parameters ω [26,27]. For doing this, we consider a set of inputs X = {x (n) } N n=1 and a corresponding set of outputs Y = {y (n) } N n=1 , given by y (n) = F(x (n) ), where N is the number of samples. In a regression problem, D = {(x (n) , y (n) )} N n=1 forms a data set-or more precisely, a training set-from which the neural network model f(x, ω) can be inferred. Traditional neural networks are comprised of units or nodes arranged in an input layer, an output layer, and a number of hidden layers situated between the input and output layers. Let L + 1 be the number of layers, and N l be the number of units in layer l, where l = 0, . . . , L. The input layer corresponds to l = 0, the output layer to l = L, so that N x = N 0 and N y = N L . In feed-forward networks, the signals y i,l−1 from units i = 1, . . . , N l−1 in layer l − 1 are multiplied by a set of weights w ji,l , j = 1, . . . , N l , i = 1, . . . , N l−1 , and then summed and combined with a bias b j,l , j = 1, . . . , N l . This calculation forms the pre-activation signal u j,l = ∑ N l−1 i=1 w ji,l y i,l−1 + b j,l which is transformed by the layer activation function g l to form the activation signal y j,l of unit j = 1, . . . , N l in layer l. Defining the matrix of weights W l ∈ R N l ×N l−1 and the vector of biases b l ∈ R N l by [W l ] ji = w ji,l and [b l ] j = b j,l , respectively, and letting ω = {W l , b l } L l=1 be the set of network parameters, the feed-forward operations can be written in matrix form as y l = g l (u l ), l = 1, . . . , L, f(x, ω) = y L , where [y l ] i = y i,l and [u l ] j = u j,l . Deep learning is the process of regressing the network parameters ω on the data D. The standard procedure is to compute a point estimate ω as the minimizer of some loss function by using the back-propagation algorithm [28]. In a stochastic framework, the loss function is usually defined as the log likelihood of the data set, with an eventual regularization term to penalize the network parameters. From a statistical point of view, this procedure is equivalent to a maximum a posteriori (MAP) estimation when regularization is used, and maximum likelihood estimation (MLE) when this is not the case.
In this section, we review the basic theory which serves as a basis for the development of different neural network architectures. In particular, we describe (i) the methodology for computing point estimates; (ii) the different types of uncertainty; and (iii) Bayesian networks.

Point Estimates
A data model with output noise is given by δ y ∼ N (0, C δ y ), (6) where here and in the following, the notation N (x; x, C x ), or more simply and when no confusion arises, N (x, C x ) stands for a Gaussian distribution with the mean x and covariance matrix C x . When the true input x is hidden (so that it cannot be observed) but samples from a random vector z = x + δ x with δ x ∼ N (0, C δ x ), i.e., p(z|x) = N (z; x, C δ x ) are available, the pertinent model is the data model with input and output noise, that is: ∆ y ∼ N (0, C δ y ).
The error ∆ y sums up the contributions of the output error and of the input error that propagates through the network in the output space. Specifically, when the noise process in the input space is small and the linearization: is assumed, we find (cf. Equations (5), (7), and (9)) ∆ y = δ y + K x (z, ω)(x − z), and further: To design a neural network, we consider a data set D associated to each data model, meaning that: 1.
A data set with input and output noise D = {(z (n) , y (n) )} N n=1 , where z (n) = x (n) + δ x and y (n) = F(x (n) ) + δ y . In a stochastic framework, a neural network can be regarded as a probabilistic model p(y|z, ω); given an observable input z and a set of parameters ω, a neural network assigns a probability to each possible output y. In view of Equations (7) and (8), the a priori confidence in the predictive power of the model is given by p(y|z, ω) = N (y; f(z, ω), C δ y (z, ω)).
The process of learning from the data D can be described by the posterior p(ω|D) = p(ω|Z, Y), which represents the Bayes plausibility for the parameters ω given the data D. This can be estimated by using the Bayesian rule: where p(D|ω) is the likelihood or the probability of the data, p(ω) is the prior over the network parameters, p(D) = p(D|ω)p(ω)dω the evidence, and: is the loss function. The first term E D (ω) in the expression of the loss function E(ω) is the contribution from the likelihood p(D|ω), written as the product (cf. Equation (12)): while the second term E R (ω) is the contribution from the prior p(ω), chosen for example, as the Gaussian distribution: In this regard, point estimates with regularization are computed by maximizing the posterior p(ω|D): while point estimates without regularization are computed by maximizing the likelihood p(D|ω): Some comments are in order.

1.
For a data model with output noise, we have z = x, C δ y = C δ y , and D = {(x (n) , y (n) )} N n=1 with y (n) = F(x (n) ) + δ y . Moreover, for the covariance matrix C δ y = σ 2 y I, where I is the identity matrix, we find: or more precisely: Assuming C ω = σ 2 ω I and using Equation (21), we infer that the point estimate ω = ω MAP is the minimizer of the Tikhonov function: where α = σ 2 y /(2σ 2 ω ) is the regularization parameter.

2.
A model with exact data can be handled by considering the data model with output noise and by making σ 2 y ≈ 0 in the representation of the data error covariance matrix C δ y = σ 2 y I. For σ 2 y ≈ 0, the relation y (n) = F(x (n) ) + δ y yields y (n) ≈ F(x (n) ), while Equation (23) and the relation α = σ 2 y /(2σ 2 ω ) ≈ 0 gives: Thus, when learning a neural network with the exact data, the maximum likelihood estimate minimizes the sum of square errors. Note that in this case, regularization is not absolutely required, because the output data are exact.

3.
For a data model with input and output noise, the computation of the estimate ω is not a trivial task, because the covariance matrix C δ y (z, ω), which enters in the expression of E D (ω), depends on ω. Moreover, in Equation (11), C δ y (z, ω) corresponds to a linearization of the neural network function under the assumption that the noise process in the input space is small. This problem can be solved by implicitly learning the covariance matrix C δ y (z, ω) from the loss function [29]. Specifically, we assume that C δ y (z, ω) is a diagonal matrix with entries σ 2 j (z, ω), that is: implying: Here, we identified µ ≡ f, and set µ j = [µ] j and y j = [y] j . To learn the variances σ 2 j (z, ω) from the loss function, we use a single network with input z, and output [µ j (z, ω), σ 2 j (z, ω)] ∈ R 2N y ; thus, in the output layer, N y units are used to predict µ j and N y units to predict σ 2 j . In practice, to increase numerical stability, we train the network to predict the log variance ρ j = log σ 2 j , in which case, the likelihood loss function is: with: It should be pointed out that from Equation (25) that it is apparent that the model is stopped from predicting high uncertainty through the log σ 2 j term, but also low uncertainty for points with high residual error, as low σ 2 j will increase the contribution of the residual. On the other hand, it should also be noted that a basic assumption of the model is that the covariance matrix C δ y (z, ω) is diagonal, which unfortunately, is contradicted by Equation (11). 4.
In order to generate a data set with input and output noise, that is, , and y (n) = F(x (n) ) + δ y , we used the jitter approach. According to this approach, at each forward pass through the network, a new random noise δ x is added to the original input vector x (n) . In other words, each time a training sample is passed through the network, random noise is added to the input variables, making them different every time it is passed through the network. By this approach, which is a simple form of data augmentation, the dimension of the data set is reduced (actually, the data set is D = {(x (n) , y (n) )} N n=1 with y (n) = F(x (n) ) + δ y ).

Uncertainties
In our analysis, we are interested in estimating the uncertainty associated with the underlying processes. The quantity which exactly quantifies the model's uncertainty is the predictive distribution of an unknown output y given an observable input z, defined by p(y|z, D) = p(y|z, ω)p(ω|D)dω. (28) If p(y|z, D) is known, the first two moments of the output y can be computed as E(y) = yp(y|z, D)dy, (29) E(yy T ) = yy T p(y|z, D)dy, (30) and the covariance matrix as From Equation (28), we see that the predictive distribution p(y|z, D) can be computed if the Bayesian posterior p(ω|D) is known. Unfortunately, computing the distribution p(ω|D) by means of Equation (13) is usually an intractable problem, because computing the evidence p(D) = p(D|ω)p(ω)dω is not a trivial task. To address this problem, either: 1.
The Laplace approximation, which yields an approximate representation for the posterior p(ω|D); or 2.
A variational inference approach, which learns a variational distribution q θ (ω) to approximate the posterior p(ω|D), can be used. In the first case, we are dealing with deterministic (point estimate) neural networks, in which a single realization of the network parameters ω is learned, while in the second case, we are dealing with stochastic neural networks, in which a distribution over the network parameters ω is learned. The Laplace approximation is of theoretical interest because it provides explicit representations for the different types of uncertainty. In Appendix A it is shown that under the Laplace approximation, the predictive distribution is given by [23,24,30] where: is the Jacobian of f with respect to ω and: is the Hessian matrix of the loss function. Equation (32) provides a Gaussian approximation to the predictive distribution p(y|z, D) with mean f(z, ω) and covariance matrix C y (z, ω). From Equation (33), we see that the covariance matrix C y (z, ω) has two components. 1.
The first component C δ y (z, ω) is the covariance matrix in the distribution over the error in the output y. This term, which is input dependent, describes the so-called aleatoric heteroscedastic uncertainty measured by p(y|z, ω). For a data model with output noise, we have (cf. Equation (11)) C δ y = C δ y , and this term, which is input independent, describes the so-called aleatoric homoscedastic uncertainty measured by p(y|x, ω).

2.
The second component C e y (z, ω) reflects the uncertainty induced in the weights ω, also called epistemic uncertainty or model uncertainty. The sources of this uncertainty measured by p(ω|D) are for example: (i) non-optimal hyperparameters (the number of hidden layers, the number of units per layer, the type of activation function); (ii) non-optimal training parameters (the minimum learning rate at which the training is stopped, the learning rate decay factor, the batch size used for the mini-batch gradient descent); and (iii) a non-optimal optimization algorithm (ADAGRAD, ADADELTA, ADAM, ADAMAX, NADAM). or model uncertainty, which refers to the fact that we do not know the model that best explains the given data. For NNs, this is uncertainty from not knowing the best values of weights in all the trainable layers. This is often referred to as reducible uncertainty, because we can theoretically reduce this type of uncertainty by acquiring more data.
Some comments can be made here.

1.
The heteroscedastic covariance matrix C δ N y j=1 can be learned from the data by minimizing the loss functions (26) and (27).

2.
The computation of the epistemic covariance matrix C e y (z, ω) from Equation (34) requires the knowledge of the Hessian matrix H( ω). In general, this problem is practically insoluble because the matrix H is very huge, and so, is very difficult to compute. However, the problem can be evaded by using the diagonal Hessian approximation where δ ij is the Kronecker delta. The diagonal matrix elements can be very efficiently computed by using a procedure which is similar to the back-propagation algorithm used for computing the first derivatives [31]. 3.
The covariance matrix C y (z, ω) can be approximated by the conditional average covariance matrix E(C y |D) of all network errors ε = y − f(z, ω) over the data set D, meaning that: where ε (n) = y (n) − f(z (n) , ω) and E(ε) = (1/N) ∑ N n=1 ε (n) . Note that this is a very rough approximation, because in contrast to C y (z, ω), E(C y |D) does not depend on z.

4.
For exact data, we have C y (x, ω) = BC e y (x, ω). Thus, when learning a neural network with exact data, the predictive distribution p(y|x, D) is Gaussian with mean f(x, ω) and (epistemic) covariance matrix C e y (x, ω).

Bayesian Networks
Stochastic neural networks are a type of neural networks built by introducing stochastic components into the network (activation or weights). A Bayesian neural network can be regarded as a stochastic neural network trained by using Bayesian inference [32]. This type of neural network provides a natural approach to quantify uncertainty in deep learning and allows to distinguish between epistemic and aleatoric uncertainties.
Bayesian neural networks equipped with variational inference bypass the computation of the evidence p(D) = p(D|ω)p(ω)dω, which determines the Bayesian posterior p(ω|D) via Equation (13), by learning a variational distribution q θ (ω) to approximate p(ω|D), that is: where θ are some variational parameters. These parameters are computed by minimizing the Kullback-Leibler (KL) divergence: with respect to θ. In fact, the KL divergence is a measure of similarity between the approximate distribution q θ (ω) and the posterior distribution obtained from the full Gaussian process p(ω|D), and minimizing the KL divergence to be the same as minimizing the variational free energy defined by Considering the data model with output noise, assuming C δ y = σ 2 y I, and replacing p(ω|D) by q θ (ω) in Equation (28), which gives an approximate predictive distribution: (40) which can be approximated at test time by where ω t is sampled from the distribution q θ (ω). As a result, for p(y|x, ω) = N (y, f(x, ω), C δ y = σ 2 y I), the predictive mean and the covariance matrix of the output y given the input x, can be approximated, respectively, by (Appendix B): The first term σ 2 y I in Equation (43) corresponds to the homoscedastic uncertainty (the amount of noise in the data), while the second part corresponds to the epistemic uncertainty (how much the model is uncertain in its prediction). The predictive mean (42) is known as model averaging, and is equivalent to performing T stochastic passes through the network and averaging the results. Note that for exact data, i.e., σ 2 y ≈ 0, the covariance matrix simplifies to: and describes the epistemic uncertainty.
In this section, we present the most relevant algorithm that is used for Bayesian inference (Bayes-by-backprop), as well as two Bayesian approximate methods. For this purpose, we consider a data model with output noise and assume C δ y = σ 2 y I; in this case, p(D|ω) is given by Equation (15) in conjunction with Equations (21) or (22).

Dropout
Dropout, which was initially proposed as a regularization technique during the training [35,36], is equivalent to a Bayesian approximation. This result was proved in [37] by showing that the variational free energy F(θ, D) has the standard form representation of the dropout loss function (as the sum of a square loss function and a L 2 regularization term). A simplified proof of this assertion is given in Appendix C. The term "dropout" refers to removing a unit along with all its connections. The choice of a dropped unit is random. In the case of dropout, the feed-forward operation of a standard neural network (  2) and (3) becomes: where: Essentially, the output of the unit k in layer l − 1, y k,l−1 is multiplied by the binary variable z k,l−1 to create the new output z k,l−1 y k,l−1 . The binary variable z k,l−1 takes the value 1 with probability p and the value 0 with probability 1 − p; thus, if z k,l−1 = 0, the new output is zero and the unit k is dropped out as an input to the next layer l. The same values of the binary variable are used in the backward pass when propagating the derivatives of the loss function. In the test time, the weights W l are scaled as pW l . Thus, we retain units with probability p at training time, and multiply the weights by p at test time. Alternatively, in order to maintain constant outputs at training time, we can scale the weights by 1/p, and do not modify the weights at test time. The model parameters ω = {W l , b l } L l=1 are obtained by minimizing the loss function (21) with possibly an L 2 regularization term.
Uncertainty can be obtained from a dropout neural network [37]. For the set of l−1 for all l = 1, . . . , L, the predictive mean and covariance matrix are computed by means of Equations (42) and (43).

Batch Normalization
Ioffe and Szegedy [38] introduced batch normalization as a technique for training deep neural networks that normalizes (standardizes) the inputs to a layer for each mini-batch. This allows for higher learning rates, regularizes the model, and reduces the number of training epochs. Moreover, Teye et al. [39] showed that a batch normalized network can be regarded as an approximate Bayesian model, and it can thus be used for modeling uncertainty.
Let us split the data set D into N b mini-batches, and let the nth mini-batch B (n) contain M pair samples (x (n,m) , y (n,m) ), meaning that: In batch normalization, the input u i,l−1 of the activation function g l corresponding to unit j, layer l, sample m, and mini-batch n, which is first normalized: and then scaled and shifted: where: are the mean and variance of the activation inputs over the M samples (in unit j, layer l and mini-batch n), respectively, α j,l and β j,l are learnable model parameters, and ε is a small number added to the mini-batch variance to prevent division by zero. By normalization (or whitening), u (n,m) j,l becomes a random variable with zero mean and unit variance, while by scaling and shifting, we guarantee that the transformation (53) can represent the identity transform. In a stochastic framework, we interpret the mean µ is a realization of ω = (µ, v). Optimizing over mini-batches of size M instead on the full training set, the objective function for the mini-batch B (n) becomes [39]: where indicates that the mean and variance ω (n) = (µ (n) , v (n) ) are used in the normalization step (52). At the end of the training stage, we obtain: The maximum a posteriori parameters θ = θ MAP ; 2.
The mean and variance realizations n=1 of the stochastic parameters ω = (µ, v); and 3.
The moving averages of the mean and variance over the training set ω = (µ = E(µ), v = E(v)). Some comments can be made here.

1.
A batch-normalized network samples the stochastic parameters ω once per training step (mini-batch). For a large number of epochs, the ω (n) become independent and identical distributed random variables for each training example; 2.
The variational distribution q θ (ω) corresponds to the joint distribution of the weights induced by the stochastic parameters ω; 3.
In the inference, the output for a given input x is f ω (x, θ). To estimate the predictive mean and covariance matrix, we proceed as follows. For each t = 1 . . . , T, we sample a mini-batch B (t) from the data set D = ∪ N b n=1 B (n) , and for the corresponding mean and variance realization and then the predictive mean and covariance matrix from Equations (42) and (43)

Neural Networks for Atmospheric Remote Sensing
In this section, we design several neural networks for atmospheric remote sensing. To test the neural networks, we considered a specific problem, namely the retrieval of cloud optical thickness and cloud top height from radiances measured by the Earth Polychromatic Imaging Camera (EPIC) onboard the Deep Space Climate Observatory (DSCOVR). DSCOVR is placed in a Lissajous orbit about the Lagrange-1 point, and provides a unique angular perspective in an almost backward direction with scattering angles between 168 • and 176 • . The EPIC instrument has 10 spectral channels ranging from the UV to the near-IR, which include four channels around the oxygen A-and B-bands; two absorption channels are centered at 688 nm and 764 nm with bandwiths of 0.8 nm and 1.0 nm, respectively, while two continuum bands are centered at 680 nm and 780 nm with bandwiths of 2.0 nm. These four channels are used for inferring the cloud parameters. To generate the database, we use a radiative transfer model based on the discrete ordinate method with matrix exponential [40,41] that uses several acceleration techniques, as for example, the telescoping technique, the method of false discrete ordinate [42], the correlated k-distribution method [43], and the principal component analysis [44][45][46]. The atmospheric parameters to be retrieved are the cloud optical thickness τ and the cloud top height H, while the forward model The O 2 absorption cross-sections are computed using LBL calculations [47] with optimized rational approximations for the Voigt line profile [48]. The wavenumber grid point spacing is chosen as a fraction (e.g., 1/4) of the minimum half-width of the Voigt lines taken from HITRAN database [49]. The Rayleigh cross-section and depolarization ratios are computed as in [50], while the pressure and temperature profiles correspond to the US standard model atmosphere [51]. The radiances are solar-flux normalized and are computed by means of the delta-M approximation in conjunction with the TMS correction. We generate N = 20,000 samples by employing the smart sampling technique [52]. Among this data set, 18,000 samples were used for training and the other 2000 for prediction. The noisy spectra are generated by using the measurement noise δ mes ∼ N (0, diag[σ 2 mesj ] 4 j=1 ), where for the jth channel, we use σ mesj = 0.1I j with I j being the average of the simulated radiance over the N samples.
The neural network algorithms are implemented in FORTRAN by using a feedforward multilayer perceptron architecture. The tool contains a variety of optimization algorithms, activation functions, regularization terms, dynamic learning rates, and stopping rules.  [53] is used as optimization tool, a mini batch of 100 samples is chosen, and a ReLU activation function is considered.

Neural Networks for Solving the Direct Problem
For a direct problem, the input x is the set of atmospheric parameters, the output y is the set of data, and the forward model F reproduces the radiative transfer model R.
We consider a neural network trained with exact data. For the predictive distribution p(y|x, D) given by Equation (32), we assume that the epistemic covariance matrix C e y (x, ω) (= C y (x, ω)) is computed from the statistics of ε = y − f(x, ω), that is, C e y ≈ E(C y |D), where f(x, ω) is the network output. Furthermore, let y δ = y + δ y with δ y ∼ N (0, C δ y ), be the noisy data vector. Using the result: we compute the predictive distribution for the noisy data p(y δ |x, D) by marginalization, that is: p(y δ |x, D) = p(y δ |y)p(y|x, D)dy where (cf. Equation (33)) C y ( ω) = C e y ( ω) + C δ y and ∆y = y − y δ . In the next step, we solve the inverse problem y δ = f(x, ω) by means of a Bayesian approach [54,55]. In this case, the posteriori density p(x|y δ , D) is given by whence, by assuming that the state vector x is a Gaussian random vector with mean x a and the covariance matrix C x , meaning that: we obtain: where: is the a posteriori potential and C is constant. The maximum a posteriori estimator, defined by can be computed by any optimization method, as for example, the Gauss-Newton method. If the problem is almost linear, the a posteriori density p(x|y δ , D) is Gaussian with the mean x and covariance matrix [54]: It should be pointed out that we can train the neural network for a data set with output noise D = {(x (n) , y δ(n) )} N n=1 , in which case, the covariance matrix C y ( ω) can be directly computed from the statistics of ε = y δ − f(x, ω).
Essentially, the method involves the following steps: 1.
Train a neural network with exact data for simulating the radiative transfer model; 2.
Compute the epistemic covariance matrix from the statistics of all network errors over the data set; 3.
Solve the inverse problem by a Bayesian approach under the assumption that the a priori knowledge and the measurement uncertainty are both Gaussian; 4.
Determine the uncertainty in the retrieval by assuming that the forward model is nearly linear.
In Figure 1, we plot the histograms of the relative error over the prediction set: where x stands for τ and H, and x pred and x are the predicted and true values, respectively. Also shown here are the plots of x pred and x pred ± 3σ x versus x. The results demonstrate that the cloud optical thickness is retrieved with better accuracy than the cloud top height, and that for the cloud top height, the predicted uncertainty are especially unrealistic, because the condition: is not satisfied. The reason for this failure might be that the forward model is not nearly linear.

Neural Networks for Solving the Inverse Problem
For an inverse problem, the input x includes the sets of data and forward model parameters (solar and viewing zenith angles, and surface albedo), while the output y includes the set of atmospheric parameters to be retrieved (cloud optical thickness and cloud top height).

Method 1
Let f(x, ω) be the output of a neural network trained with exact data, and assume that the predictive distribution p(y|x, D) given by Equation (32), and in particular, the epistemic covariance matrix C e y (x, ω)(= C y (x, ω)) are available. For the noisy input z = x + δ x with p(z|x) = N (z; x, C δ x = σ 2 x I) and under the assumption that the prior p(x) is a slowly varying function as compared to p(z|x), the predictive distribution of the network output can be approximated by [56] p(y|z, D) = p(y|x, D)p(x|z)dx Thus, if p(x) varies much more slowly than p(z|x) = p(z − x), we assume that p(y|z, D) is the convolution of the predictive distribution for an uncorrupted input p(y|x, D) with the input noise distribution p(z − x). In the noise-free case, that is, if σ x → 0, we have lim σ x →0 p(z|x) = δ(z − x), yielding p(y|z, D) = p(y|x, D). Using Equation (64), we obtain: and: E(yy T |z) = yy T p(y|z, D)dy Equations (65) and (66) show that in general E(y|z) = f(x, ω), and a noise process blurs or smooths the original mappings.
To compute the predictive mean E(y|z) and the covariance matrix Cov(y|z), the integrals in Equations (65) and (66) must be calculated. This can be done by Monte Carlo integration (by sampling the Gaussian distribution p(z|x)) or by a quadrature method. In the latter case, we consider a uniform grid and use the computational formulas: and: The neural network trained with exact data can be deterministic or stochastic, as for example, Bayes-by-backprop, dropout, and batch normalization. In this regard, for each noisy input z, we consider a uniform grid {x i } N int i=1 around z, calculate the weights v i by means of Equation (67), and for each x i , compute C e y (x i , ω) as follows: 1.
For a deterministic network, we approximate C e y (x i , ω) by the conditional average covariance matrix E(C e y |D) of all network errors over, that is, C e y ≈ E(C e y |D), yielding ∑ For a Bayes-by-backprop and dropout networks, we compute C e y (x i , ω) by means of Equation (44) with E(y) as in Equation (42); 3.
For a batch normalized network, we compute C e y (x i , ω) as Note that for a Bayes-by-backprop network, the output is f(x i , ω = µ ω ), for a dropout network, the output f(x i , ω) is computed without dropout, and for a batch normalized network, the output is f ω (x i , θ).
In summary, the method uses: 1.
Deterministic and stochastic networks trained with exact data to compute the epistemic covariance matrix; and 2.
The assumption that the predictive distribution of the network output is the convolution of the predictive distribution for an uncorrupted input with the input noise distribution to estimate the covariance matrix.
Under the assumption that the noise process is Gaussian, the convolution integrals are computed by a quadrature method with a uniform grid around the noisy input.
The results for Method 1 with deterministic and stochastic networks are illustrated in

Method 2
In order to model both heteroscedastic and epistemic uncertainties, we used the approach described in Ref. [57]. More precisely, we considered the data model with input and output noise, and used dropout to learn the heteroscedastic covariance matrix N y j=1 from the data (see Section 2.1). The network has the output [µ j (z, ω), σ 2 j (z, ω)] ∈ R 2N y and is trained to predict the log variance ρ j = log σ 2 j , in which case, the likelihood loss function is given by Equations (26) and (27). Considering the set of where ω t corresponds to a realization of the Bernoulli distribution Z l−1 for all l = 1, . . . , L, we compute the predictive mean and covariance matrix as [57] for each noisy input z. The first term in Equation (71) reproduces the heteroscedastic uncertainty, while the second and third terms reproduce the epistemic uncertainty. Instead of a dropout network, a Bayes-by-backprop network can also be used to learn the heteroscedastic covariance matrix from the data. In this case, F(θ, D) is given by Equation (47) (27). The results for Method 2 with a dropout and a Bayes-by-backprop network are illustrated in Figures 6 and 7.

Method 3
Let ω = {W l , b l } L l=1 be the parameters of a dropout network trained with exact data. Further, assume that the input data are noisy, i.e., z = . In order to compute the heteroscedastic uncertainty, we forward propagate the input noise through the network. This is done by using assumed density filtering and interval arithmetic.

1.
Assumed density filtering (ADF). This approach was applied to neural networks by Gast and Roth [58] to replace each network activation by probability distributions. In the following, we provide a simplified justification of this approach, while for a more pertinent analysis, we refer to Appendix D. For a linear layer (g l (x) = x), the feed-forward operation (without dropout) is: By straightforward calculation, we find: and: w kj,l w k 1 j 1 ,l µ j,l−1 µ j 1 ,l−1 yielding: Assuming that the y k,l are independent random variables, in which case, the covariance matrix corresponding to the column vector [y 1.l , . . . , y N l l ] T is diagonal, meaning that: E(y k,l y k 1 ,l ) − E(y k,l )E(y k 1 ,l ) = δ kk 1 E(y 2 k,l ) − E 2 (y k,l ) = δ kk 1 v k,l , we obtain v k,l = ∑ N l−1 j=1 w 2 kj,l v j,l−1 . In summary, the iterative process for a linear layer is: For a ReLU activation function ReLU(x) = max(0, x), it was shown that: where µ k,l and v k,l are given by Equations (78) and (79), respectively, and:

2.
Interval Arithmetic (IA). Interval arithmetic is based on an extension of the real number system to a system of closed intervals on the real axis [59]. For the intervals X and Y, the elementary arithmetic operations are defined by the rule X ⊕ Y = {x ⊕ y|x ∈ X, y ∈ Y}, where the binary operation ⊕ can stand for addition, subtraction, multiplication, or division. This definition guarantees that x ⊕ y ∈ X ⊕ Y. Functions of interval arguments are defined in terms of standard set mapping, that is, the image of an interval X under a function f is the set f (X) = { f (x)|x ∈ X}. This is not the same as an interval function obtained from a real function f by replacing the real argument by an interval argument and the real arithmetic operations by the corresponding interval operations. The latter is called an interval extension of the real function f and is denoted by F(X). As a corollary of the fundamental theorem of interval analysis, it can be shown that f (X) ⊆ F(X). Interval analysis provides a simple and accessible way to assess error propagation. The iterative process for error propagation is (compared with Equations (77)-(79)): while the output predictions µ k,L and their standard deviations σ k,L = √ v k,L are computed as where G l (U) is the interval extension of the activation function g l , and X and X are the left and right endpoints of the interval X, respectively, that is, X = [X, X]. By assumed density filtering and interval arithmetic, the forward pass of a neural network generates not only the output predictions µ L = [µ 1,L , . . . , µ N L ,L ] T but also their variances v L = [v 1,L , . . . , v N L ,L ] T . Following Ref. [2], we consider now a network with dropout, that is, in Equations (78), (79) and (83), we replace w kj,l by w kj,l z j,l−1 , where z j,l−1 ∼ Bernoulli(p) and the dropout probability p is the same as that used for training. For the set of samples {ω t } T t=1 , where ω t corresponds to a realization from the Bernoulli distribution Z (t) l−1 for all l = 1, . . . , L, we denote by µ L (x, ω t ) and v L (x, ω t ) the outputs of the network for an input x corrupted by the noise δ x ∼ N (0, C δ x = diag[σ 2 xk ] N x k=1 ), and compute the predictive mean and covariance matrix as (Appendix B) From Equations (87) and (88), it is obvious that the prediction ensemble is not generated by the output of the network f(x, ω t ), but by the prediction µ L (x, ω t ). In summary, the algorithm involves the following steps:

1.
Transform a dropout network into its assumed density filtering or interval arithmetic versions (which does not require retraining); 2.
Propagate (x, v 0 ) through the dropout network and collect T output predictions and variances; 3.
Compute the predictive mean and covariance matrix by means of Equations (87) and (88).
The results for Method 3 using assumed density filtering and interval arithmetic are illustrated in Figures 8 and 9, respectively. The main drawback of this method is that it requires the knowledge of the exact input data x. Because in atmospheric remote sensing that is not the case, Method 3 will be used for a comparison with the other two methods.

Summary of Numerical Analysis
In Table 1, we summarize the results of our numerical simulations by illustrating the average relative error and the standard deviation over the prediction set, E(ε x ) ± E([ε x − E(ε x )] 2 ) and E(σ x ), respectively, where x stands for the cloud optical thickness τ and the cloud top height H. The accuracy of a method is reflected by the bias of the error E(ε x ) and the interval about the mean with length E([ε x − E(ε x )] 2 ), while the precision is reflected by the standard deviation E(σ x ) (which determines the length of the uncertainty interval). Note that (i) E([ε x − E(ε x )] 2 ) reproduces the square root of the diagonal elements of the conditional average covariance matrix E(C y |D test ) of all network errors over the test set D test ; and (ii) roughly speaking, the epistemic (model) uncertainties are large if there are large variations around the mean.
] 2 ) and standard deviation E(σ x ) over the prediction set for the methods used to solve direct and inverse problems. The parameter x stands for the cloud optical thickness τ and cloud top height H. In the case of the inverse problem, the results correspond to Method 1 with a deterministic network (1a), Bayes-by-backprop (1b), dropout (1c), and batch normalization (1d); Method 2 with dropout (2a) and Bayes-by-backprop (2b); and Method 3 with assumed density filtering (3a) and interval arithmetic (3b).

Method
x The results in Figures 1-9 and Table 1 can be summarized as follows: 1.
For the direct problem, the neural network method used in conjunction with a Bayesian inversion method provides satisfactory accuracy, but does not correctly predict the uncertainty. The reason for this is that the forward model is not nearly linear, which is the main assumption for computing the uncertainty in the retrieval.

2.
For the inverse problem, the following features are apparent.
(a) Method 1 using a deterministic and Bayes-by-backprop network yields a low accuracy, while the method using dropout and batch-normalized networks provides high accuracy; (b) Method 2 using a dropout network has an acceptable accuracy. For cloud top height retrieval, the method using a Bayes-by-backprop network has a similar accuracy, but for cloud optical thickness retrieval, the accuracy is low. Possible reasons for the loss of accuracy of a Bayes-by-backprop network can be (i) a non-optimal training and/or the use of the prior p(ω) = N (ω; 0, I) instead of that given by Equation (45) (recall that for p(ω) = N (ω; 0, I), the KL divergence KL(q θ (ω)|p(ω)) can be computed analytically). (c) Method 3 with assumed density filtering and interval arithmetic is comparable with Method 2 using a dropout network, that is, the method has a high accuracy.
(d) From the methods with reasonable accuracy, by using a dropout network, Method 2 predicts similar uncertainties to Method 3 with assumed density filtering and interval arithmetic. In contrast, using dropout and batch normalized networks, Method 1 provides lower uncertainties, dominated by epistemic uncertainties. In general, it seems that Method 1 predicts a too small heteroscedastic uncertainty. Possibly, this is due to the fact that we trained a neural network with exact data, and for this retrieval example, the predictive distribution, represented as a convolution integral over the input noise distribution, does not correctly reproduce the aleatoric uncertainty.
, we deduce that the conditional average covariance matrix E(C y |D test ) does not coincide with the predictive covariance matrix Cov(y), which reflects the uncertainty.

Conclusions
We presented several neural networks for predicting uncertainty in an atmospheric remote sensing. The neural networks are designed in a Bayesian framework and are devoted to the solution of direct and inverse problems.

1.
For solving the direct problem, we considered a neural network for simulating the radiative transfer model, computed of the epistemic covariance matrix from the statistics of all network errors over the data set, solving the inverse problem by a Bayesian approach, and determined the uncertainty in the retrieval by assuming that the forward model is nearly linear.

2.
For solving the inverse problem, two neural network methods, relying on different assumptions, were implemented: (a) The first method uses deterministic and stochastic (Bayes-by-backprop, dropout, and batch normalization) networks to compute the epistemic covariance matrix and under the assumption that the predictive distribution of the network output is the convolution of the predictive distribution for a noise-free input with the input noise distribution, estimates the covariance matrix; (b) the second method uses dropout and Bayes-by-backprop to learn the heteroscedastic covariance matrix from the data.
In addition, for solving the inverse problem, a third method that uses a dropout network and forward propagates the input noise through the network by using assumed density filtering and interval arithmetic was designed. Because this method requires the knowledge of the exact input data, it was used only for testing purposes.
Our numerical analysis has shown that a dropout network that is used to learn the heteroscedastic covariance matrix from the data is appropriate for predicting the uncertainty associated with the retrieval of cloud parameters from EPIC measurements. In fact, the strengths of a dropout network are (i) its capability to avoid overfitting and (ii) its stochastic character (the method is equivalent to a Bayesian approximation).
All neural network algorithms are implemented in FORTRAN and incorporated in a common tool. In the future, we intend to implement the algorithms in the high-level programming language Python and use the deep learning library PyTorch. This software library has a variety of network architectures that provide auto-differentiation and support GPUs to enable fast and efficient computation. The Python tool will be released through a public repository to make the methods available to the scientific community.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
According to the Laplace approximation, we expand the loss function: around the point estimate ω = ω MAP and use the optimality condition ∇E( ω) = 0, to obtain: and further (cf. Equation (13)): Inserting Equations (12) and (A5) in the expression of the predictive distribution as given by Equation (28) we find: In Equation (A6), the model function f(x, ω) can be approximated by a linear Taylor expansion around ω, that is: where: is the Jacobian of f with respect to ω. Substituting Equation (A7) into Equation (A6), approximating C δ y (z, ω) ≈ C δ y (z, ω), and computing the integral over ω gives the representation (32) for the predictive distribution p(y|z, D).

Appendix B
For p(y|x, ω) = N (y, f(x, ω), C δ y = σ 2 y I), we compute the predictive mean of the output y given the input x, as E(y) = yp(y|x, D)dy and by using the result: the covariance matrix as The predictive mean and covariance matrix of a dropout network with assumed density filtering given by Equations (87) and (88), respectively, can be computed in the same manner by taking into account that in this case, the predictive power of the network is given by

Appendix C
In this appendix, which is borrowed from [37], we show that the variational free energy has the standard form representation of the dropout loss function (as the sum of a square loss function and an L 2 regularization term).
For ω = {W l , b l } L l=1 and W l = [w k,l ] N l−1 k=1 ∈ R N l ×N l−1 , we construct the variational distribution q θ (ω) as with: and: q(b l ) = N (n l , σ 2 I N l ).
Here, p ∈ [0, 1] is an activation probability, σ > 0 a scalar, and M l = [m k,l ] N l−1 k=1 and n l are a variational parameter to be determined; thus, θ = {M l , n l } L l=1 . The key point of the derivation is the representation of q(w k,l ) as a mixture of two Gaussians with the same variance (cf. Equation (A11)). When the standard deviation σ tends towards 0, the Gaussians tend to Dirac delta distributions showing that when sampling from the mixture of the two Gaussians is equivalent to sampling from a Bernoulli distribution that returns either the value 0 with probability 1 − p or m k,l with probability p, that is: with z k,l−1 ∼ Bernoulli(p). Note that the binary variable z k,l−1 corresponds to the unit k in layer l − 1 being dropped out as an input to layer l. For b l , we take into account that q(b l ) = lim σ→0 N (n l , σ 2 I N l ) = δ(b l − n l ); hence, in the limit σ → 0, b l is approximately deterministic, and we have b l ≈ n l .
The variational parameters M l and n l are computed by minimizing the variational free energy (39), that is: The two terms in the above equation are computed as follows.
where q(W l ) is as in Equations (A10) and (A11) and p(W l ) = ∏ N l−1 k=1 p(w k,l ) with p(w k,l ) = N (0, I N l ). This term can be evaluated by using the following result: For K, N ∈ N, let p = (p 1 , . . . , p K ) be a probability vector, q(x) = ∑ K k=1 p k N (x; µ k , Σ k ) with x ∈ R N a mixture of Gaussians with K components, and p(x) = N (x; 0, I N ). Then, for sufficiently large N, we have the approximation: Consequently, for large numbers of hidden units N l , l = 1, . . . , L, we find: where C is a constant. In the case of b l , the KL divergence KL(q(b l )|p(b l )), where (cf. Equation (A12)) q(b l ) = N (n l , σ 2 I N l ) and p(b l ) = N (0, I N l ), is a mixture of two single Gaussian and can be analytically computed as KL(q(b l )|p(b l )) = 1 2 [n T l n l + N l (σ 2 − log(σ 2 ) − 1)] + C.
Collecting all the results, we obtain: Thus, the variational free energy F(θ, D) has the standard form representation as the sum of a square loss function and an L 2 regularization term.

Appendix D
In this appendix, we describe the uncertainty propagation based on assumed density filtering by following the analysis given in [58].
The feed-forward operation of a neural network can be written as (cf. Equations (2) and (3)): y l = f l (y l−1 ; ω l ) = g l (W l y l−1 + b l ), where ω l = {W l , b l }. Thus, each layer function f l (y l−1 ; ω l ) is a nonlinear transformation of the previous activation y l−1 parametrized by ω l . The deep neural network can then be expressed as a succession of nonlinear layers: f(x, ω) = f L (f L−1 (. . . f 1 (y 0 ; ω 1 )).
To formalize the deep probabilistic model, we replace each activation, including input and output, by probability distributions. In particular, we assume that the joint density of all activations is given by p(y 0:L ) = p(y 0 ) L ∏ l=1 p(y l |y l−1 ), p(y l |y l−1 ) = δ(y l − f l (y l−1 ; ω l )), p(y 0 ) = N l ∏ k=1 N (y k,0 ; x k , σ 2 xk ), where p(y 0:L ) = p(y 0 , . . . , y L ) andC δ x = diag[σ 2 xk ] N x k=1 . Because this distribution is intractable, we apply the assumed density filtering approach to the network activations. The goal of this approach is to find a tractable approximation q(y 0:L ) of p(y 0:L ), that is: p(y 0:L ) ≈ q(y 0:L ), where: q(y 0:L ) = q(y 0 ) L ∏ l=1 q(y l ), q(y l ) = N l ∏ k=1 N (y k,l ; µ k,l , v k,l ), q(y 0 ) = p(y 0 ), and µ k,l and v k,l are the mean and variance of the activation of unit k in layer l, respectively. Thus, starting from input activation q(y 0 ) = p(y 0 ), we approximate subsequent layer activations by independent Gaussian distributions q(y l ). To compute the approximant q(y 0:L ), we use an iterative process (layer by layer) initialized by q(y 0 ) = p(y 0 ). In particular, for a layer l ≥ 1, we assume that q(y 0 ), . . . , q(y l−1 ) are known, or equivalently, that {(µ l 1 , v l 1 )} l−1 l 1 =0 are known, and aim to compute (µ l , v l ), where µ k,l = [µ l ] k and v k,l = [v l ] k . For this purpose, we take into account that the layer function f l transforms the activation y l−1 into the distribution: p(y 0:l ) = p(y l |y l−1 )q(y 0:l−1 ) = p(y l |y l−1 ) where p(y l |y l−1 ) = δ(y l − f l (y l−1 ; ω l )) is the true posterior at layer l and q(y 0:l−1 ) = ∏ l−1 l 1 =0 q(y l 1 ) the previous approximating factor. Furthermore, we compute the first and second-order moments of p(y 0:l ). This will be done in two steps. In the first step, we derive the moments of an activation variable y k that belongs to all layers excluding the last layer, i.e., y k is an element of y 0:l−1 = {y 0 , . . . , y l−1 }, while in the second step, we assume that y k is an activation variable contained in the last layer y l = {y 1,l , . . . , y N l ,l }. Thus,
From Equations (A31) and (A33), we see that for all layers except for the lth layer, the moments remain unchanged after the update. The moments for the lth layer will be computed by means of Equations (A32) and (A34). For a linear activation function, i.e., f k,l (y l−1 ; ω l ) = ∑ N l−1 i=1 w ki,l−1 y i,l−1 + b k,l , we find that the expressions of µ k,l and v k,l are given by Equations (78) and (79), respectively, while for a ReLU activation function ReLU(x) = max(0, x), these are given by Equations (80) and (81), respectively.