Model Selection in Atmospheric Remote Sensing with an Application to Aerosol Retrieval from DSCOVR/EPIC, Part 1: Theory

: The retrieval of aerosol and cloud properties such as their optical thickness and/or layer/top height requires the selection of a model that describes their microphysical properties. We demonstrate that, if there is not enough information for an appropriate microphysical model selection, the solution’s accuracy can be improved if the model uncertainty is taken into account and appropriately quantiﬁed. For this purpose, we design a retrieval algorithm accounting for the uncertainty in model selection. The algorithm is based on (i) the computation of each model solution using the iteratively regularized Gauss–Newton method, (ii) the linearization of the forward model around the solution, and (iii) the maximum marginal likelihood estimation and the generalized cross-validation to estimate the optimal model. The algorithm is applied to the retrieval of aerosol optical thickness and aerosol layer height from synthetic measurements corresponding to the Earth Polychromatic Imaging Camera (EPIC) instrument onboard the Deep Space Climate Observatory (DSCOVR) satellite. Our numerical simulations show that the heuristic approach based on the thesolution minimizing the residual, which is frequently used in literature, is completely unrealistic when both the aerosol model and surface albedo are unknown.


Introduction
The limited information provided by satellite measurements does not allow for a complete determination of aerosol and cloud properties and, in particular, their microphysical properties. To deal with this problem, a set of models describing the microphysical properties is typically used. For example, the aerosol models implemented in the Moderate Resolution Imaging Spectroradiometer (MODIS) aerosol algorithm over land [1] consist of three spherical, fine-dominated types and one spheroidal, coarse-dominated type, while typical cloud models consist of water and ice clouds with a predefined particle shape and size. In this regard, a priori information such as the selection of a microphysical model is an essential part of the retrieval process. Standard retrieval algorithms usually ignore model uncertainty, i.e., a model is chosen from a given set of candidate models, and the retrieval is performed as if the selected model represents the true state. This approach is very risky and may result in large errors in the retrieved parameters if the model does not reflect the real scenario. An efficient way to quantify the uncertainty in model selection is the Bayesian approach and, in particular, Bayesian model selection and model averaging [2]. By model selection we mean the specific problem of choosing the most suitable model from a given set of candidate models. In general, model selection is not a trivial task because for a given measurement, there will be several models that fit the data equally well. These tools were used, among others, in [3,4] to study uncertainty quantification in remote sensing of aerosols from Ozone Monitoring Instrument (OMI) data.
The key quantity in Bayesian model selection is relative evidence, which is a measure of how well the model fits the measurement. This is expressed in terms of the marginal likelihood, which is the integral over the state vector space of the a priori density times the likelihood density (see below). An accurate computation of the marginal likelihood is not a trivial task because some parameters characterizing the a priori and likelihood densities, e.g., the data error variance and the a priori state variance, are not precisely known and must be estimated. Moreover, the integral over the state vector space cannot be computed analytically because the likelihood density depends on the nonlinear forward model, for which an analytical representation is not available.
In this paper, we aim to eliminate the drawbacks of the Bayesian approach by using: 1. an iteratively regularized Gauss-Newton method for computing the solution of each model and estimating the ratio of the data error variance and the a priori state variance; 2. a linearization of the forward model around the solution for integrating the likelihood density over the state vector; 3. parameter choice methods from linear regularization theory for estimating the optimal model and the data error variance.
The paper is organized as follows. In Section 2, we derive a scaled data model with white noise in order to simplify the subsequent analysis. Section 3 is devoted to a review of the Bayesian parameter estimation and model selection in order to describe the problem and clarify the nomenclature. In Section 4, we summarize the iteratively regularized Gauss-Newton method and emphasize some special features of this method, including the estimation of the ratio of the data error variance and the a priori state variance. In Section 5, we assume a linearization of the forward model around the solution and, under this assumption, extend some regularization parameter choice methods for linear problems (the maximum marginal likelihood estimation and the generalized cross-validation) for data error variance estimation and model selection. Section 6 summarizes the computational steps of the proposed retrieval algorithm. In Section 7, we apply the algorithm to the retrieval of aerosol optical thickness and aerosol layer height from the Earth Polychromatic Imaging Camera (EPIC)/Deep Space Climate Observatory (DSCOVR) measurements.

Data Model
Consider N m microphysical models, and let F m (x) ∈ R M be the forward model corresponding to the model m for m = 1, . . . , N m . More specifically, in our analysis, F m (x) is the vector of the logarithms of the simulated radiances at all wavelengths λ i , i = 1, . . . , M, i.e., [F m (x)] i = ln I m (λ i , x), and x ∈ R N is the state vector encapsulating the atmospheric parameters to be retrieved, e.g., aerosol optical thickness and layer height.
The nonlinear data model under examination consists of the equations: and: y = F m (x) + δ aerm (2) where y δ ∈ R M is the measurement vector or the noisy data vector, y the exact data vector, δ mes ∈ R M the measurement error vector, and δ aerm ∈ R M the model error vector, i.e., the error due to an inadequate model. Defining the (total) data error vector by: the data model (1) and (2) becomes: In a deterministic setting, δ m is characterized by the noise level ∆ m (defined as an upper bound for δ m , i.e., ||δ m || ≤ ∆ m ), x is a deterministic vector, and we are faced with the solution of the nonlinear equation y δ = F m (x). In a stochastic setting, δ m and x are random vectors, and the data model (3) is solved by means of a Bayesian approach.
Using the prewhitening technique, we transform the data model (3) into a model with white noise. For this purpose, we take: 1. δ mes as a Gaussian random vector with zero mean and covariance matrix: and 2. δ aerm as a Gaussian random vector with zero mean and covariance matrix: If δ mes and δ aerm are independent random vectors, then δ m is a Gaussian random vector with zero mean and covariance matrix: where: is the variance of the data error vector and: is the weighting factor giving the contribution of C mes to the covariance matrix C δm . After these preliminary constructions, we introduce the scaled data model: where: and: is the scaling matrix. Taking into account that: we see that by means of the prewhitening technique, the Gaussian random vector δ m ∼ N(0, C δm = σ 2 m C δm ) is transformed into the white noise δ m ∼ N(0, C δm = σ 2 m I M ). Here and in the following, the notation N(x mean , C x ) stands for a normal distribution with mean x mean and covariance matrix C x .
The following comments can be taken into consideration: 1. In [3], the covariance matrix C aerm was estimated by empirically exploring a set of residuals of model fits to the measurements. Essentially, C aerm is assumed to be in the form: where σ 2 0 , σ 2 1 , and l (representing the non-spectral diagonal variance, the spectral variance, and the correlation length, respectively) are computed by means of an empirical semivariogram, i.e., the variances of the residual differences are calculated for each wavelength pair with the distance d = |λ i − λ j |, and a theoretical Gaussian variogram model is fitted to these empirical semivariogram values. As compared to Equation (6), in our analysis, we use the diagonal matrix approximation C aerm ≈ σ 2 aerm I M with σ 2 aerm = σ 2 0 + σ 2 1 . 2. In principle, the scaling matrix P depends on the model error variance σ 2 aerm through the weighting factor w mesm . However, for [F m (x)] i = ln I m (λ i , x), we usually have σ 2 mesi ≈ σ 2 mes for all i = 1, . . . , M. In this case, it follows that C mes , together with C δm and P, are close to the identity matrix. This result does not mean that in our model, σ 2 aerm does not play any role; σ 2 aerm is included in σ 2 m , which is the subject of an estimation process.

Bayesian Approach
In this section, we review the basics of Bayesian parameter estimation and model selection by following the analysis given in [2,3].

Bayesian Parameter Estimation
In statistical inversion theory, the state vector x is assumed to be a random vector, and in this regard, we take x ∼ N(x a , C x ), where x a is the a priori state vector, the best beforehand estimate of the solution, and: Furthermore, the matrix L is defined through the Cholesky factorization: and the parameter α is defined as: The data model (3) gives a relation between the three random vectors y δ , x, and δ m , and therefore, their probability densities depend on each other. The following probability densities are relevant in Bayesian parameter estimation: (i) the a priori density p(x | m), which represents the conditional probability density of x given the model m before performing the measurement y δ , (ii) the likelihood density p(y δ | x, m), which represents the conditional probability density of y δ given the state x and the model m, and (iii) the a posteriori density p(x | y δ , m), which represents the conditional probability density of x given the data y δ and the model m.
The Bayes theorem of inverse problems relates the a posteriori density to the likelihood density: In Bayesian parameter estimation, the marginal likelihood density p(y δ | m), defined by: plays the role of a normalization constant and is usually ignored. However, as we will see in the next section, this density is of particular importance in Bayesian model selection.
, the probability densities p(x | m) and p(y δ | x, m) are given by: and: respectively. The Bayes formula yields the following expression for the a posteriori density: where the a posteriori potential V α (x | y δ , m) is defined by: The maximum a posteriori estimator x δ mα maximizing the conditional probability density p(x | y δ , m) also minimizes the potential V α (x | y δ , m), i.e.,

Bayesian Model Selection
The relative evidence, also known as the a posteriori probability of the model m given the measurement y δ , p(m | y δ ), is used for model comparison. By taking into consideration the Bayes theorem, this conditional probability density is defined by: The mean formula: and the assumption that all models are equally likely, i.e., p(m) = 1/N m , yield: . (14) Intuitively, the model with the highest evidence is the best among all the models involved, and in this regard, we define the maximum solution estimate as: In fact, we can compare the models to see if one of them clearly shows the highest evidence, or if there are several models with comparable values of evidence. When several models can provide equally good fits to the measurement, we can use the Bayesian model averaging technique to combine the individual a posteriori densities weighted by their evidence. Specifically, using the relation: on the one hand, and p(x, y δ ) = p(x | y δ )p(y δ ), on the other hand, we are led to the Bayesian model averaging formula: and, consequently, to the mean solution estimate: The above analysis shows that in Bayesian parameter estimation and model selection, we are faced with the following problems: Problem 1. From Equations (11)- (13), it is apparent that the computation of the estimator x δ mα requires the knowledge of the parameter α, i.e., the ratio of the data error variance and the a priori state variance. Problem 2. From Equations (15) and (17), we see that the solution estimates x δ max and x δ mean are expressed in terms of relative evidence p(m | y δ ), which in turn, according to Equation (14), is expressed in terms of the marginal likelihood density p(y δ | m). In view of Equation (8), the computation of the marginal likelihood density p(y δ | m) requires the knowledge of the likelihood density p(y δ | x, m), and therefore of Equation (10), of the data error variance σ 2 m . Problem 3. The dependency of the likelihood density p(y δ | x, m) on the nonlinear forward model F m (x) does not allow an analytical integration in Equation (8).

Iteratively Regularized Gauss-Newton Method
In the framework of Tikhonov regularization, a regularized solution x δ mα to the nonlinear equation y δ = F m (x) is computed as the minimizer of the Tikhonov function: that is, In classical regularization theory, the first term on the right-hand side of Equation (18) is the squared residual; the second one is the penalty term; while α and L are the regularization parameter and the regularization matrix, respectively. From Equations (12) and (18), we see that V α (x | y δ , m) = (1/σ 2 m )F mα (x), and from Equations (13) and (19), we deduce that x δ mα = x δ mα . Thus, the maximum a posteriori estimate coincides with the Tikhonov solution, and therefore, the Bayesian parameter estimation can be regarded as a stochastic version of the method of Tikhonov regularization with an a priori chosen regularization parameter α.
In the framework of Tikhonov regularization, the computation of the optimal regularization parameter is a crucial issue. With too little regularization, reconstructions deviate significantly from the a priori, and the solution is said to be under-regularized. With too much regularization, the reconstructions are too close to the a priori, and the solution is said to be over-regularized. In the Bayesian framework, the optimal regularization parameter is identified as the true ratio of the data error variance and the a priori state variance.
Several regularization parameter choice methods were discussed in [5]. These include methods with constant regularization parameters, e.g., the maximum likelihood estimation, the generalized cross-validation, and the nonlinear L-curve method. Unfortunately, at present, there is no fail-safe regularization parameter choice method that guarantees small solution errors in any circumstance, that is for any errors in the data vector.
An improvement of the problems associated with the regularization parameter selection is achieved in the framework of the so-called iterative regularization methods. These approaches are less sensitive to overestimates of the regularization parameter, but require more iteration steps to achieve convergence. A representative iterative approach is the iteratively regularized Gauss-Newton method.
At iteration step k of the iteratively regularized Gauss-Newton method, the forward model where: is the Jacobian matrix at x δ mk , and the nonlinear equation F m (x) = y δ is replaced by its linearization: where: is the step vector with respect to the a priori and: is the linearized data vector at iteration step k.
Since the nonlinear problem is ill-posed, its linearization is also ill-posed. Therefore, the linearized Equation (20) is solved by means of Tikhonov regularization with the penalty term α k ||Lp|| 2 , where the regularization parameters α k are the terms of a decreasing (geometric) sequence, i.e., α k = qα k−1 with q < 1. Note that in the method of Tikhonov regularization, the regularization parameter α is kept constant during the iterative process. The Tikhonov function for the linearized equation takes the form: and its minimizer is given by: where: is the regularized generalized inverse at x δ mk . The new iterate is computed as: For the iterative regularization methods, the number of iteration steps k plays the role of the regularization parameter, and the iterative process is stopped after an appropriate number of steps k in order to avoid an uncontrolled expansion of errors in the data. In fact, a mere minimization of the residual ||r δ mk ||, where: is the residual vector at x δ mk , leads to a semi-convergent behavior of the iterated solution: while the error in the residual decreases as the number of iteration steps increases, the error in the solution starts to increase after an initial decay. A widely used a posteriori choice for the stopping index k is the discrepancy principle. According to this principle, the iterative process is terminated after k steps such that: Since the noise level cannot be estimated a priori for many practical problems arising in atmospheric remote sensing, we adopt a practical approach. This is based on the observation that the squared residual ||r δ mk || 2 decreases during the iterative process and attains a plateau at approximately ∆ 2 m . Thus, if the nonlinear residuals ||r δ mk || converge to ||r δ m∞ || within a prescribed tolerance, we use the estimate: The above heuristic stopping rule does not have any mathematical justification, but works sufficiently well in practice.
Since the amount of regularization is gradually decreased during the iterative processes, the iteratively regularized Gauss-Newton method can handle problems that practically do not require much regularization.
The numerical experiments performed in [5] showed that at the solution x δ mk , (i) α k −1 is close to the optimal regularization parameter, and (ii) x δ mk is close to the Tikhonov solution corresponding to the optimal regularization parameter. In this regard, we make the following assumptions: 1. α = α k −1 is an estimate for the optimal regularization parameter, and 2. x δ m α = x δ mk is the minimizer of the Tikhonov function with regularization parameter α, F m α (x).
Some consequences of the second assumption are the following. The optimality condition and: Consequently, we see that: is the minimizer of the Tikhonov function: and that the residual vector of the linear equation y δ m α = K m α p is equal to the nonlinear residual vector at the solution, i.e., In summary, in the iteratively regularized Gauss-Newton method: 1. the computation of the regularized solution x δ m α depends only on the initial value α 1 and the ratio q of the geometric sequence α k , which determine the rate of convergence, and 2. the regularization parameter at the solution is an estimate for the optimal regularization parameter, and so, for the ratio of the data error variance and the a priori state variance.
Taking these results into account, we may conclude that the iteratively regularized Gauss-Newton method gives a solution to Problem 1 of the Bayesian parameter estimation.

Parameter Estimation and Model Selection
In this section, we address Problems 2 and 3 related to Bayesian model selection. To solve Problem 3, we suppose that the forward model can be linearized around the solution x δ m α [6]. In other words, in the first-order Taylor expansion: the remainder term R m (x, x δ m α ) can be neglected. As a result, the nonlinear data model y δ = F m (x) + δ m becomes the linear model: where y δ m α is given by Equation (21), . The direct consequences of dealing with the linear model (26) are the following results: 1. the a posteriori density p(x | y δ , m) can be expressed as a Gaussian distribution: where: is the a posteriori covariance matrix, and 2. as shown in Appendix A, the marginal likelihood density p(y δ | m) can be computed analytically; the result is: where The replacement of the nonlinear data model by a linear data model will also enable us to deal with Problem 2. More specifically, for this purpose, we will adapt some regularization parameter choice methods for linear problems, i.e., maximum marginal likelihood estimation [7][8][9] and generalized cross-validation [10,11], to model selection. Furthermore, estimates for the data error variance σ 2 m delivered by these methods will be used to compute the marginal likelihood density and, therefore, the relative evidence.
In Appendix B, it is shown that: 1. in the framework of maximum marginal likelihood estimation, the data error variance can be estimated by: and the model with the smallest value of the marginal likelihood function, defined by: is optimal. 2. in the framework of generalized cross-validation, the data error variance can be estimated by: and the model with the smallest value of the generalized cross-validation function, defined by: is optimal.
Denoting by p L (y δ | m) and p G (y δ | m) the marginal likelihoods corresponding to σ 2 Lm and σ 2 Gm , respectively, we define: 1. the relative evidence of an approach based on marginal likelihood and the computation of the data error variance in the framework of the Maximum Marginal Likelihood Estimation (MLMMLE) by: 2. the relative evidence of an approach based on the marginal likelihood and computation of the data error variance in the framework of the Generalized Cross-Validation (MLGCV) by: Note that for the data error variance estimate (29), the marginal likelihood density becomes: with: On the other hand, the statements (30) and (32) are equivalent to (cf. Equation (15)): and: respectively. Deviating from a stochastic interpretation of the relative evidence and regarding this quantity in a deterministic setting merely as a merit function characterizing the solution x δ m α , we define: 1. the relative evidence of an approach based on Maximum Marginal Likelihood Estimation (MMLE) by: 2. the relative evidence of an approach based on Generalized Cross-Validation (GCV) by: Note that p MMLE (m | y δ ) and p GCV (m | y δ ) do not depend on the data error variance σ 2 m . Note also that σ 2 Gm is defined in terms of the square residual: while, according to Equation (29), σ 2 Lm is defined in terms of the quantity y δT m α (I M − A m α )y δ m α . The same difference exists between the numerators of the cross-validation function v(m) and the marginal likelihood function λ(m) (see Equations (32) and (30)).

Algorithm Description
An algorithmic implementation of the iteratively regularized Gauss-Newton method in connection with model selection is as follows: 1. compute the scaling matrix P by means of Equation (5) and the scaled data vector y δ = Py δ ; 2. given the current iterate x δ mk at step k, compute the forward model F m (x δ mk ), the Jacobian matrix K mk , and the scaled quantities F m (x δ mk ) = PF m (x δ mk ) and K mk = PK mk ; 3. compute the linearized data vector: where γ m1 and γ mN are the largest and the smallest singular values, respectively; otherwise, set α k = max(qα k−1 , α min ); 6. compute the minimizer of the Tikhonov function for the linearized equation, and the new iterate, x δ mk+1 = p δ mk + x a ; 7. compute the nonlinear residual vector at x δ mk+1 , and the residual r k+1 = ||r δ mk+1 || 2 ; 8. compute the condition number c k = γ m1 /γ mN , the scalar quantities: , and the normalized covariance matrix: where V = L −1 V; 9. if r k+1 ≥ r k , recompute x δ mk+1 by means of a step-length algorithm such that r k+1 < r k ; if the residual cannot be reduced, set r ∞ = r k . and go to Step 12; 10. compute the relative decrease in the residual r = (r k − r k+1 )/r k ; 11. if r > ε R , go to Step 1; otherwise set r ∞ = r k+1 , and go to Step 12; 12. determine k such that r k ≤ ηr ∞ < r k , for all 0 ≤ k < k ; 13. since the marginal likelihood: where X stands for the character strings MLMMLE and MLGCV when the character variable Y takes the values L and G, respectively, and the covariance matrix, 14. compute the relative evidence p X (m | y δ ), where X stands for the character strings MLMMLE, MLGCV, MMLE, and GCV by using Equations (33), (34), (37), and (38), respectively, for those situations; 15. compute the maximum and mean solution estimates x δ max and x δ mean by using Equations (15) and (17), respectively, and the mean a posteriori density p mean (x | y δ ) by using Equation (16).
The control parameters of the algorithm are: (i) the ratio q of the geometric sequence of regularization parameters, (ii) the minimum acceptable value of the regularization parameter α min , (iii) the tolerance ε R of the residual test convergence, and (iv) the tolerance η of the discrepancy principle stopping rule.
We conclude our theoretical analysis with some comments: 1. Another estimate for the data error variance is derived in Appendix C; this is given by: If α γ 2 mi for all i = 1, . . . , N, we can approximate (cf. Equation (A17)): and in view of Equation (31), we deduce that σ 2 Rm ≈ σ 2 Gm . 2. If a model is far from reality, it is natural to assume that the model parameter errors are large or, equivalently, that the data error variance is large. This observation suggests that, in a deterministic setting, we may define the relative evidence as: where the character variable Y takes the values R, L, or G. In this case, for σ 2 Rm = ||r δ m α || 2 /(M − N), the model with the smallest value of the squared residual is considered to be optimal.

Application to the EPIC Instrument
The Deep Space Climate Observatory (DSCOVR) flies in a Lissajous orbit about the first Earth-Sun Lagrange point (L 1 ), which is 1.5 million kilometers from the Earth towards the Sun. The Earth Polychromatic Imaging Camera (EPIC) is one of the two Earth observing instruments on board DSCOVR. This instrument views the entire sunlit side of the Earth from sunrise to sunset in the backscattering direction (scattering angles between 168.5°and 175.5°) with 10 narrowband filters, ranging from 317.5 to 779.5 nm [12]. This unique Earth observing geometry of EPIC compared to other instruments in Sun-synchronous orbits, that rarely view Earth at such large scattering angles [13], makes it a suitable candidate for several climate science applications including the measurement of cloud reflectivity and aerosol optical thickness and layer height [14].
We apply the algorithm proposed in Section 6 to the retrieval of aerosol optical thickness and aerosol layer height by generating synthetic measurements corresponding to the EPIC instrument. Specifically, Channels 7 and 8 in the Oxygen B-band at 680 and 687.75 nm, respectively, and Channels 9 and 10 in the Oxygen A-band at 764 and 779.5 nm, respectively, are used in the retrieval. The linearized radiative transfer model is based on the discrete ordinates method with matrix exponential and uses the correlated k-distribution method in conjunction with the principal component analysis technique to speed up the computations [15,16].
We consider the aerosol models implemented in the Moderate Resolution Imaging Spectroradiometer (MODIS) aerosol retrieval algorithm over land [1]. There are three spherical, fine-dominated model types (non-absorbing, moderately absorbing, and absorbing) and one spheroid, coarse-dominated model type (dust). These aerosol models, which depend on location and season, are the result of a cluster analysis of the climatology of almucantar retrievals from global AERONET (AErosol RObotic NETwork) measurements [17]. Each model consists of two log-normal modes (accumulated and coarse). A single log-normal mode is described by the number size distribution: where r mod is the modal or median radius of the number size distribution, σ is the standard deviation, and: is the total number of particles per cross-section of the atmospheric column. Table 1 displays the log-normal size parameters and the complex refractive indices m for the four aerosol models, where, for each log-normal mode, is the median radius of the volume size distribution: and: is the volume of particles per cross-section of the atmospheric column.  In our numerical analysis, the state vector is x = [τ, H] T , where τ is the aerosol optical thickness and H the aerosol layer height. The true aerosol optical thicknesses to be retrieved are τ t = 0.25, 0.50, 0.75, 1.0, 1.25, and 1.5, while for the true aerosol layer height, we take H t = 1.0, 1.5, 2.0, 2.5, and 3.0 km. The a priori values, which coincide with the initial guesses, are τ a = 2.0 and H a = 4 km. A Lambertian surface is assumed, and if not stated otherwise, the surface albedo is A = 0.06. We generate synthetic measurements by choosing the moderately absorbing aerosol model (Model 2) as the true or exact model. Specifically, we: 1. compute the radiances for the moderately absorbing aerosol model and the true values τ t and H t , I modabs (λ i , τ t , H t ); 2. add the measurement noise δ Imes , where δ Imes ∼ N(0, C Imes ), C Imes = diag(σ 2 Imesi ) M , and: which is the maximum signal-to-noise ratio according to the EPIC camera specifications and 3. in view of the approximation: , and: implying: The above scheme yields C δm = I M and P = I M , showing that the regularized solution does not depend on the scaling matrix. Coming to the regularization matrix, we first note that the choice However, to have a better control on the amount of regularization that is applied to each component of the state vector, we introduce the weight w xi of component [x] i , and take: If w xi → ∞, the component [x] i is close to the a priori, while for w xi → 0, the component [x] i is practically unconstrained. In our simulations, we used w τ = w H = 1.0. What is left is the specification of the control parameters of the algorithm; these are q = 0.1, α min = 10 −6 γ mN , where γ mN is the smallest singular value of the quotient matrix K mk L −1 at k = 1 (the first iteration step), ε R = 10 −3 , and η = 1.05.
To analyze the accuracy of the aerosol retrieval we consider several test examples.
Test Example 1 In the first example, we analyze the efficiency of the aerosol model selection by considering all models, i.e., we take m = 1, 2, 3, 4. Thus, the algorithm has the chance to identify the correct aerosol model. The solution accuracy is characterized by the relative errors: corresponding to (cf. Equation (17)) x δ mean = [τ mean , H mean ], and: corresponding to (cf. Equation (15)) x δ max = [τ max , H max ]. In Figures 1 and 2, we illustrate the variations of the relative errors with respect to τ t and H t , respectively, while in Table 2, we show the average relative errors: ε H meani over τ t for N τ = 6, and: The following conclusions can be drawn: 1. the relative errors ε τ mean and ε H mean corresponding to the generalized cross-validation (GCV and MLGCV) are in general smaller than those corresponding to the maximum marginal likelihood estimation (MMLE and MLMMLE). 2. the relative errors ε τ max and ε H max are very small, and so, we deduce that the algorithm recognizes the exact aerosol model; 3. the best method is GCV, in which case the average relative errors are ε τ mean|τ = 0.009, ε H mean|τ = 0.020, ε τ mean|H = 0.001, and ε H mean|H = 0.024; 4. the aerosol optical thickness is better retrieved than the aerosol layer height.  Test Example 2 In the second example, we consider all aerosol models m except the exact one, i.e., we take m = 1, 3, 4. In this more realistic scenario, the algorithm tries to find an aerosol model that is as close as possible to the exact one. The variations of the relative errors with respect to τ t and H t are illustrated in Figures 3 and 4, respectively, while the average relative errors are given in Table 3. The following conclusions can be drawn: 1. the relative errors ε τ mean and ε H mean corresponding to the generalized cross-validation (GCV and MLGCV) are still smaller than those corresponding to the maximum marginal likelihood estimation (MMLE and MLMMLE). 2. the relative errors ε τ max and ε H max are extremely large, and so, we infer that the maximum solution estimate x δ max = [τ max , H max ] is completely unrealistic; 3. as before, the best method is GCV characterized by the average relative errors ε τ mean|τ = 0.060, ε H mean|τ = 0.111, ε τ mean|H = 0.022, and ε H mean|H = 0.096; 4. the relative errors are significantly larger than those corresponding to the case when all four aerosol models are taken into account.   . Relative errors ε τ mean , ε τ max , ε H mean , and ε H max versus H t for τ t = 1.0. All aerosol models except the exact one are involved in the retrieval.
In Figure 5, we show the mean a posteriori density of τ and H computed by GCV for Test Examples 1 and 2. In Test Example 1, the a posteriori density is sharply peaked, indicating small errors in the retrieval, while in Test Example 2, the a posteriori density is wide and spread over all the aerosol models.

Test Example 3
In the third series of our numerical experiments, we include the surface albedo in the retrieval. In fact, the surface albedo regarded as a model parameter can be (i) assumed to be known, (ii) included in the retrieval, or (iii) treated as a model uncertainty. The second situation, which leads to more accurate results, is considered here. In this case, however, the aerosol optical thickness and the surface albedo are strongly correlated (the condition number of K mk L −1 at k = 1 is 10 3 -10 4 times larger than for the case in which the surface albedo is not part of the retrieval). Since we are not interested in an accurate surface albedo retrieval, we chose the weight controlling the constraint of the surface albedo in the regularization matrix as w A = 10 3 . As w τ = w H = 1.0, this means that the surface albedo is tightly constrained to the a priori. We use the a priori and true values A a = 0.06 and A t = 0.063, respectively; thus, the uncertainty in the surface albedo with respect to the a priori is 5%. The synthetic measurements are generated with the true surface albedo of 0.063.
In Figure 6, we illustrate the variations of the relative errors ε τ mean and ε H mean with respect to τ t and H t , when all four aerosol models are taken into account. The average relative errors are given in Table 4. The results show that: 1. the relative errors ε τ mean and ε H mean corresponding to generalized cross-validation (GCV and MLGCV) are in general larger than those corresponding to maximum marginal likelihood estimation (MMLE and MLMMLE); 2. the best methods are MMLE and MLMMLE; the average relative errors given by MLMMLE are ε τ mean|τ = 0.051, ε H mean|τ = 0.060, ε τ mean|H = 0.011, and ε H mean|H = 0.027; 3. the relative errors are significantly larger than those obtained in the first two test examples (when the surface albedo is known exactly). Figure 7 illustrates the variations of the relative errors ε τ mean and ε H mean with respect to τ t and H t , when all aerosol models except the exact one are considered. The corresponding average relative errors are given in Table 5. The results show that: 1. as in the previous scenario, the relative errors ε τ mean and ε H mean corresponding to generalized cross-validation (GCV and MLGCV) are in general larger than those corresponding to maximum marginal likelihood estimation (MMLE and MLMMLE).  Table 4. Average relative errors for the results plotted in Figure 6.

Method
Average Relative Error  Table 5. Average relative errors for the results plotted in Figure 7.  6. Relative errors ε τ mean and ε H mean versus τ t for H t = 3.0 km (upper panels) and H t for τ t = 1.0 (lower panels). The surface albedo is included in the retrieval and all four aerosol models are considered.

Figure 7.
Relative errors ε τ mean and ε H mean versus τ t for H t = 3.0 km (upper panels) and H t for τ t = 1.0 (lower panels). The surface albedo is included in the retrieval and all aerosol models excepting the exact one are considered.

Conclusions
We designed a retrieval algorithm that takes into account uncertainty in model selection.
The solution corresponding to a specific model is characterized by a metric called relative evidence, which is a measure of the fit between the model and the measurement. Based on this metric, the maximum solution estimate, corresponding to the model with the highest evidence, and the mean solution estimate, representing a linear combination of solutions weighted by their evidence, are introduced.
The retrieval algorithm is based on: 1. an application of the prewhitening technique in order to transform the data model into a scaled model with white noise; 2. a deterministic regularization method, i.e., the iteratively regularized Gauss-Newton method, in order to compute the regularized solution (equivalent to the maximum a posteriori estimate of the solution in a Bayesian framework) and to determine the optimal value of the regularization parameter (equivalent to the ratio of the data error and a priori state variances in a Bayesian framework); 3. a linearization of the forward model around the solution in order to transform the nonlinear data model into a linear model and, in turn, facilitate an analytical integration of the likelihood density over the state vector; 4. an extension of maximum marginal likelihood estimation and generalized cross-validation to model selection and data error variance estimation.
Essentially, the algorithm includes four selection models corresponding to: 1. the two parameter choice methods used (maximum marginal likelihood estimation and generalized cross-validation) and 2. the two settings in which the relative evidence is treated (stochastic and deterministic).
The algorithm is applied to the retrieval of aerosol optical thickness and aerosol layer height from synthetic measurements corresponding to the EPIC instrument. In the simulations, the aerosol models implemented in the MODIS aerosol algorithm over land are considered, and the surface albedo is either assumed to be known or included in the retrieval. The following conclusions are drawn: 1. The differences between the results corresponding to the stochastic and deterministic interpretations of the relative evidence are not significant. 2. If the surface albedo is assumed to be known, generalized cross-validation is superior to maximum marginal likelihood estimation; if the surface albedo is included in the retrieval, the contrary is true. 3. The errors in the aerosol optical thickness retrieval are smaller than those in the aerosol layer height retrieval. In the most realistic situation, when the exact aerosol model and surface albedo are unknown, the average relative errors in the retrieved aerosol optical thickness are about 10%, while the corresponding errors in the aerosol layer height are about 20%. 4. The maximum solution estimate is completely unrealistic when both the aerosol model and surface albedo are unknown.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In this Appendix, we compute the marginal likelihood for the linear data model (26) with p ∼ N(0, C x = σ 2 m ( αL T L) −1 ) and δ m ∼ N(0, C δm = σ 2 m I M ). In this case, the a priori and likelihood densities p(p | m) and p(y δ | p, m) are given by: respectively. Using the identity: where (cf. Equation (23)) p δ m α = K † m α y δ m α and A m α = K m α K † m α , we express the integrand in Equation (8) as: The normalization condition: where A m α = K † m α K m α is the averaging kernel matrix, and the relation: yield the following expression for the marginal likelihood density:

Appendix B
In this Appendix, we present a general tool for estimating the data error variance and the optimal model. Consider the linear model (26), i.e., y δ m α = K m α p + δ m , and a singular value decomposition of the quotient matrix K m α L −1 = UΓV T , where Γ = diag(γ i ) N with the convention γ 2 mi = 0 for i > N, U = [u 1 , ..., u M ], and V = [v 1 , ..., v N ]. The covariance matrix of the data y δ m α can be computed as: where: Define the scaled data: Y δ m α = U T y δ m α and note that the covariance matrix of Y δ m α is the diagonal matrix Σ y , i.e.,

E {Y
If σ 2 m is the correct data error variance, we must have (cf. Equation (A2)): If σ 2 m is unknown, we can find the estimate σ 2 m from the equations: However, since only one realization of the random vector Y δ m α is known, the calculation of σ 2 m from Equation (A3) may lead to erroneous results. Therefore, we look for another selection criterion. Set: and define the function: with ψ(a) being a strictly concave function, i.e., ψ (a) < 0. The expected value of f (·) is given by: we express E { f (Y δ m α |m, σ 2 m )} as: Then, we obtain: Considering the second-order Taylor expansion around a i (σ 2 m ), ψ(a i ( σ 2 m )) = ψ(a i (σ 2 m )) + ψ (a i (σ 2 m ))[a i ( σ 2 m ) − a i (σ 2 m )] for some ξ i between a i (σ 2 m ) and a i ( σ 2 m ) and taking into account the fact that ψ is strictly concave, we deduce that each term in the sum (A6) is non-negative and vanishes only for a i (σ 2 m ) = a i ( σ 2 m ). Thus, we have: for all σ 2 m . In conclusion, σ 2 m , defined through Equation (A5), is the unique global minimizer of E { f (Y δ m α |m, σ 2 m )}, i.e., σ 2 m = arg min Coming to the optimal model m, it seems that a natural choice of m is: However, instead of Equation (A8), we use a more general selection criterion, namely the optimal model m defined as: where F is a monotonic increasing function of E { f (Y δ m α |m, σ 2 m )}. The generalized cross-validation [10,11] and maximum marginal likelihood estimation [7][8][9] can be obtained by a particular choice of the function ψ(a).

Generalized cross-validation
For the choice: we obtain: Since σ 2 m is the unique global minimizer of E { f (Y δ m α |m, σ 2 m )}, the condition: yields: The above equation together with the relation: provide the following estimate for the data error variance: On the other hand, from Equations (A10) and (A12), we obtain: , and we define the function F(·) by: Taking Equations (A13) and (A14) into account, we find that the function F(·) can be calculated as: In practice, the expectation E {Y Finally, using the representations: we express the estimate of the data error variance as: From the minimization condition (A11), we find that σ 2 m satisfying the equation: