Intercomparison of Remote Sensing Retrievals: An Examination of Prior-Induced Biases in Averaging Kernel Corrections

In remote sensing applications, Optimal Estimation (OE) retrievals are sometimes compared to independent OE retrievals of the same process. This intercomparison is often done in instrument validation, where retrievals are compared with data from a separate validation instrument, and it is sometimes done in data assimilation, where data from multiple instruments need to be adjusted to the “same footing.” In these cases, the two different retrievals are compared using an adjustment that is colloquially known as the averaging kernel correction. A general misconception in the existing literature is that this averaging kernel correction removes any bias introduced by prior misspecification by either (or both) of the two comparative OE retrievals. In this paper, we will analytically show that this is not the case and the averaging kernel correction process implicitly “shifts” both OE retrievals to a common comparison prior. We will also show that there is generally a non-zero bias that is proportional to the difference between this comparison prior mean and the true (but unobserved) mean state, which has large implications for retrieval validation and data assimilation in remote sensing. Finally, to better characterize OE retrievals and retrieval intercomparisons, we will make some recommendations for mitigating this prior-induced bias in intercomparison of OE retrievals.


Introduction
Remote sensing observations of emitted and reflected electromagnetic radiation provide indirect information about atmospheric quantities of interest. For instance, total-column CO 2 may be inferred from hyperspectral observations in the strong CO 2 , weak CO 2 , and O 2 bands [1,2] in the near-infrared portion of the spectrum. The retrieval process is then an inverse problem where one needs to calculate from a set of measurements (e.g., photon counts or radiances) the causal factors that gave rise to them (e.g., CO 2 concentration).
The current state-of-the-art for solving atmospheric inverse problems is Optimal Estimation (OE) [3]. It assumes that the measurement-error characteristics of the remote sensing instrument are known and that there exists a functional relationship between the "hidden" atmospheric state of interest and the observed measurements. This function is often called the forward model, and it is commonly constructed from radiative transfer equations. Further, the OE algorithm requires specifying a (typically Gaussian) distribution for the natural variability of the hidden state, also called a prior, which is comprised of a prior mean vector and a prior covariance matrix.
Once all of these components are specified, the OE retrieval is then the maximum a posteriori (MAP) estimate of the state given the noisy observed radiances. Examples of OE retrievals in remote sensing include total-column carbon dioxide from NASA's Orbiting Carbon Observatory-2 (OCO-2; [4]), sea surface temperature from the Spinning Enhanced Visible and Infra-Red Imager (SEVIRI; [5]), total-column carbon dioxide and methane from the Greenhouse gases Observing SATellite (GOSAT; [6]), carbon dioxide, carbon monoxide, and methane from the Total Carbon Column Observing Network (TCCON; [7]), temperature and ozone from the Tropospheric Emission Spectrometer (TES; [8]), and aerosols from the Meteosat Second Generation Spinning Enhanced Visible and Infrared Imager (MSG/SEVIRI; [9]).
It is well understood that misspecifications of the forward model or the measurement-error characteristics can adversely impact the quality of OE retrievals (e.g., [10][11][12]). The work [13] examined the impact of the misspecification of the prior mean vector and prior covariance matrix in depth, and they proved that using an incorrect prior mean vector (relative to the true mean vector of the natural variability) will result in biased retrievals. That is, a bias may arise in the retrieval solely as a function of the choice of priors. One simple corollary of this result is that when we have multiple OE retrievals (e.g., total-column CO 2 from two different instruments: OCO-2 and GOSAT), then an intercomparison of their OE retrievals would need to account for the relative bias introduced by their choices of priors.
To be more precise, assume that we have a state vector x generated by the multivariate Gaussian distribution {x T , S T }, where x T is the prior mean vector and S T is the prior covariance matrix. This state x is observed by two different instruments, producing the radiance vectors y 1 and y 2 . Further, assume two different OE retrievalsx 1 andx 2 with priors {x a,1 , S a,1 } and {x a,2 , S a,2 }, respectively, where x a,1 and x a,2 are the prior means and S a,1 and S a,2 are both proper prior covariance matrices. In this paper, we shall call {x T , S T } the true prior because it describes the distribution that generated the hidden state x, and we shall call the parameters used within OE retrievals (i.e., {x a,i , S a,i }) the working prior. Even if both retrievals have specified the correct forward model and instrument measurement-error characteristics, if x a,1 = x a,2 , then it is straightforward to show that the expected value ofx 1 is different from the expected value ofx 2 (i.e., E(x 1 ) = E(x 2 )) [13]; or equivalently, E(x 1 −x 2 ) = 0, which implies that there is a prior-induced relative bias for OE retrievals with different priors.
The intercomparison of remote sensing retrievals is often required in remote sensing for validation or in data assimilation. For instance, in instrument validation, it is common practice for OE retrievals of a remotely-sensed quantity to be compared against retrievals from an independent validation instrument (e.g., validation of OCO-2 against retrievals from the TCCON sites [14]; methane comparison of GOSAT to TCCON [15]; validation of the Measurements of Pollution in the Troposphere (MOPITT) CO retrievals with aircraft in situ profiles [16]; and comparison of GOSAT total-column CO 2 against aircraft measurements [17]). Similarly, in data assimilation, it is sometimes necessary to use retrievals of a common process from multiple instruments when constraining the process of interest (e.g., using chemical compound data from TES, MOPITT, and the Microwave Limb Sounder (MLS) instrument in a chemical data assimilation system [18]; and inverse modeling of CH4 emissions from GOSAT and the SCanning Imaging Absorption SpectroMeter for Atmospheric CHartographY (SCIAMACHY) instrument [19]).
Cognizant of the relative bias introduced by the choices of different priors, there are several works in the existing literature to account for this bias. The definitive treatment of this topic is given in [20], which describes how to put two different OE retrievals on the same footing by adjusting one (or both) retrievals using a "common comparison ensemble," which is denoted by {x C , S C }. This process is also colloquially known as averaging kernel correction or averaging kernel convolution.
It is generally believed that the bias effect of the working prior can be removed by convolution with the averaging kernel when comparing between different optimal estimation retrievals. For instance, Reference [21] p. 11141 notes that "Indeed, the use of averaging kernels makes atmospheric inversion insensitive to the choice of a particular retrieval prior CO 2 profile," and they provided a citation to [22], which includes the following statement: "The smoothing of the averaging kernel introduces an error, which will be important in any comparison between measurements made at different resolutions... However, it is possible to eliminate this smoothing error by using the averaging kernel. In comparing a high-resolution measurement (or a model calculation) with a low-resolution measurement, one computes , where x m is the high-resolution measurement, and compares the result tô x." ([22] Section 2.5).
In this paper, we will address the general misconception that the averaging kernel convolution specified by [20,22] removes any bias introduced by prior misspecification. It is important to note that our paper is not about how to compute the prior-induced bias in the intercomparison of retrievals in practice, because in real applications, the true prior {x T , S T } is never known precisely. Rather, this paper will discuss the existence of such a bias and how a researcher can reduce its magnitude with imperfect knowledge of {x T , S T }. This will enable better characterization of the potential causes of biases in OE validation studies, in addition to more accurate retrieval intercomparisons.
In Section 2, we provide a quick background of OE equations, and we will analytically show that in the intercomparison of remote sensing retrievals, even after averaging kernel correction, there is a relative bias that is proportional to the difference between the comparison prior mean and the true prior mean (i.e., x C − x T ). The common misconception in the existing literature, then, can be seen as an implicit assumption that the comparison prior mean is the same as the true prior mean of the process. In Section 3, we will provide a simulation of the relative bias using an OCO-2 surrogate forward model to illustrate some instances of this relative bias under different choices of averaging kernel convolutions. Finally, we will end the paper with some discussion and recommendations for mitigating this prior-induced bias in practice.

Background and Requirements
Without loss of generality, we consider the intercomparison of two different OE retrievals of the same unknown state x. Inferences of this same state may be made from the same radiance vector y obtained from a single instrument (but with two different OE retrieval priors), or they may arise from OE retrievals of two different radiance vectors (i.e., y 1 and y 2 ), where the same atmospheric state is viewed by two different instruments.
Similar to [20], we require that the state vector x be identically defined by both OE retrievals. That is, the state elements of x should include the same physical interpretation (e.g., temperature, water vapor, CO 2 concentration, etc.), and they should be defined at the same vertical grid. In cases where one retrieval has a coarser vertical grid than the other, then an interpolation to a common grid should have been applied before averaging kernel corrections.
In many applications, in addition to the common state vector x, there are state parameters that are present in one retrieval, but not in another. For instance, the state vectors may be represented by x 1 = (x , e 1 ) and x 2 = (x , e 2 ) , where x is the part of the state vector that is common to both retrievals, while e i consists of state elements that are unique to the i-th retrieval. This complicates the intercomparison, because the residual (x − x) has a term that is dependent on e i (Reference [20] Section 2.1 called this "interference error."). (We note that the term "interference error" has been used in slightly different contexts in the literature. For instance, Reference [10] used this term to describe the error due to state elements that are not "of interest" to a mission's goal, but are nevertheless important in the forward model. In this paper, we will follow the definition given in [20].). In their derivations, Reference [20] assumed that this source of error is small or negligible [20]. We will make the same assumption here, and we will only consider the intercomparison of the common state vector x.
Here, we will quickly review the OE formalism before examining averaging kernel correction. In the general case, let i ∈ (1, 2) be the instrument index, and we assume that an N i -dimensional radiance vector y i is related to the r-dimensional (common) true state x by the following data model: where F i (·) is the N i -dimensional vector-valued forward model (notice that the subscript on F i (·) allows for the possibility of the state being observed on different sets of spectral channels across multiple instruments), x is the r-dimensional state vector, which in OE is assumed to be multivariate Gaussian with true mean x T and true covariance matrix S T , and i is the N i -dimensional Gaussian measurement-error vector with mean zero and covariance matrix S ,i . Here, i captures the measurement-error characteristics of the i-th instrument, and we assume that 1 is independent of 2 and that both are independent of x. That is, where Gau n (µ, Σ) denotes an n-dimensional Gaussian (or normal) distribution with mean vector µ and covariance matrix Σ.
We now consider the leading case of a linear forward model, which can be thought of as the first-order term of the Taylor expansion of the non-linear model in Equation (1) around some known state vector (e.g., [23]), and Equation (1) becomes, where the N i × r matrix K i = ∂F i ∂x is the Jacobian of the i-th forward model and c i is an N i -dimensional constant vector. Without loss of generality, we can assume that c i = 0 (since c i is known and hence in principle can be subtracted from y i ). In this case, y i is a vector of "centered" radiances, which simplifies the notations in the following sections without detracting from general applicability.

Working Prior vs. True Prior
Now, assume that we have two OE retrievals for y 1 and y 2 with priors {x a,1 , S a,1 } and {x a,2 , S a,2 }, respectively, where {x a,1 , S a,1 } is generally different from {x a,2 , S a,2 }, both of which are typically misspecified relative to the (unknown) natural variability parameters {x T , S T }. We note that in strictly Bayesian tradition, the prior is taken as a starting point or as an opinion, and hence, there is no preferred "true" prior. However, optimal estimation, as formalized by [3], recognizes that there is a true (but unobserved) natural distribution of x and that x T and S T are simply the first and second moments of this distribution (i.e., x T = E(x) and S T = var(x)). It is also recognized that the prior used in the OE algorithm can be misspecified relative to this true prior [3], and he noted that "if the a priori are inappropriate, [then] their errors are incorrect." He further acknowledged the difficulty of knowing the true prior {x T , S T }, recommending that in choosing a working prior that practitioners make a "reasonable estimate of a probability density function consistent with all our knowledge, one that is least committal about the state but consistent with whatever more or less detailed understanding we may have of the state vector prior to the measurement(s)" ([3] Section 10.3.3.2).
Following the framework in OE, there is clearly a preferred prior. This is evident in the existing literature, where the general consensus in prior design is to make the working prior as "accurate" as possible (e.g., [3,12,24]). For instance, Reference [24] noted that "[using] the most accurate prior will lead to the most accurate result." Reference [13] showed that using the true prior {x T , S T } in the OE algorithm will result in retrievals with the following properties: (1) unbiasedness, (2) smallest uncertainties (relative to x) among all choices of priors, and (3) accurate posterior uncertainties. All of these properties are desirable in instrument design, validation, and scientific analysis. For this reason, we call {x T , S T } the "true" prior, with the acknowledgment that in practice, the working prior is rarely, if ever, identical to the true prior.
Given the working priors, Reference [3] gave the following expression for the retrieved statesx 1 andx 2 and similarly,x where ,i is the gain matrix, and r,i = G i i . For the ease of reference, we include a list of notations and their definitions in Table 1.

Symbol Definition
True prior mean vector of the state vector x x a,i Working prior mean vector of x for the i-th retrieval x C Comparison ensemble mean within averaging kernel correction S ,i Covariance matrix for the radiance measurement error of the i-th instrument S T True prior covariance matrix of the state vector x S a,i Working prior covariance matrix of x for the i-th retrieval S C Comparison ensemble covariance matrix within averaging kernel correction Retrieved state from y i using the working prior {x a,i , S a,i } x 12 Averaging kernel adjustedx 2 under the comparison ensemble {x a,1 , S a,1 }

Relative Bias from Different OE Retrievals of the Same State
In general, using a prior that is different from the true prior will result in a bias. Reference [13] showed for a given retrieval with prior {x a,i , S a,i } that there is a bias given by: where the subscript T on the expected value operator indicates that it is taken with respect to the true natural variability of the state and A i ≡ G i K i is the working averaging kernel. Note that the equation above implies that it is possible for an instrument to have an incorrect prior mean, but still be unbiased if it uses an uninformative prior (that is, assuming that S a,i is "large" enough that S −1 a,i is effectively 0). However, most OE retrievals in remote sensing use an informative prior covariance matrix, since many retrieval problems are ill-posed, and there is not enough information to provide a unique solution (i.e., K S −1 K is singular) without adding in constraints implied by the prior covariance matrix S a,i . Therefore, for the rest of this paper, we will assume that S a,i < ∞ (i.e., all eigenvalues of S a,i are finite).
A simple corollary of Equation (5) is that whenever we compare two OE retrievals with different working prior means, the comparisons would be biased. The expected value of the difference between x 1 andx 2 is: It is easy to see from Equation (6) that two retrievals with different working prior mean vectors (i.e., x a,1 = x a,2 ) in general will have a relative bias. Even if the working prior mean vectors are the same, it is still possible for E T (x 1 −x 2 ) = 0 if either S a,1 = S a,2 or K 1 = K 2 , since they imply that A 1 = A 2 . Another somewhat unintuitive result is that there is also a relative bias if data from the two different instruments are retrieved with identical forward models, prior mean vectors, and prior covariance matrices. This is because the two instruments will have different measurement error characteristics (i.e., S ,1 = S ,2 ). Since the definition of the averaging kernel includes a term for S ,i , this implies that A 1 = A 2 even when {x a,1 , S a,1 } = {x a,2 , S a,2 }. This in turn implies that E T (x 1 −x 2 ) = 0.

Averaging Kernel Correction
It has been long recognized that the comparison of OE retrievals with different priors is susceptible to prior-induced biases (e.g., [20,22]). The definitive treatment of this topic was given in [20], which describes how to compare retrievals with different OE priors by shifting each of the retrievals involved onto a common prior {x C , S C }, which Reference [20] called the "comparison ensemble". While Reference [20] did not imply that their procedure would ensure unbiasedness, a general misconception in the remote sensing literature is that the prior-induced bias implied by Equation (6) can be removed by the averaging kernel convolution process (e.g., [21]). Specifically, the misconception is that given two OE retrievalsx 1 andx 2 from two different priors, if one adjustsx 2 intox 12 by convolving with the averaging kernel, then the relative bias betweenx 1 and the averaging kernel-correctedx 12 is zero; that is, E(x 12 −x 1 ) = 0. However, we will show analytically that this claim is not supported by the paper [20].
In applications of averaging kernel correction, it is common to choose the comparison ensemble from one of the two priors used to produce the datasets. One advantage of this approach is that the researcher doing the intercomparison can take advantage of the subject matter expertise as encoded in the existing priors, although it is prudent for the researcher to select the prior that is considered more accurate or precise from the two as the comparison ensemble. Another advantage of this approach, of course, is that the retrievals from the dataset with the chosen comparison prior do not need any adjustment. For instance, when comparing OCO-2 retrievals to TCCON retrievals, one might adjust the OCO-2 retrievals using the TCCON averaging kernel while leaving TCCON retrievals unadjusted (e.g., [14]).
Here, without loss of generality, we will choose the prior {x a,1 , S a,1 } as the comparison ensemble. That is, we set {x C , S C } = {x a,1 , S a,1 }. This implies that we need to adjustx 2 . From [20], the general procedure for this averaging kernel correction process is as follows: 1.
Combining Steps 1 to 3 above, the adjusted valuex 12 is given as follows: We note that in a particular variant of averaging kernel correction, Step 2 (makingx 2 optimal with respect to the comparison ensemble) is skipped in the process (e.g., [22]). This should only be done when the retrieval to be modified is already optimal with respect to the comparison ensemble {x C , S C }. As [20], Section 4.4, noted, "if retrieval 2 is not optimal with respect to the comparison ensemble, then Equation (18) [denoted in this paper as Step 2 in the itemized list above] should be used to convert it, and to provide an improved averaging kernel matrix." In this paper, we assume that the full averaging kernel correction is performed as suggested by [20], although the majority of our conclusions (with one important exception) still apply to the abbreviated averaging kernel correction variant in [22] (see Section 2.6 for more details).

Examining the Biases
Having set up the background and the appropriate convolution with the averaging kernel, we will examine the biases that arise therefrom. First, let us consider the bias ofx 1 relative to the unobserved truth x. Since we used {x a,1 , S a,1 } as the comparison ensemble, no adjustment tox 1 is necessary. From [13], this bias is equal to: It is also straightforward to show that the adjusted valuex 12 given in Equation (7) is also biased relative to the truth: Finally, we show that the retrievalx 1 and the adjusted retrievalx 12 are also biased relative to one another. That is, The result in Equation (10) is particularly concerning for instrument validation in practice, because it implies that even after implementing the averaging kernel correction as suggested by [20], there is still a relative bias that is proportional to (x T − x C ), and this bias arises solely from the misspecification of the comparison ensemble {x C , S C }.
For instance, OCO-2 retrievals are often compared to TCCON retrievals (after the appropriate averaging kernel adjustments) in order to assess the relative bias. The result in Equation (10) indicates that such a comparison will produce a bias as long as the comparison ensemble is not the same as the natural variability of the state x (i.e., x T = x C ). In other words, a comparison between OCO-2 retrievals and TCCON might produce a bias that might lead the OCO-2 instrument team to re-examine issues such as calibration and absorption coefficients, when the bias is actually due to the comparison profile not being the same as the true mean of the state. In this case, where x T = x C , chasing after calibration or spectroscopy to remove the bias would be fruitless.
In some validation exercises, it might be important to assess the relative magnitude of the prior-induced bias relative to known validation errors, and Equation (10) offers a means to do so. Note that Equation (10) only has one unknown, which is the true prior means x T . Here, the researcher can make reasonable guesses at some choices of x T and then compute what would be the resulting prior-induced biases. For instance, the researcher may assume that x C is misspecified by 1% or 2% (i.e., x C = 0.99 x T or x C = 0.98 x T ), which can then be used in Equation (10) to produce a numeric result that the researcher can compare to other sources of error.
Upon closer examination of the biases in Equations (8)- (10), we see that they all involve a multiplicative term (x T − x C ), where recall that for this particular example, we set {x C , S C } = {x a,1 , S a,1 }. It is then easy to see where the misunderstanding about the convolution of the averaging kernel being able to remove the bias originated, since if x T = x C , then it is obvious that E(x 1 − x) = 0, E(x 12 − x) = 0 and E(x 12 −x 1 ) = 0. That is, the general misconception in the literature arises from the implicit assumption that the comparison prior means x C is equal to the true prior means x T . Reference [20] described how to compare retrievals with different priors by moving the retrievals onto a common comparison ensemble {x C , S C }, but their paper did not address what happens when this comparison ensemble mean x C is different from the true mean x T of the state x.
Unfortunately for OE retrievals, the comparison profile is almost certainly not equal to {x T , S T } in real applications. This leads to biases in E(x 1 − x), E(x 12 − x), and E(x 12 −x 1 ), and this is why it is important to consider the effect of prior misspecification. In instrument team validation, it is often necessary to compare OE retrievals to retrievals from a validation source in order to assess the bias. Often, an averaging kernel correction is applied before intercomparison, and the instrument team would attempt to correlate any resulting bias with potential sources (e.g., spectroscopy, calibration, clouds and aerosols, etc.). The result in Equation (10) indicates that the choice of priors should be considered as a potential source of bias whenever intercomparison between multiple retrievals is necessary (e.g., instrument validation, data assimilation of multiple data sources, etc.).
It is important to note that the relative bias in Equation (10) will change depending on the choice of the comparison prior. That is, in general, E(x 12 −x 1 ) = E(x 21 −x 2 ). In many applications, a researcher is given two different OE retrievals and has the choice of selecting one of the two corresponding priors as the comparison prior. Here, the researcher may not be able to avoid a relative bias via the averaging kernel correction process, but he or she might be able to reduce its effect by selecting the prior that results in a smaller bias.
We first note that the biases in Equations (8)- (10) and Equation (10) are all proportional to (x T − x C ). Therefore, if possible, from the two choices, a researcher should select the one that has a more realistic prior mean; that is, a prior mean that is more likely to be close to the true hidden prior mean, which would then reduce the magnitude of the term (x T − x C ).
In some applications, however, it is not easy to judge which of the two prior means is more realistic or more accurate. In this case, it is possible to rely on the choice of the prior covariance matrix S C to mitigate the magnitude of the prior-induced bias. It has been demonstrated that using a "large" prior covariance matrix in an OE algorithm will reduce the magnitude of the prior-induced bias [13]. This is because inflating the prior covariance matrix will decrease the information content of the prior, thereby reducing the magnitude to which a misspecified prior mean can bias the retrieval.
In the same manner, choosing a "larger" covariance matrix for S C will decrease the magnitude of E(x 1 − x), E(x 12 − x), and E(x 12 −x 1 ). Let us consider the limiting case where S C = S a,1 → ∞. Since the averaging kernel is given as A 1 = (S −1 a,1 + KS −1 ,1 K ) −1 K S −1 ,1 K, then A 1 → I as S a,1 → ∞, and it is easy to see that, Likewise, we can show the same result for the averaging kernel-correctedx 12 as well, Following the procedure above, it is easy to show that the relative bias E(x 12 −x 1 ) → 0 as the comparison prior covariance matrix S C → ∞. Therefore, if minimizing the relative bias is of paramount concern and a researcher is unsure of the accuracy of the prior means x a,1 and x a,2 , it is a good idea to choose a prior with a "larger" prior covariance matrix as the comparison prior. This result also provides for an avenue for completely removing the prior-induced bias in the intercomparison of OE priors. If it is absolutely important to remove the prior as a source of bias in a validation study, one approach would be to let S C be an uninformative matrix (i.e., S C → ∞), which would force the prior-induced bias to approach zero. It is important to note, however, that there are some trade-offs involved in using a less informative prior covariance matrix. For instance, when the instrument signal-to-noise ratio is low, using an uninformative prior covariance can result in very large posterior uncertainties (for more details, see [13]). In the case of averaging kernel corrections, letting S C → ∞ will ensure that E(x 12 −x 1 ) → 0, but this choice may lead to larger variability in the difference (i.e., var(x 12 −x 1 ) may be "larger").
At this point, the reader may ask if Equation (6) points at a possible approach to estimate the true prior mean vector x T . That is, given a set of retrievals and the fact that the bias is proportional to (x C − x T ), we can iterate over different choices of x C until we achieve E(x 12 −x 1 ) = 0, which should occur when x C = x T . In theory, this is possible. However, recall that in this paper, we simplified the problem to assume no bias from non-prior sources (e.g., calibration, spectroscopy, geolocation, interference factors such as aerosols, etc.). In real applications, the prior-induced bias is likely only one component of the overall bias, and iterating x C until E(x 12 −x 1 ) = 0 would then falsely ascribe these non-prior sources of bias to the choice of the comparison ensemble. A more principled approach would be to try to remove the prior-induced bias via the recommendations in this section, which would allow the researcher to be reasonably sure that any bias remaining is due to other causes such as calibration or spectroscopy.
In this section, we consider the biases in the original state vector space. In many applications, the state vector is convolved into a scalar using a linear weight vector (e.g., combining the 20-dimensional CO 2 vector into an XCO2 value [25]). That is, given a state vector x, it is converted into a scalar as a linear combination x = h x, where h is a pressure weighting vector. It is easy to see that the scalar bias can be easily computed from the results in Equations (8)- (10). For instance, for a given pressure weight vector h, the scalar relative bias E(x 12 −x 1 ) is given as follows: We note that most of the conclusions in this section will hold in the scalar space, although the magnitude of the scalar bias will vary depending on the choice of the weight vector h. In theory, it is possible for h to be orthogonal to E(x 12 −x 1 ), which would result in the scalar bias being zero. In practice, the weighting vector is often constructed based on physical motivations (e.g., [25]), independent of the prior selection process. Therefore, it is nearly impossible for these two vectors to be orthogonal, and it would be foolhardy to rely on this infinitesimally small chance as a means to remove this prior-induced bias.

Abbreviated Variant of Averaging Kernel Correction
As we noted earlier, a variant of the averaging kernel correction process omits Step 2 (adjusting retrievals to make them "optimal" relative to {x C , S C }) of the procedure on Line 217. Reference [20] recommended that Step 2 be taken whenever the retrievals to be compared have a different prior than {x C , S C }, but this advice is not always heeded in practice. In this variant, the abbreviated averaging kernel correction has a much simpler form: The relative bias E(x 12 −x 1 ) is given as: From Equation (14), we see that this abbreviated averaging kernel correction still has a relative bias if x T = x C and that choosing a comparison prior mean that is as close as possible to x C will reduce this prior-induced bias. However, we note that one important difference between Equations (6) and (14) is their behavior as S C → ∞. We have seen that under the full averaging kernel correction in [20], making S C more uninformative will reduce the prior-induced bias to 0. Under the abbreviated variant discussed in this section, this remedy is no longer available. This is because from Equation (14), it is easy to see that lim S C →∞ E(x 12 −x 1 ) = (A 2 − I)(x T − x a,1 ). Therefore, we recommend that the averaging kernel correction follows the full procedure as described in [20] whenever possible.

Observation-Model Intercomparisons
In the previous section, we primarily studied the intercomparison of two OE retrievals,x 1 andx 2 . In some applications, it is necessary to compare a remote sensing retrievalx 1 against model datax m (e.g., comparing GEOS-Chemchemical transport model data against satellite retrievals [26]). Further, for retrievalx 1 and model datax m in data assimilation, it is common to convolve the model profile with the averaging kernel ofx 1 using the following equation given by [27]: It is not immediately clear how the prior-induced bias would come into play, but a close inspection indicates that it comes from the misspecification in {x a,1 , S a,1 } and from biases in the model data.
We first assume that the model datax m can be written as a function of the true value x in the following linear form:x m = b + Hx + δ, (16) where b is an intercept term, H is the "slope" or a "scaling factor" on x, and δ is a zero-mean residual error term. We can think of the model in Equation (16) as a first-order Taylor expansion of the functional relationship betweenx m and the true state x where b and H are fixed parameters and δ is a random error vector. In theory, it is possible for the error term δ to have a non-zero mean, but that systematic error could be absorbed into the intercept term b, so we will work with the simple zero-mean model. Given the above, the bias of the corrected model profilex 1m , relative to the true x, is given as follows: which indicates that when x a,1 = x T , b = 0 or H = I, then the bias E(x 1m − x) is generally nonzero. Similarly, if the goal is the intercomparison of the retrievedx 1 against the corrected model datax 1m , then we see that: ,1 + A 1 x + r,1 − A 1 x a,1 ) The result in Equation (17) is very interesting in that the effect of the prior means ofx 1 has disappeared entirely!That is, the prior mean x a,1 has disappeared from the expression for the bias. This is encouraging, especially when retrievals from multiple satellites are compared to the same model data. However, we note that the choice of the prior covariance S a,1 can still impact the magnitude of the bias, since the averaging kernel A 1 appears as multiplicative factors on the right-hand side of Equation (17).
What is remarkable about the correction in Equation (17) is that the bias E(x 1m −x 1 ) only depends on the intercept b and the Jacobian H of the first-order linear approximation of the functional relationship betweenx m and the true state x. It is obvious that if b = 0 and H = I, then the observation-model intercomparison would be unbiased. However, this is essentially assuming that x m = x + δ or that the model data are simply equal to the true state x plus some random zero-mean error. This is a very strong assumption and in general should require strong evidence for support.
It is important to note that although model data are often used as a proxy for the truth in remote sensing, they are not the same thing. In real practice, there likely would be some discrepancy between the model data and the true state x, formalized by b = 0 or H = I. In this case, the first-order Taylor approximation indicates that there would be a non-zero bias in the observation-model comparison!
In practice, the results in this section imply that researchers should generally be more skeptical in treating model data as a proxy for the truth. Often, any bias found in a observation-model intercomparison is ascribed to some potential causes in the retrievals (e.g., spectroscopy, calibration, systematic measurement errors, retrieval parameterizations, etc.). Equation (17) indicates that a bias might arise if the model data are "different" from the true field (that is, if b = 0 or H = I), so researchers should consider the accuracy of the model data as one potential source of bias in observation-model intercomparisons.

Simulation Study
In the previous section, we derived the analytical expressions for the biases that arise from misspecified priors, which remain even after taking into account averaging kernel corrections. In this section, we will demonstrate empirically the effect of the biases using data from an Observing System Simulation Experiment (OSSE), which is based on a streamlined version of the OCO-2 forward model used in [10].
The OCO-2 spectrometers measure reflected solar radiation in three IR bands, with a total number of N = 3048 spectral channels. The mission's primary goal is to provide high-resolution estimates of the total-column carbon dioxide dry air mixing ratio (XCO2). We based our simulation on the simplified OCO-2 forward model of [10], which "makes some simplification for interpretability and computational efficiency while attempting to maintain the key components of the state vector and RT [radiative transfer] that contribute substantially to uncertainty in [total-column CO 2 ]". Here, the state vector x is a 39-dimensional state vector consisting of a 20-level CO 2 profile, surface air pressure, surface albedo, and aerosol profiles (for an overview of the construction and parameterization of this simplified forward model, see Section 3 of [10]).
In this simulation experiment, we first designated a known distribution as the true prior {x T , S T } from which we would sample simulated true states x. Similar to [13], we selected this true prior as the sample mean and sample covariance of 5000 retrieved states obtained after simulation from a non-linear control case ([10] Section 4.3). For simplicity, the forward model for both simulated instruments is assumed to have the following linear form: F(x) = c + Kx, where K is a Jacobian matrix chosen from one of the 5000 retrievals from the control case in [10], and c = 0 (recall that c, being known, can be subtracted from y i without impact on subsequent calculations). From the simulated true state x, we then simulate two separate radiance vectors y 1 and y 2 . At this step, we then compute the retrieved statesx 1 andx 2 using different sets of working priors depending on the experiment. Since we wish to demonstrate that the bias is not invariant to the choice of the comparison ensemble, we applied averaging kernel corrections to bothx 1 andx 2 (at each step using the prior of the other instrument as the comparison prior) and computed E(x 21 − x 2 ) and E(x 12 − x 1 ).
We wish to examine the role of the prior mean and the prior covariance in intercomparison bias; therefore, we set the instrument noise to be the same for both (simulated) instruments (i.e., S ,1 = S ,2 = S , where the measurement-error matrix S is obtained from the surrogate OCO-2 model in [10]). There are three experiments, and within each experiment, we selected as the prior means either x 1 or x 2 , which are misspecified as a fraction of the true prior mean x T where x 1 = 0.99 x T and x 2 = 0.98 x T . In Table 2, we show the values x T , x 1 and x 2 for all 39 state elements. Similarly, for some of the three experiments, we allowed the retrievals to use a misspecified prior covariance matrix S a based on the operational prior for OCO-2, which depends on latitude and time of the OCO-2 sounding and on a climatology obtained from the GLOBALVIEW dataset [10]. Table 2. True prior means x T and working prior means x 1 and x 2 used in the simulation. While x 1 = 0.99 x T and x 2 = 0.98 x T , we display their actual values in the third and fourth column for ease of reference.

Name
x T x 1 x 2

CO 2 Volume Mixing Ratio (means in ppm)
Vertical The prior covariance matrix S a is diagonal for all non-CO 2 elements, and the block covariance matrix corresponding to the CO 2 profile has off-diagonal entries "estimated based on the Laboratoire de Météorologie Dynamique general circulation model, but the correlation coefficients were reduced arbitrarily to ensure numerical stability in taking its inverse" [1]. Furthermore, the diagonal entries of S a are significantly inflated relative to the best scientific understanding. They are "unrealistically large for most of the world, [in order to provide] a minimal constraint on the retrieved XCO2" [1]. The differences in design and magnitude between S T and S a are displayed in the top and bottom row, respectively, of Figure 1. Interested readers can find the prior means (x T , x 1 , x 2 ), the prior covariance matrices (S T , S a ), the pressure-weighting vector h, the Jacobian K, and the measurement-error matrix S in the Supplementary Materials. We will examine the bias arising from the choice of priors in three different experiments, summarized below. A quick overview of the parameterization for the experiments is given in Table 3.
Experiment 1 Here, we wish to examine the impact of the misspecification of the prior mean on the relative bias. We assume that Instruments 1 and 2 are using the same prior covariance matrix S T and that x a,1 = x 1 and x a,2 = x 2 .
Experiment 2 Here, we wish to examine the impact of the prior covariance matrix on the relative bias. We assume that Instruments 1 and 2 are using the same misspecified prior mean x 1 and that S a,1 = S a and S a,2 = S T . Experiment 3 Here, we wish to examine the impact of further inflating the prior covariance matrix.
We assume that Instruments 1 and 2 are using the same prior mean x 1 and that S a,1 = 1000 S a and S a,2 = S T . Experiment 1 The steps for our simulation experiments are as follows: The results of our three experiments (in XCO2 space) are listed in Table 4, where the units are in parts-per-million. For Experiment 1, where we correctly specified the prior covariance matrix, but misspecified the prior means, we see that both of the retrievals are biased relative to the truth (i.e., E(x 1 − x) = −1.5 ppm and E(x 2 − x) = −3.0 ppm). These biases relative to the hidden true x are consistent with the theoretical expression given in Equation (5), although we will not discuss them in detail in this section since they were covered in depth in [13].
The general misconception in the existing literature is that the averaging kernel correction process would remove the relative biases, but this is not supported by the simulations in Column 3 and 5. There, we see that using the comparison prior {x 1 = 0.99x T , S T } has a relative bias of −0.68 ppm, while using the comparison prior {x 2 = 0.98x T , S T } has a relative bias of −1.37 ppm. Note that based on Equation (10), the relative bias is proportional to the difference (x C − x T ). Since x 1 = 0.99x T and x 2 = 0.98x T , then 2(x 1 − x T ) = (x 2 − x T ), and we would expect E(x 21 − x 2 ) to be twice that of E(x 12 − x 1 ); this is precisely what Experiment 1 indicated. These simulation results (based on 3000 realizations of the true state x from {x T , S T }) are consistent with the theoretical calculation from Equation (10), the precise values of which are given in Columns 4 and 6. To reiterate, this experiment indicates that the choice of prior means (for the comparison ensemble) is crucial in determining the magnitude of the bias in the intercomparison of remote sensing retrievals. In general, the closer this choice of x C is to the hidden x T , the smaller the bias.
In Experiment 2, we wish to examine the effect of choosing a comparison ensemble with a "more uninformative" prior covariance matrix (here, S a ). To this end, we misspecified the prior means by the same amount (x i = 0.99 x T ) for both simulated instruments. As we can see, when we do the averaging kernel correction with {x C , S C } = {x 1 , S T }, the relative bias is −0.68 ppm, even though the comparison covariance matrix S C is equal to the true prior covariance S T ! Yet, when we choose the less informative, and certainty incorrect, prior {x C , S C } = {x 1 , S a }, the relative bias is reduced roughly by a factor of four to −0.15 ppm. This indicates that having an accurate prior covariance matrix may reinforce the bias arising from an incorrect prior mean vector, rather than mitigating it. Table 4.
Table of XCO2 biases remaining after averaging kernel correction, based on 3000 simulations (Columns 3 and 5) and based on theoretical derivations (Columns 4 and 6). Units are in parts-per-million (ppm). In Experiment 3, we explored the limiting case where the largely uninformative S a is further inflated by a factor of 1000. Although unrealistic as a choice of prior, we inflated S a by this factor in order to numerically explore the limiting case of the uninformative prior in averaging kernel correction. The mathematics of Section 2 indicate that this choice should further reduce the relative bias E(x 12 − x 1 ) to 0. Indeed, this is precisely what the simulation showed. The bias based on simulations is 0.0119 ppm, which is fairly close to the predicted theoretical value of −0.006 ppm, although both are so close to zero that the intercomparison can roughly be considered to be unbiased. This experiment reinforces the conclusion in Section 2 that, if unbiasedness in intercomparison is of paramount concern, then in general, a researcher should choose a comparison ensemble with a less informative prior covariance matrix.

Discussion
Optimal estimation is a state-of-the-art retrieval method in remote sensing, and it is the method of choice for a large number of remote sensing missions (e.g., OCO-2, SEVIRI, GOSAT, TES, etc.). One important component of OE retrieval is specifying the prior mean and prior covariance of the state vector x, and it has been shown in the literature that misspecifying the prior distribution within the retrieval algorithm (relative to the true unobserved natural variability of x) will result in biased retrievals.
Intercomparison of remote sensing retrievals from two or more instruments is often necessary in product validation and data assimilation [14][15][16][17][18][19]. Reference [20] recognized that "even if the retrievals are optimal with respect to their own a priori, they will not normally be optimal with respect to the comparison ensemble [{x C , S C }]" ([20] Section 3), and hence, they recommended that the researcher adjust one (or both) of the retrievals to a comparison ensemble {x C , S C }. We note that [20] are well aware of the limitation of their averaging kernel correction process with respect to the bias, and they only claimed that "[the] smoothing error of the comparison ofx 12 withx 1 should be smaller than that of the direct comparison ofx 1 andx 2 " ([20] Section 4.4). They also derived an expression for the difference between x 12 with x 1 ( [20] Equation (29)), which is equivalent to our expression for E(x 12 − x 1 ) in Equation (14), indicating that they are well aware that the averaging kernel process does not completely remove any relative bias.
Notwithstanding a lack of support from [20], a popular misconception in the existing literature is that any bias induced by prior misspecification is "removed" in intercomparison via averaging kernel correction. In this paper, we explicitly separated the natural variability of the state {x T , S T } from the comparison prior {x C , S C } (or "comparison ensemble" [20]), which allows us to analytically examine the biases that remain after averaging kernel correction. In the typical application with two OE retrievals to be compared, it is usual to "shift" a retrieval, sayx 2 , to a comparison prior given by {x C , S C } = {x a,1 , S a,1 }. In order to obtain better characterization of retrieval intercomparisons and the potential causes of OE biases, we examined the existence of these prior-induced biases, the means to mitigate them, and potential approaches for estimating their magnitudes. Our key findings are summarized as follows: • If x C = x T , then the averaging kernel-corrected value will be biased (i.e., E(x 12 − x) = 0), and the intercomparison of the two instruments will also be biased (i.e., E(x 12 −x 1 ) = 0). • Even if retrievalsx 1 andx 2 use the same prior mean x a and prior covariance S a , if the instrument noise processes are different (i.e., S ,1 = S ,2 ), then there would be a relative bias. Similarly, if K 1 = K 2 , then E(x 12 −x 1 ) = 0.

•
The biases E(x 12 − x) and E(x 12 −x 1 ) are proportional to the vector difference (x C − x T ). This explains the source of the common misconception that the average kernel correction can remove any relative bias. It is likely that these claims are based on the implicit assumption that x C = x T , but this is highly unlikely to be, if ever, true in actual practice. • A simple consequence of these results is that in validation studies where retrievals of a geophysical process (e.g., carbon dioxide, methane, etc.) are compared with data from a different instrument, a bias can result simply from the choice of the comparison ensemble x C = x T . Therefore, validation studies should keep the choice of priors in mind as a potential source of bias. We note that if x C = x T , no amount of tinkering with calibration, spectroscopy, radiative transfer models, geolocation, non-linear optimizer, or other potential causes will remove this bias. • In the intercomparison of multiple retrievals, it is common practice to choose the comparison prior from one of the priors that produced the retrievals. Since the bias is proportional to (x C − x T ), in general, it is recommended to choose a comparison prior with the most "accurate" prior mean. • We showed that using a comparison prior with a less informative prior covariance matrix will result in smaller bias. Therefore, if it is not possible to assess which prior has a more accurate prior mean, all other things being equal, it is recommended to choose a prior with a larger prior covariance matrix (we note that this bullet point will only apply to the full averaging kernel correction procedure as described in [20] and not on the abbreviated procedure as described by [22]). We note that while this choice would reduce the bias, the drawback is that it may increase the variability of the difference (i.e., var(x 12 − x 1 ) may be "larger"). • If a researcher wishes to absolutely remove the choice of prior as a potential source of bias, he or she can choose a comparison ensemble where S C → ∞ (although in practice, multiplying one of the prior covariance matrices by a sufficiently large constant is acceptable; see Experiment 3 of Section 3). This would force the prior-induced bias to approach 0, and the researcher can then be reasonably sure that any bias that remains is due to other causes such as calibration, geolocation, spectroscopy, etc. • One useful exercise in validation intercomparisons is to assess the potential magnitude of the prior-induced bias, and we could do this using Equation (10), which only has one unknown x T . Here, the researcher can make some reasonable guesses such as x C being misspecified by 1% or 2% (i.e., x C = 0.99 x T or x C = 0.98 x T ), which would then produce an estimate of the prior-induced bias for comparison to other sources of errors.  comments, from which the manuscript has benefited. We would also like to thank Vineet Yadav from the Jet Propulsion Laboratory for his support and encouragement in writing this paper.

Conflicts of Interest:
The authors declare no conflict of interest.