On statistical approaches to generate Level 3 products from satellite remote sensing retrievals

Satellite remote sensing of trace gases such as carbon dioxide (CO$_2$) has increased our ability to observe and understand Earth's climate. However, these remote sensing data, specifically~Level 2 retrievals, tend to be irregular in space and time, and hence, spatio-temporal prediction is required to infer values at any location and time point. Such inferences are not only required to answer important questions about our climate, but they are also needed for validating the satellite instrument, since Level 2 retrievals are generally not co-located with ground-based remote sensing instruments. Here, we discuss statistical approaches to construct Level 3 products from Level 2 retrievals, placing particular emphasis on the strengths and potential pitfalls when using statistical prediction in this context. Following this discussion, we use a spatio-temporal statistical modelling framework known as fixed rank kriging (FRK) to obtain global predictions and prediction standard errors of column-averaged carbon dioxide based on Version 7r and Version 8r retrievals from the Orbiting Carbon Observatory-2 (OCO-2) satellite. The FRK predictions allow us to validate statistically the Level 2 retrievals globally even though the data are at locations and at time points that do not coincide with validation data. Importantly, the validation takes into account the prediction uncertainty, which is dependent both on the temporally-varying density of observations around the ground-based measurement sites and on the spatio-temporal high-frequency components of the trace gas field that are not explicitly modelled. Here, for validation of remotely-sensed CO$_2$ data, we use observations from the Total Carbon Column Observing Network. We demonstrate that the resulting FRK product based on Version 8r compares better with TCCON data than that based on Version 7r.


Introduction
Level 2 retrievals from satellite remote sensing instruments are typically retrieved irregularly in space and time. Hence, in order to validate these retrievals against either ground-based remote-sensing data (e.g., [1]) or atmospheric transport model output (e.g., [2,3]), some form of gap-filling, or spatial prediction, is required. A wide variety of approaches has been proposed that are either deterministic (e.g., geographic co-location methodologies; see [4,5]) or statistical (e.g., kriging; see [6,7]). Statistical techniques tend to be more computationally intensive, on the one hand, but on the other hand, they allow for uncertainty quantification. This is indispensable when validating satellite remote sensing products to ground-based measurements or to other transport-model outputs, as it puts into context the magnitudes of any observed discrepancies. They have also been shown to be more accurate in practice, in a mean-squared-error sense, than their deterministic counterparts [8].
A large variety of statistical techniques has been proposed to produce Level 3 maps from satellite remote sensing Level 2 retrievals. These include spatial kriging (e.g., [3,9]), spatial block kriging [10] variants of kriging that can be, and have been, used to derive these products. In this section, we discuss a number of these variants, and we explore the circumstances under which some approaches are more appropriate than others. For ease of exposition, throughout this section, we focus on the prediction aspect and assume that mean and covariance parameters are known.

Observation Space vs. Process Space
An issue that is not often discussed, from both a theoretical and practical perspective, is whether the Level 3 product is a prediction of the retrievals (i.e., in observation space) or of the underlying process (i.e., in process space). While the former is a prediction of what a retrieval would be if it were to be done at some point in space and time, the latter is a prediction of the actual process (e.g., the true column-averaged CO 2 ) at that point in space and time. Currently, little distinction appears to be made between the two variants, yet the two types of predictions are intrinsically different. Which is used affects the conclusions one should draw when validating satellite remote sensing instruments.
Consider a field of interest, say column-averaged CO 2 , that is indexed by both space and time. Assume the field is random, and denote it by Y. Then, Y(s; t) is a random variable representing the field evaluated at some location on the sphere s at time t. Consider now a remote sensing retrieval of this field at location s i and t i , which we denote as Z(s i ; t i ), for i = 1, . . . , m. A typical model relating these retrievals to the field is the linear additive model: for i = 1, . . . , m, where { i } is a set of independent measurement errors with variance σ 2 , typically assumed to be Gaussian with a mean that is usually set to zero, but does not need to be zero.
Consider the vector of retrievals Z ≡ (Z(s 1 ; t 1 ), . . . , Z(s m ; t m )) . Kriging, or optimal linear spatio-temporal prediction, in observation space refers to predicting Z(s * j ; t * j ), j = 1, . . . , N, from the retrievals Z, where (s * j ; t * j ) is a prediction space-time location that for simplicity we assume does not coincide with an observation location. Clearly, this can only be done if the measurement-error variance σ 2 at the prediction location is known, since the kriging equations explicitly involve var(Z(s * j ; t * j )) = var(Y(s * j ; t * j )) + σ 2 . Kriging in process space, also known as 'filtered kriging,' refers to predicting Y(s * j ; t * j ), j = 1, . . . , N, from the observations Z [23] (Section 4.1.2). In both cases, optimal should be taken in the sense that the predicted quantities minimise their respective mean squared prediction errors under the model assumptions and predictor constraints. Other predictors that are optimal in a different sense are sometimes more appropriate (e.g., [24]). For the purposes of Level 3 products, minimising the mean-squared prediction error results in kriging predictors and is generally considered suitable for mapping.
Recall that here, all parameters in the statistical model for Y are assumed known and fixed. Then, it is straightforward to show that, at unobserved locations, the conditional expectations, E(Y(s * j ; t * j ) | Z) and E(Z(s * j ; t * j ) | Z), are the optimal (unbiased) linear predictors in process space and observation space, respectively, and that they are identical. However, the prediction variances are different due to the extra variability present in observation space: For unobserved locations, var(Z(s * j ; t * j ) | Z) = var(Y(s * j ; t * j ) | Z) + σ 2 , j = 1, . . . , N.
Consequently, the questions that Level 3 products in observation space answer may be very different from those that Level 3 products in process space answer. Specifically, the former provide uncertainty measures on retrievals, while the latter provide uncertainty measures on the underlying process itself. This distinction is particularly important in validation studies of satellite remote sensing retrievals.
Example 1: The implications of carrying out comparisons in an inappropriate space can be seen from a simple simulation experiment that we shall use throughout this section. Let Y be a spatial process with mean zero and covariance function C Y (s 1 , s 2 ) ≡ σ 2 Y R Y (s 1 , s 2 ), where R Y (s 1 , s 2 ) ≡ e − s 2 −s 1 /τ is a correlation function (in this case, a stationary exponential correlation function), τ is the e-folding length scale at which the correlation drops to e −1 and σ 2 Y is the process variance. Because we are working from a simulation, we know the covariance function, as well as the mean. Consider the spatial domain D ≡ [0, 1] × [0, 1], and let τ = 0.15 and σ 2 Y = 1. Assume further that we have 1000 measurements of the process randomly placed in D and that the measurement-error variance σ 2 = 1 is constant and known. These measurements play the role of 'retrievals.' For prediction locations, we consider a 100 × 100 grid on D. Predictions and prediction errors on this grid play the role of the Level 3 product. We also consider 200 random diagnostic locations on D, which serve as validation in both the process space and the observation space. The true field is simulated from Y under Gaussian assumptions, and it plays the role of the scientific quantity of interest, such as the output from a forward transport model. Figure 1a-d shows the experimental setup, the simulated (true) field, the optimal prediction in process space (which at unobserved locations is the same as the optimal prediction in observation space) and the prediction standard error in process space, respectively. The prediction standard error is seen to be high in regions of sparse measurements, and the predictor satisfactorily reproduces the unobserved true signal, as would be expected from an optimal spatial predictor.
Level 3 products based on statistical techniques always come in pairs, with one map showing a prediction and a companion map showing the prediction standard error. The first map on its own can only be used to derive discrepancy diagnostics for the prediction, such as the mean prediction error (MPE), the mean absolute prediction error (MAPE) or the root-mean-squared prediction error (RMSPE), where here, "mean" refers to the "sample mean" or "average" given in the definitions below. The MPE is designed to give an an estimate of the predictor 'bias', but it does not reveal anything about the variance of the predictor. Although the MAPE increases with both the predictor bias and variance, it does so in a highly nonlinear fashion; it is commonly used because of its intuitive interpretation. On the other hand, the MSPE is an estimate of the sum of the squared bias and the predictor variance, and it is commonly viewed (or its square root, as we do here) as a "gold standard" when comparing predictors.
Denote the process values at N d diagnostic locations as Y d and the optimal predictor at those locations asŶ d . MPE, MAPE and RMSPE are then given by: respectively. The process values, Y d , to which the predictor is compared, are typically obtained from data that are ignored when constructing the predictor or data from another instrument. In this example, we know the true (simulated) process values and thus set Y d to the true values. For example, the RMSPE over the N d = 200 diagnostic locations was 0.4995. MPE, MAPE and RMSPE are valuable, but diagnostics that treat uncertainty quantification require the second map, which shows the prediction standard errors. One such diagnostic that is very important is the coverage, which can help the analyst assess whether the prediction standard errors are correct. If uncertainty is quantified correctly, x% of (independent) validation data should fall within the x% prediction interval. Usually a large prediction interval (e.g., x = 90 or 95) is chosen since this is likely to be more useful to the analyst than a smaller prediction interval. However, any prediction interval, or set of intervals, may be chosen for diagnostic purposes. Clearly, using the correct prediction standard error when constructing the interval is required to determine the correct coverage. In Section 3.3, we further develop this idea for the situation where measurements are biased. For nominal 90% coverage, Table 1 shows the empirical coverage when validating against (i) the true field (which in this case plays the role of the output from a perfect CO 2 transport model in process space) and (ii) left-out retrievals in observation space. The table shows the misinterpretations that are possible: Using prediction standard errors in process space to construct the intervals when validating against other retrievals would give the impression that our predictions are too 'optimistic' (from the table, only 51% of the left-out retrievals fell into the constructed 90% prediction interval). Using prediction standard errors in observation space to construct the intervals when validating against a process realisation (output from a transport model, say) would give us the impression that our inferences on the process are too 'pessimistic' (from the table, all of the process values fell into the constructed 90% prediction interval).
These results are illustrative of one particular diagnostic: indeed, there is a whole suite of probabilistic diagnostics that require consideration of prediction standard errors (see [25] for a detailed review). In all these cases, incorrect prediction intervals will lead to incorrect probabilistic diagnostics and hence incorrect conclusions. Therefore, it is crucial to know whether the Level 3 products generated from statistical techniques that are used in an analysis are defined for process space or for observation space.

Level 3 Maps Generated Using Statistical Techniques Will Appear Smooth
A common criticism of kriging predictions is that they appear to be too smooth when compared to what is expected from the process. However, an attempt to produce a predictor that looks 'realistic' will likely yield a suboptimal predictor (as we show below, it is the method of conditional simulation that should be used instead to produce 'realistic' outputs). For example, the covariance-matching-constrained kriging of [26] will have larger mean squared error than the usual kriging predictor. Note that the optimal prediction in Figure 1c is smoother than the true, unobserved, process in Figure 1b, and this can be established theoretically (see below). Kriging is not meant to yield the underlying process, but it is meant to give correct coverage with optimal prediction intervals, which makes the "too smooth" criticism a moot point. The purpose of this section is to show that the optimal predictor of Y(s * j ; t * j ) in the mean-squared-error sense (which is the conditional expectation E(Y(s * j ; t * j )|Z)), is a data smoother and that a smooth predictor is not necessarily a poor predictor. Consider a setup similar to that of Section 2.1, and define the second-order difference quotient of the process as, is a finite-difference approximation of the field's curvature in the direction of the first coordinate at location s * 2 . Now, under Gaussian and mean-zero assumptions, it is straightforward to show that D Y (s * 1 , s * 2 , s * 3 ) is normally distributed with mean zero and variance: where R Y (s, r) ≡ corr(Y(s), Y(r)). Since |D Y (s * 1 , s * 2 , s * 3 )| is half-normal, we have that: from standard properties of the half-normal distribution [27] (Chapter 13, Section 10.1). Now, consider the second-order difference quotient of the optimal predictor, E(Y(s) | Z), Then, D E(Y|Z) (s * 1 , s * 2 , s * 3 ) has exactly mean zero for linear spatial trends and approximately mean zero for smooth spatial trends and small h. Importantly, it can be shown to have variance equal to We can therefore expect, on average, that the optimal predictor is smoother (in the sense that it will have smaller absolute second-order derivatives) than the process. The degree of smoothness is a function of the SNR, an aspect that we explore further in Section 2.5. If rougher maps are sought based on the same spatial covariance function, the analyst has no option but to relinquish the concept of mean-squared optimality (see [24,26] for further discussion).
This result highlights one important property of the Level 3 products, which in turn allows for their correct interpretation as summaries of random quantities. Specifically, the quantities {Y(s * j ; t * j ) : j = 1, . . . , N} are dependent random variables and not deterministic scalars. Their conditional expectation with respect to the observed data Z (the predictor) illustrates just one aspect of these random variables. Since one cannot visually determine a Level 3 product's characteristics from the conditional expectation alone, it is often preferable to generate a few conditional simulations [17] (Section 3.6.2) of the process conditional on Z, to reveal whether the spatial structure is correctly captured or not in the product.
Example 2: In Figure 2a-c, we show two conditional simulations, together with the true process Y, from the simulation experiment of Example 1, where the true mean and covariance-function parameters are known. The conditional simulations reveal that the statistical Level 3 product encodes reasonable spatial structure, despite the optimal predictor being 'smooth.'

Fixed Rank Kriging
The rather complex structures evident in Figure 1b are 'smoothed out' in the optimal prediction of Figure 1c, for reasons elucidated in Section 2.2. This smoothing action implies that there is likely to be a simpler, lower-dimensional model that could be used to satisfactorily carry out pointwise inferences in both process space and measurement space. A suite of reduced-rank methods, an example of which is FRK [12,28,29], is designed to do just that, and these yield optimal predictors based on large datasets at a fraction of the computational cost of traditional kriging. FRK has often been used to produce Level 3 products from Level 2 retrievals (e.g., [12,13,29]).
FRK assumes that the process Y can be decomposed into a weighted sum of r pre-specified basis functions, where the weights are random. To simplify the specification of the random field Y, we assume that E(Y(·; ·)) = 0. Then: which is a linear random effects model, where the random vector of coefficients η ≡ (η 1 , . . . , η r ) have mean zero and covariance matrix K and where the fine-scale variation term ζ(s; t) plays the role of the 'nugget' (without a measurement-error component) in geostatistical models. From this basis-function representation, we immediately have: where φ(s; t) ≡ (φ 1 (s; t), . . . , φ r (s; t)) and I(·) is the indicator function.
The choice of basis functions {φ i } is important. At the very least, they should be able to accurately reconstruct the optimal predictor under more general assumptions about cov(Y(s; t), Y(r; u)), and ideally, they should also be able to provide a good representation of the prediction-standard-error map. Hence, it is common to set r as large as computationally feasible, which usually comes at the cost of imposed model constraints through a structured K (e.g., [30,31]).
All excess variation in the process that cannot be explained by the spatio-temporal components {φ i (·; ·)η i } is absorbed by the white-noise component ζ(·; ·) that ensures that the total pointwise prediction standard error is accurately quantified. However, this component does not contribute to improved prediction itself, since it is spatially uncorrelated. The matrix K and var(ζ(·; ·)) are parameters of the spatio-temporal covariances. In general, they cannot be assumed known; in what follows, we shall use the method of maximum likelihood to estimate them from the data (see [29]).
Example 3: To illustrate FRK, consider the spatial-only example of Section 2.1. The basis functions {φ i } were chosen to be multi-resolution bisquares of the form, where A > 0 is the aperture. The bisquare basis functions were regularly distributed in the domain; see Figure 3a. The matrix K was restricted to be block diagonal by resolution, with each block a covariance matrix derived from an exponential covariance function; see [29] for details. The FRK predictions and prediction standard errors using these basis functions and known measurement-error variance are given in Figure 3b,c, respectively. Notice that these have a practically identical appearance to those in Figure 1c,d, respectively, despite K and the variance of the process ζ being estimated from the data. The RMSPE at the diagnostic locations using FRK was 0.4996 (compared to 0.4995 using the optimal predictor), while the coverages were identical to those obtained using classic kriging in Table 1.
Despite the FRK prediction and coverage being virtually indistinguishable from that of the classic kriging prediction, conditional simulations now are visually different from each other; see Figure 4a,b. The reason for this is that while the basis functions are able to adequately reproduce the optimal predictor and prediction standard error surfaces, they are less suited to reconstruct the sharp gradient in the covariance function close to the origin. Specifically, the reduced-rank covariance function, is not able to reproduce the drop that appears in the exponential covariance function for s close to r, which plays a big role in the 'rough' behaviour of exponential fields shown in Figure 2 [17] (Section 2.3.1). Instead, a smooth covariance function is fit, and the excess variation is absorbed by ζ, which is reflected in the sudden jump at the origin in Figure 4c, which in turn is reflected in 'speckles' in the sample paths. Technical details on the quality of the approximation in terms of the Kullback-Leibler divergence are given in [32].
Example 3 clearly shows that the sample paths from conditional simulation based on FRK are generally too speckled because of the assumption of an uncorrelated ζ. Alternative models that assume a fine-scale correlated structure can remedy this [33]. The claim that FRK constrains the predictive surface to be smoother than the optimal predictor is not true in general. Indeed, if one is only concerned with predictions, FRK can even yield the optimal predictor exactly in our example when the true covariance function is known. In this case, one needs as many basis functions as data points. Specifically, for the simulation study of Example 3, one needs to set the vector φ(·) = c(·) K −1  Using basis functions other than these leads to a surface that is suboptimal for a process with covariance function C Y (·, ·); however, in Example 3, we demonstrated nearly perfect performance with r < m and the use of a set of basis functions that was not optimised for the task. Hence, over-smoothing with FRK is the consequence of either a poor choice (or an insufficient number) of basis functions, or a poor choice for K; but, it is not a consequence of the use of a reduced-rank method in itself. Spectral considerations, such as those considered in [34], can be used to ensure that an adequate set of basis functions is chosen. Finally, the last decade has seen a surge in reduced-rank methods based on dimensionality reduction that can handle orders of magnitude more basis functions than FRK by putting additional constraints on K or K −1 (e.g., [30,31,35]). They are briefly discussed in Section 4.
In summary, FRK can yield the optimal predictor, the correct coverage and a good approximation to the true, underlying, covariance function even when this is not specified a priori. This flexibility is important, since a specific choice of covariance function considerably affects predictions and prediction standard errors. In addition, it can be used with large datasets (hundreds of thousands to millions compared to a few thousand as with classic kriging), and it can also handle structured non-stationarity. Indeed, FRK using the entire dataset only becomes less desirable from a practical standpoint when the system has a high SNR. This is where local kriging methods come into their own; these are discussed next.

Fixed-Window and Moving-Window Local Space-Time Kriging
It is generally problematic to consider satellite remote sensing datasets in their entirety when carrying out spatio-temporal prediction, for two reasons. The first reason is computational: When carrying out optimal prediction, the computational cost of prediction always increases with dataset size. Further, there is little to be gained by using data far away from the space-time prediction location, due to the use of covariances that decay with space and time separation. The second reason is modelling: It requires effort to define and fit spatio-temporal models with non-stationary covariance functions at fine, medium, and large scales. Using dynamic models (e.g., [36,37]), one can alleviate both the computational and the modelling problems somewhat. However, even then, fixed-lag (i.e., local) smoothing is generally used for datasets that span large temporal scales.
One way to circumvent these difficulties is to generate Level 3 products by considering spatial snapshots of retrievals, that is, retrievals that fall into temporal bins of fixed width. When using spatial snapshots the bin width is clearly important, and one needs 'to balance the competing goals of including as many observations as possible, while avoiding time periods over which the [. . . ] field itself would change substantially' [10]. Unfortunately, this statement is difficult to operationalise, and it results in the use of different bin widths by different users. For example, both six-day bins [3,10] and monthly bins [38] have been used to produce spatial maps from the Greenhouse gases Observing SATellite (GOSAT) retrievals.
Unless a spatio-temporal field is highly correlated in time, spatial-only kriging of spatio-temporal data is likely to yield poorer predictions than spatio-temporal kriging: Retrievals that are spatially close but on different satellite orbits that are days apart, may be reflections of completely different process values that will be treated as 'similar' when doing spatial-only prediction. From a modelling point of view, when temporally binning data, implicitly one is considering a separable spatio-temporal covariance function of the form C   Y (t, t ) says that two data in different bins are independent, which is likely to be a very poor representation of reality. Clearly, this is a strong assumption, and any class of continuous temporal covariance functions (such as the exponential) is likely to be an improvement over this choice.
Example 4: To elucidate further the deleterious impact of assuming temporal independence across bins, assume that in the simulation experiment where s = (s 1 , s 2 ) , s 1 denotes one-dimensional space and s 2 denotes time. In this case, splitting the domain of t(= s 2 ) into fixed bins of width 0.1 units, and carrying out prediction in each of these bins (as a function of spatial location s 1 ∈ [0, 1]), is reasonable, since this process exhibits high temporal correlation within 0.1 units (corr(s, s + (0, 0.1) ) = 0.51). Carrying out one-dimensional optimal spatial prediction in each bin using only the data in that bin yields the predictions shown in Figure 5a, where we also show the process values for t at the centre of each bin for comparison.
Validation can be carried out by allocating each datum used for diagnostics to its relevant temporal bin and computing the respective prediction standard errors. We found that the empirical version, RMSPE, at the left-out diagnostic locations increases by approximately 15% from 0.500 (in Example 1) to 0.574. Further, Figure 5b shows that the implied space-time predictor exhibits 'stripes' due to the fixed binning procedure (which assumes temporal invariance within each bin). In practice, deterioration is likely to be greater when spatial-covariance-function parameters are also estimated from the data, since only a fraction of the full dataset is available in each temporal bin. (a) One-dimensional spatial-only kriging of a two-dimensional field obtained by binning the data into fixed temporal bins (of s 2 ) of width 0.1. The data that fall into the bins are shown as blue dots, the true (unobserved) field at the centre of the bin as a black solid line, and the spatial prediction as a red line. (b) The implied two-dimensional spatio-temporal prediction (pred) surface (i.e., assuming temporal invariance within each bin), when carrying out prediction using fixed bins. (c) Predictions (pred) obtained by using data in a moving window in s 2 and using optimal prediction for locations at the centre of the moving window.
If dataset size or suspected temporal heterogeneity is the reason behind binning the data into temporal bins, and the main aim of the analysis is to obtain a good approximation to the optimal predictor E(Y(·) | Z), it is usually much better to consider a moving temporal window, where the prediction at some space-time location (s 1 ; t), for t(= s 2 ), considers data in the temporal bin [s 2 − ∆/2, s 2 + ∆/2], for temporal bin width ∆. Then spatio-temporal prediction (and not just spatial prediction) is done based on the spatio-temporal data in this bin centred on the prediction location. This is a type of local kriging [11] discussed further in Section 2.5 that, despite some theoretical flaws, in high SNR scenarios is well suited to carry out local (but not global) inference in space and time. In Figure 5b,c, we compare the prediction using fixed temporal bins with that using moving temporal bins. The RMSPE and the coverage computed from the latter predictions (Figure 5c) are virtually identical to those obtained from spatio-temporal kriging on [0, 1] × [0, 1] (Figure 1c). Results from [7] also show the considerable improvement of moving-window spatio-temporal predictions compared to a sequence of spatial-only predictions. In Section 3 we use moving-window spatio-temporal kriging to obtain predictions and prediction standard errors of column-averaged CO 2 (XCO 2 ) from OCO-2.

Local Prediction and Signal-To-Noise Ratio
Local kriging reduces computational burden, since predictions are based on only a subset of the whole dataset. Local kriging has a place for answering local questions, since the prediction standard errors are appropriate for the local predictor. However, as was discussed in Section 1, local kriging yields predictions that are statistically incoherent over large scales. Continental-scale or oceanic-scale questions require a combination of local predictors, and the associated prediction standard error of aggregated local predictors requires a coherent global model. Consequently, predictors obtained from local kriging do not generally have correct coverage over large scales.
If Level 3 products are primarily used to obtain local information, global incoherence may not be of prime concern in some applications. However, even when doing local-only predictions, one must still be careful: The quality of the prediction when using only a subset of observations is highly determined by the SNR of the process. With low SNRs, that is when σ 2 is large with respect to σ 2 Y , the optimal predictor borrows more strength from observations that are far away from the prediction location, than with high SNRs. If the SNR is small, one may need to consider tens of thousands of observations to get a good approximation to the optimal predictor, which is impossible with the classic kriging methods. In this sense, local kriging, even when using moving windows, is not recommended in low SNR settings.
In contrast, a global-scale predictor like FRK is very useful where there is low SNR, since it can consider hundreds of thousands of data points with ease. Further, in Appendix A.3, we show that spatial-predictor smoothness (measured through second-order absolute difference quotients) increases with measurement-error variance (i.e., decreasing SNR). Thus, FRK is an ideal 'big-data' approach for spatio-temporal prediction in low SNR settings. Furthermore, local modelling/prediction and FRK are not mutually exclusive. Indeed, it is often the case that a large amount of data is present and spatio-temporal domain subsets are needed, even when using state-of-the-art approaches that deal with a large number of basis functions. We demonstrate FRK in the context of a temporal moving window in Section 3.
Example 5: Consider again the simulation experiment of Section 2.1, but this time let σ 2 = 10. The optimal prediction with this increased measurement-error variance is shown in Figure 6a. Now the prediction obtained using the moving-window local kriging method described in Section 2.4, shown in Figure 6b, exhibits striping since predictions are based on relatively few, very noisy observations. The RMSPE at the diagnostic locations when doing kriging with all the data is 0.685, but it increases to 0.707 when using a moving window of width 0.1 (despite use of the exact covariance function). The prediction from FRK using the parameters estimated in Section 2.3, illustrated in Figure 6c, is clearly much closer to the optimal predictor, despite its being a misspecified model. The RMSPE using FRK in this case was 0.694. We can expect local kriging's relative RMSPE (relative to the optimal) to deteriorate further as the SNR decreases. In our experience, when the SNR is less than about 0.2, local kriging begins to produce poor-quality pointwise predictions when compared to dimensionality-reduction methods (e.g., FRK) in models involving the exponential covariance function.  (Figure 1b) using the experimental setup in Section 2.1 and data with increased measurement-error variance, σ 2 = 10. (a) Optimal prediction (pred) using the simulated retrievals; (b) the prediction (pred) from simple kriging on the prediction grid from the simulated retrievals using the exact model and a moving window in s 2 of width 0.1 units; (c) the prediction (pred) from fixed rank kriging on the prediction grid using the simulated retrievals.

OCO-2 Level 3 Products from V7r and V8r Lite Files
In this section, we use FRK and a moving window in time to produce statistical Level 3 products from the OCO-2 Data Release 7r Lite File Version B7305Br and the OCO-2 Data Release 8r Lite File Version B8100r. The use of the bias-corrected Lite Files, as opposed to the raw retrievals, is justified in Section 3.3. The Level 3 products, which we name FRK Version 7r and FRK Version 8r products, are made up of daily FRK predictions and prediction standard errors from 1 October 2014-28 February 2017. The FRK products, and animations of these, are provided at https://niasra.uow.edu.au/cei/ oco2level3.

OCO-2 Data Preprocessing
The OCO-2 Release 7r Lite File [19] includes bias corrections for footprints, parameters, and scalings. All data in this File with some missing variables, with a warn level greater than or equal to 15 and a quality flag of one, or with a reported retrieval standard error of more than 3 ppm, were filtered out. The OCO-2 Release 8r Lite File [20] accounts for stratospheric aerosols, thus improving the data quality in the Southern Hemisphere, and it has larger global coverage due to improvements in pre-screeners. All data in this File with a quality flag of zero were kept for generating the product, irrespective of the warn level. For both versions, we did not make a distinction between the mode of operation (land nadir, land glint, ocean glint; see [39] Section 2 for a concise summary of modes of operation) in which the retrieval was made when generating the products, and all retrievals obtained when the instrument was in target mode were filtered out.
The FRK Level 3 products were produced in process space using a moving window of 16 days' duration, which matches the repeat cycle of the satellite. We considered data in the Lite File between 24 September 2014 and 8 March 2017, and hence the FRK products were produced for days between 1 October 2014 and 28 February 2017. The standard errors of the retrievals returned by the Level 2 retrieval algorithm are known to be underestimated [40]. To cater for this, when the Lite File gave a standard error below 2 ppm, we raised it to 2 ppm, following a similar strategy in [41]. This may result in our FRK Level 3 products being under-confident; however, this is preferred to ones that are over-confident. We do not expect this adjustment to affect the relative quality of the Version 7r and the Version 8r FRK products: Indeed, when repeating the study discussed below with the standard error threshold set to 0.5 ppm instead of 2 ppm, we obtained similar relative performance of the Version 7r and Version 8r FRK products but worse predictions (higher MAE and RMSPE) when assessing these products against TCCON.
All retrievals were subsequently aggregated into a 1 × 1 × 1 lon-lat-day grid. An aggregated retrieval and standard error in any of these space-time cubes was found by averaging the retrievals and the retrieval standard errors of all OCO-2 retrievals falling into the space-time cube. Specifically, denote the vector of m i retrievals falling into the i-th space-time cube asZ i and the corresponding vector of retrieval standard errors asσ i . An aggregated retrieval Z i and retrieval standard error σ ,i for the i-th space-time cube were then found by averaging the vector elements, for i = 1, . . . , m, where m is the number of 1 × 1 × 1 lon-lat-day cubes containing one or more retrievals. The formula for σ ,i assumes (approximately) perfect correlation between the retrieval errors within the i-th space-time cube. These aggregated retrievals and retrieval standard errors were then used for generating the FRK Level 3 products. Notice that here, unlike in Section 2, we allow for heteroscedasticity of the retrieval standard error.

Implementation Details for FRK
A 16-day moving-window variant of FRK was used to construct the Level 3 products. Each window was indexed by the eighth day in the window, at which a spatial global prediction was made, and the spatio-temporal domain of interest was spanned with 3168 spatio-temporal basis functions. These basis functions were constructed by finding the tensor product of 396 spatial bisquares arranged over three resolutions on the sphere, and eight temporal bisquares at a single temporal resolution regularly spaced in the 16-day window; see Figure 7. The eight temporal basis functions were chosen to allow the signal to considerably vary temporally within a 16-day window, while the number of resolutions for the spatial basis functions was capped so that the total number of basis functions was less than 4000, a point beyond which FRK begins to slow computationally. Spectral-based methods that may assist in choosing basis functions are outlined in [34]. Note that a larger window is not considered for computational reasons; reduced rank methods that are able to deal with a larger number of basis functions are discussed in Section 4. In order to produce a prediction for the target day (i.e., the eighth day in the window), we require there to be data on the target day or, within its 16-day window, data on a day before and a day after the target day. Missing days in the Level 3 products are hence only present when there are gaps of eight days or more in the respective Lite File. Since kriging methods are not well suited for extrapolation, predictions were only made for latitudes ranging between the lowest and highest data point on the target date. This resulted in Level 3 products with a latitude extent that changed slightly from day to day. Due to the increased data density in the Version 8r Lite File, the FRK Version 8r product has a considerably larger latitudinal span than the FRK Version 7r product.
To implement FRK, one needs to estimate parameters appearing in the matrix K, here notated as ϑ, and to estimate the variance of the fine-scale process ζ(·; ·). In our case, we let K(ϑ) = bdiag({K q (ϑ) : q = 1, . . . , n res }), where bdiag(·) returns a block-diagonal matrix constructed from its arguments. The matrices are: ijq are the spatial and temporal distances between the centroids of the i-th and j-th basis functions at the q-th resolution, respectively; r q is the number of basis functions at the q-th resolution, q = 1, . . . , n res ; n res = 3 is the number of resolutions; ϑ 1q is the marginal variance at the q-th resolution; ϑ 2q is the spatial e-folding length at the q-th spatial resolution; and ϑ 3q is the temporal e-folding length at the q-th spatial resolution. All parameters in each moving window were estimated using an expectation maximisation (EM) algorithm; see [29] for details.
Predictions and prediction standard errors of the FRK Version 7r product for 13 May 2016, as well as the retrievals obtained on that day and all retrievals in the 16-day window centred on that day (from the Version 7r Lite File), are shown in Figure 8a-d. The difference between the Version 8r and Version 7r predictions and the ratio of the prediction standard errors are shown in Figure 8e,f, respectively. From Figure 8e we see that there are substantial differences in the predictions, with regional variations on the order of 2 ppm. As expected, the prediction standard errors for the FRK Version 8r product are consistently lower than those from Version 7r. This is mostly due to there being more Version 8r retrievals with which to generate the Level 3 product. Note that the FRK prediction in Figure 8c appears smooth. This is partly due to our imposing a minimum retrieval standard error of 2 ppm (recall Appendix A.3) but also because the optimal prediction is unlikely to resemble what one would expect from a typical transport model (recall Section 2.2). The quality of Level 3 products can only be properly assessed through validation, using diagnostics. At least one of these diagnostics should take uncertainty into account. In Section 2.1 we introduced one such diagnostic, coverage. In the next section, we show why it is important to use data that has been corrected for bias when assessing coverage of the resulting product from those data.

A Coverage Diagnostic in the Presence of Measurement Bias
Recall that the true field (here, column-averaged CO 2 ) is denoted as Y(s; t), at spatial location s and time point t. A retrieval at (s i ; t i ) results in the observation Z(s i ; t i ) that is equal to the true value Y(s i ; t i ) plus measurement error. In the case of raw OCO-2 retrievals, it is likely that the measurement error, i , has non-zero mean, resulting in a biased retrieval. The probabilistic considerations that follow can account for this by assuming that the bias at location s i and time t i is µ (s i ; t i ).
Under an assumption of spatio-temporally varying measurement bias, the retrieval error, . Hence, with probability 0.95, the error Z(s i ; t i ) − Y(s i ; t i ) belongs to the interval, (µ (s i ; t i ) − 1.96σ ,i , µ (s i ; t i ) + 1.96σ ,i ). Equivalently, the true XCO 2 value, Y(s; t), lies in the random interval, with probability 0.95. We refer to Equation (1) as a 95% prediction interval. This simple result is very important: It says that the measurement should first be corrected for bias; then the 95% prediction interval can be obtained by adding and subtracting 1.96σ ,i . In the context of OCO-2, this means that one should generate products based on the bias-corrected Lite Files, and not the raw retrievals, in order to ensure reliable coverages. Section 2 discusses how spatio-temporal kriging can be carried out to obtain a predictionŶ(s * ; t * ) of Y(s * ; t * ), even though there may be no data at (s * ; t * ). There, we assumed that µ (s * ; t * ) = 0; however, more generally,Ŷ(s * ; t * ) is an unbiased predictor of Y(s * ; t * ) + µ (s * ; t * ). Further, the kriging variance is: . Hence, the prediction error, has mean µ (s * ; t * ) and variance σ 2 k (s * ; t * ). Provided this error is (approximately) normally distributed, we can see that what led to Equation (1) also leads to the following (approximate) 95% kriging prediction interval at any location (s * ; t * ), not just at retrieval locations. With probability 0.95, the true XCO 2 value at any given location (s * ; t * ) lies in the random interval, (Ŷ(s * ; t * ) − µ (s * ; t * ) − 1.96σ k (s * ; t * ),Ŷ(s * ; t * ) − µ (s * ; t * ) + 1.96σ k (s * ; t * )).
These same ideas can be used to compare two different XCO 2 values at (s * ; t * ) that are each attempting to predict Y(s * ; t * ). We denote the two predictors asŶ 1 (s * ; t * ) andŶ 2 (s * ; t * ) with prediction errors,Ŷ 1 (s * ; t * ) − Y(s * ; t * ) andŶ 2 (s * ; t * ) − Y(s * ; t * ), respectively, which have (approximately) normal distributions, N(µ ,1 (s * ; t * ), σ 2 k,1 ) and N(µ ,2 (s * ; t * ), σ 2 k,2 ), respectively. Then, a comparison of the two predictors is given by, From Equation (2), where ρ is the correlation between the two prediction errors. IfŶ 1 andŶ 2 are from two different prediction methods based on the same measurements, then ρ will be non-zero in general. However, in many important cases ρ = 0, which greatly simplifies the calculation of σ 2 ∆ = (σ 2 k,1 + σ 2 k,2 ). One such case is whenŶ 1 (s * ; t * ) is a kriging predictor from OCO-2 data and Y 2 (s * ; t * ) is a TCCON observation at location (s * ; t * ). In this case, for known µ ,1 , µ ,2 , the two prediction errors are statistically independent because the random component of the OCO-2 measurement error is independent of the TCCON measurement. Generally, the random interval, contains zero with probability 0.95. If one consistently observes the interval not straddling zero, it is an important diagnostic indicating that something is not right with µ ∆ (s * ; t * ) or σ ∆ . If initially µ ∆ (s * ; t * ) is set equal to zero, and the problem is subsequently diagnosed as an undetected bias term, it can be added to the zero bias and the diagnosis based on Equation (3) repeated. If the fraction of intervals containing zero (i.e., the coverage) is larger than 0.95, then the prediction interval is said to be conservative, which indicates that σ ∆ is too large. If the coverage is smaller than 0.95, the prediction interval is said to be liberal, which indicates that σ ∆ is too small (when it comes to prediction intervals, being conservative is preferable to being liberal). Different kriging predictors (i.e., different Level 3 products) can be compared via their coverage when compared to TCCON observations. A predictor with small RMSPE might actually have poor coverage, resulting in misleading inferences that declare a "signal" to be present when in fact it is not. Of course, some assumptions need to be made about the measurement biases, µ ,1 (s * ; t * ) and µ ,2 (s * ; t * ), in practice. When comparing to TCCON, it is reasonable to assume that µ ,2 (s * ; t * ) = 0 at all TCCON locations (i.e., that TCCON measurements are unbiased). Then, µ ,1 (s * ; t * ) can be estimated by fitting a classical multivariate linear model to the differences between some simple predictions based on the raw OCO-2 retrievals, and other predictions that may be partially based on TCCON data if desired. This is what is done when constructing the Lite Files, where the regressors for µ ,1 (s * ; t * ) are based on physical attributes such as surface pressure and aerosol abundance. Once this estimate is made, it is then treated as fixed. Strictly, if TCCON data are used to estimate µ ,1 (as is done for constructing the Lite Files) then this estimate also depends onŶ 2 . However, allowing for this induced dependence is beyond the the scope of this paper. In what follows, we assume that the products generated from the Lite Files are unbiased in space and time and hence that coverages properly derived from this product are valid.

Comparison to TCCON Data
TCCON is a ground-based network designed to provide observations of XCO 2 that can be directly compared to OCO-2 retrievals. In this study, we use TCCON data from the GGG2014 database [22], specifically data collected from 24 stations (listed in Table A1 in Appendix B), six in the Southern Hemisphere and 18 in the Northern Hemisphere . We only consider the TCCON measurements at a station that fall within a 60-min time-window centred on the average local crossing time of the OCO-2 satellite over that station (usually in the early afternoon local time). The remaining TCCON measurements were then aggregated by site and by day in the same way OCO-2 was aggregated on the 1 × 1 × 1 lon-lat-day grid; see Section 3.1.
Unlike [39,66], we do not rely on coincidence criteria to compare the FRK Level 3 products to TCCON measurements, since the Level 3 products are global and at a daily resolution. The availability of global daily Level 3 products increases the number of comparisons we can make, since FRK inferences on XCO 2 can be made for every 1 × 1 × 1 lon-lat-day cube when TCCON data are available (except for days when OCO-2 is not making retrievals for eight consecutive days or longer, or when the TCCON station falls outside the latitude range of the retrievals in the 16-day window). In total, we have 3607 FRK Version 7r predictions and 4156 FRK Version 8r predictions that we can compare to TCCON values, globally, between October 2014 and February 2017. However, to make the comparisons fair, we only consider Version 7r and Version 8r predictions that are associated with the same locations in space and time. This reduces the total number of comparisons to 3510. The FRK Version 7r predictions and the TCCON data at two stations (Lamont and Wollongong), together with their differences, are shown in Figure 9a-f. Note how the 'spread' of the differences at the two stations is similar, but that the differences at the Wollongong station have a distinct seasonal cycle. This cycle is also apparent when using the coincidence criteria in (Figure A1 (s) [39]) and when using the Version 8r data (not shown). Scatter plots showing the mean differences by station and month for FRK Versions 7r and 8r are given in Figure 10a,b, respectively. The differences are randomly spread around the unit line, which is what one would expect from Level 3 products generated from bias-corrected Level 2 retrievals. There are two problematic stations in FRK Version 7r: those at Sodankylä and Pasadena. The former is very far north and poses challenges to the OCO-2 retrieval algorithm due to high solar zenith angles and snowy scenes, while the TCCON station at Pasadena is situated in a megacity, namely greater Los Angeles [39]. The city acts as a fine-scale strong source of carbon dioxide emissions that cannot be adequately captured in the Level 3 products that recall are on a 1 × 1 lon-lat grid. However, we remark that the Level 3 predictions corroborate well with the retrievals from the Edwards station, which is only about 100 km away from Pasadena. We observe that predictions at Sodankylä from the FRK Version 8r product are much improved, and a comparison of Figure 10a,b reveals a slightly smaller spread around the unit line for the FRK Version 8r product. Detailed diagnostics by station for the FRK Version 7r product are given in Table A1 in Appendix B. These include the MPE, the MAPE, the RMSPE, the coefficient of determination (R 2 ), the slope of the regression line constrained to pass through zero (Slope) and the empirical 95% coverage (95% Cov.).
The Version 7r RMSPEs given in Table A1 are close to, but overall an improvement over, those provided by [39] (who also considered the Version 7r Lite File with warn levels ≤ 15), where only OCO-2 retrievals coincident with TCCON measurements were considered. Unfortunately, detailed comparisons to [39] cannot be made due to the intrinsically different methodologies and the different number of observations on which the diagnostics are based. However, the agreement is reassuring, since our comparisons are made for days when OCO-2 retrievals are not necessarily coincident with TCCON measurements. The empirical coverage is also reasonable in most cases.
Similar diagnostics are available for the FRK Version 8r product in Table A2 in Appendix B. A comparison of the two tables reveals an overall improvement of the FRK Version 8r product over the Version 7r product at the TCCON sites, with MPEs at Wollongong and Lauder in the Southern Hemisphere substantially reduced. This corroborates the anticipated improvements of Version 8r over Version 7r that include, amongst other things, accounting for stratospheric aerosols in the Level 2 algorithm.
Summary diagnostics across all TCCON stations for both FRK products are given in Table 2, with Pasadena included or excluded in the summaries. For the FRK Version 7r product, the overall agreement between the FRK prediction and TCCON is quite good, with a global MPE (bias) of 0.08 ppm, an RMSPE of 1.36 ppm, and coefficient of determination R 2 = 0.80. The slope of the regression line (treating TCCON as a perfect covariate) forced to pass through zero is 1.00019. If Pasadena is omitted from the statistics obtained based on all stations, then the global MPE (bias) increases slightly to 0.35 ppm, but the RMSPE decreases to 1.15 ppm. This compares well with [1] who also omitted Pasadena from the study, but considered OCO-2 retrievals over a smaller time horizon (comparisons there found a similar global bias as we found, and an RMSPE of 1.5 ppm). For the FRK Version 8r product, all diagnostics fared better than those for the FRK Version 7r product. Uncertainty was also better captured: an 86% empirical coverage was achieved when Pasadena was included, and 92% was achieved when excluded.

Conclusions
In this article, we have presented statistical approaches to generating Level 3 products that contain maps of predictions and prediction standard errors, from satellite remote sensing retrievals. We showed that these products are likely to appear 'smooth', but that this is necessary to achieve optimality. We also showed that various local approaches to generating the Level 3 products each have their own strengths and weaknesses. In particular, we showed that moving-window methods can be expected to produce better predictions than 'blocking' methods, despite their theoretical limitations, but that their performance can suffer in the presence of low SNR. When both data size and SNR are an issue, we showed that reduced-rank methods such as FRK are a viable and attractive way forward.
We used FRK to compare OCO-2 retrievals in the Version 7r and Version 8r Lite Files to TCCON data. The advantage of FRK over 'coincident methods' [39,66] is that it increases the number of comparisons one can make, and it obviates the need to 'extend' the coincident region to ensure that there are sufficient retrievals to make a comparison possible. We found that the Level 3 FRK maps from OCO-2 retrievals have favourable diagnostics at validation (in this case, TCCON) locations, both in terms of prediction accuracy and in terms of uncertainty quantification. We also found that FRK Version 8r fared much better than FRK Version 7r on all diagnostics we considered. This improvement was expected since the bias correction in the Lite Files for Version 8r is based on more data than that for Version 7r. Furthermore, one should refrain from concluding that the FRK Version 8r product is superior to the Version 7r product globally, since estimates of the bias corrections in the Lite Files do use TCCON data; out-of-sample validation data would be required to make this claim.
The FRK product we generated used a 16-day moving window to predict XCO 2 , the column-averaged CO 2 , on the eighth day. Hence, with this product, one is not able to obtain prediction standard errors over temporal aggregates (say monthly averages of XCO 2 ). To remedy this, one would need to retain spatio-temporal predictions, prediction variances and covariances, over a time span that is at least as large as that of the desired temporal aggregation level (which might necessitate the use of a larger window). This would require considerable computational effort and possibly a different modelling framework.
In this article, we used the statistical technique of FRK to generate the Level 3 products, in part because it is fast and has been seen to work well with satellite remote sensing data elsewhere (e.g., [8]), and it extends elegantly to predictions of aggregations of the process (i.e., change-of-support). Recent years have seen the development of other methods built on the concept of dimension-reduction that may find use in the generation of statistical Level 3 products. These include fitting a stochastic partial differential equation model through finite elements [30] and multi-resolution approximations to Gaussian processes [35]. There are variants of FRK that impose a sparse structure on K −1 [31] to speed up computations and model finer scales; these have also been shown to work well in practice. Irrespective of the statistical method used for generating a Level 3 product, diagnostics such as those presented here should be used for validation. In Section 3.3, we show how the important notion of coverage can be used to assess uncertainty quantification, specifically the prediction standard errors, of a Level 3 product.
R Software [67] code and instructions for reproducing the results in this paper can be found at https://github.com/andrewzm/oco2-frk.

Appendix A.2. Recovering the Optimal Predictor with FRK
In this section, we show that FRK can be set up in such a way to yield the optimal predictor exactly. Suppose we have a process, {Y(s) : s ∈ D}, with known covariance function C Y (s, u), for s, u ∈ D. Let s 1  where Y(·) and { i } are independent, { i } are mutually independent and where i ∼ Gau(0, σ 2 ) has known, constant variance σ 2 . The aim here is to define a process {Ỹ(s) : s ∈ D}, decomposed as Y(s) = ∑ r i=1 φ i (s)η i , that has the same optimal predictor as that of the original process. That is, we require that: E(Y(s * ) | Z) = E(Ỹ(s * ) | Z).
The decomposition that achieves this is similar to that used in the predictive process [68]. In particular, let φ(s * ) = c(s * ) (C + σ 2 I) −1 , and let η | Z ∼ Gau(Z, K), with K = (C + σ 2 I). Then, Y(s * ) = c(s * ) (C + σ 2 I) −1 η. (A2) The conditional expectation of Equation (A2) with respect to Z is: which is the optimal predictor; see [69] (Chapter 2). Interestingly, this setup of FRK will yield the following prediction variance: var(Ỹ(s * ) | Z) = c(s * ) (C + σ 2 I) −1 c(s * ), which can be shown to be equal to the variance of the optimal predictor E(Y(s * ) | Z), rather than the conditional variance (conditional on Z) of Y(s * ), and in this case, it is important that the variances are interpreted as such.
Appendix B Table A1. Summary statistics of the differences between the fixed rank kriging Version 7r product predictions and the Total Carbon Column Observing Network measurements between 1 October 2014 and 28 February 2017. Summary statistics include the mean prediction error (MPE), the mean absolute prediction error (MAPE), the root-mean-squared prediction error (RMSPE), the coefficient of determination (R 2 ), the slope (Slope) of the regression line constrained to pass through (0,0) and the empirical 95% coverage (95% Cov.). The number of observations considered in each row is denoted as N.