## 1. Introduction

At Environment and Climate Change Canada (ECCC) we have been producing hourly surface pollutants analyses covering North America [

1,

2,

3] using an optimum interpolation scheme which combines the operational air quality forecast model GEM-MACH output [

4] with real-time hourly observations of O

_{3}, PM

_{2.5}, PM

_{10}, NO

_{2}, and SO

_{2} from the AirNow gateway with additional observations from Canada. These analyses are not used to initialize the air quality model and we wish to evaluate them by cross-validation, that is by leaving out a subset of observations from the analysis to use them for verification. Observations used to produce the analysis are called active observations while those used for verification are called passive observations.

In a first-part paper of this study, i.e., Ménard and Deshaies-Jacques [

5], we have examined different verification metrics using either active or passive observations. As we changed the ratio of observation error to background error variances

$\gamma ={\sigma}_{o}^{2}/{\sigma}_{b}^{2}$, while keeping the sum

${\sigma}_{o}^{2}+{\sigma}_{b}^{2}$ equal to

$\mathrm{var}(O-B)$, we found a minimum in

$\mathrm{var}(O-A)$ in the passive observation space. In this second-part paper, we formalize this result, develop the principles of estimation of the analysis error covariance by cross-validation, and apply it to estimate and optimize the analysis error covariance of ECCC’s surface analyses of O

_{3} and PM

_{2.5}.

When we refer to analysis error, or analysis error covariance, it is important to distinguish the perceived analysis error with the true analysis error [

6]. The perceived analysis error is the analysis error that results from the analysis algorithm itself, whereas the true analysis error is the difference between the analysis and the true state. Analysis schemes are usually derived from an optimization of some sort. In a variational analysis scheme for example, the analysis is obtained by minimizing a cost function with some given or prescribed observation error and background error covariances,

$\tilde{R}$ and

$\tilde{B}$ respectively. In a linear unbiased analysis scheme, the gain matrix

$\tilde{K}$ is obtained by minimum variance estimation, yielding an expression of the form,

$\tilde{K}=\tilde{B}H{(H\tilde{B}H+\tilde{R})}^{-1}$, where

$H$ is the observation operator. The perceived analysis error covariance is then derived as

$\tilde{A}=(I-\tilde{K}H)\tilde{B}$. In order to derive an expression for the perceived analysis error covariance we in fact assume that given error covariances,

$\tilde{R}$ and

$\tilde{B}$, are error covariances with respect to the true state, i.e., the true error covariances. We also assume that the observation operator is not an approximation with some error, but is the true error-free observation operator. Of course, in real applications neither of

$\tilde{R}$ and

$\tilde{B}$ are covariance measures with respect to the true state, but only a more or less accurate estimate of those. Daley [

6] have argued that in principle for an arbitrary gain matrix

$\tilde{K}$, the true analysis error covariance

$A$ can be computed as

$A=(I-\tilde{K}H)B{(I-\tilde{K}H)}^{T}+\tilde{K}R{\tilde{K}}^{T}$ provided that we know the true observation and background error covariances,

$R$ and

$B$. This expression is a quadratic matrix equation, and has the property that the true analysis error variance,

$tr(A)$ is minimum when

$\tilde{K}=BH{(HBH+R)}^{-1}=K$. In that sense, the analysis is truly optimal. The optimal gain matrix

$K$ is called the Kalman gain. It thus illustrates that although an analysis is obtained through some minimization principle, the resulting analysis error is not necessarily the true analysis error.

One of the main sources of information to obtain the true

$R$ and

$B$ is from the

$\mathrm{var}(O-B)$ statistic. However, it has always been argued that this is not possible without making some assumptions [

7,

8,

9], the most useful one being that background errors are spatially correlated while the observation errors are spatially uncorrelated, or at least on a much shorter length-scale. Even under those assumptions, different estimation methods such as the Hollingsworth-Lönnberg method [

10], the maximum likelihood gives different error variances and different correlation lengths [

11]. Other methods use

$\mathrm{var}(O-B)$ for rescaling but assume that the observation error is known. The assumption that the observation error is known is also debated as they contain representativeness errors [

12] that include observation operator errors. How to obtain an optimal analysis is thus unclear.

The evaluation of the true or perceived analysis error covariance using its own active observations is also a misleading problem unless the analysis is already optimal. Hollingsworth and Lönnberg [

13] addressed this issue for the first time where they noted that in the case of an optimal gain (i.e., optimal analysis), the statistics of observation-minus-analysis residuals

$O-\widehat{A}$ are related to the analysis error by

$\mathrm{E}[(O-\widehat{A}){(O-\widehat{A})}^{T}]=R-H\widehat{A}{H}^{T}$, where

$\widehat{A}$ is the optimal analysis error covariance and

$H$ and

$R$ are the observation operator and observation error covariance respectively. The caret (^) over

$A$ indicates that the analysis uses an optimal gain. In the context of spatially uncorrelated observation errors, the off-diagonal elements of

$\mathrm{E}[(O-\widehat{A}){(O-\widehat{A})}^{T}]$ would then give the analysis error covariance in observation space. Hollingsworth and Lönnberg [

13] argued that for most practical purposes, the negative intercept of

$\mathrm{E}[(O-\widehat{A}){(O-\widehat{A})}^{T}]$ at zero distance and the prescribed observation weight should be nearly equal, and thus could be used as an assessment of optimality of an analysis. However, in case where such agreement does not exist, an estimate of the actual analysis error is not possible. Another method, proposed by Desroziers et al. [

14], argued that the diagnostic

$\mathrm{E}[(O-\widehat{A}){(\widehat{A}-B)}^{T}]$ should be equal to the analysis error covariance in observation space but, again, only if the gain is optimal and the innovation covariance consistency is respected [

15].

The impasse of the estimation of the true analysis error seems to be tied with using active observations, i.e., using the same observations as those used to create the analysis. A robust approach that does not require an optimal analysis is to use observations whose errors are uncorrelated with the analysis error. For example, if we assume that observation errors are temporarily (serially) uncorrelated, an estimation of the analysis error can be made with the help of a forecast model initialized by the analysis by verifying the forecast against these observations. This is the essential assumption used traditionally in meteorological data assimilation to assess indirectly the analysis error by comparing the resulting forecast with observations valid at the forecast time. As forecast error grows with time, the observation-minus-forecast can be used to assess whether an analysis is better than another. In a somewhat different method but making the same assumption, Daley [

6] used the temporal (serial) correlation of the innovations to diagnose the optimality of the gain matrix. This property was first established in the context of Kalman filter estimation theory by Kailath [

16]. However, both the traditional meteorological forecast approach and the Daley method [

6] are subject to limitations: they assume that the model forecast has no bias and the analysis corrections are made correctly on all the variables needed to initialize the model. In practice, improper initialization of unobserved meteorological variables gives rise to spin-up problems or imbalances. Furthermore, with the traditional meteorological approach, compensation due to model error can occur, so that an optimal analysis does not necessarily yield an optimal forecast [

7].

An alternative approach introduced by Marseille et al. [

17], which we will follow here, is to use independent observation or passive observations to assess the analysis error. The essential assumption of this method is that the observations have spatially uncorrelated errors, so that the observations used for verification, i.e., the passive observations, have uncorrelated errors with the analysis. The advantage of this approach is that it does not involve any model to propagate the analysis information to a later time. Marseille et al. [

17] then showed that by multiplying the Kalman gain with an appropriate scalar value, one can reduce the analysis error. In this paper, we go further by using principles of error covariance estimation to obtain a near optimal Kalman gain. In addition we impose the innovation covariance consistency [

15] and show that all diagnostics of analysis error variance nearly agree with one another. These include the Hollingsworth and Lönnberg [

13], the Desroziers et al. [

14] and new diagnostics that we will introduce.

The paper is organized as follows. First we present in

Section 2 the theory and diagnostics of analysis error covariance in both passive and active observation spaces, as well as a geometrical representation. This leads us to a method to minimize the true analysis error variance. In

Section 3, we present the experimental setup on how we obtain near optimal analyses and presents the results of several diagnostics in active and passive observation spaces, and compare with the analysis error variance obtained from the optimum interpolation scheme itself. In

Section 4, we discuss the statistical assumptions being used, how and if they can be extended and how this formalism can be used in other applications such as the estimation of correlated observation errors with satellite observations. Finally, we draw some conclusions in

Section 5.

## 5. Conclusions

We showed that analysis error variance can be estimated and optimized, without using a model forecast, by partitioning the original observation data set into a training set, to create the analysis, and an independent (or passive) set, used to evaluate the analysis. This kind of evaluation by partitioning is called cross-validation. The method derives from assuming that the observations have spatially uncorrelated errors or, minimally, that the independent (or passive) observations have uncorrelated errors with the active observation, and are uncorrelated the background error. This leads to the important property that passive observations are uncorrelated with the analysis error and can then be used to evaluate the analysis [

17].

We have developed a theoretical framework and a geometric interpretation that has allowed us to derive a number of statistical estimation formulas of analysis error covariance that can be used in both passive and active observation spaces. It is shown that by minimizing the variance of observation-minus-analysis residuals in passive observation space we actually identify the optimal analysis. This has been done with respect to a single parameter, namely the ratio of observation to background error variances, to obtain a near optimal Kalman gain. The optimization is also done under the constraint of the innovation covariance consistency [

14,

15]. This optimization could have been done with more than one error covariance parameter but this has not been attempted here. The theory does suggest, however, that the minimum is unique.

Once an optimal analysis is identified we conduct an evaluation of the analysis error covariance using several different formulas; Desroziers et al. [

14], Hollingsworth Lönnberg [

13], and one that we develop in this paper which works in either active or passive observation spaces. As a way to validate the analysis error variance computed by the analysis scheme itself, the so-called perceived analysis error variance [

6], we compare it with the values obtained from the different statistical diagnostics of analysis error variance.

This methodology arises from a need to assess and improve ECCC’s surface air quality analyses using our operational air quality model GEM-MACH and real-time surface observations of O

_{3} and PM

_{2.5}. Our method applied the theory in a simplified way. First by considering the averaged observation and background error variances and finding an optimal ratio

$\gamma ={\sigma}_{o}^{2}/{\sigma}_{b}^{2}$ using as a constraint the trace of the innovation covariance consistency [

15]. Second, using a single parameter correlation model, its correlation length, we used the maximum likelihood estimation [

11] to obtain near optimal analyses. Also we did not attempt to account for representativeness error in the observations by, for example, filtering observations that are close. Despite all these limitations, our results show that with near optimal analyses, all estimates of analysis error variance roughly agree with each other, while disagreeing strongly when the input error statistics are not optimal. This check on estimating the analysis error variance gives us confidence that the method we propose is reliable, and provides us an objective method to evaluate different analysis components configurations, such as the type of background error correlation model, the spatial distribution of error variances and possibly the use of thinning observations to circumvent effects of representativeness errors.

The methodology introduced here for estimating analysis error variances is general and not restricted to the case of surface pollutant analysis. It would be desirable to investigate other areas of applications, such as surface analysis in meteorology and oceanography. The method could, in principle, provide guidance for any assimilation system. By considering the observation space subdomain [

25], proper scaling, local averaging [

26], or other methods discussed in Janjic et al. [

12] it may also be possible to extend this methodology to spatially varying error statistics. Based on our verification results in Part I [

5], we found that there is a dependence between model values and error variances, which we will investigate further in view of our next operational implementation of the Canadian surface air quality analysis and assimilation.

One strong limitation of the optimum interpolation scheme we are using (i.e., homogeneous isotropic error correlation and uniform error variances), which is also the case for most 3D-Var implementations, is the lack of innovation covariance consistency. Ensemble Kalman filters seem, however, much better in that regard although they have their own issues with localization and inflation. Experiments with chemical data assimilation using an ensemble Kalman filter does gives

${\chi}^{2}/{N}_{s}$ values very close to unity after simple adjustments for observation and model error variances [

27]. We thus argue that ensemble methods, such as the ensemble Kalman filter, would produce analysis error variance estimates that are much more consistent between the different diagnostics.

Estimates of analysis uncertainties can also be obtained by resampling techniques, such as the jackknife method and bootstrapping [

28]. In bootstrapping with replacement, the distribution of the analysis error is obtained by creating new analyses by replacing and duplicating observations from an existing set of observations [

28]. This technique relies on the assumption that each member of the dataset is independent and identically distributed. For surface ozone analyses where there is persistence to next day and the statistics is spatially inhomogeneous, the assumption of statistical independence may not be adequate. The comparison of these resampling estimates of analysis uncertainties could be compared with our analysis error variance estimates to help us identify limitations and areas of improvement.