A Maximum Likelihood Ensemble Filter via a Modified Cholesky Decomposition for Non-Gaussian Data Assimilation

This paper proposes an efficient and practical implementation of the Maximum Likelihood Ensemble Filter via a Modified Cholesky decomposition (MLEF-MC). The method works as follows: via an ensemble of model realizations, a well-conditioned and full-rank square-root approximation of the background error covariance matrix is obtained. This square-root approximation serves as a control space onto which analysis increments can be computed. These are calculated via Line-Search (LS) optimization. We theoretically prove the convergence of the MLEF-MC. Experimental simulations were performed using an Atmospheric General Circulation Model (AT-GCM) and a highly nonlinear observation operator. The results reveal that the proposed method can obtain posterior error estimates within reasonable accuracies in terms of ℓ−2 error norms. Furthermore, our analysis estimates are similar to those of the MLEF with large ensemble sizes and full observational networks.


Introduction
Remotely sensed observations by earth observing satellites are usually spatially and temporally discontinuous as a result of the sensor, satellite, and target view geometries [1]. For instance, polar orbiting satellites/sensors provide greater spatial details at a reduced temporal resolution, while geostationary orbiting satellites provide a better temporal resolution at a reduced spatial resolution [2]. Data Assimilation (DA) methods can be employed to make these observations more coherent both in time and space [3,4]. In this context, information from observations and an imperfect numerical forecast are optimally combined to estimate the state x * ∈ R n of a dynamical system which approximately evolves according to some imperfect numerical model: where x b ∈ R n is the background state, B ∈ R n×n is the background error covariance matrix, y ∈ R m is a vector holding the observations, m is the number of observations, R ∈ R m×m is the (estimated) data-error covariance matrix, and H : R n → R m is the observation operator (which maps model states to observations). Equation (2) is better known as the Three-Dimensional Variational (3D-Var) cost function. The analysis state is estimated via the solution of the 3D-Var optimization problem: where x a ∈ R n is the analysis state. For linear observation operators, closed-form solutions can be obtained for Equation (3), these are widely employed by ensemble-based methods. However, for nonlinear observation operators, numerical optimization methods can be employed to iteratively solve Equation (3). For instance, in the Maximum Likelihood Ensemble Filter (MLEF), vector states are constrained to the space spanned by an ensemble of model realizations, which is nothing but a low-rank square-root approximation of B. This method is widely accepted in the DA community owing to its efficient formulation and relative ease of implementation. Nevertheless, since analysis increments are computed onto an ensemble space, convergence is not ensured. We think that it is possible to replace the ensemble square-root approximation by a full-rank, well-conditioned square-root approximation of B via a modified Cholesky decomposition. In this manner, analysis increments are computed onto a space whose dimension equals that of the model one. Moreover, convergence can be ensured as long as the classic assumptions of Line-Search (LS) methods are satisfied. The structure of this paper is as follows. In Section 2, we discuss ensemble-based methods for (non) linear data assimilation. Section 3.1 proposes a Maximum Likelihood Ensemble Filter via a Modified Cholesky decomposition (MLEF-MC); the theoretical aspects of this method as well as its computational cost are analyzed. In Section 4, numerical simulations are performed using the Lorenz-96 model and an Atmospheric General Circulation Model (AT-GCM). Section 5 states the conclusions of this research.

Preliminaries
In this section, we briefly discuss ensemble-based data assimilation in linear and nonlinear cases. Line-Search optimization methods are also discussed for the numerical solution of optimization problems.

Ensemble-Based Data Assimilation
Ensemble-based methods estimate prior error distributions via an ensemble of model realizations [5]: where N is the ensemble size, and x b[e] ∈ R n stands for the e-th ensemble member, for 1 ≤ e ≤ N.
The empirical moments of Ensemble (4) are employed to estimate the moments of the prior error distributions: and where ∆X is the matrix of background anomalies: and 1 is a vector whose components are all ones. A well-known method in the ensemble context is the Ensemble Kalman Filter (EnKF) [6]. In the EnKF, a posterior ensemble can be built via the use of synthetic observation [7,8] or by employing an affine transformation on prior members [9,10]. Regardless which method is employed to estimate the analysis members, sampling errors impact the quality of the analysis members. This obeys the fact that, in practice, ensemble sizes are much smaller than model dimensions [11]. To counteract the effects of sampling noise, localization techniques are commonly employed during the assimilation steps. Localization relies on the idea that, for most geophysical systems, distant observations are weakly correlated [11,12]. Covariance localization and domain localization are frequently employed in operational scenarios. Furthermore, another possible choice is to make use of inverse covariance matrix estimation. In the EnKF based on a modified Cholesky decomposition [13][14][15], the precision covariance B −1 is estimated via the Bickel and Levina covariance matrix estimator [16]. This estimator has the form where the nonzero components of L ∈ R n×n are obtained by fitting linear models of the form where x [i] ∈ R N is a vector holding the i-th model component across all ensemble members in Equation (7), for 1 ≤ i ≤ n, and Π(i, r) denotes components within a local box of i for a radius size r. Note that the L factor is sparse since local neighborhoods are assumed for each model component. Moreover, it is possible to obtain sparse lower triangular factors by exploiting the mesh structures of numerical grids, that is, the sparsity pattern of L relies on the selection of r. Likewise, η ∈ R N is Gaussian with zero-mean and uncorrelated errors with unknown variance σ 2 . Some structures of L are shown in Figure 1 for a one-dimensional grid and different values of r, cyclic boundary conditions are assumed for the physics/dynamics. The ordering of model components plays an important role in an efficient manner to perform computations [17,18]. Thus, one can potentially exploit the special structure of the numerical mesh to obtain estimates which can be efficiently applied during the analysis steps [19]. However, the current literature proposes a modified Cholesky implementation, which can be applied without a prespecified ordering of model components [20].
EnKF methods commonly linearize observation operators when these are (highly) nonlinear [21], and as a direct consequence, this can induce bias on posterior members [22]. To handle nonlinear observation operators during the assimilation steps, optimization-based methods can be employed to estimate analysis increments. A well-known method in this context is the Maximum Likelihood Ensemble Filter (MLEF) [23]. This square-root filter employs the ensemble space to compute analysis increments [24,25]: which is nothing but a pseudo square-root approximation of B 1/2 . Thus, vector states can be written as follows: where w ∈ R N is a vector in redundant coordinates to be computed later. By replacing Equation (10) in Equation (2), one obtains [26,27] J The optimization problem to solve reads This problem can be numerically solved via Line-Search (LS) and/or Trust-Region methods. However, convergence cannot be ensured as long as gradient approximations are performed onto a reduced space whose dimension is much smaller than that of the model.

Line-Search Optimization Methods
The solution of optimization problems of the form in Equation (3) can be approximated via Numerical Optimization [28,29]. In this context, solutions are obtained via iterations: wherein k denotes the iteration index, and ∆x k ∈ R n is a descent direction, for instance, the gradient descent direction [30][31][32][33] the Newton's step [34][35][36], or a quasi-Newton-based method [37][38][39], where P k ∈ R n×n is a positive definite matrix. A concise survey of Newton-based methods can be consulted in [40]. Since step sizes in Equation (14) are based on first or second-order Taylor polynomials, the step size can be chosen via Line-Search [41][42][43] and/or Trust-Region [44][45][46] methods. Thus, we can ensure global convergence of optimization methods to stationary points of the cost function (2). This holds as long as some assumptions regarding the functions, gradients, and (potentially) Hessians are preserved [47]. In the context of Line-Search, the following assumptions are commonly made: There is a constant L such as where B is an open convex set which contains Ω 0 . These conditions together with iterates of the form ensure global convergence [48] as long as α is chosen as an (approximated) minimizer of In practice, rules for choosing step size such as the Goldstein rule [49], the Strong Wolfe rule [50], and the Halving method [51] are employed to partially solve Equation (16).

A Maximum Likelihood Ensemble Filter via a Modified Cholesky Decomposition
In this section, we develop an efficient and practical implementation of an MLEF-based filter via a modified Cholesky decomposition.

Filter Derivation
To solve the optimization problem (Equation (3)), we consider the matrix of anomalies (Equation (7)) to estimate B −1 via a modified Cholesky decomposition. We then employ the square-root approximation as a control space onto which analysis increments can be estimated. Note that We constrain vector states to the space spanned by Equation (17a): where η ∈ R n is a vector of weights to be computed later. The 3D-Var cost function (Equation (2)) onto the space (Equation (17a)) reads with the corresponding optimization problem: To approximate a solution for Equation (17d), we consider iterates of the form where k denotes iteration index, and K is the maximum number of iterations. The weights η k can be computed as follows: at iteration k, we linearize the observation operator about x k , this is where H k is the Jacobian of H(x) at x k . By employing this linear Taylor expansion, we obtain the following quadratic approximation of Equation (17c): where from which an estimate of the optimal weight η * k is as follows: Since η * k is obtained via a quadratic approximation of Equation (2), the step size (Equation (18e)) can be too large. Thus, we employ a Line-Search on Equation (17c) in the direction B 1/2 η * k : and therefore, by letting η k ≈ ρ * k η * k in Equation (18a), we obtain This process is repeated until a maximum number of iterations K is reached. Hence, an approximation of the optimal weight (Equation (17d)) reads from which an estimate of the analysis state (Equation (3)) can be computed as follows: The posterior covariance can be readily obtained from Equation (18d). Posterior weights can be sampled as follows: and therefore, the analysis ensemble members read The analysis members (Equation (19d)) are then propagated in time until new observations are available: next .
Putting it all together, the entire assimilation process is condensed in Algorithm 1. We call this filter formulation the Maximum Likelihood Ensemble Filter via a Modified Cholesky decomposition (MLEF-MC). Note that our goal is to obtain a minimizer (local optimum) of the 3D-Var optimization problem (Equation (3)). Other families of methods such as the Cluster Sampling Filters [52] target entire posterior density functions, that is, their goal is to draw samples from posterior kernels and using their empirical moments, to estimate posterior modes of error distributions.

Computational Cost of the MLEF-MC
We detail the computational cost of each line of Algorithm 1, and in this manner, we can estimate the overall computational cost of the MLEF-MC. We do not consider the computational cost of the Line-Search in Equation (18f), which will depend on the algorithm chosen for computing the optimal steps.

1.
In Line 1, the direct inversion of matrix B 1/2 is not actually needed. Note that the optimization variable in Equation (17d) can be expressed in terms of a new control variable ς k as follows: and in this manner, we can exploit the special structure of L and D to perform forward and backward substitutions on the optimal weights η * k . Thus, the number of computations to solve Equation (20) reads O ϕ 2 n where ϕ is the maximum number of nonzero entries across all rows in L with ϕ n. ϕ is commonly some function of the radius of influence r.

2.
The computation of Q k = H k L T D −1/2 −1 in Line 5 can be performed similarly to Equation (20).
On the basis of the dimensions of H k , a bound for computing Q k is as follows In Line 7, the bounds for computations are as follows: where the implicit linear system involving I + Q T k R −1 Q k can be solved, for instance, via the iterative Sherman Morrison formula [53] with no more than O ϕ 2 nm computations. Thus, the computational effort of computing Equation (18e) reads This bound is valid for Lines 9, 12, and 13. Since Lines 12 and 13 are performed N times, their computational cost reads O ϕ 2 Nnm + nNm + Nm 2 . Since all computations are performed K times, the overall cost of the MLEF-MC is as follows: Control space estimation Compute: k-th weight estimation Analysis weight 11: for e ← 1 → N do Analysis members computation 12: 14: for e ← 1 → N do Forecast step 15:

Global Convergence of the Analysis Step in the MLEF-MC
To prove the global convergence of the proposed MLEF-MC in the analysis step, we consider the assumptions in Conditions (C-A), (C-B), and In the next theorem, we state the necessary conditions to ensure global convergence in the MLEF-MC method. (22) hold, then the MLEF-MC with exact Line-Search generates an infinite sequence {x k } ∞ k=0 , then
Proof. By Taylor series, the cost function (Equation (2)) can be expanded as follows: and therefore, for any x k+1 on the ray By Condition (C-A), and Equation (22), it follows that {J (x k )} ∞ k=0 is a monotone decreasing number sequence and it has a bound below, therefore {J (x k )} ∞ k=0 has a limit, and consequently Equation (23) holds.

Further Comments
Note that the proposed filter implementation performs the analysis step onto a control space whose dimension is equal to that of the model. This space is obtained via a modified Cholesky decomposition to mitigate the impact of sampling errors. Furthermore, its computational cost is linear with regard to the model size, which makes the MLEF-MC formulation for operational settings attractive. Moreover, the analysis step globally converges to posterior modes of the error distribution. The next section assesses the accuracy of our proposed filter implementation in several experimental settings.

Numerical Simulations
In this section, we test the proposed MLEF-MC implementation and compare our results with those obtained by the well-know MLEF method. We make use of two surrogate models for the experiments: the Lorenz-96 model [54] and an Atmospheric General Circulation Model (AT-GCM). In both cases, we consider the following general settings: • Starting with a random solution, we employ the numerical model to obtain an initial condition which is consistent with the model dynamics. In a similar fashion, the background state, the actual state, and the initial ensemble are computed; • We consider the following nonlinear observation operator [55]: where j denotes the j-th observed component from the model state, for 1 ≤ j ≤ m. Likewise, we vary γ in γ ∈ {1, 3, 5}. Note that we start with a linear observation operator and end up with a highly nonlinear one. Since this observation operator is nondifferentiable, we employ the sign function to approximate its derivative: • The − 2 norm measures the accuracy of analysis states at assimilation stages, where x * t and x a t are the reference and the analysis solution at the assimilation step t, respectively; • We employ the Root Mean Square Error (RMSE) as a measure of accuracy (average) for an entire set of time-spaced observations, • We employ a Truncated Singular Value Decomposition (T-SVD) to fit the models (Equation (9)); • All experiments were performed under perfect model assumptions. No model errors were present during the assimilation steps; • We employ the MLEF formulation proposed by Zupansky in [23].

The Lorenz-96 Model
The Lorenz-96 model is described by the following ordinary differential equations [56]: where n is the number of model variables, and F is the external force. Periodic boundary conditions are assumed. When F = 8 and n = 40, the model exhibits chaotic behavior, which makes it a relevant surrogate problem for atmospheric dynamics [57]. One time unit in the Lorenz-96 represents 7 days in the atmosphere. Details regarding the construction of the reference solution, background state, initial background ensemble member, and experimental settings are as follows: • We create an initial pool X b of N = 1000 ensemble members. For each experiment, we sample N = 20 members from X b to obtain the initial ensemble X b . Two-dimensional projections of the initial pool making use of its two leading directions are shown in Figure 2; For a reference, we employ a MLEF method with an ensemble size of N = 100 members and a full observational network s = 1. Note that this ensemble size is more than twice the model resolution n = 40. In this manner, we can have an idea about how errors should evolve for large ensemble sizes and full observational networks. We refer to this as the ideal case.
The evolution of errors for the proposed filter implementation is detailed in Figures 3 and 4 for the percentage of observations s = 1 and s = 0.7, respectively. We employ a log-scale of − 2 error norms for ease of reading. Note that as the noise level σ o increases, the accuracy of the MLEF-MC degrades. This should be expected since more uncertainty is injected into the observations and as a direct consequence, the expected posterior errors increased. Nevertheless, in all cases, the evolution of errors are visually bounded (they do not blow up), and therefore, filter convergence is evidenced. For full observational networks, increments in the observation frequencies do not degrade the quality of the analysis increments; however, for observation coverages of s = 0.7, the initial accuracies (spin-up period) can be impacted slightly as the observation frequency increases. However, this does not prevent errors from becoming stable (and to decrease) in time. Note that the degree γ of the observation operator does not impact the quality of analysis corrections in the MLEF-MC method. One can see that errors are stable in time regardless of the degree of H(x). On the other hand, the radius coverage plays an important role in the assimilation of observations as the time frequency of observations increases. For instance, as δt increases, the forecast steps are longer, and therefore, more information about background error correlations can be properly captured in our estimate B −1 . Recall that background error correlations are driven by the nonlinear dynamics of the model (Equation (27)), and given the special structure of the ODE system (Equation (27)), it is reasonable to think that radius lengths larger than one can provide useful information to unobserved components during the analysis corrections. Thus, as the radius length increases, errors in the MLEF-MC behave similar to those in the ideal case.
In Figures 5 and 6, we report the gradient norms of the initial assimilation step for s = 1 and s = 0.7, respectively, for the MLEF-MC implementation. Note that for small γ and σ o values, gradient norms are similarly decreased for different values of r among iterations in the MLEF-MC context. As the noise level increases, high accuracies demand more iterations for large r values. Thus, the noise level plays an important role as long as radius lengths are increased. As should be expected, the rate of convergence can be impacted by the degree of the nonlinear observation operator. Recall that we employ a second-order approximation of J (x) to estimate its gradient, and therefore, as the degree γ increases, small step lengths will be employed by the Line-Search method, among iterations.
For the first assimilation cycle, we show two-dimensional projections of the optimization steps using the two leading directions of X b in Figures 7 and 8 for observation coverages of s = 1 and s = 0.7, respectively. We report the actual state x * , some samples from the background error distribution X b , and the iterates for different r values. The ideal case is also reported. Note that as the degree γ increases, more iterations are needed before we obtain a reasonable estimate of x * . As we mentioned before, second-order Taylor-based approximations can poorly estimate ∇J (x) as γ increases. As can be seen, as the noise level increases, the analysis estimate for the different radius lengths can be impacted.
In Figure 9, we report the average of elapsed times for computing analysis increments across M = 100 assimilation steps. As can be seen, as the radius of influence increases, the elapsed time of assimilation steps slightly increases. This agrees with the bound (Equation (21)) wherein the computational cost of the MLEF-MC formulation linearly depends on the model resolution n. Recall that the factor r is strictly related to ϕ, which in turn is bounded by n. In practice, ϕ n. It is essential to note that by employing a modified Cholesky decomposition (Equation (8)), the degree of freedom of the control space (Equation (10)) is artificially increased. Thus, we have more directions (which are consistent with the model dynamics) onto which error dynamics can be captured. This is similar to having a localized square-root approximation of B. In this manner, we can decorrelate distant model components based on our prior knowledge about the model dynamics. Moreover, we can also decrease the impact of sampling errors. All these properties are possessed by our set of basis vectors (Equation (17a)), which can explain why our proposed filter implementation can decrease initial background errors by several orders of magnitudes. This obeys two important facts: the control-space dimension is equal to that of the model, and more importantly, MLEF-MC ensures convergence as long as the conditions of Theorem 1 are satisfied.

An Atmospheric General Circulation Model (AT-GCM)
In this section, we study the performance of the MLEF-MC method by using a highly nonlinear model: the SPEEDY model. This model is an atmospheric general circulation model that mimics the behavior of the atmosphere across different pressure levels [58,59]. The number of numerical layers in this model is seven, and we employ a T-30 spectral model resolution (96 × 48 × 7 grid components) for the space discretization of each model layer [60,61]. We employ four model variables. These are detailed in Table 1 with their corresponding units and the number of layers. Note that the total number of model components to be estimated is n = 133,632. We set the number of model realizations (ensemble size) as N = 30 for all experimental scenarios. In this case, the model resolution is approximately 4454 times larger than the sample size (n N), which is very common under operational DA scenarios. Additional details of the experimental settings are described below, some are similar to those detailed in [62]: • Starting with a system in equilibrium, the model is integrated over a long time period to obtain an initial condition whose dynamics are consistent with those of the SPEEDY model; • The initial condition is perturbed N times and propagated over a long time period from which the initial background ensemble is obtained; • We employ the trajectory of the initial condition as the reference. This reference trajectory serves to build synthetic observations; The number of assimilation steps is M = 12. Thus, the total simulation time is 7.5 days. Table 2 shows the RMSE values for the MLEF-MC method. We vary the nonlinear degree of γ, and the percentage of observations s. Likewise, the radius of influence r is 1. As can be seen, the proposed filter implementation can decrease forecast errors for all model variables by, in some cases, several orders of magnitudes. As the degree of the observation operator increases, analysis errors can impact the analysis corrections, but all analysis errors are within the same orders of magnitude. Moreover, filter convergence is evident for all synthetic scenarios which agrees with Theorem 1. Note that as the number of observations increases, the accuracy of posterior estimates improves. This is expected since more information regarding the error dynamics is injected into the numerical forecast. Figures 11 and 12 show the time evolution of errors for s = 0.7 and s = 1.0, respectively. Clearly, initial errors are drastically decreased by the proposed filter implementation. This behavior is obtained regardless of the degree γ of the nonlinear observation operator (Equation (24)). As we mentioned before, the more observations employed during assimilation steps, the faster the posterior errors can be decreased. Furthermore, on the basis of the number of observations, the differences between posterior errors can be of orders of magnitude. Figure 13 shows snapshots of the first assimilation step. The results are reported for the first numerical layer of the SPEEDY model and the model variables u and T. As can be seen, background errors are drastically improved by the MLEF-MC method. Spurious waves near the poles of the T and u variables are quickly dissipated, and the numerical model retains the actual shapes (and magnitudes) of these variables.

Conclusions
Satellite remote sensing with a wide range of sources, for instance, on-board sensors, platforms, and satellite data (which provides genuine earth observation information), has transformed our view of the Earth and its environment. These sensors offer different types of observations on large scales and over decades. Typically, observations (data) are nonlinearly related to model states. This paper proposes a Maximum Likelihood Ensemble Filter method via a Modified Cholesky decomposition (MLEF-MC) for nonlinear data assimilation. This method works as follows: snapshots of an ensemble of model realizations are taken at observation steps; these ensembles are employed to build control spaces onto which analysis increments can be estimated. The control spaces are obtained via a modified Cholesky decomposition. The control-space dimension is equal to that of the model, which mitigates the impact of sampling errors. Experimental tests were performed by using the Lorenz-96 model and an Atmospheric General Circulation Model (AT-GCM). The well-known Maximum Likelihood Ensemble Filter (MLEF) was employed with an ensemble size of 100 and a full observational network as a reference method to compare solutions. The results reveal that the proposed filter implementation performs similarly to the MLEF implementation (ideal case) in terms of 2 error norms and Root Mean Square Error values.