Cluster Sampling Filters for Non-Gaussian Data Assimilation

This paper presents a fully non-Gaussian version of the Hamiltonian Monte Carlo (HMC) sampling filter. The Gaussian prior assumption in the original HMC filter is relaxed. Specifically, a clustering step is introduced after the forecast phase of the filter, and the prior density function is estimated by fitting a Gaussian Mixture Model (GMM) to the prior ensemble. Using the data likelihood function, the posterior density is then formulated as a mixture density, and is sampled using a HMC approach (or any other scheme capable of sampling multimodal densities in high-dimensional subspaces). The main filter developed herein is named"cluster HMC sampling filter"(ClHMC). A multi-chain version of the ClHMC filter, namely MC-ClHMC is also proposed to guarantee that samples are taken from the vicinities of all probability modes of the formulated posterior. The new methodologies are tested using a quasi-geostrophic (QG) model with double-gyre wind forcing and bi-harmonic friction. Numerical results demonstrate the usefulness of using GMMs to relax the Gaussian prior assumption in the HMC filtering paradigm.


Introduction
Data assimilation (DA) is a complex process that involves combining information from different sources in order to produce accurate estimates of the true state of a physical system such as the atmosphere. Sources of information include computational models of the system, a background probability distribution, and observations collected at discrete time instances. With model state denoted by x ∈ R nvar , the prior probability density P b (x) encapsulates the knowledge about the system state before incorporating any other source of information such as the observations. Let y ∈ R m be a measurement (observation) vector. The observation likelihood function P(y|x) quantifies the mismatch between the model predictions (of observed quantities) and the collected measurements. A standard application of Bayes' theorem provides the posterior probability distribution P(x|y) that provides an improved description of the unknown true state of the system of interest.
In the ideal case where the underlying probability distributions are Gaussian, the model dynamics is linear, and the observations are linearly related to the model state, the posterior can be obtained analytically for example by applying Kalman filter (KF) equations [26,25]. For large dimensional problems the computational cost of the standard Kalman filter is prohibitive, and in practice the probability distributions are approximated using small ensembles. The ensemble-based approximation has led to the ensemble Kalman filter (EnKF) family of methods [9,13,14,22]. Several modifications of EnKF, for example [19,23,40,32,36,38], have been introduced in the literature to solve practical DA problems of different complexities.
One of the drawbacks of the EnKF family is the reliance on an ensemble update formula that comes from the linear Gaussian theory. Several approaches have been proposed in the literature to alleviate the limitations of the Gaussian assumptions. The maximum likelihood ensemble filter (MLEF) [27,41,42] computes the maximum a posteriori estimate of the state in the ensemble space. The iterative EnKF [18,32] (IEnKF) extends MLEF to handle nonlinearity in models as well as in observations. IEnKF, however, assumes that the underlying probability distributions are Gaussian and the analysis state is best estimated by the posterior mode.
These families of filters can generally be tuned (e.g., using inflation and localization) for optimal performance on the problem at hand. However, if the posterior is a multimodal distribution, these filters are expected to diverge, or at best capture a single probability mode, especially in the case of long-term forecasts. Only a small number of filtering methodologies designed to work in the presence of highly non-Gaussian errors are available, and their efficiency with realistic models is yet to be established.
The Hybrid/Hamiltonian Monte Carlo (HMC) sampling filter was proposed in [6] as a fully non-Gaussian filtering algorithm, and has been extended to the four-dimensional (smoothing) setting in [6,4,5,7]. The HMC sampling filter is a sequential DA filtering scheme that works by directly sampling the posterior probability distribution via an HMC approach [12,39]. The HMC filter is designed to handle cases where the underlying probability distributions are non-Gaussian. Nevertheless, the first HMC formulation presented in [6] assumes that the prior distribution can always be approximated by a Gaussian distribution.
This assumption was introduced for simplicity of implementation; however, it can be too restrictive in many cases, and may lead to inaccurate conclusions. This strong assumption needs to be relaxed in order to accurately sample from the true posterior, while preserving computational efficiency.
This work relaxes the Gaussian prior assumption in the original HMC formulation. Specifically, the prior is represented by a Gaussian Mixture Model (GMM) that is fitted to the forecast ensemble via a clustering step. The posterior is formulated accordingly. In the analysis step the resulting mixture posterior is sampled following a HMC approach. The analysis phase, however, can be easily modified to incorporate any other efficient direct sampling method. The resulting algorithm is named the cluster HMC (C HMC ) sampling filter. In order to improve the sampling from the mixture posterior a more efficient version, named C HMC filter (MC-C HMC ), is also discussed.
Using a GMM to approximate the prior density, given the forecast ensemble, was presented in [2,36] as a means to solve the nonlinear filtering problem. In [2], a continuous approximation of the prior density was built as a sum of Gaussian kernels, where the number of kernels is equal to the ensemble size. Assuming a Gaussian likelihood function, the posterior was formulated as a GMM with updated mixture parameters. The updated means and covariance matrices of the GMM posterior were obtained by applying the convolution rule of Gaussians to the prior mixture components and the likelihood, and the analysis ensemble was generated by direct sampling from the GMM posterior. On the other hand, the approach presented in [36] works by fitting a GMM to the prior ensemble with number of mixture components detected using Akaike information criterion. The EnKF equations are applied to each of the components in the mixture distribution to generate an analysis ensemble from the GMM posterior.
Unlike the existing approaches [2,36], the methodology proposed herein is fully non-Gaussian, and does not limit the posterior density to a Gaussian mixture distribution or Gaussian likelihood functions. Here we sample the posterior distribution using a Hamiltonian Monte Carlo approach, however the direct sampling method can be replaced with any efficient analysis algorithm capable of dealing with high-dimensional multimodal distributions.
The remaining part of the paper is organized as follows. Section 2 reviews the original formulation of the HMC sampling filter. Section 3 presents the new algorithms C HMC and MC-C HMC . Numerical results and discussions are presented in Section 4. Conclusions are drawn in Section 5.

The HMC Sampling Filter
In this section we present a brief overview of the HMC sampling methodology, followed by the original formulation of the HMC sampling filter.

HMC sampling
HMC sampling follows an auxiliary-variable approach [8,37] to accelerate the sampling process of Markov chain Monte-Carlo (MCMC) algorithms. In this approach, the MCMC sampler is devoted to sampling the joint probability density of the target variable, along with an auxiliary variable. The auxiliary variable is chosen carefully to allow for the construction of a Markov chain that mixes faster, and is easier to simulate than sampling the marginal density of the target variable [20].
The main component of the HMC sampling scheme is an auxiliary Hamiltonian system that plays the role of the proposal (jumping) distribution. The Hamiltonian dynamical system operates in a phase space of points (p, x) ∈ R 2nvar , where the individual variables are the position x ∈ R nvar , and the momentum p ∈ R nvar . The total energy of the system, given the position and the momentum, is described by the Hamiltonian function H(p, x). A general formulation of the Hamiltonian function (the Hamiltonian) of the system is given by: where M ∈ R nvar×nvar is a symmetric positive definite matrix referred to as the mass matrix.
The first term in the sum (1) quantifies the kinetic energy of the Hamiltonian system, while the second term is the associated potential energy. The dynamics of the Hamiltonian system in time is described by the following ordinary differential equations (ODEs): The time evolution of the system (2) in the phase space is described by the flow: [29,33] which maps the initial state of the system (p(0), x(0)) to (p(T ), x(T )) , the state of the system at time T . In practical applications, the analytic flow Φ T is replaced by a numerical solution using a time reversible and symplectic numerical integration method [34,33]. The length of the Hamiltonian trajectory T can generally be long, and may lead to instability of the time integrator if the step size is set to T . In order to accurately approximate Φ T , the symplectic integrator typically takes m steps of size h = T /m where h is chosen such as to maintain stability. We will use Φ T hereafter to represent the numerical approximation of the Hamiltonian flow. Given the formulation of the Hamiltonian (1), the dynamics of the Hamiltonian system is governed by the equations The canonical probability distribution of the state (p, x) of the Hamiltonian system in the phase space R 2nvar , upto a scaling factor, is given by The product form of this joint probability distribution shows that the two variables p, and x are independent [34]. The marginal distribution of the momentum variable is Gaussian, p ∼ N (0, M) , while the marginal distribution of the position variable is proportional to the negative-logarithm (negative-log) of the potential energy, that is ). Here f (x) is the normalized marginal density of the position variable, while φ(x) drops the scaling factor (e.g. the normalization constant) of the density function.
In order to draw samples {x(e)} e=1,2,...,nens from a given probability distribution f (x) ∝ φ(x) , HMC makes the following analogy with the Hamiltonian mechanical system (2). The state x is viewed as a position variable, and an auxiliary momentum variable p ∼ N (0, M) is included. The negative-log of the target probability density J (x) = − log(φ(x)) is viewed as the potential energy of an auxiliary Hamiltonian system. The kinetic energy of the system is given by the negative-log of the Gaussian distribution of the auxiliary momentum variable. The mass matrix M is a user-defined parameter that is assumed to be symmetric positive definite. To achieve favorable performance of the HMC sampler, M is generally assumed to be diagonal, with values on the diagonal chosen to reflect the scale of the components of the target variable under the target density [6,29]. The HMC sampler proceeds by constructing a Markov chain whose stationary distribution is set to the canonical joint density (5). The chain is initialized to some position and momentum values, and at each step of the chain, a Hamiltonian trajectory starting at the current state is constructed to propose a new state. A Metropolis-Hastings-like acceptance rule is used to either accept or reject the proposed state. Since both position and momentum are statistically independent, the retained position samples are actually sampled from the target density f (x). The collected momentum samples are discarded, and the position samples are returned as the samples from the target probability distribution f (x).
The performance of the HMC sampling scheme is greatly influenced by the settings of the Hamiltonian trajectory, that is the choice of the two parameters m, h. The step size h should be small enough to maintain stability, while m should be generally large for the sampler to reach distant points in the state space. The parameters of the Hamiltonian trajectory can be set empirically [29] to achieve an acceptable rejection rate of at most 25% 30% , or be automatically adapted using automatic tuning schemes such as the No-U-Turn sampler(NUTS) [21], or the Riemannian Manifold HMC sampler (RMHMC) [17].
The ideas presented in this work can be easily extended to incorporate any of the HMC sampling algorithms with automatically tuned parameters. In this paper we tune the parameters of the Hamiltonian trajectory following the empirical approach, and focus on the sampler performance due to the choice of the prior distribution in the sequential filtering context.

HMC sampling filter
In the filtering framework, following a perfect-model approach, the posterior distribution P a (x k ) at a time instance t k follows from Bayes' theorem: where P b (x k ) is the prior distribution, P(y k |x k ) is the likelihood function, all at time instance t k . P(y k ) acts as a scaling factor and is ignored in the HMC context. As mentioned in Section 1, the formulation of the HMC sampling filter proposed in [6] assumes that the prior distribution P b (x k ) can be represented by a Gaussian distribution where x b k , is the background state, and B k ∈ R nvar×nvar is the background error covariance matrix. The background state x b k is generally taken as the mean of an ensemble of forecasts {x b k (e)} e=1, 2, ..., nens , obtained by forward model runs from a previous assimilation cycle. The associated weighted norm is defined as: Under the traditional, yet non-restrictive assumption, that the observation errors are distributed according to a Gaussian distribution with zero mean, and observation error covariance matrix R k ∈ R m×m , the likelihood function takes the form where H k : R nvar → R m is the observation operator that maps a given state x k to the observation space at time instance t k . The dimension of the observation space m is generally much smaller than the state space dimension, that is m n var . The posterior follows immediately from (6), (7), and (9) as: where J (x k ) is the negative-log of the posterior distribution (10a). The derivative of J (x k ) with respect to the system state x k is given by where H k = H k (x) is the linearized observation operator (e.g. the Jacobian).
The HMC sampling filter [6] proceeds in two steps, namely a forecast step and an analysis step. Given an analysis ensemble of states {x a k−1 (e)} e=1,2,...,nens at time t k−1 , an ensemble of forecasts at time t k is generated using the forward model M: In the analysis step, the posterior (10) is sampled by running a HMC sampler with potential energy set to (10a), where B k is approximated using the available ensemble of forecasts.
The formulation of the HMC filter presented in [6], and reviewed above, tends to be restrictive due to the assumption that the prior is always approximated by a Gaussian distribution. The prior distribution can be viewed as the result of propagating the posterior of the previous assimilation cycle using model dynamics. In the case of nonlinear model dynamics, the prior distribution is a nonlinear transformation of a non-Gaussian distribution which is generally expected to be non-Gaussian. Tracking the prior distribution exactly however is not possible, and a relaxation assumption must take place.
We propose conducting a more accurate density estimation of the prior, by fitting a GMM to the available prior ensemble, replacing the Gaussian prior with a Gaussian mixture prior.

Mixture models
The probability distribution P(x) is said to be a mixture of n c probability distributions The weights τ i are commonly referred to as the mixing weights, and C i (x) are the densities of the mixing components.

Gaussian mixture models (GMM)
A GMM is a special case of (13) where the mixture components are Gaussian densities, Fitting a GMM to a given data set is one of the most popular approaches for density estimation. Given a data set {x(e)} e=1, 2, ..., nens , sampled from an unknown probability distribution P(x), one can estimate the density function P(x) by a GMM; the parameters of the GMM, i.e. the mixing weights τ i , the means µ i , and the covariances Σ i of the mixture components, can be inferred from the data.
The most popular approach to obtain the maximum likelihood estimate of the GMM parameters is the expectation-maximization (EM) algorithm [11]. EM is an iterative procedure that alternates between two steps, expectation (E) and maximization (M). At iteration t + 1 the E-step computes the expectation of the complete log-likelihood based on the posterior probability of x belonging to the i th component, with the parameters Θ {t} from the previous iteration. In particular, the following quantity Q(Θ|Θ {t} ) is evaluated: Here ..nc is the parameter set of all the mixture components, and r e,i is the probability that the e th ensemble member lies under the i th mixture component.
In the M-step, the new parameters Θ {t+1} = arg max Θ Q are obtained by maximizing the conditional probability Q in (14) with respect to the parameters Θ. The updated parameters Θ {t+1} are given by the analytical formulas: To initialize the parameters for the EM iterations, the mixing weights are simply chosen to be equal τ i = n −1 c , the means µ i can be randomly selected from the given ensemble, and the covariance matrices of the components can be all set to covariance matrix of the full ensemble. Regardless of the initialization, the convergence of the EM algorithm is ensured by the fact that it monotonically increases the observed data log-likelihood at each iteration [11], that is: EM algorithm achieves the improvement of the data log-likelihood indirectly by improving the quantity Q(Θ|Θ {t} ) over consecutive iterations, i.e. Q(Θ|Θ {t+1} ) ≥ Q(Θ|Θ {t} ).

Model selection
Before EM iterations start, the number of mixture components n c must be detected. To choose the number of components in the prior mixture model selection is employed. Model selection is a process of selecting a model in the set of a candidate models that gives the best trade-off between model fit and complexity. Here, the best number of components n c can be selected with common model selection methodologies such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC): where ..nc is the set of optimal parameters for the candidate GMM model with n c components. The best number of components n c minimizes the AIC or BIC criterion [24,35]. The main difference between the two criteria, as explained by the second terms in Equation (16), is that BIC imposes greater penalty on the number of parameters (3n c − 1) of the candidate GMM model. For small or moderate numbers of samples BIC often chooses models that are too simple because of its heavy penalty on complexity.

Cluster HMC sampling filter (C HMC )
The prior distribution is approximated by a GMM fitted to the forecast ensemble, e.g., using an EM clustering step. The prior PDF reads: where the weights τ k,i quantify the probability that an ensemble member x k (e) belongs to the i th component, and (µ k,i , Σ k,i ) are the mean and the covariance matrix associated with the i th component of the mixture model at time instance t k . Assuming Gaussian observation errors, the posterior can be formulated using equations (6), (9), and (17) as follows: In general the posterior PDF (18) will not correspond to a Gaussian mixture due to the nonlinearity of the observation operator. This makes analytical solutions not possible. Here we seek to sample directly from the posterior PDF (18).
The HMC sampling requires setting the potential energy term in the Hamiltonian (1) to the negative-log of the posterior distribution (18). The potential energy term J (x k ) is: where Equation (19) is expected to suffer from numerical difficulties due to evaluating the logarithm of a sum of very small values. To address the accumulation of roundoff errors, and without loss of generality, we assume from now on that the terms in Equation (19) under the sum are sorted in decreasing order, i.e.
The potential energy function (19) is rewritten as: The gradient of the potential energy (20) is: In the case where the mixture contains a single component (one Gaussian distribution) the potential energy function (20) and it's gradient (21) reduce to the following, respectively: This shows that the C HMC sampling filter proposed herein reduces to the original HMC filter the EM algorithm detects a single component during the prior density approximation phase.
The C HMC sampling algorithm. As in the HMC sampling filter, information about the analysis probability density at the previous time t k−1 is captured by the analysis ensemble of states {x a k−1 (e)} e=1,...,nens . The forecast step consists of two stages. First, the model (12) is used to integrate each analysis ensemble member forward to time t k where observations are available. Next, a clustering scheme (e.g., EM) is used to generate the parameters of the GMM. The analysis step constructs a Markov chain starting from an initial state x 0 k , and proceeds by sampling the posterior PDF (18) at stationarity. Here the superscript over x k refers to the iteration number in the Markov chain.
The steps of the C HMC sampling filter are detailed in Algorithm 1. As discussed in [6], Algorithm 1 can be used either as a non-Gaussian filter, or as a replenishment tool for parallel implementations of the traditional filters such as EnKF.

Computational considerations
To initialize the Markov chain one seeks a state that is likely with respect to the analysis distribution. Therefore one can start with the background ensemble mean, or with the mean of the component that has the highest weight. Alternatively, one can apply a traditional EnKF step and use the mean analysis to initialize the chain.
The joint ensemble mean and covariance matrix can be evaluated using the forecast ensemble, or using the GMM parameters. Given the GMM parameters (τ k,i ; µ k,i , Σ k,i ), the joint background mean and covariance matrix are, respectively: Both the potential energy (20) and it's gradient (21) require evaluating the determinants of the covariance matrices associated with the mixture components. This is a computationally expensive process that is best avoided for large-scale problems. A simple remedy is to Algorithm 1 Cluster HMC sampling filter (C HMC ) 1: Forecast step: given an analysis ensemble {x a k−1 (e)} e=1,2,...,nens at time t k−1 ; i-generate the forecast ensemble using the model M: , e = 1, 2, . . . , n ens .
ii-Use AIC/BIC criteria to detect the number of mixture components n c in the GMM, then use EM to estimate the GMM parameters {(τ k,i ; µ k,i , Σ k,i )} i=1,2,...,nc .
2: Analysis step: given the observation y k at time point t k , follow the steps i to v: i-Initialize the Markov Chain (x 0 k ) to be to the best estimate available, e.g. to the mean of the joint forecast ensemble, or the mixture component mean with maximum likelihood. ii-Choose a positive definite mass matrix M. A recommended choice is to set M to be a diagonal matrix whose diagonal is equal to the diagonal of the posterior precision matrix. The precisions calculated from the prior ensemble can be used as a proxy. iii-Set the potential energy function to (20), and it's derivative to (21). iv-Initialize the chain with a state x 0 k and generate n ens ensemble members from the posterior distribution (18) as follows: 1) Draw a random vector p r ∼ N (0, M).
2) Use a symplectic numerical integrator (e.g. Verlet, 2-Stage, or 3-Stage [34,6]) to advance the current state (p r , x r k ) by a pseudo-time increment T to obtain a proposal state (p * , x * k ): 3) Evaluate the energy loss : ∆H = H(p * , x * k ) − H(p r , x r k ). 4) Calculate the acceptance probability: a (r) = 1 ∧ e −∆H . 5) Discard both p * , p r . 6) (Acceptance/Rejection) Draw a uniform random variable u (r) ∼ U(0, 1): i-If a (r) > u (r) accept the proposal as the next sample: x r+1 k := x * k ; ii-If a (r) ≤ u (r) reject the proposal and continue with the current state: x r+1 k := x r k . 7) Repeat steps 1 to 6 until n ens distinct samples are drawn. v-Use the generated samples {x a k (e)} e=1,2,...,nens as an analysis ensemble. The analysis ensemble can be used to infer the posterior moments, e.g. posterior mean and posterior covariance matrix. force the covariance matrices Σ k,i , ∀i = 1, 2, . . . , n c to be diagonal while constructing the GMM.
When the Algorithm 1 is applied sequentially at some steps a single mixture component could be detected in the prior ensemble. In this case forcing a diagonal covariance structure does not help; in this case the ensemble covariance is calculated and the standard HMC sampler step is applied.

A multi-chain version of the C HMC filter (MC-C HMC )
Given the special geometry of the posterior mixture distribution one can construct separate Markov chains for different components of the posterior. These chains can run in parallel to independently sample different regions of the analysis distribution. By running a Markov chain starting at each component of the mixture distribution we ensure that the proposed algorithm navigates all modes of the posterior, and covers all regions of high probability.
The parameters of the jumping distribution for each of the chains can be tuned locally based on the statistics of the ensemble points belonging to the corresponding component in the mixture. This approach is potentially very efficient, not only because it reduces the total running time of the sampler, but also because it favors an increase acceptance rate. The local ensemble size (sample size per chain) can be specified based on the prior weight of the corresponding component multiplied by the likelihood of the mean of that component. Every chain is initialized to the mean of the corresponding component in the prior mixture. The diagonal of the mass matrix can be set globally for all components, for example using the diagonal of the precision matrix of the forecast ensemble, or can be chosen locally based on the second-order moments estimated from the prior ensemble under the corresponding component in the prior mixture. This local choice of the mass matrix does not change the marginal density of the target variable.

Numerical Results
We first apply the proposed algorithms, C HMC and MC-C HMC to sample a simple one-dimensional mixture distribution. The proposed methodologies are then tested using a quasi-geostrophic (QG) model and compared against the original HMC sampling filter and against EnKF. We mainly use a nonlinear 1.5-layer reduced-gravity QG model with double-gyre wind forcing and bi-harmonic friction [31].

One-dimensional test problem
We start with a prior ensemble generated from a GMM with n c = 5 and the following mixture parameters: The EM algorithm is used to construct a GMM approximation of the true probability distribution from which the given prior ensemble is drawn. The model selection criterion used here is AIC. The generated GMM approximation of the prior has n c = 4 and the following parameters: The prior ensemble and the GMM approximation of the true prior are shown in Figure 1.
Assuming the observation likelihood function is given by: Prior ensemble Prior GMM Figure 1: The one-dimensional example. A random sample of size n ens = 100 generated from a GMM with parameters given by (25), and a GMM constructed by EM algorithm with AIC model selection criterion.
The time integration scheme used is the fourth-order Runge-Kutta scheme with a time step 1. 25 [time units].
An initial background state is created by adding Gaussian noise with zero mean and variance equal to 5 [units squared] to the reference initial condition . An initial ensemble is created by adding Gaussian random perturbations with zero mean and variance 5 [units squared] to the forecast initial condition.
For all experiments in this work, the model is run over 1000 model time steps, with observations made available every 10 time steps. In this synthetic model the scales are not relevant, and we use generic space, time, and solution amplitude units.

Observations and observation operators
Two observation operators are used with this model. • The second observation operator measures the magnitude of the flow velocity √ u 2 + v 2 . The flow velocity components u, v are obtained using a finite difference approximation of the following relations to the stream function: In both cases, the observed components are uniformly distributed over the state vector length, with a random offset, that is updated at each assimilation cycle. The reference initial state along with an example of the observational grid used, and the initial forecast state are shown in Figure 3.

Filter tuning
We used an implementation of the traditional (stochastic) EnKF with parameters tuned as suggested in [31]. Specifically, we apply a covariance localization by means of a Hadamard product as explained in [23]. The localization function used is Gaspari-Cohn [16] with localization radius set to 15 grid cells. Inflation is applied with factor δ = 1.19 to the forecast ensemble members at each assimilation cycle of EnKF.
The parameters of the HMC and C HMC sampling filters are tuned empirically in a preprocessing step in the HMC filter to guarantee a rejection rate at most between 25% to 30%. Specifically, the step size parameters of the symplectic integrator are set to h = 0.0075, m = 15. The integrator used for the Hamiltonian system in all experiments is the three-stage symplectic integrator [6,34]. The mass matrix M is chosen to be a diagonal matrix whose diagonal is set to the diagonal of the precision matrix of the forecast ensemble. In the current experiments, the first 20 steps of the Markov chains are discarded as a burnin stage. Alternatively, one can run a suboptimal minimization of the negative-log of the posterior to achieve convergence to the posterior.
The parameters of the MC-C HMC filter are set as follows. the step size parameters of the symplectic integrator are set to h = 0.0075/n c , m = 10, and the mass matrix is a diagonal matrix whose diagonal is set to the diagonal of the precision matrix of the forecast ensemble labeled under the corresponding mixture component. To avoid numerical problems related to very small ensemble variances, for example in the case of outliers, the variances are averaged with the modeled forecast variances of 5 [units squared].
The prior GMM is built with number of components determined using both AIC, and BIC model selection criteria, with a lower bound of 2 of the number of ensemble members belonging to each component of the mixture. This lower bound is enforced as a means to ameliorate the effect of outliers on the GMM construction. For cases where a component contains a very small number of ensemble members covariance tapering [15] can prove useful.

Assessment metrics
To assess the accuracy of the tested filters we use the root mean squared error (RMSE): where x true = ψ true is the reference state of the system and x is the analysis state, e.g. the average of the analysis ensemble. Here n var = 129 × 129 = 16641 is the dimension of the model state.We also use Talagrand (rank) histogram [1,10] to assess the quality of the ensemble spread around the true state. Figure 4 shows the reference field ψ true at several times, along with the error fields corresponding to the EnKF, HMC, C HMC , and MC-C HMC filter analyses. While HMC errors are comparable to EnKF errors, C HMC errors are much larger in magnitude. Figure 5 presents the RMSE (30) results of the analyses obtained using EnKF, HMC, C HMC , and MC-C HMC filters in the presence of a linear observation operator. Results for the C HMC filter are shown for the number of mixture components detected using both AIC, and BIC criteria. As seen in Figure 5, the C HMC analysis drifts away from the true trajectory in the long run. The original HMC filter (without optimal tuning of the parameter set) produces an analysis that is close to the EnKF analysis as measured by RMSE.

Results with linear observation operator
As discussed in [6] the performance of HMC filter can be further enhanced by automatically tuning the parameters of the symplectic integrator at the beginning of each assimilation cycle. Here however we are mainly interested in assessing the performance of the new methodologies compared to the original HMC filter using equivalent settings. The RMSE of the MC-C HMC analysis is comparable to that of the HMC filter. It is important to note that the MC-C HMC filter requires shorter Hamiltonian trajectories to explore the space under each local mixture component, which results in computational savings. Additional savings can be obtains by running the chains in parallel to sample different regions of the posterior. Since we are not interested in only a single best estimate of the true state of the system, RMSE alone is not sufficient to judge the quality of the filtering system. The analysis ensemble sampled from the posterior should be spread widely enough to cover the truth and avoid filter collapse. The rank histograms of the analysis ensembles are shown in Figure 6. The two small spikes in Figure 6(b) suggest that the performance of the original HMC filter could be enhanced by increasing the length of the Hamiltonian trajectories in some assimilation cycles.
The rank histograms shown in Figures 6(c) and 6(d) show that the analysis ensembles produced by the C HMC filter are very under-dispersed. Since the ensemble size is relatively small and the prior GMM is multimodal, with regions of low-probability between the different mixture components, a multimodal mixture posterior with isolated components is obtained. As explained in [30], this is a case where HMC sampling in general can suffer from being entrapped in a local minimum (and fails to jump between different high probability modes). This behavior is expected to result in ensemble collapse, as seen in Figures 6(c), and 6(d), leading to filter divergence as illustrated by the RMS errors shown in Figure 5. Figure 7 shows the number of mixture components detected using the AIC and BIC model selection criteria with the C HMC and MC-C HMC filters. Here the covariance matrices are forced to be diagonal at all assimilation cycles. In the sequential application of the C HMC filter, as seen in Figure 7(a), the number of mixture components quickly reduces to one. This means that the prior distribution reduces to a Gaussian where we still force the prior covariance matrix to be diagonal as mentioned before. These results along with the rank histograms in Figures 6(c), and 6(d), suggest that the analysis ensemble quickly reduces to an ensemble collected from only one of the mixture components, thereby losing its dispersion. This is supported by the results in Figure 8, where the rank histograms are plotted using results from the first one and two cycles, respectively. In the first cycle, the C HMC filter detects 12 components and runs a single chain that fails to jump between all modes as suggested by the long spikes in Figure 8(a). The use of BIC at the first cycle results in a single component being detected, leading to a Gaussian prior. Since the prior covariance matrix is forced to be diagonal, the prior is heavily modified, and the posterior is likely to be severely affected. Starting from the second assimilation cycle the C HMC filter is entrapped in a single mode of small variances, leading to filter divergence in the long run. The ensemble collapse can be avoided if we force the sampler to collect ensemble members from all the probability modes. This is illustrated by the rank histograms of results obtained using the MC-C HMC filter with AIC criteria as shown in Figure 9. Using BIC results in a more dispersed ensemble than with AIC. However, both criteria used with the MC-C HMC filter give very similar analysis RMSE and visually indistinguishable results presented in Figure 5.
The RMSE results shown in Figure 5 indicate that while MC-C HMC outperforms both EnKF and the original HMC filters, its results become worse in the long run. We believe that having isolated regions of high probability, e.g., with very small number of ensemble With automatic tuning of the Hamiltonian parameters, the performance of both HMC, and MC-C HMC filters is expected to be greatly enhanced. To help decide whether to apply the original formulation of the HMC filter, or the proposed methodology, one can run tests of non-Gaussianity on the forecast ensemble. To asses non-Gaussianity of the forecast several numeric or visualization normality tests are available, e.g., the Mardia test [28] based on multivariate extensions of skewness and kurtosis measures. Indication of non-Gaussianity can be found by visually inspecting several bivariate contour plots of the joint distribution of selected components in the state vector. Visualization methods for multivariate normality assessment such as chi-square QQ-plots can be very useful as well. Figure 10 shows several chi-square QQ-plots of the forecast ensembles generated from the result of EnKF, HMC, and MC-C HMC filters at different time instances. These plots show strong signs of non-Gaussianity in the forecast ensemble, and suggest that the Gaussian-prior assumption may in general lead to inaccurate conclusions.

Results with nonlinear wind-magnitude observations
In the presence of a nonlinear observation operator the distribution is expected to show even stronger signs of non-Gaussianity. With stronger non-Gaussianity, the cluster methodology is expected to outperform the original formulation of the HMC sampling filter. Figure 11 shows RMSE results, with the nonlinear observation operator, for the analyses obtained by HMC, C HMC , MC-C HMC filtering systems. While EnKF diverges under the current settings after the third cycle (results omitted for clarity), HMC, and MC-C HMC continue to produce reasonable analyses with relatively small RMSE.
In the present experiment the diagonal covariances relaxation assumption is imposed. 14    However, this structure is not imposed if only one mixture component is detected, and C HMC and MC-C HMC filters fall back to the original HMC filter. Figure 12 shows the number of mixture components detected using the AIC and BIC model selection criteria in the forecast phase of the MC-C HMC filter. We see that the number of mixture components varies considerably between cycles with the BIC criterion. However, as revealed by the rank histograms in Figures 13(d), and 13(e) , the performance of MC-C HMC filter in both cases is very similar.  Figure 13 shows rank histograms of HMC, C HMC , and MC-C HMC , with a nonlinear observation operator. We can see that C HMC performance is very similar to the case when the linear observation operator is used. It seems to be entrapped into a local minimum losing its dispersion quickly. The results of the MC-C HMC filter avoid this effect and show a reasonable spread. Similar to the linear observation operator case, the results obtained in the present settings by the MC-C HMC filter, using both AIC and BIC model selection criteria, are very similar.
The results presented here suggest that the cluster formulation of the HMC sampling filter is advantageous, especially in the presence of highly nonlinear observation operator, or strong indication of non-Gaussianity.

Conclusions and Future Work
This work presents a new formulation of the HMC sampling filter for non-Gaussian data assimilation. The new formulation, named the Cluster HMC sampling filter (C HMC ), relaxes the Gaussian prior assumption. The prior density is represented more accurately via a GMM fitted to the forecast ensemble. The initial formulation of the C HMC filter presented here is not expected to outperform the original HMC filter unless the sampler is capable of efficiently sampling multimodal distributions with modes separated by large regions of low probability. A multi-chain version of C HMC , namely MC-C HMC , is developed in order to achieve this goal.
Numerical experiments are carried out using a nonlinear 1.5-layer reduced-gravity quasi geostrophic model in the presence of observation operators of different levels of nonlinearity. The results show that the new methodologies are much more efficient that the original HMC sampling filter in the presence of a highly nonlinear observation operator.
The MC-C HMC is an algorithm that deserves further investigation. For example the local sample sizes here are selected based on the prior weight multiplied by the likelihood of the corresponding component mean. An optimal selection of the local ensemble size is required to guarantee efficient sampling from the target distribution.
Instead of using MC-C HMC filter, one can use C HMC with geometrically tempered Hamiltonian sampler as recently proposed in [30], such as to guarantee navigation between separate modes of the posterior. Alternatively, the posterior distribution can be split into n c target distributions with different potential energy functions and associated gradients. This is equivalent to running independent HMC sampling filters in different regions of the state space under the target posterior.
The authors have started to investigate the ideas discussed here, in addition to testing the proposed methodologies with automatically tuned HMC samplers.