A New Linear Regression Kalman Filter with Symmetric Samples

: Nonlinear ﬁltering is of great signiﬁcance in industries. In this work, we develop a new linear regression Kalman ﬁlter for discrete nonlinear ﬁltering problems. Under the framework of linear regression Kalman ﬁlter, the key step is minimizing the Kullback–Leibler divergence between standard normal distribution and its Dirac mixture approximation formed by symmetric samples so that we can obtain a set of samples which can capture the information of reference density. The samples representing the conditional densities evolve in a deterministic way, and therefore we need less samples compared with particle ﬁlter, as there is less variance in our method. The numerical results show that the new algorithm is more efﬁcient compared with the widely used extended Kalman ﬁlter, unscented Kalman ﬁlter and particle ﬁlter.


Introduction
We aim to seek the optimal estimate of the state based on noisy observations in nonlinear filtering problems, which have a long history that can be traced back to the 1960s. In 1960, Kalman proposed the famous Kalman filter (KF) [1] and one year later, Kalman and Bucy proposed the Kalman-Bucy filter [2]. However, we need to assume that the filtering system is linear and Gaussian in KF. For general nonlinear filtering problems, we usually cannot obtain the optimal estimates. Therefore, approximations are required in order to derive suboptimal but still efficient estimators.
One direction is to approximate the nonlinear system. For instance, we approximate the original system by linear system using first-order Taylor expansions in the extended Kalman filter (EKF) [3]. Second-order variants of EKF can be found in [4]. In [5], the state is extended and then the nonlinear system is approximated by bilinear system using Carleman approach. Obviously, in these methods, the continuity of the system is required, and the estimations are sensitive to the specific point used for the expansion.
Instead, we can approximate the conditional density function rather than the system, since the optimal estimate is completely determined by the conditional density. For example, in particle filter (PF) [6] and its variants, we use the empirical distribution of some particles to approximate the real conditional distribution. For continuous filtering systems, the posterior density function satisfies the Duncan-Mortensen-Zakai equation [7][8][9] and there are many works aiming to solve this equation such as the direct method [10] and Yau-Yau method [11][12][13].
If we approximate the conditional density by single Gaussian distribution, and use KF formulas in updating step, then we can obtain a class of the so-called Nonlinear Kalman Filters [4]. However, we still need linearization in case of nonlinear systems, and one suitable way to perform such linearization is statistical linearization in the form of statistical linear regression [14]. In this sample-based approach, we represent the related densities by a set of random or deterministic selected samples. The class of Nonlinear Kalman Filters which make use of statistical linear regression are called Linear Regression Kalman Filters (LRKFs) [14]. For more details about LRKF, readers are referred to the work in [15]. The unscented Kalman filter (UKF) [16] is the most commonly used LRKF, which use a fixed number of deterministic sigma-points to capture the information of the conditional densities. Intuitively, the LRKF can be viewed as a hybrid of PF and KF, where the particles are obtained in a deterministic way. There are some heuristic algorithms that combine PF and KF directly in many practical applications such as target tracking [17,18].
The key problem in LRKF is how to select the points, which can be determined by minimizing some distance between original density and its Dirac mixture density formed by these points. In this paper, we propose a new LRKF which uses Kullback-Leibler (K-L) divergence as the measure. Motivated by the work [19] and considering the symmetry of Gaussian distribution, we approximate the standard normal density by the Dirac mixture density formed by any given number of symmetric points. Now, we only need to solve a optimization problem. There are many PFs based on MCMC sampling in recent decades. However, only a few papers have improved particle sampling by variational inference, as it is difficult to calculate the K-L divergence between discrete densities. With the rise of a large number of generative models in machine learning, more and more approximate algorithms related to variational inference are produced, and Stein variational gradient decent (SVGD) is an important one [20,21]. SVGD drives the discrete particles to approximate the continuous posterior density function by kernel functions, so that the K-L divergence between the continuous density and its Dirac mixture approximation by discrete points is minimized by using gradient descent. We then obtain the points which can approximate non-standard Gaussian density functions by Mahalanobis transformation [22]. At last, using the framework of LRKF, we can obtain the estimation result. Inheriting the advantages of LRKF, we can handle discontinuous filtering systems by the proposed algorithm and we do not need to compute the Jacobian matrix.
The first contribution of this work is that we introduce K-L divergence to measure the distance of the continuous density and its Dirac mixture approximation formed by any given number of symmetric samples. Besides, we use SVGD to solve the corresponding optimization problem motivated by frequently uses of variational inference in machine learning. It can be seen from the numerical simulations that our algorithm shows great efficiency compared with classical EKF, UKF, and PF.
Notations: |·| represents the Euclidean norm. δ(·) denotes the Dirac delta function, i.e., which is also constrained to satisfy the identity N (x; m, P) denotes the Gaussian density function with mean m and positive definite covariance P, i.e., where n is the dimension of x and det P is the determinant of P.
This paper is organized as follows. In Section 2, we review the basic filtering problem and some preliminary results. In Section 3, we give one new LRKF. In Section 4, a numerical example is implemented to show the efficiency of the new algorithm.

Preliminaries
The discrete time filtering system considered here is as follows: where x k ∈ R n is the state of the stochastic nonlinear dynamic system (1) at discrete time instant k, y k ∈ R m is the noisy measurement (or observation) generated according to the model (2), and {w k ∈ R q , k = 0, 1, · · · } and {v k ∈ R r , k = 1, · · · } are Gaussian white noise processes with w k ∼ N (w k ; 0, Q k ) and v k ∼ N (v k ; 0, R k ). Here we need to assume that {w k , k = 0, 1, · · · }, {v k , k = 1, · · · } and the initial state x 0 are independent of each other. The density function of the initial state x 0 is p(x 0 ). Y k denotes the history of the observations up to time instant k, i.e., We aim to seek the optimal estimate of state x k based on the observation history Y k in the sense of minimum mean square error.

Definition 1 (Minimum mean square error estimate ([3])). Letx be an estimate of random variable x. Then the minimum mean square error estimate of x is
arg min Jazwinski proved that the minimum mean square error estimate of state x k based on Y k is its conditional expectation in the following theorem. Theorem 1 (Theorem 5.3 in [3]). Let the estimate of x k be a functional on Y k . Then, the minimum mean square error estimate of state x k is its conditional mean E[x k |Y k ].
Obviously, if we can obtain the conditional density of x k based on Y k , i.e., p(x k |Y k ), then we can simply compute E[x k |Y k ]. The evolution of the conditional density function p(x k |Y k ) is given in the following theorem.

Theorem 2 ([19]
). Consider the filtering problem (1)-(2) from time step k − 1 to step k. The evolution of the conditional density function p(x k |Y k ) contains iterative two steps:

•
In the prediction step, employing the system model (1) and the Chapman-Kolomogorov equation, we can obtain where p(w k ) = N (w k ; 0, Q k ).
• In the updating step, when the latest measurement y k arrives, using Bayes' rule, we have where the likelihood function p(y k |x k ) is obtained according to The initial value of the conditional density function is p(x 0 |Y 0 ) = p(x 0 ). Then according to Theorem 2, we have the evolution framework of p(x k |Y k ) which is shown in the following Figure 1. Unfortunately, we cannot obtain the conditional density function analytically in most cases, although we have the recursive evolution equations of conditional density functions. Therefore, we cannot get the optimal estimate E[x k |Y k ], and we need to resort to some approximation techniques.

Nonlinear Kalman Filtering Based on Statistical Linearization
One important approximative Bayesian estimation technique is used in the class of Nonlinear Kalman Filter. These filters assume that both p(x k |Y k−1 ) and p(x k |Y k ) are well approximated by Gaussian distributions. The detailed procedures are listed as follows [19].

1.
Initialization: The initial density of x 0 is approximated by Gaussian: with the initial meanx and initial covariance 2. Prediction: The apriori density function is approximated by with predicted state mean and predicted state covariance matrix respectively.

3.
Updating: The Bayesian filter step (5) can be reformulated in form of the joint density p(x k , y k |Y k−1 ) according to Here, this joint density is approximated by Gaussian then according to Theorem A2 in Appendix A, posterior state mean and posterior state covariance matrix where the measurement mean the measurement covariance matrix as well as the cross-covariance matrix of predicted state and measurement However, we cannot obtain the closed form integrals in the above equations in general cases.

The Linear Regression Kalman Filter
If the state densities in the integrals of the Nonlinear Kalman Filter can be replaced by proper Dirac mixture densities formed by some samples, then we can easily compute these integrals. That is, the statistical linearization is turned into an approximate statistical linear regression, and this is exactly what the LRKF does.
The Dirac mixture approximation of an arbitrary density p(s k ) of s k ∈ R N by L samples is with samples {s k,1 , · · · , s k,L } and positive scalar weights {α k,1 , · · · , α k,L }, for which Therefore, the information of the density p(s k ) is approximately encoded in the L(N + 1) Dirac mixture parameters, which can be determined by minimizing certain distance between p(s k ) andp L (s k ).

The Smart Sampling Kalman Filter
One of the LRKFs is the smart sampling Kalman Filter proposed in [15]. As the goal in LRKF is to approximate Gaussian densities p(x k |Y k−1 ) and p(x k |Y k ) by Dirac mixture densities, They first consider to approximate an N-dimensional standard normal distribution N (s; 0, I) by Dirac mixture approximation using equal weights and symmetric samples in the following manner: Then, given a non-standard Gaussian distribution we can use the Mahalanobis transformation [22] to obtain the Dirac mixture approximation. More explicitly, by transforming the samples s i in (22) according to where √ P z is the square root of P z using Cholesky decomposition, we have the Dirac mixture approximation of (23) as follows: Moreover, this transformation can be also understood from the property of Gaussian random variables listed in Theorem A1. By approximating the approximated a priori density N (x k ;x k|k−1 , P k|k−1 ) in (10) and posterior density N (x k ;x k|k , P k|k ) in (14) using (24) and (25), we can get the desired filtering result under the framework of LRKF. Now the key is how to determine the samples {s 1 , · · · , s L } in (22) so that they approximate a multivariate standard normal distribution in an optimal way. In [19], a combination of the Localized Cumulative Distribution and the modified Cramér-von Mises distance is adapted.

The New Linear Regression Kalman Filter
Apparently, in the framework of LRKF, we need to approach the goal through the formulation of an optimization problem with respect to the appropriate chosen distance metric between original density and its Dirac mixture approximation. Instead of the distance measure adapted in [19], we can minimize the K-L divergence between the multivariate standard normal distribution N (s; 0, I n ) p N (s) and its approximatioñ formed by samples S := {s 1 , . . . , s L }.

Kullback-Leibler Divergence and Stein Variational Gradient Descent
K-L divergence, D KL , is used to measure how one probability distribution is different from the reference probability distribution. Based on definition, it is known that, the K-L divergence between the Dirac mixture approximation densityp N (s) defined in (26) and standard normal distribution p N (s) is The new algorithm requires that the initial particles {s 1 , . . . , s L } are calculated in advance. Therefore we can divided the new algorithm into two parts. The first part is pre-calculation which is implemented off-line, while we use LRKF to get the estimates of the states in the on-line part. In the first part, particle sampling is regarded as a variational inference problem. SVGD is used to capture and store the most important statistical locus of the target distribution.
For all our experiments, we use kernel K(x, x ) = exp (− 1 h x − x 2 2 ) , and take the bandwidth to be h = med 2 / log(2L), where "med" is the median of the pairwise distance between the current points {x i } 2L i=0 . We must point out that the different kernel functions may lead to different numerical results, here we choose this kernel to approximate the Gaussian distribution better. Details of the SVGD can be found in the references [20,21].
The procedures of this off-line algorithm is listed in Algorithm 1.
i=1 that approximates the target distribution.

On-Line Filtering Algorithm
With the ready off-line data S = {s 1 , . . . , s L }, we can obtain the estimatex k|k of state x k by the following procedures [19].
For k ≥ 1, given x e k−1,i , 1 ≤ i ≤ 2L ,x k−1,k−1 and P k−1,k−1 , let then we havê 3. when the measurement y k arrives, let and computex The particles are updated according to The on-line procedures are summarized in Algorithm 2.
The total time step is K. We run the simulations for 100 times and assume that thê x i k is the estimation result of real state x i k at time step k in the i-th experiment. In order to measure the performances of the numerical algorithms over time, we define the Root Mean Square Error (RMSE) and the mean error at time step k as follows: Mean Error k := 1 100 Mean Error k is the average estimation error at time k while RMSE k is the accumulated average estimation error till to time k.

Numerical Example
We consider the classic cubic sensor here and the model is as follows: where x k ∈ R, y k ∈ R, w k ∼ N (0, 0.01), v k ∼ N (0, 0.01), and k = 1, 2, · · · , K with K = 500. The continuous and continuous-discrete cubic sensor problems have been investigated in [5,11,12], and it has been proved that there cannot exist a recursive finite-dimensional filter driven by the observations for continuous cubic sensor system [23]. We compare our NLRKF with classical EKF, UKF, and PF. We first use 10 symmetric points (or particles) for our NLRKF and 50 particles for PF. The performance in one experiment is shown in Figure 2. Apparently, EKF totally fails to track the real state and other algorithms can track the real state well with different accuracies.
To further explore the performance of different algorithms based on 100 experiments, we plot the evolutions of mean error and RMSE over time in Figures 3 and 4. As can be seen from these two figures, NLRKF with 20 particles performs best, followed by NLRKF with 10 particles and 6 particles. UKF is better than PF.   The total estimation error RMSE 500 and costing times of different methods based on 100 experiments are listed in Table 1, in which N NLRKF represents the number of particles or points used in NLRKF and N PF is the number of particles used in PF. We can know that NLRKFs with 20, 10, and 6 particles are the three best performing algorithms considering both RMSE and costing times. We also display the connection of RMSE and number of particles in Figure 5. UKF and EKF are independent of the number of particles, so the two lines are horizontal. The NLRKF surpasses other methods with just 6 particles.

Conclusions
In this paper, we proposed a new LRKF using K-L divergence and symmetric samples. The numerical simulation results show that our algorithm is accurate and efficient. However, the proposed algorithm requires the explicit model of the filtering system. Besides, this algorithm is under the framework of nonlinear Kalman filter, i.e., we approximate the conditional density by single Gaussian density. The approximation can lead to large errors when the conditional density is highly non-Gaussian. How to remove this framework and deterministically propagate the points are our future works.

Acknowledgments:
The authors thank the editor and anonymous reviewers for their helpful comments and constructive suggestions. S. S.-T. Yau is grateful to the National Center for Theoretical Sciences (NCTS) for providing an excellent research environment while part of this research was done.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A. Properties of Gaussian Random Variables
It was noted that the density function of Gaussian random variable is characterized by two parameters, the mean and covariance. We list some properties of Gaussian random variables here.
Theorem A1 (Theorem 2.11 in [3]). Let the p-dimensional Gaussian random vector z ∼ N (m z , P z ). Let η = Cz + c 0 , where C ∈ R q×p is a constant matrix, c 0 ∈ R q is a constant vector, and η ∈ R q is a vector. Then η ∼ N (Cm z + c 0 , CP z C T ).
Theorem A2 (Theorem 2.13 in [3]). Let x and y be jointly normally distributed according to x y ∼ N m x m y , P xx P xy P yx P yy . (A1) Then, the conditional density of x given y is normal with mean m x + P xy P −1 y (y − m y ) (A2) and covariance matrix P x − P xy P −1 y P yx . (A3)