Bayesian Nonlinear Filtering via Information Geometric Optimization

Li, Yubo; Cheng, Yongqiang; Li, Xiang; Wang, Hongqiang; Hua, Xiaoqiang; Qin, Yuliang

doi:10.3390/e19120655

Open AccessArticle

Bayesian Nonlinear Filtering via Information Geometric Optimization

by

Yubo Li

^*,

Yongqiang Cheng

,

Xiang Li

,

Hongqiang Wang

,

Xiaoqiang Hua

and

Yuliang Qin

College of Electronic Science and Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Entropy 2017, 19(12), 655; https://doi.org/10.3390/e19120655

Submission received: 17 October 2017 / Revised: 16 November 2017 / Accepted: 29 November 2017 / Published: 1 December 2017

(This article belongs to the Special Issue Radar and Information Theory)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, Bayesian nonlinear filtering is considered from the viewpoint of information geometry and a novel filtering method is proposed based on information geometric optimization. Under the Bayesian filtering framework, we derive a relationship between the nonlinear characteristics of filtering and the metric tensor of the corresponding statistical manifold. Bayesian joint distributions are used to construct the statistical manifold. In this case, nonlinear filtering can be converted to an optimization problem on the statistical manifold and the adaptive natural gradient descent method is used to seek the optimal estimate. The proposed method provides a general filtering formulation and the Kalman filter, the Extended Kalman filter (EKF) and the Iterated Extended Kalman filter (IEKF) can be seen as special cases of this formulation. The performance of the proposed method is evaluated on a passive target tracking problem and the results demonstrate the superiority of the proposed method compared to various Kalman filter methods.

Keywords:

information geometry; Bayesian filtering; nonlinear filtering; Riemannian metric tensor; natural gradient descent

1. Introduction

Filtering problems arise in various applications such as signal processing, automatic control and financial time series. The goal of nonlinear filtering is to estimate the state of a nonlinear dynamic process based on noisy observation. In the last several decades, the Kalman filter has become a standard method for linear dynamic systems subject to linear measurements, which can provide a perfect analytical solution in the optimal operation. However, it is not suitable for the nonlinear cases of filtering problems. Along with the spirit of Kalman filter, the approximation methods have been proposed for the nonlinear filtering, such as the Extended Kalman filter (EKF) [1], Unscented Kalman filter (UKF) [2], Gauss–Hermite Kalman filter (GHKF) [3], and Cubature Kalman filter (CKF) [4]. All these methods can be induced by the Bayesian approach with different approximations [5] for nonlinear cases. Apart from the aforementioned methods, the sequential Monte Carlo technique approximating for the Bayesian probability density functions (PDFs) is another feasible approach, for instance, the particle filtering (PF) [6] which uses the particle representations of probability distributions. Within the Bayesian framework, the nonlinear filtering can be converted to Bayesian filtering, and the procedure of filtering consists of two steps: state propagation and measurement update. Correspondingly, the state propagation provides the prior information for the state, and the measurement update integrates the prior information and the conditional measurement to obtain the posterior PDF of the state. In particular, the nonlinear and non-Gaussian conditions will make the measurement update more difficult and the solution of posterior PDFs intractable.

Because Bayesian posterior PDF plays a key role in nonlinear filtering, the study of Bayesian posterior PDF has attracted increasing attentions over the past few decades. Conventionally, the linear minimum mean square error (LMMSE) estimator and maximum a posteriori (MAP) estimator have played major roles in estimating posterior PDF. For the LMMSE estimator, it approximates the posterior mean and covariance matrix by its estimator and its mean square error matrix, respectively. Usually, the EKF, UKF and CKF can be derived from the LMMSE estimator. Besides, an adaptable recursive method named recursive update filter (RUF) has been derived based on the principle of LMMSE [7,8], which overcomes some of the limitations of the EKF. Being different from the LMMSE, the MAP estimator has estimated the posterior mean and obtained the covariance matrix by linearizing the measurement function around the MAP estimator. The well-known iterated EKF (IEKF), which is induced by using the Gauss–Newton optimization [9] or Levenberg–Marquardt (LM) [10] method, can be interpreted as a MAP estimator. With the variational approach employing for MAP optimization, the variational Kalman filter (VKF) [11] has been proposed. By using the Newton–Raphson iterative optimization steps to yield an approximate MAP estimation, the generalized iterated Kalman filter (GIKF) [12] algorithm has been presented to handle the nonlinear stochastic discrete-time system with state-dependent multiplicative estimation. Generally speaking, the MAP estimator methods need the iterative procedures to obtain the final estimation, and these iterative methods for nonlinear filtering have better performance. As the IEKF outperforms EKF and UKF, as shown by Lefebvre [13], the Iterated UKF (IUKF) [14] performs better than the UKF in the estimation of state and the corresponding covariance matrix.

Recently, Morelande [15] has adopted the Kullback–Leibler (KL) divergence as the metric to analyze the difference between the true joint posterior PDFs of the state conditional on the measurement and the approximation posterior PDFs. Actually, this metric can be used to derive new algorithms. The iterated posterior linearization filter (IPLF) [16] can be seen as an approximate recursive KL divergence minimization procedure. The adaptive unscented Gaussian likelihood approximation filter (AUGLAF) [17] selects the best approximation to the posterior PDFs based on the KL divergence. The KL partitioned update Kalman filter (KLPUKF) [18] uses KL divergence to measure the nonlinearity of the measurement. In essence, the KL divergence, also known as relative entropy, is a quantity in information theory. This optimization criterion of information theory has already been applied in signal processing. By utilizing the information theoretic quantities to capture the higher-order statistics, we can obtain the significant performance improvement. Meanwhile, another optimization criterion in information theoretic learning (ITL), i.e., maximum correntropy criterion (MCC), has been introduced for the filtering problems. With this criterion involving in the existing filtering framework, some new Kalman-type filters have been proposed, such as maximum correntropy Kalman filter (MCKF) [19], maximum correntropy unscented Kalman filter (MCUKF) [20], robust information filter based on maximum correntropy criterion [21].

Enlightened by the information theoretic quantities applied in nonlinear filtering, we consider the nonlinear filtering from the information geometric viewpoint. Information geometry, which was originally proposed by Amari [22], has become a new mathematical tool for the study on manifold of probability distributions. The combination of information theory and differential geometry opens a new perspective to study the geometric structure of information theory and provides a new way to deal with the existing statistical problems. In this paper, we will study the nonlinear filtering by using information geometric method. By using the joint PDFs of the measurement and the state to construct the statistical manifold, the nonlinear characteristics can be represented as the geometric quantities, such as metric tensor, and the filtering problems are converted to the optimization problems on the statistical manifold. In this way, the nonlinear filtering can be progressed by the information geometric optimization method, and it will induce an iterative procedure for estimation. The natural gradient descent [23] method is used to seek the optimal estimation across the statistical manifold, and the distance defined on the statistical manifold is utilized to design as the stopping criterion to achieve the goal of filtering.

The paper is organized as follows. Firstly, we give a brief description for Bayesian filtering and information geometry in Section 2 and Section 3, respectively. Then, the adaptive natural gradient descent method on the statistical manifold is presented to derive the new nonlinear filtering algorithm in Section 4. Further discussion about our proposed method will be given in Section 5, and the numerical simulations are implemented to demonstrate the performance in Section 6. Finally, conclusions are made in Section 7.

2. Bayesian Filtering

Bayesian principle provides a general approach for nonlinear filtering, and this approach is called as Bayesian filtering [5]. Bayesian filtering converts the state and measurement from the state-space to probability distribution. The goal of Bayesian filtering is to estimate the state of a nonlinear dynamic process conditional on measurement. The formulations of Bayesian filtering are

p (x_{k} | y_{k - 1}) = \int p (x_{k} | x_{k - 1}) p (x_{k - 1} | y_{k - 1}) d x_{k - 1}

(1)

p (x_{k} | y_{k}) = \frac{p (y_{k} | x_{k}) p (x_{k} | y_{k - 1})}{\int p (y_{k} | x_{k}) p (x_{k} | y_{k - 1}) d x_{k}}

(2)

with the probability densities as follows

x_{k} | x_{k - 1} \sim p (x; f (x_{k - 1}), Q)

(3)

y_{k} | x_{k} \sim p (y; h (x_{k}), R)

(4)

which correspond to the general state-space model formulation

x_{k} = f (x_{k - 1}) + u_{k}

(5)

\begin{matrix} y_{k} & = h (x_{k}) + v_{k} \end{matrix}

(6)

where

f

and

h

denote the state transition and measurement functions, and the covariance matrix

Q

and

R

correspond to the zero mean Gaussian noise

u_{k}

and

v_{k}

, respectively.

For the Bayesian filtering problem, Equation (1) represents the state propagation, while Equation (2) represents the measurement update. Because it is usually intractable to calculate analytically for Bayesian posterior distribution, the optimization methods are used to avoid the troublesome integral and address this problem in a computationally feasible way, such as the MAP. Compared with the LMMSE method, the advantage of the MAP method is that there is no need to solve the integral operations in Bayesian posterior distribution. Further, when we consider these PDFs, the information geometry [22] provides an alternative approach, and the optimization on the statistical manifold can be utilized to derive the new filtering algorithm.

3. Information Geometry

3.1. Riemannian Metric Tensor

Consider a parameterized family of probability density as

S = {p (y | θ), θ = {(θ_{1}, \dots, θ_{n})}^{T} \in Θ}

, where

y \in R^{m}

is a measurable random variable,

θ

is the parameter to be estimated, and

p (y | θ)

is the conditional PDF of

y

given

θ

. With the parameter

θ

acting as the coordinate system,

S

can be regarded as an n-dimensional manifold. When the Fisher information matrix (FIM) [24]

\begin{matrix} {[F (θ)]}_{i j} & ≜ E_{y | θ} [\frac{\partial \log p (y | θ)}{\partial θ_{i}} \frac{\partial \log p (y | θ)}{\partial θ_{j}}] \\ = - E_{y | θ} [\frac{\partial^{2} \log p (y | θ)}{\partial θ_{i} \partial θ_{j}}] \end{matrix}

(7)

is defined as the Riemannian metric tensor,

S

is a statistical manifold. Usually, this metric tensor in statistical manifold is also called as Fisher metric tensor [25].

E_{y | θ} [\cdot]

denotes the expectation with respect to

p (y | θ)

.

When

θ

has the prior PDF

p (θ)

, the Bayesian principle can be used to characterize the joint PDF

p (y, θ)

of

y

and

θ

. The Fisher metric tensor for Bayesian problems is defined as follows [24]

\begin{matrix} {[G (θ)]}_{i j} & ≜ - E_{y, θ} [\frac{\partial^{2} \log p (y, θ)}{\partial θ_{i} \partial θ_{j}}] \\ = - E_{y | θ} E_{θ} [\frac{\partial^{2} \log p (y | θ)}{\partial θ_{i} \partial θ_{j}}] - E_{θ} [\frac{\partial^{2} \log p (θ)}{\partial θ_{i} \partial θ_{j}}] \end{matrix}

(8)

where

E_{y, θ} [\cdot]

and

E_{θ} [\cdot]

denote the the expectations with respect to

p (y, θ)

and

p (θ)

, respectively.

This Fisher metric tensor (8) is the expected Fisher information on the measurement plus the negative Hessian of the log-prior on the parameter. The first part of Equation (8) characterizes the measurement conditional on the parameter, while the second part includes the effect of prior information on the parameter. In other words, the two terms of Equation (8) correspond to the information obtained from the measurement and the prior distribution of the parameter, respectively.

Besides, the Fisher metric tensor is relative with the Bayesian posterior Cramér–Rao Bounds (PCRB), which is applicable to multidimensional nonlinear, possible non-Gaussian, dynamical systems [26]. From the viewpoint of statistical inference, we can measure the performance of estimation by using the Bayesian PCRB. The PCRB on the estimation error has the formulation as

\begin{matrix} PCRB ≜ E_{y, θ} [(\hat{θ} (y) - θ) {(\hat{θ} (y) - θ)}^{T}] \geq G^{- 1} (θ) \end{matrix}

(9)

where

\hat{θ} (y)

denotes the estimate of

θ

, and

G (θ)

is the

n \times n

Fisher information matrix with the elements defined in the Equation (8) The inequality

“ \geq ”

means that the difference

PCRB - G^{- 1} (θ)

is a positive semidefinite matrix. This inequality is used to describe the estimation error bound. Usually, in application, the covariance matrix of the estimation is approximated by the inverse of Fisher information matrix.

3.2. Natural Gradient Descent

The gradient method is a general approach for solving optimization problem. For most of the problems that estimate the parameter

θ

in Euclidean space, the gradient descent update is defined as

\begin{matrix} {\hat{θ}}^{i} = {\hat{θ}}^{i - 1} - \nabla_{θ} L ({\hat{θ}}^{i - 1}) \end{matrix}

(10)

where

\nabla_{θ} L

is the gradient of convex differentiable objective function L, and it determines the update direction for next iterative step. However, it is not suitable for the Riemannian manifold, because of the curvature of the manifold. Based on the Riemannian metric tensor, the gradient on Riemannian manifold has been proposed as

{\tilde{\nabla}}_{θ} L (θ) = G^{- 1} (θ) \nabla_{θ} L (θ)

known as the natural gradient. Thus, the natural gradient descent update on the Riemannian manifold as

\begin{matrix} {\hat{θ}}^{i} = {\hat{θ}}^{i - 1} - η_{i} G^{- 1} ({\hat{θ}}^{i - 1}) \nabla_{θ} L ({\hat{θ}}^{i - 1}) \end{matrix}

(11)

where

G (θ)

is the Riemannian metric tensor associated with the estimated parameter

θ

, i is the number of iterative steps and the parameter

η_{i} \in (0, 1]

denotes the step-size parameter for iterative update. The natural gradient descent multiples the inverse of Riemannian metric tensor by the gradient of objective function. It takes into account the direction of steepest descent on the Riemannian manifold, which involves the curvature of the manifold. It has been proven that steps along the direction of the natural gradient descent is the steepest descent on the Riemannian manifold [27]. The natural gradient method has been used in many application, such as nonlinear estimation [28]. In addition, the natural gradient desecent is Fisher Efficient [23]. Luo et al. [29] has analyzed the convergence and bound properties of natural gradient descent method. After the natural gradient descent method constructing the iterative update procedure, the stopping conditions have to set to obtain the final estimate. Usually, the distance between two successive estimates has been used for these conditions.

3.3. Divergence and Distance

In Euclidean space, the distance can be used to describe the difference between two quantities, and it is defined by Euclidean norm as

∥ Δ θ ∥ = \sqrt{< Δ θ, Δ θ >} = \sqrt{Δ θ^{T} Δ θ}

. While in the Riemannian manifold, the distance is defined with Riemannian metric tensor

G (θ)

as

{∥ Δ θ ∥}_{G (θ)} = \sqrt{< Δ θ, Δ θ >_{G (θ)}} = \sqrt{Δ θ^{T} G (θ) Δ θ}

. In statistical manifold with Fisher metric tensor instead of Riemannian metric tensor, Amari [30] has defined the squared distance between two nearby distributions

p (y, θ)

and

p (y, θ + Δ θ)

as

d s^{2} = Δ θ^{T} G (θ) Δ θ = < Δ θ, Δ θ >_{G (θ)}

(12)

Besides, the KL divergence has provided another means to measure the similarity of two nearby probability distributions. The KL divergence is defined as

D_{KL} (p ∥ q) = \int p (y) \log (\frac{p (y)}{q (y)})

(13)

where p and q denote two probability densities. The KL divergence is also called relative entropy as in information theory. As for the KL divergence, it is a good measure of difference with the desired mathematical properties [31].

Let q approximate the neighborhood p as

q (y, θ) = p (y, θ + Δ θ)

, and the Taylor expansion gives an approximation of the KL divergence by

\begin{matrix} D_{KL} (p (y, θ) ∥ p (y, θ + Δ θ)) = E_{y, θ} [\log \{\frac{p (y, θ)}{p (y, θ + Δ θ)}\}] \\ = E_{y, θ} [\log p (y, θ)] - E_{y, θ} [\log p (y, θ + Δ θ)] \\ \approx E_{y, θ} [\log p (y, θ)] - E_{y, θ} [\log p (y, θ) + \sum_{i = 1}^{m} \frac{\partial \log p (y, θ)}{\partial θ_{i}} Δ θ_{i} + \frac{1}{2} \sum_{i, j = 1}^{m} \frac{\partial^{2} \log p (y, θ)}{\partial θ_{i} \partial θ_{j}} Δ θ_{i} Δ θ_{j}] \\ = - E_{y, θ} [\frac{1}{2} \sum_{i, j = 1}^{m} \frac{\partial^{2} \log p (y, θ)}{\partial θ_{i} \partial θ_{j}} Δ θ_{i} Δ θ_{j}] \\ = \frac{1}{2} Δ θ^{T} G (θ) Δ θ \end{matrix}

(14)

where

G (θ)

denotes the Fisher metric tensor of statistical manifold,

Δ θ_{i}

denotes the i-th scalar of

Δ θ

. In the equation,

E [\frac{\partial \log p (y, θ)}{\partial θ}] = 0

[22] has been used. From this relationship, we can note that the KL divergence has included the second order information of probability density. Compared with the Amari’s squared distance, the KL divergence has local behavior that it is approximately a half of the squared distance for the statistical manifold, which coincides with the geodesic distance for infinitesimal distances [32]. With the help of divergence and distance, we can measure how close between two estimates from the viewpoint of statistical manifold.

For the particular statistical manifold of multivariate Gaussian, it has the explicit formulation of probability density

\begin{matrix} p (y; μ, Σ) = \frac{\exp [- \frac{1}{2} {(y - μ)}^{T} Σ^{- 1} (y - μ)]}{\sqrt{{(2 π)}^{m} det (Σ)}} \end{matrix}

(15)

where

y

is the random variable,

μ \in R^{m}

and

Σ \in S y m (m)

denote the mean and the covariance matrix, respectively.

S y m (m)

is the space of real symmetric

m \times m

positive-definite matrix. The mean and the covariance matrix are the unknown parameters to be estimated. The statistical manifold is constructed by the probability density as

S = \{p (y; μ, Σ), (μ, Σ) \in R^{m} \times S y m (m)\}

. The logarithm likelihood of the multivariate Gaussian can be re-written as

\begin{matrix} ℓ = \log p (y; μ, Σ) = - \frac{1}{2} {(y - μ)}^{T} Σ^{- 1} (y - μ) - \frac{1}{2} \log det (Σ) - \frac{m}{2} \log (2 π) \end{matrix}

(16)

Consider

μ

and

Σ

as the mutual independent parameters, the first order partial derivatives of ℓ with respect to

μ

and

Σ

are

\begin{matrix} \nabla_{μ} = Σ^{- 1} (y - μ) \end{matrix}

(17)

\begin{matrix} \nabla_{Σ} ℓ & = \frac{\partial (- \frac{1}{2} tr [Σ^{- 1} (y - μ) {(y - μ)}^{T}])}{\partial Σ} + \frac{\partial (- \frac{1}{2} \log det (Σ))}{\partial Σ} \\ = \frac{1}{2} Σ^{- 1} (y - μ) {(y - μ)}^{T} Σ^{- 1} - \frac{1}{2} Σ^{- 1} \end{matrix}

(18)

then, we can compute the second order partial derivatives of ℓ as follows

\nabla_{μ} {[\nabla_{μ} ℓ]}^{T} = - Σ^{- 1}

(19)

\nabla_{Σ} {[\nabla_{Σ} ℓ]}^{T} = \frac{\partial Σ^{- 1}}{\partial Σ} (y - μ) {(y - μ)}^{T} Σ^{- 1} - \frac{1}{2} \frac{\partial Σ^{- 1}}{\partial Σ}

(20)

The Fisher metric tensor with respect to

μ

is

\begin{matrix} F_{μ} = - E_{y} [\nabla_{μ} {[\nabla_{μ} ℓ]}^{T}] = Σ^{- 1} \end{matrix}

(21)

and the Fisher metric tensor with respect to

Σ

is

\begin{matrix} F_{Σ} & = - E_{y} [\nabla_{Σ} {[\nabla_{Σ} ℓ]}^{T}] \\ = - \frac{\partial Σ^{- 1}}{\partial Σ} Σ Σ^{- 1} + \frac{1}{2} \frac{\partial Σ^{- 1}}{\partial Σ} = \frac{1}{2} Σ^{- 1} \otimes Σ^{- 1} \end{matrix}

(22)

where ⊗ denotes the Kronecker product.

With

E_{y} [\nabla_{μ} ℓ \nabla_{Σ} ℓ^{T}] = E_{y} {[\nabla_{Σ} ℓ \nabla_{μ} ℓ^{T}]}^{T} = 0

, we can obtain the distance

\begin{matrix} d s^{2} & = Δ μ^{T} Σ^{- 1} Δ μ + \frac{1}{2} {(vec (Δ Σ))}^{T} (Σ^{- 1} \otimes Σ^{- 1}) vec (Δ Σ) \\ = Δ μ^{T} Σ^{- 1} Δ μ + \frac{1}{2} tr [{(Σ^{- 1} Δ Σ)}^{2}] \end{matrix}

(23)

where m is the dimension of data

y

,

Δ μ

and

Δ Σ

denote the variations of

μ

and

Σ

, respectively. In the above procedure of deriving, the following equations [33] have been used

tr (ABCD) = {(vec (D^{T}))}^{T} (C^{T} \otimes A) vec (B)

(24)

\frac{\partial tr (A X^{- 1} B)}{\partial X} = - X^{- 1} BA X^{- 1}

(25)

\frac{\partial \log det X}{\partial X} = X^{- 1}

(26)

\frac{\partial X^{- 1}}{\partial X} = - (X^{- T} \otimes X^{- 1})

(27)

where

A, B, C, D, X

are matrices.

The distance between two quantities on statistical manifold is corresponding to the KL divergence between two probability densities. When we compare two quantities on statistical manifold, the distance can measure their similarity. The shorter distance means the smaller divergence of two probability densities. This can be used to describe the convergence of estimation.

Conventionally, the gradient method will process an iterative procedure to estimate the state, and the distance between two estimates is used to measure the convergence of the algorithm. Intuitionally, the convergence on statistical manifold is that the difference between two probability distributions corresponding to two estimates is very small. In the iterative estimation procedure, the convergence means that two successive estimates are almost equivalent, or, in practice, the distance between two estimates is less than a certain value.

4. Natural Gradient Descent Filtering

For the Bayesian filtering, the posterior distribution plays the key role in the procedure. However, the close-form of posterior PDFs is intractable because of the Bayesian integral. Usually, the optimization technique has been used to obtain the approximation formulation. In this section, we consider this problem from the information geometric perspective, and derive a new filtering method by using information geometric optimization technique.

In the Bayesian filtering, the state propagation (Equation (1)) can provide the prior information of state before the measurement update. The prior PDF is Gaussian density when the state transition function is linear. While the state function is nonlinear, the prior PDF is non-Gaussian. Here, we focus on the measurement update step, and make an effort to use information geometric optimization for the posterior PDF. Similar to the usual Gaussian filtering, the Gaussian density is used to approximate the non-Gaussian density in the step of state propagation [16], which has the formulation as

\begin{matrix} p (x_{k} | y_{k - 1}) \approx N (x_{k}; {\hat{x}}_{k}^{-}, Σ_{k}) \end{matrix}

(28)

where

{\hat{x}}_{k}^{-}

and

Σ_{k}

denote the mean and covariance matrix of the prior PDF after state propagation. It means that the k-time state

x_{k}

based on the

k - 1

-time measurement

y_{k - 1}

. With the k-time measurement likelihood function, i.e., the conditional probability density of the measurement

y_{k}

given

x_{k}

is

\begin{matrix} p (y_{k} | x_{k}) = N (y_{k}; h (x_{k}), R) \end{matrix}

(29)

We can obtain the Bayesian posterior probability density by substituting Equations (28) and (29) into Equation (2). The numerical optimization is used to obtain the approximative solution for Bayesian integral. Similar to MAP, the optimization is reformulated as

\begin{matrix} {\hat{x}}_{k} & = \arg max_{x_{k}} p (x_{k} | y_{k}) = \arg min_{x_{k}} L (x_{k}) \end{matrix}

(30)

where

L (x_{k}) = - \log p (y_{k} | x_{k}) - \log p (x_{k} | y_{k - 1})

denotes the negative logarithm likelihood function of posterior distribution, which neglects the terms independent of the

x_{k}

.

Consider the statistical manifold

S = {p (y_{k}, x_{k}), x_{k} \in R^{n}}

constructed by the joint probability density

p (y_{k}, x_{k})

, the natural logarithm maps the statistical manifold to

R

as

\log : S \to R

. Given a point

{\hat{x}}_{k}

, the Fisher metric tensor of

S

at

{\hat{x}}_{k}

can be calculated as

\begin{matrix} G ({\hat{x}}_{k}) & = E_{y_{k}, x_{k}} [- \nabla_{x_{k}} {[\nabla_{x_{k}} \log p (y_{k}, x_{k})]}^{T}] \\ = E_{x_{k}} E_{y_{k} | x_{k}} [e_{y}^{T} R^{- 1} \nabla_{x_{k}}^{x_{k}} h ({\hat{x}}_{k}) + \nabla_{x_{k}} h {({\hat{x}}_{k})}^{T} R^{- 1} \nabla_{x_{k}} h ({\hat{x}}_{k}) + Σ_{k}^{- 1}] \\ = E_{x_{k}} [\nabla_{x_{k}} h {({\hat{x}}_{k})}^{T} R^{- 1} \nabla_{x_{k}} h ({\hat{x}}_{k})] + Σ_{k}^{- 1} \\ = \nabla_{x_{k}} h {({\hat{x}}_{k})}^{T} R^{- 1} \nabla_{x_{k}} h ({\hat{x}}_{k}) + Σ_{k}^{- 1} \end{matrix}

(31)

where

\nabla_{x_{k}} h (x_{k}) = \frac{\partial h (x_{k})}{\partial x_{k}} \in R^{m} \times R^{n}

denotes the first order partial derivative of

h

with respect to

x_{k}

, and

e_{y} = h (x_{k}) - y_{k}

represents the error of the measurement prediction with the state estimation

x_{k} = {\hat{x}}_{k}

.

From Equation (31), we can note that the Fisher metric tensor consists of two parts: the measurement information and the prior information of the state. It also means that the curvature of the statistical manifold is affected by the measurement data and the prior information. It is evident that the effect of nonlinear measurement is reflected by the first terms of the Fisher metric tensor. Equation (31) establishes the relationship between the nonlinear measurement and the metric tensor.

Since the joint PDFs construct the statistical manifold, the minimization of

L (x_{k})

is converted to the optimization on the statistical manifold for the best estimation. This optimization on statistical manifold considers the measurement and state in a unified approach, and the natural gradient descent method can be used to seek the optimal estimation on the manifold along geodesic lines.

By computing the first order partial derivative of

L (x_{k})

, we can obtain the gradient

\begin{matrix} \nabla_{x_{k}} L (x_{k}) & = \nabla_{x_{k}} h {(x_{k})}^{T} R^{- 1} e_{y} + Σ_{k}^{- 1} e_{x} \end{matrix}

(32)

where

e_{y} = h (x_{k}) - y_{k}

and

e_{x} = x_{k} - {\hat{x}}_{k}^{-}

denote the individual error of the measurement prediction and the state estimation.

Since

R^{- 1}

is a symmetric positive-definite diagonal matrix,

\nabla_{x_{k}} h {({\hat{x}}_{k})}^{T} R^{- 1} \nabla_{x_{k}} h ({\hat{x}}_{k})

is positive semi-definite. Note that

Σ_{k}^{- 1}

is positive, so

G (x_{k})

is nonsingular. Thus, the natural gradient of the statistical manifold is defined as

\begin{matrix} G^{- 1} ({\hat{x}}_{k}) \nabla_{x_{k}} L ({\hat{x}}_{k}) = {(\nabla_{x_{k}} h {({\hat{x}}_{k})}^{T} R^{- 1} \nabla_{x_{k}} h ({\hat{x}}_{k}) + Σ_{k}^{- 1})}^{- 1} (\nabla_{x_{k}} h {({\hat{x}}_{k})}^{T} R^{- 1} e_{y} + Σ_{k}^{- 1} e_{x}) \end{matrix}

(33)

With the natural gradient descent on the statistical manifold, we can update the state estimation

\begin{matrix} {\hat{x}}_{k}^{+} & = {\hat{x}}_{k}^{-} - η G^{- 1} ({\hat{x}}_{k}^{-}) \nabla_{x_{k}} L ({\hat{x}}_{k}^{-}) \\ = {\hat{x}}_{k}^{-} - η {(H^{T} R^{- 1} H + Σ_{k}^{- 1})}^{- 1} (H^{T} R^{- 1}) (h ({\hat{x}}_{k}^{-}) - y_{k}) \end{matrix}

(34)

and the covariance matrix is approximated by the inverse of the Fisher metric tensor

\begin{matrix} {\hat{Σ}}_{k} & = G^{- 1} ({\hat{x}}_{k}^{-}) = {(H^{T} R^{- 1} H + Σ_{k}^{- 1})}^{- 1} \end{matrix}

(35)

where

H = {\nabla_{x_{k}} h (x_{k})|}_{x_{k} = {\hat{x}}_{k}^{-}}

is the Jacobian matrix of measurement function

h

.

Usually, one step cannot achieve the best estimation, so more steps must be utilized to achieve the final estimation. The states are updated iteratively in state space, while the corresponding posterior probabilities are moving across the statistical manifold.

To sum it up, we can construct an iterative estimation of state through the natural gradient descent method

\begin{matrix} {\hat{x}}_{k}^{i + 1} & = {\hat{x}}_{k}^{i} - η_{i} G^{- 1} ({\hat{x}}_{k}^{i}) \nabla_{x_{k}} L ({\hat{x}}_{k}^{i}) \\ = {\hat{x}}_{k}^{i} - η_{i} {(H_{i}^{T} R^{- 1} H_{i} + {({\hat{Σ}}_{k}^{i})}^{- 1})}^{- 1} (H_{i}^{T} R^{- 1} (h ({\hat{x}}_{k}^{i}) - y_{k}) + {({\hat{Σ}}_{k}^{i})}^{- 1} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0})) \\ = {\hat{x}}_{k}^{i} + η_{i} {(H_{i}^{T} R^{- 1} H_{i} + {({\hat{Σ}}_{k}^{i})}^{- 1})}^{- 1} (H_{i}^{T} R^{- 1} (y_{k} - h ({\hat{x}}_{k}^{i})) - {({\hat{Σ}}_{k}^{i})}^{- 1} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0})) \end{matrix}

(36)

which corresponds to the moving from

p (y_{k}, {\hat{x}}_{k}^{i})

to

p (y_{k}, {\hat{x}}_{k}^{i + 1})

on statistical manifold. In Equation (36),

H_{i} = \nabla_{x_{k}} h (x_{k}) |_{x_{k} = {\hat{x}}_{k}^{i}}

, and

{\hat{x}}_{k}^{0} = {\hat{x}}_{k}^{-}

is the prior information of state provided by the state propagation. Meanwhile, the error covariance matrix associated with

{\hat{x}}_{k}^{i + 1}

is approximated by the inverse of Fisher metric tensor

\begin{matrix} {\hat{Σ}}_{k}^{i + 1} & = G^{- 1} ({\hat{x}}_{k}^{i}) = {(H_{i}^{T} R^{- 1} H_{i} + {({\hat{Σ}}_{k}^{i})}^{- 1})}^{- 1} \end{matrix}

(37)

Therefore, we can obtain the iterative posterior mean and covariance matrix as

({\hat{x}}_{k}^{i + 1}, {\hat{Σ}}_{k}^{i + 1})

. When the stopping criteria are satisfied, the iterative procedure will be terminated, and the filtered state will be achieve.

4.1. Adaptive Step-Size

In this iterative procedure, there are some parameters that must be taken into account. One of them is the step-size parameter, which describes the update step-size in each iterated step. It can be a fixed value through the whole iterative procedure. As an alternative, the step-size is initialized and adjusted in each iteration. There are many strategies to adjust the value of step-size in order to achieve the sufficient decrease during the iterative procedure. In our proposed method, the value of step-size can be obtained by an exact line search [34]

\begin{matrix} η_{i} = \arg min_{0 < η \leq 1} L ({\hat{x}}_{k}^{i} + η \times {\tilde{\nabla}}_{x_{k}} L ({\hat{x}}_{k}^{i})) \end{matrix}

(38)

where

{\tilde{\nabla}}_{x_{k}} L ({\hat{x}}_{k}^{i})

denotes the natural gradient. In each iteration, the searching direction described by the natural gradient is fixed, and we just need to select the parameter

η

. Usually, the candidates of

η

can be generated randomly.

4.2. Stopping Criterion

In the iterative procedure, the number of steps can be fixed as a constant. However, it has not considered the convergence and may lead to additional computational burden. Alternatively, the number of steps is acquired according to certain stopping criterion of the iterative procedure. As in the conventional IEKF, the stopping criterion is that the distance of the state estimates between two successive iterations is smaller than a given constant

α

, that is,

\begin{matrix} ∥ {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i} ∥^{2} = 〈 {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i}, {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i} 〉 < α \end{matrix}

(39)

While the distance

∥ {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i} ∥^{2}

is defined in Euclidean space, the counterpart in Riemannian manifold is

\begin{matrix} ∥ {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i} ∥_{G}^{2} = {〈 {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i}, {\hat{x}}_{k}^{i + 1} - {\hat{x}}_{k}^{i} 〉}_{G} \end{matrix}

(40)

We can use this distance to measure the convergence on the statistical manifold. Owing to the equivalence between manifold distance and KL divergence, we also can utilize the KL divergence to measure the convergence. Here, we adopt the KL divergence (Equation (14)) instead of the distance in stopping criterion, i.e.,

D_{KL} (p (y_{k}, {\hat{x}}_{k}^{i}) ∥ p (y_{k}, {\hat{x}}_{k}^{i + 1})) < \frac{γ}{2}

(41)

Furthermore, it also describe the divergence between two probability densities of successive iterations. When the convergence is achieved, the divergence level is very low.

5. Discussion

5.1. Comparison with KF

When the conditions of state-space model have become linear and Gaussian, i.e.,

f (x_{k - 1}) = F x_{k - 1}

and

h (x_{k}) = H x_{k}

, the Fisher metric is

\begin{matrix} G (x_{k}) = H^{T} R^{- 1} H + Σ_{k}^{- 1} \end{matrix}

(42)

It is independent of the state

x_{k}

, which means that the metric tensor is a constant across the statistical manifold. Thus, we need only one step to estimate the state and the step-size is full step by setting

η = 1

. The natural gradient descent filtering (NGDF) is simplified as

\begin{matrix} {\hat{x}}_{k} & = {\hat{x}}_{k}^{-} - {G^{- 1} (x_{k}) \nabla L (x_{k})|}_{x_{k} = {\hat{x}}_{k}^{-}} \\ = {\hat{x}}_{k}^{-} - {{(H^{T} R^{- 1} H + Σ_{k}^{- 1})}^{- 1} (H^{T} R^{- 1} e_{y} + Σ_{k}^{- 1} e_{x})|}_{x_{k} = {\hat{x}}_{k}^{-}} \\ = {\hat{x}}_{k}^{-} + (Σ_{k} H^{T}) {(H Σ_{k} H^{T} + R)}^{- 1} (y_{k} - H x_{k}^{-}) \end{matrix}

(43)

where

h (x_{k}) = H x_{k}

,

e_{y} = h (x_{k}) - y_{k}

, and

e_{x} = x_{k} - {\hat{x}}_{k}^{-}

. When the first order partial derivative is calculated at the point of prior estimation,

e_{x}

will be zero. Besides, the matrix inversion lemma (Equation (48)) has been used. Equation (43) is the same as the measurement update of conventional Kalman filter, while the Kalman gain is defined as

K = (Σ_{k} H^{T}) {(H Σ_{k} H^{T} + R)}^{- 1}

. Meanwhile, the error covariance matrix (Equation (37)) can be calculated as

\begin{matrix} {\hat{Σ}}_{k} & = G^{- 1} (x_{k}) = {(H^{T} R^{- 1} H + Σ_{k}^{- 1})}^{- 1} \\ = Σ_{k} - Σ_{k} H^{T} {(H Σ_{k} H^{T} + R)}^{- 1} H Σ_{k} \\ = Σ_{k} - K (H Σ_{k} H^{T} + R) K^{T} \end{matrix}

(44)

Compared with the state estimation and error covariance matrix in Kalman filter, our method NGDF is equivalent to the Kalman filter when the conditions are linear and Gaussian. As for the EKF, NGDF has made the linear approximation for nonlinear case, so it has a similar formulation to Kalman filter.

5.2. Comparison with IEKF

Furthermore, if we set the step-size parameter as

η = 1

in the NGDF, the update of state is reformulated as

\begin{matrix} {\hat{x}}_{k}^{i + 1} & = {\hat{x}}_{k}^{i} + {(H_{i}^{T} R^{- 1} H_{i} + {({\hat{Σ}}_{k}^{i})}^{- 1})}^{- 1} (H_{i}^{T} R^{- 1} (y_{k} - h ({\hat{x}}_{k}^{i})) - {({\hat{Σ}}_{k}^{i})}^{- 1} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0})) \\ = {\hat{x}}_{k}^{i} + {(H_{i}^{T} R^{- 1} H_{i} + {({\hat{Σ}}_{k}^{i})}^{- 1})}^{- 1} H_{i}^{T} R^{- 1} (y_{k} - h ({\hat{x}}_{k}^{i})) - {(H_{i}^{T} R^{- 1} H_{i} + {({\hat{Σ}}_{k}^{i})}^{- 1})}^{- 1} {({\hat{Σ}}_{k}^{i})}^{- 1} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0}) \\ = {\hat{x}}_{k}^{i} + {\hat{Σ}}_{k}^{i} H_{i}^{T} {(H_{i} {\hat{Σ}}_{k}^{i} H_{i}^{T} + R)}^{- 1} (y_{k} - h ({\hat{x}}_{k}^{i})) - ({\hat{Σ}}_{k}^{i} - {\hat{Σ}}_{k}^{i} H_{i}^{T} {(H_{i} {\hat{Σ}}_{k}^{i} H_{i}^{T} + R)}^{- 1} H_{i} {\hat{Σ}}_{k}^{i}) {({\hat{Σ}}_{k}^{i})}^{- 1} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0}) \\ = {\hat{x}}_{k}^{0} + {\hat{Σ}}_{k}^{i} H_{i}^{T} {(H_{i} {\hat{Σ}}_{k}^{i} H_{i}^{T} + R)}^{- 1} (y_{k} - h ({\hat{x}}_{k}^{i})) + {\hat{Σ}}_{k}^{i} H_{i}^{T} {(H_{i} {\hat{Σ}}_{k}^{i} H_{i}^{T} + R)}^{- 1} H_{i} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0}) \\ = {\hat{x}}_{k}^{0} + K_{i} (y_{k} - h ({\hat{x}}_{k}^{i}) + H_{i} ({\hat{x}}_{k}^{i} - {\hat{x}}_{k}^{0})) \end{matrix}

(45)

where

K_{i} = {\hat{Σ}}_{k}^{i} H_{i}^{T} {(H_{i} {\hat{Σ}}_{k}^{i} H_{i}^{T} + R)}^{- 1}

, and the covariance matrix is approximated by the inverse of the Fisher metric tensor

\begin{matrix} {\hat{Σ}}_{k}^{i + 1} = G^{- 1} ({\hat{x}}_{k}^{i}) = {\hat{Σ}}_{k}^{i} - {\hat{Σ}}_{k}^{i} H_{i}^{T} {(H_{i} {\hat{Σ}}_{k}^{i} H_{i}^{T} + R)}^{- 1} H_{i} {\hat{Σ}}_{k}^{i} \end{matrix}

(46)

In the above procedure, the two forms of the matrix inversion lemma

{(H^{T} R^{- 1} H + Σ^{- 1})}^{- 1} H^{T} R^{- 1} = Σ H^{T} {(H Σ H^{T} + R)}^{- 1}

(47)

{(H^{T} R^{- 1} H + Σ^{- 1})}^{- 1} = Σ - Σ H^{T} {(H Σ H^{T} + R)}^{- 1} H Σ

(48)

have been used.

Obviously, the aforementioned iterative procedure in the NGDF is the same as the IEKF method, when we set

η = 1

for the step-size parameter. However, the step-size of the NGDF is usually adjusted according to the objective function. The adjustment of step-size may make the NGDF improve the accuracy of estimation. This will be illustrated in the following experiments.

5.3. Comparison with GIKF

Apart from the natural gradient method, the Newton–Raphson method is another gradient method. Hu [12] has proposed the GIKF based on Newton–Raphson method. It should be noted that the natural gradient method is completely different from the Newton–Raphson method. The natural gradient multiples the inverse of the Riemannian metric tensor by the gradient, while the Newton–Raphson multiples the inverse of the Hessian matrix by the gradient. The Riemannian metric tensor is the metric of the underlying space independent of the objective function to be approximated, but the Hessian matrix is dependent on the objective function or the parameter coordinate. Thus, the natural gradient, which using Riemannian metric tensor instead of the Hessian matrix, is more robust [28].

5.4. Relationship with Riemannian Manifold MCMC

Except the MAP technique, the Monte Carlo technique is an alternative method to obtain the integral in Bayesian filtering. Usually, the Sequence Monte Carlo (SMC) and the Markov chain Monte Carlo (MCMC) are two widely used methods. The SMC has been used in the filtering problems, which is also called as PF. The MCMC method is a fundamental tool to generate samples from a posterior density in Bayesian data analysis and inference, and it is robust and excellent for nonlinear filtering and manifold learning. Recently, the geometric concepts are introduced into the MCMC method. As we take the underlying geometric structure into account for the recursive estimation based on PDFs, the Riemannian manifold is used in the MCMC method. This method has been induced by the Girolami [35] as

\begin{matrix} θ^{n + 1} = θ^{n} + \frac{ε^{2}}{2} G^{- 1} (θ^{n}) \nabla_{θ} ℓ (θ^{n}) + ε \sqrt{G^{- 1} (θ^{n})} z^{n} \end{matrix}

(49)

where

ε \in (0, 1]

denotes the step-size of integration, and the random variable satisfies

z \sim N (z; 0, I)

. The sampling mechanism can be rewritten as Gaussian form

N (θ; μ (θ^{n}, ε), Σ (θ^{n}, ε))

, where

\begin{matrix} μ (θ^{n}, ε) & = θ^{n} + \frac{ε^{2}}{2} G^{- 1} (θ^{n}) \nabla_{θ} ℓ (θ^{n}) \end{matrix}

(50)

\begin{matrix} Σ (θ^{n}, ε) & = ε^{2} G^{- 1} (θ^{n}) \end{matrix}

(51)

These are similar to our proposed filtering method. The state update is processed by the natural gradient method, and the covariance matrix is approximated by the inverse of the Riemannian metric tensor. Compared with our proposed method, the difference is the step-size in state update and the decaying factor for the Riemannian metric tensor. The reason is that the Riemannian manifold MCMC is derived by the Hamilton dynamics and Langevin diffusion, while our proposed method is derived by information geometric optimization on statistical manifold. In addition, the metric tensor plays an important role in Riemannian manifold, and the natural gradient method provides the general approach for optimization.

6. Simulation

In this section, we compare the classical filtering methods including EKF, IEKF and RUF with our method NGDF in the application of passive target tracking [14,36]. The EKF has just one step in the update procedure, and the IEKF is an iterative filtering method derived by the Netwon method. The RUF [7,8] is an another iterative filtering method derived by the LMMSE, not the MAP, and it has the fixed number of steps.

For the system setting, the state at time instant k is

x_{k} = {[x_{k}, {\dot{x}}_{k}, y_{k}, {\dot{y}}_{k}]}^{T}

consisting of position vector

{[x_{k}, y_{k}]}^{T}

and velocity vector

{[{\dot{x}}_{k}, {\dot{y}}_{k}]}^{T}

, while the measurement at same instant is

z (k) = {[θ_{k}, {\dot{θ}}_{k}, {\dot{f}}_{k}]}^{T}

consisting of bearing

θ_{k}

, bearing rate

{\dot{θ}}_{k}

, and Doppler rate

{\dot{f}}_{k}

.

The state equation is linear and represented as

\begin{matrix} x_{k} = F x_{k - 1} + G v_{k} \end{matrix}

(52)

where

\begin{matrix} F = I_{2} \otimes (\begin{matrix} 1 & τ \\ 0 & 1 \end{matrix}) \end{matrix} \begin{matrix} G = I_{2} \otimes (\begin{matrix} 0.5 τ^{2} \\ τ \end{matrix}) \end{matrix}

(53)

where

I_{2}

is a

2 \times 2

identity matrix, ⊗ is the Kronecker product,

τ

is the sampling time, and

v_{k} = {[v_{x_{k}}, v_{y_{k}}]}^{T}

is zero-mean process noise with covariance matrix

Q_{k}

.

The measurement equation is

\begin{matrix} z_{k} & = & [\begin{matrix} θ_{k} \\ {\dot{θ}}_{k} \\ {\dot{f}}_{k} \end{matrix}] = [\begin{matrix} \arctan (y_{k} / x_{k}) \\ ({\dot{y}}_{k} x_{k} - {\dot{x}}_{k} y_{k}) / r_{k}^{2} \\ - {({\dot{y}}_{k} x_{k} - {\dot{x}}_{k} y_{k})}^{2} / (λ r_{k}^{3}) \end{matrix}] + [\begin{matrix} n_{θ_{k}} \\ n_{{\dot{θ}}_{k}} \\ n_{{\dot{f}}_{k}} \end{matrix}] \\ ≜ & h (x_{k}) + n_{k} \end{matrix}

(54)

where

r_{k} = \sqrt{x_{k}^{2} + y_{k}^{2}}

,

λ

denotes the wavelength of received signal, and

n_{θ_{k}}, n_{{\dot{θ}}_{k}}, n_{{\dot{f}}_{k}}

are mutually independent zero-mean Gaussian distributed noises with covariance matrix

R_{k} = diag [σ_{θ}^{2}, σ_{\dot{θ}}^{2}, σ_{\dot{f}}^{2}]

. Here, we treat

x_{k}, y_{k}

as the coordinate of target, and

z_{k}

as the measurement, which are different from the notations in the above sections.

In this simulation, we consider 200 time steps for tracking, and

τ = 0.5 s

,

λ = 0.1 m

,

Q = diag ([{(9 m / s^{2})}^{2}, {(2 m / s^{2})}^{2}])

. The prior PDF at time 0 is

p (x_{0}) = N (x_{0}; μ_{0}, Σ_{0})

, where

μ_{0} = {[800 m, - 50 m / s, 300 m, 10 m / s]}^{T}

, and

Σ_{0} = [\begin{matrix} σ_{x}^{2} & 0 & ρ σ_{x} σ_{y} & 0 \\ 0 & σ_{v}^{2} & 0 & 0 \\ ρ σ_{x} σ_{y} & 0 & σ_{y}^{2} & 0 \\ 0 & 0 & 0 & σ_{v}^{2} \end{matrix}]

(55)

where

σ_{x} = 500 m

,

σ_{y} = 200 m

,

ρ = 0.95

, and

σ_{v} = 100 m / s

.

We carry out the numerical experiment considering two aspects: the first is different initialization parameters in filtering method for the same track, while the second is different tracks with different noise level. The normalization root mean square error (RMSE) of position and velocity are utilized to illustrate the performance of tracking. They are defined as follows

\begin{matrix} {RMSE}_{Pos} & = \frac{\sqrt{{(x - \hat{x})}^{2} + {(y - \hat{y})}^{2}}}{\sqrt{x^{2} + y^{2}}} \end{matrix}

(56)

\begin{matrix} {RMSE}_{Vel} & = \frac{\sqrt{{(v_{x} - {\hat{v}}_{x})}^{2} + {(v_{y} - {\hat{v}}_{y})}^{2}}}{\sqrt{v_{x}^{2} + v_{y}^{2}}} \end{matrix}

(57)

where

{[x, v_{x}, y, v_{y}]}^{T}

denotes the true state, and

{[\hat{x}, {\hat{v}}_{x}, \hat{y}, {\hat{v}}_{y}]}^{T}

is estimation state. For the filtering methods, we set

γ = 1

in the stopping criterion (41) for our proposed method, while

α = 1

in (39) for the IEKF. In the iterative methods (IEKF, RUF and NGDF), the max number of iterative steps is

N = 30

.

Firstly, we set the measurement noise variances as

σ_{θ} = 2 \times 10^{- 4} rad

,

σ_{\dot{θ}} = 10^{- 5} rad / s

and

σ_{\dot{f}} = 0.05 Hz / s

, and select a track as Figure 1a. We carry out 100 Monte Carlo runs for filtering, and average over different realizations to obtain the tracking and normalized position RMSE. In each Monte Carlo runs, the initial filtered state at time 0 is generated random according to the prior PDF. Usually, they are different. After filtering, the results are shown in Figure 1. In this figure, we can note that the different initial state has influenced the first few steps, and will perform well in the follow steps. In addition, the changeable trajectory can lead to the performance of tracking degradation. In this comparing experiment, we can know that the filtered track by our filtering method is closer to the true track than the other methods, and the normalized position RMSE of our method is lower than the others. Comparing the iterative filtering methods with EKF, the iterative methods have better performance. This is because that the iterative methods have utilized the more iterative steps to hold the nonlinear measurement function, and they are more accurate than the nonlinear processing in EKF methods.

Secondly, we consider different tracks with different noise level. We carry out 100 Monte Carlo runs for the tracks, and take the average over different realizations of the tracks and the corresponding filtered states. The initial true state is random generated according to the prior PDF of state. Because the random initial state is used in each track and the state noise is imposed on state at each time step, the tracks will be different for the Monte Carlo runs. In the Bayesian filtering framework, we focus on the measurement update, and the different measurement noise is considered in this paper. We analyze the three scenarios that differ in the measurement noise variances:

Scenario 1: $σ_{θ} = 2 \times 10^{- 4} rad$ , $σ_{\dot{θ}} = 10^{- 5} rad / s$ and $σ_{\dot{f}} = 0.05 Hz / s$ ;
Scenario 2: $σ_{θ} = 2 \times 10^{- 6} rad$ , $σ_{\dot{θ}} = 10^{- 5} rad / s$ and $σ_{\dot{f}} = 0.05 Hz / s$ ;
Scenario 3: $σ_{θ} = 2 \times 10^{- 6} rad$ , $σ_{\dot{θ}} = 10^{0.1} rad / s$ and $σ_{\dot{f}} = 10^{- 3} Hz / s$ .

The normalized RMSEs of position and velocity are shown in Figure 2, Figure 3 and Figure 4, and the number of iterative steps is shown in Figure 5. The results in Figure 2, Figure 3 and Figure 4 show that the iterative methods perform better than the EKF, and our proposed method NGDF outperforms other two iterative methods IEKF and RUF in the estimations of position and velocity. Our method has the stable performance that the normalized position RMSEs keep a lower level. The EKF method has the bad performance that it diverges at some times. Comparing with the estimation of velocity, the performance of four filtering methods fluctuate greatly. These are limited by the measurement of radar system. The position measurement can be obtained directly, while the velocity measurement is obtained indirectly that also need the position measurement. The error of position measurement and estimation also influence the velocity estimation. In this case, our method has small fluctuation comparing with the other methods. From Figure 2, Figure 3 and Figure 4, the EKF fluctuates greatly. It means that the EKF method diverges in many cases, and it is not suitable for this tracking problems with the nonlinear measurement. Meanwhile, the IEKF has some jump points. This is because the iterative estimation has not converged, but the stopping criterion is satisfied. Besides, we can note the performance of the RUF method. It has a stable performance second only to our method. However, it has far more computational burden than our method and the EKF and IEKF method shown in Figure 5.

To measure the computational burden, we compare the number of iteration intuitively. The average number of iteration in each time step is shown in Figure 5. In the simulation, the EKF uses one step to obtain the filtered state in each time step, i.e., the iteration number is

N = 1

. The RUF is an iterative method, which has fixed iteration number as

N = 30

. For these three scenarios, the iteration number of the IEKF is about

N = 5

, while the iteration number of NGDF is roughly less than

N = 4

in Figure 5a,b. In Figure 5c, the number of iteration of NGDF is more than IEKF before the time steps 80, but lower in the succedent time steps. These results can reflect the computational burden. In each time step, the computational burden of RUF is 30 times than the EKF, while the IEKF is about 5 times and NGDF is less 4 times. Comparing the IEKF and NGDF method, the step-size of NGDF is adaptive, which leads to the less steps and more robust estimation. While the IEKF uses the full step-size, which may fluctuate or delay the stopping time. Besides, we compare the execution times in milliseconds for the Scenario 1. The four methods are implemented with the Matlab on a Intel Core i7 laptop. The average of the execution times based on the 100 monte carlo runs are: EKF (11.5 s), IEKF (38.4 s), NGDF (26.4 s) and RUF (187.6 s). In addition, it should be noted that these times are computed for the whole filtering procedure which achieve filtering from the beginning of time steps to the end. We can conclude that the lowest computational burden is achieved by our proposed method.

In this simulation, we also note that the performance may be become bad after a certain tracking steps. This is because that the resolution limit of radar system measurement. The resolution limit of measurement will influence the accuracy of the measurement, thus make some effect on the estimation state. This is why that the tracking of radar system has a tracking time interval. Exceeding the time interval, the tracking performance is unavailable. In addition, the level of noise has influenced the track and estimation. The lower level of state noise may make the track tend to be fixed, while the lower level of measurement noise may make the measurement is more available. They can make the estimation more accurate, but there have no significant and determination relationship between noise level and filtering. This is because the nonlinear function between state and measurement. Usually for the filtering problems, we have made some simplification and considered some case to compare the filtering methods.

To sum up, in this simulation, our method is the filtering method with highest performance than the other methods. It has a stable performance, and the increase in performance of our method does not imply a much higher computational burden compared to other iterative methods.

7. Conclusions

In this paper, we have derived a novel filtering method by utilizing the information geometric approach. The filtering problem has been converted to an optimization on the statistical manifold constructed by the joint PDFs of Bayesian filtering, and the adaptive natural gradient descent method is used to search the optimal estimation. In the filtering procedure, curvature characteristic brought about by the nonlinearity of the observation operator

h

is considered carefully by the Fisher metric tensor. For the Bayesian filtering, the Fisher metric tensor consists of measurement likelihood and prior information of state. For the linear case, the metric is constant, while variable in the nonlinear case. Then, the adaptive natural gradient descent technique is used to derive the iterative filtering, and the KL divergence is employed in the stopping criterion. Furthermore, the conventional Kalman filter, EKF and IEKF are the special formulations of our proposed method under certain conditions. With adaptive step-size and KL divergence stopping criterion, the proposed method has made some improvement over EKF, IEKF and RUF.

For the Bayesian nonlinear filtering, the posterior density may be non-Gaussian. There are two reasons bringing about the non-Gaussian cases. The first reason is that non-Gaussian observation densities make the posterior density non-Gaussian. An optimal filter has been proposed to address this problem [37]. It can modify the Kalman filter to handle the non-Gaussian observation density. The second reason is that nonlinear measurement makes the Gaussian densities become non-Gaussian. The Monte Carlo method can be used to solve this problem with large computational burden.

In the derivation of our method, we can note that the information geometric optimization for filtering can be extended to some non-Gaussian case, such as the exponential density. When the density has the analytic form, the metric can be computed, and the information geometric optimization can be used to derive the filtering method. Besides, we can use the fact that the non-Gaussian density can be approximated by the sum of some Gaussian densities to convert the non-Gaussian density into the Gaussian density, then the proposed method in this paper can be utilized in each Gaussian density component. In future work, we will continue to combine the information geometric optimization with non-Gaussian filtering, and provide some approaches for addressing the problems of non-Gaussian density in the filtering.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant No. 61601479. The authors gratefully acknowledge the reviewers for their very valuable and insightful comments and suggestions, which have improved the presentation.

Author Contributions

Yubo Li and Yongqiang Cheng put forward the original ideas and performed the research. Xiang Li and Hongqiang wang conceived and designed the simulations comparing with other existing methods. Xiaoqiang Hua discussed the adaptive natural gradient descent method. Yuliang Qin reviewed the paper and provided useful comments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Haug, A. Bayesian Estimation and Tracking: A Practical Guide; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Julier, S.; Uhlmann, J. Unscented filtering and nonlinear estimation. Proc. IEEE 2004, 92, 401–422. [Google Scholar] [CrossRef]
Arasaratnam, I.; Haykin, S.; Elliott, R. Discrete-Time Nonlinear Filtering Algorithms Using Gauss–Hermite Quadrature. Proc. IEEE 2007, 95, 953–977. [Google Scholar] [CrossRef]
Arasaratnam, I.; Haykin, S. Cubature Kalman filters. IEEE Trans. Autom. Control 2009, 54, 1254–1269. [Google Scholar] [CrossRef]
Stano, P.; Lendek, Z.; Braaksma, J.; Babuška, R.; Keizer, C.; den Dekker, A. Parametric Bayesian Filters for Nonlinear Stochastic Dynamical Systems: A Survey. IEEE Trans. Cybern. 2013, 43, 1607–1624. [Google Scholar] [CrossRef] [PubMed]
Arulampalam, M.; Maskell, S.; Gordon, N. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. Signal Process. 2002, 50, 174–188. [Google Scholar] [CrossRef]
Zanetti, R. Recursive Update Filtering for Nonlinear Estimation. IEEE Trans. Autom. Control 2012, 57, 1481–1490. [Google Scholar] [CrossRef]
Zanetti, R. Adaptable Recursive Update Filter. J. Guid. Control Dyn. 2015, 38, 1295–1299. [Google Scholar] [CrossRef]
Bell, B.; Cathey, F. The iterated Kalman filter update as a Gauss-Newton method. IEEE Trans. Autom. Control 1993, 38, 294–297. [Google Scholar] [CrossRef]
Bellaire, R.; Kamen, E.; Zabin, S. A new nonlinear iterated filter with applications to target tracking. In Proceedings of the SPIE’s 1995 International Symposium on Optical Science, Engineering, and Instrumentation, San Diego, CA, USA, 9–14 July 1995; Volume 2561, pp. 240–251. [Google Scholar]
Auvinen, H.; Bardsley, J.; Haario, H.; Kauranne, T. The Variational Kalman Filter and an efficient implementation using limited memory BFGS. Int. J. Numer. Methos Fluids 2010, 64, 314–335. [Google Scholar] [CrossRef]
Hu, X.; Bao, M.; Zhang, X.; Guan, L.; Hu, Y. Generalized Iterated Kalman Filter and its Performance Evaluation. IEEE Trans. Signal Process. 2015, 63, 3204–3217. [Google Scholar] [CrossRef]
Lefebvre, T.; Bruyninckx, H.; Schutter, J. Kalman filters for nonlinear systems: A comparison of performance. Int. J. Control 2004, 77, 639–653. [Google Scholar] [CrossRef]
Zhan, R.; Wan, J. Iterated Unscented Kalman Filter for Passive target tracking. IEEE Trans. Aerosp. Electron. Syst. 2007, 43, 1155–1163. [Google Scholar] [CrossRef]
Morelande, M.; García-Fernández, Á. Analysis of Kalman filter approximations for nonlinear measurements. IEEE Trans. Signal Process. 2013, 61, 5477–5484. [Google Scholar] [CrossRef]
García-Fernández, Á.; Svensson, L.; Morelande, M.; Sarkka, S. Posterior Linearization Filter: Principles and Implementation Using Sigma Points. IEEE Trans. Signal Process. 2015, 63, 5561–5573. [Google Scholar] [CrossRef]
García-Fernández, Á.; Morelande, M.; Grajal, J.; Svensson, L. Adaptive unscented Gaussian likelihood approximation filter. Automatica 2015, 54, 166–175. [Google Scholar] [CrossRef]
Raitoharju, M.; García-Fernández, Á.; Piché, R. Kullback-Leibler divergence approach to partitioned update Kalman filter. Signal Process. 2017, 130, 289–298. [Google Scholar] [CrossRef]
Chen, B.; Liu, X.; Zhao, H.; Principe, J. Maximum correntropy Kalman filter. Automatica 2017, 76, 70–77. [Google Scholar] [CrossRef]
Liu, X.; Qu, H.; Zhao, J.; Yue, P.; Wang, M. Maximum Correntropy Unscented Kalman Filter for Spacecraft Relative State Estimation. Sensors 2016, 16, 1530. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Zheng, W.; Sun, S.; Li, L. Robust Information Filter Based on Maximum Correntropy Criterion. J. Guid. Control Dyn. 2016, 39, 1124–1129. [Google Scholar] [CrossRef]
Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2007. [Google Scholar]
Amari, S. Natural Gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
Van Trees, H.L.; Bell, K.L. Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
Tichavský, P.; Muravchik, C.; Nehorai, A. Posterior Cramér-Rao Bounds for Discrete-Time Nonlinear Filtering. IEEE Trans. Signal Process. 1998, 46, 1386–1396. [Google Scholar] [CrossRef]
Raskutti, G.; Mukherjee, S. The Information Geometry of Mirror Descent. IEEE Trans. Inf. Theory 2015, 61, 1451–1457. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, X.; Moran, B. Optimal Nonlinear Estimation in Statistical Manifolds with Application to Sensor Network Localization. Entropy 2017, 19, 308. [Google Scholar] [CrossRef]
Luo, Z.; Liao, D.; Qian, Y. Bound Analysis of Natural Gradient Descent in Stochastic Optimization Setting. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancún, México, 4–8 December 2016; pp. 4166–4171. [Google Scholar]
Amari, S. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory 2001, 47, 1701–1711. [Google Scholar] [CrossRef]
Oizumi, M.; Tsuchiya, N.; Amari, S. Unified framework for information integration based on information geometry. Proc. Natl. Acad. Sci. USA 2016, 113, 14817–14822. [Google Scholar] [CrossRef] [PubMed]
Lenglet, C.; Rousson, M.; Deriche, R.; Faugeras, O. Statistics on Manifold of Multivaiate Nomal Distributions: Theory and Application to Diffusion Tensor MRI Processing. J. Math. Imaging Vis. 2006, 25, 423–444. [Google Scholar] [CrossRef]
Zhang, X. Matrix Analysis and Applications, 2nd ed.; Tsinghua University Press: Beijing, China, 2013. [Google Scholar]
Nocedal, J.; Wright, S. Numerical Optimization, 2nd ed; Springer: New York, NY, USA, 2006. [Google Scholar]
Girolami, M.; Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. B 2011, 73, 123–214. [Google Scholar] [CrossRef]
García-Fernández, Á.; Svensson, L. Gaussian MAP Filtering Using Kalman Optimization. IEEE Trans. Autom. Control 2015, 60, 1336–1349. [Google Scholar] [CrossRef]
Masreliez, C. Approximate Non-Gaussian Filtering with Linear State and Observation Relation. IEEE Trans. Autom. Control 1975, 20, 107–110. [Google Scholar] [CrossRef]

Figure 1. The trajectories and normalized RMSE of Position: (a) trajectory; (b) normalized Position RMSE.

Figure 2. The normalized RMSEs of Scenario 1: (a) normalized Position RMSE; (b) normalized Velocity RMSE.

Figure 3. The normalized RMSEs of Scenario 2: (a) normalized Position RMSE; (b) normalized Velocity RMSE.

Figure 4. The normalized RMSEs of Scenario 3: (a) normalized Position RMSE; (b) normalized Velocity RMSE.

Figure 5. The number of Iteration of three scenarios: (a) Scenario 1; (b) Scenario 2; (c) Scenario 3.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Cheng, Y.; Li, X.; Wang, H.; Hua, X.; Qin, Y. Bayesian Nonlinear Filtering via Information Geometric Optimization. Entropy 2017, 19, 655. https://doi.org/10.3390/e19120655

AMA Style

Li Y, Cheng Y, Li X, Wang H, Hua X, Qin Y. Bayesian Nonlinear Filtering via Information Geometric Optimization. Entropy. 2017; 19(12):655. https://doi.org/10.3390/e19120655

Chicago/Turabian Style

Li, Yubo, Yongqiang Cheng, Xiang Li, Hongqiang Wang, Xiaoqiang Hua, and Yuliang Qin. 2017. "Bayesian Nonlinear Filtering via Information Geometric Optimization" Entropy 19, no. 12: 655. https://doi.org/10.3390/e19120655

APA Style

Li, Y., Cheng, Y., Li, X., Wang, H., Hua, X., & Qin, Y. (2017). Bayesian Nonlinear Filtering via Information Geometric Optimization. Entropy, 19(12), 655. https://doi.org/10.3390/e19120655

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bayesian Nonlinear Filtering via Information Geometric Optimization

Abstract

1. Introduction

2. Bayesian Filtering

3. Information Geometry

3.1. Riemannian Metric Tensor

3.2. Natural Gradient Descent

3.3. Divergence and Distance

4. Natural Gradient Descent Filtering

4.1. Adaptive Step-Size

4.2. Stopping Criterion

5. Discussion

5.1. Comparison with KF

5.2. Comparison with IEKF

5.3. Comparison with GIKF

5.4. Relationship with Riemannian Manifold MCMC

6. Simulation

7. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI