You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

24 September 2018

A Non-Linear Filtering Algorithm Based on Alpha-Divergence Minimization

,
,
and
Global Navigation Satellite System Research Center, Wuhan University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Sensor Signal and Information Processing II

Abstract

A non-linear filtering algorithm based on the alpha-divergence is proposed, which uses the exponential family distribution to approximate the actual state distribution and the alpha-divergence to measure the approximation degree between the two distributions; thus, it provides more choices for similarity measurement by adjusting the value of α during the updating process of the equation of state and the measurement equation in the non-linear dynamic systems. Firstly, an α -mixed probability density function that satisfies the normalization condition is defined, and the properties of the mean and variance are analyzed when the probability density functions p ( x ) and q ( x ) are one-dimensional normal distributions. Secondly, the sufficient condition of the alpha-divergence taking the minimum value is proven, that is when α 1 , the natural statistical vector’s expectations of the exponential family distribution are equal to the natural statistical vector’s expectations of the α -mixed probability state density function. Finally, the conclusion is applied to non-linear filtering, and the non-linear filtering algorithm based on alpha-divergence minimization is proposed, providing more non-linear processing strategies for non-linear filtering. Furthermore, the algorithm’s validity is verified by the experimental results, and a better filtering effect is achieved for non-linear filtering by adjusting the value of α .

1. Introduction

The analysis and design of non-linear filtering algorithms are of enormous significance because non-linear dynamic stochastic systems have been widely used in practical systems, such as navigation system [1], simultaneous localization and mapping [2], and so on. Because the state model and the measurement model are non-linear and the state variables and the observation variables of the systems no longer satisfy the Gaussian distribution, the representation of the probability density distribution of the non-linear function will become difficult. In order to solve this problem, deterministic sampling (such as the unscented Kalman filter and cubature Kalman filter) and random sampling (such as the particle filter) are adopted to approximate the probability density distribution of the non-linear function, that is to say, to replace the actual state distribution density function by a hypothetical one [3].
In order to measure the similarity between the hypothetical state distribution density function and the actual one, we need to select a measurement method to ensure the effectiveness of the above methods. The alpha-divergence, proposed by S.Amari, is used to measure the deviation between data distributions p ( x ) and q ( x ) [4]. It can be used to measure the similarity between the hypothetical state distribution density function and the actual one for the non-linear filtering. Compared with the Kullback–Leibler divergence (the KL divergence), the alpha-divergence provides more choices for measuring the similarity between the hypothetical state distribution density function and the actual one. Therefore, we use alpha-divergence as a measurement criterion to measure the similarity between the two distribution functions. Indeed, adjusting the value of parameter α in the function can ensure the interesting properties of similarity measurement. Another choice of α characterizes different learning principles, in the sense that the model distribution is more inclusive ( α ) or more exclusive ( α ) [5]. Such flexibility enables α -based methods to outperform KL-based methods with the value of α being properly selected. The higher the similarity of the two probability distributions p ( x ) and q ( x ) , the smaller the value of alpha-divergence will be. Then, it can be proven that in a specific range of value, q ( x ) can fully represent the properties of p ( x ) when the value of alpha-divergence is minimum.
Because the posterior distribution of non-linear filtering is difficult to solve, given that the posterior probability distribution is p ( x ) , we can use the probability distribution q ( x ) to approximate the posterior probability distribution p ( x ) of non-linear filtering. The approximate distribution q ( x ) is expected to be a distribution with a finite moment vector. This in turn means that a good choice for the approximate distribution is from the exponential family distribution, which is a practically convenient and widely-used unified family of distributions on finite dimensional Euclidean spaces.
The main contributions of this article include:
  • We define an α -mixed probability density function and prove that it satisfies the normalization condition when we specify the probability distributions p ( x ) and q ( x ) to be univariate normal distributions. Then, we analyze the monotonicity of the mean and the variance of the α -mixed probability density function with respect to the parameter when p ( x ) and q ( x ) are specified to be univariate normal distributions. The results will be used in the algorithm implementation to guarantee the convergence.
  • We specify the probability density function q ( x ) as an exponential family state density function and choose it to approximate the known state probability density function p ( x ) . After the α -mixed probability density function is defined by q ( x ) and p ( x ) , we prove that the sufficient condition for alpha-divergence minimization is when α 1 and the expected value of the natural statistical vector of q ( x ) is equivalent to the expected value of the natural statistical vector of the α -mixed probability density function.
  • We apply the sufficient condition to the non-linear measurement update step of the non-linear filtering. The experiments show that the proposed method can achieve better performance by using a proper α value.

3. Background Work

In Section 3.1, we provide the framework of the non-linear filtering. Then, we introduce the alpha-divergence in Section 3.2, which contains many types of divergence as special cases.

3.1. Non-Linear Filtering

The actual system studied in the filtering is usually non-linear and non-Gaussian. Non-linear filtering refers to a filtering that can estimate the optimal estimation problem of the state variables in the dynamic system online and in real time from the system observations.
The state space model of non-linear systems with additive Gaussian white noise is:
x k = f ( x k 1 ) + w k 1
where x k R n is the system state vector that needs to be estimated; w k is the zero mean value Gaussian white noise, and its variance is E [ w k w k T ] = Q k . Equation (1) describes the state transition p ( x k | x k 1 ) of the system.
The random observation model of the state vector is:
z k = h ( x k ) + v k
where z k R m is system measurement; v k is the zero mean value Gaussian white noise, and its variance is E [ v k v k T ] = R k . Suppose w k and v k are independent of each other and the observed value z k is independent of the state variables x k .
The entire probability state space is represented by the generation model as shown in Figure 1. x k is the system state; z k is the observational variable, and the purpose is to estimate the value of state x k . The Bayesian filter is a general method to solve state estimation. The Bayesian filter is used to calculate the posterior distribution p ( x k | z k ) , and its recursive solution consists of prediction steps and update steps.
Figure 1. Hidden Markov Model (HMM).
Under the Bayesian optimal filter framework, the system state equation determines that the conditional transition probability of the current state is a Gaussian distribution:
p ( x k | x k 1 , z 1 : k 1 ) = N ( x k | f ( x k 1 ) , Q k )
If the prediction distribution of the system can be obtained from Chapman–Kolmogorov, the prior probability is:
p ( x k | z 1 : k 1 ) = p ( x k | x k 1 , z 1 : k 1 ) p ( x k 1 | z 1 : k 1 ) d x k 1
When there is a measurement input, the system measurement update equation determines that the measurement likelihood transfer probability of the current state obeys a Gaussian distribution:
p ( z k | x k , z 1 : k 1 ) = N ( z k | h ( x k ) , R k )
According to the Bayesian information criterion, the posterior probability obtained is:
p ( x k | z 1 : k ) = p ( z k | x k , z 1 : k 1 ) p ( x k | z 1 : k 1 ) p ( z k | z 1 : k 1 )
where p ( z k | z 1 : k 1 ) is the normalized factor, and it is defined as follows:
p ( z k | z 1 : k 1 ) = p ( z k | x k , z 1 : k 1 ) p ( x k | z 1 : k 1 ) d x k
Unlike the Kalman filter framework, the Bayesian filter framework does not demand that the update structure be linear, so it can use non-linear update steps.
In the non-linear filtering problem, the posterior distribution p ( x k | z 1 : k ) often cannot be solved correctly. Our purpose is to use the distribution q ( x ) to approximate the posterior distribution p ( x k | z 1 : k ) without an analytical solution. Here, we use the alpha-divergence measurement to measure the similarity between the two. We propose a method that directly minimizes alpha-divergence without adding any additional approximations.

3.2. The Alpha-Divergence

The KL divergence is commonly used in similarity measures, but we will generalize it to the alpha-divergence. The alpha-divergence is a parametric family of divergence functions, including several well-known divergence measures as special cases, and it gives us more flexibility in approximation [20].
Definition 1.
Let us consider two unnormalized distributions p ( x ) and q ( x ) with respect to a random variable x. The alpha-divergence is defined by:
D α [ p q ] = 1 α ( 1 α ) α p ( x ) + ( 1 α ) q ( x ) p ( x ) α q ( x ) 1 α d x
where α R , which means D α is continuous at zero and one.
The alpha-divergence meets the following two properties:
  • D α [ p q ] 0 , if and only if p = q , D α [ p q ] = 0 . This property can be used precisely to measure the difference between the two distributions.
  • D α [ p q ] is a convex function with respect to p ( x ) and q ( x ) .
Note that the term [ α p ( x ) + ( 1 α ) q ( x ) ] d x disappears when p ( x ) and q ( x ) are normalized distributions, i.e., p ( x ) d x = p ( x ) d x = 1 . The alpha-divergence in (8) is expressed by:
D α [ p q ] = 1 α ( 1 α ) ( 1 p ( x ) α q ( x ) 1 α d x )
In general, we can get another equivalent expression of the alpha-divergence when we set β = 2 α 1 :
D β [ p q ] = 4 1 β 2 1 β 2 p ( x ) + 1 + β 2 q ( x ) p ( x ) 1 + β 2 q ( x ) 1 β 2 d x
Alpha-divergence includes several special cases such as the KL divergence, the Hellinger divergence and χ 2 divergence (Pearson’s distance), which are summarized below.
  • As α approaches one, Equation (8) is the limitation form of 0 0 , and it specializes to the KL divergence from q ( x ) to p ( x ) as L’Hôpital’s rule is used:
    lim α 1 D α [ p q ] = lim α 1 1 α ( 1 α ) α p ( x ) + ( 1 α ) q ( x ) p ( x ) α q ( x ) 1 α d x = lim α 1 1 1 2 α p ( x ) q ( x ) p ( x ) α l o g ( p ( x ) ) q ( x ) 1 α + p ( x ) α q ( x ) 1 α l o g ( q ( x ) ) d x = p ( x ) l o g p ( x ) q ( x ) p ( x ) + q ( x ) d x = K L [ p q ]
    When p ( x ) and q ( x ) are normalized distributions, the KL divergence is expressed as:
    K L [ p q ] = p ( x ) l o g p ( x ) q ( x ) d x
  • As α approaches zero, Equation (8) is still the limitation form of 0 0 , and it specializes to the dual form of the KL divergence from q ( x ) to p ( x ) as L’Hôpital’s rule is used:
    lim α 0 D α [ p q ] = lim α 0 1 α ( 1 α ) α p ( x ) + ( 1 α ) q ( x ) p ( x ) α q ( x ) 1 α d x = lim α 0 1 1 2 α p ( x ) q ( x ) p ( x ) α l o g ( p ( x ) ) q ( x ) 1 α + p ( x ) α q ( x ) 1 α l o g ( q ( x ) ) d x = q ( x ) l o g q ( x ) p ( x ) + p ( x ) q ( x ) d x = K L [ q p ]
    When p ( x ) and q ( x ) are normalized distributions, the dual form of the KL divergence is expressed as:
    K L [ q p ] = q ( x ) l o g q ( x ) p ( x ) d x
  • When α = 1 2 , the alpha-divergence specializes to the Hellinger divergence, which is the only dual divergence in the alpha-divergence:
    D 1 2 [ p q ] = 2 ( p ( x ) + q ( x ) 2 p ( x ) 1 2 q ( x ) 1 2 ) d x = 2 ( p ( x ) q ( x ) ) 2 d x = 4 H e l 2 [ p q ]
    where H e l [ p q ] = 1 2 ( p ( x ) q ( x ) ) 2 d x is the Hellinger distance, which is the half of the Euclidean distance between two random distributions after taking the difference of the square root, and it corresponds to the fundamental property of distance measurement and is a valid distance metric.
  • When α = 2 , the alpha-divergence degrades to χ 2 -divergence:
    D 2 [ p q ] = 1 2 ( 2 p ( x ) q ( x ) p ( x ) 2 q ( x ) d x ) = 1 2 ( p ( x ) 2 + q ( x ) 2 2 p ( x ) q ( x ) q ( x ) d x ) = 1 2 ( p ( x ) q ( x ) ) 2 q ( x ) d x
In the later experiment, we will adapt the value of α to optimize the distribution similarity measurement.

4. Non-Linear Filtering Based on the Alpha-Divergence

We first define an α -mixed probability density function, which will be used in the non-linear filtering based on the alpha-divergence minimization. Then, we show that the sufficient condition for the alpha-divergence minimization is when α 1 and the expected value of the natural statistical vector of q ( x ) is equivalent to the expected value of the natural statistical vector of the α -mixed probability density function. At last, we apply the sufficient condition to the non-linear measurement update steps for solving the non-linear filtering problem.

4.1. The α -Mixed Probability Density Function

We first give a definition of a normalized probability density function called the α -mixed probability density function, which is expressed as p α ( x ) .
Definition 2.
We define an α-mixed probability density function:
p α ( x ) = p ( x ) α q ( x ) ( 1 α ) p ( x ) α q ( x ) ( 1 α ) d x
We can prove that when both p ( x ) and q ( x ) are univariate normal distributions, then p α ( x ) is still the Gaussian probability density function.
Suppose that p ( x ) N ( μ p , σ p 2 ) and q ( x ) N ( μ q , σ q 2 ) , so the probability density functions can be expressed as follows:
p ( x ) = 1 2 π σ p e x p ( x μ p ) 2 2 σ p 2 and q ( x ) = 1 2 π σ q e x p ( x μ q ) 2 2 σ q 2
Then we can combine these two functions with parameter α :
p ( x ) α q ( x ) ( 1 α ) = ( 2 π σ p 2 ) α 2 ( 2 π σ q 2 ) 1 α 2 e x p α ( x μ p ) 2 σ q 2 + ( 1 α ) ( x μ q ) 2 σ p 2 2 σ p 2 σ q 2 = S α 2 π σ α e x p ( x μ α ) 2 2 σ α 2
where μ α = α μ p σ q 2 + ( 1 α ) μ q σ p 2 α σ q 2 + ( 1 α ) σ p 2 is the mean of the α -mixed probability density function; σ α 2 = σ q 2 σ p 2 α σ q 2 + ( 1 α ) σ p 2 (which can be reduced to 1 σ α = α 1 σ p + ( 1 α ) 1 σ q ) is the variance of the α -mixed probability density function; S α is a scalar factor, and the expression is as follows:
S α = ( 2 π σ α 2 ) 1 2 ( 2 π σ p 2 ) α 2 ( 2 π σ q 2 ) 1 α 2 e x p α ( 1 α ) ( μ p μ q ) 2 2 [ α σ q 2 + ( 1 α ) σ p 2 ] = ( 2 π σ α 2 ) α + 1 α 2 ( 2 π σ p 2 ) α 2 ( 2 π σ q 2 ) 1 α 2 e x p α ( 1 α ) ( μ p μ q ) 2 2 [ α σ q 2 + ( 1 α ) σ p 2 ] = ( σ q 2 α σ q 2 + ( 1 α ) σ p 2 ) α 2 ( σ p 2 α σ q 2 + ( 1 α ) σ p 2 ) 1 α 2 e x p α ( 1 α ) ( μ p μ q ) 2 2 [ α σ q 2 + ( 1 α ) σ p 2 ]
Therefore, p α ( x ) is a normalized probability density function, satisfying the normalization conditions p α ( x ) d x = 1 . It is clear that the product of two Gaussian distributions is still a Gaussian distribution, which will bring great convenience to the representation of probability distribution of the latter filtering problem.
At the same time, we can get that the variance of p α ( x ) is σ α 2 , which should satisfy the condition that its value is greater than zero. We can know by its denominator when σ q 2 σ p 2 , the value of α can take any value on the real number axis; when σ q 2 < σ p 2 , the scope of α is α < σ p 2 σ p 2 σ q 2 . Then, it is easy to know that the closer σ p 2 is to σ q 2 , the greater the range of values of α .
In addition, the influence of the mean and the variance of the two distributions on the mean and variance of the α -mixed probability density function can be analyzed to facilitate the solution of the algorithm latter. As for the variance, when σ q 2 > σ p 2 , σ α 2 decreases with the increase of α ; when σ q 2 = σ p 2 , it can be concluded that σ α 2 = σ q 2 = σ p 2 ; when σ q 2 < σ p 2 , σ α 2 increases with the increase of α . As for the mean value, when σ q 2 = σ p 2 , μ α = ( μ p μ q ) α + μ q ; if σ q 2 σ p 2 , μ α = μ p σ q 2 μ q σ p 2 σ q 2 σ p 2 + ( μ q μ p ) σ q 2 σ p 2 ( σ q 2 σ p 2 ) 2 α + ( σ q 2 σ p 2 ) σ p 2 . It is clear that if μ p > μ q , then μ α increases with the increase of α ; if μ p < μ q , then μ α decreases with the increase of α . The summary of the properties is shown in Table 1.
Table 1. The monotonicity of the mean μ α and the variance σ α 2 of the α -mixed probability density function.
The monotonicity of the mean μ α and the variance σ α 2 with respect to α is shown in Figure 2.
Figure 2. The monotonicity of the mean μ α and the variance σ α 2 with respect to α .
It is clear that when μ p < μ q and σ q 2 > σ p 2 , μ α decreases with the increase of α and σ α 2 decreases with the increase of α ; when μ p < μ q and σ q 2 < σ p 2 , μ α decreases with the increase of α and σ α 2 increases with the increase of α ; when μ p > μ q and σ q 2 > σ p 2 , μ α increases with the increase of α and σ α 2 decreases with the increase of α ; when μ p > μ q and σ q 2 < σ p 2 , μ α increases with the increase of α and σ α 2 increases with the increase of α .
When α ( 0 , 1 ) , the α -mixed probability density function is the interpolation function of p ( x ) and q ( x ) , so its mean value and the variance are all between p ( x ) and q ( x ) , as shown in Figure 2, and its image curve is also between them.
The above analysis will be used in the algorithm implementation of the sufficient condition in the non-linear filtering algorithm.

4.2. The Alpha-Divergence Minimization

In the solving process of the alpha-divergence minimization, either the posterior distribution itself or the calculation of the maximized posterior distribution is complex, so the approximate distribution q ( x ) with good characterization ability is often used to approximate the true posterior distribution p ( x ) . As a result, a higher degree achieves better approximation. Here, we restrict the approximate distribution q ( x ) to be an exponential family distribution; denote p e ( x ) , with good properties, defined as follows:
p e ( x ) = h ( x ) e x p { ϕ T ( θ ) u ( x ) + g ( ϕ ( θ ) ) }
Here, θ is a parameter set of probability density function; c(x) and g( ϕ ( θ ) ) are known functions; ϕ ( θ ) is a vector composed of natural parameters; u ( x ) is a natural statistical vector. u ( x ) contains enough information to express the state variable x in the exponential family distribution completely; ϕ ( θ ) is a coefficient parameter that combines u ( x ) based on parameter set θ .
In the non-linear filtering, assume the exponential family distribution is p e ( x ) ; arbitrary function is p ( x ) , and we use p e ( x ) to approximate p ( x ) , measuring the degree of approximation by the alpha-divergence. Therefore, the alpha-divergence of p ( x ) relative to p e ( x ) is obtained, defined as:
J = D α [ p | | p e ] = 1 α ( 1 α ) [ 1 p ( x ) α p e ( x ) 1 α ] = 1 α ( 1 α ) 1 p ( x ) α [ h ( x ) e x p ( ϕ T ( θ ) u ( x ) + g ( ϕ ( θ ) ) ) ] 1 α
We state and prove in Theorem 1 that the alpha-divergence between the exponential family distribution and the probability density function of arbitrary state variable is minimum, if and only if the expected value of the natural statistical vector in the exponential family distribution is equal to the expected value of the natural statistical vector in the α -mixed probability state density function. In Corollary 1, given α = 1 , the equivalence condition can be obtained in the case of K L [ p q ] . In Corollary 2, we conclude that the specialization of the exponential family distribution is obtained after being processed by the Gaussian probability density function.
Theorem 1.
The alpha-divergence between the exponential family distribution and the known state probability density function takes the minimum value; if and only if α 1 , the expected value of the natural statistical vector in the exponential family distribution is equal to the expected value of the natural statistical vector in the α-mixed probability state density function, that is:
E p e u ( x ) = E p α u ( x )
Proof of Theorem 1.
Sufficient conditions for J minimization are that the first derivative and the second derivative satisfy the following conditions:
J ϕ ( θ ) = 0 a n d 2 J ϕ ( θ ) 2 > 0
First, we derive Equation (22) with respect to ϕ ( θ ) , and according to the conditions in the first derivative, the outcome is:
J ϕ ( θ ) = 1 α ( 1 α ) p ( x ) α ( 1 α ) p e ( x ) α p e ( x ) u ( x ) + ( g ( ϕ ( θ ) ) ϕ ( θ ) ) d x = 1 α p ( x ) α p e ( x ) 1 α u ( x ) + ( g ( ϕ ( θ ) ) ϕ ( θ ) ) d x = 1 α p ( x ) α p e ( x ) 1 α u ( x ) d x 1 α p ( x ) α p e ( x ) 1 α ( g ( ϕ ( θ ) ) ϕ ( θ ) ) d x
Let the above equation be equal to zero, then:
g ( ϕ ( θ ) ) ϕ ( θ ) = p ( x ) α p e ( x ) 1 α p ( x ) α p e ( x ) 1 α d x u ( x ) d x = p α ( x ) u ( x ) d x
In addition, since p e ( x ) is a probability density function, it satisfies the normalization condition:
p e ( x ) d x = h ( x ) e x p { ϕ T ( θ ) u ( x ) + g ( ϕ ( θ ) ) } d x = 1
Derive ϕ ( θ ) in the above equation, and the outcome is:
ϕ ( θ ) p e ( x ) = p e ( x ) u ( x ) d x + g ( ϕ ( θ ) ) ϕ ( θ ) = 0
The first item of Equation (23) can be obtained from Equations (26) and (28), which is the existence conditions of the stationary point for J.
To ensure that Equation (24) can minimize Equation (22), which means the stationary point is also its minimum point, we also need to prove that the second derivative satisfies the condition. Derive ϕ ( θ ) in Equation (25); the outcome is:
2 J ϕ ( θ ) 2 = ϕ ( θ ) 1 α p ( x ) α p e ( x ) 1 α u ( x ) d x 1 α p ( x ) α p e ( x ) 1 α ( g ( ϕ ( θ ) ) ϕ ( θ ) ) d x = 1 α α p ( x ) α p e ( x ) 1 α u ( x ) + ( g ( ϕ ( θ ) ) ϕ ( θ ) ) u ( x ) d x 1 α p ( x ) α p e ( x ) 1 α 2 g ( ϕ ( θ ) ) ϕ ( θ ) 2 d x 1 α α p ( x ) α p e ( x ) 1 α u ( x ) + ( g ( ϕ ( θ ) ) ϕ ( θ ) ) g ( ϕ ( θ ) ) ϕ ( θ ) d x = 1 α p ( x ) α p e ( x ) 1 α 2 g ( ϕ ( θ ) ) ϕ ( θ ) 2 d x 1 α α p ( x ) α p e ( x ) 1 α u ( x ) 2 + 2 u ( x ) ( g ( ϕ ( θ ) ) ϕ ( θ ) ) + ( g ( ϕ ( θ ) ) ϕ ( θ ) ) 2 d x = 1 α 2 g ( ϕ ( θ ) ) ϕ ( θ ) 2 p ( x ) α p e ( x ) 1 α d x + α 1 α p ( x ) α p e ( x ) 1 α u ( x ) + g ( ϕ ( θ ) ) ϕ ( θ ) 2 d x = 1 α 2 g ( ϕ ( θ ) ) ϕ ( θ ) 2 p α ( x ) d x + α 1 α p ( x ) α p e ( x ) 1 α u ( x ) + g ( ϕ ( θ ) ) ϕ ( θ ) 2 d x
For the first item, it is easy to prove 2 g ( ϕ ( θ ) ) ϕ ( θ ) 2 < 0 , and the proof is as follows.
It can be known from Equation (21):
g ( ϕ ( θ ) ) = l o g h ( x ) e x p ϕ T ( θ ) u ( x ) d x
The gradient of Equation (30) with respect to the natural parameter vector is as follows:
g ( ϕ ( θ ) ) ϕ ( θ ) = h ( x ) e x p ϕ T ( θ ) u ( x ) h ( x ) e x p ϕ T ( θ ) u ( x ) d x u ( x ) d x = = h ( x ) e x p ϕ T ( θ ) u ( x ) e x p g ( ϕ ( θ ) ) d x u ( x ) d x = p e ( x ) u ( x ) d x
Then, consider the matrix formed by its second derivative with respect to the natural parameter vector:
2 g ( ϕ ( θ ) ) ϕ i ( θ ) ϕ j ( θ ) = ϕ j ( θ ) h ( x ) e x p ϕ T ( θ ) u ( x ) h ( x ) e x p ϕ T ( θ ) u ( x ) d x u i ( x ) d x = ϕ j ( θ ) h ( x ) e x p ϕ T ( θ ) u ( x ) u i ( x ) d x h ( x ) e x p ϕ T ( θ ) u ( x ) d x = h ( x ) e x p ϕ T ( θ ) u ( x ) u i ( x ) d x h ( x ) e x p ϕ T ( θ ) u ( x ) d x ( h ( x ) e x p ϕ T ( θ ) u ( x ) d x ) 2 + h ( x ) e x p ϕ T ( θ ) u ( x ) u i ( x ) d x h ( x ) e x p ϕ T ( θ ) u ( x ) u j ( x ) d x ( h ( x ) e x p ϕ T ( θ ) u ( x ) d x ) 2 = p e ( x ) u i ( x ) u j ( x ) d x p e ( x ) u i ( x ) d x p e ( x ) u j ( x ) d x
According to the definition of the covariance matrix, the content in the bracket is the covariance matrix of the natural parameter vector with respect to the exponential family probability density function p e ( x ) , and for arbitrary probability density distribution p e ( x ) , the variance matrix is a positive definite matrix, so 2 g ( ϕ ( θ ) ) ϕ ( θ ) 2 < 0; and when α > 0 , the first item is greater than zero.
The integral of the second item is the secondary moment, so α 1 or α < 0 , and the second item is greater than zero.
To sum up, when α 1 , 2 J ϕ ( θ ) 2 > 0 . ☐
Corollary 1.
(See Theorem 1 of [3] for more details) When α = 1 , p α ( x ) = p ( x ) , D α [ p q ] turns into K L [ p q ] . We can obtain the above theorem under the condition of K L [ p q ] and obtain the approximate distribution by minimizing the KL divergence, which also proves that the stationary point obtained when the first derivative of its KL divergence is equal to zero also satisfies the condition that its second derivative is greater than zero. The corresponding expectation propagation algorithm is shown as follows:
E q ( x ) u ( x ) = E p ( x ) u ( x )
Corollary 2.
(See Corollary 1.1 of [3] for more details) When the exponential family distribution is simplified as the Gaussian probability density function, its sufficient statistic for u ( x ) = ( x , x 2 ) , we use the mean and variance of Gaussian probability density function, and the expectation of the corresponding propagation algorithm can use the moment matching method to calculate, so the first moment and the second moment are defined as follows:
m = E p ( x ) x a n d M = E p ( x ) x x T
The corresponding second central moment is defined as follows:
P = M m m T = E p ( x ) ( x m ) ( x m ) T
The complexity of Theorem 1 lies in that both sides of Equation (23) depend on the probability distribution of q ( x ) at the same time. The q ( x ) that satisfies the condition can be obtained by repeated iterative update on q ( x ) . The specific process is shown in Algorithm 1:
Algorithm 1 Approximation of the true probability distribution p ( x ) .
Input: Target distribution parameter of p ( x ) ; damping factor ϵ ( 0 , 1 ) ; divergence parameter α [ 1 , + ) ; initialization value of q ( x )
Output: The exponential family probability function q ( x )
1:
Calculate the α -mixed probability density function p α ( x )
2:
According to Equation (23), we get a new q ( x ) using the expectation propagation algorithm described in Corollary 1, and the new q ( x ) is denoted as q ( x )
3:
Revalue the q ( x ) as
q ( x ) = q ( x ) ϵ q ( x ) 1 ϵ q ( x ) ϵ q ( x ) 1 ϵ d x
4:
while K L [ p q ] > 0.01 do
5:
  Calculate the KL divergence of the old q ( x ) and the new q ( x )
6:
end while
In the above algorithms, we need to pay attention to the following two problems: giving an initial value of q ( x ) and selecting damping factors. As for the first problem, we can know that when σ q 2 < σ p 2 , the value range of α is α < σ p 2 σ p 2 σ q 2 , according to the analysis of the α -mixed probability density function in Section 4.1. Although the value of α is greater than one, the value range of α is limited under the condition that σ p 2 is unknown in the initial state; when σ q 2 σ p 2 , the value of α can take any value on the whole real number axis, so the initial value we can choose is relatively larger, making σ q 2 σ p 2 and μ q > μ p . When the value of α is greater than one, the mean value of the α -mixed probability density function will decrease, and the variance will also decrease, as shown in the upper left of Figure 2.
As for the second question, when α ( 0 , 1 ) , the α -mixed probability density function is the interpolation function of p ( x ) and q ( x ) according to the analysis in Section 4.1. The value range in ( 0 , 1 ) of damping factor ϵ is quite reasonable because the two probability density functions are interpolated when the value range of ϵ is in (0, 1), and the new probability density function is between the two. According to Equation (36), the smaller of ϵ , the closer the new q ( x ) to the old q ( x ) ; the larger of ϵ , the closer the new q ( x ) to q ( x ) . The mean value and the variance of q ( x ) is smaller than the real p ( x ) according to the analysis of the first question. Then, we will continue to combine new q ( x ) with p ( x ) to form a α -mixed probability density function. Similarly, we clarify that the mean value and the variance of the new q ( x ) are larger than p ( x ) , so the value of ϵ we choose should be as close as possible to one.
The convergence of the algorithm can be guaranteed after considering the above two problems, and we can get q ( x ) that meets the conditions. It can be known from Theorem 1 that the approximation q ( x ) of p ( x ) can be obtained to ensure it converges on this minimum point after repeated iterative updates.

4.3. Non-Linear Filtering Algorithm Based on the Alpha-Divergence

In the process of non-linear filtering, assuming that a priori and a posteriori probability density functions satisfy the Assumed Density Filter (ADF), then define the prior parameter as θ k = m k , P k ; the corresponding distribution is prior distribution q ( x k ; θ k ) ; define the posterior parameter as θ k + = m k + , P k + , then the corresponding distribution is posterior distribution q ( x k ; θ k + ) .
The prediction of the state variance can be expressed as follows:
p ( x k | z 1 : k 1 ) = p ( x k | x k 1 , z 1 : k 1 ) d x k 1
θ k = a r g min θ D α [ p ( x k | z 1 : k 1 ) q ( x k ; θ ) ]
The corresponding first moment about the origin f ( x k 1 ) = x k p ( x k | z 1 : k 1 ) d x k of p ( x k | z 1 : k 1 ) can be obtained from Equation (37a).
By Corollary 2, when the alpha-divergence is simplified to the KL divergence, the corresponding mean value and variance are:
m k = f ( x k 1 ) q ( x k 1 ; θ k 1 + ) d x
P k = f ( x k 1 ) f ( x k 1 ) T N ( x k 1 | m k 1 + , P k 1 + ) d x m k m k T + Q k
Here, the prior distribution q ( x k ; θ k ) can be obtained.
Similarly, the update steps of the filter can be expressed as follows:
p ( x k | z 1 : k ) = p ( z k | x k , z 1 : k 1 ) q ( x k ; θ k ) p ( x k | x k , z 1 : k 1 ) q ( x k ; θ k ) d x k
θ k + = a r g min θ D α [ p ( x k | z 1 : k ) q ( x k ; θ ) ]
It is clear according to Theorem 1:
E q ( x k ; θ k + ) u ( x ) = E p α ( x ) u ( x ) = p α ( x ) u ( x ) d x = u ( x ) p ( x k | z 1 : k ) α q ( x k ; θ k + ) 1 α π ( x ) π ( x ) d x i = 1 N u ( x i ) [ p ( z k | x k i , z 1 : k 1 ) q ( x k i ; θ k ) ] α q ( x k i ; θ k + ) ( 1 α ) / π ( x i ) j [ p ( z k | x k j , z 1 : k 1 ) q ( x k j ; θ k ) ] α q ( x k j ; θ k + ) ( 1 α ) / π ( x j )
Here, x i i i d π t ( x t ) , i = 1 , , N , π t is the proposal distribution. We choose the proposal distribution as a priori distribution q ( x k ; θ k ) . We define w i = [ p ( z k | x k i , z 1 : k 1 ) q ( x k i ; θ k ) ] α q ( x k i ; θ k + ) 1 α / π ( x i ) , W = j w j , so:
E q ( x k ; θ k + ) u ( x ) 1 W i = 1 N w i u ( x i )
An approximate calculation of the mean value and the variance for q ( x k ; θ k + ) is conducted:
m k + = 1 W i = 1 N w i x i
P k + = 1 W i = 1 N w i ( x i m k i ) ( x i m k i ) T
Since Equation (40) contains q ( x k ; θ k + ) on both sides of the equation, we must use Algorithm 1 to conduct the iterative calculation to get the satisfied posterior distribution q ( x k ; θ k + ) .
If α = 1 , the above steps can be reduced to a simpler filtering algorithm, as shown in [3].
In this process, we do not use the integral operation of the denominator in Equation (39a), but use the Monte Carlo integral strategy proposed in [15], as shown in Equation (40). We cannot conduct resampling, which greatly reduces the calculation.

5. Simulations and Analysis

According to Theorem 1, when α 1 , the non-linear filtering method we proposed is feasible theoretically. In the simulation experiment, the algorithm is validated by taking different values when α 1 . We name our proposed method as AKF and compare it with the traditional non-linear filtering methods such as EKF and UKF.
We choose the Univariate Nonstationary Growth Model (UNGM) [22] to analyze the performance of the proposed method. The system state equation is:
x ( k ) = 0.5 x ( k 1 ) + 2.5 x ( k 1 ) 1 + x 2 ( k 1 ) + 8 c o s ( 1.2 ( k 1 ) ) + w ( k )
The observation equation is:
y ( k ) = x 2 ( k ) 20 + v ( k )
The equation of state is a non-linear equation including the fractional relation, square relation and trigonometric function relation. w ( k ) is the process noise with the mean value of zero and the variance of Q. The relationship between the observed signal y ( k ) and state v ( k ) in the measurement equation is also non-linear. v ( k ) is the observation noise with the mean value of zero and the variance of R. Therefore, this system is a typical system with non-linear states and observations, and this model has become the basic model for verifying the non-linear filtering algorithm [22,23].
In the experiment, we set Q = 10, R = 1 and set the initial state as p ( x ( 1 ) ) = N ( x ( 1 ) ; 0 , 1 ) .
First, we simulate the system. When α 1 , the values of α are right for the experiments; here, the value of α is two, and the entire experimental simulation time is T = 50. The result of the state estimation is shown in Figure 3, and it can be seen that the non-linear filtering method we proposed is feasible; the state value can be estimated well during the whole process, and its performance is superior to EKF and UKF in some cases.
Figure 3. State estimation comparison of different non-linear filtering methods.
Second, in order to measure the accuracy of state estimation, the difference between the real state value at each moment and the estimated state value can be calculated to obtain the absolute value; thus, the absolute deviation of the state estimation at each moment is obtained, namely:
R M S ( k ) = | x r e a l ( k ) x e s t i m a t e d ( k ) |
As shown in Figure 4, we can see that the algorithm error we proposed is always relatively small where the absolute value deviation is relatively large. It can be seen that our proposed method performs better than other non-linear methods.
Figure 4. RMS comparison at different times.
In order to measure the overall level of error, we have done many simulation experiments. The average error of each experiment is defined as:
R M S E ( k ) = 1 T k 1 T R M S ( k )
The experimental results are shown in Table 2. We can see that when the estimation of T time series is averaged, the error mean of each AKF is minimum, which indicates the effectiveness of the algorithm, and the filtering accuracy of the algorithm is better than the other two methods under the same conditions. Because the UNGM has strong nonlinearity and we set the variance to the state noise as 10, which is quite large, so the performance differences between EKF, UKF and AKF are rather small.
Table 2. Average errors of experiments.
Then, we analyze the influence of the initial value on the filtering results by modifying the value of process noise. As can be seen from Table 3, AKF’s performance becomes more and more similar to EKF/UKF as the Q becomes smaller.
Table 3. Influence of the variance Q of state equation noise on experimental error.
In the end, we analyze the performance of the whole non-linear filtering algorithm by adjusting the value of α through 20 experiments. In order to reduce the influence of the initial value on the experimental results, we take Q = 0.1 and then average the 20 experimental errors. The result is shown in Figure 5. We can see that the error grows as α grows in this example, as the noise is relatively small.
Figure 5. The error changes as α changes.

6. Conclusions

We have first defined the α -mixed probability density function and analyzed the monotonicity of the mean and the variance under different α values. Secondly, the sufficient conditions for α to find the minimum value have been proven, which provides more methods for measuring the distribution similarity of non-linear filtering. Finally, a non-linear filtering algorithm based on the alpha-divergence minimization has been proposed by applying the above two points to the non-linear filtering. Moreover, we have verified that the validity of the algorithm in one-dimensional UNGM.
Although the filtering algorithm is effective, the alpha-divergence is a direct extension of the KL divergence. We can try to verify that the minimum physical meaning of the alpha divergence is equivalent to the minimum physical meaning of the KL divergence in a further study. The algorithm should be applied to more practical applications to prove its effectiveness. Meanwhile, we can use more sophisticated particle filtering techniques, such as [24,25], to make the algorithm more efficient. Furthermore, the alpha-divergence method described above is applied to uni-modal approximations, but more attention should be paid to multi-modal distributions, which are more difficult and common in practical systems. Furthermore, it is worth designing a strategy to automatically learn the appropriate α values.

Author Contributions

Y.L. and C.G. conceived of and designed the method and performed the experiment. Y.L. and S.Y. participated in the experiment and analyzed the data. Y.L. and C.G. wrote the paper. S.Y. revised the paper. J.Z. guided and supervised the overall process.

Funding

This research was supported by a grant from the National Key Research and Development Program of China (2016YFB0501801).

Acknowledgments

The authors thank Dingyou Ma of Tsinghua University for his help.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Grewal, M.S.; Andrews, A.P. Applications of Kalman filtering in aerospace 1960 to the present [historical perspectives]. IEEE Control Syst. 2010, 30, 69–78. [Google Scholar]
  2. Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
  3. Darling, J.E.; Demars, K.J. Minimization of the Kullback Leibler Divergence for Nonlinear Estimation. J. Guid. Control Dyn. 2017, 40, 1739–1748. [Google Scholar] [CrossRef]
  4. Amari, S. Differential Geometrical Method in Statistics; Lecture Note in Statistics; Springer: Berlin, Germany, 1985; Volume 28. [Google Scholar]
  5. Minka, T. Divergence Measures and Message Passing; Microsoft Research Ltd.: Cambridge, UK, 2005. [Google Scholar]
  6. Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
  7. Raitoharju, M.; García-Fernández, Á.F.; Piché, R. Kullback–Leibler divergence approach to partitioned update Kalman filter. Signal Process. 2017, 130, 289–298. [Google Scholar] [CrossRef]
  8. Mansouri, M.; Nounou, H.; Nounou, M. Kullback–Leibler divergence-based improved particle filter. In Proceedings of the 2014 IEEE 11th International Multi-Conference on Systems, Signals & Devices (SSD), Barcelona, Spain, 11–14 February 2014; pp. 1–6. [Google Scholar]
  9. Martin, F.; Moreno, L.; Garrido, S.; Blanco, D. Kullback–Leibler Divergence-Based Differential Evolution Markov Chain Filter for Global Localization of Mobile Robots. Sensors 2015, 15, 23431–23458. [Google Scholar] [CrossRef] [PubMed]
  10. Hu, C.; Lin, H.; Li, Z.; He, B.; Liu, G. Kullback–Leibler Divergence Based Distributed Cubature Kalman Filter and Its Application in Cooperative Space Object Tracking. Entropy 2018, 20, 116. [Google Scholar] [CrossRef]
  11. Kumar, P.; Taneja, I.J. Chi square divergence and minimization problem. J. Comb. Inf. Syst. Sci. 2004, 28, 181–207. [Google Scholar]
  12. Qiao, W.; Wu, C. Study on Image Segmentation of Image Thresholding Method Based on Chi-Square Divergence and Its Realization. Comput. Appl. Softw. 2008, 10, 30. [Google Scholar]
  13. Wang, C.; Fan, Y.; Xiong, L. Improved image segmentation based on 2-D minimum chi-square-divergence. Comput. Eng. Appl. 2014, 18, 8–13. [Google Scholar]
  14. Amari, S. Alpha-Divergence Is Unique, Belonging to Both f-Divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
  15. Gultekin, S.; Paisley, J. Nonlinear Kalman Filtering with Divergence Minimization. IEEE Trans. Signal Process. 2017, 65, 6319–6331. [Google Scholar] [CrossRef]
  16. Hernandezlobato, J.M.; Li, Y.; Rowland, M.; Bui, T.D.; Hernandezlobato, D.; Turner, R.E. Black Box Alpha Divergence Minimization. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1511–1520. [Google Scholar]
  17. Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  18. Tsallis, C. Introduction to Nonextensive Statistical Mechanics. Condens. Matter Stat. Mech. 2004. [Google Scholar] [CrossRef]
  19. Li, Y.; Turner, R.E. Rényi divergence variational inference. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1073–1081. [Google Scholar]
  20. Amari, S.I. Information Geometry and Its Applications; Springer: Berlin, Germany, 2016. [Google Scholar]
  21. Nielsen, F.; Critchley, F.; Dodson, C.T.J. Computational Information Geometry; Springer: Berlin, Germany, 2017. [Google Scholar]
  22. Garcia-Fernandez, Á.F.; Morelande, M.R.; Grajal, J. Truncated unscented Kalman filtering. IEEE Trans. Signal Process. 2012, 60, 3372–3386. [Google Scholar] [CrossRef]
  23. Li, Y.; Cheng, Y.; Li, X.; Hua, X.; Qin, Y. Information Geometric Approach to Recursive Update in Nonlinear Filtering. Entropy 2017, 19, 54. [Google Scholar] [CrossRef]
  24. Martino, L.; Elvira, V.; Camps-Valls, G. Group Importance Sampling for particle filtering and MCMC. Dig. Signal Process. 2018, 82, 133–151. [Google Scholar] [CrossRef]
  25. Salomone, R.; South, L.F.; Drovandi, C.C.; Kroese, D.P. Unbiased and Consistent Nested Sampling via Sequential Monte Carlo. arXiv, 2018; arXiv:1805.03924. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.