A Non-Linear Filtering Algorithm Based on Alpha-Divergence Minimization

A non-linear filtering algorithm based on the alpha-divergence is proposed, which uses the exponential family distribution to approximate the actual state distribution and the alpha-divergence to measure the approximation degree between the two distributions; thus, it provides more choices for similarity measurement by adjusting the value of α during the updating process of the equation of state and the measurement equation in the non-linear dynamic systems. Firstly, an α-mixed probability density function that satisfies the normalization condition is defined, and the properties of the mean and variance are analyzed when the probability density functions p(x) and q(x) are one-dimensional normal distributions. Secondly, the sufficient condition of the alpha-divergence taking the minimum value is proven, that is when α≥1, the natural statistical vector’s expectations of the exponential family distribution are equal to the natural statistical vector’s expectations of the α-mixed probability state density function. Finally, the conclusion is applied to non-linear filtering, and the non-linear filtering algorithm based on alpha-divergence minimization is proposed, providing more non-linear processing strategies for non-linear filtering. Furthermore, the algorithm’s validity is verified by the experimental results, and a better filtering effect is achieved for non-linear filtering by adjusting the value of α.


Introduction
The analysis and design of non-linear filtering algorithms are of enormous significance because non-linear dynamic stochastic systems have been widely used in practical systems, such as navigation system [1], simultaneous localization and mapping [2], and so on. Because the state model and the measurement model are non-linear and the state variables and the observation variables of the systems no longer satisfy the Gaussian distribution, the representation of the probability density distribution of the non-linear function will become difficult. In order to solve this problem, deterministic sampling (such as the unscented Kalman filter and cubature Kalman filter) and random sampling (such as the particle filter) are adopted to approximate the probability density distribution of the non-linear function, that is to say, to replace the actual state distribution density function by a hypothetical one [3].
In order to measure the similarity between the hypothetical state distribution density function and the actual one, we need to select a measurement method to ensure the effectiveness of the above methods. The alpha-divergence, proposed by S.Amari, is used to measure the deviation between data distributions p(x) and q(x) [4]. It can be used to measure the similarity between the hypothetical state distribution density function and the actual one for the non-linear filtering. Compared with the Kullback-Leibler divergence (the KL divergence), the alpha-divergence provides more choices for measuring the similarity between the hypothetical state distribution density function and the 1. We define an α-mixed probability density function and prove that it satisfies the normalization condition when we specify the probability distributions p(x) and q(x) to be univariate normal distributions. Then, we analyze the monotonicity of the mean and the variance of the α-mixed probability density function with respect to the parameter when p(x) and q(x) are specified to be univariate normal distributions. The results will be used in the algorithm implementation to guarantee the convergence. 2. We specify the probability density function q(x) as an exponential family state density function and choose it to approximate the known state probability density function p(x). After the α-mixed probability density function is defined by q(x) and p(x), we prove that the sufficient condition for alpha-divergence minimization is when α ≥ 1 and the expected value of the natural statistical vector of q(x) is equivalent to the expected value of the natural statistical vector of the α-mixed probability density function. 3. We apply the sufficient condition to the non-linear measurement update step of the non-linear filtering. The experiments show that the proposed method can achieve better performance by using a proper α value.

Related Work
It has become a common method to apply various measurement methods of divergence to optimization and filtering, among which the KL divergence, as the only invariant flat divergence, has been most commonly studied [6]. The KL divergence is used to measure the error in the Gaussian approximation process, and it is applied in the process of distributing updated Kalman filtering [7]. The proposal distribution of the particle filter algorithm is regenerated using the KL divergence after containing the latest measurement values, so the new proposal distribution approaches the actual posterior distribution [8]. Martin et al. proposed the Kullback-Leibler divergence-based differential evolution Markov chain filter for global localization for mobile robots in a challenging environment [9], where the KL-divergence is the basis of the cost function for minimization. The work in [3] provides a better measurement method for estimating the posterior distribution to apply KL minimization to the prediction and updating of the filtering algorithm, but it only provides the proof of the KL divergence minimization. The similarity of the posterior probability distribution between adjacent sensors in the distributed cubature Kalman filter is measured by minimizing the KL divergence, and great simulation results are achieved in the collaborative space target tracking task [10].
As a special situation of alpha-divergence, the KL divergence is easy to calculate, but it provides only one measurement method. Therefore, the studies on the theory and related applications of the KL divergence are taken seriously. A discrete probability distribution of minimum Chi-square divergence is established [11]. Chi-square divergence is taken as a new criterion for image thresholding segmentation, obtaining better image segmentation results than that from the KL divergence [12,13]. It has been proven that the alpha-divergence minimization is equivalent to the α-integration of stochastic models, and it is applied to the multiple-expert decision-making system [6]. Amari et al. [14] also proved that the alpha-divergence is the only divergence category, which belongs to both f-divergence and Bregman divergence, so it has information monotonicity, a geometric structure with Fisher's measurement and a dual flat geometric structure. Gultekin et al. [15] proposed to use Monte Carlo integration to optimize the minimization equation of alpha-divergence, but this does not prove the alpha-divergence minimization. In [16], the application of the alpha-divergence minimization in approximate reasoning has been systematically analyzed, and different values of α can change the algorithm between the variational Bayesian algorithm and expectation propagation algorithm. As a special situation of the alpha-divergence (α = 2q − 1), q-entropy [17,18] has been widely used in the field of physics. Li et al. [19] proposed a new class of variational inference methods using a variant of the alpha-divergence, which is called Rényi divergence, and applied it to the variational auto-encoders and Bayesian neural networks. There are more introductions about theories and applications of the alpha-divergence in [20,21]. Although the theories and applications of alpha-divergence have been very popular, we focus on providing a theory to perfect the alpha-divergence minimization and apply it to non-linear filtering.

Background Work
In Section 3.1, we provide the framework of the non-linear filtering. Then, we introduce the alpha-divergence in Section 3.2, which contains many types of divergence as special cases.

Non-Linear Filtering
The actual system studied in the filtering is usually non-linear and non-Gaussian. Non-linear filtering refers to a filtering that can estimate the optimal estimation problem of the state variables in the dynamic system online and in real time from the system observations. The state space model of non-linear systems with additive Gaussian white noise is: where x k ∈ R n is the system state vector that needs to be estimated; w k is the zero mean value Gaussian white noise, and its variance is E[w k w T k ] = Q k . Equation (1) describes the state transition p(x k |x k−1 ) of the system. The random observation model of the state vector is: where z k ∈ R m is system measurement; v k is the zero mean value Gaussian white noise, and its variance is E[v k v T k ] = R k . Suppose w k and v k are independent of each other and the observed value z k is independent of the state variables x k .
The entire probability state space is represented by the generation model as shown in Figure 1. x k is the system state; z k is the observational variable, and the purpose is to estimate the value of state x k . The Bayesian filter is a general method to solve state estimation. The Bayesian filter is used to calculate the posterior distribution p(x k |z k ), and its recursive solution consists of prediction steps and update steps.
Under the Bayesian optimal filter framework, the system state equation determines that the conditional transition probability of the current state is a Gaussian distribution: If the prediction distribution of the system can be obtained from Chapman-Kolmogorov, the prior probability is: When there is a measurement input, the system measurement update equation determines that the measurement likelihood transfer probability of the current state obeys a Gaussian distribution: According to the Bayesian information criterion, the posterior probability obtained is: where p(z k |z 1:k−1 ) is the normalized factor, and it is defined as follows: Unlike the Kalman filter framework, the Bayesian filter framework does not demand that the update structure be linear, so it can use non-linear update steps.
In the non-linear filtering problem, the posterior distribution p(x k |z 1:k ) often cannot be solved correctly. Our purpose is to use the distribution q(x) to approximate the posterior distribution p(x k |z 1:k ) without an analytical solution. Here, we use the alpha-divergence measurement to measure the similarity between the two. We propose a method that directly minimizes alpha-divergence without adding any additional approximations.

The Alpha-Divergence
The KL divergence is commonly used in similarity measures, but we will generalize it to the alpha-divergence. The alpha-divergence is a parametric family of divergence functions, including several well-known divergence measures as special cases, and it gives us more flexibility in approximation [20]. Definition 1. Let us consider two unnormalized distributions p(x) and q(x) with respect to a random variable x. The alpha-divergence is defined by: where α ∈ R, which means D α is continuous at zero and one.
The alpha-divergence meets the following two properties: This property can be used precisely to measure the difference between the two distributions. 2. D α [p||q] is a convex function with respect to p(x) and q(x).
In general, we can get another equivalent expression of the alpha-divergence when we set Alpha-divergence includes several special cases such as the KL divergence, the Hellinger divergence and χ 2 divergence (Pearson's distance), which are summarized below.

•
As α approaches one, Equation (8) is the limitation form of 0 0 , and it specializes to the KL divergence from q(x) to p(x) as L'Hôpital's rule is used: When p(x) and q(x) are normalized distributions, the KL divergence is expressed as: • As α approaches zero, Equation (8) is still the limitation form of 0 0 , and it specializes to the dual form of the KL divergence from q(x) to p(x) as L'Hôpital's rule is used: When p(x) and q(x) are normalized distributions, the dual form of the KL divergence is expressed as: • When α = 1 2 , the alpha-divergence specializes to the Hellinger divergence, which is the only dual divergence in the alpha-divergence: (15) where Hel[p||q] = 1 2 ( p(x) − q(x)) 2 dx is the Hellinger distance, which is the half of the Euclidean distance between two random distributions after taking the difference of the square root, and it corresponds to the fundamental property of distance measurement and is a valid distance metric.
• When α = 2, the alpha-divergence degrades to χ 2 -divergence: In the later experiment, we will adapt the value of α to optimize the distribution similarity measurement.

Non-Linear Filtering Based on the Alpha-Divergence
We first define an α-mixed probability density function, which will be used in the non-linear filtering based on the alpha-divergence minimization. Then, we show that the sufficient condition for the alpha-divergence minimization is when α ≥ 1 and the expected value of the natural statistical vector of q(x) is equivalent to the expected value of the natural statistical vector of the α-mixed probability density function. At last, we apply the sufficient condition to the non-linear measurement update steps for solving the non-linear filtering problem.

The α-Mixed Probability Density Function
We first give a definition of a normalized probability density function called the α-mixed probability density function, which is expressed as p α (x).

Definition 2.
We define an α-mixed probability density function: We can prove that when both p(x) and q(x) are univariate normal distributions, then p α (x) is still the Gaussian probability density function.
Suppose that p(x) ∼ N(µ p , σ 2 p ) and q(x) ∼ N(µ q , σ 2 q ), so the probability density functions can be expressed as follows: Then we can combine these two functions with parameter α: where µ α = is the mean of the α-mixed probability density function; is the variance of the α-mixed probability density function; S α is a scalar factor, and the expression is as follows: Therefore, p α (x) is a normalized probability density function, satisfying the normalization conditions p α (x)dx = 1. It is clear that the product of two Gaussian distributions is still a Gaussian distribution, which will bring great convenience to the representation of probability distribution of the latter filtering problem.
At the same time, we can get that the variance of p α (x) is σ 2 α , which should satisfy the condition that its value is greater than zero. We can know by its denominator when σ 2 q ≥ σ 2 p , the value of α can take any value on the real number axis; when σ 2 Then, it is easy to know that the closer σ 2 p is to σ 2 q , the greater the range of values of α. In addition, the influence of the mean and the variance of the two distributions on the mean and variance of the α-mixed probability density function can be analyzed to facilitate the solution of the algorithm latter. As for the variance, when σ 2 q > σ 2 p , σ 2 α decreases with the increase of α; when σ 2 q = σ 2 p , it can be concluded that σ 2 α = σ 2 q = σ 2 p ; when σ 2 q < σ 2 p , σ 2 α increases with the increase of α. As for the mean value, when σ 2 It is clear that if µ p > µ q , then µ α increases with the increase of α; if µ p < µ q , then µ α decreases with the increase of α. The summary of the properties is shown in Table 1. Table 1. The monotonicity of the mean µ α and the variance σ 2 α of the α-mixed probability density function.
Increases with the Increase of α σ 2 α = σ 2 q = σ 2 p σ 2 α Decreases with the Increase of α µ p > µ q µ α increases with the increase of α The monotonicity of the mean µ α and the variance σ 2 α with respect to α is shown in Figure 2. It is clear that when µ p < µ q and σ 2 q > σ 2 p , µ α decreases with the increase of α and σ 2 α decreases with the increase of α; when µ p < µ q and σ 2 q < σ 2 p , µ α decreases with the increase of α and σ 2 α increases with the increase of α; when µ p > µ q and σ 2 q > σ 2 p , µ α increases with the increase of α and σ 2 α decreases with the increase of α; when µ p > µ q and σ 2 q < σ 2 p , µ α increases with the increase of α and σ 2 α increases with the increase of α.
When α ∈ (0, 1), the α-mixed probability density function is the interpolation function of p(x) and q(x), so its mean value and the variance are all between p(x) and q(x), as shown in Figure 2, and its image curve is also between them.
The above analysis will be used in the algorithm implementation of the sufficient condition in the non-linear filtering algorithm.

The Alpha-Divergence Minimization
In the solving process of the alpha-divergence minimization, either the posterior distribution itself or the calculation of the maximized posterior distribution is complex, so the approximate distribution q(x) with good characterization ability is often used to approximate the true posterior distribution p(x). As a result, a higher degree achieves better approximation. Here, we restrict the approximate distribution q(x) to be an exponential family distribution; denote p e (x), with good properties, defined as follows: Here, θ is a parameter set of probability density function; c(x) and g(φ(θ)) are known functions; φ(θ) is a vector composed of natural parameters; u(x) is a natural statistical vector. u(x) contains enough information to express the state variable x in the exponential family distribution completely; φ(θ) is a coefficient parameter that combines u(x) based on parameter set θ.
In the non-linear filtering, assume the exponential family distribution is p e (x); arbitrary function is p(x), and we use p e (x) to approximate p(x), measuring the degree of approximation by the alpha-divergence. Therefore, the alpha-divergence of p(x) relative to p e (x) is obtained, defined as: We state and prove in Theorem 1 that the alpha-divergence between the exponential family distribution and the probability density function of arbitrary state variable is minimum, if and only if the expected value of the natural statistical vector in the exponential family distribution is equal to the expected value of the natural statistical vector in the α-mixed probability state density function. In Corollary 1, given α = 1, the equivalence condition can be obtained in the case of KL[p||q]. In Corollary 2, we conclude that the specialization of the exponential family distribution is obtained after being processed by the Gaussian probability density function. Theorem 1. The alpha-divergence between the exponential family distribution and the known state probability density function takes the minimum value; if and only if α ≥ 1, the expected value of the natural statistical vector in the exponential family distribution is equal to the expected value of the natural statistical vector in the α-mixed probability state density function, that is: Proof of Theorem 1. Sufficient conditions for J minimization are that the first derivative and the second derivative satisfy the following conditions: First, we derive Equation (22) with respect to φ(θ), and according to the conditions in the first derivative, the outcome is: )dx (25) Let the above equation be equal to zero, then: In addition, since p e (x) is a probability density function, it satisfies the normalization condition: Derive φ(θ) in the above equation, and the outcome is: The first item of Equation (23) can be obtained from Equations (26) and (28), which is the existence conditions of the stationary point for J.
To ensure that Equation (24) can minimize Equation (22), which means the stationary point is also its minimum point, we also need to prove that the second derivative satisfies the condition. Derive φ(θ) in Equation (25); the outcome is: For the first item, it is easy to prove ∂φ(θ) 2 < 0, and the proof is as follows. It can be known from Equation (21): The gradient of Equation (30) with respect to the natural parameter vector is as follows: Then, consider the matrix formed by its second derivative with respect to the natural parameter vector: According to the definition of the covariance matrix, the content in the bracket is the covariance matrix of the natural parameter vector with respect to the exponential family probability density function p e (x), and for arbitrary probability density distribution p e (x), the variance matrix is a positive definite matrix, so ∂ 2 g(φ(θ)) ∂φ(θ) 2 < 0; and when α > 0, the first item is greater than zero. The integral of the second item is the secondary moment, so α ≥ 1 or α < 0, and the second item is greater than zero.

Corollary 1. (See Theorem 1 of [3] for more details) When
We can obtain the above theorem under the condition of KL[p||q] and obtain the approximate distribution by minimizing the KL divergence, which also proves that the stationary point obtained when the first derivative of its KL divergence is equal to zero also satisfies the condition that its second derivative is greater than zero. The corresponding expectation propagation algorithm is shown as follows: Corollary 2. (See Corollary 1.1 of [3] for more details) When the exponential family distribution is simplified as the Gaussian probability density function, its sufficient statistic for u(x) = (x, x 2 ), we use the mean and variance of Gaussian probability density function, and the expectation of the corresponding propagation algorithm can use the moment matching method to calculate, so the first moment and the second moment are defined as follows: The corresponding second central moment is defined as follows: The complexity of Theorem 1 lies in that both sides of Equation (23) depend on the probability distribution of q(x) at the same time. The q(x) that satisfies the condition can be obtained by repeated iterative update on q(x). The specific process is shown in Algorithm 1: Algorithm 1 Approximation of the true probability distribution p(x).
Input: Target distribution parameter of p(x); damping factor ∈ (0, 1); divergence parameter α ∈ [1, +∞); initialization value of q(x) Output: The exponential family probability function q(x) 1: Calculate the α-mixed probability density function p α (x) 2: According to Equation (23), we get a new q(x) using the expectation propagation algorithm described in Corollary 1, and the new q(x) is denoted as q , (x) 3: Revalue the q(x) as Calculate the KL divergence of the old q , (x) and the new q(x) 6: end while In the above algorithms, we need to pay attention to the following two problems: giving an initial value of q(x) and selecting damping factors. As for the first problem, we can know that when σ 2 q < σ 2 p , the value range of α is α < σ 2 p σ 2 p −σ 2 q , according to the analysis of the α-mixed probability density function in Section 4.1. Although the value of α is greater than one, the value range of α is limited under the condition that σ 2 p is unknown in the initial state; when σ 2 q ≥ σ 2 p , the value of α can take any value on the whole real number axis, so the initial value we can choose is relatively larger, making σ 2 q ≥ σ 2 p and µ q > µ p . When the value of α is greater than one, the mean value of the α-mixed probability density function will decrease, and the variance will also decrease, as shown in the upper left of Figure 2.
As for the second question, when α ∈ (0, 1), the α-mixed probability density function is the interpolation function of p(x) and q(x) according to the analysis in Section 4.1. The value range in (0, 1) of damping factor is quite reasonable because the two probability density functions are interpolated when the value range of is in (0, 1), and the new probability density function is between the two. According to Equation (36), the smaller of , the closer the new q(x) to the old q(x); the larger of , the closer the new q(x) to q , (x). The mean value and the variance of q , (x) is smaller than the real p(x) according to the analysis of the first question. Then, we will continue to combine new q(x) with p(x) to form a α-mixed probability density function. Similarly, we clarify that the mean value and the variance of the new q(x) are larger than p(x), so the value of we choose should be as close as possible to one.
The convergence of the algorithm can be guaranteed after considering the above two problems, and we can get q(x) that meets the conditions. It can be known from Theorem 1 that the approximation q(x) of p(x) can be obtained to ensure it converges on this minimum point after repeated iterative updates.

Non-Linear Filtering Algorithm Based on the Alpha-Divergence
In the process of non-linear filtering, assuming that a priori and a posteriori probability density functions satisfy the Assumed Density Filter (ADF), then define the prior parameter as the corresponding distribution is prior distribution q(x k ; θ − k ); define the posterior parameter as θ + k = m + k , P + k , then the corresponding distribution is posterior distribution q(x k ; θ + k ). The prediction of the state variance can be expressed as follows: The corresponding first moment about the origin f (x k−1 ) = x k p(x k |z 1:k−1 )dx k of p(x k |z 1:k−1 ) can be obtained from Equation (37a).
By Corollary 2, when the alpha-divergence is simplified to the KL divergence, the corresponding mean value and variance are: Here, the prior distribution q(x k ; θ − k ) can be obtained. Similarly, the update steps of the filter can be expressed as follows: It is clear according to Theorem 1: Here, x i ∼ iidπ t (x t ),i = 1, · · · , N, π t is the proposal distribution. We choose the proposal distribution as a priori distribution q(x k ; θ − k ).
An approximate calculation of the mean value and the variance for q(x k ; θ + k ) is conducted: Since Equation (40) contains q(x k ; θ + k ) on both sides of the equation, we must use Algorithm 1 to conduct the iterative calculation to get the satisfied posterior distribution q(x k ; θ + k ). If α = 1, the above steps can be reduced to a simpler filtering algorithm, as shown in [3].
In this process, we do not use the integral operation of the denominator in Equation (39a), but use the Monte Carlo integral strategy proposed in [15], as shown in Equation (40). We cannot conduct resampling, which greatly reduces the calculation.

Simulations and Analysis
According to Theorem 1, when α ≥ 1, the non-linear filtering method we proposed is feasible theoretically. In the simulation experiment, the algorithm is validated by taking different values when α ≥ 1. We name our proposed method as AKF and compare it with the traditional non-linear filtering methods such as EKF and UKF.
We choose the Univariate Nonstationary Growth Model (UNGM) [22] to analyze the performance of the proposed method. The system state equation is: The observation equation is: The equation of state is a non-linear equation including the fractional relation, square relation and trigonometric function relation. w(k) is the process noise with the mean value of zero and the variance of Q. The relationship between the observed signal y(k) and state v(k) in the measurement equation is also non-linear. v(k) is the observation noise with the mean value of zero and the variance of R. Therefore, this system is a typical system with non-linear states and observations, and this model has become the basic model for verifying the non-linear filtering algorithm [22,23].
In the experiment, we set Q = 10, R = 1 and set the initial state as p(x(1)) = N(x(1); 0, 1). First, we simulate the system. When α ≥ 1, the values of α are right for the experiments; here, the value of α is two, and the entire experimental simulation time is T = 50. The result of the state estimation is shown in Figure 3, and it can be seen that the non-linear filtering method we proposed is feasible; the state value can be estimated well during the whole process, and its performance is superior to EKF and UKF in some cases. Second, in order to measure the accuracy of state estimation, the difference between the real state value at each moment and the estimated state value can be calculated to obtain the absolute value; thus, the absolute deviation of the state estimation at each moment is obtained, namely: As shown in Figure 4, we can see that the algorithm error we proposed is always relatively small where the absolute value deviation is relatively large. It can be seen that our proposed method performs better than other non-linear methods.  In order to measure the overall level of error, we have done many simulation experiments. The average error of each experiment is defined as: The experimental results are shown in Table 2. We can see that when the estimation of T time series is averaged, the error mean of each AKF is minimum, which indicates the effectiveness of the algorithm, and the filtering accuracy of the algorithm is better than the other two methods under the same conditions. Because the UNGM has strong nonlinearity and we set the variance to the state noise as 10, which is quite large, so the performance differences between EKF, UKF and AKF are rather small. Then, we analyze the influence of the initial value on the filtering results by modifying the value of process noise. As can be seen from Table 3, AKF's performance becomes more and more similar to EKF/UKF as the Q becomes smaller. In the end, we analyze the performance of the whole non-linear filtering algorithm by adjusting the value of α through 20 experiments. In order to reduce the influence of the initial value on the experimental results, we take Q = 0.1 and then average the 20 experimental errors. The result is shown in Figure 5. We can see that the error grows as α grows in this example, as the noise is relatively small.

Conclusions
We have first defined the α-mixed probability density function and analyzed the monotonicity of the mean and the variance under different α values. Secondly, the sufficient conditions for α to find the minimum value have been proven, which provides more methods for measuring the distribution similarity of non-linear filtering. Finally, a non-linear filtering algorithm based on the alpha-divergence minimization has been proposed by applying the above two points to the non-linear filtering. Moreover, we have verified that the validity of the algorithm in one-dimensional UNGM.
Although the filtering algorithm is effective, the alpha-divergence is a direct extension of the KL divergence. We can try to verify that the minimum physical meaning of the alpha divergence is equivalent to the minimum physical meaning of the KL divergence in a further study. The algorithm should be applied to more practical applications to prove its effectiveness. Meanwhile, we can use more sophisticated particle filtering techniques, such as [24,25], to make the algorithm more efficient. Furthermore, the alpha-divergence method described above is applied to uni-modal approximations, but more attention should be paid to multi-modal distributions, which are more difficult and common in practical systems. Furthermore, it is worth designing a strategy to automatically learn the appropriate α values.
Author Contributions: Y.L. and C.G. conceived of and designed the method and performed the experiment. Y.L. and S.Y. participated in the experiment and analyzed the data. Y.L. and C.G. wrote the paper. S.Y. revised the paper. J.Z. guided and supervised the overall process.