An Overdispersed Black-Box Variational Bayesian–Kalman Filter with Inaccurate Noise Second-Order Statistics

Aimed at the problems in which the performance of filters derived from a hypothetical model will decline or the cost of time of the filters derived from a posterior model will increase when prior knowledge and second-order statistics of noise are uncertain, a new filter is proposed. In this paper, a Bayesian robust Kalman filter based on posterior noise statistics (KFPNS) is derived, and the recursive equations of this filter are very similar to that of the classical algorithm. Note that the posterior noise distributions are approximated by overdispersed black-box variational inference (O-BBVI). More precisely, we introduce an overdispersed distribution to push more probability density to the tails of variational distribution and incorporated the idea of importance sampling into two strategies of control variates and Rao–Blackwellization in order to reduce the variance of estimators. As a result, the convergence process will speed up. From the simulations, we can observe that the proposed filter has good performance for the model with uncertain noise. Moreover, we verify the proposed algorithm by using a practical multiple-input multiple-output (MIMO) radar system.


Introduction
Recently, the design of robust filters has been one of the most popular topics in modern radar systems and MIMO-based radar system in particular. There are two main reasons why this has occurred. First, a radar signal of interest can be corrupted naturally from other interference and noise or by clutter and so on, which results in uncertainty in assumed statistical models of given received signals. Second, the models of measurement system can be affected by the setup of transmitters and receivers in practical applications of radar systems. Unfortunately, it is impossible or prohibitively expensive to design an optimal filter by obtaining and/or understanding accurate models in the real world, which makes some nominally optimal filters suffer significant degradations in performance even if small deviations from the assumed models occur. Therefore, a robust filter can be considered as a kind of optimal filter under some conditions of uncertainty models. The seminal concept of robust filters is proposed in the 1970s [1,2], where the minimax approach was discussed for the Wiener filter, which is a batch-based optimal filter for stationary signals. Currently, the Bayesian-based robust approach has attracted significant attention in the design of robust filters due to the use of information with respect to the prior distribution of models in providing more accurate knowledge on the statistical model.
It is known to all that classical Kalman filtering [3] involves an optimal filter in terms of the Gaussian and linear assumptions of dynamic systems. It has been well established in many fields such as communication, navigation, radar and control. In addition, more and more Kalman filtering-related technologies are applied to the localization and tracking of moving agents [4][5][6]. However, it also suffers from prominent shortcomings, for example, the good performance lies in completely grasping the statistical models of noise and the state of signal. As a result, the problem of designing a robust Kalman filter in practical applications where the knowledge of noise distribution is missing or imprecise is a big challenge for both researchers and developers. The authors of [7][8][9][10][11][12] proposed methods, which include so-called adaptive Kalman filtering, to estimate signal states and noise simultaneously, which works well when there is a lot of data used to obtain certain accurate performances in the entire estimated period. Later on, minimax-based Kalman filters [13][14][15] and finite impulse response Kalman filters [16][17][18][19] were presented.
Inspired by both the Bayesian innovation process and the Bayesian orthogonality principle, an intrinsically Bayesian robust Kalman filter (IBRKF) was proposed in [20]. The authors argued that it should be optimal from the point of view of the cost function and the prior distribution of uncertain noise. Herein, its recursive equations are very similar to that of the classical Kalman filter. The IBRKF makes use of concepts of effective noise statistics and effective Kalman gain by replacing the corresponding variables in the classical Kalman filter framework in order to improve performance. Therefore, IBRKF can be considered as a robust Kalman filter in terms of uncertain noise.
The IBR method can achieve optimal performances in terms of the uncertainty of models by using prior distributions, which does not concern measurements, and may contain some useful information for improving performances further. Instead, the optimal Bayesian Kalman filter (OBKF) takes into account the observation sequence and prior knowledge simultaneously in order to achieve optimal filtering [21]. It designs a factorgraph-based message passing algorithm, which transforms the global function into a form of local function multiplication. In this manner, it integrates the effective information of each local node and lays the foundation for approximating posterior noise distributions. Moreover, note that the recursive structure of OBKF is also the same as that of the classical Kalman filter. However, the filtering process of OBKF is quite complicated. It needs to pass messages from the beginning when calculating posterior parameters at each moment. When the dimension of the matrix or observation sequence is large, the time cost of OBKF is catastrophic. In addition, the maximum a posteriori criterion and variational inference are common methods for estimating the unknown noise covariance matrix [22][23][24][25][26]. In terms of performance, although these methods outperform classical filtering algorithms, there is still room for improvement. For example, Refs. [22,23] only solves the problem of the unknown observation noise covariance matrix while ignoring process noise. Coincidentally, Ref. [24] used the variational Bayesian method to estimate the system state and observation noise while simultaneously using the maximum a posteriori method to estimate process noise. By selecting the inverse Wishart distribution as the conjugate prior for the covariance matrix of a Gaussian distribution with known mean, Refs. [25,26] jointly estimated the state vector, prediction error and observation noise covariance matrix under the framework of variational inference. It is undeniable that these methods increased the complexity of unknown noise covariance matrix estimation. Therefore, the aim of our study is to improve the real-time performance of the algorithm as much as possible under the premise of ensuring performance.
Thanks to the pioneering work of Bode, Shannon, etc., the concept of innovation process is widely used in the derivation of Kalman recursive equations [27][28][29]. In this paper, the recursive equations of KFPNS are derived based on Bayesian theory. Compared with the recursive equations of a classical Kalman filter, it only needs to calculate the posterior noise statistics, which are obtained by variational distribution. The variational distribution is an approximate distribution of the real posterior noise distribution. Firstly, we joined the observation sequence and the prior distribution of uncertain noise parameters based on the Markov hypothesis, and then we used the O-BBVI method to approximate real posterior noise distributions. Herein, we assume that the state-space model is Gaussian. The posterior noise statistics are the posterior expectation of the second-order noise statistics and are also one of the parameters in the process of approximation. In other words, we extend the IBRKF framework to an a posteriori version based on an observation sequence and design a new approach to solve the problem of uncertain noise parameters.
Our KFPNS is universal. Any model in which the noise second-order statistics can be assumed to be a certain probability distribution is suitable for this framework. However, note that under the assumption of Gaussian noise, the proposed method can obtain better estimation accuracy, while non-Gaussian noise will produce larger errors. Moreover, the run time of KFPNS is a difficult problem when the dimension of observation sequence is large, because the time cost in computing the posterior noise statistics mainly depends on the number of samples and the length of observation sequence. Fortunately, if the results of multiple calculations of the posterior noise statistics change little, we can consider skipping this step.

Materials and Methods
Let us first introduce some notations to be used in this paper. Lowercase bold letters represent vectors and uppercase bold letters represent matrices, respectively. We use x k to represent the value of x at time k. E[·] is used to denote the expectation of the random vector. A T stands for a transpose operator. N (x; µ, Σ) represents a multivariate Gaussian distribution with mean µ and covariance matrix Σ.

Kalman Filter
The authors of [3] proposed a discrete time-domain filtering scheme in the linear state-space model, which is optimal for all linear system models. The state-space model usually consists of the following equations.
Among them, x k and y k are the n × 1 dimensional state vector and the m × 1 dimensional observation vector, respectively. Both u k and v k are zero-mean random vectors, called process noise and observation noise, and their sizes are s × 1 and m × 1, respectively. Φ k is the state transition matrix of n × n. Γ k represents the process noise transition matrix of n × s. H k is an m × n matrix called the observation transition matrix. The statistics of the above vector are as follows: where 0 m×n is a zero-valued matrix of size m × n. The purpose of Kalman filtering is to obtain the least square estimatex k of the state x k from observation y l (l ≤ k − 1).

Introduction to Bayesian Robust Filters
In engineering, the Bayesian criterion and minimax criterion (instead of the Bayesian method when there is no prior knowledge) are usually used to design robust filters [30][31][32]. The meaning of a Bayesian criterion is as follows: The average cost is the least. In the context of state estimation, the average cost is generally reflected by the mean-square error (MSE); in other words, we have the following. According to the Bayesian criterion, we minimize the cost function to find an optimal filter: where Ψ is the possible class of filters. The filter that satisfies the above formula is the minimum mean square error filter. When we cannot obtain all the knowledge of the model, we assume that the model is controlled by an unknown parameter θ = [θ 1 , . . . , θ l ], θ ∈ Θ, where Θ is the set of all possible parameters, called the uncertain class. Fortunately, these parameters can be obtained by solving the corresponding posterior distribution; thus, the Bayesian robust filter can be expressed as follows: where the expectation is taken relative to the posterior distribution p( θ| Y k ), and C θ (·) is the cost function relative to the unknown parameter θ. Dehghannasiri proposed that the average cost of the filter relative to the uncertain class is equal to the error of applying the filter to the effective information [20]. Therefore, the process of designing a robust filter for an uncertain model can be transformed into designing a Bayesian robust filter using posterior effective information.

KFPNS Framework
Now assume that the second-order statistical knowledge of process noise and observation noise is unknown and that it is determined by the unknown parameters α 1 and α 2 . Here, the covariance matrices of these two noises are as follows.
If α 1 and α 2 are statistically independent, then p(α) = p(α 1 )p(α 2 ) stands for the prior distribution. The state-space model can be parameterized as follows.
Although the IBRKF is optimal in the sense of prior, there is still a certain deviation between the prior expectation of noise and the actual noise, which makes it impossible to achieve real optimal filtering. Therefore, this paper focuses on the influence of posterior noise distribution on the cost function. With Y k = {y 0 , . . . , y k } as the observation sequence and analysis based on (6), the linear filtering function that satisfies the definition of a Bayesian robust filter is as follows: such that the following is the case: where is the vector space of all n × m matrix-valued functions, and Ω k,l ∈ is a mapping Ω k,l : N × N → R n×m such that ∑ ∞ k=1 ∑ ∞ l=1 Ω k,l 2 < ∞, · 2 being the L 2 norm. The derivation of KFPNS recursive equations depends on the following theorem, definition and lemma. Our equations are only a posterior version relative to [20], and the proof of related theories are similar to that in [20]. Theorem 1. (Bayesian Orthogonality Principle) If the weight function Ω Θ k,l of a linear filter satisfies (12), then thex α k obtained in (11) is called the optimal Bayesian least-squares estimation if and only if the following is the case.
Definition 1. (Bayesian innovation process) According to the state-space model described in (9) and (10), ifx α k is the least-square estimate of x α 1 k , then the stochastic process of the following: is a zero-mean process and ∀l, l ≤ k − 1; the following is obtained.
Lemma 1. (Bayesian information equivalence) The Bayesian least-squares estimation obtained based on z α k is equivalent to the Bayesian least-squares estimation calculated based on y α k ; in other words, we have the following.
Then, (11) can be rewritten as follows.
By substituting it into (16), after moving the terms on the left and right sides of the equation, we obtain the following.
In this case, the linear function of Bayesian robust filter based on posterior noise statistics is as follows.
Thus, the state update equation is as follows: where the following is the case.
Next, let x α k = x α 1 k −x α k denote the estimation error at time k. After some mathematical operations, x α k+1 can be expanded into the following. We can find the update equation of the estimation error covariance matrix from this equation, which is described as follows.
This completes the construction of KFPNS. Note that in (21) and (23), the observation sequences corresponding to the estimation error covariance matrix are different, which are E α P x,α k Y k−1 and E α P x,α k Y k , respectively, while the one generated by the last iteration of the recursive equation should be E α P x,α k Y k−1 . For this problem, we can iterate from the beginning; that is, first calculate Y k and repeat the process until we obtain E α P x,α k Y k . However, this strategy requires reiteration from the origin after updating posterior noise statistics, which makes the algorithm heavily loaded. In order to improve the efficiency of the algorithm, we assume that E α P x,α Table 1 reports the difference between these two recursive equations. KFPNS replaces K k , P x k , Q and R in the classical Kalman filter with , but their recursive structures are the same. Figure 1 is the overall framework of KFPNS. The key is that KFPNS updates the state based on posterior noise statistics. Therefore, KFPNS is an improved version of the classical Kalman filter concerning posterior noise distribution.   Figure 1. KFPNS framework.

The Calculation Method of Posterior Noise Statistics
We need to additionally calculate fixed preset values, and the core process of obtaining these two conditional expectations lies in the posterior distribution ( ) , which is unknown. Therefore, the goal is to solve the probability density function of the unknown parameter α under the condition by replacing the fixed preset values, and the core process of obtaining these two conditional expectations lies in the posterior distribution p( α| Y k ), which is unknown. Therefore, the goal is to solve the probability density function of the unknown parameter α under the condition of given , it is difficult to directly acquire a closed-form solution by relying on the prior distribution. However, a known simple distribution can be used to approximate a complex distribution, and by limiting the type of approximate distribution, a locally optimal approximate posterior distribution can be achieved. Unless otherwise specified, the distributions presented in this paper are Gaussian.
Consider a simple variational distribution, which belongs to the following exponential family: where g(α) is the base measure, e is the natural parameter, t(α) is the sufficient statistics and A(e) is the log-normalizer. Based on the considerations of this paper, we believe that e represents the expectations, t(α) can completely characterize all quantities of a probability distribution and t(α) = α. Take a simple multivariate Gaussian distribution N (α; e, Λ) as an example (Λ represents the identity matrix) and transform it into an exponential family form of the following: thus, the following is obtained.
In addition, the exponential family distribution also has the following property.
By using this property, the mutual transformation between expectation calculation and derivation calculation can be completed in variational inference. Therefore, most applications of variational inference use exponential family functions. More importantly, based on this simple property, we use the strategy of control variates to reduce the variance of the Monte Carlo gradient. At the same time, in this paper, we hope to use q(α; e) to approximate p( α| y k ), and the special form of exponential family distributions renders the approximation process traceable. More precisely, we consider α as a Gaussian distribution, which belongs to the exponential family, and its natural parameter e represents expectations. Therefore, the process of finding the optimal natural parameters that makes q(α; e) the closest to p( α| y k ) involves calculating posterior noise statistics. Now, suppose q(α; e) can be decomposed into the following.
Based on the idea of variational inference, we try to minimize Kullback-Leibler divergence D KL [q(α; e) p( α| Y k )] so that q(α; e) approaches p( α| Y k ). This idea still cannot be realized directly, because D KL contains an unknown distribution p( α| Y k ). Fortunately, af- ter some logarithmic operations, minimizing D KL is equivalent to maximizing the evidence lower bound objective (ELBO).
The principle is shown in Figure 2. Then, ELBO takes the derivative with respect to parameter e: where the following is the case.
Sensors 2021, 21, x FOR PEER REVIEW 9 of 23 As a result, computing ∇ e  becomes a process of looking for an expectation, which can be estimated by the Monte Carlo approach. Then, the gradient descent method is used to make the parameters converge, which makes ( ) ; q α e approximate the real posterior distribution. q α e are concentrated near the peak point, and there are fewer samples in the tail. This causes the Monte Carlo gradient to often have high variance, which means that the estimated gradient may be very different from the true value, resulting in optimization times that are too long.

One goes with the other
In view of this defect, O-BBVI adds a good proposal distribution that matches the variational problem [33]. ( ) e r ; , α τ is an overdispersed version of ( ) ; q α e . Its tail is heavier, which increases the probability that the value of the tail is sampled. In addition, O-BBVI also uses two strategies, control variates and Rao-Blackwellization in order to reduce the variance of the Monte Carlo gradient of the original BBVI. Let us begin with this overdispersed distribution: where τ ,τ is the dispersion coefficient of the overdispersed distribution. O-BBVI believes that for each component of the gradient, the proposal distribution that minimizes the variance of the estimator is not ( ) ; q α e but the following.
That is, the optimal proposal distribution pushes the probability density to the tail of As a result, computing ∇ e L becomes a process of looking for an expectation, which can be estimated by the Monte Carlo approach. Then, the gradient descent method is used to make the parameters converge, which makes q(α; e) approximate the real posterior distribution.
The aforementioned black-box variational inference (BBVI) only relies on samples from q(α; e) to calculate the Monte Carlo gradient. However, the samples of the variational distribution q(α; e) are concentrated near the peak point, and there are fewer samples in the tail. This causes the Monte Carlo gradient to often have high variance, which means that the estimated gradient may be very different from the true value, resulting in optimization times that are too long.
In view of this defect, O-BBVI adds a good proposal distribution that matches the variational problem [33]. r(α; e, τ) is an overdispersed version of q(α; e). Its tail is heavier, which increases the probability that the value of the tail is sampled. In addition, O-BBVI also uses two strategies, control variates and Rao-Blackwellization in order to reduce the variance of the Monte Carlo gradient of the original BBVI. Finally, O-BBVI constructed a new Monte Carlo gradient equation, and based on the unconventional usage of importance sampling, it samples from r(α; e, τ) and q(α; e), respectively, in order to estimate the gradient.
Let us begin with this overdispersed distribution: where τ = [τ 1 , τ 2 ] is the dispersion coefficient of the overdispersed distribution. For a fixed τ, r(α; e, τ) and q(α; e) belong to the same exponential family. Moreover,r(α; e, τ) allocates higher mass to the tails of q(α; e), which is why we introduced it. O-BBVI believes that for each component of the gradient, the proposal distribution that minimizes the variance of the estimator is not q(α; e) but the following.
That is, the optimal proposal distribution pushes the probability density to the tail of q(α; e);hence, we have reason to believe that r(α; e, τ) is closer to the optimal proposal distribution than the variational distribution q(α; e) due to the fact that, in sampling, it provides more opportunities for α, which exists in the tail of variational distribution but has higher posterior probability. Thus, (30) can be rewritten as follows: where α (m) represents the m − th sample in the sample set of r(α; e, τ).
To solve the problem of importance sampling failure caused by high-dimensional hidden variables, we use the correlation theory of mean-field, i.e., keep the rest of the components unchanged when solving the n − th component. That is to say, (34) can be expressed as follows:∇ where w(α n ) = q(α n ; e n )/r(α n ; e n , τ n ), and α −n means variables other than α n . e −n is the same. Inspired by the foregoing discussion, we take a simple sample α 0 −n from q(α −n ; e −n ) to estimate E q(α −n ;e −n ) [·] and extract M samples from r(α n ; e n , τ n ) to estimate E r(α n ;e n ,τ n ) [·]. At the same time, a score function is defined as follows.
It is used to calculate the following.
This formula is the embodiment of Rao-Blackwellization. Note that in order to reduce the complexity of the algorithm, we utilize y k instead of Y k in Equation (37). Moreover, p n is acquired by the Markov assumption. Herein, we assume that x k is a Gaussian distribution withx α k as the mean and P x,α k as the covariance. Given the state x k , the corresponding observation y k obeys the Gaussian distribution N (y k ; H k x k , R α 2 ). Thus, the likelihood function of the unknown parameter α can be approximated as follows.
Consequently, we obtain the following.
Finally, the gradient of ELBO relative to each component e n is as follows: where the following is the case.
Note that b n marks the application of the strategy of control variates. Next, we use the AdaGrad algorithm to make e converge to the optimal value: where λ t is the learning rate, and its setting is similar to that in [34]. • stands for the Hadamard product. Note that the expectations contained in natural parameter e are the posterior noise statistics E α [ Q α 1 |y k ] and E α [ R α 2 |y k ] of the k-th sampling period and not Here, they are only needed for computing the average value from 0 to the k-th sampling period. Furthermore, we want to automatically adjust the dispersion coefficient τ n during the iteration. We first calculate the derivative of (35) with respect to τ n .

−
∂V ∇ e n L ∂τ n Since f n (α) and w(α n ) are calculated in the above process, this process will not consume much time. Next, the Monte Carlo estimation method is used to calculate the solution of the equation and then the gradient descent method is used to make t converge: where s n is the step size, which needs to be set in advance. Note that if e n represents a vector, only the derivative part needs to be adjusted to the sum of the derivatives of all components of e n , and Equation (46) can still be used. Finally, if the posterior noise statistics converge to a certain value or changes little after multiple iterations, the subsequent state estimation can skip the step of finding the posterior noise parameters in order to save the algorithm overhead. The steps for implementation of the proposed KFPNS framework are summarized in Algorithms 1 and 2.  1: function O-BBVI (y k , p(α)) 2: f ( y k |α) ← N (y k ; H k x k , R α 2 ) 3: While the algorithm has not converged do 4: Draw α (0) ∼ q(α; e) 5: for n = 1 to 2 do 6: Draw M samples α (m) n ∼ r(α n ; e n , τ n ) 7: n ; e n /r α (m) n ; e n , τ n 10: n ; e n 11: n ; e n 12:

Results
In this section, we analyze the performance of the proposed algorithm and other Kalman filtering methods. The first Kalman filtering method to participate in the comparison is the IBR approach. The second one is the model-specific Kalman filtering approach, which is a classical Kalman filter designed with respect to real noise parameters. The third is the minimax method, which performs best in the worst case when the noise parameters are uncertain. We set the parameters of the minimax method to α max1 = 4, α max2 = 4. The last one is OBKF, which designs a message passing algorithm based on factor graphs to calculate the likelihood function formulaically and then employs the Metropolis Hastings MCMC method to find posterior effective noise statistics.

Simulation
For the tracking scenario in a two-dimensional space, we assume that the state of the target is x k = p x p y v x v y T , and the subscripts x, y represent the x and y dimensions, respectively. Assuming that the target's motion model is a constant velocity model and that there is only a weak random disturbance, the radar obtains the measurements of the target at every t seconds. We construct the state-space model of the measured target based on (9) and (10), where the matrices' representations are as follows. The covariance matrices of the process and observation noise are expressed as follows.
Let t = 1 and the target's motion state is initialized to E[ ([1000 1000 100 100 ]), where diag(a) represents a diagonal matrix with elements in vector a as diagonal elements.
In the first simulation, it is assumed that both α 1 and α 2 are unknown parameters that are uniformly distributed in the interval [1,4] and [0. 5,4], respectively. In order to analyze the average performance of five Kalman filtering methods, we generate a large amount of data based on the model described above and calculate the average MSE. The results are shown in Figure 3. This set of data is generated depending on the prior distribution of noise, namely, we randomly generated 200 pairs [α 1 , α 2 ], and each combination corresponds to 10 different observation sequences. Among the five filters, the average MSE of the model-specific Kalman filter is the smallest because it uses real noise parameters. Instead, the minimax approach only considers the worst case and cannot effectively solve the problem of unknown noise second-order statistics. Figure 3 also suggests that the average MSEs of OBKF and IBRKF are almost the same at the initial stage of filtering, and both greatly outperform KFPNS since the performance of KFPNS depends on the length of the observation sequence used to adjust filter parameters and the prior distribution of noise. If the number of sampling period k is small, the filter parameters may be adjusted incorrectly after each iteration, resulting in large state estimation errors for certain subsequent sampling periods. However, in the long run, the increase in analyzable data will return the uncertain parameters close to the true value. At this time, the performance of the KFPNS is even similar to that of OBKF. Both KFPNS and OBKF need to use a certain amount of observed data to estimate unknown noise parameters in order to achieve better estimation results, which causes them to converge more slowly than other algorithms, especially for KFPNS. The good news is that as the number of observations continues to increase, their performance is closer to the modelspecific Kalman filter.  We set up two pairs of specific noise parameters for the second set of simulations. In order to discuss the average performance of various filters, we provide 200 observation sequences for each specific combination in order to weaken the interference of abnormal results. Then, we calculated the average MSE of 200 sets of observations in order to visually compare the performance of various filters. The first column of Figure 4 is a performance analysis based on the specific model α α = = der the first specific model. Similarly, (d) and (f) correspond to the second specific noise We set up two pairs of specific noise parameters for the second set of simulations. In order to discuss the average performance of various filters, we provide 200 observation sequences for each specific combination in order to weaken the interference of abnormal results. Then, we calculated the average MSE of 200 sets of observations in order to visually compare the performance of various filters. The first column of Figure 4 is a performance analysis based on the specific model α 1 = 3, α 2 = 1.5. The true value of the specific noise pair in the second column is α 1 = 3.5, α 2 = 3.5. The variation of average MSE of each filter in Figure 4a is roughly consistent with the previous simulation. Figure 4c,e, respectively, show the variation of the average posterior mean E[α 1 |Y k ] and E[α 2 |Y k ] under the first specific model. Similarly, (d) and (f) correspond to the second specific noise combination. The vertical columnar line represents the variance, and the size of variance is represented by the length of the line. The figures in the second and third rows show that with the increase in observation data, E[α 1 |Y k ] and E[α 2 |Y k ], moves closer to the true value from the preset empirical values of 2.2 and 2 and finally stabilizes near the true value. At the same time, the variance of the average posterior also means changes from small to large and then to small and eventually stabilizes. This indicates that the posterior distribution of noise estimated by KFPNS is more and more in line with the true distribution, and its value is concentrated around the true value. We also observe that in Figure 4b, the minimax method outperforms IBRKF and even outperforms OBKF in the beginning. Note that under the condition of prior knowledge, the average performance of IBRKF exceeds the minimax Kalman filter. However, IBRKF performs worse with respect to some specific models; for example, when the parameters of the minimax method are closer to the true values than the prior mean of IBRKF. Fortunately, both the OBKF and KFPNS use their approaches to approximate true posterior noise distribution, and as we observe more data, posterior second-order statistics will tend towards the true model. This allows them to break the limitations of most noise models and to deal with models that IBRKF is unable to handle.
Next, we discuss the performance changes of different robust Kalman filters when the mass of the prior probability is not uniform. Suppose α 1 is uniformly distributed in the interval [1,2] and fixed. On the other hand, let a Beta distribution B(α r , β r ) in the interval [0.25, 4] govern α 2 . The mean and variance of the Beta distribution are α r /α r + β r and α r β r /(α r + β r ) 2 (α r + β r + 1), respectively, and α r + β r = 1. On the basis of ensuring that the mean of this prior distribution remains unchanged, a new prior distribution with different mass can be obtained by appropriately adjusting its parameters. We consider two pairs of specific parameters, α r = 0.1, β r = 0.9 and α r = 0.01, β r = 0.09. As a result, the reduction in parameters causes the variance to become larger, and the prior distribution will be more relaxed. Figure 5 shows their average MSEs. Even if the Beta distribution is changed to make the probability of values in prior distribution different, the average MSEs of the minimax method and IBRKF are almost unchanged. The reason is that for all prior distributions, the prior mean of the IBR method is always the same, while the minimax method does not consider prior distribution at all. They cannot effectively cope with sudden changes in the model. In this case, they are equivalent to the classical Kalman filter. Compared with the above two robust filtering strategies, although the mass distribution of the prior has changed, since KFPNS incorporates observations into prior knowledge in order to estimate posterior noise distribution, its average MSE is related to the tightness of the prior distribution. In other words, the tighter the prior distribution, the worse the performance of KFPNS, while the opposite is true when the prior distribution is relaxed. Importantly, the average MSE of KFPNS can always be closer to that of the model-specific Kalman filter.
In the final simulation, we analyze the performance of each robust filtering strategy when the prior knowledge is inaccurate. In practical applications, it is often impossible to understand the underlying true model, and the mastery of prior knowledge is not comprehensive. Therefore, it is crucial for a robust filter not to rely too much on the prior distribution. Our previous simulations are based on the assumption that the prior distribution of noise is accurate and available. Here, we only consider the case where the variance of the observation noise is unknown and the range of its prior distribution is included in the interval of the true distribution. We still have α 1 uniformly distributed over [1,2], while assuming that α 2 is uniformly distributed over [3,5], which is wrong relative to the real interval. Similarly, we generated 20 different true values based on the correct prior interval, and each true value contained 20 different sets of observations. The evaluation index of the robustness of each filter is still the average MSE. Figure 6a,b are the average MSEs of various Kalman filters in the entire sampling period when the correct interval is [2,6] and [0. 5, 7.5]. When k is large, the two robust filters, OBKF and KFPNS still perform best in terms of average MSEs. It is worth noting that KFPNS outperforms OBKF for the first time in Figure 6b, but it is slightly inferior or equivalent to OBKF in the previous simulation. This is related to the computing method of posterior noise statistics. When OBKF employs the Metropolis Hastings MCMC method to select sample points, the prior distribution incorporated is wrong and lacks an adjustment strategy, which makes the posterior noise statistics unable to converge to the true value. On the contrary, KFPNS adds a weighting factor in the calculation process, making it not completely dependent on the inaccurate prior distribution. Therefore, KFPNS performs better when facing models with larger errors between prior knowledge and the underlying true model. Unfortunately, the performance of KFPNS may decline when the prior distribution provided is inaccurate, and the stronger the inaccuracy, the faster the performance decline.  Next, we discuss the performance changes of different robust Kalman filters when the mass of the prior probability is not uniform. Suppose α 1 is uniformly distributed in mass distribution of the prior has changed, since KFPNS incorporates observations into prior knowledge in order to estimate posterior noise distribution, its average MSE is related to the tightness of the prior distribution. In other words, the tighter the prior distribution, the worse the performance of KFPNS, while the opposite is true when the prior distribution is relaxed. Importantly, the average MSE of KFPNS can always be closer to that of the model-specific Kalman filter. In the final simulation, we analyze the performance of each robust filtering strategy when the prior knowledge is inaccurate. In practical applications, it is often impossible to understand the underlying true model, and the mastery of prior knowledge is not comprehensive. Therefore, it is crucial for a robust filter not to rely too much on the prior distribution. Our previous simulations are based on the assumption that the prior distribution of noise is accurate and available. Here, we only consider the case where the variance of the observation noise is unknown and the range of its prior distribution is included in the interval of the true distribution. We still have α 1 uniformly distributed over [1,2] , while assuming that α 2 is uniformly distributed over [3,5] , which is wrong relative to the real interval. Similarly, we generated 20 different true values based on the correct prior interval, and each true value contained 20 different sets of observations. The evaluation index of the robustness of each filter is still the average MSE. Figure 6a,b are the average MSEs of various Kalman filters in the entire sampling period when the correct interval is [2,6] and [0.5,7.5]. When k is large, the two robust filters, OBKF and KFPNS still perform best in terms of average MSEs. It is worth noting that KFPNS outperforms OBKF for the first time in Figure 6b, but it is slightly inferior or equivalent to OBKF in the previous simulation. This is related to the computing method of posterior noise statistics. When OBKF employs the Metropolis Hastings MCMC method to select sample points, the prior distribution incorporated is wrong and lacks an adjustment strategy, which makes the posterior noise statistics unable to converge to the true value. On the contrary, KFPNS adds a weighting factor in the calculation process, making it not completely dependent on the inaccurate prior distribution. Therefore, KFPNS performs better when facing models with larger errors between prior knowledge and the underlying true model. Unfortunately, the performance of KFPNS may decline when the prior distribution provided is inaccurate, and the stronger the inaccuracy, the faster the performance decline.

Experiment
We also processed the observed data of the real MIMO radar system. We conducted experiments in two specific scenarios. The experimental conditions are described as follows. We have delineated two fixed experimental areas, namely the indoor area of × 2 175 465 cm and the outdoor area of × 2 800 800 cm . In addition, we also calibrated some points in the two experimental areas and took these points as the ground truth of the target. Due to the limited scanning range of the MIMO radar, the position of the radar has a certain distance from the boundary of the experimental area, but we eliminated the influence of this distance in the subsequent processing. Table 2 reports the parameters of the radar used. We first sampled the environment of the two experimental areas without any targets and filtered out non-target points on this basis. During data acquisition, two people moved at a constant velocity in these two areas according to two predetermined trajectories. Figure 7 describes our experimental scenarios. Figure 7b also shows the MIMO radar board we used and the host computer interface during data acquisition. After collecting measurements, we clustered the target points and took the cluster center as the observed position of the target. Finally, four different Kalman filtering methods were used to process the data and to analyze the error.

Experiment
We also processed the observed data of the real MIMO radar system. We conducted experiments in two specific scenarios. The experimental conditions are described as follows. We have delineated two fixed experimental areas, namely the indoor area of 175 × 465 cm 2 and the outdoor area of 800 × 800 cm 2 . In addition, we also calibrated some points in the two experimental areas and took these points as the ground truth of the target. Due to the limited scanning range of the MIMO radar, the position of the radar has a certain distance from the boundary of the experimental area, but we eliminated the influence of this distance in the subsequent processing. Table 2 reports the parameters of the radar used. We first sampled the environment of the two experimental areas without any targets and filtered out non-target points on this basis. During data acquisition, two people moved at a constant velocity in these two areas according to two predetermined trajectories. Figure 7 describes our experimental scenarios. Figure 7b also shows the MIMO radar board we used and the host computer interface during data acquisition. After collecting measurements, we clustered the target points and took the cluster center as the observed position of the target. Finally, four different Kalman filtering methods were used to process the data and to analyze the error.   The filters we use are the classical Kalman filter (CKF), IBRKF, OBKF and KFPNS. The filtering results under different experimental scenarios are shown in Figures 8 and 9. Note that the blue point represents the estimated position of target A, and the red point represents the estimated position of target B. When filtering these position data, we do not fully grasp the underlying real model, which means that the prior knowledge we provided for these algorithms may not be accurate enough. Nevertheless, it can be observed that the estimated trajectories of the two Kalman filters designed based on posterior information are significantly smoother than the other methods. The inaccuracy of prior knowledge of noise is the main reason for the poor trajectory estimation of CKF and IB-RKF. Although IBRKF introduces the concept of effective statistics, which weakens this effect to a certain extent, when the inaccuracy of prior knowledge is strong, its performance is only slightly improved compared with the classical algorithm. As expected, both KFPNS and OBKF have the characteristics of computing the posterior value and making it approximate the real value through algorithm iteration, which makes them very robust; thus, their estimated trajectory is also the closest to the true trajectory.  Figures 8 and 9. Note that the blue point represents the estimated position of target A, and the red point represents the estimated position of target B. When filtering these position data, we do not fully grasp the underlying real model, which means that the prior knowledge we provided for these algorithms may not be accurate enough. Nevertheless, it can be observed that the estimated trajectories of the two Kalman filters designed based on posterior information are significantly smoother than the other methods. The inaccuracy of prior knowledge of noise is the main reason for the poor trajectory estimation of CKF and IBRKF. Although IBRKF introduces the concept of effective statistics, which weakens this effect to a certain extent, when the inaccuracy of prior knowledge is strong, its performance is only slightly improved compared with the classical algorithm. As expected, both KFPNS and OBKF have the characteristics of computing the posterior value and making it approximate the real value through algorithm iteration, which makes them very robust; thus, their estimated trajectory is also the closest to the true trajectory.  In addition, the position error (PE) between the estimated position and the true position will also be used as an evaluation index for filter performance. In Figure 10, although the PE of the new filter sometimes exceeds IBRKF and the classical Kalman filter, overall, most of the PE curve of KFPNS is in a lower position. Moreover, the PE of the proposed algorithm is more concentrated than compared to these two algorithms. However, in Figure 10a, it cannot be ignored that KFPNS performs poorly in the early stages of filtering. The reason for this phenomenon may be that the selection of the initial value is quite dif- In addition, the position error (PE) between the estimated position and the true position will also be used as an evaluation index for filter performance. In Figure 10, although the PE of the new filter sometimes exceeds IBRKF and the classical Kalman filter, overall, most of the PE curve of KFPNS is in a lower position. Moreover, the PE of the proposed algorithm is more concentrated than compared to these two algorithms. However, in Figure 10a, it cannot be ignored that KFPNS performs poorly in the early stages of filtering. The reason for this phenomenon may be that the selection of the initial value is quite different from the real value, and there are fewer observed data available for analysis. Fortunately, with the input of more observed data, its estimation error gradually approaches OBKF.
We also compute the root-mean-square error (RMSE) of the four algorithms, and the results are shown in Figure 11. In Figure 11a, when the number of sampling period is small, the RMSE of KFPNS takes the maximum value. After 20 sampling periods, the RMSE of KFPNS tends to be stable and is significantly different from that of the other two Kalman filters with poor robustness. Figure 11b also suggests that the gap between KFPNS and OBKF is further narrowed. The change of RMSE over time proves that both KFPNS and OBKF have strong robustness. Table 3 reports the mean value of RMSE (MMSE) for various algorithms, which can be used to intuitively compare the average performance of various filters in different scenes. In the indoor scene, for target B, the average performance of KFPNS is 29.62% and 41.56% higher than that of IBRKF and the classical Kalman filter separately. Despite the average performance of KFPNS decreases by 18.1% compared with OBKF, they differ by only 0.9434 cm. ferent from the real value, and there are fewer observed data available for analysis. Fortunately, with the input of more observed data, its estimation error gradually approaches OBKF.
(a) (b) We also compute the root-mean-square error (RMSE) of the four algorithms, and the results are shown in Figure 11. In Figure 11a, when the number of sampling period is small, the RMSE of KFPNS takes the maximum value. After 20 sampling periods, the RMSE of KFPNS tends to be stable and is significantly different from that of the other two Kalman filters with poor robustness. Figure 11b also suggests that the gap between KFPNS and OBKF is further narrowed. The change of RMSE over time proves that both KFPNS and OBKF have strong robustness. Table 3 reports the mean value of RMSE (MMSE) for various algorithms, which can be used to intuitively compare the average performance of various filters in different scenes. In the indoor scene, for target B, the average performance of KFPNS is 29.62% and 41.56% higher than that of IBRKF and the classical Kalman filter separately. Despite the average performance of KFPNS decreases by 18.1% compared with OBKF, they differ by only 0.9434 cm.   We also compute the root-mean-square error (RMSE) of the four algorithms, and the results are shown in Figure 11. In Figure 11a, when the number of sampling period is small, the RMSE of KFPNS takes the maximum value. After 20 sampling periods, the RMSE of KFPNS tends to be stable and is significantly different from that of the other two Kalman filters with poor robustness. Figure 11b also suggests that the gap between KFPNS and OBKF is further narrowed. The change of RMSE over time proves that both KFPNS and OBKF have strong robustness. Table 3 reports the mean value of RMSE (MMSE) for various algorithms, which can be used to intuitively compare the average performance of various filters in different scenes. In the indoor scene, for target B, the average performance of KFPNS is 29.62% and 41.56% higher than that of IBRKF and the classical Kalman filter separately. Despite the average performance of KFPNS decreases by 18.1% compared with OBKF, they differ by only 0.9434 cm.    Figure 12 further analyzes the cumulative distribution function (CDF) of the RMSE of each filtering algorithm in the indoor scene. In Figure 12, 90% of the RMSE of KFPNS is less than 6.739 cm, which improves by 31% over IBRKF (9.767 cm) and 42.35% over CKF (11.69 cm) but degrades by 23.11% over OBKF (5.474 cm). In a nutshell, compared with the IBR robust filtering strategy, KNPNS and OBKF are more suitable for models where the noise is unknown and the real prior cannot be fully grasped. Figure 12 further analyzes the cumulative distribution function (CDF) of the RMSE of each filtering algorithm in the indoor scene. In Figure 12, 90% of the RMSE of KFPNS is less than 6.739 cm, which improves by 31% over IBRKF (9.767 cm) and 42.35% over CKF (11.69 cm) but degrades by 23.11% over OBKF (5.474 cm). In a nutshell, compared with the IBR robust filtering strategy, KNPNS and OBKF are more suitable for models where the noise is unknown and the real prior cannot be fully grasped.

Time Cost Analysis
Previous experiments have proved that OBKF and KFPNS have considerable robustness. Now we study the difference of their complexity. Since the recursive structure of these two robust Kalman filters is consistent with the classical algorithm, their computational burden is mainly concentrated on the calculation of posterior noise statistics. It is worth mentioning that OBKF uses a factor-graph-based method to convert the problem of finding the likelihood function into a matrix operation, which increases the complexity of the algorithm. Moreover, if the posterior noise statistics at thek th observation need to be computed, the likelihood function of each MCMC sample must be iterated from i = 0 until i = k -1 . Therefore, its computational complexity depends on the dimension of the matrix, the number of samples in the posterior distribution and the number of sampling periods. In contrast, the proposed algorithm rarely involves matrix operations, and there is no need to compute the likelihood function from the beginning. At the same time, the variational distribution is fixedly decomposed into two factors. These reasons all reduce the calculation burden of KFPNS to a considerable extent. Figure 13 shows the average run times of the OBKF and KFPNS relative to the sampling period. Note that both algorithms need to generate 10,000 samples to approximate the true posterior noise distribution. In addition, the running platform of all simulations and experiments in this paper is MATLAB R2018b with Intel(R) Core(TM) i5-8257U CPU at 1.40 GHz as well as 8-

Time Cost Analysis
Previous experiments have proved that OBKF and KFPNS have considerable robustness. Now we study the difference of their complexity. Since the recursive structure of these two robust Kalman filters is consistent with the classical algorithm, their computational burden is mainly concentrated on the calculation of posterior noise statistics. It is worth mentioning that OBKF uses a factor-graph-based method to convert the problem of finding the likelihood function into a matrix operation, which increases the complexity of the algorithm. Moreover, if the posterior noise statistics at the k-th observation need to be computed, the likelihood function of each MCMC sample must be iterated from i = 0 until i = k − 1. Therefore, its computational complexity depends on the dimension of the matrix, the number of samples in the posterior distribution and the number of sampling periods. In contrast, the proposed algorithm rarely involves matrix operations, and there is no need to compute the likelihood function from the beginning. At the same time, the variational distribution is fixedly decomposed into two factors. These reasons all reduce the calculation burden of KFPNS to a considerable extent. Figure 13 shows the average run times of the OBKF and KFPNS relative to the sampling period. Note that both algorithms need to generate 10,000 samples to approximate the true posterior noise distribution. In addition, the running platform of all simulations and experiments in this paper is MATLAB R2018b with Intel(R) Core(TM) i5-8257U CPU at 1.40 GHz as well as 8-GB RAM. Figure 13 illustrates that under the same number of samples and the same length of observation sequence, OBKF needs to consume more time. Consequently, KFPNS performs better in real time and is more suitable for applications in actual projects.  Figure 13 illustrates that under the same number of samples and the same length of observation sequence, OBKF needs to consume more time. Consequently, KFPNS performs better in real time and is more suitable for applications in actual projects.

Discussion
In this paper, the IBR framework is extended to a posteriori version by using posterior effective noise information. More exactly, a Bayesian robust Kalman filter based on posterior noise statistics is designed by adding the calculation method of posterior noise

Discussion
In this paper, the IBR framework is extended to a posteriori version by using posterior effective noise information. More exactly, a Bayesian robust Kalman filter based on posterior noise statistics is designed by adding the calculation method of posterior noise statistics. Our simulation results also suggest that when dealing with models with uncertain noise parameters, the performance of the designed filter is second only to OBKF. Therefore, it can be concluded that the proposed calculation method of posterior noise statistics can make the uncertain noise parameters converge near the true value. When used in an actual MIMO radar system, we found that our algorithm has a good performance in robustness and real-time situations compared with current methods. However, it is worth noting that since the run time of the KFPNS is proportional to the length of the observation sequence, when the tracking period is longer, it will consume more time than compared to the classic algorithm. Furthermore, our algorithm is run under the assumption of a Gaussian model. In other words, if the noise is non-Gaussian, the performance of our algorithm will be greatly reduced. In future investigations, we will propose a more effective and faster method for estimating posterior noise distribution and make more classes of noise suitable for this framework.