Random Finite Set Based Parameter Estimation Algorithm for Identifying Stochastic Systems

Parameter estimation is one of the key technologies for system identification. The Bayesian parameter estimation algorithms are very important for identifying stochastic systems. In this paper, a random finite set based algorithm is proposed to overcome the disadvantages of the existing Bayesian parameter estimation algorithms. It can estimate the unknown parameters of the stochastic system which consists of a varying number of constituent elements by using the measurements disturbed by false detections, missed detections and noises. The models used for parameter estimation are constructed by using random finite set. Based on the proposed system model and measurement model, the key principles and formula derivation of the proposed algorithm are detailed. Then, the implementation of the algorithm is presented by using sequential Monte Carlo based Probability Hypothesis Density (PHD) filter and simulated tempering based importance sampling. Finally, the experiments of systematic errors estimation of multiple sensors are provided to prove the main advantages of the proposed algorithm. The sensitivity analysis is carried out to further study the mechanism of the algorithm. The experimental results verify the superiority of the proposed algorithm.


Introduction
In modern society, mathematical models are playing increasingly important roles in engineering systems when used for simulating, predicting, managing, and deciding [1]. Since system identification is essential for modeling, it is the potential and vital technology to meet the challenges of model based scientific research. System identification has been applied to many areas, including industrial systems, agricultural systems, biological systems, medical systems and economic systems, and even social systems [2][3][4][5].
Generally speaking, the existing system identification algorithms can be classified into non-parametric model identification algorithms and parametric model identification algorithms [6]. Parametric models are more powerful, because they can represent the studied system with fewer parameters and have more statistical power. The parametric model identification algorithms usually assume some model structure and estimate the associated parameters by reducing the error criterion function of the model and practical system. Therefore, parameter estimation is the core component of these algorithms.

Parameter Estimation
Parameter estimation refers to identifying, estimating and calibrating continuous or discrete model parameters [7]. The parameters to be estimated could be the time-invariant physical quantities, or the coefficients and other constants in a functional relationship that describe the practical physical system [8]. They can be represented by scalars, vectors, or matrixes. In this paper, we use an n elements vector θ ∈ Θ to represent the unknown model parameters.
The stochastic system is the system whose input, output and interference have random factors, or the system itself has some uncertainty. Only by parameter estimation can we get the model parameters for the stochastic system. Parameter estimation of the stochastic system is the basis of state filtering, stochastic control, fault diagnosis, forecasting and many other fields [9]. This paper deals with the identification problem of stochastic systems whose model structures are known and parameters to be identified. In this case, the identification of stochastic system minimizes the estimating of unknown parameters associated with the known models by using prior information and observations from the practical system.
The Sequential Monte Carlo (SMC) method is a powerful tool to perform optimal state and parameter estimating by using the nonlinear and non-Gaussian state space models. Since SMC based parameter estimation algorithms employ the non-parametric statistic inference based on Bayesian inference theory, and have no assumptions of the distribution and linearity of the studied system, they are widely used for parameter estimation [10][11][12][13][14].
According to [11], the SMC based parameter estimation algorithms can be broadly formulated into Maximum Likelihood (ML) formulation and Bayesian formulation. ML formulation is mainly related to finding a point estimate of the unknown parameter which is the most likely to be associated with the observations. The ML estimate of the parameters is the parameters that minimize the negative logarithm of the likelihood function based on the received measurements. The estimation process is shown as the following equation: whereθ ML is the estimated parameter vector, z 1:T is the sequence of measurements, and θ is the unknown parameter vector to be estimated. ML method is only applicable to the estimation problem where a joint probability density function (PDF) can be assigned to the measurements [15]. In addition, ML estimate needs to use the gradient based search methods such as a damped Gauss-Newton method to actually compute estimated parameters [16].
In the Bayesian formulation, the unknown parameters θ are regarded as an unobserved random variable or vector. The Bayesian parameter estimation gives a posterior PDF of the unknown parameters, conditioned on measurements and prior distribution. Thus, the parameter estimation problem refers to obtaining the posterior PDF of θ by using the available measurements. According to Bayes' theory, the posterior distribution can be obtained by Equation (2), where p 0 (θ) is the prior distribution.
p(θ|z 1:T ) = p(z 1:T |θ)p 0 (θ) p(z 1:T ) The first advantage of the Bayesian parameter estimation algorithm over the traditional algorithms is the conceptualization of the estimation result. In the traditional algorithms, there is a specific but unknown model that needs to be identified. In the Bayesian algorithm, the model is a random variable rather than deterministic. The estimation result is a probability distribution for the model parameters named as posterior distribution. With the probability distribution, we can answer many probabilistic questions about the model. Such questions are impossible in the traditional algorithms, because the identified model they provided is not a random variable.
The second advantage of the Bayesian parameter estimation algorithm over the traditional algorithms is the incorporation with prior information represented mathematically as a prior distribution for the model. The Bayesian parameter estimation algorithm naturally incorporates prior information about the estimation result, ranging from hard additional constraints to experience based intuition. Once measurements are collected, they are combined with the prior distribution using Bayesian inference to produce the desired posterior distribution of the unknown parameters.
Both ML and Bayesian parameter estimation can be implemented off-line or on-line. In an off-line parameter estimation algorithm, we infer the unknown parameters by iterating over a fixed measurement dataset. In an on-line method, the unknown parameters are sequentially estimated as the measurements become available. The role of SMC for both off-line and on-line parameter estimation is discussed in [10,17]. If we assume that the states of the studied system are unknown, there is a need to deal with the relationship between states and parameters. Marginalization and data augmentation are two different strategies for dealing with this kind of parameter estimation problem. Marginalization means integrating out the states from the problem and viewing the unknown parameters as the only unknown variable of interest. Usually, marginalization is implemented in an off-line manner. Data augmentation estimates the unknown parameters together with the states [11,18]. In other words, the data augmentation performs parameter estimation by extending the state with the unknown parameters and transforming the problem into an optimal filtering problem. Usually, data augmentation is implemented in an on-line manner.

Nomenclature
An introduction of some concepts used in this paper is provided in the following. More detailed information can be found in the provided references.

Random Finite set (RFS)
In the middle 1970s, G. Matheron systematically described the random set theory [19]. The random finite set (RFS) approach introduced by Mahler is an advanced theory that unifies much of information fusion under a single Bayesian paradigm [20,21]. The RFS ψ is defined as a random variable that draws its instantiations ψ = Y from the hyperspace X of all finite subsets Y (the null set ∅ included) of some underlying space X 0 . Here, we assume that X 0 = R n is a Euclidean vector space and that, therefore, X consists of all finite sets of the form Y = ∅ or Y = {y 1 , . . . , y n } where n ≥ 1 is an arbitrary positive integer and y 1 , . . . , y n are arbitrary vectors in R n .

Finite Set Statistics (FISST)
Here, we adopt FISST which provides the mathematical representation and manipulation of RFS. FISST is a set of tools for filtering and estimating problems involving RFS [22]. FISST enables the RFS based Bayesian inference by providing some mathematical tools. Fundamentally, it approaches the modeling of multi-object tracking by introducing sets that contain a random number of single-object states, each of which is a random vector. It allows the dynamics of each object (member of the object set) to vary according to some motion model while the number of objects (members in the set) is allowed to vary according to some point process model.

Probability Hypothesis Density (PHD)
The difficulty of RFS based Bayesian inference is its computational complexity. To make the RFS based estimation possible, Mahler proposed the PHD (probability hypothesis density) filter [21,23,24]. The PHD D k|k (x|Z k ) of the posterior probability PDF f k|k (X k |Z k ) is a density function defined on the single object state x ∈ X 0 as follows: Here, we use the abbreviation D k|k (x) = D k|k (x|Z k ). In point theory, D k|k (x) is defined as the intensity density. D k|k (x) is not a probability density, but it represents the density of expected number of points at x. If S represents a region in the single object state space X 0 , the integral S D k|k (x)dx represents the expected number of targets in the state space S. More information about PHD can be found in [22].

Simulated Tempering
Simulated tempering is a stochastic optimization method that is used for improving the probability of an optimization routine converging to a global optima in highly multimodal objectives [25]. It adopts the properties of thermodynamical systems and deals with two important problems with Markov Chain Monte Carlo (MCMC) methods: one is improving exploration of multimodal distributions, and the other one is allowing estimation of the normalizing constant of the target distribution. Simulated tempering augments the Markov chain state with a discrete index variable controlling the inverse temperature of the system [26].
The objective function that we want to minimize is defined by the energy function φ of the system and the variables being optimized with the state x ∈ X. An increasing schedule of K inverse The variables x ∈ X on which the target distribution P is defined are augmented with a discrete index variable k ∈ {0, . . . , K}. A joint density on the target variable x and temperature index k is then defined as where C is a constant, and {ω k } K k=0 is a set of prior weights associated with each inverse temperature value, and which can be used to help shape the marginal distribution on the temperature index k. The energy function φ is defined as the negative logarithm of the unnormalized target density i.e., φ(x) = − logp(x). Function ψ defines a corresponding energy function for a base distribution Q with normalized density q(x) = exp(−ψ(x)).
If k = 0, the conditional distribution P x|k on the target variables x corresponds to the base distribution Q and β 0 = 0. If k = K, it will correspond to the target distribution P and β K = 1. We can therefore use the x components of sampled chain states for which k = K to estimate the variables of interest which is related to the target distribution P. In simulated tempering, a Markov chain with invariant distribution corresponding to (4) is constructed by alternating updates of the target variables x given the current value of temperature index k, with updates of the temperature index k given the current value of the target variables x. More information about simulated tempering can be found in [27].

Motivation and Advantages of the Proposed Parameter Estimation Algorithm
In this paper, we choose to use the Bayesian formulation represented by Equation (2). The unknown parameters are regarded as a random vector and associated with a suitable prior information modeled by the prior distribution. The posterior PDF of the parameters is to be characterized by using the given measurements. The proposed parameter estimation algorithm can be regarded as the Monte Carlo batch techniques [28], and it is perfect for estimating parameters of stochastic dynamic systems. The proposed parameter estimation algorithm is an off-line Bayesian parameter estimation algorithm, and it is an updated version of the marginalization based algorithms.
The conventional Bayesian parameter estimation algorithms depend on the vector based representation of data including states and measurements. The vector based representation makes these algorithms have three essential disadvantages. The first one is that they must assume that the studied system is a single permanently active system and the model is presumed. They cannot be used for estimating the dynamic system that switches on and off randomly. The transition from one mode to another is impossible. The second disadvantage is that they are based on the assumption that the detection is perfect which means no clutters and no missed detections, and they also need the number and sort order of measurements to be previously designated. The third disadvantage is that they are not appropriate for estimating the systems with varying state dimensions caused by the varying number of constituent elements contained in the studied system.
The limits of the commonly used random vector based Bayesian estimation algorithms are analyzed in [29]. We can find that the root of their disadvantages is that they must ensure the dimension and elements' order in each vector to be equal and fixed. They also need necessary operations outside of the Bayesian recursion to ensure the consistency of the vectors used for calculation. The determination of newly observed measurements and missed measurements is through vector augmentation and truncation which are very computationally intensive and irreversible. In this paper, we propose to employ the RFS theory to overcome these disadvantages of the standard vector based Bayesian parameter estimation algorithms.
An RFS has a broader application prospect than a random vector, because an RFS has a random number of constituent vectors and it allows these constituent vectors themselves to be random, distinct and unordered [30]. In practice, the number of measurements is not fixed and the ordering of measurements is irrelevant, so the measurements are better to be modeled as an RFS. RFS based data representation generalizes the system model and measurement model for parameter estimation, because it takes into account a more realistic situation where the varying number of objects, detection uncertainty, clutters, missed detections and data association uncertainty are all taken into consideration.
Unlike the traditional parameter estimation algorithms and the vector based Bayesian parameter estimation algorithms, the proposed Bayesian parameter estimation algorithm employs RFS based modeling, and has the following advantages:

1.
It can estimate the parameters by using the imperfect detections including noises, clutters, data association uncertainty and missed detections. Thus, it enables the measurement model used for parameter estimation to be more universal and consistent with the practical system.

2.
It can estimate the parameters for the stochastic systems which consist of a varying number of constituent components or objects. It can take the object birth, object death and spawned objects of the studied system into consideration.

3.
The proposed algorithm can also accurately estimate the states of the studied system and the number of objects contained in the studied system while estimating the unknown parameters. The estimation of states and object number depends on the estimated parameters.
The rest of the paper is structured as follows. We present the RFS based modeling for both measurement model and system model in Section 2. A detailed description of the PHD based Bayesian inference and the SMC based implementation are given in Section 3. The proposed RFS based parameter estimation algorithm is detailed in Section 4. After describing the proposed algorithm, in Section 5, we take a particular problem of estimating systematic errors of multiple sensors as an example to verify this algorithm. In order to better promote academic exchanges, we provide the source codes for the experiments in the supplementary materials part. Finally, the conclusion is provided in Section 6.

RFS Based Modeling
The RFS based parameter estimation algorithm uses two kinds of models to describe the real world: one is the Markov transition density based system model; the other one is the measurement likelihood based measurement model. System model is used to model the state transition of the studied system, and the measurement model is used to model the practical measurement system. Here, we focus on estimating the static parameter vector θ which features in the state space models that represent system dynamics and measurement system using observations from the practical system. The RFS based representation of measurements and states is the foundation of RFS based algorithm. In this section, we give the RFS based problem formulation and the RFS based models used in the proposed parameter estimation algorithm.

Problem Formulation
Suppose at time step k = 0, 1, 2, . . ., the studied system consists of n k objects. Their states can, respectively, be represented by the vectors x k,1 , . . . , x k,n k , and each state vector is obtained from the state space X ⊆ R n x . In practice, the number n k and the individual states in X of system's constituent elements are both time varying and random. In this paper, the system state at k is described by an RFS X k = {x k,1 , . . . , x k,n k } ∈ F (X ). The state evolution of the RFS based system model can be described by the Markov system state process conditioned on the static parameters θ ∈ Θ. The state evolution is characterized by a prior density X 0 ∼ µ(X 0 |θ) and a multiple objects transition density The measurement process is imperfect in the sense that there exists detection uncertainty. The measurement devices usually create false detections. If observation datasets are available at time k = 1, 2, . . ., and there are m k elements in the observation dataset at time k, and the observations take a values in the measurement space Z ⊆ R n z . Then, the observations can be modeled by an RFS Z k = {z k,1 , . . . , z k,m k } ∈ F (Z ). Here, we assume that the measurement process is conditionally independent from the system's state evolution, and can be fully described by the We assign a prior distribution to the unknown parameter vector θ, and estimate it on the parameter vector space according to Bayesian theory. The parameter estimation problem in this paper mainly refers to estimating the posterior PDF p(θ|Z 1:K ); here, Z 1:K = Z 1 , . . . , Z K is the measurement set sequence gathered up to time index K. If the prior density p 0 (θ) is given, according to Equation (2), the estimation result in the Bayesian approach is given by: The computation of f (Z 1:K |θ) is quite difficult, because it refers to performing estimation on the joint space of the unknown system states history X 1:K and the unknown parameter vector θ.
To solve this problem, we choose to consider a broader problem instead of solving the problem directly. The parameter estimation problem can be regarded as a subproblem of the chosen problem where the main task is to obtain the following posterior PDF : where f (X 1:K |Z 1:K , θ) is the posterior PDF of the system states conditioned on θ and all the given observation datasets.

RFS Based Measurement Model
For the given predicted system state X k = {x 1 , . . . , x n }, the random measurement set collected by the sensor can be represented by is the detection set for state x j ; and C k is the set of Poisson clutters.
Any given object in the studied system can generate either a single measurement or no measurement at all. Consequently, is the sensor-noise model associated with ith state x i ; and ∅ p D (x i |θ) is the discrete random subset of the baseline measurement space Z 0 . Detection uncertainty is modeled as a Bernoulli RFS as follows: where p D (x|θ) is the probability that a single object whose state is x gives a single observation, and 1 − p D (x|θ) is the probability that the sensor generates no measurement at all. The fact that the clutter process is Poisson means that the clutters can be represented by the set C = {c 1 , . . . , c M }, where M is a random nonnegative integer with probability distribution p M (m) = e −λ λ m /m!; c 1 , . . . , c M are independent, and identically distributed random measurement vectors with density c(z) = g C (z|θ).
For the state set X k , the PDF of receiving the measurement set Z k is described by the true likelihood function ϕ k (Z k |X k , θ), which describes the sensor measurement for the studied system and characterizes the missed detections, clutters, and object generated observations. As described in [29], the true likelihood function can be calculated as follows: ; the summation is taken over all associated hypothesis ξ : {1, . . . , n} −→ {0, 1, . . . , m}. In time, the clutter process is Poisson distributed, and the mean value is λ. In space, the clutter process is distributed according to an arbitrary density c(z), and c k|k ( is the likelihood function of a single sensor.

RFS Based System Model
Some systems may consist of multiple objects, and the number of these objects can be constantly varying, just as the objects appear and disappear from the system. Existing objects can give rise to new objects through spawning. Objects can likewise leave the system, as when disappearing, or they can be damaged or destroyed. All of these instances of multiple object dynamics should be explicitly modeled. If multiple objects contained in the studied system are related to birth, death, spawning, merging, and so on, the standard vector based estimation algorithms are inapplicable, because they fail to accurately estimate the number of objects in the studied system. We model the time evolution of the system which consists of multiple objects by employing RFS. We take the object birth, death, spawn, and motion into consideration. The system Markov density Π k|k−1 (X k |X k−1 , θ) characterizes the time evolution of the system states. Here, we give the true system Markov density function Π k|k−1 (X k |X k−1 , θ) for the proposed system model, where X k = {x 1 , . . . , x n k } and X k−1 = {x 1 , . . . , x n k−1 }. The RFS based representation of system' states has the following mathematical form where X k is the set of predicted system states, S k|k−1 (X k−1 ) is a set of states of persisting objects that are continuing to exist in the system, B k|k−1 (X k−1 ) represents the state set of spawned elements, B k represents the state set of spontaneous objects.
is the set of spawned states by the object whose previous state is x i . As detailed in [29], the RFS based system model is identical in mathematical form to the RFS based measurement model. The corresponding true Markov density is where π k|k−1 (x k |x k−1 , θ) is a Markov transition density, which corresponds to a single object system model It is the likelihood that a single object will have state vector x k at time step The related definitions in Equation (13) are as follows: Here, µ 0 is the expected number of spontaneously generated new objects and b 0 (x|θ) is their physical distribution. In addition, µ(x i |θ) is the expected number of new objects spawned by an object whose previous state is x i , and b(x|x i , θ) is their physical distribution.

PHD Based Bayesian Equations
The Bayesian estimator sequentially determines the posterior PDF of the system states f k|k = (X k |Z 1:k , θ) at each time step k, where Z 1:k ≡ Z 1 , . . . , Z k denotes the gathered measurement set sequence up to time step k. The posterior PDF is calculated through the prediction and the correction recursion sequentially. On the assumption that the posterior PDF f k−1|k−1 (X k−1 |Z 1:k−1 , θ) at time k − 1 is already known, if the set of measurements Z k which is related to time k has been given, the prediction step and corrected step can be described as in [22]: The PHD filter is an approximation of the Bayesian estimator described by Equations (17) and (18) [31]. Rather than propagating the FISST PDF f k|k (X k |Z 1:k , θ), we can only propagate PHD D k|k (x|Z 1:k , θ), defined by Equation (3).
The prediction equation of the PHD filter can be represented as follows [21]: where γ k|k−1 (x|θ) represents the PHD related to the object birth RFS between step k − 1 and k.
Once the observation dataset Z k for time k is available, the correction equation of the PHD filter can be represented as follows [21]: (20) where κ k (z|θ) is the PHD related to the clutter RFS at step k. Here, the clutters are characterized by the Poisson RFS. The PHD of the Poisson RFS is κ k (z|θ) = λc(z|θ); here, λ denotes the mean number at step k, and c(z|θ) denotes the spatial distribution of clutters in the measurement space.

SMC based Implementation
Beginning with Sidenbladh [32] and Zajic [33], many researchers have studied how to implement the PHD filter by using the SMC method. By using sequential importance sampling to each terms of Equation (19), the particle approximation of Equation (19) can be obtained. After choosing the importance densities p k (·|Z k , θ) and q k (·|x k−1 , Z k , θ) for spontaneous birth and persisting objects, Equation (19) can be rewritten as Therefore, we can get the particle approximation of D k|k−1 (x|θ) as follows For the correction step of the PHD filter, the predicted PHD can be represented by {ω Applying the correction Equation (20), we can get the following particle approximation where By updating the weights of the particles in term of Equation (26), the correction step maps the into the corrected particle system {ω

RFS based Parameter Estimation Algorithm
The system states conditioned on θ, are estimated by using the Bayesian theory. The main task is to obtain an estimate of PDF f (X 1:K |Z 1:K , θ) for each possible value of θ by running a PHD filter implemented by SMC. By this way, we can evaluate the measurement likelihood f (Z 1:K |θ), and then we can indirectly obtain a practical solution for p(θ|Z 1:K ).

Formula Derivation
By employing PHD to approximate Equation (6), we can estimate f (X 1:K |Z 1:K , θ) which depends on RFS based modeling. According to the relationship between the PHD and the RFS based PDF detailed in [22], Equation (6) can be rewritten as the following equation The advantage of Equation (27) is that the posterior PHD D(x 1:K |Z 1:K , θ) is defined on the standard vector based state space X , so the calculation of the RFS based posterior PDF f (X 1:K |Z 1:K , θ) is quite easy.
The observed data likelihood f (Z 1:K |θ) of the accumulated sequence of observation dataset which features in Equations (5) and (27) can be calculated by using the following equation: Now, the problem is how to calculate the measurement likelihood f (Z k |Z 1:k−1 , θ) using the results of PHD filter. The equations and SMC based implementation of the PHD filter for recursively estimating D k|k (x|Z 1:k , θ) have been introduced in Sections 3.1 and 3.2. The measurement likelihood f (Z k |Z 1:k−1 , θ), which is defined as in [21], can be calculated as follows From Equations (28) and (29), we can find the tight relationship between the parameter estimation problem and the state estimation problem. The key challenge is how to deal with the unknown system states. The PHD filter implemented by the SMC method is used to solve this problem. Thus, f (Z k |Z 1:k−1 , θ) can be obtained by using PHD filer, and can be calculated as follows: where α is a proportionality constant.
As described in Section 3.2, we can use the weighted particle system {x to represent the posterior PHD D k|k (x|Z 1:k , θ). To estimate the posterior p(θ|Z 1:K ), we should firstly obtain the approximation of f (Z k |Z 1:k−1 , θ) given by Equation (30) by running the PHD filter. According to Equation (22), the predicted PHD D k|k−1 (x|Z 1:k−1 , θ) is approximated by the following particles Substituting Equation (31) into Equation (30), we can obtain the following result: Therefore, according to Equation (28), f (Z 1:K |θ) can be calculated as follows:

Simulated Tempering Based Importance Sampling
We have obtained the estimate of f (Z 1:K |θ) by running the PHD filter, the following step is to compute p(θ|Z 1:K ) according to Equation (5). Since the analytic solution for p(θ|Z 1:K ) is difficult to compute, we solve Equation (5) by using the simulated tempering based importance sampling method.
Since we are interested in the values of the unknown parameters, we can compute them as the posterior mean by using the following equation: To avoid doing integral calculus, we propose to approximate Equation (34) via importance sampling. Here, we assume the sample size is M, we draw a sample θ 1 , . . . , θ M from the importance density ξ. Thus, the integral in Equation (34) can be approximated by the following equation: The sample weights j ≥ 0, (j = 1, . . . , M) are defined as: According to Equation (36), we find that α can be omitted, since it cancels out if we normalize the weights. Thus, for convenience, we assume α = 1. The determination of sample weight j depends on the selection of importance density ξ(θ j ). A good selection of importance density should be proportional to the posterior PDF f (Z 1:K |θ j ) and produce sample weights with a small variance.
Since we can only obtain an estimation of f (Z 1:K |θ j )/α K after running the SMC based PHD filter, the selection of the importance density ξ is quite important. If the sample size M → ∞, the approximation Equation (35) will be quite accurate for many importance densities. However, if M is finite, the approximation accuracy will depend greatly on the specific selection of importance density ξ. As shown in Equation (36), the accuracy of importance sampling by using the prior information is not satisfactory for any possible sample size, since the posterior PDF p(θ|Z 1:K ) which is proportional to f (Z 1:K |θ)p 0 (θ) is unknown. We propose a simulated tempering [2] based importance sampling method to obtain a better importance density ξ which is similar to the posterior PDF p(θ|Z 1:K ).
Here, we use S to represent the maximum stages number and ϑ s , s = 1, . . . , S, represent the proposal density for the s-th stage. The main idea of tempering is to sequentially generate a serious of importance densities from which we can sequentially draw samples. The first importance density usually resembles the prior density p 0 (θ), and the final importance density is the posterior PDF p(θ|Z 1:K ). The sequential importance density in the sequence should have very small difference. We obtain ϑ S = p(θ|Z 1:K ) at the final stage S . A sequence of importance densities that begin with the prior density and increasingly resemble the posterior PDF can be generated by the following equation, for s = 1, . . . , S ϑ s (θ) ∝ [ f (Z 1:K |θ)] Λ s p 0 (θ) where Λ s = ∑ s j=1 β j with β j ∈ (0, 1] and at the final stage we have Λ S = 1. Thus, Λ s increases with the growth of s and its upper bound is 1. During the tempering process, we should sequentially draw samples from importance densities ϑ 1 , ϑ 2 , . . . , ϑ S . At the first stage (s = 1), we draw the sample {θ 1 j } M j=1 from the prior density ϑ 0 (θ) = p 0 (θ). At the s-th stage, we use the sample {θ s−1 j } M j=1 drawn from ϑ s−1 at the (s − 1)th stage to obtain a sample from ϑ s . Firstly, we need to compute weights from particles in {θ s−1 j } M j=1 by using the equation˜ s j = f (Z 1:K |θ s−1 j ) β s for j = 1, . . . , M. Then, we normalize the weights. Resampling is quite necessary for tempering, because it can help to remove the particles with lower weights from the sample {θ s−1 j } M j=1 and improve the diversity of the remaining particles. Thus, the second step is resampling. Resampling means selecting M indices i s 1 , . . . , i s M with probability Pr{i s j = j} = s j . After resampling, the particles {θ s j } 1≤j≤M will almost surely have repeated elements, because the weights s 1 , . . . , s M will be the most possible to be uneven. Here, the MCMC (Markov Chain Monte Carlo) [34] is employed to the sample {θ s j } 1≤j≤M that have been resampled. MCMC can help to remove the duplicate particles, thus the diversity of the sample will be increased. For each particle of {θ s j } 1≤j≤M , a new sample particle can be drawn as follows where g s represents the importance density for the s-th stage. The decision of accepting or rejecting θ s, * j with certain probability is made by using the Metropolis-Hastings algorithm. If the MCMC move is accepted, θ s j = θ s, * j . If the move is rejected, θ s j =θ To diversify the sample, the importance density g s should generate the candidate particles over a fairly large area of the parameter space. We should ensure that g s is still within the area where the likelihood is high. We can select the importance density by the following method as introduced in [35]: g s (θ * |θ) = g s (θ * ) = N (θ * ;μ s ,Ĉ s ) (40) where N (θ; µ, C) denotes a Gaussian distribution;μ s andĈ s are respectively the mean and covariance matrix of the weighted particles { s j , θ s−1 j } M j=1 . The pseudo-code of the proposed parameter estimation algorithm is shown in Algorithm 1. Here, 1 ≤ H ≤ M and φ > 0 are parameters that should be defined by users.

Experiments
In this section, we verify the correctness and validity of the proposed algorithm by using an example of multiple sensors. Estimating the systematic errors of multiple sensors is important for sensor registration in cooperative networked surveillance [36]. In this example, we estimate the systematic errors of two sensors. The advantages of the proposed algorithm introduced in Section 1.3 are all successfully proved. In order to ensure that those who are interested in the experiments can conduct these experiments smoothly, we also provide the source codes as the supplementary materials.

Experimental Models
Here, we use the measurements given by R (here R = 2) static sensors to estimate the systematic errors of these sensors. Here, we denote the observation dataset reported by sensor r k ∈ {1, . . . , R} at time k by Z To verify the first advantage introduced in Section 1.3, we use the detection probability p D to characterize the detection uncertainty, and missed detection. The sensor observations are affected by the systematic errors and the measurement noises. The clutters are characterized by using Poisson RFS. We assume that the observations provided by the sensors are the azimuth and range to the objects in the surveillance system. For an observation z ∈ Z (r k ) k which is related to the object with state x ∈ X k , the measurement likelihood of the r-th sensor is modeled as: k (x) is the measurement function of the r-th sensor, and its specific form is as follows: where (x r , y r ) is the position of the r-th sensor. The unknown parameter vector θ consists of four parameters as follows: where θ (r) = [∆η r , ∆ρ r ] T , which appears in Equation (41), denotes the systematic error vector for the r-th sensor (r = 1, 2), ∆η is the azimuth systematic error and ∆ρ is the range systematic error. Σ D denotes the detection probability of the r-th sensor.
To verify the second advantage of the proposed algorithm introduced in Section 1.3, the targeted system surveilled by the sensors consists of a varying number of objects. We use the probability of survival p s to characterize the uncertainty of target existence. The state of each individual object is described by the vector x = [x,ẋ, y,ẏ] T , here (x, y) denotes the position of the object and (ẋ,ẏ) denotes the velocity.
The state of each object in the surveillance system evolves according to the Markov transitional density in time, and it is independent of the unknown parameters θ. Here, we use the following motion model: where and ⊗ is the Kroneker product, T k = t k+1 − t k denotes the time interval of the sensors, and denotes the intensity of process noises.

Experimental Setup
The typical observation dataset sequence Z 1:K , the real states of the multiple objects and the location of the sensors in the experiment are shown in Figure 1. The number of observation datasets used for estimating parameter is K = 20. The number of component objects is varying in the experiment, with n k = 3 for k = 1, 2, 3, 4, 5; n k = 5 for k = 6,7,8,9,10; n k = 4 for k = 11, 12, 13, 14, 15; and n k = 3 for k = 16, 17, 18, 19, 20. The two sensors in the experiment work asynchronously: Sensor 1 provides observation datasets at odd time steps, and Sensor 2 provides observation datasets at even time steps. T k = 1 s is the time interval between k and k − 1. The location of two sensors are (x 1 , y 1 ) = (0 km, 0 km) and (x 2 , y 2 ) = (200 km, 100 km). The probability of detection is p

Experimental Results
The estimated systematic errors for the sensors are given in Figure 2. Figure Table 1.
In Figure 2 and Table 1, we can know that the proposed RFS based parameter estimation algorithm can provide the accurate estimated parameters by using the RFS based measurement model and the measurements set that consists of noises, false detections, and missed detections. This is consistent with the first advantage introduced in Section 1.3. Moreover, the proposed algorithm gives the accurate estimated parameters for the stochastic system in which the number of constituent objects (targets) is varying over time. This is consistent with the second advantage introduced in Section 1.3. The obtained experimental results show an outstanding accuracy of the proposed algorithm, although the size of dataset is only 20 scans.   To quantitatively assess the convergence of the algorithm, we adopt the convergence diagnostic named as the estimated potential scale reduction (EPSR). The definition of EPSR is given in [37,38]. The resulted sequence of correction factors β and the EPSR at each stage s = 1, . . . , 10 are listed in Table 2. From this table, we can know that the simulated tempering process works effectively. As the stage number increases, the correction factors of each stage increase, and EPSR at each stage decreases, which means that the resulted sample in the simulated tempering process increasingly concentrates in the area of the parameter space characterized by the true likelihood.  The estimates θ s (the mean of all the particles at each stage) over the simulated tempering stages obtained by running the proposed algorithm are shown in Figure 3. The black solid lines are the true range and azimuth systematic errors of the sensors. The red circles are the sample mean at different tempering stages. From this figure, we can see that, after the fourth stage, the estimation results are very close to the real systematic errors. Since Λ s = ∑ s j=1 β j is still less than 1, the estimation process will continue until reaching the maximum stage number S = 10.  The SMC approximation of the posterior PDF of the sensors' systematic errors according to Equation (35) at the first stage, fifth stage and tenth stage are shown in Figure 4. The true systematic errors for sensors are represented by black circles. In Figure 4, we can see that, as the simulated tempering process evolves, the particles gradually concentrate on the real systematic errors of the sensors.

System States and Object Number Estimation
As given in Section 1.3, the third advantage of the proposed algorithm is that it can provide the estimate of the system states and the number of targets. Figure 5a gives the real target number contained in the system and the estimated one. Figure 5b gives the real and estimated system states at all the sampling time. From this figure, we can find that the proposed algorithm can accurately estimate the varying number of targets and system states at the final stage by using the estimated systematic errors of sensors.

Comparison with Metropolis Hastings Algorithm
There are many probability distributions from which it is impossible to sample directly. The Metropolis Hastings (MH) algorithm is a typical MCMC algorithm that can generate a sequence of random samples from these distributions. Since it is the most widely used method in statistics and in statist physics, we verify the superiority of the proposed algorithm by comparing with MH algorithm. Figure 6 displays the histogram of the resulted samples given by using MH algorithm with the identical experimental setup. From this figure, we can find that the resulted samples given by MH algorithm are more decentralized than the proposed algorithm. The estimated results are shown in Table 3. We can find that the proposed algorithm are more accurate than MH algorithm.  To compare the accuracy of the importance sampling process, we adapt the evaluation criteria named as autocorrelation function (ACF) [39,40]. According to the fundamentals of MCMC, the samples generated by using MCMC are auto-correlated. Compared with the independent samples, the information content of the samples is relatively reduced. The sample based ACF at lag k of a set of samples is defined as follows: whereθ = 1 M ∑ M s=1 θ s is the Monte Carlo estimate, and θ s is a sample. From the definition, we can see that, if the ACF dies off to 0 more rapidly, it indicates that the samples are less auto-correlated and more accurate. Figure 7 displays the ACF of the proposed algorithm and MH algorithm. From this figure, we can find that the ACF of the proposed algorithm dies off to 0 more rapidly than the MH algorithm. Thus, the accuracy of the prosed algorithm is much higher.  The former experiments assume that the prior p 0 (θ) is a uniform distribution. To verify the superiority of the proposed algorithm in a more general case, we assume the prior p 0 (θ) is a Gaussian distribution with mean value [−2.5; 9.5; 2.5; −9.5] and covariance diag [10,20,10,20]. Figure 8 displays the histogram of the resulted samples by using the proposed algorithm. We can find that the resulted KDE densities have deviation by comparing with the former experiment. This is caused by the inaccurate prior distribution. Figure 9 displays the histogram of the resulted samples given by HM algorithm. From this figure, we can find that the resulted samples are still very decentralized and the distribution of samples is heavily depending on the prior distribution.
The estimated results of the two algorithms are given in Table 4. We can find that the accuracy of the proposed algorithm deteriorates slightly, but it is still better than the results of MH algorithm.
The ACF of the two algorithms is given in Figure 10. We can find that the ACF of the proposed algorithm dies off to 0 more slowly than the former experiment, but still more rapidly than the MH algorithm. Thus, the accuracy of the estimated results of the proposed algorithm is a little better than the MH algorithm.

Sensitivity Analysis
In this section, we analyze the sensitivity of the proposed algorithm to a number of important parameters, such as the particle number N for PHD filter, the sample number M in parameter space as well as the parameter φ related to the simulated tempering process.

Particle Number N
The PHD filter can provide the estimation of the likelihood f (Z 1:K |θ). Hence, the accuracy of the proposed parameter estimation algorithm may depend on the particle number N used by the SMC based PHD filter. The influence of the particle number N on the parameter estimation is presented in Figure 11. The estimated systematic errors at the 10th stage for different particle number N by using MAP are given in Table 5.  In Figure 11 and Table 5, we can find that the estimation error has a decreasing tendency as N increases. With more particles, PHD filter can give more accurate likelihood f (Z 1:K |θ), and thus leads to a better coverage of the parameter space. The plot shows that the decreasing tendency in estimation errors by increasing N is not proportional, and when N increases to a certain degree, the improvement of estimation accuracy is very small. The result shows that, if N increases to a certain degree, it will not significantly affect the parameter estimation results. As the accuracy of the estimation results is restricted by the quality of the model and data used in the parameter estimation process, the estimation results without any errors by continuously increasing N is impossible. The computational complexity will also be increased as N increases. Therefore, we should achieve a balance between computational complexity and estimation accuracy. Figure 12 shows the ACF by using different values of N. For the convenience of analysis, we give the ACF of the first 150 samples. From this figure, we can find that, if N is larger, the corresponding ACF will die off to 0 more rapidly, so the accuracy is higher.  Table 6 shows the EPSR at all the stages for different particle number N used by the PHD filter. From this table, we can find that, as the number of stage increases, the EPSR of each stage decreases. This means that as the simulated tempering process progresses, the sample increasingly concentrates in the area of the parameter space characterized by the true likelihood. The increase of the particle number N seams to have no influence on the convergence of the proposed algorithm.   Figure 13. Table 7 is the MAP estimate at the 10th stage for different M. From the plot, we can find that the decreasing tendency in estimation errors by increasing M is quite similar to the result caused by increasing N. If M increases to a certain degree, the improvement of estimation accuracy is very small. There also needs a balance between computational complexity and estimation accuracy.
The EPSR of different sample size M is given in Table 8. From this table, we can find that the increase of M will result in the increase of EPSR, which means that the convergence of the proposed algorithm will be increased.   By setting the maximum stage number to be S = ∞, the influence of the parameter φ on the parameter estimation results is presented in Figures 14 and 15. In Figure 14, we can find that a smaller φ will need more stages and get more accurate estimation result. If φ is bigger than a certain degree, the number of stages will be very small, and the estimation errors will be bigger. In Figure 15b, we can find that, if a lower value of parameter φ is used, the MCMC acceptance rate will be increased.
In Figure 15a, we can find that, if a lower value of parameter φ is used, the number of stages S will be increased, because the increments of correction factors β s become smaller.  The computational complexity of simulated tempering based importance sampling is determined by the number of stages S and the correction factors β 1 , · · · , β S . To reduce the computational complexity, we tend to use a small value of S. However, at the same time, if we choose a small S, the sequential intermediate distribution ϑ s will vary rapidly. There is a balance between the number of stages (related to the computational complexity) and the estimation accuracy.
We give the estimated systematic errors at the final stage for different φ in Table 9. It seems that, if φ decreases, the stage number will increase and the estimated results will be more accurate.

Conclusions
We have presented the proposed RFS based parameter estimation algorithm to overcome some disadvantages of the vector based Bayesian algorithms. We also detail how to implement this algorithm by using the SMC based PHD filter and simulated tempering based importance sampling. By applying the SMC based PHD filter, the proposed algorithm successfully extends the application of parameter estimation from the single object system to the stochastic system which consists of a varying number of constituent objects. Without any assumptions under ideal conditions, it can fit the dynamic nature of the studied system and the detection uncertainty of the measurement system very well. It can be successfully used for identification of the stochastic systems which consist of a varying number of constituent objects. It can also be used for parameter estimation problem where the observation dataset is affected by false detections, missed detections and noises.