Reinforcement Learning-Based Data Association for Multiple Target Tracking in Clutter

Data association is a crucial component of multiple target tracking, in which each measurement obtained by the sensor can be determined whether it belongs to the target. However, many methods reported in the literature may not be able to ensure the accuracy and low computational complexity during the association process, especially in the presence of dense clutters. In this paper, a novel data association method based on reinforcement learning (RL), i.e., the so-called RL-JPDA method, has been proposed for solving the aforementioned problem. In the presented method, the RL is leveraged to acquire available information of measurements. In addition, the motion characteristics of the targets are utilized to ensure the accuracy of the association results. Experiments are performed to compare the proposed method with the global nearest neighbor data association method, the joint probabilistic data association method, the fuzzy optimal membership data association method and the intuitionistic fuzzy joint probabilistic data association method. The results show that the proposed method yields a shorter execution time compared to other methods. Furthermore, it can obtain an effective and feasible estimation in the environment with dense clutters.


Introduction
Measurement data association in a cluttered environment is considered to be a high potential and challenging technique in the field of multiple target tracking [1,2]. The main mission of data association is that each measurement obtained by the sensor should be determined whether it belongs to the target when multiple targets are present [3,4]. However, clutters such as false alarms and electronic countermeasures make it very difficult to accomplish the data association mission efficiently. Therefore, many methods in the literature have been proposed to solve this problem [5][6][7]. The nearest neighbor data association method (NN) [8] selects a measurement that owns the shortest distance with the predicted measurement of the target in the association environment and complete the data association. However, the nearest measurement may be a clutter and the mission ultimately failed. Reference [9] proposed a fuzzy based nearest-neighbor association method for multiple targets tracking. Instead of the classical Mahalanobis distance, fuzzy clustering has been used to acquire a likelihood measure. The probabilistic data association (PDA) [10] method calculates the association probability between obtained measurements and target, which is only applicable in assigning multiple measurements to a single target. Reference [11] proposed a novel data association technique, which is made up of PDA and NN. The probability of each measurement is obtained from the conditional probability density functions of the interested events. A multiple hypothesis tracker (MHT) [12] has been proposed to evaluate the likelihood for tracking systems. A list that can be sorted by the probability estimates of hypotheses is considered as the outputs of MHT. However, all the possible association hypotheses

•
The RL is embedded into the traditional JPDA method to obtain the relationship between the measurement distribution and its associated probability at the presence of dense measurement clutters; • The motion characteristics of the targets is considered to improve the accuracy of data association.
The structure of this paper is organized as follows. The problem formulation is described in Section 2. Section 3 explains detailed implementation of the proposed RL-JPDA method. In Section 4, the experiments are introduced and comparative results with other JPDA variants are presents. Finally, Section 5 summarizes the conclusions.

The Target Model
It is assumed that there are t = 1, 2, . . . , T targets observed by the sensor, and the dynamics and measurement model of target are defined as follows: where X t (k) represents the state vector of target t at scan k, and Z t (k) represents the measurement vector. F t (k) denotes the state transition matrix, H t (k) denotes the measurement transition matrix. The process noise w t (k) is Gaussian white noise with the covariance Q t (k) and zero mean. The measurement noise v t (k) is zero mean Gaussian noise with known covariance R t (k). In a clutter-free environment, the state vector of each target t is predicted and updated based on correct measurements as follows [15]: S t (k) = H t (k)P t (k|k − 1)H t (k) + R t (k) (6) K t (k) =P t (k|k − 1)(H t (k)) T (S t (k)) −1 X t (k|k) =X t (k|k − 1) + K t (k) Z t (k) (8) whereX t (k|k − 1) represents the predicted state vector of the t th target at scan k, andP t (k|k − 1) denotes the predicted value of state covariance. Z t (k) is an innovation, S t (k) is the innovation covariance, K t (k) is the Kalman filter gain,X t (k|k) is the estimated value of state at scan k,P t (k|k) is the estimated value of state covariance.

Joint Probabilistic Data Association Method
The JPDA method is briefly revisited here. It is assumed that all the measurements observed by one sensor at scan k are Z(k). To obtain the candidate measurements, the gate centered around the predicted measurement is used to complete measurement selection: whereẐ t (k|k − 1) is the predicted measurement of the t th target. The value of parameter ζ is the limit of the gate. Qualified measurements are defined as candidate measurements Z t j (k), j = 1, 2, . . . , N t C . N t C is the maximum number of the candidate measurement value. Due to the existence of clutters, the candidate measurements contain true measurements with more false measurements. A validation matrix is defined to describe the relationship between each target and each measurement as follows: where w j,t = 1, if jth measurement lies in gate of target t 0, otherwise The parameter t = 0 means "no target". The joint event matrix w t j (θ(k)) is a presentation that whether joint event θ(k) contains the association of target t and measurement j. The joint event matrix is generated according to (11) and two basic hypotheses:

•
Each measurement is assigned to one target uniquely.

•
Each target has one measurement at most.
The posterior probabilities of the joint events are computed to explain that candidate measurements may be originated from more than one target. The posterior probabilities P θ(k)/Z k are defined as follows: where Z k = {Z l } k l = 1 is the cumulative list of candidate measurements up to scan k, ς is a normalized constant, φ is the number of clutter measurements, V is the volume of the tracking gate, N tj Z j (k) denotes the probability density function of the predicted measurements from target t, δ i is defined as a target indicator that whether there is a measurement associated with a target t(δ t = 1), or not (δ t = 0), τ j is defined as the number of targets associated with measurement j, P D is defined as the detection probability of the t th target.
Therefore, the probability that measurement j is associated with the t th target is shown as follows: The estimated values of the target state and state covariance are: The posterior probabilities P θ(k)/Z k need to calculate the cumulative value of all probability density functions. It is obvious that the computational cost of all joint events will increase exponentially with the increase of measurements. Meanwhile, V φ will be nearly zero when the number of clutter measurements increases significantly, and the dimension explosion problem will occur.

Reinforcement Learning
RL has made a number of significant breakthroughs over the passage of time. Two kinds of method for solving RL problems have been divided as follows: on-policy and off-policy methods [32]. On-policy methods make decisions and evaluate the policy. However, the policy evaluated may be irrelevant to the policy used to generate data. The data used can be generated offline by applying the policy to the system, but the learning process for the policy is online. Thus, in off-policy methods, these two functions are separated. The off-policy methods reuse the experience acquired from performing policy to update value functions, which means high efficiency and speediness. Q-learning is a typical off-policy RL method, which is used widely due to its simplicity [33]. In Q-learning, action is performed with the highest expected Q-values at each state, then the agent can receive feedback from the environment, and the policy will be improved. The Q-value is updated based on the reward as follows: where a t is the current action, s t is the current state, γ is a discount parameter, s t+1 is the next state, λ is the learning rate, r t+1 is the RL reward acquired from the performing of a t at s t , Q(s t+1 , a) is the estimated Q-value when the action a is performed at state s t+1 . The pseudocode of the Q-learning method is shown in Algorithm 1: The Q-learning method pseudocode.

Initialize
Set the state s and the action a For each state s i and action a i Set Q(s i , a i ) = 0 End For Randomly choose an initial state s t While the terminal condition is not reached do Choose the best action a t from the current state s t from Q-

RL-JPDA Development and Implementation
This section mainly explains the procedure of the proposed data association method RL-JPDA, which includes three major parts. After initialing the basic RL and JPDA parameters, for each scan, the candidate measurements and their distribution are acquired in Part 1. Then we calculate the association probability according to the target motion characteristics and candidate measurement distribution in Part 2. RL is leveraged to make full use of the distribution law of candidate measurements in this step. The tracked targets are defined as the agents of RL, and eight areas have been considered as the states in the Q-table. All agents switch action adaptively according to the distribution law. If the performing of action results owns better performance, a positive reward will be given, otherwise the punishment would be completed by giving a negative reward. In Part 3, the data association process is performed, and the Q-table is update.
The flow chart of the RL-JPDA method is shown in Figure 1, and the pseudocode is illustrated in Algorithm 2. The detailed formulation is elaborated as follows. Algorithm 2. The pseudocode for the RL-JPDA method.

Initialize
Set the basic parameters Set the state s ={s1, s2, s3, s4, s5, s6, s7, s8}and action a ={a1, a2, a3} Set the initial Q-table: Q t (s, a) = 0 Acquire the real measurements Z t (k|k), k = 1, . . . , K train of the training process Set k = 1 While k < K max do Calculating candidate measurements If k < K train Generate clutter Z training (k) by (20) End If Acquire the candidate measurements Z t j (k) Acquire the distribution of all candidate measurements Calculating association probability Calculate the metric D t 2,j (k) by the (25) For each candidate measurement Choose the best a for the current s from Q-

Data association and Q-table update
Estimate the state X t (k|k) and covariance P t (k|k) by (30) and (9) If k < K train Estimate the state X t train (k|k ) by (31) Complete the data association of training process with X t train (k|k ) Calculate the cost value f t train (k) by (32) Calculate the reward r t train (k) by (33) Update the Q-table by (34) Else Complete the data association with X t (k|k) Calculate the cost value f t (k) by (39) Calculate the reward r t (k) by (40) Update the Q-table by (41) End If k = k + 1 End While Return results Terminate Figure 1.
The flow chart of the reinforcement learning-joint probabilistic data association (RL-JPDA) method.

Calculating Candidate Measurements
What this paper mainly focuses on is the situation that the initial segment of multiple target tracking is clutter free, then the subsequent measurements will be mixed with clutters [34]. Thus, the targets data association of initial segment is regarded as the RL training process. During the training process, the state-action map of RL will be established preliminarily. The proposed method reconstructs the compute mode of joint association probabilities in JPDA by the state-action map of RL to acquire the available information of measurements. When the target enters the clutter region, the agent of RL will choose an action to acquire the data association estimated results according to the state-action map, and the estimated results are used to update the state-action map to ensure the accuracy of the subsequent association process. This application situation is mainly aimed at a scenario where there is no off-line training time, and the training process can also be performed offline to obtain the state-action map if the condition permits. As a result, the proposed method can be applied to the whole tracking process with dense clutters accordingly.
In the training process, the clutter Z training (k) at k scan are generated according to the measurement where i = 1, 2, . . . , N f represents the number of clutter, l represents the gate side length, and rand 0,1 is a random parameter limited in [0,1]. K train is defined as the upper bound of time epochs of the training process. Therefore, the measurements at k scan can be defined as follows: The candidate measurements Z t j (k), j = 1, 2, . . . , N t C can be acquired by using (10). As shown in Figure 2, the tracking gate is established as a circular area, with the predicted value as the origin, ζ value given in (10) as the radius and is divided into four portions. An extra separation boundary of ζ/2 is introduced, and thus generates eight subregions of the tracking gate, which represent eight RL state values. Therefore, the distribution of each candidate measurement can be acquired. Furthermore, the measurement distribution matrix is defined as follows: where M t j represents the distribution of the j th measurement. For example, the first target (t = 1) has five candidate measurements (N 1 C = 5) at the time epoch of k = 30, and the distribution of each candidate measurement is shown in Figure 3. From Figure 3

Calculating Association Probability
The association probability between the j th measurement and the t th target is calculated according to two metrics D t 1, j (k) and D t 2, j (k) defined in this work. The Mahalanobis distance between the predicted measurement and each candidate measurement is considered as the basic cost value, which is calculated as follows: where w is the RL parameter. Each basic cost value is affected by its distribution M t d of measurement as well as the method of Q-learning. Figure 4 illustrates the form of the Q-table. The Q-table is designed as an 8 × 3 matrix. The rows of the Q-table represent the state and the columns represent the action. For each state, three actions are proposed to control the RL parameter w as follows. • Increase action: It takes place as a result of agent lack of self-confidence. This action commonly happens when the agent finds itself fail in some scan. This failure is defined that the agent obtains a cost value defined in (23) at scan k that is worse than its value at scan k − 1. This decreases its own confidence and hence increases its RL parameter.

•
Decrease action: Agent's success may motivate such action and it reflects right decision taken by the agent, and hence, it should increase its confidence.

•
Maintain action: The current RL parameter maintains the present status as there is no motivation for neither increasing nor decreasing it.
The above-mentioned three actions will directly affect the metric D t 1, j (k) as follows: where ∆ is a change factor. The metric D t 2, j (k) is to calculate the degree of matching between each candidate measurement and kinetic characteristic of target in the form of Mahalanobis distance D t 2, j (k): whereẐ t k−ν→k (k|k − ν ) is the predicted measurement at the k th scan calculated by the state vector X t k−ν→k (k|k − ν ) of the t th target at the (k − ν) th scan as follows: where ν is the procedure parameter. Figure 5 shows the computational process of the metric D t 2, j (k) when ν = 3. The predicted measurementẐ t k−3→k (k|k − 3 ) can be calculated by (26) and (27). Then the metric D t 2, j (k) can be acquired by calculating the Euclidean distance betweenẐ t k−3→k (k|k − 3 ) and Z t j (k). Metric D t 2, j (k) will be smaller if the measurement Z t j (k) is more in line with the motion characteristics of the target. Otherwise, D t 2, j (k) would be amplified. Therefore, the association probability of each candidate measurement at k scan is calculated as follows: .
In addition, the association probability has been normalized by (29).

Data Association and Q-table Update
According to (7) and (9), the Kalman filter is used to estimate the next state of the target as follows: When the target enters the clutter region, the estimated results are used to complete the data association and Q-table update. However, in the training process, the result of state estimation will only be used to update Q-table. The real measurement is used to estimate the next state X t train (k|k ) and complete the data association according to the Kalman filter as follows: For the training process, the Euclidean distance between X t (k k) and X t train (k|k ) is designed as the cost value f t train (k): Furthermore, the RL reward is calculated as follows: Then the Q-table is updated as follows: where i = 1, 2, . . . , 8 is the number of RL states. When the target enters the clutter region, the predicted stateX t (k|k − 1) and state estimation X t (k|k) at the (k + 1) th scan are calculated as follows: The predicted measurements ofX t (k|k − 1) and X t (k k) at the (k + 1) th scan are calculated as follows: The Mahalanobis distance between the predicted measurementsẐ t (k + 1|k − 1 ) andẐ t (k + 1|k ) is considered as the cost value f t (k): where S t (k + 1) = H t (k + 1)P t (k k)H t (k + 1) T . Furthermore, the RL reward is calculated as follows: Then the Q-table is updated as follows:

Computing Complexity
As shown in Figure 1, the initialization process is performed one time at the start, and the data association process is executed in each cycle. The number of targets is T. The number of all measurements obtained by the sensor at the k th scan is M. The number of all candidate measurements at the k th scan is N t C . For the initialization phase, the basic parameters are initialized, and the corresponding computing complexity is O(1). Then, the method starts to perform data association.
In Part 1, M measurements include real measurements and generated clutters. The computing complexity of generating clutters is O(M − T). Furthermore, the computing complexity of acquiring candidate measurements is O(M · T) of each scan. In Part 2, the metric D t 2, j (k) mainly calculates the degree of matching between each candidate measurement and kinetic characteristic of target. The computing complexity of this operation is O(N t C ). The metric D t 1, j (k) needs to obtain the RL parameter and the Euclidean distance between the predicted measurement and each candidate measurement. The computing complexity of metric D t 1, j (k) is shown as follows: The computing complexity of calculating association probability is O( T t = 1 N t C ). In Part 3, for the training process, the measurements association mainly needs to acquire three parts: estimated covariance, estimated state calculated by the candidate measurements and estimated state calculated by the real measurements. So, the computing complexity of measurements association in the training process is shown as follows: When the target enters the clutter region, the measurements association needs to acquire two parts: estimated covariance and estimated state calculated by the candidate measurements. So, the computing complexity of measurements association is shown as follows: The computing complexity of updating Q-table at each scan is O( . Therefore, because M is greater than N t C , so the maximum computing complexity of the proposed method is O(M · T) in each scan.

The Experiments and Results
In this section, three experiments are designed to evaluate the effectiveness and feasibility of the RL-JPDA method. The comparative results with GNN [35], JPDA [15], EDA [21], FOMJPDA [36], IFJPDA1 and IFJPDA2 [23] methods are also given to show the superiority of the proposed method. The initial parameters are set as follows: The upper limit of training process K train is set as 16. The upper limit of scan K max is set as 100. The change factor is set as 0.5. The procedure parameter ν is set as 3. The ellipsoid tracking gate size ζ is set as 9.21. Thirty Monte Carlo simulations are performed to acquire the experimental results.

Scenario of Two Targets with Constant Velocity
In this section, the clutter distributed in the field of view (FOV) of the sensor is modelled with the intensity uniformly for space tracking applications [37]: where λ z denotes the mean return rate of the measurement clutter, V is the volume of the tracking gate. Two cases are considered to compare the performance of the methods with different clutter rates (λ z = 20 and λ z = 40, respectively). The targets are assumed to move in straight lines with constant velocity. Measurement data are created by simulating the actual target motion in two dimensions and then adding noise to the true measurements. The targets state model is defined by (1) and (2), where the state transition matrix F and measurement matrix H are given by: where τ is the sampling interval. The state vector X t (k) contains target positions and velocity where x(k) denotes the x-coordinate of target, y(k) denotes the y-coordinate of target, .
x(k) and . y(k) denote the corresponding velocity of target respectively. The process noise and measurement noise are assumed to be Gaussian noise with zero mean and covariance Q, R: where q = diag 0.5 2 m 2 s 4 0.5 2 m 2 s 4 . The target detection probabilities are assumed to be 1.0 and the sampling interval is taken to be 1 s. The initial positions ((x, y) in meters) of the two targets are assumed to be (−30,500 m, 24,500 m) and (−25,250 m, 31,500 m), for Target 1 and 2, respectively. In Case 1, Figure 6 shows the trajectory estimation of the RL-JPDA method. It is indicated the proposed method presents better trajectory association performance. The position estimation errors of seven methods in Case 1 are illustrated in Figures 7 and 8. The position error is defined as: where x true and y true are the real target positions,x andŷ are the estimated target positions. It is obvious that the proposed method performs better on the data association process than the other methods because it employs the RL and motion characteristics. The position error of the IFPDA2 method is slightly higher than the proposed method. All other methods have poor performance in Case 1.   For the second case, we have increased the density of clutter. Because of the dimension explosion, the JPDA method cannot complete the trajectory association mission. Figure 9 shows the trajectory estimation result of the RL-JPDA method. The trajectory associated by the proposed method still presents better performance. The position errors of seven methods in Case 1 are illustrated in Figures 10  and 11. The position error of other methods in Case 2 is larger than that in Case 1. This is mainly due to the association errors of targets increasing with the increment of the clutter density, which result in a performance decrease for all methods. In addition, The RL-JPDA method outperforms the GNN, JPDA, EDA, FOMJPDA, IFJPDA1 and IFJPDA2 methods with an increasing clutter density. The error results also show that the proposed method can complete the trajectory association mission accurately in dense clutter environments.    Table 1 shows that the RMS errors of RL-JPDA are 24.90 m and 26.60 m, which are also superior to other methods significantly. The RMS results of IFJPDA1 are worse than of IFJPDA2, but the execution time of IFJPDA2 is 1.34 s. Because the degree of association is obtained by splitting the validation matrix during the computational process of IFJPDA2 method. Furthermore, this operation increases computational complexity greatly. The proposed methods do not need to perform this operation, and there is no rapid increase in the computational complexity with increasing clutter density.

Scenario of Three Targets with Constant Acceleration
In this section, the targets are assumed to move with a constant acceleration, and two cases with different density values of clutters are also considered to compare the performance of the methods. The state transition matrix F and measurement matrix H are given by: where τ is the sampling interval. The state vector X t (k) contains target positions, velocity and acceleration: x(k) ..
where x(k) denotes the x-coordinate of target, y(k) denotes the y-coordinate of target, .
x(k) and . y(k) denote the corresponding velocity of target, respectively, ..
x(k) and .. y(k) denote the corresponding acceleration of target, respectively. The process noise covariance Q and measurement noise covariance R are defined as follows: For Case 1, Figure 12 shows the trajectory estimation result of the RL-JPDA method. The trajectory associated by the proposed method owns significant performance. The mean position errors of the seven methods in Case 1 are illustrated in Figures 13-15. It is obvious that the proposed method obtains better estimated results and achieves better performance compared to other methods. For the second case, the JPDA method still cannot complete the trajectory association mission because of the dimension explosion. Figure 16 shows the trajectory estimation result of the RL-JPDA method. The trajectory associated by RL-JPDA method owns better performance. Because the proposed method uses RL to acquire the association probability, which is different from JPDA, FOMJPDA, IFJPDA1 and IFJPDA2. Furthermore, the state estimation of targets becomes more accurate, the tracking performance is also improved. The mean position error of seven methods in Case 2 are illustrated in Figures 17-19. It is obviously that the proposed method has best performance on the trajectory estimation. The other methods cannot maintain stable performance in tracking three targets.        The comparison results of the RMS errors and execution time are illustrated in Table 2. In Case 1, the RMS errors of RL-JPDA are 44.48 m, 57.21 m and 61.94 m, which are superior to those of other methods obviously. The execution time of RL-JPDA is 0.75 s, but execution time of JPDA is 4.91 s. These data indicate that the embedding of RL improves the calculation process of association probability in JPDA, and the computational complexity is greatly reduced. Meanwhile, when the target is moving with a constant acceleration, the tracking results of uniform accelerated targets are not stable by using these data association methods based on fuzzy clustering. Thus, the thirty Monte Carlo results of JPDA are better than FOMJPDA, IFJPDA1 and IFJPDA2. The EDA method has poor performance but minimum execution time. The GNN method yields maximum RMS error, which indicates that GNN method has the worst estimated result on the trajectory association of multiple targets with constant acceleration. As the clutter density increases, explosive growth in the calculation happens because the valid measurements that falls into the tracking gate increases. However, the execution time of RL-JPDA is 0.92 s, which indicates that the proposed method has lower computational complexity than other methods except for EDA.

Scenario of Reentry Vehicle
In this section, a reentry vehicle tracking scenario is used to verify the performance of the proposed method, and two cases with different proximity degrees of targets are also considered. Because of the strong nonlinearities exhibited by the forces of aerodynamic drag, gravity and random buffeting terms that act on the vehicle, the tracking problem of reentry vehicle is particularly stressful for data association methods. The vehicle dynamic model is [38]: where x 1 (k) and x 2 (k) are the position of the vehicle, x 3 (k) and x 4 (k) are the velocity of the vehicle, x 5 (k) is a parameter of its aerodynamic properties, G(k) is the gravity term, D(k) is the drag term, w i (k), i = 1, 2, 3 is the process noise. The force terms are given by: The position of the vehicle is tracked by a radar located at (x r , y r ) which measures range r and bearing θ [37]: where v 1 (k) and v 2 (k) are zero-mean measurement noises.  [37] is used for target state estimation.
For Case 1, Figure 20 shows the trajectory estimation result of the proposed method. The true trajectories consist of three crossing tracks, and the estimated result of the proposed method has excellent performance. The mean position errors of the seven methods are illustrated in Figures 21-23. The performance of the RL-JPDA method is better than the performance of all other methods, because the proposed method can acquire the motion characteristics of the reentry vehicle by training and online learning, which can improve the accuracy of data association. For the second case, Figure 24 shows the trajectory estimation result of the RL-JPDA method. The trajectory associated by RL-JPDA method still owns better performance. The mean position error of seven methods in Case 2 are illustrated in Figures 25-27. Because of the proximity of the targets in Case 2, we can see that the position error of Case 2 is larger than that of Case 1. This is mainly due to the fact that close targets will increase the chance of error association, which make a decrease in performance for all methods. However, the results of EDA in Case 1 and Case 2 change slightly, which means the performance of EDA is not affected obviously by the change of distance between targets. The proposed method has great performance than other methods for solving the data association mission of close targets.        Moreover, the comparison results of the RMS errors and execution time are illustrated in Table 3 with the clutter rate λ z = 10 (for the realistic reentry vehicle tracking, the clutter rate cannot be too high). As shown in Table 3, because of the nonlinear variation caused by aerodynamic drag, the results of GNN and JPDA own poor performance. The RMS errors of RL-JPDA in Case 1 are 34.90 m, 32.69 m and 32.19 m, which proves that the proposed method still has great association effect for a nonlinear motion model. The performance of IF-JPDA2 is better than that of FOMJPDA and IF-JPDA1, but it is worse than the proposed method. The execution time of all methods is extended due to the frequent invocation of the objective dynamics function during the association process. However, the execution time of RL-JPDA is 1.30 s, and the execution time of JPDA is 2.82 s. These data indicate that the computational complexity of RL-JPDA method is lower than that of JPDA. Meanwhile, when the seven methods are used to solve the problem of close targets data association, the execution time of JPDA and IFJPDA2 is extended obviously, because close targets can increase the number of situations that measurement is assigned to multiple targets, which would significantly increase the number of joint event matrix in the two methods. However, the execution time of RL-JPDA is 1.32 s, and RMS errors of RL-JPDA are 45.01 m, 58.61 m and 28.41 m. These data indicate that the proposed method still has better performance. Table 3. RMS errors and execution time of the reentry vehicle example.

Method
Case 1 (λ z = 10) Case 2 (λ z = 10) In summary, from the above experimental results we can see that the combination of RL and JPDA can significantly improve the trajectory association performance, especially in the dense clutter environment. The structure of the JPDA method provides reliable association accuracy. Tables 1-3 show that the execution time of RL-JPDA is much less than that of JPDA. These data indicate that JPDA method has higher computational complexity, and the integration of reinforcement learning process into the traditional JPDA method facilitates a better handling of measurement clutters so as to achieve effective data association results. Meanwhile, the position information of measurements inside the tracking gate is also taken into full account. The motion characteristics of the targets are introduced as a constraint, which further improve the association performance of the proposed method.

Analysis of RL-JPDA Control Parameters
The value of training process parameter K train is set according to the situation that the initial segment of multiple target tracking is clutter free. If the length of training process is short, the accuracy of data association will be affected in the initial segment of multiple target tracking. Meanwhile, the clutter density of data association will increase gradually with the association process, so the training process should not be too long. The parameter setting of the change factor affects the performance of the RL. When the change factor approaches 1, the metric D t 1, j (k) will fluctuate dramatically if the RL action is switched. Furthermore, the change of D t 1, j (k) will be ignored when a small value of is given. The procedure parameter ν is set according to the motion characteristics of the target. The value of ν cannot be large due to the fact that there are errors in the dynamic model of the target. In addition, tracking gate is an important underlying support technology of the data association method. The value of tracking gate size ζ should be appropriate to contain as few clutters and interference as possible, which can ultimately improve the data association performance.

Conclusions
In this paper, a novel data association method based on reinforcement learning called RL-JPDA has been presented for solving multiple target tracking data association problems in the environment with dense clutters. The proposed method reconstructs the compute mode of joint association probabilities in JPDA by the method of reinforcement learning. The reinforcement learning is inserted to acquire the available information of measurements. The distribution of measurements is defined as states in RL and the estimated results are regarded as the evaluative signals. Particularly, the learning process of each target data is independent, which means that same distribution of measurements may have different association results for different targets due to the independent Q-table. In addition, the motion characteristics of the targets are developed to ensure the accuracy of the association results. Finally, the performance of the proposed method has been tested using six different methods in three scenarios, and these methods are compared in terms of error statistics and execution time. The results show that the RL-JPDA method is superior to the other six methods, and it can solve the data association problem effectively in the environment with dense clutters.