Track-to-Track Association for Intelligent Vehicles by Preserving Local Track Geometry

Track-to-track association (T2TA) is a challenging task in situational awareness in intelligent vehicles and surveillance systems. In this paper, the problem of track-to-track association with sensor bias (T2TASB) is considered. Traditional T2TASB algorithms only consider a statistical distance cost between local tracks from different sensors, without exploiting the geometric relationship between one track and its neighboring ones from each sensor. However, the relative geometry among neighboring local tracks is usually stable, at least for a while, and thus helpful in improving the T2TASB. In this paper, we propose a probabilistic method, called the local track geometry preservation (LTGP) algorithm, which takes advantage of the geometry of tracks. Assuming that the local tracks of one sensor are represented by Gaussian mixture model (GMM) centroids, the corresponding local tracks of the other sensor are fitted to those of the first sensor. In this regard, a geometrical descriptor connectivity matrix is constructed to exploit the relative geometry of these tracks. The track association problem is formulated as a maximum likelihood estimation problem with a local track geometry constraint, and an expectation–maximization (EM) algorithm is developed to find the solution. Simulation results demonstrate that the proposed methods offer better performance than the state-of-the-art methods.


Introduction
Reliable situational awareness plays an essential role in intelligent vehicles and surveillance systems [1][2][3][4][5][6]. Typical intelligent vehicles employ various types of sensors, such as radio detection and ranging (radar), light detection and ranging (lidar), and video. The radar sensor determines the relative location and the radial velocity of objects by emitting radio signals. Radar measurements often consist of false alarm detections in addition to detections from real objects or targets while missing some target-originated detections. The lidar sensor uses laser light to detect objects. Compared to radar, it provides more detailed measurements at an increased cost. Video sensors are feature-rich with a wide field-of-view, but they are more sensitive to different illumination and weather conditions [1]. Since these sensors have different sensing capabilities, features, and accuracies, the use of multiple heterogeneous sensors can result in more reliable and multi-modal environment perception systems. Therefore, pedestrians, vehicles, and obstacles are typically detected and tracked using a multi-sensor system in intelligent vehicles [7][8][9]. A multi-sensor multi-target tracking module jointly estimates the states and the number of targets from sensor measurements in intelligent vehicles, and it can be broadly categorized as centralized or distributed. The advantage of the distributed tracking systems is that they can provide a degree of scalability and robustness not achievable by traditional centralized tracking systems [1].
Track-to-track association (T2TA) is a crucial task in distributed tracking to find the correspondence between local tracks from different sensors. It is commonly applied to combine the local tracks of a sensor with those of another sensor to form the global tracklist. For automotive applications, radar, lidar, and video sensors in environmental perception systems for intelligent vehicles use different coordinate systems and sampling frequencies. Therefore, a spatio-temporal calibration should be performed to align the detections from different sensors [1]. In practice, detection from radar, lidar, and video sensors cannot always be calibrated or aligned accurately [10]. Each sensor may cover a different part of the surveillance region with a detection probability of less than one. As a result, some local tracks from a sensor may not correspond to those of other sensors. The range, azimuth, and elevation biases of a radar sensor may lead to errors in the local tracks from that sensor. The relationship between radar sensor bias and local tracks is presented in Figure 1, where two radar sensors, A and B, and one target, T, are shown. The radar sensor bias leads to the reporting of the target T as tracks T A and T B by radar A and B, respectively. Note that if the biases in A and B radars are significant, the distance between T A and T B is correspondingly high. In this case, T A and T B is considered as originating from two different targets by the T2TA module. Therefore, T2TA in intelligent vehicles or surveillance systems suffers from many challenges, including missing detection and measurement bias. In this paper, the focus is on the problem of independent T2TA for each frame in the presence of missed detections and sensor bias. To formulate the T2TA as an optimization problem, different statistical distances or metrics are proposed in literature [11][12][13][14][15][16][17][18][19][20][21][22][23][24]. In [11], a weighted statistical distance is proposed for T2TA with the assumption that the local estimation error of one sensor is independent of those of other sensors for the same target. In [13], the independence assumption is relaxed, and a modified statistical distance with dependent errors is developed for T2TA. In [19], three algorithms based on the squared Mahalanobis distance are investigated, and the nearest neighbor (NN) and global nearest neighbor (GNN) algorithms are applied to compute the distance between two tracks for T2TA. In [20], a likelihood function for T2TA from multiple sensors is derived, and the multidimensional assignment algorithm is employed to solve the optimal matching problem. State augmentation data, which combines the kinematic state information and the additional feature state information, is proposed to perform T2TA in [21]. In [25], a track association algorithm is proposed based on the permutation matrix to support the track-to-track multi-sensor data fusion for multiple targets in an autonomous driving system. It is worth noting that most T2TA algorithms in intelligent vehicles employ the conventional GNN algorithm [25][26][27][28].
Nevertheless, most of the above methods do not consider the presence of sensor bias. In reality, T2TA performance significantly degrades with sensor bias [29]. Literature addressing this problem can be roughly divided into batch and online approaches. The batch approach is an offline implementation that estimates the track association and sensor bias using all local tracks [30][31][32]. A joint sensor registration and track-to-track fusion method is derived using an equivalent measurement method in [30], while a pseudo-measurement approach is adopted to handle registration and track fusion simultaneously in [31]. In [32], a joint registration, data association, and fusion method in a distributed sensor network is formulated as a maximum likelihood (ML) optimization problem. An expectation-maximization (EM) algorithm is then proposed to perform the ML optimization, joint association, and bias removal through following an iterative strategy. However, these methods are susceptible to being trapped in local minima and have high computational costs. The online approach is a real-time implementation to perform track association with sensor bias. In [33], relative position information among neighboring tracks is analyzed, and a reference topology feature is derived for the absolute position information. An optimal sub-pattern assignment (OSPA) metric is also proposed to construct the association cost for T2TASB. In [34], the OSPA metric is modified by compensating for the relative azimuth bias. In [35], the T2TASB is formulated as a point set registration problem, and a coherent point drift (CPD) algorithm is proposed to perform T2TASB. In the CPD algorithm, local tracks of one sensor are represented by Gaussian mixture model (GMM) centroids [4,36], where local tracks of all sensors are fitted to those of a reference sensor. Still, the CPD algorithm only exploits the relationship among local tracks from different sensors, i.e., it does not utilize the relative geometric relationship between a local track and its neighbors from each sensor. The geometry among neighboring local tracks is usually stable at least for a while and thus helpful in improving the T2TASB. Here, the geometry is inspired by the idea that the relationship between a local track and its neighbors from different sensors could be preserved after the transformation. Hence, the geometry among neighboring local tracks is usually stable at least for a while and thus helpful in improving the T2TASB.
In this paper, the problem of independent T2TA for each frame in the presence of missed detections and sensor bias is considered. A probabilistic method, called the local track geometry preservation (LTGP) algorithm, is proposed to handle T2TASB. In the proposed method, the local tracks of one sensor are represented by GMM centroids, and the local tracks of the other sensor are fitted to those of the first using a nonlinear transformation function. The local track geometry with k-connected neighborhood is developed, and the T2TASB is formulated as an ML optimization problem with an EM algorithm being proposed to address it.
Different from other literature, the main contributions of this paper are as follows: 1. The mathematical formulation for T2TASB is presented. Moreover, the local track geometry with k-connected neighborhood is derived to improve the robustness and accuracy of T2TASB. The proposed method extends the CPD method by considering the geometric relationship between neighboring tracks. 2. An EM algorithm is proposed for T2TASB. The optimal T2TASB correspondence matrix and transformation function between local tracks are estimated simultaneously. 3. The performance of the proposed method is validated by the experiments and computer simulations using the KITTI dataset.
This paper is organized as follows. The formulation of T2TASB is presented in Section 2. In Section 3, the EM algorithm is used to estimate the parameters in the proposed method. The performance of the proposed approach is evaluated using computer simulations and experiments on the KITTI dataset in Sections 4 and 5, respectively. Finally, conclusions and future work are discussed in Section 6.

A New Method for T2TASB
In this section, the T2TASB problem is formulated and a new solution is proposed. Let X s k denote the local tracks from sensor s at time k, where X s k = x s . The x s i,k denotes the i-th track state estimate and the corresponding covariance from sensor s at time k, where k = 1, 2, ...K, i = 1, 2 . . . , N s k with K and N s k being the total number of discrete time steps and the number of tracks at time k by sensor s, respectively. Here, two sensors are applied to find the global track states, i.e., s = 1, 2. The objective of the T2TASB algorithm is to find the correspondence between X 1 k and X 2 k . In [35], T2TASB is considered as a probability density estimation problem. In this paper, the relative geometry among neighboring local tracks from each sensor is proposed to formulate a maximum likelihood estimation problem with a local track geometrical constraint. Assuming that the local tracks of one sensor are represented by GMM centroids, the corresponding local tracks of the other sensor are fitted to those of the first sensor. Let x 1 t,k be the t-th data and x 2 l,k be the centroid of the l-th component. That is, where N denotes the Gaussian distribution; σ 2 k denotes the equal isotropic covariance at time k; f denotes the nonrigid transformation; I is the identity matrix; D is the size of a local track vector, and The z k t,l satisfy z k t,l ∈ {0, 1} and ∑ l z k t,l = 1 conditions. That is, only one element in vector z k t is 1 while all other elements are 0. We have where Here, a distribution 1 with weight w is employed to represents the component of a target detected by sensor 1, but not detected by sensor 2. The relationship between the local track lists x 1 t,k and x 2 l,k can be given by The nonrigid transformation f aligns the local tracks, while some nonlinear functions might be employed to approximate it as well. More detail of the nonrigid transformation is given in Appendix A. Here, the displacement function is adopted as [35] where W k is an N 2 k × D dimensional weight matrix of the Gaussian kernel; G k denotes an N 2 ; and, β denotes the width parameter in the smoothing Gaussian filter. To enforce the smoothness of transformation f , the constraint on the weight matrix W k can be given by [35,37]: where Tr(.) denotes the trace of a matrix and superscript T denotes transposition. By ignoring the constants independent of {σ 2 k } and {W k }, the objective function of T2TASB can be written as where α controls the trade-off parameter and q(z k t,l ) is used to denote p(z k t,l = 1 X 1 k , X 2 k ). Consider the membership probability π k t,l in (2), which is assumed to be the same for all components [35]. Here, π k t,l is initialized using the traditional nearest neighbor (NN) method [38] as follows: (1) If track t from sensor 1 is associated with track l from sensor 2 at time k using NN assignment, we have where 0 ≤ τ ≤ 1 is the confidence in association with the NN method.
(2) If track t from sensor 1 is not associated with any track from sensor 2 at time k using NN assignment, then the uniform membership probability is applied: The transformation f uses the relationship between local tracks from different sensors, but does not consider the relative geometry between one track and its neighbors from each sensor. The geometry is inspired by the idea that the relationship between a local track and its neighbors from different sensors could be preserved after the transformation, as depicted in Figure 2. To ensure an accurate T2TA, a geometrical constraint on the local tracks is proposed in this paper. A schematic illustration of the geometrical constraint is given in Figure 2.
Local tracks from sensor 1 Transformed local tracks from sensor 2 (d) We desire to preserve the geometry of tracks X 2 k after the nonrigid transformation f . Based on the Euclidean distance between each local track and its neighbors in X 2 k , the M nearest neighbors of each local track in X 2 k are obtained. Then, each point in X 2 k is represented as a weighted linear combination of its M nearest neighbors.
If track state x 2 j,k does not belong to the M nearest neighbors of track state x 2 l,k , then L lj is set to 0. Here, matrix L is obtained by minimizing the following cost function: where the sum of each row of L is equal to 1. After the nonrigid transformation, the local track geometry can be preserved by minimizing the transformed cost function: where G k (i, .) is the i-th row of G k . The objective function of T2TA with sensor bias in (7) is given by where γ controls the trade-off between Q and E(L).

EM Solution for the Proposed Method
Let Θ= {Z k }, {σ 2 k }, {W k } be the unknown parameters. To obtain an ML estimate of Θ, the EM algorithm is applied here. There are two steps in the EM algorithm: where m is the iteration number of the algorithm. The E-step calculates the conditional expectation using the current estimate Θ (m) , whereas the M-step provides an updated estimation, Θ (m+1) . The estimate of Θ is updated by iterating through these two steps while the complete data likelihood function is maximized.

E-Step
First, q(z k t,l ) can be found using Bayes' theorem as

M-Step
Then, E L (Θ, Θ (m) ) is rewritten as where diag(.) indicates diagonal matrix; R k is an N 1 k × N 2 k matrix with elements q(z k t,l ) for t = 1, 2, · · · N 1 k , l = 1, 2, · · · N 2 k + 1; B k = (I − L) T diag(R k T 1) (I − L); 1 represents the all-one column vector of corresponding length; and, I means the identity matrix. The estimates of σ 2 k and W k are iteratively updated by solving the corresponding partial derivative of the expected log likelihood to zero. That is, This results in Similarly, Thus, W k can be obtained by solving Here, C k is used to denote the cost matrix of T2TASB at time k as an N 1 k × (N 2 k + 1) matrix with (t, l) element C k (t, l) for t = 1, 2, · · · N 1 k , l = 1, 2, · · · N 2 k + 1 given by where C k (t, N 2 k + 1) represents the cost of not making an assignment. The assignment of track t from sensor 1 to track l from sensor 2 can occur only if C k (t, l) < C k (t, N 2 k + 1) for t = 1, 2, · · · N 1 k , l = 1, 2, · · · N 2 k with C k (t, N 2 k + 1) being a gate. If that gate is violated, no assignment option is selected. The solution for the above assignment problem is computed using the Hungarian algorithm [39].
The proposed LTGP method for T2TASB is summarized in Algorithm 1.

Computer Simulations
In this section, the performance of the proposed methods is evaluated using simulated data. Thirty targets following a discretized nearly constant velocity motion model [40] are tracked by multiple radar sensors. The initial target positions are randomly generated in the region [−100 km, 100 km] × [−100 km, 100 km]. The initial velocities of these targets are chosen as [0.5 km/s, 0.2 km/s]. The covariances of the process and measurement noise components are respectively set to diag 10 −4 km 2 , 10 −4 km 2 s 2, 10 −4 km 2 , 10 −4 km 2 s 2 , and diag(10 −4 km 2 , 10 −5 rad 2 ), where the cross-covariance terms have been ignored in the former [40]. The clutter is generated uniformly over the surveillance region using a Poisson random variable with a mean of 30 at each time step. The sampling period of the measurements is 1 s. The number of time steps is 100.
Two radar sensors are considered in the distributed sensor network. The biases in the two sensors are set to η 1 = [1 km, −0.017 rad] T , and η 2 = [−2 km, 0.034 rad] T . The detection probabilities P d of both radars are chosen as 0.95. Measurement-to-track association is performed at each sensor without considering the sensor bias. The local tracks from sensor 1 and sensor 2 are illustrated in Figure 3. Parameter τ denotes the confidence in the association by the NN method. Parameter w denotes the initial assumption on the number of false targets detected by sensor 1, but not detected by sensor 2. Parameter β represents the width of the smoothing Gaussian filter in the nonlinear transformation function. Parameter M represents the number of nearest neighbors used in linear reconstruction to preserve the local track structure, while ρ is the parameter in the cross-covariance fusion. We set τ = 0.5, w = 0.2, β = 0.1, M = 10, and ρ = 0.4 throughout this paper.
Parameters α and γ represent the trade-off regularization terms. The ranges of these parameters were determined experimentally. The correct association probability P c defined as the ratio of the correctly assigned tracks over the total number of tracks is employed as the primary metric for performance evaluation. The variation of P c with regularization parameters α and γ at time step k = 50 is shown in Figure 4. It is observed that the proposed method performs best when α ∈ [5,7] and γ ∈ [10,20]. Here, we set α = 6, and γ = 15.  The proposed LTGP algorithm is used for T2TASB, and the results at time step k = 50 are given in Figure 5, which illustrates that the local tracks from the two sensors are associated correctly by the proposed LTGP method. The performance of the proposed method is demonstrated next relative to those of GNN without registration, the reference pattern-based algorithm [33], and the CPD algorithm [35]. All results are averaged over 50 Monte Carlo runs. The proposed method achieves the best performance, as illustrated in Figure 6. Compared to the reference pattern-based algorithm, the CPD algorithm improves the P c by about 8%. The P c of the proposed method has improved by 5% as compared with the CPD algorithm. Furthermore, the results for a scenario with varying detection probabilities and different numbers of targets are respectively illustrated in Figures 7 and 8. It is observed that the proposed algorithm outperforms the other three benchmark algorithms. From Figure 7, the performance of GNN without registration, reference pattern-based algorithm, and CPD algorithm degrade rapidly with a decreased detection probability. From Figure 8, the performance of the proposed method is almost constant while increasing the number of targets. Moreover, the average P c of the proposed method is improved by approximately 9% as compared with the CPD algorithm. The computational complexities of the proposed LTGP algorithm are analyzed next. For simplicity, the same number of local tracks N for each sensor every time is considered. At each time step in the LTGP algorithm, the computational complexity to search the M nearest neighbors for each local track in X 2 k is O((M + N) log N), using the k-d tree [41]; the computational complexity to obtain matrix L is O (M 3 N); and, the complexity of the EM algorithm is almost O(N 3 ) [42]. The computational complexity at each time step in the LTGP algorithm is O (N 3 ). Therefore, the total computational complexity of the proposed LTGP algorithm is O (N 3 K), where K is the total number of measurement samples.

Experiments on KITTI Dataset
In this section, we evaluate the proposed algorithm using the KITTI dataset. Here, the KITTI multi-object tracking dataset [43] is applied to evaluate the proposed data association method. The vehicle tracking test sequences 01 and 20, and pedestrian tracking test sequences 16 and 17 are used. Each sequence consists of 30 frames. Figure 9 depicts the starting frames of left and right cameras for each sequence along with the results of object detection. For the left camera, the detection results of vehicle or pedestrian are provided by the ground truth. Meanwhile, the deformable part model detector method [44] is proposed to detect vehicle or pedestrian for images of rthe ight camera.
The ground truth matching between the left and right images in each frame is confirmed by manual annotation. The GNN without registration, the reference pattern-based algorithm, the CPD algorithm, and the proposed method are employed to associate the local tracks. The average T2TA matching accuracy performances of different T2TA methods are depicted in Figure 10. It is confirmed that the performance of the proposed method is substantially better than those of GNN without registration, the reference pattern-based algorithm, and the CPD algorithm. Compared with the CPD algorithm, the average performance of the proposed method is improved by about 7.8%. In addition, since the KITTI sequence 17 contains large pedestrian occlusion while the motion is more than the other sequences, the performance gap between this and other sequences is more evident. The proposed LTGP method has better performance compared to three benchmark algorithms in the KITTI sequence 17. It is because the proposed method preserves the geometry of local tracks in the data association. The average run-times of these algorithms are given in Table 1, which reveals that the proposed LTGP method has higher computational complexity compared to the GNN without registration, reference pattern-based, and CPD methods.

Conclusions
A probabilistic method for the track-to-track association, namely, LTGP, was proposed in this paper. In the LTGP method, one local track was transformed into another local track using a nonlinear function. We utilized k-connected neighbors to preserve the relative local track geometry. The T2TASB problem was formulated as a probability density estimation problem. The EM algorithm was used to fuse biased tracks from two sensors. To illustrate the advantages of the proposed method, some experiments of computer simulation and KITTI dataset were performed and the result is compared with GNN without registration, reference pattern-based algorithm, and CPD algorithm. Experiments on computer simulation involve varying detection probabilities and different numbers of targets, the proposed method has better performance than other algorithms for all detection probabilities and numbers of targets, but it has higher computational complexity. In the KITTI dataset, the proposed LTGP method has better performance than other methods. The T2TA matching accuracy of the proposed LTGP method was improved by about 7.8% as compared with the CPD method. From the experimental results of computer simulation and KITTI dataset, it can be concluded that the proposed LTGP method outperforms the GNN without registration algorithm, the reference pattern-based algorithm, and the CPD algorithm, but it has a higher computational load.
In the future, the proposed method is not restricted to the considered application but can be extended to other tasks, such as multi-sensor T2TASB for the connected vehicle. For the multi-sensor T2TASB scenario, the LTGP method can be extended using sequential processing. Acknowledgments: The authors gratefully acknowledge the Autonomous Vision Group for providing the KITTI dataset. The authors also would like to thank the editors and referees for the valuable comments and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.
(A8) From (A5)-(A8), we get cos θ 12 sin θ 12 − sin θ 12 cos θ 12 where θ 12 = ∆θ 1 − ∆θ 2 . Equation (A9) gives the relationship between the local tracks x 1 t,k and x 2 j,k . In this paper, a non-rigid transformation as (5) is proposed to approximate the relationship from one local tracks to other local tracks.  Membership probability of t-th row and l-th column element in π k at time k π k Membership probability matrix at time k Z k Indicator matrix z k t a 1 × N 1 k binary vector for l = 1, 2, . . . N 2 k at time k z k t,l t-th row and l-th column element in z k t at time k W k an N 2 k × D dimensional weight matrix of the Gaussian kernel G k an N 2 k × N 2 k Gaussian kernel matrix g ij an i-th row and j-th column element in G k β the width parameter in the smoothing Gaussian filter Tr(.)

Abbreviations
Trace of a matrix L N 2 k × N 2 k weighted matrix L lj a l-th row and j-th column element in L G k (i, .) i-th row of G k γ Trade-off parameter controlling between Q and E(L) R k an N 1 k × N 2 k matrix C k Cost matrix of T2TASB at time k as an N 1 k × (N 2 k + 1) matrix [x 1 t,k , y 1 t,k ] x-axis and y-axis positions of target x 1 t,k [x 1 j,k , y 2 j,k ] x-axis and y-axis positions of target x 2 j,k