Hierarchical Network-Based Tracklets Data Association for Multiple Extended Target Tracking with Intermittent Measurements

The key issue of multiple extended target tracking is to differentiate the origins of the measurements. The association of measurements with the possible origins within the target’s extent is difficult, especially for occlusions or detection blind zones, which cause intermittent measurements. To solve this problem, a hierarchical network-based tracklet data association algorithm (ET-HT) is proposed. At the low association level, a min-cost network flow model based on the divided measurement sets is built to extract the possible tracklets. At the high association level, these tracklets are further associated with the final trajectories. The association is formulated as an integral programming problem for finding the maximum a posterior probability in the network flow model based on the tracklets. Moreover, the state of the extended target is calculated using the in-coordinate interval Kalman smoother. Simulation and experimental results show the superiority of the proposed ET-HT algorithm over JPDA- and RFS-based methods when measurements are intermittently unavailable.


Introduction
Conventional multiple target tracking (MTT) algorithms assume that objects can be represented as points and allow only a single measurement per sensor scan. However, modern high-resolution sensors have revealed the existence of targets that can generate multiple measurements per scan [1]. This challenges the suitability of the conventional point-target assumption. In such scenarios, multiple extended target tracking (METT) provides a more appropriate approach as it specifically addresses the tracking of targets that can produce multiple measurements per scan. The MTT with more measurements per target is called multiple extended target tracking [2,3].
Data association plays a vital role in multi-target tracking by differentiating between false alarms and actual targets. Incorrect data association can significantly impact the performance of tracking multiple targets. For point targets, several methods for data association have been proposed, which can be found in [4]. On the other hand, data association for extended targets involves the challenging task of matching measurements to specific targets, and this complexity increases with the number of targets and measurements. To address this, multiple data association techniques have been developed specifically for multiple extended target tracking scenarios. Vivone introduced a method that incorporates a detector and joint probability data association (JPDA) tracker specifically designed for METT [5]. Additionally, the multi-detection JPDA (MD-JPDA) algorithm was developed to address many-to-one associations in high-resolution radar sensors [6]. However, MD-JPDA suffers from high complexity due to exhaustive combinations of measurements. To mitigate this issue, MD-JIPDA, an extension of MD-JPDA, integrates the existence probability, reducing complexity [7,8]. Another approach by Yang focuses on calculating

Problem Formulations
The METT problem aims to estimate states and parameters while dealing with measurements from multiple extended targets and the presence of clutter. At each time step k, the states of the multiple extended targets are denoted as S k = {s 1,k , s 2,k , .., s N T ,k }, N T representing the number of extended targets. Each extended target state is defined as s i,k = [x i,k , X i,k ] T . The subset of the state x i,k includes the centroid's location m i,k and the velocityṁ i,k of the centroid of the target, x i,k = [m i,k ,ṁ i,k ] T . X i,k ∈ R d is the extension state that describes the size and shape of the i-th target. At the time step k, the unordered set of measurements containing the clutter and target Z k is received by the sensor. Z k = {z 1,k , z 2,k , . . . , z M k ,k }, where M k = ∑ N T i M i,k + M C,k represents the number of measurements. M i,k is defined as the total number of measurements that originates from the i-th extended target. The value of clutter M C,k is Poisson-distributed with mean λ C .
In multiple extended target tracking, the measurements may intermittently be unavailable at any time. This is due to occlusions, detection blind zones, or even low frame rates caused by the work mode of the radar. To cater for the missing measurement, a binary variable representing the existence of the extended target is introduced, inspired by [23]. r k = {r 1,k , r 2,k , . . . , r N T ,k }, with r i,k ∈ {0, 1}, where the values of 0 and 1 correspond to the loss of the data and the measurement set of the i-th extended target is successfully received, respectively. The time-homogeneous binary Markov process has a transition probability matrix given by where S = {0, 1} is defined as the state space of the Markov process. The parameters p f ,i and q s,i denote the failure rate and recovery rate, 0 < p f ,i , q s,i < 1, such that the Markov process {r k } k≥0 is ergodic. Obviously, a smaller value of p f and a larger value of q s indicate a more reliable measurement received. Thus, the obtained measurement set of the i-th extended target at the time step k can be described as The objective of METT is to accurately and robustly track multiple targets, even in challenging scenarios with occlusions, target interactions, intermittent measurements, and other complexities [24].

Hierarchical Network-Based Data Association for Multiple Extended Target Tracking
In this paper, emphasis is placed on the modeling of the data association for multiple extended targets (METs), utilizing a hierarchical dense neighborhood search approach. To achieve this, the density-based spatial clustering algorithm is employed to divide measurements at a specific time into clusters. Subsequently, a low-level association network is constructed based on the clustering results, enabling the calculation of tracklets. The association of clusters is formulated as a maximum a posteriori (MAP) problem, taking into account target initiation, termination, and false trajectories arising from clutter. At a higher level, the association network estimates trajectories by utilizing the tracklets derived from the low-level network. The paper presents the flowchart of the ET-HT filter in Figure 1, illustrating the stepwise process.

Pre-Processing
Since the set of measurements includes extended targets and clutter, it is necessary to extract the measured data of every extended object in one time step. Considering the property of the measurements of the extended target in that detections are spatially distributed around, the density-based spatial clustering algorithm is used to partition the set of measurements into multiple partitions. For the set of measurements Z k which is received by the sensor at the time k, the set of subpartitions is Π k = (π 1,k , π 2,k , . . . π N π ,k ) where π i,k represents a single subpartition, and N π is the number of all subpartitions. Note that the traditional partition algorithms include distance partition, subpartition, and K-means clustering. These methods require the specification of the amount of clusters.
The density-based spatial clustering can determine the number of clusters automatically in this paper, based on the intrinsic structure of the set of measurements. The details of the implementation can be found in [25].

Hierarchical Association a. Low-level association network
Due to the intermittent observations in the extended target tracking system, the extended target trajectories may be divided into several unconnected tracklets. Let . . , τ n i ) represent the i-th extended target trajectory, where τ j i is defined as the j-th tracklet of the i-th target, and the number of trajectories is unknown. T = (T 1 , T 2 , . . . , T m ) is the set of the tracklets. In this section, the set of subpartitions is denoted as Π 1:k = (Π 1 , Π 2 , . . . , Π k ). The low-level association defines Π 1:k as the input, and uses the network flow method to generate the tracklets. The key here is to calculate a MAP estimate for T with a set of a cluster of measurements Π 1:k .
In this chapter, the G(V, E) is defined as a directed acyclic graph, where V represents the set of nodes and E represents the set of edges, with s denoting the nodes of the graph and n being the set of edges of the graph. This gives the set of graph nodes V and E, with v i defined as a subpartition and each edge e j in G(V, E) representing the motion between the subpartitions. In solving the data association process, the concept of network flow is used to represent f ij as a directed flow variable from node v i to node v j , where f si and f jn denote the starting flow variable and the ending flow variable, respectively. Each flow in the graph is subject to the following constraints. Firstly, the sum of the flows arriving at node v i is equal to the sum of the flows leaving this node at the same moment. For any tracklet τ j i : Secondly, the cost flow network must ensure that nothing other than a single extended object can be represented at one time. The upper bound of the sum of outgoing flows from node v i is set to 1. For any node: Considering that targets can appear or disappear anywhere in the cost-flow network, a source node and a sink node are introduced in [26], which are connected to the respective nodes. The source and sink nodes are also subject to a constraint that enforces all the flows starting in s to end in n.
Through the network optimization process, Equation (4) is transformed into an integer programming (IP) problem, and the logarithm of the objective function is given by where c si is the flow cost from the source node to the subpartition π i , c ij is the flow cost from the subpartition π i to the subpartition π j , and c jn is the flow cost from the subpartition π j to the sink node. Hence, the IP problem can be described as The cost can be defined as follows: In this paper, the data association problem is transformed into a MAP estimation for T . The A * search algorithm is used to solve the problem, and the optimum trajectory T * is obtained.

b.
High-level association network In this section, the tracklet-based network flow framework is represented as G * (V * , E * ) to obtain the trajectories of the multiple extended targets. It can be also formulated as a MAP problem.
where Ψ i denotes the merged trajectory, Note that the set of linear constraints is similar to those of Equations (5)- (7). The only difference is in the tracklet-based network flow model, where each node represents a tracklet extracted in the low-level association stage, which is a set of continuous measurements in a batch of time frames.
Assuming that the likelihoods of the input tracklets are conditionally independent given the merged trajectories set, and each merged trajectory is independent, the cost flow network of the tracklets is given by where P(Ψ j ) represents a Markov Chain. P s (T * i,0 ), P l (T * i,l−1 |T * i,l ) and P t (T * i,l ) represent the initial probability, transition probability, and terminal probability, respectively.
The link cost between two tracklets is defined as where p m (T * i |T * j ) and p t (T * i |T * j ) represent the motion cost and time cost, respectively. The motion-associated probability is defined as where ∆t denotes the frame gap between the last detection set of the tracklet T * j and the first detection set of the tracklet T * i .  In this case, it is assumed that the error of the predicted location and the central location of the detection set ∆p 1 and ∆p 2 follows a Gaussian distribution. The smaller the error between the predicted location and the actual position of the target to be connected, the greater the motion similarity between the corresponding track slices will be. The temporal associated cost is defined as where ς is the upper bound of the frame gap. The initialization probabilities and termination probabilities for each tracklet are set to be Similar to the low-level association process, the network flow model based on tracklets is established to solve the motion cost and time cost between the tracklets, and the optimal trajectories are obtained using the A* search algorithm.

Trajectory Smoothness
In this section, the in-coordinate interval Kalman smoothing algorithm is used to calculate the extended target's state [27]. The algorithm consists of forward filtering and backward smoothing. Assume that the interval between two frames is defined as k(i) = k(0) + i × T, (i = 0, 1, 2, . . . , M l ), where T is the sampling time. At the time k(i)(i = 0, 1, 2, . . . , M l − 1), the Kalman filter only predicts the state of the extended target until a new measurement arrives at time k(M l ) [28]. The time update step is as follows: where s k(i) denotes the extended target state at the time k(i), and F k(i−1) is the transition matrix. Once the measurements arrive at time k(M l ), the forward filtering step is wherez k(M l ) represents the central location of the measurements of the extracted extended target at time k(M l ). Here, the state and covariance matrix at each time are calculated via the backward recursive method starting from the last time k max , given by where s k|k max and P k|k max represent the smoothness state estimation and covariance matrix at time k, respectively. J k is the smoothness gain matrix.

Numerical Simulation and Experiments
In this section, the feasibility of the proposed algorithm is verified by numerical simulation and experimental verification.

Case 1: Numerical Simulation
In the simulation, the validity of the ET-HT filter is tested. Consider a 2D surveillance region which is set as [3000, 10 [1,5] s and [5,30] s, respectively. The kinematic state is given by where s i,k = [x i,k , y i,k ,ẋ i,k ,ẏ i,k ] is the state variable. x i,k and y i,k represent the location of the targets.ẋ i,k andẏ i,k represent the velocity of the targets. F i,k is the kinematic state transition matrix of the i-th target. w i,k represents the Gaussian process noise of the i-th target with zero mean and covariance Q i,k , Q i,k = diag[100, 100, 0, 0]. The measurement model is defined as where H k is the observation matrix, z i,k is defined as the measurements generated by the i-th target at time k, and e i,k denotes the Gaussian measurement noise of the i-th target with zero mean and covariance R i,k , R i,k = diag[100, 100].
In the simulation, the clutter Poisson rate λ c is set to 600, and then the clutter density λ c c k is 3 × 10 −15 . The failure and recovery rate are set to p f = 0.2 and q s = 0.8, respectively. The simulated target scenario with intermittent measurements is shown in Figure 3. In the experiment, the absolute mean number of targets estimation error (NTE) [29] and the optimal subpattern assignment (OSPA) distance [30] are taken into account as metrics to evaluate the performance of the proposed ET-HT filter against the ET-PHD, ET-JPDA, ET-PMBM, and ET-NFPHD filters.
The absolute mean number of targets estimation error is defined as where X k and Y k are two finite subsets, and |X k | and |Y k | represent the potential of the two subsets, respectively. The OSPA distance between X k and Y k is defined as the distance between the position and the potential of the two sets.
where c is the penalty cost for the cardinality mismatch, and p is the order of the OSPA metric, 1 < p < ∞, c > 0. In the simulation, they are set to 10 and 2, respectively. The OSPA distances and NTEs of the five filters are depicted in Figure 4. Comparatively, the ET-HT filter exhibits a lower OSPA distance and NTEs when compared to the other filters. In contrast, the performance of the ET-JPDA filter is influenced by clutter and necessitates prior knowledge regarding the number of targets. As the estimation error increases, the OSPA distance and NTE of the ET-JPDA filter also escalate. On the other hand, the ET-PHD filter leverages the first-order approximate moment of the multi-target density to convey valuable information about target potential. When the motion trajectories intersect, the targets are partitioned into one cell, which leads to an increase in the OSPA distance. Meanwhile, the ET-PHD and ET-PMBM filters require precise models of both targets and clutter. However, the lack of prior information and the interference of clutter result in a large discrepancy between the estimated and actual number of targets, which incurs increases in the NTE. The ET-NFPHD filter utilizes the PHD filter to process sensor measurements and extract the PHD component at different time steps. These PHD components are integrated into the network flow graph as nodes. By solving the minimum cost maximum flow problem among these nodes, the optimal data association path can be determined. As a result, the ET-NFPHD filter demonstrates a reduced OSPA distance and NTE compared to the ET-PHD filter. However, due to the direct modeling of the PHD filter's update data, the algorithm suffers from accumulated update errors, resulting in inferior tracking performance compared to the ET-HT filter proposed in this study.  In order to illustrate the influence of clutter on the experiment, we designed simulation experiments with different clutter numbers. Figure 5 presents the average OSPA distances and NTEs of the five filters at varying clutter rates. The ET-HT filter consistently demonstrates smaller average OSPA distance and NTEs compared to the ET-JPDA, ET-PMBM, ET-PHD and ET-NFPHD filters. To delve further into the influence of intermittent probability, the investigation extends to a different failure rate p f and recovery rate q s . The OSPA distances and NTEs of the five filters are illustrated in Figure 6a,b, respectively, considering diverse recovery rates in 100 Monte Carlo runs. Likewise, Figure 7a,b depicts the OSPA distances and NTEs of the filters for varying failure rates with 100 Monte Carlo runs. The results from Figures 6 and 7 clearly demonstrate that the ET-HT filter exhibits superior tracking performance compared to the ET-PHD, ET-PMBM, ET-JPDA, and ET-NFPHD filters irrespective of the transformation of the measurement recovery and loss rates. This is because the ET-HT filter associates the tracklets into long tracks through both low-level and high-level associations, thereby reducing the impact of intermittent measurements.
On the other hand, the experimental results of the ET-JPDA filter show that the OSPA distance of the ET-JPDA filter tends to decrease as the measurement recovery rate q s increases, as shown in Figure 6a. However, if the measurement loss rate p f increases, as shown in Figure 7a, it may result in a mismatch between the tracking gate and the measurement, thereby increasing the OSPA distance. Similarly, the results obtained using the ET-PHD filter shown in Figures 6b and 7b demonstrate that with an increasing recovery rate, some pieces of clutter are incorrectly identified as targets, causing errors in target estimation and an increase in the NTE. Conversely, a higher loss rate p f leads to greater data loss, resulting in a decrease in the effective data volume at each time step. In the ET-PHD filters, significant data loss implies that fewer measurement data are associated with targets, leading to a decrease in the estimated number of targets and NTE. The presence of clutter impacts the partitioning of measurements, causing the OSPA distance and NTE of the ET-PMBM filter to exceed those of the ET-HT filter. The depicted results in Figures 6 and 7 demonstrate the consistent superiority of the ET-NFPHD filter over the ET-PHD filter, irrespective of variations in intermittent probability. These findings serve as evidence supporting the notion that the integration of the flow network method effectively enhances the accuracy of the PHD filter.

Time Complexity
The time complexity of the ET-HT filter consists of two main components, namely the adaptive spectral clustering algorithm and the worst-case scenario of the search subgraph. The time complexity of the former is O(n × (N tc − n tc )) + O n 3 tc + O(K tc N tc I tc ) · O(n tc × (N tc − n tc )), where K tc is the number of nodes in the undirected graph, n tc is the sample points, N tc is the number of trajectories, and I tc is the iteration number of the K-means algorithm. The worst time complexity of the searching sub-graph is O(n 2 max ) [31]. The time complexity of the ET-JPDA filter is O(N tc × n 2 tc ), where N tc is the initial number of targets, and n tc is the sampling time. The time complexity of the ET-PHD filter is O(N tc × n 2 tc ) + O(n tc × M tc × N 2 tc × d 2 tc ), where the parameter M tc is generally greater than 1000, and d tc is the dimension of the state vector. The time complexity of the ET-PMBM filter is O(N 3 tc × H 3 tc ), where H tc represents the number of hypothetical combinations. The time complexity of the ET-NFPHD filter is O(N tc × n 2 tc ) + O(K tc N tc I tc ). The proposed algorithm and existing algorithms are coded with MatlabR2019b, and the experiments are run on a computer with an Intel Core i74770k CPU at 3.5 GHz. The running times of the five filters for 100 Monte Carlo runs are shown in Table 1.  Table 1 demonstrates that the ET-HT filter exhibits a reduced running time in comparison to both the ET-PHD filter and the ET-PMBM filter. This indicates that the ET-HT filter possesses a lower time complexity, leading to significant enhancements in computational efficiency. Although the running time of the ET-HT filter exceeds that of the ET-JPDA filter, its accuracy in estimating the state and number of targets is significantly superior. Hence, it is crucial to prioritize accuracy in tracking rather than solely emphasizing the attainment of low time complexity.

Case 2: Real Data Verification
To evaluate the applicability of the proposed filter, case 2 incorporates real scenarios. These selected scenes encompass more randomness and a higher number of moving targets, reflecting real situations. The positional data of each pedestrian were obtained with a pedestrian detector with discriminatively trained part-based models [32]. Firstly, a classifier that uses a decision rule learned from the data was constructed to determine whether the image window contains pedestrians. Next, the image was divided into multiple sub windows and each sub window was described using features that were invariant to lighting effects. Then, each sub window was provided to the classifier, which determined whether the sub-window contained pedestrians, and their position coordinates were extracted.
The input for the ET-HT filter consisted of a collection of measurement points gathered from all frames during the motion sequence. In situations where pedestrians were obscured by a streetlight, the measurement points could not be extracted, thereby simulating intermittent observation scenarios.
The ET-HT filter framework was applied to a scenario from the PETS 2009S0CC dataset. In this dataset, we focused on a specific scene featuring six pedestrians walking. Their paths intersected at specific frames (295, 305, 345, 365, and 367). Additionally, in frames 319, 339, and 351, the pedestrians were obstructed by a street light, as illustrated in Figure 8a. Figure 8b shows the extracted measurements. Figure 9 shows the tracking results of the ET-HT filter. Following a similar approach as in case 1, the PETS2009S0CC dataset was utilized to evaluate the performance of the ET-HT filter, ET-JPDA filter, ET-PHD filter, ET-PMBM, and ET-NFPHD filters of a single example run. The OSPA distances and NTEs were computed for each filter with the time step t = 10 s. The obtained results are illustrated in Figure 10.
As illustrated in Figure 9, the ET-HT filter achieves a better performance when dealing with real data with intermittent measurements and is able to accurately estimate the trajectories of the targets. The results in Figure 10a,b show that the tracking error of the ET-JPDA filter increases with the increase in clutter number, while the ET-PHD, ET-PMBM, and ET-NFPHD filters have large errors in dealing with real scenes. On the contrary, the ET-HT filter has better robustness and accuracy for tracking targets in real scenes.  The validation of the ET-HT filter on real data demonstrates its advantages in addressing target tracking problems with intermittent measurements. These findings have significant implications for practical applications that demand precise target tracking, such as intelligent traffic systems, video surveillance, and object recognition.

Discussion
The primary focus of this paper is to tackle the challenge of tracking multiple extended targets when dealing with intermittent measurements. The proposed filter in this study, ET-HT, is designed to extract all potential short tracks and employs a network flow model to compute the trajectories of these targets. Comparative experiments have been conducted to validate the algorithm's performance, demonstrating its superior robustness and accuracy in handling intermittent measurements.However, it is important to note that this paper exclusively addresses the tracking method for extended targets and does not explore the issue of tracking group targets with intermittent measurements. Additionally, the implications of target separation and merging phenomena on tracking performance in the presence of intermittent measurements have not been thoroughly investigated in this paper. Further exploration of this aspect is warranted and represents a crucial area for future research.

Conclusions
In this paper, we present a hierarchical network-based tracklet data association framework for tracking multiple extended targets with intermittent measurements. At low association levels, all possible tracklets are extracted and utilized to construct a high-level network flow model. The trajectories of the multiple extended targets can then be obtained using the A* search algorithm. The effectiveness of the proposed ET-HT filter is demonstrated through simulations and experimental results, particularly in challenging scenarios involving clutter, newborn targets, and occlusion. Moving forward, our future research aims to expand the algorithm's capabilities by considering observation delay, integrating it into multi-object trackers, and exploring the potential of incorporating data from multiple sensors to enhance estimation accuracy and overall tracking performance.