Performance Evaluation Metrics and Approaches for Target Tracking: A Survey

Performance evaluation (PE) plays a key role in the design and validation of any target-tracking algorithms. In fact, it is often closely related to the definition and derivation of the optimality/suboptimality of an algorithm such as that all minimum mean-squared error estimators are based on the minimization of the mean-squared error of the estimation. In this paper, we review both classic and emerging novel PE metrics and approaches in the context of estimation and target tracking. First, we briefly review the evaluation metrics commonly used for target tracking, which are classified into three groups corresponding to the most important three factors of the tracking algorithm, namely correctness, timeliness, and accuracy. Then, comprehensive evaluation (CE) approaches such as cloud barycenter evaluation, fuzzy CE, and grey clustering are reviewed. Finally, we demonstrate the use of these PE metrics and CE approaches in representative target tracking scenarios.


Introduction
Target tracking is widely involved in many problems of significance, such as military defense, automation/driverless transportation, intelligent robots, and so on. There are many outstanding works to provide guidance on the implementation of target-tracking algorithms. For the issue of multitarget tracking, the International Society of Information Fusion (ISIF) took track estimation, data association, and performance evaluation into account many years ago. The textbook [1], which is a tutorial of the known target-tracking algorithms and the class material [2], can assist researchers in re-implementing these methods and developing advanced methods. Some common toolboxes include the Recursive Bayesian Estimation Library (ReBEL) [3][4][5], the nonlinear estimation framework [6][7][8], and the Tracker Component Library [9]. The Open Source Tracking and Estimation Working Group (OSTEWG), which is the working group of the ISIF, revisited the current widely used methods to make up an open-source framework in [10,11], which is named Stone Soup and is available from the website https://github.com/dstl/Stone-Soup/ (accessed on 10 November 2021). Another ISIF working group, the Evaluation of Techniques of Uncertainty Reasoning Working Group (ETURWG), develops the uncertainty description in the domain of target tracking and introduced the Uncertainty Representation and Reasoning Evaluation Framework (URREF) [12] to Stone Soup. This paper focuses on the performance evaluation (PE) of target-tracking algorithms, which plays a key role in the comparison of existing algorithms and in putting forward a new algorithm. PE is referred to the assessment and the evaluation of various performance metrics of a system [13,14], whose significance lies in providing evaluation results of the system performance [15], as well as the reference basis for the optimization of the system performance.
The basic process required for tracking evaluation is that when both truth targets and tracks are available, the first step is to find an association between the true target and a track so that performance measures can be computed. However, it was assumed that there is a unique association in this paper. The association algorithms can be found in [16][17][18][19]. After the tracks were assigned to targets, we computed the various performance measures to analyze the target-tracking algorithms and optimize these algorithms.
For the practitioner, the PE problems can be divided into two stages. The first is to choose the relevant effective metrics, and the second is to evaluate a single score through these metrics [20]. In this paper, we reviewed the measures to evaluate the performance of the target tracking system. These metrics were grouped into the categories of the correctness, timeliness, and accuracy. The assessed results of each metric were weighted and combined to give an overall performance measure. Further, we designed simulations that employed several PE approaches based on these metrics to illustrate their use in the target-tracking problem.
The rest of this paper is organized as follows. Section 2 introduces general evaluation metrics and sorts categories by the characteristics of each metric. Section 3 reviews classic PE approaches. The simulation results of our approach in the context of target tracking are given in Section 4. Section 5 concludes our work and faces remaining challenges.

A Classification of the Comprehensive Evaluation Metrics
In the context of target tracking, a variety of evaluation metrics with physical significance have been proposed, which can evaluate the practicability of the tracking algorithm and the consistency of the expected and assessed results. These metrics can be divided into three categories: effectiveness, timeliness, and accuracy, which can be seen in [21][22][23][24]. This paper also followed this division criterion for convenience. What the correctness [18,[25][26][27] usually represents is the number of missed/false targets, etc. The timeliness assesses the time performance of the estimated track [23,28], which is a crucial measure for online target tracking. The accuracy metrics can be defined in different ways according to the different scenario requirements, in which the (root-) mean-squared error ((R)MSE) is commonly used in trajectory error (TE), tracking position error (TPE), and tracking velocity error (TVE) [29], and the other accuracy metrics refer to [28]. References [30,31] combined the multiple-object-tracking precision (MOPT) and the multiple-object-tracking accuracy (MOTA) to describe the effectiveness and the timeliness of the multiple-object-tracking systems, but they disregarded the effect of the error. In addition, the measures, such as the cross-platform commonality, the track purity, the processor loading, etc., were considered in [7,26,[32][33][34][35][36]. The comprehensive evaluation (CE) metric system is shown in Figure 1.

Correctness Measures
The correctness measures the numerical characteristics of the acquired data and calculates how many mistakes the tracker made in terms of misses, false tracks, and so forth, which can be briefly described by Figure 2, where the small and large circles represent the truth target and the estimated track, respectively. The solid and dashed lines denote the trajectory of the target and the curve of the track, respectively. Given a time interval t ∈ [t 1 , t 2 ], the correspondence between the target and the track is established in Figure 2. These measures are explained in detail below.

Timeliness Measures
The performance measure provides more information about the track persistence, which is also an indispensable part of the evaluation metrics [37]. Some timeliness metrics for the PE are given as follows: • Rate of false alarms (RFA): The RFA [38] is defined as the NFT per time step, which can be denoted as follows: • Track probability of detection (TPD): In the time interval [t 1 , t 2 ], assume t i f irst and t i last as the first and last time that the ith target is present, respectively. According to [39], the TPD of each target is represented as: where t denotes the persistent duration where the ith target is assigned to a valid track; • Rate of track fragmentation (RTF): It is likely that the track obtained through some tracking algorithms may not be continuous sometimes. The track segment is assigned to the ith truth, the number of changes that the continuous track becomes fragmental is defined as TFR i when the track segment is assigned to the ith truth. The smaller the RTF is, the more persistent the tracking estimated by the algorithm is [39]; • Track latency (TL): The TL, the delay from the moment that the target arises in the view of the sensor to the moment that target is detected by the tracker in the running period, is a measure of the track timeliness; • Total execution time (TET): The computational cost is another important factor to be considered in the PE of target tracking. Therefore, the total time that is taken to run the tracker is expressed as the TET for each tracking algorithm.

Accuracy Measures
The measure, favored by the majority of researchers, is a primary choice in evaluating the target tracking, in which several measures can be defined as based on the type of distance between the set of truths and tracks: The RMSE is defined in terms of the estimation error e k , which is the average difference between the estimated stateX k and the truth state X k , as: where n denotes the number of the lth targets detected at the tth time step, e k 2 = e k T e k . Th MSE/RMSE has long been the dominant quantitative performance metric in the field of signal processing. For the traditional target-tracking algorithms, the aim is to minimize them between the target truth and the estimated track [40], which is not suitable for the track assignments that do not have a one-to-one correspondence. At present, the CE metrics have been widely used in the Hausdorff distance [41], the Wasserstein distance [42,43], and the optimal subpattern assignment (OSPA) distance [20,[44][45][46]; • Hausdorff distance: The Hausdorff distance is a common method of measuring the distance between two sets of objects, which can be used to measure the similarity between tracks and is given by: where X = {x 1 , x 2 , · · · , x k },X = {x 1 ,x 2 , · · · ,x k }. x i andx i are the ith target and the ith track, respectively. d(x,x) is the Euclidean distance between x andx. It has been proven that the Hausdorff distance is very useful in assessing the multitarget data fusion algorithms [47][48][49]. Meanwhile, the distance is relatively insensitive to differences in the numbers of objects; • Wasserstein distance: The Wasserstein metric was initially used to measure the similarity of probability distributions [50] and was proposed for the sets of targets in [51]. The Wasserstein distance between X andX is: where . The Wasserstein distance extends and provides a rigorous theoretical basis for a natural multitarget miss distance. However, it lacks a physically consistent interpretation when the sets have different cardinalities [52]; • OSPA distance: The OSPA was proposed to overcome the insensitive shortcoming, whose parameters can deal with the problem that the numbers of elements in the two sets do not match [53]. The OSPA metric between X andX is: where d c (x,x) is the cut-off distance between x andx and d c (x,x) = min{d(x,x), c}, c denotes the truncation parameter, and p is the OSPA metric order parameter. The choice of parameters was given in [53]. The OSPA distance has been used widely in the literature [44,[54][55][56] and has better properties for the multitarget error evaluation than the Hausdorff metric.
In addition, there are various improved methods based on the OSPA [57], which are be enumerated as follows: In the GOSPA metric, we look for an optimal assignment between the truth targets and the estimated tracks, leaving missed and false targets unsigned [58]. The GOSPA metric penalizes localization errors for properly detected targets, the NMT, and the NFT [59]. The GOSPA can be represented as an optimization over assignment sets: where γ ⊆ {1, 2, · · · , n} × {1, 2, · · · , m}, (i, j), (i, j ) ∈ γ → j = j , and (i, j), (i , j) ∈ γ → i = i . Γ denotes the set of all possible assignment sets γ. α is the additional parameter to control the cardinality mismatch penalty; in general, α =2. The terms c p α (m − |γ|) and c p α (n − |γ|) represent the costs (to the pth power) for the NMT and the NFT, respectively; • OSPA-on-OSPA (OSPA (2) ) metric: The OSPA (2) metric [60,61] is the distance between two sets of tracks, which establishes an assignment between the real and the estimated trajectories that is not allowed to change with time and enables capturing the tracking errors of fragmentation and track switching. It is also simple to compute and flexible enough to capture many important aspects of the tracking performance. The OSPA (2) distance is defined as follows: where q is the order of the base distance and w is a collection of convex weights.
Finally, we introduce the single integrated air picture (SIAP) metric. Despite the terminology, it is applicable to tracking in general and not just in relation to an air picture. It is made up of multiple individual metrics [62,63]. The SIAP metric requires an association between tracks and targets. We used a unique association in this paper. A description of the key metrics of the SIAP is given in Table 1.

Metric Description
Ambiguity A measure of the number of tracks assigned to each true object Completeness The percentage of live objects with tracks on them LS The percentage of time spent tracking true objects across the dataset LT 1/R, where R is the average number of excess tracks assigned; the higher this value, the better Positional Accuracy Given by the average positional error of the track to the truth Spuriousness The percentage of tracks unsigned to any object Velocity Accuracy The average error in the velocity of the track to the truth Number of Targets The total number of targets Number of Tracks The total number of tracks

CE Approaches
The above metrics provide the criteria for the PE of the tracking algorithm, which are combined to give the overall performance by the CE model. In this section, we review several CE approaches to analyze and judge the performance of target tracking.

The Weight of Each Evaluation Metric Set
The analytic hierarchy process (AHP) is given to ascertain the proportion of each metric, which was used to model and analyze the evaluation metric of the PE according to layers [64][65][66]. The specific steps are as follows.
Step 1 The assessment metric system: According to Section 2, all levels of evaluation metrics are established.
The rest can be established in the same manner; Step 2 The comparison matrix A ij : Citing the numbers 1-9 as a scale, each influencing metric U i in the above metric set U is determined according to the importance of the element to its corresponding quantized value Step 3 The maximum eigenvalue λ max of A and the corresponding normalized eigenvector W: W is denoted as W = (W 1 , W 2 , · · · W n ), where W i denotes the weight of the ith evaluation metric.

Cloud Barycenter Evaluation
Based on the traditional fuzzy set and probability theory, the cloud theory provides a powerful method by combining the qualitative information with the quantitative data [67]. As a kind of mathematical model, cloud theory describes the mapping relationship between quality and quantity through a fuzzy and stochastic relation completely [68]. The cloud barycenter evaluation developed from cloud theory is a CE method that has been extensively used in numerous complex systems, especially in the military field [67]. The cloud barycenter evaluation method is a qualitative and quantitative method to achieve the transformation between the conception and the data.
The cloud is represented by three digital characteristics, including the expected value E x , the entropy E n , and the hyper entropy H e [67,69], where E x is the central value of the fuzzy concept in the defined domain, E n represents the degree of the fuzziness of the qualitative concept, and H e is the fuzzy measurement and the entropy of E n , which is a mapping of the uncertainty of the qualitative concept.
The cloud barycenter evaluation is realized by establishing the cloud model of each metric. The specific evaluation processes are as follows: Step 1 The cloud model of the comment set: The comment set of metrics was ascertained by experts. For example, we set S= { excellent, good, fair, worse, poor } to denote the comment set of target tracking, which is shown in Table 2. We set the comment as the corresponding continuous number field interval [0,1]. The formula of the cloud model is represented as: where E xi0 , E ni0 are the expected values and entropy of some qualitative comments, respectively. Step 2 The quantitative and the qualitative variables for the given metric set; (a) The cloud model of quantitative metrics: The corresponding quantitative metrics values were established by n experts as E x11 , E x21 , · · · , E xn1 , which can be denoted by the cloud model: (b) The cloud model of qualitative metrics: In the same way, every qualitative metric, which are represented by the linguistic value, can also be described by the cloud model: where E x12 , E x22 , · · · , E xn2 and E n12 , E n22 , · · · , E nn2 denote the expected values and entropy of the cloud model, respectively; Step 3 The weighted departure degree: S = (S 1 , S 2 , · · · , S n ) is the n-dimensional integrated barycenter vector, each dimension value of which is calculated by S i = g i × h i (i = 1, 2, · · · , n), where g i = (E x1 , E x2 , · · · , E xn ) is the cloud barycenter position and h i = (W 1 , W 2 , · · · , W n ) is the cloud barycenter height calculated by the AHP. S 0 = (S 1 0 , S 2 0 , · · · , S n 0 ) denotes the ideal cloud vector. The synthesized vector is normalized as follows: Finally, the weighted departure degree is given by: Step 4 Result analysis: The comment set is put in a consecutive interval. Meanwhile, each comment value is realized by a cloud model. The cloud-generator model can be established as Figure 3 shows. The comment set can be divided into five categories: excellent, good, fair, worse, and poor. For a specific case, assessment results can be output by inputting 1 + θ into the cloud-generator model.

Fuzzy CE Method
The fuzzy CE is based on fuzzy mathematics, which quantitatively expresses the objective attributes of some uncertain things [70][71][72]. The specific process is as follows: Step 1 The metric set: We analyzed the result of target tracking and establish the evaluation metric set U as follows: where U i is the ith evaluation metric; Step 2 The evaluation level set: The evaluation level set is given by V = {V 1 , V 2 , · · · , V m }, where V i is the ith grey category. V is the remark collection, which is made up of remarks of the research object; Step 3 The evaluation matrix: Starting from a single factor for the evaluation, we determine the degree of membership about evaluation objects to the evaluation level set and make the fuzzy evaluation. Then, combining the single-factor set, a multi-factor evaluation set is given by: r 11 r 12 · · · r 1m r 21 r 22 · · · r 2m · · · · · · · · · · · · r n1 r n2 · · · r nm where r ij denotes the membership degree of U i corresponding to V j ; Step 4 The fuzzy CE value: r 11 r 12 · · · r 1m r 21 r 22 · · · r 2m · · · · · · · · · · · · r n1 r n2 · · · r nm where C is the fuzzy CE set and B is the weight of the metric.
According to the principle of the maximum membership degree, the comprehensive value of the PE is obtained; thereby, the corresponding performance levels [70] are calculated.

Grey Clustering
Grey theory is a useful methodology for incomplete information systems. Grey relational analysis can be used to analyze relationships between the uncertainty and the gray category [73,74]. The main steps of the method are as follows: Step 1 Triangular whitenization weight functions are established and obtained as follows: where d ij (i = 1, 2, · · · , n; j = 1, 2, · · · , m) is the sample of the ith algorithm about jth evaluation metric and c k j is the midpoints of the jth clustering metric belonging to the kth grey category. The type of measure determines the choice of three functions: If the measure is extremely large data (preferably larger), we select f k 1j (d ij ). If it is the moderate measure, f k 2j (d ij ) is selected. If it is extremely small, then f k 3j (d ij ) is the first choice; Step 2 The clustering coefficient: where w j is the weight of the jth clustering and is determined by the AHP. σ k i is the weight cluster coefficient of the ith algorithm about the kth grey category, and δ k i is the normalized weight cluster coefficient, respectively [75]; Step 3 The integrated clustering coefficient η i for each monitoring point with respect to the grey classes k can be calculated with the following equation [76]: Step 4 According to the integrated clustering coefficient, the evaluation result is determined. The value range of the integrated clustering coefficient is divided into s intervals of the same length, which are: The track algorithm is judged as the kth grey category, when η i belongs to [1 + (k − 1)(s − 1)/s, 1 + k(s − 1)/s].

Rating and Overall Performance
Most simulations are run in the Monte Carlo scenario to describe the characteristics of the performance metrics. In [21], the analysis and assessment of the tracking algorithm were performed both with simulated and real data, where the real data were measured with the Multi-Static Primary Surveillance Radar (MSPSR) L-Band demonstrator, and the metrics were calculated for the performance evaluation such as the mean variance statistic of the NMT, RTF, RMSE, etc. Refs. [25,77] calculated the GOSPA varying values of c and γ using a multiple target tracking example in the MATLAB code. The COCO 2017 validation set and the MOTChallenge (MOT17) dataset were used in terms of the Hausdorff, Wasserstein, and OSPA metrics [78]. In the paper, the metrics mentioned in Section 2 were combined to give a score or a membership value by the aforementioned CE approaches. Three measures were taken together to judge the efficiency of target tracking in the cloud barycenter evaluation (the synthetic measures do not involve simulations). At the same time, the realization of the fuzzy theory and the grey clustering is in the absence of the correctness and the timeliness scenario. In order to demonstrate it more vividly, our simulation was performed by using a graphical user interface (GUI).

Application of Cloud Theory for Target Tracking
In this section, we discuss the application of cloud theory for target tracking. In this scenario, three categories of the performance metrics are involved, and the last is the accuracy, which is divided into the TE, TPE, and TVE. The judgment matrix was given by experts and displayed in Figure 4, and then, the metric weights were ascertained. According to the evaluation of experts, S = {excellent, good, fair, worse, poor} was placed in [0,1]. Therefore, Table 3 represents the results of the cloud model of comments.
S is the decision matrix, which denotes the integrated cloud gravity center. Combining Equations (13) and (14) and Table 3, the cloud model of the parameter status is calculated and given by Table 4. Then, the conclusion can be obtained in the part of "performance evaluation". W = [0.2022, 0.0990, 0.0509, 0.0639, 0.0530, 0.0271, 0.0435, 0.0140, 0.0886, 0.0349, 0.0260, 0.0277, 0.0725, 0.0633]. In the ideal state, E = [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]; the integrated evaluation vector of the cloud barycenter is S 0 = W × E =  In the ideal state, θ for PE is zero; however, the actual θ is −0.20477. When 1 + θ = 0.79523 is input into the cloud generator model, the cloud drops close to the "good" cloud object. Then, the PE result of target tracking is deemed to be good. We came up with a conjecture that the measure of the accuracy and the correctness are more effective than the timeliness in target tracking. Then, we verified it in the following simulation.

Application of Fuzzy CE for Target Tracking
Fuzzy CE uses the fuzzy mathematics tool to depict the influence of various metrics, which was applied in the situation without the timeliness measure. Here, U= { NMT U 1 , NVT U 2 , NST U 3 , NFT U 4 , TSE U 5 , TBE U 6 , TE U 7 }. B = (0.1621, 0.1284, 0.0773, 0.1120, 0.0836, 0.2000, 0.2365) is the weight of the seven metrics. V = {V 1 , V 2 , V 3 , V 4 } denotes the comment set, corresponding four levels: "excellent", "good", "medium", and "poor". Here, the subordinate degree was determined by the expert assessment. We obtain R, C as follows.  The simulation result is given in Figure 5. According to the metric information, we could obtain the fuzzy CE of all the evaluation information. The fuzzy CE set was calculated and performed the "medium" in terms of the maximum membership degree on the basis of the principle. To combine the analysis above, the total score of the PE for target tracking can be given by: D = 0.1494 × 90 + 0.4120 × 80 + 0.4386 × 70 = 77.108, and the result showed that the performance for target tracking performed medium.
Performance metrics were now found based on the given tracking results, which are given in Table 5. According to Table 5, we can determine the whitenization weight functions in Table 6. Taking the first metric (TPE) as an example, the four whitenization weight functions are: −4.5 , d ij ∈ (4.5,9] 0, d ij / ∈ [0, 9] f 3 1 (−, 3, +) = f k 1j (c k 1j , ∞) was applied in the excellent grey category, which is an upper measure. The good grey category and medium grey category were selected by f k 2j (−, c k 2j , +). f k 3j (0, c k 3j ) is a whitenization weight function of lower measure, which was used in the poor grey category next. From the weight of the evaluation metric and the grey clustering, the weighted clustering coefficient matrix σ can be determined. The simulation result is shown in Figure 6. The integrated clustering coefficients of the six algorithms were calculated. For example, the integrated clustering coefficient of the PS in the first algorithm was η 1 = 4 ∑ k=1 k · δ k i = 1×0.3645 + 2×0.4019 + 3×0.2336 + 4×0 = 1.8691, and the other coefficients were η 2 = 1.7397 (PRO), η 3 = 1.6448 (KKT), η 4 = 1.9222 (KKT_KF), η 5 = 1.6927 (UKF), η 6 = 1.5634 (T-FoT), respectively. The value range of the integrated clustering coefficient was divided into four intervals of the same length. According to Step 4 in Section 3.4, η 2 , η 3 , η 5 , η 6 ∈ [1, 1 + 3/4], η 1 , η 4 ∈ [1 + 3/4, 1 + 6/4], and the evaluated result was that the methods of the PRO, KKT, UKF, and T-FoT had "excellent" performance and the methods of the PS and KKT_KF "good" performance. It is not difficult to find that the CE method for target tracking achieved a satisfactory result by comparing the above algorithms. The simulation showed that the three measures contained sufficient information for evaluation by the comparative analysis. However, there were slight differences in the scenarios in which the one lacks correctness and the other one lacks timeliness. The results of grey clustering were consistent with cloud theory, and the fuzzy CE showed poor results. In other words, for this case, it can be concluded that the timeliness measure was less informative than the correctness and accuracy measures.
So far, on the one hand, PE has a tendency to focus on improving OSPA. On the other hand, PE metrics have been redefined using different association algorithms. A more effective PE method needs to be explored to enhance the algorithm efficiency.

Conclusions and Remaining Challenges
This paper reviewed PE metrics for target tracking and some CE approaches. The measures were divided into three categories and described for each category. Finally, the simulation result showed that a combination of metrics from different classes can provide a criterion for PE in target tracking.
Instead of estimating the discrete-time state of the target, it is actually of greater interest to estimate the continuous-time trajectory of the target, which contains more information than the discrete time series of point estimates. The T-FoT framework [85,87,88] is promising and powerful as it completely describes the movement pattern/dynamic behavior of the targets over time and enables the use of many curve-fitting tools such as the Gauss process and neural networks, in addition to various flexible parametric regression analysis methods. However, based on this framework, the output of the estimator or tracker is a spatiotemporal trajectory function, or functions in the case of multiple targets. How to evaluate Data Availability Statement: Data will be made available upon request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: