5.1. Experimental Design
To investigate the efficiency of the proposed approach, a random sample of 15 videos was selected from the public Hanyang visual tracker benchmark [
44] with the respective ground truth with hand-marked targets. The benchmark focus is on tracking a single target online. The benchmark presents 100 videos with generic scenarios and annotations of ground truth for all frames and annotations of attributes that affect the performance of the tracker in identifying the targets.
The public Hanyang benchmark was designed with a collection of video sequences most commonly used in object tracking. It contains videos from various datasets such as the VIVID [
45], CAVIAR (
http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1, accessed on 24 November 2020). Other benchmarks such as PAMI share some public videos in common with the Hanyang benchmark.
Table 2 shows the selected videos with the following information: the video’s size (in number of frames), the resolution of the image (in number of pixels) and the main challenges present in the scene. The challenges are rotation in the image plane (IPR), rotation outside the image plane (OPR), fast movement (FM), blurred movement (BM), low resolution (LR), scale variation (SV), deformation of the target (DEF), confusion between the target and background of the image (BC) and occlusion (OCC).
The Hanyang benchmark also includes most of the publicly available codes. The benchmark disseminates performance metrics for in-depth analysis of tracking algorithms. The metrics proposed in [
44] are the AUC of the success rate per Pascal metric (success rate) and the Euclidean distance from the central points of the bounding boxes (accuracy).
The Pascal metric [
46] is defined according to
where
is the bounding box that corresponds to the ground truth, and
is the bounding box that corresponds to the candidate target.
The Pascal metric measures the quality of tracking by quantifying the percentage of pixels that are shared between the bounding boxes, i.e., the overlap of the targets. The Pascal metric ranges from 0.0, when there is no overlap between bounding boxes, to 1.0, when there is total overlap between targets. A target is considered to be detected when the Pascal measure of the candidate target is equal to or greater than a predetermined threshold (in this work the Paschal threshold is 0.5).
The success rate per Pascal metric is the curve formed by the percentages of frames in which the target was detected in a given video, with the threshold of the Pascal metric varying from 0.0 to 1.0. The advantage of observing the curve is that the tracker’s performance is visualized for all thresholds of the Pascal measurement. Therefore, calculating the AUC of the success rate per Pascal metric is a more robust and complete measure to assess the quality of the tracker’s performance when compared with a value for a single fixed threshold. The AUC ranges from 0.0 to 1.0, and the closer it is to 1.0, the better the tracker’s performance. More details on the Pascal metric and the success rate per Pascal metric are provided in [
44,
46].
There are three tests to assess the robustness of the trackers in [
44], the OPE (One-Pass Evaluation) which tests the tracker for the success rate and accuracy from the first to the last frame of the video and the template being the ground truth of the first frame; the TRE (Temporal Robust Evaluation) that tests the tracker using a sequence of frames starting from any frame until the last one; and the SRE (Spatial Robust Evaluation) in which the template is modified from 0.8 to 1.2 of its original scale and starting from 12 different locations in the first frame.
In this work, the performance of the trackers will be evaluated by OPE robustness of the AUC of the success rate per Pascal metric and the average processing time per frame.
The overall results will be summarized by the mean, median and coefficient of variation. The coefficient of variation (
) is the ratio between the sample standard deviation,
, and the sample mean,
, of an observed variable. The
is a dimensionless measure of dispersion and can be expressed as a percentage of variation.
5.3. Analysis of Results
The results of the main experiment are summarized in
Table 3 and
Table 4. The values presented in the tables correspond to the average of six executions of each video for each of the trackers: PSO, ADSO, SFLA, DSFLA and SSA, always in that order. Replications of all the videos for each tracker were coded in MatLab and executed on the same processor (Intel Pentium Dual-Core, 1.86 GHz, 2 GB DDR2 and 160 GB HDD) to compare the average processing time per frame (it is worth saying that the program codes are not optimized). The average, median, and
of the 15 videos are in the last three rows of the tables.
Table 3 shows the performance of the trackers according to the average processing time per frame of each video.
As observed in
Table 3, the SSA tracker takes more time for execution, and the ADSO tracker is the fastest, however, the DSFLA tracker is, on average, the second fastest. The values are representative given that the
of the trackers is low, except for the PSO tracker.
Figure 2 shows the Whiskers boxplot (output of MatLab’s internal boxplot function) of each tracker for the observed data of the average processing time per frame. Each boxplot segment corresponds to 25% of the observed values, and the small circles correspond to the outliers. The darker part of the central region of the boxplot represents the interquartile range (IQR),
, where
is the third quartile and
is the first quartile. The central point of this region corresponds to the median of the observed values, and the triangles represent the extremes of the 95% confidence interval centered on the median [
47], which is calculated according to
where
is the second quartile, i.e., the median, and
is the size of the observed sample.
If the intervals do not overlap, then we can conclude with 95% confidence that there is a significant difference between the medians, this is equivalent to a statistical test in which the hypothesis that there is no difference between the medians is rejected at 5% significance.
From the graph in
Figure 2, all processing times are significantly different except for the those between the SFLA and DSFLA trackers. However, empirically, the proposed DSFLA tracker is systematically about 10% faster than the SFLA tracker.
Table 4 shows the performance of the trackers in relation to the tracking quality according to the AUC of the success rates per Pascal metric. We can see from
Table 4 that the
of all the trackers indicates a low variation of the results except for the PSO tracker. Therefore, we can say that the trackers are satisfactorily stable.
Table 4 also shows that the videos 2, 8, 10 and 15 presented the most difficult challenges for all trackers.
Figure 3 shows the boxplots for the AUC of the success rates per Pascal metric. The DSFLA tracker is significantly superior to the PSO and ADSO trackers since the 95% confidence interval referring to the DSFLA tracker has no overlap with the confidence intervals referring to PSO and ADSO trackers. It is not possible to reject the hypothesis of comparable quality between the DSFLA, SFLA, and SSA trackers. However,
Table 4 shows empirically that DSFLA tracker results are consistently better than SFLA and SSA trackers results, at about 7.2% higher AUC on average.
Figure 4 and
Figure 5 show two examples, chosen at random, of tracking performance given by the success rate per Pascal metric for all trackers for videos 4 and 7, respectively. In
Figure 4, the curve that represents the performance of the DSFLA tracker is largely above the other curves. This indicates that DSFLA has higher target detection rates for most of the Pascal metric threshold.
The graphs of the success rate per Pascal metric for most of the other videos show results that reflect the performance of the trackers shown in
Figure 5.
The DSFLA tracker produced the best results with videos 1, 4, 7, 11, and 12. The DSFLA tracker is effective in tracking targets with fast movements or when there are blurred images or rotations of the target.
The videos in which all the trackers performed poorly are those with ambient light variation and when the scale of the target has a wide range, as in the case of video 10 (Dog), or moderate occlusion, as in the case of video 15 (Walking2). A common weakness of all the trackers analyzed in this work is related to the variation in ambient lighting. This is probably due to the use of the standardized color histogram to represent the target characteristics. The color histogram is sensitive to any variation in light in the environment, and it can also easily miss the target when the characteristics of the target and background are similar. The illumination of the target in the scene changes substantially and non-proportionally to the frequencies of the histogram since the change in pixel intensity is not linear.
A possible strategy to overcome this problem is to include a target characteristic based on the shape of the object.
When the target and the background have similar characteristics, bounding boxes of different sizes can contain a similar proportion of pixels of the same intensity, and thus, the histograms are similar in appearance. Therefore, the solution space has several local minima whose objective function values are very close. This case can reduce the quality of tracking since candidate targets of different window sizes have a chance of being the estimated target.
The following analyzes check the quality of the RDES model predictions and how much the delimitation of the solution space is useful for tracking.
Table 5 shows the RMSE values (in number of pixels) of the forecasts for the
and
coordinates and the Euclidean distance between the forecast
and
.
Using the data in
Table 5 and Equation (31), the 95% confidence intervals for variable RMSEs of
and
are
and
, respectively. Therefore, the prediction error does not exceed 50 pixels of RMSE, that is, the forecasts are reasonably homogeneous and slightly skewed.
Similarly, the 95% confidence interval for the Euclidian distance between the predicted target and is . Thus, we can conclude that the distance between the predicted and estimated target does not exceed 15 pixels. Considering the largest diagonal of the video image, this value varies from 400 to 800 pixels of the videos observed. Therefore, the biggest forecast error made does not exceed 3.8%, that is, we can conclude that the predictions of the RDES model is quite accurate.
The same experimental design used to calibrate the parameters of the trackers was used to investigate whether video target tracking benefits from restricting the solution space by the proposed region.
Table 6 shows the global average and median of the four videos for the variables’ AUC of success rates per Pascal metric and processing time per frame. Two versions of the DSFLA tracker were assessed: version 1 delimits the solution space, as proposed in this work, and version 2 does not delimit the solution space.
Table 6 shows that the median AUC for version 2 is about 87% of that for version 1 and that the median time to process a frame for version 2 is about 14% longer than that for version 1. Therefore, consistent empirical evidence suggests that the use of the restrictions proposed in this work helps to increase the AUC of the success rate per Pascal metric and improves the processing time.
To conclude, future work to improve the tracker performance involving multiple particle populations acting in different regions of the solution space (in the particle selection process) and an adaptive scheme for quantifying the number of particles to be used in the transfer of knowledge based on the similarity of the frames.
Other representations of the target will also be tested to improve the target recognition ability in environments with varying lighting including the target appearance model with HOG characteristic [
35], for instance.