In this section, we evaluate our proposed techniques for the object tracking on plenoptic image sequences.
6.2. Evaluation Protocol
All the tested methods are evaluated using two accuracy measures for the object tracking. For a sequence of estimated bounding boxes
, a distance score is defined as the average of distances between centers of estimated and ground truth bounding boxes as follows:
where
and
are the center positions of estimated and ground truth boxes at frame index
t, respectively. The second accuracy measure, an overlap score, is evaluating the intersection of the estimated and ground truth boxes. In the experiments we used a constant-sized bounding box for both estimation and ground truth, so the overlap score is defined as follows:
where
is the number of pixels contained in the region. Please note that the ground truth bounding boxes are labeled by human. We conducted the experiment more than 10 times for each sequence with varying initial bounding boxes. Specifically, the number of trials on Seq #1, #2, and #3 are 18, 12, and 14 cases, respectively. The reported accuracy and timings are the averages of such many trials.
We mainly compare the following five methods:
Baseline #1 [
2]: This method uses only the f-score to select a focus index and Boosting tracker [
5].
Baseline #2 [
3]: An extension from Baseline #1. This method additionally performs image sharpening process and generates proposals based on a fixed ratio of sizes. We set the ratio as 3% as suggested in [
3].
Ours(B): Our method using fi-score (
Section 4) to select a focus index, producing proposals based on the scaling and motion factors (
Section 5), and using Boosting tracker [
5].
Ours(M): Similar to Ours(B) but using MOSSE tracker [
6].
Ours(MC) Similar to Ours(M) but the MOSSE tracker is using CNN features [
7].
All the tested methods are implemented in Python language, and all the experiments are conducted on a machine consisting of Intel i7-8700K CPU, 16 GB main memory, and NVIDIA GTX 1080 GPU.
In our focus index selection algorithm (
Section 4), we employed image features to estimate visual similarity among image patches. We specifically used deep features extracted from the last fully connected layer of pre-trained version of VGG-16 [
27] model which is publicly available. To train correlation filters of MOSSE tracker with CNN features, we used the activated feature map of the first convolution layer as suggested in [
7]. Our proposal technique can produce multiple candidate boxes by applying different parameters
and
. In experiment we mostly used 3 values, 0, 1, and 2, for each parameter, thus we considered 9 proposals in total.
6.3. Experimental Results
Table 2 shows the object-tracking accuracies for the tested method on three different benchmarks. Among techniques using the Boosting tracker, Baseline #1, #2, and Ours(B), Ours(B) provided the highest accuracy in all the tested datasets. Moreover, the second-best method, Baseline #2, is tightly coupled with the Boosting tracker. Its proposal algorithm is designed and optimized for the Boosting tracker, so Baseline #2 cannot accept other trackers such as MOSSE. On the other hand, our method can use any advanced trackers since there is not any assumption on the tracker.
By applying MOSSE tracker and its extension with CNN features to our technique, Ours(M) and Ours(MC), consistently outperformed the other tested methods including Ours(B). These experimental results verify that our method can provide even higher accuracy if we use more accurate 2D tracker.
To validate the benefits of our focus index selection technique (
Section 4) we compared the tracking results based on two different focus index scoring schemes, f-score (
Section 4.1) and fi-score (
Section 4.3), and results of the experiments are given in
Table 3. As reported in
Table 3, our fi-score based on both the focus measure and visual similarity provides significantly higher tracking accuracy over the f-score used in Baseline #1 and #2. Please note that, we did not use any proposal algorithms in this experiment for a fair comparison.
We also report the qualitative results corresponding to
Table 3.
Figure 7 shows the image patches at the position of the target object from the focus indices selected by f-score and fi-score. Since the f-score only relies on the focus measure, the selected focus index based on the f-score index can be wrong once the target object got behind an occluder. As shown in
Figure 7a, the f-score-based technique identified the focus index where the occluder is maximally focused. On the other hands, our proposed fi-score robustly found appropriate focus indices over frames, since it considers not only the focus measure but also the visual consistency. The tracking results corresponding to this comparison is reported in
Figure 8. The tracking process based on the f-score failed to locate the target object once it is occluded, while one with our fi-score was correctly estimating the target object. Please note that, in those experiments the proposal algorithms are excluded and MOSSE tracker was used.
We also conducted experiments of our method with and without proposals, and its results are shown in
Table 4. As reported, our candidate box proposal algorithm consistently provided accuracy improvements. The improvement on Seq #2 was the largest among three sequences mainly because Seq #2 has more varied motions and object sizes compared to Seq #1, and #3. These results validate that our proposal algorithm based on scaling and motion factors assists trackers to accurately locate the target object.
Let us report qualitative analysis of our proposal algorithm in
Figure 9 by comparing Baseline #1 and Ours(B) in Seq #1. We mainly investigate our motion factor in the candidate box proposal stage.
Figure 9a,b shows the tracking results of Baseline #1 and Ours(B), respectively. Baseline #1 failed to locate the target object when it is moving fast. Since most of trackers including Boosting-based techniques incrementally learn the appearance of the object, they hardly recover the tracking capability once they lost the object. On the other hand, our proposal algorithm considering the motion factor can assist the trackers to recover its tracking capability. Our tracker has lost the object between first and second columns in
Figure 9b; however, it immediately relocated the target object based on the proposal bounding box as shown in the third columns. Please note that, the selected proposals are illustrated in red boxes.
Another qualitative results to evaluate our proposal algorithm is given in
Figure 10.
Figure 10a,b show tracking results of Ours(B) without and with our proposal algorithm in Seq #2, respectively. When the size of the target object is getting smaller, as shown in the third column of the
Figure 10a, Ours(B) without the proposals failed to estimate a correct location. On the other hand, our proposal algorithm generated suitable sized proposals (red boxes) and thus we could precisely locate the target object even with the scale changes.
Since our proposed focus index selection and candidate proposal techniques use image features to capture the content of the images, a choice of the feature extractor is fairly important. In all the aforementioned experiments, we used the VGG-16 network [
27] as the feature extractor. Specifically, we used the last fully connected layer of VGG-16 in the focus index selection and the first convolution layer to train MC (MOSSE+CNN) tracker. The VGG-16 provides high accuracy in the image classification; however, it is known that its feature-extraction speed is quite slow. In order to see the performance of our proposed techniques when using lightweight feature extractor, we conducted experiments with SqueezeNet [
40]. We also included the Inception-ResNet v2 [
30] which showed one of the state-of-the-art classification performances recently. We report the tracking results with three different feature extractors, VGG-16, SqueezeNet and Inception-ResNet v2, in
Table 5. The tracking accuracies are not significantly different. Since we have used pre-trained models publicly available for the feature extraction, the feature-extraction capability for particular objects can be diverse. Such diversity in the capability causes inconsistent performance tendency; however, our method still significantly outperformed the baseline techniques.
Let us finally report the average tracking speed. Baseline #1 and #2 took 0.1 and 0.16 seconds per frame, respectively. On the other hands, Ours(B), Ours(M), and Ours(MC) took 0.37, 0.32, and 0.70 s per frame on average when we used VGG-16 network as the feature extractor, respectively. As reported in
Table 6, the tracking times of MOSSE tracker with VGG-16 and Inception-ResNet are similar. The MC tracker with the Inception-ResNet v2 provides the slowest tracking speed, because it has huge sized first convolution layer so that the longer time is consumed for updating correlation filters. On the other hand, our proposed techniques can be accelerated if we use computationally lighter models such as SqueezeNet. For instance, when we combined our proposed techniques with the MOSSE tracker and SqueezeNet as the feature extractor, our method provided both higher accuracy and faster speed compared to Baseline #2. Moreover, the fastest method among the tested combinations, our focus index selection method combined with SqueezeNet, provided significantly higher accuracy over Baseline #1 and #2.