You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

3 November 2020

Visual Tracking Using Wang–Landau Reinforcement Sampler

and
School of Computer Science and Engineering, Chung-Ang University, Seoul 156-756, Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Section Computing and Artificial Intelligence

Abstract

In this study, we present a novel tracking system, in which the tracking accuracy can be considerably enhanced by state prediction. Accordingly, we present a new Q-learning-based reinforcement method, augmented by Wang–Landau sampling. In the proposed method, reinforcement learning is used to predict a target configuration for the subsequent frame, while Wang–Landau sampler balances the exploitation and exploration degrees of the prediction. Our method can adapt to control the randomness of policy, using statistics on the number of visits in a particular state. Thus, our method considerably enhances conventional Q-learning algorithm performance, which also enhances visual tracking performance. Numerical results demonstrate that our method substantially outperforms other state-of-the-art visual trackers and runs in realtime because our method contains no complicated deep neural network architectures.

1. Introduction

Visual tracking is a fundamental computer vision algorithm [1,2,3,4,5,6] with several applications, including autonomous driving, surveillance systems, and robotic systems. Conventional visual tracking methods aim to accurately predict a target state using observations up to the current time. To predict the target state with greater accuracy, in this paper, we define multiple actions in a reinforcement learning framework and move the current state according to the selected action. Figure 1 illustrates how state prediction is related to the actions in reinforcement learning.
Figure 1. Basis of the proposed Wang–Landau reinforcement sampler. Reinforcement learning proposes one of the action choice (i.e., a l e f t , a r i g h t , a u p , and a d o w n ) to move the target state. For example, if a r i g h t is selected by our visual tracker, we move the target state to the right. The proposed visual tracker combines this reinforcement learning with Wang–Landau sampling to balance between the exploitation and exploration degrees of the prediction.

1.1. Basic Idea

We further enhance prediction accuracy by balancing the exploitation and exploration abilities of reinforcement learning. The exploitation procedure is used to further simulate the movements at the states around the current local optimum, which the tracker has extensively explored. For example, assume that our visual tracker observes that the target in Figure 1 usually moves to the right, up to the current frame. We likely need to further exploit possible states on the right side of the current state. Our visual tracker can accurately predict the target state using exploitation, especially when the target moves smoothly. In contrast, the exploration procedure is used to simulate the movements at states far from the current local optimum, which the tracker has only minimally explored. For example, the target in Figure 1 moves randomly and inconsistently in some cases; thus, we need to explore unvisited states, far from the current state. Our visual tracker can predict the target state using exploration, especially when the target is fast-moving. Traditional reinforcement learning methods experience difficulty in scheduling the exploitation and exploration procedures. In contrast, our visual tracker overcomes this problem by introducing Wang–Landau sampling [7], in which exploitation and exploration compete against each other in a sampling framework and attain the equivalence status.

1.2. Our Contributions

  • We propose a new Q-learning algorithm, augmented by Wang–Landau sampling, in which the exploitation and exploration abilities of reinforcement learning are balanced in searching target states. Conventional Q-learning methods typically select an action that maximizes a current action-value for exploitation, whereas the methods choose an action at random with a probability ϵ for exploration. However, it is nontrivial to determine the optimal ϵ , which can balance exploitation and exploration abilities. In contrast, the proposed method can balance between the exploitation and exploration processes based on the Wang–Landau algorithm. The method adapts to control the randomness of policy, using statistics on the number of visits in a particular state. Thus, our method considerably enhances conventional Q-learning algorithm performance, which also enhances visual tracking performance.
  • We present a novel visual tracking system based on the Wang–Landau reinforcement sampler. We exhaustively evaluate the proposed visual tracker and numerically demonstrate the effectiveness of the Wang–Landau reinforcement sampler.
  • Our visual tracker shows state-of-the-art performance in terms of frames per seconds (FPS) and runs in realtime because our method contains no complicated deep neural network architectures.
The remainder of this paper is organized as follows. In Section 2, we introduce relevant visual tracking algorithms. We explain reinforcement learning-based visual tracking in Section 3.2 and enhance the proposed visual tracking using Wang–Landau sampling in Section 3.3. In Section 4, we evaluate visual tracking algorithms quantitatively and qualitatively. In Section 5, we conclude the paper.

3. Proposed Visual Tracking System

3.1. Bayesian Visual Tracking

The visual tracking system aims to accurately infer target configurations over frames. This inference problem can be formulated using the posterior probability p ( X t | Y 1 : t ) :
X ^ t = arg X t max p ( X t | Y 1 : t ) ,
where X ^ t is the best state at time t. We can accurately estimate p ( X t | Y 1 : t ) in (1) by adopting Bayesian filtering, which updates the posterior distribution p ( X t | Y 1 : t ) using the following rule:
p ( X t | Y 1 : t ) p ( Y t | X t ) p ( X t | X t 1 ) p ( X t 1 | Y 1 : t 1 ) d X t 1 , p ( Y t | X t ) p ( X t | X t 1 ) p ( X t 1 | Y 1 : t 1 ) .
where p ( Y t | X t ) denotes the likelihood, i.e., the probability of coincidence between the target object and observation at the proposed state, and p ( X t | X t 1 ) represents the transition kernel that proposes the next state X t based on the previous state X t 1 . In (2), p ( Y t | X t ) is defined as
p ( Y t | X t ) = e f Y t ( X ) , Y gt ,
where f measures the similarity between the observed features [33] of the image described by X , Y t ( X ) and the ground-truth Y g t . We design f, which is similar to the matching function used in [29]. However, it is intractable to integrate probabilities over all possible values of X t 1 . Alternatively, we can sample a small number of values for X t 1 to approximate the integration. If we use an infinite number of samples, the approximation will produce zero errors. However, because it is impractical to use an infinite number of samples in real-world implementations, it is important to determine a limited number of good samples to produce accurate posterior probabilities.
Then, visual tracking aims to accurately approximate posterior probability using mathematical expectation with a limited number of samples [34].
X ^ t = arg X t max E q ( X t | Y 1 : t ) log p ( X t | Y 1 : t ) ,
where q ( X t | Y 1 : t ) is the function that outputs X t given Y 1 : t . In (4), q ( X t | Y 1 : t ) is designed by selecting the optimal transition kernel p * ( X t | X t 1 ) , as follows:
q ( X t | Y 1 : t ) = p ( Y t | X t ) p * ( X t | X t 1 ) p ( X t 1 | Y 1 : t 1 ) .

3.2. Reinforcement Learning for Visual Tracking

In this study, p * ( X t | X t 1 ) in (5) is implemented by selecting the optimal action a t in a reinforcement learning framework, in which X t + 1 a t ( · | X t ) . We compute the reward R by measuring the improvement in the log-posterior probability in (2), as follows:
R ( s t , a t , s t + 1 ) = log p ( X t | Y 1 : t ) log p ( X t + 1 | Y 1 : t + 1 ) ,
where s t = { X 1 : t } . In (6), s t , s t + 1 S and a t A , where S and A denote the spaces of states and actions, respectively. A can have four possible actions, { a l e f t , a r i g h t , a u p , a d o w n } :
A = a l e f t : p x p x 1 , p y p y , a r i g h t : p x p x + 1 , p y p y , a u p : p y p y 1 , p x p x , a d o w n : p y p y + 1 , p x p x ,
where p x and p y are the pixel indexes at the x and y axes, respectively.
Our visual tracker aims to find the optimal policy π * : S A that maximizes the expected future reward at time t:
π * = arg π max E π R t + 1 + γ R t + 2 + γ 2 R t + 3 + + γ τ t 1 R τ ,
for a single episode of length τ , where γ < 1 is a discounting parameter that weights rewards that can be received immediately. The expected cumulative reward in (8) is efficiently implemented by Q-learning [35] with the following updating rule:
Q ( s t , a t ) Q ( s t , a t ) + α R t + 1 + γ max a Q ( s t + 1 , a ) Q ( s t , a t ) ,
where γ max a Q ( s t + 1 , a ) Q ( s t , a t ) indicates the maximum update of the action-value function Q ( s t , a t ) at time t. This update is caused by a state change from s t to s t + 1 through action a. In (9), R t + 1 is the reward at time t + 1 and α is a weighting parameter.
The optimal policy in (8) can be determined toward maximizing the Q-values:
π * ( · | s ) = arg a max Q ( s , a ) ,
where the next action a t + 1 is sampled by π * ( a t | s t ) and the next state s t + 1 is determined by (7). However, when using (10), there is a risk of choosing suboptimal actions because we select an action that maximizes only a current action value. This problem causes our visual tracker to explore only already-visited states, which become trapped in the local optimum.
We overcome this problem by proposing the ϵ - g r e e d y algorithm, in which we usually select an action that maximizes a current action-value; with a probability ϵ , we choose an action at random. This ϵ - g r e e d y algorithm can be expressed as
a arg a max Q ( s , a ) , with probability 1 ϵ r a n d o m ( A ) with probability ϵ ,
where r a n d o m ( A ) returns an action randomly. However, one of the difficulties we may experience when using ϵ - g r e e d y , as a result of randomness in (11), is the surplus of actions, which complicates optimal solution identification. Therefore, we propose a semirandom strategy based on the Wang–Landau algorithm [7], in which we control the randomness of policy using statistics on the number of visits in a particular state. This approach will be further explained in the following section.

3.3. Wang–Landau Reinforcement Sampler for Visual Tracking

The proposed Wang–Landau sampler can be used to encourage the exploration of reinforcement learning by estimating the DOS [15], in which the DOS value approximates the frequency of visits to each state using Monte Carlo simulations. Based on the DOS, we determine whether a particular state is sufficiently explored. If a state has a small DOS value, the Wang–Landau sampler guides the reinforcement learning to explore that state. Otherwise, the sampler refrains from exploring that state.
For Wang–Landau sampling, we define D = { d i } i = 1 | S | and V = { v i } i = 1 | S | , in which d i and v i are the DOS score and the number of visits for the i-th state, respectively, and | S | is the total number of states. Then, we update d i if the visual tracker visits the i-th state:
d i d i × w , i ,
where w > 1 is a weighting parameter. v i is updated as follows:
v i v i + 1 , i ,
where d i and v i are initialized to 1 and 0, respectively. As the iteration proceeds, the Wang–Landau sampling adopts a coarse-to-fine strategy to attain more accurate DOS values. In the early iteration, we use a large value of w in (12), which increases the update speed. In the latter iteration, we use a smaller value of wm to fine-tune the updates. Accordingly, we decrease the value of w, w w , if a current iteration satisfies the following condition, i.e., the semiflat status:
i , v i 0.8 × 1 | S | i v i ,
where we, at least partially, explore all states. After the modification of w, the value v i is reinitialized to 0.
Owing to the need to balance the exploration of reinforcement learning with its exploitation, we present a new scheduling approach for reinforcement learning, as follows:
a arg a max Q ( s , a ) , with probability 1 ϵ n e w r a n d o m ( A ) with probability ϵ n e w ,
with
ϵ n e w = max 1 , d i p i ( Y t | X t ) ,
where p i ( Y t | X t ) is the likelihood at the i-th state, which is defined in (3). In (16), exploration d i and exploitation p i ( Y t | X t ) compete with each other. For example, if d i increases with respect to p i ( Y t | X t ) , reinforcement learning would explore diverse states with a high probability. If not, it tends to exploit a current state with a high probability.
Algorithm 1 illustrates the complete process of the proposed method.
Algorithm 1 Wang–Landau reinforcement sampler
 1:
Input: X ^ t , d i = 1 , v i = 0 , i , and w > 1 .  
 2:
Output: X ^ t + 1
 3:
for j = 1 to N do
 4:
  • Reinforcement learning  
 5:
  Propose an action a according to (15).  
 6:
  Find the next state X t + 1 using the proposed action in (7).
 7:
  Computer the posterior probability p ( X t + 1 | Y 1 : t ) in (2).
 8:
  Estimate the reward R t + 1 using p ( X t + 1 | Y 1 : t ) in (6).
 9:
  Update the Q value using R t + 1 in (9).
10:
 Determine the optimal policy function π using Q in (10).
11:
• Wang–Landau Monte Carlo sampling
12:
 Find the index i of the state X t + 1 .  
13:
 Update the DOS score d i using (12).  
14:
 Update the number of visits v i using (13).  
15:
if v i reaches the semiflat status in (14) then
16:
   w w .  
17:
   v i = 0 .  
18:
end if 
19:
end for 
20:
Find the best configuration X ^ t + 1 using (1).

4. Experiments

4.1. Experimental Settings

The proposed method was compared with 30 non-deep learning methods (e.g., SCM [36], STRUCK [37], ASLA [38], TLD [39], CXT [40], VTD [41], VTS [42], CSK [43], MEEM [44], Staple [45], and SRDCF [46]) on 50 test sequences in the OTB dataset [47]. Furthermore, our method was compared to state-of-the-art deep learning-based methods: C-COT [48], SINT [29], SINT-op [29], ECO [49], ECO-HC [49], SiamRPN++ [32], TADT [50], DAT [51], and SiamDW [52]. Moreover, the proposed visual tracker was compared based on the VOT2017 [53] dataset with 10 recent visual trackers, including CFWCR [54], CFCF [55], CSRDCF [56], MCCT [57], and LSART [58]. Our proposed visual tracker was also compared for the LaSOT [59] dataset with 8 state-of-the-art visual trackers, namely ECO [49], SiamRPN++ [32], GlobalTrack [31], ATOM [60], DASiam [30], CFNet [61], SPLT [62], and StructSiam [63]. Please note that we compared the proposed method with 30 non-deep learning visual trackers but reported only top 10 visual tracker in Figure 2 to visualize precision and success curves for each tracker more clearly.
Figure 2. Quantitative comparison with non-deep learning based methods. We evaluate the methods using precision and success plots in (a,b), respectively.
We used the precision, success rate, and AUC as evaluation metrics for testing these methods [47]. For precision, we calculated the l 2 -norm distance between the estimated bounding box E t and the ground truth bounding box G t . Then, we depicted the precision plot, which shows the percentage of frames such that the l 2 -norm distance is less than a specific threshold. For the success rate, we calculated the intersection of union I o U = | E t G t | | E t G t | , where | · | indicates the number of pixels. We considered visual tracking at each frame to be successful, if I o U is greater than a specific threshold. Moreover, we calculated the success rate, which is the ratio of the number of successful frames to the number of total frames. We then illustrated the success plot, which presents the success rates with different thresholds, and we calculated the area under curve (AUC).
For fair comparison, we used the best visual tracking results reported by the authors in the original papers and followed their experimental settings. For example, SiamRPN++ [32] was pretrained on ImageNet [64] and used ResNet [33] as a backbone network. In addition, SiamRPN++ was trained using the training datasets of COCO [65], ImageNet DET [64], ImageNet VID, and the YouTube-Bounding Boxes dataset [66]. All experiments were conducted using a desktop with an Intel CPU i7 3.60 GHz and GeForce Titan XP graphics card for the proposed method. Throughout the experiments, hyperparameters were fixed as follows: γ = 0.9 in (8), N = 2000 in Algorithm 1 and w = 0.8 in (12).

4.2. Ablation Study

We examined the effectiveness of the Wang–Landau sampling for reinforcement learning and sensitivity to hyperparameters of our method. Table 1 compares two variants of our method: reinforcement learning-based visual trackers with ϵ -greedy and Wang–Landau sampling. Table 1 demonstrates that the accuracy of reinforcement learning-based visual trackers can be considerably increased if the exploration and exploitation abilities are balanced by Wang–Landau sampling. Table 2 shows the visual tracking results of our method using different values of hyperparameters. If γ in (8) increases, our method imposes more weights on rewards in the near future. If N = 2000 in Algorithm 1 has a larger value, our method can more accurately estimate the DOS score and Q-values, sacrificing computational cost. If w = 0.8 in (12) is larger, our method can estimate the DOS score with a smaller number of iterations but less accurately. As shown in Table 2, our method is insensitive to hyperparameters. Although the values of these hyperparameters severely change, our method consistently produces accurate visual tracking results, because our proposed Q-learning-based reinforcement method accurately predicts target states regardless of the hyperparameter values.
Table 1. Ablation study: Effectiveness of Wang–Landau sampling for reinforcement learning-based visual tracking.
Table 2. Ablation study: Sensitivity to hyperparameters, γ in (8), N in Algorithm 1, and w in (12).

4.3. Quantitative Comparison

Figure 2 shows the quantitative comparisons with non-deep learning-based visual trackers. Our method considerably outperformed the second-best trackers, MEEM and Staple, in terms of precision and success rate, respectively. Our method was able to accurately track the target despite the severe appearance of the target. In particular, the proposed Wang–Landau Monte Carlo sampling improves the exploration of unvisited states, enabling our tracker to cover abrupt motion changes of the target.
Figure 3 shows the quantitative comparisons with recent deep-learning-based visual trackers. SiamDW was the best in terms of the precision plot, and ECO was the best in terms of the success plot. As shown in Figure 3, our method was competitive with deep learning-based visual trackers. In particular, our method produced accurate tracking results in terms of the success plot but relatively inaccurate tracking results in terms of the precision plot, implying that our method can be improved by adopting multiscale approaches.
Figure 3. Quantitative comparison with deep learning based methods in terms of precision and success rate.
In Figure 4, we highlighted experiments on test sequences, which contain examples of interrupted and recovered tracking. For example, “Out of view” and “Occlusion” sequences contain interrupted and recovered tracking scenarios. In these sequences, target objects frequently disappear due to occlusion and out of view attributes, which causes conventional trackers to miss the target trajectories. After a long time, the targets reappear and the trackers need to recover the target trajectories. In this situation, the proposed tracker efficiently recovers missing trajectories using the proposed exploration mechanism. As shown in Figure 4, the propose visual tracker considerably outperforms other state-of-the-art deep-learning visual trackers, which demonstrates the effectiveness of the proposed exploration mechanism based on Wang–Landau sampling.
Figure 4. Quantitative comparison in term of the success plot for interrupted and recovered tracking scenarios.
Figure 5 quantitatively evaluates the proposed method (ours) and the recent state-of-the-art deep-learning-based visual trackers using the VOT2014 dataset. Our method considerably outperforms other methods in terms of accuracy. Staple shows the second-best accuracy. However, robustness is significantly worse than the proposed method, implying that our method rarely missed the targets, while preserving the accuracy over frames. Gnet is the best in terms of robustness, while its accuracy is lower than ours.
Figure 5. Quantitative comparison of our method with recent state-of-the-art deep-learning visual trackers using the VOT2017 dataset.
Table 3 quantitatively evaluates the proposed method and deep-learning-based visual tracking methods using the LaSOT dataset. Our method and GlobalTrack present state-of-the-art tracking performance. GlobalTrack has an explicit redetection, which requires accurate object detectors. In contrast, the efficient performance of the proposed method stems from reinforcement learning with Wang–Landau-based exploration. Our method implicitly searches for the targets without any object detector.
Table 3. Quantitative comparison of our method with recent deep-learning-based visual tracking methods using the LaSOT dataset.
Table 4 quantitatively evaluates the computational costs of recent visual trackers using the LaSOT dataset. Our visual tracker shows state-of-the-art performance in terms of FPS and can run in realtime because our method contains no complicated deep neural network architectures.
Table 4. Computational costs of visual trackers in terms of frames per seconds (FPS).
Figure 6 measured the recovery rated for 8 state-of-the-art visual trackers, namely ECO, SiamRPN++, GlobalTrack, ATOM, DASiam, CFNet, SPLT, and StructSiam using the LaSOT dataset. We counted the average number of frames such that I o U is zero (i.e., the average number of interruptions), which means that trackers missed the targets. After each frame such that I o U = 0 , we counted the average number of frames such that I o U becomes nonzero again (i.e., the average number of recovered trajectories), which means that trackers recovered the targets. The recovery rate was calculated by dividing the average number of recovered trajectories with the average number of interruptions. As shown in Figure 6, our method considerably surpasses other methods in terms of the recovery rate, which demonstrate the effectiveness of the proposed Wang–Landau reinforcement sampler.
Figure 6. Comparison on the recovery rate. Blue, orange, and gray bars denote the average number of interruptions, the average number of recovered trajectories, and recovery rate (i.e., average number of recovered trajectories average number of interruptions × 100 ), respectively.

4.4. Qualitative Comparison

Figure 7 qualitatively compares our method with the method based on conventional ϵ - g r e e d y reinforcement learning using the OTB dataset. Although the test sequences contained abrupt motions (e.g., Deer, Shaking, MotorRolling, and Biker sequences), severe deformation (e.g., Ironman, Diving, Jump, Skiing, and Surfer sequences), occlusion (Soccer sequence), and illumination changes (e.g., Matrix, Shaking, and Skating1 sequences), our method accurately tracked the targets. However, conventional reinforcement learning with ϵ - g r e e d y frequently failed to track the targets when there were abrupt motions occurred because it could not sufficiently explore unvisited states.
Figure 7. Qualitative comparisons. White boxes present the estimated bounding boxes of our method, while red boxes represent the results of our method without Wang–Landau sampling.
Figure 8 shows qualitative visual tracking results of the proposed method with and without the Wang–Landau algorithm using the LaSOT dataset. The video sequences include tiny objects (e.g., boat-12, crocodile-3, drone-13, elephant-18, fox-3, and flog-9 sequences), background clutter (e.g., chameleon-6, cram-18, and fox-3 sequences), nonrigid objects (e.g., bear-17, bird-17, cattle-7, crocodile-3, fox-3, frog-9, and giraffe-10 sequences), motion blur (e.g., bus-5 and crab-18 sequences), and rotation (e.g., bottle-1 sequence). Despite these challenging visual tracking environments, the proposed method accurately tracked the targets. These results indicate that the proposed Wang–Landau-based reinforcement learning is helpful for finding unexplored states and recovering missed trajectories. As shown in Figure 8, the proposed Wang–Landau algorithm helps our visual tracker to recover inaccurate bounding boxes.
Figure 8. Qualitative evaluation. White and green boxes present the predicted bounding boxes of the proposed method with and without the Wang–Landau algorithm.
In summary, the proposed method works better than other state-of-the-art methods, as follows. Our method can predict the target state with greater accuracy by defining multiple actions in a reinforcement learning framework and moving the current state according to the selected action. In addition, we further enhance prediction accuracy by improving reinforcement learning performance using Wang–Landau sampling, in which exploitation and exploration compete against each other in a sampling framework and attain the equivalence status.

5. Conclusions

In this study, we present a visual tracking system based on reinforcement learning, in which the accuracy of the tracking can be considerably enhanced by target configuration prediction for the subsequent frame. Our visual tracker is improved by Wang–Landau sampling, in which the exploration and exploitation of reinforcement learning are efficiently scheduled. The experimental results demonstrate that our method significantly outperforms non-deep learning-based visual tracking methods. Our method is competitive with deep learning-based visual trackers, whereas the proposed method is the fastest algorithm among the compared visual trackers. For future work, we adopt the deep Q learning method to improve the visual tracking accuracy, which is one of the well-known deep learning-based reinforcement learning approaches.
Our method can fail to track the targets, if the target motions are highly random. In this case, the proposed re-reinforcement learning method inaccurately predicts the target position and degrades visual tracking performance. For future research, we plan to integrate explicit object detector into the proposed framework to handle random motions.

Author Contributions

Conceptualization, D.K. and J.K.; validation, D.K.; writing—original draft preparation, D.K.; writing—review and editing, J.K.; supervision, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by the Chung-Ang University Graduate Research Scholarship Grants in 2019 and partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (NRF-2020R1C1C1004907).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sui, Y.; Zhang, L. Visual Tracking via Locally Structured Gaussian Process Regression. IEEE SPL 2015, 22, 1331–1335. [Google Scholar] [CrossRef]
  2. Wang, L.; Lu, H.; Wang, D. Visual Tracking via Structure Constrained Grouping. IEEE SPL 2014, 22, 794–798. [Google Scholar] [CrossRef]
  3. Xu, Y.; Ni, B.; Yang, X. When Correlation Filters Meet Convolutional Neural Networks for Visual Tracking. IEEE SPL 2016, 23, 1454–1458. [Google Scholar]
  4. Kwon, J.; Dragon, R.; Gool, L.V. Joint Tracking and Ground Plane Estimation. IEEE SPL 2016, 23, 1514–1517. [Google Scholar] [CrossRef]
  5. Kim, H.; Jeon, S.; Lee, S.; Paik, J. Robust Visual Tracking Using Structure-Preserving Sparse Learning. IEEE SPL 2017, 24, 707–711. [Google Scholar] [CrossRef]
  6. Xu, Y.; Wang, J.; Li, H.; Li, Y.; Miao, Z.; Zhang, Y. Patch-based Scale Calculation for Real-time Visual Tracking. IEEE SPL 2016, 23, 40–44. [Google Scholar]
  7. Wang, F.; Landau, D. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett. 2001, 86, 2050–20531. [Google Scholar] [CrossRef]
  8. Yun, S.; Choi, J.; Yoo, Y.; Yun, K.; Choi, J.Y. Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  9. Supancic, J., III; Ramanan, D. Tracking as Online Decision-Making: Learning a Policy From Streaming Videos with Reinforcement Learning. In Proceedings of the IEEE International Conference on Computer Vision ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
  10. Huang, C.; Lucey, S.; Ramanan, D. Learning Policies for Adaptive Tracking with Deep Feature Cascades. In Proceedings of the IEEE International Conference on Computer Vision ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
  11. Choi, J.; Kwon, J.; Lee, K.M. Real-time Visual Tracking by Deep Reinforced Decision Making. Comput. Vis. Image Underst. 2018, 171, 10–19. [Google Scholar] [CrossRef]
  12. Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
  13. Kwon, J.; Lee, K.M. Tracking of Abrupt Motion using Wang-Landau Monte Carlo Estimation. In Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008. [Google Scholar]
  14. Zhou, X.; Lu, Y. Abrupt Motion Tracking via Adaptive Stochastic. Approximation Monte Carlo Sampling. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
  15. Kwon, J.; Lee, K.M. Wang-Landau Monte Carlo-based Tracking Methods for Abrupt Motions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1011–1024. [Google Scholar] [CrossRef]
  16. Liu, J.; Zhou, L.; Zhao, L. Advanced wang-landau monte carlo-based tracker for abrupt motions. IEEJ Trans. Electr. Electron. Eng. 2019, 14, 877–883. [Google Scholar] [CrossRef]
  17. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  18. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  19. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  20. Ma, C.; Huang, J.B.; Yang, X.; Yang, M.H. Hierarchical Convolutional Features for Visual Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  21. Nam, H.; Han, B. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  22. Wang, L.; Ouyang, W.; Wang, X.; Lu, H. Stct: Sequentially training convolutional networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  23. Wang, N.; Yeung, D.Y. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2013. [Google Scholar]
  24. Kwon, J. Robust Visual Tracking based on Variational Auto-encoding Markov Chain Monte Carlo. Inf. Sci. 2020, 512, 1308–1323. [Google Scholar] [CrossRef]
  25. Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. arXiv 2014, arXiv:1405.3531. [Google Scholar]
  26. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-Convolutional Siamese Networks for Object Tracking. arXiv 2016, arXiv:1606.09549. [Google Scholar]
  27. Chen, K.; Tao, W. Once for All: A Two-flow Convolutional Neural Network for Visual Tracking. arXiv 2016, arXiv:1604.07507. [Google Scholar] [CrossRef]
  28. Held, D.; Thrun, S.; Savarese, S. Learning to Track at 100 FPS with Deep Regression Networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
  29. Tao, R.; Gavves, E.; Smeulders, A.W. Siamese Instance Search for Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  30. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-Aware Siamese Networks for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  31. Huang, L.; Zhao, X.; Huang, K. GlobalTrack: A Simple and Strong Baseline for Long-term Tracking. arXiv 2019, arXiv:1912.08531. [Google Scholar]
  32. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  33. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  34. Shi, T.; Steinhardt, J.; Liang, P. Learning Where to Sample in Structured Prediction. In Artificial Intelligence and Statistics; The MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
  35. Watkins, C.; Dayan, P. Q-learning, in Machine learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  36. Zhong, W.; Lu, H.; Yang, M.H. Robust Object Tracking via Sparsity-based Collaborative Model. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  37. Hare, S.; Saffari, A.; Torr, P.H.S. Struck: Structured Output Tracking with Kernels. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
  38. Jia, X.; Lu, H.; Yang, M.H. Visual Tracking via Adaptive Structural Local Sparse Appearance Model. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  39. Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-Learning-Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1409–1422. [Google Scholar] [CrossRef]
  40. Dinh, T.B.; Vo, N.; Medioni, G. Context Tracker: Exploring Supporters and Distracters in Unconstrained Environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 20–25 June 2011. [Google Scholar]
  41. Kwon, J.; Lee, K.M. Visual Tracking Decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
  42. Kwon, J.; Lee, K.M. Tracking by Sampling Trackers. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
  43. Henriques, J.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  44. Zhang, J.; Ma, S.; Sclaroff, S. MEEM: Robust tracking via multiple experts using entropy minimization. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014. [Google Scholar]
  45. Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P. Staple: Complementary Learners for Real-Time Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  46. Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Learning Spatially Regularized Correlation Filters for Visual Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  47. Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, 23–28 June 2013. [Google Scholar]
  48. Danelljan, M.; Robinson, A.; Khan, F.; Felsberg, M. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
  49. Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  50. Li, X.; Ma, C.; Wu, B.; He, Z.; Yang, M.H. Target-Aware Deep Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  51. Pu, S.; Song, Y.; Ma, C.; Zhang, H.; Yang, M.H. Deep Attentive Tracking via Reciprocative Learning. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  52. Zhang, Z.; Peng, H. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  53. Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Čehovin Zajc, L.; Vojir, T.; Häger, G.; Lukežič, A.; Eldesokey, A.; et al. The Visual Object Tracking VOT2017 Challenge Results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017. [Google Scholar]
  54. He, Z.; Fan, Y.; Zhuang, J.; Dong, Y.; Bai, H. Correlation Filters with Weighted Convolution Responses. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017. [Google Scholar]
  55. Gundogdu, E.; Alatan, A.A. Good Features to Correlate for Visual Tracking. arXiv 2017, arXiv:1704.06326. [Google Scholar] [CrossRef]
  56. Lukezic, A.; Vojir, T.; Zajc, L.C.; Matas, J.; Kristan, M. Discriminative Correlation Filter with Channel and Spatial Reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  57. Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-Cue Correlation Filters for Robust Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  58. Sun, C.; Wang, D.; Lu, H.; Yang, M.H. Learning Spatial-Aware Regressions for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  59. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking. arXiv 2018, arXiv:1809.07845. [Google Scholar]
  60. Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  61. Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H.S. End-To-End Representation Learning for Correlation Filter Based Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  62. Yan, B.; Zhao, H.; Wang, D.; Lu, H.; Yang, X. ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-term Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  63. Zhang, Y.; Wang, L.; Qi, J.; Wang, D.; Feng, M.; Lu, H. Structured siamese network for real-time visual racking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  64. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  65. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  66. Real, E.; Shlens, J.; Mazzocchi, S.; Pan, X.; Vanhoucke, V. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.