Visual Tracking Using Wang–Landau Reinforcement Sampler

: In this study, we present a novel tracking system, in which the tracking accuracy can be considerably enhanced by state prediction. Accordingly, we present a new Q-learning-based reinforcement method, augmented by Wang–Landau sampling. In the proposed method, reinforcement learning is used to predict a target configuration for the subsequent frame, while Wang–Landau sampler balances the exploitation and exploration degrees of the prediction. Our method can adapt to control the randomness of policy, using statistics on the number of visits in a particular state. Thus, our method considerably enhances conventional Q-learning algorithm performance, which also enhances visual tracking performance. Numerical results demonstrate that our method substantially outperforms other state-of-the-art visual trackers and runs in realtime because our method contains no complicated deep neural network architectures.


Introduction
Visual tracking is a fundamental computer vision algorithm [1][2][3][4][5][6] with several applications, including autonomous driving, surveillance systems, and robotic systems. Conventional visual tracking methods aim to accurately predict a target state using observations up to the current time.
To predict the target state with greater accuracy, in this paper, we define multiple actions in a reinforcement learning framework and move the current state according to the selected action. Figure 1 illustrates how state prediction is related to the actions in reinforcement learning.

Basic Idea
We further enhance prediction accuracy by balancing the exploitation and exploration abilities of reinforcement learning. The exploitation procedure is used to further simulate the movements at the states around the current local optimum, which the tracker has extensively explored. For example, assume that our visual tracker observes that the target in Figure 1 usually moves to the right, up to the current frame. We likely need to further exploit possible states on the right side of the current state. Our visual tracker can accurately predict the target state using exploitation, especially when the target moves smoothly. In contrast, the exploration procedure is used to simulate the movements at states far from the current local optimum, which the tracker has only minimally explored. For example, the target in Figure 1 moves randomly and inconsistently in some cases; thus, we need to explore unvisited states, far from the current state. Our visual tracker can predict the target state using exploration, especially when the target is fast-moving. Traditional reinforcement learning methods experience difficulty in scheduling the exploitation and exploration procedures. In contrast, our visual tracker overcomes this problem by introducing Wang-Landau sampling [7], in which exploitation and exploration compete against each other in a sampling framework and attain the equivalence status. , a le f t , a right , a up , and a down ) to move the target state. For example, if a right is selected by our visual tracker, we move the target state to the right. The proposed visual tracker combines this reinforcement learning with Wang-Landau sampling to balance between the exploitation and exploration degrees of the prediction.

•
We propose a new Q-learning algorithm, augmented by Wang-Landau sampling, in which the exploitation and exploration abilities of reinforcement learning are balanced in searching target states. Conventional Q-learning methods typically select an action that maximizes a current action-value for exploitation, whereas the methods choose an action at random with a probability for exploration. However, it is nontrivial to determine the optimal , which can balance exploitation and exploration abilities. In contrast, the proposed method can balance between the exploitation and exploration processes based on the Wang-Landau algorithm. The method adapts to control the randomness of policy, using statistics on the number of visits in a particular state. Thus, our method considerably enhances conventional Q-learning algorithm performance, which also enhances visual tracking performance.

•
We present a novel visual tracking system based on the Wang-Landau reinforcement sampler. We exhaustively evaluate the proposed visual tracker and numerically demonstrate the effectiveness of the Wang-Landau reinforcement sampler.

•
Our visual tracker shows state-of-the-art performance in terms of frames per seconds (FPS) and runs in realtime because our method contains no complicated deep neural network architectures.
The remainder of this paper is organized as follows. In Section 2, we introduce relevant visual tracking algorithms. We explain reinforcement learning-based visual tracking in Section 3.2 and enhance the proposed visual tracking using Wang-Landau sampling in Section 3.3. In Section 4, we evaluate visual tracking algorithms quantitatively and qualitatively. In Section 5, we conclude the paper.

Related Work
In this section, we discuss the advantages and disadvantages of the relevant visual tracking methods, which can be categorized into four groups: tracking methods based on reinforcement learning, Wang-Landau sampling, and general visual tracking methods.

Tracking Methods Based on Reinforcement Learning
For visual tracking, Yun et al. [8] adopted a policy network and defined the actions to localize the target in a current frame. Supancic et al. [9] trained a Q function using the YouTube video dataset and defined the actions used to reinitialize a visual tracker and modify the appearance model of the tracker. However, these trackers could not run in realtime. Thus, Huang et al. [10] enhanced the speed of reinforcement-learning-based visual trackers and maintained their accuracy. They defined actions to determine whether the tracker easily tracks the target. If it is easy, the method tracks the target using inexpensive features; otherwise, the method tracks the target using expensive deep features. Choi et al. [11] ran their tracker at a real-time speed of 43 FPS. They presented lighter-weight deep neural networks and optimized deep neural architectures for matching and policy networks.
In contrast to these methods, our method aims to improve reinforcement learning accuracy by incorporating the Wang-Landau sampling. Please note that Choi et al. [11] did not improve conventional reinforcement learning algorithms but efficiently applied an existing REINFORCE [12] method to target the appearance-updating problem. In contrast, we improve conventional reinforcement learning algorithms (i.e., Q-learning) using Wang-Landau sampling. We enhance conventional Q-learning algorithms to balance the exploitation and exploration abilities of reinforcement learning. Moreover, we adaptively control the randomness of policy using statistics on the number of visits in a particular state.

Tracking Methods Based on Wang-Landau Sampling
For visual tracking, Kwon and Lee [13] adopted Wang-Landau Monte Carlo (WLMC) sampling to control significant changes in the target's positions. Zhou et al. [14] enhanced the Wang-Landau samplers and presented a stochastic approximation Monte Carlo (SAMC) sampling-based visual tracker, in which the density of states (DOS) was more accurately estimated with low computational cost. Kwon and Lee [15] extended the WLMC sampling into N-fold Wang-Landau (NFWL) sampling, in which the N-fold algorithm was used to enable the accurate estimation of the DOS with a relatively small number of samples. The NFWL-based visual tracking method can handle significant changes in both the positions and scales of the target. However, these methods do not contain a feedback process and cannot reflect the current visual tracking environment and results. Liu et al. [16] combined WLMC sampling with a visual background extractor, considerably reducing the state space of the target. They independently dealt with scale changes in the target using a fast scale estimation algorithm.
In contrast to these methods, we applied WLMC sampling to reinforcement learning and balanced the exploitation and exploration degrees of the target prediction.

General Visual Tracking Methods
Because of the representation power of deep neural networks [17][18][19], recent visual trackers have extracted useful features and considerably increased their accuracy [20][21][22][23] referred to as deep learning-based visual tracking methods. Wang et al. [23] presented the first visual tracker that adopted deep features, in which a stacked denoising autoencoder was used to extract generic features, and a classification layer was utilized to determine whether a current image patch is the foreground. Nam et al. [21] considered visual tracking problems as binary classification problems and divided visual tracking videos into multiple domains to extract multidomain features. Ma et al. [20] improved visual tracking accuracy by training deep neural networks using object recognition datasets and extracting hierarchical features. Kwon et al. [24] extracted deep features using the VGG-m network [25] and combined variational autoencoders with the particle Markov chain Monte Carlo method for multiple variable inferences.
Siamese network-based visual trackers [26][27][28] transformed visual tracking problems into matching problems, in which exemplar patches were matched to search window patches through a cross-correlation operation. For this purpose, two deep neural networks were designed with similar architectures that share parameters. Because matching is typically faster than classification, Siamese network-based visual trackers have demonstrated superior speed. Bertinetto et al. [26] implemented a Siamese network using only convolutional layers for visual tracking. Held et al. [28] proposed a Siamese network-based visual tracker that can run at a real-time speed of 100 FPS. Tao et al. [29] introduced a novel approach for visual tracking, in which no model updating was required. They argue that visual tracking accuracy is derived from a powerful matching function, which can be learned using a sufficiently large amount of training data. However, Siamese network-based visual trackers easily miss the targets if there are severe occlusions and background clutter. They lack both an explicit process to obtain feedback from the environment to recover erroneous trajectories and an exploration mechanism to sufficiently search the state space.
To handle long-term video sequences, DASiam [30] considered larger search areas than conventional methods, if target objects have high confidence values. GlobalTrack [31] and SiamRPN [32] have explicit redetection processes, in which whole regions are searched to recover missed target trajectories. However, GlobalTrack and SiamRPN require high computational costs because these methods employ full-search strategies. In contrast, our proposed tracker presents an efficient exploration technique based on Wang-Landau sampling. Thus, our tracker is significantly faster than the aforementioned methods and runs in realtime.
In contrast to the aforementioned methods, we incorporated reinforcement learning into visual tracking problems formulated as action-decision frameworks, in which the proposed method can recover erroneous trajectories using feedback from the environment and sufficiently explore the state space to capture abrupt target motions.

Bayesian Visual Tracking
The visual tracking system aims to accurately infer target configurations over frames. This inference problem can be formulated using the posterior probability p(X t |Y 1:t ): whereX t is the best state at time t. We can accurately estimate p(X t |Y 1:t ) in (1) by adopting Bayesian filtering, which updates the posterior distribution p(X t |Y 1:t ) using the following rule: where p(Y t |X t ) denotes the likelihood, i.e., the probability of coincidence between the target object and observation at the proposed state, and p(X t |X t−1 ) represents the transition kernel that proposes the next state X t based on the previous state where f measures the similarity between the observed features [33] of the image described by X, Y t (X) and the ground-truth Y gt . We design f , which is similar to the matching function used in [29]. However, it is intractable to integrate probabilities over all possible values of X t−1 . Alternatively, we can sample a small number of values for X t−1 to approximate the integration. If we use an infinite number of samples, the approximation will produce zero errors. However, because it is impractical to use an infinite number of samples in real-world implementations, it is important to determine a limited number of good samples to produce accurate posterior probabilities.
Then, visual tracking aims to accurately approximate posterior probability using mathematical expectation with a limited number of samples [34].
where q(X t |Y 1:t ) is the function that outputs X t given Y 1:t . In (4), q(X t |Y 1:t ) is designed by selecting the optimal transition kernel p * (X t |X t−1 ), as follows:

Reinforcement Learning for Visual Tracking
In this study, p * (X t |X t−1 ) in (5) is implemented by selecting the optimal action a t in a reinforcement learning framework, in which X t+1 ∼ a t (·|X t ). We compute the reward R by measuring the improvement in the log-posterior probability in (2), as follows: where s t = {X 1:t }. In (6), s t , s t+1 ∈ S and a t ∈ A, where S and A denote the spaces of states and actions, respectively. A can have four possible actions, {a le f t , a right , a up , a down }: where p x and p y are the pixel indexes at the x and y axes, respectively. Our visual tracker aims to find the optimal policy π * : S → A that maximizes the expected future reward at time t: for a single episode of length τ, where γ < 1 is a discounting parameter that weights rewards that can be received immediately. The expected cumulative reward in (8) is efficiently implemented by Q-learning [35] with the following updating rule: where γmax a Q(s t+1 , a) − Q(s t , a t ) indicates the maximum update of the action-value function Q(s t , a t ) at time t. This update is caused by a state change from s t to s t+1 through action a. In (9), R t+1 is the reward at time t + 1 and α is a weighting parameter.
The optimal policy in (8) can be determined toward maximizing the Q-values: where the next action a t+1 is sampled by π * (a t |s t ) and the next state s t+1 is determined by (7). However, when using (10), there is a risk of choosing suboptimal actions because we select an action that maximizes only a current action value. This problem causes our visual tracker to explore only already-visited states, which become trapped in the local optimum.
We overcome this problem by proposing the -greedy algorithm, in which we usually select an action that maximizes a current action-value; with a probability , we choose an action at random. This -greedy algorithm can be expressed as where random(A) returns an action randomly. However, one of the difficulties we may experience when using -greedy, as a result of randomness in (11), is the surplus of actions, which complicates optimal solution identification. Therefore, we propose a semirandom strategy based on the Wang-Landau algorithm [7], in which we control the randomness of policy using statistics on the number of visits in a particular state. This approach will be further explained in the following section.

Wang-Landau Reinforcement Sampler for Visual Tracking
The proposed Wang-Landau sampler can be used to encourage the exploration of reinforcement learning by estimating the DOS [15], in which the DOS value approximates the frequency of visits to each state using Monte Carlo simulations. Based on the DOS, we determine whether a particular state is sufficiently explored. If a state has a small DOS value, the Wang-Landau sampler guides the reinforcement learning to explore that state. Otherwise, the sampler refrains from exploring that state.
, in which d i and v i are the DOS score and the number of visits for the i-th state, respectively, and |S| is the total number of states. Then, we update d i if the visual tracker visits the i-th state: where w > 1 is a weighting parameter. v i is updated as follows: where d i and v i are initialized to 1 and 0, respectively. As the iteration proceeds, the Wang-Landau sampling adopts a coarse-to-fine strategy to attain more accurate DOS values. In the early iteration, we use a large value of w in (12), which increases the update speed. In the latter iteration, we use a smaller value of wm to fine-tune the updates. Accordingly, we decrease the value of w, w ← √ w, if a current iteration satisfies the following condition, i.e., the semiflat status: where we, at least partially, explore all states. After the modification of w, the value v i is reinitialized to 0.
Owing to the need to balance the exploration of reinforcement learning with its exploitation, we present a new scheduling approach for reinforcement learning, as follows: where p i (Y t |X t ) is the likelihood at the i-th state, which is defined in (3). In (16), exploration d i and exploitation p i (Y t |X t ) compete with each other. For example, if d i increases with respect to p i (Y t |X t ), reinforcement learning would explore diverse states with a high probability. If not, it tends to exploit a current state with a high probability. Algorithm 1 illustrates the complete process of the proposed method.

11:
• Wang-Landau Monte Carlo sampling 12: Find the index i of the state X t+1 .

13:
Update the DOS score d i using (12). 14: Update the number of visits v i using (13). 15: if v i reaches the semiflat status in (14) then 16: w ← √ w.
We used the precision, success rate, and AUC as evaluation metrics for testing these methods [47]. For precision, we calculated the l2-norm distance between the estimated bounding box E t and the ground truth bounding box G t . Then, we depicted the precision plot, which shows the percentage of frames such that the l2-norm distance is less than a specific threshold. For the success rate, we calculated the intersection of union IoU = |E t ∩G t | |E t ∪G t | , where | · | indicates the number of pixels. We considered visual tracking at each frame to be successful, if IoU is greater than a specific threshold. Moreover, we calculated the success rate, which is the ratio of the number of successful frames to the number of total frames. We then illustrated the success plot, which presents the success rates with different thresholds, and we calculated the area under curve (AUC).
For fair comparison, we used the best visual tracking results reported by the authors in the original papers and followed their experimental settings. For example, SiamRPN++ [32] was pretrained on ImageNet [64] and used ResNet [33] as a backbone network. In addition, SiamRPN++ was trained using the training datasets of COCO [65], ImageNet DET [64], ImageNet VID, and the YouTube-Bounding Boxes dataset [66]. All experiments were conducted using a desktop with an Intel CPU i7 3.60 GHz and GeForce Titan XP graphics card for the proposed method. Throughout the experiments, hyperparameters were fixed as follows: γ = 0.9 in (8), N = 2000 in Algorithm 1 and w = 0.8 in (12).

Ablation Study
We examined the effectiveness of the Wang-Landau sampling for reinforcement learning and sensitivity to hyperparameters of our method. Table 1 compares two variants of our method: reinforcement learning-based visual trackers with -greedy and Wang-Landau sampling. Table 1 demonstrates that the accuracy of reinforcement learning-based visual trackers can be considerably increased if the exploration and exploitation abilities are balanced by Wang-Landau sampling. Table 2 shows the visual tracking results of our method using different values of hyperparameters. If γ in (8) increases, our method imposes more weights on rewards in the near future. If N = 2000 in Algorithm 1 has a larger value, our method can more accurately estimate the DOS score and Q-values, sacrificing computational cost. If w = 0.8 in (12) is larger, our method can estimate the DOS score with a smaller number of iterations but less accurately. As shown in Table 2, our method is insensitive to hyperparameters. Although the values of these hyperparameters severely change, our method consistently produces accurate visual tracking results, because our proposed Q-learning-based reinforcement method accurately predicts target states regardless of the hyperparameter values.  Figure 2 shows the quantitative comparisons with non-deep learning-based visual trackers. Our method considerably outperformed the second-best trackers, MEEM and Staple, in terms of precision and success rate, respectively. Our method was able to accurately track the target despite the severe appearance of the target. In particular, the proposed Wang-Landau Monte Carlo sampling improves the exploration of unvisited states, enabling our tracker to cover abrupt motion changes of the target. Figure 3 shows the quantitative comparisons with recent deep-learning-based visual trackers. SiamDW was the best in terms of the precision plot, and ECO was the best in terms of the success plot. As shown in Figure 3, our method was competitive with deep learning-based visual trackers. In particular, our method produced accurate tracking results in terms of the success plot but relatively inaccurate tracking results in terms of the precision plot, implying that our method can be improved by adopting multiscale approaches.

Quantitative Comparison
In Figure 4, we highlighted experiments on test sequences, which contain examples of interrupted and recovered tracking. For example, "Out of view" and "Occlusion" sequences contain interrupted and recovered tracking scenarios. In these sequences, target objects frequently disappear due to occlusion and out of view attributes, which causes conventional trackers to miss the target trajectories. After a long time, the targets reappear and the trackers need to recover the target trajectories. In this situation, the proposed tracker efficiently recovers missing trajectories using the proposed exploration mechanism. As shown in Figure 4, the propose visual tracker considerably outperforms other state-of-the-art deep-learning visual trackers, which demonstrates the effectiveness of the proposed exploration mechanism based on Wang-Landau sampling. Figure 5 quantitatively evaluates the proposed method (ours) and the recent state-of-the-art deep-learning-based visual trackers using the VOT2014 dataset. Our method considerably outperforms other methods in terms of accuracy. Staple shows the second-best accuracy. However, robustness is significantly worse than the proposed method, implying that our method rarely missed the targets, while preserving the accuracy over frames. Gnet is the best in terms of robustness, while its accuracy is lower than ours. Table 3 quantitatively evaluates the proposed method and deep-learning-based visual tracking methods using the LaSOT dataset. Our method and GlobalTrack present state-of-the-art tracking performance. GlobalTrack has an explicit redetection, which requires accurate object detectors. In contrast, the efficient performance of the proposed method stems from reinforcement learning with Wang-Landau-based exploration. Our method implicitly searches for the targets without any object detector.          Table 4 quantitatively evaluates the computational costs of recent visual trackers using the LaSOT dataset. Our visual tracker shows state-of-the-art performance in terms of FPS and can run in realtime because our method contains no complicated deep neural network architectures.  Figure 6 measured the recovery rated for 8 state-of-the-art visual trackers, namely ECO, SiamRPN++, GlobalTrack, ATOM, DASiam, CFNet, SPLT, and StructSiam using the LaSOT dataset. We counted the average number of frames such that IoU is zero (i.e., the average number of interruptions), which means that trackers missed the targets. After each frame such that IoU = 0, we counted the average number of frames such that IoU becomes nonzero again (i.e., the average number of recovered trajectories), which means that trackers recovered the targets. The recovery rate was calculated by dividing the average number of recovered trajectories with the average number of interruptions. As shown in Figure 6, our method considerably surpasses other methods in terms of the recovery rate, which demonstrate the effectiveness of the proposed Wang-Landau reinforcement sampler.  Figure 7 qualitatively compares our method with the method based on conventional -greedy reinforcement learning using the OTB dataset. Although the test sequences contained abrupt motions (e.g., Deer, Shaking, MotorRolling, and Biker sequences), severe deformation (e.g., Ironman, Diving, Jump, Skiing, and Surfer sequences), occlusion (Soccer sequence), and illumination changes (e.g., Matrix, Shaking, and Skating1 sequences), our method accurately tracked the targets. However, conventional reinforcement learning with -greedy frequently failed to track the targets when there were abrupt motions occurred because it could not sufficiently explore unvisited states. Figure 8 shows qualitative visual tracking results of the proposed method with and without the Wang-Landau algorithm using the LaSOT dataset. The video sequences include tiny objects (e.g., boat-12, crocodile-3, drone-13, elephant-18, fox-3, and flog-9 sequences), background clutter (e.g., chameleon-6, cram-18, and fox-3 sequences), nonrigid objects (e.g., bear-17, bird-17, cattle-7, crocodile-3, fox-3, frog-9, and giraffe-10 sequences), motion blur (e.g., bus-5 and crab-18 sequences), and rotation (e.g., bottle-1 sequence). Despite these challenging visual tracking environments, the proposed method accurately tracked the targets. These results indicate that the proposed Wang-Landau-based reinforcement learning is helpful for finding unexplored states and recovering missed trajectories. As shown in Figure 8, the proposed Wang-Landau algorithm helps our visual tracker to recover inaccurate bounding boxes. (a) airplane-13 sequence (b) bear-17 sequence (c) bicycle-9 sequence (d) bird-17 sequence (e) boat-12 sequence (f) bottle-1 sequence (g) bus-5 sequence (h) cattle-7 sequence (i) chameleon-6 sequence (j) crocodile-3 sequence (k) crab-18 sequence (l) drone-13 sequence (m) elephant-18 sequence (n) fox-3 sequence (o) frog-9 sequence (p) giraffe-10 sequence In summary, the proposed method works better than other state-of-the-art methods, as follows. Our method can predict the target state with greater accuracy by defining multiple actions in a reinforcement learning framework and moving the current state according to the selected action. In addition, we further enhance prediction accuracy by improving reinforcement learning performance using Wang-Landau sampling, in which exploitation and exploration compete against each other in a sampling framework and attain the equivalence status.

Conclusions
In this study, we present a visual tracking system based on reinforcement learning, in which the accuracy of the tracking can be considerably enhanced by target configuration prediction for the subsequent frame. Our visual tracker is improved by Wang-Landau sampling, in which the exploration and exploitation of reinforcement learning are efficiently scheduled. The experimental results demonstrate that our method significantly outperforms non-deep learning-based visual tracking methods. Our method is competitive with deep learning-based visual trackers, whereas the proposed method is the fastest algorithm among the compared visual trackers. For future work, we adopt the deep Q learning method to improve the visual tracking accuracy, which is one of the well-known deep learning-based reinforcement learning approaches.
Our method can fail to track the targets, if the target motions are highly random. In this case, the proposed re-reinforcement learning method inaccurately predicts the target position and degrades visual tracking performance. For future research, we plan to integrate explicit object detector into the proposed framework to handle random motions.