Adaptive Object Tracking via Multi-Angle Analysis Collaboration

Although tracking research has achieved excellent performance in mathematical angles, it is still meaningful to analyze tracking problems from multiple perspectives. This motivation not only promotes the independence of tracking research but also increases the flexibility of practical applications. This paper presents a significant tracking framework based on the multi-dimensional state–action space reinforcement learning, termed as multi-angle analysis collaboration tracking (MACT). MACT is comprised of a basic tracking framework and a strategic framework which assists the former. Especially, the strategic framework is extensible and currently includes feature selection strategy (FSS) and movement trend strategy (MTS). These strategies are abstracted from the multi-angle analysis of tracking problems (observer’s attention and object’s motion). The content of the analysis corresponds to the specific actions in the multidimensional action space. Concretely, the tracker, regarded as an agent, is trained with Q-learning algorithm and ϵ-greedy exploration strategy, where we adopt a customized rewarding function to encourage robust object tracking. Numerous contrast experimental evaluations on the OTB50 benchmark demonstrate the effectiveness of the strategies and improvement in speed and accuracy of MACT tracker.


Introduction
Vision sensors, especially ordinary cameras, are a direct source of computer vision information. Visual tracking which gets the object position quickly and accurately in a continuous video sequence is an important topic in visual sensors research. During the tracking process, many challenges from the object itself and its surroundings need to be addressed, such as illumination variation, scale variation, occlusion, fast motion, background clutters, low resolution, deformation, in-plane rotation and so on [1]. Furthermore, the tracking system is broken into multiple constituent components: motion model, feature extractor, observation model, model updater and ensemble post-processor [2].
Many current research methods have performed well in tracking research, but most of them benefit from powerful deep neural networks and excellent machine learning methods; the former lacks reliable explanation, and the latter is only a method application. Tracking research should return to its essence that computer vision tracking is a simulation of human visual tracking.
To solve the above challenges, first, we need to analyze the tracking process from the perspective of human vision. For instance, we need to concentrate on a pedestrian (wearing a white shirt and black pants), and he walks alone from the forest to the crowd on the side of the road. Tracking in the directions (up, down, left, and right). Specifically, in our strategic framework, we use Q-learning and -greedy exploration algorithms to obtain the optimal return of different actions in each state, thus forming corresponding strategies. Furthermore, we use strategies that include feature selection and movement trends, to assist and guide the basic tracking framework which is introduced in Section 3.1.
The overview of our multi-angle analysis collaboration tracking (MACT) is illustrated in Figure 1. The current frame is no longer subject to simple random sampling in the motion model component, but purposeful sampling and giving the corresponding region a higher weight under the guidance of the movement trend strategy. Unlike most tracking methods that use fixed image feature descriptions, in the feature extractor component, the feature of the image is dynamically selected by the corresponding feature selection strategy.
In summary, MACT's main contributions are: 1. We confirm that tracking research should focus on the nature of the tracking problem, not just the classification method or network structure. 2. We propose a strategic framework that forms a one-to-one correspondence between the details of different perspectives and the action space of multi-dimensional state-action space reinforcement learning. This strategic framework can be extended according to different tracking tasks. 3. We obtain strategies from multi-angle analysis with reinforcement learning and apply them to specific traditional tracking frameworks.
Finally, we validated MACT on the public dataset OTB50 [1]. The experimental results show that MACT effectively improves the speed and accuracy of tracking.  Figure 1. The overview of our Multi-angle Analysis Collaboration Tracking (MACT). By using the movement trend strategy, we can predict the position of the target in the motion model component, and then increase the weight of this part of the sample (the yellowish rectangle is the area where the target may appear, and the yellow dashed box is the high weight sample area). By exploiting the feature selection strategy, we can choose the current more appropriate image representation in the feature extractor component (current frame selection HOG feature). Through the mutual cooperation of the strategies obtained after multi-angle thinking, MACT enables more accurate and efficient tracking.

Related Work
Visual tracking has been fundamental research in the field of computer vision over the past decade. As surveyed in [4,5], many researchers have achieved amazing results in mainstream visual tracking benchmark [1,[6][7][8][9]].

Visual Object Tracking
Traditional visual tracking algorithms are usually divided into two categories [10]. One approach constructs a generative model with previous experience to find the most matching area in the next frame. The other utilizes a discriminative model to separate the target from the background.
Generative tracking algorithms focusing on the targets description have received extensive attention in early tracking research. For example, Meanshift tracker [11], which is a tracking method based on probability density distribution, makes the search always follow the direction of the rising probability gradient, and iteratively converges to the local peak of the probability density distribution; Particle Filter tracker [12] is a method based on particle distribution statistics, which models the tracking object and defines a similarity measure to determine the similarity of the particle to the target; Kalman Filter tracker [13] is used to describe the motion model of the object for estimating the position in the next frame; and DRLTracker [14] models targets and backgrounds separately for collaborative tracking.
Discriminative approaches for visual tracking use the target as the foreground and make the online learning or offline training detector to distinguish the foreground object from the background. The main representatives of the tracking by detection method are: TLD [15], which applies multi-level classifiers to improve detection capabilities; and Struck [16], which uses structured SVM methods for online learning. It is worth mentioning that Martins et al. proposed a kernel tracking method, CSK, based on cyclic matrix, and solved the problem of dense sampling mathematically [17]. Some excellent improved correlation filter based tracking algorithms have been proposed, such as Kernelized Correlation Filters tracker (KCF) [18] and Discriminative Scale Space Tracker (DSST) [19].
For tracking based on deep learning, on the one hand, because the deep learning network model trained by big data can provide a more expressive feature representation, deep learning techniques are also widely used in computer vision research, including visual tracking research. In the early deep learning tracking research, the researchers directly integrated the features learned by the network into the relevant filtering or other tracking framework to obtain better tracking results, such as the DeepSRDCF [20]. Although this complex feature is expressed better than HOG or other conventional image features, it also brings a large amount of computation. Therefore, in later research, it is common practice to combine common features with depth features. These methods typically use common features in simple tracking scenarios and select depth features in complex tracking scenarios, such as C-COT [21] and ECO [22]. On the other hand, another major advantage of deep learning is the end-to-end output, which allows multiple tasks to be trained together, especially combining image feature networks with detection classification networks, which is suited for tracking research. Representative tracking methods include: GOTURN [23], SiameseFC [24] and CFNet [25].

Visual Tracking with Reinforcement Learning
Reinforcement learning is a learning mechanism that simulates the learning behavior of humans and higher animals. It emphasizes the constant "trying mistakes and improvements" in the interaction with the environment. As an important method in machine learning, reinforcement learning learns the optimal strategy of dynamic systems by perceiving environmental state information [26]. It enables expert-free online learning without a specialized system model.
At present, several scholars have applied reinforcement learning to the field of visual tracking. However, these applications are mostly limited to the improvement of the method, such as using reinforcement learning to mine deep expressions of deep neural networks. Specifically, Yun et al. [27] controlled the tracking strategy through actions that are trained by deep reinforcement learning. Zhang et al. proposed a fully end-to-end approach to predict the bounding box position for the object. They formulated tracking model as a recurrent convolutional neural network agent that interacts with a video over time [28]. Huang et al. used an adaptive approach to tracking with deep feature cascades and developed adaptive tracking issues as a decision process [29].
However, the core of reinforcement learning is to imitate human learning behavior, and the essence of tracking research is a simulation of human behavior. Therefore, different from the above methods, we use reinforcement learning to simulate the different perspectives of people, namely the strategic framework. Further, we use the independent strategic framework to guide the tracking framework.

Our Method
In this section, we divide the multi-angle analysis collaboration tracking (MACT) into two parts, the tracking framework and the strategic framework. The former consists of a basic tracking model [2], and the latter is implemented by a multi-dimensional state-action space reinforcement learning framework.

Tracking Framework with Basic Tracker
In our MACT tracker, the tracking model is only responsible for the basic tracking process, so the tracking framework only has basic tracking capabilities. Our tracking framework is inspired by the basic tracker proposed by Wang et al. [2], and consists of five parts: motion model, feature extractor, observation model, model updater and ensemble post-processor.
From the analysis in the Introduction, it can be seen that the motion model and feature extractor, respectively, correspond to the movement trend, which is from the angle of observer's attention, and feature selection, which is inspired by object's motion angle. In the basic tracker [2], the feature extractor component selects only one fixed feature representation (HOG feature), and the motion model component usually uses a sliding window to simply consider all possible candidates within the square neighborhood.
Different from the above two methods, to match the implementation of the strategic framework, we make some changes as follows. According to Wang et al. [2], the overall performance of the HOG feature is superior to other features (raw color and raw grayscale) in the tracking research.
For the other three components, we use the most basic methods available in [2]: • First, for the observation model component, we use the simplest logistic regression with l 2 regularization, and only employ the simple gradient descent to achieve online update of the model. • Second, for model updater component, we adopt the common practice of setting a threshold [30]. The model is updated when the difference between the confidence of the target and the confidence of the background is below the threshold.

•
Finally, for the ensemble post-processor component, we consider the reliability of each tracker as a hidden variable with reference to the study by Wang et al. [31], and then solve the problem of determining the tracking result by a factorial hidden Markov model.
For a complete tutorial about the basic tracker, we refer the readers to [2] for details. It can be seen that our tracking framework is a simple tracking method without the aid of the strategic framework. The purpose of this design is to prove that ordinary tracking can be greatly improved after having multi-angle analysis cooperation.

Strategic Framework with Reinforcement Learning
Unlike conventional tracking methods, MACT designed a meaningful strategic framework to guide the basic tracking framework described above. As shown in Figure 2, we treat the tracking process as a Markov Decision Process (MDP), and the agent can make a series of more reasonable and effective actions for motion model and feature extractor. This agent can predict where the target might appear and learn how to choose the appropriate image representation in the current frame. We treat the agent to learn the corresponding strategies by reinforcement learning. Therefore, we need to design the exclusive states, action space, and reward function for reinforcement learning.
For simple verification, in our study, the state S is defined by the time frame. During the training phase (see Equation (2)), there are two possible states for each frame: judgment of tracking result (tracking success or tracking failure) and image similarity of the target τ. In the operational phase, there is only one state: image similarity, as shown in Equation (3).
where τ t denotes the image's similarity hash vector of the current frame in the tth state, and its calculation algorithm is shown in [32]. h t is mean value of historical similarity hash.
In addition, to reduce the amount of calculation, MACT defines a state every five frames.
In the Introduction, we propose to link the angle of thinking to the action space. Therefore, in MACT, we implement two tracking analyses (the observer's attention and the object motion) in the multidimensional action space. Specifically, the observer's attention corresponds to this feature selection and the object's motion corresponds to the movement trend. Here, we only use three features (raw color, HOG and raw grayscale) and four directions (up, down, right and left). Thus, the definition of the action space A for any state in S is as follows: Obviously, the state transition function is defined as follows: The reward function R: Thus, the strategy (or policy) is denoted by π : S × A → [0, 1] which maps states and actions to a probability. The probability of choosing an action k according to policy π is π(k). A strategy is deterministic or pure if the probability of playing one action is 1, while the probability of playing other actions is 0 (i.e., ∃π(k) = 1 AND ∀l = k, π(l) = 0), otherwise the strategy is stochastic or mixed.
The goal of a reinforcement learning algorithm is to find a strategy for every state in S to optimize the expected reward, which is defined by long-term expected reward of the policy. Formally, it has two representations: the state value function, and the state-action value function, where γ is the discount factor. Further, this strategy can be divided into feature selection strategy (FSS) and movement trend strategy (MTS) according to different dimension actions (feature selection and direction selection).

Q-learning and Exploration Strategy
Considering maturity and reliability, we use Q-learning to find the optimal policy in this work [33]. The Q-learning algorithm is a classical value function-based reinforcement algorithm. Because it does not need to establish an environment model and guarantees convergence under certain conditions, it is the most widely used algorithm in reinforcement learning. The main steps of Q-learning are summarized in Algorithm 1.

Algorithm 1 Q-Learning: An Off-policy TD Control Algorithm
Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily, and Q(terminal − state, ·) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g., − greedy) The core of Q-learning is the Q-table. The rows and columns of the Q-table represent the values of the state and action, respectively. The Q-table value Q(s, a) records the estimation of the state-action value function q π (s, a). During the training process, the algorithm uses the -greedy [34] exploration strategy to select actions. The -greedy exploration strategy is an improvement of the greedy strategy, which refers to the form of the probability distribution when making action selections.
Under the current state s, the agent selects randomly by probability , which is called the random exploration process, and selects the action which has the max Q-value with the probability 1 − , which is called the exploitation process. Another famous exploration strategy is the Boltzmann exploration strategy, which can use the value function to dynamically adjust the balance between exploration and utilization in action selection.
Considering the fast convergence requirements of the model, we use -greedy in our model. After selecting the action, the agent observes the return r and the next state s from the environment, and then the Q algorithm uses the Bellman Equation to update the Q-table, ie., The Bellman equation can be understood as: Q(s; a) is expressed as the immediate return r after taking a in the current s, plus the maximum expected return Max(Q(s ; a )) after the discount γ. The Bellman equation, also known as the Dynamic Programming Equation, is a necessary condition for mathematical optimization methods, such as dynamic programming, and is also a basic concept for multi-state problems in the reinforcement learning.

Reward Function Design
During the training process, the agent gets a reward r based on an action a in the current state s. It is closely linked to specific tasks. A good reward function not only speeds up the learning process, but also increases the value of decision making. In MACT, rewards r consists of three parts: tracking quality reward r tq , feature selection reward r f ss , and movement trend reward r mts . r = r tq + r f ss + r mts (11) In particular, r tq is not only part of rewards r, but also determines the scores of r f ss and r mts . r tq is defined as: In particular, The definition of IoU in Equation (12) refers to the overlap score, which was defined by Wu et al. [1]: where ∩ and ∪ mean the intersection and union of two regions (p indicates the current object position and g represents ground truth bounding box), and |·| denotes the number of pixels in the region. We set the threshold t iou = 0.5 according to Wu et al. [1]. The definitions of r f ss and r mts are related to r tq as follows: r mts = r mts s i f r tq = +10 r mts f otherwise (15) In Equation (14), r f ss s and r f ss f , respectively, represent the value of r f ss when the tracking succeeds or fails. Similarly, the same definition applies to both r mts s and r mts f . The specific score distribution scheme is shown in Tables 1 and 2. Feature selection strategy Reward (FSS Reward) in Table 1 indicates the score obtained when the agent selects different feature expressions in different states, and movement trend strategy reward (MTS Reward) in Table 2 is the score corresponding to the action in different directions. It is worth noting that, when the tracking framework selects the HOG feature, on the one hand, in the state of successful tracking, the rewards we designed are relatively low, and, on the other hand, in the state of tracking failure, the penalty is relatively high. The reason for this design is that, under the premise of ensuring effectiveness, MACT encourages the tracking framework to use simple and effective feature representation as much as possible to improve tracking efficiency.
Therefore, the goal of training agent is to maximize the sum of reward R throughout the video sequence:

Mutual Cooperative
The strategic framework in Section 3.1 and tracking framework in Section 3.2 together form our multi-angle analysis collaboration tracking (MACT). As illustrated in Figure 3, strategic framework models two thinking processess (observer's attention and object's motion), and of course it can continue to expand. After training by reinforcement learning, the strategic framework can obtain corresponding strategy. Specifically, our MACT adopts the feature selection strategy (FSS) to guide the motion module for purposeful sampling, and hires motion trend strategy (MTS) to choose a more appropriate image feature.

Strategic Framework
Tracking Framework

Discussion
Psychological research has found that targets that are significantly different from the surrounding area are likely to attract the viewer's visual attention [35], and the study of visual attention is visual saliency research [36]. Therefore, we believe that, in the tracking process, people are more likely to be attracted by the target when the background and target are more different (no similar disturbances around the target or the background does not clutter); otherwise, people need to find more differences between the target and the background. As shown in Figure 4, when there are no other pedestrians around the target (the walking woman), its saliency value (visual attention measure calculated in [36]) is quite obvious compared to the surrounding. Once there are some similar pedestrians nearby, the target's saliency value is no longer obvious, even lower than the pedestrians. Therefore, when the tracking environment is complex, the tracking process becomes difficult, and it is necessary to obtain more information to identify the target from the background. In the MACT, we analyze the tracking process from different angles to obtain different discriminating information.
HOG feature is the last choice in MACT's feature selection strategy (FSS). Raw grayscale feature simply converts the image to grayscale and then uses the pixel value as a feature. Although the processing method is simple, in some suitable tracking scenarios, this simple feature can achieve good tracking results. Raw color is basically the same as the raw grayscale except that the image is represented in the color space instead of the grayscale. This feature is significant when the object and the background are clearly distinguishable in color. However, when the above cases are not satisfied, the effects of raw grayscale and raw color are greatly reduced. Showing excellent overall performance [2], especially the ability to describe the local shape of the object [3], HOG is adopted by FSS. As shown in Figure 5, in the MotorRolling video sequence, raw grayscale and raw color cannot capture the target well due to factors such as illumination effects and target blur; in the grayscale Football video sequence, raw grayscale has difficulty coping with this situation, because of similar interferences around the target.  We use Subway to demonstrate the advantages of feature selection strategy. From the results shown in Figure 6, we observe that, from Frame 70 to Frame 80, there is only one passerby in white clothes near the object pedestrian. Because the color of their clothes is very different, it is easy to use color features to distinguish between object and interferers. Similarly, between Frame 90 and Frame 100, there is interference with similar colors around the object. At this time, the color features no longer have good discriminability. Under the guidance of the strategic framework, the tracking framework uses HOG feature with high computational complexity but strong expressiveness and better overall performance to perform feature processing. In the later stages of the video, there are basically no similar interferers around the target, i.e., the object and background are very different, therefore, simple grayscale features can achieve good tracking results. It can be seen that a good feature selection strategy can not only improve the tracking speed, but also improve the tracking accuracy to a certain extent.  Figure 6. The example sequence of strategy taken to match the most appropriate features in the different current scene. Feature usage is determined by the corresponding option of maximum score on the score map. Our agent learns to wisely act upon the score maps. When the score maps are ambiguous, the agent postpones the decision and uses all features according to the more unambiguous score map at the next frame. Further selections of image feature are performed with more balance and stronger features confidence. Figure 7 shows the importance of the movement trend strategy. In the CarScale sequence, the car travels in one direction, and accurate motion trend estimation allows the motion model in the tracking frame to better select samples. Of course, in most tracking videos, the trajectory of the object is not determined. However, for a short period of time, the object's movement trend is still predictable due to inertia.

Experimental Results
In this section, we validate our MACT tracker on the OTB50 dataset using CVPR2013 benchmark evaluation method [1], and then compare the test results with the current mainstream tracking methods. All experiments were performed on a personal computer: MATLAB R2014b, Intel i7-4790 CPU with 4G DDR3 memory.
To facilitate tracking evaluation, we employed the classic CVPR 2013 benchmark evaluation system. The evaluation method for each frame consists of two indicators: precision plot and success plot. The former is defined as the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truths, while the latter shows the ratios of successful frames as the threshold varies from 0 to 1 [1]. For the robustness evaluation of a video sequence, we adopted one-pass evaluation (OPE), which is the average accuracy or success rate of the entire video sequence after running according to ground truth position directly (see [1] for details).

Strategic Framework Test
To validate the effectiveness of the policy framework, as shown in Table 3, we compared MACT with three specially designed trackers. The Basic tracker is basically consistent with the tracking framework of MACT (see Section 3.1 for details), the difference being that the former uses random sampling and fixed features HOG. MACT_FSS tracker is consistent with MACT, except that the strategic framework only contains feature selection strategies (FSS), and its motion model uses the same random sampling as the Basic tracker. Finally, MACT_MTS is similar to MACT but removes the feature selection strategy and adopts HOG feature in feature extractor. We experimented with these four trackers on the OTB50 over all 50 videos. On the tracking speed indicator, we compared these four methods, as shown in Figure 8. MACT_FSS tracker has the best tracking speed (21.445 frames per second), MACT is 5.23% slower than the best speed. The reason is that MACT_FSS does not need to predict the target motion trend, which also shows that the impact of motion trends on tracking speed is very small. Due to the large amount of HOG feature calculation, the tracking speeds of the other two trackers are obviously much lower.  To better understand the improvement of tracking accuracy by the strategic framework, we compared MACT_MTS tracker with the Basic tracker to verify the validity of the movement trend strategy. As shown in Figures 9 and 10, MACT_MTS's OPE protocol score is superior to Basic tracker in both success plot and precision plot, and the MACT_MTS had the better score for most of 11 different attributes (background clutter, out-of-plane rotation, illumination variation, in-plan rotation, motion blur, fast motion, deformation, occlusion, out of view, low resolution, and scale variation). This shows that the movement trend strategy (MTS) can improve the accuracy of tracking. For the validity verification of the feature selection strategy (FSS), we chose the MACT_FSS tracker to compare with the Basic tracker. As shown in Figures 11 and 12, we found that, with the guidance of the FSS, the accuracy of the MACT_FSS tracker has been greatly improved; specifically, the OPE success plot score increased by 4.16%, and the precision AUC score increased by 1.57%.
The experiment of the single strategy proves that FFS can not only greatly improve the tracking speed, but also improve the tracking accuracy; although MTS has a slight influence on the tracking speed, it can also improve the tracking accuracy. When these two strategies act on the strategic framework at the same time, that is our MACT tracker, the experimental data prove that the cooperation between FSS and MTS improves both the speed (compared with the MACT_MTS) and the accuracy, as shown in Figure 13.
Although MTS reduces the tracking speed, it has achieved the goal of jointly improving tracking accuracy in cooperation with FSS. Therefore, we believe that the computational burden of MTS is acceptable.   Figure 12. Detailed AUC scores on 11 attributes on OPE protocol.

MACT Tracker Test
After verifying the effectiveness of the strategy, we compare the performance of the MACT with the other tracking methods in two aspects: quantitative comparison and qualitative comparison.
As can be seen from the success and precision plots of OPE in Figures 14 and 15, our MACT has considerable advantages. In 11 attributes, our method is mostly leading the other trackers.   In Table 4, we compare MACT with four representative state-of-the-art competitors in VOT-2015 [48]. Although MACT has lower accuracy score than MDNet [49] and Staple [50], it runs much faster than the other four trackers in terms of normalized speed. Compared with DSST [19] and MEEM [51], MACT leads in all three indicators. Without sophisticated optimization strategies and high-precision feature representation, MACT still gets good tracking performance. It can be seen that, through the assistance of the strategic framework, MACT is basically in the lead position among the traditional tracking methods.

Qualitative Comparison
Since MACT is superior to most traditional tracking methods, we directly chose to compare with MUlti-Store Tracker (MUSTer) [52] in a qualitative comparison. MUSTer's concept is very clever (a dual-component: short-term memory and long-term memory store), and the features selected are more complicated: 31-dimensional HOG descriptors. Because of its clever design, more expressive features, and excellent program optimization, MUSTer is the leader among current tracking methods based on non-depth feature descriptions.
On the OTB50 dataset, we tested both the MACT and MUSTer methods. MUSTer's overall performance is even better. As shown in Figure 16, both methods have a good performance in most video sequences, such as basketball, bolt, boy, car4, carScale, crossing, mountainbike, walking, etc.  Figure 17 shows that MUSTer performs well in some challenging video sequences. Specifically, in the couple video sequence, in the #1-#90 frame period, the two methods can capture the object, and in the #99 and #100 frames, the target is severely occluded and the interference is consistent with the target motion trend. The MACT drifted due to the lack of more precise feature options and corresponding processing mechanisms. In addition, in the football video sequence, when there is a very similar interference around the object player, MACT begins to show poor performance, although MACT has a movement trend strategy. Although MACT is slightly inferior to MUSTer, we found that MACT has certain advantages in other challenging video sequences. As shown in Figure 18, in the motorRolling video, the object is a motorcycle. The video's difficulty is that the object speed is extremely fast and the number of video frames is very small, and the feature at this video is relatively not a key point, so MUSTer does not have an advantage, but MACT shows strong robustness because of the guidance of the movement trend strategy (MTS). The same situation occurs in another video sequence (shaking); the gray objects have similar colors to dark background, so the color and gray features are not well discriminative. In addition, there are similar interferences around the target (for example, piano and guitar players), so the HOG feature does not have an advantage. At this time, the MACT can correct the target drift, due to the guidance of the movement trend strategy (MTS). Figure 18. Example of tracking results with MACT advantages.

Conclusions
In this paper, we incorporate a novel strategic framework on traditional tracking framework based on multi-angle analysis collaboration. We believe that visual tracking research should not only consider machine learning methods or deep neural network, but also really need to think about the nature of tracking problem from different perspectives.
In our method, two thinking angles, namely observer's attention and object's motion, are selected from a simple case study, enabling them to better handle tracking. Specifically, we choose the image feature to implement the observer's attention angle, and adopt the object's movement trend to reflect the object's motion angle. Selection of suitable image features and prediction of current object movement trend are determined by strategy pool in the strategic framework. It is worth noting that the type of features and the direction of the movement trend correspond one-to-one with the actions in the action space. The learning of strategy is completed by the Q-learning and -greedy exploration in reinforcement learning. Obviously, our MACT tracker is a fusion of the clever strategic framework and the basic tracking framework.
Experiments over the OTB50 benchmark demonstrate that our MACT tracker achieves a high evaluation and avoided drift to some extent. The motivation for the paper is simple: return tracking research to thinking about tracking behavior. Mapping states in the multi-dimensional state-action space and tracking thinking from different angles can help to solve tracking problems in different tracking environments and thinking modes.