Multi-Complementary Model for Long-Term Tracking

In recent years, video target tracking algorithms have been widely used. However, many tracking algorithms do not achieve satisfactory performance, especially when dealing with problems such as object occlusions, background clutters, motion blur, low illumination color images, and sudden illumination changes in real scenes. In this paper, we incorporate an object model based on contour information into a Staple tracker that combines the correlation filter model and color model to greatly improve the tracking robustness. Since each model is responsible for tracking specific features, the three complementary models combine for more robust tracking. In addition, we propose an efficient object detection model with contour and color histogram features, which has good detection performance and better detection efficiency compared to the traditional target detection algorithm. Finally, we optimize the traditional scale calculation, which greatly improves the tracking execution speed. We evaluate our tracker on the Object Tracking Benchmarks 2013 (OTB-13) and Object Tracking Benchmarks 2015 (OTB-15) benchmark datasets. With the OTB-13 benchmark datasets, our algorithm is improved by 4.8%, 9.6%, and 10.9% on the success plots of OPE, TRE and SRE, respectively, in contrast to another classic LCT (Long-term Correlation Tracking) algorithm. On the OTB-15 benchmark datasets, when compared with the LCT algorithm, our algorithm achieves 10.4%, 12.5%, and 16.1% improvement on the success plots of OPE, TRE, and SRE, respectively. At the same time, it needs to be emphasized that, due to the high computational efficiency of the color model and the object detection model using efficient data structures, and the speed advantage of the correlation filters, our tracking algorithm could still achieve good tracking speed.


Introduction
Video tracking is an important part of computer vision and is widely used across a variety of fields including intelligent transportation, man-machine interaction, and military guidance [1,2]. This paper focuses on how to quickly and effectively address the tracking drift problem in object tracking processes when confronted with similarly colored backgrounds, object occlusions, low illumination color images, and sudden illumination changes.
In recent years, correlation filters have attracted more attention for their advantages in efficiency and robustness. In [3], a video tracking algorithm was proposed based on the sum of the mean square error of the minimum output of a correlation filter. Subsequently, Henriques et al. [4] proposed a tracking algorithm based on the circulant structure of tracking-by-detection with kernels (CSK) that used cyclic structure coding to densely sample and train the Regularized Least Squares (RLS) of a nonlinear classifier. Later, CSK was improved with a Kernel Correlation Filter (KCF [5]) that used a Histogram of Oriented Gradients (HOG [6]) features tracking algorithm. Danelljan et al. [7] introduced space regularization in the filter learning and penalized the filter coefficients according to their spatial position. Danelljan et al. [8] used the multi-channel color features to extend CSK 1.
By incorporating the object response model into the Staple algorithm, which combines a correlation filter and color model, the tracking robustness was greatly improved. Each model is responsible for the tracking of specific features and then combined three complementary models for robust visual tracking.

2.
Unlike traditional classifier-based object detection, an efficient object detection model with contour features and color histogram features is proposed for the first time, which significantly improved detection efficiency and detection speed. 3.
The redundancy aspect of the calculation of image features within each scale of the correlation filter module is optimized in this paper to improve the execution speed of the algorithm.

Multi-Complementary Model Tracking
Our baseline was the Staple (Sum of Template and Pixel-wise Leaners) algorithm. The Staple algorithm divides the tracking into the translational tracking phase and scale tracking phase. The translation tracking gives the position estimation and the scale tracking phase computes the target scale using a 1D correlation filter. During the translational tracking phase, Staple incorporates the response scores of the color model and correlation filter model (using HOG features) to achieve a good tracking performance. However, as the color information is easily disturbed by factors such as the environment and light, the performance of the tracker is limited. Thus, it is desirable that other additional features should be used as a complement to the color feature to improve the performance of the tracker. In this paper, the proposed MMT (Multi-Complementary Model Tracking), which incorporates the object model based on Edge Boxes [18] in the translational tracking phase of Staple. Edge Boxes [18] is based on the characteristics of the object contour edges information and has good adaptability to light and background changes. By fusing the multi-channel complementary feature response scores, the diversity of the sample information and discriminant can be utilized to a greater extent. This improves the generalizability of the tracker. In addition, we optimize the method of scales calculation, greatly reducing redundant operations and increasing the tracking speed.
Staple adopts the tracking-by-detection paradigm. As the location estimation and scale estimation are separate, they are responsible for their own work. In frame t + 1, translation tracking obtains the new target position based on the fix size of target size s t of the previous frame, then scale tracking updates the new scale with the new position computed by translation tracking. Therefore, in translation tracking of frame t + 1, given a search patch d t+1 extracted around the previous target position, and the fix target size s t , the Staple chooses the target bounding box p t+1 that gives the target location from a set Υ = Π : l i,j (Π) ∈ r, ϑ(Π) = s t to maximize fusing scores: where r is a valid inner region; r ⊂ p t+1 , l i,j (Π) ∈ r represents each bounding box Π's location (i, j) in the region r; and ϑ(Π) = s t represents the size of the Π is equal to the fix size s t . The functions T 0 and T 1 are the image transformation such that f 0 (T 0 (p t+1 , Π); M 0,t ) and f 1 (T 1 (p t+1 , Π); M 1,t ) assign scores to the bounding box Π according to the color model parameters and M 0,t filter model parameters M 1,t , respectively. In addition, color model parameters M 0,t and filter model parameters M 1,t are all trained from the previous target state and images, and parameters γ M 0 and γ M 1 represent the combination coefficients of the color model and filter model, respectively. In this paper, by incorporating the object model to the Staple algorithm, we obtain the target bounding box p t+1 to maximize a new fusing score of the three complementary models. The scores function can be represented by: where N m , which is three in this paper, is the number of models; the function T i is an image transformation such that f i (T i (d t+1 , Π); M i,t ) assigns a score to the bounding box Π according to the models parameters M i,t trained from the previous target state and images; and γ M 0 , γ M 1 and γ M 2 are the combination coefficients of the color model, filter model, and object model scores, respectively. Moreover, they are renamed τ c , τ f , and τ o , respectively, in this paper.   Before we introduce specific models, we first introduce the detailed concept of the inner region, which will be used in the three models based on a tracking-by-detection principle. Given a search area of +1 t d for tracking and a bounding box with a fixed size for the sliding window-based detection method, all the bounding boxes that center at different positions in a detection area of +1 t d cannot be used to detect. As shown in Figure 1, the bounding box 2 Π centered the position so that it exceeded the yellow area, which has many pixels out of the overall detection area +1 t d , so the bounding boxes will not be detected. We named the position set consisting of all the center positions of these bounding box as the outer region, and the region at which the bounding box center can be Before we introduce specific models, we first introduce the detailed concept of the inner region, which will be used in the three models based on a tracking-by-detection principle. Given a search area of d t+1 for tracking and a bounding box with a fixed size for the sliding window-based detection method, all the bounding boxes that center at different positions in a detection area of d t+1 cannot be used to detect. As shown in Figure 1, the bounding box Π 2 centered the position so that it exceeded the yellow area, which has many pixels out of the overall detection area d t+1 , so the bounding boxes will not be detected. We named the position set consisting of all the center positions of these bounding box as the outer region, and the region at which the bounding box center can be detected was named as the inner region in this paper. This is labeled with yellow in Figure 1. In addition, the set Υ = Π : l i,j (Π) ∈ r, ϑ(Π) = s t was represented in all boxes (with the fix size of s t ) centered at different positions in the region r, which was all of the search sample space. Then, we use Multi-Complementary  [18] that we incorporated in Staple. In Section 2.4, we describe the method of fusing the multiple model predictive response scores, which was used in Staple. Section 2.5 presents how we optimized the scales calculation to significantly improve the speed of the algorithm.

Learning of Filter Model
In Staple, the filter model is a type of tracking-by-detection model. During the training process, T d,t is a rectangular patch which is sampled from (t)-th frame, and the corresponding l dimension Histogram of Oriented Gradients (HOG) feature map f l , l ∈ (1, · · · , d) is extracted from T d,t . Then, by minimizing the objective function, a set of filters h for d dimensional features are trained. The loss function is then: where * represents the circular correlation; h l is the corresponding feature filter for each l dimension; and g is the desired correlation output, which generally selects the Gaussian function with a maximum value of 1. The second parameter λ ≥ 0 represents the coefficient of the regularization term. We then use Parseval's Theorem for the frequency domain to obtain a fast solution, thus obtaining: where G is the DFT conjugation of the Gaussian response g; and F k and F k are the dot-product operations of the frequency domain of the k-dimensional features map corresponding to the image patch T d,t and the corresponding conjugate operation, respectively. Based on the above model, we rename the filter H l , (l = 1, . . . , d) used in the translation phase to the translation filter R c , with the corresponding numerator A and denominator B, respectively. R c is updated with a learning factor η f : where t is the index of the frame.

Calculating the Filter Scores
Given a search patch d t+1 , which has a size ofĉ ×l and its inner region r in (t+1)-th frame, the HOG feature map of d t+1 is extracted. When the l-th dimension of the rectangular features map is marked as z l t+1 , and its frequency domain is Z l t+1 , the correlation scores S ( f ,t+1) of search area d t+1 is obtained by convolving features map z t+1 and correlation filter R C , which is obtained in the previous frame with Equation (5). The specific formula is as follows: where A l t and B t are the numerator and denominator of the translation filter R c obtained in the previous frame, respectively. F −1 represents the inverse Discrete Fourier Transform (DFT) operator.
As the filter model is based on the tracking-by-detection principle, the scores value of a position in the response map can represent the score of the bounding box centered at different positions in search patch d t+1 to the object. Unlike the sliding-window-based detection of color and object model, the filter model uses the same properties as the circular convolution, forming the response shares the same size with the feature template (with the same size of d t+1 ). In Figure 1, we show that the inner region r is in the area of search patch d t+1 , therefore the filter response y f t+1 of the inner region r can be obtained by cropping the filter response S ( f ,t+1) of a search patch d t+1 , which corresponds to the scores of all the bounding box in set Υ = Π : l i,j (Π) ∈ r, ϑ(Π) = s t .

Learning of Color Model
The color model is also based on the widely used tracking-by-detection principle, which selects the bounding box (gives the target position) with the highest score from the bounding box set as the final test result to localize the object of interest within a new frame. As with other classifier-based approaches, color models obtain parameters by learning both the positive and negative samples simultaneously.
We follow Staple, where the color features are based on RGB colors, and the bins color histograms are computed in a 32 × 32 × 32 bins space. To have the sparse features to speed up the calculation, in the color model, the Staple algorithm maps each pixel u represented by the RGB space into an index feature j = φ(u) in a 32 × 32 × 32 bins space in the image.

Learning of Color Model
The color model is also based on the widely used tracking-by-detection principle, which selects the bounding box (gives the target position) with the highest score from the bounding box set as the final test result to localize the object of interest within a new frame. As with other classifier-based approaches, color models obtain parameters by learning both the positive and negative samples simultaneously.
We   As shown in Figure 2, during the model training phase in frame t, given as a rectangular patch T d,t which is sampled around the estimated location from frame t, Staple divided T d,t into the foreground area O (shares the size with the estimated target of previous frame) and background area B. Additionally, they were used to calculate the proportion of each index feature (32 × 32 × 32 bins space) in the foreground area O and the background area B, respectively. Suppose Ω is a region, Ω ∈ {O, B}, the proportion of each index feature j (32 × 32 × 32 bins space) in area Ω can be represented by ρ j (Ω) = N j (Ω)/|Ω|, where N j (Ω) = |{u ∈ Ω : φ(u) = j}| represents the number of index feature j in area Ω and |Ω| represents the total number of pixels in area Ω. Therefore, for an online model, ρ j (O) and ρ j (B) can be followed by the following formula: where ρ t (A) is the vector of ρ  When calculating the proportion of each index feature j in the foreground area O and background area B, respectively, the weight coefficient β j t for each index feature j is updated by the following equation: M is the dimension of the mapped space (32 × 32 × 32 bins), and c η is a learning rate parameter.
When calculating the proportion of each index feature j in the foreground area O and background area B, respectively, the weight coefficient j t β for each index feature j is updated by the following equation: , which (with a magnification relative to target bounding box t p ) is extracted around t p and has a size of ĉ l × , as well as the target size t s , which is given for fixed-size target detection, we can obtain its inner region r and a bounding box set corresponds to r in frame t + 1. From Equation (2), we know that our goal is to calculate the response color scores of all bounding boxes in ϒ . However, first we need to calculate a score matrix which represents the score of different pixels u in search Given a RGB search patch d t+1 (shown in Figure 3a) in frame t + 1, first, per-pixel scores S t+1,β (shown as heat map in Figure 3b) of search patch d t+1 is obtained by looking up the table. Then, the score matrix y c t+1 named color response (shown as heat map in Figure 3c) of inner region r is obtained. The dotted yellow box region of search patch d t+1 represents the inner region r, and the red box used to generate all of the boxes center at inner region r of d t+1 in Figure 3b is the slide bounding box Π.
, which (with a magnification relative to target bounding box p t ) is extracted around p t and has a size ofĉ ×l, as well as the target size s t , which is given for fixed-size target detection, we can obtain its inner region r and a bounding box set Υ = Π : l i,j (Π) ∈ r, ϑ(Π) = s t corresponds to r in frame t + 1. From Equation (2), we know that our goal is to calculate the response color scores of all bounding boxes in Υ. However, first we need to calculate a score matrix which represents the score of different pixels u in search image patch d t+1 , and the score matrix is also named per-pixel scores in Staple. The calculation process is as follows. From the above training process, we know that the score (weights β j t ) of each index feature j has been obtained in the previous frame t, therefore the score β φ(u) t of each pixel u (RGB space) in d t+1 is obtained directly by looking it up in the table, therefore the score matrix S t+1,β =  named per-pixel scores is formed. The example of a per-pixel score is shown as the heat map in Figure 3b. Then, we begin to calculate the color score of each bounding box Π in set Υ. The color score of bounding box Π is a pixel-based average score, that is, the color score (S c (Π)) of each bounding box Π is the average of the weight scores of all the pixels β φ(u) t , u ∈ Π in the bounding box Π. The calculation formula is as follow: Then, by sliding the bounding box with fix-size s t on score matrix S t+1,β , we can calculate all the color scores of bounding boxes in Υ with Equation (9). Therefore, the other score matrix y c represents the color score of bounding box centered at position (i, j) in inner region r, which is computed in Equation (9). y c t+1 shares the same sizeĥ p ×ŵ p with inner region r and is shown with the heat map in Figure 3c. Additionally, supposing the size of s t isĥ s ×ŵ s , we haveĥ p =ĥ −ĥ s + 1 andŵ p =ŵ −ŵ s + 1.
The fractional response of the sliding window can be accelerated by convolving the image in Staple. For more details, one can refer to the code of the Staple algorithm.

Learning of Object Model
Similar to the filter model and color model, the object model is also based on the tracking-by-detection principle to locate the target. In Section 2.2, we find that, in the color model, the score of each bounding box is calculated based on the average of all pixel weight scores in the bounding box. In this section, we describe another approach, which is based on contour information to measure the probability score of the bounding box as a target.
In Edge Boxes [18], the likelihood of the bounding box containing an object is based on the number of contours that are wholly contained in a bounding box. Using efficient data structures, millions of bounding boxes can be evaluated in a fraction of a second. Furthermore, this model does not require additional training process; given the location and size of bounding box and the image, the score of the target box can be calculated efficiently. Edge Boxes [18] is introduced below.
where and are the width and height of the bounding box , respectively; in represents a Given a search area d t+1 , which (with a magnification relative to target bounding box p t ) is extracted around p t and has size ofĉ ×l, as well as the target size s t , which is given for a fixed-size target detection, similar to the color model, we can obtain its inner region r and a bounding box set Υ = Π : l i,j (Π) ∈ r, ϑ(Π) = s t that corresponds to r in frame t+1. Again, similar to the color model, our goal is to calculate the response scores of all bounding boxes in Υ. However, before calculating the scores of the bounding boxes in Υ, we should obtain the edge response and edge groups of search area d t+1 in turn, with the method in Edge Boxes. Examples of edge response and edge groups are shown in Figure 4b,c, respectively. Specific calculations can be found in Edge Boxes. After all the edge groups have been calculated, from Edge Boxes, we know that the object score S o (Π) of each bounding box Π can be expressed as: where b w and b h are the width and height of the bounding box Π, respectively; b in represents a central region of the Π; r is an edge (corresponding to a pixel) which has an edge magnitude m r , and m i is the sum of the edge magnitude m r for all edges r in g i ; ς i ∈ [0, 1] is a continuous value to indicate the probability that g i belongs to a fixed bounding box Π; and is the penalty coefficient of the size of Π.
For more details, one can refer to the code of the Edge Boxes algorithm. Therefore, we can calculate the object scores of each bounding box in Υ by sliding the bounding box. Then, the other score matrix of inner region r, is obtained, which shares the same sizeĥ p ×ŵ p with inner region r and is shown with the heat map in Figure 4d, where S o (i, j) computed with Equation (10) is the object score of a bounding box in Υ, which has the position (i, j) in region r and the size of s t .

Final Response Scores Calculation of MMT
Given a search area d t+1 , which has size ofĉ ×l, and the fixed target size s t of the previous frame, similar to the color and object models, we can obtain its inner region r and a bounding box set Υ = Π : l i,j (Π) ∈ r, ϑ(Π) = s t that corresponds to r in frame t + 1. From the above, we know that we can obtain three scores including the color score S c (Π), filter score S f (Π), and object score S o (Π) computed by the color model, filter model, and object model, respectively, and the three response scores all represent the possibility of all the bounding boxes (with the fixed size of s t and contains our search space) centered at different positions in inner region r by different model parameters. Since the three response scores were all between 0 and 1 (1 to the object and 0 to other), the magnitude of the response scores was compatible. Therefore, we followed the Staple algorithm and fused the three response scores by linear weighting. For each Π ∈ Υ, we obtained three scores S c (Π), S o (Π), and S f (Π) from three models, respectively. Finally, according to Equation (2), the final fusing score of bounding box Π was calculated by weighting S c (Π), S o (Π) and S f (Π) as: (11) where τ c and τ o are the merge coefficients of the score of the color model and object detection, respectively, and the sizes set in the experiment were 0.2 and 0.25, respectively. Specific parameters of the experiment are presented below. When the fusing scores of all the bounding boxes centered in inner region r are computed, a fusing scores matrix y t+1 = , computed with Equation (11), is the fusing score of a bounding box in Υ, which has the position (i, j) in region r and the size of s t . After selecting the target bounding box p t+1 with the largest corresponding score of the final response y t+1 , the target scale is calculated by the scale filter. The transition tracking part of the MMT algorithm, which incorporates the object model to Staple, is shown in Figure 5 (learning procedure) and Figure 6 (evaluation procedure).
bounding box in ϒ , which has the position ( ) with the largest corresponding score of the final response + 1 t y , the target scale is calculated by the scale filter. The transition tracking part of the MMT algorithm, which incorporates the object model to Staple, is shown in Figure 5 (learning procedure) and Figure 6 (evaluation procedure). In the (t)-th frame, a training patch is extracted at the location of for updating the denominator, and the numerator of the translation filter with Equation (5). The target's foreground area and background area are obtained from, and are used to calculate the proportion vectors and with Equation (7), respectively. Then, the score (weights) of each index feature is updated in Equation (8). In the (t)-th frame, a training patch is extracted at the location of for updating the denominator, and the numerator of the translation filter with Equation (5). The target's foreground area and background area are obtained from, and are used to calculate the proportion vectors and with Equation (7), respectively. Then, the score (weights) of each index feature is updated in Equation (8).   (11) and their calculation process will be explained below.
(1) Filter-related. In the (t + 1)-th frame, the search patch +1 t d (with a magnification relative to target t p ) is extracted around the t p , and the search feature maps represented using HOG From Equation (2), we know that p t denotes the estimated position of the target in (t)-th frame and p t+1 denotes the predicted position of the target in the (t + 1)-th frame. Given the search patch d t+1 , which (with a magnification relative to target bounding box p t ) is extracted around the p t , and the size s t of the target in the previous frame, we can obtain inner region r, r ⊂ d t+1 , and a box set Υ, which is our search sample space. The purpose of the transition tracking of the multi-complementary model is to choose the new target bounding box p t+1 with max fusing score from the box set Υ. Therefore, after the final response y t+1 , which gives the fusing scores of all the boxes in Υ calculated with Equation (11), p t+1 is estimated at the peak of y t+1 . When the new position is obtained, the scale of target is computed by the 1D correlation filter. Specific details about scale estimation can be found in the code of Staple. y f t+1 , y c t+1 , and y o t+1 are computed by the filter model, color model, and object model, respectively. These are used to obtain final response y t+1 with Equation (11) and their calculation process will be explained below.
(1) Filter-related. In the (t + 1)-th frame, the search patch d t+1 (with a magnification relative to target p t ) is extracted around the p t , and the search feature maps represented using HOG features are extracted from d t+1 and then convolved with translation filter R C through Equation (6) to calculate the filter response of d t+1 . Due to inner region r ⊂ d t+1 , the filter response y f t+1 of inner region r can be obtained by cropping the filter response (score matrix) of d t+1 . y f t+1 is a matrix which takes the scores of all the bounding boxes (with the same size of s t , ϑ(Π) = s t ) centered at different positions in inner region r as elements.
(2) Color-related. In the (t + 1)-th frame, the per-pixel scores S t+1,β , which represent the scores of pixels at different positions of d t+1 , are obtained by looking up the table of weight β j t , then the color response y c t+1 , which represents the color scores of all the bounding boxes in set Υ, is obtained with Equation (9).
(3) Object-related. In the (t + 1)-th frame, we calculate the edge response and the boundary group for the search area d t+1 in turn with the method in Edge Boxes [18]. Then, the object response y o t+1 , which represents the object scores of all the bounding boxes in set Υ, is calculated with Equation (10).

Scale Calculation Optimization
When target bounding box p t+1 , which denotes the predicted position of the target in frame t+1, is obtained by translation tracking, the object scale can be calculated using the one-dimensional scale filter proposed in [15]. The size range selection principle is as follows: where w t−1 , h t−1 are the width and height of the object on the previous frame, respectively; κ is the scale factor; and ν is the scale number. We followed Staple and set κ and ν to 1.02 and 33, respectively. Due to the classical scale calculation method in [15] (which is also used in Staple), we needed to calculate the HOG feature maps of 33 scale image blocks during training and testing, which is very complicated. In frame t, the feature map of the scale testing and the scale training were all based on the same coordinates, which were obtained from the transition tracking of frame t. The scale testing needs to extract the 33-scale image patch features relative to the target scale of the (t − 1)-th frame and scale training needs to extract the 33-scale image patch features relative to the target scale of the (t)-th frame, respectively, as the two-frame scale change is usually small or even the same. In the case where the scale of the two frames before and after the change is n, the image features of the 33 − n sample blocks are repeatedly calculated, resulting in significant complexity. In this paper, we therefore reused the features of the scale image patch that were obtained in the process of doing the scale calculations during scale updating. This optimization method greatly improved the execution speed of the tracker.

Multi-Complementary Model for Long-Term Tracking
With the MMT tracker and the new proposed detection method, we constructed the MMLT (Multi-Complementary Model for Long-term Tracking) tracker. In the following, we introduce the online detection module in detail. In Section 3.1, we describe the proposed online detector method used to get the candidate bounding boxes. In Section 3.2, we present how we evaluated the confidence score of the candidate bounding box. In Section 3.3, we present how we obtained the redetected target and how we decided whether to use it to reinitialize the tracker. In Section 3.4, we introduce how the MMT tracker and online detection module worked together in MMLT comprehensively and introduce the detection module's high confidence update mechanism in this paper.

The Online Detector
It is common sense that the detection module is necessary for a long-term tracking method to redetect the target in case of failed tracking when long-term occlusion or out-of-view arise. In addition, the detection method of learning a classifier online and using a classifier to search by sliding the window has high time complexity. Different from previous works [26,30] where the online classifier needs to be trained, in this paper, we combined an object detection method based on object contour [18] and the color detection method [25] to generate a fractional prediction response of the search area to redetect the target p t+1 . Due to the diversity of the sample, information and discriminant can be utilized to a greater extent. By integrating the dual prediction scores, we could form the fusing prediction response score for an object with greater confidence, resulting in better detection. This greatly improves the generalizability of the tracker.

Give a detection area
, which (with a magnification relative to target p t ) is extracted around p t , it has a size of a × c and the fix target size s t obtained from the previous frame. Similar to the tracking module, we can also obtain its inner region r and a bounding box set Υ = Π : l( Π) ∈ r, ϑ Π = s t corresponding to r. In addition, our goal is still to calculate the response scores of all bounding boxes in Υ by the color and object parameters.
For the object detection model, according to the method in Section 2.3, the edge response and edge groups of q o,t+1 are computed in turn. Then, we obtain object response (10), which presents the object scores of all the bounding boxes centered in inner region r. At the same time, the per-pixel scores matrix S c h p , 1 · · · S c h p , w p     that represents the color scores of bounding boxes (with the same size of s t ) centered at different positions in inner region r is also computed efficiently through the integral images (Equation (9)).
After we obtain the predicted response scores of different models, the approach of fusing the predictions of the multi-complementary detection model is as follows.
Let us assume that the prediction results of different models are independent. Π ∈ Υ is chosen from a box set Υ, and υ ∈ {−, +} denotes a foreground-background label. M θ is the parameter corresponding with different models, which depends on the previous object state and previous frame. The likelihood of the candidate bounding box Π belongs to the object under the model parameter M θ , which can be expressed as p Π = +; M θ , θ = 1, . . . n, where n is the total number of models. Suppose the models are independent of each other, then we have the following decomposition: Now, we have two independent model parameters: the color model parameter M 0 and the object model parameter M 1 . According to the color model parameter M 0 , the probability that candidate bounding box Π is the object can be calculated by Likewise, according to the object detection model parameter M 1 , the probability that the candidate bounding box Π is the object can be calculated by Therefore, according to Equation (13), we have the following formula to fuse the predictions of the different independent models for calculating the score of each bounding box Π: When we get the scores of all the bounding boxes in Υ, from Equation (16), we know the final response matrix d t+1 can be obtained by: where represents dot multiplication operations. d t+1 represents the final response scores of all the bounding boxes Π (with the same size of s t , ϑ Π = s t ) centered at different positions in the inner region r.
In general, the probability of Π to the object is large when both the color score and the object detection score of Π are high. If the Π gets a low score from any one model, it is considered as not likely to be the target. Therefore, in this paper, the color score and the object detection score are merged by multiplication to obtain more reliable detection results.
Having calculated the final response d t+1 of inner region r (in detection area X t+1 ), the corresponding peaks of d t+1 could be used to obtain the possible object location. Then, the purpose of the detector is to detect the top-w (w = 10) confident detection bounding boxes from Υ (corresponding to inner region r). First, we select the bounding box l max centered at the peak (with the largest response score M s in the final response d t+1 ) to the candidate bounding boxes set T. When the ratio between the response score of the bounding box centered at other peaks to M s is greater than a threshold ξ, the corresponding bounding box is also added to T. Similar to CCT, the bounding box l 0 centered at the position in the previous frame is also added to T as a candidate bounding box for evaluation. At the same time, we limited the total number of candidate bounding boxes so that it does not exceed w = 10. Finally, we obtain the candidate bounding box set T = {l 0 , l 1 , l 2 , · · · , l k }.

Candidate Bounding Box Evaluation
After the detection module detects the candidate bounding box set T, a robust mechanism is needed to measure the confidence score of each candidate bounding box l i ∈ T. However, to effectively measure the confidence score of each bounding box, we follow the Collaborative Correlation Tracker (CCT) [31] and consider not only the target area corresponding to the bounding box, but also the information of its background area. That is, for each candidate bounding box l i , we extract the image region samples EB-patch l i , which centers at the location of l i and has the same magnification relative to the candidate bounding box as transition filter R c . The EB-patch l i is measured by a well-trained filter R c (similar to R c ) to obtain the confidence score s i of the corresponding candidate bounding box l i . First, we obtain the candidate EB-patch set S = l 0 , l 1 , l 2 · · · l k corresponding to T = {l 0 , l 1 , l 2 · · · l k }. For each EB-patch l i ∈ S, i ∈ (0, 1, · · · , k), its HOG features map Jl i is calculated, which convolved with confidence filter R c in Equation (6), to obtain the confidence filter responseŷ i .
In addition, we used the maximum score s i = max(ŷ i ) ofŷ i as the confidence score of the candidate bounding box l i . Finally, the candidate confidence scores S = { s 0 , s 1 , s 2 · · · s k } are also obtained. The calculation of R c is given in Section 3.3.

Redetected Result Decision
When the candidate confidence scores S is obtained, the final redetected target can be calculated as: Then, the bounding box l i is the final redetected target p t+1 . When i = 0 and s i is higher than a certain threshold χ, p t+1 is accepted, and is then used to initialize the tracker. When i = 0 or s i is lower than χ, we consider p t+1 is not correct, and it is not accepted.

Multi-Complementary Model for Long-Term Tracking
Clearly, a robust long-term tracking algorithm requires a re-detection module in case of tracking failure. Similar to LCT, we use the threshold m τ as the activation confidence to activate the detector.  represents dot multiplication operations of matrix.

Multi-Complementary Model for Long-Term Tracking
Clearly, a robust long-term tracking algorithm requires a re-detection module in case of tracking failure. Similar to LCT, we use the threshold τ m as the activation confidence to activate the detector. First, MMT performs the target tracking process in each frame. When the activation confidence max(y h t ) < τ m , the detector is activated to redetect the target, where y h t is the filter response of the transition tracking, which is computed in Equation (6). For the detection process, first, the detector has to compute the possible candidate bounding box set T = {l 0 , l 1 , l 2 · · · l k }, and then the candidate EB-patch set S = l 0 , l 1 , l 2 · · · l k corresponds to set T. Then, the confidence score

Performance Evaluation
We implemented our experiment on the OTB-13 [33] and OTB-15 [34] benchmark datasets. All of the video sets with challenging factors were selected to undertake three experiments to evaluate performance: one pass evaluation (OPE), temporal robustness evaluation (TRE), and spatial robustness evaluation (SRE). OPE uses a traditional method of evaluation. As pointed out by [33], the traditional one-pass evaluation cannot fully reflect the robustness of a tracker, and sometimes even a small disturbance can lead to very different tracking results. The tracker begins tracking at the true position of the initial frame and calculates the Precision and Success Rate (SR). The TRE and SRE are different: the TRE randomizes the start frame to run the tracker on the rest of the sequence; and the SRE verifies the performance of a tracker by tracking an object after shifting and scaling the initial object box. These three kinds of evaluation monitor performance by indicating the accuracy and success rate of the generated graph, which indicates the percentage of the number of frames that the tracker has been able to track under different thresholds.
In this section, we first evaluate MMLT with the improvements from the online detector and multi-complementary model on OTB-13. Then, we compare MMLT with nine of the most related and state-of the-art trackers on the OTB-13 and OTB-15 benchmark datasets. Finally, we analyze the effect of different merge parameters on the tracking performance. All of the tracking results used the reported results to ensure a fair comparison.

Experimental Configuration and Parameter Settings
To evaluate the performance and efficiency of the proposed algorithm, our tracker was implemented in Matlab software with a Core i7 4.0GHz CPU and 8GB RAM. MMLT runs faster than 50 frames per second (FPS). The color model used a 32-bit RGB channel color model. HOG features were selected as the features of interest, the cell size was set to four, and the number of statistical gradient directions was nine. As in [25], the translation filter image block parameter had a fixed area to achieve standardization. The parameter was set to 150 × 150.
We, following Staple, only searched in the area around the previous position for both translation and scale in the tracking module, and adopted the translation filter R C scale filter for tracking using a Hann window during the search as well as for training. Additionally, also following the Staple, we normalized the translation search patch by a parameter's fixed area, and weighted the extracted feature channels patch of the target and context by a cosine window. Thirty-three scales with a scale factor of 1.02 were used in the scale model in this paper. In this paper, all search patches extracted around the previous position included both the target and surrounding context. In addition, we also adopted the confidence filter R C in the candidate detection EB-patch evaluation using a Hann window during confidence evaluation as well as for training and normalized the EB-patch by the parameters fixed area. The specific parameters can be seen in Table 1. More detailed parameters setting about the Staple algorithm can be seen in the code of Staple.

Analysis of MMLT Improvement
To validate the tracking aspect of the object detection and the effectiveness of the detection module, we also compared several baselines of MMLT on the OTB-13 with all 51 videos. The MMLT-NN in Table 2 is an accelerated version of the Staple algorithm that incorporated our scale optimization. We named the tracker that had an object model in transition tracking but had no online detector module as MMT. The tracker that had no object model in transition tracking but had an online detector module was named MMLT-N1. The specific tracking performance in the OTB-13 benchmark datasets are shown in Table 2, where the average FPS is the average FPS for the tracker. To more fully measure the execution efficiency of the tracking algorithm, the tracking speed of the tracker listed in Table 2 was the average of the tracking speed of 51 videos in the OTB-13 benchmark datasets. It is worth noting that the performance in Table 2 of the Staple algorithm was based on the source code provided by the author and the parameters given in Staple [25]. As seen in Table 2, MMLT's tracking accuracy and tracking success rate were superior to all other tracker versions in all three metrics of OPE, TRE, and SRE. At the same time, it was seen that the average tracking accuracy and the average success rate of MMLT had more than 10% hits against with Staple, and real-time tracking of 50 FPS was achieved at the same time. For Staple, the tracking performance was much weaker than other versions of trackers in this paper since only the filter model and the color model were integrated into the translation tracking stage and there was no detection model. As shown in Table 2, the MMLT-NN algorithm with the tracking speed of 135 FPS was greatly improved when compared to Staple due to the optimization of scale calculation. At the same time, MMT had an average of 6% and an average of over 7% improvement in tracking precision and tracking success rate, respectively, when compared with Staple due to the inclusion of the object detection model during the translational tracking phase. The MMLT-N1 tracker also had good performance due to the detection mechanism added in the case of tracking failure. Compared with the Staple, the MMLT-N1 tracker also improved the tracking precision and tracking success rate by 6% and 7%, respectively. At the same time, in Table 2, we can see that the tracking speed of the MMLT-N1 was lower than that of the MMLT, which adopted the target detection model and the object model in transition tracking. Without the object model in transition tracking, the tracking robustness of MMLT-N1 was relatively low, and the target tracking was prone to drift, which led to the MMLT-N1 tracker needing more re-detection processes and a decrease in tracking speed. At the same time, it could indirectly reflect the importance of integrating the object detection model in the translation tracking phase. In summary, from the experimental results, the multi-complementary model tracking algorithm proposed in the translation phase or the proposed target re-detection algorithm both provided a significant improvement in the performance of the algorithm.

MMLT Experiment
We compared our algorithm with some state-of-the-art methods including, SRDCF (Spatially Regularized Discriminative Correlation Filter Tracker [7]), DeepSRDCF (Spatially Regularized Discriminative Correlation Filter Tracker Based Deep Features [7]) with added deep features, DSST (Discriminative Scale Space Tracker [15]), MEEM (Multiple Experts Using Entropy Minimization tracker [21]), Staple (Sum of Template and Pixel-wise Leaners tracker [25]), LCT (Long-term Correlation Tracking [30]), LMCF (Large Margin Object Tracking with Circulant Feature Maps [32]), ECO-HC (Efficient Convolution Operators for Tracking based HOG and CN Features [35]), SAMF (Scale Adaptive with Multiple Features tracker [36]) and DLSSVM (Dual Linear SSVM [37]). The tracking success rate for the top 10 trackers was evaluated on all 51 videos the in the OTB-13 [33] and all 100 videos in the OTB-15 [34] benchmark datasets. These results are shown in Figure 8. As shown, MMLT performed well across the OPE, TRE, and SRE indicators. When the Staple algorithm was first proposed, it performed very well in comparison to other algorithms. Our approach achieved a 9.7% improvement on the success plots of OPE, a 7.8% improvement on the success plots of TRE, and an 8% improvement on the success plots of SRE over Staple on OTB-13, whilst also showing a similar improvement on the OTB-15 dataset relative to the Staple algorithm. We need to emphasize that our approach also ran at a significantly higher speed with 50 FPS. In addition, MMLT performed as well as DeepSRDCF with deep features. However, DeepSRDCF tracked less than 1 FPS while MMLT tracked at approximately 50 times the DeepSRDCF tracking speed. At the same time, whil using the same detection mechanism as the LCT algorithms, an average improvement of 10% on the success plots is achieved. In the meantime, our approach had an average of 1% improvement over the ECO-HC algorithm which also does not use depth features but has excellent tracking performance. Furthermore, it needs to be stressed that we compared the tracking performance with the published data where the average FPS of different trackers in the legend was the average of the tracking speed of 100 videos in the OTB-15 benchmark datasets, which was tested on our computer. tracked at approximately 50 times the DeepSRDCF tracking speed. At the same time, whil using the same detection mechanism as the LCT algorithms, an average improvement of 10% on the success plots is achieved. In the meantime, our approach had an average of 1% improvement over the ECO-HC algorithm which also does not use depth features but has excellent tracking performance. Furthermore, it needs to be stressed that we compared the tracking performance with the published data where the average FPS of different trackers in the legend was the average of the tracking speed of 100 videos in the OTB-15 benchmark datasets, which was tested on our computer. In summary, the MMLT algorithm performed well against the listed trackers, both in tracking performance and in tracking speed.

Attribute Based Evaluation
The video set provided in [33] contained a selection of 11 attributes including object deformation, occlusions. These allowed for further analysis of tracker performance. Figure 9 shows the results for the eight most challenging video attributes in OTB-13 [33]. As shown, the DeepSRDCF algorithm demonstrated comprehensive performance against the existing tracking algorithms outside of In summary, the MMLT algorithm performed well against the listed trackers, both in tracking performance and in tracking speed.

Attribute Based Evaluation
The video set provided in [33] contained a selection of 11 attributes including object deformation, occlusions. These allowed for further analysis of tracker performance. Figure 9 shows the results for the eight most challenging video attributes in OTB-13 [33]. As shown, the DeepSRDCF algorithm demonstrated comprehensive performance against the existing tracking algorithms outside of trackers with depth information, i.e., in-plane rotation (59.6%), out-of-plane rotation (63.0%), and scale variation (62.8%). The MMLT algorithm achieved a success rate of 60.2%, 64.1%, and 61.9% on in-plane rotation, out-of-plane rotation, and scale variation, respectively. ECO-HC achieved the success rate of 59.5%, 65.6% and 63.9%, on illumination variation, occlusion and out of view, respectively, while MMLT achieved the success rate of 62.3%, 64.3% and 64.0%, respectively. LMCF offered a 62.5% performance on background clutter, which was matched by MMLT. LCT offered a 66.8% performance on deformation, while MMLT performed well with a success rate of 67.5%.

Qualitative Comparison
We compared our proposed tracking algorithm (MMLT) with six other state-of-the-art trackers, namely Staple (Sum of Template and Pixel-wise Leaners tracker [25]), MEEM (Multiple Experts Using Entropy Minimization tracker [21]), LCT (Long-term Correlation Tracking [30]), TLD (Learning and Detecting [26]), Struck (Structured Output Tracking with Kernels [38]), and KCF (Kernel Correlation Filter [5]) on ten challenging sequences ( Figure 10). The Staple algorithm takes advantage of the complementary sample information by fusing the predictions of the filter model and the color model, and therefore showed good performance in handling with significant deformation and fast motion (Tiger2 and Deer). However, it drifted when the target objects underwent heavy illumination variation, occlusion, and background clutters (Shaking, Lemming, and Couple). As the color model is susceptible to changes in light and motion blur, even with the filter model as a complementary model, the tracker's resilience was still low when the tracker encountered severe lighting changes and similar background colors; moreover, while tracking in the event of a serious occlusion, the tracking results were also prone to drift and did not re-detect targets in the case of tracking failure Figure 9. The success plots of eight challenging attributes including background clutter, deformation, illumination variation, in-plane rotation, occlusion, out of-plane rotation, out-of-view, and scale variation. The legend illustrates the ranking scores for each tracker. The proposed MMLT algorithm has five attributes ranked first, and three attributes ranked second.

Qualitative Comparison
We compared our proposed tracking algorithm (MMLT) with six other state-of-the-art trackers, namely Staple (Sum of Template and Pixel-wise Leaners tracker [25]), MEEM (Multiple Experts Using Entropy Minimization tracker [21]), LCT (Long-term Correlation Tracking [30]), TLD (Learning and Detecting [26]), Struck (Structured Output Tracking with Kernels [38]), and KCF (Kernel Correlation Filter [5]) on ten challenging sequences ( Figure 10). The Staple algorithm takes advantage of the complementary sample information by fusing the predictions of the filter model and the color model, and therefore showed good performance in handling with significant deformation and fast motion (Tiger2 and Deer). However, it drifted when the target objects underwent heavy illumination variation, occlusion, and background clutters (Shaking, Lemming, and Couple). As the color model is susceptible to changes in light and motion blur, even with the filter model as a complementary model, the tracker's resilience was still low when the tracker encountered severe lighting changes and similar background colors; moreover, while tracking in the event of a serious occlusion, the tracking results were also prone to drift and did not re-detect targets in the case of tracking failure (Couple and Jogging-2). The MEEM tracker selected the best prediction of models collected from the past for tracking according to the minimum entropy criterion, but still did not perform well when in the presence of heavy occlusion (Walking2) or both significant scale and fast motion (Carscale). The Struck tracker did not perform well in background clutters (Shaking), fast motion (Deer), heavy occlusion, or out-of-view (Walking2, Tiger2, and Jogging-2). The KCF tracker is based on a correlation filter learned from HOG features, so drifted when the target objects underwent heavy occlusions (Lemming), and motion blur (Tiger2). In addition, the KCF tracker failed to handle background clutter (Shaking) since it is difficult to achieve robust tracking with a single feature classifier model in complex scenes. When tracking failed, the TLD tracker could re-detect the target object. However, the TLD approach did not take full advantage of the temporal movement clues and therefore did not follow targets undergoing significant deformation and fast motion (Tiger2 and Shaking) well. Moreover, the TLD method updates its detector frame-by-frame, leading to drifting. Overall, the proposed MMLT tracker performed well in estimating both the scales and positions of target objects on these challenging sequences, which can be attributed to three reasons. First, our tracker effectively combined three separate models, each dealing with different features, each complementing each other, and taking full advantage of the diversity of sample information. Therefore, in the target deformation or light, motion blur, etc., it can have a better tracking effect. Second, our confidence filter was updated only when the confidence level was high, so it could restrain the flow of the template in the detection module to a certain extent, and the object detection model incorporated into the tracker could reduce the problem of template flowing to a certain extent. Finally, we added a model detection algorithm based on the color model and object detection, which could quickly re-detect the target after it failed. filter learned from HOG features, so drifted when the target objects underwent heavy occlusions (Lemming), and motion blur (Tiger2). In addition, the KCF tracker failed to handle background clutter (Shaking) since it is difficult to achieve robust tracking with a single feature classifier model in complex scenes. When tracking failed, the TLD tracker could re-detect the target object. However, the TLD approach did not take full advantage of the temporal movement clues and therefore did not follow targets undergoing significant deformation and fast motion (Tiger2 and Shaking) well. Moreover, the TLD method updates its detector frame-by-frame, leading to drifting. Overall, the proposed MMLT tracker performed well in estimating both the scales and positions of target objects on these challenging sequences, which can be attributed to three reasons. First, our tracker effectively combined three separate models, each dealing with different features, each complementing each other, and taking full advantage of the diversity of sample information. Therefore, in the target deformation or light, motion blur, etc., it can have a better tracking effect. Second, our confidence filter was updated only when the confidence level was high, so it could restrain the flow of the template in the detection module to a certain extent, and the object detection model incorporated into the tracker could reduce the problem of template flowing to a certain extent. Finally, we added a model detection algorithm based on the color model and object detection, which could quickly redetect the target after it failed. Figure 10. Qualitative results of our MMLT algorithm, Staple [25], MEEM [21], LCT [30], TLD [26], Struck [38], and the KCF [5] methods on ten challenging sequences. The targets in these sequences underwent heavy occlusion, motion blur, illumination variation, scale variation, and background clutter, respectively.
In addition, the center location errors and the average overlap rate were used to evaluate the proposed tracker. The average center location error is the average value of all the center location errors in all the video sequences, where the center location error represents the distance center between the position of predicted bounding box t R , and the ground truth t G can be expressed by Figure 10. Qualitative results of our MMLT algorithm, Staple [25], MEEM [21], LCT [30], TLD [26], Struck [38], and the KCF [5] methods on ten challenging sequences. The targets in these sequences underwent heavy occlusion, motion blur, illumination variation, scale variation, and background clutter, respectively.
In addition, the center location errors and the average overlap rate were used to evaluate the proposed tracker. The average center location error is the average value of all the center location errors in all the video sequences, where the center location error represents the distance center between the position of predicted bounding box R t , and the ground truth G t can be expressed by the criterion CLE = L t − L g,t 2 . L t and L g,t represent the location of R t and G t , respectively. The average overlap rate represents the average overlap between bounding box R t and the ground truth G t in all the video sequences, and the overlap between bounding box B t and the ground truth G t can be represented by the criterion overlap = |B t ∩ G t |/|B t ∪ G t |, where ∩ represents the intersection and ∪ represents union. The average center location error and average overlap rate of the proposed algorithm on the ten sequences are shown in Tables 3 and 4, respectively, which show the good tracking performance of our proposed tracker against the other trackers. In addition, the best result and the next best result are marked with red and blue in Tables 3 and 4, respectively. Moreover, we report the central-pixel errors frame-by-frame on the ten sequences in Figure 11, which shows that our tracking algorithm performed well against the existing trackers. Table 3. Average center location errors of the proposed method compared to other trackers (pixels).

Ours
Staple MEEM LCT TLD Struck KCF  Figure 11. Fame-by-frame comparison of the center location errors (in pixels) on ten challenging sequences in Figure 11. Based on the experimental results, our algorithm was able to track targets accurately and stably.

Analysis of Merge Parameter
As shown in Figure 12, we randomly selected 25 sequences with different attributes from the OTB-13 benchmark datasets [33] and tested the effect of different merge parameters c τ and o τ on the TRE success rate by the cross-intersection method. Figure 11. Fame-by-frame comparison of the center location errors (in pixels) on ten challenging sequences in Figure 11. Based on the experimental results, our algorithm was able to track targets accurately and stably.

Analysis of Merge Parameter
As shown in Figure 12, we randomly selected 25 sequences with different attributes from the OTB-13 benchmark datasets [33] and tested the effect of different merge parameters τ c and τ o on the TRE success rate by the cross-intersection method. We can see in Figure 12, with the color regression model, filter regression model, and object row response model, that the tracking performance was better than the tracking performance with only the filter model and the color model, or the filter model. The performance of the three models weighting coefficients reaching the proper ratio tracker was the best. At the same time, it can also be seen in Figure 12 that the tracking success rate of the tracker was relatively high in a certain area of the optimal parameters, that is, small changes, the performance of the tracker had strong stability. Therefore, we can see that the fusion strategy based on multiple models is a good way to improve the performance of trackers. Clearly, the best performance was achieved at 0.

Conclusions
In this paper, by incorporating the object model based on contour information into the translational tracking of Staple (which combines a correlation filter and color model), tracking robustness is significantly improved. Each model is responsible for tracking specific features, and then the three complementary models are combined to form a more robust tracking model. At the same time, we design a target detection method based on an object detection model based on contour features and a color model based on histogram features, which has good performance in detection efficiency and detection accuracy when compared with traditional classifier-based detection methods. In addition, we also optimize the traditional scale calculation method. The experimental results show that the proposed algorithm offers favorable improvements in performance with regard to efficiency, accuracy, and robustness.

Conclusions
In this paper, by incorporating the object model based on contour information into the translational tracking of Staple (which combines a correlation filter and color model), tracking robustness is significantly improved. Each model is responsible for tracking specific features, and then the three complementary models are combined to form a more robust tracking model. At the same time, we design a target detection method based on an object detection model based on contour features and a color model based on histogram features, which has good performance in detection efficiency and detection accuracy when compared with traditional classifier-based detection methods. In addition, we also optimize the traditional scale calculation method. The experimental results show that the proposed algorithm offers favorable improvements in performance with regard to efficiency, accuracy, and robustness.