Alleviate Similar Object in Visual Tracking via Online Learning Interference-Target Spatial Structure

Correlation Filter (CF) based trackers have demonstrated superior performance to many complex scenes in smart and autonomous systems, but similar object interference is still a challenge. When the target is occluded by a similar object, they not only have similar appearance feature but also are in same surrounding context. Existing CF tracking models only consider the target’s appearance information and its surrounding context, and have insufficient discrimination to address the problem. We propose an approach that integrates interference-target spatial structure (ITSS) constraints into existing CF model to alleviate similar object interference. Our approach manages a dynamic graph of ITSS online, and jointly learns the target appearance model, similar object appearance model and the spatial structure between them to improve the discrimination between the target and a similar object. Experimental results on large benchmark datasets OTB-2013 and OTB-2015 show that the proposed approach achieves state-of-the-art performance.


Introduction
Research interest in visual object tracking comes from the fact that it is widely used in smart and autonomous systems, e.g., anomaly detection, smart video compression, and driver intelligent assistance systems. The main challenges of visual tracking are how the tracker can online adapt to the large appearance variations including occlusion, abrupt motion, deformation, illumination variations, in plane rotation, out of plane rotation, similar object interference, etc.
To get a more general adaptive tracker, researchers have proposed many tracking approaches using various visual representations. These approaches, which are mainly focused on the research of the object appearance model, can be divided into two categories: generative models and discriminative models. Generative models [1][2][3][4][5] search the closest description in model space as the target observation to estimate target state. These models adopt an appearance model to describe the target appearance state without considering the background information of the target effectively. Therefore, they have low discrimination when scene is complex. Compared with generative models, discriminative models [6][7][8] have better discrimination and generalization ability. These models formulate object tracking as a binary classification problem that does not estimate target specific location directly, and their accuracy is limited by the number of candidate tests.
Recently, Correlation Filter (CF) based trackers have gained much attention in visual tracking, and exhibited outstanding performance in both speed and accuracy. These trackers can be performed effectively by employing all circular shifts of the positive sample in the Fourier domain, and overcome the shortcomings of traditional discriminative model using densely-sampled samples. Bolme et al. [9] (1) We propose to add ITSS constraint into existing CF tracking model for alleviating similar object interference. The proposed approach jointly learns the target appearance model, similar object appearance model and the spatial structure between the target and similar objects. (2) We propose to introduce interference degree weight into the model, which makes our approach switch between the baseline model and the constraint model (integrating the baseline model and ITSS constraint) dynamically. (3) During the model update, instead of only updating the target appearance model, we collaboratively update the target model and the interference object model. Moreover, we propose combining the target model and the interference model for re-detecting the lost target when the target is almost totally occluded by a similar object.
The rest of this paper is organized as follows. Section 2 presents the proposed method. Simulation results are presented in Section 3, and conclusions are drawn in Section 4.

Proposed Approach
We found that the spatial structure constructed by similar object and target is very useful information that can help to distinguish target from similar objects effectively. When the target and a similar object overlap, we can effectively identify the occlusion relationship between the target and a similar object by their spatial structure. Once the spatial structure is obtained, the model update can be controlled reasonably to alleviate occlusion of similar target. Thus, we propose utilizing online learning interference-target spatial structure constraint for alleviating similar object occlusion. Figure 1 shows the flowchart of the proposed approach. We divide the approach into three stages: tracking based on ITSS model, update appearance model and re-detection the target. The similar object detected will be regarded as the potential interference object which is used to construct interference-target spatial structure graph. The edges in the graph reflect the spatial structure between the target and each similar object, and can be used to determine whether the target overlaps with a similar object. The graph's nodes describe the target's and similar object's appearance models. If there exists overlapping between the target and a similar object, we will use our occlusion analysis strategy to determine whether the target is occluded by the similar object. Then, we accord the result of the occlusion analysis to update models including the target model and similar object model. If the target is severely or totally occluded, the flow will be into re-detection stage to detect the lost target. If all the similar objects are outside the warning area in a frame, the process will switch to the baseline tracker. similar object overlap, we can effectively identify the occlusion relationship between the target and a similar object by their spatial structure. Once the spatial structure is obtained, the model update can be controlled reasonably to alleviate occlusion of similar target. Thus, we propose utilizing online learning interference-target spatial structure constraint for alleviating similar object occlusion. Figure 1 shows the flowchart of the proposed approach. We divide the approach into three stages: tracking based on ITSS model, update appearance model and re-detection the target. The similar object detected will be regarded as the potential interference object which is used to construct interference-target spatial structure graph. The edges in the graph reflect the spatial structure between the target and each similar object, and can be used to determine whether the target overlaps with a similar object. The graph's nodes describe the target's and similar object's appearance models. If there exists overlapping between the target and a similar object, we will use our occlusion analysis strategy to determine whether the target is occluded by the similar object. Then, we accord the result of the occlusion analysis to update models including the target model and similar object model. If the target is severely or totally occluded, the flow will be into re-detection stage to detect the lost target. If all the similar objects are outside the warning area in a frame, the process will switch to the baseline tracker.

Detecting Potential Interference
The interference from a similar object is gradually formed when similar objects get closer to the target. Therefore, we assume that there exists a warning area where similar object is considered as the potential interference. The warning area is divided into eight overlapping search regions. Each search region has a partial overlap with the target, but the size of the overlapping area is controlled less than half the size of the target. Similar targets are detected simultaneously by parallel computing in the eight search regions. The spatial layout has several advantages. First, the maximum response of the detector will be not on the target when similar object has attended in the search area. Second, each search area is formulated conveniently by the location and size of the target, and the eight search areas can be proportionally synchronized when the target size changes.
Usually, the similar object and target will deform during tracking, and the degree of the deformation is random. To overcome the problem, we consider two types of features, one is sensitive to deformation and the other is insensitive. Inspired by Staple tracker [16], we integrate a global color histogram and HOG features to train detector for detecting similar object. Our method is different from Staple tracker as follows. First, the detector model only models the target appearance

Detecting Potential Interference
The interference from a similar object is gradually formed when similar objects get closer to the target. Therefore, we assume that there exists a warning area where similar object is considered as the potential interference. The warning area is divided into eight overlapping search regions. Each search region has a partial overlap with the target, but the size of the overlapping area is controlled less than half the size of the target. Similar targets are detected simultaneously by parallel computing in the eight search regions. The spatial layout has several advantages. First, the maximum response of the detector will be not on the target when similar object has attended in the search area. Second, each search area is formulated conveniently by the location and size of the target, and the eight search areas can be proportionally synchronized when the target size changes.
Usually, the similar object and target will deform during tracking, and the degree of the deformation is random. To overcome the problem, we consider two types of features, one is sensitive to deformation and the other is insensitive. Inspired by Staple tracker [16], we integrate a global color histogram and HOG features to train detector for detecting similar object. Our method is different from Staple tracker as follows. First, the detector model only models the target appearance using correlation filters without including any surrounding context. Second, our method is not a simple combination of two types of features by a fixed weight but dividing the detection process into coarse detection and fine detection. This strategy is beneficial since similar object detection is different from target tracking. Target tracking assumes that the target has existed in the search region. The assumption is invalid in object detection, because if the object is not in the search region, the location of maximal correlation response is not the location of the object.
Specifically, during coarse detection, we take the location of maximal correlation response as the center of a similar object. The correlation response is obtained by the correlation between each search region and the detector trained by HOG features extracted from the estimated target. The confidence map constructed from the correlation response reflects the structural similarity between the estimated target and the testing sample from the search area. The second stage is called fine detection that gains reliable similar objects and eliminates unreliable ones from potential candidate similar objects. In this stage, the color statistical feature is extracted by a color statistical model trained by current estimated target and surrounding context. For more details of the color statistical model training, we refer to [17]. The difference is that we only extract the color feature in each potential target region which does not include any surrounding context. Then, we estimate the similarity between the target and each potential similar object by comparing the similarity of their color histogram.

Interference-Target Spatial Structure Constraint
Zhang [26] proposed tracking multi-target by preserving structural model. Motivated by this approach, we dynamically maintain the spatial structure between the target and a similar object for online supervising them. The purpose is to distinguish the similar object and the target when they occlude each other. Different from [26], we estimate the score of each object appearance (including similar objects and the target) by CF model which takes advantage of surrounding context to improve discrimination efficiently. Moreover, we introduce the weight constraint into the model to reflect the differences of the interference degree from each similar object.

The Score Function of Spatial Structure Model
Interference-target Spatial Structure (ITSS) can be viewed as a topological graph figuratively. We define a graph G = (V, E). Node set V is constructed by the target appearance and all similar object appearance. The edges set E include all connection between the target and each similar object. The edges of E can be viewed as springs that constraint between the target and each similar object. The length of the edge indicates the relative distance between the target and each similar object. Meanwhile, the thickness of the edge means the strength of a similar object interferes with the target. Now, our aim is to define a score function for describing the structure of graph G and design an optimization method to obtain the structural parameters when the score function reaches a maximum value.
The score function of the configuration C t = {B t 0 , B t 1 , . . . , B t |V−1| ,} is defined as The variable c j and h j represent row and column circular shifts from the bounding box of j-th similar object in frame j , H t j } denotes the bounding box of corresponding circular shifts from j-th similar object in frame I t . W t j , H t j and vector P t c j, h j denote the width, the high and the center pixel coordinate of bounding box B t c j, h j , respectively. The vector Φ 0 and Φ s denote the feature of the target and similar object. Specifically, when there are no similar objects around the target, Φ 0 is corresponding to the feature used in base tracker. Contrarily, if there are similar objects around the target, Φ 0 and Φ s are both HOG feature. Moreover, the vector Ψ j (p t j , p t 0 ) = p t j − p t 0 , which is a 2D vector including two attributes of size and direction, denotes the relative location between the j-th similar object and the target. The parameter Θ is denoted as Θ = {α 0 , α 1 , . . . , α |V|−1 , β 1 , . . . , α |V|−1 ,}, and |V| is node count in G. We use the distance between each similar object and the target to represent the degree of interference. The closer the distance is, the severer the interference is. This relationship is represented as T, which is a constant determined by the size of the searching region and the position relative to a target, is the max-distance between the similar object and the target.
These weight parameters introduced into Equation (1) have two important functions. (1) They control the importance of the corresponding similar object and reflect the interference degree of the similar object to the target. The greater the weight is, the more serious the disturbance is. (2) These weights also maintain a fixed function structure of Equation (1). We do not need to change the form of Equation (1) when the number of similar targets in the warning area changes.

Online Learning for Structured Prediction
We train the parameter Θ by minimizing the mixture loss function L(Θ; C). The score of configuration Equation (1) is rewritten into two components including appearance and deformation cost. We define L(Θ; C) as Here, a is the cost of appearance component and a consists of two parts, the target appearance loss A 0 and the loss A j of similar object appearance The structure labels y c j ,h j and y c 0 ,h 0 are computed by two-dimensional Gaussian function. The function b is the deformation cost constraint which captures the special structure between the similar object and the target. Mathematically, this amount is described as Here, S(C; I, β) is deformation score function described as Herein, the task-loss ∆(C,Ĉ) is defined based on the amount of overlap between the correct configuration C and the incorrect configurationĈ.
After the transformation above, the problem is how to optimize Equation (3) for learning parameter Θ. Two components α and β from Θ are divided into the first term and the second term of Equation (3). Therefore, we can optimize α and β alternatively by Equations (4) and Equation (7). First, as far as α is concerned, Equation (4) is a linear combination of Equations (5) and (6) which are both conventional correlation filtering operation. We can optimize them by kernel trick and circulant matrix. Conversely, when α is fixed, we can get the solution of β by optimizing Equation (7). Thus, calculating parameter β is transformed to a structured SVM problem. Equation (7) is the maximum of a set of affine functions and does not contain quadratic terms. Thus, it is a convex function. We use a similar method with Pegasos-based algorithm [27] to optimize the problem.

Target Detection with Spatial Constraints
The detection task is to find a configuration with the best score over all possible configurations which maximizes Equation (1) when the model parameter Θ is given in new frame. We first construct a minimum spanning tree model to build the graph G which determines the edge joint between the target and each similar object. The root of the tree is set to the estimated target. Then, we search for a set of connections to construct the tree. Once the structure of the graph is obtained, we find the best score by dynamic programming Equation (10).
All parameters in Equation (10) have the same physical meaning as in Equation (1). The score of the best configuration from this recursive form corresponds to the max score of Equation (1), e.g., T(Ĉ) = (Ĉ; 0). The algorithm automatically finds a best configuration for the target and every similar object.

Update Model
Our model update method is controlled by the occlusion estimation, and the model is updated when there is no occlusion. In our approach, the occlusion estimation is divided into two steps: occlusion detection and occlusion identification.

Occlusion Detection
We dynamically maintain a topological graph, which consists of similar objects and the target, by the method proposed in Section 2.2. The relative positive vector between the target and each similar object is equivalent to provide us with an identifier that identifies the target and the similar object. We estimate the occlusion between B i and B 0 by the overlap between them when w t i = max{ w t j j = 1, 2, . . . , |V| − 1 }. The overlap is estimated by Equation (11).
By the Equation (11), we can compute the positional relation of abscissa and ordinate between B i and B 0 , respectively. We use f (x) to represent the positional relation of abscissa between the two image areas. In this case, P t i (x) − P t 0 (x) 2 2 represents the distance of abscissa between center point P t i and P t 0 . B t i (x) + B t 0 (x) represents the sum of the width of B i and B 0 . Likewise, we utilize f (y) to represent the positional relation of ordinate between B i and B 0 .

Occlusion Identification
If f (x) < 0 and f (y) < 0, there exists overlap between B i and B 0 . We need to figure out whether B 0 is occluded by B i or not. The occluded object has more significant changes in appearance feature than the not occluded object. Hence, corresponding confidence response map will show more dramatic changes. To reduce the interference information from surrounding context, we only estimate the confidence response map corresponding to B i and B 0 , in which background information is not included. We use the combination [19] of the average peak-to-correlation energy E i and E 0 and maximum response score of the response map R i,max and R 0,max to estimate the change of both the confidence maps.
We compare R i , R 0 , E i and E 0 in the current frame with their respective historical average values R ha i , R ha 0 , E ha i and E ha 0 . These four parameters are estimated before the overlap between B i and B 0 , which are considered as the feedback from non-polluting tracking results.
Otherwise it is contrary. Next, we can update the appearance model of the target and the similar object according to the result of the occlusion detection. We optimize these models by kernel trick and circulant matrix, so our model update formula is defined as Equation (13).
Here, represents the Fourier Transform. The superscript d corresponds to a dimensional component of HOG feature. The subscript i is the appearance model identifier, e.g., i = 0 represents object appearance model. η is a learning rate fixed to 0.075, and t denotes the t-th frame. The bar-represents complex conjugation.

Target Re-Detection
When the target is almost completely occluded by a similar object, the distance between the target and the occlusion object Ψ( i , 0 ) 2 2 will become very small. In this case, ITSS constraint will lose its effectiveness. To avoid the target being erroneously positioned on the occlusion object, we collaboratively employ the target appearance model and similar object appearance model to re-detect the lost target. Specifically, the re-detection strategy is divided into three parts: detection model, startup condition and termination condition.

Detection Model
When f (x) = 0 and f (y) = 0, B i and B 0 are both in the edge position of their overlapping and non-overlapping area. Currently, B i and B 0 are believed to be reliable and free from contamination from each other. We save the appearance model D 0 trained by B 0 . D i and D 0 will be used to detect the B 0 , when B 0 is lost for the occlusion of B i . Specifically, our approach tracks B i and updates its model online, and detects B 0 in surrounding context of B i .

Startup Condition
We activate the re-detection mechanism when the target is lost. In our tracking framework, the spatial structure constraint can alleviate the part occlusion. However, when B 0 is almost completely occluded by B i , the max values of the two response maps from tracking B i and B 0 will point to nearly the same location in B i , and the spatial structure constraint becomes disabled. We define this criterion according to Equation (14).
When Equation (14) is established, we need to activate re-detection strategy to detect the lost target.

Termination Condition
If R 0 > k R R history 0 and E 0 > k E E history 0 , we consider the detection result reliable and then terminate detection and switch task to baseline tracker. Here, k R and k E are predefined ratios. In this paper, the two ratios are 0.65 and 0.42, respectively.

Experiments
Our experimental observations are reported from two aspects: special performance evaluation and comprehensive performance evaluation. In special performance evaluation, we evaluate the proposed tracking algorithm and other six state-of-the-art tracking algorithms on nine challenging sequences including occlusion from a similar object to demonstrate the effectiveness of the proposed approach for alleviating similar object interference. In comprehensive performance evaluation, the goal is to demonstrate the comprehensive performance of the proposed algorithms in various challenges including serious occlusion, noise disturbance, non-rigid shape deformation, out-of-plane object rotation, pose variation, similar object interference, etc.
To achieve a latest comparison, we report the results of several recent trackers by adding these trackers to the testing framework of visual tracking benchmark. In experiments, each tracker uses the source code provided by the author. The parameter settings from authors are kept for all the test sequences. All testing sequences are from the OTB-2015.

Baseline Tracker
The Staple tracker combines template and histogram scores to alleviate deformation, and significantly outperforms many state-of-the-art trackers in comprehensive performance. However, color histogram weakens structural information while alleviates deformation, which leads to the poor performance of Staple tracker when there exists similar object in the scenes. Thus, we choose the Staple tracking algorithms as the baseline tracker. The new tracker integrates the basic tracker and the proposed ITSS.

Specific Performance Evaluation
To fully reflect the merits of our approach, we evaluate the specific performance of our approach from two aspects: qualitative evaluation and quantitative evaluation. Figure 2 reports the qualitative evaluation results divided into four columns from left to right in each row. The first column is the initial frame where the target is marked by red solid rectangle. The second column is a frame before occlusion occurs. The third column is the frame where there is occlusion between the target and similar objects. The fourth column indicates that the occlusion is removed. The qualitative evaluation is carried out by comparing these four columns in each sequence. We compare the proposed approach with six state-of-the-art trackers: CSK [10], DAT [17], KCF [11], Staple [16], MOSSE [9] and DSST [12]. To demonstrate the effectiveness of the proposed algorithm, we select nine typical sequences. These sequences contain some objects that are not only similar in color but also similar in structure to the target, and there is severe occlusion between the target and the similar object.

Qualitative Evaluation
(1) Basketball, Girl and Walking2: in the Basketball, Girl and Walking2 sequences, target and occlusion object have similar appearance feature including structure and color. Staple uses color statistical feature to alleviate deformation, so the tracking result is easy to drift to occlusion object when the occlusion object has similar color feature with the target. Although the DAT algorithm also uses color statistical feature to resist deformation, but it does not consider the structural information of the target. It shows a poor performance on this video. Our tracker integrates ITSS to the Staple, which alleviates similar object interference, can resist deformation. In contrast, the Staple tracker does not obtain the spatial structure information of between the similar object and the target, and do not distinguish accurately the target and similar targets. Rest trackers rely too much on structural information and lead to drift when the target undergoes severe deformation. (2) Football and Girl: in the Football sequence and the Girl sequence, the target is almost fully occluded by nearly the similar object. Thus, these trackers relying on structural feature mistakenly estimate the target location to the interference. In this case, our ITSS model is powerless because the structural features that the model relies on disappear. However, our tracker introduces a re-detection strategy which can effectively distinguish the target and nearly similar objects when the target appears again, then reconstruct the interference-target spatial structure. The experiment demonstrates that it is necessary to introduce the re-detection into our tracking framework to alleviate almost full occlusion. (3) Shaking1: the target and interference have similar structure and color in the Shaking1 sequence.
Compared to our algorithm, KCF and DSST exhibit varying degrees of drift because they have no ability to resist deformation. The Staple tracker the model, which can resist deformation by color statistical features, have no enough prior information for distinguishing target from similar objects. Our tracker combines effectively advantages of ITSS and Staple. When similarity interference occurs, using ITSS distinguishes the target and the interference. When the interference is removed, our algorithm switches to Staple. Therefore, our tracker can resist similarity interference as well as deformation. (4) Coupon: the Coupon sequence is different from other sequences, where the target occludes the similar object. In addition, the target and the interference are obviously different in structural feature, but similar in color statistical feature in the sequence. Thus, except DAT, all other algorithms track the target robustly. Our tracking model not only gains the color feature of target appearance but also obtains its structure feature; therefore, our tracker gains satisfactory performance on this type of sequence. (5) Liquor: there exist multiple structures similar targets in the Liquor sequence. In addition, the target is occluded repeatedly by different similar objects. Only our approach and Staple keep correct tracking throughout the tracking process. Staple utilizes the color difference between the target and other objects to distinguish them, while our approach uses ITSS constraint to discriminate the target from similar objects. (6) Bolt2 and Deer: These two sequences are different from other sequences. The target is not occluded by a similar object, but the similar object is occluded by it. When the similar object gradually approaches the target, some tracker track the similar target incorrectly, e.g., DSST, while our tracker can use the learned ITSS constraint to correctly differentiate target and interference.
Overall, our approach provides promising results compared to the six other trackers on these sequences including similar object interference.
Sensors 2017, 17, 2382 9 of 16 similar object and the target, and do not distinguish accurately the target and similar targets. Rest trackers rely too much on structural information and lead to drift when the target undergoes severe deformation. (2) Football and Girl: in the Football sequence and the Girl sequence, the target is almost fully occluded by nearly the similar object. Thus, these trackers relying on structural feature mistakenly estimate the target location to the interference. In this case, our ITSS model is powerless because the structural features that the model relies on disappear. However, our tracker introduces a re-detection strategy which can effectively distinguish the target and nearly similar objects when the target appears again, then reconstruct the interference-target spatial structure. The experiment demonstrates that it is necessary to introduce the re-detection into our tracking framework to alleviate almost full occlusion. (3) Shaking1: the target and interference have similar structure and color in the Shaking1 sequence.
Compared to our algorithm, KCF and DSST exhibit varying degrees of drift because they have no ability to resist deformation. The Staple tracker the model, which can resist deformation by color statistical features, have no enough prior information for distinguishing target from similar objects. Our tracker combines effectively advantages of ITSS and Staple. When similarity interference occurs, using ITSS distinguishes the target and the interference. When the interference is removed, our algorithm switches to Staple. Therefore, our tracker can resist similarity interference as well as deformation. (4) Coupon: the Coupon sequence is different from other sequences, where the target occludes the similar object. In addition, the target and the interference are obviously different in structural feature, but similar in color statistical feature in the sequence. Thus, except DAT, all other algorithms track the target robustly. Our tracking model not only gains the color feature of target appearance but also obtains its structure feature; therefore, our tracker gains satisfactory performance on this type of sequence. (5) Liquor: there exist multiple structures similar targets in the Liquor sequence. In addition, the target is occluded repeatedly by different similar objects. Only our approach and Staple keep correct tracking throughout the tracking process. Staple utilizes the color difference between the target and other objects to distinguish them, while our approach uses ITSS constraint to discriminate the target from similar objects. (6) Bolt2 and Deer: These two sequences are different from other sequences. The target is not occluded by a similar object, but the similar object is occluded by it. When the similar object gradually approaches the target, some tracker track the similar target incorrectly, e.g., DSST, while our tracker can use the learned ITSS constraint to correctly differentiate target and interference.
Overall, our approach provides promising results compared to the six other trackers on these sequences including similar object interference.

Quantitative Evaluation
In addition, we provide a quantitative comparison of our tracker and the six trackers on the nine challenging sequences mentioned above. Our quantitative evaluations with fame-by-frame

Quantitative Evaluation
In addition, we provide a quantitative comparison of our tracker and the six trackers on the nine challenging sequences mentioned above. Our quantitative evaluations with fame-by-frame comparison are two metrics: center location errors (CLE) and overlap ratio (OR). CLE is measured by the Euclidean distance between the ground-truth center location and the estimated target center location. OR is described as OR = S(B T ∩ B GT )/S(B T ∪ B GT ), where B GT and B T are the ground truth bounding box and the tracking bounding box, respectively, and S(B) is a function used to calculate the area of the bounding box B. We compare the CLE and VOR frame-by-frame on the nine sequences in Figures 3 and 4, respectively. Generally, our method is comparable to the best performer on the sequence Basketball, Skating1, Football, Coupon, Girl, Liquor, Bolt2, Deer and Walking2. In particular, on the sequence Basketball and Skating1, our tracker slightly drifts in the 200th and 230th frames, respectively, due to in plane rotation of the target and deformation of the nearest similar object appear simultaneously. comparison are two metrics: center location errors (CLE) and overlap ratio (OR). CLE is measured by the Euclidean distance between the ground-truth center location and the estimated target center location. OR is described as OR = S( ∩ )/S( ∪ ), where and are the ground truth bounding box and the tracking bounding box, respectively, and S(B) is a function used to calculate the area of the bounding box B. We compare the CLE and VOR frame-by-frame on the nine sequences in Figures 3 and 4, respectively. Generally, our method is comparable to the best performer on the sequence Basketball, Skating1, Football, Coupon, Girl, Liquor, Bolt2, Deer and Walking2. In particular, on the sequence Basketball and Skating1, our tracker slightly drifts in the 200th and 230th frames, respectively, due to in plane rotation of the target and deformation of the nearest similar object appear simultaneously.

Ours
KCF Staple CSK DAT DSST MOSSE Figure 3. Frame-by-frame comparison of 7 trackers on 9 video sequences in terms of average center location errors (in pixels) in descending order. The horizontal axis represents the frame number, and the vertical axis represents the center location error. The smaller is the center location error, the better is the tracking accuracy.
Commented replaced with  Tables 1 and 2 report the comparison of the average CLE and the average OR of seven trackers in the nine sequences mentioned above. Our approach achieves the best results in six of the nine sequences in average CLE, and achieves the best performance in seven of the nine sequences in the average OR. These results demonstrate that our approach obtains superior performance than the six other trackers in alleviating similar object interference. In Coupon and Basketball sequences, our tracker gains the second and the third biggest, respectively, which is very close to the best ones. The overall results present that the proposed approach significantly outperforms the others in average CLE and average OR.  Tables 1 and 2 report the comparison of the average CLE and the average OR of seven trackers in the nine sequences mentioned above. Our approach achieves the best results in six of the nine sequences in average CLE, and achieves the best performance in seven of the nine sequences in the average OR. These results demonstrate that our approach obtains superior performance than the six other trackers in alleviating similar object interference. In Coupon and Basketball sequences, our tracker gains the second and the third biggest, respectively, which is very close to the best ones. The overall results present that the proposed approach significantly outperforms the others in average CLE and average OR.  Table 3 shows the time consumption. In the experiment, the main time consumption of our approach is embodied in three aspects: the Staple tracking model (basic tracker), the similarity target detection and the ITSS model. In the Staple tracking model, the main factor affecting the computation is the parameter of the HOG feature and the color histogram feature. In practice, the cell size of HOG features is set into 4 × 4 pixels. We set the area of the samples to 4 2 times of the target area. Samples are multiplied by a Hann window. Then, samples are normalized to a fixed size by a 50 × 50 square, so that the fps is unaffected by the height and width of video sequence. Moreover, bin color histogram is set into 32 × 32 × 32. In the similarity target detection, time consumption of this part is similar with Staple model. In this part, the parameters of HOG feature and color histogram feature are similar to the parameters in Staple model. In the ITSS model, the time consumption is related to the number of nodes in the graph. We set that the max number of nodes in the graph to 8 and the fps to 24 when only running ITSS model. In practice, the number of nodes is controlled by w in Equation (1), and is usually less than 8. Moreover, our approach is switched dynamically between the baseline model and the constraint model by weight w in Equation (1). Thus, the time consumption of the approach should be between the Staple and the ITSS. We run our approach approximately for 35 frames per second on a computer with an Intel I7-4790 CPU (3.6 GHz) and 4 GB RAM. Therefore, our algorithm satisfies the real-time applications.
We present the results of OTB-2013 and OTB-2015 in Figures 5 and 6, respectively. Among these methods, our approach performs well with overall success (64.5% on OTB-2013 and 59.1% on OTB-2015) and precision plots (85.7% on OTB-2013 and 80.4% on OTB-2015). Moreover, our approach achieves higher performance than Staple tracker on these two datasets. Specifically, the OSR of the proposed approach is 1.5% and 0.8% higher than Staple tracker on these two datasets, respectively. The DP of our approach is 2.4% and 1.3% higher than Staple tracker on these two datasets, respectively. Because our approach integrates the advantages of Staple tracker and ITSS constraint, it would implicitly mitigate the weaknesses of HOG and color features, and resist similar object interference at the same time. In other words, the overall evaluation proves that our approach improves the performance (alleviating interference from similar object) of Staple algorithm and does not weaken its existing advantages. approach achieves higher performance than Staple tracker on these two datasets. Specifically, the OSR of the proposed approach is 1.5% and 0.8% higher than Staple tracker on these two datasets, respectively. The DP of our approach is 2.4% and 1.3% higher than Staple tracker on these two datasets, respectively. Because our approach integrates the advantages of Staple tracker and ITSS constraint, it would implicitly mitigate the weaknesses of HOG and color features, and resist similar object interference at the same time. In other words, the overall evaluation proves that our approach improves the performance (alleviating interference from similar object) of Staple algorithm and does not weaken its existing advantages.  Comparing Figure 5 with Figure 6, the performances of all trackers in Figure 5, including success plots of OPE and precision plots of OPE, are better than those in Figure 6, because those sequences reflecting performance of every tracker makes up a smaller proportion in OTB-2015 than in OTB-2013. Similarly, our algorithm also performs better in dataset OTB-2013 than in dataset OTB-2015.

Conclusions
In this paper, we propose a generic approach for alleviating similar object interference. The approach integrates the interference-target spatial structure (ITSS) constraint to existing CF tracking algorithm for improving the robustness of the algorithm when the target is severely occluded by a similar object. When a similar object exists around the target, the proposed ITSS constraint is online learning by using a minimum spanning tree model to optimize the structured SVM, which provides an effective strategy to identify the target and similar object when there is occlusion between them. Moreover, when the target is almost completely occluded by a similar object, we combine the target model and the interference model for re-detecting the missing target. The experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.  approach achieves higher performance than Staple tracker on these two datasets. Specifically, the OSR of the proposed approach is 1.5% and 0.8% higher than Staple tracker on these two datasets, respectively. The DP of our approach is 2.4% and 1.3% higher than Staple tracker on these two datasets, respectively. Because our approach integrates the advantages of Staple tracker and ITSS constraint, it would implicitly mitigate the weaknesses of HOG and color features, and resist similar object interference at the same time. In other words, the overall evaluation proves that our approach improves the performance (alleviating interference from similar object) of Staple algorithm and does not weaken its existing advantages.  Comparing Figure 5 with Figure 6, the performances of all trackers in Figure 5, including success plots of OPE and precision plots of OPE, are better than those in Figure 6, because those sequences reflecting performance of every tracker makes up a smaller proportion in OTB-2015 than in OTB-2013. Similarly, our algorithm also performs better in dataset OTB-2013 than in dataset OTB-2015.

Conclusions
In this paper, we propose a generic approach for alleviating similar object interference. The approach integrates the interference-target spatial structure (ITSS) constraint to existing CF tracking algorithm for improving the robustness of the algorithm when the target is severely occluded by a similar object. When a similar object exists around the target, the proposed ITSS constraint is online learning by using a minimum spanning tree model to optimize the structured SVM, which provides an effective strategy to identify the target and similar object when there is occlusion between them. Moreover, when the target is almost completely occluded by a similar object, we combine the target model and the interference model for re-detecting the missing target. The experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.  Comparing Figure 5 with Figure 6, the performances of all trackers in Figure 5, including success plots of OPE and precision plots of OPE, are better than those in Figure 6, because those sequences reflecting performance of every tracker makes up a smaller proportion in OTB-2015 than in OTB-2013. Similarly, our algorithm also performs better in dataset OTB-2013 than in dataset OTB-2015.

Conclusions
In this paper, we propose a generic approach for alleviating similar object interference. The approach integrates the interference-target spatial structure (ITSS) constraint to existing CF tracking algorithm for improving the robustness of the algorithm when the target is severely occluded by a similar object. When a similar object exists around the target, the proposed ITSS constraint is online learning by using a minimum spanning tree model to optimize the structured SVM, which provides an effective strategy to identify the target and similar object when there is occlusion between them.
Moreover, when the target is almost completely occluded by a similar object, we combine the target model and the interference model for re-detecting the missing target. The experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.