The Kernel Based Multiple Instances Learning Algorithm for Object Tracking

: To realize real time object tracking in complex environments, a kernel based MIL (KMIL) algorithm is proposed. The KMIL employs the Gaussian kernel function to deal with the inner product used in the weighted MIL (WMIL) algorithm. The method avoids computing the pos-likely-hood and neg-likely-hood many times, which results in a much faster tracker. To track an object with different motion, the searching areas for cropping the instances are varied according to the object’s size. Furthermore, an adaptive classiﬁer updating strategy is presented to handle with the occlusion, pose variations and illumination changes. A similar score range is deﬁned with respect to two given thresholds and a similar score from the second frame. Then, the learning rate will be set to be a small value when a similar score is out of the range. In contrast, a big learning rate is used. Finally, we compare its performance with that of the state-of-art algorithms on several classical videos. The experimental results show that the presented KMIL algorithm is faster and robust to the partial occlusion, pose variations and illumination changes.


Introduction
Object tracking is a fundamental task in the fields of surveillance, robotics, human computer interaction, and so on. Recently, researchers have proposed many successful algorithms. However, the problem is still challenging due to factors such as occlusions, appearance variations, abrupt motion, and illumination changes [1].
The existing object tracking algorithms can be mainly classified into two groups: the generative method and the discriminative method [2]. The generative object tracking is one of the important problems, which learns an object model in the first frame and detects the area with the most similar appearance in the successive frames [3]. The MS tracker [4], IVT tracker [5], and VTD tracker [6] are the famous generative object tracking algorithms.
The discriminative tracking method learns and updates a binary classifier by using online training. The tracking-by-detection algorithm stems directly from the discriminative methods, which trains a classifier online and finds the most likely location from many candidate image patches as tracking evolves [7][8][9][10][11]. The Online-Boosting tracker updates a classifier by considering the tracked result as the positive sample [10]. However, it often fails when the tracked results drift from the real object location. Then, an improved algorithm (semi-supervised tracker) is proposed by Gabor [11]. The algorithm labels the positive samples in the first frame. However, the ambiguity problem exists in the tracking process. Zhang [3] describes a real-time CT tracker by utilizing the compressed features for training a Bayes classifier. In recent years, the researchers focus mainly on studying the real-time object tracking algorithms. The correlation filter-based tracking algorithms are proposed and have provided excellent performance. The algorithms include MOSSE [12], CSK [13], KCF [14], DCF [15], CN [16], DAT [17], Staple [18], and DSST [19]. These trackers have shown the advantage of being computationally efficient, which is especial useful for real-time applications. The KCF [14] tracker generates samples by applying a circulate matrix, which speeds up the computing of matrixes. As a result, the algorithm runs at hundreds of FPS [14]. However, it often suffers from drift problem due to the occlusion, illumination changes, and pose variations. The deep learning algorithm has been widely studied in the fields of computer vision including object tracking. The DeepSRDCF [20], GOTURN [21], C-COT [22], and ECO [23] benefiting from big data for learning a net model are proposed for object tracking. The experimental results have addressed that the obtained convolution features have a powerful ability of representation. However, the processing of training networks is time consuming due to the complexity network. Normally, to realize real-time object tracking, the deep learning algorithm runs on GPU and can achieve about 100 FPS (e.g., the GOTURN tracker). However, the GOTURN operates at 2.7 FPS for the only CPU [21]. To overcome the ambiguity problem in the online-boosting algorithm, the MIL tracker-related algorithms are proposed to learn a strong classifier from multiple instances in the positive and negative bags [24,25]. The WMIL tracker [25] runs at about 14FPS on a single CPU. Moreover, the WMIL tracker performs well over the CF related trackers in terms of occlusion, illumination variations and pose changes.
In this paper, we propose a Kernel based MIL (KMIL) object tracking algorithm. To further reduce the computational cost, the Gaussian kernel function is presented for resolving the inner product used in the WMIL algorithm. The WMIL algorithm often fails to track the object with different speed. To deal with the problem, the searching areas for cropping the instances are varied according to the object's size. Finally, an adaptive classifier updating strategy is presented. The similar score of the tracking result in the second frame is remembered as a reference. Then, two thresholds are defined and a range is obtained with respect to the reference. As tracking evolves, the updating rate of the classifier is adjusted to suit for the appearance changes when the maximum similar score of the sample is out of the range at the current frame. At last, the proposed algorithm is compared with the state-of-art algorithms.
The paper is organized as follows. The Section 1 is the introduction of the paper. The WMIL tracker is detailed in the Section 2. The Section 3 addresses the KMIL tracker. The experimental results are shown in the Section 4. The Section 5 summaries the paper.

The WMIL Tracker
Babenko [24] proposed an online MIL Boosting method for visual tracking. It detects an object by maximizing the bag likelihood function. Consequently, it suffers from being time consuming because the bag probability and instance probability are computed many times before selecting a most discriminative weak classifier. To deal with this problem, Zhang [25] proposed an efficient online approach (WMIL algorithm) to approximately maximize the bag likelihood function. In the algorithm, "positive" and "negative" bags are extracted for training classifiers. The positive bag is constructed by using the instances extracted from a circle centered at the object's location, while the negative bag is obtained by cropping the instances from an annular region around the object's location. Then, a strong classifier is trained in the Online Boosting frame by using the positive and negative bags. In the successive frame, candidate samples are cropped around the object's position in the previous frame. At last, the trained classifier detects the candidate sample with the maximum score as the final result. Furthermore, to deal with the problem of occlusion, illumination changes, and pose variations, the classifier is updated by using the positive and negative bags constructed by cropping instances according to the tracked result in the current frame.
It is assumed that the location of each instance x is denoted as l t (x) and the location of the object is denoted as l * t in the t th frame. The instances for constructing a positive bag X + are cropped as: t || < r}, r is the searching radius centered as the location l * t , while the instances for constructing a negative bag X − are cropped from an annular region X − = {x : r < ||l(x) − l * t || < β}, r and β (r < β) are the radius of the annular region. In the Online Boosting frame, the algorithm trains K weak classifiers φ = {h 1 , h 2 , · · · , h K }, which is defined as: where v(x) = (v 1 (x), · · · , v k (x), · · · , v K (x)) T is a feature vector function of the instances. y i is the label of the bag X i . y = 1 means the bag is positive, while y = 0 means the bag is negative. Then, M discriminative weak classifiers are selected to generate a strong classifier: The learned strong classifier detects an area with the maximum similar score as the final tracking location : After detecting an object, the weak classifiers are updated according to the new location with a constant learning rate: where λ is the learning rate.μ = 1 instances in the positive bag extracted around the tracking result at current frame.

The KMIL Object Tracking System
This section details the presented KMIL object tracking algorithm illustrated in Figure 1. Postive and negative bags are extracted around the object's location. Then, a strong classifier is learned from the first frame by using the online boosting WMIL algorithm [25], and the object image is saved as a reference. In the successive frame, the classifier is used to detect the most similar sample as the tracking result. Finally, the classifiers are updated according to the reference frame and current tracking result.
In the WMIL algorithm, the inner product is presented for selecting a weak classifier with the most discriminative ability, which reduces the computational time by avoiding computing the bag probability and instance probability many times [25]. However, the inner product is also computed M times, which is also time consuming. Recently, the kernel based approaches have been proposed for real time object tracking [26]. Inspired by the ideas in the WMIL [25] and DLSSVM [26] algorithms, we present a kernel based inner product method to select the most discriminative weak classifiers to further reduce the computational complexity.
As tracking evolves, the sample with the maximum similar score in the candidate area is detected by using the learned classifier. Normally, we assume that the object moves with the same speed and appearances around the object location from the previous frame. Therefore, the candidate samples are extracted in a fixed circle around the previous tracking location. To account for the size of an object, we change the circle adaptively. After tracking an object, the weak classifiers are updated with the new cropped samples for handling the appearance changes. Normally, a constant learning rate is used, which may lead to "over-updating" or "less-updating". To further handle these problems, an adaptive weak classifiers strategy are presented.

The Kernel Based MIL Tracker
We assume that there are N instances in the positive bag and L instances in the negative bag.
The tracking location at current frame is denoted as l 10 . The positive bag and negative bag {X + , X − } are considered as the training data. Different from the method of computing bag probability used in the WMIL algorithm, we compute the bag probability with respect to the included instances' probability.
The method means that each instance contributes equally to the bag probability according to the including instances probability. Then, the bag probability mainly depends on the instances with higher probability. Especially in the case of tracking drift, the instances near the real object but far away from the center of the previous location contribute more to the bag probability.
Similar to the positive bag, the probability of the negative bag is computed with respect to the including instances equally.
where h is the weak classifier in the classifier pool and H is the learned strong classifier constructed by K − 1 selected weak classifiers.
The inner product is computed as: where σ(z) = 1 (1+e −z ) is the sigmoid function and σ(H(x ij )) is the jth instance probability in the ith bag. The results from the WMIL tracker [25] shown that the criterion is efficient because it can avoid computing the bag probability and instance probability K times before selecting a weak classifier. Therefore, it is more efficient than the log-likelihood function used in the MIL tracker [25].
However, there is higher dimension computing in the inner product 1 which is also time consuming. To handle with the problem, we use the kernel function. The inputs h and ∇η(H) are mapped to the feature space by using φ(h) and φ(∇η(H)), where φ(·) is the Hilbert mapping of the inputs. The kernel is defined as: In practice, we choose to use the Gaussian kernel [26]: where ρ is the bandwithd of the Gaussian function. As new frames come, candidate samples are extracted around the tracking result: X s = {x : l t+1 (x) − l * t < s}, where s(r < s < β) is the radius.Then,the learned strong classifier detects the sample with the maximum similar score as tracking result. In the WMIL, MIL, and CT trackers, it is assumed that the object moves around the previous tracking location. And the candidate samples are cropped in a fixed circle. These trackers have shown continuous performance in term of accuracy on tracking large object with continuous motion. However, they often fail to track a small object because of their fast motion. To deal with the problem, we present a method to vary the radius for extracting the candidate sample with respect to the target's size. The radius is set to be 25 if the object is big. On the contrary, the radius is 35 for tracking the small object.

The Classifiers Update Strategy
After detecting an object, the classifier is updated to deal with the problems of occlusion, pose variations, and illumination changes. Normally, a learning rate is set to make a balance between the previous frame and the current frame [25]. We have tried the updating method with a fixed learning rate and found that the experimental results are unstable when there are appearance variations. With a small learning rate, the parameters of the classifier will be updated mainly with the mean and variance of the new tracked location's features. As a result, the interference from background will be introduced to update the classifier, which results in "over-updating". If the learning rate is too large, the new tracked area will affect the parameters of the classifier rarely. Then, the classifier will be "less-updating" and can't deal with the illumination changes and pose variations.
To address the problem mentioned above, we present an adaptive classifier updating strategy. From experimental results of the WMIL tracker, we found that the similar scores of the tracking results vary frequently from the tracked locations in the beginning frames. Therefore, we define the similar score S 0 of the tracking location in the second frame as a reference. Two thresholds H 1 and H 2 are also defined. Then, a similar score range (S 0 − H 1 , S 0 + H 2 ) is obtained with respect to the two thresholds and the reference. As tracking evolves, we consider the tracking results with the similar score outside the defined range as the appearance changes case. Then, a large learning rate is defined for updating the parameters of the classifier. If the similar score of tracking results are within the defined range, we use a small learning rate to update the classifier. The learning rate is defined as follows:

Parameters Setting
In these algorithms, the parameters r, α, and β determine the instances in the positive and negative bags. The s determines the area for cropping the candidate samples. The algorithms with bigger r, α, β, s can extract more instances which make the algorithms perform well to track an object, but result in time consuming. The K is the number of the weak classifiers, while M is that of the selected discriminative weak classifier for constructing a strong classifier. The classifier with bigger K and M can discriminate an object easily, which also lead to computing complexity. The parameters are set to be the same as the presented papers [3,7,19,25], which are illustrated in Table 1. The results of these algorithms have shown that these algorithms perform the best with the parameters. In the KMIL tracker, the radius for extracting candidate samples is set to be 25. In contrast, the radius is 35. Different from the fixed learning rate of the CT, MIL, WMIL trackers, the learning rate of the KMIL tracker is adapted according to the maximum score of the candidate sample at current frame. When the maximum score is in a given range, the learning rate is set to be 0.25, or else it is 0.85. The number of weak classifiers is 150, while that of the selected discriminative weak classifiers is 50. The KCF tracker employs the HOG feature and Gaussian kernel. In the first frame, a model is learned with the image patch centered at the initial position. In the successive frames, the position with the maximum value is detected over the candidate patch. Finally, a new model is trained at the new tracking position [14].
The CT, MIL, WMIL, KMIL, KCF, DSST trackers are implemented on the mentioned videos. After training the classifiers, the area with the maximum similar score is considered as the tracking result. Then, the classifiers are updated to overcome the drawbacks of occlusion, pose variations, and illumination variations.

Tracking Object Location
Here, we detail the tracking object locations of the above trackers evaluated on the eight classical videos. The results are shown in Figure 2. The MIL tracker is time consuming because of its classifier selecting strategy. As a result, it often suffers from failure for the long time object tracking. The KCF tracker uses the circulate matrix and kernel function for completing real time object tracking. However, its searching area and learning rate are constant. Therefore, it often fails to track the object in the complex environment, especially when the object is small and moves fast. The WMIL and CT trackers update the classifier with a constant learning rate. Therefore, they result in tracking drift in the complex environment. Furthermore, the WMIL tracker computes the bag probability according to instances' distance. Consequently, the tracking drift will be aggregated. Benefiting from the constant learning rate, adaptive searching radius and the bag probability, the KMIL tracker can deal with the drift problem in the complex environment. The tracking object locations in Figure 2 demonstrate that the KMIL tracker performs well over the CT, MIL, WMIL, KCF, DSST trackers. The illustration of the tracking locations on the sequences: "Deer", "Tiger2", "Faceocc2", "Sylvester", "Football1", "Shaking", "Tiger1", "Lemming".

Quantitative Analysis
We use the precision curves to evaluate the performance of the proposed algorithm. The precision curves illustrate the percentage of correctly tracked frames for a range of distance thresholds [14]. The correctly tracked frame is the one with target center within a distance threshold of the ground truth. The tracker with a higher precision at low threshold is more accurate. Similar to the previous works [7,13,27], we choose the 20 pixels as the threshold. The experimental results are shown in Figure 3. From the experimental results, we found that the precision of the KMIL tracker is higher than the other algorithms at 20 pixels for the "Tiger2", "Lemming", "Deer", "Slyvery", and "Faceocc1" sequences. For the "Tiger1" sequence, the KMIL tracker achieves the second higher precision at the 20 pixels. The MIL tracker has the highest precision for tracking the "Football man" shown in the Figure 3h at 20 pixels. All of the algorithms can achieve 1 precision at 35 pixels. The experimental results have shown the effectiveness of the proposed KMIL tracker. Especially, the KMIL tracker is more efficient in term of precision than the popular kernel based trackers KCF and DSST.
Another performance criteria we chose for our evaluation is the success plot. In the tracking process, the tracking result is denoted as a bounding box (called a ) and the real position is the ground truth (called b). Then, an overlap score is defined with respect to the two regions.
where and mean the intersection and union, respectively. |·| counts the number of the pixels. The tracked targets with the overlap score larger than the given threshold (0.5 is used) are considered as the successful results. The success curve is the ratios of the success frames to the whole frames.
The experimental results are shown in Figure 4. The higher the success curve is, the stronger tracking ability the tracker has. It shows that the "red" line obtained by using the KMIL tracker on the sequences: "Tiger2", "Lemming", "Shaking", "Deer", "Sylvester", "Faceocc1", and "Tiger1" is higher than other line. In the Figure 4h, the higher success curve is obtained by using the MIL tracker on the "Football1" sequences. However, it is time consuming, which will be detailed in the next section. Furthermore, the well known fast KCF and DSST trackers run at a low success rate especially for tracking the small object (e.g., on the "Tiger1" sequence).

Computational Cost
This section details the computational cost of the above trackers. The average computing time processing an image is defined as: t avr = t all N f ra . t all is the total computing time processing all of the images in the whole video sequence. N f ra is the total number of frames. t avr is the obtained average computing time. There are four factors lead to computational cost in the CT, MIL, WMIL, and KMIL trackers. The first factor is the total number of weak classifiers in the classifier pool. The number of the selected discriminative weak classifiers is the second factor influencing the computational time. The number of the instances due to big searching area also results in computational complexity. At last, the method for selecting discriminative weak classifiers which computes high dimension matrix is also time consuming. We did experiments by using the parameters in the Table 1. The Frames Per Second (FPS) of all the trackers are illustrated in Table 2. The KCF and DSST trackers run faster than the MIL, CT, WMIL, and KMIL trackers. However, it has low precision and success rate. Benefiting from the kernel function, the KMIL tracker avoids computing the (h x ij η (H) x ij )M times. As a result, it is with lower computational time than the MIL, WMIL trackers. Table 2. The FPS (Frames Per Second) for different algorithms conducted on sequences "Tiger2", "Lemming", "Shaking", "Deer", "Sylvester", "Faceocc1", "Tiger1", "Football1".    . Success plot for sequences: "Deer", "Tiger2", "Faceocc2", "Sylvester", "Football1", "Shaking", "Tiger1", "Lemming".

Conclusions
In this paper, we revisit the core of the WMIL formulation to counter the issues of computation complexity and drift problem. We introduce a Gaussian kernel based multiple instance learning algorithm for real-time vision applications. We also suggest a simple yet effective searching circle update strategy that is especially suitable for small but moving fast objects. Lastly, we also present a classifier update method for handling the appearance changes with respect to two thresholds and reference similar score. The experiments conducted on several classical videos demonstrated that the KMIL tracker was efficient in terms of time-consumption and robustness. Acknowledgments: Visual Tracker Benchmark http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html.

Conflicts of Interest:
The authors declare no conflict of interest.The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: