A Target Model Construction Algorithm for Robust Real-Time Mean-Shift Tracking

Mean-shift tracking has gained more interests, nowadays, aided by its feasibility of real-time and reliable tracker implementation. In order to reduce background clutter interference to mean-shift object tracking, this paper proposes a novel indicator function generation method. The proposed method takes advantage of two ‘a priori’ knowledge elements, which are inherent to a kernel support for initializing a target model. Based on the assured background labels, a gradient-based label propagation is performed, resulting in a number of objects differentiated from the background. Then the proposed region growing scheme picks up one largest target object near the center of the kernel support. The grown object region constitutes the proposed indicator function and this allows an exact target model construction for robust mean-shift tracking. Simulation results demonstrate the proposed exact target model could significantly enhance the robustness as well as the accuracy of mean-shift object tracking.


Introduction
Object tracking, as one of the most fundamental tasks in computer vision, has many applications such as video indexing, automated surveillance, and human computer interaction, etc. For this OPEN ACCESS important task, there has been a tremendous amount of research, which is generally categorized into three methodologies, i.e., point-, kernel-, and silhouette-based tracking [1]. Among these approaches, kernel-based tracking has attracted more interest nowadays, thanks to its feasibility of reliable real-time tracker implementation, and this feasibility is known to mainly come from the incorporated mean-shift procedure, helping trackers quickly locate their tracking position to a nearby optimal point.
As an efficient tool for finding the nearest dominant mode of feature space, the mean-shift procedure has been successfully applied to many object tracking algorithms. In [2], Comaniciu et al. proposed a color-histogram-based object tracker using the Bhattacharyya similarity measure and Epanechnikov ellipsoidal kernel. They developed a mean-shift target localization scheme, which quickly and reliably optimizes the similarity between the kernel-weighted histogram of a target and that of the candidate region. After this pioneering work, numerous improvements have been reported in the last decade. They are largely schemes using new object representation models [3][4][5], trackers simultaneously estimating object location as well as kernel scale and orientation [6][7][8], algorithms based on new similarity measures other than the Bhattacharyya coefficient [9][10][11], and methods incorporating histogram bin weights for reducing the effect of background clutter [12][13][14]. Among these, we focus on the problem of background clutter in this paper.
The background clutter problem in kernel-based tracking states that the background colors of a target object inevitably constitute a target representation model. This happens because the kernel shape, which is usually a simple rectangle or an ellipsoid, does not perfectly match the shape of the target object. In order to reduce the effect of such background clutter problems, Commaniciu et al. proposed a background weighted histogram method [2], where the target model is modified such that the bins of background colors contribute less to mean-shift optimization. Unfortunately, such an effort to eliminate the dominant background features from the target model is proven to not affect the target localization process, because the same background weights were also incorporated into the modification of the candidate histogram. Ning et al. pointed out this problem in [12] and proposed the use of the background weights only for target model construction, which they called a corrected background-weighted histogram method. Moreover, in [13], Li and Feng proposed an adaptive kernel method, where the background weights developed in [2] were also used to further adapt the kernel. In effect, this adaptive kernel method is supposed to modify only the target model, as was the case of [12], but using different bin weights (i.e., squares of the weights used in [12]). Instead of using such background color probability, Jeyakar et al. employed a likelihood of object color to modify the target representation model [14]. By defining the likelihood as the log of the ratio between the color distributions of inside and outside the kernel region, they designed two separate weights for modifying the target and candidate models, respectively.
All these schemes demonstrated more reliable tracking results by appropriately modifying the target histogram, but such a histogram weighting method itself seems to be an intuitive and indirect method in the sense of background elimination from a target model. For more direct and strict background elimination, an indicator function in an asymmetric kernel could be exploited [7,15]. Since the indicator function directly specifies whether a pixel belongs to the target object or not, background pixels within a kernel can be strictly excluded from the target model. However, to the best knowledge of the authors, it has not been detailed so far how such an indicator function could be created and what amount of gains could be achieved by incorporating the function. Usually the indicator function has been assumed given perfectly from another stage of application, but this assumption is generally far from reality and such object detection is just another challenging problem.
In order to reduce the interference of background clutter, this paper proposes a novel indicator function generation method and demonstrates an exact target model construction that could significantly enhance the robustness as well as the accuracy of kernel-based tracking results. The proposed method takes advantage of two 'a priori' knowledge elements, which are inherent to a kernel support for initializing a target model. More specifically, based on the 'a priori' knowledge that the outside of an initial kernel is background, background labels are propagated to inside the kernel, and then by considering the 'a priori' knowledge that there is only one target object near the center of given kernel, the proposed algorithm grows a region using an object (foreground) seed near the kernel center. The grown region constitutes the proposed indicator function and this allows an exact target model construction for robust mean-shift object tracking.
The rest of this paper is organized as follows: Section 2 explains the conventional method of mean-shift object tracking with various previous background weighting algorithms to explain how an indicator function is used for constructing the target model of the proposed tracking scheme. Then, in Section 3, we describe the proposed method for generating the indicator function, where the proposed gradient-based label propagation and foreground region merging methods are detailed. Simulations for an extensive comparison study and their results are presented in Section 4. Finally, we conclude this paper in Section 5.

Mean-Shift Object Tracking
In the conventional mean-shift object tracking, the tracking position is updated recursively to the next position ′ by: where: In Equation (2), g(·) is the shadow of the kernel profile k(x), (i.e., g(x) = −k′(x) ) and ( ) is the back-projection weight, given as: where δ[·] is the Kronecker delta function, b(xi) is the bin number associated with the pixel location , and m is the number of bins of target (candidate) histogram. and denote the u-th bins of target and that of candidate histogram, respectively, which are defined by: where and denote the numbers of pixels in target and candidate regions. x and are the centers of the target and candidate regions. ( , ) and ( , ) are the weights for the uth bin and the pixel location of the target and the candidate histogram, respectively.

Figure 1.
An example of mean-shift tracking refinement: (a) video frame at time t with the kernel support for target, (b) a zoomed part of video frame at time t + 1 with initial candidate kernel support and the first refinement result, (c) a zoomed part at time t + 1 with shifted candidate kernel support after the first refinement and the second refinement result, (d) the back-projection weights for the first refinement, and (e) the back-projection weights for the second refinement, respectively. Figure 1 shows an example of the above iterative tracking result with the weights of ( , ) = ( , ) = 1 for all u and . Figure 1a,b show the video frames at time t and t + 1 with green boxes indicating the regions of support for the kernel k(·) in Equations (4) and (5). The blue dot in the green box of Figure 1b is the starting tracking position x in Equation (1), and the red arrow shows ∆ . From Equation (2), this refinement is supposed to be done by finding the gravity center of ( ) and Figure 1d,e show the images of this back-projection weight (see Equation (3)) for the first two iterations. In these images, brighter pixels represent the higher values of ( ), and thus the mean-shift refinement of Equation (1) pushes the tracking position x to this brighter part of the image via Equation (2). As can be seen from the figures, for reliable mean-shift tracking, a large portion of the brighter pixels should belong to the baseball player, which is the target object to be tracked. This proposition can be easily understood by taking the simple example of a synthesized video having a red ball moving against blue background. Let us assume that we track the red ball and a tight kernel window including the red ball is given. In the following frame, the tracking procedure starts at the position of the kernel window of the previous frame, and thus only a part of the moving red ball is included in the kernel window. This situation makes the target and the candidate histogram models such that qu > pu when u denotes red color and vice versa for blue color. This situation results in high back-projection weight, ( ) , for the red ball and low projection weight for background via Equation (3). This high back-projection weight on the target object pushes the tracking position to the red ball until the ball is fully belonging to the kernel window. However, if we assume a red spot is on the blue background, the refinement of tracking position stops when the number of red pixels in the kernel window reaches to the number of pixels of the red ball area even though the kernel window does not fully include the red ball. This perturbs the accurate tracking and the amount of perturbation depends on the size of red spot on the blue background. According to the same principle, the brighter clutters outside the target object in Figure 1d,e bring about the perturbation of tracking result via the gravity center calculation by Equation (2). In order to reduce this background clutter interference, the background weights of (•,•) and (•,•) have been designed in several ways. After getting the normalized histogram, { } ,…, of the neighboring pixels of kernel support (i.e., the neighboring outside region of the green box in Figure 1a, Comaniciu et al. set up the background weights in [2] such as: where is the smallest nonzero value of . Note that the background histogram of target region (i.e., { } ,…, ) is also used for setting up ( , ) (the background weights for candidate).
Although this usage is based on a general assumption that the background histogram is not changed severely within a few video frame interval, the same weights in the numerator and the denominator part of the back-projection weight (see Equation (3)) cancel each other's effects out in the mean-shift procedure. By noting this problem, Ning et al. used only ( , ) of Equation (6), while leaving ( , ) = 1 for all u and in [12], and Li and Feng used this background histogram not only for ( , ) and ( , ), but also for further modifying the back-projection weight such as [13]: This weighting scheme, however, can be replaced by using the weights of: with the calculation of back-projection weight by Equation (3), because the same weights, d (•,•) = d (•,•), for and have no effect on the mean-shift procedure of Equation (2). Instead of designing the weights (•,•) and (•,•) based only on the background histogram, Jeyakar et al. incorporated the foreground histogram also to define such weights in [14]. They defined a likelihood that a particular color belongs to foreground or background such as: where {hu}u=1,...,m is the normalized histogram of the neighboring background of the candidate kernel support (i.e., the neighboring outside region of the green box in Figure 1b,c). Then, by using the likelihood, they proposed the weights: where X is one of t and c for target and candidate models, respectively, and a and b are the control parameters, of which typical values are all 1.
In contrast to the above conventional background weights, the proposed scheme employs a position dependent weight for target model construction such as: where ( ) denotes bin weights, which can be any of the above background weights including the case of always 1, and I( ) is the indicator function, specifying whether the pixel location xi belongs to the target object or not. As for the candidate background weights, we set ( , ) = 1 for all and , under the considerations of object deformation and the uncertainty of the candidate background region.

Generation of Indicator Function
In many object tracking applications, kernel support for the target model is usually initialized by user interaction or given by an object detector as the form of a simple region. This region can be assumed, without loss of generality, larger than the target object and containing one such target object near its center. These assumptions translate the problem of indicator function generation into a bi-label segmentation problem given the two 'a priori' knowledge elements: (1) outside of the region is background for sure; and (2) one target object is located near the center of the region. Hence, we first initialize the indicator function I( ) such that pixels in outside of the region are marked as 'background': where R denotes the region for the target model. Then, this set of initial background labels is propagated into the region R, so as to leave possibly one or multiple candidate target objects within the region. After completing this background label propagation, a region growing is performed to find one largest connected region near the center of R, resulting in the proposed indicator function.

Gradient-Based Background Label Propagation
Basically, the proposed background label propagation is performed by iteratively investigating the pixels contiguous to the border of the background region. This border is defined by the set B of background pixels, which are neighbors of unsettled pixels, such as: where ( ) is the 4 neighbors of . Then, for each unsettled pixel ( . . , ( ) = −1), which is adjacent to the set B (A pixel xk is said to be adjacent to a set B, when B contains one or more pixels contiguous to xk), we find the most similar neighboring pixel by: where ( ) is the set of the neighboring background pixels of , defined by: And G( ) in Equation (14) means the magnitude of gradient, defined by: where (•) denotes the intensity of a pixel, and ux and uy are the unit vectors in the x and y directions, respectively. Then, the pixel xk is to be classified as 'background' when G(xk) and G( * ) are close enough with each other, i.e.: where τ is the threshold of the proposed label propagation (although the threshold τ was selected manually in this paper, this type of gradient threshold has been widely studied in the image segmentation field, and readers may refer to [16] for automatic control of this parameter.). Once all the unsettled pixels adjacent to the set B have been investigated, the set B becomes renewed by accommodating only the newly found 'background' pixels, and then the label propagation via Equations (14) and (17) is repeated. This propagation and renew process is to be iterated until the set B is renewed as an empty set, resulting in a few of unsettled regions encompassed by the pixels having I(xi) = 1. As the result of this gradient-based background propagation, candidate target objects within the region R will be represented by the set of pixels having I(xi) = 1 for contour pixels and I(xi) = −1 for inner pixels. Especially, if an object was separated by a given kernel (i.e., the region R), the object will be remained as a thin contour line consisting of pixels having I(xi) = 1. Figure 2b shows an example of this background label propagation result. As explained before, target object near the center of R is represented as a large connected area (the pixels of I(xi) = 1 or I(xi) = −1 are depicted as white in the figure) and the pedestrian separated by given kernel support (the green box in Figure 2a) is shown as a thin line. The clutter in the off-center area of Figure 2b comes from the high intensity activities of edge pixels, and will be eliminated from the proposed indicator function by the following region growing method.

Region Growing
Region growing is a procedure that assigns pixels to a region based on the predefined criteria of growth. The approach starts basically with a set of 'seed' pixels, and the set grows by recursively appending similar neighboring pixels. In our method, the set is initialized with a single pixel, which was not background and is near the kernel center. Then the set grows to a large connected foreground region which is a candidate for the target object. After completing the region growing, the proposed scheme investigates pixels in the kernel center area to find any remaining pixels which are not background and not assigned to a foreground region yet. If such pixel was found, region growing is conducted again using the detected seed pixel, and this seed exploration and region growing process is repeated until no more seed pixels can be found. This repeated region growing may produce multiple foreground regions, and thus we choose the largest one as the proposed target object based on the 'a priori' knowledge that only one target object is located near the center of a given kernel support. Now, in order to formally describe the proposed region growing algorithm, let us define C be the small (in this paper, for simulations, we set the size of C become one ninth of the given kernel support area) rectangular region co-centered with the kernel support. After selecting a pixel xs from C, of which label is I(xs) = 1 or I(xs) = −1, two sets S 0 and O 0 are initialized by: where S 0 and O 0 are the initial sets for seed and object pixels, respectively. Then, the proposed region growing is performed by appending pixels according to the following criteria: and the set of object pixels is updated by: where N8(xi) is the eight neighbors of xi. Note here that the proposed region growing resorts to an eight neighbors system to merge the parts of an object which are connected by a thin line, while the proposed background label propagation employed a four neighbors system to leave the inner parts of an object intact. As k increases, this iterative procedure gathers more object pixels until the set S k becomes empty. After completion, an identifier is given to the grown object pixels by: where l is the initial value for object identifier and the iterative region growing was assumed completed at stage K. Once this identifier allocation has been done, we increase the object identifier l by 1, and then check any new seed pixel xs from C again. Note, because of the above identifier allocation, that no more seed pixels can be found when there exists only one object near the center of the given kernel. However, once a new seed pixel was found, the above region growing (i.e., the Equations (18)-(20)) shall be performed again and the increased identifier l + 1 will be allocated to the newly grown object pixels. This identifier increment, seed pixel exploration, region growing and identifier allocation will be repeated until no more seed pixels can be found from C.
Finally, once all the foreground pixels in C were processed, we choose only one target object by: * = arg ∈{ , where |·|L is the cardinality of the set of pixels having identifier value of L. Now the proposed indicator function will be generated by:

Simulation Results
In order to show the feasibility of the proposed scheme, we implemented the previous background clutter reduction algorithms which were explained in Section 2. The implemented bin weights will be denoted by A1 ( [13]) method, respectively. More specifically, A1 set up the background weights ( , ) and ( , ) using Equation (6), while A2 used the same ( , ) as that of A1, but set ( , ) = 1 for all u and . A3 and A4 used Equations (10) and (8), respectively, for the computation of such background weights. Then, each of the background weights in the methods A1, A2, A3 and A4 is combined with the proposed indicator function via Equation (11), resulting in the position as well as the histogram bin dependent weights. We denote these combined methods by P1, P2, P3 and P4, (i.e., Px combines the bin weights of Ax with the proposed indicator function, where x = 1,2,3 and 4.) In other words, the proposed methods set the bin weights for target model by multiplying ( , ) of A1, A2, A3 and A4 by I( ) of Equation (23) and set ( , ) = 1 for all u and . We investigated the tracking errors of each tracker (with and without the proposed indicator function). Here, the tracking error is defined by the distance from the center of the ground truth object to the tracked object (or kernel) center for each video frame.
To conduct such comparative tests, we carefully selected five video sequences, which have distinctive characteristics in the amount of scale change, deformation, and the occlusion of the target object. Three of the selected videos (Campus, Egtest05, and Bike) are publically available and can be obtained from the Performance Evaluation of Tracking System (PETS) [17,18] and the OpenCV Cookbook datasets [19], and the rest of them (Baseball-1 and Baseball-2) were produced by the authors using a Sony CX560 camcorder. The properties of each test video sequence are summarized in Table 1 and each representative image with target object is illustrated in Figure 3. The numbers between the parentheses in the 'image size' column of Table 1 represent the numbers of video frames used for this simulation.  For all tests, Red, Green and Blue (RGB) color space was employed to create a histogram, which was quantized into 16 × 16 × 16 bins for realtime implementation. The kernel for the target object was set with the Epanechnikov profile [2] as having three different sizes, where the smallest size was the tight fit to the target object, and the other two were set as 1.3 times and 1.7 times as large as the size of the smallest kernel in the width and in the height, respectively. This different size of kernel support is to simulate the imperfect initialization of target specification and to study the effect of background clutter according to this imperfection. Note that, in some applications, the initial kernel is usually given by user interaction via equipment such as mice and tablets, etc. Figure 4 shows these three sizes of kernel support for the test video 'Baseball-1'. In this figure, we can identify that background colors (outside the kernel support) could be varied significantly by the choice of kernel size, that means the performance of the algorithms A1, A2, A3, and A4 (i.e., the tracking schemes employing background histogram) could be seriously influenced by this kernel size. Table 2. Tracking results before and after applying the proposed indicator function by Equation (11). (F) means tracking failure and the number in brackets represents the improvement ratio of tracking accuracy caused by applying the proposed indicator function.  Table 2 summarizes the simulation results, where each number represents the tracking error averaged over all the tested frames of each video. Also, in the table, bold numbers indicate the best tracking results for each row (i.e., for a given kernel size and test sequence), (F)s next to the tracking error mean the cases of tracking failure, and the floating point numbers in brackets (i.e., [11.5]) denote the Accuracy Improvement (AI) ratios, which were computed by:

Seq. Name
where EA and EP are the averaged tracking errors before and after applying the proposed indicator function, respectively. First of all, from the table, we can see that the application of the proposed indicator function always improves the tracking performance except for the two cases of P2 with small kernel size and P4 with normal kernel size for the test sequence Baseball-2. However, the difference of tracking errors is far less than 1 pixel (0.3 for small kernel case and 0.1 for normal kernel case). On the other hand, the benefit of the proposed indicator function is observed to be substantial such that the largest gain reaches up to 56.6% of AI (in the case of P2 with large size kernel for the test sequence Campus) and up to 19.7 pixels (in the case of P1 with large size kernel for the test sequence EgTest05), even without considering tracking failure cases. For the tracking failure cases, the AIs in the table are generally less meaningful because kernel support is usually stuck in some background part where tracking failure happened and thus higher AIs do not always imply better tracking performance (nevertheless, positive AIs of these failure cases usually imply that the proposed method is able to track the given target object for more video frames). Instead, to show the robustness improvement by the proposed scheme, we counted the number of cases where tracking failure of the previous algorithms has been overcome by the proposed indicator function. More specifically, we define the Robustness Improvement (RI) ratio as: where FA and FP are the numbers of tracking failure cases of the previous and the corresponding proposed trackers, respectively, and T is the number of total trials (i.e., 15 in our case for each tracker). Table 3 summarizes these RIs and the averaged AIs for each proposed scheme. Here, the average has been computed for all 15 cases, but without the tracking failure cases (i.e., the AIs with (F) in the columns of P1 and P3 of Table 2). Hence the average AI represents the gain when the previous and the proposed methods successfully track the given target object, while RI shows the gain when the previous method fails to (but the proposed scheme successfully) track(s) the object. As can be seen from the table, the proposed indicator function significantly improves the tracking robustness as well as accuracy. Especially, in the cases of P1 and P3, the ratios of fixed tracking failure reached 40% of the total trial, and in the case of P1 tracking accuracy has been improved by about 30% on average. Here, the performance gain of the proposed indicator function is observed to be relatively higher in P1 and P3. As can be seen from the Table 2, this is because A2 and A4 are relatively more accurate and robust than the methods A1 and A3. To be more specific about the rationale of this different performance gain, we analyzed the changes of back-projection weight (i.e., w(xi) in Equation (3)) before and after applying the proposed indicator function. This is important because, as can be seen from Equation (2), the amount of position update for tracking eventually corresponds to the weighted average of pixel coordinates based on this weight. Figure 5 shows the back-projection weight of each algorithm for the first frame of the Campus sequence, where brighter pixels mean higher back-projection weight values. As can be seen from the figure, the back-projection weights of A2 and A4 have more bright pixels on the target object, and this explains the more accurate and robust tracking performance of A2 and A4 compared with that of A1 and A3. Similarly, the changes of back-projection weight from Figure 5b,d to Figure 5f,h explain the higher gains of P1 and P3 by applying the proposed indicator function. On the other hand, in the case of Figure 5g,i, only slight changes from Figure 5c,e can be observed, and this accounts for the relatively small gains of 4.1 pixels and 1.1 pixels in accuracy improvement for P2 and P4, respectively. These only slight changes of back-projection weight are due to the fact that the colors of the background area (i.e., the outside region of the kernel support) are very similar to those inside the given kernel, and thus the background clutter reduction methods of A2 and A4 were able to successfully suppress the effect of such background colors. However, this situation is not always guaranteed, and one serious exception case can be found in the test sequence Bike.  Figure 6 shows the changes of back-projection weight from A2 to P2 and from A4 to P4. In this case, the bright white pixels at the upper part of the given kernel are hardly eliminated from the back-projection weight image because the color is very similar to the upper part of the given target object but there are no such color outside the given kernel. Thus the bright pixels at the upper part of Figures 6b,d push the tracked position of A2 and A4 upward, resulting in tracking failure for all kernel sizes. However, in the case of Figure 6c, we can identify these bright pixels at the upper part are successfully removed, and in the case of Figure 6e, many more white pixels are observed on the target object. These changes help the schemes P2 and P4 successfully track the given target object until the end of the test sequence Bike. Figure 7 shows the tracking results of the test sequence Bike when the algorithms A2 and A4 start to fail their tracking.  Finally, in order to identify the computational complexity of the proposed indicator function, we measured the execution time of each algorithm. For more specific analysis of the complexity, each tracking algorithm is regarded as being comprised of three stages, (i.e., target model construction, target candidate model construction and new centroid computation stages). Then we measured the execution time of each stage for each algorithm with the iteration number of centroid computation. In order to eliminate the influence of operating system, all the five test sequences with three different sizes of kernel supports (i.e., 15 cases in total) were tested for the initial 40 frames of each test sequence, and we averaged all the measured times for each processing stage. All the measurement were done on the same system using the Windows 7 operating system with an Intel Core i7 2.67GHz CPU and 4 GB RAM. Table 4 summarizes the experimental results. As can be seen from the table, the average time for target model construction was increased by about 65.1%, but since the execution time for tracking is longer than the target model construction, this complexity increase corresponds to only about 19% of average total execution time per frame. Moreover, if we consider the target model construction is done for once at the initialization step, the added computation to track the whole 40 frames becomes only about 0.47%. With the average execution times in Table 4, even though the target model construction

Conclusions
Mean-shift tracking is a well-known practical method to reliably track an irregular object in real-time. For more accurate and robust performance of such mean-shift tracking, this paper introduced a target model construction based on an indicator function and provided an algorithm to create the indicator function. After formulating the generation of the indicator function as a two-label segmentation problem, we developed a method of gradient-based background-label propagation and region growing. Using two 'a priori' knowledge elements inherent to an initial kernel, the developed method extracts an object segment from the given kernel and this segmented region constitutes the proposed indicator function. This proposed method to generate the indicator function is performed once in the whole tracking procedure, its required computations do not affect the real-time implementation of the mean-shift tracker once the tracker without indicator function is fast enough in itself.
The proposed indicator function provides an improved target object localization, which is very useful for alleviating the problem of background clutter. In order to verify such effectiveness of the target model using the indicator function, we performed an extensive comparative study using five carefully selected video sequences and three different kernel sizes. For all the compared representative previous algorithms, which deal with the problem of background clutter, it has been observed that the proposed target model significantly improves the performance of the tracking robustness as well as the accuracy with a negligible computational complexity increment. The improved tracking accuracy gain reached up to 56.6% and the maximum improvement of tracking robustness was 40% at the cost of less than 0.47% of tracking time increment.
Finally, although the proposed indicator function was employed only for the construction of a target representation model in this paper, it is also expected to be very useful in dealing with various problems of mean-shift tracking such as the scale estimation of target objects, target model updating, and occlusion detection, etc.