A Fast MEANSHIFT Algorithm-Based Target Tracking System

Tracking moving targets in complex scenes using an active video camera is a challenging task. Tracking accuracy and efficiency are two key yet generally incompatible aspects of a Target Tracking System (TTS). A compromise scheme will be studied in this paper. A fast mean-shift-based Target Tracking scheme is designed and realized, which is robust to partial occlusion and changes in object appearance. The physical simulation shows that the image signal processing speed is >50 frame/s.


Introduction
Visual tracking plays an important role in various computer vision applications, such as surveillance [1,2], firing systems [3], vehicle navigation [4] and missile guidance [5]. Target tracking using an active video camera is a challenging task mainly due to three reasons [6][7][8]: (1) the tracking system should have good robustness to the targets' pose variation and occlusion; (2) tracking requires properly dealing with video camera motion through suitable estimation and compensation techniques; (3) most applications would introduce some real-time constraints, which require tracking techniques to reduce the computational time [5].
Target tracking, according to its properties, can be mainly divided into two types: feature-and optical flow-based approaches. Optical flow is the vector field which describes how the image changes with time [9]. The amplitude and direction of the optical flow vector of each pixel is usually computed by the Lucak-Kande algorithm. Shi and Tomasi [10] also proposed the well-known Shi-Tomasi-Kanade OPEN ACCESS (STK) tracker which iteratively computes the translation of a region centered on an interest point [9]. However, optical flow computation is too complicated to meet real-time requirements, and it is sensitive to illumination changes and noises, which limit its practical application.
Feature-based algorithms were originally developed for tracking a small number of salient features in an image sequence. These features include: color, grain, contour and some detection operators such as invariant feature transform (SIFT) [9] or histogram of oriented gradient (HOG) [11]. Feature-based algorithms involve the extraction of regions of interest in the images and then location of the target in individual images of the sequence. Typical feature-based tracking algorithms are: multiple hypothesis tracking (MHT) [12], Template Matching (TM) [13][14][15][16], Mean-Shift (MS) [17][18][19], Kalman filtering (KF) [20] and particle filter (PF) [21,22].
The TM is a simple and popular technique in target tracking, which is widely used in civilian and military automatic target recognition systems. Given an input and a template image, the matching algorithm finds the partial image that most closely matches the template image in terms of some specific criterion, such as the Euclidean distance or cross correlation. The conventional template matching methods consume a large amount of computational time. A number of techniques have been investigated with the intent of speeding up the template matching, and have given perfect results [14,15]. However, the TM does not achieve robust performance in complex scenes, especially in the case of clutter and occlusion [3]. The Kalman filter and particle filter are used to estimate target location in the next frame, which has also been extensively studied. Comparing to the Kalman filter, the particle filter has a more robust performance in the case of nonlinear and non-Gaussian problems due to the simulated posterior distribution. Many efforts have been carried out to speed up the particle filter. Martinez-del-Rincon et al. [21] proposed a new particle filter algorithm based on two sampling techniques, which improves substantially the efficiency of the filter. Sullivan et al. [23] proposed layered sampling using multiscale processing of images. It turns out that these solutions significantly reduce the computational costs, but in-depth efforts are desirable for better efficiency.
In image sequences, the target appearances have a strong correlation. Among all appearance based tracking models, there is one popular subset called "subspace model". Black [24] used a set of orthogonal vectors to describe the target image. Principal Component Analysis (PCA) and other classic dimensionality reduction methods provide an effective tool to compute the set of orthogonal vectors. Levy and Linden-Baum [25] presented a novel incremental PCA algorithm (Sequential Kathunen-Loeve, SKL) to update the eigen-basis when new data is available with greatly reduced computation and memory requirements. Lin applied Fisher linear discriminant analysis in subspace tracking to take background into account [26], however, it cannot perform well in case of non-Gaussian distribution.
The MS based tracker has very good robustness to the variation of translation, rotation and scale. The MS algorithm is a nonparametric density gradient estimation approach to local mode seeking and it was originally invented for data clustering. Comaniciu [18] was the first to develop its application in target tracking. The tracker needs a target model to be able to track. The target model is obtained from the color histogram of the moving object. The target candidate is obtained in the same way at a location specified by the MS algorithm. The similarity measure between the target candidate and the target model is computed using the Bhattacharya coefficient.
One of MS's drawbacks is that it often converges slowly. To the best of our knowledge, few attempts have been made to speed up the convergence of MS. The k d -tree can be used to reduce the large number of nearest-neighbor queries. Although a dramatic decrease in the computational time is achieved for high-dimensional clustering, these techniques are not attractive for relatively low-dimensional problems such as visual tracking. Cheng [27] showed that mean shift is gradient ascent with an adaptive step size, but the theory behind the step sizes remains unclear.
The innovative work in this paper is to propose a novel fast robust tracking algorithm combining the MS with the template match (TM), which is a balanced scheme between robustness and real-time performance. A fast MS-based target tracking scheme is designed and implemented, which has a good robustness to target pose variation and partial occlusion. The hardware-in-loop simulation shows that the image signal processing speed is >50 frame/s. The paper is organized as follows: the target tracking system description is described in Section 2, the hardware composition is presented in Section 3, the software structure and algorithm are described in details in Section 4, and, finally, Section 5 reports tests and results, and Section 6 describes the future works.

System Description
As shown in Figure 1, the target tracking system in this paper mainly has the following parts: video camera, signal processing module, monitor and 2D-turntable. In order to meet some practical application requirements the TTS must to have the following two performance features: (1) Robustness. In a complex background, most of the applications require the tracker to be robust to partial occlusion, clutter and changes in object appearance.
(2) Real-time performance. TTS needs to complete the image signal pre-processing, tracking and predicting target location, control 2D-turntable and other computational tasks which requires that the image processing speed should be >25 frames/s, and for some special applications processing speeds need to be >50 frame/s. The signal flow diagram of a typical target tracking system is shown in Figure 2. The TTS obtains the target image by a video camera. Through a tracker, the target location X in the current image is obtained and sent to the predictor to predict the target location X p in the next frame. The predicted result θ c , the desired angle θ and the feedback angle θ m are used to control the 2D-turntable.

Signal Processing Module
The video signal processer used in this paper is the TDS642EVM multi-channel real-time image processing platform produced by the TI Company. Its main performance features are listed in Table 1. The structure of the TDS642EVM is shown in Figure 3. The red line denotes video signal flow; the green line denotes control signal flow.

Video Camera and 2D Turntable
The pitch and yaw axis of 2D-turntable (as shown in Figure 4) are linked with the output shaft of the stepping motor, respectively. The control of the 2D-turntable is realized by controlling the two stepper motors. The turntable controller obtains control instructions from TDS642EVM by a UART, and generates the pulse signal to drive the stepping motor. The rotation angle of the turntable measured by a potentiometer is used as the feedback for the closed-loop control system. The performance characterstics of the 2D turntable are given in Table 2.

Software Structure and Algorithm
The structure of the TTS software is shown in Figure 5. The TTS software mainly includes the following two parts: image tracking algorithm, the target prediction algorithm.
(1) The tracking algorithm is to identify the location of the target in the current image. A fast robust MS-based target tracking algorithm is presented.
(2) The target prediction algorithm is to predict the location of the target in the next image though the sequence image. There are many algorithms that can achieve the prediction goal such as Kalman filter, particle filter and linear prediction method. Although the Kalman filter and particle filter [20,21] have obtained good results, these two algorithms are both inefficient. In this paper we use a linear prediction method to implement the target location prediction.

Mean-Shift Basis [19]
Kernel density estimation is a nonparametric method that extracts information about the underlying structure of a data set when no appropriate parametric model is available. Given n data points x i , i = 1 n, in the d dimensional space R d×d , the kernel density estimation at the location x can be computed by: (1) where k(·) is the profile the kernel function K(·) and c k is a normalization constant. The optimization procedure of seeking the local modes is solved by setting the gradient equal to zero. Thus, we can derive the following equation: (2) where g(x) = −k′(x), m G (x) is the MS vector.

Target Description and Distance Metric
According to the classical MS tracking algorithm [19], we can compute the target and candidate target feature vectors as follows: Candidate target feature vectors: (4) where δ is the Kronecker delta function, b(x i ) is the quantified number of the pixels value in the quantitative feature space, C, C h are the normalization constants.
The similarity function defines a distance between target model and candidates. To accommodate comparisons among various targets, this distance should have a metric structure. We define the distance between two discrete distributions as: where (y) named Bhattacharyya coefficient.

Tracking Algorithm
To find the location of the target in the current frame, the distance (5) of a function of y should be minimized. The tracking starts from the location of the target in the previous frame and searches in the neighborhood. Minimizing the distance (5) is equivalent to maximizing the Bhattacharyya coefficient (y).
Thus, the probabilities  (6) is obtained after some manipulations as: This approximation is satisfactory when the candidate In which: (9) In this way, minimizing d(y) becomes to maximize the second of Equation (8), which denotes the kernel density estimation computed by using k(x) at the y in current frame. In this process, the kernel shifts from the current location y to the new location y 1 . Thus we can use the MS procedure to find the great density estimation value in the neighborhood: The general MS algorithm steps are as follows [19]: Given: the target model {q u } u=1,…,m at y 0 in the previous frame, y 1 is the new location of spot. Then the flow of MS algorithm is: Set the spot with a feature vector {q u } u=1,…,m , at y 0 in the previous frame.
(1) Compute the feature vector of candidate spot References [27,28] show that MS is actually a bound maximization. One step of the MS iteration finds the exact maximum of the lower bound of the objective function. The existing literatures [21,[29][30][31][32][33] also show that MS is a gradient ascent algorithm with adaptive step size. Hence, its convergence rate is better than conventional fixed-step gradient algorithms and no step-size parameters need to be tuned [17]. From the viewpoint of bound optimization, the learning rate can be over-relaxed to make its convergence faster.
From another point of view, bound optimization methods always adopt conservative bounds in order to guarantee increasing the cost function value at each iteration [17]. A lot of work has been done to speed up bound optimization methods. In [17,29], it was shown that by over-relaxing the step size, acceleration can be achieved. Supposing M G is the MS shift vector, and then the over-relaxed bound optimization iteration is given by: (11) Apparently when the α = 1, over-relaxed optimization reduces to the standard MS algorithm. It is easily found that when α > 1 acceleration is realized, but for a fixed α, no convergence is guaranteed and it is hard to get the optimal α [17]. References [17,31] prove that in the case of general bound optimization model, convergence can be secured using the over-relaxed bound optimization iteration when the candidate are close to a local maximum and 0 < α < 2. Based on this proposition, an adaptive over-relaxed bound optimization is readily available: α can be adjusted by evaluating the cost function. When the cost function becomes worse for some α > 1, then α has been set too large and needs to be reduced. By setting α = 1 immediately, convergence can be achieved. In this paper, we presented the accelerated MS algorithm as follows: 1. Initialization: Set the iteration index k = 1, and the step parameter β > 1, α = 1.

Case Study
We compare the performance of the accelerated MS algorithm to the standard MS algorithm on real images (as shown in Figure 6). In the experiments, all codes run on the EVM642 mentioned in Section 3. We repeat all the tests 10 times and the average CPU time is reported in Table 3. From the test results we can conclude that the Fast MS is at least three times faster than the standard MS.   The occlusion issue is a technical challenge in the image tracking field. Many methods have been proposed to solve this problem. In this paper, the Bhattacharyya coefficient is used to determine whether the target is in occlusion or lost. Setting thresholds T1, T2, if T1 < Bhattacharyya coefficients < T2, the target is considered to be occluded, if Bhattacharyya coefficients < T1, the target is considered to be lost. In addition, by the effects of the environment illumination and the target appearance changes, the Bhattacharyya coefficient of the target candidate is, in general, the local maximum rather than the global maximum. When the target is in occlusion, the distance between the local maximum and the global maximum would increase, so some special method needs to be implemented to improve the tracking robustness. The Local Template Matching (LTM) method is used in this article to solve this problem. Template Matching (LM) is an existing algorithm, and, usually, it is a global template matching technique. In this paper template matching is implemented in the region of the candidate target, so here it is called Local Template Matching. The final location (x, y) of the target is computed over a region of interest (ROI) surrounding the candidate location derived from the fast MS as shown in Figure 7. The LTM algorithm is as follows: where S(x, y) is the pixel value at (x, y) in template image, R(u+x, v+y) is the pixel value at (u+x, v+y) in the search area. (u, v) is the candidate location derived from the fast MS. D(x, y) is the distance in the feature space, and a smaller value shows a higher correlation. Then the minimum distance D Min (x, y) and the corresponding location (x, y) are determined.

Target Prediction Algorithm
In order to improve the TTS response speed it is necessary to use the prediction method in the tracking scheme. Compared to the Kalman filter and particle filter, the linear prediction algorithm is less complex and offers moderate performance. In this paper we use the linear prediction method to get the predicted angular position of the target.
A simple method to estimate the location of the target in the image can be formulated by the following equation: x y x x y y x x y y represents the estimated location of the target, . Then the following equation is: b … is a group of fix coefficients which are set offline.
The 2D-Turntable's pitch and yaw angular deviation can be obtained by the following formula: where x θ Δ and y θ Δ is respectively the pitch and yaw angular deviation.
A reliable PD controller is used for the tracking system, and the angular deviation Δθ obtained from linear prediction is used as feed forward compensation, then the final control algorithm is: where θ represents the angles of the instruction, m θ represents the angle of the feedback. The scheme of the feed-forward compensation based PD controller is shown in Figure 8.

Parameter Setting
The kernel function has an important influence on the experimental results. In this paper the Epanechnikove kernel profile is used as: where z = 128 is the bandwidth of MS tracking algorithm which is decided by the size of the target.
x actually represents the distance between the effective pixels and the center of the tracking region.
The quantization function b is: Region of interest (ROI) is 20 × 20.

Experiments Results
Four experiments have been implemented to test the above target tracking scheme. A wireless remote control car (as shown in Figure 9) has been used to simulate a moving target. The experiments include four cases: in case of tracking with the traditional MS (as shown in Figure 10), tracking in case of poses variation with the proposed method (as shown in Figure 11), tracking in case of partial occlusion with the proposed method (as shown in Figure 12), tracking in the case of poses variation in a complex scene with the proposed method (as shown in Figure 13).
From the following tracking image sequence, we can find two rectangular boxes. One represents the center of the optical system; the other represents the target location in the current image. The distance between the two rectangular boxes are used as errors to control the 2D-turntable. When the target is in stop condition, the two rectangular boxes should overlap.
From the following experiments results, we can conclude that the TTS designed in this paper has good robustness to the target pose variation and occlusion. The system totally processes an image in 18.21 ms, in which the fast MS consuming 14.6 ms, TM consumes 1.83 ms, other algorithms consume 1.78 ms. The Target Tracking Scheme time-consuming statistical table is as shown in Table 4. The final image processing speed is >50 frame/s. The experiment results indicate our approach to tracking a moving target is fast and robust. However, this proposed algorithm needs to be comprehensively evaluated in a wider database. Although the tracking results are promising in certain situations, further development and more evaluation is anticipated in severe image clutter and occlusion situations.

Conclusions and Future Work
In this paper, a balanced scheme between the robustness and real-time performance of a TTS is presented. A novel robust tracking algorithm combining the MS with template match (TM) has been proposed, which has a good robustness to target pose variation, partial occlusion, and a fast MS-based target tracking scheme is designed and implemented. The hardware-in-loop simulation shows that the image signal processing speed is >50 frame/s. The TTS presented in this paper utilized s common CCD camera to realize acquisition of images, but for some special applications infrared CCD sensors or heterogeneous sensors are used, so IR CCD or heterogeneous sensor-based fast target tracking techniques would be a future research direction.