Visual Object Tracking Robust to Illumination Variation Based on Hyperline Clustering

: Color histogram-based trackers have obtained excellent performance against many challenging situations. However, since the appearance of color is sensitive to illumination, they tend to achieve lower accuracy when illumination is severely variant throughout a sequence. To overcome this limitation, we propose a novel hyperline clustering based discriminant model, an illumination invariant model that is able to distinguish the object from its surrounding background. Furthermore, we exploit this model and propose an anchor based scale estimation to cope with shape deformation and scale variation. Numerous experiments on recent online tracking benchmark datasets demonstrate that our approach achieve favorable performance compared with several state-of-the-art tracking algorithms. In particular, our approach achieves higher accuracy than comparative methods in the illumination variant and shape deformation challenging situations.


Introduction
Visual object tracking, which aims at estimating locations of a target object in an image sequence, is an important problem in computer vision.It plays a critical role in many applications, such as visual surveillance, robot navigation, activity recognition, intelligent user interfaces, and sensor networks [1][2][3][4][5].Despite significant progress has been made in recent years, it is still a challenging problem to develop a robust tracker for complex scenes due to appearance changes caused by partial occlusions, background clutter, shape deformation, illumination changes, and other variations.
For visual tracking, an appearance model based on feature is of prime importance for representing and locating the object of interest in each frame.Most state-of-the-art trackers rely on different features such as color [6,7], intensity [8,9], texture [1], Haar feature [10,11], and HOG feature [12,13].Color feature is insensitive to shape variation and robust to object deformation.Numerous effective color-based representation schemes have been proposed for robust visual tracking.One common method is to adopt color statistics as an appearance description.Color histogram is the most commonly used descriptor representing object [14].Distractor-Aware Tracker (DAT) uses the color histograms of distractor to distinguish object and background pixels [7].Another successful method is to transform color space.Adaptive Color attributes Tracker (ACT) [6], on the other hand, maps the RGB value of pixel to a probabilistic 11 dimensional color representation and learns a kernelized classifier to locate the target using multi-dimensional color feature.
Although these color-based trackers have performed state-of-the-art results on recent tracking benchmark datasets [15], they fail to cope with the scene where color features vary significantly, particularly in illumination changes.To solve this problem, we propose an illumination invariant Hyperline Clustering-based Tracker (HCT).The main components of the proposed HCT are shown in Figure 1.We exploit the observation that the color distributions of the same object under different illuminations locate in an identical line [16].Using an hyperline clustering algorithm which is able to identify the direction vectors of all hyperlines, we can track the object throughout the sequence where the illumination is not consistent.Moreover, to distinguish the object pixels from surrounding background regions, a Bayes classifier is trained to suppress the background hyperlines, reducing the drift problem.Due to the favorable robustness of the proposed approach, it is well suited for illumination vary scenes such as Singer1, Singer2, Trans, and so on.The contributions of this paper are as follows.First, we present a light yet discriminative object observation model in which the representation of object is formulated as the direction of hyperlines.Although it relies on direction initialization, this representation is able to distinguish the object of interest from background and achieves competitive performance on many challenging tracking sequences.Second, a Bayes classifier is trained in advance to identify and suppress the the background hyperlines, which improve the tracking robustness.Third, we adopt an anchor box scale estimation which allows us to cope with large variations of target scale and appearance (i.e., Trans sequence).Finally, we evaluate our approach on multiple tracking benchmarks demonstrating favorable results against several state-of-the-art trackers.
Notation: A boldface capital letter Y denotes a matrix and a boldface lowercase letter y a vector.Y denotes the Euclid norm of Y.The transpose and complex conjugate are denoted by Y and Y * , respectively.The inner product is denoted by •, • .The element-wise product is denoted by .F denotes Discrete Fourier Transform (DFT) and F −1 denotes the Inverse Discrete Fourier Transform (IDFT).

Related Work
Based on object appearance models, visual tracking approaches can be classified into two families: Generative and discriminative approaches.Generative approaches tackle the tracking problem by searching the image region that is the most similar to the target template.Such trackers either rely on templates or subspace models.Comaniciu et al. [17] present a histogram-based generative model with attraction of the local maxima to handle appearance change of object.Sevilla-Lara et al. [18] present a distribution field based generative tracking method.Shen et al. [19] propose a generalized Kernel-based mean shift tracker whose template model can be built from a single image and adaptively updated during tracking.In [20], a sparse representation based generative model is adopted to locate the target using a sparse linear combination of the templates.Although it is robust to various occlusions, a lot of manipulations of time-consuming sparse representation leads to low frame-rate.Additionally, these generative approaches fail to use the background information which is likely to alleviate drifts and improve the tracking accuracy.
Discriminative approaches typically train a classifier to separate the target object from background.For example, Zhang et al. [10] train a naive Bayes to locate the target in compressive projection where the features of the target appearance are efficiently extracted.Liu et al. [21] propose a robust tracking algorithm using sparse representation based voting map and sparsity constraint regularized mean shift.Yang et al. [22] present a discriminative appearance model based on superpixels, which is able to distinguish the target from the background with mid-level cues.
Recent benchmark evaluations [15,23,24] have demonstrated that Discriminative Correlation Filter (DCF) based visual tracking approaches achieve state-of-the-art results while operating at real-time [25].Circulant Structure with Kernels (CSK) tracker employs a dense sampling strategy while exploiting circulant structure with Fast Fourier Transform to learn and track the object [26].Its extension, called kernelized correlation filter (KCF) [12], incorporate multi-channel features via a linear kernel achieving excellent performance while running at more than 100 frames per second.However, the standard KCF is only to robust to linearly scale changes.This implies inferior performance when the target encounters with large scale variations.To address this problem, Danelljan et al. [25] propose Discriminative Scale Space Tracking (DSST), which is capable of learning explicit scale filter using variant scales of target appearance.Despite of its competitive performance and efficient implementation, DSST starts to drift from object that is non-rigidly deforming.To further improve the robustness to deformation of tracker, both Distractor-Aware Tracker (DAT) [7] and Adaptive Color attributes Tracker (ACT) [6] adopt color-based representation that is invariant to significant shape deformation.Sum of Template And Pixel-wise LEarners (Staple) [14] combines the template to discriminate the object and the color-based model to cope with deformation in a ridge regression framework, outperforming many sophisticated trackers.However, color distribution is sensitive to varying illumination.Thus, the color-based trackers are likely to drift when the illumination significantly changes throughout a sequence.

Hyperline Clustering-Based Tracking
The proposed Hyperline Clustering-based Tracking (HCT) is motivated by the observation that the RGB values' distribution of the same kind color under different illuminations locate in the same lines [16].Thus, the representation of object can be cast to hyperline clustering problem as shown in Section 3.1.Furthermore, a Bayes classifier based discriminative model, which is capable of separating the target from background, is proposed in Section 3.2.Section 3.3 demonstrates the capability of accurate object localization and update.Inspired by recent state-of-the-art object detection [27,28], we propose a Anchor Box based scale estimation to achieve accurate tracking, as described in Section 3.4.

Hyperline Clustering Representation
Hyperline clustering have been successfully applied in sparse component analysis [29] and image segmentation [16].Given a set of observed data points {y i } T i=1 , which respectively locate on K hyperlines L(l k ), where l k is the directional vector of the corresponding hyperline and k = 1, • • • , K (see Figure 2).K-HLC aims to estimate Mathematically, K-HLC can be cast into the following optimization problem [30]: where The indicator function I i∈Ω k is given by where Ω k denotes the k-th cluster set.The distance d(y, l) from y to L(l k ) is A robust K-hyperline clustering algorithm was proposed in [30], where it is implemented in a similar way to K-means clustering by two steps after initialization: The cluster assignment and the cluster centroid update.For the cluster assignment step, the observed data {y i } T i=1 are assigned to For the second step, the cluster centroid is obtained by eigenvalue decomposition (EVD).
In RGB color model, three primary colors (red, green, and blue) are exploited together to reproduce a array of color [31].Thus, the RGB vector of a color pixel can be represented by To illustrate the color represent ability of hyperline, we perform a scatter plot of three regions of a image from Trans sequences (see Figure 3).The foreground and two background regions are represented by yellow, blue, and black arrow lines, respectively.As we can observe, distributions of the pixels from three different regions approximatively locate in three different hyperlines.Hence, the representation of color can be considered as hyperline clustering problem.Moreover, when suffered illumination change, the distribution of the same color still locates in the same hyperline (see the red arrow).Besides, the prior probability can be approximated as P(x ∈ O) ≈ |O|/(|O| + |S|).Then, Equation ( 6) can be simplified as follows: Applying the model (7), we are able to distinguish object pixels from background region, as illustrated in Figure 4.The proposed model is capable of eliminating the effect of background and reducing the risk of drifting.In addition, because model ( 7) estimates the likelihood using the hyperline directional vector of object and background regions, it requires fewer memory to obtain accurate estimation than other color-based algorithms.However, it is dangerous to learn the model directly from the first frame image regions.To adaptively represent the changing object appearance and capture the object in different illuminations, we develop an update scheme in which the object and surrounding image hyperlines are updated independently.Because the color distribution discards the spatial position of image pixels, the proposed object hyperline representation is robust to shape deformation.Thus, the object hyperlines are fixed during the tracking process.For the surrounding hyperlines, the update scheme is summarized Algorithm 1.

Algorithm 1
The update scheme of surrounding hyperlines.
Require: Surrounding image pixels {y i } T i=1 , the surrounding hyperlines L S (l k ) and the number of hyperline K.
The updated surrounding hyperlines L S (l k ).

Localization
Similar to the state-of-the-art trackers adapting tracking-by-detection principle [6,7,12], we iteratively localize the object in a new frame after initializing the tracker in the first frame.
At the frame t, we use a trained classifier to predict the object location O t basing on the previous object location O t−1 .Instead of utilizing representation of gray [19] or HOG [12], we perform correlation filtering on the likelihood map directly.The training of classifier is achieved by find a function f (x) that minimizes the squared error where s are the sample likelihood patches, which are cyclic shifts of previous location O t−1 , g are the regression targets and λ is a regularization constant.The work of [12] shows that the Ridge Regression ( 8) can be simplified by calculating where ϕ is the Hilbert space mapping, which is induced by the Gaussian kernel κ.
then we derive where K is the kernel matrix.The elements of K are Since κ is shift invariant, K can be computed efficiently using Fast Fourier Transform (FFT).
In the detection step, the base patch z is firstly cropped out from the likelihood P t using the previous location O t−1 .The candidate patches are cyclic shifts of z.Then, the kernel matrix K z is calculated by k z ij = κ(z i , s j ).The detection scores are obtained by Finally, the target location in frame t is achieved by maximizing the score ŷ.

Anchor Box Based Scale Estimation
Motivated by recent advances of anchor box for object detection [27,28], as well as visual tracking [32], we present an anchor box based scale estimation to predict the resolution of object.A standard strategy to localize the object at different scales is to perform scale estimation at multiple resolutions [33].To account for the scale change and geometrical deformation of target, a feature pyramid is first extracted from a rectangular likelihood map centered around the target.At each scale of pyramid, we predict multiple region proposals, which is called anchors.An anchor is centered at the target location and is associated with an aspect ratio.As shown in Figure 5, we first sample the likelihood map around the previously estimated target location at S different scales.At each scale, we use R aspect ratios, yielding a = S × R anchors at the feature pyramid.However, sampling the feature into multiple resolution is computationally demanding.To boost the speed of the tracker, we perform a multi-scale of anchors instead of a feature map.The scale s and aspect ratio r of the target at current likelihood map P t is obtained by searching the anchor with the highest vote score as following: where y denotes the location of pixel and A(s, r) denote the anchor region in scale s aspect ratio r.This formulation calculates the average likelihood of anchor region whilst penalizing the maximal region likelihood.

Experiments
We validate the effectiveness of our Hyperline Clustering based Tracker (HCT) on the recent tracking evaluation benchmark [15].In Section 4.1, we describe the details about the parameters and machine used in our experiments.Section 4.2 presents the used benchmark datasets and evaluation protocols.Section 4.3 shows a comprehensive comparison of our HCT with state-of-the-art color based methods.

Implementation Details
We set the detection region three times the size of the previous object hypothesis O t−1 and the surrounding regions is twice the size of O t .The scale S and aspect ratio R are set 12 and 0.8:1:1.2respectively, thus we have 12 × 3 = 36 anchors each frame.Additionally, we use the regularization parameter λ = 0.001.All algorithms are tested in MATLAB R2015b, and run on a Lenovo laptop with Intel I7 CPU 3.4 GHz under Windows 7 Professional.

Experiment Setup
To test the ability of HCT on handling illumination change problem, we employ the color sequences posing illumination variation challenging from OTB-100 dataset [15], namely Basketball, Singer1, Singer2, CarScale, Woman, and Trans.These sequences are also suffered other challenging, such as scale variation, occlusion, deformation, and background clutters.
To report the result of HCT, We use two standard evaluation metrics: Precision and success plots.The precision plot contains the distance precision over a range of center location error threshold.Given the center location of tracked object (x t , y t ) and ground truth (x g , y g ), the location error is defined as location error = x t − x g 2 + y t − y g 2 .( In the success plot, the overlap precision is plotted over a range of overlap thresholds.Given the tracked bounding box r t and the ground truth bounding box r g , the overlap is defined as where and represent the intersection and union of two regions, respectively, and | | denotes the number of pixels in the region.

Comparison with State-of-the-Art
We compare the proposed method with mentioned state-of-the-art trackers including three color based trackers, namely Staple [14], DAT [7], and ACT [6], and three gray pixel based correlation filtering trackers, namely DSST [25], KCF [12], and CSK [26].For all methods, we use the publicly available code and suggested parameters corresponding to the authors.
Figure 6a shows the precision plot illustrating the mean location error over 6 sequences.For clarity, we only report the result of one pass evaluation that the trackers are initialized at the first frame.From the figure, we can observe that the proposed HCT performs favorably compared to existing trackers.Staple, which has been demonstrated to acquire best performance in a recent benchmark [24], also outperform other trackers in our experiment.However, HCT outperforms the Staple tracker by 16% in average precision.
Figure 6b shows the success rate plots containing the overlap precision.The mean precision scores for each tracker are presented in the legends.Again, our proposed HCT outperforms Staple by 30% and the baseline KCF tracker by 90% in mean success rate.Finally, we analyze the running time performance of HCT.For hyperline clustering, the most time-consuming calculation is multilayer initialization [30], which involves a lot of manipulations of matrix decomposition.However, due to the efficient localization and scale estimation, our pure MATLAB prototype of HCT runs at 15 frame per second.Additionally, since HCT only stores the hyperline vectors of object and background, it desires less memory than other trackers.Thus, HCT is able to be utilized in time-critical application and embedded development platform.Comparison of the proposed approach with state-of-the-art trackers in illumination changes sequence.The results of distractor-aware tracker (DAT) [7], adaptive color attributes tracker (ACT) [6], kernelized correlation filter (KCF) [12], Staple [14], and the proposed approach are represent by green, yellow, blue, magenta, and red respectively.

Conclusions
In this work, we investigate the RGB values' distribution of color pixels under different illuminations and propose a novel Hyperline Clustering based Tracker (HCT).Unlike other color based trackers that predominantly apply simple color histogram and are sensitive to illumination changes, the proposed HCT directly extracts hyperlines from both object and surrounding regions to build the likelihood map, enhancing its robustness to illumination changes.The location of an object is estimated by implementing correlation filtering.Furthermore, we propose an anchor based scale estimation to deal with the problem of scale variation and shape deformation.Numerous experiments on the Online Tracking Benchmark demonstrate the favorable performance of the proposed HCT compared with several state-of-the-art trackers.An interesting direction of future work is to apply HCT to multi-object tracking [34] and multi-camera tracking [35], which require an illumination invariant representation of object.

Figure 3 .
Figure 3. Scatter plot of three different regions, which are represented by blue yellow and black arrow respectively.The red arrow represents the foreground region in different illumination.The directional vectors of hyperlines that represent the foreground in different illuminations are approximately the same, as shown by the red lines.

3. 2 .. ( 6 )
Discriminative Model To build a discriminative model which is capable of distinguishing the object from the surrounding, we propose a hyperline clustering based Bayes classifier for visual tracking.Let x denote the object pixels in a rectangular object region O and S denote the surrounding region of object.Additionally, let b x I,k denote pixel x assigned to the k-th hyperline of image I. Thus, we formulate the object likelihood at location x by P(x ∈ O|O, S, b) ≈ P(b x O,k |x ∈ O)P(x ∈ O) ∑ Ω∈{O,S} P(b x Ω,k |x ∈ Ω)P(x ∈ Ω) Particularly, the likelihood terms are estimated directly from the distance using Equation (3), i.e., P(b x O,ko |x ∈ O) ≈ d(x, l ko )/|O| and P(b x S,k |x ∈ S) ≈ d(x, l ks )/|S|, where | • | denotes the cardinality; ko and ks represent the ko-th and ks-th of hyperline of image O and S, respectively.

Figure 4 .
Figure 4. Exemplary object likelihood map for the discriminant model illustrating the object region O and surrounding region S.

Figure 5 .
Figure 5. Visualization of anchors region based on feature pyramid.Red rectangles represent anchors at different aspect ratios.

Figure 6 .
Figure 6.Quantitative comparison of the proposed HCT with several state-of-the-art trackers.

Figure 7
Figure 7 shows a qualitative comparison of the proposed approach with existing trackers.In the Basketball sequence, all of compared trackers perform well on this sequence.However, only HCT and Staple are able to estimate the size of object accurately.For the Singer1 sequence, the target undergoes severe illumination change and scale variation.The color-based trackers, like ACT and DAT, start drifting from the target when the singer suffers severe illumination change in frame #100.Staple works well due to combination of HOG and color-based models.HCT is able to estimate the size of target.This can be attributed to the anchor based scale estimation.In the Singer2 sequence, the ACT and DAT are less effective in handling illumination change while both HCT and Staple are able to track the target accurately.For the CarScale and Woman sequences, the targets undergo occlusion and illumination change.HCT and Staple perform well on these sequences with higher overlap scores than other methods.The target object in the Trans sequence undergoes shape deformation and severe illumination change.Both the color-based and the gray-based trackers including Staple fail to cope with shape deformation and illumination change simultaneously.However, HCT is able to track the object accurately despite appearance change owing to the deformation and illumination variant.This further confirms the effectiveness of the proposed hyper-line clustering discriminative model and anchor based scale estimation.Finally, we analyze the running time performance of HCT.For hyperline clustering, the most time-consuming calculation is multilayer initialization[30], which involves a lot of manipulations of matrix decomposition.However, due to the efficient localization and scale estimation, our pure MATLAB prototype of HCT runs at 15 frame per second.Additionally, since HCT only stores the hyperline vectors of object and background, it desires less memory than other trackers.Thus, HCT is able to be utilized in time-critical application and embedded development platform.