Context-Aware and Occlusion Handling Mechanism for Online Visual Object Tracking

: Object tracking is still an intriguing task as the target undergoes signiﬁcant appearance changes due to illumination, fast motion, occlusion and shape deformation. Background clutter and numerous other environmental factors are other major constraints which remain a riveting challenge to develop a robust and effective tracking algorithm. In the present study, an adaptive Spatio-temporal context (STC)-based algorithm for online tracking is proposed by combining the context-aware formulation, Kalman ﬁlter, and adaptive model learning rate. For the enhancement of seminal STC-based tracking performance, different contributions were made in the proposed study. Firstly, a context-aware formulation was incorporated in the STC framework to make it computationally less expensive while achieving better performance. Afterwards, accurate tracking was made by employing the Kalman ﬁlter when the target undergoes occlusion. Finally, an adaptive update scheme was incorporated in the model to make it more robust by coping with the changes of the environment. The state of an object in the tracking process depends on the maximum value of the response map between consecutive frames. Then, Kalman ﬁlter prediction can be updated as an object position in the next frame. The average difference between consecutive frames is used to update the target model adaptively. Experimental results on image sequences taken from Template Color (TC)-128, OTB2013, and OTB2015 datasets indicate that the proposed algorithm performs better than various algorithms, both qualitatively and quantitatively.


Introduction
Visual Object Tracking (VOT) is an active research topic in computer vision and machine learning due to extensive applications in areas including gesture recognition [1], sports analysis [2], visual surveillance [3], medical diagnosis [4], autonomous vehicles [5,6] and radar navigation systems [7][8][9]. Various factors such as partial or full occlusion, background clutter, illumination variation, deformation and other factors in the environment complicate a tracking problem [10][11][12]. Tracking methods are categorized as generative [13] and discriminative methods [14]. Generative tracking methods focus on constructing an appearance model for target representation and search regions with high scores as results. Discriminative tracking methods treat object tracking as a classification problem by distinguishing the target from its background. Both types of tracking approaches are widely referred to in literature and have their own pros and cons in various scenarios. Generative trackers perform better analysis in case of availability of small training data. However, these trackers only consider object similarity which leads to loss of useful information around the target that might drift the tracker when the target undergoes occlusion or scale variation. However, discriminative trackers perform better analysis in the case of large training data. However, these trackers cannot adapt adequately when the appearance of target changes, due to which, tracking is affected when the target changes its shape or size during motion [15].

Related Work
With recent advancement in visual tracking, various competitive methods have been proposed for target tracking. Zhang et al. [16] proposed network padding, stride and respective field size-based network architecture for Siamese trackers. Rahman et al. [17] proposed a Siamese network-based tracker, which utilizes an attention module inside the feature refine network to discriminate between the target and background. Zhang et al. [18] proposed a tracking method which constructs a correlation filter learning model by using handcrafted features extracted from a convolutional neural network and uses hierarchical peak to side lobe ratio (PSR) for activation of the classifier. Dai et al. [19] presented adaptive regularization in a correlation filter which can learn and update the target model according to appearance variations during tracking. Javed et al. [20] proposed a deep correlation filter-based tracking method, by utilizing both forward and backward tracking information between the regression target and response map. Despite the fact that deep learning-based methods achieve favorable results, the complexity of these methods is still higher with the requirement of offline training.
Zhang et al. [21] proposed a fast algorithm which effectively uses Spatio-temporal context (STC) information for online tracking by signifying Spatio-temporal relationships between the target and its local contexts in a Bayesian framework. The tracking problem is resolved by maximizing the confidence map which uses target location prior information. Tian et al. [22] proposed an enhanced STC tracker to address occlusion through the incorporation of a patch-based occlusion detection mechanism in the STC framework. Chen et al. [23] proposed an improved STC tracker to address occlusion by incorporating a Kalman filter for prediction of the target location in case of occlusion. Munir et al. [24] proposed a modified STC tracker to address occlusion by incorporating a Kalman filter for prediction of target location in case of occlusion and implemented it for a real-time eye tracking application. Cui et al. [25] proposed an amended STC tracker to address limitation of full occlusion. They incorporated an occlusion detection mechanism which consists of three stages during which motion and template update information is stored and used when the target is occluded. Yang et al. [26] proposed an enhanced STC tracker to address occlusion by incorporating a PSR-based occlusion feedback mechanism for the model and scale update in the STC framework. Yang et al. [27] proposed an improved STC tracker to address occlusion through incorporating a Kalman filter for prediction of target location and uses Euclidean distance to detect occlusion. Zhang et al. [28] proposed a motion aware correlation filter (MACF) which predicts position and scale of the target in the next frame by utilizing instantaneous motion estimation.
Lu et al. [29] proposed RetinaTrack, an efficient joint model for detection and tracking which modifies single stage RetinaNet to instance level embedding training. Henriques et al. [30] proposed a tracking by detection framework with a kernel trick and histogram of oriented gradients feature to track the object. Ahmed et al. [31] proposed a real-time correlation-based tracking framework by utilizing open loop control strategy, so that the target is always at the center of frame. Moreover, a video stabilization method was incorporated to eliminate the vibration at low computational cost. Ma et al. [32] proposed a long-term correlation filter tracker (LCT) which decomposed the tracking problem into estimation of translation and scale, and redetects the target by online training of a random fern classifier. Masood et al. [33] proposed tracking framework which uses a maximum average correlation height (MACH) filter for detection and proximal gradient algorithm-based particle filter for tracking.
Zhou et al. [34] proposed an STC learning algorithm with multichannel features and an improved adaptive scheme for scale by using a histogram of oriented gradients feature along with color naming and using kernel methods in the STC framework to improve tracking performance. Khan et al. [35] proposed an improved tracking algorithm based on LCT. They incorporated the Kalman filter in the LCT framework for occlusion handling and PSR of the response map for occlusion detection. Ali et al. [36] proposed a tracking algorithm that combines the mean-shift tracker, Kalman filter, and correlation filter heuristically. It updates the template based on the change in the appearance model of the target and computes similarity for each forthcoming frame based on the current frame similarity value.
Mueller et al. [37] proposed a context-aware framework for correlation filter trackers by reformulating the original optimization problem for single and multidimensional features in both primal and dual domains. Qi et al. [38] proposed an improved STC algorithm through incorporation of a context-aware correlation filter in STC framework. Zhang et al. [39] proposed an improved STC algorithm by incorporating color naming and histogram of oriented gradients features in the STC framework, along with improved scale strategy and adaptive model update scheme. Shin et al. [40] proposed an improved KCF-based tracking algorithm. They incorporated module for detection of tracking failure, mechanism for re-tracking in multiple search windows and analysis of motion vectors for deciding the search window in the KCF framework. Based on literature presented it can be concluded that significant modifications have been made in the STC algorithm in terms of model updates, incorporation of occlusion detection and handling mechanisms, utilization of contextual information, fusion of various cues and features such as histogram of oriented gradients feature and color naming, combined with deep learning techniques and incorporation of adaptive learning rate mechanisms.
The STC algorithm proposed by Zhang et al. [21] utilizes fast Fourier transform for detection. Subsequently, context information around the target plays a vital role in object tracking. The basic idea of STC is to use background information around the target area in consecutive frames. The target model is updated based on spatial context information. However, STC cannot deal effectively when the model is updated on inaccurate measurements due to occlusions, background clutter and fast motion. Context-aware formulation can be efficiently applied to deal with background clutter issues. The maximum value of the response map can be used to detect occlusions. Afterwards, the Kalman filter can be applied for occlusion handling. The model update can also be related to the motion of the target; the STC model is updated on a fixed learning rate, making it vulnerable to target motion. On the basis of target motion, the tracking model should be updated adaptively.

Our Contributions
In this paper, an improved spatio-temporal context-based tracking algorithm is proposed. It combines s context aware formulation, Kalman filter and average difference between consecutive frame-based adaptive learning rate mechanism with STC. Our approach utilizes correlation filter-based context aware formulation making it effective at utilizing the context information while making it computationally less expensive. In addition, the Kalman filter is fused in a tracking framework for occlusion handling. Moreover, an adaptive learning rate mechanism is incorporated to update the model according to change in the environment. Experimental results have been presented on de facto standard videos to show the efficacy of the proposed ideas with various state-of-the-art tracking methods.

Paper Outline
The rest of the article is organized as follows-a brief explanation of Spatio-temporal context tracking and correlation filtering is given in Section 2. Section 3 defines Contextaware tracking, Kalman filter, occlusion detection mechanism and adaptive model learning rate by an explanation of the proposed method for online tracking. Experiment and Performance analysis is discussed in Sections 4 and 5. Section 6 provides Experimental results. Section 7 provides discussion and Section 8 concludes the article.

STC Based Tracking
In visual object tracking, the target is characterized by objects around the target present in the current frame. The area which is present around the target is called context. In the context around the target, various temporal and spatial relationships exist in continuous frames. STC tracking algorithm is based on a Bayesian framework to accurately find the target location on the basis of background knowledge. It formulates the task of finding the object center by maximizing the confidence map in every frame. Every current frame target location is represented by x* with its features defined as X c = {y(i) = (I(i), i)|i ∈ Ω c (x*)} where I(i) is the image grey scale value at location i while Ω c (x*) is the context around target center x*. It is shown in Figure 1. Confidence map of target location is described in (1).
where j is the target, P(y(i)|j) is context prior model that represents the features of context appearance. P(x, y(i)|j) is spatial context model that formulates spatial relation between object location and its information of context. It is used in identifying and resolves various uncertainties for different image measurements. The goal in this tracking problem is to train the spatial context model P(x, y(i)|j).

Confidence Map
Confidence map function y(x) is presented in (2).
where r is normalization constant, α is scale parameter while ξ is shape parameter. The problem of location ambiguity occurs frequently in object tracking. Appropriate selection of the shape parameter can resolve this problem and is helpful in the learning spatial context model. Setting ξ > 1 results in over-smoothing of the confidence map near the center, thereby increasing location ambiguities. However, if ξ < 1 it generates a sharp peak response due to which few positions are activated while learning spatial context. Due to these issues STC uses ξ = 1.

Context Prior Model
To learn the spatial context model, the context prior model needs to be calculated first. It is modeled by using an image intensity function to represent target appearance along with Gaussian weighted function mentioned in (3) and (4).
where d is normalization constant which restricts (4) to range between 0-1 and σ is scale representation. The closer the context location i is to the currently target location x * , larger weight should be set to predict target location in the next frame.

Learning Spatial Context Model
Spatial context model is defined by conditional probability function is presented in (5).
where ⊗ is a convolution operator in (6). For improving calculation speed fast Fourier transform (FFT) is used and calculated as presented in (7).
where F is FFT operation and denotes element wise multiplication. Solving (7) for spatial context model.
where F −1 denotes inverse FFT in (8). The spatial context model h sc learns relative spatial relations between different pixels in the Bayesian framework.

Model Update
In the STC model, the tracking is considered as a detection task. The target is initialized in position at the first frame. At the tth frame, the STC model H stc t+1 (x) can be updated by using the spatial context model h sc t (x). Then, the target center position x * t+1 of the (t + 1) frame can be attained by computing the extreme of the confidence map given in (9).
The confidence map y t+1 (x) at t + 1 frame can be calculated as described in (10).
Here, H stc t+1 derives from spatial context h sc t and is able to reduce noise caused by abrupt appearance changes of I t+1 . The STC model can be updated as mentioned in (11).
where ρ is the learning rate and h sc t is the spatial context model computed in (8).

Correlation Filter Tracking
Correlation filters use sampling methods to discriminate the target position from the region of interest in consecutive frames at low computational cost. It models all possible translations of the target in the search window as circular shifts and concatenates them to form a square matrix A 0 . It facilitates in computing the Fourier domain solution to the ridge regression problem given in (12).
In (12), the learned correlation filter is denoted by vector w. Square matrix A 0 contains all circular shifts of image patch and regression target y is vectorized image of 2D Gaussian. Let x(j) be the jth component of vector x and its conjugate is x * . Then, its Fourier transform F H x isx. (12) can be solved by using (13).
The convex in (12) is complex and has a unique global minimum. Equating its gradient to zero leads to a closed form solution of the filter as given in (14).
As A 0 is circulant, (14) can be diagonalized and its solution in Fourier domain is given in (15) The location of the target is the same as the location of maximum response when (15) is convolved with a search window for the next frame. The detection formula is given in (16).
where Z is the search window circulant matrix.

Proposed Solution
In this section, the proposed tracker is introduced in detail. First, a context-aware object tracking model is investigated. Second, a Kalman filter-based motion estimation model is discussed. Third, the average difference of a consecutive frames-based model update scheme is presented. Finally, the tracker will be discussed in Algorithm 1. Figure 2 shows the flowchart of the proposed algorithm.

Context-Aware Tracking Framework
Information of context around the target elevates the tracking performance. Therefore, it is added in the solution of the context-aware correlation filter as given in (17).
It should be noted that there are other possible choices for incorporating the context term. However, it leads to constrained convex optimization requiring an iterative solution, which is quite slow. When the position for the current frame is computed by STC, the filter w is trained and the background term A i is as small as possible. The objective function can be rewritten by forming a new data matrix B ∈ R (k+1)n×n which consists of target and context patches as given in (18).
Similar to the correlation filter, the function in (18) is convex and minimized by setting the gradient to zero. It is presented in (19).
Similar to (12), using (13) to determine Fourier domain closed form solution as described in (20).ŵ The target window and its position are updated according to (20). Based on target position, the confidence map and STC model in (9) and (11)

Kalman Filter-Based Motion Estimation Model
The Kalman Filter is an optimal filter which minimizes difference between true states and estimated states. It consists of four processes which are (1) Initial guess of state vector and state error covariance, (2) Forward time step propagation of the state vector and state error covariance, (3) Estimation of the Kalman gain based on state error covariance and measurement noise covariance, (4) Update state vector and state error covariance based on estimated output and Kalman gain [41]. A constant velocity motion model is used due to its simplicity and effectiveness in describing motion of the target. It consists of two stages which are prediction and correction.

Kalman Filter Prediction
During this state, uncertainty about the target is determined by both state and covariance prediction. The current system state can predict position based on the previous state. Similarly, covariance is calculated multiplying the covariance matrix from the previous iteration by state transition matrix and adding process noise Q. The prediction equations are described in (21) and (22).
where X t is the state target vector, A is the state transition matrix and Bu t−1 is noise.
where S t is the predicted error covariance and Q is the covariance of the process noise.

Kalman Filter Correction
The position of the target obtained from STC is used as a measurement value Y t . By combining it with the predicted result, the Kalman gain can be calculated as described in (23).
where R is the measurement noise covariance. The estimate is updated by combining it with the old estimate and the measurement as given in (24).
The difference (Y t − HX t ) is called measurement innovation or residual. It reflects discrepancy between the predicted measurement HX t and actual measurement Y t . Error covariance is calculated by using (25).
where S t+1 is the updated error covariance, H is matrix related to measurement of the state and K t is the updated Kalman gain.

Occlusion Detection
When the target undergoes occlusion, then the STC model is updated incorrectly thereby losing the target. In order to detect occlusion, maximum value of target map is used which changes its value with the situation of the target state. If the target is occluded, then the value of response map is small. However, when the target reappears then its value increases. The value of the response map determines whether the target is tracked by STC or by Kalman filter. For a given input image sequence; first the confidence map is calculated in frequency domain, then Spatio-temporal model is learned for tracking. If the target is severely occluded, then for next frame the Kalman filter will predict the position and update the STC using a feedback loop. The filter template for context-aware is updated accordingly. Kalman filter prediction can be updated as observation of target position marked for next frame.

Adaptive Learning Rate
During object tracking, target motion changes in each frame of the image sequence. Therefore, it is necessary to update the target model correctly. In STC, the learning rate is fixed, making it evitable to different appearances in the environment. So, to make it adaptive, an average difference of two consecutive frames-based mechanisms is incorporated [39]. It is given in (26).
where I ij is the pixel value and M * N is the size of image. Learning rate is adjusted as given in (27).
Value of learning rate ρ is assigned on the basis of er by using (27). for frame 1 to n frames. (1). Calculate context prior model using (3).

Experiments
To verify the performance of the proposed tracker both qualitatively and quantitatively, it is tested on several image sequences with complex conditions such as occlusion, illumination variation, deformation and clutter background. The proposed method is implemented in MATLAB 2016a. The experimental setup is Intel Core i3 2.30 GHz CPU with 4GB RAM.

Evaluation Criteria
Two criteria were used to evaluate the algorithm. Those are the center location error (CLE) and distance precision rate (DPR). The CLE is defined as the Euclidean distance calculated by tracking algorithm and ground truth of target. The calculation formula is given in (28).
where (x i, , y i ) are tracker positions and (x gt , y gt ) are ground truth values.
Distance precision rate (DPR) is the percentage of frames at threshold of 20 pixels of estimated location distance and ground truth.

Performance Analysis
DPR comparison is given in Table 1. In sequences (Cardark, Cup, Jogging-1, Juice, and Man) proposed, the tracker outperforms MOSSE CA , STC, MACF, and DCF CA . In sequences (Carchasing_ce3, and Plate_ce2) all tracking methods have similar performance. Sequence Busstation_ce2 has slightly less precision value. However, the proposed has a higher mean value than other tracking methods. The average center location error comparison is given in Table 2. In sequences (Bussta-tion_ce2, Cup, Jogging-1, and Man), the proposed tracker outperforms STC, MOSSE CA , MACF, and DCF CA . In sequences (Carchasing_ce3, Cardark, Juice, and Plate_ce2), the proposed tracker has slightly higher error value. However, the proposed has the lowest mean error than other tracking methods. The precision plots are shown in Figure 3. These plots provide frame-by-frame precision in entire image sequences. Since precision gives the mean value of an entire image sequence, it is a possibility that the tracker might get drift for a few frames but then again tracks the target correctly. Therefore, these plots were presented to show the efficacy of the tracking method. Various challenges were present in sequences such as occlusion, illumination variations, background clutter, etc. In sequences (Carchasing_ce3, Cardark, Cup, Jogging-1, Juice, Man, and Plate_ce2), the proposed tracker has the highest precision in the entire sequence. In sequence Busstation_ce2, the proposed tracker has slightly lower precision. The location error plots are shown in Figure 4. These plots provide frame-by-frame error in entire image sequences. Since the average center location gives mean error of entire image sequence, it is a possibility that tracker might get drift for few frames but then again tracks the target correctly. Therefore, these plots were presented to show the effectiveness of the tracking method. Various challenges were present in sequences such as occlusion, illumination variations, deformation, etc. In sequences (Busstation_ce2, Cup, Jogging-1, and Man), the proposed tracker has the lowest error in entire sequence. In sequences (Carchasing_ce3, Cardark, Juice, and Plate_ce2), the proposed tracker has slightly higher error.

Experimental Results
Qualitative results of proposed tracking with four state-of-the-art trackers over eight image sequences is shown in Figure 5. It involves various challenges such as partial or full occlusions, illumination variations, background clutter, etc. DCF CA and MOSSE CA contains similar tracking components as our approach, i.e., correlation filtering and context aware formulation. However, correlation filter in MOSSE CA and DCF CA is not robust to blur motion in (cup), illumination variations in (man, cardark) and occlusions in (jogging-1, busstation_ce2). In (carchasing_ce3, plate_ce2) where the target undergoes scale variations, both MOSSE CA and DCF CA have similar performance with the proposed in tracking of the target. With the joint use of an instantaneous motion model and Kalman filter in discriminative scale space tracking frame, MACF performs better on various challenging sequences. However, MACF tends to drift when the target undergoes occlusion and fails to recover from tracking failures (jogging-1). Although STC can estimate scale, it does not perform well in motion blur (juice) and scale variations (cup). This is because STC only uses intensity features and estimates scale from a response map of single translation filter. Moreover, it does not deal effectively with occlusion (jogging-1, busstation_ce2) as there is no occlusion handling mechanism present to deal with this issue. Moreover, its target model is updated on a fixed learning rate, making it vulnerable to the background environment. The proposed tracker performs well in all these challenging sequences. This performance can be attributed to three reasons. First, context-aware formulation in STC framework is incorporated, making it less sensitive to illumination variation (cardark, man) and motion blur (juice, man, cup). Second, incorporation of occlusion detection based on the response map and occlusion handling using Kalman filter makes it effective towards partial or full occlusion (jogging-1, busstation_ce2). Third, fusion of adaptive learning rate in the model update of the tracking model is effective in dealing with scale variation and fast motion (plate_ce2).

Discussion
We discuss several observations from experimental and quantitative analysis. First, context aware formulation in the correlation filter outperforms trackers without this formulation. This can be attributed to the fact that correlation filters regress all circular shifts of the target appearance model. Second, trackers with occlusion detection and handling modules outperforms trackers without these modules. This can be attributed to the fact that occlusion detection and handling mechanism does not lead the tracker to drift. Third, trackers with an adaptive learning rate mechanism perform better than a fix learning rate. It is because the tracking model copes with the changes in environment.

Conclusions
In the present article, an adaptive Spatio-temporal context (STC)-based algorithm for online tracking is presented, which combines the context-aware formulation, Kalman filter, response map-based occlusion detection, and average difference based adaptive model update in the STC framework. The algorithm performs better in scenarios such as full occlusion, illumination variation, deformation, and background clutter in comparison to various algorithms with the achievement of efficient performance in datasets. Even though the tracker has achieved the desired performance, the target may be lost in some cases like motion blur, fast motion, and scale variation. The problem can be resolved through the establishment of neural network-based algorithms [7][8][9] to improve robustness and tracking accuracy.