Spatio-Temporal Context, Correlation Filter and Measurement Estimation Collaboration Based Visual Object Tracking

Despite eminent progress in recent years, various challenges associated with object tracking algorithms such as scale variations, partial or full occlusions, background clutters, illumination variations are still required to be resolved with improved estimation for real-time applications. This paper proposes a robust and fast algorithm for object tracking based on spatio-temporal context (STC). A pyramid representation-based scale correlation filter is incorporated to overcome the STC’s inability on the rapid change of scale of target. It learns appearance induced by variations in the target scale sampled at a different set of scales. During occlusion, most correlation filter trackers start drifting due to the wrong update of samples. To prevent the target model from drift, an occlusion detection and handling mechanism are incorporated. Occlusion is detected from the peak correlation score of the response map. It continuously predicts target location during occlusion and passes it to the STC tracking model. After the successful detection of occlusion, an extended Kalman filter is used for occlusion handling. This decreases the chance of tracking failure as the Kalman filter continuously updates itself and the tracking model. Further improvement to the model is provided by fusion with average peak to correlation energy (APCE) criteria, which automatically update the target model to deal with environmental changes. Extensive calculations on the benchmark datasets indicate the efficacy of the proposed tracking method with state of the art in terms of performance analysis.


Introduction
Visual object tracking (VOT) has emerged as a dynamic study area due to its utilization in a wide range of applications such as human action recognition [1][2][3], traffic monitoring [4,5], pellet ore phase [6], smart city [7], embedded system [8], surveillance [9][10][11] and medical diagnosis [12,13]. While significant progress has been made in recent years, accurate estimation for tracking an object is still a challenge in a video sequence due to various factors such as scale variations, occlusion, deformation, background clutters, to name a few [14][15][16]. Target tracking methods, being classified as generative [17] and discriminative [18], are widely referred in literature with prominent applications. Generative tracking methods learn the appearance model of the target and search for the highest matching score. These methods achieve good tracking results at the expense of computational cost. Discriminative tracking methods treat it as binary classification and achieve favorable results. However, tracking in these methods might get affected when training data is small.
A tracking algorithm based on STC utilizes fast Fourier transform (FFT) to accelerate the calculations. Zhang et al. [19] proposed a spatio-temporal context (STC) based tracking model by formulating temporal relation between the target and its context in a Bayesian framework. Afterward, the model's confidence map is maximized to determine the target location by further updating the tracking model and scale. Based on [19], Zhang et al. proposed an adaptive STC model for online tracking by incorporating a histogram of oriented gradients (HoG) features and color naming (CN) features in the STC framework. They also used the average difference between adjacent frames to adjust the learning rate when the model is updated [20]. To further improve tracking performance in the STC framework, Wang et al. [21] proposed an improved tracking model that combines STC with a convolutional neural network (CNN) to extract online CNN's deep characteristics without training. A motion vector-based mechanism for predicting target position under motion is incorporated in the STC framework to improve the STC scale. It also combined a scale correlation filter with STC to extract different scale samples around the target and used the HoG operator to form a pyramid of scale characteristics [22,23].

Tracking by Correlation Filter
Correlation filters have been broadly applied in object tracking [24][25][26][27][28]. To solve scale estimation in correlation filtering, Danelljan et al. [29] proposed a tracker based on correlation filters for translation and scale in image scale pyramid representation. Implementation in [29] is optimized by using various strategies for reducing computational cost in [30]. Zhang et al. [31] used [30] as its base tracker and proposed motion aware correlation filters (MACF) based tracking method by incorporating joint motion estimation based Kalman filter in discriminative correlation filters and used confidence of squared response map (CSRM) criteria for model update and occlusion detection. Ma et al. [32] used implementation in [30] and proposed a fast and accurate scale estimation method by incorporating average peak to correlation energy (APCE) in a multiresolution translation filter. Li et al. [33] included a scale adaptive tracking method in KCF framework. They address the issue of fixed template size in KCF and incorporated HoG and CN features.

Tracking by Kalman Filter
Kalman filters are widely utilized for occlusion handling in various trackers [34][35][36][37][38]. Yang et al. [39] proposed an improved STC algorithm and combined the Kalman filter with STC making it more robust and used Euclidean distance to detect occlusion. Mehmood et al. [40] proposed a tracking algorithm similar to [39]. In their implementation, they have incorporated context-aware formulation and combined Kalman filter in the STC framework. Moreover, they used the maximum value of the response map for occlusion detection. Khan et al. [41] proposed an improved tracker based on long-term correlation tracking (LCT). They incorporated the Kalman filter in the LCT framework for occlusion handling and peak to side ratio (PSR) of response map for occlusion detection.
Based on the presented literature, it can be concluded that significant modifications have been made in the STC algorithm. These modifications are in terms of occlusion detection and handling mechanisms, target model update mechanisms, incorporation of scale update schemes, the fusion of various cues and features, along with deep learning techniques and adaptive learning rate mechanisms. In this article, all the tracking results of the proposed method are available on Google Drive Link: https://drive.google.com/drive/folders/1nRiUyLfXkBk6tYcSuJaqkW1WAEiyDOX2 ?usp=sharing.

Our Contributions
Based on related work, this article proposes an object tracking algorithm that enhances STC under scale variations, background clutter, occlusion, illumination variation, and deformation. The contributions of this article are numerous, described as follows: (1) We propose a scale correlation filter-based pyramid representation mechanism to accurately extract the target without accumulating the scale model's error. We use a combination of spatio-temporal context and scale correlation filter to achieve accurate object tracking. (2) We introduce an effective method in which the object can be tracked accurately by utilizing extended Kalman filter (EKF) detection for nonlinear target motion. We also use the response map's peak value to measure the reliability of the current estimated position. If the tracking result is unreliable, this method can regain the target position to continue tracking. (3) We propose an adaptive learning rate mechanism based on the average peak to correlation energy (APCE) based on the target appearance model. This method can effectively prevent the tracking model from the wrong appearance. (4) Experimental results have been presented on de facto standard videos to show the efficacy of the proposed method over STC [19], DCF CA [26], Modified KCF [28], MACF [31], Modified STC [40], and AFAM-PEC [41].
A correlation filter-based discriminative scale mechanism is incorporated in spatiotemporal learning in the proposed work, making it robust and effective in scenarios such as clutter background, illumination variation, scale variations, and fast motion. The adaptive learning rate mechanism is based on APCE between consecutive frames. It is fused in this framework so that the tracking model can be updated according to the target's shape and motion. If the model is updated on a fixed learning rate, it does not cope with the target's shape, thereby losing it in the subsequent frames.
The extended Kalman filter aspect, which is utilized in the current study, is when the target undergoes occlusion. The condition to decide whether the target is occluded is based on the response map's maximum value. In the proposed tracker, not only extended Kalman filter is applied, but a mechanism is also devised for its activation in the STC framework, making it better both qualitatively and quantitatively than various trackers.
The current study focuses on addressing limitations in the spatio-temporal context framework by incorporating efficient scale-space formulation, occlusion detection, and handling and adaptive learning rate modules.

Paper Outline
This paper's organization is as follows: a brief explanation of spatio-temporal context, tracking is given in Section 2. Section 3 defines scale correlation filter, extended Kalman filter, occlusion detection method, and adaptive learning rate mechanism by explaining the proposed method for online tracking. Experiment parameters are discussed in Section 4. Performance analysis is discussed in Section 5. Section 6 includes a discussion, while Section 7 concludes the paper.

Spatio-Temporal Context Tracking
STC tracking algorithm is based on the Bayesian framework for finding target location by utilizing context information. In every frame, the confidence map is maximized to compute the target center. The feature set around the target location in each frame is defined as X c = {n(o) = (I(o), o)|o ∈ Ω c (x*)} where I(o) is the image greyscale value at location o while Ω c (x*) is the context around target center x*. It is shown in Figure 1.
compute the target center. The feature set around the target location in each frame is defined as X = {n(o) = (I(o), o)|o ∈ Ωc (x*)} where I(o) is the image greyscale value at location o while Ωc (x*) is the context around target center x*. It is shown in Figure 1. To formulate the tracking problem, the confidence map is computed for estimation of the likelihood of target location: where x is the target coordinates and j is the target, P(n(o)|j) is the context prior model that represents the features of context appearance. P(x, n(o)|j) is the spatial context model that formulates spatial relation between target position and its context. It identifies and resolves uncertainties for different image measurements. Confidence map function n(x) is defined as follows: where m is constant for normalization, ξ is shape parameter, and θ is scale parameter. Appropriate selection of shape parameters helps the spatial context model learn. Setting ξ > 1 results in oversmoothing of confidence map near the center. If ξ < 1 sharp peak response is generated while learning spatial context. Due to these issues, STC uses ξ = 1. Context prior model needs to be calculated before learning spatial context model. Spatial context is modeled by the image intensity function and Gaussian weighted function mentioned in (3) and (4).
where σ is scale representation. (4) is restricted between 0 and 1 by using its normalization constant c. The closer the context location is to the current target location x * , the larger the weight should be set to predict the target location in the next frame. (5) defines the spatial context model.
Solving for spatial context. To formulate the tracking problem, the confidence map is computed for estimation of the likelihood of target location: where x is the target coordinates and j is the target, P(n(o) |j ) is the context prior model that represents the features of context appearance. P(x, n(o) |j ) is the spatial context model that formulates spatial relation between target position and its context. It identifies and resolves uncertainties for different image measurements. Confidence map function n(x) is defined as follows: where m is constant for normalization, ξ is shape parameter, and θ is scale parameter. Appropriate selection of shape parameters helps the spatial context model learn. Setting ξ > 1 results in oversmoothing of confidence map near the center. If ξ < 1 sharp peak response is generated while learning spatial context. Due to these issues, STC uses ξ = 1. Context prior model needs to be calculated before learning spatial context model. Spatial context is modeled by the image intensity function and Gaussian weighted function mentioned in (3) and (4).
where σ is scale representation. (4) is restricted between 0 and 1 by using its normalization constant c. The closer the context location j is to the current target location x * , the larger the weight should be set to predict the target location in the next frame. (5) defines the spatial context model.
Solving for spatial context.
where ⊗ denotes convolution operation. Fast Fourier transform (FFT) is used for improving speed and it can be calculated as follows: where denotes element-wise multiplication. Solving (8) for the spatial context model.
where F −1 denotes inverse FFT in (9). In the STC model, the target is initialized position at the first frame. The spatial context model h sc learns relatively spatial relations between different pixels in the Bayesian framework. For subsequent frames, the STC model H stc t+1 (x) can be updated by using the spatial context model h sc t (x). By computing the extreme of the confidence map center position of the target x * t+1 at (t + 1) the frame can be attained as given in (10).
Similarly, a confidence map can be calculated from (11).
STC model is updated on learning rate ρ as mentioned in (12).
where ρ is the learning rate and h sc t is the spatial context model computed in (9).

Proposed Tracker
In this section, the proposed tracker will be discussed. First, a correlation filter-based adaptive scale scheme is discussed. Second, an extended Kalman filter-based occlusion handling mechanism is investigated. Third, an adaptive learning rate scheme is presented. The execution scheme of the proposed tracker is shown in Figure 2. In each image sequence, the target of interest's location is initialized manually on the first frame from the given ground truth. Afterward, the target confidence map is calculated. Sample patches of a different set of scales are estimated from the confidence map of STC. Then, the maximum value of the response map is calculated. If the response map's value is less than the fixed threshold, then the extended Kalman filter is activated. The Kalman filter will predict the location of next frame and update the tracking model during this entire period. Once the response map's value exceeds the fixed threshold, then the Kalman filter is deactivated. Afterward, the learning rate is updated, and the target entire tracking model is updated based on the calculated position.
Different variables and notations used in the following sections are presented in Table 1.   The numerator of correlation filter at tth frame B The denominator of correlation filter at tth frame G 2-dimensional Gaussian function y Response map of correlation filter x Predicted state at tth frame

Scale Space Tracking
Discriminative correlation filters are widely used in visual object tracking. For estimating the target scale, a scale correlation filter-based tracking model is used. It first extracts different scale samples around the target position; then, the HOG feature pyramid sample is extracted from the location. For finding an optimal correlation filter, cost function given in (13) needs to be minimized.
where g is desired output, λ is the regularization term, * is circular convolution operator, h is the HOG features after extracting from the sample, l indicates l-dimensional HOG features, g indicates two-dimensional Gaussian function, d indicates the total dimension of HOG features, and f is the correlation filter. The solution of (13) in the frequency domain is given in (14).
By minimizing output error over training patches, an optimal filter can be obtained. However, it is not suitable for online tracking because of computational cost. For efficient tracking, the numerator and denominator of the correlation filter H l are updated separately as given in (15) and (16).
where γ is the learning rate. By maximizing the correlation score, the target state can be determined as given in (17).
where Z l denotes HOG features extracted from prediction

Extended Kalman Filter
Within the visual object tracking research area, EKF is widely used to estimate the system. The target location problem can be viewed as an estimation problem, as it provides measurement-based prediction. For the current estimate, EKF linearizes the nonlinear equations. Afterward, EKF is applied to that linearized model [42]. EKF involves two steps which are prediction and correction. During prediction, state and covariance estimates are computed for the current frame using (18) and (19).
where J H is measurement Jacobian and R is measurement noise. State estimate is updated using prior estimate and error between measurement and predictive measurement as given in (21).
The difference z t − J Hx − t−1 is called measurement innovation or residual. It reflects the discrepancy between predicted measurement J Hx − t−1 and actual measurement z t . Posteriori estimation of variance in given in (22).
where P t is the updated error covariance, J H is matrix related to the measurement of the state and K t is the updated Kalman gain.

Occlusion Detection
When the target undergoes occlusion, then the STC model is updated incorrectly, thereby losing the target. The maximum value of the target map is used to detect occlusion, which changes its value with the target state's situation. If the target is occluded, then the value of the response map is small. However, when the target reappears then its value increases. The value of response map determines whether the target is tracked by improved STC or by EKF. For given input image sequence first the confidence map is computed in the frequency domain. If the target is severely occluded, then EKF will predict the position and update improved STC using a feedback loop for the next frame.

Adaptive Learning Rate
The model is updated adaptively by using average peak to correlation energy (APCE) [43]. It is defined in (23).
where f max is maximum response value, f min is minimum response value and f w,h is corresponding row and column value of response map. APCE specified the degree of fluctuation between response maps and detected targets. (24) gives expression of model update.
where APCE t is the value at t-th frame, APCE 0 is the value at the initial frame and bz 0 is threshold to decide learning rate.
Algorithm 1 is presented below.

Experiments
To evaluate the performance of the proposed tracker both qualitatively as well as quantitatively, extensive experiments were conducted on image sequences selected from Temple Color (TC)-128 [44], OTB2013 [45], OTB2015 [46], and UAV123 [47] datasets. Challenging factors associated with these sequences are scale variations, deformation, partial or full occlusions, background clutter, illumination variations, and fast motion.

Evaluation Criteria
The proposed tracker was compared quantitatively with existing tracking methods based on distance precision rate (DPR) and center location error (CLE). CLE is defined as Euclidean distance calculated between the tracker and ground truth of target. The calculation formula is mentioned in (25).
where (x i, , y i ) are positions calculated by tracking algorithm and (x gt , y gt ) are ground truth values. DPR is the percentage of frames when distance threshold is greater than the estimated CLE.

Parameter Settings
We set the same parameter values as in [19] and [29]. Map function parameters α and β were set to 2.25 and 1, respectively [19]. Regularization weight parameter λ was set to 0.01. Standard deviation of desired scale filter output was 0.25. Number of scales was 33, and scale factor was 1.02 [29]. We set these values as they turned out to be the best setting in our implementation. However, changing these parameters leads to inferior performance of tracker. The threshold of DPR is 20 pixels.

Quantitative Analysis
DPR comparison is given in Table 2. In sequences (Baby_ce, Car9, Carchasing_ce4, Crossing, Jogging2, Ring_ce, Singer1, Tennis_ce2, and Tennis_ce3), the proposed tracker outperforms Modified KCF, STC, MACF, and DCF CA . In sequences (Building3, Carchas-ing_ce3, Cardark, Cup, Juice, Man, Plate_ce2, and Sunshade), all tracking methods have similar performance. In sequences (Bike3, Busstation_ce2, Car4, Girl2, Guitar_ce2, Human3, Jogging1, Skating2, and Walking2), the proposed has slightly less precision value. However, the proposed has a higher mean value than other tracking methods.  The average center location error comparison is given in Table 3. In sequences (Baby_ce, Car4, Carchasing_ce4, Cardark, Crossing, Plate_ce2, Singer1, Tennis_ce2 and Tennis_ce3) the proposed tracker outperforms Modified KCF, STC, MACF and DCF CA . In sequences (Bike3, Building3, Busstation_ce2, Carchasing_ce3, Cup, Girl2, Guitar_ce2, Human3, Jogging1, Jogging2, Juice, Man, Ring_ce, Skating2, Sunshade, and Walking2), the proposed tracker has a slightly high error value. However, the proposed tracker has the lowest mean error compared to the other tracking methods.  The frame per second (FPS) comparison is given in Table 4. In sequences (Baby_ce, Building3, Car4, Carchasing_ce3, Carchasing_ce4, Cardark, Crossing, Cup, Jogging 2, Man, Plate_ce2, Tennis_ce2, and Tennis_ce3), the proposed tracker outperforms Modified KCF, STC, MACF, AFAM-PEC, Modified STC, and DCF CA in terms of accuracy. However, the frame rate of the proposed tracker is low in comparison with other tracking methods. The precisions plots are shown in Figure 3. Table 2 provides the mean precision value of the tracker in the entire image sequence. However, the tracker might get drift for a few frames and then recover itself. Therefore, to review tracker performance during the whole image sequence, these plots are presented. Various challenges were present in sequences such as occlusion, scale variations, deformation, etc. In sequences (Baby_ce, Carchas-ing_ce3, Car4, Cardark, Carchasing_ce4, Crossing, Cup, Jogging1, Jogging2, Guitar_ce2, Man, Plate_ce2, Ring_ce, Singer1, Sunshade, Tennis_ce2, and Tennis_ce3), the proposed tracker has the highest precision in the entire sequence. In sequences (Bike3, Building3, Busstation_ce2, Car9, Girl2, Human3, Juice, Skating2, and Walking2), the proposed tracker has slightly low precision.   The location error plots are shown in Figure 4. In Table 3 average center location is calculated for each image sequence. It gives an idea about tracker performance, but it does not entirely incorporate all information necessary to review tracker performance. A possible scenario exists in object tracking when a tracker might drift for a few frames in a sequence resulting in a high error value. However, when the tracker recovers from drift and starts tracking the target accurately, the error will be low during those frames, but its average value will be high. Therefore, these plots are presented to review tracker performance on each frame. The proposed tracker performs consistently for sequences (Baby_ce, Car4, Car9, Cardark, Crossing, Carchasing_ce3, Carchasing_ce3, Cup, Guitar_ce2, Juice, Jogging2, Ring_ce and Tennis_ce3) over the entire duration. In sequences (Girl2, Human3, Skating2 and Walking2), the tracker gets drift between the frames but recovers after a few frames. For the majority of the frames in these sequences proposed tracker accurately tracks the target. However, when the tracker got drift, then the accumulative error is high for these sequences. In sequences (Bike3, Building3, Busstation_ce2, Jogging1, Man, Plate_ce2, Singer1, Sunshade, and Tennis_ce2) the proposed method has similar performance with compared trackers.  The location error plots are shown in Figure 4. In Table 3 average center location is calculated for each image sequence. It gives an idea about tracker performance, but it does not entirely incorporate all information necessary to review tracker performance. A possible scenario exists in object tracking when a tracker might drift for a few frames in a sequence resulting in a high error value. However, when the tracker recovers from drift and starts tracking the target accurately, the error will be low during those frames, but its average value will be high. Therefore, these plots are presented to review tracker performance on each frame. The proposed tracker performs consistently for sequences (Baby_ce, Car4, Car9, Cardark, Crossing, Carchasing_ce3, Carchasing_ce3, Cup, Guitar_ce2, Juice, Jogging2, Ring_ce and Tennis_ce3) over the entire duration. In sequences (Girl2, Human3, Skating2 and Walking2), the tracker gets drift between the frames but recovers after a few frames. For the majority of the frames in these sequences proposed tracker accurately tracks the target. However, when the tracker got drift, then the accumulative error is high for these sequences. In sequences (Bike3, Building3, Busstation_ce2, Jogging1, Man, Plate_ce2, Singer1, Sunshade, and Tennis_ce2) the proposed method has similar performance with compared trackers. The location error plots are shown in Figure 4. In Table 3 average center location is calculated for each image sequence. It gives an idea about tracker performance, but it does not entirely incorporate all information necessary to review tracker performance. A possible scenario exists in object tracking when a tracker might drift for a few frames in a sequence resulting in a high error value. However, when the tracker recovers from drift and starts tracking the target accurately, the error will be low during those frames, but its average value will be high. Therefore, these plots are presented to review tracker performance on each frame. The proposed tracker performs consistently for sequences (Baby_ce, Car4, Car9, Cardark, Crossing, Carchasing_ce3, Carchasing_ce3, Cup, Guitar_ce2, Juice, Jogging2, Ring_ce and Tennis_ce3) over the entire duration. In sequences (Girl2, Human3, Skating2 and Walking2), the tracker gets drift between the frames but recovers after a few frames. For the majority of the frames in these sequences proposed tracker accurately tracks the target. However, when the tracker got drift, then the accumulative error is high for these sequences. In sequences (Bike3, Building3, Busstation_ce2, Jogging1, Man, Plate_ce2, Singer1, Sunshade, and Tennis_ce2) the proposed method has similar performance with compared trackers.     Figure 5 depicts the proposed tracking qualitative results with four state-of-the-art trackers over 26 image sequences involving various challenges such as partial or full occlusions, scale variations, background clutter, etc. MACF contains a similar tracking component as our approach, i.e., scale correlation filter and Kalman filter. Even though MACF performs favorably well in sequences involving scale variations, it does not deal effectively with sequences involving occlusions (Girl2, Human3, Jogging1, Jogging2, and Skat-ing2). STC uses intensity features and response of a single translation filter to estimate scale. This makes STC a comparatively fast tracker; however, there is no occlusion detection or handling mechanism due to which its tracking results are affected in sequences (Busstation_ce2, Girl2, Human3, Jogging1, and Jogging2). Moreover, due to only one translation filter, its tracking results are also affected (Car9, Crossing, and Tennis_ce3). DCFCA contains correlation filtering combined with the context-aware formulation. However, it is not robust in occlusions, scale variations, and deformation challenges. Therefore, DCFCA does not perform well in sequences (Car9, Carchasing_ce4, Girl2, Human3, Jog-ging1, Jogging2, Skating2, and Tennis_ce3). Modified KCF performs significantly well in sequences involving occlusions. However, it does not perform well in scale variation sequences (Baby_ce, Car9, Carchasing_ce4, Guitar_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3).

Qualitative Analysis
It can be seen that the proposed tracking method outperforms other trackers in these sequences. In sequences (Baby_ce, Car4, Carchasing_ce4, Crossing, Cup, Jogging1, Jog-ging2, Guitar_ce2, Plate_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3) the proposed method can accurately track the target for entire image sequences. In sequences (Bike3, Busstation_ce2, Girl2, Human3, Skating2, and Walking2) the tracker cannot accurately perform for the entire sequence. In sequences (Building3, Carchasing_ce3, Cardark, Juice, Man, and Sunshade), all trackers have similar performance.  ). STC uses intensity features and response of a single translation filter to estimate scale. This makes STC a comparatively fast tracker; however, there is no occlusion detection or handling mechanism due to which its tracking results are affected in sequences (Bussta-tion_ce2, Girl2, Human3, Jogging1, and Jogging2). Moreover, due to only one translation filter, its tracking results are also affected (Car9, Crossing, and Tennis_ce3). DCF CA contains correlation filtering combined with the context-aware formulation. However, it is not robust in occlusions, scale variations, and deformation challenges. Therefore, DCF CA does not perform well in sequences (Car9, Carchasing_ce4, Girl2, Human3, Jogging1, Jogging2, Skating2, and Tennis_ce3). Modified KCF performs significantly well in sequences involving occlusions. However, it does not perform well in scale variation sequences (Baby_ce, Car9, Carchasing_ce4, Guitar_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3).  Figure 5 depicts the proposed tracking qualitative results with four state-of-the-art trackers over 26 image sequences involving various challenges such as partial or full occlusions, scale variations, background clutter, etc. MACF contains a similar tracking component as our approach, i.e., scale correlation filter and Kalman filter. Even though MACF performs favorably well in sequences involving scale variations, it does not deal effectively with sequences involving occlusions (Girl2, Human3, Jogging1, Jogging2, and Skat-ing2). STC uses intensity features and response of a single translation filter to estimate scale. This makes STC a comparatively fast tracker; however, there is no occlusion detection or handling mechanism due to which its tracking results are affected in sequences (Busstation_ce2, Girl2, Human3, Jogging1, and Jogging2). Moreover, due to only one translation filter, its tracking results are also affected (Car9, Crossing, and Tennis_ce3). DCFCA contains correlation filtering combined with the context-aware formulation. However, it is not robust in occlusions, scale variations, and deformation challenges. Therefore, DCFCA does not perform well in sequences (Car9, Carchasing_ce4, Girl2, Human3, Jog-ging1, Jogging2, Skating2, and Tennis_ce3). Modified KCF performs significantly well in sequences involving occlusions. However, it does not perform well in scale variation sequences (Baby_ce, Car9, Carchasing_ce4, Guitar_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3).

Discussion
It can be seen from Figure 5 that the proposed tracking method outperforms other trackers in these sequences. We discuss several observations from performance analysis. This performance can be strengthened for three reasons. First, the scale correlation filter is incorporated in the STC framework making it deal effectively better than the STC scale. This scale filter learns target appearance on different scales, making it better to track targets accurately under scale variation scenarios. It can be seen in sequences (Baby_ce, Car4, Car9, Carchasing_ce3, Carchasing_ce4, Plate_ce2, and Ring_ce) that the proposed tracker deals better with scale variation of the target. Second, the incorporation of an extended Kalman filter makes it robust to handle occlusions. When the target undergoes partial or full occlusions, then EKF predicts the target state and updates the tracking model. It can be seen in sequences (Girl2, Jogging1, and Jogging2) that the proposed method can effectively handle the target's occlusion. Third, the fusion of APCE based adaptive learning rate further elevates the tracking performance in illumination variations, motion blur, clutter background challenges. It can be seen in sequences (Building3, Cardark, Crossing, Cup, Guitar_ce2, Juice, Man, Singer1, Sunshade, Tennis_ce2 and Tennis_ce3) that the tracker can accurately follow the target. It is because the tracker's appearance model can cope with changes in the environment by utilizing information in each frame.
Even though the proposed tracker performs significantly better than various trackers, there are few sequences (Bike3, Busstation_ce2, Human3, Skating2, and Walking2) in which the tracker does not track the target accurately. In Bike3 the tracker fails due to fast movement combined with scale variation. In Skating2 the tracker fails due to the deformation of the target. In (Busstation_ce2, Human3, and Walking2) the tracker fails due to occlusions, fast motion, and motion blur. The limitations can be addressed by working in few directions, such as developing a better occlusion detection and handling mechanism, extending the aspect ratio adaptability, and incorporating context-aware formulation. It can be seen that the proposed tracking method outperforms other trackers in these sequences. In sequences (Baby_ce, Car4, Carchasing_ce4, Crossing, Cup, Jogging1, Jogging2, Guitar_ce2, Plate_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3) the proposed method can accurately track the target for entire image sequences. In sequences (Bike3, Busstation_ce2, Girl2, Human3, Skating2, and Walking2) the tracker cannot accurately perform for the entire sequence. In sequences (Building3, Carchasing_ce3, Cardark, Juice, Man, and Sunshade), all trackers have similar performance.

Discussion
It can be seen from Figure 5 that the proposed tracking method outperforms other trackers in these sequences. We discuss several observations from performance analysis. This performance can be strengthened for three reasons. First, the scale correlation filter is incorporated in the STC framework making it deal effectively better than the STC scale. This scale filter learns target appearance on different scales, making it better to track targets accurately under scale variation scenarios. It can be seen in sequences (Baby_ce, Car4, Car9, Carchasing_ce3, Carchasing_ce4, Plate_ce2, and Ring_ce) that the proposed tracker deals better with scale variation of the target. Second, the incorporation of an extended Kalman filter makes it robust to handle occlusions. When the target undergoes partial or full occlusions, then EKF predicts the target state and updates the tracking model. It can be seen in sequences (Girl2, Jogging1, and Jogging2) that the proposed method can effectively handle the target's occlusion. Third, the fusion of APCE based adaptive learning rate further elevates the tracking performance in illumination variations, motion blur, clutter background challenges. It can be seen in sequences (Building3, Cardark, Crossing, Cup, Guitar_ce2, Juice, Man, Singer1, Sunshade, Tennis_ce2 and Tennis_ce3) that the tracker can accurately follow the target. It is because the tracker's appearance model can cope with changes in the environment by utilizing information in each frame.
Even though the proposed tracker performs significantly better than various trackers, there are few sequences (Bike3, Busstation_ce2, Human3, Skating2, and Walking2) in which the tracker does not track the target accurately. In Bike3 the tracker fails due to fast movement combined with scale variation. In Skating2 the tracker fails due to the deformation of the target. In (Busstation_ce2, Human3, and Walking2) the tracker fails due to occlusions, fast motion, and motion blur. The limitations can be addressed by working in few directions, such as developing a better occlusion detection and handling mechanism, extending the aspect ratio adaptability, and incorporating context-aware formulation.

Conclusions
This article gives insight into the robust tracking algorithm based on STC by incorporating scale correlation filter based on pyramid representation for adaptive scale estimation, extended Kalman filter for occlusion handling, and APCE criteria for the adaptive learning rate of the tracking model. Experimental results indicate that the proposed tracking algorithm performs better than the various state of the art qualitatively and quantitatively. The tracker achieved the desired performance, but the target may be lost in some cases like occlusions, motion blur, and fast motion. To address these limitations, our future work includes extending the current framework to context-aware and target adaptation formulation, development of occlusion judgment criteria, incorporation of more features to learn target appearance, and extending the aspect ratio adaptability.