Efficient Online Object Tracking Scheme for Challenging Scenarios

Visual object tracking (VOT) is a vital part of various domains of computer vision applications such as surveillance, unmanned aerial vehicles (UAV), and medical diagnostics. In recent years, substantial improvement has been made to solve various challenges of VOT techniques such as change of scale, occlusions, motion blur, and illumination variations. This paper proposes a tracking algorithm in a spatiotemporal context (STC) framework. To overcome the limitations of STC based on scale variation, a max-pooling-based scale scheme is incorporated by maximizing over posterior probability. To avert target model from drift, an efficient mechanism is proposed for occlusion handling. Occlusion is detected from average peak to correlation energy (APCE)-based mechanism of response map between consecutive frames. On successful occlusion detection, a fractional-gain Kalman filter is incorporated for handling the occlusion. An additional extension to the model includes APCE criteria to adapt the target model in motion blur and other factors. Extensive evaluation indicates that the proposed algorithm achieves significant results against various tracking methods.

Tracking methods can be categorized as generative and discriminative. In generative tracking methods, the computation cost is high, and they are adaptable with environmental factors due to which these tracking methods might fail in background clutter situations [20][21][22]. Discriminative tracking methods perform better in clutter background situations since they treat these as a binary classification problem. However, they are slow, making them unsuitable for real-time applications [23][24][25].

Related Work
The STC tracker [27] has been widely used in recent years due to its computational efficiency. STC integrates spatial context information around the target of interest and considers prior information of previous frames for computing the extreme-of-confidence map by using Fourier transform. Die et al. [28] combined a correlation filter (CF) and STC. They extracted HOG (histogram of oriented gradients), (CN) color naming, and gray features for learning-correlation filters. Then, the response of CF and STC is fused. Yang et al. [29] proposed an improved tracking method by incorporating peak to sidelobe ratio (PSR)-based occlusion detection mechanism and model update scheme in the STC framework. Zhang et al. [30] proposed a tracking method by incorporating HOG, CN features, and an average difference of frames-based adaptive learning rate mechanism in the spatiotemporal context framework. Zhang et al. [31] suggested a tracking method by incorporating a selection update mechanism in the spatiotemporal context framework. Song et al. [32] anticipated an improved STC-based tracking method by combining a scale filter and loss function criteria for better performance in UAV applications.
During the past decade, significant progress has been made to develop accurate scale estimation in VOT [33][34][35][36][37][38]. Danelljan et al. [39] proposed a tracking-by-detection framework by learning filters for translation and scale estimation based on pyramid representation. Li et al. [40] incorporated an adaptive scale scheme in a kernelized correlation filter (KCF) tracker using HOG and CN features. Bibi et al. [41] modify the KCF tracker by maximizing posterior distribution over the scales grid and updating the filter by fixed point optimization. Lu et al. [42] combined KCF and Fourier-Mellin transform to deal with rotation and scale variation of the target. Yin et al. [43] modified the scale adaptive with multiple features (SAMF) tracker by using APCE-based rate of change between consecutive frames to control scale size. Ma et al. [44] incorporated APCE in discriminative correlation filters to address fixed template size.

Related Work
The STC tracker [27] has been widely used in recent years due to its computational efficiency. STC integrates spatial context information around the target of interest and considers prior information of previous frames for computing the extreme-of-confidence map by using Fourier transform. Die et al. [28] combined a correlation filter (CF) and STC. They extracted HOG (histogram of oriented gradients), (CN) color naming, and gray features for learning-correlation filters. Then, the response of CF and STC is fused. Yang et al. [29] proposed an improved tracking method by incorporating peak to sidelobe ratio (PSR)-based occlusion detection mechanism and model update scheme in the STC framework. Zhang et al. [30] proposed a tracking method by incorporating HOG, CN features, and an average difference of frames-based adaptive learning rate mechanism in the spatiotemporal context framework. Zhang et al. [31] suggested a tracking method by incorporating a selection update mechanism in the spatiotemporal context framework. Song et al. [32] anticipated an improved STC-based tracking method by combining a scale filter and loss function criteria for better performance in UAV applications.
During the past decade, significant progress has been made to develop accurate scale estimation in VOT [33][34][35][36][37][38]. Danelljan et al. [39] proposed a tracking-by-detection framework by learning filters for translation and scale estimation based on pyramid representation. Li et al. [40] incorporated an adaptive scale scheme in a kernelized correlation filter (KCF) tracker using HOG and CN features. Bibi et al. [41] modify the KCF tracker by maximizing posterior distribution over the scales grid and updating the filter by fixed point optimization. Lu et al. [42] combined KCF and Fourier-Mellin transform to deal with rotation and scale variation of the target. Yin et al. [43] modified the scale adaptive with multiple features (SAMF) tracker by using APCE-based rate of change between consecutive frames to control scale size. Ma et al. [44] incorporated APCE in discriminative correlation filters to address fixed template size. A Kalman filter is used in various tracking algorithms for occlusion handling [45][46][47][48][49]. Kaur et al. [50] suggested a real-time tracking approach using a fractional-gain Kalman filter for nonlinear systems. Soleh et al. [51] proposed the Hungarian Kalman filter (HKF) for multiple target tracking. Farahi et al. [52] proposed a probabilistic Kalman filter (PKF) by incorporating an extra stage for estimating target position by applying the Viterbi algorithm to a probabilistic graph. Gunjal et al. [53] proposed a Kalman filter-based tracking algorithm for moving targets under surveillance applications. Ali et al. [54] address issues in VOT such as fast maneuvering of the target, occlusions, and deformation by combining Kalman filter, CF, and adaptive mean shift in the heuristic framework. Kaur et al. [55] proposed a modified fractional-gain-based Kalman filter for vehicle tracking by incorporating a fractional feedback loop and cost function minimization. Zhou et al. [56] address issues in VOT such as occlusions, motion blur, and clutter background by incorporating a Kalman filter in a compressive tracking framework.
By summarizing the current methods, it can be perceived that significant work has been done to develop a robust tracking algorithm by incorporating scale update schemes, model update mechanisms, occlusion detection, and handling techniques in different tracking frameworks. The STC algorithm proposed in [27] uses FFT for detection and context information for a model update. However, it cannot effectively deal with occlusions, scale variations, and motion blur.

Our Contributions
To address the limitations of the STC, this paper proposes a robust tracking algorithm suitable for various image processing applications, such as surveillance and autonomous vehicles. The contributions can be listed concisely as follows.

1.
We introduce novel criteria for detecting occlusion by utilizing APCE, model update rules, and previous history of the modified response map to prevent the tracking model from wrong updates.

2.
We introduce an effective occlusion handling mechanism by incorporating a modified feedback-based fractional-gain Kalman filter in the spatiotemporal context framework to track an object's motion.

3.
We incorporate a max-pooling-based scale scheme by maximizing over posterior probability in the STC framework's detection stage. We applied a combination of STC and max-pooling to attain higher accuracy.

4.
We introduce an APCE-based adaptive learning rate mechanism that utilizes information of current frame and previous history to reduce error accumulation and correctly updates from the wrong appearance of the target.

Organization
The organization of this paper follows: brief explanations of STC and fractional calculus are provided in Section 2. In Section 3, the tracking modules of the proposed tracker are explained. Section 4 includes performance analysis. Discussion is given in Section 5, while Section 6 concludes the paper.

STC Tracking
The STC tracking algorithm formulates the relation between the target of interest and its context in the Bayesian framework. The feature set X c = {l(r) = (I(r), r)|r ∈ Ω c (x*)} and spatial relation between target context is presented in Figure 2. The confidence map is given as follows: l(x) = P(x|k) = P(x, l(r)|k) is the prior context model and P(x, l(r)|k) is the spatial context model. The confidence map function l(x) is given in (2): where v is the normalization constant, Þ is a parameter for shape, and ∅ is a parameter for scale. The spatial context uses the intensity of the image and weighted Gaussian function given in (3) and (4): Equation (5) describes the spatial context model: Explaining for the spatial context: Fast Fourier transform (FFT) can be calculated as follows: The solution of (8) follows: As presented in (10), x * can be obtained by computing the extreme-of-confidence map: Figure 2. The spatial relation between object and its context. Picture in the figure is part of OTB-100 dataset [26].
The confidence map is given as follows: is the prior context model and P(x, l(r) |k ) is the spatial context model. The confidence map function l(x) is given in (2): where v is the normalization constant, Þ is a parameter for shape, and ∅ is a parameter for scale. The spatial context uses the intensity of the image and weighted Gaussian function given in (3) and (4): Equation (5) describes the spatial context model: Explaining for the spatial context: Fast Fourier transform (FFT) can be calculated as follows: The solution of (8) follows: Sensors 2021, 21, 8481

of 21
As presented in (10), x * t+1 can be obtained by computing the extreme-of-confidence map: The confidence map can be considered from (11): Spatiotemporal context is updated on learning rate ρ, as given in (12):

Fractional Calculus
In this work, the Grünwald-Letnikov definition [60] is used for calculating fractional difference defined in (13): where n is fractional order, h is the sampling interval, k is the number of samples of given signal x, and n q is obtained using (14):

Proposed Solution
In this section, tracking modules are elaborated. First, the max-pooling-based scale mechanism is presented. Second, the APCE-based occlusion detection mechanism is discussed. Third, the fractional-gain Kalman filter-based mechanism for occlusion handling is examined. Fourth, an APCE-based modified learning rate mechanism is explained. The flowchart of the proposed tracker is displayed in Figure 3.
As presented in Figure 3, for each sequence, the ground truth of the target is manually initialized in the first frame. Afterward, the confidence map of the target is calculated. Then, by maximizing the posterior probability, the scale of the target is estimated. Then APCE of the response map is calculated along with the difference of APCE between consecutive frames. Based on occlusion criteria, the fractional-gain Kalman filter activates and predicts the location of the target. Afterward, the learning rate of the tracking model is updated by utilizing the current target position and previous history of APCE values.

Scale Integration Scheme
One limitation of STC is the inability to rapidly change the scale. During the detection phase of STC, we applied max-pooling over multiple scales by maximizing the posterior probability, as given in (15): where r i represents ith scale and P ( y|r i ) is the maximum detection likelihood response at ith scale. The prior term P (r i ) is the Gaussian distribution whose standard deviation is set through experimentation. It allows for a smooth transition between frames, given that the target scale does not vary much between frames. occlusion. In the present work, an occlusion feedback mechanism is presented, which detects occlusion and updates the target model by evaluating the tracking status of each frame. Average peak to correlation energy (APCE) [61] determines tracker effectiveness. The value of APCE changes according to the target occlusion state. Small values of APCE specify tracking failure or target occlusion. It is given in (16): where g and g are maximum and minimum response values, respectively, and g , gives indices of the response map. The occlusion detection criteria are built as given in (17) and (18):

Occlusion Detection Mechanism
The performance of any tracking algorithm is affected by various factors, of which the most common is occlusion. It is essential to create a mechanism for the detection of occlusion. In the present work, an occlusion feedback mechanism is presented, which detects occlusion and updates the target model by evaluating the tracking status of each frame.
Average peak to correlation energy (APCE) [61] determines tracker effectiveness. The value of APCE changes according to the target occlusion state. Small values of APCE specify tracking failure or target occlusion. It is given in (16): where g max and g min are maximum and minimum response values, respectively, and g w,h gives indices of the response map. The occlusion detection criteria are built as given in (17) and (18): where APCE t and APCE t−1 are the APCE values at t and (t − 1) frames, respectively, δ is the difference of the APCE between two sequential frames, and th is the threshold value acquired by performing multiple experiments. Rules of occlusion and model update follow: 1. When δ ≤ 0 or APCE t ≥ th , it indicates that the target is coming out of the shelter, and both the tracking and model updates are based on STC.

2.
When δ ≤ 0 and APCE t < th , it indicates that the target is in the occlusion state and tracking is based on the fractional-gain Kalman filter. The tracking model is also updated based on the Kalman filter prediction.

3.
When δ > 0 or APCE t < th , it indicates that the target occludes, and both the tracking and model update are based on STC.

4.
When δ > 0 or APCE t ≥ th , it indicates that the target tracking is good and that both the tracking and model update are based on STC.
As seen in Figure 4a, without occlusion, both APCE and δ are high; therefore, no occlusion occurs. However, when both APCE and δ give low values, as shown in Figure 4b, case occlusion occurs and the occlusion handling mechanism is activated. By using this mechanism, proposed tracking achieved significant results for the occlusion challenge.
where APCE and APCE are the APCE values at t and (t − 1) frames, respectively, is the difference of the APCE between two sequential frames, and ϵ is the threshold value acquired by performing multiple experiments. Rules of occlusion and model update follow: 1. When ≤ 0 or APCE ≥ ϵ , it indicates that the target is coming out of the shelter, and both the tracking and model updates are based on STC. 2. When ≤ 0 and APCE < ϵ , it indicates that the target is in the occlusion state and tracking is based on the fractional-gain Kalman filter. The tracking model is also updated based on the Kalman filter prediction. 3. When > 0 or APCE < ϵ , it indicates that the target occludes, and both the tracking and model update are based on STC. 4. When > 0 or APCE ≥ ϵ , it indicates that the target tracking is good and that both the tracking and model update are based on STC.
As seen in Figure 4a, without occlusion, both APCE and are high; therefore, no occlusion occurs. However, when both APCE and give low values, as shown in Figure  4b, case occlusion occurs and the occlusion handling mechanism is activated. By using this mechanism, proposed tracking achieved significant results for the occlusion challenge.

Without Occlusion
With Occlusion

Fractional-Gain Kalman Filter
The Kalman filter is widely used in the research area of VOT. A modified discrete time linear system can be characterized by Equations (19) and (20): where x is the state vector, z is system output, u is system input, and v is output noise. A, B, and H are transition, control, and measurement matrices, respectively. The innovation equation is the difference between the estimated output z and actual output z defined in (21): where x is the priori state. The estimation of the next state x with a modified gain is given in (22) and (23):

Fractional-Gain Kalman Filter
The Kalman filter is widely used in the research area of VOT. A modified discrete time linear system can be characterized by Equations (19) and (20): where x k is the state vector, z k is system output, u k is system input, and v k is output noise. A, B, and H are transition, control, and measurement matrices, respectively. The innovation equation is the difference between the estimated outputẑ k and actual output z k defined in (21): wherex − k is the priori state. The estimation of the next statex k with a modified gain is given in (22) and (23): where ∆ γ K k is the fractional derivative of previous Kalman gain. Priori error eˆ− k between actual and estimated state and its covariance P − k can be given in (24) and (25): Posteriori error e k between actual and estimated state and its covariance P k can be given, as in (26) and (27): Kalman gain K is calculated by minimizing posteriori error covariance P k as given in (28): Finding the value of K in (29): K new can be written as in (30): The modified Kalman gain K new consists of two terms. The first term represents the Kalman filter's gain, and the second represents the mean of the fractional difference of previous gains. The (−1) q+1 makes the mean value nominal.

Adaptive Learning Rate
The motion of the target is changed in each frame during tracking. It is, therefore, necessary to update the target model adaptively rather than on a fixed learning rate. We used an APCE-based degree indicator to better cope with environmental changes occurring during tracking to make it adaptive. In the present work, we used maxima of historical APCE values to normalize APCE, since the APCE value is very high. The degree indicator d APCE is defined in (31): where t s is the start index frame. The value of the learning rate is adjusted as in (32): where τ th is the threshold value acquired by performing multiple experiments. Figure 5a shows that, without both motion and blur, APCE and d APCE are high; therefore, the learning rate of tracking should be fast. However, when motion blur occurs, both APCE and d APCE give low values, as shown in Figure 5b. Thus, in that case, the model should be updated slowly due to the appearance change of the target. By using this mechanism, the proposed tracking achieved significant results for the motion blur challenge. The tracker is given in Algorithm 1.

Without Motion Blur
With   Compute confidence map by using (11).
Compute center of target location.
Check four rules of occlusion detection given in Section 3.2.
if rule 2 occurs Activate fractional-gain Kalman filter Compute fractional Kalman gain by using (30).

Performance Analysis
Comprehensive assessments were conducted on videos taken from the OTB 2015 [26] dataset for the proposed tracking method's quantitative and quantitative evaluation. These sequences include scale variations, motion blur, and fast motion challenges.

Evaluation Criteria
The proposed algorithm is compared with tracking methods on two evaluation criteria: distance precision rate (DPR) and center location error (CLE). The calculation formula for CLE is mentioned in (33):

Quantitative Analysis
DPR evaluation is presented in Table 1. In videos Blurcar1, Car2, Human7, Jogging1, and Jogging2, the proposed algorithm outperforms Modified KCF, MOSSE CA , MACF, KCF_MTSA, and STC. For the sequences Blurcar3, Blurcar4, Boy, Dancer2, and Suv, the proposed tracker has marginally less precision value. Overall, the proposed algorithm has a higher mean value than the other algorithms.  The precision and error plots are presented in Figures 6 and 7, respectively. These plots provide a frame-by-frame comparison in entire image sequences. Since precision and location error gives the mean of the entire sequence, it is possible that the algorithm loses the target for a few frames but correctly tracks again. Therefore, these plots were presented to show the effectiveness of the tracking method. In the videos Blurcar1, Human7, Jogging1, and Jogging2, the proposed algorithm has the highest precision in the entire video. It has slightly low accuracy in the Blurcar3, Blurcar4, Boy, Car2, Dancer2 and Suv videos. The proposed algorithm has the lowest error in the Blurcar1, Human7, Jogging1, and Jogging2 videos. It has marginally high error compared with a few trackers for the Blurcar3, Blurcar4, Boy, Car2, Dancer2, and Suv sequences.
loses the target for a few frames but correctly tracks again. Therefore, these plots were presented to show the effectiveness of the tracking method. In the videos Blurcar1, Hu-man7, Jogging1, and Jogging2, the proposed algorithm has the highest precision in the entire video. It has slightly low accuracy in the Blurcar3, Blurcar4, Boy, Car2, Dancer2 and Suv videos. The proposed algorithm has the lowest error in the Blurcar1, Human7, Jog-ging1, and Jogging2 videos. It has marginally high error compared with a few trackers for the Blurcar3, Blurcar4, Boy, Car2, Dancer2, and Suv sequences.   Frames per second (fps) analysis is presented in Table 3. In the Blurcar1, Car2, Dancer2, Human7, and Jogging1 videos, the proposed algorithm outperforms Modified KCF, MOSSECA, MACF, KCF_MTSA, and STC in terms of precision in error at the expense of modest frame rate. The computational time for the learning rate module is presented in Table 4. It can be seen that the proposed tracker takes less time in motion blur sequences. However, the overall speed of the tracker is slightly slow, given in Table 3. Combining the different tracking modules presented in Section 3, performance of the proposed tracker is significant as each module is specifically designed and incorporated into the STC framework, making it efficient in terms of less error and high precision for different challenging attributes in VOT.  Frames per second (fps) analysis is presented in Table 3. In the Blurcar1, Car2, Dancer2, Human7, and Jogging1 videos, the proposed algorithm outperforms Modified KCF, MOSSE CA , MACF, KCF_MTSA, and STC in terms of precision in error at the expense of modest frame rate. The computational time for the learning rate module is presented in Table 4. It can be seen that the proposed tracker takes less time in motion blur sequences. However, the overall speed of the tracker is slightly slow, given in Table 3. Combining the different tracking modules presented in Section 3, performance of the proposed tracker is significant as each module is specifically designed and incorporated into the STC framework, making it efficient in terms of less error and high precision for different challenging attributes in VOT. Figure 8 depicts the qualitative analysis of the proposed tracking with five state-ofthe-art trackers. Modified KCF and KCF_MTSA are extensions of KCF [62] based tracking methods. However, Modified KCF is not robust to motion blur (Blurcar1, Blurcar3, and Human7), whereas the performance of KCF_MTSA is affected in occlusion (Jogging2) and motion blur (Human7). MACF is an improved version of fast discriminative scale space tracking [63] and achieved favorable results in various challenges of VOT. However, it does not perform well in motion blur (Blurcar1) and occlusion (Jogging1 and Jogging 2). MOSSE CA is an improved context-aware formulation version of the MOSSE [64] tracker. The results are exceptional except in the Jogging1 and Human7 sequences. STC is the baseline tracker of the proposed method and achieves favorable results. However, it can be seen that it does not address occlusion (Jogging1 and Jogging2) or motion blur (Blurcar1, Blurcar3, Blurcar4, Boy, and Human7).  Figure 8 depicts the qualitative analysis of the proposed tracking with five state-ofthe-art trackers. Modified KCF and KCF_MTSA are extensions of KCF [62] based tracking methods. However, Modified KCF is not robust to motion blur (Blurcar1, Blurcar3, and Human7), whereas the performance of KCF_MTSA is affected in occlusion (Jogging2) and motion blur (Human7). MACF is an improved version of fast discriminative scale space tracking [63] and achieved favorable results in various challenges of VOT. However, it does not perform well in motion blur (Blurcar1) and occlusion (Jogging1 and Jogging 2). MOSSECA is an improved context-aware formulation version of the MOSSE [64] tracker. The results are exceptional except in the Jogging1 and Human7 sequences. STC is the baseline tracker of the proposed method and achieves favorable results. However, it can be seen that it does not address occlusion (Jogging1 and Jogging2) or motion blur (Blur-car1, Blurcar3, Blurcar4, Boy, and Human7).

Qualitative Analysis
It can be seen that the proposed tracker outperforms other tracking methods in these sequences. This performance is attributed to three factors. First, a max-pooling-based scale scheme is incorporated, making it less sensitive to scale variations (Boy). Second, incorporation of the APCE-based modified occlusion detection mechanism and fractional-gain Kalman filter-based occlusion handling makes it effective toward occlusions (Jogging1, Jogging2, and Suv). Third, the combination of APCE criteria in the learning rate of the proposed algorithm model update effectively, making it efficient towards motion blur (Blurcar1, Blurcar3, Blurcar4, Boy, and Human7) and illumination variations (Car2 and Dancer2).  It can be seen that the proposed tracker outperforms other tracking methods in these sequences. This performance is attributed to three factors. First, a max-poolingbased scale scheme is incorporated, making it less sensitive to scale variations (Boy). Second, incorporation of the APCE-based modified occlusion detection mechanism and fractional-gain Kalman filter-based occlusion handling makes it effective toward occlusions (Jogging1, Jogging2, and Suv). Third, the combination of APCE criteria in the learning rate of the proposed algorithm model update effectively, making it efficient towards motion blur (Blurcar1, Blurcar3, Blurcar4, Boy, and Human7) and illumination variations (Car2 and Dancer2).

Discussion
We discuss several observations from performance analysis. First, max-pooling-based scale formulation in spatiotemporal context outperforms trackers without this formulation. This can be attributed to estimating maximum likelihood by using target appearance sampled at a different set of scales. Second, trackers which utilize modules for occlusion detection and handling module outperform trackers without these modules. This can be attributed to the fractional-gain Kalman filter and an APCE-based occlusion detection mechanism preventing tracker from drift. Third, trackers with adaptive learning rate perform better than those with fixed learning rate.

Conclusions
This paper contributes insight into an STC-based accurate tracking algorithm by incorporating max-pooling, fractional-gain Kalman, and APCE measures for occlusion detection and tracking model update. It can improve the adaptability of the target model and prevent error accumulation. Evaluations specify that the proposed tracker achieves enhanced results in various complicated scenarios. However, there are some problems: (1) tracking performance is severely affected in dense occlusion; (2) the tracker lost the target of interest in deformation and fast motion; and (3) frame rate of the proposed tracking method is slow. These three points will be the focus of follow-up research. Additionally, considering the challenges of VOT, we also plan to perform future in-depth research on the fusion of features and better prediction estimation mechanisms, and carry out Raspberry Pi, FPGA, and DSP-based hardware implementation and practical application for meeting the requirements of society.