Robust Visual Tracking Based on Fusional Multi-Correlation-Filters with a High-Conﬁdence Judgement Mechanism

: Visual object trackers based on correlation ﬁlters have recently demonstrated substantial robustness to challenging conditions with variations in illumination and motion blur. Nonetheless, the models depend strongly on the spatial layout and are highly sensitive to deformation, scale, and occlusion. As presented and discussed in this paper, the colour attributes are combined due to their complementary characteristics to handle variations in shape well. In addition, a novel approach for robust scale estimation is proposed for mitigatinge the problems caused by fast motion and scale variations. Moreover, feedback from high-conﬁdence tracking results was also utilized to prevent model corruption. The evaluation results for our tracker demonstrate that it performed outstandingly in terms of both precision and accuracy with enhancements of approximately 25% and 49%, respectively, in authoritative benchmarks compared to those for other popular correlation-ﬁlter-based trackers. Finally, the proposed tracker has demonstrated strong robustness, which has enabled online object tracking under various scenarios at a real-time frame rate of approximately 65 frames per second (FPS). mechanisms in ( 21 ) and ( 22 ) are used to judge whether updating model in the current frame is necessary for the prevention of model corruption.


Introduction
Robust visual object tracking has been attracting substandtial attention. It is a significant problem in computer vision, as evidenced by its numerous implementations in robotics, services, monitoring, and human-machine interaction. Posada et al. [1] clearly defined computer vision as a key enabling technology for Industry 4.0. Segura et al. [2] and Posada et al. [3] showed the challenges and examples of computer vision technology in the field of robotics and human-robot collaboration. In typical scenarios, the target is specified in the first frame only (e.g. defining a rectangle), and it is often meaningful to track the target object in subsequent frames. This tracking can be directly applied in warehouse automation [4], human-robot handovers [5], safety design improvement for human-robot collaboration [4,[6][7][8][9] and human-robot synchronization [10,11]. However, many challenges are encountered in visually tracking an object, which are due to challenging factors such as deformation, illumination variation, scale variation, and partial occlusions.
Most available methods that are used to solve visual tracking problems are based on two strategies: The first strategy is to use an efficient algorithm to construct generative [12][13][14] or discriminative [15][16][17] models. This strategy is commonly used to devise a filter or classifier for tracking the object and updating the model at each frame by utilizing the messages in subsequent frames as a training sample.
It might lead to model shift because a small error can accumulate into a significant error when learning from predictions. Primarily, this strategy is applied in scenarios with a lack of training samples. The second type is used to exploit the features that are extracted from a deep convolutional neural networks (CNN) [18][19][20][21] , which is trained either online or on recognition datasets. Although these approaches can substantially improve the performance, the utilization of more complicated tracking algorithms or features would enormously increase the computational complexity, which might render the model unsuitable for real-time visual object tracking.
Recently, many popular trackers [21][22][23][24] that are based on correlation filters (CFs) have been proposed that can track many objects of interest because of their remarkable computational performance. By computing the correlation in the Fourier domain via the fast Fourier transform (FFT), the storage and computational requirements are both reduced by several orders of magnitude. Blome et al. [23] proposed a minimum output sum of squared error (MOSSE) visual object tracking method that uses adaptive correlation filters to incorporate a correlation filter, which can outperform more complicated algorithms. Due to the high performance and efficiency of CF, many trackers have been designed by adopting MOSSE. Henriques et al. [22] proposed the circulant structure of tracking-by-detection which extends the dense sampling that is based on MOSSE and introduces the kernel trick (CSK). In 2015, Henriques et al. [25] put forward a solution for high-speed tracking with kernelized correlation filters (KCFs), extended the multiple feature channels, and improved the performance of the tracker by utilizing the histogram of oriented gradients (HOG) feature, while preservinge the real-time performance. Danelljan et al. [16] introduced the multiple feature channels of colour names (CNs), which are based on the CSK and have received an excellent response from the industry. However, although the trackers that are discussed above exhibit outstanding performance, they cannot solve the scale estimation problem due to the presence of fast motion or other factors. Long et al. [26] proposed an omnidirectional modified Laplacian operator with an adaptive window size. Danelljan et al. [27] (DSST) solved the difficult scale estimation problem by learning discriminative correlation filters that are adopted from the scale pyramid representation. Yingzhong et al. [28] used various fusion rules to combine different features for better description. For correlation filters, boundary effects might lead to detection failure; hence, Danelljan et al. [29] (SRDCF) added a spatial regularization term to penalize the CF coefficients around the boundary. The SRDCF yields excellent tracking results; however, the real-time performance is degraded enormously, with a reported speed of only 5 FPS. In the development of trackers that are based on CF, the discrimination performance should be improved and the real-time performance requirement should be satisfied.
Due to their strong feature representation performances, CNNs have realized significant success on visual tracking tasks and in many other scenarios. Many recent trackers [18][19][20][21][30][31][32] have demonstrated high-performance on benchmarks. Lee et al. [32] and Wang et al. [31] proposed the best-performing solutions for the visual object test (VOT) [33] long-term tracking and short-term tracking, respectively. Ma et al. [30] proposed a method for enhancing the precision and robustness by learning the features that are extracted from deep convolutional neural networks and a CNN that is trained on object recognition datasets. Danelljan et al. [21] exploited an implicit interpolation model with the objective of solving the learning problem in the continuous spatial domain and proposed an innovative method for efficiently combining multi-resolution feature maps. Nam et al. [20] exploited a network by incorporating domain-specific layers and shared layers to obtain generic target representations. Wenbin et al. [34] proposed a generative system that is based on CNN, which has realized satisfactory performance. Yang Liu et al. [35] proposed a novel hierarchical feature learning framework and Dongdong et al. [36] revisited the standard SRDCF formulation and introduced padless correlation filters (PCFs), which could completely remove boundary effects. The studies that are discussed above have demonstrated the high power of CNN for target representation at the expense of high computational complexity and time consumption.
In recent years, the practice of evaluating tracking algorithms has substantially improved. In the past, researchers were limited to evaluating the tracking performance of a small number of sequences [37,38]. Benchmarks such as VOT, and the object tracking benchmark (OTB) [39,40] emphasize the importance of test methods for a wider range of sequence sets that cover a variety of object categories and challenges. OTB contains 25% grayscale sequences, while VOT contains only colour sequences. OTB includes the start of a random frame and is initialized via the addition of random interference, whereas VOT is initialized and run from the first frame.
In this paper, an understandable and efficient method is proposed for solving the problem that is described above. The main contributions can be summarized as follows: Two image representations of template and colour characteristics are combined to address illumination changes and shape variations. The discriminative CF is exploited based on a scale pyramid representation to solve the scale estimation problem. A high-confidence judgement mechanism is explored for avoiding model corruption. Figure 1 shows the flow of the target tracking algorithm in this paper.  Figure 1, the blue line is template-related. In frame t, the HOG features are extracted from the estimated location and used to update the numerator A l t and denominator B t in (14). In frame t + 1, features are extracted from the predicted target location p t and convolved with H l to obtain the template response via (15). The green line is histogram-related. In frame t, features of the target object and background are used to update ρ t (O) and ρ t (B) in (20); thus, the coefficient β t in (19) can be obtained. The histogram response is computed via (5). Then, the template response and the histogram response are combined via (3) to obtain the integrated response map without adjusting the scale. The orange line is scale-related. In frame t, a scale-space filter is trained by using the feature from previous location p t . In frame t + 1, the scale filter is combined with the response from (3) to obtain the final response p t+1 . The red line is high-confidence judgement-related. The high-confidence mechanisms in (21) and (22) are used to judge whether updating model in the current frame is necessary for the prevention of model corruption.

Problem Formulation
In this paper, the detection principle is utilized for tracking. The main objective is to obtain a classifier that can discriminate the object of interest from its ambient environment in real time when a new frame is received. In frame t, the rectangle p t represents the object position in picture x t which is selected from a collection C t to maximize a fraction: where f (T(x, p); θ) denotes a fraction of the rectangular window p in picture x with the parameters θ of the model, and the function T denotes an image transformation. Moreover, the parameters of the model should be selected to minimize the loss function L (θ; χ t ) based on the foregoing pictures and the positions of the target object in these pictures The space of model parameters θ is represented by O. The regularization term R(θ) with weight coefficientis γ used to restrict the model complexity and to avoid over-fitting. To realize real-time performance, the problems in (1) and (2) should be covered well, and the functions f and L should be selected to render the location of the target object reliable and accurate.
A fraction function is proposed which is a linear composition of histogram and template fractions in which the template fraction is obtained from the HOG feature and the histogram fraction is obtained from the CN feature: The template fraction is a linear function of a N-channel feature picture ϕ x : Γ → R N , that is acquired from x and acts on a finite area Γ ⊂ Z 2 . The weight vector α is another N-channel image: The histogram fraction is obtained from an M-channel feature picture ψ x : H → R M that originates from x and acts on a finite area H ⊂ Z 2 : In contrast to the template fraction, the histogram fraction is invariant to the spatial arrangement because the proportion of an object colour distribution is relatively constant. A linear function is used to represent the average feature pixel: It can be expressed as the average of a score image It is significant to transform the feature by applying the translation ϕ T(x) = T (ϕ x ) such that the computation of the feature can be conducted by using overlapped windows, and the template fraction can be calculated by adopting the fast approaches that are commonly used for convolution processes. By using a single integral image, the histogram score is obtained.
The parameters of the whole model are θ = (α, β), and the coefficients γ tmpl and γ hist can be inferred from α and β. The loss function L = (θ, χ) should be optimized by adjusting the parameters to obtain a weight-based linear composition of per-picture losses: The form of the per-picture loss function is as follows: l(x, p, θ) = ∆d p, arg max p∈C f (T(x, q); θ) (9) where ∆d (p, q) represents the cost of selecting the rectangle q while the true rectangle is p. Since it is a non-convex function, the computation of the optimization problem is exceedingly expensive, and the quantity of training specimen and features are limited. In contrast, correlation filters, adopt a simplistic least-squares loss function and many specimens are created by using cyclic shifts. Moreover, all the circular matrices could be diagonalized to reduce the amount of computation substantially by using the discrete fourier transform (DFT).
To maintain the efficiency and performance of the correlation filter without losing sight of the message that can be acquired from a permutation-invariant fraction of the histogram, construction of the model by solving settling two absolute ridge regression problems is suggested: By applying the correlation filter formulation, the parameters α can be easily obtained. Although the dimension of β may be less than α, it could still be more difficult to detemine. This is because it is not possible to acquire it from the circular shifts. It must be converted to a common matrix instead of a circular matrix.
In the end, the linear composition of the two fractions is used to make γ hist = ε and γ tmpl = 1 − ε, where ε is a coefficient that was selected from a validation set.

Obtaining the Template Fraction
According to a least-squares expression of the correlation filter, the loss function should be: where h l represents the channel l of the multi-channel image h, and g is the expected correlation result that corresponds to the training sample f . The regularization parameter λ is used to restrict the model computation and to avoid the problem of zero-frequency components in the spectrum f . It is supposed that (12) only has one training sample so that the solution to (12) can be obtained.
Here, G denotes the complex conjugate of G which should be equal to F k . The filter can be optimized by minimizing the output error on all training samples, but the computation for solving this problem is enormous. To obtain the efficient and convenient approximation, the numerator A l t and denominator B t of the correlation filter H l t in (13) are independently updated as follows: Here, η is a parameter of the learning rate. The formula that is used to compute the correlation of the fractions y on the rectangular area z of a feature map is (15). The new target position can be found by maximizing the score y, and g −1 denotes the inverse DFT operator.

Obtaining the Histogram Fraction
The histogram fraction can be obtained from the specimens that are acquired from each picture, where W represents a collection of pairs (q, g) of the rectangular window q and their related regression output g ∈ R, and the loss function is: For an N-channel feature that can transform ψ, the answer can be acquired by solving a N × N system of equations, which consuming O(N 2 ) memory and O(N 3 ) time. This is a challenging task to complete if the number of features is enormous.
Instead, features of the form ψ[u] = e k [u] are put forward, where e i is a one-hot vector that is one at index i and zero at other places. Moreover, is utilized in the PLT method, which is demonstrated in [41]. The type of features is selected as RGB colours, even though the local binary patterns would be a suitable alternative. To render this approach more efficient and convenient, linear regression is conducted on every feature pixel of the target object, and background areas O and B ⊂ Z 2 and the per-picture loss function is transformed to: Here, ψ is abbreviates of ψ T(x,p) . By adding the one-hot encoding, the formula above can be transformed into the independent parts according to the feature dimension: where the feature j is non-zero and k[u] = j. Then, the corresponding ridge regression problem is: Here, for every feature dimension j = 1, . . . , N, ρ j (A) = M j (A)/|A| is the ratio of the number of pixels in the area where feature j is non-zero. The parameters of the model are updated online

Combing the Scale Space Filter
In this chapter, a combined colour attribute that is used for translation in the main approach is proposed for alleviating the effects of deformation. Afterward, a solution is proposed for effectively overcoming the problem of scale estimation. In contrast to the traditional scale estimation approaches, an advanced approach of adaptive scale estimation that is based on the established object location was proposed for avoiding the high complexity that is caused by exhaustive search.

Combining the Scale Space Filter
A convenient and simple strategy is proposed for incorporating the scale estimate of a 3-dimensional scale-space filter. Via this approach, the translation and scale can be estimated together by computing the fractions in the area where the shape is similar to a box of scale pyramid manifestations, and the scale can be estimated by maximizing this fraction.
First, a feature pyramid around the specified target location in a rectangular region is constructed. The feature pyramid is constructed such that the target size in the region that corresponds to the spatial filter has M × N dimensions. The training specimen f t is a rectangular cube of dimmensions M × N × S and is located around the object position and scale, where S represents the magnitude of the scale-space filter. The filter is renovated via (14), and the expected correlation output g can be obtained by using a 3-dimensional Gaussian function. The building of the training samples is illustrated in Figure 2.  During the period of detection, a feature pyramid is constructed based on the preceding target location and the scale of the main estimation approach. The rectangular cube of size M × N × S that is located around this position is applied as the test specimen that corresponds to z in (15). Afterward, the correlation scores y can be computed via Equation (15) (2.2).

Iterative Scale Space Filter
As described in Section 3.1, the feature pyramid is constructed centred on the previously estimated target location and scale. This might cause the inclusion of a shearing ingredient in the conversion about the test sample z.
The effects of the scale shearing distortions can be alleviated by iterating the detection procedure of the tracker model. Thus, a joint scale-space filter can be iteratively adopted. When accepting a new frame, the filter of the preceding object scale and location is first utilized.
Afterward, the current object location is renovated with the scale and location when the transformation correlation filter and scale-space filter attain the maximum scores separately. Then, the detection procedure is iterated by building the feature pyramid around the current object estimate. The process always converges due to the alleviated shearing distortion, which is regarded as a parameter, when the accuracy of location estimation is enhanced.

High-Confidence Judgement Mechanism
The model update strategy significantly impacts the precision and robustness of the tracking algorithm because the appearance of the target object varies in the tracking scenario. However, the model that is used to detect the position in the current frame is unchanged; hence, the object information that has changed cannot be obtained. Thus, a model update strategy is necessary. However, if the update frequency is too high, the problems of occlusion and motion blur cannot be solved effectively, and if the frequency is too low, the model cannot timely learn the new feature from the ambient environment. To overcome these challenges, a high-confidence judgement mechanism is proposed.
In Figure 3, the left column contains two frames of sequence basketball from OTB-15, the red bounding boxes represent the tracking results of our algorithm with the high-confidence judgement mechanism, and the blue bounding boxes represents the tracking results that are obtained when the approached judgement mechanism is not utilized and the method of updating the model is adopted in each frame. The middle column describes the cenario of severe occlusion of the target, under which the model cannot be updated. The right column presents the map response to updating the model in the same scenario. Most current trackers renovate tracking models at each frame without considering the detection precision. This might leads to deterministic failure if inaccurate detection, severe occlusion or complete objectly absence occurs. In this section, the response from the tracking results is utilized to judge whether it is necessary to renovate the tracking model or not.
The peak fraction and the volatility level of the response cartographic represent the confidence level of the tracking outputs.
The desired response cartographic should only have readily observable peaks and should be smooth in all other regions if the tracking outputs are specially matched to the accurate target location and scale. The more readily observable the correlation peaks are, the higher the location precision is.
Otherwise, as shown in the first row of Figure 3, the response map will fluctuate violently in which the representation differs entirely from general response maps. If the tracking model is continuously updated when a deterministic failure occurs, it should be erroneous, as shown Figure 3 in the second row. Therefore, a high-confidence judgement mechanism is proposed for avoiding this scenario, which considers two standards: The first standard is the maximum response fraction f max of the response cartographic f (x, p; θ) : The second standard, namely, the average fluctuate-except-peak energy (AFEPE) is a novel standard that is used to express the volatility level of response cartographic and the confidence degree of the detected targets which is defined as follows: Here f max represents the maximum, f min represents the minimum, and f w,h represents the w-th row and h-th column elements of f (x, p; θ).
In the ideal scenario, in which the complete target appears in the detection scope, AFEPE should exceed the cartographic response. It should have only a single readily observable peak and should be smooth in all other regions. In contrast, AFEPE will dramatically decrease when the target is disappearing or occluded.
While the two standards, namely, f max and AFEPE, of the current frame are larger than the historical mean values that are used separately in various proportions, the tracking results in this frame are high-confidence result. Then, the tracking model will be updated online via (14), (20), and (21). Figure 3 shows the main advantage of the proposed method. While the object is occluded severely, the response cartographic waves intensify and AFEPE decreases to approximately 10 when f max is still sufficiently large. In this scenario, the high-confidence judgement mechanism would not renovate the model. With this approach, the tracking model will not be corrupted and the model can track the target object successfully once again in the subsequent tracking. Otherwise, the object will disappear, and our desired peak may vanish gradually.

Experiments
To evaluate the performance of our proposed tracking algorithm, we conduct an experiment on the OTB-13 [39] and OTB-15 [40] benchmark datasets. OTB is an authoritative benchmark that is used by many visual tracking researchers to evaluated the feasibility and efficiency of their proposed approaches. The evaluation methods for OTB-13 and OTB-15 are the same. OTB-13 has 50 sequences, while OTB-15 has 100 sequences. These sequences differ in terms of their challenging conditions; hence, they can be used to evaluate trackers more comprehensively.
All test sequences of OTB have been tagged with 11 attributes which represent challenging conditions in various scenarios, such as background clutters (BC), motion blur (MB), illumination variation (IV), in-plane rotation (IPR), low resolution (LR), occlusion (OCC), out-of-plane rotation (OPR), scale variation (SV), deformation (DEF), fast motion (FM) and out-of-view (OV). Using these 11 metrics, we evaluate the performance of our approach under various scenarios. Furthermore, we utilize the metrics of OPE (one-pass evaluation) and SRE (spatial robustness evaluation). The success scores is the area under the curve (AUC), which describes the success rate when the estimated positions compare to the ground-truth positions with a fixed overlap threshold that ranges from 0 to 1. The precision scores mean is the percentage of the estimated centre positions that are within 20 pixels of the ground-truth centre positions.
In this paper, we compare our algorithm with 7 state-of-the-art high-performance trackers on OTB-13 and OTB-15. We use the experimental results of 7 trackers that have been published by their authors to guarantee a fair comparison. Our algorithm is represented by Algorithm 1.
In Table 1, we list important parameters that we used in our experiments. The values of these parameters is selected by conducting experiments with various parameter valuess to select the optimal values, which renders our approach more powerful. Our tracker is implemented in MATLAB on a Notebook with an Intel i5-6200U @2.3GHz processor. Translation estimation: 5: Extract translation samples z trans from x t at p t−1 and s t−1 .

8:
Set p t to the target position in the current frame that maximizes f (x t ).

9:
Scale estimation: 10: Extract scale samples z scale from x t at p t−1 and s t−1 .

12:
Set s t to the target scale that maximizes y scale .

Analyses of Our Approach
To evaluate the performance of our proposed approach with three strategies, we evaluate four versions of our algorithm on OTB-13: Ours-NCN, in which colour names are not combined; Ours-NSP, in which the scale space filter is not utilized; Ours-NHCM, which lacks the high-confidence judgement mechanism; and Ours-N3, in which none of the three strategies are utilized. The characteristics and comparison results are presented in Table 2.
Comparing the trackers of Ours and Ours-NHCM, precision and success have improved by 6.8% and 13.0%. In addition, FPS has increased by 22.6%. Thus, the high confidence judgement mechanism can alleviate the model corruption that is caused by occlusion and other problems, and the strategy of updating when necessary instead of updating every frame has substantially improved the speed.
Without combining the colour features, the tracker Ours-NCN shows poorer performance. Compared with Ours-NCN, the precision of Ours increased by 15.7% and the success rate increased by 30.5%, but at the expense of 11 FPS. Ours-N3, as might be expected, performs worst due to the absence of the benefits of the three strategies. Moreover, its speed is the fastest due to its lower complexity.
According to Table 2, tracker Ours outperforms all the other versions in terms of precision and success rate, and the speed can reach 65 FPS, which satisfies real-time requirements. As discussed above, the three mechanisms that are proposed in this paper realize satisfactory efficiency and feasibility.

Overall Performance Evaluation
In this section, we evaluate the performances of the various versions of our approach. For further comparison, we evaluate our algorithm with 7 state-of-the-art trackers, namely, Staple [24], fDSST [19], KCF [25], CN [16], CSK [22], GOTURN [42] and CCOT [43] , as listed in Table 3. Among them, Staple, fDSST, KCF, CN, and CSK are correlation-filter-based algorithms, and GOTURN and CCOT are deep-learning-based algorithms. The data of GOTURN and CCOT are obtained from their original papers.  Table 3, CCOT outperforms all the other trackers, and compared to our approach, it realizes average improvements of approximately 10.2% in precision on OTB-13 and 5.6% in success on OTB-15. They perform almost the same in terms of success on OTB-13, and the speed of CCOT is only 0.3 FPS, which cannot satisfy the real-time requirements. In contrast, the speed of our algorithm is 65 FPS, which runs more than 216 times faster than CCOT. Moreover, tracker GOTURN, which is also based on deep learning, runs very fast at 165 FPS and is second only to KCF in terms of speed; however, a large disparity in performance is observed compared with our tracker. Our approach realizes average enhancements of approximately 25% in precision and 49% in success. Figure 4 presents the precision and success plots of the top six trackers on both OTB-13 and OTB-15. The first row in Figure 4 presents the comparison results on OTB-13, and the second row in Figure 4 presents the comparison results on OTB-15. These six trackers are all correlation-filter-based algorithms, and their characteristics are presented in detail in Table 4. KCF and CSK only use HOG features to describe the object model, and they do not utilize the three strategies. CN uses colour names to avoid object deformation problems, and fDSST focuses mainly on scale estimation; their speeds are 78 FPS and 54 FPS, respectively. Our tracker fully utilizes the three mechanisms and performs best in terms of most metrics. CSK is a famous tracker that pioneered the application of correlation filters(CFs) in visual tracking; hence, it is a satisfactory representation of previous classical trackers. Our approach significantly improves CSK, with an average improvement of approximately 60% in precision. KCF adopts cyclic shifts for dense sampling and joins multi-channel HOG features to make the tracker more robust. It can run very fast at 172 FPS. Our tracker significantly outperforms KCF, with average enhancements of the success rate of approximately 46.5% on OTB-13 and 47% on OTB-15. CN utilizes the feature of multi-channel colour names that is baseds on CSK, and fDSST is the speedy version of DSST with a scale space filter. They have realized satisfactory improvements, but without the colour names and high-confidence judgement mechanism, the performance is worse than that of our approach. By combining HOG features and colour names, Staple realizes relatively satisfactory performance. Moreover, on the basis of multi-feature fusion, our tracker employs adaptive scale estimation and a high-confidence judgement mechanism. Our tracker outperforms Staple by 5.3% in terms of precision and 8.5% in terms of the success rate on average. Figure 5 presents a visualization of the tracking results of our tacker and of other famous trackers on various test sequences. The "Liquor", "Skating", "Jogging-2" and "Subway" sequences all contain occlusions, deformations, and background clutter, which lead CSK and KCF to miss the object completely, whereas our tracker can track the object accurately.

Robustness Evaluation
In sequences "Skating" and "Jogging-2" which contain illumination variations, scale variations, occlusions, deformations, out-of-plane rotations, and background clutter, all trackers identify the target in the first several frames, but most trackers lose sight of the object over time, and only our approach can always track the object correctly.  We can see that the success rates in the scenarios of background clutter, motion blur, deformation and illumination variation in (b), (c), and (d), respectively, are approximately twice those of the original correlation filter that is obtained by combining colour attributes. In (i), our approach relizes a larger improvement by approximately 14.2% compared to Staple, which does not utilize adaptive scale estimation. In (g), it demonstrates strong performance, which is due to the effect of the high-confidence judgement mechanism when encountering update problems.
According to these experiments, our proposed fusional multi-correlation -filters with the high-confidence judgement mechanism can outperform state-of-the-art trackers in most scenarios.

Experiment with the Target Tracking Robot
Target tracking robots are potential applications of human-robot collaboration and can be applied to service areas, such as industrial plants, offices, supermarkets, airports, etc. and collaboratively provide information, guidance and/or physical assistance to people. The main challenge for the target tracking robot is to track the user accurately and in real-time.
The robot platform used in this experiment was a GQY robot. GQY robot is a comprehensive experimental platform, which is equipped with a monocular camera, lidar, ultrasonic sensor, and other sensors. We used the GQY robot to experiment with our tracker.
If the robot can not track the experimenter in real-time, it will not follow the experimenter in an S-shaped path. Figure 7 shows the target tracking robot following the experimenter in an S-shaped path. We repeated the experiment more than 20 times. Thanks to the real-time improvement of our tracker, the target tracking robot can track the target stably in real-time.
As shown in Figure 8, the illumination variation during the process is very obvious. Our tracker, which incorporates colour attributes, is more robust in scenes with illumination variation, and illumination variation does not affect the accuracy and stability of robot tracking.
In a variety of challenging environments, the target tracking robot can still perform effective and accurate tracking, which illustrates the effectiveness of our tracker and can meet the needs of the target tracking for real-time target tracking.

Conclusions
In this paper, a novel tracker is proposed for overcoming challenges such as deformation, scale variation, and occlusion in the field of visual tracking. The colour attributes are combined, and due to their complementary characteristics, they are used to well handle variations in shape. Correlation-filter-based trackers have theoretical limitations. For example, large shape changes can lead to more background be learned, which affects the correlation-filter-based trackers. To mitigate the problems that are caused by background clutters, fast motion and scale variations, an innovative scale estimation filter is utilized. Furthermore, a high-confidence mechanism is proposed for preventing model corruption. This tracker not only performs excellently but also satisfies the real-time performance requirement of online object tracking. Deep learning has advantages in feature representation. The trained network can achieve high accuracy. The target tracking algorithm proposed in this paper is based on correlation filtering and ensures the real-time performance of the algorithm; however, its accuracy is lower than that of deep-learning-based target tracking algorithms. We are combining our method with deep learning to further increase accuracy.
Author Contributions: In this work, W.W., C.L. and B.X. conceived the main idea, designed the main algorithms, experiments and wrote the paper. L.L., Y.T., and W.C. analyzed the data, performed the simulation experiments and reviewed the paper. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.