Multi-Templates Based Robust Tracking for Robot Person-Following Tasks

: While the robotics techniques have not developed to full automation, robot following is common and crucial in robotic applications to reduce the need for dedicated teleoperation. To achieve this task, the target must ﬁrst be robustly and consistently perceived. In this paper, a robust visual tracking approach is proposed. The approach adopts a scene analysis module (SAM) to identify the real target and similar distractors, leveraging statistical characteristics of cross-correlation responses. Positive templates are collected based on the tracking conﬁdence constructed by the SAM, and negative templates are gathered by the recognized distractors. Based on the collected templates, response fusion is performed. As a result, the responses of the target are enhanced and the false responses are suppressed, leading to robust tracking results. The proposed approach is validated on an outdoor robot-person following dataset and a collection of public person tracking datasets. The results show that our approach achieved state-of-the-art tracking performance in terms of both the robustness and AUC score.


Introduction
While the robotics techniques have not led to full automation, human-robot collaboration scenarios have arisen in diverse domains, such as manufacturing, health care, and entertainment. The major advantage of adopting person-following robots is that it reduces the demand for dedicated teleoperation. In all person-following applications, the robustness of recognizing the target to follow is the most important aspect of the following system. The perception sensors of the person-following system include a camera, laser range-finder, LiDAR, infrared and thermal sensors, and sonar. The RGB camera is widely used for its rich information, compactness, and cost-effectiveness.
To perform the following tasks, the following robot must perceive the relative position of the target in its operating environment. This can be considered as a tracking task. There are many situations in which the robot may lose track in a dynamic environment, e.g., occlusion, illumination variation, scale variation, deformation, etc. Therefore, the target must be tracked in real time without critical failures. The attempts to use visual tracking techniques have flourished over the past decade. In previous approaches, tracking algorithms detected specific features in the input feature space [1]. Schlegel et al. [2] and Hu et al. [3] developed tracking methods in the RGB image space. Shin et al. [4] presented a free tracking algorithm model using optical flow. Koide et al. [5] tracked people using color, height, and gait features. Satake et al. [6] established the distance dependence appearance model using the SIFT feature. Kwolek [7] tracked targets using a color histogram. Satake [8] adopted depth histogram features to fit with support vector machines for robust tracking. Chen et al. [9] deployed the Ada-boosting algorithm to person tracking. Wang et al. [10] adopted the kernelized correlation filter (KCF) as the tracking module in a following mission. Using traditional features can achieve person tracking under certain circumstances but cannot work well under long-term and complex environments.
Recently, Siamese-based approaches that adopt discriminative correlation of deep features have been proposed to address these issues. In Siamese-based trackers, it is common (some stated in the papers but not implemented in the released code) to only use the first frame as the template to grant template reliability, which achieved good performance in short-term datasets such as OTB [11] and VOT [12].
However, in experiments, using a fixed template can perform well for a certain duration, but over time, the variations in appearance, illumination, scale, deformation, etc., reduce the intensity of the responses to the tracked target, and eventually, tracking is lost. The intuitive solution to this issue is to continue incorporating the latest target information, but Zhang et al. [13] proved that the tracking performance only worsens if a non-discriminative template update strategy is applied throughout the tracking process. We think the reason for that is the introduction of false-positive templates. Therefore, a tracking reliability criterion is needed to safely incorporate new templates.
Providing new target information only enhances the responses of target tracking, but false responses are still caused by similar objects. As illustrated in the ground truth score map row of Figure 1, even though the people have different appearances, they receive high responses in the score map. When the real tracking target crosses or is occluded by these objects, the tracking is easily lost with these interferences. It is also important to eliminate false responses.  Motivated by the aforementioned analysis, we propose a robust tracking approach to enable robot person-following tasks. A scene analysis module (SAM) is proposed that leverages the statistical characteristics of the cross-correlation responses. The density distribution of the responses is estimated using a Gaussian mixture model. Based on the mutual information of the mixture components, the responses are segmented into instanceaware clusters. As a result, a tracking reliability criterion is proposed based on the size of the center cluster, and distractors that produce false responses are extracted as negative templates. By collecting the positive and negative templates, a score fusion strategy is applied to enhance the responses of target-tracking and to eliminate false responses, leading to the robust person tracking.
Our main contributions can be summarized as follows: (1) We proposed a tracking reliability criterion based on the variance of center responses. With the criterion, the most recent reliable results can be safely extracted as positive templates, avoiding template pollution. (2) We perform a score fusion strategy that generates the final score map by combining the responses of ground truth template, positive templates, and negative templates. As a result, the target responses are enhanced and distractors are suppressed and eliminated, reducing the chance of incorrect positioning. (3) The proposed method was incorporated in two state-of-the-art approaches, SiamRPN [14] and SiamRPN++ [15], and validated on person-following-based datasets as well as public datasets. The results show that our approaches outperform their base approaches and rank high when competing with other state-of-the-art approaches.

Related Work
The tracking performance using traditional features is severely restricted when tracking scenarios are complex. Distinct from handcrafted features, the emergence of deeplearning-based approaches has provided a significant increase in performance. Tracking algorithms based on deep feature representations have achieved state-of-the-art accuracy.
Although these techniques perform well on benchmarks, they often suffer from tracking drift caused by the accumulation of errors. Recently, derived from the idea of tracking by detection, trackers based on the Siamese network have received wide attention. Siamese-based trackers formulate the tracking problem as a similarity learning function and predict the object location by comparing the similarity between the template image and the search image. The Siamese networks are trained offline on large-scale image pairs.
The pioneering method SiamFC [16] uses the Siamese network as a feature extractor and introduces a cross-correlation layer to generate a single channel response map. The correlation can be seen as a similarity calculation, and the response map reflects the similarity between the template and the search region. Following this similarity-learning work, Li et al. propose SiamRPN [14], which enhances the tracking performance by integrating a region proposal network (RPN) into SiamFC. The RPN has two branches: one classification branch in charge of scoring the probability, and the other, a regression branch, is responsible for estimating the coordinates of bounding boxes. Based on SiamRPN, Da-SiamRPN [17] addresses the problem of the imbalance between non-semantic negative examples and semantic distractors of training data through data augmentation. UpdateNet [13] further improves upon DaSiamRPN by incorporating a small network that learns the appearance change of tracked targets. SiamDW [18] takes advantage of deeper neural networks by eliminating the negative impact of padding. SiamRPN++ [15] further improves upon the object-tracking performance using deeper networks and multi-layer fusion, achieving better accuracy while maintaining fair speed. To eliminate the negative effects of anchors, SiamCAR [19] adopts two subnetworks for feature extraction and regression respectively and proposes an anchor-free framework; SiamBAN [20] directly classifies objects and regresses bounding boxes taking advantage of a unified fully convolutional network. The avoidance of setting pre-defined anchors can avoid the tricky hyper-parameter tuning, easing the effect of human intervention. To solve the problem that the size of the object feature region needs to be determined in advance, and the cross-correlation method either retains a lot of unfavorable background information or loses a lot of foreground information, SiamGAT [21] proposes a Graph Attention Module (GAM) to establish a partial correspondence between an object and a search region as a complete bipartite graph.
In long-term tracking, robustness is a common weak point. Siam R-CNN [22] has a two-stage Siamese re-detection architecture and re-detects images by comparing region proposals with the template region. LTMU [23] proposed a meta-updater that guides the tracker update, forming a long-term tracking framework along with an online local tracker, an online verifier, and a SiamRPN-based re-detector. These methods significantly improved tracking precision but have a low tracking frame rate even in high-end desktops. Wang et al. [24] presented a long-term target tracking method by combining adaptive discriminative correlation filters with a support vector machine-based component. Siam-RM [25] has an object-tracking framework that uses the Siamese network and adopts the Siamese instance search tracker as the re-detection network. Zhang et al. [26] deployed local-global multiple correlation filters for tracking and a Kalman filter re-detection model for re-detection when the correlation filters are unreliable. Methods such as online updater, re-detection module, hierarchical search, and multi-stage framework are commonly used to handle tracking robustness issues in long-term tracking. However, the introduced modules inevitably deteriorate the real-time performance of the approaches.
In this paper, we use SiamsesRPN-based trackers as the front end of the following system. For the reason that using the first frame as the template may be easily impacted and lose the target, we adopt a Scene Analysis Module that can safely produce positive and negative targets in the tracking scenes. By fusing the scores of the templates, the target responses are enhanced and noises are suppressed, leading to robust tracking. Figure 2 presents the flow chart of our approach. The base tracker part is the common steps of Siamese-based trackers where for each frame; the features of the template image and search image are extracted using a shared-weight deep convolutional backbone. The two extracted features then cross-correlate and produce a score map that consists of the probabilities of similarity between the template and the search image. Usually, Siamesebased trackers employ non-maximum suppression to the scores and choose the corresponding regressed bounding box with the highest value as the tracking result, but these methods do not work well in some situations such as occlusion and appearance change. For each frame, the features of the template image and the search image are extracted using a shared-weight network. Then, a score map is generated by the cross-correlation of the two features. To further exploit the information from the score map, we take the scores as a whole and process them using our scene analysis module. The objects with highly similar responses are collected as negative templates. Next, we estimate the tracking reliability leveraging the outcomes of the SAM and collect confident tracking results as positive templates. Finally, the scores of the ground truth, negative, and positive are fused and the tracking box is regressed from the fused score map.

Framework
To improve tracking robustness, we take the scores as a whole and further exploit the information provided by the score map using our scene analysis module (SAM). The SAM analyzes the score map by estimating the score densities and segmenting the scores into instance-aware clusters. The SAM provides two contributions: first, a tracking reliability criterion is proposed using the statistical characteristics of the score distribution. If the tracking is determined to be reliable, the tracking target is extracted and collected as positive templates. Second, because of the limitation of the backbone networks, objects that are similar to the tracking target also respond with high values in the score map, which strongly interferes with tracking accuracy. Since the SAM segments scores into instance-aware clusters and each cluster represents a potential tracking target, the targets except for those being tracked are determined as false positive targets and collected as negative templates.
Finally, the score maps of the ground truth and negative and positive templates are fused together and the tracking box is regressed from the fused score map. The positive templates provide more recent information, enhancing the responses of the target-tracking. The negative templates are in charge of suppressing the interference due to similar objects.

Scene Analysis Module
In Siamese trackers, only the maximum value of the responses is used to predict the result for the candidate target position. However, the outcome may be unreliable for complicated scenes, such as out-of-view and occlusion situations. Nevertheless, the SiamRPN-based trackers provide discriminative responses on foregrounds and backgrounds ( Figure 3a). After the generation of the score map, we take the map as a whole and analyze the statistical characteristics of the response distribution. The estimated distribution is further segmented into instance-aware clusters, where each cluster corresponds to a potential object that is similar to the tracked target. The distribution variance of the objects and their bounding boxes are used to establish a tracking confidence criterion and fit false positive objects.  Figure 3a shows a tracking frame and its corresponding responses after cross-correlation. The tracking frames were obtained from the OTB dataset [11]. The backbone was obtained from SiamRPN++ [15]. Figure 3b shows the responses in three dimensions. We can see that with the improvement in training [17] and the [15] network, the responses distribute densely within the potential tracking targets. Therefore, we take the responses as a distribution and first sample it using the accept-reject algorithm. Specifically, given the responding distribution d(x), we select a known probability distribution q(x) and a sufficiently large constant m, such that ∀x, we have mq(x) ≥ d(x). Then, we repeatably sample from a uniform distribution U(0, 1). If the ith sample satisfies

Density Estimation
we accept x i as a sample, or reject it otherwise. Figure 3c shows the sampled points from the score map. After sampling from the score map, we adopt the Gaussian mixture model (GMM) to estimate the probability density. The GMM is a parametric probability density function represented as the sum of Gaussian densities. The representation of GMM is where N (x|µ, Σ) is the multivariate Gaussian densities, whose parameter µ ∈ R 2 , Σ ∈ R 2×2 are the mean vector and the covariance matrix. The scalar ρ c is the weight of the Gaussian component.
Since there is no closed-form solution for the GMM, the expectation-maximization (EM) algorithm [27] is commonly used to find a solution by iteratively maximizing data likelihood until the average data log-likelihood converges to a threshold. Figure 3c depicts the fitting outcome of the example image. The components of the GMM are visualized as different-colored ellipses.

Instance Segmentation
The density of the response map is estimated by the GMM; however, the GMM components are not discriminative in instances. Biemann [28] adopted the Chinese whispers algorithm to solve clustering problems using undirected and weighted graphs, which can further facilitate segmenting the GMM into instance-aware mixtures.
We define G = (V, E) as a graph with nodes v i ∈ V and weighted edges (v i , v j , w ij ) ∈ E. The adjacent matrix W of graph G is a square matrix, where the entries w ij denote the weight of the edges between v i and v j . Since the segmentation is conducted on probability densities, we use Kullback-Leibler divergence (KL divergence) as the metric to set up the weights of the graph edges.
Given two components f = ρ f N (µ f , Σ f ) and g = ρ g N (µ g , Σ g ) of a mixture, according to the definition, their KL divergence is given by and a closed-form solution is derived as We take the GMM components as the nodes in G. If the KL divergence of two nodes is greater than a threshold, an edge is established and the reciprocal of its divergence is set to the corresponding position in the adjacent matrix W. Then, the algorithm iteratively segments by grouping nodes that have the maximum mutual weights. Figure 3d presents an example of an outcome of applying instance segmentation. The GMM components are segmented into four clusters, where each cluster corresponds to a potential object in Figure 3a.

Reliability Estimation
The instance segmentation clusters responses into instance-aware GMM mixtures. When tracking in reliable circumstances, each object has its own cluster as illustrated in Figure 3d, and the size of the cluster remains stable. However, when a potential occlusion occurs, the instance clusters merge together given their small divergence values, resulting in a large size variation. The upper row of Figure 4 illustrates example scenes before and after potential occlusion. We define the standard deviation matrix of the ith instance as s i , and σ i as the max eigenvalue of s i . We introduce a reliability parameter τ: where SIZE score is the size of the response map. Figure 4 shows the values of τ over frames on Girl2 of the OTB dataset. When a potential occlusion occurs, τ presents a peak. We set a threshold parameter τ t . When τ ≥ τ t , it indicates a potential occlusion (see the τ values in frames 50, 70, 100, 120, etc.); the tracking results are unreliable. Conversely, if τ satisfies τ ≤ τ t in N r successive tracking frames, the result is considered reliable.
Frame # Figure 4. The plot of τ values on Girl2 of the OTB dataset (bottom row), the heated score map of potential occlusion scenes, and corresponding negative templates (top row). In the τ plot, the peaks indicate potential occlusions, leading to an unreliable tracking result.

Positive Templates
In Section 3.3, we proposed a tracking reliability criterion. Based on this criterion, we can safely extract the latest reliable tracking target as a positive template. We define N pos as the maximum number of positive templates stored during tracking tasks. The templates are stored as a queue; if the number of templates exceeds N pos , we remove the top of the queue.

Negative Templates
The SAM segments the GMM into instance-aware mixtures. Based on the number of segmented mixtures, we can infer the number of similar objects in the current tracking scene; then, except for the tracked target, the bounding boxes of each object are regressed and the object images are cropped as negative templates. As such, the false-positive responses can be suppressed in the score map fusion step.
We set N neg as the maximum number of negative templates stored during tracking tasks. When a new negative template arrives, the Euclidean distances between the new template feature and the stored template features are calculated. Then, the template that has the smallest distance is removed, and the new template is added.

Score Fusion
Equation (5) describes the score fusion process, where N pos and N neg are the number of collected positive templates and negative templates, respectively; f pos and f neg are the extracted features using the backbone network; and ϕ(·) is the cross-correlation procedure.
The idea of score fusion is to enhance the responses of target-tracking using positive templates and to suppress false responses using negative templates. Figure 1 presents examples of the outcome of our score fusion procedure. Each column is a tracking frame. The top to the bottom rows illustrate the score map of fusion, ground truth, and two negative templates, respectively. We see that the ground truth score map not only responds to the tracked target but also increases responses in many other areas where the tracker is easily misled. After implementing our fusion, the unrelated responses are sufficiently suppressed.

Evaluation
To evaluate the performance of our method, we tested it on two collections of datasets: the UGV dataset and a public dataset. The UGV dataset includes 17 image sequences of outdoor person-following tasks recorded by a small unmanned ground vehicle (UGV). The purpose of the person-following system is to reduce the workload of the teleoperator. To conduct a more comprehensive evaluation, we further selected 27 image sequences that involve person tracking from the OTB and UAV [29] datasets. Unlike other popular public tracking datasets, the sequences of the UAV dataset were captured from an aerial viewpoint of low-altitude UAVs.
In the experiments, we applied the designed algorithm to two representative Siamese trackers: SiamRPN and SiamRPN++. SiamRPN adopts AlexNet as the backbone and takes the feature of the final layer for the correlation. SiamRPN++ uses ResNet50 as the backbone and outputs features by fusing the outputs of multiple layers. We applied our framework to these two approaches and observed the performance improvement. The applied networks and pre-trained weights were obtained from https://github. com/STVIR/pysot (accessed in 20 April 2021). We further applied DaSiamRPN (https: //github.com/foolwood/DaSiamRPN, accessed in 20 April 2021) and its update-based variation UpdateNet (https://github.com/zhanglichao/updatenet, accessed in 20 April 2021) for comparisons. Therefore, the SiamRPN and our improvement, SiamRPN++ and our improvement, and DaSiamRPN and UpdateNet shared weights respectively and can be seen as three comparing groups.
As the evaluation method, we used one pass evaluation (OPE) [11]. The OPE criterion scores tracker performance using center location error and the bounding box overlap, which yield a precision plot and a success plot according to the threshold, respectively. The success plots are calculated as the percentage of frames with an intersection-over-union (IOU) overlap exceeding a threshold and scored using the area under the curve (AUC) score.
Since our approach provides improvements in terms of robustness instead of localization accuracy, the precision plots were omitted.
The experiments were conducted on a desktop with an NVIDIA RTX3090 GPU and an Intel i7 CPU. We set the number of GMM fitting components to six. The KL divergence threshold of setting adjacent matrix was two. The threshold parameter τ t = 0.19, N pos = 2, N neg = 3.

UGV Dataset
The UGV dataset is a self-constructed dataset that contains images from a small unmanned ground vehicle that performed person-following tasks in outdoor environments. The robot was following a single-person target in a campus environment under varying road conditions (e.g., brick roads, cement roads, snowy roads, and grasslands) and illumination conditions (backlight, shadow, dawn, and night). The vehicle performed servo moving in accordance with the relative position of the tracked target. We set the target person being followed to pose different challenging situations such as teams wearing similar clothes, partial and full occlusion, etc. The images were collected by an Intel Realsense D435i camera that was rigidly attached to the robot. The camera collected images at 30 fps in the following tasks. We downsampled the frame rate to 10 fps in our dataset. The image resolution is 640 × 480 pixels. The robot was following the target person up to speeds of 2 m/s. The dataset contains 17 image sequences that vary in the appearance of the tracking targets, the appearance and number of distractors, road conditions, weather, and experiment duration. The detailed information of each subset is provided in Table 1. The distractor information states the attributes of different subsets, including campus environments with pedestrians (PED), the number of pedestrians with different-colored clothes actively interfering (#DAI), number of pedestrians with similar-colored clothes actively interfering (#SAI), illumination variation (IV), and low illumination (LI). The results are divided into two groups. The short-term group presents the results of the sequences that are less than 100 s. Furthermore, the long-term group gives the results of the rests. Figure 5 illustrates the success plots of short-term tasks. Figure 6 presents the qualitative results of UGV02, UGV03, and UGV14. For UGV02 and UGV03, we can see that the tracking boxes usually drift when a distracting person passes through and temporarily occludes the target. For UGV14, the variation in light severely impacts the trackers' performance. In general, our approach performs well and provides improvements in both AUC scores and robustness. Figure 7 presents the success plots of long-term following tasks. Our approach ranks high amongst all considered methods. Without fine-tuning the network, the outcomes of DaSiamRPN and UpdateNet are poor on our dataset. Updatenet, which is the update-based variation of DaSiamRPN, does not provide an improvement over DaSiamRPN. Despite the already excellent performance of SiamRPN and SiamRPN++, the AUC score of our approach is improved. The qualitative results are provided in Figure 8. In the robot following situations, the robot moves accordingly based on the target location command given by the tracking result: if the tracker tracks the wrong target, the real target will soon be out of view, leading to failure of the following mission. Even though the tracker sometimes does not return a precise bounding box of the following target, a rough result (IOU > 0) can still maintain the target within the tracking view, allowing a chance to recover the target. Therefore, we define tracking robustness as the percentage of the bounding box that satisfies IOU > 0 (namely, the value of the success rate when the overlap threshold = 0 in the success plots). In our opinion, discussing tracking robustness in the following missions is even more meaningful than the AUC score. Since the SiamRPN-based approaches regress bounding boxes from pre-defined anchors, the size adjustment is minor; we demonstrate that the robustness criterion will not be biased by the large-area bounding boxes. As shown in the success plots, our approach provides a substantial gain in terms of tracking robustness compared to the base trackers. For UGV04 (Figure 7a) and UGV07 (Figure 7d), even though the target and distracting person have sufficient disparity in appearance (Figure 8a,d), the tracking box drifts to the distracting person frequently after crossing. By eliminating the responses of the distractors, our approach provides a substantial improvement compared to the other approaches.
For UGV05 and UGV06, the target and distractor person dress similarly (both wearing a black coat), and the following tasks were conducted under intense light variation (see Figure 8b,c). With the light change, the target appearance varies significantly. The competing trackers were impacted and their results presented a random pattern. By continuing to obtain the latest positive templates, our approach distinguishes the target and distractor more robustly, resulting in stable and better performance.
In UGV08 (Figure 7e), the following was conducted on a cloudy day, without light variation; our approach produced stable results and outperformed the others even with a similar-appearance distracting person. UGV16 ( Figure 7i) and UGV17 (Figure 7j) were conducted in the evening (Figure 8f,g); similar to UGV14 and UGV15, the success rates of the other approaches were heavily decayed. Our approach exhibited its strength in these situations, where both AUC score and tracking robustness outperformed the corresponding approaches.
For other UGV tasks, our approach yielded better or comparable results.

Statistical Significance
The test results above present improvements of our approaches over their base approaches in general cases. We further perform a statistical test to see if the improvements are statistically significant. Specifically, we set the null hypothesis as the subtraction of the paired data comes from a normal distribution with mean equal to zero and unknown variance. Then, the paired-sample t-test is employed. If the p value falls below 0.05 significance level, the null hypothesis is rejected or accepted otherwise. Table 2 presents the statistical significance condition and the corresponding p value of the three comparing groups. The results show that the improvements are statistically significant. We additionally compare our approaches with several latest state-of-the-art approaches: SiamCAR [19], SiamGAT [21], SiamBAN [20], DiMP [30], and PrDiMP [31]. The overall success plot of the dataset is depicted in Figure 9. Compared with the base methods, our approaches bring substantial gain in terms of both AUC score and robustness. Among all 11 competing methods, our two approaches ranked first and fifth respectively in AUC scores; in terms of robustness, they ranked second and third respectively. The PrDiMP method also shows good tracking performance in the dataset.

Public Dataset
All approaches are further tested on a dataset composed of 27 public datasets involving person tracking. The selected sequences, their source, and the results of tracking robustness are listed in Table 3. The uparrow and downarrow indicate the relative improvement provided by our approach compared to the base approaches. The red, green, and blue denote the methods that ranked first, second, and third in the experimental results, respectively.

Conclusions
In this paper, a robust visual tracking approach aiming to enable robot personfollowing tasks was proposed. To solve the problem where fixed templates cannot adapt to the robustness demand of long-term tracking, a multi-templates tracking method was proposed. The confident templates and distract templates are yielded during tracking leveraging the distribution of the central responses. By merging the responses of ground-truth templates, confident templates, and distract templates, the responses of target-tracking are enhanced and false responses are suppressed, leading to robust tracking. The proposed method was incorporated into two state-of-the-art approaches, SiamRPN and SiamRPN++, and validated on a robot person-following dataset as well as a collection of public persontracking datasets. The results showed that our approaches outperform their base approaches in terms of both AUC score and tracking robustness. Furthermore, the approaches are compared with seven state-of-the-art methods. In the UGV dataset, among 11 approaches, they rank first and fifth in terms of AUC score and second and third in terms of tracking robustness. In the public dataset, they rank first and second.
Based on the proposed robust visual tracking approach, in the future, we will continue to explore human-robot interaction and failure recovery methods to construct an autonomous and control terminal-free person following system, so that it can be applied to facilitate police patrols, factory manufacturing, and other scenarios.