Attention and Pixel Matching in RGB-T Object Tracking

.


Introduction
Visual object tracking is a significant research direction in the field of computer vision, with widespread applications in domains such as video surveillance, unmanned aerial vehicle (UAV) navigation, and other related fields [1].The objective of visual object tracking is to estimate the position and dimensions of an object bounding box throughout a complete video sequence based on the ground truth box of the initial frame.Most tracking methodologies rely on RGB images, which employ the extracted features of RGB images to estimate and predict the position of objects.Nonetheless, the outcomes obtained solely from the information provided by visible light images are not optimal in challenging scenarios with limited lighting conditions, such as during night-time, rain, and fog.RGB-T object tracking is a methodology that incorporates information from both RGB images and thermal infrared (TIR) images to predict the location of an object [2].RGB and TIR images possess complementary attributes.RGB images are susceptible to illumination changes while providing more detailed information, and TIR images are unaffected by illumination while lacking texture and detail information.Consequently, RGB-T object tracking that leverages the complementary features of RGB and TIR images exhibits superior performance.
Currently, MDNet-based [3] RGB-T trackers have achieved good tracking accuracy [4][5][6][7], but the tracking speed of these trackers cannot meet the requirements of real time; Siamesebased trackers are faster and can meet real-time requirements [8][9][10], but there is a certain gap in performance compared with some advanced trackers.Hence, most of the current RGB-T trackers are difficult to simultaneously satisfy the requirements of robustness and speed.We designed an RGB-T tracker based on the Siamese network, which can achieve good accuracy while meeting the real-time speed requirements.
Our contributions can be summarized as follows: 1.
A multi-modal weight penalty module is proposed to fully use the advantages of the two modal features and deal with various complex illumination challenges.

2.
A pixel-matching module and an improved anchor-free position prediction network are proposed to suppress the influence of cluttered background on the localized object and locate the object accurately and quickly for tracking.3.
A new end-to-end RGB-T tracker based on Siamese-net is proposed, which can satisfy the robustness and real-time tracking.The experimental results on two standard datasets show our new tracker is effective.
The rest of this paper is arranged as follows.Section 2 reviews some related works on RGB-T object tracking and briefly introduces some existing RGB-T trackers.In Section 3, our new tracker is described in detail from three aspects.Section 4 the related experiments and results are present for the proposed tracker.Section 5 discusses the advantages and disadvantages of our tracker.In Section 6, the conclusion and future work are given.

Related Works
RGB-T object tracking algorithms can be divided into traditional algorithms, algorithms based on correlation filtering, and algorithms based on deep learning.Traditional algorithms mainly employ hand-crafted features such as Histogram of Oriented Gradient (HOG), Scale-Invariant Feature Transform (SIFT), coupled with motion estimation algorithms such as Kalman filter [11] and particle filter [12], to achieve tracking.
Correlation filter-based algorithms obtain the output response by correlating the filter template with the object candidate region features and determining the object position according to the response peak.Zhai et al. [13] proposed an RGB-T tracking algorithm based on a cross-pattern correlation filter, in which correlation filtering was applied to each modality.A low-rank constraint was introduced to jointly learn the filter for modality collaboration.Yun et al. [14] proposed a discriminative fusion correlation learning model, which obtained the fusion learning filter through early estimation and late fusion, to improve the tracking performance of a discriminative correlation filter.Xiong et al. [15] proposed an RGB-T dual-modal weighted correlation filter tracking algorithm in which a weight map was jointly solved by dual-modal information, and the weight map guided the solution of the correlation filter for inferring the object occlusion.Due to the limited representational capacity of hand-crafted features, the accuracy and robustness of these two tracking algorithms are affected.
Deep learning algorithms based on data-driven deep neural networks build trackers with powerful feature representation capabilities, significantly improving accuracy and robustness compared to the previous two algorithms.Xu et al. [16] proposed an RGB-T pixel-level fusion tracking algorithm based on convolutional neural networks, taking thermal infrared images as the fourth channel of visible light images.Li et al. [17] proposed a multi-adapter convolutional network, which extracts shared and specific information of RGB-T bimodally through different adapters.ZHU et al. [18] designed a feature aggregation and pruning module for the RGB-T tracking network.The aggregation module provides rich RGB-T feature representation for the target object, and the pruning module removes noise or redundant features from the aggregated RGB-T features.Lu et al. [19] proposed a dual-selection conditional network to fully utilize bimodal discriminative information and suppress data noise and designed a flow-based resampling strategy to cope with sudden camera movement.Due to the additional modality, additional computations are required.The above algorithms mostly adopt the idea of generating candidate regions, which requires multiple forward propagations of the network, thus affecting the tracking speed.
In recent years, Siamese networks have achieved high accuracy and speed in visible light object tracking, such as SiamFC [20], SiamRPN++ [21], and SiamBAN [22], and the same type of method has also been applied to RGB-T object tracking.ZHANG et al. [23] proposed a deep learning tracking method based on pixel-level fusion, which first fused visible light and thermal infrared images, and then inputted Siamese networks for tracking.Zhang et al. [24] extracted features of visible light and thermal infrared images separately using two Siamese networks and then fused multi-layer features of the bimodal system and used multi-layer fusion features for tracking.

Method
In this section, we first describe the Siamese network architecture of the RGB-T tracker in detail and introduce the improved anchor-free position prediction network.Then, we introduce the two main modules of our tracker, including the multi-modal weight penalty module and the pixel-matching module.

Siamese Network Architecture
The overall network framework of our RGB-T Siamese tracker is shown in Figure 1.The Siamese network comprises a template branch and a search area branch.Each branch is divided into two branches: the RGB images branch and the thermal infrared images branch, which extract the features of these two images, respectively.The two branches have the same structure and share parameters.The resnet50 combined with FPN [25] is employed as the backbone for feature extraction to obtain the features extracted from the second, third, and fourth convolutional layers.Firstly, the multi-modal feature penalty module weights the RGB and TIR modality features.Then, the template and search features of the two modalities are sent to the pixel-matching module for the matching operation, and then the response maps of the two modalities are fused.The fused maps are directed into a regression and classification subnetwork similar to SiamCAR [26].Diverging from SiamCAR, distinct strategies are employed for positive sample determination, and improvements are made to the regression branch.In recent years, Siamese networks have achieved high accuracy and speed in visible light object tracking, such as SiamFC [20], SiamRPN++ [21], and SiamBAN [22], and the same type of method has also been applied to RGB-T object tracking.ZHANG et al. [23] proposed a deep learning tracking method based on pixel-level fusion, which first fused visible light and thermal infrared images, and then inputted Siamese networks for tracking.Zhang et al. [24] extracted features of visible light and thermal infrared images separately using two Siamese networks and then fused multi-layer features of the bimodal system and used multi-layer fusion features for tracking.

Method
In this section, we first describe the Siamese network architecture of the RGB-T tracker in detail and introduce the improved anchor-free position prediction network.Then, we introduce the two main modules of our tracker, including the multi-modal weight penalty module and the pixel-matching module.

Siamese Network Architecture
The overall network framework of our RGB-T Siamese tracker is shown in Figure 1.The Siamese network comprises a template branch and a search area branch.Each branch is divided into two branches: the RGB images branch and the thermal infrared images branch, which extract the features of these two images, respectively.The two branches have the same structure and share parameters.The resnet50 combined with FPN [25] is employed as the backbone for feature extraction to obtain the features extracted from the second, third, and fourth convolutional layers.Firstly, the multi-modal feature penalty module weights the RGB and TIR modality features.Then, the template and search features of the two modalities are sent to the pixel-matching module for the matching operation, and then the response maps of the two modalities are fused.The fused maps are directed into a regression and classification subnetwork similar to SiamCAR [26].Diverging from SiamCAR, distinct strategies are employed for positive sample determination, and improvements are made to the regression branch.When the location (x, y) falls into the center region of any ground truth box, it is considered as a positive sample, as shown in the red part in Figure 2. The part inside the marked box near the edge often belongs to the background part, and this part of the points that belong to the negative sample is incorrectly classified as the positive sample.This will lead to classification errors in the classification branch, causing trouble in learning the model.In addition, for this anchor-free method, the convolutional features in the central part are richer, and the object edge can be better predicted only by the features in the central part.Therefore, we choose the center region of the box centered at (c x , c y ) to be defined as the subframe (c x − rs, c y − rs, c x + rs, c y + rs), where s is the stride of the FPN layer currently in place and r is the hyperparameter of 1.5, as shown in the yellow part of Figure 2. In this way, most anchor points close to the edge inside the label box and falling on the background are labeled as negative samples so that our model converges faster and the final performance is better.
When the location (, ) falls into the center region of any ground truth considered as a positive sample, as shown in the red part in Figure 2. The part i marked box near the edge often belongs to the background part, and this part of t that belong to the negative sample is incorrectly classified as the positive sample.lead to classification errors in the classification branch, causing trouble in lea model.In addition, for this anchor-free method, the convolutional features in th part are richer, and the object edge can be better predicted only by the featur central part.Therefore, we choose the center region of the box centered at (  , defined as the subframe (  −,   − ,   + ,   + ), where  is the stride of layer currently in place and  is the hyperparameter of 1.5, as shown in the yello Figure 2. In this way, most anchor points close to the edge inside the label box an on the background are labeled as negative samples so that our model converges f the final performance is better.The regression branch of the network produces a regression feature map   has dimensions of 25 × 25 × 4. Each position (, ) in   can be mapped back to responding location in the search area (, ).The regression objective of   at four-dimensional vector  (,) = (, , , ), which can be calculated as: , , ,  are the distances from position ( , ) to the four sides of the object b box.( 0 ,  0 ) and ( 1 ,  1 ) denote the upper left and lower right corners of the grou bounding box.To better adapt to the size of the FPN, the total steps before th mapping are increased.
Using the four-dimensional vector  (,) , the Generalized Intersection ov (GIOU) metric can be computed to measure the similarity between the predi ground truth bounding boxes.The GIOU comprehensively assesses the spatia between the predicted and true bounding boxes, considering their size, loca shape.Then, the reg loss can be obtained as: ∑ ( (,) )  (  (, ),  (,) ) , The  GIOU denotes the GIOU loss function, as defined in [27], whereas the f (•) corresponds to the indicator function introduced in [26].
The final score  , at tracking is defined as the square root of the product o dicted centrality  , and the corresponding classification score  , , as in (3).The regression branch of the network produces a regression feature map A reg , which has dimensions of 25 × 25 × 4. Each position (i, j) in A reg can be mapped back to the corresponding location in the search area (x, y).The regression objective of A reg at (i, j) is a four-dimensional vector t (i,j) = (l, t, r, b), which can be calculated as: l, t, r, b are the distances from position (x, y) to the four sides of the object bounding box.(x 0 , y 0 ) and (x 1 , y 1 ) denote the upper left and lower right corners of the ground truth bounding box.To better adapt to the size of the FPN, the total steps before the feature mapping are increased.
Using the four-dimensional vector t (i,j) , the Generalized Intersection over Union (GIOU) metric can be computed to measure the similarity between the predicted and ground truth bounding boxes.The GIOU comprehensively assesses the spatial overlap between the predicted and true bounding boxes, considering their size, location, and shape.Then, the reg loss can be obtained as: The L GIOU denotes the GIOU loss function, as defined in [27], whereas the function I (•) corresponds to the indicator function introduced in [26].
The final score S x,y at tracking is defined as the square root of the product of the predicted centrality Cen x,y and the corresponding classification score Cls x,y , as in Equation (3).

Multi-Modal Weight Penalty Module (MWPM)
The features of RGB images and thermal infrared images extracted from the backbone have different effects on object tracking.Previous research on the attention module for RGB object tracking focused on the importance of the channel of each feature, assigning higher weights to relatively larger contribution feature channels and lower weights to smaller contribution features.These methods successfully utilize mutual information from different dimensional features.However, they lack consideration of the contributing factors of weights, which can further suppress unimportant channels or pixels.We use the contributing factors of the weights to represent the contribution of each modal feature.Batch normalization scale factor is used, which uses standard deviations to represent the importance of the weights.To comprehensively consider the contribution of all the features in RGB and TIR modalities towards object representation, we design a multi-modal weight penalty module, which evaluates the features of the two modalities as a whole and then assigns the corresponding weights to the deep features.
The MWPM utilizes the variance of training model weights to highlight salient features.Compared with previous attention mechanisms, no additional calculations and parameters are required, such as full connection and convolution.The scaling factor in batch normalization (BN) [28] is directly used to calculate the attention weight, and the non-significant features are further suppressed by adding regularization terms.For the channel attention submodule, the scaling factor reflects the magnitude of the change in each channel and indicates the channel's importance, as shown in Equation (4).
where xi is the normalized eigenvalue, and γ and β are the learnable reconstruction parameters that allow our network to learn to recover the feature distribution of the original network.The scaling factor is the variance in BN.The larger the variance indicates that the more dramatic the channel changes, the richer the information contained in the channel will be and the greater the importance will be, while the channel information with little change is singular and less important.
The channel attention is shown in Equation ( 5), C in denotes the input features, C out denotes the final obtained output features, and γ c i is the scaling factor for each channel.If the same normalization method is used for each pixel in space, the spatial attention Equation ( 6) can be obtained.
As shown in Figure 3, in the channel attention module, RGB and TIR features are first concatenated along the channel dimension to obtain a joint feature representation of the template and search features: Then, the feature weight penalty is applied to U F and the output obtained is: where δ (•) denotes sigmoid activation and Split is an operation that splits features along the channel dimension.For the integration method for two submodules, we adopt a method in which the channel attention is in front and the spatial attention is behind.Firstly, the channel attention module is used to reduce the weight of the less significant feature channels, and then the spatial attention module is used to suppress the background noise.
The channel attention is shown in Equation ( 5),   denotes the input features,   denotes the final obtained output features, and    is the scaling factor for each channel.If the same normalization method is used for each pixel in space, the spatial attention Equation ( 6) can be obtained.
As shown in Figure 3, in the channel attention module, RGB and TIR features are first concatenated along the channel dimension to obtain a joint feature representation of the template and search features:

Pixel Matching Module (PMM)
The RGB object tracking algorithm based on the Siamese network has good performance in accuracy and speed, but when matching template features and search features, their commonly used cross-correlation operations will bring a lot of background information in the deep network, resulting in inaccurate matching.One observation is that in RGB-T tracking tasks, the objects to be tracked are mostly small, especially the GTOT [29] and RGBT234 [30] datasets, and a relatively large template is taken to obtain robust features.Therefore, when a template image of 127 × 127 is inputted, the proportion of objects in the template is very small.With the increase in network depth, especially in deep networks such as ResNet50 [31], even the feature points in the final output correspond to a large receptive field of the input.The size of template features is large, and the corresponding true matching region is much larger than the ideal matching region.Therefore, a large amount of background information will be introduced and overwhelm the features of the object, making it difficult to distinguish the object from similar objects in the background.
To solve the above problems, we use the pixel matching module to calculate the similarity between each pixel on the search and template features.The template feature is transformed into a 1 × 1 kernel, so that the matching area is only of size 1 × 1 and the background information is greatly reduced.The spatial kernel focuses on information from each region of the template, while the channel kernel pays more attention to the overall information of the template.The template feature is decomposed into space and channel kernels with a size of 1 × 1, which reduces the matching area while suppressing background interference and accurately collects the response points of the target area.
As shown in Figure 4, the template feature Z with the dimension of Hz × Wz × C, which is split into H Z × W Z 1 × 1 × C kernels and C 1 × 1 × n (n = w z × h z ) kernels from the spatial and channel aspects, respectively.Then, search frame feature X is passed through these two filters.Firstly, the similarity between each pixel of the search feature and the spatial dimension template feature is calculated, as in Equation (9).Then, the similarity with the channel kernel can be calculated, as in Equation (10).Finally, the output feature P 2 is obtained.This operation does not require convolution, just matrix multiplication, which can improve the speed of our tracker.

Results
Our network was implemented in python using PyTorch and trained and t RTX3060ti based on three publicly benchmark datasets: the GTOT dataset, RGBT LasHeR [32].We used LasHeR as the training dataset to train our model and the it on GTOT and RGBT234, respectively.To evaluate the algorithms, this paper u common evaluation indicators, precision rate (PR) and success rate (SR).PR re the percentage of frames whose distance from the center of the prediction box to th of the ground truth box in the video sequence is less than a predetermined thresh threshold for the GTOT dataset is 5 pixels, and that for the RGBT234 dataset is 2 SR represents the percentage of frames whose overlap rate between the predicte position box and the ground truth box is greater than the threshold.When the threshold is greater than 0.7, it indicates successful tracking.We compared th mance of the proposed tracker with the advanced RGB-T tracker and RGB tracke ated our main components, and analyzed their effectiveness.
Figure 5 shows the graphs of our tracker compared with other trackers on th GTOT.The curves with different colors and lines in the figure represent different It can be clearly seen that our tracker outperforms the other seven trackers.Spe the PR of the tracker proposed in this paper is 1.6% higher than MANet, and the S from MANet by 0.2%; the PR and SR are 2.5% and 1.7% higher than MACNet tively.This proves that our proposed tracker can achieve robust tracking.In addi proposed tracker achieves a very efficient operating speed.Compared with th performing trackers, MANet, DAPNet, MACNet, etc., our tracker achieves the efficiency with a real-time operation speed of 34 FPS, which proves that our p tracker can achieve the real-time standard for object tracking.

Results
Our network was implemented in python using PyTorch and trained and tested on RTX3060ti based on three publicly benchmark datasets: the GTOT dataset, RGBT234, and LasHeR [32].We used LasHeR as the training dataset to train our model and then tested it on GTOT and RGBT234, respectively.To evaluate the algorithms, this paper used two common evaluation indicators, precision rate (PR) and success rate (SR).PR represents the percentage of frames whose distance from the center of the prediction box to the center of the ground truth box in the video sequence is less than a predetermined threshold.The threshold for the GTOT dataset is 5 pixels, and that for the RGBT234 dataset is 20 pixels.SR represents the percentage of frames whose overlap rate between the predicted output position box and the ground truth box is greater than the threshold.When the overlap threshold is greater than 0.7, it indicates successful tracking.We compared the performance of the proposed tracker with the advanced RGB-T tracker and RGB tracker, evaluated our main components, and analyzed their effectiveness.
Figure 5 shows the graphs of our tracker compared with other trackers on the dataset GTOT.The curves with different colors and lines in the figure represent different trackers.It can be clearly seen that our tracker outperforms the other seven trackers.Specifically, the PR of the tracker proposed in this paper is 1.6% higher than MANet, and the SR differs from MANet by 0.2%; the PR and SR are 2.5% and 1.7% higher than MACNet, respectively.This proves that our proposed tracker can achieve robust tracking.In addition, our proposed tracker achieves a very efficient operating speed.Compared with the better-performing trackers, MANet, DAPNet, MACNet, etc., our tracker achieves the leading efficiency with a real-time operation speed of 34 FPS, which proves that our proposed tracker can achieve the real-time standard for object tracking.
The GTOT dataset contains seven different attributes: occlusion (OCC), large-scale variation (LSV), fast motion (FM), low illumination (LI), thermal crossover (TC), small object (SO), and deformation (DEF).The attribute-based comparison shows the capability of our proposed tracker to handle these challenging situations.As shown in Table 1, our tracker achieves a definite lead in performance under the challenges of large-scale variation and low illumination while obtaining the best performance overall.In addition, our tracker also shows high performance under four attributes: fast motion, thermal crossover, deformation, and small objects.This shows that our tracker can continuously adapt to the changes of the object during tracking using the pixel matching module and improved full convolutional anchor-free tracking head to reduce the negative impact of background information, thus improving the robustness of object tracking.At the same time, it shows that the introduction of a multi-modal weight penalty module can fully exploit the information of the two modes of RGB-T and make use of the interaction of dual-modal information to better cope with the problem of motion blur caused by fast motion and poor single-modal imaging caused by environmental factors.The GTOT dataset contains seven different attributes: occlusion (OCC), large-scale variation (LSV), fast motion (FM), low illumination (LI), thermal crossover (TC), small object (SO), and deformation (DEF).The attribute-based comparison shows the capability of our proposed tracker to handle these challenging situations.As shown in Table 1, our tracker achieves a definite lead in performance under the challenges of large-scale variation and low illumination while obtaining the best performance overall.In addition, our tracker also shows high performance under four attributes: fast motion, thermal crossover, deformation, and small objects.This shows that our tracker can continuously adapt to the changes of the object during tracking using the pixel matching module and improved full convolutional anchor-free tracking head to reduce the negative impact of background information, thus improving the robustness of object tracking.At the same time, it shows that the introduction of a multi-modal weight penalty module can fully exploit the information of the two modes of RGB-T and make use of the interaction of dual-modal information to better cope with the problem of motion blur caused by fast motion and poor single-modal imaging caused by environmental factors.We tested our tracker on the RGBT234 dataset and compared it with eight advanced trackers, including DAPNet, MDNet, SGT, DAT, ECO, C-COT [38], SiamDW+RGBT, and
Figure 6 shows that the experimental results of our tracker on the RGBT234 dataset are PR = 77.5% and SR = 54.8%,which achieves excellent performance compared with other algorithms.Specifically, the PR of the tracker proposed in this paper is 0.9% higher than that of DAPNet, which ranks second in the figure, and the SR is 1.1% higher than that of DAPNet.By comparing the experimental results with various trackers on the above two RGB-T datasets, it can conclude that the tracker in this paper achieves better performance both in terms of precision and success rate.This proves our proposed tracker can cope with various complex environments and achieve robust tracking.than that of DAPNet, which ranks second in the figure, and the SR is 1.1% higher than that of DAPNet.By comparing the experimental results with various trackers on the above two RGB-T datasets, it can conclude that the tracker in this paper achieves better performance both in terms of precision and success rate.This proves our proposed tracker can cope with various complex environments and achieve robust tracking.To explicitly show the tracking performance of our tracker, we provide three sequences for comparison, which cover different challenging attributes in the RGBT234 dataset in Figure 7.As shown in A, our method performs well.In contrast, other trackers lose the object when occlusion or camera motion occurs caused by the object moving.This indicates that our PMM and the improved positive sample selection strategy are effective, enabling our tracker to adapt to environmental changes continuously and reducing the negative impact of background information.Thereby, the robustness of object tracking can be enhanced.In B, the illumination, which is low in the object region, and the object, which has a similar temperature to other objects or the background environment, make it almost invisible in the thermal image.However, our algorithm outperforms other trackers, indicating that MWPM enables our tracker to fully use multi-modal information.In C, our algorithm can still achieve precise localization and predict the best bounding box even when the object size changes during its motion.To explicitly show the tracking performance of our tracker, we provide three sequences for comparison, which cover different challenging attributes in the RGBT234 dataset in Figure 7.As shown in A, our method performs well.In contrast, other trackers lose the object when occlusion or camera motion occurs caused by the object moving.This indicates that our PMM and the improved positive sample selection strategy are effective, enabling our tracker to adapt to environmental changes continuously and reducing the negative impact of background information.Thereby, the robustness of object tracking can be enhanced.In B, the illumination, which is low in the object region, and the object, which has a similar temperature to other objects or the background environment, make it almost invisible in the thermal image.However, our algorithm outperforms other trackers, indicating that MWPM enables our tracker to fully use multi-modal information.In C, our algorithm can still achieve precise localization and predict the best bounding box even when the object size changes during its motion.

Discussion
The results on the above datasets show that our tracker performs well under challenging conditions such as large-scale changes, fast motion, deformation, and small objects.These show that the combination of the pixel-matching module and the improved fully convolutional anchor-free position prediction network effectively distinguish the object from other similar objects in the scene and make up for the lack of the ability of the

Discussion
The results on the above datasets show that our tracker performs well under challenging conditions such as large-scale changes, fast motion, deformation, and small objects.These show that the combination of the pixel-matching module and the improved fully convolutional anchor-free position prediction network effectively distinguish the object from other similar objects in the scene and make up for the lack of the ability of the Siamese network to distinguish the similarity between the tracking object and the background.
In addition, the outstanding performance in low illumination, thermal crossover, and other scenarios shows that the utility of the multi-modal weight penalty module enables us to make full use of the correlation and complementarity of information between RGB and TIR modalities, which improves the quality of fusion features and improves the performance of the tracker.This is particularly important in scenarios where tracking using only RGB modality information may not be sufficient to achieve accurate tracking.
Although our tracker has achieved excellent performance in most scenarios, it has room for improvement.For example, when the target object is partially or completely occluded by other objects in the scene, the performance of the tracker is significantly reduced.This is because our tracker relies on a combination of appearance and motion cues to track the object destroyed by occlusion.We will take measures such as updating online to minimize the impact of occlusion on our tracker performance.Furthermore, our proposed method currently requires RGB-T data, and how to extend it to other modalities needs further study.
Overall, our results show that the proposed tracker is effective in dealing with various challenges encountered in object tracking.Meanwhile, the real-time running speed of 34 FPS fully meets the real-time requirements.

Conclusions
This paper proposes a novel high-speed, robust RGB-T tracker.A multi-modal weight penalty module is designed, which enables the new tracker to take full advantage of two modal features to cope with various lighting challenges.Combined with the proposed pixel-matching module and an improved anchor-free bounding box prediction network, the new tracker can effectively suppress the effects of cluttered backgrounds and obtain the position of objects more accurately and quickly.The new tracker achieves the advanced performance on two publicly available RGB-T benchmark datasets through the extensive experimental demonstrations.
In future work, we will continue to explore how to better integrate multi-modal information for object tracking and design more excellent algorithms dealing with the object occlusion problems.

Figure 1 .
Figure 1.The overall architectural diagram of the proposed algorithm.The overall network consists of four main components: a double Siamese network for feature extraction, an MWPM for enhancing multi-modal features, a PMM for generating fusion response maps, and an anchor-free position prediction network for generating bounding boxes.

Figure 1 .
Figure 1.The overall architectural diagram of the proposed algorithm.The overall network consists of four main components: a double Siamese network for feature extraction, an MWPM for enhancing multi-modal features, a PMM for generating fusion response maps, and an anchor-free position prediction network for generating bounding boxes.

Figure 2 .
Figure 2. Schematic diagram of the selected positive samples.The red part is the original positive sample area, and the yellow part is our improved positive sample area.

Figure 2 .
Figure 2. Schematic diagram of the selected positive samples.The red part is the originally defined positive sample area, and the yellow part is our improved positive sample area.

Figure 3 .
Figure 3.The architecture of MWPM.First, connect the depth features of RGB mode and thermal infrared mode and then assign weights to all channels.

Figure 3 .
Figure 3.The architecture of MWPM.First, connect the depth features of RGB mode and thermal infrared mode and then assign weights to all channels.

Figure 4 .
Figure 4.The process of pixel matching.First, calculate the similarity between each pixe the search feature and the spatial kernel, and then calculate the similarity between it and th kernel.

Figure 4 .
Figure 4.The process of pixel matching.First, calculate the similarity between each pixel point of the search feature and the spatial kernel, and then calculate the similarity between it and the channel kernel.

Mathematics 2023 , 12 Figure 5 .
Figure 5. Precision rate (PR) and success rate (SR) curves of different tracking results on GTOT, where the representative PR and SR scores are presented in the legend.

Figure 5 .
Figure 5. Precision rate (PR) and success rate (SR) curves of different tracking results on GTOT, where the representative PR and SR scores are presented in the legend.

Figure 6 .
Figure 6.Precision rate (PR) and success rate (SR) curves of different tracking results on RGBT234, where the representative PR and SR scores are presented in the legend.

Figure 6 .
Figure 6.Precision rate (PR) and success rate (SR) curves of different tracking results on RGBT234, where the representative PR and SR scores are presented in the legend.

Figure 7 .
Figure 7.The different colored boxes in the figure correspond to different trackers: the red box is the result box of our tracker, the green box is SGT, the blue box is C-COT, and the yellow box is ECO.(A): Results of sequence greyman; (B): results of sequence woman6; (C): results of sequence car66.

Figure 7 .
Figure 7.The different colored boxes in the figure correspond to different trackers: the red box is the result box of our tracker, the green box is SGT, the blue box is C-COT, and the yellow box is ECO.(A): Results of sequence greyman; (B): results of sequence woman6; (C): results of sequence car66.

Table 1 .
Attribute-based accuracy and success rate (PR/SR) is obtained by using different trackers on the GTOT dataset.Red, green, and blue numbers represent the best one, the second-best one, and the third-best one, respectively.

Table 1 .
Attribute-based accuracy and success rate (PR/SR) is obtained by using different trackers on the GTOT dataset.Red, green, and blue numbers represent the best one, the second-best one, and the third-best one, respectively.