SCA-MMA: Spatial and Channel-Aware Multi-Modal Adaptation for Robust RGB-T Object Tracking

: The RGB and thermal (RGB-T) object tracking task is challenging, especially with various target changes caused by deformation, abrupt motion, background clutter and occlusion. It is critical to employ the complementary nature between visual RGB and thermal infrared data. In this work, we address the RGB-T object tracking task with a novel spatial-and channel-aware multi-modal adaptation (SCA-MMA) framework, which builds an adaptive feature learning process for better mining this object-aware information in a uniﬁed network. For each type of modality information, the spatial-aware adaptation mechanism is introduced to dynamically learn the location-based characteristics of speciﬁc tracking objects at multiple convolution layers. Further, the channel-aware multi-modal adaptation mechanism is proposed to adaptively learn the feature fusion/aggregation of different modalities. In order to perform object tracking, we employ a binary classiﬁcation module with two fully connected layers to predict the bounding boxes of speciﬁc targets. Comprehensive evaluations on GTOT and RGBT234 datasets demonstrate the signiﬁcant superiority of our proposed SCA-MMA for robust RGB-T object tracking tasks. In particular, the precision rate (PR) and success rate (SR) on GTOT and RGBT234 datasets can reach 90.5%/73.2% and 80.2%/56.9%, signiﬁcantly higher than the state-of-the-art algorithms.


Introduction
Object tracking, which is an important yet challenging task in the field of computer vision, has been widely applied in video surveillance, traffic monitoring and self-driving, etc. Although general object tracking with the signal-modal RGB data source has achieved significant advances during the past few years [1][2][3][4][5], there are still various existing difficulties due to the challenges of low illumination, smog and darkness, etc. Meanwhile, thermal infrared data are insensitive to the lighting condition and have a strong ability to penetrate haze and smog [6], but they cannot represent targets well in good lighting conditions compared with visual images. Recently, the RGB-T object tracking problem [7][8][9][10] has received more and researchers have aimed to integrate visible and thermal infrared data for robust object tracking in the severe conditions mentioned above.
Much previous work has been devoted to RGB-T object tracking to improve the target representation with visible and thermal data [7][8][9][10]. One stream aims to extract important and expressive information from multi-modal data and then designs target descriptors for boosting the tracking performance [10][11][12]. For example, Li et al. chose more reliable deep feature layers to construct a target descriptor [11], learned the weights of patches cropped from multi-modal data as nodes and represent the target as a graph [10,12]. These methods only use a portion of all features to reconstruct the target descriptor, and they omit a lot of background information, which may limit the potential tracking performance. Another stream aims to learn modality weights to achieve adaptive feature fusion for robust object tracking [8,[13][14][15]. Early weight-based methods added [14] or concatenated [15] multi-modal information together directly. Lan et al. [7] used the max-margin principle to optimize the modality weights according to classification scores, and Zhu et al. [13] adaptively learned the modality weights via convolutional neural networks. Although these above methods consider the reliability degree in different modalities to a certain extent, they cannot well consider how to adaptively perform the feature fusion aggregation of heterogeneous modalities for achieving robust target representation.
To solve the above-mentioned problem, we propose a novel spatial-and channel-aware multi-modal adaptation (named SCA-MMA) framework for boosting the performance of the RGB-T tracking task. The SCA-MMA can not only dynamically focus on the spatial location information of specific targets, but also adaptively learn the weight of each channel based on the assumption that all the feature channels have different reliabilities [16,17]. Here, the channel weights of multi-modal representation can be adaptively learned in the process of feature aggregation. We further integrate the spatial-aware mechanism in our SCA-MMA framework, which will dynamically learn the location-based characteristics of specific tracking objects at multiple convolution layers and decrease the suppression of background for robust target representation [18,19]. In order to complete the object tracking task, we employ a binary classification with two fully connected layers to predict the bounding boxes of specific target and K branches in the training stage to learn multidomain knowledge [4]. Extensive experiments on two public datasets demonstrate that our SCA-MMA framework can achieve state-of-the-art performance when addressing the RGB-T tracking problem. We summarize the major contributions of this work as follows.

•
We propose a novel spatial-and channel-aware multi-modal adaptation (SCA-MMA) framework for robust RGB-T object tracking in an end-to-end fashion. The proposed SCA-MMA can dynamically learn the location-based characteristics of specific tracking objects and simultaneously adopt channel-aware multi-modal adaptation for better consideration of the complementarity of RGB and thermal information. • We introduce a feature aggregation mechanism to adaptively reconstruct the target descriptor for performing RGB-T object tracking. In particular, our proposed spatial-aware mechanism can adaptively learn spatial awareness to enhance the target appearance. Furthermore, we present a channel-aware multi-modal adaptation mechanism to aggregate visual RGB and thermal infrared data, which can adaptively learn the reliable degree of each channel and then better integrate the global information. • We evaluate the proposed SCA-MMA framework on large-scale datasets (including GTOT [8] and RGBT234 [20]). The SCA-MMA achieves 90.5%/73.2% and 80.2%/56.9% in PR/SR performance, and reaches state-of-the-art performance when compared with other RGB-T trackers [9,13,21].

Related Work
According to its relevance to our work, we review related work in the following two aspects: feature aggregation methods for RGB-T object tracking and multi-domain object tracking.

Feature Aggregation Methods for RGB-T Object Tracking
RGB-T object tracking, which is a sub-branch of visual object tracking, aims to aggregate visible and thermal infrared images for robust object tracking in challenging conditions such as low illumination, heavy occlusion and significant appearance changes [8,20]. Existing methods focus on robust target representation via integrating multi-modal source data [10,12,13,22]. One research stream aims to reconstruct the target descriptor via extracting effective features from multi-modal data. To perform the object tracking, Li et al. [10] proposed a weighted sparse representation regularized graph learning algorithm by constructing the specific target as a graph-based descriptor. A two-stage modality-graph regularized manifold ranking algorithm [22] was proposed to rank all patches of multimodal data for robust target representation. A cross-modal manifold ranking algorithm [12] was then proposed to rank cropped patches from the target while considering the heterogeneous property between different modalities and noise effects. Li et al. [11] proposed FusionNet to calculate the partial derivative of loss on channels and selected these higher parts for target representation.
Another stream aims to adaptively learn modality weights, and then concatenates them together as the target descriptor [14,15]. For example, Li et al. [8,12] regularized the modality weight via reconstruction residues, Lan et al. [7] used the max-margin principle to optimize modality weights according to classification scores, and Zhu et al. [13] adaptively learned modality weights via a convolutional neural network. To employ the temporal continuity in a video sequence, the history information was integrated to obtain fusion features by computing the adaptive weights of previous frames [23]. Tang et al. [24] proposed multiple fusion strategies from different perspectives (including pixel-level, feature-level and decision-level) to boost the performance of multi-modal object tracking in video.

Multi-Domain Object Tracking
Visual object tracking is one of the fundamental branches of computer vision and has received more and more attention in the last few decades. The pivotal branch of visual object tracking regards the object tracking problem as a one-shot binary classification task [3,4,25]. For example, Nam et al. [4] proposed a multi-domain learning framework across multiple tracking sequences in the training stage and then detected the foreground in the tracking stage. Park et al. [25] exploited the metalearning algorithm in the MDNet [4] framework, which adjusted an initial model via temporal information in tracking sequences for quick optimization in the tracking stage. Jung et al. [3] introduced the RoIAligh method to extract more accurate representations for the specific target. In [26,27], multi-domain feature representation networks have been proposed to perform information fusion across frame and event domains for improving the performance of the visual object tracking task. A semi-supervised multi-domain tracking framework [28] was proposed to learn the domain-invariant and domain-specific representations through employing an adversarial regularization. Further, a filtering-based multi-sensor data fusion technique [29] was proposed to obtain improved navigational data for unmanned surface vehicle navigation. For the radar tracking problem, the decentralized fusion of Kalman and neural filters has been proposed to deal with the multi-sensor tracking of marine targets [30]. In [31], an adaptive fusion strategy was used to integrate multiple feature cues into an observation model for improved underwater target tracking.

Proposed Framework
In this section, we introduce the details of the proposed SCA-MMA framework, including the network architecture and spatial-aware and channel-aware multi-modal adaptation mechanisms.

Network Architecture
The overall network architecture of the SCA-MMA model is shown in Figure 1. The SCA-MMA mainly consists of three parts: feature extraction sub-network, feature aggregation sub-network and binary classification sub-network. In particular, each feature extraction sub-network is built with three convolutional layers to extract a target representation. The spatial-aware block is employed after the front two layers to obtain spatial awareness and enhance target representation. The feature aggregation sub-network first integrates these extracted features from visible as well as thermal images, and then adaptively learns channel-wise weights via the channel-aware block and aggregates the features in terms of the channel. As in [4], the binary classification sub-network, which is adopted to distinguish the specific target and background information, has K branches after two fully connected layers to learn multi-domain knowledge in the training stage. After finishing the multi-domain learning, the multiple branches of domain-specific layers are replaced by a single branch in the tracking stage.

Spatial-Aware Mechanism
As shown in Figure 1, we employ the spatial-aware block in the front two convolutional layers. The details of the spatial-aware block are shown in Figure 2a. The first convolutional layer is followed by the rectified linear unit (ReLU) and local response normalization (LRN) process, and the sigmoid function is adopted after the second convolutional layer to generate spatial awareness. Here, we summarize the operations of the spatial attention block in the following equations: where Conv, ReLU, LRN and Sigmoid denote the convolutional layer, rectified unit, local response normalization and sigmoid function. input and output are the input and output of the spatial-aware block. The whole feature extraction sub-network can be summarized as follows: where SAB denotes the spatial-aware block and F m denotes the output feature of m-th modality source data, m ∈ M = {rgb, thermal} in the experiment.

Channel-Aware Multi-Modal Adaptation Method
The channel-aware multi-modal adaptation method can better consider heterogeneity in channel-wise weights within a single modal data. As shown in Figure 2b, the channelaware block concatenates these extracted feature maps from two modalities. The concatenated features are fed into two fully connected layers, where each layer with 1024 output units is followed by a ReLU function. In the last layer, the dropout and softmax functions are employed in each channel dimension to obtain these channel-wise weights. Finally, under the guidance of these channel-aware weights, we can fuse these learned features for constructing a target descriptor. Here, we summarize the operations of the feature aggregation sub-network in the following equations: where , ⊗ and ⊕ denote the concatenation, channel weighting and element-wise fusion processes, f c(·) refers to a fully connected layer followed by ReLU as well as dropout operation, so f tmax(·) denotes the softmax function. ω R , ω T and F denote the learned channel-wise weights and reconstructed target descriptor by multi-modal data.

Experiment Setting
We evaluate the proposed SCA-MMA framework on two large-scale benchmarks: GTOT [8] and RGBT234 [20] datasets. GTOT is an RGB-T tracking benchmark proposed by [8]. It has 50 video sequences with well-labeled visible and thermal image pairs. It is annotated with seven attributes and thus partitioned into seven subsets for analyzing the attribute-sensitive performance of RGB-T tracking approaches. RGBT234 is a large RGB-T tracking dataset, extended from the RGBT210 [10] dataset. It contains 234 video sequences, reaching approximately 23,400 frames in total and with 8000 frames for the longest video. It is annotated with 12 attributes. We use the precision rate (PR) and success rate (SR) to evaluate the quantitative performance on these two datasets. PR is the percentage of frames whose predicted location is within a threshold distance with groundtruth. SR is the percentage of frames whose overlap ratio between predicted location and groundtruth is larger than a threshold. Following the same protocols as in [8,9,13,15,20], we set the threshold to be 5 pixels for the GTOT dataset and 20 pixels for the RGBT234 dataset to evaluate PR performance. We employ the area under the curve (AUC) of the success rate as SR for quantitative performance evaluations.
The whole network is trained in an end-to-end manner. We first initialize the parameters of the convolutional layer (Conv1-Conv3) in each feature extraction sub-network using the pre-trained MDNet model [4] and randomly initialize the parameters of all the remain-ing layers. Then, we crop positive and negative samples in training sequences randomly and minimize the cross-entropy loss by the stochastic gradient descent (SGD) algorithm, where each domain is handled separately. In the process of iteration, we randomly choose 8 frames and crop 32 positive as well as 96 negative samples in each frame to construct a minibatch in each video sequence. For positive samples, we set the IoU overlap ratio in the range 0.7∼1.0, while the negative samples are within the range 0∼0.5 IoUs. For the multi-domain learning, we set K branches for K video sequences and train the network with 100K iterations. In the front 10K iterations, we set the learning rate as 0.0001 for the feature extraction sub-network and 0.001 for the feature aggregation sub-network as well as binary classification sub-network, respectively. In the next iterations, we change the learning rate of the feature aggregation sub-network from 0.001 to 0.0001. The weight decay and momentum are fixed to 0.0005 and 0.9, respectively.
In the tracking stage, the K branches in the binary classification sub-network for multidomain learning are replaced by a single branch for each test sequence. We then fine-tune the pretrained network in the first frame pair and update the model in subsequent frame pairs. In the fine-tuning stage, we crop 500 positive samples and 5000 negative samples with the given groundtruth bounding box. For positive samples, we set the overlap ratio in range 0.7∼1.0, while the negative samples are within the range 0∼0.5 IoUs. We fit all parameters of the feature extraction sub-network and feature aggregation sub-network. For the binary classification sub-network, we set the learning rate as 0.0001 for the front two fully connected layers and 0.001 for the last layer. We fine-tune the whole network end-to-end for 30 iterations, and train a bounding box regression model. For the given t-th frame, we crop 256 samples as candidates {x i t } with the guidance of the predicted result in t − 1-th frame, and then obtain positive scores { f + (x i t )} and negative scores { f − (x i t )}. The candidate with the maximum positive score can be found as: We find the top k candidates (i.e., k = 5). The regression technology is employed to improve target localization accuracy, and the optimal target state x * can be seen as the mean value.

Result Comparisons
We utilize the full RGBT234 [20] dataset to construct training data and train our model for the experiment on the GTOT [8] dataset. We compare the proposed SCA-MMA with state-of-the-art trackers, including FANet [13], SGT [9], MDNet+RGBT, Struck+RGBT, L1-PF [15], ECO [32] and KCF [2]. We concatenate features used in trackers from RGB and thermal modalities as the RGB-T input of corresponding tracking algorithms [8]. Figure 3 shows that the SCA-MMA performs obviously better than the other trackers on the GTOT dataset. It gains 2.0%/3.4% in PR/SR promotion over the second-best state-of-the-art tracker. The predominant performance demonstrates that the proposed SCA-MMA can obtain the robust tracking target even in challenging conditions.   We construct training data using the full GTOT dataset and train the model on the RGBT234 dataset. We compare the proposed framework with state-of-the-art trackers, including single modal trackers such as MDNet [4], ECO [32], C-COT [33], SOWP [34], SRDCF [35], CSR-DCF [36] and CFNet [37], as well as RGB-T trackers such as SGT [9], FANet [13], MDNet + RGBT, SOWP + RGBT, CSR-DCF + RGBT, L1-PF [15] and CFNet + RGBT. Here, we only display the top 12 trackers. As shown in Figure 4, the SCA-MMA framework performs the best with different evaluation metrics. Compared with the second state-of-the-art tracker, the SCA-MMA framework achieves 80.2%/56.9% in PR/SR and gains a 3.8%/3.7% improvement over the second performance tracker, as well as an 8.0%/7.4% improvement over the baseline MDNet + RGBT.  The attribute-based results on the RGBT234 dataset are shown in Table 1. The best, second and third results are in red, green and blue colors, respectively. It contains all 12 attributes annotated on the RGBT234 dataset: no occlusion (NO), partial occlusion (PO), heavy occlusion (HO), low illumination (LI), low resolution (LR), thermal crossover (TC), deformation (DEF), fast motion (FM), scale variation (SV), motion blur (MB), camera moving (CM) and background clutter (BC). As shown in Table 1, our framework achieves the best performance in all attributes. Compared with the baseline MDNet+RGBT, we obtain over 10%/8% PR/SR improvement in DEF (deformation) and BC (background clutter) challenges with the sharp target appearance changes. It demonstrates that the proposed spatial-aware mechanism can learn spatial awareness adaptively and enhance the target information for robust object tracking. In the challenging conditions, including HO (heavy occlusion) and TC (thermal crossover), the performance of the SCA-MMA framework is significantly higher than the baseline, which demonstrates the efficiency of the proposed channel-aware multi-modal adaptation mechanism.  Figure 5 presents the qualitative comparison of our proposed framework versus stateof-the-art RGB-T trackers on four video sequences, including SGT and MDNet+RGBT. Overall, our SCA-MMA framework is effective in handling these challenging conditions, such as low illumination, occlusion, thermal cross, deformation, background clutter and appearance change. For the elecbike10 sequence, our framework performs well in low illumination and heavy occlusion conditions, while other trackers lose the target when occlusion happens. For the fog sequence, when occlusion and bad weather happen, our framework can achieve the robust tracking target by adaptively aggregating the visible and thermal data. To demonstrate the effectiveness of our proposed channel-aware adaptation and spatial-aware adaptation methods, we perform pruning experiments under two experimental settings on the RGBT234 dataset, including the object tracker with only with channel-aware adaptation (named "Ours-CA") and the tracker with only spatial-aware adaptation (named "Ours-SA"). The detailed performance comparisons are shown in Figure 4. The object tracker with only channel-aware adaptation can achieve 79.1% and 55.5% in terms of PR and SR, which are lower by 0.9% and 1.1% than the SCA-MMA. The tracker with only spatial-aware adaptation obtains 78.4 and 55.4% in terms of PR and SR, and its performance also reduces when compared with both "Ours-CA" and SCA-MMA methods. It clearly demonstrates that both the channel-aware adaptation and spatial-aware adaptation mechanisms can improve the performance of the RGB-T object tracking task to some extent. From the attribute-based performance shown in Table 2, we can see that in the challenges of low illumination and thermal crossover, the CA framework performs better than the SA framework. In background clutter and deformation conditions with large target appearance changes, the SA framework is far more robust than the CA framework. It demonstrates that the spatial-aware mechanism can promote target appearance in the feature extraction stage, and the channel-aware multi-modal adaptation method can handle the target reconstruction task via learning channel-wise weights in challenging conditions. Our proposed framework integrates both spatial-and channel-aware feature adaptation and achieves state-of-the-art performance.

Algorithm Analysis
We further employ the proposed SCA-MMA framework on the platform of Pytorch with E5-2620 V4 @2.10GHz and NVIDIA TITAN Xp. As shown in Table 3, the mean speed of our framework on the GTOT and RGBT234 datasets can reach 1.3 FPS, while the MDNet and MDNet+RGBT are 3.2 FPS and 1.6 FPS, respectively. Compared with the MDNet+RGBT tracker, the SCA-MMA framework gains an 8.0%/7.4% improvement on the RGBT234 dataset and 10.5%/9.5% improvement on the GTOT dataset with a comparable tracking speed (1.3 FPS versus 1.6 FPS).  Table 3. PR/SR score(%) and runtime of our framework against baseline MDNet+RGBT on GTOT and RGBT234 datasets.

Conclusions
In this work, we have proposed a novel spatial-and channel-aware multi-modal adaptation (SCA-MMA) framework for boosting the performance of RGB-T object tracking. In particular, we have built an adaptive and effective learning process to explore the complementarity between two heterogeneous modalities. SCA-MMA has introduced a spatial-aware mechanism to enhance the feature representations of interested objects in the spatial domain. Further, we have adaptively learned these channel-wise weights with the channel-aware multi-modal adaptation mechanism for achieving the final enhanced features of tracking targets. Extensive experiments on the RGBT234 and GTOT datasets have demonstrated that the proposed SCA-MMA has achieved the state-of-the-art performance when addressing the RGB-T tracking problem. In the future, we will focus on how to design more robust feature learning methods via the metalearning methods on the multi-modal understanding tasks, such as multi-modal object recognition, detection and object tracking.
Author Contributions: C.X.: Conceptualization, supervision and project administration, analysis data, writing and editing manuscript; R.S.: mathematical formulation, data analysis and experiment, writing manuscript. C.W.: data analysis and interpretation, editing manuscript; G.Z.: checked the numerical results and corrected the manuscript. All authors have read and agreed to the published version of the manuscript.