Lightweight Cross-Modal Information Mutual Reinforcement Network for RGB-T Salient Object Detection

RGB-T salient object detection (SOD) has made significant progress in recent years. However, most existing works are based on heavy models, which are not applicable to mobile devices. Additionally, there is still room for improvement in the design of cross-modal feature fusion and cross-level feature fusion. To address these issues, we propose a lightweight cross-modal information mutual reinforcement network for RGB-T SOD. Our network consists of a lightweight encoder, the cross-modal information mutual reinforcement (CMIMR) module, and the semantic-information-guided fusion (SIGF) module. To reduce the computational cost and the number of parameters, we employ the lightweight module in both the encoder and decoder. Furthermore, to fuse the complementary information between two-modal features, we design the CMIMR module to enhance the two-modal features. This module effectively refines the two-modal features by absorbing previous-level semantic information and inter-modal complementary information. In addition, to fuse the cross-level feature and detect multiscale salient objects, we design the SIGF module, which effectively suppresses the background noisy information in low-level features and extracts multiscale information. We conduct extensive experiments on three RGB-T datasets, and our method achieves competitive performance compared to the other 15 state-of-the-art methods.


Introduction
Salient object detection (SOD) is a computer vision technique that segments the mostvisually interesting objects from an image, mimicking attention mechanisms.It is important to note that SOD differs from object detection tasks that aim to predict object bounding boxes.SOD has been employed as a preprocessing step in many computer vision tasks, such as image fusion [1], perceptual video coding [2], compressed video sensing [3], image quality assessment [4], and so on.
Traditional methods for RGB SOD were initially proposed, but they could not achieve optimal performance.With the advent of CNNs [5] and U-Nets [6], deep-learning-based methods became popular in SOD.For example, multiscale information was extracted in PoolNet [7] and MINet [8].The edge feature was generated and supplemented to the object feature in EGNet [9] and EMFINet [10].Later, depth maps were introduced in SOD, which is called RGB-D SOD.In this field, the depth-enhanced module [11] was designed to fuse two-modal features.However, the RGB-D dataset still has some shortcomings.Some depth maps are not accurate due to the limitations of the acquisition equipment.Researchers turned to introducing thermal infrared images into SOD, called RGB-T SOD.
RGB-T SOD has seen significant progress in recent years.For example, CBAM [12] is employed in [13] to fuse two-modal features.To capture multiscale information, FAM module is employed in [13], and the SGCU module is designed in CSRNet [14].Despite their outstanding efforts in RGB-T SOD, there are still some problems that need to be addressed.Most of the existing works are based on a heavy model, which is unsuitable for mobile devices.Besides, there is still room for research on effectively integrating the complementary information between two-modal features.Figure 1 shows some examples where PCNet [15] and TAGF [16] cannot present the detection results well.Another problem is how to fuse two-level features and explore multiscale information during the decoding stage.Based on the aforementioned discussions, we propose a lightweight network for RGB-T SOD.Specifically, we employ the lightweight backbone MobileNet-V2 [17] in the encoder and the depth-separable convolution [18] in the decoder.To address the problem of twomodal feature fusion, we introduce the CMIMR module.We enhance two-modal features by transferring semantic information into them using the previous-level decoded feature.After this enhancement, we mutually reinforce two-modal features by communicating complementary information between them.Additionally, we design the SIGF module to aggregate two-level features and explore multiscale information during the decoding stage.Unlike RFB [11,19] and FAM [7], we employ the visual attention block (VAB) [20] to explore the multiscale information of the fused feature in the decoder.
Our main contributions are summarized as follows: 1.
We propose a lightweight cross-modal information mutual reinforcement network for RGB-T salient object detection.Our network comprises a lightweight encoder, the cross-modal information mutual reinforcement (CMIMR) module, and the semanticinformation-guided fusion (SIGF) module.2.
To fuse complementary information between two-modal features, we introduce the CMIMR module, which effectively refines the two-modal features.

3.
Extensive experiments conducted on three RGB-T datasets demonstrate the effectiveness of our method.

Related Works Salient Object Detection
Numerous works have been proposed for SOD [21][22][23].Initially, prior knowledge and manually designed features [24] were employed.With the advent of deep learning, CNNbased methods have made significant strides.For instance, many methods have attempted to capture multiscale information in images (RFB [19,25] and FAM [7]).Additionally, many works have focused on refining the edge details of salient objects [9,26,27].Furthermore, several lightweight methods have been proposed to adapt to mobile devices [28,29].While these methods have made great progress in RGB SOD, they do not perform as well when the RGB image has cluttered backgrounds, low contrast, and object occlusion.
RGB-D SOD is a technique that uses depth maps to provide complementary information to RGB images.To fuse two-modal features, several methods have been proposed, including the depth-enhanced module [11], selective self-mutual attention [30], the crossmodal depth-weighted combination block [31], the dynamic selective module [32], the cross-modal information exchange module [33], the feature-enhanced module [34], the cross-modal disentanglement module [35], the unified cross dual-attention module [36], and inverted bottleneck cross-modality fusion [37].Despite the progress made by RGB-D SOD, it performs poorly on low-quality examples, where some depth maps are inaccurate due to the limitations of the acquisition equipment.
In addition to depth maps, thermal infrared images have been employed to provide complementary information to RGB images, which is called RGB-T SOD.Many works have made efforts in this area [38,39].To fuse two-modal features, several methods have been proposed, including CBAM [12,13], the complementary weighting module [40], the crossmodal multi-stage fusion module [41], the multi-modal interactive attention unit [42], the effective cross-modality fusion module [43], the semantic constraint provider [44], the modality difference reduction module [45], the spatial complementary fusion module [46], and the cross-modal interaction module [15].To fuse two-level features during the decoding stage, the FAM module [13] and interactive decoders [47] were proposed.Additionally, lightweight networks [14,48] have been proposed to meet the requirements of mobile devices.

Architecture Overview
We present the overall architecture of our method in Figure 2, which is a typical encoder-decoder structure.In the encoder part, we adopted the lightweight MobileNet-V2 (E1∼E5) [17] as the backbone to extract five-level features F R i , F T i i=1,...,5 for the two-modal inputs, respectively.To explore the complementary information between the two-modal features, we designed the cross-modal information mutual reinforcement module to fuse the two-modal features.To detect multiscale objects and fuse the two-level features, we designed the semantic-information-guided fusion module to suppress interfering information and explore multiscale information.Additionally, we employed two auxiliary decoder branches.On the one hand, this guides the two-modal encoders to extract modality-specific information [49] for the two-modal inputs, which helps to make the feature learning process more stable.On the other hand, this provides supplementary information in terms of singlechannel saliency features.The decoder modules of the two auxiliary decoder branches are equipped with a simple structure, namely concatenation followed by 3 × 3 depth-separable convolution (DSConv) [18].Finally, the 1 × 1 convolution is applied on three decoded features, resulting in three single-channel saliency features S F Fd 1 , S F Td 2 , and S F Rd 2 .After that, the sigmoid activation function is applied to obtain saliency maps S F , S T , and S R .To fuse the complementary information between the three decoder branches, we summed the three single-channel saliency features and applied the sigmoid function to obtain the saliency map S test during the testing stage.The above processes can be formulated as follows: where Conv 1×1 means the 1 × 1 convolution and σ is the sigmoid function, which maps the single-channel saliency feature to the saliency map.F Fd 1 , F Td 2 , and F Rd 2 are the output features of the primary decoder and two auxiliary decoders.

Cross-Modal Information Mutual Reinforcement Module
Fusing complementary information between two-modal features is an essential question for RGB-T SOD.Two-modal features often contain noisy and inconsistent information, which can hinder the learning process of the saliency features.To address these issues, we designed the CMIMR module to suppress noisy information in the two-modal features and mutually supply valuable information.The structure of the CMIMR module is illustrated in Figure 3. Specifically, we used the previous-level decoded feature, which contains accurate semantic and location information, to enhance the two-modal features by the concatenation-convolution operation, respectively.This guides the two-modal features to concentrate more on valuable information and alleviate background noise.However, this enhancement operation may weaken the beneficial information in the two-modal features.To address this issue, we added residual connections to the two-modal enhanced features.This process can be described as follows: where ⊕ means elementwise summation and Conv 1×1 is the 1 × 1 convolution block consisting of the 1 × 1 convolution layer, and a batch normalization layer.are the previous-level information-enhanced two-modal features.F Fd i+1 is the decoded feature at the (i + 1)th level.The semantic and location information from the previous-level decoded features help suppress noisy information in the two-modal features, which facilitates the exploration of complementary information in the subsequent process.
After the aforementioned enhancement, we further exchanged the complementary information between the two-modal features.Since two-modal features contain both complementary and misleading information, directly concatenating them together can harm the appropriate fusion.Taking the RGB feature as an example, we intended to utilize the thermal feature to enhance it.Considering that spatial attention [50] can adaptively highlight regions of interest and filter the noisy information, we utilized the spatial attention map of the RGB feature to filter misleading information in the thermal features.This is because we wanted to preserve valuable information in the thermal feature, which is complementary to the RGB feature.After that, we concatenated the spatial-attentionfiltered thermal feature with the RGB feature to supplement beneficial information into the RGB feature.Through this operation, the complementary information in the thermal feature can adaptively flow into the RGB feature, thereby obtaining a cross-modal informationenhanced RGB feature.The enhancement process for the thermal feature is similar to that of the RGB feature.Finally, we combined the two-modal enhanced features by elementwise summation to aggregate them: where DSConv 3×3 is the 3 × 3 DSConv layer [18], ⊙ represents the elementwise multiplication operation, and SA denotes the spatial attention [50].F Tme i and F Rme i are cross-modal enhanced two-modal features.F F i is the two-modal fused feature.In summary, the CMIMR module can effectively suppress background noise in two-modal features under the guidance of previous-level semantic information.Furthermore, it can supplement valuable information to each modal feature, which helps to effectively fuse the two-modal features.'Conv 1 × 1' is the 1 × 1 convolution.'SA' is the spatial attention.'DSConv 3 × 3' is the depthseparable convolution with the 3 × 3 convolution kernel.

Semantic-Information-Guided Fusion Module
How to design the two-level feature aggregation module during the decoding stage is a crucial question for SOD.It is related to whether we can recover the elaborate details of salient objects.Since low-level features contain much noisy information, directly concatenating them together will inevitably introduce disturbing information into the fused features.To rectify the noisy information in the low-level features, we transmitted the semantic information in the high-level feature into it.Besides, multiscale information is vital in SOD tasks.Salient objects in different scenes are of various sizes and shapes, but the ordinary 3 × 3 convolution cannot accurately detect these salient objects.Inspired by the great success of multiscale information-capture modules (e.g., RFB [7,11] and FAM [19]) in SOD, we employed the visual attention block (VAB) [20] to capture the multiscale features.The VAB was initially designed as a lightweight feature-extraction backbone for many visual tasks.
The SIGF module structure is shown in Figure 4. Specifically, to suppress the background noisy information in the low-level feature, we utilized the high-level feature to refine the feature representation of the low-level feature.We concatenated the high-level feature into the low-level feature to enhance it.In the feature-enhancement process, valuable information in the low-level features may be diluted, so we introduced residual connections to preserve it.This process can be expressed as follows: where F Fe i is the semantic-information-enhanced feature.F Fd i+1 is the decoded feature at the (i + 1) th level.F F i is the two-modal fused features.Then, to enable our method to detect salient objects of various sizes and shapes, we used the VAB to extract multiscale information contained in the fused features: where VAB is the visual attention block [20].F Fd i is the decoded feature at the ith level.The VAB consists of two parts: the large kernel attention (LKA) and feed-forward network (FFN) [51].In the large kernel attention, the depth-separable convolution, depth-separable dilation convolution with dilation d, and a 1 × 1 convolution are successively stacked to capture multiscale information: where DSConv d is the depth-separable convolution with dilation d.F stands for the feature being processed.In summary, our module can rectify noisy information in the low-level feature under the guidance of high-level accurate semantic information.Meanwhile, the VAB successfully extracts multiscale information, which is beneficial for detecting multiscale salient objects.

Loss Function
The deep supervision strategy [52] is adopted in our method.Specifically, the saliency predictions of deep features F Fd i (i=1,...,5) are supervised, as shown in Figure 2. Additionally, the saliency predictions of two auxiliary decoders' output features F Td 2 , F Rd 2 are also supervised.The BCE loss [53] and IoU loss [54] are employed to calculate the losses between saliency predictions and the GT: where S F i , S T , and S R mean the saliency predictions of the deep features F Fd i , F Td 2 , and F Rd 2 , respectively.G means the ground truth.ℓ bce and ℓ IoU mean the BCE loss and IoU loss, respectively.

Experiments 4.1. Experiment Settings 4.1.1. Datasets
There are three RGB-T SOD datasets that have been widely employed in existing works: VT821 [55], VT1000 [56], and VT5000 [13].VT821 consists of 821 manually registered RGB-T image pairs.VT1000 is composed of 1000 well-aligned RGB-T image pairs.VT5000 has 5000 RGB-T image pairs, containing complex scenes and diverse objects.Following the previous works' setting [47], 2500 samples from VT5000 were selected as the training dataset.The other 2500 samples from VT5000 and all samples from VT821 and VT1000 served as the testing datasets.To avoid overfitting, the training dataset was augmented by random flipping and random rotation [11].

Implementation Details
The model was trained on a GeForce RTX 2080 Ti (11GB memory).The Pytorch framework was employed in the code implementation.The encoders were initialized with the pre-trained MobileNet-V2 [17], while the other parameters were initialized with the Kaiming uniform distribution [57].The input image was resized to 224 × 224 for both the training and testing stages.The training epochs and batch size were set to 120 and 20, respectively.The Adam optimizer was employed to reduce the loss of our method.The learning rate was set to 1 × 10 −4 and will decay to 1 × 10 −5 after 90 epochs.

Evaluation Metrics
To compare the performance of our method with other methods, four numeric evaluation metrics were employed, the mean absolute error (M), F-measure (F β ) [58], E-measure (E ξ ) [59], and structure-measure (S α ) [60].Besides, the PR curve and F-measure curve are plotted to show their evaluation results.

M
The mean absolute error M calculates the mean absolute error between the prediction value and the GT: where G(i, j) and S(i, j) denote the ground truth and the saliency map, respectively.

F β
The F-measure (F β ) is the weighted harmonic mean of the recall and precision, which is formulated as where β 2 was set to 0.3, referring to [58].

E ξ
The E-measure (E ξ ) evaluates the global and local similarities between the ground truth and predictions: where φ is the enhanced alignment matrix.

S α
The structure-measure (S α ) evaluates the structural similarities of salient objects between the ground truth and predictions: where S r and S o mean region-aware and object-aware structural similarity, respectively, and α was set to 0.5, referring to [60].

Quantitative Comparison
We compared the performance of the heavy-model-based methods in Table 1 and the lightweight methods in Table 2.The PR and F-measure curves of the compared methods on the three RGB-T datasets are plotted in Figure 5.Our method outperformed most methods in terms of four metrics, except for S α , which was slightly inferior to the other methods.Compared to the heavy-model-based methods, as shown in Table 1, our method improved 6.9%, 2.0%, and 1.1% in terms of M, F β , and E ξ on VT5000.Although our method was not as good as other methods in terms of S α , it requires only 6.1M parameters and 1.5G FLOP and can be easily applied to mobile devices.The inference speed of our method was mediocre on a professional GPU (GeForce RTX 2080 Ti, Santa Clara, CA, USA) with 34.9 FPS.However, given that the mobile devices only have access to the CPU, our method outperformed the other methods with 6.5 FPS (AMD Ryzen 7 5800H, Santa Clara, CA, USA).Besides, we compare our method with existing lightweight methods in Table 2. Our method outperformed the other methods on most metrics, except for S α on VT1000 and VT821.Our method improved 12.5%, 2.3%, and 1.2% in terms of M, F β , and E ξ on VT5000.Among the lightweight methods, the FLOP and FPS of our method were not as good as LSNet, but our method performed better.In addition, we plot the PR and F-measure curves in Figure 5 to visually compare the performance of all methods.We can see that the precision of our method was higher than other methods on VT5000 and VT821, when the recall was not very close to 1.The F-measure curves consider the trade-offs between precision and recall.We can see that our method obtained better F-measure scores on VT5000 and VT821.We evaluate the IoU and Dice scores of our method in Table 3 with reference to most image segmentation tasks.We can see that our method performed better on VT1000 than on VT5000 and VT821.Additionally, our method outperformed the compared method LSNet on all three datasets.
To demonstrate the significance of the performance improvement of our method, the t-test was performed.We retrained our method and obtained six sets of experiment results, shown in Table 4. Concretely, assuming the metrics X ∼ N(µ, σ 2 ), the test statistic was t = X−µ 0 S/ √ n , where S 2 is an unbiased estimate of σ 2 .
is the Student distribution with n − 1 degrees of freedom.Therefore, the t-test was used in our hypothesis test.For the evaluation metric M, the left-sided test was performed, i.e., the H 0 hypothesis was that the M of our method was greater than that of the compared method.For the other five metrics F β , S α , E ξ , IoU, and Dice, the right-sided test was performed, i.e., the H 0 hypothesis was that the corresponding results of our method were less than those of the compared method.The p-value is reported in our t-test.Three significance levels α were used in our t-test, i.e., 0.01, 0.05, and 0.1.Generally speaking, if p-value ≤ 0.01, the test is highly significant.If 0.01 < p-value ≤ 0.05, the test is significant.If 0.05 < p-value ≤ 0.1, the test is not significant.If p-value > 0.1, then there is no reason to reject the H 0 hypothesis.As shown in Table 5, the p-value of our method was less than 0.01 for M, F β , and E ξ on the three datasets, indicating that the t-test was highly significant.Param means the number of parameters.FLOP means floating point operations.FPS means frames per second, which was tested on two types of processors, i.e., professional graphics processing unit GeForce RTX 2080 Ti (GPU) and central processing unit AMD Ryzen 7 5800H @ 3.2 GHz (CPU), respectively.The top three results are marked in red, green, and blue color in each column, respectively.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.

VT5000 VT1000 VT821
No.  5.The t-test of our method with the compared methods on the RGB-T datasets.For the evaluation metric M, the left-sided test was performed, while for the other three metrics F β , S α , and E ξ , the right-sided test was performed.The p-value is reported in this table.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.

Qualitative Comparison
To demonstrate the effectiveness of our method, we also provide the visual comparisons with other methods in Figure 6.In this figure, the challenging scenes include small objects (1st and 2nd row), multiple objects (3rd and 4th row), a misleading RGB image (5th row), and misleading thermal images (6th, 7th, and 8th row).As seen in Figure 6, our method can detect salient objects better than other methods.For example, in the first and second rows, our method can accurately detect small salient objects, while other methods like MMNet and MIADPD failed in this case.In the third and fourth rows, our method can detect multiple objects in the scene, but the other methods performed poorly.In the fifth row, our method can detect the salient object effectively despite the low contrast in the RGB image, while the other methods were interfered with by the noisy information in the RGB image.In the sixth and seventh rows, the salient objects have apparent contrast in the RGB image, but are similar to other objects in the background in the thermal image.The thermal images provide misleading information, which can be easily solved by our method.In summary, our method can accurately overcome the challenges in these scenarios due to the better fusion of the complementary information between the two-modal features and multiscale information extraction.

Ablation Study 4.4.1. Effectiveness of Cross-Modal Information Mutual Reinforcement Module
To demonstrate the effectiveness of the CMIMR module, we perform several ablation experiments in Table 6.First, we removed the CMIMR module, i.e., the two-modal features were directly concatenated followed by the 3 × 3 DSConv, referred to as w/o CMIMR.Compared with this variant, our method improved M and F β by 5.0% and 1.7% on VT5000, respectively.This suggests that our method can effectively fuse complementary information between two-modal features by enhancing them with the guidance information of the previous level and inter-modality.To demonstrate that the performance improvement of each module is significant, we perform t-test in Table 7.As shown in Table 7, the p-value of our method was less than 0.01 for all four metrics compared to the variant w/o CMIMR, so the test was highly significant.To demonstrate that the CMIMR outperformed the other modules that play the same role in existing methods, we replaced it with the two-modal feature fusion module in ADF [13], abbreviated as w ADF-TMF.Compared to this variant, our method improved the M and F β by 2.4% and 0.8% on VT5000, respectively.Compared to the variant w ADF-TMF, the p-value of our method was less than 0.01 for F β and S α on VT5000, so the test was highly significant.This suggests that the design of the CMIMR module is sound.Table 6.Ablation studies of our method on three RGB-T datasets.The best result is marked in red color in each column.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.Second, we removed the previous-level decoded feature enhancement, which is abbreviated as w/o PDFE, i.e., two-modal features are not enhanced by the previous-level decoded feature, but are directly fed into the cross-modal information mutual enhancement component of the CMIMR module.Compared to this variant, our method improved the M and F β by 2.1% and 0.8% on VT5000, respectively.Compared to the variant w/o PDFE, the p-value of our method was less than 0.01 for the F β , S α , and E ξ on VT5000; therefore, the test was highly significant.This shows that the PDFE component is conducive to suppressing noisy information in two-modal features.Finally, we removed the cross-modal information mutual reinforcement component, which is abbreviated as w/o IMR, i.e., after the PDFE component, the two-modal features were fused by the concatenation-3 × 3 DSConv.Compared to this variant, our method improved the M and F β by 3.0% and 0.8% on VT5000, respectively.Compared to the variant w/o IMR, the p-value of our method was less than 0.01 for the F β , S α , and E ξ on VT5000, so the test was highly significant.This suggests that the IMR component helps to transfer complementary information to each other and suppress the distracting information in each modality.We also show the saliency maps of the ablation experiments in Figure 7.In the first row, the holly is obvious in the RGB image, and other ablation variants mistook it for salient objects.In the second row, the potato in the thermal image is similar to the salient objects, and other ablation variants cannot distinguish it accurately.However, with the CMIMR module, our method can eliminate this misleading information.In conclusion, the CMIMR module can effectively fuse the complementary information between two-modal features and mitigate the adverse effects of distracting information.

Effectiveness of Semantic-Information-Guided Fusion Module
To demonstrate the effectiveness of the semantic-information-guided fusion module, we conducted three ablation experiments.The results are shown in Table 6.First, we removed the SIGF module in our method, abbreviated as w/o SIGF, i.e., the two-level features were directly concatenated, followed by the 3 × 3 DSConv.Compared to this variant, our method improved the M and F β by 3.9% and 1.2% on VT5000, respectively.This demonstrates that the SIGF module is helpful in suppressing interfering information and exploring multiscale information.To demonstrate that the performance improvement of the SIGF module is significant, we perform the t-test in Table 7.Compared to the variant w/o SIGF, the p-value of our method was less than 0.01 for four metrics on VT5000, so the test was highly significant, except for the p-value, which was less than 0.05 for S α on VT821, which was significant.To demonstrate that the SIGF module outperformed other the modules that play the same role in existing methods, we replaced it with the decoder module in ADF [13], abbreviated as w ADF-Decoder.Compared to this variant, our method improved the M and F β by 2.4% and 1.0% on VT5000, respectively.Compared to the variant w ADF-Decoder, the p-value of our method was less than 0.01 for F β on VT5000, so the test was highly significant.This suggests that the design of the SIGF module is sound.
Second, we removed the previous-level semantic information enhancement in the SIGF module, which is abbreviated as w/o SIE, i.e., the previous-level semantic information enhancement was removed, and the two-level features were directly concatenated in the SIGF module.Compared with this variant, our method improved the M and F β by 1.8% and 0.7% on VT5000, respectively.This demonstrates that the SIE component helps to suppress interfering information.Compared to the variant w/o SIE, the p-value of our method was less than 0.05 for the F β , S α , and E ξ on VT5000, so the test was significant.Next, we removed the VAB component in the SIGF module, which is abbreviated as w/o VAB, i.e., the VAB component was removed in the SIGF module, and the other components were retained.Compared to this variant, our method improved the M and F β by 2.7% and 0.8% on VT5000, respectively.This shows that the VAB is capable of capturing the multiscale information of salient objects.Compared to the variant w/o VAB, the p-value of our method was less than 0.01 for the F β and S α on VT5000, so the test was highly significant.Besides, we also replaced the VAB in the SIGF module with the RFB and FAM, abbreviated as w SIGF-RFB and w SIGF-FAM, respectively.Compared to the RFB variant, our method improved the M and F β by 2.1% and 0.6% on VT5000, respectively.Compared to the variant w SIGF-RFB, the p-value of our method was less than 0.05 for the F β and E ξ on VT5000, so the test was significant.Compared to the FAM variant, our method improved the M and F β by 2.1% and 0.6% on VT5000, respectively.These two results indicate that the VAB slightly outperformed the RFB and FAM in capturing multiscale context information.We also show the visual comparisons of the ablation experiments in Figure 8.In the first row, the variants are disturbed by the tire.In the second row, other variants are unable to detect small objects.With the SIGF module, our method effectively addresses these challenges.In summary, the SIGF module can effectively suppress interfering information and capture multiscale information.

Effectiveness of Hybrid Loss and Auxiliary Decoder
To demonstrate the effectiveness of the hybrid loss and auxiliary decoder, we conducted two ablation experiments.The results are presented in Table 6.First, we removed the IoU loss, which is abbreviated as w/o IoU, i.e., only the BCE loss was employed in training our model.Compared to this variant, our method improved the M and F β by 3.0% and 1.4% on VT5000, respectively.Compared to the variant w/o IoU, the p-value of our method was less than 0.01 for the F β and E ξ on VT5000, so the test was highly significant.This demonstrates that the IoU loss is conducive to boosting the performance from the perspective of integral consistency.As shown in Figure 9b, the variant w/o IoU is susceptible to background noise.To demonstrate of the effectiveness of summing three single-channel saliency features, we employed three learnable parameters to weight them and, then, summed the weighted features, abbreviated as w LPW.Compared to this variant, our method improved the M and F β by 4.2% and 1.8% on VT5000, respectively.Compared to the variant w LPW, the p-value of our method was less than 0.01 for M, F β , and E ξ on VT5000, so the test was highly significant.However, our method failed to perform in the S α , i.e., the learnable parameters can improve the S α , but it did not perform as well as our method on the other metrics.Besides, we also conducted an experiment on the summation of three saliency maps, abbreviated as S F + S R + S T .The results were even worse than those only employing S F .Compared to this variant, our method improved the M and F β by 20.1% and 10.6% on VT5000, respectively.Compared to the variant S F + S R + S T , the p-value of our method was less than 0.01 for four metrics on VT5000, so the test was highly significant.This suggests that summing the three saliency maps together can have a detrimental effect.In Table 6, we also report the evaluation results of the three saliency maps, abbreviated as S F , S R , and S T , respectively.Note that we wished to evaluate the contribution of the three saliency maps (S F , S R , and S T ) in the same setup as our full method, and therefore, the network parameters remained unchanged.The primary decoder saliency map S F was much better than the two auxiliary decoder saliency maps S R and S T .Compared to the S F , our method improved the M and F β by 1.8% and 0.8% on VT5000, respectively.This suggests that summing three single-channel saliency features can also provide beneficial information for S F .Unfortunately, however, this strategy had an adverse effect on S α , reducing the S α by 0.6% on VT5000.We also conducted experiments only employing one modality as the input, abbreviated as RGB and T. That is, two auxiliary decoders were removed, the CMIMR module was removed, and no two-modal feature fusion were required since only one modality was used as the input.We input the RGB image and thermal image into the modified network separately.Then, the SIGF module was employed to decode the two-level features from top-to-bottom.Only employing the RGB image as the input was better than only employing the T image, but our method can greatly improve the results.Compared to the variant RGB, out method improved the M and F β by 23.4% and 4.4% on VT5000, respectively.Compared to the variant RGB, the p-value of our method was less than 0.01 for four metrics on VT5000, so the test was highly significant.
Besides, to demonstrate the necessity of two auxiliary decoders, we removed two auxiliary decoders, which is abbreviated as w/o AD, i.e., only the primary decoder was retained in our modified model.Compared to this variant, our method improved the M and F β by 10.8% and 2.0% on VT5000, respectively.Compared to the variant w/o AD, the p-value of our method was less than 0.01 for four metrics on VT5000, so the test was highly significant.This demonstrates that two auxiliary decoders can guide the two-modal encoders to extract modality-specific information and supplement valuable information at the single-channel saliency feature level.Unfortunately, the AD module did not perform well in all cases, but considering that it boosted most metrics, its failure cases in S α are acceptable.Note that since the network structure was modified in these three cases (w/o AD, RGB, and T), we needed to retrain the network to obtain the saliency maps, which is a different experimental setup from the ablation experiments S F , S R , and S T .As shown in Figure 9c, the variant w/o AD failed to guide two encoders to extract beneficial information.On the contrary, our entire model performed well in these cases.

Scalability on RGB-D Datasets
To demonstrate the scalability of our method, we retrained it on the RGB-D datasets.Following the settings in [66], we employed the 1485 images from NJU2K [67] and 700 images from NLPR [68] as the training datasets.The other parts of NJU2K, NLPR, and all images of SIP [66], STERE1000 [69] were taken as the testing datasets.Note that when testing on DUT [70], the extra 800 images from DUT were also taken as the training datasets, namely a total of 2985 images for training on DUT.
To demonstrate the effectiveness of our method, we compared it with 10 SOTA methods, S2MA [30], AFNet [71], ICNet [31], PSNet [72], DANet [73], DCMF [35], MoADNet [37], CFIDNet [34], HINet [33], and LSNet [48].As shown in Table 8, our method improved 3.2% and 0.5% in terms of the M and E ξ on the NJU2K dataset.Besides, our method improved 0.8% and 0.9% in terms of the M and F β on the NLPR dataset.This demonstrates that our method has a preferable generalization ability on the RGB-D datasets.To demonstrate that the performance improvement of our method was significant, the t-test is performed in Table 9.We retrained our method and obtained six sets of experiment results.As shown in Table 9, compared to other methods, the p-value of M, F β , and E ξ on NJU2K were less than 0.01; therefore, the t-test was highly significant.The p-value of M and F β on NLPR were less than 0.01; therefore, the test was highly significant.

Discussion
This paper further identifies three important issues in RGB-T SOD: two-modal feature fusion, two-level feature fusion, and the saliency information fusion of three decoder branches.It also provides feasible solutions to these issues, which researchers can use to make further improvements.Our method has three advantages.First, in the two-modal feature fusion, the supplementary information is retained and interfering information is filtered.Second, in the two-level feature fusion, the guidance of the semantic information helps to suppress noise information in the low-level features.Third, the auxiliary decoder can guide the two encoders to extract modality-specific information.However, there are limitations to our method.First, the summation of three single-channel saliency features improves other the metrics, but degrades the S α .Second, while the full CMIMR and SIGF bring significant improvements to our method, their subcomponents do not largely improve the metrics.We will further address these limitations in future work.There are several directions for future development in this field.First, boundary information should be taken into account to recover clearer boundaries of salient objects.Second, although existing methods have made great progress, the structure is complex and simpler, and more-effective solutions need to be explored.Finally, the solutions of two-modal feature fusion and two-level feature fusion need further improvement.

Conclusions
In this paper, we propose a lightweight cross-modal information mutual reinforcement network for RGB-T salient object detection.Our proposed method consists of the crossmodal information mutual reinforcement module and the semantic-information-guided fusion module.The former module fuses complementary information between two-modal features by enhancing them with semantic information of the previous-level decoded feature and the inter-modal complementary information.The latter module fuses the twolevel features and mines the multiscale information from the deep features by rectifying the low-level feature with the previous-level decoded feature and inserting the VAB to obtain the global contextual information.In summary, our method can effectively fuse complementary information between two-modal features and recover the details of salient objects.We conducted extensive experiments on three RGB-T datasets, and the results showed that our method is competitive compared with 15 state-of-the-art methods.

5 FFigure 2 .
Figure 2. Overall architecture of our lightweight cross-modal information mutual reinforcement network for RGB-T salient object detection.'E1∼E5' are the five modules of the encoder.'TDec' and 'RDec' are the decoder modules of the auxiliary decoder.'CMIMR' is the cross-modal information mutual reinforcement module.'SIGF' is the semantic-information-guided fusion module.

Figure 5 .
Figure 5. PR curves and F-measure curves of the compared methods on the RGB-T datasets.

Figure 9 .
Figure 9. Visual comparisons with ablation experiments on the effectiveness of the IoU loss and auxiliary decoder.(a) Ours.(b) w/o IoU.(c) w/o AD.

Author
Contributions: Conceptualization, C.L. and B.W.; methodology, C.L. and B.W.; software, Y.S.; validation, Y.S. and J.Z.; formal analysis, J.Z.; investigation, X.Z.; resources, J.Z.; writing-original draft preparation, C.L.; writing-review and editing, B.W.; visualization, C.L.; supervision, X.Z. and C.Y.; project administration, X.Z. and C.Y.; funding acquisition, C.Y.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the National Natural Science Foundation of China under Grants 62271180, 62171002, 62031009, U21B2024, 62071415, 62001146; the "Pioneer" and "Leading Goose" R&D Program of Zhejiang Province(2022C01068); the Zhejiang Province Key Research and Development Program of China under Grants 2023C01046, 2023C01044; the Zhejiang Province Nature Science Foundation of China under Grants LZ22F020003, LDT23F01014F01; the 111 Project under Grants D17019; and the Fundamental Research Funds for the Provincial Universities of Zhejiang under Grants GK219909299001-407 Institutional Review Board Statement: Not applicable.

Table 1 .
Quantitative comparisons with the heavy-model-based methods on the RGB-T datasets.

Table 2 .
Quantitative comparisons with the lightweight methods on the RGB-T datasets.The best result is marked in red color in each column.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.

Table 3 .
The t-test of our method with the compared methods on the RGB-T datasets.For the evaluation metrics IoU and Dice, the right-sided test was performed.The p-value is reported in this table.↑ mean a larger value is better and a smaller value is better, respectively.

Table 4 .
Six sets of experiment results of our method on the RGB-T datasets.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.

Table 7 .
The t-test of our method with ablation experiments on the RGB-T datasets.For the evaluation metric M, the left-sided test was performed.For the other three metrics F β , S α , and E ξ , the right-sided test was performed.The p-value is reported in this table.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.

Table 8 .
Quantitative comparisons with 10 methods on the RGB-D datasets.The top three results are marked in red, green, and blue color in each row, respectively.↑ and ↓ mean a larger value is better and a smaller value is better, respectively.