CAE-Net: Cross-Modal Attention Enhancement Network for RGB-T Salient Object Detection

: RGB salient object detection (SOD) performs poorly in low-contrast and complex background scenes. Fortunately, the thermal infrared image can capture the heat distribution of scenes as complementary information to the RGB image, so the RGB-T SOD has recently attracted more and more attention. Many researchers have committed to accelerating the development of RGB-T SOD, but some problems still remain to be solved. For example, the defective sample and interfering information contained in the RGB or thermal image hinder the model from learning proper saliency features, meanwhile the low-level features with noisy information result in incomplete salient objects or false positive detection. To solve these problems, we design a cross-modal attention enhancement network (CAE-Net). First, we concretely design a cross-modal fusion (CMF) module to fuse cross-modal features, where the cross-attention unit (CAU) is employed to enhance the two modal features, and channel attention is used to dynamically weigh and fuse the two modal features. Then, we design the joint-modality decoder (JMD) to fuse cross-level features, where the low-level features are puriﬁed by higher level features, and multi-scale features are sufﬁciently integrated. Besides, we add two single-modality decoder (SMD) branches to preserve more modality-speciﬁc information. Finally, we employ a multi-stream fusion (MSF) module to fuse three decoders’ features. Comprehensive experiments are conducted on three RGB-T datasets, and the results show that our CAE-Net is comparable to the other methods.


Introduction
Salient object detection (SOD) attempts to imitate the human's attention mechanism, which can discover the most attractive objects in the image at first glance, to segment out the saliency objects in the image.SOD can be applied in many downstream computer vision tasks, such as object tracking [1], image quality assessment [2], scene classification [3], image fusion [4], and so on.Due to its superior performance in downstream tasks, SOD has received more and more attention in recent years.
The RGB SOD has been studied for many years.In the beginning, researchers proposed many traditional methods, which involve designing handcrafted features to estimate saliency maps.These methods cannot explore the high-level semantic information contained in the image, so it leads to unsatisfactory results.Benefiting from powerful feature representation ability, convolutional neural networks (CNNs) [5] are receiving more and more attention in computer vision applications.Particularly, when the fully convolutional networks (FCN) [6] and Unet [7] were proposed in image segmentation tasks, researchers gradually turned to embracing the deep learning-based method in SOD.Many works have been proposed in SOD.For example, to take into account the long range correlation of deep features between different positions, many works [8,9] employed ASPP [10], RFB [11], or PPM [12] modules.By using these modules, the context information of salient objects can be fully exploited.Similarly, Pang [13] employed multi-branch features interaction to fully explore multi-scale information.Besides, edge features were also explicitly explored by many works to portray sharp boundaries of salient objects [14,15].Though great progress has been made in recent years, RGB SOD suffers interference from low-contrast or complex background images, resulting in a poor quality saliency map.With the development of sensor technology, we can easily afford the expenditure of depth or thermal sensors.The depth image provides a description of the spatial arrangement information of the scene.By introducing the depth information, we can easily distinguish objects with different depths.However, due to the vulnerability of depth sensors to environmental changes, low-quality depth maps exist in RGB-D datasets, resulting in the decline performance of RGB-D SOD.Different from the depth information, the thermal infrared image depicts the radiated heat information of objects in the scene, so it can help us easily distinguish salient objects.
RGB-T SOD faces the problems of multi-modal feature fusion.Previous works have explored cross-modal complementary information.In [16], the multi-interactive block is designed to fuse the previous layer's decoded features with two modal features, respectively, which are afterwards concatenated to perform cross-modal fusion.In [17], the context-guided cross-modality fusion is designed to fuse two modal features using element-wise multiplication and addition at each level, and then they are fed into a stacked refinement network to decode them.Nevertheless, direct concatenation or element-wise addition/multiplication cannot fully explore the complementary information between two modal features.Besides, there are some poor quality examples in the RGB or thermal infrared image, as shown in Figure 1.If we indiscriminately concatenate or add two modal features together, the bad quality samples will mislead the saliency model, resulting in incorrect prediction results.Therefore, we need to carefully design a module to appropriately merge two modal features.In addition, similar to RGB SOD, many works have been committed to exploring multi-scale information embedded in deep features.For example, in [17], the surrounding and global context unit was proposed to capture context information.Considering that each level feature contains different scale information, where high-level features contain more semantic and holistic information, and low-level features contain more detail and local information.Properly aggregating the cross-level features and simultaneously reducing the noise impact are worth further investigating.To solve these problems mentioned above, we propose a novel cross-modal attention enhancement network (CAE-Net) for RGB-T salient object detection, which is shown in Figure 2. Benefiting from three key components (i.e., cross-modal fusion (CMF), single-/joint-modality decoder (SMD/JMD), and multi-stream fusion (MSF)), the CAE-Net can fully exploit cross-modal information and suitably fuse them.Besides, it can adequately aggregate cross-level features in a gradually refined manner.Concretely, to fuse crossmodal features, we design a cross-modal fusion (CMF) module, where the cross-attention unit (CAU) is constructed to enhance the one modal feature using the attention from another modal feature, and then we employ channel attention to adaptively emphasize the significant modal features and restrain the deficient modal features.Then, to preferably fuse the cross-level features, we design the joint-modality decoder (JMD), where high-level features refine low-level features to suppress noisy information and sufficiently gather multi-scale features.Besides, we add two independent single-modality decoder (SMD) branches to preserve more modality-specific information [18] contained in the RGB and thermal image, respectively.Finally, we design the multi-stream fusion (MSF) module to fully explore complementary information between different decoder branches.With our elaborate design, the proposed model can better explore complementary information between cross-modal features and appropriately aggregate cross-level features.
Overall, we summarize the main contributions of our paper as follows: 1.
We propose a novel RGB-T salient object detection model, called a cross-modal attention enhancement network (CAE-Net), which consists of the cross-modal fusion (CMF), the single-/joint-modality decoder (SMD/JMD), and multi-stream fusion (MSF).2.
To fuse the cross-modal features, we design a cross-modal fusion (CMF) module, where the cross-attention unit (CAU) is employed to filter incompatible information, and the channel attention is used to emphasize the significant modal features.3.
To fuse cross-level features, we design the joint-modality decoder (JMD) module, where the multi-scale features are extracted and aggregated, and noisy information is filtered.Besides, two independent single-modality decoder (SMD) branches are employed to preserve more modality-specific information.4.
To fully explore the complementary information between different decoder branches, we design a multi-stream fusion (MSF) module.4,5) .After that, we design the joint-modality decoder (JMD) to fuse cross-level features and obtain decoded feature F Fd 3 .We also add two independent single-modality decoder (SMD) branches to preserve more modality-specific information, obtaining decoded features F Rd 3 and F Td 3 , respectively.Finally, we design a multi-stream fusion (MSF) module to fully fuse complementary information between different decoder branches and obtain the final fused feature F Sd 3 .S is the final saliency map, which is obtained by applying one 1 × 1 convolution on F Sd 3 .Here, the supervision loss of S and intermediate features are denoted as ls i(i=1,••• ,4) , which are marked with a red arrow in this figure .We organize the remaining part of this paper as follows.We briefly conclude the related works of salient object detection in Section 2. In Section 3, we describe the proposed model in detail.In Section 4, we show the comprehensive experiments and detailed analyses.Finally, this article is concluded in Section 5.

Related Works
In recent years, a large number of works have been proposed for salient object detection.Here, we briefly introduce RGB saliency models, RGB-D saliency models, and RGB-T saliency models.

RGB Salient Object Detection
In the beginning, researchers employed hand-crafted features and a variety of prior knowledge to determine saliency.For instance, the center-surrounding discrepancy mechanism [19] was employed to distinguish salient objects.Afterward, traditional machinelearning models were developed.In [20], multiple types of features were combined, which consist of multiscale contrast, spatial color distribution, and center-surrounded histogram, by learning conditional random field.In [21], the saliency score is predicted by fusing a multi-level regional feature vector through supervised learning.The convolutional neural network (CNNs) [5] has been widely used in many applications due to its powerful representation learning ability.Particularly, when Unet [7] and fully convolutional networks (FCN) [6] are proposed in image segmentation tasks, CNN-based models dominated in saliency detection.For example, Wu et al. [8] designed a cascaded partial decoder, where low-level features are refined by initial saliency maps, which are predicted by exploiting high-level features.Besides, many researchers have tried their best to recover boundary details of saliency maps [15].In [22], a boundary-aware loss function and refinement module are used to depict boundaries and purify coarse prediction maps, which effectively cause the boundaries to be clearer.In [14], fine detail saliency maps are predicted by integrating salient object features and edge features, which are produced by exploiting global features and edge features.Wan et al. [23] designed a deeper feature extraction module to enhance the deep feature representation, in which a bidirectional feature extraction unit is designed.Liu et al. [9] employed a parallel multiscale pooling to capture different scale objects.Xu et al. [24] proposed a center-pooling algorithm, where the receptive field is dynamically modified, to take into account the different importance of different regions.In [25], dense attention mechanisms were employed in the decoder to guide the low-level features concentrated on the defect regions.
Though researchers have great progressed RGB saliency detection, complex scenes, such as clutter background and low contrast, will degrade the performance RGB saliency models.

RGB-D Salient Object Detection
In recent years, we can easily obtain the depth information of scenes with the development of hardware such as laser scanner and Kinect.With the help of a depth map, the challenge of complex scenes for saliency models can be overcome via understanding spatial layout cues.Many researchers have worked to promote the progress of it.The final saliency map is produced by employing the center-dark channel map in [26].Recently, many CNN-based models have been proposed.For example, in [27], the residual connection is used to fuse the RGB and depth complementary information.The author combined depth features with multi-scale features to single out salient objects.Wang et al. [28] designed two streams to generate saliency maps for depth and RGB, respectively.Then, the switch map, which is learned by the saliency fusion module, fuses two saliency maps.In [29], RGB is processed by the master network, the depth becomes a full exploit because of the sub-network, and the depth-based features are incorporated into the master network.The two modal high-level features, including the depth features and RGB features, are fused by a selective self-mutual attention module in [30], and the depth decoder features are fused into RGB branch by introducing the residual fusion module.Multi-level features are fused by a densely cooperative fusion (DCF), and collaborative features are learned by joint learning (JL) in [31].In [32], attention maps were generated from depth cues to intensify salient regions.Besides, in [33], the multi-modal features are fused by employing a cross-modality feature modulation, which consists of spatial selection and channel selection.Wen et al. [34] designed a bi-directional gated pooling module to strengthen the multi-scale information, and gated-based selection to optimize cross-level information.Generally, the encouraging performance is presented by existing RGB-D saliency models, but inaccurate depth maps still degrade their performance.

RGB-T Salient Object Detection
The thermal infrared image can provide temperature field distribution of scenes, so it plays a positive role when the depth map cannot differentiate salient objects and backgrounds.In the beginning, traditional methods were proposed.In [35], the reliability was described for each modality by introducing a weight, and the weight was integrated into a graph-based manifold ranking method to achieve the adaptive fusion of different source data.Tu et al. [36] segmented RGB and thermal images into multi-scale superpixels.Then, these superpixels were used as graph nodes, and the manifold ranking was performed to obtain saliency maps.In [37], superpixels were used as graph nodes, and then the hierarchical features were used to learn graph affinity and node saliency.With the development of CNNs, deep learning-based methods were broadly employed.Zhang et al. [38] employed multi-branch group fusion to fuse the cross-modal features and designed a joint-attention guided bi-direction message passing to integrate multi-level features.In [39], feature representations were explored and integrated using cross-modal multi-stage fusion.Then, the bi-directional multi-scale decoder was proposed to learn the combination of multi-level fused features.Tu et al. [16] built a dual decoder to conduct interactions of global contexts, two modalities, and multi-level features.Huo et al. [17] established the context-guided cross-modality fusion to explore the complementary information of two modalities, and the features were refined using a stacked refinement network by spatial and semantic information interaction.In [40], multi-level features were extracted and aggregated with the attention mechanisms, and edge loss was used to portray boundaries.
Although much work has been performed on RGB-T SOD, there are still many problems that have not been fully explored.The majority of RGB-T SOD models employ concatenation or element-wise addition/multiplication to fuse the cross-modal features, but these fusion methods do not take into account the distinct significance of two modal features, leading to suboptimal results.Moreover, by employing vanilla Unet to decode cross-level features, the saliency models cannot sufficiently excavate the global context information embedded in deep features, and it is easily interfered by noise in low-level features.To solve these problems, we propose a novel cross-modal attention enhancement network (CAE-Net), where the cross-modal complementary information is fully explored and fused and the cross-level features are effectively aggregated.

The Proposed Method
In this section, the architecture of our proposed cross-modal attention enhancement network (CAE-Net) is introduced in Section 3.1.The cross-modal fusion (CMF) and single-/joint-modality decoder (SMD/JMD) are described in Sections 3.2 and 3.3, respectively.We present the multi-stream fusion (MSF) in Section 3.4.The loss functions are illustrated in Section 3.5.

Architecture Overview
The architecture of the proposed cross-modal attention enhancement network (CAE-Net) is shown in Figure 2. Firstly, we use a double stream encoder to extract the multi-level features of the RGB image I R and thermal infrared image I T , respectively.Here, we use VGG16 [41] as the backbone of the encoder, where we specially remove the last pooling layer and three fully connected layers of it.After deploying the encoder, we can obtain 5) for two modal inputs, respectively, and their resolution are 1, 1/2, 1/4, 1/8, and 1/16 of the original input image, respectively.Then, we design a cross-modal fusion (CMF), which consists of a cross-attention unit (CAU) and channel attention weighted fusion, to adequately explore the cross-modal complementary information, obtaining the fused features {F F i } (i=3,4,5) .After that, we design the jointmodality decoder (JMD) to fuse the cross-level features, obtaining decoded feature F Fd 3 .The JMD can effectively extract multi-scale information and filter the noisy information in the low-level features.Furthermore, we add two independent single-modality decoder (SMD) branches to preserve more modality-specific information, obtaining decoded features F Rd 3 and F Td 3 , respectively.Finally, we design a multi-stream fusion (MSF) module to fully fuse the complementary information between different decoder branches, obtaining the final fused feature F Sd 3 .Then, one 1 × 1 convolution followed by a sigmoid function is applied on F Sd 3 to generate the final saliency map S.

Cross-Modal Fusion
Digging out complementary information between two modal features is a major problem in RGB-T SOD.Here, we design the cross-modal fusion (CMF) module shown in Figure 3 to tackle this problem.The majority of existing methods just simply concatenate or element-wise add two modal features together.However, these methods cannot avoid the performance degradation caused by the misleading information in two modal inputs (i.e., the low-quality input image and the noisy information).Hence, we employ an attention mechanism to suppress the noisy information contained in two modal features.Different from frequently used self-attention, we design the cross-attention unit (CAU-R/CAU-T) shown in Figure 3 to filter one modal feature using the attention generated from another modal feature, where it can help enhance the shared features of two modalities.Concretely, using CAU-R as an example, we separately feed the thermal features F T i into a channel attention [42] and spatial attention [43] module to produce channel attention and spatial attention values of F T i , respectively.Then, we sequentially multiply the RGB features F R i with these two attention values.To avoid the RGB features being diluted by bad quality thermal samples, we introduce the residual connection for F R i .Following this way, we obtain cross-attention enhanced RGB features F Re i .Similar to CAU-R, we also deploy a CAU-T to enhance the thermal features F T i .The whole process is formulated as follows, where CA and SA are channel attention and spatial attention, respectively, GMP s is global max pooling along the spatial dimension, GMP c is global max pooling along the channel dimension, Relu is nonlinear activation function, MLP is fully connected layer, σ is activation function, and Conv 7×7 is convolution layer with 7 × 7 kernel.More details of channel attention and spatial attention can be found in [42,43]. is element-wise multiplication and ⊕ is element-wise addition.F Re i and F Te i are the enhanced RGB and thermal features, respectively.
After refining the two modal features, we attempt to appropriately fuse them.The existing methods indiscriminately fuse two modal features using concatenation or elementwise addition, but they do not take into account the different importance of two modal features.When encountering a bad quality sample, it will present a failure saliency prediction.With the help of channel attention, we can explicitly estimate the dynamic importance of RGB feature F Re i and thermal feature F Te i .Concretely, we concatenate these two features along the channel dimension, and then we feed them into the channel attention module to obtain a channel-wise importance weight for indicating which modal feature is more valuable.After that, we multiply this weight with the concatenated features, and then we employ a 1 × 1 convolution to reduce the channel number of concatenated features.The above calculation process is expressed as follows, where cat means concatenation operation, and Conv 1×1 means a 1 × 1 convolution and a BN layer [44].F F i means the fused features of two modalities at the i-th level.

Single-/Joint-Modality DECODER
The Unet [7] has been widely used in SOD research.However, considering that the low-level features contain a lot of noisy information, directly concatenating low-level encoder features with decoder features is not a optimal method.Under the guidance of high-level features, we can filter the noisy information contained in low-level features.Furthermore, multi-scale modules (PPM [12], ASPP [10], and RFB [11]) have been proved to be powerful in context information extraction.Different from [8], we use the RFB in the feature decoding phase.This is because, after concatenating the encoder feature with the previous layer decoder feature, the RFB can learn a more accurate and robust feature representation.In addition, considering that only one joint-modality decoder (JMD) may put more bias on one of the two modal features, we also add two single-modality decoder (SMD) branches to preserve more specific information in two modal features.Namely, the SMD can help each modal encoder extract effective and specific information.Concretely, using the SMD shown in Figure 4 as an example, firstly, we fed the fifth level feature F R 5 into RFB [11] to capture global context information, thus obtaining the decoded feature F Rd 5 .Then, we multiply the fourth level encoder feature F R 4 with F Rd 5 to filter the noisy information in the low-level feature.Next, we concatenate the filtered feature with F Rd 5 and feed it into RFB to obtain F Rd  4 , which is enriched with multi-scale information.The third level decoder is similar to the above process.However, it should be noted that, in the third level feature decoding process, we also added one skip connection from F Rd 5 to avoid the high-level feature being diluted.The above calculation processes are formulated as, where RFB means the RFB module and F Rd i means the i-th level decoded features.Conv 3×3 denotes a 3 × 3 convolution followed by a BN layer.UP ×2 and UP ×4 means 2 and 4 times bilinear interpolation upsampling, respectively.Our JMD is similar to SMD, but we replace the RFB operation in SMD with the context module (CM) shown in Figure 4, where we employ two parallel branches with RFB and Nonlocal [45] operation to further enhance the global context information.Notably, before feeding the feature into the Nonlocal module, we employ a 1 × 1 convolution to compress the feature channel into 64 to reduce the computation cost of the Nonlocal operation.The above calculation processes are formulated as, where CM is context module shown in Figure 4, and it can be formulated as, where Nonlocal means Nonlocal operation.

Multi-Stream Fusion (MSF)
If we only use the joint-modality decoder output F Fd 3 as the final saliency results, it may lose some distinctive information contained in RGB or thermal modality.Based on this observation, we again aggregate three branches of decoded features, as shown in Figure 2. We firstly concatenate these three decoded features F Rd 3 , F Td 3 , and F Fd 3 together.Then, we upsample the resulting features two times and employ a 3 × 3 convolution to enhance the upsampling features, and we repeat this operation again, obtaining the final saliency features F Sd 3 .The above calculating processes are formulated as, where Conv 3×3 means a 3 × 3 convolution layer and a BN layer.Finally, we employ a 1 × 1 convolution toward F Sd 3 , which is followed by a sigmoid function, obtaining the final saliency map S.This process is formulated as, where σ is the sigmoid activation function; furthermore, we employ deep supervision [46] in our model, as shown in Figure 2, where F Rd 3 , F Td 3 , and F Fd 3 are also fed into a 1 × 1 convolutional layer followed by the sigmoid activation function to predict the saliency results, respectively.Their losses, which are marked as {ls i } i=2,3,4 , are calculated between the saliency results and GT.

Loss Functions
We adopt the hybrid loss [22] to supervise our model's CAE-Net, where bce , ssim , and iou are binary cross-entropy loss [47], SSIM loss [48], and IoU loss [49], respectively.G and S mean the groundtruth and saliency map, respectively.N indicates the number of total pixels in the image, i means the i-th pixel.For SSIM loss, the image is cropped m patches, and µ x , µ y , σ x , and σ y are the mean and standard deviations of GT and predictions, respectively.σ xy is the covariance of them.C 1 and C 2 are set to 0.01 2 and 0.03 2 by default.Finally, the total loss ls total of the proposed CAE-Net can be defined as, where ls i are shown in Figure 2 and calculated using Equation ( 9).

Experiments
In this section, the datasets and implementation details are presented in Section 4.1.The evaluation metrics are described in Section 4.2.In Section 4.3, our model is quantitatively and qualitatively compared with 18 state-of-the-art models.The ablation studies are shown in Section 4.4.Finally, we analyze the scalability of our model on RGB-D datasets in Section 4.5.

Datasets and Implementation Details
To evaluate the performance of the proposed CAE-Net, we employ three widely used RGB-T datasets, including VT821 [35], VT1000 [37], and VT5000 [40].VT821 contains 821 RGB-T image pairs.VT1000 includes 1000 RGB-T image pairs.VT5000 includes 5000 RGB-T image pairs.For a fair comparison, we follow the setting in [16], where 2500 samples from VT5000 are chosen as the training set.The remaining datasets are treated as testing datasets.To avoid overfitting, we augmented the training datasets using random flipping.
We implement our model by using the PyTorch toolbox [50], and our PC is equipped with one RTX2080Ti GPU.We resize the input image to 224 × 224 before training.The encoder of RGB and thermal branches are initialized using pretrained VGG16 [41].We train our model by using the Adam optimizer, where the initial learning rate is set to 1 × 10 −4 .Additionally, the batchsize is 14, and the total training epoch is 250.We decrease the learning rate to 1 × 10 −5 after 200 epochs.

MAE
The mean absolute error (MAE) is expressed as follows, where G(i, j) and S(i, j) denotes the groundtruth and the predicted saliency map, respectively.

F β
The F-measure (F β ) is a weighted harmonic mean of recall and precision, which is formulated as, where β 2 is set to 0.3 referring to [51].

E ξ
The E-measure (E ξ ) is a metric that evaluates global and local similarities between the groundtruth and the predicted saliency map.Concretely, it is formulated as, where ϕ indicates the enhanced alignment matrix.

S α
Structure-measure (S α ) is employed to evaluate the structure similarities between salient objects in the groundtruth and the predicted saliency map, where S r and S o mean region-aware and object-aware structural similarity, respectively, and α is set to 0.5, referring to [48].

Quantitative Comparison
We present PR curves and F-measure curves in Figure 5.For PR curves, our model is closest to the upright corner compared with other models.Except for VT821, our model is slightly inferior to CSRNet.For F-measure curves, our model outperforms other models on VT5000 and V1000.Namely, it locates the top position in the figure on these two datasets, but it is comparable to CSRNet on VT821.In addition, the quantitative comparison results, including MAE, F β , E ξ , and S α , are presented in Table 1, where the adaptive F-measure and adaptive E-measure are reported.As can be seen from Table 1, our model outperforms most models on three datasets, except for VT821, our model ranks as second order with regard to F β and S α .To be specific, the traditional RGB-T methods M3S-NIR, MTMR, and SGDL perform poorly.This demonstrates the powerful representation learning ability of CNNs.Besides, our model surpasses the best RGB method CPD and RGB-D method PDNet by a large margin.This result indicates that our carefully designed model is effective.Compared to the competitive RGB-T model CSRNet, our model advances the MAE, F β , E ξ , and S α by 10.0%, 1.7%, 1.2%, and 1.4% on VT5000, respectively.Table 1.Quantitative comparisons with 18 models on three RGB-T datasets.The top three results are marked with red, green, and blue color in each column.↑ and ↓ denote that the larger value is better and the smaller value is better, respectively.* denotes tradition method, and others are deep learning method.

Complexity Analysis
In Table 2, we report the number of parameters and floating-point operations per second (FLOPs) of the compared models.We also visualize the accuracy corresponding to FLOPs in Figure 6, where the area of the circle denotes the relative size of the parameter quantities.The model located at the top-left position achieves a better trade-off between the accuracy and model complexity.We can see that the lightweight model CSRNet has the fewest parameters and FLOPs, while ranking second in terms of the F β score.Our model has a moderate number of parameters (38.8 M) and fewer FLOPs (47.1 G), while ranking first in terms of the F β score.From Figure 6, we can see that our model is located at the top and the second left position.It shows that our model achieves a better trade-off between accuracy and model complexity.

Qualitative Comparison
We show the qualitative results in Figure 7, where some representative samples, containing bad quality thermal images and small objects (the 1st row), bad quality RGB images (the 8th row), low-contrast RGB images (the 5th row), multiple objects (the 6th row and the 8th row), and vimineous object (the 10th row), are displayed.Concretely, in Figure 7 (first row and eighth row), even though the bad quality thermal image or RGB image exists, our method can highlight the salient objects without being disturbed by the bad quality sample.In the fifth row, our model can detect the bulb with the help of the thermal image, but other models are interfered by the low-contrast RGB image.In the sixth and eighth row, our model can detect two salient objects, but other models either detect only one object or detect objects with blurry boundaries.Especially in the first and sixth row, the salient objects are small, but our model can also detect them.In the 10th row, the vimineous stick can be integrally detected by our model.Generally, it can be found that, compared with other models, our model can detect small objects with less noise and can adaptively mitigate the distraction from low-quality samples.[17].(f) MIDD [16].(g) MMNet [39].(h) CPD [8].(i) PDNet [29].(j) ADF [40].(k) JLDCF [31].(l) AFNet [28].(m) EGNet [14].(n) S2MA [30].(o) BASNet [22].(p) FMCF [38].(q) R3Net [53].(r) PoolNet [9].(s) SGDL [37].(t) MTMR [35].(u) M3S-NIR [36].(v) DMRA [27].

Ablation Studies
To demonstrate the effectiveness of each component in the proposed CAE-Net, we conduct several ablation experiments, including the effectiveness of CMF, the effectiveness of SMD/JMD, the effectiveness of MSF, the effectiveness of backbone, and the effectiveness of loss functions.We provide the quantitative results in Table 3 and the visualization results in Figures 8 and 9.
Table 3. Ablation studies are implemented on three datasets, where the best result is marked with red color in each column.Here, "↓" means that the smaller the better.In order to verify the effectiveness of feature fusion in the middle layer, we conduct comparative experiments by concatenating two modal features at the input stage, which is abbreviated as "CI" in Table 3 (No.1).Concretely, we directly concatenate two modal input images I R and I T along the channel dimension at the beginning stage, and then feed it into the single branch saliency prediction network (i.e., the bottom stream in Figure 2).From Table 3 we can see that our model enhances the MAE by 12.7% on VT5000.It demonstrates the effectiveness of fusing features at the intermediate level.The visual results shown in Figure 8e also prove the same conclusion.This is because the early fusion scheme (i.e., concatenating two inputs) fails to fully explore deep complementary cues between two modal inputs.Next, we verify the effectiveness of the CMF module by removing it, shown in Table 3 (No.1 w/o CMF).Namely, we replace the CMF module by simply concatenating two modal features F R i and F T i together along the channel, which is followed by a 3 × 3 convolution layer to produce fusion features F F i , and other parts are kept the same with our full model.Compared to this variant, our model elevates the MAE, F β , E ξ , and S α by 3.8%, 1.6%, 1.0%, and 0.8% on VT5000, respectively.As can be seen from Figure 8f, the model "w/o CMF" cannot suppress the background noise.This proves that the design of the CMF is beneficial.The reason is that the CMF can suppress the noisy information in two modal features with the help of an attention module.To verify the effectiveness of cross attention in CMF, we replace it with self-attention, which is abbreviated as "Self" in Table 3 (No.1).That is, in CAU-R, we employ CA and SA of RGB feature F R i to enhance itself, but not CA and SA of thermal feature F T i , and, in CAU-T, thermal feature F T i also employs attention from itself.Compared to this variant, our model elevates the MAE, F β , E ξ , and S α by 2.5%, 0.9%, 0.7%, and 0.4% on VT5000, respectively.From Figure 8g, we can see that the ablation model "Self" is easily affected by background noise.This suggests that the cross-attention can highlight the shared information and suppress distracting information in another modal features.

Effectiveness of Single-/Joint-Modality Decoder (SMD/JMD)
To verify the effectiveness of SMD and JMD, we perform an ablation experiment by removing both of them, which is shown in Table 3 (No.2 Unet).Concretely, we use three simple Unet [7] structures to fuse the cross-level features of the three branches, respectively, where cross-level features F X 5 , F X 4 , and F X 3 are concatenated, followed by a 3 × 3 convolution to fuse them layer by layer.Compared to this variant, our model can improve MAE, F β , E ξ , and S α by 15.7%, 7.2%, 3.3%, and 3.3% on VT5000, respectively.As can be seen from Figure 8h, the ablation model "Unet" displays poor prediction results.This is because simple Unet cannot capture long-range context information and filter cross-level interfering information.Besides, we remove two SMDs, retaining only JMD, which is abbreviated as w/o SMD.Specifically, two single-modality decoders for the RGB branch and thermal branch are removed, only retaining one joint-modality decoder for the joint branch, so the multi-stream fusion module is also removed.The saliency maps are predicted on F Fd  3 .Compared to this variant, our model elevates the MAE, F β , E ξ , and S α by 2.8%, 1.0%, 0.9%, and 0.3% on VT5000, respectively.As shown in Figure 8i, it can be seen that the ablation model "w/o SMD" is easily affected by the inverted reflection of the cup in the first row.The reason is that the SMD can help two encoders extract more modality-specific information, and then the cross-modal features contain more valuable information to be fused.We further verify the effectiveness of RFB in SMD/JMD (Table 3 No.2 w/o RFB).That is, in SMD and JMD, the RFB module is replaced by a 3 × 3 convolution, while the Nonlocal branch in CM remains unchanged.Compared to this setting, our model enhances the MAE, F β , E ξ , and S α by 6.2%, 2.9%, 2.0%, and 1.1% on VT5000, respectively.We also show the visual comparison in Figure 8j.The reason is that the RFB can effectively capture the long-range context information, which is more beneficial to depict the salient objects.Besides, we verify the effectiveness of the Nonlocal branch in the context module (No.2 w/o Nonlocal).Namely, we remove the Nonlocal branch in CM and only keep the RFB branch; at this time, the CM module is identical to the RFB.Compared to this setting, our model improves the MAE, F β , E ξ , and S α by 2.8%, 0.9%, 0.6%, and 0.7% on VT5000, respectively.As can be seen from Figure 8k, the ablation model "w/o Nonlocal" is disturbed by the vehicle wheel, which is prominent in thermal image.This proves that the Nonlocal module is effective in CM because it can capture long-range relationships between different pixel positions.

Effectiveness of Multi-Stream Fusion (MSF)
To verify the validity of the MSF, we remove it and retrain the variant under the supervision of ls 2 , ls 3 , and ls 4 .In this ablation model, there are three saliency outputs corresponding to features F Fd 3 , F Rd 3 , and F Td 3 , so we evaluate their different contributions.First, we evaluate the contribution of joint-modality decoder branch (i.e., the middle stream in Figure 2), which is denoted as "Only-J" in Table 3 (No.3).That is, the saliency map is predicted on F Fd 3 .Compared to this variant, our model elevates the MAE, F β , E ξ , and S α by 1.8%, 1.3%, 0.8%, and 0.6% on VT5000, respectively.Second, we evaluate the contribution of RGB branch (i.e., the bottom stream in Figure 2), with the saliency map predicted on F Rd 3 , which is marked as "Only-R" in Table 3. Third, we evaluate the contribution of the thermal branch (i.e., the top stream in Figure 2), and the saliency maps are predicted on F Td 3 , which is marked as "Only-T" in Table 3.We can see that the RGB branch provides more contributions than the thermal branch on VT5000 with MAE(↓) 0.0442 vs. 0.0509.However, our model largely outperforms the single RGB branch or single thermal branch.This shows that single modal information is deficient.By fusing two modal features together (i.e., Only-J), the performance is boosted, but is still inferior to our full model.Finally, we average three saliency predictions of F Fd 3 , F Rd 3 , and F Td 3 , which is labeled as "Both-Avg".It turns out that simply averaging the three predictions will not yield better results.However, our model with MSF can further explore the complementary relationship between three branches by fusing them at the feature level with two 3 × 3 convolution layers.The visual comparisons shown in Figure 8l-o also consistently prove the effectiveness of the MSF module.

Effectiveness of Backbone
In Table 3, (No.4), we verify the effectiveness of backbone of the encoder.Firstly, we replace VGG16 with ResNet50 [54] as backbone of the encoder for two modal inputs I R and I T , which is abbreviated as "Res50".Compared to this variant, our model elevates the MAE, F β , E ξ and S α by 28.2%, 9.9%, 4.7%, 5.2% on VT5000, respectively.From Figure 8p we can see that, the model "Res50" can only predict the inferior saliency results.This proves that our model is not compatible with ResNet50.Secondly, we share the parameters of two encoders for RGB and thermal branches, which is abbreviated as "PS".That is, the Conv1-Conv5 of the RGB branch share the same parameters as Conv1-Conv5 of the thermal branch.Compared to this variant, our model elevates the MAE, F β , E ξ , and S α by 3.8%, 1.6%, 1.2%, and 0.7% on VT5000, respectively.The visual results are shown in Figure 8q.The results show that two parameter independent encoders can learn more diverse feature representations for each modality, respectively.

Effectiveness of Loss Functions
In Table 3, (No.5), we verify the effectiveness of loss functions.Firstly, we only use the bce loss bce in the training process.Compared to this setting, our model elevates the MAE, F β , E ξ , and S α by 4.5%, 3.9%, 2.2%, and 0.3% on VT5000, respectively.Secondly, we combine the bce with IoU loss.Namely, simultaneously employing bce and iou to train our model.Compared to only employing bce loss, this variant elevates the MAE, F β , E ξ , and S α by 3.0%, 2.7%, 1.2%, and 0.3% on VT5000, respectively.Thirdly, we combine the bce with SSIM loss.Namely, simultaneously employing bce and ssim to train our model.Compared to only employing bce loss, this variant elevates the MAE, F β , E ξ , and S α by 2.0%, 1.5%, 1.0%, and 0.1% on VT5000, respectively.Compared to bce+IoU and bce+SSIM, our model can elevate the MAE by 1.5% and 2.5%, respectively.As can be seen from Figure 9, our full model shows the superiority in all cases.The results show that either IoU or SSIM loss can help the model learn more helpful information.Furthermore, by simultaneously employing bce, IoU, and SSIM losses, our model presents the best results.

Scalability Analysis
We also verify the adaptation of our CAE-Net on four RGB-D datasets, including NJU2K (1985 image pairs) [55], NLPR (1000 image pairs) [56], STERE (1000 image pairs) [57], and DUT (1200 image pairs) [27].Following previous work settings [58,59], 1485 images from the NJU2K dataset and 700 images from the NLPR dataset are used for training, when testing our model on NJU2K, NLPR, and STERE.Additionally, as in the widely adopted training strategy in [60,61], an additional 800 image pairs from DUT are used for training, when testing our model on DUT.
We provide the quantitative results of 10 SOTA RGB-D methods in Table 4, including JLDCF [31], DCMF [62], SSF [63], DANet [61], A2dele [60], DMRA [27], ICNet [64], S2MA [30], AFNet [28], and CPFP [65].There are some methods, for which their codes are not available or for which the authors do not provide the saliency results, where we mark them with symbol "−" in Table 4. From the quantitative comparisons, we can see that our CAE-Net is comparable to these SOTA RGB-D methods.In general, our model ranks in the top three on most datasets, except on STERE in terms of S α , where our model ranks fourth.Specifically, our model enhances MAE by 8.6% and 2.9% on NJU2K and STERE, respectively.These quantitative results show that our model can be successfully adapted to RGB-D datasets, demonstrating favorable generation ability of our model.

Conclusions
In this paper, we propose a cross-modal attention enhancement network (CAE-Net), which consists of cross-modal fusion (CMF), a single-/joint-modality decoder (SMD/JMD), and multi-stream fusion (MSF), to accurately detect the salient objects.Firstly, we design the cross-modal fusion (CMF) to fuse cross-modal features, where a cross-attention unit (CAU) is employed to refine two modal features, and channel attention weighted fusion is used to merge two modal features.The CMF can effectively enhance features and reduce disturbance from bad quality samples.Then, we design the joint-modality decoder (JMD) to fuse cross-level features, where the low-level features are purified using high-level decoded features.The JMD effectively filter noise in low-level features and capture wider context information.Besides, we add two single-modality decoder (SMD) branches to preserve more modality-specific information.Finally, we employ multi-stream fusion (MSF) to fuse three branches of decoded features.The MSF can further aggregate effective information in three decoder branches.Extensive experiments are performed on three public datasets, and the results show that our model CAE-Net is comparable to 18 state-ofthe-art saliency models.
Grants D17019; and the Fundamental Research Funds for the Provincial Universities of Zhejiang under Grants GK219909299001-407.

Figure 1 .
Figure 1.Some bad quality examples of RGB or thermal infrared images.(a,b) are two samples with bad quality thermal images, and (c,d) are two samples with bad quality RGB images.GT denotes groundtruth, and ours indicates the saliency maps predicted by our proposed method.

F T 3 F 1 FFigure 2 .
Figure 2. The overall architecture of our model's cross-modal attention enhancement network (CAE-Net).Firstly, we use double stream encoder to extract multi-level features of RGB image I R and thermal infrared image I T , respectively, producing five level-deep features {F R i , F T i } (i=1,••• ,5) for them.Then, we design a cross-modal fusion (CMF) module, which consists of cross-attention unit (CAU) and channel attention weighted fusion, to fuse two modal deep features, obtaining the fused features {F F i } (i=3,4,5).After that, we design the joint-modality decoder (JMD) to fuse cross-level features and obtain decoded feature F Fd 3 .We also add two independent single-modality decoder (SMD) branches to preserve more modality-specific information, obtaining decoded features F Rd 3 and F Td 3 , respectively.Finally, we design a multi-stream fusion (MSF) module to fully fuse complementary information between different decoder branches and obtain the final fused feature F Sd 3 .S is the final saliency map, which is obtained by applying one 1 × 1 convolution on F Sd 3 .Here, the supervision loss of S and intermediate features are denoted as ls i(i=1,••• ,4) , which are marked with a red arrow in this figure.

Figure 5 .
Figure 5. PR and F-measure curves of different models.(a) Results on the VT5000 dataset.(b) Results on the VT1000 dataset.(c) Results on the VT821 dataset.

Figure 6 .
Figure 6.The accuracy and complexity of each model.The horizontal axis indicates FLOPs, while the vertical axis indicates the accuracy.Here, we measure the accuracy by F β score on VT5000.The area of circle represents the relative size of parameter quantity of each model.The model with top-left position means the better trade-off between accuracy and FLOPs.

Table 2 .
The comparisons of model complexity between different models.Here, "↓" means that the smaller the better.