Global Guided Cross-Modal Cross-Scale Network for RGB-D Salient Object Detection

RGB-D saliency detection aims to accurately localize salient regions using the complementary information of a depth map. Global contexts carried by the deep layer are key to salient objection detection, but they are diluted when transferred to shallower layers. Besides, depth maps may contain misleading information due to the depth sensors. To tackle these issues, in this paper, we propose a new cross-modal cross-scale network for RGB-D salient object detection, where the global context information provides global guidance to boost performance in complex scenarios. First, we introduce a global guided cross-modal and cross-scale module named G2CMCSM to realize global guided cross-modal cross-scale fusion. Then, we employ feature refinement modules for progressive refinement in a coarse-to-fine manner. In addition, we adopt a hybrid loss function to supervise the training of G2CMCSNet over different scales. With all these modules working together, G2CMCSNet effectively enhances both salient object details and salient object localization. Extensive experiments on challenging benchmark datasets demonstrate that our G2CMCSNet outperforms existing state-of-the-art methods.


Introduction
The goal of salient object detection (SOD) is to identify the most distinctive object in a given image or video. As an important pre-processing method, SOD is widely used in various computer vision tasks, including image understanding [1], video detection and segmentation [2], semantic segmentation [3], object tracking [4], etc.
With the development of deep learning, numerous RGB SOD models have been proposed and have achieved significant success [5][6][7][8][9]. However, when dealing with complicated scenes with low texture contrast or cluttered backgrounds, the performance of RGB SOD models deteriorates. RGB-D SOD models, which extract objects from paired RGB images and depth maps, have attracted growing interest due to their complementary information and accessibility.
Although many new CNN-based SOD approaches [10][11][12][13][14][15][16][17][18] have been proposed for RGB-D and have achieved better performance than before, there are some issues affecting the performance of existing RGB-D SOD. First, although providing complementary information, depth maps occasionally contain misleading information due to the limitations of the depth sensors [12,13,19], which deteriorate the performance of RGB-D saliency models. Second, the empirical receptive field of CNN is much smaller than the theoretical one, especially on high-level layers, and semantic information carried by deep layers may be gradually diluted when transmitted to shallower layers [8,9,17]. Lack of global context increases the chances of failure in object localization. Third, previous methods usually enhance RGB features by corresponding level depth features rather than cross-scale depth features [15,16]. Zhao [17] found that it is effective to reduce errors due to the similar appearance of objects when utilizing global scene clues. Liu [8] proposed a receptive field

•
We exploit cross-modal cross-scale feature fusion under the guidance of global context information to suppress distractors in lower layers. This strategy is based on the observation that high-level features contain global context information, which is helpful to eliminate distractors in lower layers.

•
To fully capture the complementary information in the depth map and effectively fuse RGB features and depth features, we introduce a depth enhancement module (DEM), which utilizes the complementarity between RGB features and depth features, and an RGB enhancement module (REM) which utilizes the information of RGB features to improve the details of salient object detection.

•
We propose a Global Guided Cross-Modal Cross-Scale Network (G 2 CMCSNet) to detect RGB-D salient objects, which exceed 12 SOTAs on five public datasets. Different from other models, our model not only considers feature continuity but also exploits the intrinsic structure of the RGB features and the global context information of highlevel features. The performance of the proposed method is evaluated on five popular public datasets under four evaluation metrics, and compared with 12 state-of-the-art RGB-D SOD methods.
The remainder of this paper is structured as follows: We first describe the present status of salient object detection and RGB-D salient object detection in Section 2. The overall architecture, components, and loss function of the network are outlined in Section 3. Additionally, Section 4 provides the outcomes of our experiments. Finally, Section 5 presents our conclusions.

Related Work
Deep Learning-Based RGB Salient Object Detection. Owing to the development of deep learning, a great number of deep learning-based RGB salient object detection has been proposed in recent years. Compared with traditional methods, Deep Learning-Based methods have become the most mainstream due to their excellent extraction accuracy. For example, Wang et al. [18] developed a recurrent fully convolutional network for salient object detection that incorporates saliency prior knowledge for more accurate prediction. Liu et al. [20] proposed a novel end-to-end deep hierarchical saliency network that produces coarse saliency maps first and refines them recurrently and hierarchically with local context information. To compensate for the diluted global semantic information, Liu et al. [9] introduced a global guidance module and a feature aggregation module to the U-shape network. Chen et al. [21] applied side-output residual learning for refinement in a top-down manner under the guidance of a reverse attention block, which led to significant performance improvement. Wang et al. [22] presented a salient object detection method that integrates both top-down and bottom-up saliency inference iteratively and cooperatively, which encourages saliency information to effectively flow in a bottom-up, top-down, and intra-layer manner. Wu et al. [23] proposed a novel cascaded partial decoder framework for fast and accurate salient object detection that discards higher-resolution features of shallower layers for acceleration and integrates features of deeper layers for a precise saliency map. To detect objects in cluttered scenes, Zhang et al. [24] utilized image captioning to boost the semantic feature learning for salient object detection, which encodes the embedding of a generated caption to capture the semantic information of major objects and incorporates the captioning embedding with local-global visual contexts for predicting the saliency map.
The works mentioned above perform better than traditional approaches, but they fall short when it comes to complex scenes. To address this issue, deep learning-based RGB-D salient object detection is proposed.
Deep Learning-Based RGB-D Salient Object Detection. Numerous CNN-based RGB-D SOD methods [5][6][7][8][9]11,13,15,16,18,[25][26][27][28][29][30][31][32][33][34][35][36][37] have been proposed with the development of depth sensors. Recent RGB-D SOD models mainly focus on CNN architectures and cross-modal fusion strategies to improve the performance of salient object detection. Fan et al. [13] proposed a multi-level, multi-modality learning framework that splits the multilevel features into teacher and student features and utilizes depth-enhanced modules to excavate informative parts of depth cues from the channel and spatial views. Li et al. [14] introduced three RGB-depth interaction modules to enhance low-, middle-, and highlevel cross-modal information fusion. Zhang et al. [26] proposed a multi-stage cascaded learning-based RGB-D saliency detection framework that explicitly models complementary information between RGB images and depth data by minimizing the mutual information between modalities. Wang et al. [34] introduced correlation fusion to fuse RGB and depth correlations and long-range cross-modality correlations and local depth correlations to predict salient maps. Bi et al. [35] introduced a cross-modal hierarchical interaction network that boosts salient object detection by excavating the cross-modal feature interaction and progressively multi-level feature fusion. Zhang et al. [37] utilized the dynamic enhanced module to dynamically enhance the intra-modality features and the scene-aware dynamic fusion module to realize dynamic feature selection between two modalities.
Apart from the fusion architectures, the quality of the depth map also affects the performance of salient object detection. Ji et al. [24] designed a depth calibration strategy to correct the potential noise from unreliable raw depth. Sun et al. [29] introduce a depth-sensitive RGB feature modeling scheme using the depth-wise geometric prior to reducing background distraction. Zhang et al. [30] designed a convergence structure that effectively selects the most valuable supplementary information from RGB and depth modalities to obtain a more discriminative cross-modality saliency prediction feature. Wu et al. [31] proposed a new fusion architecture for RGB-D saliency detection, which improves robustness against inaccurate and misaligned depth inputs.

Methodologies
In this section, we present the overall architecture and motivation of the proposed Global Guided Cross-Modal Cross-Scale Network (G 2 CMCSNet) in Section 3.1. In Section 3.2, we introduce its main components in detail, including G 2 CMCSM and feature refinement modules.

The Overall Architecture and Motivation
The proposed G 2 CMCSNet follows the encoder-decoder structure. As illustrated in Figure 1, G 2 CMCSNet consists of a feature encoder, global guided cross-modal cross-scale interaction, and global guided feature refinement.
Feature Encoder. We employ ResNet-50 as the backbone to extract RGB and depth features, which have been pre-trained on the ImageNet dataset. As shown in Figure 1, the RGB image and depth map are encoded separately through the two-stream ResNet-50. These encoded blocks of the RGB image and depth map are denoted by f D n and f R n (n ∈ {1, 2, 3, 4, 5}) is the block index), respectively. The input resolutions of RGB image and depth map are set to 256 × 256 × 3 and 256 × 256 × 1, respectively.
Global Guided Cross-Modal Cross-Scale Module. It has been proven in [14] that adjacent feature fusion of low-and mid-level features can effectively capture feature continuity and promote cross-modal and cross-scale fusion. Meanwhile, high-level semantic features help discover the specific locations of salient objects. Based on the above knowledge, we propose global guided cross-modal modules to capture the exact positions of salient objects while sharpening their details at the same time. It has been proven that features extracted from the deep layer contain rich semantic and textural information [17]. Meanwhile, depth maps occasionally contain misleading information, which deteriorates the performance of RGB-D saliency models. The Pyramid Pooling Module (PPM) provides global contextual information. Therefore, PPM is introduced as global guidance and embedded in the final RGB feature layer, and the output is employed to enhance global context information.
Cascaded Decoder. As we know, shallow layers contain low-level structure cues, while deep layers contain global semantic information. Therefore, we adopt a cascaded decoder to refine our saliency maps progressively. As shown in Figure 1, the cascaded decoder module consists of five decoder blocks, which receive outputs from G 2 CMCS for refinement. The initial saliency map S5 with high-level features is progressively refined by low-level features, which contain abundant detailed information.

Global Guided Cross-Modal Cross-Scale Module
The global guided cross-modal cross-scale module is the key component of G 2 CMCSNet, which integrates the cross-modal and cross-scale information under the guidance of global context information. The details of G 2 CMCS are illustrated in Figure 2.
First, depth maps may contain misleading information due to their low-quality. Second, global information is diluted when transferred to shallower levels. Based on the above facts, our fusion strategy employs the details of RGB features and locations of depth features. Therefore, G 2 CMCS includes three parts, Depth Enhancement Module (DEM), RGB Enhancement Module (REM), and Global Guidance (GG). Global Guided Cross-Modal Cross-Scale Module. It has been proven in [14] that adjacent feature fusion of low-and mid-level features can effectively capture feature continuity and promote cross-modal and cross-scale fusion. Meanwhile, high-level semantic features help discover the specific locations of salient objects. Based on the above knowledge, we propose global guided cross-modal modules to capture the exact positions of salient objects while sharpening their details at the same time.
It has been proven that features extracted from the deep layer contain rich semantic and textural information [17]. Meanwhile, depth maps occasionally contain misleading information, which deteriorates the performance of RGB-D saliency models. The Pyramid Pooling Module (PPM) provides global contextual information. Therefore, PPM is introduced as global guidance and embedded in the final RGB feature layer, and the output is employed to enhance global context information.
Cascaded Decoder. As we know, shallow layers contain low-level structure cues, while deep layers contain global semantic information. Therefore, we adopt a cascaded decoder to refine our saliency maps progressively. As shown in Figure 1, the cascaded decoder module consists of five decoder blocks, which receive outputs from G 2 CMCS for refinement. The initial saliency map S 5 with high-level features is progressively refined by low-level features, which contain abundant detailed information.

Global Guided Cross-Modal Cross-Scale Module
The global guided cross-modal cross-scale module is the key component of G 2 CMCSNet, which integrates the cross-modal and cross-scale information under the guidance of global context information. The details of G 2 CMCS are illustrated in Figure 2.
First, depth maps may contain misleading information due to their low-quality. Second, global information is diluted when transferred to shallower levels. Based on the above facts, our fusion strategy employs the details of RGB features and locations of depth features. Therefore, G 2 CMCS includes three parts, Depth Enhancement Module (DEM), RGB Enhancement Module (REM), and Global Guidance (GG).
where  denotes element-wise multiplication. RGB Enhancement Module (REM). RGB features contain information such as color and texture, which helps sharpen the details of SOD. Therefore, we propose an RGB enhancement module (REM) to enhance the details of SOD.
For channel reduction, the two features are fed into a 3 × 3 convolutional layer with BatchNorm and Relu activation, as shown in Figure 2. Thus, we can obtain the normalized feature maps f R n and f D m . The depth branch h D m contains two branches to enlarge the receptive field and a residual connection to retain the original information.
where Conv 3×3 (·) denotes a 3 × 3 convolution with BatchNorm, Relu(·) is the Rule activation function. Then, we have the DEM representations as follows: where ⊗ denotes element-wise multiplication. RGB Enhancement Module (REM). RGB features contain information such as color and texture, which helps sharpen the details of SOD. Therefore, we propose an RGB enhancement module (REM) to enhance the details of SOD.
Specifically, f R n is fed into a 3 × 3 convolutional layer with BatchNorm and Relu activation and is enhanced by itself. The REM can be representations are defined as follows: where BConv 3×3 (·) denotes a 3 × 3 convolution with BatchNorm and Relu activation. Global Guidance (GG). PPM is employed as a global guidance module for its global context information. Global context information carried by the highest level is skipped and connected to the low-level part to remedy diluted global information. PPM is placed on top of the backbone of the RGB feature stream to capture global context information. Meanwhile, skip connection is used to remedy diluted global context information. Considering that high-level RGB features contain global context information, PPM is connected to a low-level part of the network.
To preserve the original information carried by the RGB feature, residual connections are adapted to combine the enhanced features with the original RGB features. We apply element-wise summation to fuse features, and the cross-modal cross-scale enhanced feature f RD n representations are defined as follows: where + denotes element-wise summation, U p(·) denotes upsample operation. A 3 × 3 convolutional layer is followed to obtain smooth features. Our G 2 CMCSM can capture the exact positions of salient objects and sharpen their details at the same time with skip connections.

Cascaded Decoder
To effectively leverage the features, the network is decoded with a cascaded refinement mechanism. This mechanism first produces an initial saliency map with high-level features and then improves the details of the initial saliency map S 5 with low-level features that contain abundant detailed information. Using this mechanism, our model can iteratively refine the details in the low-level features. Corresponding to the five-level cross-modal cross-scale fusion, the cascaded decoder is a five-level module that consists of five decoders, as shown in Figure 1. Each decoder contains 3 × 3 convolution layers and upsamples layers. Finally, we obtained the final prediction map S 1 , which can be donated as follows: where D(·) denotes decoder operation, S n denotes prediction map, U p(·) denotes upsample operation, and S 1 denotes the final output.

Loss Function
We supervise multi-level outputs to take advantage of fusion information. Loss is composed of the Binary Cross-Entropy (BCE) loss L BCE and the Intersection-Over-Union (IOU) loss L IOU [38]. As shown in Figure 1, each S n is supervised by the GT during the BCE loss and IOU loss phases. The total loss function L can be formulated as follows:

Datasets and Evaluation Metrics
Datasets. To verify the effectiveness of our method, we evaluated the proposed method on five widely used benchmark datasets, including STEREO [39], NJU2K [38], NLPR [40], SSD [41], and SIP [42]. To make a fair comparison, we followed the same training settings as existing works [12], which consist of 1485 samples from NJUD and 700 samples from NLPR. Evaluation Metrics. We evaluate the performance of our method and other compared methods using four widely used evaluation metrics, including maximum F-measure (F β , [43], mean absolute error (MAE, M) [44], S-measure (S λ , λ = 0.5) [45], and maximum E-measure (E ξ ) [46]. To make a fair comparison, we use the tools provided by [16] to evaluate each SOTA method.

Implementation Details
Our model is implemented on the basis of PyTorch with one NVIDIA A4000 GPU. The backbone network (ResNet-50) is used, which has been pre-trained on ImageNet. Due to the different channels of RGB and depth images, the input channel of the depth encoder is set to 1. The proposed model is optimized by the Adam algorithm; the initial learning rate is set to 10 −4 and is divided by 10 every 60 epochs. All training and testing images are resized to 256 × 256. The training images are augmented using various strategies, including random flipping, rotating, and border clipping. The batch size is set to 10, and the model is trained for 120 epochs.

Comparison with State-of-the-Art
Quantitative Comparison. We compare the proposed network with 14 state-of-the-art (SOTA) CNN-based methods, which are DMRA [47], CMW [14], PGAR [12], HAINet [15], JLDCF [33], DCF [27], DSA2F [29], DCMF [34], HINet [35], CFIDNet [36], SPSNet [28], and C2DFNet [37]. The saliency maps of all compared methods are provided by the authors or obtained by running their released codes. Table 1 shows the quantitative comparison in terms of four evaluation metrics on five datasets. It can be seen that G 2 CMCSNet significantly outperforms the competing methods across all the datasets in most metrics. Especially, G 2 CMCSNet outperforms all other methods by a dramatic margin on the LFSD and SIP datasets, which are considered more challenging datasets. Moreover, G 2 CMCSNet consistently surpasses all other state-of-the-art methods in five datasets in terms of the overall performance metric. Overall, our proposed G 2 CMCSNet obtains promising performance in locating salient object(s) in a given scene. Table 1. Quantitative comparison with SOTA models using MAE (M), S-measure (S m ), max F-measure (F m ), and max E-measure (E m ). ↑(↓) denotes that the higher (lower) is better. The best two results are highlighted in red and green. Qualitative Comparison. Figure 3 shows qualitative comparisons with seven representative methods on various challenging scenarios for SOD. The first row represents a scene with low contrast. Our method and DSA2F accurately capture the salient object while others failed. In the second and third rows, our method successfully locates small objects with complex backgrounds. In the fourth row, we represent a scene with a low-quality depth map. It can be seen that our method captures the salient object by eliminating the misleading information of low-quality depth maps. The sixth and seventh rows show scenes with multiple objects and our method produces the most reliable results. Thanks to the proposed G 2 CMCSM and refinement strategy, our method can accurately highlight salient objects regardless of complicated scenes and low-quality depth maps. Qualitative Comparison. Figure 3 shows qualitative comparisons with seven representative methods on various challenging scenarios for SOD. The first row represents a scene with low contrast. Our method and DSA2F accurately capture the salient object while others failed. In the second and third rows, our method successfully locates small objects with complex backgrounds. In the fourth row, we represent a scene with a lowquality depth map. It can be seen that our method captures the salient object by eliminating the misleading information of low-quality depth maps. The sixth and seventh rows show scenes with multiple objects and our method produces the most reliable results. Thanks to the proposed G 2 CMCSM and refinement strategy, our method can accurately highlight salient objects regardless of complicated scenes and low-quality depth maps.

Ablation Study
To verify the relative contribution of different components of our model, we carry out ablation studies by removing or replacing them from our full model.
Effectiveness of G 2 CMCSM. G 2 CMCSM plays an important role in the proposed G 2 CMCSNet. To verify the effectiveness of the G 2 CMCSM, we replaced the feature fusion module G 2 CMCSM with direct summation, as shown in Figure 4. We denote this evaluation as "A1" in Table 2. As can be seen in Table 2, G 2 CMCSM shows better overall performance than A1, which validates the effectiveness of G 2 CMCSM.

Ablation Study
To verify the relative contribution of different components of our model, we carry out ablation studies by removing or replacing them from our full model.
Effectiveness of G 2 CMCSM. G 2 CMCSM plays an important role in the proposed G 2 CMCSNet. To verify the effectiveness of the G 2 CMCSM, we replaced the feature fusion module G 2 CMCSM with direct summation, as shown in Figure 4. We denote this evaluation as "A1" in Table 2. As can be seen in Table 2, G 2 CMCSM shows better overall performance than A1, which validates the effectiveness of G 2 CMCSM.  Effectiveness of DEM in G 2 CMCSM. To explore the effectiveness of the DEM, we remove it from our full model. We denote this evaluation as "A2" in Table 2. From the results in Table 2, it can be observed that the performance degrades without using DEM. This indicates the effectiveness of the DEM.
Effectiveness of REM in G 2 CMCSM. To explore the effectiveness of the DEM, we remove REM from our full model. We denote this evaluation as "A3" in Table 2. It can be

RGBD135
SSD SIP Effectiveness of DEM in G 2 CMCSM. To explore the effectiveness of the DEM, we remove it from our full model. We denote this evaluation as "A2" in Table 2. From the results in Table 2, it can be observed that the performance degrades without using DEM. This indicates the effectiveness of the DEM.
Effectiveness of REM in G 2 CMCSM. To explore the effectiveness of the DEM, we remove REM from our full model. We denote this evaluation as "A3" in Table 2. It can be observed that the performance degrades without using DEM. This indicates the effectiveness of the DEM.
Effectiveness of Global Guidance in G 2 CMCSM. To verify the effectiveness of the global guidance, we remove the PPM module from our full model. We denote this evaluation as "A4" in Table 2. As can be seen in Table 2, G 2 CMCSM shows better overall performance than CMCSM, which supports our claim that the global guidance role should be strengthened.
Effectiveness of Cross-Scale Strategy in G 2 CMCSM. To verify the effectiveness of the cross-scale strategy, we replace cross-scale fusion with corresponding level fusion, RGB features are enhanced by corresponding level depth map features rather than adjacent level depth map features. We denote this evaluation as "A5" in Table 2. G 2 CMCSM shows better overall performance than the model with corresponding level fusion, which validates the importance of cross-scale strategy.

Conclusions
In this paper, we have proposed a new G 2 CMCSNet for RGB-D SOD, which suppresses the distractors in the depth map and effectively fuses the features of RGB and depth maps. The G 2 CMCSNet consists of three main components: the DEM, which fuses the cross-modal cross-scale features extracted from the encoder; the REM, which enhances the details of SOD; and the GG, which suppresses the noise from unreliable raw depth. With all these modules working together, G 2 CMCSNet can effectively detect salient objects in complex scenarios. Extensive comparison experiments and ablation studies show that the proposed G 2 CMCSNet achieves superior performance on five widely used SOD benchmark datasets. In the future, we will extend the proposed models to other cross-modal tasks.

Conflicts of Interest:
The authors declare no conflict of interest.