Absolute and Relative Depth-Induced Network for RGB-D Salient Object Detection

Detecting salient objects in complicated scenarios is a challenging problem. Except for semantic features from the RGB image, spatial information from the depth image also provides sufficient cues about the object. Therefore, it is crucial to rationally integrate RGB and depth features for the RGB-D salient object detection task. Most existing RGB-D saliency detectors modulate RGB semantic features with absolution depth values. However, they ignore the appearance contrast and structure knowledge indicated by relative depth values between pixels. In this work, we propose a depth-induced network (DIN) for RGB-D salient object detection, to take full advantage of both absolute and relative depth information, and further, enforce the in-depth fusion of the RGB-D cross-modalities. Specifically, an absolute depth-induced module (ADIM) is proposed, to hierarchically integrate absolute depth values and RGB features, to allow the interaction between the appearance and structural information in the encoding stage. A relative depth-induced module (RDIM) is designed, to capture detailed saliency cues, by exploring contrastive and structural information from relative depth values in the decoding stage. By combining the ADIM and RDIM, we can accurately locate salient objects with clear boundaries, even from complex scenes. The proposed DIN is a lightweight network, and the model size is much smaller than that of state-of-the-art algorithms. Extensive experiments on six challenging benchmarks, show that our method outperforms most existing RGB-D salient object detection models.


Introduction
Salient object detection aims to locate and segment the most visually distinctive objects in an image. It often serves as a pre-processing method, that focuses the attention on semantically meaningful regions and provides informative visual cues to the downstream tasks, such as scene classification [1,2], visual tracking [3,4], foreign object detection [5], etc. In recent years, the rapid development of convolutional neural networks (CNNs) has facilitated the significant improvement of various computer vision tasks, e.g., object detection [6], brain tumor segmentation [7], and point cloud segmentation [8]. Salient object detection also benefits from the powerful representation ability of CNNs [9]. However, many problems still exist to be solved to accurately detect salient objects in complex image scenes, such as similar appearances between foreground and background areas, low-intensity environments, multiple objects of different sizes, etc. Due to the limited representation ability of existing saliency models, it is challenging to discriminate salient objects from cluttered background regions with only RGB images. Recently, benefiting from the development of depth sensors, it has been convenient to obtain dense depth images. Pixel-wise depth values, provide spatial information and geometric cues of the scene, which are complementary to the appearance features of the RGB data. Compared with only using appearance features from RGB images, saliency models based on RGB-D cross-modalities, can capture more relevant information about objects and avoid redundant noise. For example, in the very complex scene in Figure 1, where salient objects are not distinctive in the local area, we can observe obvious contrast between the foreground and background regions in the depth space (Figure 1b). This paper proposes a cross-modal fusion strategy for RGB-D salient object detection. Specifically, we consider two kinds of depth information, pixel-wise absolute depth values from depth images and relative depth values (also known as spatial distance in 3D space) between pair-wise pixels. The interaction between absolute depth information and RGB images is a hot topic for scene understanding, such as person reidentification [10], 3D object detection [11][12][13], etc. Most existing methods learn to extract depth and RGB features by separate networks, and directly fuse them in each scale using different fusion modules [14][15][16], ignoring the message transmission across different scales. Moreover, because of the large gap between the distribution of RGB and depth data, it will inevitably introduce noisy responses in prediction results. For example, the saliency map in Figure 1d, which is generated by the baseline network (simply integrating RGB and depth features with concatenation and convolution blocks), cannot effectively capture the entire salient object in the scene. As comparison, the network with the proposed ADI M (Figure 1e) can respond to most salient regions, by reasonably combining RGB-D crossmodal features. Besides the information provided by absolute depth values, we argue that relative depth values between short-range pixels also contribute to recognizing distinctive saliency cues in the local area. However, these have rarely been studied in recent work. For example, the left person in Figure 1a is less discriminative in the RGB feature space, but presents obvious disparity around the object boundary in the depth space. Therefore, by introducing relative depth information, the proposed method manages to recognize the person as a salient object (Figure 1f).
To this end, we propose a depth-induced network (DIN) for RGB-D salient object detection, which consists of two main components, the absolute depth-induced module (ADI M) and the relative depth-induced module (RDI M). Unlike directly fusing features from the depth and RGB images, the ADI M enforces in-depth interaction of the two modalities in a coarse-to-fine manner. Specifically, a set of gate recurrent units (GRU) [17] are employed, to hierarchically integrate depth and RGB features across multiple scales. The gate structure adaptively selects informative features from the RGB and depth images, and thus controls the fusion of multi-modalities and avoids cluttered noise being introduced, caused by the asynchronous properties of the two feature spaces. The ADI M is implemented in a recursive manner. The fusion results in the shallower layer, are subsequently input into the next integration step in the deeper layer, ensuring effective information transfer across different scales. By this means, the degree of integration goes deep through the network, and we can explore saliency cues from the combined features at different scales. The RDI M aims to capture the spatial structure information of the image and detailed saliency cues, by utilizing relative depth information. Since adjacent pixels in 2D space may not be strongly associated with 3D point cloud space, we project image pixels into 3D space based on their spatial positions and depth values. Then, a graph model is constructed on the feature map level, to enforce information propagation in the local area, according to the relative depth relationship. We implement it by a spatial graph convolutional network (GCN), based on the relative depth and semantic affinities between pair-wise pixels. The feature representation ability is successively enhanced by exploring spatial structure and geometry information across multi-scales. By this means, detailed saliency cues are exploited by the RDI M, which facilitates the accurate prediction of the final results. Unlike the commonly used two-stream networks, which encode the RGB and depth images with the same architecture [18][19][20][21], the proposed DIN is a single-stream network. It thus dramatically reduces the computation costs without sacrificing the model performance.
In summary, the main contributions of this work are three-fold: • We propose an ADI M which adopts the GRU-based method and adaptively integrates absolution depth values and RGB features, to combine the geometric and semantic information from multi-modalities. • We propose an RDI M which employs spatial GCN to explore semantic affinities and contrastive saliency cues, by leveraging the relative depth relationship. • The proposed DIN for RGB-D salient object detection is a lightweight network and outperforms most state-of-the-art algorithms on six challenging datasets.

Related Works
This section reviews some representative works on salient object detection and RGB-D salient object detection, respectively. We also give a brief discussion on the graph convolutional network.

Salient Object Detection
Early salient object detection approaches [22][23][24][25] mainly used hand-crafted features, such as brightness, color, and texture, to locate and segment salient objects in an image. In recent years, thanks to the development of CNNs, various deep learning-based saliency models [26][27][28] have been proposed, and they outperform the traditional methods by a large margin. Zhang et al. [26], explored multi-level features in each scale and recursively generated saliency maps. Feng et al. [27], proposed an attentive feedback module to explore the structure of salient objects better. Kong et al. [28], designed propagation modules to combine multi-scale features of the network. The authors of [29], proposed an integrity cognition network, to enhance the integrity of the predicted salient objects. In [30], a dual graph model was established, to guide the focal stack fusion for light field salient object detection. The authors of [31], developed a salient object detector without any human annotations by a novel supervision synthesis scheme.
Although the above works study various multi-scale fusion models and learning strategies, they still face challenges where the scenarios are complicated. To address this issue, we resort to depth images to explore the spatial structure and geometric information of the scene, and thus improve the effectiveness and robustness of the network.

RGB-D Salient Object Detection
Traditional hand-made RGB-D salient object detection methods [32][33][34] have inferior representation ability of semantic and geometric features. Recently, CNN-based methods have been developed, that are powerful in modeling the RGB-D multi-modalities and thus improve the detection performance to a large extent. Effectively integrating depth and RGB features is one of the critical issues in this task. In [35], Piao et al. hierarchically integrates the depth and RGB image, and refines the final saliency map by a recurrent attention model. The work [19], designed an asymmetric two-stream network for learning the multi-scale and multi-modal features by a ladder-shape module and attention strategies. Sun et al. [20], explored the depth-wise geometric prior, to refine the RGB feature, and employed automatic architecture search to improve the performance of the saliency model. To reduce the model size and improve the performance, Zhao et al. [36] proposed a onestream network for RGB-D salient object detection and designed effective attention models to combine multi-modalities. The work [37], proposed a novel mutual attention model, to fuse cross-modal information. Besides, effective learning strategies are also crucial for a high-quality detection model. Ji et al. [38], proposed a novel collaborative learning framework, to enhance the interaction of edge, depth, and saliency cues. The authors of [18], trained a depth distiller, which modulated the RGB representation by the features in the depth stream. In [39], two kinds of self-supervised pre-training processes were conducted, to learn semantic information and reduce the inconsistency between multi-modalities.
The mutual learning method was employed in [40,41], to align RGB and depth features. Liu et al. [42], proposed a unified transformer architecture for both RGB and RGB-D salient object detection, to propagate global information among image patches. The work [43], proposed a transformer-based network, to learn implicit class knowledge for RGB-D cosaliency detection. Although saliency models can benefit from depth information and distinguish objects from cluttered scenes, sometimes depth images are inaccurate or hard to obtain, thus introducing inevitable noise in the prediction results. Considering this situation, Hussain et al. [44] proposed to leverage only RGB images during both training and test stages, first predict depth values, and generate final saliency maps based on the intermediate depth information. This method employed a combination of Transformer and CNN, leading to satisfactory predicting results. Most of these methods exploit discriminative cues from the absolute depth values but ignore the structure information indicated by the relative depth values. In our work, we make full use of both absolute and relative depth information in images, with a single-stream architecture, to facilitate the accurate saliency detection of RGB images.

Graph Convolutional Network
A GCN aims to learn geometry information on non-Euclidean structural data. Because of the flexible application of the relationship between nodes, GCNs have received more and more research interest in recent years. Generally speaking, GCNs can be categorized into spectral approaches and spatial approaches. Kipf et al. [45], proposed a spectral GCN which performed the convolution operation in the spectral domain on the constructed graphs, with the help of Fourier transformation. Velickovic et al. [46], designed a spatial GCN that fused features between neighbor nodes with an attention mechanism. A multi-layer perceptron was trained to learn the affinity relationship between adjacent nodes. Battaglia et al. [47], utilized multiple blocks to update and transfer information on the graph alternately. Due to the effectiveness of GCNs, many researchers employ them in computer vision tasks. Yao et al. [48], used a GCN to integrate both semantic and spatial relationships of objects, to generate a more accurate image caption. Qi et al. [49], represented the image as a graph model with the depth prior. By constraining the GCN with invisible depth information, a more accurate image segmentation result is obtained. In ref. [50], a GCN is employed to model the semantic relationship on both language and visual modalities and then infers the referred regions in the image. Inspired by these works, we design a spatial GCN for the RGB-D saliency detection task. This network fully uses relative depth information, to explore detailed saliency cues in the local area and obtains satisfactory detection results with much clearer object boundaries.

Algorithm
We propose a DIN that leverages depth images to induce spatial relationships for RGB saliency detection. We first introduce the overall architecture in Section 3.1. We discuss the ADI M, which fuses the image and absolute depth features, in Section 3.2, and we elaborate on the RDI M, which refines the fused features based on the guidance of relative depth information, in Section 3.3.

Overall Architecture
The overall architecture of the proposed DIN is shown in Figure 2. We employ a ResNet [51]-based network as the backbone, to encode the input RGB image. The backbone network consists of five convolution blocks, including Conv_1, Conv_2,. . ., Conv_5. Fed with an RGB image I, with size W × H, it generates multi-scale feature maps f I l with size W 2 l−1 × H 2 l−1 , l = 1, 2, . . . , 5, respectively. We removed the fully connected layers of the original ResNet [51] to fit this task. The depth image D, is first encoded with a set of convolutional layers and then successively warped and integrated with the hierarchical image feature maps by the proposed ADI M, generating the fused feature maps f A l , l = 1, 2, . . . , 5. Since features from different levels represent meaningful information, we recursively integrate multiple feature maps in the decoding stage. Specifically, the image feature maps f I l are first fed into a convolutional layer with kernel size 1 × 1, to reduce the channel number to 64, and the generated side output feature maps are denoted as f S l , l = 1, 2, . . . , 5. Then the multi-level feature maps of RGB and depth images are integrated in a topdown manner, in which ReLU(·) is the ReLU activation function, Conv(·) is the convolution layer, C(·, ·) is the concatenation operation that concatenates feature maps on the channel dimension, and U(·) is the up-sampling operation with bilinear interpolation. + represents the element-wise addition operation. The feature map f P l , denotes the integrated RGB-D features in the l-th scale.
To further exploit the spatial structure information and detailed saliency cues, we refine the integrated feature maps f P l , with the proposed RDI Ms, by considering the relative depth values. The output feature maps are denoted as f R l . To balance the computation costs and performance, we apply the RDI M on the 3rd and 4th-levels. Then, the integrated feature maps are fed into a set of convolutional layers, with kernel size 3 × 3, to generate saliency maps S l , l = 1, 2, . . . , 5, at multiple scales. The proposed depth-induced network (DIN) is trained in an end-to-end manner. All the saliency maps are directly supervised by the ground truth, and the losses are summarized and optimized jointly. Considering that the feature maps f P 1 , incorporate high-level semantic information and low-level detailed knowledge, we choose S 1 as the final prediction result.

Absolute Depth-Induced Module
The goal of the absolute depth-induced module (ADI M), is to integrate appearance features from the RGB image and depth information from the depth image. Since there is a large gap between the distribution of RGB and depth data, flat feature fusing methods will introduce cluttered noise in the final prediction. Therefore, we propose to recursively fuse features of the two modalities, by employing a series of ADI Ms to enforce the indepth interaction between RGB and depth features. As shown in Figure 2, the depth information is embedded in the hierarchical RGB feature maps, and updated step-by-step in the encoding stage. Specifically, given the feature map of an RGB image in the l-th level f I l , and the updated depth feature map in the previous level f D l−1 , the ADI M integrates the RGB-D multi-modalities as followins, in which f A l is the integrated feature map, f D l is the updated depth feature, and f D 0 is the depth image.
The detailed structure of ADI M is shown in Figure 3. The implementation of ADI M is inspired by the gated recurrent unit (GRU), which is designed for dealing with sequential issues. We formulate the multi-scale feature integration process as a sequence problem and treat each scale as a time step. By this means, the ADI M iteratively updates the depth features of the previous state, and selectively fuses meaningful cues of two modalities with the memory mechanism.  In each time step, we feed image features f I l , into the GRU, and depth features f D l−1 , can be regarded as the hidden state of the previous step. The output feature maps f D l and f A l , in Equation (2), are the updated states. First, feature maps f I l and f D l−1 , are encoded by a convolutional layer and an ReLU activation layer, and the output feature maps are denoted as f I l and f D l−1 , respectively. Then the two feature maps are concatenated and transformed by a global max-pooling (GMP) operation, and a feature vector is generated. Subsequently, two separate fully connected layers, followed by the sigmoid function, are applied on the feature vector, to generate the resent gate r and the update gate z. In fact, gate r controls the integration of the depth and RGB features, and z controls the update step of f D l . Based on them, the fused multi-modal feature f A l , and the updated depth representation f D l , are output. Formally, the above process can be formulated as: in which FC θ (·) is the fully connected layer with learnable parameters θ, GMP(·) is the global max-pooling operation, and ⊗ is the channel-wise multiplication operation.
The feature map f A l , generated by the hidden layer, memorizes valuable multi-modal information of previous scales, which is adaptively integrated with the features of the current scale. Such an operation enhances the interaction between cross-modalities as the network goes deeper. The depth feature l D , is also updated according to the corresponding appearance information in f A l , and further facilitates the cross-modal learning in the next scale.

Relative Depth-Induced Module
Besides the absolute depth values, we can learn contrastive and structural information about the scene from relative depth values in the local area. Relative depth information can be extracted from the depth image and reveals the spatial relationship between pixels. Intuitively, closer pixels in the depth space should have a more compact feature interaction, as they tend to have the same saliency label. This observation is essential for separating salient objects in extremely cluttered image scenes. In this work, we proposd an RDI M that utilizes the GCN, to ensure message propagation, by using relative depth information. As shown in Figure 2, RDI Ms are employed in the 3rd and 4th levels of the decoding stage. Given the feature map f P l , in Equation (1), and the depth image, the RDI M refines the integrated multi-scale feature maps to boost the performance of the saliency model.
Graph Construction. To explore the relative depth relationship between pixels, we represent the feature map f P l , generated as an undirected graph G = (V, E), with the node set V, and edge set E. First, the depth image is resized to the size of f P l . Then, each pixel in f P l is regarded as a node in the graph, and the node set is denoted as V = {n 1 , n 2 , . . . , n K }, where K is the total pixel number. Each node n i , corresponds to a 3D coordinate (x i , y i , d i ) and a feature vector f P l,i , where (x i , y i ) is the 2D spatial coordinate in the feature map, d i is the depth value of the pixel, and f P l,i is the feature vector on the channel dimension, of the i-th pixel in the feature map f P l . To allow message transmission in the local area, we define edges between each node and their m nearest neighbors, according to their 3D coordinates. The weight on edge e ij ∈ E, is defined as the relative depth value w ij = |(x i , y i , d i ) − (x j , y j , d j )|, to measure the spatial correlation between the nodes n i and n j .
Graph Convolutional Layer. The proposed spatial GCN consists of a series of stacked graph convolutional layers (GCLs). For each GCL, we first define the semantic affinity a ij , for the edge e ij , to characterize the semantic discrepancy between nodes n i and n j . Specifically, to further consider the global contextual information of the image, a global average pooling (GAP) operation is applied on the feature map f P l , to extract high-level semantic information, and the output feature vector is denoted as f g . The semantic affinity is formulated as where f P l,i and f P l,j are feature vectors of n i and n j , and FC θ 1 (·) is the fully connected layer with learnable parameters θ 1 . Then the feature f P l,i of each node n i , is updated by the fully connected layer where N (n i ) is the set of neighbors of the node n i . In the updating process, both semantic and spatial affinities are considered, which helps to improve the discrimination ability of the feature.
In the RDI M, three GCLs are sequentially applied, to update the global semantic feature f g , the semantic affinity a ij , and the node feature { f P l,i } N i=1 . We adopt the output feature of the last GCL as the refined one, and denote it as f R l,i . Note that, f R l,i is the channelwise feature vector in the location (x i , y i ). We then re-arrange features of all nodes to form a feature map f R l , which has the same size as f P l . According to Equation (4), the feature map f R l , is the final output of the RDI M at scale l.
Intuitively, the constructed graph on feature map f P l , reveals the affinity between nodes in terms of both spatial correlation and visual association. By transferring messages between nodes using the GCN, the feature of each node is refined, according to its affinity with short-range neighbors. This encourages similar nodes (in both spatial and visual space) to have the same saliency labels.
The generated feature maps f R l , of the RDI M, are then input into the next decoding stage. To balance the computation costs and performance, we employ the RDI M in the 3rd and 4th scales in the decoding stage. We set m to be 64 in the 3rd scale and 32 in the 4th scale.

Training and Inference
To constrain the network and learn effective saliency cues, we generate a saliency map in each scale of the network and supervise it by the ground truth image. Specifically, the feature map f P l (i = 1, 2, . . . , 5) of the l-th level in the decoding stage, is transformed by a 3 × 3 convolutional layer, into one channel. This is followed by an up-sampling operation, to resize the output into the size of the input image, and we denote this result as the saliency map of the l-th level, s l . This saliency prediction is supervised by the ground truth imageŝ, by the cross-entropy loss function L(s l ,ŝ) = − ∑ i,jŝ (i, j) log(s l (i, j)) + (1 −ŝ(i, j)) log(1 − s l (i, j)).
The total loss is the summation of the loss in each scale, and the proposed DIN is trained in an end-to-end manner, by minimizing the total loss.
Considering the saliency map s 1 , generated by the feature map f P 1 , incorporates both multi-level and multi-modal information, we employ it as the final prediction result of the network.

Experiment Setup
Implementation Details. The parameters of the backbone network are fine-tuned by the pre-trained ResNet-50 [51], on the Imagenet [52] dataset. The rest of the parameters are randomly initialized. The input images are resized to 256 × 256, by the bilinear interpolation operation. We utilize the same data augmentation methods as in [35], to prevent the network from overfitting, including randomly flipping, cropping, and rotating. The Adam algorithm [53] is employed to optimize the loss function. The base learning rate is set to be 5 × 10 −5 and is decreased by ten times every 20 epochs. Our network converges within 40 epochs. The weight decay is 0.0001, and the batch size is four. All experiments are implemented on the PyTorch platform, with a single NVIDIA GTX 2080Ti GPU. The model size of the proposed DIN is 99.55 Mb, and the speed is 11 FPS.

Datasets.
We evaluate saliency models on six large-scale datasets, including NLPR [33], NJUD [34], STERE [54], SIP [55], LFSD [56], and SSD [57]. NLPR includes 1000 images, most of which contain multiple salient objects, and the corresponding depth images are captured by Kinect. NJUD incorporates 1985 images, which are collected from the internet, 3D movies, and photographs taken by a Fuji W3 stereo camera. STERE consists of 1000 pairs of binocular images. SIP includes 1000 images, with salient persons in real-world scenes. LFSD is composed of 100 light field images, which are taken by the Lytro light field camera. SSD contains 80 images that are extracted from stereo movies. As suggested in [35], we employ 1485 images from the NJUD and 700 images from the NLPR as the training set. The rest of the images are used as the test set.
Saliency maps are first binarized by a set of thresholds ranging from 0 to 255, and compared with ground truth images. Then, precision and recall values are computed. F-measure considers both precision and recall values, to evaluate the performance of saliency models comprehensively. F-measure is defined as the weighted harmonic mean of the precision and recall, where β is set to 0.3 following [58]. S-measure evaluates the structure similarity between objects detected by saliency maps and those in ground truth images. It considers both object-aware similarity S object , and region-aware similarity S region , by a linear system, in which µ is the constant parameter that balances the importance of the object-aware and region-aware similarity. E-measure computes the enhanced alignment matrix φ FM , to capture the pixel-level matching and image-level statistics of the foreground map, and uses the mean value of the matrix φ FM , to reflect the quality of the saliency prediction, MAE computes the mean absolute difference between the prediction result s, and the ground truth imageŝ,
Quantitative Evaluation. We compare the proposed DIN model with the evaluated methods on six large-scale datasets. The max F-measure, S-measure, E-measure, and MAE values are demonstrated in Table 1. The quantitative experiments show that our method is able to achieve competitive performance against recent state-of-the-art algorithms on most datasets, demonstrating the effectiveness of the proposed method. Especially on the challenging NJUD [34] and LFSD [56] datasets, which incorporate semantically complicated images, the performance of our method is superior to other algorithms, indicating that the proposed DIN is able to learn informative semantic cues from cluttered scenarios. Table 1. Quantitative results of the evaluated methods in terms of the F-measure (F max ), S-measure (S µ ), E-measure (E β ), and MAE ( ) on six datasets. The top three scores of each metric are marked as red, green, and blue. Higher scores of F-measure, S-measure, and E-measure are better, and lower scores of MAE are better.

Model
Pub.  Table 2 shows the model size of the evaluated methods. The model size of the proposed DIN is much smaller than the comparison saliency models, except for the PGAR, which employs the VGG-16 [69] as the backbone network. Figure 4 illustrates the scatter diagrams of model size and performance of the evaluated methods in terms of F-measure, S-measure, E-measure, and MAE. The comparisons in Figure 4 intuitively demonstrate that DIN is able to achieve satisfactory performance with fewer parameters. This observation demonstrates a potential for the DIN to be deployed on mobile devices.   Figure 5. Compared with other algorithms, our method is able to detect entire salient objects, with well-defined boundaries, accurately. As shown in Figure 5, the proposed DIN model is effective in various challenging scenarios, including multiple objects (1st column), salient objects with different colors (2nd column), distractors in the background (3rd, and 6-7th columns), low contrast between the foreground and background (5th column), and cluttered background (3rd, 4th, and 6-7th columns). For example, in the 1st column in Figure 5, most other algorithms cannot capture all salient regions of the two objects. In contrast, our method consistently highlights multiple salient objects. In the 2nd example, although there are various appearances in the foreground, the proposed DIN is able to detect all parts of salient regions. In the 3rd and 4th examples, our model successfully suppresses the distractors in the background regions. In the low contrast and low illumination scenarios (5th and 7th example), the proposed DIN can segment the entire salient regions from the background, because of the combination of the RGB and depth information. The foreground in the 6th example is not salient in the depth space. However, our method can still capture salient regions according to the contrast cues in the RGB and semantic space. The above visual results verify the effectiveness and superiority of the proposed DIN method against the comparison algorithms.

Effectiveness of ADI M
The goal of the ADI M is to integrate absolution depth information and the RGB image, and explore complementary cues from the multi-modalities. In order to verify the effectiveness of the ADI M, two networks are trained for comparison: Baseline: we utilize the backbone network as the baseline network, which replaces the ADI M as the concatenation and convolution block, and deletes the RDI M from the DIN. The baseline network takes the RGB and depth images as inputs and generates a final saliency map.
+ADI M: referred to as the +ADI M network, which takes both RGB and depth images as inputs, and employs ADI M to fuse multi-modal features hierarchically in the encoding stage based on the baseline network.
As shown in Table 3 (1st and 2nd rows), +ADI M outperforms the baseline network by up to 3% on F-measure, 3% on S-measure, 4% on E-measure, and 32% on MAE, verifying the effectiveness of the proposed ADI M. Figure 6 shows visual results of the baseline and +ADI M network. In the 1st example in Figure 6, the baseline network captures two persons as salient objects (Figure 6d). However, the left one is not salient in both semantic and depth spaces. Thanks to the ADI M, the saliency map of the +ADI M network (Figure 6e) focuses more on the true object. In the 2nd example, the saliency map generated by the baseline network ( Figure 6d) wrongly responds to the background regions. In contrast, the ADI M is able to alleviate this drawback by adaptively integrating the RGB and depth features, resulting in more accurate saliency prediction results (Figure 6e).  Table 3. Ablation studies in terms of F-measure, S-measure, E-measure, and MAE on NLPR, NJUD, and STERE datasets. We also compare the proposed ADI M with existing cross-modal fusion modules, including the depth attention module (DAM) [19], cross reference Module (CRM) [63], and cross-enhanced integration module (CI M) [21], in the encoding stage. Specifically, we replace the ADI M in the +ADI M network, with DAM, CRM, and CI M, and denote the networks as +DAM, +CRM, and +CI M, respectively. Table 4 (b-d) show their performance in terms of F-measure, S-measure, E-measure, and MAE, respectively. By contrast, the performance of +ADI M in Table 4, is higher than the comparison modules by up to 1%, 1%, 2%, and 20%, on the four metrics, respectively. We attribute the increase to the message transmission of latent features across different scales, which is implemented by the GRU-based structure of the ADI M. In contrast, DAM, CRM, and CI M only focus on the information transfer between the two modalities.

Effectiveness of RDI M
The RDI M aims to refine the multi-modal features by exploring the local contrast information indicated by relative depth values. To verify the effectiveness of the RDI M, we train +RDI M and +SDI M as baseline networks for comparison.
+RDI M: remove all ADI Ms in the encoding stage of the DIN. +SDI M: compared with the +RDI M, in the GCN, the weight on edge e ij is defined as the spatial distance in 2D space, w ij = |(x i , y i ) − (x j , y j )|.
As shown in Table 3, compared with the baseline network (1st row), the +RDI M (3rd row) achieves up to 3%, 3%, 4%, and 34% improvement on F-measure, S-measure, E-measure, and MAE, respectively. It is also superior to the +ADI M network (2nd row). This observation indicates that local contrast information is important for exploring detailed image cues and contributes more to obtaining accurate predictions. Figure 6 demonstrates the visual results of the +RDI M network. Compared with the saliency results of the baseline and +ADI M networks, those of the +RDI M (Figure 6f) present more accurate details around the edges of the objects. This is because the RDI M employs relative depth information to learn contrastive information in the local area, and improve the discriminative ability of multi-modal features by GCNs.
We also investigate the importance of relative depth values. In the +SDI M in Table 3, we construct the graph in the RDI M without considering the relative depth values. In other words, we only use the spatial distances of pair-wise pixels to reflect their affinities. According to the quantitative experiments in Table 3, the performance of +SDI M is lower than that of +RDI M, demonstrating the contribution of the relative depth values. The saliency detection results of the +SDI M in Figure 6g, also present cluttered noise around object boundaries and the background.
We compare the proposed RDI M with two existing modules in the decoding stage, including the multi-modal feature aggregation (MFA) module [21] and consistency-difference aggregation (CDA) module [39]. These two modules are applied to the decoding stage of the network, to further integrate and enhance the feature representation of multimodalities. For a fair comparison, we replace the RDI M in the +RDI M network with the MFA and CDA modules, and denote them as +MFA and +CDA, respectively. Table 4 (f-g) demonstrate the quantitative performance of these networks. It can be seen that +RDIM outperforms the comparison networks by up to 1%, 1%, 2%, and 16% in terms of F-measure, S-measure, E-measure, and MAE, respectively. This is because the RDIM utilizes relative depth values to update the feature representation in consideration of its spatial and semantic affinities with other regions, modeling structural information of the scene, which is crucial for saliency prediction. By comparison, MFA and CDA only make use of absolute depth values and integrate them with RGB features, inevitably introducing clutter noise, because of the large gap between multi-modalities.

Effectiveness of Combination of ADI M and RDI M
The DIN in Table 3 (5th row) is the proposed method in our work, which employs both ADI M and RDI M in the network. As shown in quantitative results, the DIN achieves the best performance compared with all of the other evaluated baseline networks, which indicates that the ADI M and RDI M are complementary.
Compared with the saliency maps of the baseline, +ADI M, and +RDI M networks (Figure 6d-f), the prediction results generated by the DIN in Figure 6g suppress the responses on cluttered background regions and accurately capture entire salient objects, with well-defined boundaries. In the 4th example in Figure 6, the baseline network (Figure 6d) is confused by the image background regions, and outputs blurry prediction results. The +ADI M (Figure 6e) captures both the man and helicopter as salient objects, since they are salient in the depth space. The +RDI M (Figure 6f) can alleviate the negative predictions on the image background by exploring the local contrast in both RGB and depth spaces. However, the prediction result is still inaccurate compared with the ground truth. Due to the reasonable interaction between ADI M and RDI M, the DIN (Figure 6g) can accurately capture the true salient object and eliminate the incorrect responses in the background regions. This observation verifies that the DIN takes full advantage of the complementary nature between the absolute and relative depth information and thus achieves satisfactory performance. Figure 7 shows some failure cases of our method. In the first example, the salient object has low-depth contrast with the surroundings. In the second example, the depth values of the salient object are not correctly captured by the sensor. The salient object in the third example is hard to recognize in the RGB space. Our proposed DIN fails to detect the true objects with clear boundaries in these scenarios. This is because, in these situations, depth maps do not provide valuable information and even introduce noisy responses, which limit the accuracy of the saliency model. Moreover, the RDI M based on relative depth values will magnify prediction errors.

Conclusions
In this work, we propose a DIN for RGB-D salient object detection. The DIN consists of two main components, ADI M and RDI M. The ADI M utilizes a GRU-based method, to successively integrate the RGB and absolute depth values at multiple scales. In the RDI M, we propose a spatial GCN, to explore detailed saliency cues with the help of relative depth values and semantic relationships. These two modules are complementary and lead to an effective saliency model which is able to detect entire salient objects with well-defined boundaries. The DIN is a lightweight model, because of the single-stream architecture. Extensive experiments show that the performance of the DIN is competitive with the state-of-the-art algorithms on six large-scale benchmarks. In future work, we will improve the robustness of the ADI M and RDI M in complex scenarios, and extend the proposed models to other cross-modal tasks. Moreover, we will explore more possibilities of the RGB-D salient object detection on engineering applications.