SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection

Salient Object Detection (SOD) in RGB-D images plays a crucial role in the field of computer vision, with its central aim being to identify and segment the most visually striking objects within a scene. However, optimizing the fusion of multi-modal and multi-scale features to enhance detection performance remains a challenge. To address this issue, we propose a network model based on semantic localization and multi-scale fusion (SLMSF-Net), specifically designed for RGB-D SOD. Firstly, we designed a Deep Attention Module (DAM), which extracts valuable depth feature information from both channel and spatial perspectives and efficiently merges it with RGB features. Subsequently, a Semantic Localization Module (SLM) is introduced to enhance the top-level modality fusion features, enabling the precise localization of salient objects. Finally, a Multi-Scale Fusion Module (MSF) is employed to perform inverse decoding on the modality fusion features, thus restoring the detailed information of the objects and generating high-precision saliency maps. Our approach has been validated across six RGB-D salient object detection datasets. The experimental results indicate an improvement of 0.20~1.80%, 0.09~1.46%, 0.19~1.05%, and 0.0002~0.0062, respectively in maxF, maxE, S, and MAE metrics, compared to the best competing methods (AFNet, DCMF, and C2DFNet).


Introduction
Salient Object Detection (SOD) plays a crucial role in the field of computer vision, with its primary objective being the identification and accentuation of the most visually engaging objects within a scene [1,2].These objects typically draw the majority of observer attention and play a vital role in image and video processing tasks, such as object tracking [3,4], image segmentation [5,6], and scene understanding [7,8].With the rapid advancement of depth sensor technology, RGB-D salient object detection has elicited significant interest among researchers.Compared to using only RGB images, RGB-D datasets offer a richer array of information, including color and depth details, which are invaluable in enhancing the performance of salient object detection.However, the achievement of accurate salient object detection under complex scenarios, with multi-scale objects and noise interference, continues to present a substantial challenge.Current research is confronted with two main issues : 1.
The Modality Fusion Problem: Undoubtedly, depth information opens up significant possibilities for enhancing detection performance.The distance information it provides between objects aids in clearly distinguishing the foreground from the background, thereby endowing the algorithm with robustness when dealing with complex scenarios.However, an urgent challenge that remains to be solved is how to fully exploit this depth information and effectively integrate it with the color, texture, and other features of RGB images to extract richer and more discriminative features.This challenge becomes particularly pressing when dealing with issues of incomplete depth information and noise interference, which necessitate further exploration and research.
The Multi-level Feature Integration Problem: To more effectively integrate multi-level features, it's vital to fully consider the characteristics of both high-level and low-level features.High-level features contain discriminative semantic information, which aids in the localization of salient objects, while low-level features are rich in detailed information, beneficial for optimizing object edges.Traditional RGB-D salient object detection methods often fuse features from different levels directly, disregarding their inherent differences.This approach can lead to semantic information loss and make the method vulnerable to noise and background interference.Therefore, there is a need to explore more refined feature fusion techniques that fully take into account the characteristics of different levels of features, aiming to boost the performance of salient object detection.
To address the aforementioned challenges, we propose a Semantic Localization and Multi-Scale Fusion Network (SLMSF-Net) for RGB-D salient object detection.SLMSF-Net constitutes two stages: encoding and decoding.During the encoding phase, SLMSF-Net utilizes the ResNet50 network to separately extract features from RGB and depth images and employs a depth attention module for modal feature fusion.In the decoding phase, SLMSF-Net first accurately localizes salient objects through a semantic localization module, and then constructs a reverse decoder using a Multi-Scale Fusion Module to restore the detailed information of the salient objects.Our main contributions can be summarized as follows: 1.
We propose a depth attention module that leverages channel and spatial attention mechanisms to fully explore the effective information of depth images and enhance the matching ability between RGB and depth feature maps.

2.
We propose a semantic localization module that constructs a global view for the precise localization of salient objects.

3.
We propose a reverse decoding network based on multi-scale fusion, which implements reverse decoding on modal fusion features and generates detailed information on salient objects through multi-scale feature fusion.
The design of the SLMSF-Net is poised to address key issues in the current RGB-D salient object detection domain and provide new research insights for other tasks within the field of computer vision.Extensive experimental results fully demonstrate that the SLMSF-Net exhibits excellent performance in RGB-D SOD tasks, enhancing the accuracy and effectiveness of salient object detection.

Related Works
In this section, we will review research works  related to the RGB-D salient object detection method that we propose.These related studies can be broadly divided into two categories: salient object detection based on RGB images and salient object detection based on RGB-D images.

Salient Object Detection Based on RGB Images
Salient object detection based on RGB images mainly focuses on visual cues such as color, texture, and contrast.Early saliency detection methods primarily depended on handcrafted features and heuristic rules.For instance, Itti et al. [17] proposed a saliency detection model based on the biological visual system, which estimates saliency by calculating the local contrast of color, brightness, and directional features.Achanta et al. [18] introduced a frequency-tuned salient region detection method, which extracts global contrast features in the frequency domain of the image to detect salient regions.Tong et al. [19] combined global and local cues for salient object detection, using a variety of cues (such as color, texture, and contrast) to handle complex scenarios.
In recent years, deep learning technology has achieved significant success in the field of salient object detection.Models such as the deep learning saliency model proposed by Chen et al. [20] accomplish hierarchical representation of saliency features to realize end-to-end salient object detection.Cong et al. [21] proposed a salient object detection method based on a Fully Convolutional Network (FCN), which uses global contextual information and local detail information for saliency prediction.Hou et al. [22] developed a deeply supervised network for salient object detection, improving upon the Holistically Nested Edge Detector (HED) architecture.They introduced short connections between network layers, enhancing salient object detection by combining low-level and high-level features.Zhao et al. [23] proposed GateNet, a new network architecture for salient object detection.This model introduced multilevel gate units to balance encoder block contributions, suppressing non-salient features and contextualizing for the decoder.They also included Fold-ASPP to gather multiscale semantic information, enhancing atrous convolution for better feature extraction.Zhang et al. [24] combined neural network layer features to improve salient object detection accuracy in images.Their approach used both coarse and fine image details and incorporated edge-aware maps to enhance boundary detection.Wu et al. [25] proposed a cascaded partial decoder that discarded low-level features to reduce computational complexity while refining high-level features for accuracy.
Moreover, some researchers have applied attention mechanisms to RGB-based salient object detection models, such as [26][27][28].These methods enable the models to concentrate their attention on the visually prominent regions of the image.Chen et al. [26] presented an approach for enhancing salient object detection through the use of reverse attention and side-output residual learning.This method aimed to refine saliency maps with a particular focus on improving resolution and reducing the model's size.Wang et al. [27] presented PAGE-Net, a model for salient object detection.The model utilized a pyramid attention module to enhance saliency representation by incorporating multi-scale information, thereby effectively boosting detection accuracy.Additionally, it featured a salient edge detection module, which sharpened the detection of salient object boundaries.Wang et al. [28] introduced PiNet, a salient object detection model designed for enhancing feature extraction and the progressive refinement of saliency.The model incorporated level-specific feature extraction mechanisms and employed a coarse-to-fine process for refining saliency features, which helped in overcoming common issues in existing methods like noise accumulation and spatial detail dilution.Although methods based on RGB images can achieve good performance in many situations, they lack the ability to handle depth information.

Salient Object Detection Based on RGB-D Images
With the advancement of depth sensors, RGB-D images (which contain both color and depth information) have been widely applied in salient object detection.For instance, Lang et al. [29] investigated the impact of depth cues on saliency detection, where they found that depth information holds significant value for salient object detection.Based on this, many researchers have begun to explore how to fully utilize depth information for salient object detection.
Peng et al. [30] proposed a multi-modal fusion framework that improves saliency detection performance by fusing local and global depth features with color and texture features.Zhang et al. [31] presented a new RGB-D salient object detection model, addressing challenges with depth image quality and foreground-background consistency.The model introduced a two-stage approach: firstly, an image generation stage that created highquality, foreground-consistent pseudo-depth images, and secondly, a saliency reasoning stage that utilized these images for enhanced depth feature calibration and cross-modal fusion.Ikeda et al. [32] introduced a model for RGB-D salient object detection that integrated saliency and edge features with reverse attention.This approach effectively enhanced object boundary detection and saliency in complex scenes.The model also incorporated a Multi-Scale Interactive Module for improved global image information understanding and utilized supervised learning to enhance accuracy in salient object and boundary areas.Xu et al. [33] introduced a new approach to RGB-D salient object detection, addressing the object-part relationship dilemma in Salient Object Detection (SOD).The proposed CCNet model utilized a Convolutional Capsule Network based on Feature Extraction and Integration (CCNet) to efficiently explore the object-part relationship in RGB-D SOD with reduced computational demand.Cong et al. [34] presented a comprehensive approach to RGB-D salient object detection, focusing on enhancing the interaction and integration of features from both RGB and depth modalities.It introduced a new network architecture that efficiently combined these modalities, addressing challenges in feature representation and fusion.However, these methods overlook the feature differences between different modalities, resulting in insufficient information fusion.
To address this issue, Qu et al. [35] introduced a simple yet effective deep learning model, which learns the interaction mechanism between RGB and depth-induced saliency features.Yi et al. [36] proposed a Cross-stage Multi-scale Interaction Network (CMINet), which intertwines features at different stages with the use of a Multi-scale Spatial Pooling (MSP) module and a Cross-stage Pyramid Interaction (CPI) module.They then designed an Adaptive Weight Fusion (AWF) module for balancing the importance of multi-modal features and fusing them.Liu et al. [37] proposed a cross-modal edge-guided salient object detection model for RGB-D images.This model extracts edge information from cross-modal color and depth information and integrates the edge information into cross-modal color and depth features, generating a saliency map with clear boundaries.Sun et al. [38] introduced an RGB-D salient object detection method that combined cross-modal interactive fusion with global awareness.This method embedded a transformer network within a U-Net structure to merge global attention mechanisms with local convolution, aiming for enhanced feature extraction.It utilized a U-shaped structure for extracting dual-stream features from RGB and depth images, employing a multi-level information reconstruction approach to suppress lower-layer disturbances and minimize redundant details.Peng et al. [39] introduced MFCG-Net, an RGB-D salient object detection method that leveraged multimodal fusion and contour guidance to improve detection accuracy.It incorporated attention mechanisms for feature optimization and designed an interactive feature fusion module to effectively integrate RGB and depth image features.Additionally, the method utilized contour features to guide the detection process, achieving clearer boundaries for salient objects.Sun et al. [40] introduced a new approach for RGB-D salient object detection, leveraging a cascaded and aggregated Transformer Network structure to enhance feature extraction and fusion.They employed three key modules: the Attention Feature Enhancement Module (AFEM) for multi-scale semantic information, the Cross-Modal Fusion Module (CMFM) to address depth map quality issues, and the Cascaded Correction Decoder (CCD) to refine feature scale differences and suppress noise.Although some significant results have been achieved in existing research, it remains a formidable challenge to achieve accurate salient object detection in complex scenes through cross-modal and cross-level feature fusion.

Proposed Method
In this section, we first provide an overview of our method in Section 3.1.Following that, in Section 3.2, we elaborate on the depth attention module we propose, which is used to mine valuable depth information.In Sections 3.3 and 3.4, we introduce the semantic localization module and the reverse decoding network based on the Multi-Scale Fusion Module, respectively.Finally, in Section 3.5, we discuss the loss function.

Overview of SLMSF-Net
Figure 1 displays the overall network structure of SLMSF-Net.Without loss of generality, we adopt Resnet50 [41] as the backbone network to extract features from both RGB images and depth images separately.Resnet50 encompasses five convolution stages; we removed the final pooling layer and the fully connected layer, resulting in a fully convolutional neural network, and use the outputs of the intermediate five convolution blocks as feature outputs.These output feature maps are denoted as M1, M2, M3, M4, and M5, with their sizes being 1/2, 1/4, 1/8, 1/16, and 1/32 of the original image, respectively.

1.
Modal Feature Fusion: As shown in Figure 1, we proposed a Depth Attention Module.This module performs a modal fusion of RGB image features and depth image features, forming the modal fusion features F Fuse  ).

3.
Multi-Scale Fusion Decoding: After performing semantic localization, we predicted the clear boundaries of the salient object through reverse multi-level feature integration from front to back.To accomplish this multi-level feature integration, we constructed a Multi-Scale Fusion Module, which effectively fuses features at all levels.

Depth Attention Module
In the process of fusing RGB and depth features, we need to address two main issues.The first one is the modal mismatch problem, which requires us to resolve the modal differences between the two types of features.The second one is the information complementarity problem; since RGB and depth features often capture different aspects of object information, we need to consider how to let these two types of features complement each other's information, aiming to enhance the accuracy and robustness of object detection.Inspired by [42], we designed a depth attention module to improve the matching and complementarity of multi-modal features.
Specifically, F RGB i represents the ith RGB image feature and represents the ith depth image feature, where i is a natural number from 1 to 5. As shown in Figure 2, the depth attention module first enhances the depth image feature through channel attention.The enhanced result is then multiplied element-wise with the RGB image feature to obtain the channel-enhanced fusion feature.Following this, the channel-enhanced fusion feature undergoes spatial attention enhancement, and the enhanced result is multiplied elementwise with the RGB image feature, thus obtaining the modal fusion feature.To enhance the matching of depth features, we stacked a depth attention module behind each depth feature branch.By introducing attention units, we can enhance the saliency representation ability of depth features.The fusion process of the two modal features can be expressed as follows: Sensors 2024, 24, 1117 6 of 19 Herein, CA(•) symbolizes the channel attention operation, SA(•) indicates the spatial attention operation, and × represents the element-wise multiplication operation.

Semantic Localization Module
In the process of salient object localization, high-level features play a crucial role.Compared to low-level features, high-level features are capable of capturing more abstract information, which aids in highlighting the location of salient objects.Therefore, we introduced a semantic localization module designed to effectively learn the global view of the entire image, thereby achieving more precise salient object localization.As depicted in Figure 3, the semantic localization process is divided into three stages: initially, the first stage downsamples the top-level modal fusion features to compute a global view; subsequently, the second stage carries out coordinate localization on the global view; finally, the third stage fuses the localization information with the global view.In the first stage, we implement a 1/2 scale downsample operation on the top-level modal fusion features F Fuse 5 , followed by two ConvBR 3×3 operations, thereby obtaining the first layer of the global feature map F ov 1 .Subsequently, we perform the same 1/2 scale downsample and two ConvBR 3×3 operations on the first layer of the global feature map, resulting in the second layer of the global feature map F ov 2 .As observed, these two global feature maps possess a significantly large receptive field, enabling them to serve as the global view of the entire image.The computation process for the global view can be described as follows: Herein, DownS 1/2 (•) denotes a 1/2 scale downsample operation on the input feature map.ConvBR 3×3 (•) represents a convolution operation performed on the input feature map using a kernel size of 3 × 3, followed by batch normalization and activation operations, where the activation function is Relu.This can be expressed as: Herein, Conv(•) symbolizes the convolution operation, BN(•) denotes the batch normalization operation, and Relu(•) represents the Relu activation function.
In the second stage, for the second layer of the global feature map F ov 2 , we utilize a pooling kernel of size (1, W) to perform average pooling along the vertical coordinate of the feature map, followed by a convolution operation with a kernel size of 1 × 1, resulting in the height-oriented feature map T H . Simultaneously, we use a pooling kernel of size (H, 1) to conduct average pooling along the horizontal coordinate of the feature map F ov 2 , then perform a convolution operation with a kernel size of 1 × 1, yielding the width-oriented feature map T W .This can be described as: Herein, feature map T H extends in the width direction, while feature map T W expands in the height direction.The two expanded feature maps undergo pixel-wise multiplication, and then through a Sigmoid activation function, a coordinate localization feature map F C is formed.This can be described as: Herein, the K(•) operation refers to expanding the input feature map in the width or height direction to match the size of feature map A, while Sigmoid(•) signifies the Sigmoid activation function.
In the third stage, we view the localization feature map as a self-attention mechanism for calibrating the global view.Specifically, we perform a pixel-wise multiplication operation between the localization feature map F C and the second layer of the global feature map F ov 2 , followed by a pyramid feature fusion operation on the multiplication results, yielding feature map F of * .Subsequently, we upscale the localization feature map F C twice and perform a pixel-wise multiplication operation with the first layer of the global feature map F ov 1 .The result is stacked with F of * , and then the stacked result is subjected to a pyramid feature fusion operation to finally obtain the global localization fusion feature F of .This can be described as: Herein, Pyramid(•) represents the pyramid feature fusion operation, concat(•) signifies the stacking operation along the channel, and UP(•) denotes the operation of upscaling by a factor of two.
The pyramid feature fusion operation is depicted in Figure 4. Initially, we conduct a convolution operation with a kernel size 1 × 1, adjusting the number of channels in the input feature map X to 32, which yields the feature map Y.Following this, we execute feature extraction on Y, with the specific extraction method detailed as follows: Sensors 2024, 24, x FOR PEER REVIEW 9 of 20 Herein, 21 33 ConvBR ( )  represents a dilated convolution with a kernel size of 33  and a dilation rate of 21 i − .We perform a concatenation operation along the channel with the three extracted features.Subsequently, we conduct a convolution operation on the concatenation result with a kernel size of 11  , adjusting the channel count to match that of the input feature map.Finally, a residual connection is established with the input feature map.This process can be described as follows: ))

Multi-Scale Fusion Module and the Reverse Decoding Process
Following semantic localization, we integrate multi-layer features in a forward-tobackward manner to delineate intricate details of the salient object.To achieve this multilayer feature integration, we designed and constructed a Multi-Scale Fusion Module.The reverse decoder operates in five stages, each accepting the output from the preceding stage for reverse multi-scale fusion decoding.Importantly, the input for the fifth stage of the decoder is the global localization fusion feature of F .The process of the reverse de- coder can be described as follows: )) Herein,

MSF( )
 stands for the Multi-Scale Fusion Module.We upscale the output from the first stage of the decoder to the size of the input image, thereby obtaining the final saliency prediction map.The specific formula used to generate the saliency prediction map is as follows:  Herein, ConvBR 2i−1 3×3 (•) represents a dilated convolution with a kernel size of 3 × 3 and a dilation rate of 2i − 1.We perform a concatenation operation along the channel with the three extracted features.Subsequently, we conduct a convolution operation on the concatenation result with a kernel size of 1 × 1, adjusting the channel count to match that of the input feature map.Finally, a residual connection is established with the input feature map.This process can be described as follows: Pyramid(X) = ConvBR 1×1 (concat(P 1 , P 2 , P 3 , P 4 )) + X (13)

Multi-Scale Fusion Module and the Reverse Decoding Process
Following semantic localization, we integrate multi-layer features in a forward-tobackward manner to delineate intricate details of the salient object.To achieve this multilayer feature integration, we designed and constructed a Multi-Scale Fusion Module.The reverse decoder operates in five stages, each accepting the output from the preceding stage for reverse multi-scale fusion decoding.Importantly, the input for the fifth stage of the decoder is the global localization fusion feature F of .The process of the reverse decoder can be described as follows: ) ) Herein, MSF(•) stands for the Multi-Scale Fusion Module.We upscale the output from the first stage of the decoder to the size of the input image, thereby obtaining the final saliency prediction map.The specific formula used to generate the saliency prediction map is as follows: S = Sigmoid(Conv 1×1 (UP in (Decode 1 ))) Herein, S represents the saliency prediction map, UP in (•) denotes the upscaling of the feature map to the size of the input image, while Conv 1×1 (•) signifies a single-channel convolution with a kernel size of 1 × 1.The primary purpose of Conv 1×1 (•) is to adjust the channel count of the feature map to 1.
As illustrated in Figure 5, the multi-scale feature fusion module comprises four parallel branches and a residual connection.Initially, we employ a convolutional operation with a kernel of size 1 × 1 to reduce the number of channels in the input feature map to 64.Following this, in the first branch, we sequentially execute a convolution with a kernel also of size 1 × 1, followed by another with a kernel of size 3 × 3.For the i−th(i ∈ {2, 3, 4}) branch of the module, the procedure commences with a convolution involving a kernel of size (2i − 1) × 1, proceeded by another convolution with a kernel of size 1 × (2i − 1).Finally, a dilated convolution operation with a kernel of size 3 × 3 and a dilation rate of 2i − 1 is applied.This design strategy is aimed at extracting multi-scale information from the multi-modal fusion features, thereby enriching the representational power of the model.Next, the outputs from the four branches are stacked along the channel dimension, and the channel count of the stacked output is adjusted to match the input feature map's channel count, using a convolution operation with a kernel size of 1 × 1.Finally, the adjusted result is connected to the input feature map via a residual connection.The entire fusion process can be described as follows: As illustrated in Figure 5, the multi-scale feature fusion module comprises four parallel branches and a residual connection.Initially, we employ a convolutional operation with a kernel of size 11  to reduce the number of channels in the input feature map to 64.Following this, in the first branch, we sequentially execute a convolution with a kernel also of size 11  , followed by another with a kernel of size 33  .For the -th( {2,3, 4}) ii  branch of the module, the procedure commences with a convolution involving a kernel of size (2 1) 1 i − , proceeded by another convolution with a kernel of size 1 (2 1) i − .Finally, a dilated convolution operation with a kernel of size 33  and a dilation rate of 21 i − is applied.This design strategy is aimed at extracting multi-scale information from the multi-modal fusion features, thereby enriching the representational power of the model.Next, the outputs from the four branches are stacked along the channel dimension, and the channel count of the stacked output is adjusted to match the input feature map's channel count, using a convolution operation with a kernel size of 11  .
Finally, the adjusted result is connected to the input feature map via a residual connection.The entire fusion process can be described as follows:  Herein, branch i (x) denotes the ith parallel branch, while x symbolizes the input feature map.

Loss Function
As depicted in Figure 1, at each stage of the decoder, the decoded output is upsampled to the size of the input image.Following this, a convolution operation with a single-channel convolution kernel of 1 × 1 is performed, and then a prediction saliency map is generated through a sigmoid activation function.The saliency maps predicted at each of the five stages are denoted as O i (i = 1, 2, • • • , 5).Following the same process, we can also generate the predicted saliency map O of corresponding to the output of the semantic localization module.This process can be described as follows: Assuming the predicted saliency map is denoted as O, and the real saliency map is denoted as GT, the formula for calculating the loss value of the prediction results is as follows: Loss(O, GT) = Bce(O, GT) + Dice(O, GT) ( 24) Herein, Bce(•) represents the binary cross-entropy loss function, Dice(•) denotes the Dice loss function [43], and ||•|| represents the L 1 norm.The total loss function during the training phase is described as follows: wherein, α represents the weight coefficients.During the testing phase, O 1 is the final prediction result of the model.

Experiments
Section 4.1 provides a detailed description of the implementation details, Section 4.2 discusses the sources of the datasets used, Section 4.3 introduces the setup of the evaluation metrics, Section 4.4 presents the comparison with the current state-of-the-art (SOTA) methods, and Section 4.5 is dedicated to the discussion of the ablation experiments.Together, these sections form the experimental analysis and evaluation part of the paper, comprehensively demonstrating the effectiveness and reliability of the research method.

Implementation Details
The salient object detection method proposed in this paper is implemented based on the Pytorch framework [44,45], and all experimental procedures were carried out on a single NVIDIA RTX A6000 GPU(NVIDIA, Santa Clara City, CA, USA).The initialization parameters of the backbone model, ResNet50, are derived from a pre-trained model on ImageNet [46].Specifically, both the RGB image branch and the depth image branch use a ResNet50 model for feature extraction, with the only difference being that the input channel number for the depth image branch is 1.To enhance the model's generalization capability, various augmentation strategies, such as random flipping, rotation, and boundary cropping, were applied to all training images.Throughout the training process, the Adam optimizer was employed, with parameters set to β1 = 0.9 and β2 = 0.999, and a batch size of 10.The initial learning rate was set to 1 × 10 −4 and was divided by 10 every 50 rounds.The dimensions of the input images were all adjusted to 768 × 768.The model converged within 200 rounds.In order to show the training process of our model more clearly, we report the training and validation loss curve of our network in Figure 6.

Evaluation Metrics
We employed four widely used evaluation metrics to compare SLMSF-Net with previous state-of-the-art methods, namely E-Measure, F-measure, S-measure, and MAE.
E-Measure ( ξ E ) is a saliency map evaluation method based on cognitive vision, ca- pable of integrating statistical information at both the image level and local pixel level.This measurement strategy was proposed by [52] and is defined as follows: Here, W and H represent the width and height of the saliency map, respectively, while ξ signifies the enhanced alignment matrix.E-measure has three different variants: maximum E-measure, adaptive E-measure, and average E-measure.In our experiments, we used the maximum E-measure (maxE) as the evaluation criterion.F-measure ( F β ) serves as a weighted harmonic mean of precision and recall.It is defined as follows: Here, β is a parameter used to balance Precision and Recall.In this study, we set β² to 0.3.Similar to E-measure, F-measure also has three different variants: maximum Fmeasure, adaptive F-measure, and average F-measure.In our experiments, we reported the results of the maximum F-measure (maxF).
S-measure ( α S ) is a method for evaluating structural similarity.It assesses from two perspectives: region awareness ( r S ) and object awareness ( o S ).It is defined as follows:

Evaluation Metrics
We employed four widely used evaluation metrics to compare SLMSF-Net with previous state-of-the-art methods, namely E-Measure, F-measure, S-measure, and MAE.E-Measure (E ξ ) is a saliency map evaluation method based on cognitive vision, capable of integrating statistical information at both the image level and local pixel level.This measurement strategy was proposed by [52] and is defined as follows: Here, W and H represent the width and height of the saliency map, respectively, while ξ signifies the enhanced alignment matrix.E-measure has three different variants: maximum E-measure, adaptive E-measure, and average E-measure.In our experiments, we used the maximum E-measure (maxE) as the evaluation criterion.
F-measure (F β ) serves as a weighted harmonic mean of precision and recall.It is defined as follows: Here, β is a parameter used to balance Precision and Recall.In this study, we set β² to 0.3.Similar to E-measure, F-measure also has three different variants: maximum F-measure, adaptive F-measure, and average F-measure.In our experiments, we reported the results of the maximum F-measure (maxF).S-measure (S α ) is a method for evaluating structural similarity.It assesses from two perspectives: region awareness (S r ) and object awareness (S o ).It is defined as follows: Here, α ∈ [0, 1] is a hyperparameter used to balance between S o and S r .In our experiments, α is set to 0.5.MAE (Mean Absolute Error) represents the average per-pixel absolute error between the predicted saliency map S and the ground truth map GT.It is defined as follows: Here, W and H, respectively, denote the width and height of the saliency map.The MAE is normalized to a value in the [0, 1] interval.

Quantitative Comparison
Figure 7 presents the comparison results of PR curves from different methods, while Table 1 presents the quantitative comparison results for four evaluation metrics.As shown in the figure and table, our PR curve outperforms all other comparison methods, whether on the NJU2K, NLPR, DES, SIP, SSD, or STERE datasets.This advantage is largely attributed to our designed semantic localization and multi-scale fusion strategies, which, respectively, achieve precise localization of salient objects and capture of detailed boundary information.Additionally, our designed depth attention module can effectively utilize depth information to enhance the model's segmentation performance.Concurrently, the table data reflects the same conclusion, i.e., our method outperforms all comparison methods in performance on the NJU2K, NLPR, DES, SIP, SSD, and STERE datasets.Compared with the best comparison methods (AFNet, C2DFNet, and DCMF), we have improved the MAE, maxF β , maxE ξ , S α evaluation metrics by 0.0002~0.0062,0.2~1.8%,0.09~1.46%,and 0.19~1.05%,respectively.Therefore, both the PR curves and evaluation metrics affirm the effectiveness and superiority of our method proposed for the RGB-D SOD task.
Table 1.Comparison of results for four evaluation metrics-mean absolute error (MAE), maximum F-measure (maxF), maximum E-measure (maxE), and S-measure (S)-across six datasets.The symbol "↑" indicates that a higher value is better for the metric, while "↓" indicates that a lower value is better.The best performance in each row is highlighted in bold.8. Upon observation, our method demonstrates superior performance in several challenging scenarios compared to other methods.Examples of these scenarios include situations where the foreground and background colors are similar (rows 1-2), in complex environments (rows 3-4), in scenes with multiple objects present (rows 5-6), for small object detection (rows 7-8), and under conditions of low-quality depth images (rows 9-10).

Datasets
These visual examples show that our method can more precisely locate salient objects and generate more accurate saliency maps.

Ablation Studies
As shown in Table 2, we conducted an in-depth ablation analysis to verify the effectiveness of each module.DAM represents the Deep Attention Module, SLM is the Semantic Localization Module, and MSFM stands for Multi-Scale Fusion Module."Without DAM", "without SLM", and "without MSFM" refer to the models obtained after removing the DAM, SLM, and MSFM modules from the SLMSF-Net model, respectively.By comparing the data in the third column with the sixth column, we can clearly see that the introduction of the DAM module significantly improves the performance of the model.Similarly, by comparing the data in the fourth and sixth columns, we can see that the introduction of the SLM module can significantly enhance the performance of the model.Comparing the data in the fifth and sixth columns, we can see that adding the MSFM module will enhance the model's performance.These results prove the importance of the three modules: the DAM module introduces depth image information, the SLM module realizes the precise semantic location of salient objects, and the MSFM module can fuse multi-scale features to refine the boundaries of salient objects.Each of these three functional modules resulted in a significant increase in model performance.In the last column, we can see that the SLMSF-Net model that incorporates these three modules achieved the best results.

Qualitative Comparison
For a qualitative comparison, we present a selection of representative visual examples in Figure 8. Upon observation, our method demonstrates superior performance in several challenging scenarios compared to other methods.Examples of these scenarios include situations where the foreground and background colors are similar (rows 1-2), in complex environments (rows 3-4), in scenes with multiple objects present (rows 5-6), for small object detection (rows 7-8), and under conditions of low-quality depth images (rows 9-10).These visual examples show that our method can more precisely locate salient objects and generate more accurate saliency maps.

Ablation Studies
As shown in Table 2, we conducted an in-depth ablation analysis to verify the effectiveness of each module.DAM represents the Deep Attention Module, SLM is the Semantic Localization Module, and MSFM stands for Multi-Scale Fusion Module."Without DAM", "without SLM", and "without MSFM" refer to the models obtained after removing the DAM, SLM, and MSFM modules from the SLMSF-Net model, respectively.By comparing the data in the third column with the sixth column, we can clearly see that the introduction of the DAM module significantly improves the performance of the model.Similarly, by comparing the data in the fourth and sixth columns, we can see that the introduction of the SLM module can significantly enhance the performance of the model.Comparing the data in the fifth and sixth columns, we can see that adding the MSFM module will enhance the model's performance.These results prove the importance of the three modules: the DAM module introduces depth image information, the SLM module realizes the precise semantic location of salient objects, and the MSFM module can fuse multi-scale features to refine the boundaries of salient objects.Each of these three functional modules resulted in a significant increase in model performance.In the last column, we can see that the SLMSF-Net model that incorporates these three modules achieved the best results.Table 2. Comparison of ablation study results.The symbol "↑" indicates that a higher value is better for the metric, while "↓" indicates that a lower value is better.The best performance in each row is highlighted in bold.

Conclusions
In complex scenarios, achieving precise RGB-D salient object detection against multiple scales of objects and noisy backgrounds remains a daunting task.Current research primarily faces two major challenges: modality fusion and multi-level feature integration.To address these challenges, we propose an innovative RGB-D salient object detection network, the Semantic Localization and Multi-Scale Fusion Network (SLMSF-Net).This network comprises two main stages: encoding and decoding.In the encoding stage, SLMSF-Net utilizes ResNet50 to extract features from RGB and depth images and employs a depth attention module for the effective fusion of modal features.In the decoding stage, the network precisely locates salient objects through the semantic localization module and restores the detailed information of salient objects in the reverse decoder via the Multi-Scale Fusion Module.Rigorous experimental validation shows that SLMSF-Net exhibits superior accuracy and robustness on multiple RGB-D salient object detection datasets, outperforming existing technologies.In the future, we plan to further optimize the model, improve the attention mechanism, delve into refining edge details, and explore its application in RGB-T salient object detection tasks.
We proposed a Semantic Localization Module.This module first downsamples the top-level modal fusion feature to compute a global view.It then performs coordinate localization on the global view and ultimately fuses the localization information with the global view, thereby precisely locating the salient object.Assuming the semantic localization module is represented as the SLM function, its output result can be written as: F o f = SLM(F Fuse

Figure 1 .
Figure 1.The overall network architecture of SLMSF-Net.

S
represents the saliency prediction map, in UP ( )  denotes the upscaling of the feature map to the size of the input image, while 11 Conv ( )   signifies a single-

Figure 6 .
Figure 6.Training and validation loss curve.

Figure 7 .
Figure 7.Comparison of precision-recall (P-R) curves for different methods across six RGB-D datasets.Our SLMSF-Net method is represented by a solid red line.

Figure 7 .
Figure 7.Comparison of precision-recall (P-R) curves for different methods across six RGB-D datasets.Our SLMSF-Net method is represented by a solid red line.

Figure 8 .
Figure 8. Visual comparison between SLMSF-Net and state-of-the-art RGB-D models.

Figure 8 .
Figure 8. Visual comparison between SLMSF-Net and state-of-the-art RGB-D models.