Lightweight Spatial-Frequency Collaborative Interaction Network for RGB-D Salient Object Detection

Lu, Yitong; Cui, Ziguan

doi:10.3390/s26123708

Open AccessArticle

Lightweight Spatial-Frequency Collaborative Interaction Network for RGB-D Salient Object Detection

by

Yitong Lu

¹

and

Ziguan Cui

^2,*

¹

Portland Institute, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

²

College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(12), 3708; https://doi.org/10.3390/s26123708

Submission received: 16 April 2026 / Revised: 26 May 2026 / Accepted: 9 June 2026 / Published: 10 June 2026

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

RGB-D salient object detection (SOD) aims to segment the most prominent objects from the background with a pair of given RGB and depth images. Existing RGB-D methods usually rely on heavy backbones to achieve high accuracy, while current lightweight methods struggle to maintain competitive performance. To break this intractable trade-off between effectiveness and model complexity, we propose a Lightweight Spatial-Frequency Collaborative Interaction Network (SFCINet), a unified and highly efficient framework. The core of SFCINet resides in the synergy between spatial-domain features and frequency-domain global priors. Specifically, we introduce the Spatial-Frequency Synergy (SFS) module, which shifts the perspective to a joint complex Fourier domain. By adaptively learning and optimizing the decoupled amplitude and phase components, it effectively isolates clutter to yield a purified global frequency-synergized prior, which modulates the spatial branches to eliminate cross-modal discrepancies for subsequent feature fusion while supplementing global information during decoding. To alleviate the interference caused by cross-modal representation discrepancies, we design the Cross-Guidance Interaction (CMGI) module, which employs a reciprocal anchoring mechanism. It guides the counterpart to mutually filter irrelevant noise and select task-relevant information, achieving fusion in an efficient manner. Finally, we present a Calibrated Hierarchical Decoder (CHD), which injects frequency-synergized global priors into the hierarchical decoding process. It re-establishes the connection between the frequency and spatial domains, ultimately achieving global-local consistency. Extensive experiments demonstrate that SFCINet delivers superior performance over state-of-the-art methods.

Keywords:

salient object detection; lightweight; frequency domain; RGB-D images

1. Introduction

Salient object detection (SOD) focuses on detecting most distinctive objects or regions within an image, which is a crucial task in computer vision. It simulates the human visual system (HVS) and segments the most prominent parts from the background, which benefits many applications, including quality assessment [1], autonomous driving [2], semantic segmentation [3], visual tracking [4], and video summarization [5].

In the past several years, RGB-D salient object detection methods have undergone rapid development, demonstrating their superior performance by utilizing depth cues to enhance the recognition and localization ability of SOD methods. They efficiently mitigated the challenges brought by low-texture contrast or complex background. Yi et al. [6] designed a novel method for RGB-D salient object detection, which achieves bidirectional selection between RGB and depth information. It fully exploits the rich textural information from RGB and the strong structural information embedded in depth. Wei et al. [7] proposed a PDNet to overcome the limitations of traditional RGB-based methods in complex background environments by leveraging the abundant spatial and positional information provided by depth images to enhance the separation of salient objects. Whereas these methods have shown outstanding performance in metrics, most of them rely on high computational cost and large backbones, which suffer from high deployment barriers due to their substantial computational overhead. Consequently, how to achieve the balance between efficiency and model complexity remains a major issue. Lightweight RGB-D SOD methods have been developed to address the problem by adopting and improving more efficient frameworks. Compared with heavyweight networks, their performance is unsatisfactory and struggles to handle complex scenes. This motivates us to explore complementary cues beyond the spatial domain to compensate for the limited representation capacity of lightweight architectures.

Specifically, frequency information has proven effective in capturing global contrast and structural boundaries. Traditional models based on early hand-crafted features successfully leveraged this characteristic, yielding remarkable results. For example, Achanta et al. [8] proposed a frequency-tuned approach that computes pixel-level saliency by measuring the color difference between the global mean image feature and Gaussian-blurred pixel vectors. It is simple to implement and is computationally efficient, which retains more boundary information. Ma et al. [9] introduced wavelet transforms into saliency analysis, decomposing images into multiple frequency sub-bands. It possesses the capacity to simultaneously extract local high-frequency edge details and global low-frequency spatial structures. However, these traditional models are inherently constrained by their reliance on hand-crafted features, even when incorporating frequency-domain analysis. Lacking high-level semantic guidance, the representation capacity of these hand-crafted features is insufficient to distinguish genuine salient objects from high-frequency background interference. With the rise of deep learning, researchers have begun integrating frequency analysis into neural architectures to enhance semantic representation. For instance, Jin et al. [10] proposed FCMNet, which utilizes frequency-aware attention to strengthen cross-modal interaction. It achieves synchronization of multi-modal features by highlighting shared salient components in the spectral domain. However, this approach typically employs frequency information only for global weighting and lacks a mechanism to distinguish modal-specific spectral discrepancies. Specifically, the inherent high-frequency noise in depth maps is often indiscriminately amplified during spectral interaction, leading to severe feature contamination. Consequently, achieving a purified and efficient spatial-frequency synergy remains a critical challenge.

To overcome the aforementioned limitations and achieve an effective spatial-frequency synergy, we identify several core bottlenecks that must be resolved. The first major challenge in lightweight RGB-D SOD hinges on the intrinsic conflict between global perception and computational complexity. Due to the limited receptive field of convolution, CNN-based architectures are incapable of effectively capturing global contextual features, failing to model the holistic contours and global structures of salient objects. Although Transformer-based models have alleviated the limitation by benefiting from self-attention mechanisms, their computational overhead remains prohibitively high for practical deployment. To break the bottleneck, we propose the Spatial-Frequency Synergy (SFS) module, shifting the perspective from the spatial domain to the frequency domain. By employing the Fast Fourier Transform (FFT), SFS maps the spatial features into a decoupled amplitude-phase spectral space, effectively capturing global signals in the spectral domain with much lower complexity. Unlike conventional uniform frequency processing, SFS exploits the distinct physical properties of these frequency components. In the Fourier spectral domain, the amplitude spectrum governs global energy distribution, macro-textures, and illumination styles, while the phase spectrum preserves the critical geometric skeletons, absolute object boundaries, and spatial topological structures. Depth maps lack fine texture and suffer from hole noise and artifacts on their high-frequency bands, while their low-frequency components hold macro-contours. Conversely, the high-frequency bands of RGB maps embrace sharp structural boundaries while their lower frequencies are heavily susceptible to deceptive semantic clutter and background distractions in complex scenes. By joint optimization in the complex frequency domain, the amplitude path adaptively acts as a targeted spectral filter to suppress the high-frequency noise contaminating the depth amplitude spectrum, thereby recovering the pristine global structural energy. The phase path leverages the discriminative RGB phase spectrum as an immutable geometric compass to structurally rectify and align the misaligned depth contours. Upon returning to the spatial domain, instead of natively fusing these features prematurely, SFS yields a purified global frequency-synergized prior alongside two enhanced single-modal spatial representations. This global prior is actively deployed to mutually modulate the spatial branches of both modalities, eliminating inherent feature discrepancies. Consequently, it ensures that the subsequent CMGI module can perform cross-modal feature fusion with enhanced efficiency and robustness across a structurally aligned spatial space. Concurrently, this prior is explicitly preserved to supplement macro-structural cues during the top-down decoding process.

The second challenge concerns the effective fusion of multi-modal features. It is well known that depth maps often contain substantial noise and exhibit low-texture characteristics. Direct fusion of depth and RGB features leads to feature contamination, significantly degrading the performance and robustness of the model. To handle this, we introduce the Cross-modal Guidance Interaction (CMGI) module. We adopt a reciprocal anchoring mechanism to constrain feature interaction. Utilizing the structurally consistent features refined by the SFS module, we derive a spatial confidence map for each modality. Rather than self-modulation, the spatial confidence derived from one modality acts as an immutable physical ‘anchor’ to explicitly constrain and regularize the valid feature activation space of another modality. The RGB branch enforces a gated suppression on the corrupted region where depth maps suffer from inherent noise. Meanwhile, where the RGB features inevitably encounter deceptive semantic clutter or foreground-background confusion, the invariant anchor from the depth branch forcefully constrains the RGB feature space, preventing its attention from drifting. This complementary bidirectional bounding enforces a topological interception of the information flow, blocking the modal-specific clutter from both sides.

Furthermore, to obtain high-quality saliency maps, we design the Calibrated Hierarchical Decoder (CHD) to integrate the global frequency-synergized prior from the SFS into the hierarchical decoding process. Different from standard decoders, we utilize these spectral priors as holistic geometric reference to govern the progressive feature reconstruction, preventing the structural drift during multi-scale feature fusions. Rather than merely recovering lost global context, this global-to-local guidance mechanism ensures that fine-grained spatial details are always regularized by the holistic structural consensus, forming the SFCINet into a unified framework where spectral-domain priors and spatial-domain features operate in organic synergy. Our main contributions can be summarized as follows:

We propose SFCINet, an efficient and unified framework for RGB-D SOD that enables spatial-frequency information interaction and synergistic fusion. By establishing a robust global-to-local guidance mechanism, it injects a global frequency-synergized prior into the decoding procedure, effectively mitigating structural drift during multi-scale feature fusions.
We present the SFS module, which shifts the perspective to the complex Fourier domain to capture holistic contextual features. By exploiting the distinct physical properties of amplitude and phase frequency spectral components, SFS suppresses depth noise while capturing sharp RGB structural boundaries, resolving the multi-modal spectral heterogeneity. It crystallizes a purified global frequency-synergized prior, and simultaneously calibrates the spatial features.
We introduce the CMGI module to achieve cross-modal fusion without contamination. Driven by a reciprocal anchoring mechanism, it utilizes the spatial confidence derived from the SFS-enhanced features as an invariant structural constraint to mutually gate both modalities, successfully blocking localized noise and deceptive semantic clutter.

2. Related Work

2.1. RGB Salient Object Detection

SOD for RGB images has been extensively studied for many years and existing methods can be divided into two main categories: traditional methods and deep learning methods. Traditional RGB SOD methods mainly rest on hand-crafted features, including boundary background, texture and so on. However, the constrained representation capacity of hand-crafted features restricts the performance of traditional SOD methods, resulting in the focus of SOD shifting towards sophisticated deep learning architectures. Early Convolutional Neural Network (CNN)-based SOD models show outstanding performance in extracting local textures. MENet [11] achieved impressive results by implementing multiple enhancement strategies across pixels, regions, and objects to refine feature representations. However, due to the limited receptive field of the convolution kernel, CNNs have insufficient ability to capture the global context features. Over the past few years, Vision Transformers (ViTs) have been employed to solve the problem. For example, VST [12] leveraged self-attention mechanisms for robust long-range dependency modeling, which excels at capturing global context. However, the significant computational cost of these heavyweight models often limits their practicality. Recently, various lightweight SOD methods have gradually emerged to strike a better balance between effectiveness and accuracy. For example, Wang et al. proposed LARNet [13], which introduces a brain-inspired context gating module to realize deep multi-level feature fusion at the global level. Wang et al. [14] presented an extremely lightweight wavelet neural network named ELWNet, which integrates wavelet transform modules (WTMs) and fusion modules (WTFMs) into a convolutional architecture. Although the aforementioned methodologies are primarily designed for single RGB images, they provide profound implications for RGB-D SOD research.

2.2. RGB-D Salient Object Detection

Distinct from RGB SOD [15], RGB-D SOD [16] methods integrate depth maps as a supplementary modality, leveraging spatial geometry and structural information to facilitate salient object detection. Modern high-performance models often employ dual-stream backbones, CNN or Transformer, to extract features independently before performing complex cross-modal feature fusion, which relies on large model size and high computational cost. For example, Wei et al. [17] utilized dual ResNet-50 backbones to explore the consistency and complementarity between modalities through a modal-aware interaction mechanism. Zhou et al. [18] developed IRFR-Net, which reshaped features recursively to refine saliency maps across multiple levels. To capture long-range dependencies, Liu et al. proposed VST [12], a pure Transformer-based framework that leverages multi-level Transformer blocks to unify feature extraction and fusion. Furthermore, Wang et al. [13] presented an adaptive fusion bank for multi-modal SOD, which dynamically coordinated diverse fusion strategies to enhance the robustness of feature integration. Though these methods demonstrated remarkable accuracy, their architectures often face two fundamental bottlenecks. First, capturing global contextual dependencies via spatial self-attention intrinsically incurs significant parameter redundancy and high computational latency, posing insurmountable latency challenges for real-time inference on resource-constrained devices. Second, most existing fusion mechanisms ignore the inherent information discrepancy between the two modalities. When conducting spatial alignments, the inherent high-frequency noise and structural inaccuracies in depth maps easily act as misleading guidance. This false induction misdirects the RGB branch to extract erroneous features, inevitably leading to severe cross-modal contamination.

2.3. Lightweight RGB-D Salient Object Detection

The majority of RGB-D SOD methods employ heavyweight architectures, such as ResNet101, Swin Transformer, and ViT, to achieve higher accuracy, posing a heavy burden on resource-constrained edge devices. To address this challenge, lightweight RGB-D SOD has emerged as a critical research direction. These methods replace heavyweight encoders with efficient backbones like MobileNetV2 [19], ShuffleNet [20], or MobileViT [21]. For instance, Jin et al. proposed MoADNet [22], which introduces a mobile asymmetric dual-stream encoder to minimize computational redundancy. By assigning disparate computational budgets to the RGB and depth branches based on their information density, MoADNet preserves essential modal features while significantly lowering the overall FLOPs compared to conventional symmetric dual-stream counterparts. Shifting away from the traditional dual-branch paradigm, Zhang et al. [23] designed a robust single-stream architecture. This approach streamlines the feature extraction and fusion process into a unified pipeline, effectively eliminating the parameter redundancy inherent in parallel encoders and achieving high-speed real-time inference on resource-constrained edge devices. Furthermore, Zhang et al. [24] introduced the MCCNet. It leverages a suite of cross-modal complementation modules to exploit the fine-grained, reciprocal interactions between RGB and depth cues, refining saliency maps in a computationally efficient manner. These methodologies underscore that through meticulous structural optimization and targeted feature interaction, lightweight models can achieve competitive performance while remaining viable for industrial deployment.

2.4. Applications of Frequency Information

Recently, the exploration of frequency information has witnessed remarkable advancements in image processing. It shifts the perspective from the spatial domain to the frequency domain, endowing the model with excellent global awareness. Unlike spatial convolutions that are restricted by local receptive fields, the Fast Fourier Transform (FFT) allows models to capture long-range dependencies across the entire image in the frequency domain. In contrast to the self-attention mechanism, FFT has a much lower computational complexity of

O (N l o g N)

instead of

O (N^{2})

. Driven by this efficiency, several efforts have been made to integrate spectral analysis into deep learning frameworks. For instance, DeepRFT [25] designs Fourier convolution-based residual blocks to capture multi-level frequency information, effectively expanding the receptive field for high-quality image deblurring. Qiao et al. [26] proposed a Fourier-based framework for unpaired image restoration, which learns depth-density priors within the spectral domain to effectively recover structural details and suppress complex degradations. In specific detection tasks, Zhao et al. [27] highlighted that objects and backgrounds with high spatial similarity become more discriminative when processed in the frequency domain. Furthermore, for multi-modal scenarios, Zhou et al. [28] introduced a lightweight framework for RGB-Thermal SOD that jointly mines spatial, channel, and frequency-domain cues, validating the viability of frequency-aware mining for resource-constrained multi-modal tasks. FCMNet [10] extends the frequency channel attention mechanism into the RGB-D pipeline by computing 2D DCT coefficients independently for each single-modal stream. However, these existing multi-modal frequency methods treat the frequency spectrum as a unified whole, focusing on capturing and aggregating information across isolated high- or low-frequency bands. Different image modalities exhibit entirely different physical properties in their respective high- and low-frequency components, treating the spectrum as a uniform whole inevitably poses a severe risk of cross-modal feature contamination. Specifically, in RGB-D tasks, the high-frequency band of the depth stream is heavily dominated by localized noise and measurement artifacts, whereas the high-frequency band of the RGB stream contains pristine, highly discriminative object boundaries. Without a joint spectral space to coordinate the cross-modal consensus, these frameworks cannot decouple the intertwined modal anomalies, inevitably causing severe noise or deceptive background clutter to propagate into the subsequent multi-modal fusion.

To overcome these limitations, our proposed SFS module executes a targeted amplitude-phase separation to adaptively optimize these fundamental frequency spectral components. Crucially, instead of treating the refined results as isolated descriptive features, we formulate them as a robust global frequency-synergized prior that governs twofold core functions. On one hand, this global prior is actively deployed to mutually modulate the spatial representations of both modalities instead of fusing with them. It aims to derive a reliable cross-modal consensus, thereby seamlessly guiding and enabling the subsequent CMGI module to execute a far more optimal and robust feature fusion. On the other hand, it concurrently acts as an overarching global prior directly injected into the subsequent Decoder stage to explicitly supplement the structural consensus during top-down feature reconstruction. By explicitly partitioning and purifying these components, the SFS effectively isolates single-modal clutter while faithfully retaining the multi-modal structural consensus and global frequency-synergized priors.

3. Proposed Method

In this section, we first elaborate on the overall architecture of SFCINet. Then, we introduce the significant modules successively, including the SFS module, the CMGI module, and the CHD. Ultimately, we provide the loss functions.

3.1. Overall Architecture

As shown in Figure 1, the overall architecture of SFCINet follows an encoder-decoder design, which is mainly composed of four components: a dual-branch encoder, SFS, CMGI, and CHD. The depth images are copied into three channels to match the dimensions of RGB images. We employ MobileViT as the backbone to extract multi-scale features from RGB and depth images, denoted as

r_{i} \{i = 1, 2, 3, 4\}

and

d_{i} \{i = 1, 2, 3, 4\}

. These features are fed into the SFS module in pairs, obtaining frequency-aware refined spatial features

r_{i}^{e}

and

d_{i}^{e}

, as well as a global frequency-synergized prior

P r i o r

, enhancing and calibrating features under the coordination of the frequency-spatial domain. Later, they flow into the CMGI module, resulting in the fused feature

f_{i}

. Here, they perform reciprocal guidance to purify these features by autonomously selecting and suppressing the characteristics of the other modality. By interacting on these already-purified features instead of raw inputs, CMGI successfully avoids spreading single-modal noise into the other stream. Finally, CHD is employed to integrate the multi-level fused features and the global frequency-synergized prior

P_{i}

, which acts as a reliable global layout to correct any remaining structural errors during the hierarchical decoding process, ultimately achieving global-local consistency.

3.2. Spatial-Frequency Synergy (SFS) Module

The traditional convolutional receptive field is limited, while the self-attention mechanism relies on a large amount of computational power, which is not conducive to the implementation of lightweight networks. To capture global contextual dependencies efficiently, the SFS module maps the spatial features into the frequency domain. As shown in Figure 2, first, we fuse the

r_{i} \{i = 1, 2, 3, 4\}

and

d_{i} \{i = 1, 2, 3, 4\}

to perform a cross-modal representation

f_{i}^{c a t} \{i = 1, 2, 3, 4\}

, and the Channel Attention operator is used to optimize the concatenated channels. Then, we apply the 2D Fast Fourier Transform (FFT) to transform it from the spatial domain to the frequency domain, which can be formulated as follows:

f_{i}^{c a t} = C a t (r_{i} + d_{i}, r_{i}, d_{i}),

(1)

f_{i}^{'} = {C o n v}_{1 \times 1} (C A (f_{i}^{c a t}) \otimes f_{i}^{c a t}),

(2)

A, P = F F T (f_{i}^{'}),

(3)

where

C a t (\cdot)

denotes concatenation, and

{C o n v}_{1 \times 1} (\cdot)

represents 1 × 1 convolution. And

F F T (\cdot)

denotes 2D Fast Fourier Transform, which generates amplitude and phase of

f_{i}^{'}

in the frequency domain.

A

represents the amplitude, and

P

represents the phase.

Afterward, amplitude and phase information is learned and optimized respectively in dual-path networks, generating a global frequency-synergized prior

P r i o r

.

P r i o r

is used for decoding, and it also generates a set of gated weights

{m_{r}, m_{d}}

through convolutional learning. The above process can be expressed as follows:

A_{e} = {D W C o n v}_{3 \times 3} (B N (R e L U (A))),

(4)

P_{e} = {D W C o n v}_{3 \times 3} (B N (R e L U (P))),

(5)

P r i o r = I F F T (A_{e}, P_{e}),

(6)

{m_{r}, m_{d}} = σ ({D W C o n v}_{3 \times 3} (B N (R e L U (P r i o r)))),

(7)

where

{D W C o n v}_{3 \times 3} (\cdot)

denotes

3 \times 3

Depthwise Convolution; BN

(\cdot)

denotes Batch Normalization; ReLU

(\cdot)

represents Rectified Linear Unit;

I F F T (\cdot)

denotes 2D Inverse Fast Fourier Transform;

σ

represents Sigmoid Activation Function.

The raw inputs also undergo convolutional learning and a gated mechanism, modulating the spatial features. A residual connection is also employed to enhance the representation. The above process reconciles spatial textures with global frequency layouts, obtaining the enhanced and calibrated features, which can be formulated as follows:

r_{i}^{'} = {D W C o n v}_{3 \times 3} (B N (R e L U (r_{i}))),

(8)

d_{i}^{'} = {D W C o n v}_{3 \times 3} (B N (R e L U (d_{i}))),

(9)

r_{i}^{e} = r_{i}^{'} \otimes m_{r} + r_{i}^{'},

(10)

d_{i}^{e} = d_{i}^{'} \otimes m_{d} + d_{i}^{'} .

(11)

The SFS module serves as a spectral-spatial bridge that reconciles local modality features with global structural consensus. By enriching the spatial features with frequency-domain characteristics, it alleviates discrepancy gap in the subsequent cross-modal interaction.

3.3. Cross-Modal Guidance Interaction (CMGI) Module

Given the characteristics of depth maps and RGB images, for example, the depth map contains a lot of noise, while the RGB image has a messy background. A simple addition or concatenation often leads to feature contamination and loss. To alleviate these problems, we design the CMGI, as illustrated in Figure 3, which utilizes the spatial confidence of one modality to anchor the other, ensuring the efficient transmission of information.

First, processed features

r_{i}^{e}

and

d_{i}^{e}

are learned by a set of convolutions. Then, we introduce a spatial attention (SA) mechanism to derive the spatial confidence maps

{M_{r}, M_{d}}

. Then, the RGB and Depth branch are respectively constrained by each other, in order to reach a consensus on features. The process can be expressed as follows:

R_{i}^{e} = {D W C o n v}_{3 \times 3} (B N (R e L U (r_{i}^{e}))),

(12)

D_{i}^{e} = {D W C o n v}_{3 \times 3} (B N (R e L U (d_{i}^{e}))),

(13)

M_{r} = S A (R_{i}^{e}),

(14)

M_{d} = S A (D_{i}^{e}),

(15)

R_{i}^{D} = M_{r} \otimes R_{i}^{e},

(16)

D_{i}^{R} = M_{d} \otimes D_{i}^{e},

(17)

where

S A (\cdot)

denotes the spatial attention mechanism.

After being concatenated, it continues to undergo compression and learning, and finally connects with the residual connection of its own branch, thereby enhancing the feature representation and stability, which can be formulated as follows:

F_{i} = {C o n v}_{1 \times 1} (C a t (R_{i}^{D}, D_{i}^{R})),

(18)

F_{i}^{'} = {D W C o n v}_{3 \times 3} (B N (R e L U (F_{i}))),

(19)

f_{i} = F_{i}^{'} + R_{i}^{e} + D_{i}^{e} .

(20)

Through cross-modal transmission and optimization of information, CMGI provides the decoder with consensus-driven local details. This ensures that the decoder can focus purely on top-down decoding with the global spectral priors for final saliency restoration.

3.4. Calibrated Hierarchical Decoder (CHD)

The CHD is responsible for the progressive integration of the global frequency-synergized prior

P_{i}

and the interactive feature

f_{i}

. To prevent the accumulation of redundant information and alleviate the semantic gap during upsampling and concatenation, we introduce channel attention to perform feature compression and dimensionality reduction. Subsequently, the results undergo channel reduction and connect to the lower level, which can be expressed as follows:

f^{i} = C a t (f_{i}, {P r i o r}^{i}),

(21)

f_{c a}^{i} = {C o n v}_{1 \times 1} (f^{i} \otimes C A (f^{i})),

(22)

f_{f i n a l}^{i} = f_{c a}^{i} + U p (f_{f i n a l}^{i + 1}),

(23)

where

U p (\cdot)

denotes bilinear interpolation. This skip-connection-style integration allows the network to effectively propagate saliency information across different resolutions while maintaining structural integrity.

By anchoring the progressive decoding process with the global frequency-synergized prior, the network ensures that global frequency insights and local spatial precision operate in organic synergy, forming a unified SFCINet framework.

3.5. Loss Function

In the training phase, we adopt a hybrid loss consisting of the BCE loss and the Dice loss, a multi-scale supervision strategy to supervise the

f_{f i n a l}^{k} \{k = 1, 2, 3, 4\}

, ensuring the generation of high-quality saliency maps.

The BCE loss is widely adopted for pixel-wise classification, independently evaluating the discrepancy between the predicted probability and the ground truth at each pixel. For the BCE loss at the k-th level, it can be formulated as follows:

L_{B C E} (G_{s}^{k}, P^{k}) = - \sum_{i = 1}^{H} \sum_{j = 1}^{W} [G_{s}^{k} (i, j) \log (P^{k} (i, j)) + (1 - G_{s}^{k} (i, j)) \log (1 - P^{k} (i, j))] .

(24)

However, calculating the loss pixel by pixel often neglects the global structural integrity of the salient objects. To alleviate this, we introduce the Dice loss, which directly measures the overlap similarity between the predicted region and the ground truth at the image level, thereby maintaining the overall consistency of the objects. The Dice loss is formulated as follows:

L_{Dice} (G_{s}^{k}, P^{k}) = 1 - \frac{2 \sum_{i, j} P^{k} (i, j) G_{s}^{k} (i, j)}{\sum_{i, j} P^{k} (i, j) + \sum_{i, j} G_{s}^{k} (i, j)} .

(25)

Therefore, the overall loss is obtained:

L_{sal} = \sum_{k = 1}^{4} [(G_{s}^{k}, P^{k}) + L_{Dice} (G_{s}^{k}, P^{k})],

(26)

where

H

and

W

denote the height and width of the image;

(i, j)

denotes the pixel coordinate;

P^{k} (i, j)

and

G_{s}^{k} (i, j)

denote the predicted saliency probability and the ground truth label, respectively, at the k-th level.

4. Experimental Results and Analysis

4.1. Datasets

We select five mainstream RGB-D datasets for experiments: SIP [29] (929 image pairs), NJU2K [30] (1985 image pairs), NLPR [31] (1000 image pairs), STERE [32] (1000 image pairs), and DUT-RGBD [33] (1200 image pairs). SIP is a human-centric dataset which focuses on salient person detection in outdoor environments. The depth maps were captured using the dual-camera sensor system of a Huawei Mate 10 smartphone, sourced from Huawei Technologies Co., Ltd. (Shenzhen, China). NJU2K, collected from the Internet and 3D movies, contains diverse indoor and outdoor scenes. The depth maps in this dataset were acquired via the Fujifilm FinePix Real 3D W3 camera sensor, sourced from Fujifilm Holdings Corporation (Tokyo, Japan), or extracted directly from 3D movies to obtain diverse spatial structures. NLPR is captured by a Microsoft Kinect sensor, which was sourced from Microsoft Corporation (Redmond, WA, USA), including various objects under different lighting conditions. STERE focuses on multiple persons in complex poses and real-world occlusions. For this dataset, depth maps were generated through computational stereo matching algorithms rather than direct sensor capture, specifically targeting complex and unconstrained in-the-wild scenarios. DUT-RGBD involves multiple salient objects, low-contrast foregrounds, and complex background clutter. The depth maps were captured with a Lytro light field camera sensor, sourced from Lytro, Inc. (Mountain View, CA, USA), to effectively handle transparent objects and intricate structures. For fair comparison, the training datasets are selected from NJU2K, NLPR, and DUT-RGBD, which include 1485, 700, and 800 image pairs, respectively, while the remaining images are used for testing and evaluation. These training and testing splits fully and strictly follow the publicly available standard partitions universally adopted in the RGB-D SOD community.

4.2. Evaluation Metrics

For fairness and comprehensive comparison, we adopt five widely accepted metrics, including Precision-Recall (PR), F-measure (

F_{β}

) [8], mean absolute error (

M

) [34], S-measure

(S)

[35], and E-measure

(E)

[36]. PR curves plot the precision against recall, presenting a comprehensive visualization of the model’s performance.

F_{β}

provides a balanced evaluation of the performance in balancing precision and recall.

M

quantifies the average pixel-wise discrepancy between the predicted saliency map and the ground truth.

S

focuses on the structural similarity, while

E

reflects enhanced alignment and global foreground consistency. They jointly conduct a comprehensive assessment of the model’s performance.

4.3. Implementation Details

We adopt MobileViT as our dual-branch backbone, which is initialized with official weights pre-trained on ImageNet, and remove other operations after layer 5. Our SFCINet is trained and inferred effectively on a single NVIDIA RTX 4060 GPU (8 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA). The input RGB and depth images are resized to

256 \times 256

during the training and testing process. The SFCINet is trained for 200 epochs, with an initial learning rate of 1 × 10⁻⁴, and a weight decay of 0.1 every 50 epochs. In addition, the batch size is set to 8. To boost the robustness of our model, we adopt mainstream data augmentation strategies such as rotation, image flipping, and random rotation during the data processing.

4.4. Performance Comparison

To comprehensively evaluate the performance of SFCINet, we select 16 state-of-the-art methods for comparison. These methods are categorized into two groups: (1) heavyweight models, shown in Table 1, including CDNet [37], CCAFNet [38], CAVER [39], MPDNet [40], AMINet [41], HiDANet [42], TPCL [43], and DMGNet [44]; (2) lightweight models, shown in Table 2, including MoADN [22], MMF [45], LSNet [46], AirSOD [47], HENet [48], MAGNet [49], FasterSal [23], and BTNet [50].

(1): Quantitative Comparison

Compared with heavyweight models, as shown in Table 1, SFCINet demonstrates superior performance across most benchmarks. Notably, our model achieves the highest F-measure and E-measure on the SIP, NJU2K, NLPR, and STERE datasets, consistently outperforming models with significantly larger parameter scales. We perform best on F-measure and E-measure, especially that we make great progress, increasing

F_{β}

from 0.892 to 0.911 on the highly challenging SIP dataset. In contrast to those heavyweight models, our proposed method enjoys significantly fewer parameters and much lower computational complexity (FLOPs), while delivering better overall performance. For example, compared to heavyweight competitors like TPCL (129.47M Params, 212.02G FLOPs), our network contains only 10.08M parameters and 4.27G FLOPs, which reduces the parameter size and computational burden by over 90%. But we outperform the competitors in terms of F-measure by approximately 1% on average. This fully demonstrates its remarkable model efficiency and great potential for practical deployment.

Compared with lightweight models, Table 2 presents the results against eight advanced lightweight methods. SFCINet ranks first across all metrics on the SIP and DUT-RGBD datasets, and achieves the highest F-measure and E-measure on all datasets. Instead of concatenating depth and RGB images and processing them via a unified shared backbone, our method leverages two separate backbones to model RGB and depth features, respectively. Consequently, our model is not competitive in parameters and FLOPs against some state-of-the-art lightweight models, but it achieves considerable performance gain benefiting from the efficient processing of cross-modal information via the dual-backbone network. For instance, we achieve a relative improvement of approximately 6.55% compared with AirSOD on the SIP although it is lighter in terms of parameters. It strongly proves that the SFCINet achieves a desirable trade-off between efficiency and accuracy.

Additionally, we pose the Precision-Recall (PR) curves across the five benchmark datasets, as illustrated in Figure 4. It can be intuitively observed that our PR curves stay above those of other methods, demonstrating its outstanding performance.

(2): Qualitative Comparison

For a qualitative comparison, we select several representative detection results, as illustrated in Figure 5. These examples contain several challenging scenarios, including situations with complex backgrounds (rows 1–2), poor depth maps (rows 3–4), intricate structures (rows 5–6), and low contrast (rows 7–8). Specifically, in rows 1–2, the patterns behind the door ring and the foliage beneath the pine needles create a complex background distraction. Our SFCINet effectively suppresses complex background clutter and extracts fine structures, while others often suffer from background confusion, leading to blurred boundaries and false detections. In rows 3–4, poor depth maps have significant noise and low contrast, causing significant misguidance for RGB features. Our SFCINet eliminates the distracting information and maintains the integrity and clear boundaries of the salient objects. In rows 5–6, depicting a spiral handrail and a bench armrest with intricate topologies, our model sharply preserves the internal hollow regions, while other methods merge these details into blurred blobs.

Finally, in rows 7–8, the car logo and green leaves have similar colors to their respective backgrounds, the SFCINet successfully distinguishes salient objects under extreme low-contrast conditions, while other models result in structural disintegration and object fragmentation. These visual results clearly demonstrate that our proposed method can more precisely locate salient objects across various extreme scenarios.

4.5. Ablation Studies

To objectively evaluate the effectiveness of individual modules, we design three variant models (a), (b), (c), and conduct experiments on the challenging DUT-RGBD and SIP datasets. The variant (d) is our full SFCINet, and the results are shown in Table 3 and visualized in Figure 6. Since our proposed module is a unified framework, for example, the global frequency-synergized prior

P_{i}

is generated by the SFS and flows to the CHD, we cannot simply remove the module. The core of the SFS is to shift the perspective from the spatial domain to the frequency domain, thereby obtaining the global frequency information. For the variant (a), we remove the frequency domain branch in SFS, while the operations in the spatial domain remain. Meanwhile, we exclude the injection of global features in the CHD. For the variant (b), we restore the SFS module, but block the global frequency-synergized prior from flowing to the CHD. For (c), we replace the cross-modal spatial attention interaction mechanism with concatenation, while leaving the other modules unchanged.

(1): The effectiveness of the SFS module

The effectiveness of the SFS is verified by comparing variants (a) and (b). In contrast to the variant (a), (b) has global perception from a frequency-domain perspective, effectively suppressing background noise and capturing global texture information. In all the metrics, (b) is superior to (a). For example, the F-measure is improved from 0.939 to 0.942 on DUT-RGBD. As shown in Figure 6, the variant (a) completely loses the thin pole due to a lack of global context, while (b) successfully captures the whole salient object. These results fully demonstrate the effectiveness of SFS, which combines the spatial-frequency domain and local-global information. Furthermore, we provide a detailed quantitative analysis in Table 4 to demonstrate the superior global modeling capability of our frequency-domain SFS module over conventional lightweight attention mechanisms, as well as the individual contributions of the amplitude and phase branches. Specifically, we construct variant (1) CA + SA, where the frequency domain within SFS is replaced by a standard combination of Channel Attention (CA) and Spatial Attention (SA), while keeping all other architectures identical. Ours comprehensively outperforms it, improving F-measure from 0.902 to 0.911 on SIP, proving that frequency-domain learning captures global context far better than these lightweight attention modules. Additionally, disabling the amplitude branch (2) or phase branch (3) both causes distinct performance drops. Crucially, while individually disabling either branch, variant (2) or (3) yields performance levels roughly similar to or slightly better than the (1), while jointly enabling both (Ours) brings a pronounced performance leap.

(2): The effectiveness of the CMGI module

To evaluate CMGI, we compare (c) with our full SFCINet (d). We find that the performance of (c) deteriorated in all metrics. The representations of depth and RGB features have significant differences, which result from the fundamental difference of the image sensors. For instance, the noise in the depth map can easily contaminate the significant areas. The results prove that our cross-modal spatial attention interaction mechanism effectively alleviates this issue. From Figure 6, we can also clearly see that (d) provides more accurate positioning for salient objects.

(3): The effectiveness of the CHD module

CHD employs the strategy that injects the global signal to participate in decoding, which can be verified by the comparison (b) with (d). It can be clearly seen that our full SFCINet leads comprehensively in evaluation metrics. It strongly explains that there will be a loss of the overall signal due to upsampling during the decoding process. Our unified framework is designed to alleviate this issue, which leverages global frequency-synergized prior from the SFS. Moreover, in Figure 6, we can clearly see that (d) has a more powerful capacity of capturing the human subjects and the main body area of the figurine, showing the effectiveness of our design.

(4): Efficiency Analysis

We further analyze the computational complexity and inference efficiency of our proposed modules, as detailed in Table 5. The inference speed (FPS) is benchmarked on a single NVIDIA GeForce RTX 4060 GPU with an input resolution of

256 \times 256

and a batch size of 1 under FP16 mixed precision. To measure pure forward inference latency after sufficient warm-up, the entire network execution is accelerated via CUDA Graph, while disk I/O and data pre-processing overheads are excluded; (a) provides the lightest architecture with 9.88M parameters and 277 FPS. We achieve a significant increase in performance at a minor cost of 0.2M Params and 0.25G FLOPs, compared to our full method. When comparing several variants together, we will find that the overall design of our module is extremely lightweight, while the parameters and computational requirements derive from the native backbone network. It is well worth mentioning that our full SFCINet (d) achieves the best state-of-the-art accuracy across all datasets while running at a high speed of 236 FPS, verifying its excellent efficiency and potential for practical deployment. This firmly establishes that our architectural design is highly efficient and achieves an optimal trade-off between detection accuracy and deployment efficiency.

4.6. Failure Case Analysis

As illustrated in Figure 7, our SFCINet encounters performance bottlenecks when dealing with multi-scale semantic ambiguity, specifically where the ground truth (GT) targets are restricted to highly localized, fine-grained textual signs. Depth sensors fail to highlight salient objects on the surface of the macroscopic regions (rows 2 and 3) or in adjacent regions (row 1), as they share the same depth values. Since our SFS module explicitly leverages the amplitude spectrum to capture macro-geometric energy, it understandably prioritizes the holistic structural consistency of the object over pixel-level local areas when facing deceptive depth maps. The subsequent network follows this deceptive consensus, especially the injection of the output prior into the decoding stage, failing to segment localized pixel-level regions. Consequently, dynamically decoupling macro-structural layouts from fine-grained semantic masks remains a vital direction for our future improvement.

5. Conclusions

In this paper, we propose a novel and lightweight network for RGB-D salient object detection, termed SFCINet. We employ MobileViT to independently extract multi-scale features. To better capture the global context information and calibrate the features of RGB and depth images, we design the SFS module, which shifts the perspective from the spatial domain to the frequency domain, efficiently perceiving global signals. Given the representation discrepancy between the depth and RGB features, we further develop the CMGI module with a reciprocal anchoring mechanism. It helps select useful information and filter out irrelevant noise, achieving effective integration. During the decoding process, global information is prone to being lost during upsampling. To alleviate this issue, we utilize the global frequency-synergized prior from the SFS module to participate in the decoder, forming the SFCINet into a unified framework. We conduct experiments on five widely accepted datasets and adopt five evaluation metrics to comprehensively verify the outstanding performance of our proposed method. The results show that the SFCINet achieves a desirable trade-off between detection performance and efficiency.

Author Contributions

Y.L. and Z.C. were responsible for the conceptualization and architectural design of the SFCINet. Y.L. implemented the software, conducted the experiments, and performed the formal analysis of the results. Z.C. provided critical technical support and supervised the research progress. Y.L. prepared the original manuscript, which was subsequently reviewed and edited by Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Innovation and Entrepreneurship Project of Nanjing University of Posts and Telecommunications (CXXZD2025020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article. The code of our proposed SFCINet can be found in https://github.com/luyitong530-cloud/SFCINet.git (accessed on 8 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, N.; Zhong, Q.; Li, K.; Cong, R.; Zhao, Y.; Kwong, S. A reference-free underwater image quality assessment metric in frequency domain. Signal Process. Image Commun. 2021, 94, 116218. [Google Scholar] [CrossRef]
Lee, H.; Yoon, J.; Jeong, Y.; Yi, K. Moving object detection and tracking based on interaction of static obstacle map and geometric model-free approach for urban autonomous driving. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3275–3284. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Zhu, X.-F.; Wu, X.-J.; Xu, T.; Feng, Z.-H.; Kittler, J. Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. IEEE Trans. Multimed. 2022, 24, 301–312. [Google Scholar] [CrossRef]
Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J.; Argyros, A. A review on deep learning techniques for video prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2806–2826. [Google Scholar] [CrossRef]
Yi, K.; Li, Y.; Tang, H.; Xu, J. Adaptive Depth Enhancement Network for RGB-D Salient Object Detection. IEEE Signal Process. Lett. 2025, 32, 176–180. [Google Scholar] [CrossRef]
Wei, L.; Zhu, Z.; Mi, Y.; Hu, W. PDNet: Pluralistic depth-aware network for RGB-D salient object detection. Signal Process. 2026, 239, 110271. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2009; pp. 1597–1604. [Google Scholar] [CrossRef]
Ma, X.; Xie, X.; Lam, K.-M.; Zhong, Y. Efficient saliency analysis based on wavelet transform and entropy theory. J. Vis. Commun. Image Represent. 2015, 30, 201–207. [Google Scholar] [CrossRef]
Jin, X.; Guo, C.; He, Z.; Xu, J.; Wang, Y.; Su, Y. FCMNet: Frequency-aware cross-modality attention networks for RGB-D salient object detection. Neurocomputing 2022, 491, 414–425. [Google Scholar] [CrossRef]
Wang, Y.; Wang, R.; Fan, X.; Wang, T.; He, X. Pixels, regions, and objects: Multiple enhancement for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 10031–10040. [Google Scholar] [CrossRef]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual saliency transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 4702–4712. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Liu, Y.; Qin, C.; Coleman, S.A.; Kerr, D. LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection. IEEE Trans. Multimed. 2024, 26, 5207–5222. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Liu, Y.; Zhu, D.; Coleman, S.A.; Kerr, D. ELWNet: An Extremely Lightweight Approach for Real-Time Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6404–6417. [Google Scholar] [CrossRef]
Wen, F.; Wang, Q.; Zou, R.; Wang, Y.; Liu, F.; Chen, Y.; Yu, L.; Du, S.; Yuan, C. A Salient Object Detection Method Based on Boundary Enhancement. Sensors 2023, 23, 7077. [Google Scholar] [CrossRef]
Peng, Y.; Zhai, Z.; Feng, M. SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection. Sensors 2024, 24, 1117. [Google Scholar] [CrossRef]
Wei, L.; Zhu, Z. Modal-Aware Interaction Network for RGB-D Salient Object Detection. IEEE Trans. Instrum. Meas. 2025, 74, 5026212. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, Y.; Lei, J.; Wan, J.; Yu, L. IRFR-Net: Interactive recursive feature-reshaping network for detecting salient objects in RGB-D images. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7820–7833. [Google Scholar] [CrossRef] [PubMed]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Salt Lake City, UT, USA, 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Salt Lake City, UT, USA, 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar] [CrossRef]
Jin, X.; Yi, K.; Xu, J. MoADNet: Mobile asymmetric dual-stream networks for real-time and lightweight RGB-D salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7632–7645. [Google Scholar] [CrossRef]
Zhang, J.; Dai, Y.; Zhang, T.; Harandi, M.; Barnes, N.; Hartley, R. FasterSal: Robust and real-time single-stream architecture for RGB-D salient object detection. IEEE Trans. Multimed. 2025, 27, 2477–2488. [Google Scholar] [CrossRef]
Zhang, C.; Chen, F.; Huang, L.; Peng, Z.; Hu, X. Multiple cross-modal complementation network for lightweight RGB-D salient object detection. J. Vis. Commun. Image Represent. 2025, 113, 104622. [Google Scholar] [CrossRef]
Mao, X.; Liu, Y.; Liu, F.; Li, Q.; Shen, W.; Wang, Y. Intriguing findings of frequency selection for image deblurring. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI); AAAI: Washington, DC, USA, 2023; pp. 1905–1913. [Google Scholar] [CrossRef]
Qiao, Y.; Shao, M.; Wang, L.; Zuo, W. Learning depth-density priors for Fourier-based unpaired image restoration. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2604–2618. [Google Scholar] [CrossRef]
Zhao, R.; Wang, M.; Wang, F.; Sun, F.; Li, H. Spatial-frequency collaborative learning for camouflaged object detection. IEEE Trans. Multimed. 2025, 27, 7756–7768. [Google Scholar] [CrossRef]
Zhou, H.; Hong, W.; Zhang, Z.; Liu, X.; Wu, X.-J. Lightweight Spatial-Channel-Frequency Network for RGB-Thermal Salient Object Detection. IEEE Signal Process. Lett. 2025, 32, 4009–4013. [Google Scholar] [CrossRef]
Fan, D.-P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.-M. Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2075–2089. [Google Scholar] [CrossRef] [PubMed]
Ju, R.; Ge, L.; Geng, W.; Ren, T.; Wu, G. Depth saliency based on anisotropic center-surround difference. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2014; pp. 1115–1119. [Google Scholar] [CrossRef]
Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD salient object detection: A benchmark and algorithms. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8691, pp. 92–109. [Google Scholar] [CrossRef]
Niu, Y.; Geng, Y.; Li, X.; Liu, F. Leveraging stereopsis for saliency analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2012; pp. 454–461. [Google Scholar] [CrossRef]
Piao, Y.; Ji, W.; Li, J.; Zhang, M.; Lu, H. Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019; pp. 7253–7262. [Google Scholar] [CrossRef]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2012; pp. 733–740. [Google Scholar] [CrossRef]
Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2017; pp. 4548–4557. [Google Scholar] [CrossRef]
Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI); IJCAI: Stockholm, Sweden, 2018; pp. 698–704. [Google Scholar] [CrossRef]
Jin, W.-D.; Xu, J.; Han, Q.; Zhang, Y.; Cheng, M.-M. CDNet: Complementary depth network for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 31, 3376–3390. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, Y.; Lei, J.; Wan, J.; Yu, L. CCAFNet: Crossflow and cross-scale adaptive fusion network for detecting salient objects in RGB-D images. IEEE Trans. Multimed. 2022, 24, 2192–2204. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. CAVER: Cross-modal view-mixed transformer for bi-modal salient object detection. IEEE Trans. Image Process. 2023, 32, 892–904. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Xu, Y.; Wang, T.; Liao, T. Multi-prior driven network for RGB-D salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9209–9222. [Google Scholar] [CrossRef]
Wang, R.; Wang, F.; Su, Y.; Sun, J.; Sun, F.; Li, H. Attention-guided multi-modality interaction network for RGB-D salient object detection. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 68. [Google Scholar] [CrossRef]
Wu, Z.; Allibert, G.; Meriaudeau, F.; Ma, C.; Demonceaux, C. HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness. IEEE Trans. Image Process. 2023, 32, 2160–2173. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Hao, F.; Liang, W.; Xu, J. Transformer fusion and pixel-level contrastive learning for RGB-D salient object detection. IEEE Trans. Multimed. 2024, 26, 1011–1026. [Google Scholar] [CrossRef]
Tang, Y.; Li, M. DMGNet: Depth mask guiding network for RGB-D salient object detection. Neural Netw. 2024, 180, 106751. [Google Scholar] [CrossRef] [PubMed]
Huang, N.; Jiao, Q.; Zhang, Q.; Han, J. Middle-level feature fusion for lightweight RGB-D salient object detection. IEEE Trans. Image Process. 2022, 31, 6621–6634. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, Y.; Lei, J.; Yang, R.; Yu, L. LSNet: Lightweight spatial boosting network for detecting salient objects in RGB-thermal images. IEEE Trans. Image Process. 2023, 32, 1329–1340. [Google Scholar] [CrossRef]
Zeng, Z.; Liu, H.; Chen, F.; Tan, X. AirSOD: A lightweight network for RGB-D salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1656–1669. [Google Scholar] [CrossRef]
Gao, H.; Wang, F.; Wang, M.; Sun, F.; Li, H. Highly efficient RGB-D salient object detection with adaptive fusion and attention regulation. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 3104–3118. [Google Scholar] [CrossRef]
Zhong, M.; Sun, J.; Ren, P.; Wang, F.; Sun, F. MAGNet: Multi-scale awareness and global fusion network for RGB-D salient object detection. Knowl. Based Syst. 2024, 299, 112126. [Google Scholar] [CrossRef]
Ren, P.; Bai, T.; Sun, F. Bio-inspired two-stage network for efficient RGB-D salient object detection. Neural Netw. 2025, 185, 107244. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed SFCINet.

Figure 2. Structure of the Spatial-Frequency Synergy (SFS) module.

Figure 3. Structure of the Cross-modal Guidance Interaction (CMGI) module.

Figure 4. PR curves of the proposed method and state-of-the-art methods on five datasets.

Figure 5. Qualitative comparisons with state-of-the-art methods.

Figure 6. Visual comparisons for showing the effectiveness of different components.

Figure 7. Visual examples of failure case.

Table 1. Results of our model compared with 8 heavyweight methods. The best results are in bold. The symbol “

↑