Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit

Sun, Mingsi; Lan, Ting; Wang, Wei; Liu, Pingping

doi:10.3390/rs18030389

Open AccessArticle

Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit

¹

College of Computer Science and Technology, Changchun University, Changchun 130022, China

²

School of Mechanical and Electrical Engineering, Changchun University of Science and Technology, Changchun 130022, China

³

College of Computer Science and Technology, Jilin University, Changchun 130012, China

⁴

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 389; https://doi.org/10.3390/rs18030389

Submission received: 28 November 2025 / Revised: 6 January 2026 / Accepted: 21 January 2026 / Published: 23 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposes the GDUFormer model, which combines the ViT backbone with Gated Differential Units to effectively address core issues such as noise suppression and scale differences in Optical Remote Sensing Image Salient Object Detection.
The GDU incorporates Full-Dimensional Gated Attention and Hierarchical Differential Dynamic Convolution, enabling full-dimensional feature purification and synergistic modeling of long-range dependencies and local details.

What are the implications of the main findings?

The innovative design of the dual-branch gated attention and hierarchical differential mechanism provides an efficient paradigm for the integration of Transformers and convolutions, enhancing the foreground-background discrimination ability in complex scenarios.
It achieves the optimal comprehensive performance on the EORSSD and ORSI-4199 datasets, with a balanced trade-off between parameters and computational complexity, offering reliable preprocessing support for downstream remote sensing image tasks.

Abstract

Salient object detection in optical remote sensing images has attracted extensive research interest in recent years. However, CNN-based methods are generally limited by local receptive fields, while ViT-based methods suffer from common defects in noise suppression, channel selection, foreground-background distinction, and detail enhancement. To address these issues and integrate long-distance contextual dependencies, we introduce GDUFormer, an ORSI-SOD detection method based on the ViT backbone and Gated Differential Units (GDU). Specifically, the GDU consists of two key components—Full-Dimensional Gated Attention (FGA) and Hierarchical Differential Dynamic Convolution (HDDC). FGA consists of two branches aimed at filtering effective features from the information flow. The first branch focuses on aggregating spatial local information under multiple receptive fields and filters the local feature maps via a grouping mechanism. The second branch imitates the Vision Mamba to acquire high-level reasoning and abstraction capabilities, enabling weak channel filtering. HDDC primarily utilizes distance decay and hierarchical intensity difference capture mechanisms to generate dynamic kernel spatial weights, thereby facilitating the convolution kernel to fully mix long-range contextual dependencies. Among these, the intensity difference capture mechanism can adaptively divide hierarchies and allocate parameters according to kernel size, thus realizing varying levels of difference capture in the kernel space. Extensive quantitative and qualitative experiments demonstrate the effectiveness and rationality of GDUFormer and its internal components.

Keywords:

salient object detection; optical remote sensing images; full-dimensional gated attention; differential dynamic convolution

1. Introduction

With the widespread use of optical remote sensing satellites and high-altitude drones, both the quantity and quality of optical remote sensing images (ORSI) have significantly improved. In this context, computer vision, as a rapidly developing technique for image processing, recognition, and detection, has found extensive applications in ORSI, such as remote sensing image change detection [1,2], salient object detection [3,4,5], and land cover segmentation [6]. Among these, ORSI salient object detection (ORSI-SOD) has attracted increasing attention from researchers due to its unique task characteristics.

ORSI-SOD aims to highlight the most prominent and salient objects in optical remote sensing images. This not only facilitates the rapid and accurate localization of key targets but also promotes the development and progress of related downstream tasks. Compared to RGB salient object detection, ORSI-SOD faces challenges such as large object scale differences, similar land cover features, and a low proportion of salient object pixels, making the task highly challenging.

In the face of these challenges, many methods have explored solutions in their own ways. CNN-based ORSI-SOD methods mostly follow the local-to-local or local-to-contextual paradigm [7]. While these methods have advanced ORSI-SOD, the convolutional backbone’s local nature limits performance, prompting the proposal of ViT-based methods [5,7,8,9]. These methods generally first use the ViT backbone to extract global information, then process and enhance global features at different scales through additional network structures, and finally use a prediction head to obtain the ORSI-SOD results. However, it is important to note that these methods still have certain limitations. Firstly, these approaches fail to adaptively suppress redundant noise features from complex backgrounds in remote sensing images and cannot effectively strengthen the key feature channels for objects across different scales, which results in object localization errors or omissions. Secondly, although existing ViT-based methods have proposed diverse strategies to tackle related issues, including discriminative knowledge extraction, hybrid feature alignment for foreground-background separation, and multi-scale mechanisms for small object details, they generally do not integrate the enhancement of object-background distinction, long-range dependency encoding and small object local detail preservation into a single coherent pipeline, leading to the obscuring of certain object cues. As an alternative solution, our work is designed to fill this gap with a targeted framework.

To address the aforementioned issues, we propose an ORSI-SOD detection method based on the Gated Differential Unit and ViT backbone, termed GDUFormer. We first employ PVT v2 [10] as the encoder to extract global features at four different scales. Then, features from adjacent scales are fed into each stage of the decoder, which is composed of Gated Differential Units (GDUs), for further processing. Finally, the outputs of the decoder are passed into a predictor to obtain the detection results of ORSI-SOD. Our main contribution lies in the design of the GDU, which incorporates two core modules: Full-Dimensional Gated Attention (FGA) and Hierarchical Differential Dynamic Convolution (HDDC). Specifically, the former takes into account the varying scales and large contrast differences in objects in ORSI. It first uses two branches to collect multi-receptive field spatial local information and encodes channel relationships through a Vision Mamba (ViM)-inspired linear attention. Following this, a grouping mechanism is employed to filter local feature maps, and an internal gated control mechanism is used to filter out redundant or noisy channels. The outputs from the two branches are then combined and activated by the activation function, serving as the module-level gating output applied to the HDDC results. In this process, FGA significantly enhances its ability for advanced reasoning and abstraction of tokens by fine-tuning the linear self-attention mechanism in the inspiration of ViM, without introducing excessive computational burden. The design motivation behind FGA is consistent with our previous analysis. The two branches can suppress redundant features and enhance the emphasis on key channels, respectively, thereby achieving the gated representation of salient object features across all dimensions.

The key function of HDDC is to fully integrate long-range contextual dependencies through convolution kernels and dynamically control the kernel spatial weights. This allows the kernel to capture hierarchical spatial intensity differences while considering spatial distance characteristics, thereby improving the model’s ability to distinguish between foreground and background and to capture subtle details. The component first arranges the weight intensities at each position in the kernel space using the distance decay mechanism, then adaptively divides the hierarchy based on kernel size and sets different parameters to capture varying intensity levels. Finally, the combined outputs are applied to the dynamically defined convolution kernel, facilitating context mixing. As the main information processing unit in GDU, the primary design motivation of HDDC is to achieve the discrimination between objects and backgrounds while enhancing additional attention to small objects. This not only facilitates the construction of a sound feature parsing and processing paradigm in the multi-level architecture of the decoder, but also improves the model’s ability to maintain accuracy and reliability under hard sample scenarios.

FGA and HDDC complement each other, with the former primarily assisting in filtering the ineffective features within the latter and controlling the flow of information. Through their combined effect, the model can fully utilize the global features extracted by the ViT backbone, alleviating the issues faced by ORSI-SOD and improving detection performance. GDUFormer adopts novel techniques and achieves excellent comparative results across multiple datasets without significantly increasing model parameters. As shown in Figure 1, the salient objects predicted by GDUFormer demonstrate advantages over the main competitor in terms of accurate localization, fewer mispredictions, and clearer object boundaries. Our contributions can be summarized in the following four aspects:

(1): We propose a Transformer-based solution for ORSI-SOD, called GDUFormer. This method accurately identifies salient object features in ORSI by utilizing a gating mechanism across all dimensions and dynamic convolutions that capture hierarchical intensity variations. To our knowledge, we are the first to propose the specific combination of the full-dimensional gating mechanism and hierarchical Differential Dynamic Convolution.
(2): We introduce a Full-Dimensional Gated Attention mechanism. To aggregate and capture information across all dimensions, we employ a dual-branch structure to perceive local spatial cues and incorporate the ViM-inspired linear attention, enabling the selection of key spatial positions and strong channel relationships. The resulting outputs are used to control the information flow in subsequent operations. Through this module, the model can effectively alleviate the noise interference and large target scale differences encountered in ORSI-SOD.
(3): We propose a Hierarchical Differential Dynamic Convolution. To solve the problem of low foreground-background discrimination and the tendency to overlook small-scale salient features, this component constructs kernel space weights through distance decay and the capture of hierarchical intensity differences. It encodes long-range dependencies while maintaining the inherent local inductive bias, resulting in improved integration with the ViT backbone.
(4): Extensive comparative and ablation experiments on the benchmark datasets demonstrate the effectiveness and rationality of the proposed GDUFormer, FGA, and HDDC.

Figure 1. The visual comparison examples of GDUFormer with the key competitor. The red dashed boxes highlight the key regions for comparison. As observed, HFANet and DKETFormer show obvious false and missed detections, while GDUFormer achieves precise salient object localization and smooth edges.

2. Related Work

2.1. Salient Object Detection Methods for Conventional Images

CNN-based Methods. Over the past decade, researchers have proposed a large number of CNN-based SOD methods for conventional images, aiming to address various practical challenges in SOD tasks through rich local feature extraction. Wei et al. [11] proposed a cross-feature module and a cascaded feedback decoder, which can adaptively select complementary components from input features while eliminating differences between features of adjacent layers. Pang et al. [12] designed aggregate interaction modules and self-interaction modules to integrate features of adjacent levels and extract more efficient multi-scale features from the fused features. Zhang et al. [13] proposed a lossless feature reflection and a weighted structural loss function to enable the model to efficiently learn complementary salient features, while solving the problem of blurred boundaries in salient object prediction. GateNet [14] uses multi-level gating units to address the long-standing issues of lacking an interference control mechanism and failing to consider the contribution differences in different encoder modules. The DC-Net structure [15] is based on the divide-and-conquer concept: it first adopts a multi-encoder architecture to decompose different subtasks, and then realizes the progressiveness of model prediction through a decoder composed of ResASPP2 modules. Drawing on several relevant characteristics of the human visual system, MENet [16] has a core goal of improving the accuracy and robustness of salient object detection in complex scenes with multiple objects.

ViT-based Methods. Motivated by the booming development of Transformers in the field of computer vision, many SOD methods for conventional images have begun to adopt the ViT backbone, while designing subsequent decoder and prediction head structures to adapt to the global characteristics of Transformers. VST [17] is a model based on the ViT architecture. Through operations such as global context propagation, multi-level token fusion, and token upsampling, it has successfully achieved the prediction of salient objects in conventional images. ICON [18] enhances features via diversified feature aggregation modules and integrity channel enhancement components, and adopts a part-whole verification method to improve micro-integrity, thus outperforming other models on benchmark datasets. SelfReformer [19] explicitly learns global context through the ViT network and utilizes a pixel shuffle mechanism to preserve feature details, ultimately realizing accurate prediction of salient objects. The BBRF framework [20] expands the receptive field by separating semantic features and detail features in the encoder, and employs multi-module dynamic feature filtering and multi-path feature compensation to solve the detection problem of salient object scale variations. MDSAM [21] drives visual foundation models to learn multi-scale information with an extremely small number of parameters through an adapter, and combines a fusion module and a detail enhancement module to obtain salient prediction results containing fine-grained details. AL-SAM [22] optimizes the adapter and fusion module to help the unified segmentation model [23] perceive and adapt to multi-scale features. It not only encodes the global information of features at different scales but also enhances the unified modeling of cross-scale features.

2.2. Salient Object Detection Methods for Optical Remote Sensing Images

CNN-based Methods. ORSI-SOD methods and SOD methods for conventional images exhibit the same technical trajectory in terms of technical foundation. In the context where deep learning frameworks are commonly used to solve computer vision tasks, CNN-based ORSI-SOD methods occupy a dominant position. BASNet [24] first extracts multi-scale salient features through a densely supervised encoder–decoder network combined with a residual optimization module, then uses a hybrid loss function to supervise and guide feature maps, and finally improves the segmentation accuracy and boundary clarity of salient object regions. PiCANet [25] generates attention maps to selectively integrate effective contextual features. Meanwhile, to optimize the localization of salient objects, it further integrates global and local attention. PA-KRN [26] completes the localization and segmentation of objects in a coarse-to-fine phased manner, and introduces an attention sampler and a knowledge review network to solve corresponding problems in this process. The main contribution of ERPNet [27] focuses on a dual-path parallel decoder, which realizes edge extraction and feature fusion, and can use edge information to enhance positional attention, thereby calibrating the decoding process in a cyclic manner. Li et al. [3] constructed a parallel top-down fusion network. By virtue of high- and low-level features in the same path and multi-resolution features across paths, this network effectively distinguishes multi-scale objects and suppresses cluttered backgrounds. By designing a network structure that preserves edge information, enhances feature interaction, and optimizes through adaptive feedback, BSCGNet [28] significantly improves the model’s emphasis on local information and final-layer features. SFANet [29] designs a global semantic enhancement block and an uncertainty-aware refinement module, aiming to address the problems existing in salient object detection for remote sensing images. SeaNet [30] adopts a lightweight backbone network to extract basic features, and simultaneously fuses a dynamic semantic matching module and an edge self-alignment module to activate the positions of salient objects and optimize their local details.

ViT-based Methods. With its advantage of encoding long-range feature dependencies, ViT is highly compatible with the image properties of ORSI, and has attracted extensive attention from relevant researchers in recent years. Nevertheless, as the field started relatively late, there are currently few ViT-based ORSI-SOD methods, providing substantial room for progress. Based on the ViT backbone, GeleNet [8] obtains global image representations. It not only enhances the perception of local details and directional information but also strengthens feature interaction across different levels, thus effectively boosting detection effectiveness. WeightNet [7] proposes a two-stage processing framework: multi-scale weight information is generated first, followed by feature integration via indirect fusion. This approach significantly improves the performance of salient object detection in ORSIs. HFANet [31] addresses typical challenges in remote sensing images through a modular design. It balances global context comprehension, multi-scale feature refinement, and cross-layer feature alignment, and additionally incorporates a mutual supervision mechanism between saliency and edge tasks to achieve more accurate detection outcomes. The TS-BiT model [32] employs a two-stage design. It replaces the gradient of the traditional Sign function with a center-aware Softmax binarization method and an extensible hyperbolic tangent function, successfully reducing the high computational requirement of ViT in ORSI-SOD. For DKETFormer [5], its decoder is composed of a cross-spatial knowledge extraction module and an inter-layer feature transfer module. This decoder not only better aggregates and captures pixel-level pairwise relationships but also realizes the modeling and transfer of discriminative features.

In addition, several excellent RGB-T SOD methods have emerged recently. ConTriNet [33] adopts a divide-and-conquer triple-flow architecture to address the inter-modality disparities and noise interference from defective modalities in RGB-T SOD. Mamba4SOD [34] fuses the merits of Swin Transformer V2 [35] and Mamba to construct robust multi-modal representations. HyPSAM [36] and KAN-SAM [37] are both methods based on the unified segmentation model. The difference lies in that the former mainly leverages the zero-shot generalization capability of SAM [23], while the latter extends SAM2 [38] by introducing thermal features as guiding prompts. Notably, DiMSOD [39] treats RGB, RGB-D, and RGB-T multi-modal SOD as a conditional mask generation task via the diffusion model, addressing the limitations of separate task handling and inadequate cross-modal fusion. Although the aforementioned methods have successfully applied ViT or Mamba to the SOD task, defects such as insufficient utilization of global features, lack of noise filtering, failure to effectively distinguish foreground from background, and inadequate enhancement of local details still exist. These defects will result in suboptimal SOD detection performance for optical remote sensing images. To this end, we propose the corresponding solution: GDUFormer.

3. GDUFormer

3.1. Overall Architecture

The overall architecture of GDUFormer is shown in Figure 2. Inspired by GeleNet [8] and DKETFormer [5], we adopt pvt v2-b2 [10] as the encoder for salient features in GDUFormer. As a hierarchical Transformer, pvt v2 is capable of encoding global cues from the feature map at four different scales, with the output at each stage denoted as

P = [P_{1}, P_{2}, P_{3}, P_{4}]

. Specifically, these feature maps are downsampled to 1/4, 1/8, 1/16 and 1/32 of the input image, with 64, 128, 320 and 512 channels, respectively. P is subsequently fed into the GDU-based decoder to generate scale-specific outputs. To enhance the interaction between adjacent scale feature maps, we input both

P_{i}

and

P_{i + 1} (1 \leq i \leq 3)

into the decoder layer corresponding to

P_{i}

. Each decoder layer is composed of a fixed number of GDUs, and the upsampled results of

P_{i + 1}

and

P_{i}

are iteratively processed within them to obtain the decoder output

D_{i}

. Since the highest layer (

P_{4}

) does not interact with higher layers, we apply an additional

1 \times 1

convolution to process it and obtain

D_{4}

. Notably, the dimensions of

D_{i}

and

P_{i}

are exactly identical before the final

1 \times 1

convolution is performed; after that, the number of channels of all

D_{i}

is uniformly reduced to 32. The GDU suppresses noise, enhances key features, improves foreground-background distinction, and captures fine-grained details. Its structure is shown in Figure 3. It can be observed that our main contributions, Full-Dimensional Gated Attention and Hierarchical Differential Dynamic Convolution, are both integrated into the first half of the GDU, which is the key factor behind its powerful performance. The second half of the GDU is composed of ConvFFN [10] and involves the splitting and iterative process of the feature maps from the upper and lower layers.

D_{1}, D_{2}, D_{3}, D_{4}

are then fed into the predictor to obtain the predicted result

O_{r e s}

for ORSI-SOD. It is worth mentioning that, similar to previous work [5,8], we select the effective partial decoder [40] as the prediction head, and like DKETFormer [5], we add an extra branch to accommodate the four scale inputs from the decoder. The predictor primarily drives efficient fusion of multi-scale feature maps via dense element-wise multiplication and concatenation operations, and ultimately yields a prediction map with 1 output channel and a scale matching the input image, referred to as

O_{r e s}

in this paper. Finally,

O_{r e s}

obtains the guidance from the ground truth map under the supervision of the balanced cross-entropy (BCE) loss function and intersection-over-union (IoU) loss function. This process can be formulated as

L_{t o t a l} (O_{r e s}, G T) = L_{b c e} (O_{r e s}, G T) + L_{i o u} (O_{r e s}, G T),

(1)

where

G T

denotes the ground truth map with binary annotations for the salient objects. The detailed explanation of the overall structure of GDUFormer is complete. Next, we will focus on discussing the FGA and HDDC components within the GDU.

3.2. Full-Dimensional Gated Attention

Given ORSIs’ large object scale variations and noise susceptibility, designing dedicated modules is essential. Previous methods tried to address such problems by introducing ViT encoders to extract global cues; nevertheless, this approach neither explored nor encoded the spatial and channel dimensions at the same time, nor did it filter and strengthen full-dimensional information. To this end, we put forward our solution—Full-Dimensional Gated Attention (FGA)—within the GDU, with its detailed structure illustrated in Figure 4.

FGA is primarily composed of two branches, responsible for filtering spatial local information and channel relationships separately. The left branch first mines spatial local cues through multiple depthwise convolutions with different receptive fields, and then uses the grouping mechanism to filter local features. This process helps the model identify noise information in features and determines the key cues that should be retained in ORSI-SOD by performing weighted summation on the stacked receptive fields. Specifically, the receptive field stacking process can be formulated as

\begin{matrix} S_{r f} = [S i L U (D W C_{3 \times 3} ((X)), S i L U (D W C_{5 \times 5} ((X)), S i L U (D W C_{7 \times 7} ((X))], \end{matrix}

(2)

where

[.]

represents the concatenation operation,

S i L U

denotes the activation function, and

D W C_{3 \times 3}

stands for the

3 \times 3

depthwise convolution. X is the result obtained after concatenation and normalization of

P_{i}

and

P_{i + 1}

. The obtained

S_{r f}

contains abundant local spatial features. Combined with the fixed-scale global features carried in

P_{i}

and

P_{i + 1}

, the spatial information is already highly sufficient. Nevertheless, PVT inherently lacks a filtering mechanism—it only regulates the richness of semantic information by adjusting scales. Meanwhile, depthwise convolutions merely extract and capture receptive field information within individual channels. For this reason, the direct combination approach will leave these spatial features still cluttered. This part was validated in our ablation experiments. Fortunately, for local regions, the stacked receptive fields can partition the noise still present in the global features into multiple candidate windows, and under the influence of the parameter matrix, the noise is appropriately stripped away. Given this, the focus of the current work has shifted from suppressing noise to better integrating and summarizing the candidate windows. This transition process not only filters out noisy pixels to some extent, but also integrates the global features in the encoder and the local information mined by depthwise convolutions, thus mitigating the issue of large spatial scale differences in ORSI. The construction of this solution unifies both situations. This idea is realized through an operation called the grouping mechanism. Specifically, before concatenation in Equation (2), we expand each feature map’s dimensions and concatenate along these dimensions to obtain

S_{r f} \in R^{B \times 3 \times C \times H \times W}

. A

1 \times 1

pointwise convolution then expands

S_{r f}

’s parameter space; the result is activated with Softmax for weights, followed by weighted summation with

S_{r f}

. This process can be formulated as

F_{s} = \sum_{d i m 1} S_{r f} * S o f t m a x (P W C (S_{r f})),

(3)

where

P W C

stands for pointwise convolution, and

\sum_{d i m 1}

denotes summation along dimension 1 with that dimension compressed. Notably, dimension expansion and

S_{r f}

reconstruction aim to enable the grouping mechanism to fully utilize Softmax-derived normalized weights via the extra dimension, thus aggregating diverse receptive field feature representations in

S_{r f}

.

S o f t m a x

weights are the ratio of the model’s total confidence resources (summing to 1) assigned to each candidate object according to its competitiveness. Through the element-wise multiplication in Equation (3), the feature space of

S_{r f}

expanded by the extra pointwise convolution can be applied back to

S_{r f}

via the

S o f t m a x

weights, and the summation along dimension 1 enables the weights to complete feature integration, thus successfully achieving the objectives analyzed above. Through this operation,

F_{s}

integrates candidate clues from different local receptive fields without introducing a large number of parameters, thus achieving the goals of suppressing noise and alleviating large-scale differences. It is worth noting that the gating mechanism, as an attention mechanism, should not introduce an excessive number of parameters or computational burden, which is why the structure of the left branch is relatively simple. Additionally, the previous analysis of noise suppression and alleviating large-scale differences justifies the rationality and innovation of this design. For the right branch, we focus primarily on channel selection. Different channels can be viewed as copies of the feature map, so removing weak channels greatly enhances the effectiveness of the key features in the map. The specific structure of this operation is shown on the right side of Figure 4.

To be specific, the right branch of FGA is constructed based on a multi-head linear self-attention mechanism, which is theoretically linked to Vision Mamba via the linear attention equivalence proposed in MLLA [41]. As demonstrated by Han et al. [41], ViM’s selective scanning mechanism can be approximated by linear self-attention with iterative feature updates, where the state transition of ViM corresponds to the key-value interaction in linear attention. To better explain this process, we formulate the selective scanning mechanism of ViM as

\begin{matrix} S_{i} = A_{i} * S_{i - 1} + B_{i} (▿_{i} * X_{i}); \\ Y_{i} = C_{i} * S_{i} + D * X_{i}, \end{matrix}

(4)

where

A_{i}

,

B_{i}

,

C_{i}

, and D are parameter matrices that control the input

X_{i}

and output

Y_{i}

. The first half of Equation (4) describes how the state evolves (via Matrix

A_{i}

) according to the impact of the input on the state (via Matrix

B_{i}

), while the second half illustrates how the state is converted into the output (via Matrix

C_{i}

) and how the input directly influences the output (via Matrix D). MLLA [41] argues that

S_{i - 1}

and

S_{i}

can be regarded as upper and lower iterative feature maps of linear self-attention,

B_{i}

and

X_{i}

can be abstracted as keys and values for matrix multiplication, and

C_{i}

can serve as query vectors. In addition, the original MLLA [41] uses multiple positional encodings to replace the traditional forget gate mechanism (

A_{i}

), aiming to make the ViM-inspired linear attention more suitable for visual tasks.

However, we believe that applying the gating mechanism to the lower-layer feature map

S_{i - 1}

is equivalent to implementing a controlled residual connection, which is not conducive to the filtering of channel relationships and makes it more difficult for the advanced reasoning and abstraction capabilities brought by ViM to play a role. For this reason, we position the gating mechanism on

B_{i} (▿_{i} * X_{i})

—the outcome of matrix multiplication between the key and value vectors in the linear self-attention mechanism. This approach not only helps overcome the representational bottleneck of linear models but also promotes the achievement of a dynamic balance between memory and forgetting. For ease of understanding, we shift the perspective from the inspiration of ViM back to the multi-head linear attention mechanism in the following description. The result of the matrix multiplication performed by the key and value vectors is denoted as

{S i m}_{k v} \in R^{B \times h \times C \times C}

, where h represents the number of heads in the attention mechanism. In the last two dimensions of

{S i m}_{k v}

, the relationship between each pair of channels has been explicitly encoded, making it highly suitable to deploy the gating mechanism for channel filtering in this process. To this end, we use an internal gating attention on the similarity matrix

{S i m}_{k v}

. To avoid disrupting channel relationships, we employ pointwise convolution in the internal gating attention to mine the complementary relationships between heads, and use the GELU activation function to filter out poor channel relationship pairs. This process can be formulated as

{S i m}_{k v} = {S i m}_{k v} * G E L U (P W C ({S i m}_{k v})),

(5)

where

G E L U

can filter out most negative pixel values, thereby ensuring that active channel relationships are preserved. Following this operation,

{S i m}_{k v}

undergoes activation using Softmax. After performing matrix multiplication with the query and establishing a residual connection with the value vector,

F_{c}

—the result of channel relationship filtering—is obtained. As far as we know, our work is one of the early efforts in computer vision to perform convolution and activation on the similarity matrix and feed the output back to it. This procedure improves the abstract reasoning, cue representation, and dynamic balance abilities of the ViM-inspired linear attention embedded in the module. Notably, during the residual connection with the value vector, we adopt pointwise convolution to implement a simple gating mechanism, which in turn enhances the function of the parameter matrix D in Equation (4). In fact, the ViM-inspired linear attention structure shares similar principles with the original Mamba architecture but differs greatly in structural design; in contrast, it has a close structural connection with the standard linear attention, yet there are distinct discrepancies in theoretical interpretation. Specifically, according to the theory of MLLA [41], the scanning process of the ViM-inspired linear attention structure achieves the function of the selective scanning mechanism of the original Mamba architecture to a certain extent. Nevertheless, there are notable gaps between them in the detailed structures, including tensor operations, weight generation and feature application, which is why we define it as a ViM-inspired linear attention architecture. In comparison with the standard linear attention, the structural modifications we have made are relatively limited, and our main contributions focus on the innovative thinking and interpretation of the ViM-inspired linear attention method. By applying the gating mechanism to the Key-Value product, the rank of the similarity matrix can be raised to a certain extent, which further changes the representational strength of the linear model and enhances the dynamic balance of memory and forgetting. The operations we adopted have not been tried in prior ViM-inspired linear attention methods, and the extra interpretable processing of the similarity matrix itself makes certain theoretical and structural contributions. To sum up, the proposed ViM-inspired linear attention structure is distinctly different from the original Mamba architecture and the standard linear attention. Thanks to the internal gating attention in Equation (5), weak channel relationships have been eliminated, and key object features have been strengthened across multiple functional layers of the selective scanning mechanism. The combination of

F_{s}

and

F_{c}

can be formulated as

\begin{matrix} F_{G A} = P W C ([F_{s}, F_{c}]); \\ F_{G A} = S i L U (B N (F_{G A})), \end{matrix}

(6)

where

B N

denotes the batch normalization mechanism, and

S i L U

is the activation function. Via the operations in Equation (6), the filtered local spatial information and channel relationships are integrated. Pixels and channels with high noise, repetition, redundancy, or misidentification are also excluded from FGA’s information flow. At the same time, the module’s ability to represent, abstract, and balance key salient object features is further enhanced. Based on this, the obtained

F_{G A}

can be transformed into an external-oriented overall gating mechanism via

B N

and

S i L U

, applying high-standard full-dimensional information to HDDC as attention. This design strictly suppresses the majority of abnormal cues in HDDC’s output, thereby ensuring the efficiency and reliability of GDU during decoding.

In summary, FGA’s dual-branch design, which distinguishes itself from single-dimensional attention mechanisms, achieves full-dimensional feature purification and lays the groundwork for subsequent feature enhancement based on HDDC. It should be noted that FGA is actually an auxiliary component of HDDC, whose main purpose is to control the results of HDDC. Therefore, HDDC functions as the core component responsible for information flow processing within GDU, underscoring its critical importance. Further details of HDDC will be provided in Section 3.3.

3.3. Hierarchical Differential Dynamic Convolution

Before formally introducing HDDC, it is necessary to clarify its critical role within the GDU and, more broadly, within the entire decoder. For HDDC, its input consists of global features from adjacent scales of the encoder. Although global cues have already been extracted, these feature maps generally lack foreground-background distinction and tend to overlook the local details of small salient objects. Applying only conventional convolutions to capture local context may impair the salient cues embedded in the global features. Therefore, it is necessary to design a module structure that can effectively model long-range dependencies and, at the same time, focus on key positions in the feature map through weight parameter control.

Overlock [42] proposes a dynamic convolution with context mixing capability. Its core idea is to characterize the correlation between a single token at the center of a region and its context by leveraging the token itself and the affinity values between it and all other tokens. Subsequently, these affinity values are aggregated, and a token-based dynamic convolution kernel is constructed through a learnable mechanism, thereby integrating context information into each weight of the convolution. While this is indeed an excellent and efficient method for context mixing, we must point out that its structural design still has some shortcomings for our needs. For one thing, although it constructs a token-based dynamic convolution kernel to meet the need for modeling long-range dependencies, the implementation of this operation still relies on encoding pairwise relationships between pixel pairs without introducing any differentiation, exploration, or recognition mechanisms. This easily leads to similar foregrounds and backgrounds being mistakenly identified as a single category (salient or non-salient). For another, despite the structure being called a convolution kernel, the foundation for deploying kernel weights is essentially token relationships. Under such circumstances, how to strengthen attention to specific key pixel positions, excavate potential salient objects, and improve the accurate recognition of details like small salient objects and salient object outlines has become another key aspect requiring consideration.

According to the aforementioned analysis, we impose attention weights that rely on distance decay and hierarchical intensity difference capture (HIDC) on ContMix’s [42] kernel space. The goal is to capture pixel differences in local regions and strengthen the model’s feature discrimination and mining capabilities. The overall structure of HDDC with applied weights is illustrated in Figure 5. Specifically, similar to ContMix [42], we first re-split the concatenated encoder global features along the channel dimension. We allow the upper-layer features to store context information and the lower-layer features to enhance weight attributes. The purpose of this operation is that the decoder outputs results at the current level; thus, it is a more appropriate choice for the lower layer (i.e., the current layer) to amplify salient features. Subsequently, we perform matrix multiplication on the upper- and lower-layer global features processed by the weight matrix. After undergoing scale transformation, the result serves as the initial ContMix convolution kernel

{K e r n e l}_{o r i}

, with a scale of

B \times K^{2} \times h e a d \times N

. In this case,

K^{2}

stands for the initial kernel size,

h e a d

is the number of heads of the multi-head self-attention mechanism, and N denotes the product of image width and height. Next, we generate two independent kernels to characterize the correlation between tokens and their context, so as to realize the subsequent aggregation of affinity values. This process can be formulated as

{K e r n e l}_{1}, {K e r n e l}_{2} = S p l i t ({C o n v}_{k e r n e l} ({K e r n e l}_{o r i})),

(7)

where

{K e r n e l}_{1}

and

{K e r n e l}_{2}

are the generated independent kernels,

S p l i t

refers to splitting the results along the kernel dimension, while

{C o n v}_{k e r n e l}

extends the original

K^{2}

to

K_{1}^{2} + K_{2}^{2}

. Through this operation, the two generated convolution kernels are relatively independent and can capture local information under different receptive fields. Additionally, as they are based on contextual relationships, they can effectively facilitate wide-range context mixing. During implementation, the kernel space scales of

{K e r n e l}_{1}

and

{K e r n e l}_{2}

are configured to

7 \times 7

and

5 \times 5

, respectively. After that, to balance the relationship between contextual information and kernel-space weights, we perform dimensionality reduction in

{K e r n e l}_{1}

and

{K e r n e l}_{2}

by averaging across spatial dimensions, yielding

{W e i g h t}_{1}

and

{W e i g h t}_{2}

, which serve as the foundation for generating kernel-space attention.

Specifically, inspired by RMT [43], we plan to adopt the distance paradigm to influence the weight distribution of the kernel space. Given that

{K e r n e l}_{1}

and

{K e r n e l}_{2}

perform identical kernel space attention, we only detail the weight application of

{K e r n e l}_{1}

for brevity. The spatial decay matrix based on Manhattan distance proposed by RMT introduces explicit spatial information before self-attention, thus providing the attention mechanism with a clear spatial prior. Considering that the

H W \times H W

space of self-attention can achieve improved performance under the guidance of spatial prior, is it feasible to incorporate this prior mechanism into the kernel space? To tackle this problem, we first derive the formulated representation of RMT’s 2D spatial decay matrix as

S_{n m}^{2 d} = γ^{| x_{n} - x_{m} | + | y_{n} - y_{m} |},

(8)

where

(x_{n}, y_{n})

and

(x_{m}, y_{m})

denote the coordinates of any two points in the 2D space. Based on these 2D coordinates, the spatial decay matrix can represent the Manhattan distance between each pair of tokens via

S_{n m}^{2 d}

, which ultimately acts on the self-attention mechanism. The critical reason for the feasibility of this approach lies in the fact that the

H W \times H W

similarity matrix guarantees the encoding of the relationship between any two tokens. However, there is no

K_{1}^{2} \times K_{1}^{2}

similarity matrix in the kernel space, and constructing such a matrix from scratch cannot act on the initial attention

{W e i g h t}_{1}

, which has collapsed into

B \times K_{1}^{2} \times h e a d

. For this reason, we convert the spatial decay matrix into a distance decay mechanism targeting the diagonals of the kernel space. We first construct a

K_{1} \times K_{1}

correlation matrix, and then the 2D spatial decay matrix calculates the Manhattan distance between the

K_{1}

elements. Under this operation, the spatial prior leads to the highest weights on the diagonals of the kernel space. In each row, the greater the distance from the current diagonal pixel, the more drastic the weight decay. This grants the distance decay directional characteristics, facilitating strengthened capture of discriminative information and detailed cues on the diagonals. To balance the distance decay relationship, we also add this operation to the other diagonal of the kernel space. This process can be formulated as

\begin{matrix} W_{i j} = γ^{| x_{i} - x_{j} | + | y_{i} - y_{j} |}; \\ W_{i j}^{2 d} = W_{i j} + W_{i j}^{T}, \end{matrix}

(9)

where

W_{i j}^{T}

denotes the transpose of

W_{i j}

along the row dimension. Although the kernel space distance relationships on both diagonals are encoded, the module still lacks enhancement for the overall kernel space. For this reason, motivated by PidiNet [44], we propose a hierarchical intensity difference capture mechanism. It can adaptively extract the pixel intensity difference at each level within the kernel window, and construct global kernel space weights accordingly. Specifically, drawing on APDC [44], the mechanism takes the convolution kernel center as the origin and one pixel as the hierarchical division distance, computing the intensity difference between each pixel and its clockwise adjacent pixel. The convolution kernel difference results will be hierarchically heterogeneous in subsequent steps and further highlighted, so as to realize the perception of the overall relational status of the kernel space. The pixel intensity difference calculation for a specific level is formulated as

r e s = \sum_{n = 0}^{N} ω_{n} * (x_{n} - x_{j}),

(10)

where N stands for the number of pixels included in the current window level,

x_{n}

denotes the pixel value of the n-th point,

x_{j}

is the pixel connected clockwise to

x_{n}

within the hierarchical window, and

ω_{n}

is the corresponding kernel weight. Similarly to APDC [44], we transform the equation from

ω_{n}

-dominated to

x_{n}

-dominated, enabling intensity difference capture for each window level without operating on feature map pixels.

Taking the

7 \times 7

window size of

{K e r n e l}_{1}

as an example, we first remove the central pixel of the window and divide it into three levels, containing 8, 16, and 24 pixels, respectively. Subsequently, we calculate the pixel intensity differences within each level and convert the calculation process into a kernel operation. So far, the weight difference expression for each level is still linear—that is, regardless of the level, the base of the intensity difference calculation is entirely consistent. However, according to the analysis in the previous section, introducing spatial prior is beneficial, and what the module currently lacks is precisely the enhancement of the overall kernel space relationships. It should be noted that this enhancement process needs to pay special attention to the discrimination between foreground and background and the capture of local details. To achieve this function, we introduce an adaptive hierarchical operator (AHO) and an intensity difference highlighting mechanism (IDHM), which can be formulated as

W_{d i f f} = φ * (e^{r e s_{k}} - 1),

(11)

where

φ

denotes the adaptive hierarchical operator, and each level of the kernel space window from inner to outer is assigned a different

φ

. In practical operation, still taking the

7 \times 7

window size as an example, the initial values of

φ

for each level from inside to outside are set to 2, 1, and 0.75, respectively. The reason for this operation is that when the window moves to the current central point, to achieve accurate foreground-background discrimination, the pixels closest to the center are the ones that require the most attention and vigilance. Meanwhile, the importance of pixels farther away naturally decreases layer by layer. This aligns with the core idea of the distance decay mechanism analyzed in this section, which is also the reason for constructing a hierarchical structure for the kernel space initially. It is worth mentioning that

φ

is learnable, allowing it to better adapt to the iteration and optimization of the convolution kernel.

r e s_{k}

is the outcome of the

x_{n}

-dominated calculation process, numerically identical to

r e s

.

e^{r e s_{k}} - 1

represents the intensity difference highlighting mechanism, which can further expose important encodings in the kernel space and generate higher inner products accordingly. As weights, these will also produce higher activation responses, thereby promoting the discovery, excavation, and capture of local details. Further analysis shows that after the hierarchical construction of the convolution kernel, adaptive hierarchical heterogeneity, and intensity difference highlighting, the weight

W_{d i f f}

that can characterize, activate, and enhance the overall kernel space relationships is ready. It should be noted that in Equation (11),

W_{d i f f}

represents the difference weight corresponding to the

r e s_{k}

level. Since hierarchical construction does not affect the overall use of weights, we integrate the weight results of each level after the above operations into a single weight, denoted as

W_{d i f f}^{*}

. This result is first added to

W_{i j}^{2 d}

of the corresponding kernel size, then undergoes residual connection with

{W e i g h t}_{1}

, and finally acts on

{K e r n e l}_{1}

after activation. This process can be formulated as

{D y}_{k e r n e l}^{1} = {K e r n e l}_{1} * σ (W_{d i f f}^{*} + W_{i j}^{2 d} + {W e i g h t}_{1}),

(12)

where

σ

denotes the Sigmoid activation function, and

{D y}_{k e r n e l}^{1}

represents the dynamic convolution kernel after applying the kernel space weights. This dynamic convolution kernel not only achieves fine-grained recognition of objects and textures as well as accurate perception of local details under the combined action of

W_{d i f f}^{*}

and

W_{i j}^{2 d}

, but also urges the convolution operation to additionally focus on the detailed directional cues and relevant discriminative information along the kernel window diagonals. As a result, this allows the ContMix mechanism to exert stronger feature aggregation capabilities and be more suitable for the ORSI-SOD task. We refer to the entire weight extraction, construction, application, and convolution aggregation process as HDDC, which serves as the key information processing component in GDU. Notably,

{K e r n e l}_{2}

performs the same operations as

{K e r n e l}_{1}

, and their respective results will be concatenated in subsequent steps to augment GDU’s feature reserve under different local receptive fields.

The innovation of HDDC centers on task-tailored kernel space optimization, bridging long-range dependency modeling associated with Transformers and local detail capture typical of convolution, effectively tackling a major limitation faced by existing hybrid methods.

4. Experiments

4.1. Datasets

EORSSD [45] dataset is an extension of the ORSSD dataset [46], comprising numerous semantically meaningful yet challenging images. The dataset contains a total of 2000 images along with corresponding ground truth annotations, among which 1400 images are used for model training, and the remaining 600 images for testing and validation.

ORSI-4199 [47] dataset is larger in scale, including 4199 remote sensing images in total, all of which are provided with accurate pixel-level saliency annotations. This dataset is divided into 2199 training images and 2000 testing images, featuring rich object categories, irregular shapes, and large-scale variations. It is currently a mainstream dataset for evaluating the performance of ORSI-SOD models.

4.2. Implementation Details

GDUFormer is implemented using the PyTorch (2.3.1) framework, with the entire training process conducted on a single NVIDIA GeForce RTX 3090 Ti GPU. For the model’s basic architecture, the encoder employs the PVT-v2-b2 [10] structure. To enhance training efficiency, the pre-trained model parameters of this encoder are directly loaded at the initialization stage. The key hyperparameters configured during training are detailed as follows: the training and testing resolution is

352 \times 352

, and the initial learning rate is 1 × 10⁻⁴, the batch size is fixed at 12, the optimizer is Adam, the maximum number of epochs is 45, and data augmentation is performed via rotation and flipping. Additionally, a learning rate decay strategy is adopted, reducing the learning rate to 1/10 of its current value every 10 epochs.

To ensure fair and comprehensive evaluation of ORSI-SOD performance, seven widely accepted metrics in the field are used as follows: S-measure, F-measure (maximum, mean, and adaptive modes), and E-measure (maximum, mean, and adaptive modes). It is worth noting that given that MAE values are generally quite close in the latest studies, which to a certain extent loses the significance of comparison, we do not select it as an evaluation metric.

4.3. Comparative Experiments

4.3.1. Comparative Results on the EORSSD Dataset

To fully validate the performance of GDUFormer, we conducted comparisons with numerous conventional image SOD methods (R3Net [48], PoolNet [49], EGNet [50], MINet [12], GateNet [14], CSNet [51], SAMNet [52], HVPNet [53], PA-KRN [26], VST [17], and ICON [18]) and ORSI-SOD methods. Given that GDUFormer adopts the PVT-v2-b2 [10] as the backbone, we categorized the ORSI-SOD methods into CNN-based methods (LVNet [46], ACCoNet [54], MJRBM [47], CorrNet [55], ERPNet [27], SARNet [4], MCCNet [56], EMFINet [57], SeaNet [30], SFANet [29], and MRBINet [58]) and ViT-based methods (HFANet [31], LightEMNet [59], BCAR-Net [60], LGIPNet [61], and DKETFormer [5]) for more refined comparisons. Moreover, since DKETFormer [5] had not made its results on EORSSD [45] publicly available, we retrained, tested, and evaluated the model on this dataset. It should be clarified that to maintain the fairness of the comparison, we retrained DKETFormer with exactly the same experimental environment and hyperparameters as GDUFormer, i.e., the detailed settings presented in Section 4.2. The comparative experimental results are shown in Table 1. Before conducting the analysis, we briefly introduce the comparison methods listed in Table 1. Methods in the first part are conventional image SOD methods, with the last two (VST [17] and ICON [18]) being ViT-based ones. The methods in the second and third parts all belong to ORSI-SOD methods; the former are CNN-based, whereas the latter are ViT-based. Given that GDUFormer leverages PVT-v2-b2 [10] to extract basic ORSI-SOD feature maps, the third part is our top priority for comparison, and the secondary ones are the methods in the second section plus the ViT-based methods in the first section. This comparison strategy not only guarantees fairness, but also pinpoints the discrepancies between different backbones and those between different tasks.

Based on the Table 1 results, the following conclusions can be drawn: (1) The performance of conventional image SOD methods is generally not as good as that of ORSI-SOD methods. For example, ViT-based conventional methods VST [17] and ICON [18] consistently score lower than CNN-based ORSI-SOD methods SeaNet [30] and SFANet [29] across metrics. This indicates that there is a significant gap in semantic expression between ORSIs and conventional images, highlighting the necessity and challenge of the ORSI-SOD task. (2) For ORSI-SOD, CNN-based methods are considerably inferior to ViT-based methods in multiple metrics. Taking LGIPNet [61] and SFANet [29] as examples, the former outperforms the latter by 0.46%, 1.03%, and 0.38% in the

S_{α}

,

F_{β}^{m a x}

and

E_{β}^{m a x}

metrics, respectively. This phenomenon not only reveals the significant impact of the backbone on model performance but also proves the superiority of the ViT backbone over the CNN backbone for the ORSI-SOD task. (3) GDUFormer achieves the best comprehensive performance among all comparative methods. Building on the previous two conclusions, we focus GDUFormer’s comparison on ViT-based ORSI-SOD methods. According to Table 1 data, GDUFormer ranks first in all metrics except

S_{α}

and

E_{β}^{m e a n}

, with a more balanced performance distribution. For example, GDUFormer outperforms the second-ranked method by 0.23% and 0.36% in

F_{β}^{a d p}

and

E_{β}^{a d p}

, respectively, demonstrating its robust salient object detection capability. It should be noted that with the gradual improvement of model performance in recent years, the values of various methods on evaluation metrics have generally exceeded 0.9, approaching 1. Therefore, the performance improvement achieved by GDUFormer is worthy of attention. In addition, although GDUFormer’s performance on

S_{α}

and

E_{β}^{m e a n}

is slightly weaker, it still achieves competitive results. Thus, it can be fully illustrated that GDUFormer has a distinct leading advantage in comprehensive performance compared with other methods.

We also compared the visual results of various SOD methods in Figure 6. According to the visual comparison results, compared with major competitors, GDUFormer is less affected by noise interference, features more accurate salient object localization and finer edge contour segmentation, and performs better in handling small object details. This outcome not only verifies the superior performance of GDUFormer visually but also proves its effectiveness in tackling the challenges in ORSI—including noise interference, significant object scale variations, poor foreground-background distinguishability, and the easy oversight of small-scale salient features. In turn, it indirectly confirms the rationality of GDUFormer’s internal module design. In addition, we present some prediction failure cases of GDUFormer, as shown in Figure 7. Despite the fact that GDUFormer has achieved remarkable advances in ORSI salient object prediction over the comparative methods, its performance is still somewhat restricted when dealing with test images containing multiple objects and strong interference. All images in Figure 7 are hard samples selected from EORSSD. We can see that most predictions of GDUFormer are precise, with the main limitations summarized as follows: the smallest aircraft in the first image is overlooked, the rectangular-shaped details in the second image fail to be represented, and the prediction for the tail of the aircraft at the bottom of the third image is insufficiently accurate. Careful observation of the input images shows that these failure cases result from the inconsistent scales of multiple objects and interference from light changes. This demonstrates that GDUFormer performs excellently in salient object prediction under non-extreme conditions, but its performance on complex and hard samples requires further exploration and enhancement in subsequent research.

4.3.2. Comparative Results on the ORSI-4199 Dataset

Since some methods have not published their evaluation results on the ORSI-4199 [47] dataset, we removed these methods and additionally added three salient object detection methods (PiCANet [25], BASNet [24], and WeightNet [8]) for comparison. It is worth noting that given the small number of ViT-based ORSI-SOD methods, we used SeaNet [30] and SFANet [29] with their backbones replaced by PVT as comparison objects. The relevant experimental results are derived from the official data of DKETFormer [5]. In addition, to maintain consistency with Section 4.3.1, we also retrained, tested, and evaluated DKETFormer [5] on the ORSI-4199 [47] dataset.

As presented in Table 2, the comparative experimental results reveal that the performance comparison pattern of SOD methods on ORSI-4199 [47] is largely consistent with that on EORSSD [45]. The key observations are as follows: (1) The advantage of CNN-based ORSI-SOD methods over conventional image SOD methods are not as obvious as in EORSSD [45]. On this dataset, SeaNet [30] and SFANet [29] lag behind VST [17] and ICON [18] in overall performance, only gaining leading scores in specific metrics. This indicates that ORSI-4199 [47], as a challenging dataset, has narrowed the performance gap between methods. (2) After replacing their backbones with PVT, SeaNet [30] and SFANet [29] have caught up with other ViT-based ORSI-SOD methods in performance. This further confirms our discussion in Section 4.3.1 that the backbone has a significant impact on model performance and that the ViT backbone is superior to the CNN backbone. (3) GDUFormer still demonstrates the best overall performance compared with all other methods on this dataset. It is worth noting that WeightNet [8] shows a considerable advantage in

S_{α}

and

F_{β}^{a d p}

metrics, and its scores in

F_{β}^{m a x}

and

F_{β}^{m e a n}

are close to those of our method. However, we would like to note that the performance superiority of WeightNet [8] mainly comes from the dual PVT backbones provided by its two-stage architecture, which not only extends the parameter space but also extracts more feature expressions. Even facing such a large model size, GDUFormer still achieves an obvious lead over WeightNet [8] in most metrics, further illustrating the superiority of GDUFormer. Meanwhile, from the cross-dataset comparison, GDUFormer has quite similar performance advantages on EORSSD and ORSI-4199, which fully proves the rationality, effectiveness, and generalization of the proposed method.

Furthermore, although GDUFormer achieves the optimal comprehensive evaluation performance on both the EORSSD [45] and ORSI-4199 [47] datasets, its comparative advantage over the main competitors is not significant. This is mainly attributed to the following two reasons: (1) ORSI-SOD is a highly challenging task. In recent years, a large number of ORSI-SOD methods have been proposed, yet we can observe that the performance gap among the remaining methods is not significant even without considering GDUFormer, e.g., LGIPNet [61] and DKETFormer [5], both of which are works published in 2025. From this observation, we can draw the conclusion that the performance of ORSI-SOD methods has reached a bottleneck to a certain extent, and the requirement for the magnitude of performance improvement should be appropriately lowered. In this context, GDUFormer attains the best results across multiple metrics on the benchmark datasets, and its improvement range is consistent with or even exceeds the performance gap among methods in recent years. This fully demonstrates the performance superiority and architectural validity of GDUFormer. (2) GDUFormer has a moderate computational complexity, which is a reasonable trade-off for its performance gains. After conducting the comparison in terms of accuracy, it is necessary to analyze the efficiency of the model. The comparison results of parameters and FLOPs with major competitors are presented in Table 3. As indicated in Table 3, the comparative methods involve remarkably complex networks, including WeightNet [8] and LGIPNet [61]. From a comprehensive comparison perspective, GDUFormer ranks at a moderate level in both parameters and FLOPs. Hence, it can be concluded that the performance merits of GDUFormer are not obtained at the cost of computational efficiency. In conclusion, the performance improvement realized by GDUFormer is justified, and it also attains competitive results in terms of model efficiency.

4.4. Ablation Study

All ablation experiments were conducted on the EORSSD dataset, and the experimental settings are exactly the same as those in Section 4.2.

Exploration of Core Components. As the basic layer of the decoder, GDU plays a crucial role in the performance improvement of the proposed method. To verify the rationality of its internal structure, it is necessary to conduct an ablation study on the two core components in GDU. The experimental results are shown in Table 4, where each row from top to bottom represents the baseline (without using the decoder), removing HDDC, removing FGA, and the final GDUFormer. According to the experimental results, removing HDDC from GDU has a more significant negative impact than removing FGA. This result corroborates our previous description that HDDC, as the main component for processing information flow in GDU, plays a more critical role. Meanwhile, HDDC is capable of enhancing foreground-background distinguishability and reinforcing the detailed features of small objects, thereby contributing more to the improvement of salient object detection accuracy. On the other hand, FGA is mainly designed to suppress noise and mitigate object scale differences. Since these problems have been partially addressed in the hierarchical encoder–decoder structure, removing FGA does not significantly affect the model’s performance.

It is worth mentioning that excluding FGA results in the model achieving higher scores than GDUFormer in

S_{α}

and

E_{β}^{m e a n}

. This demonstrates that the adoption of FGA is harmful to the aspects evaluated by the two metrics. In addition, this detrimental phenomenon has also been verified in the comparative experiments. In Table 1 and Table 2, GDUFormer’s advantages in the

S_{α}

and

E_{β}^{m e a n}

metrics are not as obvious as those in other metrics. Given FGA’s excellent overall performance, we believe it plays a positive role consistent with the original design intent. In summary, as the core modules of GDU, HDDC and FGA have significantly improved the overall performance of GDUFormer.

Study on the Internal Structure of FGA. Though the overall performance of FGA has been confirmed, in-depth exploration of its internal structure remains essential. For this reason, we carried out multiple experiments as displayed in Table 5. Based on the experimental results of the first, second, and fourth rows, removing either the left or right branch of FGA will impose an adverse impact on the detection results, and the impact of removing the right branch is more notable. In detail, the fourth row surpasses the second row only in

F_{β}^{m a x}

and

E_{β}^{m a x}

, which reveals the considerable drawbacks caused by the absence of the ViM-inspired linear attention mechanism and the operation of removing weak channel relationships. In contrast, the left branch is mainly used to perceive and aggregate multi-scale local receptive fields, serving as a further step of modeling and exploration for image dimensions. Due to the requirement of keeping the overall architecture lightweight, its structural design is rather concise, resulting in relatively restricted effectiveness.

The experiments in the third and fifth rows mainly focus on the key internal elements of the two branches. The third row indicates that FGA does not use the grouping mechanism shown in Equation (3), but instead directly aggregates the extracted local feature maps. Based on the results, eliminating the grouping mechanism indeed brings a relatively distinct adverse effect on the model’s detection performance. This reveals that the grouping mechanism better merges and induces local candidate windows, thereby achieving effective filtering of noisy pixels and enhancing the efficient integration of global encoder features and local decoder features, which further mitigates the problem of large object scale differences in ORSI. The fifth row explores the internal gating attention of the right branch (Equation (5)). When gating attention is not applied to

{S i m}_{k v}

, the representational bottleneck of the linear model will reappear, and the dynamic balance of memory and forgetting will regress. As a result, weak channel relationships cannot be screened, and the acquisition of advanced reasoning and abstract capabilities is insufficient, ultimately leading to an overall decline in the performance of FGA and even the entire model. Therefore, it can be concluded that the internal structure adopted by FGA is reasonable and effective, and no other variants can achieve equivalent substitution for FGA.

Study on the Internal Structure of HDDC. Compared with FGA, HDDC is more critical and has a more complex internal structure. To this end, we performed ablation experiments on its internal configuration as presented in Table 6. The experiments in the second and third rows, respectively, indicate not using the distance decay mechanism at all in the kernel space and not using the distance decay mechanism for the transposed diagonal. The experimental results demonstrate the effectiveness of the proposed distance decay mechanism: it can enhance the convolution kernel’s extraction of discriminative information and detailed clues by leveraging directional characteristics. Additionally, the performance of using the distance decay mechanism on only one side is notably inferior to that of using it on both sides. This shows that a balanced distance decay relationship will help improve the overall efficiency of the convolution kernel, which also provides certain experimental support for the subsequent application of the hierarchical intensity difference capture (HIDC) mechanism to strengthen the overall kernel space. The fourth row of Table 6 presents the results of removing HIDC. According to the data, removing HIDC leads to a more severe negative impact than removing the distance decay mechanism. This comparative result explicitly proves the necessity of constructing global kernel space weights—using only the distance decay mechanism cannot introduce sufficient spatial priors into the convolution kernel weights, thereby affecting the model’s ability to distinguish between foreground and background.

The experiment in the fifth row is an extension of the fourth row. Without changing other structures, we canceled the hierarchical segmentation in HIDC. All weights of the convolution kernel will be treated as a whole to linearly capture the intensity differences within the receptive field. The results indicate that hierarchical segmentation is reasonable—capturing as a whole can not highlight the importance of features close to the central point, eventually leading to the model’s imprecision in distinguishing foreground from background. The sixth row is an exploration of the intensity difference highlighting mechanism (IDHM) in Equation (11). It can be found that this mechanism is effective: it can generate higher activation responses for the kernel space, thereby improving the model’s ability to capture detailed features of small objects. Without this mechanism, the overall performance of the model is even worse than that after removing hierarchical segmentation, which indirectly illustrates the effectiveness of the IDHM.

Experiments in Rows 7 to 10 all investigate the validity of the Adaptive Hierarchical Operator (AHO). When the AHO’s hyperparameter configuration is shifted from inside-out decreasing to equal or increasing, the model’s detection performance consistently degrades. This outcome validates the correctness of our argument in Section 3.3 about the layer-wise gradual decrease in feature importance. Furthermore, the experimental result of Row 8 being inferior to that of Row 7 indirectly confirms this conclusion. Rows 9 and 10 further convert the hyperparameters from learnable to static while adopting different hyperparameter settings. As the results of these two rows are inferior to those of Row 1 and Row 8, respectively, it is evident that using learnable parameters to adapt to the convolution kernel’s optimization and iteration process is rational and correct. Based on the comprehensive analysis of experimental data in Table 6, it can be concluded that the internal structure of HDDC is fully rational and capable of effectively boosting the model’s performance.

5. Discussion

GDUFormer addresses the problems of limited local receptive fields in CNNs and noise interference as well as detail loss in ViTs for Optical Remote Sensing Image Salient Object Detection. It establishes an encoder–decoder framework with ViT as the backbone and GDU as the core, where the core design logic is well-aligned with the experimental results.

From the perspective of model structure, the dual-branch design of FGA forms complementary advantages: the left branch effectively mitigates the problem of large object scale differences in remote sensing images through multi-receptive field aggregation and a grouping mechanism; the right branch achieves weak channel filtering based on Vision Mamba-inspired linear attention and internal gating, enhancing the abstract reasoning capabilities of features. As the core of information processing, HDDC makes a more critical contribution to foreground-background discrimination and small object detail capture. Its distance decay mechanism and hierarchical intensity difference capture mechanism successfully integrate spatial priors into dynamic convolution kernels, balancing long-range dependency modeling and local inductive bias.

Experimental results show that GDUFormer achieves superior comprehensive performance compared to existing CNN-based and ViT-based methods on the two benchmark datasets (EORSSD and ORSI-4199), particularly excelling in the

F_{β}^{a d p}

and

E_{β}^{a d p}

metrics, which verifies the model’s adaptability to complex scenarios. Ablation experiments also demonstrate that removing either FGA or HDDC leads to performance degradation. However, the model still has limitations: in hard samples with multiple objects and strong illumination interference, there are still a small number of false positive predictions and insufficiently accurate boundary details. This is closely related to the characteristics of remote sensing images, such as irregular object distribution and complex background textures.

Future optimizations can be carried out in two aspects: first, introducing an adaptive sample weighting strategy to improve the model’s robustness to hard samples; second, simplifying the branch structure of FGA to further reduce computational overhead while maintaining performance, making it more suitable for large-scale remote sensing image processing scenarios. Overall, through modular innovative design, GDUFormer provides an efficient and balanced solution for the ORSI-SOD task, and the design ideas of its core modules can also serve as a reference for other remote sensing image understanding tasks.

6. Conclusions

Considering the common issues of existing methods in the ORSI-SOD task, we propose our solution, GDUFormer, in this paper to suppress noise, highlight effective channels, enhance foreground-background discrimination, and optimize the detailed features of small objects. Specifically, we first apply a ViT backbone to encode global features at multiple scales. Subsequently, we use a decoder stacked with GDUs to capture salient object features in the information flow, and generate the model’s final prediction results through the predictor. The main contributions of GDU focus on FGA and HDDC. Among them, the former uses two independent branches to filter noisy spatial local cues and weak channel relationships, respectively. The specific filtering process involves the grouping mechanism and internal gating attention, and the channel filtering process realizes effective ViM-inspired linear attention, thereby enhancing the model’s advanced reasoning and abstraction capabilities. For HDDC, it generates dynamic kernel space weights through the distance decay mechanism and the hierarchical intensity difference capture mechanism, which prompts the convolution kernel to additionally focus on detailed directional cues and relevant discriminative information. The intensity difference capture mechanism utilizes the hierarchical properties of the convolution kernel window and configures different learnable parameters for each level, thus realizing the diversified perception of pixel intensity difference by the convolution. We verified the effectiveness and rationality of GDUFormer on multiple benchmark datasets. Comparative experimental results show that GDUFormer achieves the best comprehensive performance. Ablation experimental results also confirm the effectiveness of FGA, HDDC, and their specific internal designs.

Author Contributions

Conceptualization, M.S. and P.L.; methodology, M.S.; software, M.S.; validation, M.S., T.L. and W.W.; formal analysis, M.S.; investigation, T.L.; resources, W.W.; data curation, T.L.; writing—original draft preparation, M.S.; writing—review and editing, M.S.; visualization, T.L.; supervision, P.L.; project administration, P.L.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jilin Provincial Department of Education Science and Technology Research Project, grant number JJKH20251376KJ, and the Jilin Provincial Natural Science Foundation Project, grant numbers YDZJ202501ZYTS589 and YDZJ202501ZYTS619.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The authors are grateful to the anonymous reviewers for their insightful comments, which have certainly improved this paper.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

ORSI-SOD	Optical Remote Sensing Images Salient Object Detection
GDU	Gated Differential Unit
FGA	Full-Dimensional Gated Attention
HDDC	Hierarchical Differential Dynamic Convolution
ViM	Vision Mamba
HIDC	Hierarchical Intensity Difference Capture
IDHM	Intensity Difference Highlighting Mechanism
AHO	Adaptive Hierarchical Operator

References

Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Guo, C.; Li, H.; Zhang, C.; Zheng, F.; Zhao, Y. A parallel down-up fusion network for salient object detection in optical remote sensing images. Neurocomputing 2020, 415, 411–420. [Google Scholar] [CrossRef]
Huang, Z.; Chen, H.; Liu, B.; Wang, Z. Semantic-guided attention refinement network for salient object detection in optical remote sensing images. Remote Sens. 2021, 13, 2163. [Google Scholar] [CrossRef]
Sun, Y.; Zhao, H.; Zhou, J. DKETFormer: Salient object detection in optical remote sensing images based on discriminative knowledge extraction and transfer. Neurocomputing 2025, 625, 129558. [Google Scholar] [CrossRef]
Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. RSSFormer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
Li, G.; Bai, Z.; Liu, Z.; Zhang, X.; Ling, H. Salient object detection in optical remote sensing images driven by transformer. IEEE Trans. Image Process. 2023, 32, 5257–5269. [Google Scholar] [CrossRef]
Di, L.; Zhang, B.; Wang, Y. Multiscale and multidimensional weighted network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5625114. [Google Scholar] [CrossRef]
Zhang, X.; Yu, Y.; Li, D.; Wang, Y. Progressive Self-Prompting Segment Anything Model for Salient Object Detection in Optical Remote Sensing Images. Remote Sens. 2025, 17, 342. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Wei, J.; Wang, S.; Huang, Q. F³Net: Fusion, feedback and focus for salient object detection. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12321–12328. [Google Scholar]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
Zhang, P.; Liu, W.; Lu, H.; Shen, C. Salient object detection with lossless feature reflection and weighted structural loss. IEEE Trans. Image Process. 2019, 28, 3048–3060. [Google Scholar] [CrossRef]
Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Zhang, L. Suppress and balance: A simple gated network for salient object detection. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16; Springer: Cham, Switzerland, 2020; pp. 35–51. [Google Scholar]
Zhu, J.; Qin, X.; Elsaddik, A. Dc-net: Divide-and-conquer for salient object detection. Pattern Recognit. 2025, 157, 110903. [Google Scholar] [CrossRef]
Wang, Y.; Wang, R.; Fan, X.; Wang, T.; He, X. Pixels, regions, and objects: Multiple enhancement for salient object detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10031–10040. [Google Scholar]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual saliency transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 4722–4732. [Google Scholar]
Zhuge, M.; Fan, D.P.; Liu, N.; Zhang, D.; Xu, D.; Shao, L. Salient object detection via integrity learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3738–3752. [Google Scholar] [CrossRef]
Yun, Y.K.; Lin, W. Selfreformer: Self-refined network with transformer for salient object detection. arXiv 2022, arXiv:2205.11283. [Google Scholar]
Ma, M.; Xia, C.; Xie, C.; Chen, X.; Li, J. Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process. 2023, 32, 1026–1038. [Google Scholar] [CrossRef]
Gao, S.; Zhang, P.; Yan, T.; Lu, H. Multi-scale and detail-enhanced segment anything model for salient object detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 9894–9903. [Google Scholar]
Sun, Y.; Zhao, H.; Zhou, J. Segment Anything Model for detecting salient objects with accurate prompting and Ladder Directional Perception. Pattern Recognit. Lett. 2025, 196, 184–190. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. Basnet: Boundary-aware salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–21 June 2019; pp. 7479–7489. [Google Scholar]
Liu, N.; Han, J.; Yang, M.H. PiCANet: Pixel-wise contextual attention learning for accurate saliency detection. IEEE Trans. Image Process. 2020, 29, 6438–6451. [Google Scholar] [CrossRef]
Xu, B.; Liang, H.; Liang, R.; Chen, P. Locate globally, segment locally: A progressive architecture with knowledge review network for salient object detection. In Proceedings of the 2021 AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3004–3012. [Google Scholar]
Zhou, X.; Shen, K.; Weng, L.; Cong, R.; Zheng, B.; Zhang, J.; Yan, C. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2022, 53, 539–552. [Google Scholar] [CrossRef]
Feng, D.; Chen, H.; Liu, S.; Liao, Z.; Shen, X.; Xie, Y.; Zhu, J. Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4706317. [Google Scholar] [CrossRef]
Quan, Y.; Xu, H.; Wang, R.; Guan, Q.; Zheng, J. ORSI Salient Object Detection via Progressive Semantic Flow and Uncertainty-aware Refinement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608013. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zhang, X.; Lin, W. Lightweight salient object detection in optical remote-sensing images via semantic matching and edge alignment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601111. [Google Scholar] [CrossRef]
Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624915. [Google Scholar] [CrossRef]
Zhang, J.; Liu, T.; Zhang, J.; Liu, L. TS-BiT: Two-Stage Binary Transformers for ORSI Salient Object Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6003905. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef]
Xu, Y.; Hou, R.; Qi, Z.; Ren, T. Mamba4SOD: RGB-T Salient Object Detection Using Mamba-Based Fusion Module. IET Comput. Vis. 2025, 19, e70033. [Google Scholar] [CrossRef]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Hou, R.; Li, X.; Ren, T.; Zhou, D.; Wu, G.; Cao, J. HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Li, X.; Hou, R.; Ren, T.; Wu, G. KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection. arXiv 2025, arXiv:2504.05878. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [PubMed]
Zhang, S.; Huang, J.; Tang, W.; Wu, Y.; Hu, T.; Xu, X.; Liu, J. DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10103–10111. [Google Scholar]
Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar]
Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify mamba in vision: A linear attention perspective. Adv. Neural Inf. Process. Syst. 2024, 37, 127181–127203. [Google Scholar]
Lou, M.; Yu, Y. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. In Proceedings of the 2025 Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 128–138. [Google Scholar]
Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5641–5651. [Google Scholar]
Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel difference networks for efficient edge detection. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 5117–5127. [Google Scholar]
Zhang, Q.; Cong, R.; Li, C.; Cheng, M.M.; Fang, Y.; Cao, X.; Zhao, Y.; Kwong, S. Dense attention fluid network for salient object detection in optical remote sensing images. IEEE Trans. Image Process. 2020, 30, 1305–1317. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested network with two-stream pyramid for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9156–9166. [Google Scholar] [CrossRef]
Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI salient object detection via multiscale joint region and boundary model. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607913. [Google Scholar] [CrossRef]
Deng, Z.; Hu, X.; Zhu, L.; Xu, X.; Qin, J.; Han, G.; Heng, P.A. R3net: Recurrent residual refinement network for saliency detection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; AAAI Press: Menlo Park, CA, USA, 2018; Volume 684690. [Google Scholar]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Feng, J.; Jiang, J. A simple pooling-based design for real-time salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3917–3926. [Google Scholar]
Zhao, J.X.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
Gao, S.H.; Tan, Y.Q.; Cheng, M.M.; Lu, C.; Chen, Y.; Yan, S. Highly efficient salient object detection with 100k parameters. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 702–721. [Google Scholar]
Liu, Y.; Zhang, X.Y.; Bian, J.W.; Zhang, L.; Cheng, M.M. SAMNet: Stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Trans. Image Process. 2021, 30, 3804–3814. [Google Scholar] [CrossRef]
Liu, Y.; Gu, Y.C.; Zhang, X.Y.; Wang, W.; Cheng, M.M. Lightweight salient object detection via hierarchical visual perception learning. IEEE Trans. Cybern. 2020, 51, 4439–4449. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zeng, D.; Lin, W.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2022, 53, 526–538. [Google Scholar] [CrossRef]
GongyangLi, Z.; Bai, Z.; Lin, W.; Ling, H. Lightweight salient object detection in optical remote sensing images via feature correlation. IEEE Trans. Geosci. Remote Sens 2022, 60, 5617712. [Google Scholar]
Li, G.; Liu, Z.; Lin, W.; Ling, H. Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5614513. [Google Scholar] [CrossRef]
Zhou, X.; Shen, K.; Liu, Z.; Gong, C.; Zhang, J.; Yan, C. Edge-Aware Multiscale Feature Integration Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5634819. [Google Scholar] [CrossRef]
Jia, Y.; Zhao, J.; Ma, L.; Yu, L. Multistrategy Region and Boundary Interaction Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5633116. [Google Scholar] [CrossRef]
Xing, G.; Wang, M.; Wang, F.; Sun, F.; Li, H. Lightweight Edge-Aware Mamba-Fusion Network for Weakly Supervised Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5631813. [Google Scholar] [CrossRef]
Gu, Y.; Chen, S.; Sun, X.; Ji, J.; Zhou, Y.; Ji, R. Optical remote sensing image salient object detection via bidirectional cross-attention and attention restoration. Pattern Recognit. 2025, 164, 111478. [Google Scholar] [CrossRef]
Sun, L.; Liu, H.; Wang, X.; Zheng, Y.; Chen, Q.; Wu, Z.; Fu, L. Local–Global Information Perception Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5625518. [Google Scholar] [CrossRef]

Figure 2. The overall architecture of GDUFormer. Except for

P_{4}

, the outputs of the decoder stacked by GDU are also processed through

1 \times 1

convolution to generate

D_{1}

,

D_{2}

and

D_{3}

. The lower half of the figure conceptually illustrates the roles that FGA and HDDC, the two core components of GDU, are capable of performing. It should be noted that to more clearly demonstrate the uniqueness of each component, we use red in the lower half to represent unwanted feature representations and teal for desired ones.

Figure 2. The overall architecture of GDUFormer. Except for

P_{4}

, the outputs of the decoder stacked by GDU are also processed through

1 \times 1

convolution to generate

D_{1}

,

D_{2}

and

D_{3}

. The lower half of the figure conceptually illustrates the roles that FGA and HDDC, the two core components of GDU, are capable of performing. It should be noted that to more clearly demonstrate the uniqueness of each component, we use red in the lower half to represent unwanted feature representations and teal for desired ones.

Figure 3. The structure of GDU. This module consists of two core components, FGA and HDDC, which also represent our main contributions.

Figure 4. The structure of FGA. The left and right branches of this component, respectively, utilize the grouping mechanism and the ViM-inspired linear attention to filter spatial local information and weak channel relationships. Among them, the latter can also bring additional advanced reasoning and abstract capabilities to the module.

Figure 5. The structure of HDDC. The design emphasis of this module lies in the enhancement of convolution kernel weights. HDDC adopts the distance decay mechanism and hierarchical intensity difference capture mechanism to accomplish the overall enhancement of the convolution kernel space, which in turn helps the model capture related discriminative information and directional clues.

Figure 6. The visual comparison examples of GDUFormer with comparative methods on EORSSD. * denotes the retrained methods. It can be seen that the prediction results of GDUFormer are clear and reliable, achieving accurate foreground-background separation, which visually proves the effectiveness of the proposed method.

Figure 7. The failure cases of GDUFormer predictions. The main drawback of the proposed method is that there are still some false positives in the prediction maps, and the boundaries and details of some objects are not accurate and clear enough.

Table 1. The comparative experimental results on the EORSSD dataset. * denotes retrained methods, - indicates that the original score of the method for the metric is unpublished, and the top two results under each metric are marked in red and blue, respectively.

Method	$S_{α}$	$F_{β}^{\max}$	$F_{β}^{mean}$	$F_{β}^{adp}$	$E_{β}^{\max}$	$E_{β}^{mean}$	$E_{β}^{adp}$
R3Net	0.8184	0.7498	0.6302	0.4165	0.9483	0.8294	0.6462
PoolNet	0.8207	0.7545	0.6406	0.4611	0.9292	0.8193	0.6836
EGNet	0.8601	0.7880	0.6967	0.5379	0.9570	0.8775	0.7566
MINet	0.9040	0.8344	0.8174	0.7705	0.9442	0.9346	0.9243
GateNet	0.9114	0.8566	0.8228	0.7109	0.9610	0.9385	0.8909
CSNet	0.8364	0.8341	0.7656	0.6319	0.9535	0.8929	0.8339
SAMNet	0.8622	0.7813	0.7214	0.6114	0.9421	0.8700	0.8284
HVPNet	0.8734	0.8036	0.7377	0.6202	0.9482	0.8721	0.8270
PA-KRN	0.9192	0.8639	0.8358	0.7993	0.9616	0.9536	0.9416
VST	0.9208	0.8716	0.8263	0.7089	0.9743	0.9442	0.8941
ICON	0.9185	0.8622	0.8371	0.8065	0.9687	0.9619	0.9497
LVNet	0.8630	0.7794	0.7328	0.6284	0.9254	0.8801	0.8445
ACCoNet	0.9290	0.8837	0.8552	0.7969	0.9727	0.9653	0.9450
MJRBM	0.9197	0.8656	0.8239	0.7066	0.9646	0.9350	0.8897
CorrNet	0.9289	0.8778	0.8620	0.8311	0.9696	0.9646	0.9593
ERPNet	0.9210	0.8632	0.8304	0.7554	0.9603	0.9401	0.9228
SARNet	0.9240	0.8719	0.8541	0.8304	0.9620	0.9555	0.9536
MCCNet	0.9327	0.8904	0.8604	0.8137	0.9755	0.9685	0.9538
EMFINet	0.9290	0.8720	0.8486	0.7984	0.9711	0.9604	0.9501
SeaNet	0.9208	0.8649	0.8519	0.8304	0.9710	0.9651	0.9602
SFANet	0.9349	0.8833	0.8680	-	0.9769	0.9726	-
MRBINet	0.9351	0.8852	0.8768	0.8626	0.9766	0.9721	0.9729
HFANet	0.9380	0.8876	0.8681	0.8365	0.9740	0.9679	0.9644
LightEMNet	0.8870	0.8470	0.8350	-	0.9620	0.9440	-
BCAR-Net	0.9360	-	-	-	0.9761	-	-
LGIPNet	0.9392	0.8924	0.8770	0.8623	0.9806	0.9753	0.9719
DKETFormer *	0.9304	0.8831	0.8734	0.8542	0.9778	0.9676	0.9705
GDUFormer	0.9336	0.8933	0.8813	0.8646	0.9827	0.9712	0.9764

Table 2. The comparative experimental results on the ORSI-4199 dataset. * denotes retrained methods, among which the backbones of SeaNet and SFANet are replaced with PVT v2; - indicates that the original score of the method for the metric is unpublished, and the top two results under each metric are marked in red and blue, respectively.

Method	$S_{α}$	$F_{β}^{\max}$	$F_{β}^{mean}$	$F_{β}^{adp}$	$E_{β}^{\max}$	$E_{β}^{mean}$	$E_{β}^{adp}$
R3Net	0.8142	0.7847	0.7790	0.7776	0.888	0.8722	0.8645
PiCANet	0.7114	0.6461	0.5684	0.5933	0.7946	0.6927	0.7511
PoolNet	0.8271	0.8010	0.7779	0.7382	0.8964	0.8676	0.8531
EGNet	0.8464	0.8267	0.8041	0.7650	0.9161	0.8947	0.8620
BASNet	0.8341	0.8157	0.8042	0.7810	0.9069	0.8881	0.8882
CSNet	0.8241	0.8124	0.7674	0.7162	0.9096	0.8586	0.8447
SAMNet	0.8409	0.8249	0.8029	0.7744	0.9186	0.8938	0.8781
HVPNet	0.8471	0.8295	0.8041	0.7652	0.9201	0.8956	0.8687
PA-KRN	0.8491	0.8415	0.8324	0.8200	0.9280	0.9168	0.9063
VST	0.8790	0.8717	0.8524	0.7947	0.9481	0.9348	0.8997
ICON	0.8752	0.8763	0.8664	0.8531	0.9521	0.9438	0.9239
MJRBM	0.8593	0.8493	0.8309	0.7995	0.9311	0.9102	0.8891
MCCNet	0.8746	0.8690	0.8630	0.8592	0.9413	0.9348	0.9182
CorrNet	0.8623	0.8560	0.8513	0.8534	0.9330	0.9206	0.9142
ERPNet	0.8670	0.8553	0.8374	0.8024	0.9290	0.9149	0.9024
ACCoNet	0.8775	0.8686	0.8620	0.8581	0.9412	0.9342	0.9167
EMFINet	0.8675	0.8584	0.8479	0.8186	0.9340	0.9257	0.9136
SeaNet	0.8722	0.8653	0.8591	0.8556	0.9426	0.9363	0.9197
SFANet	0.8761	0.8710	0.8659	-	0.9447	0.9385	-
MRBINet	0.8824	0.8800	0.8755	0.8755	0.9489	0.9425	0.9220
HFANet	0.8767	0.8700	0.8624	0.8323	0.9431	0.9336	0.9191
WeightNet	0.8884	0.8870	0.8813	0.8802	0.9534	0.9456	0.9283
SeaNet *	0.8860	0.8842	0.8783	0.8754	0.9533	0.9479	0.9250
SFANet *	0.8865	0.8858	0.8803	0.8782	0.9539	0.9484	0.9275
LGIPNet	0.8854	0.8791	0.8668	0.8268	0.9499	0.9432	0.9213
DKETFormer *	0.8848	0.8856	0.8798	0.8750	0.9539	0.9465	0.9259
GDUFormer	0.8873	0.8873	0.8818	0.8782	0.9560	0.9500	0.9296

Table 3. The results of efficiency comparison with major competitors. “Hybrid” indicates that the model adopts a hybrid backbone based on CNN and Transformer, while “2 × PVT” means the two-stage network employs the PVT backbone twice. * denotes retrained methods, among which the backbones of SeaNet and SFANet are replaced with PVT v2.

Method	Backbone	Params (M)	FLOPs (G)
SeaNet	MobileNet	2.76	1.70
SFANet	Res2Net	25.10	7.70
MRBINet	Res2Net	32.40	42.80
HFANet	Hybrid	60.53	68.30
WeightNet	2 × PVT	66.34	21.18
SeaNet *	PVT	25.57	11.59
SFANet *	PVT	25.77	12.74
LGIPNet	PVT	52.10	47.00
DKETFormer	PVT	25.97	13.77
GDUFormer	PVT	40.73	27.56

Table 4. The ablation study for validating the effectiveness of core components in GDUFormer. The “w/o” in the two middle rows represents “without”, which means excluding FGA or HDDC from GDU.

	$S_{α}$	$F_{β}^{\max}$	$F_{β}^{mean}$	$F_{β}^{adp}$	$E_{β}^{\max}$	$E_{β}^{mean}$	$E_{β}^{adp}$
Baseline	0.9235	0.8832	0.8687	0.8506	0.9765	0.9621	0.9712
GDU w/o HDDC	0.9322	0.8892	0.8748	0.8563	0.9796	0.9689	0.9709
GDU w/o FGA	0.9363	0.8911	0.8769	0.8594	0.9822	0.9735	0.9739
GDUFormer	0.9336	0.8933	0.8813	0.8646	0.9827	0.9712	0.9764

Table 5. The ablation study on the internal structure of FGA. These experiments mainly investigate the necessity and effectiveness of the left and right branches in FGA, and the key designs inside both branches have also been further verified.

	$S_{α}$	$F_{β}^{\max}$	$F_{β}^{mean}$	$F_{β}^{adp}$	$E_{β}^{\max}$	$E_{β}^{mean}$	$E_{β}^{adp}$
GDUFormer	0.9336	0.8933	0.8813	0.8646	0.9827	0.9712	0.9764
Non Left Branch	0.9318	0.8892	0.8756	0.8605	0.9805	0.9702	0.9747
Direct Combination	0.9351	0.8912	0.8786	0.8642	0.9807	0.9707	0.9756
Non Right Branch	0.9309	0.8904	0.8743	0.8553	0.9825	0.9694	0.9747
Non Internal Gating Attention	0.9339	0.8879	0.8766	0.8611	0.9798	0.9700	0.9738

Table 6. The ablation study on the internal structure of HDDC. We pay close attention to the distance decay mechanism and hierarchical intensity difference capture mechanism in HDDC, and the intensity difference highlighting mechanism and adaptive hierarchical operator of the latter have also been thoroughly explored.

	$S_{α}$	$F_{β}^{\max}$	$F_{β}^{mean}$	$F_{β}^{adp}$	$E_{β}^{\max}$	$E_{β}^{mean}$	$E_{β}^{adp}$
GDUFormer	0.9336	0.8933	0.8813	0.8646	0.9827	0.9712	0.9764
Non Distance Decay	0.9335	0.8863	0.8746	0.8590	0.9778	0.9688	0.9723
Non Decay Transpose	0.9314	0.8895	0.8751	0.8596	0.9797	0.9683	0.9735
Non HIDC	0.9327	0.8872	0.8727	0.8547	0.9805	0.9709	0.9717
Non Kernel Level	0.9295	0.8874	0.8769	0.8618	0.9784	0.9656	0.9743
Non IDHM	0.9287	0.8890	0.8753	0.8593	0.9790	0.9646	0.9744
AHO Equal	0.9340	0.8924	0.8794	0.8607	0.9815	0.9708	0.9774
AHO Reverse	0.9314	0.8894	0.8798	0.8644	0.9784	0.9672	0.9761
Non AHO Learnable	0.9315	0.8892	0.8760	0.8617	0.9804	0.9677	0.9749
Non AHO Reverse Learnable	0.9295	0.8858	0.8744	0.8606	0.9747	0.9634	0.9723

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, M.; Lan, T.; Wang, W.; Liu, P. Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit. Remote Sens. 2026, 18, 389. https://doi.org/10.3390/rs18030389

AMA Style

Sun M, Lan T, Wang W, Liu P. Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit. Remote Sensing. 2026; 18(3):389. https://doi.org/10.3390/rs18030389

Chicago/Turabian Style

Sun, Mingsi, Ting Lan, Wei Wang, and Pingping Liu. 2026. "Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit" Remote Sensing 18, no. 3: 389. https://doi.org/10.3390/rs18030389

APA Style

Sun, M., Lan, T., Wang, W., & Liu, P. (2026). Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit. Remote Sensing, 18(3), 389. https://doi.org/10.3390/rs18030389

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Salient Object Detection for Optical Remote Sensing Images Based on Gated Differential Unit

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Salient Object Detection Methods for Conventional Images

2.2. Salient Object Detection Methods for Optical Remote Sensing Images

3. GDUFormer

3.1. Overall Architecture

3.2. Full-Dimensional Gated Attention

3.3. Hierarchical Differential Dynamic Convolution

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparative Experiments

4.3.1. Comparative Results on the EORSSD Dataset

4.3.2. Comparative Results on the ORSI-4199 Dataset

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI