Next Article in Journal
Estimating Wheat Traits Using Artificial Neural Network-Based Radiative Transfer Model Inversion
Next Article in Special Issue
YOLO-SCNet: A Framework for Enhanced Detection of Small Lunar Craters
Previous Article in Journal
Landslide Identification in UAV Images Through Recognition of Landslide Boundaries and Ground Surface Cracks
Previous Article in Special Issue
VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TSFANet: Trans-Mamba Hybrid Network with Semantic Feature Alignment for Remote Sensing Salient Object Detection

1
School of Computer Science, Northwestern Polytechnical University, Dongxiang Road, Chang’an District, Xi’an 710129, China
2
School of Earth Sciences and Engineering, Hohai University, Xikang Road, Gulou District, Nanjing 210098, China
3
School of Artificial Intelligence, Tianjin University of Science and Technology, Dagu South Road, Hexi District, Tianjin 300457, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(11), 1902; https://doi.org/10.3390/rs17111902
Submission received: 8 April 2025 / Revised: 22 May 2025 / Accepted: 22 May 2025 / Published: 30 May 2025

Abstract

Recent advances in deep learning have witnessed the wide application of convolutional neural networks (CNNs), Transformer models, and Mamba models in optical remote sensing image (ORSI) analysis, particularly for salient object detection (SOD) tasks in disaster warning, urban planning, and military surveillance. Although existing methods improve detection accuracy by optimizing feature extraction and attention mechanisms, they still face limitations when dealing with the inherent challenges of ORSI. These challenges mainly manifest as complex backgrounds, extreme scale variations, and topological irregularities, which severely affect detection performance. However, the deeper underlying issue lies in how to effectively align and integrate local detail features with global semantic information. To tackle these issues, we propose the Trans-Mamba Hybrid Network with Semantic Feature Alignment (TSFANet), a novel architecture that exploits intrinsic correlations between semantic information and detail features. Our network comprises three key components: (1) a Trans-Mamba Semantic-Detail Dual-Stream Collaborative Module (TSDSM) that combines CNNs-Transformer and CNNs-Mamba in a hybrid dual-branch encoder to capture both global context and multi-scale local features; (2) an Adaptive Semantic Correlation Refinement Module (ASCRM) that leverages semantic-detail feature correlations for guided feature optimization; and 3) a Semantic-Guided Adjacent Feature Fusion Module (SGAFF) that aligns and refines multi-scale semantic features. Extensive experiments on three public RSI-SOD datasets demonstrate that our method consistently outperforms 30 state-of-the-art approaches, effectively accomplishing the task of salient object detection in remote sensing imagery.

1. Introduction

Remote sensing image salient object detection (RSI-SOD) aims to identify and extract visually prominent regions and objects from complex remote sensing scenes [1,2,3,4]. By analyzing image features such as texture, edges, and color patterns, SOD effectively suppresses background interference while highlighting objects of interest. With the advancement of Transformer architectures [5], RSI-SOD has demonstrated remarkable success in various applications, including terrain change detection [6], disaster early warning [7], and military object surveillance [8]. Different from natural scene images (NSI) captured by conventional cameras, RSIs are acquired by aerial platforms and satellites from high altitudes, providing broader coverage and higher resolution. However, this unique imaging perspective introduces several critical challenges:
(1) Complex Backgrounds: As illustrated in Figure 1 (first row), conventional NSI-SOD models struggle with RSI-SOD detection, often failing to differentiate between background and foreground elements, leading to numerous false positives; (2) Extreme Scale Variations: The second and third rows of Figure 1 demonstrate how the NSI-SOD model BASNet [9], while effective for large-scale salient region object extraction, significantly underperforms in detecting small-scale salient objects; (3) Irregular Topological Structures: The fourth and fifth rows of Figure 1 reveal that existing models can only roughly extract salient regions from objects with complex topologies and intricate edge details, resulting in significant loss of fine-grained features.
Recent approaches to address these challenges can be categorized into CNNs-based, Transformer-based, Mamba-based, and hybrid architectures. The CNNs-based R3Net [10] (Figure 1e) focuses on local feature learning but struggles with global context integration. The Transformer-based GeleNet [11] (Figure 1f) excels in capturing contextual relationships but lacks precision in edge detection. The Mamba-based VMamba [12] (Figure 1c) shows improved detail extraction through sequential modeling but lacks comprehensive scene understanding. While the hybrid CNNs-Transformer model HFANet [13] (Figure 1g) combines local and global feature processing, it still suffers from false detections. The feature visualization analysis in Figure 2 reveals that current RSI-SOD models, while showing improved object region focus compared to NSI-SOD approaches, still struggle with background suppression, inconsistent multi-scale target localization, and boundary preservation. For example, the suppression of interfering backgrounds such as the runway target region in the lower right corner of the first row and the red building above the second row is still insufficient, the positioning of the large-scale building region and the small-scale car region in the third row is incorrect, and the extraction of the boundary integrity of the road region in the fourth row is insufficient.
To address these limitations, we propose the Trans-Mamba Hybrid Network with Semantic Feature Alignment (TSFANet) to align and integrate local detail features with global semantic representations. By combining the strengths of Transformer-based global modeling and Mamba-based local detail extraction, our TSFANet is able to capture richer and more robust feature representations, leading to more accurate and reliable saliency prediction. This hybrid design is motivated by the complementary characteristics of both architectures, where global semantics help distinguish salient objects from cluttered backgrounds and local details refine object boundaries and small-scale structures. The primary contributions of this work are summarized as follows:
  • To achieve comprehensive feature extraction, we designed the Trans-Mamba Semantic-Detail Dual-Stream Collaborative Module (TSDSM), a novel dual-stream architecture that synergistically combines CNNs-Transformer and CNNs-Mamba branches. This hybrid structure effectively leverages Transformer’s global modeling capabilities and Mamba’s local processing advantages, enabling more accurate salient detection in complex remote sensing scenes.
  • For effective alignment of local details and global semantic features, we constructed the Adaptive Semantic Correlation Refinement Module (ASCRM). This module models the correlation between semantic and local features, utilizing matrix reshaping and SoftMax activation to accurately capture the spatial information of significant regions, thereby enhancing the precision of optical remote sensing salient object detection.
  • To better integrate semantic features with local details, we designed the Semantic-Guided Adjacent Feature Fusion Module (SGAFF). This module extracts the overall semantic framework using a global attention mechanism and fuses it with local features to enhance the semantic information. Through the global semantic fluid information, SGAFF effectively filters out background noise, highlights target details, and improves object detection accuracy.
  • Extensive experiments on three public RSI-SOD datasets demonstrate that our method consistently outperforms 26 state-of-the-art approaches. Detailed ablation studies verify that our proposed modules effectively address the challenges of complex backgrounds, scale variations, and irregular topologies in RSI-SOD tasks.
The article is structured as follows. Section 2 provides an overview of the related works. Section 3 introduces the proposed TSFANet framework and its associated components. The experimental results and discussion analysis are presented in Section 4. Lastly, the conclusion is drawn in Section 5.

2. Related Works

2.1. State Space Models

Mamba has emerged as the most advanced state space model (SSM) variant, achieving Transformer-comparable modeling capabilities through its hardware-aware design while maintaining superior efficiency for long sequence processing. Recent developments in Mamba-based architectures can be categorized into several key research directions: Architectural Innovations: Early works focused on enhancing Mamba’s basic architecture. Matten [14] pioneered bidirectional processing with spatiotemporal attention, while MambaVision [15] introduced a restructured architecture that effectively integrates with Transformer mechanisms. nnMamba [16] advanced the field by incorporating Channel-Spatial Siamese learning with CNN integration, significantly improving long-range modeling capabilities. DualMamba [17] proposed a hybrid approach combining cross-attention modules for global modeling with residual learning for local feature extraction. Vision-Specific Adaptations: Several works have focused on adapting Mamba for vision tasks. VMamba [12] introduced the SS2D module for 2D selective scanning, enhancing contextual information collection. Weak-Mamba-UNet [18] developed a comprehensive framework combining CNN, Transformer, and Mamba-based encoders for multi-level feature processing. LocalMamba [19] proposed a dynamic, layer-wise scanning strategy for localized feature capture while maintaining global context. VIM [20] enhanced visual representation through position-aware bidirectional modeling. Task-Specific Applications: Recent research has expanded Mamba’s applications across various domains. Graph-Mamba [21] enhanced graph network modeling through input-dependent node selection. VMRNN [22] and SpikeMba [23] focused on temporal dynamics and neural information processing. CU-Mamba [24] and VMambaMorph [25] addressed medical imaging challenges through specialized architectures. MedMamba [26] developed multi-modal medical image analysis capabilities, while Pan-Mamba [27] and Vmambair [28] focused on cross-modal interaction and image restoration, respectively. Despite these advances in Mamba’s applications across various computer vision tasks, its potential in remote sensing image analysis remains largely unexplored, particularly in the context of salient object detection. Most existing Mamba-based methods for remote sensing images focus solely on semantic segmentation tasks, leaving a significant gap in RSI-SOD applications.

2.2. SOD in Natural Scene Images

Traditional Natural Scene Image Salient Object Detection (NSI-SOD) primarily relied on classical machine learning algorithms, leveraging handcrafted features such as regularization techniques [4], color variation [29], and background priors [30]. While these methods achieved initial progress, they were limited by their dependence on manually designed features and inability to extract deep semantic information. The advent of deep learning brought revolutionary advances to saliency detection. Itti et al. [1] pioneered computational visual attention with a center-surround disparity mechanism, efficiently integrating multi-scale features for salient region identification. Subsequently, CNNs-based approaches demonstrated remarkable improvements. R3Net [10] enhanced detection through recurrent networks and residual learning, alternating between low- and high-level features. AFNet [31] improved boundary accuracy by combining boundary perception with attention mechanisms, while MDF [32] strengthened feature representation through multi-level feature cascading. Amulet [33] introduced effective feature fusion strategies, and MSIN [34] enhanced feature extraction through multi-scale interaction. Edge information processing saw significant diversification. C2SNet [35] utilized boundary information as auxiliary signals, while BASNet [9] improved performance in challenging scenarios through boundary-aware mechanisms. EGNet [36] and ITSD [37] enhanced boundary prediction through edge guidance and two-stream encoding. TRACER [38] optimized feature selection through extreme attention, and PiCANet [39] employed contextual attention for dynamic feature fusion. Recent advances include PFANet’s [40] pyramid feature extraction, BBRFNet’s [41] multi-scale receptive fields, and VST’s [5] Transformer-based long-range dependency modeling. Despite these advances in NSI-SOD, optical remote sensing image saliency detection faces unique challenges due to imaging limitations, including resolution variations, illumination changes, and atmospheric interference. Direct application of existing models to remote sensing images remains problematic, highlighting the need for specialized approaches that effectively integrate multi-scale global and local features while addressing domain-specific challenges.

2.3. SOD in Optical RSIs

RSI-SOD addresses more complex remote sensing scenes compared to NSI-SOD, leading to significant methodological advances organized in four main development stages: (1) The foundational stage focused on establishing datasets and basic frameworks. Zhao et al. [42] introduced a sparsity-guided SOD method with initial RSI datasets. Li et al. [43] developed the Optical Remote Sensing for Salient Object Detection Dataset (ORSSD) and LVNet, combining nested networks with pyramid structures. Zhang et al. [44] extended this with the higher-resolution Extended Optical Remote Sensing for Salient Object Detection Dataset (EORSSD) and proposed DAFNet featuring dense attention mechanisms. Tu et al. [45] contributed the challenging Optical Remote Sensing Image Dataset with 4199 Samples (ORSI-4199) and MJRBM model. (2) CNNs-based methods then emerged to address RSI-specific challenges. EMFINet [46] combined edge awareness with multi-scale feature pyramids, while MCCNet [47] tackled complex backgrounds through multi-content completion. SARNet [48] introduced coarse-to-fine detection with semantic guidance, and ACCoNet [49] leveraged neighborhood context coordination. ICON [50] focused on boundary preservation and feature diversity enhancement. (3) Recent advances have centered on Transformer-based and hybrid architectures. GeleNet [11], the pioneering Transformer-based method, introduced direction-aware spatial attention. Subsequent works like RAGRNet [51], BSCGNet [52], and ESGNet [53] enhanced feature interaction through various graph-based and attention mechanisms. Ma et al. [54] proposed an end-to-end framework that integrates superpixel generation and region merging for remote sensing image segmentation. ASTTNet [55] and IDELNet [56] further refined Transformer-based feature extraction. (4) The latest hybrid approaches have shown promising results. HFANet [13] combined CNN and Transformer encoders for multi-scale modeling. ASNet [57] and HFCNet [58] focused on balancing global-local feature integration. PROFILE [59] and WeightNet [60] introduced sophisticated feature enhancement and noise suppression mechanisms. Ma et al. [61] introduced a deep superpixel-wise segmentation approach for remote sensing image, which leverages task-specific superpixel sampling and soft graph convolution to enhance both accuracy and computational time. FPS-U2Net [62] focuses on gradually extracting salient information from the image by aggregating multi-scale and multi-level features extracted from different stage encoders. Despite these advances, achieving optimal accuracy in complex scenarios remains challenging, particularly for regions with intricate backgrounds, scale variations, and irregular topologies. Our proposed TSFANet addresses these limitations by introducing a novel dual-encoder architecture combining CNNs-Mamba for local features and CNNs-Transformer for global representation, unified by correlation-guided multi-scale feature fusion.

3. Methodology

This section describes the proposed TSFANet. Firstly, we introduce the overall structure. Then, the details of TSDSM, ASCRM, and SGAFF are described respectively. Finally, the loss function for network training is introduced.

3.1. Overall Structure

The overall architecture of TSFANet is illustrated in Figure 3. Built upon the foundational structure of HFANet [13], TSFANet enhances the classic encoder-decoder framework to improve feature extraction and integration. The model employs a dual encoder architecture comprising CNNs-Transformer and CNNs-Mamba components. Specifically, the CNNs-Transformer module captures complex global correlations across spatial positions, while the CNNs-Mamba module integrates these global correlations with local feature details, thereby enhancing the effectiveness of feature extraction.
At each encoding stage, the ASCRM module leverages global features to guide attention shifts within the local feature maps, optimizing them to emphasize detailed characteristics of salient target regions. Additionally, to further enrich semantic content and support robust feature interaction and fusion, the SGAFF module is integrated. This module provides semantic cues through deep-layer features and detailed structural information through shallow-layer features, facilitating accurate small-target recognition and effective background suppression. Algorithm 1 details the implementation process of the method. For an input image I R 3 × 448 × 448 , the TSDSM encoding process operates as follows: the CNNs-Transformer branch extracts global feature maps G 1 R 64 × 224 × 224 , G 2 R 128 × 112 × 112 , G 3 R 256 × 56 × 56 , G 4 R 320 × 28 × 28 , and G 5 R 512 × 14 × 14 . Concurrently, the CNNs-Mamba branch captures local features L 1 R 64 × 224 × 224 , L 2 R 128 × 112 × 112 , L 3 R 256 × 56 × 56 , L 4 R 320 × 28 × 28 , and L 5 R 512 × 14 × 14 . For each global feature G i and local feature L i ( i 1 , 2 , 3 , 4 , 5 ) at different stages, the ASCRM module is utilized to obtain fused features F i . These fused features are then processed through the GF-ASPP and S6 modules, resulting in selectively enhanced detailed features D 5 R 1024 × 7 × 7 . Using the SGAFF module, the fused features F i ( i 1 , 2 , 3 , 4 , 5 ) and deep features D i ( i 1 , 2 , 3 , 4 , 5 ) are combined to produce deep fused features D i 1 ( i 1 , 2 , 3 , 4 , 5 ) for each stage. Finally, the deep fused features D i 1 are fed into convolutional module to generate the outputs O i ( i 1 , 2 , 3 , 4 , 5 ).
Algorithm 1: Training Framework for TSFANet
Remotesensing 17 01902 i001

3.2. Trans-Mamba Semantic-Detail Dual-Stream Collaborative

To capture the detailed features of target regions in remote sensing images (RSI), existing methods commonly integrate CNNs and Transformers to construct encoders for feature extraction. Typically, CNNs are employed for pixel-level local feature information modeling, followed by Transformers for pixel-block-level global feature information modeling. However, this encoder architecture fails to effectively leverage both local and global features simultaneously. Inspired by Ref. [17], we propose the Trans-Mamba Semantic-Detail Dual-Stream Collaborative Module (TSDSM), which integrates CNNs-Transformer and CNNs-Mamba to form a dual-path encoder. The CNNs-Transformer branch is for global feature modeling, capturing global relationships and long-range dependencies within the image while guiding the CNNs-Mamba branch in local feature modeling. Although the two branches operate independently in feature extraction, their collaboration is achieved through the Adaptive Semantic Correlation Refinement Module (ASCRM), which is applied after each encoding stage (as illustrated in Figure 3). The ASCRM module explicitly models the intrinsic correlations between the global semantic features from the CNNs-Transformer branch and the local detail features from the CNNs-Mamba branch by learning a cross-attention weight matrix. This mechanism enables information exchange and mutual guidance between the branches, allowing global semantic cues to refine local features and vice versa. The resulting fused features integrate both holistic context and local structure, forming a unified representation for subsequent decoding and prediction.

3.2.1. CNNs-Transformer

As illustrated in Figure 3, the Res encoder is utilized to extract low-level features of RSI targets, while the Transformer encoder acquires global semantic information. The Res encoder is defined as follows
R ( x ) = Conv 1 × 1 Conv 3 × 3 Conv 1 × 1 ( x ) x
where Conv 1 × 1 denotes a convolution operation with a kernel size of 1 × 1 , and ⊕ represents element-wise addition. The Transformer captures global attention by constructing query (Q), key (K), and value (V) matrices from input feature image patches. However, due to its computational complexity increasing with image resolution, the Transformer is employed only in the 3rd and 4th stages. As shown in Figure 3, the Transformer encoder partitions the input image into patches through patch embedding. After normalization, it computes the Q, K, and V matrices for these patches. The multi-head attention is then applied to assign attention weights to each image patch. Finally, the feed-forward layer and convolutional layers are integrated to further extract and aggregate features, producing a sequence of token features. The computations processes are defined as follows:
T ( x ) = FC Conv DW FC LN ( G MH ) G MH
G MH = A MH LN ( PE ( x ) ) x
where FC denotes a fully connected feed-forward layer, Conv DW represents a depth-wise convolution with a kernel size of 3 × 3 , LN represents layer normalization, PE indicates patch embedding operations, G MH represents the multi-head attention feature map, and A MH is the multi-head attention operation. The multi-head attention is calculated as:
A MH ( Q , K , V ) = SoftMax Q K d V
where SoftMax · is the activation function, ⊤ denotes matrix transposition, and d is the dimensionality of input features divided by the number of multi-head attention.

3.2.2. CNNs-Mamba

As depicted in Figure 3, the CNNs-Mamba branch employs ResBlock to capture low-level features such as edges, textures, and corners of the target, thereby reducing the complexity burden on the Mamba coding module. The Mamba encoder is responsible for synthesizing the local and global information captured by the model in deeper layers. Global information guides the construction of long-range dependencies between different local features, refining the local features in key regions. As shown in Figure 3, the Mamba encoder utilizes the State Space Module (SSM) in conjunction with the Omnidirectional Sensing Module (ODPM) to perform selective scanning in various directions, achieving linear complexity and a globally effective receptive field while extracting local details by integrating global semantic information. For the input features F R H × W × C , after layer normalization and dimensionality reduction via a 1 × 1 convolution, spatial features of each input channel are extracted through deep convolution. Subsequently, the ODPM selectively scans in different directions to extract distinct detail features, which are adjusted using the normalized features. To prevent gradient loss during model operations, the features are added back to the input features after dimensionality restoration via 1 × 1 convolution, supplementing the foreground information overlooked in local feature extraction.
A i = ODPM DConv Conv 1 × 1 ( LN ( F i ) )
L i = Conv 1 × 1 ( Conv 1 × 1 ( A i ) × LN ( F i ) + LN ( F i ) )
In contrast to VMamba’s 2D selective scanning, which performs forward and backward scanning in horizontal and vertical directions, the proposed ODPM introduces additional diagonal and anti-diagonal direction scanning to address the anisotropy inherent in remote sensing images. It conducts global modeling in all directions and processes each scanning sequence independently through the SSM module. This approach selectively propagates key features while discarding background noise, enabling global modeling in specific directions. Finally, all sequences are aggregated and summed, capturing detailed features from different directions within the remote sensing image. For the input feature map I R h × w , where I [ m , n ] denotes the token located at the m-th row and n-th column of the feature map, the horizontal scanning operations used by VMamba are defined as follows
s 1 [ x ] = I [ x mod w ] [ x / w ]
s 2 [ x ] = I [ ( N x 1 ) mod w ] [ ( N x 1 ) / w ]
where N = h × w , x represents the position index of an element in the flattened feature map, ranging from 0 to N 1 , and s 1 , s 2 R N × C denote the one-dimensional token sequences obtained by flattening along the horizontal scanning direction. The vertical scanning operations are defined as
s 3 [ x ] = I x / h [ x mod h ] , ( 0 x < N )
s 4 [ x ] = I ( N x 1 ) / h [ ( N x 1 ) mod h ]
where s 3 , s 4 R N × C represent the one-dimensional token sequences obtained by flattening along the vertical scanning direction. The diagonal scanning operations introduced by ODPM are computed as follows:
s 5 [ x ] = I [ x mod w ] [ x mod h ] , ( 0 x < N )
s 6 [ x ] = I [ ( ( N x 1 ) mod w ) ] [ ( N x 1 ) mod h ]
where s 5 , s 6 R N × C denote the one-dimensional token sequences obtained by flattening along the diagonal scanning direction. The anti-diagonal scanning operations are as
s 7 [ x ] = I [ ( ( N x 1 ) mod w ) ] [ x mod h ]
s 8 [ x ] = I [ x mod w ] [ ( N x 1 ) mod h ]
where s 7 , s 8 R N × C represent the one-dimensional token sequences obtained by flattening along the anti-diagonal scanning direction. After completing the scanning in horizontal, vertical, diagonal, and anti-diagonal directions, the one-dimensional token sequences obtained from forward and backward scanning in each direction are input into the SSM (S6) block to learn attention features for significant targets in each direction. Subsequently, the corresponding inverse operations are applied to each direction
I k [ m ] [ n ] = s k [ n × w + m ] , k = 1 , 7 s k [ N 1 n × w m ] , k = 2 , 8 s k [ m × h + n ] , k = 3 , 5 s k [ N 1 m × h n ] , k = 4 , 6
where k = 1 , , 8 represents the eight scanning directions. The final ODPM output is obtained by summing the projections from all directions as I = k = 1 8 I k .

3.3. Adaptive Semantic Correlation Refinement

The CNNs-Transformer and CNNs-Mamba encoders extract feature information with dimensional discrepancies. While many existing methods address this issue through feature fusion, they often overlook the intrinsic correlations within the features extracted by different encoders. Inspired by Ref. [57], we recognize that such intrinsic correlations exist within the feature information extracted by different encoders. To refine local feature details by leveraging the correlation between semantic and detailed features, we propose the Adaptive Semantic Correlation Refinement Module (ASCRM).
As illustrated in Figure 4, for the input semantic feature G i R C × H × W and local feature L i R C × H × W , matrix reshaping is first applied to obtain G i R C × H W and L i R C × ( H W ) . By learning a weight matrix W C R C × C , the reshaped semantic features are projected into the semantic correlation matrix A r R C × H × W . The matrix multiplication is then utilized to compute the correlation matrix r, representing the correlation between semantic and local features. The correlation matrix r is calculated as follows:
r = A r L i = ( G i ) W C L i
where ⊤ denotes the transpose operation and ⊗ represents matrix multiplication. The correlation matrix r is subsequently processed using the softmax function for activation and normalization. This processing enables the weighted semantic correlation matrix to capture the spatial location information of significant regions within each feature. The corrected semantic feature G i corr R C × H × W and local feature L i corr R C × H × W are defined as follows:
G i corr = Reshape ( Reshape ( G i ) SoftMax ( r ) )
L i corr = Reshape ( Reshape ( L i ) SoftMax ( r ) )
where Reshape · denotes the matrix reshaping operation, and SoftMax · represents the activation function. Subsequently, location information is generated for G i corr and L i corr through a 1 × 1 convolution, resulting in G i map [ 0 , 1 ] 1 × H × W and L i map [ 0 , 1 ] 1 × H × W , respectively. Based on these mapped values, redundant information is filtered out to obtain G i gate and L i gate , calculated as follows:
G i gate = G i map G i corr = Sigmoid ( Conv ( G i corr ) ) G i corr
L i gate = L i map L i corr = Sigmoid ( Conv ( L i corr ) ) L i corr
where Sigmoid · denotes the sigmoid activation function, which maps feature values to the range [ 0 , 1 ] , and Conv represents the 1 × 1 convolution operation. Through residual connections, G i gate is fused with G i , and L i gate is fused with L i , resulting in G ^ i corr and L ^ i corr :
G ^ i corr = Conv 3 × 3 G i gate G i
L ^ i corr = Conv 3 × 3 L i gate L i
where Conv 3 × 3 denotes the 3 × 3 depthwise separable convolution, and ⊕ represents element-wise addition. For the two effective components, G ^ i corr and L ^ i corr , concatenation is performed followed by 3 × 3 convolution to extract the location features of their significant regions. These features are then transformed into position information P i using the 1 × 1 convolution and the sigmoid function:
P i = Sigmoid Conv 1 × 1 Conv 3 × 3 G ^ i corr Θ L ^ i corr
where Θ denotes channel-wise concatenation. By applying the extracted position information P i to the local feature L i , the refined local feature F i is obtained as follows:
F i = P i L i L i
This direct feature modulation approach provides precise collaborative semantic guidance for refining local features, thereby laying a solid foundation for the fine-grained processing of remote sensing images.

3.4. Semantic-Guided Adjacent Feature Fusion

The remote sensing image contains rich semantic information that can assist the model in accurately locating target positions. When combined with local features, this semantic information can better identify targets of varying sizes within complex backgrounds. Although many existing methods effectively learn feature information at different scales, significant semantic differences persist across layer features. Inspired by Ref. [48], we constructed the Semantic-Guided Adjacent Feature Fusion Module (SGAFF) to align and fuse features by effectively utilizing deep semantic features and details across scales.
As illustrated in Figure 5, for the deep input feature D i R C × H × W from the encoder, it is transformed into D i R C × H W through matrix operations for computational convenience. Subsequently, the Global Attention Mechanism (GAM) is employed to capture the overall semantic framework of the feature E i R H W × N . The overall semantic framework S i is calculated as follows:
E i = SoftMax Transpose GAM ( D i )
where SoftMax · represents the activation function, and Transpose · denotes the matrix transpose operation. Semantic features aid in understanding the overall context and meaning of the image but are susceptible to noise and complex backgrounds. To mitigate these issues, the semantic framework is fused with the local features D i extracted from the input features to obtain the global semantic fluid information S i R C × N . This provides detailed information beneficial for distinguishing similar objects in different regions. The S i is calculated as:
S i = Conv 1 × 1 D i Conv 1 × 1 D i E i
where Conv 1 × 1 represents the 1 × 1 convolution, ⊕ denotes element-wise addition, and ⊗ represents matrix multiplication. To facilitate the fusion of features F i 1 R ( C / 2 ) × 2 H × 2 W with the shallow input features from the encoder, the deep-layer input features D i are upsampled and reshaped into a feature tensor D i with dimensions N × 2 H × 2 W :
D i = Reshape Up D i
where Reshape · represents the matrix reshaping operation, and Up · denotes the upsampling. The semantic information S i is projected to the corresponding positions, aggregated, and integrated into the semantic feature A i R C × k H × k W :
A i = Reshape D i Conv 1 × 1 D i S i
where Reshape · transforms the feature matrix from C × k 2 H W to C × k H × k W dimensions to facilitate subsequent computation and processing; ⊗ represents matrix multiplication, and ⊕ denotes element-wise addition. Since different channel dimensions contain varying feature details of the target and include background noise interference, it is essential to highlight the detailed parts of the target. To achieve this, channel dimension features are resampled to obtain attention across different channels, thereby enhancing the shallow input features F i 1 through channel attention. To prevent feature information loss while filtering out background noise, semantic features A i are used to guide and obtain the channel attention projection C i R C × k 2 H W . Through matrix reshaping, the channel attention feature C i R C × k H × k W is obtained:
C i = Reshape SoftMax F i 1 A i F i 1
where SoftMax · represents the softmax activation function, and ⊗ denotes matrix multiplication. The Sigmoid function is applied to the semantic feature A i and multiplied with the channel feature C i , leveraging the semantic features to further select the target’s feature information. Then, A i is element-wise added to the semantic feature and passed through a 1 × 1 convolution and normalization to obtain the aggregated feature D i 1 R ( C / 2 ) × 2 H × 2 W :
D i 1 = ϕ Conv 1 × 1 C i Sigmoid A i A i
where ⊙ represents element-wise matrix multiplication, and ϕ denotes the batch normalization and ReLU operations.

3.5. Loss Function

The proposed TSFANet extends HFANet [13] by employing the designed IG Loss and weighted Intersection over Union (wIoU) as the loss functions. As shown in Figure 3, IG Loss is calculated as follows:
L I G = 1 n i = 1 n p i E m 1 + 1 L B C E p i S , g i S + ( 1 p i S · g i E ) m 2 + 1 L B C E p i E , g i E
where p i S and p i E are the predicted saliency map and predicted edge map, respectively; g i S and g i E are the ground truth saliency map and edge map; i represents the index of each pixel; n is the total number of pixels; m 1 and m 2 are balancing weight factors (both set to 1 in our implementation, following Ref. [13]) that control the contribution of the edge and region terms in the loss function, and L B C E ( ) is the Binary Cross Entropy (BCE) loss. The BCE loss is defined as follows:
L i B C E ( x , y ) = y log ( x ) ( 1 y ) log ( 1 x )
By introducing the wIoU loss into the IG loss, L i w I o U is calculated as follows:
L i w I o U = 1 j = 1 n p j s · g j s + 1 j = 1 n p j s + g j s p j s · g j s + 1
where ( p j s , g j s ) = p j s + g j s and ( p j s · g j s ) = p j s · g j s represent the sum and element-wise multiplication of the predicted saliency map and ground truth values at pixel j, respectively. The total loss for each stage can be expressed as
L i = λ 1 L i I G + λ 2 L i w I o U
L total = L 1 + i = 2 5 1 2 i 2 L i
For the model training process, high-quality saliency prediction maps are generated in five stages, and the model is supervised using image labels and edge labels. To better balance the five detection stages of the model, the losses of the five stages are weighted and summed as Equation (35). In addition, we find that the hyperparameters λ 1 and λ 2 are all set to 1 and work best in our experiments.

4. Experiments and Analysis

4.1. Dataset Description

To comprehensively evaluate the performance of the proposed model, we conducted extensive experimental comparisons on three publicly available Remote Sensing Image Saliency Object Detection (RSI-SOD) datasets: ORSSD, EORSSD, and ORSI-4199.
  • ORSSD: The first publicly available RSI-SOD dataset, ORSSD, comprises 800 optical RSI images with corresponding annotations. Of these, 600 images are designated for training, and 200 images for testing.
  • EORSSD: An extended version of ORSSD, EORSSD incorporates more challenging scenarios to better assess model robustness. It includes 1400 training samples and 600 testing samples.
  • ORSI-4199: The most diverse saliency detection dataset in terms of scene complexity, ORSI-4199 contains 2000 training samples and 2199 testing samples.
Following the standard protocol, the training sets of each dataset are utilized for model training, while performance evaluations are conducted on their respective test sets.

4.2. Implementation Details and Evaluation Metrics

We developed TSFANet using the PyTorch 2.0.0 framework and Python 3.8.0. Model inference is accelerated using an NVIDIA RTX 4090 GPU. To prevent overfitting and enhance the generalization performance of the model, we employed various data augmentation techniques, including random rotations, color transformations, and noise addition. For model parameter optimization, we set the initial learning rate to 1 × 10 5 , the batch size to 16, and used the Adam optimizer to update the network parameters. A polynomial learning rate strategy is applied to automatically adjust the learning rate during training. The model is trained for 300 epochs. In training and testing phases, input images are resized to 448 × 448 pixels to maintain consistency.
To comprehensively assess the performance of our method, we adopt five widely used evaluation metrics: S-measure ( S α ), F-measure ( F β ), E-measure ( E ζ ), mean absolute error (MAE), and the precision-recall (P-R) curve. The S-measure jointly considers region-aware and object-aware similarities to reflect both global and structural correspondence between the predicted saliency map and the ground truth. The F-measure calculates the weighted harmonic mean of precision and recall, with a stronger emphasis on precision, to evaluate the overall detection quality. The E-measure combines pixel-level precision and image-level structural information for a balanced measurement of saliency map accuracy. The MAE quantifies the average pixel-wise error between the predicted and true saliency maps, providing an intuitive indicator of prediction error. Finally, the P-R curve illustrates the model’s performance under different binarization thresholds by plotting precision versus recall, offering a comprehensive view of detection capability across varying criteria.

4.3. Comparison with State-of-the-Art Methods

To comprehensively evaluate the proposed method, TSFANet is benchmarked against state-of-the-art models on three public Remote Sensing Image Saliency Object Detection (RSI-SOD) datasets. The comparison involves a Mamba-based semantic image segmentation model (SSI) VMamba [12] and Samba [63] (with one class in the experiments), eight Natural Scene Image (NSI) models (PicaNet [39], R3Net [10], RAS [64], BASNet [9], CPDNet [65], EGNet [36], PoolNet [66], U2Net [67], and VST [5]), and sixteen Remote Sensing Image (RSI) models (CSNet [68], CoorNet [69], GeleNet [11], HVPNet [70], SAMNet [71], MCCNet [47], SUCA [72], ACCoNet [49], DNTD [73], MJRBM [45], ICON [50], EMFINet [46], HFANet [13], DPORTNet [74], PA-KRN [75], ERPNet [76], DBINet [77], DCNet [78], and DSINet [79]). To ensure experimental fairness, results for the 30 comparison methods were derived from publicly available data provided by the authors, with some reproduced using publicly available source code.

4.3.1. Quantitative Comparison

The quantitative comparison results of TSFANet and the other 30 latest methods on three RSI-SOD datasets are shown in Table 1, Table 2 and Table 3. Clearly, our method outperforms these advanced methods across each dataset. Specifically, for the simpler ORSSD dataset, TSFANet achieves three optimal results and one third-best result. Compared with the top-performing traditional CNN-based NSI-SOD model R3Net, the S α , F β , and E ξ scores are increased by 4.37%, 5.59%, and 5.79%, respectively, while the MAE is reduced by 0.93%. Although it trails behind the leading RSI-SOD model MCCNet by 0.67% on the F β metric, TSFANet leads by 0.18%, 0.59%, and 0.1% on the S α , E ξ , and MAE metrics, respectively. On the extended and more challenging EORSSD dataset, TSFANet slightly lags behind HFANet by 0.67% and 0.77% on the S α and F β indicators, respectively, but surpasses it by 0.07% and 0.1% on the E ξ and MAE indicators, respectively. On the most challenging ORSI-4199 dataset, TSFANet achieves the best results with an S α of 88.76%, F β of 86.17%, E ξ of 94.53%, and MAE of 2.77%. To visually illustrate the model’s overall performance, Figure 6 presents the precision-recall (P-R) curves and F-measure curves of the comparison methods across the three datasets. The comparison of the curves reveals that TSFANet’s P-R curve is closest to the coordinate (1, 1), demonstrating its ability to achieve high precision while maintaining high recall. Additionally, TSFANet achieves the largest area under the F-measure curve compared to all other methods. The computational complexity analysis of the various methods is presented in Table 4. It can be observed that the introduction of multiple Transformer modules leads to a large number of parameters in the baseline model HFANet, while the proposed TSFANet improves on it by discarding some of the modules with a large number of parameters, enhancing the capture of feature details and the understanding of semantic information. Although the number of parameters is the largest, the computational complexity is at a medium level, which can guarantee high performance while providing a better price–performance ratio in terms of hardware resources and practical applications. While TSFANet consistently outperforms existing methods overall, there remain certain cases, such as images containing extremely small, low-contrast, or densely clustered salient objects, where the improvement is less significant. These cases reveal that further enhancement is needed to address the detection of subtle and highly challenging targets in complex remote sensing scenes.

4.3.2. Qualitative Comparison

Qualitative evaluations across different RSI scenarios highlight the advantages of our proposed method under complex conditions. The visualization results are presented in Figure 7. Typical scenarios for analysis and comparison include (1) Objects in complex cluttered backgrounds; (2) Small-scale salient objects; (3) Topologically irregular objects. By comparing the prediction results generated by each test image, ground truth labels, and different models, it is evident that our TSFANet more closely aligns with real-world scenarios than other methods. Specifically, the advantages of saliency detection in different scenarios are as follows:
(1) Objects in Complex Cluttered Backgrounds: As shown in Figure 7a–c, human activities introduce abundant clutter information similar to the target in RSI, making it difficult for some models to achieve high-precision detection. For example, in Figure 7a, only EMFINet, SUCA, CorrNet, ACCoNet, ERPNet, MCCNet, and DBSINet effectively filter out the interference from the terminal building, which has a similar color to the target airplane on the right. In Figure 7b, BASNet, R3Net, ENFNet, RAS, EMFINet, EGNet, GeleNet, and CPD learn some non-significant object features, leading to the recognition of the right guiding surface as part of the target. In Figure 7c, due to the excessive similarity between the salient target and background features, most models fail to completely extract the road target, with only Smamba, GeleNet, VIT, HFANet, DBNet, DCNet, and DSINet extracting the road region relatively comprehensively. In contrast, our method filters redundant information by effectively extracting and fusing local features with global semantic information, enhancing edge details after identifying the foreground position.
(2) Small-Scale Salient Objects: As shown in Figure 7d–f, the large imaging range and high spatial resolution of RSI often contain numerous small-scale targets, making it challenging for some models to accurately locate and finely extract these targets. For example, in Figure 7d,e, VMamba, R3Net, and BASNet suffer from severe localization deficiencies, resulting in the loss of six ships and four airplanes. While Samba, PA-KRN, and HFANet achieve more comprehensive salient target extraction than other methods, they still lack edge details and encounter false alarms and misclassifications, such as in the road extraction scenario shown in Figure 7f. Dual coding models DBNet, DCNet and DSINet all show good ability of small-scale target location, but due to different information integration abilities of the models, the determination of target boundary still needs to be improved. The comparison results indicate that our model provides more accurate and detailed extraction of small-scale targets, underscoring the effectiveness of integrating local and global feature information for saliency detection in such scenarios.
(3) Topologically Irregular Objects: As shown in Figure 7g–i, the diversity and complexity of natural terrain in RSI lead to objects like mountains, rivers, and landmark buildings often having irregular topology and complex edge features. In Figure 7g, due to the complexity of the mall structure designed for aesthetics, except for BASNet and DPORTNet, which do not fully capture the entire mall, other models exhibit varying degrees of loss in mall edge details. In Figure 7h, the staircases on different floors of the building lead to severe misjudgments by models such as VMamba, Samba, R3Net, RAS, PoolNet, GeleNet, VIT, SUCA, and DNTD, mistakenly identifying them as part of the main target. In Figure 7i, due to the detailed lake area and background complexity, as well as irregular topology, the detection results of each model are suboptimal. Although EGNet, GeleNet, and DSINet achieve more accurate detections, there are still some false alarms. In comparison, our model excels in addressing topologically irregular targets.
In summary, our method demonstrates excellent detection accuracy in scenarios with cluttered backgrounds, multi-scale targets, and topologically irregular objects. Despite the strong qualitative performance, we observe that TSFANet can still miss very subtle objects or confuse background clutter with salient regions under extreme conditions, such as heavy shadow, severe occlusion, or ambiguous object boundaries. To address these limitations, future research will explore more robust feature fusion strategies and adaptive attention mechanisms tailored to complex and diverse remote sensing environments.

4.4. Ablation Study

As the baseline model HFANet has achieved better results on two datasets, ORSSD and EORSSD, it provides a more adequate basis for comparison to validate the new method. However, ORSI-4199 is a more challenging dataset, especially in dealing with complex backgrounds and extreme scale variations, and Baseline’s performance on this dataset still needs to be greatly improved. Therefore, we used only the ORSI-4199 dataset in our ablation experiments to better assess the performance and improvement of the model when dealing with more difficult tasks. Based on the quantitative results in Table 5, the evaluation metrics progressively improve with the sequential incorporation of different modules. Additionally, Figure 8 presents visualization results for the sequential integration of these modules.

4.4.1. Effect of TSDSM

We constructed a hybrid feature encoder structure combining CNN-Transformer and CNN-Mamba architectures to analyze the effectiveness of different encoder combinations.
As shown in Table 6, various encoder architectures and their pairwise combinations were evaluated, including TSDSM using CNN, Transformer, and Mamba. The TSDSM hybrid coding structure achieved the highest evaluation scores, with F β and E ξ reaching 0.8617 and 0.9453, respectively. Figure 9 visualizes the feature maps obtained by each encoder structure. The CNN coding structure alone accurately captures the salient target features but lacks global semantic information, leading to significant attention to background interference. The Transformer encoder effectively captures global information but struggles with detailed feature processing. The Mamba encoder offers better global understanding through state space modeling but is insufficient for the RSI-SOD task. Combining CNNs with Transformer or Mamba reduces background interference but either lacks edge completeness or misclassifies regions. In contrast, the TSDSM structure effectively captures both global contextual and fine-grained local features, leading to superior performance.

4.4.2. Effect of ASCRM

ASCRM is designed to guide the fusion of global semantic features from the CNN-Transformer encoder and local detailed features from the CNN-Mamba encoder by leveraging global-local feature relationships, enabling precise salient region localization and extraction. Ablation experiments were performed by replacing ASCRM in TSFANet with alternative structures numbered No. 1–No. 10 in Figure 10.
As shown in Table 7, the No.10 (ASCRM) structure achieved the highest quantitative evaluation scores. This is attributed to ASCRM’s learnable weight matrix, which effectively establishes relationships between global and local features, guiding their optimization and fusion for accurate salient region extraction. Figure 11 illustrates the feature visualization results of the ASCRM ablation experiments. Structures No.1 to No.9 exhibit various limitations, such as insufficient global understanding, low confidence in detailed features, and incomplete edge extraction. In contrast, No. 10 (ASCRM) demonstrates the best detection effect, effectively focusing on salient target regions and extracting fine-grained edge features. To further validate ASCRM’s effectiveness, it was compared with advanced attention mechanisms, including AAM [39], BAM [80], CAM [81], CBAM [82], CoT-Attention [83], ECA-Attention [84], NLAM [10], RSAM [85], and Self-Attention [86]. As shown in Table 8, ASCRM achieves higher scores across all evaluation metrics. Figure 12 provides heatmaps of various attention mechanisms, where blue indicates lower model attention and red indicates higher attention. ASCRM effectively guides the model to focus on salient regions, enabling precise extraction of fine-grained edge features for targets like vehicles, airplanes, and buildings.

4.4.3. Effect of SGAFF

To effectively integrate different levels of feature information, we constructed the Semantic-Guided Adjacent Feature Fusion (SGAFF) module to align and fuse detailed features between different scales. As shown in Table 9, SGAFF outperforms existing feature fusion methods, including element-wise summation, element-wise multiplication, feature channel concatenation, and AFAM [13], achieving the best evaluation scores. Figure 13 presents feature visualization analysis comparing SGAFF with other fusion methods. Element-wise summation introduces redundant information and biases towards common regions, while element-wise multiplication enhances commonly attended regions but excludes salient areas lacking sufficient features. Channel concatenation alleviates some defects but remains limited by feature richness and confidence. AFAM improves channel concatenation by incorporating deformable convolution and residual connections but is constrained by fused feature semantics. In contrast, SGAFF effectively learns and sorts semantic information in the deep network and supplements shallow local information, resulting in optimized feature fusion between different levels.

4.4.4. Effect of the Loss Function

To evaluate the effectiveness of the IG Loss + IoU Loss hybrid function, we compared various loss functions and their combinations, including F-measure Loss, BCE Loss, CT Loss, IG Loss, and their combinations with IoU Loss.
Table 10 shows that F-measure Loss performs poorly in metrics other than F β , as it focuses solely on optimizing the F β -measure metric. CT Loss assigns higher weights to boundary pixels using fixed or adaptive thresholds but lacks consistent mutual supervision between edge and salient region predictions, resulting in only slight performance improvements over F-measure Loss. BCE Loss, which calculates cross-entropy supervision between each pixel and the label, achieved a sub-optimal evaluation score with S α , F β , E ξ , and MAE reaching 0.8748, 0.8543, 0.8797, and 0.0361, respectively. IG Loss adaptively adjusts edge and salient region prediction losses using learnable parameters during training, improving over the second-best results by 0.07%, 0.44%, 1.15%, and 0.53%, respectively. When combined with IoU Loss, IG Loss + IoU Loss achieves the best results on the ORSI-4199 dataset, further validating the effectiveness of the hybrid loss function.

5. Conclusions

This study addresses the SOD task within the RSI domain by introducing TSFANet, which comprises three core components: TSDSM, ASCRM, and SGAFF. To enhance the model’s ability to extract salient region features, TSDSM integrates CNNs-Mamba and CNNs-Transformer dual encoders. The CNNs-Mamba encoder effectively captures local features, while the CNNs-Transformer encoder models global semantic information. ASCRM leverages the correlation consistency between local detail features and global semantic features to guide and optimize the fusion of semantic information and detailed features. Addressing significant semantic differences in cross-layer features, SGAFF is designed to align and fuse deep semantic features with multi-scale details, ensuring comprehensive feature integration. Through extensive evaluations against 26 state-of-the-art methods on multiple public RSI-SOD datasets, our method has demonstrated superior performance in various rating indicators and prediction quality metrics. Additionally, ablation experiments have validated the specific contributions of each component within TSFANet, underscoring the effectiveness of the integrated modules in improving SOD performance. The successful implementation of TSFANet highlights its potential for advancing salient object detection in complex remote sensing environments.

Author Contributions

Conceptualization, J.L. and Z.W.; methodology, J.L. and C.Z.; software, Z.W. and N.X.; validation, N.X. and C.Z.; formal analysis, J.L. and Z.W.; investigation, Z.W.; resources, Z.W. and C.Z.; data curation, J.L.; original draft preparation, J.L.; review and editing, Z.W. and C.Z.; visualization, Z.W.; supervision, Z.W. and C.Z.; project administration, Z.W. and N.X.; funding acquisition, Z.W. and N.X. All authors have read and agreed on the published version of the manuscript.

Funding

This work was supported in part by the Youth Talent Support Program of Shaanxi Science and Technology Association under Grant 23JK0701, in part by the Xi’an Science and Technology Planning Projects under Grant 20240103, and in part by the China Postdoctoral Science Foundation under Grant 2024M754225.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
  2. Borji, A. What is a salient object? A dataset and a baseline model for salient object detection. IEEE Trans. Image Process. 2014, 24, 742–756. [Google Scholar] [CrossRef]
  3. Li, C.; Yuan, Y.; Cai, W.; Xia, Y.; Dagan Feng, D. Robust saliency detection via regularized random walks ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2710–2717. [Google Scholar]
  4. Yuan, Y.; Li, C.; Kim, J.; Cai, W.; Feng, D.D. Reversion correction and regularized random walk ranking for saliency detection. IEEE Trans. Image Process. 2017, 27, 1311–1322. [Google Scholar] [CrossRef] [PubMed]
  5. Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual Saliency Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4702–4712. [Google Scholar]
  6. Lv, Z.; Huang, H.; Li, X.; Zhao, M.; Benediktsson, J.A.; Sun, W.; Falco, N. Land cover change detection with heterogeneous remote sensing images: Review, progress, and perspective. Proc. IEEE 2022, 110, 1976–1991. [Google Scholar] [CrossRef]
  7. Sarkar, A.; Chowdhury, T.; Murphy, R.R.; Gangopadhyay, A.; Rahnemoonfar, M. Sam-vqa: Supervised attention-based visual question answering model for post-disaster damage assessment on remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4702716. [Google Scholar] [CrossRef]
  8. Han, Y.; Liao, J.; Lu, T.; Pu, T.; Peng, Z. KCPNet: Knowledge-driven context perception networks for ship detection in infrared imagery. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5000219. [Google Scholar] [CrossRef]
  9. Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. BASNet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–19 June 2019; pp. 7479–7489. [Google Scholar]
  10. Deng, Z.; Hu, X.; Zhu, L.; Xu, X.; Qin, J.; Han, G.; Heng, P.A. R3Net: Recurrent residual refinement network for saliency detection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 684–690. [Google Scholar]
  11. Li, G.; Bai, Z.; Liu, Z.; Zhang, X.; Ling, H. Salient object detection in optical remote sensing images driven by transformer. IEEE Trans. Image Process. 2023, 32, 5257–5269. [Google Scholar] [CrossRef]
  12. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
  13. Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624915. [Google Scholar] [CrossRef]
  14. Gao, Y.; Huang, J.; Sun, X.; Jie, Z.; Zhong, Y.; Ma, L. Matten: Video Generation with Mamba-Attention. arXiv 2024, arXiv:2405.03025. [Google Scholar]
  15. Hatamizadeh, A.; Kautz, J. MambaVision: A Hybrid Mamba-Transformer Vision Backbone. arXiv 2024, arXiv:2407.08083. [Google Scholar]
  16. Gong, H.; Kang, L.; Wang, Y.; Wan, X.; Li, H. nnmamba: 3D biomedical image segmentation, classification and landmark detection with state space model. arXiv 2024, arXiv:2402.03526. [Google Scholar]
  17. Sheng, J.; Zhou, J.; Wang, J.; Ye, P.; Fan, J. DualMamba: A Lightweight Spectral-Spatial Mamba-Convolution Network for Hyperspectral Image Classification. arXiv 2024, arXiv:2406.07050. [Google Scholar] [CrossRef]
  18. Wang, Z.; Ma, C. Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation. arXiv 2024, arXiv:2402.10887. [Google Scholar]
  19. Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
  20. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
  21. Wang, C.; Tsepa, O.; Ma, J.; Wang, B. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv 2024, arXiv:2402.00789. [Google Scholar]
  22. Tang, Y.; Dong, P.; Tang, Z.; Chu, X.; Liang, J. VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting. arXiv 2024, arXiv:2403.16536. [Google Scholar]
  23. Li, W.; Hong, X.; Fan, X. Spikemba: Multi-modal spiking saliency mamba for temporal video grounding. arXiv 2024, arXiv:2404.01174. [Google Scholar]
  24. Deng, R.; Gu, T. CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration. arXiv 2024, arXiv:2404.11778. [Google Scholar]
  25. Wang, Z.; Zheng, J.Q.; Ma, C.; Guo, T. Vmambamorph: A visual mamba-based framework with cross-scan module for deformable 3D image registration. arXiv 2024, arXiv:2404.05105. [Google Scholar]
  26. Yue, Y.; Li, Z. Medmamba: Vision mamba for medical image classification. arXiv 2024, arXiv:2403.03849. [Google Scholar]
  27. He, X.; Cao, K.; Yan, K.; Li, R.; Xie, C.; Zhang, J.; Zhou, M. Pan-Mamba: Effective pan-sharpening with State Space Model. arXiv 2024, arXiv:2402.12192. [Google Scholar] [CrossRef]
  28. Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. Vmambair: Visual state space model for image restoration. arXiv 2024, arXiv:2403.11423. [Google Scholar] [CrossRef]
  29. Kim, J.; Han, D.; Tai, Y.W.; Kim, J. Salient region detection via high-dimensional color transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 883–890. [Google Scholar]
  30. Zhu, W.; Liang, S.; Wei, Y.; Sun, J. Saliency optimization from robust background detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2814–2821. [Google Scholar]
  31. Feng, M.; Lu, H.; Ding, E. Attentive feedback network for boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1623–1632. [Google Scholar]
  32. Li, G.; Yu, Y. Visual saliency based on multiscale deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5455–5463. [Google Scholar]
  33. Zhang, P.; Wang, D.; Lu, H.; Wang, H.; Ruan, X. Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 202–211. [Google Scholar]
  34. Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
  35. Zhang, Z.; Li, S.; Li, H. C2SNet: Contour-to-Saliency Network for Salient Object Detection. IEEE Trans. Image Process. 2020, 29, 3076–3088. [Google Scholar]
  36. Zhao, J.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 8779–8788. [Google Scholar]
  37. Zhou, H.; Xie, X.; Lai, J.H.; Chen, Z.; Yang, L. Interactive two-stream decoder for accurate and fast saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9138–9147. [Google Scholar]
  38. Lee, M.S.; Shin, W.; Han, S.W. TRACER: Extreme attention guided salient object tracing network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Online, 22 February–1 March 2022. [Google Scholar]
  39. Liu, N.; Han, J.; Yang, M.H. Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3089–3098. [Google Scholar]
  40. Zhao, T.; Wu, X. Pyramid feature attention network for saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3085–3094. [Google Scholar]
  41. Ma, M.; Xia, C.; Xie, C.; Chen, X.; Li, J. Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process. 2023, 32, 1026–1038. [Google Scholar] [CrossRef]
  42. Zhao, J.; Wang, J.; Shi, J.; Jiang, Z. Sparsity-guided saliency detection for remote sensing images. J. Appl. Remote Sens. 2015, 9, 095055. [Google Scholar] [CrossRef]
  43. Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested network with two-stream pyramid for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9156–9166. [Google Scholar] [CrossRef]
  44. Zhang, Q.; Cong, R.; Li, C.; Cheng, M.M.; Fang, Y.; Cao, X.; Zhao, Y.; Kwong, S. Dense attention fluid network for salient object detection in optical remote sensing images. IEEE Trans. Image Process. 2021, 30, 1305–1317. [Google Scholar] [CrossRef]
  45. Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI salient object detection via multiscale joint region and boundary model. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607913. [Google Scholar] [CrossRef]
  46. Zhou, X.; Shen, K.; Liu, Z.; Gong, C.; Zhang, J.; Yan, C. Edge-aware multiscale feature integration network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5605315. [Google Scholar] [CrossRef]
  47. Li, G.; Liu, Z.; Lin, W.; Ling, H. Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5614513. [Google Scholar] [CrossRef]
  48. Huang, Z.; Chen, H.; Liu, B.; Wang, Z. Semantic-guided attention refinement network for salient object detection in optical remote sensing images. Remote Sens. 2021, 13, 2163. [Google Scholar] [CrossRef]
  49. Li, G.; Liu, Z.; Lin, D.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2023, 53, 526–538. [Google Scholar] [CrossRef] [PubMed]
  50. Zhuge, M.; Fan, D.P.; Liu, N.; Zhang, D.; Xu, D.; Shao, L. Salient object detection via integrity learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3738–3752. [Google Scholar] [CrossRef]
  51. Zhao, J.; Jia, Y.; Ma, L.; Yu, L. Recurrent adaptive graph reasoning network with region and boundary interaction for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5630720. [Google Scholar] [CrossRef]
  52. Feng, D.; Chen, H.; Liu, S.; Liao, Z.; Shen, X.; Xie, Y.; Zhu, J. Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4706317. [Google Scholar] [CrossRef]
  53. Gong, A.; Nie, J.; Niu, C.; Yu, Y.; Li, J.; Guo, L. Edge and skeleton guidance network for salient object detection in optical remote sensing images. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7109–7120. [Google Scholar] [CrossRef]
  54. Ma, F.; Zhang, F.; Xiang, D.; Yin, Q.; Zhou, Y. Fast task-specific region merging for SAR image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5222316. [Google Scholar] [CrossRef]
  55. Gao, L.; Liu, B.; Fu, P.; Xu, M. Adaptive spatial tokenization transformer for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602915. [Google Scholar] [CrossRef]
  56. Liu, K.; Zhang, B.; Lu, J.; Yan, H. Towards integrity and detail with ensemble learning for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5624615. [Google Scholar] [CrossRef]
  57. Yan, R.; Yan, L.; Geng, G.; Cao, Y.; Zhou, P.; Meng, Y. ASNet: Adaptive semantic network based on transformer-CNN for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608716. [Google Scholar] [CrossRef]
  58. Liu, Y.; Xu, M.; Xiao, T.; Tang, H.; Hu, Y.; Nie, L. Heterogeneous feature collaboration network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5635114. [Google Scholar] [CrossRef]
  59. Han, P.; Zhao, B.; Li, X. Progressive feature interleaved fusion network for remote-sensing image salient object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5500414. [Google Scholar] [CrossRef]
  60. Di, L.; Zhang, B.; Wang, Y. Multi-scale and multi-dimensional weighted network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5625114. [Google Scholar] [CrossRef]
  61. Ma, F.; Zhang, F.; Yin, Q.; Xiang, D.; Zhou, Y. Fast SAR image segmentation with deep task-specific superpixel sampling and soft graph convolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5214116. [Google Scholar] [CrossRef]
  62. Fang, W.; Fu, Y.; Sheng, V.S. FPS-U2Net: Combining U2Net and Multi-level Aggregation Architecture for Fire Point Segmentation in Remote Sensing Images. Comput. Geosci. 2024, 189, 105628. [Google Scholar] [CrossRef]
  63. Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model. arXiv 2024, arXiv:2404.01705. [Google Scholar] [CrossRef]
  64. Chen, S.; Tan, X.; Wang, B.; Hu, X. Reverse attention for salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 234–250. [Google Scholar]
  65. Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar]
  66. Liu, J.J.; Hou, Q.; Cheng, M.M.; Feng, J.; Jiang, J. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3917–3926. [Google Scholar]
  67. Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
  68. Cheng, M.M.; Gao, S.H.; Borji, A.; Tan, Y.Q.; Lin, Z.; Wang, M. A highly efficient model to study the semantics of salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8006–8021. [Google Scholar] [CrossRef]
  69. Giglietto, F.; Righetti, N.; Rossi, L.; Marino, G. COORNET: An Integrated Approach to Surface Problematic Content, Malicious Actors, and Coordinated Networks. Aoir Sel. Pap. Internet Res. 2021, 21, 13–16. [Google Scholar] [CrossRef]
  70. Chen, X.; Zhang, N.; Li, L.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Si, L.; Chen, H. Good visual guidance makes a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. arXiv 2022, arXiv:2205.03521. [Google Scholar]
  71. Liu, Y.; Zhang, X.Y.; Bian, J.W.; Zhang, L.; Cheng, M.M. Samnet: Stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Trans. Image Process. 2021, 30, 3804–3814. [Google Scholar] [CrossRef]
  72. Li, J.; Pan, Z.; Liu, Q.; Wang, Z. Stacked U-shape network with channel-wise attention for salient object detection. IEEE Trans. Multimed. 2020, 23, 1397–1409. [Google Scholar] [CrossRef]
  73. Fang, C.; Tian, H.; Zhang, D.; Zhang, Q.; Han, J.; Han, J. Densely nested top-down flows for salient object detection. Sci. China Inf. Sci. 2022, 65, 182103. [Google Scholar] [CrossRef]
  74. Liu, Y.; Zhang, D.; Liu, N.; Xu, S.; Han, J. Disentangled capsule routing for fast part-object relational saliency. IEEE Trans. Image Process. 2022, 31, 6719–6732. [Google Scholar] [CrossRef]
  75. Xu, B.; Liang, H.; Liang, R.; Chen, P. Locate globally, segment locally: A progressive architecture with knowledge review network for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3004–3012. [Google Scholar]
  76. Zhou, X.; Shen, K.; Weng, L.; Cong, R.; Zheng, B.; Zhang, J.; Yan, C. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2022, 53, 539–552. [Google Scholar] [CrossRef] [PubMed]
  77. Fang, W.; Fu, Y.; Sheng, V.S. Dual Backbone Interaction Network For Burned Area Segmentation in Optical Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6008805. [Google Scholar] [CrossRef]
  78. Fu, Y.; Fang, W.; Sheng, V.S. Burned Area Segmentation in Optical Remote Sensing Images Driven by U-shaped Multi-stage Masked Autoencoder. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10770–10780. [Google Scholar] [CrossRef]
  79. Ge, Y.; Liang, T.; Ren, J.; Chen, J.; Bi, H. Enhanced salient object detection in remote sensing images via dual-stream semantic interactive network. Vis. Comput. 2024, 44, 5153–5169. [Google Scholar] [CrossRef]
  80. Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
  81. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  82. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  83. Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. arXiv 2021, arXiv:2107.12292. [Google Scholar] [CrossRef] [PubMed]
  84. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
  85. Pan, J.; Canton Ferrer, C.; McGuinness, K.; O’Connor, N.E.; Torres, J.; Sayrol, E.; Giro-i Nieto, X. SalGAN: Visual Saliency Prediction with Generative Adversarial Networks. arXiv 2018, arXiv:1701.01081. [Google Scholar]
  86. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Figure 1. Performance comparison between our method and the advanced Mamba-based, CNNs-based, Transformer-based, and hybrid CNNs-Transformer methods in classic challenging scenarios. (a) RSI. (b) GT. (c) VMamba. (d) BASNet. (e) R3Net. (f) GeleNet. (g) HFANet. (h) Ours.
Figure 1. Performance comparison between our method and the advanced Mamba-based, CNNs-based, Transformer-based, and hybrid CNNs-Transformer methods in classic challenging scenarios. (a) RSI. (b) GT. (c) VMamba. (d) BASNet. (e) R3Net. (f) GeleNet. (g) HFANet. (h) Ours.
Remotesensing 17 01902 g001
Figure 2. Feature map visualization illustration of the object region positioning deviation of different SOD models. (a) RSI. (b) GT. (c) VMamba. (d) BASNet. (e) R3Net. (f) GeleNet. (g) HFANet. (h) Ours.
Figure 2. Feature map visualization illustration of the object region positioning deviation of different SOD models. (a) RSI. (b) GT. (c) VMamba. (d) BASNet. (e) R3Net. (f) GeleNet. (g) HFANet. (h) Ours.
Remotesensing 17 01902 g002
Figure 3. The overall architecture of TSFANet, includes Res Encoder, Mamba Encoder, Transformer Encoder, Adaptive Semantic Correlation Refinement Module (ASCRM), and Semantic-Guided Adjacent Feature Fusion Module (SGAFF).
Figure 3. The overall architecture of TSFANet, includes Res Encoder, Mamba Encoder, Transformer Encoder, Adaptive Semantic Correlation Refinement Module (ASCRM), and Semantic-Guided Adjacent Feature Fusion Module (SGAFF).
Remotesensing 17 01902 g003
Figure 4. Architecture of Adaptive Semantic Correlation Refinement Module.
Figure 4. Architecture of Adaptive Semantic Correlation Refinement Module.
Remotesensing 17 01902 g004
Figure 5. Architecture of Semantic-Guided Adjacent Feature Fusion Module.
Figure 5. Architecture of Semantic-Guided Adjacent Feature Fusion Module.
Remotesensing 17 01902 g005
Figure 6. The P-R and F-measure curves of different comparison methods on ORSSD, EORSSD and ORSI datasets. (a) P-R curves on ORSSD. (b) P-R curves on EORSSD. (c) P-R curves on ORSI-4199. (d) F-measure curves on ORSSD. (e) F-measure curves on EORSSD. (f) F-measure curves on ORSI-4199.
Figure 6. The P-R and F-measure curves of different comparison methods on ORSSD, EORSSD and ORSI datasets. (a) P-R curves on ORSSD. (b) P-R curves on EORSSD. (c) P-R curves on ORSI-4199. (d) F-measure curves on ORSSD. (e) F-measure curves on EORSSD. (f) F-measure curves on ORSI-4199.
Remotesensing 17 01902 g006
Figure 7. Visualization comparison of different RSI-SOD models on typical challenging scenarios. (a1c1,a2c2): Objects in complex cluttered backgrounds; (d1f1,d2f2): Small-scale salient objects; (g1i1,g2i2): Topologically irregular objects. The compared methods include CNNs-based methods (Mamba-based semantic image segmentation model (SSI) VMamba [12] and Samba [63], NSI models (PicaNet [39], R3Net [10], RAS [64], BASNet [9], CPDNet [65], EGNet [36], PoolNet [66], U2Net [67], and VST [5]), and RSI models (CSNet [68], CoorNet [69], GeleNet [11], HVPNet [70], SAMNet [71], MCCNet [47], SUCA [72], ACCoNet [49], DNTD [73], MJRBM [45], ICON [50], EMFINet [46], HFANet [13], DPORTNet [74], PA-KRN [75], ERPNet [76]), DBINet [77], DCNet [78], and DSINet [79]).
Figure 7. Visualization comparison of different RSI-SOD models on typical challenging scenarios. (a1c1,a2c2): Objects in complex cluttered backgrounds; (d1f1,d2f2): Small-scale salient objects; (g1i1,g2i2): Topologically irregular objects. The compared methods include CNNs-based methods (Mamba-based semantic image segmentation model (SSI) VMamba [12] and Samba [63], NSI models (PicaNet [39], R3Net [10], RAS [64], BASNet [9], CPDNet [65], EGNet [36], PoolNet [66], U2Net [67], and VST [5]), and RSI models (CSNet [68], CoorNet [69], GeleNet [11], HVPNet [70], SAMNet [71], MCCNet [47], SUCA [72], ACCoNet [49], DNTD [73], MJRBM [45], ICON [50], EMFINet [46], HFANet [13], DPORTNet [74], PA-KRN [75], ERPNet [76]), DBINet [77], DCNet [78], and DSINet [79]).
Remotesensing 17 01902 g007
Figure 8. Comparison of ablation study visualization results. (a) RSI. (b) GT. (c) Baseline. (d) Baseline + TSDSM. (e) Baseline + TSDSM + ASCRM. (f) Baseline + TSDSM + ASCRM + SGAFF.
Figure 8. Comparison of ablation study visualization results. (a) RSI. (b) GT. (c) Baseline. (d) Baseline + TSDSM. (e) Baseline + TSDSM + ASCRM. (f) Baseline + TSDSM + ASCRM + SGAFF.
Remotesensing 17 01902 g008
Figure 9. Feature visualization comparison of different encoder structures. (a) RSI. (b) GT. (c) CNNs. (d) Transformer. (e) Mamba. (f) CNNs + Transformer. (g) CNNs + Mamba. (h) TSDSM.
Figure 9. Feature visualization comparison of different encoder structures. (a) RSI. (b) GT. (c) CNNs. (d) Transformer. (e) Mamba. (f) CNNs + Transformer. (g) CNNs + Mamba. (h) TSDSM.
Remotesensing 17 01902 g009
Figure 10. The Framework of various ASCRM ablation experiment modules. (a) Baseline module without semantic correlation; (b) Module with single-branch semantic guidance; (c) Module with additional convolution layer; (d) Module with alternative activation function; (e) Module using channel attention only; (f) Module using spatial attention only; (g) Module with modified correlation calculation; (h) Module with different normalization strategy; (i) Module combining multiple attention mechanisms; (j) The proposed ASCRM module.
Figure 10. The Framework of various ASCRM ablation experiment modules. (a) Baseline module without semantic correlation; (b) Module with single-branch semantic guidance; (c) Module with additional convolution layer; (d) Module with alternative activation function; (e) Module using channel attention only; (f) Module using spatial attention only; (g) Module with modified correlation calculation; (h) Module with different normalization strategy; (i) Module combining multiple attention mechanisms; (j) The proposed ASCRM module.
Remotesensing 17 01902 g010
Figure 11. Feature visualization comparison of different ASCRM ablation modules. (a) RSI. (b) GT. (c) NO.1. (d) NO.2. (e) NO.3. (f) NO.4. (g) NO.5. (h) NO.6. (i) NO.7. (j) NO.8. (k) NO.9. (l) NO.10.
Figure 11. Feature visualization comparison of different ASCRM ablation modules. (a) RSI. (b) GT. (c) NO.1. (d) NO.2. (e) NO.3. (f) NO.4. (g) NO.5. (h) NO.6. (i) NO.7. (j) NO.8. (k) NO.9. (l) NO.10.
Remotesensing 17 01902 g011
Figure 12. Feature visualization comparison of different attention moudles. (a) RSI. (b) GT. (c) + Baseline. (d) + AAM. (e) + BAM. (f) + CAM. (g) + CBAM. (h) + CoT-Attention. (i) + ECA-Attention. (j) + NLAM. (k) + RSAM. (l) + Self-Attention. (m) + ASCRM.
Figure 12. Feature visualization comparison of different attention moudles. (a) RSI. (b) GT. (c) + Baseline. (d) + AAM. (e) + BAM. (f) + CAM. (g) + CBAM. (h) + CoT-Attention. (i) + ECA-Attention. (j) + NLAM. (k) + RSAM. (l) + Self-Attention. (m) + ASCRM.
Remotesensing 17 01902 g012
Figure 13. Results of different feature fusion methods. (a) RSI. (b) GT. (c) Baseline. (d) + Element-wise Summation. (e) + Element-wise Multiplication. (f) + Channel Concatenation. (g) + SGAFF.
Figure 13. Results of different feature fusion methods. (a) RSI. (b) GT. (c) Baseline. (d) + Element-wise Summation. (e) + Element-wise Multiplication. (f) + Channel Concatenation. (g) + SGAFF.
Remotesensing 17 01902 g013
Table 1. Quantitative comparisons of our method with 30 state-of-the-art methods on the ORSSD dataset. The Top-3 results in each column are highlighted in red, blue, and green. “-” denotes the authors do not provide the corresponding results.
Table 1. Quantitative comparisons of our method with 30 state-of-the-art methods on the ORSSD dataset. The Top-3 results in each column are highlighted in red, blue, and green. “-” denotes the authors do not provide the corresponding results.
MethodsPublicationTypeBackboneORSSD
S α F β E ξ MAE
PicaNet2018 CVPRC-NSIResNet0.86890.79220.90050.0260
RAS2018 ECCVC-NSIResNet0.88290.82290.93060.0169
R3Net2018 IJCAIC-NSIResNet0.90090.86150.92380.0170
PoolNet2019 CVPRC-NSIResNet0.80980.70510.85130.0469
BASNet2019 CVPRC-NSIU-Net0.89310.82310.92060.0277
CPDNet2019 CVPRC-NSIResNet0.88290.83630.91330.0171
EGNet2019 ICCVC-NSIVGG0.87250.76030.89590.0217
U2Net2020 PRC-NSIU-Net0.87160.79620.90140.0222
VIT2021 ICCVT-NSITransformer0.91740.86130.94150.0125
CSNet2020 TIPC-RSIResNet0.89280.84370.91810.0181
HVPNet2021 TGRSC-RSIResNet0.85850.74310.86010.0227
SAMNet2021 TGRSC-RSIResNet0.87210.75590.86900.0221
CoorNet2021 TGRSC-RSIResNet0.93920.91880.97460.0098
MCCNet2021 TGRSC-RSIVGG0.94280.92410.97580.0087
SUCA2021 T-GRSC-RSIResNet0.89890.83980.93910.0145
ACCoNet2022 TCYBC-RSIVGG0.94240.91570.97540.0088
DNTD2022 TGRSC-RSIResNet0.86960.81630.90650.0218
MJRBM2022 TGRSC-RSIResNet0.91940.87490.94180.0163
ICON2022 TGRSC-RSIResNet0.92510.88510.96370.0116
EMFINet2022 TGRSC-RSIResNet0.93650.90380.96710.0109
DPORTNet2023 TGRSC-RSIResNet0.88230.83270.91190.0221
PA-KRN2023 TGRSC-RSIResNet0.92310.89090.96200.0139
ERPNet2023 TGRSC-RSIResNet0.92470.89260.95660.0135
GeleNet2023 TIPT-RSIPVT0.90750.86190.94630.0133
HFANet2022 TGRSH-RSIHybrid0.93930.91650.97110.0092
VMamba2024 arXiv-MAmba0.70600.50470.66840.0584
Samba2024 arXiv-MAmba0.86280.80770.90600.0327
DBINet2024 GRSLH-RSIHybrid0.88180.82200.92970.0173
DCNet2024 JSTARSH-RSIHybrid0.88210.83590.91280.0177
DSINet2024 Vis ComputH-RSIHybrid0.93870.90070.97090.0093
TSFANetOursH-RSIHybrid0.94460.91740.98170.0077
Table 2. Quantitative comparisons of our method with 30 state-of-the-art methods on the EORSSD dataset. The Top 3 results in each column are highlighted in red, blue, and green. “-” denotes the authors do not provide the corresponding results.
Table 2. Quantitative comparisons of our method with 30 state-of-the-art methods on the EORSSD dataset. The Top 3 results in each column are highlighted in red, blue, and green. “-” denotes the authors do not provide the corresponding results.
MethodsPublicationTypeBackboneEORSSD
S α F β E ξ MAE
PicaNet2018 CVPRC-NSIResNet0.87970.77620.89020.0119
RAS2018 ECCVC-NSIResNet0.88470.80520.92650.0115
R3Net2018 IJCAIC-NSIResNet0.89740.82680.91540.0124
PoolNet2019 CVPRC-NSIResNet0.85560.74860.87420.0169
BASNet2019 CVPRC-NSIU-Net0.90310.82070.92660.0114
CPDNet2019 CVPRC-NSIResNet0.84880.76720.87940.0156
EGNet2019 ICCVC-NSIVGG0.86050.70650.86310.0109
U2Net2020 PRC-NSIU-Net0.86190.74570.85810.0167
VIT2021 ICCVT-NSITransformer0.91830.84070.93660.0074
CSNet2020 TIPC-RSIResNet0.83990.78120.88600.0142
HVPNet2021 TGRSC-RSIResNet0.86970.74300.85520.0112
SAMNet2021 TGRSC-RSIResNet0.85970.72860.85430.0133
CoorNet2021 TGRSC-RSIResNet0.92980.88900.96460.0083
MCCNet2021 TGRSC-RSIVGG0.93230.88740.96850.0066
SUCA2021 T-GRSC-RSIResNet0.89850.81680.92510.0097
ACCoNet2022 TCYBC-RSIVGG0.92850.88230.96550.0074
DNTD2022 TGRSC-RSIResNet0.89540.81760.91960.0114
MJRBM2022 TGRSC-RSIResNet0.92000.85040.93540.0099
ICON2022 TGRSC-RSIResNet0.91960.86320.96190.0073
EMFINet2022 TGRSC-RSIResNet0.92990.87510.96010.0084
DPORTNet2023 TGRSC-RSIResNet0.89370.79450.89070.0152
PA-KRN2023 TGRSC-RSIResNet0.91860.86210.95370.0104
ERPNet2023 TGRSC-RSIResNet0.92010.85650.93990.0089
GeleNet2023 TIPT-RSIPVT0.88490.81380.92730.0090
HFANet2022 TGRSH-RSIHybrid0.93900.89510.96780.0070
VMamba2024 arXiv-MAmba0.74230.51040.69910.0273
Samba2024 arXiv-MAmba0.89760.83970.94160.0124
DBINet2024 GRSLH-RSIHybrid0.88300.81520.92580.0129
DCNet2024 JSTARSH-RSIHybrid0.84820.81480.87530.0161
DSINet2024 Vis ComputH-RSIHybrid0.93150.87760.95710.0076
TSFANetOursH-RSIHybrid0.93230.88740.96850.0060
Table 3. Quantitative comparisons of our method with 30 state-of-the-art methods on the ORSI-4199 dataset. The Top 3 results in each column are highlighted in red, blue, and green. “-” denotes the authors do not provide the corresponding results.
Table 3. Quantitative comparisons of our method with 30 state-of-the-art methods on the ORSI-4199 dataset. The Top 3 results in each column are highlighted in red, blue, and green. “-” denotes the authors do not provide the corresponding results.
MethodsPublicationTypeBackboneORSI-4199
S α F β E ξ MAE
PicaNet2018 CVPRC-NSIResNet0.83590.77110.88600.0418
RAS2018 ECCVC-NSIResNet0.83260.78790.89620.0433
R3Net2018 IJCAIC-NSIResNet0.85860.82200.91160.0353
PoolNet2019 CVPRC-NSIResNet0.79260.71330.84910.0643
BASNet2019 CVPRC-NSIU-Net0.83690.78620.88630.0500
CPDNet2019 CVPRC-NSIResNet0.83610.80350.89430.0403
EGNet2019 ICCVC-NSIVGG0.84640.79590.89430.0455
U2Net2020 PRC-NSIU-Net0.83840.79150.89700.0433
VIT2021 ICCVT-NSITransformer0.87640.83540.92880.0307
CSNet2020 TIPC-RSIResNet0.82150.75400.85490.0545
HVPNet2021 TGRSC-RSIResNet0.84680.79320.89300.0438
SAMNet2021 TGRSC-RSIResNet0.84070.79200.89080.0452
CoorNet2021 TGRSC-RSIResNet0.86210.84210.92040.0382
MCCNet2021 TGRSC-RSIVGG0.87470.85410.93460.0332
SUCA2021 T-GRSC-RSIResNet0.87950.85000.93520.0320
ACCoNet2022 TCYBC-RSIVGG0.87750.85310.93410.0330
DNTD2022 TGRSC-RSIResNet0.84460.81130.90410.0441
MJRBM2022 TGRSC-RSIResNet0.85930.82260.91020.0390
ICON2022 TGRSC-RSIResNet0.87530.85770.94350.0299
EMFINet2022 TGRSC-RSIResNet0.86740.83900.92540.0346
DPORTNet2023 TGRSC-RSIResNet0.74750.76090.86760.0586
PA-KRN2023 TGRSC-RSIResNet0.84920.82370.91640.0399
ERPNet2023 TGRSC-RSIResNet0.86730.82860.91470.0373
GeleNet2023 TIPT-RSIPVT0.76650.68870.82160.0711
HFANet2022 TGRSH-RSIHybrid0.87660.85340.93340.0330
VMamba2024 arXiv-MAmba0.72280.58530.74480.0829
Samba2024 arXiv-MAmba0.84670.82110.91320.0446
DBINet2024 GRSLH-RSIHybrid0.83180.78700.89520.0441
DCNet2024 JSTARSH-RSIHybrid0.83580.80320.89390.0410
DSINet2024 Vis ComputH-RSIHybrid0.86440.85540.92840.0297
TSFANetOursH-RSIHybrid0.88760.86170.94530.0277
Table 4. Computational complexity of different methods. The best results are marked in bold.
Table 4. Computational complexity of different methods. The best results are marked in bold.
ModelParams (M) ↓Flops (G) ↓ModelParams (M) ↓Flops (G) ↓
PicaNet47.21959.786RAS45.3160.19
R3Net71.8970.99PoolNet75.6780.83
BASNet46.1036.57CPDNet69.9975.69
EGNet85.2590.91U2Net44.01115.313
VIT126.6275.44
CSNet65.5870.12HVPNet100.42120.11
SAMNet95.66114.96CoorNet88.98104.69
MCCNet67.65358.77SUCA117.34171.66
ACCoNet167.01177.21DNTD94.89103.79
MJRBM85.9490.25ICON94.7989.91
EMFINet102.45132.47DPORTNet93.7388.68
PA-KRN87.6273.65ERPNet72.3487.55
GeleNet48.9425.45HFANet320.3178.46
VMamba66.80121.31Samba99.16123.57
DBINet57.6147.15DCNet101.1982.09
DSINet109.8989.69TSFANet334.82115.24
Table 5. Ablation experiments on the ORSI-4199 dataset. TSDSM, ASCRM, and SGAFF denote Trans-Mamba Semantic-Detail Dual-Stream Collaborative Encoder Module, Adaptive Semantic Correlation Refinement Attention Module and Semantic-Guided Adjacent Feature Fusion Module, respectively. The best results are marked in bold.
Table 5. Ablation experiments on the ORSI-4199 dataset. TSDSM, ASCRM, and SGAFF denote Trans-Mamba Semantic-Detail Dual-Stream Collaborative Encoder Module, Adaptive Semantic Correlation Refinement Attention Module and Semantic-Guided Adjacent Feature Fusion Module, respectively. The best results are marked in bold.
No.TSDSMASCRMSGAFF S α F β E ξ MAE
0 0.87660.85340.93340.0330
1 0.87920.85730.93690.0313
2 0.87790.85500.93540.0321
3 0.87880.85670.93620.0317
4 0.88580.86030.94180.0287
5 0.88490.85960.94030.0292
6 0.88350.85900.93880.0296
70.88760.86170.94530.0277
Table 6. Comparative experiments of different encoder structures on ORSI-4199 datasets. The best results are marked in bold.
Table 6. Comparative experiments of different encoder structures on ORSI-4199 datasets. The best results are marked in bold.
EncoderEvaluation Metrics
S α F β E ξ MAE
Baseline0.88350.85900.93880.0296
+ CNN0.88390.85930.93910.0294
+ Transformer0.88530.86020.94170.0283
+ Mamba0.88470.85990.94120.0287
+ CNN-Transformer0.88720.86140.94460.0279
+ CNN-Mamba0.88680.86090.94400.0280
+ TSDSM0.88760.86170.94530.0277
Table 7. Comparative of different ASCRM ablation modules on ORSI-4199 datasets. The best results are marked in bold.
Table 7. Comparative of different ASCRM ablation modules on ORSI-4199 datasets. The best results are marked in bold.
EncoderEvaluation Metrics
S α F β E ξ MAE
No.10.88490.85960.94030.0292
No.20.88490.85970.94040.0291
No.30.88500.85990.94060.0290
No.40.88510.86000.94100.0288
No.50.88550.86020.94280.0286
No.60.88640.86050.94420.0283
No.70.88660.86070.94440.0282
No.80.88730.86140.94510.0279
No.90.88710.86100.94490.0280
No.100.88760.86170.94530.0277
Table 8. Comparative experiments of different attention modules on ORSI-4199 datasets. The best results are marked in bold.
Table 8. Comparative experiments of different attention modules on ORSI-4199 datasets. The best results are marked in bold.
EncoderEvaluation Metrics
S α F β E ξ MAE
Baseline0.87660.85340.93340.0330
+AAM0.87700.85390.93380.0327
+BAM0.87670.85350.93350.0329
+CAM0.87670.85360.93360.0329
+CBAM0.87690.85380.93380.0328
+CoT-Attention0.87720.85400.93410.0326
+ECA-Attention0.87730.85420.93420.0325
+NLAM0.87750.85440.93500.0324
+RSAM0.87760.854750.93510.0324
+Self-Attention0.87780.85480.93530.00322
+ASCRM0.87790.85500.93540.0321
Table 9. Comparative experiments of different feature fusion modules on ORSI-4199 datasets. The best results are marked in bold.
Table 9. Comparative experiments of different feature fusion modules on ORSI-4199 datasets. The best results are marked in bold.
EncoderEvaluation Metrics
S α F β E ξ MAE
Baseline0.87660.85340.93340.0330
+ Element-wise Summation0.87690.85380.93390.0328
+ Element-wise Multiplication0.87740.85420.93430.0325
+ Channel Concatenation0.87790.85570.93590.0321
+ SGAFF0.87880.85670.93620.0317
Table 10. Comparative experiments of different loss functions on ORSI-4199 datasets. The best results are marked in bold.
Table 10. Comparative experiments of different loss functions on ORSI-4199 datasets. The best results are marked in bold.
EncoderEvaluation Metrics
S α F β E ξ MAE
F-measure Loss0.87340.85660.88020.0395
CT Loss0.87450.84910.88710.0389
BCE Loss0.87480.85430.90970.0367
IG Loss0.87550.85870.92120.0314
F-measure Loss + IOU Loss0.87610.86090.93850.0292
CT Loss + IOU Loss0.88520.86030.94180.0295
BCE Loss + IOU Loss0.88680.86100.94440.0281
IG Loss + IOU Loss0.88760.86170.94530.0277
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Wang, Z.; Xu, N.; Zhang, C. TSFANet: Trans-Mamba Hybrid Network with Semantic Feature Alignment for Remote Sensing Salient Object Detection. Remote Sens. 2025, 17, 1902. https://doi.org/10.3390/rs17111902

AMA Style

Li J, Wang Z, Xu N, Zhang C. TSFANet: Trans-Mamba Hybrid Network with Semantic Feature Alignment for Remote Sensing Salient Object Detection. Remote Sensing. 2025; 17(11):1902. https://doi.org/10.3390/rs17111902

Chicago/Turabian Style

Li, Jiayuan, Zhen Wang, Nan Xu, and Chuanlei Zhang. 2025. "TSFANet: Trans-Mamba Hybrid Network with Semantic Feature Alignment for Remote Sensing Salient Object Detection" Remote Sensing 17, no. 11: 1902. https://doi.org/10.3390/rs17111902

APA Style

Li, J., Wang, Z., Xu, N., & Zhang, C. (2025). TSFANet: Trans-Mamba Hybrid Network with Semantic Feature Alignment for Remote Sensing Salient Object Detection. Remote Sensing, 17(11), 1902. https://doi.org/10.3390/rs17111902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop