Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion

Jiang, Qianyin; Liao, Jianshang; Lin, Qiuyu; Zhang, Junkang

doi:10.3390/ijgi15040160

Open AccessArticle

Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion

¹

School of Artificial Intelligence, Guangzhou Maritime University, Guangzhou 510725, China

²

School of Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(4), 160; https://doi.org/10.3390/ijgi15040160

Submission received: 20 January 2026 / Revised: 27 March 2026 / Accepted: 3 April 2026 / Published: 8 April 2026

(This article belongs to the Topic State-of-the-Art Object Detection, Tracking, and Recognition Techniques)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing object detection is fundamental to Earth observation, yet remains challenging when relying on a single sensing modality. While optical imagery provides rich spatial and textural details, it is highly sensitive to illumination and adverse weather; conversely, Synthetic Aperture Radar (SAR) offers robust all-weather acquisition but suffers from speckle noise and limited semantic interpretability. To address these limitations, we leverage the potential of foundation models for optical–SAR object detection via a novel gated–guided fusion approach. By integrating transferable and generalizable representations from foundation models into the detection pipeline, we enhance semantic expressiveness and cross-environment robustness. Specifically, a gated–guided fusion mechanism is designed to selectively merge cross-modal features with foundational priors, enabling the network to prioritize informative cues while suppressing unreliable signals in complex scenes. Furthermore, we propose a dual-stream architecture incorporating attention mechanisms and State Space Models (SSMs) to simultaneously capture local and long-range dependencies. Extensive experiments on the large-scale M4-SAR dataset demonstrate that our method achieves state-of-the-art performance, significantly improving detection accuracy and robustness under challenging sensing conditions.

Keywords:

remote sensing; multimodal object detection; dual-stream network; cross-modal interaction; frequency-aware feature modeling

1. Introduction

Remote sensing object detection plays a pivotal role in diverse military and civilian applications. While Synthetic Aperture Radar (SAR) offers robust all-weather day-and-night imaging capabilities, its interpretability is often degraded by intrinsic speckle noise; conversely, optical imagery provides rich spectral and textural information but is highly susceptible to adverse weather and illumination variations. Driven by these complementary characteristics, the synergistic fusion of SAR and optical data has emerged as a promising research avenue, aiming to achieve enhanced detection reliability by capitalizing on the distinct advantages of each modality.

The advancement of deep learning [1,2,3] has significantly propelled the progress of remote sensing object detection. Classical Convolutional Neural Networks (CNNs) [4,5,6] excel at capturing local features but are inherently limited in modeling long-range dependencies. To address this, transformer architectures [7,8,9] with global self-attention mechanisms have been introduced, enhancing contextual understanding at the expense of higher computational complexity. This shift has led to the development of hybrid CNN–transformer models [10,11,12] which integrate local detail extraction with global relationship modeling. Recently, more efficient architectures based on State Space Models (SSMs) [13] have emerged, such as VMamba [14]. Utilizing selective SSMs, these architectures enable efficient long-sequence modeling and offer a compelling direction for processing large-scale remote sensing imagery [15,16].

Despite these advancements, existing methods frequently suffer from limited generalization capabilities. They often fail to adapt to diverse geographic environments and sensor variations. Consequently, this results in significant performance degradation on unseen data. To mitigate this, conventional approaches rely on extensive domain-specific data augmentation or costly retraining [17,18], both of which suffer from poor scalability. This limitation has sparked interest in foundation models (e.g., HyperSIGMA [19], GeoRSCLIP [20], SkySense++ [21]) pretrained on large-scale datasets to acquire universal and transferable representations. However, effectively integrating these powerful foundational representations into multimodal detection pipelines to specifically enhance optical–SAR synergy remains an underexplored challenge.

To address this, we propose a novel framework that harnesses the potential of foundation models for optical–SAR detection. Our primary contributions are summarized as follows:

We introduce a Foundation Model-Guided Feature Injection strategy, which seamlessly embeds transferable representations into the detection network to augment generalization and adaptability across diverse environments.
We design an adaptive Dual-Stream Fusion Architecture incorporating Low- and High-Frequency Mamba (LHF-Mamba) blocks along with a Gated–Guided Fusion mechanism. This design facilitates the simultaneous modeling of long-range dependencies while dynamically balancing information from SAR and optical inputs to suppress noise and highlight reliable cues.
We conduct extensive evaluations on the large-scale M4-SAR dataset [22]. Our approach achieves state-of-the-art performance, significantly improving robustness and accuracy under complex sensing conditions.

2. Related Work

2.1. Application of Foundation Models

In recent years, the field of computer vision has widely adopted foundation models, leveraging their formidable generalization capabilities and transferable knowledge acquired through large-scale pretraining. While these models significantly bolster performance in downstream tasks, their direct adaptation to remote sensing imagery is impeded by substantial domain gaps. These gaps primarily stem from unique imaging perspectives, variable spatial scales, complex geospatial features, and the inherent diversity of sensor modalities.

To address these challenges, researchers have actively explored methods for adapting foundation models tailored for remote sensing tasks, achieving progress in various directions. For instance, Li et al. [23] proposed a bi-temporal adapter network as a universal framework for applying foundation models to change detection tasks. While their work addresses the compatibility between the Bi-TAB model and existing visual models, it leaves the embedding of remote sensing foundation models unexplored. Further advancing this line of work, Ding et al. [24] exploited pretrained priors to embed multitemporal land cover and land use representations, although the cost of the calculation remained relatively high. In a similar vein, Wang et al. [25] developed a tri-level prompt encoder to improve tuning performance on HyperSIGMA [19] by integrating prompts across multiple network layers, though this approach can lead to increased model complexity and computational cost. Parallel efforts by Wang et al. facilitated knowledge transfer via feature alignment loss [26], focusing on consistency across representations.

Collectively, these studies underscore that leveraging large-scale pretrained models has emerged as a potent paradigm for enhancing RS generalization. However, critical hurdles remain, particularly regarding the seamless alignment of multimodal features and the integration of multi-scale semantic information to maintain cross-modal consistency.

2.2. Optical and SAR Image Fusion

In the field of remote sensing object detection, relying on a single modality often precludes comprehensive scene perception. Optical data are notoriously susceptible to atmospheric conditions, while SAR imagery is inherently plagued by speckle noise.

To reconcile these modalities, conventional fusion methods typically employ parallel architectures to integrate features extracted from different modalities or scales. For instance, Cao et al. [27] utilized channel switching and spatial attention to merge optical and SAR features of different scales. Zhang et al. [28] adopted simple element-wise summation of modal features at each scales. However, a major drawback of these approaches is the paucity of iterative interaction. Although Liu et al. [29] employed a hierarchical transformer with a Swin-Fusion module, their architecture remains vulnerable to SAR noise propagating into the optical stream.

From the perspective of information complementarity, Mao et al. [30] drew inspiration from pan-sharpening models to guide the reconstruction of optical images by effectively incorporating structural information from SAR data, while Song et al. [31] designed a cross-modal alignment detector. Despite their ingenuity, both approaches overlook the synergy of multi-scale feature hierarchies. Further advancing this field, Wei et al. [32] and Fang et al. [33] explicitly decomposed features into modality-shared and modality-specific components; nevertheless, these methods offer only superficial treatment of the inherent physical discrepancies between optical and SAR data, potentially compromising spatial contextual integrity. More recently, Wang et al. introduced E2E-OSDet [22] and tested it on the large-scale M4-SAR dataset, addressing alignment through specialized interaction modules.

Despite these advancements, a persistent limitation lies in the perfunctory nature of cross-modal interactions. This constrains the model’s ability to achieve semantically coherent feature integration, a gap that our proposed gated–guided mechanism aims to fill.

3. Methods

3.1. Network Structure

Fusion of optical and SAR data has been widely explored in remote sensing object detection contexts due to their complementary imaging characteristics; however, despite notable progress, existing approaches often struggle to effectively bridge the semantic gap between heterogeneous modalities, particularly under complex scenes and varying imaging conditions.

Motivated by the strong representation capability of recent foundation models, we introduce a foundation model-guided dual-stream framework for cross-modal remote sensing object detection (Figure 1). The framework adopts different branches for optical and SAR inputs, then integrates multiple feature extraction and fusion modules to progressively encode multi-scale representations via hierarchical downsampling. Through comprehensive cross-branch interactions, the proposed method yields more robust and discriminative cross-modal feature representations.

Within the feature encoding and fusion modules, we introduce the LHF-Mamba block to efficiently encode features from both optical and SAR modalities. LHF-Mamba adopts a dual-branch design in which low-frequency components are modeled by a Mamba-based state-space module to capture global structures and long-range dependencies, while high-frequency components are processed by convolutional operations to preserve local textures and fine-grained details. This frequency-aware design effectively enhances the robustness and discriminability of remote sensing target representations.

Subsequently, a Modality Fusion Module (MFM) is employed to facilitate effective interaction between the dual-modal features, ensuring feature complementarity and semantic alignment across modalities. To further leverage the strong representation capability of foundation models, prior information derived from HyperSIGMA is incorporated into the proposed architecture to provide high-level semantic guidance for cross-modal feature learning. In addition, an Adaptive Prior Gating (APG) module equipped with a forgetting gate is introduced to adaptively regulate the integration of prior information from the HyperSIGMA foundation model with backbone features. By selectively controlling the contribution of prior features, the APG module effectively suppresses irrelevant or noisy information, thereby improving feature representation quality and integration efficiency.

Finally, the refined dual-modal features are fed into the neck and head of YOLO11 [34] to enable precise and end-to-end multimodal object detection.

Specifically, given input optical and SAR images

I_{opt}, I_{sar} \in R^{H \times W \times 3}

(each SAR image in the M4-SAR dataset is stored as three identical RGB channels to maintain compatibility with standard input formats, although it is inherently single-polarized), where H and W respectively denote the height and width of the input images, the Stem module first performs dimensional adjustment to generate the initial feature maps

X_{opt}^{0}, X_{sar}^{0} \in R^{\frac{H}{8} \times \frac{W}{8} \times C_{0}}

.

The network comprises three sequential stages, each containing a feature encoding and fusion module. These modules progressively integrate and interact the dual-stream features. Let

X_{opt}^{i}

and

X_{sar}^{i}

denote the output features of the i-th stage

(i = 1, 2, 3)

. For each stage where

i \geq 2

, the input features from the previous stage (

X_{opt}^{i - 1}

and

X_{sar}^{i - 1}

) are first downsampled and then fed into the subsequent encoding and fusion module.

At the i-th stage, the spatial resolution of the feature maps for both modalities follows the standard downsampling scheme used in CNN-based architectures:

S_{opt}^{i} = \frac{H}{2^{2 + i}} \times \frac{W}{2^{2 + i}} \times C_{i},

(1)

S_{sar}^{i} = \frac{H}{2^{2 + i}} \times \frac{W}{2^{2 + i}} \times C_{i},

(2)

where

C_{i}

represents the number of feature channels in the i-th stage and

C_{i} = 2^{6 + i}

. This progression of channel numbers and feature map sizes follows the standard design for multi-scale feature extraction in CNN-based detection architectures. Increasing the number of channels at deeper stages allows the network to capture higher-level semantic information while preserving sufficient capacity for discriminative feature representation. At the same time, maintaining a moderate channel width at shallow stages ensures that fine-grained details and spatial information are preserved, which is crucial for detecting small or subtle objects in complex remote sensing scenes.

By combining this stage-wise channel allocation with our dual-stream design, LHF-Mamba blocks, and HyperSIGMA-guided fusion, the network achieves a balanced representation that integrates both local details and global context, leading to improved detection performance across diverse object categories.

Since each stage follows an identical processing pipeline except for the input feature resolution, the stage index i is omitted when there is no ambiguity. In the following,

X_{opt}

and

X_{sar}

are used to represent the optical and SAR feature maps within a generic stage, respectively.

3.2. Low- and High-Frequency Mamba Block

In remote sensing image analysis, spatial information can be broadly categorized by frequency: high-frequency components capture rapid intensity changes such as edges, textures, and object boundaries, while low-frequency components correspond to smooth regions and large-scale structures. Conventional CNNs process spatial features in a unified manner, implicitly mixing different frequency components without explicit differentiation. This undifferentiated treatment limits the network’s ability to separately model fine-grained details and global semantic structures, a limitation that becomes particularly critical in multimodal tasks such as SAR–optical fusion.

Optical and SAR images exhibit fundamentally different spatial characteristics due to their imaging mechanisms. Optical imagery emphasizes coherent low-frequency structures related to object shape and layout, whereas SAR imagery is rich in high-frequency patterns caused by scattering effects, surface roughness, and speckle. Effective fusion of these modalities requires a network architecture capable of handling such complementary frequency characteristics in a structured and adaptive way.

To address this challenge, we propose the LHF-Mamba block, a frequency-aware feature extraction module that explicitly separates and processes low- and high-frequency components through two parallel pathways. As illustrated in Figure 2, given an input feature map

X \in {X_{opt}, X_{sar}}

, these two parallel branches are designed to process global structural information and local detailed patterns, respectively.

Specifically, to emphasize global semantic consistency and suppress high-frequency noise and preserve the semantic coherence characteristic of optical imagery, the upper branch first applies an average pooling operation to the input feature map. This serves as an explicit low-pass filtering process that suppresses rapid spatial variations and highlights smooth structures. The pooled features are then fed into a 2D State Space Model (2D-SSM) to capture long-range spatial dependencies and global state evolution, enabling the modeling of global contextual information. A layer normalization operation is further employed to stabilize the state dynamics. Finally, the features are upsampled to restore the original spatial resolution. The resulting representation is formulated as follows:

R_{low} = Up (LN ({SSM}_{2 D} (AvgPool (X)))),

(3)

where

{SSM}_{2 D} (\cdot)

denotes the two-dimensional state space operator and

Up (\cdot)

represents bilinear upsampling.

In parallel, the lower high-frequency branch employs multi-scale depthwise convolutions to focus on localized spatial variations. This design preserves and enhances fine details such as edges, scattering centers, and textures, which are essential for representing structural information in SAR data. Concretely, a

3 \times 3

depthwise convolution is first applied to capture fine-scale variations, followed by a

9 \times 9

depthwise convolution to incorporate broader local context. A subsequent

1 \times 1

convolution performs channel-wise projection, and a ReLU activation is adopted to generate a non-negative modulation response. This process can be expressed as follows:

R_{high} = ReLU ({Conv}_{1 \times 1} ({DWConv}_{9 \times 9} ({DWConv}_{3 \times 3} (X)))) .

(4)

To integrate the complementary information captured by the two branches, the LHF-Mamba block adopts an element-wise multiplicative interaction in which the high-frequency response adaptively modulates the global state representation:

R_{int} = R_{low} ⊙ R_{high},

(5)

where ⊙ denotes element-wise multiplication. This design allows local details to selectively enhance or suppress global features without introducing semantic interference.

Subsequently, a

1 \times 1

convolution is applied to facilitate cross-channel interaction, and a residual connection is employed to preserve the original information flow. The final output of the LHF-Mamba block is given by

F = {Conv}_{1 \times 1} (R_{int}) + X .

(6)

By explicitly disentangling and recombining low- and high-frequency components before applying tailored operators to each, the LHF-Mamba block provides a more structured and effective approach to multimodal representation learning. This design enhances the network’s ability to integrate complementary information from SAR and optical imagery, leading to improved robustness and accuracy in object detection tasks.

3.3. Modality Fusion Module

To effectively integrate the multi-scale frequency features extracted by the LHF-Mamba block, we propose the Modality Fusion Module (MFM). The MFM is designed to mitigate the inherent domain gap and feature misalignment between optical and SAR sensors by employing a sequential strategy of intra-modal re-weighting followed by cross-modal gated selection. Initially, the module focuses on enhancing the discriminative power of individual modality features. Given that optical and SAR data contain varying levels of informative content across spatial and channel dimensions, we implement a self-adaptive re-weighting mechanism. Independent convolutional layers are utilized to project the input features into a latent representation space in which nonlinear transformations identify salient regions. The generated attention scores are then normalized through a Softmax function to produce competitive weight distribution maps

W_{opt}

and

W_{sar}

that reflect the varying importance of features. This process amplifies significant structural landmarks while suppressing modality-specific noise. The re-weighted features

{\tilde{F}}_{opt}

and

{\tilde{F}}_{sar}

are formulated as follows:

W_{opt} = Softmax ({Conv}_{1 \times 1} (F_{opt})),

(7)

W_{sar} = Softmax ({Conv}_{1 \times 1} (F_{sar})),

(8)

{\tilde{F}}_{opt} = W_{opt} \cdot F_{opt}, {\tilde{F}}_{sar} = W_{sar} \cdot F_{sar},

(9)

where

F_{*}

is the output of the LHF-Mamba block from each modality and

Softmax (\cdot)

denotes the normalization function that ensures a competitive allocation of attention resources.

Subsequently, the weighted features of both modalities are integrated to achieve a preliminary information fusion

F_{p r e} = {\tilde{F}}_{opt} + {\tilde{F}}_{sar}

. The fused feature is subsequently processed by the Feature Fusion Selection Module (FFSM), which sequentially applies a convolutional-ReLU layer and a convolutional-Sigmoid layer to learn complex fused feature representations and generate a dynamic gating mask

F_{M}

with values ranging between 0 and 1. This mask acts as a semantic filter to adaptively select the most valuable fused information for the current modality. The entire process of initial self-enhancement followed by fusion modulation effectively promotes the alignment and synthesis of multimodal features in a more discriminative subspace. The process is defined as

F_{M} = σ ({Conv}_{1 \times 1} (ReLU ({Conv}_{1 \times 1} (F_{p r e})))),

(10)

M_{opt} = F_{M} ⊙ F_{opt},

(11)

M_{sar} = F_{M} ⊙ F_{sar},

(12)

where

σ (\cdot)

is the Sigmoid activation. Through a sequential strategy of intra-modal feature enhancement followed by cross-modal fusion and gating modulation, the overall process effectively promotes alignment and complementary integration of the multimodal features. This approach not only preserves the unique characteristics of each modality but also fully leverages complementary information between sensors, providing robust multimodal representations for downstream tasks.

3.4. Adaptive Prior Gating Module

To fully harness the transferable and generalized representations of foundation models for effective multimodal data integration, we propose the Adaptive Prior Gating (APG) module. The APG is specifically designed to leverage the rich semantic priors from HyperSIGMA through a dual-path adaptive gated interaction mechanism, facilitating the seamless alignment of raw modality features with high-level structural knowledge.

The input to the APG module consists of two distinct components: (1) the modality features extracted from raw data, denoted as

M \in {M_{opt}, M_{sar}}

, and (2) the structured prior knowledge generated by the HyperSIGMA module.

Among these, HyperSIGMA takes the raw optical image

I_{opt} \in R^{H \times W \times 3}

and SAR image

I_{sar} \in R^{H \times W \times 3}

as input, then extracts hierarchical multi-scale feature representations at three scales:

F^{1 / 8}

,

F^{1 / 16}

, and

F^{1 / 32}

, where the superscript indicates the downsampling factor (stride) relative to the feature map F with the original input size. These scales are intentionally aligned with the multi-level feature pyramid (

S_{opt}^{i}

and

S_{sar}^{i}

) of the detection backbone in order to ensure structural compatibility. At a certain stage, we use P uniformly to represent the output features of HyperSIGMA at this layer.

As a foundation model pretrained on large-scale remote sensing datasets, HyperSIGMA generates structured prior knowledge that captures universal semantic concepts and cross-modal relationships, thereby providing high-level guidance for the downstream detection task. This prior representation P encodes transferable semantic knowledge that complements the modality-specific features extracted by the backbone.

To ensure dimensional compatibility, the APG module employs convolutional layers to align the channel dimensions between the HyperSIGMA prior features and the modality-specific features produced by the MFM module. After channel alignment, gated modulation is performed to adaptively regulate the contribution of prior knowledge to each scale.

To establish a unified feature space for interaction, a nonlinear projection is first applied to both inputs. The projected modality features

M^{'}

and prior features

P^{'}

are then decomposed into Query (Q), Key (K), and Value (V) vectors to support the cross-attention mechanism:

Q_{m} = M^{'} W_{q}^{m}, K_{m} = M^{'} W_{k}^{m}, V_{m} = M^{'} W_{v}^{m},

(13)

Q_{p} = P^{'} W_{q}^{p}, K_{p} = P^{'} W_{k}^{p}, V_{p} = P^{'} W_{v}^{p},

(14)

where

W_{*}^{m}, W_{*}^{p} \in R^{d \times d}

are learnable parameter matrices dedicated to modality-specific and prior-specific subspaces, respectively.

Referencing the Forgetting Transformer [35], our dual-path cross-attention mechanism incorporates a scalar forget gate

f_{t}

to control the retention and attenuation of information. This adaptive gating allows the model to dynamically regulate the influence of prior knowledge based on the current contextual evidence. The module facilitates bidirectional interaction through two distinct branches: the first branch treats prior knowledge as the Query to filter perceptual evidence from modal features that is most consistent with the learned priors, while the second branch utilizes modal features as the Query to perform active semantic retrieval from the prior knowledge base. In both branches, the forget gate mechanism modulates the attention weights to emphasize critical cross-modal correlations while suppressing redundant content. Subsequently, the gated attention weights are applied to generate enhanced output features for each path.

We first compute a scalar forget gate

f_{t}

for each timestep t:

f_{t} = σ (ω_{f}^{⊤} x_{t} + b_{f}) \in (0, 1),

(15)

where

ω_{f}

and

b_{f}

are learnable parameters. To incorporate the forget gate into the attention mechanism, we construct a decay matrix

D \in R^{L \times L}

, as defined in Equation (16), where L is the sequence length after flattening the spatial dimensions. Each entry

D_{i j}

represents the cumulative information decay from position j to i, computed as the log-sum of forget gates along the path. This ensures that when computing attention from position i to earlier positions j, the attention weight is discounted by accumulated forget gates, modeling how information gradually fades over the sequence.

D_{i j} = \{\begin{matrix} \sum_{l = j + 1}^{i} log f_{t}, & i \geq j \\ - \infty, & i < j \end{matrix}

(16)

The enhanced output features for the modality path (

O_{m}

) and prior path (

O_{p}

) are then computed as

O_{m} = Softmax (Q_{p} K_{m}^{⊤} + D_{m}) V_{m},

(17)

O_{p} = Softmax (Q_{m} K_{p}^{⊤} + D_{p}) V_{p} .

(18)

Finally, the outputs are integrated through concatenation and dimension reshaping to produce the final representation O, which encapsulates the deep synergy between modality features and prior knowledge:

O = Reshape (Concat (O_{m}, O_{p})),

(19)

where

Reshape (\cdot)

and

Concat (\cdot)

denote the reshaping and concatenation operations, respectively. By adaptively modulating the interaction via the log-space decay matrix

D_{*}

, the APG ensures a robust and semantically aligned multimodal representation.

3.5. Loss Function

The objective function in this work is formulated based on the YOLO11 [34] architecture, refined to accommodate the complexities of oriented object detection in optical and SAR imagery. The comprehensive loss function

L

integrates localization accuracy, classification confidence, and distribution refinement, as defined in Equation (20):

L = λ_{1} L_{r e g} + λ_{2} L_{c l s} + λ_{3} L_{d f l},

(20)

where

λ_{1}, λ_{2}, λ_{3}

are balancing coefficients. Following YOLOv11, we set

λ_{1} = 7.5

,

λ_{2} = 0.5

, and

λ_{3} = 1.5

, values that have proven effective for balancing localization precision and classification accuracy in rotated bounding box detection. The three loss components

L_{r e g}

(regression loss),

L_{c l s}

(classification loss), and

L_{d f l}

(distribution focal loss) are defined as follows.

First,

L_{r e g}

represents the bounding box regression loss. To address the orientation sensitivity of remote sensing targets, we incorporate the Rotated Object detection Bounding Box (OBB) loss [36]. This term extends the traditional IoU metric by accounting for the angular deviation and geometric alignment, ensuring precise localization of skewed objects. Second,

L_{c l s}

denotes the classification loss, which employs the Binary Cross-Entropy (BCE) to facilitate multi-label classification and maintain robust detection under complex backgrounds. Finally,

L_{d f l}

refers to the Distribution Focal Loss (DFL) [37]. The DFL refines the probability distribution of the bounding box boundaries, enabling the model to better handle the ambiguity of object edges in heterogeneous multimodal data. This joint optimization strategy ensures that the network can simultaneously achieve high semantic consistency and spatial precision.

4. Experiments and Results

4.1. Dataset

The M4-SAR dataset [22] is used for the experiments in this paper. M4-SAR is a large-scale benchmark for optical–SAR fusion object detection. This dataset is constructed based on SAR data from the Sentinel-1 satellite and optical data from the Sentinel-2 satellite, containing 112,184 precisely aligned image pairs and 981,862 object instances with arbitrary orientation annotations. All images are ultimately cropped to a size of 512 × 512 pixels, with an average of 8.75 instances per image, exhibiting the dense object distribution characteristic typical of remote sensing scenes.

In this dataset, optical images provide two resolutions of 10 M and 60 M, while SAR data support both VH and VV polarization modes. In the M4-SAR dataset, each SAR image is stored as a three-channel RGB image in which the same grayscale intensity value is replicated across all three channels. This storage format follows common practices in remote sensing datasets to maintain compatibility with standard image loading pipelines and pretrained models that expect three-channel inputs.

To improve data quality and ensure precise alignment with optical data, the SAR images were subjected to noise suppression and geometric correction. A semi-supervised optical-assisted annotation strategy ensures labeling quality; in addition, the dataset encompasses complex scenes including cloud-covered, low-light, and low-resolution conditions, providing comprehensive support for evaluating the robustness of fusion detection algorithms.

The dataset annotates six categories of key ground targets: airport, harbor, bridge, playground, wind turbine, and oil tank. These targets exhibit significant diversity in terms of aspect ratio, angle, area, and imaging characteristics, enabling a comprehensive evaluation of algorithm performance across different land cover types. The distribution of instances by category also reflects the actual occurrence frequency of typical targets in coastal areas, with the bridge and oil tank catergories accounting for the largest proportions. Furthermore, the oil tank and harbor categories, as typical high-density man-made targets, demand high fine-grained feature extraction capability on the part of algorithms. Due to top-view perspective and SAR speckle noise, the wind turbine category exhibits blurred boundaries and concentrated orientation features, necessitating robust cross-modal complementary feature mining and precise localization.

4.2. Experimental Settings

All experiments were conducted on a workstation equipped with four NVIDIA RTX 4090 GPUs. The proposed model was optimized using the AdamW optimizer with an initial learning rate of

2 \times 10^{- 4}

and a weight decay of 0.01. The training process lasted for 250 epochs with a batch size of 32, and all input images were resized to

512 \times 512

to ensure a consistent spatial resolution.

For comparison, as in M4-SAR, we adopted several representative oriented object detection methods as baselines, including Rotated FCOS, Rotated ATSS, Oriented R-CNN, Oriented RepPoints, RTMDet, PSC, and LSKNet. During evaluation, we follow the standard COCO evaluation protocol [38] and report commonly used detection metrics, including

{AP}_{50}

,

{AP}_{75}

, and

mAP

. In addition, we report the per-class

{AP}_{50}

in order to provide a detailed analysis of the detection performance for each category. For the M4-SAR dataset, the category recognition accuracy for the six categories (bridge, harbor, oil tank, playground, airport, wind turbine) are denoted as BD, HB, OT, PG, AP, and WT, respectively.

4.3. Performance Comparison

We evaluate the performance of the proposed method on the M4-SAR dataset against several state-of-the-art optical–SAR multimodal object detection approaches. The quantitative results are summarized in Table 1. As demonstrated, our method achieves substantial improvements across key metrics, with

{AP}_{50}

,

{AP}_{75}

, and

mAP

reaching 88.0%, 74.9%, and 65.6%, respectively. Notably, the proposed framework outperforms the state-of-the-art E2E-OSDet [22] by 4.6% in

{AP}_{75}

and 4.2% in

mAP

. These significant gains underscore the effectiveness of our foundation model-guided feature injection strategy and the dual-stream fusion architecture, which synergistically facilitate adaptive modal interactions and robust feature alignment in complex environments. Furthermore, while the detection accuracy for certain specific categories exhibits minor fluctuations, our method achieves exceptional performance in challenging classes such as harbors, playgrounds, and airports, leading to superior overall precision and cross-modal robustness.

As shown in Table 1, our model contains 31.4 M parameters with an average inference time of 28.4 ms per image. Although integrating HyperSIGMA introduces some computational overhead, the optimized implementation ensures practical inference speeds for high-performance fusion models. This acceptable tradeoff allows us to leverage large pretrained models for enhanced representational capacity and cross-modal generalization.

To further validate the localization efficacy, Figure 3 illustrates the visualization results of five representative methods alongside our proposed approach. The visual evidence confirms that our method yields the most precise localization across diverse environmental attributes. In scenarios featuring harbors, playgrounds, and airports, our model significantly mitigates missed detections, consistent with the quantitative findings in Table 1. Particularly in cloudy conditions, although optical sensors are severely degraded by cloud cover, our method successfully identifies targets such as wind turbines by leveraging SAR semantic cues. For densely distributed targets such as oil tanks, while minor omissions occur due to severe spatial occlusion, our approach consistently identifies objects that remain undetected by existing methods, further demonstrating its superior discriminative capability.

4.4. Generalization Experiments

To evaluate the generalization capacity and data efficiency of the proposed method, we conducted experiments on the M4-SAR dataset using varying training proportions of 25%, 50%, and 70%. The performance was subsequently evaluated on the complete test set, with the results summarized in Table 2. As observed, our method consistently achieves the highest

{AP}_{50}

and

mAP

across all training scales, demonstrating its superior adaptability and robustness under constrained data conditions.

Furthermore, all evaluated methods exhibit a positive correlation between training data volume and detection accuracy. Notably, the performance margin of our algorithm remains stable across all proportions, maintaining a consistent lead of 1.1% to 1.4% in both metrics compared to the state-of-the-art E2E-OSDet [22]. This persistent superiority across different data regimes validates the ability of our foundation model-guided approach to effectively capture intrinsic cross-modal features, thereby reducing the model’s dependence on large-scale labeled datasets while ensuring stable generalization in diverse training scenarios.

4.5. Ablation Experiments

To evaluate the contribution of each proposed component, we conducted a series of ablation studies on the M4-SAR dataset. The quantitative results are summarized in Table 3 and the visualization results are shown in Figure 4. Specifically, we investigated the impact of three core modules: the foundation model-based prior extraction (HyperSIGMA), the APG module, and the MFM. In the ablation setup, “w/o APG” implies that the prior knowledge and modality features are directly concatenated without the dual-path gated interaction, while “w/o MFM” denotes the absence of the adaptive re-weighting and selection mechanism between optical and SAR features.

As indicated in Table 3, integrating HyperSIGMA, APG, and MFM consistently enhances model performance across all metrics (

{AP}_{50}

,

{AP}_{75}

, and

mAP

). Notably, even without the APG interaction, the mere inclusion of prior knowledge extracted via HyperSIGMA improves the

mAP

by 2.9% and 9.5% in different configurations. This validates the ability of the semantic priors provided by the foundation model to offer a more robust representation than raw data alone. As illustrated in the fourth and fifth columns of Figure 4, HyperSIGMA significantly increases the recall for challenging targets such as oil tanks and bridges while also refining the localization precision for large-scale objects such as airports.

Building upon the prior extraction, the introduction of the APG module for bidirectional feature modulation yields an additional

mAP

gain of 2.1% and 5.9%, respectively. These results demonstrate that the gated interaction mechanism successfully bridges the gap between high-level priors and low-level perceptual features. A comparison between the corresponding columns in Figure 4 further reveals that the APG module effectively mitigates missed detections and false positives by suppressing redundant cross-modal noise.

Furthermore, the MFM module proves critical for cross-modal alignment. MFM-based selective fusion improves

mAP

regardless of whether prior information is introduced, by 4.2% and 7.0%, respectively. The visual evidence in Figure 4 confirms that MFM enhances the detection rate for categories with high intra-class variance, such as wind turbines and playgrounds. These findings collectively demonstrate that enhancing cross-modal interaction and fusion promotes a better information balance, significantly contributing to the overall detection robustness.

5. Discussion

The experimental results on the M4-SAR dataset demonstrate that our proposed framework achieves state-of-the-art performance, particularly manifesting substantial gains in

mAP

,

{AP}_{50}

, and

{AP}_{75}

. These advancements are primarily attributed to the foundation model-guided feature injection strategy and the dual-stream architecture designed for synergistic modal interaction. Specifically, the integration of high-level semantic priors from HyperSIGMA empowers the model to maintain robust recognition capabilities under adverse conditions such as cloud-obscured optical imagery or heterogeneous SAR resolutions and polarizations. The APG module further refines this integration by adaptively gating the prior information with modality-specific features, significantly mitigating missed detections and enhancing localization precision for critical categories such as harbors, playgrounds, and airports. Complementing this, the MFM module facilitates the alignment and selective fusion of optical and SAR features to ensure a balanced cross-modal representation that underpins the overall robustness of the detector.

Compared with existing methodologies, our approach not only elevates global detection metrics but also exhibits superior adaptability across diverse imaging scenarios, including varying optical resolutions (10 M/60 M), SAR polarizations (VH/VV), and disparate illumination attributes (bright, dark, and cloudy). Ablation studies confirm the consistent performance contributions of HyperSIGMA, APG, and MFM. Notably, even in the absence of complex gating, the mere inclusion of foundation model priors yields substantial improvements, underscoring the indispensable value of external semantic knowledge in remote sensing tasks. Visual comparisons further validate that our method excels in reducing omissions, a critical requirement for real-world applications characterized by high environmental variability. These findings suggest that the convergence of foundation model guidance and adaptive multimodal fusion represents a potent direction for advancing object detection in complex remote sensing scenes.

6. Conclusions

This paper presents a foundation model-guided optical–SAR fusion framework designed to leverage the complementary strengths of multimodal data, addressing the inherent limitations of single-modality approaches. Driven by the requirement for robust detection in challenging environments, we introduce a feature injection strategy anchored in large-scale foundation models. This strategy significantly enhances semantic comprehension and generalization, leading to superior detection performance across varying data scales. Furthermore, our dual-stream architecture incorporating the LHF-Mamba block and the Modality Fusion Module effectively bridges the modality gap by achieving deep and frequency-aware feature fusion. This design transcends the shallow cross-modal interactions prevalent in conventional methods. Extensive experiments demonstrate that our approach significantly advances the state-of-the-art in complex scenario object detection. This work highlights the transformative potential of integrating pretrained foundation models with adaptive fusion mechanisms, providing a robust paradigm for future research in multimodal remote sensing analysis.

Author Contributions

Conceptualization, Qianyin Jiang, Jianshang Liao and Junkang Zhang; methodology, Qianyin Jiang and Junkang Zhang; validation, Qiuyu Lin; writing—original draft preparation, Qianyin Jiang and Junkang Zhang; writing—review and editing, Junkang Zhang; visualization, Junkang Zhang; supervision, Junkang Zhang; funding acquisition, Jianshang Liao and Qianyin Jiang. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Natural Science Foundation of Guangdong Province (Grant number 2024A1515510030) and the Guangzhou Municipal Education Bureau’s University Scientific Research Project (Grant number 2024312014).

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/wchao0601/M4-SAR.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Yang, J.; Liang, Z.; Li, J.; Gan, Y.; Zhong, J. A Novel Copy–Move Forgery Detection Algorithm via Gradient-Hash Matching and Simplified Cluster-Based Filtering. Int. J. Pattern Recognit. Artif. Intell. 2023, 37, 2350011. [Google Scholar] [CrossRef]
Andrew, O.; Apan, A.; Paudyal, D.R.; Perera, K. Convolutional Neural Network-Based Deep Learning Approach for Automatic Flood Mapping Using NovaSAR-1 and Sentinel-1 Data. ISPRS Int. J. Geo. Inf. 2023, 12, 194. [Google Scholar] [CrossRef]
Guo, P.; Celik, T.; Liu, N.; Li, H.C. Piecewise Self-Adaption Weighted attention for the detection of concentrated distributions of ships in SAR images. Remote Sens. Lett. 2025, 16, 200–210. [Google Scholar] [CrossRef]
Wan, S.; Yeh, M.L.; Ma, H.L. An Innovative Intelligent System with Integrated CNN and SVM: Considering Various Crops through Hyperspectral Image Data. ISPRS Int. J. Geo. Inf. 2021, 10, 242. [Google Scholar] [CrossRef]
Yu, L.; Wu, H.; Liu, L.; Hu, H.; Deng, Q. TWC-AWT-Net: A transformer-based method for detecting ships in noisy SAR images. Remote Sens. Lett. 2023, 14, 512–521. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
Wang, J.; Li, H.; Li, Y.; Qin, Z. A Lightweight CNN-Transformer Implemented via Structural Re-Parameterization and Hybrid Attention for Remote Sensing Image Super-Resolution. Isprs Int. J. Geo. Inf. 2025, 14, 8. [Google Scholar] [CrossRef]
Ding, K.; Wang, Y.; Wang, C.; Ma, J. A New Subject-Sensitive Hashing Algorithm Based on Multi-PatchDrop and Swin-Unet for the Integrity Authentication of HRRS Image. ISPRS Int. J. Geo. Inf. 2024, 13, 336. [Google Scholar] [CrossRef]
Jiang, M.; Shao, H. A CNN-Transformer Combined Remote Sensing Imagery Spatiotemporal Fusion Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13995–14009. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 103031–103063. [Google Scholar] [CrossRef]
Liao, J.; Wang, L. SpecSpatMamba: An efficient hyperspectral image classification method integrating spectral-spatial dual-path and state space model. Egypt. J. Remote Sens. Space Sci. 2025, 28, 628–644. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification with State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Wang, Q.; Ye, H.; Liang, D.; Huang, S.J. Diffusion-Noise-Based Augmentation for Long-Tailed Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5626114. [Google Scholar] [CrossRef]
Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6427–6444. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
Wu, K.; Zhang, Y.; Ru, L.; Dang, B.; Lao, J.; Yu, L.; Luo, J.; Zhu, Z.; Sun, Y.; Zhang, J.; et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat. Mach. Intell. 2025, 7, 1235–1249. [Google Scholar] [CrossRef]
Wang, C.; Lu, W.; Li, X.; Yang, J.; Luo, L. M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection. arXiv 2025, arXiv:2505.10931. [Google Scholar]
Li, K.; Cao, X.; Meng, D. A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610112. [Google Scholar] [CrossRef]
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611711. [Google Scholar] [CrossRef]
Wang, G.; Ma, Y.; Zhou, F.; Wang, Y.; Yan, Y.; Geng, H. RFHP-CD: A Prompt-Driven Fine-Tuning Framework of Remote Sensing Foundation Model for Building and Cropland Change Detection. IEEE Access 2025, 13, 121601–121615. [Google Scholar] [CrossRef]
Wang, K.; Li, Z.; Guo, J.; Wang, Y. Incremental Classification of Cross-Scene Hyperspectral Images Based on Dual Constraints and Knowledge Transfer. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5505005. [Google Scholar] [CrossRef]
Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal Object Detection by Channel Switching and Spatial Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 403–411. [Google Scholar] [CrossRef]
Zhang, J.; Cao, M.; Xie, W.; Lei, J.; Li, D.; Huang, W.; Li, Y.; Yang, X. E2E-MFD: Towards end-to-end synchronous multimodal fusion detection. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NIPS’24), Red Hook, NY, USA, 9–15 December 2024. [Google Scholar]
Liu, B.; Ren, B.; Hou, B.; Gu, Y. Multi-Source Fusion Network for Remote Sensing Image Segmentation with Hierarchical Transformer. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6318–6321. [Google Scholar] [CrossRef]
Mao, R.; Li, H.; Ren, G.; Yin, Z. Cloud Removal Based on SAR-Optical Remote Sensing Data Fusion via a Two-Flow Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7677–7686. [Google Scholar] [CrossRef]
Song, K.; Xue, X.; Wen, H.; Ji, Y.; Yan, Y.; Meng, Q. Misaligned Visible-Thermal Object Detection: A Drone-Based Benchmark and Baseline. IEEE Trans. Intell. Veh. 2024, 9, 7449–7460. [Google Scholar] [CrossRef]
Wei, T.; Chen, H.; Wang, J.; Liu, W. MDFNet: Multimodal Feature Decomposition and Fusion Network for Multimodal Remote Sensing Image Semantic Segmentation. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 20–22 December 2024; pp. 1–5. [Google Scholar] [CrossRef]
Fang, Q.; Wang, Z. Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Lin, Z.; Nikishin, E.; He, X.; Courville, A. Forgetting Transformer: Softmax Attention with a Forget Gate. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025; Volume 2025, pp. 69704–69738. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Red Hook, NY, USA, 6–12 December 2020; pp. 21002–21012. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zürich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
He, X.; Tang, C.; Zou, X.; Zhang, W. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. In Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, 29 October–3 November 2023; pp. 1465–1474. [Google Scholar] [CrossRef]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
Zeng, Y.; Liang, T.; Jin, Y.; Li, Y. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11198–11213. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed method.

Figure 2. Structure of LHF-Mamba block.

Figure 3. Visualization of the detection results on M4-SAR. GT presents the ground truth, while the quadrilaterals in the figures represent the detected object locations and their colors indicate the recognized categories (dark blue for bridges, green for harbors, red for oil tanks, light blue for playgrounds, magenta for airports, and yellow for wind turbines). In the first column, 10 M/60 M denotes the resolution of the optical image, whereas VH/VV represents the polarization of the SAR image.

Figure 4. Visualization of the detection results on ablation experiments. The meanings of the abbreviations in this table are the same as those in Figure 3.

Table 1. Results on the M4-SAR dataset. The metric #P represents the number of trainable parameters of the model. The best results for each metric are shown in bold red.

Method	#P (M)	Inf.T (ms)	BD (%)	HB (%)	OT (%)	PG (%)	AP (%)	WT (%)	AP₅₀ (%)	AP₇₅ (%)	mAP (%)
CFT [33]	53.8	40.6	75.8	92.5	61.3	91.6	90.3	96.3	84.6	68.9	59.9
CLANet [39]	48.2	37.1	74.8	92.2	60.7	91.3	91.6	97.2	84.6	68.5	59.6
CSSA [27]	13.5	29.1	73.3	91.7	59.3	88.9	91.6	95.8	83.4	66.4	58.0
CMADet [31]	41.5	12.3	70.9	90.7	52.0	86.4	91.7	97.1	81.5	63.5	55.7
ICAFusion [40]	29.0	23.6	74.7	91.9	60.9	91.0	91.8	96.7	84.5	67.3	58.8
MMIDet [41]	53.8	41.9	74.9	92.6	61.1	91.7	91.4	97.0	84.8	68.6	59.8
E2E-MFD [28]	31.3	37.1	76.1	91.9	61.1	91.8	91.3	97.2	84.9	69.5	60.5
E2E-OSDet [22]	27.5	20.9	77.7	90.7	64.3	91.8	92.1	97.8	85.7	70.3	61.4
Ours	31.4	28.4	77.6	96.8	62.9	94.7	99.1	96.7	88.0	74.9	65.6

Note: BD–Bridge, HB–Harbor, OT–Oil Tank, PG–Playground, AP–Airport, WT–Wind Turbine.

Table 2. Results of the generalization experiments. The best results for each metric are shown in bold red.

Method	25% Training		50% Training		70% Training
Method	AP₅₀ (%)	mAP (%)	AP₅₀ (%)	mAP (%)	AP₅₀ (%)	mAP (%)
CFT [33]	51.8	31.1	64.8	42.4	76.4	54.1
CLANet [39]	50.9	30.5	63.9	41.5	75.8	53.4
CSSA [27]	48.2	28.5	61.1	39.4	73.2	50.9
CMADet [31]	47.9	27.2	59.7	38.1	71.8	49.5
ICAFusion [40]	49.5	29.8	62.5	40.8	74.5	52.2
MMIDet [41]	52.1	31.5	65.1	42.9	76.9	54.5
E2E-MFD [28]	53.4	32.4	66.2	43.7	77.5	55.2
E2E-OSDet [22]	54.5	33.2	67.1	44.5	78.8	56.3
Ours	55.9	34.6	68.4	45.8	80.1	57.7

Table 3. Results of the ablation experiments. The symbols and × indicate the inclusion and exclusion of the module in the experimental setup, respectively. The best results for each metric are shown in bold red.

HyperSIGMA	APG	MFM	${AP}_{50}$ (%)	${AP}_{75}$ (%)	mAP (%)
×	×	×	77.9	59.1	49.8
✓	×	×	83.7	68.4	59.3
✓	✓	×	85.9	69.7	61.4
×	×	✓	83.4	64.7	56.8
✓	×	✓	84.8	69.0	59.7
✓	✓	✓	88.0	74.9	65.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Jiang, Q.; Liao, J.; Lin, Q.; Zhang, J. Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion. ISPRS Int. J. Geo-Inf. 2026, 15, 160. https://doi.org/10.3390/ijgi15040160

AMA Style

Jiang Q, Liao J, Lin Q, Zhang J. Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion. ISPRS International Journal of Geo-Information. 2026; 15(4):160. https://doi.org/10.3390/ijgi15040160

Chicago/Turabian Style

Jiang, Qianyin, Jianshang Liao, Qiuyu Lin, and Junkang Zhang. 2026. "Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion" ISPRS International Journal of Geo-Information 15, no. 4: 160. https://doi.org/10.3390/ijgi15040160

APA Style

Jiang, Q., Liao, J., Lin, Q., & Zhang, J. (2026). Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion. ISPRS International Journal of Geo-Information, 15(4), 160. https://doi.org/10.3390/ijgi15040160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion

Abstract

1. Introduction

2. Related Work

2.1. Application of Foundation Models

2.2. Optical and SAR Image Fusion

3. Methods

3.1. Network Structure

3.2. Low- and High-Frequency Mamba Block

3.3. Modality Fusion Module

3.4. Adaptive Prior Gating Module

3.5. Loss Function

4. Experiments and Results

4.1. Dataset

4.2. Experimental Settings

4.3. Performance Comparison

4.4. Generalization Experiments

4.5. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI