DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery

Luo, Cheng; Zhang, Yueting; Guo, Jiayi; Zhou, Guangyao; You, Hongjian; Li, Peifeng; Ning, Xia

doi:10.3390/rs18091358

Open AccessArticle

DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery

by

Cheng Luo

^1,2,3,

Yueting Zhang

^1,2,3,*

,

Jiayi Guo

^1,2,3,

Guangyao Zhou

^1,2,

Hongjian You

^1,2,3,

Peifeng Li

^1,2 and

Xia Ning

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(9), 1358; https://doi.org/10.3390/rs18091358

Submission received: 15 March 2026 / Revised: 21 April 2026 / Accepted: 27 April 2026 / Published: 28 April 2026

(This article belongs to the Special Issue Deep Learning-Based Analysis of High-Resolution Remote Sensing Images: Registration, Fusion, and Change Detection)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel Diffusion-enhanced Mutual Consistency (DEMC) framework is introduced for cross-domain object detection, specifically addressing the modality gap between optical and SAR imagery.
The framework utilizes a Diffusion-Based Domain Alignment (DBDA) module and Dual-Student Mutual Verification (DSMV) mechanism to enhance detection accuracy and reduce pseudo-label noise in SAR environments.

What are the implications of the main finding?

The proposed method significantly improves detection performance in SAR image datasets, with a notable increase in recall and reduced false negatives compared to existing domain adaptation methods.
The integration of diffusion models into SAR image generation provides a promising approach to solving the data scarcity and annotation challenges in remote sensing applications.
Extensive experiments on four benchmark datasets (HRSC2016/ShipRSImageNet to SSDD/HRSID) demonstrate that the proposed method achieves state-of-the-art performance, with significant improvements in detection recall and AP50 compared to existing approaches.

Abstract

Cross-domain object detection from optical to Synthetic Aperture Radar (SAR) imagery addresses the challenges of SAR data scarcity and high annotation costs, enabling crucial capabilities for persistent maritime surveillance and reconnaissance. However, the substantial modality gap resulting from distinct imaging mechanisms and severe coherent speckle noise significantly hampers knowledge transfer. Existing Unsupervised Domain Adaptation (UDA) methods, which primarily rely on adversarial feature alignment or static pseudo-labeling, struggle to replicate the physical backscattering properties of SAR data and often fall prey to confirmation bias due to intense background clutter. To overcome these limitations, this paper introduces the Diffusion-Enhanced Mutual Consistency (DEMC) framework. DEMC introduces a novel two-stage adaptation paradigm. The first stage, the Diffusion-Based Domain Alignment (DBDA) module, generates a physics-aware intermediate domain. By integrating step-efficient diffusion generation with physical refinement, this module effectively reduces the cross-modal visual discrepancy while preserving the semantic structure of the optical source. In the second stage, this paper tackles the pervasive issue of pseudo-label noise with the Dual-Student Mutual Verification (DSMV) mechanism. Guided by Cross-Agent Spatial Consensus (CASC) and Adaptive Thresholding (AIT), this mechanism dynamically refines pseudo-labels through geometric overlap validation, effectively recovering faint, low-contrast targets that would typically be discarded by standard thresholds. Extensive evaluations across four benchmark tasks (HRSC2016/ShipRSImageNet to SSDD/HRSID) demonstrate that DEMC establishes a new state-of-the-art. Notably, the framework significantly enhances detection recall and reduces omission errors in complex coastal environments, offering a robust solution for zero-tolerance, all-weather surveillance tasks.

Keywords:

unsupervised domain adaptation; object detection; synthetic aperture radar (SAR); diffusion models; mutual consistency learning

1. Introduction

Synthetic Aperture Radar (SAR) has become an essential tool in modern ocean monitoring and military reconnaissance due to its all-weather, day-and-night imaging capabilities [1,2,3]. While data-driven approaches, particularly Convolutional Neural Networks (CNNs), have significantly advanced SAR target detection, their effectiveness is heavily reliant on large-scale, high-quality annotated datasets [4,5,6]. Unlike optical remote sensing imagery, which offers intuitive visual cues, SAR data is governed by complex coherent imaging mechanisms. The combination of non-intuitive scattering features and severe speckle noise makes precise annotation extremely challenging for non-specialists. This dependence on expert knowledge leads to prohibitive labor costs, leaving SAR datasets abundant in raw data but lacking sufficient annotations. As a result, Unsupervised Domain Adaptation (UDA), which transfers knowledge from a well-annotated optical domain (source) to an unlabeled SAR domain (target), has become a vital strategy to address the SAR data bottleneck and enable intelligent monitoring [7,8,9].

Despite the successes of UDA in computer vision, the significant modality disparity between optical and SAR domains limits the effectiveness of existing methods. For example, conventional image translation techniques, such as Generative Adversarial Networks (GANs), struggle to bridge this gap. These models often fail to preserve geometric structures while simulating realistic coherent speckle noise, resulting in synthetic data that lacks both diversity and physical realism. Furthermore, traditional self-training methods for prediction alignment rely on static thresholding for pseudo-label filtering, overlooking the dynamic uncertainty across training phases [10]. Given the faint target signatures and intense background clutter typical in SAR imagery, fixed thresholds may discard valuable hard positives with low confidence, while inaccurate high-confidence predictions can still be retained. This weakens the supervisory signals available for self-training and hinders optimization in complex SAR environments [10,11]. In the broader remote sensing community, techniques such as feature extraction from spectral images based on multi-threshold binarization have proven highly effective for improving classification accuracy and reducing sample dependency. Rusyn et al. demonstrated the effectiveness of multi-threshold binarization in extracting discriminative features with limited training samples [12]. Zhang et al. optimized multi-threshold segmentation using an improved black-winged kite algorithm to enhance segmentation precision [13]. Although these methods were developed for spectral feature extraction and segmentation rather than pseudo-label learning, they highlight the broader utility of adaptive threshold design in remote sensing.

To address these challenges, this paper presents a novel perspective aimed at improving pixel-level generation fidelity and pseudo-label robustness. At the pixel level, this paper moves beyond traditional GANs and global transformation strategies by introducing Diffusion models [14,15]. These models excel at capturing complex data distributions, allowing us to generate pseudo-SAR images that adhere to SAR’s physical scattering characteristics [16]. This results in more precise domain alignment at the input level. In the instance mining stage, this paper proposes Mutual Consistency Learning, inspired by cross-view learning [17], to mitigate pseudo-label noise. The core idea is that predictions from a single network in complex SAR backgrounds are often unreliable. However, if two independent branches can agree on the localization predictions for the same spatial region, it is highly likely that this region contains a true target. This spatial consensus-based approach provides a reliable foundation for detecting low Signal-to-Noise Ratio targets [18].

This paper introduces the Diffusion-enhanced Mutual Consistency Learning (DEMC) framework. To explicitly distinguish our approach from general machine learning image enhancement models that might share similar acronyms, our DEMC is a distinctly tailored architecture that aims to achieve robust cross-domain transfer through a dual-stage collaboration: 1. To address the significant gap in texture distribution between optical and SAR images, this paper employs a one-step diffusion model to construct an intermediate domain. By introducing zero-convolution skip connections (Zero-Conv) along with spectral consistency, denoising, and contrast enhancement as physical post-processing steps, the generated images retain the semantic layout of the optical source domain while faithfully replicating the texture information of SAR targets. 2. To tackle the noise problem commonly present in pseudo-labels, a co-learning architecture is designed, comprising a teacher network, a main student, and a proxy student. By leveraging the proposed Cross-Agent Spatial Consensus (CASC) strategy, pseudo-labels are dynamically corrected through IoU consistency across multiple networks. This strategy effectively mitigates the “confirmation bias” of a single network in low-contrast SAR regions, recovering a significant number of potential true positive samples. Along with the Adaptive Threshold (AIT) mechanism, the model can autonomously optimize the training sample pool, ensuring continuous performance improvement.Although this study focuses on the ship class due to the availability of well-established optical-to-SAR maritime benchmarks, the proposed DEMC framework is not restricted to ships at the algorithmic level. In principle, it can be extended to other coastal objects, such as offshore platforms, oil tanks, port facilities, and coastal vehicles, provided that labeled optical source data and unlabeled SAR target images are available. However, its effectiveness depends on whether the target category presents distinguishable SAR scattering structures and sufficiently clear contours; extremely dense inshore scenes or ultra-low-SNR targets may reduce the reliability of spatial consensus. The primary contributions of this work are summarized as follows:

(1): Novel UDA Paradigm: We propose the DEMC framework, which integrates generative alignment with mutual verification learning. This innovative approach combines the generative capabilities of diffusion models with the error-correction ability of mutual learning, effectively addressing the challenges of transferring knowledge from the optical domain to the heterogeneous SAR domain.
(2): Physics-Aware Diffusion Synthesis: By combining one-step diffusion techniques with physics-based post-processing, we generate high-fidelity intermediate domain data with strong scattering features, significantly alleviating distribution shifts between modalities.
(3): Robust Pseudo-Label Mining: We introduce the CASC strategy, replacing the conventional confidence threshold with spatial consensus constraints, greatly enhancing the model’s recall ability for weak targets in complex backgrounds.
(4): Superior Experimental Performance: Experimental results on benchmark tasks such as HRSC2016, ShipRSImageNet, SSDD, and HRSID demonstrate that the proposed method achieves state-of-the-art performance in core metrics, including AP50 and F1-score.

1.1. Object Detection in SAR Images

Convolutional Neural Network (CNN)-based approaches have greatly advanced Remote Sensing target detection. Early studies mainly focused on adapting optical detection models, like Faster R-CNN, YOLO, and SSD, to the Synthetic Aperture Radar (SAR) domain. However, SAR imagery presents unique challenges such as speckle noise and target scale variations, necessitating additional methods. To address these, researchers have integrated attention mechanisms and multi-scale feature fusion techniques. Kang et al. [19] improved small-scale target detection through multi-layer fusion and contextual data. Cao et al. [8] developed SAR-Net, a multi-scale direction-aware attention network that improves detection across various object scales. Other research has explored dynamic multi-scale fusion with transformer-style modules to better address complex marine environments [20].

1.2. Image Translation: From GANs to Diffusion Models

Pixel-level domain adaptation seeks to bridge the modality gap by converting source-domain images to match the target domain’s style. Generative Adversarial Networks (GANs), especially CycleGAN and its variants, have been widely used for this purpose due to their ability to generate realistic remote sensing samples. However, GANs often struggle with issues such as instability, hyperparameter sensitivity, and failure to preserve geometric structures during optical-to-SAR image translation, resulting in artifacts that degrade image quality. Recently, denoising diffusion probabilistic models (DDPMs) have emerged as a promising alternative, offering superior generative quality and diversity. These models have shown success in tasks like text-to-image and 3D generation [21,22,23,24,25,26]. Zhang et al. [21] reviewed text-to-image diffusion models, while Cao et al. [22] examined generative diffusion models. Wang et al. [23] introduced ProlificDreamer for high-fidelity 3D generation, and Poole et al. [24] demonstrated 3D generation with DreamFusion. Yi et al. [25] proposed GaussianDreamer for rapid text-to-3D generation, highlighting the potential of diffusion models in complex generative tasks. For SAR image generation, the use of denoising diffusion probabilistic models has shown promising results. Specifically, ref. [27] presents a method that generates high-fidelity SAR images from limited samples, bridging the modality gap between optical and SAR domains. Nevertheless, substantial research efforts are still needed to fully explore its application in SAR image synthesis.

1.3. Unsupervised Domain Adaptation for SAR Detection

To address the challenge of limited SAR annotations, unsupervised domain adaptation (UDA) has emerged as a crucial approach. Huang et al. [28] introduced a transfer learning strategy that utilizes multi-source data to minimize discrepancies between optical and SAR domains, and between various SAR target types. Shi et al. [29] further proposed a progressive transfer strategy for ship detection from optical to SAR images, which gradually reduces the cross-domain discrepancy during adaptation. Zhang et al. [30] explored both global structural and local instance-level alignment to improve feature consistency between optical and SAR images. Yuan et al. [31] introduced a CSD module that decomposes classifiers into domain-common and domain-specific components, enhancing SAR target recognition. Zheng et al. [32] introduced a dual-teacher framework to separate cross-domain and semi-supervised tasks, reducing interference between optical and SAR supervision. Similarly, regarding the significant challenge of domain shift, Zhou et al. [33] introduced a novel technique utilizing cross-domain feature interaction and data contribution balance, which aligns with our approach of leveraging mutual verification to overcome domain-specific challenges. Finally, Han et al. [34] designed a frequency-enhanced feature alignment module to simplify the attention mechanism while capturing domain-specific information for improved SAR object detection. Moreover, phased self-training pipelines have recently demonstrated significant potential in other cross-domain remote sensing tasks, such as semantic segmentation [35,36]. Recent studies have further explored remote sensing object detection generalization from complementary perspectives. Zhang et al. [37] used Stable Diffusion and CLIP for controllable generative few-shot detection, Zhang et al. [38] introduced Fourier contour parametric learning to unify different geometric annotations, and Xie et al. [39] incorporated LLaMA-based language priors for open-vocabulary remote sensing detection. These works improve generalization through generative augmentation, geometric representation, and semantic priors. In contrast, our DEMC focuses on annotation-free optical-to-SAR adaptation, where the physical modality gap and SAR pseudo-label noise remain the main bottlenecks. In this setting, standard Mean Teacher or co-training frameworks typically rely on rigid, high-confidence score filtering to generate pseudo-labels, which often leads to severe “confirmation bias” in noisy SAR backgrounds, where confident learning from false positives occurs while faint true targets are discarded. To address this issue, our DSMV framework fundamentally differs from traditional co-training methods by enforcing an IoU-based Cross-Agent Spatial Consensus. This unique mechanism allows DEMC to validate and recover geometrically consistent, low-confidence target proposals that standard confidence-based self-training methods cannot achieve.

2. Materials and Methods

The proposed DEMC framework is specifically designed to facilitate cross-domain object detection from optical to SAR imagery, targeting the critical requirements of all-weather maritime surveillance and persistent military reconnaissance. While optical satellite imagery (the source domain) benefits from ease of acquisition and lower annotation costs, its operational efficacy is severely compromised by atmospheric attenuation, such as cloud cover and fog, as well as diurnal limitations. Conversely, SAR imaging provides robust, all-weather, and day-and-night sensing capabilities. Despite these advantages, the physical complexity of SAR scattering renders the annotation process highly specialized, incurring prohibitive temporal and economic costs.

To tackle these challenges, this paper proposes an unsupervised domain adaptation framework that leverages existing optical image datasets. This framework employs adaptive transfer learning to facilitate the migration of object detection tasks from the source domain (optical images) to the target domain (SAR images). The core concept of this framework is the generation of intermediate domain images, which allows for the transfer of knowledge from the source domain to the target domain without the need for target domain annotations. Additionally, a pseudo-label mechanism is incorporated to further enhance detection accuracy. The overall structure of the proposed framework is illustrated in Figure 1. For clarity, the overall DEMC training pipeline can be summarized in three sequential steps. First, DBDA translates each labeled optical source image into a SAR-like intermediate image while preserving its original bounding-box annotations, thereby providing image-level modality alignment. Second, the detector is warmed up using both the original optical images and the generated intermediate-domain images. Third, unlabeled SAR images are introduced for self-training, where the Teacher provides candidate pseudo-labels, the Proxy Student verifies low-confidence proposals through CASC, and AIT dynamically adjusts the confidence threshold according to the evolving prediction statistics. Thus, DBDA addresses the image-level modality gap, whereas DSMV, CASC, and AIT jointly address pseudo-label noise at the instance level.

Given a source domain dataset

D_{s} = {(x_{s}^{i}, y_{s}^{i})}_{i = 1}^{N_{s}}

, where

x_{s}^{i}

represents a source domain optical image and

y_{s}^{i}

denotes the corresponding target location bounding box and class label, and an unlabeled target domain (SAR) dataset

D_{t} = {x_{t}^{j}}_{j = 1}^{N_{t}}

, where

x_{t}^{j}

is a single-channel SAR image in the target domain, the objective is to address the distribution shift between the source and target domains (

P (x_{s}) \neq P (x_{t})

), caused by significant differences in imaging mechanisms and noise characteristics. Therefore, this paper seeks to design a model

F_{θ}

such that an object detector trained on the source domain can effectively perform inference on the target domain.

The objective is to minimize the generalization error in the target domain:

min_{θ} R_{t} (F_{θ}) = E_{(x_{t}, y_{t}) \sim D_{t}} [L (F_{θ} (x_{t}), y_{t})],

(1)

where

L

denotes the object detection loss function. Since the target domain data

y_{t}

is unlabeled, this paper approximates this objective by generating intermediate domain data and employing pseudo-labeling techniques.

2.1. Diffusion-Based Physics-Aware Domain Alignment: DBDA

The DBDA module aims to generate an intermediate domain

D_{m i d} = {(x_{m i d}^{i}, y_{s}^{i})}_{i = 1}^{N_{s}}

, where the labels

y_{s}^{i}

originate from the source domain, and the intermediate domain images

x_{m i d}^{i}

retain the geometric structure of the source domain while incorporating the physical texture characteristics of the target domain. The visual transition from the optical source to the SAR-like intermediate domain is illustrated in Figure 2. As shown, the DBDA module effectively bridges the modality gap: the generated images in the second column exhibit the characteristic bright-spot scattering and speckle-like textures of SAR imagery, while maintaining the precise bounding box alignment of the original optical ships. Through this module, the generated intermediate domain not only preserves the semantic information of the source domain but also integrates the physical features of the target domain. This approach effectively reduces the visual discrepancy between the source and target domains, enabling more accurate cross-domain object detection.

(1) One-Step Diffusion Backbone To enhance generation speed and reduce computational burden, this paper adopts SD-Turbo as the generation backbone G. Trained via Adversarial Diffusion Distillation, this model maps random noise to images in a single forward pass. Specifically, the generator G receives an optical image

x_{s}

as structural conditioning, along with random noise

ϵ

and a text prompt c (e.g., “top-down view of SAR ship with speckle noise”) as stylistic conditioning, to generate an image

x_{gen}

in the target domain style:

x_{gen} = G (x_{s}, ϵ, c; ϕ),

(2)

where

ϕ

represents the generator parameters fine-tuned via Low-Rank Adaptation (LoRA)

(2) Zero-Conv Skip Connections To ensure the generated images maintain geometric consistency, this paper introduces zero-convolution residual connections within the generative network. This architecture enables residual connections between the encoder E and decoder D, effectively preserving the source image’s geometric information:

z = E (x_{s}), Δ z = Z (z), x_{gen} = D (z + Δ z, c),

(3)

Here,

Z (\cdot)

denotes zero-initialized convolution, stabilizing the generation process (particularly during the early training phase) by enforcing the preservation of source image contours and positions.

(3) Physics-Aware Refinement

Directly generated images often contain noise that does not align with the physical characteristics of SAR imaging. To better align generated images with SAR imaging principles, this paper employs a three-step refinement process:

Step I: Spectral Consistency

To align the image with SAR’s single-band characteristics while preserving the generative priors of the pre-trained SD-Turbo model which is optimized for 3-channel RGB synthesis, this paper generates RGB images first rather than modifying the network’s output dimensions. The RGB images are then converted to grayscale using standard ITU-R BT.601 coefficients:

x_{gray} = 0.299 \cdot R + 0.587 \cdot G + 0.114 \cdot B,

(4)

These coefficients are derived from human visual perception standards, providing a stable, physically consistent mapping to a single-channel representation without introducing arbitrary empirical hyperparameters. This approach ensures compatibility with both the pretrained network and the subsequent physics-aware SAR refinement steps.

Step II: Non-Local Means Denoising

To remove uniform noise introduced during processing while preserving the strong reflective edges of ships, Non-Local Means (NLM) denoising is applied:

x_{den} (p) = \frac{1}{Z (p)} \sum_{q \in N (p)} w (p, q) \cdot x_{gray} (q),

(5)

where

w (p, q)

is a similarity-based weighting function and

Z (p)

is a normalization factor.

Step III: Contrast-Limited Adaptive Histogram Equalization

To simulate the strong scattering characteristics of metallic vessels in SAR images, this paper enhances the image using Contrast-Limited Adaptive Histogram Equalization (CLAHE). In SAR imaging mechanisms, metallic structures (e.g., ships) typically form dihedral or trihedral corner reflectors, yielding extremely intense local backscattering that appears as prominent specular highlights. Unlike global histogram equalization, which often washes out localized high-intensity points, CLAHE dynamically amplifies these discrete specular highlights while restricting noise amplification in uniform background regions (such as the sea surface). Therefore, CLAHE acts not merely as a visual enhancer, but as a physical simulator of SAR backscattering.

x_{sar} = CLAHE (x_{den}; β), β = 1.5,

(6)

(4) Channel Adaptation

Since mainstream detection networks typically utilize 3-channel inputs, the single channel of the SAR image is replicated into three channels to facilitate compatibility with existing pre-trained detectors:

x_{mid} = Rep 3 (x_{sar}) \in R^{H \times W \times 3},

(7)

2.2. Dual-Student Mutual Verification

We design a co-learning architecture comprising a Teacher network (

F_{T}

), a Main Student (

F_{S}

), and a Proxy Student (

F_{A}

). Notably, the Proxy Student shares the feature extraction backbone with the Main Student but utilizes an independent, decoupled detection head to generate auxiliary supervisory signals. Architecturally, the Teacher and the Main Student adopt the same Faster R-CNN detector with a ResNet-50 backbone, including a feature extractor, a region proposal network, and RoI classification/regression heads. The Teacher is initialized from the Main Student and updated only through the exponential moving average (EMA), without direct back-propagation. The Proxy Student shares the ResNet-50 feature extraction backbone with the Main Student to reduce computational overhead, but employs an independent Mining Agent Head for classification and bounding-box regression. Therefore, DSMV introduces prediction-level diversity while maintaining a shared feature representation. During inference, only the Main Student detector is retained.

(1) Mean Teacher Update During training, the teacher network’s parameters are updated via the exponential moving average (EMA) of the student networks. Specifically, the Main Student network

F_{S}

updates its parameters

θ_{S}

through backpropagation, while the teacher network

F_{T}

’s parameters are updated according to the following formula:

θ_{T}^{(t)} = α θ_{T}^{(t - 1)} + (1 - α) θ_{S}^{(t)},

(8)

where

α

is the smoothing coefficient, typically set to 0.999. Importantly, only the Main Student’s parameters (

θ_{S}

) are incorporated into the EMA update. The Proxy Student is deliberately designed to explore low-confidence proposals, and including its parameters in the EMA would inject excessive noise, destabilizing the Teacher network and reducing the reliability of high-confidence pseudo-labels. This design ensures that the Teacher provides stable supervisory signals while allowing the Proxy Student to mine hard examples for mutual verification.

(2) Cross-Agent Spatial Consensus (CASC)

While high-confidence predictions from the Teacher directly supervise the Main Student, the Proxy Student acts as a validator for low-confidence proposals. The CASC strategy mines target-domain SAR images (

x_{t}

) by establishing geometric consistency. Specifically, if the Teacher and Proxy Student independently predict the same semantic category at overlapping locations, the prediction is verified via spatial consensus rather than raw confidence scores. This is formulated as:

{\hat{P}}_{A} (x_{t}) = \{(b_{T}^{i}, c_{T}^{i}) ∣ \exists j, IoU (b_{T}^{i}, b_{A}^{j}) > τ_{l o c}, c_{T}^{i} = c_{A}^{j}\},

(9)

By enforcing this Intersection over Union (IoU) consensus, CASC effectively recovers latent true-positive samples that would otherwise be discarded due to low contrast.

(3) Adaptive Thresholding Correction

To adapt to the gradual improvement of the model’s capabilities during training, an adaptive thresholding mechanism (AIT) is introduced. Let the mean and standard deviation of the student network’s confidence scores in the kth iteration be

μ_{k}

and

σ_{k}

, respectively. The threshold for the next phase is updated as follows:

τ_{c l s}^{(k + 1)} = λ τ_{c l s}^{(k)} + (1 - λ) (μ_{k} - σ_{k}),

(10)

where

λ

is the smoothing coefficient, typically set to 0.05.

Synergy of AIT and CASC for Noise Filtering: It is important to note that the framework identifies and filters noisy pseudo-labels not through an isolated binary classifier, but via the joint dual-guard mechanism of AIT and CASC. AIT acts as the score-level defense, dynamically elevating or relaxing the confidence barrier to block obvious background clutter as the model evolves. As the model learns over time, AIT adjusts the confidence threshold based on the statistical distribution of predictions, ensuring that only those predictions that meet an evolving confidence level are passed forward.

Subsequently, CASC acts as the spatial-level defense. Because random SAR clutter lacks consistent structural geometry, it is highly improbable for both the Teacher and the Proxy Student to generate high-confidence false positives at the exact same location. Therefore, any proposal that passes the AIT confidence barrier but fails the CASC spatial agreement (IoU

\leq τ_{l o c}

) is inherently identified as confirmation-bias noise and systematically discarded. This synergistic operation ensures that CASC serves to verify the consistency of the pseudo-labels in spatial terms, while AIT ensures that the labels are filtered based on score-level confidence. Together, these mechanisms continuously purify the supervisory signals, improving the robustness and precision of the model by preventing confirmation bias from corrupting the learning process.

2.3. Overall Loss Function

The entire framework is optimized end-to-end via a weighted combination of source-domain supervised loss, target-domain unsupervised loss, and auxiliary proxy loss:

L_{t o t a l} = L_{s u p} + γ_{u} L_{u n s u p} + γ_{p} L_{A},

(11)

(1) Source-Domain Supervised Loss

The supervised loss is derived from both the original optical images (

x_{s}

) and the synthesized intermediate images (

x_{mid}

):

L_{s u p} = \sum_{i} L_{d e t} (F_{S} (x_{s}^{i}), y_{s}^{i}) + η \sum_{i} L_{d e t} (F_{S} (x_{m i d}^{i}), y_{s}^{i}),

(12)

where

L_{d e t}

denotes the standard Faster R-CNN detection objective, which is composed of a classification loss and a bounding-box regression loss:

L_{d e t} = L_{c l s} + L_{b o x},

where

L_{c l s}

is the cross-entropy loss for object category classification, and

L_{b o x}

is the Smooth-

L_{1}

loss for bounding-box regression computed on positive proposals. The same

L_{d e t}

formulation is used for both the source-domain supervised loss in Equation (12) and the target domain pseudo-label loss in Equation (13). The coefficient

η

controls the contribution of the intermediate domain.

(2) Unsupervised Target Loss

For the unlabeled target domain, the network learns from heavily augmented SAR images (

aug (x_{t})

) guided by the CASC-filtered pseudo-labels (

{\hat{P}}_{A}

):

L_{u n s u p} = \sum_{x_{t} \in D_{t}} I ({\hat{P}}_{A} (x_{t}) \neq \emptyset) L_{d e t} (F_{S} (aug (x_{t})), {\hat{P}}_{A} (x_{t})),

(13)

where

I (\cdot)

is an indicator function that ensures the pseudo-label is non-empty before computing the detection loss.

(3) Auxiliary Proxy Loss

The auxiliary proxy loss

L_{A}

is introduced to explicitly optimize the decoupled Mining Agent Head of the Proxy Student. Although the Proxy Student shares the feature-extraction backbone with the Main Student, its detection head is independent and is responsible for producing auxiliary predictions used in Cross-Agent Spatial Consensus. To avoid unreliable supervision from noisy low-confidence samples,

L_{A}

is computed using the high-confidence pseudo-labels generated by the Teacher after AIT filtering. Let

P_{T}^{h} (x_{t})

denote the high-confidence teacher pseudo-label set:

P_{T}^{h} (x_{t}) = {(b_{T}^{i}, c_{T}^{i}) ∣ s_{T}^{i} \geq τ_{c l s}},

(14)

where

b_{T}^{i}

,

c_{T}^{i}

, and

s_{T}^{i}

represent the bounding box, category label, and confidence score of the i-th teacher prediction, respectively. The auxiliary proxy loss is formulated as:

L_{A} = \sum_{x_{t} \in D_{t}} I (P_{T}^{h} (x_{t}) \neq ⌀) L_{d e t} (F_{A} (aug (x_{t})), P_{T}^{h} (x_{t})),

(15)

where

F_{A}

denotes the Proxy Student and

L_{d e t}

consists of the standard classification and bounding-box regression losses. This term does not introduce additional target-domain annotations. Instead, it provides stable supervision for the auxiliary mining head, enabling the Proxy Student to produce discriminative and geometrically reliable proposals for CASC. During training,

L_{A}

is weighted by

γ_{p}

, and the parameters of the Proxy Student head are excluded from the EMA update of the Teacher. This design prevents noisy auxiliary mining signals from being accumulated in the Teacher while maintaining sufficient diversity for mutual verification.

2.4. Optimization Schedule

To synthesize the aforementioned modules into a cohesive pipeline, the complete training procedure is formalized in Algorithm 1. This algorithm explicitly details the temporal scheduling of the framework, starting with the physics-aware domain alignment (Phase 1), proceeding to the supervised burn-in stage for foundational knowledge transfer (Phase 2), and culminating in the unsupervised burn-up stage (Phase 3), where the Dual-Student Mutual Verification and Adaptive Thresholding mechanisms are dynamically engaged to ensure robust domain adaptation.

Algorithm 1 Training Procedure of the Proposed DEMC Framework

Require: Source domain data

D_{s} = {(x_{s}^{i}, y_{s}^{i})}

;

Target domain data $D_{t} = {x_{t}^{j}}$ ;
Diffusion Generator G (SD-Turbo with LoRA);
Teacher $F_{T}$ , Main Student $F_{S}$ , Proxy Student $F_{A}$ ;
Hyperparameters: EMA factor $α$ , Thresholds $τ_{c l s}, τ_{l o c}$ , AIT smoothing coefficient $λ$ .

Ensure: Optimized Student Model

F_{S}

.

1:: Phase 1: Physics-Aware Domain Alignment (DBDA)
2:: for each $x_{s} \in D_{s}$ do
3:: Generate intermediate sample $x_{m i d} \leftarrow G (x_{s}, ϵ, c)$ via Equations (2) and (3);
4:: Apply refinement: $x_{m i d} \leftarrow Refine (x_{m i d})$ via Equations (4) and (7);
5:: Add to intermediate set: $D_{m i d} \leftarrow D_{m i d} \cup {(x_{m i d}, y_{s})}$ ;
6:: end for
7:: Phase 2: Burn-in Stage (Supervised Warm-up)
8:: for iteration $k = 1$ to $K_{b u r n_i n}$ do
9:: Sample batch from $D_{s} \cup D_{m i d}$ ;
10:: Update $F_{S}$ by minimizing $L_{s u p}$ via Equation (12);
11:: end for
12:: Initialize Teacher and Proxy: $θ_{T} \leftarrow θ_{S}, θ_{A} \leftarrow θ_{S}$ ;
13:: Phase 3: Burn-up Stage (UDA with Mutual Verification)
14:: for iteration $k = 1$ to $K_{b u r n_u p}$ do
15:: Sample target batch $x_{t} \in D_{t}$ ;
16:: // Teacher Prediction
17:: Generate proposals $P_{T}$ from $F_{T} (x_{t})$ ;
18:: if $k > 0.4 \times K_{b u r n_u p}$ then
19:: // Dual-Student Mutual Verification (DSMV)
20:: Generate proxy proposals $P_{A}$ from $F_{A} (x_{t})$ ;
21:: Filter pseudo-labels via CASC:
22:: $\hat{P} \leftarrow {p \in P_{T} ∣ \exists p^{'} \in P_{A}, IoU (p, p^{'}) > τ_{l o c}, score > τ_{c l s}}$ ;
23:: else
24:: Filter pseudo-labels using only static confidence $τ_{c l s}$ ;
25:: end if
26:: Update $F_{S}$ and the proxy head using $L_{t o t a l} = L_{s u p} + γ_{u} L_{u n s u p} + γ_{p} L_{A}$ via Equation (11);
27:: if $k > 0.6 \times K_{b u r n_u p}$ then
28:: // Adaptive Thresholding (AIT)
29:: Calculate statistics $μ_{k}, σ_{k}$ of student predictions;
30:: Update threshold: $τ_{c l s} \leftarrow λ τ_{c l s} + (1 - λ) (μ_{k} - σ_{k})$ via Equation (10);
31:: end if
32:: Update Teacher via EMA: $θ_{T} \leftarrow α θ_{T} + (1 - α) θ_{S}$ ;
33:: end for
34:: return Final Student Model $F_{S}$

2.5. Dataset Description and Implementation Details

To evaluate the cross-domain transferability of the proposed DEMC, this paper conducts comprehensive experiments on four benchmark datasets for ship detection. These include two optical archives (HRSC2016, ShipRSImageNet) and two SAR repositories (SSDD, HRSID).

HRSC2016 [40]: This benchmark is specifically curated for object localization. It comprises 1061 optical images acquired from Google Earth, featuring extreme variations in image dimensions (ranging from

300 \times 300

to

1500 \times 900

pixels). With nearly 3000 ship instances distributed across complex offshore and inshore environments, this entire dataset is utilized as a foundational source domain.

ShipRSImageNet [41]: A large-scale, multi-source library incorporating 3435 samples captured under diverse sensor types and seasonal weather conditions. Following standard preprocessing, all patches are resized to

930 \times 930

pixels. This paper aggregates the training and validation subsets to establish a robust and diverse optical source distribution.

SSDD [9]: Representing a classic SAR benchmark, SSDD contains 1160 images derived from RadarSat-2, TerraSAR-X, and Sentinel-1. It encompasses multi-scale resolutions and diverse coastal topographies (e.g., ports, islands, and open sea). The dataset is officially partitioned into 928 training and 232 testing units.

HRSID [42]: To validate the model’s performance in high-density and high-resolution scenarios, this paper utilizes HRSID. Constructed from Sentinel-1B, TerraSAR-X, and TanDEM-X satellites, this dataset includes 5604 images at a uniform resolution of

800 \times 800

pixels. It provides 16,951 precisely annotated ship instances under varying polarization modes.

In our experimental configuration, the primary object detector is constructed using the Faster R-CNN framework with a ResNet-50 backbone, developed end-to-end within the PyTorch environment. The DBDA module undergoes 5000 epochs of rigorous pre-training to ensure that the synthesized intermediate samples achieve deep distribution alignment with authentic SAR imagery in terms of physical textures. To maintain consistency in feature scales across diverse datasets, all input imagery is uniformly resampled to

640 \times 640

pixels. The hardware infrastructure comprises a single NVIDIA GeForce RTX 3090 GPU, with the batch size configured at 4. The training sequence is meticulously partitioned into two functionally complementary phases, each encompassing 50,000 iterations. During the initial Burn-in Stage, the model prioritizes knowledge transfer from optical representations to the target domain, with the student network’s learning rate initialized at

4 \times 10^{- 2}

. The subsequent Burn-up Stage focuses on deepening the Unsupervised Domain Adaptation task, where the teacher network undergoes robust evolution via an Exponential Moving Average (EMA) mechanism with a smoothing coefficient of

α = 0.9996

. Regarding the hyperparameter configuration of the DEMC framework, the baseline secure thresholds for classification (

τ_{c l s}

) and localization (

τ_{l o c}

) are both established at 0.8. To enhance the model’s adaptability to evolving pseudo-labels in the latter training phases, an AIT mechanism is introduced, applying the AIT smoothing coefficient of

λ = 5 \times 10^{- 2}

to update

τ_{c l s}

after 60% of the burn-up stage has elapsed. Furthermore, auxiliary verification components designed to mine latent true positives and reinforce pseudo-label robustness are formally integrated at the 40% mark of the burn-up stage, thereby maintaining a dynamic equilibrium between detection precision and recall throughout the complex cross-modal learning process.

In terms of computational complexity, it is important to distinguish between the offline adaptation phase and the online inference phase. During offline domain alignment, the DBDA module introduces the major additional cost because the SD-Turbo-based generator contains 1.528 B parameters and requires 19.15 TFLOPs. However, this cost is incurred only once before detector adaptation: each optical source image is translated into the intermediate domain and then cached for subsequent detector training. Therefore, DBDA is not executed during online inference.

During the adaptation stage, the DSMV architecture contains the Teacher, Main Student, and Proxy Student branches, increasing the training-time parameter scale to 434.17 M and requiring additional forward passes for pseudo-label verification. This additional cost is used only to improve pseudo-label reliability during training. During deployment, both the Teacher and Proxy Student are discarded, and only the Main Student detector is retained. The final inference model contains 165.18M parameters and requires 427.73 GFLOPs per inference pass, which is comparable to the standard Faster R-CNN detector with a ResNet backbone.

This design reflects an efficiency-performance trade-off: DEMC introduces additional offline training and generation costs to obtain more reliable SAR pseudo-labels and stronger cross-domain generalization, but it does not increase the deployed inference pipeline. Therefore, the framework is more suitable for offline model adaptation followed by operational deployment, rather than real-time on-board domain adaptation.

2.6. Evaluation Metrics

To rigorously assess the cross-domain efficacy of the DEMC framework, we employ standard object detection protocols. Given the inherent challenges of SAR imagery—specifically, intense background clutter and faint target scattering—we utilize Precision (P), Recall (R), and Average Precision (AP) to quantify the model’s resilience to modality shifts. Precision and Recall. These twin metrics evaluate the detector’s operational limits in high-clutter environments. Precision quantifies the model’s false-alarm rejection capability, measuring the proportion of genuine targets among all positive predictions. Recall, conversely, assesses the sensitivity to weak scatterers, representing the proportion of ground-truth targets successfully recovered by the model. They are formulated as:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N},

(16)

where

T P

(True Positives),

F P

(False Positives), and

F N

(False Negatives) denote correctly localized ships, misclassified background clutter, and undetected targets, respectively. Adhering to the Horizontal Bounding Box protocol, a prediction is validated as a True Positive if its Intersection-over-Union (IoU) with the ground truth exceeds the

0.5

threshold. Average Precision (AP50). As Precision and Recall inherently exhibit an inverse relationship, pointwise estimates provide an incomplete performance picture. To capture the global performance dynamic across varying confidence thresholds, we compute the Average Precision (

A P

), which corresponds to the area under the Precision-Recall (P-R) curve:

A P = \int_{0}^{1} P (R) d R,

(17)

Specifically, we report

A P_{50}

(calculated at an IoU threshold of 0.5) following the PASCAL VOC benchmark. This serves as a robust holistic indicator of both classification accuracy and localization precision during the optical-to-SAR adaptation process.

3. Results

3.1. Comparison Experiments

To validate the effectiveness and superiority of the DEMC framework for cross-domain object detection from optical to Synthetic Aperture Radar (SAR), we conduct a comprehensive comparison of our proposed method with several state-of-the-art domain-adaptive object detection approaches. These comparisons are performed on the HRSC2016, ShipRSImageNet, SSDD, and HRSID datasets.

Source-Only Baseline (Lower Bound): To quantify the magnitude of the domain shift, we establish a theoretical lower bound using a standard Faster R-CNN detector. This model is trained exclusively on the optical source domain and directly evaluated on the SAR target domain without any adaptation, explicitly demonstrating the severe performance degradation caused by cross-modal discrepancies.

DA-Faster [43]: This pioneering work in domain-adaptive object detection addresses feature distribution alignment by introducing adversarial discriminators at both the image and instance levels. It represents a classic approach to feature-level alignment.

SWDA [44]: This method proposes a strong-weak distribution alignment strategy that enhances robustness in complex backgrounds by strongly aligning local features while weakly aligning global features. This approach is particularly relevant for optical-to-SAR transfer tasks with significantly different textures, and it represents the performance upper bound for traditional adversarial transfer learning.

HTCN [45]: This method balances transferability and discriminability by introducing interpolated images as intermediate domains to assist in feature alignment. It represents a more refined strategy for feature decoupling and alignment.

SSDA-YOLO [46]: Although the original paper targets semi-supervised scenarios, its core approach—self-training using pseudo-labels has strong relevance for unsupervised settings. We adapted this method to an unsupervised self-training mode for comparison, representing a pseudo-label-based self-training strategy.

Table 1 comprehensively reports the quantitative adaptation performance of the proposed DEMC framework across four benchmark cross-domain tasks. The Source-only baseline, trained exclusively on optical data (HRSC2016 or ShipRSImageNet) and tested directly on SAR imagery without adaptation, suffers significant performance degradation, highlighting the substantial modality gap between optical and SAR domains. Traditional feature-alignment models, such as DA-Faster and SWDA, show only modest improvements over this baseline. For instance, in the HRSC2016→SSDD task, SWDA increases the AP50 score from 29.0% to only 34.5%. This limited gain illustrates the challenges posed by fundamental differences in imaging mechanisms, as traditional methods struggle to capture the weak backscattering signatures inherent in SAR data.

In contrast, the DEMC framework achieves substantial improvements across all datasets, with AP50 scores of 48.5%, 50.6%, 51.2%, and 53.4% across the four respective tasks. Notably, compared to the state-of-the-art SSDA-YOLO, DEMC delivers an average F1-score improvement of 3.2% and an average AP50 gain of 3.7%. These results demonstrate that integrating diffusion-based generative alignment with mutual learning effectively mitigates the domain shift that traditional methods fail to overcome.

Importantly, DEMC exhibits a significant improvement in Recall across all transfer tasks, achieving a maximum Recall of 54.5%. This indicates that the proposed method significantly reduces false negatives (omission errors), particularly in SAR images with high background clutter. While Precision experiences a slight decrease in some scenarios, this reflects the classic Precision-Recall trade-off in object detection. From a practical standpoint, particularly in maritime surveillance and military reconnaissance, exhaustive target retrieval is more critical than achieving absolute precision. By reducing missed detections and mitigating background interference, DEMC offers considerable value for zero-tolerance reconnaissance operations.

Figure 3 visualizes the qualitative detection results for the HRSC2016→SSDD and HRSC2016→HRSID tasks. We selected scenes characterized by complex backgrounds and varying target sizes to test model robustness. Visual inspection reveals that while the baseline models successfully detect large targets in isolated areas, they consistently fail in more complex coastal environments, resulting in false negatives and misclassifications. Conversely, DEMC successfully localizes small targets, even under high-noise conditions, and maintains high precision in challenging coastal environments. Furthermore, DEMC maintains its performance when transitioning from the SSDD to the more challenging HRSID domain, demonstrating that the DBDA module effectively bridges the semantic gap via physics-aware synthesis, while the DSMV mechanism reduces background clutter and improves target detection under severe domain shifts. The qualitative observations in Figure 3 are consistent with the quantitative results in Table 1. For the HRSC2016→SSDD task, DEMC improves AP50 from 43.2% to 48.5% and Recall from 41.5% to 51.2% compared with SSDA-YOLO. For the HRSC2016→HRSID task, DEMC improves AP50 from 48.9% to 51.2% and Recall from 46.2% to 53.6%. These gains explain the reduced number of red boxes in Figure 3, showing that DEMC is particularly effective in suppressing false negatives and recovering faint SAR targets.

3.2. Ablation Experiment

3.2.1. Component Analysis

To quantify the independent contributions of the proposed DEMC modules, a comprehensive ablation study was conducted on the HRSC2016→SSDD benchmark. As shown in Table 2, comparing different component combinations reveals their synergistic impact on cross-domain performance. The first row evaluates the standalone impact of DBDA. With only DBDA, the model achieves an AP50 of 38.6%, which is 9.6 percentage points higher than the Source-only baseline of 29.0%. The Diffusion-Based Domain Alignment module serves as the cornerstone of the framework; omitting DBDA leads to a significant

A P_{50}

degradation from

48.5 %

to

40.5 %

. This substantial drop highlights the importance of diffusion-based generative techniques in bridging the modality gap between optical and radar sensors.

Moreover, introducing the Dual-Student Mutual Verification mechanism on the DBDA-only setting greatly enhances recall, particularly for targets submerged in complex background clutter. This elevates the recall metric from

38.2 %

to

45.4 %

. When all components are integrated, the Cross-Agent Spatial Consensus and Adaptive Thresholding modules reduce pseudo-label noise by enforcing spatial consistency, leading to a Precision of 60.5% and a Recall of 51.2%. Ultimately, the complete DEMC framework achieves optimal robustness by coupling pixel-level physical alignment with instance-level mutual verification.

3.2.2. Efficacy of DBDA in Mitigating Instance-Level Modality Gap

To provide a tangible assessment of the DBDA module’s capability in harmonizing heterogeneous modalities, we scrutinized the instance-level feature manifolds utilizing t-SNE dimensionality reduction. The visualization in Figure 4 offers a compelling contrast between the raw domain shift and the aligned feature space.

The left panel of Figure 4 exposes the fundamental challenge inherent in direct transfer. We observe a pronounced segregation where the optical source instances (gray) and the target SAR instances (red) occupy mutually exclusive regions. This disjoint distribution underscores the severe “feature alienation” caused by the fundamental disparity between optical reflectance and SAR coherent scattering mechanisms. Without intervention, the detector trained on optical data fails to generalize to the SAR domain due to this vast divergence.

Conversely, the right panel illustrates the transformative impact of the DBDA module. A distinct phenomenon emerges: the generated intermediate features (green) do not merely approximate but actively interlace with the real SAR distribution (red), exhibiting a high degree of spatial overlap. This assimilation corroborates that DBDA successfully injects SAR-specific physical textures—such as speckle patterns and strong scattering points—while rigorously retaining the structural integrity of the source objects. By effectively compressing the manifold distance between the source and target domains, the proposed module empowers the detector to cultivate robust, domain-invariant representations that bridge the modality gap.

From a distributional perspective, the baseline t-SNE map shows low cross-domain density overlap: optical instances form compact gray clusters that are spatially separated from the red SAR clusters. After DBDA, the intermediate-domain features are no longer isolated from the SAR manifold; instead, they are dispersed around and interlaced with SAR samples. This qualitative density change indicates that DBDA not only shifts the global source-domain centroid toward the target domain but also improves local neighborhood consistency at the instance level.

3.2.3. Joint Sensitivity and Pseudo-Label Quality Analysis

Beyond independent component analysis, we scrutinized the interplay between the classification safety margin (

τ_{c l s}

) and the localization tolerance (

τ_{l o c}

), as visualized in Figure 5. This analysis exposes a critical dependency between the two parameters for ensuring pseudo-label integrity.

Figure 5a maps the

A P_{50}

sensitivity, revealing that performance gains do not simply follow threshold increments. Instead, the results favor a synchronized configuration where

τ_{c l s}

and

τ_{l o c}

converge at 0.8. Specifically, fixing

τ_{l o c}

at 0.8 serves as a spatial filter, effectively stripping away background clutter and SAR speckle noise that accidentally overlap with weak predictions. The classification threshold, however, requires a delicate balance: values below 0.7 introduce confirmation error, whereas values exceeding 0.9 inadvertently discard valid but difficult samples. Consequently, the maximal

A P_{50}

of

48.5 %

is realized only when classification confidence is rigorously cross-verified by spatial consensus at the 0.8 benchmark.

The temporal efficacy of the proposed method is captured in Figure 5b. While the Burn-in Stage yields uniform gains across all variants, the Burn-up Stage highlights the limitation of static thresholding, where the baseline (gray line) suffers from performance saturation. Conversely, our framework (red line) avoids this stagnation, attributing its sustained growth to the AIT mechanism. By dynamically steering

τ_{c l s}

based on the student network’s learning statistics, AIT effectively balances the precision-recall trade-off. This adaptive regulation is crucial for recovering targets in noisy SAR environments, preventing the model from collapsing into a local optimum driven by easy samples. The individual roles of CASC and AIT can be further interpreted from their different positions in the pseudo-label refinement process. CASC provides spatial-level verification: a low-confidence Teacher proposal is retained only when the Proxy Student predicts the same class at an overlapping location, i.e., when the IoU exceeds

τ_{l o c}

. Therefore, CASC mainly suppresses geometrically inconsistent background clutter and reduces confirmation-bias noise. In contrast, AIT provides score-level curriculum control by dynamically updating

τ_{c l s}

according to the evolving prediction statistics. As shown in Figure 5b, the variant without AIT still uses the spatial verification mechanism but relies on a fixed confidence threshold, leading to earlier performance saturation. The full DEMC framework achieves more stable AP50 growth, indicating that AIT helps recover hard positive SAR targets after the detector becomes more reliable. Thus, CASC and AIT are complementary: CASC controls geometric reliability, whereas AIT controls the adaptive admission of pseudo-labels during training.

4. Discussion

The experimental results presented in Section 3 validate the efficacy of the proposed DEMC framework in bridging the substantial modality gap between optical and SAR domains. To provide a comprehensive understanding of the framework’s performance, we delve into the underlying mechanisms of physics-aware alignment, the robustness of mutual verification, and the inherent limitations of the current approach.

4.1. Mechanism of Physics-Aware Domain Alignment

One of the key findings from the ablation study (Table 2) is that the DBDA module plays a crucial role in the framework’s success. Removing DBDA resulted in a significant performance drop, with

A P_{50}

decreasing from 48.5% to 40.5%. Unlike traditional GAN-based translation methods, such as CycleGAN, which often suffer from geometric distortion or “hallucinations” when generating small targets, our Diffusion-based approach leverages Zero-Conv skip connections. This architectural choice is pivotal: it ensures that the generated intermediate images (

x_{m i d}

) maintain pixel-level semantic consistency with the source optical images (

x_{s}

), effectively decoupling “content” from “style”.

Furthermore, the t-SNE visualization (Figure 4) empirically confirms that the physics-aware refinement (Spectral Consistency, NLM, CLAHE) successfully shifts the feature distribution of optical data towards the SAR manifold. By simulating coherent speckle noise and metallic scattering characteristics, DBDA enables the detector to learn domain-invariant features before seeing any real SAR data, significantly lowering the difficulty of the subsequent adaptation task.

Robustness Against Distorting Artifacts: Importantly, the architectural design of the DBDA module inherently equips the framework with strong robustness against common distorting artifacts encountered in harsh remote sensing environments. First, against affine transformations and geometric distortions, the Zero-Conv skip connections strictly anchor the spatial features, ensuring that the structural integrity and bounding box coordinates of the source domain are preserved without warping during diffusion. Second, against low contrast, the localized enhancement mechanism of CLAHE dynamically rescues faint, low-visibility targets by amplifying their discrete specular highlights against dark backgrounds. Finally, against blur and uniform noise, the NLM denoising step suppresses non-structural interference while preserving the sharp reflective boundaries of metallic vessels. Together, these physics-aware refinement steps ensure that the intermediate domain is not only visually realistic but also structurally resilient.

4.2. Mitigating Confirmation Bias via Spatial Consensus

A persistent challenge in Unsupervised Domain Adaptation (UDA) is confirmation bias, where the model amplifies its errors due to noisy pseudo-labels. This issue is particularly pronounced in Synthetic Aperture Radar (SAR) ship detection, where small, weak targets are often misclassified as background clutter due to low confidence scores. Existing dual-teacher or co-training UDA methods in remote sensing and computer vision often utilize auxiliary branches for feature mining but predominantly rely on confidence-based validation such as averaging scores or applying static high-confidence thresholds. However, this standard approach is ineffective in SAR imagery, where heavy speckle noise consistently suppresses the raw confidence scores of true targets, leading to the loss of critical supervisory signals.

Our DSMV framework, particularly the Cross-Agent Spatial Consensus (CASC) strategy, fundamentally transforms the pseudo-labeling paradigm. In contrast to traditional co-training frameworks, where an auxiliary branch merely serves as a secondary confidence scorer, our Proxy Student functions as an independent geometric validator. The key innovation of CASC is its shift from using raw confidence to IoU-based spatial consensus as the verification criterion. Rather than relying solely on confidence thresholds, CASC validates predictions based on the geometric overlap between the Teacher network and the Proxy Student. As illustrated in the qualitative results (Figure 3), this mechanism enables DEMC to recover “hard positives,” which are targets that are visually distinct but have low confidence. Ultimately, this spatial consensus allows DSMV to achieve a capability that standard Mean Teacher methods cannot: the successful recovery of faint, structurally intact targets that are typically discarded by confidence-based filters, significantly enhancing recall in complex coastal backgrounds.

4.3. Dynamics of Adaptive Thresholding

The convergence curves (Figure 5) illustrate that the Adaptive Thresholding (AIT) mechanism is crucial during the “Burn-up” stage. Static thresholds often lead to performance saturation: thresholds that are too high filter out valid targets (low recall), while those that are too low introduce noise (low precision). By dynamically decaying the classification threshold

τ_{c l s}

based on the statistical distribution of the student network’s predictions, AIT allows the model to prioritize precision early in training and gradually shift towards higher recall in later stages. This dynamic curriculum learning strategy ensures that the pseudo-labels evolve stably, preventing the model from collapsing due to noise accumulation in the early training stages.

While DEMC significantly enhances recall, which is crucial for zero-tolerance reconnaissance, this inherently introduces a trade-off with precision. In certain operational scenarios, such as automated port traffic management or routine resource monitoring, prioritizing higher recall might introduce operational risks by overwhelming human operators or downstream tracking systems with false alarms. Fortunately, the DEMC framework is highly adaptable to varied mission requirements. For applications demanding extreme precision, operators can readily fine-tune the system by elevating the foundational classification threshold (

τ_{c l s}

) or the spatial consensus threshold(

τ_{l o c}

) within the DSMV module. Furthermore, restricting the AIT smoothing coefficient

λ

maintains a stricter confidence barrier throughout the training process, thereby suppressing false positives and shifting the model’s operational focus toward high precision.

4.4. Limitations and Future Directions

Despite the state-of-the-art performance, the proposed DEMC framework has limitations that warrant further investigation. First, while the framework significantly improves recall, the absolute F1-score remains constrained (currently peaking at 58.4%). This performance ceiling is inherently limited by the extreme background clutter and low Signal-to-Noise Ratio (SNR) characteristic of SAR imagery, coupled with the absence of target-domain annotations in the UDA setting. Regarding the operational limits in these noisy environments, the DEMC model operates effectively as long as the targets present marginally distinguishable structural contours or corner reflector scattering against the background. Under such low-SNR conditions, the CASC mechanism successfully recovers the targets. However, in ultra-low SNR conditions (e.g., severe sea state clutter completely masking the scattering signatures of small vessels), the spatial consensus process lacks sufficient visual cues, leading to performance degradation. Breaking this performance bottleneck may depend on increasing the volume and diversity of training data. Interestingly, our experimental results across the four benchmark tasks (Table 1) already empirically validate this trajectory: models trained with the larger ShipRSImageNet source domain (3435 images) consistently yielded higher F1-scores than those trained with the smaller HRSC2016 dataset (1061 images) across both SAR target domains. Therefore, a direct and imperative pathway for future work to elevate the F1-score to a more reliable performance level is to substantially scale up the volume, diversity, and modality of the foundation training datasets.

Second, the computational overhead of the DBDA module is non-negligible. Although SD-Turbo is used to accelerate generation, the diffusion process is still computationally more intensive than GANs or direct feature alignment, potentially limiting its application in real-time on-board processing scenarios. while the physics-aware refinement improves visual realism, it is currently a hand-crafted post-processing step. Future work could explore integrating these physical constraints (e.g., Rayleigh scattering models) directly into the loss function of the diffusion model for fully end-to-end training. Lastly, our experiments focused on the ship category in offshore and coastal scenes. Future work will investigate the extension of DEMC to other coastal object categories and broader SAR monitoring scenarios once reliable cross-modality benchmark datasets become available.The performance of DEMC in extremely dense inshore scenarios (e.g., crowded ports with significant side-lobe interference) remains a challenging frontier for future research.

5. Conclusions

Cross-domain object detection from optical to Synthetic Aperture Radar (SAR) imagery presents a significant challenge due to the inherent discrepancies in imaging mechanisms. To address this modality gap, we propose the Diffusion-enhanced Mutual Consistency (DEMC) framework for Unsupervised Domain Adaptation (UDA). Unlike conventional feature alignment methods, DEMC initiates adaptation at the pixel level by leveraging diffusion models to generate a physics-aware intermediate domain. This approach effectively neutralizes severe visual and statistical shifts between the source and target domains. However, residual domain discrepancies inevitably lead to pseudo-label noise. To mitigate this, we introduce the Dual-Student Mutual Verification (DSMV) mechanism, guided by Cross-Agent Spatial Consensus (CASC) and Adaptive Thresholding (AIT). DSMV dynamically filters confirmation bias and refines pseudo-labels through geometric constraints between cross-network predictions. Extensive benchmark experiments demonstrate that DEMC sets a new state-of-the-art standard in cross-domain object detection. Quantitatively, the framework achieves peak AP50 scores of 48.5%, 50.6%, 51.2%, and 53.4% across the HRSC2016→SSDD, ShipRSImageNet→SSDD, HRSC2016→HRSID, and ShipRSImageNet→HRSID tasks, respectively. Furthermore, DEMC significantly improves recall, with a maximum recall gain of over 15% compared to the source-only baseline. It also achieves an absolute F1-score of 58.4% under extreme unsupervised conditions. Ablation studies confirm that the synergistic coupling of physical alignment and mutual consistency learning yields unparalleled robustness against severe background clutter and faint target scattering. Ultimately, DEMC establishes a highly reliable paradigm for detecting weak and small targets, advancing the capabilities of all-weather maritime surveillance systems.

Author Contributions

Conceptualization, C.L. and Y.Z.; methodology, C.L., J.G. and Y.Z.; validation, G.Z., H.Y. and X.N.; writing—original draft preparation, C.L. and Y.Z.; supervision, Y.Z. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China 61991420; National Natural Science Foundation of China 61991421.

Data Availability Statement

The HRSID dataset is available at https://github.com/chaozhong2010/HRSID (accessed on 17 November 2025). The HRSC2016 dataset is available at https://link.zhihu.com/?target=https//sites.google.com/site/hrsc2016/ (accessed on 17 November 2025). The SSDD dataset is available at https://github.com/TianwenZhang0825/Official-SSDD (accessed on 17 November 2025). The ShipRSImagaeNet dataset is available at https://github.com/zzndream/ShipRSImageNet (accessed on 17 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIT	Adaptive Thresholding
AP50	Average Precision at IoU = 0.5
CASC	Cross-Agent Spatial Consensus
CLAHE	Contrast-Limited Adaptive Histogram Equalization
DBDA	Diffusion-Based Domain Alignment
DEMC	Diffusion-Enhanced Mutual Consistency
DSMV	Dual-Student Mutual Verification
EMA	Exponential Moving Average
IoU	Intersection over Union
LoRA	Low-Rank Adaptation
NLM	Non-Local Means
SAR	Synthetic Aperture Radar
SNR	Signal-to-Noise Ratio
UDA	Unsupervised Domain Adaptation
VAE	Variational Autoencoder

References

Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Xu, H.; Chen, W.; Sun, B.; Chen, Y.; Li, C. Oil tank detection in synthetic aperture radar images based on quasi-circular shadow and highlighting arcs. J. Appl. Remote Sens. 2014, 8, 083689. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Li, C.; Kuang, G. Pyramid attention dilated network for aircraft detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 662–666. [Google Scholar] [CrossRef]
Zhang, Y.; Hao, Y. A survey of SAR image target detection based on convolutional neural networks. Remote Sens. 2022, 14, 6240. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Zhang, H. Ship classification in high-resolution SAR images using deep learning of small datasets. Sensors 2018, 18, 2929. [Google Scholar] [CrossRef]
Fan, W.; Zhou, F.; Bai, X.; Tao, M.; Tian, T. Ship detection using deep convolutional neural networks for PolSAR images. Remote Sens. 2019, 11, 2862. [Google Scholar] [CrossRef]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship detection in large-scale SAR images via spatial shuffle-group enhance attention. IEEE Trans. Geosci. Remote Sens. 2020, 59, 379–391. [Google Scholar] [CrossRef]
Cao, M.; Xie, W.; Lei, J.; Zhang, J.; Li, D.; Li, Y. Multi-scale direction-aware SAR object detection network via global information fusion. arXiv 2023, arXiv:2312.16943. [Google Scholar]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased teacher for semi-supervised object detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [CrossRef]
Liu, C.; Zhang, W.; Lin, X.; Zhang, W.; Tan, X.; Han, J.; Li, X.; Ding, E.; Wang, J. Ambiguity-resistant semi-supervised learning for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15579–15588. [Google Scholar]
Rusyn, B.; Lutsyk, O.; Kosarevych, R.; Maksymyuk, T.; Gazda, J. Features extraction from multi-spectral remote sensing images based on multi-threshold binarization. Sci. Rep. 2023, 13, 19655. [Google Scholar] [PubMed]
Zhang, Y.; Liu, X.; Sun, W.; You, T.; Qi, X. Multi-Threshold Remote Sensing Image Segmentation Based on Improved Black-Winged Kite Algorithm. Biomimetics 2025, 10, 331. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar] [CrossRef]
Song, Y.; Ermon, S. Improved Techniques for Training Score-based Generative Models. arXiv 2020, arXiv:2006.09011. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, X.; Tang, Z.; Xu, F.; Datcu, M.; Han, J. Generative artificial intelligence meets synthetic aperture radar: A survey. IEEE Geosci. Remote Sens. Mag. 2024, 14, 6–48. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3060–3069. [Google Scholar]
Wang, X.; Yang, X.; Zhang, S.; Li, Y.; Feng, L.; Fang, S.; Lyu, C.; Chen, K.; Zhang, W. Consistent-teacher: Towards reducing inconsistent pseudo-targets in semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3240–3249. [Google Scholar]
Kang, M.; Ji, K.; Leng, X.; Lin, Z. Contextual region-based convolutional neural network with multilayer fusion for SAR ship detection. Remote Sens. 2017, 9, 860. [Google Scholar] [CrossRef]
Cao, R.; Sui, J. A Dynamic Multi-Scale Feature Fusion Network for Enhanced SAR Ship Detection. Sensors 2025, 25, 5194. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, C.; Zhang, M.; So Kweon, I. Text-to-image diffusion models in generative AI: A survey. arXiv 2023, arXiv:2303.07909. [Google Scholar]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.A.; Li, S.Z. A survey on generative diffusion models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Wang, Z.; Lu, C.; Wang, Y.; Bao, F.; Li, C.; Su, H.; Zhu, J. ProlificDreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 8406–8441. [Google Scholar]
Poole, B.; Jain, A.; Barron, J.T.; Mildenhall, B. DreamFusion: Text-to-3D using 2D diffusion. arXiv 2022, arXiv:2209.14988. [Google Scholar]
Yi, T.; Fang, J.; Wang, J.; Wu, G.; Xie, L.; Zhang, X.; Liu, W.; Tian, Q.; Wang, X. GaussianDreamer: Fast generation from text to 3D Gaussians by bridging 2D and 3D diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; Volume 35, pp. 6796–6807. [Google Scholar]
Liu, Y.; Yue, J.; Xia, S.; Ghamisi, P.; Xie, W.; Fang, L. Diffusion models meet remote sensing: Principles, methods, and perspectives. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4708322. [Google Scholar]
Wang, J.; Yang, H.; Liu, Z.; Chen, H. SSDDPM: A single SAR image generation method based on denoising diffusion probabilistic model. Sci. Rep. 2025, 15, 10867. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Pan, Z.; Lei, B. What, where, and how to transfer in SAR target recognition based on deep CNNs. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2324–2336. [Google Scholar]
Shi, Y.; Du, L.; Guo, Y.; Du, Y. Unsupervised domain adaptation based on progressive transfer for ship detection: From optical to SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5230317. [Google Scholar] [CrossRef]
Zhang, R.; Guo, H.; Xu, F.; Yang, W.; Yu, H.; Zhang, H.; Xia, G.S. Optical-Enhanced Oil Tank Detection in High-Resolution SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5237112. [Google Scholar] [CrossRef]
Yuan, Y.; Rao, Z.; Lin, C.; Huang, Y.; Ding, X. Adaptive ship detection from optical to SAR images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3508205. [Google Scholar]
Zheng, X.; Cui, H.; Xu, C.; Lu, X. Dual teacher: A semisupervised cotraining framework for cross-domain ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613312. [Google Scholar]
Yang, Y.; Chen, J.; Sun, L.; Zhou, Z.; Huang, Z.; Wu, B. Unsupervised domain-adaptive SAR ship detection based on cross-domain feature interaction and data contribution balance. Remote Sens. 2024, 16, 420. [Google Scholar] [CrossRef]
Han, J.; Yang, W.; Wang, Y.; Chen, L.; Luo, Z. Remote sensing teacher: Cross-domain detection transformer with learnable frequency-enhanced feature alignment in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5619814. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Yu, Y.; Jiang, Z. Sdat-former: Foggy scene semantic segmentation via a strong domain adaptation teacher. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2023; pp. 1760–1764. [Google Scholar]
Wang, Z.; Zhang, Y.; Zhang, Z.; Jiang, Z.; Yu, Y.; Li, L.; Zhang, L. Sdat-former++: A foggy scene semantic segmentation method with stronger domain adaption teacher for remote sensing images. Remote Sens. 2023, 15, 5704. [Google Scholar] [CrossRef]
Zhang, T.; Zhuang, Y.; Wang, G.; Chen, H.; Wang, H.; Li, L.; Li, J. Controllable generative knowledge driven few-shot object detection from optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5612319. [Google Scholar] [CrossRef]
Zhang, T.; Zhuang, Y.; Wang, G.; Chen, H.; Li, L.; Li, J. A unified remote sensing object detector based on fourier contour parametric learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5611225. [Google Scholar] [CrossRef]
Xie, J.; Wang, G.; Zhang, T.; Sun, Y.; Chen, H.; Zhuang, Y.; Li, J. Llama-unidetector: A llama-based universal framework for open-vocabulary object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4409318. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L.; Wang, Y.; Feng, P.; He, R. ShipRSImageNet: A large-scale fine-grained dataset for ship detection in high-resolution optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8458–8472. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3339–3348. [Google Scholar]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6956–6965. [Google Scholar]
Chen, C.; Zheng, Z.; Ding, X.; Huang, Y.; Dou, Q. Harmonizing Transferability and Discriminability for Adapting Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhou, H.; Jiang, F.; Lu, H. SSDA-YOLO: Semi-supervised domain adaptive YOLO for cross-domain object detection. Comput. Vis. Image Underst. 2023, 229, 103649. [Google Scholar] [CrossRef]

Figure 1. Architecture of the DEMC framework. Green, black, and blue arrows denote supervised data flow, unsupervised data flow, and negative feedback, respectively. (1) DBDA: Transforms optical images into a SAR-like intermediate domain (

x_{m i d}

) by integrating physical textures while preserving geometric structures. (2) DSMV: A collaborative Teacher-Student paradigm. To minimize computational overhead, the Proxy Student shares the backbone with the Main Student, functioning via an independent Mining Agent Head to validate low-confidence proposals. (3) AIT: Dynamically calibrates classification thresholds to generate high-quality pseudo-labels for the target SAR domain (

x_{t}

).

Figure 1. Architecture of the DEMC framework. Green, black, and blue arrows denote supervised data flow, unsupervised data flow, and negative feedback, respectively. (1) DBDA: Transforms optical images into a SAR-like intermediate domain (

x_{m i d}

) by integrating physical textures while preserving geometric structures. (2) DSMV: A collaborative Teacher-Student paradigm. To minimize computational overhead, the Proxy Student shares the backbone with the Main Student, functioning via an independent Mining Agent Head to validate low-confidence proposals. (3) AIT: Dynamically calibrates classification thresholds to generate high-quality pseudo-labels for the target SAR domain (

x_{t}

).

Figure 2. Visual examples of the DBDA-generated intermediate domain. The first column displays the original optical images (Source); the second column presents the generated intermediate domain images which bridge the modality gap; the third column shows the corresponding SAR images (Target) for reference.

Figure 3. Visual comparison of cross-domain ship detection. The first three rows show the HRSC2016→SSDD task, and the last three rows show the HRSC2016→HRSID task. (a) Ground Truth, (b) Source-only, (c) SWDA, (d) SSDA-YOLO, and (e) Proposed DEMC. Bounding box color codes: Green denotes true positives (correct detections), Red denotes false negatives (omission errors), and Blue denotes false positives (false alarms).

Figure 4. Red dots denote real SAR instances. In (a), gray dots denote original optical instances, which form several separated clusters away from the SAR distribution, indicating a clear modality gap. In (b), green dots denote DBDA-generated intermediate-domain instances, which show stronger overlap and local interlacing with SAR instances. The increased distributional overlap suggests that DBDA reduces the instance-level feature discrepancy between optical and SAR domains.

Figure 5. Joint sensitivity analysis and convergence performance. (a) The heat map illustrates the synergistic effect of classification threshold

τ_{c l s}

and localization threshold

τ_{l o c}

on detection performance. (b) The convergence curves demonstrate that the full DEMC framework significantly outperforms the baseline and the version without AIT, particularly showing stable AP50 gains during the Burn-up stage through dynamic pseudo-label refinement.

Figure 5. Joint sensitivity analysis and convergence performance. (a) The heat map illustrates the synergistic effect of classification threshold

τ_{c l s}

and localization threshold

τ_{l o c}

on detection performance. (b) The convergence curves demonstrate that the full DEMC framework significantly outperforms the baseline and the version without AIT, particularly showing stable AP50 gains during the Burn-up stage through dynamic pseudo-label refinement.

Table 1. Cross-Domain Object Detection Performance Comparison on Four Benchmark Tasks.

Task	Method	F1-Score	Recall	Precision	AP50
HRSC2016→SSDD	Source-only	34.9	25.3	56.2	29.0
	DA-Faster	42.6	34.5	55.8	36.6
	SWDA	40.3	32.8	52.1	34.5
	HTCN	44.5	36.2	57.8	40.2
	SSDA-YOLO	50.0	41.5	62.8	43.2
	DEMC [Ours]	55.5	51.2	60.5	48.5
ShipRSImageNet→SSDD	Source-only	37.2	27.5	57.4	30.2
	DA-Faster	42.5	35.2	53.6	36.8
	SWDA	45.1	37.9	55.8	39.5
	HTCN	48.9	41.6	59.2	42.4
	SSDA-YOLO	53.8	46.5	63.8	45.9
	DEMC [Ours]	56.6	52.4	61.5	50.6
HRSC2016→HRSID	Source-only	36.2	26.5	57.5	30.2
	DA-Faster	40.4	33.2	51.5	37.2
	SWDA	42.8	35.8	53.2	39.8
	HTCN	46.8	39.5	57.4	44.5
	SSDA-YOLO	53.8	46.2	64.5	48.9
	DEMC [Ours]	57.5	53.6	61.9	51.2
ShipRSImageNet→HRSID	Source-only	38.0	28.1	58.5	32.8
	DA-Faster	45.3	38.4	55.2	39.6
	SWDA	47.5	40.5	57.4	42.1
	HTCN	51.2	44.2	60.8	46.5
	SSDA-YOLO	56.3	49.5	65.2	50.8
	DEMC [Ours]	58.4	54.5	63.0	53.4

Table 2. Ablation study on the components of the DEMC framework.

	DBDA	DSMV	CASC+AIT	F1-Score	Recall	Precision	AP50
DEMC	✓			44.9	38.2	54.5	38.6
	✓	✓		48.4	45.4	51.8	41.2
		✓	✓	48.9	42.1	58.2	40.5
	✓	✓	✓	55.5	51.2	60.5	48.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, C.; Zhang, Y.; Guo, J.; Zhou, G.; You, H.; Li, P.; Ning, X. DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery. Remote Sens. 2026, 18, 1358. https://doi.org/10.3390/rs18091358

AMA Style

Luo C, Zhang Y, Guo J, Zhou G, You H, Li P, Ning X. DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery. Remote Sensing. 2026; 18(9):1358. https://doi.org/10.3390/rs18091358

Chicago/Turabian Style

Luo, Cheng, Yueting Zhang, Jiayi Guo, Guangyao Zhou, Hongjian You, Peifeng Li, and Xia Ning. 2026. "DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery" Remote Sensing 18, no. 9: 1358. https://doi.org/10.3390/rs18091358

APA Style

Luo, C., Zhang, Y., Guo, J., Zhou, G., You, H., Li, P., & Ning, X. (2026). DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery. Remote Sensing, 18(9), 1358. https://doi.org/10.3390/rs18091358

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DEMC: A Diffusion-Enhanced Mutual Consistency Framework for Cross-Domain Object Detection in Optical and SAR Imagery

Highlights

Abstract

1. Introduction

1.1. Object Detection in SAR Images

1.2. Image Translation: From GANs to Diffusion Models

1.3. Unsupervised Domain Adaptation for SAR Detection

2. Materials and Methods

2.1. Diffusion-Based Physics-Aware Domain Alignment: DBDA

2.2. Dual-Student Mutual Verification

2.3. Overall Loss Function

2.4. Optimization Schedule

2.5. Dataset Description and Implementation Details

2.6. Evaluation Metrics

3. Results

3.1. Comparison Experiments

3.2. Ablation Experiment

3.2.1. Component Analysis

3.2.2. Efficacy of DBDA in Mitigating Instance-Level Modality Gap

3.2.3. Joint Sensitivity and Pseudo-Label Quality Analysis

4. Discussion

4.1. Mechanism of Physics-Aware Domain Alignment

4.2. Mitigating Confirmation Bias via Spatial Consensus

4.3. Dynamics of Adaptive Thresholding

4.4. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI