1. Introduction
Synthetic Aperture Radar (SAR) has become an essential tool in modern ocean monitoring and military reconnaissance due to its all-weather, day-and-night imaging capabilities [
1,
2,
3]. While data-driven approaches, particularly Convolutional Neural Networks (CNNs), have significantly advanced SAR target detection, their effectiveness is heavily reliant on large-scale, high-quality annotated datasets [
4,
5,
6]. Unlike optical remote sensing imagery, which offers intuitive visual cues, SAR data is governed by complex coherent imaging mechanisms. The combination of non-intuitive scattering features and severe speckle noise makes precise annotation extremely challenging for non-specialists. This dependence on expert knowledge leads to prohibitive labor costs, leaving SAR datasets abundant in raw data but lacking sufficient annotations. As a result, Unsupervised Domain Adaptation (UDA), which transfers knowledge from a well-annotated optical domain (source) to an unlabeled SAR domain (target), has become a vital strategy to address the SAR data bottleneck and enable intelligent monitoring [
7,
8,
9].
Despite the successes of UDA in computer vision, the significant modality disparity between optical and SAR domains limits the effectiveness of existing methods. For example, conventional image translation techniques, such as Generative Adversarial Networks (GANs), struggle to bridge this gap. These models often fail to preserve geometric structures while simulating realistic coherent speckle noise, resulting in synthetic data that lacks both diversity and physical realism. Furthermore, traditional self-training methods for prediction alignment rely on static thresholding for pseudo-label filtering, overlooking the dynamic uncertainty across training phases [
10]. Given the faint target signatures and intense background clutter typical in SAR imagery, fixed thresholds may discard valuable hard positives with low confidence, while inaccurate high-confidence predictions can still be retained. This weakens the supervisory signals available for self-training and hinders optimization in complex SAR environments [
10,
11]. In the broader remote sensing community, techniques such as feature extraction from spectral images based on multi-threshold binarization have proven highly effective for improving classification accuracy and reducing sample dependency. Rusyn et al. demonstrated the effectiveness of multi-threshold binarization in extracting discriminative features with limited training samples [
12]. Zhang et al. optimized multi-threshold segmentation using an improved black-winged kite algorithm to enhance segmentation precision [
13]. Although these methods were developed for spectral feature extraction and segmentation rather than pseudo-label learning, they highlight the broader utility of adaptive threshold design in remote sensing.
To address these challenges, this paper presents a novel perspective aimed at improving pixel-level generation fidelity and pseudo-label robustness. At the pixel level, this paper moves beyond traditional GANs and global transformation strategies by introducing Diffusion models [
14,
15]. These models excel at capturing complex data distributions, allowing us to generate pseudo-SAR images that adhere to SAR’s physical scattering characteristics [
16]. This results in more precise domain alignment at the input level. In the instance mining stage, this paper proposes Mutual Consistency Learning, inspired by cross-view learning [
17], to mitigate pseudo-label noise. The core idea is that predictions from a single network in complex SAR backgrounds are often unreliable. However, if two independent branches can agree on the localization predictions for the same spatial region, it is highly likely that this region contains a true target. This spatial consensus-based approach provides a reliable foundation for detecting low Signal-to-Noise Ratio targets [
18].
This paper introduces the Diffusion-enhanced Mutual Consistency Learning (DEMC) framework. To explicitly distinguish our approach from general machine learning image enhancement models that might share similar acronyms, our DEMC is a distinctly tailored architecture that aims to achieve robust cross-domain transfer through a dual-stage collaboration: 1. To address the significant gap in texture distribution between optical and SAR images, this paper employs a one-step diffusion model to construct an intermediate domain. By introducing zero-convolution skip connections (Zero-Conv) along with spectral consistency, denoising, and contrast enhancement as physical post-processing steps, the generated images retain the semantic layout of the optical source domain while faithfully replicating the texture information of SAR targets. 2. To tackle the noise problem commonly present in pseudo-labels, a co-learning architecture is designed, comprising a teacher network, a main student, and a proxy student. By leveraging the proposed Cross-Agent Spatial Consensus (CASC) strategy, pseudo-labels are dynamically corrected through IoU consistency across multiple networks. This strategy effectively mitigates the “confirmation bias” of a single network in low-contrast SAR regions, recovering a significant number of potential true positive samples. Along with the Adaptive Threshold (AIT) mechanism, the model can autonomously optimize the training sample pool, ensuring continuous performance improvement.Although this study focuses on the ship class due to the availability of well-established optical-to-SAR maritime benchmarks, the proposed DEMC framework is not restricted to ships at the algorithmic level. In principle, it can be extended to other coastal objects, such as offshore platforms, oil tanks, port facilities, and coastal vehicles, provided that labeled optical source data and unlabeled SAR target images are available. However, its effectiveness depends on whether the target category presents distinguishable SAR scattering structures and sufficiently clear contours; extremely dense inshore scenes or ultra-low-SNR targets may reduce the reliability of spatial consensus. The primary contributions of this work are summarized as follows:
- (1)
Novel UDA Paradigm: We propose the DEMC framework, which integrates generative alignment with mutual verification learning. This innovative approach combines the generative capabilities of diffusion models with the error-correction ability of mutual learning, effectively addressing the challenges of transferring knowledge from the optical domain to the heterogeneous SAR domain.
- (2)
Physics-Aware Diffusion Synthesis: By combining one-step diffusion techniques with physics-based post-processing, we generate high-fidelity intermediate domain data with strong scattering features, significantly alleviating distribution shifts between modalities.
- (3)
Robust Pseudo-Label Mining: We introduce the CASC strategy, replacing the conventional confidence threshold with spatial consensus constraints, greatly enhancing the model’s recall ability for weak targets in complex backgrounds.
- (4)
Superior Experimental Performance: Experimental results on benchmark tasks such as HRSC2016, ShipRSImageNet, SSDD, and HRSID demonstrate that the proposed method achieves state-of-the-art performance in core metrics, including AP50 and F1-score.
1.1. Object Detection in SAR Images
Convolutional Neural Network (CNN)-based approaches have greatly advanced Remote Sensing target detection. Early studies mainly focused on adapting optical detection models, like Faster R-CNN, YOLO, and SSD, to the Synthetic Aperture Radar (SAR) domain. However, SAR imagery presents unique challenges such as speckle noise and target scale variations, necessitating additional methods. To address these, researchers have integrated attention mechanisms and multi-scale feature fusion techniques. Kang et al. [
19] improved small-scale target detection through multi-layer fusion and contextual data. Cao et al. [
8] developed SAR-Net, a multi-scale direction-aware attention network that improves detection across various object scales. Other research has explored dynamic multi-scale fusion with transformer-style modules to better address complex marine environments [
20].
1.2. Image Translation: From GANs to Diffusion Models
Pixel-level domain adaptation seeks to bridge the modality gap by converting source-domain images to match the target domain’s style. Generative Adversarial Networks (GANs), especially CycleGAN and its variants, have been widely used for this purpose due to their ability to generate realistic remote sensing samples. However, GANs often struggle with issues such as instability, hyperparameter sensitivity, and failure to preserve geometric structures during optical-to-SAR image translation, resulting in artifacts that degrade image quality. Recently, denoising diffusion probabilistic models (DDPMs) have emerged as a promising alternative, offering superior generative quality and diversity. These models have shown success in tasks like text-to-image and 3D generation [
21,
22,
23,
24,
25,
26]. Zhang et al. [
21] reviewed text-to-image diffusion models, while Cao et al. [
22] examined generative diffusion models. Wang et al. [
23] introduced ProlificDreamer for high-fidelity 3D generation, and Poole et al. [
24] demonstrated 3D generation with DreamFusion. Yi et al. [
25] proposed GaussianDreamer for rapid text-to-3D generation, highlighting the potential of diffusion models in complex generative tasks. For SAR image generation, the use of denoising diffusion probabilistic models has shown promising results. Specifically, ref. [
27] presents a method that generates high-fidelity SAR images from limited samples, bridging the modality gap between optical and SAR domains. Nevertheless, substantial research efforts are still needed to fully explore its application in SAR image synthesis.
1.3. Unsupervised Domain Adaptation for SAR Detection
To address the challenge of limited SAR annotations, unsupervised domain adaptation (UDA) has emerged as a crucial approach. Huang et al. [
28] introduced a transfer learning strategy that utilizes multi-source data to minimize discrepancies between optical and SAR domains, and between various SAR target types. Shi et al. [
29] further proposed a progressive transfer strategy for ship detection from optical to SAR images, which gradually reduces the cross-domain discrepancy during adaptation. Zhang et al. [
30] explored both global structural and local instance-level alignment to improve feature consistency between optical and SAR images. Yuan et al. [
31] introduced a CSD module that decomposes classifiers into domain-common and domain-specific components, enhancing SAR target recognition. Zheng et al. [
32] introduced a dual-teacher framework to separate cross-domain and semi-supervised tasks, reducing interference between optical and SAR supervision. Similarly, regarding the significant challenge of domain shift, Zhou et al. [
33] introduced a novel technique utilizing cross-domain feature interaction and data contribution balance, which aligns with our approach of leveraging mutual verification to overcome domain-specific challenges. Finally, Han et al. [
34] designed a frequency-enhanced feature alignment module to simplify the attention mechanism while capturing domain-specific information for improved SAR object detection. Moreover, phased self-training pipelines have recently demonstrated significant potential in other cross-domain remote sensing tasks, such as semantic segmentation [
35,
36]. Recent studies have further explored remote sensing object detection generalization from complementary perspectives. Zhang et al. [
37] used Stable Diffusion and CLIP for controllable generative few-shot detection, Zhang et al. [
38] introduced Fourier contour parametric learning to unify different geometric annotations, and Xie et al. [
39] incorporated LLaMA-based language priors for open-vocabulary remote sensing detection. These works improve generalization through generative augmentation, geometric representation, and semantic priors. In contrast, our DEMC focuses on annotation-free optical-to-SAR adaptation, where the physical modality gap and SAR pseudo-label noise remain the main bottlenecks. In this setting, standard Mean Teacher or co-training frameworks typically rely on rigid, high-confidence score filtering to generate pseudo-labels, which often leads to severe “confirmation bias” in noisy SAR backgrounds, where confident learning from false positives occurs while faint true targets are discarded. To address this issue, our DSMV framework fundamentally differs from traditional co-training methods by enforcing an IoU-based Cross-Agent Spatial Consensus. This unique mechanism allows DEMC to validate and recover geometrically consistent, low-confidence target proposals that standard confidence-based self-training methods cannot achieve.
2. Materials and Methods
The proposed DEMC framework is specifically designed to facilitate cross-domain object detection from optical to SAR imagery, targeting the critical requirements of all-weather maritime surveillance and persistent military reconnaissance. While optical satellite imagery (the source domain) benefits from ease of acquisition and lower annotation costs, its operational efficacy is severely compromised by atmospheric attenuation, such as cloud cover and fog, as well as diurnal limitations. Conversely, SAR imaging provides robust, all-weather, and day-and-night sensing capabilities. Despite these advantages, the physical complexity of SAR scattering renders the annotation process highly specialized, incurring prohibitive temporal and economic costs.
To tackle these challenges, this paper proposes an unsupervised domain adaptation framework that leverages existing optical image datasets. This framework employs adaptive transfer learning to facilitate the migration of object detection tasks from the source domain (optical images) to the target domain (SAR images). The core concept of this framework is the generation of intermediate domain images, which allows for the transfer of knowledge from the source domain to the target domain without the need for target domain annotations. Additionally, a pseudo-label mechanism is incorporated to further enhance detection accuracy. The overall structure of the proposed framework is illustrated in
Figure 1. For clarity, the overall DEMC training pipeline can be summarized in three sequential steps. First, DBDA translates each labeled optical source image into a SAR-like intermediate image while preserving its original bounding-box annotations, thereby providing image-level modality alignment. Second, the detector is warmed up using both the original optical images and the generated intermediate-domain images. Third, unlabeled SAR images are introduced for self-training, where the Teacher provides candidate pseudo-labels, the Proxy Student verifies low-confidence proposals through CASC, and AIT dynamically adjusts the confidence threshold according to the evolving prediction statistics. Thus, DBDA addresses the image-level modality gap, whereas DSMV, CASC, and AIT jointly address pseudo-label noise at the instance level.
Given a source domain dataset , where represents a source domain optical image and denotes the corresponding target location bounding box and class label, and an unlabeled target domain (SAR) dataset , where is a single-channel SAR image in the target domain, the objective is to address the distribution shift between the source and target domains (), caused by significant differences in imaging mechanisms and noise characteristics. Therefore, this paper seeks to design a model such that an object detector trained on the source domain can effectively perform inference on the target domain.
The objective is to minimize the generalization error in the target domain:
where
denotes the object detection loss function. Since the target domain data
is unlabeled, this paper approximates this objective by generating intermediate domain data and employing pseudo-labeling techniques.
2.1. Diffusion-Based Physics-Aware Domain Alignment: DBDA
The DBDA module aims to generate an intermediate domain
, where the labels
originate from the source domain, and the intermediate domain images
retain the geometric structure of the source domain while incorporating the physical texture characteristics of the target domain. The visual transition from the optical source to the SAR-like intermediate domain is illustrated in
Figure 2. As shown, the DBDA module effectively bridges the modality gap: the generated images in the second column exhibit the characteristic bright-spot scattering and speckle-like textures of SAR imagery, while maintaining the precise bounding box alignment of the original optical ships. Through this module, the generated intermediate domain not only preserves the semantic information of the source domain but also integrates the physical features of the target domain. This approach effectively reduces the visual discrepancy between the source and target domains, enabling more accurate cross-domain object detection.
(1) One-Step Diffusion Backbone To enhance generation speed and reduce computational burden, this paper adopts SD-Turbo as the generation backbone
G. Trained via Adversarial Diffusion Distillation, this model maps random noise to images in a single forward pass. Specifically, the generator
G receives an optical image
as structural conditioning, along with random noise
and a text prompt
c (e.g., “top-down view of SAR ship with speckle noise”) as stylistic conditioning, to generate an image
in the target domain style:
where
represents the generator parameters fine-tuned via Low-Rank Adaptation (LoRA)
(2) Zero-Conv Skip Connections To ensure the generated images maintain geometric consistency, this paper introduces zero-convolution residual connections within the generative network. This architecture enables residual connections between the encoder
E and decoder
D, effectively preserving the source image’s geometric information:
Here, denotes zero-initialized convolution, stabilizing the generation process (particularly during the early training phase) by enforcing the preservation of source image contours and positions.
(3) Physics-Aware Refinement
Directly generated images often contain noise that does not align with the physical characteristics of SAR imaging. To better align generated images with SAR imaging principles, this paper employs a three-step refinement process:
Step I: Spectral Consistency
To align the image with SAR’s single-band characteristics while preserving the generative priors of the pre-trained SD-Turbo model which is optimized for 3-channel RGB synthesis, this paper generates RGB images first rather than modifying the network’s output dimensions. The RGB images are then converted to grayscale using standard ITU-R BT.601 coefficients:
These coefficients are derived from human visual perception standards, providing a stable, physically consistent mapping to a single-channel representation without introducing arbitrary empirical hyperparameters. This approach ensures compatibility with both the pretrained network and the subsequent physics-aware SAR refinement steps.
Step II: Non-Local Means Denoising
To remove uniform noise introduced during processing while preserving the strong reflective edges of ships, Non-Local Means (NLM) denoising is applied:
where
is a similarity-based weighting function and
is a normalization factor.
Step III: Contrast-Limited Adaptive Histogram Equalization
To simulate the strong scattering characteristics of metallic vessels in SAR images, this paper enhances the image using Contrast-Limited Adaptive Histogram Equalization (CLAHE). In SAR imaging mechanisms, metallic structures (e.g., ships) typically form dihedral or trihedral corner reflectors, yielding extremely intense local backscattering that appears as prominent specular highlights. Unlike global histogram equalization, which often washes out localized high-intensity points, CLAHE dynamically amplifies these discrete specular highlights while restricting noise amplification in uniform background regions (such as the sea surface). Therefore, CLAHE acts not merely as a visual enhancer, but as a physical simulator of SAR backscattering.
(4) Channel Adaptation
Since mainstream detection networks typically utilize 3-channel inputs, the single channel of the SAR image is replicated into three channels to facilitate compatibility with existing pre-trained detectors:
2.2. Dual-Student Mutual Verification
We design a co-learning architecture comprising a Teacher network (), a Main Student (), and a Proxy Student (). Notably, the Proxy Student shares the feature extraction backbone with the Main Student but utilizes an independent, decoupled detection head to generate auxiliary supervisory signals. Architecturally, the Teacher and the Main Student adopt the same Faster R-CNN detector with a ResNet-50 backbone, including a feature extractor, a region proposal network, and RoI classification/regression heads. The Teacher is initialized from the Main Student and updated only through the exponential moving average (EMA), without direct back-propagation. The Proxy Student shares the ResNet-50 feature extraction backbone with the Main Student to reduce computational overhead, but employs an independent Mining Agent Head for classification and bounding-box regression. Therefore, DSMV introduces prediction-level diversity while maintaining a shared feature representation. During inference, only the Main Student detector is retained.
(1) Mean Teacher Update During training, the teacher network’s parameters are updated via the exponential moving average (EMA) of the student networks. Specifically, the Main Student network
updates its parameters
through backpropagation, while the teacher network
’s parameters are updated according to the following formula:
where
is the smoothing coefficient, typically set to 0.999. Importantly, only the Main Student’s parameters (
) are incorporated into the EMA update. The Proxy Student is deliberately designed to explore low-confidence proposals, and including its parameters in the EMA would inject excessive noise, destabilizing the Teacher network and reducing the reliability of high-confidence pseudo-labels. This design ensures that the Teacher provides stable supervisory signals while allowing the Proxy Student to mine hard examples for mutual verification.
(2) Cross-Agent Spatial Consensus (CASC)
While high-confidence predictions from the Teacher directly supervise the Main Student, the Proxy Student acts as a validator for low-confidence proposals. The CASC strategy mines target-domain SAR images (
) by establishing geometric consistency. Specifically, if the Teacher and Proxy Student independently predict the same semantic category at overlapping locations, the prediction is verified via spatial consensus rather than raw confidence scores. This is formulated as:
By enforcing this Intersection over Union (IoU) consensus, CASC effectively recovers latent true-positive samples that would otherwise be discarded due to low contrast.
(3) Adaptive Thresholding Correction
To adapt to the gradual improvement of the model’s capabilities during training, an adaptive thresholding mechanism (AIT) is introduced. Let the mean and standard deviation of the student network’s confidence scores in the
kth iteration be
and
, respectively. The threshold for the next phase is updated as follows:
where
is the smoothing coefficient, typically set to 0.05.
Synergy of AIT and CASC for Noise Filtering: It is important to note that the framework identifies and filters noisy pseudo-labels not through an isolated binary classifier, but via the joint dual-guard mechanism of AIT and CASC. AIT acts as the score-level defense, dynamically elevating or relaxing the confidence barrier to block obvious background clutter as the model evolves. As the model learns over time, AIT adjusts the confidence threshold based on the statistical distribution of predictions, ensuring that only those predictions that meet an evolving confidence level are passed forward.
Subsequently, CASC acts as the spatial-level defense. Because random SAR clutter lacks consistent structural geometry, it is highly improbable for both the Teacher and the Proxy Student to generate high-confidence false positives at the exact same location. Therefore, any proposal that passes the AIT confidence barrier but fails the CASC spatial agreement (IoU ) is inherently identified as confirmation-bias noise and systematically discarded. This synergistic operation ensures that CASC serves to verify the consistency of the pseudo-labels in spatial terms, while AIT ensures that the labels are filtered based on score-level confidence. Together, these mechanisms continuously purify the supervisory signals, improving the robustness and precision of the model by preventing confirmation bias from corrupting the learning process.
2.3. Overall Loss Function
The entire framework is optimized end-to-end via a weighted combination of source-domain supervised loss, target-domain unsupervised loss, and auxiliary proxy loss:
(1) Source-Domain Supervised Loss
The supervised loss is derived from both the original optical images (
) and the synthesized intermediate images (
):
where
denotes the standard Faster R-CNN detection objective, which is composed of a classification loss and a bounding-box regression loss:
where
is the cross-entropy loss for object category classification, and
is the Smooth-
loss for bounding-box regression computed on positive proposals. The same
formulation is used for both the source-domain supervised loss in Equation (
12) and the target domain pseudo-label loss in Equation (
13). The coefficient
controls the contribution of the intermediate domain.
(2) Unsupervised Target Loss
For the unlabeled target domain, the network learns from heavily augmented SAR images (
) guided by the CASC-filtered pseudo-labels (
):
where
is an indicator function that ensures the pseudo-label is non-empty before computing the detection loss.
(3) Auxiliary Proxy Loss
The auxiliary proxy loss
is introduced to explicitly optimize the decoupled Mining Agent Head of the Proxy Student. Although the Proxy Student shares the feature-extraction backbone with the Main Student, its detection head is independent and is responsible for producing auxiliary predictions used in Cross-Agent Spatial Consensus. To avoid unreliable supervision from noisy low-confidence samples,
is computed using the high-confidence pseudo-labels generated by the Teacher after AIT filtering. Let
denote the high-confidence teacher pseudo-label set:
where
,
, and
represent the bounding box, category label, and confidence score of the
i-th teacher prediction, respectively. The auxiliary proxy loss is formulated as:
where
denotes the Proxy Student and
consists of the standard classification and bounding-box regression losses. This term does not introduce additional target-domain annotations. Instead, it provides stable supervision for the auxiliary mining head, enabling the Proxy Student to produce discriminative and geometrically reliable proposals for CASC. During training,
is weighted by
, and the parameters of the Proxy Student head are excluded from the EMA update of the Teacher. This design prevents noisy auxiliary mining signals from being accumulated in the Teacher while maintaining sufficient diversity for mutual verification.
2.4. Optimization Schedule
To synthesize the aforementioned modules into a cohesive pipeline, the complete training procedure is formalized in Algorithm 1. This algorithm explicitly details the temporal scheduling of the framework, starting with the physics-aware domain alignment (Phase 1), proceeding to the supervised burn-in stage for foundational knowledge transfer (Phase 2), and culminating in the unsupervised burn-up stage (Phase 3), where the Dual-Student Mutual Verification and Adaptive Thresholding mechanisms are dynamically engaged to ensure robust domain adaptation.
| Algorithm 1 Training Procedure of the Proposed DEMC Framework |
Require: Source domain data ; Target domain data ; Diffusion Generator G (SD-Turbo with LoRA); Teacher , Main Student , Proxy Student ; Hyperparameters: EMA factor , Thresholds , AIT smoothing coefficient .
Ensure: Optimized Student Model . - 1:
Phase 1: Physics-Aware Domain Alignment (DBDA) - 2:
for each do - 3:
Generate intermediate sample via Equations (2) and (3); - 4:
Apply refinement: via Equations (4) and (7); - 5:
Add to intermediate set: ; - 6:
end for - 7:
Phase 2: Burn-in Stage (Supervised Warm-up) - 8:
for iteration to do - 9:
Sample batch from ; - 10:
Update by minimizing via Equation ( 12); - 11:
end for - 12:
Initialize Teacher and Proxy: ; - 13:
Phase 3: Burn-up Stage (UDA with Mutual Verification) - 14:
for iteration to do - 15:
Sample target batch ; - 16:
// Teacher Prediction - 17:
Generate proposals from ; - 18:
if then - 19:
// Dual-Student Mutual Verification (DSMV) - 20:
Generate proxy proposals from ; - 21:
Filter pseudo-labels via CASC: - 22:
; - 23:
else - 24:
Filter pseudo-labels using only static confidence ; - 25:
end if - 26:
Update and the proxy head using via Equation (11); - 27:
if then - 28:
// Adaptive Thresholding (AIT) - 29:
Calculate statistics of student predictions; - 30:
Update threshold: via Equation ( 10); - 31:
end if - 32:
Update Teacher via EMA: ; - 33:
end for - 34:
return Final Student Model
|
2.5. Dataset Description and Implementation Details
To evaluate the cross-domain transferability of the proposed DEMC, this paper conducts comprehensive experiments on four benchmark datasets for ship detection. These include two optical archives (HRSC2016, ShipRSImageNet) and two SAR repositories (SSDD, HRSID).
HRSC2016 [
40]: This benchmark is specifically curated for object localization. It comprises 1061 optical images acquired from Google Earth, featuring extreme variations in image dimensions (ranging from
to
pixels). With nearly 3000 ship instances distributed across complex offshore and inshore environments, this entire dataset is utilized as a foundational source domain.
ShipRSImageNet [
41]: A large-scale, multi-source library incorporating 3435 samples captured under diverse sensor types and seasonal weather conditions. Following standard preprocessing, all patches are resized to
pixels. This paper aggregates the training and validation subsets to establish a robust and diverse optical source distribution.
SSDD [
9]: Representing a classic SAR benchmark, SSDD contains 1160 images derived from RadarSat-2, TerraSAR-X, and Sentinel-1. It encompasses multi-scale resolutions and diverse coastal topographies (e.g., ports, islands, and open sea). The dataset is officially partitioned into 928 training and 232 testing units.
HRSID [
42]: To validate the model’s performance in high-density and high-resolution scenarios, this paper utilizes HRSID. Constructed from Sentinel-1B, TerraSAR-X, and TanDEM-X satellites, this dataset includes 5604 images at a uniform resolution of
pixels. It provides 16,951 precisely annotated ship instances under varying polarization modes.
In our experimental configuration, the primary object detector is constructed using the Faster R-CNN framework with a ResNet-50 backbone, developed end-to-end within the PyTorch environment. The DBDA module undergoes 5000 epochs of rigorous pre-training to ensure that the synthesized intermediate samples achieve deep distribution alignment with authentic SAR imagery in terms of physical textures. To maintain consistency in feature scales across diverse datasets, all input imagery is uniformly resampled to pixels. The hardware infrastructure comprises a single NVIDIA GeForce RTX 3090 GPU, with the batch size configured at 4. The training sequence is meticulously partitioned into two functionally complementary phases, each encompassing 50,000 iterations. During the initial Burn-in Stage, the model prioritizes knowledge transfer from optical representations to the target domain, with the student network’s learning rate initialized at . The subsequent Burn-up Stage focuses on deepening the Unsupervised Domain Adaptation task, where the teacher network undergoes robust evolution via an Exponential Moving Average (EMA) mechanism with a smoothing coefficient of . Regarding the hyperparameter configuration of the DEMC framework, the baseline secure thresholds for classification () and localization () are both established at 0.8. To enhance the model’s adaptability to evolving pseudo-labels in the latter training phases, an AIT mechanism is introduced, applying the AIT smoothing coefficient of to update after 60% of the burn-up stage has elapsed. Furthermore, auxiliary verification components designed to mine latent true positives and reinforce pseudo-label robustness are formally integrated at the 40% mark of the burn-up stage, thereby maintaining a dynamic equilibrium between detection precision and recall throughout the complex cross-modal learning process.
In terms of computational complexity, it is important to distinguish between the offline adaptation phase and the online inference phase. During offline domain alignment, the DBDA module introduces the major additional cost because the SD-Turbo-based generator contains 1.528 B parameters and requires 19.15 TFLOPs. However, this cost is incurred only once before detector adaptation: each optical source image is translated into the intermediate domain and then cached for subsequent detector training. Therefore, DBDA is not executed during online inference.
During the adaptation stage, the DSMV architecture contains the Teacher, Main Student, and Proxy Student branches, increasing the training-time parameter scale to 434.17 M and requiring additional forward passes for pseudo-label verification. This additional cost is used only to improve pseudo-label reliability during training. During deployment, both the Teacher and Proxy Student are discarded, and only the Main Student detector is retained. The final inference model contains 165.18M parameters and requires 427.73 GFLOPs per inference pass, which is comparable to the standard Faster R-CNN detector with a ResNet backbone.
This design reflects an efficiency-performance trade-off: DEMC introduces additional offline training and generation costs to obtain more reliable SAR pseudo-labels and stronger cross-domain generalization, but it does not increase the deployed inference pipeline. Therefore, the framework is more suitable for offline model adaptation followed by operational deployment, rather than real-time on-board domain adaptation.
2.6. Evaluation Metrics
To rigorously assess the cross-domain efficacy of the DEMC framework, we employ standard object detection protocols. Given the inherent challenges of SAR imagery—specifically, intense background clutter and faint target scattering—we utilize Precision (P), Recall (R), and Average Precision (AP) to quantify the model’s resilience to modality shifts. Precision and Recall. These twin metrics evaluate the detector’s operational limits in high-clutter environments. Precision quantifies the model’s false-alarm rejection capability, measuring the proportion of genuine targets among all positive predictions. Recall, conversely, assesses the sensitivity to weak scatterers, representing the proportion of ground-truth targets successfully recovered by the model. They are formulated as:
where
(True Positives),
(False Positives), and
(False Negatives) denote correctly localized ships, misclassified background clutter, and undetected targets, respectively. Adhering to the Horizontal Bounding Box protocol, a prediction is validated as a True Positive if its Intersection-over-Union (IoU) with the ground truth exceeds the
threshold. Average Precision (AP50). As Precision and Recall inherently exhibit an inverse relationship, pointwise estimates provide an incomplete performance picture. To capture the global performance dynamic across varying confidence thresholds, we compute the Average Precision (
), which corresponds to the area under the Precision-Recall (P-R) curve:
Specifically, we report (calculated at an IoU threshold of 0.5) following the PASCAL VOC benchmark. This serves as a robust holistic indicator of both classification accuracy and localization precision during the optical-to-SAR adaptation process.
4. Discussion
The experimental results presented in
Section 3 validate the efficacy of the proposed DEMC framework in bridging the substantial modality gap between optical and SAR domains. To provide a comprehensive understanding of the framework’s performance, we delve into the underlying mechanisms of physics-aware alignment, the robustness of mutual verification, and the inherent limitations of the current approach.
4.1. Mechanism of Physics-Aware Domain Alignment
One of the key findings from the ablation study (
Table 2) is that the DBDA module plays a crucial role in the framework’s success. Removing DBDA resulted in a significant performance drop, with
decreasing from 48.5% to 40.5%. Unlike traditional GAN-based translation methods, such as CycleGAN, which often suffer from geometric distortion or “hallucinations” when generating small targets, our Diffusion-based approach leverages Zero-Conv skip connections. This architectural choice is pivotal: it ensures that the generated intermediate images (
) maintain pixel-level semantic consistency with the source optical images (
), effectively decoupling “content” from “style”.
Furthermore, the t-SNE visualization (
Figure 4) empirically confirms that the physics-aware refinement (Spectral Consistency, NLM, CLAHE) successfully shifts the feature distribution of optical data towards the SAR manifold. By simulating coherent speckle noise and metallic scattering characteristics, DBDA enables the detector to learn domain-invariant features before seeing any real SAR data, significantly lowering the difficulty of the subsequent adaptation task.
Robustness Against Distorting Artifacts: Importantly, the architectural design of the DBDA module inherently equips the framework with strong robustness against common distorting artifacts encountered in harsh remote sensing environments. First, against affine transformations and geometric distortions, the Zero-Conv skip connections strictly anchor the spatial features, ensuring that the structural integrity and bounding box coordinates of the source domain are preserved without warping during diffusion. Second, against low contrast, the localized enhancement mechanism of CLAHE dynamically rescues faint, low-visibility targets by amplifying their discrete specular highlights against dark backgrounds. Finally, against blur and uniform noise, the NLM denoising step suppresses non-structural interference while preserving the sharp reflective boundaries of metallic vessels. Together, these physics-aware refinement steps ensure that the intermediate domain is not only visually realistic but also structurally resilient.
4.2. Mitigating Confirmation Bias via Spatial Consensus
A persistent challenge in Unsupervised Domain Adaptation (UDA) is confirmation bias, where the model amplifies its errors due to noisy pseudo-labels. This issue is particularly pronounced in Synthetic Aperture Radar (SAR) ship detection, where small, weak targets are often misclassified as background clutter due to low confidence scores. Existing dual-teacher or co-training UDA methods in remote sensing and computer vision often utilize auxiliary branches for feature mining but predominantly rely on confidence-based validation such as averaging scores or applying static high-confidence thresholds. However, this standard approach is ineffective in SAR imagery, where heavy speckle noise consistently suppresses the raw confidence scores of true targets, leading to the loss of critical supervisory signals.
Our DSMV framework, particularly the Cross-Agent Spatial Consensus (CASC) strategy, fundamentally transforms the pseudo-labeling paradigm. In contrast to traditional co-training frameworks, where an auxiliary branch merely serves as a secondary confidence scorer, our Proxy Student functions as an independent geometric validator. The key innovation of CASC is its shift from using raw confidence to IoU-based spatial consensus as the verification criterion. Rather than relying solely on confidence thresholds, CASC validates predictions based on the geometric overlap between the Teacher network and the Proxy Student. As illustrated in the qualitative results (
Figure 3), this mechanism enables DEMC to recover “hard positives,” which are targets that are visually distinct but have low confidence. Ultimately, this spatial consensus allows DSMV to achieve a capability that standard Mean Teacher methods cannot: the successful recovery of faint, structurally intact targets that are typically discarded by confidence-based filters, significantly enhancing recall in complex coastal backgrounds.
4.3. Dynamics of Adaptive Thresholding
The convergence curves (
Figure 5) illustrate that the Adaptive Thresholding (AIT) mechanism is crucial during the “Burn-up” stage. Static thresholds often lead to performance saturation: thresholds that are too high filter out valid targets (low recall), while those that are too low introduce noise (low precision). By dynamically decaying the classification threshold
based on the statistical distribution of the student network’s predictions, AIT allows the model to prioritize precision early in training and gradually shift towards higher recall in later stages. This dynamic curriculum learning strategy ensures that the pseudo-labels evolve stably, preventing the model from collapsing due to noise accumulation in the early training stages.
While DEMC significantly enhances recall, which is crucial for zero-tolerance reconnaissance, this inherently introduces a trade-off with precision. In certain operational scenarios, such as automated port traffic management or routine resource monitoring, prioritizing higher recall might introduce operational risks by overwhelming human operators or downstream tracking systems with false alarms. Fortunately, the DEMC framework is highly adaptable to varied mission requirements. For applications demanding extreme precision, operators can readily fine-tune the system by elevating the foundational classification threshold () or the spatial consensus threshold() within the DSMV module. Furthermore, restricting the AIT smoothing coefficient maintains a stricter confidence barrier throughout the training process, thereby suppressing false positives and shifting the model’s operational focus toward high precision.
4.4. Limitations and Future Directions
Despite the state-of-the-art performance, the proposed DEMC framework has limitations that warrant further investigation. First, while the framework significantly improves recall, the absolute F1-score remains constrained (currently peaking at 58.4%). This performance ceiling is inherently limited by the extreme background clutter and low Signal-to-Noise Ratio (SNR) characteristic of SAR imagery, coupled with the absence of target-domain annotations in the UDA setting. Regarding the operational limits in these noisy environments, the DEMC model operates effectively as long as the targets present marginally distinguishable structural contours or corner reflector scattering against the background. Under such low-SNR conditions, the CASC mechanism successfully recovers the targets. However, in ultra-low SNR conditions (e.g., severe sea state clutter completely masking the scattering signatures of small vessels), the spatial consensus process lacks sufficient visual cues, leading to performance degradation. Breaking this performance bottleneck may depend on increasing the volume and diversity of training data. Interestingly, our experimental results across the four benchmark tasks (
Table 1) already empirically validate this trajectory: models trained with the larger ShipRSImageNet source domain (3435 images) consistently yielded higher F1-scores than those trained with the smaller HRSC2016 dataset (1061 images) across both SAR target domains. Therefore, a direct and imperative pathway for future work to elevate the F1-score to a more reliable performance level is to substantially scale up the volume, diversity, and modality of the foundation training datasets.
Second, the computational overhead of the DBDA module is non-negligible. Although SD-Turbo is used to accelerate generation, the diffusion process is still computationally more intensive than GANs or direct feature alignment, potentially limiting its application in real-time on-board processing scenarios. while the physics-aware refinement improves visual realism, it is currently a hand-crafted post-processing step. Future work could explore integrating these physical constraints (e.g., Rayleigh scattering models) directly into the loss function of the diffusion model for fully end-to-end training. Lastly, our experiments focused on the ship category in offshore and coastal scenes. Future work will investigate the extension of DEMC to other coastal object categories and broader SAR monitoring scenarios once reliable cross-modality benchmark datasets become available.The performance of DEMC in extremely dense inshore scenarios (e.g., crowded ports with significant side-lobe interference) remains a challenging frontier for future research.