A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology

You, Jianing; Lv, Yixuan; Li, Shengyang; Liu, Silei; Zhang, Kailun; Liu, Yuxuan

doi:10.3390/rs18020274

Open AccessArticle

A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology

by

Jianing You

^1,2

,

Yixuan Lv

¹

,

Shengyang Li

^1,2,*

,

Silei Liu

^1,2

,

Kailun Zhang

^1,2

and

Yuxuan Liu

^1,2

¹

Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing 100094, China

²

School of Aeronautics and Astronautics, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 274; https://doi.org/10.3390/rs18020274

Submission received: 22 December 2025 / Revised: 9 January 2026 / Accepted: 10 January 2026 / Published: 14 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We constructed MOS-Ship, a novel, high-resolution (sub-meter), spatially aligned optical-SAR dataset designed to assess multimodal detection accuracy. It uniquely includes a synthetic degradation benchmark (featuring multi-level cloud and fog) to specifically evaluate model robustness under realistic adverse weather conditions.
We propose MOS-DETR, a novel detection framework built upon a query-based architecture featuring an innovative multimodal encoding backbone. This design effectively integrates optical textures and SAR scattering signatures at the feature level, and is further coupled with an adaptive probabilistic fusion mechanism to ensure high accuracy under adverse weather.

What are the implications of the main findings?

Our MOS-Ship dataset and benchmark bridge the gap between idealized research settings and real-world operational challenges, providing a critical resource for developing and validating truly all-weather multimodal algorithms.
The proposed multi-modal and adaptive fusion approach offers a practical and resilient solution for robust maritime surveillance, ensuring reliable ship detection even when optical satellite imagery is obscured or degraded by poor weather.

Abstract

Maritime ship detection is a critical task for security and traffic management. To advance research in this area, we constructed a new high-resolution, spatially aligned optical-SAR dataset, named MOS-Ship. Building on this, we propose MOS-DETR, a novel query-based framework. This model incorporates an innovative multi-modal Swin Transformer backbone to extract unified feature pyramids from both RGB and SAR images. This design allows the model to jointly exploit optical textures and SAR scattering signatures for precise, oriented bounding box prediction. We also introduce an adaptive probabilistic fusion mechanism. This post-processing module dynamically integrates the detection results generated by our model from the optical and SAR inputs, synergistically combining their complementary strengths. Experiments validate that MOS-DETR achieves highly competitive accuracy and significantly outperforms unimodal baselines, demonstrating superior robustness across diverse conditions. This work provides a robust framework and methodology for advancing multimodal maritime surveillance.

Keywords:

multimodal detection; optical-SAR fusion; deep learning; rotating target; satellite images

1. Introduction

Maritime ship detection from remote sensing imagery has become a critical research focus with broad implications for national security, maritime traffic monitoring, port management, fisheries regulation, environmental protection, and disaster response [1,2]. As global maritime trade continues to expand and shipping routes become increasingly congested, the ability to automatically detect and track vessels in complex coastal environments is essential for maintaining situational awareness, preventing illegal activities such as smuggling and piracy, and ensuring navigation safety. In particular, in port areas where ships of various types and sizes frequently enter, dock, and depart, accurate detection is indispensable for efficient traffic scheduling, berth allocation, and surveillance of high-value maritime assets. The growing strategic demand for real-time maritime intelligence has therefore stimulated the development of advanced computer vision algorithms tailored to remote sensing imagery.

The proliferation of high-resolution optical satellites, including the WorldView, GaoFen, and SuperView series, has significantly advanced maritime observation capabilities. This imagery provides a wealth of spatial, textural, and color information, enabling the fine-grained identification of ship structures, hull contours, and wake patterns. Consequently, the adaptation of powerful deep learning-based object detectors has demonstrated substantial performance gains over traditional methods. By leveraging hierarchical feature representations and large-scale annotated datasets, these models achieve high precision in distinguishing ships from surrounding clutter [3,4,5,6,7,8]. However, the performance of optical sensors is fundamentally constrained by environmental and illumination conditions; the presence of clouds, haze, or sea glare can severely obscure target features and degrade accuracy. Furthermore, complex maritime scenes often feature densely docked or partially overlapping vessels, demanding models capable of handling arbitrary orientations. To address this specific challenge, specialized techniques have been developed, such as the Orientation-Aware Sampling Learning (OASL) method, which improves detection accuracy in these dense scenarios [9].

In contrast, Synthetic Aperture Radar (SAR) provides an alternative modality that is capable of all-weather and all-day imaging. Operating in the microwave spectrum, SAR actively transmits electromagnetic pulses and records backscattered echoes, making it largely immune to cloud cover and illumination variations. This enables consistent monitoring of maritime targets regardless of weather or time of day. SAR imagery has therefore been widely employed for ship detection, evolving from traditional CFAR-based statistical detectors to modern deep neural network approaches. SAR data capture the unique scattering characteristics of metallic ship structures, allowing them to stand out against the relatively homogeneous sea surface [10]. Advanced networks have been designed to explicitly leverage this [11,12,13], for instance by fusing scattering information to improve the detection of oriented ships [14]. Nevertheless, SAR images are not without challenges. Speckle noise and radiometric variations often obscure small targets, while geometric distortions induced by side-looking imaging can cause shifts in object appearance. Furthermore, strong backscatter from human-made coastal structures can easily be confused with ships, and the relatively long revisit periods of SAR satellites can limit continuous surveillance.

Given the complementary properties of optical and SAR modalities, multimodal fusion has emerged as a promising solution for enhancing ship detection robustness. Optical data provide intuitive spatial and textural cues, while SAR contributes structural and backscattering information that is resilient to weather degradation. By integrating these complementary sources, multimodal detection frameworks can leverage the strengths of both modalities to achieve superior accuracy and generalization [15,16,17]. Despite this promise, current research still faces two major limitations. First, most existing multimodal studies assume clean optical conditions and do not account for the inevitable degradation that occurs in real maritime monitoring scenarios. As a result, models trained on ideal data often fail when exposed to clouds or haze. Second, there is a notable lack of high-resolution, spatially aligned, and temporally synchronized optical–SAR datasets specifically designed for ship detection. Public datasets such as SSDD [18] and AIR-SARShip [19] provide valuable resources but are all unimodal. Although OGSOD [20] contains paired optical–SAR imagery, its spatial resolution is limited to about 10 m, and the targets mainly include static ground facilities such as bridges, oil tanks, and ports, rather than moving maritime vessels. This restricts its applicability to dynamic ship detection and spatio-temporal modeling tasks.

To overcome these limitations, this study constructs a new multimodal optical–SAR ship detection dataset focused on high-value port areas. The dataset is built through active tasking of optical and SAR satellites to ensure near-synchronous imaging of the same region of interest. By exploiting temporal overlap between the two acquisition schedules, we minimize discrepancies caused by ship movement. Each optical–SAR pair undergoes rigorous spatial alignment using advanced image registration algorithms such as LightGlue [21], followed by manual verification to ensure sub-pixel accuracy. Compared with existing datasets, this collection features sub-meter resolution imagery of dynamic maritime targets and critical military ports, enabling detailed analysis of ship appearance, scattering, and contextual semantics across modalities. Moreover, it provides a realistic foundation for studying multimodal fusion under spatial–temporal asynchrony—conditions that are typical in real satellite observation systems due to non-overlapping orbits and acquisition intervals.

To further enhance the dataset’s utility, we introduce a synthetic optical degradation benchmark designed to evaluate multimodal robustness under challenging conditions. Controlled cloud and fog effects are simulated at multiple severity levels and applied exclusively to optical images while keeping the corresponding SAR data unchanged. This yields parallel clean and degraded test sets that facilitate fair, quantitative comparisons of fusion algorithms under varying levels of optical corruption. The benchmark offers a systematic platform for analyzing the resilience of multimodal fusion networks against adverse weather, bridging the gap between experimental settings and real-world maritime monitoring.

Building upon this dataset, we propose a novel multimodal detection framework that addresses the challenges of oriented target localization and feature fusion in high-resolution remote sensing imagery. The network architecture is inspired by the query-based design of Oriented Former [22] but is substantially extended with a multimodal feature encoding backbone capable of extracting consistent representations from registered optical and SAR inputs. The encoded features are then fused through adaptive cross-attention modules that learn modality-specific dependencies and align complementary semantics. This design enables precisely oriented bounding box detection even in the presence of geometric distortions or partial occlusions. During inference, an adaptive probabilistic integration mechanism is further introduced to combine the detection outputs from both modalities. When optical imagery is degraded, the integration automatically emphasizes the SAR-based predictions, whereas under clear conditions, the model can operate effectively with single-modality input. This flexible and self-adaptive detection paradigm ensures high accuracy, robustness, and practical deployability across diverse operational scenarios.

The main contributions of this work are summarized as follows:

1.: We construct MOS-Ship, a novel, high-quality multimodal optical–SAR ship detection dataset. Its primary advantages over existing datasets include precise spatial alignment between modalities, sub-meter resolution imagery, and extensive coverage of complex maritime environments such as major ports and straits. MOS-Ship captures temporally non-synchronous, dynamic targets and is augmented with a synthetic benchmark featuring multi-level cloud and fog simulations to support robust multimodal evaluation.
2.: We propose MOS-DETR (Multi-modal Oriented-Ship DEtection TRansformer), a novel detection framework that integrates multimodal feature encoding into an oriented query-based detector to achieve precise and robust ship detection under spatial–temporal asynchrony.
3.: We develop a probabilistic decision integration mechanism that adaptively fuses detection results from optical and SAR modalities according to image quality, ensuring reliable performance under both clear and degraded conditions.
4.: Extensive experiments on our proposed dataset demonstrate the framework’s effectiveness. Our method achieves high accuracy on multimodal (84.1 AP₅₀), RGB-only (88.8 AP₅₀), and SAR-only (77.0 AP₅₀) test splits. Crucially, under simulated adverse weather, our probabilistic fusion mechanism improves detection accuracy by 17.7 percentage points over the optical-only baseline (74.4% vs. 56.7%), confirming its robustness.

2. Materials and Methods

2.1. Multi-Modal Optical-SAR Ship Dataset

To address the scarcity of public datasets suitable for multi-modal, oriented ship detection, we constructed a new, high-quality dataset, which we name the “MOS-Ship” (Multi-modal Optical-SAR Ship) dataset. The entire construction process, illustrated in Figure 1, involves three primary stages: strategic data acquisition, high-precision spatio-temporal co-registration, and final dataset generation with robustness enhancement.

2.1.1. Data Acquisition and Collection Strategy

The acquisition of consistent multi-modal satellite imagery is fraught with challenges, including significant operational costs, the inherent mobility of maritime targets, and narrow temporal windows constrained by weather and satellite trajectories. To mitigate these issues, our collection strategy focused on key maritime chokepoints (e.g., major ports and canals) to maximize target density. We prioritized the imaging of anchored or berthed vessels with minimal time intervals between acquisitions, thereby reducing motion-induced artifacts and ensuring the high fidelity of the source data, which is crucial for subsequent precise annotation and model training. This effort yielded a total of 28 large-scale, multi-modal image pairs, comprising optical imagery with a 0.75-m GSD and SAR imagery with a 1-m GSD.

2.1.2. Spatio-Temporal Co-Registration

A critical prerequisite for any meaningful multi-modal analysis is the establishment of precise pixel-level correspondence between heterogeneous images. Our collected optical and SAR pairs exhibit significant misalignments due to disparate sensor characteristics and acquisition geometries. To rectify this, we implemented a high-precision, feature-based registration pipeline, using the optical image

I_{O}

as the reference and the SAR image

I_{S}

as the moving image for each pair.

The pipeline begins with the extraction of keypoint sets,

K_{O} = {(p_{i}^{O}, d_{i}^{O})}_{i = 1}^{N_{O}}

and

K_{S} = {(p_{j}^{S}, d_{j}^{S})}_{j = 1}^{N_{S}}

, where p denotes the coordinates and d is the associated local feature descriptor. These are then processed by the deep feature matcher LightGlue [21], chosen for its robustness against the significant radiometric differences between optical and SAR data. The matcher, denoted by a function

M

, yields a set of high-confidence correspondences:

M = M (K_{O}, K_{S}) = {(p_{i}^{O}, p_{j}^{S})} .

(1)

From these correspondences, we estimate an affine transformation matrix

H

that maps coordinates from the SAR image to the optical image. The optimal transformation

H^{*}

is determined by minimizing the reprojection error over an inlier set

M^{'} \subseteq M

identified via RANSAC:

H^{*} = arg min_{H} \sum_{(p_{i}^{O}, p_{j}^{S}) \in M^{'}} {∥p_{i}^{O} - H (\begin{matrix} p_{j}^{S} \\ 1 \end{matrix})∥}_{2}^{2} .

(2)

The estimated transformation

H^{*}

is then applied to warp the SAR image

I_{S}

, aligning it with the coordinate system of

I_{O}

while simultaneously resampling it to match the optical image’s resolution. This process ensures geometric congruence, a cornerstone for all subsequent fusion tasks.

2.1.3. Dataset Finalization and Augmentation

To generate a dataset suitable for deep learning models, the co-registered large-scale image pairs were systematically tiled into 1024 × 1024 pixel patches using a sliding window with a 500-pixel overlap. This synchronous tiling preserves the established spatial correspondence. Object annotations were retained for a patch if their Intersection-over-Foreground (IoF) ratio exceeded 0.7, with their coordinates translated to the local patch frame.

Furthermore, to enhance the model’s robustness against real-world atmospheric interference, we applied a procedural augmentation to generate a complete secondary set of optical images with synthetically rendered clouds and mist. Specifically, we employed a randomized brush stroke algorithm guided by Perlin noise paths to simulate the irregular distribution of cloud masses. The synthesis process involves multi-layer rendering where cloud density is controlled by varying opacity (0.6–1.0) and flow parameters, followed by a Gaussian blur operation (kernel size ranging from 21 to 51 pixels) to emulate the soft edges characteristic of atmospheric scattering. The synthetic cloud layers are composited onto the original optical images using a screen blending mode, which mathematically approximates the additive nature of light scattering in foggy conditions. As illustrated in Figure 2, this simulation method produces visual degradation highly consistent with real-world adverse weather scenarios, effectively challenging the model to learn robust features under occlusion.

The final “MOS-Ship” dataset follows the DOTA format [23]. All images are in PNG format; optical images are 8-bit, 3-channel RGB, while SAR images are single-channel, 32-bit floating-point to preserve the full dynamic range of backscatter information. Each annotation provides the four vertex coordinates of an object’s oriented bounding box, its category, and a difficulty score. The “MOS-Ship” dataset is partitioned into training and validation subsets, with detailed statistics provided in Table 1. To facilitate reproducibility and further research, the dataset and the corresponding implementation code will be made publicly available upon acceptance at https://github.com/yyy7777777/MOS-Ship-Dataset (accessed on 1 February 2025).

2.2. Multi-Modal Oriented-Ship Detection Transformer

To address the challenges of oriented object detection in multi-modal remote sensing imagery, we propose a novel detection framework, termed MOS-DETR (Multi-modal Oriented-Ship DEtection TRansformer). This framework is built upon the query-based architecture of Oriented Former, but innovatively incorporates a multi-modal feature encoding backbone. The architecture of MOS-DETR is systematically designed to first extract robust and consistent features from registered optical (RGB) and Synthetic Aperture Radar (SAR) image pairs, and then to perform high-precision oriented object detection. The methodology is detailed in the following subsections, covering the overall framework, the multi-modal backbone, the decoder and adaptive fusion processing.

2.2.1. Overall Framework

The architecture of our proposed MOS-DETR, depicted in Figure 3, is a query-based, end-to-end detector. Adhering to the DETR paradigm, its core framework is designed to directly optimize a set of learnable object queries, culminating in the generation of final detection results.

The operational workflow of MOS-DETR commences with a Swin-Transformer backbone specifically engineered for multi-modal data processing. This network incorporates multi-head tokenizers for the extraction of multi-scale feature pyramids from input RGB and SAR images, thereby providing a unified and informative cross-modal feature representation [24]. The dimensionality of these feature maps is subsequently unified by means of a Channel Mapper. Following this, a set of object queries is initialized, with each query being disentangled into distinct content and positional components. In the final stage, the decoder is supplied with the feature pyramid and the initialized object queries. It then iteratively refines the class and location attributes of these queries through multiple decoder layers, ultimately yielding the final detection boxes and their corresponding categories.

To enhance the robustness of ship detection, particularly when optical imagery is degraded, we propose a post-processing framework named Adaptive Max-Confidence Fusion. This late-fusion approach ensembles the detection outputs that the MOS-DETR model generates from co-located, near-simultaneous Synthetic Aperture Radar (SAR) and optical images. The fusion process is designed to synergistically combine the complementary strengths of the sensors—the all-weather capability of SAR and the high-resolution detail of optical imagery.

2.2.2. Backbone

The Swin Transformer (Hierarchical Vision Transformer using Shifted Windows) [25] is a general-purpose vision backbone designed to construct hierarchical feature representations with linear computational complexity. As illustrated in Figure 3, the model primarily consists of a Patch Partition module followed by four sequential stages (Stages 1–4).

The input image is processed by the Patch Partition module, which divides the image into non-overlapping patches (e.g., 4 × 4 pixels), thereby transforming raw pixel data into serialized embedding vectors. Subsequently, the feature maps pass through four stages, each containing a series of Swin Transformer Blocks. Crucially, these blocks are designed to operate in alternating pairs to achieve a balance between computational efficiency and global modeling capability. The first block in each pair utilizes Window-based Multi-head Self-Attention (W-MSA), which partitions feature maps into static non-overlapping windows and restricts attention computation locally. While this design significantly reduces computational complexity from quadratic to linear with respect to image size, it isolates information within local boundaries. To address this, the immediately following block employs Shifted Window-based Multi-head Self-Attention (SW-MSA). By shifting the window partitioning configuration, SW-MSA bridges the boundaries of the preceding windows, enabling cross-window information exchange and effectively expanding the receptive field. Due to this strict pairwise dependency, the number of blocks in any stage must always be even. This structural constraint is explicitly reflected in the annotations “×2” and “×6“ in Figure 3, which denote the total count of stacked blocks (representing 1 and 3 pairs, respectively). Such variation in depth across stages facilitates the learning of semantic features at different levels of abstraction.

To construct hierarchical feature maps, a Patch Merging layer is introduced between adjacent stages. This layer functions similarly to pooling operations in convolutional neural networks; it concatenates neighboring patches, halving the height and width of the feature map while doubling the number of channels. Through this progressive Patch Merging operation, the Swin Transformer effectively builds a multi-scale hierarchical feature pyramid, making it highly adaptable not only for image classification but also for dense prediction tasks such as object detection and semantic segmentation.

As shown in Figure 4, our backbone builds on Swin Transformer with two key modifications: (i) a dual-branch patch embedding that selects a modality-specific tokenizer, and (ii) modality-conditioned Low-Rank Adaptation (LoRA) applied to all main linear projections in self-attention and multilayer perceptron (MLP), while keeping the original pretrained weights frozen.

Dual-Branch Patch Embedding (DPE)

To effectively handle the inherent heterogeneity between multi-modal inputs, we design a dual-branch architecture acting as domain-specific experts. Since optical (RGB) and Synthetic Aperture Radar (SAR) imagery exhibit significant distributional shifts—RGB focuses on color and texture, while SAR captures surface roughness and geometry—forcing a single tokenizer to process both would lead to suboptimal feature projection.

Therefore, we introduce two distinct, learnable tokenizers,

E_{RGB}

and

E_{SAR}

. For an input image

I_{i} \in R^{H \times W \times 3}

with a modality indicator

c_{i} \in {RGB, SAR}

, the routing function

m (c_{i})

selects the corresponding tokenizer to map the raw pixels into a sequence of patch tokens:

X_{i} = E_{m (c_{i})} (I_{i}) \in R^{N \times C},

(3)

where H and W denote the image height and width, N is the resulting number of patch tokens, and C represents the embedding dimension (channel width). This decoupled design ensures that the unique low-level physical characteristics of each modality are preserved before entering the shared backbone.

Modality-Conditioned LoRA (MC-LoRA)

To enable the frozen backbone to adapt to the specific statistical properties of each modality, we introduce a Modality-Conditioned LoRA mechanism.

Formally, for any linear layer with frozen pre-trained weights

W_{0} \in R^{d_{in} \times d_{out}}

, we inject a trainable side branch that is dynamically selected based on the input modality. The forward pass is formulated as:

y; =; x W 0; +; \frac{α}{r}, x (A m (c_{i}) B m (c_{i})),

(4)

where

x \in R^{1 \times d_{in}}

is the input feature. The terms

A_{m} \in R^{d_{in} \times r}

and

B_{m} \in R^{r \times d_{out}}

are the trainable low-rank matrices specific to modality m. Following the standard configuration in [26], we set the rank

r = 16

and the hyperparameter

α = 16

. Consequently, the scaling coefficient simplifies to

\frac{α}{r} = 1

. This setting is adopted to align the magnitude of the adapter updates with the pre-trained weights, thereby ensuring stable optimization dynamics during the fine-tuning process. The mechanism acts as a routing switch: if the input is RGB, the parameters

{A_{RGB}, B_{RGB}}

are activated; otherwise,

{A_{SAR}, B_{SAR}}

are used.

This mechanism is applied to the two core components of each Swin Transformer block:

(i): Window-based Multi-Head Self-Attention (W-MSA)/Shifted Window-based Multi-head Self-Attention (SW-MSA):

In this module, input tokens

X \in R^{N \times C}

are projected into three subspaces: Query (

Q

), Key (

K

), and Value (

V

). Physically,

Q

encodes the target retrieval information,

K

serves as the indexing attribute for matching, and

V

carries the semantic content to be aggregated. The projections, incorporating both frozen pre-trained weights

W

and learnable modality-specific LoRA updates

Δ W

, are formulated as:

\begin{matrix} [Q, K, V] = X W_{q k v} + \frac{α}{r} X Δ W_{m (c_{i})}^{q k v}, \end{matrix}

(5)

\begin{matrix} A_{ttn} & = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + B_{rel} + M), \end{matrix}

(6)

\begin{matrix} Z^{'} & = (A_{ttn} V) W_{o} + \frac{α}{r} (A_{ttn} V) Δ W_{m (c_{i})}^{o} . \end{matrix}

(7)

W_{q k v}

and

W_{o}

denote the frozen projection weights for the attention mechanism and the output linear layer, respectively, while

Δ W^{q k v}

and

Δ W^{o}

represent their corresponding trainable LoRA adaptations (decomposed into

A_{m} B_{m}

). Additionally,

d_{k}

is the dimension of the key vectors (scaling factor),

B_{rel}

is the relative position bias, and

M

is the attention mask.

(ii): Feed-Forward Network (FFN):

The output is then processed by an MLP with LoRA-adapted linear layers:

\begin{matrix} H & = LN (X + DropPath (Z^{'})), \end{matrix}

(8)

\begin{matrix} U & = σ (H W_{1} + \frac{α}{r} H Δ W_{m (c_{i})}^{1}), \end{matrix}

(9)

\begin{matrix} Y & = U W_{2} + \frac{α}{r} U Δ W_{m (c_{i})}^{2}, \end{matrix}

(10)

\begin{matrix} X^{out} & = H + DropPath (Y) . \end{matrix}

(11)

where

LN (\cdot)

denotes Layer Normalization and

σ (\cdot)

is the GELU activation. In this formulation,

W_{1}

and

W_{2}

represent the frozen weights for the expansion and reduction linear layers, respectively, while

Δ W^{1}

and

Δ W^{2}

denote their corresponding learnable modality-specific LoRA adaptations. Through this stacking of adaptive blocks, the backbone constructs a modality-aware hierarchical feature pyramid

{F_{ℓ}}_{ℓ = 1}^{L}

.

By stacking these adaptive Swin blocks, interspersed with patch merging layers, the backbone constructs a modality-aware hierarchical feature pyramid, denoted as

{F_{ℓ}}_{ℓ = 1}^{L}

. In this notation, L represents the total number of stages (typically

L = 4

), and ℓ denotes the index of each specific stage. Consequently,

F_{ℓ}

corresponds to the feature map output at the ℓ-th stage, containing rich semantic information with progressively reduced spatial resolution, which is subsequently passed to the detection head.

Parameter-Efficiency Analysis

The primary advantage of LoRA is its parameter efficiency. For a single linear projection with dimensions

C \times C^{'}

, LoRA introduces only

2 C r

trainable parameters, as opposed to the

C \times C^{'}

parameters of the original weight matrix. For a square matrix where

C = C^{'}

, this results in a trainable parameter ratio of:

ρ \approx \frac{2 C r}{C^{2}} = \frac{2 r}{C} ≪ 1 .

(12)

By applying this adaptation to all key linear projections (i.e., the QKV and output projections in the attention module, and both layers of the FFN), the overall fraction of trainable parameters remains minimal. This enables an efficient adaptation to modality-specific statistics while preserving the core knowledge of the pretrained model.

Architecturally, our design decouples the learning of modality-specific adjustments from the shared, general-purpose backbone. This is achieved through a two-level strategy: the dual-branch tokenizer is tasked with bridging the low-level distributional gap between optical and SAR imagery, while the modality-conditioned LoRA modules are responsible for aligning the mid- to high-level feature representations. This sample-wise adaptation mechanism allows for the processing of mixed-modality batches seamlessly, without requiring architectural modifications or incurring additional computational overhead on the frozen backbone during inference.

2.2.3. Detection Head

The detection head architecture adopts the robust design from OrientedFormer [22], comprising a preprocessing module for feature unification followed by a cascade of decoder layers.

Multi-Scale Feature Unification and Query Construction

As illustrated in Figure 3, before entering the decoder layers, the multi-scale feature maps from the backbone (Stages 1–4) are processed by the Multi-Scale Feature Unification and Query Construction module. First, to handle the resolution variance from the backbone, the multi-scale features are projected to a uniform channel dimension D via a channel mapper, facilitating multi-scale interactions. These unified features serve as the key-value pairs for the subsequent cross-attention operations. Second, the encoder is not used and queries are initialized by the enhancement method [27]. A set of N (e.g., 300) high-confidence proposals is selected from the unified feature maps to initialize the object queries. Each query is disentangled into a content part

Q_{c} \in R^{N \times D}

and a positional part

Q_{p} \in R^{N \times 5}

, parameterizing an oriented box

(x, y, w, h, θ)

.

Decoder Layer Update

The initialized queries then pass through several cascaded decoder layers for progressive refinement. Within each layer, queries are updated via three key modules:

First, a Wasserstein Self-Attention module suppresses redundant detections by modeling inter-query relationships. It uniquely incorporates geometric structure by augmenting the standard attention mechanism with a Gaussian Wasserstein distance score between query boxes. The update rule is formulated as:

WAttn (Q_{c}, φ, G) = Softmax (\frac{(Q_{c} + φ) {(Q_{c} + φ)}^{⊤}}{\sqrt{d_{q}}} + G) Q_{c},

(13)

where

Q_{c} \in R^{N \times D}

denotes the content query matrix (N is the number of queries, D is the channel dimension).

φ

represents the Gaussian Positional Embedding (Gaussian PE) derived from the positional queries

Q_{p}

. Distinct from standard encodings that overlook orientation, Gaussian PE models each oriented box

(x, y, w, h, θ)

as a 2-D Gaussian distribution

N (μ, Σ)

. This formulation unifies heterogeneous geometric parameters (i.e., spatial coordinates in pixels and rotation angles in radians) into a consistent metric space, enabling the attention mechanism to capture rotation-sensitive geometric structures.

G \in R^{N \times N}

is the Gaussian Wasserstein distance score matrix that quantifies the geometric similarity between pairs of queries (where a lower score implies closer proximity). The term

d_{q}

serves as the scaling factor (typically equal to D) to normalize the dot-product attention scores.

Subsequently, an Oriented Cross-Attention module refines the content queries by aligning features from the unified multi-scale feature maps with the oriented proposals. The updated content query

{Q^{'}}_{c}

is obtained as:

{Q^{'}}_{c} = Oriented - Cross - Attn ({f^{l}}, Q_{c}, Q_{p}) .

(14)

In this process,

{f^{l}}_{l = 1}^{L}

represents the set of unified multi-scale feature maps extracted from the backbone stages (where L is the number of levels).

Q_{p} \in R^{N \times 5}

denotes the positional queries parameterizing the oriented boxes

(x, y, w, h, θ)

, which are utilized to generate rotation-aware sampling points for precise feature extraction.

Finally, two separate Feed-Forward Network (FFN) heads predict the classification score and the bounding box refinement delta from the updated query features. The training is supervised by a composite loss function

L

, defined as:

L = λ_{cls} L_{cls} + λ_{reg} L_{reg} + λ_{iou} L_{iou},

(15)

where

L_{cls}

,

L_{reg}

, and

L_{iou}

represent the Focal Loss for classification, L1 loss for parameter regression, and Rotated IoU loss for geometric alignment optimization, respectively. The hyperparameters

λ_{cls}

,

λ_{reg}

, and

λ_{iou}

are the corresponding balancing coefficients for these loss components. This process repeats across all decoder layers, enabling coarse-to-fine refinement of the oriented detection results.

2.2.4. Adaptive Max-Confidence Fusion

To systematically integrate the detection sets from both modalities,

D_{SAR}

and

D_{RGB}

, into a unified and more reliable output

D_{fused}

, we propose an adaptive max-confidence fusion algorithm. This method diverges from conventional ensembling techniques by employing a winner-takes-all principle, guided by geometric constraints specific to oriented objects. The entire procedure is formally detailed in Algorithm 1.

Algorithm 1: Adaptive Max-Confidence Fusion

Geometry-Aware Iterative Matching

The fusion process begins by associating detections that correspond to the same physical object. All detections from both modalities are first pooled into a single set,

D_{pool}

, and sorted in descending order of their confidence scores. The algorithm then iteratively processes this pool: for the current highest-scoring detection, denoted as the reference detection

d_{ref}

, we identify a match set M comprising all other detections that are geometrically consistent with it. In the event that multiple detections share the identical highest confidence score, the tie is broken deterministically based on their input index order (prioritizing the detection appearing later in the pooled sequence).

A key aspect of our approach is the use of a dual-constraint strategy for matching, which considers both spatial overlap and orientation similarity—critical for maritime targets. A detection

d_{i}

is matched with the reference

d_{ref}

if and only if both the Rotated Intersection over Union (RIoU) and the absolute angular difference satisfy the following conditions:

RIoU (d_{i}, d_{ref}) > τ_{IoU} \land | Δ θ (d_{i}, d_{ref}) | < τ_{angle},

(16)

where

τ_{IoU}

represents the minimum spatial overlap threshold required to associate two bounding boxes, and

τ_{angle}

denotes the maximum permissible angular deviation to ensure orientation consistency. The term

| Δ θ |

calculates the shortest angular distance, accounting for the periodicity of orientation (

| Δ θ | = min (| θ_{1} - θ_{2} |, 180^{\circ} - | θ_{1} - θ_{2} |)

). This dual-constraint mechanism is crucial for disambiguating closely moored vessels with different headings in dense harbor scenes. After a match set is formed and fused, all its constituent detections are removed from

D_{pool}

. Detections that remain unmatched throughout the process are considered unique and are directly preserved in the final set

D_{fused}

.

Max-Confidence and Argmax Selection

For each identified match set

M = {d_{1}, d_{2}, \dots, d_{K}}

, we adopt a max-confidence principle for fusion. The validity of this comparison relies on the inherent consistency of the confidence scores. Since both detection sets

D_{SAR}

and

D_{RGB}

are generated by the same MOS-DETR model sharing the identical detection head and trained with a unified Focal Loss objective, their output probability distributions are implicitly aligned within a common semantic space. This shared optimization landscape ensures that the confidence scores are statistically comparable across modalities without requiring complex post hoc calibration.

Motivated by this alignment and the complementary strengths of the sensors (e.g., SAR’s robustness to weather vs. optical’s high resolution), we prioritize the most certain prediction. Simple averaging risks diluting a high-confidence signal from the more reliable modality, particularly when one modality fails to detect the target due to environmental degradation (yielding a near-zero score). Therefore, we select the detection with the maximum confidence score to represent the fused entity:

k^{*} = arg {max}_{k = 1}^{K} s_{k} .

(17)

Both the fused confidence score and the fused bounding box are determined by this single most confident detection:

s_{fused} = s_{k^{*}} and b_{fused} = b_{k^{*}} .

(18)

This argmax selection of the bounding box is critical for preserving geometric integrity. Naively averaging oriented boxes can lead to geometrically invalid results, particularly due to the circular nature of the angle parameter (e.g., averaging

5^{\circ}

and

175^{\circ}

erroneously yields

90^{\circ}

). Our approach avoids such inconsistencies and ensures that the fused box’s parameters, including the crucial ship orientation, are inherited from the most reliable prediction available.

3. Results

To comprehensively evaluate the effectiveness of our proposed Multi-Modal Oriented Ship Target Detection network, we conducted a series of systematic experiments.

3.1. Implementation Details

Our framework is implemented in MMRotate based on PyTorch 3.10. The backbone is a Swin-Tiny (patch size 4, window size 7) initialized with ImageNet pre-trained weights. The model processes

1024 \times 1024

resolution inputs with an embedding dimension of 96. To balance model capacity and parameter efficiency, we set the LoRA rank

r = 16

and scaling factor

α = 16

. We train the model for 100 epochs using the AdamW optimizer (initial LR

5 \times 10^{- 5}

, weight decay

1 \times 10^{- 6}

). A differential learning rate strategy is employed, with multipliers of 0.1 (backbone), 5.0 (RGB components), and 3.0 (SAR components). The LR follows a multi-step decay, reducing by 0.1 at epochs 50 and 70.

Following standard DETR-based protocols, we set the loss balancing coefficients to

λ_{cls} = 2.0

,

λ_{reg} = 5.0

, and

λ_{iou} = 2.0

. For the multi-modal fusion process, the thresholds are set to

τ_{IoU} = 0.1

and

τ_{angle} = 45^{\circ}

to accommodate spatial misalignment while ensuring orientation consistency. Training is conducted on 4 NVIDIA 3090 GPUs with a total batch size of 16 (4 per GPU), converging in approximately 12 h.

3.2. Ablation Study Results

We first conducted an ablation study on the MOS-Ship dataset to validate the individual contributions of our proposed components: the DPE (Dual-branch Patch Embedding) and MC-LoRA (Modality-Conditioned LoRA) modules. The quantitative results are presented in Table 2.

The baseline model (lacking both DPE and MC-LoRA) achieved an AP₅₀ of 73.3% on the mixed set and 58.4% on the SAR-only set. The addition of DPE improved AP₅₀ to 74.7% (MIX) and 61.5% (SAR_ONLY). Incorporating MC-LoRA alone yielded higher gains, reaching 79.5% (MIX) and 66.5% (SAR_ONLY). The full model, integrating both components, achieved the best performance with 84.1% AP₅₀ on the mixed set and 77.0% on the SAR-only set.

3.3. Comparison with Other Object Detectors

We compared our full model, MOS-DETR, against a suite of other oriented object detectors. The quantitative results are detailed in Table 3.

MOS-DETR achieved 84.1% AP₅₀ on the MIX test set, 88.8% on RGB_ONLY, and 77.0% on SAR_ONLY. It outperformed the second-best method (OrientedFormer) by 3.8% in AP₅₀ on the MIX set and 4.3% on the SAR_ONLY set. Qualitative results are presented in Figure 5.

3.4. Robustness Analysis in Adverse Weather

To quantitatively assess robustness, we conducted experiments under simulated adverse weather conditions where optical data is degraded by occlusions. Figure 6 visualizes the detection results.

The analysis in Table 4 presents the recall and accuracy metrics. The optical-only baseline achieved 34.5% recall. Conventional fusion strategies like probabilistic ensembling (probEn) and score averaging (avg) achieved recalls of 63.9% and 69.6%, respectively. The proposed max-confidence score fusion (max) combined with argmax box selection achieved 86.8% recall and 77.7% accuracy.

We also performed a sensitivity analysis on the Adaptive Max-Confidence Fusion mechanism’s key hyperparameters (

τ_{IoU}

and

τ_{angle}

), as shown in Table 5. The model achieves peak Accuracy (77.7%) at

τ_{IoU} = 0.1

. Increasing

τ_{IoU}

to 0.5 decreased accuracy to 67.3%. Varying

τ_{angle}

from

15^{\circ}

to

60^{\circ}

resulted in minor accuracy fluctuations between 77.1% and 77.7%.

4. Discussion

4.1. Architectural Component Analysis

The ablation study demonstrates that the proposed components are critical for effective multi-modal processing. The baseline model struggles significantly with SAR-only data (58.4% AP₅₀), likely due to a qualitative failure to adapt to modality-specific features using a single set of weights. The MC-LoRA module provides the most substantial individual performance boost, confirming its role in effectively calibrating features for different modalities. The integration of DPE with MC-LoRA yields the best results (77.0% AP₅₀ on SAR_ONLY), suggesting a synergistic effect where the DPE’s specialized processing as a dual-head tokenizer enhances MC-LoRA’s ability to perform modality-specific adaptation. This enables the model to effectively process and fuse information from disparate sources.

4.2. Performance vs. Other Methods

The comparative analysis confirms the superior performance of MOS-DETR. While unimodal methods often degrade on the SAR_ONLY split due to the lack of color and texture information, our model maintains robust performance. Qualitative visualizations in Figure 5 suggest that our model extracts salient, modality-specific features—such as textural and scattering properties in SAR or intricate structural details in optical imagery—rather than relying on a generalized, modality-agnostic template based merely on simple geometric properties. This is evidenced by its precise localization in challenging SAR scenes with high speckle noise or ambiguous land-sea boundaries, where other methods often fail.

4.3. Interpretation of Robustness in Adverse Weather

The simulated adverse weather experiments highlight the vulnerability of unimodal systems, where the optical-only baseline’s recall collapses to 34.5%. The comparison of fusion strategies reveals why our proposed method excels. Averaging-based methods (e.g., avg, probEn) dilute the critical, high-confidence signal from the unobscured modality (SAR) with the low-confidence signal from the occluded optical view. In contrast, our max strategy preserves the high-confidence detections, boosting recall to 86.8%.

The sensitivity analysis further clarifies the fusion mechanism’s behavior. The performance drop when increasing

τ_{IoU}

(from 0.1 to 0.5) indicates that a looser spatial constraint is beneficial for associating multi-modal targets that may exhibit minor spatial misalignment due to sensor differences or calibration errors. Conversely, the method’s insensitivity to

τ_{angle}

suggests that the orientation predictions are generally consistent across modalities, or that the fusion logic is robust to minor angular deviations. Consequently, the configuration of

τ_{IoU} = 0.1

and

τ_{angle} = 45^{\circ}

effectively balances spatial tolerance with orientation consistency.

5. Conclusions

In this paper, we introduced a novel framework for robust multi-modal (Optical/SAR) ship detection. The core innovation lies in a parameter-efficient adaptive backbone, featuring a dual-branch tokenizer for specialized input embedding and modality-conditioned LoRA modules injected into a frozen Swin Transformer. This design enables efficient, fine-grained adaptation to heterogeneous data while preserving pretrained knowledge. Complementing the detector, we proposed an adaptive max-confidence fusion algorithm for post-processing, which intelligently integrates detections from both modalities, significantly enhancing robustness, particularly under adverse weather conditions. Experiments demonstrate state-of-the-art performance, especially on SAR and mixed-modality datasets, validating our approach’s ability to achieve true multi-modal adaptation. As an outlook for future work, this framework provides a strong foundation. We believe the model’s proven capability for precise, modality-specific feature extraction can be extended to support more fine-grained tasks, such as specific ship classification, which would further enhance the accuracy and targeted nature of maritime monitoring.

Author Contributions

Conceptualization, J.Y. and Y.L. (Yixuan Lv); Methodology, J.Y.; Validation, J.Y., S.L. (Silei Liu), K.Z. and Y.L. (Yuxuan Liu); Formal analysis, J.Y., S.L. (Silei Liu), K.Z. and Y.L. (Yuxuan Liu); Investigation, J.Y., Y.L. (Yixuan Lv), S.L. (Silei Liu), K.Z. and Y.L. (Yuxuan Liu); Resources, J.Y.; Data curation, J.Y.; Writing—original draft, J.Y. and Y.L. (Yixuan Lv); Writing—review & editing, S.L. (Shengyang Li); Visualization, J.Y.; Supervision, S.L. (Shengyang Li); Project administration, Y.L. (Yixuan Lv) and S.L. (Shengyang Li); Funding acquisition, S.L. (Shengyang Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Research Program of the Chinese Academy of Sciences grant number KGFZD-145-23-18.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Liu, B.; Zheng, G.; Ren, Y.; Zhang, S.; Liu, Y.; Gao, L.; Liu, Y.; Zhang, B.; Wang, F. Deep-learning-based information mining from ocean remote-sensing imagery. Natl. Sci. Rev. 2020, 7, 1584–1605. [Google Scholar] [CrossRef] [PubMed]
Demir, B.; Bovolo, F.; Bruzzone, L. Detection of land-cover transitions in multitemporal remote sensing images with active-learning-based compound classification. IEEE Trans. Geosci. Remote Sens. 2012, 50, 1930–1941. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Li, Z.; Chai, S.; Zhang, H.; Yu, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Wang, R.; You, Y.; Zhang, Y. Ship detection in foggy remote sensing image via scene classification R-CNN. In Proceedings of the 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC), Guiyang, China, 22–24 August 2018; IEEE: New York, NY, USA, 2018; pp. 81–85. [Google Scholar]
Zhang, Z.; Zheng, H.; Cao, J.; Liu, W. FRS-Net: An efficient ship detection network for thin-cloud and fog-covered high-resolution optical satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2326–2340. [Google Scholar] [CrossRef]
Zhao, Z.; Li, S. OASL: Orientation-aware adaptive sampling learning for arbitrary oriented object detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103740. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Liu, C. Balance learning for ship detection from synthetic aperture radar remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 182, 190–207. [Google Scholar] [CrossRef]
Zhang, X.; Yang, X.; Li, Y.; Yang, J.; Cheng, M.-M.; Li, X. Rsar: Restricted state angle resolver and rotated sar benchmark. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 7416–7426. [Google Scholar]
Yu, H.; Liu, B.; Wang, L.; Li, T. LD-Det: Lightweight Ship Target Detection Method in SAR Images via Dual Domain Feature Fusion. Remote Sens. 2025, 17, 1562. [Google Scholar] [CrossRef]
Sun, Z.; Leng, X.; Zhang, X.; Zhou, Z.; Xiong, B.; Ji, K.; Kuang, G. Arbitrary-Direction SAR Ship Detection Method for Multiscale Imbalance. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5208921. [Google Scholar]
Wang, H.; Liu, S.; Lv, Y.; Li, S. Scattering Information Fusion Network for Oriented Ship Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4013105. [Google Scholar]
Dong, J.; Feng, J.; Tang, X. OptiSAR-Net: A Cross-Domain Ship Detection Method for Multi-Source Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4709311. [Google Scholar]
He, J.; Su, N.; Xu, C.; Li, H. A cross-modality feature transfer method for target detection in SAR images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5213615. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L.; Wu, J. Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects. IEEE Geosci. Remote Sens. Mag. 2024, 12, 132–168. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Xian, S.; Wu, Z.; Sun, Y.; Zhang, Q. AIR-SARShip-1.0: High-resolution SAR ship detection dataset. J. Radars 2019, 8, 852–863. [Google Scholar]
Ruan, R.; Yang, K.; Zhao, Z. OGSOD-2.0: A challenging multimodal benchmark for optical-SAR object detection. In Proceedings of the Sixteenth International Conference on Graphics and Image Processing (ICGIP 2024), Nanjing, China, 8–10 November 2024; SPIE: Bellingham, WA, USA, 2025; pp. 11–21. [Google Scholar]
Lindenberger, P.; Sarlin, P.-E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 17581–17592. [Google Scholar]
Zhao, J.; Ding, Z.; Zhou, Y.; Xu, Y. OrientedFormer: An end-to-end transformer-based oriented object detector in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5640816. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Wang, H.; Li, S.Y.; Yang, J.; Liu, Y.; Lv, Y.; Zhou, Z. Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. Int. Conf. Learn. Represent. 2022, 1, 3. [Google Scholar]
Zhang, S.; Wang, X.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Luo, P.; Chen, K. Dense Distinct Query for End-to-End Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7329–7338. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Lu, G. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2849–2858. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Yang, S.; Pei, Z.; Zhou, F.; Yu, L. Rotated Faster R-CNN for oriented object detection in aerial images. In Proceedings of the 2020 3rd International Conference on Robot Systems and Applications, Tokyo, Japan, 26–29 December 2020; ACM: New York, NY, USA, 2020; pp. 35–39. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z. R3Det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Shi, X. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Li, H. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Luo, H. ReDet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2786–2795. [Google Scholar]
Yang, X.; Zhang, G.; Li, W.; Liu, T. H2RBox: Horizontal box annotation is all you need for oriented object detection. arXiv 2022, arXiv:2210.06742. [Google Scholar]
Lee, S.; Lee, S.; Song, B.C. CFA: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar]
Pu, Y.; Wang, Y.; Xia, Z.; Zhu, X. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6589–6600. [Google Scholar]
Li, Z.; Hou, B.; Wu, Z.; Chen, X. FCOSR: A simple anchor-free rotated detector for aerial object detection. Remote Sens. 2023, 15, 5499. [Google Scholar] [CrossRef]
Lee, W.; Chang, H.; Moon, J.; Lee, J.; Kim, M. ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 8848–8858. [Google Scholar]
Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022; pp. 145–163. [Google Scholar]

Figure 1. The proposed pipeline for multi-modal data preparation and augmentation.

Figure 2. Qualitative comparison of the cloud simulation strategy. (a) The clear optical patch from our dataset. (b) The result of our procedural augmentation, where synthetic clouds are generated using randomized noise paths and screen blending. (c) A real-world remote sensing image with natural cloud cover. The visual similarity in texture, transparency, and edge softness between (b,c) validates the realism of our simulation method for training robust detectors.

Figure 3. An illustration of the proposed multi-modal oriented object detection framework. Its core components include a multi-modal Swin Transformer backbone for feature extraction, a module for multi-scale feature unification and initial query construction, and a cascaded decoder for iterative query refinement.

Figure 4. This is a Swin Transformer-based backbone model that achieves highly efficient adaptive processing of RGB and SAR images with minimal parameter overhead by injecting lightweight “modal adapters” (LoRA) into the frozen backbone network.

Figure 5. Comparison of our method against other methods on challenging samples. The confidence threshold is set to 0.3. Green solid lines represent ground truth (GT). Red solid lines represent correct predictions (True Positives), while red dashed lines represent false predictions (False Positives). Yellow dashed lines indicate missed detections (False Negatives). First row: Optical targets in complex backgrounds. Second row: SAR targets in complex backgrounds. Third and fourth rows: Optical and SAR targets from the same background region. (a) Ground Truth. (b) Our Method. (c) Single-stage method (HRBox). (d) Two-stage method (Faster R-CNN). (e) End-to-end method (OrientedFormer).

Figure 6. Advantage of multi-modal fusion under synthetic fog. (Top) Optical-only detection successfully identifies non-occluded targets (green solid boxes) but fails in occluded areas (dashed yellow boxes). (Bottom) Our method successfully recovers these targets (solid orange boxes) by incorporating all-weather SAR information via Adaptive Max-Confidence Fusion.

Table 1. Detailed Statistical Distribution of the MOS-Ship Dataset. This table summarizes the data partition, scene complexity, and object instance scales to demonstrate the dataset’s representativeness.

Part I: General Data Partition
Subset	Scenes (Image Pairs)	Description	Total Instances
Training Set	996 + 996	SAR + RGB (Clear)	4389
Validation Set	250 + 250	SAR + RGB (Clear)	1083
Validation Set	250	RGB (Cloud-Augmented)	549
Total	1246 + 1246 + 250	SAR + RGB (Clear) + RGB (Cloud-Augmented)	6021
Part II: Scene Distribution
Scene Category	Count (Image Pairs)	Percentage	Note
In-shore (Ports)	712	57.14%	Heavy clutter, land interference
Off-shore (Open Sea)	534	42.86%	Sea clutter, pure background
Part III: Ship Category/Scale Distribution
Ship Category *	Instance Count	Scale (Pixels)	Ratio
Small Ship	1133	Area $< 64^{2}$	20.71%
Medium Ship	3598	$64^{2} \leq$ Area $\leq 128^{2}$	65.75%
Large Ship	741	Area $> 128^{2}$	13.54%

* Due to the difficulty of distinguishing fine-grained vessel classes in SAR-Optical pairs, we classify targets based on instance scales. The definitions for Small, Medium, and Large ships are adapted from standard COCO metrics to accommodate the high resolution of the dataset.

Table 2. Ablation study of DPE and MC-LoRA modules on the MOS-Ship dataset. The symbols ‘✓’ and ‘×’ denote the inclusion and exclusion of the corresponding module, respectively. The best performance metrics are highlighted in bold.

DPE	MC-LoRA	MOS-Ship
		MIX		RGB_ONLY		SAR_ONLY
		Recall	AP₅₀	Recall	AP₅₀	Recall	AP₅₀
×	×	98.5	73.3	99.4	83.1	97.6	58.4
✓	×	97.0	74.7	97.0	84.9	97.0	61.5
×	✓	98.7	79.5	99.1	88.0	98.2	66.5
✓	✓	99.1	84.1	99.7	88.8	99.1	77.0

Table 3. Comparison with state-of-the-art methods on our proposed dataset. All methods are retrained for a fair comparison. The best results are highlighted in bold.

Method	Year	MIX			RGB_ONLY			SAR_ONLY
Method	Year	AP₂₅	AP₅₀	AP₇₅	AP₂₅	AP₅₀	AP₇₅	AP₂₅	AP₅₀	AP₇₅
RoI Transformer [28]	2019	79.5	71.5	58.8	83.4	83.3	75.1	74.1	63.8	39.2
Gliding Vertex [29]	2020	41.1	31.3	11.8	46.2	37.7	8.9	36.1	25.7	1.9
Rotated Faster R-CNN [30]	2020	77.5	61.3	12.2	83.6	69.5	17.5	68.9	46.0	9.4
R3Det [31]	2021	66.0	50.2	13.3	71.2	60.6	16.6	60.5	39.2	6.5
Oriented R-CNN [32]	2021	80.0	77.5	52.7	82.8	82.8	71.2	76.1	67.0	33.6
S²A-Net [33]	2021	84.6	79.5	38.8	86.9	86.8	57.7	81.7	69.3	16.9
ReDet [34]	2021	67.9	59.5	34.9	73.1	68.2	50.8	63.6	48.5	21.0
ORENet [34]	2021	67.9	59.5	34.9	73.1	68.2	50.8	63.6	48.5	21.0
H2RBox [35]	2022	78.1	59.4	20.2	81.5	69.5	26.1	73.7	53.2	15.3
CFA [36]	2022	61.3	45.1	19.6	68.0	53.8	27.4	52.8	34.6	11.7
Rotated Retinanet [37]	2023	73.4	57.3	28.3	79.5	69.6	38.4	67.4	54.1	19.0
Rotated FCOS [38]	2023	73.7	63.7	22.0	78.0	68.9	34.9	69.3	53.5	13.3
OptiSAR-Net [15]	2024	69.7	62.6	35.0	78.9	74.4	51.1	60.7	50.9	18.9
OrientedFormer [22]	2024	81.7	80.3	56.6	85.5	85.4	72.8	77.3	72.7	39.4
ABBSPO [39]	2025	85.8	74.7	37.5	88.3	86.4	47.7	82.6	67.5	23.0
UCR [11]	2025	86.4	71.9	22.3	88.6	84.1	34.1	83.4	56.2	13.6
MOS-DETR(Ours)	2025	84.6	84.1	62.8	88.8	88.8	82.8	78.5	77.0	49.6

Table 4. Performance comparison of adverse weather conditions demonstrating fusion effectiveness.

Method		Accuracy	Precision	Recall
baseline		56.7	53.6	34.5
Score-Fusion	Box-Fusion	Accuracy	Precision	Recall
probEn [40]	avg	60.1	73.4	63.9
probEn [40]	s-avg	60.1	73.4	63.9
probEn [40]	argmax	60.1	73.4	63.9
avg	avg	65.6	77.2	69.7
avg	s-avg	65.6	77.1	69.6
avg	argmax	65.6	77.1	69.6
max	avg	77.6	84.1	86.7
max	s-avg	77.6	84.2	86.8
max	argmax	77.7	84.2	86.8

Table 5. Sensitivity analysis of the Adaptive Max-Confidence Fusion on MOS-Ship by varying IoU (

τ_{IoU}

) and Angle (

τ_{angle}

) thresholds. Default settings are marked in bold.

Table 5. Sensitivity analysis of the Adaptive Max-Confidence Fusion on MOS-Ship by varying IoU (

τ_{IoU}

) and Angle (

τ_{angle}

) thresholds. Default settings are marked in bold.

Variable Parameter	$τ_{IoU}$	$τ_{angle}$ (°)	Accuracy	Precision	Recall
Varying IoU Threshold	0.1	45	77.7	84.2	86.8
	0.2	45	76.3	82.6	86.9
	0.3	45	74.6	80.5	86.9
	0.5	45	67.3	72.3	86.9
Varying Angle Threshold	0.1	15	77.1	83.4	86.8
	0.1	30	77.6	84.2	86.8
	0.1	45	77.7	84.2	86.8
	0.1	60	77.6	84.2	86.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

You, J.; Lv, Y.; Li, S.; Liu, S.; Zhang, K.; Liu, Y. A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology. Remote Sens. 2026, 18, 274. https://doi.org/10.3390/rs18020274

AMA Style

You J, Lv Y, Li S, Liu S, Zhang K, Liu Y. A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology. Remote Sensing. 2026; 18(2):274. https://doi.org/10.3390/rs18020274

Chicago/Turabian Style

You, Jianing, Yixuan Lv, Shengyang Li, Silei Liu, Kailun Zhang, and Yuxuan Liu. 2026. "A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology" Remote Sensing 18, no. 2: 274. https://doi.org/10.3390/rs18020274

APA Style

You, J., Lv, Y., Li, S., Liu, S., Zhang, K., & Liu, Y. (2026). A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology. Remote Sensing, 18(2), 274. https://doi.org/10.3390/rs18020274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Modal Approach for Robust Oriented Ship Detection: Dataset and Methodology

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Multi-Modal Optical-SAR Ship Dataset

2.1.1. Data Acquisition and Collection Strategy

2.1.2. Spatio-Temporal Co-Registration

2.1.3. Dataset Finalization and Augmentation

2.2. Multi-Modal Oriented-Ship Detection Transformer

2.2.1. Overall Framework

2.2.2. Backbone

Dual-Branch Patch Embedding (DPE)

Modality-Conditioned LoRA (MC-LoRA)

Parameter-Efficiency Analysis

2.2.3. Detection Head

Multi-Scale Feature Unification and Query Construction

Decoder Layer Update

2.2.4. Adaptive Max-Confidence Fusion

Geometry-Aware Iterative Matching

Max-Confidence and Argmax Selection

3. Results

3.1. Implementation Details

3.2. Ablation Study Results

3.3. Comparison with Other Object Detectors

3.4. Robustness Analysis in Adverse Weather

4. Discussion

4.1. Architectural Component Analysis

4.2. Performance vs. Other Methods

4.3. Interpretation of Robustness in Adverse Weather

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI