Abstract
Maritime ship detection is a critical task for security and traffic management. To advance research in this area, we constructed a new high-resolution, spatially aligned optical-SAR dataset, named MOS-Ship. Building on this, we propose MOS-DETR, a novel query-based framework. This model incorporates an innovative multi-modal Swin Transformer backbone to extract unified feature pyramids from both RGB and SAR images. This design allows the model to jointly exploit optical textures and SAR scattering signatures for precise, oriented bounding box prediction. We also introduce an adaptive probabilistic fusion mechanism. This post-processing module dynamically integrates the detection results generated by our model from the optical and SAR inputs, synergistically combining their complementary strengths. Experiments validate that MOS-DETR achieves highly competitive accuracy and significantly outperforms unimodal baselines, demonstrating superior robustness across diverse conditions. This work provides a robust framework and methodology for advancing multimodal maritime surveillance.