With the advancement of satellite technology, remote sensing satellites have also witnessed long-term development. Synthetic aperture radar (SAR) and optical sensors are the two primary payloads of remote sensing satellites, providing researchers with a wealth of information. SAR is characterized by its all-weather operation, although it is not very friendly to manual processing. Optical imagery, on the other hand, is highly suitable for human visual recognition but is limited to daytime operation. Therefore, SAR and optical imagery have good complementarity. Ship detection is crucial to maritime search and rescue, trade, and traffic planning. Therefore, developing a highly accurate and computationally efficient detection algorithm that can accurately identify ships with arbitrary orientations in both optical and SAR images holds significant research value.
The field of object detection has witnessed remarkable progress with the rapid development of deep learning technology, particularly in general image detection scenarios. These deep learning-based approaches are broadly categorized into two main types: single-stage methods, known for their efficiency (e.g., YOLO series [
4,
5,
6,
7,
8,
9,
10,
11,
12]), and two-stage methods, which generally offer higher accuracy (e.g., Faster R-CNN [
13]). However, directly applying these general object detection algorithms to ship detection in remote sensing images presents significant challenges. Unlike natural images, remote sensing images exhibit unique data distributions, imaging geometries, diverse resolutions, and complex maritime backgrounds, which inherently limit the direct applicability of models trained on natural image datasets. Furthermore, many general object detection algorithms often entail high computational demands and numerous parameters, posing further constraints on their deployment for remote sensing image analysis. Consequently, a substantial research focus on developing highly efficient and accurate deep learning algorithms specifically tailored for ship detection in remote sensing imagery has emerged.
Sun et al. [
14] introduced BiFA-YOLO, an object detection algorithm for arbitrarily oriented ships in SAR ships based on YOLO. This method incorporates a bidirectional feature fusion module (Bi-DFFM) for multi-scale feature aggregation and an angular classification structure for precise angle detection. Additionally, a random-rotation mosaic data augmentation technique is used to enhance detection performance. Zhou et al. [
15] developed PVT-SAR, a novel SAR ship detection framework leveraging a pyramid vision transformer (PVT) to capture global dependencies via self-attention. The framework includes overlapping patch embedding and mixed transformer encoder modules to handle dense targets and limited data. A multi-scale feature fusion module is employed to improve small-target detection, and a normalized Gaussian Wasserstein distance loss is used to reduce scattering interference. Yu et al. [
16] presented AF2Det, an anchor-free and angle-free detector, which uses bounding-box projection (BBP) to predict the angles of target objects’ oriented bounding boxes (OBBs). This approach avoids boundary discontinuity and employs an anchor-free architecture with deformable convolution and bottom-up feature fusion to enhance detection capabilities. Zhou et al. [
17] designed a novel ellipse parameter representation method for arbitrarily oriented objects. This method embeds the object’s angle within the ellipse’s focal vector, enabling a single numerical representation and reducing uncertainty in bounding-box regression. The algorithm uses a 2D Gaussian distribution for initial sample selection, employs Kullback–Leibler divergence loss as the regression loss function, and utilizes SimOTA as the label assignment strategy. Song et al. [
18] implemented SRDF, a single-stage rotated-object detector that integrates an instance-level false-positive suppression module (IFPSM) to mitigate non-target responses through spatial feature encoding. A hybrid classification and regression approach is used to represent object orientation, and traditional post-processing is replaced with a 2D probability distribution model for more precise bounding-box extraction. Wan et al. [
19] developed a SAR ship small-target azimuth detector via Gaussian label matching and semantic flow feature alignment. The method introduces the FAM module into an FPN to align deep and shallow semantics and incorporates attention mechanisms within an adaptive boundary enhancement module. A Gaussian distribution-based label matching strategy enables effective regression learning even when bounding boxes do not overlap. Zhang et al. [
20] proposed a novel oriented-ship detection method for SAR images called SPG-OSD. To improve detection accuracy and reduce false alarms and missed detections, SPG-OSD incorporates three mechanisms: first, an oriented two-stage detection module based on scattering characteristics; second, a scattering-point-guided region proposal network (RPN) to predict potential key scattering points; and third, a region-of-interest (RoI) contrastive loss function proposed to enhance inter-class distinction and reduce intra-class variance. Experimental results on images from the Gaofen-3 satellite demonstrate that the algorithm achieves advanced performance. Liu et al. [
21] established YOLOv7oSAR, a SAR ship detection model using a rotation box mechanism and KLD loss function to enhance accuracy. The model incorporates a Bi-former attention mechanism for small-target detection and a lightweight P-ELAN structure to reduce model size and computational requirements. Meng et al. [
22] proposed LSR-Det, a lightweight object detection algorithm for rotated-ship detection in SAR images. This method employs a contour-guided backbone network to reduce model parameters while maintaining strong feature extraction capabilities and introduces a lightweight adaptive feature pyramid network (FPN) to enhance cross-layer feature fusion. Additionally, a rotating detection head with shared CNN parameters is designed to improve the precision of multi-scale ship target detection. Huang et al. [
23] developed NST-YOLO11, an arbitrarily oriented-ship detection algorithm based on YOLO11. It uses an improved Swin Transformer module and a Cross-Stage connected Spatial Pyramid Pooling-Fast (CS-SPPF) module. To eliminate information redundancy from the Swin Transformer module’s local window self-attention, NST-YOLO11 employs a unified spatial-channel attention mechanism. Additionally, an advanced SPPF module is designed to further enhance the detection algorithm’s performance. Li et al. [
24] innovated novel conditions for encoding methods and loss functions, introducing the Coordinate Decomposition Method (CDM) and developing a joint optimization paradigm to improve detection performance and address boundary discontinuity issues. Qin et al. [
25] proposed DSA-Net based on SkewCIoU, using CLM for spatial attention and GCM for channel attention to enhance ship discrimination. The method introduces SkewCIoU loss to improve the detection of slender ships. Li et al. [
26] improved the S2A-Net ship detection method by embedding pyramid squeeze attention to focus on key features and designing a context information module to enhance network context understanding. To improve the detected image’s quality, they used a fog density and depth decomposition-based dehazing network and applied an image weight sampling strategy to handle imbalanced ship category distribution. Liu et al. [
27] proposed the TS2Anet model, leveraging the S2Anet rotating box object detection network and the PVTv2 structure as its backbone. The method employs the “cutout” image pre-processing technique to simulate occlusion caused by clouds and uses the GHM loss function to reduce the influence of outlier samples. Liang et al. [
28] introduced MidNet, an anchor- and angle-free detector that represents each oriented object using a center and four midpoints. The method employs symmetric deformable convolution to enhance midpoint features and adaptively matches the center and midpoints by predicting centripetal shifts and matching radii. A concise analytical geometry algorithm ensures the accuracy of oriented bounding boxes. Yan et al. [
29] enhanced ReDet with three key improvements, resulting in a more powerful ship detection model, ReBiDet. The enhancements include replacing the FPN structure with a ReBiFPN to capture multi-scale features, utilizing K-means clustering to adjust anchor sizes according to ground truth aspect ratios and employing a DPRL sampler to address scale imbalance. Fang et al. [
30] proposed YOLO-RSA, composed of a feature extraction backbone, a multi-scale feature pyramid, and a rotated detection head. Tests on HRSC2016 and DOTA show that it outperforms other methods in recall, precision, and mAP. Ablation studies evaluate component contributions, and generalization tests prove its robustness across diverse scenarios. Huang et al. [
31] enhanced the YOLOv11 algorithm for ship detection in remote sensing images by introducing a lightweight and efficient multi-scale feature expansion neck module. This method uses a multi-scale expansion attention mechanism to capture semantic details and combines cross-stage partial connections to boost spatial semantic interaction. Additionally, it employs the GSConv module to minimize feature transmission loss, resulting in a ship detection model that is both lightweight and high-precision. Gao et al. [
32] proposed a YOLOV5-based oriented-ship detector for remote sensing images. It introduces a cross-stage partial context transformer (CSP-COT) module to capture global contextual spatial relationships; then, an angle classification prediction branch is added to the YOLOV5 head network for detecting targets in any direction, and a probability and distribution loss function (ProbIoU) is designed to optimize the regression effect. Additionally, a lightweight task-specific context decoupling (LTSCODE) module is employed to replace the original YOLOV5 head, to solve the accuracy problem caused by YOLOV5’s hybridization of classification and localization tasks. Sun et al. [
33] proposed MSDFF-Net, which integrates a multi-scale large-kernel block (MSLK) to enhance noisy features, a dynamic feature fusion (DFF) module to suppress clutter, and a Gaussian probability loss (GPD) for elliptical-ship regression. This method achieves a state-of-the-art 91.55% AP50 in SAR ship detection tasks. Zhang et al. [
34] proposed a cross-sensor SAR image detector based on dynamic feature discrimination and center-aware calibration. It incorporates a dynamic feature discrimination module (DFDM) to alleviate regression offset via bidirectional spatial feature aggregation and multi-scale feature enhancement, and a center-aware calibration module (CACM) to reduce feature misalignment by modeling target salience and focusing on the target’s perception center. Experiments on the MiniSAR and FARAD datasets show improvements of over 6–20% in mAP and F1, validating its effectiveness.
While deep learning has significantly advanced ship detection, two fundamental limitations persist. First, state-of-the-art detectors often entail substantial computational costs; for instance, S2ANet demands 56.22 GFLOPs on the RSDD-SAR dataset, revealing an inefficient computational design. Second, complex backgrounds in remote sensing images—particularly for rotated-object detection—pose significant challenges, often leading to false positives, as current methods struggle to effectively decouple target signatures from background clutter. This creates a pervasive accuracy–efficiency dilemma: two-stage methods achieve high precision through cascaded optimization but suffer from notably longer inference times (e.g., ReDet [
35] on NVIDIA V100 with 62.89 ms), significantly exceeding those of single-stage detectors like S2ANet [
36] (42.63 ms); conversely, single-stage methods, despite their speed, often struggle in complex scenarios (e.g., dense port ships and coastal line interference) due to the absence of fine-grained regression steps. To address this, we introduce EfficientRDet, a novel "pseudo-two-stage" paradigm that integrates lightweight refinement modules into a single-stage architecture—enabling two-stage-level precision without sequential computational overhead. This paradigm is further bolstered by three novel mechanisms. On the RSDD-SAR dataset, EfficientRDet achieves 93.58% AP
50 at 14.08 ms latency, surpassing ReDet by 1.29% AP
50 while operating at 4.47× its frame rate, thereby shattering the conventional accuracy–efficiency trade-off.