4.1. Datasets and Implementation
We compare our model with representative detectors and conduct ablation studies to highlight the contribution of each component. We use standard evaluation metrics including mean Average Precision, Precision, and Recall at a specified IoU threshold. Unless otherwise stated, the backbone is an ImageNet-pretrained ResNet-50. In GDConv layers we set the number of groups to G = 4. Our full configuration, termed DVDNet, consists of the backbone together with GDConv inserted at conv3 and conv4, the LBP enhancement on P3, and the vectorized regression head. In addition, for the two main datasets, HRSID and SSDD, we also tested their performance under different target sizes. According to the COCO scale, targets are divided into three categories based on pixel area, and , , and are calculated accordingly. At IoU = 0.5:0.95, with a step size of 0.05, AP is calculated only for targets at the corresponding scale.
Our model is implemented in PyTorch 2.3.1. GDConv layers are inserted in the ResNet-50 at layers , replacing the conv in those blocks with a deformable conv of the same size and 4 groups. The LBP module is applied on the P3 feature; we reduce it to 1 channel and compute LBP using 8-neighborhood. The 8 binary maps are concatenated and passed through a conv to get back 256 channels. For vector decomposition, our ROI head outputs 6 regression values per ROI instead of 4. We still use class-agnostic regression for simplicity.
All methods in the comparison experiments in this paper are initiated from publicly available pre-training weights and are trained and evaluated independently on each dataset without hybrid training. The evaluation is uniformly performed on single-scale inputs, with the non-maximal suppression threshold taken as 0.5, and the reports mAP, Precision, and Recall. The YOLO family consists of YOLOv5, YOLOv6, YOLOv7, YOLOv8, and YOLOX-s, as well as E2YOLOX-VFL, YOLOv7oSAR, and Light-YOLO. strictly. Follow the default configuration of their official implementations for training and inference. Use the repository’s default optimizer and learning rate schedules, employ default strong data augmentation schemes such as multi-scale training, stochastic affine, and mosaic with mixup or cutmix, and leave anchors and EMAs unaltered in their official near and off state. Input resolution and batch size are taken by default, and the best weights are selected by validation set metrics without using test-time enhancement. Other families of methods include RetinaNet, FCOS, CenterNet, Cascade R-CNN, Libra R-CNN, Sparse R-CNN, Dynamic R-CNN, GCNet, SSD300, DETR, Deformable DETR, and so on. Then, the training is performed according to the recommended recipe of each paper or official code. Optimizers with learning rate schedules from the corresponding papers are used, and the regular data is augmented with random scale jitter and level flipping. AdamW is used for the DETR series, and SGD is used for the rest of the methods, while ResNet-50 is used for the backbone network pre-trained by ImageNet, and the rest of the hyperparameters are kept as the official defaults. The above settings are designed to fully utilize the native performance of each method and, at the same time, ensure the consistency of the evaluation process and the reproducibility of the results. For fair comparison with FLOPs reproducible, the same input resolution of 800 × 800 was fixed on all four datasets.
The HRSID dataset contains 5604 800 × 800 high-resolution images of 16,951 vessels. Its sources are a combination of Sentinel-1B (C-band) and TerraSAR-X & TanDEM-X (X-band) satellites, with pixel resolutions of 0.5 m, 1 m, and 3 m, covering a wide range of imaging angles and sea states. Horizontal frames, instance segmentation masks, and “in-shore/off-shore” scene labels are also provided, and 400 pure background images are attached for robustness testing. In order to ensure the reproducibility, this paper uses fixed random seeds of 42 divided into 60%, 20%, and 20% as the training set, validation set, and test set. The input size is kept at 800 × 800. If the original provided scene graph is not square, it will be sliced into 800 × 800 according to the official slicing method before participating in the training and evaluation.
SSDD [
30] is a SAR small target detection dataset. A total of 1160 images are included, containing 2456 vessels, with an average of about 2.1 vessels per image. The labeling form is based on a horizontal surrounded frame, and the official new version also extends the rotated frame with pixel-level segmentation, which is convenient for studying small targets, dense targets, and fine localization. It is mainly derived from radar satellites RadarSat-2, TerraSAR-X, and Sentinel-1. The polarization modes cover HH/HV/VH/VV with a spatial resolution of 1 m–15 m, and the scenarios cover a wide range of sea states in the near-shore and far-shore. The original image scale is not uniform. In this paper, all the samples are scaled letterbox to 800 × 800 and divided into a training set, validation set, and test set according to 60%, 20%, and 20%, and random seed 42. The annotation is still a single class of rotating box, and the caliber of evaluation is consistent with HRSID.
The HRSC2016 dataset is a challenging remote sensing benchmark focused on ship detection in aerial images. It contains high-resolution images with significant variations in scale, orientation, and aspect ratio. One notable feature of HRSC2016 is that objects are labeled with oriented bounding boxes, making it an ideal test environment for evaluating the effectiveness of orientation-aware detection methods. A 6:2:2 division of training set, validation set, and testing set was used, using fixed random seeds of 42 to ensure reproducibility. Similarly, the resolution needs to be converted to 800 × 800.
In addition, we acquired and annotated a dataset of 2400 SAR small ship images with a spatial resolution of 1–10 m, covering multiple scene types. The polarization is mainly VV, with a small amount of VH, and the annotation is performed by rotating external rectangles with a single ship class while keeping the scene type identifier. To ensure consistency with the publicly available benchmark dataset, all samples are letterboxed to 800 × 800 after equal scaling, divided into a 7:1:2 ratio, 1680 for training, 240 for validation, and 480 for testing, and the division process uses a fixed random seed of 42. Samples from the HRSC2016 and SAR small ship datasets are shown in
Figure 6.
4.2. Comparative Experiment
From the results shown in
Table 1,
Table 2 and
Table 3, our proposed model consistently outperforms a broad set of some state-of-the-art detection methods across all benchmark datasets. The superior performance has been achieved on SAR small target detection datasets HRSID and SSDD. In addition, including both remote sensing optical HRSC2016 dataset and self-built SAR small ship dataset. The improvements are evident across
, precision, and recall. These gains can be attributed to the synergy between three architectural innovations in our design. These include Grouped Deformable Convolution (GDConv), Local Binary Pattern (LBP) enhancement module, and Vector Decomposition-based bounding box regression header, respectively.
In the high clutter, low signal-to-noise, and very small scale scenario of SAR small target test methods, our method is characterized by maintaining high-precision calls on both mainstream datasets. As shown in
Table 1, the
of our method on the HRSID dataset is 90.9%. The
on the SSDD dataset is 87.2%, which exceeds the other mainstream models of SOTA. Although both the HRSID and SSDD datasets are SAR small target detection datasets, the two sets of scenarios differ greatly, with the HRSID harbor/nearshore complexity and the SSDD offshore background monotony. The performance of our method is more stable, with cross-domain fluctuations of less than 4%. In addition, our method achieves 86.2% precision and 91.7% recall on HRSID. The precision on SSDD reaches 90.4%, and the recall reaches 90.7%. SAR applications are more afraid of misreporting ship shadows, and high precision directly reduces the burden of back-end screening. At the same time, the recall rate can be improved to avoid missed detection, which is also critical in maritime surveillance, compared with the traditional two-stage Faster R-CNN, 83.4%, with an improvement of about 8%. E2YOLOX-VFL, YOLOv7oSAR, and Light-YOLOv8 are models designed specifically for remote sensing small target ship detection, and the experimental results show that the detection performance of the DVDNet proposed in this paper is still superior compared to Light-YOLOv8. On HRSID, the recall of DVDNet has the most obvious strength. This indicates that GDConv+LBP improves the proposal quality and small-target separability in the nearshore strong clutter and dense small-target scenarios.
is also ahead or equal. On SSDD,
is flat or slightly ahead, but Precision and Recall are significantly ahead, which indicates that DVDNet has better suppression of sea surface noise, wave crests, and other bright false targets, and significantly fewer false detections.
Our method maintains low drift for scene switching on both mainstream datasets, HRSID and SSDD. Compared with other mainstream models, it demonstrates the robustness to resolution differences, shoreline clutter, and imaging strong scattering on the SAR small target detection task. For the conventional two-stage methods such as Faster R-CNN, this paper changes Neck from conventional FPN to an augmented structure consisting of GDConv for C3 and C4 and LBP texture enhancement for P3. Together with the vectorized regression head, the response and robustness of small targets and arbitrarily oriented ships are specifically enhanced. This results in approximately 20% improvement in precision and approximately 10% improvement in recall. In contrast to Transformer and DETR-like methods, we keep the CNN backbone as well as lightweight rotation attention, suppress background with a local prior, converge at high speed, and do not rely on tens of thousands of preheating steps.
It can be seen that the optimization is achieved between the combined precision and recall scores, cross-domain stability, and arithmetic consumption, even though the mAP alone is not maximized. When faced with tasks such as real-time maritime surveillance, vessel capturing, and satellite-carried SAR cruises, our approach provides a lower risk of underreporting and the overall advantage of being deployable on edge hardware.
In order to better observe the superiority of DVDNet in small target remote sensing detection, two remote sensing small target datasets were further divided according to COCO criteria to verify the performance of the model in small, medium, and large sizes. From the overall results on the
Table 2, DVDNet achieves the highest
for small targets in both datasets, 80.1 for HRSID and 73.8 for SSDD, which is still a stable advantage over the three YOLO variants for remote sensing of small vessels. For medium and large targets, DVDNet is almost equal to the strongest light YOLO models, such as the
of HRSID is 92.1, which is equal to that of Light-YOLOv8 at 92.1, and the
is 94.1, which is slightly lower than that of Light-YOLOv8 at 94.6. The
and
of SSDD are 90.0 and 92.9, which are in the first echelon as well. Additionally, as the model size of the YOLO series increases, such as YOLOX-l and YOLOv5x, the detection performance of YOLO-based models improves. This is evident both in the overall performance shown in
Table 1 and in the
AP values for different sizes in
Table 2. In particular, the
AP values for medium- and large-sized data show a significant increase and surpass those of the DVDNet proposed in this paper. Combining
Table 1 and
Table 2, it can be seen that since both datasets belong to small object detection datasets, DVDNet performs best in terms of small object size and overall performance. Compared to YOLOv series methods, DVDNet achieves an
mAP nearly equal to v8x and YOLOv5x with only moderate parameter counts. At the same time, it’s saving approximately 40% in FLOPs compared to the v8x.
Traditional two-stage and early one-stage methods generally have low , such as the HRSID of Faster R-CNN, which is 65.5, and the SSDD, which is 54.2, which indicates that it is difficult to fully capture the details in nearshore clutter and small target scenes by only relying on regular sampling and standard regression. The global characterization of the Transformer family of methods is more friendly to medium and large targets, but it is still inferior to DVDNet in . For example, Deformable DETR has an HRSID of 77.1 and an SSDD of 67.0.
On the efficiency dimension, the YOLO family trades a very low number of parameters and FLOPs for high throughput, such as 3.2M vs. 13.6G for YOLOv8n. The overall computational power of the two-stage family is higher, with DVDNet’s 55.3M vs. 339.8G being in the same order of magnitude as the standard two-stage backbone. Compared with the 326.7G of Faster R-CNN, it only adds a small amount of overhead but gains significantly in and overall AP for both data sets. Combining the accuracy and overhead, DVDNet has the most prominent advantage in small target detection while maintaining a medium-to-large target performance comparable to its strongest rivals, demonstrating greater robustness under complex sea conditions and multi-scale conditions. This phenomenon can be explained by the combination of module design, C3 and C4’s GDConv providing adaptive sampling of elongated and arbitrarily oriented ships, P3’s LBP strengthening texture and edge details, and vectorized regression improving the stability of rotating frame fitting, and the synergy of the three is directly reflected in the sustained leadership of .
On the HRSC2016 dataset, which presents unique challenges in remote sensing due to high-resolution imagery and densely packed, arbitrarily oriented ship targets, our model achieves the highest of 80.7%, surpassing all other methods, including the strong two-stage baselines such as Cascade R-CNN, 80.6%, Sparse R-CNN, 79.7%, and YOLOX-s, 79.5%. Our model also achieves the best precision, 85.0%, and recall, 83.0%, indicating both accurate localization and strong object coverage. These results confirm the effectiveness of our architecture in capturing fine-grained textures, geometric variances, and rotation patterns, key traits of SAR target detection.
On the SAR small ship dataset, in the YOLO family, the
mAP of YOLOv6-n reaches 94.8, the highest in
Table 3, but the accuracy is only 91.8, indicating that the recall of more targets also brings relatively more false detections. The
mAP of Light-YOLOv8 and YOLOv7oSAR are 91.5 and 92.2, which are one echelon behind. Among the two-stage and Transformer system methods, Deformable DETR has the
mAP of 94.2, with a precision of 95.7, the highest in the whole table. However, with a recall of 93, the strategy is more conservative, with slightly more misses. The
mAP of DVDNet is 95.1, and the precision and recall are 93.8 and 94.1, respectively, which are the best performances of both, achieving a more balanced performance between false detection control and detection capability. From the perspective of task requirements, harbor and near-shore scenarios place more emphasis on low false alarms, and DVDNet’s high-precision advantage is more valuable in practice. On the whole, DVDNet has taken into account high precision and high recall while maintaining near-optimal
mAP, which demonstrates the stable gain of GDConv and LBP for small-sized, elongated vessels with variable directions.
Cross-domain Generalization. The consistent performance on HRSID, SSDD, HRSC2016, and self-built SAR small ship dataset demonstrates that our model not only excels in the remote sensing domain but also exhibits strong generalization ability across natural scenes and varied detection challenges. This shows that the model can accurately catch extremely small vessels (10–30 px) in near-shore, high-clutter scenes. It can also rapidly lock onto sparse targets across wide-area offshore imagery. While many detectors perform well in specific domains, our method achieves top performance universally, making it a robust and transferable detection solution. This cross-domain capability is critical for real-world deployment, where model reliability across diverse environments is essential.
Figure 7 and
Figure 8 show a visualization comparison of the HRSID and SSDD remote sensing ship detection datasets, respectively, for DVDNet and Faster R-CNN. The visualization results jointly indicate that in small, dense, and arbitrarily oriented ship scenes, DVDNet improves both recall and localization accuracy compared to Faster R-CNN. In HRSID with strong near-shore clutter, DVDNet can detect small boats of 10–30 px more completely, and the rotation box aligns better with the boat’s longitudinal axis. This significantly reduces false positives where coastlines and bright spots are misidentified as boats. In SSDD with weaker textures further offshore, DVDNet provides higher and more stable confidence scores. The angles and aspect ratios of slender vessels are better matched, and small bright spots in the open sea are effectively suppressed. These advantages correspond directly to the model design. In the C3 and C4 stages, GDConv learns deformation-adaptive and orientation-adaptive sampling to mitigate mismatches caused by flat and elongated objects. LBP texture enhancement amplifies the subtle contrasts of edges and corners in speckled backgrounds, making small objects more distinguishable. Vectorized regression of the ROI head avoids angular discontinuities, improving IoU stability for objects in any direction. Overall, DVDNet’s performance improvement is most significant at the small scale, highlighting its structural advantages for small object detection.
It can be seen that the visualization results in both high-density, multi-directional ships in ports and daily multi-category scenarios are obtained with tight contours and low misdiagnosis, which directly corroborates the complementary advantages of these three modules for different domain characteristics. Secondly, there are minimal differences in confidence and border quality in the visualization results on different datasets. This confirms the ability of the model to migrate between SAR, remote sensing, and natural images. The 20–30 px miniature boat in
Figure 7 and the distant basketball frame with bicycle in
Figure 8 are all detected in their entirety. This contrasts with conventional detectors that tend to miss or drift frames at similar sizes, highlighting the significant improvement of this paper’s test methods in the detection of small 10–30 px targets.
Figure 9 and
Figure 10 show the visualization comparison on the remote sensing optical dataset HRSC2016 and the self-built remote sensing small boat dataset, respectively. From the two sets of visualizations, it is evident that DVDNet significantly outperforms Faster R-CNN in locating vessels with elongated hulls in congested harbor areas. In the HRSC2016 dataset, DVDNet’s bounding boxes for elongated vessels align more closely with the vessel’s orientation and have tighter boundaries, effectively separating small vessels in adjacent berths from dock structures. The overall confidence level is predominantly above 0.9. However, Faster R-CNN suffers from inaccurate direction detection, short or wide bounding boxes, and missed detections and minor false positives in dense areas. In the SAR small ship dataset, DVDNet can still reliably detect extremely small targets with weak echoes in scenes with strong coastal scattering, speckle noise, and low contrast. Small ships near the coastline and dark backgrounds can also be correctly annotated with higher confidence. Faster R-CNN is more prone to missing weak echo targets in open waters and lacks precise angle and scale regression for elongated targets. These differences align with the deformation alignment and multi-scale receptive fields of GDConv, the weak texture enhancement of LBP, and the improved stability of vectorized regression for small targets in our method, demonstrating DVDNet’s superior recall and localization quality for small targets and complex backgrounds.
4.3. Ablation Experiment
In order to better understand the contribution of each component, we conducted ablation experiments on all baseline datasets in
Table 4 and
Table 5. Bold in the tables means the best results. Starting from a concise two-stage baseline, ResNet-50 backbone + RPN + standard box regression without FPN, LBP, and GDConv. FPN, LBP, and GDConv and their combinations are then turned on incrementally, and finally the full DVDNet is evaluated. where GDConv is placed on conv3 and conv4, LBP is used on P3, and the detection head is vectorized quantized regression. Each row in
Table 4 and
Table 5 corresponds to a configuration of an enabled module to separate the effects of each component under the unified training and evaluation protocol.
Starting from the baseline, the model without any enhancement achieves relatively low performance. The of 62.2% on HRSID and 59.6% on SSDD. When we introduce FPN alone, we observe clear improvements across all datasets. For example, on HRSC2016 the mAP rises from 55.2% to 66.6%, confirming that multi-scale feature fusion is essential for capturing object variability in size.
Incorporating the LBP module alone also brings measurable gains. For instance, the mAP on SAR small-target ship dataset increases from 60.5% to 68.7%, and precision rises to 69.2%. This supports our hypothesis that local binary patterns enhance low-level contour information, improving object boundary discrimination.
The use of GDConv in isolation produces the most significant gain among single modules. On SSDD, GDConv boosts to 77.3%, with precision and recall also improving substantially. This demonstrates the effectiveness of spatially adaptive sampling and group-wise specialization for handling geometric variations in SAR targets.
The combination of LBP and GDConv leads to even more remarkable improvements. For example, on HRSID, increases to 89.5%, which is a 26.9% absolute improvement over the baseline. Similarly, on SSDD the combined model achieves 85.9% mAP, outperforming all partial variants.
Finally, the full model (FPN + LBP + GDConv), denoted as Ours, achieves the best performance across all benchmarks. The 90.9% mAP on HRSID, 87.2% on SSDD, 80.7% on HRSC2016, and 94.1% on SAR small-target ship. Notably, the precision on HRSID reaches 90.9%, demonstrating that our method not only detects more targets but does so with fewer false positives.
The ablation study clearly demonstrates the incremental and complementary value of each proposed module. GDConv contributes most to spatial adaptability, LBP enhances texture sensitivity, and FPN provides strong multi-scale representations. Together, they form a synergistic and robust architecture that works well to achieve superior performance on all three types of datasets, SAR, remote sensing, and natural imagery.
In order to evaluate the impact of the position of inserting GDConv at different stages of ResNet-50 on the accuracy versus computational overhead, further ablation experiments were conducted on two major remote sensing open-source small target datasets, HRSID and SSDD. The results are shown in
Table 6, where the experimental setup is as explained in
Section 4.1. It can be seen that the variations on both the HRSID and SSDD datasets are basically the same. The greatest gains are seen when placing GDConv in the middle layers, C3 vs. C4, boosting both the overall
and also the most significant boost to
. While C5 has limited gain when placed only in the deepest layer, C2 has a slight improvement when sinking to the shallow layer, but the computational overhead increases steeply, and the cost performance is not as good as C3 + C4. This is because C5 is too low resolution and has limited contribution to the small targets, while C2 has the highest resolution but the heaviest speckle and background noise, and the computational cost is high. C2 has the highest resolution but the heaviest speckle and background noise and is computationally expensive. C3 and C4 combine sufficient semantics with usable resolution, and it is at this level that GDConv’s deformation adaptation is most effective, complementing P3′s LBP texture enhancement.
As can be seen in
Table 7, as the number of subgroups increases from 1 to 4, the model’s accuracy and performance of the mini-objective steadily improve on both datasets, while the number of parameters and the computational effort slowly decrease. As it continues to increase to 8, the accuracy begins to fall back slightly. In the HRSID dataset,
increases from 90.1 to 90.9, where
increases from 79.4 to 80.1, and
and
increase from 91.3 and 93.8 to 92.1 and 94.1, respectively.
decreases to 90.5 when the number of subgroups is equal to 8, which suggests that excessive subgrouping weakens the inter-channel coupling and destabilizes the estimation of the deformation offset. SSDD shows the same pattern,
increases from 86.7 to 87.2,
increases from 72.8 to 73.8, and
falls back to 86.9 when the number of groups is equal to 8. In terms of efficiency, the number of parameters decreases from 56.4 M to 45.3 M and the number of FLOPs decreases from 324 G to 249.8 G. A further increase to 8 brings only a small decrease of 1.2 G, accompanied by a loss of accuracy. Combining accuracy and overhead, the optimal compromise is achieved by taking 4 as the number of groups. It improves the overall
of the two datasets, especially strengthening the small target
, while keeping the parameter and computational cost manageable. Therefore, the number of groups, 4, is used as the default configuration of GDConv.