Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression

Du, Xinmiao; Wu, Xihong

doi:10.3390/rs17173094

Open AccessArticle

Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression

by

Xinmiao Du

and

Xihong Wu

^*

School of Intelligence Science and Technology, Peking University, Beijing 100080, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3094; https://doi.org/10.3390/rs17173094

Submission received: 12 July 2025 / Revised: 25 August 2025 / Accepted: 2 September 2025 / Published: 5 September 2025

(This article belongs to the Special Issue Deep Learning Techniques and Applications of MIMO Radar Theory)

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

The LBP enhancement module integrates manual texture features.
Vectorized bounding box avoids the problem of angular periodicity.

What is the implication of the main finding?

The LBP enhances the visibility of small targets in complex backgrounds.
Vectorized bounding box achieves precise framing of targets in any direction.

Abstract

Object detection in synthetic aperture radar (SAR) imagery poses significant challenges due to low resolution, small objects, arbitrary orientations, and complex backgrounds. Standard object detectors often fail to capture sufficient semantic and geometric cues for such tiny targets. To address this issue, a new Convolutional Neural Network (CNN) framework called Deformable Vectorized Detection Network (DVDNet) has been proposed, specifically designed for detecting small, oriented, and densely packed objects in SAR images. The DVDNet consists of Grouped-Deformable Convolution for adaptive receptive field adjustment to diverse object scales, a Local Binary Pattern (LBP) Enhancement Module that enriches texture representations and enhances the visibility of small or camouflaged objects, and a Vector Decomposition Module that enables accurate regression of oriented bounding boxes via learnable geometric vectors. The DVDNet is embedded in a two-stage detection architecture and is particularly effective in preserving fine-grained features critical for mall object localization. The performance of DVDNet is validated on two SAR small target detection datasets, HRSID and SSDD, and it is experimentally demonstrated that it achieves 90.9% mAP on HRSID and 87.2% mAP on SSDD. The generalizability of DVDNet was also verified on the self-built SAR ship dataset and the remote sensing optical dataset HRSC2016. All these experiments show that DVDNet outperforms the standard detector. Notably, our framework shows substantial gains in precision and recall for small object subsets, validating the importance of combining deformable sampling, texture enhancement, and vector-based box representation for high-fidelity small object detection in complex SAR scenes.

Keywords:

synthetic aperture radar; small object; arbitrary orientations; grouped-deformable convolution; local binary pattern; vector decomposition

1. Introduction

Object detection in SAR images is a crucial task for applications in surveillance, urban planning, traffic monitoring, and disaster management. In recent years, deep learning has led to remarkable advances in generic object detection on natural images [1,2,3,4,5]. Two-stage detectors like Faster R-CNN [2] and one-stage detectors like YOLO [3] achieve high accuracy and speed on benchmarks such as COCO. However, affected by speckle noise, complex imaging geometry and other factors, SAR images present very different characteristics from optical images, and it is often difficult to obtain ideal results by directly migrating the above methods.

There are many challenges in SAR target detection tasks. SAR images typically contain objects at vastly different scales, in arbitrary orientations, in very dense distributions, and against highly complex backgrounds. Standard detectors assume fixed-size receptive fields and mostly axis-aligned bounding boxes, which struggle with these conditions. Small objects may be missed by deep networks optimized for larger objects, rotated objects may not be well enclosed by axis-aligned boxes, densely packed objects can cause overlapping detections to be merged or missed, and background clutter can lead to many false positives. Notably, small object detection remains particularly challenging in SAR scenarios due to limited resolution and scale variance, making it a crucial focus of this work.

Various approaches have been explored to handle individual aspects of these issues. For instance, multi-scale feature pyramids [6] or multi-scale training strategies are used to detect objects of different sizes. For arbitrary orientations, some methods introduce rotated anchors or rotated bounding box regression [7,8,9,10,11] to better localize angled objects. Nevertheless, existing detectors typically address these challenges in isolation and still encounter limitations when faced with the full complexity of SAR imagery. There is a need for a unified approach that can concurrently tackle scale variation, orientation, density, and background clutter [12,13,14,15,16,17].

In this paper, we propose a comprehensive solution, leveraging three key innovations in a single framework. Grouped-Deformable Convolution (GDConv) layers that provide adaptive receptive fields for multi-scale and deformable object shapes while keeping the model efficient. An LBP Enhancement Module that injects rich local texture features (via the classic Local Binary Pattern descriptor) into the CNN feature maps to help distinguish objects from complex backgrounds. A Vector Decomposition Module that predicts each object’s bounding box using a pair of vectors, enabling an effective and continuous representation of oriented boxes without angular ambiguities. By integrating these components into a standard CNN-based detection pipeline, our method directly addresses the four challenges described above. GDConv handles scale and shape variation, LBP features improve detection of small or low-contrast objects in clutter, and the vector decomposition handles arbitrary object orientations and aspect ratios. This design is especially beneficial for detecting small targets, which are prevalent and critical in SAR imagery applications.

A novel SAR target detection framework, Deformable Vectorized Detection Network (DVDNet), was designed. It combines grouped deformable convolutions, an LBP-based texture enhancement, and a vector decomposition-based bounding box encoding. To our knowledge, this is the first work to unify these three components in a single detector. In summary, the contributions of this work are as follows.

(1) Grouped Deformable Convolution (GDConv) was designed, which synergistically merges deformable convolution [18] with grouped convolution [19]. This module allows the network to learn adaptive sampling of features while controlling the number of parameters and computational cost.

(2) We design an LBP Enhancement Module that seamlessly integrates Local Binary Pattern computation into the CNN. The module extracts local texture patterns from feature maps and fuses them with learned features, providing complementary cues that improve the detection of small, densely packed objects and reduce false detections in textured backgrounds.

(3) We contrive a Vector Decomposition Module for bounding box regression. Instead of predicting width, height, and orientation angles directly, our detector predicts two vectors from the object’s center to its bounding box sides. This yields a flexible representation for rotated boxes, avoiding issues of angle periodicity and enabling more precise localization of oriented objects.

Finally, we conducted ablation experiments and comparison experiments on two SAR small target detection datasets, HRSID and SSDD, to evaluate the proposed DVDNet. To validate the generalization performance of the model, extensive experiments are also conducted on the self-built SAR small ship dataset and the remote sensing optical dataset HRSC2016. DVDNet achieves much higher mAP than the baseline detector on all datasets, with a significant improvement in precision-recall performance. Also, detailed ablation studies and analyses are provided to demonstrate the effectiveness of each proposed component.

The remainder of this paper is organized as follows. Section 2 reviews related work in SAR object detection, deformable/grouped convolutions, and texture/orientation modeling in CNNs. Section 3 presents the proposed method, detailing the overall architecture and each module. Section 4 reports experimental results on four datasets, including quantitative comparisons and discussions. Finally, Section 5 concludes the paper with remarks and future directions.

2. Related Work

For the task of detecting small-target vessels using remote sensing, this paper will systematically review existing research along the main threads of task, paradigm, and supporting evidence. First, the data characteristics of small target vessel detection are introduced, followed by a detailed discussion of existing research on SAR-based small target vessel detection. After summarizing and comparing the limitations of existing research, the subsequent modules designed in this paper are elaborated upon based on four dimensions including scale, deformation, texture, and bounding box.

2.1. Small Object Detection

Small object detection is one of the most persistent challenges in computer vision, particularly in aerial and remote sensing imagery [20,21]. Compared to medium or large objects, small targets occupy fewer pixels and often lack sufficient contextual cues, making them easily overlooked by standard detectors [22]. Deep CNN architectures tend to lose spatial detail in higher layers due to successive downsampling, which further degrades the detectability of tiny objects [15,23].

Small-scale vessel detection is commonly employed in remote sensing, primarily through two imaging categories, including SAR and optical imaging. SAR imaging can operate under all-weather and all-time conditions; however, its characteristics, such as speckle noise, strong scattering, and coastal clutter, significantly increase detection difficulty [24]. Optical imagery offers richer texture and color information but is constrained by lighting conditions, weather, and cloud interference [25]. Different imaging modes have their own advantages and disadvantages in ship detection, and how to effectively combine and adapt them has become a key focus of research.

2.2. SAR Object Detection

In recent years, target detection in SAR images has garnered increasing attention. Early methods typically applied general-purpose detectors to SAR datasets. For example, Fast/Ferrer R-CNN [2,26] and SSD were evaluated on SAR images. However, these methods struggled to address the unique characteristics of SAR data, such as coherent spot noise and variations in target scattering properties, leading to the development of specialized methods and benchmarks. Datasets such as MSTAR [27] and SSDD [28] have been introduced as benchmarks for SAR target detection, where many targets are marked with rotating bounding boxes or horizontal boxes. This has led to the development of specialized detectors for arbitrary target orientations, small target sizes, and complex backgrounds in SAR images.

Focusing on the task of detecting small targets from remote sensing vessels, numerous studies have been conducted in recent years. Moser et al. [29] proposed a semi-automatically constructed classification dataset, Argo, targeting small vessels (<20 m), aimed at supporting the identification of refugee boats in humanitarian scenarios. Samples were generated by cross-referencing AIS data with high-resolution optical imagery (PlanetScope 3.5 m/px) and manually reviewed. The final recall rate for small vessels reached 91%. Most current studies have developed models for small vessel detection based on the YOLO series [30]. Among them, for satellite edge computing, a small model based on YOLOv8 [31] was implemented on an FPGA (AMD/Xilinx Kria KV260, <10 W) to achieve near-real-time ship detection for xView3-SAR, with accuracy only about 2–3% lower than SOTA GPU, but the model size was reduced by 10²–10³ times; Processing 40,000 km² of scenes in under 1 min. However, while this method is adapted for near-shore clutter or small targets, it still suffers from minor accuracy losses. Long-term robustness under extreme sea conditions or multi-source domains requires further validation. Shi et al. [32] proposed an anchor-free SAR ship detection model, incorporating brain-inspired attention mechanisms. A visual attention module is added at the shallow layer to suppress similar scattering backgrounds near the coast, while dense connections are added at the deep layer to enhance semantic information. A width-height prediction constraint is proposed to suppress edge artifacts caused by speckle noise. The AP on SSDD and HRSID reached 68.2% and 62.2%, respectively.

Another important research direction focuses on multi-scale and dense object detection in synthetic aperture radar (SAR) imagery, particularly for small ship targets. Feature Pyramid Network (FPN) architectures [6] have been widely adopted to fuse multi-scale feature maps, thereby enhancing the representation of fine-grained features and improving the detection of small ships. To further address the complexity of crowded maritime scenes, researchers have explored contextual information and attention mechanisms. For instance, some models integrate global context modules or introduce higher-resolution feature branches to ensure that tiny vessels are not overlooked. However, detecting very small ships in the presence of large vessels remains challenging, as the feature representation must remain sufficiently discriminative and robust across multiple scales.

Yu et al. [33] advocate a pyramid-based, single-level detection approach and use RADC residual asymmetric dilated convolutional blocks to expand the receptive field and enhance semantic information. They propose center-based uniform matching for label assignment to balance the detection of large and small objects. They achieve performance comparable to that of general detectors on SSDD and HRSID with lower computational complexity.

The aforementioned research expanding on the YOLO series models as baseline models generally focuses on reducing information loss when dealing with small target vessels. For example, for complex sea-land backgrounds and multi-scale vessels, E2YOLOX-VFL [34] introduces ECA channel attention into the backbone, modifies the localization loss to EFIoU, uses Varifocal Loss for confidence, and replaces NMS with BG-NMS to mitigate false detections caused by dense occlusions. This resulted in a 9.28% improvement in mAP on the HRSC2016 dataset. Additionally, models such as the Oriented Bounding Box (OBB) detector [35], snake-shaped convolutions, and multi-scale feature enhancement modules [36] have been introduced to enhance the accuracy of small-target vessel detection.

In summary, if the focus is on location accuracy and complex background suppression, rotating boxes and point sets can better enclose long, narrow, or directionally inconsistent small boats. At the same time, the detection network needs to incorporate scale adaptation to prevent subsampling from causing the loss of small target information. In addition, deformation and directional changes should be considered to handle the challenges posed by ships oriented in arbitrary directions. Texture and background suppression can also reduce interference from sea waves and coastal structures. By jointly considering scale, deformation, texture, and bounding box design, the accuracy and recall of small target detection can be effectively ensured.

2.3. Multi-Scale and Deformable Convolutions

To address these challenges, various techniques have been proposed. Multi-scale feature fusion frameworks [37,38] enhance the representation of objects at different sizes, while context modeling [39] and super-resolution modules [40] help preserve or recover fine-grained details that are critical for identifying small ships. Nevertheless, detecting small targets in complex SAR maritime scenes remains highly challenging because of extreme scale variation, dense ship distribution, and severe background clutter from sea waves and coastal environments.

Modern CNN-based detectors rely on learned convolutional features, and numerous modifications to the convolution operation have been proposed to enhance feature representation. Deformable Convolutional Networks (DCN) [18] introduced a learnable offset for each convolution kernel position, enabling the sampling grid to shift dynamically according to the input content. This mechanism allows the network to adaptively capture variations in object shape and orientation. DCN has shown significant improvements in detection accuracy, particularly for non-rigid or rotated targets, and has been incorporated into various detection frameworks. Subsequent extensions such as DCNv2 [41] further refine the offset learning process through enhanced modulation and offset adjustment. In the context of SAR ship detection, these properties are especially valuable for handling vessels with arbitrary orientations and deformations caused by imaging conditions.

While deformable convolutions increase modeling flexibility, they also introduce additional parameters and computational cost due to the extra offset fields. In parallel, grouped convolutions have been explored as an efficient alternative. Originally introduced in AlexNet to facilitate GPUs parallelization and later popularized in ResNet [19], grouped convolution divides the input channels into groups and performs convolution independently within each group, thereby reducing the total number of connections. An extreme case is the depthwise separable convolution, as used in MobileNets, which drastically reduces computation while maintaining competitive accuracy. Beyond efficiency, grouped convolutions also enable sets of filters to specialize in different subsets of features, providing a balance between performance and computational cost.

Combining the benefits of deformable and grouped convolutions is a promising idea that has begun to gain traction. By applying deformable convolution in a grouped manner, one can retain the adaptive sampling capability while limiting the parameter growth. For example, if the feature maps are split into G groups, each group can learn its own offset fields and filters focusing on a portion of the channels. This approach, referred to as Grouped Deformable Convolution (GDConv), yields an efficient yet flexible layer [42]. Recent works on efficient detection in remote sensing suggest that grouped deformable convolution can improve the accuracy–cost trade-off. However, a systematic integration of GDConv in a general object detection architecture has been little explored [43,44]. In our work, we incorporate GDConv into the backbone of the detector to handle multi-scale object features. By doing so, we allow different groups of feature channels to adjust to different object scales or orientations through learned offsets, while keeping the overall model complexity manageable. This is particularly beneficial for SAR images, where some feature channels or groups may specialize in large structures such as buildings and others in small objects like vehicles [4,8,45].

In summary, significant improvements such as deformable convolution and multi-scale feature fusion have been introduced in the related work of SAR object detection. However, most existing approaches tend to address these challenges in isolation. Building upon these ideas, our work proposes a unified framework that simultaneously tackles multiple difficulties. Specifically, our network addresses rotation through a novel bounding box parameterization, enhances multi-scale detection via deformable convolutions and feature augmentation, and leverages texture cues to better separate objects from complex backgrounds. Small target detection, long recognized as a challenge in SAR imagery due to resolution limitations and background clutter, is explicitly considered in our design. By integrating grouped deformable convolution with LBP-enhanced features, our model substantially improves the detectability of tiny and densely distributed objects, as further demonstrated by our results on the HRSID and SSDD datasets.

2.4. Texture Descriptors and Oriented Bounding Boxes

Before the deep learning era, hand-crafted texture descriptors such as SIFT, HOG and Local Binary Patterns (LBP) [46] played a key role in object detection and recognition tasks [47,48]. LBP, in particular, is a simple yet powerful operator that thresholds the neighborhood of each pixel to generate a binary pattern encoding local texture [49,50]. It has the advantages of being illumination-invariant and computationally efficient, which made it widely used in tasks such as face detection and texture classification. With the rise of CNNs, these hand-crafted features were largely replaced by learned features. Nevertheless, there is evidence that they can still provide complementary information to CNNs. For instance, some studies have combined LBP with CNN features for applications such as face mask detection, yielding improved precision [51]. In aerial images, where objects like roofs, roads, or vehicles exhibit distinctive texture patterns, incorporating LBP descriptors can enhance the feature representation. In our work, we design an LBP enhancement module that explicitly computes local binary patterns on feature maps and merges them with the network’s learned features, thereby leveraging classical domain knowledge in a deep model [52,53].

This motivates our proposed method, which explicitly enhances small object detectability by employing grouped deformable convolution for adaptive feature extraction and an LBP-based module for capturing local textures. These components are particularly valuable in remote sensing application, where small vehicles, ships, and infrastructure need to be reliably localized within large-scale images.

One line of research addresses the detection of rotated objects. For example, Ding et al. propose the ROI Transformer [7], which refines horizontal region proposals into rotated regions of interest, thereby enhancing orientation sensitivity in two-stage detectors. Many approaches further extend anchor-based detectors by incorporating rotated anchor boxes or angle prediction branches to model oriented bounding boxes. Although these methods improve accuracy on rotated objects, they often increase complexity through multi-branch outputs or additional angle parameters, which require careful handling to avoid periodicity issues.

Another relevant line of work concerns the representation of oriented bounding boxes. Most object detectors represent boxes by their center coordinates and side lengths. However, directly regressing an angle can be tricky due to the discontinuity at 360° = 0°. Some approaches like SCRDet [9] mitigate this by using alternative parameterizations or loss functions that are smooth over angle changes. A recent trend is to represent oriented boxes with vectors or key points rather than an angle. For example, Yang et al. [9] predict boundary vectors from the center to each of the four sides to avoid direct angle prediction. Similarly, Zhou et al. [54] propose a vector decomposition-based method, which achieves high accuracy without angle regression by predicting a set of vectors for each box. These studies demonstrate that vector-based representations are effective for modeling orientation [55,56].

Inspired by these ideas, we implement a vector decomposition module in our detector. Unlike BBAVectors, which predicts four vectors, or other complex schemes, we opt for a minimal representation including two vectors originating from the box center to define the box. This is equivalent to specifying the oriented box by its center, one vector that describes the half-width along one side, and another vector that describes the half-height in the perpendicular direction. Such a representation is compact and avoids redundancy. It inherently encodes the orientation and size of the object and aligns well with regression objectives in detection heads. Our approach integrates this representation into the head of a standard two-stage detector, differing from prior anchor-free implementations [54], and couples it with the aforementioned modules for a more holistic solution [57,58].

3. Methods

In this section, we describe the architecture of the proposed detection network and detail each of its main components. These are the overall framework, the grouped deformable convolutional layers, the LBP enhancement module, the vector decomposition module for bounding box regression, and the loss function for training.

3.1. Overall Framework

The overall network architecture is illustrated in Figure 1. It follows a two-stage detection paradigm built on a CNN backbone with feature pyramid networking, augmented by our proposed modules. The framework consists of the following parts.

First, we use a deep CNN backbone, ResNet-50, to extract the feature map from the input image. We merge GDConv layers into the backbone. Specifically, we replace standard convolutions in certain intermediate layers with grouped deformable convolutions. Based on experimental validation, as shown in the ablation study in a later section, we insert GDConv layers into the last layer of the conv3 and conv4 blocks of ResNet-50. This modification allows those layers to adapt their receptive field according to object geometry. The output of the backbone is a set of feature maps at different resolutions. We employ a Feature Pyramid Network (FPN) structure [6] to combine high-level and low-level features, producing a pyramid of feature maps (

P_{3}

through

P_{5}

) that represent the image at multiple scales.

To enhance the low-level features with texture information, we then design an LBP module (Section 3.3). We apply this module to one of the higher-resolution feature maps in the pyramid such as

P_{3}

, which has relatively fine spatial granularity. The LBP module computes local binary pattern codes for each spatial location on that feature map, yielding an additional set of channels that capture local texture patterns, such as edges, corners, etc. These LBP feature channels are then concatenated with the original feature map channels. A

1 \times 1

convolution is used to fuse and reduce the dimensions, producing an enhanced feature map that now contains both learned features and LBP features. This enhanced feature map is fed downstream for proposal generation and detection.

As shown in Figure 2, We utilize a Region Proposal Network [2] operating on the multi-scale feature maps to generate candidate object proposals. RPN has been enhanced to handle rotated proposals. It uses a small set of anchor boxes at each position. By default, the axes are aligned with various scales and aspect ratios. Given that our backbone features are already rotation-sensitive thanks to GDConv and LBP, we found it sufficient to use standard horizontal anchors. The RPN produces a set of proposal boxes (axis-aligned) with objectness scores. We apply non-maximum suppression (NMS) to filter overlapping proposals.

For each proposal, we extract a fixed-size ROI feature using ROI Align on the appropriate FPN level. These ROI features are then processed by the detection head, which has two branches, classification and regression. The classification branch, which typically consists of two fully connected layers followed by a softmax or sigmoid output, predicts the object class or background. The regression branch is modified in our approach to include the Vector Decomposition Module described in Section 3.4. Instead of predicting the standard 4 regression offsets for bounding box

(d_{x}, d_{y}, d_{w}, d_{h})

, our regression head predicts a set of vector parameters that describe the bounding box. Specifically, it predicts (

Δ x

,

Δ y

) for the center offset relative to the proposal center, and (

v_{1 x}

,

v_{1 y}

,

v_{2 x}

,

v_{2 y}

), which are components of two vectors

\vec{v_{1}}

and

\vec{v_{2}}

emanating from the box center. These vectors, when combined with the center, define an oriented bounding box. The vector decomposition module takes these predictions and constructs the final detected bounding box as an oriented box, which can be converted to a standard axis-aligned box for evaluation if needed. During training, this head is supervised to match the ground-truth boxes’ vector representations.

In summary, our framework enhances a standard two-stage detector by incorporating a GDConv layer for embedding the backbone to improve feature extraction, an LBP module for texture enhancement, and a vector-based bounding box regression header for orientation. The design is modular, and each component can be ablated or modified independently, as we will analyze in experiments.

3.2. Grouped Deformable Convolution (GDConv)

The grouped deformable convolution is a central component of our model, aiming to improve the backbone’s ability to handle objects of varying shape and scale without incurring excessive computation. We build on the original deformable convolution operation [18]. A standard deformable convolution adds learnable offsets to the sampling locations of a convolution kernel. Formally, for a given convolution layer with kernel

K

of spatial size

k \times k

applied to an input feature map

x

, a normal convolution at location

p_{0}

computes as shown in Equation (1).

y (p_{0}) = \sum_{p_{i} \in K} w (p_{i}) \cdot x (p_{0} + p_{i}),

(1)

where

w (p_{i})

are the weights and

p_{i}

ranges over the

k^{2}

relative positions in the kernel. In a deformable convolution, each sampling location is shifted by an offset

Δ p_{i}

that is learned via an offset prediction convolution. The operation becomes Equation (2).

y (p_{0}) = \sum_{p_{i} \in K} w (p_{i}) \cdot x (p_{0} + p_{i} + Δ p_{i}),

(2)

where

Δ p_{i}

is a fractional offset vector specific to position

p_{i}

. These offsets are output by an auxiliary convolutional layer that takes the same input

x

and produces

2 k^{2}

channels for the

x

and

y

offsets of each kernel point. This mechanism allows the convolution to sample features from a flexible receptive field shape, such as aligning along an object’s contour or focusing on a small region if the object is small.

As shown in Figure 3, our proposed GDConv modifies this by introducing the concept of groups

G

in the deformable convolution filter. Instead of one monolithic deformable convolution over all input channels, we partition the input channels into

G

groups, and similarly the output channels into

G

groups for simplicity, such that each group of input channels maps to a group of output channels. For each group

g

, where

g = 1, \dots, G

, a separate deformable convolution is applied, as follow Equation (3).

y_{g} (p_{0}) = \sum_{p_{i} \in K} w_{g} (p_{i}) \cdot x_{g} (p_{0} + p_{i} + Δ p_{i, g})

(3)

where

x_{g}

and

y_{g}

denote the input and output feature maps restricted to group

g

channels, and

w_{g}

are the weights for group

g

. Importantly, the offset

p_{i, j}

can also be group-specific. In implementation, this means we have

G

separate offset fields, each output by its own small convolution, or a unified convolution that outputs

2 k^{2} \times G

values and we split them per group.

The benefit of grouping is two folds. On the one hand, for the parameter reduction. In a standard conv layer with

C_{i n}

input and

C_{o u t}

output channels, the number of weights is

C_{o u t} \times C_{i n} \times k^{2}

. With G groups, each output channel only connects to

\frac{C_{i n}}{G}

inputs. Assuming

C_{o u t}

is divisible by

G

, the weight count becomes

C_{o u t} \times \frac{C_{i n}}{G} \times k^{2}

, which is

\frac{1}{G}

of the original. Thus, grouping can dramatically reduce the parameters and computation if

G

is large. We typically choose a moderate

G

to balance efficiency and representation power. We usually choose a moderate

G

to balance efficiency and expressiveness. In this paper, 4 was selected after comparison of ablation experiments.

On the other hand, different groups can learn to handle different types of features. In the context of SAR images, one group of filters and offsets might specialize in capturing large-scale structures such as buildings and roads by learning appropriate offsets with wider spread, while another group might focus on small objects such as cars and signals by learning offsets that hone in on fine details. Group-specific deformable offsets mean each group can deform its sampling grid in a different manner tailored to the subset of features it processes.

We apply GDConv in intermediate layers of the backbone where feature resolution and channel depth make it most beneficial. The conv3 and conv4 stage, which produces medium-sized feature maps, is a good candidate in the ResNet backbone. These layers can see objects of different sizes and shapes, and increased deformability would help, while grouping reduces the added cost. By the conv5 stage, the feature map is very coarse in spatial size and channels are high, and one might still use GDConv there, though we found most benefits come from earlier stages.

GDConv is implemented by modifying the convolution layers in the network definition. During training, the offset convolution for each group learns to produce offsets that minimize the detection loss indirectly through backpropagation from the detection objective. We initialize the offsets to zero, so initially it behaves like a normal conv, and let the network gradually learn deformations. In our experiments, GDConv demonstrated the ability to deform to aligned with object orientations and scale regions. We observed offset vectors that elongate along vehicles and adjust around building boundaries, which validates the expected behavior.

3.3. LBP Enhancement Module

To better detect small objects and distinguish objects from background clutter, we incorporate Local Binary Pattern (LBP) features into the CNN. LBP is a classical operator that captures local texture by comparing each pixel with its neighbors [38]. In its basic form with an 8-connected neighborhood, the LBP value at pixel

(i, j)

is computed by comparing each neighbor with the center pixel and thresholding

3 \times 3

the neighborhood centered at it. As shown in Equation (4).

L B P (i, j) = \sum_{n = 0}^{7} 1 {I_{n} \geq I_{c}} \cdot 2^{n},

(4)

where

I_{c}

is the intensity of the center pixel,

I_{n}

are the intensities of the 8 neighboring pixels, and

1 \{\cdot\}

is an indicator function that outputs 1 if the condition is true (neighbor pixel brighter or equal to center) or 0 otherwise. The result is an 8-bit code (0–255) describing the local pattern. This code can also be interpreted as a set of 8 binary features.

As shown in Figure 4, our LBP module operates on a feature map to extract texture information that the CNN’s learned filters might not explicitly capture. We choose a relatively high-resolution feature map from the backbone or FPN for this purpose. This map has a rich amount of detail about edges and small structures. We first transform this feature map to a single-channel representation suitable for LBP. One simple way is to take the intensity or the average across channels. In practice, we found that applying LBP on each channel separately and then aggregating was unnecessary; instead, we apply a

1 \times 1

convolution to collapse the feature map into one channel and then compute LBP on that.

The LBP computation is implemented as a fixed, non-learnable operation within the network. For each spatial location of the chosen feature map, we evaluate the 8 comparisons with its immediate neighbors. This yields 8 binary maps, one for each bit of the LBP code. We treat these 8 binary maps as 8 additional feature channels. These binary feature maps highlight local texture transitions. For example, one channel might indicate where a pixel is darker than its right neighbor, another where it is darker than its top neighbor. Patterns like edges will produce distinctive streaks in these channels.

The principle behind converting binary numbers to decimal numbers lies in the expansion of bit weights. In the binary system, each bit can only take the values 0 or 1, with corresponding weights that are powers of 2, starting from the rightmost bit as

2^{0}, 2^{1}, 2^{2}

, and so on until the leftmost bit. For example, an eight-bit binary number

b_{7} b_{6} b_{5} b_{4} b_{3} b_{2} b_{1} b_{0}

can be converted to its decimal equivalent as shown in Equation (5).

D e c i m a l = b_{7} \times 2^{7} + b_{6} \times 2^{6} + b_{5} \times 2^{5} + b_{4} \times 2^{4} + b_{3} \times 2^{3} + b_{2} \times 2^{2} + b_{1} \times 2^{1} + b_{0} \times 2^{0},

(5)

where

b_{i}

represents the binary value of the i-th digit, which can be either 0 or 1. For example, 00111010 expands to

0 \times 2^{7} + 0 \times 2^{6} + 1 \times 2^{5} + 1 \times 2^{4} + 1 \times 2^{3} + 0 \times 2^{2} + 1 \times 2^{1} + 0 \times 2^{0}

, 58.

Next, we fuse the LBP features with the original CNN features. We concatenate the 8 LBP channels with the original feature map’s channels. This creates an augmented feature map with 256 + 8 channels. To avoid a disproportionate increase in dimensionality and to allow the network to learn how to best use these new features, we apply a

1 \times 1

convolution after concatenation. This

1 \times 1

convolution compresses the feature back to the original number of channels (256) or another desired dimension, and it learns the optimal weighting between original features and LBP features. In essence, this provides the network a chance to attend to LBP cues as needed. During training, the gradients flow through this fusion layer into both the previous CNN layers and the fixed LBP computation.

The effect of the LBP enhancement is that certain fine details become more salient. For example, in a SAR image, a small vehicle might produce a particular LBP pattern that helps it stand out from a similarly colored road. The CNN alone might not pick this up if the vehicle is only a few pixels, but the LBP will generate a strong binary pattern indicating an object boundary. By integrating this, our detector sees an improved feature representation. This typically leads to higher precision and higher recall for tiny objects. We will show in experiments that adding the LBP module yields a noticeable improvement especially on datasets with small objects.

It is worth noting that the LBP module adds negligible computational load. It involves simple comparisons and bitwise operations. There are no learned parameters in generating the LBP maps, and only a small overhead in memory for the additional channels. Therefore, it is an attractive plug-in module for improving feature richness without the risk of overfitting.

3.4. Vector Decomposition Module for Bounding Boxes

The vector decomposition module is our approach to predicting object bounding boxes, particularly to handle oriented objects. Traditional detectors parameterize a bounding box by

(t_{x}, t_{y}, t_{w}, t_{h})

, which are the relative offsets of the box center and the log-scale offsets of width and height with respect to some reference [2]. Some oriented detectors add a fifth parameter for the angle

θ

. However, predicting

θ

directly can be problematic due to the discontinuity at

2 π

and because a small change in angle can cause a large change in box coordinates when the box is almost square. Instead, we opt for a vector-based representation that is smooth and continuous.

In our module, an object’s bounding box is represented by two vectors

\vec{v_{1}}

and

\vec{v_{2}}

originating from the box center as shown in Figure 1. These can be thought of as the directions towards two adjacent sides of the rotated rectangle. More concretely, let

(x_{c}, y_{c})

be the coordinates of the box center. We define

\vec{v_{1}} = (v_{1 x}, v_{1 y})

as the vector from the center to the midpoint of the box’s first side, and

\vec{v_{2}} = (v_{2 x}, v_{2 y})

as the vector from the center to the midpoint of the second side (which is the adjacent side, 90 degrees rotated from the first). In an oriented rectangle, these two vectors should be perpendicular and their lengths correspond to half of the box’s width and height, respectively. The four corners of the bounding box can be recovered as Equation (6).

c o r n e r_{1,2} = (x_{c}, y_{c}) \pm \vec{v_{1}} and c o r n e r_{3,4} = (x_{c}, y_{c}) \pm \vec{v_{2}},

(6)

where

\vec{v_{1}}

corresponds to width and

\vec{v_{2}}

to height. However, we do not enforce orthogonality or any explicit constraints in the prediction, this is learned implicitly. If the object is axis-aligned,

\vec{v_{1}}

might align with the horizontal such as

(\frac{w}{2}, 0)

and

\vec{v_{2}}

with the vertical

(0, \frac{h}{2})

. If the object is rotated by some angle,

\vec{v_{1}}

will rotate accordingly and

\vec{v_{2}}

should rotate by the same angle. In cases where the object does not have a clear orientation (or orientation is not annotated), the network is free to output any consistent pair

\vec{v_{1}}, \vec{v_{2}}

that covers the object.

During training, we generate target vectors for each ground-truth box. For a horizontal ground-truth box, we set

{\vec{v}}_{1}^{*} = (\frac{w^{*}}{2}, 0)

and

{\vec{v}}_{2}^{*} = (0, \frac{h^{*}}{2})

in the coordinate frame of the proposal, relative to the proposal center and axes. Here

w^{*}

,

h^{*}

are the ground truth box width and height. For an oriented ground truth, available in a SAR dataset with oriented annotation, we compute

{\vec{v}}_{1}^{*}

and

{\vec{v}}_{2}^{*}

as half-length vectors along the ground truth box’s orientation. These serve as regression targets for the network’s predictions.

Our detection head’s regression branch is thus modified to output 6 values per object

\{Δ x, Δ y, v_{1 x}, v_{1 y}, v_{2 x}, v_{2 y}\}

. For class-specific regression, or 6 per class for class-specific, but we use class-agnostic regression for simplicity.

Δ x, Δ y

are the center offset of the predicted box relative to the proposal’s center, encoded as

Δ x = x_{c}^{*} - \frac{x_{p}}{w_{p}}

,

Δ y = y_{c}^{*} - \frac{y_{p}}{h_{p}}

for ground truth (with

p

denoting proposal,

*

ground truth). For the vectors, we similarly normalize by the proposal size, as follow Equation (7).

t_{v 1 x} = \frac{v_{1 x}^{*}}{w_{p}}, t_{v 1 y} = \frac{v_{1 y}^{*}}{h_{p}}, t_{v 2 x} = \frac{v_{2 x}^{*}}{w_{p}}, t_{v 2 y} = \frac{v_{2 y}^{*}}{h_{p}}

(7)

as target values. The network outputs corresponding predictions

({\hat{t}}_{v 1 x}, {\hat{t}}_{v 1 y}, {\hat{t}}_{v 2 x}, {\hat{t}}_{v 2 y})

which are then scaled back by

w_{p}, h_{p}

to yield the predicted

\vec{v_{1}} a n d \vec{v_{2}}

in absolute terms. We treat each component as an independent regression task.

One advantage of this approach is that it sidesteps the difficulty of angle prediction. If an object rotates gradually, the predicted vectors can rotate gradually as well, without any abrupt changes or wrapping around. Another advantage is that by predicting the full vectors, the model can, in principle, handle non-rectangular shapes or slight deviations if needed by adjusting

\vec{v_{1}} a n d \vec{v_{2}}

.

At inference time, once we have

\hat{Δ x}, \hat{Δ y}, \hat{{\vec{v}}_{1}}, \hat{{\vec{v}}_{2}}

for a proposal, we reconstruct the detection box. The predicted center is

(\hat{x_{c}}, \hat{y_{c}}) = (x_{p} + \hat{Δ x} \cdot w_{p}, y_{p} + \hat{Δ y} \cdot h_{p})

. The predicted vectors in absolute coordinates are

\hat{{\vec{v}}_{1}} = (\hat{v_{1 x}}, \hat{v_{1 y}}) = ({\hat{t}}_{v 1 x} \cdot w_{p}, {\hat{t}}_{v 1 y} \cdot h_{p})

, and similarly

\hat{{\vec{v}}_{2}}

. We can directly output the oriented bounding box defined by

(\hat{x_{c}}, \hat{y_{c}})

and

\hat{{\vec{v}}_{1}}, \hat{{\vec{v}}_{2}}

. For evaluation on datasets that expect axis-aligned boxes, we compute the axis-aligned bounding box that minimally encloses the oriented box.

The vector decomposition module adds a bit more output to the network, but this overhead is minor. We found that the network learns to predict perpendicular vectors for most objects, effectively learning the concept of oriented boxes. In cases where orientation is ambiguous, the network may output arbitrary perpendicular vectors, which still produce a valid bounding box. This representation proved robust in our experiments, especially enhancing the localization of rotated objects. Additionally, even on datasets with only horizontal annotations, the vector approach did not harm performance. In fact, it slightly improved localization accuracy, likely because the model was able to adjust corners in a more flexible way than the standard width-height parameterization.

3.5. Loss Function

We train the entire network end-to-end using a multi-task loss that combines classification and localization terms, following the standard practice in two-stage detectors [2]. The overall loss

L

is a sum of four components, as shown in Equation (8).

L = L_{R P N - c l s} + L_{R P N - r e g} + L_{R O I - c l s} + L_{R O I - r e g},

(8)

where RPN and ROI, region-of-interest, denote the proposal network and the second-stage detection head, respectively. We describe the latter two (ROI) in detail as they pertain to our contributions, the RPN uses standard losses as in Faster R-CNN.

For the classification branch of the ROI head, we use a cross-entropy loss over

N_{c}

classes plus background. For each predicted region

i, L_{R O I - c l s}^{i} = - \log p_{i, c^{*}}

, where

p_{i, c^{*}}

is the softmax probability for the ground truth class

c^{*}

, or the background class if the proposal does not match any object with

I o U > 0.5

. We sum this over all ROI examples.

For the regression branch, we define a smooth L1 loss on the vector parameters. Let

\hat{t} = (\hat{Δ x}, \hat{Δ y}, {\hat{t}}_{v 1 x}, {\hat{t}}_{v 1 y}, {\hat{t}}_{v 2 x}, {\hat{t}}_{v 2 y})

be the predicted transformations for a given ROI, and

t^{*}

the ground-truth transformation as defined in Section 3.4. We use Equation (9).

L_{R O I - r e g}^{i} = SmoothL 1 (\hat{Δ x} - Δ x^{*}) + SmoothL 1 (\hat{Δ y} - Δ y^{*}) + SmoothL 1 ({\hat{t}}_{v 1 x} - t_{v 1 x}^{*}) + SmoothL 1 ({\hat{t}}_{v 1 y} - t_{v 1 y}^{*}) + SmoothL 1 ({\hat{t}}_{v 2 x} - t_{v 2 x}^{*}) + SmoothL 1 ({\hat{t}}_{v 2 y} - t_{v 2 y}^{*})

(9)

We apply this regression loss only to positive ROIs, those matched to a ground-truth object. The SmoothL1 Huber loss is defined as Equation (10).

SmoothL 1 (d) = \{\begin{matrix} 0.5 d^{2}, i f |d| < 1, \\ |d| - 0.5, o t h e r w i s e, \end{matrix}

(10)

which is less sensitive to outliers than L2 and is standard for box regression tasks.

The RPN is trained with a similar loss including Binary cross-entropy for object classification and SmoothL1 for regressing proposal coordinates. We weight the losses such that classification and regression contributions are roughly balanced. In practice, we set

λ = 1

as the weight for the ROI regression loss relative to ROI classification, after normalizing by number of examples, which worked well (this follows the practice from Fast R-CNN [59]. The RPN loss is also weighted with

λ_{R P N} = 1

.

No additional loss terms are needed for the LBP module or GDConv, as they are implicitly learned through the effect they have on the detection accuracy. One could consider an auxiliary loss to encourage

\vec{v_{1}} ⊥ \vec{v_{2}}

or a similar constraint, but we found it unnecessary as the network naturally learns near-perpendicular vectors for rectangular objects to minimize regression error.

4. Experiments

As shown in Figure 5, The HRSID and SSDD used in this paper cover multiple radar platforms and sea states, providing labels such as nearshore and offshore labels, instance masks, rotating frames, and pure background samples, which are capable of systematically assessing the robustness of the model in small targets and complex backgrounds. After validating the performance of the model on two publicly available master datasets, the cross-domain and cross-label generalization capabilities of the model are further validated on the optical remote sensing dataset HRSC2016 and the self-built SAR small ship dataset, respectively. Since the self-built SAR ship dataset are mostly subject to copyright and confidentiality constraints, which make it difficult to reproduce publicly, this paper mainly chooses to report the results on widely used public benchmarks.

4.1. Datasets and Implementation

We compare our model with representative detectors and conduct ablation studies to highlight the contribution of each component. We use standard evaluation metrics including mean Average Precision, Precision, and Recall at a specified IoU threshold. Unless otherwise stated, the backbone is an ImageNet-pretrained ResNet-50. In GDConv layers we set the number of groups to G = 4. Our full configuration, termed DVDNet, consists of the backbone together with GDConv inserted at conv3 and conv4, the LBP enhancement on P3, and the vectorized regression head. In addition, for the two main datasets, HRSID and SSDD, we also tested their performance under different target sizes. According to the COCO scale, targets are divided into three categories based on pixel area, and

A P_{S}

,

A P_{M}

, and

A P_{L}

are calculated accordingly. At IoU = 0.5:0.95, with a step size of 0.05, AP is calculated only for targets at the corresponding scale.

Our model is implemented in PyTorch 2.3.1. GDConv layers are inserted in the ResNet-50 at layers

r e s 3_3, r e s 4_6

, replacing the

3 \times 3

conv in those blocks with a deformable conv of the same size and 4 groups. The LBP module is applied on the P3 feature; we reduce it to 1 channel and compute LBP using 8-neighborhood. The 8 binary maps are concatenated and passed through a

1 \times 1

conv to get back 256 channels. For vector decomposition, our ROI head outputs 6 regression values per ROI instead of 4. We still use class-agnostic regression for simplicity.

All methods in the comparison experiments in this paper are initiated from publicly available pre-training weights and are trained and evaluated independently on each dataset without hybrid training. The evaluation is uniformly performed on single-scale inputs, with the non-maximal suppression threshold taken as 0.5, and the reports mAP, Precision, and Recall. The YOLO family consists of YOLOv5, YOLOv6, YOLOv7, YOLOv8, and YOLOX-s, as well as E2YOLOX-VFL, YOLOv7oSAR, and Light-YOLO. strictly. Follow the default configuration of their official implementations for training and inference. Use the repository’s default optimizer and learning rate schedules, employ default strong data augmentation schemes such as multi-scale training, stochastic affine, and mosaic with mixup or cutmix, and leave anchors and EMAs unaltered in their official near and off state. Input resolution and batch size are taken by default, and the best weights are selected by validation set metrics without using test-time enhancement. Other families of methods include RetinaNet, FCOS, CenterNet, Cascade R-CNN, Libra R-CNN, Sparse R-CNN, Dynamic R-CNN, GCNet, SSD300, DETR, Deformable DETR, and so on. Then, the training is performed according to the recommended recipe of each paper or official code. Optimizers with learning rate schedules from the corresponding papers are used, and the regular data is augmented with random scale jitter and level flipping. AdamW is used for the DETR series, and SGD is used for the rest of the methods, while ResNet-50 is used for the backbone network pre-trained by ImageNet, and the rest of the hyperparameters are kept as the official defaults. The above settings are designed to fully utilize the native performance of each method and, at the same time, ensure the consistency of the evaluation process and the reproducibility of the results. For fair comparison with FLOPs reproducible, the same input resolution of 800 × 800 was fixed on all four datasets.

The HRSID dataset contains 5604 800 × 800 high-resolution images of 16,951 vessels. Its sources are a combination of Sentinel-1B (C-band) and TerraSAR-X & TanDEM-X (X-band) satellites, with pixel resolutions of 0.5 m, 1 m, and 3 m, covering a wide range of imaging angles and sea states. Horizontal frames, instance segmentation masks, and “in-shore/off-shore” scene labels are also provided, and 400 pure background images are attached for robustness testing. In order to ensure the reproducibility, this paper uses fixed random seeds of 42 divided into 60%, 20%, and 20% as the training set, validation set, and test set. The input size is kept at 800 × 800. If the original provided scene graph is not square, it will be sliced into 800 × 800 according to the official slicing method before participating in the training and evaluation.

SSDD [30] is a SAR small target detection dataset. A total of 1160 images are included, containing 2456 vessels, with an average of about 2.1 vessels per image. The labeling form is based on a horizontal surrounded frame, and the official new version also extends the rotated frame with pixel-level segmentation, which is convenient for studying small targets, dense targets, and fine localization. It is mainly derived from radar satellites RadarSat-2, TerraSAR-X, and Sentinel-1. The polarization modes cover HH/HV/VH/VV with a spatial resolution of 1 m–15 m, and the scenarios cover a wide range of sea states in the near-shore and far-shore. The original image scale is not uniform. In this paper, all the samples are scaled letterbox to 800 × 800 and divided into a training set, validation set, and test set according to 60%, 20%, and 20%, and random seed 42. The annotation is still a single class of rotating box, and the caliber of evaluation is consistent with HRSID.

The HRSC2016 dataset is a challenging remote sensing benchmark focused on ship detection in aerial images. It contains high-resolution images with significant variations in scale, orientation, and aspect ratio. One notable feature of HRSC2016 is that objects are labeled with oriented bounding boxes, making it an ideal test environment for evaluating the effectiveness of orientation-aware detection methods. A 6:2:2 division of training set, validation set, and testing set was used, using fixed random seeds of 42 to ensure reproducibility. Similarly, the resolution needs to be converted to 800 × 800.

In addition, we acquired and annotated a dataset of 2400 SAR small ship images with a spatial resolution of 1–10 m, covering multiple scene types. The polarization is mainly VV, with a small amount of VH, and the annotation is performed by rotating external rectangles with a single ship class while keeping the scene type identifier. To ensure consistency with the publicly available benchmark dataset, all samples are letterboxed to 800 × 800 after equal scaling, divided into a 7:1:2 ratio, 1680 for training, 240 for validation, and 480 for testing, and the division process uses a fixed random seed of 42. Samples from the HRSC2016 and SAR small ship datasets are shown in Figure 6.

4.2. Comparative Experiment

From the results shown in Table 1, Table 2 and Table 3, our proposed model consistently outperforms a broad set of some state-of-the-art detection methods across all benchmark datasets. The superior performance has been achieved on SAR small target detection datasets HRSID and SSDD. In addition, including both remote sensing optical HRSC2016 dataset and self-built SAR small ship dataset. The improvements are evident across

m A P_{50}

, precision, and recall. These gains can be attributed to the synergy between three architectural innovations in our design. These include Grouped Deformable Convolution (GDConv), Local Binary Pattern (LBP) enhancement module, and Vector Decomposition-based bounding box regression header, respectively.

In the high clutter, low signal-to-noise, and very small scale scenario of SAR small target test methods, our method is characterized by maintaining high-precision calls on both mainstream datasets. As shown in Table 1, the

m A P_{50}

of our method on the HRSID dataset is 90.9%. The

m A P_{50}

on the SSDD dataset is 87.2%, which exceeds the other mainstream models of SOTA. Although both the HRSID and SSDD datasets are SAR small target detection datasets, the two sets of scenarios differ greatly, with the HRSID harbor/nearshore complexity and the SSDD offshore background monotony. The performance of our method is more stable, with cross-domain fluctuations of less than 4%. In addition, our method achieves 86.2% precision and 91.7% recall on HRSID. The precision on SSDD reaches 90.4%, and the recall reaches 90.7%. SAR applications are more afraid of misreporting ship shadows, and high precision directly reduces the burden of back-end screening. At the same time, the recall rate can be improved to avoid missed detection, which is also critical in maritime surveillance, compared with the traditional two-stage Faster R-CNN, 83.4%, with an improvement of about 8%. E2YOLOX-VFL, YOLOv7oSAR, and Light-YOLOv8 are models designed specifically for remote sensing small target ship detection, and the experimental results show that the detection performance of the DVDNet proposed in this paper is still superior compared to Light-YOLOv8. On HRSID, the recall of DVDNet has the most obvious strength. This indicates that GDConv+LBP improves the proposal quality and small-target separability in the nearshore strong clutter and dense small-target scenarios.

m A P_{50}

is also ahead or equal. On SSDD,

m A P_{50}

is flat or slightly ahead, but Precision and Recall are significantly ahead, which indicates that DVDNet has better suppression of sea surface noise, wave crests, and other bright false targets, and significantly fewer false detections.

Our method maintains low drift for scene switching on both mainstream datasets, HRSID and SSDD. Compared with other mainstream models, it demonstrates the robustness to resolution differences, shoreline clutter, and imaging strong scattering on the SAR small target detection task. For the conventional two-stage methods such as Faster R-CNN, this paper changes Neck from conventional FPN to an augmented structure consisting of GDConv for C3 and C4 and LBP texture enhancement for P3. Together with the vectorized regression head, the response and robustness of small targets and arbitrarily oriented ships are specifically enhanced. This results in approximately 20% improvement in precision and approximately 10% improvement in recall. In contrast to Transformer and DETR-like methods, we keep the CNN backbone as well as lightweight rotation attention, suppress background with a local prior, converge at high speed, and do not rely on tens of thousands of preheating steps.

It can be seen that the optimization is achieved between the combined precision and recall scores, cross-domain stability, and arithmetic consumption, even though the mAP alone is not maximized. When faced with tasks such as real-time maritime surveillance, vessel capturing, and satellite-carried SAR cruises, our approach provides a lower risk of underreporting and the overall advantage of being deployable on edge hardware.

In order to better observe the superiority of DVDNet in small target remote sensing detection, two remote sensing small target datasets were further divided according to COCO criteria to verify the performance of the model in small, medium, and large sizes. From the overall results on the Table 2, DVDNet achieves the highest

A P_{S}

for small targets in both datasets, 80.1 for HRSID and 73.8 for SSDD, which is still a stable advantage over the three YOLO variants for remote sensing of small vessels. For medium and large targets, DVDNet is almost equal to the strongest light YOLO models, such as the

A P_{M}

of HRSID is 92.1, which is equal to that of Light-YOLOv8 at 92.1, and the

A P_{L}

is 94.1, which is slightly lower than that of Light-YOLOv8 at 94.6. The

A P_{M}

and

A P_{L}

of SSDD are 90.0 and 92.9, which are in the first echelon as well. Additionally, as the model size of the YOLO series increases, such as YOLOX-l and YOLOv5x, the detection performance of YOLO-based models improves. This is evident both in the overall performance shown in Table 1 and in the AP values for different sizes in Table 2. In particular, the AP values for medium- and large-sized data show a significant increase and surpass those of the DVDNet proposed in this paper. Combining Table 1 and Table 2, it can be seen that since both datasets belong to small object detection datasets, DVDNet performs best in terms of small object size and overall performance. Compared to YOLOv series methods, DVDNet achieves an mAP nearly equal to v8x and YOLOv5x with only moderate parameter counts. At the same time, it’s saving approximately 40% in FLOPs compared to the v8x.

Traditional two-stage and early one-stage methods generally have low

A P_{S}

, such as the HRSID of Faster R-CNN, which is 65.5, and the SSDD, which is 54.2, which indicates that it is difficult to fully capture the details in nearshore clutter and small target scenes by only relying on regular sampling and standard regression. The global characterization of the Transformer family of methods is more friendly to medium and large targets, but it is still inferior to DVDNet in

A P_{S}

. For example, Deformable DETR has an HRSID of 77.1 and an SSDD of 67.0.

On the efficiency dimension, the YOLO family trades a very low number of parameters and FLOPs for high throughput, such as 3.2M vs. 13.6G for YOLOv8n. The overall computational power of the two-stage family is higher, with DVDNet’s 55.3M vs. 339.8G being in the same order of magnitude as the standard two-stage backbone. Compared with the 326.7G of Faster R-CNN, it only adds a small amount of overhead but gains significantly in

A P_{S}

and overall AP for both data sets. Combining the accuracy and overhead, DVDNet has the most prominent advantage in small target detection while maintaining a medium-to-large target performance comparable to its strongest rivals, demonstrating greater robustness under complex sea conditions and multi-scale conditions. This phenomenon can be explained by the combination of module design, C3 and C4’s GDConv providing adaptive sampling of elongated and arbitrarily oriented ships, P3’s LBP strengthening texture and edge details, and vectorized regression improving the stability of rotating frame fitting, and the synergy of the three is directly reflected in the sustained leadership of

A P_{S}

.

On the HRSC2016 dataset, which presents unique challenges in remote sensing due to high-resolution imagery and densely packed, arbitrarily oriented ship targets, our model achieves the highest

m A P_{50}

of 80.7%, surpassing all other methods, including the strong two-stage baselines such as Cascade R-CNN, 80.6%, Sparse R-CNN, 79.7%, and YOLOX-s, 79.5%. Our model also achieves the best precision, 85.0%, and recall, 83.0%, indicating both accurate localization and strong object coverage. These results confirm the effectiveness of our architecture in capturing fine-grained textures, geometric variances, and rotation patterns, key traits of SAR target detection.

On the SAR small ship dataset, in the YOLO family, the mAP of YOLOv6-n reaches 94.8, the highest in Table 3, but the accuracy is only 91.8, indicating that the recall of more targets also brings relatively more false detections. The mAP of Light-YOLOv8 and YOLOv7oSAR are 91.5 and 92.2, which are one echelon behind. Among the two-stage and Transformer system methods, Deformable DETR has the mAP of 94.2, with a precision of 95.7, the highest in the whole table. However, with a recall of 93, the strategy is more conservative, with slightly more misses. The mAP of DVDNet is 95.1, and the precision and recall are 93.8 and 94.1, respectively, which are the best performances of both, achieving a more balanced performance between false detection control and detection capability. From the perspective of task requirements, harbor and near-shore scenarios place more emphasis on low false alarms, and DVDNet’s high-precision advantage is more valuable in practice. On the whole, DVDNet has taken into account high precision and high recall while maintaining near-optimal mAP, which demonstrates the stable gain of GDConv and LBP for small-sized, elongated vessels with variable directions.

Cross-domain Generalization. The consistent performance on HRSID, SSDD, HRSC2016, and self-built SAR small ship dataset demonstrates that our model not only excels in the remote sensing domain but also exhibits strong generalization ability across natural scenes and varied detection challenges. This shows that the model can accurately catch extremely small vessels (10–30 px) in near-shore, high-clutter scenes. It can also rapidly lock onto sparse targets across wide-area offshore imagery. While many detectors perform well in specific domains, our method achieves top performance universally, making it a robust and transferable detection solution. This cross-domain capability is critical for real-world deployment, where model reliability across diverse environments is essential.

Figure 7 and Figure 8 show a visualization comparison of the HRSID and SSDD remote sensing ship detection datasets, respectively, for DVDNet and Faster R-CNN. The visualization results jointly indicate that in small, dense, and arbitrarily oriented ship scenes, DVDNet improves both recall and localization accuracy compared to Faster R-CNN. In HRSID with strong near-shore clutter, DVDNet can detect small boats of 10–30 px more completely, and the rotation box aligns better with the boat’s longitudinal axis. This significantly reduces false positives where coastlines and bright spots are misidentified as boats. In SSDD with weaker textures further offshore, DVDNet provides higher and more stable confidence scores. The angles and aspect ratios of slender vessels are better matched, and small bright spots in the open sea are effectively suppressed. These advantages correspond directly to the model design. In the C3 and C4 stages, GDConv learns deformation-adaptive and orientation-adaptive sampling to mitigate mismatches caused by flat and elongated objects. LBP texture enhancement amplifies the subtle contrasts of edges and corners in speckled backgrounds, making small objects more distinguishable. Vectorized regression of the ROI head avoids angular discontinuities, improving IoU stability for objects in any direction. Overall, DVDNet’s performance improvement is most significant at the small scale, highlighting its structural advantages for small object detection.

It can be seen that the visualization results in both high-density, multi-directional ships in ports and daily multi-category scenarios are obtained with tight contours and low misdiagnosis, which directly corroborates the complementary advantages of these three modules for different domain characteristics. Secondly, there are minimal differences in confidence and border quality in the visualization results on different datasets. This confirms the ability of the model to migrate between SAR, remote sensing, and natural images. The 20–30 px miniature boat in Figure 7 and the distant basketball frame with bicycle in Figure 8 are all detected in their entirety. This contrasts with conventional detectors that tend to miss or drift frames at similar sizes, highlighting the significant improvement of this paper’s test methods in the detection of small 10–30 px targets.

Figure 9 and Figure 10 show the visualization comparison on the remote sensing optical dataset HRSC2016 and the self-built remote sensing small boat dataset, respectively. From the two sets of visualizations, it is evident that DVDNet significantly outperforms Faster R-CNN in locating vessels with elongated hulls in congested harbor areas. In the HRSC2016 dataset, DVDNet’s bounding boxes for elongated vessels align more closely with the vessel’s orientation and have tighter boundaries, effectively separating small vessels in adjacent berths from dock structures. The overall confidence level is predominantly above 0.9. However, Faster R-CNN suffers from inaccurate direction detection, short or wide bounding boxes, and missed detections and minor false positives in dense areas. In the SAR small ship dataset, DVDNet can still reliably detect extremely small targets with weak echoes in scenes with strong coastal scattering, speckle noise, and low contrast. Small ships near the coastline and dark backgrounds can also be correctly annotated with higher confidence. Faster R-CNN is more prone to missing weak echo targets in open waters and lacks precise angle and scale regression for elongated targets. These differences align with the deformation alignment and multi-scale receptive fields of GDConv, the weak texture enhancement of LBP, and the improved stability of vectorized regression for small targets in our method, demonstrating DVDNet’s superior recall and localization quality for small targets and complex backgrounds.

4.3. Ablation Experiment

In order to better understand the contribution of each component, we conducted ablation experiments on all baseline datasets in Table 4 and Table 5. Bold in the tables means the best results. Starting from a concise two-stage baseline, ResNet-50 backbone + RPN + standard box regression without FPN, LBP, and GDConv. FPN, LBP, and GDConv and their combinations are then turned on incrementally, and finally the full DVDNet is evaluated. where GDConv is placed on conv3 and conv4, LBP is used on P3, and the detection head is vectorized quantized regression. Each row in Table 4 and Table 5 corresponds to a configuration of an enabled module to separate the effects of each component under the unified training and evaluation protocol.

Starting from the baseline, the model without any enhancement achieves relatively low performance. The

m A P_{50}

of 62.2% on HRSID and 59.6% on SSDD. When we introduce FPN alone, we observe clear improvements across all datasets. For example, on HRSC2016 the mAP rises from 55.2% to 66.6%, confirming that multi-scale feature fusion is essential for capturing object variability in size.

Incorporating the LBP module alone also brings measurable gains. For instance, the mAP on SAR small-target ship dataset increases from 60.5% to 68.7%, and precision rises to 69.2%. This supports our hypothesis that local binary patterns enhance low-level contour information, improving object boundary discrimination.

The use of GDConv in isolation produces the most significant gain among single modules. On SSDD, GDConv boosts

m A P_{50}

to 77.3%, with precision and recall also improving substantially. This demonstrates the effectiveness of spatially adaptive sampling and group-wise specialization for handling geometric variations in SAR targets.

The combination of LBP and GDConv leads to even more remarkable improvements. For example, on HRSID,

m A P_{50}

increases to 89.5%, which is a 26.9% absolute improvement over the baseline. Similarly, on SSDD the combined model achieves 85.9% mAP, outperforming all partial variants.

Finally, the full model (FPN + LBP + GDConv), denoted as Ours, achieves the best performance across all benchmarks. The 90.9% mAP on HRSID, 87.2% on SSDD, 80.7% on HRSC2016, and 94.1% on SAR small-target ship. Notably, the precision on HRSID reaches 90.9%, demonstrating that our method not only detects more targets but does so with fewer false positives.

The ablation study clearly demonstrates the incremental and complementary value of each proposed module. GDConv contributes most to spatial adaptability, LBP enhances texture sensitivity, and FPN provides strong multi-scale representations. Together, they form a synergistic and robust architecture that works well to achieve superior performance on all three types of datasets, SAR, remote sensing, and natural imagery.

In order to evaluate the impact of the position of inserting GDConv at different stages of ResNet-50 on the accuracy versus computational overhead, further ablation experiments were conducted on two major remote sensing open-source small target datasets, HRSID and SSDD. The results are shown in Table 6, where the experimental setup is as explained in Section 4.1. It can be seen that the variations on both the HRSID and SSDD datasets are basically the same. The greatest gains are seen when placing GDConv in the middle layers, C3 vs. C4, boosting both the overall

m A P_{50}

and also the most significant boost to

A P_{S}

. While C5 has limited gain when placed only in the deepest layer, C2 has a slight improvement when sinking to the shallow layer, but the computational overhead increases steeply, and the cost performance is not as good as C3 + C4. This is because C5 is too low resolution and has limited contribution to the small targets, while C2 has the highest resolution but the heaviest speckle and background noise, and the computational cost is high. C2 has the highest resolution but the heaviest speckle and background noise and is computationally expensive. C3 and C4 combine sufficient semantics with usable resolution, and it is at this level that GDConv’s deformation adaptation is most effective, complementing P3′s LBP texture enhancement.

As can be seen in Table 7, as the number of subgroups increases from 1 to 4, the model’s accuracy and performance of the mini-objective steadily improve on both datasets, while the number of parameters and the computational effort slowly decrease. As it continues to increase to 8, the accuracy begins to fall back slightly. In the HRSID dataset,

m A P_{50}

increases from 90.1 to 90.9, where

A P_{S}

increases from 79.4 to 80.1, and

A P_{M}

and

A P_{L}

increase from 91.3 and 93.8 to 92.1 and 94.1, respectively.

m A P_{50}

decreases to 90.5 when the number of subgroups is equal to 8, which suggests that excessive subgrouping weakens the inter-channel coupling and destabilizes the estimation of the deformation offset. SSDD shows the same pattern,

m A P_{50}

increases from 86.7 to 87.2,

A P_{S}

increases from 72.8 to 73.8, and

m A P_{50}

falls back to 86.9 when the number of groups is equal to 8. In terms of efficiency, the number of parameters decreases from 56.4 M to 45.3 M and the number of FLOPs decreases from 324 G to 249.8 G. A further increase to 8 brings only a small decrease of 1.2 G, accompanied by a loss of accuracy. Combining accuracy and overhead, the optimal compromise is achieved by taking 4 as the number of groups. It improves the overall

m A P_{50}

of the two datasets, especially strengthening the small target

A P_{S}

, while keeping the parameter and computational cost manageable. Therefore, the number of groups, 4, is used as the default configuration of GDConv.

5. Conclusions

In this paper, we propose DVDNet, a CNN-based unified framework for SAR target detection, which addresses small targets and arbitrary aspects in SAR images. The framework combines grouped deformable convolution, LBP texture enhancement module and vector decomposition based bounding box representation. DVDNet significantly improves the detection performance of standard architectures by combining the handling of multi-scale variations, arbitrary object orientations, and background complexity. DVDNet significantly improves detection performance through three major improvements. Introduction of grouped deformable convolution (GDConv), which allows the network to adaptively sample features while controlling parameters and computational overhead. Local Binary Pattern (LBP) is seamlessly embedded into the CNN to enhance the texture representation, improve the detection rate of small and dense targets, and reduce the texture background false detection. Adopting bounding box vector decomposition to represent the rotating box with two vectors pointing from the center to the edge, avoiding angular periodicity and locating targets in any direction more accurately. Extensive experiments with two mainstream SAR small target detection datasets, HRSID and SSDD, have demonstrated that each of the DVDNet components contributes to improved accuracy, providing state-of-the-art results for the ResNet-50 level detector. Notably, DVDNet, by combining fine-grained texture coding and adaptive sensing fields, shows particular strengths in detecting small objects, an area where many existing detectors perform poorly, as evidenced by the excellent results on the HRSID dataset and the SSDD dataset. The generalization performance is also validated on the remote sensing dataset HRSC2016 and the self-built SAR small ship dataset.

Author Contributions

Conceptualization, X.D.; Methodology, X.D.; Software, X.W.; Validation, X.D.; Formal analysis, X.W.; Investigation, X.W.; Resources, X.D.; Data curation, X.D.; Writing—original draft, X.D.; Writing—review & editing, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Correction Statement

This article has been republished with a minor correction to the Data Availability Statement. This change does not affect the scientific content of the article.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wu, B.; Huang, J.; Duan, Q. Real-time Intelligent Healthcare Enabled by Federated Digital Twins with AoI Optimization. IEEE Netw. 2025, 1. [Google Scholar] [CrossRef]
Wu, B.; Cai, Z.; Wu, W.; Yin, X. AoI-aware resource management for smart health via deep reinforcement learning. IEEE Access 2023, 11, 81180–81195. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning roi transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Huang, J.; Wu, B.; Duan, Q.; Dong, L.; Yu, S. A Fast UAV Trajectory Planning Framework in RIS-assisted Communication Systems with Accelerated Learning via Multithreading and Federating. IEEE Trans. Mob. Comput. 2025, 24, 6870–6885. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Liang, D.; Kaneko, S.; Hashimoto, M.; Iwata, K.; Zhao, X. Co-occurrence probability-based pixel pairs background model for robust object detection in dynamic scenes. Pattern Recognit. 2015, 48, 1374–1390. [Google Scholar] [CrossRef]
Wu, B.; Huang, J.; Duan, Q. FedTD3: An Accelerated Learning Approach for UAV Trajectory Planning. In International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA); Springer: Berlin/Heidelberg, Germany, 2025; pp. 13–24. [Google Scholar]
Fang, Z.; Hu, S.; Wang, J.; Deng, Y.; Chen, X.; Fang, Y. Prioritized Information Bottleneck Theoretic Framework With Distributed Online Learning for Edge Video Analytics. IEEE Trans. Netw. 2025, 33, 1203–1219. [Google Scholar] [CrossRef]
Wu, B.; Wu, W. Model-Free Cooperative Optimal Output Regulation for Linear Discrete-Time Multi-Agent Systems Using Reinforcement Learning. Math. Probl. Eng. 2023, 2023, 6350647. [Google Scholar] [CrossRef]
Li, H.; Chen, J.; Zheng, A.; Wu, Y.; Luo, Y. Day-night cross-domain vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 12626–12635. [Google Scholar]
Cui, Q.; Ji, C.; Yu, Y.; Zhao, J. Research on Radar Clutter Simulation and Background Adaptive Clutter Map Detection Technique. In Proceedings of the 2024 6th International Conference on Electronic Engineering and Informatics (EEI), Chongqing, China, 28–30 June 2024; pp. 546–551. [Google Scholar] [CrossRef]
Pan, D.; Wu, B.-N.; Sun, Y.-L.; Xu, Y.-P. A fault-tolerant and energy-efficient design of a network switch based on a quantum-based nano-communication technique. Sustain. Comput. Inform. Syst. 2023, 37, 100827. [Google Scholar] [CrossRef]
Wu, B.; Huang, J.; Duan, Q.; Dong, L.; Cai, Z. Enhancing vehicular platooning with wireless federated learning: A resource-aware control framework. arXiv 2025, arXiv:2507.00856. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, X.; Xu, X.; Wei, S.; Shi, J.; Zeng, T. STCADeNet: Spatial-temporal context awareness for video SAR shadow detection. Expert Syst. Appl. 2025, 286, 127881. [Google Scholar] [CrossRef]
Hughes, L.H.; Marcos, D.; Lobry, S.; Tuia, D.; Schmitt, M. A deep learning framework for matching of SAR and optical imagery. ISPRS J. Photogramm. Remote Sens. 2020, 169, 166–179. [Google Scholar] [CrossRef]
Garg, R.; Seitz, S.M.; Ramanan, D.; Snavely, N. Where’s waldo: Matching people in images of crowds. In CVPR 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1793–1800. [Google Scholar]
Parnes, P.; Synnes, K.; Schefstrom, D. mStar: Enabling collaborative applications on the Internet. IEEE Internet Comput. 2000, 4, 32–39. [Google Scholar] [CrossRef][Green Version]
Mao, Y.; Li, X.; Su, H.; Zhou, Y.; Li, J. Ship detection for SAR imagery based on deep learning: A benchmark. In Proceedings of the 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 11–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; Volume 9, pp. 1934–1940. [Google Scholar]
Moser, E.; Meyer, S.; Schmidhuber, M.; Ketterer, D.; Eberhardt, M. Argo: Towards Small Vessel Detection for Humanitarian Purposes. In Proceedings of the IJCAI, Vienna, Austria, 23–29 July 2022; pp. 5245–5248. [Google Scholar]
Guan, T.; Chang, S.; Wang, C.; Jia, X. SAR Small Ship Detection Based on Enhanced YOLO Network. Remote Sens. 2025, 17, 839. [Google Scholar] [CrossRef]
Laganier, C.; Fletcher, L.; Kwan, E.; Walters, R.; Nockles, V. Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing. arXiv 2025, arXiv:2507.04842. [Google Scholar]
Shi, H.; He, C.; Li, J.; Chen, L.; Wang, Y. An improved anchor-free SAR ship detection algorithm based on brain-inspired attention mechanism. Front. Neurosci. 2022, 16, 1074706. [Google Scholar] [CrossRef]
Yu, W.; Wang, Z.; Li, J.; Luo, Y.; Yu, Z. A lightweight network based on one-level feature for ship detection in SAR images. Remote Sens. 2022, 14, 3321. [Google Scholar] [CrossRef]
Zhao, Q.; Wu, Y.; Yuan, Y. Ship target detection in optical remote sensing images based on E2YOLOX-VFL. Remote Sens. 2024, 16, 340. [Google Scholar] [CrossRef]
Liu, Y.; Ma, Y.; Chen, F.; Shang, E.; Yao, W.; Zhang, S.; Yang, J. Yolov7osar: A lightweight high-precision ship detection model for Sar images based on the yolov7 algorithm. Remote Sens. 2024, 16, 913. [Google Scholar] [CrossRef]
Tang, J.; Hu, X.M.; Jeon, S.W.; Chen, W.N. Light-YOLO: A lightweight detection algorithm based on multi-scale feature enhancement for infrared small ship target. Complex Intell. Syst. 2025, 11, 130. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Unel, F.O.; Ozkalayci, B.O.; Cigla, C. The power of tiling for small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Chen, C.; Liu, M.-Y.; Tuzel, O.; Xiao, J. R-cnn for small object detection. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 214–230. [Google Scholar]
Chang, R.; Wang, L.; Xu, X.; Liu, S. Gop-Level Adaptive Resampling with CNN-based Super Resolution. In Proceedings of the 2025 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 14–18 September 2025; pp. 127–132. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Wang, X.; Chan, K.C.; Yu, K.; Dong, C.; Loy, C.C. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent advances for aerial object detection: A survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Zhang, G.; Lu, S.; Zhang, W. Cad-net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 21–23 September 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Liu, Y.; Wang, J.; Xiao, L.; Liu, C.; Wu, Z.; Xu, Y. Foregroundness-Aware Task Disentanglement and Self-Paced Curriculum Learning for Domain Adaptive Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 369–380. [Google Scholar] [CrossRef] [PubMed]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Gray scale and rotation invariant texture classification with local binary patterns. In Computer Vision-ECCV 2000, Proceedings of the 6th European Conference on Computer Vision, Dublin, Ireland, 26 June–1 July 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 404–420. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Oztel, I.; Oztel, G.Y.; Akgun, D. A hybrid lbp-dcnn based feature extraction method in yolo: An application for masked face and social distance detection. Multimed. Tools Appl. 2023, 82, 1565–1583. [Google Scholar] [CrossRef]
Nichani, E.; Radhakrishnan, A.; Uhler, C. Do deeper convolutional networks perform better? In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 18–24 July 2021.
Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, M.; Dong, Y.; Tan, J.; Zhao, S.; Wang, H. Vector decomposition-based arbitrary-oriented object detection for optical remote sensing images. Remote Sens. 2023, 15, 4738. [Google Scholar] [CrossRef]
Li, N.; Jiang, S.; Xue, J.; Ye, S.; Jia, S. Texture-aware self-attention model for hyperspectral tree species classification. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5502215. [Google Scholar] [CrossRef]
Zhou, F.; Chen, Q.; Liu, B.; Qiu, G. Structure and texture-aware image decomposition via training a neural network. IEEE Trans. Image Process. 2019, 29, 3458–3473. [Google Scholar] [CrossRef] [PubMed]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented object detection in aerial images with box boundary-aware vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 2150–2159. [Google Scholar]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]

Figure 1. Overview of the proposed detection framework.

Figure 2. The RPN infrastructure.

Figure 3. The structure of the GDConv.

Figure 4. The overview of the LBP.

Figure 5. Samples on the HRSID and SSDD datasets.

Figure 6. Samples on the HRSC2016 and SAR small ship datasets.

Figure 7. Visualization comparison of Faster R-CNN and DVDNet on the HRSID dataset.

Figure 8. Visualization comparison of Faster R-CNN and DVDNet on the SSDD dataset.

Figure 9. Visualization comparison of Faster R-CNN and DVDNet on the HRSC2016 dataset.

Figure 10. Visualization comparison of Faster R-CNN and DVDNet on the SAR small ship dataset.

Table 1. Comparison of various detection methods on the HRSID and SSDD test sets. Results are reported as mAP, precision, and recall at IoU = 0.5.

Method	HRSID			SSDD
Method	mAP₅₀	Precision	Recall	mAP₅₀	Precision	Recall
YOLOv5s	89.5	83.1	91.3	72.9	58.8	77.9
YOLOv6-n	89.9	85.5	91.4	82	87.2	86.4
YOLOv7-tiny	83.6	85.5	74.6	83.7	81.1	84.9
YOLOv8n	90.4	86	91.7	87	85	84
YOLOX-s	88.9	82.6	90.4	74	66	77.4
YOLOX-l	90.6	85.9	91.4	86.8	88.2	89.5
YOLOv8l	90.7	86	91.5	86.9	89.1	89.9
YOLOv5x	90.8	86	91.6	87.1	89.5	90.2
YOLOv8x	90.8	86.1	91.7	87.2	90.1	90.4
CenterNet	83.8	66.3	86.8	72.6	61.5	76.8
FCOS	82.1	76.2	85.1	69.5	53.1	76
SSD300	80	70.5	84.5	67.2	52.3	75.4
RetinaNet	82.5	72	85	71.8	58.2	73.6
Faster R-CNN	80.5	66.9	83.4	69.5	67.2	74.2
Cascade R-CNN	86	74	87.8	76.4	70	78.8
DETR	81	70	80	71	60.2	78.3
Deformable DETR	89.1	84.3	90	79.8	84.2	81.9
Sparse R-CNN	85	78	86	78	73	79
Dynamic R-CNN	82	72	82	73.9	70	77
GCNet	81.5	70	82	75	65.2	79
Libra R-CNN	82.5	71	83	75.5	66.8	78.3
E2YOLOX-VFL	88.4	80	88.2	86.5	79.4	81
YOLOv7oSAR	89.2	84.1	86.4	86.8	81.5	83.4
Light-YOLOv8	90.5	85.9	87.2	87.1	85.2	84.8
DVDNet (Ours)	90.9	86.2	91.7	87.2	90.4	90.7

Table 2. Comparison of different scale AP and complexity of each method on HRSID and SSDD dataset.

Method	HRSID			SSDD			Params (M)	FLOPs (G)	FPS
Method	$A P_{S}$	$A P_{M}$	$A P_{L}$	$A P_{S}$	$A P_{M}$	$A P_{L}$	Params (M)	FLOPs (G)	FPS
YOLOv5s	78.9	90.9	93.6	59.7	75.7	78.6	7.2	26.6	56.4
YOLOv6-n	79.3	91.3	94	68.8	84.8	87.7	11.3	34	44.1
YOLOv7-tiny	73	85	87.7	70.5	86.5	89.4	6.2	20	75
YOLOv8n	79.8	91.8	94.5	73.8	89.8	92.7	3.2	15	92
YOLOX-s	78.3	90.3	93	60.8	76.8	79.7	9	41.9	35.8
YOLOX-l	79.7	92.4	95.2	69.4	90.1	93.3	46.5	245	6.1
YOLOv8l	79.9	92.9	95.5	71.5	90.3	93.5	43.7	250	6
YOLOv5x	79.8	93.1	95.7	72.6	90.6	93.8	86	313	4.8
YOLOv8x	80	93.4	96	73.2	90.8	94	68.2	395	3.8
CenterNet	70.3	85	87.6	58.6	75.1	77.9	32.5	252	6
FCOS	68.6	83.3	85.9	55.5	72	74.8	33.1	256	5.9
SSD300	64	80.8	84	50.7	69.4	72.5	24	230	6.5
RetinaNet	68.5	83.5	86.3	57.6	74.4	77.3	36	282	4.8
Faster R-CNN	65.5	82	85	54.2	71.9	74.9	51.3	326.7	4.6
Cascade R-CNN	73	87.6	90.4	62.4	78.9	81.8	69	365	4.2
DETR	64.5	82.2	85.2	54.5	73.3	76.3	41	325	4.5
Deformable DETR	77.1	90.7	93.3	67	82.3	85	43	328	5
Sparse R-CNN	71	86.4	89.2	63.8	80.4	83.2	102	360	4.4
Dynamic R-CNN	67	83.5	86.5	58.6	76.3	79.3	51	330	4.5
GCNet	67.5	82.9	85.7	60.8	77.4	80.2	52	340	4.4
Libra R-CNN	69.5	84.1	86.9	61.5	78	80.9	61	345	4.3
E2YOLOX-VFL	78.4	90.2	92.8	73.9	89.1	91.9	9.5	45	33.3
YOLOv7oSAR	79.2	91	93.6	74.2	89.4	92.2	12	35	42.9
Light-YOLOv8	79.8	92.1	94.6	72.7	89.8	92.7	5.8	15	87
DVDNet (Ours)	80.1	92.1	94.1	73.8	90	92.9	45.3	249.8	6.2

Table 3. Comparison of 18 detection methods on the HRSC2016 and SAR small ship test sets. Results are reported as mAP, precision, and recall at IoU = 0.5.

Method	HRSC2016			SAR Small Ship
Method	mAP₅₀	Precision	Recall	mAP₅₀	Precision	Recall
YOLOv5s	78.4	83.1	80.6	91.9	90.8	90.2
YOLOv6-n	78.1	82.7	80.2	94.8	91.8	93.5
YOLOv7-tiny	78.9	83.4	80.9	93.7	92.2	91.3
YOLOv8n	79	83.3	80.4	93	92.7	93.3
YOLOX-s	79.5	84.1	81.2	90.8	92.9	91.6
YOLOX-l	80.2	84.6	81.5	92.2	92.6	93
YOLOv8l	80.3	84.8	81.7	92.4	92.7	93.1
YOLOv5x	80.5	84.9	81.8	92.6	93.1	93.5
YOLOv8l	80.6	85	82.3	92.8	93.5	93.8
CenterNet	77.2	81.2	78.8	90.8	91.2	92.6
FCOS	78.3	82.6	80.2	90.3	93.1	92.7
SSD300	74.1	79	76.3	94.3	93.6	90.9
RetinaNet	77.8	82.3	79.6	93	90.3	92.8
Faster R-CNN	79.2	84	81.1	93.5	93.6	93.9
Cascade R-CNN	80.6	84.5	82.5	90.1	91	93.7
DETR	78.5	83	80.3	94.8	90.4	92.5
Deformable DETR	79.8	84.2	81.9	94.2	92.7	93
Sparse R-CNN	79.7	84.3	81.6	91.1	92.8	93.6
Dynamic R-CNN	79.3	83.7	81	90.9	91.9	90.4
GCNet	78.7	82.9	80.5	90.9	91.8	91
Libra R-CNN	78.8	83	80.8	91.5	90.6	90.2
E2YOLOX-VFL	73.8	82	80	92.6	93.1	91.6
YOLOv7oSAR	74.2	81.5	81.2	92.2	92.6	91.9
Light-YOLOv8	80.2	84.5	82.1	91.5	90.7	91.4
DVDNet (Ours)	80.7	85	83	95.1	93.8	94.1

Table 4. Ablation experiment on the HRSID and SSDD test sets. Results are reported as mAP, precision, and recall at IoU = 0.5.

Combination	FPN	LBP	GDConv	HRSID			SSDD
Combination	FPN	LBP	GDConv	mAP₅₀	Precision	Recall	mAP₅₀	Precision	Recall
None				62.2	61.4	64.5	59.6	64.3	63.8
FPN	√			75.0	62.1	65.5	72.0	65.1	64.8
LBP		√		64.3	62.9	65.7	61.7	65.9	65.0
GDConv			√	76.6	73.8	78.1	73.5	77.4	77.3
LBP + GDConv		√	√	89.5	74.9	79.0	85.9	78.6	78.1
DVDNet	√	√	√	90.9	86.2	91.7	87.2	90.4	90.7

Table 5. Ablation experiment on the HRSC2016 and SAR small-target ship test sets. Results are reported as mAP, precision, and recall at IoU = 0.5.

Combination	FPN	LBP	GDConv	HRSC2016			SAR Small-Target Ship
Combination	FPN	LBP	GDConv	mAP₅₀	Precision	Recall	mAP₅₀	Precision	Recall
None				55.2	60.5	58.4	60.5	61	60
FPN	√			66.6	61.2	59.3	70.2	71.5	70
LBP		√		57.1	62.0	59.5	68.7	69.2	68
GDConv			√	68.0	72.8	70.7	75.3	76.5	74.8
LBP + GDConv		√	√	79.5	73.9	71.5	85.5	86	84.7
DVDNet	√	√	√	80.7	85.0	83.0	95.1	93.8	94.1

Table 6. Effects of inserting GDConv at different backbone stages on the HRSID and SSDD datasets.

Method	GDConv	HRSID				SSDD				Params (M)	FLOPs (G)
Method	GDConv	mAP₅₀	$A P_{S}$	$A P_{M}$	$A P_{L}$	mAP₅₀	$A P_{S}$	$A P_{M}$	$A P_{L}$	Params (M)	FLOPs (G)
Baseline + LBP + Vec	All ordinary 3 × 3	88.5	76.8	90.2	93.1	84.7	70.1	87.9	91.2	41.3	236.7
GD-C5	only conv5_x	88.8	77	90.4	93.2	85	70.5	88.1	91.4	41.5	237.1
GD-C4	only conv4_x	89.2	77.8	90.8	93.4	85.5	71.3	88.6	91.7	41.9	241.6
GD-C3	only conv3_x	89.4	78.2	91	93.5	85.6	71.6	88.7	91.8	43	243.3
GD-C3 + C4(Ours)	conv3_x and conv4_x	90.9	80.1	92.1	94.1	87.2	73.8	90	92.9	45.3	249.8
GD-C2 + C3 + C4	conv2_x, conv3_x and conv4_x	91	80.3	92.2	94.1	87.3	73.9	90.1	93	48.1	251.9
GD-C3 + C4 + C5	conv3_x, conv4_x and conv5_x	90.9	80	92.1	94.2	87.2	73.7	90	93	45.6	241.2

Table 7. Effects of the number of subgroups G for performance-efficiency tradeoffs on the HRSID and SSDD datasets.

G	HRSID				SSDD				Params (M)	FLOPs (G)
G	mAP₅₀	$A P_{S}$	$A P_{M}$	$A P_{L}$	mAP₅₀	$A P_{S}$	$A P_{M}$	$A P_{L}$	Params (M)	FLOPs (G)
1	90.1	79.4	91.3	93.8	86.7	72.8	89.4	92.5	56.4	324
2	90.6	79.8	91.7	94	87	73.3	89.7	92.7	47.8	266.3
4	90.9	80.1	92.1	94.1	87.2	73.8	90	92.9	45.3	249.8
8	90.5	79.7	91.9	94	86.9	73.5	89.8	92.8	45.1	248.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, X.; Wu, X. Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression. Remote Sens. 2025, 17, 3094. https://doi.org/10.3390/rs17173094

AMA Style

Du X, Wu X. Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression. Remote Sensing. 2025; 17(17):3094. https://doi.org/10.3390/rs17173094

Chicago/Turabian Style

Du, Xinmiao, and Xihong Wu. 2025. "Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression" Remote Sensing 17, no. 17: 3094. https://doi.org/10.3390/rs17173094

APA Style

Du, X., & Wu, X. (2025). Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression. Remote Sensing, 17(17), 3094. https://doi.org/10.3390/rs17173094

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Small Object Detection

2.2. SAR Object Detection

2.3. Multi-Scale and Deformable Convolutions

2.4. Texture Descriptors and Oriented Bounding Boxes

3. Methods

3.1. Overall Framework

3.2. Grouped Deformable Convolution (GDConv)

3.3. LBP Enhancement Module

3.4. Vector Decomposition Module for Bounding Boxes

3.5. Loss Function

4. Experiments

4.1. Datasets and Implementation

4.2. Comparative Experiment

4.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI