This study utilizes the Ultralytics framework, with the model architecture shown in
Figure 4. The backbone, based on YOLOv10, extracts features, which are then processed by the Neck module and passed to the detection head for localization and classification. The RTDETR detection head (DHead) is used to handle dense scenes with small objects, leveraging its robust global relationship modeling for handling multiple instances. The HEConv module is integrated into both the HSPP and GradDynFPN modules. HSPP optimizes the pooling process, enhancing feature generalization, while GradDynFPN manages multi-scale features, enabling cross-scale interactions. The detection task is completed through iterative updates in the decoder.
3.2.1. HawkEye Conv
The principle of conventional convolution calculation is illustrated in Equation (5), which derives the output by flipping the weights of the sampling points in the convolution kernel and summing the results.
where
O(
i,
j) represents the value of the output feature map at position (
i,
j).
W(
m,
n) is the weight matrix of the convolution kernel, and
I(
i +
m,
j +
n) denotes the corresponding pixel values in the input feature map. The size of the convolution kernel is denoted as
k. Its receptive field is described in Equation (6), where
S is the stride and
Rl is the receptive field at the
l-th layer of convolution.
As illustrated in
Figure 5, the receptive field of standard convolution is inherently limited and fixed. This constraint hampers the complete learning of target features with a finite number of sampling points. Furthermore, the rigidity of the learning region inevitably introduces interference from background and other targets during the information collection process, adversely affecting the model’s learning capability. However, existing convolutional variants are rarely designed specifically for small targets. While these variants may focus on the position and shape of objects to some extent, their divergent sampling points can lead to significant information noise during critical feature extraction. This issue is exacerbated in dense scenes with occluded small targets, where current dynamic convolutions often perform poorly.
HawkEye Conv (HEConv), shown in
Figure 6, is designed to address the aforementioned challenges. The upper stable sampling region utilizes predefined fixed sampling points arranged in various shapes, with parameters ensuring stability during local feature extraction. In contrast, the lower region involves dynamic offsets and random selection, derived from standard convolution sampling points that fall outside predefined shapes. Initially, dynamic points are randomly generated, after which a dynamic offset network adjusts these points by creating adaptive offsets, enabling flexible sampling across the feature map. By combining adaptive dynamic points with fixed points from non-feature areas, the model adapts to target shapes, improving sensitivity and robustness to input variations.
Unlike traditional channel-grouping convolution strategies, which operate with separate convolutional operators for each group and reintegrate information by channel, our approach directly segments the convolution kernel from the fundamental sampling points. The Diamond and X versions serve as the primary convolution modules, while their mixed version forms the channel-grouped convolution. The effectiveness of this design is validated through detailed experimental analysis and comparisons presented in the subsequent section.
- A.
Stable Sampling Area
In the stable sampling region, we aim to maximize the convolution kernel’s ability to extract information from small targets. We modify the standard convolution kernel shape to a special shape convolution (Diamond- or X-shaped) composed of k sampling points with stable sampling relationships. These shapes are utilized to extract information from fixed positions. The Diamond convolution is applicable for symmetric shapes with concentrated features, such as vehicles and buildings, while the X-shaped convolution is suited for targets like plants and pedestrians that exhibit significant aspect ratios. The corresponding sampling point operations are defined as follows in Equation (7):
where
As shown in Equation (8), the parameter shape governs the convolution sampling process and is contingent upon the aspect ratio α. Specifically, when shape = 0, a Diamond-shaped configuration is employed for fixed region sampling, applicable in scenarios where 0.5 < α < 2. Conversely, when shape = 1, an X-shaped configuration is utilized for fixed sampling. Furthermore, acknowledging the contemporary innovations that leverage channel grouping techniques—utilizing distinct convolution units for feature processing followed by the application of attention modules for weighted fusion—this study introduces a third testing mode designed accordingly. The three configurations investigated are (a) X-shaped convolution, (b) Diamond-shaped convolution, and (c) a dual-branch architecture that concurrently processes both shapes, ultimately resulting in a mixed convolution structure with lightweight attention weight fusion.
Considering that both shapes may be necessary in complex and diverse scenes, we design three modes: (a) X-shaped convolution, (b) Diamond convolution, and (c) a dual-branch structure that combines both, culminating in a lightweight attention-weighted fusion of the mixed convolution structure.
- B.
Dynamic Offset and Random Selection Area
The dynamic sampling adjustment section is designed to accommodate the shapes of small targets while restricting excessive flexibility in dynamic offsets. Initially, the convolution kernel is allowed to move sampling points along the edges of the image, creating a combination of original and offset sampling points, which together form the range for selecting dynamic sampling points, termed the dynamic sampling point repository. It is crucial to emphasize that in the dynamic sampling section, the extent of point offsets varies according to the complexity of the convolution and the usage scenario. Random offsets occur within individual shapes (X-shaped or Diamond), while a convolutional network is added in the mixed shape to facilitate learnable dynamic offsets. Here,
R represents the sampling points of the standard convolution minus the points in the stable sampling area (A),
Q denotes the new sampling points generated by dynamic offsets, and
forms the dynamic sampling point repository. The dynamic random extraction method involves selecting
sampling locations from
, computed as follows in Equation (9):
The computation of dynamic offset points is illustrated in Equation (10).
This extraction method effectively balances computational load by randomly selecting positions within a specified range, enabling both dynamic sampling points and allowing dynamic extraction from certain fixed sampling points. These points are then combined with those from Group A to create new convolution units. Consequently, this convolution exhibits the characteristics of 1/4 dynamic points, 1/4 random fixed points, and 1/2 special shape fixed points. It is essential to clarify that in the final sampling configuration, Group A fixed points must always be present and remain fixed; Group B points in R may not always exist but are fixed; whereas points in Q are neither guaranteed to exist nor fixed. This approach not only enhances the model’s adaptability but also improves its robustness in various scenarios.
The final positions and numbers of sampling points are as indicated in Equation (11):
The receptive field of HEConv is significantly expanded compared to standard convolution, as specified in Equation (12):
where
denotes the convolution kernel’s displacement, which is learned and subject to dynamic variations.
The HawkEye Conv implementation is provided for clarity, as shown in the following Algorithm 1. It demonstrates the dual-branch hybrid convolution, with one branch representing the single-version convolution. The algorithm outlines how the convolution operation integrates fixed sampling points with dynamically adjusted offset points, detailing the process of generating these points. By using SimAM for the weighted fusion of fixed and dynamic components, this approach enhances the model’s ability to capture small target shapes, improving detection performance.
Algorithm 1 Pseudocode for the HEConv algorithm.
Algorithm 1: HawkEye Conv |
Input: |
Output: |
1 Initialize deformable convolution layer and offset prediction network |
2 Define fixed sampling points (Diamond or X shape based on sampling) |
3 Define B group points from 3 × 3 grid |
4 |
5 Step 1: Calculate Dynamic Offsets |
6 dynamic_offsets ← Offset_Prediction_Network(x) |
7 |
8 Step 2: Generate Dynamic Sampling Points |
9 Adjust B group points with dynamic_offsets |
10 Randomly sample 4 points from dynamic repository |
11 |
12 Step 3: Apply Deformable Convolution |
13 deform_output ← Deformable_Convolution(x, dynamic_points) |
14 |
15 Step 4: Apply Fixed Sampling Convolution |
16 fixed_output ← Convolution using fixed sampling points |
17 |
18 Step 5: Weighted Fusion with SimAM |
19 attention_weights ← SimAM(fixed_output, deform_output) |
20 y ↔ (fixed_output × attention_weight + deform_output × (1 − attention_weight)) |
21 Return y |
3.2.3. Gradual Dynamic Feature Pyramid Network
FPN has demonstrated exceptional performance in various small object detection tasks, primarily due to its effective feature interaction mechanism, which addresses the challenges of multi-scale object detection. The Gradual Dynamic Feature Pyramid Network (GradDynFPN) significantly enhances detection capabilities by incorporating richer scale information compared to traditional feature pyramid networks (FPNs). Classic models such as BiFPN [
36], AFPN [
39], and ContextGuideFPN [
41] have continuously optimized the horizontal interactions among features at the same scale and the vertical interaction paths across different scales.
Through an in-depth analysis of the principles underlying the existing mainstream FPNs, we found that introducing more scale information can effectively enhance detection performance. However, while feature interactions across non-adjacent scales can broaden the information fusion across a wider field of view, they may also introduce blurring interference issues after scale transformations.
In the process of feature fusion, we do not directly apply a single attention mechanism for fusion. Our analysis indicates that when the features from the upper layer are sampled and resized to match the size of the lower layer, although this includes a large amount of semantic information that can enrich the original lower layer features, the sampling operation itself leads to feature information blurring. Therefore, we require the accurate spatial modeling capability of the original low-level features to guide the features in targeted learning at precise regions. Similarly, when the features from the lower layer are sampled and resized to match the upper layer’s feature map size, the spatial information can effectively address the challenges of recognizing small targets in low resolution. However, the lack of semantic information in the lower layers can interfere with the features of the upper layer. As a result, we need the semantic relationship capture ability of high-level features to guide the features in strengthening their understanding of the relationships between targets.
As illustrated in
Figure 8a, our proposed Gradual Dynamic Feature Pyramid Network (GradDynFPN) emphasizes the importance of interactions between adjacent scales for information fusion in the middle layer. Therefore, we employ interaction operations between adjacent features to fully leverage the complementary information of deeper and shallower features, thereby enhancing the richness and precision of feature representation.
According to the structural details in
Figure 8b, during the three-layer feature interaction process, we first design the sampling operations for different layers with careful consideration. The three layers of features are processed as follows: lightweight upsampling is performed using CARAFE [
46], channel transformation is achieved through a 1 × 1 convolution, and downsampling employs our designed HEConv. Subsequently, we utilize the features from the upper and lower layers along with the middle layer to complete the initial step of sampling fusion. During the fusion process, the middle layer extracts spatial and channel weights, applying spatial attention for upward interactions and channel attention for downward interactions. This guides the fusion features through matrix multiplication, effectively addressing the blurring issues encountered when aligning features of different scales. Finally, the mixed features from the upper and lower layers and the middle layer are fused using adaptive weights, completing the final concatenation.