1. Introduction
Object detection [
1] is a fundamental task in computer vision, widely applied in areas such as autonomous driving, surveillance, agriculture, and medical analysis. With the development of deep learning, object detection methods [
2,
3] have achieved remarkable progress, driven by increasingly sophisticated feature extraction techniques and efficient detector designs. Traditional convolutional neural networks (CNNs) have been the cornerstone of feature extraction, enabling the modeling of local patterns. However, standard convolutions are limited by their fixed receptive fields, which constrain their ability to adapt to objects of varying scales, shapes, and contextual dependencies.
To address these challenges, advanced modules, such as Deformable Convolutional Networks (DCNs) [
4,
5] and dilated convolutions, have been introduced. DCNs allow the dynamic adjustment of sampling positions to better accommodate geometric variations, while dilated convolutions expand receptive fields without increasing the number of parameters, facilitating multi-scale context capture. Additionally, attention mechanisms, including self-attention modules [
6], have been widely adopted to model long-range dependencies and enhance feature representation.
Recently, Transformer-based [
6] object detectors have been proposed to overcome the limitations of fixed receptive fields and restricted context modeling in CNN-based methods. For example, Swin Transformer [
7] employs a shifted window attention mechanism that enables hierarchical and efficient multi-scale feature learning, while ViTDet [
8] leverages Vision Transformers to improve dense prediction tasks through rich global context modeling. In addition, Deformable DETR [
9] introduces deformable attention modules that dynamically attend to a sparse set of key locations, effectively capturing geometric variations and contextual cues with reduced computational complexity compared to vanilla DETR [
10]. These approaches demonstrate a strong performance by combining attention mechanisms with adaptive receptive field learning. However, despite their effectiveness, such Transformer-based models often come with higher computational and memory requirements, posing challenges for real-time deployment. Moreover, their attention mechanisms typically operate over fixed or pre-defined windows or query points, limiting the granularity of spatial deformation and scale adaptation in certain scenarios. In contrast, our proposed D2FM provides a lightweight yet highly adaptive alternative. By decoupling the prediction of dilation coefficients and spatial offsets, D2FM enables fine-grained, content-aware receptive field adaptation that directly operates on convolutional grids.
In
Figure 1a, the standard DCN [
4] directly learns spatial offsets for each position based on the original 3 × 3 convolution grid, enabling the adaptive sampling of the input features to handle geometric variations. However, the receptive field (the region of the input image that influences a single pixel in the neural network) is still limited by the size of the original grid, and the flexibility to capture multi-scale context is also restricted. In contrast,
Figure 1b illustrates our proposed two-stage deformable process. First, a 1 × 1 convolution is used to predict an expansion factor for each position. Based on this expansion factor, the 3 × 3 sampling locations are dynamically expanded or contracted, allowing adaptive control over the receptive field size. Then, the features at these expanded positions are used to predict the final spatial offsets. This separation of expansion prediction and offset prediction enables the network to first determine an appropriate scale for feature interaction and then fine-tune the sampling positions according to local geometric relationships. This design enhances the model’s ability to handle objects of varying sizes and complex shapes, and provides more precise control over multi-scale feature aggregation compared to the standard DCN.
In this work, we propose the Deformable and Dilated Feature Fusion Module (D2FM) to further enhance the adaptability and flexibility of feature extraction in object detection tasks. D2FM dynamically predicts both dilation coefficients and spatial offsets, enabling the model to better capture multi-scale and context-dependent patterns. Moreover, a self-attention mechanism is introduced to effectively fuse geometry-aware and enhanced local features. To integrate D2FM into modern object detectors with minimal overhead, we designed the D2FM-HierarchyEncoder, which employs hierarchical channel reduction and depth-dependent stacking of D2FM blocks, balancing the trade-off between expressive power and computational cost.
We embedded the proposed modules into the YOLOv11 [
11] detection framework, resulting in a new model named D2YOLOv11. Extensive experiments on COCO2017 [
12], Pest24 [
13], and Urised11 datasets [
14] demonstrate that D2YOLOv11 consistently outperforms the baseline methods, achieving superior detection accuracy across different object scales and categories. Our results validate the effectiveness and generalization ability of the proposed D2FM and D2FM-HierarchyEncoder, offering promising direction for future object detection system designs.
The main contributions of this paper are as follows:
1. We proposed the Deformable and Dilated Feature Fusion Module, which splits the DCN into the dilation and offset processes, enhancing the feature extraction ability for multi-scale deformable targets.
2. We designed D2FM-HierarchyEncoder, which adopts the grouping method to reduce the number of channels exponentially layer by layer and fuse high-dimensional features, improving the computational efficiency.
3. We applied D2FM-HierarchyEncoder to the detection head of YOLOv11, and the detection results on multiple datasets proved the effectiveness of the proposed method.
3. Approaches
3.1. Revisit DCN
Regular convolutions, including operations like ROI pooling, have significant limitations when modeling irregular objects due to their fixed shape, making it difficult to adapt to geometric deformations. The DCN addresses this issue by abandoning the conventional convolution modules found in existing CNNs.
Assume
is a feature map, where
denotes the spatial location of a pixel. Let
be a convolution kernel with kernel size
. Then, the convolution results
between
and
can be computed using Equation (1).
Building upon this, the DCN adaptively offsets the convolutional sampling locations in
to handle object irregularity. We denote the predicted offset and modulation scalar as
and
, respectively, both obtained through convolutional operations over x. The term
represents the adjusted sampling location in the feature map. The final output
of the DCN is then computed via Equation (2).
3.2. Revisiting the Deformable and Dilated Feature Fusion Module (D2FM)
Figure 2 shows the pipeline of D2FM. We define one 3 × 3 convolution
, where
, along with three 1 × 1 convolutions
,
, and
.
is used to generate dynamic dilation coefficients for each feature point.
is employed to refine features within the dilated regions. Since it uses a 1 × 1 convolution kernel, it introduces minimal computational overhead.
is responsible for predicting the offset within the dilated region, and the features sampled from these offset positions are fused with the output of
via a self-attention mechanism. Finally, a 3 × 3 convolution
is applied to produce the final output.
Given a feature map
, we aim to extract features from its position
. First, we predict the dilation coefficient
for
using
(input channel is
c, output channel is 1) Equation (3):
Based on
, we obtain a 3 × 3 convolutional processing position
by Equation (4):
Using
, we index the features
from
. Next, we apply convolution with
to
to compute the offset
for each feature interaction region by Equation (5):
This offset is then used to index the corresponding features
from
. Subsequently, we refine the features of
using
, obtaining the enhanced features
by Equation (6):
To enable the interaction between the features
from
and
from
for richer semantic information extraction, we process them using a self-attention mechanism, ultimately generating the updated features
through Equation (7):
“
” represents matrix multiplication. The final feature output
at position
is obtained by convolving
with
by Equation (8):
The core motivation behind D2FM is to enhance the adaptability and flexibility of convolutional operations by dynamically adjusting both the sampling positions and interaction patterns. Traditional convolutional networks use fixed grids to extract local features, which limits their ability to capture geometric variations or context-dependent patterns. In contrast, D2FM first learns a dilation coefficient to determine the receptive field dynamically, allowing the network to adapt to content-specific scales. Additionally, by introducing a learnable offset for each interaction region, the network can focus on more informative or semantically aligned areas.
Furthermore, the incorporation of a self-attention mechanism between the geometry-aware sampled features and the enhanced local context enables a refined fusion of spatially dynamic and content-adaptive information. By unifying dilation prediction, spatial offset learning, and self-attention into a coherent framework, D2FM offers a more expressive and adaptive mechanism for visual feature extraction.
3.3. Architecture Design of D2FM-HierarchyEncoder
We aim to enhance the feature extraction capability of the encoder by adopting group-wise architecture. As shown in
Figure 3, we initially designed an encoder named D2FM-GroupEncoder. It processes input features by dividing them into three equally shaped groups; each passed through identical 1 × 1 convolutions followed by two sequential D2FM blocks. All branches use the same number of channels, and their outputs are concatenated and fused via a final 1 × 1 convolution layer.
However, performing feature extraction in the manner of the D2FM-GroupEncoder inevitably introduces significant computational and time overhead. To better capture multi-scale and hierarchical feature dependencies, we redesigned the D2FM-HierarchyEncoder, an enhanced feature extraction module built upon the baseline D2FM-GroupEncoder.
In contrast, the D2FM-HierarchyEncoder (
Figure 4) introduces two critical modifications:
(1) Hierarchical Channel Reduction: Instead of uniform feature divisions, the input is split into three feature groups with progressively decreasing channel dimensions: c, c/2, and c/4. This design enables the encoder to allocate a greater representational capacity to richer features while still capturing lightweight contextual cues at a reduced cost.
(2) Depth-Dependent D2FM Stacking: Each feature group is assigned a different number of stacked D2FM blocks—proportional to its channel capacity. The highest-capacity path (with channel size c) passes through a single D2FM block, while the c/2 and c/4 paths are processed through two and three D2FM blocks, respectively. This introduces an implicit hierarchy in the processing depth, encouraging the deeper modeling of lower-dimensional but potentially abstracted features.
Finally, the outputs from all branches are aggregated and unified via a 1 × 1 convolution that projects the concatenated features (with total channel size 2c) onto the desired output dimension k.
This hierarchical strategy allows D2FM-HierarchyEncoder to model multi-level dependencies more effectively, balancing expressiveness and computational efficiency.
3.4. Application of D2FM-HierarchyEncoder in YOLOv11
D2FM-HierarchyEncoder can be easily integrated into various object detection algorithms to enhance their representation learning capabilities. In this paper, we use the widely popular YOLOv11 detector as the application platform. In addition, we called this new method D2YOLOv11.
The difference between D2YOLOv11 and YOLOv11 lies in the detection head processing: We introduced the D2FM-HierarchyEncoder module and additional refined convolutional layers for feature enhancement in D2YOLOv11, whereas YOLOv11 directly uses partial convolutional layers to output coarse detection results.
As shown in
Figure 5, D2YOLOv11 is composed of three main components: the backbone, neck, and detection head. The backbone is responsible for extracting features from the input image, the neck refines and fuses multi-scale features, and the detection head outputs bounding boxes, class predictions, and confidence scores.
D2FM-HierarchyEncoder is integrated into the detection head. Initially, the neck produces three feature maps at different scales. To unify their channel dimensions and refine semantic contexts, each feature map is first processed through a 1 × 1 convolution followed by a 3 × 3 convolution. This ensures that all three feature maps have the same channel count. Next, D2FM-HierarchyEncoder is used to extract features. It should be noted that these three feature maps share a single D2FM Encoder to improve computational efficiency. Finally, the feature maps are passed through another 3 × 3 convolution and 1 × 1 convolution to restore their channel dimensions, and the final detection results are generated.