3.1. The Framework of DAU-YOLO
Figure 1 illustrates the architecture of the proposed network model, named DAU-YOLO, which improves upon YOLOv11. Considering that YOLOv11 has achieved state-of-the-art (SOTA) performance in various domains, offering enhanced feature extraction, optimized efficiency, and competitive accuracy with fewer parameters compared to YOLOv8, it serves as a strong baseline. Our goal is to further improve the detection accuracy of small objects without significantly altering the network architecture. In the backbone, there is one stem layer and four stage layers. The stem layer is an RFA module and each stage layer consists of one RFA module and one C3k2. The RFA module, which introduces the Receptive-Field Attention mechanism into standard convolution, can retain more small object information during the original 3 × 3 convolution downsampling process. In the neck, we add the Dynamic Attention Upsampling (DAU) module into standard PAFPN [
15] to better integrate features at different scales. Among the features, shallow features are especially crucial, where a large amount of small object information is preserved. So, the module applies dynamic attention mechanisms to extract shallow features more effectively. In the end, to minimize the parameter, we experimented with removing the deep feature detection head after adding the shallow feature detection head, ensuring that the number of parameters remained as small as possible. To accommodate varying accuracy requirements and hardware constraints, the proposed DAU-YOLO has been released in multiple versions, including nano (n), small (s), medium (l), large (l), and extra-large (x), similar to YOLOv11. All versions share the same overall architecture, with differences only in network depth and the number of parameters per layer. The parameter size of each version is detailed in
Table 1.
3.2. RFA Module
In DAU-YOLO, the Receptive-Field Attention module and C3k2 module are utilized to ensure high-quality feature extraction and effective image downsampling. The C3k2 module is a newly introduced feature extraction component in the YOLOv11 model, which can divide the input features into two parts: one part is directly processed through standard convolution operations, while the other undergoes variable convolutional kernels (e.g., 3 × 3,
) using multiple C3K structures or bottleneck structures. Finally, the two feature streams are concatenated and fused using a 1 × 1 convolution. This design maintains a lightweight structure while effectively extracting features of complex scenarios. The structure of the C3k2 block is shown in
Figure 2.
The Receptive-Field Attention module replaces the standard convolution in YOLOv11, providing an efficient approach for feature extraction and downsampling. This approach not only highlights the importance of various features within the receptive-field window but also enhances the spatial representation of receptive-field features. For the ultra-small objects to be detected, their pixel ratio in a 1920 × 1080 image is only in the single digits. After being normalized to 640 × 640 for network input, their pixel count may be extremely small. During the downsampling process, the weight distribution of each sliding window is crucial. Therefore, we drew inspiration from Receptive-Field Attention and integrated it into the backbone, enabling the maximal extraction and differentiation of local small object information.
As shown in
Figure 3, the Receptive-Field Attention module consists of two paths: One path is to multiply the attention map. If the convolutional kernel size is
, after downsampling and extracting receptive-field spatial characteristics, the shape of input feature
will become
, where
and
w represent the size of batch, channel, height, and width. To reduce computational complexity and accelerate training speed, we reshape the feature and use AvgPool2d to aggregate the global information of each receptive-field feature. Then, we apply softmax to highlight the importance of each feature within the receptive-field representation, obtaining
. The calculation formula is given in Equation (
1).
The other path is to obtain the Receptive-Field Attention feature. We use a CBR operation, which means Convolution + Batchnorm + ReLU, to obtain the feature from
X and reshape it to
. This is a
convolution, designed to interact with information. The calculation formula is given in Equation (
2).
Finally, the results of the two steps are combined using the cross product, giving
. The computation of RFA can generally be formulated as Equation (
3).
Subsequently, needs to be reshaped and processed through a convolutional layer to transform its shape into , which is then fed into the C3k2 module. The size of is set manually, typically as , where n represents the number of the stage layer.
3.3. DAU-Module
In the proposed network, the neck retains a PAFPN architecture. PAFPN incorporates a bottom–up pathway after the top–down pathway of FPN, reinforcing the feature hierarchy by integrating precise localization signals from lower layers through bottom–up path augmentation. Building upon the two original YOLO11 top–down and bottom–up layers, we made two major modifications. First, a novel module named DAU (dynamic attention and upsampling) is introduced at the top of neck. This module enhances small object feature extraction through upsampling and maximizes information utilization via spatial-diffusion attention and task-aware attention mechanisms. Second, the positions of the two bottom–up layers have been adjusted. The bottom–up layer in the deep stage is removed and added to the top layer. This allows for full interaction of information between the DAU module and other neck modules while significantly reducing the increase in parameters.
In the backbone, apart from the stem layer, there are four stage layers, each performing downsampling. We refer to the outputs of these four layers as P1, P2, P3, and P4. In YOLO11, the features P2, P3, and P4 are concatenated with the output features from the next lower top–down layer in the neck, followed by the C3K2 operation. It is generally believed that during the downsampling and feature extraction process in images, shallow layers undergo fewer downsampling operations, preserving richer fine details of small objects, while deep layers primarily capture global information about objects and their backgrounds. Therefore, the network aims to fully utilize the shallow features of P1. To achieve this, an initial upsampling operation is performed within the DAU module. The DAU module fully integrates feature P1 with the output of top–down layer 2 and processes them through the C3K2 module for effective feature extraction. Then, we enhanced the extracted features with a spatial-diffusion block and a task-aware block. The structure of DAU module is shown in
Figure 4.
The use of these two attention mechanisms stems from a careful consideration of what the feature characteristics are before the detection layer. First, since the PAFPN structure is employed, selecting the most appropriate scale among multiple scales is crucial. However, as this method specifically focuses on improving small-object detection, and subsequent experiments demonstrate that the P1 layer provides the optimal scale, a scale-attention block is not added to reduce parameter complexity. Second, in the spatial domain, regions of interest should be assigned higher attention weights for good semantic representation. Finally, the detection head encompasses multiple tasks. In YOLO-based object detection, it is required to output box loss, cls loss, and dfl loss. Therefore, we present a task-aware attention mechanism, enabling adaptive channel-wise attention to effectively prioritize different tasks. The two attention blocks are applied sequentially.
To better distinguish the target objects from adjacent objects and background, we introduce a learnable offset into the standard convolution. Considering the high dimensionality of the spatial domain, the spatial-diffusion block isdivided into two steps: (1) utilizing deformable convolution [
40] to enable the attention mechanism to learn more sparse representations and (2) aggregating features within the same spatial region. The computation of the spatial-diffusion block can generally be formulated as Equation (
4).
where N is the number of sparse sampling locations,
is a shifted location by the self-learned spatial offset
to focus on a discriminative region, and
is a self-learned importance scalar at location
p.
Task-aware attention is essentially a form of channel attention. We integrate a task-aware block after the spatial-diffusion block, facilitate joint learning, and enhance the generalization of object representation. It selects the optimal activation function for each channel adaptively, dynamically switching on and off channels of features to adaptively prioritize different tasks. The computation of the task-aware block can be formulated as Equation (
5).
where
is the feature slice at the c-th channel and
is a hyper-function that learns to control the activation thresholds. The inspiration for the hyper-function originates from
Dynamic ReLU [
41].
Since the DAU module with upsampling has been added, an additional top–down layer is required in the PAFPN, resulting in four detection layers. However, experimental results indicate that removing the lowest detection layer has little impact on accuracy. Therefore, in this work, we omit the lowest detection layer along with its corresponding bottom–up layer. This ensures that the parameter count is minimized as much as possible without compromising accuracy.