3.1. Architecture
This study aims to enhance the performance of YOLOv11n in UAV-based infrared small target detection tasks through targeted architectural improvements. The proposed model, named IFD-YOLO, specifically addresses three major challenges in infrared scenarios: blurred small-target features, strong background interference, and the trade-off between detection accuracy and inference speed. Overall, IFD-YOLO follows the typical one-stage detection architecture of YOLOv11, consisting of a backbone for feature extraction, a neck for multi-scale feature fusion, and a detection head for final prediction. As illustrated in
Figure 1, IFD-YOLO improves three key components of YOLOv11n: the feature extractor, the feature enhancement module, and the bounding-box regression loss, which are redesigned as follows.
In the feature extraction stage, the original backbone network is replaced with the lightweight convolutional architecture RepViT (version M0.9). By decoupling the token mixer and channel mixer within a MetaFormer-like framework and introducing structural re-parameterization, RepViT strengthens feature extraction for infrared small targets. It helps the network capture subtle target cues more effectively than conventional backbones under low-contrast infrared conditions.
To further enhance feature representation capability, the original C3k2 module is replaced with the C3k2_DyGhost module. This module integrates dynamic convolution into the Ghost structure, allowing convolutional kernel weights to adapt dynamically according to the input feature distribution. Consequently, it improves fine-grained feature representation of infrared small targets while maintaining a lightweight architecture, and it enhances discrimination in complex backgrounds.
In the bounding box regression stage, an Adaptive Fusion-IoU (AF-IoU) loss function is employed to replace conventional IoU-based losses. AF-IoU introduces a Focal Box [
32]–based dynamic scaling mechanism that adaptively adjusts the weights of prediction boxes with different quality levels. In addition, it employs an annealing strategy controlled by a ratio hyperparameter: the loss focuses more on low-quality boxes in the early training stages to accelerate convergence and gradually shifts its emphasis to high-quality boxes to improve regression precision. In addition, an optimized attention allocation mechanism is designed for infrared detection scenarios to alleviate the insensitivity of traditional IoU losses to small-target regression at high IoU thresholds, thereby improving localization accuracy.
In summary, IFD-YOLO achieves a well-balanced trade-off between detection accuracy and efficiency through the synergistic optimization of the backbone, feature transformation, and loss function. This framework provides a more effective solution for small target detection in complex infrared UAV scenarios.
3.2. RepViT Backbone Network
To enhance feature extraction capability and inference efficiency in UAV infrared small-target detection tasks, this study introduces the lightweight backbone network RepViT (Re-parameterized Vision Transformer) to replace the original convolutional backbone in YOLOv11. Infrared images are often characterized by small target sizes, low signal-to-noise ratios, and complex backgrounds. Traditional convolutional backbones, constrained by limited receptive fields, struggle to preserve fine-grained details and global semantic information during downsampling. Moreover, their large parameter counts and computational costs hinder real-time deployment on resource-constrained UAV platforms.
RepViT integrates the local feature extraction advantages of convolutional neural networks (CNNs) with the global modeling capability of Vision Transformers (ViTs) [
33]. During training, it uses structural re-parameterization [
34] to adopt a multi-branch architecture that enriches feature representation. At inference, these branches are fused into a single convolutional path, which reduces latency and computational cost while preserving detection accuracy. Similarly to the spectral–spatial interaction mechanism explored in SiamBSI [
35], this design strengthens the joint modeling of local details and global semantics, enabling the network to capture fine-grained variations in small infrared targets under cluttered scenes.
The overall architecture of RepViT follows a hierarchical feature extraction strategy, consisting of three main components: the Stem, Stage, and Downsample modules, enabling progressive modeling from shallow local textures to deep semantic representations. In the Stem stage, the model first applies two 3 × 3 convolutions and a stride-2 downsampling operation to reduce spatial resolution and extract basic texture information, laying the foundation for subsequent high-level feature learning. Then, the features pass through multiple Stages and Downsample modules in an alternating manner for feature extraction and compression. Each Stage is composed of several RepViT Blocks or RepViT SE Blocks, as illustrated in
Figure 2 and
Figure 3, which show the structure of the RepViT Block and the overall RepViT network, respectively.
In each RepViT Block, the network performs feature modeling by combining depthwise convolution and pointwise convolution operations. Specifically, in the
i-th RepViT Block of the
l-th layer, the input feature is denoted as
. The Token Mixer module performs spatial feature mixing and recombination through a 3 × 3 depthwise separable convolution, and its output can be formulated as:
where
DWConv3×3 denotes the 3 × 3 depthwise convolution operation,
BN represents batch normalization, and
σ(·) is the ReLU activation function. Here,
DWConv3×3 denotes a standard 3 × 3 depthwise convolution used in the DyGhost module, which is independent of the C3 or C3k2 block in YOLOv11. In addition, the Squeeze-and-Excitation (SE) [
36] attention mechanism is incorporated to adaptively adjust channel weights, thereby emphasizing salient regional features of small targets in infrared images.
The Channel Mixer module consists of two 1 × 1 convolutional layers with GELU [
37] activation functions. Functionally, it is equivalent to the Feed-Forward Network (FFN) [
38] in Transformer architectures. By performing a process of “channel expansion–nonlinear mapping–channel compression,” it enables cross-channel feature modeling and nonlinear enhancement, thereby improving the richness and discriminative power of feature representations. Its computation can be expressed as:
where
and
denote the weight matrices of the two 1 × 1 convolutional layers, and
δ(·) represents the GELU activation function.
Finally, for blocks with
stride = 1, a residual connection is introduced outside the Channel Mixer to preserve shallow feature integrity and alleviate the vanishing gradient problem. The output after incorporating the residual connection can be expressed as:
When stride = 1, the residual connection ensures that shallow texture information is preserved, mitigates gradient vanishing, and improves feature propagation efficiency.
Through this hierarchical modeling and multi-scale feature fusion strategy, RepViT effectively retains fine-grained features of infrared small targets during downsampling while capturing global contextual information. Consequently, it significantly enhances detection accuracy and feature adaptability under a lightweight design.
Finally, the backbone network outputs multi-scale feature maps to provide complementary high-resolution and high-semantic features for the YOLO detection head, thereby enabling precise recognition and localization of infrared small targets.
Although RepViT was originally introduced as a lightweight backbone for general vision tasks, it exhibits particularly pronounced advantages in infrared small-target detection. Infrared targets typically contain extremely weak texture information and minimal structural details, making them highly susceptible to irreversible feature loss during the downsampling process in traditional convolutional networks. By leveraging the MetaFormer architecture, RepViT strengthens global contextual modeling while maintaining a lightweight design, enabling the network to better preserve faint thermal signatures across layers and thereby improving the discriminability of weak targets.
In addition, its structural re-parameterization mechanism allows the network to employ multi-branch structures during training to enhance the fusion of high-frequency and low-frequency information, while collapsing into a single-branch convolution during inference. This achieves a favorable balance between enhanced representational capability and real-time efficiency. Such characteristics align well with the “weak signal, strong background” nature of infrared imaging and effectively reduce the blurring and loss of small-target information during downsampling, ultimately improving detection stability and reliability.
3.3. C3k2_DyGhost Module
In YOLOv11, the C3k2 block is a built-in Cross Stage Partial (CSP) module used in both the backbone and neck. It follows the standard C3 framework, where the input feature map is split into two branches: one branch bypasses the block, and the other branch passes through a sequence of bottleneck layers. The suffix “k2” indicates that two 3 × 3 convolution-based bottleneck units are used inside the C3 block, which enhances feature extraction capacity while maintaining a lightweight structure. In this work, we adopt the original YOLOv11 C3k2 block as the baseline and replace its bottleneck part with the proposed DyGhostBottleneck to construct the C3k2_DyGhost module.
To further enhance the feature extraction capability of the model for infrared small-target detection tasks, this study designs an improved C3k2_DyGhost module based on the original C3k2 module in the YOLOv11n network. As shown in
Figure 4, the module inherits the residual structure of the C3 framework to maintain the continuity of feature flow and the stability of gradient propagation. Here, the term “C3 framework” refers to the Cross Stage Partial C3 block used in the official YOLOv11 implementation. The core idea is to replace the original Bottleneck unit with a DyGhostBottleneck, which combines Ghost feature generation with dynamic convolution [
39]. This structure enables adaptive feature modeling while preserving lightweight extraction efficiency. Similarly to the dynamic spectral–spatial perception mechanism explored in DSP-Net [
40], the proposed DyGhost module leverages adaptive convolutional kernels to adjust feature extraction according to input distribution, thereby improving its resilience to complex infrared backgrounds.
The traditional Ghost Module generates intrinsic features using a small number of standard convolutions and then produces redundant features through linear transformations (such as depthwise or pointwise convolutions), thereby obtaining rich feature representations at a low computational cost. However, since its convolutional kernel weights are fixed, it lacks adaptability to variations in input feature distributions and struggles to effectively capture weak small targets in complex infrared backgrounds.
To address this limitation, this study introduces a dynamic convolution mechanism into the Ghost feature generation stage, enabling input-adaptive feature modeling through multiple sets of learnable convolution kernels. The structure of the DyGhost module is shown in
Figure 5. Specifically, the DyGhost module predefines
K sets of convolutional kernel parameters and employs a lightweight attention branch to generate the corresponding weighting coefficients
(for (k = 1, 2, …, K)). These coefficients are produced by a Softmax function to ensure the balance and stability of kernel combinations.
Let the input feature be
. The DyGhost module generates multiple feature responses using the
K learnable convolution kernels
as follows:
Here, denotes the k-th set of convolutional kernel parameters, represents the corresponding feature response, and * indicates the convolution operation.
To achieve input-adaptive feature modeling, the DyGhost module introduces a lightweight routing branch, which first extracts global contextual information from the input via Global Average Pooling (GAP). Then, a linear mapping function
(·) is applied to generate the weighting coefficient
:
The final output feature is obtained by computing the weighted summation of the convolution results from all kernel groups:
This mechanism allows the model to adjust convolutional responses according to the input feature distribution, improving its ability to handle diverse targets and complex infrared backgrounds.
Subsequently, during the Ghost feature generation stage, the redundant features produced by the linear transformation are concatenated with the primary features:
Here,
denotes the concatenated Ghost feature map. The input features are then fused through batch normalization (
BN) and residual connection:
When the stride is set to 1, the residual branch directly connects the input and output to achieve the fusion of shallow and deep features. When the stride is 2, the input is first processed by a downsampling convolution to match the spatial dimensions.
As illustrated in
Figure 6, when the stride equals 1, the main branch consists of two stacked DyGhost modules, each following the standard Convolution Layer → Batch Normalization (
BN) [
41] → ReLU [
42] activation structure to enable efficient feature extraction and nonlinear representation. The residual branch directly adds the input features—after
BN—to the output of the main branch, thereby achieving effective fusion of shallow details and deep semantic information while maintaining computational efficiency.
In contrast, when the stride is 2, the main branch performs downsampling through a Depthwise Dynamic Convolution [
43], resulting in the feature map resolution being reduced by half. Although this process helps enlarge the receptive field and capture higher-level semantic information, it inevitably leads to a significant loss of fine-grained textures and low-contrast thermal target details in UAV-based infrared small object detection tasks. In extreme cases, certain minute targets may even vanish completely from the feature maps. Therefore, this study adopts a stride = 1 configuration within the DyGhost module to maximally preserve spatial resolution and detailed features, ensuring stronger response capability and higher localization accuracy when detecting weak and small-scale infrared targets.
This structure maintains computational efficiency while enabling an effective integration of local details and global semantics during cross-layer feature propagation. By embedding dynamic convolution into the Ghost feature generation unit, the module enhances its capability to capture high-frequency texture and edge information while remaining lightweight. This property is particularly beneficial in infrared small target detection scenarios, which are often characterized by low contrast and high noise. The DyGhost module adaptively adjusts convolutional weights according to the saliency distribution of input features, thereby amplifying small target responses and suppressing background interference.
In summary, infrared scenes typically exhibit low contrast, strong background noise, and a large number of pseudo-targets, making the extraction of stable and discriminative small-target features particularly challenging. Although the Ghost module provides a lightweight computational framework, its fixed convolution kernels lack the ability to adapt to the complex and rapidly varying background characteristics of infrared imagery. As a result, the weak structural cues of small infrared targets are easily overwhelmed by dominant background textures or noise patterns, leading to missed detections and reduced feature reliability. By introducing dynamic convolution, the C3k2_DyGhost module enables the network to adaptively select the most suitable kernel combinations according to the input feature distribution. This mechanism helps retain weak target cues under low signal-to-noise conditions and reduces interference from non-target regions such as background heat sources and reflections.
With this adaptive feature modeling strategy, C3k2_DyGhost improves feature accuracy and stability compared with the original C3k2 structure, while keeping computational complexity and parameters at a low level. More importantly, its dynamic response characteristics closely align with the physical properties of infrared imaging, which relies on local energy fluctuations rather than the rich color or texture cues present in visible-light imagery. This makes C3k2_DyGhost particularly suitable for UAV-based infrared small-target detection, as it provides a robust, flexible, and highly discriminative feature extraction foundation for subsequent feature fusion and detection head prediction. The enhanced adaptability exhibited by this module offers valuable insights for the future design of lightweight feature extraction architectures tailored to infrared detection tasks.
3.4. Adaptive Fusion-IoU Loss
In object detection tasks, Bounding Box Regression (BBR) [
44] is a critical component that directly determines detection accuracy, with the design of the loss function playing a central role. Traditional IoU-based losses (such as GIoU, DIoU, and CIoU) primarily describe the discrepancy between predicted and ground-truth boxes in terms of geometric metrics like center distance, aspect ratio, and bounding area. Although these methods alleviate issues such as gradient vanishing and scale insensitivity inherent in the original IoU formulation, improvements based purely on geometric constraints have gradually reached saturation. Moreover, the coupling among different geometric terms limits the optimization potential.
In UAV infrared small-target scenarios, high-quality prediction boxes are limited, and low-quality ones dominate the gradient updates. This imbalance slows convergence and reduces localization accuracy. To address this issue, this paper proposes an Adaptive Fusion-IoU Loss (AF-IoU), which redefines the weighting mechanism of predicted box quality. AF-IoU enables the model to dynamically adjust its focus between high- and low-quality prediction boxes at different training stages, thereby achieving a better balance between detection accuracy and training efficiency.
The core idea of AF-IoU goes beyond refining geometric metrics and establishes a “coarse-to-fine” training process through dynamic weight allocation. Its key mechanisms include a Scaled Focal Box strategy, annealing-based attention adjustment, and a confidence-weighted scheme [
45]. Specifically, AF-IoU indirectly modifies the IoU value by scaling the predicted and ground-truth boxes, thereby adjusting the loss weights of predictions with varying quality. Let the original IoU be defined as:
When the predicted box is shrunk, the intersection area decreases, IoU becomes smaller, and the loss increases; conversely, when the predicted box is expanded, IoU increases and the loss decreases. The scaling ratio r determines the focus of the model: when r < 1, the model emphasizes high-quality prediction boxes; when r > 1, it emphasizes low-quality ones. This approach achieves dynamic weight redistribution solely through scale variation, without introducing additional geometric computations, allowing the model to adaptively focus on more representative samples throughout different training stages.
To balance the rapid convergence in early training with precise regression in later stages, AF-IoU introduces a dynamic annealing strategy based on box attention. This strategy controls the scaling ratio through a dynamic hyperparameter ratio, which gradually decreases over training epochs, smoothly shifting the attention of the model from low-quality to high-quality prediction boxes. The evolution pattern follows a cosine annealing schedule [
46], expressed as:
Here, T denotes the total number of training epochs. The dynamic scaling ratio r decreases from 2.0 to 0.5 over the total training epochs following a cosine-annealing schedule. A larger ratio at the early training stage enlarges the receptive range and guides the model to focus on low-quality prediction boxes, thereby accelerating convergence. The range of 2.0 → 0.5 was empirically determined through preliminary experiments to achieve a stable balance between convergence speed and localization accuracy. As training progresses and ratio gradually decreases, the model progressively shifts its attention toward high-quality prediction boxes, achieving more precise bounding box localization. This coarse-to-fine dynamic adjustment strategy effectively mitigates the issue of sample quality imbalance commonly encountered in object detection tasks.
In addition, AF-IoU incorporates the concept of Focal Loss by introducing a confidence-weighted term to further optimize the loss distribution. Let the confidence score of a predicted box be denoted as conf; the overall formulation of AF-IoU can be expressed as:
In Equation (11),
denotes the IoU computed after isotropically scaling both the predicted box and the ground-truth box by the ratio
. The confidence term
acts as a soft weighting factor that increases the loss for low-confidence predictions.
where
represents the dynamic scaling factor. The parameter
γ (set to 0.5 in this study) follows the focal-loss principle, which uses an exponential coefficient to emphasize hard samples. Following the concept proposed in Alpha-IoU [
47], the parameter
(set to 1.5 in this study) acts as the IoU exponent that adjusts the weighting of high-quality predictions through a power-scaling mechanism. A larger
increases the gradient contribution of high-IoU samples and penalizes low-IoU predictions more strongly, thereby improving the precision of bounding-box regression. By jointly modeling IoU and confidence, AF-IoU achieves unified optimization of geometric scale, prediction confidence, and sample quality, leading to a more balanced weighting of high- and low-quality predictions throughout different training phases.
Infrared small targets typically occupy only a few pixels, making their bounding boxes extremely sensitive to even slight localization deviations. Under such conditions, traditional IoU-based loss functions often suffer from gradient domination by low-quality samples and insufficient optimization in the high-IoU region, which further amplifies localization errors for tiny targets. AF-IoU addresses this issue by introducing a dynamically weighted scaling mechanism that adaptively redistributes optimization focus according to the quality of prediction boxes. Its core design employs a cosine annealing schedule to smoothly adjust the scaling ratio throughout training, enabling the model to concentrate on low-quality predictions during the early stages to accelerate convergence toward coarse spatial structures. As training progresses, the weighting gradually shifts toward high-quality predictions, enhancing fine-grained regression accuracy for infrared small targets whose bounding boxes are highly sensitive to displacement.
This coarse-to-fine optimization strategy effectively mitigates gradient imbalance caused by sample quality disparity and ensures smoother, more stable gradient variation across the entire training process. Moreover, AF-IoU incorporates IoU exponentiation and confidence modulation, allowing the model to penalize boxes with larger positional deviations more strongly, thereby improving localization reliability in noisy infrared environments. Given that even a minimal regression error may render an infrared small target completely undetectable, the dynamic nature of AF-IoU makes it inherently more suitable for infrared small-target localization than conventional IoU-based losses. At the same time, AF-IoU maintains full compatibility with standard IoU loss formulations and can be directly substituted for CIoU in the YOLO series without any architectural modification, offering both refined optimization capability and strong generalizability for high-quality bounding-box regression tasks.