Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11

Gu, Yan; Chen, Lingshan; Su, Tian

doi:10.3390/wevj16110591

Open AccessArticle

Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11

by

Yan Gu

¹

,

Lingshan Chen

^1,* and

Tian Su

^2,3

¹

College of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

Department of Civil Engineering, School of Civil Engineering and Geomatics, Shandong University of Technology, Zibo 255000, China

³

Department of Engineering and Management, International College, Krirk University, No. 3 Soi Ramintra 1, Ramintra Road, Anusaowaree, Bangkhen, Bangkok 10220, Thailand

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(11), 591; https://doi.org/10.3390/wevj16110591

Submission received: 15 August 2025 / Revised: 6 October 2025 / Accepted: 17 October 2025 / Published: 23 October 2025

Download

Browse Figures

Versions Notes

Abstract

Object detection in degraded environments such as low-light and nighttime conditions remains a challenging task, as conventional computer vision techniques often fail to achieve high precision and robust performance. With the increasing adoption of deep learning, this paper aims to enhance object detection under such adverse conditions by proposing an improved version of YOLOv11, named DRF-YOLO (Degradation-Robust and Feature-enhanced YOLO). The proposed framework incorporates three innovative components: (1) a lightweight Cross Stage Partial Multi-Scale Edge Enhancement (CSP-MSEE) module that combines multi-scale feature extraction with edge enhancement to strengthen feature representation; (2) a Focal Modulation attention mechanism that improves the network’s responsiveness to target regions and contextual information; and (3) a self-developed Dynamic Interaction Head (DIH) that enhances detection accuracy and spatial adaptability for small objects. In addition, a lightweight unsupervised image enhancement algorithm, Zero-DCE (Zero-Reference Deep Curve Estimation), is introduced prior to training to improve image contrast and detail, and Generalized Intersection over Union (GIoU) is employed as the bounding box regression loss. To evaluate the effectiveness of DRF-YOLO, experiments are conducted on two representative low-light datasets: ExDark and the nighttime subset of BDD100K, which include images of vehicles, pedestrians, and other road objects. Results show that DRF-YOLO achieves improvements of 3.4% and 2.3% in mAP@0.5 compared with the original YOLOv11, demonstrating enhanced robustness and accuracy in degraded environments while maintaining lightweight efficiency.

Keywords:

YOLOv11; multi-scale edge enhancement; autonomous driving; object detection; low-light environment

1. Introduction

With the rapid advancement of intelligent driving technology in recent years, object detection has been increasingly applied in autonomous driving, intelligent surveillance, and related fields. However, in complex degraded environments such as nighttime, low-light, haze, or polluted road surfaces (e.g., harbor pavements with tire marks or unclean conditions), the significant degradation of image quality makes it difficult for traditional detection algorithms to effectively extract meaningful features, leading to a marked decline in both detection accuracy and robustness. [1,2]. In particular, “small objects”—defined in this study as targets occupying relatively few pixels in an image, such as distant pedestrians, traffic signs, and far-away vehicles—are more likely to be overlooked due to their low resolution and inconspicuous features [3,4,5]. This makes small-object detection under degraded visual conditions a critical weakness in current computer vision systems. Moreover, since autonomous driving and surveillance applications require reliable detection across different scales of objects (from pedestrians to large vehicles), robust multi-scale feature representation is equally important. Therefore, accurately locating and recognizing both small and larger targets in low-quality images has become a key challenge in contemporary visual perception research, motivating the need for more effective and consistent detection frameworks [6,7,8].

Recent studies on small object detection have primarily focused on feature fusion and lightweight network designs. Methods such as ASFF [9] and CSPNe [10] have achieved notable results in general scenarios. However, issues such as limited adaptability and inefficient multi-scale feature interaction remain prominent in degraded environments. To enhance model robustness, Chen et al. [11] proposed Dual Perturbation Optimization (DPO), which minimizes loss function sharpness by simultaneously applying adversarial perturbations to both model weights and the input feature space, significantly improving the generalization of detection models in noisy and degraded conditions. Similarly, Shi et al. [12] introduced ASG-YOLOv5, which enhances the detection accuracy and real-time performance for small objects in UAV remote sensing images by integrating a dynamic context attention module and a spatial gating fusion module. Khalili and Smyth [5] presented SOD-YOLOv8, combining an efficient general feature pyramid (GFPN), a high-resolution detection layer, and an EMA attention mechanism, while proposing a novel Powerful IoU loss to improve the detection accuracy of small and medium-sized objects in traffic scenes at minimal computational cost.

Object detection, as a fundamental task in computer vision, aims to automatically locate and classify objects within images or videos. Its core techniques include backbone network design, data augmentation strategies, loss function optimization, and model compression and deployment. Among these, the YOLO (You Only Look Once) series has garnered significant attention as a representative one-stage object detection framework due to its high efficiency and real-time performance.

Moreover, several researchers have turned their attention to object detection in degraded and low-resolution environments. Liu et al. [13] proposed Image-Adaptive YOLO (IA-YOLO), based on the YOLOv3 [14] framework, by incorporating a differentiable image enhancement branch that enables adaptive preprocessing and joint training with the detection head, thereby improving performance in hazy and dim conditions. Further, Liu et al. [15] introduced Dark YOLO, which integrates SimAM local attention and dimensional complementary attention, alongside a SCINet-based cascade illumination enhancement structure, enabling robust detection in extremely dark scenarios. Lan et al. [16] employed decoupled contrast translation to enhance detection accuracy in nighttime surveillance. RestoreDet, proposed by Cui et al. [17], improves detection stability on low-resolution degraded images by designing a degradation-equivariant representation mechanism and integrating a super-resolution reconstruction branch for auxiliary training. Recently, Wang et al. [18] proposed UniDet-D, which incorporates a unified dynamic spectral attention mechanism to adaptively focus on critical spectral components under complex conditions such as rain, fog, darkness, and dust. This design achieves end-to-end fusion of image enhancement and object detection, ensuring consistent high performance across various degraded environments. Additionally, Hong et al. [19] introduced an illumination-invariant learning strategy that significantly enhances robustness under low-light conditions by decoupling feature extraction from illumination cues, while Tran et al. [20] combined a low-light enhancement framework with fisheye camera detection for intelligent surveillance, further emphasizing the importance of illumination-adaptive modules in real-world perception systems.

In recent years, the integration of LiDAR and image-based perception has gained significant attention in autonomous driving and UAV research. LiDAR-based detection excels at providing accurate depth and spatial geometry, whereas image-based methods are superior in semantic and texture representation. Recent surveys such as Wang et al. [21] highlight that multimodal fusion of LiDAR and visual cues can substantially enhance robustness under adverse weather or low-light conditions. However, the computational cost and sensor calibration complexity often limit the deployment of LiDAR-based systems in small-scale UAVs or embedded platforms. Conversely, image-based detectors, such as DRF-YOLO, can achieve competitive performance through efficient illumination adaptation and multi-scale enhancement, providing a lightweight yet effective alternative for small-object detection tasks. Furthermore, Nikouei et al. [3] and Mukherjee et al. [22] provide comprehensive overviews and multimodal strategies for small object detection under occlusion, blur, and illumination degradation—issues that this work directly addresses through architectural and loss function optimization.

Since the introduction of YOLOv1 by Redmon et al. [23], the YOLO family has evolved rapidly. It pioneered the integration of object localization and classification into a unified end-to-end neural network, eliminating the need for region proposal stages as used in traditional two-stage detectors. Unlike two-stage detectors such as Faster R-CNN [24], YOLO directly divides the image into grids for joint object classification and bounding box regression, significantly improving inference speed while maintaining competitive accuracy—an advantage that makes YOLO particularly suitable for industrial applications requiring real-time feedback.

The evolution of the YOLO family has led to notable improvements in accuracy, speed, and architectural efficiency. YOLOv2 [25] introduced Batch Normalization (BN) in all convolutional layers and adopted high-resolution inputs (448 × 448), leading to an overall mAP improvement of approximately 6.4%. Specifically, BN provided regularization and faster convergence (+2.4% mAP), and a high-resolution classifier pretrained on ImageNet contributed a further +4% gain. YOLOv3 adopted a new backbone, Darknet-53, and a multi-scale detection strategy (from

13 \times 13

to

52 \times 52

grid sizes), alongside residual connections, resulting in a performance boost of around 12% on the PASCAL VOC dataset [14].

YOLOv5 incorporated Automatic Mixed Precision (AMP) training [26], which reduced training time by approximately 40% using FP16/FP32 computation. YOLOv8 further innovated by unifying detection, segmentation, and pose estimation into a single architecture, while supporting dynamic model scaling from 2.3 M to 43.7 M parameters to adapt across various computational platforms. YOLOv9 introduced Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN) [27], enhancing gradient modeling and detection accuracy while maintaining computational efficiency. YOLOv10 removed the conventional Non-Maximum Suppression (NMS) module and introduced a dual assignment strategy and a lightweight classification head, further improving inference speed [28]. The most recent YOLOv11 optimized its data augmentation policies, network structure, and loss functions, achieving a better trade-off between accuracy and efficiency in challenging environments.

Despite considerable progress, existing detection algorithms still face challenges when detecting small objects under combined degradation conditions such as low light, haze, and nighttime scenarios. The performance degradation and limited generalization in such environments motivate the need for more robust models. In this work, we propose DRF-YOLO (Degradation-Robust and Feature-enhanced YOLO), a novel detection framework based on YOLOv11, specifically tailored for small object detection in degraded environments. Our main contributions are summarized as follows:

Lightweight Feature Module:We propose CSP-MSEE (CSP Multi-Scale Edge Enhancement), a lightweight module based on the original C3k2 block from YOLOv11. This module integrates multi-scale pooling and edge-aware enhancements to improve feature representation and boundary perception while reducing parameter count and computational complexity.
Multi-Scale Attention Mechanism: We replace the original SPPF module with the Focal Modulation [29] mechanism. This attention-based design enhances the model’s sensitivity to both local and global semantic contexts and improves detection robustness for small objects across varying scales.
Dynamic Head Design: A novel Dynamic Interaction Head (DIH) is designed, integrating task alignment and multi-task interaction. It utilizes shared convolutions, scale-aware encoding [30], Group Normalization [31], and Deformable Convolution v2 (DCNv2) [32], enabling more flexible feature fusion between classification and regression tasks, especially for targets with varying sizes and shapes.
Robustness in Degraded Environments: We incorporate an unsupervised image enhancement algorithm, Zero-DCE [33], during training to improve visibility in low-light and low-contrast conditions. Additionally, we replace traditional IoU/CIoU with the GIoU loss [34] to improve bounding box localization in foggy and blurred edge conditions.

In summary, DRF-YOLO achieves a balanced trade-off among accuracy, robustness, and computational efficiency through structural innovation and feature enhancement strategies, offering a promising solution for small object detection under various degraded environmental conditions.

2. YOLOv11 Model Architecture and Characteristics

YOLOv11, the latest lightweight object detection model in the YOLO series, not only inherits the real-time performance and accuracy advantages of its predecessors but also introduces significant optimizations in network structure and computational efficiency. While maintaining the series’ hallmark of rapid inference, YOLOv11 substantially improves detection precision and processing speed through enhancements in neural architecture and innovative training strategies. These improvements make it highly suitable for complex scenarios such as security surveillance, autonomous driving, and aerial inspection.

As illustrated in Figure 1, the overall architecture of YOLOv11 consists of three primary components: Backbone, Neck, and Head.

Backbone: A modified CSPDarknet-based architecture utilizing a lightweight C3k2 module is adopted. Through the integration of cross-stage partial connections and bottleneck structures, the network effectively enhances feature extraction while minimizing redundant computation. The adoption of the SiLU activation function and fused convolution strategies further improves the representational efficiency of shallow layers.
Neck: The Neck component retains the multi-scale feature aggregation mechanisms from PANet (Path Aggregation Network) and FPN (Feature Pyramid Network). By establishing horizontal connections across different hierarchical layers, it strengthens the model’s capacity to represent objects at various scales. Additionally, channel dimensions and path structures are optimized to meet edge deployment requirements.
Head: The detection head preserves the classic YOLO-style dense prediction framework while incorporating Distribution Focal Loss (DFL) to achieve more accurate bounding box regression. Each feature layer simultaneously outputs object confidence scores and class probabilities. Nonetheless, under challenging conditions—especially in small object detection—the head still exhibits limitations in feature responsiveness and spatial perception.

In terms of training strategy, YOLOv11 benefits from mature data augmentation techniques inherited from prior versions, including Mosaic, MixUp, and HSV perturbations. It also employs a combination of BCEWithLogitsLoss and GIoU/CIoU loss functions, significantly enhancing its robustness and generalization capability. Although YOLOv11 performs exceptionally well under normal lighting conditions, its performance in detecting small objects and handling low-light or degraded environments still leaves room for improvement [3,35,36].

3. Improved DRF-YOLO Algorithm

To enhance the robustness and accuracy of object detection models under nighttime, low-light, and various degraded conditions, this paper proposes a novel lightweight detection algorithm named DRF-YOLO (Degradation-Robust and Feature-enhanced YOLO), based on YOLOv11, as illustrated in Figure 2. The proposed algorithm introduces systematic improvements in network architecture, feature enhancement strategies, and environmental adaptability.

First, a self-designed module named CSP-MSEE (CSP Multi-Scale Edge Enhancement) is introduced to replace the original C3k2 structure in the backbone. This module integrates multi-scale feature extraction with edge enhancement. By employing multi-scale pooling techniques such as nn.AdaptiveAvgPool2d, the model captures features across various receptive fields. Coupled with the EdgeEnhancer sub-module, it precisely highlights object contours. The extracted multi-scale features are then aligned and fused through convolutional layers, significantly enhancing both edge sensitivity and multi-scale feature representation.

Second, to further improve the model’s ability to perceive key object regions, the original Spatial Pyramid Pooling Fast (SPPF) module is replaced with a Focal Modulation attention mechanism. Unlike conventional spatial pooling, Focal Modulation leverages both local spatial cues and global context to flexibly construct semantic representations of salient regions. This not only strengthens contextual understanding in cluttered scenes but also improves detection of small objects and enhances robustness in low-contrast and noisy conditions.

For the detection head, a new architecture called Dynamic Interaction Head (DIH) is designed. It incorporates the Adaptive Spatial Feature Fusion (ASFF) mechanism along with a P2 layer for small object detection [37], enabling cross-scale feature fusion with precise spatial alignment. This yields improvements in both localization precision and detail recovery in small object scenarios.

At the image preprocessing stage, an unsupervised enhancement algorithm called Zero-DCE is employed. It performs end-to-end brightness and contrast correction for low-light inputs, thereby optimizing the input quality in nighttime environments.

In terms of the loss function, the original Complete IoU (CIoU) regression loss [38] is replaced with Generalized IoU (GIoU) to improve bounding box localization and mitigate spatial ambiguity in degraded images.

Through these enhancements, DRF-YOLO achieves improved detection accuracy for small objects and exhibits strong robustness in adverse environments while maintaining a lightweight architecture. Experimental results demonstrate its superior performance in degraded conditions such as nighttime, low-light, and foggy scenarios.

3.1. CSP-MSEE Module

To enhance the network’s ability to perceive and represent multi-scale edge information, this paper introduces a lightweight feature enhancement module named CSP-MSEE (CSP Multi-Scale Edge Enhancement), as illustrated in Figure 3. The module is integrated into the C3k and C3k2 components of the YOLOv11 backbone, combining three core advantages—multi-scale feature extraction, edge information enhancement, and an efficient convolutional structure. This design effectively reduces computational overhead while significantly improving detection performance for small targets and blurred boundaries in degraded environments.

The CSP-MSEE module comprises two primary components:

Multi-Scale Feature Extraction Channel: This sub-module employs adaptive average pooling at multiple scales (e.g., $3 \times 3$ , $6 \times 6$ , $9 \times 9$ , and $12 \times 12$ ) to capture contextual information across diverse receptive fields. Channel adaptation and local feature modeling are achieved through $1 \times 1$ pointwise convolutions followed by $3 \times 3$ depthwise separable convolutions, ensuring efficient processing.
EdgeEnhancer Module: As shown in Figure 4, this sub-module first applies local average pooling to extract low-frequency background responses from the input feature maps. It then subtracts this smoothed representation from the original feature map to isolate high-frequency edge details. The resulting edge signals are further amplified using a Sigmoid-activated convolutional layer. Finally, the enhanced edge features are combined with the original input via residual fusion to enrich the final representation.

All scale-enhanced feature maps are upsampled to a unified spatial resolution, concatenated with local feature branches, and fused through a

3 \times 3

convolution layer to produce the module’s output.

In terms of design philosophy, CSP-MSEE inherits the residual connection and channel-splitting strategies from the original C3 module in YOLOv11, maintaining a balance between network depth and width. This architecture not only improves the model’s sensitivity to multi-scale and edge-level details but also preserves computational efficiency, thereby offering robust support for visual perception tasks under complex lighting and degraded environmental conditions.

3.2. Focus Modulation Module

To enhance the model’s ability to selectively model critical information across diverse receptive fields, we propose the Focal Modulation Attention Enhancement Module as a key feature interaction component. This module leverages a gating-guided multi-scale contextual modeling mechanism to effectively regulate local–global information fusion while maintaining low computational complexity. Consequently, it improves the network’s adaptability to complex backgrounds and multi-scale object scenarios.

Unlike traditional self-attention mechanisms that rely on pairwise token similarity calculations, the proposed Focus Modulation module employs a combination of focal context modulation, gated aggregation, and element-wise affine transformations. Specifically, it utilizes multi-scale contextual convolutions to extract features from varying receptive fields. Learnable gating functions are then applied to adaptively adjust the weighting of contextual features across layers, enhancing the model’s focus on task-relevant regions.

A global context aggregation branch—comprising global average pooling and non-linear activation—is further incorporated to strengthen long-range dependency modeling, enabling more comprehensive visual field control. The architecture of the proposed module is illustrated in Figure 5.

3.3. Self-Developed Task-Aligned Dynamic DIH Detection Head

To further improve the generalization ability and localization accuracy of detection heads under multi-scale targets and complex backgrounds, we propose a self-developed Dynamic Interactive Head (DIH). DIH integrates key techniques such as task decomposition mechanisms, dynamic convolutional alignment, offset-guided attention, and class probability enhancement. Its design aims to efficiently decouple task-specific features, achieve spatial alignment, and enable precise multi-scale fusion.

The DIH adopts adaptive spatial feature fusion strategies similar to ASFF and the improved YOLOv8 detection head. It is further enhanced with multi-scale feature pyramid networks and attention mechanisms to improve detection robustness across various target sizes. The architecture follows the processing pipeline of “sharing, decoupling, alignment, and fusion”, as illustrated in Figure 6. The key components of DIH are described below:

Shared Feature Extraction Module: A two-layer Conv_GN structure extracts shallow features from the input and forms a unified feature representation shared by downstream tasks.
Task Decomposition Module: Inspired by DyHead, two TaskDecomposition modules are used to decouple classification and regression branches. Global average pooling is used to guide cross-channel attention, allowing task-specific adaptation.
Offset-Guided Dynamic Convolution Alignment Module:The regression branch incorporates DyDCNv2, combining offsets and masks for spatially adaptive convolution. A $3 \times 3$ convolutional layer dynamically generates spatial offsets and masks to adjust sampling points and aggregation weights, improving spatial alignment and edge localization.
Class Probability Alignment Module (CLSProbAlign): A class probability heatmap is explicitly constructed via two convolution layers. This heatmap is element-wise multiplied with classification features to guide them toward target regions and reduce background interference.
Prediction Branches:
- The regression branch outputs bounding box logits (prior to DFL decoding) via the cv2 library.
- The classification branch predicts class confidence scores using the cv3 library.
Scale Regulator (Scale): Before final output, each feature layer includes a learnable Scale parameter to adaptively fine-tune the offset amplitude of predicted bounding boxes.
DFL Decoder: A Distribution Focal Loss (DFL) decoder is employed to transform the regression logits into continuous bounding box offset values.

Although the proposed DIH draws inspiration from dynamic detection heads such as DyHead and ATSS, it introduces several key differences. Unlike DyHead, which primarily relies on iterative attention layers for task interaction, DIH employs a task decomposition mechanism that explicitly separates classification and regression through dedicated modules, ensuring clearer task boundaries and reduced feature entanglement. Compared to ATSS Head, which focuses on adaptive sample selection to balance positive and negative anchors, DIH emphasizes offset-guided dynamic convolution alignment, allowing spatially precise feature aggregation at the pixel level. Moreover, the integration of a Class Probability Alignment branch is unique to DIH, guiding classification features toward target regions while suppressing background noise. These design choices collectively differentiate DIH from existing approaches, highlighting its stronger adaptability to degraded environments and small-object detection scenarios.

By combining DyDCNv2 with offset-guided attention, DIH achieves pixel-level precise spatial alignment, enhancing edge regression accuracy. The TaskDecomposition modules enable classification to focus on semantic structure, while regression emphasizes boundary geometry. Furthermore, the Class Probability Alignment mechanism enhances spatial attention guidance by highlighting target regions and suppressing background noise, ultimately improving detection precision in complex scenarios.

3.4. Design Rationale and Synergistic Effect

The proposed DRF-YOLO is not merely a collection of isolated modules but a carefully designed architecture where each component addresses specific challenges while complementing others. The CSP-MSEE module strengthens multi-scale feature extraction and edge enhancement, ensuring that even in low-contrast regions, object contours are preserved. Building upon this foundation, the Focal Modulation module adaptively emphasizes task-relevant regions by combining local and global contextual cues, which enhances semantic understanding in cluttered or degraded scenes. The Dynamic Interaction Head (DIH) then leverages these enriched representations by performing precise cross-scale feature fusion and task-aligned dynamic alignment, thereby improving both classification confidence and bounding box localization, especially for small or blurred objects. Finally, the Zero-DCE preprocessing step ensures that the input images themselves are optimized for subsequent feature extraction, further amplifying the effectiveness of downstream modules.

By integrating these modules into a unified framework, DRF-YOLO achieves a synergistic improvement: edge-aware multi-scale features are contextually refined and dynamically aligned across detection tasks, while input enhancement reduces the burden on feature extractors. This holistic design philosophy explains why DRF-YOLO demonstrates superior robustness and accuracy in adverse conditions such as nighttime, low-light, and foggy environments, without significantly increasing model complexity.

3.5. GIoU Loss Function

YOLOv11 employs the Complete Intersection over Union (CIoU) loss function as the default for bounding box regression. While CIoU offers advantages in modeling target geometry, handling bounding box overlap, and considering center-to-center distance, it provides relatively stable optimization under standard lighting conditions. However, in scenarios involving small-scale targets, complex backgrounds, or sparse object distributions, CIoU becomes vulnerable to interference. Its robustness deteriorates significantly when dealing with low-quality input images—such as those affected by poor illumination or degraded weather—resulting in a notable decline in bounding box regression accuracy.

To address this limitation and enhance object detection performance in adverse visual environments, this paper adopts the Generalized Intersection over Union (GIoU) loss as a substitute for the CIoU metric. GIoU improves training feedback especially for non-overlapping or partially overlapping bounding boxes. Unlike conventional IoU-based metrics, GIoU incorporates an additional penalty term based on the smallest enclosing box, allowing effective gradient propagation even when predicted and ground truth boxes do not intersect. This characteristic is particularly beneficial in scenarios with occlusion or localization uncertainty.

The Generalized Intersection over Union (GIoU) loss, proposed by Rezatofighi et al. [34], extends the traditional IoU metric by incorporating the spatial relationship between the predicted box and the ground truth box. Unlike the standard IoU, which only measures the ratio of intersection over union between the predicted bounding box and the ground truth, GIoU introduces a penalty based on the area of the smallest enclosing box that contains both boxes.

The GIoU is defined as:

GIoU = IoU - \frac{A_{c} - A_{u n i o n}}{A_{c}}

(1)

where

IoU = \frac{A_{i n t e r s e c t i o n}}{A_{u n i o n}}

, and

A_{c}

denotes the area of the smallest enclosing box covering both the predicted box and the ground truth box. The subtraction term penalizes non-overlapping regions, enabling effective gradient updates even when there is no intersection between boxes.

This design allows the model to receive gradient updates even in challenging cases such as object occlusion or extreme aspect ratios. By incorporating GIoU into the loss computation, YOLOv11 demonstrates improved performance in detecting targets under poor lighting conditions, enhancing its overall robustness in degraded scenarios.

As illustrated in Figure 7, the structure of the Generalized Intersection over Union (GIoU) loss function is defined as follows: the union area is determined by the intersection between the predicted bounding box and the ground truth box, while the minimum enclosing region (denoted as MWER) refers to the smallest region that simultaneously encloses both boxes. By incorporating the area of the MWER, the GIoU metric not only captures the overlapping region but also accounts for spatial discrepancies in shape, position, and scale between the two boxes.

Specifically, GIoU exhibits the following properties: when the predicted and ground truth boxes perfectly overlap, the GIoU reaches 1; when there is no intersection, GIoU equals 0; and when the predicted box diverges significantly in shape or size from the ground truth, GIoU may be negative, reflecting a poor localization performance.

The GIoU loss function offers notable advantages in complex environments such as low-light conditions and degraded images. First, in scenarios where environmental noise causes bounding box predictions to deviate from their true positions, the enclosing region penalty of GIoU effectively guides the optimization direction and suppresses regression divergence during training. Furthermore, since GIoU evaluates geometric consistency, it is less sensitive to noisy features and yields more robust gradient signals than traditional IoU.

Second, in degraded scenes where the contrast between foreground objects and background is low, models are prone to false positives. GIoU introduces stronger shape constraints that prevent predicted boxes from expanding arbitrarily into background regions, thus enhancing detection robustness. For small object detection, even minor displacement of bounding boxes can result in accuracy loss. Traditional IoU often suffers from gradient vanishing in such low-overlap conditions. In contrast, by considering the area of the minimum enclosing rectangle, GIoU still provides meaningful gradients, thereby improving optimization efficiency and localization precision during training.

In conclusion, by introducing a geometric penalty through the enclosing region, the GIoU loss function addresses the limitations of traditional IoU, particularly in degraded environments and small-object detection tasks. Its robust anti-interference capability and stable gradient design make it an effective and reliable objective function for enhancing target detection accuracy in complex scenarios.

3.6. Zero-DCE: An Unsupervised Image Enhancement Algorithm

In low-light environments, the overall brightness, contrast, and detail of images are often significantly degraded, which directly affects the accuracy and robustness of subsequent object detection models. To address this issue, image enhancement has become a vital preprocessing step. Traditional methods—such as histogram equalization and Retinex-based techniques [39]—typically rely on handcrafted rules, which limits their adaptability and generalization capability in diverse scenarios.

In recent years, deep learning-based enhancement methods have gained increasing attention for their superior performance. Among them, Zero-DCE (Zero-Reference Deep Curve Estimation) stands out as an end-to-end, reference-free image enhancement algorithm specifically designed for low-light conditions. It leverages parameter-free convolutional operations to effectively suppress noise and artifacts, thereby improving image brightness, contrast, and detail without requiring paired supervision.

Proposed by Guo et al. [33], Zero-DCE redefines low-light enhancement as a curve estimation task rather than a conventional image-to-image translation problem. It learns a set of pixel-wise, content-aware illumination adjustment curves under fully unsupervised conditions. These curves are applied to the input image to produce an enhanced output with better perceptual quality. Importantly, Zero-DCE requires no paired training data and instead relies on intrinsic brightness statistics and structural regularities for self-supervision.

The core idea is to iteratively optimize the image enhancement through a learned set of functions

{A_{n} (x)}

, where each

A_{n}

is a curve parameter corresponding to pixel location x. These curves are applied to each pixel of the original image to yield the enhanced result. The formulation of the enhancement operation is defined in Equation (2).

L E_{n} (x) = L E_{n - 1} (x) + A_{(n)} (x) L E_{n - 1} (x) (1 - L E_{n - 1} (x))

(2)

Here,

L E_{n} (x)

represents the pixel value at position x in the input image, and

A_{n} (x)

are the curve parameters learned by the deep network. This formulation enables Zero-DCE to adaptively enhance a wide variety of image types—including indoor scenes, outdoor environments, and nighttime images—making it a powerful tool in fields such as computer vision, image recognition, and artificial intelligence.

In this study, the number of enhancement iterations n is set to 8, which yields relatively optimal enhancement performance. At each iteration,

L E_{n} (x)

represents the enhanced image obtained from the previous result

L E_{n - 1} (x)

, while

A_{n} (x)

denotes the corresponding curve parameter map, which has the same spatial dimensions as the input image. The overall structure of the Zero-DCE enhancement process is illustrated in Figure 8.

Another key feature of Zero-DCE is its unsupervised loss function design, which does not rely on reference images. It consists of four components:

Spatial Consistency Loss ( $L_{spa}$ )
This loss evaluates the variation in pixel differences between adjacent regions before and after enhancement. The goal is to preserve the spatial consistency of local image structures. Specifically, the image is divided into K local regions, and the pixel differences between each region and its four neighboring regions (top, bottom, left, and right) are computed and averaged. Let I be the input image and Y the enhanced image. The average pixel values are obtained via average pooling, typically using $4 \times 4$ regions implemented with convolutional layers.

$L_{spa} = \frac{1}{K} \sum_{i = 1}^{K} \sum_{j \in Ω (i)} {(| (Y_{i} - Y_{j}) | - | (I_{i} - I_{j}) |)}^{2}$

(3)
Exposure Control Loss ( $L_{\exp}$ )
This loss constrains the image exposure level to avoid over- or under-enhancement. A target exposure value is predefined, and the brightness difference between the local average and this target value is penalized. Each local region is of size $16 \times 16$ . Let M denote the total number of such regions.

$L_{\exp} = \frac{1}{M} \sum_{k = 1}^{M} | Y_{k} - E |$

(4)
Color Constancy Loss ( $L_{col}$ )
Loss is measured according to a conclusion on whether the color is normal. According to the Gray-World color constancy hypothesis, for a colorful image, the average value of the three color components R, G, and B tends to the same gray value K, as shown in the formula below.

$L_{col} = \sum_{\begin{matrix} \forall (p, q) \in \\ v a r e p s i l o n \end{matrix}} {(J_{p} - J_{q})}^{2}, ε = \{(R, G), (R, B), (G, B)\}$

(5)
Illumination Smoothness Loss ( $L_{tva}$ )
This loss enforces spatial smoothness in the learned enhancement curves by constraining the gradient magnitude of the curve parameter map $A_{n} (x)$ along horizontal and vertical directions. Let N be the total number of iterations, and $\nabla_{x}$ , $\nabla_{y}$ represent horizontal and vertical gradients, respectively.

$L_{tva} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{c \in δ} (|▽_{x} A_{n}^{c}| + |▽_{y} A_{n}^{c}|), δ = \{R, G, B\}$

(6)

The overall loss function is a weighted sum of the above components:

L_{total} = L_{spa} + L_{\exp} + W_{col} L_{col} + W_{tva} L_{tva}

(7)

In this study, Zero-DCE is integrated as a preprocessing module before the YOLOv11 detection pipeline, particularly suited for image enhancement in extreme conditions such as nighttime or low-light scenarios. Experimental results demonstrate that images enhanced by Zero-DCE exhibit significantly improved brightness, sharper target edges, and substantial gains in both detection accuracy and recall. Quantitatively, the enhanced images achieve a PSNR of 12.57 dB and an SSIM of 0.5616, confirming notable improvements in image quality. The specific enhancement effects are visualized in Figure 9.

4. Experimental Results and Analysis

4.1. Experimental Environment and Configuration

All experiments in this study were conducted under a unified configuration: Ubuntu 22.04 operating system, Python 3.10 environment, PyTorch 2.1.2, and CUDA 11.8. The hardware platform utilized an NVIDIA GeForce RTX 3080 Ti GPU. The detailed experimental settings are presented in Table 1.

4.2. Datasets

In this study, two representative nighttime vision datasets are utilized to train and evaluate the proposed model: the nighttime subset of BDD100K and the ExDark dataset.

The BDD100K (Berkeley DeepDrive 100K) dataset [40] was developed by the Berkeley DeepDrive project team at UC Berkeley. It is a large-scale autonomous driving dataset containing over 100,000 driving videos under various weather conditions, time periods, and road types. A notable portion of the dataset includes extensive nighttime driving footage, which serves as valuable training material for object detection in complex environments. For this study, images labeled as “night” were extracted to form a nighttime subset, resulting in a training set of 27,445 images and a validation set of 4394 images. The images have a resolution of 1280 × 720 pixels and include annotations for diverse object categories such as vehicles, pedestrians, and traffic signs, offering both diversity and realism.

The ExDark (Exclusively Dark) dataset is specifically designed for object detection in low-light conditions. It contains 7363 real-world nighttime images across 12 object categories, including typical environments such as streets, indoor scenes, seaports, and rural areas. The dataset poses significant challenges due to low illumination, small objects, complex backgrounds, and occlusions, thereby providing a robust benchmark for evaluating detection accuracy and model resilience in extreme conditions.

Together, these two datasets complement each other in lighting conditions and scene composition, offering a comprehensive foundation for validating the effectiveness of the proposed algorithm in diverse nighttime scenarios.

4.3. Model Evaluation Metrics

Model performance is assessed using both quantitative and qualitative approaches. Quantitatively, the following metrics are computed: Precision (P), Recall (R), mean Average Precision (mAP), mAP@0.5, number of parameters (Params), and computational complexity (GFLOPs). Qualitatively, visual inspection is conducted on detection results over real-world nighttime images to assess the model’s practical effectiveness.

4.4. Ablation and Comparative Experiments

To evaluate the detection performance of the proposed DRF-YOLO algorithm in low-light environments, comprehensive comparative experiments were conducted against several mainstream object detection models, including Faster R-CNN, SSD, YOLOv3, YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv11n.

Experiments were carried out on two typical nighttime datasets: ExDark and the BDD100K Night Subset. Evaluation metrics included mAP@0.5, Precision, Recall, Params, and GFLOPs. The results are detailed in Table 2 and Table 3 and Figure 10.

As shown in Table 2, DRF-YOLO achieves superior performance on the ExDark dataset, outperforming all compared methods across all metrics. For instance, compared to YOLOv3, DRF-YOLO reduces parameters by approximately 96% and computational cost by 96.3%, while achieving improvements of 7.2% in Precision, 14.2% in Recall, and 16.7% in mAP@0.5. Compared to the mainstream lightweight model YOLOv11n, DRF-YOLO only increases parameter count by 1.3 M and GFLOPs by 3.4, yet delivers gains of 2.9%, 2.3%, and 3.4% in Precision, Recall, and mAP@0.5 respectively.

Compared to YOLOv11s, DRF-YOLO achieves better detection accuracy while reducing parameters by 58.9% and GFLOPs by 55%. Similar performance trends are observed on the BDD100K Night Subset (Table 3), confirming the robustness and effectiveness of DRF-YOLO in low-light conditions without significant increases in model size or computational burden. Furthermore, we conducted cross-dataset generalization experiments (training on ExDark and testing on BDD100K, and vice versa), and observed consistent improvements, demonstrating that DRF-YOLO maintains strong generalization capability across different low-light datasets.

4.5. Detection Performance in Complex Nighttime Environments

The detection performance of the proposed DRF-YOLO algorithm in complex nighttime environments is visually illustrated in Figure 11, Figure 12, Figure 13 and Figure 14, using samples from the ExDark dataset and the nighttime subset of BDD100K. These images encompass typical low-light scenarios and urban road conditions at night, serving to evaluate the algorithm’s effectiveness under real-world degraded environments.

As shown in Figure 11 and Figure 12 (ExDark dataset), DRF-YOLO demonstrates high-confidence detection of common nighttime targets such as pedestrians and various objects. Under extreme low-light conditions, DRF-YOLO achieves up to 34% higher confidence in pedestrian detection compared to mainstream algorithms like YOLOv11 and YOLOv8, while significantly reducing both false positives and false negatives. Even when faced with occluded objects or blurred edges, the algorithm consistently generates accurate bounding boxes and category labels, highlighting its strong adaptability in low-illumination scenes.

Figure 13 and Figure 14 further present the detection performance of DRF-YOLO on nighttime urban road images from the BDD100K dataset. In these complex environments, interference factors such as strong reflections, streetlight glare, and motion blur impose significant challenges on object detection. Experimental results reveal that DRF-YOLO can precisely detect small vehicles, pedestrians, and non-motorized road users. Compared to lightweight models like YOLOv8n, DRF-YOLO shows a 16% improvement in small object recognition and a notable reduction in false detection rates under occlusion.

In summary, the DRF-YOLO algorithm exhibits outstanding object detection capabilities across both the ExDark and BDD100K night datasets. It not only enhances detection accuracy in low-light and complex scenarios but also significantly improves performance in detecting small and occluded objects. These results confirm the algorithm’s robustness and practical generalization ability in real-world nighttime applications.

4.6. Ablation Study

To validate the effectiveness and necessity of each module in DRF-YOLO under degraded scenarios, systematic ablation experiments were conducted on the ExDark dataset. As a benchmark dataset for low-light object detection, ExDark encompasses various typical target categories and challenging nighttime scenarios, making it ideal for evaluating structural improvements.

4.6.1. Backbone Improvement

As shown in Table 4, replacing the original C3K2 module in the YOLOv11 backbone with the proposed CSP-MSEE (Multi-Scale Edge Enhancement) module resulted in a significant reduction in parameters (approximately 0.4 M) and improved computational efficiency (0.3 GFLOPs), while maintaining stable detection accuracy. This demonstrates the module’s capability to achieve lightweight design without sacrificing feature representation quality.

4.6.2. Attention Enhancement

Integrating the Focal Modulation attention mechanism led to modest yet consistent improvements in both precision (P) and mAP@0.5, underscoring its advantages in small object localization and global contextual modeling.

4.6.3. Detection Head Optimization

The introduction of the Dynamic Interaction Head (DIH) further enhanced detection performance by leveraging cross-scale semantic enhancement and spatial attention fusion. This architecture significantly boosted detection accuracy for small and edge-region objects in complex nighttime scenes, with an observed gain of over 1.5% in mAP@0.5.

4.6.4. Loss Function Analysis

Different IoU-based regression losses were evaluated to enhance localization accuracy and convergence stability. As shown in Table 5, the baseline YOLOv11 adopted CIoU, which considers overlap, center distance, and aspect ratio. However, it still suffers from unstable optimization when object aspect ratios vary significantly. To address this, several variants were tested.

DIoU improves upon CIoU by adding a penalty for the distance between the predicted and ground-truth box centers, accelerating convergence. EIoU further decouples width and height regression to improve bounding box aspect ratio alignment. SIoU introduces a geometric decomposition of the loss into angle, distance, and shape components, leading to smoother gradients and faster convergence. WIoU adaptively reweights loss contributions based on object scale and localization uncertainty, helping balance the learning between large and small targets. Finally, GIoU extends IoU by introducing a geometric penalty based on the area of the smallest enclosing box, improving robustness under occlusion and partial visibility.

In DRF-YOLO, these losses were applied only to the localization branch, while the classification and confidence branches used the Varifocal Loss. Empirically, GIoU achieved the best overall balance between precision and recall, with an mAP@0.5 of 65.2%, indicating superior generalization in low-light and occluded environments.

4.6.5. Image Enhancement

Finally, to enhance visual features in raw low-light images, the lightweight unsupervised enhancement algorithm Zero-DCE was employed during preprocessing. Without adding inference cost, it substantially improved image visibility, leading to further detection performance gains.

DRF-YOLO’s performance gains arise from both the individual strengths of CSP-MSEE, Focal Modulation, and DIH, and their synergistic interplay. CSP-MSEE enriches multi-scale and edge features, guiding Focal Modulation to focus on key regions, which in turn enables DIH to achieve precise spatial alignment and task-specific feature fusion. This coordinated design effectively enhances detection accuracy for small objects in degraded conditions, yielding a mAP@0.5 of 65.2% in low-light and complex nighttime environments while maintaining computational efficiency.

5. Conclusions

This study addresses the challenge of small-object detection in degraded visual conditions, such as nighttime and low-light environments in autonomous driving. We present DRF-YOLO, an improved YOLOv11-based model featuring CSP-MSEE for multi-scale edge enhancement, Focal Modulation for context-aware attention, and a Dynamic Interaction Head for precise small-object localization. The design is further reinforced by GIoU loss and Zero-DCE image enhancement to boost robustness in adverse lighting.

On the ExDark and BDD100K datasets, DRF-YOLO improves mAP@0.5 by 3.4% and 2.3% over YOLOv11. Although the parameter count and GFLOPs increase moderately (from 2.6 M to 3.9 M and 6.4 to 9.8, respectively), the architecture remains lightweight and efficient. Ablation studies confirm the contribution of each module, and visual results demonstrate stable detection under occlusion, blur, and low illumination. These improvements translate into practical benefits in real-world applications, such as earlier obstacle detection by several meters, faster reaction times, and enhanced road safety in autonomous driving scenarios.

Despite its strengths, DRF-YOLO has limitations. The current evaluation primarily uses existing nighttime datasets, which may not fully reflect performance on completely unseen environments or diverse object types. Future work could explore cross-dataset generalization, extend the model to UAV or traffic surveillance imagery, and further optimize the architecture to balance accuracy with computational cost. Additionally, investigating adaptive mechanisms for extremely small or densely packed objects could further enhance detection robustness.

In summary, DRF-YOLO provides an efficient, accurate, and robust solution for small-object detection in degraded environments, offering tangible advantages for intelligent vision systems in autonomous driving and surveillance.

Author Contributions

Conceptualization, Y.G.; methodology, Y.G.; software, Y.G.; validation, Y.G. and L.C.; formal analysis, Y.G.; investigation, Y.G.; data curation, Y.G.; writing—original draft preparation, Y.G.; visualization, Y.G.; resources, L.C. and T.S.; writing—review and editing, L.C. and T.S.; supervision, L.C.; project administration, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, D.; Shao, F.; Zhang, S.; Yang, L.; Zhang, H.; Liu, S.; Liu, Q. Advanced Object Detection in Low-Light Conditions. Remote Sens. 2024, 16, 4493. [Google Scholar] [CrossRef]
Morawski, I.; Chen, Y.-A.; Lin, Y.-S.; Hsu, W.H.H. NOD: Taking a Closer Look at Detection under Extreme Low-Light Conditions with Night Object Detection Dataset. arXiv 2021, arXiv:2110.10364. [Google Scholar] [CrossRef]
Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
Li, S.; Wang, S.; Wang, P. A Small Object Detection Algorithm for Traffic Signs Based on Improved YOLOv7. Sensors 2023, 23, 7145. [Google Scholar] [CrossRef]
Khalili, B.; Smyth, A.W. Small Object Detection YOLOv8 (SOD-YOLOv8): Enhancing YOLOv8 for Small Object Detection in Traffic Scene. arXiv 2024, arXiv:2408.04786. [Google Scholar]
Zhang, H.; Liang, M.; Wang, Y. YOLO-BS: A traffic sign detection algorithm based on YOLOv8. Sci. Rep. 2025, 15, 7558. [Google Scholar] [CrossRef]
Peng, D.; Ding, W.; Tong, Z. A novel low light object detection method based on the YOLOv5 fusion feature enhancement. Sci. Rep. 2024, 14, 4486. [Google Scholar] [CrossRef]
Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual Generative Adversarial Networks for Small Object Detection. arXiv 2017, arXiv:1706.05274. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone for enhancing CNN learning capabilities. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Chen, Z.; Wang, Z.; Luo, Y.; Wang, S.; Huang, Z. DPO: Dual-perturbation optimization for test-time adaptation in 3D object detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 4138–4147. [Google Scholar]
Shi, H.; Yang, W.; Chen, D.; Wang, M. ASG-YOLOv5: An improved YOLOv5 UAV remote sensing aerial image scenario for small object detection based on attention and spatial gating. PLoS ONE 2024, 19, e0298698. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-adaptive YOLO for object detection in adverse weather conditions. arXiv 2022, arXiv:2112.08088. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Liu, Y.; Li, S.; Zhou, L.; Liu, H.; Li, Z. Dark-Yolo: A low-light object detection algorithm integrating multiple attention mechanisms. Appl. Sci. 2025, 15, 5170. [Google Scholar] [CrossRef]
Lan, G.; Zhao, B.; Li, X. Decoupled contrastive image translation for nighttime surveillance. arXiv 2023, arXiv:2307.05038. [Google Scholar]
Cui, Z.; Zhu, Y.; Gu, L.; Qi, G.-J.; Li, X.; Gao, P.; Zhang, Z.; Harada, T. RestoreDet: Degradation equivariant representation for object detection in low resolution images. arXiv 2022, arXiv:2201.02314. [Google Scholar] [CrossRef]
Wang, Y.; Yang, H.; Zhang, W.; Lu, S. UniDet-D: A unified dynamic spectral attention model for object detection under adverse weather. arXiv 2025, arXiv:2506.12324. [Google Scholar]
Hong, M.; Cheng, S.; Huang, H.; Fan, H.; Liu, S. You Only Look Around: Learning Illumination Invariant Feature for Low-light Object Detection. arXiv 2024, arXiv:2410.18398. [Google Scholar] [CrossRef]
Tran, D.Q.; Aboah, A.; Jeon, Y.; Shoman, M.; Park, M.; Park, S. Low-Light Image Enhancement Framework for Improved Object Detection in Fisheye Lens Datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; pp. 7056–7065. [Google Scholar] [CrossRef]
Wang, H.; Liu, J.; Dong, H.; Shao, Z. A Survey of the Multi-Sensor Fusion Object Detection Task. Sensors 2025, 25, 2794. [Google Scholar] [CrossRef]
Mukherjee, S.; Beard, C.; Li, Z. MODIPHY: Multimodal Obscured Detection for IoT using Phantom Convolution-Enabled Faster YOLO. arXiv 2024, arXiv:2402.07894. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar] [CrossRef]
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv 2017, arXiv:1710.03740. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Yang, J.; Li, C.; Dai, X.; Yuan, L.; Gao, J. Focal modulation networks. arXiv 2022, arXiv:2203.11926. [Google Scholar] [CrossRef]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P.H. Res2Net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. arXiv 2018, arXiv:1803.08494. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning augmentation strategies from data. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1777–1786. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Weng, T.; Niu, X. Enhancing UAV Object Detection in Low-Light Conditions with ELS-YOLO: A Lightweight Model Based on Improved YOLOv11. Sensors 2025, 25, 4463. [Google Scholar] [CrossRef]
Han, Z.; Yue, Z.; Liu, L. 3L-YOLO: A Lightweight Low-Light Object Detection. Appl. Sci. 2024, 15, 90. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. arXiv 2020, arXiv:1911.08287. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A diverse driving dataset for heterogeneous multitask learning. arXiv 2020, arXiv:1805.04687. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the YOLOv11 model.

Figure 2. Overall architecture of the proposed DRF-YOLO.

Figure 3. Architecture of the proposed CSP-MSEE module.

Figure 4. Structure of the EdgeEnhancer module.

Figure 5. Architecture of the Focal Modulation Attention Enhancement Module.

Figure 6. Overall architecture of the proposed Dynamic Interactive Head (DIH).

Figure 7. Illustration of the GIoU loss structure.

Figure 8. Schematic diagram of the Zero-DCE enhancement process.

Figure 9. Original low-light image (left) and the image enhanced by our method (right).

Figure 10. Precision–Recall (P-R) curve comparison between YOLOv11 and DRF-YOLO on the ExDark dataset.

Figure 11. Comparison of detection results using the DRF-YOLO algorithm on the ExDark dataset.

Figure 12. Comparison of heatmaps generated by the DRF-YOLO algorithm on the ExDark dataset.

Figure 13. Comparison of detection results using the DRF-YOLO algorithm on the BDD100K dataset.

Figure 14. Comparison of heatmaps generated by the DRF-YOLO algorithm on the BDD100K dataset.

Table 1. Experimental parameter settings for DRF-YOLO.

Parameter	Set Value
Number of training epochs	300
Batch size	16
Image size (`imgsz`)	640 × 640
Initial learning rate (`lr0`)	0.01
Final learning rate (`lrf`)	0.0001
Learning rate scheduler	Cosine annealing
Momentum (`momentum`)	0.937
Weight decay coefficient (`weight_decay`)	0.0005
Warmup bias learning rate	0.1
Pretrained backbone	ImageNet-pretrained
IoU threshold for NMS	0.7
Automatic Mixed Precision (AMP)	False

Table 2. Performance comparison of different models on the ExDark dataset.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	FPS (3080Ti)
Faster R-CNN	92.5	65.8	49.2	27.4	18
SSD	88.1	63.4	42.9	23.6	37
Deformable-DETR	90.3	77.4	54.2	30.9	22
YOLOv3	90.6	70.1	48.5	27.0	40
YOLOv5n	89.4	80.3	54.4	31.2	215
YOLOv8n	89.7	81.4	56.6	32.4	203
YOLOv10n	93.3	77.0	53.7	30.8	210
YOLOv11n	94.9	82.0	61.8	34.1	190
YOLOv11s	95.7	82.2	62.3	34.5	142
YOLO-BIFPN	91.4	79.3	53.1	29.7	160
YOLO-timm	90.5	81.7	56.0	31.8	115
DRF-YOLO	97.8	84.3	65.2	36.5	175

Table 3. Performance comparison of different models on the BDD100K Night Subset.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	FPS (3080Ti)
Faster R-CNN	91.2	65.0	36.5	19.8	16
SSD	86.4	62.7	31.2	17.2	34
Deformable-DETR	89.8	75.1	38.9	22.4	21
YOLOv3	89.5	71.0	35.4	19.3	36
YOLOv5n	88.1	77.2	36.8	20.8	208
YOLOv8n	89.6	78.5	37.9	21.7	196
YOLOv10n	91.2	74.6	36.7	20.5	202
YOLOv11n	92.3	79.8	37.1	22.1	184
YOLOv11s	93.0	80.1	37.4	22.4	139
YOLO-BIFPN	90.1	76.0	35.2	20.1	153
YOLO-timm	90.7	78.2	36.3	21.0	108
DRF-YOLO	95.0	80.0	39.4	23.8	174

Table 4. Results of ablation experiments on ExDark data set for DRF-YOLO.

Baseline	CSP-MSEE	Focal Modulation	DIH	GIoU	Zero-DCE	mAP@0.5 (%)	Params (M)	FLOPs (G)
✓						61.8	2.6	6.4
✓	✓					62.0	2.2	6.1
✓		✓				62.5	2.7	6.7
✓			✓			64.5	2.9	7.0
✓				✓		63.0	2.6	6.4
✓					✓	63.6	2.6	6.4
✓	✓	✓				63.2	2.4	6.3
✓	✓	✓	✓			64.7	3.1	7.8
✓	✓	✓	✓	✓		65.0	3.3	8.4
✓	✓	✓	✓	✓	✓	65.2	3.9	9.8

Table 5. Comparison of different loss functions on the ExDark dataset.

Loss Function	Precision (%)	Recall (%)	mAP@0.5 (%)
CIoU	96.4	83.2	64.1
DIoU	95.7	82.5	63.4
EIoU	96.9	83.6	64.4
SIoU	97.1	83.9	64.7
WIoU	96.5	83.4	64.0
GIoU	97.8	84.3	65.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, Y.; Chen, L.; Su, T. Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11. World Electr. Veh. J. 2025, 16, 591. https://doi.org/10.3390/wevj16110591

AMA Style

Gu Y, Chen L, Su T. Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11. World Electric Vehicle Journal. 2025; 16(11):591. https://doi.org/10.3390/wevj16110591

Chicago/Turabian Style

Gu, Yan, Lingshan Chen, and Tian Su. 2025. "Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11" World Electric Vehicle Journal 16, no. 11: 591. https://doi.org/10.3390/wevj16110591

APA Style

Gu, Y., Chen, L., & Su, T. (2025). Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11. World Electric Vehicle Journal, 16(11), 591. https://doi.org/10.3390/wevj16110591

Article Menu

Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11

Abstract

1. Introduction

2. YOLOv11 Model Architecture and Characteristics

3. Improved DRF-YOLO Algorithm

3.1. CSP-MSEE Module

3.2. Focus Modulation Module

3.3. Self-Developed Task-Aligned Dynamic DIH Detection Head

3.4. Design Rationale and Synergistic Effect

3.5. GIoU Loss Function

3.6. Zero-DCE: An Unsupervised Image Enhancement Algorithm

4. Experimental Results and Analysis

4.1. Experimental Environment and Configuration

4.2. Datasets

4.3. Model Evaluation Metrics

4.4. Ablation and Comparative Experiments

4.5. Detection Performance in Complex Nighttime Environments

4.6. Ablation Study

4.6.1. Backbone Improvement

4.6.2. Attention Enhancement

4.6.3. Detection Head Optimization

4.6.4. Loss Function Analysis

4.6.5. Image Enhancement

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI