1. Introduction
In the expansive field of computer vision, object detection has emerged as a cornerstone research focus, achieving remarkable progress in recent years and garnering significant attention. This technology has found widespread applications across numerous domains, including vehicle-assisted driving [
1,
2], security surveillance [
3,
4], pedestrian detection [
5,
6,
7], industrial quality control [
8], and assistive technologies for the visually impaired [
9]. The practical deployment of object detection systems is heavily influenced by the quality of illumination in their operating environments. Under optimal lighting conditions—characterized by an abundance of photons and uniform illumination distribution—object features become distinctly discernible. This clarity enables a clear differentiation between the foreground (objects of interest) and the background, thereby providing favorable conditions for accurate object detection tasks. However, in low-light scenarios, such as nighttime or dimly lit environments, the photon count is significantly reduced, resulting in diminished contrast and inadequate brightness, which complicates the identification of objects and facial features [
10]. Furthermore, the increased presence of noise and interference in low-light conditions exacerbates the challenges faced by object detection systems. These factors undermine the performance of existing object detection algorithms, rendering them less capable of consistently and accurately identifying objects, often failing to meet expected levels of detection accuracy and reliability. This limitation poses a substantial barrier to advancements in critical fields such as security surveillance, autonomous driving, and intelligent transportation systems [
11,
12]. Consequently, research into object detection under low-light conditions is of paramount importance, as it addresses these challenges and further propels the development of this technology. By developing robust and accurate object detection methods that perform effectively in low-light environments, the scope of this technology’s applications can be broadened, contributing significantly to domains that rely on dependable object detection.
In the field of low-light object detection, deep learning methods can be divided into the following two categories based on whether candidate boxes are generated: two-stage algorithms and one-stage algorithms. Two-stage algorithms, such as Faster R-CNN (Faster Region Convolutional Network) [
13], first generate potential regions and then refine recognition. Although they achieve high accuracy, their processing speed is limited. Wang et al. [
14] proposed an object detection algorithm for nighttime environments that combines DCGAN with Faster R-CNN, incorporating a feature fusion module and a multi-scale pooling module to enhance detection performance under low-light conditions. Xu et al. [
15] combined deformable convolutional networks with Faster R-CNN, employing perceptual boundary localization to achieve more accurate position learning of objects, thereby improving detection performance in dim environments.
One-stage algorithms, such as SSD [
16] and the YOLO series [
17,
18,
19,
20], perform detection directly, omitting the candidate box generation step. This leads to faster detection but demands a more comprehensive feature extraction process. Qin et al. [
21] proposed the DE-YOLO framework, which integrates DENet and YOLOv3 into a unified enhancement-detection system. However, the independent processing by the two networks in the framework may prevent the full utilization of enhanced image details during feature extraction. Cui et al. [
22] proposed a self-supervised method for low-light object detection using a multi-task autoencoder transformation model. Nevertheless, the accuracy of self-supervised learning can be compromised when sufficient contextual information is unavailable, particularly in low-light environments where the absence of contextual details further affects detection accuracy. Zou et al. [
23] optimized the backbone network of YOLOv5 and applied RetinexNet for data augmentation to improve occluded object detection, but RetinexNet’s image enhancement performance required manual parameter adjustments to achieve optimal results. Wang et al. [
24] introduced an enhancement strategy combining block matching with three-dimensional filtering, effectively improving the signal-to-noise ratio in nighttime images. However, the approach did not fully address the limitations of convolutional neural networks in capturing global information.
To address the aforementioned challenges, researchers have proposed a variety of solutions. For instance, Hong et al. [
25] introduced a framework called YOLA, which enhances object detection performance in low-light conditions by learning illumination-invariant features via the Lambertian image formation model. While this approach leverages inter-channel and spatial relationships to handle diverse lighting, it underemphasizes the integration of global contextual information, critical for resolving ambiguities in low-contrast scenes where object-background boundaries blur. Hashmi et al. [
26] developed the FeatEnHancer module, employing hierarchical multi-scale feature fusion guided by task-specific losses and multi-head attention. Although effective for cross-scale feature integration, this method relies on parameterized attention mechanisms that may overlook lightweight and dynamic feature selection tailored to low-light sparsity, potentially compromising efficiency in resource-constrained scenarios. Yin et al. [
27] proposed PE-YOLO, integrating a Pyramid Enhancement Network (PENet) with YOLOv3 to decompose images into Laplacian pyramid components for detail enhancement. However, the static pooling strategy in their feature pyramid risks losing fine-grained local details, especially problematic in low-light environments where texture and edge information are already degraded. Ding et al. [
28] introduced SDNIA-YOLO to enhance YOLO robustness under extreme weather via Neural Style Transfer (NST)-synthesized images. While this boosts generalization to adversarial conditions, it lacks a dedicated mechanism to address the unique challenge of low-light illumination, such as noise amplification and uneven brightness, which require more than just style adaptation. Liu et al. [
29] proposed IA-YOLO with a Differentiable Image Processing (DIP) module to adapt to varying weather. However, the independent operation of the DIP module and detection network may hinder seamless fusion of enhanced features, as demonstrated in Qin et al.’s [
21] earlier work, where parallel processing limited feature utilization efficiency. Han et al. [
30] developed 3L-YOLO, a lightweight model that avoids explicit image enhancement by introducing switchable atrous convolution. While this reduces computational load, it overlooks the critical role of preprocessing in restoring lost low-light details—an omission that becomes pronounced in extremely dark environments with minimal initial information. Bhattacharya et al. [
31] presented D2BGAN for unsupervised low-light-to-bright image conversion, effective against motion blur and noise in driving scenes. Yet, the separation of image enhancement and detection as distinct stages may not optimize end-to-end feature learning, limiting the model’s ability to prioritize task-relevant details during enhancement. Zhang et al. [
32] proposed NUDN for nighttime domain adaptation via knowledge distillation, reducing costs for small objects. However, this approach does not explicitly model the reciprocal relationship between spatial and channel-wise features—a key deficit in low-light scenarios where cross-dimensional dependencies (e.g., channel-wise contrast and spatial structure) are critical for accurate detection. Wang et al. [
33] improved YOLOv5 with DK_YOLOv5, incorporating enhancement algorithms and attention mechanisms. While effective, the sequential integration of modules may lead to suboptimal information flow, as each component (enhancement, attention, detection) operates in isolation rather than dynamically reinforcing one another. Li et al. [
34] introduced YOLO_GD with cross-layer fusion and dual-layer spatial attention, mitigating inter-layer information loss. However, the spatial-only attention mechanism struggles to capture long-range dependencies across channels, which are essential for reconstructing color and texture in low-light images where channel correlations weaken. Zhao et al. [
35] enhanced YOLOv7 with agile hybrid convolutions and deformable attention, improving edge extraction. Yet, the lack of a unified framework to balance local detail extraction (e.g., edges) and global context (e.g., object layout) remains an issue amplified in low-light scenes where both types of information are degraded but interdependent. Mei et al. [
36] developed GOI-YOLO with group convolutions to isolate edge interference, enhancing efficiency. However, the exclusive focus on spatial feature grouping neglects the potential of channel-wise attention to recover low-light spectral information, such as restoring discriminative color features muted by insufficient illumination. Abu Awwad et al. [
5] proposed multi-stage image enhancement for anomaly detection in smart cities, reducing false positives. Nevertheless, the static enhancement pipeline does not adapt to diverse low-light sub-scenarios (e.g., near-infrared vs. visible light), limiting its generality across illumination conditions.
Although the aforementioned studies have mitigated some of the challenges associated with object detection in low-light conditions, several unresolved issues persist: (1) Insufficient integration of spatial-local and channel-global information, (2) static feature extraction mechanisms ill-suited to low-light sparsity, and (3) insufficient emphasis on the importance of contextual information. However, existing methodologies in the realm of low-light object detection predominantly concentrate on enhancing the detection performance of individual objects, often neglecting the importance of contextual information. These methodologies primarily rely on image processing techniques, such as denoising, contrast enhancement, and edge detection, to improve the visibility and detectability of objects in low-light scenarios. Despite the advancements made in these techniques, the lack of emphasis on contextual information limits the overall performance and accuracy of low-light object detection systems. Secondly, the capacity to extract local detail features from images captured in low-illumination environments remains insufficient. Local detail features are crucial for accurately identifying and classifying objects within an image. However, in low-light conditions, the reduction in photon count and the increase in noise levels result in blurred and distorted images, making it difficult to extract reliable local detail features.
To address these challenges, this paper, Dark-YOLO, addresses these by designing a unified framework where adaptive enhancement, dynamic feature selection, and reciprocal attention mechanisms enable collaborative optimization, ensuring robust performance under complex low-light constraints. The main contributions of this paper are as follows:
(1) We introduce two complementary attention mechanisms. One is SimAM in the Partial Convolutional Spatial Module (PSM), which applies pixel-level focused attention to key features, enabling the model to prioritize critical local information. The other is the Dimensional Reciprocal Attention Module (D-RAMiT), which models global-local relationships through bidirectional spatial-channel attention, integrating long-range dependencies across spatial layouts and channel correlations. Together, they enhance the model’s perception of contextual information in low-light scenarios by explicitly defining “contextual information” into two complementary dimensions: global context and local context. The global context is captured by D-RAMiT’s Spatial Self-Attention (SPSA), which encodes the overall image layout (such as the relative positions of objects) to understand scene-level structure and spatial dependencies. The local context is preserved through the cross-overlapping pooling of the Cross-Spatial Pyramid Pooling Feature (CSPPF) and the partial convolution of PSM. These operations mitigate noise and retain spatial hierarchies, thus preserving fine-grained details like edges and textures that degrade under low-light conditions. SimAM dynamically ranks the importance of local features to focus on discriminative pixels, while D-RAMiT constructs global contextual representations by modeling the interdependencies between spatial locations and channel dimensions. The synergy of the two forms a “multi-attention framework” that balances the prioritization of local details and the modeling of global context, enabling robust feature learning in challenging low-light environments.
(2) The proposed method introduces SCINet, an image enhancement module that cascades multiple illumination learning blocks. Each block is designed to learn and correct specific aspects of the image’s illumination. This cascading structure enables the network to progressively refine illumination information, restoring finer details in low-light images and enriching feature representation.
(3) An enhanced spatial feature pyramid module, termed CSPPF, is proposed. This module utilizes cross-overlapping average pooling and max pooling to minimize the loss of local information in the image, thereby improving the retention of critical spatial details. Max pooling captures salient edges (robust to noise), while average pooling retains global brightness gradients; their combination mitigates the detail loss caused by traditional single pooling in low-contrast images.
(4) A dynamic feature extraction module, PSM, is introduced. By integrating selective information from partial convolution with a parameter-free attention mechanism, PSM enhances the network’s feature extraction capacity and strengthens its object localization capability. In low-light images, partial convolution (PConv) ignores invalid/noisy pixels in the receptive field, reducing interference from corrupted data, while SimAM directs attention to rare valid features.
(5) The method employs Dimension Reciprocal Attention Module (D-RAMiT), a dimension reciprocal attention module, which computes self-attention in parallel along both spatial and channel dimensions. This establishes long-range relationships between pixels, ensuring comprehensive utilization of both local and global information, and ultimately boosting the network’s object detection accuracy in low-light conditions. In dim scenes, local features are sparse and ambiguous; D-RAMiT’s spatial self-attention (SPSA) recovers fine-grained details, while channel self-attention (CHSA) models inter-channel dependencies, enabling the model to infer object identities from limited local cues using global scene statistics.
The remainder of this paper is organized as follows. In
Section 2, the specific implementation details of Dark-YOLO will be presented. Subsequently,
Section 3 will discuss the experimental results of Dark-YOLO, including a variety of ablation studies and comparisons with different algorithms on the ExDark and DarkFace datasets, highlighting the advantages and robustness of Dark-YOLO. Finally,
Section 4 will provide the conclusion, summarizing Dark-YOLO, elucidating its strengths, and addressing its limitations along with potential improvements and future enhancements.