Explicit Illumination Modeling for Object Detection in Low-Light Environments

Cao, Wenkang; Yang, Peng; Lyu, Wensheng

doi:10.3390/electronics15102057

Open AccessArticle

Explicit Illumination Modeling for Object Detection in Low-Light Environments

by

Wenkang Cao

¹

,

Peng Yang

^1,2,* and

Wensheng Lyu

²

¹

Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China

²

School of Civil and Resource Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2057; https://doi.org/10.3390/electronics15102057

Submission received: 16 April 2026 / Revised: 8 May 2026 / Accepted: 9 May 2026 / Published: 12 May 2026

Download

Browse Figures

Versions Notes

Abstract

Under complex lighting conditions, particularly in low-light environments, general object detectors often suffer from degraded detection performance due to insufficient brightness, severe noise, and loss of discriminative details. This issue is especially critical in underground mining scenarios, where weak illumination, complex backgrounds, dust interference, and frequent small or partially occluded targets make reliable visual perception highly challenging. To address this issue, we propose an Illumination-Aware Detection Network (IADNet) for object detection in low-light environments. Specifically, an Illumination Modeling Subnetwork (IMS) is designed to extract illumination-aware and degradation-aware auxiliary features from low-light images. Within the IMS, an Adaptive Weighted Downsampling (AWD) layer is introduced to reduce noise interference during feature downsampling and enhance illumination-aware representation learning. Furthermore, a Global Feature Enhancement Module (GFEM) is incorporated to strengthen global context modeling and improve feature representation capability in complex scenes. In addition, an extra contrastive loss is introduced to constrain the optimization of the IMS, and weighting factors are employed to balance the detection loss and the contrastive loss during training. Extensive experiments conducted on multiple datasets demonstrate the effectiveness of the proposed method. On the public ExDark dataset, IADNet achieves an mAP@50 of 80.3%, outperforming the baseline YOLO11m by 3.4 percentage points. On the self-constructed mining low-light dataset Lowlight_Mine, the proposed method achieves 92.3% Precision, 82.0% Recall, 89.3% mAP@50, and 57.8% mAP@50:95, showing favorable performance in object detection tasks under mining-related low-light scenarios. On the DARK FACE dataset, IADNet achieves 54.6% mAP@50 and 31.2% mAP@50:95, further indicating its robustness under real low-light conditions. On the synthetic low-light Dark_VOC dataset, IADNet attains an mAP@50 of 91.6%, and on the normal-light VOC dataset, it achieves an mAP@50 of 93.0%, suggesting that the proposed method maintains stable detection performance under the evaluated illumination conditions. These results indicate that IADNet improves low-light object detection performance and provides a useful experimental reference for object detection tasks in mining-related low-light scenarios.

Keywords:

object detection; low-light environment; illumination-aware detection; contrastive learning; mining scene; intelligent perception

1. Introduction

Object detection, as a core task in computer vision, aims to localize and classify all target objects in an image. In recent years, significant progress has been made due to the rapid development of deep convolutional neural networks. From two-stage approaches such as Faster R-CNN [1], to one-stage methods like the YOLO series [2,3,4] and RetinaNet [5], and more recently to Transformer-based models such as DETR [6] and DINO [7], general object detection has demonstrated outstanding performance in diverse domains including natural scenes, traffic surveillance, and autonomous driving. However, most of these mainstream detectors rely heavily on good lighting conditions. Their performance typically degrades severely under low-light environments, limiting their applicability in real-world scenarios such as nighttime surveillance or mining operations.

Low-light images often suffer from a range of degradation factors including insufficient brightness, low contrast, severe noise, and color distortion. These issues result in the loss of critical visual information, leading to blurred object contours, missing textures, and indistinct semantic features. Consequently, traditional detectors struggle to extract effective representations, frequently leading to missed or incorrect detections.

A straightforward solution is to enhance the low-light image before detection using dedicated image enhancement networks. This makes the input more similar to well-lit images typically seen in training datasets. However, most generic detectors are trained on large-scale datasets such as COCO [8] or PASCAL VOC [9], which do not contain corresponding low-light versions. As a result, enhancement networks and detection networks must be trained on different datasets, leading to significant domain gaps. In many cases, this mismatch can even reduce the overall detection performance.

Some recent works have explored weakly supervised approaches to improve detection in low-light conditions. For example, Zhang et al. [10] proposed a Low-Light Weakly Supervised Object Detection framework that progressively trains detection models on low-light images using common datasets captured under normal lighting. Hui et al. [11] introduced WSA-YOLO, a weakly supervised adaptive low-light object detector based on YOLOv7, which leverages adaptive enhancement techniques and decomposes the image into reflectance and illumination maps for separate enhancement.

Other approaches attempt to integrate image enhancement and detection into a unified framework. IA-YOLO [12], for instance, incorporates a Differentiable Image Processing (DIP) module whose parameters are adaptively predicted by a lightweight CNN (CNN-PP) based on the input image. This module is trained jointly with YOLOv3, enabling weakly supervised optimization of image enhancement for the detection task under adverse conditions such as fog or low light. Similarly, GDIP [13] embeds an image enhancement module within the detection network by running various preprocessing operations (e.g., brightness adjustment, contrast enhancement) in parallel and combining them via a learnable gating mechanism. GDIP further introduces a multi-stage enhancement strategy and a lightweight regularized version that can be removed during inference to improve deployment efficiency without sacrificing accuracy.

In this work, as shown in Figure 1, we observe that low-light images exhibit significant variation in lighting sources and intensities, resulting in diverse degradation levels. This observation motivates us to design an Illumination-Aware Detection Network (IADNet), which explicitly models complex illumination variations to significantly improve object detection performance under low-light conditions. By incorporating illumination-aware mechanisms, IADNet enables the network to adaptively adjust feature representations across diverse lighting scenarios, thereby enhancing robustness to illumination changes. Specifically, we first construct an Illumination Modeling Subnetwork (IMS) to explicitly capture illumination-related information from input images. The IMS is composed of multiple cascaded Adaptive Residual Blocks (ARB), where each ARB replaces conventional downsampling operations with an Adaptive Weighted Downsampling Convolution (AWD). This design effectively suppresses noise interference during feature downsampling and alleviates feature degradation caused by noise amplification in low-light environments. Subsequently, a Feature Fusion Module (FFM) is employed to deeply integrate the illumination-aware features extracted by the IMS with the semantic features produced by the detection backbone. Through this fusion process, the backbone features are dynamically modulated to better align with the imaging characteristics under the current illumination conditions. To further strengthen illumination modeling, the IMS is optimized using a contrastive learning strategy. By enforcing contrastive constraints on features from images captured under different lighting conditions, the network is guided to learn illumination-aware discriminative representations. Meanwhile, a weighting factor is introduced to balance the contrastive loss and the detection loss, ensuring collaborative optimization between illumination modeling and detection performance.

In addition, we incorporate a Global Feature Enhancement Module (GFEM) at the high-level stages of the IADNet backbone to strengthen global context modeling under low-light conditions. This design is motivated by the fact that the original YOLO backbone is mainly built upon convolutional operations, which are effective for local pattern extraction but inherently limited in modeling long-range dependencies and global contextual information. In low-light scenes, object appearance is often severely degraded by insufficient illumination, low contrast, and background noise, making it difficult to distinguish foreground targets from dark or cluttered surroundings based solely on local features. To address this issue, the GFEM enhances global semantic representation and long-range dependency modeling, enabling the network to capture more comprehensive scene-level context for both semantic reasoning and target-background disambiguation. As a result, it improves the robustness of feature representation and the stability of object detection in complex low-light environments.

Our contributions are summarized as follows:

We propose a novel object detection network, termed IADNet, which is designed to enhance the illumination perception capability of detection models. Extensive experiments conducted on multiple low-light datasets demonstrate that IADNet consistently outperforms existing state-of-the-art methods.
We design an Illumination Modeling Subnetwork (IMS) to explicitly model illumination information in images through contrastive learning. The illumination-aware features captured by the IMS are then used to dynamically modulate the semantic features extracted by the detection backbone, thereby enhancing the illumination perception capability of the detection network.
We propose a Global Feature Enhancement Module (GFEM) to strengthen the global context modeling capability of the detection network. By enhancing long-range dependency modeling and global semantic representation, GFEM further improves feature representation quality and detection stability under complex illumination conditions.

2. Related Work

2.1. Object Detection

Object detection, a fundamental task in computer vision, aims to accurately locate and classify all objects of interest in images or videos. With the rapid development of deep learning, convolutional neural network (CNN)-based object detection methods [14,15,16] have achieved remarkable progress. Existing mainstream object detection algorithms can be broadly categorized into two groups: two-stage methods and one-stage methods. Two-stage methods, represented by the Faster R-CNN series [1,17,18], first generate region proposals using a Region Proposal Network (RPN), and then perform classification and regression on each candidate box. These methods offer high accuracy but suffer from considerable computational overhead, making them less suitable for real-time applications. In contrast, one-stage methods, such as the YOLO series and RetinaNet, unify the proposal generation, classification, and regression tasks into a single network, achieving higher inference speed through an end-to-end architecture. Notably, recent versions like YOLOv11 and YOLOv12 strike a favorable balance between accuracy and efficiency, rendering them widely adopted in industrial and edge computing scenarios. In recent years, the introduction of Transformer-based architectures has further advanced the field of object detection. Representative models such as DETR (Detection Transformer) and DINO overcome the locality limitation of CNNs by employing self-attention mechanisms to model long-range dependencies within images, thereby improving detection performance in complex scenes. However, these models typically require substantial computational resources and exhibit slower inference speeds. Despite the impressive performance of general-purpose object detection methods on standard benchmarks, their effectiveness deteriorates significantly in real-world scenarios such as nighttime or low-light environments. Under these challenging conditions, issues such as severe image degradation, blurred object boundaries, and low contrast lead to a noticeable drop in detection accuracy.

2.2. Low-Light Image Object Detection

Although mainstream object detection algorithms have made significant progress under natural lighting conditions, their performance degrades considerably in low-light environments due to issues such as low contrast, high noise, and loss of fine details. To address these challenges, researchers have proposed various methods specifically designed for object detection in low-light scenes. A common approach is to decouple the tasks of image enhancement and object detection. In such pipelines, low-light images are first preprocessed using enhancement algorithms such as Retinex-based models [19] and Zero-DCE [20], and then passed into standard object detectors like YOLO or Faster R-CNN. While this strategy is straightforward to implement, the enhancement process is not aligned with the detection objective, which often leads to artifacts or information shifts that ultimately limit detection performance. To overcome these limitations, an increasing number of studies have shifted toward end-to-end joint optimization frameworks that integrate differentiable image enhancement modules with detection networks. These frameworks enable the model to learn illumination-adaptive features directly from the data. For example, IA-YOLO introduces a learnable image enhancement module whose parameters are predicted by an auxiliary network and jointly optimized with YOLOv3, achieving a synergy between perceptual enhancement and object detection. Similarly, the LLVIP [21] dataset and related low-light detection models emphasize the deep integration of semantic understanding and perceptual enhancement under low-light conditions. Recent studies have also improved low-light object detection through detection-oriented enhancement and YOLO-based detector optimization [22,23].

Some studies [10,11] have explored the use of weakly supervised methods to improve object detection performance under low-light conditions. Zhang et al. [10] proposed a Low-Light Weakly Supervised Object Detection (LL-WSOD) framework, which gradually trains the model under low-light environments using commonly available datasets captured under normal illumination. Hui et al. [11] introduced WSA-YOLO, a weakly supervised adaptive object detection algorithm based on YOLOv7 for low-light scenes. This method leverages adaptive enhancement techniques to effectively improve detection performance in dark environments. Specifically, it employs a decomposition network to separate the image into a reflectance map and an illumination map, which are then enhanced individually.

Unlike previous low-light object detection methods, we explicitly model varying illumination conditions to enhance the model’s illumination awareness, thereby improving its robustness under different lighting environments.

2.3. Low-Light Image Enhancement

The goal of low-light image enhancement is to improve the visibility and quality of images captured under insufficient lighting conditions, thereby providing more reliable inputs for subsequent high-level vision tasks such as object detection, recognition, and segmentation. Low-light image enhancement methods can be broadly categorized into two classes: traditional image processing techniques and deep learning-based approaches. Traditional methods primarily rely on techniques such as histogram equalization, gamma correction, and Retinex theory. Retinex theory [24] provides a flexible theoretical framework for enhancing low-light images. Subsequent works include single-scale Retinex methods [25], multi-scale Retinex methods [26], and variational Retinex models [27]. However, these algorithms often struggle to prevent issues such as color distortion and visual artifacts. With the advancement of deep learning, an increasing number of studies have adopted convolutional neural networks (CNNs) and Transformer architectures to learn complex nonlinear mappings for end-to-end image enhancement. Ren et al. [28] proposed a trainable hybrid network to improve the visibility of degraded images. The proposed architecture comprises two distinct branches, enabling the simultaneous learning of global content and salient structures of clear images within a single network. Guo et al. [29] proposed Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates the enhancement task as an image-specific curve estimation problem based on deep networks. Xu et al. [30] introduced a novel low-light enhancement framework that jointly models both the appearance and structure information of images. Their method leverages structural features to guide appearance enhancement, resulting in clearer and more realistic outputs.

3. Method

3.1. The Architecture of IADNet

IADNet is an improved version built upon YOLO11m [31], and its overall architecture is illustrated in Figure 2. IADNet consists of three major components: the illumination modeling subnetwork, the backbone, and the neck and head of the detection network. During training, the input first undergoes data augmentation techniques such as random rotation, creating an augmented input that pairs with the original input to form a sample pair. Both the augmented input and the original input are then simultaneously fed into the illumination modeling subnetwork to learn the illumination information of the image. At the same time, the original input is passed through the backbone of the detection network. The illumination features learned by the illumination modeling subnetwork are then fused with the backbone features to enhance the backbone’s ability to perceive lighting conditions. These fused features are subsequently input to the neck and head of the network for further processing.

3.2. Illumination Modeling Subnetwork

As shown in Figure 1, we observe that illumination variations in low-light environments are highly complex. Differences in light sources and illumination intensity often result in varying degrees of image degradation. This observation motivates the design of an illumination modeling subnetwork to enhance the illumination awareness of the detection network. The proposed subnetwork consists of one initial convolutional layer and five Adaptive Residual Blocks (ARB). The initial convolution embeds the input into a feature space, while the five ARB modules are designed to extract multi-scale features that facilitate effective fusion with the multi-level features of the backbone network. The structure of the ARB module is illustrated in Figure 3. Specifically, the input feature map is first downsampled via an AWD layer, followed sequentially by Batch Normalization (BN), a SiLU activation, a convolutional layer, and another BN. A residual connection is then established by adding the output of this sequence to a shortcut path that consists of a convolutional layer and BN applied directly to the input. The ARB module can be formulated as follows:

A R B = Y_{1} + Y_{2}

(1)

Y_{1} = B N (C o n v (S i L U (B N (A W D (X)))))

(2)

Y_{2} = B N (C o n v (X))

(3)

The illumination modeling subnetwork is optimized using a contrastive loss. Specifically, given an input image

x

, a data-augmented version

x^{'}

is generated through random flipping and rotation. Both

x

and

x^{'}

are then fed into the illumination modeling subnetwork, producing two feature vectors:

q_{x}

and

p_{x^{'}}

. The feature vector

q_{x}

, extracted from the original low-light image, is treated as the query sample

q

. Since

p_{x^{'}}

shares the same illumination intensity and light source as

q_{x}

, it is regarded as the positive sample. In contrast, other low-light images from the dataset that exhibit different illumination conditions are considered negative samples

N

. To address the need for a large number of negative samples in contrastive learning, a memory queue is constructed to store representations of negative samples. The negative-sample memory queue is constructed as a feature storage matrix

Q \in ℝ^{K \times \dim}

, where

K

denotes the queue capacity and

\dim

represents the dimensionality of the negative-sample features. As new negative samples are continuously enqueued, the queue is updated using a first-in, first-out (FIFO) strategy, whereby the earliest stored negative samples are automatically removed to maintain a fixed capacity. Based on these sample representations, the contrastive loss is computed as follows:

L_{c t r} = - \log \frac{\exp (q_{x} \cdot p_{x^{'}} / τ)}{\sum_{i = 1}^{K} \exp (q_{x} \cdot Q / τ)}

(4)

where

τ

denotes the temperature coefficient, which is primarily used to control the smoothness of the distribution.

Finally, the multi-scale illumination-aware features extracted by the IMS are fused with the multi-scale features obtained from the detector backbone to achieve effective feature integration for subsequent detection tasks. As illustrated in Figure 4, the features from the backbone

F_{B a c k b o n e}

and the features from the IMS module

F_{I M S}

are first concatenated along the channel dimension to obtain the fused feature representation:

F_{c} = C o n c a t (F_{B a c k b o n e}, F_{I M S})

(5)

Subsequently, the fused feature

F_{c}

is sequentially processed by a convolution operation followed by a Sigmoid activation function to generate the degradation-guided weights. These weights are then multiplied with the input features from the backbone in a pixel-wise manner, enabling degradation-guided feature modulation and yielding the final output feature

F_{o u t}

. The overall feature fusion process can be formulated as shown in Equation (6):

F_{o u t} = σ (C o n v_{1 \times 1} (F_{c})) \cdot F_{B a c k b o n e}

(6)

where

σ

denotes the Sigmoid function, and

C o n v_{1 \times 1}

represents the

1 \times 1

convolution operation.

3.3. Adaptive Weighted Downsampling Layer

Traditional downsampling is typically implemented using a convolution with a kernel size of

1 \times 1

and a stride of 2. However, this approach only considers individual pixels and fails to suppress noise inherent in low-light image features. To address this limitation, Chen et al. [32] proposed the Adaptive Weighted Downsampling (AWD) layer as an alternative to conventional downsampling convolutions. In our illumination modeling subnetwork, we replace all standard downsampling convolutions with AWD layers. The detailed architecture of AWD is illustrated in Figure 5. AWD takes an input feature map of size

C \times H \times W

and processes it through two parallel branches to jointly generate spatially adaptive convolution kernels and perform feature modulation. The upper branch applies a convolution followed by Batch Normalization (BN) to produce the kernel weights. In the lower branch, spatial average pooling is first performed, followed by a sequence of convolution, ReLU activation, and another convolution. The outputs of the two branches are then multiplied element-wise to generate dynamic convolution kernels with dimensions

C \times \frac{H}{2} \times \frac{W}{2} \times k \times k

, where

k

denotes the kernel size. Finally, the input feature map undergoes padding and an unfold operation to extract overlapping local image patches. These patches are then convolved with the corresponding dynamic kernels in a position-wise manner, enabling context-aware adaptive downsampling.

3.4. Global Feature Enhancement Module

The backbone of YOLO11m is mainly composed of convolutional structures, which are effective for local feature extraction but inherently limited in modeling global contextual information. Such a limitation is particularly critical in low-light scenes, where object appearance is often severely degraded by insufficient illumination, low contrast, and background interference. Under these conditions, relying solely on local features may lead to ambiguous target perception and unstable feature representation. To alleviate this problem, we design a Transformer module based on a transposed self-attention mechanism, termed the Global Feature Enhancement Module (GFEM), to strengthen the detector’s ability to model global context. By introducing long-range dependency modeling and richer scene-level semantic reasoning, GFEM enables the network to better distinguish foreground targets from dark or noisy backgrounds, thereby improving the robustness of object detection in complex low-light environments.

Unlike conventional Transformer modules, GFEM adopts a transposed self-attention mechanism, which enables global context modeling with relatively lower computational complexity. As illustrated in Figure 6, the input feature map X is first processed by two layers of deformable convolution (Deformable Convolution) to enhance the adaptability of feature transformation, allowing the network to more flexibly handle targets with diverse deformations and structural characteristics:

F_{d} = D e f o r m a b l e C o n v \times 2 (X)

(7)

Subsequently, the transformed feature

F_{d}

captures global contextual information through Transposed Self-Attention (TSA). The detailed computation of the transposed self-attention [33] mechanism is illustrated in Figure 7. Specifically, the input feature

F_{d}

is first normalized using Layer Normalization, and then mapped into the query

Q

, key

K

, and value

V

representations via a

1 \times 1

convolution and a

3 \times 3

depthwise separable convolution, respectively. Next, a dot-product operation is performed between the transposed

Q

and

K

, followed by a Softmax function to obtain the transposed attention map. By computing attention along the channel dimension, this strategy effectively preserves global dependency modeling while significantly reducing computational complexity. The overall TSA process can be formulated as follows:

F_{d^{'}} = W_{1 \times 1}^{1} A t t e n t i o n (Q, K, V) + F_{d}

(8)

A t t e n t i o n (Q, K, V) = V \cdot S o f t m a x (K \cdot Q / α)

(9)

where

W_{1 \times 1}^{1}

convolution, and

α

is the temperature parameter. Finally,

F_{d^{'}}

is further processed by sequentially applying Layer Normalization and a Multi-Layer Perceptron (MLP) to introduce nonlinear transformations, and is combined with residual connections and Dropout operations to enhance gradient flow and improve the generalization capability of the model.

3.5. Loss Function

The overall loss function of YOLO11m consists of the bounding box regression loss, confidence loss, and classification loss. Since the proposed network incorporates the IMS module, an additional contrastive loss is introduced to impose explicit constraints on it. To balance the contributions of different loss terms, weighting factors

λ_{d e t}

and

λ_{c t r}

are introduced into the total loss, and the detection loss and contrastive loss are combined through a weighted summation:

L_{t o t a l} = λ_{d e t} \times L_{d e t} + λ_{c t r} \times L_{c t r}

(10)

where

L_{d e t}

denotes the sum of the bounding box regression loss, confidence loss, and classification loss, and

L_{c t r}

represents the contrastive loss defined in Equation (4). In the ablation studies, we further evaluate the impact of different settings of

λ_{d e t}

and

λ_{c t r}

on the overall detection performance to determine the optimal loss balancing strategy.

4. Experiment

4.1. Implementation Details

The proposed IADNet is implemented based on the PyTorch framework and trained using the SGD optimizer. The initial learning rate is set to 0.02, and the model is trained for 100 epochs with a batch size of 16. A cosine annealing strategy is adopted to gradually decay the learning rate during training. All experiments are conducted on an NVIDIA GeForce RTX 4090 GPU. The detailed hardware and software environment, as well as the hyperparameter settings, are listed in Table 1.

4.2. Evaluation Metrics

To evaluate the effectiveness of the proposed method, we adopt Precision, Recall, mAP@50, and mAP@50:95 as the evaluation metrics. The computation of Precision and Recall is described as follows:

Precision = \frac{T P}{T P + F P}

(11)

Recall = \frac{T P}{T P + F N}

(12)

A P = \int_{0}^{1} P (R) d R

(13)

m A P = \frac{1}{N} \sum_{i \in N} A P_{i}

(14)

where

TP (True Positives): Number of correctly detected objects (correct class and IoU above the threshold);
FP (False Positives): Number of incorrectly detected objects (wrong class or IoU below the threshold);
FN (False Negatives): Number of ground-truth objects missed by the detector.

In addition, we adopt the number of parameters (Params), giga floating-point operations (GFLOPs), and inference time as evaluation metrics to assess the computational efficiency of the proposed method. The calculation of GFLOPs is given as follows:

GFLOPs = \frac{FLOPs}{10^{9}}

(15)

4.3. Dataset

In this paper, the ExDark dataset, the self-constructed mining low-light dataset Lowlight_Mine, the DARK FACE dataset, and the PASCAL VOC dataset are employed to evaluate the effectiveness of the proposed method.

ExDark [34]: ExDark is a publicly available dataset specifically designed for visual tasks under low-light conditions. It consists of 7363 real-world low-light images covering 12 object categories: bicycle, boat, bottle, bus, car, cat, chair, cup, dog, motorbike, people, and table. These images cover a wide range of nighttime scenarios with complex and diverse illumination conditions, resulting in varying degrees of image degradation. To present the data distribution more intuitively, we visualize the number of instances for each category in the ExDark dataset, as shown in Figure 8. In addition, some sample instances from the ExDark dataset are shown in Figure 9.

Lowlight_Mine: Since publicly available benchmark datasets specifically designed for object detection in low-light underground mining environments are still limited, existing datasets cannot fully satisfy the requirements for training and evaluating detection algorithms in such safety-critical scenarios. To address this limitation and further evaluate the proposed method in mining-related low-light conditions, we constructed a mining low-light experimental dataset, termed Lowlight_Mine, by combining on-site image acquisition at Jiaojia Gold Mine with publicly accessible Internet images related to underground mining scenarios.

The Lowlight_Mine dataset contains 1753 images and 5714 annotated object instances. Among them, 1081 images with 3682 object instances were collected on site at Jiaojia Gold Mine, accounting for 61.7% of the images and 64.4% of the annotated instances. The remaining 672 images with 2032 object instances were collected from publicly accessible Internet sources related to underground mining scenarios, accounting for 38.3% of the images and 35.6% of the annotated instances. The Internet images were mainly collected from mining-related web pages, news reports, safety-production materials, and image-search results to enrich scene diversity and illumination variation.

In terms of category composition, the annotated object categories include underground personnel and mining-related equipment/vehicles. Specifically, the dataset contains 3726 underground personnel instances and 1988 mining-related equipment/vehicle instances, corresponding to the major safety-critical visual targets in underground mining operations. In terms of scene composition, the dataset includes five major types of mining operation scenes: underground roadways, personnel operation areas, equipment transportation scenes, inspection or maintenance scenarios, and mixed mining scenes. Specifically, the dataset contains 426 images of underground roadways, 548 images of personnel operation areas, 394 images of equipment transportation scenes, 231 images of inspection or maintenance scenarios, and 154 images of mixed mining scenes.

Overall, Lowlight_Mine reflects several typical visual challenges in deep mining environments, such as weak illumination, non-uniform lighting, strong background interference, local reflection, occlusion, low contrast, and distant small targets. Some sample images from the Lowlight_Mine dataset are shown in Figure 10.

All images were manually annotated using bounding boxes. To ensure annotation quality, a two-stage quality-control protocol was adopted. Specifically, the initial annotations were first produced by one annotator and then reviewed and corrected by another reviewer. Ambiguous samples, such as heavily occluded targets or objects under extremely dark illumination, were further inspected to improve annotation consistency. The dataset was divided into training, validation, and test sets at an approximate ratio of 8:1:1, corresponding to 1402 images for training, 175 images for validation, and 176 images for testing.

DARK FACE [35]: The DARK FACE dataset is a real-world low-light face detection dataset, containing 6000 low-light images with annotated human face bounding boxes. The scenes cover various complex nighttime environments, including teaching buildings, streets, bridges, overpasses, and parks. In our experiments, 5400 images were used for training, while the remaining 600 images were used for testing.

PASCAL VOC [9]: The PASCAL VOC dataset is a classic benchmark for visual recognition, led by the University of Oxford and other institutions, and has been widely used for tasks such as image classification, object detection, and semantic segmentation. To evaluate the generalization capability of the proposed method, we construct a Dark_VOC dataset by applying gamma correction to the original PASCAL VOC images. The corresponding formulation is given as follows:

I_{o u t} = I_{i n}^{γ}

(16)

Here,

I_{i n}

denotes the input pixel value normalized to the range

[0, 1]

, and the parameter

γ

is used to control the illumination intensity: when

γ > 1

, the image becomes darker, whereas when

γ < 1

, the image becomes brighter. In this study,

γ

is set within the range

[1.8, 3.0]

to simulate varying degrees of low-light conditions. Specifically, we construct the VOC_clean_train dataset by selecting 8111 images containing five object categories—person, car, bus, bicycle, and motorbike—from the VOC2007_trainval and VOC2012_trainval datasets. Similarly, the VOC_clean_test dataset is built by selecting 2734 images from the VOC2007_test dataset using the same criteria. Based on the aforementioned gamma transformation, we further generate the low-light datasets Dark_VOC_train and Dark_VOC_test from VOC_clean_train and VOC_clean_test, respectively.

4.4. Ablation Study

To verify the effectiveness of the proposed modules for low-light object detection, we conduct ablation experiments on the ExDark dataset. The overall ablation results are summarized in Table 2. The baseline model achieves an mAP@50 of 76.9%, an mAP@50:95 of 52.6%,with 20.1 M parameters, 68.0 GFLOPs, and an inference time of 7.7 ms. Based on this baseline, introducing the IMS module and the GFEM both leads to performance improvements to different extents, indicating that the two key modules proposed in this work can effectively enhance object detection performance in low-light scenarios.

From the single-module results, the IMS module brings a more significant performance gain. After incorporating IMS into the baseline model, mAP@50 increases from 76.9% to 79.1%, and mAP@50:95 increases from 52.6% to 54.9%, corresponding to improvements of 2.2 and 2.3 percentage points, respectively. Meanwhile, the number of parameters increases from 20.1 M to 20.6 M, GFLOPs increase from 68.0 to 72.5, and the inference time rises from 7.7 ms to 8.2 ms. Combined with the method design presented in the previous section, this result indicates that IMS explicitly models illumination variation information in low-light images by constructing an illumination modeling subnetwork, and further enhances the model’s perception of degraded illumination features through contrastive learning, thereby effectively improving detection accuracy under low-light conditions. To further analyze whether IMS actually responds to illumination variations, we apply gamma transformations with different intensities to a set of normally illuminated images and visualize the feature representations produced by IMS using t-SNE. As shown in Figure 11, the IMS features corresponding to different illumination levels exhibit relatively clear separability in the embedding space. Since the underlying image content remains unchanged and only the illumination intensity is adjusted, this visualization provides supplementary evidence that IMS can capture illumination-related degradation characteristics rather than merely extracting generic visual features.

In comparison, when only the GFEM is introduced, the model achieves an mAP@50 of 78.2%, and an mAP@50:95 of 53.8%, which are 1.3 and 1.2 percentage points higher than the baseline, respectively. Meanwhile the number of parameters increases to 21.2 M, GFLOPs increase to 68.5, and the inference time becomes 8.0 ms. This result demonstrates that GFEM also contributes positively to detection performance. Its improvement mainly comes from enhancing the global contextual modeling capability of the network through the transposed self-attention mechanism, thereby improving feature representation and detection stability in complex scenes. This is particularly important for low-light detection because local textures are often weakened, object boundaries may become blurred, and foreground targets can be easily confused with dark or noisy backgrounds. Under such conditions, relying only on local convolutional features may be insufficient for robust recognition. By modeling long-range dependencies and strengthening global semantic reasoning, GFEM helps the detector infer object identity from broader spatial context and improves target-background discrimination under degraded illumination conditions.

From the results of the combined experiments, as shown in Table 2, when both IMS and GFEM are introduced simultaneously, the model achieves the best overall performance. Specifically, mAP@50 reaches 80.3%, and mAP@50:95 reaches 56.0%, which are 3.4 and 3.4 percentage points higher than the baseline, respectively. Meanwhile the number of parameters, GFLOPs, and inference time are 21.7 M, 73.0, and 8.6 ms, respectively. This result indicates that IMS and GFEM exhibit good complementarity. The former focuses on illumination modeling and degradation perception under low-light conditions, whereas the latter emphasizes global semantic modeling. More specifically, IMS enhances the detector’s sensitivity to illumination degradation, as supported by the t-SNE visualization, while GFEM compensates for the insufficiency of local features by introducing global contextual dependencies. Their combination therefore further improves detection performance in complex low-light environments.

To further validate the effectiveness of AWD in the IMS module, we conduct an additional ablation study by replacing AWD with a standard downsampling convolution while keeping the overall structure of IMS unchanged. As shown in Table 3, when standard convolution is used for downsampling, the model achieves an mAP@50 of 78.5%, with 22.7 M parameters and 71.4 GFLOPs. After replacing it with AWD, mAP@50 increases to 79.1%, the number of parameters decreases to 20.6 M, and GFLOPs become 72.5. Compared with the standard downsampling convolution, AWD improves mAP@50 by 0.6 percentage points while significantly reducing the number of parameters. This indicates that AWD can perform feature downsampling and information preservation more effectively under low-light conditions, thereby enhancing the model’s feature extraction capability in degraded scenes. Since low-light images usually contain amplified noise and weakened structural details, the adaptive downsampling process is beneficial for preserving useful local information while reducing the negative influence of redundant or noisy responses.

In addition, we further conduct ablation experiments on the two loss weight coefficients,

λ_{d e t}

and

λ_{c t r}

, to analyze the influence of different parameter combinations on detection performance. As shown in Table 4, the best detection performance is achieved when

λ_{d e t} = 0.2

and

λ_{c t r} = 0.1

, yielding an mAP@50 of 80.3%. When the parameter settings are (0.5, 0.1), (0.2, 0.3), (0.1, 0.1), and (0.4, 0.2), the corresponding mAP@50 values are 79.5%, 79.9%, 79.4%, and 80.1%, respectively. These results indicate that different loss weight settings have a noticeable impact on model performance, and that the combination of

λ_{d e t} = 0.2

and

λ_{c t r} = 0.1

provides a better balance between the detection loss and the contrastive loss, thereby leading to superior detection performance. This also suggests that the detection loss should remain the dominant optimization objective because it directly supervises object localization and classification, whereas the contrastive loss mainly serves as an auxiliary regularization term for illumination-aware feature learning. If the contrastive loss is over-weighted, it may interfere with the task-oriented optimization of the detector.

Overall, the proposed IMS, GFEM, and AWD designs all contribute effectively to improving object detection performance under low-light conditions. Among them, IMS serves as the primary source of performance gain by enhancing illumination-aware representation learning, GFEM further strengthens global contextual modeling and target-background discrimination, and AWD reduces the number of parameters while maintaining improved detection accuracy. Together with a proper setting of the loss weight coefficients, the proposed model achieves a favorable balance between detection performance and optimization behavior, which validates the rationality and effectiveness of the proposed method design.

4.5. Experiment Result

Experimental Results on the ExDark Dataset. To further validate the effectiveness of the proposed method in low-light scenarios, we compare the proposed method with several state-of-the-art detectors on the ExDark dataset, including YOLOv9m, YOLOv10m, YOLOv11m, IA-YOLO, GDIP, DETR, DE-YOLO, MAET, Zero-DCE, YOLA, and PE-YOLO. For the public ExDark dataset, the results of several representative low-light detection or enhancement-based methods, including DE-YOLO, MAET, Zero-DCE, YOLA, and PE-YOLO, are presented with reference to the corresponding published studies. To ensure the fairness of comparison as much as possible, the input resolution, data partitioning strategy, and evaluation metrics adopted in our experiments are kept consistent with those used in these previous methods. In addition, the results of IA-YOLO, GDIP, and DETR are reproduced by us under the same experimental setting. For the experiments on Lowlight_Mine, DARK FACE, Dark_VOC, and VOC, all comparative results are reproduced by us under consistent experimental settings, including the same training strategy, input resolution, data partitioning strategy, and evaluation metrics. The quantitative results on ExDark are summarized in Table 5. It can be observed that the proposed method achieves the highest detection accuracy in the bicycle, cat, chair, dog, motorbike, people, and table categories, and also obtains competitive performance in the car category. In addition, it obtains the best overall detection performance among all compared methods, with an mAP@50 of 80.3%. Specifically, in terms of the overall mAP@50, the proposed method outperforms YOLOv9m, YOLOv10m, YOLOv11m, IA-YOLO, GDIP, DETR, DE-YOLO, MAET, Zero-DCE, YOLA, and PE-YOLO by 2.8, 4.4, 3.4, 6.0, 4.7, 5.6, 3.0, 2.6, 3.4, 5.1, and 2.3 percentage points, respectively. These results indicate that the proposed method can effectively improve object detection performance in complex low-light scenes, especially showing stronger advantages in multiple categories.

To provide a more intuitive comparison of detection performance, we further visualize the detection results of the baseline model YOLO11m and the proposed method on several representative images from the ExDark test set, as shown in Figure 12. In particular, Figure 12a, Figure 12b, and Figure 12c present the ground-truth annotations (GT), the detection results of YOLO11m, and the detection results of the proposed method, respectively. In the first example, YOLO11m incorrectly localizes the chair object, whereas the proposed method is able to accurately detect its position, demonstrating superior illumination-aware localization capability. In the second example, YOLO11m mistakenly identifies a background region as a person, resulting in a false positive, while the proposed method correctly detects the car without producing false detections. In the third example, the proposed method consistently yields higher confidence scores for detected objects than YOLO11m, indicating stronger feature representation and discrimination ability. Overall, these qualitative results clearly and intuitively demonstrate that the proposed method achieves higher detection accuracy and stronger robustness under low-light conditions.

Figure 13 further illustrates the differences between YOLO11m and the proposed method in terms of heatmap visualization. (1) In the first example, the attention of YOLO11m is mainly concentrated on local regions, causing some foreground objects to be overlooked. In contrast, the proposed method is able to focus more comprehensively on key target areas such as vehicles and pedestrians, resulting in a more balanced and coherent heatmap response. (2) In the second example, the attention distribution of YOLO11m is noticeably affected by background interference, whereas the proposed method primarily concentrates its attention on the target objects, demonstrating stronger regional discrimination capability. (3) In the third example, under complex scene conditions, the proposed method effectively identifies and focuses on the target regions, while the heatmap of YOLO11m appears relatively scattered, with weaker responses on certain critical targets. Overall, the heatmaps of the proposed method are more consistently concentrated within the detection bounding boxes, exhibiting stronger feature focus and semantic consistency. These observations indicate that, compared with YOLO11m, the proposed method is more effective in capturing and attending to salient features for object detection.

Experimental Results on the Lowlight_Mine Dataset. To further validate the effectiveness of the proposed method in mining low-light scenarios, we conduct comparative experiments on the self-constructed mining low-light dataset, Lowlight_Mine. The quantitative results are summarized in Table 6. It can be observed that the proposed method achieves the best performance in terms of Precision, Recall, mAP@50, and mAP@50:95, reaching 92.3%, 82.0%, 89.3%, and 57.8%, respectively. Specifically, compared with YOLOv9m, the proposed method improves Precision, Recall, mAP@50, and mAP@50:95 by 3.5, 6.2, 4.1, and 4.3 percentage points, respectively. Compared with YOLOv10m, the corresponding improvements are 4.2, 6.0, 4.2, and 3.6 percentage points, respectively. Compared with YOLOv11m, the proposed method further improves these four metrics by 3.1, 4.6, 3.7, and 3.6 percentage points, respectively. In addition, compared with IA-YOLO, GDIP, DETR, DE-YOLO, MAET, Zero-DCE, YOLA, and PE-YOLO, the proposed method improves mAP@50 by 5.0, 4.5, 5.4, 3.8, 3.3, 4.4, 4.3, and 3.0 percentage points, respectively. For the stricter mAP@50:95 metric, the corresponding improvements are 5.1, 4.7, 5.4, 3.8, 3.4, 5.0, 4.6, and 2.8 percentage points, respectively. These results indicate that the proposed method not only improves target recognition capability under a relatively loose IoU threshold, but also enhances localization quality under stricter IoU thresholds. Overall, the proposed method can more effectively improve target recognition and localization capability in mining low-light environments, while showing clear advantages in reducing missed detections and enhancing detection stability, thereby achieving superior overall detection performance.

To provide a more intuitive comparison of detection performance in mining low-light scenes, we further visualize the detection results of YOLOv11m and IADNet on several representative images selected from the Lowlight_Mine test set, as shown in Figure 14. In particular, Figure 14a and Figure 14b present the detection results of YOLOv11m and the proposed method, respectively. Overall, under complex low-light mining environments, IADNet exhibits stronger target perception capability and more stable detection performance.

In the first example, the scene simultaneously contains insufficient illumination, adjacent target distribution, and distant small targets. Although YOLOv11m shows improved local detection stability, it still fails to detect the distant small target. In contrast, IADNet is able to detect all miner targets in the image more completely, especially showing more accurate recognition of distant weak-response targets, which demonstrates stronger small-target perception capability.

In the second example, the personnel on the right side of the image are densely distributed, while the background contains obvious shadows and local strong reflection interference, making the target contours blurred and reducing the separability between foreground and background. Although both YOLOv11m and IADNet exhibit a certain degree of missed detection in the densely populated region on the right, the overall detection performance of IADNet remains superior to that of YOLOv11m. Specifically, for a miner located in the dark region on the left side of the image, YOLOv11m recognizes it but still misses the farthest small-scale miner. In contrast, IADNet successfully detects both targets. This indicates that IADNet has stronger feature representation and target recognition capability in weak-light regions and distant small-target scenarios.

In the third example, both YOLOv11m and IADNet are able to detect the true targets more accurately, showing better detection accuracy and stability. Among them, IADNet still demonstrates superior overall performance in terms of target localization and detection reliability.

Overall, these three groups of qualitative results clearly and intuitively demonstrate that, compared with YOLOv11m, IADNet exhibits stronger detection capability in mining low-light scenes and can more effectively reduce missed detections while maintaining stable detection performance. This further supports the effectiveness of the proposed method in mining-related low-light detection scenarios.

Experimental Results on the DARK FACE Dataset. To further verify the robustness and generalization ability of IADNet in real low-light environments, we additionally conduct comparative experiments on the challenging DARK FACE dataset. This dataset contains a large number of real-world low-illumination face images, in which facial details are severely degraded by weak lighting, blur, noise, and low contrast. Therefore, it can effectively evaluate the model’s capability for target perception and localization under extremely weak illumination conditions.

As shown in Table 7, IADNet achieves the best overall performance among all compared methods, with 68.3% Precision, 56.5% Recall, 54.6% mAP@50, and 31.2% mAP@50:95. Compared with YOLOv11m, IADNet improves Precision, Recall, mAP@50, and mAP@50:95 by 1.6, 1.5, 1.4, and 0.4 percentage points, respectively. Compared with PE-YOLO, which is the strongest competing method in terms of mAP@50 among the low-light enhancement-based methods, IADNet further improves mAP@50 and mAP@50:95 by 1.7 and 0.7 percentage points, respectively. These results indicate that IADNet can enhance feature representation in complex low-light scenes and improve detection stability under severe illumination degradation. In particular, the improvement in Recall suggests that the proposed method can reduce missed detections to some extent, while the improvement in mAP@50:95 indicates better localization quality under stricter IoU thresholds.

In addition, we visualize the detection results of YOLO11m and IADNet, as shown in Figure 15. It can be observed that IADNet is able to detect small face targets hidden in dark regions more reliably, whereas YOLO11m still suffers from missed detections under extremely weak lighting conditions. The qualitative results further demonstrate the effectiveness of IADNet in challenging real-world low-light face detection scenarios.

Experimental Results on the Dark_VOC Dataset. To further examine the robustness of the proposed method under controlled synthetic illumination degradation, we conduct experiments on the synthesized low-light dataset Dark_VOC. We compare IADNet with several representative methods, including YOLOv9m, YOLOv10m, YOLOv11m, IA-YOLO, GDIP, DE-YOLO, MAET, and Zero-DCE. As reported in Table 8, IADNet achieves the best overall performance, with a total mAP@50 of 91.6%. Specifically, IADNet obtains the highest detection accuracy in the Person, Car, and Bicycle categories, reaching 90.1%, 93.2%, and 92.8%, respectively. For the Bus and Motorbike categories, IADNet also maintains competitive performance, although DE-YOLO achieves slightly higher results in these two categories. In terms of the overall mAP@50, IADNet outperforms the second-best method DE-YOLO by 0.9 percentage points, indicating that the proposed method maintains stable detection performance under controlled low-light degradation.

In addition, we provide qualitative comparisons between the baseline model YOLO11m and the proposed IADNet, as illustrated in Figure 16. In the first example, YOLO11m fails to detect a vehicle located in a dark region in the lower-right corner, whereas IADNet successfully identifies the object with accurate localization. In the second example, under extremely low illumination conditions, YOLO11m again misses a vehicle, while IADNet remains capable of detecting the target correctly. These qualitative results further show that IADNet can improve target perception and localization under synthesized low-light conditions. It should be noted that Dark_VOC is generated through synthetic illumination degradation and cannot fully reproduce real-world low-light factors such as sensor noise, scattering, local glare, and occlusion. Therefore, the Dark_VOC experiment is used as a supplementary robustness validation under controlled degradation, while real-scene validation is mainly supported by ExDark, Lowlight_Mine, and DARK FACE.

Experimental Results on the VOC Dataset. To evaluate the generalization capability of IADNet under normal illumination conditions, we further conduct experiments on the VOC dataset. The quantitative results are summarized in Table 9. It can be observed that IADNet achieves the highest detection accuracy in four categories, namely Person, Car, Bus, and Motorbike, and obtains the best overall detection performance with an mAP@50 of 93.0%. Specifically, IADNet outperforms YOLOv9m, YOLOv10m, and YOLOv11m by 1.2%, 1.6%, and 0.5% in terms of overall mAP@50, respectively. These results indicate that IADNet maintains strong detection performance and good generalization capability even under normal lighting conditions.

In addition, we further provide qualitative comparisons between the baseline model YOLO11m and IADNet, as illustrated in Figure 17. In the first example, YOLO11m fails to detect a partially visible vehicle on the right side of the image, whereas IADNet successfully identifies the target. In the second example, YOLO11m detects only three vehicles and misses one instance, while IADNet accurately detects all four vehicles. Overall, these qualitative results further demonstrate that IADNet exhibits robust and reliable detection performance not only in low-light environments but also under normal illumination conditions.

Comparison of Computational Efficiency. To comprehensively evaluate the computational efficiency of the proposed model, we compare different network architectures in terms of the number of parameters (Params), computational complexity (GFLOPs), and inference time. The experimental results are summarized in Table 10. As shown, the proposed IADNet achieves competitive detection accuracy while maintaining a relatively low computational cost. Moreover, under the same hardware environment, IADNet attains an average inference time of 8.6 ms, which remains within an acceptable range for practical applications. These results demonstrate a favorable trade-off between accuracy and efficiency, highlighting the effectiveness and practicality of the proposed architecture for real-world deployment.

5. Conclusions

In this paper, we propose a novel object detection framework for low-light environments, termed IADNet. The proposed method introduces an Illumination Modeling Subnetwork (IMS) to extract illumination-aware and degradation-aware auxiliary features from low-light images, thereby improving the illumination perception capability of the detection network. In addition, an Adaptive Weighted Downsampling (AWD) layer is incorporated into the IMS to reduce noise interference and preserve useful local information under degraded lighting conditions. Furthermore, a Global Feature Enhancement Module (GFEM) is designed to strengthen the global context modeling capability of the network and improve feature representation in complex scenes.

Extensive experiments on the ExDark, Lowlight_Mine, DARK FACE, Dark_VOC, and VOC datasets demonstrate that IADNet achieves superior performance compared with the baseline model and several representative detectors. In particular, the results on the Lowlight_Mine dataset demonstrate that the proposed method performs well in object detection tasks under underground mining scenarios, where weak illumination, complex background interference, and small distant targets commonly coexist. These results show that IADNet can provide a useful experimental basis for mining-scene object detection tasks, including underground target perception, personnel monitoring, equipment inspection, and safety management.

Although the proposed method achieves promising experimental results, there are still issues that require further research. Low-light object detection datasets are still relatively scarce, especially for underground mining scenarios. Although the Lowlight_Mine dataset provides representative samples of mining-related low-light scenes, its scale and scene coverage can be further expanded to better support research and application. Therefore, in future work, we will continue to improve the object detection model and further collect low-light datasets related to underground scenarios, including more on-site images under different illumination levels, equipment types, and operational conditions. These efforts will provide a more sufficient data basis and model support for future research on object detection in mining scenarios.

Author Contributions

Conceptualization, W.C. and P.Y.; methodology, W.C. and P.Y.; software, W.C.; validation, W.C. and W.L.; formal analysis, W.C. and W.L.; investigation, W.C.; data curation, W.C.; writing—original draft preparation, W.C.; writing—review and editing, P.Y. and W.L.; supervision, P.Y.; funding acquisition, P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China, Grant Number 2021YFC3001302.

Data Availability Statement

Publicly available datasets were analyzed in this study. The Exclusively Dark Image Dataset (ExDark) is available at https://github.com/cs-chan/Exclusively-Dark-Image-Dataset (accessed on 11 May 2025). Additional data used in this study were collected from industrial mining scenarios and are not publicly available due to confidentiality restrictions.

Acknowledgments

The authors would like to thank Beijing Key Laboratory of Information Service Engineering, Beijing Union University, and the School of Civil and Resource Engineering, University of Science and Technology Beijing, for their support during this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the Computer Vision-ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Yang, Y. LL-WSOD: Weakly supervised object detection in low-light. J. Vis. Commun. Image Represent. 2024, 98, 104010. [Google Scholar] [CrossRef]
Hui, Y.; Wang, J.; Li, B. WSA-YOLO: Weak-supervised and adaptive object detection in the low-light environment for YOLOv7. IEEE Trans. Instrum. Meas. 2024, 73, 1–12. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-adaptive YOLO for object detection in adverse weather conditions. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1792–1800. [Google Scholar] [CrossRef]
Kalwar, S.; Patel, D.; Aanegola, A.; Konda, K.R.; Garg, S.; Krishna, K.M. GDIP: Gated differentiable image processing for object detection in adverse conditions. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2023; pp. 7083–7089. [Google Scholar]
Zhang, H.; Yu, H.; Tao, Y.; Zhu, W.; Zhang, K. Improvement of ship target detection algorithm for YOLOv7-tiny. IET Image Process. 2024, 18, 1710–1718. [Google Scholar] [CrossRef]
Dong, P.; Wang, Y.; Yu, Q.; Feng, W.; Zong, G. AMC-YOLO: Improved YOLOv8-based defect detection for cigarette packs. IET Image Process. 2024, 18, 4873–4886. [Google Scholar] [CrossRef]
Zhang, D.; Fang, W.; Liu, Y.; Lyu, Z.; Xiong, C.; Wang, Z. Two-stage video anomaly detection based on dual-stream networks and multi-instance learning. IET Image Process. 2024, 18, 4843–4851. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage Retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12504–12513. [Google Scholar]
Li, C.; Guo, C.; Chen, C.L. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Li, X.; Li, Z.; Zhou, L.; Huang, Z. FOLD: Low-Level Image Enhancement for Low-Light Object Detection Based on FPGA MPSoC. Electronics 2024, 13, 230. [Google Scholar]
Li, J.; Wang, X.; Chang, Q.; Wang, Y.; Chen, H. Research on low-light environment object detection algorithm based on YOLO_GD. Electronics 2024, 13, 3527. [Google Scholar] [CrossRef]
Land, E.H. The Retinex theory of color vision. Sci. Am. 1977, 237, 108–129. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. Properties and performance of a center/surround Retinex. IEEE Trans. Image Process. 1997, 6, 451–462. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. A multiscale Retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef]
Kimmel, R.; Elad, M.; Shaked, D.; Keshet, R.; Sobel, I. A variational framework for Retinex. Int. J. Comput. Vis. 2003, 52, 7–23. [Google Scholar] [CrossRef]
Ren, W.; Liu, S.; Ma, L.; Xu, Q.; Xu, X.; Cao, X.; Du, J.; Yang, M.-H. Low-light image enhancement via a deep hybrid network. IEEE Trans. Image Process. 2019, 28, 4364–4375. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Xu, X.; Wang, R.; Lu, J. Low-light image enhancement via structure modeling and guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9893–9903. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Chen, L.; Fu, Y.; Wei, K.; Zheng, D.; Heide, F. Instance segmentation in the dark. Int. J. Comput. Vis. 2023, 131, 2198–2218. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Loh, Y.P.; Chan, C.S. Getting to know low-light images with the exclusively dark dataset. Comput. Vis. Image Underst. 2019, 178, 30–42. [Google Scholar] [CrossRef]
Yang, W.; Yuan, Y.; Ren, W.; Liu, J.; Scheirer, W.J.; Wang, Z.; Zhang, T.; Zhong, Q.; Xie, D.; Pu, S.; et al. Advancing image understanding in poor visibility environments: A collective benchmark study. IEEE Trans. Image Process. 2020, 29, 5737–5752. [Google Scholar] [CrossRef] [PubMed]
Qin, Q.; Chang, K.; Huang, M.; Li, G. DENet: Detection-driven enhancement network for object detection under adverse weather conditions. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 2813–2829. [Google Scholar]
Cui, Z.; Qi, G.J.; Gu, L.; You, S.; Zhang, Z.; Harada, T. Multitask AET with orthogonal tangent regularity for dark object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2553–2562. [Google Scholar]
Hong, M.; Cheng, S.; Huang, H.; Fan, H.; Liu, S. You only look around: Learning illumination-invariant feature for low-light object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 87136–87158. [Google Scholar]
Yin, X.; Yu, Z.; Fei, Z.; Lv, W.; Gao, X. PE-YOLO: Pyramid enhancement network for dark object detection. In Proceedings of the International Conference on Artificial Neural Networks; Springer: Cham, Switzerland, 2023; pp. 163–174. [Google Scholar]

Figure 1. Under low-light conditions, variations in illumination intensity and light source types can lead to diverse and complex forms of image degradation.

Figure 2. The framework of IADNet.

Figure 3. Adaptive Residual Block.

Figure 4. Feature Fusion Module.

Figure 5. Architecture of the adaptive weighted downsampling layer.

Figure 6. Architecture of the Global Feature Enhancement Module.

Figure 7. Transposed Self-Attention.

Figure 8. The number of instances for all categories in the ExDark dataset.

Figure 9. Sample images from the ExDark dataset.

Figure 10. Sample images from the Lowlight_Mine dataset.

Figure 11. t-SNE visualization of features extracted by IMS from low-light images with varying illumination intensities.

Figure 12. Comparison of YOLO11m and the proposed algorithm on the ExDark dataset.

Figure 13. Comparison of heatmap visualization results between YOLO11m and the proposed algorithm.

Figure 14. Comparison of YOLO11m and the proposed algorithm on the Lowlight_Mine dataset.

Figure 15. Comparison of YOLO11m and the proposed algorithm on the DARK FACE dataset.

Figure 16. Comparison of YOLO11m and the proposed algorithm on the Dark_VOC dataset.

Figure 17. Comparison of YOLO11m and the proposed algorithm on the VOC dataset.

Table 1. Experimental Environment and Hyperparameter Settings.

Experimental Environment and Hyperparameters	Version/Value
Operating System	Ubuntu 20.04.6
CPU	13th Gen Intel^® Core™ i9-13900K
GPU	NVIDIA GeForce RTX 4090 24 GB
Python	3.9.18
PyTorch	2.0.1
CUDA	11.8
Learning Rate Decay Strategy	Cosine Annealing
Optimizer	SGD
Batch Size	16
Initial Learning Rate	0.02
Number of Training Epochs	100

Table 2. Ablation study for IADNet on the ExDark dataset.

Baseline	IMS	GFEM	mAP@50/%	mAP@50:95/%	Params/M	GFLOPs	Inference Time (ms)
√			76.9	52.6	20.1	68.0	7.7
√	√		79.1	54.9	20.6	72.5	8.2
√		√	78.2	53.8	21.2	68.5	8.0
√	√	√	80.3	56.0	21.7	73.0	8.6

Table 3. Ablation study for IMS on ExDark dataset.

IMS		mAP@50/%	Params/M	GFLOPs
Conv	AWD	mAP@50/%	Params/M	GFLOPs
√		78.5	22.7	71.4
	√	79.1	20.6	72.5

Table 4. Effect of different

λ_{d e t}

and

λ_{c t r}

settings on detection performance.

Table 4. Effect of different

λ_{d e t}

and

λ_{c t r}

settings on detection performance.

λ_det	λ_ctr	mAP@50/%
0.2	0.1	80.3
0.5	0.1	79.5
0.2	0.3	79.9
0.1	0.1	79.4
0.4	0.2	80.1

Table 5. Comparison of different methods on the ExDark dataset.

Model	Bicycle	Boat	Bottle	Bus	Car	Cat	Chair	Cup	Dog	Motorbike	People	Table	Total
YOLOv9m [3]	85.6	66.8	72.0	89.7	85.8	83.0	70.1	71.5	84.7	78.6	78.3	63.7	77.5
YOLOv10m [4]	85.8	66.9	72.0	89.0	81.4	81.1	68.1	69.7	82.9	77.0	76.4	60.9	75.9
YOLOv11m [31]	80.5	68.2	72.5	88.4	82.6	80.9	70.3	75.1	79.3	77.2	80.1	67.7	76.9
IA-YOLO [12]	79.1	65.8	70.4	85.6	80.2	78.3	68.5	72.6	76.9	74.8	77.5	62.9	74.3
GDIP [13]	80.2	66.9	71.7	86.4	81.1	79.2	69.8	73.5	77.8	75.6	78.4	66.6	75.6
DETR [6]	78.5	66.1	70.3	85.9	80.6	78.6	68.2	72.9	76.9	75.2	77.6	65.6	74.7
DE-YOLO [36]	80.4	79.7	77.9	91.2	82.7	72.8	69.9	80.1	77.2	76.7	82.0	57.2	77.3
MAET [37]	83.1	78.5	75.6	92.9	83.1	73.4	71.3	79.0	79.8	77.2	81.1	57.0	77.7
Zero-DCE [29]	84.1	77.6	78.3	93.1	83.7	70.3	69.8	77.6	77.4	76.3	81.0	53.6	76.9
YOLA [38]	83.9	78.7	75.3	88.8	79.0	73.4	69.9	71.9	86.8	66.3	78.3	49.8	75.2
PE-YOLO [39]	84.7	79.2	79.3	92.5	83.9	71.5	71.7	79.7	79.7	77.3	81.8	55.3	78.0
Ours	87.3	73.3	73.9	91.8	83.9	84.7	73.6	76.2	87.5	79.7	82.1	71.0	80.3

Table 6. Comparison of different methods on the Lowlight_Mine dataset.

Model	Precision/%	Recall/%	mAP@50/%	mAP@50:95/%
YOLOv9m	88.8	75.8	85.2	53.5
YOLOv10m	88.1	76.0	85.1	54.2
YOLOv11m	89.2	77.4	85.6	54.2
IA-YOLO	87.4	75.0	84.3	52.7
GDIP	87.8	75.5	84.8	53.1
DETR	86.9	74.6	83.9	52.4
DE-YOLO	88.7	76.6	85.5	54.0
MAET	88.9	77.0	86.0	54.4
Zero-DCE	87.6	76.2	84.9	52.8
YOLA	88.4	76.4	85.0	53.2
PE-YOLO	89.0	77.8	86.3	55.0
Ours	92.3	82.0	89.3	57.8

Table 7. Comparison of different methods on the DARK FACE dataset.

Model	Precision/%	Recall/%	mAP@50/%	mAP@50:95/%
YOLOv9m	65.8	54.1	52.5	30.4
YOLOv10m	66.1	54.4	52.8	30.2
YOLOv11m	66.7	55.0	53.2	30.8
IA-YOLO	65.2	53.5	51.8	30.1
GDIP	65.5	53.9	52.3	31.0
DETR	65.1	53.6	51.9	29.8
DE-YOLO	66.0	54.2	52.6	30.2
MAET	65.9	54.6	52.6	29.5
Zero-DCE	65.4	53.8	51.8	27.3
YOLA	66.2	54.5	52.7	30.2
PE-YOLO	66.4	54.8	52.9	30.5
Ours	68.3	56.5	54.6	31.2

Table 8. Comparison of different methods on the Dark_VOC dataset.

Model	Person	Car	Bus	Bicycle	Motorbike	Total
YOLOv9m	89.3	91.5	88.7	90.9	90.1	90.1
YOLOv10m	87.3	89.5	86.8	90.2	89.2	88.6
YOLOv11m	86.2	88.7	85.9	89.3	87.4	87.5
IA-YOLO	84.1	86.5	83.8	87.2	84.9	85.3
GDIP	84.8	87.1	84.6	87.9	86.1	86.1
DE-YOLO	89.2	91.8	90.5	88.6	93.4	90.7
MAET	89.5	91.3	88.7	92.1	89.4	90.2
Zero-DCE	88.5	90.3	87.2	92.7	89.3	89.6
Ours	90.1	93.2	89.7	92.8	92.1	91.6

Table 9. Comparison of different methods on the VOC dataset.

Model	Person	Car	Bus	Bicycle	Motorbike	Total
YOLOv9m	90.5	92.1	91.3	92.8	92.3	91.8
YOLOv10m	90.2	92.5	89.8	93.1	91.4	91.4
YOLOv11m	91.3	93.7	90.8	94.2	92.5	92.5
Ours	91.9	94.8	91.9	93.9	92.6	93.0

Table 10. Comparison of parameters, GFLOPs, and inference time.

Model	Params/M	GFLOPs	Inference Time (ms)
YOLOv9m	20.1	76.8	9.3
YOLOv10m	16.5	59.1	7.8
YOLOv11m	20.1	68.0	7.7
Ours	21.7	73.0	8.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, W.; Yang, P.; Lyu, W. Explicit Illumination Modeling for Object Detection in Low-Light Environments. Electronics 2026, 15, 2057. https://doi.org/10.3390/electronics15102057

AMA Style

Cao W, Yang P, Lyu W. Explicit Illumination Modeling for Object Detection in Low-Light Environments. Electronics. 2026; 15(10):2057. https://doi.org/10.3390/electronics15102057

Chicago/Turabian Style

Cao, Wenkang, Peng Yang, and Wensheng Lyu. 2026. "Explicit Illumination Modeling for Object Detection in Low-Light Environments" Electronics 15, no. 10: 2057. https://doi.org/10.3390/electronics15102057

APA Style

Cao, W., Yang, P., & Lyu, W. (2026). Explicit Illumination Modeling for Object Detection in Low-Light Environments. Electronics, 15(10), 2057. https://doi.org/10.3390/electronics15102057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explicit Illumination Modeling for Object Detection in Low-Light Environments

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Low-Light Image Object Detection

2.3. Low-Light Image Enhancement

3. Method

3.1. The Architecture of IADNet

3.2. Illumination Modeling Subnetwork

3.3. Adaptive Weighted Downsampling Layer

3.4. Global Feature Enhancement Module

3.5. Loss Function

4. Experiment

4.1. Implementation Details

4.2. Evaluation Metrics

4.3. Dataset

4.4. Ablation Study

4.5. Experiment Result

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI