Zero-3DCE: A Low-Light Video Enhancement for More Robust Computer Vision Tasks

Mpilo Mbulelo Tatana; Rito Clifford Maswanganyi; Philani Khumalo

doi:10.3390/a18120775

,

and

Department of Electronic and Computer Engineering, Durban University of Technology, Durban 4001, South Africa

^*

Author to whom correspondence should be addressed.

Algorithms2025, 18(12), 775;https://doi.org/10.3390/a18120775

Version Notes

Order Reprints

Abstract

Low-light video enhancement remains a challenge, specifically due to the challenging task of acquiring paired low-light video data. This paper proposes Zero-3DCE, a 3D version of Zero-DCE. Zero-3DCE differs from Zero-DCE by (i) introducing 3D separable convolutions for temporal consistency, (ii) integrating spatial attention for region-specific enhancement, and (iii) combining MS-SSIM and edge-based losses for structural preservation. Separable convolutions are utilized to capture 3D data while maintaining real-time speed, while a spatial attention network guides the model to regions that require enhancement by adaptively weighting spatial regions across all channels. Coupled with YOLOv11m, Zero-3DCE improves detection accuracy under low-light conditions. The model is trained with a combination of single-frame and multi-frame data. Results showed that Zero-3DCE outperformed other low-light enhancers in enhancing both 2D and 3D data while achieving real-time speeds. Zero-3DCE outperforms Zero-DCE by +3.4 dB in PSNR, and achieves up to 0.11 higher SSIM, demonstrating significant perceptual and structural enhancement.

Keywords:

Zero-3DCE; low-light video enhancement; zero-reference; image restoration; computer vision applications

1. Introduction

Most crimes occur under low-light conditions where both human perception and computer-vision (CV) systems struggle to capture sufficient visual details for detection or classification. Hardware-based solutions such as external illumination or night-vision sensors introduce high costs, maintenance requirements, glare, and light pollution.

Replacing entire camera systems is infeasible, motivating software-based light enhancement methods. Many solutions focus on single-frame enhancement and ignore temporal relationships, leading to flickering and color inconsistency in enhanced videos. This temporal-inconsistency problem is the primary motivation for the proposed Zero-3DCE model. Low-light images often contain noise that existing enhancers inadvertently amplify. Consequently, low-light enhancement requires models that recover illumination without amplifying noise.

Current state-of-the-art solutions to this challenge involve low-light image and video enhancement (LLIE and LLVE) models. LLIE models aim to adjust the luminosity of 2D data while maintaining a high PSNR. The majority of research outputs in LLIE focus on supervised models [1,2,3], which, during training, are exposed to paired low-light and normal-light images. These models often outperform their counterparts, such as semi-supervised, zero-reference, and unsupervised learners, leading to their standardization in LLIE. With the rise of more powerful architectures such as Transformers and Generative Adversarial Networks (GANs), more LLIE researchers have shifted away from CNNs. As noted in [4], the superiority of transformer-based models over CNNs lies in their ability to adaptively adjust to varying lighting conditions in a single image, facilitating more balanced enhancement.

LLIE models are specifically designed to process and enhance images, rendering them impractical for many real-world applications that require video processing. LLIE designers resort to creating image-based models that process videos as a series of images. A major consequence of this approach is the loss of frame-wise temporal information, which results in improperly enhanced videos. LLVE has the added difficulty of a hard-to-acquire dataset. Acquiring paired low-light and normal-light images is a non-trivial task and often results in the use of synthetic data as a workaround. When this issue is expanded to paired video data, the problems encountered are compounded. With natural low-light video data, each frame often exhibits unique lighting patterns. This phenomenon can be replicated in images but proves more challenging in videos, leading to even less meaningful data being available for LLVE models to be trained on.

As a counter to the challenges of dataset acquisition, zero-reference enhancers have been proposed [5,6] due to their zero-reference learning capabilities. Zero-reference learners rely on specialized loss functions instead of paired data to learn mappings of low-light data to its normal-light equivalents. In addition to zero-reference learning, these models generalize well to real-world data, making them a good candidate for standardization in real-world applications for LLE.

This paper proposes Zero-3DCE, a 3D extension of Zero-DCE for simultaneous LLIE and LLVE. As a proof of concept, it is coupled with a YOLOv11m detector [7] to demonstrate that Zero-3DCE can be utilized for nighttime surveillance and crime detection.

Zero-DCE [8] works by utilizing light enhancement curves, which the model estimates to map low-light input pixels to their enhanced counterpart. The advantage of the model is its zero-reference learning, which means that it does not require paired and labeled data, unlike its supervised counterparts [3,9,10]. Zero-3DCE builds upon Zero-DCE by incorporating edge-preserving and multi-scale structural losses, spatial attention, and separable 3D convolutions to capture temporal consistency efficiently. It is evaluated with a YOLOv11m algorithm to detect weapons (knives and guns) and people.

The contributions of this paper are as follows:

Extend Zero-DCE to multi-frame (video) enhancement via separable 3D convolutions.
Introduce edge-aware and MS-SSIM loss functions for structure and contrast preservation.
Demonstrate that zero-reference learning can match supervised models while achieving real-time inference.
Integrate Zero-3DCE with YOLOv11 to enable robust nighttime object and crime detection.

2. Related Work

2.1. Image Enhancers

The domain of light enhancement features traditional and deep learning-based models. Traditional enhancement methods, such as gray-level transformations [11], histogram equalization-based models [12], and Retinex theory [13], are simple and require few computational resources. However, they are hindered by their lack of local enhancement capabilities and amplification of noise during image illumination adjustment.

Deep learning-based enhancers dominate recent research, owing to their superior performance and advanced noise-suppression techniques. Deep learners can be categorized into four groups: supervised learners, unsupervised learners, semi-supervised learners, and zero-reference learners.

Supervised learning enhancers dominate because they learn explicit mappings of low-light images to their normal-light pairs. Models such as those proposed in [9,14,15] have produced state-of-the-art results, making supervised learners the benchmark in image enhancement. Unlike supervised learners, unsupervised low-light enhancement (LLE) models [16,17,18] are trained on both poorly lit and optimally lit visual data, without requiring the data to be from the same scene. Semi-supervised enhancers, such as those proposed in [19,20] leverage elements of paired and unpaired learning by utilizing both paired and unpaired data for training. To counter the need for paired data, zero-reference learning models such as those in [21,22] have been proposed. Zero-reference learners do not require paired and labeled data; instead, they rely on non-reference loss functions, which use perceptual and statistical metrics to guide the model to desired results. However, these single-frame approaches overlook temporal relationships, which are critical for video enhancement.

2.2. Video Enhancers

LLVE is critical for video analysis but has received relatively less focus. An obstacle that exists in LLVE, not in LLIE, is the need to preserve frame-wise relationships, which, when lost, results in flickering and related artifacts. Standard LLVE networks employ 3D convolutions to capture inter-frame information by leveraging adjacent frames [23]. MBLLVEN [9] extends MBLLEN by using 3D kernels instead of 2D ones for improved temporal awareness, at the cost of a real-time model. For improved efficiency, AdaEnlight [24] does away with iterative enhancement and instead proposes a temporal-consistency loss, allowing close-to-real-time speeds and improving artifact suppression. Binary Neural Network (BNNs), a lightweight alternative [25], use spatial–temporal shift operations and distribution-aware convolutions to capture motion cues across frames and correct misalignment. Despite these advances, many LLVE models still rely on large architectures and fully guided learning, resulting in algorithms ill-prepared for real-world applications. The pursuit of fast, zero-reference LLVE models that generalize well to real-world data has motivated the proposed Zero-3DCE framework.

3. Materials and Methods

This section outlines how the framework of Zero-3DCE expands on the work performed in the Zero-DCE model [8]. Subsequently, the expanded algorithm is coupled with a fine-tuned YOLO model that detects weapons and people. The section also outlines the training data and tools used for developing Zero-3DCE.

3.1. Low-Light Enhancement

In LLVE, spatial features are shared across neighboring frames, making the modeling of temporal correlation essential for effective artifact reduction, as illustrated in Figure 1. Figure 1a presents the low-light input video, while Figure 1b displays its enhanced counterpart produced through frame-wise LLE. The enhanced video exhibits color artifacts resulting from insufficient processing, which occurs because the model’s frame-wise enhancement approach does not adequately capture temporal correlations.

Figure 1. Illustration of frame-wise enhancement in video data.

Conventional 3D kernels allow a neural network to capture temporal information but incur high computational costs. To address this, Zero-3DCE adopts factorized 3D kernels to maintain frame-wise data at a reduced cost.

To illustrate this, let

C_{i n}

and

C_{o u t}

represent the number of input and output channels, respectively. For 3D convolutions (separable or otherwise), the kernel size is defined as

k^{3}

. The total number of operations per output voxel for a standard 3D convolution is given in Equation (1), where Ops is the abbreviated term for operations.

{O p s}_{n o r m a l 3 D} = C_{i n} \times C_{o u t} \times k^{3},

(1)

when the standard 3D convolution is transformed to a separable convolution, firstly, the depth-wise convolution is computed in Equation (2),

{O p s}_{d e p t h w i s e} = C_{i n} \times k^{3} .

(2)

The input channels are then combined into the output channels using a 1 × 1 × 1 convolution. Therefore, the number of point-wise operations is described in Equation (3),

{O p s}_{p o i n t w i s e} = C_{i n} \times C_{o u t} \times 1^{3},

(3)

and the total number of operations is thus given in Equation (4),

{O p s}_{s e p a r a b l e 3 D} = C_{i n} \times k^{3} + C_{i n} \times C_{o u t} .

(4)

Comparing the total number of operations in the standard 3D and separable 3D convolutions, the ratio in Equation (5) is obtained, where 3DSC and 3DC represent 3D separable convolution and 3D convolution, respectively,

{O p s}_{r a t i o} = \frac{3 D S C}{3 D C} = \frac{C_{i n} \times k^{3} + C_{i n} \times C_{o u t}}{C_{i n} \times C_{o u t} \times k^{3}} .

(5)

Zero-3DCE has 32 input and output channels and a kernel size of k = 3. Using these provided values, the standard 3D convolution has 27,648 operations per voxel, while the separable convolution has 1888 operations per voxel. The separable convolution thus only contains 6.83% of the total number of operations in the standard 3D convolution, calculated using Equation (5).

Zero-3DCE utilizes the same enhancement technique as Zero-DCE by iteratively enhancing inputs and later applying residual enhancement. The input is enhanced using the curve shown in Equation (6),

I_{(x, y, z)} = i_{(x, y, z)} + α i_{(x, y, z)} (i_{(x, y, z)} - 1),

(6)

where

I_{(x, y, z)}

are the enhanced frames with pixel coordinates

(x, y, z)

. The 2D spatial coordinates are given by

(x, y)

and the 1D temporal coordinate is given by

z

. The alpha

(a)

term adjusts the magnitude of the curves and is bound within [−1, 1].

3.1.1. 3DCE Network

The proposed Zero-3DCE network, illustrated in Figure 2, follows an encoder–decoder architecture with depth-wise separable 3D convolutions and spatial attention. The input to the network is a low-light video clip

x \in R^{B \times 3 \times D \times H \times W}

. Zero-3DCE operates on short video clips rather than single frames. The temporal depth is set to D = 2, meaning that two consecutive frames are processed jointly using 3D convolutions.

Figure 2. 3D deep curve estimation network. “C” defines the number of channels, “D” is the temporal depth, and “H” defines the spatial height, while “W” is the spatial width.

The encoder consists of three depth-wise separable 3D convolutional blocks (Conv-1, Conv-2 and Conv-3), each producing 32 feature maps, and each followed by a spatial attention module and 3D max-pooling. A fourth depth-wise separable 3D convolution (Conv-4) operates at the bottleneck resolution.

In each spatial attention module, the channel-wise average and maximum projections are computed and concatenated along the channel dimension. The resulting tensor is processed by a 3D convolution with a

7 \times 7 \times 7

kernel and a sigmoid activation to obtain an attention mask, which is then applied to the feature maps. This mechanism guides the network to focus on under-exposed regions across both the temporal and spatial domains.

The decoder mirrors the encoder with two 3D upsampling stages. At the lowest resolution, the feature maps from the bottleneck are concatenated with the encoder features and processed by Conv-5, followed by 3D upsampling (Upscale-1). The resulting tensor is concatenated with the mid-level encoder features and passed through Conv-6, followed by Upscale-2. Finally, the upsampled features are concatenated with the shallow encoder features and processed by Conv-7 to produce 24 output channels, corresponding to 3 sets of 8 channel enhancement curves. The curves are upsampled to the input resolution and applied iteratively using the zero-reference enhancement formula described in Equation (6).

3.1.2. Spatial Attention Module

The purpose of the spatial attention module is to guide the network to the enhancement of extreme regions that require stronger adjustment, namely, the under-exposed areas. The convolutional layers capture spatial and temporal features and treat all regions equally. To counter this, the attention module adaptively weights various 2D regions across frames, enabling the network to emphasize the enhancement of crucial areas.

The module takes as input a 5D tensor

[B, C, D, H, W]

, where

B

is the batch size,

C

is the number of channels, the temporal depth is represented by

D

, and the 2D dimensions are represented by

H

and

W

. Average pooling is applied across the channel dimensions, producing two separate pooled tensors that preserve spatial structures. These pooled tensors are concatenated and passed through a 3D convolutional layer, which learns to identify dependencies across neighboring frames. The attention map is constrained to [0, 1] via sigmoid activation and is finally reweighted by element-wise multiplication with the input tensor. The map holds information of key image regions via weights, where less important parts, such as the background, are lightly weighted.

3.1.3. Zero-Reference Loss Functions

Color Constancy Loss

The color constancy loss ensures that the enhanced frame preserves its natural color. To achieve this, the mean of the RGB channels is calculated over the spatial dimensions for each frame. Channel differences are computed for each frame and averaged across all frames in the video, as formulaically illustrated in Equation (7),

D_{x y}^{f} = (m_{x}^{f} - m_{y}^{f}),

(7)

where (

m_{x}^{f}, m_{y}^{f}

) represents the means of the respective three-color channels, and

D_{x y}

represents the difference between channels

m_{x}

and

m_{y}

with frame index

f

. Thus, there are three channel difference calculations. The color loss can then be expressed as shown in (8). To prevent convergence to zero, the loss has an epsilon value (

\in

) added.

L_{c o l o r} = \frac{1}{N} \sum_{f = 1}^{N} \sqrt{D_{R G} + D_{R B} + D_{G B} + \in} .

(8)

Spatial Consistency Loss

The spatial consistency loss preserves the spatial features of each frame. The loss is obtained from the delta measurement between the spatial gradients of the enhanced and original frames. The 3D spatial consistency loss is the 2D spatial consistency loss formulated in [8], extended to average across all the frames. Averaging ensures that the loss is independent of the length of the video. The 3D spatial consistency loss is expressed in Equation (9),

L_{s p a}^{3 D} = \frac{1}{D} \sum_{f = 1}^{D} [\frac{1}{K} \sum_{i = 1}^{K} \sum_{j \in Ω (i)} {(|Y_{f, i} - Y_{f, j}| - |I_{f, i} - I_{f, j}|)}^{2}],

(9)

where D is the temporal dimension and K is the number of 4 × 4 patches. The neighboring regions centered at

i

are denoted by

Ω (i)

. The average pixel intensity values of the enhanced and original frames are shown, respectively, as

Y

and

I

. The computed loss is averaged from the starting frame

f = 1

.

Exposure Control Loss

The exposure control loss prevents the over- and under-enhancement of regions, which occurs when already bright regions are further enhanced, or extremely dark regions are not enhanced enough. The 3D exposure loss extends the 2D exposure loss by averaging the loss over all frames in a video. In Equation (10), the target exposure level E is desired for the frame

Y_{f, k}

, at pixel k of frame f. D denotes the total number of frames. M gives the total number of pixels per frame.

L_{e x p}^{3 D} = \frac{1}{D} \sum_{f = 1}^{D} (\frac{1}{M} \sum_{k = 1}^{M} |Y_{f, k} - E|) .

(10)

Illumination Smoothness Loss

The illumination smoothness loss ensures that neighboring pixels, both in the temporal and spatial domains, observe smooth transitions, thus preserving the naturalness of the frame. To achieve this, the loss computes the squared differences between neighboring pixels. The loss is formulated in Equation (11), where

d_{t v}

,

h_{t v}

, and

w_{t v}

denote the temporal, vertical, and horizontal calculated squared differences, respectively. The batch size is denoted by B. C is the number of channels. The number of frames is denoted by D, while spatial dimensions are denoted by H and W.

L_{T v} = \frac{2}{B} (\frac{d_{t v}}{(C (D - 1) H W + ε)} + \frac{h_{t v}}{(C D (H - 1) W + ε)} + \frac{w_{t v}}{(C D H (W - 1) + ε)}) .

(11)

Edge Detection Loss

The edge loss ensures that edges in the enhanced frame are not degraded and maintains the edge consistencies present in the input frame. The loss function detects intensity changes across the 2D spatial and 1D temporal dimensions. The computed loss is the mean absolute difference (of the gradients) across all input and output frames. Although both the spatial and edge losses focus on preserving structural integrity, the spatial loss focuses on preserving contrast patterns. In contrast, the edge loss focuses on the strength of these contrast patterns. The edge detection loss is formulated in Equation (12), which depicts that the edge loss is the gradient difference between

x

and

y

in all three spatial dimensions,

L_{edge} = m e a n (| \nabla x - \nabla y |) + ε .

(12)

MS-SSIM Loss

The human visual system is highly suited to perceive structures in images; thus, to measure how well a human perceives an image, a structural similarity measure is required. The structural similarity index from a single scale cannot capture details at different resolutions. To capture low-frequency details, such as contrast, and high-frequency details, such as textures, a multi-scale approach is required, as proposed in [26]. The multi-scale structural similarity index measure (MS-SSIM) is defined in Equation (13). The MS-SSIM loss of two frame patches (

x, y

) being compared with each other is defined in Equation (13) as

L_{M S - S S I M (x, y)}

. The original frame has the scale index of

j = 1

, where

j = 1

is the finest scale and

M

is the coarsest. The MS-SSIM consists of 3 metrics, including the contrast measure (

c_{j} (x, y)

) and the structural measure (

s_{j} (x, y)

), which are both calculated at each scale iteration. The final metric, the luminance measure (

l_{m} (x, y)

), is only calculated at the coarsest scale. Each metric is weighted with a unique weight (

a, β, γ

),

L_{M S - S S I M (x, y)} = {[l_{m} (x, y)]}^{α M} \cdot \prod_{j = 1}^{M} {[c_{j} (x, y)]}^{β_{j}} {[s_{j} (x, y)]}^{γ_{j}} .

(13)

The total loss, which guides the network to illumination adjustment, frame structure preservation, and temporal information preservation, is defined in Equation (14),

L_{total} = w_{1} L_{\exp} + w_{2} L_{smooth} + w_{3} L_{spatial} + w_{4} L_{edge} + w_{5} L_{color} + w_{6} L_{MS - SSIM} .

(14)

3.2. Detection

To assess Zero-3DCE applications beyond enhancement, the network was coupled with a fine-tuned YOLOv11m model. This demonstrates how frame illumination adjustment can improve the performance of CV systems under low luminosity. In the setup, Zero-3DCE serves as a pre-processing model that enhances frames prior to detection.

The detection algorithm was trained on a dataset containing people, guns, and knives. To ensure that the model was fine-tuned to its expected real-world applications, the training dataset included low-light data and data captured from security systems, in addition to data captured under normal lighting conditions.

Building on findings by [27,28], during training, only the first 10 layers of the standard detection network were frozen, and as their results showed, the best detection scores were achieved under these training conditions. The researchers attributed this to the early layers learning features common for most objects, and thus freezing these layers allows for networks to be retrained on limited data without overfitting and reducing computational cost.

4. Results

4.1. Training Configuration

Zero-3DCE was trained on a combination of public and custom low-light datasets. These included DICM [29], LIME [30], LOL [31], SICEMix [32], LLVIP [33], BVI-RLV [34], and Dark48 [35]. The diverse dataset ensures that the model is exposed to a wide range of illumination levels and noise patterns. All frames were resized to a resolution of

256 \times 256

and normalized to a range of [0, 1]. An NVIDIA GeForce RTX 4070 SUPER GPU was used for all training and testing.

4.2. Ablation Study

To evaluate the contributions of different parts of the model to the algorithm, an ablation study was conducted.

4.2.1. Edge Detection Loss

The model was retrained with the edge detection loss omitted. Without the loss, the enhanced inferred frame of Figure 3b appears over-exposed and without key image features. This is particularly evident in the blue and purple squares, where in Figure 3b, due to over-exposure, there is a loss of detail. The challenges observed in Figure 3b are not present in the frames enhanced using the complete model, as seen in Figure 3c. The results illustrate how the edge loss assists the network in maintaining local contrasts and preventing over-exposure.

Figure 3. Ablation study on edge detection loss.

4.2.2. MS-SSIM Loss

The model without the loss in Figure 4b loses some of its details, as observed in the blue square. In the blue square in Figure 4c, the frame contains more details and textures, illustrating the importance of the MS-SSIM loss. As observed in the results presented, the MS-SSIM loss assists the model in maintaining textures and sharpness, which makes the image appear more natural.

Figure 4. Ablation study on MS-SSIM loss.

4.2.3. Spatial Attention

Without spatial attention, the inferred frame, as demonstrated in Figure 5b, is noisy and over-enhanced. This demonstrates the attention mechanisms’ role in enabling the model to focus on key parts of the image, specifically under-exposed regions, resulting in balanced enhancement, as seen in Figure 5c.

Figure 5. Ablation study on spatial attention.

4.2.4. Separable Convolution

Zero-3DCE was retrained using standard 3D convolutions in place of depth-wise-separable convolutions and evaluated on the enhancement of 50 frames from a single video. The average per-frame processing time was recorded and compared with that of the original Zero-3DCE under identical testing conditions. The results indicate that Zero-3DCE operates 14.55% faster than the version employing standard 3D convolutions.

4.3. Enhancement Results

Zero-3DCE was tested on both single-frame and multi-frame data. Figure 6 shows how Zero-3DCE can handle varying illumination patterns and enhance images properly without incurring artifacts and under- or over-exposing parts of the image. This pattern is further demonstrated in Figure 7, where Zero-3DCE was used to enhance videos, and, as observed, accomplished the task with desired results. In contrast, Figure 8 shows that several comparison models exhibit visible artifacts under the same conditions.

Figure 6. Zero-3DCE performance on single-frame data. The left column displays single-frame inputs, while their enhanced versions, utilizing Zero-3DCE, are shown in the right column.

Figure 7. Zero-3DCE performance on multi-frame data.

Figure 8. Single-frame enhancement comparison of Zero-3DCE with other low-light enhancers.

When tested against other unsupervised low-light enhancers on single-frame data from the LOL-eval15 dataset [36], Zero-3DCE achieved consistently strong performance across all metrics, ranking either first or second in every case, as illustrated in Table 1. The results also highlight Zero-3DCE’s superior performance over Zero-DCE.

Visual comparisons with other low-light enhancement models [8,14,16,37,38,39,40,41] are presented in Figure 8 and Figure 9. Figure 8 presents the single-frame comparisons, while Figure 9 presents multi-frame comparisons. In Figure 9, EnlightenGAN, RUAS, Night Enhancement, and KinD++ enhance the single frame, but in doing so, the resultant image appears discolored. GSAD over-enhances the image, causing the dark background to appear lighter than it should. Zero-DCE only slightly enhances the image, leaving behind areas of low contrast. Kant and Zero-3DCE achieve similar results, but as presented in Table 2, Zero-3DCE achieves these results faster.

Figure 9. Multi-frame enhancement comparison of Zero-3DCE with other low-light enhancers.

Table 1. Comparison of Zero-3DCE with other unsupervised learners on single-frame data. Zero-3DCE is observed to be the best-performing unsupervised learner overall, achieving either the best or second-best scores in all the metrics.

Model	PSNR (↑)	SSIM (↑)	LPIPS (↓)	NIQE (↓)	BRISQUE (↓)
EnlightenGAN [16]	17.5557	0.66567	0.3878	4.5814	10.3452
RUAS [37]	9.5754	0.3194	0.5452	6.6551	27.4994
Zero-DiDCE [41]	17.9406	0.5841	0.3782	7.8945	28.9683
NightEnhancement [37]	21.5212	0.7629	0.3595	4.4899	27.1202
Zero-DCE [8]	14.8607	0.5624	0.3852	7.7657	27.4029
Zero-3DCE	18.2723	0.6676	0.3687	3.5057	13.6800

The best-performing model is highlighted in red, while blue highlights indicate the next-best-performing model. The arrows indicate whether lower or higher values are desired for each metric.

Table 2. Computational cost comparison of Zero-3DCE with other LLE networks.

Model	Runtimes in Seconds
EnlightenGAN	0.2692
GSAD	2.6917
RUAS	0.0838
KinD++	3.6136
KANT	55.7471
Zero-DiDCE	0.2291
Night Enhancement	0.4520
Zero-3DCE	0.0621

Red highlight indicates the best perfroming model (shortest runtime).

The multi-frame comparison in Figure 9 demonstrates that all the models tested against Zero-3DCE are wholly unsuited for multi-frame enhancement, as none of them effectively enhance the frames. All the models tested against Zero-3DCE fail to enhance the video without the frame accumulating artifacts, while GSAD and KANT fail to enhance the video completely. Zero-3DCE enhances the frame while also avoiding over-enhancing or incurring noise.

Zero-3DCE was also compared with other models on inference times. All the models were tested on the same single-frame dataset. The averaged results are presented in Table 2. Zero-3DCE (highlighted in red) achieved the fastest inference times, highlighting how the model can achieve the results previously presented in Figure 8 and Figure 9 at real-time speeds.

The model was coupled with a fine-tuned YOLOv11 model and tested on its ability to enhance and detect objects using [42,43,44] along with the DarkFace dataset [45]. Qualitative results are shown in Figure 10, and quantitative detection metrics are summarized in Table 3. The results show that Zero-3DCE can improve the detection rates of the YOLO model, supporting the stated hypothesis that the light enhancement network in question, Zero-3DCE, can improve CV tasks.

Figure 10. Results of the coupled enhancement and detection model. More accurate detections are made when the YOLO algorithm is coupled with Zero-3DCE.

Table 3. Zero-3DCE coupled with YOLOv11 detection results.

5. Discussion

The results presented demonstrate that the proposed model, Zero-3DCE, achieves its intended goal of zero-reference low-light image and video enhancement. It achieved PSNR and SSIM values of 11.575 and 0.49, outperforming Zero-DCE by +3.3 dB and +0.18, respectively. Moreover, Zero-3DCE achieves this without resulting in noise-ridden frames and, owing to its use of separable convolutions, accomplishes this in real-time speeds, achieving an enhancement rate of one frame every 0.06 s. Zero-3DCE was shown to outperform Zero-DCE amongst other LLE networks in the enhancement of images and videos, achieving a NIQE score 1.325 lower than the next best, illustrating perceptual improvement. When integrated with a detection algorithm, the nighttime detection capabilities of the algorithm were improved, where an increase in the precision, recall, f1, and mean average precision scores were observed.

The results presented in this paper suggest that future research on low-light enhancement should focus more on 3D models, making these models faster and more effective. The 3D model presented in this paper outperforms 2D light enhancers in processing both 2D and 3D data. Continued reliance on 2D enhancers limits the real-world applications of these models; thus, a major shift is required in low-light enhancement research to design models capable of handling the variety of real-world data.

Finally, research needs to move away from guided models due to their over-reliance on paired and labeled data. Zero-3DCE has demonstrated that unguided models can outperform guided learners while also avoiding over-reliance on paired training data, which is often challenging to acquire in the case of low-light enhancement

Author Contributions

Conceptualization, M.M.T.; methodology, M.M.T.; software, M.M.T.; validation, M.M.T.; writing—original draft preparation, M.M.T.; writing—review and editing, M.M.T., R.C.M., and P.K.; supervision and project administration, R.C.M. and P.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Research Foundation of South Africa (Grant Number PMDS230505102760).

Data Availability Statement

The data presented in this review are available as follows; for low-light enhancement https://github.com/ShenZheng2000/LLIE_Survey, accessed on 10 June 2024; and for recognition and detection https://www.kaggle.com/datasets/constantinwerner/human-detection-dataset (accessed on 10 September 2024), https://www.kaggle.com/datasets/ankan1998/weapon-detection-dataset (accessed on 10 September 2024), and https://universe.roboflow.com/school-fin7c/person-weapon-datasets (accessed on 10 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CV	Computer Vision
GAN	Generative Adversarial Network
LLIE	Low-Light Image Enhancement
LLVE	Low-Light Video Enhancement

References

Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12504–12513. [Google Scholar]
Liu, Y.; Huang, T.; Dong, W.; Wu, F.; Li, X.; Shi, G. Low-light image enhancement with multi-stage residue quantization and brightness-aware attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12140–12149. [Google Scholar]
Xu, X.; Wang, R.; Lu, J. Low-light image enhancement via structure modeling and guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9893–9903. [Google Scholar]
Liu, F.; Fan, L. A review of advancements in low-light image enhancement using deep learning. arXiv 2025, arXiv:2505.0575. [Google Scholar] [CrossRef]
Yu, C.; Han, G.; Pan, M.; Wu, X.; Deng, A. Zero-TCE: Zero Reference Tri-Curve Enhancement for Low-Light Images. Appl. Sci. 2025, 15, 701. [Google Scholar] [CrossRef]
He, J.; Xue, M.; Ning, A.; Song, C. Zero-reference lighting estimation diffusion model for low-light image enhancement. arXiv 2024, arXiv:2403.02879. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
Lv, F.; Lu, F.; Wu, J.; Lim, C. MBLLEN: Low-Light Image/Video Enhancement Using CNNs. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018; p. 4. [Google Scholar]
Guo, X.; Hu, Q. Low-light image enhancement via breaking down the darkness. Int. J. Comput. Vis. 2023, 131, 48–66. [Google Scholar] [CrossRef]
Baidoo, E.; Kwesi Kontoh, A. Implementation of gray level image transformation techniques. Int. J. Mod. Educ. Comput. Sci. 2018, 10, 44. [Google Scholar] [CrossRef]
Jebadass, J.R.; Balasubramaniam, P. Low light enhancement algorithm for color images using intuitionistic fuzzy sets with histogram equalization. Multimed. Tools Appl. 2022, 81, 8093–8106. [Google Scholar] [CrossRef]
Chen, Y.; Wen, C.; Liu, W.; He, W. A depth iterative illumination estimation network for low-light image enhancement based on retinex theory. Sci. Rep. 2023, 13, 19709. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Lore, K.G.; Akintayo, A.; Sarkar, S. LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Zhang, J.; Li, H.; Huo, Z. Unsupervised Boosted Fusion Network for Single Low-light Image Enhancement; IEEE Access: Piscataway, NJ, USA, 2024. [Google Scholar]
Ming, F.; Wei, Z.; Zhang, J. Unsupervised low-light image enhancement in the fourier transform domain. Appl. Sci. 2023, 14, 332. [Google Scholar] [CrossRef]
Chen, J.; Wang, Y.; Han, Y. A semi-supervised network framework for low-light image enhancement. Eng. Appl. Artif. Intell. 2023, 126, 107003. [Google Scholar] [CrossRef]
Jiang, N.; Cao, Y.; Zhang, X.-Y.; Wang, D.-H.; He, Y.; Wang, C.; Zhu, S. Low-light image enhancement with quality-oriented pseudo labels via semi-supervised contrastive learning. Expert Syst. Appl. 2025, 276, 127106. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Loy, C.C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Liu, X.; Shen, Y.; Zhang, S.; Zhao, S. Zero-shot restoration of back-lit images using deep internal learning. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1623–1631. [Google Scholar]
Zhu, L.; Yang, W.; Chen, B.; Zhu, H.; Meng, X.; Wang, S. Temporally consistent enhancement of low-light videos via spatial-temporal compatible learning. Int. J. Comput. Vis. 2024, 132, 4703–4723. [Google Scholar] [CrossRef]
Liu, S.; Li, X.; Zhou, Z.; Guo, B.; Zhang, M.; Shen, H.; Yu, Z. AdaEnlight: Energy-aware low-light video stream enhancement on mobile devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2023, 6, 1–26. [Google Scholar] [CrossRef]
Zhang, G.; Zhang, Y.; Yuan, X.; Fu, Y. Binarized low-light raw video enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 17–21 June 2024; pp. 25753–25762. [Google Scholar]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers; IEEE: Piscataway, NJ, USA, 2003; pp. 1398–1402. [Google Scholar]
Gandhi, V.; Gandhi, S. Fine-Tuning Without Forgetting: Adaptation of YOLOv8 Preserves COCO Performance. arXiv 2025, arXiv:2505.01016. [Google Scholar] [CrossRef]
Rafi, A.N.Y.; Yusuf, M. Improving Vehicle Detection in Challenging Datasets: YOLOv5s and Frozen Layers Analysis. Int. J. Inform. Comput. 2023, 5, 31–45. [Google Scholar]
Lee, C.; Lee, C.; Kim, C.-S. Contrast enhancement based on layered difference representation of 2D histograms. IEEE Trans. Image Process. 2013, 22, 5372–5384. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar] [CrossRef]
Zheng, S.; Ma, Y.; Pan, J.; Lu, C.; Gupta, G. Low-light image and video enhancement: A comprehensive survey and beyond. arXiv 2022, arXiv:2212.10772. [Google Scholar]
Li, C.; Guo, C.; Han, L.; Jiang, J.; Cheng, M.-M.; Gu, J.; Loy, C.C. Low-light image and video enhancement using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9396–9416. [Google Scholar] [CrossRef] [PubMed]
Anantrasirichai, N.; Lin, R.; Malyugina, A.; Bull, D. BVI-Lowlight: Fully Registered Benchmark Dataset for Low-Light Video Enhancement. arXiv 2024, arXiv:2402.01970. [Google Scholar]
Tu, Z.; Liu, Y.; Zhang, Y.; Mu, Q.; Yuan, J. DTCM: Joint optimization of dark enhancement and action recognition in videos. IEEE Trans. Image Process. 2023, 32, 3507–3520. [Google Scholar] [CrossRef] [PubMed]
Yang, W.; Wang, S.; Fang, Y.; Wang, Y.; Liu, J. From fidelity to perceptual quality: A semi-supervised approach for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 14–19 June 2020; pp. 3063–3072. [Google Scholar]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
Jin, Y.; Yang, W.; Tan, R.T. Unsupervised night image enhancement: When layer decomposition meets light-effects suppression. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 404–421. [Google Scholar]
Hou, J.; Zhu, Z.; Hou, J.; Liu, H.; Zeng, H.; Yuan, H. Global structure-aware diffusion process for low-light image enhancement. Adv. Neural Inf. Process. Syst. 2024, 36, 79734–79747. [Google Scholar]
Brateanu, A.; Balmez, R.; Orhei, C.; Ancuti, C.; Ancuti, C. Enhancing Low-Light Images with Kolmogorov–Arnold Networks in Transformer Attention. Sensors 2025, 25, 327. [Google Scholar] [CrossRef] [PubMed]
Mi, A.; Luo, W.; Qiao, Y.; Huo, Z. Rethinking zero-DCE for low-light image enhancement. Neural Process. Lett. 2024, 56, 93. [Google Scholar] [CrossRef]
Verner, K. Human Detection Dataset CCTV Footage of Humans. Available online: https://www.kaggle.com/datasets/constantinwerner/human-detection-dataset (accessed on 10 September 2024).
Sharma, A. Weapon Detection Dataset Weapon Detection Including KNIFE, Gun, Pistol etc. Available online: https://www.kaggle.com/datasets/ankan1998/weapon-detection-dataset (accessed on 10 September 2024).
School. Person, Weapon Datasets Dataset. Available online: https://universe.roboflow.com/school-fin7c/person-weapon-datasets (accessed on 10 September 2024).
Yang, W.; Yuan, Y.; Ren, W.; Liu, J.; Scheirer, W.J.; Wang, Z.; Zhang, T.; Zhong, Q.; Xie, D.; Pu, S. Advancing image understanding in poor visibility environments: A collective benchmark study. IEEE Trans. Image Process. 2020, 29, 5737–5752. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Model	Metrics
	Precision	Recall	F1 Score	mAP@0.5	mAP@0.5:0.95
Dark	0.9509	0.6301	0.7580	0.7990	0.6778
Zero-3DCE	0.9560	0.8401	0.8943	0.9078	0.7897