Low-Illumination Image Enhancement Based on Deep Learning Techniques: A Brief Review

: As a critical preprocessing technique, low-illumination image enhancement has a wide range of practical applications. It aims to improve the visual perception of a given image captured without sufﬁcient illumination. Conventional low-illumination image enhancement methods are typically implemented by improving image brightness, enhancing image contrast, and suppressing image noise simultaneously. Nevertheless, recent advances in this area are dominated by deep-learning-based solutions, and consequently, various deep neural networks have been proposed and applied to this ﬁeld. Therefore, this paper brieﬂy reviews the latest low-illumination image enhancement, ranging from its related algorithms to its unsolved open issues. Speciﬁcally, current low-illumination image enhancement methods based on deep learning are ﬁrst sorted out and divided into four categories: supervised learning methods, unsupervised learning methods, semi-supervised learning methods, and zero-shot learning methods. Then, existing low-light image datasets are summarized and analyzed. In addition, various quality assessment indices for low-light image enhancement are introduced in detail. We also compare 14 representative algorithms in terms of both objective evaluation and subjective evaluation. Finally, the future development trend of low-illumination image enhancement and its open issues are summarized and prospected.


Introduction
With the popularity of multimedia devices, the applications of videos and images are becoming increasingly widespread [1][2][3]. However, due to the limitations of imaging equipment or sub-optimal lighting conditions (i.e., moonlight, firelight, lamplight, etc.), it usually results in a serious reduction in visibility and contrast, and the image captured is mostly accompanied by large amounts of noise. Correspondingly, it is hard to get the reliable information we want [4][5][6][7]. Figure 1 shows some typical examples of lowillumination (or low-light) images. From the perspective of human perception, some images can hardly get valuable information, especially those with very low brightness. But for the computer, these invisible matrix data in the numerical difference is the key to reflecting the image information. To put it simply, the visual invisibility of human eyes is only a partial perception. Still, the computer can start from the difference and connection of the numerical distribution of the image to have a clearer understanding of the image. Therefore, how to use the deep learning algorithm to realize the conversion of invisible information to visible information is the purpose of the existing low illumination image enhancement technology based on deep learning [8,9]. Some video shooting devices even focus on the ability to take pictures at night [10,11]. Therefore, with a wide range of applications, low-illumination image enhancement has become a challenging and active research area for the image-processing community [12,13]. There are many traditional low-illumination image enhancement methods, including gray transformation, histogram equalization [14], Retinex model [15], frequency-domain processing [16], image fusion model [17], defogging model [18], etc. Among them, lowillumination image enhancement based on the Retinex model is the most prevalent, and many scholars have made a series of improvements to it, which can effectively improve the quality of image enhancement. Retinex is a commonly used image theory based on scientific experiments and analysis. The Retinex model methods decompose the low-illumination image into an illuminance image and reflection image, reduce or even remove the influence of the incident image through specific algorithms, and retain the reflection property image of the essence of the object. However, this algorithm has some limitations. Taking the reflection map as the enhancement result will lead to losing details, color distortion, and other problems. Moreover, this model ignores noise because a lot of noise is generated in the reflection map. The calculation of the reflection map itself is an ill-posed problem, which can only be calculated by approximate estimations.
In recent years, deep learning has been widely used in the field of image processing, and various experiments have proved that it has obvious advantages in many low-level visual tasks. With the advent of LLNet [19] in 2017, low-illumination image enhancement based on deep learning began to receive the most attention from researchers in this field. From 2017 to 2022, hundreds of low-illumination image enhancement algorithms based on deep learning have been proposed. This fully reflects that the methods based on deep learning are of better accuracy, robustness, and faster speed. Table 1 shows recent low-illumination image enhancement algorithms based on deep learning with the corresponding timelines.
In general, the existing low-illumination image enhancement algorithms based on deep learning are dominant, but they still face many problems that need to be overcome by researchers. Therefore, it is necessary to review and analyze existing methods comprehensively. To this end, we introduce the current progress of low-illumination image enhancement based on deep learning and analyze its datasets and objective evaluation indicators. Furthermore, the challenges and application defects of existing methods are also discussed in this paper, aiming to contribute to future research and development on this topic.
The rest of this paper is organized as follows: in Section 2, low-illumination enhancement algorithms based on deep learning are divided into four categories according to their learning and training fashions, and the characteristics of each algorithm are analyzed in detail. In Section 3, existing low-illumination image datasets are listed, and the representative datasets and the most frequently used datasets are examined in detail. In Section 4, a subjective visual comparison and objective evaluation of 14 representative enhancement algorithms are conducted. In the last section, existing bottlenecks in this field are summarized, and future research directions are prospected. There are many traditional low-illumination image enhancement methods, including gray transformation, histogram equalization [14], Retinex model [15], frequency-domain processing [16], image fusion model [17], defogging model [18], etc. Among them, lowillumination image enhancement based on the Retinex model is the most prevalent, and many scholars have made a series of improvements to it, which can effectively improve the quality of image enhancement. Retinex is a commonly used image theory based on scientific experiments and analysis. The Retinex model methods decompose the low-illumination image into an illuminance image and reflection image, reduce or even remove the influence of the incident image through specific algorithms, and retain the reflection property image of the essence of the object. However, this algorithm has some limitations. Taking the reflection map as the enhancement result will lead to losing details, color distortion, and other problems. Moreover, this model ignores noise because a lot of noise is generated in the reflection map. The calculation of the reflection map itself is an ill-posed problem, which can only be calculated by approximate estimations.
In general, the existing low-illumination image enhancement algorithms based on deep learning are dominant, but they still face many problems that need to be overcome by researchers. Therefore, it is necessary to review and analyze existing methods comprehensively. To this end, we introduce the current progress of low-illumination image enhancement based on deep learning and analyze its datasets and objective evaluation indicators. Furthermore, the challenges and application defects of existing methods are Photonics 2023, 10,198 4 of 25 also discussed in this paper, aiming to contribute to future research and development on this topic.
The rest of this paper is organized as follows: in Section 2, low-illumination enhancement algorithms based on deep learning are divided into four categories according to their learning and training fashions, and the characteristics of each algorithm are analyzed in detail. In Section 3, existing low-illumination image datasets are listed, and the representative datasets and the most frequently used datasets are examined in detail. In Section 4, a subjective visual comparison and objective evaluation of 14 representative enhancement algorithms are conducted. In the last section, existing bottlenecks in this field are summarized, and future research directions are prospected.

Classification of Low Illumination Image Enhancement Methods
According to the learning method used, we can classify existing low-illumination enhancement methods into four categories, i.e., supervised learning, unsupervised learning, semi-supervised learning, and zero-shot learning methods.

Supervised Learning Methods
The most crucial feature of supervised learning is that the required data sets are labeled. Supervised learning is learning the model from labeled training data and then using the model to predict its label for some new data. Meanwhile, the more similar the predicted tags are to the given tags, the better the supervised learning algorithm is. Low illumination image enhancement methods based on supervised learning can be roughly divided into end-to-end methods and Retinex theory methods.

End-to-End Methods
Back in 2017, a deep encoder-based method (LLNet) was first proposed [19], which used a variant of stacked sparse denoising autoencoder to identify features from the lowlight image, adaptively enhance and denoise the image. This method is the first end-to-end low illumination enhancement method based on deep learning. Lv et al. [22] proposed a multi-branch image enhancement network MBLLEN. The MBLLEN consists of three types of modules, i.e., the feature extraction module (FEM), the enhancement module (EM), and the fusion module (FM). In this method, different image features are extracted from different levels, which can be enhanced by multiple sub-networks. Finally, the output image is generated by multi-channel fusion. The image noise and artifacts in the low-light area can be suppressed well by this method. At the same time, 3D convolution can be used instead of 2D convolution for low-illumination video enhancement. Wang et al. [66] proposed a Global Awareness and Detail Retention network (GLADNet), which put the input image into the encoder-decoder structure to generate a global illumination estimate and finally used a convolutional network to reconstruct images based on the global care estimate and the original days. To prove the effectiveness of this method, the target recognition performance test is tested on Google Cloud Vision API2. The results show that compared with the original image, Google Cloud Vision can detect more details in the enhanced image and label them. Lu et al. [37] proposed a multi-exposure fusion network, which transferred functions in two sub-networks to generate two enhanced images. Then the noise is further reduced by the simple average fusion method, and the effect is further refined by the refinement element. PRIEN [43] is a progressive enhancement network that is able to continuously extract features from low-illumination images using recursive units composed of recursive layers and residual blocks. This method is the first to directly input low-illumination images into the dual-attention model to extract features. In order to better ensure the network performance, the method reduces the number of parameters by recursively operating a single residual block. Although the network structure of this method is relatively simple, the enhancement effect is also more obvious. C-LIENet [67] includes an encoder-decoder structure composed of convolutional layers, and introduces a unique multi-context feature extraction module, which can extract context features in layers. This method proposes a three-part loss function, which has advantages in local features, structural similarity, and details. Lim et al. [41] proposed a deep Laplacian Restorer (DSLR) for low-illumination image enhancement, which can adjust the global luminance and local details from the input image and gradually fuse them in the image space. The proposed multi-scale Laplacian residuals block makes the training phase more efficient through rich connections of higher-order residuals defined in the multi-scale structure embedded in the feature space.

Deep Retinex-Based Methods
Compared with the result of end-to-end network learning directly, the method based on deep learning and the Retinex theory sometimes has a better enhancement effect. Wei et al. [24] proposed the first deep learning network based on a data-driven strategy and Retinex theory, which decomposed an image into a reflectance image of independent light rays and a structure-aware illumination image through a decomposition network. Then, a large area of illumination was enhanced by the enhancement network. Finally, the overall image effect was improved by clipping local distribution and BM3D noise reduction. A brief flow diagram of this algorithm is shown in Figure 2. Liang et al. [20] modeled MSR (Multi-Scale Retinex) and proposed the MSR-Net, which includes three modules: multiscale logarithmic transformation, convolution difference, and color restoration. It directly learns the end-to-end mapping from low-illumination to the true image by training the synthesized low-illumination pair adjusted by Photoshop. Zhang et al. [29] constructed a deep network based on Retinex theory, which is divided into two branch networks, one for regulating illumination and one for removing degradation. It includes three modules: layer decomposition, reflectance restoration, and illuminance adjustment. Multi-scale illuminance enhancement module is used to avoid noise and distortion of enhanced image. R2RNet [42] is a new real-normal network based on the Retinex theory. It includes a decomposition module, a noise reduction module, and a brightness enhancement module and presents a large real-world image pair dataset (LSRW). Wei et al. [46] proposed a novel module, namely NCBC, based on Retinex theory to simultaneously suppress noise and control color deviation, which consists of a convolutional neural network and two loss functions. Its characteristic is that the NCBC module only calculates the loss function in the training stage, which makes the method faster than other algorithms in the test stage.
Photonics 2023, 9, x FOR PEER REVIEW 6 of 28 layer decomposition, reflectance restoration, and illuminance adjustment. Multi-scale illuminance enhancement module is used to avoid noise and distortion of enhanced image. R2RNet [42] is a new real-normal network based on the Retinex theory. It includes a decomposition module, a noise reduction module, and a brightness enhancement module and presents a large real-world image pair dataset (LSRW). Wei et al. [46] proposed a novel module, namely NCBC, based on Retinex theory to simultaneously suppress noise and control color deviation, which consists of a convolutional neural network and two loss functions. Its characteristic is that the NCBC module only calculates the loss function in the training stage, which makes the method faster than other algorithms in the test stage.

Deep Transformer-Based Methods
Vision Transformer is currently a research hotspot with supervision tasks. Cui et al. [68] proposed Illumination-Adaptive Transformer (IAT) network, which is a full supervision and training mode and belongs to an ultra-lightweight network. Different from the above solutions, this network adopts the idea of a target detection network DETR (Detection Transformer) [69] and designs an end-to-end transformer to overcome the impact of low illumination on visual effects. The performance of this method is very good, and the speed is the most important feature. Wang et al. [64] proposed a Low-Light Transformer based Network (LLFormer). The core components of the LLFormer are the axis-based multi-head self-attention and cross-layer attention fusion block, which reduces the computational complexity. Meanwhile, Wang et al. built a benchmark dataset of 4K and 8K UHD images

Deep Transformer-Based Methods
Vision Transformer is currently a research hotspot with supervision tasks. Cui et al. [68] proposed Illumination-Adaptive Transformer (IAT) network, which is a full supervision and training mode and belongs to an ultra-lightweight network. Different from the above solutions, this network adopts the idea of a target detection network DETR (Detection Transformer) [69] and designs an end-to-end transformer to overcome the impact of low illumination on visual effects. The performance of this method is very good, and the speed is the most important feature. Wang et al. [64] proposed a Low-Light Transformer based Network (LLFormer). The core components of the LLFormer are the axis-based multi-head self-attention and cross-layer attention fusion block, which reduces the computational complexity. Meanwhile, Wang et al. built a benchmark dataset of 4K and 8K UHD images (UHD-LOL) to evaluate the proposed LLFormer, which was also the first method to try to solve the UHD-LLIE task.

Unsupervised Learning Methods
There are two limitations to the aforementioned methods as mentioned above. First, the pairwise pictures in the datasets are limited. Second, training models on pairwise datasets would cause the over-fitting problem. To solve the above issues, scholars began to use unsupervised learning methods for enhancement. Unsupervised learning is characterized by a learning environment of unlabeled data. Jiang et al. [28] proposed the first unsupervised learning method called EnlightenGAN. A brief flow diagram of this algorithm is shown in Figure 3. EnlightenGAN is the first to introduce non-matching training to this field successfully. The proposed method has two improved parts. Firstly, the global-local discriminator structure deals with spatially varying illumination conditions in the input image. Secondly, the self-feature retention loss and self-regularization attention mechanism are used to keep the image content features unchanged before and after enhancement. Fu et al. [70] proposed a low-illumination enhancement network (LE-GAN) using an invariant identity loss and attention module. They used an illumination sensing attention module to enhance image feature extraction, which improved visual quality while realizing noise reduction and detail enhancement. At the same time, constant loss of identity can solve the problem of overexposure. This paper also established and released a sizeable unpaired low-illumination/normal-illumination image dataset called PNLI. Ni et al. [71] proposed an unsupervised enhancement method called UEGAN. The model is based on a single deep GAN incorporating an attention mechanism to capture more global and local features. The model proposes fidelity loss and quality loss to handle unsupervised image enhancement assurance loss is used to ensure that the content between the enhanced image and the input image is the same, and quality loss is used to assign the desired features to the image. Zhang et al. [72] proposed an unsupervised low-illumination image enhancement method (HEP) using histogram equalization. This method introduced the noise separation module (NDM), which separates the noise and content in the reflectance map with unmatched image pairs. Through histogram equalization and the NDM module, the texture information and brightness details of the image can be well enhanced, and the noise in the dark and yellow area of the image can be well suppressed.
Photonics 2023, 9, x FOR PEER REVIEW 7 of 28 attention module to enhance image feature extraction, which improved visual quality while realizing noise reduction and detail enhancement. At the same time, constant loss of identity can solve the problem of overexposure. This paper also established and released a sizeable unpaired low-illumination/normal-illumination image dataset called PNLI. Ni et al. [71] proposed an unsupervised enhancement method called UEGAN. The model is based on a single deep GAN incorporating an attention mechanism to capture more global and local features. The model proposes fidelity loss and quality loss to handle unsupervised image enhancement assurance loss is used to ensure that the content between the enhanced image and the input image is the same, and quality loss is used to assign the desired features to the image. Zhang et al. [72] proposed an unsupervised low-illumination image enhancement method (HEP) using histogram equalization. This method introduced the noise separation module (NDM), which separates the noise and content in the reflectance map with unmatched image pairs. Through histogram equalization and the NDM module, the texture information and brightness details of the image can be well enhanced, and the noise in the dark and yellow area of the image can be well suppressed.

Semi-Supervised Learning Methods
Semi-supervised learning was proposed not long ago. It combines the advantages of supervised learning and unsupervised learning. Semi-supervised learning requires la-

Semi-Supervised Learning Methods
Semi-supervised learning was proposed not long ago. It combines the advantages of supervised learning and unsupervised learning. Semi-supervised learning requires labeled data as well as unlabeled data. The characteristic of the semi-supervised learning method is to use many unlabeled samples and a small number of labeled samples to train the classifier, which can solve the problem of more labeled samples and less unlabeled samples. Yang et al. [44] proposed a semi-supervised low-illumination image enhancement method (DRBN) based on frequency band representation. A brief flow diagram of this algorithm is shown in Figure 4. The network is a deep recursive band network, which first restores the linear band representation of the enhanced image based on pairwise low/standard light images, and obtains the improved band representation by relocating the given band. This band can remove not only noise but also correct image details. A semi-supervised learning network called HybirdNet was proposed by Robert et al. [27]. The network is divided into two branches. The first branch is responsible for receiving the supervised signals and is used to extract the invariant components. The second branch is entirely unsupervised and is used to reconstruct the model information discarded by the first branch as input data.

Zero-Shot Learning Methods
The advantage of zero-short learning is that training data is not required. This method does not need to be trained in advance and can directly take the low-light image to be enhanced as the input. Zhu et al. [39] proposed a new three-branch fully convolutional neural network called RRDNet. The input image is decomposed into three components: illumination, reflection, and noise. By iterating the loss function, the noise is estimated, and the lighting is effectively restored, allowing the noise to be predicted clearly, making it possible to eliminate image noise. RRDNet proposes a new loss algorithm to optimize the image decomposition effect, which can estimate the noise in the dark area according to the image brightness distribution to avoid the noise in the dark area being over-amplified. A brief flow diagram of this algorithm is shown in Figure 5. Inspired by the super-resolution model, Zhang et al. [33] proposed a CNN network (ExCNet) specifically for testing. In the test, the network estimates a parametric curve most suitable for the test backlight image, which can be used to enhance the image directly. The advantages of ExCNet are that it is ideal for complex shooting environments and severe backlight environments. In the video enhancement process, the following structure is guided by the parameters of the previous frame to avoid the phenomenon of artifacts.

Zero-Shot Learning Methods
The advantage of zero-short learning is that training data is not required. This method does not need to be trained in advance and can directly take the low-light image to be enhanced as the input. Zhu et al. [39] proposed a new three-branch fully convolutional neural network called RRDNet. The input image is decomposed into three components: illumination, reflection, and noise. By iterating the loss function, the noise is estimated, and the lighting is effectively restored, allowing the noise to be predicted clearly, making it possible to eliminate image noise. RRDNet proposes a new loss algorithm to optimize the image decomposition effect, which can estimate the noise in the dark area according to the image brightness distribution to avoid the noise in the dark area being over-amplified. A brief flow diagram of this algorithm is shown in Figure 5. Inspired by the super-resolution model, Zhang et al. [33] proposed a CNN network (ExCNet) specifically for testing. In the test, the network estimates a parametric curve most suitable for the test backlight image, which can be used to enhance the image directly. The advantages of ExCNet are that it is ideal for complex shooting environments and severe backlight environments. In the video

Low-Illumination Image Datasets
Recently, promoted by the demand for algorithms, the number of low-illumination image datasets has gradually increased. Table 2 shows the representative datasets. It can be seen from the table that the overall datasets are from single to diverse, from simple to complex, and the number of images in a single dataset has also significantly increased. From 10 images in LIME in 2017 to 55,148 in LIL-Phone in 2021, these low-illumination image datasets provide solid data support for developing various algorithms. However, the existing datasets are generally only suitable for low-level image processing work. At the same time, collecting a similar number of datasets is not easy. It is difficult for human eyes to capture the low-illumination/standard image pairs of specific objects. Therefore, the generation of existing teams of data requires manual processes, but the manual labels are often not ideal, which makes the enhancement effect of the algorithm greatly affected.
Early datasets are generally applied to traditional and model-based methods, which require less quantity and quality of datasets. Early datasets are all single images, but with the development of deep learning in low-illumination images, some paired low-illumination image datasets have been successively proposed, such as SID-RAW, LOL, and LSRW. Among them, the LOL (Low-light Dataset) dataset collected and sorted by Wei et al. in the year 2018 is the most representative one. Figure 6 shows low-light images from some sample datasets.

Low-Illumination Image Datasets
Recently, promoted by the demand for algorithms, the number of low-illumination image datasets has gradually increased. Table 2 shows the representative datasets. It can be seen from the table that the overall datasets are from single to diverse, from simple to complex, and the number of images in a single dataset has also significantly increased. From 10 images in LIME in 2017 to 55,148 in LIL-Phone in 2021, these low-illumination image datasets provide solid data support for developing various algorithms. However, the existing datasets are generally only suitable for low-level image processing work. At the same time, collecting a similar number of datasets is not easy. It is difficult for human eyes to capture the low-illumination/standard image pairs of specific objects. Therefore, the generation of existing teams of data requires manual processes, but the manual labels are often not ideal, which makes the enhancement effect of the algorithm greatly affected.
Early datasets are generally applied to traditional and model-based methods, which require less quantity and quality of datasets. Early datasets are all single images, but with the development of deep learning in low-illumination images, some paired lowillumination image datasets have been successively proposed, such as SID-RAW, LOL, and LSRW. Among them, the LOL (Low-light Dataset) dataset collected and sorted by Wei et al. in the year 2018 is the most representative one. Figure 6 shows low-light images from some sample datasets.

Low-Illumination Image Datasets
Recently, promoted by the demand for algorithms, the number of low-illumin image datasets has gradually increased. Table 2 shows the representative datasets. be seen from the table that the overall datasets are from single to diverse, from sim complex, and the number of images in a single dataset has also significantly incre From 10 images in LIME in 2017 to 55,148 in LIL-Phone in 2021, these low-illumin image datasets provide solid data support for developing various algorithms. How the existing datasets are generally only suitable for low-level image processing wo the same time, collecting a similar number of datasets is not easy. It is difficult for h eyes to capture the low-illumination/standard image pairs of specific objects. Ther the generation of existing teams of data requires manual processes, but the manual are often not ideal, which makes the enhancement effect of the algorithm greatly aff Early datasets are generally applied to traditional and model-based methods, w require less quantity and quality of datasets. Early datasets are all single images, bu the development of deep learning in low-illumination images, some paired low-illu tion image datasets have been successively proposed, such as SID-RAW, LOL, and L Among them, the LOL (Low-light Dataset) dataset collected and sorted by Wei et the year 2018 is the most representative one. Figure 6 shows low-light images from sample datasets.    Low-illumination image datasets are summarized in Table 2. The MIT-Adobe FiveK dataset was taken with an SLR camera and includes 5000 images. The photos detail a wide range of scene information and lighting conditions. In addition, the data set also arranges for trained professionals to use Adobe Lightroom to adjust the hue so that the photos are visually pleasing to the human eye. SICE is a large-scale multi-exposure image dataset containing multi-scene high-resolution image sequences. The NPE is a combined data set consisting of 46 images taken with a digital camera and 110 images downloaded from various websites. The collected images are characterized by low contrast in local areas. LIME is a small data set containing only 10 images, which is mainly applied to the quality evaluation of algorithm enhancement effects. The DPED is a large data set of real images taken from three mobile phones and a single-lens reflex camera. The SID data set was collected and presented in 2018. The data set was taken by two cameras, the SONY Alpha 7SII and the Fuji X-T2, and contains 5094 raw exposure images, each with a corresponding long exposure reference image. The LOL dataset consisted of 500 low-light and normal-light image pairs, divided into 485 training pairs and 15 test pairs. Low-light images contain noise generated during photo capture. Most of the images are interior scenes. All images have a resolution of 400 × 600. BBD-100K is the largest data set of driving videos and contains 100K videos. The data set takes into account geographical, environmental, and other factors and is useful for training models with less environmental impact. SICE is a large-scale multi-exposure image dataset containing multi-scene high-resolution image sequences. The ExDark dataset contains low-light images from 10 different environments, from dusk to very low light, labeled into 12 categories. SMOID is a pair of videos synchronized in space and time. This data set also makes up for the defect that the previous low illumination video data sets only contain low illumination video but no control of normal brightness video. The DRV data set contains both the still video and the corresponding long-exposure video. This data set is derived from the SID data set extension.
The Dark-Face dataset consists of real low-light images taken in multiple scenes. All the images in this data set are marked with the face part with the boundary box, which is used as the main training set or verification set. NightCity is the largest real-world night semantic segmentation dataset with pixel-level tags. ELD data set is a noisy image synthesized by the noise model. This data set is collected by various camera equipment. Under extremely low illumination conditions, the quality of the image of this data set can be equivalent to the real data. LLVIP data set is the visible image and infrared image pairing data set. The dataset contains 27 dark scenes and two daytime scenes. LSRW is the first large-scale real-world low illuminance/normal brightness data set to be collected. Both VE-LOL-H and VE-LOL-L belong to the VE-LOL dataset. Among them, VE-LOL-L is built for low-level visual tasks, and VE-LOL-H is built for high-level visual tasks. VE-LOL-L is a composite image that includes a variety of backgrounds. VE-LOL-H is the largest low light detection data set for advanced visual tasks and can be used in joint enhancement and detection algorithms. The Oli-Phone is a large-scale data set containing both low-light images and low-light videos, which are taken by different brands of mobile phones under different lighting conditions. VV is a low-illumination data set that has been used in the field of low-illumination image enhancement.
In general, most of the existing low-illumination image datasets are single images in the natural environment, and the number of paired image datasets with reference is small. However, artificial labels often cannot avoid the influence of physical factors. Therefore, the data sets of low-illumination image enhancement methods are relatively backward compared with the data sets of other image processing tasks. In addition, in experiments, researchers tend to use test data collected by them because some data sets are not initially applied in low-illumination image enhancement, and there is no specific evaluation index to evaluate the enhancement effect except for paired data sets. Since data sets are the key to developing low-illumination image enhancement technology and are relatively backward compared with other image processing data sets, how to get rid of the dependence on pairwise data sets is the key to developing low-illumination image enhancement technology based on deep learning.

Objective Evaluation Indices
There are many objective evaluation indicators of images and different evaluation standards from different perspectives. Each evaluation standard has its corresponding advantages and disadvantages. Thus far, there is no evaluation index designed explicitly for low-light image enhancement. The existing Image Quality Assessment (IQA) methods can be divided into full reference evaluation indices and no reference evaluation indices. Table 3 lists the latest quality evaluation indexes of several classic cores. The following is a detailed description of several commonly used indicators. Table 3. Abstract of objective quality evaluation index of the image.

Abbreviation
Full-/Non-Reference MSE (Mean Square Error) [86,87] Full-Reference MAE (Mean Absolute Error) [88] Full-Reference SNR (Signal to Noise Ratio) [89] Full-Reference PSNR (Peak-Signal to Noise Ratio) [90] Full-Reference LPIPS (Learned Perceptual Image Patch Similarity) [91] Full-Reference IFC (Information Fidelity Criterion) [92] Full-Reference VIF (Visual Information Fidelity) [93] Full-Reference SSIM (Structural Similarity Index) [94] Full-Reference IE (Information Entropy) [95] Non-Reference NIQE (Natural Image Quality Evaluator) [96] Non-Reference LOE (Lightness Order Error) [97] Full-Reference PI (Perceptual Index) [98] Non-Reference MUSIQ (Multi-scale Image Quality Transformer) [99] Non-Reference NIMA (Neural Image Assessment) [100] Non-Reference SPAQ (Smartphone Photography Attribute and Quality) [101] Non-Reference (i) Peak-Signal to Noise Ratio (PSNR) is one of the most widely used indicators. The unit is dB. It is used to measure the difference between two images. Such as the compressed image and original image, compressed image quality evaluation, restored image and actual image, restoration algorithm performance evaluation, etc. The higher the PSNR value, the less distortion. The formula of PSNR is: where MSE is the mean square error of two images; MaxValue is the maximum value of image pixels.
(ii) Structural Similarity (SSIM) is the most widely used standard for referenced image quality evaluation. SSIM is used to highlight the brightness, contrast, and structural similarity between two images whose value range is 0-1; the closer to 1, the more similar the two images are. Assuming that x and y are two input images, the formula is: where l(x, y) is brightness comparison, C(x, y) is contrast comparison, and S(x, y) is structural comparison. α, β, γ is greater than 0, which is used to adjust the three-part specific gravity. l(x, y), C(x, y), and S(x, y) have the following formulas respectively: l(x, y) = 2µ x µ y + c 1 C(x, y) = 2σ xy + c 2 where µ x and µ y represent the average value of the two images, respectively σ x and σ y represent the standard deviations of the two images. σ xy represents the covariance of the two images. c 1 , c 2 , and c 3 are constants to avoid 0 in the denominator.
(iii) Mean Square Error (MSE) is also one of the most commonly used indicators to measure image quality. It refers to the expected value of the square difference between the estimated value and the true value. In an image processing algorithm, it is the mean value of the squared difference between the processed image pixel value and the original pixel value. The expression is as follows: where M is the height of the image and N is the image's width. g(x, y) andĝ(x, y) denote the original and enhanced image, respectively. The smaller the value of MSE, the better the image quality.
(iv) Information Entropy (IE) reflects the amount of information carried by an image. The greater the information entropy, the richer the image information and the better the quality. IE is used to compare the difference in information content in different images. For example, images taken in the same area will also have different information content due to additional shooting times. The expression is as follows: where p (k) is the probability density of gray level k, and M is the maximum gray level.
(v) Standard Deviation (STD), also known as standard deviation, represents the average distance each item of data is from the average. STD is also the square root of the variance. STD can reflect the dispersion degree of a data set. Compared with images, STD reflects the dispersion degree between illustrations and the average value and is a measure of image contrast within a specific range. The larger the standard deviation, the more information contained in the image, and the better the visual effect. The formula is: where M is the height of the image, N is the width of the image, and f (i, j) represents the gray value of the pixel at (i, j) of the image. δ represents the average gray value of the image, and its expression is as follows: where M, N, and f (i, j) denotes the same content as above.
(vi) Lightness Order Error (LOE) is the sequential difference in brightness of an image, and the illuminance change of an image is evaluated by evaluating the sequential change process of the brightness of the image in the neighborhood. LOE reflects the natural retention ability of the image. A smaller value indicates that the image has a better luminance order and looks more natural. The formula is: where M is the height of the image, N is the width of the image, and ⊕ is the XOR operator, and L(i, j) and L e (x, y), respectively, represent the maximum value in the three-color channels. The smaller the value of LOE, the better the brightness order of image retention.

Subjective Evaluation Indices
In addition to the aforementioned objective evaluation indices, there are some subjective ones: Differential Mean Opinion Score (MOS) or DMOS (Differential Mean Opinion Score). Among them, MOS is the most widely used subjective IQA method. Different people make subjective comparisons between the original image and the enhanced image to get the MOS score and, finally, get the average score. MOS scores range from 1 to 5, with higher scores indicating better subjective enhancement. The formula is: where R represents the number of evaluators' satisfaction scores for the image, and N represents the total number of evaluators.

Summary
Although PSNR, MSE, SSIM, and IE are classical and popular image evaluation indicators, these indicators are far from reaching the evaluation ability of human visual perception. At the same time, although these evaluation indices can evaluate the image quality, they cannot express the relationship between the low-illumination enhanced image and the actual image. Therefore, in the evaluation index of low-illumination image enhancement, we think that there is still a great effort to achieve the balance between human visual effects and machine perception.

Algorithm Comparisons and Result Analysis
We selected some datasets to enhance the images to compare the effects of various lowillumination image enhancement algorithms based on deep learning. We made an objective evaluation of these enhanced images. To ensure the fairness of the test, all the methods are tested under the same hardware environment after reaching the optimal level of training. Training environment configuration: Intel I7-8700 CPU, 32 GB RAM, and NVIDIA GeForce RTX2080 Ti GPU. TensorFlow/PyTorch framework, PyCharm software in 32 GB environment, Anaconda Python 3.7 interpreter built the network framework. Test environment configuration: Intel(R) Core (TM) i5-6300hq CPU @ 2.3 GHz, Windows10 operating system, PyCharm software in 12 GB environment, Anaconda Python 3.7 interpreter. Experiments were conducted on the unpaired LIME and SICE datasets and on the paired LOL and VE-LOL-L datasets, respectively. We analyzed the LLNet, RetinexNet, MBLLEN, Enlight-enGAN, ExCNet, RRDNet, DRBN, TBEFN, IceNet, KinD++ [30], Sci, URetinexNet, and Zero-DCE++ [50], which are a total of 14 low-illumination image enhancement algorithms based on deep learning. LLNet, RetinexNet, MBLLEN, TBEFN, KinD++, URetinexNet, and LCDPNet belong to the supervised learning method. EnlightenGAN and Sci belong to the unsupervised model. DRBN belongs to the semi-supervised learning model. ExCNet, RRDNet, Zero-DCE++, and IceNet belong to the zero-shot learning method. The enhanced effects of these algorithms are shown in Figures 7-10. Besides, we use the aforementioned indicators to carry out an objective evaluation of the 14 methods, and the evaluation results are shown in Tables 4-7

Subjective Analysis
We applied the above 14 methods to demonstrate the enhancement effect of LIME, SCIE, LOL, and VE-LOL-L data sets, as shown in Figures 7-10. Figure 7 shows an enhanced rendering of the LIME data set. This image is an outdoor image. Some of the enhanced images have obvious color deviation, among which ExCNet and RetinexNet have the most serious color distortion. We also found that the RRDNet enhancement results in a low brightness increase compared to other methods. The figure, of course, in the light of the area in EnlightenGAN and URetinexNet method, there is the phenomenon of excessive exposure. Figure 8 is an enhanced rendering of the SCIE data set, which is an indoor image. The image enhancement is not as noticeable on RRDNet, Zero-DCE++, and Sci. Moreover, the RetinexNet, ExCNet, IceNet, and KinD ++ all showed artifact phenomena, among which the image of RetinexNet is the most serious. URetinexNet has better visual effects than other methods. Figure 9 is an enhanced rendering of the LOL dataset; the figure is an outdoor image. It is obvious EnlightenGAN and enhancing the image of MBLLEN leaves some color distortion that is serious. The former leaves yellow color, and the color is too bright. However, in the enhanced image obtained by LLNet and RetinexNet, the details are lost seriously, and the railings almost disappear in the image. Figure 10 shows the enhanced effect of the VE-LOL-L data set, which is an indoor image. It is obvious that the lower left corner in Figure 10 is red, but the color presented by LLNet is more orange. Similarly, the color deviation of the desktop man in the image displayed by the RetinexNet is also large, and the overall color of the image is inconsistent. LLNet, RetinexNet, TBEFN, and KinD++ all produce a black aperture in the area of the light in the picture.
In general, the existing low-illumination image enhancement methods based on deep learning can effectively improve image brightness and contrast. But the visual effect is different. The existing methods still have a lot of room for improvement, such as how to eliminate the noise generated while improving the brightness, how to avoid the color distortion phenomenon, and so on. Some existing methods can solve one problem effectively but often ignore others.

Objective Analysis
As shown in Table 4, NIQE, IE, and LOE evaluation indexes were used to objectively evaluate the enhancement effects of 14 low illumination image enhancement methods based on deep learning on the LIME data set. Know from the table, in recent years, the method on the indicator can have a better result. The Sci in IE and LOE for indicators in the table to get the first three grades, EnlightenGAN on NIQE and IE performance is also good. The other five algorithms have their advantages by ranking the top three once each.
As shown in Tables 5-7, we used PSNR, SSIM, MSE, NIQE, IE, and LOE to objectively evaluate the enhancement results of 14 kinds of low-illumination images based on deep learning. Table 5 presents objective evaluations based on the SCIE dataset. As can be seen from the table, MBLLEN has a superior performance and achieved first place in PSNR, SSIM, MSE, and IE. URetinexNet did just as well. Table 6 is an objective evaluation based on the LOL dataset, which is the most widely used dataset. However, many use LOL data sets as training sets, but these algorithms have no advantage in testing. URetinexNet achieved excellent results in all reference indexes but not-so-good results in non-reference indexes. Table 7 is an objective evaluation based on the VE-LOL-L dataset, which is the latest large paired dataset. It can be seen from the table that LCDPNet has obvious advantages over other algorithms, but the distribution of other rankings is scattered.
In general, we can see from the above four tables that all the methods have a good improvement on the original indicators relative to the input image. Meanwhile, it is found that the supervised learning-based method has significantly better scores on the objective evaluation indicators than the methods based on unsupervised learning, semi-supervised learning, and zero-shot learning. Similarly, we found that the RRDNet method achieved the best performance in the LOE evaluation index of the four data sets, while KinD++ scored the worst. However, based on the visual representations in Figures 7-10, the RRDNet enhancement is insignificant, and the brightness enhancement is low. In comparison, the Kindle ++ enhancement effect is relatively good. Therefore, we feel that the LOE index does not apply to low-illumination image enhancement. Similarly, we found that in the NIQE index, ICENet has obvious disadvantages, but in terms of visual effects, this method is significantly improved by methods, such as RetinexNet and RRDNet. Therefore, we feel that most of the non-reference indicators may not apply to the field of low-illumination image enhancement.

Low-Illumination Face Detection Performance
To better reflect the influence of low-illumination images on high-level visual tasks, we adopted the DSFD [102] method to carry out a low-illumination face detection experiment on the DarkFace data set using the above 14 methods, as shown in Figure 11. It can be seen from the figure that all low-illumination image enhancement methods based on deep learning effectively improve the accuracy of face detection. We also conducted a quantitative analysis of the effect of each enhanced image, as shown in Figure 12. As can be seen from it, the number of error detections after enhancement is less than that before enhancement, while the number of correct detections is generally increased. Moreover, compared with earlier methods, the number of correct detections predicted by recent ones has improved more significantly. The number of missed detections and the number of correct detections trend is opposite. In recent years, the number of missed detections has been less. The logarithmic trend line in Figure 12 also effectively illustrates that this year's method has better results for advanced visual tasks. It also shows that these methods can help high-level visual tasks.

Time Complexity
In Table 8, we compare the computational complexity of the 14 algorithms by the average running time of image enhancement in the VE-LOL-H dataset. We can see that the running time of the gap between all kinds of methods in Table 8 is more significant. More specifically, EnlightenGAN achieves the fastest speed, and RRDNet runs the slowest. Although most of the ways can run within 1 s, it is still challenging to accomplish the anticipation of real-time video processing.

Time Complexity
In Table 8, we compare the computational complexity of the 14 algorithms by the average running time of image enhancement in the VE-LOL-H dataset. We can see that the running time of the gap between all kinds of methods in Table 8 is more significant. More specifically, EnlightenGAN achieves the fastest speed, and RRDNet runs the slowest. Although most of the ways can run within 1 s, it is still challenging to accomplish the anticipation of real-time video processing. Table 8. Average running time (in seconds) of image enhancement in VE-LOL-H. The best result is bold in red, the second result is bold in blue, and the third result is bold in yellow.

Method
Running Time

Summary
It is not difficult to see that the values of objective evaluation indices of test images in LOL and VE-LOL-L datasets are more suitable for human visual effects than those in SCIE and LIME datasets. It can also be seen that the algorithm trained with pairwise data sets will get a more obvious enhancement effect. We can see from the data that the method based on supervised learning can achieve better results compared with other methods. Still, unsupervised learning, zero-shot learning, and semi-supervised learning are the current development trends, mainly because it is difficult to obtain paired images in a low-illumination environment. The method based on fully supervised learning has lower generalization ability and poor applicability compared with other methods. Moreover, the unsupervised learning method can introduce more prior knowledge reflecting environment characteristics by designing proper loss functions and network structures. The enhancement algorithm with a better subjective visual effect has not achieved good results in the objective evaluation, which is precisely where the aim evaluation index needs to be improved.

Conclusions
This paper summarizes the low-illumination image enhancement algorithms based on deep learning and introduces the representative algorithms from four classifications. Then, we present the existing image datasets, point out the shortcomings, and show the current subjective and objective evaluation indicators. We offer the enhancement results from 14 typical algorithms and analyze them in detail to further reveal the advantages and disadvantages of each algorithm. Here is the summary: (1) Most of the existing low-illumination image datasets are single images in the natural environment, and the number of paired image datasets with reference is small. The low illuminance/standard light image pairs composed by the synthetic method often have deviations. Still, artificial labels often fail to avoid the influence of physical factors; the brands are inaccurate, and so on. The existing data sets are not suitable for high-level visual tasks and often fail to meet the requirements of high-level visual tasks. They often fail to meet the requirements of high-level visual tasks. Therefore, enhancing the low-illumination image dataset is also a direction of future research. (2) The existing objective evaluation indicators are used in other research fields, such as image fog removal, rain removal, noise reduction, etc. These metrics are not designed for low-light image enhancement, and these evaluation metrics are far from achieving the natural perception effect of human beings. The quality of objective indicators is often different from the evaluation of human eyes, so it is urgent to design objective evaluation indicators for low-illumination image enhancement. (3) From a series of algorithm effects, it can be seen that the algorithms with fusion and model framework often have better generalization ability, and the unsupervised learning methods are more robust and stable than the supervised learning methods. However, the purpose of low-illumination image enhancement algorithms is to prepare for higher-level visual tasks. Hence, the current research aims to make universal lowillumination image enhancement algorithms that can serve higher-level visual tasks. (4) Low-illumination video enhancement is a method to decompose the video into frames, enhance the decomposed image, and then fuse it. However, it is difficult for various algorithms to achieve the speed of 30 frames per second. How to speed up the algorithms without weakening the enhancement effect still needs further research.
In conclusion, deep-learning-based low-illumination image enhancement algorithms can hardly achieve the optimal level in terms of brightness enhancement, noise reduction, and contrast enhancement simultaneously. Therefore, we need to choose the optimal method for practical application. Of course, it is hoped that through the research on the existing low-illumination image enhancement methods based on deep learning, the study on the direction of modification can be improved to a higher level so that it can play a more critical role in high-level visual tasks.