FDMLNet: A Frequency-Division and Multiscale Learning Network for Enhancing Low-Light Image

Low-illumination images exhibit low brightness, blurry details, and color casts, which present us an unnatural visual experience and further have a negative effect on other visual applications. Data-driven approaches show tremendous potential for lighting up the image brightness while preserving its visual naturalness. However, these methods introduce hand-crafted holes and noise enlargement or over/under enhancement and color deviation. For mitigating these challenging issues, this paper presents a frequency division and multiscale learning network named FDMLNet, including two subnets, DetNet and StruNet. This design first applies the guided filter to separate the high and low frequencies of authentic images, then DetNet and StruNet are, respectively, developed to process them, to fully explore their information at different frequencies. In StruNet, a feasible feature extraction module (FFEM), grouped by multiscale learning block (MSL) and a dual-branch channel attention mechanism (DCAM), is injected to promote its multiscale representation ability. In addition, three FFEMs are connected in a new dense connectivity meant to utilize multilevel features. Extensive quantitative and qualitative experiments on public benchmarks demonstrate that our FDMLNet outperforms state-of-the-art approaches benefiting from its stronger multiscale feature expression and extraction ability.


Introduction
Photos captured in insufficient illumination conditions such as nighttime, lopsided, under-exposed, etc., exhibit an undesired visual experience or deliver compromised messages for other computer vision tasks, due to their low contrast and lightness and blurry details [1][2][3][4][5]. Especially, high-level computer vision tasks show unsatisfactory performance in these low-light photos, such as in inaccurate face or object recognition [6,7]. Hence, it is necessary to restore the quality of low-illumination pictures. Low-light image enhancement (LLIE) [1,[8][9][10][11][12][13][14] is an efficient way to yield visually pleasing images with moderate lightness, vivid color, and clearer details, so as to further improve the performance of face detection, object recognition, and other tasks. Therefore, LLIE [1][2][3]15] is an indispensable technology in low-level computer vision applications to generate wanted images.
In past decades, a great deal of LLIE approaches, including histogram-based [3,16,17], Retinex-based [8][9][10]18,19], fusion-based [20,21], physical-model-based, [3,[22][23][24][25][26] have been reported. Histogram-based methods, which are simple and highly efficient, introduce an over-or underenhancement owing to the spatial relationship among pixels being neglected. Retinex-based methods consider that an image consists of illumination and reflection components, and the enhanced images exhibit color distortion. Fusion-based models yield appealing visual images, benefiting from fusing multiple images with various characteristics. However, the enhanced results encounter a detail loss and artificial halos. Dehazing model-based approaches [25] are the most typical representative of physicalmodel-based methods, and they are unsuccessful for creating satisfying and hazy-free images. Recently, data-driven methods [1,[27][28][29][30] have been introduced to conquer the inappropriate enhancement of classical methods, owing to their powerful feature extraction capability. However, existing approaches are confronted with heavy computing burdens and are time-consuming, limiting their real-world applications. Furthermore, most of them rarely take hierarchical features and a multiscale representation into account [15].
To cope with these mentioned issues, we propose a new LLIE method based on frequency division and multiscale learning, called FEMLNet, for improving the quality of image acquired in suboptimal lighting conditions. Differing from most CNN-based and GAN-based methods, we perform different operations on the image's high and low frequencies rather than the whole picture to fully explore its hierarchical features. Additionally, we present a feasible feature extraction module (FFEM) based on a multiscale learning (MSL) block with a dual-branch channel attention mechanism (DCAM) to obtain self-adapting multiscale features. The former can adaptively exploit information at different scale spaces, and the latter makes the focus of our FDMLNet model on more valuable features while enhancing its multiscale learning capacity. Simultaneously, a dense connection strategy is introduced in our model to merge multilevel features adequately. Figure 1 shows the enhanced results via the presented method for the images obtained in different lighting conditions. With the help of our FDMLNet, all enhanced images consistently show a pleasing visual appearance. In conclusion, our primary contributions of this work are emphasized as follows.
(1) We present a novel LLIE approach for creating visually satisfying images. The superior performance of this FDMLNet is verified by extensive experiments validated on several public benchmarks. (2) We design a residual multiscale structure named MSAM, which is based on a residual multiscale learning (MSL) block and a dual-branch channel attention mechanism (DCAM). Furthermore, the former promotes the multiscale features learning ability of the FDMLNet, and the latter, including spatial attention and pixels attention, makes our model focus on areas that best characterize the image. (3) Finally, we merge three MSAMs in a novel dense skip-connection way to build an FFEM for fully exploring the image's hierarchical information. In addition, we apply the dense connection strategy among FFEMs to further integrate multilevel features adequately.
We organize the rest of this paper as follows. The relevant works on LLIE are briefly reviewed in Section 2. In Section 3, the framework of our model is elaborated. We also present the relation between existing models and our method. In Section 4, we analyze ablation studies and the performance of our FDMLNet in detail. In the end, we report the conclusions and discussions about this work in Section 5.

Related Works
LLIE plays an irreplaceable role in recovering inherent color and details as well as compressing the noise of low-illumination images. In what follows, we comprehensively review previous low-light image enhancement works, including conventional approaches and leaning-based approaches.

Traditional Approaches
In the early stage, specialized high-performance hardware, such as professional lowlight circuits, charge-coupled device (CCD), complementary metal-oxide-semiconductor (CMOS), etc., is employed in imaging systems for generating visually satisfying pictures. However, the price of these devices is unacceptable, and their operation is difficult. We also can process the gathered images by LLIE methods. Histogram-equalization-based methods, including global histogram equalization (GHE) [16,17] and local histogram equalization (LHE) [3][4][5], directly adjust the image pixels value to redistribute their distribution in global and local levels. Swarm intelligence algorithms, image decomposition, Rayleigh distribution, and other technologies [31][32][33] were hired to optimize the previous HE-based approaches. Additionally, gamma, S-shape, logarithmic, and other improved nonlinear functions [34][35][36] also can restore inherent color and details of excessively dark images through pixel transformation. Unfortunately, these above-listed methods either amplify noise or yield improper exposure. Recently, some scholars [37][38][39][40] have handled LLIE issues in the wavelet domain, gradient domain, NSST domain, etc. rather than the spatial domain.
Contrary to pixel transformation approaches, Retinex-inspired methods [8,18,19] typically assume that an image consists of illumination and reflection components, as well as its reflection components' own consistent peculiarity during the processing. Hence, the LLIE problem can be viewed as the illumination component estimation. On the basis of this assumption, LR3M [18], a fast Retinex-based algorithm [8], Poisson noise aware Retinex model [9], Retinex-based variational framework [10], and other methods [11,41], have been reported to yield satisfying images. However, the enhanced results exhibit observable color distortion, noise enlargement, or fuzzy details. Differing from the above approaches, physical-model-based approaches enhance low-light images from the aspects of the imaging principle. The dehazing model [25], atmospheric scattering model [22,24], and priorknowledge-based model [23,26] are its typical representative. However, the processed images suffer from hand-crafted halos and local darkness, due to inappropriate prior information under some low-light conditions. Moreover, fusion-based methods [3,20,21], fusing a variety of frequency images or multifeature maps to fully exploit the hierarchical features of the image, can also effectively recover visually satisfactory photos from subpar illumination images. Similar to these, we perform frequency division on low-luminosity images to obtain high-and low-frequency information, and then integrating the frequency images processed by different operations.

Learning-Based Approaches
In recent years, learning-based methods containing supervised and unsupervised learning strategies have outperformed traditional approaches in feature representation and extraction and have been applied in object detection, image processing, and other computer vision assignments [42][43][44][45]. LLNet [27], a groundbreaking work for LLIE, stacked sparse denoising autoencoders for light improvement and denoising at once. Lv et al. [46] designed MBLLEN, consisting of a feature extraction, enhancement, and fusion module for facilitating the performance of LLNet. EEMEFN [47] and TBEFN [48] generated normal light pictures by fusing multiexposure images. Subsequently, the pyramid network [49,50], residual network [51], image semantic network [52], semantically contrastive learning [52], and recursive learning network [53] were introduced to enhance the feature representation and extraction of the previously reported model. Moreover, the Retinex theory and learning-based model were combined to make the proposed methods enjoy an appealing performance. For example, Retinex-Net [54] applied Enhance-Net to adjust the light of illumination maps generated by Decom-Net. A regularized sparse gradient was introduced into Retinex-Net to build a more robust LLIE approach. Wang et al. [55] applied local and global features extracted by DeepUPE to learn the mapping relationship from the original image to the illumination image. Zhang et al. [28] designed an enhancement framework (named KinD) that included three stages: a layer decomposition, reflectance recovery, and illumination light adjustment. They [56] then injected a multiscale illumination attention module into the early proposed KinD model to further promote its capacity. However, these Retinex-inspired learning methods also inevitably introduce a color deviation or hand-crafted holes due to an inaccurately estimated illumination. Additionally, the frequency-based decomposition-and-enhancement model [21] was reported to rely on the assumption that the noise exhibits different contrast at different frequency layers. Understandably, supervised methods heavily need extra computing resources to process paired (normal/abnormal) datasets for training. However, these paired images cannot be easily gathered in the real world, and we carefully capture them by artifact synthesizing or altering the exposure time and ISO rating of cameras.
Conversely, unsupervised methods are trained by unpaired images captured under various lighting conditions and scenes rather than paired images [1,29,53]. Jiang et al. [29] skillfully established EnlightenGAN [29], a typically GAN-based method, containing a global and local discriminator, self-regularized perception, and attention mechanism. Yu et al. [57] designed DeepExposure relying on reinforcement adversarial learning. However, these unsupervised methods need carefully selected unpaired images for training and inevitably introduce observable color casts. To fully explore the advantages of unsupervised and supervised methods, Yang et al. [58] presented a semisupervised approach named DRBN [59] for light enhancement. In this model, supervised learning restored the linear band representation of an enhanced image, and perceptual-quality-driven adversarial learning rearranged these linear bands to yield visually satisfying normal-light images. In [59], a network pretrained on the aesthetic dataset and an introduced LSTM module further optimized the DRBN [59]. More recently, zero-reference-based methods have proved highly efficient and cost-effective, and fewer images are needed, which has caused a stir in the fields of LLIE. For example, RRDNet [60] decomposed an image into illumination, reflectance, and noise, then the Retinex reconstruction loss, texture enhancement loss, and illumination-guided noise estimation loss were carefully contrived to drive zero-referencebased learning methods. Inspired by Retinex, Zhao et al. [30] created RetinexDIP, and Liu et al. [61] designed the RUAS network for boosting low-illumination images. Li et al. [62] employed high-order nonlinear curve mapping to adjust the image pixel values for recovering satisfying images. Afterward, they demonstrated a faster and more lightweight network called Zero DCE++ [1].

Methodology
This section first analyzes the motivation of this design. After that, the overall model framework and its main components, including frequency division (FD), the feasible feature extraction module (FFEM), and the loss function, are minutely described. We discuss the relation to other learning-based methods at the end of this section.

Motivation
We can easily observe images captured in insufficient light exhibit a color deviation, blurry details, and unsatisfactory brightness. Traditional LLIE methods based on HE, the Retinex theory, a fusion framework, a physical model, etc., can solve these issues to a certain extent. Still, they perform unsatisfactorily in terms of robustness. Most significantly, [17,21] showed that the detail, edge, and noise were described in the high frequencies, while the main information was demonstrated in the low frequencies. A frequency division operation can extract feature maps at different frequencies to achieve the goal of preserving detail and compressing noise. Recently, data-driven approaches based on generative adversarial networks (GANs) or convolution neural networks (CNNs) have shown strong feature representation capability, which was widely applied in image enhancement, image super-resolution, object recognition, and so on [42][43][44][45]63]. Unfortunately, although these LLIE methods significantly promote contrast, saturation, and brightness, remove the color deviation, and highlight the structural details, they heavily depend on computer resources owing to the depth or width of the network. Additionally, multiscale learning is rarely considered in these learning-based LLIE methods.
As a consequence, we combined traditional methods with CNN to design a novel LLIE method with fewer parameters and a high efficiency based on the above analysis. Specifically, we first perform frequency division on input images to achieve feature maps at high and low frequencies. Then, we propose a feasible feature extraction module containing an attention mechanism and a multiscale learning structure to improve the representation ability of our proposed CNN-based method.

The Overall Model Framework
To tackle unsatisfactory contrast and brightness, blurry details, as well as the color deviation of low-light images, we present a new LLIE approach based on the theory that different information in an image is displayed at different frequencies. The overall framework of our FDMLNet, including its three main parts, i.e., frequency division (FD), DetNet, and StruNet, is illustrated in Figure 2. Among these components, FD is employed to separate the high and low frequencies of the input images; DetNet, made up of a 7 × 7 Conv, a 3 × 3 Conv, and a 1 × 1 Conv processes the high frequencies of the input images to preserve inherent detail and condense the noise; the low frequencies of the input images are processed by StruNet, which consists of three feasible feature extraction modules (FFEMs) to promote its brightness and contrast and remove the color deviation.

Frequency Division
Different frequency information plays notable roles in the whole image, and pixels with drastic changes in intensity, such as edges, detail, noise, etc., are distributed in the high frequencies, but pixels with a gentle change in intensity, such as the image structure, background, and other information, are spread over the low frequencies [21]. Based on this mechanism, this work engages a guided filter (GF) [64], an edge-preserving filter based on the local linear model, for dealing with authentic pictures to create low-and high-frequency feature maps.
Supposing that Q n is the n th input image, I n is the corresponding guided image, and the relationship between the output image O n and I n in the local windows w k tends to be linear, i.e., where w k is a local window with a size of r × r. a k and b k are constant and their values can be calculated by minimizing the squared error between O n and Q n , that is, where ε is a regularization parameter. Thus, the values of a k and b k are, respectively, defined as In Equation (3), µ k and δ k are the pixels' mean value and variance of the local window w k in the guided image, respectively. |w| is the total number of pixels in w k .Q i n,k is the pixels' mean value in the n th input image.
Since one pixel is contained in multiple windows, the average value of a k and b k is solved and Equation (1) can be rewritten as where O n is the low-frequency feature map of the input image. Therefore, its high-frequency feature map P n is

Feasible Feature Extraction Module
Nowadays, we have a detailed analysis of the feasible feature extraction module (FFEM) structure, which is depicted in Figure 3. This module stacks 3 MSAMs in an updated dense skip-connection way to promote the learning ability of FEMLNet and fully explore features at different levels. The process can be expressed as where  Multiscale learning structure: Generally, the image exhibits different characteristics at various scales, and a multiscale representation can effectively extract its information at different scales and promote the performance of learning-based methods [15,56]. As a result, the multiscale learning strategy has broadly been conducted on object identification, pose recognition, face detection, and other computer vision tasks [42][43][44][45]. However, this strategy is rarely considered in most state-of-the-art LLIE models. In this proposed FDMLNet, we built an efficient multiscale learning structure called MSAM, which consists of a multiscale learning block and a dual-branch channel attention mechanism. This MSAM consists of small convolution kernel groups with a size of 3 × 3 and different dilation rates, i.e., 1, 2, 3, and 5. Furthermore, Figure 4 demonstrates its structure in detail. The image dimensionality is reduced by the 1 × 1 convolution operation to alleviate the computational load. Then, we extract multiscale information through four parallel branches, which are made up of 3 × 3 convolutions with dilation rates r = 1, 2, 3, and 5, respectively. Notably, the features extracted by the previous branch are injected into the next branch to adequately utilize the image's potentially multiscale information. The extraction procedure of the multiscale feature can be described as In the following, we integrate the results of the four branches by concatenating them and then, a 1 × 1 convolution operation is used to process the concatenated results. Finally, the dual-branch channel attention mechanism processes the convolution results, and then the output features are injected into input images to exploit more inherent global and local information.
Dual-branch channel attention mechanism: As we all know, the human brain selectively focuses on the key information while ignoring the rest of the visible information [1,7,21,29,43]. The attention mechanism, a strategy mimicking the human brain, has been widely used for generating attention-aware features and extracting key information for promoting the ability of CNN-based methods by adaptively rearranging weights. We designed a dual-branch channel attention mechanism, containing pixel and spatial attention mechanisms, to further enhance the performance of this proposed FDML-Net, and Figure 5 shows its structure in detail. We can observe this design can fully exploit the image features in different channels. Specifically, we send the input data into a spatial attention branch to extract both the background and texture of the image. Firstly, average pooling and max pooling operations are used to process the input data, and then we fuse them in an additive manner. Suppose that the size of the input data is H × W, the united feature map z c is defined as where H avgp and H maxp are average pooling and max pooling operations, respectively. u c (i, j) is the pixel value at position (i, j) in the input data. Then, the 7 × 7 Conv with an activation function (sigmoid) is used to calculate the spatial weight map W s , i.e., where Conv 7×7 is a convolution with a size of 7 × 7, sig is the sigmoid function, an activation function, and a channel shuffle is introduced to tackle the communication of feature maps among different groups. Then, we extract the image's spatial feature F s by multiplying the input data with the weight map, namely F s = u c × W s . In the pixel attention branch, the feature map z c that fuses features generated by the average pooling and max pooling operations is added into the input data u c to avoid the influence of the spatial relationship and is recorded as v c . Then, three 1 × 1 Conv operations are applied to v c and the result of the top branch is processed by a transpose operation. In order to solve the weighted matrices W p , the transposed result was multiplied by the result of the second branch and then processed by a softmax function. The above procedure can be described as, where so f t is the softmax function and Conv 1×1 is the convolution with the size of 1 × 1. Subsequently, the result of the final branch is multiplied by the weighted matrices W p to calculate the pixel weighted map W p , The pixel weighted map W p and the spatial weight map W s are integrated in a sum operation to obtain attention-aware feature maps. Furthermore, the input data are fused with the attention-aware feature maps to entirely explore its inherent information F, that is

Loss Function
To guarantee our method shows satisfactory performance in LLIE, we carefully devised a hybrid loss function containing a structure similarity (SSIM) loss, L 1 loss, total variation (TV) loss, and color constancy (CC) loss to assess the discrepancy between the output and authentic images. These four loss functions are minutely described as follows: L 1 -norm loss: We first calculate the mean absolute error (i.e., l 1 -norm) between the output result I out and normal-light image I nl to measure their difference. It can be calculated as follows: Structure similarity (SSIM) loss: The L 1 -norm loss can make our model generate highillumination images, but over-or underenhancement and other structural distortion are introduced in the enhanced images. To address these challenging issues, we injected the SSIM loss to examine the structure similarity. The formula of the SSIM loss is shown below: where µ x and µ y are the mean values of the pixels in the output and input images, respectively. σ x and σ y stand for the pixels' variance of the output and input images, respectively. c 2 and c 2 are constants, which were empirically set as 0.0001 and 0.0009. Total variation (TV) loss: Although most data-driven approaches effectively light up low-illumination images, they inevitably generate observable noise. For compressing the image noise, the TV loss was applied to smooth the output image by minimizing its gradient in our method, and its definition is: where H and W are the image size. P is a pixel value. i and j are the pixel indexes in the enhanced image. Color constancy (CC) loss: Generally speaking, low-light images encounter a color deviation, which leads to an unsatisfactory visual appearance. This work introduced the CC loss function proposed in [62] to fully explore the relationship among R, G, and B channels and correct the distorted color. The CC loss function can be defined as where J · is the mean value of the p or q channel in the output result. (p, q) stands for a pair of channels. Total loss: We integrated the above-listed four loss functions to design a total loss function, named L total , defined as: where L l 1 , L SSI M , L TV , and L CC are the l 1 -norm, SSIM, TV, and CC losses, respectively. ω TV and ω CC are the weights, set as 0.8 and 0.4.

Relation to Other Learning-Based Methods
Relation to Xu et al. [21]: The proposed method relied on the same mechanism (i.e., the image exhibits different features at various frequency layers) as the literature [21]. However, the description of three apparent differences between these two methods is as follows: (1) The way the frequency division was performed: Xu et al. [21] employed a learningbased way, paying attention to the context encoding model (ACE), to adaptively decompose the high and low frequencies of the input image. However, a guided filter, a traditional preserving filer, was applied to achieve the image's high and low frequencies in our work. (2) The way the enhancement was performed: Xu et al. [21] compressed the inherent noise and highlighted the details by the cross-domain transformation (CDT) model. However, we designed two subnets, i.e., DetNet and StruNet, to enhance the image, and the former processed the high-frequency components of the image to highlight its detail while the latter disposed of its low-frequency components to generate visually pleasing structural images. (3) Furthermore, we injected spatial attention and pixel attention mechanisms into our reported FDMLNet to fully exploit the inherent information in the image. In addition, the multiscale structure was also embedded to promote the multiscale representation ability of the proposed model.
Relation to PRIEN [50]: PRIEN [50] employed a dual-attention mechanism to promote its performance in LLIE. In this paper, we created a dual-branch channel attention module integrating spatial and pixel relationships. Noticeably, a channel shuffle was introduced in the spatial attention branch to achieve communication among all channels, and the pixels' spatial relationship of the image was injected into the pixels' attention branch. In addition, [50] only considered the SSIM loss function, which may magnify the inherent noise or distort the image color. However, the SSIM loss function, TV loss, L1 loss, and color loss functions were all brought into our model to remove the color deviation, preserve the details, and compress the inherent noise.

Experimental Results and Analysis
In this part, we describe the experimental results and analysis in detail. Firstly, we briefly present the implementation details and experimental settings. Then, ablation studies, as well as qualitative and quantitative assessments on paired and unpaired datasets, are depicted. To this end, the analysis of the application test is implemented.

Experimental Settings
In the following, we state the comparison approaches, public benchmarks, and assessment criteria in detail.
Public benchmarks: We performed verification experiments on two paired datasets (LOL and MIT-Adobe FiveK) and four unpaired datasets (LIME, MEF, NPE, and VV) to test their performance in light enhancement. The LOL dataset was captured by changing the exposure time and ISO of a camera and contains 500 pairs of abnormal/normal light RGB-images with a size of 400 × 600. The MIT-Adobe FiveK benchmark contains 5000 RAW-images processed by five professional photographers. Adobe Lightroom was used to transform these images from the RAW to the RGB format to train the LLIE models. The LIME, MEF, NPE, and VV benchmarks contain 10, 17, 84, and 24 images, respectively.
Assessment criteria: We adopted four full-reference commonly used criteria, including the mean square error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM) [66], and learned perceptual image patch similarity (LPIPS) [67] to assess these LLIE comparison methods on the LOL and MIT-Adobe FiveK datasets. For these criteria, an MSE, PSNR, or LPIPS [67] value, as well as a higher PSNR value indicated a better visual perception. Furthermore, two nonreference criteria, i.e., the natural image quality evaluator (NIQE) [13] and patch-based contrast quality index (PCQI), were employed to assess the performance of these LLIE methods on the LIME, MEF, NPE, and VV public benchmarks, and a lower NIQE [13] or higher PCQI score suggested more satisfying enhanced images.

Training Details
We carried out our designed model on a platform with two 2080Ti GPUs, a Windows 10 operating system, 128 GB of RAM, and an Intel(R) Core(TM) i7-9700K CPU @ 3.60 GHz. This proposed network was coded in Pytorch and optimized by stochastic gradient descent (SGD). Furthermore, the batch size was 8, the learning rate was 0.0001, and the activate function was ReLU. We randomly selected 485 paired images from the LOL dataset for training our model. Finally, the MIT-Adobe, LOL test, LIME, MEF, NPE, and VV benchmarks were also selected for the testing experiment.

Ablation Studies
Ablation studies on the frequency division, multiscale learning, dual-branch channel attention mechanism, loss and activation functions were conducted to fully understand the FDMLNet. These ablation studies are detailed as follows: Study of the frequency division: Figure 6 describes the visual enhancement results to verify the effectiveness of the frequency division (FD) operation in our presented FDMLNet model. Among them, -w/o FD represents our designed model without FD operation, FD m f and FD g f stand for our developed model employing a mean filter (mf) and a guided filter (gf) to separate the image high and low frequencies, respectively. From the results, we discover that FD could avoid color casts and FD m f inevitably introduced observable noise. However, FD g f coinstantaneously compressed the inherent noise and lights up the image.

Input
-w/o FD FD mf FD gf Study of the multiscale learning structure: To examine the multiscale learning (MSL) structure of our method, MSL was removed (named -w/o MSL). That is to say, our model only extracted the image information under a single scale. Notice that -w/o MSL yielded unwanted light and color casts in the enhanced images, as shown in Figure 7. Additionally, from Table 1, we see FDMLNet generated higher PSNR and SSIM scores on both the LOL and MIT-Adobe FiveK benchmarks. Thus, MSL improved absolutely the ability of our model in LLIE.
Study of the dual-branch channel attention mechanism: -w/o DCAM indicates that the attention mechanism was not taken into account in our model. As depicted in Figure 7, -w/o DCAM failed to enhance local details and remove the color deviation as well as hand-crafted halos. However, the output image generated by our method showed a high brightness, vivid colors, and clearer details. The PSNR and SSIM [66] of the different operations on the LOL and MIT-Adobe datasets are shown in Table 1; it can be seen that our method generated the highest scores of two elevation criteria on the selected public datasets.  Study of the loss function: We studied the roles of the mentioned loss functions in our design. Furthermore, -w/o L1, -w/o TV, -w/o SSIM, and -w/o CC indicates that the L1 loss, TV loss, SSIM loss, and CC loss were removed in our loss function, respectively. Figure 8 demonstrates the image improved by our model with different loss functions, and Table 2 shows the PSNR and SSIM [66] scores of two public benchmarks processed by our FDMLNet model with different operations. Compared with other operations, we easily find that only our design exhibited the best performance in both quantitative and qualitative analyses for light enhancement.

Input -w/o MSL -w/o DCAM Ours
Study of the activation function: To study the performance of the presented FDMLNet with different activation functions, we show the processed images by our method with LeakyReLU, Mish, and ReLU in Figure 9. We find that LeakyReLU amplified the dark area's inherent noise, and Mish was unsatisfactory for enhancing the local dark area. However, ReLU could compress the image noise and light up the whole image simultaneously. Furthermore, it was so intuitive and so sensible that both LOL and MIT-Adobe FiveK datasets enhanced by FDMLNet showed optimal PSNR and SSIM values [66], as seen from Table 2.

Comprehensive Assessment on Paired Datasets
Qualitative evaluation: We first applied the FDMLNet and comparison LLIE methods on the MIT-Adobe 5K and LOL paired benchmarks to validate their effectiveness in terms of light enhancement. The qualitative evaluation on these two datasets was as follows: Figure 10 shows the enhanced images of every comparison LLIE methods on the image randomly selected from the MIT-Adobe paired benchmark. The following observations could be obtained: First, the LLIE methods succeeded in lighting up low-illumination images, indicating that the image enhancement was an effective way to tackle the issues of these images. However, SRIE [19], BIMEF [20], and LR3M [18] could not generate the wanted images with a satisfactory visual appearance. RetinexNet [54] improved the illumination of images while yielding unnatural visual experiences. KinD [28] failed to recover the inherent details and introduced unsatisfactory color casts in local dark regions of the image. SCL-LLE [52] generated undesired images with an unnatural visual experience (observed in picture g in Figure 10). MIRNet [52] succeeded in improving the image brightness, but the enhanced images exhibited a color deviation and low contrast. DSLRenhanced images had a blocking effect, and DRBN-enhanced pictures encountered color distortion (discovered in the sky part of the images h and j in Figure 10). EnlightGAN [29] failed to remove the artifacts' halos and blocking effects. We also found that DLN [14] was unsatisfactory in removing whitish tone and correcting color distortion. Although Zero DCE++ [1] could successfully light up the image, it brought in unnatural visual and blurry details. Compared with twelve state-of-the-art LLIE methods, only our method showed an impressive performance in rebuilding artifact-free images with a visually pleasing appearance, clearer details, and vivid colors.  [19], (c) BIMEF [20], (d) LR3M [18], (e) RetinexNet [54], (f) KinD [28], (g) SCL-LLE [52], (h) DSLR [49], (i) EnlightenGAN [29], (j) DRBN [59], (k) Zero DCE++ [1], (l) DLN [14], (m) MIRNet [52], and (n) Ours.
Quantitative evaluation: In addition to the visual comparison listed above, a quantitative evaluation was also performed on the LOL and MIT-Adobe public benchmarks to further validate our designed model comprehensively. The average MSE, SSIM [66], PSNR, and LPIPS [67] scores on these two public datasets promoted by the aforementioned LLIE models are shown in Table 3. For the four reference criteria, we can readily easily notice that SRIE [19], BIMEF [20], and LR3M [18] were inferior to some data-driven approaches, which empirically indicated that the latter showed an impressive performance in LLIE owing to its strong ability for feature representation and extraction. In comparison, among all the aforementioned methods, our FDMLNet method generated comparable scores of MSE, SSIM [66], PSNR, and LPIPS [67] in these two datasets. This means our proposed method performed well in lighting up the brightness, preserving inherent details, and compressing the noise of low-light images in terms of both quantitative and qualitative evaluations. Table 3. Quantitative analysis of different state-of-the-art LLIE methods on public paired benchmarks. Red/green text means the best/second-best performance. ↓ and ↑ respectively represent the smaller or bigger the value, the better the performance.

LOL
MIT-Adobe

Comprehensive Assessment on Unpaired Datasets
Qualitative evaluation: To effectively and comprehensively examine the light enhancement capability of state-of-the-art comparison methods and our FDMLNet, four unpaired benchmarks (i.e., LIME, MEF, NPE, and VV) were also used to conduct validation experiments. We demonstrate randomly selected results generated by these cutting-edge approaches from the LIME, MEF, NPE, and VV benchmarks in Figure 12, Figure 13, Figure 14 and Figure 15, respectively. From these enhanced images, the following observations can be made: BIMEF [20], a fusion-strategy-based method, tried to produce high-light images by fusing multiexposure images. Significantly, this method failed to light up the dark regions of some pictures and introduced observable over-or underenhancements. Both LR3M [18] and SRIE [19] could notably promote the image brightness and contrast, but LR3M-enhanced images suffered from unsatisfactory structural details and SRIE [19] excessively enhanced some images causing local overexposure. RetinexNet [54] introduced unsatisfactory artifact holes, DSLR [49] generated an unnatural visual appearance, blocking effects, and color casts. Zero DCE++ [1] and DLN [14] effectively enhanced low-illumination images with blurry details and low contrast, but they all introduced an additional whitish tone in the enhanced images. Additionally, the former generated unwanted hand-crafted holes and blurry edges in some enhanced images, and the latter was not satisfactory when tackling color distortion. SCL-LLE [52] generated visually unnatural images, and MIRNet [65] failed to address the local darkness of the enhanced images. Although EnlightenGAN [29] and DRBN [59] were satisfactory for lighting up the brightness of low-light images, they inevitably brought in some local underenhancement or darkness and unsatisfactory edges. On the contrary, our discovered method showed a satisfactory manifestation in lighting illumination, preserving edges and structural details, avoiding color distortion, and over-or underenhancement on the LIME, MEF, NPE, and VV unpaired benchmarks. To wit, our method outperformed all aforementioned comparison approaches in lighting up low-light images.  [19], (c) BIMEF [20], (d) LR3M [18], (e) RetinexNet [54], (f) KinD [28], (g) SCL-LLE [52], (h) DSLR [49], (i) EnlightenGAN [29], (j) DRBN [59], (k) Zero DCE++ [1], (l) DLN [14], (m) MIRNet [52], and (n) Ours.

Comprehensive Analysis of Computational Complexity
We show the computational complexity of all above-listed methods and their average execution time on the LOL benchmark in Table 5. From the table, we find Zero DCE++ [1] enjoyed the fewest number of parameters and flops, the fastest speed owing to its estimating of the parameters of the high-order curve via a lightweight network. Besides Zero DCE++ [1], DRBN [59], and RetinexNet [54], our FDMLNet exhibited a fewer number of parameters and faster speed in light enhancement than the remaining comparison approaches. However, all the validation experiments proved that our FDMLNet outperformed all comparison methods in LLIE.

Comprehensive Assessment on Real Images
To prove the application of our method in real-world images, we applied our FDMLNet on real low-light images captured by Mate 20 Pro and Vivo X60 phones. The results yielded by our the FDMLNet are depicted in Figure 16. The following observation can be obtained: the enhanced images consistently exhibited a visually pleasing appearance, vivid colors, and more apparent details with the help of our designed learning-based method. Therefore, our proposed FDMLNet model could be applied to promote the quality of images received from a common phone camera, such as a Mate 20 Pro, Vivo X60, and so on. Additionally, we processed the compressed low-light images, which were created by setting the compression ratios to 0.2, 0.5, 0.8, and 1 in order to test our method. The enhanced images and the NIQE (original/enhanced images) are shown in Figure 17. We can easily find that the proposed FDMLNet generated more satisfactory images and had lower NIQE scores under a variety of compression ratios. Unfortunately, our proposed method failed to remove the hand-crafted halos, especially with a compression ratio of 0.2 (observed in picture a in Figure 17).

Discussion and Limitation
Low-illumination images not only exhibit an unsatisfactory visual appearance but deliver compromised information for other high-level computer vision applications. Hence, it is urgent but practical to improve their quality. Our FDMLNet required fewer parameters, had a faster speed, and performed well in generating a visually pleasing image in most cases, but it still showed some limitations in certain unique scenes. For example, Figure 18 demonstrates the visual comparisons of the FDMLNet tested on different low-light images; we can observe that our method failed to restore the quality of the images with excessive noise, colored light, and local overexposure. The most probable reason was that our designed DetNet was without a denoising operation and directly processed the image's high frequencies containing inherent noise. Moreover, some special scene images, such as colored light images, were not included when training our model. In the future, we will tackle these challenging issues by fusing semantic information and building a diversity dataset to train the model.

Conclusions
We constructively demonstrated a novel and highly efficient method for tackling the challenging issues of low-illumination photos. This proposed FDMLNet first employed a guided filter to separate the image high and low frequencies. In addition, the DetNet and StruNet were separately used to process them for enhancing low-light images. In StruNet, a multiscale learning block with a dual-branch channel attention strategy was injected to fully exploit the information at different scales. Then, the FFEM was composed by three MSAMs in a improved skip-connection way to utilize the hierarchical and inherent features. Furthermore, the FFEMs were connected by means of a dense connection to guarantee the multilevel information was completely assimilated. Extensive experimental validation results on several public paired/unpaired benchmarks proved that our FDMLNet was superior to state-of-the-art approaches in terms of LLIE. However, our method ineffectively recovered the color and brightness of images with boosted noise or colored light; we will tackle these remaining problems in the future.

Conflicts of Interest:
The authors declare no conflict of interest.