Adaptive Dual Aggregation Network with Normalizing Flows for Low-Light Image Enhancement

Low-light image enhancement (LLIE) aims to improve the visual quality of images taken under complex low-light conditions. Recent works focus on carefully designing Retinex-based methods or end-to-end networks based on deep learning for LLIE. However, these works usually utilize pixel-level error functions to optimize models and have difficulty effectively modeling the real visual errors between the enhanced images and the normally exposed images. In this paper, we propose an adaptive dual aggregation network with normalizing flows (ADANF) for LLIE. First, an adaptive dual aggregation encoder is built to fully explore the global properties and local details of the low-light images for extracting illumination-robust features. Next, a reversible normalizing flow decoder is utilized to model real visual errors between enhanced and normally exposed images by mapping images into underlying data distributions. Finally, to further improve the quality of the enhanced images, a gated multi-scale information transmitting module is leveraged to introduce the multi-scale information from the adaptive dual aggregation encoder into the normalizing flow decoder. Extensive experiments on paired and unpaired datasets have verified the effectiveness of the proposed ADANF.


Introduction
Insufficient light in complex imaging environments can lead to dark brightness, low contrast, high noise, and poor details in captured images [1,2].Low-light image enhancement (LLIE) aims to solve the problems of insufficient visibility and low contrast in low-light images while restoring noise, structures, color distortion, etc. [3].Low-light image enhancement can effectively improve the performance of methods such as object detection and scene understanding at night or in low-light conditions [4].
Over the past decades, many low-light image enhancement methods have been proposed [5,6].Previous methods are usually based on hand-designed features and processing steps such as histogram equalization [7,8] and gamma transformation [9].These methods are simple and fast, but they usually amplify noise while enhancing the image and often cannot restore the color and details of low-light images well [10].The widely popular Retinex theory [11] provides an intuitive and easy-to-understand framework for LLIE by decomposing the image into reflection and illumination components [12,13].However, for complex illumination properties in practice, it is challenging to design priors and regularizations that are always valid for accurate decomposition of the reflection and illumination components [14,15].Improper decomposition can lead to unrealistic details, undesirable artifacts, and color distortion in enhanced images [16].
Inspired by the successful application of deep learning in object recognition, detection, etc. [17,18], researchers have focused on building various deep learning frameworks suitable for the LLIE task [6,19,20].In addition, the development of paired datasets [21,22] has indeed been a critical step in enabling the application of deep learning to the LLIE task.Recent IILE methods based on deep learning can be roughly divided into deep-Retinex-based methods and end-to-end methods [5,23,24].
Deep-Retinex-based methods are also based on the human visual system's Retinex theory and use neural networks to simulate the process of separating the reflectance and illumination components [25].These methods aim to combine the advantages of both Retinex theory and deep learning, enabling an interpretable low-light image enhancement paradigm [26,27].Under low-light conditions, the boundary between the reflectance component and the illuminance component can become blurred, making it more difficult to accurately separate them.Even when using deep learning models for Retinex decomposition, there is still the problem of being unable to accurately separate the reflectance and illumination components, which may lead to noise amplification and image stylization in the enhanced results [28].In order to solve these problems, researchers are constantly exploring ways to improve deep learning models, such as by using more complex network structures or introducing regularization techniques to improve model performance [10,29].
End-to-end methods typically use deep neural networks to directly learn the nonlinear relationships between low-light images and their corresponding normally exposed images [30,31].By removing the need for explicit separation of the reflectance and illumination components, end-to-end methods focus on designing a variety of novel neural network structures for LLIE [32].End-to-end methods have the advantage of being less dependent on physical models and can directly learn the desired mapping between low-light and normal-light images [33].However, they may not be as interpretable as deep-Retinex-based methods due to their black-box nature.
Recent methods based on deep learning have made good progress in LLIE.However, these methods generally use pixel-level error functions, such as L1 or L2 norm, as the objective function of deep networks for training [5,10].Pixel-level error functions cannot measure the real visual errors between enhanced images and normally exposed images such as complex structures and textures [34,35].And pixel-level error functions have difficulty providing effective regularization for local structures in various complex backgrounds.
To alleviate the above problem, we propose an adaptive dual aggregation network with normalizing flows (ADANF) for low-light image enhancement.Different from previous methods that use pixel-level error functions to measure the difference between enhanced and normally exposed images in the image domain, we adopt a normalizing flow framework to map enhanced and normally exposed images to the underlying data distribution, which can effectively express the structural details of complex images [36].In addition, we use the errors between the data distributions for enhanced and normally exposed images as the objective function to effectively measure the visual distance.
In the proposed ADANF, an adaptive dual aggregation encoder is firstly exploited to extract illumination-robust features by fully exploring the global properties and local details of the low-light images.Next, a reversible normalizing flow decoder is leveraged to recover normally exposed images from the illumination-robust features.Here, we exploit the inverse process capabilities of the normalized stream decoder to reconstruct brighter, more detailed images.Finally, to further improve the quality of image enhancement, a gated multi-scale information transmitting module is designed to introduce the multi-scale features from the adaptive dual aggregation encoder into the normalizing flow decoder.Extensive experiments on paired and unpaired datasets verify the effectiveness of the proposed ADANF.
The contributions of this paper mainly include: • An adaptive dual aggregation encoder is leveraged to fully capture the global properties and local details of low-light images for extracting illumination-robust features from low-light images.

•
To measure real visual errors between enhanced and normally exposed images, a reversible normalizing flow decoder is used to map enhanced and normally exposed images to potential distributions, and the difference between the distributions is used as the objective function for training.

•
A gated multi-scale information transmitting module is designed to introduce the multi-scale features from the adaptive dual aggregation encoder into the normalizing flow decoder to further improve the quality of enhanced images.
The rest of the manuscript is organized as follows.Recent related works are introduced in Section 2. Section 3 gives the details of the proposed ADANF.Section 4 reports experimental results.Finally, the conclusion is provided in Section 5.

Traditional Methods
Previous methods usually study hand-designed features for LLIE.Histogram equalization is one of the most classic low-light image enhancement methods [37].Reza [38] designed a block-based histogram equalization method to model lighting changes in local areas.Lee et al. [39] calculated the 2D histogram by considering the relationship between neighboring pixels within local regions.They utilized the layered difference approach for enhancing contrast.In addition, some researchers attempted to combine image quality assessment with histogram equalization to improve performance.Gu et al. [40] used subjective and objective evaluation guidance to improve the histogram to correct image brightness and contrast to the level of normal exposure.
Retinex theory is also very popular in low-light image enhancement, and researchers have carefully designed many decomposition methods based on the Retinex theory.Kimmel et al. [41] proposed to introduce the lighting component gradient into a variational framework for LLIE.Ren et al. [12] designed a low-rank prior regularized Retinex decomposition model to alleviate the noise amplification problem.Gu et al. [13] proposed a fractional-order variational structure that regularizes both the reflectance and illumination components.Liang et al. [42] combined nonlinear diffusion techniques and Retinex decomposition to estimate lighting components to improve estimation results.These methods are sensitive to illumination changes.In low-light environments, illumination changes may lead to inaccurate feature extraction and affect the enhancement effect.

Deep-Learning-Based Methods
Recent methods mainly study the design of deep learning frameworks for LLIE, including deep-Retinex-based methods and end-to-end methods [5,10].Deep-Retinex-based methods combine the advantages of Retinex theory and deep learning to provide an interpretable solution for low-light image enhancement.Wei et al. [21] proposed a Retinex-Net including Decom-Net and Enhance-Net.Decom-Net is responsible for decomposing the input low-light image into reflection and illumination parts, while Enhance-Net is responsible for enhancing the illumination part to obtain normally exposed images.Zhang et al. [43] proposed a KinD network to utilize images under different exposure conditions for training.Fan et al. [29] introduced a semantic segmentation sub-network into the Retinex model to use semantic priors to guide image enhancement.Liu et al. [27] employed unrolling and adjustment to exploit global and local brightness of images for LLIE.
End-to-end methods focus on carefully designing different networks to learn the mapping between low-light and normally exposed images.Lore et al. [20] designed the first deep network LLNet for LLIE, which is a sparse denoising autoencoder structure.Yang et al. [33] proposed to exploit a transformer-based network to extract the global information of low-light images.Ren et al. [44] utilized an encoder-decoder network to extract global content and a recurrent neural network to preserve edge details.Xu et al. [45] proposed a frequency-based model that uses low-frequency layers to restore content and high-frequency layers to restore image details.Xu et al. [31] considered that the information amounts in different areas are different and designed a signal-to-noise-ratio-aware transformer for LLIE.However, recent deep-learning-based methods usually employ pixel-level L1 or L2 norm as the objective function to optimize deep networks, which cannot effectively measure the real visual errors between the enhanced image and the normal exposure image.

Methods
LLIE aims at generating the normally exposed image X n ∈ R H×W×3 from a lowlight image X l ∈ R H×W×3 , where W and H represent the width and height, respectively.Previous methods focus on studying different networks, directly utilizing MSE [20], L1 [46], or color loss [47] as objective functions to perform supervised training under paired training samples {X l , X gt }, where X gt ∈ R H×W×3 is the ground truth normally exposed image.However, there are two problems with previous methods.First, it is difficult for these methods to fully adaptively utilize the global and local information of the image X l to improve visual effect and suppress noise.On the other hand, the loss functions of these methods focus on pixel level or local errors, and it is difficult to fully utilize the visual properties to measure the real visual errors between the generated image X n and the ground truth X gt [35].
To alleviate the above two problems, an adaptive dual aggregation network with normalizing flows (ADANF) is proposed for LLIE.The overall structures of the ADANF are shown in Figure 1.First, an adaptive dual aggregation encoder is employed to fully exploit the global properties and local details of the image X l to extract illumination-robust features.Then, an invertible normalizing flow decoder is used to recover the normally exposed image X n from the illumination-robust features.Finally, a gated multi-scale information transmitting module is designed to introduce the multi-scale features of the adaptive dual aggregation encoder into the normalizing flow decoder to further improve the quality of image enhancement.

Preprocessing
Low-light images often have local or global dark areas, resulting in poor contrast and unclear detail.In addition, insufficient light may also cause problems such as noise and artifacts.If the original low-light images are input directly into the model, the model may have difficulty distinguishing low-contrast areas and noisy areas.By performing histogram equalization on X l , we can redistribute the pixel intensities of an image X l so that they occupy the entire possible intensity range.The histogram-equalized image h(X l ) ∈ R H×W×3 will have higher contrast and the model can more easily identify and perceive different areas in the image.In addition, we use color map c(X l ) ∈ R H×W×3 to enhance the contrast and visibility of low-light images X l , highlighting details in dark areas, where c(X l ) = X l /mean p (X l ), and mean p (X l ) represents the calculation of the mean value of each pixel in X l .Finally, we use the gradient map g(X l ) ∈ R H×W×3 to explicitly capture the noisy areas in low-light images X l , where g(X l ) = max(|∇ x (c(X l ))|, ∇ y (c(X l )) ), ∇ x , and ∇ y are the gradients in the x and y directions, respectively.To improve the model's sensitivity to noisy areas in low-contrast and dark areas, h(X l ), c(X l ), g(X l ), and X l are stacked by channel as the input X in = [h(X l ), c(X l ), g(X l ), X l ] of the subsequent network.

Global-Local Adaptive Aggregation Module
In the adaptive dual aggregation encoder, two 3 × 3 convolutions are first used to transform the image X in into the feature space to obtain the shallow feature F s ∈ R H×W×C s , where C s is the channel number.Then, global-local adaptive aggregation blocks are used to extract illumination-robust feature F i ∈ R H×W×C i .The global-local adaptive aggregation block is the key module of the adaptive dual aggregation encoder, and we take a globallocal adaptive aggregation block as an example to introduce its details.
First, spatial-window self-attention [48,49] is utilized to explore the global information of the image.We generate query features Q ∈ R H×W×C i , key features K ∈ R H×W×C i , and value features V ∈ R H×W×C i from the shallow feature F s by using convolutions.
where W Q , W K , W V ∈ R 1×1×C i are the weights of a 1 × 1 convolution, and biases are omitted.Since performing self-attention directly on the global image will introduce a huge amount of calculation, we follow SwinTransformer [50] to perform the spatial-window self-attention to reduce the amount of calculation.Q, K, and V are divided into non-overlapping spatial windows Q j sw , K j sw , and V j sw ∈ R H sw ×W sw ×C i , respectively.H sw × W sw is the size of the spatial window.We can calculate the features of each spatial window using Equation (2).
where P j is relative position encoding of the j-th spatial window.The outputs of spatialwindow self-attention are where n = (H/H sw ) 2 is the number of spatial windows and F g ∈ R H×W×C i .In addition, shift window operations [50] are utilized to extract the global spatial feature of the image.
Second, to capture details and textures in images of LLIE, a local branch uses depthwise convolution (DWC) operations to extract local features F l = DWC(V) ∈ R H×W×C i from the value features V from Equation (1).
Third, to fully utilize the global and local information of the image X l to generate illumination-robust features, an adaptive interaction aggregation (AIA) module is designed.Since F g is the global information of the image and F l is the local features of the image, F g and F l are misaligned features.In this case, simple feature weighted combination or concatenation operations cannot fully integrate global and local information.In the AIA module, we first use the information of local features F l to refine the texture detail information of global features F g by exploiting the attention mechanism.The spatial attention map S(F l ) ∈ R H×W×1 of the local features is calculated as where φ is the sigmoid activation, σ is the RELU activation, W sa1 , W sa2 ∈ R 1×1×C i are weights of the 1 × 1 convolutions, W sa1 contains C i kernels, and W sa2 contains a kernel.Then, we can obtain the refined global feature Fg = F g ⊙ S(F l ), where ⊙ is the Hadamard product, Fg ∈ R H×W×C i .Then, the AIA module utilizes the rich channel information of global features F g to suppress redundant channels of local features F l .The channel attention map Then, we can obtain the refined local feature Fl = F l ⊙ C(F g ), Fl ∈ R H×W×C i .Finally, the refined global and local features Fg and Fl are aggregated by element-wise addition as the output.Multiple global-local adaptive aggregation blocks are repeated to generate illumination-robust feature F i .

Normalizing Flow Decoder
During real imaging, changes in lighting conditions (e.g., different time, weather, or light sources) can cause even the same scene to look completely different in low-light images.That is, a normally exposed image will correspond to many different low-light images.A good LLIE method should be able to adapt to changes in lighting conditions.In this paper, we propose to exploit a normalizing flow decoder to recover normally exposed images from illumination-robust feature F i .
In the proposed ADANF, the normalizing flow decoder is an invertible network, whose purpose is to learn a one-to-many mapping relationship for LLIE.In the training phase, the normalizing flow decoder aims to learn the mapping of normally exposed images to the feature F i of low-light images [51,52].The normalizing flow network can adapt to various characteristics of the same scene under different lighting conditions.During the testing phase, the inverse of the learned mapping can be exploited to generate normally exposed images from low-light image features F i .
The structures of the normalizing flow decoder has three levels, with a squeeze layer and 12 flow steps at each level.A squeeze layer is a type of layer that reduces the spatial resolution of the input data, which can help with reducing the computational complexity of the network.The flow steps are the main part of the invertible network, where the invertible mapping from normally exposed images to the feature of low-light images is learned.
As shown in Figure 1, a flow step is composed of an activation normalization (Act-Norm) layer, an invertible 1 × 1 convolution, and an affine coupling component.The ActNorm layer is similar to batch normalization, using the scale µ ∈ R 1×1×C i and bias σ ∈ R 1×1×C i parameters of each channel of the input data to perform a transformation Y i = F i −µ σ ∈ R H×W×C i as preprocessing, whose purpose is to make the input data F i have zero mean and unit variance.The scale µ and bias σ parameters of the ActNorm layer are learnable and initialized using the mean and variance of batch features.
After the ActNorm layer, an invertible 1 × 1 convolution is used to increase the information interaction between the feature channels of Y i to obtain Ȳi ∈ R H×W×C i .In invertible 1 × 1 convolutions, given the output data and convolution kernel, we can accurately recover the original input data.In this way, we can reconstruct the normally exposed image from the feature F i based on the inverse of the learned mapping.In order to make the traditional 1 × 1 convolution invertible, we need to set its weight matrix to a random orthogonal matrix [51].
The affine coupling component is a special reversible transformation component that can effectively map input data to different feature spaces.It transforms existing channels through multiplication and addition operations and can effectively facilitate the normalizing flow decoder to learn the mapping from normally exposed imgages to the feature F i during the training phase.In the affine coupling component, a split operation is first utilized to divide the input data Ȳi into two data, Ȳ1 i ∈ R H×W×C i /2 and Ȳ2 i ∈ R H×W×C i /2 along the channel dimension.Then, we perform an identity transforma- tion on Ȳ1 i to obtain H 1 = Ȳ1 i and perform an affine transformation on Ȳ1 i to obtain H 2 , where NN s ( Ȳ1 i ) and NN b ( Ȳ1 i ) are shadow three-layer convolutional neural networks to learn the scale and bias from Y 1 i for affine transformation.Next, H 1 and H 2 are concatenated by channel and input into invertible 1 × 1 convolutions for information interaction among channels.Similar to recent methods [51,52], the flow step is repeated 12 times to learn the mapping.

Mapping Learning Aided by Multi-Scale Features
Due to complex low-light conditions, the detailed information of the image at different scales will be lost or obscured, or the areas at different scales will be too dark or too bright [53].The multi-scale information of the image is important for LLIE, but the above normalizing flow decoder cannot effectively utilize its multi-scale information.In the proposed ADANF, a gated multi-scale information transmitting module is used to introduce the multi-scale features of the adaptive dual aggregation encoder into the normalizing flow decoder to further improve the quality of image enhancement.
Detailed structures of the gated multi-scale information transmitting module are shown in Figure 2. Three dilated convolutions with different dilation rates (e.g., [1,2,4]) are firstly used in parallel to extract the features of different scales [54].Then, these features are concatenated and fed to 1 × 1 convolutions to generate multi-scale features F ms ∈ R H×W×C i .Next, Global Average-Pooling (GAP) and Global Max-Pooling (GMP) operations in the channel dimension are utilized to extract spatial information GAP(F ms ) and GMP(F ms ).Convolution operations with sigmoid activation are used to generate attention weights from F ms to control the multi-scale information passed to the normalizing flow decoder.The output of the gated multi-scale information transmitting module is the gated multiscale features Fms ∈ R H×W×C i ,

Loss Function
In the ADANF, we use the normalizing flow decoder to capture the conditional distribution P NFD (X gt |X l , θ) of a normally exposed image X gt under its low-light image condition X l , where θ represents the parameters of the normalizing flow decoder.Since the normalizing flow decoder is an invertible network, it can map a normally exposed image X gt to a latent variable z = NFD θ (X gt ; X l ) under the low-light image condition X l and can also reversibly map the latent variable z to the normally exposed image X gt = NFD −1 θ (z; X l ).In ADANF, the latent variable z refers to the illumination-robust feature F i .Similar to recent work [36], the latent variable z can be assumed to follow a Gaussian distribution P z (z).
According to the change-of-variables theorem, the conditional distribution P NFD (X gt |X l , θ) can be calculated as: The normalizing flow decoder NFD θ is sequentially composed of N invertible layers , where NFD n θ is the n-th layer, n = 0, 1, • • • , N − 1, h 0 = X gt , and h N = z.ADAE n (X l ) is the latent image features from the adaptive dual aggregation encoder.
According to Equation ( 7), we can use the negative log-likelihood as a loss function L to optimize the parameters of the proposed ADANF.By using the chain rule, L is formulated as: Since P z (z) is assumed to follow a Gaussian distribution, P z NFD θ X gt ; X l can be calculated as: In the testing phase, low-light images are input to the adaptive dual aggregation encoder to obtain illumination-robust features, and then these features are input to the normalizing flow decoder through inverse mapping NFD −1 θ to generate normally exposed images.

Datasets and Evaluation Metrics
Paired datasets.LOLv1 [21] is one of the most commonly used datasets in LLIE.This dataset is collected from real scenes and contains 500 pairs of low-light and normally exposed images under different lighting conditions.Among them, 485 pairs of images are used for training and 15 pairs of images are used for testing.
LOLv2 [22] contains two subsets, namely LOLv2-real and LOLv2-synthetic.LOLv2real contains image pairs of different brightness in real scenes obtained by adjusting exposure time and ISO settings.These image pairs are intended to study illumination changes in real application scenarios.Specifically, LOLv2-real contains 689 image pairs for training and 100 image pairs for testing.LOLv2-synthetic synthesizes low-light images from RAW images by analyzing the lighting distribution of low-light images.It contains 1000 image pairs, of which 900 pairs are used for training and 100 pairs are used for testing.
Unpaired datasets.The DICM [55], LIME [3], MEF [56], NPE [57], and VV [46] (https://sites.google.com/site/vonikakis/datasets,accessed on 4 September 2023) datasets are real captured images and do not contain normally exposed images as reference images.Therefore, these datasets cannot be used for training.We tested the performance of the proposed ADANF on these several datasets.
Evaluation metrics.For paired datasets like LOL and LOL-v2, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [58] are used as evaluation metrics.PSNR measures the peak signal-to-noise ratio between the original image and the enhanced image, while SSIM takes into account the structural and textural information of the image.In addition, learning perceptual image patch similarity (LPIPS) [59] is also used as an evaluation index, which uses deep features to measure the perceptual similarity of images.This indicator is learned through deep learning methods.Compared with PSNR and SSIM, LPIPS can more truly reflect the human eye's perception of image quality.
For unpaired datasets such as DICM, LIME, MEF, NPE, and VV, direct evaluation using PSNR, SSIM, or LPIPS is not possible because there are no paired normally exposed images.We use the model parameters trained on LOLv2-synthetic to directly infer the enhanced image.In this case, the Natural Image Quality Evaluator (NIQE) is employed to evaluate the results.For PSNR and SSIM, the larger the value, the better the enhancement quality.For LPIPS and NIQE, the smaller the value, the better the enhancement quality.

Implement Details
In the proposed ADANF, the number of global-local adaptive aggregation modules in the adaptive dual aggregation encoder is 24, and the normalizing flow decoder has three levels with a squeeze layer and 12 flow steps at each level.The batch size on the LOLv1, LOLv2-real, and LOLv2-synthetic dataset is 8.We train the ADANF for 40,000 iterations using the Adam optimizer with the initial learning rate set to 0.0005 and multiplying the learning rate by 0.5 at 20,000, 30,000, 36,000, and 38,000 iterations.The input image size is set to 160 × 160.For unpaired data, we use the parameters trained on LOLv2-synthetic to perform inference and obtain results.All experiments are completed on a dual-card NVIDIA RTX 4090 server.
The quantitative results of the proposed ADANF and comparison methods on the paired LOLv1, LOLv2-real, and LOLv2-synthetic datasets are reported in Tables 1-3, respectively.Low-light images often suffer from color distortion and low contrast, which make it difficult to extract effective features.Previous methods LIME [3], RetinexNet [21], and KinD [43] usually use classic structures when extracting low-light image features, which makes it difficult to effectively model their complex distribution when processing images under different low-light conditions, thus affecting performance.From Tables 1-3, we can see that our ADANF can be significantly improved under PSNR, SSIM, and LPIPS.It is worth noting that our ADANF has a greater improvement in the PSNR index, indicating that our proposed ADANF can obtain higher quality enhanced normally exposed images.Compared with the recent method LLFormer [16], the PSNR of our ADANF on the LOLv1, LOLv2-real, and LOLv2-synthetic datasets has increased by 0.91%, 1.81%, and 0.66%, respectively.This may be due to the fact that the adaptive dual aggregation encoder can effectively extract the global properties and local details from the low-light images.The proposed gated multi-scale information transmitting module can effectively transfer the latent features from the input image to the normalizing flow decoder so that the enhanced image has a more stable quality.The normalizing flow decoder can effectively model the distribution of normally exposed images to reconstruct high-quality images from illumination-robust features.In this section, we also conduct experiments on unpaired datasets.Due to the lack of reference images for comparison, we mainly used the NIQE to quantify the performance of each method and used the visual results for qualitative analysis.In terms of the NIQE indicator, the quantitative results for different datasets are shown in Table 4.The proposed ADANF shows better performance on the LIME, MEF, and VV datasets than other methods.On the DICM and NPE datasets, the proposed ADANF also has comparable performance.

Visualization
Visual results of different image enhancement methods on paired datasets.In order to verify that this method can generate better quality illumination-enhanced images, we compared some of the images generated by this method with the results generated by other low-light image enhancement algorithms.As shown in Figure 3, it can be seen that our ADANF can obtain a more realistic restoration effect.Compared with some methods, it has pictures with lower noise and more realistic colors.In addition, it has clearer details at the intersection of light and dark.These results can show that the module designed by this method is more complete and sufficient in extracting the features of the original image, promoting the final enhancement result to show more details in the transition area, thereby obtaining better enhancement results.
Visual results of different image enhancement methods on unpaired datasets.As can be seen from Figure 4, our method has better color performance in different scenarios.Compared with other methods, the image color obtained by this method is more realistic.It is neither too dark to see the details nor too bright to make the image color unrealistic.These visual results can show that the proposed method is effective not only in scenarios with paired datasets but also in complex scenarios with only unpaired datasets.Experimental results on multiple unpaired DICM, LIME, MEF, NPE, and VV datasets show that the proposed ADANF has good generalization ability.

Ablation Study
In order to verify the effectiveness of the different modules designed by this method, we conducted ablation experiments on the LOLv1 dataset to test the effects of the introduced adaptive dual aggregation encoder (ADAE) and gated multi-scale information transmitting module (GMITM).We replaced the ADAE in the proposed ADANF with multi-layer convolutions and removed the GMITM method as a baseline method.Then, the ADAE and GMITM modules are respectively added to the baseline method to conduct experiments.The experimental results are shown in Table 5.Compared with the baseline method, the introduction of the ADAE and GMITM alone can bring about improvements in the three evaluation indicators.The improvement brought by the introduction of the ADAE is that this module can fully utilize the potential features of the input image in both global and local aspects.The improvement obtained by further combining the GMITM and ADAE is because the image features extracted by ADAE are better transferred to the normalizing flow decoder to assist in image enhancement.

Conclusions
In this paper, we propose an adaptive dual aggregation network with normalizing flows for low-light image enhancement.First, an adaptive dual aggregation encoder is used to fully exploit the global properties and local details of the image to extract illuminationrobust features.Next, after illumination-robust features are extracted, a reversible normalizing flow decoder is used to recover normally exposed images from these features.This step takes advantage of the inverse process capabilities of the normalizing flow decoder to reconstruct brighter, more detailed images from low-light images.Finally, a gated multiscale information transmitting module is designed to introduce the multi-scale features of the adaptive dual aggregation encoder into the normalizing flow decoder.This step aims to further improve the quality of image enhancement by introducing multi-scale features.Extensive experiments on paired and unpaired datasets verify the effectiveness of the proposed ADANF.In the future, we will study lightweight low-light image enhancement networks to meet the needs of real-time low-light image processing applications.

Figure 1 .
Figure 1.Detailed structures of ADANF.In the testing phase, a low-light image is first fed to the adaptive dual aggregation encoder to fully exploit the global properties and local details for extracting illumination-robust features.Then, a gated multi-scale information transmitting module is designed to introduce the multi-scale features of the adaptive dual aggregation encoder into the normalizing flow decoder.Finally, an invertible normalizing flow decoder is used to recover the normally exposed image from the illumination-robust features.

Figure 2 .
Figure 2. Detailed structures of the gated multi-scale information transmitting module.

Figure 3 .Figure 4 .
Figure 3.Some visualization results of the proposed ADANF and the recent state-of-the-art methods for the LOLv1, LOLv2-real, and LOLv2-synthetic datasets.Ours Input Zero-DCE KinD EnlightenGAN LLFlow

Table 1 .
Quantitative results of the proposed ADANF and the state-of-the-art methods for the LOLv1 datasets.↑/↓ means that the larger/smaller the index value, the better/lower the quality.GFLOPs represents the Giga Floating Point Operations.Params represents the number of weight parameters.

Table 2 .
Quantitative results of the proposed ADANF and the state-of-the-art methods for the LOLv2real datasets.↑/↓ means that the larger/smaller the index value, the better/lower the quality.GFLOPs represents the Giga Floating Point Operations.Params represents the number of weight parameters.

Table 3 .
Quantitative results of the proposed ADANF and the state-of-the-art methods for the LOLv2-synthetic datasets.↑/↓ means that the larger/smaller the index value, the better/lower the quality.GFLOPs represents the Giga Floating Point Operations.Params represents the number of weight parameters.

Table 4 .
Quantitative results of the proposed ADANF and the state-of-the-art methods for unpaired DICM, LIME, MEF, NPE, and VV datasets.The evaluation index is NIQE.

Table 5 .
Ablation studies of the proposed ADANF on the LOLv1 dataset.↑/↓ means that the larger/smaller the index value, the better/lower the quality.