DANet: A Domain Alignment Network for Low-Light Image Enhancement

: We propose restoring low-light images suffering from severe degradation using a deep-learning approach. A significant domain gap exists between low-light and real images, which previous methods have failed to address with domain alignment. To tackle this, we introduce a domain alignment network leveraging dual encoders and a domain alignment loss. Specifically, we train two dual encoders to transform low-light and real images into two latent spaces and align these spaces using a domain alignment loss. Additionally, we design a Convolution-Transformer module (CTM) during the encoding process to comprehensively extract both local and global features. Experimental results on four benchmark datasets demonstrate that our proposed A Domain Alignment Network(DANet) method outperforms state-of-the-art methods.


Introduction
Enhancing low-light images is a significant challenge in computer vision.The objective is to address poor visibility, low contrast, and various degradations like artifacts, color distortion, and noise in low-light images.These problems not only hinder human visual perception but also affect other computer vision tasks such as image processing [1,2], object detection [3], and autonomous driving [4], etc.
With the advancement of deep learning, numerous algorithms have emerged to tackle low-light image enhancement.Despite these efforts, existing methods have limitations.Most algorithms use an encoder-decoder or recurrent structure to process low-light images, aligning them with well-exposed images at the output.However, since this alignment of data distribution occurs only at the output, it often fails to effectively restore low-light images to properly well-exposed ones.This results in the output being either too dark, Figure 1c,f or over-exposed as shown in Figure 1b,j, some methods aim to align the output end with the original image as closely as possible or learn the residual between them, resulting in excessively dark brightness compared to the original image.On the other hand, some methods use Generative Adversarial Networks (GANs) to fit the distribution at the output end, such as EnlightenGAN [5], Signal-to-Noise-Ratio (SNR) [6], and so on.However, since they only align at the output end and ignore the alignment of intermediate layer features, the results can be unstable and sometimes overexposed, as shown in Figure 1e.Furthermore, most convolution-based methods tend to overlook long-range dependency issues, while transformer-based methods ignore the local perceptual nature of convolutions.Consequently, many methods produce locally blurry enhancement results or introduce noise into the image.Although certain methods attempt to combine long-range and short-range features, insufficient fusion often leads to inadequate texture details in the results.In this paper, to address the above-mentioned drawbacks, we propose a novel network named domain-alignment network (DANet) to perform low-light image enhancement.Unlike traditional encoder-decoder structures, we introduce a dual encoder, single decoder architecture.This is because Dual Encoders are capable of extracting deep-level feature representations from images, which are highly useful for understanding the content and textural structure of the images.In the context of image enhancement tasks, Dual Encoders can be utilized for domain adaptation by learning the mapping between the source domain (low-light images) and the target domain (real images), thereby improving the effectiveness of image en-hancement.Additionally, Dual Encoders can enhance the robustness of the image enhancement by learning robust feature representations, especially in the presence of noise, compression artifacts, or other forms of degradation.Specifically, DANet consists of dual encoders (Enhancement Encoder, Reconstruct Encoder), and a decoder.The encoder is composed of convolution and convolution-Transformer modules (CTM).The dual encoders take ground-truth and low-light images as inputs, where one encoder is used to extract ground-truth features for reconstruction, while the other one extracts low-light image features for enhancing illumination.The decoder simultaneously reconstructs images under normal and low-light conditions, obtaining two reconstructed images, respectively.To better extract features, we adequately capture long-range dependencies and local perceptual features by integrating convolution and CTM modules.In addition, in order to reduce the difference between low light image and normal light image data in two different fields, and better carry out feature fusion and complementarity, we propose a domain alignment loss function, which improves the enhancement effect of low light image.These modules work together to extract relevant features from different scales and regions of the input images.By leveraging feature interactions, we achieve an effective fusion of these extracted features, thereby improving performance in feature representation and enhancement.
Experiments on benchmark datasets indicate that the proposed method performs well compared to state-of-the-art techniques.It produces clear, natural images and achieves brightness levels close to ground truth images.Our contributions can be summarized as follows: 1.
We proposed a novel framework that combines dual encoders with domain alignment, named Domain Alignment Network (DANet), for enhancing low-light images.

2.
We proposed dual encoders and a domain alignment loss function, where the dual encoders are used to encode images from two different domains separately, and the domain alignment function aligns the feature distributions of the two hidden spaces into the same domain, achieving feature alignment between the two domains.

3.
We designed a Convolution-Transformer Module (CTM) comprising local and nonlocal branches.The local branch is intended for extracting local features, while the non-local branch is for extracting global features.Subsequently, the features are then deeply fused, enhancing the image's detailed texture information and improving its quality.

4.
Experimental results show that our approach outperforms other methods in both objective and subjective metrics.

Related Works 2.1. Non-Learning Based Low-Light Image Enhancement
Non-learning methods for low-light image enhancement primarily include histogram equalization and Retinex theory approaches.For histogram equalization, Zhuang et al. [7] introduced an enhancement technique based on entropy adaptive histogram equalization, while Mun et al. [8] proposed an edge-enhanced dual histogram equalization method using guided image filters.These main methods often produce unwanted artifacts in enhanced real-world images while losing image details.Compared to histogram equalization-based methods, image enhancement algorithms based on Retinex theory can achieve better enhancement effects.Retinex-based methods decompose an image into reflectance and illumination components, using the reflectance component as a reliable basis for image enhancement.Subsequently, illumination correction is applied to suppress artifacts [9,10], leading to more realistic and natural results.Li et al. [11] also considered noise in the Retinex model, enhancing the model's robustness.Ghosh et al. [12] presented a variational filtering solution based on the Retinex model.However, when enhancing complex realworld images, these methods often introduce local color distortions [13], resulting in over-enhancement and local issues.

Learning-Based Low-Light Image Enhancement
In recent years, there has been a surge in learning-based methods proposed for enhancing weak light images [14][15][16][17].Jiang Y et al. [5] introduced EnlightenGAN, designed for no-reference low-light image enhancement, reducing dependence on paired datasets.Guo et al. [18] introduced the Zero method, which transforms the nonlinear relationship between low and normal light into a task of learning image-specific curve estimation.SNR [6] employs transformers for signal-to-noise ratio perception and a CNN model with spatially varying operations to achieve spatially variant enhancement of weak light images.Zheng et al. [19] presented UTVNet, leveraging balanced parameter learning in a modelbased denoising approach.UTVNet guides the noise layer map to recover finer details and suppress noise in realistically captured low-light scenes.Ma et al. [20] proposed the CSDNet method, a novel context-sensitive decomposition network architecture that utilizes scene-level context for spatial scale dependency.It constructs illumination-guided spatially variant changes for edge-aware smoothing properties, enhancing image brightness and details.Ma et al. [21] introduced the self-calibrated illumination (SCI) method, the SCI learning framework for rapidly, flexibly, and robustly improving image brightness in realworld low-light scenarios.Additionally, unsupervised methods are explored [18,22,23].For instance, Guo et al. [18] constructed a lightweight network for pixel estimation.Prior methods overlooked domain alignment for handling low-light images.Thus, we propose a method using transaction encoders and a novel domain triple-loss alignment network.Our approach not only produces clear images but also exhibits brightness closer to real images.

Enhancement of Low-Light Images Using Vision Transformers
In recent years, since being proposed by Vaswani et al. [24] in 2019, Transformers and their variations have been applied to numerous computer vision tasks, including image classification [25,26], semantic segmentation [27,28], object detection [29][30][31], image restoration [32][33][34], and others.Particularly, since the introduction of the Vision Transformer (ViT), efforts have been made to adapt Transformers better for visual tasks.Many works have focused on reducing the quadratic computational cost of global self-attention within Transformers.Some work [35][36][37] concentrate on establishing pyramid Transformer architectures similar to convnet-based structures.For instance, in Xu et al. [6] an SNR hybrid network was introduced for low-light image enhancement.However, due to the high computational burden of the original global Transformer, SNR only used a single global Transformer layer at the lowest resolution in a U-shaped CNN.While Transformer-based methods have shown promising results in various computer vision tasks, their full potential in low-light image enhancement remains underexplored.
Unlike previous works, our proposed method enhances low-light images using transaction encoders and a novel domain alignment network.The framework includes new dual encoders with convolution-transformer modules and a decoder, transforming weak light and real images into two latent spaces to enhance low-light images through domain alignment.

Proposed Method
In this section, we first present the overall framework of the proposed DANet, followed by an introduction to the key components of CMT.Finally, we describe the loss function used.
The overall framework of the proposed Domain Alignment Network (DANet) is illustrated in Figure 2. Suppose the image pairs are given as (I ll ,I gt ), where I gt represents the ground truth image and I ll represents its corresponding low-light image.The image pairs are fed into the dual Encoders E ll and E gt , for separate processing.In the deepest layer of the encoder, we propose the CTM to model local and long-range dependencies by using a parallel dual-branch structure.One branch employs a Transformer structure, while the other utilizes a convolutional structure.Two latent features, l ll and l gt , are generated after dual encoders.A Domain alignment loss function is designed that aims to map l ll and l gt to the same latent space during the training phase.A domain alignment loss function has been designed for the purpose of mapping data from the l ll and l gt domains, which are distributed differently, onto a unified latent space during the training process.This approach aims to reduce the discrepancies between the data from different spaces and achieve complementary feature fusion.The decoder performs concurrent reconstruction of both l ll and l gt , yielding two reconstructed images, I en and I rec , respectively.This stage can be formulated as: where

Dual Encoders with Convolution-Transformer Module
Traditional low-light image enhancement algorithms typically employ encoder-decoder architectures or iterative structures (such as diffusion models) to reconstruct images, aligning the output with the ground truth images.However, aligning only at the output stage fails to enable the deepest layers of the network to learn the distribution of the corresponding domain's high-level features, resulting in low-quality outcomes.
To solve these problems, we devise a dual encoder, single decoder architecture, as shown in Figure 2, where {E gt ,D} represents a reconstruction network that maps the real image I gt to latent space features and reconstructs it back to the original image.On the other hand, {E ll ,D} represents a low-light enhancement network that restores the low-light image I ll to an illuminated image that closely resembles I gt as much as possible.The structural composition of E ll and E gt bears resemblance, and for illustrative purposes, we will focus on E ll .Given a low-light image I ll ∈ R H×W×3 , E ll first utilizes a sequence of three convolutional layers to extract features and reduce the space dimension.Following the three convolutional layers, the feature F ll is obtained.Afterward, F ll is fed into the Convolution-Transformer Module (CTM) to obtain the latent space feature F ll .Three deconvolutional layers are set on Decoder restores F ll to the prediction of image I en .The following will introduce the detailed situation of CTM.

Convolution-Transformer Module (CTM)
Traditional methods often employ deep residual convolutional networks to extract features but overlook the issue of long-range dependencies.In recent years, the use of vision transformer (ViT) models has effectively addressed this problem.However, pure ViT networks lack the local perceptual capabilities of convolutions, leading to poor texture details in the results.Although some methods, such as SNR [6], consider both long-range and local features, the lack of effective interaction hinders proper feature fusion, resulting in the presence of some noise in the results.
To address the aforementioned issues, we designed the Convolution-Transformer based module (CTM) as illustrated in Figure 3.Given an input set of features, we split them into two branches: a local branch and a non-local branch.The local branch employs a deep residual convolution block to extract local features, capturing the detailed nuances of the image's local region.For the dual non-local branch, the first branch divided the features into patches and fed them into a transformer to extract global features.The second branch used a residual convolution to extract local features in depth.The obtained features were then deeply fused through a series of transformers and residual networks for comprehensive feature fusion.This stage can be formulated as: where F is the input feature of CTM, H ctran (•), H tran (F)(•), and H res (F) represent transformers with cross attention, common transformers, and resent blocks.

Domain Alignment Loss
Existing alignment loss functions.Existing loss functions like cross-entropy loss, triplet loss, and contrastive loss can be utilized.The cross-entropy loss aligns the distribution of F ll towards F gt in the form: Assuming that during the training stage, another negative sample F − gt is extracted from the same batch.The formula for Triplet loss can be written as: where F gt , F ll represent two aligned latent space vectors, F − gt means the other negative sample, a represents margin hyperparameter.
In the experiments, it was observed that as the number of training epochs increased, both cross-entropy loss and triplet loss exhibited pattern collapse.The possible reason for this could be the requirement of cross-channel input for features in both functions.However, in the latent space, the training of distinct channel features might differ between low-light images and keyboard images.Contrastive loss, on the other hand, tends to demonstrate its effects only after extensive training at a large scale and over numerous epochs.
Domain alignment loss functions.Based on this, we designed a latent space discriminator as our domain alignment loss function.Specifically, as shown in Figure 2, the discriminator consists of three layers of linear mappings, ultimately outputting a value representing the probability of vector authenticity.The formula is as follows: where F gt means the fake sample, and F ll represents the true sample.
The reconstruction loss aids DANet in achieving results with complete image structures and can be formulated as follows: where I n represents target images and I e denotes light-enhanced images.
The overall loss functions.A new loss function was designed to optimize DANet in terms of image structure, feature domain alignment, and human visual perception.The loss function is expressed as follows: where L rec , L tri , L dan and λ mean the reconstruction loss, triplet loss, domain alignment loss and the weight coefficient used to balance the triplet loss, respectively.We set λ = 0.3.
LOLv1 consists of 485 pairs of low-light and normal-light images for training and 15 pairs for testing, with each pair including a low-light input image and its corresponding normal-illumination image.In contrast, the LOLv2-synthetic dataset generates lowlight images from RAW images by analyzing illumination distribution.This subset includes 1000 pairs of low-light and normal images, with 900 pairs for training and 100 pairs for testing.
Implementation Details.Our implementation used PyTorch and is trained and tested on a PC with a single 1080Ti GPU.The model was trained with the Adam optimizer (β 1 = 0.9, β 2 = 0.999) over 2.5 × 10 5 iterations.The initial learning rate was set at 2 × 10 −4 and decreased to 1 × 10 −6 using a cosine annealing schedule for stability.Training samples were randomly cropped to 128 × 128 patches from low-light or normal-light image pairs, with a batch size of 16.Data augmentation techniques, such as random rotation and flipping, were used to enhance the training data.
Evaluation Metrics.We employed three full-reference distortion metrics to evaluate the performance of the proposed method: Peak Signal-to-Noise Ratio (PSNR) [40], Structural Similarity Index (SSIM) [40], and Mean Absolute Error (MAE).Additionally, we used a perceptual metric, the Frechet Inception Distance (FID) [41], commonly used in generative adversarial networks, to assess the visual quality of the enhanced results.

Quantitative Analysis
We compared our method with other approaches on the LOLv1, LOLv2-synthetic [38], LSRW-Huawei, and LSRW-Nikon [39] test sets.The performance comparisons on these four public datasets are reported in Tables 1 and 2. Overall, DANet achieves either optimal or sub-optimal performance across all metrics, demonstrating the superiority of our proposed approach.In Table 1, our method significantly outperforms previous state-of-the-art methods.Specifically, for distortion metrics, we surpass the current SOTA methods, indicating our results contain more high-frequency details and structure.In particular, for the PSNR metric, compared to the second-best method DSHGAL, our method improved performance by 1.534 dB in PSNR, for the SSIM and MAE we achieved sub-optimal evaluation on the LSRW-Huawei test set.For the FID metric, we achieved sub-optimal evaluation on the LSRW-nikon test set.As shown in Table 2, we also observed similar improvements on the LOLv2-synthetic test set, i.e., 0.82 dB and 0.007.On the LoLv1 test set, our method improved both MAE and FID metrics by at least 0.007 and 4.31, respectively.For the PSNR metrics, we achieved sub-optimal results on the LOLv1 test set.Experimental outcomes demonstrate that our proposed method delivers better and satisfactory visual quality for high-resolution low-light image restoration, proving its effectiveness.Figure 4 displays the enhancement results comparison of each method on the LOLv1 dataset.The top row shows the enhancement result maps (RGB images) of several enhancement methods, and the bottom row shows the corresponding colour histograms.In terms of colour enhancement results, our method successfully recovers the original colour histogram, and the observation shows that the method approximates the colour histogram of the real image, whereas the enhancement results of the other methods are not fully recovered.Figure 5 displays the enhancement results comparison of each method on the LOLv1 dataset.We observed the following trends: (1) The signal-to-noise ratio is influenced by the inability to effectively integrate local and global features, resulting in inconsistent texture details.For example, URetinex produces blurry details; (2) SCI and KiD exhibit locally dim lighting phenomena, neglecting long-term dependency relationships and domain alignment, leading to excessively bright results and the appearance of artifacts.
(3) Due to the lack of domain alignment in latent space, CSCDGAN, Zero, and PairLIE demonstrate uneven brightness distribution in the output, while EnlightenGAN suffers from local overexposure.Similarly, in KinD, the presence of noise or the absence of local texture details in the results is due to the lack of remote or local dependency relationships, resulting in ineffective feature extraction.
Figure 6 illustrates the enhancement results comparison of each method on the LSRW-Nikon dataset.EnlightenGAN, Zero, and KinD exhibit color distortion, while SNR and URetinex result in blurred detail enhancement, all due to insufficient extraction of effective feature information.Figure 7 displays the enhancement results comparison of each method on the LSRW-Huawei dataset.EnlightenGAN, Zero, and KinD again show color distortion, while URetinex, LIME, and CSDGAN lead to the excessive enhancement of distant sunlight due to a lack of remote dependencies in feature extraction.Figure 8 presents the enhancement results comparison on the LOLv2-synthetic dataset.The sky and clouds in EnlightenGAN and CSDGAN suffer from overexposure, whereas in KiD and Zero, the cloud illumination is insufficient.Overall, all other methods lack domain alignment in latent space, resulting in locally overexposed or underexposed regions in the generated results.In contrast, our approach utilizes domain alignment techniques and extracts remote or short-range dependency relationships, yielding not only clear and locally coherent images but also demonstrating a closer resemblance to real ground-level images in terms of lighting.

Ablation Study
To showcase the effectiveness of the domain alignment network, the CTM module, and the domain alignment loss function, we conducted the following three ablation experiments on the LSRW-Huawei dataset.
The effectiveness of DANet.We removed the aligned encoder and used a separate encoder and decoder for image reconstruction, which corresponds to a traditional encoderdecoder structure (referred to as "w/o DANet").The results for various metrics are depicted in row 3 of Table 3, with visualization outcomes presented in Figure 9.However, compared to real images, the contrast and saturation appear relatively poor, and the enhancement results are not entirely satisfactory.It can be seen that without adopting the region alignment method, the distribution of brightness in the results becomes unbalanced.This is because the domain alignment loss function proposed in this paper was not used, and the illumination of the images was not properly distinguished, resulting in poor enhancement effects on the contrast and saturation of the images.However, in the Domain Adaptation Network (DANet), when including the domain alignment loss function and two dual encoders, one encoder can reconstruct images under abnormal illumination, while the other encoder enhances images under weak lighting conditions.The domain alignment loss function matches the images under normal illumination, resulting in better contrast, saturation, and texture details in the enhanced images, which are also closer to the real images.The contrastive loss (referred to as "RP con"), triplet loss (referred to as "RP tri"), and cross-entropy loss (referred to as "RP cro").Results are highlighted in bold for the best performance and underlined for the second-best performance.'↑' indicates the higher the better.'↓' means the lower the better.The effectiveness of CTM.We removed the CTM module (referred to as "w/o CTM").The results for each metric are shown in row 4 of Table 3, and the visualization results are shown in Figure 9.If the CTM module is removed, the enhanced image may exhibit local region blurriness or some noise.This is because the long-range dependency is ignored, and the global features are not sufficiently extracted, along with the extraction of local features, leading to local blurriness in the enhanced image.

Condition
The effectiveness of loss function.The evaluation of loss function is shown in Figure 9.To validate the effectiveness of the domain alignment function proposed in the paper, we replaced the domain alignment loss function with contrastive loss (referred to as "RP con"), triplet loss (referred to as "RP tri"), and cross-entropy loss (referred to as "RP cro") during training.From the results in Figure 9 that when using RP con, RP cro and RP tri, the contrast and saturation of the enhancement results are poor, and the contrast is somewhat distorted compared to GT.When using our proposed domain alignment loss function, the contrast and saturation of the DANet enhancement results are better, the details are clearer and closer to the real image, and the output visual effect is more natural.The experimental results show that the design of domain alignment loss function effectively improves the quality of model-enhanced images.

Conclusions
This paper proposed a DANet network for low-light image enhancement, comprising Dual Encoders with Convolution-Transformer Module (CTM).The Encoders with Convolution-Transformer Module aim to reduce the spatial dimensions between low-light and normal-light images, mapping them into two latent spaces.The CTM captures longrange dependencies, with separate branches extracting global and local features, effectively aligning the domains of these image characteristics.These innovations not only enhance detailed information in low-light images but also achieve illumination closer to real images, mitigating issues of over-enhancement or under-enhancement.Experimental results across four public datasets demonstrated the superior performance of our method compared to state-of-the-art techniques.Future research will delve deeper into Transformer-based methods for enhancing low-light images, emphasizing detailed and illumination improvements.

Figure 1 .
Figure 1.Examples of image enhancement results in both brightness and texture details.

Figure 2 .
Figure 2. Overview of Domain Alignment Network.

Figure 9 .
Figure 9. Visual results for ablation study on LSRW-Huawei dataset.W/o DANet indicates the absence of the aligned encoder, using only a single encoder and decoder to reconstruct the image.W/o CTM indicates the removal of the CTM module.'RP con', 'RP tri', and 'RP cro' denote the effects of using these three loss functions, respectively, as substitutes for the domain alignment loss function.Table 3.Effect of the DANet, CTM and con on the LSRW-Huawei in terms of PSNR, SSIM, MAE and FID.The W/o DANet and W/o CTM, respectively, indicate removal of the aligned encoder and the CTM module.The contrastive loss (referred to as "RP con"), triplet loss (referred to as "RP tri"), and cross-entropy loss (referred to as "RP cro").Results are highlighted in bold for the best performance and underlined for the second-best performance.'↑' indicates the higher the better.'↓' means the lower the better.
ll and F gt are features of low-light and normal-light images, respectively.H en (•) and H ′ en (•) represent the dual encoders.H CTM and H cnn denote the proposed CTM and convolution neural network.H de (•) is the decoder.I en and I rec represent light-enhanced images and reconstructed images.

Table 1 .
Quantitative evaluation of different methods on the LSRW-Huawei and LSRW-Nikon test sets.Results are highlighted in bold for the best performance and underlined for the second-best performance.'↑' indicates the higher the better.'↓' means the lower the better.

Table 2 .
Quantitative evaluation of different methods on the LOLv1 and LOLv2-synthetic test sets.Results are highlighted in bold for the best performance and underlined for the second-best performance.'↑'indicatesthehigher the better.'↓'means the lower the better.To facilitate a clearer comparison, we provided visual results of all methods separately in Figures4-8across the LOLv1, LSRW-Nikon, LSRW-Huawei, and LOLv2-synthetic datasets.