RSCNN: A CNN-Based Method to Enhance Low-Light Remote-Sensing Images

: Image enhancement (IE) technology can help enhance the brightness of remote-sensing images to obtain better interpretation and visualization effects. Convolutional neural networks (CNN), such as the Low-light CNN (LLCNN) and Super-resolution CNN (SRCNN), have achieved great success in image enhancement, image super resolution, and other image-processing applications. Therefore, we adopt CNN to propose a new neural network architecture with end-to-end strategy for low-light remote-sensing IE, named remote-sensing CNN (RSCNN). In RSCNN, an upsampling operator is adopted to help learn more multi-scaled features. With respect to the lack of labeled training data in remote-sensing image datasets for IE, we use real natural image patches to train ﬁrstly and then perform ﬁne-tuning operations with simulated remote-sensing image pairs. Reasonably designed experiments are carried out, and the results quantitatively show the superiority of RSCNN in terms of structural similarity index (SSIM) and peak signal-to-noise ratio (PSNR) over conventional techniques for low-light remote-sensing IE. Furthermore, the results of our method have obvious qualitative advantages in denoising and maintaining the authenticity of colors and textures.


Introduction
Remote-sensing images play a significant role in large-scale spatial analysis and visualization, including climate change detection [1], urban 3D modelling [2], and global surface monitoring [3]. However, due to the effects of remotely sensed devices, undesirable weather conditions, such as haze, blizzards, storms, clouds, etc. [4], have a great negative impact on the visibility and interpretability of remote-sensing images. Low-light images create more difficulties for many practical tasks such as marine disaster monitoring and night monitoring. Therefore, it is a great necessity to enhance the contrast and brightness of low-light images automatically when we want to achieve a high-quality remote-sensing image dataset with large scale and long time series.
The purpose of image enhancement (IE) is to improve the visual interpretation of images and to provide better clues for further processing and analyzing [4][5][6]. Over time, many low-light IE methods have been proposed and achieved great success in image processing and remote-sensing fields. Histogram Equalization (HE) [7] and its variants such as Dynamic Histogram Equalization (DHE) [8], Brightness Protecting Dynamic Histogram Equalization (BPDHE) [9], and Contrast Constrained Adaptive Histogram Equalization (CLAHE) [10] are classic traditional contrast-enhancement methods. The purpose of HE is to increase the contrast of the entire image by expanding the dynamic range of the image. It is a global adjustment process without considering the change in brightness, which is prone to local overexposure, color distortion, and poor denoising. This kind of method can automatically obtain images with stronger contrast and better brightness. enhanced image with better sharpness, but it may cause worse denoising. Besides, a proper layer size is required to adequately capture the characteristics of training data while reducing the risk of a vanishing gradient as much as possible. Low-light CNN (LLCNN) [26] firstly introduces CNN convolutional layers into low-light IE and achieves better result in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) compared to LLNet and many other traditional methods. LLCNN utilizes a specially designed convolutional module and residual learning to achieve a deeper network while coping with the vanishing gradient problem. It adopts SSIM as the training loss to obtain better texture preservation. The same as in [24], a gamma degradation method with the parameters randomly set in the range (2,5) is used to generate low-light images for training. Multi-branch Low-light Enhancement Network (MBLLEN) [27] uses a CNNbased module to extract and enhance feature maps at different levels and fuses them to obtain the final result. The authors of [28] trained pure fully convolutional end-to-end networks, which operate on raw sensor data of extreme low-light images directly to obtain an enhanced result.
With respect to remote-sensing low-light image enhancement, most researchers still focus on traditional and machine learning methods. For example, the authors of [29] applied HE for contrast enhancement, that of [30] used dominated brightness level analysis and adaptive intensity transformation to enhance remote-sensing images, the authors of [21] proposed DWT-based methods for remote-sensing IE tasks, and the work in [4] enhanced low-visibility aerial images using the Retinex representation method. Deep learning methods have not received enough attention yet.
According to a previous discussion, obviously, convolutional network has shown its great superiority in low-light image processing. Therefore, in this paper, we proposed a purely CNN-based architecture called remote-sensing CNN (RSCNN) for low-light remotesensing IE. Different kernels in the RSCNN are used to capture various features such as the textures, edges, contours, and deep features of low-light images. Then, all the feature maps are integrated to obtain the final images which have been enhanced properly. It is well known that definition of the loss function of a neural network is very crucial. The L1 loss is very popular for measuring the whole similarity of two images. In addition, the SSIM loss is also applied in this paper to retain more accurate image textures. The sum of the L1 and SSIM loss functions are adopted as the overall loss function to take advantage of the two loss functions. With respect to the lack of training dataset for remote-sensing IE, we adopt transfer-learning from the pretrained RSCNN model for the natural image enhancement dataset and fine-tune it for remote-sensing IE with simulated low-light and normal-light remote-sensing image pairs.
Reasonable experiments are carried out with two datasets. Compared to 10 baselines, both quantitative and qualitative results illustrate that RSCNN has great advantages over other methods for low-light remote-sensing IE.

Formatting of Mathematical Components
The framework of RSCNN is shown in Figure 1. A deep CNN-based model extracts the abstract features and learns the detailed information from the input low-light images. Since CNN-based models can directly process multi-channel images without color space conversion, all information of input images can be retained and the complex nonlinear relation patterns between low-light and normal-light image pairs can be well learned, thereby generating images with proper light, stronger contrast, and natural textures. In detail, there are four main different types of components in the deep learning network, as described below.
(1) Convolution layer The whole network has 8 convolution layers. Each layer consists of multiple kernels, and the weights of these kernels do not change during the convolution process, i.e., there is weight sharing. With the convolution operation, RSCNN extracts the different features of the input images at different convolution levels. The output of the first CNN layer roughly depicts the location of low-level features (edges and curves) in the original image. On this basis, another convolution operation is carried out, and the output will be the activation map representing higher-level features [31]. Such features can be semicircles (a combination of curves and lines) or quadrilateral (a combination of several lines). The more convolution layers, the more complex feature activation map will be obtained. There are several parameters that need to be determined for each layer, such as the kernel size , padding , and stride . The number of kernels is the number of output feature maps. and denote the width and height of images, respectively. Thus, the size of the output feature maps can be calculated as follows: Since we want to fix the tensor size of the input and the output for each convolution layer, we set K = 5, S = 1, and P = 2 for the first convolution layer and K = 3, S = 1, and P = 1 for the rest.
(2) Activation layer The activation layer is vital in a deep CNN because the nonlinearity of the activation layer introduces nonlinear characteristics to a system which has just undergone linear computation, gives RSCNN a stronger representational power, and avoids the occurrence of gradient saturation during training. We adopt rectified linear unit (ReLU) for its advancement in improving the training speed of RSCNN without obvious changes in accuracy. The activation layer is applied over the output of the previous layer.
Every value obtained from upper stream convolution layer should be activated by ReLU before it is input into the downstream convolution layer.
(3) Upsampling operation Inspired by the CNN for super resolution methods [32][33][34], in the RSCNN, we adopt bilinear interpolation to magnify the image by two times for a better receptive field and then add another CNN layer after that in order to learn more complex features with different scales. We use Bicubic as the interpolation method in this operation to help preserve clearer edges [35].

(4) Max-pooling operation
We adopt the pooling operation in RSCNN with two purposes: Firstly, the pooling operation is helpful to reduce the number of parameters and to resize the image to the In detail, there are four main different types of components in the deep learning network, as described below.
(1) Convolution layer The whole network has 8 convolution layers. Each layer consists of multiple kernels, and the weights of these kernels do not change during the convolution process, i.e., there is weight sharing. With the convolution operation, RSCNN extracts the different features of the input images at different convolution levels. The output of the first CNN layer roughly depicts the location of low-level features (edges and curves) in the original image. On this basis, another convolution operation is carried out, and the output will be the activation map representing higher-level features [31]. Such features can be semicircles (a combination of curves and lines) or quadrilateral (a combination of several lines). The more convolution layers, the more complex feature activation map will be obtained. There are several parameters that need to be determined for each layer, such as the kernel size K, padding P, and stride S. The number of kernels N is the number of output feature maps. W and H denote the width and height of images, respectively. Thus, the size of the output feature maps can be calculated as follows: Since we want to fix the tensor size of the input and the output for each convolution layer, we set K = 5, S = 1, and P = 2 for the first convolution layer and K = 3, S = 1, and P = 1 for the rest.
(2) Activation layer The activation layer is vital in a deep CNN because the nonlinearity of the activation layer introduces nonlinear characteristics to a system which has just undergone linear computation, gives RSCNN a stronger representational power, and avoids the occurrence of gradient saturation during training. We adopt rectified linear unit (ReLU) for its advancement in improving the training speed of RSCNN without obvious changes in accuracy. The activation layer is applied over the output of the previous layer.
Every value obtained from upper stream convolution layer should be activated by ReLU before it is input into the downstream convolution layer.
(3) Upsampling operation Inspired by the CNN for super resolution methods [32][33][34], in the RSCNN, we adopt bilinear interpolation to magnify the image by two times for a better receptive field and then add another CNN layer after that in order to learn more complex features with different scales. We use Bicubic as the interpolation method in this operation to help preserve clearer edges [35].

(4) Max-pooling operation
We adopt the pooling operation in RSCNN with two purposes: Firstly, the pooling operation is helpful to reduce the number of parameters and to resize the image to the Remote Sens. 2021, 13, 62 5 of 13 original patch image size, decreasing the training cost by a meaningful extent. Secondly, the pooling operation can cut down the possibility of overfitting, helpful to suppress noise.
In RSCNN, we set the kernel size to 2 for each max-pooling operation.

Loss Function
A combination of the SSIM loss function and the L1 loss function is adopted in RSCNN. The L1 loss function, noted as L l1 , is given as Equation (3).
where p and P represent the index of the pixel and the patch, respectively. o(p) and e(p) represent the values of the pixels in the processed patch and target ones, respectively. L1 loss can preserve pixel-wise relations between the target images and the enhanced ones of every training pair, helping enhanced images have similar light intensity to the target one. However, it gives less consideration to the overall structure of the whole image, resulting in a lack of textural details. Additionally, low-light capture usually causes structural distortions such as blurs and artifacts, which is visually salient but cannot be well handled by pixel-wise loss functions such as the mean squared error.
The SSIM loss function, however, is helpful under this situation. The SSI M value for patch P is defined as Equation (4).
where x is the original normal-light image, y is the enhanced one, µ x and µ y are the respective pixel value averages, σ 2 x and σ 2 y are the respective variances, σ xy is the covariance, and c 1 and c 2 are the constants to prevent the denominator from being zero. A larger SSI M is means better quality of the processed images. Therefore, L ssim is defined as 1 − SSI M.
For L, we combine L ssim and L l1 as Equation (5).
The value of p is set to 0.1 in L. The training target is to minimize L.

Training
(1) Datasets There are two datasets that are used in this work: the DeepISP dataset [36] and the UCMerced dataset [37]. Their descriptions are as follows.
DeepISP: A total of 110 pairs of normal exposure and low-light exposure images are included, 77 for training and 33 for testing. The scenes captured include indoor and outdoor images, and sun light and artificial light with a Samsung S7 rear camera. The image pairs are almost the same, except that the low-light one has a 1/4 of the exposure time of the normal one. The resolution of each image is 3024 × 4032. Original images are divided into patches with sizes of 256 × 256. Figure 2 illustrates the representative images of every type. This dataset is named Dataset1.
DeepISP: A total of 110 pairs of normal exposure and low-light exposure images are included, 77 for training and 33 for testing. The scenes captured include indoor and outdoor images, and sun light and artificial light with a Samsung S7 rear camera. The image pairs are almost the same, except that the low-light one has a 1/4 of the exposure time of the normal one. The resolution of each image is 3024 × 4032. Original images are divided into patches with sizes of 256 × 256. Figure 2 illustrates the representative images of every type. This dataset is named Dataset1.   Figure 3 shows some representative images of UCMerced datasets. This dataset is named Dataset2. Figure 3. The representative images of UCMerced for every type [37].
As far as we know, there is no specific open dataset for low-light remote-sensing image enhancement training. With respect to this dilemma, a set of natural low-light and normal-light image pairs generated from an ordinary image dataset, that is Dataset1 in this paper, is adopted for the pretrained training.
Then, because the light source angle and camera angle of remote-sensing imaging equipment have their own obvious characteristics compared with natural images, it is not proper to directly apply a model that was trained using natural image pairs to remotesensing images. Therefore, a fine-tuning process is indispensable. First, we choose "dense residential" images from the UCMerced dataset because, compared with other categories, these images have more diverse features, richer textures, more complex shadows, and blurrier boundaries. These complex conditions make low-light images more difficult to enhance. Then, we follow the methods of [19] and [29] to set the original image as the ground truth and use the degradation method to generate the corresponding low-light image. A pair of low-light images and the corresponding one is used as the input and label for RSCNN training and testing. A random gamma adjustment is used to simulate the low-light images. The parameter gamma is randomly set in the range of (2, 5), enabling RSCNN to adaptively enhance the image and to have better generalization. Finally, a total of 100 pairs of normal exposure and low-light exposure images is used. They are split into 80 pairs for training and 20 pairs for testing, respectively. This dataset is named Dataset2.
(2) Evaluation criteria As far as we know, there is no specific open dataset for low-light remote-sensing image enhancement training. With respect to this dilemma, a set of natural low-light and normal-light image pairs generated from an ordinary image dataset, that is Dataset1 in this paper, is adopted for the pretrained training.
Then, because the light source angle and camera angle of remote-sensing imaging equipment have their own obvious characteristics compared with natural images, it is not proper to directly apply a model that was trained using natural image pairs to remotesensing images. Therefore, a fine-tuning process is indispensable. First, we choose "dense residential" images from the UCMerced dataset because, compared with other categories, these images have more diverse features, richer textures, more complex shadows, and blurrier boundaries. These complex conditions make low-light images more difficult to enhance. Then, we follow the methods of [19,29] to set the original image as the ground truth and use the degradation method to generate the corresponding low-light image. A pair of low-light images and the corresponding one is used as the input and label for RSCNN training and testing. A random gamma adjustment is used to simulate the low-light images. The parameter gamma is randomly set in the range of (2, 5), enabling RSCNN to adaptively enhance the image and to have better generalization. Finally, a total Remote Sens. 2021, 13, 62 7 of 13 of 100 pairs of normal exposure and low-light exposure images is used. They are split into 80 pairs for training and 20 pairs for testing, respectively. This dataset is named Dataset2.
(2) Evaluation criteria PSNR, SSIM [11], and CIEDE2000 [38] are used to evaluate the performance of RSCNN. Since SSIM has already been described, here, we briefly describe the PSNR evaluator as follows.
where X is the normal-light image and Y is the enhanced one generated from the low-light image. MAX represents the maximum signal value that exists in X. The higher the PSNR, the better RSCNN performs. According Equation (6), we can see that PSNR is a variant of mean squared error (MSE). It is a pixel-wise full-reference quality metric, computed by averaging the squared intensity differences of the enhanced result and reference image pixels [11]. It is easy to calculate and has clear physical meanings but is not sensitive to the change in image structure and is not completely in accordance with human visual characteristics. SSIM makes up for PSNR. According to Equation (4), SSIM puts focus on image structure similarity and measures the image similarity from brightness (µ x , µ y ), contrast (σ 2 x , σ 2 y ), and structure (σ xy ). PSNR and SSIM are widely used to evaluate the performance of low-light image-processing methods [22,24,26,39,40] and remote-sensing image-processing methods [20,41,42]. With the help of PSNR and SSIM, we can effectively evaluate the color retention and structural differences between enhanced images and reference images.
Furthermore, we adopt CIEDE2000 as the evaluation criteria. It is a color difference equation based on CIE's lab color space (CIELAB) and is published by the International Commission on Illumination (CIE) in Publication 142-2001. It can help us evaluate the degree of color difference between the ground-truth image and the enhanced image. The smaller CIEDE2000 is, the closer the result image is to the ground-truth image. We use the "imcolordiff" function in Matlab 2020b for CIEDE2000. It is based on [43].

Implementation Details
There are 3 kinds of CONVs: 1-D-CNNs, 2-D-CNNs, and 3-D-CNNs. Since we want to treat the input image patches as a whole with spatial information, we choose a 2-D-CNN as the CONV in our network [44]. The configuration of each convolution layer is shown in Figure 1. The weights of each CONV layer are initialized using kaiming_normal [45].
During training, the patch-size is set to 256 × 256 and the depth of the whole network is 8. In addition, Adam optimization is adopted with a weight decay of 0.0001. The base learning rate is 0.001, and the batch size is 8. Our model is trained using PyTorch.

Baselines
Ten different methods, which are shown in Table 1, are compared with our proposed method.
As observed, different types of models are considered. The models that are used apply the default settings suggested by the authors.

Comparison Results on Dataset1
The experiment is first carried out on Dataset1, and 9 different methods are compared with RSCNN. Detailed results are presented in Table 2. In the experimental results, the SSIMs of DHE and CLAHE are significantly improved compared to ordinary HE, and the PSNR result of DHE is the best. Compared with the histogram equalization algorithms, the Retinex algorithms achieve better indicator results. Among them, the SSIM of the MSRCR method is about 12% higher than that of DHE but, because its adjustment method is not global pixel-wise, the PSNR is 8% lower than that of DHE. With respect to LIME and BIMEF, compared with the traditional histogram method and the Retinex method, it has a better effect in maintaining the overall visual characteristics and pixel-wise results. DWT-SVD is often used for low-light remote-sensing image enhancement. The results of DWT-SVD are similar to the enhancement algorithms based on luminance estimation.  28.194 Obviously, from the perspective of quantitative analysis indicators, the results of RSCNN have better results than various traditional low-light enhancement algorithms and can be applied to low-light remote-sensing image enhancement tasks. For example, the SSIM of the RSCNN is 0.825, which is 0.2 higher than that of the widely used DWT-SVD algorithm. As for the PSNR, our method achieved 28.123 dB, which is much better than those of all these baselines since their PSNRs are lower than 20 dB.
As shown in Figure 4, in general, all the methods are able to obtain brighter images with stronger contrast. However, the results of many methods are not sufficient and satisfactory. For example, HE-based methods such as HE, DHE, and CLAHE can inappropriately enhance the dark background (too bright or too dark) and can cause color distortions. SRbased methods (i.e., SSR, MSR, and MSRCR) and LIME are able to appropriately enhance the dark background, but the color distortions are also very severe and the background is enhanced to be blue instead of actual dark.
As for color distortion, CLAHE, DWT-SVD, and RSCNN work relatively better, and the backgrounds of the enhanced images are very close to those of the target images. However, DWT-SVD and CLAHE suffer from over-enhanced and insufficient brightness, respectively, in the high-contrast region, which is not as natural as that of our proposed RSCNN. In addition, the HE and DHE enhanced images have significant noise, and SSR and MSR generate images that appear to be covered by haze. Meanwhile, the images that are enhanced by our proposed method are sharper and have better brightness than those of other methods thanks to its powerful feature extraction ability and learning ability.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 13 As shown in Figure 4, in general, all the methods are able to obtain brighter images with stronger contrast. However, the results of many methods are not sufficient and satisfactory. For example, HE-based methods such as HE, DHE, and CLAHE can inappropriately enhance the dark background (too bright or too dark) and can cause color distortions. SR-based methods (i.e., SSR, MSR, and MSRCR) and LIME are able to appropriately enhance the dark background, but the color distortions are also very severe and the background is enhanced to be blue instead of actual dark. As for color distortion, CLAHE, DWT-SVD, and RSCNN work relatively better, and the backgrounds of the enhanced images are very close to those of the target images. However, DWT-SVD and CLAHE suffer from over-enhanced and insufficient brightness, respectively, in the high-contrast region, which is not as natural as that of our proposed RSCNN. In addition, the HE and DHE enhanced images have significant noise, and SSR and MSR generate images that appear to be covered by haze. Meanwhile, the images that are enhanced by our proposed method are sharper and have better brightness than those of other methods thanks to its powerful feature extraction ability and learning ability.

Comparison Results on Dataset2
To evaluate the performance of RSCNN on the low-light remote-sensing images, we fine-tuned the trained model and tested it using Dataset2. The results are presented in Table 3. In addition, Figure 5 shows the visual results to compare the proposed method with other methods. In remote-sensing image enhancement, preserving accurate textural and structural information is very important for many applications including scene classification [46] and object detection [47]. In addition, obtaining images with natural colors is also of great significance for visual discrimination and further analysis.

Comparison Results on Dataset2
To evaluate the performance of RSCNN on the low-light remote-sensing images, we fine-tuned the trained model and tested it using Dataset2. The results are presented in Table 3. In addition, Figure 5 shows the visual results to compare the proposed method with other methods. In remote-sensing image enhancement, preserving accurate textural and structural information is very important for many applications including scene classification [46] and object detection [47]. In addition, obtaining images with natural colors is also of great significance for visual discrimination and further analysis.   As we can see from Table 3, the comparison results indicate that RSCNN has the best performance compared to all other low-light image enhancement methods. Specifically, the SSIM, PSNR, and CIEDE2000 of our method are 0.791, 20.936 dB, and 19.496, respectively. To comprehensively support the qualitative conclusions of the superiority of RSCNN, visual comparison and analysis are also needed. Figure 5 shows the image-enhancement results obtained using different methods for qualitative comparison. In addition, the patches in the two red boxes are enlarged to show detailed information. As shown in Figure 5, all the methods obtain images with stronger contrast and brightness. However, the results of CLAHE, BIMEF, and DWT-SVD may not be sufficiently enhanced since the brightness is still somewhat dim. In addition, different methods have different characteristics, resulting in different effects.
For example, in terms of the image colors, the buildings obtained by HE, DHE, and LIME are enhanced to be different colors, which are far from the standard natural images. The estimated images generated by SSR, MSR, and RSCNN are much better than other methods. As for detailed information such as edges and textures in dark regions, HE, DHE, and LIME are able to obtain clear cars. However, several other methods cannot ac- As we can see from Table 3, the comparison results indicate that RSCNN has the best performance compared to all other low-light image enhancement methods. Specifically, the SSIM, PSNR, and CIEDE2000 of our method are 0.791, 20.936 dB, and 19.496, respectively. To comprehensively support the qualitative conclusions of the superiority of RSCNN, visual comparison and analysis are also needed. Figure 5 shows the image-enhancement results obtained using different methods for qualitative comparison. In addition, the patches in the two red boxes are enlarged to show detailed information. As shown in Figure 5, all the methods obtain images with stronger contrast and brightness. However, the results of CLAHE, BIMEF, and DWT-SVD may not be sufficiently enhanced since the brightness is still somewhat dim. In addition, different methods have different characteristics, resulting in different effects.
For example, in terms of the image colors, the buildings obtained by HE, DHE, and LIME are enhanced to be different colors, which are far from the standard natural images. The estimated images generated by SSR, MSR, and RSCNN are much better than other methods. As for detailed information such as edges and textures in dark regions, HE, DHE, and LIME are able to obtain clear cars. However, several other methods cannot accurately replicate the detailed information. For example, the textures of cars that are generated by CLAHE, BIMEF, and DWT-SVD are very dark and blurred, which make it hard to figure out the shape, and even the trees cannot be visually recognized since they are nearly black. Additionally, although the results from MSR and SSR are free of apparent color distortion, they suffer from apparent grid-like veins, which can be avoided by using our method. As a whole, the visual effects of the RSCNN are the closest to the original image in both color and texture. For instance, RSCNN preserves the details of trees and cars and enhances remote-sensing image with little information loss, thus making the images more realistic than those of other methods.

Conclusions
An end-to-end RSCNN model is proposed in this paper to get brighter images from degraded low-light images and is applied to remote-sensing images. A CNN architecture is used to achieve end-to-end enhancement for low-light remote-sensing images. The usampling and downsampling operators are designed to learn deep features from different scales. In this way, the enhanced images can have more detailed features. Compared to other traditional methods, our result achieves more natural results with more realistic textures and vivid details while revealing the edge features and structural features as much as possible. It can help a lot with subsequent high-level remote-sensing image information-discovery tasks.