Single Remote Sensing Image Dehazing Using a Prior-Based Dense Attentive Network

Remote sensing image dehazing is an extremely complex issue due to the irregular and non-uniform distribution of haze. In this paper, a prior-based dense attentive dehazing network (DADN) is proposed for single remote sensing image haze removal. The proposed network, which is constructed based on dense blocks and attention blocks, contains an encoder-decoder architecture, which enables it to directly learn the mapping between the input images and the corresponding haze-free image, without being dependent on the traditional atmospheric scattering model (ASM). To better handle non-uniform hazy remote sensing images, we propose to combine a haze density prior with deep learning, where an initial haze density map (HDM) is firstly extracted from the original hazy image, and is subsequently utilized as the input of the network, together with the original hazy image. Meanwhile, a large-scale hazy remote sensing dataset is created for training and testing of the proposed method, which contains both uniform and non-uniform, synthetic and real hazy remote sensing images. Experimental results on the created dataset illustrate that the developed dehazing method obtains significant progresses over the state-of-the-art methods.


Introduction
Remote sensing imageries are being increasingly utilized in the fields of numerous applications with the advances of remote sensing technology, such as agriculture and weather studies [1], land cover monitoring [2][3][4], and so on. However, remote sensing images are always impacted by various atmospheric conditions like cloud, fog, and haze, which leads to a low image quality and thus inefficient downstream analysis for many applications. Therefore, remote sensing image haze removal is a crucial and indispensable pre-processing task.
For the image dehazing problem, earlier works utilized multiple images of the same scenery [5][6][7][8]. Despite some success, these methods are not practical in real life since the acquisition of several images from the same scenery under different conditions is rather difficult. Subsequently, numerous single-image dehazing methods have been developed. Some of the earlier methods make use of image enhancement techniques, including histogram-based and contrast-based methods. In [9], Xu et al. presented a solution based on contrast limited adaptive histogram equalization to remove haze from single-color images. Narasimhan et al. [10] proposed a physical-based model to describe the appearances of scenery under uniform bad weather conditions and utilizes a quick algorithm to recover the scene contrast. However, these enhancement methods do not take the reasons for the image degradation into account, leading to common over-estimation, under-estimation, and color shift problems.
band. Xie et al. [27] modified the DCP for remote sensing images and developed a novel dark channelsaturation prior. Despite being physically grounded, these methods are mostly sensitive to a nonuniform haze distribution, which however, is the most common state of haze in remote sensing images.
To handle these issues, we propose a prior-based dense attentive dehazing network (DADN) for single remote sensing image dehazing. Firstly, taking the non-uniform haze distribution of hazy remote sensing images into account, we propose to extract a haze density map (HDM) from the original hazy image at the first step, which can be regarded as a haze density prior, and we subsequently use the HDM together with the original hazy image as input of the network. The proposed network contains an encoder-decoder structure and directly learns the mapping from the original input images to the corresponding haze-free images, without any intermediate parameter estimation steps, enabling the network to measure the distortion of the clear image directly, rather than that of intermediate parameters. Dense blocks are carefully constructed to effectively mine the haze-relevant information, considering the advantages of dense networks. Meanwhile, both spatial and channel attention blocks are leveraged to recalibrate the extracted feature maps, thus allowing for more adaptive and efficient training.
Our main contributions are listed as follows: (1) A single hazy remote sensing image dehazing solution, which combines both physical prior and deep learning technology, is presented to better describe the haze distribution in remote sensing images, and thus deal with non-uniform haze removal. In this solution, we first extract an HDM from the original hazy image, and subsequently leverage the HDM prior as input of the network together with the original hazy image. (2) An encoder-decoder structured dehazing framework is proposed to directly learn clear images from input images, without estimation of any intermediate parameters. The proposed network is constructed based on dense blocks and attention blocks for accurate clear image estimation. Furthermore, we leverage a discriminator at the end of net to fine-tune the output and ensure that the estimated dehazed result is undifferentiated from the corresponding clear image. (3) A large-scale hazy remote sensing dataset is created as a benchmark which contains both uniform and non-uniform, high-resolution and low-resolution, synthetic and real hazy remote sensing images. Experimental results on the proposed dataset demonstrate the outstanding performance of the proposed method.
The remainder of the paper is organized as follows. Section 2 describes the degradation procedure caused by haze, as well as the details of the proposed dense attentive dehazing network (DADN). The experimental settings, results, and analysis are presented in Section 3 and a further discussion is presented in Section 4. Finally, our conclusions are given in Section 5.

Atmospheric Scattering Model (ASM)
The image degradation caused by the presence of fog and haze is formulated mathematically by the ASM [11,28,29] as: In this equation, ( ) and ( ) respectively denote the true scene radiance and the observed hazy image; denotes the global atmospheric light, indicating the ambient light intensity; ( ) is the transmission matrix, which indicates the proportion of light which successfully arrives the sensor; and is the pixel location. When is uniform, the transmission matrix ( ) can be expressed as: where ( ) and denote the depth of field and the extinction coefficient of the atmosphere, respectively. To obtain a clear image ( ) as output, we rewrite the model in Equation (1) as: According to the classical ASM, a similar three-step methodology is adopted in most of the existing single-image dehazing solutions as follows: (1) estimate the transmission map ( ) from the original hazy image ( ); (2) estimate the atmospheric light using some other method (often empirical); (3) compute the clean image ( ) via Equation (3). Despite being intuitive and physically grounded, this three-step methodology transforms the problem of clear image lossy reconstruction directly into an estimation problem for parameters ( ) and , giving rise to a suboptimal image restoration quality. To deal with this problem, we develop an encoder-decoder dehazing framework which directly learns the clear haze-free image from the original hazy image, enabling the network to measure the distortion of the clear image directly. Figure 1 presents the overall structure of the proposed DADN. Inspired by the successes of numerous computer vision tasks using dense networks [30,31] and attention mechanisms [32,33], we carefully designed the network in an encoder-decoder structure with dense blocks and attention blocks.

Network Architecture
Specialized designed for single remote sensing image dehazing, where non-uniform haze is the most common state of haze, we came up with a solution to combine a haze density prior with deep learning to better describe haze distribution. In this solution, we first extract an initial HDM from the original hazy image, which can be regarded as a haze density prior, and subsequently use it as the input of the network, as well as the original hazy image. Furthermore, our network contains a discriminator at the end to fine-tune the dehazed output and ensure the estimated dehazed result is undifferentiated from the corresponding clear haze-free image.

Haze Density Map (HDM)
For remote sensing images, the depth of field ( ) in Equation (2) can be regarded as a constant since the distance between sensor and scene is always very large, and thus the haze intensity is mostly affected by the extinction coefficient , which depends on the atmospheric conditions, and is rather unpredictable. Meanwhile, for a single regular close-range image, the extinction coefficient can be regarded as a constant since the distance is limited, and thus the haze intensity is mostly affected by the depth of field, which is much easier to represent. We can see the comparison on close-range hazy images and remote sensing hazy images in the first two rows of Figure 2 where haze on the remote sensing images are much more irregularly distributed. Therefore, the haze intensity in remote sensing images is much more difficult to describe. To deal with this issue, we came up with a solution, i.e., combine a haze density prior with deep learning. Firstly, we extract a raw HDM from the original input hazy image. According to the assumption developed by Pan et al. in [34] that for hazy regions in a given image, the minimal intensity value is higher than the intensity value in haze-free regions, we extract the minimal intensity among the R, G, B channels to roughly describe the distribution of haze in the original hazy image. Thus, the raw HDM is defined as: where we normalize the hazy image to [0,1] , represented as . The saturation ( ) , which indicates the purity of the color, is further utilized to make the HDM more precise since in a hazefree region the saturation will be higher than the saturation in a hazy region. Therefore, the modified HDM is expressed as: where acts as an adjusted factor which controls how dark the haze-free regions will be and is empirically set as 2 in this paper to ensure that the haze-free region is gloomy enough; R, G, B, respectively represent the three color channels. Meanwhile, morphological opening [35] and a guided filter [36] are applied to lighten the impact of the scene texture, since the extracted HDM might remain some scene texture which needs to be reduced. Finally, the HDMs are extracted as shown in the third row of Figure 2. The extracted HDM, which is regarded as the haze density prior, is then fed into the developed network, together with the original hazy image, to help the network better extract the haze-relevant features that describe the haze distribution.

Encoder
The encoder which maps the original inputs to an intermediate feature map, is carefully constructed with dense blocks and attention blocks.
(1) Dense Block To tackle the issue of vanishing gradients, Huang et al. [37] developed a densely connected network, according to the observation that if CNNs contain shorter connections between the front layers and back layers, they can be satisfactorily deeper and achieve much more effective training. The structure of a 6-layer dense block is presented in Figure 3. Every layer is connected with the other layers in the forward way in a dense block, to solve the issue of vanishing gradient and at the same time strengthen the feature propagation. Taking the advantages of dense blocks into account, we utilize dense blocks with different layers to construct our network. (2) Attention Block Enlightened by the success of the attention mechanism in various computer vision problems [32,33,38], our network contains a residual channel-spatial attention block (RCSAB) to recalibrate the extracted feature maps, making the whole network focus more on important features, and thus better describing the non-uniform haze distribution. The proposed RCSAB takes advantage of both channel attention bocks and spatial attention blocks, and carefully operates them in a parallel way. A residual block is further combined for better feature mining. The architecture of the RCSAB is presented in Figure 4. The channel attention block leveraged in this network can be found in Figure 5. Channel attention focuses on finding out the most meaningful features among the input feature maps, since every channel of the feature maps can be regarded as a feature detector. For effective computation of the channel attention map, the input feature maps are squeezed on the spatial dimension utilizing average-pooling as well as max-pooling. Convolutions with a kernel size of 1 × 1 are performed after the pooling, and an element-wise addition is applied to combine the feature maps from the different pooling operations. Finally, we obtain the output feature maps by multiplying the channel attention maps with original input feature maps. Differing from channel attention, spatial attention is utilized to find out which part of the given input is an informative part. The most informative part, which usually contains lots of vital information about the haze distribution, is then the focus in the further learning. For computing, max-pooling and average-pooling are performed along the channel axis. Features after the pooling operations are then concatenated to create a spatial attention map (see in Figure 6). Similarly, multiplication is implemented between the computed spatial attention map and the original input feature maps, to obtain the final output. To better exploit the benefits of both blocks, we combine these output features by performing element-wise addition. Meanwhile, we further integrate the spatial attention and channel attention blocks with a residual block, only focusing on the residual part between the input and output for more effective feature extraction.
As presented in Figure 1, the encoder contains three dense blocks, the corresponding transition blocks for down-sampling, and the RCSAB. The three dense blocks have respectively 6, 12, and 24 layers. Details of each layer are provided in Table 1. The feature size after the transition blocks is 1/32 of the input size, and the RCSAB does not change the feature size. Similarly, the decoder is made up of five dense blocks and the corresponding transition blocks for up-sampling ( Figure 1). To better integrate features at different sizes, a pyramid pooling block [39] is added at the end of decoder, where four pooling operations with different kernel sizes (1/32, 1/16, 1/8, 1/4) are utilized. The features after the pooling operation are unsampled to the original size and then combined with the input feature to generate the result. Details of these layers are provided in Table 2. To make sure that the estimated dehazed result is almost undifferentiated from the corresponding clear image, a discriminator block is applied at the end of the net, where the abovementioned encoder-decoder architecture can be regarded as a generator network. In our discriminator, several convolutions with kernel size 4 × 4 are performed. Given a 512 × 512 size input image, the output image size is 62 × 62. The discriminator structure is shown in Figure 7 and details are provided in Table 3.

Loss Function
For the discriminator, we train it to maximize the probability to assign the correct label (0 or 1) for the training dataset. The loss function for the proposed discriminator can be defined as [40]: where ( ) denotes the discriminator's estimation of the probability that real data instance _ is real, _ is the expected value over all real data instances, ( _ ) is the generator's output when given noise _ , ( ( _ )) denotes the discriminator's estimation of the probability that a fake instance is real, and _ is the expected value over all generated fake instances ( _ ). For the front encoder-decoder generator architecture, the loss function is composed of the edge-preserving loss function and generator loss: where the generator loss is expressed as: The edge-preserving loss function was developed by Zhang et al. [19] to tackle the common problem of halo artifacts existing in the loss function, and contains three components: loss, gradient loss (both horizontal and vertical), and feature edge loss, which can be defined as: where is the edge-preserving loss, denotes the gradient loss, and indicates the feature edge loss. The two-directional gradient loss is defined as: where G is the encoder-decoder network structure, denotes the input hazy image, and indicates the target clear image. and represent the gradient computing horizontally and vertically, respectively. The is expressed as: where represents a CNN architecture. In this paper, the layers before relu1-1 and relu2-2 of VGG-16 [41] are utilized as and , respectively. In summary, the proposed network adopts an encoder-decoder structure and is comprised of dense blocks and attention blocks. Considering the non-uniform haze distribution in remote sensing imagery, we came up with a solution to first extract an initial HDM from the original hazy image, which can be regard as the haze density prior, and use it as the input of the network, as well as the original hazy image. Furthermore, our network contains a discriminator at the end to fine-tune the dehazed result and guarantee that the final estimated result is undifferentiated from the corresponding clear image. For the loss function, we combine edge-preserving loss and generator loss. The network parameters are then achieved through minimizing the utilized loss function.

Datasets
A large-scale hazy remote sensing image dataset is created for this experiment. Since it is impractical to obtain paired haze-free and hazy remote sensing images of the same view and the same scene at the same time, synthetic hazy images were used for the network training. We synthesized both uniform and non-uniform hazy remote sensing image pairs as the training dataset, which contained a total of 13,980 images (12,000 for training and 1980 for validation) derived from the Aerial Image Dataset (AID) developed by Xia et al. [42], which was originally developed for aerial scene classification.
For generation of uniform hazy images, we set the atmospheric light for a single image uniformly between [0.5,1], and selected ∈ {0.4,0.6}. The uniform hazy image was then generated through Equation (1), using clear images from the AID dataset. In this experiment, 720 uniform hazy remote sensing images were developed with a size of 512 × 512. For the non-uniform hazy images, we extracted 19 different transmission maps from real non-uniform remote sensing images using the method proposed by Pan et al. [34] and added them to the clear images from the AID dataset, thus generating, in total, 8940 non-uniform hazy images with a size of 512 × 512. Moreover, we added the generated uniform transmission maps and non-uniform transmission maps together to imitate more complex environments, so that another 4320 hazy images with size of 512 × 512 were developed. Therefore, a hazy remote sensing image dataset with a total of 13,980 image pairs (hazy image and corresponding clear image) was developed, containing both uniform and non-uniform hazy remote sensing images.
As for the test datasets, we constructed four kinds of dataset, containing both uniform and nonuniform, high-resolution and low-resolution, synthetic and real hazy remote sensing images, as listed below:  Test Dataset 1: Test dataset 1 consisted of 1650 synthetic uniform hazy remote sensing images. We simulated the images through the classical ASM Equation (1)

Training Details
We used the PyTorch [43] framework for the training and testing. The model was trained with an NVIDIA RTX 2080 Ti GPU. ADAM [44] was leveraged as the optimization algorithm, with a batch size of 1. Meanwhile, we chose = 1, = 0.5, = 0.8, for the edge-preserving loss and = 0.25 for the generator loss.

Evaluation Criteria
We utilize the full-reference criteria of the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the dehazing results. For a single dehazed image, a higher PSNR value denotes higher similarity in pixel-wise values between the reference image and the result, and a higher SSIM value denotes that in terms of structural properties the dehazed result is much closer to the reference image.

Experimental Results
In this section, the experimental results obtained with both synthetic hazy remote sensing images and real-world hazy remote sensing images are presented. We compare the proposed solution with five prevailing methods of DCP [12], BCCR [13], Fast visibility restoration (FVR) [45], All-in-one Dehazing Network (AOD-Net) [18], and DCPDN [19].

Results on Test Dataset 1
The results obtained with the synthetic uniform dataset are shown in Figure 9. The first column presents the synthetic hazy remote sensing images, the last column presents the ground truth images, and the other columns are the dehazed results of the different methods. DCP successfully removes most of the haze but tends to over-enhance the images, especially for the large white areas, since the DCP prior fails when color of the object is close to the atmospheric light. BCCR reveals the same problem of over-enhancement and leads to some color distortion (see the first two images), which is mainly due to the underestimation of the transmission matrices. The results of DCP and BCCR indicate the disadvantages of the prior-based methods, in that they can only obtain an accurate estimation when their assumptions fit perfectly. Since these priors are statistically achieved based on natural images, when it comes to remote sensing images, they may not fit and can lead to unsatisfactory dehazing results. FVR retains much of the haze and leads to obvious color distortions and artifacts, at the cost of saving time. The results of AOD-Net are not clear enough and tend to become dimmer than the ground truth. DCPDN achieves much more pleasing results for this uniform hazy dataset. However, unlike AOD-Net, DCPDN tends to lighten the dehazed results (see the last image), mainly caused by the inaccurate estimation of the atmospheric light. In contrast, the developed method, which avoids the estimation of intermediate parameters (transmission and atmospheric light) and directly measures the distortion of the clear image rather than that of the intermediate parameters, obtains the most pleasing results, with color and structural details that are the closest to the true haze-free images, verifying the advantage of the proposed network structure.
The PSNR and SSIM results are listed in Table 4. In accordance with the visual results, the developed method exceeds the other five methods, which demonstrates the outstanding performance of the proposed solution on uniform hazy remote sensing images.

Hazy
DCP BCCR FVR AOD-Net DCPDN Proposed Ground truth  The results obtained with the synthetic non-uniform hazy remote sensing images using the developed method and the other prevailing methods are presented in Figure 10. Similarly, BBCR and DCP tend to over-enhance the results. Meanwhile, when dealing with large-scale non-uniform haze (see the last four hazy images), both DCP and BCCR fail to detect and handle the haze of different intensities, achieving rather unsatisfactory results, which indicates that these prior-based methods are limited when faced with non-uniform remote sensing images. FVR introduces serious color distortions and artifacts and cannot remove all the haze, but it appears to be not that sensitive to nonuniform haze, and thus achieves a better performance when dealing with this non-uniform dataset. The deep learning based AOD-Net and DCPDN are clearly sensitive to the non-uniform haze distribution, and retain obvious vestiges of non-uniform haze in their dehazed results (see the last five images), indicating that detecting and removing non-uniform haze from a single remote sensing image may be a rather difficult task for these deep learning-based methods, if there is not any additional prior information. By contrast, the developed method, benefiting from the HDM prior, successfully removes all the non-uniform haze and achieves results that are the closest to the truth clear images.

Hazy
DCP BCCR FVR AOD-Net DCPDN Proposed Ground truth Similarly, we utilize PSNR and SSIM to evaluate the dehazing results (see Table 5). The table illustrates that the developed method obtains the highest PSNR and SSIM values, outperforming the other five methods and achieving the best dehazing results, which is in accordance with the quantitative analysis. Overall, the proposed method deals with both uniform and non-uniform hazy images successfully and achieves the most visually pleasing dehazed results, without color distortions or artifacts.

Results on Test Dataset 3
The dehazing results obtained with the real hazy UAV images are shown in Figure 11. DCP and BCCR successfully remove most of the haze, but tend to over-enhance the image and lead to some color shift. FVR retains much of the haze in the result images, and introduces obvious color distortions and artifacts. AOD-Net and DCPDN again fail to remove all the haze, while the developed solution, with high contrast, vivid color, clear structure, and plausible results, obtains the most pleasing visual results, which demonstrate the effectiveness of the proposed method.

Results on Test Dataset 4
The results obtained with the second real hazy dataset of Landsat 8 OLI images are shown in Figure 12. As can be found, DCP and BCCR are sensitive to the non-uniform haze, and obvious traces of the non-uniform haze can still be seen. BCCR leads to obvious color distortion (especially the fourth image) and tends to over-enhance the image. The results of FVR do not retain many traces of non-uniform haze, indicating that FVR is not that sensitive to non-uniform haze, but it cannot remove all the haze, and introduces serious artifacts and color distortions. AOD-Net and DCPDN fail to remove all the haze, with much non-uniform haze remaining. In contrast, the proposed method removes most of the haze successfully, and there are very few traces of haze remaining in the results. Overall, the proposed method, which avoids estimating the transmission matrices and atmospheric light, with the help of the HDM prior, performs better than the other five methods when handling non-uniform haze in remote sensing images.

Discussion
In this study, we proposed a dense attentive dehazing network (DADN) which combines physical prior and deep learning technology to learn the mapping between the original input images and the corresponding haze-free image directly. Specialized designed for single remote sensing image dehazing, we propose to first extract an HDM from the original hazy image, which can be regarded as a haze density prior, and subsequently combine the HDM with the original hazy image as input of the network for a better description of the non-uniform haze distribution in hazy remote sensing images. Meanwhile, both spatial and channel attention blocks are carefully constructed in the network to recalibrate the extracted feature maps, thus allowing more adaptive and efficient training. To make sure that the estimated dehazed result is undifferentiated from the corresponding clear image, we further utilize a discriminator at the end of net, to refine the output.
To further validate the effectiveness of each module of the network, we conducted experiments on a network without the HDM (DADN_noHDM), a network without the discriminator (DADN_noDISCRI), and a network without the attention blocks (DADN_noRCSAB). The results are presented in Figure 13 and Table 6. DADN_noHDM and DADN_noRCSAB fail to detect the nonuniform haze, and obvious vestiges of haze remain, especially in the last two images, indicating that models without the HDM prior and RCSAB lack the ability to mine high-level haze-relevant features, and thus fail to remove all the non-uniform haze. Meanwhile, for the PSNR and SSIM criterion in Table 6, the proposed DADN method considerably outperforms DADN_noHDM and DADN_noRCSAB, which demonstrates that the haze density prior of the HDM and the attention module (RCSAB) are important and effective in the detection and removal of the non-uniform haze existing in remote sensing images. For the visual effects, DADN_noDISCRI and DADN are the most competitive methods, with vivid color, clear structure, and most of the non-uniform haze removed, while for the qualitative results, DADN outperforms DADN_noDISCRI, with the PSNR improved by 0.5. The qualitative results on the large-scale test data further validate the effectiveness of the proposed discriminator. Furthermore, a comparison on average consuming time (per image) is conducted. As we can see, our module makes obvious improvement in dehazing performance only with the cost of less than 0.06 s increase in time (per image), which is acceptable.
Overall, all the modules, i.e., the HDM prior, RCSAB, and the discriminator, are effective and necessary for single remote sensing image haze removal Hazy DADN_noHDM DADN_noRCSAB DADN_noDISCRI DADN Ground truth

Conclusions
In this paper, we proposed a specialized solution for single remote sensing image dehazing which combines the haze density prior with deep learning technology. In the solution, the haze density prior HDM is extracted from the original hazy image at the first step and subsequently used as input of the network together with the original hazy image. The effectiveness of the HDM input has been further demonstrated through comparative experiments. A dense attentive dehazing network (DADN) is also presented in the solution, which composed of dense blocks and attention blocks (both spatial attention and channel attention) and directly learns the mapping from the input images to the corresponding haze-free image. The whole network contains an encoder-decoder architecture and has a discriminator at the end of the network, to further refine the dehazed results.
A large-scale hazy remote sensing dataset was created as a benchmark, which contains both uniform and non-uniform, synthetic and real hazy remote sensing images. The experimental results on the created dataset demonstrated that DADN achieves better performances than the other prevailing dehazing algorithms, especially when it comes to a non-uniform haze distribution. However, when handling large-scale dense haze, the proposed method may be challenged. In our further work, we will attempt to improve the performance of DADN by incorporating more background prior knowledge, and further mine haze-relevant features to better handle the large-scale dense haze existing in single remote sensing images.