Attention Enhanced Serial Unet++ Network for Removing Unevenly Distributed Haze

The purpose of image dehazing is the reduction of the image degradation caused by suspended particles for supporting high-level visual tasks. Besides the atmospheric scattering model, convolutional neural network (CNN) has been used for image dehazing. However, the existing image dehazing algorithms are limited in face of unevenly distributed haze and dense haze in real-world scenes. In this paper, we propose a novel end-to-end convolutional neural network called attention enhanced serial Unet++ dehazing network (AESUnet) for single image dehazing. We attempt to build a serial Unet++ structure that adopts a serial strategy of two pruned Unet++ blocks based on residual connection. Compared with the simple Encoder–Decoder structure, the serial Unet++ module can better use the features extracted by encoders and promote contextual information fusion in different resolutions. In addition, we take some improvement measures to the Unet++ module, such as pruning, introducing the convolutional module with ResNet structure, and a residual learning strategy. Thus, the serial Unet++ module can generate more realistic images with less color distortion. Furthermore, following the serial Unet++ blocks, an attention mechanism is introduced to pay different attention to haze regions with different concentrations by learning weights in the spatial domain and channel domain. Experiments are conducted on two representative datasets: the large-scale synthetic dataset RESIDE and the small-scale real-world datasets I-HAZY and O-HAZY. The experimental results show that the proposed dehazing network is not only comparable to stateof-the-art methods for the RESIDE synthetic datasets, but also surpasses them by a very large margin for the I-HAZY and O-HAZY real-world dataset.


Introduction
When light spreads in dense suspended particles such as fog, haze, smoke, dust, etc., the image information collected by imaging sensors is seriously degraded due to the scattering of the particles, which causes the loss of a large amount of useful information and greatly limits high-level vision tasks. The purpose of image dehazing is to eliminate the influence of the atmospheric environment on image quality, increase the visibility of images, and provide support for downstream vision tasks such as classification, localization, and self-driving systems. In the past few decades, single image dehazing has been widely used for outdoor video surveillance systems, such as highway traffic, forest, and grassland ecology. As a foundational low-level vision task, single image dehazing has gained more and more attention from the computer vision community and artificial intelligence companies over the world.
Numerous image dehazing methods can be divided into traditional methods and learning-based methods in general. Traditional image dehazing algorithms are mostly based on hypothetical models, among which the atmospheric scattering model introduced in [1,2] is one of the most successful models. The atmospheric scattering model can well explain the formation of haze, therefore it also provides a theoretical basis for traditional • We propose a novel end-to-end attention enhanced serial Unet++ dehazing network. The serial Unet++ module extracts features in different resolutions and effectively fuses them to restore thick hazy images. An attention mechanism is introduced to pay different levels of attention to haze regions with different concentrations; • We build a serial Unet++ structure that is responsible for fully extracting features of different resolutions and reconstructing them on different scales. The serial Unet++ structure directly transmits the original information of the shallow layers to the subsequent deeper layers, so that the deeper layers focus on residual learning while reusing shallow contextual information. Thus, the structure can not only avoid the degradation of the model, but also fuse shallow contextual information into deep features, which contributes to generating more realistic images with less color distortion in the faces of dense daze regions; • To remove the haze and restore the image information as much as possible, we take some improvement measures to the Unet++ module. First, the original Unet++ is pruned to avoid expansion of model parameters. Besides, in the down-sampling operation, we replace the simple convolutional layer with the convolutional module with ResNet in order to prevent the loss of original information in the transmission to the deep network; • The different pixel values in the spatial domain and different feature channels show different sensitivities to haze regions with different concentrations. We introduce the attention module at the bottom of the decoder to assign different weights to different spaces and channels, which helps to pay different levels of attention to haze regions with different concentrations and further enables the network to learn the uneven haze in images.
The rest of the paper is organized in the following way. Section 2 describes recent studies related to our work. Section 3 presents the proposed network, including a Unet++-based structure of the Encoder-Decoder and the learnable attention modules. The experiment results are discussed in Section 4, while the ablation studies are drawn in Section 5. The conclusion and acknowledgement are put at the end.

Relate Works
The atmospheric scattering model was firstly introduced by McCartney [1,2] and further developed by Narasimhan [27] and Nayar [28]. It is widely used for describing the formation of hazy images and formulated as: where I(z) is the observed hazy image, J(z) is the recovered hazy-free image, t(z) is the medium transmission map, and A is the global atmospheric light. When the atmospheric light A is homogeneous, the transmission map can be expressed as: where β is the scattering coefficient of the atmosphere, and d(z) represents the scene depth. Given a hazy image I, the target hazy-free image J can be calculated from the above two formulas. From the formula, we can see that, in order to solve the recovered hazy-free image J, we need to calculate three key parameters correctly. However, in practice, we usually cannot obtain these parameters directly. Therefore, many scholars use different prior knowledge.
He et al. [3] discovered DCP (dark channel prior) according to statistical law to reckon the transmission map. However, DCP will be invalid when it comes to the regions with high brightness. Zhu et al. [4] introduced CAP (color attenuation prior) to describe the relationship among brightness, saturation, and the density of haze. Berman et al. [5] proposed a non-local prior that means the color of a haze-free image can form tight, non-local clusters in RGB space, and their varying distances can translate to different transmission coefficients in the presence of haze. Derived from a local linear model, He et al. further put forward a guided filtering method [6] which is cost-efficient in haze removal without the use of a complex atmospheric model. Although many improved atmospheric models [7][8][9][10][11] have achieved a large amount of success, they also showed the problem of insufficient robustness in dealing with more complex real-world scenes. In the meantime, the prior error is still not be completely avoided, which directly causes color distortions in restored images.
Since image dehazing is a highly ill-posed problem, the existing methods often use strong priors or assumptions as additional constraints to restore the transmission map, global atmospheric light, and scene radiance. Due to unavoidable errors caused by the estimation of some middle parameters, the atmospheric scattering model has been gradually replaced by end-to-end models [29][30][31][32][33][34][35][36][37] to directly generate the dehazed image. For instance, Surez et al. [24] employed a triplet of Generative Adversarial Network (GAN) [38] to remove the haze on each color channel independently. A GAN-based enhanced pix2pix dehazing network (EPDN) [34] was designed to have a multi-resolution generator and a multi-scale discriminator followed by the pyramid pooling enhancer module. Dong et al. [35] also borrowed the structure of GAN for image dehazing. They introduced the frequency domain information into the generator network as a priori knowledge to deal with the problem of color distortion. Inspired by knowledge distillation, Wu et al. [36] designed a two-stream dehaze network, KTDN, to transfer the knowledge learned from abundant haze-free images. Chen et al. [37] adopted a smoothed dilation technique to help to remove the gridding artifacts and leverage a gated sub-network to fuse the features from different levels. The methods mentioned above have significantly improved the performance of dehazed images; however, these generic methods suffered from the problems of complex models, unevenly distributed haze and insufficient dehazing degree after reconstruction.
The Unet model was first proposed for application in biomedical image segmentation [39][40][41] and soon stretched to a variety of visual tasks [42,43]. On account of its mirrored down-and-up-sampling structure, the Unet structure can pay more attention to the contextual information in one image and restore the features' scale to the size of the original image, which is significant to the end-to-end tasks. Additionally, long connections are also used to fuse the features extracted by the previous down-sampling parts to the later up-sampling parts with the same resolution. Unet++ redesigns the network by adding more skip routes and short connections between different resolutions. Therefore, such an operation can improve the efficiency of feature utilization and avoid the introduction of too many parameters.

Architecture
In this section, we present the AESUnet in detail, including the pipeline and the stabilization of the whole network, the encoder-decoder structure of serial Unet++ with local residual learning, and the attention module.

Pipeline Overview
The pipeline of the network, shown in Figure 1, consists of two Unet++ blocks connected in serial. The input of this network is hazy images. Two serial Unet++ blocks are responsible for fully extracting features of different resolutions and reconstructing them on different scales. When the output features (shallow features) of the first Unet++ block are passed to the second Unet++ block, they are also passed backward to be concatenated with the output features (deep features) of the second block. Through this residual connection, shallow contextual information can be used again; that is, the serial Unet++ structure allows the original information of the shallow layers to be directly transmitted to the subsequent deeper layers, so that the deeper layers can focus on residual learning and avoid the degradation of the model. After obtaining the concatenated features, we take an attention module to pay different attention to haze regions with different concentrations and adopt two convolutional layers to reduce the channels to three. At last, we add the original hazy images to the finally extracted feature channels and obtain the haze-free images. Figure 1. The whole structure of the AESUnet including two Unet++ blocks, an attention module, two convolutional layers, and some skip connections between them. The network is a fully end-toend structure. Two Unet++ structures are used to extract shallow features and deep features, respectively, and these features are concatenated in channel dimension. Attention module is then employed to have the network learn the distribution of haze. Finally, perceptual loss and reconstruction loss are used to help the training process.

Encoder-Decoder of the Serial Unet++ Structure
In order to remove the haze and restore the image as much as possible, the feature extractor must make full use of the information in one image. Inspired by several previous dehazing networks that utilized the encoder-decoder structure as the feature extractor and achieved great performance, we build a serial Unet++ block with an encoder-decoder structure. Especially, we use a variant of the original Unet model called Unet++, which adds more short connections and skip routes to promote the information for contacting and fusing. As shown in Figure 2, different from the original Unet++, we do some pruning to the model. Specifically, since the input patches were resized to 256 × 256 pixels, we cut the deepest layer of the Unet++ and just keep three layers to down-sample the resolution to 1/8 scale. Therefore, the expression of the -th feature at the -th layer is formulated as: where ℊ(•) means convolution layer, (•) means Upsampling operation, and ⊕ represents the concat operation. Figure 1. The whole structure of the AESUnet including two Unet++ blocks, an attention module, two convolutional layers, and some skip connections between them. The network is a fully end-to-end structure. Two Unet++ structures are used to extract shallow features and deep features, respectively, and these features are concatenated in channel dimension. Attention module is then employed to have the network learn the distribution of haze. Finally, perceptual loss and reconstruction loss are used to help the training process.

Encoder-Decoder of the Serial Unet++ Structure
In order to remove the haze and restore the image as much as possible, the feature extractor must make full use of the information in one image. Inspired by several previous dehazing networks that utilized the encoder-decoder structure as the feature extractor and achieved great performance, we build a serial Unet++ block with an encoder-decoder structure. Especially, we use a variant of the original Unet model called Unet++, which adds more short connections and skip routes to promote the information for contacting and fusing. As shown in Figure 2, different from the original Unet++, we do some pruning to the model. Specifically, since the input patches were resized to 256 × 256 pixels, we cut the deepest layer of the Unet++ and just keep three layers to down-sample the resolution to 1/8 scale. Therefore, the expression of the j-th feature at the i-th layer is formulated as: where G (·) means convolution layer, up(·) means Upsampling operation, and ⊕ represents the concat operation. Moreover, in the Down-sampling operation, a convolutional module with ResNet [44] structure is used to replace the simple convolutional layer. As shown in Figure 3a, the Down-sampling operation contains three convolutional layers, and immediately after every convolutional layer are batch normalization (BN) and ReLU layers. To keep the gradient from dispersion, a residual learning strategy is introduced. The input features transferred from the upper encoder are pooled to half size and fed to the first two convolutional layers. Further information extracted by two series of convolutional, batch normalization (BN), and ReLU layers are then added to the input and sent to the next convolutional layer together. The structure of the Up-sampling operation is similar to that of the Down-sampling operation, as shown in Figure 3b, except that the pool operation is replaced by interpolation to restore the feature size to the original resolution. Through assigning different weight to different spaces and channels, the attention module at the bottom of the decoder helps to learn the uneven distribution of haze.
to the model. Specifically, since the input patches were resized to 256 × 256 pixels, we cut the deepest layer of the Unet++ and just keep three layers to down-sample the resolution to 1/8 scale. Therefore, the expression of the -th feature at the -th layer is formulated as: where ℊ(•) means convolution layer, (•) means Upsampling operation, and ⊕ represents the concat operation. Electronics 2021, 10, x FOR PEER REVIEW Figure 2. The encoder-decoder architecture of the serial Unet++ block. Compared to the original Unet++ model, we duced the layers of the network from four to three for the consideration of less parameters and resolution of the inp The three-layer down-sampling structure reduces the image resolution to one eighth of the original size. Through t Unet++ structure, the features of different resolutions can be fully fused, which gives the model the ability to capture de features and retain shallow features.
Moreover, in the Down-sampling operation, a convolutional module with [44] structure is used to replace the simple convolutional layer. As shown in Figure  Down-sampling operation contains three convolutional layers, and immediatel every convolutional layer are batch normalization (BN) and ReLU layers. To keep t dient from dispersion, a residual learning strategy is introduced. The input feature ferred from the upper encoder are pooled to half size and fed to the first two convol layers. Further information extracted by two series of convolutional, batch normal (BN), and ReLU layers are then added to the input and sent to the next convolution together. The structure of the Up-sampling operation is similar to that of the Dow pling operation, as shown in Figure 3b, except that the pool operation is replaced terpolation to restore the feature size to the original resolution. Through assigning ent weight to different spaces and channels, the attention module at the bottom decoder helps to learn the uneven distribution of haze. Figure 3. The detailed structure of the Down-sampling operation and Up-sampling operation in serial Unet++ modu Compared to original network, we replaced the convolutional layer with residual convolutional layer. After the Dow sampling operation or the Up-sampling operation, the size of the features is reduced to half of the original size or doubl accordingly. An attention module is added at the bottom of the Up-sampling operation to facilitate learning the distrib tion of haze in different spaces or channels.

Attention Mechanism
In most cases, the distribution of haze is uneven, especially for dense haz makes it difficult to apply CNN-based dehazing network to the real scene. At th time, different feature channels also have different sensitivities to haze regions w ferent concentrations. Therefore, assigning different weights to corresponding ch also has an effect on the dehazing performance. Many works [45][46][47] have appl attention mechanism to the Unet structure and achieved good results in different tasks. Inspired by [48,49], we introduce the attention mechanism into our network it can focus more on the dense haze areas when the distribution of haze is uneven image. As shown in Figure 4, in the process of keeping input features passed back, c attention and spatial attention are multiplied in turn to obtain refined features as t put of the feature module.
In the channel attention module (see Figure 4), we first adopted an adaptiv pooling operation to obtain the raw weight of each channel. Through the adaptiv pooling operation, for the feature map of size H × W × C, we extract a feature m size 1 × 1 × C, where each value is a weight of all the pixel values in the corresp feature map. Then, the raw weights are sent to a learning module consisting of o volutional layer, a ReLU activation function unit followed by the other convolution . The detailed structure of the Down-sampling operation and Up-sampling operation in serial Unet++ module. Compared to original network, we replaced the convolutional layer with residual convolutional layer. After the Down-sampling operation or the Up-sampling operation, the size of the features is reduced to half of the original size or doubled accordingly. An attention module is added at the bottom of the Up-sampling operation to facilitate learning the distribution of haze in different spaces or channels.

Attention Mechanism
In most cases, the distribution of haze is uneven, especially for dense haze. This makes it difficult to apply CNN-based dehazing network to the real scene. At the same time, different feature channels also have different sensitivities to haze regions with different concentrations. Therefore, assigning different weights to corresponding channels also has an effect on the dehazing performance. Many works [45][46][47] have applied the attention mechanism to the Unet structure and achieved good results in different visual tasks. Inspired by [48,49], we introduce the attention mechanism into our network so that it can focus more on the dense haze areas when the distribution of haze is uneven in one image. As shown in Figure 4, in the process of keeping input features passed back, channel attention and spatial attention are multiplied in turn to obtain refined features as the output of the feature module. and Sigmoid activation function. Finally, the learned feature weights are channel-wise multiplied into the input features so that different channels have different degrees of attention to the haze. After the channel attention module, a spatial attention module (see Figure 4) is employed to measure the degree of attention to different locations of the feature map. We first perform the max-pooling and mean-pooling operations along the channel axis on the feature map fused with channel attention. In this way, two spatial attention maps of H × W × 1 are obtained from the original feature map of H × W × C. Immediately after concatenating them, a convolutional layer and Sigmoid activation function are utilized to learn the distribution of haze in the whole image. At last, the spatial attention map is pixel-wise multiplied into the input features. In summary, the attention feature is computed as: where is the input features of the attention module, and is the output features with channel attention and spatial attention. (•) is the Sigmoid activation function and (•) is the ReLU activation function.
(•) represents the adaptively mean pooling operation and (•) means the concatenation in channel dimension. and mean max-pooling and mean-pooling operations, respectively.

Loss Function
We use reconstruction loss and perceptual loss as the composition of the integrated loss function and formulate them as: Reconstruction loss measures the mean absolute error (MAE), which is also called L1 Loss, between ground truth and corresponding image, and is formulated as: In the channel attention module (see Figure 4), we first adopted an adaptive mean pooling operation to obtain the raw weight of each channel. Through the adaptive mean pooling operation, for the feature map of size H × W × C, we extract a feature matrix of size 1 × 1 × C, where each value is a weight of all the pixel values in the corresponding feature map. Then, the raw weights are sent to a learning module consisting of one convolutional layer, a ReLU activation function unit followed by the other convolutional layer and Sigmoid activation function. Finally, the learned feature weights are channelwise multiplied into the input features so that different channels have different degrees of attention to the haze.
After the channel attention module, a spatial attention module (see Figure 4) is employed to measure the degree of attention to different locations of the feature map. We first perform the max-pooling and mean-pooling operations along the channel axis on the feature map fused with channel attention. In this way, two spatial attention maps of H × W × 1 are obtained from the original feature map of H × W × C. Immediately after concatenating them, a convolutional layer and Sigmoid activation function are utilized to learn the distribution of haze in the whole image. At last, the spatial attention map is pixel-wise multiplied into the input features. In summary, the attention feature is computed as: where F is the input features of the attention module, and F is the output features with channel attention and spatial attention. δ(·) is the Sigmoid activation function and σ(·) is the ReLU activation function. AMP(·) represents the adaptively mean pooling operation and CAT(·) means the concatenation in channel dimension. Max and Mean mean max-pooling and mean-pooling operations, respectively.

Loss Function
We use reconstruction loss L r and perceptual loss L p as the composition of the integrated loss function and formulate them as: Reconstruction loss measures the mean absolute error (MAE), which is also called L1 Loss, between ground truth and corresponding image, and is formulated as: where I i is the input hazy image, G(·) means the network operating on the input, and J i is the corresponding ground truth. Perceptual loss was proposed in [50] to measure the perceptual similarity in features space and calculate the mean square error, also called L2 Loss. The vgg(·) means pretrained VGG16 [51] network. It is defined as: Finally, we use a weight combination of the two loss functions. In this work, the parameters α, β are set to 1, 1 correspondingly.

Experiments
In this section, we will introduce the datasets used in training and testing our network. At the same time, the detail parameters of the training process are given. Finally, we compare the results of the network with several representative methods in the same objective metrics.

Datasets and Metrics
Similar to the existing learn-based dehazing methods, we utilized two of the most commonly used dehazing datasets, RESIDE datasets [52] and I-HAZY and O-HAZY image dehazing datasets [53,54], for training our model.
The RESIDE dataset is a large-scale benchmark consisting of both synthetic and realworld hazy images. It is divided into five subsets, each serving different training or evaluation purposes. In our experiment, we used Indoor Train Set (ITS) and Outdoor Train Set (OTS) as training datasets, and Synthetic Objective Testing Set (SOTS) for evaluating. In ITS, there are 10,000 different hazy indoor images and 10 corresponding synthesized hazed images to each. In OTS, there are 8970 different hazy outdoor images and 35 corresponding synthesized hazed images to each. Therefore, there are, totally, 100,000 images in ITS and 313,950 in OTS. In SOTS, there are 500 hazed images and their corresponding ground truth images are used for calculating the metrics partly.
Compared with the RESIDE dataset, I-HAZY and O-HAZY datasets are real-world datasets. I-HAZY and O-HAZY datasets are proposed to address the limitation that is currently considered, both for assessment and training of learning-based dehazing techniques which all exclusively rely on synthetic hazy images. They are composed of pairs of real hazy and corresponding haze-free images. The real hazy images are all generated by professional haze machines and captured under the same illumination parameters with the corresponding haze-free images, therefore they are closer to the actual application. I-HAZY datasets have 30 images, among which 25 are for training and 5 are for evaluating. O-HAZY dataset have 45 images, among which 40 are for training and the rest are for evaluating.
To objectively evaluate the performance of the proposed method, we adopt two metrics widely used in image dehazing task: the Peak Signal to Noise Ratio (PSNR) and the Structural Similarity index (SSIM).
PSNR is the most common and widely used image objective evaluation index, and it is based on the error between corresponding pixels, which is an error-sensitive based image quality evaluation matric. PSNR can be formulated as: where n represents the bit width of the pixel. MSE stands for the mean absolute error and it can be formulated as: where h, w mean the height and weight of the image, and I i,j , J i,j mean the pixel value of the input hazy image and the corresponding output haze-free image at position (i, j). SSIM is also a full-reference image quality evaluation index, which measures image similarity from three aspects: brightness, contrast, and structure. SSIM can be formulated as: where l, c, and s stand for brightness, contrast, and structure, respectively. To evaluate and compare the proposed model with previous methods from a more comprehensive perspective, except for the above two most commonly used reference subjective evaluation metrics, we also selected two additional evaluation metrics: Natural Image Quality Evaluator (NIQE), a non-reference image quality index, and Natural Image Quality Evaluator (LPIPS) [55], a subjective evaluation index.
The design idea of NIQE is to construct a series of features to measure image quality and use these features to fit a multivariate Gaussian model. These features are extracted from some simple and highly regular natural landscapes. The smaller the value of NIQE, the more the characteristics of the image conform to the natural image with high rules, which means that its quality is better. LPIPS uses the similarity measurement of highdimension image structure to replace the distance measurement that cannot be formed in practice, which means the difference of pixel values is not always consistent with people's subjective perception. In practical use, LPIPS uses the deep network pre-trained on ImageNet datasets to extract the deep features of images and reference images. The lower the LPIPS value, the higher the feature similarity between the generated image and the corresponding reference image, and the more similar the subjective perception.

Implement Details
We implement our framework in Pytorch 1.7.1 and train our model in a computer equipped with a RTX 2080Ti GPU and an Intel i9-9900K CPU. We utilize ADAM [56] as an optimizer where β 1 and β 2 are set to 0.9 and 0.999. The default learning rate is set to 0.0001. To better adjust the learning rate, we adopt CosineAnnealingLR [57] as a scheduler.
Every image is randomly rotated by 0 • , 90 • , 180 • , or 270 • and flipped horizontally with 50% probability to improve the robustness of the model and prevent overfitting. The batch size is set to 2 and the thread of the CPU is set to 16. Other hyperparameters are different in training different datasets. When training on RESIDE dataset, we randomly take a pair of images from the dataset and train our network for 1,000,000 iterations. All the patches transferred to the network are resized to 256 × 256pixels. In I-HAZY and O-HAZY datasets, all the images are resized to 512 × 512, and the size of patches transferred to the network is also 256 × 256. Due to the small number of samples in the datasets, we train for 125,000 iterations on each dataset. Code will be made available at https://github.com/kirqwer6666/Image-dehazing-pytorch.

Experiment on Synthetic RESIDE Datasets
The experimental results of our method AESUnet and the other comparative methods on the RESIDE dataset are shown in Figure 5 and Table 1. As shown in Table 1, AESUnet can achieve state-of-the-art performance in all four metrics. The performance of AESUnet on the indoor RESIDE dataset is comparable with GCANet. Moreover, when it comes to the outdoor dataset, AESUnet can reach significant improvements over the other comparative methods.
best results are shown in bold.

Method
Indoor Outdoor PSNR (dB) ↑ SSIM ↑ LPIPS  ↓ NIQE ↓ PSNR (dB)  Specifically, as seen from Figure 5, the DCP method can achieve relatively soft visual performance, but in the face of areas with high brightness such as sky (the image in the first row) and wall (the image in the sixth row), it causes serious color distortion compared with ground truth. The convolutional module used in AODNet is at the same resolution, therefore its ability to characterize features is weak. AODNet's dehazing performance is not thorough enough, which makes the images still present a hazy sense. The feature extraction module used in DCPDN comprehensively considers different resolutions, and multi-scale fusion is used in feature reconstruction. However, due to the error accumulation of parameter estimation in the atmospheric scattering model, DCPDN will bring obvious color distortion. This error is more obvious when the haze becomes thicker (see Figures 5d and 6d). Although DCPDN has achieved good results in some images, there is still much color distortion that cannot be ignored, and a large amount of haze remains in  [3], (c) AODNet [20], (d) DCPDN [21], (e) FDGAN [37], (f) GCANet [30], (g) ours, and (h) GT. Table 1. Metrics comparisons of the dehazing results on SOTS dataset. In this table, " ↑ " and " ↓ " respectively mean that the larger the metric, the better and the smaller the metric, the better. The best results are shown in bold.

Indoor
Outdoor Specifically, as seen from Figure 5, the DCP method can achieve relatively soft visual performance, but in the face of areas with high brightness such as sky (the image in the first row) and wall (the image in the sixth row), it causes serious color distortion compared with ground truth. The convolutional module used in AODNet is at the same resolution, therefore its ability to characterize features is weak. AODNet's dehazing performance is not thorough enough, which makes the images still present a hazy sense. The feature extraction module used in DCPDN comprehensively considers different resolutions, and multi-scale fusion is used in feature reconstruction. However, due to the error accumulation of parameter estimation in the atmospheric scattering model, DCPDN will bring obvious color distortion. This error is more obvious when the haze becomes thicker (see Figures 5d and 6d). Although DCPDN has achieved good results in some images, there is still much color distortion that cannot be ignored, and a large amount of haze remains in some areas with high-density haze, such as the lower right area of the image in the third row. FDGAN and GCANet use more advanced feature extraction modules. The former uses a GAN network with a deep encoder-decoder structure, while the latter uses a gated fusion module to optimize the features extracted by encoder-decoder structure. FDGAN and GCANet perform well in the indoor dataset; however, the effect of dehazing is not ideal in the outdoor dataset, especially in the areas with obvious gradient changes, such as the junction of objects and the sky.
Electronics 2021, 10, x FOR PEER REVIEW 11 of 16 some areas with high-density haze, such as the lower right area of the image in the third row. FDGAN and GCANet use more advanced feature extraction modules. The former uses a GAN network with a deep encoder-decoder structure, while the latter uses a gated fusion module to optimize the features extracted by encoder-decoder structure. FDGAN and GCANet perform well in the indoor dataset; however, the effect of dehazing is not ideal in the outdoor dataset, especially in the areas with obvious gradient changes, such as the junction of objects and the sky.  [3], (c) AODNet [20], (d) DCPDN [21], (e) FDGAN [37], (f) GCANet [30], (g) ours, and (h) GT.
In comparation, the dehazed images generated by AESUnet are not only more visually faithful and closer to the ground truth, but the color is changed more smoothly even in the areas with dense haze.

Experiment on Real-World I-HAZY and O-HAZY Datasets
Compared to the RESIDE datasets, the advantage of our method is more obvious on more challenging I-HAZY and O-HAZY datasets. As shown in Table 2, we reach the best performance and surpass the second place by a very large margin, 4.425 dB in PSNR and 0.028 in SSIM as the average. As for LPIPS and NIQE, we also achieved the best results. ↑ " and " ↓ " respectively mean that the larger the metric, the better and the smaller the metric, the better. The best results are shown in bold.   [3], (c) AODNet [20], (d) DCPDN [21], (e) FDGAN [37], (f) GCANet [30], (g) ours, and (h) GT.

Method
In comparation, the dehazed images generated by AESUnet are not only more visually faithful and closer to the ground truth, but the color is changed more smoothly even in the areas with dense haze.

Experiment on Real-World I-HAZY and O-HAZY Datasets
Compared to the RESIDE datasets, the advantage of our method is more obvious on more challenging I-HAZY and O-HAZY datasets. As shown in Table 2, we reach the best performance and surpass the second place by a very large margin, 4.425 dB in PSNR and 0.028 in SSIM as the average. As for LPIPS and NIQE, we also achieved the best results. Table 2. Metrics comparisons of the dehazing results on I-HAZY and O-HAZY dataset. In this table, " ↑ " and " ↓ " respectively mean that the larger the metric, the better and the smaller the metric, the better. The best results are shown in bold.

I-HAZY
O-HAZY As seen from Figure 6, some previous methods, such as DCP, AODNet, and DCPDN, totally fail in this dehazed task of the real-world dataset. Due to the lack of attention module to learn the distribution of uneven haze, FDGAN and GCANet have a certain effect but are accompanied by serious degradation in dealing with the unevenly distributed haze. Stacking more parameters in the feature extraction module (14.07 M for FDGAN and 9.61 M for GCANet) does not bring qualitative change to deal with uneven haze. As marked with red boxes in Figure 6, because of the dense haze attached on the surfaces of some objects, the outlines and texture details cannot be clearly seen on the object surfaces of the FDGAN and GCANet results in row 1 and row 5. In addition, it is worth mentioning that FDGAN and GCANet are not enough to completely restore the original color of the images covered by heavily dense haze. Finally, as shown in row 2 and row 3 in Figure 6, in the face of dense and uneven haze, FDGAN and GCANet almost completely fail. The lack of ability to capture deeper features makes them unable to recover the image information better. Compared with these methods, our model can not only adaptively remove haze in both low-density and high-density areas to the greatest extent, but can also restore more outlines and texture details with less color distortion.

Ablation Study
In order to analyze the effectiveness of each module in the proposed network, we conduct the ablation study by consideration of two main factors: (1) Unet + + structure compared with ordinary Unet structure; (2) strategy of module series connection; (3) attention mechanism. Therefore, we design three different models in the ablation study: • AES-simple-Unet: Serial Unet-based block with attention module; • AEUnet: Single Unet++ block with attention module; • SUnet: Serial Unet++ block without attention module.
In order to avoid the positive effect caused by parameter stacking, we adjust the convolutional layers of the three models in the ablation study so that their flops and parameters are almost the same. When calculating the flops and parameters, the size of the input is set to 1 × 3 × 256 × 256. We trained the models in the RESIDE outdoor dataset and tested it in the SOTS outdoor dataset. Other hyperparameter settings are also consistent.
As shown in Table 3, the three factors can bring a significant improvement to the network. This promotion mainly comes from the mechanism of these factors rather than the stacking of parameters. In particular, the introduction of the attention module can bring more obvious performance improvement compared with the Unet++ structure. The performance improvement of serial strategy is greater than the first two. On the one hand, this is due to the positive impact of parameters; on the other hand, it is brought by feature fusion between shallow and deep layers. The results are also reflected in Figure 7. Due to the lack of short connections and more skip routes in the Unet++ structure, although more convolutional layers are added to extract features, the AES-simple-Unet also performs poorly in some regions compared with AESUnet. In the red box of Figure 7b, the color of the sky area around the sun is clearly divided into three layers, while, in Figure 7e, the color changes more naturally and smoothly. By comparison, as marked with the red box in Figure 7, the image generated by AESUnet is closer to the ground truth. The performance of AEUnet is the worst among the four models in the ablation study. It can be clearly seen from the Figure 7c that the color distortion in the sky area is very serious. This is because a single UNet++ block cannot capture deeper color information well to restore the original color. As for SUnet, because of the lack of attention module, the haze in the high-density areas is left on the image (marked with green box in Figure 7d), which seriously degrades the visual performance, and AESUnet can better take effect (marked with green box in Figure 7e). high-density areas is left on the image (marked with green box in Figure 7d), whi ously degrades the visual performance, and AESUnet can better take effect (marke green box in Figure 7e).

Conclusions and Future Work
In this paper, we propose a fully end-to-end Convolution Neural Network cal tention Enhanced Serial Unet++ Dehazing Network (AESUnet) for single image de To fully make use of the features extracted, we employ serial structure of two blocks to replace the simple Encoder-Decoder structures. Moreover, the attention m is introduced to help the network learn the distribution of the uneven haze. Com with the existing dehazing methods, AESUnet can better remove the dense haze in with less color distortion. Experiments on both synthetic and real-world dataset that our method can achieve state-of-the-art performance and generate more v pleasing results in the image dehazing task.
Although the proposed model can achieve quite pleasing dehazing performa

Conclusions and Future Work
In this paper, we propose a fully end-to-end Convolution Neural Network called Attention Enhanced Serial Unet++ Dehazing Network (AESUnet) for single image dehazing. To fully make use of the features extracted, we employ serial structure of two Unet++ blocks to replace the simple Encoder-Decoder structures. Moreover, the attention module is introduced to help the network learn the distribution of the uneven haze. Compared with the existing dehazing methods, AESUnet can better remove the dense haze in images with less color distortion. Experiments on both synthetic and real-world datasets show that our method can achieve state-of-the-art performance and generate more visually pleasing results in the image dehazing task.
Although the proposed model can achieve quite pleasing dehazing performance on real-world datasets, it is still worth discussing in the following aspects. Firstly, because the proposed model adopts the end-to-end Unet++ architecture, it will be very computationally expensive, which is not conducive to its industrial deployment. Replacing the heavy-weight Unet structure with a more light-weight structure such as MobileNet [58] will have broader application prospects [59]. Secondly, the running speed of the proposed model is not fast enough to meet the needs of real-time operation. In our experiment, the model can only process images in the RESIDE dataset at a speed of 14 FPS. Therefore, the model still needs to be improved to meet the needs of video dehazing. Considering the existing video image processing methods [60][61][62], feature fusion methods such as key frame, nearest neighbor frame, or time attention can be used to speed up the processing speed of video frames.