Single Image Dehazing Using End-to-End Deep-Dehaze Network

Atmospheric haze limits the performance of the camera sensor, which results in capturing degraded hazy images. Removal of this haze from the observed images is a complicated task because of its ill-posed nature. This study offers the Deep-Dehaze network to retrieve the haze-free image. For the given input, the proposed architecture uses four feature extraction module to perform nonlinear feature extraction. We improvise the traditional Unet architecture and the residual network to design our architecture. We also introduced the L1 spatial-edge loss function, which enables our system to achieve better performance over the typical L1 and L2 loss function. Unlike other learning-based approaches, our network does not use any fusion connection for image dehazing. The experimental results show that our proposed Deep-Dehaze architecture surpasses previous state-of-the-art single image dehazing methods quantitatively and qualitatively. Our network achieves outstanding average PSNR score 24.5 on the RESIDE dataset.


INTRODUCTION
such as radar tracking, weather forecasting, traffic monitoring [14]. For image dehazing, most of the state of the art methods widely use the atmospheric scattering model [3-12, 15-17, 20-23] as noted below: I (x) = J (x)t(x) + A(1 − t(x)) (1) Here, I(x) is the degraded hazy image, t(x) denotes the transmission map, J(x) is the haze free image, and A stands for the global atmospheric light. Incoming light to the camera sensor is not independent of the transmission map, which explicitly controls the visual quality of the output image from the sensor. A hazy image's functional model fundamentally contains two parts: transmission information and the scattered atmospheric light. Both of them together comprise the hazy image, in theory. Assuming the atmospheric light A is homogeneous, the expression for the transmission map is: t(x) = e− βd (x ) (2) where, β represents attenuation coefficient of the atmosphere and d(x) is the depth information of the scene. Based upon the equation 1) and previously mentioned assumption for homogenous distribution, many studies have been proposed. We can generalize these studies into two classes: model-based methods [1,2,13,18,24], learning-based method [6,12,15,16,19,22,23].
In this study, we took a different approach to perform single image dehazing. Our network at the first stage takes the hazy input image without any preprocessing. Then this image plugged into four different feature extraction blocks to extract the hazy information from the image. Four feature extraction blocks are as follows: residual-Unet with two entirely different filter properties, conventional residual block, sequential residual block. In our residual Unet, we utilize only one residual connection for the Unet architecture. Moreover, this residual connection is a global residual connection which adds the input image with the extracted feature at the end of the network. This connection is in contrast to the widely practiced local residual Unet architecture.
For the conventional residual block, we perform 15 continuous convolution operations upon the input image, followed by a global residual connection. In our residual block, we only use the ReLU activation function, and we do not use any batch normalization in the residual block. For our sequential-residual block, we concatenate the outputs form the residual blocks. After that, a simple convolution operation upon the concatenated tensor gives us the desired features. Now we subtract all the features from the input image, and we concatenate the resultants together into a single tensor. A final convolution operation upon this tensor gives us the desired haze-free image.We can summarize the overall contribution of our study as follows: • Our Deep-Dehaze network enjoys end-to-end property for dehazing the hazy input images. The proposed Deep-Dhaze system achieves state-of-the-art performances by collaborating four different feature extraction modules. • In this work, we propose a sequential-residual module for feature extraction, which outperforms the traditional residual block, dense block, and residual-attention block. • Our exploited Residual-Unet uses only one global residual connection. This improvisation on the top of the Unet outperforms the performance original Unet-based module. • Finally, we improvise the usual L1 loss function by utilizing it for both the spatial and edge domains. This loss function helps our network to retrieve the fine details without using any fusion or gated fusion operation. Also, our reconstructed outputs are free from the burning effect, unlike the priorbased approaches.
We have followed the usual sequence for documenting of our study. Following section is focused on related works that reconstruct dehazed images. After this, we explain the network architecture and present relevant figures. Then, we compare our performance with other dehazing studies upon various datasets, which is followed by the conclusion.

RELATED WORK
Classical Prior-based methods operate upon two things: transmission map and the atmospheric light. At the very core, these prior dependent schemes try to improve the manipulation of the statistical property of the brightest intensity region. It's common for these practices to focus on sky region segmentation, adaptive thresholding, air-light compression [19]. The primary dehaze-based method was Histogram equalization that introduces near-uniform probability density function to improve contrast enhancement by remapping the histogram of the images using [11]. Following that, homomorphic filtering combines illumination and reflectance multiplicatively and results in bleaching of the image near-uniform probability density function. Global contrast optimization schemes enhance the overexposed regions of the SDR images by applying the compression curve. Multi-scale retinex [20] helps to increase the contrast in the luminance channel; however, the performance decreases more if the input image contains light noise. In [11], the authors combine single-scale Retinex theory with wavelet transformation to remove the input image's haze. This study [7] use Spatio-Temporal Retinex-inspired Envelope with Stochastic Sampling scheme to approximate the haze-free image.
The dark channel prior is one of the oldest dehazing schemes [10]. This method is so influential that a bulk of the image dehazing study is standing upon this formula. This statistical observation relies mostly on most outdoor image regions having pixels that present low intensity in at least one of the color channels [10]. Even though prior dependent schemes are useful for image dehazing, they are not robust enough due to their inherent limitations in prior selection. Their assumption mostly covers the atmospheric haze at some scale but fails when it comes to the indoor fog or hazy urban images. The learning-based approaches take versatile datasets to learn the haze properties in different domains.
Many recent dehazing studies use deep neural networks to learn haze information for the given dataset and apply it to remove the haze from the image. These methods [6,12,16,22] uses a deep convolutional network to estimate the transmission map, ambient light information form the given image, and later restore the clean image based on the learned information. This study [6] proposes an end-to-end dehazing network to estimate the transmission map from hazy images. Patel et al. [22] uses pyramid dense network to estimate the clear image from the given input corrupted image. AoD [12] net uses R-CNN to estimate the dehazed image without using the atmospheric light information. Ren et al. [16] use a coarse-to-fine scheme to accurately predict the transmission map for haze removal. This study [15] uses pix-2-pix network for image dehazing. They have converted the image dehazing problem into image translation problem for removing haze from the given image.
GAN-based methods also been using for approximating the hazefree image from the given corrupted image. FFDnet [23] is the first GAN-dehazing scheme that removes the haze from the picture by estimating the non-linear noise level map. Cycle dehaze GAN [9] performs dehazing on the top of the cycle GAN and uses perceptual and cycle consistency loss. By exploiting the Tiramisu model, conditional GAN [5] dehaze the hazy input image. CCGAN [21] uses a dehaze GAN and a classification network to predict the transmission map for accurate haze removal. CWGAN [8] learns to construct a clean image form the given corrupted image by using fundamental features of the Wasserstein GAN.

NETWORK DETAILS
As shown in Figure 2, our network learns the nonlinear relationship between the hazy input image and the output image. Our end-to-end deep-dehaze system consists of different sub-networks. Included sub-networks allows our end-to-end network to learn image features, spatial properties, haze patterns for the given input. We present the necessary description of these sub-modules below.
The proposed ResUnet block differs from the original Unet architecture with a single global residual connection. Established Unet with residual connections includes skipping links for every layer. These links allow the network to learn better than the original Unet architecture. The number of the channel increases for Unet after each pooling operation exponentially, which increases the cost for the skip connection at each layer. To reduce that, we use a single global skip connection for our Unet base architecture.   We have observed better performance for the Unet with global skip connection. In our end-to-end Deep-Dehaze network, we have used two ResUnet blocks for spatial feature learning. For the first ResUnet, channel increment starts from 32, and for the second, channel increases exponentially from 16.
Our residual block follows the typical design flow. We have included the skip connection to learn the spatial haze patterns form the given input. It is a well-known fact that residual connection can attain the underlying nonlinear relation between the input and the output image. For this reason, in ill-posed computer vision problems, the residual module is a popular choice. In our residual block, we have used 15 consecutive convolution operation before the skip connection. After learning the haze patterns at 15 layers, we have added the input image to the end of the module. For this module, we use 32 filters from the second convolution layer to the 14th layer. For the first layer and the last layer, we have used three channels. Our kernel width is 7 for each layer in the residual module.
The proposed Seq-Residual block is the simple extension of the typical residual block, as we describe above. In this module, we have used three consecutive residual blocks, one individual convolution operation after each residual block. Then we have concatenated the residual blocks into a single tensor for the global feature learning. For each res blocks, we have used different kernel widths. Our kernel sizes are 3, 5, and 7. For the individual convolution operation, the kernel width is 3. At the final stage, we perform the last convolution operation to extract the concatenated tensor's global feature information.
After extracting desired features using the modules mentioned above, we perform subtract operation between the hazy input image and the output of each module. Following that, we concatenate all of them into a single tensor. This tensor is the most crucial feature bulk that we extract from the corrupted input image in each epoch. In this tensor, we concatenate the multi-scale spatial information, inter-pixel relationship, transmission light features. A final convolution operation upon this tensor completes our network's single image dehazing task. For ill-posed computer vision problems, it is reasonable to choose the loss function from the regression loss family. Typically, people use L1 or L2 loss function. However, L1 is preferable over the L2 loss function due to the outlier scenario. In practice, L1 is more robust for outliers over the L2 loss function. For our network, we use a combination of L1 loss function for the spatial and edge domains. Usually, the system approximates the clean image from the given corrupted image based on the nonlinear relationship between them. In this manner, the performance of the network varies from image to image. Also, the spatial complexity of the given image makes this approximation more challenging. Due to this reason, it is common to see the resultant image contains less amount of inter-pixel variance than the ground truth image. In other words, the reconstructed image is more likely to be he smoothed out version of the original image. To address this, we use the L1 loss function for both the edge and the spatial domains. We augmented the loss value from both of them in a weighted manner to constitute our proposed loss function.

RESULT ANALYSIS
In this study, we primarily use the RESIDE dataset for single image dehazing. We have selected three subsets of the RESIDE dataset. They are indoor training set, outdoor training set, synthetic testing set. For validation, we have used indoor and outdoor validation set from the RESIDE dataset. For synthetic haze validation, we have also used the validation set from the DIV2k dataset. For the extreme dehazing performance, we have used the dataset curated by the Ancuti et al. [3]. This dataset contains extremely hazy images with their respective ground truths. We have used following training setup for our network. We sample our training images into 128 by 128 patches to reduce the training complexity. For our scheme, we use the Keras framework to design the dehazing network. For the ADAM optimizer, we keep the usual training parameter. We train the Deep-Dehaze network with the batch size 16.
For the comparative analysis, we use three different methods against our study. These state-of-the-art studies are Dark channel prior, Non-local prior, All-in-one dehazing net. We use PSNR, SSIM, and MSE metric to demonstrate quantitative performance. The synthetic dataset from the RESIDE validation, DIV2k validation, and the extreme dehaze validation dataset from the Ancuti et al. [3] comes with the ground truth images. Figure 6 is showing the visual comparison between the proposed study and the studies mentioned above. In Figure 7, we present the comparative analysis for the extreme dehazing. The left column of the Figure 6 contains the input hazy image. From this figure, we can see the side by side comparison of the single image dehazing performance. In the rightmost column, we see the dehazing performance for the dark channel prior [10]. As seen from the picture, a dark channel prior helps the user remove atmosphoric hazes from the given image effectively. However, it faces challenges with the burning-effect, which is common for typical dehazing methods. Due to the burningeffect, many dehazing schemes reduce the overall light information that decreases vital details. This method also reconstructs the given input image with an unnatural contrast, which degrades the total fidelity of the input image.
From the second right column, non-local image scheme shoes superior image dehazing performance over the dark channel prior. Compared to the dark channel prior, this method is contrastadaptive and more immune to the pixel burning-effect. The middle column shows the image dehazing performance of the AoD net. Even though its overall haze removal performance is better than  other methods, reconstructed images are less sharp than the other studies. The second left column is showing the dehazing performance of the proposed study. We can see that the proposed research can dehaze the green leaf region most effectively compared to the other studies. For the other studies, pixel burning-effect is strongly visible in the leaf region of the house image. We can also see that the inherent contrast appears to be more natural for the proposed study compared to other studies. From figure 7, we can see the demonstration of the extreme image dehazing performance. This picture contains the hazy images and the ground truth image from the Ancuti et al. [3]. From this photo, our approach can extract the most pixel information from the given hazy image. For the bicycle image, we can see a generator in the room from the ground-truth image. From the dehazed images, the generator is more pleasantly visible for the proposed study.
For the quantitative analysis, we use the PSNR, SSIM, and the MSE metric for the comparison. Table 1 is showing the quantitative comparison for the validation dataset from the RESIDE dataset. In Table 2, we present the comparison for the extreme dehazing dataset. Table 3 contains the quantitative comparison for the validation dataset from the DIV2K dataset. From these tables, we can see that the proposed study achieves the best results on average for every dataset.   In summary, our model performs better than other state-of-theart studies. Proposed study achieves better visual performance as from the Figure 6 and Figure 7. Also, this study shows its efficacy for quantitative analysis as from the tables above. However, our method is not absolute in terms of performance. In some cases, other studies perform better than our scheme. For example, from Figure 6, we can see that our method poorly dehazed the road image. Even though that image contains better contrast, more haze is on the road for our model than the other studies. In our future work, we would like to address issues like this. We hope to work with more complex haze scenarios to achieve better performance in the extreme dehazing dataset. We would also like to improve the network's performance by instance-learning. By doing so, we can improve the overall adaptivity of the proposed network. Furthermore, our network is still limited to extreme dehazing tasks, which will be considered for future work.

CONCLUSION
In this study, we propose an end-to-end image dehazing study to recover the clean image from the given input. In our research, we have introduced several improvements over usual deep learning architectures. We also improve the performance of the typical L1 loss function for low-level computer vision tasks. Our study covers the image dehazing performance over several datasets. In every case, our scheme achieves better results compared to the state-ofthe-art studies. In our future work, we are planning to implement this work for real-time video and detection tasks.