IFE-Net: An Integrated Feature Extraction Network for Single-Image Dehazing

: In recent years, numerous single-image dehazing algorithms have made signiﬁcant progress; however, dehazing still presents a challenge, particularly in complex real-world scenarios. In fact, single-image dehazing is an inherently ill-posed problem, as scene transmission relies on unknown and nonhomogeneous depth information. This study proposes a novel end-to-end single-image dehazing method called the Integrated Feature Extraction Network (IFE-Net). Instead of estimating the transmission matrix and atmospheric light separately, IFE-Net directly generates the clean image using a lightweight CNN. During the dehazing process, texture details are often lost. To address this issue, an attention mechanism module is introduced in IFE-Net to handle different information impartially. Additionally, a new nonlinear activation function is proposed in IFE-Net, known as a bilateral constrained rectiﬁer linear unit (BCReLU). Extensive experiments were conducted to evaluate the performance of IFE-Net. The results demonstrate that IFE-Net outperforms other single-image haze removal algorithms in terms of both PSNR and SSIM. In the SOTS dataset, IFE-Net achieves a PSNR value of 24.63 and an SSIM value of 0.905. In the ITS dataset, the PSNR value is 25.62, and the SSIM value reaches 0.925. The quantitative results of the synthesized images are either superior to or comparable with those obtained via other advanced algorithms. Moreover, IFE-Net also exhibits signiﬁcant subjective visual quality advantages.


Introduction
Obtaining a clear and haze-free image is crucial in photography and computer vision applications.Due to the presence of a large amount of dust, smoke, mist, or other floating particles in the atmosphere, when the camera captures images in this environment, significant quality degradation often occurs in the resulting images.These degradations, in turn, may have a negative impact on the performance of many computer vision systems [1][2][3][4], such as detection, tracking, and classification.Therefore, restoring clean images from damaged inputs through image dehazing is extremely important in the field of computer vision.
To overcome quality issues caused via haze in captured images, the atmospheric scattering model [5][6][7] has been proposed to restore clean images; it can be formally written as follows: where I(x) is the observed hazy image, J(x) is the true scene radiance, α is the global atmospheric light, t(x) is the medium transmission map, and x is the pixel index in the observed hazy image I. Furthermore, the medium transmission map can be expressed as follows: where d(x) is the distance from the scene point to the camera, and β represents the attenuation coefficient of the atmosphere.From Equation (1), it can be seen that the dehazing process requires the accurate estimation of the transmission map and atmospheric light.A small portion of research mainly focuses on estimating atmospheric light [8][9][10][11][12], but the accuracy of the atmospheric light obtained will directly affect the results after dehazing and excessive errors will lead to a decrease in the dehazing performance on the image.Alternative other algorithms focus more on accurately estimating transmission maps, and the estimation of a transmission map mainly falls into two categories: prior-based methods [13,14] and learning-based methods [15,16].In order to compensate for information loss during image processing, some methods use different priors to obtain atmospheric light and transmission maps.For example, Berman et al. [17] proposed a non-local prior-based dehazing algorithm based on the assumption that the colors of clean images are well approximated by different colors.Based on the difference in brightness, the saturation of blurred images is blurred, and color attenuation prior (CAP) [18] is proposed to estimate scene depth.The image prior obtained using prior-based algorithms can easily be inconsistent with practice, which may lead to incorrect transmission approximations.Learning-based methods are effective and superior to prior based-algorithms, exhibiting significant performance improvements.In [19], two subnetworks were designed to estimate the transmission map and atmospheric light, respectively.In [20], the authors created three different images from the hazy image and fused the results of the three images after dehazing.However, deep learning-based methods require training on a large number of real hazy images and their corresponding images without haze.The methods of estimating atmospheric light and transmission maps separately have made significant progress, but both have limitations.On the one hand, the inaccurate estimation of transmission maps may lead to low image quality; on the other hand, the separate estimation of atmospheric light and transmission maps leads to difficulties in finding the inherent relationship between them.
In order to find the intrinsic relationship between the parameters of Equation (1), Boyi Li et al. [21] first proposed a dehazing model that does not estimate α and t(x).This model directly generates clean images from blurred images, rather than relying on any separate intermediate parameter estimation steps.Recently, many methods have used end-to-end learning instead of atmospheric scattering models to directly obtain clean images from networks [22][23][24][25].Another widely used method tends to predict the residual of potential haze-free images or haze-free images relative to hazy images, as they often achieve better performance [26][27][28][29][30].Although these recent dehazing methods have made significant progress, due to the complex haze distribution and the difficulty in collecting image pairs during the training process, it is easy to lose image details during the dehazing process using limited a dataset.
Due to the difficulty in collecting image pairs during the training process, IFE-Net uses end-to-end models, adaptively learns network features, and adopts multiscale feature extraction to better extract information.In addition, parallel convolutional layers of different sizes are used to extract features from input images of different scales [31,32].This feature extraction structure is conducive to preserving more information and reducing the loss of image details.
Considering the potential cumulative error caused via the separate estimation of atmospheric light and the transmission map, IFE-Net unifies atmospheric light and transmission maps as one parameter to directly obtain a clean image.In addition, attention mechanisms have been widely applied in the design of neural networks [19,[33][34][35][36], which can provide additional flexibility in the network.Inspired by these works and considering the different weights of features in different regions, a feature attention mechanism module called attention mechanism (AM) is designed in the network, which processes different types of information more effectively.
In deep learning networks, the activation function is a nonlinear function that enables neural networks to learn and represent complex patterns and relationships.The selection of the final activation function has a significant impact on the output results of the model, as different activation functions have different characteristics and applicable scenarios.We considered that the output of the last layer of the image after dehazing should have upper and lower boundaries.In IFE-Net, we designed a new activation function called a bilateral constrained rectifier linear unit (BCReLU).The specific details of BCReLU and its comparison with other activation results in the network are detailed in Section 3.2.3.
The main contributions are as follows: 1. IFE-Net directly produces the clean image from a hazy image, rather than estimating the transmission map and atmospheric light separately.All parameters of IFE-Net are estimated in a unified model.

2.
We propose a novel attention mechanism (AM) module, which consists of a channel attention mechanism, pixel attention mechanism, and texture attention.This module has different weighted information for different features and focuses more on strong features in areas with thick haze.

3.
A bilateral constrained rectifier linear unit (BCReLU) is proposed in IFE-Net.To our knowledge, no one else has proposed BCReLU.Its significance in obtaining image restoration is demonstrated through experiments.

4.
The experiments show that IFE-Net performs well both qualitatively and quantitatively.The extensive experimental results also illustrate the effectiveness of IFE-Net.

Related Work
Recently, single-image dehazing has attracted widespread attention in the field of computer vision.Due to the unknown global atmospheric light and transmission map, single-image dehazing is an inherently ill-posed problem.Many different methods have been proposed to address the issue.These methods can be roughly divided into prior-based and learning-based methods.The main difference between these two methods is that the prior-based methods mainly utilize prior statistical knowledge and hand-crafted features to process the hazy images, while the learning-based methods can automatically learn from the training set through a neural network and save it in the network's weights.
Single-image dehazing methods that extensively utilize prior knowledge have emerged.A patch-based dark channel prior (DCP) [11] method proposed by He et al. is one of the representative prior methods.Based on the assumption that hazy images may have extremely low intensity in at least one color channel, DCP uses an atmospheric scattering model for haze removal.Pixel-based dehazing methods [37,38] provide another solution to estimate the transmission map; however, pixel-based dehazing methods may result in insufficient information and an inability to estimate transmission maps.In addition, a method for establishing a linear model based on local prior images was proposed by Zhu et al. [18] to restore depth information.Although prior-based methods have achieved good results, the existence of priors is conditional.These hand-crafted priors are only applicable to specific cases and may not be applicable in changing scenarios.
The human brain is able to quickly distinguish hazy regions in natural images without other information, and convolutional neural networks have been inspired by this to be applied in image dehazing.These learning-based methods demonstrate extremely strong capabilities in dehazing.For example, Cai et al. [31] proposed Dehaze-Net, which is a trainable end-to-end network consisting of four parts: feature extraction; multiscale mapping; local extremum; and nonlinear regression.It is used to estimate the transmission map, and then the output transmission map is restored to a clean image through an atmospheric scattering model.Ren et al. [39] further proposed a multiscale convolutional neural network (MSCNN) for estimating scene transmission maps.Qin et al. [36] proposed an end-to-end feature fusion attention network (FFA-Net) to directly recover clean images, taking into account different weighted information.Due to the difficulty in obtaining paired clean images and hazy images in nature, Li et al. [40] studied the implementation of image dehazing without training on real clean image sets on the ground.These learning-based methods have achieved good performance in dehazing and are more widely used in image dehazing.

The Proposed Method
In this section, we first introduce the transformed atmospheric scattering model.Then, a detailed introduction to the specific structure of the proposed IFE-Net is provided.

The Transformed Atmospheric Scattering Model
We can rewrite Equation ( 1) for the clean image as the output: Existing works such as [19,31,41] usually utilize empirical rules to estimate α and deep learning models to estimate t(x).Estimating α and t(x) separately will lead to certain errors.The output clean image obtained by combining α and t(x) may have a greater cumulative error.
The transformed atmospheric scattering model is proposed [21] to reduce the cumulative error caused by separate estimating.The two parameters, α and t(x), are unified into one formula to avoid the potential cumulative error caused by estimating the two parameters separately.Model (3) can be rewritten as follows: where + (α − 1) It is worth noting that in Equation ( 5), α and t(x) together form a new variable D(x).The clean image can be obtained by estimating D(x).The unified variable D(x) can effectively reduce the cumulative error caused by estimating α and t(x) separately.

Network Design
The architecture of the proposed IFE-Net contains three essential parts: (i) fused filters of different sizes concatenated them together to form a multiscale feature block; (ii) an attention mechanism composed of channel attention, pixel attention, and texture attention; and (iii) a bilateral constrained rectifier linear unit (BCReLU).As illustrated in Figure 1, the input image is first passed to the multiscale feature extraction block to produce multiscale features.Next, we process multiscale features using an attention mechanism block.The combination of the multiscale feature block and attention mechanisms module forms the D(x) estimation block.Finally, we employ BCReLU to perform nonlinear regression on D(x), thus obtaining the clean image.

Multiscale Feature Extraction
Multiscale feature extraction is very effective in the field of dehazing, while maintaining scale invariance and extracting information [42][43][44][45].Moreover, parallel convolutions with different filter sizes are used to capture features at different scales.To compensate for the loss during information convolution, we connect network features of different scales to each other before extracting information from the next feature layer.Inspired by feature extraction methods, we used convolutional layers of different sizes to densely extract the features of the input image at different scales.As depicted in Figure 2, we choose to use five convolutional operations in the multiscale feature extraction block of IFE-Net, where the size of any convolution filter is among 1 × 1, 3 × 3, 5 × 5, 7 × 7, and 9 × 9. "Conv1" uses a 1 × 1 convolution kernel to extract features, while "Conv2" uses a 3 × 3 convolution kernel to extract features; then, "Conv1" and "Conv2" layers are concatenated into the "concat1" layer.During the forward propagation process, a 5 × 5 sized convolution kernel is used to extract features from the "concat1" layer to obtain the "Conv3" layer.The "Conv2" layer, and the "Conv3" layer are concatenated into the "concat2" layer, and a 7 × 7 sized convolution kernel is used to extract features from the "concat2" layer to obtain the "Conv4" layer.Then, the "Conv3" layer and the "Conv4" layer are concatenated into the "Concat3" layer, and a 9 × 9 convolution kernel is used to extract features from the "concat3" layer to obtain the "Conv5" layer.Finally, the "Conv1" layer, "Conv2" layer, "Conv3" layer, "Conv4" layer and "Conv5" layer are concatenated to obtain the output of the multiscale feature feature extraction block.Importantly, the multiscale design of IFE-Net reduces information loss during convolutions and captures features at different scales.

Attention Mechanism
Most previous networks have treated channel and pixel features equally during image dehazing, resulting in unsatisfactory results after dehazing.Meanwhile, for images with uneven haze distribution, such networks cannot achieve good results.In order to better handle different parts of information, we designed an novel attention mechanism module, as shown in Figure 3. Compared to networks that treat channel and pixel features equally, the attention mechanism module of IFE-Net assigns different weights to different regions based on the importance of features.The more information the features contain, the greater their weight values.The application of the attention mechanism in IFE-Net focuses more on learning important information with high weights.The channel attention mechanism, pixel attention mechanism, and texture attention of the attention mechanism module can be expressed separately as follows: where y is the output of the multiscale feature extraction block, which serves as the input of the attention mechanism module; Conv 6 (y) and Conv 7 (y) denote the 1 × 1 convolution layer; Conv 8 (y) and Conv 9 (y) denote the 3 × 3 convolution layer; pool represents the 5 × 5 channel maxpooling operation; F 1 (y) denotes the output of the channel attention mechanism; F 2 (y) denotes the output of the pixel attention mechanism; and F 3 (y) denotes the output of texture attention.The attention mechanism block effectively assigns different weights to the features of different regions, enabling the entire network architecture to better retain effective information while suppressing the impact of unimportant information.

Bilateral Constrained Rectifier Linear Unit
Common choices for nonlinear activation function in deep networks include sigmoid [46], tanh [47], and rectified linear unit (ReLU) [48].Sigmoid is a function of saturation at both ends, but it has a high computational cost and easily suffers from vanishing gradients, which may result in poor local optimality for the training process.Compared to sigmoid, tanh has an output mean of 0, which leads to faster convergence speed and fewer iterations.However, tanh, like sigmoid, has soft saturation, resulting in gradients vanishing.ReLU is proposed to alleviate the vanishing gradient problem of neural networks to a certain extent and accelerate the rate of convergence of gradient descent.It is worth noting that ReLU maintains unilateral suppression and has a wide area when it is greater than 0, which may lead to response overflow, especially in the final layer.For image restoration, the output of the last layer should have upper and lower boundaries, and the range of values should be relatively small.To this end, we propose a bilateral constrained rectifier linear unit (BCReLU) activation function to overcome the limitations of sigmoid and ReLU, as shown in Figure 4.As a novel linear unit, BCReLU keeps bilateral constraint and local linearity.Its output is centered around zero, making the latter layer of neurons less prone to bias and neuronal necrosis.In addition, BCReLU saves computational time and converges faster than other activation functions, which can help solve the gradient attenuation phenomenon as the number of layers increases.The marginal value of BCReLU is y max and y min (y max = 1 and y min = −1).BCReLU can be expressed as We compared the activation functions of the last layer in the network.Table 1 shows the quantitative evaluation results of different activation functions in the last layer on SOTS and ITS datasets (see Section 4.2 for details of PSNR and SSIM indicators).When using BCReLU in the last layer, the network achieved the best results, which confirms its effectiveness in IFE-Net.

Experiments
To verify the superiority of IFE-Net, the dehazing results of IFE-Net were qualitatively and quantitatively compared with those of existing advanced dehazing methods using real-world images and benchmark datasets.

Datasets and Implementation Details
We chose the ground truth images with depth meta-data from the indoor NYU2 Depth Database [49].Over 1440 clean images were selected from the NYU2 database and used to create synthesized hazy images using Equation (1).We chose β ∈ {0.4,0.6, 0.8, 1.0, 1.During the course of the experiments, we adopted the simple mean square error (MSE) loss function.Moreover, we utilized the BCReLU neuron in the last convolutional layer, as we find it more effective than other neurons, in our specific environment.The IFE-Net only needs a few epochs to converge and exhibits stability after approximately 10 epochs.In this study, we save the model parameters for 10 epochs of training for dehazing.We notice that an appropriately large batchsize can yield good performance in the batch normalization layer [50].Due to limited physical memory on the GPU cards, the batch size of images is set to 16 during training.All experiments are performed on an NVIDIA RTX 3060 Ti GPU, NVIDIA, Santa Clara, CA, USA.

Quantitative Results on Synthetic Images
We adopt the peak signal-to-noise ratio (PSNR) and structure similarity index measure (SSIM) [51] as image quality indicators for quantitative analysis.PSNR is generally used to measure the reference value of image quality between the maximum signal and background noise, and the larger the value, the lower the image distortion.PSNR can be expressed as where MSE is the mean square error of two images, and MaxValue is the maximum value that can be obtained from image pixels.SSIM is an indicator that measures the similarity between two images.From the perspective of image composition, SSIM defines structural information as independent of brightness and contrast, reflecting the properties of object structures in the scene.SSIM models distortion as a combination of three different factors: brightness, contrast, and structure.It uses the mean as the estimate of brightness, standard deviation as the estimate of contrast, and covariance as the measure of structural similarity.
where C 1 = (K 1 L) 2 and C 2 = (K 2 L) 2 are used to avoid situations where the denominator is 0; L is equivalent to MaxValue in PSNR, which is a very small constant; u x and u y are the mean; σ 2 x is the standard deviation; and σ xy is the variance.Compared to PSNR, SSIM is more in line with human visual characteristics in evaluating image quality.
High PSNR and SSIM scores indicate low image distortion and a more similar structure.We compare IFE-Net with the powerful methods in recent years based on PSNR and SSIM indicators.DCP [11] does not require precise physical modeling of haze in images but only relies on the prior principle of dark channels to reliably calculate the transmission matrix for image dehazing.Dehaze-Net [31] is an end-to-end system that utilizes prior knowledge to obtain atmospheric light, only learns the medium transmission map through the network, and ultimately obtains clean images.AOD-Net [21] is the first end-to-end trainable dehazing model, which does not separately estimate the parameters in Equation (1) but rather unifies all parameters into one and directly obtains a clean image from the hazy image.FFA-Net [36] has a feature fusion attention mechanism, and the design of the network allows it to perform well with dense hazes, textures, and details.GCA-Net [52] applies gated subnetworks and smooth extended convolutions, which is beneficial for fusing features of different scales and removing possible grid artifacts.DWGAN [53] introduces 2D discrete wavelet transform, aiming at restoring clear texture details and retaining sufficient high-frequency information.GUNet [54] significantly reduces overhead while effectively removing haze.The images in the RESIDE dataset [55] were selected for experimental evaluation of our method.
Figure 5 shows the dehazing results of some randomly selected synthetic images from the SOTS datasets.DCP [11], Dehaze-Net [14], and DWGAN [53] successfully remove heavy haze, but they exhibit color distortion and increased brightness.There are also issues with brightness enhancement and contrast in the results generated via FFA-Net [36], GCA-Net [52], GUNet [54], and AOD-Net [25].IFE-Net handles details better and maintains color consistency with the ground truth.From the results, it can be observed that the results of IFE-Net are significantly better than other networks in terms of fidelity of image details and color.Table 2 shows the average quantitative results of the quality evaluation indicators in Figure 5, and the PSNR and SSIM values of IFE-Net are superior to the other methods.Tables 3 and 4 show the PSNR and SSIM results of our images after dehazing, respectively.Meanwhile, Table 5 shows the average time it takes for different networks to process each image with a size of 548 × 412.The results in Tables 2-5 indicate that IFE-Net is effective and efficient.In addition, we also removed haze from the hazy image of a large area of the sky and compared it with several other advanced methods.Most dehazing algorithms have poor dehazing effects on images containing large areas of sky, resulting in color distortion and uneven color blocks in the restored haze-free images.We show the results of several methods in Figure 8. Figure 8a shows the input hazy image, and Figure 8b-d show the dehazing results of GCA, GUN-Net, and IFE-Net, respectively.From Figure 8b, it can be observed that the results obtained show a thorough removal of haze on the ground, but uneven color blocks appear on the ground and also in the sky area.In Figure 8c, there are no issues with image color distortion, but the dehazing effect in the ground and sky areas is not significant.In Figure 8d, the haze in the sky is suppressed without significant color distortion blocks.Simultaneously, the dehazing effect in the ground area is significant, and the results obtained are good in terms of dehazing and details.

Ablation Research
Both IFE-Net and AOD unify the atmospheric light and transmission map in the atmospheric scattering model into one parameter, directly obtaining clean images.In order to evaluate the contribution of the AM module in the network, we compared the networks with and without it in AOD and IFE-Net, respectively.Table 6 shows the experimental results on two datasets, indicating that the addition of AM modules resulted in better PSNR and SSIM results.Figure 9 shows a comparison of the visual effects of images; networks without an AM module have darker colors, while networks with an AM module achieve better visual effects.The quantitative and qualitative results in the ablation research demonstrate the effectiveness of an AM module in the networks.

Conclusions
We proposed a novel end-to-end adaptive enhancement dehazing network, called IFE-Net, to address the challenge of single-image dehazing.IFE-Net consists of a multiscale feature extraction block, an attention mechanism (AM) module, and a bilateral constrained rectifier linear unit (BCReLU).Considering the cumulative errors that may arise from estimating atmospheric light and transmission maps separately, IFE-Net estimates a parameter that is unified by both.Its novel network design effectively performs feature extraction.In addition, we designed an attention mechanism (AM) module to address the varying importance of information in different regions.The importance of BCReLU in image restoration was also demonstrated through experiments.We compared IFE-Net with other dehazing methods using PSNR and SSIM, and the results show that IFE-Net achieved good scores for both indicators.At the same time, we used subjective criteria to analyze the results obtained via different methods on natural hazy images.Our conclusion is that the proposed IFE-Net combines feature extraction blocks, attention mechanism, and a BCReLU activation function, making it significantly effective in natural and synthetic image dehazing.Although our IFE-Net has a simple structure, it shows strong capabilities in haze removal.The experimental results confirm the superiority and efficiency of IFE-Net.At present, IFE-Net has achieved good results in dehazing, and another promising area for our future research is to apply it to image enhancement algorithms.

2 , 1 . 4 , 1 .
6}, and each channel was set with different atmospheric light A, with a range of [0.6, 1.0].The synthesized training set includes 27,193 haze images, and the learning rate is set to 0.0001 during the training process.

Figure 5 .
Figure 5. Quantitative comparison of IFE-Net with other methods on SOTS.

Figure 6 Figure 6 .
Figure6shows a comparison of the results between IFE-Net and other methods using real scenes.As shown in Figure6, DCP and GCA suffered from visual artifacts on the real hazy images.AOD-Net, FFA-Net, and GCA produced unrealistic colors in one or several images, such as the results of AOD-Net and FFA-Net in the fourth row as well as the results of AOD-Net and GCA in the fifth row.Dehaze-Net and FFA-Net retained a thin layer of haze, as shown in the second row.However, IFE-Net achieved excellent results in both thin and thick haze areas while maintaining colors consistent with real scenes.A similar result can be observed in the outdoor images shown in Figure7.We enlarged the upper left corner of Figure7a-g,h-n to show the enlarged results.The results of AOD-Net, FFA-Net, GCA, and GUNet exhibited color distortions and many non-natural characteristics.Additional white haze appeared in FFA-Net, resulting in incomplete dehazing.The result of DWGAN contained too much white.GCA and DWGAN showed unclear outlines of buildings in hazy areas above the sky.In contrast, IFE-Net successfully removed almost all of the haze while preserving the essential properties of the images, with obvious advantages in preserving edges, texture, contrast, brightness, and other image characteristics, as shown in Figure7.

Figure 7 .
Figure 7.This figure shows our ability in detail and color processing compared to other networks.

Figure 8 .
Figure 8. Results of dehazing in sky areas with dense haze.

Figure 9 .
Figure 9.Comparison of the visual effects of AM block.

Table 1 .
Quantitative results of quality evaluation indicators on SOTS and ITS datasets using different activation functions in the final layer.

Table 2 .
The average quantitative results of the quality evaluation indicators in Figure5.

Table 3 .
Quantitative results of quality evaluation indicators on SOTS dataset.

Table 4 .
Quantitative results of quality evaluation indicators on ITS dataset.

Table 5 .
The average time taken by different networks to process each image.

Table 6 .
Effectiveness of AM module.