A Novel Transformer-Based Attention Network for Image Dehazing

Image dehazing is challenging due to the problem of ill-posed parameter estimation. Numerous prior-based and learning-based methods have achieved great success. However, most learning-based methods use the changes and connections between scale and depth in convolutional neural networks for feature extraction. Although the performance is greatly improved compared with the prior-based methods, the performance in extracting detailed information is inferior. In this paper, we proposed an image dehazing model built with a convolutional neural network and Transformer, called Transformer for image dehazing (TID). First, we propose a Transformer-based channel attention module (TCAM), using a spatial attention module as its supplement. These two modules form an attention module that enhances channel and spatial features. Second, we use a multiscale parallel residual network as the backbone, which can extract feature information of different scales to achieve feature fusion. We experimented on the RESIDE dataset, and then conducted extensive comparisons and ablation studies with state-of-the-art methods. Experimental results show that our proposed method effectively improves the quality of the restored image, and it is also better than the existing attention modules in performance.


Introduction
In severe weather, such as haze, fog, rain, or snow, capturing high-quality images is a challenging task due to reduced visibility. These conditions also reduce the performance of advanced vision tasks, such as image classification, object detection, and scene analysis. Therefore, it is of great significance to remove the influence of severe weather in images [1,2]. For example, image dehazing, deraining, or desnowing has received a lot of attention [3][4][5][6][7][8][9][10]. In the field of image dehazing, it can be divided into general scenes and remote sensing images. For remote sensing images, preprocessing is required by wavelet-based denoising or newer versions of compressed sensing [11][12][13]. Our work proposes an image dehazing model combining Transformers and CNNs for general scenes.
In this paper, we propose a model for image dehazing. Image dehazing has been widely studied in recent years. Most methods implement haze removal through the atmospheric scattering model [14], as shown in Equation (1): where I(x) is the hazy image formed by the scattering medium, J(x) is the restored hazefree image, t(x) is the transmission matrix, A is the global atmospheric light, and x is the pixel position.

1.
We propose to apply Transformer as a channel attention module to the image dehazing task. We perform quantitative and qualitative comparisons with state-of-the-art methods on synthetic and real-world hazy image datasets, achieving better results on both.

2.
Our proposed Transformer-based channel attention module (TCAM) is a plug-andplay module that can be applied to other models or tasks, such as image classification, object detection, etc.

Related Work
The existing methods of haze removal can be divided into the prior-based method and the learning-based method according to different data processing methods [25].

Prior-Based Method
The prior-based method estimates the transmission matrix and the global atmospheric light by assuming scene and hand-crafted priors.
Fattal et al. [26] redefined the atmospheric scattering model by adding new surface shadow variables and assumed that the surface shadow and the transmission function are statistically independent. Tan et al. [27] assumed that the haze-free image has higher contrast than the hazy image, and they improved the image quality by enhancing the contrast of the image. He et al. [28] observed that at least one color channel has a very low pixel value in the local area of most haze-free outdoor images, and they proposed a dark channel prior dehazing algorithm. Tang et al. [29] studied haze-relevant priors based on the regression framework to identify the best prior combination. Berman et al. [30] observed that the pixels in the RGB space clusters of the haze-free image are usually nonlocal, and these clusters form haze lines in the hazy image. Zhu et al. [31] proposed an image dehazing framework based on artificial multiexposure image fusion, which first combines the global and local details of the gamma-corrected image, and then balances the image brightness and color saturation to obtain the corresponding haze-free image.
Since these prior-based methods use hand-crafted priors and specific scenes as preconditions, the performance of these methods in haze removal is inferior if the prior is invalid or insufficient. For example, DCP [28] is less effective in processing highlights or large white areas and sky areas.

Learning-Based Method
Recently, CNNs have achieved great success in the field of computer vision. Therefore, learning-based methods using CNN have been widely proposed, which solve the problem of prior-based methods that rely heavily on hand-crafted priors and restrictions on specific scenarios.
Cai et al. [32] proposed an end-to-end dehazing network named DehazeNet to predict the transmission matrix, where the global atmospheric light is a given fixed value. Ren et al. [33] proposed a model that extracts features through coarse-scale and fine-scale networks to estimate the transmission matrix, called single image dehazing via multiscale convolutional neural network (MSCNN). However, these end-to-end models only estimate the transmission matrix. If the transmission matrix or global atmospheric light is inaccurate, the quality of the restored image will be suboptimal. Li et al. [15] proposed the all-in-one dehazing network, which integrates the transmission matrix t(x) and the global atmospheric light A into a parameter K(x). Zhang et al. [34] proposed a model that can learn the transmission matrix and global atmospheric light separately, called densely connected pyramid dehazing network (DCPDN). It introduces a generative adversarial network to identify the restored image. Qu et al. [35] proposed a model that does not rely on the atmospheric scattering model, called enhanced Pix2pix dehazing network (EPDN). The model was composed of three modules: multiresolution generator, multiscale discriminator, and enhancer. However, the generator has certain limitations in generating real detail features. Wu et al. [36] proposed a model consisting of an autoencoder-like dehazing network and contrastive regularization, called contrastive learning for compact single-image dehazing (AECR-Net). Nevertheless, contrastive learning requires a certain proportion of negative samples, which will seriously slow down the training speed. Using the attention mechanism pays attention to detailed features without slowing down the training speed. Consequently, it is meaningful to introduce the attention module.

Proposed Network Framework
The overall architecture of our proposed Transformer for image dehazing (TID) is shown in Figure 1. The TID consists of two modules: the multiscale parallel residual module and the attention module.

Multiscale Parallel Residual Module
As shown in the dashed box in Figure 1, the multiscale parallel residual module [37] we used can be formulated by Equation (2). where Conv i represents the convolutional operation of different filter sizes, ⊕ represents channel concatenate, and ∂ represents the ReLU [38] function. Compared with the conventional residual module, the multiscale parallel residual module is used to extract multiple feature information of different scales and superimpose the original input. Here, we only extract two feature maps of different scales. Then, we input the obtained concatenated feature map of the attention module. The overall architecture of our proposed Transformer for image dehazing (TID) is shown in Figure 1. The TID consists of two modules: the multiscale parallel residual module and the attention module.

Multiscale Parallel Residual Module
As shown in the dashed box in Figure 1, the multiscale parallel residual module [37] we used can be formulated by Equation (2).
where i Conv represents the convolutional operation of different filter sizes, ⊕ represents channel concatenate, and ∂ represents the ReLU [38] function.
Compared with the conventional residual module, the multiscale parallel residual module is used to extract multiple feature information of different scales and superimpose the original input. Here, we only extract two feature maps of different scales. Then, we input the obtained concatenated feature map of the attention module.

Transformer-Based Channel Attention Module
The existing methods of calculating channel attention are all achieved by compressing the spatial dimension of the feature map. SE-Net [19] used the average-pooling operation to squeeze the spatial information of its feature map to obtain the channel attention map. CBAM [18] used both average-pooling and max-pooling to calculate the channel attention map. Compared with simply using average-pooling or max-pooling to squeeze the spatial dimension, we propose a module that uses the Transformer architecture to obtain channel attention, which is called the Transformer-based channel attention module (TCAM). The specific details are shown in Figure 2.

Transformer-Based Channel Attention Module
The existing methods of calculating channel attention are all achieved by compressing the spatial dimension of the feature map. SE-Net [19] used the average-pooling operation to squeeze the spatial information of its feature map to obtain the channel attention map. CBAM [18] used both average-pooling and max-pooling to calculate the channel attention map. Compared with simply using average-pooling or max-pooling to squeeze the spatial dimension, we propose a module that uses the Transformer architecture to obtain channel attention, which is called the Transformer-based channel attention module (TCAM). The specific details are shown in Figure 2. Transformer uses a one-dimensional sequence as the input token. Therefore, we need to perform patch embedding on the input feature map. First, we divide the input feature map into image patches with a resolution of P P × , and then each image patch is flattened into a one-dimensional sequence, which is called the token. If the input feature map , the token after patch embedding is where H, W, and C are the batch size, height, width, and channels, respectively, and 2 / N HW P = is the number of patches. Then, all tokens are linearly projected to a constant size D for the Transformer Transformer uses a one-dimensional sequence as the input token. Therefore, we need to perform patch embedding on the input feature map. First, we divide the input feature map into image patches with a resolution of P × P, and then each image patch is flattened into a one-dimensional sequence, which is called the token. If the input feature map X 0 ∈ R H×W×C , the token after patch embedding is X t ∈ R N×(P 2 •C) , where H, W, and C are the batch size, height, width, and channels, respectively, and N = HW/P 2 is the number of patches. Then, all tokens are linearly projected to a constant size D for the Transformer encoder to compute. In order to ensure the position information of each image patch, all tokens need to be position embedded, as shown in Equation (3).
where E p ∈ R N×D is the random number generated. T 0 is the input token of the Transformer encoder. The Transformer encoder [39] includes two-layer normalization (LN), a multihead self-attention (MSA) and a multilayer perceptron block, where LN comes before MSA and MLP. In addition, to avoid the disappearance of the gradient, the Transformer encoder uses the residual connection behind MSA and MLP [23,40]. Among them, MLP is a two-layer perceptron with expansion ratio r. The processing of the l-th Transformer encoder can be expressed as where L represents the number of Transformer encoders. Then, we perform the average-pooling (AvgPool) operation along the first dimension of the output T L of the Transformer encoder. The channel attention map T c ∈ R 1×C is obtained by linear projection, formulated as in Equation (6).
The output feature map of TCAM can be expressed by Equation (7).
where ⊗ denotes pixel-level dot multiplication. In the dot multiplication, the channel attention map T c expands along the spatial dimension through the broadcast mechanism. Compared with ViT [22] and DeiT [41], which use the class token as the output, our proposed TCAM performs the average-pooling operation on patch tokens to calculate the channel attention value of the feature map to achieve the purpose of enhancing the detailed information. The experimental results demonstrate that TCAM is effective.

Spatial Attention Module
We use a spatial attention module as a supplement to TCAM. The spatial attention module from CBAM [18] performs average-pooling and max-pooling along the channel axis, and then connects the two spatial attention maps. The spatial attention map T s can be calculated by Equation (8).
where σ denotes the sigmoid function, ⊕ represents channel concatenate, and Conv 7×7 represents a convolution operation with a kernel size of 7 × 7. As shown in Figure 3, the attention module includes two parts: TCAM and the spatial attention module. This module can realize the enhancement of the detailed information of the feature map, and the overall process can be expressed by Equation (9).
where ⊗ denotes pixel-level dot multiplication. where ⊗ denotes pixel-level dot multiplication. The overall architecture of our proposed model is shown in Figure 1, we use two multiscale parallel residual modules and two attention modules. Finally, we concatenate the feature maps output by the two attention modules along the first dimension, and then obtain ( ) K x through a convolutional layer. Table 1 shows the parameters of all convolutional layers.

Xi+1
Channel attention module Spatial attention module Figure 3. Architecture of the proposed attention module.

Results
To verify the effectiveness of our proposed TID, we chose the indoor training SET (ITS), synthetic objective testing set (SOTS), and hybrid subjective testing set (HSTS) in RESIDE [42] for the experiments. Among them, ITS was used as the training set, and SOTS and HSTS were used as the test set. The overall architecture of our proposed model is shown in Figure 1, we use two multiscale parallel residual modules and two attention modules. Finally, we concatenate the feature maps output by the two attention modules along the first dimension, and then obtain K(x) through a convolutional layer. Table 1 shows the parameters of all convolutional layers.

Results
To verify the effectiveness of our proposed TID, we chose the indoor training SET (ITS), synthetic objective testing set (SOTS), and hybrid subjective testing set (HSTS) in RESIDE [42] for the experiments. Among them, ITS was used as the training set, and SOTS and HSTS were used as the test set.

Quantitative and Qualitative Results on the Synthetic Dataset
In our experiment, we selected 500 and 10 images from SOTS and HSTS, respectively. All selected images are synthetic hazy images with haze-free ground truth.
The second row in Table 2 shows the quantitative evaluation of SOTS. Compared with these state-of-the-art methods, our method has a significant improvement in both PSNR and SSIM evaluation indicators. Furthermore, we also tested on HSTS outdoor synthetic data. As shown in the third row in Table 2, our proposed method also achieves better PSNR and SSIM results than other methods on the HSTS dataset.  Figure 4 shows some dehazing images from the HSTS dataset. As shown in Figure 4b,c, the prior-based methods do not perform well. For the learning-based methods, as shown in Figure 4d, the result generated by DehazeNet [32] is significantly darker. As shown in Figure 4e, although the recovered images obtained by AOD-Net [15] have higher brightness, the ability to restore details is poor. For example, the city wall and the red flags on the city wall in the first image, the bridge deck position in the fourth image, and the overall dehazing effect of the third image are not good. As shown in Figure 4f, the result obtained by EPDN [35] is close to the ground truth image, but its effect is not good in the sky area. As shown in Figure 4g, the result obtained by AECR-Net [36] is also close to the ground truth images, but its generalization ability to real-world images is not good, which we will discuss in Section 4.1.2. The recovered image obtained by our proposed method is visually closer to the haze-free ground truth than other methods.  Figure 4 shows some dehazing images from the HSTS dataset. As shown in Figure  4b,c, the prior-based methods do not perform well. For the learning-based methods, as shown in Figure 4d, the result generated by DehazeNet [32] is significantly darker. As shown in Figure 4e, although the recovered images obtained by AOD-Net [15] have higher brightness, the ability to restore details is poor. For example, the city wall and the red flags on the city wall in the first image, the bridge deck position in the fourth image, and the overall dehazing effect of the third image are not good. As shown in Figure 4f, the result obtained by EPDN [35] is close to the ground truth image, but its effect is not good in the sky area. As shown in Figure 4g, the result obtained by AECR-Net [36] is also close to the ground truth images, but its generalization ability to real-world images is not good, which we will discuss in Section 4.1.2. The recovered image obtained by our proposed method is visually closer to the haze-free ground truth than other methods.  [26]; (c) FVR [43]; (d) Deha-zeNet [32]; (e) AOD-Net [15]; (f) EPDN [35]; (g) AECR-Net [36]; (h) TID (ours); (i) ground truth.  [26]; (c) FVR [43]; (d) DehazeNet [32]; (e) AOD-Net [15]; (f) EPDN [35]; (g) AECR-Net [36]; (h) TID (ours); (i) ground truth.

Qualitative Results in Real-World Hazy Images
To evaluate the generalization ability of our proposed method, we selected 10 realworld hazy images (without haze-free ground truth) from HSTS. As shown in Figure 5b, Fattal's [26] method is less effective than other methods. As shown in Figure 5b,c, the prior-based method does not perform well in details, especially in the red frame area of the first image. As shown in Figure 5d, DehazeNet [32] has too-low brightness in the red frame area of the third image. As shown in Figure 5e, AOD-Net [15] performs well in real-world images, but it is inferior to our proposed method in detail. As shown in Figure 5f, EPDN [35] performs better in the red frame area but has severe color distortion in the sky area, such as in the first and third images. As shown in Figure 5g, AECR-Net [36] performs poorly overall in real-world images and has inferior generalization ability. In summary, our method can generalize well on real-world hazy images in visual quality and better preserves detailed information, especially in the area shown in the red box. frame area of the third image. As shown in Figure 5e, AOD-Net [15] performs well in realworld images, but it is inferior to our proposed method in detail. As shown in Figure 5f, EPDN [35] performs better in the red frame area but has severe color distortion in the sky area, such as in the first and third images. As shown in Figure 5g, AECR-Net [36] performs poorly overall in real-world images and has inferior generalization ability. In summary, our method can generalize well on real-world hazy images in visual quality and better preserves detailed information, especially in the area shown in the red box.  [26]; (c) FVR [43]; (d) Deha-zeNet [32]; (e) AOD-Net [15]; (f) EPDN [26]; (g) AECR-Net [36]; (h) TID (ours).

Ablation Studies
In order to verify the effectiveness of our proposed attention module, we designed two ablation studies: (1) attention module and non-attention module; (2) our proposed attention module compared with SE-Net [19] and CBAM [18]; (3) for the output of TCAM, the average-pooling patch token compared with the class token.

Attention Module and Non-Attention Module
We removed the attention module in our proposed network architecture. Then the two networks were quantitatively analyzed using PSNR and SSIM on the SOTS test set. Figure 6 shows the validation curve, compared with the non-attention module. The validation curve with the attention module has a smaller oscillation amplitude and a better fit effect. Table 3 shows the quantitative results of SOTS. The PSNR of using the attention module is 6.89% higher than that of the non-attention module, and the SSIM is 3.61% higher. This ablation study demonstrates that the network with attention modules can effectively improve performance haze removal.  [26]; (c) FVR [43]; (d) DehazeNet [32]; (e) AOD-Net [15]; (f) EPDN [26]; (g) AECR-Net [36]; (h) TID (ours).

Ablation Studies
In order to verify the effectiveness of our proposed attention module, we designed two ablation studies: (1) attention module and non-attention module; (2) our proposed attention module compared with SE-Net [19] and CBAM [18]; (3) for the output of TCAM, the average-pooling patch token compared with the class token.

Attention Module and Non-Attention Module
We removed the attention module in our proposed network architecture. Then the two networks were quantitatively analyzed using PSNR and SSIM on the SOTS test set. Figure 6 shows the validation curve, compared with the non-attention module. The validation curve with the attention module has a smaller oscillation amplitude and a better fit effect. Table 3 shows the quantitative results of SOTS. The PSNR of using the attention module is 6.89% higher than that of the non-attention module, and the SSIM is 3.61% higher. This ablation study demonstrates that the network with attention modules can effectively improve performance haze removal.  We used SE-Net [19] and CBAM [18] instead of the attention module we proposed and also performed quantitative analysis on these three networks. As shown in Figure 7 and Table 4, compared with SE-Net [19], the PSNR of our proposed attention module is 6.05% higher, and SSIM is 2.80% higher; compared to CBAM [18], PSNR is 3.29% higher,

Our Proposed Attention Module Is Compared with SE-Net and CBAM
We used SE-Net [19] and CBAM [18] instead of the attention module we proposed and also performed quantitative analysis on these three networks. As shown in Figure 7 and Table 4, compared with SE-Net [19], the PSNR of our proposed attention module is 6.05% higher, and SSIM is 2.80% higher; compared to CBAM [18], PSNR is 3.29% higher, and SSIM is 3.00% higher. This demonstrates that the attention module we proposed is better than the SE-Net [19] and CBAM [18] performance.  We used SE-Net [19] and CBAM [18] instead of the attention module we proposed and also performed quantitative analysis on these three networks. As shown in Figure 7 and Table 4, compared with SE-Net [19], the PSNR of our proposed attention module is 6.05% higher, and SSIM is 2.80% higher; compared to CBAM [18], PSNR is 3.29% higher, and SSIM is 3.00% higher. This demonstrates that the attention module we proposed is better than the SE-Net [19] and CBAM [18] performance.  For the output of TCAM, we used the class token instead of the average-pooled patch token. Then, a quantitative analysis of these two networks was performed. As shown in Figure 8 and Table 5, compared with the class token, the PSNR of the average-pooled patch token is 1.92% higher, and SSIM is 1.88% higher. This ablation study demonstrates that the average-pooled patch token is more effective than the class token.  For the output of TCAM, we used the class token instead of the average-pooled patch token. Then, a quantitative analysis of these two networks was performed. As shown in Figure 8 and Table 5, compared with the class token, the PSNR of the average-pooled patch token is 1.92% higher, and SSIM is 1.88% higher. This ablation study demonstrates that the average-pooled patch token is more effective than the class token.

Conclusions
In this paper, we propose a Transformer-based channel attention module (TCAM) combined with a spatial attention module to enhance a CNN-based backbone network. Our proposed TCAM utilizes Transformer to address the limitation of the local receptive fields of CNNs, and then uses a spatial attention module as its complement, which finally achieves enhanced detailed information of feature maps along both the channel and spatial dimensions. At the same time, we use a multiscale parallel residual module to extract features of different scales to achieve feature reuse. We perform quantitative and qualita-

Conclusions
In this paper, we propose a Transformer-based channel attention module (TCAM) combined with a spatial attention module to enhance a CNN-based backbone network. Our proposed TCAM utilizes Transformer to address the limitation of the local receptive fields of CNNs, and then uses a spatial attention module as its complement, which finally achieves enhanced detailed information of feature maps along both the channel and spatial dimensions. At the same time, we use a multiscale parallel residual module to extract features of different scales to achieve feature reuse. We perform quantitative and qualitative evaluations against state-of-the-art methods on SOTS and HSTS datasets. Experimental results show that our proposed method has superior performance. Compared with AECR-Net, our proposed method improves PSNR by 4.34% and 4.64% and SSIM by 2.41% and 1.21% in SOTS and HSTS, respectively.
In addition, we designed three ablation studies to verify our proposed attention module. The results of comprehensive ablation experiments show that our proposed attention module can improve image dehazing performance and outperform existing attention modules.