A Self-Attention CycleGAN for Unsupervised Image Hazing

Ni, Hongyin; Su, Wanshan

doi:10.3390/bdcc9040096

Open AccessArticle

A Self-Attention CycleGAN for Unsupervised Image Hazing

by

Hongyin Ni

and

Wanshan Su

^*

School of Computer Science, Northeast Electric Power University, Jilin 132012, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(4), 96; https://doi.org/10.3390/bdcc9040096

Submission received: 23 February 2025 / Revised: 1 April 2025 / Accepted: 4 April 2025 / Published: 11 April 2025

(This article belongs to the Special Issue Recent Advances in Machine Learning Methods for Imperfect Large-Scale Data)

Download

Browse Figures

Versions Notes

Abstract

The high cost and difficulty of collecting real-world foggy scene images mean that automatic driving datasets produce limited images in bad weather and lead to deficient training in automatic driving systems, causing unsafe judgments and leading to traffic accidents. Therefore, to effectively promote the safety and robustness of an autonomous driving system, we improved the CycleGAN model to achieve dataset augmentation of foggy images. Firstly, by combining the self-attention mechanism and the residual network architecture, the sense of hierarchy of the fog effect in the synthesized image was significantly refined. Then, LPIPS was employed to adjust the calculation method for cycle consistency loss to make the synthetic picture more similar to the original one in terms of perception. The experimental results showed that the FID index of the foggy image generated by the improved CycleGAN network was reduced by 3.34, the IS index increased by 15.8%, and the SSIM index increased by 0.1%. The modified method enhances the generation of foggy images, while retaining more details of the original image and reducing content distortion.

Keywords:

foggy images; autopilot; CycleGAN; self-attention mechanism

1. Introduction

Autonomous driving [1] is a common example of an AI application in real life. However, the driving safety problems that are caused by autonomous driving should be given more attention and solved urgently. There are many factors that affect the driving behavior of autonomous vehicles, one of which is the dataset. Including enough autonomous driving scenarios under different environmental conditions in a dataset is a key aspect to ensure more accurate and safer driving behaviors; in contrast, an insufficient amount of data on autonomous driving scenarios, especially in bad weather, makes it easy for the autonomous driving system to produce a wrong judgment, resulting in traffic accidents. Currently, most of the published autonomous driving datasets are collected in sunny or cloudy conditions, and few datasets are collected in foggy, rainy, and snowy conditions. Moreover, it is difficult and costly to collect and annotate autonomous driving data under these harsh conditions.

Recently, image fogging technology has attracted significant attention in computer vision, automated driving, and remote sensing, mainly for data enhancement, algorithm robustness testing, and haze scene simulation. The current techniques fall into three main categories: image fogging based on atmospheric scattering models, image fogging based on deep learning, and image fogging based on depth maps. The method based on atmospheric scattering models simulates the optical property using physical formula and synthesizes fog effects from transmittance and atmospheric light parameters, which is computationally efficient but relies on manual adjustments. Among them, the FoHIS model proposed by Zhang N et al. is based on the atmospheric scattering model to add fog against pixel points on a clear image [2]. The image fogging method based on deep learning often applies generative adversarial networks and variational autoencoders to achieve end-to-end synthetic fog images. It learns the distribution features of many genuine fog photos to produce more realistic fog maps, but it has high requirements for both the volume and quality of the training data. Sun H et al. proposed a fog map generation method based on a domain adaptation mechanism and employed a variational autoencoder to reduce the difference between synthetic and real domains [3]. The method based on the depth map model incorporates the scene depth information and generates a non-uniform fog effect to enhance the spatial realism, making it suitable for scenes requiring geometric consistency like stereo vision or autonomous driving. Sakaridis C et al. further synthesized fog images by using the depth information of the image, integrated with the image’s object and distance information [4]. Li Liang et al. used depth estimation to obtain the distance information of the scene and simulated the fog effect based on the physical scattering model to achieve realistic foggy image generation [5]. However, this method is more dependent on the acquisition effect of the depth information of the image. If the acquisition effect is not good, it will lead to a worse synthesis effect of the foggy image.

At present, CycleGAN-based image generation for data enhancement is a commonly used method due to its great advantage in generating the desired image without a paired dataset. When the neural network is being trained, the converter part of the generator can only extract features from a limited window, which results in the network being unable to fully capture the overall semantic information of the original image, leading to lost information in the generated image. Tommaso Dreossi et al. proposed an automatic driving scene image generator based on a CNN model [6]. This method seeks new loopholes in the automatic driving system by changing the brightness, the saturation of the image, the position of the object in the image, etc. However, the effect of generating images needs to be strengthened. Zhang et al. designed the DeepRoad framework [7], whose core function is to enhance the scene generalization ability of autonomous driving technology by synthesizing road images under complex meteorological conditions such as snowy days and rainy days. The haze image generation algorithm based on CycleGAN published by Xiao et al. can capture relationships between synthetic and hazy pixels in the training process, enabling mutual conversion of fog-free and foggy visual scenes [8]. Nevertheless, the current existing image fogging methods remain unable to solve the image domain adaptation problem well, and there is still a certain gap between the generated image and the real foggy day scene. Therefore, the current research mainly focuses on enhancement of the realism of the fog effect, the improvement of the computational efficiency, and the combination of multiple methods to optimize the fogging effect.

As a classic framework for unsupervised image-to-image transformation, CycleGAN has demonstrated good performance in many tasks like image defogging, style migration, etc. However, the direct application of CycleGAN models for image fogging still encounters many challenges, such as the global consistency of the generated images, the detail preservation capability, and insufficient perceptual quality. We suggest a CycleGAN-based data augmentation technique to improve foggy scene datasets for autonomous driving systems in order to more effectively handle the aforementioned problems. Initially, global feature extraction from the initial pictures was made possible by integrating the self-attention mechanism into the residual network. Then, the L1 loss in cycle consistency computation was replaced by LPIPS to improve the perceptual similarity between the generated and real images. The experimental results demonstrated that the proposed method can achieve better quantitative performance and improved image quality.

2. Materials and Methods

2.1. The CycleGAN Baseline

In 2017, Zhu et al. first proposed the Cycle Generative Adversarial Network (CycleGAN), which is a breakthrough framework for cross-domain image conversion without pairing training data [9]. CycleGAN shows broad application potential in fields such as art style transfer and medical image synthesis, as well as transfer learning in the field of computer vision. Unlike the traditional method, which relies on strictly aligned input and output samples, the model achieves adaptive migration between the source domain and the target domain through bidirectional mapping learning. The model is mainly composed of two generators and two discriminators. In the training process, the goal of the generator is to synthesize images that are sufficiently realistic to deceive the discriminator, while the discriminator is committed to distinguishing between real samples and the generated results. The iterative collaborative optimization of the two drives the system to be dynamically balanced, and finally, the generation and discrimination modules achieve optimal performance in their respective roles.

The generator is composed of three parts: the encoder, transformer, and decoder. The encoder extracts features by convolving the source image and converts them into feature vector representations. The transformer learns through a 9-layer residual network to realize the conversion of the image scene. The final output image is obtained by the decoder mapping the converted feature vectors back to the image space and reconstructing the feature representations using deconvolution operations.

The discriminator uses PatchGAN [10,11] to discriminate the generated image. PatchGAN is a fully convolutional network, which evaluates each patch of the required discriminant image and judges whether the image is true by outputting [0, 1]. The discriminant network captures the deep features of the image step by step through four convolutional layers and accesses a convolutional layer with a channel number of 1 at the end as a classifier to determine the authenticity of the image.

The loss function of CycleGAN mainly includes adversarial loss [12,13], identity loss, and cycle consistency loss. Among them, the adversarial loss is produced to make the generated image gradually more similar to the target image through the mutual competition between the generator and the discriminator. Identity loss can be introduced to ensure that the generated image and the input image retain as much color constancy as possible. The cycle consistency loss can suppress the generator’s tendency to randomly sample from the real domain, ensuring semantic correspondence during cross-domain translation. The cycle consistency loss can make the generated image more consistent with the semantics of the original image. The cycle consistency loss function uses the L1 loss, which can be expressed as follows:

L_{c y c} (G, F) = E_{x ~ P d a t a (x)} [| |F (G (x) - x) {| |}_{1})] + E_{y ~ P d a t a (y)} [| |G (F (x) - x) {| |}_{1})]

(1)

Thus, the total loss function of CycleGAN can be expressed as follows:

L (G, F, D_{X}, D_{Y}) = L_{G A N} (G, D_{Y}, X, Y) + L_{G A N} (F, D_{X}, Y, X) + λ_{1} L_{I d t} (G, F) + λ_{2} L_{C Y C} (G, F)

(2)

Here,

λ_{1}

,

λ_{2}

is the weight of the

L_{C Y C}

loss in the total loss function.

2.2. The CycleGAN Structure Used for Image Dataset Augmentation

In the task of foggy day scene dataset augmentation, the effects of images that are generated by neural network training are usually blurred. Meanwhile, in the process of network training, the generated images often suffer from information loss due to insufficient extraction of global semantic information of the images by the original model. To address the limitations of the existing deep learning algorithm, in this paper, the CycleGAN network is optimized. The improved overall network structure is shown in Figure 1.

As shown in Figure 1, this study enhances the CycleGAN generative network in two parts: firstly, a self-attention mechanism is added to the generator part to strengthen the correlation of the global fog concentration distribution; then, a perceptual loss is added to the cycle consistency loss function, which is used to reduce the error between the input image and the secondary generated image to engage the generator and the discriminator in the process of intense mutual confrontation to generate more realistic foggy images. The following sections will elaborate on these two improvements.

2.2.1. Residual Network Based on Self-Attention Mechanism

The self-attention mechanism [14,15,16] mainly comes from the article “Attention Is All You Need” and is currently widely used in the field of natural language processing. The attention mechanism in deep learning is actually used to imitate the human visual attention mechanism, that is, the ability to quickly filter out important information from a large amount of information through limited attention. The core of the self-attention mechanism is to obtain more detailed information related to the key information and appropriately ignore other useless information. The following is the self-attention mechanism’s calculating formula:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(3)

where

d_{k}

is the number of columns of matrix Q, K. The function of

\sqrt{d_{k}}

is to prevent Q and K from affecting the semantic relevance due to excessive values after inner product. The role of softmax [17] is to calculate the correlation coefficient of each feature of the image corresponding to other features, that is, the attention weight. The query information matrix is denoted by Q, the key information matrix by K, and the numerical information matrix by V. These three matrices are generated by the input matrix X after linear transformation. Figure 2 illustrates the self-attention mechanism realization process.

Traditional convolutional neural networks are unable to effectively deal with global features due to the limitation of local receptive fields. While in the image fogging task, the distribution of fog often has global features, the self-attention mechanism can make the model effectively learn the fog distribution pattern by calculating the relationship between different regions of the image, thus improving the realism of the generated image. Simultaneously, in order to increase the perceived quality of the fogged image and lessen the damage caused by the fog to the original image’s details, the self-attention mechanism can focus on important areas of the image, such as edges and object outlines.

Within CycleGAN’s transition layer, the traditional CycleGAN learns through a 9-layer residual network [18,19] to achieve the conversion of the image scene. However, since the traditional residual network is limited to small windows to extract features, it is unable to fully learn the supplied image’s features, resulting in the loss of some details in the generated image. In this paper, by combining the features of global links in the self-attention mechanism with the feature that the residual network can prevent network degradation, the feature extraction of the global field of view of the image is realized, and the multi-scale invariant features are added. The improved residual module structure is shown in Figure 3.

Firstly, the network calculates the three feature maps of Q, K, and V through three convolutional layers. Then, the attention weight is obtained by point multiplication and normalization. Then, the matrix multiplication operation of the calculated V and the attention weight is performed to obtain the outcome of the self-attention processing. Finally, the final feature map is obtained by adding the result of the self-attention processing, the input, and the input, processed by the custom convolution block. Its calculation expression is as follows:

Y = X + C o n v (a t t e n t i o n (X)) + C o n v (C o n v (X))

(4)

The improved generator is still composed of three parts: the encoder, transformer, and decoder. Among them, this paper replaces the traditional Resnet with SA_Resnet in the transition layer. Table 1 displays the upgraded generator’s internal structure.

2.2.2. Cycle Consistency Loss Based on LPIPS

Learned Perceptual Image Patch Similarity (LPIPS), also known as “perceptual loss”, is derived from “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric” of CVPR2018 [20]. The LPIPS index can be used to measure the difference between the two images. The lower the LPIPS value is, the more similar the two images are, and the smaller the difference is. On the contrary, the higher the LPIPS value is, the greater the difference between the two images is, and the more dissimilar they are. In the past, most scholars used traditional methods such as PSNR, SSIM, FSIM [21], L2 Loss [22], and L1 Loss [23] to reproduce this behavior of human perception of pictures. However, the features learned by LPIPS on the unsupervised model and the low-level perceptual similarity of the model are more in line with human perception than PSNR and other loss functions. By learning to map the generated image back to the real image domain, LPIPS guides the generator in mastering the reconstruction process from a synthetic image to a real one, emphasizing their perceptual similarity. Therefore, LPIPS has the smallest calculation error in the perceptual similarity function and achieves better results than the traditional method.

LPIPS loss has been shown to improve the visual perception quality in many image generation tasks, but its role in image fogging has still not been researched systematically. In the original CycleGAN framework, the cycle consistency loss employs the L1 paradigm to assess the discrepancy between the transformed and original images, primarily emphasizing pixel-level errors, and is likely to lead to the generation of an overly smooth image. The LPIPS loss, on the other hand, is able to measure the semantic consistency of an image in a higher-dimensional feature space, thus reducing the color distortion or structural bias that may occur during foggy image conversion.

In this paper, we add LPIPS to the cycle consistency loss function. Taking the secondary generated image closer to the original image and making it more closely resemble the original are the goals, rather than merely aiming for the quality of the generated image. The calculation formula of LPIPS is as follows:

d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, ω} | | ω_{l} \cdot ({\hat{y}}_{h ω}^{l} - {\hat{y_{0}}}_{h ω}^{l}) | |_{2}^{2}

(5)

where d is the distance between x and

x_{0}

. The calculation process of LPIPS is divided into two stages: feature extraction and distance calculation. Firstly, some deep learning algorithms (including VGG [24,25] and Resnet) are employed to extract the image’s features, and the corresponding feature vectors of the image are normalized. The L2 distance between the two input images’ associated feature vectors is then determined. Finally, the average value is taken in space and summed on the channel to obtain the LPIPS value. The following is the formula that is used to calculate the enhanced network’s total loss function:

\begin{matrix} L (G, F, D_{X}, D_{Y}) = & L_{G A N} (G, D_{Y}, X, Y) + L_{G A N} (F, D_{X}, Y, X) + λ_{1} L_{I d t} (G, F) + λ_{2} (ω_{1} L_{1, C Y C} (G, F) \\ + ω_{2} L_{L p i p s, C Y C} (G, F)) \end{matrix}

(6)

Among them,

λ_{1}

,

λ_{2}

,

ω_{1}

, and

ω_{2}

are the weights of different losses, emphasizing the different importance of each loss in the network.

3. Results

3.1. Experimental Environment

All experiments applied the same configuration to prevent the experimental environment from affecting the creation of foggy images. The operating system for this experiment was Windows11, the development environment was python3.7 and PyCharm Community Edition 2022, the deep learning framework was pytorch1.12.1, the number of iterations was 200, the learning rate was 0.0002, the graphics card model was NVIDIA GeForce RTX 3050 Ti Laptop GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA), and the optimizer was Adam.

3.2. Experimental Dataset

The KITTI dataset [26], O-HAZE dataset [27], Dense-HAZE dataset [28], I-HAZE dataset [29], and NH-HAZE dataset [30] were employed in this paper. The experimental dataset was divided into four parts: trainA, testA, trainB, and testB. The training dataset that generated the model included trainA and trainB, and the test dataset that generated the model included testA and testB. Among them, 400 real street view images were selected during vehicle driving to form datasets trainA and testA from the image pool in the KITTI dataset. Since there was no corresponding fog scene image in the KITTI dataset, and a smaller dataset of foggy images, we chose the 185 foggy images in the O-HASE dataset, Dense-HASE dataset, I-HASE dataset, and NH-HASE dataset to compose datasets trainB and testB.

3.3. Fog Image Generation Experiment

The difficulty of the acquisition environment of foggy images are the reason for the low number of publicly available foggy datasets for autonomous driving scenarios. In order to alleviate the lack of foggy datasets, this paper generated a large number of foggy images that were more similar to a realistic environment through the improved CycleGAN network, which is conducive to training a stable automatic driving system. The network consists of two GAN networks. One of them can convert sunny images into foggy images in an autonomous driving environment. In contrast, the other one converts a foggy image in the autonomous driving environment into a fog-free image. The images created by the network during training are displayed in Figure 4.

Among them, the input real fog-free image is shown in Figure 4a, and the fake foggy one generated from the real fog-free image is shown in Figure 4b and can be sent to the discriminator to judge whether it is true or false. If it is false, the discriminator continues to train the generator to generate a more realistic foggy image. Figure 4c is a fake foggy picture that was re-entered into the generator to generate a fog-free picture, and Figure 4d is a real fog-free picture that was used to generate a fog-free picture. Similarly, Figure 4e is the input real foggy picture, while Figure 4f is the fake fog-free picture that was generated from the real foggy picture, which can be sent to the discriminator for discrimination. Figure 4g is the fake fog-free picture that was generated back to the foggy picture, while Figure 4h is the real foggy picture that was generated from the foggy picture. At the same time, Figure 5 and Figure 6 display the differences in the loss functions of the original model and the enhanced network model.

By comparing Figure 5 and Figure 6, we can observe that the enhanced model’s loss values decrease at a faster rate. In particular, the losses of cycle_A, cycle_B, idt_A, and idt_B become more stable after 100 epochs, indicating that the improved model has better training convergence. This allows the model to reach a stable state within fewer training epochs, improving the training efficiency while reducing the risk of overfitting. Additionally, the original model exhibits higher overall values and greater fluctuations in cycle_A and cycle_B, suggesting significant errors when converting images back to the original domain. In contrast, the improved model shows clear reductions in the cycle_A and cycle_B values, with less fluctuation, indicating that the generated transformed images have higher quality and stronger content consistency than the original images. Since the identity loss reflects the model’s capacity to maintain the initial features of the unconverted images, the lower and more stable values of idt_A and idt_B in the improved model imply that it recognizes and retains the style of input images better, enhancing the reliability of the style transfer. Finally, the values of G_A and G_B in the improved model gradually stabilize and remain within a lower range, indicating that the model has learned a more stable mapping relationship, effectively avoiding the mode collapse problem. As a result, the generated images appear more realistic and diverse.

However, the improved model also has certain limitations. Since the incorporation of a self-attention mechanism into the generator architecture inevitably expands the model’s parameter count and elevates its computational demands, it requires more computing resources and time for training. At the same time, the LPIPS loss function needs to calculate the feature difference between the generated image and the real image in the pre-training network in each training iteration, thereby increasing the overall training time. According to the experimental training, the original model training dataset takes 200 rounds and 60,823 s, and the improved model training dataset takes 200 rounds and 75,046 s.

3.3.1. Subjective Evaluation

To confirm that the enhanced network structure in this paper is effective, Figure 7 compares the foggy image generation experiments of eight groups of the FoHIS model, RadialFog model, Lightroom Classic software version 2024, CycleGAN model, and improved model in an autonomous driving environment.

As shown above, the experimental results of the FoHIS model, RadialFog model, Lightroom Classic software, traditional CycleGAN model, and improved CycleGAN model are compared. Among them, the FoHIS model is based on the atmospheric scattering model [31] and adds fog features to a fog-free image through pixels. The Lightroom Classic software has a dehazing function, which can achieve the effect of image fogging by adjusting the value to a negative value. The RadialFog model performs fog synthesis and diffusion through a center point in the picture. The longer the distance from the center point of the fog is, the weaker the effect of the fog synthesis is. Figure 7a is the input content image, that is, the fog-free image in good weather. Figure 7b is the input style image, that is, the original image that will be converted into an image in a foggy state. Figure 7c–g are the result comparisons of the five models under eight groups of fog image generation experiments. It can be observed by the naked eye that the FoHIS model retains the most complete features of the input image, but the added fog features are too average, and the tone is gray, resulting in the generated fog image not being realistic. The image after fogging with the Lightroom Classic software has a hazy fog feeling, but the whole image is white, and the third picture is a failed attempt at fogging. The effect of the RadialFog model is similar to that of the FoHIS model, and the generated images are all grayish, but the effect of the RadialFog model is heavier, and any detailed information is more lost. Both the original and improved CycleGAN models generate fog features that are not overly uniform, making them more realistic. However, the enhanced model retains more information from the input image than the original CycleGAN and generates fog features that seem more realistic and natural. Overall, the improved model outperforms the traditional CycleGAN, demonstrating its effectiveness.

3.3.2. Objective Evaluation

Because a subjective evaluation is mainly judged by the direct observation of the human eye, it is easily affected by subjective factors. To further validate the efficacy of the enhanced model, this experiment took 40 pictures generated by the RadialFog model, Lightroom Classic fogging software, FoHIS model, improved model, and original model, respectively, and evaluated the fogging effect of these five models through the objective measurement indicators of Frechet Inception Distance (FID), Inception Score (IS), and structural similarity (SSIM). Table 2 displays the experimental results.

The FID is used to comprehensively evaluate models’ performances by measuring the distribution difference between the generated samples and the real samples, and the lower its value is, the closer the generating ability of the model is to the real data. The IS index is based on the classification model to quantitatively analyze the diversity and clarity of the generated results. The higher the IS value is, the better the quality and richness of the generated content are. The SSIM is used to judge the quality of the generation by comparing by comparing the structural similarity; the higher its value is, the higher the matching of details between the generated result and the original input is, and the more natural the visual effect is.

Table 2 demonstrates that the FID values of the three control experiments, the RadialFog model, Lightroom Classic fogging software, and FoHIS model, are higher, indicating that the quality of the foggy images generated by these three methods is quite different from that of real foggy images, which means that they are not suitable for amplifying foggy image datasets. At the same time, the IS value of the original CycleGAN model is low, indicating that the use of this network architecture will reduce the clarity of the generated image, and we need to optimize the network structure. In the evaluation and comparison of the five models, the three indicators of the improved model are superior and the most suitable for foggy image dataset amplification in a real-life autonomous driving scenario.

Then, based on the eight sets of experiments that were compared in the subjective evaluation above, we compared the structural similarity of the existing model and the improved model one by one, as shown in Table 3.

It can be seen from Table 3 that except for the fourth set of experiments, the improved models of other experiments are higher than the SSIM values of the existing CycleGAN model, indicating that the improved model can better retain the information of the input image as a whole after integrating the fog features. It can be seen that the improved model does not weaken the connection between the generated image and the input image, while generating a more realistic foggy image.

Lastly, we assessed both the updated model and the current model overall. Table 4 displays the findings of this.

Table 4 shows that compared to the previous model, the FID of the improved model was 3.34 lower, indicating that the difference between the foggy image generated by the improved model and the real foggy image was smaller, and that it was closer to the real foggy image. The IS increased by 15.8%, and the SSIM increased by 0.1%, indicating that the improved model not only generates foggy images, but also enhances the quality of the produced hazy photos while preserving the original image’s features. The method of this paper is better than the traditional CycleGAN model based on three objective measures, which proves the effectiveness of the improved model.

3.4. Ablation Experiment

In order to further test whether each module in the improved network in this paper can produce a performance gain, CycleGAN was used as the baseline model, and we compared the models with only the self-attention module being added, only the LPIPS-based cycle consistency loss function being added, and the self-attention module and LPIPS-based cycle consistency loss function experiments being added. All of the experiment’s parameters were set to the same values to guarantee its fairness, and the comparison results of the ablation experiment are shown in Figure 8.

As can be seen from Figure 8b, the foggy image generated by the original CycleGAN is too foggy, with the fog covering most of the useful information of the image and retaining too little input image information, and the authenticity is not high. Therefore, the foggy image generated by the original CycleGAN is not ideal. It can be seen from Figure 8c that after adding the self-attention mechanism, the fog generated by the image is more natural and more realistic, which improves the information retention of the input image. It can be seen from Figure 8d that the fog generated by the loss function after adding LPIPS is also natural, but the generated picture is blurred. In Figure 8e, the foggy images generated by our method are better than those of the above methods. While not losing too much of the input image information, the natural fog is also well generated. Then, we used objective evaluation indicators to evaluate the above experiments. Table 5 displays the outcomes of the experiment.

It can be seen from Table 5 that the self-attention module makes different degrees of contribution to the three indicators of model performance, among which the contribution to the SSIM is the largest, and the value of the SSIM increases by 0.8%. Because the feature extraction after adding the self-attention module has global linkability, it can retain the detailed information of the source image better and has less distortion. The LPIPS-based cycle consistency loss function makes a significant contribution to the IS and FID indexes, but it affects the effect of the SSIM index and blurs some details. The value of the IS increases by 13.1%, and the value of the FID decreases by 2.53, mainly due to the fact that LPIPS can improve the perceptual similarity between images, so that the generated foggy image is closer to the real foggy image, and the quality of the generated foggy image is also improved. The experimental results show that the integration of the self-attention module within the LPIPS-based cycle consistency loss function demonstrably enhances the network performance, confirming the reliability and effectiveness of the proposed improvements.

4. Discussion

Currently, there are many generative adversarial network methods performing well in image conversion tasks, among which CycleGAN is still a powerful benchmark model for unsupervised image fogging tasks. Its unsupervised training mechanism makes it particularly suitable for tasks that lack paired training data, such as image fogging. However, the CycleGAN model encounters some limitations, such as the excessive uniformity of the fog in the generated images, the inadequate ability to retain details, and the lack of perceptual quality. Therefore, it is necessary to improve the adaptability of CycleGAN, which was the focus of this study. The modified method proposed in this paper is more suitable for training automatic driving system models. It can not only extract foggy features comprehensively but also generate foggy images that are closer to real life, which is dependent on the powerful self-attention mechanism, which captures image details more accurately and enhances the authenticity of fog features. At the same time, the LPIPS perceptual loss further optimizes the realistic effect of foggy images by reducing the visual difference between the generated results and the real scene. The method described in this paper can effectively augment the autopilot dataset in foggy environments to adequately train the autopilot system, improve the safety and robustness of the autopilot system, and reduce the risk of traffic accidents.

However, the method in this paper also needs to be perfected in the future. Even if the LPIPS loss and the self-attention mechanism can successfully raise the overall quality of the generated foggy images, the cost is a significant increase in computational complexity, memory overhead and optimization difficult.

Bayraktar, E, and Yigit, C B, proposed a novel pooling method called conditional pooling [32], which integrates the advantages of average pooling and maximum pooling to dynamically select the optimum pooling strategy based on the characteristics of the input data, thus reducing information loss and improving the transmission efficiency. As an adaptive feature extraction method, conditional pooling can significantly improve the model’s capacity to retain local features, improve the adaptability of the model to different foggy scenes, and reduce the computational overhead to a certain extent. Therefore, in future research, we plan to further explore its application in image fogging tasks and continuously optimize the image generation model or design a new backbone network to generate more realistic foggy images, as well as to conduct more in-depth research on how to balance the efficiency and quality of the model.

5. Conclusions

In this paper, an optimized CycleGAN model is proposed for unsupervised image fogging tasks. The self-attention mechanism and LPIPS perceptual loss were introduced to improve the quality of foggy images. The self-attention mechanism and residual network were first employed to improve the feature extraction of foggy images by the adversarial generative network, while retaining the ability of the residual network to prevent network degradation. Secondly, LPIPS was added to the cycle consistency loss due to its potent ability to detect the perceptual similarity between images, in order to preserve the full details of the original image as well as possible in the generated image. The experimental findings demonstrated that, compared to the conventional CycleGAN network, the enhanced CycleGAN network suggested in this paper performs better in producing images that are more akin to the actual foggy driving scene, while also retaining the detailed information of the input driving scene image better.

Author Contributions

Writing—original draft preparation, W.S.; writing—review and editing, W.S. and H.N.; supervision, H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available in multiple repositories on GitHub (https://github.com/). These include the KITTI Data, O-HAZE Data, Dense-HAZE Data, I-HAZE Data and NH-HAZE Data, containing diverse foggy and clear images.

Acknowledgments

As my thesis is about to be completed, I would like to express my heartfelt thanks and high respect to my supervisor, Hongyin Ni! The work presented in this thesis was completed under the careful guidance of Ni. My supervisor’s professional knowledge, rigorous attitude towards research, working style of pursuing excellence, noble teacher ethics, strict discipline, leniency toward others, and approachable personality have had a profound impact on me. Not only did these characteristics enable me to establish ambitious academic goals and master basic research methods, they also helped me understand a number of truths about how to treat people and deal with the world. A teacher’s kindness is like the sea, and the student will never forget it! Secondly, I would like to thank all my fellow members of the 1105 laboratory; it is because of your help and support that I can overcome one difficulty and doubt after another until the successful completion of this paper. I would also like to thank my parents and other family and friends for their care, support, and understanding. Finally, I would like to thank the professors and experts who reviewed this paper as part of their busy schedules!

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LPIPS	Learned Perceptual Image Patch Similarity
SSIM	Structural Similarity

References

Wu, H.; Wang, H.; Su, X.; Li, M.; Xu, F.; Zhong, S. Security testing of visual perception module in autonomous driving system. J. Comput. Res. Dev. 2022, 59, 1133–1147. [Google Scholar]
Zhang, N.; Zhang, L.; Cheng, Z. Towards simulating foggy and hazy images and evaluating their authenticity. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, 14–18 November 2017; Proceedings, Part III 24; Springer: Cham, Switzerland, 2017; pp. 405–415. [Google Scholar]
Sun, H.; Zheng, Y.; Lang, Q. Domain adaptation for synthesis of hazy images. J. Comput. Commun. 2021, 9, 142–151. [Google Scholar] [CrossRef]
Sakaridis, C.; Dai, D.; Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992. [Google Scholar] [CrossRef]
Liang, L.; Qing, Y.; Jianping, L.; Yuze, L. Fog simulation method based on depth estimation. Laser Optoelectron. Prog. 2023, 60, 72–78. [Google Scholar]
Dreossi, T.; Ghosh, S.; Sangiovanni-Vincentelli, A.; Seshia, S.A. Systematic testing of convolutional neural networks for autonomous driving. arXiv 2017, arXiv:1708.03309. [Google Scholar]
Zhang, M.; Zhang, Y.; Zhang, L.; Liu, C.; Khurshid, S. Deeproad: Ganbased metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, 3–7 September 2018; pp. 132–142. [Google Scholar]
Jinsheng, X.; Mengyao, S.; Junfeng, L. Image transformation algorithm of haze scene based on generative adversarial network. Chin. J. Comput. 2020, 43, 165–176. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Pan, Y.; Pi, D.; Chen, J.; Meng, H. Fdppgan: Remote sensing imagefusion based on deep perceptual patchgan. Neural Comput. Appl. 2021, 33, 9589–9605. [Google Scholar] [CrossRef]
Cheng, R.-R.; Zhao, X.-L.; Zhou, H.-J. Chinese font style transfer research based on font features and multi-scale patch generative adversarial network. J. YunnanUniv. Nat. Sci. Ed. 2023, 45, 1228–1237. [Google Scholar]
Chen, M.; Zhao, S.; Liu, H.; Cai, D. Adversarial-learned loss for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3521–3528. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Lei, S.; Yi, W.; Ying, C.; Ruibin, W. Review of attention mechanism in natural language processing. Data Anal. Knowl. Discov. 2020, 4, 1–14. [Google Scholar]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar]
Qin, Z.; Sun, W.; Deng, H.; Li, D.; Wei, Y.; Lv, B.; Yan, J.; Kong, L.; Zhong, Y. cosformer: Rethinking softmax in attention. arXiv 2022, arXiv:2202.08791. [Google Scholar]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Zhao, J.; Cheng, P.; Hou, J.; Fan, T.; Han, L. Short-term load forecasting of multi-scale recurrent neural networks based on residual structure. Concurr. Comput. Pract. Exp. 2023, 35, e7551. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Zhang, L.; Zhang, L.; Mou, X.; Zhang, D. Fsim: A feature similarity index for image quality assessment. IEEE Trans. Image Process. 2011, 20, 2378–2386. [Google Scholar] [CrossRef] [PubMed]
Barron, J.T. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4331–4339. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: Vgg and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7 June 2015; pp. 3061–3070. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; Timofte, R.; De Vleeschouwer, C. O-haze: A dehazing benchmark with real hazy and haze-free outdoor images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 754–762. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; Sbert, M.; Timofte, R. Dense-haze: A benchmark for image dehazing with dense-haze and haze-free images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1014–1018. [Google Scholar]
Ancuti, C.; Ancuti, C.O.; Timofte, R.; De Vleeschouwer, C. I-haze: A dehazing benchmark with real hazy and haze-free indoor images. In Advanced Concepts for Intelligent Vision Systems: 19th International Conference, ACIVS 2018, Poitiers, France, 24–27 September 2018; Proceedings 19; Springer: Cham, Switzerland, 2018; pp. 620–631. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; Timofte, R. Nh-haze: An image dehazing benchmark with non-homogeneous hazy and haze-free images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 444–445. [Google Scholar]
Ju, M.; Zhang, D.; Wang, X. Single image dehazing via an improved atmospheric scattering model. Vis. Comput. 2017, 33, 1613–1625. [Google Scholar]
Bayraktar, E.; Yigit, C.B. Conditional-pooling for improved data transmission. Pattern Recognit. 2024, 145, 109978. [Google Scholar] [CrossRef]

Figure 1. Improvement of the overall structure of the network diagram (the complete network structure consists of two parts, while this diagram only shows one part).

Figure 2. The implementation process of the self-attention mechanism. Where the purple, pink, and blue modules are linearly transformed Q, K, and V respectively.

Figure 3. Residual module construction based on self-attention mechanism.

Figure 4. Pictures generated during CycleGAN network training. (a) The input real fog-free image; (b) the fake foggy image generated from the real fog-free image; (c) the fake foggy picture re-entered into the generator to generate a fog-free picture; (d) the real fog-free picture used to generate a fog-free picture; (e) the input real foggy picture; (f) the fake fog-free picture generated from the real foggy picture; (g) the fake fog-free picture generated back to the foggy picture; (h) the real foggy picture generated from the foggy picture.

Figure 5. The change in the original model’s loss function.

Figure 6. The change in the improved model’s loss function.

Figure 7. Comparison of foggy image effects. (a) Input fog-free content image; (b) input fogged image; (c) fogged images generated by the Lightroom Classic software; (d) fogged images generated by the RadialFog model; (e) fogged images generated by the FOHIS model; (f) fogged images generated by the CycleGAN model; (g) fogged images generated by our model.

Figure 8. Comparison results of ablation experiments. (a) Input fog-free images; (b) fogged images generated by the CycleGAN model; (c) fogged images generated by the CycleGAN + self-attention model; (d) fogged images generated by the CycleGAN + LPIPS model; (e) fogged images generated by the CycleGAN + self-attention + LPIPS model.

Table 1. Generator’s internal composition.

Module	Explanation
	Network Layer	Kernel Size	Stride	Padding	Activation Function
encoder	Conv1	(7, 7, 64)	-	0	ReLU
	Conv2	(3, 3, 128)	2	1	ReLU
	Conv3	(3, 3, 256)	2	1	ReLU
transformer	SA_Resnet Block1	-
	……	……
	SA_Resnet Block9	-
decoder	DeConv1	(3, 3, 128)	2	1	ReLU
	DeConv2	(3, 3, 64)	2	1	ReLU
	DeConv3	(7, 7, 3)	-	0	Tanh

Table 2. Evaluation results of the different methods.

Methods	FID	IS	SSIM
RadialFog	250.27	1.891	0.358
Lightroom Classic	283.47	1.877	0.373
FoHIS	279.25	1.887	0.383
CycleGAN	189.49	1.834	0.378
Ours	184.35	1.948	0.398

Table 3. Comparison of structural similarity of 8 groups of comparative experiments.

	CycleGAN’s SSIM	Our Model’s SSIM
1	0.753	0.836
2	0.246	0.248
3	0.405	0.498
4	0.434	0.412
5	0.507	0.508
6	0.323	0.340
7	0.291	0.354
8	0.501	0.627

Table 4. Overall evaluation of existing models and improved models.

Methods	FID↓	IS↑	SSIM↑
CycleGAN	181.41	1.922	0.361
Our model	178.07	2.080	0.362

Table 5. Evaluation of ablation experiment findings.

Baseline (CycleGAN)	+Self-Attention	+LPIPS	IS	FID	SSIM
√			1.922	181.41	0.361
√	√		1.949	180.60	0.369
√	√	√	2.080	178.07	0.362

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, H.; Su, W. A Self-Attention CycleGAN for Unsupervised Image Hazing. Big Data Cogn. Comput. 2025, 9, 96. https://doi.org/10.3390/bdcc9040096

AMA Style

Ni H, Su W. A Self-Attention CycleGAN for Unsupervised Image Hazing. Big Data and Cognitive Computing. 2025; 9(4):96. https://doi.org/10.3390/bdcc9040096

Chicago/Turabian Style

Ni, Hongyin, and Wanshan Su. 2025. "A Self-Attention CycleGAN for Unsupervised Image Hazing" Big Data and Cognitive Computing 9, no. 4: 96. https://doi.org/10.3390/bdcc9040096

APA Style

Ni, H., & Su, W. (2025). A Self-Attention CycleGAN for Unsupervised Image Hazing. Big Data and Cognitive Computing, 9(4), 96. https://doi.org/10.3390/bdcc9040096

Article Menu

A Self-Attention CycleGAN for Unsupervised Image Hazing

Abstract

1. Introduction

2. Materials and Methods

2.1. The CycleGAN Baseline

2.2. The CycleGAN Structure Used for Image Dataset Augmentation

2.2.1. Residual Network Based on Self-Attention Mechanism

2.2.2. Cycle Consistency Loss Based on LPIPS

3. Results

3.1. Experimental Environment

3.2. Experimental Dataset

3.3. Fog Image Generation Experiment

3.3.1. Subjective Evaluation

3.3.2. Objective Evaluation

3.4. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI