1. Introduction
In natural environments, fog and haze exhibit fundamentally distinct formation mechanisms. Fog arises when a sudden temperature drop induces phase transition of water vapor, forming micrometer-sized suspended water droplets that reduce visibility through Mie scattering. In contrast, haze is caused by sub-micrometer aerosol particles (e.g., PM2.5) that interact with light waves via Rayleigh scattering, resulting in more severe visual attenuation effects. With accelerating industrialization, precursor pollutants such as nitrogen oxides and volatile organic compounds from vehicle emissions and industrial activities undergo complex photochemical reactions to generate secondary aerosols, giving rise to composite fog–haze pollution. This hybrid pollution combines characteristics of both fog droplets and particulate matter, not only significantly increasing the risk of respiratory diseases in humans but also posing dual challenges to computer vision applications like autonomous vehicle perception systems and satellite remote sensing monitoring.
Current image dehazing algorithms face three technical bottlenecks: (1) the performance degradation of lightweight CNN dehazing models (such as the All-in-One Dehazing Network [
1], abbreviated as AOD-Net) under extreme fog concentrations (experiments show that the PSNR drops by 37% when the atmospheric scattering coefficient β > 0.3); (2) the real-time deficiency of Transformer architectures [
2] (e.g., the DehazeFormer algorithm), with a processing delay of 320 ms for 1080p images; and (3) edge artifact issues caused by the simplification of loss function design (LPIPS metrics indicate that existing methods have perceptual distortion in texture regions). Currently, the collaborative optimization of dehazing algorithm performance and real-time performance is a mainstream research direction to meet the deployment requirements of mobile and embedded devices in scenarios such as unmanned aerial vehicles and autonomous driving.
This paper selects AOD-Net and DehazeFormer [
3] as benchmarks due to their representativeness of typical architectures: the former is a model of lightweight CNN dehazing algorithms (with only 0.38 M parameters) and can achieve real-time dehazing, but it often suffers from detail loss and color distortion when dealing with high-concentration fog and non-uniform lighting scenarios. The latter demonstrates the advantages of Transformer in modeling long-range dependencies, achieving significant breakthroughs in texture recovery and color fidelity, and enabling efficient dehazing. However, this algorithm has the drawbacks of high computational complexity and large memory usage, which greatly limit its deployment in practical scenarios.
This paper focuses on the optimization improvements and performance comparison of these two cutting-edge algorithms, aiming to reveal their technical characteristics through systematic experiments and explore the performance boundaries of the two algorithms using a combined subjective and objective evaluation system, providing a theoretical basis for practical engineering applications.
The innovative contributions of this paper are mainly reflected in three aspects: (1) designing a composite loss function for AOD-Net, integrating perceptual loss and L1 regularization to significantly enhance the model’s adaptability to fog concentration gradient changes, with the loss weight combination determined through grid search; (2) proposing a multi-constraint optimization objective for DehazeFormer, integrating MSE (Mean Squared Error) and perceptual loss to effectively improve the color fidelity of the algorithm in complex scenarios; and (3) establishing a multi-dimensional evaluation framework covering computational efficiency, restoration quality (PSNR, SSIM, VSNR, and LPIPS), etc., to provide a scientific decision-making basis for the selection of industrial-grade dehazing systems.
The subsequent structure of this paper is organized as follows:
Section 2 systematically reviews the development and technical principles of image dehazing technologies;
Section 3 elaborates on the optimization methods for AOD-Net and DehazeFormer;
Section 4 constructs a multi-dimensional evaluation system and conducts comprehensive comparative experiments on both algorithms; and
Section 5 summarizes the research findings and provides an outlook on future research directions.
2. Literature Review
2.1. Physical Model-Based Image Defogging Algorithms
The physical modeling theory for image defogging can be traced back to the atmospheric light attenuation model proposed by McCartney [
4], which was systematically developed into the atmospheric scattering model (ASM) by Narasimhan’s team [
5,
6]. Its core lies in image restoration through parameter estimation of the transmittance and atmospheric light A. After 2000, single-image defogging techniques based on the atmospheric scattering model made significant progress.
In 2008, Tan [
7] proposed a Markov random field-based method to defog single images by maximizing local contrast, though it suffered from color distortion. In 2009, Fattal [
8] introduced statistical priors and estimated scene transmittance by exploiting the local irrelevance between surface shading and transmittance, significantly improving defogging performance. A landmark breakthrough came from He et al. [
9] with the Dark Channel Prior (DCP), which reveals the statistical law that at least one color channel in local regions of haze-free images approaches zero. In 2015, Zhu’s team [
10] further proposed the Color Attenuation Prior, discovering that the difference between brightness and saturation in hazy images changes regularly with haze concentration, expanding the application boundary of physical models.
However, physical model-based defogging methods still have limitations. First, they exhibit high computational complexity, requiring substantial computing resources for high-resolution images and resulting in poor real-time performance. Second, parameter estimation of transmittance and atmospheric light is vulnerable to environmental interference, where estimation errors in practical applications directly affect defogging results. Additionally, these algorithms show weak adaptability to non-uniform haze scenes, and their detail recovery capabilities under complex meteorological conditions still need improvement.
2.2. Image Processing-Based Defogging Enhancement Algorithms
Before the rise of deep learning, image enhancement methods without physical priors constituted the mainstream technical path in engineering applications. Such methods directly improve visual perception quality through signal processing, with three primary technical routes forming a tripartite framework:
- (1)
Histogram equalization and its variants [
11,
12,
13] expand the dynamic range via grayscale distribution reconstruction. Adaptive local equalization methods use sliding window mechanisms to mitigate the over-enhancement effects of global processing.
- (2)
Retinex theory establishes a light-reflection component decoupling model [
14,
15,
16], recovering the scene’s intrinsic reflectance through illumination component correction. Its multi-scale version (MSR) significantly enhances color fidelity.
- (3)
Wavelet transform frameworks leverage multi-resolution analysis to separate low-frequency haze layers from high-frequency details [
17,
18,
19], achieving selective suppression via frequency-domain filtering.
Although these methods can reach real-time processing efficiency, their non-physical constraint-based enhancement mechanisms lead to severe side effects: gradient reversal halos easily emerge in dense haze regions, and global contrast stretching causes high highlight saturation distortion rates. These inherent limitations restrict their application in precision vision systems.
2.3. Deep Learning-Based Defogging Reconstruction Algorithms
In recent years, the rapid development of deep learning technology has brought revolutionary breakthroughs to the field of computer vision, demonstrating significant advantages in tasks such as image restoration (denoising/dehazing/de-raining), object detection, and underwater enhancement. In the field of image dehazing, a new data-driven paradigm has emerged: by constructing large-scale paired datasets of hazy and clear images, deep networks can independently mine the nonlinear mapping relationship between haze degradation and clear scenes, breaking through traditional methods’ reliance on physical assumptions. Compared with algorithms based on handcrafted features, deep learning methods exhibit stronger representation capabilities in image structure analysis and semantic understanding.
The technological evolution of deep learning in image dehazing can be divided into three stages:
- 1.
Physical Model-Driven Stage
Early research continued the physical modeling approach of traditional dehazing. In 2016, Cai’s team [
20] proposed DehazeNet, which first constructed an end-to-end parameter estimation framework. By using customized convolution kernels and nonlinear activation functions to fit the physical properties of transmittance, this method still faced error accumulation in the independent estimation of atmospheric light values. In 2017, Li et al. [
1] innovatively established a coupled equation of transmittance and atmospheric light in AOD-Net, transforming the dual-parameter estimation into a single-variable optimization problem. Their lightweight network design reduced error sensitivity while maintaining real-time processing capabilities. Although this algorithm reduces system complexity through parameter coupling, its linear atmospheric scattering model simplifies the multi-scattering effects in actual scenarios, leading to color distortion under dense haze. Additionally, the model’s reliance on synthetic data limits its generalization ability in real-world scenarios.
- 2.
Architectural Innovation Stage
With the emergence of breakthrough technologies such as ResNet [
21] and attention mechanisms [
22], research priorities shifted to network architecture optimization after 2019. Residual learning effectively alleviated the gradient vanishing problem in deep networks, while the channel attention mechanism and spatial attention mechanism achieved haze-concentration-adaptive regional enhancement through feature reweighting. Notably, Dosovitskiy’s team [
23] proposed the Vision Transformer (ViT) in 2021, breaking through the modal boundaries between vision and language tasks. Its global receptive field characteristics provided a new paradigm for modeling long-range haze dependency relationships. Although ResNet and attention mechanisms have enhanced feature expression capabilities, there are some controversies. For instance, residual structures in deep networks tend to cause feature redundancy; the global modeling capability of ViT comes at the cost of extremely high computational load (with parameters reaching 5–8 times those of CNN models); moreover, its patch division may destroy the continuity priors of haze (such as gradient fog in sky regions), and so on.
- 3.
Specialized Architecture Deepening Stage
In 2023, Song et al. [
3] proposed DehazeFormer, marking the maturity of domain-specific architectures. This model integrates the multi-scale feature fusion mechanism of Transformer with the local perception advantages of CNN. Through a haze-concentration-aware module and a physical constraint loss function, it achieves performance breakthroughs. However, such CNN–Transformer hybrid architectures suffer from inconsistencies in training objectives between shallow local feature extraction and global dependency modeling, resulting in a slower convergence speed compared to pure CNN models. Furthermore, the current physical loss functions (such as atmospheric light smoothing constraints) only act at the pixel level and lack differentiated modeling for haze-related semantic regions (e.g., sky and building edges). This explains the over-saturation issue in sky regions when tested on the RESIDE dataset.
In 2024, Wang et al. proposed UCL-Dehaze, which constructs semantically-aware physical constraints through unsupervised contrastive learning, achieving a breakthrough in addressing the over-saturation issue in sky regions. However, the training cost increased by three times [
24]. Mo et al. proposed a lightweight dehazing network, which compresses DehazeFormer using neural architecture search technology, achieving 23fps real-time dehazing on Jetson TX2. Nevertheless, the quantized model exhibited a 6.7% color distortion on the RESIDE dataset, revealing the sensitivity of the physical loss function to hardware adaptation [
25].
In 2025, Wang et al. proposed ITW-DehazeFormer, which incorporates a turbulent physical model into DehazeFormer and improves the haze-concentration-aware module through the constraints of the vortex diffusion equation. Experiments showed that its PSNR in water mist scenarios was 2.1 dB higher than that of the original model, but it inherited the slow convergence issue of the CNN-Transformer hybrid architecture (training epochs increased by 37%) [
26]. Qu et al. proposed UIEFormer, developing a lightweight vision Transformer for underwater image enhancement. Experiments demonstrated that the model can effectively restore the color and details of underwater images while maintaining lightweight properties (with a parameter count < 1 M). The effective color restoration of images using a lightweight Transformer in visual tasks provides a direction for image dehazing [
27].
3. Optimization of AOD-Net and DehazeFormer Algorithms
3.1. Overview of AOD-Net Algorithm
The AOD-Net algorithm, proposed by Li et al. [
1], is an end-to-end image dehazing algorithm. This network structure can directly take a hazy image as input, process it through a series of internal network operations, and directly output the dehazed image. Due to its relatively simple network structure, which does not require complex intermediate parameter estimation and calculations, the lightweight architecture design and good dehazing effect give AOD-Net an advantage in real-time image dehazing tasks, enabling rapid dehazing of input images. AOD-Net is suitable for application scenarios with high real-time requirements, such as video surveillance and autonomous driving.
The core of any dehazing algorithm is to recover a haze-free image from a hazy one. The key improvement of the AOD-Net algorithm is integrating the two parameters that need to be estimated in the atmospheric scattering model, transmittance and atmospheric light value, into a single parameter for neural network estimation.
Modified Atmospheric Scattering Model: In the traditional atmospheric scattering model formula, researchers usually need prior assumptions or neural network estimation to determine the two parameters transmittance and atmospheric light value. However, errors in parameter estimation exist, and cumulative errors can increase the impact on the generated dehazed image. The atmospheric scattering theory model, Equation (1), and its modified form, Equation (2), are expressed as follows:
Li et al. [
1] proposed to integrate the atmospheric light value A and the transmission map t(x) into a unified variable K(x), resulting in Equation (3):
By estimating K(x) through a neural network, the impact of cumulative errors during the calculation process can be reduced.
Based on the above physical model, the network structure is designed as shown in
Figure 1. The AOD-Net dehazing process mainly consists of two modules:
3.2. Overview of DehazeFormer Algorithm
The image dehazing method based on DehazeFormer constructs an image dehazing network using a Transformer–CNN hybrid architecture. The core idea of this algorithm is to recover clear images from foggy images through multi-scale feature extraction and physical model-driven approaches. As shown in the overall flowchart of the DehazeFormer algorithm in
Figure 3, the overall model architecture adopts a U-Net symmetric structure, where convolution blocks are replaced by DehazeFormer modules. It alternately extracts features through 3 × 3 convolutions and DehazeFormer modules, and constructs multi-scale representations by combining downsampling. In the decoding stage, residual connections are used to transmit features across layers to preserve details, SKFusion is employed to dynamically fuse information from different branches, and the DehazeFormer module is applied again to enhance feature modeling capabilities. Finally, a clear image is output through upsampling and Soft Reconstruction. This design innovatively integrates the local perception advantages of CNNs with the global modeling capabilities of Transformers, and in conjunction with a selective feature fusion mechanism, effectively improves the image recovery performance in complex degraded scenarios.
As shown in the structure of the DehazeFormer module in
Figure 4, which replaces the convolution blocks in the traditional U-Net structure, it is mainly divided into four parts. Firstly, the improved Revised LayerNorm (RLN) layer normalization is used to process the input image. It can dynamically adjust parameters according to the local features of the image, flexibly change the features after normalization, and support gradient separation, avoiding the unstable factors brought by the input of the convolution network to the training process. Secondly, the input image is subjected to window division operation (each window performs padding and attention calculation independently). Compared with the traditional zero-layer padding edge, Reflection Padding uses the pixels at the edge of the reflected image to fill the edge area of the image, which can better maintain the edge information of the image and avoid excessive smoothing. Then, the Q, K, V triplet generated by the Linear layer performs self-attention calculation on the multiple small windows divided from the image. The pixels in each window perform relationship modeling to capture information in the local area, and finally sum them up. At the same time, multiple convolution operations are parallelized, and finally, the global features extracted by the attention mechanism and the local features extracted by the convolution operations are combined. Its calculation formulas are Equations (5) and (6):
Among them, Q, K, and V are the aforementioned query vector, key vector, and value vector; h refers to the number of attention heads; C refers to the number of image channels; and d is the logarithmic positional encoding. Finally, a feed-forward network (MLP) with a dynamic expansion ratio (2.0–4.0 times) is adopted, and channel attention is introduced. The features are integrated through a dual residual connection with learnable scaling coefficients.
3.3. Optimization of Loss Functions for the AOD-Net Algorithm and DehazeFormer Algorithm
The AOD-Net is trained using the Mean Squared Error (MSE) loss as the primary supervision metric, while the baseline training of the DehazeFormer model relies solely on the L1 loss function. To optimize the network training and enhance dehazing performance, this paper introduces a composite supervision mechanism that combines the MSE loss, L1 loss, and perceptual loss. Both MSE and L1 are pixel-wise loss functions, whereas the perceptual loss focuses on semantic and perceptual similarity rather than pixel-level differences. Since the perceptual loss, computed via a pre-trained CNN, exhibits a significantly larger magnitude during training, we constrain its weighting factor to the interval [0.01, 0.2] to ensure balanced optimization with other loss terms (e.g., MSE and L1).
As the most commonly used regression loss function in deep learning, MSE quantifies the error by calculating the average of the squared differences between the predicted values and the true values. Its mathematical expression is shown in Equation (7):
In image tasks, N is batch_size × C × H × W. Although MSE can provide pixel-level supervision signals to ensure that the generated image maintains an approximate numerical distribution with the target image, its gradient magnitude is linearly positively correlated with the error. When dealing with outliers, the squaring operation significantly amplifies the error impact, which may cause training process oscillations and easily lead to over-smoothing of the generated image.
The Mean Absolute Error (MAE), commonly referred to as L1 loss, establishes a supervision constraint by calculating the average of the absolute differences between the model’s predicted values and the true values. Its mathematical expression is shown in Equation (8):
In the image processing scenario, the defined dimensions of N are consistent with MSE. Compared with MSE, the gradient magnitude of L1 loss is constant, making it more robust to outliers. This characteristic enables it to effectively preserve image edge sharpness and texture detail features.
Perceptual loss is a loss function based on high-level semantic features, widely used in computer vision tasks such as image generation, image style transfer, and image dehazing. By comparing the feature representations of images in a pre-trained deep learning model, it measures the visual perceptual similarity between the predicted image and the real image. Traditional loss functions (such as MSE loss and L1 loss) only calculate pixel-level differences, which easily leads to the generated images lacking realism or being overly smooth. Perceptual loss extracts features through a pre-trained CNN (VGG16 is used in this paper’s experiments) and calculates the distance between the generated image and the real image in the feature space, thereby better preserving the texture details, realism, and some high-level information (such as object shape and structure) of the image. The Perceptual loss function formula is shown in Equation (9):
where
is the feature extractor of the i-th layer of the pre-trained CNN, and
is the dimensions of the i-th layer feature map.
4. Experiments and Results
4.1. Experiments
4.1.1. Datasets
This study employs the RESIDE [
28] dataset, an authoritative benchmark in the field of image dehazing, as the core experimental platform. This dataset constructs a multi-scenario evaluation system by integrating synthetic and real hazy images, with its core advantages embodied in three aspects:
Data Scale: It contains over 138,000 annotated samples.
Scenario Diversity: It covers six major scenarios, including urban streetscapes, natural landscapes, and indoor environments.
Physical Realism: Synthetic hazy images are generated based on the atmospheric scattering model by precisely controlling parameters such as the transmission map and atmospheric light.
These multi-dimensional data characteristics effectively support the evaluation of the model’s generalization capability.
The experimental configuration primarily relies on two core subsets: the indoor training set (ITS) and the Synthetic Objective Testing Set (SOTS).
The ITS contains 10,000 clear images of indoor scenes. A parameterized haze generation engine creates 10 fogged variants for each original image (β ∈ [0.6, 1.8], atmospheric light A ∈ [0.7, 1.0]), forming a total of 100,000 training samples. The dataset is divided into training/validation sets at a 9:1 ratio to ensure the stability of the model optimization process.
The testing phase adopts the SOTS dual-modal evaluation system:
The outdoor subset (SOTS-outdoor) contains 492 image pairs, with original clear images sourced from the Middlebury Stereo [
29] depth benchmark dataset and high-quality images crawled from the web.
The indoor subset (SOTS-indoor) consists of 500 image pairs built based on NYU Depth V2 [
30]. It is worth noting that SOTS-indoor has a resolution heterogeneity issue—the hazy images (620 × 460) and fog-free reference images (640 × 480) have pixel dimension differences. To address this, this study uses bilinear interpolation to edge-pad the hazy images to a unified resolution of 640 × 480, which may affect the data on indoor dehazing effects to some extent in this experimental study.
4.1.2. Experimental Environment
The experimental environment in this study is introduced in two parts: hardware and software environments.
Hardware Environment:
GPU Model: RTX 4090 24 GB, 1 card.
CPU Model: Xeon(R) Platinum 8358P, 16 cores.
Memory Capacity: 120 GB.
Software Environment:
Operating System: Linux.
Deep Learning Frameworks and CUDA: Model training for the AOD-Net algorithm uses PyTorch 2.5.1, Python 3.12 (Ubuntu 22.04), and CUDA 12.4; model training for the DehazeFormer algorithm uses PyTorch 1.11.0, Python 3.8 (Ubuntu 20.04), and CUDA 11.3.
4.1.3. Evaluation Metrics
The PSNR [
31] is a widely used objective quality evaluation metric in image processing and computer vision, primarily used to measure the distortion degree of images after processing by algorithms such as denoising, compression, and restoration. Its core idea is to quantify image quality loss by calculating pixel-level differences between the processed image and the original distortion-free image. The formula for the Peak Signal-to-Noise Ratio is shown in Equation (10):
where
represents the maximum possible pixel value of the image (e.g., 255 for 8-bit images). The Mean Squared Error (MSE) calculates the average of squared differences between corresponding pixels in the reference image
and the dehazed image
, serving as a pixel-level dissimilarity metric. For each pixel
, the difference between
and
is squared and then averaged, as formulated in Equation (11):
Here, M and N represent the width and height of the two images, respectively. MSE is a metric that measures the difference between predicted and actual data by calculating the average of the squared differences between the predicted and actual values. A smaller MSE value indicates that the predicted results are closer to the actual data. The PSNR (Peak Signal-to-Noise Ratio) is a metric calculated based on MSE, measuring the image’s signal-to-noise ratio on a logarithmic scale. A higher PSNR value indicates greater similarity between the predicted results and the actual data, and better image quality.
The Structural Similarity Index (SSIM) [
32] is an objective metric for assessing perceptual quality similarity between two images, proposed by Wang et al. in 2004. Unlike the PSNR, which only considers pixel-level errors, the SSIM better approximates human subjective perception by modeling the human visual system (HVS)’s sensitivity to luminance, contrast, and structure.
SSIM computation involves three independent comparisons that are ultimately combined into a composite score. These three components are luminance similarity, contrast similarity, and structural similarity, calculated, respectively, by Equations (12)–(14):
where x represents the dehazed image, y denotes the corresponding haze-free reference image;
and
are the mean pixel values of local image patches x and y, respectively;
and
are the standard deviations of patches x and y;
is the covariance between patches x and y; and
,
, and
are constants, with
typically set as
for computational simplification. The composite SSIM score is then calculated by Equation (15):
where α, β, and γ are weighting parameters typically set to one. The SSIM value normally ranges within [0, 1], where values closer to 1 indicate higher similarity between images, and 0 denotes completely dissimilar images.
The VSNR [
33] is an image quality assessment metric based on human visual characteristics, with its complete formulation given by Equation (16):
This method improves upon traditional approaches through two key mechanisms: first, it employs a contrast sensitivity function for frequency-domain weighting to emphasize mid-frequency information that is most perceptible to human vision; second, it incorporates visual masking effects to reduce error weighting in textured regions. Here,
represents the reference energy term,
denotes the contrast sensitivity error, and
corresponds to the masking effect corrected error, as defined by Equations (17), (18), and (19), respectively:
where
represents the i-th pixel value of the original haze-free image;
denotes the difference between the dehazed image and reference haze-free image;
is the contrast sensitivity function weight; and
indicates the local masking threshold, determined by the local texture complexity of the reference image. In image dehazing evaluation, the VSNR effectively detects detail restoration and visual artifacts (e.g., halo effects). Compared to metrics like the SSIM, the VSNR places greater emphasis on quantifying perceptible noise. Generally, higher VSNR values indicate better visual quality.
LPIPS (Learned Perceptual Image Patch Similarity) [
34] is a deep learning-based perceptual similarity metric for images. Unlike traditional mathematically-derived metrics such as the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index), the LPIPS employs deep convolutional neural networks to extract high-level image features. Its model parameters are learned through training on large-scale image datasets, enabling effective capture of the human visual system’s perception of image differences.
The method takes image pairs as input and outputs perceptual similarity scores by computing weighted distances in feature space. The score ranges [0, 1], where lower values indicate higher visual similarity. This study adopts the officially released benchmark LPIPS model implemented using a pre-trained AlexNet architecture.
4.2. Improving Experimental Details of the AOD-Net Algorithm
4.2.1. Experimental Parameters for AOD-Net
In this section, the integrated dehazing network AOD-Net investigates the collaborative optimization of lightweight architecture design and physical model constraints. The experimental setup is as follows: The training set undergoes uniform preprocessing, with randomly extracted 480 × 640-pixel RGB image patches serving as the network input. The network training employs the Adam optimizer for end-to-end learning, with key parameters configured as follows: 10 training epochs (automatically saving the model with the lowest validation loss), a batch size of eight, an initial learning rate of 1 × 10−4, a weight decay coefficient of 1 × 10−4, and the introduction of gradient clipping (threshold 0.1) to enhance training stability. The loss function adopts the Mean Squared Error (MSE) to constrain the pixel-level differences between the output dehazed image and the ground-truth haze-free image. The model parameters are controlled at the 1.76 K level, and mixed-precision training is employed to achieve computational acceleration.
To systematically evaluate the optimization effects of different loss functions, this study designs a three-tier progressive experiment:
Baseline experiment: This fully reproduces the original AOD-Net algorithm using a single MSE loss function.
Perceptual enhancement experiment: To validate the effectiveness of incorporating a perceptual loss function, we introduce it alongside the MSE loss with a weight ratio of 0.1.
Composite optimization experiment: A ternary composite loss function is constructed combining MSE, perceptual loss, and L1 regularization. The experiment employs grid search to systematically explore the optimal weight combinations.
4.2.2. Grid Search Experiment for AOD-Net
On the basis of training the basic AOD-Net model, a ternary composite loss function combining MSE, L1 regularization, and perceptual loss is constructed, with the loss weight ratios denoted as
,
, and
, respectively. The experiment establishes a three-dimensional grid with
∈ [0.5, 1.0, 1.5],
∈ [0.0, 0.5, 1.0], and
∈ [0.05, 0.1, 0.2], resulting in a total of 27 combinations. Since the MSE and L1 loss functions are of similar magnitude, a step size of 0.5 is used for their combinations while retaining the original infrastructure loss. For the perceptual loss, which is calculated using a pre-trained VGG16 and has a magnitude approximately 6–8 times that of MSE, three significant values within the range of 0.01–0.2 are selected for grid search to maintain gradient balance. The experimental parameters remain unchanged from the original settings, training the network using 10,000 pairs of hazy–clear images from the ITS indoor training set, but each combination is only trained for three epochs to save time. After each epoch of training, the L1 loss is calculated on the validation set. A smaller L1 loss indicates a smaller difference between the dehazed image predicted by the model and the real image, meaning better dehazing performance of the model. By comparing the minimum L1 loss values (i.e., the best performance of the model on the validation set across all epochs) of the 27 combinations, the optimal weight ratio is selected. The experimental results are shown in
Figure 5; the abscissa of the bar chart represents the weight combinations, and the ordinate represents the minimum L loss value of each combination on the validation set. It displays the top 10 and bottom five combinations sorted by L1 loss in ascending order. It can be clearly observed that within the given research range, the optimal weight ratio
:
:
is 1.5:1.0:0.05.
4.2.3. Quantitative Comparison of AOD-Net
In this study, the AOD-Net algorithm is strictly reproduced to conduct perceptual enhancement experiments and composite optimization experiments. A total of 10,000 pairs of hazy–clear image data from the ITS indoor training set are used for network training, with a training cycle of 10 epochs for each experiment. During training, a dynamic model-saving strategy was implemented, automatically preserving the optimal model parameters based on validation loss. To evaluate the model’s generalization performance, cross-domain testing was conducted on both indoor and outdoor subsets of the SOTS dataset. First, the dehazed images generated by the model were saved, followed by a quantitative analysis of the dehazed images against the ground-truth haze-free images using four objective evaluation metrics (PSNR, SSIM, VSNR, and LPIPS). Through systematic evaluation, the model’s comprehensive performance metrics on the dual-domain test set are presented in
Table 1, where PSNR, SSIM, and LPIPS values are retained to four decimal places to enhance result precision.
Table 1 presents a detailed evaluation of the comprehensive performance of the AOD-Net algorithm and its improved versions on the SOTS dataset. The experimental results demonstrate that the multi-stage loss function optimization strategy significantly enhances model performance in both outdoor and indoor scenarios.
In the outdoor subset test, the original AOD-Net model established its baseline performance with a PSNR of 28.9374 dB, SSIM of 0.8964, VSNR of 9.14 dB, and LPIPS of 0.1467, where the lower LPIPS value confirms the algorithm’s superiority in perceptual quality. When a perceptual loss (weight: 0.1) was introduced on top of the original MSE loss, the model showed improvements of 0.9908 dB in PSNR, 0.0719 in SSIM, and 0.73 dB in VSNR, while the LPIPS score decreased by 0.0513 to 0.0954, demonstrating the perceptual loss’s enhancement of feature representation. Further adopting a ternary composite loss function—combining MSE (1.5), perceptual loss (0.05), and L1 regularization (1.0)—the model achieved peak performance with 29.9912 dB PSNR, 0.9716 SSIM, and 10.02 dB VSNR, while the LPIPS improved by 0.0026 to 0.0928, marking a 36.74% reduction compared to the initial value. This highlights the effectiveness of multi-loss collaborative optimization.
In the indoor scenario test, the original model achieved baseline performance with 18.6089 dB PSNR, 0.8774 SSIM, 3.58 dB VSNR, and 0.2062 LPIPS. After the two-stage improvement, the single-loss optimization phase increased the PSNR, SSIM, and VSNR by 1.8804 dB, 0.0033, and 1.1 dB, respectively, while the LPIPS decreased by 0.112. The composite loss phase further boosted these metrics by 2.2895 dB, 0.0217, and 0.44 dB, respectively, with the LPIPS dropping by 0.0041. The final performance reached 22.7788 dB PSNR, 0.9092 SSIM, 5.12 dB VSNR, and 0.091 LPIPS—a 56.30% reduction from the initial value—validating the algorithm’s strong adaptability in complex indoor scenes.
Regarding computational efficiency, tests on an 11th Gen Intel Core i5-1135G7 processor showed that the original model took 0.82 s to process a 550 × 413-pixel image. With the introduction of perceptual loss, the inference time drastically decreased to 0.32 s, achieving an 60.97% efficiency improvement. The optimized model using the triple-loss function required 0.62 s for dehazing, delivering a 1.3× speedup. This change reveals the dual potential of loss function design in feature space compression and computational efficiency optimization.
4.2.4. Qualitative Comparison of AOD-Net
Figure 6 visually compares the dehazing effects of the AOD-Net model under different training strategies on the SOTS dataset. The experimental results demonstrate that the original reproduced model exhibits noticeable color distortion in both outdoor and indoor scenes, while the introduction of composite loss functions progressively optimizes visual perception.
In the SOTS-outdoor test, the original model’s joint estimation deviation of atmospheric light and transmittance leads to multiple visual anomalies: the sky region displays an unnatural dark blue hue, gray edge artifacts appear at the boundary between the sky and foreground, and excessive global contrast enhancement causes color shift in foreground objects, resulting in an overall darkened appearance. Notably, after incorporating the perceptual loss function, the blue tint in the sky region is significantly reduced, aligning more closely with the true colors of the haze-free reference image, while edge artifacts are effectively suppressed.
For the SOTS-indoor scenario, the original model exhibits compressed dynamic range in brightness: white walls and illuminated areas suffer from visual graying, and highly saturated color blocks (e.g., red walls) experience significant desaturation, resulting in a faded appearance. Through composite loss function optimization, the model successfully restores color saturation in red regions while maintaining balanced brightness on walls.
Further observations reveal that although the overall dehazing differences between loss function configurations are not immediately obvious to the naked eye, visual comparisons confirm the following:
These improvements validate the role of composite loss functions in optimizing color fidelity, providing visual evidence for the model’s performance enhancement.
4.3. Improving Experimental Details of the DehazeFormer Algorithm
4.3.1. Experimental Parameters for DahzeFormer
The training parameter design of the DehazeFormer model focuses on balancing efficient dehazing and detail preservation. The AdamW optimizer is employed for parameter updates, with an initial learning rate of 1 × 10−4 and a weight decay coefficient of 0.05. The learning rate scheduling adopts a cosine annealing strategy, enabling the model to converge rapidly in the early stage of training and perform fine adjustments near the local minimum in the later stage, thereby improving its generalization ability. The learning rate fluctuates along a cosine curve with the training process, gradually decaying from the maximum value to a minimum threshold of 1 × 10−6 in each cycle. Due to GPU memory constraints, the batch size is adjusted from the original paper’s 16 to 8, the training epochs are reduced from 200 in the original paper to 50, which is determined based on experiments, and input images are uniformly resized to a 256 × 256 resolution. Data augmentation strategies, including random horizontal flipping, grid-based random crop-and-stitch, and color jittering, are applied during training. Mixed-precision computing (AMP) is utilized to accelerate training, achieving stable convergence after completing the full training cycle on the RESIDE dataset.
The experimental design consists of three progressively improved groups:
Baseline Group (Group-I): This strictly reproduces the original DehazeFormer architecture, using only the L1 norm as the loss function.
Enhanced Group (Group-II): This introduces a perceptual loss term based on the pre-trained VGG-16 network for feature extraction, with its loss weight set to 0.1, in addition to the L1 loss.
Optimized Group (Group-III): This constructs a triple composite loss function (L1 + Perceptual + MSE), with the optimal weight ratios determined via grid search.
All experimental groups maintain identical hyperparameter configurations to ensure the fairness of comparative experiments.
4.3.2. Training Epoch Exploration Experiment
Due to the differences in hardware environments and the training configurations of the original papers, this study designs a progressive model optimization strategy. As shown in
Table 2, multi-stage training evaluations were conducted on the ITS indoor dataset: model snapshots were saved every 20 training cycles (epoch = 10/30/50/70), constructing an evaluation queue containing four candidate models. Quantitative evaluation shows that the model trained for 50 epochs achieves the highest values in both PSNR and SSIM metrics on the ITS validation set. Its average PSNR (37.83 dB) increased by 5.6% compared to the 30-epoch model, and the maximum SSIM value exceeded the 0.999 threshold, verifying the optimal balance point for model convergence.
Figure 7 clearly presents the visualization results of the dehazing effect of the DehazeFormer model under different training epochs. It can be observed that as the number of training epochs increases, the dehazed images generated by the model show a significant improvement in visual clarity. Compared with the models trained for 5 and 10 epochs, the dehazed images predicted by these models exhibit obvious haze traces. When the model training reaches 30 epochs, the grayish-white and irregular synthetic haze traces can no longer be observed with the naked eye, and the DehazeFormer model has demonstrated excellent dehazing capability. Through quantitative analysis and comparison of the data, the dehazing effect of the model at this stage has approached the imaging quality of real fog-free scenes in subjective visual evaluation. Based on this experimental data, this study adopts 50 training epochs in the subsequent comparative experiments on the loss function optimization of the DehazeFormer model.
4.3.3. Grid Search Experiment for DahzeFormer
On the basis of the training of the DehazeFormer model in the original paper, a ternary composite loss function of L1 regularization + MSE + perceptual loss is constructed, with loss weights of
,
, and
, respectively. The experiment constructs a three-dimensional grid where
∈ [0.5, 1.0, 1.5],
∈ [0.0, 0.5, 1.0], and
∈ [0.05, 0.1, 0.2], resulting in a total of 27 combinations. This is consistent with the grid search experiment of the AOD-Net algorithm. The experimental parameters remain unchanged on the original basis. Each combination is only trained for three epochs, and the average value of the PSNR on the validation set is calculated once at the end of each epoch training. The larger the PSNR, the higher the quality of the dehazed image predicted by the model. By comparing the average PSNR values of the 27 combinations of models on the validation set, the optimal weight ratio is selected. Through the experiment, the obtained results are drawn as a heatmap, as shown in
Figure 8. The darker the color, the larger the PSNR and the better the dehazing performance. Fixing the perceptual loss weight, the influences of MSE loss and L1 loss on the dehazing performance of the model are shown through the heatmap, and the optimal weight ratio
:
:
is 1.5:0.0:0.2. By observation, when the perceptual loss weight is 0.05, the overall PSNR ranges from 18.30 to 19.34. When the perceptual loss is increased to 0.1, the overall PSNR is improved, ranging from 18.77 to 19.68. When the perceptual loss is 0.2, the overall data interval is between 18.24 and 19.91, with a peak value, but the difference between the data increases. Through the analysis of this figure, we believe that there is no single loss function that completely affects the trend of the model’s dehazing performance. The model’s dehazing performance is affected by the combined action of all loss functions during training. However, for the DehazeFormer model, the L1 loss function has a greater impact on it.
4.3.4. Quantitative Comparison of DahzeFormer
Table 3 systematically presents the comparison data of the dehazing performance of the DehazeFormer series models on the indoor and outdoor subsets of the SOTS dataset. The experimental results show that the basic version of the DehazeFormer model has demonstrated good performance in outdoor scenarios: the PSNR reaches 32.1919 dB, the SSIM index is 0.9902, the VSNR is 18.39 dB, and the LPIPS is as low as 0.0653, indicating that the dehazed image is close to the original fog-free image in terms of visual perception. In the indoor scene test, although its PSNR (21.3952 dB), SSIM (0.9791), and VSNR (8.61 dB) are lower than those in the outdoor scene, the LPIPS value of 0.1860 also maintains a relatively high perceptual similarity.
The introduction of models with different improvements further reflects the optimization effect: compared with the basic version, the 1.0L1 + 0.1Per combination increases the PSNR in the outdoor scene to 33.9387 dB, with an increase of about 5.42%, and the PSNR in the indoor scene reaches 22.5949 dB, with an increase of about 5.62%. The SSIM index increases by 0.40% in the outdoor scene and 0.79% in the indoor scene; the VSNR increases by about 1.58% in the outdoor scene and about 1.86% in the indoor scene; and the LPIPS value decreases by 0.0177 in the outdoor scene and 0.0034 in the indoor scene. The performance of the 1.5L1 + 0.2Per combination is better. The PSNR in the outdoor scene reaches 34.7611 dB, which is about 7.98% higher than that of the basic version, and the PSNR in the indoor scene is 23.8426 dB, with an increase of about 11.45%. The SSIM index increases by 0.61% in the outdoor scene and 0.82% in the indoor scene; the VSNR increases by about 2.77% in the outdoor scene and about 2.67% in the indoor scene; and the LPIPS value decreases by 0.04 in the outdoor scene and 0.0104 in the indoor scene. It can be seen that the optimization effect of the model in the outdoor scene is gradually enhanced with the improvement, and there is also an improvement in the indoor scene, and different improvements have different impacts on various indicators.
In terms of computational efficiency testing, when processing a 550 × 413-pixel image on the Intel Core i5-1135G7 platform, the inference time of the basic version of DehazeFormer is 2.87 s; the DehazeFormer + 0.1Per with the added loss function component takes 6.72 s, and the time consumption increases by more than 134%; and the inference time of 1.5DehazeFormer + 0.2Per is 7.56 s. The data shows that the introduction of multi-component loss functions will increase the processing time per frame, bringing new research directions for model architecture optimization. In the future, it is necessary to find a balance between performance improvement and efficiency.
4.3.5. Qualitative Comparison of DahzeFormer
Figure 9 visually presents the visual comparison of the defogging effects of the DehazeFormer model on the SOTS dataset. Visual evaluation shows that the model has improved in three dimensions, color fidelity, texture reconstruction, and edge sharpness, fully preserving the structural information of the image without color artifacts or detail blurring. However, quantitative analysis reveals the presence of residual gray-scale artifacts in scene transition regions, particularly evident in areas with abrupt depth-of-field changes.
This set of comparison images presents the dehazing effects of DehazeFormer models optimized with different loss functions across various scenarios (urban vistas, architectures, interiors, etc.). Each row, from left to right, shows a hazy image, a prediction by the model trained with the basic DehazeFormer, a prediction by the model trained with an additional 0.1 perceptual loss, a prediction by the improved model with the optimal weight ratio explored via grid search, and the original fog-free image.
Visually, after being processed by the basic DehazeFormer model, the haze in the images is reduced and clarity is improved. However, issues like a pale sky color, artifacts at the junction of buildings and the sky, and unclear textures arise. Compared with the model trained with a single loss function, the model trained with the added perceptual loss makes the overall image more transparent, with better detail and color restoration, and no obvious unnatural transitions. Still, there are cases where the sky color is darker than the original. From the basic model to the improved ones, the dehazing effect is gradually optimized, image details and clarity are enhanced, and it becomes hard to visually detect haze traces and color block problems.
4.4. Algorithm Performance Comparative Analysis
In the previous research, experimental validation was conducted on two algorithms, AOD-Net and DehazeFormer: First, algorithm reproduction was completed, followed by optimization and improvement of the loss functions. The dehazing performance was then comparatively analyzed through quantitative metrics and visual effects. This section will present a performance comparison and analysis of these two algorithms.
4.4.1. Analysis of Basic Model Performance
As shown in
Table 4, the AOD-Net algorithm adopts an ultra-simplified CNN architecture with only five convolutional layers and an extremely small number of parameters. These parameters mainly come from convolution kernels and bias terms, with a total of approximately 1.76 K. The FLOPS calculation of this model is dominated by shallow convolution operations with small kernels. When the input size is 480 × 640, the total FLOPS is about 205 GFLOPS, and the total memory is approximately 1 MB, making it suitable for real-time deployment on edge devices.
The DehazeFormer model (the Dehazeformer-s version is used in this paper) integrates Transformer attention mechanisms and multi-scale feature fusion, resulting in a significant increase in the number of parameters, approximately 20 M. These parameters mainly come from attention projection matrices and DWConv layers in DehazeFormer blocks. With a 256 × 256 input, the total FLOPS is 65 GFLOPS, and the memory consumption during training is about 10 GB.
The memory consumption of DehazeFormer is approximately 10,000 times that of AOD-Net, limiting its deployment on low-resource devices. The fundamental difference between the two models reflects a trade-off between “efficiency and performance”.
4.4.2. Quantitative Comparison Between AOD-Net and DahzeFormer
Statistical significance tests were performed on the PSNR and SSIM metrics obtained from dehazing tests on the SOTS indoor and outdoor datasets using the reproduced model and the improved model. The indoor dataset had a sample size of 500, and the outdoor dataset had a sample size of 492. Paired sample
t-tests were conducted, and effect sizes were calculated, yielding the results shown in
Table 5.
According to the data in the table, the improvements made in this paper to the algorithm’s loss function are statistically significant for both PSNR and SSIM metrics in both indoor and outdoor scenarios. However, the effects are generally weak, resulting in only a small improvement in actual image quality. For the outdoor dataset using DehazeFormer, the PSNR shows a noticeable improvement with a moderate effect size, representing a practically meaningful enhancement where the image quality improvement is perceptible to the human eye.
Table 6 presents the objective metrics for the defogging performance of several algorithms (including reproduced classic defogging algorithms and their improved versions with optimized loss functions) on the SOTS dataset, along with their inference times for processing the same image on a CPU. The quantitative analysis reveals the following:
AOD-Net: By incorporating a composite loss function (L1 + perceptual loss), it achieved a 22.41% improvement in indoor PSNR, a significant 56.30% reduction in LPIPS, and a 1.3× speedup in inference efficiency, validating the optimization potential of lightweight architectures.
DehazeFormer determines the optimal weight ratio through grid search and increases the proportions of L1 loss and perceptual loss. As a result, the PSNR and SSIM metrics on the indoor dataset are improved by 11.44% and 0.82%, respectively. However, its computational complexity increases accordingly, illustrating the performance–efficiency tradeoff.
Cross-Algorithm Comparison: DehazeFormer demonstrated superior fog density modeling, with its outdoor VSNR (18.90 dB) outperforming AOD-Net (10.02 dB) by 88.62%, attributed to its multi-scale attention mechanism for non-uniform haze.
Time-Sensitive Scenarios: The optimized AOD-Net achieved real-time processing at 0.63 s/frame, while DehazeFormer is better suited for offline high-precision tasks. This difference stems from their design philosophies: AOD-Net uses a parameterized physical model (1.78 K parameters), whereas DehazeFormer employs a deep Transformer architecture (20 M parameters) for end-to-end haze decomposition.
Table 6.
Objective performance comparison of algorithms on the SOTS dataset.
Table 6.
Objective performance comparison of algorithms on the SOTS dataset.
Algorithms | Indoor | Outdoor | Times |
---|
PSNR | SSIM | VSNR | LPIPS | PSNR | SSIM | VSNR | LPIPS |
---|
AOD-Net | Reproduce | 18.6089 | 0.8774 | 3.58 | 0.2062 | 28.9374 | 0.8964 | 9.14 | 0.1467 | 0.82 s |
Improve | 22.7788 | 0.9092 | 5.12 | 0.0901 | 29.9912 | 0.9716 | 10.02 | 0.0928 | 0.63 s |
DehazeFormer | Reproduce | 21.3952 | 0.9791 | 8.61 | 0.1860 | 32.1919 | 0.9902 | 18.39 | 0.0653 | 2.87 s |
Improve | 23.8426 | 0.9871 | 8.84 | 0.1756 | 34.7611 | 0.9962 | 18.90 | 0.0254 | 7.56 s |
4.4.3. Qualitative Comparison Between AOD-Net and DahzeFormer
Figure 10 and
Figure 11, respectively, display the dehazed images generated by the algorithms used in this study on the SOTS-outdoor and SOTS-indoor datasets. Through comparative observation, it can be visually determined that the CNN-based dehazing algorithm underperforms the latter two algorithms on these datasets. Although the images produced by this algorithm can partially restore colors, textures, and structures, they generally exhibit blurred object contours.
The AOD-Net algorithm demonstrates satisfactory performance in recovering fine textures and details in images. However, it is prone to color distortion, particularly in large-area regions where uneven color distribution and unnatural transitions often occur.
The DehazeFormer algorithm outperforms the previous two methods in terms of color restoration and detail/texture recovery. Nevertheless, the model’s generalization capability is limited, as it tends to produce artifacts and uneven color blocks at scene boundaries.
The comparative experiments demonstrate that different dehazing algorithms each have their advantages. The AOD-Net algorithm is suitable for real-time lightweight applications, offering fast processing speed and good visual restoration, though its performance in high-density indoor fog scenes is slightly limited. The improved version of AOD-Net, through loss function optimization, has enhanced color and detail texture restoration to some extent.
The DehazeFormer algorithm excels in both types of scenarios, particularly in restoring complex textures and dense fog regions. However, its higher computational complexity results in slower inference speed. The improved DehazeFormer algorithm further enhances visual quality, with natural texture and color transitions, delivering superior realism.
Traditional CNN methods, while computationally efficient, exhibit significantly inferior dehazing performance compared to AOD-Net and DehazeFormer, often suffering from artifacts, color blocks, loss of high-frequency details, and excessive smoothing.
In summary, the AOD-Net algorithm is more suitable for real-time dehazing tasks, while the DehazeFormer algorithm prioritizes high-quality dehazing applications.
5. Discussion
This study conducted experiments, systematic reproduction, and improvements on three classical dehazing algorithms. While certain progress has been achieved, several directions warrant further exploration.
In the research and evaluation of image dehazing algorithms, synthetic datasets (typified by RESIDE) are widely used due to their advantages, such as accessibility to large-scale paired data and strong scene controllability. However, this reliance has also introduced significant domain gaps and potential biases.
Taking the RESIDE dataset as an example, the haze in the dataset is typically generated based on the atmospheric scattering model, which only simulates a single scattering process and overlooks the complex multiple scattering effects in real-world foggy conditions, as well as the dynamic impacts of environmental factors like humidity, temperature, and wind speed on haze distribution. This leads to issues such as color distortion and detail blurring when models process real-world dense haze.
Secondly, there is a problem of insufficient complexity and diversity in scenarios. Although the RESIDE dataset includes numerous scenarios such as skies, cities, and lakes, the scenarios are overly idealized, lacking natural uncertainties (e.g., variable lighting and irregular cloud distributions in the sky), and there is a slight prevalence of scene homogenization, with the same buildings appearing multiple times from different angles. When models are trained and evaluated on such synthetic datasets, they often exhibit high metric scores but struggle to meet real-world dehazing needs, and are prone to issues like excessive saturation and overexposed sky regions.
Moreover, synthetic data rarely incorporate the semantic associations present in real-world scenarios. In actual dehazing tasks, models need to process images with semantic awareness (e.g., “preserving the clarity of traffic signs” or “retaining facial details”). Evaluations on synthetic datasets fail to reflect a model’s ability to address such semantic requirements.
In light of these issues, we argue that current research needs to reduce its dependence on synthetic datasets, construct large-scale datasets containing real-world foggy image pairs, and design synthetic haze generation methods that more accurately reflect the distribution characteristics of real-world haze.
6. Conclusions
This paper systematically optimizes the classical dehazing algorithms AOD-Net and DehazeFormer through innovative loss function reconstruction. Based on the loss function optimization schemes of the two algorithms, a perceptual loss that can improve human visual quality is added, and the optimal weight ratio for each algorithm is explored through grid search. Through experimental control variables, it can be concluded that adding perceptual loss to optimize model training is helpful for improving the dehazing performance of the model. In the grid search, through the analysis of 27 combinations, it is found that a single loss function cannot determine the direction of model training optimization, and it is the result of the joint action of all loss functions.
The research in this paper still has three limitations: (1) the exploration range of the optimal weight ratio of the model is limited, and it is impossible to clearly determine whether the weight ratio used in this paper is the optimal solution for the model; (2) the real-time bottleneck of the Transformer architecture has not been fundamentally solved; and (3) it is not applicable to cross-modal vision tasks. For vision tasks such as deblurring and occlusion removal, for instance, Wang et al. innovatively adapted the feature fusion mechanism of dehazing Transformers to cross-modal tasks for occluded person re-identification [
35], whereas the algorithm studied in this paper is specifically designed for dehazing tasks and demonstrates suboptimal performance in other types of vision tasks. Nevertheless, the optimization of loss functions serves as a viable direction for enhancing model performance in dehazing and is broadly applicable to all vision tasks.
The research further clarifies the applicable boundaries of different algorithm paradigms: the ultra-lightweight improved AOD-Net can be applied to real-time sensitive scenarios such as vehicle-mounted platforms, with an end-to-end processing delay of less than 100 ms, meeting the ISO-26262 functional safety standard [
36]; meanwhile, DehazeFormer maintains good SSIM stability under extreme fog concentration conditions, providing technical support for professional fields such as aerial remote sensing and meteorological monitoring. This differentiated performance provides certain engineering guidance for algorithm selection in different application scenarios.
Future research will focus on the following: (1) To address the problem of dynamic fog concentration adaptation, a differentiable atmospheric scattering model based on Neural Radiance Field (NeRF) will be constructed. At the same time, the problem that the haze distribution in the current synthetic dataset is too “regular” will be optimized to make the synthetic dataset more in line with real haze weather. (2) Lightweight computing architecture: To solve the deployment bottleneck of Transformer, the FlashAttention-2 optimization strategy will be adopted to migrate DehazeFormer knowledge to the CNN architecture.