4.1. Datasets
We evaluate our proposed DFFNet on three publicly available haze remote sensing datasets: StateHaze1k [
52], RICE [
53], and RRSHID [
54]. The image resolution of StateHaze1k and RICE is 
, whereas that of RRSHID is 
.
StateHaze1k contains three subsets with different levels of haze density, namely haze1k_thin, haze1k_moderate, and haze1k_thick. Each subset includes approximately 400 pairs of synthesized RGB remote sensing images; among them, 320 image pairs are used for training, 45 for testing, and 35 for validation.
In the haze1k_thick subset, most images are dominated by dense haze with relatively concentrated distribution. The haze1k_thin subset contains images with light haze, where some regions are nearly haze-free. In contrast, the haze1k_moderate subset exhibits the most complex haze distribution, featuring a mix of thin haze, dense haze, and clear areas, thus demonstrating significant non-uniformity.
RICE is a real-world dataset created by Google Earth for the task of remote sensing image dehazing. It is divided into two subsets: RICE1 and RICE2.
In RICE1, most images have evenly distributed haze, with only about 10 images showing uneven haze patterns. To ensure representative coverage of such characteristics, we allocate 5 uneven haze images to both the training and testing sets. In total, the training set consists of 402 image pairs, and the testing set includes 98 pairs.
RICE2 contains images of coastal and inland scenes obscured by real clouds. After manual selection, 24 representative images are designated for the testing set. In total, this subset contains 590 images for training and 146 for testing.
RRSHID is a large-scale real-world dataset featuring paired hazy and haze-free remote sensing images. Unlike synthetic datasets that rely on simplified atmospheric models, RRSHID captures authentic atmospheric phenomena, such as heterogeneous haze densities and spatial distributions within individual images, intricate interactions between haze and diverse land cover types, and color deviations caused by variations in natural lighting and atmospheric composition. This makes it well-suited for validating the robustness of models in real-world scenarios. It is stratified by haze density into three subsets: RRSHID_thin, RRSHID_moderate, and RRSHID_thick.
The RRSHID_thin subset includes 763 image pairs, with 610 for training, 76 for testing, and 77 for validation. The RRSHID_moderate subset, the largest among the three, contains 1526 pairs, allocated as 1220 for training, 152 for testing, and 154 for validation. The RRSHID_thick subset has 764 pairs, with 611 for training, 76 for testing, and 77 for validation.
  4.2. Evaluation Metrics
We evaluate the quality of restored images using three commonly adopted standard metrics, which have been widely used in prior works (e.g., [
37,
55,
56]): Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). These metrics collectively assess the restoration performance from pixel-level, structural, and perceptual perspectives.
PSNR quantifies the pixel-wise error between the restored and reference images. Higher PSNR values indicate lower distortion and better fidelity. SSIM measures perceptual similarity by comparing luminance, contrast, and structural information; values closer to 1 denote higher structural consistency. LPIPS evaluates perceptual similarity based on deep features extracted from a pretrained neural network, capturing high-level semantic differences. Lower LPIPS scores correspond to better perceptual alignment.
Together, these metrics provide a comprehensive evaluation of both global semantics and local detail preservation after dehazing. The formulas for each metric are as follows:
        where 
 is the maximum pixel value in the image, 
 is base-10 logarithm, 
x is the dehazed image produced by the model, 
y is the ground truth, and 
 are the height and width of the image, respectively:
        where 
 denotes the mean, 
 is the variance, 
 is the covariance of 
x and 
y, and 
 and 
 are constants used to prevent division by zero:
        where 
 and 
 represent the feature maps of images 
x and 
y extracted at the 
l-th layer of a pretrained network, and 
 is the weight coefficient of that layer. In this paper, AlexNet is used as the backbone network.
  4.3. Implementation Details
We implement and train DFFNet using the PyTorch framework with an NVIDIA RTX 4090 GPU (24 GB). During training, RGB remote sensing images are randomly cropped to a size of , with a batch size of 4. During validation, evaluation is conducted on full-resolution images without cropping. On each subset, DFFNet is trained for 500 epochs, with the total training time being approximately 5–7 h. This time includes per-epoch validation on the training set, which can vary across datasets depending on their size.
In DFFNet, the number of Spa-Fre Blocks in each stage is set to , and the corresponding embedding channel dimensions are . We use the Adam optimizer with an initial learning rate of , , , , and without weight decay. The learning rate is adjusted using a cosine annealing schedule with  (i.e., the total number of training epochs) and a minimum learning rate .
  4.4. Experimental Results and Analysis
We compare DFFNet with classical dehazing methods and several recent high-impact approaches. These include the prior-based DCP [
8], the CNN-based AOD-Net [
13], FFA-Net [
40], and DCMPNet [
57], the Transformer-based DehazeFormer [
18], SFAN [
23], and the diffusion model-based RSHazeDiff [
58]. It is worth noting that SFAN also incorporates frequency domain processing, which aligns it conceptually with our approach and makes the comparison more relevant. To ensure a fair comparison, we adopt the same batch size and input resolution as used in DFFNet, while keeping all other settings consistent with the original implementations of the respective methods. The experimental results are presented as follows.
(1) Results on StateHaze1K: The quantitative evaluation results of different methods on the StateHaze1K dataset are shown in 
Table 1, which shows that our method achieves top or second-best performance across all haze levels on the StateHaze1K dataset. Notably, it ranks first in both PSNR and LPIPS in light and medium haze conditions, and maintains competitive results under heavy haze, reflecting its effectiveness in both perceptual quality and robustness under varying atmospheric degradations. The “Average” value reported in the table is computed as the arithmetic mean across all haze-level subsets, consistent with the standard practice in related works.
 The qualitative visual results (see 
Figure 3) further support these conclusions. In light haze images, DCP exhibits noticeable color oversaturation, while AOD-Net suffers from color distortion and large areas of residual haze. Although FFA-Net, SFAN, and RSHazeDiff perform relatively well in general, there are still slight haze remnants along the left edge of the image. In contrast, both DehazeFormer and our method produce cleaner, more natural-looking images with better detail restoration and perceptual consistency.
In the medium haze scene, the input image contains haze that is unevenly distributed and varies in thickness. Except for our method, most others show varying degrees of residual haze in the denser lower part of the image. DCP, AOD-Net, and FFA-Net show obvious remnants with a bluish tint, while DCMPNet, SFAN, and RSHazeDiff leave only small residual areas but still show whitening artifacts. Our method, however, achieves complete haze removal and maintains an overall tone closer to the ground truth, aligning well with human visual perception.
In heavy haze images, although most methods can remove the majority of haze, some, such as DCMPNet and RSHazeDiff, still show whitening in grassy areas. Our method effectively restores true color in these regions, maintaining consistency with the surrounding context and demonstrating its precise modeling of contextual information in strongly degraded regions.
(2) Results on RICE: As shown in 
Table 2, DFFNet achieves the best average performance across PSNR, SSIM, and LPIPS on the RICE dataset. It consistently ranks among the top methods on both RICE1 and RICE2, demonstrating strong capabilities in brightness restoration, structural preservation, and perceptual quality. We also provide qualitative results (
Figure 4). In RICE1, we select a lightly hazed image with uneven haze distribution for comparison. Although DCP removes haze globally, it results in severe oversaturation. AOD-Net shows almost no change. While FFA-Net, SFAN, and RSHazeDiff manage to remove the haze, the brightness in the previously hazy areas remains largely unchanged. In contrast, DCMPNet, DehazeFormer, and our DFFNet maintain more uniform brightness across the image, effectively removing haze while preserving natural and consistent colors and contrast. The image shown in RICE2 depicts scenes characterized by light haze and scattered cloud patches. DCP and AOD-Net primarily attempt haze removal by increasing saturation, leaving cloud structures still visible, and AOD-Net incorrectly tints them green. FFA-Net removes the clouds but leaves noticeable green patches in the areas where the clouds were. SFAN and RSHazeDiff show slight improvements but introduce smearing artifacts. By comparison, DCMPNet, DehazeFormer, and our DFFNet not only thoroughly eliminate the haze and clouds but also reasonably restore details in the ground and river regions, delivering a more natural and coherent visual effect.
 (3) Results on RRSHID: According to the quantitative results presented in 
Table 3, DFFNet demonstrates superior dehazing performance across all three haze levels as well as in the overall average. It consistently achieves the highest values in PSNR and SSIM, along with the lowest LPIPS, particularly excelling under thick haze conditions. These results indicate that DFFNet possesses strong generalization ability and robustness across a wide range of challenging haze scenarios.It is noteworthy that both DFFNet and SFAN incorporate frequency domain processing mechanisms, yet a clear performance gap exists between the two. Although SFAN maintains relatively high SSIM and low LPIPS under moderate and thick haze, its overall PSNR and average performance are inferior. This can be attributed to the more effective spatial–frequency integration in the architecture of DFFNet. RSHazeDiff also performs competitively across multiple metrics, with particularly strong perceptual quality as reflected by its LPIPS score. However, due to its diffusion-based structure, it incurs significantly higher training and inference costs, which may hinder its deployment in practical applications.The qualitative visual results (
Figure 5) demonstrate that our model effectively balances detail restoration and haze removal under all three haze levels. It also shows a relatively accurate recovery of ground object colors, indicating strong visual consistency across varying haze densities.
   4.6. Ablation Study
  4.6.1. Ablation Studies on Model Components
We conduct ablation experiments on the Haze1k_moderate dataset to evaluate the contributions of the CEU, FRU, and DDFFM modules, and to examine the effect of arranging CEU and FRU in series. All experiments use identical training settings.
As shown in 
Table 7, replacing DDFFM with vanilla attention, removing either FRU or CEU, or arranging FRU and CEU in series all lead to noticeable drops in PSNR and other evaluation metrics, compared to the complete DFFNet.
Specifically, replacing DDFFM degrades the effectiveness of spatial–frequency feature fusion, as vanilla attention only models global dependencies within each domain, ignoring local interactions. Removing FRU leads to weaker haze suppression. While FRU still contributes to global haze suppression, removing CEU results in inadequate restoration of local details due to the lack of spatial contextual information.
Moreover, serially connecting FRU and CEU within the same layer degrades performance significantly (PSNR drops by 4.61 dB, SSIM by 0.208). This arrangement introduces redundant noise from FRU that cannot be effectively refined by the subsequent CEU.
Figure 6 provides qualitative evidence that aligns with the analytical insights derived from the ablation results. Compared to variants (b–e), image (f) exhibits the least residual haze in the bottom-right corner and the most visually consistent tone with the surrounding areas, highlighting the superior restoration quality achieved by the complete DFFNet.
   4.6.2. Ablation on Learnable Fusion Weights
To evaluate the impact of the two learnable parameters 
 and 
 in the proposed DDFFM module on model performance, we conducted an ablation study by comparing the results under different fixed values of these parameters. Specifically, we conducted controlled experiments under four different parameter settings, where both 
 and 
 were fixed to either 0.5 or 1. In each setting, all three DDFFM modules in the network used the same values of 
 and 
. The results are summarized in 
Table 8. Additionally, we recorded the learned values of 
 and 
 during standard training, collected from DDFFM modules at different layers, as shown in 
Table 9.
From the results, we observe that the fixed setting  achieves the best performance, yielding the highest PSNR and SSIM as well as the lowest LPIPS. However, the learned values of  and  vary significantly across different layers, suggesting that the model adapts its fusion strategy based on spatial location or semantic content.
  4.6.3. Component Effectiveness Analysis
In this subsection, a representative remote sensing image characterized by non-uniform haze distribution is selected as an input example, where the bottom-left region is heavily obscured by dense haze, while the remaining areas are relatively clear or haze-free. 
Figure 7 illustrates how the input is progressively processed across different layers of the network. Frequency domain and spatial domain feature maps are highlighted with orange and yellow borders, respectively. Layers 3 to 5 illustrate the feature representations before and after DDFFM fusion. From left to right, the columns correspond to input features from the frequency and spatial branches, concatenated dual-domain features, fused outputs after DDFFM processing, and separately recovered frequency domain and spatial domain outputs obtained via channel-wise decomposition.
In the early feature extraction stages (Layer1–Layer2), the spatial and frequency branches process the input image independently. The spatial domain features preserve texture and structural details with well-defined ground object contours. In contrast, the frequency domain features contain less fine detail but begin to reflect the global distribution of haze across the image. This stage clearly highlights the representational differences between the two branches.
By Layer3, before DDFFM fusion, the frequency domain features begin to exhibit green-colored activations corresponding to hazy regions, indicating that the network has started responding to haze-specific signals. Meanwhile, spatial features remain focused on key structural regions such as roads, retaining stronger local details.
In Layer4, before DDFFM fusion, frequency domain features appear blurred and lack structural sharpness, whereas spatial features maintain rich contextual information due to residual connections from earlier layers (e.g., Layer2). Following fusion via the DDFFM module in Layer4, both branches show evidence of complementary enhancement. Although details in the densest haze regions remain partially suppressed, the overall structural coherence improves.
In Layer5, prior to DDFFM fusion, the frequency domain branch exhibits homogeneous green blobs over the dense haze regions, indicating a strong activation response. Conversely, the spatial branch features are more textured and edge-enhanced, often attributed to skip connections that incorporate earlier feature representations. After final fusion in Layer5, the outputs from both branches become more consistent in color and structure, with significant detail restoration in previously obscured haze regions.
This progression demonstrates the complementary nature of spatial and frequency modeling: while the frequency branch excels at capturing and suppressing global haze artifacts, it is less effective at fine-grained detail recovery. The spatial branch, providing contextual cues, enhances local structural reconstruction.
The proposed DDFFM module enables dynamic feature integration between the two domains, achieving both noise suppression and information enhancement. This mitigates information loss in the frequency branch and reduces interference during fusion, ultimately yielding superior reconstruction quality—particularly in dense haze regions. These observations validate the complementary advantages of dual-domain modeling and highlight the critical role of DDFFM in bridging local detail recovery with global semantic awareness.