PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes

Bai, Kun; Zhou, Zhigang; Yang, Jian; Zhang, Wenyue

doi:10.3390/app16104635

Open AccessArticle

PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes

School of Information, Shanxi University of Finance and Economics, Taiyuan 030006, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 4635; https://doi.org/10.3390/app16104635

Submission received: 31 March 2026 / Revised: 5 May 2026 / Accepted: 5 May 2026 / Published: 8 May 2026

Download

Browse Figures

Versions Notes

Abstract

Object detection technology has been widely applied in fields such as autonomous driving and security surveillance, where it serves as a vital component of intelligent systems. However, under adverse hazy weather conditions, objects obscured by haze and their edges may lose detailed visual information. Existing object detection methods lack targeted mechanisms to address these challenges, resulting in a marked decline in detection accuracy in complex environments. To this end, this paper proposes an object detection method based on an end-to-end robust detection framework, termed the Physics-informed Frequency-Enhanced Network (PFE-Net). We propose a physics-guided visibility enhancement module (PG-VEM) that leverages the atmospheric scattering model and dark channel prior to adaptively compensate for degraded image features under adverse weather conditions, thereby restoring image details and contrast. Meanwhile, the frequency-domain edge-awareness module (FD-EPM) explicitly enhances the geometric contours of blur-obscured objects through Fourier transform and high-pass filtering, thereby improving the discriminability of edge features. Comprehensive experiments were conducted on both the real-world RTTS dataset and the synthetic VOChaze dataset to validate the effectiveness of the proposed approach. The results indicate that the proposed method achieves significant improvements in accuracy, particularly under complex weather conditions, demonstrating excellent all-weather environmental perception capabilities. This method has important practical engineering value and can substantially enhance the safety of autonomous driving and security surveillance systems under adverse weather conditions.

Keywords:

object detection; foggy scenes; YOLOv8; physics-informed; frequency-domain enhancement

1. Introduction

In recent years, object detection has become a major research focus and has been widely applied in autonomous driving, security surveillance, intelligent transportation, and robotics [1]. In autonomous driving systems, object detection plays a critical role in environmental perception [2]. It is necessary to recognize multiple categories of objects in road scenes in real time, including vehicles, pedestrians, traffic signs, traffic lights, and obstacles. The detection results directly affect the safety and reliability of path planning and decision-making control. In security surveillance systems, object detection is widely used for abnormal behavior recognition, personnel tracking, intrusion detection, and public safety warning [3]. For example, in scenarios such as urban surveillance, airports, and railway stations, detection algorithms are required to detect targets robustly under complex backgrounds, low illumination, and long-range conditions. However, in real-world scenarios, adverse weather conditions such as rain and fog often cause significant image degradation, which severely weakens the feature extraction and object recognition capabilities of deep learning models [4]. Fog-induced occlusion leads to reduced image contrast and loss of detail, while objects at different distances exhibit varying degrees of blur. This disrupts feature consistency and further increases the difficulty of feature learning. Meanwhile, object contours become smoother and edge information is gradually lost, which negatively affects convolutional features that rely on structural information, thereby reducing the accuracy of object detection and segmentation. Therefore, object detection under foggy conditions is of great practical importance in the field of computer vision and has attracted extensive attention. Related studies include haze removal for object detection [5,6], object detection in rainy conditions [7], and object detection in snowy conditions [8]. This paper mainly focuses on object detection in foggy conditions.

Beyond deep-learning-based feature extraction, there also exist classical preprocessing strategies that can be practically useful under degraded imaging conditions. A representative example is multi-threshold binarization, which has been shown to reduce the scale of training samples while preserving discriminative information in multi-spectral remote sensing images [9]. This direction is complementary to the present work: binarization-based methods reduce input complexity at the pixel-intensity level, whereas PFE-Net learns task-driven feature representations through physics-guided visibility enhancement and frequency-domain edge enhancement.

Early studies typically adopted a two-stage approach that decoupled the task into two steps: first, restoring the clear appearance of degraded images through a dehazing model; and second, performing object detection on the dehazed images for object prediction [10]. For example, Li et al. [5] proposed the BAD-Net two-stage architecture, which consists of a generative adversarial network-based dehazing method and a convolutional neural network-based object detection module. The single-shot detector (SSD) proposed by Bandeira et al. [11] employs a single convolutional neural network for object detection. Although BAD-Net adopts an end-to-end distributed strategy by first applying cross-node dehazing and feature fusion modules (i.e., dual-branch attention fusion), followed by cross-node object detection networks, thereby effectively addressing the issue of missing annotations in real hazy scenes, its dehazing model mainly focuses on visual quality and often introduces artifacts or noise that are detrimental to the detector [12]. In contrast to dehazing-based strategies, directly training on foggy scenes enables end-to-end joint optimization, which improves the robustness and efficiency of object detection under foggy conditions, avoids the accumulation of dehazing errors, and better adapts to complex and non-uniform haze environments [13]. For instance, the CGMDRNet model proposed by Chen et al. [13] incorporates a cross-modal information guidance mechanism within an end-to-end network architecture. Liu et al. [14] proposed a physics-guided method that constrains the learning process by embedding physical laws into the model. Although existing methods have significantly improved object detection performance, fog-induced occlusion still severely impairs the network’s ability to accurately recognize objects. Distant objects are heavily obscured, whereas nearby objects remain relatively clear. Such occlusion causes originally sharp object contours to become smoother, posing a major challenge for object detection in foggy conditions.

To address these challenges, this paper proposes a detection model that integrates physics-driven and frequency-domain enhancement techniques to improve object detection performance in foggy environments. First, to enable the network to recognize and discriminate objects in foggy scenes more accurately, we develop PG-VEM, which employs physical prior recalibration to correct distortions or blurring caused by adverse weather conditions such as haze, rain, and snow, thereby restoring visibility. In addition, to produce clearer geometric contours, we design FD-EPM, which sharpens object contours under foggy conditions by leveraging frequency-domain enhancement to capture geometric details and further extract critical edge features. Extensive quantitative and qualitative experiments verify the feasibility of the PG-VEM module and the effectiveness of the FD-EPM module.

The contributions of this paper can be summarized in the following three aspects:

To address the difficulty of accurate object recognition and the blurring of distant objects caused by foggy conditions, this paper proposes a physics-fusion network that enables precise object detection in adverse weather environments such as fog.
This paper proposes PG-VEM and FD-EPM, which enhance the network’s perception of distant objects through feature recalibration based on physical priors and achieve geometric representation of object contours through frequency-domain enhancement, respectively.
Extensive quantitative and qualitative experiments are conducted on both synthetic and real-world datasets to verify the effectiveness of the proposed PG-VEM and FD-EPM modules.

2. Related Work

2.1. Object Detection

Mainstream object detection methods can generally be categorized into two-stage [15] and single-stage approaches [16]. Two-stage object detection methods first employ a Region Proposal Network (RPN) to preliminarily generate potential target regions and then perform refined classification and bounding box regression on these candidate proposals to predict object locations and categories [17]. For example, Li et al. [18] proposed Cascade-DETR++, which establishes more robust alignment relationships between multi-scale features and queries through a multi-stage progressive refinement mechanism, thereby improving localization accuracy and classification robustness in complex scenes. Zhang et al. [19] proposed Refine-RCNN (v2), which employs large-model distillation techniques to enhance the ability of the two-stage architecture to detect objects with long-tail distributions, allowing the student model to acquire additional knowledge from the teacher model and thus substantially improving detection accuracy. Although two-stage methods demonstrate notable advantages in detection accuracy, particularly for small object detection, as well as localization robustness, their complex pipeline incurs substantial computational overhead and slow inference speed, thereby limiting their applicability in real-time industrial scenarios. To improve inference speed, single-stage methods omit the candidate region generation step and directly predict object class probabilities and bounding box offsets on the global feature map through an end-to-end regression framework [20]. For example, Wang et al. [21] proposed YOLOv10, which incorporates an NMS-free training strategy into the traditional object detection framework. By optimizing the detection process, the model directly outputs final predictions without requiring the NMS stage for result filtering. The Ultralytics team [22] introduced YOLOv11, which integrates self-attention mechanisms, enabling the model to extract and enhance global object features more effectively under extreme conditions and to focus more on critical information regions, thereby improving feature extraction performance. This paper adopts a single-stage object detection method to efficiently denoise features and perform real-time object detection in images captured under hazy conditions, with the aim of addressing the low recognition accuracy and delayed response of traditional algorithms in low-visibility environments.

Recent YOLO-based frameworks have also been extended to specialized industrial inspection scenarios. For example, teacher–student frameworks can leverage large vision models for data pre-annotation and YOLO-based models for instance segmentation, which further demonstrates the flexibility of YOLO-style detectors in deployment-oriented visual perception tasks [23].

Recent transformer-based detectors further broaden the design space of object detection. DETR formulates object detection as a direct set-prediction problem and removes many hand-crafted post-processing components [24], while RT-DETR improves the efficiency of DETR-style detection and provides real-time performance competitive with YOLO-style detectors [25]. In addition, YOLOv10 introduces an end-to-end real-time detection pipeline with improved speed–accuracy trade-offs [26]. Although these recent detectors are not designed specifically for haze degradation, they provide strong state-of-the-art baselines for evaluating whether the proposed physics-informed and frequency-enhanced design remains effective beyond the original YOLOv8 backbone.

2.2. Image Dehazing

Image dehazing refers to the process of removing the effects of light scattering and absorption caused by atmospheric phenomena such as haze, thereby restoring the true colors and details of an image and consequently enhancing its clarity, contrast, and visibility [27]. Conventional prior-based methods estimate transmission and global atmospheric light in the atmospheric scattering model by extracting physical constraints, such as the dark channel prior, from the statistical properties of a large number of natural images, and then reconstruct the haze-free image [28]. For example, the classical dark channel prior (DCP) proposed by He et al. [29] can infer the influence of haze and restore true image details through local information, without requiring additional assumptions beyond the atmospheric scattering model. Yan et al. [30] proposed Adaptive-DCP++, which alleviates the halo artifacts that frequently occur in sky regions in conventional prior-based methods by introducing local contrast adaptive correction. Although these methods exhibit strong physical interpretability and do not require training data, the predefined priors often fail in complex and dynamic real-world scenes, resulting in dehazing outputs with color distortion or residual haze [31]. To improve robustness, deep learning-based end-to-end methods exploit the powerful feature extraction capabilities of convolutional neural networks (CNNs) or Vision Transformers (ViTs) to directly learn the nonlinear mapping from hazy images to clear images [32]. For instance, Song et al. [33] proposed DehazeFormer-V2, which leverages its strong global information modeling capability to better capture long-range spatial dependencies in images, thereby restoring subtle textures and detailed visual features. Wang et al. [34] proposed Diff-Dehaze, which incorporates a diffusion model widely used in image generation and demonstrates excellent ability to produce highly detailed and realistic image features, especially in recovering textures and structural components. In this paper, we propose a preprocessing approach for foggy images that combines atmospheric physical modeling with frequency-domain analysis. Specifically, by introducing the atmospheric scattering equation and employing the Fourier transform (FFT) to selectively suppress low-frequency fog noise in the frequency domain, the proposed method enables the object detection network to more accurately perceive geometric contours and texture features obscured by fog during feature abstraction, thereby enhancing robustness under adverse weather conditions.

2.3. Object Detection in Adverse Weather

In practical applications, object detection is often required to operate in adverse weather environments, including rainy and foggy conditions [35,36]. For example, autonomous driving in rainy and foggy weather requires accurate detection of vehicles and pedestrians to avoid collisions, while surveillance systems operating under such conditions rely on advanced image dehazing techniques to improve monitoring accuracy. Object detection methods for nighttime scenes enhance image contrast under extremely low-light conditions through low-light enhancement techniques or multimodal feature fusion, such as infrared and visible-light fusion, while also mitigating sensor noise caused by high gain [37]. For example, Hou et al. [38] proposed NightVision-Net, which significantly improves detection accuracy in nighttime traffic scenarios by employing a dynamic brightness adaptation module without introducing additional noise. Deep-Dark-YOLO, proposed by Liu et al. [39], utilizes generative adversarial networks (GANs) to perform feature-domain transfer from nighttime to daytime. Object detection methods for rainy scenes reconstruct the geometric contours and texture details of occluded objects by removing interference from raindrops and rain streaks and restoring spatial textures [40]. For instance, RS-DiffDet, proposed by Lu et al. [41], leverages the superior deraining capability of diffusion models to provide the detector with a refined latent feature space. Ye et al. [42] investigated DeepNOMA, which achieves end-to-end collaborative optimization of deraining preprocessing and object detection by decoupling rain-streak components from target features. Although these methods perform well in specific scenarios, they still face considerable challenges in handling global degradation caused by atmospheric scattering in foggy weather. Therefore, this paper systematically investigates several key challenges in object detection under foggy conditions, including severe feature blurring caused by atmospheric scattering, missed detections of small objects due to globally reduced contrast, and the high incidence of false positives in dense fog regions.

Another relevant line of work is realistic neural weather synthesis. Recent methods such as WeatherWeaver [43] and ClimateNeRF [44] can synthesize more visually realistic and controllable adverse-weather imagery than classical scattering-only simulation. These methods are valuable for improving the realism of training data and for studying synthetic-to-real generalization. In this paper, we retain the classical atmospheric scattering model for VOCh generation because it provides transparent physical control over haze density and is directly aligned with the physics-guided branch of PFE-Net; nevertheless, integrating neural weather synthesis with PFE-Net is an important future direction.

3. Method

In real-world dynamic scenarios, object detection models often face significant challenges caused by environmental degradation factors. In particular, under rainy and foggy weather conditions, atmospheric scattering leads to a substantial reduction in image contrast and is accompanied by severe blurring of geometric structures. This physical degradation not only diminishes the visual quality of images but also poses considerable challenges to convolutional neural network-based detectors. Traditional approaches are often limited by the inherent constraints of spatial-domain feature extraction, making it difficult to distinguish discriminative target features from blurred backgrounds. To fundamentally address this issue, this study proposes an integrated end-to-end robust detection framework that performs feature-level dynamic calibration within the network through physics-informed learning combined with frequency-domain feature enhancement, thereby maintaining high-accuracy recognition performance in adverse environments.

3.1. Overall Architecture

In foggy conditions, fog-induced occlusion significantly impairs the network’s ability to accurately recognize and interpret objects. The contours of distant objects are heavily obscured, whereas those of nearby objects remain relatively clear. As occlusion increases, originally sharp edges gradually become blurred and exhibit smoother characteristics, making object detection particularly challenging under such conditions.

To address this issue, we propose the overall architecture shown in Figure 1, termed the Physics-informed Frequency-Enhanced Network (PFE-Net). The framework adopts YOLOv8 as the backbone detector and mitigates rain and haze interference by integrating specially designed enhancement modules into its key feature propagation stages [45]. Specifically, the physics-guided visibility enhancement module (PG-VEM) is incorporated into the shallow layers of the backbone network, where it leverages the physical principles of image formation to perform energy compensation on early-stage features. Subsequently, after the feature fusion network responsible for multi-scale feature interaction, the frequency-domain edge perception module (FD-EPM) is introduced to explicitly restore object contour details through spectral operations.

3.2. Physics-Guided Visibility Enhancement Module (PG-VEM)

In atmospheric physics, the observed hazy image is strictly governed by the transmittance law. According to the classical atmospheric scattering model, the hazy image

I (x)

can be expressed as a linear combination of the radiance of a clear scene

J (x)

and the global atmospheric light A, i.e.,

I (x) = J (x) t (x) + A (1 - t (x)),

(1)

where

t (x)

denotes the transmittance, which characterizes the proportion of the light signal retained after traversing the medium [46]. Equation (1) explicitly contains both a multiplicative term

J (x) t (x)

, where the medium transmission exponentially attenuates the scene radiance as a function of depth, and an additive term

A (1 - t (x))

, which represents the airlight component introduced by atmospheric scattering. Thus, haze affects the observed image through both attenuation and additive veiling light, rather than through a purely additive degradation process. In the feature space, the local absence of transmittance directly leads to abnormal attenuation of feature response intensity, substantially reducing the saliency of target regions. If the detection network performs blind search solely in the spatial domain, it often fails to capture effective semantic cues in low-contrast regions. Motivated by this observation, we argue that the physical degradation process should be explicitly modeled, and that accurate energy compensation should be applied to impaired features through estimation of the pixel-level transmittance distribution.

Guided by the preceding physical rationale, we propose a Physics-Guided Visibility Enhancement Module (PG-VEM), as shown in Figure 2, which facilitates visibility restoration by constructing a feature recalibration branch based on physical priors. Let

X \in R^{H \times W \times C}

denote the input feature map of this module, where H and W represent the height and width of the feature map, respectively, and C denotes the number of channels. To accurately capture the spatial distribution of haze, we first employ a local minimum pooling operation to simulate the extraction process of the dark channel prior, thereby obtaining dark channel features

X_{dark} \in R^{H \times W \times C}

. These features exploit the physical characteristic that dark channel values tend to be higher in dense haze regions, thereby preliminarily identifying severely degraded pixels [47]. Subsequently, the original features are concatenated with the dark channel features along the channel dimension and then fed into a lightweight estimation subnetwork. This subnetwork employs a Sigmoid activation function

σ

to generate a pixel-level pseudo-transmittance map with the same spatial dimensions as the input features, denoted by

t_{map} \in [0, 1]

, which can be formulated as

t_{map} = σ (F_{est} ([X, X_{dark}])),

(2)

where

F_{est}

denotes the feature extraction function incorporating dilated convolution, which is designed to capture a broader regional context of haze distribution [48]. After obtaining the pseudo-transmittance map, we further define a physics-informed enhancement weight to emulate signal amplification at the physical layer. The core idea of this weight is to assign a larger compensation coefficient to regions with lower transmittance. Its computation is given by

W = 1 + tanh (α) \cdot 2.0 \cdot (1 - t_{map}),

(3)

where

α

is a learnable channel-level gain vector designed to adaptively modulate the enhancement intensity for different semantic channels. In implementation, the gain is predicted from the input feature descriptor rather than treated as a manually fixed constant. Specifically, global average pooling first summarizes the input feature tensor, and a lightweight prediction branch maps this descriptor to a bounded gain range. This design allows PG-VEM to apply weak correction for lightly degraded inputs and stronger compensation for dense-haze regions. The coefficient 2.0 in Equation (3) is used as the initial enhancement center; the sensitivity analysis in Section 4.6 shows that the model is stable under a broad range of initial values. Finally, the calibrated feature output is obtained through element-wise multiplication, expressed as

X_{out} = Clamp (X ⊙ W, min = 1.0, max = 3.0),

(4)

where ⊙ denotes element-wise multiplication, and the

Clamp

operation effectively constrains the over-amplification of background noise by enforcing a maximum response ratio. The nonlinear gain behavior introduced by the learnable parameter

α

is illustrated in Figure 3.

3.3. Frequency Domain Edge Perception Module (FD-EPM)

Beyond physics-based energy restoration, the model faces another critical challenge, namely the blurring of geometric structures. Rain and fog act as natural low-pass filters, causing the high-frequency components associated with object edges and textures to be obscured by the smooth low-frequency background. Due to their inherently limited local receptive fields, traditional convolution operators are often ineffective in handling such globally degraded edge information [49]. The motivation for citing convolution-operator theory is that a fixed spatial convolution kernel can only represent a restricted family of transformations; consequently, relying solely on local convolution may be insufficient for recovering globally degraded structures such as haze-veiled edges and large-scale frequency attenuation. Previous studies have shown that the Fourier transform can decouple spatial-domain information into distinct frequency components, thereby enabling the explicit separation of haze interference and the enhancement of object boundaries in the frequency domain [50]. The Cooley–Tukey FFT is cited here not only as an efficient computational tool but also as the practical basis that makes spectral decomposition feasible inside an end-to-end detector. To address this issue, we introduce the Frequency-Domain Edge Perception Module (FD-EPM), as illustrated in Figure 4, which is designed to capture and incorporate obscured geometric details in the spectral domain.

For precise edge information extraction, FD-EPM first applies the Fast Fourier Transform (FFT) to map the input spatial features X into the frequency domain, yielding a frequency-domain representation

F (u, v)

, where

(u, v)

denotes the frequency-domain coordinate axis. The complex spectral output is further decomposed into an amplitude spectrum, which encodes contrast information,

A (u, v)

, and a phase spectrum, which captures geometric structural information,

P (u, v)

, as formulated by

F (u, v) = A (u, v) \cdot e^{j P (u, v)},

(5)

where j denotes the imaginary unit [50]. To decouple edge signals from the cluttered haze spectrum, we construct a soft-mask high-pass filter parameterized by learnable variables M. This mask performs frequency selection by computing the normalized distance of each frequency point from the spectral center

(u_{0}, v_{0})

, denoted as

dist (u, v)

, such that

M (u, v) = σ ((dist (u, v) - cutoff) \cdot s),

(6)

where

cutoff

represents a dynamically learned cutoff-frequency threshold, and s denotes a scaling parameter that controls the steepness of the filter so as to mitigate ringing artifacts induced by a hard cutoff. The cutoff threshold is learned on a per-sample basis. Given an input tensor

X \in R^{B \times C \times H \times W}

, global pooling first produces a compact feature descriptor, and a lightweight predictor maps it to a scalar cutoff for each sample. The resulting two-dimensional radial mask

M (u, v) \in R^{H \times W}

is broadcast along the channel dimension and applied to the complex spectrum of each channel. Therefore, all channels of the same sample share the same frequency-selection geometry, while different samples may receive different masks according to their haze severity. After completing component filtering in the frequency domain, we employ the inverse Fast Fourier Transform (iFFT) to restore the processed frequency features to the spatial domain, thereby obtaining edge-enhanced features

X_{edge}

. To enable deep interaction between these edge features and the original semantic features, we design a high-dimensional feature mixing block that first expands the channel dimension to facilitate richer feature recombination and then projects it back to the original dimension. Finally, we introduce a residual connection to inject the enhanced boundary information into the main pathway, together with a learnable scaling factor initialized to a near-zero value

γ

. The modulation process is defined as

X_{final} = X_{in} + H (X_{edge}) \cdot γ,

(7)

where H denotes the high-dimensional mixing function. This frequency-domain adaptive perception strategy encourages the network to reveal the true contours of targets by suppressing haze interference during training, thereby significantly improving the detector’s localization consistency on blurred boundaries. Representative learned frequency masks and cutoff distributions are shown in Figure 5.

3.4. Detection-Driven End-to-End Optimization Strategy

To ensure that the above enhancement mechanisms remain consistent with the ultimate recognition objective, this framework discards complex auxiliary loss constraints and instead adopts an end-to-end optimization strategy driven solely by the native YOLO detection loss [45]. The rationale behind this design is that, under a single-objective optimization framework, the Physics-Guided Visibility Enhancement Module (PG-VEM) and the Frequency-Domain Edge Perception Module (FD-EPM) can autonomously learn feature representations that most effectively reduce detection errors during backpropagation. This mechanism avoids the negative impact of pixel-level reconstruction objectives on semantic feature extraction, which is common in conventional dehazing tasks. Specifically, the model optimizes all parameters by minimizing the total loss function

L_{total}

, which is defined in accordance with YOLOv8 as:

L_{total} = λ_{1} L_{box} + λ_{2} L_{cls} .

(8)

Here,

L_{box}

denotes the bounding-box regression loss, which is used to optimize the spatial localization accuracy of the predicted bounding boxes, while

L_{cls}

denotes the classification loss, which is designed to enhance the discriminability of category attributes. The symbols

λ_{1}

and

λ_{2}

represent the balancing coefficients for the respective loss terms. This detection-task-driven feature evolution paradigm ensures that the model achieves an effective balance between high recall and robust localization performance under complex hazy weather conditions [45].

4. Experiments

4.1. Dataset and Experimental Settings

Validating the robustness of object detection models under adverse weather conditions such as rain and fog is essential. To comprehensively evaluate the performance of the proposed method under complex meteorological conditions, the experimental design incorporates both real-world and synthetic scenarios.

RTTS (Real-world Task-driven Testing Set) is currently the largest annotated dataset for real-world hazy images, consisting of 4321 images collected from traffic scenes [51]. The dataset contains five major traffic object categories, namely pedestrians, cars, buses, bicycles, and motorcycles, with a total of 41,203 annotated bounding boxes. Among them, 11,606 bounding boxes are labeled as difficult samples, thereby providing an effective benchmark for evaluating the model’s generalization capability under low-visibility and occlusion conditions.

VOChaze is a synthetic haze dataset derived from the VOC dataset and is introduced to quantitatively evaluate the performance degradation of object detection models under different visibility levels [5]. This dataset simulates real-world low-visibility scenarios by randomly adding haze of varying intensities to clear images, thereby enabling the assessment of model robustness in hazy environments. Based on these synthetic data, the model’s performance under different haze levels can be systematically analyzed, which further supports model optimization and enhances its adaptability to adverse weather conditions.

For data partitioning, the experiments use VOCn (normal scenes) and VOCh (synthetic haze scenes) as the training sets to improve the model’s feature extraction capability and generalization performance across diverse environmental conditions. During training, the VOCn dataset provides clear and normal scenes, whereas the VOCh dataset simulates haze environments with varying intensities, thereby enhancing the model’s robustness under adverse weather conditions [52]. During the testing phase, evaluations are conducted on the VOCh-test, VOCn-test, and RTTS datasets. This experimental setting ensures both comprehensive quantitative analysis and validation in real-world scenarios, thus enabling a more thorough assessment of the model’s performance. The dataset statistics are summarized in Table 1, and representative examples are shown in Figure 6.

To ensure fairness and reproducibility, all algorithms were evaluated under a consistent software and hardware environment. The hardware platform consisted of an Intel Xeon Gold 6248R processor and an NVIDIA RTX 3090 GPU with 24 GB of VRAM, providing substantial computational power and high-speed data processing capability. The software environment was built on the PyTorch 1.12.1 deep learning framework running on Ubuntu 20.04, ensuring stability and compatibility during algorithm execution. To better evaluate the model performance, the hyperparameters were configured using the SGD optimizer, with an initial learning rate of 0.01, a batch size of 16, and a training schedule of 100 epochs. This standardized experimental setting ensures strong comparability and reproducibility across all algorithms, thereby providing a reliable basis for subsequent analysis.

This paper evaluates performance using commonly adopted object detection metrics, including mean Average Precision at an IoU threshold of 0.5 (

{mAP}_{50}

) [53], precision [54], and recall [55]. These metrics are designed to comprehensively assess the model’s accuracy and robustness in object localization and classification tasks.

For

{mAP}_{50}

, mean Average Precision is computed by averaging the Average Precision values across all categories under an IoU threshold of 0.5. The formula is given as

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i},

(9)

where N represents the total number of categories, and

{AP}_{i}

denotes the Average Precision of the i-th category [53]. Precision quantifies the proportion of correctly predicted positive samples among all samples predicted as positive, and is defined as

Precision = \frac{TP}{TP + FP},

(10)

where

TP

and

FP

denote true positives and false positives, respectively [54]. Recall measures the proportion of actual positive samples that are correctly identified as positive, and is defined as

Recall = \frac{TP}{TP + FN},

(11)

where

TP

and

FN

denote true positives and false negatives, respectively [55]. Since precision and recall emphasize different error types, the F1-score is commonly used as a balanced criterion that jointly considers both measures [56]. Using these metrics, the performance of the object detection model can be comprehensively evaluated from multiple perspectives, thereby ensuring its effectiveness and stability in real-world applications.

4.2. Quantitative Results of Object Detection in Foggy Weather

To validate the effectiveness and robustness of the proposed PFE-Net under adverse rainy and foggy conditions, we conducted systematic quantitative comparative experiments on both synthetic hazy datasets and real-world traffic-scene datasets. The experiments employed the classical two-stage detector Faster R-CNN [57], as well as the current high-performance single-stage benchmark model YOLOv8, as comparative baselines. To fairly evaluate the contribution of the proposed enhancement modules to feature extraction, all models were trained under identical input resolutions, hyperparameter settings, and hardware configurations. We separately trained the models on the normal-image dataset VOCn and the synthetic hazy dataset VOCh, and then evaluated them on the normal test set (VOCn-test), the synthetic hazy test set (VOCh-test), and the highly challenging real-world hazy dataset RTTS in terms of precision (P), recall (R), and the key accuracy metrics

{mAP}_{50}

and

{mAP}_{50 : 95}

. The detailed quantitative results are presented in Table 2.

To provide a stronger comparison with recent object detectors, we additionally evaluate YOLOv10n [26] and RT-DETR-L [25] under the same VOCh training protocol and report results separately on VOCh and RTTS. The split comparisons in Table 3 and Table 4 show that PFE-Net achieves the best

{mAP}_{50}

on the real-hazy RTTS benchmark while introducing only a small computational increase over YOLOv8n.

The split comparison also makes the limitations of PFE-Net relative to strong recent detectors transparent. Although PFE-Net obtains the best

{mAP}_{50}

and Recall on the real-hazy RTTS benchmark in Table 4, it does not dominate every metric. On VOCh, YOLOv10n achieves slightly higher

{mAP}_{50}

and

{mAP}_{50 : 95}

, while RT-DETR-L obtains the highest Precision. On RTTS, YOLOv10n obtains the best

{mAP}_{50 : 95}

and RT-DETR-L obtains the highest Precision, although both have lower

{mAP}_{50}

than PFE-Net. These cases indicate that PFE-Net can still be less competitive when the evaluation emphasizes strict localization quality or conservative high-confidence predictions. A likely reason is that PG-VEM and FD-EPM are optimized to recover visibility and enhance weakened edge cues under haze, which improves recall and coarse localization for degraded targets but may introduce slight over-enhancement or boundary uncertainty for some samples. This explains why PFE-Net is particularly effective on real hazy scenes while still leaving room for improvement against strong general detectors on selected metrics.

As shown in Table 2, PFE-Net, with the incorporation of the PG-VEM and FD-EPM modules, consistently outperforms the baseline YOLOv8-Base model in detection accuracy across nearly all experimental settings. This improvement is particularly evident in the cross-domain evaluation scenario, where the model is trained on VOCn and tested on the real foggy-weather dataset RTTS. In this setting,

{mAP}_{50}

increases from 0.392 to 0.414, corresponding to a relative gain of

(0.414 - 0.392) / 0.392 \times 100 % = 5.61 %

rather than a direct absolute-point interpretation. Moreover, even when the training set includes hazy samples (VOCh), our method still demonstrates superior performance on the real-world RTTS dataset. The primary reason is that physical signal attenuation coupled with the loss of high-frequency information under hazy conditions often leads to the failure of baseline networks. The PG-VEM module explicitly models the atmospheric scattering process and dynamically compensates for the energy of feature responses using the estimated pseudo-transmittance, thereby effectively mitigating semantic occlusion. Meanwhile, the FD-EPM module employs the Fourier transform to decouple features into the frequency domain, enabling the precise extraction and enhancement of target edge features blurred by haze [50]. This coupled enhancement strategy, which integrates physical gain and signal restoration, allows the network to more effectively penetrate complex haze layers and extract more discriminative object features.

Moreover, it can be observed that under challenging hazy-weather evaluations such as RTTS, PFE-Net achieves a substantial improvement in recall compared with the baseline models, for example, increasing from 0.339 to 0.362 under the VOCn training configuration, while precision exhibits only minor variation. This improvement likely stems from the enhanced frequency-domain edge-awareness mechanism, which significantly increases sensitivity to degraded targets and enables the detection of small and distant objects that are otherwise obscured by dense haze, thereby markedly reducing the miss-detection rate. However, this more aggressive feature exploration process may also generate a small number of false positives when dealing with extremely complex background noise, resulting in slight fluctuations in precision. In safety-critical traffic monitoring and autonomous driving tasks, substantially improving recall under extremely low-visibility conditions offers greater practical engineering value and stronger safety assurance [58].

4.3. Mixed-Training and Per-Class Analysis

To examine whether adding real hazy images to the training set changes the conclusion, we further conduct a mixed-training ablation on RTTS. We split RTTS into 70% training images and 30% held-out test images with seed = 42. Setting A follows the original protocol and trains only on VOCh; Setting B trains on VOCh plus the 70% RTTS split and is evaluated on the same held-out 30% RTTS split. The results are shown in Table 5.

As Table 5 shows, incorporating real hazy RTTS images into training substantially improves all metrics, confirming that real-haze supervision is beneficial when available. We retain the VOCh-only protocol as the main setting because it evaluates the more challenging synthetic-to-real generalization ability and avoids using RTTS test images during training.

The per-class results reveal that person and car obtain relatively strong AP on RTTS, whereas bus and motorbike remain more challenging under dense haze and class imbalance. Comparing Table 6 and Table 7, all five classes achieve higher AP on synthetic VOCh than on real RTTS, which highlights the remaining synthetic-to-real domain gap. Figure 7 further visualizes class-specific precision–recall curves on RTTS.

4.4. Qualitative Results of Object Detection in Foggy Weather

To intuitively demonstrate the perceptual capability of PFE-Net under extreme weather conditions, representative scenes from the synthetic foggy dataset VOChaze and the real-world foggy dataset RTTS were selected for comparative visualization with the baseline YOLOv8, the classical detector Faster R-CNN, and the proposed PFE-Net. The corresponding detection results are shown in Figure 8.

As illustrated in Figure 8, in traffic scenarios severely obscured by dense fog, the baseline model YOLOv8 exhibits significant missed detections, particularly for small and distant objects such as pedestrians and remote vehicles, for which reliable bounding boxes often fail to be generated. This phenomenon may be attributed to the global contrast degradation and the “semantic occlusion” effect caused by fog, which prevent conventional convolution operators from extracting sufficiently discriminative features from degraded feature responses. In contrast, PFE-Net demonstrates clear superiority in these challenging regions by effectively capturing targets obscured by haze and producing more accurate and compact localization boxes. This improvement is mainly attributed to the proposed Physics-Guided Visibility Enhancement Module (PG-VEM), which explicitly models the atmospheric scattering process and performs pixel-level energy compensation on degraded features, thereby achieving a feature-level “perspective” effect [59]. Meanwhile, the Frequency-Domain Edge Perception Module (FD-EPM) significantly enhances the geometric sharpness of object contours by reinforcing high-frequency components in the spectral domain, enabling the model to accurately delineate the true shapes of objects even in blurred backgrounds with smooth boundaries, and thereby substantially improving the detector’s recall.

To further clarify which object categories are detected by PFE-Net, we add normalized confusion matrices on RTTS and VOCh, as shown in Figure 9. PFE-Net is not restricted to car detection: it detects all 20 PASCAL VOC categories on VOCh and the five annotated outdoor categories on RTTS.

4.5. Ablation Experiments

To systematically investigate the independent contributions of each enhancement component in PFE-Net to detection performance, we conducted detailed ablation experiments by training on VOCn and testing on the RTTS dataset. Using the original YOLOv8 as the baseline, we quantitatively evaluated changes in detection accuracy under different configurations by progressively incorporating the Physics-Guided Visibility Enhancement Module (PG-VEM) and the Frequency-Domain Edge Perception Module (FD-EPM). The detailed ablation results are summarized in Table 8.

As shown in Table 8, it can be clearly observed that when PG-VEM is introduced independently, the model’s performance in real foggy environments improves significantly, with

{mAP}_{50}

increasing from 0.392 to 0.405. This improvement can be primarily attributed to PG-VEM’s explicit incorporation of the physical constraints imposed by the atmospheric scattering model. By accurately estimating the pseudo-transmittance, this module provides the necessary energy compensation for deep features, effectively alleviating the “semantic occlusion” effect caused by optical signal attenuation and thereby substantially enhancing the network’s response intensity in dense fog regions. Meanwhile, it can also be observed that independently deploying the FD-EPM module yields a 0.9% improvement in

{mAP}_{50}

. This gain mainly arises from the fact that haze behaves as a characteristic low-pass disturbance in the frequency domain. FD-EPM decouples features through the Fourier transform and employs a dynamic high-pass filtering mechanism to effectively extract obscured object contours, namely high-frequency components. This geometric restoration strategy, from a signal-processing perspective, significantly enhances the detector’s localization consistency on blurred boundaries. Finally, when both modules operate jointly, the model achieves optimal detection performance. This demonstrates the synergistic interaction between the two components: the physics-guided module is responsible for energy restoration to ensure target recognizability, while the frequency-domain module facilitates edge delineation to guarantee precise localization. The dual-domain enhancement strategy, which combines physical-space modeling and global frequency-spectrum analysis, enables the model to naturally learn highly discriminative representations for complex rainy and hazy environments without relying on clear-label supervision, thereby validating the scientific soundness of the PFE-Net architectural design.

4.6. Parameter Sensitivity and Computational Cost

We further conduct parameter-sensitivity analyses for the PG-VEM gain initialization and the Equation (3)

α

range. All settings are trained for 50 epochs with seed = 42 to control computational cost, and are evaluated on both VOCh and RTTS.

Table 9 and Table 10, together with Figure 10, show that the proposed modules are not overly sensitive to the tested parameter ranges. This supports the robustness of the selected default settings.

As shown in Table 11, PFE-Net adds only 0.12 M parameters and 0.36 G FLOPs over YOLOv8n, while achieving higher RTTS

{mAP}_{50}

. Compared with RT-DETR-L, PFE-Net is substantially lighter and more suitable for deployment-oriented hazy-scene detection.

5. Discussion

5.1. Limitations

Although PFE-Net improves object detection under hazy conditions, it still has several limitations. First, under extremely heavy fog, where both visibility and high-frequency edge cues are almost completely lost, the physics-guided correction becomes under-constrained and FD-EPM has limited reliable edge information to enhance. Second, small and heavily occluded objects remain difficult because their spectral responses are weak and can be confused with haze-induced noise. Third, the training protocol based on synthetic VOCh and real RTTS evaluation still contains a synthetic-to-real domain gap, as reflected by the per-class performance difference between Table 6 and Table 7.

5.2. Future Work

Future work will proceed in three directions. First, we will integrate more realistic neural weather synthesis methods, such as WeatherWeaver [43] and ClimateNeRF [44], to generate more diverse and realistic foggy training data. Second, we will extend PFE-Net to video-level hazy-scene detection, where temporal consistency can help stabilize predictions under dynamic visibility changes. Third, we will further compress FD-EPM and PG-VEM for edge-device and vehicle-mounted deployment.

5.3. Normal-Image Generalization

In the haze-trained and haze-free test configuration, the recall of PFE-Net may slightly decrease compared with some other settings. This behavior is expected because PG-VEM and FD-EPM are explicitly optimized for haze-induced attenuation and blurred contours. When the test images are haze-free, aggressive enhancement may provide less benefit and can slightly alter the precision–recall balance. Nevertheless, the corresponding precision and mAP remain competitive, and this setting is not the primary deployment scenario of the proposed method.

6. Conclusions

This paper proposes PFE-Net, a robust framework that integrates physics-driven modeling with frequency-domain enhancement to address feature blurring, contrast degradation, and contour loss in foggy environments. Specifically, we introduce a Physics-Guided Visibility Enhancement Module (PG-VEM), which leverages atmospheric scattering principles and the dark channel prior to perform feature-level dynamic energy compensation, thereby effectively mitigating semantic occlusion. Complementing this design, the Frequency-Domain Edge Perception Module (FD-EPM) employs the Fourier transform to decouple features into the spectral domain and uses high-pass filtering to explicitly enhance the contours of haze-obscured targets, thereby improving localization consistency. Experiments on the real-world RTTS dataset and the synthetic VOChaze dataset demonstrate that PFE-Net significantly outperforms the YOLOv8 baseline. Notably, under challenging cross-domain scenarios, PFE-Net achieves a 5.61% relative gain in

{mAP}_{50}

together with improved recall, validating its effectiveness in low-visibility conditions. Ablation studies further confirm that the dual-domain coupling of physical energy restoration and frequency-domain geometric refinement enables the network to learn discriminative representations for complex weather conditions without requiring clear-weather labels. Additional comparisons with YOLOv10n and RT-DETR-L, mixed-training experiments with real hazy images, per-class analyses, confusion matrices, and parameter-sensitivity studies further demonstrate the effectiveness and limitations of the proposed design. In future work, we will explore joint detection–restoration optimization based on reinforcement learning, as well as lightweight architectural designs for real-time industrial deployment.

Author Contributions

Conceptualization, K.B. and Z.Z.; methodology, K.B.; software, K.B.; validation, K.B., J.Y. and W.Z.; formal analysis, K.B.; investigation, K.B.; resources, Z.Z.; data curation, K.B. and J.Y.; writing—original draft preparation, K.B.; writing—review and editing, Z.Z. and W.Z.; visualization, K.B.; supervision, Z.Z.; project administration, Z.Z.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Yao, S.; Guan, R.; Huang, X.; Li, Z.; Sha, X.; Yue, Y.; Lim, E.G.; Seo, H.; Man, K.L.; Zhu, X.; et al. Radar-Camera Fusion for Object Detection and Semantic Segmentation in Autonomous Driving: A Comprehensive Review. IEEE Trans. Intell. Veh. 2023, 9, 2094–2128. [Google Scholar] [CrossRef]
Akcay, S.; Kundegorski, M.E.; Willcocks, C.G.; Breckon, T.P. Using Deep Convolutional Neural Network Architectures for Object Classification and Detection Within X-Ray Baggage Security Imagery. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2203–2215. [Google Scholar] [CrossRef]
Hassaballah, M.; Kenk, M.A.; Muhammad, K.; Minaee, S. Vehicle Detection and Tracking in Adverse Weather Using a Deep Learning Framework. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4230–4242. [Google Scholar] [CrossRef]
Li, C.; Zhou, H.; Liu, Y.; Yang, C.; Xie, Y.; Li, Z.; Zhu, L. Detection-Friendly Dehazing: Object Detection in Real-World Hazy Scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8284–8295. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Fan, Y.L.; Chen, R. Single Image Haze Removal via Region Detection Network. IEEE Trans. Multimed. 2019, 21, 2545–2560. [Google Scholar] [CrossRef]
Huang, S.C.; Le, T.H.; Jaw, D.W. DSNet: Joint Semantic Learning for Object Detection in Inclement Weather Conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2623–2633. [Google Scholar] [CrossRef] [PubMed]
Ding, Q.; Li, P.; Yan, X.; Shi, D.; Liang, L.; Wang, W.; Xie, H.; Li, J.; Wei, M. CF-YOLO: Cross Fusion YOLO for Object Detection in Adverse Weather with a High-Quality Real Snow Dataset. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10749–10759. [Google Scholar] [CrossRef]
Rusyn, B.; Lutsyk, O.; Kosarevych, R.; Maksymyuk, T.; Gazda, J. Features Extraction from Multi-Spectral Remote Sensing Images Based on Multi-Threshold Binarization. Sci. Rep. 2023, 13, 19655. [Google Scholar] [CrossRef]
Zhou, Q.; Shahidehpour, M.; Paaso, A.; Bahramirad, S.; Alabdulwahab, A.; Abusorrah, A. Distributed Control and Communication Strategies in Networked Microgrids. IEEE Commun. Surv. Tutor. 2020, 22, 2586–2633. [Google Scholar] [CrossRef]
Bandeira, F.O.; Alves, P.R.L.; Hennig, T.B.; Brancalione, J.; Nogueira, D.J.; Matias, W.G. Chronic Effects of Clothianidin to Non-Target Soil Invertebrates: Ecological Risk Assessment Using the Species Sensitivity Distribution (SSD) Approach. J. Hazard. Mater. 2021, 419, 126491. [Google Scholar] [CrossRef] [PubMed]
An, T.; Gao, H.; Liu, R.; Dai, K.; Xie, T.; Li, R.; Wang, K.; Zhao, L. An MoE-Driven Unified Image Restoration Framework for Adverse Weather Conditions. IEEE Trans. Circuits Syst. Video Technol. 2026, 36, 3101–3116. [Google Scholar] [CrossRef]
Chen, G.; Shao, F.; Chai, X.; Chen, H.; Jiang, Q.; Meng, X.; Ho, Y.S. CGMDRNet: Cross-Guided Modality Difference Reduction Network for RGB-T Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6308–6323. [Google Scholar] [CrossRef]
Liu, W.; Pang, J.; Zhang, B.; Wang, J.; Liu, B.; Tao, D. See Degraded Objects: A Physics-Guided Approach for Object Detection in Adverse Environments. IEEE Trans. Image Process. 2025, 34, 2198–2212. [Google Scholar] [CrossRef]
Liu, Y.; Han, J.; Zhang, Q.; Wang, L. Salient Object Detection via Two-Stage Graphs. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 1023–1037. [Google Scholar] [CrossRef]
Chen, K.; Lin, W.; Li, J.; See, J.; Wang, J.; Zou, J. AP-Loss for Accurate One-Stage Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3782–3798. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Ye, M.; Ke, L.; Li, S.; Tai, Y.W.; Tang, C.K.; Danelljan, M.; Yu, F. Cascade-DETR: Delving into High-Quality Universal Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 6704–6714. [Google Scholar]
Liu, Y.; Li, J.; Wang, Y.; Li, X.; Jiao, Z.; Yang, J.; Gao, X. Refined Segmentation R-CNN: A Two-Stage Convolutional Neural Network for Punctate White Matter Lesion Segmentation in Preterm Infants. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Cham, Switzerland, 2019; pp. 193–201. [Google Scholar]
Lin, W.; Chu, J.; Leng, L.; Miao, J.; Wang, L. Feature Disentanglement in One-Stage Object Detection. Pattern Recognit. 2024, 145, 109878. [Google Scholar] [CrossRef]
Gong, M.; Wang, D.; Zhao, X.; Guo, H.; Luo, D.; Song, M. A Review of Non-Maximum Suppression Algorithms for Deep Learning Target Detection. In Seventh Symposium on Novel Photoelectronic Detection Technology and Applications, Kunming, China, 5–7 November 2020; SPIE: Bellingham, WA, USA, 2021; Volume 11763, pp. 821–828. [Google Scholar]
Liu, L.; Xu, X. Self-Attention Mechanism at the Token Level: Gradient Analysis and Algorithm Optimization. Knowl.-Based Syst. 2023, 277, 110784. [Google Scholar] [CrossRef]
Yang, H.; Wang, L.; Pan, Y.; Chen, J.J. A Teacher-Student Framework Leveraging Large Vision Model for Data Pre-Annotation and YOLO for Tunnel Lining Multiple Defects Instance Segmentation. J. Ind. Inf. Integr. 2025, 44, 100790. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37. [Google Scholar]
Fattal, R. Single Image Dehazing. ACM Trans. Graph. 2008, 27, 1–9. [Google Scholar] [CrossRef]
Choi, L.K.; You, J.; Bovik, A.C. Referenceless Prediction of Perceptual Fog Density and Perceptual Image Defogging. IEEE Trans. Image Process. 2015, 24, 3888–3901. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [CrossRef]
Yan, X.; Cao, J.; Zhou, J.; Ding, C.; Sun, H.; Sun, L.; Song, A. Dcp-ahs: A High-Performance Distributed Cooperative Positioning Model for Concave Networks. IEEE Trans. Mob. Comput. 2023, 23, 4334–4347. [Google Scholar] [CrossRef]
Yeh, C.H.; Huang, C.H.; Kang, L.W. Multi-Scale Deep Residual Learning-Based Single Image Haze Removal via Image Decomposition. IEEE Trans. Image Process. 2019, 29, 3153–3167. [Google Scholar] [CrossRef]
Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A Survey of the Vision Transformers and Their CNN-Transformer Based Variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision Transformers for Single Image Dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Yan, X.; Wang, F.L.; Xie, H.; Yang, W.; Zhang, X.P.; Qin, J.; Wei, M. UCL-Dehaze: Toward Real-World Image Dehazing via Unsupervised Contrastive Learning. IEEE Trans. Image Process. 2024, 33, 1361–1374. [Google Scholar] [CrossRef]
Li, J.; Xu, R.; Liu, X.; Ma, J.; Li, B.; Zou, Q.; Ma, J.; Yu, H. Domain Adaptation Based Object Detection for Autonomous Driving in Foggy and Rainy Weather. IEEE Trans. Intell. Veh. 2025, 10, 900–911. [Google Scholar] [CrossRef]
Han, X.J.; Qu, Z.; Wang, S.Y.; Xia, S.F. Object Detection With Physical Prior and AWConv in Foggy Weather for Traffic Scenes. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 18722–18736. [Google Scholar] [CrossRef]
Schutera, M.; Hussein, M.; Abhau, J.; Mikut, R.; Reischl, M. Night-to-Day: Online Image-to-Image Translation for Object Detection Within Autonomous Driving by Night. IEEE Trans. Intell. Veh. 2020, 6, 480–489. [Google Scholar] [CrossRef]
Hou, J.; He, G. Redefining Night Vision: The Power of MSR-Driven Neural ISP. In ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 3100–3104. [Google Scholar]
Liu, Y.; Li, S.; Zhou, L.; Liu, H.; Li, Z. Dark-Yolo: A Low-Light Object Detection Algorithm Integrating Multiple Attention Mechanisms. Appl. Sci. 2025, 15, 5170. [Google Scholar] [CrossRef]
Li, Y.; Monno, Y.; Okutomi, M. Dual-Pixel Raindrop Removal. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10748–10762. [Google Scholar] [CrossRef] [PubMed]
Lu, S.; Al-Dhahir, N. Coherent and Differential ICI Cancellation for Mobile OFDM with Application to DVB-H. IEEE Trans. Wirel. Commun. 2008, 7, 4110–4116. [Google Scholar] [CrossRef]
Ye, N.; Li, X.; Yu, H.; Zhao, L.; Liu, W.; Hou, X. DeepNOMA: A Unified Framework for NOMA Using Deep Multi-Task Learning. IEEE Trans. Wirel. Commun. 2020, 19, 2208–2225. [Google Scholar] [CrossRef]
Lin, C.-H.; Wang, Z.; Liang, R.; Zhang, Y.; Fidler, S.; Wang, S.; Gojcic, Z. Controllable Weather Synthesis and Removal with Video Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2025; IEEE: New York, NY, USA, 2025; pp. 13580–13591. [Google Scholar]
Li, Y.; Lin, Z.-H.; Forsyth, D.; Huang, J.-B.; Wang, S. ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3227–3238. [Google Scholar]
Giri, K.J. SO-YOLOv8: A Novel Deep Learning-Based Approach for Small Object Detection with YOLO Beyond COCO. Expert Syst. Appl. 2025, 280, 127447. [Google Scholar]
Ju, M.; Ding, C.; Ren, W.; Yang, Y.; Zhang, D.; Guo, Y.J. IDE: Image Dehazing and Exposure Using an Enhanced Atmospheric Scattering Model. IEEE Trans. Image Process. 2021, 30, 2180–2192. [Google Scholar] [CrossRef]
Tan, D.; Niu, C.; Yang, Y.; Yang, D.; Tan, B. DC-BiNet: Towards Interpretable Generated Image Detection with Dark Channel Prior. Expert Syst. Appl. 2025, 280, 127508. [Google Scholar] [CrossRef]
Menon, A.; Mehrotra, K.; Mohan, C.K.; Ranka, S. Characterization of a Class of Sigmoid Functions with Applications to Neural Networks. Neural Netw. 1996, 9, 819–835. [Google Scholar] [CrossRef]
Hörmander, L. On the Range of Convolution Operators. Ann. Math. 1962, 76, 148–170. [Google Scholar] [CrossRef]
Cochran, W.T.; Cooley, J.W.; Favin, D.L.; Helms, H.D.; Kaenel, R.A.; Lang, W.W.; Maling, G.C.; Nelson, D.E.; Rader, C.M.; Welch, P.D. What Is the Fast Fourier Transform? Proc. IEEE 1967, 55, 1664–1674. [Google Scholar] [CrossRef]
Wang, L.; Yang, J.; Workman, M.; Wan, P. Effective Algorithms to Detect Stepping-Stone Intrusion by Removing Outliers of Packet RTTs. Tsinghua Sci. Technol. 2021, 27, 432–442. [Google Scholar] [CrossRef]
Kumar, S.; Sharma, S.; Asghar, R.; Mohandas, R.; Brophy, T.; Sistu, G.; Grua, E.M.; Donzella, V.; Eising, C. Exploring Sensor Impact and Architectural Robustness in Adverse Weather on BEV Perception. IEEE Open J. Veh. Technol. 2025, 6, 2857–2875. [Google Scholar] [CrossRef]
Wang, W.; Li, Q. TPM-EViT: Tri-Probability Map-Enhanced Vision Transformer Framework for UAV Object Detection. Knowl.-Based Syst. 2025, 325, 113983. [Google Scholar] [CrossRef]
Struhl, K. A Paradigm for Precision. Science 2001, 293, 1054–1055. [Google Scholar] [CrossRef] [PubMed]
Raghavan, V.; Bollmann, P.; Jung, G.S. A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance. ACM Trans. Inf. Syst. 1989, 7, 205–229. [Google Scholar] [CrossRef]
Huang, H.; Xu, H.; Wang, X.; Silamu, W. Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 787–797. [Google Scholar] [CrossRef]
Luo, L.; Neihart, N.M.; Roy, S.; Allstot, D.J. A Two-Stage Sensing Technique for Dynamic Spectrum Access. IEEE Trans. Wirel. Commun. 2009, 8, 3028–3037. [Google Scholar] [CrossRef]
Miraliev, S.; Abdigapporov, S.; Kakani, V.; Kim, H. Real-Time Memory Efficient Multitask Learning Model for Autonomous Driving. IEEE Trans. Intell. Veh. 2023, 9, 247–258. [Google Scholar] [CrossRef]
Zheng, S.; Liu, W.; Guo, Y.; Zang, Y.; Shen, S.; Wang, C. A New Adversarial Perspective for LiDAR-Based 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10608–10616. [Google Scholar]

Figure 1. The overall architecture of Physics-informed Frequency-Enhanced Network (PFE-Net). Boxes denote the main network components, and arrows indicate the feature propagation pathway.

Figure 2. The Physics-Guided Visibility Enhancement Module (PG-VEM).

Figure 3. Illustration of the PG-VEM nonlinear gain. The left plot shows the basic

tanh (x)

curve, and the right plot shows

α \cdot tanh (x)

under representative gain values

α = 0.5

,

1.5

, and

3.0

.

Figure 3. Illustration of the PG-VEM nonlinear gain. The left plot shows the basic

tanh (x)

curve, and the right plot shows

α \cdot tanh (x)

under representative gain values

α = 0.5

,

1.5

, and

3.0

.

Figure 4. Frequency Domain Edge Perception Module (FD-EPM) architecture.

Figure 5. Visualization of the dynamically learned FD-EPM frequency masks and cutoff values. The learned cutoffs concentrate around a stable mean while still adapting to sample-specific degradation levels. (a) Learned high-pass masks generated by FD-EPM for representative RTTS inputs with different haze levels. (b) Distribution of learned cutoff values across RTTS test images.

Figure 6. Some examples of foggy images. (a–c) Real-world hazy images from RTTS. (d–f) Synthetic hazy images from VOCh.

Figure 7. Class-specific precision–recall curves at IoU = 0.5 on RTTS.

Figure 8. Comparison of object detection in foggy weather using different methods. Each row shows the same scene with ground-truth annotations, Faster R-CNN, YOLOv8, and PFENet from left to right. Red boxes denote ground-truth annotations in the GT column and model-predicted detections in the other columns.

Figure 9. Normalized confusion matrices of PFE-Net on RTTS and VOCh.

Figure 10. Parameter-sensitivity curves for PG-VEM and the

α

range. The results show that PFE-Net remains stable across reasonable parameter intervals.

Figure 10. Parameter-sensitivity curves for PG-VEM and the

α

range. The results show that PFE-Net remains stable across reasonable parameter intervals.

Table 1. Dataset statistics used in the experiments.

Dataset	#Train	#Val/Test	#Classes	Role/Note
VOCh	16,551	4952	20	Main synthetic-haze training and in-domain test set.
RTTS (main)	–	4321	5	Held-out real-hazy test set.
RTTS (mixed ablation)	3024	1297	5	Random 70%/30% split with seed = 42.

Table 2. Comparison of object detection performance evaluation in foggy weather.

Training Data	Test Data	Model	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$
VOCn	VOCn-test (Normal)	Faster R-CNN	0.738	0.677	0.837	0.607
VOCn	VOCn-test (Normal)	YOLOv8	0.871	0.785	0.875	0.660
VOCn	VOCn-test (Normal)	PFENet	0.871	0.793	0.899	0.697
VOCn	VOCh-test (Synthetic Hazy)	Faster R-CNN	0.756	0.683	0.699	0.442
VOCn	VOCh-test (Synthetic Hazy)	YOLOv8	0.782	0.581	0.678	0.480
VOCn	VOCh-test (Synthetic Hazy)	PFE-YOLO	0.786	0.611	0.712	0.511
VOCn	RTTS (Real Hazy)	Faster R-CNN	0.889	0.409	0.409	0.242
VOCn	RTTS (Real Hazy)	YOLOv8	0.619	0.339	0.392	0.249
VOCn	RTTS (Real Hazy)	PFE-YOLO	0.618	0.362	0.414	0.263
VOCh	VOCn-test (Normal)	Faster R-CNN	0.741	0.717	0.745	0.462
VOCh	VOCn-test (Normal)	YOLOv8	0.830	0.762	0.850	0.627
VOCh	VOCn-test (Normal)	PFENet	0.861	0.743	0.854	0.631
VOCh	VOCh-test (Synthetic Hazy)	Faster R-CNN	0.710	0.723	0.735	0.448
VOCh	VOCh-test (Synthetic Hazy)	YOLOv8	0.841	0.781	0.864	0.643
VOCh	VOCh-test (Synthetic Hazy)	PFENet	0.855	0.779	0.871	0.650
VOCh	RTTS (Real Hazy)	Faster R-CNN	0.861	0.329	0.287	0.164
VOCh	RTTS (Real Hazy)	YOLOv8	0.652	0.396	0.451	0.281
VOCh	RTTS (Real Hazy)	PFENet	0.644	0.412	0.460	0.288

Table 3. Revised VOCh comparison with relative

{mAP}_{50}

gain. Best result in each metric column is in boldface; second-best is underlined.

Table 3. Revised VOCh comparison with relative

{mAP}_{50}

gain. Best result in each metric column is in boldface; second-best is underlined.

Method	${mAP}_{50}$	${mAP}_{50 : 95}$	Precision	Recall	Params (M)	FLOPs (G)	Rel. Gain (%)
YOLOv8n (base)	0.864	0.6426	0.8412	0.7811	3.01	8.20	0.00
PFE-Net (ours)	0.871	0.6503	0.8547	0.7787	3.13	8.56	+0.81
YOLOv10n	0.875	0.6740	0.8850	0.7840	2.78	8.74	+1.27
RT-DETR-L	0.847	0.6512	0.8906	0.7719	32.97	108.34	−1.97

Table 4. Revised RTTS comparison with relative

{mAP}_{50}

gain. Best result in each metric column is in boldface; second-best is underlined.

Table 4. Revised RTTS comparison with relative

{mAP}_{50}

gain. Best result in each metric column is in boldface; second-best is underlined.

Method	${mAP}_{50}$	${mAP}_{50 : 95}$	Precision	Recall	Params (M)	FLOPs (G)	Rel. Gain (%)
YOLOv8n (base)	0.451	0.2815	0.6522	0.3960	3.01	8.20	0.00
PFE-Net (ours)	0.460	0.2876	0.6453	0.4121	3.13	8.56	+2.00
YOLOv10n	0.457	0.2990	0.6726	0.3847	2.78	8.74	+1.33
RT-DETR-L	0.390	0.2585	0.7561	0.3506	32.97	108.34	−13.53

Table 5. Effect of adding real hazy data (RTTS) during training.

Setting	Training Data	Test Set	${mAP}_{50}$	${mAP}_{50 : 95}$	Precision	Recall
A (full-RTTS reference)	VOCh only	RTTS (full, 4321)	0.4603	0.2876	0.6453	0.4121
B (mixed training)	VOCh + 70% RTTS (3024)	RTTS 30% (1297)	0.6718	0.4490	0.7597	0.5968
$Δ$ (B–A)	–	–	+0.2115	+0.1614	+0.1144	+0.1847

Table 6. Per-class detection performance of PFE-Net on RTTS at IoU = 0.5.

Class	${AP}_{50}$	${AP}_{50 : 95}$	Precision	Recall
person	0.6617	0.4340	0.7327	0.6003
car	0.5442	0.3297	0.7136	0.4533
bicycle	0.4590	0.3097	0.6389	0.4327
motorbike	0.3948	0.2167	0.5632	0.3859
bus	0.2420	0.1477	0.5779	0.1883

Table 7. Per-class detection performance of PFE-Net on VOCh at IoU = 0.5 for the five RTTS-overlapping classes.

Class	${AP}_{50}$	${AP}_{50 : 95}$	Precision	Recall
person	0.8617	0.5742	0.8759	0.7400
car	0.9007	0.6947	0.8660	0.8249
bus	0.8492	0.7289	0.8157	0.7775
bicycle	0.8856	0.6438	0.8889	0.7861
motorbike	0.8564	0.6100	0.8270	0.7651

Table 8. Ablation experiment on the effectiveness of core components.

YOLOv8	PG-VEM	FD-EPM	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$
✔			0.619	0.339	0.392	0.249
✔	✔		0.612	0.355	0.405	0.256
✔		✔	0.610	0.348	0.401	0.254
✔	✔	✔	0.618	0.362	0.414	0.263

Table 9. Sensitivity study of the initial value used in the PG-VEM enhancement parameter.

Initial Value	${mAP}_{50}$ VOCh	${mAP}_{50 : 95}$ VOCh	${mAP}_{50}$ RTTS	${mAP}_{50 : 95}$ RTTS
0.5	0.8922	0.6810	0.5038	0.3245
1.0	0.8930	0.6815	0.5021	0.3195
1.5	0.8871	0.6840	0.4888	0.3156
2.0 (used)	0.8929	0.6794	0.4922	0.3169
2.5	0.8958	0.6875	0.4979	0.3208
3.0	0.8926	0.6807	0.4928	0.3191

Table 10. Sensitivity study of the

α

range in Equation (3).

Table 10. Sensitivity study of the

α

range in Equation (3).

$α$ Range	${mAP}_{50}$ VOCh	${mAP}_{50 : 95}$ VOCh	${mAP}_{50}$ RTTS	${mAP}_{50 : 95}$ RTTS
$[0.5, 2.5]$	0.8845	0.6731	0.4968	0.3159
$[1.0, 3.0]$ (used)	0.8868	0.6753	0.4939	0.3158
$[1.5, 3.5]$	0.8870	0.6775	0.4872	0.3143

Table 11. Computational cost comparison at input size

1 \times 3 \times 640 \times 640

.

Table 11. Computational cost comparison at input size

1 \times 3 \times 640 \times 640

.

Method	Params (M)	FLOPs (G)	${mAP}_{50}$ VOCh	${mAP}_{50}$ RTTS
YOLOv8n (base)	3.01	8.20	0.864	0.451
PFE-Net (ours)	3.13	8.56	0.871	0.460
YOLOv10n	2.78	8.74	0.875	0.457
YOLOv10s	8.13	25.11	–	–
RT-DETR-L	32.97	108.34	0.847	0.390

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bai, K.; Zhou, Z.; Yang, J.; Zhang, W. PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes. Appl. Sci. 2026, 16, 4635. https://doi.org/10.3390/app16104635

AMA Style

Bai K, Zhou Z, Yang J, Zhang W. PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes. Applied Sciences. 2026; 16(10):4635. https://doi.org/10.3390/app16104635

Chicago/Turabian Style

Bai, Kun, Zhigang Zhou, Jian Yang, and Wenyue Zhang. 2026. "PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes" Applied Sciences 16, no. 10: 4635. https://doi.org/10.3390/app16104635

APA Style

Bai, K., Zhou, Z., Yang, J., & Zhang, W. (2026). PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes. Applied Sciences, 16(10), 4635. https://doi.org/10.3390/app16104635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PFENet: Physics-Informed Frequency-Enhanced YOLO for Object Detection in Hazy Scenes

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Image Dehazing

2.3. Object Detection in Adverse Weather

3. Method

3.1. Overall Architecture

3.2. Physics-Guided Visibility Enhancement Module (PG-VEM)

3.3. Frequency Domain Edge Perception Module (FD-EPM)

3.4. Detection-Driven End-to-End Optimization Strategy

4. Experiments

4.1. Dataset and Experimental Settings

4.2. Quantitative Results of Object Detection in Foggy Weather

4.3. Mixed-Training and Per-Class Analysis

4.4. Qualitative Results of Object Detection in Foggy Weather

4.5. Ablation Experiments

4.6. Parameter Sensitivity and Computational Cost

5. Discussion

5.1. Limitations

5.2. Future Work

5.3. Normal-Image Generalization

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI