1. Introduction
With the rapid advancement of modern smart grids, unmanned aerial vehicles (UAVs) have become a core technology for automated power line inspection to ensure the safe and stable operation of power systems. Compared to conventional manual methods, intelligent UAV-based inspection schemes have distinct advantages, such as wide coverage, high operational efficiency, flexible deployment, and low overall costs. Consequently, they have been widely applied to state recognition and defect detection for various power components, including insulators, power fittings, anti-vibration hammers, and conductor accessories [
1,
2]. Therefore, UAV-based intelligent detection supports the maintenance of digital transmission lines and helps shift grid inspections from manual reliance toward data-driven intelligence.
Deep-learning-based approaches have fundamentally renewed power inspection in recent years. Early efforts relied on traditional image processing and handcrafted features, such as edge extraction and geometric analysis, to identify critical power lines and components [
3,
4]. However, these methods often lacked robustness when faced with complex backgrounds and varying illumination. With the rise of large-scale datasets, CNN-based frameworks have become the dominant approach due to their superior feature extraction capabilities. This progress encompasses both two-stage detectors, such as Faster R-CNN [
5] and Mask R-CNN [
6], and one-stage models, most notably RetinaNet [
7] and the YOLO series [
8,
9,
10,
11,
12,
13]. Current research has widely focused on specialized architectures [
14], feature enhancement for complex backgrounds [
15], lightweight designs for efficient deployment [
16], and adaptation to adverse weather [
17]. Despite these advancements, existing methods have demonstrated strong potential for engineering applications under clear-weather conditions or when the training and testing distributions are relatively well matched. However, their robust generalization capability in unseen foggy target domains remains limited.
In practical applications, UAVs often encounter complex atmospheric conditions, particularly foggy weather. Atmospheric-optics studies have shown that suspended particles in foggy or bad-weather environments attenuate light during propagation, with the extinction process being closely related to wavelength, visibility, and meteorological conditions [
18]. Consequently, fog-induced atmospheric scattering causes a significant degradation in image contrast, the loss of high-frequency details, and noticeable color shifts [
19,
20]. These degradation phenomena obscure the fine structural features of power lines and their components, severely compromising the performance of defect detection models. Overcoming these weather-induced limitations is therefore pivotal to enabling robust UAV-based inspection under clear-to-fog cross-domain conditions.
Existing studies on foggy inspection detection mainly follow two directions. The first focuses on mitigating the effects of fog through image preprocessing or restoration, aiming to enhance the visibility of input images before feeding them into a downstream detector. Within this direction, several studies rely on physics-based atmospheric scattering and light-extinction models, using transmission estimation, ambient light modeling, visibility-aware extinction estimation, spectral band-dependent attenuation analysis, or prior constraints to restore clear images or infer scene information under degraded weather conditions [
21,
22]. Another line of work uses deep learning to learn a direct mapping from degraded images to clear ones, often improving both defogging quality and fine-detail reconstruction [
23,
24,
25]. While such methods can improve image quality, their optimization objectives are typically geared toward visual enhancement rather than aligning with the downstream task of object detection. Moreover, errors introduced during the restoration stage may propagate to the detector and compromise final performance. For fine-grained tasks such as transmission line inspection, where small objects rely heavily on structural details, relying solely on front-end image recovery is often insufficient to ensure robust generalization across unseen foggy target domains. The restoration process also carries the risk of introducing artifacts, over-enhancement, or local structural distortions [
26,
27], which can propagate to and undermine downstream defect recognition. Additionally, employing defogging as an independent preprocessing step leaves image restoration and object detection inherently decoupled, hindering end-to-end optimization for the final task.
Another direction emphasizes robust feature representation learning at the detector level. Instead of treating foggy degradation merely as an input quality issue, these approaches aim to enhance the detector’s resilience to weather-induced perturbations at the feature level. Some methods focus on improving the robustness of power-component features through architectural design. In particular, attention mechanisms, multi-scale modeling, and context aggregation have been introduced to strengthen object saliency and discriminative feature extraction under adverse weather conditions [
28,
29,
30]. Beyond power component detection, related feature extraction and fusion strategies have also been investigated in broader remote sensing perception tasks, including hyperspectral imaging, infrared–visible fusion, and multi-source visual sensing. For example, hyperspectral anomaly detection methods have employed local contrast modeling, spatial–spectral gradient feature fusion, and spectral–spatial information fusion to enhance low-contrast anomalous regions and suppress complex background interference [
31,
32]. Hyperspectral video tracking methods further exploit spectral–spatial angle mapping and material–motion cue fusion to improve target–background discrimination and tracking robustness [
33,
34]. In addition, infrared–visible fusion and multi-source remote sensing detection studies have explored detection-guided fusion, multi-branch feature extraction, and cross-modal complementary feature interaction to improve perception robustness under complex imaging conditions [
35,
36,
37]. These studies suggest that robust perception in degraded or cluttered scenes benefits from explicit modeling of local contrast, gradient cues, complementary modality information, and spatial structural relationships. However, most of these methods rely on multi-band or multi-sensor inputs, whereas foggy transmission line inspection in this study is addressed under an RGB-based clear-to-fog setting. Meanwhile, context aggregation modules, such as SPPCSPC, SPPELAN, and ASPP, are commonly used in modern detection and segmentation frameworks. These modules adopt spatial pyramid pooling and its variants to expand receptive fields and integrate multi-scale semantic information [
11,
38,
39]. However, while convolutions, pooling, and aggressive downsampling help capture high-level semantics, they may also suppress fine-grained local details. This issue becomes particularly evident in low-resolution images and small-object scenarios [
40,
41]. These findings suggest that relying solely on large receptive fields and enriched high-level semantics is insufficient to fully compensate for the loss of fine-grained structural cues caused by foggy conditions.
A further line of work attempts to reduce the gap between clear and foggy scenes through cross-domain learning. In particular, some approaches employ domain adaptation methods by introducing unlabeled foggy target-domain images during training. Some strategies, such as adversarial alignment, statistical distribution matching, or pseudo-label self-training, are usually applied to improve detection performance on the target domain [
42,
43,
44]. While such methods have proven effective in generic cross-domain scenarios, their application in power inspection, particularly for foggy defect detection, remains limited [
45,
46]. The scarcity of real-world foggy images has prompted some studies to adopt the domain generalization (DG) strategy [
47,
48,
49]. This approach trains models exclusively on labeled fog-free power line images and aims to generalize to unseen foggy target domains without access to target-domain training samples. Despite its potential, DG has seldom been explicitly employed for foggy condition detection in power systems. In 2025, a model named YOLOv8-eRFD-AP [
50] demonstrated the feasibility of DG in power inspection, yet its modeling efforts have primarily been centered on general architectural enhancements, while more explicit modeling of fog-induced degradation remains rather limited.
Even with these advances, one issue remains unresolved in foggy transmission line inspection. Under foggy conditions, high-frequency information, including edges, textures, and fine structures, is suppressed far more severely than low-frequency components such as overall luminance and coarse contours. For transmission line components, this implies that critical discriminative evidence, such as local edge continuity, slender structural contours, and tiny connecting regions, is already compromised at the imaging stage. Moreover, these weakened responses can be further smoothed during feature extraction and context aggregation, especially in pooling-dominated designs. However, existing approaches lack a unified treatment of fog-induced high-frequency attenuation and structural over-smoothing.
To address this issue, we propose a robust detection model for foggy transmission line inspection that incorporates frequency cues related to fog-induced degradation. The proposed model is built upon the vanilla YOLOv7 framework in this study, which provides an efficient one-stage, multi-scale detection pipeline. The model is trained exclusively on clear-weather source-domain samples without accessing any foggy target-domain images during training, and is evaluated directly on the foggy target domain. To address the loss of local discriminative evidence caused by fog-induced high-frequency attenuation, we design a fog-aware gated compensation (FAGC) module. This module integrates discrete wavelet transform (DWT) into high-resolution detection features. It constructs modulation coefficients based on the energy distribution of high- and low-frequency components that are sensitive to fog-induced degradation. These coefficients adaptively control the compensation strength of local residual information. Furthermore, by combining the wavelet statistics maps with a low-frequency-dominant appearance proxy to generate a spatial gating map, the proposed model achieves selective enhancement of degradation-sensitive regions. This mechanism more effectively compensates for the loss of evidence in small objects and edge-sensitive structures.
However, compensating for damaged edges only at the high-resolution stage remains insufficient. If the subsequent context aggregation module relies on pooling-dominated structures, then the compensated weak responses may again be smoothed out during deep feature propagation. Given that power line components exhibit strongly directional geometric characteristics, we introduce direction-sensitive convolutions to enhance the modeling capacity for orientation-specific gradients. For this reason, we replace SPPCSPC with a structural-positional enhancement pyramid (SPEP). SPEP first performs pooling-free structural enhancement through a structure-aware directional aggregation (SADA) unit composed of parallel branches with direction-sensitive and dilated convolutions, and then applies a coordinate-guided positional refinement (CGPR) unit. This design strengthens the continuity representation of slender structures and improves localization and discriminability under complex backgrounds.
The main contributions of this paper are summarized as follows:
A framework for transmission line defect detection in fog is proposed. This framework is trained exclusively on clear-weather data and tested on foggy scenes with varying fog densities.
A fog-aware gated compensation (FAGC) module is proposed to explicitly introduce fog-related frequency priors into the detection process. By leveraging frequency band energy statistics and a spatial gating mechanism, local residual information is selectively compensated, thereby improving model adaptability to fog-induced local evidence degradation.
A structural–positional enhancement pyramid (SPEP) is designed to alleviate traditional pooling-dominated context aggregation with a pooling-free, multi-branch architecture. By integrating direction-sensitive convolutions and coordinate attention, the over-smoothing problem of slender structures and small objects under foggy conditions is effectively mitigated.
Extensive experiments are conducted on a public transmission line inspection dataset. Results demonstrate that the proposed method outperforms various mainstream models across multiple fog density conditions, exhibiting particularly strong robustness in detecting slender structures.
The remainder of this paper is organized as follows.
Section 2 introduces the proposed FogTLD-YOLO framework in detail.
Section 3 presents the experimental settings and a comprehensive evaluation of the model performance, including comparative and ablation experiments.
Section 4 provides further discussion and analysis of the proposed design.
Section 5 summarizes the main findings of this study and discusses future work.
3. Experiments and Analysis
3.1. Experimental Settings
The experiments were conducted on the PTL-AI Furnas Dataset [
55], an aerial-image dataset designed for UAV-based power transmission line inspection. The dataset contains 6295 images covering five component categories, namely baliser, bird nest, insulator, spacer, and stockbridge. Representative annotated examples and label definitions are shown in
Figure 6. Among these categories, insulator, baliser, and spacer are further detailed into some substates.
The dataset was split into training, validation, and test sets in an 8:1:1 ratio using stratified sampling to preserve the original class distribution. Under the clear-to-fog setting considered in this study, the clear-weather training and validation subsets constituted the source domain. The target domain was constructed from the original clear-weather test set by applying the synthetic fog generation algorithm proposed in [
56]. Specifically, three fog density levels were generated by setting the thickness parameter to 0.05, 0.07, and 0.09, respectively, and the foggy target domain was formed by evenly mixing samples synthesized under these three settings, with each level accounting for one-third of the target-domain samples.
Figure 7 shows the visual differences between representative clear-weather reference images and their fog-synthesized counterparts under different fog density levels.
Some commonly used evaluation metrics for the object detection task were adopted in this study, including precision (P), recall (R), average precision (AP), and mean average precision (mAP). In all experiments, mAP50 was adopted as the main evaluation metric and is referred to as ‘mAP’ for brevity. In addition, mAP50-95 was reported in
Table 1 to evaluate detection performance under stricter localization thresholds. These metrics are given as follows:
Here, true positive (TP) denotes a predicted bounding box that correctly matches a ground-truth object of the same category under the specified IoU threshold, false positive (FP) denotes an incorrect prediction, and false negative (FN) denotes a missed ground-truth object. AP denotes the area under the precision–recall curve for a given category, and mAP is the average AP over all categories.
All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 4080 GPU (16 GB memory) running Ubuntu 22.04.4 LTS. The proposed model and all the comparative algorithms were implemented in PyTorch 2.9.1 with CUDA 12.6 acceleration, using Python 3.11.14.
To ensure a fair and reproducible comparison, all detectors, including YOLO-series one-stage detectors and two-stage detectors, were trained and evaluated using the same dataset and training protocol and were reproduced based on their official implementations [
5,
6,
7,
8,
9,
10,
11,
12,
13,
16,
17,
50]. Specifically, all detectors were trained for 200 epochs with a batch size of 16 and an input size of
pixels. The comparative algorithms were reproduced based on their official implementations and corresponding default optimization settings, while the proposed model was trained using stochastic gradient descent (SGD) with Nesterov momentum and an initial learning rate of 0.01. No additional foggy images were used for model training or fine-tuning. For evaluation, the checkpoint that achieved the best performance on the source-domain validation set during training was selected for testing.
3.2. Experimental Results
Comparative Experiments and Model Performance Validation
To comprehensively evaluate the effectiveness of the proposed method, we compared FogTLD-YOLO with a set of representative detectors under unified training and testing protocols. The comparative algorithms consist of representative generic one-stage detectors, namely RetinaNet [
7], YOLOv5s [
8], YOLOv8s [
10], YOLOv9s [
11], YOLOv10n [
12], and YOLOv11n [
13]. To mitigate the influence of model-scale differences, larger YOLO variants, including YOLOv5l [
8], YOLOv8l [
10], and YOLOv11l [
13], are also incorporated into the comparison. In addition, two representative two-stage detectors, namely Faster R-CNN [
5] and Mask R-CNN [
6], are evaluated to provide a broader assessment across different detection paradigms. Specifically, several task-oriented transmission line inspection models are also involved in this work, including Lite-YOLO-ID [
16], CACS-YOLO [
17], and YOLOv8-eRFD-AP [
50]. The quantitative results are summarized in
Table 1.
In
Table 1, FogTLD-YOLO achieves the best overall detection performance in the mixed-fog target domain, with 78.7% precision, 79.7% recall, 82.1% mAP, and 55.1% mAP50-95. Among all comparative algorithms, YOLOv7 provides the most competitive baseline performance, achieving 79.3% mAP and 53.6% mAP50-95. Compared with YOLOv7, FogTLD-YOLO increases mAP and mAP50-95 to 82.1% and 55.1%, respectively, while reducing the number of parameters from 36.53 M to 34.64 M and slightly decreasing FLOPs from 103.30 G to 102.70 G.
Figure 8 provides an intuitive comparison of the performance–complexity trade-off across different detectors. In terms of inference speed, FogTLD-YOLO reaches 131.58 FPS, ranking in the middle-to-upper range among the compared detectors and showing a certain real-time processing capability.
General detectors still retain competitive values on some individual metrics, but their overall performance decreases in the unseen foggy target domain. YOLOv5s reaches a precision of 80.0%, whereas its recall and mAP are 53.1% and 56.9%, respectively. YOLOv8s, YOLOv9s, YOLOv10n, and YOLOv11n also exhibit relatively low overall performance in the clear-to-fog setting. The results of larger YOLO variants further suggest that increasing model capacity alone may not be sufficient to ensure improved foggy-domain robustness. Specifically, YOLOv5l and YOLOv8l obtain mAP values of 48.6% and 54.3%, respectively, and do not outperform their smaller-scale counterparts. Although YOLOv11l improves upon YOLOv11, its mAP remains at 64.1%, still considerably lower than the 82.1% achieved by FogTLD-YOLO. This performance gap is more evident in defect-related categories such as insulator_nok and spacer_nok. For spacer_nok, the AP values of YOLOv8s, YOLOv9s, YOLOv10n, and YOLOv11n are only 28.7%, 38.6%, 42.5%, and 25.6%, respectively. RetinaNet, Mask R-CNN, and Faster R-CNN show a similar tendency, especially on slender and small structures. Faster R-CNN obtains 37.9% mAP and 19.8% mAP50-95 in the foggy target domain. For spacer_nok, the AP values of Mask R-CNN and Faster R-CNN are only 5.45% and 8.33%, respectively. This indicates that proposal-based two-stage detectors are less stable for defect-sensitive categories under foggy degradation. These results indicate that categories relying on thin structures, local boundaries, and small abnormal regions are more affected under foggy conditions.
A similar trend is found in task-oriented transmission line inspection detectors. Lite-YOLO-ID and CACS-YOLO obtain recall values of 33.0% and 39.9% and mAP values of 38.9% and 44.3%, respectively. Their category-level performance is also low. For insulator_nok, the AP values are 44.7% and 46.8%. For spacer_nok, they further decrease to 25.1% and 30.8%. These results indicate difficulty in preserving local defect cues and slender component structures in the present clear-to-fog setting.
YOLOv8-eRFD-AP performs better than the other task-oriented detectors. Its recall reaches 53.3%, and its mAP reaches 58.1%. However, the AP values on insulator_nok and spacer_nok remain at 51.4% and 48.4%, respectively. Compared with FogTLD-YOLO, the gap appears not only in overall mAP but also in the categories that depend more strongly on local continuity and small structural deviations.
Figure 9 shows qualitative comparison results under foggy conditions. Ground Truth (GT) denotes the annotated reference results. Red, yellow, and green boxes denote false positives, missed detections, and correct detections, respectively. The compared models include YOLOv7, YOLOv9, CACS-YOLO, and YOLOv8-eRFD-AP. In each row of
Figure 9, the three columns correspond to a background-induced false-positive case, a small object missed detection case under heavy fog, and a category confusion case with adjacent structures, respectively.
The image-level results help explain where the quantitative differences come from. Around the upper Stockbridge dampers, several comparative detectors produce extra responses, and some predictions extend to nearby background regions. Under heavy fog, missed detections increase after the visibility of insulator strings and neighboring slender parts decreases. In densely arranged insulator structures and connection regions, some detectors generate redundant boxes or assign incorrect categories. In these regions, FogTLD-YOLO produces fewer false positives and missed detections, and its predicted boxes are closer to the annotated object locations.
Figure 10 presents the corresponding feature activation heatmaps [
57,
58,
59]. In each row of
Figure 10, the three columns correspond to a background interference scene, a dense slender-component scene under fog, and a critical-connection region scene, respectively. The color changes from blue to red as the feature response increases. In YOLOv7, high-response regions extend beyond the objects and spread to the tower body, the sky boundary, and other background areas. YOLOv9 shows stronger responses on part of the objects, but the activations remain scattered in regions containing adjacent slender structures. CACS-YOLO and YOLOv8-eRFD-AP place more response on insulator strings and component bodies, but the separation between neighboring slender structures is still limited, and the response on connection nodes is not always concentrated.
For FogTLD-YOLO, high-response regions are more concentrated on insulator strings, spacer bodies, and connection nodes. High responses in background regions, including tower structures, sky, and vegetation, are more limited. Along elongated component paths, the responses are also less fragmented. This observation is consistent with the visual results in
Figure 9, where missed detections, redundant boxes, and category confusion occur more often in the locations where other detectors show response spreading or response interruption.
3.3. Ablation Study
The ablation studies were conducted to assess the effectiveness of each proposed component, and three variants are designed as follows:
Variant I: The vanilla YOLOv7 model, where neither the fog-aware gated compensation module (FAGC) nor the structural–positional enhancement pyramid (SPEP) is introduced.
Variant II: Without the structural–positional enhancement pyramid (SPEP).
Variant III: Without the fog-aware gated compensation (FAGC) module.
As shown in
Table 2, the proposed FogTLD-YOLO achieves the best overall performance, reaching 82.1% mAP in the foggy target domain. Relative to the complete model, removing either FAGC or SPEP reduces mAP from 82.1% to 81.0%, whereas removing both modules further decreases mAP to 79.3%. These results indicate that both FAGC and SPEP make effective contributions to the detection task, while their joint integration yields the strongest overall performance.
The precision and recall values reflect different effects after removing the two modules. Without SPEP, precision drops from 78.7% to 71.0%, whereas recall increases slightly from 79.7% to 81.1%. This change suggests reduced suppression of structurally ambiguous responses during deep feature aggregation. Without FAGC, precision rises from 78.7% to 80.9%, but recall decreases from 79.7% to 76.4%. This result is more closely related to the loss of compensation for weakened local evidence under foggy conditions. When the two modules are used together, the model reaches 78.7% precision and 79.7% recall and achieves the highest overall mAP among all variants, with a more balanced precision–recall trade-off.
The category-wise results further support the different effects of the two modules. Compared with the YOLOv7, the FogTLD-YOLO improves five of the nine object categories, including baliser_aok, insulator_ok, insulator_nok, spacer_ok, and spacer_nok. Among them, the most evident gains appear on insulator_nok and spacer_nok, with AP increases of 3.3% and 23.0%, respectively. These two categories are more dependent on local abnormal cues, edge continuity, and fine structural details, which are more easily weakened under foggy conditions. For categories with already high baseline AP or weaker dependence on local defect evidence, the changes are relatively small. For example, insulator_ok changes from 95.5% to 95.6%, and bird nest changes from 82.7% to 82.4%. This indicates that the proposed modules mainly improve fog-sensitive and defect-sensitive categories, rather than producing uniform AP increases across all categories. Relative to the complete model, Variant III shows a larger decrease on degradation-sensitive categories, with the largest drop of 3.5% on spacer_nok. Variant II shows more evident reductions on categories that depend on structural continuity and localization stability, including decreases of 2.8% on spacer_nok and 1.2% on spacer_ok. These category-level changes suggest that FAGC is more closely related to preserving degradation-sensitive local cues, whereas SPEP contributes more to structural continuity and localization stability during deep aggregation. Although minor decreases are observed in a few categories, the complete model still achieves the highest overall mAP, increasing from 79.3% to 82.1%. Together, these results support the complementary effects of the two modules in the final model.
4. Discussion
4.1. Impact of the Insertion Position of FAGC
FAGC was inserted separately into the P3, P4, and P5 detection branches, and also jointly into all three branches, to examine how the insertion position affects detection performance. In the following discussion, the latter is denoted as the joint P3–P5 configuration.
As shown in
Figure 11, P3 gives the highest mAP, reaching 81.0%. When FAGC is inserted into the deeper branches, the performance decreases slightly. The joint P3–P5 configuration yields 78.7% mAP, which is 2.3% lower than that of P3.
This result is related to the role of FAGC in local feature modulation under foggy conditions. In the high-resolution prediction branch P3, local edge continuity, slender structures, and small abnormal regions are still represented more explicitly, and the effect of frequency-aware local compensation is therefore easier to retain at this stage. Since fog-induced degradation primarily suppresses high-frequency components, such as edges and fine structural details, FAGC can produce a more evident compensation effect in P3. In P4 and P5, these local structural cues become weaker and more abstract as the spatial resolution decreases, and the gain brought by FAGC correspondingly becomes smaller.
A similar tendency can also be seen at the category level. The categories insulator_nok and spacer_nok are taken as representative examples because both are associated with slender structures or small abnormal regions. When FAGC is inserted only into P3, the AP of insulator_nok reaches 91.3%, and the AP of spacer_nok reaches 69.8%. These are the highest values among the compared insertion settings. Extending FAGC to all three levels leads to lower results, indicating that direct multi-level insertion does not provide additional benefit in the present setting.
These results support placing FAGC in the high-resolution P3 branch, where fog-weakened local cues are still retained more clearly. In deeper branches, the effect of FAGC becomes less evident.
4.2. Ablation Study on the Gating Designs in FAGC
Three gating designs are compared in
Table 3. One uses only wavelet-statistics maps, one uses only the low-frequency-dominant appearance proxy, and the third combines both cues.
The combined design yields the highest overall mAP at 81.0%. The wavelet-only and appearance-only settings reach 79.5% and 79.7%, respectively. This pattern further demonstrates that the gate is more effective when wavelet statistics maps and the low-frequency-dominant appearance proxy are used together.
Differences become clearer at the category level. With wavelet statistics alone, spacer_nok reaches 74.5%, the highest value among the three designs. The corresponding results for bird nest and stockbridge_ok, however, are 79.1% and 75.6%. When only the appearance proxy is used, the distribution across categories becomes more even. In this case, insulator_nok, bird nest, and stockbridge_ok rise to 90.7%, 83.1%, and 77.8%, respectively, whereas spacer_nok falls to 60.9%, well below the wavelet-only setting.
These category-level shifts point to different roles of the two cues. The wavelet-only design shows a stronger response on spacer_nok, a category that depends more heavily on fine structures and local geometric abnormalities. The appearance-only design gives more balanced results on insulator_nok, bird nest, and stockbridge_ok, which suggests that the appearance proxy contributes more to local context discrimination and suppression of irrelevant background responses.
The joint design retains the advantages of both cues to a greater extent. In this design, insulator_nok, bird nest, and stockbridge_ok reach 91.3%, 83.3%, and 79.0%, respectively, which are the most favorable results among the three designs. Although the result for spacer_nok does not exceed that of the wavelet-only design, it remains high at 69.8%. These results indicate that using wavelet-statistics maps together with the low-frequency-dominant appearance proxy provides a more suitable gating design for FAGC. The former highlights regions with weakened structural responses. The latter helps suppress ineffective compensation in smooth foggy backgrounds.
4.3. Effect of Convolutional Branch Design in SADA
Figure 12 and
Figure 13 illustrate how detection performance varies with the convolutional branch design within the SADA module. In this analysis, the CGPR module and the dilated-context branch remain unchanged, while variations are introduced only in the first two structural branches. The compared settings include Variant I, with an isotropic design, Variant II, with a large-kernel isotropic design, Variant III, with a strong-directional design, and the SADA setting, which employs asymmetric
and
convolutional branches. The three variants are designed as follows:
Variant I: The two structural branches adopt the same configuration, namely successive and convolutional modules followed by a convolutional module.
Variant II: The two structural branches adopt the same configuration, namely successive and convolutional modules followed by a convolutional module.
Variant III: The two structural branches adopt different directional configurations: The first branch uses successive and convolutional modules followed by a convolutional module, whereas the second branch uses successive and convolutional modules followed by a convolutional module.
As shown in
Figure 12, the adopted SADA setting achieves the highest mAP of 81.0%, whereas the mAP values for Variant I, Variant II, and Variant III are 78.0%, 80.2%, and 79.2%, respectively. These results suggest that the effectiveness of SADA may not be fully explained by receptive field enlargement alone, but is more closely related to the joint effect of directional sensitivity and appropriate receptive field construction. Specifically, the larger-kernel isotropic design and the stronger directional design provide larger spatial or directional receptive fields than the adopted 1 × 3/3 × 1 branches. However, their performance does not further exceed that of the adopted design, indicating that increasing the convolutional range alone may be insufficient for fog-degraded slender-component detection. Conventional isotropic convolutions provide limited structural discrimination for slender objects, while larger kernels may introduce additional neighboring responses and weaken fine structural details. Meanwhile, an overly strong directional constraint may reduce the flexibility needed to describe local connection regions and subtle defects. Therefore, the adopted 1 × 3/3 × 1 branches provide a more suitable balance between direction-sensitive local modeling and receptive field construction, enabling SADA to better preserve structural continuity under foggy conditions.
The same difference can also be seen in the category-level results in
Figure 13. Under the adopted SADA setting, insulator_ok, insulator_nok, and spacer_nok reach 95.8%, 91.1%, and 69.1%, respectively, which are the highest values among the evaluated settings. Variant II, which uses larger isotropic kernels, still gives relatively high results on insulator categories. Its spacer_nok performance, however, drops to 65.9%. Variant III obtains 95.0% on insulator_ok, but its results on insulator_nok and spacer_nok are lower than those of the adopted SADA setting. These category-level changes indicate that an excessively strong directional bias is less favorable for preserving local defect sensitivity together with the continuity of slender structures.
These findings further elucidate the role of branch design within SADA. Its contribution is related not only to receptive-field expansion, but also to the way directional sensitivity is introduced into structural representation. The adopted asymmetric and convolutional design achieves a more suitable balance between structural continuity and local defect sensitivity, which is particularly important for slender structures and defect-sensitive categories in foggy scenes.
4.4. Impact of Proposed Modules on Different Detector Architectures
Table 4 compares the effects of FAGC and SPEP across the evaluated detector architectures. From single-module insertion to joint use of the two modules, overall mAP improves consistently across the evaluated YOLO-based one-stage detectors. The joint configuration reaches 60.9% for YOLOv5, 71.1% for YOLOv9, 57.1% for YOLOv10, and 82.1% for YOLOv7. To further examine the transferability of the proposed modules to a different detection paradigm, Faster R-CNN is additionally introduced as a representative two-stage detector. The results show that direct transfer to the two-stage framework leads to less consistent performance gains. Specifically, FAGC slightly improves the mAP of Faster R-CNN from 37.88% to 38.93%, suggesting that fog-aware local compensation can still provide some benefit for weakened local evidence. However, SPEP decreases the mAP to 36.52%.
Category-level results further separate the roles of the two modules. FAGC brings larger gains on baliser_nok and insulator_nok, where recognition depends more strongly on local defects and fine structural evidence. This effect appears in all four detector groups and is more consistent with the role of FAGC in frequency-aware compensation of fog-weakened local cues. SPEP contributes more on bird nest, spacer_ok, and spacer_nok, categories that rely more on structural continuity and direction-aware deep aggregation. The difference appears more clearly in YOLOv5 and YOLOv7.
The joint configuration produces the largest overall gains among the evaluated YOLO-based one-stage detectors. Across the evaluated architectures, it yields the best overall mAP and keeps competitive performance on defect-sensitive and structure-sensitive categories. FAGC mainly compensates local discriminative cues under foggy conditions, whereas SPEP better preserves structural continuity during deep feature aggregation. Used together, the two modules show complementary effects and address fog-induced degradation more broadly across YOLO-based one-stage detector architectures. In contrast, the results on Faster R-CNN indicate that FAGC and SPEP are not directly transferable to all detection paradigms in a plug-and-play manner. One possible reason is that YOLO-based one-stage detectors directly use multi-scale neck features for classification and localization, whereas two-stage detectors rely on region proposal generation and RoI-level feature extraction. When directly integrated into a proposal-based pipeline, the proposed modules may alter the features used for proposal generation and RoI-level refinement. Therefore, architecture-specific adaptation may be required when extending FAGC and SPEP to two-stage detectors.
4.5. Evaluation Across Fog Densities and Training Protocols
To analyze the influence of fog density and training protocol on detection performance, two training protocols are compared. The first follows the basic domain generalization protocol of this study, where only clear-weather images are used for training. The second adopts supervised mixed-fog training, where the original clear-weather training images are converted into synthetic-fog images at fog densities of 0.05, 0.07, and 0.09. Each density accounts for one third of the generated training samples. The proposed model is then evaluated under clear-weather and fog density-specific test conditions.
As shown in
Figure 14, the two training settings exhibit different performance tendencies across test conditions. Under clear-weather training, FogTLD-YOLO achieves 90.9% mAP on the clear-weather test set. As the fog density increases from 0.05 to 0.07 and 0.09, the mAP decreases from 88.7% to 82.1% and 76.1%, respectively. The result on the mixed-fog test set is 82.1%. This trend indicates that the original training protocol preserves favorable performance under clear-weather conditions, but becomes increasingly affected by stronger fog degradation. Under supervised mixed-fog training, the performance trend is different. The mAP on the clear-weather test set decreases to 79.3%, whereas the mAP values on fog densities of 0.05, 0.07, and 0.09 reach 86.9%, 89.6%, and 90.3%, respectively. The mAP on the mixed-fog test set is 90.1%. This result shows that supervised mixed-fog training improves adaptation to synthetic foggy inputs, especially under higher fog density test conditions, but reduces compatibility with clear-weather images.
To further analyze the feature-distribution discrepancy between clear-weather and foggy domains, we calculate the maximum mean discrepancy (MMD) [
60] under different fog density conditions. Specifically, features from the high-resolution P3 branch before the detection head are extracted to calculate MMD between clear-weather-domain samples and foggy-domain samples. A smaller MMD value indicates a smaller feature distribution discrepancy. As shown in
Table 5, FogTLD-YOLO consistently yields lower MMD values than YOLOv7 under all evaluated fog density conditions, reducing the MMD by at least 0.077 in absolute terms and by at least 38.1% in relative terms. This result indicates that FogTLD-YOLO reduces the feature-level gap between the clear-weather source domain and foggy target domains, thereby improving the robustness of clear-weather-trained features under different fog density test conditions.
Figure 15 further shows the feature activation heatmap results of the mixed-fog-trained FogTLD-YOLO under different fog density inputs. For each row, the four images from left to right correspond to clear-weather, fog density 0.05, fog density 0.07, and fog density 0.09, respectively. As the fog density changes, the main high-response regions remain located around annotated transmission line objects, while diffuse responses in surrounding vegetation, tower structures, and cluttered backgrounds are not obviously amplified. This observation shows that the model maintains object-oriented spatial responses under different synthetic fog density conditions.
Overall, the MMD and feature activation heatmap analyses provide complementary evidence for the behavior of FogTLD-YOLO under different fog density conditions. MMD quantitatively measures the feature distribution discrepancy between clear-weather and foggy domains, while the feature activation heatmaps qualitatively show the spatial response behavior under different fog density inputs. Together with the performance results, these analyses show that the test performance of FogTLD-YOLO is affected by both fog density and training protocol. Clear-weather training better preserves the original-domain performance and still provides clear-to-fog robustness, whereas supervised mixed-fog training improves performance on synthetic foggy test subsets but sacrifices clear-weather compatibility.
5. Conclusions
This study proposes FogTLD-YOLO, an end-to-end framework to address the challenge of transmission line defect detection under foggy conditions with UAV-based inspection. By jointly modeling fog-induced degradation and structure-preserving feature aggregation, FogTLD-YOLO enhances weakened local evidence while maintaining the continuity and positional sensitivity of slender structures. In this way, the proposed model provides a targeted detection framework for UAV-based transmission line inspection when only clear-weather training data are available.
Extensive experiments under mixed fog density conditions demonstrate the effectiveness of FogTLD-YOLO. On the foggy target domain constructed with density levels of 0.05, 0.07, and 0.09, FogTLD-YOLO achieved 78.7% precision, 79.7% recall, and 82.1% mAP. Specifically, the proposed model achieves 92.0% AP on insulator_nok and 72.6% AP on spacer_nok, showing clear advantages on defect-sensitive and slender structure categories. Moreover, FogTLD-YOLO outperforms YOLOv7, the best-performing generic detector, by 2.8% on mAP. Compared with representative task-oriented detectors, the performance gain remains substantial, further indicating the robustness and practical value of the proposed design for foggy transmission line inspection.
Overall, the results indicate that explicit enhancement of degradation-sensitive local cues together with structure-aware aggregation is effective for robust defect detection in foggy inspection scenarios. Although the synthetic fog protocol enables controlled evaluation of clear-to-fog robustness under different fog densities, it cannot fully reproduce real foggy UAV inspection conditions. Real scenarios may involve spatially nonuniform fog, illumination variation, depth-dependent visibility changes, wind-induced motion, and sensor noise. Therefore, further evaluation on real foggy transmission line inspection data remains an important direction for future work.