4.1. Ablation Study
Ablation experiments are conducted to verify the contribution of each module. The comparison starts from single-modality YOLOv11 baselines and a simple concatenation-based RGB–T baseline (YOLO-concat), followed by progressively adding the MBConv lightweight backbone, SDFM, CGA, and NWD. The results are listed in
Table 8.
The simple concatenation baseline already improves the single-modality models, indicating that cross-modal complementarity is beneficial. However, it also increases computational burden and provides only limited gains under difficult conditions. Replacing the heavy backbone blocks with MBConv significantly improves computational efficiency: compared with YOLO-concat, Network 1 (MBConv only) reduces GFLOPs from 8.9 to 5.0 and increases FPS from 38 to 79. However, its mAP@0.5 decreases from 90.1% to 87.3%, corresponding to a 2.8 percentage-point drop. This noticeable accuracy decrease indicates an accuracy–efficiency trade-off caused by lightweight feature extraction. In other words, MBConv effectively reduces computational cost and improves inference speed, but lightweight extraction alone weakens feature representation and cannot sufficiently handle cross-modal misalignment, weak smoke boundaries, and small-target localization.
After introducing SDFM, the mAP@0.5 increases from 87.3% to 90.4%, which not only recovers most of the accuracy loss caused by lightweighting but also slightly surpasses the simple concatenation baseline. This suggests that shallow cross-modal alignment and denoising help compensate for the reduced representation capacity of the lightweight backbone. The further addition of CGA improves mAP@0.5 to 92.9%, showing that adaptive RGB–TIR fusion can more effectively exploit complementary visible-light texture and thermal saliency under complex backgrounds. Finally, NWD-based regression further improves the localization robustness of small and distant fire spots, resulting in the complete YOLO-MMSC with 94.6% mAP@0.5, 6.4 GFLOPs, and 64 FPS. These results indicate that SDFM and CGA not only enhance cross-modal perception but also compensate for the accuracy loss introduced by lightweighting, achieving a better accuracy–efficiency balance for edge deployment.
4.4. Performance Comparison and Literature Positioning
Table 11 compares the proposed detector with representative baselines, including Faster R-CNN, YOLOv5, YOLOv8, single-modality YOLOv11 variants, and the simple RGB–T concatenation baseline. YOLO-MMSC achieves the best overall detection accuracy and precision among the evaluated methods while maintaining real-time speed on both the workstation and edge hardware.
Compared with single-modality YOLOv11, the proposed multimodal design clearly improves both recall and precision. Compared with the simple fusion baseline, YOLO-MMSC further shows that gains do not come merely from adding another modality; instead, they result from alignment-aware denoising, adaptive fusion, and small-target-oriented regression. More specifically, SDFM improves shallow cross-modal alignment and noise suppression under smoke and vegetation occlusion, CGA adaptively reallocates RGB and TIR contributions under nighttime and partially obscured conditions, and NWD reduces the sensitivity of small-box regression to slight localization offsets. This explains the consistent gains of YOLO-MMSC over the simple RGB–T concatenation baseline in the Night, Occlusion/Smoke, and Small-target subsets.
Recent studies have also investigated RGB–thermal or multispectral object detection for UAV fire monitoring, UAV small-object detection, and real-time aerial perception. However, most existing methods focus on forest or wildland fire scenes, generic UAV multimodal small-object detection, drone-based target detection, or real-time aerial perception, rather than complex-background power transmission-corridor wildfire monitoring. Therefore,
Table 12 provides a concise comparison with recent related methods from the perspective of application scenario, modality, technical focus, and deployment relevance. Because these studies are evaluated on different datasets and task definitions, their reported numerical results are not directly comparable with the proposed method on our corridor-oriented test set. Instead, this comparison is used to clarify the application-specific contribution of YOLO-MMSC.
The comparison with recent literature further reveals both convergences and divergences. On the one hand, our findings are consistent with recent RGB–thermal wildfire and UAV multimodal detection studies in that thermal information provides more stable cues under low illumination, smoke-obscured scenes, and weak target contrast, while visible images contribute texture and contextual boundaries. This convergence explains why RGB–T fusion generally improves over single-modality detection in the Night and Occlusion/Smoke subsets. On the other hand, the present study differs from most existing works in its application boundary. Recent RGB–thermal wildfire studies mainly focus on forest or wildland fire scenarios, and recent UAV multimodal small-object studies usually emphasize generic target detection. In contrast, power transmission-corridor wildfire monitoring requires simultaneous consideration of repeated line-like backgrounds, long-range small fire/smoke targets, power-infrastructure-related thermal distractors, edge-side latency, and continuous alarm stability. Therefore, the gains observed in this study should not be interpreted only as a general benefit of adding a thermal modality, but as the result of combining corridor-oriented data, shallow cross-modal alignment, adaptive RGB–TIR fusion, small-target regression, hard-negative training, and lightweight temporal consistency. This also explains why direct numerical comparison with published results obtained on different datasets may be misleading, whereas comparison under the same experimental setting and literature-level discussion provides a fairer assessment of the proposed framework.
Figure 6 presents representative qualitative comparisons. In the visual examples, the proposed detector produces fewer missed boxes for thin smoke and small flames, and it remains more stable when branches and background clutter partially cover the target.
4.6. Field Altitude Study and Failure Cases
A dedicated field study was conducted to identify a practical UAV operating altitude for transmission-corridor wildfire early warning. Using a DJI Matrice 4T platform, paired RGB–T sequences were acquired at six altitudes from 60 m to 210 m with a step of 30 m. The quantitative results are summarized in
Table 15, and the corresponding visual comparison is shown in
Figure 7. Overall, increasing altitude leads to a gradual reduction in apparent target size and edge clarity, which directly affects the detectability of weak fire and smoke signatures. This tendency is reflected by the monotonic decrease in Recall and CDR, together with the increase in jitter index
J.
At lower altitudes (60–90 m), the model achieves the highest quantitative performance, with relatively large target projections and clear thermal boundaries. However, although low-altitude operation is beneficial for detectability, it also reduces coverage efficiency and may limit inspection productivity in long transmission corridors. As altitude increases to the medium range (120–180 m), the target occupies a smaller image area and smoke boundaries become less distinct, especially in scenes with vegetation occlusion or weak thermal contrast. Nevertheless, the detector still maintains relatively stable performance in this interval. Combined with the larger coverage area at these heights, this range provides a better compromise between detection reliability and inspection efficiency and is therefore recommended for routine patrol.
Within the recommended altitude range, the performance degradation remains within an acceptable operational margin. Specifically, when the UAV altitude increases from 120 m to 180 m, mAP@0.5 decreases from 94.4% to 93.4%, corresponding to a reduction of 1.0 percentage point, while Recall decreases from 93.6% to 91.8%, corresponding to a reduction of 1.8 percentage points. The CDR decreases from 95.0% to 93.0%, and the jitter index increases from to . These changes indicate that the 120–180 m range still maintains acceptable detection and continuity performance, but the gradual increase in temporal jitter suggests that excessively high altitudes may weaken the stability of weak fire and smoke monitoring.
When the altitude further increases to 210 m, the adverse effect of long-range imaging becomes more evident. Fire and smoke targets become less separable from the background, weak smoke structures are more likely to be submerged by clutter, and thermal edges become increasingly blurred. This results in lower Recall and CDR, as well as a higher jitter index, indicating that excessive altitude weakens both instantaneous detectability and temporal stability. In nighttime high-gain mode, the thermal branch can better preserve weak fire signatures at medium altitude. By contrast, in nighttime low-gain mode, thermal contrast may be compressed, and some weak fire cues become less distinguishable, especially when altitude increases, and the target occupies only a very small area. These observations support the recommendation that, under nighttime high-interference conditions, high-gain mode should be preferred to reduce the risk of missing weak fire targets.
Figure 8 presents representative failure cases under extreme interference conditions. The first example corresponds to nighttime fireworks. In this case, bright visible-light emission, intense local heat radiation, and smoke plumes appear simultaneously, making the scene highly similar to actual wildfire behavior in both RGB and TIR modalities. The second example corresponds to hot-object interference under nighttime low-gain mode. Due to compressed thermal contrast, high-temperature objects may exhibit thermal responses similar to flames, while their surrounding contextual cues are insufficient to fully suppress false activation. These cases indicate that, although the proposed method performs robustly in most corridor scenarios, extreme semantic-neighbor interference remains a practical challenge. It should also be noted that the present failure-case analysis is mainly qualitative. Although hard-negative samples are included in the training and validation protocol, the current manuscript does not provide a fine-grained class-wise false-positive frequency analysis for fireworks, hot non-fire objects, and other thermal distractors. Therefore, the robustness conclusion under these rare but important interference sources should be interpreted with caution. The current results demonstrate that hard-negative training improves overall field performance, but they do not yet fully quantify how often each type of semantic-neighbor interference causes false alarms in practical operation. They also suggest different mitigation directions. For nighttime fireworks, the interference is usually short-lived and accompanied by abrupt visible-brightness and thermal-intensity changes; therefore, temporal persistence and multi-frame consistency can be used to distinguish transient fireworks from gradually developing fire or smoke events. For hot non-fire objects, the thermal response is often more spatially stationary and may be associated with fixed locations such as residential areas, industrial facilities, roads, or known heat sources.
From an engineering deployment perspective, model prediction should be combined with alarm-level decision strategies rather than used in isolation. First, multi-frame alarm triggering can effectively suppress transient false positives; in the current setting, requiring three consecutive detections introduces only a limited additional delay while significantly improving alarm reliability. Second, region-based whitelisting or masking can be applied to known stationary heat sources. Third, scene-level auxiliary priors, such as forested or dry-vegetation corridor regions, industrial facilities, residential areas, and known heat-source locations, can be used to adjust the verification requirement of candidate alarms. Fourth, adaptive gain and threshold strategies should be adopted according to nighttime operating conditions. These recommendations clarify the practical deployment boundary of the proposed method under challenging interference conditions.
4.7. Discussion
The experimental results indicate that the main benefit of the proposed framework does not come from simply adding a thermal modality to a visible-light detector. Instead, the improvement is produced by the combined effect of task-oriented data construction, shallow cross-modal alignment, adaptive RGB–TIR fusion, small-target-oriented regression, and lightweight temporal smoothing. The ablation results show that MBConv substantially improves inference efficiency but also introduces an accuracy–efficiency trade-off when used alone. This observation is important because lightweight design is not automatically beneficial for field monitoring unless the reduced representation capacity is compensated for by more effective multimodal interaction. In this work, SDFM and CGA play such a compensatory role: SDFM reduces shallow misalignment and noise coupling under smoke and vegetation occlusion, while CGA adaptively reallocates visible-texture and thermal-saliency information under nighttime or partially obscured conditions. NWD further reduces the sensitivity of small-box regression to slight localization offsets. Therefore, the final performance gain should be interpreted as the result of a coordinated multimodal design rather than an isolated module improvement.
Compared with recent RGB–thermal wildfire detection and UAV multimodal perception studies, the findings of this work show both convergence and divergence. The convergence is that thermal information is consistently useful under low illumination, smoke-obscured scenes, and weak target contrast, whereas visible images provide texture, shape, and contextual boundary information. This is consistent with recent RGB–thermal fire detection and UAV multimodal detection studies [
8,
14]. The divergence lies in the application boundary. Most existing RGB–thermal wildfire studies mainly focus on forest or wildland fire scenes, while many UAV multimodal small-object studies focus on generic objects, person detection, or general aerial multispectral perception [
12,
13]. In contrast, power transmission-corridor wildfire monitoring requires simultaneous consideration of repeated line-like backgrounds, long-range weak fire/smoke targets, power-infrastructure-related thermal distractors, edge-side latency, and temporal alarm stability. Therefore, the proposed framework should be regarded as an application-specific RGB–thermal monitoring solution for transmission corridors rather than a general-purpose wildfire detector. This also explains why direct numerical comparison with published results obtained on different datasets and task definitions may be misleading; fair assessment requires both same-setting baseline comparison and literature-level discussion.
The practical implication of the results is that YOLO-MMSC-T is suitable as an edge-side perception module for UAV patrol, but it should not be regarded as a complete alarm-decision system. The altitude study suggests that the 120–180 m range provides a reasonable compromise between target visibility, temporal stability, and inspection coverage. Within this range, the detector still maintains acceptable mAP@0.5 and continuous detection rate, although jitter gradually increases with altitude. This result is useful for engineering deployment because it links detection performance with UAV operating conditions rather than evaluating the detector only on static images. Nevertheless, the recommended altitude range should be interpreted as a field guideline under the collected corridor conditions, not as a universal rule for all transmission-line environments.
Several limitations remain. First, the self-collected dataset is geographically and seasonally restricted. Although it covers different time periods, nighttime conditions, smoke/occlusion, small targets, and hard-negative interference, the field data were mainly collected from transmission-line corridors in Central China during winter. Therefore, the current results should not be interpreted as evidence that the model can be directly generalized to all corridor environments without further validation. Different vegetation types, soil backgrounds, terrain morphology, humidity levels, seasonal appearances, and extreme weather conditions may change the visual texture, thermal contrast, smoke diffusion pattern, and fire-spread behavior. For example, a humid tropical forest corridor may exhibit stronger smoke attenuation and lower thermal contrast, whereas a dry savanna or grassland corridor may contain more fragmented vegetation texture, exposed soil background, and faster flame spread. These domain shifts may affect both RGB appearance and TIR saliency, and may further influence the reliability of cross-modal fusion. Thus, the present study should be regarded as a feasibility demonstration under the collected corridor conditions. Broader multi-region, multi-season, and multi-climate validation will be required before large-scale deployment in geographically diverse transmission corridors.
Second, the qualitative failure cases and the lack of subtype-level false-positive statistics show that appearance-based RGB–thermal fusion alone cannot fully resolve semantic-neighbor interference. Nighttime fireworks may simultaneously produce strong visible brightness, local heat radiation, and smoke-like plumes, making them similar to real wildfire events in both RGB and TIR modalities. Hot non-fire objects under low-gain thermal imaging may also generate flame-like responses when the surrounding contextual cues are weak. Similar limitations have also been reported in RGB–thermal fire detection studies, where thermal imagery improves weak-fire perception but may still suffer from false alarms under complex smoke-obscured or high-temperature interference conditions [
8]. These cases suggest that reliable engineering deployment requires detector outputs to be coupled with an alarm-level decision mechanism. Multi-frame alarm triggering, temporal persistence checking, and spatial consistency constraints can suppress transient false positives, and related spatial–temporal strategies have been used to reduce false alarms in vision-based fire and smoke detection [
26]. For fixed thermal distractors, region-based whitelisting or masking can be introduced using prior knowledge of industrial facilities, roads, residential heating areas, or known heat-source locations. A more rigorous evaluation of this issue will require a larger and more balanced hard-negative benchmark, in which fireworks, fixed hot objects, industrial heat sources, residential heating, vehicle lights, and smoke-like non-fire events are separately labeled. Future work will report subtype-level false-positive rate and false positives per image under unified confidence and NMS settings, so that the recurrence frequency of each interference type can be quantified more transparently.
Finally, the temporal consistency term used in this work is intentionally lightweight and mainly suppresses short-term flickering detections and bounding-box jitter. It does not explicitly model long-term fire evolution, smoke transport, wind direction, or scene-level risk priors. Future work will therefore focus on three more specific directions. First, larger multi-region and multi-season RGB–T corridor datasets should be constructed to cover different vegetation types, terrain morphologies, humidity levels, seasonal backgrounds, UAV viewing angles, and industrial or residential thermal interference patterns. Such data would support domain generalization analysis and incremental adaptation when the model is transferred to new corridor environments. Second, more explicit spatiotemporal modeling can be explored, including trajectory-aware filtering for object-level alarm stabilization, lightweight Transformer-based temporal aggregation for inter-frame dependency modeling, and 3D convolutional modeling for short-clip fire/smoke motion feature extraction. Third, decision-level alarm fusion can be developed by combining detector confidence, temporal persistence, scene priors, geographic information system (GIS) data, known heat-source masks, corridor risk-zone maps, weather information, wind direction, humidity, and other non-visual sensor cues when available. These extensions would improve the generalizability, false-alarm suppression ability, and practical reliability of RGB–thermal UAV monitoring for transmission-corridor wildfire early warning.