Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors

Wang, Biao; Huang, Daochun; Lin, Yifeng; He, Xu; Guo, Zhengxian; Hong, Bo

doi:10.3390/rs18121869

Open AccessArticle

Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors

by

Biao Wang

¹,

Daochun Huang

^1,2,*,

Yifeng Lin

¹,

Xu He

²,

Zhengxian Guo

¹ and

Bo Hong

²

¹

Institute of Next Generation Power Systems and International Standards, Wuhan University, Wuhan 430072, China

²

School of Electrical Engineering and Automation, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1869; https://doi.org/10.3390/rs18121869

Submission received: 18 April 2026 / Revised: 19 May 2026 / Accepted: 27 May 2026 / Published: 6 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A corridor-oriented RGB–thermal UAV wildfire monitoring framework was proposed by integrating multimodal fusion, lightweight detection, and temporal consistency.
The proposed method achieved real-time deployment on Jetson Orin NX and showed stable performance in nighttime, smoke/occlusion, and long-range small-target scenarios.

What is the implication of the main finding?

The results demonstrate the feasibility of edge-deployable RGB–thermal remote sensing for practical wildfire early warning in power transmission corridors.
The constructed corridor-oriented dataset and UAV altitude analysis provide field-oriented references for future multimodal infrastructure monitoring studies.

Abstract

Early wildfire monitoring in power transmission corridors requires reliable detection of weak fire and smoke cues under complex field conditions and strict edge-computing constraints. To address these issues, this paper proposes an edge-deployable RGB–thermal framework based on visible and thermal infrared (TIR) imaging for unmanned aerial vehicle (UAV)-based corridor monitoring, including a spatial detector, YOLO-MMSC, and a temporal-enhanced version, YOLO-MMSC-T. The study also establishes a self-collected corridor-oriented RGB–thermal (RGB–T) dataset to complement public wildfire data. Unlike existing RGB–thermal wildfire datasets that mainly focus on forest or wildland fire scenes, the proposed dataset is specifically organized for complex-background power transmission-corridor monitoring, including continuous UAV sequences, nighttime conditions, smoke/vegetation occlusion, long-range small targets, and hard-negative interference. To the best of our knowledge, this is the first self-collected RGB–thermal wildfire dataset designed for this specific application scenario. The framework integrates a mobile inverted bottleneck convolution (MBConv) lightweight backbone, a Shallow Detail Fusion Module (SDFM) for shallow cross-modal alignment and denoising, a Content-Guided Attention (CGA) module for adaptive fusion, and normalized Wasserstein distance (NWD)-based box regression for long-range small-target localization. Experiments on public and self-collected datasets show that YOLO-MMSC achieves 94.6% mAP@0.5, 95.0% precision, and 93.9% recall while running at 60 FPS on Jetson Orin NX. With temporal fine-tuning, YOLO-MMSC-T reaches a continuous detection rate (CDR) of 95.6% with a jitter index of

2.8 \times 10^{- 3}

. Field experiments using a DJI Matrice 4T further indicate a practical operating altitude of 120–180 m. These results support lightweight RGB–thermal remote sensing for real-time wildfire monitoring in complex transmission-corridor environments.

Keywords:

remote sensing; unmanned aerial vehicle (UAV); RGB–thermal perception; wildfire monitoring; power transmission corridors; multimodal fusion; edge deployment; temporal consistency

1. Introduction

Wildfire activity has become an increasingly important concern for environmental monitoring and infrastructure safety under the combined influence of climate change and more frequent extreme hot and dry weather. Recent studies have shown that wildfire events are closely coupled with power-grid security, including wildfire–power grid interactions under a changing climate [1], probabilistic line-outage risk caused by wildfires [2], and resilience evaluation of power transmission line outages under wildfire disturbance [3]. For overhead transmission-line infrastructure, wildfire risk assessment also relies on multi-source data, risk models, and remote sensing support [4]. Therefore, timely identification of incipient fire and smoke signatures in complex transmission-corridor environments is important not only for disaster prevention but also for improving situational awareness over long linear monitoring regions.

Current corridor fire monitoring mainly relies on satellite remote sensing, tower-mounted cameras, and unmanned aerial vehicle (UAV) inspection. Satellite platforms provide wide-area coverage and support large-scale wildfire awareness and regional risk assessment [5], but their revisit interval and spatial resolution may be insufficient for detecting small incipient fire spots under local occlusion. Ground-based monitoring can provide continuous observation, yet it often suffers from limited viewing geometry and difficult deployment in mountainous or forested terrain. By contrast, UAV-based remote sensing offers flexible maneuverability, adjustable viewpoints, and high-resolution close-range observations, making it highly suitable for corridor monitoring tasks involving long-range small targets, semi-transparent smoke, vegetation occlusion, and nighttime low illumination [6,7,8]. However, practical UAV deployment still faces two major challenges. First, early wildfire targets are typically small and weakly contrasted, and their appearance is easily disturbed by cluttered backgrounds and smoke diffusion. Second, onboard applications require lightweight real-time processing under strict constraints on computing power, memory bandwidth, and energy consumption.

Deep-learning-based object detection provides an effective tool for visual wildfire monitoring. Two-stage detectors such as Faster R-CNN offer strong representation capability, whereas one-stage detectors such as the YOLO family are more attractive for real-time deployment because of their superior speed [9,10]. Recent work in remote sensing has highlighted the importance of multi-scale modeling, lightweight inference, and cross-modality learning for UAV imagery, especially when targets are small and background clutter is strong [11,12,13,14,15]. Public benchmark construction and cross-spectral alignment have also become increasingly important for thermal–visible UAV perception, as reflected by recent benchmark and alignment studies [16,17]. Nevertheless, the difficulty of UAV-based corridor wildfire monitoring is not limited to scale variation. Nighttime low illumination degrades visible-light information, while weak-contrast smoke and vegetation occlusion reduce target separability; moreover, long-range fire spots are highly sensitive to localization errors. These factors jointly lead to missed detections, false alarms, and unstable temporal predictions, making reliable alarm generation difficult in practice [18].

RGB–thermal multimodal perception provides a promising way to address these issues. Thermal infrared imaging offers more stable saliency for heat-radiating targets and can complement visible observations under nighttime and smoke-obscured conditions. Recent studies have shown the potential of RGB–thermal fusion for complex-scene parsing and cross-modal feature complementarity [19,20], while robust multimodal learning strategies can further improve tolerance to modality degradation [21]. In corridor wildfire monitoring, remote-sensing-based fire-risk assessment and UAV visible–infrared sensing have gradually moved toward multimodal observations [6,7,8,13,22,23]. However, two practical limitations remain. First, many existing methods rely on simple concatenation or shallow fusion, which are not sufficient to handle cross-modal misalignment and noise coupling under semi-transparent smoke and vegetation occlusion. Second, complex multimodal structures often introduce considerable intermediate feature maps and memory-access overhead, such that models with acceptable FLOPs may still be inefficient on embedded UAV devices [24,25]. In addition, the scarcity of publicly available RGB–thermal wildfire datasets, especially for transmission-corridor-related scenes and continuous UAV sequences, restricts the development and evaluation of robust multimodal monitoring methods [8,17,23].

Motivated by these considerations, this paper presents an edge-deployable RGB–thermal framework for UAV-based wildfire monitoring in power transmission corridors, including an improved spatial detector, YOLO-MMSC, and a temporal-enhanced version, YOLO-MMSC-T. The framework combines shallow cross-modal alignment and denoising, adaptive weighted fusion, a lightweight backbone, and small-target-oriented regression to improve robustness under occlusion, weak texture, and long-range imaging conditions. A lightweight temporal consistency constraint is further introduced to suppress flickering misses and bounding-box jitter during continuous monitoring. In terms of data resources, the study combines public RGB–T wildfire data with a self-collected corridor-oriented UAV RGB–T continuous-sequence dataset and hard negatives. The experimental analysis covers module ablation, data-source contribution, robustness under different nighttime gain settings, subset-level evaluation under Night, Occlusion/Smoke, and Small-target conditions, and field altitude analysis. The goal is to provide an effective remote-sensing solution for accurate, real-time, and continuously reliable wildfire monitoring in complex transmission-corridor environments.

2. Methodology

The proposed framework is designed for transmission-corridor wildfire early warning under complex field conditions. Paired RGB and thermal infrared (TIR) images are first processed by the Shallow Detail Fusion Module (SDFM) for shallow cross-modal alignment and denoising, so that weak fire and smoke cues can be preserved under tree occlusion, semi-transparent smoke, and nighttime low-illumination conditions. A mobile inverted bottleneck convolution (MBConv)-based lightweight backbone is then adopted to extract multi-scale features while reducing parameter size and memory-access overhead, which is beneficial for edge-side deployment. Next, the Content-Guided Attention (CGA) module performs adaptive RGB–TIR fusion to strengthen complementary thermal-texture responses and suppress redundant background activations. The fused features are sent to the detection head, where normalized Wasserstein distance (NWD)-based regression is used to enhance localization robustness for long-range small targets. In addition, a lightweight temporal consistency constraint is introduced during sequence fine-tuning to alleviate flickering miss detections and bounding-box jitter across consecutive frames, thereby improving the reliability of continuous online warning.

2.1. YOLO-MMSC Structure

The proposed framework is built upon YOLOv11 and consists of a dual-branch RGB–thermal feature extractor, a lightweight multimodal fusion path, and an optional temporal enhancement stage for continuous monitoring. Figure 1 illustrates the overall pipeline. Visible and thermal infrared (TIR) images are first synchronized and preprocessed, and then forwarded to two mirrored shallow branches. The RGB branch preserves rich texture and edge information, while the TIR branch emphasizes thermal saliency under low illumination and smoke-obscured conditions. Cross-modal interaction is introduced at both shallow and middle layers to exploit thermal–texture complementarity while suppressing modality-specific noise.

Accordingly, the framework is organized around four task-oriented components: an SDFM-based shallow feature alignment and denoising unit, a CGA-based adaptive multimodal fusion unit, an MBConv-based lightweight backbone, and an NWD-guided regression strategy for small targets. On top of the frame-level detector, a lightweight temporal consistency module is further used during fine-tuning to improve alarm continuity and reduce flickering boxes in video monitoring.

2.2. MBConv-Based Lightweight Backbone

To reduce computational cost and memory access overhead on edge hardware, the original high-cost aggregation units in the baseline backbone are replaced by MBConv-style lightweight blocks. Each block follows an inverted residual design with pointwise expansion, depthwise convolution, and pointwise projection. Compared with conventional dense convolution blocks, MBConv substantially reduces parameter count and intermediate feature-map traffic, which is particularly beneficial for deployment on resource-constrained devices such as Jetson Orin NX.

In this paper, the lightweight replacement is implemented symmetrically in the RGB and TIR branches to maintain tensor compatibility before fusion. The resulting backbone preserves hierarchical semantic representation while releasing computational budget for the proposed multimodal modules. In practical deployment, this design helps alleviate the gap between FLOPs and real edge-side latency caused by intensive memory access.

The quantitative effect of this lightweight replacement is further verified in the ablation study. Compared with the simple RGB–T concatenation baseline, introducing MBConv reduces the computational cost from 8.9 GFLOPs to 5.0 GFLOPs, corresponding to a 43.8% reduction, and increases the inference speed from 38 FPS to 79 FPS under the same evaluation setting. This confirms that MBConv effectively reduces computational burden and improves inference efficiency. However, the mAP@0.5 decreases from 90.1% to 87.3%, indicating that lightweight replacement alone introduces an accuracy–efficiency trade-off and should be combined with cross-modal alignment and adaptive fusion modules. The detailed structure of the MBConv-based lightweight block is shown in Figure 2, where pointwise expansion, depthwise convolution, squeeze–excitation calibration, and projection are organized to reduce computation while retaining feature representation capacity.

2.3. SDFM and CGA Multimodal Fusion Modules

The SDFM module is introduced at shallow stages to align RGB texture cues and TIR thermal details at a fine-grained level. The motivation is that semi-transparent smoke, weak boundaries, and local misalignment between the two modalities can easily cause unstable fusion if only direct concatenation is applied. SDFM performs shallow cross-modal feature interaction and denoising to preserve fire edges and weak smoke structures while reducing irrelevant noise responses. As shown in Figure 3, the module refines shallow RGB and TIR features through cross-modal interaction and residual enhancement, which improves the preservation of weak smoke boundaries and small flame responses before deeper fusion.

Figure 4 illustrates the CGA module used for adaptive RGB–TIR fusion. The module estimates content-dependent channel and spatial weights so that thermal responses can be emphasized in low-light or smoke-obscured regions, while RGB texture cues are retained when thermal contrast is insufficient.

After shallow refinement, a CGA-based adaptive fusion module is embedded at multi-scale fusion nodes in the neck. Instead of using fixed concatenation weights, CGA performs content-guided channel–spatial attention and generates pixel-wise adaptive fusion weights for RGB and TIR features. This allows the network to enhance thermal saliency in low-light or smoke-obscured regions, while still exploiting visible-texture information when thermal contrast is weak. As a result, the fused representation is more robust to occlusion, low illumination, and redundant background activations.

2.4. NWD-Based Regression and Temporal Consistency

Long-range wildfire targets typically occupy only a very small fraction of the image plane, making IoU-based box regression unstable when the overlap changes drastically under tiny localization offsets. To improve regression robustness for such small targets, the proposed detector adopts NWD-based box regression, which provides a smoother and more stable optimization signal under long-range small-target conditions.

For continuous monitoring, a temporal consistency constraint is further introduced during fine-tuning. A short sequence of length

T = 5

is sampled from each video clip, and adjacent-frame predictions are associated using maximum-IoU matching with threshold

τ = 0.5

. The overall loss is formulated as

L = \sum_{k = 0}^{4} L_{\det}^{NWD} (P_{t + k}) + λ \sum_{k = 0}^{3} L_{temp} (P_{t + k}, P_{t + k + 1}),

(1)

where

λ = 0.2

is the temporal regularization weight. The first term denotes the frame-level detection loss with NWD-based regression, and the second term constrains the temporal smoothness of matched targets across adjacent frames. The sequence length

T = 5

was selected to provide sufficient short-term temporal context for suppressing flickering detections caused by UAV motion, smoke diffusion, and partial occlusion, while avoiding excessive memory cost during sequence fine-tuning. A shorter sequence provides limited temporal constraints, whereas a longer sequence introduces more redundant frames and only marginally improves stability. The temporal weight

λ = 0.2

was chosen to balance frame-level detection accuracy and temporal smoothness. A smaller weight weakens the suppression of bounding-box fluctuation, while an excessively large weight may over-constrain adjacent-frame predictions and slightly affect static localization accuracy. During training, the total loss is backpropagated over the entire sequence rather than a single frame. During inference, no heavy temporal network is introduced; instead, lightweight cross-frame smoothing is used to reduce flickering misses and bounding-box jitter.

3. Experimental Setup

3.1. Datasets and Subset Definition

The experimental data consist of a public RGB–T corpus and a self-collected corridor-oriented field corpus. Existing public RGB–thermal wildfire datasets provide valuable forest, wildland, and multi-scenario fire observations, but they are not specifically organized for power transmission-corridor monitoring, where repeated transmission line backgrounds, vegetation occlusion, nighttime inspection, long-range small fire/smoke targets, and power-infrastructure-related thermal distractors appear simultaneously. Therefore, to complement these public resources, this study constructs a self-collected corridor-oriented RGB–thermal wildfire dataset for complex-background power transmission-corridor monitoring. The dataset is designed to provide transmission-line-specific continuous UAV sequences, multi-period observations, and hard-negative interference samples that are weakly represented in existing public wildfire datasets. The public portion is compiled from three publicly available RGB–T wildfire-related datasets, namely FireMan-UAV-RGBT, FLAME-3, and RGBT-3M, resulting in 8214 cleaned paired frames after format unification and quality filtering. The self-collected corridor corpus contains 8127 paired RGB–T frames organized as continuous UAV sequences acquired in high-voltage transmission-line corridor environments. In addition, 587 paired hard-negative samples were collected from confusing but non-fire events, such as nighttime fireworks, heating charcoal fires, and hot non-fire objects. These hard-negative samples are regarded as a dedicated component of the self-collected corridor corpus, because they were acquired under the same corridor-oriented inspection background and are used to improve false-alarm resistance under practical interference conditions. Table 1 summarizes the dataset composition.

The self-collected corpus was acquired with a DJI Matrice 4T platform carrying synchronized visible and thermal imaging sensors. The visible stream was captured by the onboard wide-angle camera with a 1/1.3-inch CMOS sensor and 48 MP effective pixels, while the thermal stream was captured by the onboard uncooled vanadium oxide (VOx) infrared thermal camera. The thermal camera provides a native resolution of

640 \times 512

, a pixel pitch of 12 μm, a frame rate of 30 Hz, and an infrared spectral range of 8–14 μm. During nighttime acquisition, both high-gain and low-gain thermal modes were considered to analyze weak-fire visibility and hot-object interference under different thermal operating settings.

The field data were collected from high-voltage transmission-line corridors in Central China from late January to early February 2025. To protect operational information related to power-line infrastructure, the specific line identifiers and precise geographic coordinates are not disclosed. The acquisition covered four typical time periods, including morning, noon, evening, and nighttime. The winter field environment was characterized by low temperature, large day–night temperature differences, and fluctuating visibility caused by thin fog and light haze. During the acquisition period, the daytime temperature was approximately 2–11 °C, with frequent cold-air activity, 4–6 level northerly winds, and occasional intermittent weak cloudy or rainy weather. Such low-temperature conditions further enlarged the visual difference between RGB images and thermal infrared images. In addition, the corridor vegetation was in a winter dry state, with dry weeds, undergrowth branches, and other flammable materials distributed around the transmission corridor. Since part of the acquisition period overlapped with the Spring Festival, the surrounding environment also contained confusing non-fire sources, such as charcoal fires for residential heating, fireworks, smoke, strong visible light, and local heat sources. These factors naturally formed complex monitoring scenarios involving nighttime low illumination, smoke occlusion, thermal interference, and long-range small targets, which are close to practical wildfire inspection conditions along power transmission corridors.

For RGB–thermal pair construction and annotation, the visible and thermal streams were first matched according to acquisition timestamps. DJI Thermal Analysis Tool was then used to inspect the thermal images and assist in the verification of RGB–thermal pair consistency. The paired images were adjusted to a common image format before annotation. Fire and smoke instances were manually annotated using bounding boxes. For flame targets, the visible flame region or the corresponding high-temperature thermal response was labeled according to the paired RGB–TIR observation. For smoke targets, the visually distinguishable smoke plume region was annotated, while severely ambiguous smoke boundaries were checked with the paired thermal image and surrounding temporal frames. Each annotation was first labeled by one annotator and then reviewed by another annotator to reduce missed labels and inconsistent boundaries. Confusing but non-fire samples, such as fireworks, heating charcoal fires, and hot non-fire objects, were included as hard-negative samples without fire or smoke labels. For the self-collected continuous sequences, the train/validation/test split was performed at the video-segment level rather than at the frame level. Specifically, all annotated paired frames extracted from the same continuous UAV video segment were assigned to only one subset, and no video segment was shared among the training, validation, and test/field-verification subsets. The 1626 paired frames in the test/field-verification subset were therefore sampled from video segments completely independent of those used for training and validation. This strategy prevents temporally adjacent frames with highly similar backgrounds and target appearances from appearing in different subsets, thereby reducing the risk of temporal data leakage. The resulting image-level split is shown in Table 2, and the video-level split is summarized in Table 3.

The corresponding video-level split of the self-collected continuous sequences is summarized in Table 3. The duration refers to the original UAV video duration before frame sampling. The annotated paired frames denote the RGB–T image pairs selected from the original videos for manual annotation and experimental evaluation, rather than all frames contained in the raw 30 Hz video streams. The 16 video segments cover different acquisition dates, time periods, thermal gain settings, flight viewpoints, smoke/occlusion conditions, long-range small targets, and hard-negative interference sources.

A concise summary of the data sources used in this study is provided in Table 4. The three public datasets provide general RGB–thermal wildfire observations from forest, wildland, and multi-scenario fire-related scenes. However, they do not explicitly focus on power transmission corridors, continuous UAV inspection sequences, or power-infrastructure-related hard negatives. The self-collected corridor corpus therefore supplies transmission-line-specific continuous UAV sequences, multi-period observations, and hard-negative interference samples that are weakly represented in existing public wildfire datasets.

To better reflect the practical difficulty of corridor wildfire monitoring, three dedicated subsets were constructed. The Night subset contains nighttime frames collected under both high-gain and low-gain thermal settings. The Occlusion/Smoke subset contains scenes dominated by vegetation occlusion, thin smoke, or semi-transparent smoke. The Small-target subset contains targets whose bounding-box area is less than

0.3 %

of the image area. This threshold is used as a task-specific criterion for the long-range UAV monitoring scenario considered in this study. Under the letterboxed

640 \times 640

detector input,

0.3 %

of the input image area corresponds to approximately 1229 pixels, which is close to a square target of about

35 \times 35

pixels. In practical corridor inspection, incipient flames and thin smoke observed at medium-to-high UAV altitudes often occupy only such a small image region. Under this condition, slight localization offsets can cause large IoU variations, and target boundaries are easily disturbed by vegetation texture, smoke diffusion, and thermal blur. Therefore, the

0.3 %

threshold is adopted to isolate long-range weak targets for targeted robustness evaluation, rather than to define a universal small-object standard. This small-target criterion is defined by

r = \frac{w_{b} h_{b}}{W H},

(2)

where W and H are the image width and height, and

w_{b}

and

h_{b}

denote the target-box width and height. Representative samples of these challenging conditions are shown in Figure 5.

3.2. Implementation Details

Training and deployment settings are summarized in Table 5 and Table 6. The training server uses an Intel i7-14700KF CPU and an RTX 5070 GPU with 32 GB RAM. Edge deployment is carried out on a Jetson Orin NX equipped with an Arm Cortex-A78AE CPU, a 1024-core Ampere GPU, and 16 GB memory. The detector is trained in PyTorch 2.8.0 with CUDA 12.8, while edge-side evaluation is conducted in PyTorch 2.3.0 with CUDA 12.6 and JetPack 6.2.

The thermal stream is normalized in its native

640 \times 512

format and then, together with the aligned visible image, is mapped to the detector input by letterbox padding to

640 \times 640

. This setting preserves the original aspect ratio while maintaining compatibility with the YOLO inference pipeline. During training, all geometric augmentations are applied synchronously to the paired RGB and TIR images to preserve cross-modal spatial correspondence. The adopted augmentations include random horizontal flipping, random scaling and translation, and mosaic augmentation. Mild photometric perturbation is applied only to the visible branch, while the thermal branch is normalized without color-space augmentation to avoid destroying thermal intensity semantics. No random augmentation is used during validation or inference. The final edge-side speed is measured on the Jetson Orin NX using TensorRT FP16 inference with batch size 1, and the reported FPS corresponds to stable online inference rather than one-off peak throughput.

The detector is trained for 250 epochs using SGD with a batch size of 32, an initial learning rate of 0.0005, and momentum of 0.937. The main training, augmentation, and inference settings are summarized in Table 6. For temporal fine-tuning, short clips of length

T = 5

are randomly sampled at the video-segment level. Adjacent-frame association uses maximum-IoU matching with threshold

τ = 0.5

, and the temporal regularization weight is set to

λ = 0.2

. Hard-negative sequences account for approximately 20% of temporal-training samples through oversampling, which improves the robustness of the classifier/objectness branch to semantic neighbors without forcing incorrect fire/smoke regression targets.

To further clarify the choice of temporal hyperparameters, a sensitivity analysis was conducted on the continuous video subset. As shown in Table 7,

T = 5

and

λ = 0.2

provide a favorable trade-off between frame-level detection accuracy, continuous detection stability, and bounding-box smoothness. Increasing the sequence length from 5 to 7 brings only marginal improvement in continuous detection rate (CDR) and jitter index (J), while increasing the sequence memory and training cost. In contrast, a shorter sequence provides insufficient temporal context for intermittent smoke and partially occluded targets. For the temporal regularization weight,

λ = 0.1

weakens the smoothing effect, whereas

λ = 0.3

slightly decreases mAP@0.5 because of over-smoothing. Therefore,

T = 5

and

λ = 0.2

are adopted in the final YOLO-MMSC-T model.

3.3. Evaluation Metrics

The overall detection performance is evaluated by mAP@0.5, precision (P), and recall (R). To characterize continuous monitoring reliability, two temporal metrics are additionally used. The continuous detection rate (CDR) measures the ratio of target trajectories that can be continuously detected within a clip. The jitter index (J) is used to quantify the temporal fluctuation of predicted bounding boxes across adjacent frames.

For a matched target trajectory with N valid detections, the jitter index is defined as the average normalized variation of the box center and size between adjacent frames:

J = \frac{1}{N - 1} \sum_{t = 1}^{N - 1} (\frac{| x_{t + 1} - x_{t} |}{W} + \frac{| y_{t + 1} - y_{t} |}{H} + \frac{| w_{t + 1} - w_{t} |}{W} + \frac{| h_{t + 1} - h_{t} |}{H}),

(3)

where

(x_{t}, y_{t}, w_{t}, h_{t})

denotes the center coordinates, width, and height of the predicted bounding box at frame t, and W and H are the image width and height, respectively. For multiple target trajectories, the reported J is averaged over all valid trajectories in the evaluated video subset. A smaller J indicates lower temporal box fluctuation and more stable continuous monitoring. It should be noted that J is not intended to replace standardized multi-object tracking metrics such as MOTA or IDF1, because this work focuses on the temporal stability of detection boxes for fire/smoke alarm continuity rather than identity-preserving object tracking. Therefore, J is used as a task-specific diagnostic metric for comparing bounding-box smoothness under the same video sequences and evaluation protocol. The value

2.8 \times 10^{- 3}

is not treated as a statistically justified universal threshold for “low” jitter; instead, it indicates lower normalized box fluctuation relative to the compared baselines on the same continuous video subset. On edge hardware, average FPS is reported to reflect practical deployability. In addition, subset-level mAP is used to quantify modality contribution under Night, Occlusion/Smoke, and Small-target conditions.

4. Results and Discussion

4.1. Ablation Study

Ablation experiments are conducted to verify the contribution of each module. The comparison starts from single-modality YOLOv11 baselines and a simple concatenation-based RGB–T baseline (YOLO-concat), followed by progressively adding the MBConv lightweight backbone, SDFM, CGA, and NWD. The results are listed in Table 8.

The simple concatenation baseline already improves the single-modality models, indicating that cross-modal complementarity is beneficial. However, it also increases computational burden and provides only limited gains under difficult conditions. Replacing the heavy backbone blocks with MBConv significantly improves computational efficiency: compared with YOLO-concat, Network 1 (MBConv only) reduces GFLOPs from 8.9 to 5.0 and increases FPS from 38 to 79. However, its mAP@0.5 decreases from 90.1% to 87.3%, corresponding to a 2.8 percentage-point drop. This noticeable accuracy decrease indicates an accuracy–efficiency trade-off caused by lightweight feature extraction. In other words, MBConv effectively reduces computational cost and improves inference speed, but lightweight extraction alone weakens feature representation and cannot sufficiently handle cross-modal misalignment, weak smoke boundaries, and small-target localization.

After introducing SDFM, the mAP@0.5 increases from 87.3% to 90.4%, which not only recovers most of the accuracy loss caused by lightweighting but also slightly surpasses the simple concatenation baseline. This suggests that shallow cross-modal alignment and denoising help compensate for the reduced representation capacity of the lightweight backbone. The further addition of CGA improves mAP@0.5 to 92.9%, showing that adaptive RGB–TIR fusion can more effectively exploit complementary visible-light texture and thermal saliency under complex backgrounds. Finally, NWD-based regression further improves the localization robustness of small and distant fire spots, resulting in the complete YOLO-MMSC with 94.6% mAP@0.5, 6.4 GFLOPs, and 64 FPS. These results indicate that SDFM and CGA not only enhance cross-modal perception but also compensate for the accuracy loss introduced by lightweighting, achieving a better accuracy–efficiency balance for edge deployment.

4.2. Contribution of Corridor-Oriented Data and Hard Negatives

To examine whether the performance gain mainly comes from architectural design or from better task-aligned data support, an additional data-source study was conducted on the field test subset. Three training configurations were compared: (1) public RGB–T data only, (2) public data plus the self-collected corridor-oriented sequences, and (3) the previous setting further augmented with hard negatives. The results are summarized in Table 9.

Training on public data only already provides a reasonable starting point, but the detector remains less stable in corridor scenes because repeated vegetation textures, weak smoke, and thermal distractors are underrepresented in generic wildfire data. After adding the self-collected sequences, all field-oriented indicators improve, especially CDR and the Night/Occlusion subsets, showing that corridor-specific data are necessary for stable deployment. The final addition of hard negatives brings a further precision gain and helps suppress residual false alarms in confusing nighttime scenes, while also improving temporal continuity. These observations support the claim that the proposed dataset is not merely additional data volume, but a task-aligned resource for corridor wildfire remote sensing.

4.3. Robustness Under Different Nighttime Gain Settings

Because thermal gain strongly affects weak-target visibility and hot-object suppression in practical UAV patrol, an additional comparison was performed on the self-collected nighttime subset under high-gain and low-gain settings. Table 10 reports the results within the recommended altitude interval of 120–180 m. The high-gain setting preserves weak fire and thin-smoke responses more reliably, leading to better recall and CDR. The low-gain setting reduces over-saturation in some hot scenes, but under weak-fire conditions, it also compresses thermal contrast and increases the risk of intermittent misses. These observations are consistent with the failure-case analysis and suggest that nighttime gain should be selected jointly with scene complexity and expected target strength rather than using a fixed setting for all patrol tasks.

4.4. Performance Comparison and Literature Positioning

Table 11 compares the proposed detector with representative baselines, including Faster R-CNN, YOLOv5, YOLOv8, single-modality YOLOv11 variants, and the simple RGB–T concatenation baseline. YOLO-MMSC achieves the best overall detection accuracy and precision among the evaluated methods while maintaining real-time speed on both the workstation and edge hardware.

Compared with single-modality YOLOv11, the proposed multimodal design clearly improves both recall and precision. Compared with the simple fusion baseline, YOLO-MMSC further shows that gains do not come merely from adding another modality; instead, they result from alignment-aware denoising, adaptive fusion, and small-target-oriented regression. More specifically, SDFM improves shallow cross-modal alignment and noise suppression under smoke and vegetation occlusion, CGA adaptively reallocates RGB and TIR contributions under nighttime and partially obscured conditions, and NWD reduces the sensitivity of small-box regression to slight localization offsets. This explains the consistent gains of YOLO-MMSC over the simple RGB–T concatenation baseline in the Night, Occlusion/Smoke, and Small-target subsets.

Recent studies have also investigated RGB–thermal or multispectral object detection for UAV fire monitoring, UAV small-object detection, and real-time aerial perception. However, most existing methods focus on forest or wildland fire scenes, generic UAV multimodal small-object detection, drone-based target detection, or real-time aerial perception, rather than complex-background power transmission-corridor wildfire monitoring. Therefore, Table 12 provides a concise comparison with recent related methods from the perspective of application scenario, modality, technical focus, and deployment relevance. Because these studies are evaluated on different datasets and task definitions, their reported numerical results are not directly comparable with the proposed method on our corridor-oriented test set. Instead, this comparison is used to clarify the application-specific contribution of YOLO-MMSC.

The comparison with recent literature further reveals both convergences and divergences. On the one hand, our findings are consistent with recent RGB–thermal wildfire and UAV multimodal detection studies in that thermal information provides more stable cues under low illumination, smoke-obscured scenes, and weak target contrast, while visible images contribute texture and contextual boundaries. This convergence explains why RGB–T fusion generally improves over single-modality detection in the Night and Occlusion/Smoke subsets. On the other hand, the present study differs from most existing works in its application boundary. Recent RGB–thermal wildfire studies mainly focus on forest or wildland fire scenarios, and recent UAV multimodal small-object studies usually emphasize generic target detection. In contrast, power transmission-corridor wildfire monitoring requires simultaneous consideration of repeated line-like backgrounds, long-range small fire/smoke targets, power-infrastructure-related thermal distractors, edge-side latency, and continuous alarm stability. Therefore, the gains observed in this study should not be interpreted only as a general benefit of adding a thermal modality, but as the result of combining corridor-oriented data, shallow cross-modal alignment, adaptive RGB–TIR fusion, small-target regression, hard-negative training, and lightweight temporal consistency. This also explains why direct numerical comparison with published results obtained on different datasets may be misleading, whereas comparison under the same experimental setting and literature-level discussion provides a fairer assessment of the proposed framework.

Figure 6 presents representative qualitative comparisons. In the visual examples, the proposed detector produces fewer missed boxes for thin smoke and small flames, and it remains more stable when branches and background clutter partially cover the target.

4.5. Subset Analysis and Temporal Stability

To better quantify modality contribution under key conditions, subset-level mAP is reported in Table 13. The RGB-only detector performs poorly in the Night and Occlusion/Smoke subsets, confirming that visible-light information degrades substantially under low illumination and smoke-obscured conditions. The TIR branch is more stable, while simple RGB–T concatenation yields only limited additional gains. The progressive improvements obtained by Network 2 and Network 3 demonstrate that SDFM and CGA are particularly beneficial in difficult operating conditions. YOLO-MMSC achieves the best results in all three subsets.

The subset analysis provides a more interpretable explanation of why RGB–T fusion is effective for remote sensing of corridor wildfire hazards. In nighttime scenes, thermal saliency dominates because visible texture is weak; under occlusion and smoke, the TIR branch offers more stable target cues but still needs visible information to preserve contour and contextual boundaries. The gains of SDFM and CGA therefore reflect not only stronger accuracy, but also better cross-modal task allocation under different environmental disturbances.

Temporal stability is further examined on the continuous video subset. As shown in Table 14, YOLO-MMSC already yields better continuity and lower jitter than all non-temporal baselines. After temporal fine-tuning, YOLO-MMSC-T further improves CDR from 93.1% to 95.6% and reduces the jitter index from

3.9 \times 10^{- 3}

to

2.8 \times 10^{- 3}

, while the frame-level mAP changes only slightly. These results indicate that the temporal module mainly enhances alarm continuity and suppresses box fluctuation rather than simply increasing static detection accuracy.

4.6. Field Altitude Study and Failure Cases

A dedicated field study was conducted to identify a practical UAV operating altitude for transmission-corridor wildfire early warning. Using a DJI Matrice 4T platform, paired RGB–T sequences were acquired at six altitudes from 60 m to 210 m with a step of 30 m. The quantitative results are summarized in Table 15, and the corresponding visual comparison is shown in Figure 7. Overall, increasing altitude leads to a gradual reduction in apparent target size and edge clarity, which directly affects the detectability of weak fire and smoke signatures. This tendency is reflected by the monotonic decrease in Recall and CDR, together with the increase in jitter index J.

At lower altitudes (60–90 m), the model achieves the highest quantitative performance, with relatively large target projections and clear thermal boundaries. However, although low-altitude operation is beneficial for detectability, it also reduces coverage efficiency and may limit inspection productivity in long transmission corridors. As altitude increases to the medium range (120–180 m), the target occupies a smaller image area and smoke boundaries become less distinct, especially in scenes with vegetation occlusion or weak thermal contrast. Nevertheless, the detector still maintains relatively stable performance in this interval. Combined with the larger coverage area at these heights, this range provides a better compromise between detection reliability and inspection efficiency and is therefore recommended for routine patrol.

Within the recommended altitude range, the performance degradation remains within an acceptable operational margin. Specifically, when the UAV altitude increases from 120 m to 180 m, mAP@0.5 decreases from 94.4% to 93.4%, corresponding to a reduction of 1.0 percentage point, while Recall decreases from 93.6% to 91.8%, corresponding to a reduction of 1.8 percentage points. The CDR decreases from 95.0% to 93.0%, and the jitter index increases from

3.2 \times 10^{- 3}

to

4.1 \times 10^{- 3}

. These changes indicate that the 120–180 m range still maintains acceptable detection and continuity performance, but the gradual increase in temporal jitter suggests that excessively high altitudes may weaken the stability of weak fire and smoke monitoring.

When the altitude further increases to 210 m, the adverse effect of long-range imaging becomes more evident. Fire and smoke targets become less separable from the background, weak smoke structures are more likely to be submerged by clutter, and thermal edges become increasingly blurred. This results in lower Recall and CDR, as well as a higher jitter index, indicating that excessive altitude weakens both instantaneous detectability and temporal stability. In nighttime high-gain mode, the thermal branch can better preserve weak fire signatures at medium altitude. By contrast, in nighttime low-gain mode, thermal contrast may be compressed, and some weak fire cues become less distinguishable, especially when altitude increases, and the target occupies only a very small area. These observations support the recommendation that, under nighttime high-interference conditions, high-gain mode should be preferred to reduce the risk of missing weak fire targets.

Figure 8 presents representative failure cases under extreme interference conditions. The first example corresponds to nighttime fireworks. In this case, bright visible-light emission, intense local heat radiation, and smoke plumes appear simultaneously, making the scene highly similar to actual wildfire behavior in both RGB and TIR modalities. The second example corresponds to hot-object interference under nighttime low-gain mode. Due to compressed thermal contrast, high-temperature objects may exhibit thermal responses similar to flames, while their surrounding contextual cues are insufficient to fully suppress false activation. These cases indicate that, although the proposed method performs robustly in most corridor scenarios, extreme semantic-neighbor interference remains a practical challenge. It should also be noted that the present failure-case analysis is mainly qualitative. Although hard-negative samples are included in the training and validation protocol, the current manuscript does not provide a fine-grained class-wise false-positive frequency analysis for fireworks, hot non-fire objects, and other thermal distractors. Therefore, the robustness conclusion under these rare but important interference sources should be interpreted with caution. The current results demonstrate that hard-negative training improves overall field performance, but they do not yet fully quantify how often each type of semantic-neighbor interference causes false alarms in practical operation. They also suggest different mitigation directions. For nighttime fireworks, the interference is usually short-lived and accompanied by abrupt visible-brightness and thermal-intensity changes; therefore, temporal persistence and multi-frame consistency can be used to distinguish transient fireworks from gradually developing fire or smoke events. For hot non-fire objects, the thermal response is often more spatially stationary and may be associated with fixed locations such as residential areas, industrial facilities, roads, or known heat sources.

From an engineering deployment perspective, model prediction should be combined with alarm-level decision strategies rather than used in isolation. First, multi-frame alarm triggering can effectively suppress transient false positives; in the current setting, requiring three consecutive detections introduces only a limited additional delay while significantly improving alarm reliability. Second, region-based whitelisting or masking can be applied to known stationary heat sources. Third, scene-level auxiliary priors, such as forested or dry-vegetation corridor regions, industrial facilities, residential areas, and known heat-source locations, can be used to adjust the verification requirement of candidate alarms. Fourth, adaptive gain and threshold strategies should be adopted according to nighttime operating conditions. These recommendations clarify the practical deployment boundary of the proposed method under challenging interference conditions.

4.7. Discussion

The experimental results indicate that the main benefit of the proposed framework does not come from simply adding a thermal modality to a visible-light detector. Instead, the improvement is produced by the combined effect of task-oriented data construction, shallow cross-modal alignment, adaptive RGB–TIR fusion, small-target-oriented regression, and lightweight temporal smoothing. The ablation results show that MBConv substantially improves inference efficiency but also introduces an accuracy–efficiency trade-off when used alone. This observation is important because lightweight design is not automatically beneficial for field monitoring unless the reduced representation capacity is compensated for by more effective multimodal interaction. In this work, SDFM and CGA play such a compensatory role: SDFM reduces shallow misalignment and noise coupling under smoke and vegetation occlusion, while CGA adaptively reallocates visible-texture and thermal-saliency information under nighttime or partially obscured conditions. NWD further reduces the sensitivity of small-box regression to slight localization offsets. Therefore, the final performance gain should be interpreted as the result of a coordinated multimodal design rather than an isolated module improvement.

Compared with recent RGB–thermal wildfire detection and UAV multimodal perception studies, the findings of this work show both convergence and divergence. The convergence is that thermal information is consistently useful under low illumination, smoke-obscured scenes, and weak target contrast, whereas visible images provide texture, shape, and contextual boundary information. This is consistent with recent RGB–thermal fire detection and UAV multimodal detection studies [8,14]. The divergence lies in the application boundary. Most existing RGB–thermal wildfire studies mainly focus on forest or wildland fire scenes, while many UAV multimodal small-object studies focus on generic objects, person detection, or general aerial multispectral perception [12,13]. In contrast, power transmission-corridor wildfire monitoring requires simultaneous consideration of repeated line-like backgrounds, long-range weak fire/smoke targets, power-infrastructure-related thermal distractors, edge-side latency, and temporal alarm stability. Therefore, the proposed framework should be regarded as an application-specific RGB–thermal monitoring solution for transmission corridors rather than a general-purpose wildfire detector. This also explains why direct numerical comparison with published results obtained on different datasets and task definitions may be misleading; fair assessment requires both same-setting baseline comparison and literature-level discussion.

The practical implication of the results is that YOLO-MMSC-T is suitable as an edge-side perception module for UAV patrol, but it should not be regarded as a complete alarm-decision system. The altitude study suggests that the 120–180 m range provides a reasonable compromise between target visibility, temporal stability, and inspection coverage. Within this range, the detector still maintains acceptable mAP@0.5 and continuous detection rate, although jitter gradually increases with altitude. This result is useful for engineering deployment because it links detection performance with UAV operating conditions rather than evaluating the detector only on static images. Nevertheless, the recommended altitude range should be interpreted as a field guideline under the collected corridor conditions, not as a universal rule for all transmission-line environments.

Several limitations remain. First, the self-collected dataset is geographically and seasonally restricted. Although it covers different time periods, nighttime conditions, smoke/occlusion, small targets, and hard-negative interference, the field data were mainly collected from transmission-line corridors in Central China during winter. Therefore, the current results should not be interpreted as evidence that the model can be directly generalized to all corridor environments without further validation. Different vegetation types, soil backgrounds, terrain morphology, humidity levels, seasonal appearances, and extreme weather conditions may change the visual texture, thermal contrast, smoke diffusion pattern, and fire-spread behavior. For example, a humid tropical forest corridor may exhibit stronger smoke attenuation and lower thermal contrast, whereas a dry savanna or grassland corridor may contain more fragmented vegetation texture, exposed soil background, and faster flame spread. These domain shifts may affect both RGB appearance and TIR saliency, and may further influence the reliability of cross-modal fusion. Thus, the present study should be regarded as a feasibility demonstration under the collected corridor conditions. Broader multi-region, multi-season, and multi-climate validation will be required before large-scale deployment in geographically diverse transmission corridors.

Second, the qualitative failure cases and the lack of subtype-level false-positive statistics show that appearance-based RGB–thermal fusion alone cannot fully resolve semantic-neighbor interference. Nighttime fireworks may simultaneously produce strong visible brightness, local heat radiation, and smoke-like plumes, making them similar to real wildfire events in both RGB and TIR modalities. Hot non-fire objects under low-gain thermal imaging may also generate flame-like responses when the surrounding contextual cues are weak. Similar limitations have also been reported in RGB–thermal fire detection studies, where thermal imagery improves weak-fire perception but may still suffer from false alarms under complex smoke-obscured or high-temperature interference conditions [8]. These cases suggest that reliable engineering deployment requires detector outputs to be coupled with an alarm-level decision mechanism. Multi-frame alarm triggering, temporal persistence checking, and spatial consistency constraints can suppress transient false positives, and related spatial–temporal strategies have been used to reduce false alarms in vision-based fire and smoke detection [26]. For fixed thermal distractors, region-based whitelisting or masking can be introduced using prior knowledge of industrial facilities, roads, residential heating areas, or known heat-source locations. A more rigorous evaluation of this issue will require a larger and more balanced hard-negative benchmark, in which fireworks, fixed hot objects, industrial heat sources, residential heating, vehicle lights, and smoke-like non-fire events are separately labeled. Future work will report subtype-level false-positive rate and false positives per image under unified confidence and NMS settings, so that the recurrence frequency of each interference type can be quantified more transparently.

Finally, the temporal consistency term used in this work is intentionally lightweight and mainly suppresses short-term flickering detections and bounding-box jitter. It does not explicitly model long-term fire evolution, smoke transport, wind direction, or scene-level risk priors. Future work will therefore focus on three more specific directions. First, larger multi-region and multi-season RGB–T corridor datasets should be constructed to cover different vegetation types, terrain morphologies, humidity levels, seasonal backgrounds, UAV viewing angles, and industrial or residential thermal interference patterns. Such data would support domain generalization analysis and incremental adaptation when the model is transferred to new corridor environments. Second, more explicit spatiotemporal modeling can be explored, including trajectory-aware filtering for object-level alarm stabilization, lightweight Transformer-based temporal aggregation for inter-frame dependency modeling, and 3D convolutional modeling for short-clip fire/smoke motion feature extraction. Third, decision-level alarm fusion can be developed by combining detector confidence, temporal persistence, scene priors, geographic information system (GIS) data, known heat-source masks, corridor risk-zone maps, weather information, wind direction, humidity, and other non-visual sensor cues when available. These extensions would improve the generalizability, false-alarm suppression ability, and practical reliability of RGB–thermal UAV monitoring for transmission-corridor wildfire early warning.

5. Conclusions

This paper addressed UAV-based wildfire monitoring in complex transmission-corridor environments, where perception reliability, edge-side deployment, and continuous video stability must be considered together. To this end, an RGB–thermal multimodal framework, including YOLO-MMSC and its temporal-enhanced version YOLO-MMSC-T, was proposed. The main conclusions are summarized as follows.

First, from the perspective of model design, the proposed framework combines fine-grained cross-modal alignment and denoising, adaptive weighted fusion, a lightweight MBConv-based backbone, and NWD-based localization optimization for small targets. These modules jointly improve detection robustness in challenging corridor conditions, as supported by the dedicated subset evaluations.

Second, in terms of overall accuracy and deployment feasibility, YOLO-MMSC achieves 94.6% mAP@0.5 with 95.0% precision and 93.9% recall on the comprehensive test set, while maintaining real-time inference at 60 FPS on Jetson Orin NX. These results verify that the proposed lightweight fusion strategy achieves an effective trade-off among accuracy, speed, and deployability for transmission-corridor wildfire monitoring.

Third, regarding continuous monitoring and engineering application, YOLO-MMSC-T increases the continuous detection rate to 95.6% and reduces the jitter index to

2.8 \times 10^{- 3}

on the temporal subset, indicating that the temporal consistency constraint effectively suppresses intermittent missed detections and improves temporal smoothness. The DJI Matrice 4T field study with altitude sweeping from 60 m to 210 m further suggests a recommended inspection range of 120–180 m, and the accompanying gain-setting and failure-case analyses clarify the practical deployment boundary of the proposed framework.

Overall, the main advancement of this work lies not only in the individual technical modules, but also in the integration of corridor-oriented RGB–thermal data construction, lightweight multimodal perception, small-target localization, temporal stability enhancement, and UAV field-operation analysis into a unified edge-deployable monitoring framework. In particular, the proposed framework improves the utilization of complementary cross-modal information by jointly exploiting visible-light texture and structural cues together with thermal infrared saliency. Through shallow feature alignment and denoising as well as adaptive RGB–TIR fusion, the detector can make fuller use of heterogeneous feature information than simple RGB–T concatenation, thereby improving robustness under nighttime low illumination, smoke/vegetation occlusion, and long-range weak-target conditions. Compared with conventional single-modality detectors and simple RGB–T fusion baselines, the proposed system provides a more application-specific solution for power transmission-corridor wildfire monitoring, where perception robustness, edge efficiency, and continuous alarm reliability must be considered simultaneously.

Although the present results are encouraging, broader validation is still needed for large-scale deployment. Future work will extend the dataset to more diverse corridor environments, further improve long-term temporal modeling, and integrate detection outputs with decision-level alarm logic for more robust remote-sensing-based early warning.

Author Contributions

Conceptualization, B.W. and D.H.; methodology, B.W. and D.H.; software, B.W.; validation, B.W., X.H. and Y.L.; formal analysis, B.W. and D.H.; investigation, B.W. and Z.G.; data curation, B.W.; writing—original draft preparation, B.W.; writing—review and editing, D.H., X.H. and B.H.; supervision, D.H.; project administration, D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, grant number 2023YFF0611300.

Data Availability Statement

The public datasets analyzed in this study are cited in the reference list. The self-collected RGB–T corridor dataset and related field sequences are not publicly released at this stage because of project and operational restrictions, but may be made available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank the data-collection and field-test participants for their support in UAV RGB–thermal acquisition and annotation. This work is related to the National Key R&D Program of China project “Integrated demonstration of full-chain NQI technologies for fire prevention and control of key facilities in new power systems”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vahedi, S.; Zhao, J.; Pierre, B.; Lei, F.; Anagnostou, E.; He, K.; Jones, C.; Wang, B. Wildfire and power grid nexus in a changing climate. Nat. Rev. Electr. Eng. 2025, 2, 225–243. [Google Scholar] [CrossRef]
Khan, I.; Ghassemi, M. A probabilistic approach for analysis of line outage risk caused by wildfires. Int. J. Electr. Power Energy Syst. 2022, 139, 108042. [Google Scholar] [CrossRef]
Sayarshad, H.R.; Ghorbanloo, R. Evaluating the resilience of electrical power line outages caused by wildfires. Reliab. Eng. Syst. Saf. 2023, 240, 109588. [Google Scholar] [CrossRef]
Xu, J.; Fang, C.; Cheng, Y. Wildfire Risk Assessment to Overhead Transmission-Line Based on Improved Analytic Hierarchy Process. Fire Mater. 2025, 49, 523–535. [Google Scholar] [CrossRef]
Li, J.; Chen, J.; Yu, H.; Jiang, M.; Lu, Z.; Zhou, Y.; Wang, S.; Fan, J. Wildfire monitoring technologies of transmission-line corridors based on Fengyun-3E satellite imaging. Front. Energy Res. 2023, 11, 1265516. [Google Scholar] [CrossRef]
Mu, L.; Yang, Y.; Wang, B.; Zhang, Y.; Feng, N.; Xie, X. Edge Computing-Based Real-Time Forest Fire Detection Using UAV Thermal and Color Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6760–6771. [Google Scholar] [CrossRef]
Muksimova, S.; Umirzakova, S.; Mardieva, S.; Abdullaev, M.; Cho, Y.I. Revolutionizing Wildfire Detection Through UAV-Driven Fire Monitoring with a Transformer-Based Approach. Fire 2024, 7, 443. [Google Scholar] [CrossRef]
Zhang, Y.; Rui, X.; Song, W. A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sens. 2025, 17, 2593. [Google Scholar] [CrossRef]
Murat, A.A.; Kiran, M.S. A comprehensive review on YOLO versions for object detection. Eng. Sci. Technol. Int. J. 2025, 70, 102161. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Boroujeni, S.P.H.; Razi, A.; Khoshdel, S.; Afghah, F.; Coen, J.L.; O’Neill, L.; Fule, P.Z.; Watts, A.; Kokolakis, N.M.T.; Vamvoudakis, K.G. A Comprehensive Survey of Research towards AI-Enabled UAV Systems for Wildfire Management. Inf. Fusion 2024, 108, 102369. [Google Scholar] [CrossRef]
He, Z.; Zhao, B.; Wu, Y.; Jiang, Y.; Zhao, Q. A Progressive Target-Aware Network for Drone-Based Person Detection Using RGB-T Images. Remote Sens. 2025, 17, 3361. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Q. Real-Time Aerial Multispectral Object Detection with Dynamic Modality-Balanced Pixel-Level Fusion. Sensors 2025, 25, 3039. [Google Scholar] [CrossRef]
Zhuo, Z.; Lu, R.; Yao, Y.; Wang, S.; Zheng, Z.; Zhang, J.; Yang, X. TAF-YOLO: A Small-Object Detection Network for UAV Aerial Imagery via Visible and Infrared Adaptive Fusion. Remote Sens. 2025, 17, 3936. [Google Scholar] [CrossRef]
Qi, Y.; Yang, S.; Chen, J.; Zhang, M.; Zhu, J.; Liu, X.; Zheng, H. A Modality Alignment and Fusion-Based Method for Around-the-Clock Remote Sensing Object Detection. Sensors 2025, 25, 4964. [Google Scholar] [CrossRef]
Vasile, C.E.; Bîră, C.; Hobincu, R. UAV-TIRVis: A Benchmark Dataset for Thermal–Visible Image Registration from Aerial Platforms. J. Imaging 2025, 11, 432. [Google Scholar] [CrossRef] [PubMed]
Hao, H.; Peng, Y.; Ye, Z.; Han, B.; Tang, W.; Kang, W.; Zhang, X.; Li, Q.; Liu, W. TMRGBT-D2D: A Temporal Misaligned RGB-Thermal Dataset for Drone-to-Drone Target Detection. Drones 2025, 9, 694. [Google Scholar] [CrossRef]
Khan, S.; Muhammad, K.; Hussain, T.; Del Ser, J.; Cuzzolin, F.; Bhattacharyya, S.; Akhtar, Z.; de Albuquerque, V.H.C. DeepSmoke: Deep learning model for smoke detection and segmentation in outdoor environments. Expert Syst. Appl. 2021, 182, 115125. [Google Scholar] [CrossRef]
Zhang, B.; Li, Z.; Sun, F.; Li, Z.; Dong, X.; Zhao, X.; Zhang, Y. SICFNet: Shared Information Interaction and Complementary Feature Fusion Network for RGB-T traffic scene parsing. Expert Syst. Appl. 2025, 276, 127171. [Google Scholar] [CrossRef]
Hangloo, S.; Arora, B. Multimodal fusion techniques: Review, data representation, information fusion, and application areas. Neurocomputing 2025, 649, 130827. [Google Scholar] [CrossRef]
Neverova, N.; Wolf, C.; Taylor, G.W.; Nebout, F. ModDrop: Adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1692–1706. [Google Scholar] [CrossRef]
Hu, X.; Jiang, Y.; Xia, X.; Chen, C.; Liu, W.; Wan, P.; Bin, K.; Zhong, P. UAV-StrawFire: A visible and infrared dataset for real-time straw-fire monitoring with deep learning and image fusion. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104586. [Google Scholar] [CrossRef]
Kularatne, S.D.M.W.; Álvarez Casado, C.; Rajala, J.; Hänninen, T.; Bordallo López, M.; Nguyen, L. FireMan-UAV-RGBT: A Novel UAV-Based RGB-Thermal Video Dataset for the Detection of Wildfires in the Finnish Forests. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; pp. 1–8. [Google Scholar] [CrossRef]
Liu, H.I.; Galindo, M.; Xie, H.; Wong, L.K.; Shuai, H.H.; Li, Y.H.; Cheng, W.H. Lightweight Deep Learning for Resource-Constrained Environments: A Survey. ACM Comput. Surv. 2024, 56, 267. [Google Scholar] [CrossRef]
Wang, F.; Hao, M.; Yang, S.; Zhang, W. Deep Learning Workload Mapping Optimization on Jetson Platforms. ACM Trans. Archit. Code Optim. 2025, 22, 81. [Google Scholar] [CrossRef]
Lee, Y.; Shim, J. False Positive Decremented Research for Fire and Smoke Detection in Surveillance Camera Using Spatial and Temporal Features Based on Deep Learning. Electronics 2019, 8, 1167. [Google Scholar] [CrossRef]

Figure 1. Structure of the proposed YOLO-MMSC for RGB–thermal transmission-corridor wildfire detection.

Figure 2. Structure of the MBConv-based lightweight block used in the backbone and neck.

Figure 3. Structure of the SDFM module for shallow cross-modal alignment and denoising.

Figure 4. Structure of the CGA module for adaptive RGB–TIR fusion.

Figure 5. Representative samples of the dataset, including public RGB–T data, self-collected field sequences, and hard-negative examples.

Figure 6. Qualitative comparison of representative detectors in challenging corridor scenarios.

Figure 7. Representative RGB–T observations under different UAV inspection altitudes.

Figure 8. Representative failure cases under extreme interference conditions.

Table 1. Dataset composition and annotation statistics.

Source	RGB–T Pairs	Fire Boxes	Smoke Boxes	Hard-Negative Samples
Public RGB–T data	8214	10,572	8931	0
Self-collected corridor sequences	8127	9826	7644	0
Self-collected hard-negative samples	587	0	0	587

Table 2. Dataset split.

Source	Train	Validation	Test/Field Verification
Public RGB–T data	5750	2464	–
Self-collected corridor sequences	4876	1625	1626
Self-collected hard-negative samples	411	176	–

Table 3. Video-level split of the self-collected continuous corridor sequences.

Subset	Video Segments	Total Duration (min)	Annotated Paired Frames Used
Training	10	146.0	4876
Validation	3	44.1	1625
Test/field verification	3	44.5	1626
Total	16	234.6	8127

Table 4. Summary of the RGB–thermal data sources used in this study.

Data Source	Modality and Scene Type	Scale Used in This Study	Main Role in This Study
FireMan-UAV-RGBT	UAV RGB–thermal forest wildfire scenes	Part of 8214 cleaned public RGB–T pairs	Provides UAV-based RGB–thermal wildfire observations from forest-like environments.
FLAME-3	RGB and thermal outdoor wildland fire/smoke scenes	Part of 8214 cleaned public RGB–T pairs	Supplements open-field fire and smoke appearances under outdoor burning conditions.
RGBT-3M	Multi-scene RGB–thermal fire-, smoke-, and human-related scenes	Part of 8214 cleaned public RGB–T pairs	Improves multimodal scene diversity and cross-scene robustness.
Self-collected corridor RGB–T corpus	UAV RGB–thermal high-voltage transmission-line corridor scenes in Central China	8127 paired corridor frames plus 587 paired hard-negative samples	Provides corridor-specific continuous sequences covering different time periods, smoke/occlusion, long-range small targets, and practical thermal distractors.

Table 5. Hardware and software environment.

Item	Training Server	Edge Device
CPU	Intel i7-14700KF	Arm Cortex-A78AE
GPU	RTX 5070	1024-core Ampere
RAM	32 GB	16 GB
PyTorch	2.8.0	2.3.0
CUDA	12.8	12.6
Python	3.10.18	3.10.12
Other	–	JetPack 6.2

Table 6. Training, augmentation, and inference settings.

Item	Setting
Detector input size	$640 \times 640$ with letterbox padding
Native TIR size	$640 \times 512$
Epochs	250
Batch size	32
Optimizer	SGD
Initial learning rate	0.0005
Momentum	0.937
Geometric augmentation	Synchronized horizontal flip, scaling, and translation for RGB–TIR pairs
Mosaic augmentation	Applied during training; disabled during validation and inference
Photometric augmentation	Mild visible-branch perturbation; no color augmentation for TIR
Validation/inference augmentation	Disabled
Temporal sequence length	5
IoU threshold for matching	0.5
Temporal weight $λ$	0.2
Inference batch size	1
Inference precision	TensorRT FP16 on Jetson Orin NX

Table 7. Sensitivity analysis of temporal hyperparameters on the continuous video subset.

T	$λ$	mAP@0.5	CDR (%)	J (×10⁻³)
3	0.2	94.8	94.5	3.2
5	0.2	94.9	95.6	2.8
7	0.2	94.8	95.7	2.7
5	0.1	94.9	94.9	3.2
5	0.3	94.6	95.4	2.7

Table 8. Ablation study of the proposed detector.

Model	MBConv	SDFM	CGA	NWD	mAP@0.5	GFLOPs	FPS
YOLOv11 (RGB)	–	–	–	–	87.9	8.2	41
YOLOv11 (IR)	–	–	–	–	88.8	8.2	41
YOLO-concat	–	–	–	–	90.1	8.9	38
Network 1	✓	–	–	–	87.3	5.0	79
Network 2	✓	✓	–	–	90.4	5.6	71
Network 3	✓	✓	✓	–	92.9	6.0	67
YOLO-MMSC	✓	✓	✓	✓	94.6	6.4	64

Table 9. Contribution of corridor-oriented data and hard negatives on the field test subset.

Training Configuration	mAP@0.5	P (%)	CDR (%)	Night	Occ./Smoke
Public RGB–T only	90.8	91.4	88.9	85.2	86.4
Public + self-collected corridor sequences	93.4	94.1	92.6	88.8	90.3
Public + self-collected corridor sequences + hard negatives	94.6	95.0	93.1	90.6	91.8

Table 10. Performance on the nighttime subset under different thermal gain settings (120–180 m).

Setting	mAP@0.5	Recall (%)	CDR (%)	J (×10⁻³)
Nighttime high gain	93.7	92.9	94.4	3.4
Nighttime low gain	92.5	91.1	92.8	3.9

Table 11. Comparison with existing methods.

Model	R (%)	P (%)	FLOPs (G)	mAP@0.5	FPS	Edge FPS
Faster R-CNN	87.2	86.5	112.9	86.7	15	–
YOLOv5	83.6	84.1	4.5	83.8	45	–
YOLOv8	85.4	86.2	8.7	86.1	35	25
YOLOv11 (RGB)	87.9	87.2	8.2	87.9	41	36
YOLOv11 (IR)	88.7	88.0	8.2	88.8	41	36
YOLO-concat	90.2	90.6	8.9	90.1	38	33
YOLO-MMSC	93.9	95.0	6.4	94.6	64	60

Table 12. Comparison with recent RGB–thermal, UAV small-object, and wildfire-detection studies.

Method	Main Task	Modality	Relation to This Study
Zhang et al. [8]	UAV forest fire detection	RGB–thermal	Provides a recent RGB–thermal wildfire dataset and fusion model, but mainly focuses on forest fire scenes rather than transmission-line corridors.
TAF-YOLO [14]	UAV multimodal small-object detection	Visible–infrared	Addresses small-object detection with adaptive multimodal fusion, but is not designed for wildfire monitoring in power transmission corridors.
PTANet [12]	Drone-based target detection	RGB–thermal	Focuses on RGB–thermal small-target detection for persons in UAV images, which differs from fire/smoke monitoring and alarm continuity.
FMPFNet [13]	Real-time aerial multispectral detection	Visible–infrared	Emphasizes real-time multispectral fusion for aerial object detection, while our work further considers corridor-specific wildfire data, hard negatives, and temporal stability.
YOLO-MMSC/YOLO-MMSC-T	Transmission-corridor wildfire monitoring	RGB–thermal	Integrates corridor-oriented data construction, shallow alignment, adaptive fusion, NWD-based small-target localization, edge deployment, and temporal consistency.

Table 13. Subset-level performance under key operating conditions (mAP@0.5, %).

Model	Night	Occlusion/Smoke	Small-Target
YOLOv11 (RGB)	76.8	78.4	83.2
YOLOv11 (IR)	82.3	84.1	86.0
YOLO-concat	84.0	85.6	87.4
Network 2 (MBConv+SDFM)	86.2	87.9	89.5
Network 3 (+CGA)	88.9	90.4	91.6
YOLO-MMSC	90.6	91.8	92.7

Table 14. Temporal evaluation on the continuous video subset.

Model	mAP@0.5	CDR (%)	J (×10⁻³)
YOLOv11 (RGB)	88.3	85.6	6.1
YOLOv11 (IR)	89.6	88.7	5.2
YOLO-concat	90.8	89.5	5.0
YOLO-MMSC	94.7	93.1	3.9
YOLO-MMSC-T	94.9	95.6	2.8

Table 15. Performance at different UAV inspection altitudes.

Altitude (m)	mAP@0.5	Recall (%)	CDR (%)	J (×10⁻³)
60	95.1	94.8	96.0	2.5
90	94.7	94.1	95.0	2.9
120	94.4	93.6	95.0	3.2
150	94.0	92.8	94.0	3.6
180	93.4	91.8	93.0	4.1
210	92.7	90.6	92.0	4.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, B.; Huang, D.; Lin, Y.; He, X.; Guo, Z.; Hong, B. Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors. Remote Sens. 2026, 18, 1869. https://doi.org/10.3390/rs18121869

AMA Style

Wang B, Huang D, Lin Y, He X, Guo Z, Hong B. Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors. Remote Sensing. 2026; 18(12):1869. https://doi.org/10.3390/rs18121869

Chicago/Turabian Style

Wang, Biao, Daochun Huang, Yifeng Lin, Xu He, Zhengxian Guo, and Bo Hong. 2026. "Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors" Remote Sensing 18, no. 12: 1869. https://doi.org/10.3390/rs18121869

APA Style

Wang, B., Huang, D., Lin, Y., He, X., Guo, Z., & Hong, B. (2026). Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors. Remote Sensing, 18(12), 1869. https://doi.org/10.3390/rs18121869

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Edge-Deployable RGB–Thermal UAV Monitoring for Wildfires in Power Transmission Corridors

Highlights

Abstract

1. Introduction

2. Methodology

2.1. YOLO-MMSC Structure

2.2. MBConv-Based Lightweight Backbone

2.3. SDFM and CGA Multimodal Fusion Modules

2.4. NWD-Based Regression and Temporal Consistency

3. Experimental Setup

3.1. Datasets and Subset Definition

3.2. Implementation Details

3.3. Evaluation Metrics

4. Results and Discussion

4.1. Ablation Study

4.2. Contribution of Corridor-Oriented Data and Hard Negatives

4.3. Robustness Under Different Nighttime Gain Settings

4.4. Performance Comparison and Literature Positioning

4.5. Subset Analysis and Temporal Stability

4.6. Field Altitude Study and Failure Cases

4.7. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI