Next Article in Journal
Monitoring Nitrogen Uptake and Grain Quality in Ponded and Aerobic Rice with the Squared Simplified Canopy Chlorophyll Content Index
Previous Article in Journal
Across-Beam Signal Integration Approach with Ubiquitous Digital Array Radar for High-Speed Target Detection
Previous Article in Special Issue
Evaluation of Key Remote Sensing Features for Bushfire Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection

1
State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei 230026, China
2
School of Emergency Management, Nanjing University of Information Science and Technology, Nanjing 210044, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(15), 2593; https://doi.org/10.3390/rs17152593
Submission received: 29 April 2025 / Revised: 23 June 2025 / Accepted: 21 July 2025 / Published: 25 July 2025
(This article belongs to the Special Issue Advances in Spectral Imagery and Methods for Fire and Smoke Detection)

Abstract

UAVs are essential for forest fire detection due to vast forest areas and inaccessibility of high-risk zones, enabling rapid long-range inspection and detailed close-range surveillance. However, aerial photography faces challenges like multi-scale target recognition and complex scenario adaptation (e.g., deformation, occlusion, lighting variations). RGB-Thermal fusion methods integrate visible-light texture and thermal infrared temperature features effectively, but current approaches are constrained by limited datasets and insufficient exploitation of cross-modal complementary information, ignoring cross-level feature interaction. A time-synchronized multi-scene, multi-angle aerial RGB-Thermal dataset (RGBT-3M) with “Smoke–Fire–Person” annotations and modal alignment via the M-RIFT method was constructed as a way to address the problem of data scarcity in wildfire scenarios. Finally, we propose a CP-YOLOv11-MF fusion detection model based on the advanced YOLOv11 framework, which can learn heterogeneous features complementary to each modality in a progressive manner. Experimental validation proves the superiority of our method, with a precision of 92.5%, a recall of 93.5%, a mAP50 of 96.3%, and a mAP50-95 of 62.9%. The model’s RGB-Thermal fusion capability enhances early fire detection, offering a benchmark dataset and methodological advancement for intelligent forest conservation, with implications for AI-driven ecological protection.

1. Introduction

The frequency of forest fires has increased significantly in recent years, and extreme forest fire events have had a major impact on societies and ecosystems globally [1]. With the advantages of flexible mobility and multi-scale observation, UAV has become an important equipment for forest fire detection. In the daily inspection of forest fires, the composite strategy of long-term rapid inspection and close-range fine detection requires the detection algorithm to have both lightweight and generalization characteristics.
As shown in Figure 1, in the process of forest fire detection, UAVs face two different application scenarios: remote shooting and close shooting. This brings about the difficulty of multi-scale target detection and matching for forest fires. At the same time, the aerial image is affected by the change in flight attitude and vegetation occlusion. The target presents significant multi-scale features, deformation characteristics and edge blurring effect. The traditional fixed shape detection model has the defects of feature matching deviation and scale sensitivity. Therefore, RGB-Thermal fusion detection can give full play to the advantages of visible light image (RGB) and thermal infrared image (T), and make up for the deficiency of single mode [2]. The visible light image contains rich texture and color information, which can accurately identify the color characteristics of the flame and provide the basis for the detailed analysis of the fire target. The thermal infrared image is not limited by light conditions and vegetation occlusion, and can keenly capture high-temperature heat sources. Even in the presence of heavy smoke or in nighttime conditions, it can accurately pinpoint the locations of fire points. In the remote shooting scene, the initial fire point is small, and the combination of the high temperature signal in the thermal infrared image and the smoke texture feature of the visible light image can also detect the potential fire source in time. In the close shooting scene, RGB-Thermal fusion detection can quickly adapt to the scale change and shape distortion of the target in the face of complex vegetation environment and fire scene with dramatic perspective change.
In the existing forest fire detection research, most of the publicly available forest fire image datasets are limited to visible light images and lack real fire data. Furthermore, there is a paucity of publicly available RGB-Thermal image datasets, and there is a lack of more accurate image alignment work. The majority of these datasets are focused on fire classification and segmentation tasks, and there is still a gap in fire detection work [3]. The FLAME1 [4] and FLAME2 [5] datasets utilise the overhead view, which is incapable of reflecting the multi-angle characteristics of UAVs during daily forest fire inspections. The well-labelled Corsican Fire Dataset [6] and the RGB-T wildfire dataset [2] contain a limited amount of data and are all images of a single experimental scenario, which lacks data generalization. The FireMan-UAV-RGBT dataset [3] has captured multi-scene forest fire images from multiple viewpoints, but only supports the classification task and does not provide more refined information for forest fire detection. The dataset in this paper is comprehensive, considering the diversity of viewpoints in aerial photography, the diversity of forest fire scenes, and the precision of the data that has been annotated. The RGBT-3M has been constructed in order to provide reliable data support for the detection of forest fires using RGB-Thermal imaging.
At the methodological level, the multimodal fusion strategies can be mainly categorized into three types: data-level fusion, feature-level fusion and decision-level fusion. They correspond to different stages of the algorithmic model inference process, as shown in Figure 2.
In the field of multimodal detection methods, deep learning networks mostly use intermediate fusion strategies [7,8], and researchers have developed a number of multimodal interaction and fusion strategies, which have proved to be effective in enhancing the design of modal interactions in the feature extraction phase [9,10]. CACFNet [11] mines complementary information from two modalities by designing cross-modal attention fusion modules, and uses cascaded fusion modules to decode multilevel features in an up–down manner; SICFNet [12] constructs a shared information interaction and complementary feature fusion network, which consists of three phases: feature extraction, information interaction, and feature calibration refinement; and the Thermal-induced Modality-interaction Multi-stage Attention Network (TMMANet [13]) leverages thermal-induced attention mechanisms in both the encoder and decoder stages to effectively integrate RGB and thermal modalities.
At present, preliminary progress has been made in forest fire identification based on the RGB-Thermal fusion method [2,3,5,6]. Although existing work applies deep learning frameworks such as LeNet [14], MobileViT [15], ResNet [16], YOLO [17], etc., to forest fire detection, which improves the efficiency of forest fire recognition [5], the design of algorithms based on RGB-Thermal correlation is still very limited. Chen et al. [5] explored RGB-Thermal based early fusion and late fusion methods for classification and detection of forest fire images. Rui et al. [2] proposed an adaptive learning RGB-T bimodal image recognition framework for forest fires. Guo et al. [18] designed the SkipInception feature extraction module and SFSeg sandwich structure to fuse visible and thermal infrared images for flame segmentation task. Overall, the above algorithms do not deeply consider the information interaction and propagation of cross-modal features, and lack the ability to simultaneously achieve the calibration of shallow texture features and the localization of high-level semantic information at all scales, and thus remain deficient in analyzing diverse and challenging forest fire scenarios.
We propose a new forest fire detection framework. It employs a parallel backbone network to extract RGB and TIR features. A feature cross-fertilization structure is established in multi-scale feature extraction to enhance information interaction and propagation between modalities. The channel and spatial attention mechanisms, along with a feature branching selection strategy, are introduced to suppress noise from heterogeneous inter-modal features. Finally, it achieves effective combination of complementary relationships between modalities.
In summary, this paper will explore how to effectively fuse the features of visible and thermal infrared images on existing deep learning models to improve the efficiency and effectiveness of forest fire target detection in complex environments. Based on the above background, the main contributions of this paper are as follows:
(1) A novel forest fire dataset is introduced, containing time-synchronized RGB-thermal video data from real fires and outdoor experiments in multiple Chinese forest areas. It provides high-quality, reliable data for classification and detection tasks via manual frame-splitting, image alignment, and annotation, supporting subsequent deep learning model training and testing. To the best of our knowledge, this is the first RGB-Thermal image detection dataset for forest fires.
(2) A fire detection method was carried out by combining multimodal fusion techniques and computer vision methods. We choose the well-known deep learning architecture YOLOv11, and add cross-modal feature fusion structure and attention mechanism under the dual backbone structure of RGB and TIR to guide the gradual fusion of modal heterogeneous information and improve the adaptability of the method in forest fire target detection.
(3) Our constructed model is evaluated in several challenging forest fire scenarios, effectively demonstrating the usability and robustness of our proposed dataset and deep learning approach in forest fire detection scenarios.

2. Dataset

2.1. Data Collection

The experimental equipment used for RGB-T image data collection is the DJI M300 RTK UAV (DJI Technology Co., Ltd., Shenzhen, China) equipped with the H20T, and the DJI MAVIC 2 (DJI Technology Co., Ltd., Shenzhen, China) Enterprise with integrated camera, as shown in Figure 3.
In the process of data collection, the specific shooting specifications are shown in Table 1.
In order to collect forest fire images covering a wide range of scenarios, the study carried out large-scale field environmental data collection in Anhui, Yunnan, and Inner Mongolia, including real fires or outdoor experimental data. In the data collection process, multiple UAV devices were used, and all devices were time-synchronized to simultaneously acquire visible and thermal infrared videos to ensure the consistency of the acquired data. Finally, from the large number of videos collected, we filtered out the videos with high representativeness of forest fire scenes, and the relevant information is shown in Table 2.

2.2. Data Pre-Processing

In the pre-processing stage, the video is extracted at 5 frames per second to reduce similarity between image frame pairs. Images are divided into fire and non-fire frame pairs to facilitate image classification tasks. Considering the similarity of scenes, additional frame-skipping strategies are adopted to optimize the processing workflow. Finally, 17,862 frame pairs of images are obtained, of which the non-fire frames are 6642 pairs and the fire frames are 11,220 pairs.
The visible and thermal infrared images are usually captured by different sensors, and the modality gaps caused by different imaging systems or styles pose a great challenge to the matching task [19]. Although complementary information is provided in different imaging modalities, multimodal images obtained directly from the camera are not aligned for direct fusion, as shown in Figure 4. This figure depicts a stereo vision system framework. Two cameras, each associated with independent coordinate systems (left and right), are included.
Image alignment refers to the establishment of pixel-level correspondences between images of two viewpoints, through a series of pre-processing and alignment steps, transforming them into a common reference frame or coordinate system through spatial mapping relationships, i.e., converting them into a common representation that makes them spatially aligned, and allows them to be compared and analyzed on the same spatial scale. Image alignment is used to merge the strengths of different modalities, resulting in a more comprehensive, accurate, and robust characterization.
At the device hardware level, the temporal acquisition frame rates of visible and thermal infrared images are kept synchronized, i.e., they are already aligned in the temporal dimension. In the spatial dimension, the alignment between the visible and thermal infrared images can be realized by solving the homography matrix of the visible images and the thermal infrared images and performing affine transformations, i.e.,
H = R + T 1 d N T X 2 = H X 1
H is the univariate responsivity matrix, R and T denotes the rotation and translation matrices between the two coordinate systems. The coordinates of a point on the plane p in the world coordinate system in the two coordinate systems are X 1 and X 2 , the normal vector of the plane p is n , and the distance to the origin of the camera coordinate system is d .
During the construction of most forest fire RGB-T data sets, the image registration process is usually carried out by manually selecting feature points, or by using general feature point matching methods such as ORB [20] or SIFT [21]. We propose a two-stage bimodal image alignment framework, termed M-RIFT, to improve the accuracy and robustness of matching heterogeneous image data, as shown in Figure 5. In the rough alignment stage, manually selected feature points are used as the coarse alignment step for image resizing to quickly overcome the initial geometric distortion. In the fine alignment stage, we adopt the RIFT multimodal image matching method [22]. First, feature points in the image are detected via the maximum moment map. Then, the maximum value index in each direction is searched to construct the maximum index map. Next, the FREAK descriptor is used to generate the feature vector, and homonymous point pairs are obtained based on the nearest-neighbor strategy. After removing outliers, the affine transform model between images is derived. This approach enables the rapid and accurate establishment of feature correspondences and optimization of matching results.
Through the above method, the homography matrix can be computed, and for each match point ( x i , y i ) and ( x i , y i ) , a system of equations is constructed based on the mathematical model of the perspective transformation. The mathematical model of perspective transformation is as follows:
x y 1 = H x y 1
H is the homography matrix. Two equations can be obtained after expansion:
x = h 11 x + h 12 y + h 13 h 31 x + h 32 y + h 33 y = h 21 x + h 22 y + h 23 h 31 x + h 32 y + h 33
h denotes the elements of the homography matrix H expanded into a vector. The vector h is obtained by solving the chi-square equation system via singular value decomposition (SVD), from which the homography matrix is then derived.
Traditional feature point matching methods cannot effectively handle the modal differences between cross-modal images, making it difficult to match heterogeneous information. Figure 6 demonstrates a comparative analysis of the feature point matching results between our approach and other methodologies. The green line connects the matching points corresponding to the visible light image and the thermal infrared image.
The traditional methods are only capable of identifying a limited number of matching points, frequently accompanied by issues of incorrect alignment. By contrast, the approach proposed in this paper successfully detects a substantial number of accurate matching points, thereby demonstrating the superiority of the proposed method.

2.3. Statistical Analysis of the Dataset

The multi-scene, multi-target and multimodal forest fire aerial photography dataset (RGBT-3M dataset) contains 22,440 images (i.e., 11,220 pairs of images) of fire frames, which were annotated using LabelImg (version: 1.8.6), and the labeled targets were smoke, fire, and person, with the numbers of 13,574, 11,315, and 5888. Infrared images lacked obvious smoke features, so we provided labels excluding smoke targets. The dataset is divided according to the ratio of 7:3 to form the training set and validation set, some representative scenes are shown in Figure 7, and the specific data are shown in Table 3. The dataset will be updated to the website: https://complex.ustc.edu.cn/.

3. Method

3.1. Overall Architecture Design

Convolutional Neural Networks (CNNs) are powerful in feature extraction and modeling and have achieved outstanding performance in early forest fire recognition tasks. Forest fire images are natural images with rich local covariance, and CNNs with translational invariants are good at extracting fire features and thus learning high-level (semantic) features. Many well-known object detection frameworks are used for fire detection tasks, such as the YOLO series [17,23] and the RCNN series [24]. The YOLO series of algorithms for architectural design always centers on the core goal of fast detection, which meets the efficient detection requirements of forest fire detection tasks.
Therefore, YOLOv11 is adopted as the baseline scheme in this study, and the classic three-stage architecture design balances feature abstraction capability and computational efficiency, providing a robust baseline framework for subsequent model improvement. The input resolution of visible and infrared images is denoted as W × H . The backbone network extracts feature at different scales via multi-convolutional down-sampling. When the feature map resolution is { W 8 × H 8 , W 16 × H 16 , W 32 × H 32 } , feature interaction design performs cross-modal information interaction by extracting features from different modal networks. After passing through the C3k2 feature fusion module, feature splicing design is applied before feeding into the neck network.
To boost the model’s cross-modal fusion efficiency and detection performance, we devise a novel cross-modal feature fusion algorithm within the RGB-Thermal fusion framework. This algorithm comprises two key components: the feature interaction design and the feature splicing design.
In the feature interaction design, we integrate the channel prior convolutional attention (CPCA) mechanism. Given the significant differences in feature representations between infrared and visible images, CPCA dynamically adjusts the importance of different channels. By emphasizing the complementary information and suppressing redundant or conflicting features, it enables effective cross-modal synergy. This process allows the model to fully leverage the unique advantages of each modality.
For the feature splicing design, we adopt the parallel patch-aware splicing (PPAS) method. PPAS divides the images into patches and processes them in parallel, guiding the model to focus on critical regions. This approach enhances the model’s perception of the target by capturing local details and global context simultaneously. Moreover, it suppresses irrelevant background information, significantly improving the detection accuracy and efficiency. The construction process is shown in Figure 8. We use different font colors to distinguish different methods in order to better understand the naming of the model we are building. We labeled different fusion strategies in blue, orange, and green, and we labeled the first letter of the improved method in red.

3.2. Forest Fire Detection Frameworks Based on Mid-Term Fusion Strategies

The YOLOv11-MF network architecture consists of a dual-input layer, a dual-channel backbone network layer, a neck network layer, and a detection layer. The same modal feature splicing module is set up after the C3k2 module of each backbone network branch. Subsequently, the generated fused feature maps are input to the neck network layer. At the neck network layer, the features are further integrated, and the feature information processed by the neck network layer is finally passed to the detection layer to output the detection results. The network structure of YOLOv11-MF is shown in Figure 9.

3.3. Cross-Modal Feature Fusion Algorithm Design

In order to better fuse the bimodal information, the design of the feature interaction module and the optimization of the modal splicing module is carried on the mid-term fusion framework (YOLOv11-MF). In this way, the model can pay more attention to the features that contribute to the detection effect, while suppressing the influence of irrelevant or noisy features, which enhances the model’s ability to fuse between different information and improves the performance of forest fire target detection.
Finally, we design an RGB-Thermal fusion detection model, named CP-YOLOv11-MF as shown in Figure 10.

3.3.1. Feature Interaction Module Design

To address the visible and infrared feature interaction problem, we designed a feature interaction module. As shown in Equation (4), visible and infrared features are processed via channel prior convolutional attention [25] to compute channel and spatial attention, after which they are combined element-wise with the original visible and thermal infrared features:
F = F R G B F T F C P C A
The method dynamically assigns attention weights in channel and spatial dimensions to adaptively emphasize important features in different modalities, as shown in Figure 11.
In the process of channel attention calculation, the hybrid attention mechanism (CBAM, Convolutional Block Attention Module) method is borrowed to collect spatial information from the feature maps by applying the average pooling and maximum pooling operations. Subsequently, this collected information is fed into the shared MLP as shown in Equation (5):
C A ( F ) = σ ( M L P ( A v g P o o l ( F ) ) + M L P ( M a x P o o l ( F ) ) )
The spatial relations between features are computed with the help of depth-separable convolution, which reduces the complexity of computation while inter-channel relations are preserved, as shown in Equation (6):
S A ( F ) = C o n v 1 × 1 ( i = 0 3 B r a n c h i ( D w C o n v ( F ) ) )
D w C o n v denotes the depth separable convolution, and B r a n c h i , i { 0 , 1 , 2 , 3 } denotes the first branch.

3.3.2. Feature Splicing Module Design

Owing to the characteristic disparities across different modes, merely through simple splicing is substantial noise introduced, rendering inter-modal information transfer ineffective. To address this challenge, this study draws inspiration from the multi-branch feature extraction strategy in HCF-Net [26], and designs a parallel patch-aware splicing module (PPAS). The module employs a parallel multi-branching framework comprising local, global, and serial convolutional branches, which effectively suppresses extraneous information. Additionally, it leverages spatial and channel attention mechanisms for adaptive feature enhancement, as illustrated in Figure 12.
F l o c a l H × W × C , F g l o b a l H × W × C , F l o c a l H × W × C are computed through the three branches, and the weighted sum is derived as F ˜ H × W × C . The attention module consists of a series of channel attention and spatial attention. Thereafter, F ˜ H × W × C is processed sequentially by one-dimensional channel attention maps M c 1 × 1 × C and M s H × W × 1 , as shown in Equation (7):
F c = M c ( F ˜ ) F ˜ , F s = M s ( F c ) F c , F = δ ( B ( d r o p o u t ( F s ) ) ) ,
represents element-wise product, F c = M c ( F ˜ ) F ˜ and F s = M s ( F c ) F c calculate the selected features, and δ ( ) and B ( ) denote the rectified linear unit and batch normalization operations.
The distinction between the local and global branches is accomplished by controlling the patch size parameter, which is implemented via the aggregation and displacement of non-overlapping patches in the spatial dimension. Thereby, the attention matrix between non-overlapping patches is computed to facilitate local and global feature extraction and interaction, as depicted in Figure 13.
F are partitioned into spatially contiguous patches and channel averaged. The patches after channel averaging are linearly computed using a feed-forward neural network (FFN). Based on this, an activation function is applied to obtain the probability distribution of the linearly computed features in the spatial dimension and, for each marker, its weight is adjusted by filtering the features related to the task.

4. Experiment

4.1. Experimental Settings

The algorithmic model was experimented in the Ubuntu 18.04 operating system with NVIDIA GeForce RTX 3090 graphics card, CUDA version number 11.1, and Python version number 3.9.19. The principle of consistency was followed during the training of all the networks, keeping the same core optimizer parameters, optimization algorithm, and number of training parameters. The detailed settings of each specific training parameter are shown in Table 4.

4.2. Evaluation Criteria

Metrics such as precision (P), detection recall (R), and average precision (AP) are used to evaluate the model performance.
P = T P / ( T P + F P )
R = T P / ( T P + F N )
A P = 0 1 P ( R ) d R
The detection threshold set is 0.5, which is adopted as mAP50. Meanwhile, the evaluation index mAP50-95 is defined as the average value of detection thresholds ranging from 0.5 to 0.95 (excluding 0.5, with a step size of 0.05).

4.3. Comparative Experiment

In this section, a comparative study is carried out for the above single-modal detection model as well as the RGB-Thermal detection framework. It is worth noting that smoke is visible on visible images and difficult to recognize on thermal infrared images. This is due to the low sensitivity of the thermal infrared camera carried by the UAV and, for the remote observation, the UAV is far away from the forest fire target and cannot effectively capture the smoke information. Meanwhile, in order to focus more on the RGB-T bimodal fusion detection in small target detection, the subsequent comparison experiments are carried out with the smoke label removed, and only the flame and person are targeted for analysis. Table 5 gives a comparison of the effect of different number of target categories for visible light detection.
For single-modal comparison, we selected RTMdet [27], a single-stage target detection algorithm with similar model complexity to YOLOv11, and FasterRCNN [28], a well-known two-stage target detection algorithm, for comparison experiments. Table 6 presents a comparison of the effect of single-modal detection methods.
Compared to the single-stage target detection model with similar complexity, the YOLOv11 model reflects a more prominent performance advantage, which is significantly higher than the RTMdet model in all metrics. Compared to the complex two-stage target detection model, the YOLOv11 model is not much different from the FasterRCNN model in all the indicators, but YOLOv11 is slightly higher in recall, which reflects that the YOLOv11 model performs well in reducing the leakage of detection, which is needed for the early detection task of forest fires.

4.4. Ablation Experiment

In order to verify the effect of each improvement module on the model detection capability, we compare the effects of different improvements on the model detection performance. Ablation test results of the algorithm model are shown in Table 7. The experiments adopt YOLOv11 as the baseline model for unimodal detection in visible and infrared images. Based on this, the mid-term fusion framework “YOLOv11-MF” is designed. Adding a cross-modal feature interaction design based on CPCA to YOLOv11-MF yields “YOLOv11-MF+ feature interaction structure”. Finally, integrating the PPAS feature splicing module results in “CP-YOLOv11-MF”.
As shown in Table 7, early, mid, and late RGB-Thermal bimodal fusion frameworks are constructed via multimodal strategies. According to different multimodal fusion strategies, a simple Concat function is used for modal splicing, and the early fusion framework (YOLOv11-EF), the mid-term fusion framework (YOLOv11-MF), and the late fusion framework (YOLOv11-LF) are constructed on the basis of the YOLOv11 model. Simple bimodal feature splicing slightly improves algorithm performance, while designing a cross-modal feature interaction module and optimizing modal splicing modules enhances intermodal interactions, enabling deep feature and information complementarity across modalities. After a series of targeted improvements, the algorithm model (CP-YOLOv11-MF) constructed finally reaches 92.5%, 93.5%, 96.3%, and 62.9%, which reflects the effectiveness of the various improvement methods for model performance improvement.
In the detection framework based on early fusion Strategies (YOLOv11-EF), the input layer is improved by adding two new input channels and introducing the Concat function for early bimodal feature splicing. First, the infrared image and the visible image are used as input data, and the feature splicing operation is performed on the images of two different modalities to generate the bimodal fusion feature map. Subsequently, the generated bimodal fusion feature map is input to the backbone network layer and the neck network layer. The detection layer performs target detection operations from the previous layers and outputs the detection results. The network structure of YOLOv11-EF is shown in Figure 14.
The detection framework based on late fusion strategies (YOLOv11-LF) consists of a dual-input layer, a dual-channel backbone network layer, a dual-channel neck network layer, and a detection layer. The visible channel backbone network and the thermal infrared channel backbone network perform feature extraction for the visible and thermal infrared images, and output the extracted features to the neck network layer. The features are enhanced in the neck network layer. A feature splicing module is embedded at the output position of the neck network layer to input the generated bimodal image fusion features to the detection layer. The network structure is shown in Figure 15.
In order to verify the effectiveness of the RGB-Thermal bimodal target detection algorithm model, in this section, the YOLOv11 model is utilized to train the infrared and visible images separately to obtain the detection results in a single modality, i.e., the YOLOv11 network model processes the two types of images, visible and thermal infrared, and directly outputs the detection results without any fusion. As shown in Table 8, the performance of each model under single-modal detection and different detection fusion frameworks is compared.
A comparison of visible and infrared image detection results in a single modality shows that visible images, despite richer information, contain more interference. Evaluation of detection performance reveals similar mAP50 values for both modalities. However, infrared images exhibit a significant advantage in flame detection under the stricter mAP50-95 metric, outperforming visible images by 5.4%. In contrast, detection accuracy of person with infrared is slightly lower than that with visible light.
In terms of the comparative analysis of the detection effect of single-modal and dual-modal, the RGB-T dual-modal image detection method is significantly better than the single-modal image detection results in the three types of evaluation indexes (Precision, Recall, mAP), which proves the effectiveness of the image fusion technology in improving the performance of target detection. In-depth analysis of the characteristics of different stages of fusion methods within each RGB-Thermal bimodal fusion framework reveals that later fusion shows certain advantages. In the early fusion stage, because the original data has not been processed in depth, a large amount of redundant information and potential noise are not effectively eliminated, and these interfering factors are likely to have a negative impact on the subsequent model analysis process. Mid-term fusion also suffers from a similar problematic potential, in which a certain degree of noise interference inevitably exists in the data processing process. In this case, relying only on simple splicing operations to integrate multi-source data, it is not possible to fully explore the intrinsic correlation between the data, and it is difficult to realize the efficient fusion and utilization of information. In contrast, the fusion in the later stages of the information processing process, the data underwent multiple rounds of rigorous screening, effectively reducing noise impact. Subsequently, the information is integrated, which enables the model to more accurately refine the key features, thus presenting a better performance than the mid-term fusion framework on this dataset.
Under the multiple RGB-Thermal bimodal fusion frameworks mentioned above, only the Concat function is used for modal splicing operations and, in order to better perform modal fusion interactions, a cross-modal feature interaction structure is designed. Since only a single backbone network exists in the early fusion framework (YOLOv11-EF), the cross-modal feature interaction structure is applied in this section to the mid-term fusion framework (YOLOv11-MF) and late fusion framework (YOLOv11-LF) for experiments, as shown in Table 9. After adding the cross-modal feature interaction structure, the improved mid-term fusion framework performs optimally.
The experimental results show that, after the addition of cross-modal structures, all indicators under the late integration framework decreased to a certain extent while, for the mid-term integration framework, all indicators were significantly improved. Since the cross-modal structure introduced in the late fusion framework at the late stage of data processing does not have effective feature interaction with the late integration, it makes it difficult for the model to quickly adapt to and effectively integrate this cross-modal information. In the mid-term fusion stage, the data is initially processed and not yet fully solidified with feature patterns, and the introduction of cross-modal structures at this time can timely capture the rich complementary information between different modal data. From the network architecture, modal splicing and interactions are cross-cutting, enabling inter-modal feature mapping, enhancing information flow, and improving characterization of complex scenarios and diversified targets. This significantly improves model performance, especially in mAP50-95, where obvious advantages are observed.
In the feature splicing module, we test the effectiveness of different attentions in the optimization of the feature splicing module. Various attentional mechanisms (SimAM [29], GAM [30], NAM [31], LCA [32], and our method) are adopted to improve the modal splicing approach, which are applied in the feature splicing module after the C3k2 feature extraction module of the backbone network, and the related results are shown in Table 10.
As shown in Table 10, our designed PPAS, enabled by its multi-branching structure, effectively filters noisy information, complements cross-modal information, and outperforms in all metrics. The GAM attention mechanism ranks second in several metrics. Similar to our approach, it leverages channel-spatial attention interaction to enhance feature extraction accuracy.

4.5. Lightweight Design

In the experimental process, during the modal splicing function design, the overall complexity of the model is increased greatly after all modal splicing modules are replaced with PPAS. Considering that the algorithmic model is mainly used for UAVs to perform forest fire detection tasks (especially small targets), the lightweight approach is designed in this section. The original modal splicing function (Concat) is retained in the third modal splicing module corresponding to the third detection layer (which is mainly used for the detection of large targets).
This section presents comparative experiments to assess the detection performance of different modal splicing methods and model complexity. All schemes are evaluated within the medium-term fusion framework of the enhanced modal fusion structure (YOLOv11—MF+ feature interaction structure). As detailed in Table 11, Scheme 1 replaces all modal splicing modules with PPAS; Scheme 2 substitutes only the first two modules with PPAS; and Scheme 3 replaces only the first module with PPAS.
As shown in Table 11, simplified modal splicing Scheme 2 (replacing only the first two modal splicing modules with PPAS) reduces model parameters and size by nearly 50% compared to full replacement. Notably, indicators show no significant decline, with slight improvements in accuracy and mAP50-95, verifying the effectiveness of the simplified design.
In order to further verify the balance between the detection effect and model complexity of the algorithmic models, the comparison of detection performance and complexity of the single-modal and RGB-T bimodal algorithmic models is shown in Table 12.
From Table 12, the model, designed to handle RGB-T bi-modal data, with a dual backbone for cross-modal feature extraction, incorporates lightweight designs in data input and algorithmic improvements. Although its parameter count and model size are slightly larger than the original single-modal detector, it achieves improved performance at lower complexity.

4.6. Visual Analysis

In order to visualize the performance of the algorithmic model established in the forest fire target detection task, Figure 16 shows the partial detection results of the algorithmic model CP-YOLOv11-MF, which demonstrates the detection effect of the UAV from different viewpoints and in different scenes.
The blue box labeled “fire 0.8” indicates that the model predicts that the target is “fire” with 80% confidence. From the above figure, it can be seen that the CP-YOLOv11-MF algorithm model can fulfill the forest fire target detection task well.
At the same time, in order to further analyze the performance differences between different algorithm models, representative forest fire image detection samples (night environment, tree cover, smoke cover) are selected for visual analysis in this section, as shown in Figure 17, Figure 18 and Figure 19, to visualize the improvement effect of different algorithm models.
Figure 17 illustrates the detection performance of each model under nighttime conditions. While each model demonstrates proficiency in detecting fires with distinct characteristics, person detection may incur pixel-level displacement. This is because humans lack rich texture in thermal infrared images, leading to blurred detection box borders that hinder accurate localization. In the mid-term fusion framework (YOLOv11-MF), multiple detection boxes initially appear, but the final model—incorporating cross-modal feature fusion and splicing—achieves precise person detection with the highest confidence among all models.
The performance of visible images in flame detection under tree occlusion conditions is limited, as shown in Figure 18. For some fire objects, the detection confidence is only 30%, and the detection boxes have localization bias. Thermal infrared images can effectively recognize high-temperature target regions that stand out from the surrounding environment by virtue of their ability to perceive high-temperature areas in the scene. In the early fusion framework (YOLOv11-EF), the poor fusion of bimodal information initially generates multiple detection boxes. After model improvement, the confidence level of all target detections increases to 80%, demonstrating that the adopted algorithm model effectively enhances the accuracy and stability of forest fire target detection under tree occlusion conditions. This provides a more reliable solution for forest fire target detection in complex environments.
There is a false alarm problem with thermal infrared images in smoke occlusion environments, as shown in Figure 19. Due to the existence of areas around the fire point with temperatures close to the human body temperature, thermal infrared images incorrectly identify these areas as personnel targets. Although visible light images have rich texture information and are still able to detect flames and smoke under low visibility, they are deficient in the localization accuracy of the detection framework. In addition, in the application scenarios of the early fusion framework and the middle fusion framework, the visible light image has under-reporting phenomenon and fails to successfully detect some of the actual targets. When the proposed method is used for detection, the detection confidence is increased to 80% for both critical targets, namely, flames and people, which improves the accuracy and reliability of detection.
In summary, the algorithm model CP-YOLOv11-MF constructed is able to perform the target detection task more accurately in complex forest fire scenarios. Compared with single-modal detection methods, the model significantly reduces the false alarm rate and missed alarm rate, effectively overcoming the limitations of single-modal detection. Meanwhile, by designing the modal interaction structure and optimizing the modal splicing module, the model’s ability to detect targets in complex environments is enhanced significantly.

5. Conclusions

In this paper, a multi-target and multi-scene forest fire aerial photography dataset is constructed by collecting data in multiple locations with a UAV equipped with a dual-optical camera head, which provides a more comprehensive visual dataset for subsequent forest fire prevention and management research. Meanwhile, the early fusion detection framework (YOLOv11-EF), the middle fusion detection framework (YOLOv11-MF) and the late fusion detection framework (YOLOv11-LF) are constructed for the multimodal fusion strategy on the basis of the YOLOv11 target detection model, which proves the advancement of the RGB-T bimodal target detection network model compared with the single-modal one. Based on this, a modal interaction structure is designed and a modal splicing module is optimized to enhance deep cross-modal interaction and fusion for RGB-Thermal bimodal target detection. Lightweight design is also incorporated during algorithm model improvement. Finally, the RGB-T dual-modal detection model CP-YOLOv11-MF constructed achieves 92.5%, 93.5%, 96.3%, and 62.9% in terms of precision, recall, mAP50, and mAP50-95. Compared with the metrics of visible light detection in a single mode, there are 1.8%, 3.2%, 2.7%, and 7.9% improvement and, with the metrics of thermal infrared detection in a single mode, the improvement is 1.3%, 4.9%, 2.7%, and 4.7%.
This paper presents an optimized AI-driven framework for RGB-thermal fusion in wildfire detection, which significantly improves the accuracy and response efficiency of monitoring systems. In the context of the growing trend of multi-source data fusion for forest fire detection, this study provides novel insights into the integration of diverse data modalities. Future work will focus on further enhancing the scale and diversity of the multi-scenario fire dataset by continuing to collect data in more forested areas with different geographic environments and climatic conditions, covering a wide range of terrains such as mountains, hills, and plains, as well as forested scenarios with different seasons and day/night time slots, in order to increase the dataset’s level of coverage of complex real-world scenarios. At the algorithmic research level, we continue to study the cross-modal fusion mechanism in depth, explore more potential modal interaction features, and improve the efficiency of the model in utilizing the bimodal data, so as to achieve more stable and accurate detection in the complex and changing forest fire scenarios.

Author Contributions

Conceptualization, Y.Z. and X.R.; methodology, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, X.R.; supervision, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (program NO. 52321003) and the Startup Foundation for Introducing Talent of NUIST (1523142501164).

Data Availability Statement

The RGBT-3M dataset can be found at https://complex.ustc.edu.cn.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cunningham, C.X.; Williamson, G.J.; Bowman, D.M.J.S. Increasing frequency and intensity of the most extreme wildfires on Earth. Nat. Ecol. Evol. 2024, 8, 1420–1425. [Google Scholar] [CrossRef] [PubMed]
  2. Rui, X.; Li, Z.; Zhang, X.; Li, Z.; Song, W. A RGB-Thermal based adaptive modality learning network for day–night wildfire identification. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103554. [Google Scholar] [CrossRef]
  3. Kularatne, S.D.M.W.; Casado, C.Á.; Rajala, J.; Hänninen, T.; López, M.B.; Nguyen, L. FireMan-UAV-RGBT: A Novel UAV-Based RGB-Thermal Video Dataset for the Detection of Wildfires in the Finnish Forests. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; pp. 1–8. [Google Scholar]
  4. Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
  5. Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland Fire Detection and Monitoring Using a Drone-Collected RGB/IR Image Dataset. IEEE Access 2022, 10, 121301–121317. [Google Scholar] [CrossRef]
  6. Toulouse, T.; Rossi, L.; Campana, A.; Celik, T.; Akhloufi, M.A. Computer vision for wildfire research: An evolving image dataset for processing and analysis. Fire Saf. J. 2017, 92, 188–194. [Google Scholar] [CrossRef]
  7. Li, X.Y.; Chen, S.G.; Tian, C.N.; Zhou, H.; Zhang, Z.X. M2FNet: Mask-Guided Multi-Level Fusion for RGB-T Pedestrian Detection. IEEE Trans. Multimed. 2024, 26, 8678–8690. [Google Scholar] [CrossRef]
  8. Song, K.C.; Wen, H.W.; Xue, X.T.; Huang, L.M.; Ji, Y.Y.; Yan, Y.H. Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531015. [Google Scholar] [CrossRef]
  9. Jin, D.Z.; Shao, F.; Xie, Z.X.; Mu, B.Y.; Chen, H.W.; Jiang, Q.P. CAFCNet: Cross-modality asymmetric feature complement network for RGB-T salient object detection. Expert Syst. Appl. 2024, 247, 123222. [Google Scholar] [CrossRef]
  10. Lv, Y.; Liu, Z.; Li, G.Y. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
  11. Zhou, W.J.; Dong, S.H.; Fang, M.X.; Yu, L. CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene Parsing. IEEE Trans. Intell. Veh. 2024, 9, 1919–1929. [Google Scholar] [CrossRef]
  12. Zhang, B.; Li, Z.L.; Sun, F.M.; Li, Z.H.; Dong, X.B.; Zhao, X.L.; Zhang, Y.R. SICFNet: Shared Information Interaction and Complementary Feature Fusion Network for RGB-T traffic scene parsing. Expert Syst. Appl. 2025, 276, 14. [Google Scholar] [CrossRef]
  13. Pang, Y.; Huang, Y.; Weng, C.Y.; Lyu, J.L.; Bai, C.Y.; Yu, X.S. Enhanced RGB-T saliency detection via thermal-guided multi-stage attention network. Vis. Comput. 2025, 41, 8055–8073. [Google Scholar] [CrossRef]
  14. Bin Azami, M.H.; Orger, N.C.; Schulz, V.H.; Oshiro, T.; Cho, M. Earth Observation Mission of a 6U CubeSat with a 5-Meter Resolution for Wildfire Image Classification Using Convolution Neural Network Approach. Remote Sens. 2022, 14, 1874. [Google Scholar] [CrossRef]
  15. Kumar, A.; Perrusquía, A.; Al-Rubaye, S.; Guo, W. Wildfire and smoke early detection for drone applications: A light-weight deep learning approach. Eng. Appl. Artif. Intell. 2024, 136, 108977. [Google Scholar] [CrossRef]
  16. Qurratulain, S.; Zheng, Z.Z.; Xia, J.; Ma, Y.; Zhou, F.R. Deep learning instance segmentation framework for burnt area instances characterization. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103146. [Google Scholar] [CrossRef]
  17. Li, J.; Tang, H.; Li, X.; Dou, H.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based on the YOLO framework. Int. J. Wildland Fire 2024, 33, WF23044. [Google Scholar] [CrossRef]
  18. Guo, S.H.; Hu, B.; Huang, R. Real-Time Flame Segmentation based on RGB-Thermal Fusion. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ, USA; pp. 1435–1440. [Google Scholar]
  19. Cui, S.; Ma, A.L.; Wan, Y.T.; Zhong, Y.F.; Luo, B.; Xu, M.Z. Cross-Modality Image Matching Network with Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3099506. [Google Scholar] [CrossRef]
  20. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  21. Burger, W.; Burge, M.J. Scale-Invariant Feature Transform (SIFT). In Digital Image Processing: An Algorithmic Introduction Using Java; Burger, W., Burge, M.J., Eds.; Springer: London, UK, 2016; pp. 609–664. [Google Scholar]
  22. Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans. Image Process. 2020, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
  23. Gonçalves, L.A.O.; Ghali, R.; Akhloufi, M.A. YOLO-Based Models for Smoke and Wildfire Detection in Ground and Aerial Images. Fire 2024, 7, 140. [Google Scholar] [CrossRef]
  24. Ding, Y.H.; Wang, M.Y.; Fu, Y.J.; Wang, Q. Forest Smoke-Fire Net (FSF Net): A Wildfire Smoke Detection Model That Combines MODIS Remote Sensing Images with Regional Dynamic Brightness Temperature Thresholds. Forests 2024, 15, 839. [Google Scholar] [CrossRef]
  25. Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical image segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef] [PubMed]
  26. Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
  27. Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
  28. Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  29. Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
  30. Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
  31. Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based Attention Module. arXiv 2021, arXiv:2111.12419. [Google Scholar] [CrossRef]
  32. He, A.; Li, X.; Wu, X.; Su, C.; Chen, J.; Xu, S.; Guo, X. ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network for TIR Wildlife Detection in UAV Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17308–17326. [Google Scholar] [CrossRef]
Figure 1. Schematic of forest fire detection imagery based on UAV.
Figure 1. Schematic of forest fire detection imagery based on UAV.
Remotesensing 17 02593 g001
Figure 2. Multimodal fusion strategies.
Figure 2. Multimodal fusion strategies.
Remotesensing 17 02593 g002
Figure 3. Data collection equipment (a) DJI Matrice 300 RTK with H20T; (b) DJI MAVIC 2 Enterprise.
Figure 3. Data collection equipment (a) DJI Matrice 300 RTK with H20T; (b) DJI MAVIC 2 Enterprise.
Remotesensing 17 02593 g003
Figure 4. RGB-thermal dual-optical camera imaging schematic.
Figure 4. RGB-thermal dual-optical camera imaging schematic.
Remotesensing 17 02593 g004
Figure 5. Flowchart of M-RIFT image alignment method.
Figure 5. Flowchart of M-RIFT image alignment method.
Remotesensing 17 02593 g005
Figure 6. Feature point matching results of image matching methods.
Figure 6. Feature point matching results of image matching methods.
Remotesensing 17 02593 g006
Figure 7. Example of partial images of the dataset (a) visible images; (b) thermal infrared images.
Figure 7. Example of partial images of the dataset (a) visible images; (b) thermal infrared images.
Remotesensing 17 02593 g007
Figure 8. Building process of CP-YOLOv11-MF.
Figure 8. Building process of CP-YOLOv11-MF.
Remotesensing 17 02593 g008
Figure 9. Network structure of YOLOv11-MF.
Figure 9. Network structure of YOLOv11-MF.
Remotesensing 17 02593 g009
Figure 10. The network structure of the proposed CP-YOLOv11-MF.
Figure 10. The network structure of the proposed CP-YOLOv11-MF.
Remotesensing 17 02593 g010
Figure 11. Schematic of Channel Prior Convolutional Attention.
Figure 11. Schematic of Channel Prior Convolutional Attention.
Remotesensing 17 02593 g011
Figure 12. Flowchart of PPAS.
Figure 12. Flowchart of PPAS.
Remotesensing 17 02593 g012
Figure 13. Patch-Aware Flowchart.
Figure 13. Patch-Aware Flowchart.
Remotesensing 17 02593 g013
Figure 14. Network structure of YOLOv11-EF.
Figure 14. Network structure of YOLOv11-EF.
Remotesensing 17 02593 g014
Figure 15. Network structure of YOLOv11-LF.
Figure 15. Network structure of YOLOv11-LF.
Remotesensing 17 02593 g015
Figure 16. Visualization of CP-YOLOv11-MF detection results.
Figure 16. Visualization of CP-YOLOv11-MF detection results.
Remotesensing 17 02593 g016
Figure 17. Visualization of the detection performance of each model for nighttime conditions.
Figure 17. Visualization of the detection performance of each model for nighttime conditions.
Remotesensing 17 02593 g017
Figure 18. Visualization of the detection performance of each model for tree occlusion conditions.
Figure 18. Visualization of the detection performance of each model for tree occlusion conditions.
Remotesensing 17 02593 g018
Figure 19. Visualization of the detection performance of each model for smoke occlusion conditions.
Figure 19. Visualization of the detection performance of each model for smoke occlusion conditions.
Remotesensing 17 02593 g019
Table 1. Video Shooting Specifications.
Table 1. Video Shooting Specifications.
UAVCameraFPSResolution
DJI Matrice 300 RTKH20T30visible image: 1920 × 1080
infrared image: 640 × 512
DJI MAVIC 2 EnterpriseAll-in-one camera30visible image: 1920 × 1080
infrared image: 900 × 720
Table 2. Raw video information in RGBT-3M dataset.
Table 2. Raw video information in RGBT-3M dataset.
Video ItemUAVCameraLocationSceneTimeDurationFile Size
Video pair 1DJI Matrice 300 RTKH20TAnhui, ChinaOutdoor ExperimentNight744 s2.68 GB, 135 MB
Video pair 2DJI Mavic 2 EnterpriseIntegrated CameraYunnan, ChinaReal FireDaytime348 s1.46 GB, 389 MB
Video pair 3DJI Mavic 2 EnterpriseIntegrated CameraYunnan, ChinaReal FireDaytime703 s2.96 GB, 716 MB
Video pair 4DJI Mavic 2 EnterpriseIntegrated CameraYunnan, ChinaReal FireDaytime831 s3.5 GB, 742 MB
Video pair 5DJI Mavic 2 EnterpriseIntegrated CameraYunnan, ChinaReal FireDaytime363 s1.53 GB, 361 MB
Video pair 6DJI Mavic 2 Enterprise DualIntegrated CameraYunnan, ChinaNo fireDaytime554 s2.33 GB, 546 MB
Video pair 7DJI Mavic 2 EnterpriseIntegrated CameraYunnan, ChinaNo fireDaytime91 s392 MB, 91.2 MB
Video pair 8DJI Mavic 2 EnterpriseIntegrated CameraInner Mongolia, ChinaOutdoor ExperimentDaytime112 s486 MB, 74.2 MB
Video pair 9DJI Mavic 2 EnterpriseIntegrated CameraInner Mongolia, ChinaOutdoor ExperimentDaytime218 s940 MB, 147 MB
Video pair 10DJI Matrice 300 RTKH20TAnhui, ChinaOutdoor ExperimentDaytime698 s2.52 GB, 140 MB
Table 3. RGBT-3M dataset content.
Table 3. RGBT-3M dataset content.
Dataset Number of ImagesNumber of LabelsSmokeFirePerson
Training Set785421,550948879144148
Validation Set33669227408634011740
Total11,22030,77713,57411,3155888
Table 4. Training parameter.
Table 4. Training parameter.
Training EnvironmentParameter Settings
CPUIntel® Xeon(R) Gold 6226R CPU @ 2.90 GHz ×64
GPUNVIDIA GeForce RTX 3090
Operating SystemUbuntu 20.04.6 LTS
Deep Learning EnvironmentPython: 3.8.19, torch: 1.8.0, CUDA: 11.1
OptimizerStochastic Gradient Descent (SGD)
Momentum0.937
Weight Decay0.0005
Training Epochs200
Batch Size4
Table 5. Comparison of the effect of different number of target categories for visible light detection.
Table 5. Comparison of the effect of different number of target categories for visible light detection.
ModelClassPRmAP50mAP50-95
YOLOv11
(All object)
smoke93.9%93.2%97%74.6%
fire92.5%85.8%91.5%51.2%
person90.5%88.4%93.8%57.2%
YOLOv11
(remove smoke object)
fire91.6%89.6%92.9%52.1%
person89.8%91%94.3%58%
Table 6. Model performance comparison of single-modal.
Table 6. Model performance comparison of single-modal.
ModelClassPrecisionRecallmAP50mAP50-95
YOLOv11
(Visible image)
all90.7%90.3%93.6%55%
fire91.6%89.6%92.9%52.1%
person89.8%91%94.3%58%
RTMdet [27]
(Visible image)
all79%73.8%76%38.6%
fire82.2%71.9%73.5%33.1%
person75.7%75.7%78.5%44.1%
FasterRCNN [28]
(Visible image)
all90.6%89.3%93.6%55.5%
fire92.6%88.4%93.3%52.2%
person88.5%90.2%94%58.8%
YOLOv11
(Infrared image)
all91.2%88.6%93.6%57.6%
fire92.2%86.5%92.4%57.5%
person90.2%90.8%94.9%57.6%
RTMdet [27]
(Infrared image)
all84.5%76.8%82.2%43.6%
fire83.7%72.9%78.7%41.4%
person85.4%80.6%85.7%45.9%
FasterRCNN [28]
(Infrared image)
all91.2%87.6%93.6%57.7%
fire92.5%86.2%93.3%58%
person89.9%89%94%57.5%
Table 7. Ablation experiments.
Table 7. Ablation experiments.
ModalPrecisionRecallmAP50mAP50-95
YOLOv11
(visible image)
90.7%90.3%93.6%55%
YOLOv11
(infrared image)
91.2%88.6%93.6%57.6%
YOLOv11-MF90.4%91.3%95.3%58.7%
YOLOv11-MF + feature interaction structure91.7%92.6%96%61.6%
YOLOv11-MF + feature interaction structure + PPAS (CP-YOLOv11-MF)92.5%93.5%96.3%62.9%
Table 8. Model performance comparison of single-modal and dual-modal algorithms.
Table 8. Model performance comparison of single-modal and dual-modal algorithms.
ModelClassPrecisionRecallmAP50mAP50-95
YOLOv11
(visible image)
all90.7%90.3%93.6%55%
fire91.6%89.6%92.9%52.1%
person89.8%91%94.3%58%
YOLOv11
(Infrared image)
all91.2%88.6%93.6%57.6%
fire92.2%86.5%92.4%57.5%
person90.2%90.8%94.9%57.6%
YOLOv11-EFall91.1%89.8%94.9%58.2%
fire91.6%88.8%94.1%57.5%
person90.7%90.7%95.6%59%
YOLOv11-MFall90.6%91.2%95.3%58.6%
fire91%90.9%94.8%58.1%
person90.2%91.4%95.9%59.1%
YOLOv11-LFall91.3%91.5%95.3%59.7%
fire92.3%91.5%95.3%59.7%
person90.2%91.4%95.4%59.7%
Table 11. Performance and complexity comparison of different modal splicing schemes. Convention: best, 2nd-best.
Table 11. Performance and complexity comparison of different modal splicing schemes. Convention: best, 2nd-best.
Modal Splicing SchemePrecisionRecallmAP50mAP50-95Parameter
(×106)
Model Size
Scheme 192.1%93.9%96.4%62.7%20.5139.7 MB
Scheme 292.5%93.5%96.3%62.9%11.8323 MB
Scheme 392.4%93.4%96.3%62.4%9.6618.9 MB
Table 12. Comparison of detection performance and complexity of the single-modal and RGB-T bi-modal algorithmic models.
Table 12. Comparison of detection performance and complexity of the single-modal and RGB-T bi-modal algorithmic models.
ModelPrecisionRecallmAP50mAP50-95Parameter
(×106)
Model Size
YOLOv11
(visible image)
90.7%90.3%93.6%55%9.4118.3 MB
YOLOv11
(Infrared image)
91.2%88.6%93.6%57.6%9.4118.3 MB
CP-YOLOv11-MF92.5%93.5%96.3%62.9%11.8323 MB
Table 9. Comparison of the effects of cross-modal feature interaction structures on different detection frameworks.
Table 9. Comparison of the effects of cross-modal feature interaction structures on different detection frameworks.
ModelClassPrecisionRecallmAP50mAP50-95
YOLOv11-LFall91.3%91.5%95.3%59.7%
fire92.3%91.5%95.3%59.7%
person90.2%91.4%95.4%59.7%
YOLOv11-LF + feature interaction structureall90.3%90%94.9%57.8%
fire90%89.4%94.4%57.4%
person90.6%90.7%95.5%58.1%
YOLOv11-MFall90.6%91.2%95.3%58.6%
fire91%90.9%94.8%58.1%
person90.2%91.4%95.9%59.1%
YOLOv11-MF+ feature interaction structureall91.9%92.3%96%61.5%
fire92.4%92.4%96%61.7%
person91.4%92.2%96%61.3%
Table 10. Comparison of the Effectiveness of Different Attention Mechanisms in Optimizing Feature Splicing Modules. Convention: best, 2nd-best.
Table 10. Comparison of the Effectiveness of Different Attention Mechanisms in Optimizing Feature Splicing Modules. Convention: best, 2nd-best.
Attention MechanismPrecisionRecallmAP50mAP50-95
-91.9%92.3%96%61.5%
SimAM [29]91.6%92.9%96.1%61.2%
GAM [30]91.3%93.5%96.2%61.8%
NAM [31]91.9%93%96.2%61.7%
LCA [32]90.9%92.4%95.8%61.3%
Ours92.1%93.9%96.4%62.7%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Rui, X.; Song, W. A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sens. 2025, 17, 2593. https://doi.org/10.3390/rs17152593

AMA Style

Zhang Y, Rui X, Song W. A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sensing. 2025; 17(15):2593. https://doi.org/10.3390/rs17152593

Chicago/Turabian Style

Zhang, Yalin, Xue Rui, and Weiguo Song. 2025. "A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection" Remote Sensing 17, no. 15: 2593. https://doi.org/10.3390/rs17152593

APA Style

Zhang, Y., Rui, X., & Song, W. (2025). A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sensing, 17(15), 2593. https://doi.org/10.3390/rs17152593

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop