A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection

Zhang, Yalin; Rui, Xue; Song, Weiguo

doi:10.3390/rs17152593

Open AccessArticle

A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection

by

Yalin Zhang

¹

,

Xue Rui

²

and

Weiguo Song

^1,*

¹

State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei 230026, China

²

School of Emergency Management, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2593; https://doi.org/10.3390/rs17152593

Submission received: 29 April 2025 / Revised: 23 June 2025 / Accepted: 21 July 2025 / Published: 25 July 2025

(This article belongs to the Special Issue Advances in Spectral Imagery and Methods for Fire and Smoke Detection)

Download

Browse Figures

Versions Notes

Abstract

UAVs are essential for forest fire detection due to vast forest areas and inaccessibility of high-risk zones, enabling rapid long-range inspection and detailed close-range surveillance. However, aerial photography faces challenges like multi-scale target recognition and complex scenario adaptation (e.g., deformation, occlusion, lighting variations). RGB-Thermal fusion methods integrate visible-light texture and thermal infrared temperature features effectively, but current approaches are constrained by limited datasets and insufficient exploitation of cross-modal complementary information, ignoring cross-level feature interaction. A time-synchronized multi-scene, multi-angle aerial RGB-Thermal dataset (RGBT-3M) with “Smoke–Fire–Person” annotations and modal alignment via the M-RIFT method was constructed as a way to address the problem of data scarcity in wildfire scenarios. Finally, we propose a CP-YOLOv11-MF fusion detection model based on the advanced YOLOv11 framework, which can learn heterogeneous features complementary to each modality in a progressive manner. Experimental validation proves the superiority of our method, with a precision of 92.5%, a recall of 93.5%, a mAP50 of 96.3%, and a mAP50-95 of 62.9%. The model’s RGB-Thermal fusion capability enhances early fire detection, offering a benchmark dataset and methodological advancement for intelligent forest conservation, with implications for AI-driven ecological protection.

Keywords:

forest fire; UAV multispectral imagery; YOLOv11; wildfire dataset; small object detection; attention mechanism; computer vision

1. Introduction

The frequency of forest fires has increased significantly in recent years, and extreme forest fire events have had a major impact on societies and ecosystems globally [1]. With the advantages of flexible mobility and multi-scale observation, UAV has become an important equipment for forest fire detection. In the daily inspection of forest fires, the composite strategy of long-term rapid inspection and close-range fine detection requires the detection algorithm to have both lightweight and generalization characteristics.

As shown in Figure 1, in the process of forest fire detection, UAVs face two different application scenarios: remote shooting and close shooting. This brings about the difficulty of multi-scale target detection and matching for forest fires. At the same time, the aerial image is affected by the change in flight attitude and vegetation occlusion. The target presents significant multi-scale features, deformation characteristics and edge blurring effect. The traditional fixed shape detection model has the defects of feature matching deviation and scale sensitivity. Therefore, RGB-Thermal fusion detection can give full play to the advantages of visible light image (RGB) and thermal infrared image (T), and make up for the deficiency of single mode [2]. The visible light image contains rich texture and color information, which can accurately identify the color characteristics of the flame and provide the basis for the detailed analysis of the fire target. The thermal infrared image is not limited by light conditions and vegetation occlusion, and can keenly capture high-temperature heat sources. Even in the presence of heavy smoke or in nighttime conditions, it can accurately pinpoint the locations of fire points. In the remote shooting scene, the initial fire point is small, and the combination of the high temperature signal in the thermal infrared image and the smoke texture feature of the visible light image can also detect the potential fire source in time. In the close shooting scene, RGB-Thermal fusion detection can quickly adapt to the scale change and shape distortion of the target in the face of complex vegetation environment and fire scene with dramatic perspective change.

In the existing forest fire detection research, most of the publicly available forest fire image datasets are limited to visible light images and lack real fire data. Furthermore, there is a paucity of publicly available RGB-Thermal image datasets, and there is a lack of more accurate image alignment work. The majority of these datasets are focused on fire classification and segmentation tasks, and there is still a gap in fire detection work [3]. The FLAME1 [4] and FLAME2 [5] datasets utilise the overhead view, which is incapable of reflecting the multi-angle characteristics of UAVs during daily forest fire inspections. The well-labelled Corsican Fire Dataset [6] and the RGB-T wildfire dataset [2] contain a limited amount of data and are all images of a single experimental scenario, which lacks data generalization. The FireMan-UAV-RGBT dataset [3] has captured multi-scene forest fire images from multiple viewpoints, but only supports the classification task and does not provide more refined information for forest fire detection. The dataset in this paper is comprehensive, considering the diversity of viewpoints in aerial photography, the diversity of forest fire scenes, and the precision of the data that has been annotated. The RGBT-3M has been constructed in order to provide reliable data support for the detection of forest fires using RGB-Thermal imaging.

At the methodological level, the multimodal fusion strategies can be mainly categorized into three types: data-level fusion, feature-level fusion and decision-level fusion. They correspond to different stages of the algorithmic model inference process, as shown in Figure 2.

In the field of multimodal detection methods, deep learning networks mostly use intermediate fusion strategies [7,8], and researchers have developed a number of multimodal interaction and fusion strategies, which have proved to be effective in enhancing the design of modal interactions in the feature extraction phase [9,10]. CACFNet [11] mines complementary information from two modalities by designing cross-modal attention fusion modules, and uses cascaded fusion modules to decode multilevel features in an up–down manner; SICFNet [12] constructs a shared information interaction and complementary feature fusion network, which consists of three phases: feature extraction, information interaction, and feature calibration refinement; and the Thermal-induced Modality-interaction Multi-stage Attention Network (TMMANet [13]) leverages thermal-induced attention mechanisms in both the encoder and decoder stages to effectively integrate RGB and thermal modalities.

At present, preliminary progress has been made in forest fire identification based on the RGB-Thermal fusion method [2,3,5,6]. Although existing work applies deep learning frameworks such as LeNet [14], MobileViT [15], ResNet [16], YOLO [17], etc., to forest fire detection, which improves the efficiency of forest fire recognition [5], the design of algorithms based on RGB-Thermal correlation is still very limited. Chen et al. [5] explored RGB-Thermal based early fusion and late fusion methods for classification and detection of forest fire images. Rui et al. [2] proposed an adaptive learning RGB-T bimodal image recognition framework for forest fires. Guo et al. [18] designed the SkipInception feature extraction module and SFSeg sandwich structure to fuse visible and thermal infrared images for flame segmentation task. Overall, the above algorithms do not deeply consider the information interaction and propagation of cross-modal features, and lack the ability to simultaneously achieve the calibration of shallow texture features and the localization of high-level semantic information at all scales, and thus remain deficient in analyzing diverse and challenging forest fire scenarios.

We propose a new forest fire detection framework. It employs a parallel backbone network to extract RGB and TIR features. A feature cross-fertilization structure is established in multi-scale feature extraction to enhance information interaction and propagation between modalities. The channel and spatial attention mechanisms, along with a feature branching selection strategy, are introduced to suppress noise from heterogeneous inter-modal features. Finally, it achieves effective combination of complementary relationships between modalities.

In summary, this paper will explore how to effectively fuse the features of visible and thermal infrared images on existing deep learning models to improve the efficiency and effectiveness of forest fire target detection in complex environments. Based on the above background, the main contributions of this paper are as follows:

(1) A novel forest fire dataset is introduced, containing time-synchronized RGB-thermal video data from real fires and outdoor experiments in multiple Chinese forest areas. It provides high-quality, reliable data for classification and detection tasks via manual frame-splitting, image alignment, and annotation, supporting subsequent deep learning model training and testing. To the best of our knowledge, this is the first RGB-Thermal image detection dataset for forest fires.

(2) A fire detection method was carried out by combining multimodal fusion techniques and computer vision methods. We choose the well-known deep learning architecture YOLOv11, and add cross-modal feature fusion structure and attention mechanism under the dual backbone structure of RGB and TIR to guide the gradual fusion of modal heterogeneous information and improve the adaptability of the method in forest fire target detection.

(3) Our constructed model is evaluated in several challenging forest fire scenarios, effectively demonstrating the usability and robustness of our proposed dataset and deep learning approach in forest fire detection scenarios.

2. Dataset

2.1. Data Collection

The experimental equipment used for RGB-T image data collection is the DJI M300 RTK UAV (DJI Technology Co., Ltd., Shenzhen, China) equipped with the H20T, and the DJI MAVIC 2 (DJI Technology Co., Ltd., Shenzhen, China) Enterprise with integrated camera, as shown in Figure 3.

In the process of data collection, the specific shooting specifications are shown in Table 1.

In order to collect forest fire images covering a wide range of scenarios, the study carried out large-scale field environmental data collection in Anhui, Yunnan, and Inner Mongolia, including real fires or outdoor experimental data. In the data collection process, multiple UAV devices were used, and all devices were time-synchronized to simultaneously acquire visible and thermal infrared videos to ensure the consistency of the acquired data. Finally, from the large number of videos collected, we filtered out the videos with high representativeness of forest fire scenes, and the relevant information is shown in Table 2.

2.2. Data Pre-Processing

In the pre-processing stage, the video is extracted at 5 frames per second to reduce similarity between image frame pairs. Images are divided into fire and non-fire frame pairs to facilitate image classification tasks. Considering the similarity of scenes, additional frame-skipping strategies are adopted to optimize the processing workflow. Finally, 17,862 frame pairs of images are obtained, of which the non-fire frames are 6642 pairs and the fire frames are 11,220 pairs.

The visible and thermal infrared images are usually captured by different sensors, and the modality gaps caused by different imaging systems or styles pose a great challenge to the matching task [19]. Although complementary information is provided in different imaging modalities, multimodal images obtained directly from the camera are not aligned for direct fusion, as shown in Figure 4. This figure depicts a stereo vision system framework. Two cameras, each associated with independent coordinate systems (left and right), are included.

Image alignment refers to the establishment of pixel-level correspondences between images of two viewpoints, through a series of pre-processing and alignment steps, transforming them into a common reference frame or coordinate system through spatial mapping relationships, i.e., converting them into a common representation that makes them spatially aligned, and allows them to be compared and analyzed on the same spatial scale. Image alignment is used to merge the strengths of different modalities, resulting in a more comprehensive, accurate, and robust characterization.

At the device hardware level, the temporal acquisition frame rates of visible and thermal infrared images are kept synchronized, i.e., they are already aligned in the temporal dimension. In the spatial dimension, the alignment between the visible and thermal infrared images can be realized by solving the homography matrix of the visible images and the thermal infrared images and performing affine transformations, i.e.,

\{\begin{cases} H = R + T \frac{1}{d} N^{T} \\ X_{2} = H X_{1} \end{cases}

(1)

H

is the univariate responsivity matrix,

R

and

T

denotes the rotation and translation matrices between the two coordinate systems. The coordinates of a point on the plane

p

in the world coordinate system in the two coordinate systems are

X_{1}

and

X_{2}

, the normal vector of the plane

p

is

n

, and the distance to the origin of the camera coordinate system is

d

.

During the construction of most forest fire RGB-T data sets, the image registration process is usually carried out by manually selecting feature points, or by using general feature point matching methods such as ORB [20] or SIFT [21]. We propose a two-stage bimodal image alignment framework, termed M-RIFT, to improve the accuracy and robustness of matching heterogeneous image data, as shown in Figure 5. In the rough alignment stage, manually selected feature points are used as the coarse alignment step for image resizing to quickly overcome the initial geometric distortion. In the fine alignment stage, we adopt the RIFT multimodal image matching method [22]. First, feature points in the image are detected via the maximum moment map. Then, the maximum value index in each direction is searched to construct the maximum index map. Next, the FREAK descriptor is used to generate the feature vector, and homonymous point pairs are obtained based on the nearest-neighbor strategy. After removing outliers, the affine transform model between images is derived. This approach enables the rapid and accurate establishment of feature correspondences and optimization of matching results.

Through the above method, the homography matrix can be computed, and for each match point

(x_{i}, y_{i})

and

(x_{i}^{'}, y_{i}^{'})

, a system of equations is constructed based on the mathematical model of the perspective transformation. The mathematical model of perspective transformation is as follows:

[\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}] = H [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(2)

H

is the homography matrix. Two equations can be obtained after expansion:

\begin{array}{l} x^{'} = \frac{h_{11} x + h_{12} y + h_{13}}{h_{31} x + h_{32} y + h_{33}} \\ y^{'} = \frac{h_{21} x + h_{22} y + h_{23}}{h_{31} x + h_{32} y + h_{33}} \end{array}

(3)

h

denotes the elements of the homography matrix

H

expanded into a vector. The vector

h

is obtained by solving the chi-square equation system via singular value decomposition (SVD), from which the homography matrix is then derived.

Traditional feature point matching methods cannot effectively handle the modal differences between cross-modal images, making it difficult to match heterogeneous information. Figure 6 demonstrates a comparative analysis of the feature point matching results between our approach and other methodologies. The green line connects the matching points corresponding to the visible light image and the thermal infrared image.

The traditional methods are only capable of identifying a limited number of matching points, frequently accompanied by issues of incorrect alignment. By contrast, the approach proposed in this paper successfully detects a substantial number of accurate matching points, thereby demonstrating the superiority of the proposed method.

2.3. Statistical Analysis of the Dataset

The multi-scene, multi-target and multimodal forest fire aerial photography dataset (RGBT-3M dataset) contains 22,440 images (i.e., 11,220 pairs of images) of fire frames, which were annotated using LabelImg (version: 1.8.6), and the labeled targets were smoke, fire, and person, with the numbers of 13,574, 11,315, and 5888. Infrared images lacked obvious smoke features, so we provided labels excluding smoke targets. The dataset is divided according to the ratio of 7:3 to form the training set and validation set, some representative scenes are shown in Figure 7, and the specific data are shown in Table 3. The dataset will be updated to the website: https://complex.ustc.edu.cn/.

3. Method

3.1. Overall Architecture Design

Convolutional Neural Networks (CNNs) are powerful in feature extraction and modeling and have achieved outstanding performance in early forest fire recognition tasks. Forest fire images are natural images with rich local covariance, and CNNs with translational invariants are good at extracting fire features and thus learning high-level (semantic) features. Many well-known object detection frameworks are used for fire detection tasks, such as the YOLO series [17,23] and the RCNN series [24]. The YOLO series of algorithms for architectural design always centers on the core goal of fast detection, which meets the efficient detection requirements of forest fire detection tasks.

Therefore, YOLOv11 is adopted as the baseline scheme in this study, and the classic three-stage architecture design balances feature abstraction capability and computational efficiency, providing a robust baseline framework for subsequent model improvement. The input resolution of visible and infrared images is denoted as

W \times H

. The backbone network extracts feature at different scales via multi-convolutional down-sampling. When the feature map resolution is

{\frac{W}{8} \times \frac{H}{8}, \frac{W}{16} \times \frac{H}{16}, \frac{W}{32} \times \frac{H}{32}}

, feature interaction design performs cross-modal information interaction by extracting features from different modal networks. After passing through the C3k2 feature fusion module, feature splicing design is applied before feeding into the neck network.

To boost the model’s cross-modal fusion efficiency and detection performance, we devise a novel cross-modal feature fusion algorithm within the RGB-Thermal fusion framework. This algorithm comprises two key components: the feature interaction design and the feature splicing design.

In the feature interaction design, we integrate the channel prior convolutional attention (CPCA) mechanism. Given the significant differences in feature representations between infrared and visible images, CPCA dynamically adjusts the importance of different channels. By emphasizing the complementary information and suppressing redundant or conflicting features, it enables effective cross-modal synergy. This process allows the model to fully leverage the unique advantages of each modality.

For the feature splicing design, we adopt the parallel patch-aware splicing (PPAS) method. PPAS divides the images into patches and processes them in parallel, guiding the model to focus on critical regions. This approach enhances the model’s perception of the target by capturing local details and global context simultaneously. Moreover, it suppresses irrelevant background information, significantly improving the detection accuracy and efficiency. The construction process is shown in Figure 8. We use different font colors to distinguish different methods in order to better understand the naming of the model we are building. We labeled different fusion strategies in blue, orange, and green, and we labeled the first letter of the improved method in red.

3.2. Forest Fire Detection Frameworks Based on Mid-Term Fusion Strategies

The YOLOv11-MF network architecture consists of a dual-input layer, a dual-channel backbone network layer, a neck network layer, and a detection layer. The same modal feature splicing module is set up after the C3k2 module of each backbone network branch. Subsequently, the generated fused feature maps are input to the neck network layer. At the neck network layer, the features are further integrated, and the feature information processed by the neck network layer is finally passed to the detection layer to output the detection results. The network structure of YOLOv11-MF is shown in Figure 9.

3.3. Cross-Modal Feature Fusion Algorithm Design

In order to better fuse the bimodal information, the design of the feature interaction module and the optimization of the modal splicing module is carried on the mid-term fusion framework (YOLOv11-MF). In this way, the model can pay more attention to the features that contribute to the detection effect, while suppressing the influence of irrelevant or noisy features, which enhances the model’s ability to fuse between different information and improves the performance of forest fire target detection.

Finally, we design an RGB-Thermal fusion detection model, named CP-YOLOv11-MF as shown in Figure 10.

3.3.1. Feature Interaction Module Design

To address the visible and infrared feature interaction problem, we designed a feature interaction module. As shown in Equation (4), visible and infrared features are processed via channel prior convolutional attention [25] to compute channel and spatial attention, after which they are combined element-wise with the original visible and thermal infrared features:

F = F_{R G B} \oplus F_{T} \oplus F_{C P C A}

(4)

The method dynamically assigns attention weights in channel and spatial dimensions to adaptively emphasize important features in different modalities, as shown in Figure 11.

In the process of channel attention calculation, the hybrid attention mechanism (CBAM, Convolutional Block Attention Module) method is borrowed to collect spatial information from the feature maps by applying the average pooling and maximum pooling operations. Subsequently, this collected information is fed into the shared MLP as shown in Equation (5):

C A (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(5)

The spatial relations between features are computed with the help of depth-separable convolution, which reduces the complexity of computation while inter-channel relations are preserved, as shown in Equation (6):

S A (F) = C o n v_{1 \times 1} (\sum_{i = 0}^{3} B r a n c h_{i} (D w C o n v (F)))

(6)

D w C o n v

denotes the depth separable convolution, and

B r a n c h_{i}, i \in {0, 1, 2, 3}

denotes the first branch.

3.3.2. Feature Splicing Module Design

Owing to the characteristic disparities across different modes, merely through simple splicing is substantial noise introduced, rendering inter-modal information transfer ineffective. To address this challenge, this study draws inspiration from the multi-branch feature extraction strategy in HCF-Net [26], and designs a parallel patch-aware splicing module (PPAS). The module employs a parallel multi-branching framework comprising local, global, and serial convolutional branches, which effectively suppresses extraneous information. Additionally, it leverages spatial and channel attention mechanisms for adaptive feature enhancement, as illustrated in Figure 12.

F_{l o c a l} \in ℝ^{H^{'} \times W^{'} \times C^{'}}

,

F_{g l o b a l} \in ℝ^{H^{'} \times W^{'} \times C^{'}}

,

F_{l o c a l} \in ℝ^{H^{'} \times W^{'} \times C^{'}}

are computed through the three branches, and the weighted sum is derived as

\tilde{F} \in ℝ^{H^{'} \times W^{'} \times C^{'}}

. The attention module consists of a series of channel attention and spatial attention. Thereafter,

\tilde{F} \in ℝ^{H^{'} \times W^{'} \times C^{'}}

is processed sequentially by one-dimensional channel attention maps

M_{c} \in ℝ^{1 \times 1 \times C^{'}}

and

M_{s} \in ℝ^{H^{'} \times W^{'} \times 1}

, as shown in Equation (7):

\begin{array}{l} F_{c} = M_{c} (\tilde{F}) \otimes \tilde{F}, F_{s} = M_{s} (F_{c}) \otimes F_{c}, \\ F'' = δ (B (d r o p o u t (F_{s}))), \end{array}

(7)

\otimes

represents element-wise product,

F_{c} = M_{c} (\tilde{F}) \otimes \tilde{F}

and

F_{s} = M_{s} (F_{c}) \otimes F_{c}

calculate the selected features, and

δ (\cdot)

and

B (\cdot)

denote the rectified linear unit and batch normalization operations.

The distinction between the local and global branches is accomplished by controlling the patch size parameter, which is implemented via the aggregation and displacement of non-overlapping patches in the spatial dimension. Thereby, the attention matrix between non-overlapping patches is computed to facilitate local and global feature extraction and interaction, as depicted in Figure 13.

F^{'}

are partitioned into spatially contiguous patches and channel averaged. The patches after channel averaging are linearly computed using a feed-forward neural network (FFN). Based on this, an activation function is applied to obtain the probability distribution of the linearly computed features in the spatial dimension and, for each marker, its weight is adjusted by filtering the features related to the task.

4. Experiment

4.1. Experimental Settings

The algorithmic model was experimented in the Ubuntu 18.04 operating system with NVIDIA GeForce RTX 3090 graphics card, CUDA version number 11.1, and Python version number 3.9.19. The principle of consistency was followed during the training of all the networks, keeping the same core optimizer parameters, optimization algorithm, and number of training parameters. The detailed settings of each specific training parameter are shown in Table 4.

4.2. Evaluation Criteria

Metrics such as precision (P), detection recall (R), and average precision (AP) are used to evaluate the model performance.

P = T P / (T P + F P)

(8)

R = T P / (T P + F N)

(9)

A P = \int_{0}^{1} P (R) d R

(10)

The detection threshold set is 0.5, which is adopted as mAP50. Meanwhile, the evaluation index mAP50-95 is defined as the average value of detection thresholds ranging from 0.5 to 0.95 (excluding 0.5, with a step size of 0.05).

4.3. Comparative Experiment

In this section, a comparative study is carried out for the above single-modal detection model as well as the RGB-Thermal detection framework. It is worth noting that smoke is visible on visible images and difficult to recognize on thermal infrared images. This is due to the low sensitivity of the thermal infrared camera carried by the UAV and, for the remote observation, the UAV is far away from the forest fire target and cannot effectively capture the smoke information. Meanwhile, in order to focus more on the RGB-T bimodal fusion detection in small target detection, the subsequent comparison experiments are carried out with the smoke label removed, and only the flame and person are targeted for analysis. Table 5 gives a comparison of the effect of different number of target categories for visible light detection.

For single-modal comparison, we selected RTMdet [27], a single-stage target detection algorithm with similar model complexity to YOLOv11, and FasterRCNN [28], a well-known two-stage target detection algorithm, for comparison experiments. Table 6 presents a comparison of the effect of single-modal detection methods.

Compared to the single-stage target detection model with similar complexity, the YOLOv11 model reflects a more prominent performance advantage, which is significantly higher than the RTMdet model in all metrics. Compared to the complex two-stage target detection model, the YOLOv11 model is not much different from the FasterRCNN model in all the indicators, but YOLOv11 is slightly higher in recall, which reflects that the YOLOv11 model performs well in reducing the leakage of detection, which is needed for the early detection task of forest fires.

4.4. Ablation Experiment

In order to verify the effect of each improvement module on the model detection capability, we compare the effects of different improvements on the model detection performance. Ablation test results of the algorithm model are shown in Table 7. The experiments adopt YOLOv11 as the baseline model for unimodal detection in visible and infrared images. Based on this, the mid-term fusion framework “YOLOv11-MF” is designed. Adding a cross-modal feature interaction design based on CPCA to YOLOv11-MF yields “YOLOv11-MF+ feature interaction structure”. Finally, integrating the PPAS feature splicing module results in “CP-YOLOv11-MF”.

As shown in Table 7, early, mid, and late RGB-Thermal bimodal fusion frameworks are constructed via multimodal strategies. According to different multimodal fusion strategies, a simple Concat function is used for modal splicing, and the early fusion framework (YOLOv11-EF), the mid-term fusion framework (YOLOv11-MF), and the late fusion framework (YOLOv11-LF) are constructed on the basis of the YOLOv11 model. Simple bimodal feature splicing slightly improves algorithm performance, while designing a cross-modal feature interaction module and optimizing modal splicing modules enhances intermodal interactions, enabling deep feature and information complementarity across modalities. After a series of targeted improvements, the algorithm model (CP-YOLOv11-MF) constructed finally reaches 92.5%, 93.5%, 96.3%, and 62.9%, which reflects the effectiveness of the various improvement methods for model performance improvement.

In the detection framework based on early fusion Strategies (YOLOv11-EF), the input layer is improved by adding two new input channels and introducing the Concat function for early bimodal feature splicing. First, the infrared image and the visible image are used as input data, and the feature splicing operation is performed on the images of two different modalities to generate the bimodal fusion feature map. Subsequently, the generated bimodal fusion feature map is input to the backbone network layer and the neck network layer. The detection layer performs target detection operations from the previous layers and outputs the detection results. The network structure of YOLOv11-EF is shown in Figure 14.

The detection framework based on late fusion strategies (YOLOv11-LF) consists of a dual-input layer, a dual-channel backbone network layer, a dual-channel neck network layer, and a detection layer. The visible channel backbone network and the thermal infrared channel backbone network perform feature extraction for the visible and thermal infrared images, and output the extracted features to the neck network layer. The features are enhanced in the neck network layer. A feature splicing module is embedded at the output position of the neck network layer to input the generated bimodal image fusion features to the detection layer. The network structure is shown in Figure 15.

In order to verify the effectiveness of the RGB-Thermal bimodal target detection algorithm model, in this section, the YOLOv11 model is utilized to train the infrared and visible images separately to obtain the detection results in a single modality, i.e., the YOLOv11 network model processes the two types of images, visible and thermal infrared, and directly outputs the detection results without any fusion. As shown in Table 8, the performance of each model under single-modal detection and different detection fusion frameworks is compared.

A comparison of visible and infrared image detection results in a single modality shows that visible images, despite richer information, contain more interference. Evaluation of detection performance reveals similar mAP50 values for both modalities. However, infrared images exhibit a significant advantage in flame detection under the stricter mAP50-95 metric, outperforming visible images by 5.4%. In contrast, detection accuracy of person with infrared is slightly lower than that with visible light.

In terms of the comparative analysis of the detection effect of single-modal and dual-modal, the RGB-T dual-modal image detection method is significantly better than the single-modal image detection results in the three types of evaluation indexes (Precision, Recall, mAP), which proves the effectiveness of the image fusion technology in improving the performance of target detection. In-depth analysis of the characteristics of different stages of fusion methods within each RGB-Thermal bimodal fusion framework reveals that later fusion shows certain advantages. In the early fusion stage, because the original data has not been processed in depth, a large amount of redundant information and potential noise are not effectively eliminated, and these interfering factors are likely to have a negative impact on the subsequent model analysis process. Mid-term fusion also suffers from a similar problematic potential, in which a certain degree of noise interference inevitably exists in the data processing process. In this case, relying only on simple splicing operations to integrate multi-source data, it is not possible to fully explore the intrinsic correlation between the data, and it is difficult to realize the efficient fusion and utilization of information. In contrast, the fusion in the later stages of the information processing process, the data underwent multiple rounds of rigorous screening, effectively reducing noise impact. Subsequently, the information is integrated, which enables the model to more accurately refine the key features, thus presenting a better performance than the mid-term fusion framework on this dataset.

Under the multiple RGB-Thermal bimodal fusion frameworks mentioned above, only the Concat function is used for modal splicing operations and, in order to better perform modal fusion interactions, a cross-modal feature interaction structure is designed. Since only a single backbone network exists in the early fusion framework (YOLOv11-EF), the cross-modal feature interaction structure is applied in this section to the mid-term fusion framework (YOLOv11-MF) and late fusion framework (YOLOv11-LF) for experiments, as shown in Table 9. After adding the cross-modal feature interaction structure, the improved mid-term fusion framework performs optimally.

The experimental results show that, after the addition of cross-modal structures, all indicators under the late integration framework decreased to a certain extent while, for the mid-term integration framework, all indicators were significantly improved. Since the cross-modal structure introduced in the late fusion framework at the late stage of data processing does not have effective feature interaction with the late integration, it makes it difficult for the model to quickly adapt to and effectively integrate this cross-modal information. In the mid-term fusion stage, the data is initially processed and not yet fully solidified with feature patterns, and the introduction of cross-modal structures at this time can timely capture the rich complementary information between different modal data. From the network architecture, modal splicing and interactions are cross-cutting, enabling inter-modal feature mapping, enhancing information flow, and improving characterization of complex scenarios and diversified targets. This significantly improves model performance, especially in mAP50-95, where obvious advantages are observed.

In the feature splicing module, we test the effectiveness of different attentions in the optimization of the feature splicing module. Various attentional mechanisms (SimAM [29], GAM [30], NAM [31], LCA [32], and our method) are adopted to improve the modal splicing approach, which are applied in the feature splicing module after the C3k2 feature extraction module of the backbone network, and the related results are shown in Table 10.

As shown in Table 10, our designed PPAS, enabled by its multi-branching structure, effectively filters noisy information, complements cross-modal information, and outperforms in all metrics. The GAM attention mechanism ranks second in several metrics. Similar to our approach, it leverages channel-spatial attention interaction to enhance feature extraction accuracy.

4.5. Lightweight Design

In the experimental process, during the modal splicing function design, the overall complexity of the model is increased greatly after all modal splicing modules are replaced with PPAS. Considering that the algorithmic model is mainly used for UAVs to perform forest fire detection tasks (especially small targets), the lightweight approach is designed in this section. The original modal splicing function (Concat) is retained in the third modal splicing module corresponding to the third detection layer (which is mainly used for the detection of large targets).

This section presents comparative experiments to assess the detection performance of different modal splicing methods and model complexity. All schemes are evaluated within the medium-term fusion framework of the enhanced modal fusion structure (YOLOv11—MF+ feature interaction structure). As detailed in Table 11, Scheme 1 replaces all modal splicing modules with PPAS; Scheme 2 substitutes only the first two modules with PPAS; and Scheme 3 replaces only the first module with PPAS.

As shown in Table 11, simplified modal splicing Scheme 2 (replacing only the first two modal splicing modules with PPAS) reduces model parameters and size by nearly 50% compared to full replacement. Notably, indicators show no significant decline, with slight improvements in accuracy and mAP50-95, verifying the effectiveness of the simplified design.

In order to further verify the balance between the detection effect and model complexity of the algorithmic models, the comparison of detection performance and complexity of the single-modal and RGB-T bimodal algorithmic models is shown in Table 12.

From Table 12, the model, designed to handle RGB-T bi-modal data, with a dual backbone for cross-modal feature extraction, incorporates lightweight designs in data input and algorithmic improvements. Although its parameter count and model size are slightly larger than the original single-modal detector, it achieves improved performance at lower complexity.

4.6. Visual Analysis

In order to visualize the performance of the algorithmic model established in the forest fire target detection task, Figure 16 shows the partial detection results of the algorithmic model CP-YOLOv11-MF, which demonstrates the detection effect of the UAV from different viewpoints and in different scenes.

The blue box labeled “fire 0.8” indicates that the model predicts that the target is “fire” with 80% confidence. From the above figure, it can be seen that the CP-YOLOv11-MF algorithm model can fulfill the forest fire target detection task well.

At the same time, in order to further analyze the performance differences between different algorithm models, representative forest fire image detection samples (night environment, tree cover, smoke cover) are selected for visual analysis in this section, as shown in Figure 17, Figure 18 and Figure 19, to visualize the improvement effect of different algorithm models.

Figure 17 illustrates the detection performance of each model under nighttime conditions. While each model demonstrates proficiency in detecting fires with distinct characteristics, person detection may incur pixel-level displacement. This is because humans lack rich texture in thermal infrared images, leading to blurred detection box borders that hinder accurate localization. In the mid-term fusion framework (YOLOv11-MF), multiple detection boxes initially appear, but the final model—incorporating cross-modal feature fusion and splicing—achieves precise person detection with the highest confidence among all models.

The performance of visible images in flame detection under tree occlusion conditions is limited, as shown in Figure 18. For some fire objects, the detection confidence is only 30%, and the detection boxes have localization bias. Thermal infrared images can effectively recognize high-temperature target regions that stand out from the surrounding environment by virtue of their ability to perceive high-temperature areas in the scene. In the early fusion framework (YOLOv11-EF), the poor fusion of bimodal information initially generates multiple detection boxes. After model improvement, the confidence level of all target detections increases to 80%, demonstrating that the adopted algorithm model effectively enhances the accuracy and stability of forest fire target detection under tree occlusion conditions. This provides a more reliable solution for forest fire target detection in complex environments.

There is a false alarm problem with thermal infrared images in smoke occlusion environments, as shown in Figure 19. Due to the existence of areas around the fire point with temperatures close to the human body temperature, thermal infrared images incorrectly identify these areas as personnel targets. Although visible light images have rich texture information and are still able to detect flames and smoke under low visibility, they are deficient in the localization accuracy of the detection framework. In addition, in the application scenarios of the early fusion framework and the middle fusion framework, the visible light image has under-reporting phenomenon and fails to successfully detect some of the actual targets. When the proposed method is used for detection, the detection confidence is increased to 80% for both critical targets, namely, flames and people, which improves the accuracy and reliability of detection.

In summary, the algorithm model CP-YOLOv11-MF constructed is able to perform the target detection task more accurately in complex forest fire scenarios. Compared with single-modal detection methods, the model significantly reduces the false alarm rate and missed alarm rate, effectively overcoming the limitations of single-modal detection. Meanwhile, by designing the modal interaction structure and optimizing the modal splicing module, the model’s ability to detect targets in complex environments is enhanced significantly.

5. Conclusions

In this paper, a multi-target and multi-scene forest fire aerial photography dataset is constructed by collecting data in multiple locations with a UAV equipped with a dual-optical camera head, which provides a more comprehensive visual dataset for subsequent forest fire prevention and management research. Meanwhile, the early fusion detection framework (YOLOv11-EF), the middle fusion detection framework (YOLOv11-MF) and the late fusion detection framework (YOLOv11-LF) are constructed for the multimodal fusion strategy on the basis of the YOLOv11 target detection model, which proves the advancement of the RGB-T bimodal target detection network model compared with the single-modal one. Based on this, a modal interaction structure is designed and a modal splicing module is optimized to enhance deep cross-modal interaction and fusion for RGB-Thermal bimodal target detection. Lightweight design is also incorporated during algorithm model improvement. Finally, the RGB-T dual-modal detection model CP-YOLOv11-MF constructed achieves 92.5%, 93.5%, 96.3%, and 62.9% in terms of precision, recall, mAP50, and mAP50-95. Compared with the metrics of visible light detection in a single mode, there are 1.8%, 3.2%, 2.7%, and 7.9% improvement and, with the metrics of thermal infrared detection in a single mode, the improvement is 1.3%, 4.9%, 2.7%, and 4.7%.

This paper presents an optimized AI-driven framework for RGB-thermal fusion in wildfire detection, which significantly improves the accuracy and response efficiency of monitoring systems. In the context of the growing trend of multi-source data fusion for forest fire detection, this study provides novel insights into the integration of diverse data modalities. Future work will focus on further enhancing the scale and diversity of the multi-scenario fire dataset by continuing to collect data in more forested areas with different geographic environments and climatic conditions, covering a wide range of terrains such as mountains, hills, and plains, as well as forested scenarios with different seasons and day/night time slots, in order to increase the dataset’s level of coverage of complex real-world scenarios. At the algorithmic research level, we continue to study the cross-modal fusion mechanism in depth, explore more potential modal interaction features, and improve the efficiency of the model in utilizing the bimodal data, so as to achieve more stable and accurate detection in the complex and changing forest fire scenarios.

Author Contributions

Conceptualization, Y.Z. and X.R.; methodology, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, X.R.; supervision, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (program NO. 52321003) and the Startup Foundation for Introducing Talent of NUIST (1523142501164).

Data Availability Statement

The RGBT-3M dataset can be found at https://complex.ustc.edu.cn.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cunningham, C.X.; Williamson, G.J.; Bowman, D.M.J.S. Increasing frequency and intensity of the most extreme wildfires on Earth. Nat. Ecol. Evol. 2024, 8, 1420–1425. [Google Scholar] [CrossRef] [PubMed]
Rui, X.; Li, Z.; Zhang, X.; Li, Z.; Song, W. A RGB-Thermal based adaptive modality learning network for day–night wildfire identification. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103554. [Google Scholar] [CrossRef]
Kularatne, S.D.M.W.; Casado, C.Á.; Rajala, J.; Hänninen, T.; López, M.B.; Nguyen, L. FireMan-UAV-RGBT: A Novel UAV-Based RGB-Thermal Video Dataset for the Detection of Wildfires in the Finnish Forests. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; pp. 1–8. [Google Scholar]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland Fire Detection and Monitoring Using a Drone-Collected RGB/IR Image Dataset. IEEE Access 2022, 10, 121301–121317. [Google Scholar] [CrossRef]
Toulouse, T.; Rossi, L.; Campana, A.; Celik, T.; Akhloufi, M.A. Computer vision for wildfire research: An evolving image dataset for processing and analysis. Fire Saf. J. 2017, 92, 188–194. [Google Scholar] [CrossRef]
Li, X.Y.; Chen, S.G.; Tian, C.N.; Zhou, H.; Zhang, Z.X. M2FNet: Mask-Guided Multi-Level Fusion for RGB-T Pedestrian Detection. IEEE Trans. Multimed. 2024, 26, 8678–8690. [Google Scholar] [CrossRef]
Song, K.C.; Wen, H.W.; Xue, X.T.; Huang, L.M.; Ji, Y.Y.; Yan, Y.H. Modality Registration and Object Search Framework for UAV-Based Unregistered RGB-T Image Salient Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531015. [Google Scholar] [CrossRef]
Jin, D.Z.; Shao, F.; Xie, Z.X.; Mu, B.Y.; Chen, H.W.; Jiang, Q.P. CAFCNet: Cross-modality asymmetric feature complement network for RGB-T salient object detection. Expert Syst. Appl. 2024, 247, 123222. [Google Scholar] [CrossRef]
Lv, Y.; Liu, Z.; Li, G.Y. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
Zhou, W.J.; Dong, S.H.; Fang, M.X.; Yu, L. CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene Parsing. IEEE Trans. Intell. Veh. 2024, 9, 1919–1929. [Google Scholar] [CrossRef]
Zhang, B.; Li, Z.L.; Sun, F.M.; Li, Z.H.; Dong, X.B.; Zhao, X.L.; Zhang, Y.R. SICFNet: Shared Information Interaction and Complementary Feature Fusion Network for RGB-T traffic scene parsing. Expert Syst. Appl. 2025, 276, 14. [Google Scholar] [CrossRef]
Pang, Y.; Huang, Y.; Weng, C.Y.; Lyu, J.L.; Bai, C.Y.; Yu, X.S. Enhanced RGB-T saliency detection via thermal-guided multi-stage attention network. Vis. Comput. 2025, 41, 8055–8073. [Google Scholar] [CrossRef]
Bin Azami, M.H.; Orger, N.C.; Schulz, V.H.; Oshiro, T.; Cho, M. Earth Observation Mission of a 6U CubeSat with a 5-Meter Resolution for Wildfire Image Classification Using Convolution Neural Network Approach. Remote Sens. 2022, 14, 1874. [Google Scholar] [CrossRef]
Kumar, A.; Perrusquía, A.; Al-Rubaye, S.; Guo, W. Wildfire and smoke early detection for drone applications: A light-weight deep learning approach. Eng. Appl. Artif. Intell. 2024, 136, 108977. [Google Scholar] [CrossRef]
Qurratulain, S.; Zheng, Z.Z.; Xia, J.; Ma, Y.; Zhou, F.R. Deep learning instance segmentation framework for burnt area instances characterization. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103146. [Google Scholar] [CrossRef]
Li, J.; Tang, H.; Li, X.; Dou, H.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based on the YOLO framework. Int. J. Wildland Fire 2024, 33, WF23044. [Google Scholar] [CrossRef]
Guo, S.H.; Hu, B.; Huang, R. Real-Time Flame Segmentation based on RGB-Thermal Fusion. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ, USA; pp. 1435–1440. [Google Scholar]
Cui, S.; Ma, A.L.; Wan, Y.T.; Zhong, Y.F.; Luo, B.; Xu, M.Z. Cross-Modality Image Matching Network with Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3099506. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Burger, W.; Burge, M.J. Scale-Invariant Feature Transform (SIFT). In Digital Image Processing: An Algorithmic Introduction Using Java; Burger, W., Burge, M.J., Eds.; Springer: London, UK, 2016; pp. 609–664. [Google Scholar]
Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans. Image Process. 2020, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
Gonçalves, L.A.O.; Ghali, R.; Akhloufi, M.A. YOLO-Based Models for Smoke and Wildfire Detection in Ground and Aerial Images. Fire 2024, 7, 140. [Google Scholar] [CrossRef]
Ding, Y.H.; Wang, M.Y.; Fu, Y.J.; Wang, Q. Forest Smoke-Fire Net (FSF Net): A Wildfire Smoke Detection Model That Combines MODIS Remote Sensing Images with Regional Dynamic Brightness Temperature Thresholds. Forests 2024, 15, 839. [Google Scholar] [CrossRef]
Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical image segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef] [PubMed]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based Attention Module. arXiv 2021, arXiv:2111.12419. [Google Scholar] [CrossRef]
He, A.; Li, X.; Wu, X.; Su, C.; Chen, J.; Xu, S.; Guo, X. ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network for TIR Wildlife Detection in UAV Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17308–17326. [Google Scholar] [CrossRef]

Figure 1. Schematic of forest fire detection imagery based on UAV.

Figure 2. Multimodal fusion strategies.

Figure 3. Data collection equipment (a) DJI Matrice 300 RTK with H20T; (b) DJI MAVIC 2 Enterprise.

Figure 4. RGB-thermal dual-optical camera imaging schematic.

Figure 5. Flowchart of M-RIFT image alignment method.

Figure 6. Feature point matching results of image matching methods.

Figure 7. Example of partial images of the dataset (a) visible images; (b) thermal infrared images.

Figure 8. Building process of CP-YOLOv11-MF.

Figure 9. Network structure of YOLOv11-MF.

Figure 10. The network structure of the proposed CP-YOLOv11-MF.

Figure 11. Schematic of Channel Prior Convolutional Attention.

Figure 12. Flowchart of PPAS.

Figure 13. Patch-Aware Flowchart.

Figure 14. Network structure of YOLOv11-EF.

Figure 15. Network structure of YOLOv11-LF.

Figure 16. Visualization of CP-YOLOv11-MF detection results.

Figure 17. Visualization of the detection performance of each model for nighttime conditions.

Figure 18. Visualization of the detection performance of each model for tree occlusion conditions.

Figure 19. Visualization of the detection performance of each model for smoke occlusion conditions.

Table 1. Video Shooting Specifications.

UAV	Camera	FPS	Resolution
DJI Matrice 300 RTK	H20T	30	visible image: 1920 × 1080 infrared image: 640 × 512
DJI MAVIC 2 Enterprise	All-in-one camera	30	visible image: 1920 × 1080 infrared image: 900 × 720

Table 2. Raw video information in RGBT-3M dataset.

Video Item	UAV	Camera	Location	Scene	Time	Duration	File Size
Video pair 1	DJI Matrice 300 RTK	H20T	Anhui, China	Outdoor Experiment	Night	744 s	2.68 GB, 135 MB
Video pair 2	DJI Mavic 2 Enterprise	Integrated Camera	Yunnan, China	Real Fire	Daytime	348 s	1.46 GB, 389 MB
Video pair 3	DJI Mavic 2 Enterprise	Integrated Camera	Yunnan, China	Real Fire	Daytime	703 s	2.96 GB, 716 MB
Video pair 4	DJI Mavic 2 Enterprise	Integrated Camera	Yunnan, China	Real Fire	Daytime	831 s	3.5 GB, 742 MB
Video pair 5	DJI Mavic 2 Enterprise	Integrated Camera	Yunnan, China	Real Fire	Daytime	363 s	1.53 GB, 361 MB
Video pair 6	DJI Mavic 2 Enterprise Dual	Integrated Camera	Yunnan, China	No fire	Daytime	554 s	2.33 GB, 546 MB
Video pair 7	DJI Mavic 2 Enterprise	Integrated Camera	Yunnan, China	No fire	Daytime	91 s	392 MB, 91.2 MB
Video pair 8	DJI Mavic 2 Enterprise	Integrated Camera	Inner Mongolia, China	Outdoor Experiment	Daytime	112 s	486 MB, 74.2 MB
Video pair 9	DJI Mavic 2 Enterprise	Integrated Camera	Inner Mongolia, China	Outdoor Experiment	Daytime	218 s	940 MB, 147 MB
Video pair 10	DJI Matrice 300 RTK	H20T	Anhui, China	Outdoor Experiment	Daytime	698 s	2.52 GB, 140 MB

Table 3. RGBT-3M dataset content.

Dataset	Number of Images	Number of Labels	Smoke	Fire	Person
Training Set	7854	21,550	9488	7914	4148
Validation Set	3366	9227	4086	3401	1740
Total	11,220	30,777	13,574	11,315	5888

Table 4. Training parameter.

Training Environment	Parameter Settings
CPU	Intel^® Xeon(R) Gold 6226R CPU @ 2.90 GHz ×64
GPU	NVIDIA GeForce RTX 3090
Operating System	Ubuntu 20.04.6 LTS
Deep Learning Environment	Python: 3.8.19, torch: 1.8.0, CUDA: 11.1
Optimizer	Stochastic Gradient Descent (SGD)
Momentum	0.937
Weight Decay	0.0005
Training Epochs	200
Batch Size	4

Table 5. Comparison of the effect of different number of target categories for visible light detection.

Model	Class	P	R	mAP50	mAP50-95
YOLOv11 (All object)	smoke	93.9%	93.2%	97%	74.6%
	fire	92.5%	85.8%	91.5%	51.2%
	person	90.5%	88.4%	93.8%	57.2%
YOLOv11 (remove smoke object)	fire	91.6%	89.6%	92.9%	52.1%
YOLOv11 (remove smoke object)	person	89.8%	91%	94.3%	58%

Table 6. Model performance comparison of single-modal.

Model	Class	Precision	Recall	mAP50	mAP50-95
YOLOv11 (Visible image)	all	90.7%	90.3%	93.6%	55%
	fire	91.6%	89.6%	92.9%	52.1%
	person	89.8%	91%	94.3%	58%
RTMdet [27] (Visible image)	all	79%	73.8%	76%	38.6%
	fire	82.2%	71.9%	73.5%	33.1%
	person	75.7%	75.7%	78.5%	44.1%
FasterRCNN [28] (Visible image)	all	90.6%	89.3%	93.6%	55.5%
	fire	92.6%	88.4%	93.3%	52.2%
	person	88.5%	90.2%	94%	58.8%
YOLOv11 (Infrared image)	all	91.2%	88.6%	93.6%	57.6%
	fire	92.2%	86.5%	92.4%	57.5%
	person	90.2%	90.8%	94.9%	57.6%
RTMdet [27] (Infrared image)	all	84.5%	76.8%	82.2%	43.6%
	fire	83.7%	72.9%	78.7%	41.4%
	person	85.4%	80.6%	85.7%	45.9%
FasterRCNN [28] (Infrared image)	all	91.2%	87.6%	93.6%	57.7%
	fire	92.5%	86.2%	93.3%	58%
	person	89.9%	89%	94%	57.5%

Table 7. Ablation experiments.

Modal	Precision	Recall	mAP50	mAP50-95
YOLOv11 (visible image)	90.7%	90.3%	93.6%	55%
YOLOv11 (infrared image)	91.2%	88.6%	93.6%	57.6%
YOLOv11-MF	90.4%	91.3%	95.3%	58.7%
YOLOv11-MF + feature interaction structure	91.7%	92.6%	96%	61.6%
YOLOv11-MF + feature interaction structure + PPAS (CP-YOLOv11-MF)	92.5%	93.5%	96.3%	62.9%

Table 8. Model performance comparison of single-modal and dual-modal algorithms.

Model	Class	Precision	Recall	mAP50	mAP50-95
YOLOv11 (visible image)	all	90.7%	90.3%	93.6%	55%
	fire	91.6%	89.6%	92.9%	52.1%
	person	89.8%	91%	94.3%	58%
YOLOv11 (Infrared image)	all	91.2%	88.6%	93.6%	57.6%
	fire	92.2%	86.5%	92.4%	57.5%
	person	90.2%	90.8%	94.9%	57.6%
YOLOv11-EF	all	91.1%	89.8%	94.9%	58.2%
	fire	91.6%	88.8%	94.1%	57.5%
	person	90.7%	90.7%	95.6%	59%
YOLOv11-MF	all	90.6%	91.2%	95.3%	58.6%
	fire	91%	90.9%	94.8%	58.1%
	person	90.2%	91.4%	95.9%	59.1%
YOLOv11-LF	all	91.3%	91.5%	95.3%	59.7%
	fire	92.3%	91.5%	95.3%	59.7%
	person	90.2%	91.4%	95.4%	59.7%

Table 11. Performance and complexity comparison of different modal splicing schemes. Convention: best, 2nd-best.

Modal Splicing Scheme	Precision	Recall	mAP50	mAP50-95	Parameter (×10⁶)	Model Size
Scheme 1	92.1%	93.9%	96.4%	62.7%	20.51	39.7 MB
Scheme 2	92.5%	93.5%	96.3%	62.9%	11.83	23 MB
Scheme 3	92.4%	93.4%	96.3%	62.4%	9.66	18.9 MB

Table 12. Comparison of detection performance and complexity of the single-modal and RGB-T bi-modal algorithmic models.

Model	Precision	Recall	mAP50	mAP50-95	Parameter (×10⁶)	Model Size
YOLOv11 (visible image)	90.7%	90.3%	93.6%	55%	9.41	18.3 MB
YOLOv11 (Infrared image)	91.2%	88.6%	93.6%	57.6%	9.41	18.3 MB
CP-YOLOv11-MF	92.5%	93.5%	96.3%	62.9%	11.83	23 MB

Table 9. Comparison of the effects of cross-modal feature interaction structures on different detection frameworks.

Model	Class	Precision	Recall	mAP50	mAP50-95
YOLOv11-LF	all	91.3%	91.5%	95.3%	59.7%
	fire	92.3%	91.5%	95.3%	59.7%
	person	90.2%	91.4%	95.4%	59.7%
YOLOv11-LF + feature interaction structure	all	90.3%	90%	94.9%	57.8%
	fire	90%	89.4%	94.4%	57.4%
	person	90.6%	90.7%	95.5%	58.1%
YOLOv11-MF	all	90.6%	91.2%	95.3%	58.6%
	fire	91%	90.9%	94.8%	58.1%
	person	90.2%	91.4%	95.9%	59.1%
YOLOv11-MF+ feature interaction structure	all	91.9%	92.3%	96%	61.5%
	fire	92.4%	92.4%	96%	61.7%
	person	91.4%	92.2%	96%	61.3%

Table 10. Comparison of the Effectiveness of Different Attention Mechanisms in Optimizing Feature Splicing Modules. Convention: best, 2nd-best.

Attention Mechanism	Precision	Recall	mAP50	mAP50-95
-	91.9%	92.3%	96%	61.5%
SimAM [29]	91.6%	92.9%	96.1%	61.2%
GAM [30]	91.3%	93.5%	96.2%	61.8%
NAM [31]	91.9%	93%	96.2%	61.7%
LCA [32]	90.9%	92.4%	95.8%	61.3%
Ours	92.1%	93.9%	96.4%	62.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Rui, X.; Song, W. A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sens. 2025, 17, 2593. https://doi.org/10.3390/rs17152593

AMA Style

Zhang Y, Rui X, Song W. A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sensing. 2025; 17(15):2593. https://doi.org/10.3390/rs17152593

Chicago/Turabian Style

Zhang, Yalin, Xue Rui, and Weiguo Song. 2025. "A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection" Remote Sensing 17, no. 15: 2593. https://doi.org/10.3390/rs17152593

APA Style

Zhang, Y., Rui, X., & Song, W. (2025). A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sensing, 17(15), 2593. https://doi.org/10.3390/rs17152593

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection

Abstract

1. Introduction

2. Dataset

2.1. Data Collection

2.2. Data Pre-Processing

2.3. Statistical Analysis of the Dataset

3. Method

3.1. Overall Architecture Design

3.2. Forest Fire Detection Frameworks Based on Mid-Term Fusion Strategies

3.3. Cross-Modal Feature Fusion Algorithm Design

3.3.1. Feature Interaction Module Design

3.3.2. Feature Splicing Module Design

4. Experiment

4.1. Experimental Settings

4.2. Evaluation Criteria

4.3. Comparative Experiment

4.4. Ablation Experiment

4.5. Lightweight Design

4.6. Visual Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI