Next Article in Journal
Numerical Investigation of Surface–Atmosphere Interaction and Fire Danger in Northern Portugal: Insights into the Wildfires on July 29, 2025
Next Article in Special Issue
Predicting Anthropogenic Wildfire Occurrence Using Explainable Machine Learning Models: A Nationwide Case Study of South Korea
Previous Article in Journal
Wind Speed Prediction Based on AM-BiLSTM Improved by PSO-VMD for Forest Fire Spread
Previous Article in Special Issue
Wildfire Susceptibility Mapping Using Deep Learning and Machine Learning Models Based on Multi-Sensor Satellite Data Fusion: A Case Study of Serbia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations

Mechatronic Engineering Department, Faculty of Engineering, Fırat University, Elazig 23119, Turkey
*
Author to whom correspondence should be addressed.
Fire 2026, 9(3), 109; https://doi.org/10.3390/fire9030109
Submission received: 8 December 2025 / Revised: 1 February 2026 / Accepted: 25 February 2026 / Published: 2 March 2026

Abstract

In this study, a deep learning-based multimodal framework is presented for forest fire detection using RGB images, which synthetically generates night-vision-like, white-hot, and green-hot pseudo-thermal representations. The synthetic modalities are derived directly from RGB data and integrated into a hardware-independent multimodal learning pipeline to increase visual diversity without relying on additional sensing hardware. Each modality is processed using an ImageNet-pretrained convolutional backbone, and modality-specific feature vectors are combined through feature-level concatenation before classification. The proposed framework was evaluated using multiple backbone architectures, including ResNet18, EfficientNet-B0, and DenseNet121, which were assessed independently under a unified experimental protocol. Experiments were conducted on two datasets with substantially different scales and characteristics: the FLAME dataset (39,375 images, binary classification) and the FireStage dataset (791 images, three-class classification). For both datasets, stratified 80–20% training–validation splits were employed, and online stochastic data augmentation was applied exclusively to the training sets. On the FLAME dataset, the proposed framework achieved consistently high performance across different backbone and modality configurations. The best-performing models reached an accuracy of 99.66%, precision of 99.80%, recall of 99.66%, F1-score of 99.73%, and ROC AUC value of 0.9998. On the more challenging FireStage dataset, the framework demonstrated stable performance despite limited data availability, achieving an accuracy of 93.71% for RGB-only configurations and up to 93.08% for selected multimodal combinations, while macro-averaged F1-scores exceeded 0.92, and ROC AUC values reached up to 0.9919. Per-class analysis further indicates that early-stage fire (Start Fire) patterns can be discriminated, achieving ROC AUC values above 0.96, depending on the backbone and modality combination. Overall, the results suggest that synthetic-modality-based multimodal learning can provide competitive performance for both large-scale and data-limited fire detection scenarios, offering a flexible and hardware-independent alternative for forest fire monitoring applications.

1. Introduction

1.1. Background

Forest fires not only disrupt ecosystem balance but also cause multidimensional environmental consequences such as reduced biodiversity, loss of natural habitats, degradation of soil mineral structure, and disruption of the water cycle. Post-fire soil erosion, sudden increases in carbon emissions, and prolonged ecosystem recovery periods contribute to long-lasting effects that may persist for years. Furthermore, the destruction of forested areas accelerates climate change by reducing natural carbon sinks and causes significant socio-economic losses associated with agriculture, tourism, and energy infrastructure [1,2,3].
Increased heat waves, prolonged droughts, low humidity levels, and increased wind speeds associated with global climate change significantly increase both the likelihood of forest fires and their spread. Mega fires seen in recent years in regions such as the Mediterranean basin, the west coast of the United States, and Australia have clearly demonstrated that traditional observation and intervention methods are no longer sufficient [1,4,5]. The increase in the intensity and spread of fires has created a need for a new approach where even minutes are critical in the detection and intervention processes [6,7].
However, since existing camera-based detection systems are mostly designed to capture only flame components, low-density smoke, gray-toned gas clusters, and non-thermal early-stage heat signatures that characterize the initial stage of a fire are often not detected correctly [7,8,9]. Similarly, classical computer vision methods are easily affected by environmental factors such as lighting changes, shadows, haze, and sun glare [10,11]. Furthermore, although thermal cameras offer advantages for early detection, their high cost, energy consumption, and field installation requirements limit their applicability in large-scale areas [4,12,13].
For these reasons, it is critical to develop low-cost, hardware-independent, environmentally resilient, and scalable automated systems that can detect forest fires before flames form; i.e., detection at the smoke or temperature anomaly stage [5,14,15]. This study proposes a new multimodal approach that addresses this need by generating synthetic night-vision- and thermal-like modalities from existing RGB data, giving early-stage indicators more prominence.

1.2. Literature Review

Table 1 provides a comprehensive summary of recent deep learning-based studies on forest fire and smoke detection, clearly showing the dominant approaches, datasets, methods, and performance metrics in the literature.
Table 1 should be emphasized that the performance metrics summarized in Table 1 are derived from studies that address fundamentally different learning paradigms, including image-level classification and object-level detection tasks. In classification-oriented approaches, the term “Accuracy” conventionally denotes image-level classification accuracy, defined as the proportion of correctly classified images with respect to the total number of evaluated samples. Conversely, object detection studies—particularly those employing YOLO-based architectures—primarily assess performance using mean Average Precision (mAP), most commonly mAP@0.5, which quantifies localization and classification performance at the object level. In several studies, this metric is loosely referred to as “accuracy,” despite representing a conceptually different evaluation criterion. As a consequence, the high performance values reported for detection-based methods in Table 1 predominantly reflect successful object-level localization and recognition of visually salient fire or smoke regions, rather than holistic image-level classification performance. Accordingly, these metrics are not directly comparable to the classification-based results reported in this study and should be interpreted within the context of their.
A careful examination of the table reveals that the vast majority of existing studies focus on binary classification problems such as presence/absence of fire or smoke. A significant portion of these studies report high accuracy rates (95–99% and above) based solely on single-mode RGB images. However, these high achievements were mostly obtained under ideal conditions where flames or dense smoke were clearly visible.
A significant portion of the studies in the literature rely on YOLO-based object detection architectures (YOLOv5–YOLOv11). These methods are quite effective in real-time detection of visually distinct flame and dense smoke areas in UAV and fixed camera scenarios. However, since these architectures are inherently focused on object-level detection, they may be limited in detecting low-contrast, weak smoke and heat traces that appear in the early stages of a fire. It is observed that in almost all of the studies listed in Table 1, the “Start Fire” class is not explicitly modeled as a separate class.
Another important limitation highlighted in Table 1 is that most multi-modal systems require actual thermal or infrared (IR) sensors. Although RGB + IR-based multimodal approaches offer advantages in night-vision and low-visibility conditions, these systems are impractical for large-scale applications due to their high hardware costs, energy consumption, field installation, and calibration requirements. Furthermore, the susceptibility of thermal images to environmental conditions (humidity, atmospheric noise, distance, etc.) can also limit model generalizability.
When examining the architectures listed in Table 1, it is evident that many advanced deep learning structures are used, such as CNN, CNN–RNN hybrids, Transformer-based networks, and ensemble approaches. Nevertheless, fire detection is mostly treated as a static image recognition problem, and the temporal development of a fire and early ignition dynamics are not explicitly modeled. Studies incorporating temporal information (LSTM, GRU, etc.) are limited in number and have generally evaluated binary classification scenarios only.
In terms of datasets, Table 1 shows that some large-scale datasets (e.g., FLAME, D-Fire, USTC_SmokeRS) are frequently used. However, the majority of these datasets only contain two-class labeling (fire/no-fire or smoke/no-smoke) and do not provide clear and balanced labeling for critical stages such as the early stages of a fire (start fire). This situation means that models are sensitive to fully developed fires but relatively weak against early-stage indicators.

1.3. Contribution and Novelty of Present Study

This study addresses several practical limitations reported in the existing fire detection literature by exploring the use of synthetic visual modalities derived from RGB images. Specifically, night-vision-like, white-hot, and green-hot pseudo-thermal representations were generated to create a multimodal input structure without relying on physical thermal or infrared sensors. This design choice aims to reduce hardware dependency and associated costs while enabling a controlled exploration of modality diversity within a unified framework.
While many previous studies have primarily focused on binary fire detection, the proposed framework was evaluated in both binary and multi-class scenarios. In addition to the FLAME dataset, which attends to the presence and absence of fire, the FireStage dataset was used to examine the model’s performance in a three-class setup (No Fire, Start Fire, and Fire). This approach enables an analysis of early-stage fire detection, a topic that is less explored in the literature due to data scarcity and annotation challenges.
As summarized in Table 1, the existing methods generally perform well in detecting fully developed fires but often rely on hardware-specific inputs and binary classification structures. In contrast, the approach proposed in this study investigates the feasibility of synthetic-modality-based multimodal learning under both large-scale and limited-data conditions, without the need for additional sensing equipment.
The main contributions of this study can be summarized as follows:
Unlike a large portion of existing works that formulate wildfire monitoring as an object detection or instance segmentation problem, the proposed approach is strictly designed as an image-level classification framework.
A hardware-independent multimodal dataset was constructed by generating synthetic night-vision-like, white-hot, and green-hot pseudo-thermal representations directly from RGB images.
A multimodal feature extraction and fusion framework was systematically evaluated using ImageNet-pretrained backbones (ResNet18, EfficientNet-B0, and DenseNet121).
The proposed framework was assessed in both binary (FLAME) and three-class (FireStage) classification settings, including early-stage fire scenarios.
The generalization ability of the approach was examined across datasets with substantially different sizes and characteristics.
Quantitative results revealed competitive performance across different backbone architectures and modality combinations, with detailed comparisons in terms of accuracy, precision, recall, F1-score, ROC/AUC, parameter count, and training time.
Overall, rather than presenting a definitive solution, this study offers an empirical investigation into the potential and limitations of synthetic-modality-based multimodal learning for fire detection, particularly in scenarios where access to specialized sensing hardware or large-scale labeled datasets is limited.

2. Materials and Methods

In this study, a multimodal dataset was developed by combining original RGB images with synthetically generated pseudo-thermal representations—specifically night-vision-like, white-hot, and green-hot variants—to enhance fire detection and classification performance. These multimodal inputs were processed using convolutional neural network backbones pre-trained on ImageNet, including ResNet-18, EfficientNet-B0, and DenseNet-121, depending on the experimental setup. An overview of the proposed methodological framework is provided in Figure 1.
The primary objective of this approach is to improve the model’s generalization ability by enriching visual input diversity without the need for additional sensing hardware. The framework is structured in three main stages:
(A)
Preprocessing and synthesis of multimodal visual inputs from RGB images;
(B)
Modality-specific feature extraction followed by feature-level fusion and classification through a lightweight fully connected layer;
(C)
Quantitative performance evaluation using standard classification metrics.
During the feature extraction phase, each modality is passed through a shared, pre-trained backbone network with frozen weights. The resulting feature vectors, specific to each modality, are then concatenated to create a unified multimodal representation. This fused representation enables the model to leverage complementary information across modalities, such as color composition, intensity distribution, texture, and pseudo-thermal features. To enhance the model’s robustness, particularly in challenging forest fire scenarios and early fire stages, the dataset was expanded in terms of both modality diversity and feature richness.

2.1. Dataset and Features

The methodological approach developed within the scope of this study was tested on two different datasets (FLAME and FireStage). The FLAME dataset (access link: https://ieee-dataport.org/open-access/flame-2-fire-detection-and-modeling-aerial-multi-spectral-image-dataset) (accessed on 1 November 2025) was selected because it has the highest number of data points for forest fire detection. However, a limitation of the FLAME dataset is that it only contains data labeled with two classes (fire present/fire absent). To address this issue, the FireStage dataset (access link: https://figshare.com/articles/dataset/Forest_Fire_Detection/14904522?file=28702587) (accessed on 1 November 2025) was included in this study. This dataset contains data labeled with three classes (“no fire,” “fire,” and “fire start”), and the data labeled as “fire start” includes not only images of fire but also images of light smoke. Figure 2 shows the numerical distribution of images in the two datasets according to classes.

2.2. Preprocessing and Augmentation

All input images from both the FLAME and FireStage datasets were processed using a unified preprocessing pipeline to ensure consistency across all modalities and backbone networks. The original RGB images were resized to a fixed resolution of 224 × 224 pixels and normalized using the mean and standard deviation values from the ImageNet dataset. This normalization was applied consistently to both training and validation sets to maintain compatibility with the ImageNet-pretrained backbone architectures.
Beyond the RGB inputs, three synthetic visual modalities—night-vision-like, white-hot, and green-hot representations—were generated directly from the RGB images. This was achieved through grayscale conversion, followed by histogram equalization and the application of predefined colormap mappings. These synthetic modalities were treated as independent input channels and subjected to the same preprocessing steps as the RGB images, ensuring a uniform input distribution across all modalities.
To mitigate the limitations posed by the relatively small size of the FireStage dataset and to enhance model generalization, an online stochastic data augmentation strategy was employed during training. At each epoch, the input images were randomly transformed using a mix of geometric and photometric augmentations. These included random resized cropping (with a scale range of 0.75 to 1.0), horizontal flipping (probability = 0.5), small-angle rotations within ±10 degrees, Gaussian blur (probability = 0.25), and color jitter (probability = 0.5). The purpose of these augmentations was to improve the model’s robustness to variations in viewpoint, lighting, and image quality, while preserving the semantic integrity of the scenes. To ensure a fair evaluation, no augmentation was applied to validation images, thus avoiding potential information leakage.
The data augmentation process was implemented online, meaning that augmented images were not generated or stored beforehand. Instead, each original image was transformed differently at every training epoch, allowing the model to encounter a wide variety of augmented versions throughout training. As a result, the effect of augmentation was quantified in terms of the number of training samples observed during optimization, rather than a static number of generated images.
For example, the FireStage dataset consisted of 632 training images after stratified splitting. Over the course of 10 epochs, this resulted in approximately 6320 effective training samples. Similarly, the FLAME dataset’s 31,500 training images yielded about 315,000 effective samples under the same training configuration. Table 2 and Table 3 provide a comparative summary of the real versus effective sample sizes for both datasets, illustrating the impact of online augmentation on the overall training volume.

2.3. Data Synthesis

In this study, a multimodal input representation was constructed by generating three synthetic modalities—night-vision-like, white-hot, and green-hot—from each original RGB image. These modalities were not captured by physical infrared or thermal sensors; instead, they were derived deterministically from the RGB data to increase appearance diversity under limited-data conditions and to emulate common operational visualizations used in low-light and pseudo-thermal monitoring.

2.3.1. RGB-to-Grayscale Transformation

Let an RGB image be denoted as I R G B ( x , y ) = [ R ( x , y ) , G ( x , y ) , B ( x , y ) ] , where R , G , B [ 0 , 255 ] . First, the image was converted to a grayscale luminance map I g r a y ( x , y ) to preserve the intensity structure while removing color dependency. The grayscale conversion follows the standard luminance formulation:
I g r a y ( x , y ) = 0.299   R ( x , y ) + 0.587   G ( x , y ) + 0.114   B ( x , y )
This step yields I g r a y x , y 0 , 255 and provides a stable intensity basis for subsequent contrast and colormap operations.

2.3.2. Contrast Enhancement for Night-Vision-like Modality

To obtain a night-vision-like appearance, global histogram equalization was applied to the grayscale image to enhance contrast in dark regions. Let the histogram equalization operator be denoted as H ( ) . The enhanced intensity image is computed as follows:
I e q ( x , y ) = H I g r a y ( x , y )
Histogram equalization improves the separability of low-intensity structures by redistributing intensity values over the available range. Finally, a predefined colormap C N V ( ) was applied to obtain the night-vision-like RGB representation:
I n i g h t x , y = C N V I e q x , y
In practice, C N V corresponds to a green-dominant mapping that visually resembles low-light imaging.

2.3.3. White-Hot and Green-Hot Pseudo-Thermal Modalities

Two additional pseudo-thermal modalities were produced by applying distinct colormap transformations to the grayscale intensity image. The white-hot modality was generated by mapping high-intensity values to brighter tones using a colormap C W H ( ) :
I w h i t e x , y = C W H I g r a y x , y
Similarly, the green-hot modality was generated with a green-to-yellow intensity mapping using C G H ( ) :
I g r e e n ( x , y ) = C G H I g r a y ( x , y )
These transformations are deterministic and preserve the spatial structure of the original image while modifying its appearance distribution, thereby providing complementary “views” of the same scene.

2.3.4. Final Multimodal Sample Construction

For each original RGB image, the final multimodal sample is defined as a set of aligned modality tensors:
X x , y = I R G B x , y ,     I n i g h t x , y ,     I w h i t e x , y ,   I g r e e n x , y
All modalities were resized to 224 × 224 and normalized with the ImageNet dataset’s mean and standard deviation to match the input distribution expected by the ImageNet-pretrained backbones. This modality construction enables consistent multimodal learning without requiring additional sensor hardware. Representative samples of RGB images and their synthesized modalities for the FLAME and FireStage datasets are shown in Figure 3.

2.3.5. Interpretation of Modality-Specific Intensity Distributions

To investigate whether the synthesized modalities introduce statistically distinguishable patterns, pixel intensity histograms were generated for each class and modality, as shown in Figure 4. These histograms reveal that the synthetic transformations produce intensity distributions that differ noticeably from those of the original RGB images, reinforcing their value in expanding the feature space. Notably, the class-specific shifts observed in the pseudo-thermal representations suggest that early signs of fire—often visually subtle in RGB images—can be statistically highlighted through these transformations.
Table 4 contains data on the size of an original RGB image and the synthesized (night-vision-like, green-hot, and white-hot) images for both datasets.

2.4. Feature Extraction and Classification

The proposed fire detection model adopts a two-stage modular architecture that is specifically designed to integrate multimodal visual information and enhance classification performance. In the first stage, features are extracted separately from each modality, while the second stage performs classification using the combined feature representation. This architecture enables the model to leverage complementary visual cues from both the original RGB images and the synthetically generated modalities—namely, night-vision-like, white-hot, and green-hot—in a structured and interpretable way.

2.4.1. Modality-Wise Feature Extraction

During the feature extraction stage, each input modality is independently processed using a convolutional neural network backbone that has been pretrained on ImageNet. Depending on the specific experimental setup, ResNet-18, EfficientNet-B0, or DenseNet-121 is used as the feature extractor. In all cases, the final classification layers of these networks are removed, retaining only the convolutional body. This approach allows the model to extract high-level semantic features while preserving spatially aggregated information that is crucial for identifying fire-related patterns.
Let I m denote an input image corresponding to modality m { RGB , Night , White , Green } . Each modality is passed through a shared backbone network F to obtain a modality-specific feature vector:
f m = F I m ,   f m R d
where d denotes the backbone-dependent feature dimensionality (e.g., d = 512 for ResNet-18). In the default setup, a shared backbone network is employed across all modalities to minimize model complexity and maintain consistent feature embeddings. All backbone parameters are kept frozen during training—a strategy that helps to stabilize the optimization process, especially in scenarios with limited training data.

2.4.2. Multimodal Feature Fusion

The modality-specific feature vectors are combined using feature-level concatenation, resulting in a unified multimodal representation:
f concat = [ f RGB     f Night     f White     f Green ]
where denotes the concatenation operator. For four modalities and a backbone feature dimension of d , this produces a combined feature vector of size 4 d (e.g., 2048 dimensions for ResNet-18). The fused representation effectively captures complementary information such as color composition, intensity distribution, texture, and pseudo-thermal responses—all of which are crucial for distinguishing between different stages of fire.

2.4.3. Classification Head

The concatenated feature vector is passed to a lightweight, fully connected classification head responsible for final prediction. First, a linear transformation reduces the feature dimensionality from 4 d to 256, enabling compact representation learning. A ReLU activation introduces nonlinearity, followed by a dropout layer with a probability of 0.3 to mitigate overfitting. The final linear layer outputs class logits corresponding to either the binary (No Fire, Fire) or three-class (No Fire, Start Fire, Fire) classification task.
During training, class-weighted CrossEntropyLoss is employed to address class imbalance, particularly for the start_fire class in the FireStage dataset. Optimization is performed using the Adam optimizer, and only the parameters of the classification head are updated, while the backbone weights remain frozen. The layer-wise structure of the classification head is summarized in Table 5.

2.5. Model Performance Evaluation

The accuracy of the developed transfer learning- and feature fusion-based deep neural network model was calculated on both the training and validation (test) sets at the end of each epoch. After model training was completed, classification performance was evaluated on the validation set. For this purpose, accuracy, macro-averaged precision, recall, F1 score, and multi-class ROC AUC (One-vs.-Rest) metrics were calculated (Table 6). The results were visualized using class-based ROC curves and a confusion matrix. For the binary FLAME dataset, standard binary ROC AUC analysis was applied. For the three-class FireStage dataset, multi-class ROC AUC values were computed using a One-vs-Rest (OvR) strategy with macro-averaging to account for class imbalance. Additionally, a learning curve comparing training and validation accuracies for each epoch was constructed.

3. Results

The proposed model was evaluated on two datasets with substantially different characteristics and scales: the FireStage dataset and the FLAME dataset. All experiments were conducted using stratified training–validation splits to preserve class distributions and ensure fair performance assessment.
The FireStage dataset contains a total of 791 images, comprising three classes (No Fire, Start Fire, and Fire). Due to the limited size of the dataset, the available images were used as provided and organized into stratified training and validation splits to preserve class distributions across different fire progression stages. Following stratified splitting, 632 images (80%) were used for training, and 159 images (20%) were reserved for validation. This dataset was primarily employed to evaluate the model’s ability to distinguish different stages of fire development under data-scarce conditions (Table 7).
The FLAME dataset, which is considerably larger, consists of 39,375 images acquired by unmanned aerial vehicles (UAVs) under diverse environmental and terrain conditions. The dataset includes scenes with and without fire and supports robust evaluation for early fire detection. Using the same stratified splitting strategy, 31,500 images (80%) were used for training and 7875 images (20%) for validation. In addition to RGB images, synthetic modalities derived from RGB data were incorporated to enrich visual diversity and support multimodal learning.
For both datasets, data augmentation was applied exclusively to the training sets using an online stochastic strategy, while validation images remained unaltered. Performance results are reported using accuracy, precision, recall, F1-score, confusion matrices, and ROC/AUC metrics, with per-class analyses provided for the FireStage dataset to account for class imbalance and limited sample size.
Table 8 presents a detailed validation performance comparison of the proposed model on the FLAME and FireStage datasets in terms of different network architectures and modality combinations. For the FLAME dataset, the results indicate that RGB input consistently achieves the highest—or very close to the highest—performance across all architectures. In particular, the DenseNet121 architecture with RGB input achieved the best single performance, with an accuracy of 99.66%, an F1-score of 99.73%, and an AUC value of 0.9997 (as reported in Table 8). This finding indicates that for the binary classification problem, fire-related visual cues are largely discriminative within the RGB space.
On the FLAME dataset, using individual synthetic modalities—Green, Night, and White—resulted in only a slight performance drop compared to RGB. Among these, the White modality performed the best, delivering more competitive results than the other synthetic transformations. Notably, the combination of RGB and White achieved performance levels nearly identical to the RGB-only setup across all architectures, with high F1-scores and a well-balanced precision–recall trade-off. For instance, in the case of DenseNet121, this configuration reached an F1-score of 0.9963 and an AUC of 0.9997 (see Table 6). On the other hand, while the RGB + Green and RGB + Night combinations maintained relatively high precision, they suffered from reduced recall, indicating an increase in false negatives, which negatively impacted overall F1 performance.
The results obtained using the FireStage dataset, which poses a more complex and multi-class classification challenge, further highlight the limitations of relying solely on synthetic modalities. As shown in Table 8, RGB again proved to be the most effective standalone input across all tested architectures. The DenseNet121 + RGB configuration delivered the best performance, with an accuracy of 93.71% and an F1-score of 0.9272. In contrast, using the Green, Night, or White modality alone led to a significant drop in performance, suggesting that these synthetic transformations are insufficient by themselves for reliably distinguishing between fire stages. However, it is important to acknowledge certain limitations of the proposed synthetic modality approach. Since all synthetic channels are deterministically derived from RGB images, their effectiveness is inherently dependent on the quality of the original visual input. In scenarios involving environmental degradation—such as fog, smoke, clouds, or dust—the RGB signal may be significantly distorted, leading to misleading or ambiguous synthetic transformations. These conditions represent potential failure cases for the method and may limit its generalizability in real-world deployments.
However, when paired with RGB, the White modality demonstrated notable benefits on the FireStage dataset. For example, with the EfficientNet-B0 architecture, the RGB + White combination improved the F1-score by approximately 2.5% over the RGB-only setup and yielded more balanced precision and recall values (see Table 8). Similarly, both the DenseNet121 and ResNet18 architectures with the RGB + White combination demonstrated performance levels that closely matched their RGB-only baselines, while also exhibiting improved generalization. These findings suggest that the White modality enhances structural and contrast-related cues between fire stages, making it particularly beneficial in multi-class classification scenarios.
Finally, the fully multimodal configuration—combining RGB, Night, White, and Green modalities—did not lead to a clear performance gain on either dataset, as indicated in Table 8. Moreover, this setup substantially increased both the number of model parameters and training time. For instance, in the case of DenseNet121, the fully multimodal approach led to a significant increase in training duration without outperforming simpler configurations like RGB or RGB + White in terms of F1-score. These outcomes emphasize that a selective and controlled modality fusion strategy is more effective and computationally efficient than indiscriminate multimodal expansion.
The confusion matrices of the FLAME dataset presented in Figure 5 illustrate the binary classification performance of the model under different backbone architectures (ResNet18, EfficientNet-B0, and DenseNet121) and different modality configurations (RGB-only, RGB + White, and fully multimodal). When only RGB images were used, all backbone architectures achieved high correct classification rates. In particular, the true positive rates for the Fire class are notably high, while false-negative predictions (Fire → No Fire) remain limited. This observation indicates that due to the large scale and visual diversity of the FLAME dataset, fire-containing scenes provide strong discriminative features.
The RGB + White modality combination shows a tendency to reduce false-positive rates, especially for the No Fire class. This suggests that white-hot-like transformations support brightness-based discrimination and contribute to a more consistent representation of non-fire scenes. In contrast, in the fully multimodal configuration (RGB + Night + White + Green), an increase in the number of cases where the Fire class is misclassified as No Fire is observed for some backbone architectures. This finding indicates that synthetic modalities do not always provide complementary benefits and that modality fusion must be carefully designed, particularly for large-scale datasets. Especially for EfficientNet- and DenseNet-based models, certain synthetic transformations may complicate the decision boundaries.
The confusion matrices for the FireStage dataset reflect model behavior in a three-class (No Fire, Fire, Start Fire) and data-limited scenario in detail. In this dataset, class imbalance and the limited number of samples become more pronounced, particularly for the Start Fire class. When only RGB images were used, all backbone architectures could distinguish the No Fire and Fire classes with relatively high accuracy; however, the Start Fire class was occasionally confused with these two classes. This is attributed to the visually ambiguous and low-contrast nature of early-stage fire scenes.
The RGB + White modality combination reduces misclassifications for the Fire class, while some confusion remains for the Start Fire class. This indicates that, although white-hot transformations are beneficial for scenes with strong flames, they do not always clearly separate early-stage fire cues. In the fully multimodal configuration (RGB + Night + White + Green), an overall improvement trend in the correct classification of the Start Fire class is observed. In particular, DenseNet- and EfficientNet-based models assign Start Fire samples to the correct class at higher rates. This suggests that synthetic modalities can enhance the visibility of low-intensity and early-stage fire patterns.
Nevertheless, some instances are also observed where the No Fire class is misclassified as Start Fire. This observation indicates that the limited sample size and visual overlap in the FireStage dataset can influence model decisions.
When both datasets are considered together, the confusion matrices show that the model exhibits more stable behavior on large-scale and balanced datasets, while being more sensitive to modality selection in data-limited and multi-class scenarios. In particular, for the FireStage dataset, synthetic modalities contribute to the separation of the early-stage fire class; however, the extent of this contribution varies depending on the backbone architecture and modality combination.

4. Discussion and Limitation

Table 8 presents a comparative performance evaluation of the proposed method on the FLAME and FireStage datasets against recent studies in the literature. As evident from the table, the majority of existing works focus on a binary classification setup (fire/no fire) and typically rely on either RGB images alone or a limited number of hardware-based multimodal inputs (e.g., RGB + infrared). For instance, Goncalves et al. (2024) [9] reported an accuracy of 99.70% on the FLAME dataset using an RGB-based DenseNet architecture, while Wang et al. (2024) [13] achieved 98.99% accuracy with a hybrid SE-ResNet + SVM model. Similarly, studies by Arteaga et al. (2020) [25], Mohammed et al. (2022) [26], and Benzekri et al. (2020) [2] reported accuracies approaching 99% in binary fire detection scenarios.
In contrast to these approaches, the method proposed in this study constructs a multi-input learning framework using synthetic modalities derived solely from RGB images, without relying on real thermal or infrared hardware. When Table 6 and Table 8 are considered together, the DenseNet121 architecture achieved an accuracy of 99.66% and an F1-score of 0.9973 on the FLAME dataset using RGB images, while the RGB + White modality combination yielded an accuracy of 99.53% and an F1-score of 0.9963. These results indicate that, despite being hardware-independent, the proposed approach delivers performance comparable to—and in some metrics, competitive with—the best-performing binary fire detection methods reported in the literature. This demonstrates that the method provides a low-cost, scalable, and practical solution suitable for real-world deployment.
The most notable contribution highlighted in Table 9 is that this study directly addresses a three-class classification problem (No Fire—Start Fire—Fire) on the FireStage dataset. In much of the existing literature, the “Start Fire” (early-stage fire) class is either completely neglected or merged into the “Fire” or “No Fire” categories. In contrast, the proposed approach explicitly models the early stage of fire as a separate class, which is of critical importance for early intervention. As shown in Table 6, the DenseNet121 architecture achieved an accuracy of 93.71% and an F1-score of 0.9272 on the FireStage dataset. In addition, the RGB + White modality combination was observed to provide more balanced and generalizable performance across different backbone architectures.
A major limitation of this study is the relatively small size of the FireStage dataset compared to the FLAME dataset. While the FireStage dataset contains 791 images in total (632 for training and 159 for validation), the FLAME dataset comprises 39,375 images, providing a substantially richer training distribution. Although online data augmentation increased the effective number of training samples for the FireStage dataset to approximately 6320 across epochs, these augmented samples were derived from a limited set of original images and therefore could not fully replace the benefits of additional real-world data.
In particular, the start_fire class is more sensitive to data scarcity, which may limit generalization performance. Consequently, the results for the FireStage dataset are reported based on per-class metrics and ROC/AUC analyses and should be interpreted with caution. Future work will focus on expanding the FireStage dataset with additional samples and increased scene diversity, which is expected to further improve the robustness of the proposed multimodal framework.
In practical deployment scenarios involving high-resolution UAV imagery, the proposed model is intended to operate within a patch-based inference framework. Instead of directly downscaling large aerial images to 224 × 224 pixels, high-resolution frames can be partitioned into smaller patches matching the network input size. Each patch is independently classified, and patch-level predictions can be aggregated using voting or confidence-based strategies to produce scene-level fire alerts. This approach preserves fine-grained visual cues, such as low-density smoke or early-stage fire signatures, which may be suppressed by global downscaling. Therefore, the proposed framework is well suited for integration into UAV-based early warning pipelines.

5. Conclusions

In this study, a multi-input deep learning approach was developed to enable early-stage and hardware-independent detection of forest fires by leveraging synthetic night-vision-like, white-hot, and green-hot modalities derived from RGB images. The proposed method aims to increase visual diversity without requiring real thermal or infrared sensors, thereby enhancing the model’s generalization capability. During the feature extraction stage, a pretrained DenseNet121 architecture was employed, and feature representations obtained from different modalities were combined in a controlled manner to perform classification.
The proposed approach was comprehensively evaluated on two different datasets: the binary FLAME dataset (fire/no fire) with 39,375 images, and the three-class FireStage dataset (No Fire—Start Fire—Fire) with 795 images. For both datasets, an 80–20% training–validation split was applied. The experimental results showed that, on the FLAME dataset, the DenseNet121 + RGB configuration achieved an accuracy of 99.66%, an F1-score of 0.9973, and a ROC AUC value of 0.9997, indicating very high performance. In addition, the RGB + White modality combination produced results very close to the RGB-only configuration and emerged as a reliable alternative by providing more balanced performance across different architectures.
The results obtained on the three-class FireStage dataset clearly reflect the more challenging nature of this problem. On this dataset, the DenseNet121 architecture achieved an accuracy of 93.71% and an F1-score of 0.9272 with the RGB + White modality combination, showing more stable behavior in terms of precision–recall balance. Because the “Start Fire” class is characterized by low contrast, dense smoke, and weak flame cues, modeling this class separately significantly increases the difficulty of the classification task. Despite these challenges, the results demonstrate that the proposed method is effective in discriminating early-stage fire patterns.
The experiments also revealed that the fully multimodal configuration (RGB + Night + White + Green) substantially increased the number of parameters and training time, while not providing a meaningful performance gain. This finding indicates that controlled and selective modality fusion yields more efficient and generalizable results for fire detection problems compared to indiscriminate multimodal expansion.
In conclusion, this study goes beyond the predominantly binary fire detection problems addressed in the literature by directly modeling the early stage of fire as a separate class within a hardware-independent, synthetic-modality-supported deep learning framework. The findings show that the proposed approach achieves high generalization performance on large-scale datasets and delivers reasonable and operationally meaningful performance in multi-class scenarios with limited data. In this respect, the study offers a low-cost and scalable solution with strong potential for real-time forest fire monitoring and early warning systems.

Author Contributions

Conceptualization, B.T., A.B.T., A.K.T. and O.Y.; methodology, B.T. and A.B.T.; validation, B.T. and A.B.T.; software, B.T.; formal analysis, B.T., A.B.T. and A.K.T.; investigation, B.T.; resources, A.K.T. and B.T.; data curation, B.T.; writing—original draft preparation, B.T., A.B.T., O.Y. and A.K.T.; writing—review and editing, B.T., A.B.T. and A.K.T.; visualization, B.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funding by The Scientific and Technological Research Council of Türkiye (TÜBİTAK) under the 1001 Program within Project No. 223M001 and by the Fırat University Scientific Research Projects (BAP) Coordination Unit under Project No. MF.25.122.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. This study does not involve human participants.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. Two open access datasets were used in this study. FLAME dataset access link: https://ieee-dataport.org/open-access/flame-2-fire-detection-and-modeling-aerial-multi-spectral-image-dataset; FireStage dataset access link: https://figshare.com/articles/dataset/Forest_Fire_Detection/14904522?file=28702587.

Acknowledgments

No additional material or financial support was received beyond the support stated in the funding section. The authors reviewed and edited all output and are fully responsible for the content of this publication. DeepL and Grammarly tools were used to improve the readability of the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. de Almeida, R.V.; Crivellaro, F.; Narciso, M.; Sousa, A.I.; Vieira, P. Bee2Fire: A deep learning powered forest fire detection system. In Proceedings of the ICAART 2020—12th International Conference on Agents and Artificial Intelligence, Valletta, Malta, 22–24 February 2020; pp. 603–609. [Google Scholar]
  2. Benzekri, W.; El Moussati, A.; Moussaoui, O.; Berrajaa, M. Early forest fire detection system using wireless sensor network and deep learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 496–503. [Google Scholar] [CrossRef]
  3. Mohammad, M.B.; Bhuvaneswari, N.; Koteswari, C.P.; Priya, V.B. Hardware implementation of forest fire detection system using deep learning architectures. In Proceedings of the International Conference on Edge Computing and Applications (ICECAA) 2022, Tamilnadu, India, 13–15 October 2022; pp. 1198–1205. [Google Scholar]
  4. Ryu, J.; Kwak, D. A study on a complex flame and smoke detection method using computer vision detection and convolutional neural network. Fire 2022, 5, 108. [Google Scholar] [CrossRef]
  5. Shirwaikar, R.; Narvekar, A.; Hosamani, A.; Fernandes, K.; Tak, K.; Parab, V. Real-time semi-occluded fire detection and evacuation route generation: Leveraging instance segmentation for damage estimation. Fire Saf. J. 2025, 152, 104338. [Google Scholar] [CrossRef]
  6. Chaoxia, C.; Shang, W.; Zhang, F. Information-guided flame detection based on Faster R-CNN. IEEE Access 2020, 8, 58923–58932. [Google Scholar] [CrossRef]
  7. Casas, E.; Ramos, L.; Bendek, E.; Rivas-Echeverria, F. Assessing the effectiveness of YOLO architectures for smoke and wildfire detection. IEEE Access 2023, 11, 96554–96583. [Google Scholar] [CrossRef]
  8. Sathishkumar, V.E.; Cho, J.; Subramanian, M.; Naren, O.S. Forest fire and smoke detection using deep learning-based learning without forgetting. Fire Ecol. 2023, 19, 9. [Google Scholar] [CrossRef]
  9. Goncalves, A.M.; Brandao, T.; Ferreira, J.C. Wildfire detection with deep learning—A case study for the CICLOPE project. IEEE Access 2024, 12, 82095–82110. [Google Scholar] [CrossRef]
  10. Yan, C.; Wang, J. MAG-FSNet: A high-precision robust forest fire smoke detection model integrating local features and global information. Measurement 2025, 247, 116813. [Google Scholar] [CrossRef]
  11. Wang, W.; Huang, Q.; Liu, H.; Jia, Y.; Chen, Q. Forest fire detection method based on deep learning. In Proceedings of the International Conference on Cyber-Physical Social Intelligence (ICCSI) 2022, Nanjing, China, 18–21 November 2022; pp. 23–28. [Google Scholar]
  12. Zhu, W.; Niu, S.; Yue, J.; Zhou, Y. Multiscale wildfire and smoke detection in complex drone forest environments based on YOLOv8. Sci. Rep. 2025, 15, 2399. [Google Scholar] [CrossRef]
  13. Wang, X.; Wang, J.; Chen, L.; Zhang, Y. Improving computer vision-based wildfire smoke detection by combining SE-ResNet with SVM. Processes 2024, 12, 747. [Google Scholar] [CrossRef]
  14. Li, L.; Liu, F.; Ding, Y. Real-time smoke detection with Faster R-CNN. In Proceedings of the 2nd International Conference on Artificial Intelligence and Information Systems, Chongqing, China, 28–30 May 2021; pp. 1–5. [Google Scholar]
  15. Bahhar, C.; Ksibi, A.; Ayadi, M.; Jamjoom, M.M.; Ullah, Z.; Soufiene, B.O.; Sakli, H. Wildfire and smoke detection using staged YOLO model and ensemble CNN. Electronics 2023, 12, 228. [Google Scholar] [CrossRef]
  16. Li, Y.; Zhang, W.; Liu, Y.; Jing, R.; Liu, C. An efficient fire and smoke detection algorithm based on an end-to-end structured network. Eng. Appl. Artif. Intell. 2022, 116, 105492. [Google Scholar] [CrossRef]
  17. Wang, C.; Li, Q.; Liu, S.; Cheng, P.; Huang, Y. Transformer-based fusion of infrared and visible imagery for smoke recognition in commercial areas. Comput. Mater. Contin. 2025, 84, 5157–5176. [Google Scholar] [CrossRef]
  18. Wang, Y.; Wang, Y.; Khan, Z.A.; Huang, A.; Sang, J. Multi-level feature fusion networks for smoke recognition in remote sensing imagery. Neural Netw. 2025, 184, 107112. [Google Scholar] [CrossRef] [PubMed]
  19. Alkhammash, E.H. A comparative analysis of YOLOv9, YOLOv10, YOLOv11 for smoke and fire detection. Fire 2025, 8, 26. [Google Scholar] [CrossRef]
  20. Xue, Z.; Kong, L.; Wu, H.; Chen, J. Fire and smoke detection based on improved YOLOv11. IEEE Access 2025, 13, 73022–73040. [Google Scholar] [CrossRef]
  21. He, L.; Zhou, Y.; Liu, L.; Zhang, Y.; Ma, J. Research and application of deep learning object detection methods for forest fire smoke recognition. Sci. Rep. 2025, 15, 16328. [Google Scholar] [CrossRef]
  22. Niu, K.; Wang, C.; Xu, J.; Liang, J.; Zhou, X.; Wen, K.; Lu, M.; Yang, C. Early forest fire detection with UAV image fusion: A novel deep learning method using visible and infrared sensors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6617–6629. [Google Scholar] [CrossRef]
  23. Jin, P.; Cheng, P.; Liu, X.; Huang, Y. From smoke to fire: A forest fire early warning and risk assessment model fusing multimodal data. Eng. Appl. Artif. Intell. 2025, 152, 110848. [Google Scholar] [CrossRef]
  24. Shang, L.; Hu, X.; Huang, Z.; Zhang, Q.; Zhang, Z.; Li, X.; Chang, Y. YOLO-DKM: A flame and spark detection algorithm based on deep learning. IEEE Access 2025, 13, 117687–117699. [Google Scholar] [CrossRef]
  25. Arteaga, B.; Diaz, M.; Jojoa, M. Deep learning applied to forest fire detection. In Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) 2020, Louisville, KY, USA, 9–11 December 2020. [Google Scholar]
  26. Mohammed, R.K. A real-time forest fire and smoke detection system using deep learning. Int. J. Nonlinear Anal. Appl. 2022, 13, 2053–2063. [Google Scholar]
  27. Mohnish, S.; Akshay, K.P.; Ram, S.G.; Vignesh, A.S.; Pavithra, P.; Ezhilarasi, S. Deep learning based forest fire detection and alert system. In Proceedings of the International Conference on Communication, Computing and Internet of Things (IC3IoT) 2022, Chennai, India, 10–11 March 2022; pp. 1–5. [Google Scholar]
  28. Ban, Y.; Zhang, P.; Nascetti, A.; Bevington, A.R.; Wulder, M.A. Near real-time wildfire progression monitoring with Sentinel-1 SAR time series and deep learning. Sci. Rep. 2020, 10, 1322. [Google Scholar] [CrossRef]
  29. Rahul, M.; Saketh, K.S.; Sanjeet, A.; Naik, N.S. Early detection of forest fire using deep learning. In Proceedings of the IEEE REGION 10 CONFERENCE (TENCON) 2020, Osaka, Japan, 16–19 November 2020; pp. 1136–1140. [Google Scholar]
  30. Jiang, Y.; Wei, R.; Chen, J.; Wang, G. Deep learning of Qinling forest fire anomaly detection based on genetic algorithm optimization. UPB Sci. Bull. Ser.-Electr. Eng. Comput. Sci. 2021, 83, 75–84. [Google Scholar]
  31. Li, M.; Zhang, Y.; Mu, L.; Xin, J.; Yu, Z.; Liu, H.; Xie, G. Early forest fire detection based on deep learning. In Proceedings of the 3rd International Conference on Industrial Artificial Intelligence (IAI) 2021, Shenyang, China, 8–11 November 2021; pp. 1–5. [Google Scholar]
  32. Khan, S.; Khan, A. FFireNet: Deep learning based forest fire classification and detection in smart cities. Symmetry 2022, 14, 2155. [Google Scholar] [CrossRef]
  33. Gayathri, S.; Ajay Karthi, P.V.; Sunil, S. Prediction and detection of forest fires based on deep learning approach. J. Pharm. Negat. Results 2022, 13, 429–433. [Google Scholar] [CrossRef]
  34. Kang, Y.; Jang, E.; Im, J.; Kwon, C. A deep learning model using geostationary satellite data for forest fire detection with reduced detection latency. GISci. Remote Sens. 2022, 59, 2019–2035. [Google Scholar] [CrossRef]
  35. Ghosh, R.; Kumar, A. A hybrid deep learning model combining CNN and RNN to detect forest fires. Multimed. Tools Appl. 2022, 81, 38643–38660. [Google Scholar] [CrossRef]
  36. Tahir, H.U.A.; Waqar, A.; Khalid, S.; Usman, S.M. Wildfire detection in aerial images using deep learning. In Proceedings of the 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2) 2022, Rawalpindi, Pakistan, 24–26 May 2022. [Google Scholar]
  37. Peng, Y.; Wang, Y. Automatic wildfire monitoring system based on deep learning. Eur. J. Remote Sens. 2022, 55, 551–567. [Google Scholar] [CrossRef]
  38. Mashraqi, A.M.; Asiri, Y.; Algarni, A.D.; Abu-Zinadah, H. Drone imagery forest fire detection and classification using modified deep learning model. Therm. Sci. 2022, 26, 411–423. [Google Scholar] [CrossRef]
  39. Almasoud, A.S. Intelligent deep learning enabled wild forest fire detection system. Comput. Syst. Sci. Eng. 2023, 44, 1485–1498. [Google Scholar] [CrossRef]
  40. Alice, K.; Thillaivanan, A.; Koteswara Rao, G.R.; Rajalakshmi, S.; Singh, K.; Rastogi, R. Automated forest fire detection using atom search optimizer with deep transfer learning model. In Proceedings of the 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 May 2023; pp. 222–227. [Google Scholar]
  41. Xie, F.; Huang, Z. Aerial forest fire detection based on transfer learning and improved Faster R-CNN. In Proceedings of the 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 26–28 May 2023; pp. 1132–1136. [Google Scholar]
Figure 1. Transfer learning-based deep network architecture structure that takes the developed multimodal input fusion feature dataset.
Figure 1. Transfer learning-based deep network architecture structure that takes the developed multimodal input fusion feature dataset.
Fire 09 00109 g001
Figure 2. (a) Distribution of image data by class in the FireStage dataset and (b) FLAME dataset.
Figure 2. (a) Distribution of image data by class in the FireStage dataset and (b) FLAME dataset.
Fire 09 00109 g002
Figure 3. Representative sample images from the (a) FireStage and (b) FLAME datasets, shown as RGB inputs and their synthetically generated modality counterparts (night-vision-like, white-hot, and green-hot).
Figure 3. Representative sample images from the (a) FireStage and (b) FLAME datasets, shown as RGB inputs and their synthetically generated modality counterparts (night-vision-like, white-hot, and green-hot).
Fire 09 00109 g003
Figure 4. Pixel intensity histograms of raw RGB images and their synthesized modalities for the (a) FLAME dataset and (b) FireStage dataset.
Figure 4. Pixel intensity histograms of raw RGB images and their synthesized modalities for the (a) FLAME dataset and (b) FireStage dataset.
Fire 09 00109 g004
Figure 5. Confusion matrices obtained for different backbone architectures and modality configurations on the (a) FLAME and (b) FireStage datasets.
Figure 5. Confusion matrices obtained for different backbone architectures and modality configurations on the (a) FLAME and (b) FireStage datasets.
Fire 09 00109 g005aFire 09 00109 g005b
Table 1. Comparative literature summary of current deep learning-based studies on forest fire and smoke detection.
Table 1. Comparative literature summary of current deep learning-based studies on forest fire and smoke detection.
AuthorDatasetObjectiveMethodPerformance
de Almeida, R.V. et al. (2020) [1]Bee2Fire dataset (associated with the CICLOPE scenario)Forest fire classification (fire/no-fire)CNN (ResNet-18)Specificity ≈ 99%
Benzekri, W. et al. (2020) [2]Forest fire image sequenceTime-dependent fire detectionSequence models based on RNN, LSTM, GRUAccuracy ≈ 99.89%
Mohammad, M.B. et al. (2022) [3]Forest fire images (edge device-focused)Real-time fire detection on hardwareCNN (ResNet50, GoogLeNet, CNN-9, MobileNet, InceptionV3, AlexNet)Accuracy ≈ 99.42% (ResNet50/GoogleNet-based)
Ryu, J. et al. (2022) [4]Custom indoor/outdoor fire and smoke datasetMixed indoor/outdoor fire/smoke detectionClassic CV + CNN + InceptionV3-based classifier≈5–6% accuracy improvement over baseline methods
Shirwaikar, R. et al. (2025) [5]Indoor semi-occluded fire datasetIndoor disaster management; semi-occluded fire/spark detection + evacuation routeYOLOv8 (simplified) + instance segmentation-based damage estimationPrecision ≈ 0.73;
F1 ≈ 0.81
Chaoxia C. et al.(2020) [6] BoWFire, PascalVOC and Corsician datasetOutdoor fire imagesColor and global information guidedAccuracy=99.50%
Casas, E. et al. (2023) [7]Foggia fire/smoke CCTV dataset (outdoor environment, CCTV)Comparison of different YOLO versions for early detection of smoke and wildfireYOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLO-NASF1 ≈ 0.95; recall ≈ 0.98 for the best model(s)
Sathishkumar et al. (2023) [8]BoWFire + original forest fire image set (RGB)Classification of forest fire and smoke images; transfer learning without forgetting old knowledge when switching to a new taskVGG16, InceptionV3, Xception + Learning without Forgetting (LwF)Xception + LwF: ≈96.9% acc. (original), ≈91.4% acc.
Goncalves, A.M. et al. (2024) [9]CICLOPE alarm images (Portugal; tower cameras)Camera-based wildfire detection in a large area (2.7 M ha) forestDenseNet-based feature extractor + detail-selective CNN classifierAccuracy ≈ 99.7% (in CICLOPE alarm images)
Yan, C. et al. (2025) [10]VIGP-FS forest smoke datasetLocal + global feature fusion for forest fire smoke detectionMAG-FSNet: CNN backbone + multi-scale attention + global feature fusionPrecision ≈ 88.4%; Recall ≈ 83.4%; mAP@0.5 ≈ 89.3%
Wang, W. et al. (2022) [11]Forest fire image datasetForest fire object detectionYOLO-based CNN (forest fire detection)Accuracy ≈ 83.9%
Zhu, W. et al. (2025) [12]D-Fire drone dataset (drone-based)UAV wildfire detection in complex forest environmentsYOLOv8 (multiscale feature learning)Precision ≈ 93.6%; Recall ≈ 88.5%
Wang, X. et al. (2024) [13]Various public wildfire smoke image/video datasetsOutdoor wildfire smoke detection; classical CV + DL hybridSE-ResNet feature extractor + SVM classifierAcc ≈ 98.99%; F1 ≈ 99%
Li, L. et al. (2021) [14]Factory/indoor smoke dataset (ICAIIS 2021, real factory scenes)Real-time detection of indoor (factory) smokeFaster R-CNN-based smoke detectorAccuracy: ≈99.0%
Bahhar, C. et al. (2023) [15]UAV and open datasets (wildfire, smoke; multi-source, RGB)Staged YOLO architecture for UAV-based forest fire/smoke detectionStaged YOLO (different YOLO versions) + Ensemble CNNAcc ≈ 99%;
mAP ≈ 0.85 for smoke class
Li, Y. et al. (2022) [16]Wildfire image dataset (~35 k images, tower/camera-focused)Forest fire monitoring (smoke/fire) from surveillance towersResNet, EfficientNet-based end-to-end network + Grad-CAM visualizationAUROC ≈ 0.949
Wang, C. et al. (2025) [17]IR + visible (commercial area/business park) smoke datasetSmoke detection in commercial/urban areas using IR + RGB fusionTransformer-based Fusion (IR + Visible feature fusion)Accuracy ≈ 90.9%; precision ≈ 98.4%; recall ≈ 92.4%; FP/FN < 5%
Wang, Y. et al. (2025) [18]USTC_SmokeRS, E_SmokeRS, Aerial RS smoke datasetsSmoke detection in remote sensing imagesConvNeXt backbone + AFEM (attention feature enhancement module) + BFFMAccuracy ≈ 98.9%; false alarm rate (FAR) ≈ 3.3%
Alkhammash, E.H. (2025) [19]Smoke + D-Fire-like open-source datasetsSmoke/fire detection by comparing different YOLO generation modelsComparative analysis of YOLOv9, YOLOv10, and YOLOv11Precision ≈ 0.845; Recall ≈ 0.801
Xue, Z. et al. (2025) [20]Baidu Paddle wildfire + additional indoor/outdoor datasetsIndoor/outdoor multi-scenario fire/smoke detectionYOLOv11-DH3 (enhanced YOLOv11 derivative)Precision ≈ 91.6%; Recall ≈ 90%
He, L. et al. (2025) [21]WD + FFS forest fire smoke datasetsAnalysis of different object detectors for forest fire smoke detectionYOLOv11x + optimized loss function (loss redesign)Precison ≈ 0.949;
Recall ≈ 0.850; highmAP@0.5
Niu, K. et al. (2025) [22]UAV fusion dataset (2752 RGB–IR image pairs)Early forest fire detection using visible + IR fusion with UAVYOLOv5s-based lightweight detector + image fusion≈10% increase in precision for small fire objects
Jin, P. et al. (2025) [23]Multimodal dataset (3352 paired samples: imagery + environmental)Smoke detection + early warning + risk assessmentYOLOv8n + MSDBlock (multiscale dual-branch block)Accuracy ≈ 93.1%; ≈18.8% higher than traditional baselines
Shang, L. et al. (2025) [24]Custom dataset (20,044 images; indoor, industrial, forest)Multi-scenario (industrial + forest + indoor/outdoor) flame and spark detectionYOLO-DKM (improved YOLOv8; deformable conv + key modules)Precision ≈ 82.1%; Recall ≈ 71.8%
Arteaga, B. et al. (2020) [25]Forest fire imagesFire presence/absence classificationCNN (ResNet + VGG-based deep classifiers)Accuracy ≈ 99.5%
Mohammed, R.K. (2022) [26]Forest fire image datasetFire/smoke image classificationInception-ResNet-based CNNAccuracy ≈ 99.09%
Mohnish, S. et al. (2022) [27]Forest fire image datasetFire image detection (bounding box level)CNN-based detection (custom, single-stage)Accuracy ≈ 92.20%
Ban, Y. et al. (2020) [28]Sentinel-1 SAR time series (district/region-based)Near real-time monitoring of fire progressionCNN-based time series model (on SAR images)Accuracy ≈ 83.53%
Rahul, M. et al. (2020) [29]Forest fire imagesFire/no-fire classificationCNN (ResNet50, VGG16, DenseNet121)Accuracy ≈ 92.27%
Jiang, Y. et al. (2021) [30]Qinling forest fire anomaly datasetForest anomaly/fire detectionCNN + BP NN, GA, SVM, GA-BP optimizationAccuracy ≈ 95%
Li, M. et al. (2021) [31]Forest fire image datasetEarly forest fire detection (object detection)h-EfficientDet (EfficientDet + h-EfficientDet architecture)Accuracy ≈ 98.35%
Khan, S. et al. (2022) [32]Forest fire detection dataset (Fire/No-Fire)Fire/no-fire classification (smart city scenario)FFireNet (MobileNetV2-based CNN) + MobileNetV2 comparisonAccuracy ≈ 98.42%; FFireNet + MobileNetV2
Gayathri, S. et al. (2022) [33]Forest fire image datasetForest fire prediction and detectionCNN-based classifierAccuracy ≈ 96%
Kang, Y. et al. (2022) [34]Geostationary satellite data (GEO; multi-temporal)Early forest fire detection using GEO satellite dataCNN + Random Forest hybrid approachAccuracy ≈ 98%
Ghosh, R.; Kumar, A. (2022) [35]Forest fire image datasetSpatial + temporal pattern learning for fire detectionCNN + RNN hybrid (e.g., LSTM layers)Accuracy ≈ 99.62%
Tahir, H.U.A. et al. (2022) [36]UAV wildfire imageryWildfire detection in UAV imagesYOLOv5-based detectionF1 ≈ 94.44%
Peng, Y.; Wang, Y. (2022) [37]Wildfire monitoring image datasetDeep learning-based automatic wildfire monitoring systemCNN (SqueezeNet1.1, AlexNet, MobileNet, ResNet18, VGG16 comparisons)Accuracy ≈ 99.28%
Mashraqi, A.M. et al. (2022) [38]Drone imagery forest fire datasetFire classification from drone imagesDIFFDC-MDL hybrid (LSTM-RNN + MobileNetV2)Accuracy ≈ 99.38%
Almasoud, A.S. (2023) [39]Forest fire image datasetSmart, DL-based wild forest fire detectionIWFFDA-DL, ACNN-BLSTM + YOLOv3 combinationAccuracy ≈ 99.56%
Alice, K. et al. (2023) [40]Forest fire image datasetAutomatic forest fire detectionDeep transfer learning (QRNN, ResNet50 + Atom Search Optimizer)Accuracy ≈ 97.33%
Xie, F.; Huang, Z. (2023) [41]UAV wildfire datasetFire/smoke detection in UAV imagesTransfer Learning + Faster R-CNN (ResNet50 backbone + fusion and attention)Accuracy ≈ 93.7%
Table 2. Data augmentation strategy applied during training.
Table 2. Data augmentation strategy applied during training.
Augmentation TypeApplied SetProbability/RangePurpose
Random Resized CropTraining onlyScale: 0.75–1.0Scale and viewpoint variation
Horizontal FlipTraining only0.5Viewpoint invariance
RotationTraining only±10°Orientation robustness
Gaussian BlurTraining only0.25Sensor noise simulation
Color JitterTraining only0.5Illumination variation
NormalizationTraining and ValidationImageNet mean/stdStable optimization
Table 3. Dataset size and effective training samples.
Table 3. Dataset size and effective training samples.
DatasetSplitReal ImagesEpochsEffective Training Samples
FireStageTraining63210~6320
FireStageValidation159-159
FLAMETraining31,50010~315,000
FLAMEValidation7875-7875
Table 4. Input tensor size per modality after preprocessing (both datasets).
Table 4. Input tensor size per modality after preprocessing (both datasets).
Dataset TypeRGBNight VisionWhiteGreen
FireStage Dataset(3, 224, 224)(3, 224, 224)(3, 224, 224)(3, 224, 224)
FLAME Dataset(3, 224, 224)(3, 224, 224)(3, 224, 224)(3, 224, 224)
Table 5. Layer structure for the classification stage.
Table 5. Layer structure for the classification stage.
StageLayerInput SizeOutput SizeDescription
1Fully connected4d (e.g., 2048)256Dimensionality reduction
2Activation256256Nonlinear transformation (ReLU fuction)
3Dropout256256Probability = 0.3
4aFully connected2562Binary classification (No Fire, Fire)
4bFully connected2563Three-class classification (No Fire, Start Fire, Fire)
Table 6. Performance metrics.
Table 6. Performance metrics.
MetricDefinitionFormula
AccuracyThe ratio of examples correctly classified by the model to the total number of examples. Accuracy = T P + T N T P + T N + F P + F N
PrecisionThe proportion of correct classes among the examples predicted as positive. P r e c i s i o n = T P T P + F P
RecallThe proportion of true positive examples that are correctly predicted. Recall   = T P T P + F N
F1 ScoreThe harmonic mean that balances recall and precision. This metric is suitable for use in data imbalance. F 1   Score = 2 R e c a l l P r e c i s i o n R e c a l l + P r e c i s i o n
AUCThe Area Under the Receiver Operating Characteristic (ROC) Curve, representing the model’s ability to discriminate between classes across all classification thresholds. Higher values indicate better separability. A U C = 0 1   T P R ( t ) d ( F P R ( t ) )
ROC CurveA graphical representation of the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across different decision thresholds. T P R = T P T P + F N ,   F P R = F P F P + T N
Table 7. Distribution ratio of each dataset across clusters.
Table 7. Distribution ratio of each dataset across clusters.
Dataset TypeNumber of ImagesTraining Dataset CountTest Dataset Count
FireStage Dataset791632159
FLAME Dataset39,37531,5007875
Table 8. Performance values in the validation of model success on the Flame and FireStage datasets.
Table 8. Performance values in the validation of model success on the Flame and FireStage datasets.
DatasetBackboneModalityAccPrecRecallF1AUCParamsTrain Time (s)
FLAMEDenseNet121RGB0.9965710.9979990.9966030.9973000.9997237.22 M1668
FLAMEDenseNet121Green0.9874290.9919760.9882090.9900890.9987367.22 M1271
FLAMEDenseNet121Night0.9833650.9893550.9844120.9868780.9988007.22 M1203
FLAMEDenseNet121White0.9906030.9918200.9934050.9926120.9992567.22 M1223
FLAMEDenseNet121RGB + Green0.8601900.9989770.7807750.8765000.9973407.48 M1804
FLAMEDenseNet121RGB + Night0.8331430.9989180.7382090.8490000.9964527.48 M1806
FLAMEDenseNet121RGB + White0.9953020.9952140.9974020.9963070.9997197.48 M1818
FLAMEDenseNet121RGB + N + W + G0.9432380.9969470.9134690.9533840.9970758.00 M6804
FLAMEEfficientNet-B0RGB0.9954290.9977960.9950040.9963980.9998224.34 M1579
FLAMEEfficientNet-B0Green0.9860320.9945430.9834130.9889470.9987224.34 M1157
FLAMEEfficientNet-B0Night0.9826030.9893420.9832130.9862680.9979834.34 M1227
FLAMEEfficientNet-B0White0.9893330.9927880.9904080.9915970.9988704.34 M1233
FLAMEEfficientNet-B0RGB + Green0.8107940.9988640.7030380.8252400.9957374.66 M1718
FLAMEEfficientNet-B0RGB + Night0.8416510.9963010.7535970.8581180.9937324.66 M1744
FLAMEEfficientNet-B0RGB + White0.9940320.9950070.9956040.9953050.9996714.66 M1723
FLAMEEfficientNet-B0RGB + N + W + G0.8467300.9966000.7613910.8632600.9949495.32 M6519
FLAMEResNet18RGB0.9917460.9973820.9896080.9934800.99960811.31 M1548
FLAMEResNet18Green0.9732060.9923980.9652280.9786240.99750811.31 M1157
FLAMEResNet18Night0.9742220.9741260.9856120.9798350.99664111.31 M1300
FLAMEResNet18White0.9892060.9875120.9956040.9915410.99895411.31 M1186
FLAMEResNet18RGB + Green0.9624130.9444860.9996000.9712620.99545011.44 M1723
FLAMEResNet18RGB + Night0.9758730.9779590.9842130.9810760.99717111.44 M1896
FLAMEResNet18RGB + White0.9875560.9811690.9996000.9902990.99943811.44 M1707
FireStageDenseNet121RGB0.9371070.9237370.9314110.9272020.9919257.22 M176
FireStageDenseNet121Green0.8050310.8118140.7774520.7898840.9304767.22 M99
FireStageDenseNet121Night0.7672960.7692020.7357450.7455290.9221917.22 M104
FireStageDenseNet121White0.7987420.8164720.7559470.7724970.9271457.22 M100
FireStageDenseNet121RGB + Green0.7484280.7876770.7888560.7372190.9687257.48 M129
FireStageDenseNet121RGB + Night0.8427670.8310600.8543500.8279980.9740037.48 M132
FireStageDenseNet121RGB + White0.9308180.9252400.9206580.9227730.9895297.48 M129
FireStageDenseNet121RGB + N + W + G0.8930820.8748120.9007820.8823110.9795568.00 M441
FireStageEfficientNet-B0RGB0.9056600.8907130.8890520.8897290.9664924.34 M170
FireStageEfficientNet-B0Green0.7735850.7731480.7367220.7440170.9130764.34 M100
FireStageEfficientNet-B0Night0.7735850.7636120.7642550.7556870.9225564.34 M102
FireStageEfficientNet-B0White0.7987420.7913290.7680030.7742980.9167834.34 M101
FireStageEfficientNet-B0RGB + Green0.7295600.7740430.7728900.7247100.9424454.66 M128
FireStageEfficientNet-B0RGB + Night0.8679250.8481460.8685240.8542380.9601644.66 M131
FireStageEfficientNet-B0RGB + White0.9308180.9152820.9152820.9152820.9610324.66 M125
FireStageEfficientNet-B0RGB + N + W + G0.8616350.8414860.8644510.8448560.9600825.32 M416
FireStageResNet18RGB0.9245280.9349720.9038770.9162920.98454611.31 M169
FireStageResNet18Green0.7295600.7439010.7101660.7200180.87114811.31 M97
FireStageResNet18Night0.7232700.7243470.7064190.7026950.87456111.31 M104
FireStageResNet18White0.7421380.7200440.7385140.7157100.90068311.31 M97
FireStageResNet18RGB + Green0.8742140.8515340.8611930.8542350.96310011.44 M124
FireStageResNet18RGB + Night0.8238990.8042500.8211140.8061330.94396511.44 M132
FireStageResNet18RGB + White0.9056600.9115840.8672210.8808890.97877811.44 M122
FireStageResNet18RGB + N + W + G0.8805030.8632980.8906810.8678190.96231011.70 M416
Table 9. Comparative performance analysis of the proposed multimodal ResNet18-based framework with state-of-the-art fire detection studies on the FLAME and FireStage datasets.
Table 9. Comparative performance analysis of the proposed multimodal ResNet18-based framework with state-of-the-art fire detection studies on the FLAME and FireStage datasets.
StudyDatasetMethodModalityClassAccuracy (%)
Benzekri, W. et al. (2020) [2]Fire Detection (image sequence)LSTM/GRURGB (Sequence)299.89
Goncalves, A.M. et al. (2024) [9]FLAME DatasetDenseNet + CNNRGB299.70
Wang, X. et al. (2024) [13]Fire Detection (public wildfire smoke datasets)SE-ResNet + SVMRGB298.99
Arteaga, B. et al. (2020) [25]Fire Detection (image dataset)CNN (ResNet + VGG)RGB299.50
Mohammed, R.K. (2022) [26]Fire Detection (image dataset)Inception-ResNetRGB299.09
Mohnish, S. et al. (2022) [27]Fire Detection (image dataset)CNN DetectionRGB292.20
This Study (TAŞAR et al.)FireStage DatasetDenseNetRGB393.71
RGB + White93.08
RGB + Night + White + Green 89.31
This Study (TAŞAR et al.)FLAME DatasetDenseNetRGB299.66
RGB + White99.53
RGB + Night + White + Green 89.31
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Taşar, B.; Tatar, A.B.; Tanyildizi, A.K.; Yakut, O. Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations. Fire 2026, 9, 109. https://doi.org/10.3390/fire9030109

AMA Style

Taşar B, Tatar AB, Tanyildizi AK, Yakut O. Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations. Fire. 2026; 9(3):109. https://doi.org/10.3390/fire9030109

Chicago/Turabian Style

Taşar, Beyda, Ahmet Burak Tatar, Alper Kadir Tanyildizi, and Oğuz Yakut. 2026. "Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations" Fire 9, no. 3: 109. https://doi.org/10.3390/fire9030109

APA Style

Taşar, B., Tatar, A. B., Tanyildizi, A. K., & Yakut, O. (2026). Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations. Fire, 9(3), 109. https://doi.org/10.3390/fire9030109

Article Metrics

Back to TopTop