Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations

Taşar, Beyda; Tatar, Ahmet Burak; Tanyildizi, Alper Kadir; Yakut, Oğuz

doi:10.3390/fire9030109

Open AccessArticle

Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations

Mechatronic Engineering Department, Faculty of Engineering, Fırat University, Elazig 23119, Turkey

^*

Author to whom correspondence should be addressed.

Fire 2026, 9(3), 109; https://doi.org/10.3390/fire9030109

Submission received: 8 December 2025 / Revised: 1 February 2026 / Accepted: 25 February 2026 / Published: 2 March 2026

(This article belongs to the Special Issue Machine Learning (ML) and Deep Learning (DL) Applications in Wildfire Science: Principles, Progress and Prospects (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

In this study, a deep learning-based multimodal framework is presented for forest fire detection using RGB images, which synthetically generates night-vision-like, white-hot, and green-hot pseudo-thermal representations. The synthetic modalities are derived directly from RGB data and integrated into a hardware-independent multimodal learning pipeline to increase visual diversity without relying on additional sensing hardware. Each modality is processed using an ImageNet-pretrained convolutional backbone, and modality-specific feature vectors are combined through feature-level concatenation before classification. The proposed framework was evaluated using multiple backbone architectures, including ResNet18, EfficientNet-B0, and DenseNet121, which were assessed independently under a unified experimental protocol. Experiments were conducted on two datasets with substantially different scales and characteristics: the FLAME dataset (39,375 images, binary classification) and the FireStage dataset (791 images, three-class classification). For both datasets, stratified 80–20% training–validation splits were employed, and online stochastic data augmentation was applied exclusively to the training sets. On the FLAME dataset, the proposed framework achieved consistently high performance across different backbone and modality configurations. The best-performing models reached an accuracy of 99.66%, precision of 99.80%, recall of 99.66%, F1-score of 99.73%, and ROC AUC value of 0.9998. On the more challenging FireStage dataset, the framework demonstrated stable performance despite limited data availability, achieving an accuracy of 93.71% for RGB-only configurations and up to 93.08% for selected multimodal combinations, while macro-averaged F1-scores exceeded 0.92, and ROC AUC values reached up to 0.9919. Per-class analysis further indicates that early-stage fire (Start Fire) patterns can be discriminated, achieving ROC AUC values above 0.96, depending on the backbone and modality combination. Overall, the results suggest that synthetic-modality-based multimodal learning can provide competitive performance for both large-scale and data-limited fire detection scenarios, offering a flexible and hardware-independent alternative for forest fire monitoring applications.

Keywords:

wildfire detection; multimodal imaging; synthetic thermal imaging; transfer learning; deep learning; early-stage fire detection

1. Introduction

1.1. Background

Forest fires not only disrupt ecosystem balance but also cause multidimensional environmental consequences such as reduced biodiversity, loss of natural habitats, degradation of soil mineral structure, and disruption of the water cycle. Post-fire soil erosion, sudden increases in carbon emissions, and prolonged ecosystem recovery periods contribute to long-lasting effects that may persist for years. Furthermore, the destruction of forested areas accelerates climate change by reducing natural carbon sinks and causes significant socio-economic losses associated with agriculture, tourism, and energy infrastructure [1,2,3].

Increased heat waves, prolonged droughts, low humidity levels, and increased wind speeds associated with global climate change significantly increase both the likelihood of forest fires and their spread. Mega fires seen in recent years in regions such as the Mediterranean basin, the west coast of the United States, and Australia have clearly demonstrated that traditional observation and intervention methods are no longer sufficient [1,4,5]. The increase in the intensity and spread of fires has created a need for a new approach where even minutes are critical in the detection and intervention processes [6,7].

However, since existing camera-based detection systems are mostly designed to capture only flame components, low-density smoke, gray-toned gas clusters, and non-thermal early-stage heat signatures that characterize the initial stage of a fire are often not detected correctly [7,8,9]. Similarly, classical computer vision methods are easily affected by environmental factors such as lighting changes, shadows, haze, and sun glare [10,11]. Furthermore, although thermal cameras offer advantages for early detection, their high cost, energy consumption, and field installation requirements limit their applicability in large-scale areas [4,12,13].

For these reasons, it is critical to develop low-cost, hardware-independent, environmentally resilient, and scalable automated systems that can detect forest fires before flames form; i.e., detection at the smoke or temperature anomaly stage [5,14,15]. This study proposes a new multimodal approach that addresses this need by generating synthetic night-vision- and thermal-like modalities from existing RGB data, giving early-stage indicators more prominence.

1.2. Literature Review

Table 1 provides a comprehensive summary of recent deep learning-based studies on forest fire and smoke detection, clearly showing the dominant approaches, datasets, methods, and performance metrics in the literature.

Table 1 should be emphasized that the performance metrics summarized in Table 1 are derived from studies that address fundamentally different learning paradigms, including image-level classification and object-level detection tasks. In classification-oriented approaches, the term “Accuracy” conventionally denotes image-level classification accuracy, defined as the proportion of correctly classified images with respect to the total number of evaluated samples. Conversely, object detection studies—particularly those employing YOLO-based architectures—primarily assess performance using mean Average Precision (mAP), most commonly mAP@0.5, which quantifies localization and classification performance at the object level. In several studies, this metric is loosely referred to as “accuracy,” despite representing a conceptually different evaluation criterion. As a consequence, the high performance values reported for detection-based methods in Table 1 predominantly reflect successful object-level localization and recognition of visually salient fire or smoke regions, rather than holistic image-level classification performance. Accordingly, these metrics are not directly comparable to the classification-based results reported in this study and should be interpreted within the context of their.

A careful examination of the table reveals that the vast majority of existing studies focus on binary classification problems such as presence/absence of fire or smoke. A significant portion of these studies report high accuracy rates (95–99% and above) based solely on single-mode RGB images. However, these high achievements were mostly obtained under ideal conditions where flames or dense smoke were clearly visible.

A significant portion of the studies in the literature rely on YOLO-based object detection architectures (YOLOv5–YOLOv11). These methods are quite effective in real-time detection of visually distinct flame and dense smoke areas in UAV and fixed camera scenarios. However, since these architectures are inherently focused on object-level detection, they may be limited in detecting low-contrast, weak smoke and heat traces that appear in the early stages of a fire. It is observed that in almost all of the studies listed in Table 1, the “Start Fire” class is not explicitly modeled as a separate class.

Another important limitation highlighted in Table 1 is that most multi-modal systems require actual thermal or infrared (IR) sensors. Although RGB + IR-based multimodal approaches offer advantages in night-vision and low-visibility conditions, these systems are impractical for large-scale applications due to their high hardware costs, energy consumption, field installation, and calibration requirements. Furthermore, the susceptibility of thermal images to environmental conditions (humidity, atmospheric noise, distance, etc.) can also limit model generalizability.

When examining the architectures listed in Table 1, it is evident that many advanced deep learning structures are used, such as CNN, CNN–RNN hybrids, Transformer-based networks, and ensemble approaches. Nevertheless, fire detection is mostly treated as a static image recognition problem, and the temporal development of a fire and early ignition dynamics are not explicitly modeled. Studies incorporating temporal information (LSTM, GRU, etc.) are limited in number and have generally evaluated binary classification scenarios only.

In terms of datasets, Table 1 shows that some large-scale datasets (e.g., FLAME, D-Fire, USTC_SmokeRS) are frequently used. However, the majority of these datasets only contain two-class labeling (fire/no-fire or smoke/no-smoke) and do not provide clear and balanced labeling for critical stages such as the early stages of a fire (start fire). This situation means that models are sensitive to fully developed fires but relatively weak against early-stage indicators.

1.3. Contribution and Novelty of Present Study

This study addresses several practical limitations reported in the existing fire detection literature by exploring the use of synthetic visual modalities derived from RGB images. Specifically, night-vision-like, white-hot, and green-hot pseudo-thermal representations were generated to create a multimodal input structure without relying on physical thermal or infrared sensors. This design choice aims to reduce hardware dependency and associated costs while enabling a controlled exploration of modality diversity within a unified framework.

While many previous studies have primarily focused on binary fire detection, the proposed framework was evaluated in both binary and multi-class scenarios. In addition to the FLAME dataset, which attends to the presence and absence of fire, the FireStage dataset was used to examine the model’s performance in a three-class setup (No Fire, Start Fire, and Fire). This approach enables an analysis of early-stage fire detection, a topic that is less explored in the literature due to data scarcity and annotation challenges.

As summarized in Table 1, the existing methods generally perform well in detecting fully developed fires but often rely on hardware-specific inputs and binary classification structures. In contrast, the approach proposed in this study investigates the feasibility of synthetic-modality-based multimodal learning under both large-scale and limited-data conditions, without the need for additional sensing equipment.

The main contributions of this study can be summarized as follows:

➢: Unlike a large portion of existing works that formulate wildfire monitoring as an object detection or instance segmentation problem, the proposed approach is strictly designed as an image-level classification framework.
➢: A hardware-independent multimodal dataset was constructed by generating synthetic night-vision-like, white-hot, and green-hot pseudo-thermal representations directly from RGB images.
➢: A multimodal feature extraction and fusion framework was systematically evaluated using ImageNet-pretrained backbones (ResNet18, EfficientNet-B0, and DenseNet121).
➢: The proposed framework was assessed in both binary (FLAME) and three-class (FireStage) classification settings, including early-stage fire scenarios.
➢: The generalization ability of the approach was examined across datasets with substantially different sizes and characteristics.
➢: Quantitative results revealed competitive performance across different backbone architectures and modality combinations, with detailed comparisons in terms of accuracy, precision, recall, F1-score, ROC/AUC, parameter count, and training time.

Overall, rather than presenting a definitive solution, this study offers an empirical investigation into the potential and limitations of synthetic-modality-based multimodal learning for fire detection, particularly in scenarios where access to specialized sensing hardware or large-scale labeled datasets is limited.

2. Materials and Methods

In this study, a multimodal dataset was developed by combining original RGB images with synthetically generated pseudo-thermal representations—specifically night-vision-like, white-hot, and green-hot variants—to enhance fire detection and classification performance. These multimodal inputs were processed using convolutional neural network backbones pre-trained on ImageNet, including ResNet-18, EfficientNet-B0, and DenseNet-121, depending on the experimental setup. An overview of the proposed methodological framework is provided in Figure 1.

The primary objective of this approach is to improve the model’s generalization ability by enriching visual input diversity without the need for additional sensing hardware. The framework is structured in three main stages:

(A): Preprocessing and synthesis of multimodal visual inputs from RGB images;
(B): Modality-specific feature extraction followed by feature-level fusion and classification through a lightweight fully connected layer;
(C): Quantitative performance evaluation using standard classification metrics.

During the feature extraction phase, each modality is passed through a shared, pre-trained backbone network with frozen weights. The resulting feature vectors, specific to each modality, are then concatenated to create a unified multimodal representation. This fused representation enables the model to leverage complementary information across modalities, such as color composition, intensity distribution, texture, and pseudo-thermal features. To enhance the model’s robustness, particularly in challenging forest fire scenarios and early fire stages, the dataset was expanded in terms of both modality diversity and feature richness.

2.1. Dataset and Features

The methodological approach developed within the scope of this study was tested on two different datasets (FLAME and FireStage). The FLAME dataset (access link: https://ieee-dataport.org/open-access/flame-2-fire-detection-and-modeling-aerial-multi-spectral-image-dataset) (accessed on 1 November 2025) was selected because it has the highest number of data points for forest fire detection. However, a limitation of the FLAME dataset is that it only contains data labeled with two classes (fire present/fire absent). To address this issue, the FireStage dataset (access link: https://figshare.com/articles/dataset/Forest_Fire_Detection/14904522?file=28702587) (accessed on 1 November 2025) was included in this study. This dataset contains data labeled with three classes (“no fire,” “fire,” and “fire start”), and the data labeled as “fire start” includes not only images of fire but also images of light smoke. Figure 2 shows the numerical distribution of images in the two datasets according to classes.

2.2. Preprocessing and Augmentation

All input images from both the FLAME and FireStage datasets were processed using a unified preprocessing pipeline to ensure consistency across all modalities and backbone networks. The original RGB images were resized to a fixed resolution of 224 × 224 pixels and normalized using the mean and standard deviation values from the ImageNet dataset. This normalization was applied consistently to both training and validation sets to maintain compatibility with the ImageNet-pretrained backbone architectures.

Beyond the RGB inputs, three synthetic visual modalities—night-vision-like, white-hot, and green-hot representations—were generated directly from the RGB images. This was achieved through grayscale conversion, followed by histogram equalization and the application of predefined colormap mappings. These synthetic modalities were treated as independent input channels and subjected to the same preprocessing steps as the RGB images, ensuring a uniform input distribution across all modalities.

To mitigate the limitations posed by the relatively small size of the FireStage dataset and to enhance model generalization, an online stochastic data augmentation strategy was employed during training. At each epoch, the input images were randomly transformed using a mix of geometric and photometric augmentations. These included random resized cropping (with a scale range of 0.75 to 1.0), horizontal flipping (probability = 0.5), small-angle rotations within ±10 degrees, Gaussian blur (probability = 0.25), and color jitter (probability = 0.5). The purpose of these augmentations was to improve the model’s robustness to variations in viewpoint, lighting, and image quality, while preserving the semantic integrity of the scenes. To ensure a fair evaluation, no augmentation was applied to validation images, thus avoiding potential information leakage.

The data augmentation process was implemented online, meaning that augmented images were not generated or stored beforehand. Instead, each original image was transformed differently at every training epoch, allowing the model to encounter a wide variety of augmented versions throughout training. As a result, the effect of augmentation was quantified in terms of the number of training samples observed during optimization, rather than a static number of generated images.

For example, the FireStage dataset consisted of 632 training images after stratified splitting. Over the course of 10 epochs, this resulted in approximately 6320 effective training samples. Similarly, the FLAME dataset’s 31,500 training images yielded about 315,000 effective samples under the same training configuration. Table 2 and Table 3 provide a comparative summary of the real versus effective sample sizes for both datasets, illustrating the impact of online augmentation on the overall training volume.

2.3. Data Synthesis

In this study, a multimodal input representation was constructed by generating three synthetic modalities—night-vision-like, white-hot, and green-hot—from each original RGB image. These modalities were not captured by physical infrared or thermal sensors; instead, they were derived deterministically from the RGB data to increase appearance diversity under limited-data conditions and to emulate common operational visualizations used in low-light and pseudo-thermal monitoring.

2.3.1. RGB-to-Grayscale Transformation

Let an RGB image be denoted as

I^{R G B} (x, y) = [R (x, y), G (x, y), B (x, y)]

, where

R, G, B \in [0, 255]

. First, the image was converted to a grayscale luminance map

I^{g r a y} (x, y)

to preserve the intensity structure while removing color dependency. The grayscale conversion follows the standard luminance formulation:

I^{g r a y} (x, y) = 0.299 R (x, y) + 0.587 G (x, y) + 0.114 B (x, y)

(1)

This step yields

I^{g r a y} (x, y) \in [0, 255]

and provides a stable intensity basis for subsequent contrast and colormap operations.

2.3.2. Contrast Enhancement for Night-Vision-like Modality

To obtain a night-vision-like appearance, global histogram equalization was applied to the grayscale image to enhance contrast in dark regions. Let the histogram equalization operator be denoted as

H (\cdot)

. The enhanced intensity image is computed as follows:

I^{e q} (x, y) = H (I^{g r a y} (x, y))

(2)

Histogram equalization improves the separability of low-intensity structures by redistributing intensity values over the available range. Finally, a predefined colormap

C_{N V} (\cdot)

was applied to obtain the night-vision-like RGB representation:

I^{n i g h t} (x, y) = C_{N V} (I^{e q} (x, y))

(3)

In practice,

C_{N V}

corresponds to a green-dominant mapping that visually resembles low-light imaging.

2.3.3. White-Hot and Green-Hot Pseudo-Thermal Modalities

Two additional pseudo-thermal modalities were produced by applying distinct colormap transformations to the grayscale intensity image. The white-hot modality was generated by mapping high-intensity values to brighter tones using a colormap

C_{W H} (\cdot)

:

I^{w h i t e} (x, y) = C_{W H} (I^{g r a y} (x, y))

(4)

Similarly, the green-hot modality was generated with a green-to-yellow intensity mapping using

C_{G H} (\cdot)

:

I^{g r e e n} (x, y) = C_{G H} (I^{g r a y} (x, y))

(5)

These transformations are deterministic and preserve the spatial structure of the original image while modifying its appearance distribution, thereby providing complementary “views” of the same scene.

2.3.4. Final Multimodal Sample Construction

For each original RGB image, the final multimodal sample is defined as a set of aligned modality tensors:

X (x, y) = \{I^{R G B} (x, y), I^{n i g h t} (x, y), I^{w h i t e} (x, y), I^{g r e e n} (x, y)\}

(6)

All modalities were resized to

224 \times 224

and normalized with the ImageNet dataset’s mean and standard deviation to match the input distribution expected by the ImageNet-pretrained backbones. This modality construction enables consistent multimodal learning without requiring additional sensor hardware. Representative samples of RGB images and their synthesized modalities for the FLAME and FireStage datasets are shown in Figure 3.

2.3.5. Interpretation of Modality-Specific Intensity Distributions

To investigate whether the synthesized modalities introduce statistically distinguishable patterns, pixel intensity histograms were generated for each class and modality, as shown in Figure 4. These histograms reveal that the synthetic transformations produce intensity distributions that differ noticeably from those of the original RGB images, reinforcing their value in expanding the feature space. Notably, the class-specific shifts observed in the pseudo-thermal representations suggest that early signs of fire—often visually subtle in RGB images—can be statistically highlighted through these transformations.

Table 4 contains data on the size of an original RGB image and the synthesized (night-vision-like, green-hot, and white-hot) images for both datasets.

2.4. Feature Extraction and Classification

The proposed fire detection model adopts a two-stage modular architecture that is specifically designed to integrate multimodal visual information and enhance classification performance. In the first stage, features are extracted separately from each modality, while the second stage performs classification using the combined feature representation. This architecture enables the model to leverage complementary visual cues from both the original RGB images and the synthetically generated modalities—namely, night-vision-like, white-hot, and green-hot—in a structured and interpretable way.

2.4.1. Modality-Wise Feature Extraction

During the feature extraction stage, each input modality is independently processed using a convolutional neural network backbone that has been pretrained on ImageNet. Depending on the specific experimental setup, ResNet-18, EfficientNet-B0, or DenseNet-121 is used as the feature extractor. In all cases, the final classification layers of these networks are removed, retaining only the convolutional body. This approach allows the model to extract high-level semantic features while preserving spatially aggregated information that is crucial for identifying fire-related patterns.

Let

I_{m}

denote an input image corresponding to modality

m \in {RGB, Night, White, Green}

. Each modality is passed through a shared backbone network

F (\cdot)

to obtain a modality-specific feature vector:

f_{m} = F (I_{m}), f_{m} \in R^{d}

(7)

where

d

denotes the backbone-dependent feature dimensionality (e.g.,

d = 512

for ResNet-18). In the default setup, a shared backbone network is employed across all modalities to minimize model complexity and maintain consistent feature embeddings. All backbone parameters are kept frozen during training—a strategy that helps to stabilize the optimization process, especially in scenarios with limited training data.

2.4.2. Multimodal Feature Fusion

The modality-specific feature vectors are combined using feature-level concatenation, resulting in a unified multimodal representation:

f_{concat} = [f_{RGB} ∥ f_{Night} ∥ f_{White} ∥ f_{Green}]

(8)

where

∥

denotes the concatenation operator. For four modalities and a backbone feature dimension of

d

, this produces a combined feature vector of size

4 d

(e.g., 2048 dimensions for ResNet-18). The fused representation effectively captures complementary information such as color composition, intensity distribution, texture, and pseudo-thermal responses—all of which are crucial for distinguishing between different stages of fire.

2.4.3. Classification Head

The concatenated feature vector is passed to a lightweight, fully connected classification head responsible for final prediction. First, a linear transformation reduces the feature dimensionality from

4 d

to 256, enabling compact representation learning. A ReLU activation introduces nonlinearity, followed by a dropout layer with a probability of 0.3 to mitigate overfitting. The final linear layer outputs class logits corresponding to either the binary (No Fire, Fire) or three-class (No Fire, Start Fire, Fire) classification task.

During training, class-weighted CrossEntropyLoss is employed to address class imbalance, particularly for the start_fire class in the FireStage dataset. Optimization is performed using the Adam optimizer, and only the parameters of the classification head are updated, while the backbone weights remain frozen. The layer-wise structure of the classification head is summarized in Table 5.

2.5. Model Performance Evaluation

The accuracy of the developed transfer learning- and feature fusion-based deep neural network model was calculated on both the training and validation (test) sets at the end of each epoch. After model training was completed, classification performance was evaluated on the validation set. For this purpose, accuracy, macro-averaged precision, recall, F1 score, and multi-class ROC AUC (One-vs.-Rest) metrics were calculated (Table 6). The results were visualized using class-based ROC curves and a confusion matrix. For the binary FLAME dataset, standard binary ROC AUC analysis was applied. For the three-class FireStage dataset, multi-class ROC AUC values were computed using a One-vs-Rest (OvR) strategy with macro-averaging to account for class imbalance. Additionally, a learning curve comparing training and validation accuracies for each epoch was constructed.

3. Results

The proposed model was evaluated on two datasets with substantially different characteristics and scales: the FireStage dataset and the FLAME dataset. All experiments were conducted using stratified training–validation splits to preserve class distributions and ensure fair performance assessment.

The FireStage dataset contains a total of 791 images, comprising three classes (No Fire, Start Fire, and Fire). Due to the limited size of the dataset, the available images were used as provided and organized into stratified training and validation splits to preserve class distributions across different fire progression stages. Following stratified splitting, 632 images (80%) were used for training, and 159 images (20%) were reserved for validation. This dataset was primarily employed to evaluate the model’s ability to distinguish different stages of fire development under data-scarce conditions (Table 7).

The FLAME dataset, which is considerably larger, consists of 39,375 images acquired by unmanned aerial vehicles (UAVs) under diverse environmental and terrain conditions. The dataset includes scenes with and without fire and supports robust evaluation for early fire detection. Using the same stratified splitting strategy, 31,500 images (80%) were used for training and 7875 images (20%) for validation. In addition to RGB images, synthetic modalities derived from RGB data were incorporated to enrich visual diversity and support multimodal learning.

For both datasets, data augmentation was applied exclusively to the training sets using an online stochastic strategy, while validation images remained unaltered. Performance results are reported using accuracy, precision, recall, F1-score, confusion matrices, and ROC/AUC metrics, with per-class analyses provided for the FireStage dataset to account for class imbalance and limited sample size.

Table 8 presents a detailed validation performance comparison of the proposed model on the FLAME and FireStage datasets in terms of different network architectures and modality combinations. For the FLAME dataset, the results indicate that RGB input consistently achieves the highest—or very close to the highest—performance across all architectures. In particular, the DenseNet121 architecture with RGB input achieved the best single performance, with an accuracy of 99.66%, an F1-score of 99.73%, and an AUC value of 0.9997 (as reported in Table 8). This finding indicates that for the binary classification problem, fire-related visual cues are largely discriminative within the RGB space.

On the FLAME dataset, using individual synthetic modalities—Green, Night, and White—resulted in only a slight performance drop compared to RGB. Among these, the White modality performed the best, delivering more competitive results than the other synthetic transformations. Notably, the combination of RGB and White achieved performance levels nearly identical to the RGB-only setup across all architectures, with high F1-scores and a well-balanced precision–recall trade-off. For instance, in the case of DenseNet121, this configuration reached an F1-score of 0.9963 and an AUC of 0.9997 (see Table 6). On the other hand, while the RGB + Green and RGB + Night combinations maintained relatively high precision, they suffered from reduced recall, indicating an increase in false negatives, which negatively impacted overall F1 performance.

The results obtained using the FireStage dataset, which poses a more complex and multi-class classification challenge, further highlight the limitations of relying solely on synthetic modalities. As shown in Table 8, RGB again proved to be the most effective standalone input across all tested architectures. The DenseNet121 + RGB configuration delivered the best performance, with an accuracy of 93.71% and an F1-score of 0.9272. In contrast, using the Green, Night, or White modality alone led to a significant drop in performance, suggesting that these synthetic transformations are insufficient by themselves for reliably distinguishing between fire stages. However, it is important to acknowledge certain limitations of the proposed synthetic modality approach. Since all synthetic channels are deterministically derived from RGB images, their effectiveness is inherently dependent on the quality of the original visual input. In scenarios involving environmental degradation—such as fog, smoke, clouds, or dust—the RGB signal may be significantly distorted, leading to misleading or ambiguous synthetic transformations. These conditions represent potential failure cases for the method and may limit its generalizability in real-world deployments.

However, when paired with RGB, the White modality demonstrated notable benefits on the FireStage dataset. For example, with the EfficientNet-B0 architecture, the RGB + White combination improved the F1-score by approximately 2.5% over the RGB-only setup and yielded more balanced precision and recall values (see Table 8). Similarly, both the DenseNet121 and ResNet18 architectures with the RGB + White combination demonstrated performance levels that closely matched their RGB-only baselines, while also exhibiting improved generalization. These findings suggest that the White modality enhances structural and contrast-related cues between fire stages, making it particularly beneficial in multi-class classification scenarios.

Finally, the fully multimodal configuration—combining RGB, Night, White, and Green modalities—did not lead to a clear performance gain on either dataset, as indicated in Table 8. Moreover, this setup substantially increased both the number of model parameters and training time. For instance, in the case of DenseNet121, the fully multimodal approach led to a significant increase in training duration without outperforming simpler configurations like RGB or RGB + White in terms of F1-score. These outcomes emphasize that a selective and controlled modality fusion strategy is more effective and computationally efficient than indiscriminate multimodal expansion.

The confusion matrices of the FLAME dataset presented in Figure 5 illustrate the binary classification performance of the model under different backbone architectures (ResNet18, EfficientNet-B0, and DenseNet121) and different modality configurations (RGB-only, RGB + White, and fully multimodal). When only RGB images were used, all backbone architectures achieved high correct classification rates. In particular, the true positive rates for the Fire class are notably high, while false-negative predictions (Fire → No Fire) remain limited. This observation indicates that due to the large scale and visual diversity of the FLAME dataset, fire-containing scenes provide strong discriminative features.

The RGB + White modality combination shows a tendency to reduce false-positive rates, especially for the No Fire class. This suggests that white-hot-like transformations support brightness-based discrimination and contribute to a more consistent representation of non-fire scenes. In contrast, in the fully multimodal configuration (RGB + Night + White + Green), an increase in the number of cases where the Fire class is misclassified as No Fire is observed for some backbone architectures. This finding indicates that synthetic modalities do not always provide complementary benefits and that modality fusion must be carefully designed, particularly for large-scale datasets. Especially for EfficientNet- and DenseNet-based models, certain synthetic transformations may complicate the decision boundaries.

The confusion matrices for the FireStage dataset reflect model behavior in a three-class (No Fire, Fire, Start Fire) and data-limited scenario in detail. In this dataset, class imbalance and the limited number of samples become more pronounced, particularly for the Start Fire class. When only RGB images were used, all backbone architectures could distinguish the No Fire and Fire classes with relatively high accuracy; however, the Start Fire class was occasionally confused with these two classes. This is attributed to the visually ambiguous and low-contrast nature of early-stage fire scenes.

The RGB + White modality combination reduces misclassifications for the Fire class, while some confusion remains for the Start Fire class. This indicates that, although white-hot transformations are beneficial for scenes with strong flames, they do not always clearly separate early-stage fire cues. In the fully multimodal configuration (RGB + Night + White + Green), an overall improvement trend in the correct classification of the Start Fire class is observed. In particular, DenseNet- and EfficientNet-based models assign Start Fire samples to the correct class at higher rates. This suggests that synthetic modalities can enhance the visibility of low-intensity and early-stage fire patterns.

Nevertheless, some instances are also observed where the No Fire class is misclassified as Start Fire. This observation indicates that the limited sample size and visual overlap in the FireStage dataset can influence model decisions.

When both datasets are considered together, the confusion matrices show that the model exhibits more stable behavior on large-scale and balanced datasets, while being more sensitive to modality selection in data-limited and multi-class scenarios. In particular, for the FireStage dataset, synthetic modalities contribute to the separation of the early-stage fire class; however, the extent of this contribution varies depending on the backbone architecture and modality combination.

4. Discussion and Limitation

Table 8 presents a comparative performance evaluation of the proposed method on the FLAME and FireStage datasets against recent studies in the literature. As evident from the table, the majority of existing works focus on a binary classification setup (fire/no fire) and typically rely on either RGB images alone or a limited number of hardware-based multimodal inputs (e.g., RGB + infrared). For instance, Goncalves et al. (2024) [9] reported an accuracy of 99.70% on the FLAME dataset using an RGB-based DenseNet architecture, while Wang et al. (2024) [13] achieved 98.99% accuracy with a hybrid SE-ResNet + SVM model. Similarly, studies by Arteaga et al. (2020) [25], Mohammed et al. (2022) [26], and Benzekri et al. (2020) [2] reported accuracies approaching 99% in binary fire detection scenarios.

In contrast to these approaches, the method proposed in this study constructs a multi-input learning framework using synthetic modalities derived solely from RGB images, without relying on real thermal or infrared hardware. When Table 6 and Table 8 are considered together, the DenseNet121 architecture achieved an accuracy of 99.66% and an F1-score of 0.9973 on the FLAME dataset using RGB images, while the RGB + White modality combination yielded an accuracy of 99.53% and an F1-score of 0.9963. These results indicate that, despite being hardware-independent, the proposed approach delivers performance comparable to—and in some metrics, competitive with—the best-performing binary fire detection methods reported in the literature. This demonstrates that the method provides a low-cost, scalable, and practical solution suitable for real-world deployment.

The most notable contribution highlighted in Table 9 is that this study directly addresses a three-class classification problem (No Fire—Start Fire—Fire) on the FireStage dataset. In much of the existing literature, the “Start Fire” (early-stage fire) class is either completely neglected or merged into the “Fire” or “No Fire” categories. In contrast, the proposed approach explicitly models the early stage of fire as a separate class, which is of critical importance for early intervention. As shown in Table 6, the DenseNet121 architecture achieved an accuracy of 93.71% and an F1-score of 0.9272 on the FireStage dataset. In addition, the RGB + White modality combination was observed to provide more balanced and generalizable performance across different backbone architectures.

A major limitation of this study is the relatively small size of the FireStage dataset compared to the FLAME dataset. While the FireStage dataset contains 791 images in total (632 for training and 159 for validation), the FLAME dataset comprises 39,375 images, providing a substantially richer training distribution. Although online data augmentation increased the effective number of training samples for the FireStage dataset to approximately 6320 across epochs, these augmented samples were derived from a limited set of original images and therefore could not fully replace the benefits of additional real-world data.

In particular, the start_fire class is more sensitive to data scarcity, which may limit generalization performance. Consequently, the results for the FireStage dataset are reported based on per-class metrics and ROC/AUC analyses and should be interpreted with caution. Future work will focus on expanding the FireStage dataset with additional samples and increased scene diversity, which is expected to further improve the robustness of the proposed multimodal framework.

In practical deployment scenarios involving high-resolution UAV imagery, the proposed model is intended to operate within a patch-based inference framework. Instead of directly downscaling large aerial images to 224 × 224 pixels, high-resolution frames can be partitioned into smaller patches matching the network input size. Each patch is independently classified, and patch-level predictions can be aggregated using voting or confidence-based strategies to produce scene-level fire alerts. This approach preserves fine-grained visual cues, such as low-density smoke or early-stage fire signatures, which may be suppressed by global downscaling. Therefore, the proposed framework is well suited for integration into UAV-based early warning pipelines.

5. Conclusions

In this study, a multi-input deep learning approach was developed to enable early-stage and hardware-independent detection of forest fires by leveraging synthetic night-vision-like, white-hot, and green-hot modalities derived from RGB images. The proposed method aims to increase visual diversity without requiring real thermal or infrared sensors, thereby enhancing the model’s generalization capability. During the feature extraction stage, a pretrained DenseNet121 architecture was employed, and feature representations obtained from different modalities were combined in a controlled manner to perform classification.

The proposed approach was comprehensively evaluated on two different datasets: the binary FLAME dataset (fire/no fire) with 39,375 images, and the three-class FireStage dataset (No Fire—Start Fire—Fire) with 795 images. For both datasets, an 80–20% training–validation split was applied. The experimental results showed that, on the FLAME dataset, the DenseNet121 + RGB configuration achieved an accuracy of 99.66%, an F1-score of 0.9973, and a ROC AUC value of 0.9997, indicating very high performance. In addition, the RGB + White modality combination produced results very close to the RGB-only configuration and emerged as a reliable alternative by providing more balanced performance across different architectures.

The results obtained on the three-class FireStage dataset clearly reflect the more challenging nature of this problem. On this dataset, the DenseNet121 architecture achieved an accuracy of 93.71% and an F1-score of 0.9272 with the RGB + White modality combination, showing more stable behavior in terms of precision–recall balance. Because the “Start Fire” class is characterized by low contrast, dense smoke, and weak flame cues, modeling this class separately significantly increases the difficulty of the classification task. Despite these challenges, the results demonstrate that the proposed method is effective in discriminating early-stage fire patterns.

The experiments also revealed that the fully multimodal configuration (RGB + Night + White + Green) substantially increased the number of parameters and training time, while not providing a meaningful performance gain. This finding indicates that controlled and selective modality fusion yields more efficient and generalizable results for fire detection problems compared to indiscriminate multimodal expansion.

In conclusion, this study goes beyond the predominantly binary fire detection problems addressed in the literature by directly modeling the early stage of fire as a separate class within a hardware-independent, synthetic-modality-supported deep learning framework. The findings show that the proposed approach achieves high generalization performance on large-scale datasets and delivers reasonable and operationally meaningful performance in multi-class scenarios with limited data. In this respect, the study offers a low-cost and scalable solution with strong potential for real-time forest fire monitoring and early warning systems.

Author Contributions

Conceptualization, B.T., A.B.T., A.K.T. and O.Y.; methodology, B.T. and A.B.T.; validation, B.T. and A.B.T.; software, B.T.; formal analysis, B.T., A.B.T. and A.K.T.; investigation, B.T.; resources, A.K.T. and B.T.; data curation, B.T.; writing—original draft preparation, B.T., A.B.T., O.Y. and A.K.T.; writing—review and editing, B.T., A.B.T. and A.K.T.; visualization, B.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funding by The Scientific and Technological Research Council of Türkiye (TÜBİTAK) under the 1001 Program within Project No. 223M001 and by the Fırat University Scientific Research Projects (BAP) Coordination Unit under Project No. MF.25.122.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. This study does not involve human participants.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. Two open access datasets were used in this study. FLAME dataset access link: https://ieee-dataport.org/open-access/flame-2-fire-detection-and-modeling-aerial-multi-spectral-image-dataset; FireStage dataset access link: https://figshare.com/articles/dataset/Forest_Fire_Detection/14904522?file=28702587.

Acknowledgments

No additional material or financial support was received beyond the support stated in the funding section. The authors reviewed and edited all output and are fully responsible for the content of this publication. DeepL and Grammarly tools were used to improve the readability of the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

de Almeida, R.V.; Crivellaro, F.; Narciso, M.; Sousa, A.I.; Vieira, P. Bee2Fire: A deep learning powered forest fire detection system. In Proceedings of the ICAART 2020—12th International Conference on Agents and Artificial Intelligence, Valletta, Malta, 22–24 February 2020; pp. 603–609. [Google Scholar]
Benzekri, W.; El Moussati, A.; Moussaoui, O.; Berrajaa, M. Early forest fire detection system using wireless sensor network and deep learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 496–503. [Google Scholar] [CrossRef]
Mohammad, M.B.; Bhuvaneswari, N.; Koteswari, C.P.; Priya, V.B. Hardware implementation of forest fire detection system using deep learning architectures. In Proceedings of the International Conference on Edge Computing and Applications (ICECAA) 2022, Tamilnadu, India, 13–15 October 2022; pp. 1198–1205. [Google Scholar]
Ryu, J.; Kwak, D. A study on a complex flame and smoke detection method using computer vision detection and convolutional neural network. Fire 2022, 5, 108. [Google Scholar] [CrossRef]
Shirwaikar, R.; Narvekar, A.; Hosamani, A.; Fernandes, K.; Tak, K.; Parab, V. Real-time semi-occluded fire detection and evacuation route generation: Leveraging instance segmentation for damage estimation. Fire Saf. J. 2025, 152, 104338. [Google Scholar] [CrossRef]
Chaoxia, C.; Shang, W.; Zhang, F. Information-guided flame detection based on Faster R-CNN. IEEE Access 2020, 8, 58923–58932. [Google Scholar] [CrossRef]
Casas, E.; Ramos, L.; Bendek, E.; Rivas-Echeverria, F. Assessing the effectiveness of YOLO architectures for smoke and wildfire detection. IEEE Access 2023, 11, 96554–96583. [Google Scholar] [CrossRef]
Sathishkumar, V.E.; Cho, J.; Subramanian, M.; Naren, O.S. Forest fire and smoke detection using deep learning-based learning without forgetting. Fire Ecol. 2023, 19, 9. [Google Scholar] [CrossRef]
Goncalves, A.M.; Brandao, T.; Ferreira, J.C. Wildfire detection with deep learning—A case study for the CICLOPE project. IEEE Access 2024, 12, 82095–82110. [Google Scholar] [CrossRef]
Yan, C.; Wang, J. MAG-FSNet: A high-precision robust forest fire smoke detection model integrating local features and global information. Measurement 2025, 247, 116813. [Google Scholar] [CrossRef]
Wang, W.; Huang, Q.; Liu, H.; Jia, Y.; Chen, Q. Forest fire detection method based on deep learning. In Proceedings of the International Conference on Cyber-Physical Social Intelligence (ICCSI) 2022, Nanjing, China, 18–21 November 2022; pp. 23–28. [Google Scholar]
Zhu, W.; Niu, S.; Yue, J.; Zhou, Y. Multiscale wildfire and smoke detection in complex drone forest environments based on YOLOv8. Sci. Rep. 2025, 15, 2399. [Google Scholar] [CrossRef]
Wang, X.; Wang, J.; Chen, L.; Zhang, Y. Improving computer vision-based wildfire smoke detection by combining SE-ResNet with SVM. Processes 2024, 12, 747. [Google Scholar] [CrossRef]
Li, L.; Liu, F.; Ding, Y. Real-time smoke detection with Faster R-CNN. In Proceedings of the 2nd International Conference on Artificial Intelligence and Information Systems, Chongqing, China, 28–30 May 2021; pp. 1–5. [Google Scholar]
Bahhar, C.; Ksibi, A.; Ayadi, M.; Jamjoom, M.M.; Ullah, Z.; Soufiene, B.O.; Sakli, H. Wildfire and smoke detection using staged YOLO model and ensemble CNN. Electronics 2023, 12, 228. [Google Scholar] [CrossRef]
Li, Y.; Zhang, W.; Liu, Y.; Jing, R.; Liu, C. An efficient fire and smoke detection algorithm based on an end-to-end structured network. Eng. Appl. Artif. Intell. 2022, 116, 105492. [Google Scholar] [CrossRef]
Wang, C.; Li, Q.; Liu, S.; Cheng, P.; Huang, Y. Transformer-based fusion of infrared and visible imagery for smoke recognition in commercial areas. Comput. Mater. Contin. 2025, 84, 5157–5176. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Khan, Z.A.; Huang, A.; Sang, J. Multi-level feature fusion networks for smoke recognition in remote sensing imagery. Neural Netw. 2025, 184, 107112. [Google Scholar] [CrossRef] [PubMed]
Alkhammash, E.H. A comparative analysis of YOLOv9, YOLOv10, YOLOv11 for smoke and fire detection. Fire 2025, 8, 26. [Google Scholar] [CrossRef]
Xue, Z.; Kong, L.; Wu, H.; Chen, J. Fire and smoke detection based on improved YOLOv11. IEEE Access 2025, 13, 73022–73040. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Liu, L.; Zhang, Y.; Ma, J. Research and application of deep learning object detection methods for forest fire smoke recognition. Sci. Rep. 2025, 15, 16328. [Google Scholar] [CrossRef]
Niu, K.; Wang, C.; Xu, J.; Liang, J.; Zhou, X.; Wen, K.; Lu, M.; Yang, C. Early forest fire detection with UAV image fusion: A novel deep learning method using visible and infrared sensors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6617–6629. [Google Scholar] [CrossRef]
Jin, P.; Cheng, P.; Liu, X.; Huang, Y. From smoke to fire: A forest fire early warning and risk assessment model fusing multimodal data. Eng. Appl. Artif. Intell. 2025, 152, 110848. [Google Scholar] [CrossRef]
Shang, L.; Hu, X.; Huang, Z.; Zhang, Q.; Zhang, Z.; Li, X.; Chang, Y. YOLO-DKM: A flame and spark detection algorithm based on deep learning. IEEE Access 2025, 13, 117687–117699. [Google Scholar] [CrossRef]
Arteaga, B.; Diaz, M.; Jojoa, M. Deep learning applied to forest fire detection. In Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) 2020, Louisville, KY, USA, 9–11 December 2020. [Google Scholar]
Mohammed, R.K. A real-time forest fire and smoke detection system using deep learning. Int. J. Nonlinear Anal. Appl. 2022, 13, 2053–2063. [Google Scholar]
Mohnish, S.; Akshay, K.P.; Ram, S.G.; Vignesh, A.S.; Pavithra, P.; Ezhilarasi, S. Deep learning based forest fire detection and alert system. In Proceedings of the International Conference on Communication, Computing and Internet of Things (IC3IoT) 2022, Chennai, India, 10–11 March 2022; pp. 1–5. [Google Scholar]
Ban, Y.; Zhang, P.; Nascetti, A.; Bevington, A.R.; Wulder, M.A. Near real-time wildfire progression monitoring with Sentinel-1 SAR time series and deep learning. Sci. Rep. 2020, 10, 1322. [Google Scholar] [CrossRef]
Rahul, M.; Saketh, K.S.; Sanjeet, A.; Naik, N.S. Early detection of forest fire using deep learning. In Proceedings of the IEEE REGION 10 CONFERENCE (TENCON) 2020, Osaka, Japan, 16–19 November 2020; pp. 1136–1140. [Google Scholar]
Jiang, Y.; Wei, R.; Chen, J.; Wang, G. Deep learning of Qinling forest fire anomaly detection based on genetic algorithm optimization. UPB Sci. Bull. Ser.-Electr. Eng. Comput. Sci. 2021, 83, 75–84. [Google Scholar]
Li, M.; Zhang, Y.; Mu, L.; Xin, J.; Yu, Z.; Liu, H.; Xie, G. Early forest fire detection based on deep learning. In Proceedings of the 3rd International Conference on Industrial Artificial Intelligence (IAI) 2021, Shenyang, China, 8–11 November 2021; pp. 1–5. [Google Scholar]
Khan, S.; Khan, A. FFireNet: Deep learning based forest fire classification and detection in smart cities. Symmetry 2022, 14, 2155. [Google Scholar] [CrossRef]
Gayathri, S.; Ajay Karthi, P.V.; Sunil, S. Prediction and detection of forest fires based on deep learning approach. J. Pharm. Negat. Results 2022, 13, 429–433. [Google Scholar] [CrossRef]
Kang, Y.; Jang, E.; Im, J.; Kwon, C. A deep learning model using geostationary satellite data for forest fire detection with reduced detection latency. GISci. Remote Sens. 2022, 59, 2019–2035. [Google Scholar] [CrossRef]
Ghosh, R.; Kumar, A. A hybrid deep learning model combining CNN and RNN to detect forest fires. Multimed. Tools Appl. 2022, 81, 38643–38660. [Google Scholar] [CrossRef]
Tahir, H.U.A.; Waqar, A.; Khalid, S.; Usman, S.M. Wildfire detection in aerial images using deep learning. In Proceedings of the 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2) 2022, Rawalpindi, Pakistan, 24–26 May 2022. [Google Scholar]
Peng, Y.; Wang, Y. Automatic wildfire monitoring system based on deep learning. Eur. J. Remote Sens. 2022, 55, 551–567. [Google Scholar] [CrossRef]
Mashraqi, A.M.; Asiri, Y.; Algarni, A.D.; Abu-Zinadah, H. Drone imagery forest fire detection and classification using modified deep learning model. Therm. Sci. 2022, 26, 411–423. [Google Scholar] [CrossRef]
Almasoud, A.S. Intelligent deep learning enabled wild forest fire detection system. Comput. Syst. Sci. Eng. 2023, 44, 1485–1498. [Google Scholar] [CrossRef]
Alice, K.; Thillaivanan, A.; Koteswara Rao, G.R.; Rajalakshmi, S.; Singh, K.; Rastogi, R. Automated forest fire detection using atom search optimizer with deep transfer learning model. In Proceedings of the 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 May 2023; pp. 222–227. [Google Scholar]
Xie, F.; Huang, Z. Aerial forest fire detection based on transfer learning and improved Faster R-CNN. In Proceedings of the 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 26–28 May 2023; pp. 1132–1136. [Google Scholar]

Figure 1. Transfer learning-based deep network architecture structure that takes the developed multimodal input fusion feature dataset.

Figure 2. (a) Distribution of image data by class in the FireStage dataset and (b) FLAME dataset.

Figure 3. Representative sample images from the (a) FireStage and (b) FLAME datasets, shown as RGB inputs and their synthetically generated modality counterparts (night-vision-like, white-hot, and green-hot).

Figure 4. Pixel intensity histograms of raw RGB images and their synthesized modalities for the (a) FLAME dataset and (b) FireStage dataset.

Figure 5. Confusion matrices obtained for different backbone architectures and modality configurations on the (a) FLAME and (b) FireStage datasets.

Table 1. Comparative literature summary of current deep learning-based studies on forest fire and smoke detection.

Author	Dataset	Objective	Method	Performance
de Almeida, R.V. et al. (2020) [1]	Bee2Fire dataset (associated with the CICLOPE scenario)	Forest fire classification (fire/no-fire)	CNN (ResNet-18)	Specificity ≈ 99%
Benzekri, W. et al. (2020) [2]	Forest fire image sequence	Time-dependent fire detection	Sequence models based on RNN, LSTM, GRU	Accuracy ≈ 99.89%
Mohammad, M.B. et al. (2022) [3]	Forest fire images (edge device-focused)	Real-time fire detection on hardware	CNN (ResNet50, GoogLeNet, CNN-9, MobileNet, InceptionV3, AlexNet)	Accuracy ≈ 99.42% (ResNet50/GoogleNet-based)
Ryu, J. et al. (2022) [4]	Custom indoor/outdoor fire and smoke dataset	Mixed indoor/outdoor fire/smoke detection	Classic CV + CNN + InceptionV3-based classifier	≈5–6% accuracy improvement over baseline methods
Shirwaikar, R. et al. (2025) [5]	Indoor semi-occluded fire dataset	Indoor disaster management; semi-occluded fire/spark detection + evacuation route	YOLOv8 (simplified) + instance segmentation-based damage estimation	Precision ≈ 0.73; F1 ≈ 0.81
Chaoxia C. et al.(2020) [6]	BoWFire, PascalVOC and Corsician dataset	Outdoor fire images	Color and global information guided	Accuracy=99.50%
Casas, E. et al. (2023) [7]	Foggia fire/smoke CCTV dataset (outdoor environment, CCTV)	Comparison of different YOLO versions for early detection of smoke and wildfire	YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLO-NAS	F1 ≈ 0.95; recall ≈ 0.98 for the best model(s)
Sathishkumar et al. (2023) [8]	BoWFire + original forest fire image set (RGB)	Classification of forest fire and smoke images; transfer learning without forgetting old knowledge when switching to a new task	VGG16, InceptionV3, Xception + Learning without Forgetting (LwF)	Xception + LwF: ≈96.9% acc. (original), ≈91.4% acc.
Goncalves, A.M. et al. (2024) [9]	CICLOPE alarm images (Portugal; tower cameras)	Camera-based wildfire detection in a large area (2.7 M ha) forest	DenseNet-based feature extractor + detail-selective CNN classifier	Accuracy ≈ 99.7% (in CICLOPE alarm images)
Yan, C. et al. (2025) [10]	VIGP-FS forest smoke dataset	Local + global feature fusion for forest fire smoke detection	MAG-FSNet: CNN backbone + multi-scale attention + global feature fusion	Precision ≈ 88.4%; Recall ≈ 83.4%; mAP@0.5 ≈ 89.3%
Wang, W. et al. (2022) [11]	Forest fire image dataset	Forest fire object detection	YOLO-based CNN (forest fire detection)	Accuracy ≈ 83.9%
Zhu, W. et al. (2025) [12]	D-Fire drone dataset (drone-based)	UAV wildfire detection in complex forest environments	YOLOv8 (multiscale feature learning)	Precision ≈ 93.6%; Recall ≈ 88.5%
Wang, X. et al. (2024) [13]	Various public wildfire smoke image/video datasets	Outdoor wildfire smoke detection; classical CV + DL hybrid	SE-ResNet feature extractor + SVM classifier	Acc ≈ 98.99%; F1 ≈ 99%
Li, L. et al. (2021) [14]	Factory/indoor smoke dataset (ICAIIS 2021, real factory scenes)	Real-time detection of indoor (factory) smoke	Faster R-CNN-based smoke detector	Accuracy: ≈99.0%
Bahhar, C. et al. (2023) [15]	UAV and open datasets (wildfire, smoke; multi-source, RGB)	Staged YOLO architecture for UAV-based forest fire/smoke detection	Staged YOLO (different YOLO versions) + Ensemble CNN	Acc ≈ 99%; mAP ≈ 0.85 for smoke class
Li, Y. et al. (2022) [16]	Wildfire image dataset (~35 k images, tower/camera-focused)	Forest fire monitoring (smoke/fire) from surveillance towers	ResNet, EfficientNet-based end-to-end network + Grad-CAM visualization	AUROC ≈ 0.949
Wang, C. et al. (2025) [17]	IR + visible (commercial area/business park) smoke dataset	Smoke detection in commercial/urban areas using IR + RGB fusion	Transformer-based Fusion (IR + Visible feature fusion)	Accuracy ≈ 90.9%; precision ≈ 98.4%; recall ≈ 92.4%; FP/FN < 5%
Wang, Y. et al. (2025) [18]	USTC_SmokeRS, E_SmokeRS, Aerial RS smoke datasets	Smoke detection in remote sensing images	ConvNeXt backbone + AFEM (attention feature enhancement module) + BFFM	Accuracy ≈ 98.9%; false alarm rate (FAR) ≈ 3.3%
Alkhammash, E.H. (2025) [19]	Smoke + D-Fire-like open-source datasets	Smoke/fire detection by comparing different YOLO generation models	Comparative analysis of YOLOv9, YOLOv10, and YOLOv11	Precision ≈ 0.845; Recall ≈ 0.801
Xue, Z. et al. (2025) [20]	Baidu Paddle wildfire + additional indoor/outdoor datasets	Indoor/outdoor multi-scenario fire/smoke detection	YOLOv11-DH3 (enhanced YOLOv11 derivative)	Precision ≈ 91.6%; Recall ≈ 90%
He, L. et al. (2025) [21]	WD + FFS forest fire smoke datasets	Analysis of different object detectors for forest fire smoke detection	YOLOv11x + optimized loss function (loss redesign)	Precison ≈ 0.949; Recall ≈ 0.850; highmAP@0.5
Niu, K. et al. (2025) [22]	UAV fusion dataset (2752 RGB–IR image pairs)	Early forest fire detection using visible + IR fusion with UAV	YOLOv5s-based lightweight detector + image fusion	≈10% increase in precision for small fire objects
Jin, P. et al. (2025) [23]	Multimodal dataset (3352 paired samples: imagery + environmental)	Smoke detection + early warning + risk assessment	YOLOv8n + MSDBlock (multiscale dual-branch block)	Accuracy ≈ 93.1%; ≈18.8% higher than traditional baselines
Shang, L. et al. (2025) [24]	Custom dataset (20,044 images; indoor, industrial, forest)	Multi-scenario (industrial + forest + indoor/outdoor) flame and spark detection	YOLO-DKM (improved YOLOv8; deformable conv + key modules)	Precision ≈ 82.1%; Recall ≈ 71.8%
Arteaga, B. et al. (2020) [25]	Forest fire images	Fire presence/absence classification	CNN (ResNet + VGG-based deep classifiers)	Accuracy ≈ 99.5%
Mohammed, R.K. (2022) [26]	Forest fire image dataset	Fire/smoke image classification	Inception-ResNet-based CNN	Accuracy ≈ 99.09%
Mohnish, S. et al. (2022) [27]	Forest fire image dataset	Fire image detection (bounding box level)	CNN-based detection (custom, single-stage)	Accuracy ≈ 92.20%
Ban, Y. et al. (2020) [28]	Sentinel-1 SAR time series (district/region-based)	Near real-time monitoring of fire progression	CNN-based time series model (on SAR images)	Accuracy ≈ 83.53%
Rahul, M. et al. (2020) [29]	Forest fire images	Fire/no-fire classification	CNN (ResNet50, VGG16, DenseNet121)	Accuracy ≈ 92.27%
Jiang, Y. et al. (2021) [30]	Qinling forest fire anomaly dataset	Forest anomaly/fire detection	CNN + BP NN, GA, SVM, GA-BP optimization	Accuracy ≈ 95%
Li, M. et al. (2021) [31]	Forest fire image dataset	Early forest fire detection (object detection)	h-EfficientDet (EfficientDet + h-EfficientDet architecture)	Accuracy ≈ 98.35%
Khan, S. et al. (2022) [32]	Forest fire detection dataset (Fire/No-Fire)	Fire/no-fire classification (smart city scenario)	FFireNet (MobileNetV2-based CNN) + MobileNetV2 comparison	Accuracy ≈ 98.42%; FFireNet + MobileNetV2
Gayathri, S. et al. (2022) [33]	Forest fire image dataset	Forest fire prediction and detection	CNN-based classifier	Accuracy ≈ 96%
Kang, Y. et al. (2022) [34]	Geostationary satellite data (GEO; multi-temporal)	Early forest fire detection using GEO satellite data	CNN + Random Forest hybrid approach	Accuracy ≈ 98%
Ghosh, R.; Kumar, A. (2022) [35]	Forest fire image dataset	Spatial + temporal pattern learning for fire detection	CNN + RNN hybrid (e.g., LSTM layers)	Accuracy ≈ 99.62%
Tahir, H.U.A. et al. (2022) [36]	UAV wildfire imagery	Wildfire detection in UAV images	YOLOv5-based detection	F1 ≈ 94.44%
Peng, Y.; Wang, Y. (2022) [37]	Wildfire monitoring image dataset	Deep learning-based automatic wildfire monitoring system	CNN (SqueezeNet1.1, AlexNet, MobileNet, ResNet18, VGG16 comparisons)	Accuracy ≈ 99.28%
Mashraqi, A.M. et al. (2022) [38]	Drone imagery forest fire dataset	Fire classification from drone images	DIFFDC-MDL hybrid (LSTM-RNN + MobileNetV2)	Accuracy ≈ 99.38%
Almasoud, A.S. (2023) [39]	Forest fire image dataset	Smart, DL-based wild forest fire detection	IWFFDA-DL, ACNN-BLSTM + YOLOv3 combination	Accuracy ≈ 99.56%
Alice, K. et al. (2023) [40]	Forest fire image dataset	Automatic forest fire detection	Deep transfer learning (QRNN, ResNet50 + Atom Search Optimizer)	Accuracy ≈ 97.33%
Xie, F.; Huang, Z. (2023) [41]	UAV wildfire dataset	Fire/smoke detection in UAV images	Transfer Learning + Faster R-CNN (ResNet50 backbone + fusion and attention)	Accuracy ≈ 93.7%

Table 2. Data augmentation strategy applied during training.

Augmentation Type	Applied Set	Probability/Range	Purpose
Random Resized Crop	Training only	Scale: 0.75–1.0	Scale and viewpoint variation
Horizontal Flip	Training only	0.5	Viewpoint invariance
Rotation	Training only	±10°	Orientation robustness
Gaussian Blur	Training only	0.25	Sensor noise simulation
Color Jitter	Training only	0.5	Illumination variation
Normalization	Training and Validation	ImageNet mean/std	Stable optimization

Table 3. Dataset size and effective training samples.

Dataset	Split	Real Images	Epochs	Effective Training Samples
FireStage	Training	632	10	~6320
FireStage	Validation	159	-	159
FLAME	Training	31,500	10	~315,000
FLAME	Validation	7875	-	7875

Table 4. Input tensor size per modality after preprocessing (both datasets).

Dataset Type	RGB	Night Vision	White	Green
FireStage Dataset	(3, 224, 224)	(3, 224, 224)	(3, 224, 224)	(3, 224, 224)
FLAME Dataset	(3, 224, 224)	(3, 224, 224)	(3, 224, 224)	(3, 224, 224)

Table 5. Layer structure for the classification stage.

Stage	Layer	Input Size	Output Size	Description
1	Fully connected	4d (e.g., 2048)	256	Dimensionality reduction
2	Activation	256	256	Nonlinear transformation (ReLU fuction)
3	Dropout	256	256	Probability = 0.3
4a	Fully connected	256	2	Binary classification (No Fire, Fire)
4b	Fully connected	256	3	Three-class classification (No Fire, Start Fire, Fire)

Table 6. Performance metrics.

Metric	Definition	Formula
Accuracy	The ratio of examples correctly classified by the model to the total number of examples.	$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$
Precision	The proportion of correct classes among the examples predicted as positive.	$P r e c i s i o n = \frac{T P}{T P + F P}$
Recall	The proportion of true positive examples that are correctly predicted.	$Recall = \frac{T P}{T P + F N}$
F1 Score	The harmonic mean that balances recall and precision. This metric is suitable for use in data imbalance.	$F 1 Score = 2 * \frac{R e c a l l * P r e c i s i o n}{R e c a l l + P r e c i s i o n}$
AUC	The Area Under the Receiver Operating Characteristic (ROC) Curve, representing the model’s ability to discriminate between classes across all classification thresholds. Higher values indicate better separability.	$A U C = \int_{0}^{1} T P R (t) d (F P R (t))$
ROC Curve	A graphical representation of the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across different decision thresholds.	$T P R = \frac{T P}{T P + F N}, F P R = \frac{F P}{F P + T N}$

Table 7. Distribution ratio of each dataset across clusters.

Dataset Type	Number of Images	Training Dataset Count	Test Dataset Count
FireStage Dataset	791	632	159
FLAME Dataset	39,375	31,500	7875

Table 8. Performance values in the validation of model success on the Flame and FireStage datasets.

Dataset	Backbone	Modality	Acc	Prec	Recall	F1	AUC	Params	Train Time (s)
FLAME	DenseNet121	RGB	0.996571	0.997999	0.996603	0.997300	0.999723	7.22 M	1668
FLAME	DenseNet121	Green	0.987429	0.991976	0.988209	0.990089	0.998736	7.22 M	1271
FLAME	DenseNet121	Night	0.983365	0.989355	0.984412	0.986878	0.998800	7.22 M	1203
FLAME	DenseNet121	White	0.990603	0.991820	0.993405	0.992612	0.999256	7.22 M	1223
FLAME	DenseNet121	RGB + Green	0.860190	0.998977	0.780775	0.876500	0.997340	7.48 M	1804
FLAME	DenseNet121	RGB + Night	0.833143	0.998918	0.738209	0.849000	0.996452	7.48 M	1806
FLAME	DenseNet121	RGB + White	0.995302	0.995214	0.997402	0.996307	0.999719	7.48 M	1818
FLAME	DenseNet121	RGB + N + W + G	0.943238	0.996947	0.913469	0.953384	0.997075	8.00 M	6804
FLAME	EfficientNet-B0	RGB	0.995429	0.997796	0.995004	0.996398	0.999822	4.34 M	1579
FLAME	EfficientNet-B0	Green	0.986032	0.994543	0.983413	0.988947	0.998722	4.34 M	1157
FLAME	EfficientNet-B0	Night	0.982603	0.989342	0.983213	0.986268	0.997983	4.34 M	1227
FLAME	EfficientNet-B0	White	0.989333	0.992788	0.990408	0.991597	0.998870	4.34 M	1233
FLAME	EfficientNet-B0	RGB + Green	0.810794	0.998864	0.703038	0.825240	0.995737	4.66 M	1718
FLAME	EfficientNet-B0	RGB + Night	0.841651	0.996301	0.753597	0.858118	0.993732	4.66 M	1744
FLAME	EfficientNet-B0	RGB + White	0.994032	0.995007	0.995604	0.995305	0.999671	4.66 M	1723
FLAME	EfficientNet-B0	RGB + N + W + G	0.846730	0.996600	0.761391	0.863260	0.994949	5.32 M	6519
FLAME	ResNet18	RGB	0.991746	0.997382	0.989608	0.993480	0.999608	11.31 M	1548
FLAME	ResNet18	Green	0.973206	0.992398	0.965228	0.978624	0.997508	11.31 M	1157
FLAME	ResNet18	Night	0.974222	0.974126	0.985612	0.979835	0.996641	11.31 M	1300
FLAME	ResNet18	White	0.989206	0.987512	0.995604	0.991541	0.998954	11.31 M	1186
FLAME	ResNet18	RGB + Green	0.962413	0.944486	0.999600	0.971262	0.995450	11.44 M	1723
FLAME	ResNet18	RGB + Night	0.975873	0.977959	0.984213	0.981076	0.997171	11.44 M	1896
FLAME	ResNet18	RGB + White	0.987556	0.981169	0.999600	0.990299	0.999438	11.44 M	1707
FireStage	DenseNet121	RGB	0.937107	0.923737	0.931411	0.927202	0.991925	7.22 M	176
FireStage	DenseNet121	Green	0.805031	0.811814	0.777452	0.789884	0.930476	7.22 M	99
FireStage	DenseNet121	Night	0.767296	0.769202	0.735745	0.745529	0.922191	7.22 M	104
FireStage	DenseNet121	White	0.798742	0.816472	0.755947	0.772497	0.927145	7.22 M	100
FireStage	DenseNet121	RGB + Green	0.748428	0.787677	0.788856	0.737219	0.968725	7.48 M	129
FireStage	DenseNet121	RGB + Night	0.842767	0.831060	0.854350	0.827998	0.974003	7.48 M	132
FireStage	DenseNet121	RGB + White	0.930818	0.925240	0.920658	0.922773	0.989529	7.48 M	129
FireStage	DenseNet121	RGB + N + W + G	0.893082	0.874812	0.900782	0.882311	0.979556	8.00 M	441
FireStage	EfficientNet-B0	RGB	0.905660	0.890713	0.889052	0.889729	0.966492	4.34 M	170
FireStage	EfficientNet-B0	Green	0.773585	0.773148	0.736722	0.744017	0.913076	4.34 M	100
FireStage	EfficientNet-B0	Night	0.773585	0.763612	0.764255	0.755687	0.922556	4.34 M	102
FireStage	EfficientNet-B0	White	0.798742	0.791329	0.768003	0.774298	0.916783	4.34 M	101
FireStage	EfficientNet-B0	RGB + Green	0.729560	0.774043	0.772890	0.724710	0.942445	4.66 M	128
FireStage	EfficientNet-B0	RGB + Night	0.867925	0.848146	0.868524	0.854238	0.960164	4.66 M	131
FireStage	EfficientNet-B0	RGB + White	0.930818	0.915282	0.915282	0.915282	0.961032	4.66 M	125
FireStage	EfficientNet-B0	RGB + N + W + G	0.861635	0.841486	0.864451	0.844856	0.960082	5.32 M	416
FireStage	ResNet18	RGB	0.924528	0.934972	0.903877	0.916292	0.984546	11.31 M	169
FireStage	ResNet18	Green	0.729560	0.743901	0.710166	0.720018	0.871148	11.31 M	97
FireStage	ResNet18	Night	0.723270	0.724347	0.706419	0.702695	0.874561	11.31 M	104
FireStage	ResNet18	White	0.742138	0.720044	0.738514	0.715710	0.900683	11.31 M	97
FireStage	ResNet18	RGB + Green	0.874214	0.851534	0.861193	0.854235	0.963100	11.44 M	124
FireStage	ResNet18	RGB + Night	0.823899	0.804250	0.821114	0.806133	0.943965	11.44 M	132
FireStage	ResNet18	RGB + White	0.905660	0.911584	0.867221	0.880889	0.978778	11.44 M	122
FireStage	ResNet18	RGB + N + W + G	0.880503	0.863298	0.890681	0.867819	0.962310	11.70 M	416

Table 9. Comparative performance analysis of the proposed multimodal ResNet18-based framework with state-of-the-art fire detection studies on the FLAME and FireStage datasets.

Study	Dataset	Method	Modality	Class	Accuracy (%)
Benzekri, W. et al. (2020) [2]	Fire Detection (image sequence)	LSTM/GRU	RGB (Sequence)	2	99.89
Goncalves, A.M. et al. (2024) [9]	FLAME Dataset	DenseNet + CNN	RGB	2	99.70
Wang, X. et al. (2024) [13]	Fire Detection (public wildfire smoke datasets)	SE-ResNet + SVM	RGB	2	98.99
Arteaga, B. et al. (2020) [25]	Fire Detection (image dataset)	CNN (ResNet + VGG)	RGB	2	99.50
Mohammed, R.K. (2022) [26]	Fire Detection (image dataset)	Inception-ResNet	RGB	2	99.09
Mohnish, S. et al. (2022) [27]	Fire Detection (image dataset)	CNN Detection	RGB	2	92.20
This Study (TAŞAR et al.)	FireStage Dataset	DenseNet	RGB	3	93.71
			RGB + White		93.08
			RGB + Night + White + Green		89.31
This Study (TAŞAR et al.)	FLAME Dataset	DenseNet	RGB	2	99.66
			RGB + White		99.53
			RGB + Night + White + Green		89.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Taşar, B.; Tatar, A.B.; Tanyildizi, A.K.; Yakut, O. Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations. Fire 2026, 9, 109. https://doi.org/10.3390/fire9030109

AMA Style

Taşar B, Tatar AB, Tanyildizi AK, Yakut O. Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations. Fire. 2026; 9(3):109. https://doi.org/10.3390/fire9030109

Chicago/Turabian Style

Taşar, Beyda, Ahmet Burak Tatar, Alper Kadir Tanyildizi, and Oğuz Yakut. 2026. "Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations" Fire 9, no. 3: 109. https://doi.org/10.3390/fire9030109

APA Style

Taşar, B., Tatar, A. B., Tanyildizi, A. K., & Yakut, O. (2026). Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations. Fire, 9(3), 109. https://doi.org/10.3390/fire9030109

Article Menu

Multimodal Wildfire Classification Using Synthetic Night-Vision-like and Thermal-Inspired Image Representations

Abstract

1. Introduction

1.1. Background

1.2. Literature Review

1.3. Contribution and Novelty of Present Study

2. Materials and Methods

2.1. Dataset and Features

2.2. Preprocessing and Augmentation

2.3. Data Synthesis

2.3.1. RGB-to-Grayscale Transformation

2.3.2. Contrast Enhancement for Night-Vision-like Modality

2.3.3. White-Hot and Green-Hot Pseudo-Thermal Modalities

2.3.4. Final Multimodal Sample Construction

2.3.5. Interpretation of Modality-Specific Intensity Distributions

2.4. Feature Extraction and Classification

2.4.1. Modality-Wise Feature Extraction

2.4.2. Multimodal Feature Fusion

2.4.3. Classification Head

2.5. Model Performance Evaluation

3. Results

4. Discussion and Limitation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI