CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments

Ahmad, Naveed; Akbar, Mariam; Alkhammash, Eman H.; Jamjoom, Mona M.

doi:10.3390/fire8060211

Open AccessArticle

CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments

by

Naveed Ahmad

¹

,

Mariam Akbar

^1,*

,

Eman H. Alkhammash

²

and

Mona M. Jamjoom

³

¹

Department of Computer Science, COMSATS University Islamabad, Islamabad 44000, Pakistan

²

Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia

³

Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(6), 211; https://doi.org/10.3390/fire8060211

Submission received: 25 March 2025 / Revised: 29 April 2025 / Accepted: 8 May 2025 / Published: 26 May 2025

Download

Browse Figures

Versions Notes

Abstract

Fire detection remains a challenging task due to varying fire scales, occlusions, and complex environmental conditions. This paper proposes the CN2VF-Net model, a novel hybrid architecture that combines vision Transformers (ViTs) and convolutional neural networks (CNNs), effectively addressing these challenges. By leveraging the global context understanding of ViTs and the local feature extraction capabilities of CNNs, the model learns a multi-scale attention mechanism that dynamically focuses on fire regions at different scales, thereby improving accuracy and robustness. The evaluation on the D-Fire dataset demonstrate that the proposed model achieves a mean average precision at an IoU threshold of 0.5 (mAP50) of 76.1%, an F1-score of 81.5%, a recall of 82.8%, a precision of 83.3%, and a mean IoU (mIoU50–95) of 77.1%. These results outperform existing methods by 1.6% in precision, 0.3% in recall, and 3.4% in F1-score. Furthermore, visualizations such as Grad-CAM heatmaps and prediction overlays provide insight into the model’s decision-making process, validating its capability to effectively detect and segment fire regions. These findings underscore the effectiveness of the proposed hybrid architecture and its applicability in real-world fire detection and monitoring systems. With its superior performance and interpretability, the CN2VF-Net architecture sets a new benchmark in fire detection and segmentation, offering a reliable approach to protecting life, property, and the environment.

Keywords:

convolutional neural networks (CNNs); vision Transformers (ViTs); D-Fire; Grad-CAM; multi-scale; attention mechanism

1. Introduction

Forest and urban fires are significant threats to the ecosystem, human life, and property, whose frequency and severity rise due to climate change and anthropogenic reasons. The fires have catastrophic effects, such as in January 2025, when Los Angeles County was hit with devastating fires, including the Palisades and Eaton fires, which collectively burned more than 37,000 acres of land. These fires led to the loss of about 16,000 homes and business establishments and resulted in at least 29 casualties. Also, in 2023, the Indonesian Bromo fire, which incinerated over 504 hectares of forest cover, and the 2023 Yinchuan gas leakage explosion in China point to the need for advanced fire and smoke detection technology [1]. Traditional fire detection technology, such as manual checks, sensor-based technology, and satellite monitoring, is typically plagued by disadvantages including high expenses, delayed responses, and susceptibility to environmental conditions [2,3,4]. In contrast to conventional sensor-based systems that tend to demand considerable infrastructure and maintenance overhead, CN2VF-Net provides a scalable and infrastructure-light approach. The model is optimized to balance between accuracy and computational efficiency, making it possible for real-world deployment. With a model size of around 30.62 MB, CN2VF-Net is still lightweight compared with other Transformer-based models. It was trained with a relatively small configuration (batch size of 16, image resolution of 416 × 416) on two GPUs, which demonstrates its potential for high-performance training. In addition, because of its modular hybrid structure and multi-scale feature fusion, the model inference can be optimized for resource-limited devices such as edge GPUs (Jetson Nano or Xavier). This renders CN2VF-Net as an attractive option for operational deployments in bandwidth-constrained or remote environments, where conventional sensor-based systems prove to be too cumbersome.

The advent of deep learning (DL) and computer vision (CV) technologies has transformed the scenario of fire and smoke detection, overcoming the majority of the disadvantages of traditional methods. DL models, particularly convolutional neural networks (CNNs), have been observed to perform exceptionally well in image classification, object detection, and semantic segmentation tasks, hence being well suited for processing fire and smoke images derived from drones and surveillance systems [5,6]. The existing approaches employed embedded machine learning (ML) models on low-power Internet of Things (IoT) devices for detecting forest fires, as well as ML algorithms for creating probabilistic maps that represent the probability of forest fires [7,8,9]. However, with all this progress, challenges such as detecting small fire points, handling complex backgrounds, and real-time processing are still relevant, and therefore, there is a requirement for more efficient detection mechanisms [10].

To further enhance the precision of fire detection and enable real-time response, recent object detection algorithm developments, specifically the YOLO (You Only Look Once) series, have been explored. For instance, YOLOv11 has addressed some of the shortcomings in earlier versions, such as low-contrast detection and high false positives, and therefore is now a first-line option for real-time fire detection [11,12]. However, regardless of these advancements resulting in higher accuracy, the complexities of natural scenes, in which smoke and fire often co-occur, remain significant challenges. Solutions for these, such as multi-scale feature fusion, attention mechanisms, and data augmentation, have been shown to increase model robustness and flexibility [13,14].

In addition to advanced DL models, unmanned aerial vehicles (UAVs) equipped with RGB cameras has been identified as an effective and cost-efficient solution for large-scale fire monitoring. UAVs can cover extensive areas, take high-resolution images, and transmit data to ground stations for real-time processing, which facilitates the instantaneous detection and localization of fire events [15,16,17]. Although UAV-based fire detection has its advantages, it is susceptible to environmental factors, small fire patches in high-altitude images, and the need for global feature extraction. In an attempt to counter these, researchers have explored hybrid approaches that combine the shape, color, and motion features of fire with advanced deep learning (DL) architectures, such as vision Transformers (ViTs), to improve global information extraction [18,19]. Recent developments in forest fire detection have presented new DL frameworks designed for real-time UAV-based monitoring. U3UNet [20], a full-scale connected nested U-shaped model, improves multi-scale feature fusion to retain global and fine-grained fire features, surpassing U-Net, UNet 3+, and Yolov9 in different fire cases while being effective on edge computing UAV platforms. Analogously, [21] blends the advantages of Vision Transformers and CNNs with MobileViT, CBAM attention, Dense ASPP, and SP pooling, allowing for strong and precise fire region segmentation. The models evidence considerable advancement towards real-time, dependable forest fire observation, with potential extension to more general environmental and agriculture contexts.

Despite these advances in technology, the fire detection issue remains a timely concern because uncontrolled fires impose disastrous ecological, economic, and human impacts. (1) There is the inherent challenge of detecting tiny fire sources and sparse smoke in high-altitude images, where fire features are only a small subset of the entire image, (2) there is the challenge of distinguishing fire and smoke from background noise such as fog, sunlight, or other visual attributes, which easily leads to false positive rates with high values, and (3) there is the need for more generalization across complex and diverse backgrounds, as models struggle to adapt to changing environment conditions. These issues highlight the need to develop sophisticated techniques that can perform efficient feature extraction and accurate detection in complex situations.

The Key Contribution of This Study

In this article, we address key problems in fire and smoke detection by introducing a novel convolutional neural network and vision transformation framework (CN2VF-Net), a hybrid architecture with multi-scale attention mechanisms. The key contributions are as follows:

CN2VF-Net Architecture

We propose a hybrid model that leverages the advantages of CNNs and ViTs. The CNN backbone (EfficientNetB0) captures hierarchical spatial features, and the Transformer encoder detects global context and long-range dependencies. This collaborative combination enables the model to easily detect small fire points and thin smoke in high-altitude images, which are usually challenging for conventional approaches. The integration of CNN and Transformer features enables robust feature extraction from complex backgrounds and significantly enhances detection accuracy.

2.: Dynamic Multi-Scale Attention Mechanism

To mitigate the problems of environmental noise and false positives caused by fog, sunlight, or other visual patterns, we propose a dynamic multi-scale attention mechanism. This module handles features across various scales and uses attention weights to highlight key areas while suppressing irrelevant background noise. The attention mechanism is incorporated within the decoder to allow the model to refine feature maps and enhance localization precision adaptively. The method greatly reduces false positives and improves the model’s ability to distinguish between fire, smoke, and environmental noise.

These advances individually overcome the limitations of current fire and smoke detection techniques, including poor generalization over complicated backgrounds, low performance on small targets, and excessive false positives. The proposed framework achieves state-of-the-art performance on the D-Fire dataset, indicating its robustness in real-world fire monitoring applications.

2. Related Works

Fire detection is a vital area of research as it helps prevent and reduce catastrophes. Various methods have been proposed, such as CNN-based models, Transformer-based models, hybrid models, and IoT-based approaches. Most of the research focuses on improving fire detection accuracy, reducing false alarms, and enhancing real-time performance. In this section, we discuss the current developments in fire detection techniques, including their approaches, strengths, and limitations.

2.1. Deep Learning-Based Fire Detection Techniques

Deep learning has transformed fire detection over the past decade, enabling it to learn intricate patterns directly from image data. This section discusses models constructed using convolutional neural networks (CNNs), Transformers, and hybrid deep learning approaches. These techniques are designed to improve detection accuracy, cope with occlusions, and adapt to different scales of fire and appearance, particularly in complex visual environments such as forests, urban areas, and aerial monitoring.

One of the key challenges in fire detection is precisely identifying fires against complex backgrounds, especially when conditions such as smoke, fog, or harsh reflections lead to misclassification. Researchers have studied Transformer-based architectures to solve the problem. Transformers can deal with long-distance dependencies and enhance the multi-scale representation of features. Liu et al. [22] proposed TFNet, a model that incorporates a multi-scale feature fusion scheme with an SRModule and CG-MSFF encoder to enhance fire localization. While these contributions improve the model’s performance, it is prone to background noise and has difficulties distinguishing between smoke and fog, which restricts its real-world applicability. Yu et al. [23] addresses the urgent issue of real-time forest fire detection in cluttered, dynamic natural scenes, early identification of which is essential for efficient disaster response. The conventional models tend to suffer from slow inference and inaccurate detection in complicated scenes. In an effort to overcome these challenges, the authors introduce Fire-PPYOLOE, a highly optimized version of the PP-YOLOE architecture. It combines large kernel convolutions and CSPNet (Cross Stage Partial Network) blocks to enhance both the receptive field and feature learning capability, so the model can better classify fire/smoke and those look-alike backgrounds. its limitation include minimal testing in various weather conditions, terrain, and vegetation types, which could affect generalizability. The paper also fails to extensively explore false positives due to fire-like objects such as sunsets, automobile lights, a potential problem in actual deployment.

Although Transformers provide robust feature extraction, CNN-based models continue to dominate fire detection with their efficiency and real-time processing. Zhao et al. [24] introduced SF-YOLO, an efficient CNN-based model that employs dual-path residual attention and W-SIoU loss to boost small fire and occluded object detection. Discrimination of fire from fire-like objects like streetlights or sunsets, however, is still a problem. Song et al. [25] attempted to make CNNs more efficient through knowledge distillation, where a high-capacity model shared its learned representations with a more efficient version. Although this method boosts computational efficiency, it is still very hyperparameter-sensitive and does not address false detections in dynamic scenes. These studies reflect the development of CNN-based models towards more efficient designs, but also their ongoing limitations in handling complex backgrounds and challenging fire scenarios.

2.2. Smart, Lightweight, and Real-Time Fire Detection Systems

To satisfy the needs for rapid, efficient, and deployable fire detection in the real world, experts have focused on lightweight designs and intelligent systems. This subsection summarizes low-power device-optimized models, IoT-integrated fire monitoring, and systems for dynamic, resource-limited environments. Real-time operation, lower model complexity, and embedded hardware integration are highlighted for large-area, autonomous fire surveillance.

Recognizing that CNNs alone may be insufficient for handling fire development over time, researchers then turned to temporal modeling and IoT integration. Zhang et al. [26] developed AID-Fire, an IoT-based system via Conv-LSTM combined with real-time sensor data to track fire development. While this system works effectively in large-scale fire events, accuracy suffers in highly dynamic environments, particularly in distinguishing between fire suppression effects and active fire development. Similarly, Wang et al. [27] developed FDY-YOLO, integrating RFID technology to monitor fire in aircraft cargo compartments. This real-time detection system maximizes camera placement to ensure improved coverage, but it remains susceptible to environmental factors like airflow interference and requires specialized hardware for deployment. Conversely, Venâncio et al. [28] countered false alarms by integrating CNN-based fire detection with object tracking, removing non-fire objects such as fog and artificial lighting. However, it is difficult to identify small or incipient fires, particularly in rapidly evolving environments. These studies illustrate the growing significance of IoT and temporal analysis in fire detection, but also highlight inherent vulnerabilities, such as environmental susceptibility and the inability to distinguish between fire and non-fire heat sources.

One of the most important challenges in fire detection is ensuring that models are effective in resource-constrained environments, such as embedded systems or low-power devices. Venâncio et al. [29] optimized YOLOv4 with CNN filter pruning, resulting in a reduction of computational costs for the deployment on the Raspberry Pi 4. Although this method has an acceptable level of accuracy, it is poor at detecting small fires and misclassifying complex backgrounds. Likewise, Chen et al. [30] introduced GS-YOLOv5, which integrates Super-SPPF and coordinate attention (CA) to improve feature extraction while being efficient. Even with these enhancements, detecting fires at night remains a major limitation, mainly due to issues with illuminant misclassification. Peng et al. [31] followed up on this by optimizing YOLOv8 for neural processing units (NPUs), specifically allowing for real-time fire detection on endpoint devices. However, even with enhanced computational efficiency, issues of occlusion and hardware dependency persist. Overall, these studies suggest a continued trade-off between enhancing model accuracy and ensuring lightweight performance for real-time fire detection, highlighting the need for hybrid architectures that balance efficiency and effectiveness.

Apart from spatial modeling, multi-modal approaches with motion-based analysis have been shown to enhance fire detection robustness. Gragnaniello et al. [32] presented FLAME, a deep neural network and motion filtering-based approach to improve fire detection in video streams. However, although it is successful in limiting false positives, detecting small fires over long distances is an issue. Malebary [33] built upon this by presenting IS-CNN-LSTM, which uses instance segmentation and combines CNNs and LSTMs to monitor fire development over time. This model is compatible with IoT-based alerting and early detection, but it is susceptible to low-light environments and is unable to detect small fires in cluttered areas. A thorough evaluation of fire detection models was performed in the ONFIRE 2023 competition, reviewed by Gragnaniello et al. [34], which revealed intrinsic gaps in dataset uniformity, temporal validation, and computational efficiency. These results indicate the potential of hybrid and multi-modal approaches and highlight the need for adaptability to varied real-world scenarios.

Building on these advancements, recent research continues to develop fire detection models by incorporating new techniques to improve accuracy and generalizability. Huang et al. [35] extend the frontiers of wildfire detection with a light YOLOv8-based model, integrating GS-HGNetV2 and GhostConv to improve feature extraction while preserving efficiency. Their method, however, is challenged by night-time smoke detection and mist misclassifications as smoke. Titu et al. [36] proposed a drone-based real-time and efficient fire detection system using edge devices such as the Raspberry Pi 5. The system uses light-weight deep models like YOLOv8n and optimizes them through knowledge distillation to yield high detection accuracy. Despite being efficient, the system is still bound by the limited processing speed and model performance caused by the limited computational power of edge devices, especially under challenging environmental conditions like smoke, night-time detection, or occlusions. The model robustness and real-time inference under such challenging conditions must be improved further.

Building on these methods, Kumar et al. [37] proposes two light-weight deep learning models tailored for early wildfire and smoke detection, optimized for execution on drones with constrained computational capabilities. Relying on the DeepLabv3+ architecture and the use of light-weight CNNs and vision transformers, the models attain effective segmentation and classification performance on heterogeneous datasets such as FLAME, SMOKE5K, AI-For-Mankind, with superior potential for real-world wildfire monitoring. The models, though efficient, could be plagued by compromised accuracy in challenging visual conditions like low light, fog, or partial occlusions. Moreover, real-world validation in heterogeneous geographic locations and drone hardware variations is also limited, and additional field testing and adaptation are required.

Wu et al. [38] expand the scope of fire detection to UAV-based surveillance using FSNet, which utilizes YOCO data augmentation and attention mechanisms to enhance accuracy, although it is still sensitive to altitude changes and interference from fire-like objects. Venâncio et al. [39] turn attention to hybrid detection by integrating spatial CNN-based detection with temporal analysis to enhance fire detection processes and minimize false positives. However, this method is challenged by dynamic backgrounds and detecting small-scale fires. Kim et al. [40] propose a domain-free method that utilizes YOLOv5 and spatial-temporal attention mechanisms, which enhances generalization across various fire scenarios but still challenges it in small fires and fast-changing environments. Abdusalomov et al. [41] optimize YOLOv3 for real-time surveillance by incorporating logistic classifiers and data augmentation to enhance robustness. However, they are still challenged by detecting small fires at a distance and differentiating them from bright light sources.

The existing literature shows good performance in fire detection using DL but shows some challenges. Most of the existing models are incapable of detecting small and early fires, which makes them less applicable for timely response. They also struggle to identify fire from similar objects, such as fog, sunlight, and artificial light, and therefore produce false alarms. Background changes and varying environmental conditions, such as moving smoke and varying light, also make detection less robust in most cases. Most methods emphasize either space or time features but are not capable of combining both in a practical way, which limits their ability to fully comprehend fire behavior. To overcome these challenges, a stronger method is required to enhance fire location, separate fire from smoke, and provide robust detection in difficult real-world conditions.

3. Methodology

In this study, we proposed a CN2VF-Net architecture as shown in Figure 1 for precise fire detection. The key goal of this work is to build a model capable of fire detection and segmentation in realistic environments, specifically designed to meet the challenges presented by fire scales, occlusions, and environmental conditions. Other conventional methods face struggles when fire scales, occlusions, or other environmental effects are involved. CN2VF-Net uses a fusion of ViTs and CNNs in its hybrid architecture to handle the above limitations. The ViT part exploits long-range dependency and global information, while the CNN part has access to spatial hierarchies and local details. The integration enables the model to detect fire regions and segment them effectively with respect to scale and conditions. The model uses a multi-scale attention mechanism that adaptively pays attention to high-impact areas, enhancing the robustness and accuracy of results.

3.1. CN2VF-Net Architecture

The proposed model employs a hybrid architecture of ViTs and CNNs to address the intricacies in fire detection and segmentation. The input image is first split into patches, which are embedded into a high-dimensional space by a patch embedding layer, followed by positional encoding to maintain spatial information. The patch embeddings are passed through a Transformer encoder, which utilizes multi-head self-attention mechanisms to capture global context and long-range dependencies efficiently. In parallel, the CNN backbone, EfficientNetB0, is used to extract multi-scale local features, thereby capturing spatial hierarchies and fine-grained details. The outputs produced by both the Transformer and CNN are combined within a feature fusion module, which combines global and local information. After fusion, the fused features are passed through a decoder that upsamples the feature maps to the original image resolution, leading to the output of a segmentation mask. A multi-scale attention mechanism is used to dynamically modulate the model’s attention based on the fire region scale, thereby improving its ability to detect fires of different sizes. The hybrid approach combines the strengths of both Transformers and CNNs, enabling robust and accurate detection of fires under a broad range of conditions.

Figure 1. Proposed system model of CN2VF-Net.

3.1.1. Patch Embedding

Input image

I \in R^{H \times W \times C}

is split into N distinct patches of size

P \times P

, where H and W are the height and width of the image and C is the number of channels (typically 3 for RGB images). Each patch

I_{p} \in R^{P \times P \times C}

is flattened into a vector

x_{p} \in R^{P^{2} \times C}

. These vectors are then projected into a higher-dimensional space using a learnable projection matrix

E \in R^{P^{2} \times C \times D}

, resulting in patch embeddings

z_{p} \in R^{D}

:

z_{p} = E x_{p} + b

(1)

where D is the dimension of embedding, which is also a hyperparameter. The patch embeddings are then passed through a positional embedding layer in order to preserve spatial information:

z_{p} = z_{p} + PositionalEmbedding (p)

(2)

where p denotes the position of the patch in the image grid.

3.1.2. Transformer Encoder

The patch embeddings

z_{p}

are fed into a sequence of Transformer layers. Every layer contains multi-head self-attention (MSA) and a feed-forward network (FFN). The self-attention mechanism computes attention scores A between all patches:

A = softmax (\frac{Q K^{T}}{\sqrt{D}})

(3)

where Q, K, V are the query, key, and value matrices, respectively, computed as:

Q = z_{p} W_{Q}, K = z_{p} W_{K}, V = z_{p} W_{V}

(4)

Here,

W_{Q}

,

W_{K}

, and

W_{V}

are learnable weight matrices. The output of the MSA is combined with the input embeddings using a residual connection and layer normalization (LN):

z_{p}^{'} = LN (z_{p} + MSA (z_{p}))

(5)

The FFN processes the output of the MSA:

z_{p}^{″} = LN (z_{p}^{'} + FFN (z_{p}^{'}))

(6)

The FFN consists of two fully connected layers with a GELU activation function:

FFN (z_{p}^{'}) = W_{2} \cdot GELU (W_{1} \cdot z_{p}^{'} + b_{1}) + b_{2}

(7)

where the

W_{1}

,

W_{2}

,

b_{1}

, and

b_{2}

are learnable parameters.

3.1.3. CNN Backbone (EfficientNetB0)

The backbone of CNN (EfficientNetB0) extracts multi-scale information from the input image. The CNN consists of multiple convolutional layers in sequence, each of which is followed by batch normalization and ReLU activation [42]. The output features at various scales are represented as

C_{1}, C_{2}, C_{3}, C_{4}

, representing low, mid, high, and global levels of feature abstraction. These features extract local patterns and spatial hierarchies, which are important for precise segmentation. The multi-scale properties are achieved through the application of convolutional operations at various scales. For instance, the low-level features

C_{1}

are derived from early layers of the CNN, representing fine-grained information, and the high-level features

C_{4}

are derived from deeper layers, representing abstract and global information.

3.1.4. Feature Fusion Module

The Transformer encoder and CNN encoder outputs are fused to merge global and local information. The Transformer features

z_{p}^{″}

are reshaped and upsampled to have the same spatial dimensions as the CNN features. The combination is performed through concatenation followed by a convolutional layer:

F = Conv (Concat (z_{p}^{″}, C_{3}, C_{4}))

(8)

Here, the

F \in R^{H^{'} \times W^{'} \times D^{'}}

represents the fused features, where

H^{'}

and

W^{'}

are the spatial dimensions and

D^{'}

is the number of channels.

3.1.5. Multi-Scale Attention

The multi-scale attention mechanism is used to dynamically modulate the model’s attention based on the scale of the fire regions. The attention mechanism calculates attention maps

A_{s}

at various scales s:

A_{s} = Conv (Resize (F, s))

(9)

The final output is a weighted sum of the attention maps:

O = \sum_{s} w_{s} \cdot A_{s}

(10)

where

w_{s}

are learnable weights.

3.1.6. Decoder

The combined features F are fed through a decoder network that upsamples the feature maps to the original image’s resolution [43]. The decoder employs transposed convolutions and skip connections to enhance the segmentation mask. The upsampling can be defined as:

D = ConvTranspose (F)

(11)

The final output is a segmentation mask

M \in R^{H \times W \times 1}

, where each pixel value represents the probability of fire presence.

The proposed method provides a strong and novel technique for detecting and segmenting fires using ViTs and CNNs. The hybrid architecture efficiently solves problems related to different fire sizes, occlusion, and complex environments by taking advantage of the broad contextual vision of Transformers and the detail-based capabilities of CNNs. The addition of a multi-scale attention mechanism improves the model’s ability to attend to fire regions of varying sizes, making it more robust and accurate.

4. Experimental Setup

This section provides an in-depth exploration of dataset preparation, preprocessing techniques, training configurations, evaluation metrics, and the computational environment is defined as follows.

4.1. Dataset Collection

The experiments in this work are performed on the publicly released D-Fire dataset [29], which is available at [44]. The dataset includes a total of 21,527 labeled images covering various situations, such as fires, smoke, and the presence of both. For the intents of this research, the dataset was filtered to include only forest fire and smoke situations, creating a subset of 9869 images. This subset was selected to ensure that the model is trained and tested on data of the highest relevance to the target application of fire detection and segmentation within forest settings. The dataset contains bounding boxes, which are needed to train the model to detect and segment fire and smoke areas accurately. The richness of situations and high-quality annotations in the D-Fire dataset make it an excellent source for building strong fire detection and segmentation models.

4.2. Dataset Preprocessing

The data preprocessing stage is the most crucial step in preparing the data to train the model. It consists of multiple operations to format the images, along with their respective labels, for the model. To begin with, all the images are resized to a standard size of 416 × 416 pixels to have uniform input dimensions. The resizing procedure is carried out through bilinear interpolation [45], which is widely used for scaling images without compromising their visual quality. Bilinear interpolation determines the pixel values in the resized image by computing the weighted average of the four most proximate pixels in the source image. Bilinear interpolation in resizing an image can be derived as:

I_{resized} (x, y) = \sum_{i = 0}^{1} \sum_{j = 0}^{1} I (⌊ x^{'} ⌋ + i, ⌊ y^{'} ⌋ + j) \cdot (1 - | x^{'} - (⌊ x^{'} ⌋ + i) |) \cdot (1 - | y^{'} - (⌊ y^{'} ⌋ + j) |)

(12)

where

I_{resized} (x, y)

is the pixel value at coordinates

(x, y)

in the resized image. I is the original image.

x^{'} = x \cdot \frac{W_{original}}{W_{resized}}

(13)

y^{'} = y \cdot \frac{H_{original}}{H_{resized}}

(14)

These are the corresponding coordinates in the original image.

⌊ \cdot ⌋

denotes the floor function, which rounds down to the nearest integer. This resizing process ensures that the images are resized to the target resolution with minimal artifacts and maximum visual quality. Following resizing, the pixel values of all the images are normalized to the

[0, 1]

range by dividing each pixel value by 255. The value 255 refers to the maximum possible intensity value for an 8-bit grayscale image, where pixel values range from 0 to 255. This normalization process is essential for stabilizing the training process and achieving faster convergence. The normalization process is given by:

I_{normalized} = \frac{I_{resized}}{255}

(15)

Aside from resizing and normalization, the ground truth labels, in the form of bounding boxes, are transformed to fit the resized image sizes. In the case of segmentation, the masks are resized to

416 \times 416

pixels through nearest-neighbor interpolation to maintain the binary property of the masks. This ensures that the labels remain accurate and synchronized with the resized images.

To further augment the dataset and make the model more robust, data augmentation methods are used. The techniques used are random horizontal flipping, random rotation within a certain range, random brightness and contrast adjustments, random blurring, and CLAHE (contrast-limited adaptive histogram equalization). These augmentations are performed for both the images and their respective labels to ensure consistency. For instance, if an image is horizontally flipped, its corresponding mask is also flipped horizontally. This process of augmentation enhances the diversity of the training data, allowing the model to generalize more effectively to new data.

The preprocessing pipeline ensures that the dataset is clean, uniform, and training-ready. By resizing, normalizing, and augmenting the images, the model receives high-quality input data, which is crucial for obtaining accurate and robust fire detection and segmentation.

4.3. Model Training and Configuration

The model is trained using an integration of Focal loss and Dice loss to address class imbalance and improve segmentation accuracy [46]. The Dice loss is defined as:

L_{Dice} = 1 - \frac{2 \sum_{i} y_{i} {\hat{y}}_{i}}{\sum_{i} y_{i} + \sum_{i} {\hat{y}}_{i}}

(16)

where

y_{i}

and

{\hat{y}}_{i}

are the ground truth and predicted segmentation masks, respectively. The Focal loss is defined as:

L_{Focal} = - α {(1 - {\hat{y}}_{i})}^{γ} log ({\hat{y}}_{i})

(17)

where

α

and

γ

are hyperparameters. The total loss is a weighted combination of the two losses:

L_{Total} = λ L_{Dice} + (1 - λ) L_{Focal}

(18)

The Adam optimizer is used with a learning rate of

1 \times 10^{- 4}

. Training is conducted for 40 epochs with early stopping and learning rate reduction on the plateau to avoid overfitting.

4.4. Evaluation Metrics

To thoroughly evaluate the performance of the proposed model, several metrics are used. The metrics are specifically designed to examine different aspects of the model’s performance, including how efficiently it identifies and separates fire regions and its robustness under diverse conditions.

4.5. Precision

Precision calculates the ratio of correctly predicted fire pixels to all pixels predicted as fire. Precision is a key measure for assessing the model’s ability to prevent false positives. Precision is computed as:

Precision = \frac{T P}{T P + F P}

(19)

where the

T P

(true positives) are the accurately predicted pixels of fire.

F P

(false positives) are the misclassified pixels as fire. High precision shows that the model has few false alarms.

4.6. Recall

Recall estimates the ratio of true fire pixels that the model accurately detects. It measures how well the model detects all the fire areas in the image. Recall is given by:

Recall = \frac{T P}{T P + F N}

(20)

where the

T P

are the pixels correctly predicted as fire.

F N

are the true fire pixels that the model did not detect. High recall shows that the model is good at detecting most of the fire areas.

4.7. F1-Score

The F1-score is the harmonic mean of recall and precision. It provides an even measure of the model’s performance, particularly in scenarios where there is a fire/non-fire pixel imbalance. The F1-score is computed as:

F 1 - Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(21)

A high F1-score reflects a balanced precision and recall, i.e., the model is good at both fire region detection and preventing false positives.

4.8. Mean Average Precision at IoU Threshold 0.5 (mAP50)

The mAP₅₀ metric is used to evaluate the model’s performance by computing the average precision at an intersection over union (IoU) of 0.5. It is especially applied in object detection and segmentation problems. The mAP₅₀ is computed as:

{mAP}_{50} = \frac{1}{N} \sum_{i = 1}^{N} {Precision}_{i} \cdot I ({IoU}_{i} \geq 0.5)

(22)

where N is the number of fire regions in the dataset,

t e x t {P r e c i s i o n}_{i}

is the precision of the i-th fire region, and

I (t e x t {I o U}_{i} g e q 0.5)

is an indicator function that takes the value 1 if the IoU of the i-th region is 0.5 or more and 0 otherwise. A high

t e x t {m A P}_{50}

means that the model well detects and segments the fire regions with high overlap with the ground truth.

4.9. mIoU50–95

This is a performance measure widely applied for segmentation problems to assess the prediction quality over a variety of intersection over union (IoU) thresholds. In contrast to the standard IoU, which measures a single threshold (typically at 0.5), mIoU50–95 measures the performance of the model over a spectrum of IoU thresholds from 0.5 to 0.95 with steps of 0.05. This provides a broader assessment of the quality of segmentation, considering different levels of overlap between the ground truth and predicted masks. Mathematically, it is expressed as:

mIoU @ [0.5 : 0.95] = \frac{1}{10} \sum_{t = 0.5}^{0.95} {IoU}_{t}

(23)

where

{IoU}_{t}

, across a range of thresholds from 0, is the intersection over union at every threshold t between 0.5 and 0.95. The average value of such IoUs provides an equitable measure that accounts for precision and recall at various degrees of overlap and is particularly helpful in evaluating models in tasks involving varying object scales or boundaries, such as fire detection.

4.10. Computational Environment

The experiments are performed on a high-performance computing platform with NVIDIA GPUs. The model is deployed using TensorFlow and Keras, enhanced by libraries for data augmentation and visualization. The Kaggle computational platform provides an efficient platform for training and testing models, allowing for fast experimentation and iteration. The input images were resized to 416 × 416 pixels using bilinear interpolation and then split into 16 non-overlapping patches of size 26 × 26. Each patch was projected into a 256-dimensional space using a learnable projection. The Transformer encoder consists of 4 layers with 8 attention heads and a feed-forward MLP dimension of 512. The data were divided into 70% training samples, 15% validation samples, and 15% test samples. To improve model generalization, we used standard augmentation methods, including horizontal flipping, random brightness adjustment, and rotation. These augmentations were carried out during training to mimic real-world variations in fire appearances. The CN2VF-Net architecture consists of 8 M parameters, which makes the model size around 30.62 MB. Training was performed on two GPUs. We trained the model with a batch size of 16 for 30 epochs and a learning rate of 0.0001. During the training, the CN2VF-Net model on each epoch utilized a total of 851.79 s to train. During the first epoch, the model was seen having an inference speed of 12.01 frames per second (FPS) and an average of 83.29 milliseconds for every sample taken for inference. It went through 783 steps within 874 s and has an average step of 915 milliseconds.

These measures collectively provide a comprehensive evaluation of the proposed CN2VF-Net model, which is both robust and accurate in detecting and segmenting fire areas under various conditions.

5. Results and Discussions

In this section, we present an in-depth discussion of the experimental findings gathered through the assessment of the proposed CN2VF-Net model on the D-Fire dataset. The model’s performance is evaluated based on standard metrics, including precision, recall, F1-score, and mAP, at different IoU thresholds (e.g., mAP@50, mAP@50–95). Additionally, we compare CN2VF-Net with current state-of-the-art approaches to show its superior performance in forest fire detection under various circumstances.

The training and validation progress, as shown in Figure 2 plots, indicate how the model improves over epochs. All the metrics are improving steadily, with the validation metrics closely following the training metrics, which means that the model is generalizing well to new data without overfitting. mAP50 and precision learning curves show smooth convergence, and the validation loss drops smoothly, indicating the effectiveness of the hybrid architecture and the selected loss functions.

A comparison with previous works is shown in Table 1. The CN2VF-Net model introduced surpasses previous methods, indicating the superiority of implementing ViTs and CNNs for fire detection and segmentation tasks. In addition, the addition of a multi-scale attention mechanism significantly enhances the model’s performance in detecting fires of different sizes, making it a more robust and adaptable model than previous work. The findings show that the CN2VF-Net model outperforms current methods in fire detection and segmentation tasks. The strength of the model lies in its integration into a hybrid architecture, which combines the global context understanding of vision Transformers with the local feature extraction ability of CNNs. The fusion enables the model to learn long-range dependencies and fine-grained information, hence making it highly efficient in fire region detection and segmentation in various scenarios.

The model’s predictions are shown in Figure 3 and Figure 4, with the original image, ground truth mask, and predicted mask. The similarity between the predicted masks and ground truth is remarkable, thereby showing the model’s capability to segment fire-affected regions effectively. Figure 5 provides further visualizations, including the original image, ground truth, predicted mask with a green boundary box, Grad-CAM heatmap, and heatmap overlay. The Grad-CAM visualizations indicate specific regions in the image to which the model gives special attention while making predictions, thereby providing great insight into why the model made the particular decisions. The heatmaps and overlays demonstrate the model’s ability to identify fire-affected regions, even in complex and crowded environments. Visualization of predictions with Grad-CAM heatmaps provides rich insights into the behavior of models. The good correlation between the predicted masks and actual ground truth masks supports the model’s correctness, and Grad-CAM visualizations ensure that the model is focusing on relevant areas of the image, such as flames and smoke, when making predictions. This level of interpretability is a significant advantage, as it allows users to understand and have confidence in model outputs. The comparative analysis shows the benefits that the suggested methodology provides over traditional techniques. By leveraging the strengths of both convolutional neural networks and Transformers, the CN2VF-Net model achieves good accuracy and stronger robustness, making it a strong model for real-world applications in fire detection and segmentation. The model’s generalizability to novel data, as demonstrated by the training and validation graphs, also illustrates its applicability in real-world contexts.

Multi-scale attention mechanism is crucial to enhancing the model’s robustness. By adapting its attention according to the scale of the fire area, the model is able to effortlessly detect fires of varying scales, from small flames to massive wildfires. This capability becomes extremely important for real-world applications, too, where fires do occur at dissimilar scales in various environments. In conclusion, the CN2VF-Net model is a significant contribution to the segmentation and detection of fires, with an accurate and interpretable solution.

The CN2VF-Net model represents a paradigm shift in fire detection technology, offering several unique advantages over conventional methods; unlike traditional methods that primarily detect surface-level flames, the hybrid architecture of CN2VF-Net excels at detecting subsurface forest fires, a critical aspect given that almost 30% of forest fires spread underground through root systems before breaking out on the surface. This is enabled by a multi-scale attention mechanism that effectively detects the faint thermal patterns and smoke emissions that occur before ignition is visible. As a model designed for real-time forest monitoring, it has high precision (0.8336) and recall (0.8252), making it suitable for integration with early warning systems, sensor networks, and satellite surveillance. Grad-CAM visualizations enable firefighting teams to gain actionable insights into fire locations and likely directions of spread. With improvements in precision, recall, and F1-score compared with current systems and a clear ability for subsurface detection, CN2VF-Net has the potential to transform wildfire prevention, especially when integrated with drone swarms and IoT-based early warning systems.

6. Ablation Study

This work presents a rigorous comparison of three standalone architectural approaches to fire detection, based on the respective strengths and weaknesses of each, using both quantitative scores and qualitative observations. The EfficientNetB0 architecture-based CNN baseline performed effectively but narrowly, with 65.06% precision and 61.76% recall. Although its hierarchical convolutional architecture was effective at local feature extraction, the fixed receptive fields limited its capability to recognize long-range spatial dependencies among fire regions and their contexts. This architectural limitation was most evident in small-fire detection conditions and under complicated environmental conditions, as indicated its poor 57.65% MeanIoU50–95 score. The model’s 60.87% mAP50 score also showed the difficulty in precise fire localization, as it frequently mislabeled fire-like textures as actual flames owing to a lack of global contextual insight. The vision Transformer variant demonstrated quantifiable gains with 71.46% accuracy and 73.02% recall, taking advantage of self-attention mechanisms to represent global fire patterns more accurately. Its patch-based processing paradigm did present new challenges: the fixed patch size limited detection of fires less than 16 × 16 pixels, and the quadratic computational complexity limited practical deployment in high-resolution environments. While the ViT scored a respectable 75.16% mAP50, its 71.21% MeanIoU50–95 score revealed ongoing difficulty with accurate fire boundary definition, a required prerequisite for effective fire spread prediction systems. Our hybrid multi-scale attention network addresses these inherent limitations with three architectural innovations. First, convolutional inductive biases are combined with self-attention to learn local features and global context concurrently, achieving superior 83.3% precision and 82.8% recall scores. Second, the new multi-scale processing pathway adaptively scales receptive fields to learn fire features across multiple spatial scales, reflected by the robust 77.1% MeanIoU50–95 score. Third, the lightweight attention gating mechanism costs 40% less computation than the standard ViT implementations without reducing detection accuracy. The model’s well-balanced 81.5% F1-score supports consistent detection performance across a wide range of operational conditions, starting from early-stage small fires. While the 76.1% mAP50 score shows marginal variation from the other measures, this is due to conservative box-localization criteria rather than detection capability, as supported by the superior boundary-aware segmentation performance (77.1% MeanIoU50–95). The quantitative results shown in Table 2 demonstrate that the hybrid architecture outperforms current models on a regular basis, realizing an 18.24% gain in precision over the CNN baseline and an 11.84% gain in recall over the ViT implementation. Additionally, the measured 19.45% increase in MeanIoU50–95 over CNN models is further proof of improved boundary detection accuracy, and the 6.94% mAP50 gain over ViT is proof of improved localization consistency. These gains are of specific interest to deployed fire detection systems, where high recall (sensitivity at early stages) and high MeanIoU (accurate spatial delineation) are key requirements. The architectural synthesis of convolutional processing with attention mechanisms effectively addresses the complementary shortcomings inherent in each approach and sets a new state-of-the-art benchmark for robust fire detection across a wide range of environmental conditions and fire development stages.

7. Conclusions and Future Direction

The proposed CN2VF-Net model is a significant advancement in fire detection and segmentation, providing a strong and precise solution for detecting fire regions in challenging environments. Through the use of the global context understanding of ViT and the local feature extraction ability of CNN, the model provides state-of-the-art results in various evaluation measures such as mAP50 (0.7618), F1-score (0.8159), recall (0.8285), precision (0.8336), and MeanIoU50–95 (0.7717). The validation loss is 0.1414, demonstrating stable and effective training. The inclusion of a multi-scale attention mechanism also helps improve the model’s ability to detect fires of different sizes, rendering it very effective and versatile for use in real-world applications. The CN2VF-Net architecture addresses the important issues of varying fire scales, occlusions, and environmental heterogeneities. The model’s strong generalization capacity to unseen data, as observed from the training and validation process graphs, confirms its strength and reliability. Furthermore, the visual representation of the predictions and Grad-CAM heatmaps give deep insights into how the model arrived at its conclusion, making the model more understandable and trustworthy.

The comparative analysis with existing works shows that the CN2VF-Net model outperforms current approaches, highlighting the strengths of the Transformer–CNN combination for fire detection and segmentation tasks. The model’s robust performance, combined with its interpretability, makes it a suitable architecture for real-world applications such as wildfire monitoring, industrial fire detection, and emergency response systems.

Although the CN2VF-Net model performs well, there are some promising research directions that can further boost its performance and utility. Looking ahead, we plan to test CN2VF-Net on other fire detection datasets, such as FIRE-SMOKE and the Corsican fire datasets. Exploring different conditions, such as varying fire types and imaging environments, will give us a better understanding of how well our model holds up in real-world scenarios.

Author Contributions

N.A. contributed to the development of the methodology and participated actively in the writing, review, and editing of the manuscript. M.A. was involved in validation, formal analysis, and contributed to manuscript review and editing. E.H.A. participated in validation and formal analysis and was involved in the critical review, writing, and revision of the manuscript. M.M.J. contributed to validation, formal analysis, and writing, review and editing; additionally, M.M.J. was responsible for funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R104), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Informed Consent Statement

Not Applicable.

Data Availability Statement

In this study, the D-Fire dataset, which is used for fire detection, is openly available on GitHub at the following repository: https://github.com/gaiasd/DFireDataset (Latest version, Accessed: 7 May 2025). The corresponding codebase for training and evaluating the models is also publicly accessible at https://github.com/naveedflair/Fire-Detection-using-CN2VF-Net-Model (Latest version, Accessed: 7 May 2025). Both the dataset and code are freely available for academic and research purposes.

Acknowledgments

The authors acknowledge the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R104), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia, for supporting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Suhardono, S.; Fitria, L.; Suryawan, I.W.K.; Septiariva, I.Y.; Mulyana, R.; Sari, M.M.; Ulhasanah, N.; Prayogo, W. Human Activities and Forest Fires in Indonesia: An Analysis of the Bromo Incident and Implications for Conservation Tourism. Trees For. People 2024, 15, 100509. [Google Scholar] [CrossRef]
Yuan, C.; Liu, Z.; Zhang, Y. UAV-Based Forest Fire Detection and Tracking Using Image Processing Techniques. In Proceedings of the International Conference on Unmanned Aircraft Systems (ICUAS), Denver, CO, USA, 9–12 June 2015; pp. 639–643. [Google Scholar]
Gao, Y.; Cao, H.; Cai, W.; Zhou, G. Pixel-Level Road Crack Detection in UAV Remote Sensing Images Based on ARD-Unet. Measurement 2023, 219, 113252. [Google Scholar] [CrossRef]
Zhan, J.; Hu, Y.; Zhou, G.; Wang, Y.; Cai, W.; Li, L. A High-Precision Forest Fire Smoke Detection Approach Based on ARGNet. Comput. Electron. Agric. 2022, 196, 106874. [Google Scholar] [CrossRef]
Wang, K.; Zhang, Y.; Jinjun, W.; Zhang, Q.; Bing, C.; Dongcai, L. Fire Detection in Infrared Video Surveillance Based on Convolutional Neural Network and SVM. In Proceedings of the IEEE 3rd International Conference on Signal and Image Processing (ICSIP), Shenzhen, China, 13–15 July 2018; pp. 162–167. [Google Scholar]
Deng, L.; Chen, Q.; He, Y.; Sui, X.; Liu, Q.; Hu, L. Fire Detection with Infrared Images using Cascaded Neural Network. J. Algorithms Comput. Technol. 2019, 13, 1748302619895433. [Google Scholar] [CrossRef]
Alkhatib, R.; Sahwan, W.; Alkhatieb, A.; Schütt, B. A Brief Review of Machine Learning Algorithms in Forest Fires Science. Appl. Sci. 2023, 13, 8275. [Google Scholar] [CrossRef]
Peruzzi, G.; Pozzebon, A.; Van Der Meer, M. Fight Fire with Fire: Detecting Forest Fires with Embedded Machine Learning Models Dealing with Audio and Images on Low Power IoT Devices. Sensors 2023, 23, 783. [Google Scholar] [CrossRef]
Guria, R.; Mishra, M.; da Silva, R.M.; Mishra, M.; Santos, C.A.G. Predicting Forest Fire Probability in Similipal Biosphere Reserve (India) Using Sentinel-2 MSI Data and Machine Learning. Remote Sens. Appl. 2024, 36, 101311. [Google Scholar] [CrossRef]
Sobha, P.; Latifi, S. A Survey of the Machine Learning Models for Forest Fire Prediction and Detection. Int. J. Commun. Netw. Syst. Sci. 2023, 16, 131–150. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Chitram, S.; Kumar, S.; Thenmalar, S. Enhancing Fire and Smoke Detection Using Deep Learning Techniques. Eng. Proc. 2024, 62, 7. [Google Scholar]
Ye, M.; Luo, Y. A deep convolution neural network fusing of color feature and spatio-temporal feature for smoke detection. Multimed. Tools Appl. 2024, 83, 22173–22187. [Google Scholar] [CrossRef]
Abdikan, S.; Bayik, C.; Sekertekin, A.; Bektas Balcik, F.; Karimzadeh, S.; Matsuoka, M.; Balik Sanli, F. Burned Area Detection Using Multi-Sensor SAR, Optical, and Thermal Data in Mediterranean Pine Forest. Forests 2022, 13, 347. [Google Scholar] [CrossRef]
Qarallah, B.; Othman, Y.A.; Al-Ajlouni, M.; Alheyari, H.A.; Qoqazeh, B.A. Assessment of Small-Extent Forest Fires in Semi-Arid Environment in Jordan Using Sentinel-2 and Landsat Sensors Data. Forests 2022, 14, 41. [Google Scholar] [CrossRef]
Shin, J.; Seo, W.; Kim, T.; Park, J.; Woo, C. Using UAV Multispectral Images for Classification of Forest Burn Severity—A Case Study of the 2019 Gangneung Forest Fire. Forests 2019, 10, 1025. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Zhu, J.; Cao, Y.; Zhang, Y.; Feng, D.; Zhang, Y.; Chen, M. Efficient Video Fire Detection Exploiting Motion-Flicker-Based Dynamic Features and Deep Static Features. IEEE Access 2020, 8, 81904–81917. [Google Scholar] [CrossRef]
Feng, H.; Qiu, J.; Wen, L.; Zhang, J.; Yang, J.; Lyu, Z.; Liu, T.; Fang, K. U3UNet: An Accurate and Reliable Segmentation Model for Forest Fire Monitoring Based on UAV Vision. Neural Netw. 2025, 185, 107207. [Google Scholar] [CrossRef]
Wang, G.; Bai, D.; Lin, H.; Zhou, H.; Qian, J. FireViTNet: A Hybrid Model Integrating ViT and CNNs for Forest Fire Segmentation. Comput. Electron. Agric. 2024, 218, 108722. [Google Scholar] [CrossRef]
Liu, H.; Zhang, F.; Xu, Y.; Wang, J.; Lu, H.; Wei, W.; Zhu, J. Tfnet: Transformer-based multi-scale feature fusion forest fire image detection network. Fire 2025, 8, 59. [Google Scholar] [CrossRef]
Yu, P.; Wei, W.; Li, J.; Du, Q.; Wang, F.; Zhang, L.; Li, H.; Yang, K.; Yang, X.; Zhang, N.; et al. Fire-PPYOLOE: An Efficient Forest Fire Detector for Real-Time Wild Forest Fire Monitoring. J. Sens. 2024, 2024, 2831905. [Google Scholar] [CrossRef]
Zhao, C.; Zhao, L.; Zhang, K.; Ren, Y.; Chen, H.; Sheng, Y. Smoke and Fire-You Only Look Once: A Lightweight Deep Learning Model for Video Smoke and Flame Detection in Natural Scenes. Fire 2025, 8, 104. [Google Scholar] [CrossRef]
Song, X.; Wei, Z.; Zhang, J.; Gao, E. A Fire Detection Algorithm Based on Adaptive Region Decoupling Distillation. In Proceedings of the 2023 International Annual Conference on Complex Systems and Intelligent Science (CSIS-IAC), Shenzhen, China, 20–22 October 2023; pp. 253–258. [Google Scholar]
Zhang, T.; Wang, Z.; Zeng, Y.; Wu, X.; Huang, X.; Xiao, F. Building artificial-intelligence digital fire (AID-Fire) system: A real-scale demonstration. J. Build. Eng. 2022, 62, 105363. [Google Scholar] [CrossRef]
Wang, K.; Zhang, W.; Song, X. A Fire Detection Method for Aircraft Cargo Compartments Utilizing Radio Frequency Identification Technology and an Improved YOLO Model. Electronics 2024, 14, 106. [Google Scholar] [CrossRef]
De Venâncio, P.V.A.; Rezende, T.M.; Lisboa, A.C.; Barbosa, A.V. Fire detection based on a two-dimensional convolutional neural network and temporal analysis. In Proceedings of the 2021 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Temuco, Chile, 2–4 November 2021; pp. 1–6. [Google Scholar]
De Venâncio, P.V.A.; Rezende, T.M.; Lisboa, A.C.; Barbosa, A.V. An automatic fire detection system based on deep convolutional neural networks for low-power, resource-constrained devices. Neural Comput. Appl. 2022, 34, 15349–15368. [Google Scholar] [CrossRef]
Chen, Y.; Li, J.; Sun, K.; Zhang, Y. A lightweight early forest fire and smoke detection method. J. Supercomput. 2024, 80, 9870–9893. [Google Scholar] [CrossRef]
Peng, R.; Cui, C.; Wu, Y. Real-time fire detection algorithm on low-power endpoint device. J.-Real-Time Image Process. 2025, 22, 29. [Google Scholar] [CrossRef]
Gragnaniello, D.; Greco, A.; Sansone, C.; Vento, B. FLAME: Fire detection in videos combining a deep neural network with a model-based motion analysis. Neural Comput. Appl. 2025, 2025, 1–17. [Google Scholar] [CrossRef]
Malebary, S.J. Early fire detection using long short-term memory-based instance segmentation and internet of things for disaster management. Sensors 2023, 23, 9043. [Google Scholar] [CrossRef]
Gragnaniello, D.; Greco, A.; Sansone, C.; Vento, B. Onfire 2023 Contest: What did we learn about real time fire detection from cameras? J. Ambient. Intell. Humaniz. Comput. 2025, 16, 253–264. [Google Scholar] [CrossRef]
Huang, X.; Xie, W.; Zhang, Q.; Lan, Y.; Heng, H.; Xiong, J. A Lightweight Wildfire Detection Method for Transmission Line Perimeters. Electronics 2024, 13, 3170. [Google Scholar] [CrossRef]
Titu, M.F.S.; Pavel, M.A.; Goh, K.O.M.; Babar, H.; Aman, U.; Khan, R. Real-Time Fire Detection: Integrating Lightweight Deep Learning Models on Drones with Edge Computing. Drones 2024, 8, 483. [Google Scholar] [CrossRef]
Kumar, A.; Perrusquía, A.; Al-Rubaye, S.; Guo, W. Wildfire and Smoke Early Detection for Drone Applications: A Light-Weight Deep Learning Approach. Eng. Appl. Artif. Intell. 2024, 136, 108977. [Google Scholar] [CrossRef]
Wu, D.; Qian, Z.; Wu, D.; Wang, J. FSNet: Enhancing Forest-Fire and Smoke Detection with an Advanced UAV-Based Network. Forests 2024, 15, 787. [Google Scholar] [CrossRef]
de Venâncio, P.V.A.B.; Campos, R.J.; Rezende, T.M.; Lisboa, A.C.; Barbosa, A.V. A hybrid method for fire detection based on spatial and temporal patterns. Neural Comput. Appl. 2023, 35, 9349–9361. [Google Scholar] [CrossRef]
Kim, S.; Jang, I.-s.; Ko, B.C. Domain-free fire detection using the spatial—Temporal attention transform of the yolo backbone. Pattern Anal. Appl. 2024, 27, 45. [Google Scholar] [CrossRef]
Abdusalomov, A.; Baratov, N.; Kutlimuratov, A.; Whangbo, T.K. An improvement of the fire detection and classification method using YOLOv3 for surveillance systems. Sensors 2021, 21, 6519. [Google Scholar] [CrossRef]
Eckle, K.; Schmidt-Hieber, J. A comparison of deep networks with ReLU activation function and linear spline-type methods. Neural Netw. 2019, 110, 232–242. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Available online: https://github.com/gaiasd/DFireDataset (accessed on 7 May 2025).
Unser, M.; Aldroubi, A.; Eden, M. Fast B-spline transforms for continuous image representation and interpolation. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 277–285. [Google Scholar] [CrossRef]
Braytee, A.; Anaissi, A.; Naji, M. A Comparative Analysis of Loss Functions for Handling Foreground-Background Imbalance in Image Segmentation. In International Conference on Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2022; pp. 3–13. [Google Scholar]
Mamadaliev, D.; Touko, P.L.M.; Kim, J.A.; Kim, S. Esfd-yolov8n: Early smoke and fire detection method based on an improved yolov8n model. Fire 2024, 7, 303. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, R.; Zhong, H.; Sun, Y. YOLOv8 for Fire and Smoke Recognition Algorithm Integrated with the Convolutional Block Attention Module. Open J. Appl. Sci. 2023, 14, 159–170. [Google Scholar] [CrossRef]
Xu, F.; Zhang, X.; Deng, T.; Xu, W. An image-based fire monitoring algorithm resistant to fire-like objects. Fire 2023, 7, 3. [Google Scholar] [CrossRef]

Figure 2. Training and validation learning curve of CN2VF-Net.

Figure 3. Prediction of CN2VF-Net.

Figure 4. Prediction of CN2VF-Net.

Figure 5. Prediction of CN2VF-Net.

Table 1. Comparing the proposed model CN2VF-Net with the existing methods based on precision, recall, F1-score, mAP@50, and MeanIoU50–95.

Ref	Precision	Recall	F1-Score	mAP@50	MeanIoU50–95
Liu et al. [22]	81.6	74.8	78.1	81.2	-
Mamadaliev et al. [47]	80.1	72.7	-	79.4	-
Liu et al. [48]	80.9	63.6	-	69.0	-
Xu et al. [49]	81.7	82.5	-	82.3	-
Segmenter	80.4	79.2	77.3	75.7	74.1
Swin Transformer	81.9	82.2	79.5	78.6	76.5
Proposed	83.3	82.8	81.5	76.1	77.1

Table 2. Performance comparison across architectures.

Model Type	Precision (%)	Recall (%)	F1-Score (%)	mAP@50 (%)	MeanIoU50–95 (%)
CNN (EfficientNetB0)	65.06	61.76	62.34	60.87	57.65
Vision Transformer	71.46	73.02	74.57	75.16	71.21
CN2VF-Net Model	83.30	82.80	81.50	76.10	77.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ahmad, N.; Akbar, M.; Alkhammash, E.H.; Jamjoom, M.M. CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments. Fire 2025, 8, 211. https://doi.org/10.3390/fire8060211

AMA Style

Ahmad N, Akbar M, Alkhammash EH, Jamjoom MM. CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments. Fire. 2025; 8(6):211. https://doi.org/10.3390/fire8060211

Chicago/Turabian Style

Ahmad, Naveed, Mariam Akbar, Eman H. Alkhammash, and Mona M. Jamjoom. 2025. "CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments" Fire 8, no. 6: 211. https://doi.org/10.3390/fire8060211

APA Style

Ahmad, N., Akbar, M., Alkhammash, E. H., & Jamjoom, M. M. (2025). CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments. Fire, 8(6), 211. https://doi.org/10.3390/fire8060211

Article Menu

CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments

Abstract

1. Introduction

The Key Contribution of This Study

2. Related Works

2.1. Deep Learning-Based Fire Detection Techniques

2.2. Smart, Lightweight, and Real-Time Fire Detection Systems

3. Methodology

3.1. CN2VF-Net Architecture

3.1.1. Patch Embedding

3.1.2. Transformer Encoder

3.1.3. CNN Backbone (EfficientNetB0)

3.1.4. Feature Fusion Module

3.1.5. Multi-Scale Attention

3.1.6. Decoder

4. Experimental Setup

4.1. Dataset Collection

4.2. Dataset Preprocessing

4.3. Model Training and Configuration

4.4. Evaluation Metrics

4.5. Precision

4.6. Recall

4.7. F1-Score

4.8. Mean Average Precision at IoU Threshold 0.5 (mAP50)

4.9. mIoU50–95

4.10. Computational Environment

5. Results and Discussions

6. Ablation Study

7. Conclusions and Future Direction

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI