Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective

Ali, Mostafa M. E. H.; Ghodrat, Maryam

doi:10.3390/fire8070285

Open AccessArticle

Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective

by

Mostafa M. E. H. Ali

^*

and

Maryam Ghodrat

^*

School of Engineering and Technology, The University of New South Wales, Canberra, ACT 2600, Australia

^*

Authors to whom correspondence should be addressed.

Fire 2025, 8(7), 285; https://doi.org/10.3390/fire8070285

Submission received: 31 May 2025 / Revised: 12 July 2025 / Accepted: 19 July 2025 / Published: 21 July 2025

(This article belongs to the Special Issue Computer Vision and Artificial Intelligence in Fire and Flame Detection)

Download

Browse Figures

Versions Notes

Abstract

Despite recent advances in deep learning for fire detection, much of the current research prioritizes model-centric metrics over dataset fidelity, particularly from a fire safety engineering perspective. Commonly used datasets are often dominated by fully developed flames, mislabel smoke-only frames as non-fire, or lack intra-video diversity due to redundant frames from limited sources. Some works treat smoke detection alone as early-stage detection, even though many fires (e.g., electrical or chemical) begin with visible flames and no smoke. Additionally, attempts to improve model applicability through mixed-context datasets—combining indoor, outdoor, and wildland scenes—often overlook the unique false alarm sources and detection challenges specific to each environment. To address these limitations, we curated a new video dataset comprising 1108 annotated fire and non-fire clips captured via indoor surveillance cameras. Unlike existing datasets, ours emphasizes early-stage fire dynamics (pre-flashover) and includes varied fire sources (e.g., sofa, cupboard, and attic fires), realistic false alarm triggers (e.g., flame-colored objects, artificial lighting), and a wide range of spatial layouts and illumination conditions. This collection enables robust training and benchmarking for early indoor fire detection. Using this dataset, we developed a spatiotemporal fire detection model based on the mixed convolutions ResNets (MC3_18) architecture, augmented with Convolutional Block Attention Modules (CBAM). The proposed model achieved 86.11% accuracy, 88.76% precision, and 84.04% recall, along with low false positive (11.63%) and false negative (15.96%) rates. Compared to its CBAM-free baseline, the model exhibits notable improvements in F1-score and interpretability, as confirmed by Grad-CAM++ visualizations highlighting attention to semantically meaningful fire features. These results demonstrate that effective early fire detection is inseparable from high-quality, context-specific datasets. Our work introduces a scalable, safety-driven approach that advances the development of reliable, interpretable, and deployment-ready fire detection systems for residential environments.

Keywords:

smart fire detection; fire recognition; vision-based approach; smart buildings fire safety

1. Introduction

Fires remain a significant global hazard, responsible for approximately 265,000 fatalities each year. Fire-related burns are the fourth leading cause of unintentional injury worldwide and the third most common source of unintentional harm within households [1]. Notably, two-thirds of fire-related deaths occur in buildings without functioning smoke detectors [2]. In developed countries such as Australia, residential fires contribute disproportionately to fire-related injuries and fatalities [3], with more lives lost annually in house fires than from all natural hazards combined, including storms, floods, and bushfires [4]. Most fatalities occur during nighttime hours, and nearly one-third of victims are aged 65 or older. These statistics underscore the urgent need for reliable and context-aware fire prevention and early detection technologies, particularly in residential settings.

Conventional indoor fire detection systems primarily rely on physical smoke and flame sensors. Although widely deployed, these systems present notable limitations [5,6,7]. Their effectiveness is highly dependent on the sensor placement and fire development rate, which can result in delayed or failed alarm activation. In 2018, approximately 38% of fire alarms failed to operate as intended, with 45% of these failures attributed to incorrect installation [8]. Battery-powered detectors exhibit higher failure rates (38%) compared to mains-powered systems (21%) [8], raising concerns about their reliability in critical scenarios.

To address these challenges, computer vision-based fire detection—leveraging closed-circuit television (CCTV) and deep learning—has emerged as a promising alternative [9,10]. This approach offers advantages such as faster response times, broader spatial awareness, and the capacity to incorporate contextual scene information. However, much of the current research remains heavily model-centric, emphasizing metrics such as accuracy, inference speed, and computational efficiency. While important, this focus often comes at the expense of data quality, specifically in terms of environmental realism, scene diversity, and domain-specific relevance. Many existing datasets are characterized by an overrepresentation of fully developed flames [11,12,13,14], inadequate annotation of early smoke events [15], and low intra-video variability [16], leading to models that may perform well on benchmarks but struggle in real-world deployments, particularly during the early stages of fire development, when rapid detection is critical.

Another important limitation lies in the use of heterogeneous datasets comprising synthetic imagery, YouTube videos, and mixed-context fire scenarios (e.g., wildland, outdoor, and indoor scenes) [17,18]. While these sources increase data volume, they often sacrifice contextual specificity, especially for high-risk indoor environments. Realistic challenges, prevalent in residential interiors, are frequently underrepresented.

Table 1 summarizes recent studies in deep learning-based fire detection, highlighting dataset types, detection strategies (e.g., classification, localization, and segmentation), and performance metrics. Despite growing interest, few studies explicitly focus on indoor fire behavior. For example, Pincott et al. [19] proposed a CNN-based smoke and flame detection system tailored for indoor settings, utilizing a dataset that spans multiple fire development stages. Yusun et al. [20] introduced a small dataset of controlled indoor fire scenarios; however, questions remain regarding its realism and generalizability due to its limited availability and constrained scene diversity.

To our knowledge, no publicly available dataset sufficiently captures realistic, early-stage indoor fire scenarios from surveillance viewpoints. Existing datasets are often limited in scope, overly synthetic, or fail to reflect the environmental and temporal nuances of indoor residential fires, particularly during their early development stages.

In response to the heightened risks associated with residential fires—and the absence of datasets and models designed for realistic indoor scenarios—this study offers the following key contributions:

Development of a novel indoor fire video dataset, compiled from real-world CCTV footage across diverse residential environments.
Design of an enhanced fire detection model, based on the mixed convolution ResNets (MC3_18) architecture augmented with Convolutional Block Attention Modules (CBAM), to improve early-stage fire recognition by emphasizing salient spatiotemporal features.
Introduction of an indoor fire detection benchmark, constructed from realistic surveillance footage and focused explicitly on early-stage events, providing a reproducible evaluation framework for future 3D CNN-based research.

The remainder of this paper is organized as follows: Section 2 reviews existing fire detection datasets. Section 3 presents the proposed indoor dataset and architecture of the proposed detection framework. Section 4 discusses experimental results and analysis. Section 5 concludes the study and outlines directions for future work.

Table 1. Summary of the latest research on fire using deep learning approaches.

Reference	Approach	Training Dataset Type	Result
Pincott et al. [19]	Flame/smoke object detection	Indoor	Accuracy, precision, recall
Yusun et al. [20]	Smoke/flame object detection	Indoor	Precision, recall, mAP0.5, detection time
Khan et al. [17]	Flame object detection and segmentation	Multiple scenes of fires	Accuracy, F-measure, precision, recall
Arpit et al. [18]	Flame recognition	Multiple scenes of fires	Accuracy, F-measure, precision, recall
Wu et al. [21]	Flame object detection	Outdoor	Detection rate and speed
Khan et al. [15]	Smoke/flame recognition	Multiple scenes of fires	Accuracy, F-measure, precision, recall
Yuan et al. [22]	Smoke segmentation	Complex scenes	Intersection over Union and Mean Square Error
Tao et al. [23]	Smoke recognition	Real surveillance scenes	Detection rate, false alarm rate, and F1-score
Majid et al. [24]	Flame recognition	Multiple scenes of fires	Accuracy, F-measure, precision, recall
Chaoxia et al. [25]	Flame object detection	Multiple scenes of fires	Accuracy, F-measure, precision, recall
Li et al. [26]	Smoke/flame object detection	Multiple scenes of fires	Detection speed, Precision
This study	Smoke/flame recognition	Indoor	Accuracy, F-measure precision, recall

2. Related Work

In this section, we first review commonly used public datasets for indoor and general outdoor fire detection, highlighting their limitations in capturing early-stage fire events and realistic indoor scenarios. We then introduce our newly collected dataset, which addresses these gaps by incorporating more realistic fire and fire-like sources.

Despite the growing number of publicly available fire datasets, most suffer from one or more of the following critical limitations (see Figure 1 and Table 2):

Late-Stage Combustion Bias: Many datasets, including DFAN, BoWFire (Chino’s), and non-public collections like Pincott’s, predominantly feature images or video clips of fully developed flames (See Figure 1). While such vivid fire scenes enhance detection rates in later stages, they offer minimal training data for early fire detection, particularly pre-flashover fires, which are crucial for timely warnings.
Smoke Frames Labeled as Non-Fire: Foggia’s dataset explicitly treats all smoke frames—including those from genuine early-stage fires—as non-fire, conflating smoke-only events with background and undermining a model’s ability to learn true flame signatures.
Low Intra-Video Diversity: Datasets with large frame counts often derive many of their images from a handful of video sources, leading to near-duplicate frames that lack sufficient scene diversity for robust training. For example, Foggia’s dataset comprises 31 raw videos, from which over 57,000 frames are extracted, but the majority are temporally adjacent and visually redundant, limiting the model’s ability to generalize beyond a few camera angles and lighting conditions.
Smoke-Based Detection Focus: A subset of datasets (e.g., VSD, Smoke100k) emphasizes smoke rather than flame features. However, certain fire types (e.g., electrical faults, gas leaks, and chemical reactions) produce little or no visible smoke, rendering smoke-centric classifiers ineffective in those scenarios.

Considerable prior work has relied heavily on these imperfect resources, as shown in the usage count column in Table 2. Several studies have utilized these datasets, often combining multiple sources to create their own customized datasets. For instance, Huang et al. [27] explored fire detection in video surveillance using convolutional neural networks and wavelet transforms. Their dataset consisted of images from the Corsican Fire Database, supplemented by samples and augmented images from Foggia’s, as well as additional fire and non-fire images with fire-like backgrounds sourced from the internet. Similarly, Khan [15,17] heavily relied on Foggia’s dataset for training, addressing the computational challenges in fire detection, but still faced limitations related to dataset diversity and realism. Zheng et al. [28] constructed a dynamic CNN framework called DCN_Fire to evaluate forest fire danger, utilizing principal component analysis to improve inter-class discriminability and saliency detection to normalize flame images for training purposes. DCN_Fire attained an accuracy of 98.30% on its evaluation set, demonstrating significant effectiveness. Wu et al. [21] presented the VFD model, which was trained on outside fire imagery, for the detection of manufacturing fires. Their model achieved a detection accuracy of 98.40% at a rate of 36 FPS (frames per second). Khan et al. [29] proposed an edge intelligence-assisted method for smoke detection in foggy conditions, demonstrating the adaptability of DL models to complex environments.

Accurate and reliable fire detection remains a critical requirement in fire safety engineering, particularly for residential settings where early intervention can mitigate loss of life and property. However, many existing deep learning-based fire detection models are trained and evaluated on overly simplified datasets, typically dominated by outdoor scenes or fully developed flames. These datasets fail to reflect the complexity and ambiguity inherent to indoor environments, where lighting variability, visual occlusions, and early-stage combustion phenomena are common. As a result, many of these models exhibit high performance on benchmark datasets but lack robustness and generalizability in realistic deployment scenarios.

To address this gap, we constructed a dataset specifically focused on indoor residential fires, comprising two classes: indoor fire and non-fire. This binary classification task is inherently more challenging due to the presence of low-contrast smoke, artificial lighting artifacts, cluttered visual backgrounds, and partially occluded flame features. In this context, we benchmark our attention-augmented 3D CNN model.

Table 2. Commonly used datasets for fire detection (indoor and general outdoor fires). Note: The datasets listed below focus on general outdoor and indoor fire scenes, excluding datasets specific to wildland fires only.

Dataset Name	Description and Volume	Environment	Usage Count †
Töreyin [30]	11 videos: 5 genuine fire scenes (e.g., garden, fireplace, box fire) and 6 fire-like scenarios (e.g., red clothing, vehicles, crowds).	Indoor and outdoor	910
Foggia’s [16]	31 videos: 13 fire, 16 non-fire, 2 mixed. 57,800 total frames (6311 fire; 51,489 non-fire). Smoke is labeled as non-fire.	Indoor and outdoor	454
FireSense [31]	49 videos: flame detection (11 positive, 16 negative), smoke detection (13 positive, 9 negative).	Indoor and outdoor	233
BoWFire (Chino’s) [11]	226 images: 119 real fire, 107 fire-colored distractors (e.g., sunsets, clothing, lights).	Indoor and outdoor	234
VSD Dataset [32]	6 videos converted into 4 image sets (e.g., leaf, cotton, wood fires).	Outdoor fire	242
FD Dataset [33]	Aggregated from Foggia and BoWFire.	Indoor and outdoor	100
Smoke100k [34]	100,000 synthetic images across smoke levels: low (33 k), medium (36 k), high (33 k).	Synthetic images	7
VisiFire [35]	57 videos: 13 fire, 21 urban smoke, 21 forest smoke, 2 other.	Outdoor/forest	--
DFAN (Yar’s) [36]	3403 fire images distributed across 12 classes: boat fire (338), building fire (305), bus fire (400), car fire (579), cargo fire (207), electric pole fire (300), forest fire (480), pickup fire (257), SUV fire (240), train fire (300), van fire (300), and 97 normal (non-fire) images.	Outdoor	79
Pincott’s [19]	600 images covering all fire development stages, from ignition to spread.	Indoor	Private dataset
Ahn’s [20]	10,163 indoor images; 514 for false positive scenarios (e.g., welding sparks, light reflections).	Indoor	Private dataset
Our Dataset	1108 indoor videos, described in detail in this section.	Indoor	N/A

† Usage Count refers to the number of studies citing or using the dataset, based on a survey of the existing literature.

Figure 1. Comparison of early-stage fire examples in legacy datasets versus our proposed dataset. The left panel shows predominantly late-stage, fully developed flames drawn from widely used collections (e.g., DFAN, VisiFire, and Töreyin), while the right panel highlights small-scale ignition scenes captured in the first two minutes post-ignition in our new dataset.

3. Proposed Framework

3.1. Dataset

To overcome the existing gaps in the current datasets, we compiled a new video dataset that simulates realistic indoor fire scenarios as captured by static surveillance cameras. Key features include:

Early-Stage Focus and Broad Scenario Coverage. All the clips are limited to the fire flashover, ensuring that the flames remain physically small and challenging to detect. The videos encompass a wide variety of residential fire types—wall fires, sofa ignitions, cupboard blazes, and attic fires—as well as horizontal and vertical ventilation conditions, supporting robust generalization across household layouts. We collected real fire scenarios from credible and publicly available sources such as the Fire Safety Research Institute [37] to ensure the dataset accurately reflects authentic fire incidents.
Diverse Environmental Conditions. We vary the camera-to-fire distances, incorporate common indoor backgrounds featuring fire-colored objects, and capture across different lighting regimes to mimic real-world surveillance footage [38]. Additionally, the dataset includes occlusion cases, where flames are partially obscured by furniture or structural elements (e.g., table edges, sofa backs), and multi-view scenes from different camera angles, simulating realistic variations in fixed surveillance deployments.

To further illustrate dataset diversity, we categorized the fire scenarios into distinct subtypes (e.g., sofa fires, cupboard ignitions, attic fires, and wall-mounted equipment), with each subtype comprising multiple videos captured under varied lighting, distance, and background clutter conditions. More than 60% of the fire clips include partial occlusions or low visibility, and over 30% occur under low-light or nighttime settings. The non-fire videos also feature diverse false positive triggers, such as LED lights, television flicker, artificial flames, and moving curtains. Notably, approximately half of the non-fire samples were intentionally selected to represent false alarm sources, closely resembling real fire in color, motion, or intensity. This systematic variation ensures balanced exposure to visually similar but semantically distinct cases, promoting generalizable learning and robust fire/no-fire discrimination.

Table 3 presents the training, validation, and test splits of our dataset, which includes 1108 video clips evenly distributed across the fire and non-fire categories.

3.2. Proposed Model

In the context of indoor fire detection, timely and reliable recognition is essential for minimizing damage and ensuring occupant safety. Fires exhibit dynamic visual patterns such as flame color changes, flickering intensity, and irregular motion. These features vary across time and space, making them inherently temporal in nature. Temporal cues are critical for distinguishing real fires from false positives triggered by lighting artifacts, reflections, or non-fire motion (e.g., moving shadows or waving curtains). Such spatiotemporal patterns are absent in still images, limiting the effectiveness of static image-based methods in safety-critical settings. Therefore, this study employs video sequences as input, enabling the model to leverage both spatial and temporal information through 3D CNNs. This approach aligns with the real-world operation of surveillance systems, which continuously capture video rather than isolated frames.

Previous work by Tran et al. [39] demonstrated that 3D CNNs, particularly when embedded within residual learning frameworks, outperform their 2D counterparts on video classification tasks, which similarly require spatiotemporal reasoning. However, selecting the appropriate architecture involves a trade-off between recognition performance and computational efficiency, especially in real-time deployment scenarios.

To address this, we explored several spatiotemporal architectures, including C3D and the Video ResNet family. Among the latter, R3D-18, MC3-18, and R(2+1)D-18 each offer distinct trade-offs. While R(2+1)D-18 achieves the highest recognition accuracy, it incurs greater computational cost due to its complex structure. In contrast, MC3-18 offers an optimal balance between accuracy and efficiency, making it a practical choice for real-time fire detection in resource-constrained environments such as smart cameras or embedded devices, as shown in Table 4.

Given its favorable performance-to-complexity ratio, MC3 was adopted as the base architecture in our framework. The MC3 network combines the advantages of both 2D and 3D convolutions. Its stem consists of a 3D convolutional layer with a kernel size (3, 7, 7), padding (1, 3, 3), and stride (1, 2, 2), where the first dimension captures temporal features and the remaining two capture spatial features. This layer accepts a single-channel input and produces 64 output channels.

Following the stem, the architecture includes four convolutional blocks. The first block uses 3D convolutions to model short-term motion, while the remaining three blocks use 2D convolutions to capture detailed spatial information. Each block contains two residual units with two convolutional layers each. Except for the first block, each is followed by a downsampling operation implemented with 3D convolutions (kernel size 1, stride 2), reducing the resolution of intermediate feature maps. Batch normalization and ReLU activation follow each convolutional layer to ensure stable and nonlinear learning.

This hybrid 2D–3D design allows the network to extract temporal patterns early and progressively refine spatial representations in deeper layers. A global average pooling (GAP) layer, followed by dropout and a fully connected (FC) layer, produces the final classification output (see Table 5 for details).

To further improve model sensitivity and generalization, we incorporated attention mechanisms. Recent research has demonstrated that attention modules help CNNs focus on relevant features in video tasks with high inter-frame redundancy [40,41]. For instance, channel attention (CA) has been used effectively in image-based fire detection [24,33], while the dual attention design proposed by Yar et al. [36] enhances performance by combining channel and spatial cues.

Our choice of integrating the Convolutional Block Attention Module (CBAM) module into the MC3_18 architecture is informed by both theoretical and empirical precedent. The original CBAM paper [42] demonstrated notable performance improvements when applied to 2D ResNet architectures, such as ResNet-50, in image recognition tasks. While CBAM has primarily been explored in 2D contexts, its lightweight design and modularity allow seamless integration with spatiotemporal models such as MC3_18. A prior study on Malaysian Sign Language recognition using CBAM-enhanced MC3_18 [43] further validated this approach, showing substantial performance gains over the baseline MC3_18 architecture.

CBAM applies channel and spatial attention sequentially to intermediate feature maps. The channel attention sub-module uses global average and max pooling to compute feature importance across channels, while the spatial attention sub-module generates a 2D attention map to highlight informative regions. Importantly, CBAM adds minimal computational overhead, preserving real-time feasibility. Motivated by these findings, we adopted CBAM in our indoor fire detection model to enhance spatiotemporal feature refinement while maintaining computational efficiency (see Figure 2 and Figure 3).

In summary, by leveraging spatiotemporal video input, a hybrid 2D–3D convolutional backbone, and lightweight attention modules, our proposed framework offers a robust, interpretable, and computationally efficient solution for early indoor fire detection. This approach is well-aligned with practical deployment needs and contributes to the growing body of real-time safety intelligence systems.

3.3. Preprocessing and Hyperparameters

All the video frames were resized to 256 × 256, followed by center cropping to 224 × 224. We sampled five clips per video, each consisting of 16 frames at a temporal resolution of 15 FPS, yielding a duration of approximately 1.07 s per clip. This configuration was selected to capture short-term temporal dynamics while maintaining training efficiency.

Regarding data augmentation strategy, during training, we applied a combination of spatial augmentations to enhance generalization and mimic realistic indoor variability. These included random horizontal flipping (p = 0.5), color jittering (brightness, contrast, saturation, and hue), and mild rotation (±10 degrees) to account for camera angle variations. We also employed random resized cropping within a scale of 0.9–1.0 to preserve spatial context while introducing variability in object positioning. Temporal jittering was achieved by sampling multiple clips at random starting frames from each video. This augmentation pipeline was designed to reduce overfitting while maintaining critical spatiotemporal patterns relevant to fire detection.

The model was trained for 50 epochs using a batch size of 16 and an initial learning rate of 1 × 10⁻⁴. These values were selected based on exploratory experiments evaluating training stability and validation accuracy. While we did not perform a full grid search, we empirically tested learning rates (1 × 10⁻³, 1 × 10⁻⁴, 1 × 10⁻⁵), batch sizes (8, 16), and dropout values (0.3, 0.5). We used fine-tuning from pretrained weights to accelerate convergence and enhance generalization.

All the video samples were manually annotated by the authors based on the visible presence or absence of fire. A clip was labeled as “fire” only if open flames were clearly observable in at least one frame. Non-fire clips included both typical indoor activities and visually similar false alarm scenarios, such as LED displays, stoves, or screen flicker. Ambiguous cases were cross-checked by a second annotator to ensure labeling consistency. This labeling process was designed to reflect realistic CCTV monitoring conditions and emphasize early-stage fire detection.

4. Experimental Results

4.1. Performance Matrix and Assessment

We used two ways to assess our model. Three measures are used in the first method of evaluation: false positive (FP), false negative (FN), and accuracy. Equation (1) shows that accuracy is the percentage of correctly predicted tests in the collected data. FP is the system’s false warning rate, and FN is an incorrect positive class prediction. Three measures are used in the second method of evaluation: precision, recall, and F-measure (F-m). Equation (2) shows that precision is the number of correctly identified positive tests divided by the total number of expected positive results for a given system. The true positive (T-P) rate is the percentage of properly predicted cases that test positive. The total number of positive samples is FP + TP. Equation (3) says that recall is the percentage of correctly labelled positive samples out of all the samples in the class. To be clear, recall stands for the sensitivity, referring to the fire monitoring system’s true positive level. Also, as shown in Equation (4), the percentage of F-m is found by averaging the weights of recall and precision. We compare our method to other models, using evaluation measures.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

F - m e a s u r e = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

To illustrate the model’s training behavior and generalization, Figure 4 shows the evolution of training and validation loss and accuracy across epochs. Both loss curves decrease steadily, while the training and validation accuracies consistently improve and eventually stabilize. This indicates successful convergence. The final training accuracy reaches approximately 97%, with the validation accuracy stabilizing around 90%, demonstrating that the model generalizes well without overfitting. The relatively small gap between the training and validation metrics further supports this.

On the held-out test set, the CBAM-enhanced MC3_18 model achieves an accuracy of 86.11%, with a false positive rate of 11.63% and a false negative rate of 15.96%. The precision, recall, and F1-score are 88.76%, 84.04%, and 86.34%, respectively, confirming strong generalization to previously unseen indoor fire scenarios.

To assess the effectiveness of the Convolutional Block Attention Module (CBAM), we compared the proposed model with the baseline MC3_18 model without attention. The CBAM-enhanced model achieved a 4.5% increase in accuracy, a 6.9% gain in recall, and a 4% improvement in F1-score, along with a significant reduction (5.5%) in the false positive rate. These results demonstrate that the attention mechanism helps the network focus more effectively on informative fire-related spatiotemporal features.

4.2. Performance Comparison with Earlier Approaches

Accurate and reliable fire detection remains a critical requirement in fire safety engineering, particularly in residential indoor environments where early intervention is essential to minimize loss of life and property. However, many existing deep learning-based fire detection models are trained and evaluated on idealized or overly simplified datasets, often dominated by outdoor scenes, fully developed flames, or synthetic fire footage. These datasets typically fail to reflect the complexity and ambiguity of real indoor environments, which often include early-stage combustion, low-contrast smoke, artificial lighting, visual occlusions, and cluttered settings.

To address this gap, we introduce a newly collected dataset focused exclusively on indoor residential fire scenarios. This dataset presents a challenging binary classification task—distinguishing indoor fire from non-fire scenes—captured under realistic indoor conditions and featuring numerous edge cases, such as partially occluded flames and fire-like visual distractors. In parallel, we also propose an attention-augmented 3D convolutional neural network (3D CNN) architecture, which, to the best of our knowledge, has not been previously applied to fire detection tasks. While our model achieves strong performance, the main focus of this paper is not model innovation, but to demonstrate how dataset realism and task context fundamentally influence detection performance.

The results presented in Table 6 and Table 7, and the accompanying Figure 5, are not intended as direct model benchmarking, where all the models are trained and tested on the same dataset. Instead, the aim is to illustrate how reported performance metrics are often inflated due to overly simplified or non-representative datasets, particularly those that neglect early-stage or indoor fire scenarios. This comparative framework provides a broader perspective on the limitations of conventional fire datasets and highlights the need for more contextually realistic data in evaluating model robustness and deployment readiness.

As shown in Table 6, several existing models report very high classification accuracy, often exceeding 95%. For example, Yar et al. [44] report 98.75% accuracy on a forest fire dataset, and Huang et al. [27] report 98.26% using a combination of the Corsican, Foggia, and Chino datasets. However, these datasets are predominantly composed of outdoor scenes or highly visible flame events. Even the work by Khan et al. [15], which achieves 94.39% accuracy, explicitly excludes smoke-only frames from the fire class, making it unsuitable for early-stage fire detection.

In contrast, our model achieves 86.11% accuracy, with a false positive rate of 11.63% and a false negative rate of 15.96%, evaluated on a deliberately challenging indoor dataset with early-stage combustion and realistic visual distractions. Although our accuracy is lower than that reported on idealized datasets, it reflects far more complex input conditions. Notably, the indoor-focused model by Pincott et al. [19]—one of the few comparable works—achieves only 82.63% accuracy and suffers from a false positive rate of 72.00%, which is impractical for real-world use due to alarm fatigue and desensitization.

As summarized in Table 7, our attention-augmented 3D CNN also delivers a well-balanced trade-off between precision (88.76%), recall (84.04%), and F-measure (86.34%). Compared to Khan et al. [15], who report higher recall (98.00%) but lower precision (82.00%), our model offers better control over false alarms—a critical requirement in residential settings where user trust and compliance are essential. Likewise, while Huang et al. [27] report a high F-measure of 96.86%, their evaluation is based on mixed-scene datasets with limited indoor realism, potentially overestimating model performance in actual indoor deployments.

These results reinforce the central message of this work: high reported accuracy on idealized or outdoor datasets does not necessarily indicate real-world readiness, especially in the context of early-stage indoor fire detection. Our study demonstrates that dataset context and realism are crucial determinants of model robustness, and we advocate for more rigorous dataset design and evaluation practices that reflect real deployment conditions. In this setting, our attention-augmented 3D CNN provides reliable and deployment-ready performance under conditions where conventional models often falter.

4.3. Grad-CAM++ Visualization and Interpretation

To gain deeper insight into the internal decision-making of the proposed 3D Attention–CNN model, we applied Grad-CAM++ to visualize activation responses during indoor fire detection tasks. Figure 6 presents original video frames alongside their corresponding Grad-CAM++ heatmaps for multiple fire scenarios, including both RGB and thermal surveillance footage.

Each row pair in the figure compares the original frames (top) with their respective heatmaps (bottom), highlighting the spatiotemporal regions the model considers most informative for classification. Across different conditions, from low-contrast thermal videos to fully visible flames in RGB scenes, the heatmaps consistently localize fire-relevant features such as ignition points, spreading flames, and zones of elevated thermal intensity.

In the first example, drawn from a thermal video with limited texture information, the model correctly focuses on compact heat sources and subtle motion cues, showing robustness to visual ambiguity. In the second and third RGB-based examples, where flames and smoke are visible, the model emphasizes dynamically changing fire regions and surrounding smoke dispersion. These responses illustrate the model’s ability to distinguish realistic indoor fire behavior from irrelevant or misleading visual stimuli.

Importantly, the model does not rely on superficial or fire-like visual textures alone. The Grad-CAM++ visualizations demonstrate that it captures meaningful temporal transitions and context-specific cues such as flickering light, partial occlusion by furniture, and interactions with cluttered indoor elements. This behavior reflects its training on realistic residential fire data and contributes to its resilience against false alarms, particularly in scenarios involving artificial lighting or innocuous heat sources.

Such focused and interpretable attention patterns strengthen the model’s transparency and explainability. By aligning visual evidence with semantically meaningful fire indicators, the model provides decision rationale that aligns with human expectations and domain knowledge. This enhances trust in its predictions and affirms its suitability for deployment in real-time surveillance systems where safety and reliability are paramount.

In summary, Grad-CAM++ results validate that our 3D Attention–CNN model not only performs well quantitatively but also visually localizes relevant fire cues in realistic indoor environments. Its strong visual reasoning capabilities—combined with false alarm mitigation and contextual awareness—make it a promising candidate for safety-critical applications in residential fire monitoring.

4.4. Model Efficiency and Real-Time Inference Capability

Efficient real-time inference is essential for fire detection systems operating in safety-critical environments, where timely alerts can significantly impact emergency response and damage mitigation.

Table 8 presents a comparative analysis of several fire detection models based on three core efficiency metrics: frames per second (FPS), model size (MB), and parameter count (millions). These comparisons underscore the trade-offs between computational efficiency and detection reliability in practical deployments.

Our proposed model achieves 47.6 FPS on an NVIDIA RTX 4090 GPU, with a compact size of 44.18 MB and 11.58 million parameters. This configuration demonstrates a strong balance between high-speed inference and manageable model complexity, making it suitable for real-time applications on modern GPU-based surveillance systems.

In comparison, larger architectures such as ResNet50 [27] and ResNetFire [45] exceed 25 million parameters and 97 MB in size. While ResNet50 reports a higher FPS of 112, the absence of detailed hardware specifications in that study limits the fairness of direct comparisons. Additionally, such large models may demand greater memory bandwidth and computational resources, which can restrict deployment on embedded or resource-constrained platforms.

Other lightweight approaches, including MSAM [41] and DFAN [36], achieve 75.15 FPS and 70.55 FPS respectively—measured on RTX 2070 GPUs. Although these models offer excellent throughput, they do so with significantly fewer parameters (e.g., MSAM with only 3.17 million), which may compromise their ability to generalize complex fire behaviors, especially in visually cluttered or low-contrast indoor scenes.

Similarly, MobileNet_V2 [29] demonstrates extreme compactness (13.23 MB) but records a lower 39.78 FPS on a NVIDIA TITAN X GPU, raising concerns about scalability for high-load surveillance environments, particularly in multi-camera deployments.

Taken together, our model offers a compelling trade-off: it retains a relatively small footprint while delivering robust inference speeds. This makes it well-suited for real-time fire monitoring systems that require both contextual accuracy and computational scalability in residential and indoor settings.

5. Conclusions

This study presents a novel contribution to vision-based fire detection by introducing a safety-driven, contextually diverse video dataset tailored to early-stage indoor fire dynamics. By addressing critical gaps in the existing datasets, such as the dominance of fully developed fires and the lack of scene specificity, our curated resource enables more realistic evaluation of detection systems in residential environments.

Leveraging this dataset, we implemented a spatiotemporal deep learning model based on the MC3_18 architecture, enhanced with Convolutional Block Attention Modules (CBAM). The proposed system delivers strong performance across accuracy, precision, and recall metrics, while maintaining low false positive and negative rates and achieving computational efficiency suitable for real-time deployment. Grad-CAM++ visualizations confirm the model’s attention to semantically relevant fire features, enhancing interpretability.

These results affirm that detection performance in safety-critical settings is tightly coupled with dataset quality. By aligning model design with the complexities of real-world indoor fires, this work offers a practical, interpretable, and scalable solution for early fire detection.

From a fire safety engineering standpoint, practical deployment and user acceptance are just as important as model performance. While our current work focuses on video classification, the proposed system could be extended to function as a complementary module alongside conventional fire detection systems, such as smoke or heat alarms. In such applications, video-based analysis could serve as a secondary verification layer to reduce false alarms and improve early-stage fire detection, particularly in residential settings where false positives (e.g., from cooking, fireplaces, or LED lighting) are common. We also acknowledge privacy concerns in indoor surveillance and note that future deployment could leverage local, edge-based processing to ensure that video data remains within the household. This would preserve privacy while supporting real-time, reliable detection in complex indoor environments.

While our dataset of 1000 videos covers diverse indoor fire scenarios, its size and domain focus may limit generalizability to outdoor or industrial settings. The system was not evaluated for outdoor settings, as this is beyond the intended application. Finally, although standard metrics were used, real-world deployment may introduce practical factors that are not fully captured in experimental testing. Future work will aim to expand the dataset further and assess performance across more diverse environments and longer video durations.

To further improve detection robustness and real-world applicability, future research will focus on integrating multimodal data sources, including thermal imaging, audio signals, and gas sensor inputs. Furthermore, lightweight model compression techniques will be explored to optimize performance on edge devices, ensuring efficient deployment in resource-constrained settings, like smart cameras and mobile safety systems. We also acknowledge that future work may include cross-dataset evaluation, provided that more contextually aligned and realistic indoor fire datasets become available.

Author Contributions

Conceptualization, M.M.E.H.A. and M.G.; methodology, M.M.E.H.A.; software, M.M.E.H.A.; validation, M.M.E.H.A.; formal analysis, M.M.E.H.A.; investigation, M.M.E.H.A.; data curation, M.M.E.H.A.; writing—original draft preparation, M.M.E.H.A.; writing—review and editing, M.G. and M.M.E.H.A.; visualization, M.M.E.H.A.; supervision, M.G.; funding acquisition, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors also acknowledge the financial support of UNSW through the Tuition Fee Scholarship (TFS) program award to Mostafa M. E. H. Ali.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The video dataset used in this study is available from the corresponding author upon reasonable request (contact: mostafa.ali@unsw.edu.au).

Acknowledgments

The authors gratefully acknowledge Murat Tahtali of the University of New South Wales (UNSW) for providing access to high-performance computing resources used throughout this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ghassempour, N.; Tannous, W.K.; Agho, K.E.; Avsar, G.; Harvey, L.A. Comparison of causes, characteristics and consequences of residential fires in social and non-social housing dwellings in New South Wales, Australia. Prev. Med. Rep. 2022, 28, 101860. [Google Scholar] [CrossRef] [PubMed]
Dinaburg, J.; Gottuk, D. Smoke alarm nuisance source characterization: Review and recommendations. Fire Technol. 2016, 52, 1197–1233. [Google Scholar] [CrossRef]
Mailstop, F.; Prevention, A.F.; Grants, S. Global Concepts in Residential Fire Safety Part 3–Best Practices from Canada, Puerto Rico, Mexico, and Dominican Republic; System Planning Corporation: Arlington, VA, USA, 2009. [Google Scholar]
Coates, L.; Kaandorp, G.; Harris, J.; Van Leeuwen, J.; Avci, A.; Evans, J.; George, S.; Gissing, A.; van den Honert, R.; Haynes, K. Preventable Residential Fire Fatalities in Australia July 2003 to June 2017; Bushfire and Natural Hazards CRC: Melbourne, Australia, 2019. [Google Scholar]
Liu, G.; Yuan, H.; Huang, L. A fire alarm judgment method using multiple smoke alarms based on Bayesian estimation. Fire Saf. J. 2023, 136, 103733. [Google Scholar] [CrossRef]
Khan, F.; Xu, Z.; Sun, J.; Khan, F.M.; Ahmed, A.; Zhao, Y. Recent advances in sensors for fire detection. Sensors 2022, 22, 3310. [Google Scholar] [CrossRef] [PubMed]
Gaur, A.; Singh, A.; Kumar, A.; Kulkarni, K.S.; Lala, S.; Kapoor, K.; Srivastava, V.; Kumar, A.; Mukhopadhyay, S.C. Fire sensing technologies: A review. IEEE Sens. J. 2019, 19, 3191–3202. [Google Scholar] [CrossRef]
Smoke Alarms Fail in a Third of House Fires. Available online: https://www.bbc.co.uk/news/uk-england-50598387 (accessed on 30 April 2025).
Jin, C.; Wang, T.; Alhusaini, N.; Zhao, S.; Liu, H.; Xu, K.; Zhang, J. Video Fire Detection Methods Based on Deep Learning: Datasets, Methods, and Future Directions. Fire 2023, 6, 315. [Google Scholar] [CrossRef]
Xu, F.; Zhang, X.; Deng, T.; Xu, W. An image-based fire monitoring algorithm resistant to fire-like objects. Fire 2023, 7, 3. [Google Scholar] [CrossRef]
Chino, D.Y.; Avalhais, L.P.; Rodrigues, J.F.; Traina, A.J. Bowfire: Detection of fire in still images by integrating pixel color and texture analysis. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Brazil, 26–29 August 2015; pp. 95–102. [Google Scholar]
Ahmad, N.; Akbar, M.; Alkhammash, E.H.; Jamjoom, M.M. CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments. Fire 2025, 8, 211. [Google Scholar] [CrossRef]
Abdusalomov, A.; Umirzakova, S.; Tashev, K.; Sevinov, J.; Temirov, Z.; Muminov, B.; Buriboev, A.; Safarova Ulmasovna, L.; Lee, C. AI-Driven Boost in Detection Accuracy for Agricultural Fire Monitoring. Fire 2025, 8, 205. [Google Scholar] [CrossRef]
Safarov, F.; Muksimova, S.; Kamoliddin, M.; Cho, Y.I. Fire and Smoke Detection in Complex Environments. Fire 2024, 7, 389. [Google Scholar] [CrossRef]
Muhammad, K.; Ahmad, J.; Baik, S.W. Early fire detection using convolutional neural networks during surveillance for effective disaster management. Neurocomputing 2018, 288, 30–42. [Google Scholar] [CrossRef]
Foggia, P.; Saggese, A.; Vento, M. Real-time fire detection for video-surveillance applications using a combination of experts based on color, shape, and motion. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1545–1556. [Google Scholar] [CrossRef]
Muhammad, K.; Ahmad, J.; Lv, Z.; Bellavista, P.; Yang, P.; Baik, S.W. Efficient deep CNN-based fire detection and localization in video surveillance applications. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 1419–1434. [Google Scholar] [CrossRef]
Jadon, A.; Varshney, A.; Ansari, M.S. Low-complexity high-performance deep learning model for real-time low-cost embedded fire detection systems. Procedia Comput. Sci. 2020, 171, 418–426. [Google Scholar] [CrossRef]
Pincott, J.; Tien, P.W.; Wei, S.; Calautit, J.K. Indoor fire detection utilizing computer vision-based strategies. J. Build. Eng. 2022, 61, 105154. [Google Scholar] [CrossRef]
Ahn, Y.; Choi, H.; Kim, B.S. Development of early fire detection model for buildings using computer vision-based CCTV. J. Build. Eng. 2023, 65, 105647. [Google Scholar] [CrossRef]
Wu, H.; Wu, D.; Zhao, J. An intelligent fire detection approach through cameras based on computer vision methods. Process Saf. Environ. Prot. 2019, 127, 245–256. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, L.; Xia, X.; Huang, Q.; Li, X. A gated recurrent network with dual classification assistance for smoke semantic segmentation. IEEE Trans. Image Process. 2021, 30, 4409–4422. [Google Scholar] [CrossRef] [PubMed]
Tao, H.; Duan, Q. An adaptive frame selection network with enhanced dilated convolution for video smoke recognition. Expert Syst. Appl. 2023, 215, 119371. [Google Scholar] [CrossRef]
Majid, S.; Alenezi, F.; Masood, S.; Ahmad, M.; Gündüz, E.S.; Polat, K. Attention based CNN model for fire detection and localization in real-world images. Expert Syst. Appl. 2022, 189, 116114. [Google Scholar] [CrossRef]
Chaoxia, C.; Shang, W.; Zhang, F. Information-guided flame detection based on faster R-CNN. IEEE Access 2020, 8, 58923–58932. [Google Scholar] [CrossRef]
Li, Y.; Zhang, W.; Liu, Y.; Jing, R.; Liu, C. An efficient fire and smoke detection algorithm based on an end-to-end structured network. Eng. Appl. Artif. Intell. 2022, 116, 105492. [Google Scholar] [CrossRef]
Huang, L.; Liu, G.; Wang, Y.; Yuan, H.; Chen, T. Fire detection in video surveillances using convolutional neural networks and wavelet transform. Eng. Appl. Artif. Intell. 2022, 110, 104737. [Google Scholar] [CrossRef]
Zheng, S.; Gao, P.; Wang, W.; Zou, X. A highly accurate forest fire prediction model based on an improved dynamic convolutional neural network. Appl. Sci. 2022, 12, 6721. [Google Scholar] [CrossRef]
Muhammad, K.; Khan, S.; Palade, V.; Mehmood, I.; De Albuquerque, V.H.C. Edge intelligence-assisted smoke detection in foggy surveillance environments. IEEE Trans. Ind. Inform. 2019, 16, 1067–1075. [Google Scholar] [CrossRef]
Töreyin, B.U.; Dedeoğlu, Y.; Güdükbay, U.; Cetin, A.E. Computer vision based method for real-time fire and flame detection. Pattern Recognit. Lett. 2006, 27, 49–58. [Google Scholar] [CrossRef]
Dimitropoulos, K.; Barmpoutis, P.; Grammalidis, N. Spatio-temporal flame modeling and dynamic texture analysis for automatic video-based fire detection. IEEE Trans. Circuits Syst. Video Technol. 2014, 25, 339–351. [Google Scholar] [CrossRef]
Yuan, F. Video-based smoke detection with histogram sequence of LBP and LBPV pyramids. Fire Saf. J. 2011, 46, 132–139. [Google Scholar] [CrossRef]
Li, S.; Yan, Q.; Liu, P. An efficient fire detection method based on multiscale feature extraction, implicit deep supervision and channel attention mechanism. IEEE Trans. Image Process. 2020, 29, 8467–8475. [Google Scholar] [CrossRef] [PubMed]
Cheng, H.-Y.; Yin, J.-L.; Chen, B.-H.; Yu, Z.-M. Smoke 100k: A database for smoke detection. In Proceedings of the 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019; pp. 596–597. [Google Scholar]
Cetin, A.E. Computer Vision Based Fire Detection Dataset. 2024. Available online: http://signal.ee.bilkent.edu.tr/VisiFire/ (accessed on 26 January 2025).
Yar, H.; Hussain, T.; Agarwal, M.; Khan, Z.A.; Gupta, S.K.; Baik, S.W. Optimized dual fire attention network and medium-scale fire classification benchmark. IEEE Trans. Image Process. 2022, 31, 6331–6343. [Google Scholar] [CrossRef] [PubMed]
Fire Safety Research Institute. Available online: https://www.youtube.com/@FireSafetyResearchInstitute (accessed on 26 January 2025).
Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.-C.; Lee, J.T.; Mukherjee, S.; Aggarwal, J.K.; Lee, H.; Davis, L. A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3153–3160. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Tang, J.; Shu, X.; Yan, R.; Zhang, L. Coherence constrained graph LSTM for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 636–647. [Google Scholar] [CrossRef] [PubMed]
Yar, H.; Khan, Z.A.; Rida, I.; Ullah, W.; Kim, M.J.; Baik, S.W. An efficient deep learning architecture for effective fire detection in smart surveillance. Image Vis. Comput. 2024, 145, 104989. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Khan, R.U.; Wong, W.S.; Ullah, I.; Algarni, F.; Haq, U.; Inam, M.; bin Barawi, M.H.; Khan, M.A. Evaluating the Efficiency of CBAM-Resnet Using Malaysian Sign Language. Comput. Mater. Contin. 2022, 71, 2755–2772. [Google Scholar] [CrossRef]
Yar, H.; Ullah, W.; Khan, Z.A.; Baik, S.W. An effective attention-based CNN model for fire detection in adverse weather conditions. ISPRS J. Photogramm. Remote Sens. 2023, 206, 335–346. [Google Scholar] [CrossRef]
Sharma, J.; Granmo, O.-C.; Goodwin, M.; Fidje, J.T. Deep convolutional neural networks for fire detection in images. In Proceedings of the Engineering Applications of Neural Networks: 18th International Conference, EANN 2017, Athens, Greece, 25–27 August 2017; pp. 183–193. [Google Scholar]

Figure 2. Schematic of the proposed framework for reliably differentiating hazardous fire incidents from benign or non-fire scenarios.

Figure 3. Integration of the CBAM within a ResNet residual block, highlighting the exact insertion point where CBAM refines the convolutional feature maps in each block.

Figure 4. Training and validation loss and accuracy of the MC3_18–CBAM spatiotemporal fire detection model.

Figure 5. Performance comparison of the proposed model across key evaluation metrics: accuracy, recall, precision, and F-measure [15,19,27,36].

Figure 6. Grad-CAM++ overlay for the “indoor fire” class, showing strong activation on flame cores, rising smoke, and dynamic ember regions.

Table 3. Dataset split for the proposed indoor fire corpus.

	Fire	Non-Fire
Training	337	357
Validation	114	120
Test	86	94
Total	537	571
Total	1108

Table 4. Recognition performance and parameter count of various ResNet-based video models.

Network	Number of Parameters	Recognition Accuracy
R3D	33.4 M	64.2
R(2+1)D	33.3 M	68.0
rMC3	33.0 M	65.0
R2D	11.4 M	58.9
MC3	11.7 M	64.7

Table 5. Baseline MC3_18 architecture (without CBAM).

Layer Name	Output Size	MC3_18 Configuration
conv1	L × 112 × 112	3 × 7 × 7, 64, stride 1 × 2 × 2, padding 1 × 3 × 3
conv2_x	L × 56 × 56	[3D block: 3 × 3 × 3, 64] × 2
conv3_x	L × 28 × 28	[2D block: 3 × 3, 128] × 2, stride 2 × 2
conv4_x	L × 14 × 14	[2D block: 3 × 3, 256] × 2, stride 2 × 2
conv5_x	L × 7 × 7	[2D block: 3 × 3, 512] × 2, stride 2 × 2
pool	1 × 1 × 1	Global average pooling
fc	1 × 1 × 1	Fully connected (512 → 1 classes)

Table 6. Comparison of accuracy, false positives, and false negatives across state-of-the-art fire detection methods.

Ref	Dataset	Classes	False Positives	False Negatives	Accuracy
[36]	DFAN	Outdoor: boat, car, forest, etc.	-	-	88.00%
[41]	Custom dataset	Outdoor/forest fires vs. non-fire	-	-	93.50%
[44]	Custom dataset	Outdoor/forest fires vs. non-fire	-	-	98.75%
[15]	Foggia’s & Chino’s Smoke instances are labeled as non-fire.	Outdoor/forest fires vs. non-fire	9.07	2.13	94.39
[27]	Corsican, Foggia, Chino	Indoor and outdoor fires vs. non-fire	0.64%	1.13%	98.26%
[19]	Custom dataset	Indoor smoke, flame, non-fire	72.00%	10.67%	82.63%
This work	Our new dataset (Realistic, Early-Stage)	Indoor fires vs. non-fire	11.63%	15.96%	86.11%

Table 7. Comparison of precision, recall, and F-measure across fire detection methods.

Ref	Dataset	Classes	Recall	Precision	F-Measure
[36]	DFAN	Outdoor: boat, car, forest, etc.	88.00%	88.00%	87.00%
[41]	Custom dataset	Outdoor/forest fires vs. non-fire	93.51%	93.57%	93.51%
[44]	Custom dataset	Outdoor/forest fires vs. non-fire	98.70%	98.82%	98.74%
[15]	Foggia’s & Chino’s smoke instances are labeled as non-fire.	Outdoor/forest fires vs. non-fire	98.00%	82.00%	89.00%
[27]	Corsican, Foggia, Chino	Indoor and outdoor fires vs. non-fire	96.09%	97.65%	96.86%
[19]	Custom dataset	Indoor smoke, flame, non-fire	88.21%	83.21%	84.79%
This work	Our new dataset (realistic, early-stage)	Indoor fires vs. non-fire	84.04	88.76%	86.34%

Table 8. Performance and computational efficiency of various fire detection models. Our proposed model demonstrates a strong balance between parameter count, model size, and real-time inference capability (FPS) on high-end GPUs.

Model	FPS	Model Size (MB)	Parameters (Millions)	System Specification
Our model	47.6	44.18	11.58	NVIDIA GPU RTX 4090 24 GB GPU
ResNet50 [27]	112	97.6	25.6	-
DFAN [36]	70.55	83.63	23.9	NVIDIA GPU 2070 12 GB GPU
DFAN [36]	12.90	83.63	23.9	Intel Core i9, 3,60 GHz CPU
MSAM [41]	75.15	25.20	3.17	NVIDIA GPU 2070 12 GB GPU
MobileNet_V2 [29]	39.78	13.23	-	NVIDIA GPU TITAN X (Pascal) 12 GB GPU
ResNetFire [45]	57.3	98.0	25.6	NVIDIA 12 GB GPU

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ali, M.M.E.H.; Ghodrat, M. Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective. Fire 2025, 8, 285. https://doi.org/10.3390/fire8070285

AMA Style

Ali MMEH, Ghodrat M. Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective. Fire. 2025; 8(7):285. https://doi.org/10.3390/fire8070285

Chicago/Turabian Style

Ali, Mostafa M. E. H., and Maryam Ghodrat. 2025. "Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective" Fire 8, no. 7: 285. https://doi.org/10.3390/fire8070285

APA Style

Ali, M. M. E. H., & Ghodrat, M. (2025). Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective. Fire, 8(7), 285. https://doi.org/10.3390/fire8070285

Article Menu

Reliable Indoor Fire Detection Using Attention-Based 3D CNNs: A Fire Safety Engineering Perspective

Abstract

1. Introduction

2. Related Work

3. Proposed Framework

3.1. Dataset

3.2. Proposed Model

3.3. Preprocessing and Hyperparameters

4. Experimental Results

4.1. Performance Matrix and Assessment

4.2. Performance Comparison with Earlier Approaches

4.3. Grad-CAM++ Visualization and Interpretation

4.4. Model Efficiency and Real-Time Inference Capability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI