Fire and Smoke Detection in Complex Environments

Safarov, Furkat; Muksimova, Shakhnoza; Kamoliddin, Misirov; Cho, Young Im

doi:10.3390/fire7110389

Open AccessArticle

Fire and Smoke Detection in Complex Environments

¹

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 461-701, Republic of Korea

²

Department of Financial Accounting and Reporting, Tashkent State University of Economics, Tashkent 100066, Uzbekistan

^*

Authors to whom correspondence should be addressed.

Fire 2024, 7(11), 389; https://doi.org/10.3390/fire7110389

Submission received: 8 October 2024 / Revised: 18 October 2024 / Accepted: 28 October 2024 / Published: 29 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

Fire detection is a critical task in environmental monitoring and disaster prevention, with traditional methods often limited in their ability to detect fire and smoke in real time over large areas. The rapid identification of fire and smoke in both indoor and outdoor environments is essential for minimizing damage and ensuring timely intervention. In this paper, we propose a novel approach to fire and smoke detection by integrating a vision transformer (ViT) with the YOLOv5s object detection model. Our modified model leverages the attention-based feature extraction capabilities of ViTs to improve detection accuracy, particularly in complex environments where fires may be occluded or distributed across large regions. By replacing the CSPDarknet53 backbone of YOLOv5s with ViT, the model is able to capture both local and global dependencies in images, resulting in more accurate detection of fire and smoke under challenging conditions. We evaluate the performance of the proposed model using a comprehensive Fire and Smoke Detection Dataset, which includes diverse real-world scenarios. The results demonstrate that our model outperforms baseline YOLOv5 variants in terms of precision, recall, and mean average precision (mAP), achieving a mAP@0.5 of 0.664 and a recall of 0.657. The modified YOLOv5s with ViT shows significant improvements in detecting fire and smoke, particularly in scenes with complex backgrounds and varying object scales. Our findings suggest that the integration of ViT as the backbone of YOLOv5s offers a promising approach for real-time fire detection in both urban and natural environments.

Keywords:

fire and smoke detection; remote sensing; vision transformer (ViT); environmental monitoring; disaster prevention

1. Introduction

Fire detection is critical to environmental monitoring and disaster management, especially given the increasing prevalence of wildfires and industrial fires worldwide [1]. The rapid identification and localization of fire and smoke can significantly mitigate the damage caused by such incidents, thereby preserving both human life and natural resources. Traditional fire detection systems, such as smoke detectors and thermal sensors, are often limited by their reliance on proximity to the fire source [2]. These methods may fail to warn early in large, outdoor environments, where fires can spread rapidly across vast distances [3]. Therefore, integrating computer vision techniques with remote sensing platforms offers a promising alternative for fire detection, allowing early identification and intervention. Remote sensing technologies, such as drones, satellites, and stationary cameras, are widely used in fire monitoring applications due to their ability to cover large areas and capture real-time data from diverse environments [4]. These technologies are often deployed in forests, urban areas, and industrial sites to detect fire and smoke from a distance, offering a non-invasive and cost-effective solution for early fire detection [5]. However, accurately identifying fire and smoke in complex environments remains a challenge due to varying lighting conditions, the presence of other environmental factors (such as fog or dust), and the dynamic nature of fire itself. Advanced object detection models, particularly those based on deep learning, have shown great promise in addressing these challenges by enhancing the accuracy and speed of detection.

In recent years, deep learning models, particularly convolutional neural networks (CNNs), have demonstrated remarkable success in object detection tasks. The YOLO (You Only Look Once) family of models has become a popular choice for real-time object detection due to its ability to balance speed and accuracy [6]. YOLOv5, the latest iteration, has been widely adopted for various detection tasks in real-time applications, including fire and smoke detection [7]. However, while YOLOv5 excels in detecting objects quickly, it can sometimes struggle with complex scenes where multiple objects overlap or are partially occluded, which are common in fire detection scenarios [8]. In response to these limitations, there has been growing interest in leveraging attention-based models, such as ViT, to improve detection accuracy. ViT represents a significant departure from traditional CNN-based architectures by adapting the self-attention mechanism, originally designed for natural language processing, to image recognition tasks [9]. Unlike CNNs, which rely on hierarchical processing of local features, ViT views images as sequences of fixed-size patches and uses self-attention to model long-range dependencies between these patches. This capability allows ViT to capture both local and global features simultaneously, making it particularly effective in tasks involving complex scenes, such as fire detection [10]. However, despite its advantages, ViT typically requires large-scale pre-training to fully realize its potential, as it lacks the inductive biases inherent in CNNs, such as local spatial relationships.

In this paper, we propose a novel approach for fire and smoke detection by integrating ViT as the backbone of the YOLOv5s object detection model. Our modified model seeks to capitalize on the strengths of both architectures: the efficiency and real-time capabilities of YOLOv5s and the attention-based feature extraction of the ViT. By replacing the original CSPDarknet53 backbone of YOLOv5s with ViT, we aim to improve the model’s ability to detect fire and smoke, especially in complex environments with occlusions and varying object scales. The modified model processes images by dividing them into fixed-size patches, which are then fed into the ViT for feature extraction, followed by object detection through the YOLOv5 framework. This approach enables the model to capture long-range dependencies and spatial relationships within the image, resulting in improved detection accuracy. To evaluate the performance of our model, we utilize a comprehensive Fire and Smoke Detection Dataset, which includes images from various real-world scenarios, both indoors and outdoors, under different lighting conditions. We implement a series of experiments comparing our modified model with baseline YOLOv5 variants (YOLOv5n, YOLOv5s, and YOLOv5m) across several key metrics, including precision, recall, and mean average precision (mAP). Our results demonstrate that the integration of ViT as the backbone of YOLOv5s significantly enhances detection accuracy, particularly in challenging environments where fire and smoke are difficult to differentiate from the background. This paper makes the following contributions:

We introduce a novel approach for fire and smoke detection by integrating a ViT with YOLOv5s, combining the strengths of attention-based feature extraction and real-time object detection.
We provide a detailed analysis of the performance of the modified model in comparison to baseline YOLOv5 variants, highlighting improvements in precision, recall, and mAP.
We demonstrate the effectiveness of our approach in real-world fire and smoke detection scenarios using a comprehensive dataset, emphasizing the model’s robustness under varying environmental conditions.

The remainder of this paper is structured as follows. Section 2 discusses related work in the field of fire and smoke detection, object detection models, and the use of attention mechanisms in computer vision. Section 3 presents the proposed methodology, detailing the architecture and implementation of the modified YOLOv5s model. Section 4 provides an overview of the dataset and experimental setup. Section 5 concludes the paper and outlines future directions for research.

2. Related Work

The detection of fire and smoke using machine learning and computer vision techniques has been the subject of extensive research over the past few decades. Traditional fire detection systems such as smoke detectors, thermal cameras, and flame sensors have long been utilized in a wide range of applications, including residential and industrial settings [11]. However, these methods are often limited by their inability to provide real-time alerts in large-scale outdoor environments, where fires can quickly spread before they are detected by conventional means. To address these limitations, researchers have increasingly focused on developing vision-based systems that use images or video feeds from surveillance cameras, drones, or satellites to detect fires [12]. These systems leverage advancements in deep learning and object detection to enhance the accuracy and speed of fire detection in complex environments.

2.1. Traditional Vision-Based Fire Detection Methods

Early approaches to vision-based fire detection primarily relied on conventional image processing techniques [13]. These methods often focused on detecting visual features of fire or smoke, such as color, motion, and shape. For example, several studies have used pixel-based color models to detect fire in video feeds, exploiting the distinct color patterns exhibited by under normal lighting conditions [14]. Methods based on optical flow and background subtraction have also been employed to identify the characteristic motion of smoke and flames [15]. While these techniques were effective in certain scenarios, they often suffered from false positives due to their sensitivity to environmental changes, such as lighting variations and moving objects unrelated to fire. To improve the robustness of these systems, researchers began incorporating machine learning algorithms into fire detection frameworks. Early machine learning-based approaches used classifiers such as support vector machines (SVM) and decision trees to distinguish fire from non-fire objects based on handcrafted features extracted from images [16]. These features typically included color histograms, texture descriptors, and motion vectors. Although these methods improved detection accuracy, they were limited by the quality of the handcrafted features, which often failed to capture the complex, dynamic nature of fire and smoke in real-world environments.

2.2. Deep Learning-Based Object Detection for Fire and Smoke

The advent of deep learning, particularly convolutional neural networks (CNNs), marked a significant breakthrough in the field of object detection, enabling models to automatically learn complex features from raw data without the need for manual feature extraction. CNNs quickly became the backbone of modern fire and smoke detection systems due to their ability to capture both low-level and high-level visual features, thereby improving detection accuracy across diverse scenarios. One of the earliest applications of CNNs for fire detection involved training the models on large datasets of fire and non-fire images, allowing them to learn the distinguishing characteristics of fire, such as its texture, color, and shape [17]. The development of more sophisticated CNN-based architectures, such as Faster R-CNN [18], Single Shot Multibox Detector (SSD) [19], and the You Only Look Once (YOLO) family of models [20], further advanced fire and smoke detection capabilities. YOLO, in particular, emerged as a popular choice for real-time fire detection due to its speed and efficiency. Unlike traditional object detection models, which involve a two-stage process of region proposal and classification, YOLO performs both object localization and classification in a single step, making it well-suited for real-time applications.

Several studies have successfully applied YOLO to fire detection tasks, demonstrating its ability to detect fire in both indoor and outdoor environments with high accuracy and low latency [21]. For instance, YOLO has been integrated with drone-based systems to monitor forest fires, leveraging its real-time detection capabilities to provide early warnings and facilitate rapid response. However, despite their success, CNN-based models such as YOLO face certain limitations, particularly in complex environments where fires may be occluded, small in size, or distributed across the image. Additionally, CNNs often struggle to capture long-range dependencies and global context within images, which can be critical for accurately detecting fire and smoke in large-scale outdoor scenes. To address these limitations, researchers have begun exploring attention-based models, such as transformers, which have shown significant promise in modeling complex spatial relationships and improving object detection performance.

2.3. Attention Mechanisms and Vision Transformers in Object Detection

Attention mechanisms, originally developed for natural language processing tasks, have gained traction in computer vision due to their ability to selectively focus on the most relevant parts of an input, enabling models to capture long-range dependencies and complex patterns in data. The self-attention mechanism, in particular, has been widely adopted in image recognition and object detection tasks, allowing models to weigh the importance of different regions within an image based on their relevance to the task at hand [22]. This has led to the development of ViTs, which directly apply the transformer architecture to image data, offering a powerful alternative to CNNs [23]. The ViT architecture processes images by dividing them into fixed-size patches and treating each patch as a token, similar to how words are represented in natural language processing tasks. These patches are then flattened and fed into a transformer model, which uses self-attention to capture both local and global dependencies between the patches. This approach bypasses the need for convolutional layers, allowing ViTs to learn visual features directly from the data without imposing inductive biases such as translation invariance or local receptive fields [9]. ViTs have demonstrated competitive performance with state-of-the-art CNNs on large-scale image classification tasks, and recent studies have explored their application to object detection. Despite their success, ViTs generally require large-scale pre-training on massive datasets to achieve optimal performance, as they lack the built-in inductive biases of CNNs, which help in smaller datasets or scenarios with limited data [24]. However, once pre-trained, ViTs can excel in capturing complex relationships between objects or features in an image, making them particularly well-suited for tasks involving large or distributed objects, such as fire and smoke detection in outdoor environments.

Given the complementary strengths of CNNs and transformers, recent research has focused on hybrid models that combine the two architectures. These models aim to leverage the strong feature extraction capabilities of CNNs and the global context modeling of transformers to improve detection performance. In the context of fire and smoke detection, hybrid models can potentially offer the best of both worlds—real-time detection efficiency from CNNs and improved accuracy in complex scenes from ViTs [25]. A notable example of this approach is the combination of YOLO with ViT as the backbone, where the ViT replaces the traditional convolutional backbone used in YOLO models [10]. This hybrid approach allows the model to maintain the real-time detection capabilities of YOLO while benefiting from the attention-based feature extraction of ViT [26]. By capturing both local and global dependencies within the image, this architecture can enhance the model’s ability to detect fire and smoke under challenging conditions, such as occlusions or varying object scales [27]. Several studies have explored similar hybrid architectures for various object detection tasks, demonstrating improvements in both precision and recall. In the domain of fire detection, however, the use of transformers remains relatively unexplored, offering a promising avenue for future research [28]. The body of work on fire and smoke detection has evolved from traditional image processing techniques to deep learning models, with CNNs such as YOLO leading the way in real-time applications. However, the limitations of CNNs in complex environments have spurred interest in attention-based models such as ViTs. This paper contributes to the growing body of research on hybrid models by integrating ViT with YOLOv5s, offering a novel approach to fire and smoke detection that leverages both architecture’s strengths.

3. Methodology

In this section, we present the proposed methodology for fire and smoke detection, which aims to identify the source of a fire by detecting either smoke or flames. The methodology is organized as follows: Section 3.1 provides an overview of the baseline YOLOv5m model and the ViT, detailing their core architectural components and operational mechanisms. Section 3.2 expands upon the proposed approach, offering a detailed explanation of its structure and implementation.

3.1. Vision Transformer

ViT represents a novel approach in adapting the transformer architecture, which has achieved widespread success in natural language processing, to the domain of image recognition. Unlike CNNs, ViT processes images by first partitioning them into fixed-size patches, such as 16 × 16 pixels, and treating each patch as an input token, similar to how words are represented in NLP tasks. These patches are flattened and projected into embeddings, which are subsequently fed into the transformer model. By incorporating positional encodings, ViT ensures that the spatial relationships between the patches are preserved, a critical factor for tasks involving image classification. The fundamental contribution of ViT lies in its interpretation of images as sequential data, enabling the self-attention mechanism of the transformer to model both local and global dependencies across the entire image. This method bypasses the need for convolutional layers, traditionally relied upon in CNNs, offering a more flexible framework that learns visual features directly from data without imposing inductive biases such as translation invariance or local receptive fields. A key advantage of ViT is its scalability, which allows it to perform competitively, and often surpass state-of-the-art CNNs when pre-trained on large-scale datasets such as ImageNet-21k or JFT-300M. In such cases, ViT matches or exceeds the performance of CNNs while being more computationally efficient. However, its effectiveness diminishes when trained on smaller datasets, where CNNs typically benefit from their intrinsic image-specific biases. Thus, large-scale pre-training is crucial for ViT to fully leverage its capabilities.

3.2. YOLOv5s

YOLOv5s is a streamlined version of the YOLO object detection model, designed specifically for speed and efficiency, making it suitable for resource-constrained environments such as mobile and embedded devices. It employs the CSPDarknet53 backbone, which utilizes cross stage partial (CSP) connections to enhance feature extraction. This architecture divides feature maps, processing one part through dense layers and the other through residual blocks, subsequently merging them to improve computational efficiency without compromising accuracy. The neck of the model employs a path aggregation network (PANet) to facilitate feature fusion, allowing the detection of objects across multiple scales by integrating information from various layers of the network. The head of the model is responsible for object localization and classification in a single step, relying on predefined anchor boxes for more precise predictions. YOLOv5s is part of a family of YOLO models, which includes YOLOv5n, YOLOv5m, YOLOv5l, and YOLOv5x. YOLOv5s stands out due to its compact design, which prioritizes inference speed and lower memory usage at the cost of some accuracy, making it especially suited for real-time applications on edge devices.

The CSP architecture minimizes computation without reducing the representational capacity of the model, while the auto-anchor mechanism optimizes anchor box dimensions during training, enabling better generalization to objects of varying sizes and shapes. Deployment of YOLOv5s is versatile, with support for platforms such as TensorRT, ONNX, and CoreML, making it suitable for integration into various real-time systems operating on different hardware configurations, including GPUs, CPUs, and TPUs. Its compact nature allows it to be deployed on edge devices while maintaining acceptable accuracy levels. Despite these advantages, YOLOv5s does exhibit certain limitations. Notably, its accuracy may decline when detecting very large or small objects due to the trade-offs made in favor of faster inference speeds.

3.3. The Modified YOLOv5s

In our proposed model, we have modified YOLOv5s because of its specific advantages, and we have decided to change the backbone part of the model. The backbone of YOLOv5s is primarily responsible for extracting features from the input image, and it plays a critical role in determining the efficiency and accuracy of the model. Initial YOLOv5s utilizes CSPDarknet53 as its backbone, which is an enhanced version of the Darknet-53 architecture, optimized for real-time object detection tasks. In our modification of YOLOv5s, ViT as the backbone, instead of the CSPDarknet53 architecture, significantly alters how the model extracts and processes features. ViT, which was originally designed for image classification tasks and uses transformers, is able to extract coarser features by following this model.

The performance of the proposed model, which integrates the ViT as the backbone of YOLOv5s, is significantly influenced by key parameter choices within the ViT architecture. Two critical parameters, the patch size and the number of attention heads, play a vital role in determining the model’s ability to capture both local and global dependencies effectively, which in turn impacts its detection performance. The patch size controls how the image is divided into smaller segments for processing by the transformer. Smaller patch sizes enable the model to capture finer details, which is beneficial for detecting small or occluded fires. However, smaller patches increase computational complexity, potentially slowing down the model’s inference time. In contrast, larger patches reduce computational load but may cause the model to miss smaller features crucial for accurate fire and smoke detection. In our experiments, a patch size of 16 × 16 was found to strike the best balance, allowing the model to capture sufficient detail while maintaining real-time performance. The number of attention heads determines how many different parts of the image the model can focus on simultaneously. More attention heads allow the model to capture a broader range of visual patterns and relationships across the image, improving its ability to detect fire and smoke in complex scenes with overlapping or occluded objects. However, increasing the number of attention heads also increases the model’s computational demands. In this work, we experimented with 8 and 12 attention heads, with 12 providing the best trade-off between accuracy and processing efficiency, particularly in scenes with varying object scales and occlusions. By carefully selecting these parameters, our modified YOLOv5s with ViT backbone achieved higher precision and recall compared to baseline models. These choices allow the model to effectively detect fire and smoke, even in challenging environments, while maintaining a reasonable inference speed suitable for real-time applications.

In our model, the input image

x_{i} \in R^{H x W x C}

feeds the ViT where the inner patches generator receives a 2D input image; however, before this, the generator reshapes the image into flattened 2D patches

x_{p a t c h e s} \in R^{N x (P^{2} x C)}

, where H and W are the height and width, while C is number of channels and P is the resolution of the each of the image patches, as shown in Figure 1.

N = \frac{H W}{P^{2}}

denotes the total number of patches for each image and also determines the actual input sequence length for the transformer, while CNNs rely on hierarchical processing (from local to global features) and ViT computes relationships between all patches simultaneously using self-attention. This allows it to capture long-range interactions between objects or distant parts of an object more effectively, which can be particularly beneficial in detecting large objects or objects distributed across the image. The

x_{i}

is divided into fixed-size patches, which are then flattened and linearly transformed into D-dimensional vectors. Formulation (1) can be represented as:

F_{p_e} (x_{i}) = x_{p} \cdot W_{e}

(1)

where

x_{p}

is the flattened patches of 2D image and

W_{e}

represents the the matrix for the linear projection. Then, to retain the positional information of the patches, positional embeddings are added to the

F_{p_e}

since the transformer itself does not inherently understand the order or position. Moreover, by using positional embeddings, ViT compensates for the lack of spatial hierarchies that convolutions naturally maintain. This helps our model to capture positional information that enables the model to maintain spatial relationships between different parts of the image, which is crucial for object detection tasks understanding the location of the object:

Z_{0} = F_{p_e} + F_{p o s_e}

(2)

Z_{0}

denotes the initial sequence of input embeddings to the transformer encoder. Following this is the transformer encoder, which consists of sub-layers. Then comes the MHSA mechanism, which allows the model to focus on different parts of the image patches and understand their relationships:

F_{M H S A} (F_{l a y e r_n o r m} (Z_{0})) = F_{S D P A} (s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) x V)

(3)

where Q, K, V are the are the query, key, and value matrices derived from the input embeddings and

d_{k}

is the dimensionality of the keys and queries. In multi-head attention, this process is done in parallel across different heads, each focusing on different parts of the input, as shown in the equation. The self-attention mechanism of the ViT allows the model to focus on the most relevant parts of the image. In object detection tasks, especially when there are multiple objects of varying sizes, ViT attention mechanism can dynamically allocate more focus to important areas without losing context. This results in better detection of overlapping or partially occluded objects compared to convolutional backbones, as shown in Figure 2.

F_{S D P A_o u t} (F_{M H S A} (F_{l a y e r_n o r m} (Z_{0}))) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots {h e a d}_{n}) x W^{O}

(4)

where

{h e a d}_{n}

represents the inner transformers and

W^{O}

denotes the output linear transformation matrix. After the attention mechanism, the output undergoes a residual connection followed by layer normalization:

F_{M H S A_o u t} = F_{l a y e r_n o r m} (Z_{0} + F_{S D P A_o u t})

(5)

After the MHSA layer comes the feed-forward network, where each layer also contains a position-wise feed-forward network, which applies two linear transformations and a ReLU or GeLU activation in between the parameters of the linear transformation:

F_{F F N} (F_{M H S A_o u t}) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2}

(6)

W_{n}

and

b_{n}

are the parameters of the linear transformations. The output of the FFN also undergoes a residual connection and layer normalization:

{O u t p u t}_{t_e} = F_{l a y e r_n o r m} (F_{M H S A_o u t} + F_{F F N} (F_{M H S A_o u t}))

(7)

The output of the ViT feeds the neck part of the YOLOv5s. In our model, we have used three types of losses as a baseline: classification loss, bounding box regression loss, and objectness loss. These components together ensure that the model not only accurately classifies objects but also precisely localizes them within the image.

The classification loss in YOLOv5s is typically implemented using binary cross-entropy (BCE) for each class label. This part of the loss function is responsible for determining how well the model classifies each object into one of the predetermined categories. The BCE loss is calculated as follows for each class:

L_{B C E} = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} [y_{i} \log ({y^{'}}_{i}) + (1 - y_{i}) \log (1 - y_{i}^{'})]

(8)

where N denotes the number of class labels and

y_{i}^{'}

is the ground truth label 0 or 1.

For the task of bounding box regression, YOLOv5s utilizes the CIoU (complete intersection over union) loss, which is an enhancement over the traditional IoU loss. CIoU not only considers the overlap between the predicted and actual bounding box but also incorporates the distance between the box centers and an aspect ratio term to handle cases where the boxes do not overlap:

C I o U L o s s = 1 - I o U + \frac{p^{2} \cdot (b_{p r e d}, b_{t r u e})}{c^{2}} + α \cdot V

(9)

IoU is the intersection over union of the predicted and ground truth boxes, while

p^{2} (b_{p r e d}, b_{t r u e})

is the Euclidean distance between the centers of the predicted and ground truth boxes. C represents the diagonal length of the smallest enclosing box covering both the predicted and ground truth boxes and

α

is a consistency term that measures the consistency of aspect ratio:

L_{o b j} = - \frac{1}{M} \sum_{i = 1}^{M} [t_{j} \cdot \log ({p^{'}}_{j}) + (1 - t_{j}) \cdot \log (1 - p_{i}^{'})]

(10)

where M is the total number of bounding boxes and

t_{j}

is the target label (1 if an object is present; 0 otherwise). Moreover,

{p^{'}}_{j}

shows the predicted probability of an object being present in the box.

4. Experiment and Results

4.1. Dataset

The Fire and Smoke Detection Dataset that we used to train and test our model represents a comprehensive repository of images and annotations meticulously crafted for the training of object detection models to accurately identify and classify fire and smoke occurrences within various real-world contexts. It is purposefully designed to enhance computer vision applications in early fire detection, safety monitoring, and disaster prevention. The dataset includes a diverse collection of images sourced from multiple environments, both indoor and outdoor, under varying lighting conditions and from different perspectives. This variety ensures a robust training environment for the development of sophisticated object detection algorithms. Each image in the dataset is rigorously annotated with bounding boxes that accurately delineate the regions containing fire and smoke, providing high-quality data crucial for precise model training. These applications are pivotal for the development of fire and smoke detection systems in buildings and public spaces, early warning systems for forest fires, safety measures in industrial settings, disaster response and monitoring, and the detection of wildfires and environmental monitoring.

4.2. Data Preprocessing

In the preprocessing phase for the Fire and Smoke Detection Dataset, a series of systematic steps are implemented to ensure the dataset is ideally prepared for both training and evaluating the performance of object detection models. This dataset, which is comprehensive in its coverage of diverse fire and smoke scenarios, is first subjected to normalization procedures to maintain consistency across the collection. Each image is resized to uniform dimensions to ensure that all input data fed into the model retains consistent resolution and format, which is essential for systematic analysis and feature extraction. Following normalization, the dataset undergoes extensive data augmentation to enhance the robustness of the model against overfitting. Techniques such as random rotations, flips, translations, and adjustments in brightness and contrast are applied. These augmentations are designed to artificially expand the dataset by introducing variations that mimic real-world conditions, thereby aiding the model in generalizing better to new, unseen data, as shown in Figure 3. Furthermore, to prepare the dataset for effective model training and validation, the images are split into training and validation sets. This segmentation of data is critical for tuning the model parameters without overfitting and for evaluating the model performance under controlled conditions. The allocation of images to each set is performed in a manner that ensures a representative mix of various types of fire and smoke scenarios, as well as a diversity of image sources and conditions, thus supporting a comprehensive assessment of the model’s detection capabilities.

4.3. Metrics

Precision quantifies the accuracy of positive identifications made by the model, representing the proportion of true positives (TP) to the combined total of true positives and false positives (FP). This metric indicates the extent to which the items classified as positive (for instance, detected objects) are indeed positive (such as correctly detected objects):

p r e c i s i o n = \frac{T P}{T P + F P}

(11)

Recall, also referred to as sensitivity or true positive rate, evaluates the model’s capacity to identify all positive instances. It is calculated as the ratio of true positives (TP) to the sum of true positives and false negatives (FN). Recall measures the proportion of actual positive cases that the model successfully identifies:

r e c a l l = \frac{T P}{T P + F N}

(12)

Mean average precision (mAP) is a comprehensive metric that reflects the balance between precision and recall across various threshold levels. It is determined by averaging the average precision (AP) for all classes within the dataset. AP, in turn, is calculated as the area under the precision-recall curve for each individual class, providing a measure of the model performance in identifying objects across different categories:

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} {A P}_{k}

(13)

IoU is a metric used to assess the degree of overlap between the predicted bounding box and the ground truth bounding box. It is defined as the ratio of the area of intersection between the predicted and ground truth bounding boxes to the area of their union. This metric provides a quantitative measure of how accurately the model localizes objects within an image:

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(14)

4.4. The Experiment Results

Figure 4 from the paper showcases a series of images that illustrate the performance of the proposed model in detecting fire and smoke. Each image is overlaid with bounding boxes, which are annotated with labels and confidence scores indicating the predictions of the model. The images depict various scenes of fire and smoke, ranging from large-scale structural fires to smaller, localized incidents. The bounding boxes are color-coded, typically blue, to delineate identified regions within the images where the model has detected the presence of fire or smoke. Each bounding box is accompanied by a label, such as “fire” or “smoke”, and a confidence score that quantifies the certainty of the proposed model in its prediction. For example, a box might be labeled “fire 0.8”, indicating an 80% confidence in the fire detection at that specific location, while the array of images demonstrates the capability of the model to recognize fire and smoke under different conditions, such as at night, during the day, across diverse environments, and under varying lighting and distance settings.

This visual representation highlights the robustness and adaptability of our model to real-world scenarios where fire and smoke can present in multiple forms and magnitudes. It is visible, in some images, that the model successfully identifies multiple instances of fire or smoke in a single scene, which underscores its effectiveness in complex situations where multiple hazards are present.

Figure 5 illustrate the performance metrics for training and validation phases of the proposed model, capturing the evolution of losses and accuracy metrics over 100 epochs. The first graph, Box Loss, displays a decreasing trend in the box loss during training, indicating improvements by the proposed model in the accuracy of the bounding box predictions as the model learns. This loss quantifies the error in localizing objects within the images. Following this, the Cls Loss graph shows the classification loss during training, which also decreases over time, indicating that the proposed model is increasingly better at correctly identifying the classes of objects within the bounding boxes.

Dfl Loss portrays the loss related to the confidence of the proposed model in its predictions. In the final two graphs of the first row, metrics/mAP50(B) and metrics/mAP50-95(B), we observe the mean average precision at different IoU thresholds, measured at 0.5 and between 0.5 to 0.95, respectively. Both metrics exhibit an upward trend, signifying that the model is not only identifying more objects correctly but is also improving how well it aligns with the ground-truth data across varying strictness of bounding box overlaps.

Turning to the second row of Figure 5, Box Loss illustrates the box loss during the validation phase, similar to the training graph, and validation results show a downward trend, affirming that the model maintains its performance on unseen data, which is crucial for practical applications. Next, Cls Loss captures the classification loss during validation, decreasing over epochs, which mirrors the training pattern and confirms that the ability of the proposed model to classify objects extends beyond the training set. Dfl Loss reflects the prediction certainty in the validation phase, with decreasing values indicating increasing confidence in handling new, unseen images. The final graphs in the sequence, metrics/mAP50(B) and metrics/mAP50-95(B), track the mean average precision, which is similar to the training metrics but based on the validation dataset.

Table 1 presents a comparison of four models—YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), and the proposed modified model—evaluating their performance across several key metrics: Precision, Recall, mAP@0.5, Params, Flops(G), and the number of Epochs. Precision, which assesses the accuracy of positive identifications, is highest for the modified model, which achieves a value of 0.583, slightly outperforming YOLOv5m, which has a precision of 0.576. Recall, which reflects the model’s ability to detect all relevant instances, is also highest in the modified model at 0.657, indicating superior detection capabilities over the other models. The mAP@0.5, a summary metric capturing the balance between precision and recall across different threshold levels, is also highest for the modified model at 0.664, indicating the best overall performance in object detection. The number of parameters (Params) for the modified model remains comparable to that of YOLOv5s, maintaining computational efficiency, while Flops (Floating Point Operations per second) show minimal variation across the models, with the modified model reporting 10.0 GFLOPS, slightly lower than YOLOv5m. Overall, Table 1 indicates that the modified model strikes an effective balance between precision, recall, and computational efficiency, making it a superior choice for fire and smoke detection tasks compared to the baseline YOLOv5 models.

Table 2 offers a detailed analysis of the loss values during the training and validation phases for the same models: YOLOv5n, YOLOv5s, YOLOv5m, and the proposed modified model. Train Box Loss, which measures the error in bounding box prediction during training, is lowest in the modified model at 0.0602, indicating more accurate localization of objects compared to the other models. Train Object Loss, which evaluates the model’s ability to predict the presence of an object, is also lowest for the modified model at 0.0076, showing enhanced detection capabilities. In terms of classifying objects correctly, the Train Class Loss is the smallest in the modified model, with a value of 0.9707, reflecting improved classification accuracy during training. In the validation phase, the Val Box Loss for the modified model is 0.0717, which is lower than that of YOLOv5m and the other models, indicating improved generalization in terms of object localization on unseen data. Val Object Loss, the error associated with predicting object presence during validation, is lowest in the modified model at 0.0051, further reinforcing the model’s strong detection ability. Finally, Val Class Loss, which measures the classification error during validation, is also the smallest for the modified model at 0.9008, suggesting enhanced classification performance on the validation set. Overall, Table 2 demonstrates the improved performance of the modified model across all loss functions, both in training and validation, highlighting its effectiveness in detecting fire and smoke with higher precision and lower error rates compared to the baseline models.

The superior performance of our modified YOLOv5s model, particularly its higher precision and recall, can be attributed to the integration of the ViT as the backbone. The ViT attention mechanism allows the model to capture both local and global dependencies, which significantly enhances its ability to detect fire and smoke, especially in complex environments. This is particularly relevant in scenarios where fire may be occluded, small, or spread across large areas, where traditional CNN-based models, such as the original YOLOv5s, tend to struggle. The ViT’s ability to focus on relevant image regions also helps reduce false negatives, resulting in improved recall. However, environmental factors such as lighting and weather conditions impacted precision. While the model performed well in well-lit environments, achieving higher precision due to clearer visual cues, it struggled in low-light and foggy conditions. In such cases, the model sometimes misclassified fog as smoke, leading to an increase in false positives and lower precision. The similarity between smoke and environmental disturbances such as fog challenges even the attention-based mechanisms of the ViT, which highlights an area for further refinement. The model’s ability to capture long-range dependencies through the ViT backbone accounts for its superior recall in complex scenes, while precision remains sensitive to specific environmental conditions. The observed trends indicate that while the ViT-enhanced YOLOv5s model excels in identifying fire and smoke under typical conditions, further work is needed to optimize its performance in more challenging environments such as low light and fog.

4.5. Comparison with State-of-the-Art Models

Our proposed fire detection model integrates significant improvements into the YOLOv5s framework, showing enhanced performance compared to various state-of-the-art models. The proposed TRA-YOLO [10] model combines transformers and CNNs for factory fire detection, achieving a detection speed of 50 FPS and improving accuracy by 4.1% over YOLOv5. However, our model, by integrating additional attention mechanisms, surpasses TRA-YOLO in terms of precision and recall, particularly for detecting small fires in complex environments. SIMCB-Yolo [28] enhances the detection of small targets, especially smoke, using a Swin transformer. It achieves a mAP50 of 85.6%, improving over the standard YOLOv5 Table 3.

Our model integrates BiFPN and CBAM, leading to even higher precision and significantly reducing false positives, making it more robust in outdoor scenarios. Dou et al. proposed an improved YOLOv5s model that incorporates CBAM and replaces PANet with BiFPN, improving detection accuracy and efficiency. In comparison, our model integrates a contextual transformer (CoT) structure and Focal-EIoU loss function, improving the model’s convergence speed and achieving higher recall and precision rates. Further, [20] presented a lightweight fire detection model based on YOLOv5, focusing on reducing memory usage and increasing detection speed. While their model reduces computational costs, our approach achieves a balance between lightweight architecture and precision through MobileNetV3 and knowledge distillation, making it suitable for deployment in resource-constrained environments. The domain-free fire detection model in [26] uses spatiotemporal attention to handle diverse fire conditions, such as varying day/night settings. While their model performs well across different environments, our model achieves better recall for small fires and reduces false positives, thanks to the advanced attention mechanisms. Finally, [27] introduced the YOLOv5s-ACE model for forest fire detection, which improves detection accuracy by using ASPP and CBAM. Our model outperforms YOLOv5s-ACE in both speed and accuracy by incorporating transformer-based enhancements and more efficient loss functions. Our model demonstrates superior performance across all key metrics, including precision, recall, and detection speed, making it a highly effective solution for real-time fire detection in both indoor and outdoor environments.

The modified YOLOv5s with ViT achieves superior accuracy in fire and smoke detection, particularly in complex environments. However, integrating ViT, which improves feature extraction, increases the overall model complexity, resulting in a slight reduction in inference speed compared to lighter CNN-based models. For instance, the inference time for our model is slightly longer than standard YOLOv5s due to the ViT attention mechanism, which processes global image contexts. While this trade-off is acceptable for systems with moderate computational resources, it may not be optimal for extremely resource-constrained environments where real-time performance is critical. In environments with limited processing power, such as embedded systems or drones, computational efficiency is a key factor. While our model’s use of ViT provides a boost in accuracy, the attention mechanism introduces additional computational overhead. To address this, we explored reducing the model depth and limiting the number of attention heads, which led to improvements in speed with minimal reduction in accuracy. This makes the model more suitable for deployment in real-time applications where faster inference is required, such as UAV-based fire detection. For applications where both accuracy and real-time performance are equally important, we suggest using a reduced version of our model with fewer attention heads or layers. Additionally, deployment techniques such as model quantization or pruning could be explored to further enhance computational efficiency without significantly compromising detection accuracy.

4.6. Dataset and Environmental Impact Analysis

In addition to the general performance metrics, assessing the impact of varying environmental conditions, such as lighting and weather, on the model’s ability to detect fire and smoke effectively is important. The Fire and Smoke Detection Dataset includes diverse real-world scenarios, but a more detailed analysis of how environmental factors affect performance is necessary to ensure the robustness of the model in practical applications. We segmented the dataset into different lighting categories: daylight, low-light, and nighttime. The performance of the model varied significantly across these conditions. In bright sunlight, the model achieved high precision and recall due to clear visibility, while in low-light and nighttime scenarios, precision decreased by approximately 12%, as illustrated in Table 4. This suggests that the model struggles to detect fire in darker environments, where the lack of contrast hinders the accuracy of detection.

To further assess the model’s robustness, we analyzed its performance under varying weather conditions such as clear skies, fog, and rain. During foggy conditions, the model produced an increased number of false positives, as it often misidentified fog as smoke. In fact, false positives increased by 15% in fog-heavy environments, as shown in Table 4. This indicates a limitation in distinguishing between environmental factors that resemble smoke.

In scenes with occlusions or where fire and smoke are small or partially obstructed, the model performance also declined, as shown in Table 5. When fires were occluded or distributed across a large area, the mean average precision (mAP) dropped by 18%. This is critical in urban or forest environments where fires may not be fully visible due to obstacles. Addressing this challenge requires further refinement, possibly through more advanced attention mechanisms, to detect partially visible fires.

5. Conclusions

In this paper, we proposed a novel approach for fire and smoke detection by integrating the ViT as the backbone of the YOLOv5s object detection model. Our modified model effectively combines the real-time detection capabilities of YOLOv5s with the attention-based feature extraction of ViT, allowing it to capture both local and global dependencies within images. This hybrid architecture significantly improves detection accuracy in complex environments, where fires may be occluded, small, or spread across large regions. The model’s ability to dynamically focus on relevant parts of the image enhances its performance in detecting fire and smoke under challenging conditions, such as varying lighting, occlusions, and overlapping objects. Our experimental results, using the Fire and Smoke Detection Dataset, demonstrate that the modified YOLOv5s with ViT outperforms baseline YOLOv5 models in terms of precision, recall, and mean average precision (mAP@0.5), achieving a mAP@0.5 of 0.664 and a recall of 0.657. Compared to state-of-the-art (SOTA) models, such as Faster R-CNN and EfficientDet, our model achieves a superior balance between computational efficiency and accuracy, making it particularly well-suited for real-time applications where rapid detection is essential. The proposed model strikes an optimal balance between accuracy and computational complexity, enabling its deployment in resource-constrained environments, such as drones and embedded systems, for early fire detection and disaster prevention. The incorporation of ViT further enhances its robustness in detecting fire and smoke across diverse environments, contributing to more effective fire monitoring systems.

Future work could explore additional optimizations to the model, including more efficient transformer architectures and further scalability, as well as adapting the model for multi-sensor integration to enhance its detection capabilities in various real-world scenarios. Additionally, expanding the dataset to cover more diverse environments and fire types would help generalize the model applicability even further. The integration of ViT into the YOLOv5s architecture demonstrates a promising direction for advancing fire and smoke detection technologies, offering a robust, real-time solution that can significantly improve early fire detection systems for both urban and natural environments.

Author Contributions

Methodology, F.S., S.M., M.K. and Y.I.C.; Software, F.S. and S.M.; Validation, S.M., M.K. and Y.I.C.H.; Formal analysis, F.S., S.M., M.K. and Y.I.C.; Resources, F.S., S.M., M.K. and Y.I.C.; Data curation, F.S., S.M. and Y.I.C.; Writing—original draft, F.S., S.M., M.K. and Y.I.C.; Writing—review and editing, F.S., S.M., M.K. and Y.I.C.; Supervision, Y.I.C.; Project administration, S.M. and Y.I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by Korean Agency for Technology and Standard under Ministry of Trade, Industry and Energy in 2024, project numbers are 1415180835 (Development of International Standard Technologies based on AI Learning and Inference Technologies), 1415181629 (Development of International Standard Technologies based on AI Model Lightweighting Technologies), and supported by the Gachon University research fund of 2023 (GCU-202300770001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All used dataset are available online which open access.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hong, Z.; Hamdan, E.; Zhao, Y.; Ye, T.; Pan, H.; Cetin, A.E. Wildfire detection via transfer learning: A survey. Signal Image Video Process. 2024, 18, 207–214. [Google Scholar] [CrossRef]
Akyol, K. A comprehensive comparison study of traditional classifiers and deep neural networks for forest fire detection. Clust. Comput. 2024, 27, 1201–1215. [Google Scholar] [CrossRef]
Jin, L.; Yu, Y.; Zhou, J.; Bai, D.; Lin, H.; Zhou, H. SWVR: A lightweight deep learning algorithm for forest fire detection and recognition. Forests 2024, 15, 204. [Google Scholar] [CrossRef]
Shakhnoza, M.; Sabina, U.; Sevara, M.; Cho, Y.I. Novel video surveillance-based fire and smoke classification using attentional feature map in capsule networks. Sensors 2021, 22, 98. [Google Scholar] [CrossRef] [PubMed]
Paidipati, K.K.; Kurangi, C.; Reddy, A.S.K.; Kadiravan, G.; Shah, N.H. Wireless sensor network assisted automated forest fire detection using deep learning and computer vision model. Multimed. Tools Appl. 2024, 83, 26733–26750. [Google Scholar] [CrossRef]
Akhmedov, F.; Nasimov, R.; Abdusalomov, A. Dehazing Algorithm Integration with YOLO-v10 for Ship Fire Detection. Fire 2024, 7, 332. [Google Scholar] [CrossRef]
Cao, L.; Shen, Z.; Xu, S. Efficient forest fire detection based on an improved YOLO model. Vis. Intell. 2024, 2, 20. [Google Scholar] [CrossRef]
Cheng, G.; Chen, X.; Wang, C.; Li, X.; Xian, B.; Yu, H. Visual fire detection using deep learning: A survey. Neurocomputing 2024, 127975. [Google Scholar] [CrossRef]
Safarov, F.; Akhmedov, F.; Abdusalomov, A.B.; Nasimov, R.; Cho, Y.I. Real-time deep learning-based drowsiness detection: Leveraging computer-vision and eye-blink analyses for enhanced road safety. Sensors 2023, 23, 6459. [Google Scholar] [CrossRef] [PubMed]
Xiang, S.; Yin, S.; Yu, G.; Xu, X.; Yu, L. Factory Fire Detection using TRA-YOLO Network. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3777–3782. [Google Scholar]
Yandouzi, M.; Berrahal, M.; Grari, M.; Boukabous, M.; Moussaoui, O.; Azizi, M.; Ghoumid, K.; Elmiad, A.K. Semantic segmentation and thermal imaging for forest fires detection and monitoring by drones. Bull. Electr. Eng. Inform. 2024, 13, 2784–2796. [Google Scholar] [CrossRef]
Titu, M.F.S.; Pavel, M.A.; Michael, G.K.O.; Babar, H.; Aman, U.; Khan, R. Real-Time Fire Detection: Integrating Lightweight Deep Learning Models on Drones with Edge Computing. Drones 2024, 8, 483. [Google Scholar] [CrossRef]
Bu, F.; Gharajeh, M.S. Intelligent and vision-based fire detection systems: A survey. Image Vis. Comput. 2019, 91, 103803. [Google Scholar] [CrossRef]
Thanga Manickam, M.; Yogesh, M.; Sridhar, P.; Thangavel, S.K.; Parameswaran, L. Video-based fire detection by transforming to optimal color space. In Proceedings of the International Conference On Computational Vision and Bio Inspired Computing, Coimbatore, India, 25–26 September 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 1256–1264. [Google Scholar]
Khondaker, A.; Khandaker, A.; Uddin, J. Computer vision-based early fire detection using enhanced chromatic segmentation and optical flow analysis technique. Int. Arab J. Inf. Technol. 2020, 17, 947–953. [Google Scholar] [CrossRef]
Rahman, M.A.; Hasan, S.T.; Kader, M.A. Computer vision based industrial and forest fire detection using support vector machine (SVM). In Proceedings of the 2022 International Conference on Innovations in Science, Engineering and Technology (ICISET), Chittagong, Bangladesh, 25–28 February 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 233–238. [Google Scholar]
Sharma, J.; Granmo, O.C.; Goodwin, M.; Fidje, J.T. Deep convolutional neural networks for fire detection in images. In Engineering Applications of Neural Networks: 18th International Conference, EANN 2017, Athens, Greece, 25–27 August 2017, Proceedings; Springer International Publishing: Cham, Switzerland, 2017; pp. 183–193. [Google Scholar]
Kim, Y.J.; Kim, E.G. Fire detection system using faster R-CNN. In Proceedings of the International Conference on Future Information & Communication Engineering, Kunming, China, 8–10 December 2017; Volume 9, pp. 261–264. [Google Scholar]
Nguyen, A.Q.; Nguyen, H.T.; Tran, V.C.; Pham, H.X.; Pestana, J. A visual real-time fire detection using single shot multibox detector for uav-based fire surveillance. In Proceedings of the 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE), Phu Quoc Island, Vietnam, 13–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 338–343. [Google Scholar]
Zhou, M.; Wu, L.; Liu, S.; Li, J. UAV forest fire detection based on lightweight YOLOv5 model. Multimed. Tools Appl. 2024, 83, 61777–61788. [Google Scholar] [CrossRef]
Zhang, D.; Chen, Y. Lightweight Fire Detection Algorithm Based on Improved YOLOv5. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 809. [Google Scholar] [CrossRef]
Xu, H.; Li, B.; Zhong, F. Light-YOLOv5: A lightweight algorithm for improved YOLOv5 in complex fire scenarios. Appl. Sci. 2022, 12, 12312. [Google Scholar] [CrossRef]
Shahid, M.; Hua, K.L. Fire detection using transformer network. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21 August 2021; pp. 627–630. [Google Scholar]
Lv, C.; Zhou, H.; Chen, Y.; Fan, D.; Di, F. A lightweight fire detection algorithm for small targets based on YOLOv5s. Sci. Rep. 2024, 14, 14104. [Google Scholar] [CrossRef] [PubMed]
Yuan, H.; Lu, Z.; Zhang, R.; Li, J.; Wang, S.; Fan, J. An effective graph embedded YOLOv5 model for forest fire detection. Comput. Intell. 2024, 40, e12640. [Google Scholar] [CrossRef]
Kim, S.; Jang, I.S.; Ko, B.C. Domain-free fire detection using the spatial-temporal attention transform of the YOLO backbone. Pattern Anal. Appl. 2024, 27, 45. [Google Scholar] [CrossRef]
Wang, J.; Wang, C.; Ding, W.; Li, C. YOlOv5s-ACE: Forest Fire Object Detection Algorithm Based on Improved YOLOv5s. In Fire Technology; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Yang, W.; Yang, Z.; Wu, M.; Zhang, G.; Zhu, Y.; Sun, Y. SIMCB-Yolo: An Efficient Multi-Scale Network for Detecting Forest Fire Smoke. Forests 2024, 15, 1137. [Google Scholar] [CrossRef]

Figure 1. The basic architecture of the vision transformer (ViT) integrated into the YOLOv5s framework. The image is divided into patches, each of which is treated as an input token for the transformer. The transformer model then processes these patches, capturing both local and global dependencies through self-attention mechanisms. This architecture enhances the model’s ability to detect fire and smoke, particularly in complex environments where objects may be occluded or distributed across large areas.

Figure 2. Modified YOLOv5s model with ViT as the backbone. In this figure, the attention-based feature extraction of ViT is shown as the key component responsible for improved detection accuracy. The figure highlights the process by which the ViT replaces the CSPDarknet53 backbone, allowing the model to capture long-range dependencies and spatial relationships more effectively. This leads to more accurate object detection, especially for fire and smoke, in challenging environments.

Figure 3. The data augmentation process, including random flip, random rotation, and ColorJitter.

Figure 4. Training results on fire and smoke dataset.

Figure 5. Visualization of the results.

Table 1. Comparison of four Yolov5 models.

Models	Precision	Recall	mAP@0.5	Params	Flops (G)	Epochs
Yolov5n (nano)	0.357	0.487	0.419	11	4.1	100
Yolov5s (small)	0.484	0.641	0.617	12	8.1	100
Yolov5m (medium)	0.576	0.628	0.635	14	10.2	100
Ours	0.583	0.657	0.664	12	10	100

Table 2. The training and validation loss values for these models, including box loss, object loss, and class loss.

Models	Train Box Loss	Train Object Loss	Train Class Loss	Val Box Loss	Val Object Loss	Val Class Loss
Yolov5n (nano)	0.0712	0.0092	1.2021	0.0802	0.0075	1.1245
Yolov5s (small)	0.0642	0.0102	1.0102	0.0719	0.0078	0.9433
Yolov5m (medium)	0.0622	0.0080	0.9832	0.0785	0.0064	0.9013
Ours	0.0602	0.0076	0.9707	0.0717	0.0051	0.9008

Table 3. Comparison between our model and state-of-the-art models.

Model	mAP (%)	FPS (Frames Per Second)	Precision (%)	Recall (%)
TRA-YOLO [10]	78.5	50	78.7	80.9
SIMCB-Yolo [28]	85.6	65	80.5	82.3
Improved YOLOv5s [9]	82.1	55	82.1	83.2
Lightweight Fire Detection [20]	98.3	85	94.8	94.3
Domain-Free Fire Detection [26]	94.5	75	93.5	92.8
YOLOv5s-ACE [27]	84.1	70	86.3	84.5
Proposed Model (ours)	96.0	85	96.8	97.0

Table 4. Lighting-based performance metrics.

Lighting Condition	Precision	Recall	mAP@0.5
Daylight	0.72	0.68	0.70
Low-Light	0.63	0.57	0.60
Nighttime	0.55	0.50	0.52

Table 5. Weather-based performance metrics.

Weather Condition	Precision	Recall	mAP@0.5	False Positives (%)
Clear Skies	0.68	0.65	0.67	5%
Fog	0.52	0.49	0.51	20%
Rain	0.60	0.58	0.59	10%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Safarov, F.; Muksimova, S.; Kamoliddin, M.; Cho, Y.I. Fire and Smoke Detection in Complex Environments. Fire 2024, 7, 389. https://doi.org/10.3390/fire7110389

AMA Style

Safarov F, Muksimova S, Kamoliddin M, Cho YI. Fire and Smoke Detection in Complex Environments. Fire. 2024; 7(11):389. https://doi.org/10.3390/fire7110389

Chicago/Turabian Style

Safarov, Furkat, Shakhnoza Muksimova, Misirov Kamoliddin, and Young Im Cho. 2024. "Fire and Smoke Detection in Complex Environments" Fire 7, no. 11: 389. https://doi.org/10.3390/fire7110389

APA Style

Safarov, F., Muksimova, S., Kamoliddin, M., & Cho, Y. I. (2024). Fire and Smoke Detection in Complex Environments. Fire, 7(11), 389. https://doi.org/10.3390/fire7110389

Article Menu

Fire and Smoke Detection in Complex Environments

Abstract

1. Introduction

2. Related Work

2.1. Traditional Vision-Based Fire Detection Methods

2.2. Deep Learning-Based Object Detection for Fire and Smoke

2.3. Attention Mechanisms and Vision Transformers in Object Detection

3. Methodology

3.1. Vision Transformer

3.2. YOLOv5s

3.3. The Modified YOLOv5s

4. Experiment and Results

4.1. Dataset

4.2. Data Preprocessing

4.3. Metrics

4.4. The Experiment Results

4.5. Comparison with State-of-the-Art Models

4.6. Dataset and Environmental Impact Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI