Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke

Çınarer, Gökalp

doi:10.3390/app15137178

Open AccessArticle

Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke

by

Gökalp Çınarer

Department of Computer Engineering, Faculty of Engineering-Architecture, Yozgat Bozok University, 66100 Yozgat, Turkey

Appl. Sci. 2025, 15(13), 7178; https://doi.org/10.3390/app15137178

Submission received: 29 May 2025 / Revised: 19 June 2025 / Accepted: 23 June 2025 / Published: 26 June 2025

(This article belongs to the Special Issue Innovative Applications of Artificial Intelligence in Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate forest fire detection is critical for the timely intervention and mitigation of environmental disasters. It is very important to intervene in forest fires before major damage occurs by using smoke data. This study proposes a novel deep learning-based approach that significantly enhances the accuracy of fire detection by incorporating advanced feature extraction techniques. Through rigorous experiments and comprehensive evaluations, our method outperforms existing approaches, demonstrating its effectiveness in detecting fires at an early stage. The proposed approach leverages convolutional neural networks to automatically identify fire signatures from remote sensing images, offering a reliable and efficient solution for forest fire monitoring. A total of 30 different object detection models, including the proposed model, were run with the extended Wildfire Smoke dataset, and the results were compared. As a result of extensive experiments, it was observed that the proposed model gave the best result among all models, with a test mAP value of 96.9%. Our findings not only contribute to the advancement of fire detection technologies, but also underscore the importance of deep learning in addressing real-world environmental challenges.

Keywords:

object detection; deep learning; yolo; artificial intelligence; wildfire; smoke

1. Introduction

Forests are vital structures for the world, covering 33% of the land on Earth and contributing directly or indirectly to ecological, economic, and social life as the main source of rich biological diversity [1]. The destruction of forests disrupts the basic structure of the ecosystem and increases the level of carbon dioxide in the atmosphere. At the same time, the destruction of forests triggers global warming, disrupts the ecosystem, and reduces water resources. In order to prevent these situations, measures should be taken to reduce forest fires first.

Forest fires are disasters that occur in open areas and tend to spread with the influence of wind; they are difficult to prevent and cause great damage due to the presence of easy-burning materials such as wood, leaves, and branches in the forest, which increases the effect of the fire. Contrary to popular belief, these fires not only destroy forests. Another harm of forest fires is that they cause respiratory and cardiovascular diseases in people due to smoke [2]. It is also a scientific fact that they cause an increase in disease and death rates in pets and livestock [3].

Another situation that forest fires negatively affect is the economy. During periods when forest fire extinguishing efforts were carried out in the USA, employment in the region increased by 1%. On the other hand, it was observed that there was a 20% increase in house prices after the forest fires [4].

Considering the effects of forest fires on the environment, a very sad picture has emerged. While the average annual number of forest fires between 2013 and 2022 was 61,410, the area affected by fire was 7.2 million decares on average, annually [5]. The delayed detection of forest fires may cause irreversible damage. The formation of a small fire primarily results in a large amount of smoke. This smoke harbors an upcoming forest fire. To prevent damage from a disaster, smoke, which signals the beginning of a fire, must be detected quickly and accurately. Difficulties lie in the complexity of the image background, the difficulty in distinguishing between features because of the cloud-like appearance of smoke, the inability to fully detect smoke in blurry images, the difficulty in detecting smoke because the objects behind it are more visible when the smoke is thin, and the difficulty in detecting the light intensity and angle of incidence in the open air. For many years, the detection of forest fires has been performed only with methods dependent on manpower such as observation towers, ground teams, emergency response vehicles, and telephone notifications. Due to the limitations of these methods, the precautions to be taken against fire risks have been limited. In recent years, forest fires have been monitored in real time using international satellite systems, and fire moment mapping has been performed. Along with technological developments, remote sensing techniques and satellite-based imaging systems have been used in fire detection since the 1990s; these developments were followed by the integration of artificial intelligence and deep learning-based approaches in the post-2010 period. Recently, architectures such as U-Net, Yolo, and LSTM have further accelerated the process. Especially in countries such as the USA, Australia, Canada, and China, forest fires are monitored in real time thanks to systems developed by NASA, CSIRO, and various academic institutions, and are transmitted to authorized units via automatic warning systems. These systems, developed with deep learning algorithms, have been developed to analyze fire smoke, temperature changes, and light signals, and early warning systems have been developed. Today, real image processing algorithms such as CNN and Yolo and time series-oriented artificial intelligence models such as LSTM are used to detect smoke and temperature anomalies in the initial stages of a fire. In this context, DL systems are one of the most important and innovative basic approaches for smoke detection in forest fires. However, existing models are often trained on limited datasets and can be confused with images that have similar appearances to smoke, such as clouds, fog, and dust. On the other hand, although standard Yolo models used in many studies are satisfactory in terms of accuracy and speed, they cannot show sufficient performance in detecting early-stage fires with sensitivity. In the existing literature, original backbone structures are generally used, which limits the flexibility and customizability of the model. This study provides significant contributions to the field of the early detection of forest fires.

A new deep learning-based approach has been developed that goes beyond existing methods and enables the determination of fire signatures from smoke data with high accuracy. In particular, the proposed model’s ability to detect early-stage fires with high precision using advanced feature extraction techniques provides a time-critical advantage in controlling forest fires. In this context, a smoke detection system, which forms the basis of the DL-based forest monitoring system, was developed to prevent damage caused by forest fires by early detection.

The main contributions of this study are as follows:

An improved forest fire smoke image dataset consisting of smoke and smoke-like images was created to evaluate the performance of the deep learning model.
Smoke and normal image data were processed and classified.
Comparative results of a total of 30 object detection models, including 20 models with modified backbone structures and 10 original Yolo models, were shared.
An optimized model for the early detection of forest fire was proposed.
With the proposed model, pre-fire smoke images were detected more accurately and quickly, and the accuracy of smoke detection increased.
Compared to the original Yolo model, the number of parameters of the proposed model was reduced by 19.7%, and the number of GFLOPs was reduced by 26.8%.
The processing power was reduced by decreasing the parameter and GFLOP values, and accordingly, the processing time was reduced by 22.7%, and the model size by 19.7%.
Model backbone development techniques were used for the network architecture, detection accuracy, and processing speed.
We ensured that the proposed model had higher accuracy than different Yolo models.

The remainder of this paper is organized as follows: In Section 2, studies in the literature on fire detection with DL are mentioned. In Section 3, information is provided about DL, object detection, the dataset used, Yolo algorithms, the proposed model, and how the created and original models are run. In addition, a comparison between the performance of the existing models and the proposed model is presented in Section 4. Section 5 includes a discussion of studies conducted using the same dataset. Finally, the contributions of this study and future research directions are discussed.

2. Related Works

Image-based methods using different methods have been proposed for forest fire smoke detection. Studies on DL-based pre-fire smoke detection have made significant progress in recent years. For this reason, current studies and methods specifically for fire detection are included. In a similar study in the literature, Shamsoshoara et al. [6] developed a CNN-based model with data obtained from thermal cameras and drones. However, this approach is limited in terms of real-time application in the field due to high hardware dependency. When all these studies are examined, the optimization of the backbone structures of the newly developed models and the reduction in processing times eliminate an important data preprocessing problem.

Studies on forest fire detection with DL have been examined under three headings: cites, works’ models, and algorithms. The literature search is detailed in Table 1.

In summary, the reviewed studies have presented various deep learning approaches for fire detection using different datasets, model architectures, and evaluation metrics. However, these studies have significant limitations, such as the use of small-scale datasets [7], dependency on a single data source [8], annotation inconsistencies [9], high computational costs and inconsistent model performances [10], and limited compatibility with real-time applications [11]. Despite promising results, there is a significant research gap in fire detection involving hybrid models that work with scalable and balanced datasets, providing robust, efficient, and early-stage fire detection under various environmental conditions. Unlike previous studies, the proposed work aims to increase the generalization capability by combining different visual conditions and datasets. This approach directly addresses the research gaps identified in the literature and contributes to the development of early warning systems for forest fires.

3. Material and Method

3.1. Object Detection and Deep Learning

In the digitalizing world, technology is used in almost every daily and routine transaction. With the increase in technological developments that continue to increase daily, digital solutions are being produced for more specific transactions. In the current period, with the increase in open-source websites and internet usage, there has been a significant increase in data and data sources. With image processing, various applications have been created based on the information obtained from images as a result of the multiple processing and processing of image data [12]. The exploration of image data has continued the development of computer technology. Thanks to research on computer intelligence in 1950, the foundations for smart machines were laid, and this technology has now been further developed by integrating it with today’s technology, later being defined as AI [13]. Scientists who conduct more research on smart machines have created ANN, ML, and DL structures inspired by the structure of the human brain to give machines the ability to learn [14]. Taking advantage of the increasing learning power of machines and an abundance of image data, researchers have continued their studies on object detection. The ability of computers to see objects and distinguish what they see is generally examined under the term CV. ML, DL, and CV are sub-branches of AI.

3.2. Dataset

A new dataset was prepared because the open-source datasets prepared for forest fires are single-class, require very high processing power, owing to the excessive number of images, and have very long processing times. At the beginning of a fire, dense smoke rises, and after this rising smoke, flames begin to grow. Because our aim was early fire detection, the dataset was created from aerial smoke images and images of smoke-like objects in nature. First, 737 images were taken from the open-source Wildfire Smoke Dataset created by AI for the Makind technology community [15]. Then, 124 smoke-like images were downloaded from various open-source websites. By applying data augmentation to 72 downloaded images, 268 images were obtained and combined with the images of the Wildfire Smoke Dataset. A dataset consisting of two classes, divided into three classes, train, test, and validation, was prepared. The images in our dataset consist of two classes: smoke and normal. A large extended wildfire dataset [16] was created with 737 images containing smoke images and 268 images downloaded from open-source websites (istockphoto, pexels, shutterstock).

3.3. Algorithms

DL-based algorithms are classified in two ways: two- and one-stage detectors. Algorithms called two-stage detectors first use a network to determine the regions where objects are located in the image, and then use a second structure to precisely detect objects in these regions [17]. The long operating time of the two-stage detectors led to the creation of the Yolo alternative one-stage detector model [18]. Yolo, which performs the location and classification of the object in one go by passing the image with a smaller model size over a single network, is not only faster, but also offers real-time detection [19]. Yolov1, introduced in 2016, significantly improved both detection speed and accuracy compared to earlier approaches [20]. NMS is performed based on the determined confidence scores, and NMS ensures that only a single bounding box is drawn by taking the box with the highest confidence score among the bounding boxes drawn for an object [21]. Yolo consists of a structure containing 24 convolutional layers, 2 fully connected layers to detect objects, and 1 × 1 convolutional layers to increase speed by reducing the number of parameters. Although Yolov1 is a good and fast object detection model, its usefulness in detecting objects of different image sizes that are not in the training data is insufficient; it detects a maximum of two objects belonging to the same class for each grid, and it has limitations, such as more localization errors compared to the other models [22]. Yolov2 [23] is a new and updated object detection model developed to overcome the limitations of the previous version, Yolov1. The limitations of Yolo models are as follows: although they increase the acceleration by reducing the number of calculations, there is no good improvement in terms of the mAP value, the simplicity of the CNN network, and the fact that it does not use a structure such as RPN, which causes the accuracy value to remain low compared to other models, and Yolo fails to detect small objects successfully. A long time after the launch of Yolov3 [24], the 4th version of the Yolov4 [25] series was introduced in 2020. The model basically consists of three structures: the backbone, which enables the extraction of features; the neck part, which develops the features to increase the usability of the extracted features; and the head, which is responsible for the prediction process. Yolov4 uses CSPDarknet53 as the backbone. CSP structures were used to reduce the computational effort while maintaining the accuracy of model determination. Mish [26], which has self-organizing ability, was used in the activation function selection. In the feature collection process, PANet is combined with FPN, which goes from top to bottom and has an opposite bottom-up structure; however, unlike the original structure of PANet, a PANet structure that combines features instead of collecting them is used [27]. Despite the innovations made in Yolov4, this is insufficient for detecting small objects owing to its high-resolution prediction structure and the use of a large anchor box.

PyTorch was used as a framework to abandon Darknet in the Yolov5 version, [28]. A modified CSPDarknet53 was used in the backbone structure to increase the speed and accuracy and reduce the number of parameters and calculations. The focus layer, first introduced in Yolov5, was used in the backbone layer to reduce the CUDA memory used and increase the forward and backward propagation [29]. After the focus layer, convolutional layers were used as the feature extractors. The BN and SiLU activation functions were added to each convolutional layer. The pooling layer used in the backbone structure, namely SPP, performs the function of eliminating the size limitation by combining the features of different scales in a specific feature map. PANet created a neck structure to increase the importance of low size features and the accuracy rate in the localization process. Although the head structure is the same as in the previous version, it is designed to provide three different feature maps as outputs to increase the efficiency for detecting small and large objects. The details of the architectural structure of Yolov5 are given in Figure 1. Another notable innovation is the use of the AutoAnchor structure, which allows for the creation of anchor boxes of similar sizes based on real object boundaries in the training image data in the dataset via k-means [27].

The makers of the Yolov5 algorithm presented five models with different width and depth dimensions for the convolutional structures. Nano (n) and small (s) models are lightweight models designed to operate with low system requirements. Large (l) and extra-large (x) models are heavy models designed to achieve high performance and are used in more powerful devices, although they cause slow speeds. The medium (m) model is designed to work in a more balanced way in terms of performance and speed compared to other models.

In addition, the anchor used in the other models was not used in this model. Just a month after the release of the Yolov6 [30], Yolov7 [31] was launched as a new model in July 2022. First, it was used in the E-ELAN network backbone structure, which combines the features of different classes without disturbing the original gradient path by checking the shortest and longest gradient paths, taking into account the factors affecting accuracy and speed [32]. In addition, compound model scaling was introduced by consistently changing the width and depth of the network to meet various application requirements while maintaining the optimal network structure. Models of various sizes were produced using compound model scaling. Inspired by the RepConv structure, a RepConvN convolutional layer without an identity connection was used [33]. A different approach is presented for the head, which is responsible for object detection. In this context, there is an auxiliary head in the middle layer that increases the learning ability by updating the weights during the training process. The label assigner calculates predictions and real situations, and transmits the flexible labels it creates to the lead head structure responsible for the final result output [34]. Yolov6 is a version of Yolo that allows for integrated use in IoT and embedded systems [35]. Therefore, community support may be more limited compared to other versions, such as Yolov5 or Yolov8. There may be uncertainty about model updates and long-term maintenance.

In January 2023, the latest and most updated version of Yolo, Yolov8 [36], was made available by Ultralytics, London, United Kingdom with five different-scale models. Nano (n) and small (s) models, which are different-scale models designed by changing depth, width, and ratio values, work fast and have reduced parameter values and calculation processes, but have lower accuracy performance than other models. These two models are lightweight and designed for use in systems that do not require many hardware features. Large (l) and extra-large (x) models have high accuracy but are slower than lightweight models because of their high parameter values. The Medium (m) model was designed to provide a more balanced performance between lightweight and large-scale models in terms of both performance and speed. The application area of Yolov8 has been expanded by introducing different scale models to perform small and large object detection.

In object detection algorithms that use anchors, if the aligned anchors are not correctly positioned, they cause errors. This is why an anchor is not used in Yolov8. However, it uses task alignment learning [37]. C2f and CBS modules are used in the backbone structure, which increases the performance and makes feature extraction more efficient. In addition, SPPF is used, which extracts various features using three pooling processes at the end of the backbone section, expands the receptive field of the network, and thus enables more accurate object detection in complex images [38]. To detect objects at different scales, FPN, which creates feature maps, and PAN structure, which jumps at different levels of the network structure and combines features, are used together [39]. Thus, objects of different shapes can be accurately perceived. In Yolov8, every convolution includes BN and SiLU activation.

In this context, Yolov8 and Yolov5 algorithms were selected as pilots. The comparative performances of the models and backbone changes were applied separately for these two Yolo models. The “Nano” and “Small” versions of Yolov5 were quite successful in terms of speed. Yolov5 is generally fast, optimized, and easy to use. In addition, the small models of Yolov8 (for example, Yolov8n or Yolov8s) are speed-oriented. Since it is one of the newest versions, it has an advantage in terms of optimization and is very fast on modern hardware. Considering these situations, the comparison of these algorithms is very important in evaluating the performance of the proposed model.

3.4. Proposed Model

Although Yolo models bring different features to each newly developed system, object detection systems have not been developed at the desired level. In this context, the model was developed using the hyperparameters of the most stable Yolov5 model. Although the Yolov5 algorithm is a popular model for object detection, it requires some parameter changes in the model to increase the detection accuracy and improve its performance. When detecting an object, the target-object-specific features obtained from the image data are the most important factors in detecting this region. Therefore, to develop the Yolov5 object detection model, we focused on the backbone structure, which is a part of the model that performs feature extraction and creates feature maps at different resolutions. Instead of Yolov5’s backbone structure, a new network was developed by integrating the ResNet50 [40] model. The architectural structure of model is illustrated in Figure 2.

In the original ResNet50 structure, the stride value is 2 in a 1 × 1 convolution in bottleneck blocks that require subsampling. However, the stride value was set to 2 in the 3 × 3 convolution. The stride value was taken into a 3 × 3 convolution to increase the accuracy rate. In the first layer of the model, a convolution operation of 7 × 7 in size and 64 filters are applied, followed by a 3 × 3 max pooling layer to reduce the spatial dimension. In this way, the initial stage allows the basic patterns to be captured in a wide area. To use ResNet50 in the backbone as a feature extractor, the average pooling layer and classifier softmax layer were removed. After these changes, the original Yolov5 backbone structure was replaced with the developed ResNet50-based backbone model. Residual blocks were used as the basic building blocks of the model. In addition, the residual connections in each block enable the network to become deeper and provide resistance to the vanishing gradient problem. Each residual block, after processing the input, was collected with the input via a direct shortcut link to form the model. The residual block basically learns the difference between an input (x) and the learned transformation F(x). This block contains a “shortcut link” that connects the input directly to the output. The global average pooling layer in the last part of the network summarizes the feature maps, then reaches an output with 100 classes through the fully connected layer, and finally the classification process is performed with the softmax activation function. The architecture of the hybrid model is shown in Figure 3. In this proposed model, a ResNet-based model structure integrated into the Yolo architecture is presented. This structure offers a more optimized method in terms of feature fusion and information transfer compared to the classical Yolo architecture. In particular, the CSP structures in the neck section were developed with the aim of increasing the accuracy while reducing the computational load of the model. There are three basic sections in the architecture: the backbone, neck, and output. This structure is optimized for high accuracy and low latency object detection. In the backbone section, the image given to the input of the model is first passed through a convolution layer of 7 × 7 size and 64 filters, then a 3 × 3 maximum pooling process is applied to provide dimensionality reduction. This stage is followed by bottleneck blocks repeated 3, 4, 6, and 3 times, respectively. Each block consists of convolution layers of 1 × 1, 3 × 3 and again 1 × 1 sizes, and residual connections minimize information loss and gradient fading problems in the deep learning process. This structure allows for the efficient extraction of multi-layer features from low to high levels. Multi-scale feature maps obtained from the backbone are enriched in the neck section and made more meaningful. BottleneckCSP (Cross-Stage Partial) structures are also used in the neck section. This structure contributes to achieving high accuracy with a lower parameter count and processing load compared to traditional convolutions. In the last stage, multi-scale feature maps obtained from the neck layer are processed for classification and location estimation via Conv2D layers. The model has 19.7% fewer parameters and 26.8% lower FLOPs compared to the original Yolo architecture. These improvements reduced the computational time by 22.7% and significantly reduced the model size.

The multiplicity of the layers in the DL structure, which is used to solve computer vision and similar difficult and complex problems, expands the capacity. However, because small gradient values are multiplied during backpropagation, extremely small values are obtained in structures with many layers; therefore, the change in the weight values decreases to almost zero. The skip connection structure shown in Figure 4 was used as a solution to the ResNet vanishing gradient problem. The skip connection structure works by skipping the layers in a flat network, which degrades its performance. This structure forms the basis of ResNet, and is the residual block structure that gives its name to the model.

The skip connection enabled faster propagation of inputs between the layers. Under normal circumstances, a bias term is added to the value obtained by multiplying the input (i.e., x) by the weight value in the layer and passing it through an activation function (i.e., F(x)) (see Equation (1)).

H(x) = F(wx + b)

(1)

However, the skip connection enabled the output (i.e., H(x)) to be calculated for the sum of the inputs and the inputs passed through the activation function (see Equation (2)).

H (x) = F (x) + x

(2)

In addition, increasing the number of layers in neural networks expands the working space of the model and results in too many functional possibilities, making it very difficult to reach the target function. Therefore, convergence was almost impossible. This situation increases the possibility of moving away from the target function when attempting to approach it in experimental practice. For this reason, the function areas are nested within the model architecture to obtain better results without reducing the model quality. With this method, an increase in the number of layers did not negatively affect the performance of the model. In short, the convergence problem was solved by nesting the functional areas. In addition, an increase in the number of functions increases the number of non-linear results. The increase in the non-linear results also leads to increased performance and provides better results for complex structures. In addition, by adding 1 × 1 convolution to the bottleneck blocks used, the 3 × 3 convolution was changed, and the number of parameters was reduced. To match the input and output sizes, the shortcut connection was size-matched using a 1 × 1 convolution. Thus, by decreasing the parameters, the training time was also shortened.

3.5. Running the Models

The model developed in this study was compared not only with the classical architectural structures of Yolo models, but also with the performances of different hybrid models. In this context, five models of Yolov5 (n-s-m-l-x), five models of Yolov8 (n-s-m-l-x), and twenty models whose backbone structures were modified with EfficientNet_b1, MobileNetV3s, ResNet34, and ResNet50 were used in this study. In this process, 30 object detection models were tested using Google Colab, and the results were examined.

In total, 737 image data belonging to the Wildfire Smoke Dataset, as well as 268 image data containing smoke-like objects taken from open-source websites, were processed on the Colab platform using the Python programming language and libraries such as Keras, Pytorch, and Tenserflow. All experiments were conducted using Python 3.13 with the PyTorch deep learning framework, running on a workstation equipped with an NVIDIA, SantaClara, ABD, RTX 3080 GPU, Intel Core i7-12700K CPU, and 64 GB Ram. From a total of 1005 image data points belonging to our advanced 2-class dataset, 805 images (80%) were used for training, 100 images for validation (10%), and the remaining 100 images (10%) were used for testing.

The most appropriate parameter values of the DL models for forest fire detection were determined and consistently applied across all algorithms to ensure a fair comparison. These values included a learning rate of 0.01, a momentum of 0.937, the SiLU activation function, the SGD optimization algorithm, and 100 training epochs across all backbone architectures and model variants.

The epoch, which is the number of times the image data in the dataset are displayed to the network with label information, was determined to be 100 for all models. The optimization algorithm is a structure that increases the performance in complex learning processes in deep learning models and is designated as SGD in all models. The momentum value used to ensure fast SGD operation was determined to be 0.937 in all models. In deep learning, weight values are updated by finding the difference by taking the derivative during the backpropagation process, multiplying this difference with the learning rate value and subtracting the result from the old weight values to find new weight values. The learning rate value for all models was determined to be 0.01. The activation function, which was used to convert the results to non-linear values, was used to improve the performance. A designated SiLu activation function was used for all the models. This was preferred over Relu to reduce the problem of Gradient Vanishing and to obtain results that were more sensitive to negative inputs.

SILU(α) = α⋅σ(α)

(3)

In the formulation, α gives the input value while σ(α) defines the sigmoid function. The formula of the sigmoid function is as follows. The output range is shown between 0 and 1. It is used especially in the last layer in two-class models.

σ (α) = \frac{1}{1 + e^{- x}}

(4)

3.6. Evaluating the Results of Object Detection Algorithms

Some values are required to understand whether object detection models provide good or bad results and to evaluate their performance. The values used in the performance evaluation were calculated based on the accuracy rates of the models in predicting the objects in the image data in the dataset. If the model detects smoke in the image given to the model for the smoke category, it is considered TP, whereas if the model detects it as not smoke, it is considered FN. If the model detects smoke that does not exist in the image given to the model for the smoke category, it is considered an FP, whereas if the model detects smoke as absent, it is considered a TN. In this case, TPs and TNs show that the model performs the detection process successfully, and FNs and FPs show that the model detects smoke incorrectly.

Precision measures the ratio of TP values to all positive predictions of the model, as shown in Equation (5).

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

Recall is the value that gives the ratio of TP values to all positive situations in a real situation, as shown in Equation (6).

R e c a l l = \frac{T P}{T P + F N}

(6)

F1, which uses recall and precision values as a common evaluation point on a single value, is the value obtained by applying the harmonic mean process to avoid erroneous evaluation of the normal average calculation when the two values are extreme (see Equation (7)).

F_{1} = 2 \times \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

The evaluation data that provide the accuracy value of the model, shown as a single value, is the mAP. When the equation for the mAP value is examined (see Equation (8)), n represents the number of classes and APk represents the AP value of class k.

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} {A P}_{k}

(8)

4. Experimental Results

The training results of five Yolov8 models, five Yolov5 models, and twenty hybrid Yolov5 models with modified backbone structure, run with an improved smoke dataset, are listed in Table 2. When Table 2 is examined, one can see that the model with the highest precision value is the proposed model based on Yolov5-x with the ResNet50 backbone structure, at 98.8%. The models that follow this model with 98.2% and 98.0% precision values are the Yolov5-m model with the backbone structure replaced by MobileNetV3s, and the Yolov5-n model with the backbone structure replaced by ResNet50. The original Yolov5-l model had the highest recall value of 98.0% among the 30 models. The models that follow the Yolov5-l model in terms of recall value are the Yolov5-m model (97.3%), with the backbone structure modified by ResNet50, and the Yolov5-x-based proposed model (96.7%) with the backbone structure modified by ResNet50. The model with the highest F1 value (97.7%), which allows for precision and recall values to be displayed on a single value, is the recommended model with a modified backbone structure. The models that follow this model in terms of F1 are the original Yolov5 model and Yolov5-n model with the backbone structure modified by ResNet50, with values of 97.3% and 97.1%, respectively. The model with the highest mAP value for the training process was the original Yolov5-m model (98.8%). This model is followed by the Yolov5-n and the original Yolov5-l models, whose backbone structures have been modified with ResNet50, with a mAP value of 98.6%.5.

The testing process was carried out with a test dataset consisting of image data that the models had never seen. The results recorded during the testing process are listed in Table 3. Among the models, the original Yolov5-n model, with a precision value of 95.5%, yielded the highest precision. Models that follow the Yolov5-n model as the precision value are the recommended model, whose backbone structure has been replaced with ResNet50. In addition, the Yolov5-x model, whose backbone structure was replaced with MobileNetV3s, yielded a result of 95.0%. The Yolov5-m model, whose backbone structure was replaced with ResNet50, yielded the highest result with a recall value of 96.7%. This model was followed by the proposed model, whose backbone structure was replaced with ResNet50 with a recall value of 96.4%. In addition, the model that yielded the best result in terms of F1 value, at 95.8%, was again the recommended model with its backbone structure. The models that follow this model in terms of F1 value are the original Yolov5-n model (95.3%) and the Yolov5-m model (95.2%), whose backbone structures were modified with ResNet50. Finally, the model with the highest mAP value of 96.9%, which allowed us to interpret the performance of the model in the best way, was the recommended model whose backbone structure had been replaced with ResNet50. This model was followed by the Yolov5-s model, whose backbone structure was replaced with ResNet50 with a mAP value of 96.5%.

Because the test models are implemented using image data that the algorithms have not seen before, they best reflect the performance of the proposed models in a real application environment. When all hybrid models and original YOLO models were examined, the hybrid recommended model provided the best performance.

A heatmap is given in Figure 5 showing the performance metrics (precision, recall, F1, mAP) for various models listed on the x-axis. Color intensity represents performance, where dark red corresponds shows higher values and blue corresponds shows lower values. The performance of the proposed model is generally better or equal to other models in all metrics (precision, recall, F1, mAP). It is seen that the proposed model has the highest value, especially in the mAP (97%) metric.

To evaluate the performance of the models and measure their usability in a real application environment, it is necessary to examine the training time and model size values. In Table 4, the training time and model size values of the models are presented. The fastest model here is the original Yolov5-n model with a 0.21 h training time. In terms of speed, the Yolov5-n model is followed by the Yolov5-n and Yolov8-n models, whose backbone structure has been replaced with MobileNetV3s, with training time values of 0.22 h. The smallest model in terms of size was the original Yolov5-n model with a size of 3.9 MB.

The Yolov5-n model is followed by the Yolov5-n model and the Yolov8-n model, with the backbone structure replaced by MobileNetV3s, with model sizes of 4.7 MB and 6.2 MB, respectively. In this case, the best models in terms of speed and model size are the original Yolov5-n model, the Yolov5-n model with the backbone structure modified by MobileNetV3s, and the Yolov8-n model, respectively.

The number of layers used in the models, the number of parameters used (in millions), and the number of GFLOPs are listed in Table 5. Flops is a unit of measurement used to measure performance, reporting the number of floating-point calculations a processor performs, and GFLOPs is a unit of measurement equal to one billion flops. The complexity of the models was measured in GFLOPs, and the parameters represent the size of the model. The smaller the parameters and GFLOPs, the less processing power the model needs to use, and the model size will also be reduced. The model with the smallest parameters and GFLOP values, that is, requiring the least processing power, was the original Yolov5-n model. However, because even a small error is of vital importance in detecting an important situation, such as a forest fire, the priority here is to have a high accuracy rate for the model. Therefore, the proposed model, which has the highest mAP value, was recommended.

The PR curves of the test results of the proposed model, the original Yolov5-x model, and other Yolov5-x models created by changing the backbone structures are shown in Figure 6. The model that yields the highest mAP value of 98% in the smoke class is the proposed model. With mAP values of 96.3% and 96%, the two best-performing models in the smoke class after the proposed model were the Yolov5-x and Yolov5-x models, whose backbone structures were modified with MobileNetV3s. The highest mAP value of the normal class was the Yolov5-x model, whose backbone structure had been modified with MobileNetV3s (96%), which was compared to the proposed model and the original model, with mAP values of 95.9% and 92.6%, respectively. When Figure 6 is examined, the model that gives the highest mAP value of 98% in the smoke class is the proposed model (Figure 6e). With mAP values of 96.3% and 96%, the two best-performing models in the smoke class after the proposed model are the Yolov5-x model (Figure 6a) and the Yolov5-x model whose backbone structure had been modified with MobileNetV3s (Figure 6c). The graphs given in Figure 5 allow us to comment on the performance of the model depending on the size of the area under the curve. Model performance is directly proportional to the size of the area under the curve. It can be seen here that the model with the largest area under the curve is the proposed model. We can understand this from the fact that the mAP value of all classes is the highest at 96.9%. The other two curves in Figure 6 belong to the Yolov5-x model whose backbone structure had been replaced with EfficientNet_b1 (Figure 6b) and the Yolov5-x model whose backbone structure had been replaced with ResNet34 (Figure 6d), respectively.

Comparison of the Proposed Model and the Original Model

It has been observed that the proposed model, which was developed by changing the backbone structure, outperforms the original Yolov5-x model and has the highest mAP value among the hybrid 30 shared models. With the changes made, the number of GFLOPs in the proposed model was reduced by 26.8%, and the number of parameters was reduced by 19.7% compared to the original model, creating a faster and more accurate model with less processing power. Although the number of layers in the original Yolov5-x model was 322, the number of layers was reduced to 292 in the proposed model. By changing these values, the training time of the proposed model was reduced by 22.7% and the model size by 19.7%, compared with the original model, resulting in a faster model.

Although the precision value of the original Yolov5-x model as a result of the testing process was 89.5%, this value increased by 5.9% to 95.4% with the improved model. The recall value increased by 2.4%, and the F1 value by 4.2%, compared to the original model, reaching 96.4% and 95.8%, respectively. The mAP value, which is the most important value in interpreting and increased by 2.4–96.9%, making it the best performing model among the 30 models.

Figure 7 is a violin graph showing the distribution of performance values for different metrics (precision, recall, F1, mAP). Each violin visualizes the density and distribution of values for a metric. For the precision metric, the values are mostly concentrated between 0.85 and 1.0. However, there are a few low precision values around 0.6 in the lower region (these low precision values were also noticeable in the previous heatmap). The recall values are slightly more narrowly distributed compared to those of precision, meaning there are no major differences in recall performance between models. The F1 score distribution is concentrated between 0.8 and 1.0. The distribution is wider than for precision and recall, indicating some differences in F1 score between the models. The mAP value is quite narrowly concentrated, mostly between 0.9 and 1.0. This shows that most models have consistent and high performance in terms of mAP.

The effects of the operations performed on the model predictions are shown in Figure 8. Labeled image data (Figure 8a) are provided in the dataset. The label information in the labeled image data contains the actual boundaries that must be estimated. The original Yolov5-x model made some incorrect predictions (Figure 8b), as can be seen. These errors were caused by the model perceiving a smoke image as if there were two smoke images. As a result of the improvements made to the model, it can be seen that the improved model becomes more stable (Figure 8c).

5. Discussion

Although interest in using DL-based computer vision techniques to detect fire and smoke in forests and natural protected areas has increased, the methods used remain limited to classical deep learning models. Bankar et al. [41] conducted a study on smoke detection using a Wildfire Smoke Dataset. In their study, the authors aimed to improve the performance of the object detection model by performing data augmentation. After the augmentation process, 1915 image data were obtained, and all the image data obtained were run on the Yolov5-x, Faster R-CNN, SSD, Yolov3, and Fast R-CNN models; mAP values of 87%, 79%, 72%, 81%, and 78% were obtained, respectively. As can be seen, authors stated that Yolov5-x gave the highest result with a mAP value of 87% among the five different DL models. In addition, the mAP values of the Yolov5-x model, when operated with different batch size values, are also given. The highest mAP value (87%) was achieved with a batch size of 16. Some innovations have been introduced to improve the performance and increase the accuracy of detection. Al-Smadi et al. [42] used the Wildfire Smoke Dataset for smoke detection. By performing data augmentation, total data was increased to 1723. In the dataset, 80% of the image data were reserved for training and 20% for testing. The results were compared by running five models (Yolov5, Yolov7, SSD, Yolov4, Fast R-CNN, EfficientDet, Yolov3, and Faster R-CNN) with the image data obtained. While five models, Yolov5, Yolov7, and Yolov3, reached over 90% inmAP value, the other models remained below 80% in mAP value. Yolov3’s [43] accuracy is as good as that of SSD [24] but three times faster than SSD. An SPP block is added that combines multiple max-pooling outputs. By combining feature maps at different scales with multi-scale prediction, the problem of not being able to detect small objects was solved. Yolov3 has an architecture that connects the inputs of 1 × 1 layers to the outputs of 3 × 3 layers, consisting of 106 convolutional layers, 53 of which are in the detection head, and includes Leaky ReLU activation and BN structures. It works slightly slower because of the deep layers it contains. On the other hand, in this study the results of all evaluations indicated that the model with the highest performance was Yolov5-x, with a mAP value of 96.8%. Distribution-based methods train DNNs as feature extractors to detect abnormal images and treat images deviating from the center of the distribution as abnormal inputs. However, they cannot find the underlying image part that causes abnormality in the image dispersal [44].

In the studies conducted, the results of the train and test processes were not given in detail. In other words, there is no information indicating that the given mAP values are the training or test results. In another study [45] using the Yolov3 and Yolov4 models, mAP values of 84.12% and 88.15% were achieved, respectively. In this study, a very high mAP score was obtained, which also reveals the success of the proposed model. In our study with the Advanced Wildfire Smoke Dataset, the results of the training and test processes were shared in detail, and it was stated that the training mAP value of our improved Yolov5-x model was 97.3%, and the test mAP value was 96.8%. In addition, the original Wildfire Smoke Dataset, which consisted of a single class, was used in the studies. Having a single-class dataset means that the number of objects that the model will detect is limited to one. This means that the model has no possibility of making predictions other than in the specified class. In other words, the possibility of the model giving another class name to the smoke object in the image is eliminated, which may increase the accuracy of the model. Considering these situations, a second class was created by adding images collected from open-source websites to the original Wildfire Smoke Dataset. Thus, it is possible to make more accurate comments regarding the performance of the model. A more detailed study was carried out than the studies conducted with the Wildfire Smoke Dataset in the literature, and a model with a higher mAP value was proposed. In addition, the recommended model works faster than the original Yolov5-x model. The model has fewer parameters and lower FLOPs compared to the original Yolo architecture. Thanks to these improvements, the processing time has been reduced and the model size has also been significantly reduced. It has been provided with a structure suitable for real-time operation while maintaining depth and accuracy. The architecture exhibits high performance in distinguishing smoke-like structures from real smoke with high accuracy.

6. Conclusions

In this study, a method with high accuracy for the early diagnosis and detection of automatic forest fires, based on deep learning, was investigated. This study also aimed to present an improved model for the early detection of forest fires. First, an advanced two-class dataset was presented by adding additional images to the image data of the Wildfire Smoke Dataset, which is an open-source dataset. Subsequently, a model was proposed by replacing the backbone structure of the original Yolov5-x model for ResNet50. A total of 30 different object detection models were comparatively evaluated on the dataset developed within the scope of this study, and it was shown that the proposed model gave the best results, with 96.9% mAP success in the testing phase. In this respect, this study brings a new method to the literature of data processing and modeling techniques that can be used for fire detection; at the same time, it strongly demonstrates the potential of deep learning in preventing environmental disasters. In addition, to verify the success of the proposed model against other models, it was compared with 19 Yolov5 models with a modified backbone structure, five original Yolov5 models, and five Yolov8 models, the most current version of Yolo. It was observed that accuracy rate of the proposed model was higher than that of all other compared models. Our model, which was developed based on the Yolov5-x model, showed better performance with less processing power than the original Yolov5-x model. In addition, our study was compared with other studies using the Wildfire Smoke Dataset, and better results were obtained compared to the studies performed. As a result, smoke detection was successfully performed for the early detection of forest fires. Despite these successes, some factors limit the applicability of the proposed model in different areas and images. The ability of the model to distinguish between phenomena may weaken in dark, hazy, rainy, and smoggy weather. In this context, in future studies, the ability to predict the presence of smoke will be increased by integrating different image processing and smoke detection techniques into the system to eliminate this problem.

In future studies, the optimization processes of the proposed model will continue, and the effects on the accuracy will be investigated by performing data augmentation. In addition, the shape and size of the smoke were identified, and fire was detected faster. At the same time, the fire detection process could be accelerated by instantly monitoring the data obtained through instant imaging with drones. The results obtained will be transferred to a mobile system and a prototype of real application software with both mechanical and temporal process control will be created.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data included in this study are available upon request by contacting the corresponding author or can be downloaded from https://doi.org/10.5281/zenodo.14218778 (accessed on 25 November 2024).

Conflicts of Interest

The author declares that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper. The author declares no conflicts of interest related to this article.

List of Abbreviations

GFLOPs	Giga Floating-Point Operations Per Second
FPN	Feature Pyramid Networks
PAFPN	Path Aggregation Network for Instance Segmentation
AP	Average Precision
CNN	Convolutional Neural Networks
RCNN	Region Convolutional Neural Network
YOLO	You Only Look Once
AI	Artificial Intelligence
ANN	Artificial Neural Networks
ML	Machine Learning
DL	Deep Learning
CV	Computer Vision
NMS	Non-Maximum Suppression
BN	Batch Normalization
mAP	Mean Average Precision
RPN	Region Proposal Network
SSD	Single Shot MultiBox Detector
SPP	Spatial Pyramid Pooling
SPPF	Spatial Pyramid Pooling Fast
IoU	Intersection Over Union
CSP	Cross-Stage Partial Connections
PANet	Path Aggregation Network
SiLU	Sigmoid Linear Unit
ReLU	Rectified Linear Unit
IoT	Internet of Things
CBS	Composed of Convolution
ResNet	Residual Network
SGD	Stochastic Gradient Descent
TP	True Positives
TN	True Negatives
FP	False Positives
FN	False Negatives
PR	Precision Recall

References

Dölarslan, M.; Gül, E. Sadece Bir Yangın mı? Ekolojik ve Sosyo-Ekonomik Açıdan Orman Yangınları. Türk Bilimsel Derlemeler Derg. 2017, 10, 32–35. [Google Scholar]
Eisenman, D.P.; Galway, L.P. The Mental Health and Well-Being Effects of Wildfire Smoke: A Scoping Review. BMC Public Health 2022, 22, 2274. [Google Scholar] [CrossRef]
Sanderfoot, O.V.; Bassing, S.B.; Brusa, J.L.; Emmet, R.L.; Gillman, S.J.; Swift, K.; Gardner, B. A Review of the Effects of Wildfire Smoke on the Health and Behavior of Wildlife. Environ. Res. Lett. 2022, 16, 123003. [Google Scholar] [CrossRef]
Meier, S.; Elliott, R.J.R.; Strobl, E. The Regional Economic Impact of Wildfires: Evidence from Southern Europe. J. Environ. Econ. Manag. 2023, 118, 102787. [Google Scholar] [CrossRef]
Hoover, K.; Hanson, L.A. Wildfire Statistics; CRS In Focus, IF10244; Library of Congress, Congressional Research Service: Washington, DC, USA, 2023. [Google Scholar]
Shamsoshoara, A.; Afghah, F. Airborne Fire Detection and Modeling Using Unmanned Aerial Vehicles Imagery: Datasets and Approaches. In Handbook of Dynamic Data Driven Applications Systems; Springer: Berlin/Heidelberg, Germany, 2023; Volume 2, pp. 525–550. [Google Scholar]
Ul Ain Tahir, H.; Waqar, A.; Khalid, S.; Usman, S.M. Wildfire Detection in Aerial Images Using Deep Learning. In Proceedings of the 2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2), Rawalpindi, Pakistan, 24–26 May 2022. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A Multi-Scale Faster RCNN Model for Small Target Forest Fire Detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Lin, J.; Lin, H.; Wang, F. STPM_SAHI: A Small-Target Forest Fire Detection Model Based on Swin Transformer and Slicing Aided Hyper Inference. Forests 2022, 13, 1603. [Google Scholar] [CrossRef]
Choutri, K.; Fadloun, S.; Lagha, M.; Bouzidi, F.; Charef, W. Forest Fire Detection Using IoT Enabled UAV and Computer Vision. In Proceedings of the 2022 International Conference on Artificial Intelligence of Things (ICAIoT), Istanbul, Turkey, 29–30 December 2022. [Google Scholar]
Li, Y.; Rong, L.; Li, R.; Xu, Y. Fire Object Detection Algorithm Based on Improved YOLOv3-Tiny. In Proceedings of the 2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 22–24 April 2022; pp. 264–269. [Google Scholar] [CrossRef]
Huang, T.S.; Schreiber, W.F.; Tretiak, O.J. Image Processing. Proc. IEEE 1971, 59, 1586–1609. [Google Scholar] [CrossRef]
Kaul, V.; Enslin, S.; Gross, S.A. History of Artificial Intelligence in Medicine. Gastrointest. Endosc. 2020, 92, 807–812. [Google Scholar] [CrossRef]
Shinde, P.P.; Shah, S. A Review of Machine Learning and Deep Learning Applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018. [Google Scholar] [CrossRef]
Open Wildfire Smoke Datasets. Available online: https://github.com/aiformankind/wildfire-smoke-dataset (accessed on 11 September 2023).
Wildfire Dataset Download Link. November 2024. Available online: https://zenodo.org/records/14218779 (accessed on 25 November 2024).
Zhou, X.; Koltun, V.; Krähenbühl, P. Probabilistic two-stage detection. arXiv 2021, arXiv:2103.07461. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of YOLO Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. Lect. Notes Comput. Sci. 2015, 9905, 21–37. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Hussain, M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature Toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; TaoXie; Fang, J.; Imyhxy; et al. ultralytics/yolov5: v7.0-yolov5 Sota Realtime Instance Segmentation, V7; Zenodo: Geneva, Switzerland, 2022. [Google Scholar] [CrossRef]
Nepal, U.; Eslamiat, H. Comparing YOLOv3, YOLOv4 and YOLOv5 for Autonomous Landing Spot Detection in Faulty UAVs. Sensors 2022, 22, 464. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Hussain, M.; Al-Aqrabi, H.; Munawar, M.; Hill, R.; Alsboui, T. Domain Feature Mapping with YOLOv7 for Automated Edge-Based Pallet Racking Inspections. Sensors 2022, 22, 6927. [Google Scholar] [CrossRef]
Rangari, A.P.; Chouthmol, A.R.; Kadadas, C.; Pal, P.; Singh, S.K. Deep Learning Based Smart Traffic Light System Using Image Processing with YOLOv7. In Proceedings of the 2022 4th International Conference on Circuits, Control, Communication and Computing (I4C), Bangalore, India, 21–23 December 2022; pp. 129–132. [Google Scholar] [CrossRef]
Gillani, I.S.; Munawar, M.R.; Talha, M.; Azhar, S.; Mashkoor, Y.; Uddin, M.S.; Zafar, U. Yolov5, Yolo-x, Yolo-r, Yolov7 Performance Comparison: A Survey. Comput. Sci. Inf. Technol. (CS & IT) 2022, 12, 17. [Google Scholar] [CrossRef]
Bist, R.B.; Subedi, S.; Yang, X.; Chai, L. A Novel YOLOv6 Object Detector for Monitoring Piling Behavior of Cage-Free Laying Hens. AgriEngineering 2023, 5, 905–923. [Google Scholar] [CrossRef]
Kawade, V.; Naikwade, V.; Bora, V.; Chhabria, S. A Comparative Analysis of Deep Learning Models and Conventional Approaches for Osteoporosis Detection in Hip X-Ray Images. In Proceedings of the 2023 World Conference on Communication & Computing (WCONF), Raipur, India, 14–16 July 2023; pp. 1–7. [Google Scholar]
Xiao, B.; Nguyen, M.; Yan, W.Q. Fruit Ripeness Identification Using YOLOv8 Model. Multimed. Tools Appl. 2023, 83, 28039–28056. [Google Scholar] [CrossRef]
Passa, R.S.; Nurmaini, S.; Rini, D.P. YOLOv8 Based on Data Augmentation for MRI Brain Tumor Detection. Sci. J. Inform. 2023, 10, 363–370. [Google Scholar] [CrossRef]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972v1. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Bankar, A.; Shinde, R.; Bhingarkar, S. Impact of Image Translation Using Generative Adversarial Networks for Smoke Detection. In Proceedings of the 2021 International Conference on Computational Performance Evaluation (ComPE), Shillong, India, 1–3 December 2021; pp. 246–255. [Google Scholar] [CrossRef]
Al-Smadi, Y.; Alauthman, M.; Al-Qerem, A.; Aldweesh, A.; Quaddoura, R.; Aburub, F.; Mansour, K.; Alhmiedat, T. Early Wildfire Smoke Detection Using Different YOLO Models. Machines 2023, 11, 246. [Google Scholar] [CrossRef]
Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Zhang, M.; Tian, X. Transformer Architecture Based on Mutual Attention for Image-Anomaly Detection. Virtual Real. Intell. Hardw. 2023, 5, 57–67. [Google Scholar] [CrossRef]
Zheng, H.; Duan, J.; Dong, Y.; Liu, Y. Real-time fire detection algorithms running on small embedded devices based on MobileNetV3 and YOLOv4. Fire Ecol. 2023, 19, 31. [Google Scholar] [CrossRef]

Figure 1. Structure of Yolov5.

Figure 2. Structure of model.

Figure 3. Proposed model.

Figure 4. Skip connection structure.

Figure 5. Model heatmap metrics.

Figure 6. PR curves of proposed model.

Figure 7. Violin plot—metric distributions.

Figure 8. Performance comparison of models. (a) Original image (b) Yolov5-x (c) proposed model.

Table 1. Related Work.

Works’ Details	Algorithms	Limitations
UI Ain Tahir et al. [7] performed fire detection by taking images from FireNet and FLAME datasets. Images taken from two datasets contain two classes: feverish and non-feverish. In total, 457 images were used in training the Yolov5 algorithm. The model was tested with 65 images. As a result of these stages, the F1 score of the algorithm was found to be 94.44%.	Yolov5	Low generalization capacity, only one model
Zhang et al. [8] detected forest fire using the FLAME dataset. There are 2003 image data in the dataset. In total, 85% of the image data was used for training and 15% was used for testing. For object detection, ResNet50 was used by changing the backbone part of the Faster R-CNN. In training the algorithm with the image data, a MAP value of 85.2% was reached.	Faster R-CNN	Low generalization capacity, only one hybrid model
A dataset [9] was created with images collected by the university and from the internet. In total, 90% of the resulting dataset is reserved for training and 10% for testing. The Mask R-CNN algorithm was replaced with the Swin Transformer backbone network. Additionally, PAFPN was used instead of the FPN structure used in Mask R-CNN. As a result of these changes made in the algorithm, an AP value of 89.4% was reached.	Mask R-CNN	Single backbone and algorithm in the model
Choutri et al. [10] used 2906 images manually tagged with the FLAME dataset. For forest fire detection, image data is classified as fire and non-fire. Yolov4 was run over 30 epochs and the training time was 14 h, and the AP value was 86%. Yolov2 was run over 60 epochs and the training time lasted 35 h. In total, 85% AP value was obtained from the training set. Faster R-CNN was run over 100 epochs and the training time lasted 60 h. In total, a 66% AP value was obtained from the training set. The SSD was run over 90 epochs and the training time lasted 72 h. In total, a 42% AP value was obtained from the training set.	Yolov4, Yolov2, Faster R-CNN, SSD	Training time is high with different models, AP values are unbalanced
Li et al. [11] created a dataset consisting of 3395 images collected from open-source websites. In total, 80% of the data in the dataset was reserved for training and 20% for testing. Some adjustments were made to the Yolov3-tiny model for fire detection. First, a 52 × 52 output layer was added to the FPN network. Then, the SE module was added behind the detection layers. Finally, the first 4 max pooling layers of the backbone structure were replaced with 3 × 3 convolution layers with two steps. After the modified Yolov3-tiny model was trained with the dataset, an 80.8% mAp value was obtained. In the original SSD, Yolov3, Yolov3-tiny, and Yolov4 models, mAp values of 62.1%, 76.5%, 74.9%, and 60.1% were obtained, respectively.	SSD, Yolov3, Yolov3-tiny, Yolov4-tiny	Current algorithms with different models are not included

Table 2. Model training results.

Models	Backbone	Precision	Recall	F1	mAP
Yolov5-n	Yolov5	0.957	0.946	0.951	0.973
Yolov5-n	EfficientNet_b1	0.920	0.914	0.916	0.953
Yolov5-n	MobileNetV3s	0.926	0.931	0.928	0.959
Yolov5-n	ResNet34	0.929	0.942	0.935	0.958
Yolov5-n	ResNet50	0.980	0.964	0.971	0.986
Yolov5-s	Yolov5	0.950	0.940	0.944	0.956
Yolov5-s	EfficientNet_b1	0.907	0.894	0.900	0.946
Yolov5-s	MobileNetV3s	0.965	0.931	0.947	0.961
Yolov5-s	ResNet34	0.877	0.836	0.856	0.908
Yolov5-s	ResNet50	0.963	0.960	0.961	0.976
Yolov5-m	Yolov5	0.962	0.958	0.959	0.988
Yolov5-m	EfficientNet_b1	0.814	0.791	0.802	0.905
Yolov5-m	MobileNetV3s	0.982	0.905	0.941	0.969
Yolov5-m	ResNet34	0.911	0.894	0.902	0.932
Yolov5-m	ResNet50	0.941	0.973	0.956	0.971
Yolov5-l	Yolov5	0.967	0.980	0.973	0.986
Yolov5-l	EfficientNet_b1	0.786	0.834	0.809	0.878
Yolov5-l	MobileNetV3s	0.956	0.932	0.947	0.974
Yolov5-l	ResNet34	0.915	0.820	0.864	0.941
Yolov5-l	ResNet50	0.977	0.950	0.963	0.972
Yolov5-x	Yolov5	0.971	0.932	0.951	0.982
Yolov5-x	EfficientNet_b1	0.944	0.918	0.930	0.957
Yolov5-x	MobileNetV3s	0.964	0.942	0.952	0.963
Yolov5-x	ResNet34	0.923	0.931	0.926	0.963
Yolov8-n	Yolov8	0.908	0.918	0.912	0.950
Yolov8-s	Yolov8	0.940	0.940	0.940	0.959
Yolov8-m	Yolov8	0.914	0.953	0.933	0.965
Yolov8-l	Yolov8	0.882	0.870	0.875	0.917
Yolov8-x	Yolov8	0.888	0.831	0.858	0.928
Proposed Model		0.988	0.967	0.977	0.973

Table 3. Model test results.

Models	Backbone	Precision	Recall	F1	mAP
Yolov5-n	Yolov5	0.955	0.953	0.953	0.947
Yolov5-n	EfficientNet_b1	0.940	0.865	0.900	0.934
Yolov5-n	MobileNetV3s	0.917	0.879	0.897	0.924
Yolov5-n	ResNet34	0.844	0.881	0.862	0.913
Yolov5-n	ResNet50	0.945	0.959	0.951	0.960
Yolov5-s	Yolov5	0.895	0.906	0.900	0.910
Yolov5-s	EfficientNet_b1	0.838	0.873	0.855	0.930
Yolov5-s	MobileNetV3s	0.945	0.920	0.932	0.942
Yolov5-s	ResNet34	0.784	0.840	0.811	0.878
Yolov5-s	ResNet50	0.919	0.959	0.938	0.965
Yolov5-m	Yolov5	0.926	0.953	0.939	0.960
Yolov5-m	EfficientNet_b1	0.699	0.785	0.739	0.820
Yolov5-m	MobileNetV3s	0.913	0.939	0.925	0.939
Yolov5-m	ResNet34	0.817	0.897	0.855	0.907
Yolov5-m	ResNet50	0.939	0.967	0.952	0.956
Yolov5-l	Yolov5	0.939	0.940	0.939	0.945
Yolov5-l	EfficientNet_b1	0.611	0.852	0.711	0.797
Yolov5-l	MobileNetV3s	0.940	0.959	0.949	0.956
Yolov5-l	ResNet34	0.791	0.884	0.834	0.905
Yolov5-l	ResNet50	0.937	0.928	0.932	0.960
Yolov5-x	Yolov5	0.895	0.940	0.916	0.945
Yolov5-x	EfficientNet_b1	0.845	0.929	0.884	0.936
Yolov5-x	MobileNetV3s	0.950	0.944	0.946	0.960
Yolov5-x	ResNet34	0.864	0.919	0.890	0.922
Yolov8-n	Yolov8	0.884	0.918	0.900	0.942
Yolov8-s	Yolov8	0.856	0.927	0.890	0.926
Yolov8-m	Yolov8	0.825	0.908	0.864	0.897
Yolov8-l	Yolov8	0.821	0.850	0.835	0.865
Yolov8-x	Yolov8	0.844	0.824	0.833	0.856
Proposed Model		0.954	0.964	0.958	0.969

Table 4. Model training time and model size.

Models	Backbone	Training Time (h)	Model Size (MB)
Yolov5-n	Yolov5	0.21	3.9
Yolov5-n	EfficientNet_b1	0.69	13.8
Yolov5-n	MobileNetV3s	0.22	4.7
Yolov5-n	ResNet34	0.51	45.3
Yolov5-n	ResNet50	0.79	55.6
Yolov5-s	Yolov5	0.30	14.4
Yolov5-s	EfficientNet_b1	0.73	19.5
Yolov5-s	MobileNetV3s	0.25	9.7
Yolov5-s	ResNet34	0.53	50.3
Yolov5-s	ResNet50	0.85	62.5
Yolov5-m	Yolov5	0.58	42.2
Yolov5-m	EfficientNet_b1	0.88	32.8
Yolov5-m	MobileNetV3s	0.31	22.3
Yolov5-m	ResNet34	0.62	62.9
Yolov5-m	ResNet50	0.95	76.8
Yolov5-l	Yolov5	0.97	92.9
Yolov5-l	EfficientNet_b1	0.95	56.2
Yolov5-l	MobileNetV3s	0.42	44.9
Yolov5-l	ResNet34	0.73	85.5
Yolov5-l	ResNet50	1.06	101.3
Yolov5-x	Yolov5	1.58	173.1
Yolov5-x	EfficientNet_b1	1.12	92.7
Yolov5-x	MobileNetV3s	0.63	80.7
Yolov5-x	ResNet34	0.90	121.3
Yolov8-n	Yolov8	0.22	6.2
Yolov8-s	Yolov8	0.36	22.5
Yolov8-m	Yolov8	0.72	52.0
Yolov8-l	Yolov8	1.08	87.6
Yolov8-x	Yolov8	1.75	136.7
Proposed Model		1.22	138.9

Table 5. Layers, parameters, and GFLOPs information for the models.

Models	Backbone	Layers	Parameters (M)	GFLOPs
Yolov5-n	Yolov5	157	1.7	4.1
Yolov5-n	EfficientNet_b1	513	6.5	11.7
Yolov5-n	MobileNetV3s	287	2.1	3.1
Yolov5-n	ResNet34	197	22.4	61.9
Yolov5-n	ResNet50	232	27.5	72.1
Yolov5-s	Yolov5	157	7.0	15.8
Yolov5-s	EfficientNet_b1	513	9.4	16.2
Yolov5-s	MobileNetV3s	287	4.6	7.2
Yolov5-s	ResNet34	197	24.9	66.2
Yolov5-s	ResNet50	232	30.9	77.7
Yolov5-m	Yolov5	212	20.8	47.9
Yolov5-m	EfficientNet_b1	533	16.0	27.9
Yolov5-m	MobileNetV3s	307	10.8	18.7
Yolov5-m	ResNet34	217	31.2	77.8
Yolov5-m	ResNet50	252	38.1	90.5
Yolov5-l	Yolov5	267	46.1	107.7
Yolov5-l	EfficientNet_b1	553	27.7	49.6
Yolov5-l	MobileNetV3s	327	22.1	40.0
Yolov5-l	ResNet34	237	42.5	99.3
Yolov5-l	ResNet50	272	50.3	113.3
Yolov5-x	Yolov5	322	86.1	203.8
Yolov5-x	EfficientNet_b1	573	45.9	84.4
Yolov5-x	MobileNetV3s	347	40.0	74.5
Yolov5-x	ResNet34	257	60.3	133.8
Yolov8-n	Yolov8	168	3.0	8.7
Yolov8-s	Yolov8	168	11.1	28.6
Yolov8-m	Yolov8	218	25.8	78.9
Yolov8-l	Yolov8	268	43.6	165.2
Yolov8-x	Yolov8	268	68.1	257.8
Proposed Model		292	69.1	149.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Çınarer, G. Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke. Appl. Sci. 2025, 15, 7178. https://doi.org/10.3390/app15137178

AMA Style

Çınarer G. Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke. Applied Sciences. 2025; 15(13):7178. https://doi.org/10.3390/app15137178

Chicago/Turabian Style

Çınarer, Gökalp. 2025. "Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke" Applied Sciences 15, no. 13: 7178. https://doi.org/10.3390/app15137178

APA Style

Çınarer, G. (2025). Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke. Applied Sciences, 15(13), 7178. https://doi.org/10.3390/app15137178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke

Abstract

1. Introduction

2. Related Works

3. Material and Method

3.1. Object Detection and Deep Learning

3.2. Dataset

3.3. Algorithms

3.4. Proposed Model

3.5. Running the Models

3.6. Evaluating the Results of Object Detection Algorithms

4. Experimental Results

Comparison of the Proposed Model and the Original Model

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

List of Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI