Next Article in Journal
Machine Learning-Driven Radiomics Analysis for Distinguishing Mucinous and Non-Mucinous Pancreatic Cystic Lesions: A Multicentric Study
Previous Article in Journal
Direct Distillation: A Novel Approach for Efficient Diffusion Model Inference
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms

by
Quy-Quyen Hoang
,
Quy-Lam Hoang
and
Hoon Oh
*
Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan 44610, Republic of Korea
*
Author to whom correspondence should be addressed.
J. Imaging 2025, 11(2), 67; https://doi.org/10.3390/jimaging11020067
Submission received: 9 January 2025 / Revised: 7 February 2025 / Accepted: 9 February 2025 / Published: 19 February 2025

Abstract

:
This study explores a method of detecting smoke plumes effectively as the early sign of a forest fire. Convolutional neural networks (CNNs) have been widely used for forest fire detection; however, they have not been customized or optimized for smoke characteristics. This paper proposes a CNN-based forest smoke detection model featuring novel backbone architecture that can increase detection accuracy and reduce computational load. Since the proposed backbone detects the plume of smoke through different views using kernels of varying sizes, it can better detect smoke plumes of different sizes. By decomposing the traditional square kernel convolution into a depth-wise convolution of the coordinate kernel, it can not only better extract the features of the smoke plume spreading along the vertical dimension but also reduce the computational load. An attention mechanism was applied to allow the model to focus on important information while suppressing less relevant information. The experimental results show that our model outperforms other popular ones by achieving detection accuracy of up to 52.9 average precision (AP) and significantly reduces the number of parameters and giga floating-point operations (GFLOPs) compared to the popular models.

1. Introduction

Forest fires often cause enormous damage to human life and the environment [1]. The main reason for the great damage is that forest fires spread quickly before they are detected, making them difficult to extinguish. This paper considers the evolution of the existing vision-based model to effectively detect forest smoke, the early sign of a forest fire. The proposed model is designed based on convolutional neural networks (CNNs) and an attention mechanism, focusing on increasing accuracy and reducing the computational complexity of smoke plume detection.
According to the survey papers by Chowdary and Gupta [2], Alkhatib [3], and Barmpoutis et al. [4], many forest fire detection methods have been proposed. Early methods relied on fire lookout towers and tools like the Osborne Fire finder [5]; however, they were not effective due to continuous human intervention and potential human error. Some methods used sensors that can detect the signs of a fire outbreak, such as increased temperature, smoke, flames, or a lack of oxygen; they faced the challenge of reliably collecting data from sensors deployed across vast forested areas [6]. They also suffered from delayed fire detection because the fire alarm did not sound until the fire detection parameter values reached a preset threshold.
Recently, the direction of research has been shifting toward a vision-based approach that relies on artificial intelligence [7]. Existing vision-based approaches can be broadly divided into the following two categories: the image processing approach and the CNN-based approach. The former relies on image processing techniques to explore fire and smoke characteristics such as color, shape, and motion. Chen et al. [8], Vipin [9], and Yuan et al. [10] used RGB, YCbCr, and Lab color models, respectively, to extract fire and smoke pixels. Zhang et al. [11] used wavelet and fast Fourier transform methods to analyze the contours of the fire area in videos. Foggia et al. [12] combined the properties of color, shape, and motion using a multi-expert framework to increase detection accuracy. One recent approach utilized background subtraction and color segmentation to detect regions containing motion [13]. Since these approaches do not use high computational power, they may be suitable for devices with limited computational power, such as drones or surveillance cameras. However, to achieve a reasonable level of accuracy, they require careful image pre-processing steps and may need the use of different feature extraction algorithms for forest fire images in different situations.
In contrast, CNN-based approaches use deep learning techniques to automatically extract features from different images. Wang et al. [14] proposed a lightweight forest fire detection model by replacing the backbone network of YOLOv4 [15] with MobileNetv3 [16]. YOLOv4 is a popular object detection model known for its accuracy and speed, while MobileNetv3 is a lightweight convolutional neural network (CNN) designed to reduce computational load, making it suitable for resource-constrained devices. This method significantly reduces the computational load but comes with a trade-off in detection accuracy. Jiao et al. [17] used YOLOv3 [18] to detect forest fires with the utilization of an unmanned aerial vehicle (UAV) that could capture high-resolution videos and images. However, it did not work well for small smoke plumes or fires. Another approach developed by Zhang et al. [19] tried to detect forest smoke using Faster R-CNN [20]; although they improved the accuracy to some extent, there were some shortcomings in terms of the diversity of the forest fire images included in the dataset used in the experiment. Vani [21] employed Inceptionv3 [22] to train satellite images for forest fire detection. The problem with this satellite-based approach was that it could only capture large-scale fire images after the fire had spread over a large area. Furthermore, since Inceptionv3 only returned a fire or non-fire decision without boxing the fires, it required an extra step to determine the regions of the fires, which would take time and effort. One recent approach introduced by Meena et al. [23] used R-CNN [24] for forest fire detection. The high computational complexity of this model hindered its portability to monitoring devices. In summary, the existing approaches have made limited improvements in detection accuracy because they use popular models such as the YOLO series and Faster R-CNN as they are. Moreover, they often require a high computational load.
Based on the discussion so far, this paper introduces a forest smoke detection model featuring a new backbone architecture that is customized to increase the accuracy of smoke detection and reduce computational load. The proposed backbone is designed to effectively extract smoke features. By extracting object features through different views using kernels of varying sizes, it can better detect smoke plumes of different sizes. Furthermore, by using the depth-wise convolution of coordinate kernels, it can not only better extract the features of smoke plumes spreading along the vertical dimension but also reduce the computational load. Finally, by using an attention mechanism, it can focus on the important features of an image. As a result, the proposed model could achieve up to 52.9 average precision (AP), which far exceeds the accuracy of other models such as RetinaNet [25], YOLO [26,27,28], Faster-RCNN [20], and SSD [29], while significantly reducing the number of parameters and GFLOPs.
The rest of the paper is organized as follows: Section 2 presents the background; Section 3 describes the model architecture in detail; and Section 4 analyzes the experimental results and is followed by the conclusion in Section 5.

2. Background

2.1. Overview of Forest Fire Detection Model

The forest fire detection model consists of the following three modules: Backbone, Neck, and Head, as shown in Figure 1. The Backbone module has four stages labeled S 1 , S 2 , S 3 , and S 4 , each of which generates one feature map from the feature map of the stage below it, while S 1 generates a feature map from the input image. Early stages tend to capture low-level information such as edges, corners, etc., while later stages tend to capture higher-level or specific information.
The Neck and Head modules were defined by Lin et al. [25]. Neck has five levels labeled P 1 ,   P 2 ,   P 3 , P 4 , and P 5 , each of which has one feature map. The level feature map of P 3 is built by applying convolutions to the stage feature map of S 4 ; the level feature map of P 2 is created by up-sampling the level feature map of P 3 and adding it to the stage feature map of S 3 ; and the level feature map of   P 1 is created similarly. Note that the level feature map of P 3 is used to generate the level feature map of P 4 . Two more level feature maps of P 4 and P 5 are constructed by down-sampling those of P 3 and P 4 , respectively, to have more abundant features. In this way, using a multi-level pyramidal network [30], Neck can not only balance the information via multiple stages but also help the model easily detect objects of different scales. Head consists of the following two primary components: object classification and bounding box regression. The object classification component predicts the class to which an object belongs, assigning a probability score to each class. The bounding box regression component, on the other hand, estimates the coordinates of the bounding box that encloses the detected object. These two components work together to increase the accuracy of object identification within an image.

2.2. Motivation and Our Approach

Backbone plays an important role in determining the accuracy of object detection, as it creates a feature map of the object. However, many convolutional layers may be involved, resulting in significant computational load.
Recent forest fire detection models have utilized a well-known backbone designed on the ImageNet dataset [31]. Unfortunately, ImageNet does not have smoke and fire classes. This means that those backbones were not optimized for forest fire and/or smoke detection. In addition, ImageNet is a large dataset with over one million images and one thousand classes. Thus, researchers have been trying to improve backbones with more layers and/or large kernels to extract more information from this dataset. This requires more computational load.
This paper presents a new forest fire detection model to optimize smoke detection in terms of accuracy and computational load. The design of our model is fundamentally based on two principles. First, comparing the two convolution processes shown in Figure 2a,b, using a larger size kernel allows for the faster generation of feature maps but generates more parameters. Therefore, it may be advantageous to use multiple small-sized kernels to extract one feature element from the same receptive field. Second, to more effectively extract features of smoke plumes spreading along the vertical dimension, as shown in Figure 3, it may be desirable to decompose the conventional convolution with square kernels into the depth-wise convolution of coordinate kernels. This decomposition also contributes to reducing the number of parameters.
Additionally, our model extracts the features of objects through different views using kernels of different sizes to better detect smoke plumes of different scales. Our model also uses an attention mechanism that focuses on the features of specific objects (smoke) in the image while suppressing irrelevant features.

3. Proposed Model

3.1. Backbone

The proposed backbone is structured as shown in Figure 4a. The proposed model comprehensively extracts the features of the input data by traversing a 4-stage hierarchy, where each stage consists of one or more residual blocks, with one attention block added to the residual block output of stages 3 and 4. The two design principles of the proposed backbone are to effectively extract forest smoke features to increase smoke detection accuracy and to reduce computational load. The following explains how these design principles are reflected in the structure of the proposed backbone.

3.1.1. Stem Block

The stem block illustrated in Figure 4b is utilized to quickly reduce the spatial dimension of the input image without losing feature information. The stem block uses three 3 × 3 kernels with stride sizes of 2, 1, and 1 to reduce the number of parameters, while the existing ones still use a large kernel size. Even using three small kernels, the same level of information can be extracted. Like other models, Batch Normalization (BN) and the Rectified Linear Unit (ReLU) are additionally applied to the output of each convolutional layer to increase the learning speed. Note that our stem block does not use the Sigmoid Linear Unit (SiLU) and the Gaussian Error Linear Unit (GELU) functions, which consume more computational resources compared to simpler alternatives such as ReLU. At the end of the stem block, one 3 × 3 max pooling is applied to reduce the size of the feature map. In practice, the input image goes through four convolutional layers that use “stride 2” twice thus reducing each dimension of the feature map by a factor of four.

3.1.2. Transition Block

The transition block illustrated in Figure 4c is used to shrink the size of feature map between two adjacent stages. The Conv 1 × 1 is utilized to double the number of channels, followed by 3 × 3 max pooling to reduce the spatial dimension by half. This shrinks the size of the feature map without a loss of information, while saving the number of required parameters.

3.1.3. Residual Block

The residual block illustrated in Figure 4d is designed to better extract the smoke features from forest smoke. The feature map from the previous layer is split into four small feature maps along the channel dimension and each feature map is processed along different convolution layers.
The top two branches use two sequential 1 × n and n × 1 depth-wise convolutions (DWconv’s) instead of n × n kernels to reduce the number of parameters (n is given as 3 or 5 in the figure), while the third branch uses a 1 × 1 depth-wise convolution. This factorization reduces the number of parameters from n 2 to 2 n while maintaining the same receptive field. The n × 1 convolutions also helps the model better capture vertically distributed features, such as smoke. The third branch can enhance feature extraction from small smoke plumes by using DWconv 1 × 1 with a small-sized kernel. The last branch sequentially applies one max pooling 3 × 3 and one DWconv 1 × 1. By taking the maximum value within each pooling region, max pooling retains the most important features while discarding less important or noisy features. One DWconv 1 × 1 is applied on the output of the max pooling layer that can help to perform channel mixing, which can improve the accuracy of the model.
The outputs of four branches are concatenated along the channel dimension to produce a fine-grained feature map, which is fed serially into two point-wise convolutions (PWconv 1 × 1 s) to mix information along the channel dimension. The ReLU activation function between them is used to reinforce functional nonlinearity in a large space via a scaling factor of four. The original feature map delivered via the residual branch is added to the resulting feature map to avoid the vanishing gradient problem [32].

3.1.4. Attention Block

One attention block is only added to the output of the last residual block at stages 3 and 4 as shown in Figure 4a, considering computational efficiency since the feature maps in stages 1 and 2 are large in size. The attention mechanism helps the model to focus on important features of the image while suppressing irrelevant ones. Our backbone employs the Convolution Block Attention Module (CBAM) [33] that consists of the following two components: Channel Attention Module (CAM) as shown in Figure 5b, and Spatial Attention Module (SAM) as shown in Figure 5c. In feature maps, CAM allows the model to focus on the most relevant channels, while SAM allows it to capture spatial dependencies.
Let F and R C × H × W represent the input feature map and a set of possible feature maps of the target object, respectively such that F R C × H × W . Input feature map F is processed by CAM to produce channel attention weight M c ( F ) as detailed in Figure 5b. Then, the refined feature map F is obtained by performing the element-wise matrix multiplication between M c ( F ) and F to redistribute the information in the input feature map F along the channel dimension as follows:
F = M c F F .
Referring to Figure 5b, CAM uses average-pooling and max-pooling along the spatial dimension to aggregate the spatial information, which generate the average-pooled features F a v g c and the max-pooled features F m a x c , respectively. These two features are then passed to the Multilayer Perceptron (MLP) to generate two channel attention maps, M L P ( F a v g c ) and M L P ( F m a x c ) , which are merged using element-wise addition. Finally, the sigmoid function, denoted by σ , is applied to produce the channel attention weight M c F as follows:
M c F = σ M L P ( F a v g c )       M L P ( F m a x c ) .
Referring to Figure 5c, the refined feature map F is then fed into the SAM module to generate spatial attention weight M s ( F ) . Then, M s F is multiplied with feature map F to refine the feature map F in a spatial dimension, thereby producing the feature map F as follows:
F = M s F     F .
SAM also uses both max-pooling and average-pooling, but along the channel dimension, generates two features F a v g s and F m a x s that represent the aggregated channel information. Then, they are concatenated and mixed using 7 × 7 convolution, F 7 × 7 , to produce a spatial attention map. Finally, the sigmoid function σ is applied to produce the spatial attention weight M s F as follows:
M s F = σ F 7 × 7 ( F a v g c ;   F m a x c ) .
Note that our model takes into account both channel attention and spatial attention since feature maps have spatial and channel dimensions.

3.2. Neck and Head

The Neck module in Figure 6, consisting of five levels, P 1 , , P 5 , uses the slightly modified version of the Feature Pyramid Network model [30] to better detect objects of different scales as well as balance the information via multiple stages. The modifications are as follows. Initially, Conv 1 × 1 is applied to the feature maps from S 2 to S 4 in Backbone, producing a new feature map with 256 channels. The feature maps in P 1 , P 2 , and P 3 are produced by applying Conv 3 × 3 to the new feature map or the addition of the new feature maps, where 2× implies that up-sampling is applied twice. Note that stage S 1 is not used. The module has two more levels, P 4 and , P 5 in which feature maps are obtained by down-sampling the feature map at P 3 by 1/2 and 1/4, respectively. This can help the model to better detect larger objects.
Referring to the Head module in Figure 7, object classification is represented by five Conv 3 × 3’s and one class feature map denoted by A × W × H, and bounding box regression is by five Conv 3 × 3’s and one box feature map denoted by 4A × W × H, where four indicates the four relative offset values between the anchor and the ground truth box. The anchor, inherited from RetinaNet [25], has various scales and aspect ratios to enable the model to effectively detect objects of different sizes and shapes. At each feature map location, a set of nine anchors is generated, which consists of three different scales and three aspect ratios (1:1, 2:1, 1:2). These nine anchors cover a scale range of 32 to 813 pixels with respect to the input image of the network. The anchors are applied across different levels of the Feature Pyramid Network (FPN) and allow for object detection at multiple resolutions, enabling the detection of both small and large objects.

3.3. Loss Function

Because forest smoke often only occupies a small region compared to the background forest area, the foreground and background classes are extremely imbalanced during training. Therefore, our model uses the focus loss (FL) function [25] to overcome this imbalanced training.
The focal loss function, FL( p t ), for classification score p t , is expressed as follows:
F L p t = 1 p t γ log p t
where 1 p t γ is the modulating factor, with tuneable focusing parameter γ = 2, and
p t = p i f   y = 1 1 p o t h e r w i s e  
where y { ± 1 } specifies the ground-truth class, and p [ 0,1 ] is the model’s estimated probability for the class with label y = 1 . As suggested in the paper [20], we measure the difference between the offsets and the ground truth boxes using the bounding box regression loss function denoted by L 1 . Then, the total loss, L t o t a l , is expressed as a linear combination of F L p t and L 1 as follows:
L t o t a l = α F L p t + β L 1 ,
where α and β are balancing terms. To determine the optimal values of the hyperparameters α, γ, and β, experiments were conducted using various α, γ, and β values recommended from the paper [25]. According to the experimental results in Table 1, the combination of α = 0.25, γ = 2, and β = 1 is known to produce the best accuracy.

4. Experiments

4.1. Dataset

There have been large-scale benchmark datasets in the object detection field, but no forest fire/smoke dataset has been found. Therefore, our primary focus is on the detection of early-stage forest fires, where the only visible sign is smoke, rather than flames. This makes our model distinct from other public fire detection benchmarks, which typically focus on detecting fire. The HPWREN dataset [34], that contains real-world imagery of early-stage forest since 2000, was considered to be somewhat suitable for our research goal of detecting smoke in the early stages of forest fires. Therefore, our initial dataset was created by extracting 2190 images from the HPWREN dataset. Our dataset was then expanded to have 4350 forest fire/smoke images by adding images from the Internet and from a forest fire surveillance system operated by a local company. However, note that since our model aims for real-time early detection of forest fires, satellite images are not currently included in the dataset due to the difficulty of real-time acquisition and insufficient image quality.
Each image in the dataset was labelled and boxed using the tool from Roboflow [35], and are then divided into a training set of 3915 images (90%) and a validation set of 435 images (10%). The dataset adequately accounted for various forest smoke scenarios by including forest smoke images varying in fire intensity, time of day, smoke shape, etc.

4.2. Experimental Setup

The model was implemented based on the Pytorch framework and then trained and evaluated using a computer with a GeForce RTX 3060 GPU card. The training process took 60 epochs with a batch size of 6. The learning rate was initialized as 2.5 × 10 3 and then decreased 10 times and 100 times after 40 epochs and 55 epochs, respectively.
Our model was compared with several existing models, including RetinaNet [25], YOLOv8 [26], YOLOv9 [27], YOLOv10 [28], Faster-RCNN [20], and SSD [29]. To ensure a fair evaluation, all models were implemented on the same dataset with consistent augmentation techniques, including flipping and rotation, to mitigate the risk of overfitting. Additionally, optimization methods, such as learning rate scheduling, optimizer selection, momentum, and weight decay, were applied across all models. Our backbone was also compared with other backbones like VGG16 [36], Convnext [37], EfficientNet [38], InceptionV1 [39], and InceptionV4 [40].

4.3. Evaluation Metrics

The well-known average precision (AP) metric from MS-COCO [41] is used to evaluate the performance of the model. AP indicates the area under a precision-recall curve, P(r), that plots the value of precision against recall for different confidence threshold values [42]. Thus, AP is expressed as follows:
A P = 0 1 P r d r .
Precision indicates the ratio of the correct predictions to all positive predictions, while Recall indicates the ratio of the correct predictions to all labeled smokes. Thus, these two metrics are expressed as follows:
P r e c i s i o n = T P T P + F P
and
R e c a l l = T P T P + F N
where True Positive (TP) indicates that the model predicted the presence of smoke (Positive) and was correct (True), False Positive (FP) indicates that the model predicted the presence of smoke (Positive) but was incorrect (False), True Negative (TN) indicates that the model predicted the absence of smoke (Negative) but was correct (True), and False Negative (FN) indicates that the model predicted the absence of smoke (Negative) but was incorrect (False).
In addition to AP, some other metrics are used to evaluate the performance of the model. AP50 and AP75 indicate the AP values at 50% and 75% IoU (Intersection over Union) thresholds, respectively, and APS, APM, and APL are AP values for small, medium, and large objects, respectively. GFLOPs (giga floating-point operations) and #Params (the number of parameters) are used to evaluate the computational complexity of the model, and FPS (frames per second) is used to evaluate detection speed.

4.4. Experimental Results

According to Table 2, our model achieved the best values in AP and its sub-metrics while keeping #Params and GFLOPs low overall. RetinaNet achieved fairly competitive accuracy in APL; it requires a much higher computational cost. Since RetinaNet uses a neck and head similar to our model, we can infer that the increase in computational cost comes from the backbone. Specifically, RetinaNet shows almost twice as many #Params and 6% higher GFLOPs compared to our model. The result show that our model also improves AP by approximately 9.6% while using fewer parameters and lower GFLOPs compared to Faster R-CNN, a two-stage object detection model that typically achieves high accuracy.
Meanwhile, our model requires slightly more computational load than YOLO. However, it achieves much higher accuracy, especially when detecting small objects. Looking at the APs (Average Precision Small) index, our model improves accuracy by 43.17% to 52.2% compared to the YOLO versions. Considering that the initial smoke size i s very small compared to the large monitored forest area, it seems clear that our proposed model is beneficial for early forest fire detection.
Table 3 compares the performance of the proposed backbone with other popular backbones using the same Neck, Head, and other settings. Overall, the proposed backbone achieves the best AP values while keeping fairly favorable values in #Parameters and GFLOPs. VGG16 also achieved good AP values but with significantly higher #Parameters and GFLOPs values. On the other hand, EfficientNet and InceptionV1 generated significantly fewer parameters but with significantly lower AP values of 44.0 and 41.2, respectively. It can be concluded that the proposed backbone achieves both efficiency and effectiveness for smoke detection by showing clear gains in AP and reduced computational load compared to other methods.
Figure 8 shows the qualitative test results for 15 forest fire images numbered 1 to 15, and the class name and confidence value are given at the top of each bounding box. The proposed model was able to detect different shapes of smoke not only in the images with clear smoke such as images 1–4 but also in the monochrome or grayscale images captured by infrared cameras at night, as shown in images 6 and 7, or in the images with small or blurred smoke, like images 8, 9, and 11–14, which are difficult for humans to discern.
To show how well the attention mechanism works, heat maps of the images using the Grad-CAM technique [43] with different models applied are compared in Figure 9. Note that YOLOv8s are selected as the best version of YOLO’s. The first column of the table shows four different smoke images, and each of the next five columns shows the heat map of the corresponding image when each model is applied. Looking at the heat map, hot colors such as red and yellow indicate high attention, while cool colors such as blue and green indicate low attention. It is clearly seen that the attention area of the heat map generated by our proposed model depicts the shape of the original smoke image much better compared to other heat maps. It is also seen that our model is better able to suppress the less relevant regions. For example, looking at the last smoke image in our model’s heat map, the attention region has a very similar shape to smoke.
In summary, the proposed model shows a good balance by increasing the accuracy of forest smoke detection while keeping the number of parameters to a minimum. However, due to the relatively high FLOP, more training time is required.

4.5. Ablation Study

Finally, we performed an ablation study to investigate the effectiveness of using techniques such as splitting (Splitting), the depth-wise convolution of coordinate kernels (DW-Coord), and an attention mechanism (CBAM) on the basic backbone (Basic) of our proposed model. According to the experimental results in Table 4, Splitting significantly reduces the number of parameters (#Params) by dividing the feature map channels into smaller segments, while DW-Coord and CBAM improve AP by helping the model focus on extracting the important features of smoke.
Consequently, the proposed model with all the techniques applied was able to increase AP by 6% while reducing #Params and GFLOPs by 34% and 14%, respectively, compared to the base model.

5. Conclusions

This paper presented a CNN-based forest smoke detection model featuring new backbone architecture to increase detection accuracy and reduce computational load. The backbone includes three important methods. It extracts object features through different views using kernels of varying sizes to better detect smoke plumes of different sizes. It uses the depth-wise convolution of coordinate kernels to better extract the features of smoke plumes spreading along the vertical dimension. It also employs an attention mechanism to have the model focus on important features. The model was trained using 90% of a dataset containing 4350 forest fire or smoke images and validated using 10%. According to the experimental results, our model not only improves accuracy but also reduces computational load in early forest fire detection compared to the existing models. Further research will focus on reducing FLOPs to improve learning and inference speed and optimizing the Neck and Head modules to enhance performance. The proposed model will be further examined to determine whether it can be applied to embedded systems with low computing power, such as surveillance cameras and drones.

Author Contributions

Conceptualization, H.O. and Q.-Q.H.; methodology, Q.-Q.H. and H.O.; software, Q.-Q.H.; validation, H.O. and Q.-L.H.; data curation, Q.-L.H. and H.O.; writing—original draft preparation, Q.-Q.H.; writing—review and editing, H.O.; visualization, Q.-Q.H.; supervision, H.O.; funding acquisition, H.O. All authors have read and agreed to the published version of the manuscript.

Funding

This result was supported by the “Regional Innovation Strategy (RIS)” through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used for our experiments can be found at https://drive.google.com/drive/folders/1l9qI_EzU4A8heXvlpyGJEdxGVP3pRhcL (accessed on 8 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Facts + Statistics: Wildfires. Available online: https://www.iii.org/fact-statistic/facts-statistics-wildfires (accessed on 1 April 2023).
  2. Chowdary, V.; Gupta, M.K. Automatic forest fire detection and monitoring techniques: A survey. In Intelligent Communication, Control and Devices: Proceedings of ICICCD 2017; Springer Nature: Singapore, 2018; pp. 1111–1117. [Google Scholar]
  3. Alkhatib, A.A.A. A review on forest fire detection techniques. Int. J. Distrib. Sens. Netw. 2014, 10, 597368. [Google Scholar] [CrossRef]
  4. Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A review on early forest fire detection systems using optical remote sensing. Sensor 2022, 20, 6442. [Google Scholar] [CrossRef] [PubMed]
  5. History of the Osborne Firefinder. Available online: https://www.fs.usda.gov/t-d/pubs/pdf/hi_res/03511311hi.pdf (accessed on 1 April 2023).
  6. Bouabdellah, K.; Noureddine, H.; Larbi, S. Using wireless sensor networks for reliable forest fires detection. Procedia Comput. Sci. 2013, 19, 794–801. [Google Scholar] [CrossRef]
  7. Gaur, A.; Singh, A.; Kumar, A.; Kumar, A.; Kapoor, K. Video flame and smoke based fire detection algorithms: A literature review. Fire Technol. 2020, 56, 1943–1980. [Google Scholar] [CrossRef]
  8. Chen, T.H.; Wu, P.H.; Chiou, Y.C. An early fire-detection method based on image processing. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 1707–1710. [Google Scholar]
  9. Vipin, V. Image processing based forest fire detection. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 87–95. [Google Scholar]
  10. Yuan, C.; Liu, Z.; Zhang, Y. UAV-based forest fire detection and tracking using image processing techniques. In Proceedings of the 2015 International Conference on Unmanned Aircraft Systems (ICUAS), Denver, CO, USA, 9–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 639–643. [Google Scholar]
  11. Zhang, Z.; Zhao, J.; Zhang, D.; Qu, C.; Ke, Y.; Cai, B. Contour based forest fire detection using FFT and wavelet. In Proceedings of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; IEEE: Piscataway, NJ, USA, 2008; Volume 1, pp. 760–763. [Google Scholar]
  12. Foggia, P.; Saggese, A.; Vento, M. Real-time fire detection for video-surveillance applications using a combination of experts based on color, shape, and motion. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1545–1556. [Google Scholar] [CrossRef]
  13. Mahmoud, M.A.; Ren, H. Forest fire detection using a rule-based image processing algorithm and temporal variation. Math. Probl. Eng. 2018, 2018, 7612487. [Google Scholar] [CrossRef]
  14. Wang, S.; Chen, T.; Lv, X.; Zhao, J.; Zou, X.; Zhao, X.; Xiao, M.; Wei, H. Forest fire detection based on lightweight Yolo. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1560–1565. [Google Scholar]
  15. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  16. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar]
  17. Jiao, Z.; Zhang, Y.; Xin, J.; Mu, L.; Yi, Y.; Liu, H.; Liu, D. A deep learning based forest fire detection approach using UAV and YOLOv3. In Proceedings of the 2019 1st International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–27 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
  18. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  19. Zhang, Q.X.; Lin, G.H.; Zhang, Y.M.; Xu, G.; Wang, J.T. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
  20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.014972015. [Google Scholar] [CrossRef]
  21. Vani, K. Deep learning based forest fire classification and detection in satellite images. In Proceedings of the 2019 11th International Conference on Advanced Computing (ICoAC), Chennai, India, 18–20 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 61–65. [Google Scholar]
  22. Szegedy, C.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  23. Meena, U.; Munjal, G.; Sachdeva, S.; Garg, P.; Dagar, D.; Gangal, A. RCNN Architecture for Forest Fire Detection. In Proceedings of the 2023 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 19–20 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 699–704. [Google Scholar]
  24. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  25. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  26. Glenn, J. Yolov8. Available online: https://github.com/ultralytics/ultralytics/tree/main (accessed on 1 May 2024).
  27. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  28. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  29. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  30. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  31. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  33. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  34. High Performance Wireless Research and Education Network. Available online: http://hpwren.ucsd.edu/index.html (accessed on 1 April 2023).
  35. Roboflow. Available online: https://roboflow.com/ (accessed on 1 March 2023).
  36. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  37. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  38. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  39. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  40. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  41. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  42. Padilla, R.; Netto, S.L.; Silva, E.A.D. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; IEEE: Piscataway, NJ, USA, 2022; pp. 237–242. [Google Scholar]
  43. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Figure 1. The architecture of the forest fire detection model.
Figure 1. The architecture of the forest fire detection model.
Jimaging 11 00067 g001
Figure 2. Reduction in the number of parameters by using different sized kernels.
Figure 2. Reduction in the number of parameters by using different sized kernels.
Jimaging 11 00067 g002
Figure 3. The smoke features tend to be vertically distributed through the layers.
Figure 3. The smoke features tend to be vertically distributed through the layers.
Jimaging 11 00067 g003
Figure 4. The proposed Backbone structure for forest fire detection.
Figure 4. The proposed Backbone structure for forest fire detection.
Jimaging 11 00067 g004
Figure 5. CBAM architecture.
Figure 5. CBAM architecture.
Jimaging 11 00067 g005
Figure 6. The Neck architecture.
Figure 6. The Neck architecture.
Jimaging 11 00067 g006
Figure 7. The Head architecture.
Figure 7. The Head architecture.
Jimaging 11 00067 g007
Figure 8. Qualitative test results for 15 forest fire images numbered 1 to 15, with the class name and confidence value given at the top of each bounding box.
Figure 8. Qualitative test results for 15 forest fire images numbered 1 to 15, with the class name and confidence value given at the top of each bounding box.
Jimaging 11 00067 g008
Figure 9. Heat maps of the images to which different models are applied.
Figure 9. Heat maps of the images to which different models are applied.
Jimaging 11 00067 g009
Table 1. The result of experiments using various α, γ, and β values.
Table 1. The result of experiments using various α, γ, and β values.
γαβAPAP50AP75APSAPMAPL
00.75148.379.746.222.644.680.4
0.10.75149.482.850.225.345.480.5
0.20.75150.684.247.025.047.984.2
0.50.5150.282.948.025.845.383.8
10.25151.685.649.826.748.883.8
20.25152.985.753.327.850.285.8
50.25152.382.850.426.650.385.0
Table 2. Performance comparison of our model and other models.
Table 2. Performance comparison of our model and other models.
ModelAPAP50AP75APSAPMAPL#Params (Millions)GFLOPsFPS
Our model52.985.753.327.850.285.818.6120.621.5
RetinaNet [25]50.882.149.324.346.685.636.1127.820.4
YOLOv8s [26]49.777.051.114.445.977.011.228.6102.0
YOLOv8m [26]48.876.549.213.443.976.425.978.939.8
YOLOv9s [27]49.176.150.614.043.176.67.126.479.5
YOLOv9m [27]49.577.251.214.344.277.020.176.337.9
YOLOv10s [28]48.275.448.515.840.775.97.221.688.5
YOLOv10m [28]47.674.347.413.338.476.515.459.140.3
Faster-RCNN [20]48.379.546.727.545.378.341.1134.417.7
SSD [29]43.077.842.421.647.170.824.4214.217.4
Table 3. Performance comparison of proposed backbone and other backbones.
Table 3. Performance comparison of proposed backbone and other backbones.
BackboneAPAP50AP75APSAPMAPL#Params
(Millions)
GFLOPsFPS
Our model52.985.753.327.850.285.818.61120.6321.5
VGG16 [36]49.783.748.825.645.582.6142.93331.8212.2
Convnext [37]48.081.046.319.745.678.919.6190.1122.0
EfficientNet [38]44.070.942.017.240.773.114.5825.7526.1
Inceptionv1 [39]41.269.440.49.633.882.116.1352.2523.8
Inceptionv4 [40]41.066.440.37.539.282.052.92120.4321.0
Table 4. Ablation study on backbone modules with different techniques.
Table 4. Ablation study on backbone modules with different techniques.
BasicSplittingDW-CoordCBAMAP#Params
(Million)
GFLOPs
49.928.21140.49
50.720.93125.55
52.618.52120.62
52.918.61120.63
√ indicate the inclusion of specific techniques in the backbone architecture.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hoang, Q.-Q.; Hoang, Q.-L.; Oh, H. An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms. J. Imaging 2025, 11, 67. https://doi.org/10.3390/jimaging11020067

AMA Style

Hoang Q-Q, Hoang Q-L, Oh H. An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms. Journal of Imaging. 2025; 11(2):67. https://doi.org/10.3390/jimaging11020067

Chicago/Turabian Style

Hoang, Quy-Quyen, Quy-Lam Hoang, and Hoon Oh. 2025. "An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms" Journal of Imaging 11, no. 2: 67. https://doi.org/10.3390/jimaging11020067

APA Style

Hoang, Q.-Q., Hoang, Q.-L., & Oh, H. (2025). An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms. Journal of Imaging, 11(2), 67. https://doi.org/10.3390/jimaging11020067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop