An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms

Hoang, Quy-Quyen; Hoang, Quy-Lam; Oh, Hoon

doi:10.3390/jimaging11020067

Open AccessArticle

An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms

by

Quy-Quyen Hoang

,

Quy-Lam Hoang

and

Hoon Oh

^*

Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan 44610, Republic of Korea

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(2), 67; https://doi.org/10.3390/jimaging11020067

Submission received: 9 January 2025 / Revised: 7 February 2025 / Accepted: 9 February 2025 / Published: 19 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

This study explores a method of detecting smoke plumes effectively as the early sign of a forest fire. Convolutional neural networks (CNNs) have been widely used for forest fire detection; however, they have not been customized or optimized for smoke characteristics. This paper proposes a CNN-based forest smoke detection model featuring novel backbone architecture that can increase detection accuracy and reduce computational load. Since the proposed backbone detects the plume of smoke through different views using kernels of varying sizes, it can better detect smoke plumes of different sizes. By decomposing the traditional square kernel convolution into a depth-wise convolution of the coordinate kernel, it can not only better extract the features of the smoke plume spreading along the vertical dimension but also reduce the computational load. An attention mechanism was applied to allow the model to focus on important information while suppressing less relevant information. The experimental results show that our model outperforms other popular ones by achieving detection accuracy of up to 52.9 average precision (AP) and significantly reduces the number of parameters and giga floating-point operations (GFLOPs) compared to the popular models.

Keywords:

convolutional neural networks; object detection; forest fire detection; backbone network; attention mechanisms

1. Introduction

Forest fires often cause enormous damage to human life and the environment [1]. The main reason for the great damage is that forest fires spread quickly before they are detected, making them difficult to extinguish. This paper considers the evolution of the existing vision-based model to effectively detect forest smoke, the early sign of a forest fire. The proposed model is designed based on convolutional neural networks (CNNs) and an attention mechanism, focusing on increasing accuracy and reducing the computational complexity of smoke plume detection.

According to the survey papers by Chowdary and Gupta [2], Alkhatib [3], and Barmpoutis et al. [4], many forest fire detection methods have been proposed. Early methods relied on fire lookout towers and tools like the Osborne Fire finder [5]; however, they were not effective due to continuous human intervention and potential human error. Some methods used sensors that can detect the signs of a fire outbreak, such as increased temperature, smoke, flames, or a lack of oxygen; they faced the challenge of reliably collecting data from sensors deployed across vast forested areas [6]. They also suffered from delayed fire detection because the fire alarm did not sound until the fire detection parameter values reached a preset threshold.

Recently, the direction of research has been shifting toward a vision-based approach that relies on artificial intelligence [7]. Existing vision-based approaches can be broadly divided into the following two categories: the image processing approach and the CNN-based approach. The former relies on image processing techniques to explore fire and smoke characteristics such as color, shape, and motion. Chen et al. [8], Vipin [9], and Yuan et al. [10] used RGB, YCbCr, and Lab color models, respectively, to extract fire and smoke pixels. Zhang et al. [11] used wavelet and fast Fourier transform methods to analyze the contours of the fire area in videos. Foggia et al. [12] combined the properties of color, shape, and motion using a multi-expert framework to increase detection accuracy. One recent approach utilized background subtraction and color segmentation to detect regions containing motion [13]. Since these approaches do not use high computational power, they may be suitable for devices with limited computational power, such as drones or surveillance cameras. However, to achieve a reasonable level of accuracy, they require careful image pre-processing steps and may need the use of different feature extraction algorithms for forest fire images in different situations.

In contrast, CNN-based approaches use deep learning techniques to automatically extract features from different images. Wang et al. [14] proposed a lightweight forest fire detection model by replacing the backbone network of YOLOv4 [15] with MobileNetv3 [16]. YOLOv4 is a popular object detection model known for its accuracy and speed, while MobileNetv3 is a lightweight convolutional neural network (CNN) designed to reduce computational load, making it suitable for resource-constrained devices. This method significantly reduces the computational load but comes with a trade-off in detection accuracy. Jiao et al. [17] used YOLOv3 [18] to detect forest fires with the utilization of an unmanned aerial vehicle (UAV) that could capture high-resolution videos and images. However, it did not work well for small smoke plumes or fires. Another approach developed by Zhang et al. [19] tried to detect forest smoke using Faster R-CNN [20]; although they improved the accuracy to some extent, there were some shortcomings in terms of the diversity of the forest fire images included in the dataset used in the experiment. Vani [21] employed Inceptionv3 [22] to train satellite images for forest fire detection. The problem with this satellite-based approach was that it could only capture large-scale fire images after the fire had spread over a large area. Furthermore, since Inceptionv3 only returned a fire or non-fire decision without boxing the fires, it required an extra step to determine the regions of the fires, which would take time and effort. One recent approach introduced by Meena et al. [23] used R-CNN [24] for forest fire detection. The high computational complexity of this model hindered its portability to monitoring devices. In summary, the existing approaches have made limited improvements in detection accuracy because they use popular models such as the YOLO series and Faster R-CNN as they are. Moreover, they often require a high computational load.

Based on the discussion so far, this paper introduces a forest smoke detection model featuring a new backbone architecture that is customized to increase the accuracy of smoke detection and reduce computational load. The proposed backbone is designed to effectively extract smoke features. By extracting object features through different views using kernels of varying sizes, it can better detect smoke plumes of different sizes. Furthermore, by using the depth-wise convolution of coordinate kernels, it can not only better extract the features of smoke plumes spreading along the vertical dimension but also reduce the computational load. Finally, by using an attention mechanism, it can focus on the important features of an image. As a result, the proposed model could achieve up to 52.9 average precision (AP), which far exceeds the accuracy of other models such as RetinaNet [25], YOLO [26,27,28], Faster-RCNN [20], and SSD [29], while significantly reducing the number of parameters and GFLOPs.

The rest of the paper is organized as follows: Section 2 presents the background; Section 3 describes the model architecture in detail; and Section 4 analyzes the experimental results and is followed by the conclusion in Section 5.

2. Background

2.1. Overview of Forest Fire Detection Model

The forest fire detection model consists of the following three modules: Backbone, Neck, and Head, as shown in Figure 1. The Backbone module has four stages labeled

S_{1}, S_{2}, S_{3}

, and

S_{4}

, each of which generates one feature map from the feature map of the stage below it, while

S_{1}

generates a feature map from the input image. Early stages tend to capture low-level information such as edges, corners, etc., while later stages tend to capture higher-level or specific information.

The Neck and Head modules were defined by Lin et al. [25]. Neck has five levels labeled

P_{1}, P_{2}, P_{3}, P_{4},

and

P_{5}

, each of which has one feature map. The level feature map of

P_{3}

is built by applying convolutions to the stage feature map of

S_{4}

; the level feature map of

P_{2}

is created by up-sampling the level feature map of

P_{3}

and adding it to the stage feature map of

S_{3}

; and the level feature map of

P_{1}

is created similarly. Note that the level feature map of

P_{3}

is used to generate the level feature map of

P_{4}

. Two more level feature maps of

P_{4}

and

P_{5}

are constructed by down-sampling those of

P_{3}

and

P_{4}

, respectively, to have more abundant features. In this way, using a multi-level pyramidal network [30], Neck can not only balance the information via multiple stages but also help the model easily detect objects of different scales. Head consists of the following two primary components: object classification and bounding box regression. The object classification component predicts the class to which an object belongs, assigning a probability score to each class. The bounding box regression component, on the other hand, estimates the coordinates of the bounding box that encloses the detected object. These two components work together to increase the accuracy of object identification within an image.

2.2. Motivation and Our Approach

Backbone plays an important role in determining the accuracy of object detection, as it creates a feature map of the object. However, many convolutional layers may be involved, resulting in significant computational load.

Recent forest fire detection models have utilized a well-known backbone designed on the ImageNet dataset [31]. Unfortunately, ImageNet does not have smoke and fire classes. This means that those backbones were not optimized for forest fire and/or smoke detection. In addition, ImageNet is a large dataset with over one million images and one thousand classes. Thus, researchers have been trying to improve backbones with more layers and/or large kernels to extract more information from this dataset. This requires more computational load.

This paper presents a new forest fire detection model to optimize smoke detection in terms of accuracy and computational load. The design of our model is fundamentally based on two principles. First, comparing the two convolution processes shown in Figure 2a,b, using a larger size kernel allows for the faster generation of feature maps but generates more parameters. Therefore, it may be advantageous to use multiple small-sized kernels to extract one feature element from the same receptive field. Second, to more effectively extract features of smoke plumes spreading along the vertical dimension, as shown in Figure 3, it may be desirable to decompose the conventional convolution with square kernels into the depth-wise convolution of coordinate kernels. This decomposition also contributes to reducing the number of parameters.

Additionally, our model extracts the features of objects through different views using kernels of different sizes to better detect smoke plumes of different scales. Our model also uses an attention mechanism that focuses on the features of specific objects (smoke) in the image while suppressing irrelevant features.

3. Proposed Model

3.1. Backbone

The proposed backbone is structured as shown in Figure 4a. The proposed model comprehensively extracts the features of the input data by traversing a 4-stage hierarchy, where each stage consists of one or more residual blocks, with one attention block added to the residual block output of stages 3 and 4. The two design principles of the proposed backbone are to effectively extract forest smoke features to increase smoke detection accuracy and to reduce computational load. The following explains how these design principles are reflected in the structure of the proposed backbone.

3.1.1. Stem Block

The stem block illustrated in Figure 4b is utilized to quickly reduce the spatial dimension of the input image without losing feature information. The stem block uses three 3 × 3 kernels with stride sizes of 2, 1, and 1 to reduce the number of parameters, while the existing ones still use a large kernel size. Even using three small kernels, the same level of information can be extracted. Like other models, Batch Normalization (BN) and the Rectified Linear Unit (ReLU) are additionally applied to the output of each convolutional layer to increase the learning speed. Note that our stem block does not use the Sigmoid Linear Unit (SiLU) and the Gaussian Error Linear Unit (GELU) functions, which consume more computational resources compared to simpler alternatives such as ReLU. At the end of the stem block, one

3 \times 3

max pooling is applied to reduce the size of the feature map. In practice, the input image goes through four convolutional layers that use “stride 2” twice thus reducing each dimension of the feature map by a factor of four.

3.1.2. Transition Block

The transition block illustrated in Figure 4c is used to shrink the size of feature map between two adjacent stages. The Conv 1 × 1 is utilized to double the number of channels, followed by 3 × 3 max pooling to reduce the spatial dimension by half. This shrinks the size of the feature map without a loss of information, while saving the number of required parameters.

3.1.3. Residual Block

The residual block illustrated in Figure 4d is designed to better extract the smoke features from forest smoke. The feature map from the previous layer is split into four small feature maps along the channel dimension and each feature map is processed along different convolution layers.

The top two branches use two sequential

1 \times n

and

n \times 1

depth-wise convolutions (DWconv’s) instead of

n \times n

kernels to reduce the number of parameters (n is given as 3 or 5 in the figure), while the third branch uses a 1 × 1 depth-wise convolution. This factorization reduces the number of parameters from

n

² to

2 n

while maintaining the same receptive field. The

n \times 1

convolutions also helps the model better capture vertically distributed features, such as smoke. The third branch can enhance feature extraction from small smoke plumes by using DWconv 1 × 1 with a small-sized kernel. The last branch sequentially applies one max pooling 3 × 3 and one DWconv 1 × 1. By taking the maximum value within each pooling region, max pooling retains the most important features while discarding less important or noisy features. One DWconv 1 × 1 is applied on the output of the max pooling layer that can help to perform channel mixing, which can improve the accuracy of the model.

The outputs of four branches are concatenated along the channel dimension to produce a fine-grained feature map, which is fed serially into two point-wise convolutions (PWconv 1 × 1 s) to mix information along the channel dimension. The ReLU activation function between them is used to reinforce functional nonlinearity in a large space via a scaling factor of four. The original feature map delivered via the residual branch is added to the resulting feature map to avoid the vanishing gradient problem [32].

3.1.4. Attention Block

One attention block is only added to the output of the last residual block at stages 3 and 4 as shown in Figure 4a, considering computational efficiency since the feature maps in stages 1 and 2 are large in size. The attention mechanism helps the model to focus on important features of the image while suppressing irrelevant ones. Our backbone employs the Convolution Block Attention Module (CBAM) [33] that consists of the following two components: Channel Attention Module (CAM) as shown in Figure 5b, and Spatial Attention Module (SAM) as shown in Figure 5c. In feature maps, CAM allows the model to focus on the most relevant channels, while SAM allows it to capture spatial dependencies.

Let F and

R^{C \times H \times W}

represent the input feature map and a set of possible feature maps of the target object, respectively such that

F \in R^{C \times H \times W}

. Input feature map

F

is processed by CAM to produce channel attention weight

M_{c} (F)

as detailed in Figure 5b. Then, the refined feature map

F^{'}

is obtained by performing the element-wise matrix multiplication between

M_{c} (F)

and

F

to redistribute the information in the input feature map

F

along the channel dimension as follows:

F^{'} = M_{c} (F) ⨂ F .

(1)

Referring to Figure 5b, CAM uses average-pooling and max-pooling along the spatial dimension to aggregate the spatial information, which generate the average-pooled features

F_{a v g}^{c}

and the max-pooled features

F_{m a x}^{c}

, respectively. These two features are then passed to the Multilayer Perceptron (MLP) to generate two channel attention maps,

M L P (F_{a v g}^{c})

and

M L P (F_{m a x}^{c})

, which are merged using element-wise addition. Finally, the sigmoid function, denoted by

σ

, is applied to produce the channel attention weight

M_{c} (F)

as follows:

M_{c} (F) = σ (M L P (F_{a v g}^{c}) ⨁ M L P (F_{m a x}^{c})) .

(2)

Referring to Figure 5c, the refined feature map

F^{'}

is then fed into the SAM module to generate spatial attention weight

M_{s} (F')

. Then,

M_{s} (F')

is multiplied with feature map

F^{'}

to refine the feature map

F^{'}

in a spatial dimension, thereby producing the feature map

F^{″}

as follows:

F^{″} = M_{s} (F^{'}) ⨂ F^{'} .

(3)

SAM also uses both max-pooling and average-pooling, but along the channel dimension, generates two features

F_{a v g}^{s}

and

F_{m a x}^{s}

that represent the aggregated channel information. Then, they are concatenated and mixed using 7 × 7 convolution,

F^{7 \times 7}

, to produce a spatial attention map. Finally, the sigmoid function

σ

is applied to produce the spatial attention weight

M_{s} (F')

as follows:

M_{s} (F') = σ (F^{7 \times 7} (F_{a v g}^{c}; F_{m a x}^{c})) .

(4)

Note that our model takes into account both channel attention and spatial attention since feature maps have spatial and channel dimensions.

3.2. Neck and Head

The Neck module in Figure 6, consisting of five levels,

P_{1}, \dots, P_{5}

, uses the slightly modified version of the Feature Pyramid Network model [30] to better detect objects of different scales as well as balance the information via multiple stages. The modifications are as follows. Initially, Conv 1 × 1 is applied to the feature maps from

S_{2}

to

S_{4}

in Backbone, producing a new feature map with 256 channels. The feature maps in

P_{1}

,

P_{2}

, and

P_{3}

are produced by applying Conv 3 × 3 to the new feature map or the addition of the new feature maps, where 2× implies that up-sampling is applied twice. Note that stage

S_{1}

is not used. The module has two more levels,

P_{4}

and

, P_{5}

in which feature maps are obtained by down-sampling the feature map at

P_{3}

by 1/2 and 1/4, respectively. This can help the model to better detect larger objects.

Referring to the Head module in Figure 7, object classification is represented by five Conv 3 × 3’s and one class feature map denoted by A × W × H, and bounding box regression is by five Conv 3 × 3’s and one box feature map denoted by 4A × W × H, where four indicates the four relative offset values between the anchor and the ground truth box. The anchor, inherited from RetinaNet [25], has various scales and aspect ratios to enable the model to effectively detect objects of different sizes and shapes. At each feature map location, a set of nine anchors is generated, which consists of three different scales and three aspect ratios (1:1, 2:1, 1:2). These nine anchors cover a scale range of 32 to 813 pixels with respect to the input image of the network. The anchors are applied across different levels of the Feature Pyramid Network (FPN) and allow for object detection at multiple resolutions, enabling the detection of both small and large objects.

3.3. Loss Function

Because forest smoke often only occupies a small region compared to the background forest area, the foreground and background classes are extremely imbalanced during training. Therefore, our model uses the focus loss (FL) function [25] to overcome this imbalanced training.

The focal loss function, FL(

p_{t}

), for classification score

p_{t}

, is expressed as follows:

F L (p_{t}) = {- (1 - p_{t})}^{γ} \log (p_{t})

(5)

where

{- (1 - p_{t})}^{γ}

is the modulating factor, with tuneable focusing parameter γ = 2, and

p_{t} = \{\begin{array}{l} p & i f y = 1 \\ 1 - p & o t h e r w i s e \end{array}

(6)

where

y \in {\pm 1}

specifies the ground-truth class, and

p \in [0,1]

is the model’s estimated probability for the class with label

y = 1

. As suggested in the paper [20], we measure the difference between the offsets and the ground truth boxes using the bounding box regression loss function denoted by

L_{1}

. Then, the total loss,

L_{t o t a l}

, is expressed as a linear combination of

F L (p_{t})

and

L_{1}

as follows:

L_{t o t a l} = α F L (p_{t}) + {β L}_{1},

(7)

where

α

and

β

are balancing terms. To determine the optimal values of the hyperparameters α, γ, and β, experiments were conducted using various α, γ, and β values recommended from the paper [25]. According to the experimental results in Table 1, the combination of α = 0.25, γ = 2, and β = 1 is known to produce the best accuracy.

4. Experiments

4.1. Dataset

There have been large-scale benchmark datasets in the object detection field, but no forest fire/smoke dataset has been found. Therefore, our primary focus is on the detection of early-stage forest fires, where the only visible sign is smoke, rather than flames. This makes our model distinct from other public fire detection benchmarks, which typically focus on detecting fire. The HPWREN dataset [34], that contains real-world imagery of early-stage forest since 2000, was considered to be somewhat suitable for our research goal of detecting smoke in the early stages of forest fires. Therefore, our initial dataset was created by extracting 2190 images from the HPWREN dataset. Our dataset was then expanded to have 4350 forest fire/smoke images by adding images from the Internet and from a forest fire surveillance system operated by a local company. However, note that since our model aims for real-time early detection of forest fires, satellite images are not currently included in the dataset due to the difficulty of real-time acquisition and insufficient image quality.

Each image in the dataset was labelled and boxed using the tool from Roboflow [35], and are then divided into a training set of 3915 images (90%) and a validation set of 435 images (10%). The dataset adequately accounted for various forest smoke scenarios by including forest smoke images varying in fire intensity, time of day, smoke shape, etc.

4.2. Experimental Setup

The model was implemented based on the Pytorch framework and then trained and evaluated using a computer with a GeForce RTX 3060 GPU card. The training process took 60 epochs with a batch size of 6. The learning rate was initialized as

{2.5 \times 10}^{- 3}

and then decreased 10 times and 100 times after 40 epochs and 55 epochs, respectively.

Our model was compared with several existing models, including RetinaNet [25], YOLOv8 [26], YOLOv9 [27], YOLOv10 [28], Faster-RCNN [20], and SSD [29]. To ensure a fair evaluation, all models were implemented on the same dataset with consistent augmentation techniques, including flipping and rotation, to mitigate the risk of overfitting. Additionally, optimization methods, such as learning rate scheduling, optimizer selection, momentum, and weight decay, were applied across all models. Our backbone was also compared with other backbones like VGG16 [36], Convnext [37], EfficientNet [38], InceptionV1 [39], and InceptionV4 [40].

4.3. Evaluation Metrics

The well-known average precision (AP) metric from MS-COCO [41] is used to evaluate the performance of the model. AP indicates the area under a precision-recall curve, P(r), that plots the value of precision against recall for different confidence threshold values [42]. Thus, AP is expressed as follows:

A P = \int_{0}^{1} P (r) d r .

(8)

Precision indicates the ratio of the correct predictions to all positive predictions, while Recall indicates the ratio of the correct predictions to all labeled smokes. Thus, these two metrics are expressed as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

and

R e c a l l = \frac{T P}{T P + F N}

(10)

where True Positive (TP) indicates that the model predicted the presence of smoke (Positive) and was correct (True), False Positive (FP) indicates that the model predicted the presence of smoke (Positive) but was incorrect (False), True Negative (TN) indicates that the model predicted the absence of smoke (Negative) but was correct (True), and False Negative (FN) indicates that the model predicted the absence of smoke (Negative) but was incorrect (False).

In addition to AP, some other metrics are used to evaluate the performance of the model. AP₅₀ and AP₇₅ indicate the AP values at 50% and 75% IoU (Intersection over Union) thresholds, respectively, and AP_S, AP_M, and AP_L are AP values for small, medium, and large objects, respectively. GFLOPs (giga floating-point operations) and #Params (the number of parameters) are used to evaluate the computational complexity of the model, and FPS (frames per second) is used to evaluate detection speed.

4.4. Experimental Results

According to Table 2, our model achieved the best values in AP and its sub-metrics while keeping #Params and GFLOPs low overall. RetinaNet achieved fairly competitive accuracy in AP_L; it requires a much higher computational cost. Since RetinaNet uses a neck and head similar to our model, we can infer that the increase in computational cost comes from the backbone. Specifically, RetinaNet shows almost twice as many #Params and 6% higher GFLOPs compared to our model. The result show that our model also improves AP by approximately 9.6% while using fewer parameters and lower GFLOPs compared to Faster R-CNN, a two-stage object detection model that typically achieves high accuracy.

Meanwhile, our model requires slightly more computational load than YOLO. However, it achieves much higher accuracy, especially when detecting small objects. Looking at the APs (Average Precision Small) index, our model improves accuracy by 43.17% to 52.2% compared to the YOLO versions. Considering that the initial smoke size i s very small compared to the large monitored forest area, it seems clear that our proposed model is beneficial for early forest fire detection.

Table 3 compares the performance of the proposed backbone with other popular backbones using the same Neck, Head, and other settings. Overall, the proposed backbone achieves the best AP values while keeping fairly favorable values in #Parameters and GFLOPs. VGG16 also achieved good AP values but with significantly higher #Parameters and GFLOPs values. On the other hand, EfficientNet and InceptionV1 generated significantly fewer parameters but with significantly lower AP values of 44.0 and 41.2, respectively. It can be concluded that the proposed backbone achieves both efficiency and effectiveness for smoke detection by showing clear gains in AP and reduced computational load compared to other methods.

Figure 8 shows the qualitative test results for 15 forest fire images numbered 1 to 15, and the class name and confidence value are given at the top of each bounding box. The proposed model was able to detect different shapes of smoke not only in the images with clear smoke such as images 1–4 but also in the monochrome or grayscale images captured by infrared cameras at night, as shown in images 6 and 7, or in the images with small or blurred smoke, like images 8, 9, and 11–14, which are difficult for humans to discern.

To show how well the attention mechanism works, heat maps of the images using the Grad-CAM technique [43] with different models applied are compared in Figure 9. Note that YOLOv8s are selected as the best version of YOLO’s. The first column of the table shows four different smoke images, and each of the next five columns shows the heat map of the corresponding image when each model is applied. Looking at the heat map, hot colors such as red and yellow indicate high attention, while cool colors such as blue and green indicate low attention. It is clearly seen that the attention area of the heat map generated by our proposed model depicts the shape of the original smoke image much better compared to other heat maps. It is also seen that our model is better able to suppress the less relevant regions. For example, looking at the last smoke image in our model’s heat map, the attention region has a very similar shape to smoke.

In summary, the proposed model shows a good balance by increasing the accuracy of forest smoke detection while keeping the number of parameters to a minimum. However, due to the relatively high FLOP, more training time is required.

4.5. Ablation Study

Finally, we performed an ablation study to investigate the effectiveness of using techniques such as splitting (Splitting), the depth-wise convolution of coordinate kernels (DW-Coord), and an attention mechanism (CBAM) on the basic backbone (Basic) of our proposed model. According to the experimental results in Table 4, Splitting significantly reduces the number of parameters (#Params) by dividing the feature map channels into smaller segments, while DW-Coord and CBAM improve AP by helping the model focus on extracting the important features of smoke.

Consequently, the proposed model with all the techniques applied was able to increase AP by 6% while reducing #Params and GFLOPs by 34% and 14%, respectively, compared to the base model.

5. Conclusions

This paper presented a CNN-based forest smoke detection model featuring new backbone architecture to increase detection accuracy and reduce computational load. The backbone includes three important methods. It extracts object features through different views using kernels of varying sizes to better detect smoke plumes of different sizes. It uses the depth-wise convolution of coordinate kernels to better extract the features of smoke plumes spreading along the vertical dimension. It also employs an attention mechanism to have the model focus on important features. The model was trained using 90% of a dataset containing 4350 forest fire or smoke images and validated using 10%. According to the experimental results, our model not only improves accuracy but also reduces computational load in early forest fire detection compared to the existing models. Further research will focus on reducing FLOPs to improve learning and inference speed and optimizing the Neck and Head modules to enhance performance. The proposed model will be further examined to determine whether it can be applied to embedded systems with low computing power, such as surveillance cameras and drones.

Author Contributions

Conceptualization, H.O. and Q.-Q.H.; methodology, Q.-Q.H. and H.O.; software, Q.-Q.H.; validation, H.O. and Q.-L.H.; data curation, Q.-L.H. and H.O.; writing—original draft preparation, Q.-Q.H.; writing—review and editing, H.O.; visualization, Q.-Q.H.; supervision, H.O.; funding acquisition, H.O. All authors have read and agreed to the published version of the manuscript.

Funding

This result was supported by the “Regional Innovation Strategy (RIS)” through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used for our experiments can be found at https://drive.google.com/drive/folders/1l9qI_EzU4A8heXvlpyGJEdxGVP3pRhcL (accessed on 8 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Facts + Statistics: Wildfires. Available online: https://www.iii.org/fact-statistic/facts-statistics-wildfires (accessed on 1 April 2023).
Chowdary, V.; Gupta, M.K. Automatic forest fire detection and monitoring techniques: A survey. In Intelligent Communication, Control and Devices: Proceedings of ICICCD 2017; Springer Nature: Singapore, 2018; pp. 1111–1117. [Google Scholar]
Alkhatib, A.A.A. A review on forest fire detection techniques. Int. J. Distrib. Sens. Netw. 2014, 10, 597368. [Google Scholar] [CrossRef]
Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A review on early forest fire detection systems using optical remote sensing. Sensor 2022, 20, 6442. [Google Scholar] [CrossRef] [PubMed]
History of the Osborne Firefinder. Available online: https://www.fs.usda.gov/t-d/pubs/pdf/hi_res/03511311hi.pdf (accessed on 1 April 2023).
Bouabdellah, K.; Noureddine, H.; Larbi, S. Using wireless sensor networks for reliable forest fires detection. Procedia Comput. Sci. 2013, 19, 794–801. [Google Scholar] [CrossRef]
Gaur, A.; Singh, A.; Kumar, A.; Kumar, A.; Kapoor, K. Video flame and smoke based fire detection algorithms: A literature review. Fire Technol. 2020, 56, 1943–1980. [Google Scholar] [CrossRef]
Chen, T.H.; Wu, P.H.; Chiou, Y.C. An early fire-detection method based on image processing. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 1707–1710. [Google Scholar]
Vipin, V. Image processing based forest fire detection. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 87–95. [Google Scholar]
Yuan, C.; Liu, Z.; Zhang, Y. UAV-based forest fire detection and tracking using image processing techniques. In Proceedings of the 2015 International Conference on Unmanned Aircraft Systems (ICUAS), Denver, CO, USA, 9–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 639–643. [Google Scholar]
Zhang, Z.; Zhao, J.; Zhang, D.; Qu, C.; Ke, Y.; Cai, B. Contour based forest fire detection using FFT and wavelet. In Proceedings of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; IEEE: Piscataway, NJ, USA, 2008; Volume 1, pp. 760–763. [Google Scholar]
Foggia, P.; Saggese, A.; Vento, M. Real-time fire detection for video-surveillance applications using a combination of experts based on color, shape, and motion. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1545–1556. [Google Scholar] [CrossRef]
Mahmoud, M.A.; Ren, H. Forest fire detection using a rule-based image processing algorithm and temporal variation. Math. Probl. Eng. 2018, 2018, 7612487. [Google Scholar] [CrossRef]
Wang, S.; Chen, T.; Lv, X.; Zhao, J.; Zou, X.; Zhao, X.; Xiao, M.; Wei, H. Forest fire detection based on lightweight Yolo. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1560–1565. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar]
Jiao, Z.; Zhang, Y.; Xin, J.; Mu, L.; Yi, Y.; Liu, H.; Liu, D. A deep learning based forest fire detection approach using UAV and YOLOv3. In Proceedings of the 2019 1st International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–27 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Zhang, Q.X.; Lin, G.H.; Zhang, Y.M.; Xu, G.; Wang, J.T. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.014972015. [Google Scholar] [CrossRef]
Vani, K. Deep learning based forest fire classification and detection in satellite images. In Proceedings of the 2019 11th International Conference on Advanced Computing (ICoAC), Chennai, India, 18–20 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 61–65. [Google Scholar]
Szegedy, C.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Meena, U.; Munjal, G.; Sachdeva, S.; Garg, P.; Dagar, D.; Gangal, A. RCNN Architecture for Forest Fire Detection. In Proceedings of the 2023 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 19–20 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 699–704. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Glenn, J. Yolov8. Available online: https://github.com/ultralytics/ultralytics/tree/main (accessed on 1 May 2024).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
High Performance Wireless Research and Education Network. Available online: http://hpwren.ucsd.edu/index.html (accessed on 1 April 2023).
Roboflow. Available online: https://roboflow.com/ (accessed on 1 March 2023).
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Padilla, R.; Netto, S.L.; Silva, E.A.D. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; IEEE: Piscataway, NJ, USA, 2022; pp. 237–242. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The architecture of the forest fire detection model.

Figure 2. Reduction in the number of parameters by using different sized kernels.

Figure 3. The smoke features tend to be vertically distributed through the layers.

Figure 4. The proposed Backbone structure for forest fire detection.

Figure 5. CBAM architecture.

Figure 6. The Neck architecture.

Figure 7. The Head architecture.

Figure 8. Qualitative test results for 15 forest fire images numbered 1 to 15, with the class name and confidence value given at the top of each bounding box.

Figure 9. Heat maps of the images to which different models are applied.

Table 1. The result of experiments using various α, γ, and β values.

γ	α	β	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
0	0.75	1	48.3	79.7	46.2	22.6	44.6	80.4
0.1	0.75	1	49.4	82.8	50.2	25.3	45.4	80.5
0.2	0.75	1	50.6	84.2	47.0	25.0	47.9	84.2
0.5	0.5	1	50.2	82.9	48.0	25.8	45.3	83.8
1	0.25	1	51.6	85.6	49.8	26.7	48.8	83.8
2	0.25	1	52.9	85.7	53.3	27.8	50.2	85.8
5	0.25	1	52.3	82.8	50.4	26.6	50.3	85.0

Table 2. Performance comparison of our model and other models.

Model	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	#Params (Millions)	GFLOPs	FPS
Our model	52.9	85.7	53.3	27.8	50.2	85.8	18.6	120.6	21.5
RetinaNet [25]	50.8	82.1	49.3	24.3	46.6	85.6	36.1	127.8	20.4
YOLOv8s [26]	49.7	77.0	51.1	14.4	45.9	77.0	11.2	28.6	102.0
YOLOv8m [26]	48.8	76.5	49.2	13.4	43.9	76.4	25.9	78.9	39.8
YOLOv9s [27]	49.1	76.1	50.6	14.0	43.1	76.6	7.1	26.4	79.5
YOLOv9m [27]	49.5	77.2	51.2	14.3	44.2	77.0	20.1	76.3	37.9
YOLOv10s [28]	48.2	75.4	48.5	15.8	40.7	75.9	7.2	21.6	88.5
YOLOv10m [28]	47.6	74.3	47.4	13.3	38.4	76.5	15.4	59.1	40.3
Faster-RCNN [20]	48.3	79.5	46.7	27.5	45.3	78.3	41.1	134.4	17.7
SSD [29]	43.0	77.8	42.4	21.6	47.1	70.8	24.4	214.2	17.4

Table 3. Performance comparison of proposed backbone and other backbones.

Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	#Params (Millions)	GFLOPs	FPS
Our model	52.9	85.7	53.3	27.8	50.2	85.8	18.61	120.63	21.5
VGG16 [36]	49.7	83.7	48.8	25.6	45.5	82.6	142.93	331.82	12.2
Convnext [37]	48.0	81.0	46.3	19.7	45.6	78.9	19.61	90.11	22.0
EfficientNet [38]	44.0	70.9	42.0	17.2	40.7	73.1	14.58	25.75	26.1
Inceptionv1 [39]	41.2	69.4	40.4	9.6	33.8	82.1	16.13	52.25	23.8
Inceptionv4 [40]	41.0	66.4	40.3	7.5	39.2	82.0	52.92	120.43	21.0

Table 4. Ablation study on backbone modules with different techniques.

Basic	Splitting	DW-Coord	CBAM	AP	#Params (Million)	GFLOPs
√				49.9	28.21	140.49
√	√			50.7	20.93	125.55
√	√	√		52.6	18.52	120.62
√	√	√	√	52.9	18.61	120.63

√ indicate the inclusion of specific techniques in the backbone architecture.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hoang, Q.-Q.; Hoang, Q.-L.; Oh, H. An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms. J. Imaging 2025, 11, 67. https://doi.org/10.3390/jimaging11020067

AMA Style

Hoang Q-Q, Hoang Q-L, Oh H. An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms. Journal of Imaging. 2025; 11(2):67. https://doi.org/10.3390/jimaging11020067

Chicago/Turabian Style

Hoang, Quy-Quyen, Quy-Lam Hoang, and Hoon Oh. 2025. "An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms" Journal of Imaging 11, no. 2: 67. https://doi.org/10.3390/jimaging11020067

APA Style

Hoang, Q.-Q., Hoang, Q.-L., & Oh, H. (2025). An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms. Journal of Imaging, 11(2), 67. https://doi.org/10.3390/jimaging11020067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms

Abstract

1. Introduction

2. Background

2.1. Overview of Forest Fire Detection Model

2.2. Motivation and Our Approach

3. Proposed Model

3.1. Backbone

3.1.1. Stem Block

3.1.2. Transition Block

3.1.3. Residual Block

3.1.4. Attention Block

3.2. Neck and Head

3.3. Loss Function

4. Experiments

4.1. Dataset

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Experimental Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI