FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments

Yuan, Jianye; Wang, Haofei; Li, Minghao; Wang, Xiaohan; Song, Weiwei; Li, Song; Gong, Wei

doi:10.3390/rs16183382

Open AccessArticle

FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments

by

Jianye Yuan

¹,

Haofei Wang

^2,*,

Minghao Li

³,

Xiaohan Wang

³,

Weiwei Song

²,

Song Li

¹ and

Wei Gong

¹

Electronic Information School, Wuhan University, Wuhan 473072, China

²

Peng Cheng Laboratory, Department of Mathematics and Theories, Shenzhen 518000, China

³

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(18), 3382; https://doi.org/10.3390/rs16183382

Submission received: 5 July 2024 / Revised: 6 September 2024 / Accepted: 9 September 2024 / Published: 11 September 2024

(This article belongs to the Special Issue Time-Series Mapping and Analysis of Land Surface Parameters and Changes Using Remote Sensing Data)

Download

Browse Figures

Versions Notes

Abstract

Fire detection is crucial due to the exorbitant annual toll on both human lives and the economy resulting from fire-related incidents. To enhance forest fire detection in complex environments, we propose a new algorithm called FD-Net for various environments. Firstly, to improve detection performance, we introduce a Fire Attention (FA) mechanism that utilizes the position information from feature maps. Secondly, to prevent geometric distortion during image cropping, we propose a Three-Scale Pooling (TSP) module. Lastly, we fine-tune the YOLOv5 network and incorporate a new Fire Fusion (FF) module to enhance the network’s precision in identifying fire targets. Through qualitative and quantitative comparisons, we found that FD-Net outperforms current state-of-the-art algorithms in performance on both fire and fire-and-smoke datasets. This further demonstrates FD-Net’s effectiveness for application in fire detection.

Keywords:

fire detection; FA mechanism; TSP module; FF module

1. Introduction

In recent years, deep learning has made remarkable strides across various domains, encompassing image classification [1,2], object detection [3], and object segmentation [4]. Within the realm of fire detection, deep learning has proven instrumental, particularly in tackling challenging environments characterized by small targets and occlusions. The fire detection process follows a similar approach to target detection methods that ascertain the location of objects in an image before classification. The initial step involves pinpointing the target instance’s location within the image, which facilitates the accurate subsequent localization of the fire.

Deep Neural Networks (DNNs), also known as Deep Feedforward Neural Networks (DFNNs), serve as the fundamental building blocks of deep learning [5]. Initially, DNNs operated as linear models, failing to capture complex nonlinear structures or extract contextual features. Consequently, they faded from prominence in both academic and industrial circles. However, continuous optimization introduced multiple hidden layers to enable the modeling of nonlinear relationships and the incorporation of Recurrent Neural Networks (RNNs) [6] as activation functions. As a result, DNNs regained favor among researchers and practitioners. Despite this resurgence, issues such as gradient vanishing and explosion arose due to the intricate internal structure of DNNs. To address these challenges, Long Short-Term Memory (LSTM) [7] emerged as an effective solution. However, it impeded the speed of target detection. Remote sensing of forest fires has become a research hotspot and poses unique challenges due to the vast areas covered and the complexity of imagery. In light of this, we propose a novel approach: a multi-layer, fast-reading deep learning technique for target detection.

Identifying distinct fire instances within a single image presents significant challenges. This stems from multiple factors, including the small size of fire targets in remote sensing images, the translucent nature of certain fires, complex backgrounds, and the intense reflections that can lead to false detections. Moreover, the presence of mangroves and red leaves with fire-like properties further complicates the fire detection process. To address these intricate circumstances, we introduce FD-Net, a novel fire detection algorithm tailored for such scenarios. To facilitate a clear understanding of the complex environmental dataset in this paper, Figure 1 was produced. It demonstrates that this dataset encompasses (a) small objects, (b) negative examples, and (c) instances of fire obstructed by complex environments. It is worth noting that the difficulty encountered in target detection remains a pressing issue in the field.

In this study, our focus lies on analyzing data samples in intricate contexts, encompassing negative samples, small object fires, and occluded fires. Subsequently, we conducted a comprehensive array of experimental tests, revealing the superior suitability of our method for fire target detection across diverse scenarios. The key contributions of our research are outlined as follows:

I.: We collected two large-scale fire detection datasets.
II.: We proposed a new attention mechanism called the Fire Attention (FA) mechanism.
III.: We developed the Three-Scale Pooling (TSP) module to correct geometric distortion in fire images.
IV.: We fine-tuned the “Neck” of the YOLOv5 network and proposed a new Fire Fusion (FF) module to enhance the precision of fire image detection.

2. Related Works

With the rapid development of deep learning, numerous methods for object detection have emerged, and fire detection, in particular, has attracted increasing attention and research efforts. Deep-learning-based fire detection methods can be broadly categorized into three types, namely: single-stage methods, two-stage methods and others. In the following, we will introduce these three types of object detection algorithms separately.

2.1. One-Stage Methods

One-stage methods generally refer to dense object detection and classification on input images without explicit generation of candidate bounding boxes. Common one-stage object detection algorithms include the YOLO series (such as YOLOv4 [8] and YOLOv5 [9]), Single Shot MultiBox Detector (SSD) [10], and RetinaNet [11]. Due to their design, these methods are often less complex and faster, making them suitable for deployment on mobile devices and embedded systems. However, one-stage methods often lead to lower detection accuracy, higher probabilities of missing detections, and false alarms. In a study [12], an improved YOLOv5 fire detection framework was proposed for smart city construction. Although the aforementioned methods partially address the challenges in fire detection, they still exhibit unsatisfactory detection performance and require experimentation in specific environments, which is far from the desired results. Subsequently, a study [13] combined a crowd of cats with an LSTM model and proposed an IoT-based forest fire detection system, achieving an accuracy of 98.6%. Another study [14] introduced a new one-stage fire detection algorithm, FireNet-v2, which improved the lightweight fire detection models and achieved an overall accuracy of 98.43% on the Foggia dataset [15]. While both two methods partially improve the detection accuracy, they are both tested on specific datasets and may not represent real-world detection scenarios. In light of this, our proposed FD-Net algorithm achieves optimal results in detecting fires in complex environments.

2.2. Two-Stage Methods

Two-stage methods refer to fire detection in two stages. These methods involve using the input fire images to generate region proposals, and then the network classifies and regresses the positions of these proposals. As a result of this two-step process, these methods often exhibit higher accuracy and are suitable for detecting small objects and complex scenes. Common two-stage methods include Faster R-CNN [16], Mask R-CNN [17], and Cascade R-CNN [18]. For example, a study [19] proposed the two-stage model EdgeFireSmoke++ by integrating human–computer IoT technology and applied it to forest fire detection. Another study [20] compared and analyzed five visual-based fire detection systems and proposed a fire detection system based on the LUV color space and hybrid transformations, incorporating motion and frame region functionalities. These methods partially improve the efficiency in fire detection and demonstrate good performance in detecting fires in complex environments. However, they still suffer from slow detection speed. The rapid spread of fires often occurs in an instant, making detection speed crucial in fire detection.

2.3. Other Methods

In addition to the aforementioned one-stage and two-stage methods, there are other approaches that utilize machine learning or special techniques for fire detection. For example, CornerNet [21] performs object detection by predicting the coordinates of object corners. CenterNet [22] detects objects by predicting the coordinates of object centers and the size of bounding boxes. There are also object detection methods based on the EfficientNet [23] backbone network, which aims to balance accuracy and computational efficiency through composite scaling coefficients. A study [24] proposes a multi-scale fusion fire detection method using Gaussian Mixture Models. Another study [25] applied Particle Swarm Optimization (PSO) and K-Medoids clustering for fire and flame detection, achieving better qualitative and quantitative detection results. Another approach [26] utilized a Support Vector Machine (SVM) for fire detection and achieved good detection performance by collecting internet resources and recording fire videos to create a dataset. In another study [27], fire fractal features in the spatial domain were obtained through performance analysis of fire scenes. Then, rough set theory and trend fusion intelligent methods were used to obtain signal fractal features. By processing the signals detected by a group of fire detectors and recognizing the logical overlap relationship between two characters, a genuine fire alarm was finally generated. While these methods help to generate a fire alarm, they cannot effectively detect fires in complex environments, and they ignore the detection of small flame targets.

3. Proposed Methods

To facilitate a comprehensive understanding of each model presented in this paper, the following sections will individually introduce the FA mechanism, TSP module, and FF module. The FA mechanism primarily extracts features from “flame” information from various perspectives. The TSP module transforms the extracted features, thereby enhancing the network’s attention to “flame” information. The FF module is utilized to further process and leverage the features extracted from the TSP module for the purpose of “flame” task learning. These three modules play a pivotal role in bridging the different stages of the research.

3.1. FA Mechanism

Deep Neural Networks offer a solution to the information overload by allocating resources to more critical but less resource-demanding tasks. Attention mechanism modules, such as Squeeze-and-Excitation (SE) [28], Coordinate Attention (CA) [29], Efficient Channel Attention (ECA) [30], and the Convolutional Block Attention Module (CBAM) [31], are commonly used to improve the efficiency of task processing. Considering the inherent characteristics of fire images, such as the presence of small fire targets, we present a novel attention mechanism named Fire Attention (FA). This mechanism is tailored for fire scenes, leveraging location information in the images. The primary objective of this mechanism is to enhance the feature representation capability of the network based on the location information of fire incidents. This mechanism enables feature transformation on any position’s feature vector

X = [x 1, x 2, x 3, \dots x c] \in R^{H \times W \times C}

within the network, resulting in the output of transformed feature vector

Y = [y 1, y 2, y 3, \dots, y c] \in R^{H \times W \times C}

of the same dimension. In the subsequent sections, we will elucidate the architectural details of the FA mechanism network. Figure 2 displays the model structural of the FA module.

In order to obtain the height and width details of the feature map in both directions, a global average pool (Avg pool) is first presented. The output value of the

c

channel for width

w

is

Z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (i, w)

, and its output value for height

H

is

Z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq j < w} x_{c} (h, j)

(where

c

represents the channel number and

x_{c}

denotes the input value; The height and width of the feature map are represented by the letters

H

and

w

, respectively;

i

and

j

represent the

i

-th and

j

-th). The acquired global perceptual field is then concatenated in both the

Z_{c}^{w} (w)

and

Z_{c}^{h} (h)

directions. The image information

f = \partial (F ([z^{h}, z^{w}]))

is then obtained following feature translation using an

1 \times 1

convolution kernel, BatchNorm (BN), and non-linear (NL) (the feature map’s height information is shown in

z^{h}

, its width information is shown in

z^{w}

, the output of the preceding layer’s convolution is shown in

F

, and the convolution operation is shown in

\partial

); the obtained feature map information

f

then extracts the respective height information

f^{h} \in R^{C / r \times H}

and width information

f^{w} \in R^{C / r \times W}

(

r

represents the reduction factor). After that, the feature map is transformed using

1 \times 1

convolution and the

S o f t m a x

function [32] to obtain the high-dimensional feature information

G^{h} = \partial (F_{h} (f^{h}))

(the variable

F_{h}

denotes the height generated as an output from the preceding convolutional layer) and wide dimension

G^{w} = \partial (F_{w} (f^{w}))

(the variable

F_{w}

represents the width outputted from the preceding convolutional layer), respectively. Finally, the output information is combined with the weight matrix to obtain the output eigenvalues of the FA module

Q^{c} (i, j) = x_{c} (i, j) \times G_{c}^{h} (i) \times G_{c}^{w} (j)

, where

x_{c} (i, j)

represents the

i

-th and

j

-th

c

-channel feature map information,

G_{c}^{h} (i)

represents the

i

-th

c

-channel feature map’s height, and

G_{c}^{h} (i)

represents the

j

-th

c

-channel feature map’s height. To enhance the network’s ability to accurately locate target information and improve its recognition capacity, the weight values of the FA module incorporate both channel information and position information of the feature map. This includes the height information and width information of the feature map, allowing the FA module to effectively attend to the relevant channels and spatial locations within the feature map. By considering both channel and position information, the FA module assists in refining the network’s feature representation and facilitating more precise target localization. Figure 2 displays the model structure of the FA module.

3.2. TSP Module

CNN networks require fixed-size input before connecting to fully connected layers, which usually causes geometric distortions when images are cropped or warped to fit the fixed size. To address this issue, we propose a method called TSP to mitigate geometric distortions in fire images. Its principle is to divide the input fire feature map with dimensions

W \times H

into three different grids of scales (

1 \times 1

,

2 \times 2

, and

4 \times 4

). Each scale feature map is then subjected to max pooling [33], average pooling [34], and random pooling [35] operations to obtain three fire images of different scale. A total of 21 fire feature maps are extracted, and a Concat operation is performed on the three fire images of different scales. Finally, max pooling is applied to the content of each grid. The network architecture is illustrated in Figure 3.

3.3. FF Module

The fusion of image features primarily takes place in the “Neck” part of the object detection framework, which also plays a role in enhancing contextual information, feature dimension reduction, and multi-scale feature extraction. The “Neck” module is typically positioned after the backbone to better utilize the features extracted by the backbone. Compared to the backbone module, the “Neck” module focuses on the fusion of more complex image information. In our approach, we fine-tuned YOLOv5 and employed a combination of CBS, upsampling, CSP2_1, and Conv layers. We also used Concatenate to connect the upstream output data. For upsampling, the nearest-neighbor method was utilized [36]. Instead of the state-of-the-art YOLOv5 “Neck” module, we proposed a new module called FF, which is illustrated in Figure 4.

3.4. Prediction Head

The main task of the detection module of the algorithm presented is to classify and locate flames of varying sizes within the “Detect” framework. It comprises essential components such as convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully connected layers for high-level feature integration. The anchor mechanism delineates the dimensions of flames for establishing bounding boxes in the prediction stage. The Classification component leverages the Softmax function to categorize the detected bounding boxes to enhance accuracy. Furthermore, the Regression component refines the bounding boxes to precisely determine the spatial extent and dimensions of the flames. The prediction for the position and category of the objects are the key outputs of the Prediction Head section. Our fire detection algorithm (FD-Net) sets three sets of output values, each of which produces a set of bounding boxes for identifying fire objects of various sizes. Figure 5 displays the precise head output prediction values.

To better illustrate the architecture of the fire detection framework FD-Net, we present FD-Net in Figure 6.

4. Experimental Results

4.1. Experimental Settings

For instance, the original image input size of Faster R-CNN [16] is 1000 × 600 pixels, 300 × 300, or 512 × 512 pixels for SSD; 416 × 416 or 544 × 544 pixels for YOLO; and 512 × 512 pixels for CenterNet [22]. We finally chose the default input image size of 640 × 640 pixels, taking into account the image properties.

In reference to the definition of aerial photography of forest fires, the datasets created in this paper for complex environments are mainly characterized by: (1) darker flames: due to the burning to the deep layers of trees, the color of the flames deepens, with parts showing red or orange; (2) darker smoke: the smoke spreads and becomes denser, with colors appearing black or dark gray and some parts showing lower smoke clouds; (3) an overall lower green ratio: the extensive burning of vegetation significantly reduces the green ratio in the images; (4) more image occlusions: there are more occlusions due to a large amount of smoke, suspended particles, dust, and fallen trees or residual fire; (5) visual interference images: the fire scene environments such as ground fires and crown fires are similar, which can easily lead to false detection; (6) environmental change images: environmental conditions such as lighting and weather change rapidly; (7) longer fire lines overall: the fire spreads continuously, with an increase in flame height and a noticeable increase in the length of the fire line.

Our dataset comprises a total of 10,284 images obtained from remote sensing, encompassing small objects, fires, and diverse backgrounds with varying scales of complexity. To provide a visual representation of the dataset, we present the bounding boxes in Figure 7. Figure 7a illustrates that the majority of fire objects are concentrated in the central region [0.5, 0.5], while a smaller number are concentrated in the region [0.6, 0.7]. Furthermore, Figure 7b demonstrates that there is a significant number of small objects in the dataset, catering to the need for complex experimental data involving small objects and occluded objects. This further validates the suitability of our dataset for conducting comprehensive and diverse experiments.

The Ubuntu 20.04 LTS computer used for the experiment was equipped with a 3080TI 12 G NVIDIA video card, Python 3.8.11, and PyTorch version 1.7.1+cu110. Adaptive Moment Estimation (Adam) [37] was used for optimization, and our initial learning rate was set to 0.01, the final OneCycle LR learning rate was 0.1, the weight decay was set to 0.0005, the momentum was set to 0.937, the number of epochs was set to 300, and the batch size was set to 64.

The primary objective of data augmentation is to expand the dataset, thereby enhancing the model’s ability to capture the contextual characteristics of images and improving its feature representation. Various techniques are employed for data augmentation, including optical variation augmentation [38] (e.g., adjusting image hue, saturation, and value), geometric variation augmentation [39] (e.g., random scaling, cropping, translation, shearing, and rotation), and other methods. Popular data augmentation techniques include mixup [40], cutmix [41], and mosaic [42]. Currently, mosaic augmentation, widely utilized by YOLOv4 [8], has gained significant popularity due to its effectiveness and efficiency. In our training process, we employed the mosaic augmentation technique as one of the training sets.

4.2. K-Means

In order to optimize the performance of the FD-Net algorithm, we applied the K-means algorithm [43] to cluster the two experimental datasets. It is widely recognized that increasing the value of K can enhance the algorithm’s accuracy. However, this also leads to an increase in the number of anchor boxes, potentially introducing redundancy and slowing down the calculation. Figure 8 illustrates that the curve slope remains relatively flat when the anchor value is set to 9. Therefore, we have determined that setting the number of anchors to 9, with precise settings of (25, 37), (55, 75), (79, 142), (134, 84), (138, 198), (190, 323), (277, 185), (367, 367), and (607, 510), strikes an optimal balance between accuracy and computational efficiency for our algorithm.

4.3. Evaluation Metrics

In our evaluation, we considered several metrics, including mean average precision (mAP), giga-floating-point operations (GFLOPs), weight, and frames per second (FPS). To calculate the GFLOPs, we employed the FLOPs formula as indicated in Equation (1). It is important to note that bias was not taken into account in these calculations.

G F L O P s = F_{h e i g h t} * F_{w i d t h} * C_{o u t} * C_{i n} * K_{w i d t h} * K_{h e i g h t} \times 10^{9}

(1)

In the formula,

F_{h e i g h t}

and

F_{w i d t h}

represent the height and width of the output feature map, respectively;

C_{o u t}

and

C_{i n}

represents the channel of the output feature map and the input feature map, respectively;

K_{w i d t h}

and

K_{h e i g h t}

represents the width and height of the convolution kernel, respectively;

*

stands for convolution operation; and

\times

signifies multiplication. Equation (2) demonstrates how to compute the number of parameters (Params).

P a r a m s = C_{o u t} * K_{w i d t h} * K_{h e i g h t} * C_{i n}

(2)

Equation (3) illustrates the calculation of precision.

P = \frac{T P}{T P + F P}

(3)

In the equation, the symbol

T P

represents the occurrence of a true positive, indicating that the actual result aligns with the expected positive output. Similarly,

F P

represents the occurrence of a false positive, indicating that the projected output is positive while the actual output is negative. By plotting recall on the horizontal axis and accuracy on the vertical axis, a graph can be generated. This allows for the calculation of the recall and precision at various points, ultimately leading to the determination of the average precision (AP). The specific process of calculating AP is outlined in Equation (4).

A P = \sum_{i = 1}^{n - 1} (R_{i + 1} - R_{i}) P_{i n t e r} (R_{i} + 1)

(4)

In the formula,

R_{1}, R_{2}, \dots R_{m}

represent the recall values in ascending order, corresponding to the first interpolated segment of precision. The calculation of mean average precision (mAP), which is subsequently averaged across all categories, is provided in Equation (5).

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(5)

In the above formula, the categories considered in this study are numbered 1 and 2, representing fire and smoke, respectively. These numbers are assigned to denote the specific categories under consideration. The variable K represents the total number of categories involved in the analysis.

4.4. Quantitative Comparison

Table 1 was generated based on the experimental data to compare the performance of different object detection models, namely Fast R-CNN, YOLOv3 [44], YOLOv4, Scaled-YOLOv4 [45], YOLOv5, YOLOv7 [46], YOLOX [47], YOLOv9 [48], and YOLOv10 [49]. Upon analysis of Table 1, we can see that the FD-Net algorithm obtains the best overall results and the highest mAP except for YOLOv10, but YOLOv10 performs worse than FD-Net in terms of GFLOPs, Params, Weight, and FPS. Furthermore, our model demonstrates significantly lower values in terms of GFLOPs, Params, and Weight compared to YOLOX (3 times lower), Fast R-CNN (84 times lower), and similar values to YOLOv5. Additionally, our mAP surpasses YOLOv5 by approximately 4%. Moreover, FD-Net shows clear advantages over Fast R-CNN, YOLOv3, YOLOv4, and Scaled-YOLOv4, highlighting the effectiveness of our fire detection algorithm approach.

In our experiments, each image was tested independently for fire detection in small, medium, and large objects. Table 2 presents the results, demonstrating that the FD-Net algorithm outperforms YOLOv3, YOLOv4, Scaled-YOLOv4, YOLOv5, YOLOv7, YOLOv9, and YOLOv10 algorithms in detecting tiny objects and performs comparably to YOLOX. However, it does not perform as well as the Fast R-CNN algorithm in detecting other object sizes, as indicated in Table 1. It is worth noting that YOLOv10 outperforms FD-Net on Aps, but slightly underperforms FD-Net on Apm and Apl, indicating that the FD-Net approach has a greater overall impact and wider applicability than the current state-of-the-art algorithms.

4.5. Performance Comparisons

To validate the accuracy of our dataset, we developed visualization scripts to visualize the bounding boxes. In Figure 9, we show the visual representation of the bounding boxes, confirming the accuracy of our dataset. This visualization reveals that the majority of the targets in our fire dataset are small in size. Therefore, we can confidently conclude that most events in our fire dataset qualify as small targets.

Figure 10 visualizes the training and validation impacts during the experiment. It illustrates the decreasing loss of both the train and validation sets, converging towards zero as the number of iterations increases. Furthermore, the category loss remains consistently at zero since our target value solely focuses on fire detection without considering other categories. The mAP and recall stabilize once they reach the maximum number of iterations. This visualization demonstrates that our hyperparameter selection is more reasonable and that the model exhibits reduced redundancy, thereby indicating the effectiveness of our approach.

4.6. Qualitative Comparison

In order to showcase the detection performance of different algorithms, we created Figure 11, from which three random fire images from the test set were selected for detection. From Figure 11, it can be observed that YOLOv4, YOLOv7, and YOLOX exhibit missed detections, while Fast R-CNN and YOLOv7 show false detections. Meanwhile, YOLOv3, Scaled-YOLOv4, YOLOv5, YOLOv9, and YOLOv10 demonstrate slightly better detection performance but still lag behind FD-Net. Although the number of images presented is limited, this further proves the practical value of the algorithm.

5. Extended Experiment

To enhance the generalization and efficiency of our fire detection algorithm [50], and to account for the common occurrence of smoke alongside fire, we augmented our data collection by including smoke. This addition allows for a more comprehensive evaluation of the performance. In this study, we collected a dataset comprising 11,156 records obtained through remote sensing of small target fires, including instances with large-scale and complex backgrounds. To ensure a balanced distribution of data, we split the dataset into training, validation, and test sets in a ratio of 7:2:1, respectively. The hyperparameter values used in this dataset are the same as those of the fire dataset mentioned above. The subsequent experimental evaluations will provide insights into the effectiveness of our algorithm in detecting both fire and smoke.

5.1. Quantitative Comparison

In our experiments, we compared our network with Fast R-CNN, YOLOv3, YOLOv4, YOLOv5, YOLOX, YOLOv9, and YOLOv10. The experimental conditions, such as GFLOPs, Weight, Params, and FPS, remained unchanged except for the alteration of the dataset. It is worth noting that due to the large area covered by smoke, we disregarded smoke areas smaller than 322 square meters in the dataset. Therefore, the comparison of Aps was not conducted in this particular experiment.

Through experiments on the fire smoke dataset, we generated Table 3 to compare the performance of different algorithms. The results show that FD-Net has the highest mAP and Fast R-CNN has the lowest mAP, except for YOLOv9, but the overall performance of FD-Net is better than YOLOv9, while Fast R-CNN exhibits the lowest mAP. Notably, Fast R-CNN and Scaled-YOLOv4 were found to be less suitable for fire detection due to the significant discrepancy in their recognition of fire and smoke. This disparity may be attributed to overfitting phenomena in the algorithm for fire detection, which can lead to instability. Among all the algorithms evaluated, FD-Net demonstrates a superior overall impact compared to YOLOv3, YOLOv4, YOLOv5, and YOLOv7. However, when considering the fire smoke dataset, FD-Net’s performance falls slightly below that of YOLOX. It is important to note that factors such as GFLOPs, Params, Weight, and FPS are not taken into account in Table 3. Meanwhile, we find that the overall detection performance of YOLOv9 and YOLOv10 outperforms the other algorithms but is inferior to FD-Net. After careful consideration of the strengths and weaknesses of each algorithm, we conclude that the overall impact of FD-Net on both the fire dataset and the fire smoke dataset outweighs that of other state-of-the-art algorithms.

5.2. Performance Comparisons

Figure 12 presents the impact of our test frame, highlighting the noticeable increase in the number of target instances when smoke is included in our dataset. This observation suggests that smoke contributes to a portion of the targets in our sample. Furthermore, it demonstrates that by incorporating smoke into the detection process, we indirectly enhance the detection of fire. This finding reinforces the effectiveness of considering smoke as an important factor in fire detection algorithms, as it can improve the overall detection performance.

5.3. Qualitative Comparison

To better demonstrate the detection performance of FD-Net on fire and smoke images, we randomly selected three test samples from the test set and created Figure 13. From Figure 13, it can be observed that Scaled-YOLOv4 exhibits missed detections in the first and second columns of images. YOLOv3, YOLOv4, YOLOv5, YOLOv7, and YOLOX all show missed detections in the third column of images. Upon analysis, Fast R-CNN performs well in terms of detection, but its overall detection performance is slightly lower than that of FD-Net. YOLOv9 and YOLOv10, while performing better to some extent, only detected one smoke target in the second column of images, and the overall performance was not as good as that of FD-Net. It is evident that the randomly selected three detection samples cannot represent the overall detection performance, but they provide further evidence of the practical value of FD-Net in fire detection. Therefore, FD-Net is suitable for fire detection tasks.

6. Conclusions

Early fire detection is crucial for preventing devastating loss of life and substantial financial damage. In light of this critical need, we introduce a novel fire detection framework named FD-Net. We proposed the FA (Fire Attention) mechanism to enhance the network’s ability to capture the positional information of fire images. To address the geometric distortion in fire images, we introduced the TSP (Three-Scale Pooling) module. Furthermore, we proposed the FF (Fire Fusion) module to further improve the output quality of the network. To evaluate its efficacy, we conducted comprehensive experiments on two publicly available datasets: the fire dataset and the fire-and-smoke dataset. By performing meticulous comparisons with both classic algorithms and state-of-the-art object detection algorithms, we assess FD-Net’s performance using eight evaluation metrics. The experimental findings unequivocally demonstrate the superior overall performance of FD-Net, establishing its potential as a replacement for existing state-of-the-art algorithms in fire detection tasks.

Author Contributions

J.Y. designed the experiments and the code for the paper and organized the content of the manuscript. H.W. (corresponding author) provided financial support and polished the manuscript. M.L. revised the structure of the manuscript. X.W. revised the structure of the manuscript. W.S. provided theoretical support and designed the conceptual framework of the paper. S.L. provided theoretical support and designed the conceptual framework of the paper. W.G. provided theoretical support and designed the conceptual framework of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62202248).

Data Availability Statement

The dataset for this experiment came from Wuhan University. The data have not been disclosed for the time being, and the dataset has been uploaded to the following GitHub repository: https://github.com/jyyuan666/fire_data_large.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Yang, H.; Wang, J.; Wang, J. Efficient Detection of Forest Fire Smoke in UAV Aerial Imagery Based on an Improved YOLOv5 Model and Transfer Learning. Remote Sens. 2023, 15, 5527. [Google Scholar] [CrossRef]
Yuan, J.; Ma, X.; Han, G. Research on Lightweight Disaster Classification Based on High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 2577. [Google Scholar] [CrossRef]
Chen, S.; Cao, Y.; Feng, X.; Lu, X. Global2Salient: Self-adaptive feature aggregation for remote sensing smoke detection. Neurocomputing 2021, 466, 202–220. [Google Scholar] [CrossRef]
Ma, J.; Chen, J.; Ng, M.; Anne, L. Loss odyssey in medical image segmentation. Med. Image Anal. 2021, 71, 1361–8415. [Google Scholar] [CrossRef]
Cheng, S.; Wu, Y.; Li, Y.; Yao, F. TWD-SFNN: Three-way decisions with a single hidden layer feedforward neural network. Inf. Sci. 2021, 579, 15–32. [Google Scholar] [CrossRef]
Bodapati, S.; Bandarupally, H. Comparison and Analysis of RNN-LSTMs and CNNs for Social Reviews Classification. Adv. Intell. Syst. Comput. 2021, 1319, 49–59. [Google Scholar]
Abdel-Magied, M.F.; Loparo, K.A.; Lin, W. Fault detection and diagnosis for rotating machinery: A model-based approach. In Proceedings of the 1998 American Control Conference, Philadelphia, PA, USA, 26 June 1998; pp. 3291–3296. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Liu, W.; Fu, C.-Y. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 11–14 October 2016; p. 9905. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Yar, H.; Khan, Z.A.; Ullah, F.U.M.; Ullah, W.; Baik, S.W. A modified YOLOv5 architecture for efficient fire detection in smart cities. Expert Syst. Appl. 2023, 231, 120465. [Google Scholar] [CrossRef]
Mahaveerakannan, R.; Anitha, C.; Thomas, A.K.; Rajan, S.; Muthukumar, T.; Rajulu, G.G. An IoT based forest fire detection system using integration of cat swarm with LSTM model. Comput. Commun. 2023, 211, 37–45. [Google Scholar] [CrossRef]
Shees, A.; Ansari, M.S.; Varshney, A.; Asghar, M.N.; Kanwaly, N. FireNet-v2: Improved Lightweight Fire Detection Model for Real-Time IoT Applications. Procedia Comput. Sci. 2023, 218, 2233–2242. [Google Scholar] [CrossRef]
Jadon, A.; Omama, M.; Varshney, A. FireNet: A specialized lightweight fire & smoke detection model for real-time IoT applications. arXiv 2019, arXiv:1905.11922. [Google Scholar]
Jiang, H.; Learned-Miller, E. Face detection with the faster R-CNN. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 650–657. [Google Scholar]
Bharati, P.; Pramanik, A. Deep learning techniques—R-CNN to mask R-CNN: A survey. In Computational Intelligence in Pattern Recognition; Springer: Singapore, 2020; pp. 657–668. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Almeida, J.S.; Jagatheesaperumal, S.K.; Nogueira, F.G.; de Albuquerque, V.H. EdgeFireSmoke++: A novel lightweight algorithm for real-time forest fire detection and visualization using internet of things-human machine interface. Expert Syst. Appl. 2023, 221, 119747. [Google Scholar] [CrossRef]
Pritam, D.; Dewan, J.H. Detection of fire using image processing techniques with LUV color space. In Proceedings of the 2017 2nd International Conference for Convergence in Technology (I2CT), Mumbai, India, 7–9 April 2017; pp. 1158–1162. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Chen, J.; He, Y.; Wang, J. Multi-feature fusion based fast video flame detection. Build. Environ. 2010, 45, 1113–1122. [Google Scholar] [CrossRef]
Khatami, A.; Mirghasemi, S.; Khosravi, A.; Lim, C.P.; Nahavandi, S. A new PSO-based approach to fire flame detection using K-Medoids clustering. Expert Syst. Appl. 2017, 68, 69–80. [Google Scholar] [CrossRef]
Chen, K.; Cheng, Y.; Bai, H.; Mou, C.; Zhang, Y. Research on Image Fire Detection Based on Support Vector Machine. In Proceedings of the 2019 9th International Conference on Fire Science and Fire Protection Engineering (ICFSFPE), Chengdu, China, 18–20 October 2019; pp. 1–7. [Google Scholar]
Xia, D.; Wang, S. Research on Detection Method of Uncertainty Fire Signal Based on Fire Scenario. In Proceedings of the 2006 6th World Congress on Intelligent Control and Automation, Dalian, China, 21–23 June 2006; pp. 4185–4189. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Wang, J.; Yu, J.; He, Z. DECA: A novel multi-scale efficient channel attention module for object detection in real-life fire images. Appl. Intell. 2022, 52, 1362–1375. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; p. 11211. [Google Scholar]
Oksum, E.; Le, D.V.; Vu, M.D.; Nguyen, T.H.; Pham, L.T. A novel approach based on the fast sigmoid function for interpretation of potential field data. Bull. Geophys. Oceanogr. 2021, 62, 543–556. [Google Scholar]
You, H.; Yu, L.; Tian, S. MC-Net: Multiple max-pooling integration module and cross multi-scale deconvolution network. Knowl. -Based Syst. 2021, 231, 107456. [Google Scholar] [CrossRef]
Wang, S.-H.; Khan, M.A.; Govindaraj, V.; Fernandes, S.L.; Zhu, Z.; Zhang, Y.-D. Deep Rank-Based Average Pooling Network for COVID-19 Recognition. Comput. Mater. Contin. 2022, 70, 2797–2813. [Google Scholar] [CrossRef]
Zhang, Y.D.; Satapathy, S.C.; Liu, S. A five-layer deep convolutional neural network with stochastic pooling for chest CT-based COVID-19 diagnosis. Mach. Vis. Appl. 2021, 32, 14. [Google Scholar] [CrossRef]
Yang, Z. Activation Function: Cell Recognition Based on yolov5s/m. J. Comput. Commun. 2021, 9, 1–16. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J.A. A method for stochastic optimization, arXiv 2014. arXiv 2020, arXiv:1412.6980. [Google Scholar]
Cantrell, K.; Erenas, M.M.; de Orbe-Payá, I. Use of the hue parameter of the hue, saturation, value color space as a quantitative analytical parameter for bitonal optical sensors. Anal. Chem. 2010, 82, 531–542. [Google Scholar] [CrossRef]
Dong, P.; Galatsanos, N.P. Affine transformation resistant watermarking based on image normalization. Int. Conf. Image Process. 2002, 3, 489–492. [Google Scholar]
Verma, V.; Lamb, A.; Beckham, C. Manifold Mixup: Better Representations by Interpolating Hidden States. Int. Conf. Mach. Learn. 2018, 97, 6438–6447. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6022–6031. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
Jacksi, K.; Ibrahim, R.K.; Zeebaree, S.R.M.; Zebari, R.R.; Sadeeq, M.A.M. Clustering Documents Based on Semantic Similarity Using HAC and K-Mean Algorithms. In Proceedings of the 2020 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq, 23–24 December 2020; pp. 205–210. [Google Scholar]
Redmon, J.; Farhadi, A. yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13024–13033. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Cao, Y.; Tang, Q.; Wu, X. EFFNet: Enhanced Feature Foreground Network for Video Smoke Source Prediction and Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1820–1833. [Google Scholar] [CrossRef]

Figure 1. Part of the challenging fire dataset. (a) Small object fire dataset. (b) Negative example fire dataset. (c) Occluded fire dataset.

Figure 2. FA mechanism network architecture.

Figure 3. TSP module structure.

Figure 4. FF module network structure diagram.

Figure 5. Prediction Head model structure diagram.

Figure 6. The FD-Net fire detection architecture. We give the output value of each convolution layer from front to back. For instance, “CBS: 320*320*64” indicates that the output length and width of “CBS” are 320, and the number of output channels is 64. Instead of using various convolution blocks, we utilize different colors; each time we input 64 images into the network, the input image dimensions are 640*640. The splicing method between each module is similar to YOLOv5.

Figure 7. Target box visualization. (a) Coordinate map of the target frame’s center point; (b) the length and width of the frame target box.

Figure 8. Graph of anchors.

Figure 9. Size of the bounding box.

Figure 10. Analysis of the fire dataset’s training effects.

Figure 11. Qualitative comparison results.

Figure 12. Visualization of the bounding boxes.

Figure 13. Qualitative comparison results.

Table 1. A comparison of various quantitative statistics, including mean average precision (mAP) (where higher values indicate better performance), giga-floating-point operations (GFLOPs) (where smaller values indicate better computational efficiency), model parameters (Params) (where a smaller value indicates a more compact model), weight, and frames per second (FPS) (where a higher value indicates better real-time processing capability).

Model	mAP (%)	GFLOPs	Params	Weight (M)	FPS
Fast R-CNN [16]	60.90	370	137098724	113.5	20
YOLOv3 [44]	66.00	155.276	61529119	246.5	111
YOLOv4 [8]	65.20	141.917	63943071	256.3	83
Scaled-YOLOv4 [45]	64.21	166	70274000	141.2	90
YOLOv5 [9]	67.30	4.2	1766623	3.9	250
YOLOv7 [46]	62.72	105.1	37201950	74.8	125
YOLOX [47]	68.70	156	54208895	36	50
YOLOv9 [48]	68.63	28.9	8205369	66.9	32
YOLOv10 [49]	72.39	103.3	23156890	169.7	62
FD-Net (ours)	69.23	4.4	1980895	4.4	220

Table 2. Comparison of quantitative statistics, small objects (Aps): area ≤ 322; medium objects (Apm): 322 < area ≤ 962; and large objects (Apl): area > 962.

Model	Aps	Apm	Apl
Fast R-CNN [16]	0.582	0.6204	0.584
YOLOv3 [44]	0.422	0.624	0.537
YOLOv4 [8]	0.453	0.611	0.530
Scaled-YOLOv4 [45]	0.45	0.564	0.471
YOLOv5 [9]	0.454	0.627	0.522
YOLOv7 [46]	0.348	0.602	0.487
YOLOX [47]	0.510	0.632	0.594
YOLOv9 [48]	0.476	0.622	0.598
YOLOv10 [49]	0.536	0.640	0.530
FD-Net (ours)	0.492	0.651	0.534

Table 3. A comparison of quantitative statistics on fire and smoke datasets. The values in the table are all percentages: Apm(F) denotes the Apm value of fire; Apl(F) denotes the Apl value of fire; Apm(S) denotes the Apm value of smoke; and Apl(S) denotes the Apl value of smoke.

Model	mAP	fire	Apm(F)	Apl(F)	Smoke	Apm(S)	Apl(S)
Fast R-CNN [16]	56.27	42.00	54.10	57.00	70.00	12.50	86.00
YOLOv3 [44]	60.20	45.60	36.60	39.40	74.80	19.20	79.90
YOLOv4 [8]	63.90	51.50	41.00	40.40	76.20	21.30	81.40
Scaled-YOLOv4 [45]	62.50	48.90	27.80	33.80	76.10	16.80	71.10
YOLOv5 [9]	60.30	46.40	39.90	40.80	74.20	15.90	79.40
YOLOv7 [46]	59.70	46.90	38.30	41.50	72.50	11.90	79.40
YOLOX [47]	66.42	55.00	72.10	59.00	78.00	24.30	94.00
YOLOv9 [48]	64.65	49.28	58.66	46.20	80.02	16.30	74.85
YOLOv10 [49]	65.02	53.12	45.69	46.91	76.91	14.69	71.31
FD-Net(ours)	65.10	52.60	44.00	47.70	77.60	16.90	80.60

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, J.; Wang, H.; Li, M.; Wang, X.; Song, W.; Li, S.; Gong, W. FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments. Remote Sens. 2024, 16, 3382. https://doi.org/10.3390/rs16183382

AMA Style

Yuan J, Wang H, Li M, Wang X, Song W, Li S, Gong W. FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments. Remote Sensing. 2024; 16(18):3382. https://doi.org/10.3390/rs16183382

Chicago/Turabian Style

Yuan, Jianye, Haofei Wang, Minghao Li, Xiaohan Wang, Weiwei Song, Song Li, and Wei Gong. 2024. "FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments" Remote Sensing 16, no. 18: 3382. https://doi.org/10.3390/rs16183382

APA Style

Yuan, J., Wang, H., Li, M., Wang, X., Song, W., Li, S., & Gong, W. (2024). FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments. Remote Sensing, 16(18), 3382. https://doi.org/10.3390/rs16183382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments

Abstract

1. Introduction

2. Related Works

2.1. One-Stage Methods

2.2. Two-Stage Methods

2.3. Other Methods

3. Proposed Methods

3.1. FA Mechanism

3.2. TSP Module

3.3. FF Module

3.4. Prediction Head

4. Experimental Results

4.1. Experimental Settings

4.2. K-Means

4.3. Evaluation Metrics

4.4. Quantitative Comparison

4.5. Performance Comparisons

4.6. Qualitative Comparison

5. Extended Experiment

5.1. Quantitative Comparison

5.2. Performance Comparisons

5.3. Qualitative Comparison

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI