A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery

Pan, Haiyan; Luo, Die; Zhang, Yuewei

doi:10.3390/app15073699

Open AccessArticle

A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery

by

Haiyan Pan

¹

,

Die Luo

¹ and

Yuewei Zhang

^2,*

¹

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

²

Guangzhou Meteorological Satellite Ground Station, Guangzhou 510650, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3699; https://doi.org/10.3390/app15073699

Submission received: 10 February 2025 / Revised: 13 March 2025 / Accepted: 25 March 2025 / Published: 27 March 2025

(This article belongs to the Special Issue Latest Advances in Radar Remote Sensing Technologies)

Download

Browse Figures

Versions Notes

Abstract

As global climate change escalates, wildfires have emerged as a critical form of natural disaster, presenting substantial risks to ecosystems, public safety, and economic development. While satellite remote sensing has been extensively utilized for wildfire monitoring, current methodologies face limitations in addressing complex backgrounds and environmental variations. These techniques usually depend on set thresholds or the extraction of local features, which can lead to incorrect positives and overlooked detections. Consequently, existing methods inadequately capture the comprehensive characteristics of fire points. To mitigate these challenges, this study proposes a deep-learning-based fire point detection method that integrates Swin Transformer and BiLSTM for the extraction of the multi-dimensional features associated with fire points. This research represents the inaugural application of the Swin Transformer in the context of fire point detection, leveraging its self-attention mechanism to discern global dependencies and fire point information within complex environments. By amalgamating features at various levels, the proposed method significantly improves the accuracy and robustness of fire point detection. Experimental findings demonstrate that this method surpasses traditional models such as DenseNet, SimpleCNN, and Multi-Layer Perceptron (MLP) across multiple performance metrics, including accuracy, recall, and F1 score.

Keywords:

swin transformer; wildfire detection; deep learning; comprehensive feature

1. Introduction

As global climate change escalates, wildfires, recognized as a major natural disaster, present substantial challenges to ecological systems, public safety, and economic progress [1,2,3]. These events are marked by significant spatiotemporal variability and abrupt onset, complicating their prediction and monitoring. This complexity is especially pronounced in extensive regions such as forests and grasslands, where the effective and precise monitoring and analysis of wildfires are essential components of early warning systems and emergency management strategies for natural disasters [4,5]. The application of satellite remote sensing technology, which facilitates the rapid and extensive collection of wildfire data, has been widely adopted in this context [6,7,8].

Satellite remote sensing has emerged as a highly effective and extensive method for ground observation, particularly in the context of wildfire monitoring. Through the analysis of satellite imagery, researchers can acquire real-time information regarding wildfires across vast geographical areas [9,10,11]. Remote sensing products from satellites, including the Moderate-Resolution Imaging Spectroradiometer (MODIS) and the Visible Infrared Imaging Radiometer Suite (VIIRS), offer essential data for the monitoring of wildfires [12,13]. Both MODIS and VIIRS are capable of detecting the thermal radiation signals emitted during wildfire incidents by utilizing thermal infrared bands and implementing threshold algorithms to identify fire locations [12,14].

In recent years, the proliferation of artificial intelligence, particularly through the implementation of machine learning and deep learning techniques, has prompted a growing body of research focused on enhancing the accuracy and real-time capabilities of wildfire monitoring systems [15,16,17,18,19]. The evolution of machine learning technologies has led to an increasing number of investigations aimed at integrating these methodologies into wildfire detection efforts. Among the pioneering machine learning algorithms, those utilized for remote sensing image classification are Support Vector Machines (SVM) and Random Forest (RF) [17,20,21]. In the context of wildfire detection, these algorithms are employed to construct classifiers that discern fire-related features within remote sensing imagery. SVM operates by identifying the optimal hyperplane for classifying image data, demonstrating robust generalization capabilities, particularly in the analysis of nonlinear datasets [22]. Conversely, Random Forest employs a voting mechanism derived from multiple decision trees, which allows it to effectively navigate the intricate relationships among various features [23]. Nonetheless, traditional machine learning approaches encounter challenges when addressing complex spatiotemporal data. For instance, in the analysis of multi-dimensional datasets, conventional algorithms may inadequately capture the comprehensive information from fire points. As advancements in deep learning technologies continue to accelerate, an increasing number of studies are transitioning towards the adoption of deep learning methodologies.

The effective utilization of Convolutional Neural Networks (CNNs) in image processing has established them as a prominent deep learning technique for wildfire monitoring [24]. CNNs have the ability to autonomously learn features from images and to derive increasingly abstract representations through multiple convolutional and pooling layers, which is particularly advantageous for fire detection in remote sensing imagery [15,25,26]. While CNNs primarily concentrate on spatial information, Recurrent Neural Networks (RNNs) excel at handling time-series data, thereby capturing temporal dependencies that are crucial for monitoring wildfires [27]. The manifestation of wildfires is frequently associated with a sequence of continuous changes, such as increasing temperatures and the dispersal of smoke. RNNs are capable of identifying these temporal patterns by retaining information from preceding time points. Fire detection methodologies that employ RNNs typically process satellite images in a chronological sequence, enabling the model to forecast the likelihood of fire occurrence at subsequent time intervals based on data from prior moments [28,29].

In addition to remote sensing imagery, the monitoring of wildfires can benefit from the incorporation of diverse multisource data, including meteorological and geographic information. Research in this area often employs various data types as inputs for machine learning models, thereby enhancing both detection accuracy and robustness. For instance, combining remote sensing data with meteorological factors like temperature, humidity, and wind speed has been shown to significantly improve fire prediction accuracy, particularly under complex climatic conditions [25,30]. These approaches typically utilize multimodal learning frameworks grounded in deep learning, which amalgamate information from multiple sources, consequently augmenting the overall efficacy of wildfire monitoring systems.

Although there have been notable advancements in detecting fire spots using deep learning techniques, identifying fire spot features in remote sensing images continues to be a challenge. The main issue is that individual fire spots often have unclear characteristics in the images, resulting in a low accuracy for models trained only on these features. However, during fire incidents, the center of the fire spot shows a clear contrast with its surroundings. Thus, it is essential to compare the characteristics of fire spots with environmental features to enhance detection accuracy.

The Swin Transformer offers several advantages over traditional CNNs when it comes to detecting fire points. While CNNs predominantly depend on fixed local receptive fields, which may inadequately capture cross-scale information when analyzing high-resolution images or intricate scenes, the Swin Transformer utilizes a sliding window approach coupled with a hierarchical architecture. This design facilitates the effective extraction of image features across various scales and the capture of long-range dependencies via its self-attention mechanism. Consequently, the Swin Transformer demonstrates enhanced capabilities in identifying fire points within complex backgrounds, particularly in environments such as forests and grasslands. Additionally, its hierarchical structure allows for the precise detection of subtle variations in fire points, thereby improving both detection accuracy and robustness. Additionally, fire spots experience a unique progression over time, from their initial creation to their subsequent growth. BiLSTM excels at handling time-series data and, in contrast to LSTM which functions in one direction, it analyzes sequences in both the forward and backward directions at the same time. This feature allows BiLSTM to capture a greater range of characteristics and significantly reduce information loss, especially in tasks that require understanding dependencies.

Taking the aforementioned factors into account, we suggest a novel fire point detection network that integrates the Swin Transformer architecture with the BiLSTM network, with an emphasis on detailed features associated with wildfires. The main goals of this research are as follows: (1) to assess how well the combination of the Swin Transformer and BiLSTM performs in detecting fire points; (2) to compare its effectiveness against current baseline methods; and (3) to perform ablation studies to determine the impact of each component in the network.

2. Research Area and Datasets

2.1. Research Area

The geographical focus of this study encompasses two provinces in southern China: Fujian and Jiangxi. As shown in Figure 1. These regions are distinguished by their extensive mountainous and hilly landscapes, which exhibit significant forest coverage, particularly in the elevated areas where forest resources are plentiful [31]. However, the occurrence of wildfires in these forests presents considerable challenges for containment due to the complex topography and the inherent difficulties associated with firefighting efforts [32]. This often results in rapid fire propagation and the potential for extensive damage. Furthermore, the region is situated within a subtropical climate zone, characterized by elevated summer temperatures and irregular precipitation patterns, which heighten the risk of wildfires during the dry season [33]. Consequently, the wildfire-prone period in these provinces predominantly occurs during the summer and autumn months, when high temperatures and strong winds create optimal conditions for the ignition and spread of forest fires.

2.2. Dataset

Himawari-8 represents the third generation of geostationary satellites, launched by the Japan Meteorological Agency in 2014. It primarily serves the regions of East Asia and Australia. The satellite is outfitted with the Advanced Himawari Imager (AHI), which captures data across sixteen spectral bands, comprising six reflectance bands and ten brightness temperature bands. The spatial resolution of Himawari-8 varies between 0.5 and 2 km, and it offers a temporal resolution of 10 min, rendering it particularly effective for the real-time monitoring of forest fires [34]. Although geostationary satellites typically exhibit lower spatial resolution compared to polar orbiting satellites, their extensive coverage, synchronized data collection capabilities, and fixed observational position mitigate this drawback.

In the present study, six spectral bands pertinent to forest fire analysis were selected from the Himawari-8 satellite data. These included three reflectance bands (bands 3, 4, and 6) and three brightness temperature bands (bands 7, 14, and 15) [35]. Specifically, bands 3 and 4 were indicative of vegetation cover, bands 7 and 14 were associated with fire detection, and bands 6 and 15 served to filter out cloud interference. Collectively, these bands furnished critical information essential for fire detection, such as thermal radiation and reflectance in areas affected by fire, thereby enhancing the accuracy and real-time efficacy of fire point detection.

3. Method

The flowchart illustrating the proposed method is presented in Figure 2. Initially, the multispectral image data utilized in this research comprises AHI data obtained from the Himawari-8 satellite, while the ground truth data for fire points is acquired from the meteorological satellite ground station located in Guangzhou, Guangdong, China. Initially, the collected data are preprocessed to obtain the six spectral bands necessary for creating the fire spot dataset. This dataset is subsequently split into the training and validation sets. The training data are utilized to train the proposed model as well as the comparison methods. Once the model is trained and saved, the validation set is employed to assess the performance of each model, leading to a comparative analysis using quantitative evaluation metrics.

3.1. Overall Pipeline

The deep learning architecture presented in this study comprises two distinct branches: the Swin Transformer (Shifted Window Transformer) and the BiLSTM (a Bidirectional Long Short-Term Memory Network). The Swin Transformer is located at the top, and the BiLSTM is at the bottom. The input for the Swin Transformer consists of a remote sensing image measuring

H \times W \times 6

, with H and W indicating the height and width of the image. Initially, the Patch Partition process segments the input image into non-overlapping patches of uniform size. After Linear Embedding, these patch tokens are processed through several Swin Transformer blocks that utilize enhanced self-attention, going through four stages in total before generating the final feature layer.

The input to BiLSTM consists of the six spectral bands of the central fire spot, with a size of

D \times 6

, where D represents the selected temporal length. After obtaining the BiLSTM feature layer, both feature representations are concatenated and passed through a fully connected layer to produce a binary classification output, identifying fire spots and non-fire spots. The detailed architecture of our proposed model is illustrated in Figure 3.

3.1.1. The Swin Transformer

The Swin Transformer [36] is a vision Transformer architecture that has demonstrated considerable success within the realm of computer vision in recent years.

The fundamental concept underlying the Swin Transformer involves transitioning the computation of the self-attention mechanism from a global to a local mode. In conventional Transformer models, every pair of elements within the input sequence is subjected to computation, which proves to be inefficient for the processing of high-resolution images. To mitigate this challenge, the Swin Transformer incorporates local and shifted windows, thereby significantly reducing computational complexity while maintaining the capacity to model global information effectively.

In the Swin Transformer architecture, the initial step involves partitioning the input image into numerous smaller windows, which are subsequently transformed into fixed-dimensional vectors via Linear Embedding. The computation of self-attention is then performed within each individual window, leading to a notable reduction in computational complexity. To mitigate the challenges associated with long-range dependencies that arise from the localized nature of the windows, the Swin Transformer incorporates a window-shifting mechanism. This mechanism facilitates effective interactions among neighboring positions, thereby enabling the model to capture a more extensive range of global information. In conventional self-attention frameworks, the representation of each position is contingent upon all other positions within the input, resulting in substantial computational demands when handling large images. Conversely, the Swin Transformer addresses this issue by segmenting the input image into smaller local windows, applying self-attention mechanisms exclusively within these defined windows.

Specifically, given an input feature matrix

X \in R^{H \times W \times C}

, where H and W represent the height and width of the image and C represents the number of channels, the Swin Transformer divides it into multiple M × M windows and performs self-attention within each window. The self-attention mechanism applicable to a window can be articulated using the following formula:

A t t e n t i o n (Q_{w}, K_{w}, V_{w}) = s o f t m a x (\frac{Q_{w} K_{w}^{T}}{\sqrt{d_{k}}}) V_{w}

(1)

Here,

Q_{w}

,

K_{w}

, and

V_{w}

represent the Query, Key, and Value matrices, respectively, while d denotes the dimension of the Query/Key.

In this framework, the matrices designated as Query, Key, and Value are computed within a specified window, with the dimension of the key. The implementation of this localized window strategy in the Swin Transformer restricts the attention mechanism to smaller segments, thereby enhancing the computational efficiency considerably.

To address the challenges associated with long-range dependencies in conventional self-attention mechanisms, the Swin Transformer incorporates a shifted window approach. This method involves displacing the windows from their initial positions at each stage, thereby facilitating interactions among neighboring positions and effectively capturing long-range dependencies.

Furthermore, the Swin Transformer employs a hierarchical architecture that progressively enhances feature dimensions from lower resolution inputs. This design not only preserves computational efficiency but also enables the processing of more detailed image information. The output features generated at each layer are integrated with features from other layers through self-attention mechanisms within localized windows, allowing the model to incrementally acquire global semantic insights regarding the image. This hierarchical framework not only mitigates computational expenses but also adeptly manages large-scale image datasets.

By implementing local window self-attention mechanisms, window shifting, and a hierarchical structure, the Swin Transformer successfully resolves the computational and efficiency challenges inherent in traditional Transformers, representing a significant advancement in contemporary visual Transformer architectures. Due to its efficiency and robust performance, the Swin Transformer has emerged as one of the most prevalent models in the field of computer vision, offering a promising avenue for the further evolution of visual Transformers.

3.1.2. The Bidirectional Long Short-Term Memory Network

The BiLSTM [37] serves as an extension of the LSTM and is extensively utilized for the processing of sequence data and time-series analysis. LSTM, a specialized variant of Recurrent Neural Networks (RNNs), is primarily designed for the processing and forecasting of time-ordered data. A significant advantage of LSTM lies in its capacity to effectively address the vanishing and exploding gradient issues that traditional RNNs encounter when dealing with lengthy sequences, thereby enabling the model to handle long-term dependencies.

BiLSTM improves this capability by employing a bidirectional structure, which enables the model to process temporal data in both the forward and backward directions, thus enhancing its ability to capture a broader range of contextual information. The BiLSTM architecture consists of two parts: the forward LSTM, which handles the sequence from start to finish; and the backward LSTM, which processes the sequence in reverse. This dual flow of information allows BiLSTM to simultaneously learn the dependencies in both directions, overcoming the limitations of unidirectional LSTM models. The structure of BiLSTM can be succinctly represented as follows:

\vec{h_{t}} = L S T M (x_{t,} \vec{h_{t - 1}})

(2)

\overset{\leftarrow}{h_{t}} = L S T M (x_{t,} \overset{\leftarrow}{h_{t - 1}})

(3)

In this context,

\vec{h_{t}}

is the hidden state of the forward LSTM,

\overset{\leftarrow}{h_{t}}

is the hidden state of the backward LSTM, and

x_{t}

is the input data at the current time step.

In a BiLSTM network, the outputs generated by both the forward and backward LSTM components are combined to create a bidirectional output vector:

h_{t} = [\vec{h_{t}}, \overset{\leftarrow}{h_{t}}]

(4)

Consequently, the BiLSTM model has the capacity to utilize information from both the preceding and subsequent time steps, rendering it especially effective for applications that necessitate a comprehensive understanding of the global context.

3.2. Experimental Environment for Network Configuration

Our proposed model was developed utilizing the PyTorch framework, specifically, version 2.4.1 of the library. All experimental procedures detailed in this study were performed on a system featuring an NVIDIA GeForce GTX 4090 graphics card (NVIDIA Corporation, Santa Clara, CA, USA). The deep learning training regimen employed the Adam optimizer, with a learning rate set at 10⁻⁶, and was conducted over 300 epochs, and also utilized the cross-entropy loss function as the criterion for training loss.

3.3. Evaluation Indicators for the Network

To evaluate the performance of our proposed model, we used accuracy, precision, recall, F1 score (F1), missed detection rate (MD), and false detection rate (ED) as evaluation metrics:

A c c u r a c y = (T P + T N) / (T P + F N + F P + T N)

(5)

P r e c i s i o n = T P / (T P + F P)

(6)

R e c a l l = T P / (T P + F N)

(7)

F 1 = (2 \times P \times F) / (P + F)

(8)

M D = 1 - R

(9)

E D = 1 - P

(10)

4. Experimental Results and Analysis

Our proposed model was evaluated against widely utilized classification networks, including DenseNet, a SimpleCNN, and an MLP (Multi-Layer Perceptron). DenseNet, or Densely Connected Convolutional Networks, represents a novel deep learning architecture that enhances network efficiency through the establishment of dense interconnections among layers. In this architecture, each layer is explicitly connected to the outputs of all preceding layers, meaning that each layer’s input consists of feature maps from all prior layers. This configuration significantly facilitates gradient flow, thereby mitigating the vanishing gradient issue commonly encountered in deep networks. The architecture of DenseNet promotes efficient feature reuse, which not only reduces the total number of parameters but also improves computational efficiency. Furthermore, DenseNet enhances the transmission and utilization of local features, which can markedly elevate the network’s performance in tasks such as image classification and object detection. The DenseNet utilized in this research is set up with a growth rate of 32, block counts of [6,12,15,22], and is designed for two output categories.

The architecture of the SimpleCNN is derived from the Convolutional Neural Network (CNN) framework introduced by Yoojin Kang et al. in their research [24]. This network exemplifies a conventional CNN structure, comprising multiple convolutional layers, pooling layers, fully connected layers, and dropout layers. The input image is characterized by a dimension of 9 × 9 pixels with N channels. The network is systematically organized into several segments: the initial segment features a 3 × 3 convolutional layer equipped with zero padding and ReLU activation, succeeded by a 2 × 2 max-pooling layer. The subsequent segment mirrors this configuration, incorporating another 3 × 3 convolutional layer with zero padding and ReLU activation, and a 2 × 2 max-pooling layer. The third segment continues this pattern, including a 3 × 3 convolutional layer with zero padding and ReLU activation, and a 2 × 2 max-pooling layer, followed by a dropout layer with a dropout rate of 0.25. The concluding segment consists of a fully connected layer utilizing ReLU activation, followed by an additional dropout layer with a dropout rate of 0.5. The final output is produced through a Sigmoid activation function, facilitating binary classification. The framework utilized in this research aligns with the SimpleCNN architecture mentioned earlier.

The Multi-Layer Perceptron (MLP) represents a fundamental architecture in feedforward neural networks, characterized by the presence of multiple fully connected layers of neurons. The MLP comprises an input layer, two hidden layers, and an output layer. Neurons within each layer are interconnected with neurons from the preceding layer, establishing a fully connected network structure. Each connection is associated with a weight parameter, and neurons employ an activation function (such as ReLU, Sigmoid, or Tanh) to execute nonlinear transformations of the input signals. The training of MLPs is conducted through the backpropagation algorithm, which aims to optimize the weights and minimize the loss function, thereby enhancing prediction accuracy. MLPs are adept at managing complex nonlinear relationships, which accounts for their extensive application in various domains, including image recognition, speech processing, and natural language processing. The MLP utilized in this research features 128 neurons in the first hidden layer and 64 neurons in the second hidden layer. Given that the task involves binary classification, the output layer consists of one neuron.

The experimental findings, presented in Table 1 and Table 2, indicate that conventional Convolutional Neural Networks, specifically DenseNet and SimpleCNN, are ineffective at adequately capturing the background information associated with fire points, resulting in a recall rate of zero and an inability to accurately identify these points. This suggests that Convolutional Neural Networks predominantly focus on the extraction of local features; however, for tasks involving fire point detection, the incorporation of background information and global dependencies is equally essential. In contrast, the Swin Transformer, which is predicated on a self-attention mechanism, is capable of modeling long-range global information, thereby leveraging the surrounding background data of fire points and significantly enhancing detection performance.

Furthermore, our proposed model, which amalgamates the extraction of comprehensive fire point features through the integration of LSTM and the Swin Transformer, results in superior performance in terms of accuracy, recall, and F1 score.

Conversely, the MLP model primarily depends on the central spectral band information of fire points for detection. While it does not incorporate background information, the central spectral band typically encompasses the most critical features, allowing the MLP to maintain satisfactory performance in accuracy and recall. This observation underscores the notion that, within the context of fire point detection, the core features of the fire points are of greater significance than the background information, which serves merely as supplementary data. Nonetheless, the lack of background information constrains the overall efficacy of the MLP, hindering its competitiveness against our proposed model. Additionally, the MLP exhibits inferior robustness across diverse regions compared to our proposed model; for instance, it performs well in Fujian but demonstrates relatively poor performance in Jiangxi.

The outcomes of various methods are depicted in Figure 4 and Figure 5. Figure 4 demonstrates that the proposed model has fewer false positives compared to MLP, while effectively identifying existing fire spots. In contrast, DenseNet and SimpleCNN do not detect any fire spots, which aligns with the information in Table 2. Figure 5 reveals that there is substantial cloud cover in the ground truth. Despite these challenging conditions, the proposed model maintains a low false positive rate, while MLP is significantly impacted by cloud interference, resulting in numerous false detections. Likewise, DenseNet and SimpleCNN show no ability to detect fire spots.

In general, our suggested approach surpasses the other three methods regarding both the numerical data and visual outcomes. Consequently, our proposed model emerges as the most advantageous option, as it comprehensively utilizes both the core features and background information of fire points.

5. Discussion

To assess the effectiveness of the two components of the model, ablation experiments are carried out, as detailed in Table 3. The quantitative evaluation metrics presented in Table 3 demonstrate that the proposed method consistently surpasses all other models across various experimental conditions, achieving outstanding results in accuracy, recall, precision, and F1 score. This suggests that the incorporation of comprehensive fire point features greatly enhances the model’s accuracy. While the accuracy of the BiLSTM model experiences a slight decline, it maintains a high recall rate. However, this drop in accuracy is linked to an increase in the error detection rate, indicating that an exclusive focus on temporal features may result in a higher occurrence of false positive predictions. In comparison, the Swin Transformer exhibits a weaker performance. Although it achieves high accuracy, its F1 score suffers due to its inadequate precision and recall, leading to larger prediction errors for fire points and an increase in false positives.

From the perspective of fire point detection across different models, the proposed method excels in identifying fire points. The integration of comprehensive fire point features significantly boosts the model’s capability to detect fire points while minimizing both false negatives and false positives. The BiLSTM model effectively captures fire point information, and despite a slight decrease in accuracy, it still maintains a high recall rate, indicating its continued effectiveness in fire point detection. Conversely, the Swin Transformer struggles with fire point detection, as it tends to produce larger prediction errors, resulting in more false negatives, highlighting the drawbacks of relying solely on global fire point features.

In conclusion, the proposed method significantly outperforms networks that rely solely on BiLSTM or the Swin Transformer in terms of accuracy and fire point detection capability, underscoring the importance of comprehensive feature fusion for effective fire point detection.

6. Conclusions and Future Work

This research presents an innovative deep-learning-based approach for the detection of fire points, aiming to improve both the accuracy and robustness of detection through a thorough examination of the extensive features associated with fire points. Significantly, this method is the first to implement the Swin Transformer architecture in fire point detection tasks, utilizing its self-attention mechanism to effectively model long-range global dependencies. This feature allows for the precise capturing of global information pertinent to fire points and their surrounding environments. By integrating the Swin Transformer with a Bidirectional Long Short-Term Memory (BiLSTM) network, the proposed approach facilitates comprehensive feature extraction across dual dimensions, ultimately amalgamating these features to enhance the performance of fire point detection.

Comparative analyses with conventional models such as DenseNet, SimpleCNN, and Multi-Layer Perceptron (MLP) demonstrate that the proposed method outperforms these models across various metrics, including accuracy, recall, and F1 score. Notably, the proposed method significantly surpasses CNN models, which predominantly concentrate on local features; and MLP models, which primarily rely on the core band information of fire points, in terms of both the accuracy and the comprehensiveness of fire point detection. Ablation studies further validate the effectiveness of each component, indicating that the integration of multiple features markedly enhances detection performance. Although BiLSTM shows a slight reduction in accuracy, it excels in recall, highlighting its essential role in fire point detection.

Moreover, the experimental results suggest that the core band information of fire points is more critical than the global information in the context of fire point detection tasks. Nonetheless, background information still contributes to the model’s overall performance, particularly in complex background scenarios.

While the proposed method has demonstrated strong performance in experiments, there is still the potential to enhance its accuracy, particularly in extreme weather and challenging terrain. Additionally, the current research is confined to certain geographic areas, leading to notable variations in the model’s performance across different locations, which may affect its transferability. To tackle this challenge, future studies should aim to improve the model’s ability to generalize, ensuring it can achieve high accuracy across a broader range of geographical and climatic conditions. Enhancing transferability could involve increasing the model’s adaptability through diverse training datasets and cross-regional validation, particularly by conducting extensive evaluations under various climatic, terrain, and other conditions. This approach will contribute to the development of more robust models that consistently perform well in different environments, thereby enhancing their reliability and effectiveness in real-world applications.

In future research, alongside improving the model’s generalization and adaptability, data-level optimization should also be addressed. Although the Himawari-8 satellite data utilized in this study has a high temporal resolution, its spatial resolution is relatively low, which may limit the accuracy of fire point detection. Therefore, future investigations should explore the integration of Himawari-8 data with other high-spatial-resolution remote sensing data to create a dataset that combines both high temporal and spatial resolution, further advancing fire detection technology.

Author Contributions

Conceptualization, H.P. and D.L.; methodology, H.P. and D.L.; software, D.L.; validation, H.P.; formal analysis, H.P.; investigation, D.L.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, H.P. and D.L.; writing—review and editing, D.L.; visualization, H.P.; supervision, H.P.; project administration, H.P.; funding acquisition, H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Correa, D.B.; Alcantara, E.; Libonati, R.; Massi, K.G.; Park, E. Increased burned area in the Pantanal over the past two decades. Sci. Total Environ. 2022, 835, 155386. [Google Scholar] [CrossRef] [PubMed]
Halofsky, J.E.; Peterson, D.L.; Harvey, B.J. Changing wildfire, changing forests: The effects of climate change on fire regimes and vegetation in the Pacific Northwest, USA. Fire Ecol. 2020, 16, 4. [Google Scholar] [CrossRef]
Shamsaei, K.; Juliano, T.W.; Roberts, M.; Ebrahimian, H.; Kosovic, B.; Lareau, N.P.; Taciroglu, E. Coupled fire-atmosphere simulation of the 2018 Camp Fire using WRF-Fire. Int. J. Wildland Fire 2023, 32, 195–221. [Google Scholar] [CrossRef]
Aguilera, R.; Corringham, T.; Gershunov, A.; Benmarhnia, T. Wildfire smoke impacts respiratory health more than fine particles from other sources: Observational evidence from Southern California. Nat. Commun. 2021, 12, 1493. [Google Scholar] [CrossRef]
Dutta, R.; Aryal, J.; Das, A.; Kirkpatrick, J.B. Deep cognitive imaging systems enable estimation of continental-scale fire incidence from climate data. Sci. Rep. 2013, 3, 3188. [Google Scholar] [CrossRef]
Lian, C.Q.; Xiao, C.W.; Feng, Z.M.; Ma, Q. Accelerating decline of wildfires in China in the 21st century. Front. For. Glob. Chang. 2024, 6, 1252587. [Google Scholar] [CrossRef]
Maffei, C.; Lindenbergh, R.; Menenti, M. Combining multi-spectral and thermal remote sensing to predict forest fire characteristics. ISPRS J. Photogramm. Remote Sens. 2021, 181, 400–412. [Google Scholar] [CrossRef]
Kato, S.; Miyamoto, H.; Amici, S.; Oda, A.; Matsushita, H.; Nakamura, R. Automated classification of heat sources detected using SWIR remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102491. [Google Scholar] [CrossRef]
Llorens, R.; Sobrino, J.A.; Fernández, C.; Fernández-Alonso, J.M.; Vega, J.A. A methodology to estimate forest fires burned areas and burn severity degrees using Sentinel-2 data. Application to the October 2017 fires in the Iberian Peninsula. Int. J. Appl. Earth Obs. Geoinf. 2021, 95, 102243. [Google Scholar] [CrossRef]
Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A Review on Early Forest Fire Detection Systems Using Optical Remote Sensing. Sensors 2020, 20, 6442. [Google Scholar] [CrossRef]
Hally, B.; Wallace, L.; Reinke, K.; Jones, S.; Engel, C.; Skidmore, A. Estimating fire background temperature at a geostationary scale—An evaluation of contextual methods for AHI-8. Remote Sens. 2018, 10, 1368. [Google Scholar] [CrossRef]
Xu, W.; Wooster, M.J.; He, J.; Zhang, T. First study of Sentinel-3 SLSTR active fire detection and FRP retrieval: Night-time algorithm enhancements and global intercomparison to MODIS and VIIRS AF products. Remote Sens. Environ. 2020, 248, 111947. [Google Scholar] [CrossRef]
Fusco, E.J.; Finn, J.T.; Abatzoglou, J.T.; Balch, J.K.; Dadashi, S.; Bradley, B.A. Detection rates and biases of fire observations from MODIS and agency reports in the conterminous United States. Remote Sens. Environ. 2019, 220, 30–40. [Google Scholar] [CrossRef]
Waigl, C.F.; Stuefer, M.; Prakash, A.; Ichoku, C. Detecting high and low-intensity fires in Alaska using VIIRS I-band data: An improved operational approach for high latitudes. Remote Sens. Environ. 2017, 199, 389–400. [Google Scholar] [CrossRef]
Ding, C.C.; Zhang, X.Y.; Chen, J.Y.; Ma, S.C.; Lu, Y.F.; Han, W.C. Wildfire detection through deep learning based on Himawari-8 satellites platform. Int. J. Remote Sens. 2022, 43, 5040–5058. [Google Scholar] [CrossRef]
Yang, S.X.; Huang, Q.Y.; Yu, M.Z. Advancements in remote sensing for active fire detection: A review of datasets and methods. Sci. Total Environ. 2024, 943, 173273. [Google Scholar] [CrossRef]
Huh, Y.; Lee, J. Enhanced contextual forest fire detection with prediction interval analysis of surface temperature using vegetation amount. Int. J. Remote Sens. 2017, 38, 3375–3393. [Google Scholar] [CrossRef]
De Almeida Pereira, G.H.; Fusioka, A.M.; Nassu, B.T.; Minetto, R. Active fire detection in Landsat-8 imagery: A large-scale dataset and a deep-learning study. ISPRS J. Photogramm. Remote Sens. 2021, 178, 171–186. [Google Scholar] [CrossRef]
Rashkovetsky, D.; Mauracher, F.; Langer, M.; Schmitt, M. Wildfire detection from multisensor satellite imagery using deep semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7001–7016. [Google Scholar] [CrossRef]
Zhou, W.; Tang, B.H.; He, Z.W.; Huang, L.; Chen, J.Y. Identification of forest fire points under clear sky conditions with Himawari-8 satellite data. Int. J. Remote Sens. 2024, 45, 214–234. [Google Scholar] [CrossRef]
Zhang, Y.; He, B.; Kong, P.; Xu, H.; Zhang, Q.; Quan, X.; Lai, G. Near Real-Time Wildfire Detection in Southwestern China Using Himawari-8 Data. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 8416–8419. [Google Scholar] [CrossRef]
Abid, F. A survey of machine learning algorithms based forest fires prediction and detection systems. Fire Technol. 2021, 57, 559–590. [Google Scholar]
Mohajane, M.; Costache, R.; Karimi, F.; Pham, Q.B.; Essahlaoui, A.; Nguyen, H.; Oudija, F. Application of remote sensing and machine learning algorithms for forest fire mapping in a Mediterranean area. Ecol. Indic. 2021, 129, 107869. [Google Scholar]
Kang, Y.; Jang, E.; Im, J.; Kwon, C. A deep learning model using geostationary satellite data for forest fire detection with reduced detection latency. Gisci. Remote Sens. 2022, 59, 2019–2035. [Google Scholar] [CrossRef]
Marjani, M.; Ahmadi, S.A.; Mahdianpari, M. FirePred: A hybrid multi-temporal convolutional neural network model for wildfire spread prediction. Ecol. Inform. 2023, 78, 102282. [Google Scholar] [CrossRef]
Kang, Y.; Sung, T.; Im, J. Toward an adaptable deep-learning model for satellite-based wildfire monitoring with consideration of environmental conditions. Remote Sens. Environ. 2023, 298, 113814. [Google Scholar] [CrossRef]
Zhang, Q.; Zhu, J.; Huang, Y.; Yuan, Q.Q.; Zhang, L.P. Beyond being wise after the event: Combining spatial, temporal and spectral information for Himawari-8 early-stage wildfire detection. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103506. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Jmal, M.; Mseddi, W.S.; Attia, R. Wildfire Segmentation Using Deep Vision Transformers. Remote Sens. 2021, 13, 327. [Google Scholar] [CrossRef]
Gonçalves, D.N.; Marcato, J.; Carrilho, A.C.; Acosta, P.R.; Ramos, A.P.M.; Gomes, F.D.G.; Libonati, R. Transformers for mapping burned areas in Brazilian Pantanal and Amazon with PlanetScope imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103151. [Google Scholar] [CrossRef]
Ying, L.; Shen, Z.; Yang, M.; Piao, S. Wildfire detection probability of MODIS fire products under the constraint of environmental factors: A study based on confirmed ground wildfire records. Remote Sens. 2019, 11, 3031. [Google Scholar] [CrossRef]
Guo, M.; Yao, Q.C.; Suo, H.Q.; Xu, X.X.; Li, J.; He, H.S.; Yin, S.; Li, J.N. The importance degree of weather elements in driving wildfire occurrence in mainland China. Ecol. Indic. 2023, 148, 110152. [Google Scholar] [CrossRef]
Wang, W.Q.; Zhao, F.J.; Wang, Y.X.; Huang, X.Y.; Ye, J.X. Seasonal differences in the spatial patterns of wildfire drivers and susceptibility in the southwest mountains of China. Sci. Total Environ. 2023, 869, 161782. [Google Scholar] [CrossRef] [PubMed]
Xing, H.; Fang, K.Y.; Yao, Q.C.; Zhou, F.F.; Ou, T.H.; Liu, J.E.; Zhou, S.F.; Jiang, S.X.; Chen, Y.; Bai, M.W.; et al. Impacts of changes in climate extremes on wildfire occurrences in China. Ecol. Indic. 2023, 157, 111288. [Google Scholar] [CrossRef]
Bessho, K.; Date, K.; Hayashi, M.; Ikeda, A.; Imai, T.; Inoue, H.; Kumagai, Y.; Miyakawa, T.; Murata, H.; Ohno, T.; et al. An Introduction to Himawari-8/9—Japan’s New-Generation Geostationary Meteorological Satellites. J. Meteorol. Soc. Japan. Ser. II 2016, 94, 151–183. [Google Scholar] [CrossRef]
Xu, W.; Wooster, M.J.; Kaneko, T.; He, J.; Zhang, T.; Fisher, D. Major advances in geostationary fire radiative power (FRP) retrieval over Asia and Australia stemming from use of Himarawi-8 AHI. Remote Sens. Environ. 2017, 193, 138–149. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar]

Figure 1. Research area.

Figure 2. Flowchart of our proposed method.

Figure 3. Overall architecture of the proposed model.

Figure 4. The Fujian fire map and ground truth generated by each method, where a red “×” is the fire spot.

Figure 5. The Jiangxi fire map and ground truth generated by each method, where a red “×” is the fire spot.

Table 1. Comparison of results.

Model	A	P	R	F1	ED	MD
Our Proposed Method	0.999	0.652	0.747	0.696	0.348	0.253
DenseNet	0.999	--	0.000	--	--	1.000
SimpleCNN	0.999	--	0.000	--	--	1.000
MLP	0.999	0.471	0.567	0.510	0.529	0.433

Table 2. Comparison of subdivision by province.

Province	Model	A	P	R	F1	ED	MD
Fujian	Our Proposed Method	0.999	0.700	0.764	0.731	0.300	0.236
	DenseNet	0.999	--	0.000	--	--	1.000
	SimpleCNN	0.999	--	0.000	--	--	1.000
	MLP	0.999	0.617	0.640	0.628	0.383	0.360
Jiangxi	Our Proposed Method	0.999	0.604	0.730	0.661	0.396	0.270
	DenseNet	0.999	--	0.000	--	--	1.000
	SimpleCNN	0.999	--	0.000	--	--	1.000
	MLP	0.999	0.324	0.495	0.392	0.676	0.505

Table 3. Ablation experiment.

Model	A	P	R	F1	ED	MD
Our Proposed Method	0.999	0.652	0.747	0.696	0.348	0.253
BiLSTM	0.999	0.612	0.715	0.659	0.388	0.285
Swim Transformer	0.999	0.561	0.638	0.597	0.439	0.362

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, H.; Luo, D.; Zhang, Y. A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery. Appl. Sci. 2025, 15, 3699. https://doi.org/10.3390/app15073699

AMA Style

Pan H, Luo D, Zhang Y. A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery. Applied Sciences. 2025; 15(7):3699. https://doi.org/10.3390/app15073699

Chicago/Turabian Style

Pan, Haiyan, Die Luo, and Yuewei Zhang. 2025. "A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery" Applied Sciences 15, no. 7: 3699. https://doi.org/10.3390/app15073699

APA Style

Pan, H., Luo, D., & Zhang, Y. (2025). A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery. Applied Sciences, 15(7), 3699. https://doi.org/10.3390/app15073699

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comprehensive Feature Extraction Network for Deep-Learning-Based Wildfire Detection in Remote Sensing Imagery

Abstract

1. Introduction

2. Research Area and Datasets

2.1. Research Area

2.2. Dataset

3. Method

3.1. Overall Pipeline

3.1.1. The Swin Transformer

3.1.2. The Bidirectional Long Short-Term Memory Network

3.2. Experimental Environment for Network Configuration

3.3. Evaluation Indicators for the Network

4. Experimental Results and Analysis

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI