Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery

Tong, Hongwei; Yuan, Jianye; Zhang, Jingjing; Wang, Haofei; Li, Teng

doi:10.3390/rs16152827

Open AccessArticle

Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery

by

Hongwei Tong

^1,†,

Jianye Yuan

^2,†,

Jingjing Zhang

¹,

Haofei Wang

³ and

Teng Li

^1,*

¹

Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei 230093, China

²

Electronic Information School, Wuhan University, Wuhan 473072, China

³

Peng Cheng Laboratory, Department of Mathematics and Theories No.2, Xingke 1st Street, Nanshan, Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(15), 2827; https://doi.org/10.3390/rs16152827

Submission received: 6 June 2024 / Revised: 28 July 2024 / Accepted: 31 July 2024 / Published: 1 August 2024

Download

Browse Figures

Versions Notes

Abstract

With rising global temperatures, wildfires frequently occur worldwide during the summer season. The timely detection of these fires, based on unmanned aerial vehicle (UAV) images, can significantly reduce the damage they cause. Existing Convolutional Neural Network (CNN)-based fire detection methods usually use multiple convolutional layers to enhance the receptive fields, but this compromises real-time performance. This paper proposes a novel real-time semantic segmentation network called FireFormer, combining the strengths of CNNs and Transformers to detect fires. An agile ResNet18 as the encoding component tailored to fulfill the efficient fire segmentation is adopted here, and a Forest Fire Transformer Block (FFTB) rooted in the Transformer architecture is proposed as the decoding mechanism. Additionally, to accurately detect and segment small fire spots, we have developed a novel Feature Refinement Network (FRN) to enhance fire segmentation accuracy. The experimental results demonstrate that our proposed FireFormer achieves state-of-the-art performance on the publicly available forest fire dataset FLAME—specifically, with an impressive 73.13% IoU and 84.48% F1 Score.

Keywords:

small fire segmentation; global and local attention; UNet

Graphical Abstract

1. Introduction

In recent years, propelled by advancements in sensor technology, the acquisition of remote sensing (RS) images has become increasingly convenient. In the field of remote sensing, tasks such as ground crop detection [1,2], change detection [3,4,5], and urban scene segmentation [6,7,8] have attracted increasing attention. For a more detailed field of remote sensing, remote sensing-based fire detection or segmentation can be important to reduce damage and analyze carbon emissions [9].

Thanks to the availability of publicly accessible satellite data, hyperspectral images (HSIs) acquired through satellites have been widely employed for fire segmentation [10,11,12]. Some works [13,14] use satellite post-disaster images as data and focus on detecting burned areas after wildfires. The reasons why these satellite images are not convenient for direct real-time wildfire detection are as follows: the inherent characteristics of satellites, such as low revisit rates and limited spatial resolution, significantly impede their utility in emergency situations, including fire disaster, as shown in Figure 1. Consequently, relying solely on satellite remote sensing images proves challenging in meeting the real-time demands necessitated by such unforeseen events.

Unmanned aerial vehicles (UAVs) offer a flexible and cost-effective solution for capturing high-resolution remote sensing images at low altitudes, addressing the challenges associated with satellite imagery. Low-altitude UAV-based wildfire detection algorithms have been explored previously [15,16,17]; these early methods leverage enhanced CNN architectures to extract information from low-altitude remote sensing images [18,19]. However, due to the inherent characteristics of CNNs, i.e., locality and translation invariance, they struggle to effectively capture global information within the images.

Visual Transformer (ViT) can obtain higher accuracy rates than CNNs in tasks, including image segmentation, for its effective attention mechanism [20,21,22,23,24]. However, it requires a higher computational load. Even though several approaches have been proposed addressing this issue, e.g., the shifted window attention mechanism introduced in [25], the existing methods [26,27] are still limited in fulfilling the task of real-time precise fire segmentation for low-altitude remote sensing images. Since images captured by UAV usually possess higher resolutions (typically 3840 × 2140) and encompass more advanced semantic information than common images, this is particularly true for UAV fire images. In order to achieve accurate segmentation results in RS images, the connection between the small fire point region and the overall image is indispensable, just like the annotated local information and global information in Figure 2.

This paper presents a novel network named FireFormer for the real-time wildfire segmentation of UAV RS images. FireFormer adopts a hybrid structure, comprising a ResNet18-based encoder [28] and Transformer-based decoder. In the decoder, we introduce the Forest Fire Transformer Block (FFTB) as a key component to effectively integrate global and local contexts through a lightweight dual-branch structure. Moreover, to overcome the issue of blurred-edge features caused in a series of downsampling and upsampling operations in the network, we propose the Feature Refinement Net (FRN).

The main contributions of this research are outlined as follows:

(1): We propose the FFTB, which incorporates two parallel branches. These branches focus on capturing both the global and local information of the RS image.
(2): To acquire global information of an image with a lower computational cost, we have devised a Cross-Scaled Dot-Product attention mechanism. This mechanism captures global information at different scales and performs feature fusion.
(3): We propose the FRN to refine the output feature maps with finer details. This structure addresses the challenge of losing small objects during the segmentation process by leveraging parallel spatial and channel operations.

2. Related Work

2.1. Semantic Segmentation of Fire Remote Sensing Images

Forest fire is a significant and non-negligible contributor to carbon emissions, and as a result, many researchers in remote sensing have been dedicated to fire detection [29,30] and segmentation [31,32,33]. These works can be divided into two types based on the image data they used.

The first type of works use satellite remote sensing images, which employ semantic segmentation to delineate fire areas in the satellite images and conduct a series of downstream studies such as fire damage assessment [21,34] and carbon emissions [35,36,37] analysis based on the delineated areas. Reference [21] introduces a novel Transformer network that segments active fire pixels from VIIRS satellite time series, focusing on temporal information for improved detection accuracy. It identifies the limitations of the current methods, including low precision and the underutilization of temporal data in active fire detection from satellite imagery. The method increases the accuracy of active fire detection to 53.1%. In reference [34], the study refines a Double-Step U-Net model using deep learning to enhance wildfire severity classification from satellite images, emphasizing the optimization of loss functions for better accuracy. It tackles the challenge of a precise wildfire damage estimation by evaluating these loss functions’ impact on model performance. And it improves the accuracy of wildfire detection to 54%.

The second type of works use UAV images [38,39]. The semantic segmentation work based on UAV fire images mainly detects small fire points and smoke in the early stages of forest fires. Ref. [40] proposes a real-time fire segmentation method that improves upon deeplabv3+ by using MobileNetV3 for faster processing and adding shallow features to ensure accuracy, helping firefighters quickly assess the fire extent and plan responses. This method increases the recognition accuracy of the FLAME wildfire dataset to 72.87%. Ref. [41] introduces the FLAME dataset for aerial imagery pile burn detection, utilizing deep learning methods for the binary classification of fire presence and segmentation to delineate fire borders. The paper addresses the challenges in early fire detection by developing an Artificial Neural Network (ANN) for classification accuracy and a U-Net-based deep learning approach for precise fire segmentation, enhancing fire management strategies. These works greatly promote the combination of semantic segmentation and fire detection. However, it is worth noting that due to the particularity of fire detection, segmentation methods based on UAV fire images usually focus more on segmentation accuracy (metrics such as the IoU, OA and F1 score) but overlook the real-time requirements of responding to fire scenes (metrics such as FPS and params).

2.2. Semantic Segmentation Based on Transformer

The Transformer [42,43] was first introduced in the field of natural language processing and showcased its remarkable ability to integrate global information. The standard Transformer Block is depicted in Figure 3a. Leveraging the Transformer’s impressive performance in global information modeling, the Visual Transformer (ViT) brought the Transformer into the realm of computer vision. TransUNet [44] and TransFuse [45] combined CNN and Transformer in an encoder to achieve a balance between local and global information. In the field of remote sensing, the UNetformer [8] is a novel hybrid network that combines a CNN encoder with a Transformer decoder for efficiently segmenting urban scenes in remote sensing imagery, capturing both global and local contexts. The paper proposes an efficient global–local attention mechanism and a feature refinement head to enhance semantic segmentation performance, achieving a 71.33% IoU on the FLAME dataset. However, researchers recognized the high memory requirements behind ViT’s exceptional long-range modeling capabilities [46,47,48]. Researchers have generally focused on two directions for improving semantic segmentation networks based on Transformers. The first direction is to continue exploring ViT’s ability to represent deeper image features. For instance, refs. [49,50] introduced the “cloze test” idea into images, forcing ViT to learn deeper image features from damaged images. In the field of remote sensing, SS-MAE [51] innovatively introduces spatial masks and multispectral masks, forcing the network to learn detailed features from remote sensing images from both the spatial and multi-band perspectives. This framework achieves a higher accuracy in remote sensing images.

The second direction is to reduce ViT’s computational complexity and memory consumption. One well-known work is Swin Transformer [25], which proposed the shifted window attention mechanism (SWMSA), as shown in Figure 3b. These works [27,52] have promoted the lightweighting of semantic segmentation networks based on Transformer to varying degrees, allowing the networks to adapt to real-time requirements in specific scenarios. However, it should be noted that remote sensing images have larger resolutions and richer semantic information. This makes it difficult for common general segmentation networks to perform well in real-time when facing remote sensing images.

To address this issue, we designed the Forest Fire Transformer Block (FFTB) in FireFormer, which is a lightweight module composed of parallel global and local branches, as shown in Figure 3c. To reduce the computational complexity of FFTB, we designed a Cross-Scaled Dot-Product attention mechanism, enabling FFTB to obtain global information while meeting real-time requirements. Our experiments have shown that FFTB enables FireFormer to achieve the highest IoU metric (73.13%) while maintaining strong real-time inference capabilities, reducing the network parameter count to 11.7 M and reaching a frame rate of 38.29 FPS.

3. Methods

In this section, we begin by presenting an overview of the FireFormer model. Subsequently, we delve into the introduction of two pivotal modules within FireFormer, namely, the Forest Fire Transformer Block (FFTB) and the Feature Refinement Network (FRN).

3.1. Network Structure

The FireFormer model’s overall architecture, depicted in Figure 4, is introduced. FireFormer adopts a hybrid structure comprising a CNN-based encoder and a Transformer-based decoder. By employing Weighted Feature map fusion (WF), FireFormer effectively obtains the discriminative features that are crucial for identifying forest fires in RS images. The formula denoting this feature map fusion is represented as Equation (1).

F F = α \times {R e s}_{n} + (1 - α) \times S_{n}

(1)

where

{R e s}_{n}

represents the output feature map of the Resnet encoder and

S_{n}

represents the output feature map of the decoder, and

α

is the a learnable weight value. Furthermore, the FRN is introduced to further enhance the segmentation accuracy of FireFormer.

In the CNN-based encoder of FireFormer, we have chosen to use a pre-trained ResNet18 as the encoder. ResNet18 is composed of four stages of residual blocks (Res Blocks), where each block downsamples the feature maps by a factor of two and doubles the number of channels. Assuming the size of the given RS image is

f \in R^{3 \times 1024 \times 1024}

, the output defined for each stage when passing the RS image through the encoder is

{R e s}_{n} \in R^{64 \cdot 2^{(n - 1)} \times 256 \cdot 2^{(1 - n)} \times 256 \cdot 2^{(1 - n)}}

, where

n = 1,2, 3,4

.

In the Transformer-based decoder of FireFormer, we have designed a parallel dual-branch (e.g., global branch and local branch) FFTB to enhance the overall representation of the RS image features extracted by the CNN-based encoder, where each block upsamples the feature maps by a factor of two and halves the number of channels. When the given feature maps pass through the global branch of FFTB, Cross-Scaled Dot-Product attention (CDA) is employed to aggregate information from both the local details and the global context. Specifically, CDA divides the feature maps into patches, with each patch having a size of 2 × 2. These patches are then flattened and separated into Key (K), Query (Q) and Value (V) through linear operations. The aggregation of global information in the feature maps is achieved through Equation (2).

S A_{h} (x) = S o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{D_{h}}}) \cdot V

(2)

In the local branch of the FFTB, FireFormer incorporates a simple yet effective Local Feature Enhancement (LFE) block. This block is specifically designed to enhance FireFormer’s capability to accurately identify fire points in vast forests and improve its ability to discern fire edges. Lastly, the FFTB performs a summation operation on the output feature maps of both the global and local branches. This operation ensures that the output feature maps maintain their original resolution. The output of the each FFTB is defined as

S_{n} \in R^{512 \cdot 2^{(1 - n)} \times 32 \cdot 2^{(n - 1)} \times 32 \cdot 2^{(n - 1)}}

, where

n = 1,2, 3,4

.

After the four encoding and decoding stages, we obtain the encoded feature maps

{R e s}_{n}

and perform a weighted summation with the corresponding

S_{n}

in skip connection. This process ensures a balance between local and global information in the feature maps. After four iterations, the final output feature maps are denoted as

F \in R^{64 \times 256 \times 256}

.

Segmenting small fire points in high-resolution RS images is challenging due to factors like tree occlusion and variations in fire intensity. To address this, we introduce the FRN at the end of FireFormer. The FRN refines the feature maps spatially and channel-wise, and the feature maps are upsampled four times to restore the original resolution. The refined feature maps obtained through the FRN are represented as

F \in R^{n \times 256 \times 256}

, where

n

denotes the number of classes. These refined feature maps play a crucial role in achieving accurate segmentation results.

3.2. Forest Fire Transformer Block

The semantic segmentation of RS images requires a network with robust information extraction capabilities. This capability should prioritize the extraction of local information, while also considering the extraction of global information. To address these challenges, we have developed the FFTB. The structure of the FFTB is visually depicted in Figure 5.

The FFTB global branch is constructed with Transformer Blocks that incorporate the CDA mechanism. This branch is specifically designed to extract global information from the feature maps, while preserving their resolution and channel size. Suppose the given feature map as

f^{l - 1} \in R^{c_{1} \times h \times w}

, then the output feature map after passing through the global branch can be represented as

G \in R^{c_{1} \times h \times w}

. The entire process can be succinctly summarized by the following:

G = M L P (B P A (f^{l - 1}))

(3)

The FFTB local branch consists of two parallel depthwise separable convolutions (DWconv). Suppose the given feature map as

f^{l - 1} \in R^{c_{1} \times h \times w}

. To ensure the preservation of features across channels, we have maintained the same number of channels in the feature map without any alterations. The outputs of the parallel depthwise separable convolutions applied to the feature map

f^{l - 1}

are denoted as

L_{1}

and

L_{2}

. Subsequently, we fuse

L_{1}

and

L_{2}

along the channel dimension to obtain the final output feature map

L

of the local branch. Lastly, we merge the feature map L with the corresponding output $G$ from the global branch. Additionally, we employ the

G E L U

activation function to eliminate information redundancy within the fused feature map. The entire process can be concisely summarized as follows:

\begin{matrix} L_{1} = σ (f^{l - 1}) \\ L_{2} = ϕ (f^{l - 1}) \\ L = G E L U (G + L N (L_{1} + L_{2})) \end{matrix}

(4)

3.3. Cross-Scaled Dot-Product Attention Mechanism

After conducting our observations, we have discovered that the feature maps derived from RS images during the process of feature extraction contain a wealth of information. These include intricate texture details that exhibit strong local characteristics, as well as semantic information that showcases robust global characteristics.

In order to fully exploit the detailed texture information in remote sensing images, we have opted for a straightforward non-overlapping window partitioning approach (specifically, we have utilized 2 × 2 windows in our experiments).

The advanced semantic information in remote sensing images is crucial for semantic segmentation. Therefore, we propose the Cross-Scaled Dot-Product Attention Mechanism to globally model the feature maps of remote sensing images, as illustrated in Figure 6. In order to enable the exchange of information between windows, we employ convolution in both the horizontal and vertical directions. The cross-shaped window interaction module merges features from both the vertical and horizontal directions, thereby achieving global information interaction. Specifically, in the horizontal direction, the information interaction between any point

P_{1}^{(x, y)}

in window 1 and any point

P_{2}^{(x + s, y)}

in window 2 can be effectively modeled using Equation (5).

P_{1}^{(x, y)} = \frac{\sum_{i = 0}^{s - x - 1} P_{1}^{(x + i, y)} + \sum_{j = 0}^{s} P_{2}^{(x + w - j, y)}}{s}

(5)

P_{1}^{(m + i, n)} = W S A (P_{1}^{(m, n)})

(6)

where

s

represents the size of the windows. Simultaneously, the relationships between points within individual windows, as exemplified by the red path within window 1, have already been established during the window self-attention process, as Equation (6). Consequently, in the horizontal direction, the interaction of information between windows is facilitated by the convolution operation.

Similarly, in the vertical direction, inter-window information exchange is established through vertical convolution. The relationship between window 1 and window 3 can be described using Equation (5). Information exchange can be achieved by connecting multiple intermediate windows; for instance, points

P_{4}^{(x + s, y + s)}

in window 4 can interact with points

P_{2}^{(x + s, y)}

in window 2, and subsequently, points

P_{2}^{(x + s, y)}

in window 2 can interact with points

P_{1}^{(x, y)}

in window 1, thereby facilitating information exchange between windows in different directions. Thus, the cross-shaped window interaction method effectively simulates global relationships in window directions, thereby capturing the global information present in the feature map.

3.4. Feature Refinement Net

In the context of semantic segmentation tasks involving RS images, feature map fusion plays a pivotal role within the U-Net-like framework. The utilization of skip connections has a limited impact on enhancing the segmentation accuracy of the final fused feature map. Hence, we propose the Feature Refinement Net, which refines the ultimate fused feature map from two distinct perspectives: the Spatial Path and the Channel Path, as depicted in Figure 7.

We commence by upsampling the fused feature map by a factor of 2, denoting it as

f \in R^{c \times h \times w}

. In the Channel Path, we employ Adaptive Avgpool to diminish the spatial dimensions of the feature map to 1 × 1, which we define as

f \in R^{c \times 1 \times 1}

. To ensure a precise recovery of the channel information in the FRN, we allocate the channel numbers in a ratio of 2:1:1, represented as

f_{c}^{1} \in R^{\frac{c}{2} \times 1 \times 1}

,

f_{c}^{2} \in R^{\frac{c}{4} \times 1 \times 1}

and

f_{c}^{3} \in R^{\frac{c}{4} \times 1 \times 1} r e s p e c t i v e l y

. Subsequently, within each channel branch, we apply the reduce and expand operation, comprised of four 1×1 convolutional layers. This operation reduces the channel count by half, then by a quarter, before restoring them to their original quantity. Following this, the outcomes from the three channel branches are concatenated along the channel dimension, resulting in the final channel-wise recovered feature map denoted as

C_{0} \in R^{c \times 1 \times 1}

. Finally, the spatial recovery is achieved through the matrix multiplication between

C_{0}

and

f

.

To refine spatial features at different levels in the Spatial Path, we divide the feature map

f \in R^{c \times h \times w}

into two parts along the channel dimension, referred to as

f_{s}^{1} \in R^{\frac{c}{2} \times h \times w}

and

f_{s}^{2} \in R^{\frac{c}{2} \times h \times w}

. In the first branch, we apply deep separable convolution layers twice consecutively to segment

f_{s}^{1}

, using convolution kernels of size 7 × 7 and 5 × 5. The use of larger convolution kernels allows for a wider receptive field, thereby enhancing the global semantic information of the feature map. In the second branch, we perform a depth-separable convolution on segment

f_{s}^{2}

, utilizing a 3 × 3 convolution kernel to increase the spatial texture details of the feature map. Finally,

f_{s}^{1}

and

f_{s}^{2}

are merged along the channel dimension and added together to obtain the refined feature map. It is important to note that a residual module has been incorporated to prevent any degradation in network performance.

3.5. Datasets

3.5.1. FLAME Dataset

The FLAME dataset was gathered in northern Arizona, USA, utilizing drones for controlled pile burning. This dataset comprises multiple collections, encompassing aerial videos captured by drone cameras and thermal images recorded by infrared cameras. Within this dataset, there are 2003 photos annotated with fire areas, each with a size of 3840 × 2140. These annotated images serve as valuable resources for conducting fire semantic segmentation research. To facilitate training, we allocated 60% of the images for training purposes, 20% for validation and the remaining 20% for testing. Prior to training, each image is cropped to a size of 1024 × 1024, ensuring consistency across the dataset.

3.5.2. FLAME2 Dataset

To assess the generalization capability of FireFormer and mitigate the impact of similar images during training or segmentation results, we conducted segmentation tests on the FLAME2 dataset [53]. Additionally, based on the obtained test results, we performed an analysis of carbon emissions during forest fire burning processes. The FLAME2 dataset primarily comprises original and manipulated aerial images captured during a controlled fire in November 2021 in an open canopy pine forest in northern Arizona. It includes 7 sets of original RGB and infrared (IR) videos that have not been annotated. For the experiment, we primarily selected a portion of the third fire’s RGB video for segmentation testing. We extracted frames from the RGB video and subsequently cropped the extracted frame images to a size of 1024 × 1024.

3.6. Implementation Details

3.6.1. Training and Testing Settings

In the experiment, all models were implemented using the PyTorch framework on an NVIDIA GTX 3060 GPU. To achieve fast convergence, we used the AdamW optimizer to train all models. The base learning rate was set to 6 × 10⁻⁴. The learning rate was adjusted using the cosine annealing strategy. The maximum number of epochs during training was set to 40, and the batch size was set to 4. During the training process, we randomly cropped the FLAME dataset to a size of 1024 × 1024.

3.6.2. Evaluation Metrics

In our experiment, we employed two main categories of evaluation metrics. The first category focuses on assessing the accuracy of the network and includes the following metrics: overall accuracy (OA), average F1 score (F1) and mean intersection over union (mIoU).

The second category evaluates the efficiency of the network and includes the following metrics: floating-point operations (Flops), frames per second (FPS), memory usage in megabytes (MB) and the number of model parameters (M). By considering both accuracy and efficiency metrics, we can comprehensively evaluate the performance of the network in terms of its accuracy, computational complexity, speed and memory usage.

3.6.3. Loss Functions

In order to expedite the convergence of the network during the training process, we employ a unified loss function that amalgamates the Dice loss

L_{d i c e}

and cross-entropy loss

L_{c e}

to oversee the model’s progress. The calculation of this joint loss function is as follows:

\begin{matrix} L_{d i c e} = 1 - \frac{2}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} \frac{{\hat{y_{k}}}^{(n)} y_{k}^{(n)}}{{\hat{y_{k}}}^{(n)} + y_{k}^{(n)}} \\ L_{c e} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} y_{k}^{(n)} l o g {\hat{y_{k}}}^{(n)} \end{matrix}

(7)

The Dice loss function

L_{d i c e}

has a robustness to class imbalance and higher sensitivity to the pixel overlap of small targets. Therefore, the Dice loss function performs better in segmenting small targets. The cross-entropy loss function

L_{c e}

penalizes misclassifications more heavily during training, promoting the model to learn more classification boundaries. To improve the accuracy of the segmentation results, we used a combination of these two loss functions.

3.7. Hardware

The FLAME dataset used in this study was collected using various drones and cameras. Table 1 describes the technical specification of the utilized drones and cameras and the specific information obtained from the FLAME dataset.

4. Results

4.1. Ablation Study

In order to assess the efficacy of the proposed network architecture and crucial modules, we conducted ablation experiments on the FLAME dataset, utilizing U-Net as the baseline network, The parameters and structure used in the baseline model ResNet18 are consistent with the original paper. In this experiment, we kept the parameters of the three blocks fixed. In the FFTB, the window size and number of layers are set to 2 × 2 and 4, respectively. In WF, the encoding and decoding feature maps have a ratio of 2:8. In the FRN, the spatial convolution kernel sizes are {7,7}; {5,5}; {3,3}, with a channel ratio of 2:1:1. To demonstrate that our parameter settings achieve the optimal state for each network, we configured different parameters for each network, as shown in Table 2. To ensure a fair and unbiased comparison, all ablation studies were carried out without employing any test-time augmentation strategies. The resulting findings are presented in Table 3.

4.1.1. Effect of Forest Fire Transformer Block (FFTB)

By incorporating three FFTBs into the baseline architecture, we created Baseline + FFTB. As shown in Table 1, we can observe that the introduction of the FFTB effectively enhances the segmentation capability of the baseline. Specifically, the IoU is improved by 4.16%, the F1 score is improved by 2.9%, and the OA is improved by 0.06%. This improvement demonstrates that the network can learn more intricate image details in the FFTB dual-branch mode. Figure 8c demonstrates that the FFTB module significantly enhances the baseline’s ability to perceive edges, resulting in more accurate segmentation results at the edges.

4.1.2. Effect of Weight Fuse (WF)

The data presented in Table 1 indicates that the inclusion of WF into the Baseline + FFTB model, namely, Baseline + FFTB + WF, leads to an increase of 0.49% in the IoU and a 0.34% improvement in the F1 score, while the OA remains unchanged. The objective of WF is to fuse feature maps of varying granularities through Weighted Fusion. In Figure 8d, the upper right corner is particularly significant, indicating that through the WF module, the model can selectively learn more accurate features from multiple scales of feature maps.

4.1.3. Effect of Feature Refinement Net (FRN)

During the ablation experiment on the FRN, we incorporated it into Baseline + FFTB + WF, and created Baseline + FFTB + WF + FRN; it was observed that compared to the former, there was a 1.3% increase in IoU, a 0.87% increase in the F1 score, and a 0.01% increase in OA. As depicted in Figure 8e, the network demonstrates an enhanced edge segmentation capability for objects when FRN blocks are incorporated, as opposed to the results obtained without FRN blocks.

4.2. Comparison with Other Semantic Segmentation Methods

In our study, we conducted a comparison between our proposed FireFormer model and several existing methods, namely, FCN [54], U-Net [55], DeeplabV3 [56], TransUNet [44] and SwinUNet [57]. The initial methods mentioned are based on conventional CNN networks, while SwinUNet solely utilizes the ViT architecture. On the other hand, TransUNet represents a hybrid structure that combines both CNN and ViT in the encoder stage. In contrast, FireFormer, our proposed method, adopts a hybrid architecture comprising a CNN-based encoder and a Transformer-based decoder.

4.2.1. Comparison with Different Encoder–Decoder Combinations on FLAME Dataset

In order to highlight the advantages of our hybrid architecture, FireFormer, in efficient semantic segmentation, we conducted comparative experiments on the FLAME dataset, comparing it with UNet [55], SwinUNet [57] and TransUNet [44]. As depicted in Table 4, our proposed FireFormer exhibits competitive accuracy and complexity. Specifically, FireFormer achieves a remarkable 10.5% improvement in IoU compared to the pure CNN-based U-Net. When compared to the pure ViT-based SwinUNet, FireFormer attains a higher accuracy, while reducing memory consumption by 65%. Although the hybrid structure of TransUNet slightly surpasses FireFormer by 1% in accuracy, FireFormer consumes only 0.24 times the memory of TransUNet and exhibits a nine times faster inference speed.

4.2.2. Comparison with Other Methods on FLAME Dataset

The assessment of network complexity and speed is of utmost importance, particularly in real-time fire segmentation applications. In the official FLAME dataset evaluation, we compared FireFormer with other networks in terms of IoU, GPU memory consumption, complexity, parameters, and speed. The comparative results are depicted in Table 5. In comparison to A2FPN [52], which exhibits the lowest GPU memory consumption, FireFormer achieves a faster inference speed by 5.05 ms. When compared to the leading PIDNet [58], FireFormer demonstrates fewer parameters and more than twice the inference speed. While maintaining a high level of accuracy, FireFormer showcases commendable performance in terms of network inference speed and complexity. This balance between accuracy and network performance serves as a validation for the remarkable effectiveness of our proposed FFTB and FRN network modules. Indeed, real-time fire segmentation applications necessitate not only high network complexity and speed but also precise segmentation accuracy. To address this requirement, we conducted meticulous comparative experiments on the FLAME dataset. FLAME is a comprehensive forest fire scene segmentation dataset, comprising forest fire images captured from various heights and angles. Consequently, achieving high scores on this dataset presents certain challenges. We trained multiple advanced and efficient segmentation networks and conducted an extensive comparison of the results on the official FLAME test set. The detailed results are presented in Table 6. From the table, it can be observed that our proposed FireFormer achieves the highest IoU and F1 score, indicating its superior performance in accurately segmenting forest fire scenes. The visualization of the segmentation results is shown in Figure 9, from which it can be construed that FireFormer’s segmentation results are more accurate, especially at the edges of small fire points.

Our proposed FireFormer achieved exceptional performance in the accurate segmentation of forest fire scenes. With an impressive IoU of 73.13%, FireFormer also attained the highest scores in terms of F1 (84.48%). Comparatively, FireFormer outperformed traditional CNN-based segmentation networks such as U-Net, FCN and SegNet by 10.5%, 0.16% and 10.18% in IoU, and by 7.46%, 0.11% and 7.21% in F1, respectively. Furthermore, FireFormer surpassed the recently proposed efficient ViT-based network, PIDNet, by 1.1% in IoU and 0.74% in F1. Even in challenging scenarios where the segmentation of small fire spots is difficult, FireFormer demonstrated superior performance, solidifying its crucial significance in addressing fire segmentation applications.

4.3. Test the Scene Adaptability of FireFormer and Explore Carbon Emissions

To test the scene adaptability of FireFormer, we applied the FireFormer model trained on the FLAME dataset to the FLAME2 dataset. Based on the obtained detection results, we conducted an interesting exploration to estimate the carbon emissions caused by flames using FireFormer’s flame activity detection capability.

As shown in Figure 10, FireFormer detects four active fire points in each sample of the FLAME2 dataset. We explored the carbon emissions of these fire points based on the results detected by FireFormer. Since the images in the FLAME2 dataset are captured from a vertical perspective, we approximated the segmentation result of the flames to be the size of the burning area and calculated the carbon emissions for this perspective using Formula (8).

E = A \times B \times R \times F \times O

(8)

where E represents the carbon emissions resulting from the fires,

A

represents the area of the burned forest land (this value will change as the flame burns), and

B

represents the biomass per unit area of the burning area. After carefully reviewing the official website of Kaibab Forest, we have found the officially released data. Therefore, we have determined that the biomass per unit area is 35.49.

R

represents the carbon content in the biomass. This value should vary depending on the composition of the forest. Due to variations in vegetation types, forest age and composition, the conversion rates exhibit significant disparities. Moreover, the availability of conversion rates for various vegetation types is limited. Therefore, it is common practice to adopt the internationally recognized conversion rate of 0.5.

F

represents the combustion ratio, which we have set at 0.15 for our calculations. Lastly,

O

represents the oxidation coefficient, which we have taken as 0.9.

An important point to note is that in exploring the carbon emissions of the fire points, it is inevitable that errors will be introduced. The reason for these errors lies in our use of fixed average values, rather than on-site measurements, when calculating carbon emissions. Our calculation of

A

will introduce errors as we use the segmented result of the flame as the burning area, and due to the influence of factors such as wind and smoke during the combustion process, the value of A will be underestimated. Furthermore, during the combustion process of the flame, it is difficult to determine the

R

,

F

and

O

in real time. While we have adopted internationally recommended values in our calculations, these values should be specifically measured for the actual conditions of forest fires. Based on the detection results of FireFormer on the FLAME2 dataset and the calculation of carbon emissions from active fire points using Formula (8), we obtained Figure 11, which illustrates the trends in carbon emissions for four active fire points. We plotted the carbon emission data generated by the four fire sources in Figure 11 into Table 7 and evaluated our experimental data using the accuracy metric. An analysis of the accuracy in the table reveals that our proposed method performs well when the UAV’s vision is unobstructed by smoke during fire incidents. However, significant estimation errors in carbon emission occur when the UAV’s vision is obscured by smoke from the fire. We attribute this issue to incorrect identification by the semantic segmentation algorithm when smoke obscures the fire sources, thereby leading to errors in carbon emission estimation.

5. Discussion

In this work, we used low-altitude remote sensing images of early forest fires based on unmanned aerial vehicles as data to develop a novel CNN–Transformer hybrid forest fire identification algorithm.

Firstly, the impetus for this work originated from our observation that within the field of remote sensing, there is a limited but significant amount of research on monitoring fires using UAV remote sensing imagery [59,60]. These efforts have advanced fire detection methodologies within the domain of remote sensing. LUFFD-YOLO [61] introduced an algorithm for wildfire detection that is based on attention mechanisms and the fusion of multi-scale features, enabling the automated identification of fire regions in remote sensing images. The authors of [56], building upon the Deeplabv3+ framework, have achieved the detection of fires in UAV imagery. However, we have noted that the aforementioned methods are predominantly founded on visual models for object recognition, which are only capable of outlining the fire areas within the remote sensing images. This can lead to the inclusion of pixels that do not actually belong to the fire region. Therefore, the FireFormer we propose is constructed on a semantic segmentation visual model, facilitating pixel-level detection.

Secondly, the reason for proposing FireFormer is that most current fire detection algorithms still utilize satellite imagery as data [62,63]. Rida Kanwal and colleagues [64] employed a CNN neural network to achieve fire segmentation based on satellite remote sensing images. Mukul Badhan and colleagues [65] used VIIRS satellite remote sensing images to implement more accurate large-scale fire identification. We believe that satellite images have a relatively low spatial resolution and long revisit cycles, making it difficult to clearly and promptly reflect fire conditions on the ground. For the early small flames of forest fires, relying solely on satellite images is even more challenging. Unmanned aerial vehicles, as low-altitude remote sensing image acquisition devices, can more quickly and rapidly obtain real-time fire conditions.

To address the shortcomings we identified in the existing work during our research process, we proposed FireFormer. Considering the high resolution of remote sensing images, we introduced a lightweight Transformer framework, the Forest Fire Transformer Block. The experimental results have demonstrated that this module enhances the network’s ability for real-time fire monitoring. We believe that FireFormer can be deployed on unmanned aerial vehicle systems, enabling ground detection personnel to respond to forest fires more conveniently and accurately.

However, our work still has two areas of insufficiency. First, the algorithm we proposed is a supervised algorithm, which requires a significant amount of human labor to annotate remote sensing images in order to successfully complete the model training. Second, we believe that there is still room for improvement in the feature extraction module of FireFormer, and more advanced feature extraction methods can achieve more accurate recognition accuracy. In the future, we are considering using self-supervised learning methods for model training to alleviate the predicament that fire recognition algorithms require a large amount of annotated data for training. In addition, we will continue to develop new feature extraction modules to achieve more accurate fire detection.

6. Conclusions

This work aims to extract valuable information from early wildfire images obtained from low-altitude unmanned aerial vehicles (UAVs), to achieve higher flame segmentation accuracy with a smaller computational cost. Based on our findings, traditional CNN segmentation methods and Transformer-based segmentation methods struggle to strike a good balance between real-time performance and accuracy when faced with wildfire images captured by UAVs. To address this problem, we designed a novel real-time semantic segmentation network with a hybrid structure that incorporates our innovative Forest Fire Transformer Block (FFTB) and Feature Refinement Net (FRN). Specifically, in the Lo-Fi branch of FFTB, we designed a bi-paths attention (BPA) mechanism that allows the network to perform the global modeling of feature maps with a lower computational cost, effectively reducing the computational load and improving real-time performance. Additionally, in segmentation tasks, feature maps undergo multiple downsampling and upsampling, resulting in the loss of edge details. Therefore, we designed the FRN, which was validated through experiments and result visualization. Finally, we proposed an approximate calculation method for estimating carbon emissions from early forest fires from the perspective of low-altitude UAVs based on the segmentation results of FireFormer. In future work, we will continue to research better image feature encoders to replace the concise encoding in FireFormer, to achieve higher segmentation accuracy.

Author Contributions

Conceptualization, H.T., J.Y. and T.L.; Formal analysis, H.T. and J.Y.; Funding acquisition, T.L. and J.Z.; Methodology, H.T. and J.Y.; Project administration; T.L. and H.W.; Resources, T.L. and J.Z.; Software, H.T. and J.Y.; Supervision, H.T. and T.L.; Validation, H.T. and J.Y.; Visualization, H.T. and J.Y.; Writing—original draft preparation, H.T. and J.Y.; Writing—review and editing, T.L. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Code will be available at https://github.com/HongweiTong/FireFormer (accessed on 30 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Han, J.H.; Suh, M.S.; Yu, H.Y.; Kim, S.H. Improvement of High-Resolution Daytime Fog Detection Algorithm Using GEO-KOMPSAT-2A/Advanced Meteorological Imager Data with Optimization of Background Field and Threshold Values. Remote Sens. 2024, 16, 2031. [Google Scholar] [CrossRef]
Pang, Y.; Zhang, Y.; Wang, Y.; Wei, X.; Chen, B. SOCNet: A Lightweight and Fine-Grained Object Recognition Network for Satellite On-Orbit Computing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5632913. [Google Scholar] [CrossRef]
Bai, T.; An, Q.; Deng, S.; Li, P.; Chen, Y.; Sun, K.; Zheng, H.; Song, Z. A Novel UNet 3+ Change Detection Method Considering Scale Uncertainty in High-Resolution Imagery. Remote Sens. 2024, 16, 1846. [Google Scholar] [CrossRef]
Pi, X.; Luo, Q.; Feng, L.; Xu, Y.; Tang, J.; Liang, X.; Ma, E.; Cheng, R.; Fensholt, R.; Brandt, M.; et al. Mapping global lake dynamics reveals the emerging roles of small lakes. Nat. Commun. 2022, 13, 5777. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEEGeoscience Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]
Lu, Y.; Li, H.; Zhang, C.; Zhang, S. Object-Based Semi-Supervised Spatial Attention Residual UNet for Urban High-Resolution Remote Sensing Image Classification. Remote Sens. 2024, 16, 1444. [Google Scholar] [CrossRef]
Huang, L.; Lin, S.; Liu, X.; Wang, S.; Chen, G.; Mei, Q.; Fu, Z. The Cost of Urban Renewal: Annual Construction Waste Estimation via Multi-Scale Target Information Extraction and Attention-Enhanced Networks in Changping District, Beijing. Remote Sens. 2024, 16, 1889. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Hai, Y.; Liang, M.; Yang, Y.; Sun, H.; Li, R.; Yang, Y.; Zheng, H. Detection of Typical Forest Degradation Patterns: Characteristics and Drivers of Forest Degradation in Northeast China. Remote Sens. 2024, 16, 1389. [Google Scholar] [CrossRef]
Ouadou, A.; Huangal, D.; Hurt, J.A.; Scott, G.J. Semantic Segmentation of Burned Areas in Sentinel-2 Satellite Images Using Deep Learning Models. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 6366–6369. [Google Scholar]
Zhang, P.; Ban, Y. Unsupervised Geospatial Domain Adaptation for Large-Scale Wildfire Burned Area Mapping Using Sentinel-2 MSI and Sentinel-1 SAR Data. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5742–5745. [Google Scholar]
Shama, A.; Zhang, R.; Zhan, R.; Wang, T.; Xie, L.; Bao, X.; Lv, J. A burned area extracting method using polarization and texture feature of sentinel-1a images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Khryashchev, V.; Larionov, R. Wildfire segmentation on satellite images using deep learning. In Proceedings of the 2020 Moscow Workshop on Electronic and Networking Technologies (MWENT), Moscow, Russia, 11–13 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Akbari Asanjan, A.; Memarzadeh, M.; Lott, P.A.; Rieffel, E.; Grabbe, S. Probabilistic Wildfire Segmentation Using Supervised Deep Generative Model from Satellite Imagery. Remote Sens. 2023, 15, 2718. [Google Scholar] [CrossRef]
Hossain, F.A.; Zhang, Y. Development of new efficient transposed convolution techniques for flame segmentation from UAV-captured images. In Proceedings of the 2021 3rd International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 8–11 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland fire detection and monitoring using a drone-collected RGB/IR image dataset. IEEE Access 2022, 10, 121301–121317. [Google Scholar] [CrossRef]
Kellenberger, B.; Marcos, D.; Tuia, D. When a few clicks make all the difference: Improving weakly-supervised wildlife detection in UAV images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Hochstuhl, S.; Pfeffer, N.; Thiele, A.; Hammer, H.; Hinz, S. Your Input Matters—Comparing Real-Valued PolSAR Data Representations for CNN-Based Segmentation. Remote Sens. 2023, 15, 5738. [Google Scholar] [CrossRef]
Asanjan, A.A.; Memarzadeh, M.; Lott, P.A.; Templin, T.; Rieffel, E. Quantum-compatible variational segmentation for image-to-image wildfire detection using satellite data. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4919–4922. [Google Scholar]
Zhao, Y.; Ban, Y.; Sullivan, J. Tokenized Time-Series in Satellite Image Segmentation with Transformer Network for Active Fire Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Gao, D.; Ou, L.; Liu, Y.; Yang, Q.; Wang, H. Deepspoof: Deep reinforcement learning-based spoofing attack in cross-technology mul-timedia communication. IEEE Trans. Multimed. 2024, 1–13. [Google Scholar] [CrossRef]
Gao, D.; Liu, Y.; Hu, B.; Wang, L.; Chen, W.; Chen, Y.; He, T. Time synchronization based on cross-technology communication for iot networks. IEEE Internet Things J. 2023, 10, 19753–19764. [Google Scholar] [CrossRef]
Gao, D.; Wang, H.; Guo, X.; Wang, L.; Gui, G.; Wang, W.; Yin, Z.; Wang, S.; He, T. Federated learning based on ctc for heterogeneous internet of things. IEEE Internet Things J. 2023, 10, 22673–22685. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Y.; Zhao, Y.; Zhang, X.; Wang, X.; Lian, C.; Li, J.; Shan, P.; Fu, C.; Lyu, X.; Li, L.; et al. MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices. Remote Sens. 2023, 15, 5665. [Google Scholar] [CrossRef]
Li, Y.; Liu, Z.; Yang, J.; Zhang, H. Wavelet Transform Feature Enhancement for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5644. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Z.; Zhu, Q.; Shao, Y.; Guo, X.; Guan, Q. Uksd-Net: An Unsupervised Knowledge-Guided Symmetric Deep Network forForest Burned Areas Detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2362–2365. [Google Scholar]
Rashkovetsky, D.; Mauracher, F.; Langer, M.; Schmitt, M. Wildfire detection from multisensor satellite imagery using deepsemantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7001–7016. [Google Scholar] [CrossRef]
Bo, W.; Liu, J.; Fan, X.; Tjahjadi, T.; Ye, Q.; Fu, L. BASNet: Burned area segmentation network for real-time detection of damage maps in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Guan, Z.; Miao, X.; Mu, Y.; Sun, Q.; Ye, Q.; Gao, D. Forest fire segmentation from aerial imagery data using an improved instance segmentation model. Remote Sens. 2022, 14, 3159. [Google Scholar] [CrossRef]
Niknejad, M.; Bernardino, A. Attention on classification for fire segmentation. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 616–621. [Google Scholar]
Monaco, S.; Pasini, A.; Apiletti, D.; Colomba, L.; Garza, P.; Baralis, E. Improving wildfire severity classification of deep learning U-nets from satellite images. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 5786–5788. [Google Scholar]
Nerobelov, G.; Timofeyev, Y.; Foka, S.; Smyshlyaev, S.; Poberovskiy, A.; Sedeeva, M. Complex Validation of Weather Research and Forecasting—Chemistry Modelling of Atmospheric CO₂ in the Coastal Cities of the Gulf of Finland. Remote Sens. 2023, 15, 5757. [Google Scholar] [CrossRef]
Liu, Z.; Deng, Z.; Davis, S.J.; Ciais, P. Global carbon emissions in 2023. Nat. Rev. Earth Environ. 2024, 5, 253–254. [Google Scholar] [CrossRef]
Qin, J.; Duan, W.; Zou, S.; Chen, Y.; Huang, W.; Rosa, L. Global energy use and carbon emissions from irrigated agriculture. Nat. Commun. 2024, 15, 3084. [Google Scholar] [CrossRef]
Kleebauer, M.; Marz, C.; Reudenbach, C.; Braun, M. Multi-resolution segmentation of solar photovoltaic systems using deeplearning. Remote Sens. 2023, 15, 5687. [Google Scholar] [CrossRef]
Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Yao, L.; Zhao, H.; Peng, J.; Wang, Z.; Zhao, K. FoSp: Focus and separation network for early smoke segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6621–6629. [Google Scholar]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24; Springer: Berlin/Heidelberg, Germany, 2021; pp. 14–24. [Google Scholar]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing swin transformer and convolutional neural network for remote sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Lin, J.; Gao, F.; Shi, X.; Dong, J.; Du, Q. Ss-mae: Spatial–spectral masked autoencoder for multisource remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531614. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Hopkins, B.; O’Neill, L.; Afghah, F.; Razi, A.; Rowell, E.; Watts, A.; Fule, R.; Coen, J. Flame 2: Fire Detection and Modeling: Aerial Multi-Spectral Image Dataset. IEEE DataPort 2023. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Baheti, B.; Innani, S.; Gajre, S.; Talbar, S. Semantic scene segmentation in unstructured environment with modified DeepLabV3+. Pattern Recognit. Lett. 2020, 138, 223–229. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Muksimova, S.; Mardieva, S.; Cho, Y.I. Deep Encoder–Decoder Network-Based Wildfire Segmentation Using Drone Images in Real-Time. Remote Sens. 2022, 14, 6302. [Google Scholar] [CrossRef]
Hartley, R.J.L.; Davidson, S.J.; Watt, M.S.; Massam, P.D.; Aguilar-Arguello, S.; Melnik, K.O.; Pearce, H.G.; Clifford, V.R. A Mixed Methods Approach for Fuel Characterisation in Gorse (Ulex europaeus L.) Scrub from High-Density UAV Laser Scanning Point Clouds and Semantic Segmentation of UAV Imagery. Remote Sens. 2022, 14, 4775. [Google Scholar] [CrossRef]
Han, Y.; Duan, B.; Guan, R.; Yang, G.; Zhen, Z. LUFFD-YOLO: A Lightweight Model for UAV Remote Sensing Forest Fire Detection Based on Attention Mechanism and Multi-Level Feature Fusion. Remote Sens. 2024, 16, 2177. [Google Scholar] [CrossRef]
Zhang, D.; Huang, C.; Gu, J.; Hou, J.; Zhang, Y.; Han, W.; Dou, P.; Feng, Y. Real-Time Wildfire Detection Algorithm Based on VIIRS Fire Product and Himawari-8 Data. Remote Sens. 2023, 15, 1541. [Google Scholar] [CrossRef]
Shirvani, Z.; Abdi, O.; Goodman, R.C. High-Resolution Semantic Segmentation of Woodland Fires Using Residual Attention UNet and Time Series of Sentinel-2. Remote Sens. 2023, 15, 1342. [Google Scholar] [CrossRef]
Kanwal, R.; Rafaqat, W.; Iqbal, M.; Weiguo, S. Data-Driven Approaches for Wildfire Mapping and Prediction Assessment Using a Convolutional Neural Network (CNN). Remote Sens. 2023, 15, 5099. [Google Scholar] [CrossRef]
Badhan, M.; Shamsaei, K.; Ebrahimian, H.; Bebis, G.; Lareau, N.P.; Rowell, E. Deep Learning Approach to Improve Spatial Resolution of GOES-17 Wildfire Boundaries Using VIIRS Satellite Data. Remote Sens. 2024, 16, 715. [Google Scholar] [CrossRef]

Figure 1. The left side shows satellite fire data, while the right side displays low-altitude remote sensing fire data. Satellites, operating from high altitude, have a strong capability for monitoring overall fire situations but are less effective at detecting small fire sources. In contrast, low-altitude remote sensing excels at capturing small fire sources.

Figure 2. Illustration of global and local contextual information. The local information is indicated in green annotations, while the connections between global information are indicated in blue annotations.

Figure 3. (a) Structure of the standard Transformer Block. (b) Two consecutive Swin Transformer Blocks. (c) Transformer Block with the proposed bi-paths attention mechanism (BPA).

Figure 4. Architecture of our proposed FireFormer. FireFormer contains two important modules: FFTB and FRN. In skip connections, we use Weighted Fusion (WF) methods.

Figure 5. Architecture of Forest Fire Transformer Block.

Figure 6. Architecture of Cross-Shaped Scaled Dot-Product attention mechanism. The stars of different colors represent random points in each window, and the lines in different windows indicate the information interaction of points within the windows in the vertical or horizontal direction.

Figure 7. Architecture of Feature Refinement Net.

Figure 8. Comparison of segmentation results before and after using different modules in the FireFormer framework. (a) Original image. (b) Baseline. (c) Baseline + FFTB. (d) Baseline + FFTB + WF. (e) Baseline + FFTB + WF + FRN (FireFormer).

Figure 9. Examples of semantic segmentation results on the FLAME dataset. (a) Original image. (b) U-Net. (c) Deeplab V3. (d) BiseNet. (e) BuildFormer. (f) PIDNet. (g) FireFormer. The part within the red box is the detailed section of the flame edge, and we consider the accuracy of the edge to be of greater importance.

Figure 10. The segmentation results of FireFormer on the FLAME2 dataset, with the segmentation of four fire points indicated as labeled. Left top, left bottom, right top and right bottom are respectively referred to as fire points 1, 2, 3 and 4.

Figure 11. Line graph of carbon emissions for each of four fire points in the FLAME2 dataset.

Table 1. Technical specification of hardware and dataset information.

	Phantom 3 Professional, DJI, 1280 g, diagonal size = 350 mm, max speed = 16 m/s (~57 kph), max flight time is 23 min, flight time is reduced to 18 min due to additional weight
	Matrice 200, DJI, 3.80 kg, size: 71 6 mm × 220 mm × 236 mm, payload up to 2 kg, 16 m/s (~61 kph), batteries: (TB50) and TB55a, max flight time: 38 min, operation range of 7 km
Type	Camera	Palette	Duration	Resolution	FPS	Size	Application	Labeled
Video	Phantom	Normal (.JPEG)	2003 frames	3480 × 2160	-	5.3 GB	Segmentation	Y (Fire)

Table 2. Experiment of network parameter settings.

Modules	Window Size	Layers	Feature Map Ratio	Spatial Convolution Kernel	Channel Ratio	IoU
FFTB	2 × 2	2	Nan			70.93
	4 × 4	4				71.18
	2 × 2	4				71.34
	4 × 4	2				70.81
WF	Nan		1:9	Nan		71.67
			2:8			71.83
			7:3			70.64
			8:2			70.49
FRN	Nan			{7,7};{5,5};{3,3}	1:1:2	73.04
				{7,7};{5,5};{3,3}	2:1:1	73.13
				{7,7};{7,7};{5,5}	2:1:1	72.96

Table 3. Ablation experiment of FireFormer on the FLAME dataset.

Model’s Name	FFTB	WF	FRN	IoU	F1	OA
Baseline	F	F	F	67.18	80.37	99.76
Baseline + FFTB	T	F	F	71.34	83.27	99.82
Baseline + FFTB + WF	T	T	F	71.83	83.61	99.82
Baseline + FFTB + WF + FRN	T	T	T	73.13	84.48	99.83

Table 4. Comparison experiment of different encoder–decoders on the FLAME dataset.

Method	Backbone	Encoder	Decoder	FLOPs (G)	Memory (MB)	IoU
UNet	-	CNN	CNN	792.72	3180.00	62.63
SwinUNet	Swin-T	Trans	Trans	123.69	3355.28	53.50
TransUNet	ViT-R50	Trans	CNN	512.51	4880.33	74.13
FireFormer(ours)	Res18	CNN	Trans	42.01	1203.92	73.13

Table 5. Comparison of models memory, parameters, Flops, speed and FPS.

Method	Backbone	Memory (MB)	Params (M)	FLOPs (G)	Speed (ms)	FPS
U-Net	-	3184.00	24.79	792.82	211.95	4.72
FCN	Res50	5132.12	32.95	555.45	150.92	6.63
SegNet	VGG16	5384.00	29.44	642.70	104.83	5.46
RefineNet	Res50	2876.00	27.31	125.97	70.77	14.13
DeconvNet	VGG16	-	251.80	866.61	256.48	3.92
BiseNet	Res18	1088.15	27.01	244.15	123.10	8.12
DeepLabV3	VGG16	2122.04	33.49	100.81	85.11	11.75
HRNet	W32	4278.75	29.54	181.85	109.77	9.11
MANet	Res18	4245.25	35.86	309.77	120.48	8.31
SegFormer	MiT-B1	2690.60	13.68	52.97	104.38	8.58
BuildFormer	Res50	6377.00	40.52	466.36	159.24	6.28
A2FPN	-	985.00	12.16	52.97	31.16	32.09
UNetFormer	Res18	1124.25	11.72	46.95	26.15	38.24
PIDNet	-	1206.31	28.54	89.15	69.16	14.46
FireFormer	Res18	1203.92	11.66	42.01	26.11	38.29

Table 6. Comparison of segmentation accuracy on FLAME dataset.

Method	Backbone	F1	IoU	OA
U-Net	-	77.02	62.63	99.71
FCN	Res50	84.37	72.97	99.82
SegNet	VGG16	77.27	62.95	99.71
RefineNet	Res50	84.35	72.93	99.83
DeconvNet	VGG16	78.51	64.62	99.75
BiseNet	Res18	83.74	72.03	99.82
DeepLabV3	VGG16	84.31	72.87	99.83
HRNet	W32	82.68	70.47	99.91
MANet	Res18	83.91	72.28	99.83
SegFormer	MiT-B1	79.95	66.60	99.77
BuildFormer	Res50	84.18	72.68	99.82
A2FPN	-	81.78	69.18	99.78
UNetFormer	Res18	83.26	71.33	99.81
PIDNet	-	83.74	72.03	99.82
FireFormer	Res18	84.48	73.13	99.83

Table 7. The emissions from each fire point and estimation accuracy.

Frame	0	5	10	15	20	25	30	35	40	45	50	55	60
Point 1 Carbon emissions (kgC)	23.86	24.69	25.61	25.41	23.09	24.49	24.91	23.88	27.96	26.01	25.154	26.91	27.31
Point 2 Carbon emissions (kgC)	30.81	18.86	26.86	24.92	24.98	21.94	22.85	24.40	31.44	26.38	32.77	22.04	15.93
Point 3 Carbon emissions (kgC)	12.77	12.16	9.53	4.21	5.13	5.63	0.304	0	4.59	2.96	4.61	7.42	9.01
Point 4 Carbon emissions (kgC)	1.90	1.69	2.13	2.57	2.58	2.42	2.18	2.45	3.06	2.45	2.11	2.45	2.90
Acc (%)	87.19	91.02	84.36	89.92	85.11	84.70	66.21	71.83	65.35	83.96	92.63	86.29	82.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tong, H.; Yuan, J.; Zhang, J.; Wang, H.; Li, T. Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery. Remote Sens. 2024, 16, 2827. https://doi.org/10.3390/rs16152827

AMA Style

Tong H, Yuan J, Zhang J, Wang H, Li T. Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery. Remote Sensing. 2024; 16(15):2827. https://doi.org/10.3390/rs16152827

Chicago/Turabian Style

Tong, Hongwei, Jianye Yuan, Jingjing Zhang, Haofei Wang, and Teng Li. 2024. "Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery" Remote Sensing 16, no. 15: 2827. https://doi.org/10.3390/rs16152827

APA Style

Tong, H., Yuan, J., Zhang, J., Wang, H., & Li, T. (2024). Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery. Remote Sensing, 16(15), 2827. https://doi.org/10.3390/rs16152827

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Wildfire Monitoring Using Low-Altitude Remote Sensing Imagery

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of Fire Remote Sensing Images

2.2. Semantic Segmentation Based on Transformer

3. Methods

3.1. Network Structure

3.2. Forest Fire Transformer Block

3.3. Cross-Scaled Dot-Product Attention Mechanism

3.4. Feature Refinement Net

3.5. Datasets

3.5.1. FLAME Dataset

3.5.2. FLAME2 Dataset

3.6. Implementation Details

3.6.1. Training and Testing Settings

3.6.2. Evaluation Metrics

3.6.3. Loss Functions

3.7. Hardware

4. Results

4.1. Ablation Study

4.1.1. Effect of Forest Fire Transformer Block (FFTB)

4.1.2. Effect of Weight Fuse (WF)

4.1.3. Effect of Feature Refinement Net (FRN)

4.2. Comparison with Other Semantic Segmentation Methods

4.2.1. Comparison with Different Encoder–Decoder Combinations on FLAME Dataset

4.2.2. Comparison with Other Methods on FLAME Dataset

4.3. Test the Scene Adaptability of FireFormer and Explore Carbon Emissions

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI