FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation

Ahmad, Naveed; Akbar, Mariam; Alkhammash, Eman H.; Jamjoom, Mona M.

doi:10.3390/fire8080295

Open AccessArticle

FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation

by

Naveed Ahmad

¹

,

Mariam Akbar

^1,*,

Eman H. Alkhammash

^2,*

and

Mona M. Jamjoom

³

¹

Department of Computer Science, COMSATS University Islamabad, Islamabad 44000, Pakistan

²

Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia

³

Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Fire 2025, 8(8), 295; https://doi.org/10.3390/fire8080295

Submission received: 1 June 2025 / Revised: 28 June 2025 / Accepted: 22 July 2025 / Published: 26 July 2025

Download

Browse Figures

Versions Notes

Abstract

Forest fire detection is an essential application in environmental surveillance since wildfires cause devastating damage to ecosystems, human life, and property every year. The effective and accurate detection of fire is necessary to allow for timely response and efficient management of disasters. Traditional techniques for fire detection often experience false alarms and delayed responses in various environmental situations. Therefore, developing robust, intelligent, and real-time detection systems has emerged as a central challenge in remote sensing and computer vision research communities. Despite recent achievements in deep learning, current forest fire detection models still face issues with generalizability, lightweight deployment, and accuracy trade-offs. In order to overcome these limitations, we introduce a novel technique (FireNet-KD) that makes use of knowledge distillation, a method that maps the learning of hard models (teachers) to a light and efficient model (student). We specifically utilize two opposing teacher networks: a Vision Transformer (ViT), which is popular for its global attention and contextual learning ability, and a Convolutional Neural Network (CNN), which is esteemed for its spatial locality and inductive biases. These teacher models instruct the learning of a Swin Transformer-based student model that provides hierarchical feature extraction and computational efficiency through shifted window self-attention, and is thus particularly well suited for scalable forest fire detection. By combining the strengths of ViT and CNN with distillation into the Swin Transformer, the FireNet-KD model outperforms state-of-the-art methods with significant improvements. Experimental results show that the FireNet-KD model obtains a precision of 95.16%, recall of 99.61%, F1-score of 97.34%, and mAP@50 of 97.31%, outperforming the existing models. These results prove the effectiveness of FireNet-KD in improving both detection accuracy and model efficiency for forest fire detection.

Keywords:

convolutional neural networks (CNNs); vision transformers (ViTs); swin transformer; knowledge distillation (KD)

1. Introduction

Forests cover about 31% of the Earth’s surface and are irreplaceable ecosystems for biodiversity protection, water cycle regulation, carbon sequestration, and global climate stability [1]. These irreplaceable landscapes are, however, constantly under threat from more severe and frequent wildfires induced by climate change, deforestation, and expanding human encroachment [2,3]. The impacts are widespread, from loss of biodiversity and ecosystem disruption to increased greenhouse gas emissions, air pollution, and economic loss [4,5,6]. In recent years, the frequency and intensity of forest fires have increased rapidly due to climate change and extended droughts. Global tree cover loss due to fire exceeded 7.4 million hectares in 2022, 51% more than in the last decade, as reported by the World Resources Institute [7]. Events like the 2020 wildfires in California, which torched more than 4 million acres and constituted the highest wildfire season in the history of California [8], and the 2023 Canadian wildfires, which burned more than 18 million hectares and caused severe air pollution in North America [9], highlight the necessity of dependable and scalable fire detection. The existing satellite-based systems are of low temporal or spatial resolution, and real-time observation is not feasible. Traditional computer vision approaches suffer from high false alarms and poor performance in occluded or smoke conditions [10].

Due to the urgency of detecting fires early and responding rapidly, traditional forest fire surveillance technologies, such as ground patrols, lookout towers, and satellite systems, are unsatisfactory. Conventional technologies are plagued by slow detection, poor resolution, and sensitivity to environmental factors such as smoke or cloud cover [11,12,13]. Satellite systems like MODIS and VIIRS, although useful for coarse-level surveillance, are not prompt in response and have little value for detecting small or initial fires [14,15]. The shortcomings point to the need for more intelligent, timely, and reliable detection systems with lower response latency and fewer false positives.

In an attempt to overcome these shortcomings, Wireless Sensor Networks (WSNs) have been discovered to be a potential means of monitoring environmental parameters. Through constant measurement of temperature, humidity, and gas concentrations, WSNs can detect anomalous trends typical of nascent fire behavior [16,17]. When integrated with Internet of Things (IoT) platforms, such systems enable constant communication, centralized alarm, and real-time decision-making for large forest areas [18,19]. However, despite being optimistic, WSNs are hindered by issues of power limitations, environmental noise, and sensor degradation, which compromise their long-term performance and reliability in the absence of intelligent data interpretation [20]. For enhancing situational awareness and bridging WSN gaps, Unmanned Aerial Vehicles (UAVs) are presently used as high-mobility multi-role surveillance platforms. UAVs, using thermal and optical sensors, deliver high-resolution, real-time imagery, cutting through distant mountains, dense forests, or other inaccessible terrain that conventional methods cannot reach [21,22,23]. Apart from providing rapid fire location and propagation tracing, aerial platforms may serve as a bridge between satellite and ground sensors to form a multi-tier monitoring infrastructure [24,25].

With these technological foundations, recent developments in artificial intelligence (AI) and Deep Learning (DL) have further enhanced fire detection capabilities. AI models, specifically Convolutional Neural Networks (CNNs), are capable of being trained on massive datasets to effectively identify fire patterns in images and video streams, distinguishing actual fire events from false alarms caused by fog, sunlight, or reflections [26,27,28,29,30,31]. These models facilitate machine-based visual information interpretation, enabling faster and more accurate fire detection, especially when integrated with UAV video and ground sensor data. The emergence of transformer architectures has significantly evolved DL in vision and language applications. Proposed by Vaswani et al. [32], the self-attention mechanism of the transformer allows models to attend to long-range dependencies and global context better than conventional convolutional networks. This feature is especially useful for fire detection applications, where context in space and connections between faraway areas in aerial imagery can improve the detection of smoke plumes and fire trends.

Nonetheless, despite advancements, existing AI-based methods remain problematic. Environment heterogeneity, weather, and seasonal canopies restrict models from generalizing effectively across regions [33]. Additionally, the integration of heterogeneous data sources such as UAV imagery, sensors, and satellite inputs necessitates strong, scalable models that can process diverse inputs in real-time [34]. Closing these gaps is critical to the creation of truly intelligent and trustworthy forest fire detection systems capable of functioning in realistic, complex environments. Our proposed solution presents a state-of-the-art knowledge distillation (KD) method that efficiently solves critical issues in fire detection systems with the help of multi-teacher adaptive fusion, metric-driven optimization, and effective deployment. The method uses two expert teacher models—a Vision Transformer (ViT) to handle global fire patterns and an EfficientNet to handle local texture information, whose knowledge is adaptively fused in a lightweight Swin Transformer student model via an attention-based fusion mechanism. In contrast to static weighting methods of traditional KD, our student model independently learns to adapt the contribution of every teacher according to its own feature embeddings, allowing for more context-dependent predictions.

Recent developments in fire detection have been promising in real-world applications. The DL-based platforms, such as SmokeNet and FSSNet, are capable of detecting smoke and fire with good accuracy in surveillance and UAV videos. Satellite-based platforms such as NASA’s MODIS and VIIRS sensors are extensively used for the monitoring of wildfires at the regional level. Yet these systems tend to suffer from resolution, latency, and performance constraints under occlusion or in small-scale fires. Such difficulties emphasize the importance of better models that can generalize over various fire conditions with real-time performance, which in turn inspires the development of our proposed FireNet-KD architecture.

Key Contributions

1.: FireNet-KD with Adaptive Fusion

This study proposed a FireNet-KD, a new knowledge distillation architecture that integrates the best of several expert models to improve fire detection. A Vision Transformer (ViT) extracts global contextual patterns, while EfficientNet attends to local texture features. These models are used as teachers to a lightweight Swin Transformer student, which is trained through an attention-based fusion process. Unlike traditional KD approaches with fixed teacher weights, our student model dynamically regulates each teacher’s impact according to its feature representations, leading to more context-dependent predictions.

2.: Confidence-Aware Detection and Imbalance Mitigation

To further enhance detection robustness, we present a confidence-aware multi-scale sliding window detector with confidence-weighted non-max suppression (NMS), effectively reducing false positives. Further, to address class imbalance, we introduce strategic undersampling and weighted batch sampling during training, improving minority fire region detection without affecting overall efficiency.

2. Related Works

2.1. Early and Accurate Forest Fire Detection

Early and accurate detection of forest fires is the problem that has led to a set of innovations in the field, primarily aimed at boosting detection speed, accuracy, and management of complex visual environments.

FF-net [35] addressed the challenge of fire detection in dense scenes by proposing an object detection model that enhances fire detection through feature extraction and Kullback–Leibler Focal Loss (KLF). The method is effective in solving data imbalance, but is susceptible to visual interference. Improved over this is ADE-Net [36], which proposes a dual-encoding segmentation network that employs attention fusion for better early flame detection. This solves problems like class imbalance and enhances sensitivity to small flames. Still, feature representation problems persist for small fires.

While the initial detection models were designed for static images, detection via UAV has proven effective in real-time monitoring of forest fires. DRCSPNet [37] for UAV-captured images uses the Dilation Repconv Cross Stage Partial Network (DRCSPNet) and Lite-Path Aggregation Network (Lite-PAN) for multi-scale detection. The method offers improvements in low-illumination and -contrast changes, but is based on synthetic datasets, which are not highly realistic. This is improved in Ghost Convolution Swin Transformer (GCST) [38] by optimizing UAV-based detection with rotation attention mechanisms, resulting in increased accuracy and real-time computation, especially for small and occluded fires.

These UAV systems are continually evolving, with YOLOv5’s UAV robustness improved by the introduction of Coordinate Attention (CA) in YOLOv7 [39], which introduced a feature fusion module for better accuracy in harsh environments. These advancements highlight the need for high-performance and lightweight detection models for effective deployment. As the accuracy of detection improved, researchers began ensembling various deep models to tackle the fire detection problem. Xu et al. [40] ensembled EfficientDet, YOLOv5, and EfficientNet as an ensemble technique to improve fire detection accuracy. Their technique achieved a 51.3% rate of false positive reduction, demonstrating the benefit of ensemble techniques to improve fire detection. CBAM-enhanced YOLOv5 [41] also emphasized the importance of small target detection, particularly at longer distances, by enhancing the backbone layer and incorporating attention mechanisms to improve detection accuracy.

Further fine-tuning was observed in multi-task learning approaches, such as MTL-FFDet [42], where fire detection, segmentation, and classification were jointly optimized. Along with this approach improving detection, it also reduced false positives through a multi-task non-maximum suppression technique. All these enhancements reflect the power of task sharing when optimizing the model.

2.2. Deep Learning Methods and Model Optimization for Fire Detection

As more advanced models of learning emerged for fire detection, maximizing both accuracy and computational cost became a requirement, especially for real-time applications.

One of these enhancements was semi-supervised learning, as shown by the ShuffleNetV2-based models [43], which use fewer labeled samples to learn more effectively. Consistency regularization improved the model’s performance under conditions with fewer labeled data, further pushing the limits of how fire detection models can be trained. FL-YOLOv7, proposed by Xiao et al. [44], is a model that integrates a lightweight architecture with a new attention mechanism for detecting small fires. Their work highlighted improvements in mean average precision (mAP), indicating that lightweight models can be optimized for speed without losing accuracy. Building on the architecture of YOLOv7, Shi et al. [45] optimized the model by adding Ghost C3 modules and dynamic attention mechanisms, significantly downsizing its size while improving its performance for real-time applications.

As model performance continued to improve, SWVR [46] used the Reparameterization Vision Transformer (RepViT) and the Simple Parameter-Free Attention Module (SimAM) to minimize computational complexity without sacrificing competitive accuracy. The model was a good alternative for fire detection on low-resource devices, offering an excellent performance–resource trade-off. Early detection of fire was prioritized by Bai and Wang [47], and multi-scale feature fusion and transfer learning were applied in a YOLO-based network to improve the detection of small fire. They focused on model optimization for early detection, such as flame detection at an early stage. This was enhanced by Zhou et al. [48], who introduced a light version of YOLOv5 with MobileNetV3 as the backbone. Their semi-supervised learning combined with knowledge distillation not only reduced model size but also enhanced detection accuracy. Ahmad et al. [49] proposed a CN2VF-Net model that solves the crucial problem of detecting and segmenting fires in complex scenes, particularly where fires occur at various scales or become masked by environmental noise such as fog, smoke, or sunlight. To achieve this, the authors introduce CN2VF-Net, a DL model that is a fusion of Convolutional Neural Networks (CNNs) for local detail extraction and Vision Transformers (ViTs) to extract global context. The model incorporates a dynamic multi-scale attention mechanism to boost attention toward important fire areas and disregard background information.

Similarly, YOLOv5 was extended by Sun et al. [50] to identify small targets, incorporating various additional modules, some of which utilized small target detection layers and attention modules. These authors focused more on optimizing YOLO for real-time detection of small-sized fires in UAV applications. FFD-YOLO [51] also incorporated a synthetic dataset with pseudo-fire samples, using a lightweight network to significantly enhance detection accuracy by removing false positives.

In addition to enhancing the robustness of detection models, Sathishkumar et al. [52] addressed catastrophic forgetting in fire detection by following a Learning without Forgetting (LwF) strategy. This enabled the model to continue learning from past datasets while learning new ones, minimizing the likelihood of forgetting previously learned tasks, such as identifying different types of fires. Lastly, Chen et al. [53] contributed to dataset generation by developing a high-quality registered RGB and infrared image dataset for drone-based fire detection. Their contribution is based on the need for high-quality and realistic data to train precise fire detection models so as to highlight the importance of data in model performance.

2.3. Comparison with Other Similar Knowledge Distillation Techniques

To compare the proposed FireNet-KD with existing methods such as GCST and SWVR, which are predominantly single-teacher distillation or fixed fusion mechanism-based, the proposed FireNet-KD brings in a novel multi-teacher distillation mechanism where two complementary teacher models, ViT for global context and EfficientNetV2-L for local detail, guide the Swin Transformer student model. GCST relies on channel-wise spatial transfer, and SWVR relies on ensemble knowledge via voting, which is not adaptive to various visual conditions. In contrast, FireNet-KD uses a learnable attention-based fusion module to allow the student to learn dynamic weightings between the teachers according to context, which improves the adaptability in challenging cases such as smoke occlusion or small fire detection. In addition, our use of a Swin Transformer student brings multi-scale inductive bias and efficient self-attention, which achieves superior performance with smaller computation cost than the traditional ViT-based or convolutional-only students.

The existing literature shows remarkable advancements in detecting forest fires, with increasing interest in enhancing early detection, model precision, and real-time capability. Ranging from UAV-based detection and object detection techniques to using state-of-the-art DL models such as YOLO, EfficientDet, and Vision Transformers, fire detection capabilities have made considerable improvements. But despite the advancements, issues remain, including the detection of small fires, handling visual clutter like smoke and reflections, real-time inference in low-resource settings, and obtaining high-quality, varied datasets. Moreover, although certain pieces of work try to improve accuracy via attention mechanisms and light architectures, fewer works tackle the catastrophic forgetting issue or support adaptive learning for novel fire conditions without compromising existing knowledge. To bridge these gaps, FireNet-KD proposes a multi-teacher distillation framework that combines global (ViT) and local (EfficientNet) feature representations to empower the student Swin Transformer to learn a balanced and robust representation. The attention-based fusion mechanism dynamically aggregates teacher contributions, enhancing generalization in diverse environmental conditions. In addition, our training pipeline is stable and scalable, effectively handling overfitting and catastrophic forgetting along with enabling real-time deployment, rendering FireNet-KD a more practical and flexible solution for monitoring wildfires. Table 1 shows the comparison of existing techniques for forest fire detection.

3. Methodology

In this study, we proposed a FireNet-KD architecture as shown in Figure 1, for precise forest fire detection. The main goal of this study is to build a model capable of detecting forest fires in realistic environments, specifically designed to meet the challenges presented by fire scales, occlusions, and environmental conditions. Other conventional methods struggle when forest fire scale, occlusions, or other environmental effects are involved. FireNet-KD leverages a novel knowledge distillation architecture that combines Vision Transformers (ViTs) and EfficientNet models as the teacher model and a Swin Transformer as the student model.

3.1. Data Collection and Processing

In this study, we utilized the FLAME dataset [54], a publicly accessible benchmark dataset on Kaggle. The dataset contains both aerial and ground images, with both fire and non-fire cases, creating a rich source for training and testing DL models for detecting wildfires. The training set consists of 30,000 fire images and 14,300 non-fire images, whereas the test set consists of 5120 fire images and 3460 non-fire images. For better generalization and input diversity enhancement, we used sophisticated data augmentation techniques, such as random affine transformations, horizontal flipping, and color jittering. All the images were also normalized based on the channel-wise mean and standard deviation statistics, making the input distribution stable across training epochs. These preprocessing operations together give both numerical consistency and statistical robustness, enabling fair model evaluation and stable convergence in different environmental conditions.

The preprocessing pipeline starts with resizing each input image into a standard

224 \times 224

resolution using bilinear interpolation for consistency with our model architecture. Given an input image

I \in R^{H \times W \times 3}

, we apply bilinear interpolation to resize it to

224 \times 224

pixels:

I_{resized} = BilinearInterpolate (I, (224, 224))

(1)

Random horizontal flipping with a probability of 30% is used to provide viewpoint invariance, which is applied with probability

p = 0.3

:

I_{flipped} (x, y, c) = \{\begin{matrix} I_{resized} (W - x - 1, y, c), & with probability 0.3 \\ I_{resized} (x, y, c), & otherwise \end{matrix}

(2)

where W is the image width.

The affine transformations involve random rotations in ±15 degrees and translations within ±10% of the image dimensions, representing different camera views. When comprising rotation and translation,

(\begin{matrix} x^{'} \\ y^{'} \end{matrix}) = (\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}) (\begin{matrix} x \\ y \end{matrix}) + (\begin{matrix} t_{x} \\ t_{y} \end{matrix})

(3)

where

θ \sim U (- 15^{\circ}, 15^{\circ})

is the rotation angle;

θ

is sampled from a uniform distribution between −15 and 15 degrees.

t_{x}, t_{y} \sim U (- 0.1 W, 0.1 W)

is sampled independently from a uniform distribution ranging from

- 0.1 W

to

0.1 W

, where W is the width of the image.

Color jittering adds controlled brightness, contrast, and saturation variations by scaling each channel value by a random factor in ±10% of its initial value. In RGB space, we apply

I_{jittered} = (1 + α) ⊙ I_{affine} + β

(4)

where

α \sim U {(- 0.1, 0.1)}^{3}

for brightness, contrast, and saturation, and

β = 0

.

Lastly, we transform the images to tensors and perform channel-wise normalization. This thorough preprocessing approach further enhances the model’s generalization capacity in real-world fire detection situations without compromising computational efficiency during training. Channel-wise standardization can be calculated as follows:

I_{final} (c) = \frac{I_{jittered} (c) - μ_{c}}{σ_{c}}

(5)

where

μ = [0.485, 0.456, 0.406]

and

σ = [0.229, 0.224, 0.225]

are the channel-wise mean and standard deviation, used for normalizing RGB images before inputting them into pretrained models.

3.2. FireNet-KD Arcitecture

The FireNet-KD model proposed here uses a teacher–student knowledge distillation framework that combines the advantages of Vision Transformers and spinodals to enhance fire detection. ViT-Tiny Teacher offers global context awareness from its self-attention and is therefore best positioned to capture large-scale patterns such as smoke dispersion. The

16 \times 16

patch size is selected in a way that it balances computational expense and spatial resolution within manageable training times. To address ViT’s limitation in capturing small-scale fire, however, we incorporate an EfficientNetV2-L Teacher, which is better at local texture representation because it has a high convolutional inductive bias. Fusion scaling

(φ = 1.5)

achieves high representational capacity without overparameterization. EfficientNet, being very good in certain aspects, can sometimes be overly sensitive to flame-like textures (sunlight, reflections), which is mitigated by blending its predictions with ViT’s global context and Swin Transformer’s hierarchical feature learning in the student model. The Swin-Base student is simultaneously trained with both teachers and learns to cooperatively harness their complementary strengths through an adaptive attention-based fusion mechanism. This fusion not only boosts the model’s generalization under adverse complex and cluttered scenes but also enhances robustness to false positives and boosts the detection of small or occluded fire regions.

t_{1} = ViT (x) \in R^{2}

(6)

t_{2} = EfficientNet (x) \in R^{2}

(7)

where x is usually an input, and it may be a feature vector or image in the case of a Vision Transformer (ViT) model. It would be the information fed into the model for processing. And

R^{2}

refers to the 2-dimensional Euclidean space, such that the result of the function ViT(x) should be in a 2-dimensional space. This may be a vector with two dimensions, such as a classification output with two classes, or a 2D representation of a feature or data after ViT has transformed it.

The student model is used as a Swin Transformer with

7 \times 7

local windows to extract spatial features. To enable dynamic adaptation to the direction of the ensemble of teacher models (ViT and EfficientNetV2-L), the student learns to assign relative importance to each teacher through an attention-driven fusion strategy. The student is allowed to concentrate on teacher features relevant to the present input context, thus enabling better generalization across varying complexities of fire scenes.

F_{s} = Swin (x) \in R^{49 \times 768}

(8)

where X is the input to the Swin Transformer.

F_{s}

is the learned spatial features with a size of

49 \times 768

, indicating the number of local windows (49) multiplied by the dimensionality of each window’s representation (768 features).

Then, these spatial features are globally averaged and pooled to yield a compact representation:

{\bar{f}}_{s} = \frac{1}{49} \sum_{i = 1}^{49} F_{s} (i) \in R^{768}

(9)

where

{\bar{f}}_{s}

is the averaged feature vector across the 49 local windows. The output is a 768-dimensional vector, which condenses the spatial features into a single vector that encodes global information.

Then, the attention mechanism calculates weighting factors as a two-layer neural network with

R e L U

activation and

s o f t m a x

normalization of the weights:

α = softmax (W_{2} ReLU (W_{1} {\bar{f}}_{s} + b_{1}) + b_{2}) \in R^{2}

(10)

where

W_{1}

in

R^{512 \times 768}

and

W_{2} \in R^{2 \times 512}

are learnable weight matrices, and

b_{1}

and

b_{2}

are biases.

R e L U

is the activation function used on the linear transformation of

{\bar{f}}_{s}

. The

s o f t m a x

function is applied to normalize the output to achieve a probability distribution,

α

, a 2-dimensional vector. The input

{\bar{f}}_{s}

is the global average-pooled features of the Swin student model. The output vector

α

dynamically weighs the importance of the ViT and EfficientNetV2-L teacher models, modulating the knowledge fusion based on input features. Regularization techniques such as dropout (rate 0.2) in the attention layers, early stopping, and L2 weight decay are employed to avoid the risk of overfitting during knowledge distillation. A distillation loss is employed to soften the teacher outputs and provide more stable gradients during training.

Lastly, the final prediction blends the student’s logits with attention weights:

y_{pred} = {Swin}_{head} (F_{s}) ⊙ α

(11)

where

S w i n_{h e a d} (F_{s})

is the ultimate output of the Swin Transformer model, derived from the extracted spatial features.

⊙ denotes element-wise multiplication.

α

are the previously calculated attention weights, which weigh the most applicable parts of the features for making the final choice.

This architecture enables adaptive knowledge transfer, such that the student model highlights the most significant teacher guidance relevant for each input, while also considering computational efficiency by employing a single forward pass in inference. This method helps take advantage of a teacher–student system, where the student model updates its attention in response to the teacher’s guidance, resulting in improved performance at the cost of reduced computations. The Swin Transformer is chosen as the student model because it is computationally efficient, hierarchical, and capable of strong multi-scale feature representation. In contrast to traditional Vision Transformers with quadratic complexity and requiring large-scale training data, the Swin Transformer exploits a shifted window approach that allows for linear computational cost without compromising long-range dependency capture ability. Its hierarchical architecture is also similar to that of CNNs, hence ideal for distillation from both transformer-based (ViT) and convolutional (EfficientNet) teacher models. Thus, the Swin Transformer is an efficient yet effective student model with strong capability for strong performance in real-time fire detection tasks.

3.3. Training Protocol and Loss Functions

The training is based on a multi-component loss function that optimizes both localization accuracy and detection accuracy. The underlying cross-entropy loss function includes class weighting and label smoothing:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} [w_{y_{i}} y_{i} log p_{i} + (1 - y_{i}) log (1 - p_{i})] + ϵ {∥ p ∥}_{2}^{2}

(12)

where N denotes the number of samples.

y_{i}

denotes the ground truth label for sample i.

p_{i}

denotes the predicted probability for class 1 for sample i.

w_{y_{i}} \in [1.0, 3.5]

denotes the class weighting vector, compensating for class imbalance.

ϵ = 0.01

is the label smoothing regularization term.

The precision loss term makes low-confidence fire pixel predictions adverse:

L_{p r e c} = 1 - \frac{1}{| P |} \sum_{i \in P} p_{i}

(13)

where P is the set of true positive samples. And

p_{i}

is the predicted probability for sample i.

The recall loss maximizes fire detection sensitivity:

L_{r e c} = BCE (s_{1}, y)

(14)

where

s_{1}

is the predicted output and y is the ground truth label.

Then, the localization quality is enforced through a multi-threshold mAP loss:

L_{m A P} = \frac{1}{10} \sum_{τ = 0.5}^{0.95} (1 - A P_{τ})

(15)

where the

A P_{τ}

is the average precision at threshold

τ

.

In order to properly balance detection sensitivity, classification accuracy, and localization capability, the ultimate loss function is defined as

L_{total} = L_{CE} + 0.3 L_{prec} + 0.3 L_{rec} + 0.2 L_{mAP}

(16)

Here, the

L_{CE}

is the cross-entropy loss with label smoothing and class weighting;

L_{prec}

and

L_{rec}

are additional losses to boost precision and recall, respectively; and

L_{mAP}

is derived from average precision at different thresholds. The coefficients were set empirically for optimization tuning to provide a balanced contribution of each loss term. Precision and recall losses are given equal weights (0.3) due to their importance in minimizing false alarms and missed detections. The mAP loss is assigned a moderate weight (0.2) to boost spatial localization without dominating the overall optimization. This design achieves strong detection performance across a broad range of fire detection conditions.

The selection of weights of loss functions in FireNet-KD was led by the requirement for balancing various aspects of performance essential for wildfire detection. In particular, increased weights were put on precision and recall terms to highlight the correct identification of fire and non-fire areas. A slightly reduced weight was given to the mAP metric in order to enable strong localization without dominating the classification goals. These weights were empirically adjusted during initial validation to provide a stable training process and the best possible balance between detection accuracy and spatial precision.

Then, the optimization process is carried out by AdamW optimizer with weight decay

λ = 10^{- 4}

and the OneCycle learning rate scheduler, which warms up for 30% of training epochs:

η_{t} = \{\begin{matrix} η_{\max} (\frac{t}{t_{warm}}) & if t \leq t_{warm} \\ η_{\max} cos (\frac{2 π (t_{\max} - t_{warm})}{t - t_{warm}}) & if t > t_{warm} \end{matrix}

(17)

where

η_{t}

is the maximum learning rate.

t_{w a r m}

is the warm-up epoch number.

t_{m a x}

refers to the total epochs.

3.4. Multi-Scale Inference Pipeline

The detection system uses a complex sliding window strategy for accurate fire localization. For every input image, several scales

s \in {128, 192, 224}

are processed with a stride of 25% window size:

p_{i, j, s} = f_{θ} {(crop (I, i, j, s))}_{fire}

(18)

Detected regions are subjected to confidence-weighted non-maximum suppression:

x_{k}^{final} = \frac{\sum_{i} w_{i} x_{k}^{(i)}}{\sum_{i} w_{i}}, w_{i} = p_{i}

(19)

where

k \in {1, 2, 3, 4}

are the bounding box coordinates. This strategy provides strong detection for different fire sizes while preserving accurate localization.

The architecture is shown to be especially effective in complex cases of small or partially occluded fires, yet it sustains low false positive rates even in complicated natural environments. The dynamic weighting capability of the attention mechanism is beneficial when confronted with difficult cases where visual features can imply both fire and non-fire interpretations.

3.5. Evaluation Metrics

Our comprehensive evaluation framework uses several key metrics to rigorously assess model performance from complementary perspectives. Each metric provides unique insights into the system’s capabilities and limitations. Below are mathematical formulations designed to capture critical aspects of fire detection performance.

3.6. Precision: The Accuracy of Positive Predictions

Precision assesses the model’s capacity to accurately identify actual fire events with minimal false alarms, an imperative for effective wildfire detection systems. Precisely, precision can be described as

Precision = \frac{T P}{T P + F P} = \frac{\sum_{i = 1}^{N} I ({\hat{y}}_{i} = 1)}{\sum_{i = 1}^{N} I (y_{i} = {\hat{y}}_{i} = 1)}

(20)

where

T P

denotes true positives (correct fire detections),

F P

represents false positives (false alarms), I denotes the indicator function, and N is the total number of samples.

High precision is especially essential in wildfire applications where false alarms cause wastage of resources and operational expenses.

3.7. Recall: Comprehensive Fire Detection Capability

Recall measures the system’s capacity to spot all real fires that occurred, with as little missed detection as possible:

Recall = \frac{T P}{T P + F N} = \frac{\sum_{i = 1}^{N} I (y_{i} = 1)}{\sum_{i = 1}^{N} I (y_{i} = {\hat{y}}_{i} = 1)}

(21)

where

F N

represents false negatives (missed fires).

3.8. F1-Score: Balanced Performance Metric

The F1-score yields a harmonic mean of precision and recall, allowing a single measurement that balances both issues:

F_{1} = 2 \cdot \frac{Precision \times Recall}{Precision + Recall} = \frac{2 T P}{2 T P + F P + F N}

(22)

The harmonic mean penalizes large differences between precision and recall more harshly than the arithmetic mean.

3.9. mAP@0.5: Localization Accuracy Evaluation

The mean average precision at IoU threshold 0.5 (mAP50) measures both detection precision and localization accuracy:

F_{1} = \frac{2 \times Precision \times Recall}{Precision + Recall} = \frac{2 T P}{2 T P + F P + F N}

(23)

where

p (r)

is the precision–recall curve.

p_{interp} (r)

uses interpolation for monotonicity. The IoU threshold of

0.5

requires

> 50 %

overlap between prediction and ground truth.

3.10. Computational Environment

We performed all experiments on Kaggle’s GPU platform, which features NVIDIA T4 Tensor Core GPUs with 16 GB of memory. This allowed us to have sufficient computational capacity to train our knowledge distillation model without sacrificing realistic inference speeds. The ViT instructor model used a patch size of 16 × 16, resulting in 196 patches per image of 224 × 224. In contrast, the student Swin Transformer used a smaller patch size of 4 × 4 (creating 3136 patches initially before merging windows) to allow it to perceive finer fire features at various scales. A total of 104.90 MB parameters was used by the FireNet-KD model.

4. Results and Discussions

In this section, the proposed FireNet-KD model shows better performance on all test metrics with a precision of 95.16%, recall of 99.61%, F1-score of 97.34%, and mAP@50 of 97.31% on the fire detection benchmark dataset. The performance greatly surpasses other current state-of-the-art techniques, as indicated in Table 2, with FireNet-KD recording improvements of 3.2% precision, 2.8% recall, and 4.1% mAP@50 over the best competing algorithm. The enhanced performance is due to the model’s knowledge distillation framework, which efficiently combines the strengths of Vision Transformers and EfficientNet architectures through an attention-based fusion mechanism, producing strong feature extraction and precise fire localization in various environmental conditions.

In Figure 2, the training and validation curves for FireNet-KD show Loss, precision, recall, F1-score, and mAP@50 over 50 epochs. Curves show smooth convergence and consistent validation performance, which confirms the robustness and generalization of the model. The curves show a steady decrease in both training and validation loss, indicating stable convergence with minimal overfitting. Precision and recall are consistently high during training and only experience slight fluctuations, demonstrating the model’s reliability in making accurate predictions. The F1-score and mAP@50 curves have a steady increase, once again proving the model’s capacity to generalize well to novel data. These findings emphasize the efficacy of FireNet-KD’s design architecture and training procedure in reaching high-performance fire detection.

Figure 3 shows the visual detection outcomes, comparing original images to FireNet-KD’s outputs for fire localization. Predictions with confidence scores of less than 0.5 were removed during inference to minimize false positives. The threshold was applied consistently in all visualized results and evaluation measures. The model correctly detects fire areas, marked by red bounding boxes with high-confidence probability scores. Interestingly, FireNet-KD performs well even in adverse conditions, such as fires with complex backgrounds like snow, water, and wooded forests, while efficiently denying false positives. This ability presents the model as having valuable applicability in aerial fire detection operations.

In real environments, where there is variability due to surroundings and occlusion, FireNet-KD was tested on a range of environmental scenarios, such as daytime, night, small-sized fires, complex smoke environments, and dense vegetation scenes. As depicted in Figure 4, the model performs highly consistently in most of these situations, demonstrating robustness and generalization capability. However, there was a decline in performance in low-resolution or visually blurred images, where fire characteristics lose definition or are hidden.

To further verify FireNet-KD’s validity in real-world applications, its performance was tested under diverse environmental conditions: daytime, nighttime, with smoke, complex natural backgrounds, and small fire conditions. The model’s performance is shown in Table 2 and Figure 5 and Figure 6. FireNet-KD has demonstrated that it performs well under all environment types. Highly noteworthy is the finding that the model’s recall ratio is high in smoke scenes and nighttime scenes, indicating the model’s superior detection ability even in low-visibility or noisy scenes. Relatively lower precision is obtained in low-resolution images or visually complex backgrounds such as fog or sun glare, yet overall performance is robust. These experimental findings confirm the model’s robustness and versatility in real-world wildfire monitoring applications.

To graphically represent FireNet-KD’s overall performance, a comparison graph highlighting important parameters—precision, recall, F1-score, and mAP@50—is presented in Figure 6. The graphical representation is an additional enhancement to the tabular results, offering an immediate comprehension of the model’s performance. To bring interpretability to FireNet-KD’s decision-making process, we employed Grad-CAM (Gradient-weighted Class Activation Mapping) to emphasize the spatial regions most responsible for the model’s prediction. As shown in Figure 5, we can observe that Grad-CAM’s highlighted points show that the model is effectively focusing on significant fire-related features such as the boundary of the fire, smoke regions, and thermal characteristics. These visual cues verify that FireNet-KD is extracting semantically meaningful features and making decisions based on meaningful regions of the image. This makes the model more transparent and reliable in real-world applications for fire detection. The visualizations indicate the areas in the input images that the model is focusing on when it is detecting fire. Grad-CAM’s output confirms the fact that the model focuses on informative fire features like flames and smoke even in difficult cases, thus setting the reliability and interpretability of its predictions.

The trade-off between precision and recall ensures not only accurate fire detection for FireNet-KD but also the minimization of false negatives—a critical requirement for early-wildfire-warning systems. The mAP@50 metrics also serve as further indication of the model’s accurate localization capability, which is crucial for the estimation of fire spread and intensity in real-world scenarios. The stable performance across all evaluation metrics, as reflected by the training curves and detection outputs, establishes FireNet-KD as a state-of-the-art automated fire detection system for aerial imagery. Such enhancements address primary limitations in existing strategies, particularly in the detection of small fires over visually demanding backgrounds, with a focus on computational efficiency for real-time applications. The results validate the effectiveness of the proposed multi-teacher knowledge distillation method and justify its applicability for large-scale wildfire monitoring systems.

The enhanced performance of FireNet-KD is associated to its knowledge distillation structure, which exploits the advantages of both ViT for global context awareness and EfficientNet for fine-grained local feature extraction. The attention-based fusion mechanism allows for dynamic weighing of these complementary features, enabling strong detection of fires under different environmental conditions like smoke, haze, and dense vegetation. High model recall (99.61%) ensures few missed detections—a critical requirement for early intervention—while strong precision (95.16%) ensures reduced false alarms, a common issue in automatic fire detection. Additionally, the sliding-window inference and confidence-weighted non-maximum suppression (NMS) enhance localization accuracy (mAP@50: 97.31%), surpassing single-model and traditional ensemble methods. The training curves (Figure 2) confirm stable convergence without overfitting, while detection results (Figure 3) show consistent performance in diverse fire scenarios.

Although these are promising outcomes, several challenges need to be addressed for actual application. Data gathering and annotation rank among the most crucial challenges. Fire objects in aerial imagery are rare, and the acquisition of labeled data over multiple environmental scenarios, such as various altitudes, camera angles, lighting, and fire intensities, is time-consuming and perilous in certain situations. Labeling fire areas manually, particularly in smoke-filled or unclear images, can introduce variability and subjectivity, potentially affecting model training and testing. In addition, even though the proposed model is great on common benchmark datasets, generalization across unseen geographic locations, weather conditions, and sensor modalities is a critical issue. Domain variations, such as vegetation changes, topography, or seasonal fire regimes, can affect the model’s reliability. In practical deployment, FireNet-KD can face moderate challenges with environmental variation and scaling. Seasonal weather and geographic variations, for instance, vegetation color, light intensity, and terrain, could marginally affect detection performance. However, the model uses both global and local features and therefore has inherent robustness against such variations. Furthermore, while intensive training on large data may be computationally costly, the student model (Swin Transformer) offers a good trade-off between efficiency and accuracy, thus making the approach viable for real-world wildfire monitoring with managed computational resources.

While FireNet-KD is resilient under varying environmental conditions, there are some limitations. In particular, the detection performance of the model degrades for low-resolution or visually poor-quality images, where fire-related features like flame contours or smoke textures are poorly defined. In addition, although the dataset contains heterogeneous samples, the generalization performance of the model can also degrade when applied to unseen seasonal or regional conditions, such as snowy regions or desert flora, that were thus underrepresented during training. Such weaknesses are most likely due to domain shifts and the lack of context-dependent visual cues in such conditions. Closing such gaps in future research using domain adaptation methods, such as continual learning techniques, may further improve the robustness and scalability of the model for real-world applications.

5. Ablation Study

The ablation study on FireNet-KD provides valuable insights into the contributions of each component in the architecture. The full FireNet-KD model shows excellent performance with 95.16% precision, 99.61% recall, 97.34% F1-score, and 97.31% mAP50, providing a robust benchmark for comparison as shown in Table 3. Looking at individual components, Teacher 1 (ViT) shows better precision at 93.18% owing to its global attention mechanism that efficiently separates real fire patterns from false alarms, although it is relatively weak in recall (95.54%), especially for small fires. Conversely, Teacher 2 (EfficientNetV2-L) has better recall (96.87%) due to its better capacity for detecting local flame textures via convolutional operations, but has poorer precision (92.18%) because it sometimes incorrectly classifies flame-like textures. The single student model (Swin Transformer) is an interesting compromise, having 94.89% F1-score and 94.31% mAP50, beating both the single-teacher setups but remaining below the level of the complete model, implying that though proficient alone, it benefits greatly from teacher assistance. The outstanding performance of the complete FireNet-KD resulted from three collaborative effects: Teacher 1’s global context perception and Teacher 2’s local feature sensitivity, which are complementary to each other; the adaptive, attention-driven fusion mechanism, which optimally combines their contribution according to input properties; and the effective knowledge transfer that enables the student to outperform its teachers without degrading computational efficiency. Every configuration exhibits typical limitations—Teacher 1 performs poorly on small fire detection, Teacher 2 produces more false alarms, and the student in isolation takes significantly longer to train. These results together show that FireNet-KD’s design innovations effectively combine the strengths of its components while reducing their respective weaknesses to produce a system that surpasses any one of them through thoughtful, attention-mediated integration of their complementary abilities.

6. Conclusions and Future Direction

This work proposes FireNet-KD, a new knowledge distillation framework for strong aerial fire detection, which greatly surpasses the state of the art. We showed through extensive experimentation and ablation analysis that our multi-teacher design with attention-fused prediction possesses outstanding performance (95.16% precision, 99.61% recall, 97.34% F1-score, and 97.31% mAP50), outperforming separate teacher models and the isolated student network. The main innovation is the component’s integration of a Vision Transformer for global context modeling and an EfficientNet for local feature extraction, collaboratively balanced through a learned attention mechanism. The ablation studies presented strong evidence that each architectural element has distinct, non-overlapping capabilities, and the complete system performs better than any single part. These findings justify FireNet-KD as a dependable, real-time solution for aerial fire observation, offering a balance of detection accuracy and operational practicality.

Future work will aim to optimize FireNet-KD for edge deployment, fuse multi-modal sensors (thermal/LiDAR), and generalize the model to video-based fire detection through temporal analysis. We will also investigate adaptive learning methods to improve generalization across varying environments and increase model interpretability for real-world firefighting operations.

Author Contributions

Conceptualization, N.A. and M.A.; Data curation, M.M.J.; Formal analysis, E.H.A.; Funding acquisition, M.M.J.; Investigation, E.H.A.; Methodology, N.A.; Project administration, M.A.; Resources, E.H.A.; Supervision, M.A.; Validation, N.A. and M.M.J.; Writing—original draft, N.A.; Writing—review & editing, M.A., E.H.A. and M.M.J. All authors have read and agreed to the published version of the manuscript.

Funding

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R104), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

In this study, the FLAME dataset, which is used for fire detection, is openly available on Kaggle at the following repository: https://www.kaggle.com/datasets/smrutisanchitadas/flame-dataset-fire-classification (Latest version, Accessed: 25 May 2025). The dataset is publicly available for academic and research purposes, while the code can be provided upon request.

Acknowledgments

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R104), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Köhl, M.; Lasco, R.; Cifuentes, M.; Jonsson, Ö.; Korhonen, K.T.; Mundhenk, P.; de Jesus Navar, J.; Stinson, G. Changes in forest production, biomass and carbon: Results from the 2015 UN FAO Global Forest Resource Assessment. For. Ecol. Manag. 2015, 352, 21–34. [Google Scholar] [CrossRef]
Bowman, D.M.J.S.; Balch, J.K.; Artaxo, P.; Bond, W.J.; Carlson, J.M.; Cochrane, M.A.; d’Antonio, C.M.; DeFries, R.S.; Doyle, J.C.; Harrison, S.P.; et al. Fire in the Earth system. Science 2009, 324, 481–484. [Google Scholar] [CrossRef]
Westerling, A.L.; Hidalgo, H.G.; Cayan, D.R.; Swetnam, T.W. Warming and earlier spring increase western US forest wildfire activity. Science 2006, 313, 940–943. [Google Scholar] [CrossRef] [PubMed]
Abatzoglou, J.T.; Williams, A.P. Impact of anthropogenic climate change on wildfire across western US forests. Proc. Natl. Acad. Sci. USA 2016, 113, 11770–11775. [Google Scholar] [CrossRef] [PubMed]
Oliveira, S.; Gonçalves, A.; Zêzere, J.L. Reassessing wildfire susceptibility and hazard for mainland Portugal. Sci. Total Environ. 2021, 762, 143121. [Google Scholar] [CrossRef]
Fernández-Guisuraga, J.M.; Martins, S.; Fernandes, P.M. Characterization of biophysical contexts leading to severe wildfires in Portugal and their environmental controls. Sci. Total Environ. 2023, 875, 162575. [Google Scholar] [CrossRef]
World Resources Institute. Tree Cover Loss from Fires Reached Record High in 2022; World Resources Institute: Washington, DC, USA, 2023; Available online: https://www.wri.org/ (accessed on 20 May 2025).
California Department of Forestry and Fire Protection (CAL FIRE). 2020 Fire Season Summary; CAL FIRE: Sacramento, CA, USA, 2020. Available online: https://www.fire.ca.gov/incidents/2020/ (accessed on 20 May 2025).
Natural Resources Canada. Canada’s 2023 Wildfire Season—A Record Year; Government of Canada: Ottawa, ON, Canada, 2023. Available online: https://natural-resources.canada.ca/ (accessed on 20 May 2025).
Geetha, S.; Abhishek, C.S.; Akshayanat, C.S. Machine vision based fire detection techniques: A survey. Fire Technol. 2021, 57, 591–623. [Google Scholar] [CrossRef]
Giglio, L.; Boschetti, L.; Roy, D.; Hoffmann, A.A.; Humber, M.; Hall, J.V. Collection 6 Modis Burned Area Product User’s Guide Version 1.0; NASA EOSDIS Land Processes DAAC: Sioux Falls, SD, USA, 2016; pp. 11–27.
Justice, C.O.; Townshend, J.R.G.; Vermote, E.F.; Masuoka, E.; Wolfe, R.E.; Saleous, N.; Roy, D.P.; Morisette, J.T. An overview of MODIS Land data processing and product status. Remote Sens. Environ. 2002, 83, 3–15. [Google Scholar] [CrossRef]
Chen, J.; Zheng, W.; Wu, S.; Liu, C.; Yan, H. Fire monitoring algorithm and its application on the geo-kompsat-2A geostationary meteorological satellite. Remote Sens. 2022, 14, 2655. [Google Scholar] [CrossRef]
Schroeder, W.; Oliva, P.; Giglio, L.; Csiszar, I.A. The New VIIRS 375 m active fire detection data product: Algorithm description and initial assessment. Remote Sens. Environ. 2014, 143, 85–96. [Google Scholar] [CrossRef]
Fu, Y.; Li, R.; Wang, X.; Bergeron, Y.; Valeria, O.; Chavardès, R.D.; Wang, Y.; Hu, J. Fire detection and fire radiative power in forests and low-biomass lands in Northeast Asia: MODIS versus VIIRS Fire Products. Remote Sens. 2020, 12, 2870. [Google Scholar] [CrossRef]
Lloret, J.; Garcia, M.; Bri, D.; Sendra, S. A wireless sensor network deployment for rural and forest fire detection and verification. Sensors 2009, 9, 8722–8747. [Google Scholar] [CrossRef] [PubMed]
Haque, A.; Soliman, H. A Transformer-Based Autoencoder with Isolation Forest and XGBoost for Malfunction and Intrusion Detection in Wireless Sensor Networks for Forest Fire Prediction. Future Internet 2025, 17, 164. [Google Scholar] [CrossRef]
Ramadan, M.N.A.; Basmaji, T.; Gad, A.; Hamdan, H.; Akgün, B.T.; Ali, M.A.H.; Alkhedher, M.; Ghazal, M. Towards early forest fire detection and prevention using AI-powered drones and the IoT. Internet Things 2024, 27, 101248. [Google Scholar] [CrossRef]
Radhi, A.A.; Ibrahim, A.A. Forest Fire Detection Techniques Based on IoT Technology. In Proceedings of the 2023 1st IEEE International Conference on Smart Technology (ICE-SMARTec), Bandung, Indonesia, 17–19 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 128–133. [Google Scholar]
Nikhilesh Krishna, C.; Rauniyar, A.; Bharadwaj, N.K.S.; Raj, S.B.; Valsan, V.; Suresh, K.; Pandi, V.R.; Sathyan, S. ECO-Guard: An Integrated AI Sensor System for Monitoring Wildlife and Sustainable Forest Management. In Proceedings of the International Conference on Information and Communication Technology for Competitive Strategies, Jaipur, India, 8–9 December 2023; Springer Nature: Singapore, 2023; pp. 409–419. [Google Scholar]
Sudhakar, S.; Vijayakumar, V.; Kumar, C.S.; Priya, V.; Ravi, L.; Subramaniyaswamy, V. Unmanned Aerial Vehicle (UAV) based Forest Fire Detection and monitoring for reducing false alarms in forest-fires. Comput. Commun. 2020, 149, 1–16. [Google Scholar] [CrossRef]
Gao, Y.; Hao, M.; Wang, Y.; Dang, L.; Guo, Y. Multi-scale coal fire detection based on an improved active contour model from Landsat-8 Satellite and UAV images. ISPRS Int. J. Geo-Inf. 2021, 10, 449. [Google Scholar] [CrossRef]
Adão, T.; Hruška, J.; Pádua, L.; Bessa, J.; Peres, E.; Morais, R.; Sousa, J.J. Hyperspectral imaging: A review on UAV-based sensors, data processing and applications for agriculture and forestry. Remote Sens. 2017, 9, 1110. [Google Scholar] [CrossRef]
Liu, W.; Lyu, S.-K.; Liu, T.; Wu, Y.-T.; Qin, Z. Multi-target optimization strategy for unmanned aerial vehicle formation in forest fire monitoring based on deep Q-network algorithm. Drones 2024, 8, 201. [Google Scholar] [CrossRef]
Jiang, Y.; Kong, J.; Zhong, Y.; Zhang, Q.; Zhang, J. An Enhanced Algorithm for Active Fire Detection in Croplands Using Landsat-8 OLI Data. Land 2023, 12, 1246. [Google Scholar] [CrossRef]
Muhammad, K.; Ahmad, J.; Mehmood, I.; Rho, S.; Baik, S.W. Convolutional neural networks based fire detection in surveillance videos. IEEE Access 2018, 6, 18174–18183. [Google Scholar] [CrossRef]
Ko, B.C.; Cheong, K.-H.; Nam, J.-Y. Fire detection based on vision sensor and support vector machines. Fire Saf. J. 2009, 44, 322–329. [Google Scholar] [CrossRef]
Li, M.; Zhang, K.; Liu, J.; Gong, H.; Zhang, Z. Blockchain-based anomaly detection of electricity consumption in smart grids. Pattern Recognit. Lett. 2020, 138, 476–482. [Google Scholar] [CrossRef]
Bergado, J.R.; Persello, C.; Reinke, K.; Stein, A. Predicting wildfire burns from big geodata using deep learning. Saf. Sci. 2021, 140, 105276. [Google Scholar] [CrossRef]
Yang, S.; Huang, Q.; Yu, M. Advancements in remote sensing for active fire detection: A review of datasets and methods. Sci. Total. Environ. 2024, 943, 173273. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, L.; Liu, S.; Yin, Y. Intelligent fire location detection approach for extrawide immersed tunnels. Expert Syst. Appl. 2024, 239, 122251. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
Yar, H.; Khan, Z.A.; Rida, I.; Ullah, W.; Kim, M.J.; Baik, S.W. An efficient deep learning architecture for effective fire detection in smart surveillance. Image Vis. Comput. 2024, 145, 104989. [Google Scholar] [CrossRef]
Jin, P.; Cheng, P.; Liu, X.; Huang, Y. From smoke to fire: A forest fire early warning and risk assessment model fusing multimodal data. Eng. Appl. Artif. Intell. 2025, 152, 110848. [Google Scholar] [CrossRef]
Yuan, J.; Wang, H.; Yang, T.; Su, Y.; Song, W.; Li, S.; Gong, W. FF-net: A target detection method tailored for mid-to-late stages of forest fires in complex environments. Case Stud. Therm. Eng. 2025, 65, 105515. [Google Scholar] [CrossRef]
Kong, S.; Deng, J.; Yang, L.; Liu, Y. An attention-based dual-encoding network for fire flame detection using optical remote sensing. Eng. Appl. Artif. Intell. 2024, 127, 107238. [Google Scholar] [CrossRef]
Wang, G.; Li, H.; Xiao, Q.; Yu, P.; Ding, Z.; Wang, Z.; Xie, S. Fighting against forest fire: A lightweight real-time detection approach for forest fire based on synthetic images. Expert Syst. Appl. 2025, 262, 125620. [Google Scholar] [CrossRef]
Wang, L.; Li, H.; Siewe, F.; Ming, W.; Li, H. Forest fire detection utilizing ghost Swin transformer with attention and auxiliary geometric loss. Digit. Signal Process. 2024, 154, 104662. [Google Scholar] [CrossRef]
Shi, P.; Wang, X. Forest Fire Detection Method based on Improved YOLOv7. In Proceedings of the 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China, 1–3 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 618–622. [Google Scholar]
Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A forest fire detection system based on ensemble learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
Xue, Z.; Lin, H.; Wang, F. A small target forest fire detection model based on YOLOv5 improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Lu, K.; Huang, J.; Li, J.; Zhou, J.; Chen, X.; Liu, Y. MTL-FFDET: A multi-task learning-based model for forest fire detection. Forests 2022, 13, 1448. [Google Scholar] [CrossRef]
Lin, J.; Lin, H.; Wang, F. A semi-supervised method for real-time forest fire detection algorithm based on adaptively spatial feature fusion. Forests 2023, 14, 361. [Google Scholar] [CrossRef]
Xiao, Z.; Wan, F.; Lei, G.; Xiong, Y.; Xu, L.; Ye, Z.; Liu, W.; Zhou, W.; Xu, C. Fl-yolov7: A lightweight small object detection algorithm in forest fire detection. Forests 2023, 14, 1812. [Google Scholar] [CrossRef]
Shi, P.; Lu, J.; Wang, Q.; Zhang, Y.; Kuang, L.; Kan, X. An efficient forest fire detection algorithm using improved YOLOv5. Forests 2023, 14, 2440. [Google Scholar] [CrossRef]
Jin, L.; Yu, Y.; Zhou, J.; Bai, D.; Lin, H.; Zhou, H. SWVR: A lightweight deep learning algorithm for forest fire detection and recognition. Forests 2024, 15, 204. [Google Scholar] [CrossRef]
Bai, X.; Wang, Z. Research on forest fire detection technology based on deep learning. In Proceedings of the 2021 International Conference on Computer Network, Electronic and Automation (ICCNEA), Xi’an, China, 24–26 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 85–90. [Google Scholar]
Ahmad, N.; Akbar, M.; Alkhammash, E.H.; Jamjoom, M.M. CN2VF-Net: A Hybrid Convolutional Neural Network and Vision Transformer Framework for Multi-Scale Fire Detection in Complex Environments. Fire 2025, 8, 211. [Google Scholar] [CrossRef]
Zhou, M.; Wu, L.; Liu, S.; Li, J. UAV forest fire detection based on lightweight YOLOv5 model. Multimed. Tools Appl. 2024, 83, 61777–61788. [Google Scholar] [CrossRef]
Sun, Z.; Xu, R.; Zheng, X.; Zhang, L.; Zhang, Y. A forest fire detection method based on improved YOLOv5. Signal Image Video Process. 2025, 19, 136. [Google Scholar] [CrossRef]
Wang, Z.; Xu, L.; Chen, Z. FFD-YOLO: A modified YOLOv8 architecture for forest fire detection. Signal Image Video Process. 2025, 19, 265. [Google Scholar] [CrossRef]
Sathishkumar, V.E.; Cho, J.; Subramanian, M.; Naren, O.S. Forest fire and smoke detection using deep learning-based learning without forgetting. Fire Ecol. 2023, 19, 9. [Google Scholar] [CrossRef]
Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland fire detection and monitoring using a drone-collected rgb/ir image dataset. IEEE Access 2022, 10, 121301–121317. [Google Scholar] [CrossRef]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Liu, H.; Zhu, J.; Xu, Y.; Xie, L. Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7. Forests 2024, 15, 1781. [Google Scholar] [CrossRef]
Li, J.; Xu, R.; Liu, Y. An improved forest fire and smoke detection model based on yolov5. Forests 2023, 14, 833. [Google Scholar] [CrossRef]
Chen, X.; Xue, Y.; Hou, Q.; Fu, Y.; Zhu, Y. RepVGG-YOLOv7: A modified YOLOv7 for fire smoke detection. Fire 2023, 6, 383. [Google Scholar] [CrossRef]
Fan, X.; Lei, F.; Yang, K. Real-Time Detection of Smoke and Fire in the Wild Using Unmanned Aerial Vehicle Remote Sensing Imagery. Forests 2025, 16, 201. [Google Scholar] [CrossRef]
Zheng, Y.; Tao, F.; Gao, Z.; Li, J. FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n. Forests 2024, 15, 1823. [Google Scholar] [CrossRef]

Figure 1. The proposed FireNet-KD architecture. The Teacher 1 (Vision Transformer) providing global contextual features, while the Teacher 2 (CNN-based) offering local hierarchical features. Both guide the student model through an attention-based knowledge distillation mechanism.

Figure 2. Training and validation curves of each metric.

Figure 3. Forest fire detection of FireNet-KD; bounding boxes are shown only for detections with confidence scores above 0.5.

Figure 4. Detection results of FireNet-KD under varying environmental conditions.

Figure 5. Grad-CAM visualizations of areas of activation accountable for the decision-making process of FireNet-KD. The model effectively highlights areas of importance in the input images, resulting in enhanced interpretability and dependability of fire detection outcomes.

Figure 6. Performance of FireNet-KD model on the most significant evaluation measures. The plot is a graphical comparison of precision, recall, F1-score, and mAP@50, and it demonstrates the model’s high overall detection capability.

Table 1. Comparison of forest fire detection techniques.

Technique	Dataset	Advantages	Limitations
FF-net	FLAME	High precision and robustness in complex scenes; handles small targets and occlusions well	Performance drops in scenes with pseudo-samples (regions resembling flames)
ADE-Net	FLAME	Dual encoding for spatial and semantic features; strong local and global fusion	Large model size (333.69 MB); slightly higher inference time; requires supervised data
DRCSPNet + Global Mixed Attention + Lite-PAN	FLAME	Lightweight and real-time (33.5 FPS); robust to lighting variations and complex backgrounds	Synthetic data may not generalize perfectly; lower mAP on real-world scenarios (58.39%)
GCST	FLAME	Efficient multi-scale flame and smoke feature extraction; reduced parameter count	Performance may drop in noisy real-world scenes with complex backgrounds
SWVR	FLAME	Bi-directional feature fusion; reduces Params and GFLOPs; suitable for edge devices	Minor reduction in semantic richness if GSConv is overused; slight speed reduction
CN2VF-Net	D-Fire	Handles fire scale variation, occlusions, and environmental complexity; lightweight and accurate for deployment	Limited performance on fires smaller than 16 × 16 pixels
SPPFP + CBAM + BiFPN	Forest fire	Detects small fire targets in long-range UAV images; overcomes traditional model limitations	Susceptible to lighting/occlusion interference; false alarms remain an issue

Table 2. Comparison of the proposed model FireNet-KD against existing methods based on precision, recall, F1-score, and mAP@50. Bold values indicate the best performance for each metric.

Ref	Precision	Recall	F1-Score	mAP@50
Liu et al. [55]	90.9	86.8	88.8	91.5
Li et al. [56]	89.2	-	89.9	89.3
Chen et al. [57]	90.8	-	91.8	91.4
Fan et al. [58]	76.7	75.5	76.1	79.2
Zheng et al. [59]	94.5	96.8	-	96.7
Proposed	95.1	99.6	97.3	97.3

Table 3. Ablation study results. Bold values indicate the best performance for each metric.

Model Variant	Precision (%)	Recall (%)	F1-Score (%)	mAP@50 (%)
Single Teacher 1	93.18	95.54	94.72	92.65
Single Teacher 2	92.18	96.87	93.86	91.88
Student Only	93.58	97.21	94.89	94.31
FireNet-KD	95.16	99.61	97.34	97.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ahmad, N.; Akbar, M.; Alkhammash, E.H.; Jamjoom, M.M. FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation. Fire 2025, 8, 295. https://doi.org/10.3390/fire8080295

AMA Style

Ahmad N, Akbar M, Alkhammash EH, Jamjoom MM. FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation. Fire. 2025; 8(8):295. https://doi.org/10.3390/fire8080295

Chicago/Turabian Style

Ahmad, Naveed, Mariam Akbar, Eman H. Alkhammash, and Mona M. Jamjoom. 2025. "FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation" Fire 8, no. 8: 295. https://doi.org/10.3390/fire8080295

APA Style

Ahmad, N., Akbar, M., Alkhammash, E. H., & Jamjoom, M. M. (2025). FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation. Fire, 8(8), 295. https://doi.org/10.3390/fire8080295

Article Menu

FireNet-KD: Swin Transformer-Based Wildfire Detection with Multi-Source Knowledge Distillation

Abstract

1. Introduction

Key Contributions

2. Related Works

2.1. Early and Accurate Forest Fire Detection

2.2. Deep Learning Methods and Model Optimization for Fire Detection

2.3. Comparison with Other Similar Knowledge Distillation Techniques

3. Methodology

3.1. Data Collection and Processing

3.2. FireNet-KD Arcitecture

3.3. Training Protocol and Loss Functions

3.4. Multi-Scale Inference Pipeline

3.5. Evaluation Metrics

3.6. Precision: The Accuracy of Positive Predictions

3.7. Recall: Comprehensive Fire Detection Capability

3.8. F1-Score: Balanced Performance Metric

3.9. mAP@0.5: Localization Accuracy Evaluation

3.10. Computational Environment

4. Results and Discussions

5. Ablation Study

6. Conclusions and Future Direction

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI