Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation

Keerthinathan, Pirunthan; Sandino, Juan; Mahendren, Sutharsan; Uthayasooriyan, Anuraj; Galvez, Julian; Hamilton, Grant; Gonzalez, Felipe

doi:10.3390/drones9120827

Open AccessArticle

Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation

by

Pirunthan Keerthinathan

^1,2,*

,

Juan Sandino

^1,2,3

,

Sutharsan Mahendren

²,

Anuraj Uthayasooriyan

^1,2

,

Julian Galvez

^1,3,4

,

Grant Hamilton

⁵

and

Felipe Gonzalez

^1,2,3

¹

QUT Centre for Robotics (QCR), Faculty of Engineering, Queensland University of Technology (QUT), 2 George Street, Brisbane City, QLD 4000, Australia

²

School of Electrical Engineering and Robotics, Faculty of Engineering, Queensland University of Technology (QUT), 2 George Street, Brisbane City, QLD 4000, Australia

³

Securing Antarctica’s Environmental Future (SAEF), Queensland University of Technology (QUT), 2 George Street, Brisbane City, QLD 4000, Australia

⁴

QUT Research Engineering Facility, Office of Research Infrastructure, J2, Sports Lane, Kelvin Grove, QLD 4059, Australia

⁵

School of Biology and Environmental Science, Faculty of Science, Queensland University of Technology (QUT), 2 George Street, Brisbane City, QLD 4000, Australia

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(12), 827; https://doi.org/10.3390/drones9120827

Submission received: 30 October 2025 / Revised: 25 November 2025 / Accepted: 25 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Rapid Disaster Assessment and Post-Disaster Recovery Using UAV Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Plume-based YOLO detection achieved 76.1–85.7% F1 and 80.1–87.5% overall accuracy, compared to ~58% overall accuracy, thereby reducing false positives.
Knowledge Distillation (KD) improved the detection performance from mAP@0.5 = 58.9–61.2% (without KD) to 69.7–72.1% (with KD).

What are the implications of the main findings?

ntroduction of Plume as a new visual indicator to better distinguish wildfire or bushfire smoke from similar atmospheric phenomena.
Demonstrated that few-shot learning can reduce manual annotation effort, enhancing scalability across diverse environments.

Abstract

Uncrewed aerial systems (UAS)-based remote sensing and artificial intelligence (AI) analysis enable real-time wildfire or bushfire detection, facilitating early response to minimize damage and protect lives and property. However, their effectiveness is limited by three issues: distinguishing smoke from fog, the high cost of manual annotation, and the computational demands of large models. This study addresses the three key challenges by introducing plume as a new indicator to better distinguish smoke from similar visual elements, and by employing a hybrid annotation method using knowledge distillation (KD) to reduce expert labour and accelerate labelling. Additionally, it leverages lightweight YOLO Nano models trained with pseudo-labels generated from a fine-tuned teacher network to lower computational demands while maintaining high detection accuracy for real-time wildfire monitoring. Controlled pile burns in Canungra, QLD, Australia, were conducted to collect UAS-captured images over deciduous vegetation, which were subsequently augmented with the Flame2 dataset, which contains wildfire images of coniferous vegetation. A Grounding DINO model, fine-tuned using few-shot learning, served as the teacher network to generate pseudo-labels for a significant portion of the Flame2 dataset. These pseudo-labels were then used to train student networks consisting of YOLO Nano architectures, specifically versions 5, 8, and 11 (YOLOv5n, YOLOv8n, YOLOv11n). The experimental results show that YOLOv8n and YOLOv5n achieved an mAP@0.5 of 0.721. Plume detection outperforms smoke indicators (F1: 76.1–85.7% vs. 70%) in fog and wildfire scenarios. These findings underscore the value of incorporating plume as a distinct class and utilizing KD, both of which enhance detection accuracy and scalability, ultimately supporting more reliable and timelier wildfire monitoring and response.

Keywords:

forest fire; transfer learning; auto-labelling; UAV; edge deployment

1. Introduction

Wildfires, commonly referred to as bushfires in Australia, are unplanned fires that spread through wildland areas, causing significant environmental damage, destroying property, endangering lives, and contributing to air pollution. The necessity for early wildfire detection is both urgent and complex, influenced by the potential for exponential fire spread under critical conditions [1]. Early warning systems reduce risks to residents and responders by providing time for evacuation and strategic planning, and they save millions in firefighting and recovery costs by enabling early fire containment [2]. Remote sensing (RS) platforms, including satellites, manned aircraft, and uncrewed aerial systems (UASs), play a vital role in providing on-demand information for areas that are hard to reach or pose risks to humans, such as regions prone to wildfires [3]. UASs demonstrate distinct advantages in wildfire management, outperforming other RS approaches by delivering high-resolution spectral and structural data with the requisite temporal and spatial resolution for effective monitoring [4].

Artificial intelligence (AI) in UAS-based RS involves analyzing data collected by UASs to perform tasks such as object detection and tracking, and semantic segmentation. This technology enhances environmental monitoring across various domains, including agriculture [5], biosecurity [6], search and rescue [7], and disaster management [8]. Machine learning (ML) is a core component of AI, providing the ability to learn from data. Deep learning (DL) is a sophisticated branch of ML, excelling in handling complex datasets through deep neural networks. Recent studies show how advanced DL models have greatly enhanced the precision of wildfire detection systems, outperforming conventional ML techniques [4]. DL features a range of architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, and generative adversarial networks (GANs) [9]. In detection tasks, CNNs are particularly effective at capturing hierarchical local features, while vision transformers (ViTs) have gained popularity for their strong performance on large datasets, given sufficient computational resources [10,11]. For the development and evaluation of these advanced models, a variety of datasets have been utilized, which are essential resources for improving DL approaches to wildfire detection.

Recent CNN-based wildfire detection studies have utilized various datasets, demonstrating notable performance when trained and tested on high-performance computing systems [4]. Table 1 summarizes recent studies conducted on real-time wildfire detection using UAS imagery. Object detection is more commonly applied than semantic segmentation due to its faster deployment capabilities, while accuracy and speed vary depending on the dataset, number of target classes, and the hardware used for deployment. Jiao et al. [12] developed a large variant of the You Only Look Once version 3 (YOLOv3) model, which enables real-time fire detection by capturing images via UASs and transmitting them to ground stations for processing. The use of large variant AI models enhances prediction accuracy and performance, critical for timely wildfire response. However, transmitting and processing images in wildland environments is challenging due to latency and limited connectivity caused by sparse cell tower coverage or limited bandwidth [13]. Meanwhile, running DL models in resource-constrained computers onboard UASs presents significant challenges due to their limited computational power and memory, which restricts the model complexity of deployable models. However, the use of lightweight models indicates improved UAS autonomy and operational effectiveness by optimizing the trade-off between detection performance and speed in wildfire scenarios. Techniques such as knowledge distillation (KD) [14] transfer learned knowledge from a large, high-performing model to a smaller model. This approximates the original performance while significantly reducing computational complexity, which makes it suitable for deployment in resource-constrained environments [15].

The scarcity of labelled UAS-acquired wildfire datasets poses a significant challenge for training robust models. Most studies combined UAS-acquired datasets with publicly available wildfire datasets not originally captured by UASs [15,16,17,18]. This may be due to the limited availability of labelled datasets [19]. This is a common challenge in RS and AI, largely driven by the time-consuming nature of annotation and the need for domain expertise [20]. Using wildfire images from non-UAS sources captured at ground level using handheld devices helps overcome the lack of labelled UAS-acquired data and enhances model generalization through diverse scenarios. However, differences in viewpoint, resolution, and labelling standards can introduce a domain gap, potentially reducing the accuracy of UAS-based wildfire detection models [21].

Wildfire detection systems primarily target flame, smoke, or a combination of both classes. Smoke detection often acts as an early indicator of wildfire activity, as it rises above the canopy and can be observed from a significant distance before flames become visible. This makes smoke a critical early warning signal in wildfire detection efforts. Research conducted by Barmpoutis et al. [22] and Wang et al. [15], for instance, reported that smoke detection demonstrated suboptimal performance, primarily due to the prevalence of false-positive results. Environmental conditions, such as fog and low-lying clouds, which can visually mimic smoke in UAS imagery, frequently resulted in misclassifications, thereby diminishing the accuracy of automated smoke detection systems [23,24]. Consequently, while smoke remains an important early indicator, the differentiation of smoke from analogous atmospheric phenomena continues to pose a substantial challenge.

In response to challenges in UAS-based wildfire detection, particularly the domain gap introduced by non-UAS datasets and the difficulty in distinguishing smoke from similar atmospheric phenomena, we introduced plume detection as an additional key indicator, alongside smoke and flame, to increase confidence and reduce false alarms. A plume is typically characterized by a vertically rising column of smoke that extends above the forest canopy, making it visually distinct during the initial stages of a fire. It appears as a structured, coherent, and elongated column, generally upright but often tilted under wind influence. The plume originates from a narrow base at the ignition point and widens with height, forming a recognizable geometry. Its texture commonly shows smooth, layered, or rolling billows with a sharper boundary compared to diffused smoke or fog, enabling clearer discrimination in aerial imagery. The boundary and structure of the plume generally diffuse after a certain distance from the source, which is useful for defining the plume height and annotating it for machine learning. Studies have explored smoke plume height detection using satellite-based systems [25]. However, the use of UAS can further enhance this capability by capturing high-resolution and near-real-time imagery at lower altitudes. These capabilities facilitate the detection of finer plume structures, particularly in small-scale fires in the early wildfire stage, where satellite imagery may be limited by spatial resolution constraints, elevated data acquisition costs, and delays in retrieving the data for real-time decision making. We conducted a controlled fire experiment in Canungra, QLD, Australia to capture domain-specific imagery dataset [26] from deciduous vegetation-dominated wildland using a UAS. The images were manually annotated for three classes: smoke, plume, and flame.

Instead of benchmarking smoke detection against other architectures, this study focuses on evaluating how the inclusion of plume as a distinct visual indicator, alongside smoke and flame, improves the overall reliability of wildfire detection, particularly under foggy conditions. We implemented a teacher-student framework using KD [14], where a fine-tuned teacher model generated pseudo-labels and transferred knowledge to lightweight student models, which were subsequently optimized for real-time applications. Grounding Dino [27], a powerful open-set object detection model that combines language-guided localization with the DINO [28] (Detection transformer [29] with improved training) architecture was used as the teacher model. YOLO Nano variants including YOLO version 5 (YOLOv5n), version 8 (YOLOv8n), and version 11 (YOLOv11n) were chosen as student models due to their speed and lightweight design, making them suitable for deployment on resource-constrained hardware devices. The acquired dataset [26] was augmented with the Flame2 dataset [30] to enhance model generalization. The dataset comprises RGB images collected by UAS during a prescribed fire in an open canopy setting in Northern Arizona in 2021. A portion of the Flame2 dataset was manually annotated and used to fine-tune the teacher model. The fine-tuned teacher model was then used to generate pseudo-labels for the entire dataset across the three classes. The combined dataset, which included both manual and pseudo-labels from wildlands with varying dominant vegetation types, was used to train the student models. This approach ensures consistency and accuracy in labelling while minimizing the domain gap between the datasets. By investigating the plume alongside traditional indicators such as smoke and flame, and leveraging UAS-specific datasets, this study aims to improve the accuracy, reliability and scalability of automated wildfire detection systems, particularly in identifying early warning signals under challenging environmental conditions.

2. Materials and Methods

2.1. Experimental Design

The experimental design, as depicted in Figure 1, consists of five essential components: data acquisition, labelling, few-shot learning-based pseudo-labelling, lightweight model development, and fog image validation. At the data acquisition component, UAS is deployed to collect aerial imagery over a controlled fire. A subset of the Flame2 datasets [30], including the collected imagery [26], is manually labelled to establish a reliable ground truth dataset. To enhance the efficiency of the annotation process, a few-shot learning technique is implemented, which involves fine-tuning a pre-trained Grounding DINO model using the manually labelled data to pseudo-label the remaining Flame2 data. Subsequently, lightweight AI YOLO nano models are developed employing a teacher-student training framework, specifically designed for real-time performance. The reliability of these models is then validated using a fog image dataset, assessing their capability to distinguish wildfire indicators (smoke, plume, flame) from visually similar phenomena, with performance evaluated accordingly.

2.2. Data Acquisition

2.2.1. Study Site

Flight tests were conducted at TAC Resources, Canungra QLD, Australia (28°1′33.6″ S, 153°9′59.94″ E), selected in consultation and with the approval of the site’s landowner. Canungra is located in the Scenic Rim Region of Southeast QLD, in the Gold Coast hinterland. It is about 32 km west of the Gold Coast and 75 km south of Brisbane, nestled below the Beechmont plateau and stretching towards Tamborine Mountain. The site is surrounded by dense forests, predominantly surrounded by deciduous vegetation, and steep slopes. Figure 2 shows the site location as captured by Nearmap [31].

2.2.2. Controlled Pile Burns

A controlled pile burn was conducted using six partially open metal barrels. The fuel consisted of on-site bark and dry eucalyptus leaves, and the barrels were placed 5–10 m apart based on accessible areas of the sloped terrain, with each pile ignited sequentially to observe both merging and isolated plumes and to ensure clear visual separation for analysis. Firebreaks were established around the barrels through the systematic removal of dry vegetation and other flammable materials to mitigate the risk of unintentional fire spread. Figure 3 shows the control pile burn barrels. Weather conditions, encompassing wind speed, direction, temperature, and humidity, were meticulously evaluated to ascertain safe burning conditions. Table 2 presents the meteorological observations recorded at 9:00 AM on the day of data acquisition, obtained from the nearest bureau of meteorology stations equipped with weather parameter sensors. We observed no significant variation in wind speed throughout the day; therefore, we assumed that the wind speed measured at 9:00 AM was representative for the entire data collection period. Personnel, trained and equipped with personal protective equipment (PPE) such as fire-resistant clothing, gloves, eye protection, and respiratory masks, conducted the ignition. Team members maintained effective communication throughout the operation, with designated personnel positioned strategically around the site to monitor fire behaviour and ensure containment. Fire suppression equipment, such as water tanks, fire extinguishers, fire-retardant foam, and hand tools, was readily accessible. Safe escape routes were identified and communicated to all personnel involved. Following the burn, embers and hotspots were extinguished with water until the site was entirely cool to the touch, and a post-burn monitoring period was implemented to observe for potential flare-ups.

2.2.3. UAS Operations

The uncrewed aerial vehicle (UAS) used to collect aerial Canungra control fire [26] dataset during the controlled pile burn experiment is a custom-built X500 Kit (Holybro, Zhejiang, China), and equipped with a Pixhawk 4 flight controller. The OAK-D camera (formally an OAK-D-S2, Luxonis, Westminister, CO, USA) was configured to record HD-resolution RGB images at a rate of one frame per second (FPS). The camera possesses an RGB sensor with a focal length of 4.81 mm and horizontal, vertical and diagonal fields of view of 69°, 55° and 82°, respectively. We used a securely tightened camera mount to ensure that the camera maintained a fixed viewing angle throughout the flight. Raspberry Pi 4 model B with 2 GB RAM was used for the host device. The flights took place on the 16th of November 2024, between 12:00 PM and 3:00 PM local time (UTC +10). The UAS, equipped with an OAK-D camera and Raspberry Pi payload and maintaining a maximum take-off weight (MTOW) below 2 kg, was manually piloted to capture real-time surveillance footage (Figure 4).

The flight path plan is illustrated in Figure 5. The circular marker indicates the UAS’s take-off and landing zone, while the dotted line represents the executed trajectory. The path is contained within a circular operating area, with the UAS flying radially toward and away from the controlled pile-burning site, indicated by the fire symbol in Figure 5. The flight operations were monitored using QGroundControl v4.4.2, which prov operations were monitored using QGroundControl v4.4.2, which provided real-time telemetry data to ensure safe and stable flight performance. The UAS maintained a height above ground level (AGL) between 50 m and 70 m, with the camera consistently pointed at the pile burn, while keeping a safe distance from tree canopies and active fire zones. All operations were conducted within visual line of sight to maintain operational safety standards. A total of 364 aerial images were captured, of which 210 were selected for object detection AI models development.

2.2.4. Open-Access Datasets

This study utilized the Flame2 dataset to enhance the training data for the development of lightweight ML models. The augmentation of the training data with diverse, high-quality footage aimed to optimize the lightweight models for real-time deployment scenarios, where computational resources are frequently limited. The Flame2 dataset [30], specifically curated for wildfire monitoring, consists of high-resolution RGB videos captured during a prescribed burn in an open canopy pine forest in Northern Arizona in November 2021. This dataset’s high-quality RGB footage, comprising seven raw videos, facilitates feature extraction and model training. The details of the Flame2 RGB videos are presented in Table 3. By integrating datasets from two distinct forest types—coniferous forests (Flame2 dataset [30]) and deciduous forests (Canungra control fire dataset [26])—a more diverse training set was constructed.

Furthermore, 74 fog images depicting forest environments with aerial view were sourced from Kaggle [32] and freepik website (https://www.freepik.com) to assess the capability of the final model to differentiate between smoke and plumes and false detections induced by fog. This evaluation aimed at examining the efficacy of integrating plume indicators in wildfire detection and to ascertain the practical benefits of this approach, with the objective of minimizing false alarms and offering an additional layer of indication.

2.3. Few-Shot Training with Grounding DINO

Manual image annotation with bounding boxes in object detection tasks requires considerable expertise and labour. To address this, we adopted Grounding DINO [27], a state-of-the-art framework for visual grounding and object detection, into our workflow to automate the labelling process. Grounding DINO leverages a ViT architecture, that combines powerful image-text alignment with strong object localization to detect objects based on free-form text prompts without needing class-specific training. It enables flexible and scalable object detection by grounding natural language queries directly in images. This study utilized few-shot learning by fine-tuning Grounding DINO to generate pseudo-labels, which are then used to develop lightweight models through a KD process, where Grounding DINO acts as the teacher and the lightweight models serve as students. Figure 6 provides an overview of the Grounding DINO model architecture and the KD process.

We specifically adopt a pseudo-labelling strategy, also known as offline KD, rather than feature-level or response-based distillation. This choice addresses the constraints of transferring knowledge from a foundation model to an edge detector. We avoid feature-level distillation due to the fundamental architectural differences between the teacher and student models. The teacher’s Vision Transformer backbone relies on global-level attention features, whereas the student’s CNN architecture prioritizes local spatial features. Direct feature alignment is therefore difficult without computationally intensive adapters. Similarly, response-based distillation is not straightforward because of the semantic mismatch between the teacher’s open-set text alignment scores and the student’s closed-set class probability logits. Consequently, we implemented an offline distillation process utilizing hard pseudo-labels. The teacher network generated predictions on the unlabeled data, which are rigorously filtered using confidence thresholds and Non-Maximum Suppression (NMS). This process converts the teacher’s raw outputs into deterministic bounding box annotations. These hard labels provide strong supervision for the student network, effectively filtering out the teacher’s intrinsic uncertainty and noise while decoupling the student’s training from the computational overhead of the teacher model.

In this study, we introduced “plume” as a new indicator for wildfire detection in addition to smoke and flame. Smoke was annotated as the entire region of reduced visibility, while plume annotations were placed within this area because plumes are treated as a subset of smoke. A plume was labelled when a vertically oriented smoke column was visible, beginning from a single origin point near the ground or canopy, often with a narrow cross-section resembling a corn-shaped structure, and characterized by higher opacity or brightness at the core. Figure 7 illustrates the plume annotation conditions. The bounding box enclosed the region from this point of origin upward until the plume began to diffuse and lose its well-defined edges. Flames were annotated using the smallest bounding box that enclosed the visible flame. We manually annotated the collected Canungra control fire dataset [26], focusing on deciduous vegetation land, and selected Videos 3 and 4 from the Flame2 dataset [30], which features coniferous pine tree forests. These videos were framed every 4 s, resulting in a total of 338 annotated images created using the LabelImg tool [33]. The annotated data was split in a 1:4 ratio for testing and fine-tuning, with the classes “smoke,” “plume,” and “flame” designated as target labels. Table 4 presents the number of images and instances for each class in both the training and testing datasets.

We employed the open-source training code from Open Grounding Dino, a third-party implementation of the Grounding DINO [34], to fine-tune the official Grounding DINO model. Training was conducted on a high-performance computing (HPC) environment equipped with five Intel^® Xeon^® Platinum 8468 CPUs (Intel, Santa Clara, CA, USA), one NVIDIA H100 GPU, and 200 GB of RAM. The fine-tuned model was evaluated using the COCO evaluation metrics [35] on the test dataset. These metrics were discussed in Section 2.6 in detail.

Hyperparameter tuning was performed using 10-step ranges for the key parameters: learning rate (1 × 10⁻⁵ to 1 × 10⁻¹), weight decay (1 × 10⁻⁶ to 1 × 10⁻⁴), encoder depth (4–6 layers), decoder depth (6–8 layers). Each configuration was evaluated on the validation set, and the final values (Table 5) were selected based on the highest obtained mAP@0.5 (maxDets = 100). Training incorporated early stopping with a patience of 20 epochs to prevent overfitting. We applied NMS with an Intersection over Union (IoU) threshold of 0.5 to refine the predictions by eliminate overlapping bounding boxes. Additionally, for detections of the same class, smaller bounding boxes fully contained within larger ones were removed, retaining only the largest box. The results were compared to ground truth annotations, and the IoU between the labels and predicted bounding box was calculated to evaluate the detection accuracy. Dataset augmentation for training the lightweight object detection model involved extracting frames from the remaining videos in the Flame2 dataset at two-second intervals. These frames were subsequently processed using the fine-tuned Grounding DINO model, employing the same text prompts—smoke, plume, and flame—used during the fine-tuning stage. The same NMS and overlapping bounding box removal techniques were applied to generate high-quality pseudo-labels. This augmented dataset was then combined with the manually annotated data to train a lightweight model optimized for real-time wildfire detection across diverse forest environments.

2.4. Lightweight Model Development

The development of a lightweight object detection model suitable for real-time deployment necessitated the evaluation of multiple AI architectures. Although recent models such as RT-DETR [36] and YOLO-NAS [37] adopt transformers and neural architecture search to improve the detection performance, their smallest pre-trained variants, (RT-DETR-S, YOLO-NAS-S), require approximately 20 M and 12 M parameters, respectively. Our study prioritizes the nano variants of the Ultralytics YOLO series (5n, 8n, 11n) which maintain an order-of-magnitude lower model complexity with approximately 4.6 m, 3.2 M and 2.6 M parameters. This parameter efficiency is critical for minimizing memory footprint. These variants demonstrate progressive improvements in accuracy and features, starting from the solid foundation of YOLOv5 [38], known for its efficient architecture and robust performance, followed by the anchor-free enhancements introduced in YOLOv8 [38], and culminating in the improved precision and reduced complexity achieved by YOLOv11 [39] compared to earlier versions.

The dataset employed for training was compiled through a combination of 338 manually labelled images and 582 pseudo-labelled images generated by the fine-tuned Grounding DINO model. The final dataset was split using an 80:20 ratio for training and testing. The training set consisted of 680 images (513 smoke, 262 plume, 297 flame instances), while the test set included 204 images (137 smoke, 79 plume, 80 flame). Stratified sampling was used to ensure balanced representation of each class across the train–test split. Table 6 shows the number of images and instances for each class in the training and testing dataset. All annotations were formatted in YOLO format for YOLO-based models. Training and implementation of the YOLOv5n, YOLOv8n, and YOLOv11n models was carried out using the Ultralytics library. Each YOLO model was trained using consistent hyperparameters, including an image size of 640 and a batch size of 16, via the Ultralytics Python (Version 3.8.10) application programming interface (API). Standard evaluation metrics based on IOU including precision, mean average precision and recall were used to evaluate the model performance [40]. These metrics are discussed in Section 2.6.

The training process was conducted in the same HPC environment used for fine-tuning Grounding DINO. The process involved optimizing the models’ hyperparameters and examining its learning curves to identify the best time to halt training, thereby minimizing overfitting. The fine-tuning process used a search space for the initial learning rate between 0.00001 and 0.1, with maximum rotation augmentation ranging from 0 to 45 degrees, and training was conducted over 50 epochs with 300 iterations. Training was performed with early stopping (patience = 20 epochs), and the final model was selected based on the highest mAP@0.5 achieved on the validation set. A total of 634 images were utilized, with an 80:20 split for training and testing purposes. To evaluate the effectiveness of KD, YOLO models were trained using only manual labels, without incorporating pseudo-labels. The performance of these YOLO models was then compared to that of models trained using KD on the same test dataset. This experimental setup provided a clear assessment of how well models trained with pseudo-labels performed, highlighting the potential of KD to enhance scalability and efficiency in model development.

Trained YOLO models were deployed on the OAK-D camera to leverage its on-device inference and decoding capabilities, and their speed of detection was compared with deployments on the Jetson Nano Orin board. The models were converted to the blob format using the DepthAI tool (https://tools.luxonis.com), optimizing them specifically for the OAK-D’s Myriad X Vision Processing Unit (VPU). The converted blob models were integrated into the OAK-D’s processing pipeline. Real-time object detection, including video stream capture, YOLO model inference, and decoding outputs such as bounding boxes and class labels, was performed directly on the camera, without relying on external computing resources. The same YOLO models were deployed on the Jetson Nano Orin board (host device), a powerful edge computing platform. Inference was executed using the Ultralytics library, leveraging the Jetson’s GPU for accelerated performance. Decoding was performed on the host, and FPS were recorded under conditions similar to those of the OAK-D setup.

2.5. Evaluation on Fog Images

Fog images were intentionally excluded from the training dataset and reserved as unseen negative samples for robust post-training evaluation of false positive performance. The performance of the trained YOLOv5n, YOLOv8n, YOLOv11n models under foggy conditions was evaluated through inference on a curated set of fog images. A total of 74 images were carefully selected from publicly available sources on Kaggle and Freepik, depicting fog in forest or wildland environments from an aerial perspective to accurately represent real-world wildfire detection scenarios to minimize domain gaps. These fog images were incorporated into a subset of test dataset comprising 50 images, all of which contained labels for all the wildfire indicator classes. Each trained model was employed to perform inference on the image set of containing fog and wildfire with 0.5 confidence threshold, and their outputs were recorded. The evaluation employed a classification-based approach, wherein the detection of any indicator was interpreted as a wildfire occurrence

2.6. Evaluation Metrics

In this study, object detection models within the teacher–student KD framework was evaluated using different metric sets, due to variations in the underlying model development code bases. The teacher model, Grounding DINO, assessment of prediction performance was performed using the COCO metrics [35]. Average precision (AP) and average recall (AR) were measured across varying IoU thresholds and detection limits to comprehensively assess the model’s performance. The evaluation metrics used for the test datasets are detailed in Table 7. The student models, which are YOLO nano variants, were evaluated using YOLO performance metrics [41], which are commonly employed in the YOLO development framework (see Table 8). In addition to object detection, a classification-based evaluation was conducted on foggy images. To assess the performance of this approach, classification task based standard evaluation metrics, including accuracy, precision, recall, and F1-score, were computed [42]. Precision reflects the model’s ability to minimize false positives, particularly in foggy images where no wildfire indicators should be present. Recall measured the proportion of actual wildfire images in which the model successfully detected at least one indicator class, reflecting its sensitivity to true wildfire events. The F1-score, defined as the harmonic mean of precision and recall, offered a balanced assessment of the model’s performance in distinguishing true wildfire activity from false detections induced by fog

3. Results

3.1. Evaluation of Few-Shot Fine-Tuning for Grounding DINO

Figure 8 showcases a comparison between ground truth labels and the corresponding predictions, displayed side by side for visual inspection. These examples demonstrate the model’s ability to generalize to unseen images using limited annotated data. The results in Table 9 shows the performance of the finetuned model on the test dataset using COCO evaluation metrics. The model achieved a mAP of 0.479 across IoU thresholds [0.50:0.95], with detection accuracy highest at looser thresholds—reaching a mAP@0.5 of 0.735 and 0.440 at IoU 0.75. mAR increased with the number of allowable detections, peaking at 0.675 for maxDets = 100. These results suggest some sensitivity to stricter IoU thresholds and highlight room for improvement in detection consistency across varying object sizes.

Table 10 presents a class-wise breakdown of detection performance on the test dataset, including the number of labelled images, total annotated instances, and the average IoU for each class. The IoU values were computed after applying a confidence threshold of 0.5 and performing NMS with a threshold of 0.5. The model performed best on Smoke detection, achieving a high average IoU of 0.9473. In contrast, the performance on Plume and Flame was lower, with average IoUs of 0.6196 and 0.6129, respectively.

3.2. Lightweight Models Performance

Table 11 presents a detailed breakdown of the YOLO models’ performance, including precision (P), recall (R), mAP@0.5, and mAP@[0.50:0.95] across all classes. YOLOv8n achieved the highest overall mAP@0.5 (0.721) and strong class-wise precision and recall, particularly for smoke and flame detection, demonstrating balanced performance. In contrast, plume detection showed lower precision and recall across all models, likely due to its diffuse appearance and annotation ambiguity. Figure 9 presents the training and validation loss curves for the three YOLO model variants: YOLOv5n, YOLOv8n, and YOLOv11n. Figure 10, Figure 11 and Figure 12 present visual comparisons between predicted outputs and ground truth labels for YOLOv5n, YOLOv8n, and YOLOv11n, respectively. These side-by-side examples highlight how well each model detects smoke, plume, and flame in wildfire imagery. Figure 13 presents the Precision-Recall (PR) curves for the lightweight YOLO models—YOLOv5n, YOLOv8n, and YOLOv11n—highlighting their performance differences during evaluation. Table 12 presents the performance metrics for the YOLOv5n, YOLOv8n, and YOLOv11n models trained with and without KD. The models designated as YOLOv5n-WKD, YOLOv8n-WKD, and YOLOv11n-WKD refer to the models trained without pseudo-labels for KD. Table 13 presents a comparative analysis of the FPS achieved by the object detection models when deployed on the Jetson Nano and the OAK-D camera.

3.3. Model Evaluation Under Foggy Conditions

Table 14 summarizes these results as binary confusion matrices (wildfire vs. non-wildfire) for smoke and plume, illustrating the trade-off between missed detections and false alarms for each model. Table 15 further presents the corresponding precision, recall and accuracy for each class—including smoke, plume, and flame—across the YOLOv5n, YOLOv8n, and YOLOv11n models evaluated on the image set containing fog and wildfire images. Figure 14 illustrates the F1-score comparison between the smoke and plume indicators across the developed models. Figure 15 illustrates the performance of the YOLOv11n on fog and wildfire images, including cases where fog is misclassified as smoke.

4. Discussion

This study evaluated the effectiveness of incorporating plume as a distinct classification class alongside smoke and flame to improve real-time wildfire detection using UAS imagery under foggy conditions, and demonstrated that few-shot learning enables effective fine-tuning with a limited number of annotated examples [43]. The results, presented in Table 14, indicate that this multi-class classification approach significantly improves the accuracy of early-stage wildfire detection. The performance of three variants of YOLO Nano on a dataset comprising fog and wildfire images underscores the importance of distinguishing between smoke and plume. Notably, plume detection consistently outperformed smoke detection across all models. It achieved higher accuracy (ranging from 80.1% to 87.5%) and F1-scores (76.1% to 85.7%), whereas smoke detection showed lower accuracy (approximately 58.8%) and a modest F1-score of around 70%. This finding highlights the well-documented difficulty of differentiating smoke from fog in low-visibility environments, an issue frequently noted in previous optical sensing studies [15,22,44].

While all models demonstrated perfect recall (100%) for smoke detection, they exhibited low precision (55%), indicating that while they successfully detected all wildfire instances, they produced a significant number of false positives in fog images. This limitation is especially evident in the right panel of Figure 14, which shows how fog and smoke can share low opacity, horizontally spreading features, making them difficult to separate visually. The lower recall for plume detection (63.2–75%) suggests that some actual plumes were missed, indicating a less comprehensive capture of plume instances compared to smoke. Despite one misclassification, the inclusion of the plume class proved beneficial in reducing confusion with fog, as its distinct vertical geometry and higher columnar density made it more discriminative than diffuse smoke (Figure 15a), with the models achieving high F1-scores that indicate reliable plume identification and minimal false positives under foggy conditions.

From the labelled details presented in Table 4 and Table 6, it was observed that over half of the images annotated for smoke also included plume labels, highlighting the frequent presence of plumes in wildfire imagery, particularly during the early ignition phase. By differentiating plume from diffuse smoke and fog, the model was able to focus on unique visual features such as vertical ascent, which are less prone to misclassification. This strategy contributed to the notably high F1-score for plume detection, especially with YOLOv11n. The superior performance is primarily attributed to the C2PSA [39] (Cross Stage Partial with Spatial Attention) module, which enhances the model’s ability to focus on specific spatial regions of interest. It allows the model to prioritize the structural coherence like vertical ascent of plumes, effectively suppressing false positives caused by the diffuse and unstructured nature of fog. Flame detection showed better performance than plume among all the models, but was less useful in the early stages of wildfires, as flames were often hidden by canopy cover in the early stages [45]. By incorporating plume detection, the model uses a feature that appears early in a wildfire and is visually distinct from fog, thereby improving overall detection reliability.

Recent studies primarily employed YOLO variants and other compact architectures to detect smoke and flame, often using combined classes such as “smoke + flame,” while our study advances this approach by introducing the plume class as a separa)te category, enhancing the model’s ability to capture early and distinguishable wildfire indicators. The mAP@0.5 of 0.721 achieved by YOLOv8n and YOLOv5n in this study is competitive with prior research. Akhloufi et al. [16] reported a slightly higher mAP@0.5 of 79.84% using YOLOv3-tiny, which classified flame, smoke, and a combined “smoke + flame” class. In contrast, our model introduces a novel plume class, which may increase detection complexity due to its diverse visual characteristics. Wang et al. [15] achieved a lower mAP of 0.661 using YOLOv4-MobileNetV3, while their KD-based pruned model exhibited faster prediction speeds with an mAP of 0.631, indicating that our approach outperforms this lightweight architecture despite handling an additional class. Xiong et al. [18] reported AP@0.5 values of 0.86 for flame and 0.71 for smoke (estimated mAP: 0.785). When limited to these two classes, our YOLOv8n achieves an mAP@0.5 of 0.815, surpassing their performance. The integration of manual and pseudo-labels in our study significantly enhanced the efficiency.

Smoke achieved the highest IoU because its diffuse but relatively uniform appearance creates more consistent and clearly identifiable boundaries across images. This allows both annotators and the model to localize smoke regions more reliably. Compared to other classes, the lower detection performance for plumes during the model development is likely influenced by the challenges in consistently annotating their boundaries. As vertically rising columns of smoke, plumes often lack distinct edges, making it difficult to determine where bounding boxes should end. This subjectivity introduces annotation variability, resulting in inconsistencies in ground truth labels that reduce the IoU and overall accuracy. In addition to vertical ascent, plumes also exhibit higher columnar density, stronger structural coherence, and more stable upward motion compared to diffuse smoke or horizontally spreading fog. Flames, while smaller and more sensitive to bounding box errors [46], performed better likely due to their bright, well-defined shapes that enable more consistent annotation. These class-specific differences underscore the importance of improved annotation strategies, particularly for diffuse or irregular targets like plumes. On the deployment side, platform performance also plays a key role in real-time detection. The Jetson Nano Orin demonstrated significantly higher frame rates compared to the OAK-D. This highlights the advantages of GPU-enabled hardware for processing complex models at the edge and offers practical insights into the trade-offs between speed, power efficiency, and model complexity in wildfire monitoring applications.

This study did not analyze the full temporal progression of the burning process (flame–plume–smoke transitions) during the controlled pile burns, as the experimental setup was designed solely for data acquisition to support real-time detection rather than fire behaviour characterization. The dataset used in this study [26] was collected during controlled burns and is limited to images captured under daylight conditions. While this provides a consistent and manageable environment for model development, it limits the model’s generalizability. Smoke and plume characteristics such as colour and density can vary significantly based on factors such as fuel type [47], combustion temperature [48], and environmental conditions like wind direction [49]. These variations make detection more challenging in diverse contexts and may reduce the model’s performance in foggy conditions during nighttime or across different forest types. While RGB cameras are cost-effective, their performance is limited at night due to low illumination, reduced contrast, and increased noise. This study did not assess nighttime performance, which remains a key limitation. Future work should consider fusing RGB with thermal imagery, as thermal sensors can detect heat signatures independent of lighting, offering a promising solution for improving nighttime wildfire detection reliability [50]. In addition to this, factors such as the camera’s intrinsic resolution, the fixed gimbal orientation, the flight altitudes used during data acquisition, and the variability in wind conditions impose additional constraints on the dataset. These acquisition parameters may not fully capture the diversity of imaging geometries and environmental conditions encountered in real wildfire settings and therefore represent potential sources of bias that should be recognized as a limitation of this study.

The inclusion of the plume class, despite its relatively lower detection performance, marks a significant step forward in wildfire detection by providing crucial early warning cues, especially in foggy conditions. To further improve detection accuracy for wildfire, integrating multi-modal data can be beneficial, with RGB cameras offering a more cost-effective option compared to thermal imagery. The model’s performance at night remains untested. Moreover, the adoption of KD techniques can make the dataset creation process more scalable by automating and speeding up the annotation of large volumes of images. This not only reduces manual effort but also ensures consistency in labelling. By utilizing smoke, plume, and flame classes, this study also offers a solid framework for addressing fog-related challenges in wildfire detection and enhances the timeliness of response by evaluating deployment speeds across two different edge processing devices.

5. Conclusions

This study presents the following four key contributions towards real-time wildfire detection:

Introduced plume as a distinct indicator, alongside the established categories of smoke and flame, to improve wildfire detection under foggy conditions.
Demonstrated that few-shot learning reduces manual annotation effort, enhancing scalability across diverse environments using transferring knowledge from large to lightweight AI models.
Evaluated and compared the performance of the developed model on two different edge devices for real-time wildfire detection.
Presentated an annotated database of UAS-collected wildfire imagery, comprising plume, smoke, and flame data across coniferous and deciduous tree types.

The findings indicate that the plume detection demonstrates remarkable robustness, achieving a higher F1-score ranging from 76.1% to 85.7%, coupled with a high accuracy between 80.1% to 87.5%. This aligns with emerging trends in RS, where automated labelling not only accelerates the annotation process but also increases the volume of labelled instances, ultimately leading to improved detection performance. Deployment tests revealed that the Jetson Nano Orin exhibited superior real-time performance (21.7–23.3 FPS) compared to the OAK-D on-device decoding (8.8–8.9 FPS), illustrating the advantages of lighter GPU-enabled onboard computers for efficient wildfire monitoring. Future work could involve augmenting the dataset with a wider range of plume and flame examples from UAS imagery captured under varying fog, lighting, and seasonal conditions, which would help reduce appearance variability and improve model generalization. By integrating smoke, plume, and flame classifications with automated labelling processes, this study establishes a robust and scalable framework for addressing challenges associated with fog, facilitating more reliable and timely wildfire detection.

Author Contributions

Conceptualization, P.K.; Data curation, P.K., J.S. and A.U.; Formal analysis, P.K. and S.M.; Funding acquisition, G.H. and F.G.; Investigation, P.K. and S.M.; Methodology, P.K.; Project administration, G.H. and F.G.; Resources, P.K., J.S., A.U. and J.G.; Software, P.K.; Supervision, G.H. and F.G.; Validation, P.K. and A.U.; Visualization, P.K.; Writing—original draft, P.K.; Writing—review and editing, P.K., J.S., S.M., A.U., J.G., G.H. and F.G. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by the Australian Research Council grant DP 220103233.

Data Availability Statement

The UAS dataset collected for this study at TAC Resources, Canungra, QLD, Australia, is accessible via QUT Research data finder at https://doi.org/10.25912/RDF_1764134706710 (assessed 24 November 2025). The Flame2 dataset utilized in this study is openly available through IEEEDataPort at https://ieee-dataport.org/open-access/flame-2-fire-detection-and-modeling-aerial-multi-spectral-image-dataset (assessed 24 November 2025) [30].

Acknowledgments

We gratefully acknowledge the Australian Research Council (ARC), Queensland University of Technology (QUT) and QUT Centre for Robotics (QCR) for their financial support and provision of laboratory facilities. We also thank the Research Engineering Facility (REF) team at QUT for their expertise and access to essential research infrastructure, as well as QUT’s High-Performance Computing (HPC) facility for computational support. We extend our sincere gratitude to Troy Cuff and Lydia Halim of TAC Resources and the Canungra Emergency Research and Educational Facility (CEREF, https://ceref.org.au) for their logistical support, assistance in site access, permit coordination for UAS operations, and facilitation of the controlled burn experiment. We further acknowledge the continuous expert guidance provided by Troy Cuff on fire and disaster mitigation strategies and community implications through out this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rothermel, R.C. A Mathematical Model for Predicting Fire Spread in Wildland Fuels; Intermountain Forest & Range Experiment Station, Forest Service, US Department of Agriculture: Ogden, UT, USA, 1972; Volume 115. [Google Scholar]
Reimer, J.; Thompson, D.K.; Povak, N. Measuring Initial Attack Suppression Effectiveness through Burn Probability. Fire 2019, 2, 60. [Google Scholar] [CrossRef]
Fanta-Jende, P.; Steininger, D.; Kern, A.; Widhalm, V.; Apud Baca, J.G.; Hofstätter, M.; Simon, J.; Bruckmüller, F.; Sulzbachner, C. Semantic Real-Time Mapping with Uavs. PFG—J. Photogramm. Remote Sens. Geoinf. Sci. 2023, 91, 157–170. [Google Scholar] [CrossRef]
Keerthinathan, P.; Amarasingam, N.; Hamilton, G.; Gonzalez, F. Exploring Unmanned Aerial Systems Operations in Wildfire Management: Data Types, Processing Algorithms and Navigation. Int. J. Remote Sens. 2023, 44, 5628–5685. [Google Scholar] [CrossRef]
Amarasingam, N.; Ashan Salgadoe, A.S.; Powell, K.; Gonzalez, L.F.; Natarajan, S. A Review of Uav Platforms, Sensors, and Applications for Monitoring of Sugarcane Crops. Remote Sens. Appl. Soc. Environ. 2022, 26, 100712. [Google Scholar] [CrossRef]
Keerthinathan, P.; Amarasingam, N.; Kelly, J.E.; Mandel, N.; Dehaan, R.L.; Zheng, L.; Hamilton, G.; Gonzalez, F. African Lovegrass Segmentation with Artificial Intelligence Using Uas-Based Multispectral and Hyperspectral Imagery. Remote Sens. 2024, 16, 2363. [Google Scholar] [CrossRef]
McGee, J.; Mathew, S.J.; Gonzalez, F. Unmanned Aerial Vehicle and Artificial Intelligence for Thermal Target Detection in Search and Rescue Applications. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 1–4 September 2020; pp. 883–891. [Google Scholar]
Keerthinathan, P.; Winsen, M.; Krishnakumar, T.; Ariyanayagam, A.; Hamilton, G.; Gonzalez, F. Modelling Lidar-Based Vegetation Geometry for Computational Fluid Dynamics Heat Transfer Models. Remote Sens. 2025, 17, 552. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G. A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications. Information 2024, 15, 755. [Google Scholar] [CrossRef]
Landgraf, S.; Wursthorn, K.; Hillemann, M.; Ulrich, M. Dudes: Deep Uncertainty Distillation Using Ensembles for Semantic Segmentation. PFG—J. Photogramm. Remote Sens. Geoinf. Sci. 2024, 92, 101–114. [Google Scholar] [CrossRef]
Voelsen, M.; Rottensteiner, F.; Heipke, C. Transformer Models for Land Cover Classification with Satellite Image Time Series. PFG—J. Photogramm. Remote Sens. Geoinf. Sci. 2024, 92, 547–568. [Google Scholar] [CrossRef]
Jiao, Z.; Zhang, Y.; Mu, L.; Xin, J.; Jiao, S.; Liu, H.; Liu, D. A Yolov3-Based Learning Strategy for Real-Time Uav-Based Forest Fire Detection. In Proceedings of the 2020 Chinese Control And Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 4963–4967. [Google Scholar]
Jiao, Z.; Zhang, Y.; Xin, J.; Mu, L.; Yi, Y.; Liu, H.; Liu, D. A Deep Learning Based Forest Fire Detection Approach Using Uav and Yolov3. In Proceedings of the 2019 1st International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–27 July 2019; pp. 1–5. [Google Scholar]
Moslemi, A.; Briskina, A.; Dang, Z.; Li, J. A Survey on Knowledge Distillation: Recent Advancements. Mach. Learn. Appl. 2024, 18, 100605. [Google Scholar] [CrossRef]
Wang, S.; Zhao, J.; Ta, N.; Zhao, X.; Xiao, M.; Wei, H. A Real-Time Deep Learning Forest Fire Monitoring Algorithm Based on an Improved Pruned + Kd Model. J. Real-Time Image Process. 2021, 18, 2319–2329. [Google Scholar] [CrossRef]
Akhloufi, M.A.; Couturier, A.; Castro, N.A. Unmanned Aerial Vehicles for Wildland Fires: Sensing, Perception, Cooperation and Assistance. Drones 2021, 5, 15. [Google Scholar] [CrossRef]
Lu, K.; Xu, R.; Li, J.; Lv, Y.; Lin, H.; Liu, Y. A Vision-Based Detection and Spatial Localization Scheme for Forest Fire Inspection from Uav. Forests 2022, 13, 383. [Google Scholar] [CrossRef]
Xiong, C.; Yu, A.; Rong, L.; Huang, J.; Wang, B.; Liu, H. Fire Detection System Based on Unmanned Aerial Vehicle. In Proceedings of the 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT), Chongqing, China, 22–24 November 2021; pp. 302–306. [Google Scholar]
Nyandwi, E.; Gerke, M.; Achanccaray, P. Local Evaluation of Large-scale Remote Sensing Machine Learning-Generated Building and Road Dataset: The Case of Rwanda. PFG—J. Photogramm. Remote Sens. Geoinf. Sci. 2024, 92, 705–722. [Google Scholar] [CrossRef]
Li, N.; Cheng, L.; Wang, L.; Chen, H.; Zhang, Y.; Yao, Y.; Cheng, J.; Li, M. Automatic Labelling Framework for Optical Remote Sensing Object Detection Samples in a Wide Area Using Deep Learning. Expert Syst. Appl. 2024, 255, 124827. [Google Scholar] [CrossRef]
Csurka, G. Domain Adaptation for Visual Applications: A Comprehensive Survey. arXiv 2017, arXiv:1702.05374. [Google Scholar] [CrossRef]
Barmpoutis, P.; Stathaki, T.; Dimitropoulos, K.; Grammalidis, N. Early Fire Detection Based on Aerial 360-Degree Sensors, Deep Convolution Neural Networks and Exploitation of Fire Dynamic Textures. Remote Sens. 2020, 12, 3177. [Google Scholar] [CrossRef]
Hossain, F.A.; Zhang, Y.; Yuan, C.; Su, C.-Y. Wildfire Flame and Smoke Detection Using Static Image Features and Artificial Neural Network. In Proceedings of the 2019 1st International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–27 July 2019; pp. 1–6. [Google Scholar]
Zhan, J.; Hu, Y.; Cai, W.; Zhou, G.; Li, L. PDAM–STPNNet: A Small Target Detection Approach for Wildland Fire Smoke through Remote Sensing Images. Symmetry 2021, 13, 2260. [Google Scholar] [CrossRef]
Huang, J.; Loría-Salazar, S.M.; Deng, M.; Lee, J.; Holmes, H.A. Assessment of Smoke Plume Height Products Derived from Multisource Satellite Observations Using Lidar-Derived Height Metrics for Wildfires in the Western US. Atmos. Chem. Phys. 2024, 24, 3673–3698. [Google Scholar] [CrossRef]
Keerthinathan, P.; Gonzalez, F.; Sandino, J.; Uthayasooriyan, A. Canungra UAS-data on control fire (Version 1.0). Queensland University of Technology. Dataset 2024. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H. Grounding Dino: Marrying Dino with Grounded Pre-Training for Open-Set Object Detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with Improved Denoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Hopkins, B.; O’Neill, L.; Afghah, F.; Razi, A.; Rowell, E.; Watts, A.; Fule, P.; Coen, J. Flame 2: Fire Detection and Modeling: Aerial Multi-Spectral Image Dataset. IEEE DataPort 2023. [Google Scholar] [CrossRef]
Nearmap. Nearmap Official Website. Available online: https://www.nearmap.com/au (accessed on 10 November 2025).
Thaslim (Ed.) Fog Detection Dataset; Kaggle: San Francisco, CA, USA, 2024. [Google Scholar]
Tzutalin. LabelImg. Available online: https://github.com/tzutalin/labelImg (accessed on 10 November 2025).
Long, Z.; Li, W. Open Grounding Dino: The Third Party Implementation of the Paper Grounding Dino. Available online: https://github.com/longzw1997/Open-GroundingDino (accessed on 10 November 2025).
Wood, L.; Chollet, F. Efficient Graph-Friendly Coco Metric Computation for Train-Time Model Evaluation. arXiv 2022, arXiv:2207.12120. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs Beat Yolos on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Zenodo. Deci-AI/super-gradients: 3.0.8, CERN: Geneva, Switzerland, 2021. [CrossRef]
Hussain, M. Yolov5, Yolov8 and Yolov10: The Go-to Detectors for Real-Time Vision. arXiv 2024, arXiv:2407.02988. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; da Silva, E.A.B. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
ultralytics. Model Validation with Ultralytics Yolo. Available online: https://docs.ultralytics.com/modes/val/ (accessed on 26 May 2025).
Vujovic, Z. Classification Model Evaluation Metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 599–606. [Google Scholar] [CrossRef]
Yang, J.; Yu, W.; Lv, Y.; Sun, J.; Sun, B.; Liu, M. Sam2-Elnet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation. arXiv 2025, arXiv:2503.12404. [Google Scholar] [CrossRef]
Alkhatib, A.A.A. A Review on Forest Fire Detection Techniques. Int. J. Distrib. Sens. Netw. 2014, 10, 597368. [Google Scholar] [CrossRef]
Hossain, F.M.A.; Zhang, Y.M.; Tonima, M.A. Forest Fire Flame and Smoke Detection from Uav-Captured Images Using Fire-Specific Color Features and Multi-Color Space Local Binary Pattern. J. Unmanned Veh. Sys. 2020, 8, 285–309. [Google Scholar] [CrossRef]
Luo, J.; Liu, Z.; Wang, Y.; Tang, A.; Zuo, H.; Han, P. Efficient Small Object Detection You Only Look Once: A Small Object Detection Algorithm for Aerial Images. Sensors 2024, 24, 7067. [Google Scholar] [CrossRef]
Park, G.; Lee, Y. Wildfire Smoke Detection Enhanced by Image Augmentation with Stylegan2-Ada for Yolov8 and Rt-Detr Models. Fire 2024, 7, 369. [Google Scholar] [CrossRef]
Chunyu, Y.; Jun, F.; Jinjun, W.; Yongming, Z. Video Fire Smoke Detection Using Motion and Color Features. Fire Technol. 2010, 46, 651–663. [Google Scholar] [CrossRef]
Galhardo, A.; Viegas, J.; Coelho, P.J. The Influence of Wind on Smoke Propagation to the Lower Layer in Naturally Ventilated Tunnels. Tunn. Undergr. Space Technol. 2022, 128, 104632. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A. Deep Learning Approach for Wildland Fire Recognition Using Rgb and Thermal Infrared Aerial Image. Fire 2024, 7, 343. [Google Scholar] [CrossRef]

Figure 1. Overview of the study methodology, illustrating the key steps in data acquisition, Labelling, Few-shot training, Lightweight model development and evaluation on fog images.

Figure 2. Map of the study site located within TAC Resources, Canungra QLD, Australia. (a) Aerial view of the site; (b,c) ground-level photographs of the site.

Figure 3. Control pile burns barrels used to replicate wildfire conditions at the TAC Resources site in Canungra, QLD, Australia.

Figure 4. The UAS used for data acquisition with the key components of UAS and payload system.

Figure 5. Flight path plan of manual navigation control of UAS during the image acquisition of controlled pile burn at the TAC Resources site in Canungra, QLD, Australia.

Figure 6. Schematic diagram of the knowledge distillation process: the fine-tuned Grounding DINO serves as the teacher network, while the Nano variants of the YOLO family act as the student network [27].

Figure 7. Illustration of the visual conditions used for annotating plume. The red-dotted bounding box represents the full smoke region, while green-dotted bounding boxes indicate vertically rising plumes, a subset of smoke, characterized by narrow, corn-shaped columns with brighter cores and clear upward structure originating near the canopy or flame source.

Figure 8. Side-by-side comparison of (a) ground truth labels of the unseen images and (b) corresponding predictions from the Few-Shot Learning model (Grounding DINO).

Figure 9. Training and validation loss curves for the YOLO models: YOLOv5n (a.1–a.4), YOLOv8n (b.1–b.4), and YOLOv11n (c.1–c.4). Each subfigure shows: (1) training box loss, (2) training classification loss, (3) validation box loss, and (4) validation classification loss.

Figure 10. (a) Ground truth labels and (b) corresponding predictions generated by the YOLOv5n model for wildfire indicators, presented in matching order for visual comparison.

Figure 11. (a) Ground truth labels and (b) corresponding predictions generated by the YOLOv8n model for wildfire indicators, presented in matching order for visual comparison.

Figure 12. (a) Ground truth labels and (b) corresponding predictions generated by the YOLOv11n model for wildfire indicators, presented in matching order for visual comparison.

Figure 13. Precision-Recall (PR) curves for YOLO training: (a) YOLOv5n, (b) YOLOv8n, and (c) YOLOv11n, illustrating the performance of each lightweight model variant during evaluation.

Figure 14. F1-score for YOLOv5, YOLOv8, and YOLOv11 (Nano Variants) on the image set containing wildfire and fog images.

Figure 15. YOLOv11n inference results. (a–f) inference on fog only images. (g–l) inference on wildfire images. The dark blue bounding boxes indicate smoke, while the light blue bounding boxes indicate plume.

Table 1. Recent studies of real-time UAS-based wildfire detection. IoU—intersection of union, mAP—mean average precision. mAP@0.5—mean average precision at 0.5 confidence threshold. mAP@0.75—mean average precision at 0.75 confidence threshold.

Reference	Image Resolution (Pixels)	Deployment Computer	Detection Target	Model	FPS	Performance
[12]	416 × 416	Intel i7-9700K, RTX2080 8 GB	Flame, Smoke	YOLOv3	30	Average IoU: 73.78%
[15]	416 × 416	NVIDIA Jetson Xavier NX	Flame, Smoke	Pruned YOLOv4 backbone with MobileNetV3 with Knowledge Distillation	26.74	mAP@0.5: 0.631
[16]	640 × 480	DJI MANIFOLD onboard computer	Flame, Smoke	YOLOv3-tiny	6.2	mAP@0.5: 79.84%, mAP@0.75: 65.91%, Average IoU: 70.23%
[17]	1280 × 720	Raspberry PI 4B	Flame	NanoDet	48	mAP: 57.6
[18]	416 × 416	Ultra96-V2 board	Flame, Smoke	YOLOv3 (quantized to 8-bit)	>3	mAP@0.5: 0.86 (Flame), 0.71 (Smoke)

Table 2. Meteorological observations on the day of data acquisition.

Weather Parameters	Measurements	Nearest Bureau Stations
Temperature	24.5 °C	Canungra (De-fence)—1.7 km away
Relative Humidity	71%
Wind speed	4 km/h
Wind direction	Southeast
Rainfall	0 mm	Canungra Finch Road—1 km away
Solar exposure	20.1 Mjm⁻²	Canungra Finch Road—1 km away

Table 3. Specifications of the RGB videos from the Flame2 dataset [30].

Dataset File ID	Resolution	Duration (s)	File Size (GB)	Frame Rate (FPS)	Camera	Context
1-Video	3840 × 2160	291	3.5	30	Mavic 2 Enterprise	During the fire, no visible flames, likely filmed at a distance from active fire
2-Video	3840 × 2160	183	2.2	30	Mavic 2 Enterprise
3-Video	1920 × 1080	404	1.69	30	Mavic 2 Enterprise	During the fire, includes visible flames or closer to active fire
4-Video		301	1.26	30
5-Video		267	1.12	30
6-Video		185	0.8	30
7-Video		239	1	30
17-Preburn	3840 × 2160	291	2.66	30	Zenmuse X4S	Before the fire, captures pre-burn environmental conditions
18-Preburn	3840 × 2160	183	1.56	30	Zenmuse X4S	Before the fire, captures pre-burn environmental conditions

Table 4. Number of images and instances for each class for Grounding DINO finetune.

Classes	Images		Instance
Classes	Train	Test	Train	Test
Smoke	283	43	338	47
Plume	143	30	255	50
Flame	157	35	316	62

Table 5. Few-Shot Fine-Tuning Parameters for Grounding DINO.

Parameter	Value	Description
data_aug_scales	(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)	List of scales used for data augmentation
batch_size	4	Number of samples processed in each training iteration
backbone	swin_T_224_1k	Pretrained backbone model
enc_layers	6	Number of encoder layers
dec_layers	8	Number of decoder layers
text_encoder_type	bert	Type of text encoder
lr	0.0005	Initial learning rate
lr_backbone	1.00 × 10⁻⁵	Learning rate for the backbone network
weight_decay	5.00 × 10⁻⁵	Regularization parameter
epochs	120	Number of training epochs
label_list	(‘smoke’, ‘plume’, ‘flame’)	Target classes

Table 6. Number of images and instances for each class for Lightweight model development.

Classes	Images		Instance
Classes	Train	Test	Train	Test
Smoke	513	137	592	152
Plume	262	79	436	133
Flame	297	80	605	160

Table 7. COCO evaluation metrics of object detection computed over all classes for Grounding DINO (Teacher network) performance assessment.

Metrics	Formula	Description
IoU	$\frac{O v e r l a p}{U n i o n}$	Intersection over Union between predicted and ground-truth bounding boxes.
${T P}^{t}$ $, {F P}^{t}$ $, {F N}^{t}$	-	${T P}^{t}$ $: number of correctly predicted indicators . {F P}^{t}$ $: number of incorrect predictions . {F N}^{t}$ : number of missed detections (with IoU ≥ t).
$Average Precision {(A R}_{i}^{t}$ )	$\frac{{T P}^{t}}{{T P}^{t} + {F P}^{t}}$	Average of proportion of correct prediction among all predicted instances of class “i” with IoU ≥ t.
$Average Recall {(A R}_{i}^{t}$ )	$\frac{{T P}^{t}}{{T P}^{t} + {F N}^{t}}$	Average of proportion of correct prediction among all ground-truths of class “i” with IoU ≥ t.
mAP@[0.50:0.95] (maxDets = 100)	$\frac{1}{10 N} \sum_{t \in T} \sum_{i = 1}^{N} {A P}_{i}^{t}$	Mean average precision across IoU thresholds from 0.50 to 0.95 in 0.05 steps, when allowing at most 100 detection per image. Without a confidence threshold, it takes the first 100 detections based on their confidence scores.
mAP@0.5 (maxDets = 100)	$\frac{1}{N} \sum_{i = 1}^{N} A P_{i}^{0.5}$	Mean average precision at IoU = 0.50 (lenient overlap), when allowing at most 100 detection per image. It takes the top 100 detections based on their confidence scores
mAP@0.75 (maxDets = 100)	$\frac{1}{N} \sum_{i = 1}^{N} A P_{i}^{0.75}$	Mean average precision at IoU = 0.50 (strict overlap), when allowing at most 100 detection per image. It takes the top 100 detections based on their confidence scores
mAR@[0.50:0.95] (maxDets = 1)	$\frac{1}{10 N} \sum_{t \in T} \sum_{i = 1}^{N} {A R}_{i}^{t, 1}$	Mean average recall when allowing at most 1 detection per image. It takes the highest detections based on their confidence scores
mAR@[0.50:0.95] (maxDets = 10)	$\frac{1}{10 N} \sum_{t \in T} \sum_{i = 1}^{N} {A R}_{i}^{t, 10}$	Mean average recall when allowing at most 10 detection per image. It takes the top 10 detections based on their confidence scores.
mAR@[0.50:0.95] (maxDets = 100)	$\frac{1}{10 N} \sum_{t \in T} \sum_{i = 1}^{N} {A R}_{i}^{t, 10}$	Mean average recall when allowing at most 100 detection per image. It takes the top 100 detections based on their confidence scores.

where T = {0.50, 0.55, …, 0.95}, N = number of classes.

Table 8. YOLO evaluation metrics [41] of object detection for YOLO nano variants (Student network) performance assessment.

Metrics	Formula	Description
Precision (P)	$\frac{T P}{T P + F P}$	Quantifies the proportion of true positives among all positive predictions for a class, using a 0.5 IoU threshold and the confidence threshold that yields the highest F1 score.
Recall (R)	$\frac{T P}{T P + F N}$	Quantifies the proportion of true positives among all actual positives, for a class, using a 0.5 IoU threshold and the confidence threshold that yields the highest F1 score.
mAP@[0.50:0.95]	$\frac{1}{10 N} \sum_{t \in T} \sum_{i = 1}^{N} {A P}_{i}^{t}$	The average of the mean average precision calculated at varying IoU thresholds, ranging from 0.50 to 0.95 with the default value of max dets 300.
mAP@0.5	$\frac{1}{N} \sum_{i = 1}^{N} A P_{i}^{0.5}$	The average of the mean average precision calculated at 0.5 IoU thresholds with the default value of max dets 300.

where T = {0.50, 0.55, …, 0.95}, N = number of classes, IoU—Intersection over Union, TP—True Positive, FP—Folse Positive, FN—False Negative.

Table 9. Performance of finetuned grounding dino on test dataset for all three indicator classes.

Metric	Value
mAP@[0.50:0.95] (maxDets = 100)	0.479
mAP@0.5 (maxDets = 100)	0.735
mAP@0.75 (maxDets = 100)	0.440
mAR@[0.50:0.95] (maxDets = 1)	0.403
mAR@[0.50:0.95] (maxDets = 10)	0.600
mAR@[0.50:0.95] (maxDets = 100)	0.675

Table 10. Class-Wise Summary of Annotated Images, Instances, and average Intersection over Union (IoU).

Classes	Images	Instances	Average IoU
Smoke	43	47	0.9473
Plume	30	50	0.6196
Flame	35	62	0.6129

Table 11. Class-wise detection performance metrics for YOLOv5, YOLOv8, and YOLOv11 (Nano Variants).

Classes	All			Smoke			Plume			Flame
Images	204			137			79			80
Instances	445			152			133			160
YOLO version	5	8	11	5	8	11	5	8	11	5	8	11
Precision (P)	0.705	0.732	0.644	0.822	0.852	0.801	0.607	0.619	0.515	0.686	0.724	0.618
Recall (R)	0.722	0.643	0.692	0.888	0.873	0.862	0.617	0.466	0.564	0.662	0.591	0.65
mAP@0.5	0.721	0.721	0.697	0.911	0.93	0.906	0.56	0.532	0.524	0.692	0.7	0.66
mAP@[0.50:0.95]	0.448	0.436	0.431	0.753	0.749	0.747	0.284	0.258	0.256	0.308	0.302	0.29

Table 12. Performance metrics of YOLOv5, YOLOv8, and YOLOv11 (Nano variants) models trained with and without Knowledge Distillation (KD), where postfix ‘-WKD’ denotes the models trained without pseudo-labels.

Metrics	YOLOv5n	YOLOv5n-WKD	YOLOv8n	YOLOv8n-WKD	YOLOv11n	YOLOv11n-WKD
mAP@0.5	0.721	0.604	0.721	0.589	0.697	0.612
mAP@[0.50:0.95]	0.448	0.367	0.436	0.371	0.431	0.369

Table 13. Inference Speed (Frames Per Second) of YOLO Models on OAK-D and Jetson Nano Orin Devices.

Models	Frame per Second
Models	OAK-D	Jetson Nano Orin
YOLOv5n	8.9	21.7
YOLOv8n	8.8	21.8
YOLOv11n	8.9	23.3

Table 14. Confusion Matrix Summary for Smoke- and Plume-Based Wildfire Detection (YOLO Nano Variants).

YOLO Model	Class	Dataset	WD	NWD
5n	Smoke	Wildfire	49	1
	Smoke	Fog	55	19
	Plume	Wildfire	31	19
	Plume	Fog	1	73
8n	Smoke	Wildfire	50	0
	Smoke	Fog	56	18
	Plume	Wildfire	25	25
	Plume	Fog	2	72
11n	Smoke	Wildfire	50	0
	Smoke	Fog	56	18
	Plume	Wildfire	33	17
	Plume	Fog	1	73

WD = Images with wildfire detection, NWD = Images with no wildfire detection.

Table 15. Performance evaluation of YOLO models on wildfire and fog images.

YOLO Model	Class	P	R	A
5n	Smoke	0.54	0.98	0.58
5n	Plume	0.98	0.72	0.85
8n	Smoke	0.54	1	0.58
8n	Plume	0.95	0.63	0.80
11n	Smoke	0.54	1	0.58
11n	Plume	0.98	0.72	0.85

P—Precision, R—Recall, A—Accuracy.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Keerthinathan, P.; Sandino, J.; Mahendren, S.; Uthayasooriyan, A.; Galvez, J.; Hamilton, G.; Gonzalez, F. Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation. Drones 2025, 9, 827. https://doi.org/10.3390/drones9120827

AMA Style

Keerthinathan P, Sandino J, Mahendren S, Uthayasooriyan A, Galvez J, Hamilton G, Gonzalez F. Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation. Drones. 2025; 9(12):827. https://doi.org/10.3390/drones9120827

Chicago/Turabian Style

Keerthinathan, Pirunthan, Juan Sandino, Sutharsan Mahendren, Anuraj Uthayasooriyan, Julian Galvez, Grant Hamilton, and Felipe Gonzalez. 2025. "Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation" Drones 9, no. 12: 827. https://doi.org/10.3390/drones9120827

APA Style

Keerthinathan, P., Sandino, J., Mahendren, S., Uthayasooriyan, A., Galvez, J., Hamilton, G., & Gonzalez, F. (2025). Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation. Drones, 9(12), 827. https://doi.org/10.3390/drones9120827

Article Menu

Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Design

2.2. Data Acquisition

2.2.1. Study Site

2.2.2. Controlled Pile Burns

2.2.3. UAS Operations

2.2.4. Open-Access Datasets

2.3. Few-Shot Training with Grounding DINO

2.4. Lightweight Model Development

2.5. Evaluation on Fog Images

2.6. Evaluation Metrics

3. Results

3.1. Evaluation of Few-Shot Fine-Tuning for Grounding DINO

3.2. Lightweight Models Performance

3.3. Model Evaluation Under Foggy Conditions

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI