Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach

Samavat, Tina; Yazdi, Amirhessam; Yan, Feng; Yang, Lei

doi:10.3390/fire8110413

Open AccessArticle

Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach

¹

Department of Computer Science and Engineering, University of Nevada, Reno, NV 89557, USA

²

Department of Information Systems, University of Nevada, Reno, NV 89557, USA

³

Department of Computer Science, University of Houston, Houston, TX 77204, USA

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(11), 413; https://doi.org/10.3390/fire8110413 (registering DOI)

Submission received: 13 August 2025 / Revised: 28 September 2025 / Accepted: 21 October 2025 / Published: 25 October 2025

(This article belongs to the Special Issue Innovative Applications of Remote Sensing and Machine Learning in Forest Fire Detection and Prevention)

Download

Browse Figures

Versions Notes

Abstract

Smoke detection is a practical approach for early identification of wildfires and mitigating hazards that affect ecosystems, infrastructure, property, and the community. The existing deep learning (DL) object detection methods (e.g., Detection Transformer (DETR)) have demonstrated significant potential for early awareness of these events. However, their precision is influenced by the low visual salience of smoke and the reliability of the annotation, and collecting real-world and reliable datasets with precise annotations is a labor-intensive and time-consuming process. To address this challenge, we propose a weakly supervised Transformer-based approach with a teacher–student architecture designed explicitly for smoke detection while reducing the need for extensive labeling efforts. In the proposed approach, an expert model serves as the teacher, guiding the student model to learn from a variety of data annotations, including bounding boxes, point labels, and unlabeled images. This adaptability reduces the dependency on exhaustive manual annotation. The proposed approach integrates a Deformable-DETR backbone with a modified loss function to enhance the detection pipeline by improving spatial reasoning, supporting multi-scale feature learning, and facilitating a deeper understanding of the global context. The experimental results demonstrate performance comparable to, and in some cases exceeding, that of fully supervised models, including DETR and YOLOv8. Moreover, this study expands the existing datasets to offer a more comprehensive resource for the research community.

Keywords:

wildfire detection; smoke detection; deformable-DETR; weakly supervised

1. Introduction

Annually, wildfires devastate vast areas of natural resources, leading to significant financial losses for governments, severe environmental damage, and profound social and economic repercussions for affected communities. According to a report published by the National Interagency Fire Center (NIFC), wildfires have occurred 64,897 times in the United States, impacting approximately 8.9 million acres and resulting in an estimated economic loss of USD 3 billion based on a five-year average [1]. These concerning data underscore the pressing necessity for efficient wildfire monitoring methods and research to develop practical tools and innovative strategies for analyzing wildfire dynamics and implementing measures to control fires in their early stages [2].

Early identification of smoke from wildfires is crucial for mitigating their severe ecological and economic consequences [3]. Computer vision has evolved into a powerful instrument for this purpose, providing real-time surveillance capabilities [4]. While various platforms such as satellites [5,6,7] and drones [8,9,10,11] are used, ground-based cameras have become one of the most reliable and cost-effective methods for surveillance [12,13]. By installing fixed cameras in high-risk areas, this approach enables continuous monitoring with high-resolution imagery. Their ability to withstand extreme weather conditions and deliver real-time observations makes them a stable and efficient solution for early warning systems [14,15].

In addressing the necessity for enhanced wildfire monitoring, computer vision and deep learning have arisen as formidable instruments for modern surveillance initiatives. These technologies can analyze vast streams of visual data, enabling rapid and early detection of smoke plumes. However, this task remains inherently challenging due to the unique visual characteristics of smoke. Its amorphous shape, indistinct edges, and resemblance to benign phenomena such as clouds or fog make it a challenging target for automated systems, resulting in heightened risk of false alarms. Overcoming these visual complexities has become the primary objective of specialized deep learning architectures. Among the range of proposed deep learning architectures, the initial focus was primarily on convolutional neural networks (CNNs), which continue to be refined and utilized. The prevailing paradigm in smoke detection has been based on fully supervised learning with CNN-based models, particularly the You Only Look Once (YOLO) family, which are considered to be key benchmarks due to their effective balance of speed and accuracy [16]. Recent research has concentrated on enhancing YOLO by incorporating attention modules, advanced feature fusion networks such as BiFPN, and innovative loss functions to better accommodate the irregular shapes and scales of smoke [17,18,19].

While CNNs have proven effective, they possess inherent limitations in modeling global context, which is essential for accurately distinguishing diffuse smoke plumes [20,21]. This limitation has led researchers to investigate more advanced and powerful architectures. One significant outcome has been the adaptation of Transformer models, initially designed for Natural Language Processing (NLP), for use in computer vision. Vision Transformers (ViTs) revealed that Transformer designs alone might attain superior outcomes in classification tasks [22], prompting their utilization in object detection. Moreover, alternative architectures, such as the Detection Transformer (DETR) [23], have marked a substantial transformation by reconceptualizing object detection as a set prediction challenge, which has been utilized for smoke detection in evaluations like the Nevada Smoke Detection (Nemo) research [24]. A convolutional neural network (CNN) backbone is utilized to capture feature representations, and these are then passed through a Transformer-based encoder–decoder architecture to generate predictions for object classes and bounding boxes. To enhance efficiency, Deformable-DETR incorporates a deformable attention mechanism that concentrates on a limited number of crucial sampling points, which greatly lowers computational costs and speeds up convergence [25,26]. This shift has also led to the emergence of hybrid models that substitute YOLO’s backbone with ViTs or Swin Transformers, thereby improving feature extraction [27,28,29]. However, despite these architectural advances, such models remain heavily dependent on large-scale datasets for conducting supervised training.

1.1. Annotation Bottleneck

A key bottleneck of supervised learning approaches is their dependence on large, meticulously annotated datasets and their tendency toward slow training convergence [30,31]. The considerable cost and effort involved in manual annotations, specifically drawing precise bounding boxes, pose a significant barrier to the development of robust models, particularly given the subjective nature of labeling amorphous smoke plumes. The challenge is exacerbated by the lack of precisely labeled data, in stark contrast to the abundant availability of extensive collections of unlabeled recordings. This imbalance has motivated the community to explore approaches that can effectively leverage unlabeled data, and weakly supervised learning (WSL) has consequently emerged as a crucial paradigm not only for smoke detection but also for the broader computer vision community.

Weakly supervised object detection (WSOD) approaches are designed to minimize the need for detailed annotations by relying on coarse labeling instead. The seminal work on Multiple Instance Learning (MIL) [32] laid the foundation for learning from bags of instances rather than individually labeled examples. This concept has been extensively applied in computer vision, where image-level labels are used to train object detectors without bounding-box annotations [33,34]. Class Activation Maps (CAMs) [35] have become a cornerstone technique in WSL, highlighting areas the network deems essential for its classifications. Building on this, Grad-CAM [36] and its variants provide more precise localization by leveraging gradient information. These techniques have been successfully applied to smoke detection, where Xie et al. [37] and Zhao et al. [38] introduced multi-stage pipelines using CAMs to create coarse pseudo-labels that are refined for training detectors. Domain-specific adaptations, such as the collaborative framework by Pan et al. [39], further improve ground-based smoke detection.

Semi-supervised learning (SSL) leverages a limited amount of labeled samples together with a larger pool of unlabeled data to improve model performance. Consistency regularization ensures that models produce stable predictions under perturbations. Temporal Ensembling [40] and Mean Teacher [41] established the teacher–student paradigm, where a teacher (an exponential moving average of the student) provides stable targets. MixMatch [42] unified consistency regularization with entropy minimization, while FixMatch [43] simplified the approach. These ideas were adapted in the Smoke-Aware Consistency (SAC) framework [44], which penalizes inconsistencies in predictions across augmented images. Pseudo-labeling, also known as self-training, uses model predictions on unlabeled data as training targets. Noisy Student [45] demonstrated impressive results by iteratively training larger student models on pseudo-labeled data. STAC [46] combined pseudo-labeling with consistency regularization for object detection, while Soft Teacher [47] introduced soft pseudo-labels and box jittering for greater robustness. These methods show that combining limited annotations with unlabeled data can yield competitive results [44,48,49,50]. Nevertheless, despite the promising progress in computer vision and the demonstrated effectiveness of SSL, such techniques have not yet been widely employed in the smoke detection domain, where the current approaches still predominantly rely on fully supervised learning.

Recent research has sought to integrate various supervision paradigms. For instance, Omni-DETR [51] combines fully annotated, weakly annotated, and unlabeled data using a co-learning strategy. Unbiased Teacher [52] addresses pseudo-labeling bias through focal loss and mutual learning, whereas Dense Teacher [53] adapts SSL for dense prediction tasks. Nonetheless, these approaches frequently face challenges such as instability and error propagation when relying on co-learning or teacher–student evolution.

1.2. Contributions

Expanding on prior progress, this work proposes Omni-Nemo, a new teacher–student framework aimed at addressing the persistent annotation challenge in wildfire smoke detection. Unlike Omni-DETR, which relies on co-learning among peer student models that may lead to instability and error propagation, Omni-Nemo employs expert-guided transfer learning. In our framework, we employ a fixed and validated expert teacher—the Nemo smoke detection model, known for its strong performance in wildfire smoke detection—to generate reliable high-quality pseudo-labels consistently during training. This approach, in addition to addressing the annotation problem by accommodating different forms of supervision, such as bounding boxes, points, and unlabeled data, also mitigates the challenges associated with concurrent learning, resulting in more stable and efficient guidance during student model training.

The key contribution of Omni-Nemo is its expert-driven design, which enables the unification of fully labeled, weakly labeled, and unlabeled data within a single coherent training pipeline. By decoupling the teacher from the student and employing a fixed domain-specialized model for supervision, our framework guarantees reliable knowledge transfer. This capability is further enhanced by targeted technical contributions, including a three-component loss function and a cosine learning rate scheduler, which together promote efficient convergence and robust generalization.

This study makes three primary contributions toward advancing real-world wildfire smoke detection:

A Flexible Weakly Supervised Framework: Omni-Nemo is designed to utilize a variety of annotation types during the model training process, including noisy bounding boxes, point labels, and unlabeled images. This flexibility effectively addresses the high costs associated with annotation and the limited availability of accurately labeled data, which significantly impacts model performance and training quality. As a result, the framework is well-suited for large-scale deployment.
A Stable Training Paradigm using a Static Expert Teacher: In contrast to co-learning models that utilize evolving supervision, where the teacher model is trained alongside the student, Omni-Nemo employs a fixed pre-trained expert teacher to guide the student. This method guarantees consistent supervisory signals, reduces drift and error accumulation, and facilitates efficient knowledge transfer without the need to retrain a model to act as the teacher. It also enables the integration of other trained and valuable models to enhance the training process using smaller and new datasets.
An Optimized Loss Function and Learning Schedule for Accelerated Distillation: The framework features a three-component loss function that integrates supervised learning with consistency regularization and entropy minimization. These components encourage confident predictions and alignment with the teacher’s outputs, while the cosine learning rate scheduler facilitates smooth convergence.

The structure of this paper is outlined as follows. Section 2 introduces the proposed methodology, describing the teacher–student framework, the modified loss function, and the overall training pipeline. Section 3 provides the implementation details, including the dataset description, hardware configuration, and hyperparameter optimization. Section 4 presents the experimental outcomes, featuring comparisons with established benchmarks, an analysis of early-detection capabilities, and qualitative case studies. Section 5 delivers an in-depth discussion of the findings, while Section 5 discusses the study’s constraints and future research opportunities. The paper concludes in Section 6, which summarizes the contributions and highlights the key insights.

2. Materials and Methods

We introduce Omni-Nemo, a weakly supervised framework for smoke detection that leverages a variety of data annotations during the training process rather than depending solely on precise bounding boxes. In this approach, we have modified the teacher–student architecture of Omni-DETR [51] by replacing the standard co-learning teacher with a fixed pre-trained expert model from the Nemo study [24]. This design aims to reduce the significant costs and labor associated with obtaining accurate annotations, enabling the model to extract robust features from datasets that may contain point labels, noisy bounding boxes, or even lack annotations altogether while minimizing the reliance on supervisory input.

Unlike the co-learning approach, where the evolving predictions of the teacher model can introduce noise, our fixed expert provides a stable supervisory signal. This design thereby mitigates the significant risks of confirmation bias and error propagation that typically degrade performance in SSL paradigms. By ensuring the student model is guided by reliable pseudo-labels, the framework facilitates more effective knowledge distillation.

To further refine this framework, we introduced two technical adjustments to optimize knowledge transfer. First, we modified the loss function to accelerate the rate of the student model’s alignment regarding detection behavior with the expert model on the unlabeled dataset, thereby facilitating more efficient learning. Second, to enhance training stability and ensure faster convergence, we implemented a cosine annealing learning rate scheduler. This strategy systematically adjusts the learning rate, directing the optimization process toward a more effective minimum of the loss function.

Figure 1 illustrates the overall training pipeline, which begins with a supervised burn-in stage. In this phase, the student network is initialized through pre-training on a fully annotated dataset, ensuring a stable foundation. The process then advances to a semi-supervised stage, where the fixed teacher model generates reliable pseudo-labels for unlabeled data. Guided by a modified loss function (explained in Section 2.2), the student is iteratively adjusted to align with the teacher’s outputs, thereby accelerating the transfer of knowledge. By capitalizing on large volumes of unlabeled samples, the framework enables the student to acquire strong generalization capabilities. The following sections describe each component of this approach in greater detail.

2.1. Teacher–Student Architecture

Our Omni-Nemo framework adopts a teacher–student architecture inspired by the co-learning paradigm introduced in Omni-DETR, where both teacher and student models are trained together to strengthen supervision. Although co-learning provides some flexibility, it often suffers from instability, drift, and error propagation, especially when pseudo-labels are produced by models that are still being trained. To address these issues, we redesigned the approach by replacing the simultaneous training of the teacher and student with that of an expert teacher. The term “expert teacher” denotes a high-capacity model that has been comprehensively pre-trained on labeled data and is kept frozen throughout training, operating solely in inference mode. Its role is to produce dependable pseudo-labels for unlabeled samples, thereby offering consistent and effective supervision to the student network.

In this study, the Nemo model, which has been validated on a real-world wildfire dataset, is employed as a static expert teacher to generate reliable pseudo-labels for unlabeled and weakly labeled images. These pseudo-labels provide supervisory signals for the student network, enabling it to learn from the teacher’s outputs without requiring joint training. This decoupled design enables the student to acquire domain-specific knowledge while avoiding the instability and noise that often arise in co-evolving systems. Therefore, the framework not only leverages unlabeled data more effectively but also establishes a solid foundation for advancing wildfire smoke detection with trained models.

2.1.1. Teacher Model

We selected the Nemo benchmark model, a DETR-based detector, as the static expert teacher within our framework. This decision was guided by Nemo’s established reliability and prior validation in wildfire smoke detection rather than any architectural advantage. Explicitly designed for early-stage smoke detection, Nemo offers a trusted source of pseudo-labels, enabling supervision in weakly labeled or unlabeled settings where accurate annotations are both scarce and costly. By anchoring supervision to this validated expert model, our framework eliminates the variability and computational cost associated with labeling the data while maintaining consistent guidance across the learning process. This design enables the student network to concentrate on architectural and optimization enhancements, which improve training stability, reduce error accumulation, and facilitate scalable deployment under real-world conditions. Moreover, it establishes a foundation for integrating future expert models, positioning our approach as a practical path toward advancing label-efficient wildfire detection.

Figure 2 illustrates the DETR framework, which is built around three key modules: a convolutional backbone, a Transformer encoder–decoder, and a feed-forward network (FFN). The process begins by passing the input image through the CNN backbone to extract a low-resolution feature representation. This representation is flattened, enriched with positional encodings to retain spatial structure, and then delivered to the Transformer encoder. By applying multiple layers of Multi-Head Self-Attention, the encoder captures global dependencies across the image and models the interactions among different regions effectively.

Subsequently, the decoder takes a fixed number of learned embeddings, referred to as object queries, and processes them in conjunction with the output of the encoder. Each layer of the decoder employs self-attention among the queries to avoid duplicate predictions while also utilizing cross-attention with the image features to localize objects. As the queries advance through the decoder, they are refined to yield the final predictions, each of which consists of a class label and a bounding box. This end-to-end design enables each object query to specialize in detecting specific object characteristics within the overarching context of the image.

2.1.2. Student Model

We adopt Deformable-DETR as the architectural backbone of our student model, maintaining continuity with the Omni-DETR framework while addressing the unique challenges of smoke detection. This choice is motivated by Deformable-DETR’s proven effectiveness in capturing the low-contrast and irregular visual patterns characteristic of early-stage wildfire smoke. As illustrated in Figure 3, the core contribution lies in the deformable attention mechanism, which substitutes uniform attention across the entire feature map with a sparse collection of adaptively learned sampling locations anchored to each reference point. By concentrating on the most relevant regions, the model flexibly captures features of diverse shapes and scales, making it particularly well-suited for amorphous and non-rigid smoke boundaries. Within the decoder, this mechanism enables object queries to adaptively attend to multi-scale features, thereby improving both localization precision and classification accuracy.

The model also incorporates multi-scale feature extraction from the CNN backbone in a design similar to a Feature Pyramid Network. This enhances detection across spatial resolutions, allowing the system to identify both small distant smoke columns and large well-developed plumes. These architectural strengths not only elevate detection accuracy but also improve computational efficiency and accelerate convergence. Retaining Deformable-DETR as the student model ensures compatibility with weakly supervised training while focusing innovation on the learning strategy itself. This design facilitates effective knowledge transfer from the expert teacher and advances our broader goal of scalable label-efficient smoke detection across real-world camera networks. Moreover, it provides a solid foundation for future extensions, including domain adaptation and integration with alternative expert models.

2.2. Loss Function Design

The transition from a co-learning model to a fixed expert teacher requires a redefinition of the training objective. Our methodology utilizes a unified loss function specifically designed for robust and efficient knowledge transfer from our expert teacher. In this design, we tackle the shortcomings by first eliminating the ineffective knowledge transfer associated with hard pseudo-labels, which treat confident teacher predictions and uncertain guesses as equally valid. Instead, we employ a consistency loss that enables the student to learn the correct class, the relationships between classes, and the uncertainties in the teacher model’s predictions.

Next, we present an acceleration mechanism that was not present in the original framework. The entropy minimization loss functions as a regularizer, penalizing uncertain outputs on unlabeled data. This approach encourages the student model to create well-defined decision boundaries, thereby accelerating convergence toward expert-level performance and enhancing overall generalization.

The overall training objective of the student model, denoted as

L_{total}

, is defined as a weighted sum of three components: the supervised loss (

L_{\sup}

) applied to labeled samples, the consistency loss (

L_{cons}

) applied to unlabeled samples, and the entropy minimization loss (

L_{ent}

), which drives the model toward producing confident predictions. The formulation of the total loss is expressed as

L_{total} = L_{\sup} + λ_{cons} L_{cons} + λ_{ent} L_{ent} .

(1)

Here,

λ_{cons}

and

λ_{ent}

are scalar hyperparameters that balance the influence of the semi-supervised tasks. The methodology for selecting these values and an analysis of their impact are discussed in Section 3.6. The following sections detail each component of this composite objective.

2.2.1. Supervised Loss ( $L_{\sup}$ )

In line with the standard DETR framework, the Hungarian algorithm [51] is employed to determine the optimal bipartite matching between the N model predictions and the M ground-truth objects. Once the matching is established, the supervised loss is defined as the sum of a classification component and a regression component. The classification loss,

L_{cls}^{\sup}

, is formulated as a negative log likelihood objective that penalizes misclassified object categories as well as incorrect No-Smoke assignments for unmatched predictions:

L_{cls}^{\sup} = - \sum_{j = 1}^{M} log {\hat{p}}_{σ (j), c_{j}} - \sum_{i \in U} log {\hat{p}}_{i, c_{\emptyset}} .

(2)

Here,

\hat{p} i, c

denotes the probability assigned to class c by the i-th query,

σ (j)

indicates the index of the prediction paired with the j-th ground-truth object of class

c_{j}

,

U

refers to the set of unmatched predictions, and

c \emptyset

corresponds to the background category. The regression loss,

L_{reg}^{\sup}

, combines both

L_{1}

and GIoU losses to ensure accurate bounding-box localization:

L_{reg}^{\sup} = \sum_{j = 1}^{M} ({∥{\hat{b}}_{σ (j)} - b_{j}∥}_{1} + L_{GIoU} ({\hat{b}}_{σ (j)}, b_{j})),

(3)

where

b_{j}

is the ground-truth bounding box, and

{\hat{b}}_{σ (j)}

is the bounding box of the corresponding matched prediction. The total supervised loss

L_{\sup}

is the sum of these two components.

2.2.2. Unsupervised Consistency Loss ( $L_{cons}$ )

For samples without labels, consistency is enforced between the teacher and student models to promote effective knowledge transfer. The consistency loss,

L_{cons}

, encourages the student’s predictions to match the teacher’s pseudo-labels by applying Mean Squared Error (MSE) on the predicted bounding-box coordinates:

L_{cons} = \frac{1}{N^{'}} \sum_{i = 1}^{N^{'}} {∥{\hat{b}}_{i}^{s} - {\hat{b}}_{i}^{t}∥}_{2}^{2} .

(4)

Here,

{\hat{b}}_{i}^{s}

and

{\hat{b}}_{i}^{t}

are the bounding boxes from the student and teacher models, respectively. The loss is calculated over

N^{'}

matched pairs, containing a student prediction and a corresponding teacher pseudo-label for the same object [52].

2.2.3. Entropy Minimization ( $L_{ent}$ )

Finally, we employ entropy minimization as a regularization technique on the predictions from the student model for unlabeled data. This method is guided by the principle of low-density separation, encouraging the model to make high-confidence (low-entropy) predictions [54]. The loss is the average Shannon entropy over the N class distributions predicted by the student:

L_{ent} = \frac{1}{N} \sum_{i = 1}^{N} H ({\hat{p}}_{i}^{s}) = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} {\hat{p}}_{i, c}^{s} log {\hat{p}}_{i, c}^{s} .

(5)

In this formulation,

{\hat{p}}_{i, c}^{s}

represents the probability that the student’s i-th prediction assigns to class c, while C denotes the total number of categories. Minimizing this loss discourages uncertain outputs and drives the student network to make more confident classifications for objects present in unlabeled data.

2.3. Two-Stage Training Strategy

As depicted in Figure 4, our approach builds upon the two-stage training scheme introduced in the Omni-DETR framework [51]. While the initial supervised “burn-in” phase is retained from the original design, we introduce key modifications to the subsequent semi-supervised phase, specifically in the teacher model configuration and loss computation.

As depicted in Figure 4, our approach builds upon the two-stage training scheme introduced in the Omni-DETR framework [51]. While the initial supervised “burn-in” phase is retained from the original design, we introduce key modifications to the subsequent “semi-supervised” phase regarding the teacher model and the loss computation.

In the first phase, the student network is adapted to the smoke detection domain using only the labeled dataset. During this supervised stage, model parameters are optimized using a standard supervised loss, providing a strong initialization for the subsequent semi-supervised process. Once the burn-in stage is complete, the framework transitions to the semi-supervised phase. This phase employs a teacher–student setup to incorporate a varied dataset, which may be entirely unlabeled or contain weak annotations, such as point labels or noisy bounding boxes, depending on the experiment.

A key distinction from the original Omni-DETR approach is the use of a fixed pre-trained expert teacher (the Nemo model), whose weights remain unchanged throughout training. Another modification involves the loss calculation during this phase, as discussed further in Section 2.2.

The training procedure for each unlabeled sample proceeds as follows:

The frozen teacher network produces predictions that serve as pseudo-labels.
The student network processes the same input and generates its own predictions.
Consistency loss ( $L_{cons}$ ) and entropy loss ( $L_{ent}$ ) are computed by comparing the two outputs.
Gradients are applied only to update the student network parameters.

This design allows the student model to benefit from stable guidance provided by a validated expert, enabling effective use of unlabeled data. It supports improved detection accuracy and robustness while reducing reliance on manual annotation.

3. Implementation Details

3.1. Dataset

This study utilizes datasets obtained from two publicly available wildfire camera networks, namely AlertWildfire [55] and HPWREN (High Performance Wireless Research and Education Network) [56]. These networks capture real-world wildfire events through fixed surveillance cameras deployed across fire-prone regions in the United States. The footage features actual fire incidents under uncontrolled environmental conditions, including variable lighting, terrain complexity, and atmospheric interference, making the data highly representative of operational smoke detection challenges.

Our training pipeline for the student model consists of two clearly separated stages designed to prevent data leakage and support robust generalization:

Stage 1 (Burn-In Training): In the first phase, the student model is trained on the Nemo training set, which consists of 2400 fully labeled wildfire images collected before 2020 from the AlertWildfire network. This dataset, identical to the one used for training the original Nemo teacher model [24], provides full supervision and establishes the foundation for initializing the student.
Stage 2 (Semi-Supervised Training): The student is further trained on a distinct dataset of approximately 1300 images collected from HPWREN video archives [57], spanning wildfire events between 2020 and 2023. These images were not included in the teacher model’s training and are temporally and contextually disjoint from the original Nemo splits. This separation ensures that the teacher does not generate pseudo-labels for data it has previously seen, thereby eliminating the risk of circular supervision or domain overfitting. The dataset includes full ground-truth annotations created following the manual labeling methodology described in the Nemo study [24], in which experts review terrestrial camera footage to identify and annotate early-stage smoke signals under operational conditions. This process accounts for visual ambiguity, environmental variability, and the diffuse appearance of smoke plumes. To enable controlled experimentation across different levels of supervision, we simulate weak annotation regimes by programmatically generating alternative formats, including noisy bounding boxes, point-level labels, and unlabeled subsets for various training scenarios. This procedure directly follows the dataset variation protocol proposed by Omni-DETR [51], which provides a systematic framework for evaluating model robustness under imperfect labeling.

For each semi-supervised experiment, 20% of the images retain their full annotations, while the remaining 80% are processed under one of the following schemes:

No Annotation: 80% of the dataset lacks any form of annotation.
Noisy BBox: 80% of the samples are annotated with algorithmically generated bounding boxes that may contain noise.
Point: 80% of the data is weakly annotated using a single point indicating the presence of smoke.
Point and Noisy BBox: 40% of the data is annotated with point-level labels and 40% with noisy bounding boxes.

Table 1 provides a comprehensive overview of all datasets utilized in the training and evaluation pipeline.

Although these annotations are simulated, it is important to note that the underlying images are not synthetic. All samples were manually curated and labeled by human experts based on real wildfire footage. The annotation process involved extensive review of multi-year HPWREN archives to identify early-stage smoke signals—often faint, ambiguous, and challenging to detect. Given the inherent subjectivity and complexity of labeling such phenomena, even expert annotations may vary in precision. In this context, real-world manual labels can themselves be considered weak, and simulation of noisy annotations offers a principled way to model uncertainty and evaluate robustness.

To ensure rigorous evaluation and avoid data leakage, all performance metrics are reported on the Nemo-test dataset, which is completely disjoint from both the burn-in and semi-supervised training datasets. Additionally, the teacher model is used strictly in inference mode during semi-supervised training and is never involved in the evaluation process. This guarantees that reported results reflect genuine generalization to unseen data.

While all datasets originate from terrestrial camera networks, the semi-supervised set captures different fire events than those in the original Nemo splits. This design mirrors real-world deployment, where fixed infrastructure continuously records new incidents over time. The student model must generalize across evolving conditions rather than simply replicating patterns from familiar scenes.

To further support transparency and reproducibility, Figure 5 showcases representative samples from the dataset. These examples illustrate the diversity of smoke appearances, environmental contexts, and annotation styles included in our study. Additionally, the HPWREN Fire Ignition Images Library is publicly accessible [56], and the Nemo training and test datasets are available through the authors’ GitHub repository [58].

3.2. Experimental Settings

Our training procedure builds upon the methodology established in the Omni-DETR framework, with several adaptations for the wildfire smoke detection task. Model weights are initialized from standard COCO pre-trained detectors, which themselves originate from ImageNet pre-training to ensure strong feature representations at the outset. Training is performed using the AdamW optimizer with a weight decay of

1 \times 10^{- 4}

, which balances regularization with learning stability. A batch size of 1 is adopted due to the high resolution of the wildfire imagery and the associated GPU memory constraints.

The training schedule is divided across different supervision settings: models are trained for 800 epochs when using weakly labeled data and for 1600 epochs when operating on unlabeled data with pseudo-labels. To further stabilize optimization and prevent exploding gradients, we apply gradient clipping with a maximum gradient norm of 0.1. This careful combination of initialization, optimization, and regularization choices ensures consistent convergence and reliable performance across both weakly supervised and semi-supervised stages.

3.3. Model Configuration

Our detector’s architecture is based on a Transformer with six encoder and six decoder layers. The model utilizes a feed-forward network dimension of 2048, a hidden dimension of 256, and eight attention heads. A ResNet-50 network, pre-trained on ImageNet, serves as the feature extraction backbone. For the specific task of smoke detection, we customize this architecture in two key ways. First, we tailor the final detection head to predict two classes: “Smoke” and “No-Smoke”. Second, following the Nemo benchmark, we initialize the Transformer decoder with 10 specialized object queries, which are specifically designed to detect multiple distinct smoke plumes within an image.

3.4. Hardware

All experiments were executed utilizing the PyTorch version 1.8.1 deep learning framework, with the official Omni-DETR implementation as the basis of our codebase. Model training and evaluation were conducted on a workstation utilizing CUDA version 12.4, equipped with an NVIDIA GeForce RTX 4090 GPU with 24 GB of VRAM, providing adequate computational resources for extensive experimentation.

3.5. Learning Rate

To regulate the learning rate during training, we adopted a cosine annealing scheduling strategy [59]. This dynamic approach enables relatively large parameter updates in the early stages of training while gradually reducing the step size to finer adjustments as the model nears convergence. The learning rate at epoch t, denoted as

η_{t}

, is determined according to the following formulation:

η_{t} = η_{min} + \frac{1}{2} (η_{max} - η_{min}) (1 + cos (\frac{T_{cur}}{T_{max}} π)) .

(6)

3.6. Hyperparameter Tuning

To establish an optimal training configuration for our framework, we conducted a rigorous sensitivity analysis of the most influential hyperparameters. The primary objectives were to accelerate model convergence, enhance training stability, and achieve a lower final loss value. Our optimization efforts centered on the loss-weighting coefficients,

λ_{cons}

and

λ_{ent}

(from Equation (1)), and the primary parameter of the cosine annealing scheduler, the decay period

T_{max}

(from Equation (6)). We employed a grid search to explore various combinations of these parameters.

Before presenting the tuning results, it is important to justify the inclusion of both semi-supervised loss terms. To avoid redundant ablation experiments, we refer to the findings of related work in SSL, which have demonstrated that consistency regularization and entropy minimization play essential yet complementary roles in this context. Studies have shown that removing either component leads to a noticeable performance degradation, highlighting that neither is sufficient on its own. Based on these insights, our tuning focuses on balancing the contributions of both losses rather than ablating them [60].

The results of our hyperparameter analysis are presented in Figure 6, which compares the training loss trajectories for several configurations over the first 100 epochs. As shown, the baseline configuration (no modification, blue line) exhibits the slowest convergence and a consistently higher training loss. In contrast, all modified configurations display markedly improved convergence behavior, confirming the effectiveness of tuning both the loss weights and the learning rate schedule.

A detailed analysis of the configurations reveals distinct trade-offs. Applying optimized loss weights (

λ_{cons} = 100

and

λ_{ent} = 1000

) without adjusting the learning rate schedule (orange line) yields a significant initial improvement; however, the convergence flattens prematurely at a suboptimal level. Configurations using a shorter decay period (

T_{max} = 100

, purple and brown lines) converge very quickly as the learning rate decays faster, but they also settle at a relatively higher final loss.

The most successful configurations were those with a longer cosine decay schedule (

T_{max} = 200

, represented by the green and red lines), which enabled a more gradual yet steady decline in training loss. While the red configuration (

λ_{cons} = 100

and

λ_{ent} = 1000

) was slightly slower in the initial epochs, it ultimately achieved the lowest and most stable training loss by the end of the period. This suggests that it strikes the optimal balance between early learning and long-term fine-tuning. Therefore, we selected this final set of hyperparameters for all subsequent experiments presented in this paper.

To complement the loss-based analysis, Figure 7 (mAP) presents the corresponding mean average precision values for epochs 20 to 100 under the same hyperparameter configurations. This figure aligns with the temporal range and tuning conditions depicted in the loss trajectory plot, enabling a direct comparison between convergence behavior and detection performance. Together, these visualizations offer a more comprehensive understanding of how optimization choices impact both training dynamics and the final model accuracy.

4. Results

This section provides a thorough evaluation of the proposed Omni-Nemo framework. The main objective is to demonstrate its effectiveness in the context of weakly supervised wildfire smoke detection and to measure its performance relative to established benchmark methods. We first describe the experimental setup along with the evaluation protocol adopted for the study. We then present the quantitative findings, offering a comparative analysis that highlights the strengths of our semi-supervised strategy, which is built on a fixed teacher model.

4.1. Wildfire Detection Performance Evaluation

To rigorously assess the effectiveness of our framework, we conducted experiments designed to evaluate model performance under varying levels of supervision. The overarching goal was to determine whether models leveraging weakly labeled and unlabeled data can outperform conventional fully supervised baselines, particularly when annotation resources are limited. Our evaluation compared Omni-Nemo against several representative approaches across two categories: supervised benchmarks, including Faster R-CNN (FRCNN), YOLOv11 [61], YOLOv8 [62], and Nemo-DETR [24], which we consider to be our primary supervised baseline, and weakly supervised benchmarks, including WS-DETR [63] and Omni-DETR [51], serving as our primary weakly supervised baseline for comparison. In addition, to ensure a direct comparison, we used the same training and test datasets as those employed in the original Nemo study. This alignment allows our results to be evaluated alongside the supervised methods reported in that work.

All the supervised models were trained on the 2400-image labeled dataset from the original Nemo study, while the weakly supervised approaches were trained on both the labeled dataset and an extended 1300-image semi-supervised dataset, simulating various weak supervision regimes, including noisy bounding boxes, point labels, and unlabeled samples. Evaluation was conducted using COCO-style metrics on a separate test set obtained from the Nemo study, which includes average precision (AP) and average recall (AR) across different object sizes. For the wildfire smoke detection task, particular emphasis was placed on mean average precision (

m A P

), average precision for small objects (

A P_{S}

), and average precision for medium objects (

A P_{M}

). The

m A P

metric captures overall detection performance across multiple IoU thresholds, whereas

A P_{S}

and

A P_{M}

specifically assess the model’s capability to detect small- and medium-scale smoke plumes, which are especially critical for timely intervention during the early stages of wildfires.

Table 2 summarizes the performance of all the evaluated methods under varying supervision regimes. Across all the settings, the proposed Omni-Nemo framework consistently outperformed the conventional supervised and weakly supervised baselines. Notably, in the most challenging “No Annotation” setting, Omni-Nemo attained 64.1% mAP, surpassing fully supervised detectors such as Nemo-DETR (40.6%), FRCNN (29.3%), and YOLOv11 (26.8%). This substantial gain under minimal supervision underscores the effectiveness of the proposed teacher–student paradigm. Among the supervised models, Nemo-DETR remained the strongest baseline, achieving 40.6% mAP and the highest recall (88.6%), outperforming YOLOv8 (39.6%), FRCNN (77.2% AR), and YOLOv11 (35.2% AR). The weakly supervised WS-DETR baseline performed considerably worse at 12.3% mAP, highlighting the limitations of conventional weak supervision. Omni-DETR improved to 48.0% mAP, yet all the Omni-Nemo variants achieved even higher performance, with the “Point” supervision and “Noisy BBox” configurations reaching 66.5% and 66.4% mAP, respectively, establishing Omni-Nemo as the most effective approach across all the supervision levels.

The average precision analysis revealed the superior generalization capability of Omni-Nemo across various IoU thresholds. At the commonly reported 50% IoU threshold, Omni-Nemo demonstrated precision, with the variants achieving 80.2–82.6% compared to Nemo-DETR’s 77.2%, FRCNN’s 68.4%, and YOLOv11’s 43.7%. The “Point” variant achieved 82.6%

A P_{50}

, nearly matching Omni-DETR’s 82.9%, indicating localization accuracy for detected smoke plumes. The “Point” variant achieved 58.3%

A P_{S}

, representing a 7.2% improvement over the supervised benchmark Nemo-DETR (54.4%) and a 45% improvement over the weakly supervised benchmark Omni-DETR (40.1%) while also significantly outperforming conventional detectors such as FRCNN (27.2%) and YOLOv11 (8.2%). For medium-sized smoke plumes, the Omni-Nemo variant of the “Noisy BBox” reached 73.9%

A P_{M}

compared to Nemo-DETR’s 69.4%, Omni-DETR’s 47.2%, and FRCNN’s 64.4%. All the Omni-Nemo variants surpassed the supervised benchmark by 3.4–6.5% while showing dramatic improvements of 53–57% over the weakly supervised benchmark. However, for large smoke plumes, Nemo-DETR maintained the highest precision (80.7%

A P_{L}

), outperforming FRCNN (72.1%), YOLOv11 (25.1%), and all the Omni-Nemo variants (67.2–69.4%). This trade-off reflects a deliberate design choice in our framework: prioritizing the early-stage detection of subtle smoke signals over exhaustive modeling of large visually dominant plumes, which are typically easier to detect through alternative means and require immediate rather than early intervention.

In the average recall analysis, Nemo-DETR achieved the highest overall recall (88.6%) due to producing a larger number of candidate detections, particularly for large objects (

A R_{L} = 91.0

%) and medium objects (

A R_{M} = 83.7

%). In contrast, traditional supervised detectors such as FRCNN (77.2%) and YOLOv11 (35.2%) exhibited lower recall due to either conservative prediction thresholds or underfitting to sparse smoke signatures. The Omni-Nemo variants, guided by teacher regularization, adopted a more precision-focused strategy that reduces false positives, resulting in slightly lower overall recall (68.4–73.9%). This strategic approach offers an advantage for operational deployment as false alarms can lead to unnecessary resource allocation and erode system credibility. Despite the overall lower recall, Omni-Nemo demonstrated superior performance for small objects, with the “Point” variant achieving 69.1%

A R_{S}

compared to Nemo-DETR’s 66.7%, Omni-DETR’s 42.3%, and notably outperforming both YOLOv11 (7.8%) and FRCNN (44.4%). For medium and large objects, while Nemo-DETR maintained the highest recall, the Omni-Nemo variants showed competitive performance for medium objects (75.4–77.8%

A R_{M}

) while maintaining the precision advantages discussed earlier.

The experimental results establish several critical findings that validate our approach. First, Omni-Nemo demonstrated superior label efficiency, achieving better performance than fully supervised methods while using significantly less labeled data, with the “No Annotation” variant outperforming Nemo-DETR by 23% and outperforming conventional detectors such as FRCNN and YOLOv11 by more than 35%. Second, the framework demonstrated enhanced early-detection capabilities through superior performance on small- and medium-sized smoke plumes, indicating improved capability for early wildfire detection scenarios. Third, the precision-oriented design reduced false positives while maintaining competitive recall for critical small objects. Finally, the consistent improvement over Omni-DETR (16–18 percentage point mAP improvement) directly validates our architectural enhancements, including the fixed expert teacher, optimized loss function, and refined training schedule. These results establish Omni-Nemo as a robust and practical solution for weakly supervised wildfire smoke detection, offering significant advantages in label efficiency while maintaining critical operational performance for early-detection scenarios.

4.2. Early Incipient Time-Series

Conventional metrics such as mAP,

A P_{S}

, and

A P_{M}

quantify overall detection capability but do not directly reflect the responsiveness required for operational wildfire monitoring. To evaluate time-critical performance, we measured raw detection latency, defined as the elapsed time in minutes between the first visible appearance of smoke in the video and the first correct detection produced by each model. This analysis was conducted using 14 wildfire events obtained from the HPWREN archive. We limited the experiment to models that had already demonstrated their ability to detect small low-contrast smoke plumes in the COCO evaluation (Table 2). Early-stage smoke is naturally faint and small, so any detector that failed this first sensitivity test was excluded from the timing runs.

The supervised baselines include Nemo-DETR, which has been widely adopted in smoke detection research, and FRCNN, which represents a commonly used convolutional detection framework. In accordance with the updated selection strategy, YOLOv8 was added as an efficient detector with competitive accuracy, and Omni-DETR was included to provide a recent Transformer-driven alternative. The proposed Omni-Nemo framework was evaluated using three supervision settings: “No Annotation,” “Point,” and “Noisy BBox,” which progressively reduce the level of manual labeling required.

The results in Table 3 show that all the Omni-Nemo variants achieved low detection latency, ranging between 3.0 and 3.6 min on average. The “Point” configuration achieved the fastest mean latency at 3.0 min, followed by “No Annotation” at 3.2 min. Both surpassed the FRCNN baseline, which required 5.0 min on average. Nemo-DETR reached 3.3 min, while YOLOv8 and Omni-DETR achieved 4.8 and 3.5 min, respectively.

These outcomes indicate that, even under minimal supervision, Omni-Nemo matches or exceeds the responsiveness of fully supervised architectures. Furthermore, the inclusion of both convolutional and Transformer-driven baselines confirms that weakly supervised models can achieve reliable early detection in real-world monitoring scenarios.

4.3. Detection Visualization

Beyond quantitative evaluation, qualitative inspection of prediction outcomes offers important insights into model behavior and robustness. Figure 8 illustrates a representative case study drawn from the test set, showcasing a direct comparison between the predictions of the expert Nemo teacher and those of the Omni-Nemo student in a challenging real-world wildfire scenario.

Figure 8a displays the ground truth (blue boxes) alongside the predictions made by the Nemo teacher (red box). In contrast, Figure 8b illustrates the corresponding predictions from our Omni-Nemo model. Notably, Omni-Nemo successfully localized both the initial denser smoke and the more diffuse larger plume that rose above it. These examples highlight the capability of Omni-Nemo in detecting varying densities and scales within a complex scene.

In addition to the visual comparisons, we conducted a focused error analysis on a challenging subset of 60 test images. This subset was deliberately chosen to represent highly cluttered conditions involving clouds, haze, and steam, which are known to create ambiguity and increase the likelihood of false or uncertain detections. On this subset, Omni-Nemo achieved 21 true positives, 26 false negatives, and 13 false positives. A representative example from this subset is shown in Figure 9, which illustrates both the strengths and limitations of the framework.

Taken together, the quantitative results and qualitative observations indicate that the student model, through effective use of both labeled and unlabeled data, has acquired a richer representation of smoke characteristics. This enhanced understanding allows it to generalize beyond the capabilities of its teacher while also revealing specific areas for improvement when faced with visually complex and cluttered conditions.

5. Discussion

The findings of this study support the effectiveness of the proposed Omni-Nemo framework as an expert-guided semi-supervised approach for wildfire smoke detection. The results indicate that the methodology achieves acceptable detection accuracy while reducing annotation requirements compared to conventional fully supervised approaches. Across varied supervision scenarios, Omni-Nemo shows higher overall mAP scores than the baseline models, suggesting its ability to extract meaningful feature representations from partially labeled datasets. These outcomes align with the central hypothesis that combining stable teacher guidance with a modified loss function facilitates efficient knowledge transfer and promotes generalization across diverse data conditions.

(1) Scale-Dependent Performance Characteristics: The analysis reveals scale-dependent differences in detection performance. Omni-Nemo demonstrates consistent precision in identifying small and medium smoke plumes—categories that are operationally important for early wildfire intervention and often difficult to annotate reliably. In these cases, it achieves higher precision metrics than fully supervised baselines, indicating its capacity to detect subtle early-stage smoke signals. In contrast, the fully supervised Nemo-DETR yields stronger results for large plumes, with higher AP and AR scores. This pattern reflects the known limitations of weakly supervised methods: large diffuse objects often require dense and precise annotations to accurately represent their spatial extent, which pseudo-labels and coarse supervision may not fully capture. Similar challenges related to scale imbalance and annotation sparsity have been noted in prior research [64]. While detection of large plumes remains a limitation, this is considered to be an acceptable trade-off given the emphasis on early detection, where small and medium plumes are most relevant for wildfire prevention and rapid response.

(2) Component-Wise Analysis: The evaluation provides evidence for the individual and combined contributions of each design component within Omni-Nemo. Comparisons with the baseline Nemo model highlight the impact of incorporating both weakly annotated and unlabeled datasets. The flexible weak-label framework improves performance across all the supervision scenarios tested (No Annotation, Noisy BBox, Point, and Point with Noisy BBox), supporting the benefit of training under heterogeneous supervision conditions. Comparisons with Omni-DETR further emphasize the role of the teacher model design. Replacing traditional co-learning approaches with a fixed expert teacher architecture contributes to training stability and consistency. This design avoids mutual degradation between teacher and student models and ensures reliable guidance throughout training. Additionally, hyperparameter ablation studies isolate the effects of the optimization strategy, confirming that the three-component loss function and cosine learning rate schedule contribute to improved student model performance.

Taken together, the experimental findings suggest that each major component—flexible weak supervision, expert guidance, and tailored training strategies—contributes to practical wildfire smoke detection. The combined effect enables Omni-Nemo to achieve performance comparable to fully supervised methods while addressing the ongoing challenge of annotation scarcity that limits broader application of deep learning in this domain.

Limitations and Future Work

(1) Expert-Guided Transfer Learning Framework: This work introduces a framework that addresses the challenge of limited accurately labeled datasets by replacing traditional co-learning approaches with an expert-guided transfer paradigm. Unlike Omni-DETR, which simultaneously trains teacher and student models on unlabeled data, our framework employs a fixed pre-validated expert model to supervise student training. This architectural choice helps to mitigate the instability and noise often associated with training teacher models on limited datasets, thereby supporting consistent generation of high-quality pseudo-labels.

Comparative analyses between Omni-Nemo, Omni-DETR, and Nemo-DETR suggest the effectiveness of this architectural modification. It should be noted that incompatibilities in weight update mechanisms and training schedules between DETR-based teachers and Deformable-DETR students preclude direct implementation of co-learning in our experimental setting. Future research may explore a broader range of expert teacher architectures to further enhance supervision quality and model robustness.

(2) Performance Limitations on Large-Scale Detection: Despite the framework’s contributions, the approach shows relatively lower performance in detecting large-scale smoke plumes compared to the fully supervised Nemo-DETR baseline. This trade-off is consistent with the established findings in the semi-supervised detection literature and remains a known challenge in the field. Potential mitigation strategies include implementing scale-aware attention mechanisms, developing multi-scale loss functions, and employing adaptive pseudo-label refinement techniques to improve large-object modeling while maintaining sensitivity to early-stage smoke plumes.

(3) Dataset Scale and Generalizability Constraints: Regarding the scale and diversity of the experimental dataset, it is important to note that, while the dataset is modest in size, it is specifically tailored for early-stage wildfire smoke detection. All the samples were sourced from real-world camera networks that provide continuous monitoring of fire-prone regions under naturalistic uncontrolled conditions.

Through systematic manual curation and annotation of multi-year fire event sequences, we constructed a dataset that captures smoke patterns typically absent from synthetic or standard benchmark collections. However, broader data from diverse contexts remains essential. Future work may involve incorporating larger, more heterogeneous datasets, including cross-regional and cross-domain benchmarks, to support generalization and assess scalability across varied environmental and geographic conditions.

(4) Broader Applications and Future Directions: The proposed methodology may extend beyond wildfire detection. Expert-guided distillation represents a transferable approach that is applicable to domains where established expert models exist but labeled training data is limited. The integration of uncertainty quantification techniques could further improve the reliability and robustness of pseudo-labels in the presence of teacher model noise.

Potential application areas include medical imaging, where expert models can be adapted across different scanner technologies without requiring extensive annotation, and industrial inspection systems, where new hardware platforms necessitate scalable model adaptation without incurring substantial manual labeling costs. These examples illustrate the broader relevance of expert-guided transfer learning in addressing data scarcity across diverse domains.

6. Conclusions

In this study, we present Omni-Nemo, a semi-supervised teacher–student framework designed to advance wildfire smoke detection while reducing the reliance on extensive manual annotations. At the core of this approach is a fixed expert model that serves as the teacher, generating reliable pseudo-labels to guide the student during training. Through this design, students are able to learn effectively from both unlabeled and weakly labeled data, thereby substantially reducing the overall annotation effort. To further enhance performance, we introduced a modified loss function tailored to the semi-supervised setting, incorporating consistency and entropy-based terms to stabilize training and improve generalization. Additionally, we employed a cosine learning rate schedule that accelerated convergence in the early training phases and promoted more stable optimization. Together, these refinements contributed to more substantial alignment between student and teacher predictions and improved detection outcomes. The results demonstrated that Omni-Nemo outperforms the fully supervised Nemo-DETR baseline in

m A P

,

A P_{S}

, and

A P_{M}

, while the fully supervised model retained advantages in recall and large-object detection. Compared to benchmark models—including fully supervised Nemo-DETR, WS-DETR, Omni-DETR, and YOLO-v8—Omni-Nemo achieved acceptable performance. Specifically, Omni-Nemo attained the highest mAP (66.5), outperforming Nemo-DETR (40.6), YOLO-v8 (39.6), and WS-DETR (12.3). It also led in AP_S (58.3) and AP_M (73.9), demonstrating its ability in detecting small- and medium-scale smoke plumes—critical for early wildfire intervention. These findings underscore the effectiveness of combining expert guidance with principled training strategies to enable performing wildfire detection under limited supervision, offering a scalable and annotation-efficient solution for real-world deployment.

Author Contributions

Conceptualization, T.S., F.Y. and L.Y.; methodology, T.S.; software, T.S. and L.Y.; validation, T.S., L.Y. and A.Y.; formal analysis, T.S.; investigation, T.S. and L.Y.; resources, L.Y.; data curation, L.Y.; writing—original draft preparation, T.S.; writing—review and editing, T.S., L.Y. and A.Y.; visualization, T.S.; supervision, L.Y.; project administration, L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Science Foundation OIA-2148788.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The images and videos employed in this study are openly available through the AlertWildfire network [55] and the High-Performance Wireless Research and Education Network (HPWREN) [56]; the compiled dataset is accessible at [57].

Conflicts of Interest

The authors declare no conflicts of interest.

References

National Interagency Fire Center. Wildland Fire Summary and Statistics Annual Report 2024. 2025. Available online: https://www.nifc.gov/sites/default/files/NICC/2-Predictive%20Services/Intelligence/Annual%20Reports/2024/annual_report_2024.pdf (accessed on 15 May 2025).
Li, Y.; Zhang, T.; Ding, Y.; Wadhwani, R.; Huang, X. Review and perspectives of digital twin systems for wildland fire management. J. For. Res. 2025, 36, 1–24. [Google Scholar] [CrossRef]
Negash, N.M.; Sun, L.; Fan, C.; Shi, D.; Wang, F. Review of Wildfire Detection, Fighting, and Technologies: Future Prospects and Insights. In Proceedings of the AIAA Aviation Forum and ASCEND 2025, Las Vegas, NV, USA, 21–25 July 2025; p. 3469. [Google Scholar]
Özel, B.; Alam, M.S.; Khan, M.U. Review of modern forest fire detection techniques: Innovations in image processing and deep learning. Information 2024, 15, 538. [Google Scholar] [CrossRef]
Balsak, A.; San, B.T. Evaluation of the effect of spatial and temporal resolutions for digital change detection: Case of forest fire. Nat. Hazards 2023, 119, 1799–1818. [Google Scholar] [CrossRef]
Singh, H.; Ang, L.M.; Srivastava, S.K. Active wildfire detection via satellite imagery and machine learning: An empirical investigation of Australian wildfires. Nat. Hazards 2025, 121, 9777–9800. [Google Scholar] [CrossRef]
Zhao, Y.; Ban, Y. GOES-R time series for early detection of wildfires with deep GRU-network. Remote Sens. 2022, 14, 4347. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A.; Khoshdel, S.; Afghah, F.; Coen, J.L.; O’Neill, L.; Fule, P.; Watts, A.; Kokolakis, N.M.T.; Vamvoudakis, K.G. A comprehensive survey of research towards AI-enabled unmanned aerial systems in pre-, active-, and post-wildfire management. Inf. Fusion 2024, 108, 102369. [Google Scholar] [CrossRef]
Abdusalomov, A.; Umirzakova, S.; Bakhtiyor Shukhratovich, M.; Mukhiddinov, M.; Kakhorov, A.; Buriboev, A.; Jeon, H.S. Drone-Based Wildfire Detection with Multi-Sensor Integration. Remote Sens. 2024, 16, 4651. [Google Scholar] [CrossRef]
Saltiel, T.M.; Larson, K.B.; Rahman, A.; Coleman, A. Airborne LiDAR to Improve Canopy Fuels Mapping for Wildfire Modeling; Technical Report; Pacific Northwest National Laboratory (PNNL): Richland, WA, USA, 2024. [Google Scholar]
Allison, R.S.; Johnston, J.M.; Craig, G.; Jennings, S. Airborne optical and thermal remote sensing for wildfire detection and monitoring. Sensors 2016, 16, 1310. [Google Scholar] [CrossRef] [PubMed]
Shah, S. Preliminary Wildfire Detection Using State-of-the-art PTZ (Pan, Tilt, Zoom) Camera Technology and Convolutional Neural Networks. arXiv 2021, arXiv:2109.05083. [Google Scholar]
Tzoumas, G.; Pitonakova, L.; Salinas, L.; Scales, C.; Richardson, T.; Hauert, S. Wildfire detection in large-scale environments using force-based control for swarms of UAVs. Swarm Intell. 2023, 17, 89–115. [Google Scholar] [CrossRef]
Govil, K.; Welch, M.L.; Ball, J.T.; Pennypacker, C.R. Preliminary results from a wildfire detection system using deep learning on remote camera images. Remote Sens. 2020, 12, 166. [Google Scholar]
Honary, R.; Shelton, J.; Kavehpour, P. A Review of Technologies for the Early Detection of Wildfires. ASME Open J. Eng. 2025, 4, 040803. [Google Scholar] [CrossRef]
Pimpalkar, S. Wild-Fire and Smoke Detection for Environmental Disaster Management using Deep Learning: Comparative Analysis using YOLO variants. In Proceedings of the 2025 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 5–7 March 2025; pp. 1–6. [Google Scholar]
Wang, Z.; Wu, L.; Li, T.; Shi, P. A smoke detection model based on improved YOLOv5. Mathematics 2022, 10, 1190. [Google Scholar] [CrossRef]
Li, J.; Xu, R.; Liu, Y. An improved forest fire and smoke detection model based on yolov5. Forests 2023, 14, 833. [Google Scholar] [CrossRef]
Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.I. An improved wildfire smoke detection based on YOLOv8 and UAV images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef] [PubMed]
Ide, R.; Yang, L. Adversarial Robustness for Deep Learning-Based Wildfire Prediction Models. Fire 2025, 8, 50. [Google Scholar] [CrossRef]
Sun, B.; Cheng, X. Smoke Detection Transformer: An Improved Real-Time Detection Transformer Smoke Detection Model for Early Fire Warning. Fire 2024, 7, 488. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Yazdi, A.; Qin, H.; Jordan, C.B.; Yang, L.; Yan, F. Nemo: An open-source transformer-supercharged benchmark for fine-grained wildfire smoke detection. Remote Sens. 2022, 14, 3979. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Chen, Y.; Liu, B.; Yuan, L. PR-Deformable DETR: DETR for remote sensing object detection. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Krawi, O.; Rada, L. Towards Automatic Streetside Building Identification with an Integrated YOLO Model for Building Detection and a Vision Transformer for Identification. IEEE Access 2025, 13, 52901–52911. [Google Scholar] [CrossRef]
Choi, S.; Song, Y.; Jung, H. Study on Improving Detection Performance of Wildfire and Non-Fire Events Early Using Swin Transformer. IEEE Access 2025, 13, 46824–46837. [Google Scholar] [CrossRef]
Li, R.; Hu, Y.; Li, L.; Guan, R.; Yang, R.; Zhan, J.; Cai, W.; Wang, Y.; Xu, H.; Li, L. SMWE-GFPNNet: A high-precision and robust method for forest fire smoke detection. Knowl.-Based Syst. 2024, 289, 111528. [Google Scholar]
Chaturvedi, S.; Shubham Arun, C.; Singh Thakur, P.; Khanna, P.; Ojha, A. Ultra-lightweight convolution-transformer network for early fire smoke detection. Fire Ecol. 2024, 20, 83. [Google Scholar] [CrossRef]
Tang, T.; Jayaputera, G.T.; Sinnott, R.O. A Performance Comparison of Convolutional Neural Networks and Transformer-Based Models for Classification of the Spread of Bushfires. In Proceedings of the 2024 IEEE 20th International Conference on e-Science (e-Science), Osaka, Japan, 16–20 September 2024; pp. 1–9. [Google Scholar]
Dietterich, T.G.; Lathrop, R.H.; Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef]
Bilen, H.; Vedaldi, A. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2846–2854. [Google Scholar]
Tang, P.; Wang, X.; Bai, X.; Liu, W. Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2843–2851. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 618–626. [Google Scholar]
Xie, J.; Yu, F.; Wang, H.; Zheng, H. Class activation map-based data augmentation for satellite smoke scene detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar]
Zhao, L.; Liu, J.; Peters, S.; Li, J.; Mueller, N.; Oliver, S. Learning class-specific spectral patterns to improve deep learning-based scene-level fire smoke detection from multi-spectral satellite imagery. Remote Sens. Appl. Soc. Environ. 2024, 34, 101152. [Google Scholar]
Pan, J.; Ou, X.; Xu, L. A collaborative region detection and grading framework for forest fire smoke using weakly supervised fine segmentation and lightweight faster-RCNN. Forests 2021, 12, 768. [Google Scholar] [CrossRef]
Liu, L.; Yu, D.; Zhang, X.; Xu, H.; Li, J.; Zhou, L.; Wang, B. A Semi-Supervised Attention-Temporal Ensembling Method for Ground Penetrating Radar Target Recognition. Sensors 2025, 25, 3138. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in Neural Information Processing Systems (NeurIPS 30); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 1195–1204. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS 32); Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 5050–5060. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Wang, C.; Grau, A.; Guerra, E.; Shen, Z.; Hu, J.; Fan, H. Semi-supervised wildfire smoke detection based on smoke-aware consistency. Front. Plant Sci. 2022, 13, 980425. [Google Scholar] [CrossRef]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Sohn, K.; Zhang, Z.; Li, C.L.; Zhang, H.; Lee, C.Y.; Pfister, T. A simple semi-supervised learning framework for object detection. arXiv 2020, arXiv:2005.04757. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3060–3069. [Google Scholar]
Li, H.; Qi, J.; Li, Y.; Zhang, W. A dual-branch selection method with pseudo-label based self-training for semi-supervised smoke image segmentation. Digit. Signal Process. 2024, 145, 104320. [Google Scholar] [CrossRef]
Amini, M.R.; Feofanov, V.; Pauletto, L.; Hadjadj, L.; Devijver, E.; Maximov, Y. Self-training: A survey. Neurocomputing 2025, 616, 128904. [Google Scholar] [CrossRef]
Kage, P.; Rothenberger, J.C.; Andreadis, P.; Diochnos, D.I. A review of pseudo-labeling for computer vision. arXiv 2024, arXiv:2408.07221. [Google Scholar] [CrossRef]
Wang, P.; Cai, Z.; Yang, H.; Swaminathan, G.; Vasconcelos, N.; Schiele, B.; Soatto, S. Omni-detr: Omni-supervised object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9367–9376. [Google Scholar]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased teacher for semi-supervised object detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [CrossRef]
Zhou, H.; Ge, Z.; Liu, S.; Mao, W.; Li, Z.; Yu, H.; Sun, J. Dense teacher: Dense pseudo-labels for semi-supervised object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–50. [Google Scholar]
Kurakin, A.; Raffel, C.; Berthelot, D.; Cubuk, E.D.; Zhang, H.; Sohn, K.; Carlini, N. ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv 2020, arXiv:1911.09785. [Google Scholar] [CrossRef]
ALERTWildfire Camera Network. Available online: https://www.alertwildfire.org/ (accessed on 6 September 2025).
HPWREN. The HPWREN Fire Ignition Images Library for Neural Network Training. Available online: https://www.hpwren.ucsd.edu/FIgLib/ (accessed on 4 August 2025).
Omni-Nemo Dataset. Available online: https://nevada.box.com/s/9mvjpononhmq1g8hnc8tj3hlqiasa7pg (accessed on 6 September 2025).
SayBender/Nemo. Available online: https://github.com/SayBender/Nemo (accessed on 6 September 2025).
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Zhang, S.; Gao, H.; Zhang, T.; Zhou, Y.; Wu, Z. Alleviating robust overfitting of adversarial training with consistency regularization. arXiv 2022, arXiv:2205.11744. [Google Scholar] [CrossRef]
Ramos, L.T.; Casas, E.; Romero, C.; Rivas-Echeverría, F.; Bendek, E. A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery. Results Eng. 2025, 26, 104869. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 4 August 2025).
LaBonte, T.; Song, Y.; Wang, X.; Vineet, V.; Joshi, N. Scaling novel object detection with weakly supervised detection transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 85–96. [Google Scholar]
Liu, C.; Zhang, W.; Lin, X.; Zhang, W.; Tan, X.; Han, J.; Li, X.; Ding, E.; Wang, J. Ambiguity-resistant semi-supervised learning for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15579–15588. [Google Scholar]

Figure 1. Schematic of the Omni-Nemo training pipeline. The framework begins with a supervised burn-in phase using labeled data, followed by a semi-supervised stage where a fixed expert teacher generates pseudo-labels for unlabeled samples.

Figure 2. Structural overview of the DETR-based Nemo framework used as the expert teacher. The input image is processed by a CNN backbone to generate feature maps, which are then passed through a Transformer encoder. A Transformer decoder applies object queries and outputs class labels and bounding-box predictions via feed-forward networks.

Figure 3. Architecture of the Deformable-DETR student model used in Omni-Nemo. The input image is processed by a CNN backbone to generate multi-scale feature maps. A Transformer module with deformable attention (sampling points shown as red circles) refines these features, allowing the decoder to align object queries (colored patches) with relevant regions. Prediction heads produce the final class labels and bounding-box outputs.

Figure 4. Overview of the proposed two-stage training framework. In the burn-in stage, the student model, based on Deformable-DETR, is trained on labeled data to establish a baseline. In the semi-supervised stage, a fixed pre-trained teacher model (Nemo), built on standard DETR, generates pseudo-labels for unlabeled images. The student is updated by minimizing the loss between its predictions and the teacher’s outputs, enabling knowledge transfer using unlabeled data.

Figure 5. Sample images illustrating variations in smoke visibility and environmental conditions.

Figure 6. Impact of hyperparameter tuning (

T_{max}

,

λ_{cons}

, and

λ_{ent}

) on training loss over the first 100 epochs.

Figure 6. Impact of hyperparameter tuning (

T_{max}

,

λ_{cons}

, and

λ_{ent}

) on training loss over the first 100 epochs.

Figure 7. Impact of hyperparameter tuning (

T_{max}

,

λ_{cons}

, and

λ_{ent}

) on mean average precision (mAP) over the first 100 epochs.

Figure 7. Impact of hyperparameter tuning (

T_{max}

,

λ_{cons}

, and

λ_{ent}

) on mean average precision (mAP) over the first 100 epochs.

Figure 8. Comparison of model predictions with ground truth bounding boxes. (a) Nemo-DETR predictions versus ground truth: the blue frames represent the ground-truth bounding boxes, while the red frames represent the predicted bounding boxes. (b) Omni-Nemo predictions: the red frames represent the predicted bounding boxes.

Figure 9. Example predictions from the challenging subset of the test dataset, which includes highly cluttered conditions with clouds, haze, and steam. The red frames indicate the predicted bounding boxes produced by the Omni-Nemo model.

Table 1. Dataset composition and usage in the training pipeline.

Category	Training Process	Number of Images	Origin
Labeled dataset	Burn-in	2400	Nemo (Train) [24]
Unlabeled dataset	Semi-supervised	1300	Generated in this study [57]
Validation dataset	Evaluation	260	Nemo (Test) [24]

Table 2. COCO evaluation metrics for Nemo-DETR [24], FRCNN [24], YOLOv11 [61], YOLOv8 [62], WS-DETR [63], Omni-DETR [51], and Omni-Nemo variants. The bold values indicate the best performance for each evaluation metric.

Parameter	Nemo-DETR	FRCNN	YOLO-v11	YOLO-v8	WS-DETR	Omni-DETR	Omni-Nemo
Parameter	Nemo-DETR	FRCNN	YOLO-v11	YOLO-v8	WS-DETR	Omni-DETR	No Annotation	Noisy BBox	Point	Point and Noisy BBox
mAP	40.6	29.3	26.8	39.6	12.3	48.0	64.1	66.4	66.5	65.9
AP50	77.2	68.4	43.7	73.5	20.1	82.9	80.2	81.0	82.6	81.8
AP_S	54.4	27.2	8.2	35.6	0.0	40.1	55.5	51.6	58.3	57.2
AP_M	69.4	64.4	17.3	31.1	3.7	47.2	71.8	73.9	72.1	72.5
AP_L	80.7	72.1	25.1	42.0	6.9	60.8	68.5	69.4	67.2	67.9
AR	88.6	77.2	35.2	57.7	7.1	59.1	68.4	69.7	73.9	70.6
AR_S	66.7	44.4	7.8	28.9	0.0	42.3	67.6	66.8	69.1	68.8
AR_M	83.7	75.5	30.1	50.2	5.6	64.7	75.4	77.8	76.1	77.3
AR_L	91.0	79.3	32.3	61.5	9.8	70.4	73.4	76.7	75.0	76.5

Table 3. Raw detection time in minutes for each fire event. The table compares FRCNN [24], Nemo-DETR [24], YOLOv8 [62], Omni-DETR [51], and three Omni-Nemo variants (No Annotation, Point, and Noisy BBox).

Fire Event	FRCNN	Nemo-DETR	YOLO-v8	Omni-DETR	Omni-Nemo
(HPWREN Archive [1])	(min)	(min)	(min)	(min)	No Annotation	Point	Noisy BBox
bravo-e-mobo-c__2019-08-13	5	2	3	2	2	2	2
bh-w-mobo-c__2019-06-10	9	4	5	4	3	3	3
bl-n-mobo-c__2019-08-29	4	3	5	4	3	3	4
bl-s-mobo-c__2019-07-16	8	3	5	4	3	3	3
lp-n-mobo-c__2019-07-17	2	1	3	2	2	1	2
ml-w-mobo-c__2019-09-24	4	3	4	3	3	2	3
ml-w-mobo-c__2019-10-06	7	4	5	3	3	3	3
om-e-mobo-c__2019-07-12	2	5	7	6	6	6	7
pi-s-mobo-c__2019-08-14	5	3	5	3	4	3	4
rm-w-mobo-c__2019-10-03	2	2	3	2	2	2	2
rm-w-mobo-c__2019-10-03	2	1	2	1	1	1	2
smer-tcs8-mobo-c__2019-08-29	3	2	3	2	2	2	3
wc-e-mobo-c__2019-09-25	7	5	7	5	4	4	5
wc-s-mobo-c__2019-09-24	10	8	10	8	7	7	7
Mean ± sd (Events 1–14)	5.0 ± 2.8	3.3 ± 1.9	4.8 ± 2.0	3.5 ± 1.8	3.2 ± 1.6	3.0 ± 1.7	3.6 ± 1.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Samavat, T.; Yazdi, A.; Yan, F.; Yang, L. Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach. Fire 2025, 8, 413. https://doi.org/10.3390/fire8110413

AMA Style

Samavat T, Yazdi A, Yan F, Yang L. Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach. Fire. 2025; 8(11):413. https://doi.org/10.3390/fire8110413

Chicago/Turabian Style

Samavat, Tina, Amirhessam Yazdi, Feng Yan, and Lei Yang. 2025. "Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach" Fire 8, no. 11: 413. https://doi.org/10.3390/fire8110413

APA Style

Samavat, T., Yazdi, A., Yan, F., & Yang, L. (2025). Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach. Fire, 8(11), 413. https://doi.org/10.3390/fire8110413

Article Menu

Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach

Abstract

1. Introduction

1.1. Annotation Bottleneck

1.2. Contributions

2. Materials and Methods

2.1. Teacher–Student Architecture

2.1.1. Teacher Model

2.1.2. Student Model

2.2. Loss Function Design

2.2.1. Supervised Loss ( L sup )

2.2.2. Unsupervised Consistency Loss ( L cons )

2.2.3. Entropy Minimization ( L ent )

2.3. Two-Stage Training Strategy

3. Implementation Details

3.1. Dataset

3.2. Experimental Settings

3.3. Model Configuration

3.4. Hardware

3.5. Learning Rate

3.6. Hyperparameter Tuning

4. Results

4.1. Wildfire Detection Performance Evaluation

4.2. Early Incipient Time-Series

4.3. Detection Visualization

5. Discussion

Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2.1. Supervised Loss ( $L_{\sup}$ )

2.2.2. Unsupervised Consistency Loss ( $L_{cons}$ )

2.2.3. Entropy Minimization ( $L_{ent}$ )