1. Introduction
Forests are critical ecosystems that provide essential services, including biomass production, water conservation, carbon sequestration, and biodiversity preservation [
1,
2]. However, they are increasingly affected by wildfires [
3]. According to recent statistics, carbon emissions from forest wildfires reached about 2.2 Pg throughout the 2024–2025 worldwide wildfire period [
4]. Additionally, large-scale wildfire events have released substantial amounts of toxic particulate matter and chemical pollutants, causing long-term environmental impacts on the atmosphere, water, and soil systems [
5]. Wildfires also impose considerable direct and indirect economic losses [
6,
7]. Therefore, developing accurate and efficient early wildfire detection methods is important for mitigating fire spread, protecting ecological systems, and reducing socio-economic impacts.
Conventional forest fire monitoring approaches are mainly based on ground sensors and Internet of Things (IoT) techniques for early warning. Through the deployment of wireless sensor networks [
8] and air quality sensors [
9], these systems can support real-time environmental data acquisition and fire warning [
10]. Some studies have also incorporated infrared sensing to build multimodal ground monitoring systems [
11]. For large-area observation, satellite remote sensing is frequently adopted because of its wide coverage. Multi-source satellite images can be used to continuously track wildfire dynamics and assist large-scale fire assessment [
12,
13]. However, in complex terrain, ground-based systems often require high deployment and maintenance costs and may still leave monitoring blind areas. Meanwhile, satellite-based approaches are constrained by revisit intervals and spatial resolution. Consequently, both approaches still have difficulty achieving real-time detection and identifying small fires at an early stage.
To address these shortcomings, an increasing number of recent studies have investigated wildfire detection by integrating UAV remote sensing with computer vision techniques [
14]. These approaches offer flexible deployment and high-resolution observations and therefore present clear advantages over traditional large-scale monitoring methods [
15]. They also enable automated early detection at relatively low cost. Earlier studies in this field mainly depended on conventional machine learning methods, in which classifiers were usually built from handcrafted indicators including flame spectral, smoke texture, or environmental thresholds. For example, earlier studies explored several vision-based strategies for fire detection. Ko et al. [
16] combined visual sensors with an SVM classifier, while Chitade et al. [
17] focused on fire segmentation using color-based k-means clustering. Habiboğlu et al. [
18] investigated video fire detection with covariance matrices and reported faster performance than earlier methods. A broader comparison of rule-based and machine learning-based approaches was later provided by Toulouse et al. [
19]. Dampage et al. [
20] integrated environmental sensor data into a regression-based detection framework, whereas Yang et al. [
21] proposed a Preferred Vector Machine (PVM) model to enhance detection accuracy and reduce false alarms. However, these methods depend strongly on handcrafted features, which are often less robust in complex forest environments and insufficient for learning deep spatial representations. As a result, their ability to capture minor fires in the initial phase remains limited, which can lead to missed detections and false alarms [
19,
21].
Recent progress in deep learning has made it possible to automatically extract features through convolutional neural networks (CNNs), thereby reducing dependence on manual feature engineering. Two-stage object detection methods, including Fast R-CNN [
22], Mask R-CNN [
23], and Faster R-CNN [
24], depend on region proposal mechanisms and therefore usually involve high computational cost and inference latency. Such characteristics make these models difficult to deploy on UAV platforms with limited computing resources. By comparison, one-stage object detection methods like YOLO [
25] and SSD [
26] are generally faster at inference because they predict object locations and categories in a single pipeline. Their speed advantage is particularly useful for UAV-based wildfire monitoring, where real-time response is important. Recently, approaches driven by deep learning have been extensively deployed to wildfire detection. Jiao et al. [
27] designed a YOLOv3-tiny algorithm for detecting fire on limited datasets. Hung et al. [
28] optimized detection performance by applying data augmentation and using a Deep Normalized Convolutional Neural Network (DNCNN). Models based on the Transformer architecture have also been introduced to capture global contextual information. For example, Ghali et al. [
29] introduced a Transformer-based detection and segmentation architecture, while Qiao et al. [
30] introduced FireFormer to reduce false alarms. Liu et al. [
31] advanced TFNet to improve multi-scale feature fusion. However, Transformer-based models often involve high computational complexity. To address this issue, recent studies have focused on lightweight model design. Liu et al. [
32] proposed MCAN-YOLO based on YOLOv7, while Han et al. [
33] introduced LUFFD-YOLO based on YOLOv8n. Jin et al. [
34] developed SWVR by incorporating GSConv, and Zhu et al. [
35] proposed YOLO-MP to improve efficiency for edge deployment. Recent work has also focused on improving model performance in challenging environments. For example, Zhou et al. [
36] studied robustness to occlusion, while Guo et al. [
37] dealt with scale variation through state space modeling. Han et al. [
38], in turn, introduced a multitask framework to better handle extreme conditions.
Despite these developments, current models still show two major limitations in practical UAV-based wildfire detection. First, accurately perceiving small and sparse targets remains difficult when computational resources are limited. From high-altitude UAV viewpoints, fire spots and smoke usually appear at very small scales. After model compression, deep features may be weakened during downsampling, which can result in the absence of critical target features. Consequently, accurate detection of early-stage fires becomes difficult. Second, existing models remain limited in robustness within complex forest environments and are not sufficiently adaptable to non-rigid targets. Background elements with fire-like visual characteristics can introduce false detections during feature extraction. Moreover, conventional attention mechanisms and bounding box regression strategies do not handle the dynamic and irregular morphology of wildfire spread well. This may result in unstable feature representations and lower localization accuracy.
To tackle these limitations, this study develops an improved YOLO26-based wildfire detection model, termed ASCA-YOLO. The major contributions of this paper are summarized as follows:
- (1)
A Forest Wildfire Adaptive Multi-Scale Convolution (FWAMSConv) module is proposed to strengthen the extraction of multi-scale features for small and sparse targets while preserving computational efficiency.
- (2)
A Forest Wildfire Sparse Contextual Saliency Attention (FWSCSAttention) mechanism is designed to characterize contextual feature distributions and suppress background interference.
- (3)
A Forest Wildfire Adaptive Sparse-Aware IoU (FWASIoU) loss is developed to enhance regression of bounding box coordinates for non-rigid wildfire targets.
- (4)
The proposed ASCA-YOLO model attains an encouraging equilibrium between detection quality and computing cost, which makes it suitable for instantaneous UAV-based wildfire monitoring.
3. Materials
3.1. Dataset
To improve the performance and generalization ability of network training, this paper uses the forest-fire_dataset provided by Roboflow together with the large-scale public UAV remote-sensing forest fire dataset M4SFWD [
40]. M4SFWD mainly provides UAV/remote-sensing wildfire images with variations in forest terrain, illumination, weather, and fire-object distribution, while the Roboflow dataset provides additional heterogeneous fire and smoke images from public or real-world visual sources.
During dataset integration, all images were first converted to RGB format, and the original annotations were converted into YOLO format. Data cleaning was then performed to remove corrupted images, unreadable images, images with extremely low resolution, severely blurred images, images with unrecognizable targets, and images irrelevant to forest wildfire detection. Exact duplicates were detected using file hashing. Near-duplicate images were screened using perceptual hashing, and candidate image pairs with a Hamming distance no greater than 6 were manually reviewed. For identical or nearly identical images, only the image with better visual quality was retained. After cleaning and filtering, 7414 images were retained in wildfire dataset, including 4312 images from M4SFWD and 3102 images from Roboflow. Representative examples are represented in
Figure 5.
Considering that UAV wildfire images may contain similar frames from the same scene, video sequence, or fire event, this study adopted a scene-level/event-level split rather than a simple image-level random split. Specifically, each image was assigned a scene/event group ID according to its data source, original folder, video sequence, scene background, fire event, and near-duplicate screening results. Images from the same video sequence, UAV flight scene, fire event, or near-duplicate group were assigned to only one subset. Finally, the 7414 images were organized into 486 scene/event groups and divided into train, valid, and test sets at an approximate ratio of 7:2:1, containing 5189, 1482, and 743 images, respectively. This splitting strategy was used to reduce the risk of temporal leakage and scene leakage during model evaluation. Scene-level split of the curated wildfire dataset is listed in
Table 1.
All retained images were rechecked and re-annotated using LabelImg 1.8.6 into two categories: “fire” and “smoke”. Visible flame regions were labeled as fire, and visible smoke regions were labeled as smoke. Fire and smoke appearing in the same image were annotated separately. During preprocessing, all images were adjusted to a resolution of 640 × 640 pixels. Depending on the specific UAV flight altitude and camera parameters, this resolution corresponds to an approximate ground coverage area ranging from 50 × 50 m2 to 200 × 200 m2.
3.2. Model Running Environment and Parameter Settings
To improve reproducibility, all training settings are specified as follows. All models were trained and evaluated on Windows 11 with an Intel Core i5-14600KF CPU, 32 GB RAM, and an NVIDIA GeForce RTX 5060 Ti GPU with 16 GB VRAM. The software environment included Python 3.10.19, PyTorch 2.10.0, and CUDA 12.8. All images were resized to 640 × 640 pixels. The batch size was 4, and the number of training epochs was 100. Pretrained weights were used for initialization.
MuSGD was adopted as the optimizer, with , , , and . The warm-up stage lasted for 3 epochs, with warm-up momentum of 0.8 and warm-up bias learning rate of 0.1. Cosine learning rate scheduling was not used. The augmentation strategy included HSV augmentation with , , and , translation of 0.1, scaling of 0.5, horizontal flipping with a probability of 0.5, and Mosaic augmentation with a probability of 1.0. Mosaic augmentation was disabled during the last 10 epochs, while Mixup, CutMix, copy-paste, vertical flipping, rotation, shear, and perspective transformation were not used.
The random seed was set to 0, and deterministic training was enabled. Automatic mixed precision was disabled. During validation, the IoU threshold was 0.7, and the maximum number of detections per image was 300. Since YOLO26 adopts an NMS-free end-to-end detection paradigm, no additional conventional NMS was used. Input images were normalized by scaling pixel values from [0, 255] to [0, 1]. The same data split and training settings were used for all comparative and ablation experiments.
3.3. Evaluation Metrics
We assessed the proposed model in terms of both detection performance and computational cost, using six common metrics: precision, recall, mAP50, mAP50-95, parameter count, and floating point operations. The specific descriptions of each metric are as follows.
Precision (P) is used to measure the percentage of true positive categories within the samples predicted as positive categories via the model, reflecting the precision capability of the model. Its calculation formula is:
where
denotes the number of true targets correctly detected by the model, and
denotes the number of background regions incorrectly identified as targets.
Recall (
R) represents the proportion of all true targets that are accurately detected via the model, indicating the model’s capability to reduce missed detections. It is calculated as:
where
denotes the number of true targets missed by the model.
mAP50 is the mean Average Precision calculated for all target classes under an IoU threshold of 0.5, and it illustrates the overall quality of the model in target recognition and localization. It is expressed as:
where
is the total number of target categories in the dataset (in this study,
), and
denotes the smoothed precision-recall curve function for the
category.
mAP50-95 is used to measure the arithmetic mean of the mAP under a total of 10 different thresholds as the IoU threshold increases ranging from 0.5 to 0.95 in increments of 0.05, which serves as a rigorous assessment of the model’s high-precision bounding box regression capability. Its calculation formula is:
where
denotes the mean Average Precision under the specific IoU threshold requirement.
Parameter count (Par) is used to measure the overall number of learnable parameters contained in the model’s network architecture, straightforwardly reflecting the storage requirement of the model on edge devices and its spatial complexity.
Floating point operations (FLOPs) is used to measure amount of billion floating point operations required by the model to process a single input image, intuitively reflecting the computational complexity and inference speed potential of the model.
Frames per second (FPS) is used to measure the number of images that the model can process per second during inference, directly reflecting the real-time detection capability of the model under a given hardware environment. A higher FPS indicates faster inference speed and better potential for real-time UAV-based wildfire monitoring.
Missed detection rate (MDR) is used to measure the proportion of true targets that are not detected by the model, reflecting the missed detection risk of the model in wildfire monitoring. Its calculation formula is:
False alarm rate (FAR) is used to measure the proportion of false positive detections among all predicted positive detections, reflecting the false alarm risk of the model under complex background conditions. Its calculation formula is:
4. Results and Discussion
4.1. Comparative Experiments of Different Modules
4.1.1. Comparative Experiment of Convolutional Modules
To evaluate the effectiveness of Forest Wildfire Adaptive Multi-Scale Convolution (FWAMSConv) in extracting small-scale features under limited computational cost, comparative experiments are conducted against the baseline YOLO26 and several representative convolutional modules, including DWConv, GhostConv, PConv, and FSConv. The quantitative results are presented in
Table 2.
Due to structural limitations, conventional convolutional modules often face challenges in balancing computational efficiency and fine-grained feature representation. DWConv reduces computational cost through channel-wise decomposition, lowering FLOPs to 3.6 G. However, its limited receptive field restricts contextual modeling, which may lead to the loss of small-scale fire features during deep feature extraction. As a result, the recall decreases to 0.748. GhostConv reduces parameter redundancy by generating feature maps through linear transformations, decreasing the parameter count to 2.050 M. However, this mechanism may weaken fine-grained edge representations, particularly for small flame regions, which can limit detection performance. Consequently, mAP50 remains at 0.838. PConv reduces memory access by using convolution operations to a selection of input channels. However, this partial channel processing strategy may limit cross-channel feature interaction, resulting in increased computational cost rather than reduction. In this study, FLOPs reach 6.0 G, indicating reduced efficiency. FSConv introduces frequency-spatial feature fusion to improve representation capability. However, the increased computational complexity leads to FLOPs of 6.3 G. In addition, its limited ability to suppress background interference may result in performance degradation, with recall decreasing to 0.735 and mAP50 to 0.819.
In contrast, FWAMSConv improves feature representation through channel compression and parallel multi-scale depthwise separable branches. The model achieves a parameter count of 1.870 M, FLOPs of 4.2 G and FPS of 119.7, indicating improved efficiency. At the same time, P, R, mAP50, and mAP50-95 reach 0.843, 0.784, 0.861, and 0.538, respectively. The improved recall indicates that the model becomes more effective in detecting small fire spots and slender smoke, including small-scale instances with bounding boxes smaller than 32 × 32 pixels. Correspondingly, the MDR decreases from 0.235 for YOLO26 to 0.216 for YOLO26 + FWAMSConv, further indicating that FWAMSConv helps reduce missed detections of weak wildfire targets. Taken together, these results demonstrate that FWAMSConv attains a better equilibrium between computational expense and detection effectiveness in complex wildfire scenarios.
To further examine the feature extraction behavior of different convolutional modules, Grad-CAM is employed to visualize deep feature responses. An image containing multiple small fire spots and slender smoke structures is selected to compare the attention distributions of diverse models, and the corresponding results are presented in
Figure 6.
The visualization results indicate that YOLO26 and the models using conventional convolutional modules exhibit relatively scattered activation patterns. Some small fire regions, especially those located near the image boundaries, are not highlighted consistently. By contrast, the model integrated with FWAMSConv produces more concentrated activation responses around small flame regions and thin smoke structures. This finding implies that FWAMSConv enhances feature localization for small-scale targets. Overall, the visualization results are in agreement with the quantitative results, showing that the proposed module improves feature representation for small and sparse wildfire targets while preserving computational efficiency.
4.1.2. Comparative Experiment of Attention Mechanisms
To evaluate the effectiveness of FWSCSAttention in complex environments, comparative experiments are conducted with the baseline YOLO26 and several representative attention mechanisms, including CBAM, LCA, DynamicSpatialAttention, and HPAttention. The results are presented in
Table 3.
As shown in
Table 3, conventional attention mechanisms do not consistently improve detection performance in complex forest environments. CBAM relies on global max pooling to infer spatial attention. However, strong local responses caused by background interference may be amplified by the max pooling operation, leading to incorrect activations. This results in decreased performance, with P and R dropping to 0.818 and 0.759, respectively. LCA and DynamicSpatialAttention focus on modeling local spatial relationships or dynamic convolutional features. However, these methods may exhibit limited generalization when dealing with diverse wildfire scenarios. In particular, DynamicSpatialAttention shows a decrease in recall to 0.742, indicating reduced sensitivity to true fire targets. HPAttention introduces hierarchical pooling for feature fusion. However, it does not sufficiently separate fire-related features from background interference, which may introduce redundant feature responses. As a result, mAP50 decreases to 0.839.
In contrast, FWSCSAttention enhances feature discrimination through contextual statistical modeling. The model reaches 0.841 in precision, 0.786 in recall, 0.861 in mAP50, and 0.540 in mAP50-95. The gain in precision indicates fewer false positives when the background is complex. Correspondingly, the FAR decreases from 0.173 for YOLO26 to 0.159 for YOLO26 + FWSCSAttention, further confirming that FWSCSAttention can reduce false alarms caused by fire-like background interference. This suggests that the attention mechanism helps the model remain more stable in challenging scenes. To further examine the behavior of different attention mechanisms, the corresponding visualization results are presented in
Figure 7.
The visualization results indicate that YOLO26 and the models using conventional attention mechanisms generate noticeable activations in non-fire regions, revealing their sensitivity to background interference. In contrast, the model equipped with FWSCSAttention exhibits more concentrated responses in actual fire and smoke regions while suppressing irrelevant background activations. This suggests stronger feature discrimination in complex scenes.
4.1.3. Comparative Experiment of Loss Functions
To evaluate the performance of FWASIoU in dealing with non-rigid targets, comparative experiments are performed with the baseline YOLO26 and several representative loss functions, including SDIoU, SIoU, WIoU, and MPDIoU. The corresponding results are reported in
Table 4.
Since the loss function mainly affects the training process, the parameter count and FLOPs remain unchanged across all models. Conventional IoU-based loss functions often struggle to handle the irregular and dynamic shapes of wildfire targets. As a result, performance degradation is observed in several cases. SDIoU and MPDIoU improve bounding box alignment by introducing distance-based constraints. However, wildfire targets often lack stable geometric reference points due to their irregular shapes. As a result, these methods show limited adaptability, leading to mAP50 values of 0.838 and 0.834, respectively. SIoU introduces an angle-based constraint to improve convergence. However, this rigid geometric constraint may not be well suited to highly dynamic and amorphous wildfire targets. Consequently, recall decreases to 0.748, indicating reduced localization stability. WIoU adopts a dynamic focusing mechanism to balance gradients between easy and hard samples. However, without explicit geometric adaptation for target deformation, the regression process may become unstable, resulting in a decrease in precision to 0.822.
In contrast, FWASIoU introduces multiple constraints, including center stability, bounding consistency, and morphological adaptation. The model achieves P, R, mAP50, and mAP50-95 of 0.854, 0.779, 0.866, and 0.542, respectively. The improvement in mAP50-95 suggests more stable localization for irregular wildfire targets.
To further analyze the effect of different loss functions, Grad-CAM is used to visualize feature responses for non-rigid wildfire targets. An image containing diffused smoke and irregular fire structures is selected, and the results are presented in
Figure 8.
The visualization results indicate that YOLO26 and the models using conventional loss functions exhibit dispersed activation patterns, with limited alignment to irregular target boundaries. In contrast, the model using Forest Wildfire Adaptive Sparse-Aware IoU shows more concentrated responses along the contours of fire and smoke. The result points to better localization performance for non-rigid targets.
4.2. Ablation Experiment
We conducted ablation experiments on the wildfire dataset to assess the role of each module. The results are listed in
Table 5.
Using YOLO26 alone (Model 1), the model reaches 0.827 precision, 0.765 recall, 0.846 mAP50, and 0.512 mAP50-95, leaving clear room for further improvement in both accuracy and stability. After adding FWAMSConv (Model 2), recall and mAP50 rise to 0.784 and 0.861, while both FLOPs and parameter count drop, showing that this module improves efficiency and benefits small-target detection. With only FWSCSAttention (Model 3), mAP50-95 increases to 0.540, which reflects better resistance to background interference. Using FWASIoU alone (Model 4) also leads to gains in precision and mAP, pointing to more accurate localization for non-rigid targets.
The paired combinations also show clear interaction effects. FWAMSConv together with FWSCSAttention (Model 5) improves both precision and recall. FWAMSConv combined with FWASIoU (Model 6) strengthens small-target detection and localization, although its resistance to background clutter is still limited. FWSCSAttention plus FWASIoU (Model 7) further improves robustness and localization, but the computational burden remains comparatively high.
The best overall result appears in Model 8, where all three modules are used together. In this setting, ASCA-YOLO achieves 0.870 precision, 0.809 recall, 0.895 mAP50, 0.578 mAP50-95 and 122.5 FPS. Compared with YOLO26, all evaluation metrics improve, while FLOPs and parameter count are both reduced. In addition, ASCA-YOLO reduces the MDR to 0.191 and the FAR to 0.130, indicating fewer missed detections and false alarms under the joint effect of the three modules. This indicates that the three modules work well together and enhance detection performance without increasing computational cost.
4.3. Comparative Experiment with Other Detection Models
We further compared the proposed model with several widely used object detection methods, and the results are shown in
Table 6.
Faster R-CNN, as a typical two-stage detector, relies on region proposals before final prediction. This design helps improve localization, but it also brings a heavy computational burden. In our experiments, Faster R-CNN reaches the highest recall at 0.859, yet its 181.4 G FLOPs and 41.3 M parameters make it less practical for UAV platforms with limited onboard resources. SSD detects objects from multi-scale feature maps, but its representation of small targets is still limited. This is reflected in its recall of 0.565, suggesting weaker sensitivity to small wildfire regions. RetinaNet achieves a high precision of 0.876 by using Focal Loss to alleviate class imbalance, but its computational cost remains high, with FLOPs reaching 256.9 G. RT-DETR benefits from global feature interaction through self-attention. However, its computational complexity increases with image resolution, which may result in high resource consumption when processing UAV-based remote sensing data.
Single-stage YOLO-based models provide advantages in computational efficiency due to their end-to-end architectures. Even so, conventional YOLO variants still depend on standard convolution and IoU-based loss design, which can limit their performance when small targets and complex backgrounds appear at the same time. This is also reflected in their relatively constrained precision and mAP50-95.
By comparison, ASCA-YOLO improves both accuracy and efficiency. It achieves 0.895 mAP50 and 0.578 mAP50-95, while reducing FLOPs and parameter count to 4.2 G and 1.870 M. In addition, ASCA-YOLO reaches an inference speed of 122.5 FPS, indicating that the model can maintain high detection accuracy while providing sufficient real-time processing capability. Among the compared models, it offers a more balanced trade-off between detection quality and computational cost. Although it is not the best in every single metric, its overall behavior is more stable and better suited to practical UAV wildfire monitoring.
To further examine model behavior in difficult cases, we also provide qualitative results on representative test images. Two typical scenarios are considered: small fire targets and dispersed smoke under dark background conditions. The results are shown in
Figure 9 and
Figure 10.
In Figure (a), which contains small and sparse fire targets, several models exhibit missed detections to varying degrees. In contrast, ASCA-YOLO is able to detect most of the visible fire and smoke regions. This suggests improved sensitivity to small-scale targets and a reduced missed detection rate for all targets. In Figure (b), which contains complex background interference, several models produce either missed detections or incorrect activations. In comparison, ASCA-YOLO shows more consistent detection results, with reduced false responses in non-fire regions and improved identification of smoke targets. Overall, the qualitative and quantitative results are consistent. The proposed model improves the recognition of minor wildfire objects and reduces the impact of background distraction, while maintaining computational efficiency. These results indicate that ASCA-YOLO is suitable for UAV-based wildfire detection in complex environments.
4.4. Deployment Test on a Raspberry Pi Platform
To further assess the practical deployment potential of ASCA-YOLO in resource-constrained UAV scenarios, a brief deployment test was conducted on a Raspberry Pi platform. Since inference performance on embedded hardware may differ considerably from that on a desktop GPU, this experiment was intended to provide a preliminary device-level evaluation of runtime efficiency and resource consumption.
In this experiment, the trained YOLO26 and ASCA-YOLO models were deployed on a Raspberry Pi 4 platform equipped with 8 GB RAM and running Raspberry Pi OS (64-bit). PyTorch was used as the inference framework, and the input resolution was fixed at 640 × 640, consistent with the settings used in the main experiments. During evaluation, single-image inference with a batch size of 1 was adopted. The reported indicators included average inference latency per image, frames per second (FPS), and peak memory usage. Each result was obtained by averaging over 200 test images.
The deployment results are presented in
Table 7. On the Raspberry Pi 4 platform, YOLO26 achieved an average inference latency of 952.4 ms per image and a throughput of 1.05 FPS, with a peak memory usage of 735 MB. In comparison, ASCA-YOLO achieved an average inference latency of 826.7 ms per image and a throughput of 1.21 FPS, while reducing peak memory usage to 668 MB. These results indicate that the proposed model imposes a lower deployment burden than the baseline model under embedded conditions. Although the inference speed on the Raspberry Pi 4 remained significantly lower than that observed on the desktop GPU, ASCA-YOLO was still able to perform stable forward inference within limited hardware resources.
The preliminary deployment results on Raspberry Pi 4 show that ASCA-YOLO can run stably on an embedded platform with limited hardware resources. These results further support its deployment feasibility for UAV-based wildfire monitoring and demonstrate its potential engineering application value on edge-oriented devices.
5. Conclusions
This study addresses two key challenges in UAV-based forest wildfire detection: the difficulty of capturing small-scale features under limited computational resources and the limited robustness to background interference and non-rigid target structures. To this end, an improved single-stage detection model, ASCA-YOLO, is proposed. After adding FWAMSConv, FWSCSAttention, and FWASIoU, the model enhances feature representation, suppresses background interference, and improves bounding box regression.
The results show that ASCA-YOLO provides a better trade-off between detection performance and computational cost. Compared with the baseline and other representative detectors, it improves the main evaluation results while keeping both parameter count and FLOPs lower. The FPS improvement further reduces inference latency, which is beneficial for timely fire and smoke detection in UAV-based monitoring. The gains in recall and mAP50-95 also suggest better sensitivity to small wildfire targets and better adaptation to irregular fire and smoke patterns.
The proposed method performs well in the current experiments and has also shown preliminary deployment feasibility on the Raspberry Pi 4 platform. However, the present deployment validation remains limited to a brief embedded test. Future work will focus on further deployment and optimization on the NVIDIA Jetson Nano platform, with particular attention paid to real-time performance under practical constraints such as limited power supply, varying environmental conditions, and dynamic flight scenarios. These efforts are expected to support the development of reliable and efficient UAV-based wildfire monitoring systems.