1. Introduction
As core areas for ecological conservation and resource management, national parks require robust ecological monitoring and security management to maintain biodiversity and mitigate ecological risks [
1].Consequently, the efficient detection of three specific targets has become a critical management priority: the early identification of pine wilt disease—caused by the pine wood nematode
Bursaphelenchus xylophilus—to contain its spread, the timely detection of forest fires to minimize damage, and the precise monitoring of under-construction farmhouses to prevent encroachment on ecological protection zones. While drone technology has enabled large-scale, high-frequency monitoring for these purposes, the high-resolution imagery it acquires presents unique challenges in complex forest environments [
2]. Moreover, operational UAV patrols in national parks impose strict constraints on inference speed and computational cost: limited battery life and onboard computing resources require detection models to be both lightweight and real-time. Current mainstream detectors, struggle to balance accuracy and efficiency under such deployment conditions, making task-specific optimization a necessity.
The distinct challenges of multi-scene detection in national parks are reflected in three main aspects. First, the background is highly complex: dense forest vegetation and dramatic light-shadow variations often cause targets to be obscured by foliage or to blend into their surroundings. Second, target characteristics vary considerably. Pine wilt disease-infected trees and partially constructed farmhouses represent regular, static targets whose shapes remain relatively stable but are easily affected by background interference. In contrast, forest fires are irregular, dynamic targets—fire hotspots change shape rapidly over time, requiring bounding-box regression to adjust dynamically to variations in form and orientation. Third, scale variation is substantial, ranging from small-scale objects such as individual diseased trees to large-scale regions like extensive wildfires, necessitating a model with robust multi-scale feature extraction capabilities.
In recent years, deep learning has driven rapid advancements in object detection. CNN-based models [
3] (e.g., Faster R-CNN and the YOLO series) and transformer-based algorithms [
4] have demonstrated excellent performance in general scenarios. Among these, YOLOv8 achieves a balance between detection speed and accuracy through its C2f module, path-aggregation feature pyramid network (PA-FPN), and decoupled detection head, making it a mainstream choice. However, its limitations become increasingly evident in the complex environments of national parks. The backbone network inadequately captures global features, making it difficult to detect small objects (e.g., early-stage diseased trees) within extensive forest backgrounds. Moreover, the feature fusion mechanism lacks dynamic focusing capability [
5]; traditional concatenation operations cannot adjust weights based on target characteristics, causing important features—such as farmhouse edges or wildfire cores—to be diluted by redundant information. Finally, the CIoU loss function is poorly adapted to irregular and complex targets. In bounding-box regression for wildfires and other non-uniform objects, considering only overlap area and aspect ratio fails to accommodate dynamically changing shapes, thereby limiting localization accuracy.
In addition, both the YOLO family and transformer-based detectors such as DETR have demonstrated strong performance in general vision tasks. However, high computational demands of transformer and their insensitivity to long-tail small objects limit their substitutability for lightweight one-stage models like YOLO in current UAV edge-deployment scenarios. Rey et al. evaluated YOLOv8n/YOLOv8s on Jetson Orin NX and RPI5 for UAV deployment. INT8-quantized YOLOv8n achieved 65 FPS on Orin NX, meeting onboard real-time requirements, while RPI5 failed latency constraints. These findings highlight the critical edge–cloud trade-offs for UAV deployment [
6]. Hua and Chen reviewed deep learning-based small object detection in aerial images, systematically covering CNN and Transformer paradigms—including DETR and its variants (e.g., AO2-DETR, Hyneter) for UAV-borne vision tasks. Their survey confirms that while Transformer-based detectors have been extensively explored in aerial scenarios, their deployment on resource-constrained edge devices remains challenging [
7]. These findings provide a paradigm trade-off justification for selecting lightweight one-stage models such as YOLO for real-time UAV detection under operational constraints.
To address these challenges, researchers have continuously customized and improved YOLO and other network models for different application requirements, and these efforts have yielded positive results across diverse scenarios. Back et al. proposed a drone detection model integrating Mamba and attention mechanisms. By introducing the SSM core, attention modules, PAFPN, and depthwise separable convolutions into the network architecture, they enhanced multi-scale feature extraction for real-time detection of pine wilt disease-infected trees on edge devices [
8]. Yuan et al. introduced YOLOv8-RD, a robust detection method for pine wilt imagery. They developed a ResFuzzy module combining residual learning and a fuzzy neural network to filter noise and refine background features, and integrated a detail enhancement module with a dynamic upsampling operator to restore fine feature details [
9]. Xiao et al. proposed a fluorescence-based detection system for pine wood nematode disease, integrating deep learning with portable hardware. The system achieved a 39.98% accuracy improvement on large-size images and enabled detection of DNA concentrations as low as 1 fg/μL within 20 min. The system demonstrates strong potential for field deployment to curtail disease spread [
10].
Han Y. et al. employed GhostNetV2 to enhance conventional convolutions and proposed a lightweight UAV-based remote sensing model for forest fire detection, named LUFFD-YOLO. This model combines attention mechanisms with multi-layer feature fusion, thereby improving detection accuracy and efficiency [
11]. Saydirasulovich S N proposed an improved YOLOv8 model by incorporating the Wise-IoUv3 loss function, Ghost Shuffle convolution, and BiFormer attention mechanism. These enhancements increased localization precision, reduced model parameters, and strengthened smoke feature extraction in complex backgrounds, while also improving recognition speed in forest fire smoke detection [
12]. Bouguettaya A. et al. provided a comprehensive review of UAV-based early wildfire detection systems using deep learning techniques. Their survey highlighted the growing importance of autonomous fire monitoring in forest and wildland environments, emphasizing the role of computer vision algorithms in enabling timely detection and reducing potential forest resource loss [
13]. Ali H. A. et al. proposed a three-tier edge-intelligent framework integrating UAVs and lightweight CNNs, attaining 100% F1-score on the FireMan-UAV-RGBT dataset and 99.5% on UAV-FFDB, with an inference latency of only 157 ms on edge devices. This demonstrates the framework’s practical value and effectiveness for real-time forest fire monitoring and rapid emergency response [
14].
Yi H. et al. proposed LAR-YOLOv8, which strengthens local feature extraction through a dual-branch attention mechanism and introduces a vision transformer module to optimize feature map representation. They also designed an attention-guided bidirectional feature pyramid network (AGBiFPN) using a dynamic sparse attention mechanism. Based on UAV imagery, this approach significantly improved detection accuracy while reducing the number of parameters [
15]. To address the safety concerns of high-rise building glass curtain walls and the limitations of traditional manual inspections, Zhou K. et al. proposed an automated damage detection algorithm based on the YOLOv10s framework. By improving the backbone network, neck network, detection layer, and loss function, their method effectively resolved issues of inaccurate damage localization and the challenge of detecting small-scale damage [
16].
However, most existing studies focus on single targets and fail to meet the collaborative detection requirements of national parks involving multi-scene, multi-object scenarios. Regular static targets and irregular dynamic targets exhibit fundamental conflicts in feature representation and detection logic, making it difficult for a single improvement strategy to achieve comprehensive adaptability. Moreover, the demand for lightweight models deployable on UAVs adds further constraints. To address these challenges, this study proposes two improved YOLOv8-based algorithms—YOLOv8-StarNet-CGA and SCS-YOLOv8—specifically optimized for multi-scene detection of pine wilt disease-infected trees, forest fires, and under-construction farmhouses in national parks. StarNet replaces the C2f module to enhance global feature extraction, the content-guided attention mechanism (CGA) dynamically adjusts feature weights to emphasize key regions, and SIoU replaces CIoU to improve bounding-box regression robustness [
17] for irregular targets. These improvements effectively address complex background interference and object diversity while maintaining lightweight efficiency suitable for UAV deployment, offering both theoretical significance and practical value.
The main contributions of this study are as follows:
An enhanced backbone network is proposed, in which StarNet replaces the C2f module. This substantially improves global feature extraction and enhances the model’s adaptability to complex forest scenes.
A content-guided attention mechanism (CGA) is designed to replace the traditional feature concatenation module. By dynamically adjusting feature weights to enhance key region fusion, it improves the discriminative ability of object detection.
The SIoU loss function is introduced to optimize shape and orientation consistency. In forest fire scenarios, this approach overcomes the limitations of traditional CIoU loss in bounding box regression for objects with complex poses, thereby improving localization accuracy.
Extensive experiments were conducted on a UAV-based national park dataset to validate the superior performance and robustness of YOLOv8-StarNet-CGA and SCS-YOLOv8 in multi-scene object detection, demonstrating their potential for handling diverse forest scenarios while maintaining lightweight efficiency suitable for UAV deployment.
The remainder of this paper is organized as follows:
Section 2 provides a detailed description of the design and implementation of the improved algorithms.
Section 3 presents the composition of the dataset and the experimental setup.
Section 4 evaluates the performance of the proposed methods and compares them with other models.
Section 5 analyzes the advantages and limitations of the proposed models and discusses potential directions for future research. Finally,
Section 6 summarizes the overall work and findings of this study.
4. Experimental Results
4.1. Overall Model Performance
To comprehensively evaluate the performance of the proposed method, we conducted separate experiments on three dedicated datasets, each corresponding to one target scenario: pine wilt disease-infected trees, under-construction farmhouses, and forest fires. Classic original YOLO versions were trained on this unified dataset, and their results on the corresponding test sets were subsequently compared and analyzed. As shown in
Table 4,
Table 5 and
Table 6, among the original YOLO models, the YOLOv8 version achieved higher precision and mAP50 scores compared to other versions, demonstrating superior performance. Therefore, this study focuses on improvements based on the YOLOv8 architecture.
Table 4 details the performance of different methods on the pine wilt disease-infected trees dataset. The results indicate that the improved YOLOv8-StarNet-CGA model achieves the most significant gains and is better suited for detecting pine wilt disease-infected trees scenarios. Its precision (P), recall (R), F1-score, mAP50, and mAP50-95 increased by 8.6%, 13%, 11.2%, 11.7% and 14.8%, respectively, compared to YOLOv8, highlighting the superiority of the improved algorithm for this specific detection task. It is worth noting that SCS-YOLOv8 also outperforms other methods across all metrics. Compared with the original YOLOv8, SCS-YOLOv8 achieves improvements of 8%, 11.3%, 10%, 11% and 16.1% in P, R, F1-score, mAP50, and mAP50-95, respectively. However, replacing the loss function led to a slight performance drop (precision decreased from 0.915 to 0.909, and mAP50 from 0.955 to 0.948), indicating that the new loss function is not fully adapted to this scenario. Pine wilt disease-infected trees are static targets with stable morphology, and the detection task primarily involves distinguishing them from complex backgrounds rather than optimizing bounding box shape or orientation. SIoU, by emphasizing shape and orientation consistency, imposes an “over-constraint” that can distract the model from focusing on critical features, resulting in occasional misclassification of normal targets. A potential improvement is to apply scene-adaptive weighting: for regular targets such as pine wilt disease-infected trees, reduce the weight of angle and shape costs in SIoU, focusing instead on position and overlap optimization.
Subsequently, training was conducted on the dataset of under-construction farmhouses, followed by comparative experiments.
Table 5 presents the results on this dataset, showing that the improved YOLOv8-StarNet-CGA model again achieved the best performance, reaching top-level metrics. Its precision (P), recall (R), F1-score, mAP50, and mAP50-95 increased by 11%, 10.2%, 10.6%, 10.1% and 22.8%, respectively, compared to YOLOv8. Although SCS-YOLOv8 ranked as the second-best model, its metrics were also near-optimal, demonstrating strong detection capability. Compared with YOLOv8, SCS-YOLOv8 improved precision, recall, F1-score, mAP50, and mAP50-95 by 10.7%, 9.6%, 10.2%, 9.3% and 21.2%, respectively.
Furthermore, to provide additional comparison with the original YOLOv8 and validate the robustness of SCS-YOLOv8, experiments were conducted on a forest fire dataset. As shown in
Table 6, our method achieved superior performance. The proposed SCS-YOLOv8 improved P, R, F1-score, mAP50, and mAP50-95 by 7.2%, 13%, 10.1%, 6.3% and 10.2%, respectively, compared to the original YOLOv8, demonstrating its clear performance advantage.
Finally, the improved models achieve not only “high precision” but also “low computational complexity”. Both YOLOv8-StarNet-CGA and SCS-YOLOv8 exhibit GFLOPs of 13.8, only 48.6% of the original YOLOv8, while still significantly improving accuracy metrics. This comparison demonstrates that, through replacing the C2f module with StarNet and optimizing feature fusion via CGA, the models maintain enhanced detection performance while substantially reducing computational complexity, balancing efficiency and precision. These improvements make the models particularly suitable for practical applications such as drone inspections in national parks and deployment on edge devices.
Based on experimental analysis, we can preliminarily conclude that YOLOv8-StarNet-CGA demonstrates higher suitability for detecting pine wilt disease-infected trees and under-construction farmhouses. Its enhanced global feature extraction via StarNet and the CGA content attention mechanism’s focus on key regions significantly improve detection accuracy and robustness for both complex and regular targets. For forest fire scenarios, SCS-YOLOv8, which employs the SIoU loss function, achieves superior performance. SIoU optimizes shape and orientation consistency, better accommodating sparsely distributed fire points and the diverse bounding box shapes of fires, resulting in notable improvements in precision and stability. This study emphasizes the “practicality of multi-scenario detection in national parks”: the original YOLOv8 has high computational demands, making it difficult to deploy on drones or other edge devices. In contrast, the improved models substantially reduce GFLOPs, maintaining high accuracy while meeting field inspection requirements for device endurance and real-time performance. This balance between “performance-efficiency” represents a core contribution of the proposed algorithms, providing quantitative support for future onboard drone deployment. These two improvements optimize for different target characteristics and scene complexities, demonstrating the adaptability of specific modules to scenario-specific requirements. Selecting the appropriate combination of enhancements for each target type is therefore essential.
Finally, to more intuitively demonstrate the superior performance of YOLOv8-StarNet-CGA and SCS-YOLOv8 in object detection,
Figure 8,
Figure 9 and
Figure 10 present a comparison of detection results on each dataset between the most suitable model for the scenario and the original YOLOv8. The experiments show that YOLOv8 exhibits varying degrees of missed detections and false positives, particularly in scenarios with dense targets, complex backgrounds, or severe occlusion, which clearly limits its detection capability. In contrast, YOLOv8-StarNet-CGA and SCS-YOLOv8 effectively overcome these issues through their improved designs, significantly enhancing detection completeness and reliability.
4.2. Ablation Study of the Proposed Models
To further investigate the contributions of each improvement module in YOLOv8-StarNet-CGA and SCS-YOLOv8, ablation experiments were conducted on three scenario-specific datasets. The StarNet, CGA, and SIoU modules were progressively introduced, with the results documented in
Table 7. The analysis focuses on the improvements in precision and mAP50 achieved by each module.
In the pine wilt disease-infected tree scenario, introducing StarNet increased precision by 5.3% and mAP50 by 8.8%, indicating that StarNet significantly enhances detection accuracy through improved global feature extraction. Adding CGA alone resulted in a modest increase of 0.3% in precision and 0.2% in mAP50, suggesting that CGA has limited impact without a strong backbone network. Combining StarNet with SIoU led to improvements of 6.1% in precision and 9.3% in mAP50, with SIoU further optimizing bounding box regression accuracy. When StarNet and CGA were combined (YOLOv8-StarNet-CGA), precision reached 0.915 (an 8.6% increase) and mAP50 reached 0.955 (an 11.7% increase), demonstrating the best performance and indicating that their synergy significantly enhances the extraction and focus on key features. However, in the full SCS-YOLOv8 model, precision was 0.909 (8% increase) and mAP50 was 0.948 (11% increase), slightly lower than YOLOv8-StarNet-CGA. Although the optimization is less adapted to the morphology of pine wilt disease-infected trees in this scenario, it still significantly outperforms the original model.
In the under-construction farmhouse scenario, introducing StarNet increased precision by 4.1% and mAP50 by 5.4%, demonstrating its efficiency in extracting features of regular targets. Using CGA alone improved precision and mAP50 by 2% and 0.7%, respectively, indicating that CGA requires a strong backbone network to be fully effective. Combining StarNet with SIoU led to gains of 4.4% in precision and 5.5% in mAP50, with SIoU beginning to contribute to bounding box optimization. When StarNet and CGA were combined, precision and mAP50 reached 0.983 (11% increase) and 0.985 (10.1% increase), achieving the best performance and highlighting their strong synergistic effect on regular target detection. The full SCS-YOLOv8 achieved precision and mAP50 of 0.896 and 0.949, slightly lower than YOLOv8-StarNet-CGA, yet still maintaining very strong performance.
In the forest fire scenario, introducing StarNet increased precision by 2.4% and mAP50 by 8%, confirming the capability of StarNet for global perception of sparse fire spot features. Adding CGA alone resulted in a 0.4% increase in precision and 4.3% in mAP50, representing a modest yet notable contribution. Combining StarNet with SIoU improved precision and mAP50 by 2.1% and 3.6%, respectively, highlighting the role of SIoU in optimizing fire spot shape consistency. When StarNet and CGA were combined, precision reached 0.896 (6.2% increase) and mAP50 reached 0.91 (6% increase), demonstrating high detection accuracy. The complete SCS-YOLOv8 further improved precision and mAP50 to 0.906 and 0.913, achieving the best results, indicating that SIoU significantly enhances detection robustness by optimizing bounding box regression for irregular targets in this scenario.
Finally, tracking the dynamic changes in GFLOPs clearly illustrates the specific impact of each improvement module on computational complexity. StarNet reduces the baseline computational load: when only StarNet is introduced, GFLOPs decrease from 28.4 in the original YOLOv8 to 24.4. This indicates that StarNet, through the high-dimensional feature representation capability of star operations, enhances global feature extraction while simultaneously reducing redundant computations. CGA further improves computational efficiency: with CGA added, GFLOPs drop from 24.4 to 13.8, a reduction of 43.4%. This occurs because CGA replaces the traditional concat operation, using dynamic weights to focus on key features and reduce the computation of irrelevant feature fusion, thereby improving feature utilization while lowering computational load. SIoU has no significant effect on computation: after introducing SIoU, GFLOPs remain at 13.8, essentially unchanged, confirming that replacing the loss function does not substantially affect core computations but improves robustness by optimizing bounding box regression without additional computational cost.
4.3. Comparison with Other Object Detection Models
To validate the performance advantages of YOLOv8-StarNet-CGA and SCS-YOLOv8, comparative experiments were conducted against three classical object detection models—EfficientNet, SSD, and DETR—across three distinct scenarios: pine wilt disease-infected trees, under-construction farmhouses, and forest fires. The results are summarized in
Table 8:
Additionally, experiments were conducted on the three aforementioned scenarios using three YOLOv8-based improved models: YOLO-Drone [
39], TSD-YOLO [
40], and YOLO-MS [
41]. The comparative results are presented in
Table 9:
The results demonstrate that both proposed models, YOLOv8-StarNet-CGA and SCS-YOLOv8, significantly outperform other baseline models and YOLOv8-based variants across key evaluation metrics, including Precision, Recall, and mAP50. Specifically, YOLOv8-StarNet-CGA achieves the best performance in detecting pine wilt disease-infected trees and under-construction farmhouses, whereas SCS-YOLOv8 exhibits superior robustness in forest fire scenarios owing to the SIoU optimization. These comparative results fully validate the effectiveness of the proposed modules in enhancing object detection performance in complex forest environments within national parks.
5. Discussion
The proposed YOLOv8-StarNet-CGA and SCS-YOLOv8 models significantly enhance object detection performance in national park scenarios. YOLOv8-StarNet-CGA integrates StarNet and CGA, while SCS-YOLOv8 further incorporates SIoU on top of YOLOv8-StarNet-CGA, yielding notable improvements in detection accuracy. StarNet strengthens global feature extraction, CGA optimizes feature fusion in critical regions, and SIoU enhances the robustness of bounding box regression in complex pose scenarios. Experimental results indicate that YOLOv8-StarNet-CGA is particularly effective for pine wilt disease-infected trees and under-construction farmhouses, whereas SCS-YOLOv8 excels in forest fire scenarios. Both models outperform the original YOLOv8 across all three scenarios, demonstrating strong adaptability to complex and irregular target conditions, while also balancing precision and efficiency, making them suitable for practical UAV deployment.
However, the adaptability of SIoU remains insufficient in certain scenarios and requires further improvement. One potential direction is to implement scene-adaptive weighting: for regular targets such as pine wilt disease-infected trees and under-construction farmhouses, the angle and shape penalties in SIoU could be reduced, emphasizing positional and overlap optimization. Additionally, the forest fire dataset relies on publicly available internet sources, which presents clear limitations: it lacks diversity, mainly focuses on typical visible fire scenes, and does not adequately represent complex terrains (e.g., canyons, steep slopes) or specific vegetation types (e.g., coniferous forests, shrublands). Samples under extreme weather conditions are also missing, with no coverage of heavy rain, dense fog, or low-contrast scenarios such as nighttime, limiting the model’s adaptability to real-world complex environments. Future work should focus on constructing field-collected datasets, conducting controlled UAV experiments across varying elevations and vegetation types, supplementing extreme weather and dynamic samples, and recording correlations between fire events and environmental parameters. Furthermore, based on existing UAV-based fire monitoring model, a 13% improvement in recall under continuous aerial patrol conditions implies that more positive frames containing fire are correctly identified. However, the actual early fire detection time gained considering fire spread dynamics still requires further experimental validation. This will enhance the model’s generalization capability in real-world scenarios and better meet the practical requirements for national park fire monitoring.
6. Conclusions
This study proposes two improved YOLOv8-based multi-scene forest object detection algorithms, namely YOLOv8-StarNet-CGA and SCS-YOLOv8, optimized for detecting pine wilt disease-infected trees, forest fires, and under-construction farmhouses in national park ecological monitoring. First, StarNet was introduced to replace the C2f module in the backbone of YOLOv8, enhancing global feature extraction, and CGA dynamically adjusted feature weights to emphasize key regions. Next, the original CIoU loss function was replaced with SIoU to improve the robustness of bounding box regression. SCS-YOLOv8 was evaluated on a UAV-acquired national park dataset covering the three target categories. Compared with the original YOLOv8, SCS-YOLOv8 improved mAP50 by 11% for pine wilt disease-infected trees, 9.3% for under-construction farmhouses, and 11.6% for forest fires. Meanwhile, YOLOv8-StarNet-CGA achieved mAP50 gains of 11.7%, 10.1%, and 9.7% in the respective scenarios, indicating that YOLOv8-StarNet-CGA is more suitable for pine wilt disease-infected trees and under-construction farmhouses, while SCS-YOLOv8 excels in forest fire scenarios. Furthermore, both models demonstrated superior performance in Precision, Recall, mAP50, and mAP50-95 compared with other YOLO variants and mainstream detection models. The GFLOPs of both improved models decreased from 28.4 in the original YOLOv8 to 13.8, achieving both computational efficiency and enhanced detection performance. Overall, the two YOLOv8-based improved models exhibit stronger detection capabilities, balance precision and efficiency, perform exceptionally in complex backgrounds and diverse target scenarios, address multi-scene detection challenges in national parks, and are highly suitable for UAV inspection deployments, providing efficient technical support for ecological monitoring.