Skip to Content
ForestsForests
  • Article
  • Open Access

19 February 2026

Multi-Scenario Recognition and Detection Model in National Parks Based on Improved YOLOv8

,
,
,
,
,
,
,
and
1
College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China
2
Hangzhou Ganzhi Technology Co., Ltd., Hangzhou 311300, China
3
Baishanzu National Park Qianjiang Source Management Bureau, Lishui 323714, China
4
Center for Forest Resource Monitoring of Zhejiang Province, Hangzhou 310020, China
This article belongs to the Section Natural Hazards and Risk Management

Abstract

With the advancement of unmanned aerial vehicle (UAV) technology, its use in ecological monitoring and safety management of national parks has expanded significantly. However, object detection in complex scenes remains challenging due to environmental complexity, background interference, and occlusion. To address these issues, this paper proposes two improved YOLOv8-based models, YOLOv8-StarNet-CGA and SCS-YOLOv8, for detecting pine wilt disease-infected trees, under-construction farmhouses, and forest fires. In YOLOv8-StarNet-CGA, the StarNet module and Content-Guided Attention (CGA) are integrated into the backbone to enhance global feature extraction and focus on critical regions through dynamic weight adjustment. In SCS-YOLOv8, the original CIoU loss is also replaced with SIoU loss to optimize shape and orientation consistency, improving robustness. Experiments on UAV datasets covering diverse national park scenes demonstrate the effectiveness of the models. Results show that the improved models substantially outperform the original YOLOv8 in Precision, Recall, and mAP50. For pine wilt disease caused by the pine wood nematode Bursaphelenchus xylophilus, YOLOv8-StarNet-CGA achieves 8.6% higher Precision and 11.7% higher mAP50, facilitating early diagnosis and intervention of the disease. In under-construction farmhouse scenarios, Precision rises by 11% and mAP50 by 10.1%, lowering annual inspection labor by nearly 30% and improving oversight. For forest fires, SCS-YOLOv8 is more effective, with Precision improved by 7.2% and mAP50 by 6.3%. The improved detection model enables earlier identification of fire spots, thereby providing additional response time for emergency intervention, helping to mitigate fire spread and reduce the loss of forest resources. Both models also reduce GFLOPs and computational complexity, striking a balance between efficiency and accuracy, and showing strong potential for UAV deployment.

1. Introduction

As core areas for ecological conservation and resource management, national parks require robust ecological monitoring and security management to maintain biodiversity and mitigate ecological risks [1].Consequently, the efficient detection of three specific targets has become a critical management priority: the early identification of pine wilt disease—caused by the pine wood nematode Bursaphelenchus xylophilus—to contain its spread, the timely detection of forest fires to minimize damage, and the precise monitoring of under-construction farmhouses to prevent encroachment on ecological protection zones. While drone technology has enabled large-scale, high-frequency monitoring for these purposes, the high-resolution imagery it acquires presents unique challenges in complex forest environments [2]. Moreover, operational UAV patrols in national parks impose strict constraints on inference speed and computational cost: limited battery life and onboard computing resources require detection models to be both lightweight and real-time. Current mainstream detectors, struggle to balance accuracy and efficiency under such deployment conditions, making task-specific optimization a necessity.
The distinct challenges of multi-scene detection in national parks are reflected in three main aspects. First, the background is highly complex: dense forest vegetation and dramatic light-shadow variations often cause targets to be obscured by foliage or to blend into their surroundings. Second, target characteristics vary considerably. Pine wilt disease-infected trees and partially constructed farmhouses represent regular, static targets whose shapes remain relatively stable but are easily affected by background interference. In contrast, forest fires are irregular, dynamic targets—fire hotspots change shape rapidly over time, requiring bounding-box regression to adjust dynamically to variations in form and orientation. Third, scale variation is substantial, ranging from small-scale objects such as individual diseased trees to large-scale regions like extensive wildfires, necessitating a model with robust multi-scale feature extraction capabilities.
In recent years, deep learning has driven rapid advancements in object detection. CNN-based models [3] (e.g., Faster R-CNN and the YOLO series) and transformer-based algorithms [4] have demonstrated excellent performance in general scenarios. Among these, YOLOv8 achieves a balance between detection speed and accuracy through its C2f module, path-aggregation feature pyramid network (PA-FPN), and decoupled detection head, making it a mainstream choice. However, its limitations become increasingly evident in the complex environments of national parks. The backbone network inadequately captures global features, making it difficult to detect small objects (e.g., early-stage diseased trees) within extensive forest backgrounds. Moreover, the feature fusion mechanism lacks dynamic focusing capability [5]; traditional concatenation operations cannot adjust weights based on target characteristics, causing important features—such as farmhouse edges or wildfire cores—to be diluted by redundant information. Finally, the CIoU loss function is poorly adapted to irregular and complex targets. In bounding-box regression for wildfires and other non-uniform objects, considering only overlap area and aspect ratio fails to accommodate dynamically changing shapes, thereby limiting localization accuracy.
In addition, both the YOLO family and transformer-based detectors such as DETR have demonstrated strong performance in general vision tasks. However, high computational demands of transformer and their insensitivity to long-tail small objects limit their substitutability for lightweight one-stage models like YOLO in current UAV edge-deployment scenarios. Rey et al. evaluated YOLOv8n/YOLOv8s on Jetson Orin NX and RPI5 for UAV deployment. INT8-quantized YOLOv8n achieved 65 FPS on Orin NX, meeting onboard real-time requirements, while RPI5 failed latency constraints. These findings highlight the critical edge–cloud trade-offs for UAV deployment [6]. Hua and Chen reviewed deep learning-based small object detection in aerial images, systematically covering CNN and Transformer paradigms—including DETR and its variants (e.g., AO2-DETR, Hyneter) for UAV-borne vision tasks. Their survey confirms that while Transformer-based detectors have been extensively explored in aerial scenarios, their deployment on resource-constrained edge devices remains challenging [7]. These findings provide a paradigm trade-off justification for selecting lightweight one-stage models such as YOLO for real-time UAV detection under operational constraints.
To address these challenges, researchers have continuously customized and improved YOLO and other network models for different application requirements, and these efforts have yielded positive results across diverse scenarios. Back et al. proposed a drone detection model integrating Mamba and attention mechanisms. By introducing the SSM core, attention modules, PAFPN, and depthwise separable convolutions into the network architecture, they enhanced multi-scale feature extraction for real-time detection of pine wilt disease-infected trees on edge devices [8]. Yuan et al. introduced YOLOv8-RD, a robust detection method for pine wilt imagery. They developed a ResFuzzy module combining residual learning and a fuzzy neural network to filter noise and refine background features, and integrated a detail enhancement module with a dynamic upsampling operator to restore fine feature details [9]. Xiao et al. proposed a fluorescence-based detection system for pine wood nematode disease, integrating deep learning with portable hardware. The system achieved a 39.98% accuracy improvement on large-size images and enabled detection of DNA concentrations as low as 1 fg/μL within 20 min. The system demonstrates strong potential for field deployment to curtail disease spread [10].
Han Y. et al. employed GhostNetV2 to enhance conventional convolutions and proposed a lightweight UAV-based remote sensing model for forest fire detection, named LUFFD-YOLO. This model combines attention mechanisms with multi-layer feature fusion, thereby improving detection accuracy and efficiency [11]. Saydirasulovich S N proposed an improved YOLOv8 model by incorporating the Wise-IoUv3 loss function, Ghost Shuffle convolution, and BiFormer attention mechanism. These enhancements increased localization precision, reduced model parameters, and strengthened smoke feature extraction in complex backgrounds, while also improving recognition speed in forest fire smoke detection [12]. Bouguettaya A. et al. provided a comprehensive review of UAV-based early wildfire detection systems using deep learning techniques. Their survey highlighted the growing importance of autonomous fire monitoring in forest and wildland environments, emphasizing the role of computer vision algorithms in enabling timely detection and reducing potential forest resource loss [13]. Ali H. A. et al. proposed a three-tier edge-intelligent framework integrating UAVs and lightweight CNNs, attaining 100% F1-score on the FireMan-UAV-RGBT dataset and 99.5% on UAV-FFDB, with an inference latency of only 157 ms on edge devices. This demonstrates the framework’s practical value and effectiveness for real-time forest fire monitoring and rapid emergency response [14].
Yi H. et al. proposed LAR-YOLOv8, which strengthens local feature extraction through a dual-branch attention mechanism and introduces a vision transformer module to optimize feature map representation. They also designed an attention-guided bidirectional feature pyramid network (AGBiFPN) using a dynamic sparse attention mechanism. Based on UAV imagery, this approach significantly improved detection accuracy while reducing the number of parameters [15]. To address the safety concerns of high-rise building glass curtain walls and the limitations of traditional manual inspections, Zhou K. et al. proposed an automated damage detection algorithm based on the YOLOv10s framework. By improving the backbone network, neck network, detection layer, and loss function, their method effectively resolved issues of inaccurate damage localization and the challenge of detecting small-scale damage [16].
However, most existing studies focus on single targets and fail to meet the collaborative detection requirements of national parks involving multi-scene, multi-object scenarios. Regular static targets and irregular dynamic targets exhibit fundamental conflicts in feature representation and detection logic, making it difficult for a single improvement strategy to achieve comprehensive adaptability. Moreover, the demand for lightweight models deployable on UAVs adds further constraints. To address these challenges, this study proposes two improved YOLOv8-based algorithms—YOLOv8-StarNet-CGA and SCS-YOLOv8—specifically optimized for multi-scene detection of pine wilt disease-infected trees, forest fires, and under-construction farmhouses in national parks. StarNet replaces the C2f module to enhance global feature extraction, the content-guided attention mechanism (CGA) dynamically adjusts feature weights to emphasize key regions, and SIoU replaces CIoU to improve bounding-box regression robustness [17] for irregular targets. These improvements effectively address complex background interference and object diversity while maintaining lightweight efficiency suitable for UAV deployment, offering both theoretical significance and practical value.
The main contributions of this study are as follows:
  • An enhanced backbone network is proposed, in which StarNet replaces the C2f module. This substantially improves global feature extraction and enhances the model’s adaptability to complex forest scenes.
  • A content-guided attention mechanism (CGA) is designed to replace the traditional feature concatenation module. By dynamically adjusting feature weights to enhance key region fusion, it improves the discriminative ability of object detection.
  • The SIoU loss function is introduced to optimize shape and orientation consistency. In forest fire scenarios, this approach overcomes the limitations of traditional CIoU loss in bounding box regression for objects with complex poses, thereby improving localization accuracy.
  • Extensive experiments were conducted on a UAV-based national park dataset to validate the superior performance and robustness of YOLOv8-StarNet-CGA and SCS-YOLOv8 in multi-scene object detection, demonstrating their potential for handling diverse forest scenarios while maintaining lightweight efficiency suitable for UAV deployment.
The remainder of this paper is organized as follows: Section 2 provides a detailed description of the design and implementation of the improved algorithms. Section 3 presents the composition of the dataset and the experimental setup. Section 4 evaluates the performance of the proposed methods and compares them with other models. Section 5 analyzes the advantages and limitations of the proposed models and discusses potential directions for future research. Finally, Section 6 summarizes the overall work and findings of this study.

2. Materials and Methods

2.1. The YOLOv8 Base Model

As a significant iterative version within the YOLO series for object detection, YOLOv8 demonstrates outstanding performance. Its architecture comprises three core components: a backbone network, a feature enhancement network (neck), and a detection head [18]. These components work closely together to achieve efficient and accurate object detection, and the overall network structure is shown in Figure 1.
Figure 1. Network architecture of YOLOv8. P0–P21 denote the parameter indices of the backbone network (e.g., P0 represents the initial parameter, P21 represents the parameter at stage 21, etc.).
The backbone network is responsible for extracting key features from the input image. This task is accomplished collaboratively by convolutional layers (Conv), the C2f module, and the spatial pyramid pooling–fusion module (SPPF) [19]. The convolutional layers perform the initial feature extraction, capturing fundamental image characteristics such as edges, textures, etc. As an innovative element of YOLOv8, the C2f module employs a distinctive structural design to further optimize the feature extraction process. It efficiently integrates multi-level feature information, thereby enhancing feature representation. The spatial pyramid pooling–fusion module (SPPF) performs pooling operations on features of different scales [20], merging multi-scale features to enable the model to better adapt to targets of varying sizes and effectively improve its ability to perceive diverse objects.
The feature enhancement network (neck) is constructed based on the concept of a path-aggregation feature pyramid network (PA-FPN) [21,22], with the primary goal of achieving effective fusion of features across different levels. Building on PA-FPN, YOLOv8 introduces several optimizations by removing certain convolutional layers during the upsampling stage of its feature pyramid network. This adjustment not only simplifies the network architecture and reduces computational cost but also prevents potential feature loss caused by excessive convolution operations. As a result, the model performs feature fusion more efficiently, preserves and utilizes key image features more effectively, and provides richer, more representative features for subsequent object detection.
In YOLOv8, the detection head is responsible for analyzing the extracted features [23] to determine the categories and locations of the targets. The decoupled head module introduced in YOLOv8 represents a key innovation in the detection head. In traditional object detection models, classification and localization tasks are often interdependent, which can lead to conflicts and reduce detection accuracy. The decoupled head module separates the classification and localization branches [24], allowing these two tasks to be performed independently. This separation effectively prevents interference between the tasks, enabling the model to focus more precisely on classification and localization. As a result, it significantly improves target recognition and localization accuracy, thereby enhancing the overall detection performance of the model.
Through the careful design and optimization of its backbone network, feature enhancement network, and detection head, YOLOv8 achieves significant improvements in detection speed and accuracy compared with previous versions, demonstrating excellent performance across various practical applications.

2.2. Algorithm Improvements

In the field of object detection, YOLOv8 demonstrates strong performance, with detection accuracy surpassing that of many mainstream algorithms. However, when applied to multi-scene object detection tasks in national park forests, it reveals certain limitations, including notable instances of missed detections and false positives. Given the distinctive characteristics of targets in multi-scene images—such as complex background interference and significant variations in object scale—these issues substantially affect the accuracy and reliability of detection results.
To effectively address these challenges and account for the characteristics of national park target scenarios, two innovative forest multi-scenario object detection algorithms based on the improved YOLOv8 were developed, named YOLOv8-StarNet-CGA and SCS-YOLOv8 (detailed structures are shown in Figure 2). These algorithms introduce improvements across multiple critical aspects, aiming to enhance detection accuracy for multi-scene forest targets in national parks while reducing missed and false detections.
Figure 2. Network architecture of SCS-YOLOv8. YOLOv8-StarNet-CGA shares the same architecture as SCS-YOLOv8, with the only difference being the replacement of CIoU with SIoU in SCS-YOLOv8.

2.2.1. The Underlying Logic of the Dual-Model Design

The detection targets within national parks exhibit significant heterogeneity:
  • Pine wilt disease-infected trees and under-construction farmhouses are regular-shaped targets with relatively consistent features. The main detection challenge lies in distinguishing these features against complex backgrounds.
  • Forest fires are irregular and dynamic targets, with fire patterns changing over time. The primary detection challenge involves the robustness of bounding box regression and adaptability to varying shapes and orientations.
Traditional single-model approaches struggle to simultaneously optimize detection performance for both types of targets; therefore, a dual-model strategy is adopted:
YOLOv8-StarNet-CGA: The original YOLOv8 backbone exhibits certain limitations. The C2f structure has a restricted receptive field, limiting its ability to fully exploit global information. It also generates redundant features, which can lead to overfitting. Furthermore, the traditional C2f module employs a Bottleneck structure [25], which may not provide sufficient accuracy in complex scenarios, often resulting in false positives or missed detections and thereby impairing the effective representation and utilization of features. To address these issues, StarNet was first used to replace the C2f module. StarNet, built upon star operations, adopts a four-stage hierarchical architecture [26], demonstrating excellent performance while maintaining fast execution and low latency across different hardware platforms. This modification not only reduces the repeated usage of the C2f module but also significantly enhances feature extraction efficiency, allowing the model to more precisely capture critical features from input data.
In addition, a novel Content-Guided Attention (CGA) mechanism was innovatively introduced. The CGA mechanism replaces the concatenation module in the neck network, enabling more precise focus on critical regions and enhancing feature fusion effectiveness. Furthermore, it facilitates deep interaction among features from different levels and channels, allowing the model to utilize information more efficiently. Employing a coarse-to-fine strategy, the CGA mechanism generates a dedicated Spatial Interaction Map (SIM) for each channel of the input features. Moreover, CGA fully integrates channel attention weights with spatial attention weights, ensuring information exchange across both channel and spatial dimensions. This avoids the limitations of single-dimension processing and enhances the model’s ability to understand and detect various targets in complex scenes and tasks.
SCS-YOLOv8: Finally, building on YOLOv8-StarNet-CGA, YOLOv8 is further improved by incorporating the SIoU loss function. SIoU not only accounts for the geometric relationships of bounding boxes but also introduces the concepts of shape and orientation matching. In complex scenarios such as forest fires, where targets may undergo rotation or deformation, SIoU calculates shape similarity and directional consistency to more accurately capture the differences between predicted and ground-truth bounding boxes.

2.2.2. Backbone

The YOLOv8 backbone exhibits limitations in multi-scene recognition. Feature extraction suffers from homogenization, and the fusion of shallow and deep features is suboptimal. In terms of scene adaptability, it lacks a scene-awareness mechanism and demonstrates limited robustness to scene variations. From a computational perspective, it incurs high costs and substantial memory usage. To address these issues, the C2f modules in the backbone network were enhanced through the integration of StarNet. StarNet follows a traditional hierarchical network design with a four-stage layered architecture. Its star operation implements non-linear high-dimensional transformations, differing from conventional neural networks that increase network width (i.e., the number of channels). This operation is analogous to applying pairwise multiplicative kernel functions across channels, particularly polynomial kernels [27]. Within this architecture, convolutional layers perform downsampling while doubling the number of channels, and the improved demonstration modules handle feature extraction. To enhance computational efficiency, layer normalization was replaced with batch normalization, which was positioned after the depthwise separable convolution. This arrangement allows the two operations to be fused during inference, thereby further improving efficiency. Inspired by the design of MobileNeXt [28], a depthwise separable convolution was incorporated at the tail end of each module. Regarding the network configuration, the channel expansion factor was consistently set to 4, and the network width was doubled at each subsequent stage. Following the design principle of MobileNetv2 [29], the GELU activation function in the demonstration module was replaced with the ReLU6 activation function [30]. The overall architecture of StarNet is illustrated in Figure 3. By adjusting the number of modules and the channels of the input embeddings, we constructed StarNet models of varying scales. Features are extracted by repeatedly stacking multiple star-shaped modules. Within a single layer of the neural network, the star operation is a commonly used procedure, typically expressed as: ( A 1 T X + B 1 ) ( A 2 T X + B 2 ) .
Figure 3. Overview of the StarNet architecture. StarNet adopts a traditional hierarchical network structure, using convolutional layers at each stage to reduce the resolution while doubling the number of channels. DW-Conv: depthwise convolution; RefUp: refinement upsampling.
Within a single the input feature X is first subjected to two separate linear transformations. The resulting features are then fused through element-wise multiplication. To simplify subsequent analysis and computation, the weight matrices and biases are merged, thereby reducing the star operation to: A 1 T X ( A 2 T X ) . By narrowing the scope to a single-output-channel transformation and a single-element input, we define three vectors w 1 , w 2 and x , each belonging to R ( d + 1 ) × 1 , where d denotes the number of input channels. In general, the star operation can be reformulated as follows:
w 1 T x w 2 T x = i = 1 d + 1 w 1 i x i j = 1 d + 1 w 2 j x j = i = 1 d + 1 j = 1 d + 1 w 1 i w 2 j x i x j
= α 1,1 x 1 x 1 + + α 4,5 x 4 x 5 + + α d + 1 , d + 1 x d + 1 x d + 1 ( d + 2 ) ( d + 1 ) 2   i t e m s
The channels are indexed by i and j, and α is the coefficient for each term:
α i , j = w 1 i w 2 j i f   i = = j w 1 i w 2 j + w 1 j w 2 i i f   i ! = j
It should be clarified that the star operation does not explicitly construct a d 2 -dimensional feature vector. Instead, the second-order interaction terms are implicitly induced through element-wise multiplication after two linear projections. In other words, although the computational process is carried out within the original d-dimensional space, the resulting feature representation exhibits an expressive capacity comparable to that of a higher-dimensional polynomial feature mapping. This implicit expansion enhances representation ability without explicitly enumerating all pairwise feature combinations.
After reformulating the star operation, it can be expanded into an expression containing ( d + 2 ) ( d + 1 ) 2 distinct terms. Except for the term α ( d + 1 , : ) x d + 1 x , each term exhibits a nonlinear relationship with x, indicating that each represents an independent implicit dimension. Consequently, although the computationally efficient star operation is performed in a d-dimensional space, feature representation occurs in an implicit feature space of ( d + 2 ) ( d + 1 ) 2 d ) 2 2 dimensions (assuming d ≫ 2). This process significantly increases the feature dimensionality within a single network layer without incurring additional computational cost [31].
Next, by stacking multiple network layers, the implicit dimensionality increases exponentially in a recursive manner, approaching infinity. Starting with an initial network layer of width d, a single star operation produces the expression i = 1 d + 1 j = 1 d + 1 w 1 i w 2 j x i x j . To illustrate the effect of multiple star operations more clearly, P_n is used to denote the output of the n-th star operation. After passing through n layers, the star operations implicitly yield a representation belonging to R d 2 2 n . This demonstrates that even stacking only a few layers can, by virtue of exponential growth, substantially enhance the implicit dimensionality.
Although Figure 3 shows two linear projections followed by element-wise multiplication, this process should not be regarded as an explicit feature expansion. The star operation does not explicitly construct a d2-dimensional feature vector. Instead, the element-wise multiplication introduces second-order feature interactions implicitly. The computation is still performed in the original d-dimensional space, but these interactions enhance the expressive capacity of the feature representation.

2.2.3. Content-Guided Attention Mechanism

In YOLOv8, the traditional concatenation operation merely stacks features of different scales or levels along the channel dimension [32], without considering the correlation or significance of these features. In contrast, the CAG mechanism adaptively assigns attention weights based on the content of the input features during feature fusion, giving different features distinct weights, thereby allowing the fused features to better capture the intrinsic characteristics of the target. CGA adopts a coarse-to-fine processing workflow [33], as illustrated in Figure 4. First, a coarse version of the Spatial Importance Map (SIM), denoted as W c o a , is generated with a dimensionality of R C × H × W . Subsequently, each channel is refined under the guidance of the input features. Let X R C × H × W represent the preceding input features. The objective of CGA is to generate a channel-specific SIM, denoted as W, with the same dimensionality as X, i.e., W R C × H × W . Then, the channel attention weights W c and the spatial attention weights W s are calculated separately [34,35]. The specific computation formulas are as follows:
W c = C 1 × 1 ( m a x ( 0 , C 1 × 1 ( X G A P c ) ) ) W S = C 7 × 7 X G A P s , X G M P s
Figure 4. Schematic diagram of the Content-Guided Attention (CGA) mechanism.
Here, max(0, x) denotes the ReLU activation function, which introduces nonlinearity. C k × k · represents a convolutional layer with a kernel size of k×k for feature extraction. The symbol [·] indicates channel concatenation, which combines different features along the channel dimension. X G A P c is obtained by applying global average pooling to the input feature X across the spatial dimension, enabling the extraction of global spatial information. X G A P s is the result of global average pooling along the channel dimension, which helps capture the mean information across channels. X G M P s is produced by global max pooling along the channel dimension, which emphasizes the most important channel-wise features. To reduce the number of model parameters and lower computational complexity, we adopt a channel reduction strategy. The first 1×1 convolutional layer reduces the channel dimension from C to C r (where r is the reduction ratio), and the second 1×1 convolutional layer restores the channel dimension back to C. In the implementation, the reduction r is set to C 16 , thereby fixing the reduced channel dimension to 16. Subsequently, W c and W S are fused through a simple addition operation (following broadcasting rules) to generate the coarse SIMs W c o a R C × H × W . Experimental results demonstrate that multiplicative fusion is also achieving a similar effect, with the calculation formulated as:
W c o a = W c + W S
To obtain the final refined SIMs W, each channel of W c o a is adjusted according to the corresponding input features. Specifically, utilizing the content of the input features as guidance, a channel shuffle operation is applied to alternately rearrange W c o a and the input feature X, thereby generating the final channel-specific SIM W [36]. This process is defined by the following formula:
W = σ G C 7 × 7 C S X , W c o a
Here, σ denotes the sigmoid activation function, which maps the output values to the range (0, 1). CS(·) represents the channel shuffle operation, which enhances information exchange among channels. G C k × k ( · ) refers to a grouped convolution layer with a kernel size of k×k, where the number of groups is set to C in practical applications. CGA assigns a unique SIM to each channel, acting like a precise navigation system that guides the model to focus on the critical regions of each channel.

2.2.4. Implementation of the SIoU Loss Function

The default bounding box loss function employed in YOLOv8 is the Complete Intersection over Union (CIoU) [37], which is calculated as follows:
C I o U = I o U ρ 2 b , b g t c 2 α v
L C I o U = 1 C I o U = 1 B g t B p r d B g t B p r d + ρ 2 B g t , B p r d ( w c ) 2 + ( w h ) 2 + α v
α = v 1 I o U + v
v = 4 π 2 ( a r c t a n w g t h g t a r c t a n w h ) 2
In the above formula, α is the weighting coefficient, and v is a parameter that measures the consistency of the aspect ratio. The meanings of the parameters are as follows: B g t represents the area of the ground-truth box, and B p r d represents the area of the predicted box; ρ 2 B g t , B p r d measures the distance between the centers of the two boxes; w c and w h denote the width and height of the smallest enclosing box that covers both boxes; w and h are the width and height of the predicted box, while w g t and h g t are the width and height of the ground-truth box. CIoU primarily considers overlapping area, center-point distance, and aspect ratio, but it may have certain limitations in scale adaptability. For targets with significant scale variations, its performance may be affected. Therefore, we introduce SIoU [38], a loss function that more accurately reflects the matching degree between predicted and ground-truth boxes. The SIoU loss function consists of four cost components: angle cost ( Λ ), distance cost ( ), shape cost ( Ω ), and IoU cost (Figure 5, Figure 6 and Figure 7). The SIoU loss is calculated as follows:
L S I o U = 1 I o U + Λ + + Ω 2
Figure 5. Schematic diagram illustrating the contribution of angular cost to the loss function calculation. The angle α between the line connecting box centers and the horizontal axis is the key parameter. The cost Λ is minimized when α approaches 0, guiding the prediction box towards the nearest axis.
Figure 6. Schematic diagram of the distance calculation between a ground truth bounding box and its predicted counterpart. The distance cost considers the normalized center offsets ρ x and ρ y , weighted by γ . C w and C h denote the width and height of the smallest enclosing box covering both boxes.
Figure 7. Schematic diagram of the contribution relationships among IoU components. The figure shows the geometric relationship between predicted and ground-truth boxes, including intersection, union, center distance, and minimum enclosing box—key elements for computing the IoU cost in the SIoU loss function.
The four cost functions are defined as follows:
Λ = 1 2 s i n 2 a r c s i n x π 4
= t = x , y ( 1 e γ ρ t ) θ ,   ρ x = b c x g t b c x c w 2 , ρ y = b c y g t b c y c h 2
Ω = t = w , h 1 e ω t
I o U = B p r d B g t B p r d B g t
In the SIoU loss function, the symbols are defined as follows:
x = c h σ , where c h is the difference in the vertical coordinates of the center points between the ground truth box and the predicted box, and σ is the Euclidean distance between the two center points. α is the angle (in radians) between the line connecting the centers of the predicted box and the ground truth box and the horizontal axis. γ is the weighting factor in the distance cost, which varies with the angle and is defined as γ = 2 Λ . ρ t represents the squared ratio of the coordinate difference between the center points to the dimensions of the smallest enclosing box, where t is the direction indicator (x or y). θ is the weight parameter for the shape cost, controlling the strength of shape matching, typically ranging from 2 to 6. ω t represents the shape difference metric for width or height, defined as the relative difference in width or height between the predicted box and the ground truth box.
Owing to its more comprehensive consideration of factors, SIoU enables the model to learn target features and positional information more effectively, thereby improving detection accuracy, particularly when handling complex poses, scale variations, and occluded targets. Moreover, SIoU demonstrates stronger adaptability to scale changes, balancing the loss across targets of different sizes in multi-scale object detection. Its faster convergence allows the model to maintain high performance while reducing training time and computational resource consumption.

2.3. Dataset

To support the recognition and detection of forest scenes, this study constructed a multi-type image dataset in Qianjiang Source-Baishanzu National Park using on-site photography with a DJI Phantom 4 Pro drone (manufacturer: DJI Innovations, Shenzhen, China). A total of 1814 images were collected, covering three categories: pine wilt disease-infected trees, under-construction farmhouses, and forest fires—thus providing diverse and comprehensive data for the study. The drone maintained a flight altitude of 70–80 m, and the image resolution was set to 1920 × 1080 pixels. In total, 661 images of pine wilt disease-infected trees and 487 images of under-construction farmhouses were captured, reflecting both the extent of pine wilt disease-infected trees damage across different regions and the spatial distribution and progress of under-construction farmhouses. Considering the risks and limitations of capturing forest fire images in the field, 666 representative and authentic images were gathered from publicly available online sources, including news reports and professional photography websites. These images cover forest fires of varying scales and stages, offering valuable references for early fire-warning research in national parks. Before model training, the dataset was randomly divided into training, validation, and test sets in a ratio of 7:2:1.

3. Experiments

3.1. Evaluation Metrics

In this study, six key metrics were employed to comprehensively evaluate model’s performance: precision (P), recall (R), F1-score, mAP50, mAP50-95, and GFLOPs. Among these, the calculations for precision, recall, and F1-score are closely related to false positives (FP), true positives (TP), false negatives (FN), and true negatives (TN). The specific formulas are as follows:
R = T P T P + F N   ,   P = T P T P + T F
F 1 = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l
AP, i.e., Average Precision, is a metric used in object detection to measure a model’s detection performance for a single category, whereas mean Average Precision (mAP) represents the average of the average precisions across all categories. The calculation formula is as follows:
A P = 0 1 P r d r   ,   m A P = 1 N c i = 1 N c A P i
In the formula, P ( r ) denotes the precision at a recall of r, and N c represents the number of categories.
GFLOPs, representing one billion floating-point operations (1 GFLOP = 109 FLOPs), is a key metric for assessing the computational complexity of deep learning models. The formula is as follows:
G F L O P s = 1 10 9 l = 1 L ( 2 × C l i n × C l o u t × K l 2 × H l × W l )
Here, L denotes the total number of network layers, C l i n / C l o u t represents the input/output channels at the l-th layer, K l is the convolution kernel size, and H l × W l is the feature map size.

3.2. Experimental Configuration

3.2.1. Training Setup

The hardware environment consisted of an Intel Core Ultra 5-125H processor paired with an Intel Arc integrated GPU. The software environment was based on the Ubuntu 18.04 operating system, utilizing the PyTorch 1.8.1 deep learning framework. For hyperparameter settings, the initial learning rate was set to 0.01, the input image size to 640 × 640, the batch size to 32, and the number of training epochs to 300 per set. An early stopping mechanism was implemented with a patience of 200 epochs; if model performance did not improve over 200 consecutive epochs, training would automatically terminate. All models were initialized with YOLOv8-COCO pretrained weights. Other configuration options remained as the default settings of the original YOLOv8 model. The complete training configuration is summarized in Table 1.
Table 1. Unified training hyperparameters.

3.2.2. Data Augmentation

The settings are identical across all three scenarios and are listed in Table 2. No augmentation was applied during validation and testing.
Table 2. Online data augmentation.

3.2.3. Inference Settings

All models share the same inference hyperparameters (Table 3). These values were uniformly applied to all scenarios and all compared models without any task-specific tuning.
Table 3. Inference hyperparameters.

4. Experimental Results

4.1. Overall Model Performance

To comprehensively evaluate the performance of the proposed method, we conducted separate experiments on three dedicated datasets, each corresponding to one target scenario: pine wilt disease-infected trees, under-construction farmhouses, and forest fires. Classic original YOLO versions were trained on this unified dataset, and their results on the corresponding test sets were subsequently compared and analyzed. As shown in Table 4, Table 5 and Table 6, among the original YOLO models, the YOLOv8 version achieved higher precision and mAP50 scores compared to other versions, demonstrating superior performance. Therefore, this study focuses on improvements based on the YOLOv8 architecture.
Table 4. Pine wilt disease-infected tree scenario.
Table 5. Under-construction farmhouse scenario.
Table 6. Forest fire scenario.
Table 4 details the performance of different methods on the pine wilt disease-infected trees dataset. The results indicate that the improved YOLOv8-StarNet-CGA model achieves the most significant gains and is better suited for detecting pine wilt disease-infected trees scenarios. Its precision (P), recall (R), F1-score, mAP50, and mAP50-95 increased by 8.6%, 13%, 11.2%, 11.7% and 14.8%, respectively, compared to YOLOv8, highlighting the superiority of the improved algorithm for this specific detection task. It is worth noting that SCS-YOLOv8 also outperforms other methods across all metrics. Compared with the original YOLOv8, SCS-YOLOv8 achieves improvements of 8%, 11.3%, 10%, 11% and 16.1% in P, R, F1-score, mAP50, and mAP50-95, respectively. However, replacing the loss function led to a slight performance drop (precision decreased from 0.915 to 0.909, and mAP50 from 0.955 to 0.948), indicating that the new loss function is not fully adapted to this scenario. Pine wilt disease-infected trees are static targets with stable morphology, and the detection task primarily involves distinguishing them from complex backgrounds rather than optimizing bounding box shape or orientation. SIoU, by emphasizing shape and orientation consistency, imposes an “over-constraint” that can distract the model from focusing on critical features, resulting in occasional misclassification of normal targets. A potential improvement is to apply scene-adaptive weighting: for regular targets such as pine wilt disease-infected trees, reduce the weight of angle and shape costs in SIoU, focusing instead on position and overlap optimization.
Subsequently, training was conducted on the dataset of under-construction farmhouses, followed by comparative experiments. Table 5 presents the results on this dataset, showing that the improved YOLOv8-StarNet-CGA model again achieved the best performance, reaching top-level metrics. Its precision (P), recall (R), F1-score, mAP50, and mAP50-95 increased by 11%, 10.2%, 10.6%, 10.1% and 22.8%, respectively, compared to YOLOv8. Although SCS-YOLOv8 ranked as the second-best model, its metrics were also near-optimal, demonstrating strong detection capability. Compared with YOLOv8, SCS-YOLOv8 improved precision, recall, F1-score, mAP50, and mAP50-95 by 10.7%, 9.6%, 10.2%, 9.3% and 21.2%, respectively.
Furthermore, to provide additional comparison with the original YOLOv8 and validate the robustness of SCS-YOLOv8, experiments were conducted on a forest fire dataset. As shown in Table 6, our method achieved superior performance. The proposed SCS-YOLOv8 improved P, R, F1-score, mAP50, and mAP50-95 by 7.2%, 13%, 10.1%, 6.3% and 10.2%, respectively, compared to the original YOLOv8, demonstrating its clear performance advantage.
Finally, the improved models achieve not only “high precision” but also “low computational complexity”. Both YOLOv8-StarNet-CGA and SCS-YOLOv8 exhibit GFLOPs of 13.8, only 48.6% of the original YOLOv8, while still significantly improving accuracy metrics. This comparison demonstrates that, through replacing the C2f module with StarNet and optimizing feature fusion via CGA, the models maintain enhanced detection performance while substantially reducing computational complexity, balancing efficiency and precision. These improvements make the models particularly suitable for practical applications such as drone inspections in national parks and deployment on edge devices.
Based on experimental analysis, we can preliminarily conclude that YOLOv8-StarNet-CGA demonstrates higher suitability for detecting pine wilt disease-infected trees and under-construction farmhouses. Its enhanced global feature extraction via StarNet and the CGA content attention mechanism’s focus on key regions significantly improve detection accuracy and robustness for both complex and regular targets. For forest fire scenarios, SCS-YOLOv8, which employs the SIoU loss function, achieves superior performance. SIoU optimizes shape and orientation consistency, better accommodating sparsely distributed fire points and the diverse bounding box shapes of fires, resulting in notable improvements in precision and stability. This study emphasizes the “practicality of multi-scenario detection in national parks”: the original YOLOv8 has high computational demands, making it difficult to deploy on drones or other edge devices. In contrast, the improved models substantially reduce GFLOPs, maintaining high accuracy while meeting field inspection requirements for device endurance and real-time performance. This balance between “performance-efficiency” represents a core contribution of the proposed algorithms, providing quantitative support for future onboard drone deployment. These two improvements optimize for different target characteristics and scene complexities, demonstrating the adaptability of specific modules to scenario-specific requirements. Selecting the appropriate combination of enhancements for each target type is therefore essential.
Finally, to more intuitively demonstrate the superior performance of YOLOv8-StarNet-CGA and SCS-YOLOv8 in object detection, Figure 8, Figure 9 and Figure 10 present a comparison of detection results on each dataset between the most suitable model for the scenario and the original YOLOv8. The experiments show that YOLOv8 exhibits varying degrees of missed detections and false positives, particularly in scenarios with dense targets, complex backgrounds, or severe occlusion, which clearly limits its detection capability. In contrast, YOLOv8-StarNet-CGA and SCS-YOLOv8 effectively overcome these issues through their improved designs, significantly enhancing detection completeness and reliability.
Figure 8. Pine wilt disease-infected tree scenario. (a) Original YOLOv8 model; (b) Improved YOLOv8-StarNet-CGA model. In the first two sets, it is evident that the YOLOv8 model exhibits missed detections, whereas the improved YOLOv8-StarNet-CGA model achieves more precise detection. In the latter two sets, although neither model shows missed detections, YOLOv8-StarNet-CGA demonstrates higher confidence scores, yielding more accurate results.
Figure 9. Under-construction farmhouse scenario. (a) Original YOLOv8 model; (b) Improved YOLOv8-StarNet-CGA model. Although neither model exhibits missed or false detections, the YOLOv8-StarNet-CGA model shows higher confidence scores and produces more accurate results.
Figure 10. Forest fire scenario. (a) Original YOLOv8 model; (b) Improved SCS-YOLOv8 model. The figure clearly shows that the YOLOv8 model exhibits missed detections, whereas the improved SCS-YOLOv8 model achieves more precise detection with higher confidence scores, producing more accurate results.

4.2. Ablation Study of the Proposed Models

To further investigate the contributions of each improvement module in YOLOv8-StarNet-CGA and SCS-YOLOv8, ablation experiments were conducted on three scenario-specific datasets. The StarNet, CGA, and SIoU modules were progressively introduced, with the results documented in Table 7. The analysis focuses on the improvements in precision and mAP50 achieved by each module.
Table 7. Ablation experiment results.
In the pine wilt disease-infected tree scenario, introducing StarNet increased precision by 5.3% and mAP50 by 8.8%, indicating that StarNet significantly enhances detection accuracy through improved global feature extraction. Adding CGA alone resulted in a modest increase of 0.3% in precision and 0.2% in mAP50, suggesting that CGA has limited impact without a strong backbone network. Combining StarNet with SIoU led to improvements of 6.1% in precision and 9.3% in mAP50, with SIoU further optimizing bounding box regression accuracy. When StarNet and CGA were combined (YOLOv8-StarNet-CGA), precision reached 0.915 (an 8.6% increase) and mAP50 reached 0.955 (an 11.7% increase), demonstrating the best performance and indicating that their synergy significantly enhances the extraction and focus on key features. However, in the full SCS-YOLOv8 model, precision was 0.909 (8% increase) and mAP50 was 0.948 (11% increase), slightly lower than YOLOv8-StarNet-CGA. Although the optimization is less adapted to the morphology of pine wilt disease-infected trees in this scenario, it still significantly outperforms the original model.
In the under-construction farmhouse scenario, introducing StarNet increased precision by 4.1% and mAP50 by 5.4%, demonstrating its efficiency in extracting features of regular targets. Using CGA alone improved precision and mAP50 by 2% and 0.7%, respectively, indicating that CGA requires a strong backbone network to be fully effective. Combining StarNet with SIoU led to gains of 4.4% in precision and 5.5% in mAP50, with SIoU beginning to contribute to bounding box optimization. When StarNet and CGA were combined, precision and mAP50 reached 0.983 (11% increase) and 0.985 (10.1% increase), achieving the best performance and highlighting their strong synergistic effect on regular target detection. The full SCS-YOLOv8 achieved precision and mAP50 of 0.896 and 0.949, slightly lower than YOLOv8-StarNet-CGA, yet still maintaining very strong performance.
In the forest fire scenario, introducing StarNet increased precision by 2.4% and mAP50 by 8%, confirming the capability of StarNet for global perception of sparse fire spot features. Adding CGA alone resulted in a 0.4% increase in precision and 4.3% in mAP50, representing a modest yet notable contribution. Combining StarNet with SIoU improved precision and mAP50 by 2.1% and 3.6%, respectively, highlighting the role of SIoU in optimizing fire spot shape consistency. When StarNet and CGA were combined, precision reached 0.896 (6.2% increase) and mAP50 reached 0.91 (6% increase), demonstrating high detection accuracy. The complete SCS-YOLOv8 further improved precision and mAP50 to 0.906 and 0.913, achieving the best results, indicating that SIoU significantly enhances detection robustness by optimizing bounding box regression for irregular targets in this scenario.
Finally, tracking the dynamic changes in GFLOPs clearly illustrates the specific impact of each improvement module on computational complexity. StarNet reduces the baseline computational load: when only StarNet is introduced, GFLOPs decrease from 28.4 in the original YOLOv8 to 24.4. This indicates that StarNet, through the high-dimensional feature representation capability of star operations, enhances global feature extraction while simultaneously reducing redundant computations. CGA further improves computational efficiency: with CGA added, GFLOPs drop from 24.4 to 13.8, a reduction of 43.4%. This occurs because CGA replaces the traditional concat operation, using dynamic weights to focus on key features and reduce the computation of irrelevant feature fusion, thereby improving feature utilization while lowering computational load. SIoU has no significant effect on computation: after introducing SIoU, GFLOPs remain at 13.8, essentially unchanged, confirming that replacing the loss function does not substantially affect core computations but improves robustness by optimizing bounding box regression without additional computational cost.

4.3. Comparison with Other Object Detection Models

To validate the performance advantages of YOLOv8-StarNet-CGA and SCS-YOLOv8, comparative experiments were conducted against three classical object detection models—EfficientNet, SSD, and DETR—across three distinct scenarios: pine wilt disease-infected trees, under-construction farmhouses, and forest fires. The results are summarized in Table 8:
Table 8. Comparison with other classical object detection models.
Additionally, experiments were conducted on the three aforementioned scenarios using three YOLOv8-based improved models: YOLO-Drone [39], TSD-YOLO [40], and YOLO-MS [41]. The comparative results are presented in Table 9:
Table 9. Comparison with other YOLOv8-based improved models.
The results demonstrate that both proposed models, YOLOv8-StarNet-CGA and SCS-YOLOv8, significantly outperform other baseline models and YOLOv8-based variants across key evaluation metrics, including Precision, Recall, and mAP50. Specifically, YOLOv8-StarNet-CGA achieves the best performance in detecting pine wilt disease-infected trees and under-construction farmhouses, whereas SCS-YOLOv8 exhibits superior robustness in forest fire scenarios owing to the SIoU optimization. These comparative results fully validate the effectiveness of the proposed modules in enhancing object detection performance in complex forest environments within national parks.

5. Discussion

The proposed YOLOv8-StarNet-CGA and SCS-YOLOv8 models significantly enhance object detection performance in national park scenarios. YOLOv8-StarNet-CGA integrates StarNet and CGA, while SCS-YOLOv8 further incorporates SIoU on top of YOLOv8-StarNet-CGA, yielding notable improvements in detection accuracy. StarNet strengthens global feature extraction, CGA optimizes feature fusion in critical regions, and SIoU enhances the robustness of bounding box regression in complex pose scenarios. Experimental results indicate that YOLOv8-StarNet-CGA is particularly effective for pine wilt disease-infected trees and under-construction farmhouses, whereas SCS-YOLOv8 excels in forest fire scenarios. Both models outperform the original YOLOv8 across all three scenarios, demonstrating strong adaptability to complex and irregular target conditions, while also balancing precision and efficiency, making them suitable for practical UAV deployment.
However, the adaptability of SIoU remains insufficient in certain scenarios and requires further improvement. One potential direction is to implement scene-adaptive weighting: for regular targets such as pine wilt disease-infected trees and under-construction farmhouses, the angle and shape penalties in SIoU could be reduced, emphasizing positional and overlap optimization. Additionally, the forest fire dataset relies on publicly available internet sources, which presents clear limitations: it lacks diversity, mainly focuses on typical visible fire scenes, and does not adequately represent complex terrains (e.g., canyons, steep slopes) or specific vegetation types (e.g., coniferous forests, shrublands). Samples under extreme weather conditions are also missing, with no coverage of heavy rain, dense fog, or low-contrast scenarios such as nighttime, limiting the model’s adaptability to real-world complex environments. Future work should focus on constructing field-collected datasets, conducting controlled UAV experiments across varying elevations and vegetation types, supplementing extreme weather and dynamic samples, and recording correlations between fire events and environmental parameters. Furthermore, based on existing UAV-based fire monitoring model, a 13% improvement in recall under continuous aerial patrol conditions implies that more positive frames containing fire are correctly identified. However, the actual early fire detection time gained considering fire spread dynamics still requires further experimental validation. This will enhance the model’s generalization capability in real-world scenarios and better meet the practical requirements for national park fire monitoring.

6. Conclusions

This study proposes two improved YOLOv8-based multi-scene forest object detection algorithms, namely YOLOv8-StarNet-CGA and SCS-YOLOv8, optimized for detecting pine wilt disease-infected trees, forest fires, and under-construction farmhouses in national park ecological monitoring. First, StarNet was introduced to replace the C2f module in the backbone of YOLOv8, enhancing global feature extraction, and CGA dynamically adjusted feature weights to emphasize key regions. Next, the original CIoU loss function was replaced with SIoU to improve the robustness of bounding box regression. SCS-YOLOv8 was evaluated on a UAV-acquired national park dataset covering the three target categories. Compared with the original YOLOv8, SCS-YOLOv8 improved mAP50 by 11% for pine wilt disease-infected trees, 9.3% for under-construction farmhouses, and 11.6% for forest fires. Meanwhile, YOLOv8-StarNet-CGA achieved mAP50 gains of 11.7%, 10.1%, and 9.7% in the respective scenarios, indicating that YOLOv8-StarNet-CGA is more suitable for pine wilt disease-infected trees and under-construction farmhouses, while SCS-YOLOv8 excels in forest fire scenarios. Furthermore, both models demonstrated superior performance in Precision, Recall, mAP50, and mAP50-95 compared with other YOLO variants and mainstream detection models. The GFLOPs of both improved models decreased from 28.4 in the original YOLOv8 to 13.8, achieving both computational efficiency and enhanced detection performance. Overall, the two YOLOv8-based improved models exhibit stronger detection capabilities, balance precision and efficiency, perform exceptionally in complex backgrounds and diverse target scenarios, address multi-scene detection challenges in national parks, and are highly suitable for UAV inspection deployments, providing efficient technical support for ecological monitoring.

Author Contributions

Conceptualization, X.L., Z.Q. and G.J.; methodology, Z.Q. and S.C.; software, Z.Q.; validation, Z.Q.; formal analysis, X.L., Z.Q., X.Z., L.S. and D.W.; investigation, Z.Q.; resources, X.L., Z.Q., F.W. and G.J.; data curation, Z.Q. and F.W.; writing—original draft, Z.Q.; writing—review and editing, X.L. and Z.Q.; visualization, Z.Q.; supervision, X.L., Z.Q., H.L., X.Z., L.S., F.W., D.W., S.C. and G.J.; project administration, Z.Q. and H.L.; funding acquisition, X.L. and G.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available on request due to restrictions. The raw image data supporting the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available due to concerns regarding the protection of ecological locations and the potential risk of disturbance to the study sites.

Acknowledgments

We are grateful to Qianjiang Source-Baishanzu National Park Qianjiang Source Management Bureau for supplying the valuable model data. We would like to express our gratitude to the English language editors, who helped in checking the grammar mistakes.

Conflicts of Interest

The authors declare no conflicts of interest. Hanbao Lou is employed by Hangzhou Ganzhi Technology Co., Ltd., his employer’s company was not involved in this study, and there is no relevance between this research and their company.

References

  1. Zhu, J.; Song, L. A review of ecological mechanisms for management practices of protective forests. J. For. Res. 2021, 32, 435–448. [Google Scholar] [CrossRef]
  2. Yun, T.; Li, J.; Ma, L.; Zhou, J.; Wang, R.; Eichhorn, M.P.; Zhang, H. Status, advancements and prospects of deep learning methods applied in forest studies. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103938. [Google Scholar] [CrossRef]
  3. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
  4. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  5. Subedi, A. Improving Generalization Performance of YOLOv8 for Camera Trap Object Detection. Master’s Thesis, University of Cincinnati, Cincinnati, UH, USA, 2024. [Google Scholar]
  6. Rey, L.; Bernardos, A.M.; Dobrzycki, A.D.; Carramiñana, D.; Bergesio, L.; Besada, J.A.; Casar, J.R. A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications. Electronics 2025, 14, 638. [Google Scholar] [CrossRef]
  7. Hua, W.; Chen, Q. A survey of small object detection based on deep learning in aerial images. Artif. Intell. Rev. 2025, 58, 1–67. [Google Scholar] [CrossRef]
  8. Back, M.A.; Bonifácio, L.; Inácio, M.L.; Mota, M.; Boa, E. Pine wilt disease: A global threat to forestry. Plant Pathol. 2024, 73, 1026–1041. [Google Scholar] [CrossRef]
  9. Yuan, J.; Wang, L.; Wang, T.; Bashir, A.K.; Al Dabel, M.M.; Wang, J.; Feng, H.; Fang, K.; Wang, W. YOLOv8-RD: High-robust pine wilt disease detection method based on residual fuzzy YOLOv8. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
  10. Xiao, J.; Wu, J.; Liu, D.; Li, X.; Liu, J.; Su, X.; Wang, Y. Improved Pine Wood Nematode Disease Diagnosis System Based on Deep Learning. Plant Dis. 2025, 109, 862–874. [Google Scholar] [CrossRef]
  11. Han, Y.; Duan, B.; Guan, R.; Yang, G.; Zhen, Z. LUFFD-YOLO: A lightweight model for UAV remote sensing forest fire detection based on attention mechanism and multi-level feature fusion. Remote Sens. 2024, 16, 2177. [Google Scholar] [CrossRef]
  12. Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.I. An improved wildfire smoke detection based on YOLOv8 and UAV images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef] [PubMed]
  13. Bouguettaya, A.; Zarzour, H.; Taberkit, A.M.; Kechida, A. A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms. Signal Process. 2022, 190, 108309. [Google Scholar] [CrossRef]
  14. Ali, H.A.; Ever, E.; Kizilkaya, B.; Khan, M.T.R.; Rehman, M.U.; Ansari, S.; Imran, M.A.; Yazici, A. An edge-intelligent three-tier framework for real-time forest fire detection, integrating WSNs, WMSNs, and UAVs. Internet Things 2026, 36, 101861. [Google Scholar] [CrossRef]
  15. Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
  16. Zhou, K.; Shi, J.L.; Fu, J.Y.; Zhang, S.X.; Liao, T.; Yang, C.Q.; Wu, J.R.; He, Y.C. An improved YOLOv10 algorithm for automated damage detection of glass curtain-walls in high-rise buildings. J. Build. Eng. 2025, 101, 111812. [Google Scholar] [CrossRef]
  17. Varghese, R.; Sambath, M. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Tamil Nadu, India, 17–18 April 2024; pp. 1–6. [Google Scholar]
  18. Gan, B.; Pu, G.; Xing, W.; Wang, L.; Liang, S. Enhanced YOLOv8 with lightweight and efficient detection head for for detecting rice leaf diseases. Sci. Rep. 2025, 15, 22179. [Google Scholar] [CrossRef]
  19. Ranjbarzadeh, R.; Crane, M.; Bendechache, M. The impact of backbone selection in YOLOv8 models on brain tumor localization. Iran J. Comput. Sci. 2025, 8, 939–961. [Google Scholar] [CrossRef]
  20. Qi, Z.; Hua, W.; Zhang, Z.; Deng, X.; Yuan, T.; Zhang, W. A novel method for tomato stem diameter measurement based on improved YOLOv8-seg and RGB-D data. Comput. Electron. Agric. 2024, 226, 109387. [Google Scholar] [CrossRef]
  21. Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
  22. Zhu, L.; Lee, F.; Cai, J.; Yu, H.; Chen, Q. An improved feature pyramid network for object detection. Neurocomputing 2022, 483, 127–139. [Google Scholar] [CrossRef]
  23. Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO advances to its genesis: A decadal and comprehensive review of the you only look once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
  24. Zhao, Z.; He, C.; Zhao, G.; Zhou, J.; Hao, K. RA-YOLOX: Re-parameterization align decoupled head and novel label assignment scheme based on YOLOX. Pattern Recognit. 2023, 140, 109579. [Google Scholar] [CrossRef]
  25. Koh, P.W.; Nguyen, T.; Tang, Y.S.; Mussmann, S.; Pierson, E.; Kim, B.; Liang, P. Concept bottleneck models. In Proceedings of the International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 5338–5348. [Google Scholar]
  26. Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar]
  27. Moghaddam, V.H.; Hamidzadeh, J. New Hermite orthogonal polynomial kernel and combined kernels in Support Vector Machine classifier. Pattern Recognit. 2016, 60, 921–935. [Google Scholar] [CrossRef]
  28. Zhou, D.; Hou, Q.; Chen, Y.; Feng, J.; Yan, S. Rethinking Bottleneck Structure for Efficient Mobile Network Design. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 680–697. [Google Scholar]
  29. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
  30. Mansuri, S.; Gupta, S.K.; Singh, D.P.; Choudhary, J. Modified DMGC algorithm using ReLU-6 with improved learning rate for complex cluster associations. In Proceedings of the 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 25–27 May 2022; pp. 634–640. [Google Scholar]
  31. Hofmann, T.; Schölkopf, B.; Smola, A.J. Kernel methods in machine learning. Ann. Statist. 2008, 36, 1171–1220. [Google Scholar] [CrossRef]
  32. Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
  33. Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
  34. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  35. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  36. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
  37. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  38. Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
  39. Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An optimized YOLOv8 network for tiny UAV object detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
  40. Du, S.; Pan, W.; Li, N.; Dai, S.; Xu, B.; Liu, H.; Xu, C.; Li, X. TSD-YOLO: Small traffic sign detection based on improved YOLOv8. IET Image Process. 2024, 18, 2884–2898. [Google Scholar] [CrossRef]
  41. Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.