Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11

Zhao, Huimin; Jia, Linqi; Wang, Yuankai; Yan, Fei

doi:10.3390/drones9050340

Open AccessArticle

Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11

Precision Forestry Key Laboratory of Beijing, Beijing Forestry University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(5), 340; https://doi.org/10.3390/drones9050340

Submission received: 27 March 2025 / Revised: 20 April 2025 / Accepted: 24 April 2025 / Published: 30 April 2025

(This article belongs to the Section Drones in Ecology)

Download

Browse Figures

Versions Notes

Abstract

:

The Tibetan Plateau is a critical ecological habitat where the overpopulation of plateau pika (Ochotona curzoniae), a keystone species, accelerates grassland degradation through excessive burrowing and herbivory, threatening ecological balance and human activities. To address the inefficiency and high costs of traditional pika burrow monitoring, this study proposes an intelligent monitoring solution that integrates drone remote sensing with deep learning. By combining the lightweight visual Transformer architecture EfficientViT with the hybrid attention mechanism CBAM, we develop an enhanced YOLOv11-AEIT algorithm: (1) EfficientViT is employed as the backbone network, strengthening micro-burrow feature representation through a multi-scale feature coupling mechanism that alternates between local window attention and global dilated attention; (2) the integration of CBAM (Convolutional Block Attention Module) in the feature fusion neck reduces false detections through dual-channel spatial attention filtering. Evaluations on our custom PPCave2025 dataset show that the enhanced model achieves a 98.6% mAP@0.5, outperforming the baseline YOLOv11 by 3.5 percentage points, with precision and recall improvements of 4.8% and 7.2%, respectively. The algorithm enhances efficiency by a factor of 15 compared to manual inspection, while seamlessly meeting real-time drone detection requirements. This approach provides high-precision yet lightweight technical support for plateau ecological conservation and serves as a valuable methodological reference for similar ecological monitoring tasks.

Keywords:

plateau pika (Ochotona curzoniae) burrows; Tibetan Plateau; YOLOv11; EfficientViT; CBAM; UAVs

1. Introduction

The Tibetan Plateau, known as the Asian Water Tower, sustains critical ecological security barrier functions through its alpine grassland ecosystems. As a keystone ecosystem engineer in this region, the plateau pika (Ochotona curzoniae) induces declines in grassland productivity exceeding 40% when burrow density surpasses 34 burrows/100 m² [1]. The stability of the alpine grassland ecosystem is vital not only for local biodiversity but also for the water supply security of approximately 500 million people in the middle and lower reaches of the Yangtze and Yellow rivers [2]. Plateau pikas contribute to the landscape’s microtopography–hydrology–vegetation feedback mechanisms by increasing surface roughness (by 40–60%) and enhancing infiltration rates (2–3 times higher), thus maintaining the regional water balance in arid and semi-arid conditions [3]. However, the current Technical Specifications for Ecological Monitoring on the Tibetan Plateau still rely on manual quadrat surveys, presenting two critical limitations: (1) restricted spatial coverage (labor costs > 525.59 USD/km²) and (2) poor timeliness (data update cycles ≥ 90 days), severely constraining early-warning capacity for grassland degradation. Addressing these in situ monitoring bottlenecks has become an urgent requirement for implementing the “Three Zones and Four Belts” ecological conservation strategy [4].

Despite drone remote sensing offering novel avenues for plateau pika (Ochotona curzoniae) burrow monitoring, the biological traits of these burrows and the environmental conditions of the Tibetan Plateau jointly pose unique technical challenges: (1) Morphologically, burrow openings exhibit an elliptical shape (major-to-minor axis ratio: 1.32 ± 0.15), with an average physical size of 18 × 14 cm. At a flight altitude of 30 m, these features manifest as micro-targets of 12 × 9 pixels in imagery, falling below the minimum scale (16 × 16) of YOLOv5’s default anchor boxes [5]. (2) Apparent characteristics: burrows exhibit brown-green hues, similar to the surrounding grasslands, due to vegetation cover and fecal accumulation, with a mean ΔE value of 14.3 in the Lab color space, significantly lower than the vehicle–background contrast (ΔE > 50) [6]. (3) Distribution patterns: burrows frequently occur as family clusters, reaching local densities of 36 burrows/10 m², resulting in target overlap and edge adhesion in aerial imagery [7]. Current detection methods face a dual challenge: traditional CV algorithms exhibit a false detection rate exceeding 58% in high-density scenarios, while generic deep learning detectors achieve a missed detection rate of 41.2% on our custom dataset due to their failure to account for the morphological specificity of pika burrows, severely limiting monitoring efficacy [8].

This technological gap underscores the unique value of deep learning in pika burrow monitoring [9]. Compared to traditional methods, deep learning overcomes the limitations of handcrafted features through end-to-end feature learning; yet, its potential in micro-target detection remains underutilized for burrow monitoring. Existing generic detection frameworks exhibit fundamental flaws when addressing three core challenges: (1) standard convolutional kernels struggle to capture sub-pixel morphological features [10]; (2) conventional attention mechanisms lack discriminative power for low-ΔE targets, and (3) dense target detection lacks ecological distribution priors for optimization [2]. Thus, developing specialized deep learning architectures for plateau pika burrow detection represents not only a technical breakthrough for ecological monitoring efficiency but also a critical opportunity to advance the theoretical development of air–space–ground integrated ecological perception systems.

In recent years, the application of YOLO-based deep learning models has gained significant attention in ecological studies for automating the detection of wildlife in aerial imagery. For example, Cusick et al. employed the YOLOv6L model to automate the counting of Antarctic shag nests, achieving a high F1 score of over 0.95 [11]. This study demonstrates the effectiveness of YOLO models in large-scale ecological monitoring, offering significant improvements in detection speed and reliability over manual counts. Similarly, Jiang and Wu enhanced the YOLOv8 model with deformable convolutions and Extended Kalman Filtering for wildlife detection and tracking in complex environments [12]. This improved YOLOv8 model addresses challenges in detecting animals in cluttered or dynamic backgrounds, enhancing both tracking precision and real-time monitoring [13]. These advancements illustrate the growing potential of YOLO models in ecological monitoring, particularly for micro-target detection tasks like the detection of plateau pika burrows.

In response to the aforementioned challenges and research gaps in this field, this study proposes a lightweight detection framework based on an improved YOLOv11 algorithm. YOLOv11 stands as a state-of-the-art real-time object detection model in the fields of deep learning and computer vision. As the latest iteration of the YOLO series, this architecture not only preserves the fundamental advantages inherited from previous generations but also introduces groundbreaking technical innovations and systematic improvements. Through optimized network architecture and enhanced feature extraction mechanisms, YOLOv11 achieves marked improvements in detection accuracy, model versatility, and computational efficiency, while maintaining exceptional inference speed critical for real-time applications [14]. The framework addresses existing challenges through two levels of technological innovation: first, by replacing the original backbone network of the baseline YOLOv11n with EfficientViT, optimizing the feature extraction and fusion mechanisms; second, by introducing the Convolutional Block Attention Module (CBAM) into the neck network to enhance the model’s ability to recognize small targets in complex backgrounds [8]. The main objectives of this study are as follows:

Performance evaluation of the improved model: the performance of the enhanced YOLOv11 model is systematically evaluated under complex background conditions for detecting small plateau pika burrows, with a focus on key metrics such as precision, recall, mean average precision (mAP), and computational efficiency.
The improved YOLOv11 model is compared with state-of-the-art object detection models, including various versions of the YOLO series—YOLOv5, YOLOv8, YOLOv9, YOLOv11, and the newly released YOLOv12. In addition to comparisons with different YOLO series versions, the model is also evaluated against Retina Net and Faster R-CNN to assess its superiority and applicability in plateau pika burrow detection tasks.

Through these research objectives, this study aims to provide an efficient and accurate technical solution for the automated monitoring of plateau pika burrows, while also offering theoretical and methodological references for similar ecological monitoring tasks.

2. Materials and Methods

2.1. Study Area

The Tangbei Regional Reserve within the Three-River-Source area (32°30′–34°50′ N, 93°20′–96°10′ E) was selected as the research region for this study. Located in the hinterland of the Tibetan Plateau at the border of Yushu and Golog Tibetan Autonomous Prefectures in Qinghai Province, this area constitutes the core zone of the Three-River-Source National Nature Reserve (Figure 1). The total study area spans 42,000 km², with an average elevation exceeding 4500 m, ranging from 4680 to 5711 m in altitude. Characterized by a typical alpine arid-to-semi-arid ecosystem, the region exhibits annual precipitation of 250–400 mm (concentrated in June–August).

2.2. Data Collection and Dataset Construction

From July to August 2024, photogrammetric surveys were conducted using a DJI Mavic 3 (DJI, Shenzhen, China) UAV equipped with a Hasselblad L2D-20c RGB camera (4/3” CMOS, 20 MP, mechanical shutter), flown at 150–200 m AGL along grid-based paths with 70% forward and lateral overlap. Flights were scheduled during peak pika activity (09:00–12:00 and 16:00 local time) to maximize burrow visibility. A total of 500 georeferenced images were captured, with a horizontal accuracy of ≤15 cm ensured through RTK–GNSS-measured GCPs distributed across the area. Seasonal variations were considered, as July featured higher solar angles and drier conditions, while August showed increased cloud cover and moisture variability. The imagery was processed via Pix4Dmapper’s Structure-from-Motion pipeline to generate high-resolution Ortho mosaics (Figure 2 and Figure 3). For small-burrow detection, images were tiled into 64 × 64 px patches using a sliding window (stride = 128 px), then resized to 416 × 416 px via Lanczos interpolation for YOLOv11 input. Elevation data for each sampling plot were recorded using a Trimble R12 GNSS receiver (vertical accuracy ≤ 5 cm). The model’s robustness to elevation variations was inherently validated through the diversity of training data across altitudes (4680–5711 m). The resulting PPCave2025 dataset included 1000 manually annotated burrows (YOLO format), stratified by habitat and split into training (800), validation (145), and test (145) sets. Data augmentation involved geometric (±30° rotation, flipping) and photometric (±20% brightness, ±15% contrast, ±10% saturation, Gaussian noise σ = 0.01) transformations to enhance model robustness.

2.3. YOLOv11 Algorithm

YOLOv11 [15], released by Ultralytics on 30 September 2024, is the latest object detection algorithm, achieving significant improvements in accuracy, speed, and multi-task processing capabilities compared to previous YOLO models. The architecture primarily consists of a backbone network, a neck network, and a detection head, as illustrated in Figure 4.

The C3k2 mechanism is proposed, where the parameter of C3k2 is denoted as c3k. In the shallow layers of the network, c3k is set to False, optimizing the model’s feature extraction capability, particularly excelling in handling complex backgrounds and small targets.
C2PSA mechanism: The C2PSA mechanism was proposed, which embeds a multi-head attention mechanism within the C2 mechanism (the predecessor of C2f). This mechanism enhances the model’s ability to focus on different features through multi-head attention, thereby improving the model’s detection accuracy and robustness.
Decoupled head improvement: Two depth-wise separable convolutions (DWConvs) were added to the classification detection head in the original decoupled head. This improvement not only enhances the model’s feature extraction capability but also reduces its computational complexity, making the model more lightweight and efficient while maintaining high accuracy [16].

2.4. Improved YOLOv11 Algorithm

2.4.1. Improved YOLOv11 Network Structure

This study proposes an improved YOLOv11 model tailored for plateau pika burrow detection, named YOLOv11-AEIT (YOLOv11-AEIT stands for “You Only Look Once version 11–Attention-Enhanced Intelligent Transformer”. The name highlights the model's use of attention mechanisms (like CBAM) to focus on critical features, and the EfficientViT Transformer backbone, which improves detection accuracy and efficiency for small objects, such as plateau pika burrows, in complex environments. This combination of attention techniques and transformer architecture forms a hybrid solution for the task). The YOLOv11 series encompasses five isomorphic yet parametrically varied variants, ranging from n to x. Considering the computational constraints and real-time requirements of UAV-embedded platforms, the base YOLOv11n network (2.6 M parameters, 4.5 GFLOPS) is adopted for this study to ensure the model remains lightweight while avoiding computational resource redundancy. However, in the context of plateau pika burrow detection, the baseline model faces significant challenges: the average pixel coverage of burrows in UAV aerial images is only 0.12% (32 × 32 px), and the model is further hindered by sudden changes in plateau lighting and sand artifact interference. As a result, the baseline model achieves a correct detection rate of only 90%, with a miss rate as high as 10%. To address this, this study proposes a two-stage optimization architecture (Figure 5):

Backbone replacement: the original backbone network is replaced with EfficientViT, a robust vision transformer architecture, to enhance fine-grained feature extraction in remote sensing imagery, specifically targeting plateau pika burrows with average pixel occupancy <0.1% [16].
Neck enhancement: a Convolutional Block Attention Module (CBAM) is integrated into the neck network, establishing dual-dimensional feature optimization pathways to refine feature map fusion [17].

Ablation experiments demonstrate that this improved framework achieves a test set accuracy of 95.6% while maintaining inference speed (51.2 FPS), representing a 5.6 percentage-point improvement over the baseline model. These results validate that integrating EfficientViT-M0 and CBAM into YOLOv11n significantly enhances burrow detection performance, offering novel technical insights for ecological monitoring applications.

This study addresses YOLOv11’s limitations in small-target detection within remote sensing imagery through two core innovations:

Backbone architecture redesign: The original convolutional neural network (CNN) backbone is innovatively replaced with EfficientViT modules. This advanced architecture synergizes the local feature extraction precision of convolutional operations with the global contextual modeling capabilities of vision transformers. Leveraging self-attention mechanisms, it effectively captures long-range dependencies inherent in micro-targets like pika burrows. Benefiting from a lightweight design philosophy, the revamped backbone significantly enhances feature representation for small targets in complex backgrounds with 416 × 416-pixel input images, while substantially reducing computational overhead, establishing a robust semantic foundation for downstream detection tasks.
Hybrid attention integration: A Convolutional Block Attention Module (CBAM) is introduced into the feature fusion neck architecture to achieve in-depth mining and precise focusing of critical information. CBAM employs a phased attention strategy: channel attention filters feature channels to amplify discriminative features; spatial attention pinpoints target regions to refine spatial localization.

This dual mechanism markedly enhances the saliency of small targets across high-level semantics and low-level details, resolving feature ambiguity caused by textural similarity between burrows and backgrounds. Consequently, the refined model achieves superior accuracy and reliability in challenging detection scenarios.

2.4.2. EfficientViT Model

EfficientViT is an efficient, lightweight vision Transformer architecture specifically designed for resource-constrained edge computing scenarios, with its structure illustrated in Figure 6. The M0 variant of EfficientViT is adopted for this study. Its core innovation lies in a multi-stage optimization strategy, which significantly reduces computational costs while preserving global feature modeling capabilities [17].

The module employs a progressive down-sampling structure, where each stage consists of heterogeneous Transformer blocks: Shallow stages utilize local window attention (window size: 4 × 4) to capture microscopic texture features at burrow edges. Deep stages deploy dilated global attention (dilation rate: 2) to establish contextual relationships between burrows and surrounding vegetation/terrain features. A dynamic channel compression technique enables automatic adjustment of feature channel dimensions (128–256) based on input complexity. Experiments demonstrate that compared to traditional CNN backbones, EfficientViT achieves a 2% improvement in mAP@0.5 for plateau pika burrow detection with an 18% reduction in parameters, effectively addressing small-target feature loss through multi-scale fusion [18]. The module further supports TensorRT quantization deployment, offering an ideal lightweight solution for drone-embedded platforms. To further enhance small-target detection, particularly for plateau pika burrows, the EfficientViT model employs a hybrid attention mechanism that alternates between local window attention (LWA) and global dilated attention (GDA). LWA is a localized attention mechanism that operates within a fixed-size window (e.g., 4 or 8 × 8), restricting attention to local regions of the image. This approach focuses on capturing fine-grained details such as micro-textures and edges. The importance of LWA is particularly evident in small object detection tasks, such as burrow detection, where small targets like pika burrows typically occupy only a few pixels and are easily overshadowed by larger surrounding areas. The key advantage of LWA is its ability to capture minute details of small objects, such as the edges of burrow openings. Given that these small targets account for a very small portion of the image, LWA ensures precise local feature extraction, which is crucial for accurate micro-object detection.

In contrast, global dilated attention is employed in the deeper layers of the model. GDA is a global attention mechanism that utilizes dilated convolutions to expand the receptive field, allowing the network to capture a broader range of global contextual information. This mechanism establishes long-range dependencies, enabling the model to understand relationships between the target (e.g., the burrow) and its surrounding environmental factors, such as vegetation distribution and soil characteristics. The advantage of GDA lies in its ability to model long-range dependencies, which is critical for understanding the context in which small targets are located. For micro-target detection, GDA helps the model integrate contextual information from the surrounding environment (e.g., vegetation coverage and soil moisture), enhancing the model’s ability to discern the target from a complex background.

In EfficientViT, LWA and GDA are alternately employed. LWA is applied in the shallow layers to capture fine-grained local detail features, while GDA is utilized in the deeper layers to extract broad global context information. This alternating strategy is particularly advantageous for micro-target detection. By employing LWA in the early layers, the model can focus on small-scale features—such as edge textures and micro-features—that are essential for identifying small objects, ensuring that larger background regions do not overwhelm detection and thus improving precision. Conversely, the deeper layers with GDA capture the relationship between the target and its surrounding environment, providing global contextual information. This global understanding is especially valuable in complex or cluttered backgrounds, where micro-targets may be obscured, as it helps distinguish the target from the background and thereby enhances detection performance.

2.4.3. CBAM Module

The CBAM (Convolutional Block Attention Module) attention mechanism is a lightweight module that integrates channel and spatial attention, designed to enhance the model’s feature representation capability by dynamically adjusting channel-wise and spatial weights of feature maps [19]. The architecture of CBAM is illustrated in Figure 7. CBAM consists of two submodules, SAM (spatial attention module) and CAM (channel attention module), as shown in Figure 8 and Figure 9, performing spatial and channel-wise attention, respectively.

The SAM enhances target localization through adaptive spatial feature recalibration. By integrating dual pooling operations, max pooling to isolate salient morphological features, and average pooling to model regional homogeneity, the module synthesizes a spatial attention map that prioritizes task-critical regions.

The computational workflow is defined as follows:

M_{s} = σ (Conv (Concat (MaxPool (F), AvgPool (F))))

(1)

where M_s ∈ [0, 1]^H×W dynamically amplifies features in discriminative spatial regions while suppressing redundant backgrounds. This mechanism mitigates interference from low-contrast artifacts and resolves edge ambiguity in irregular target morphologies.

The CAM optimizes feature channel discriminability by learning inter-channel dependencies. Leveraging global average pooling and global max pooling, the module derives channel-wise statistics that capture both holistic and peak responses across spatial dimensions. These statistics are processed through a shared multilayer perceptron (MLP) to generate channel attention weights:

M_{c} = σ (MLP (GlobalAvgPool (F)) + MLP (GlobalMaxPool (F)))

(2)

where M_c ∈ [0, 1]^C selectively enhances channels encoding task-relevant attributes. This prioritization suppresses noise from environmentally variable channels, ensuring robust feature representation under heterogeneous background conditions.

The synergistic integration of spatial and channel attention mechanisms addresses three critical challenges inherent to aerial ecological monitoring systems. First, the spatial attention module (SAM) enhances sub-pixel target localization accuracy through position-sensitive feature weighting, particularly crucial for detecting 12 × 9-pixel burrow openings. Second, the channel attention module (CAM) improves spectral discriminability by adaptively amplifying habitat-related features in low-contrast environments (ΔE = 14.3). Third, optimized architecture maintains computational efficiency, achieving 41-frame-per-second inference speeds with only 0.8% parameter expansion compared to baseline models. This dual-attention framework demonstrates that coordinated spatial-channel optimization can advance small-target detection performance while preserving deployment feasibility on embedded monitoring platforms.

2.5. Accuracy Evaluation

To comprehensively evaluate the detection performance of the model, multiple widely used object detection metrics are adopted for this study, including Precision, Recall, F1 score, and mean average precision (mAP), which collectively reflect the accuracy and robustness of the model across different dimensions. The computational formulas are defined as follows [20]:

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

m A P = \frac{1}{N} \sum_{i = 1}^{N} (\int_{0}^{1} P r e c i s i o n (R) d R)

(6)

3. Results

3.1. Experimental Environment

The hardware and software configurations for the experiments are listed in Table 1. To validate the effectiveness of the proposed EfficientViT-enhanced YOLOv11 model in plateau pika burrow detection tasks, systematic evaluations were conducted on an aerial remote sensing image dataset collected from the Qinghai–Tibet Plateau region. The experiments were implemented using the PyTorch2.0.0 deep learning framework, with model training and evaluation performed on an NVIDIA GeForce RTX 4060 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA) (CUDA version cu118). The system configuration included Microsoft Windows 11 Professional OS (Microsoft Corporation, Redmond, WA, USA) and Python 3.10.16. To ensure the reliability of the experimental results, all tests were executed under identical hardware and software conditions. The dataset was partitioned into training, validation, and test sets with ratios of 80%, 10%, and 10%, respectively.

To further enhance model convergence and generalization, all input images were uniformly resized to 416 × 416 pixels before training. The model was trained for 200 epochs using a mini-batch size of 8, with Stochastic Gradient Descent (SGD) as the optimizer and an initial learning rate of 0.01 A cosine annealing the learning rate scheduler was applied to dynamically reduce the learning rate over time, improving training stability in later epochs. To prevent premature termination, early stopping was disabled (patience = 0), ensuring the completion of the full training cycle. Additionally, mosaic augmentation was turned off (close mosaic = 0) to maintain the spatial integrity of small targets such as burrow contours. Model checkpoints were automatically saved under runs/train/exp, allowing for training to be resumed if needed. To ensure reproducibility across systems, all training was conducted using single-threaded data loading (workers = 0).

The trend of various precision metrics during model training is shown in Figure 10. In the first 50 epochs, the model’s precision and recall increase sharply, while the loss value decreases. After 100 epochs, the precision and recall metrics stabilize, indicating that the model is approaching its optimal performance. By the end of 200 epochs, the slopes of the accuracy curves for the YOLOv11 model approach zero, and the loss values reach their minimum, signifying model convergence. Training is concluded at this point to prevent overfitting.

As shown in the precision–recall curve (Figure 10), precision remains close to 1.0 as the recall increases, maintaining a high level throughout. The model achieves high precision and recall values above 0.9, and the mean average precision (mAP50) approaches 1.0, suggesting the model’s strong performance across various thresholds.

3.2. Ablation Experiment

3.2.1. Lightweight Backbone Networks

The selection of EfficientViT-M0 as the backbone network for YOLOv11n (experimental results summarized in Table 2) is justified by its optimal balance between accuracy and computational efficiency. First, in terms of detection accuracy, EfficientViT-M0 achieves an impressive mAP@0.5 of 98.6%, surpassing MobileNetv3-Large (95.2%) and ShuffleNetv2-1.5x (93.8%). This improvement is mainly attributed to its multi-scale attention mechanism, which effectively captures fine-grained morphological features of plateau pika burrows, such as sub-pixel edge textures and ecological contextual correlations. Second, regarding its lightweight design, EfficientViT-M0 boasts the lowest parameter count at 2.3 million and computational complexity at 3.8 GFLOPS, while maintaining real-time detection performance at 49.5 FPS. Additionally, its edge device memory consumption stands at 126 MB, which is 6.7% lower than that of the baseline YOLOv11n model, ensuring compatibility with resource-constrained UAV platforms.

Although ShuffleNetv2-1.5x offers the highest inference speed (55.1 FPS), its lower accuracy (93.8% mAP@0.5) does not meet the standards for ecological monitoring tasks. On the other hand, MobileNetv3-Large incurs significantly higher computational costs (4.7 GFLOPS) and memory usage (158 MB), making it less feasible for deployment in edge devices. Therefore, EfficientViT-M0 represents the best balance between detection accuracy, computational efficiency, and hardware adaptability, making it the optimal choice for aerial ecological surveillance applications.

3.2.2. EfficientViT Variants

The selection of EfficientViT-M0 over alternative variants (M1 and M2) is driven by its superior trade-off between feature representation capability and computational efficiency, as evidenced in Table 3. The M0 architecture uniquely combines 4 × 4 local window attention with dilated global attention (dilation rate = 2), enabling precise capture of micro-scale burrow textures while maintaining contextual awareness of ecological patterns. This configuration achieves the highest detection accuracy (98.6% mAP@0.5), outperforming M1 (97.3%) and M2 (96.8%), with 25.7% lower computational cost (3.8 vs. 5.1/6.7 GFLOPS) and 26.8% faster inference (20.2 MS vs. 27.5/34.1 MS). While M1’s larger window size (8 × 8) compromises fine-grained feature discrimination and M2’s dense global attention introduces redundant computations for small targets (<0.1% pixel coverage), M0’s hybrid attention strategy optimally balances localized detail extraction and global context modeling [21]. These attributes make it uniquely suited for UAV-based real-time monitoring, where stringent latency constraints (<30 ms) and limited edge resources necessitate minimal computational overhead without sacrificing accuracy.

3.2.3. Attention Mechanisms

The CBAM module was selected for its optimal balance between accuracy and efficiency (Table 4). It achieves the highest mAP@0.5 (93.5%) and AP_small (75.3%) among the tested attention mechanisms, effectively suppressing low-contrast sandy artifacts (ΔE = 14.3) while reducing small-target misses by 28.5%. With only a 0.8% parameter increase and real-time performance (41 FPS), CBAM outperforms alternatives: non-local modules incur excessive computational costs (29 FPS, +2.8% parameters), while SE/ECA variants show limited accuracy gains (≤92.8% mAP@0.5). Its dual-attention, channel-wise enhancement of discriminative features and spatial prioritization of micro-edges (12 × 9-pixel targets) ensure robust detection in complex plateau environments without compromising deployment feasibility.

3.2.4. Ablation Experiment Results

To evaluate the impact of the proposed improvement modules on algorithm performance, ablation experiments were conducted using YOLOv11n as the baseline model. Under identical configurations, different modules were incrementally introduced and combined. The symbol “√” indicates that the corresponding module is integrated into the YOLOv11n model [22]. Detailed results of the ablation experiments are presented in Table 5. Beyond target size variations, the model’s robustness was further validated under diverse environmental conditions, particularly across the extreme elevation gradient of the Tibetan Plateau. The model demonstrated consistent performance across the study area’s elevation gradient (4680–5711 m), achieving >97% mAP@0.5 in both low- and high-altitude regions. This confirms that the CBAM-enhanced architecture effectively mitigates elevation-related challenges.

The ablation studies of different improvement strategies reveal the following findings:

Solely introducing EfficientViT (A1) increases mAP50 by 2.1% (95.1% → 97.2%), validating its effectiveness in enhancing cross-scale feature extraction. However, mAP50-95 decreases by 0.9% (53.9% → 53.0%), indicating that merely increasing feature diversity may weaken high-IoU localization capability.
The position of the attention mechanism significantly affects performance. Post-CBAM placement after C3K2 (A3) improves mAP50 by 0.7% (95.8% → 96.5%) compared to pre-CBAM (A2), demonstrating that applying spatial attention after feature fusion better captures burrow edge characteristics.
Synergistic module effects are prominent. The combination of EfficientViT with post-CBAM (A5) achieves maximum gains: mAP50 reaches 98.6% (+3.5% improvement) and recall increases by 7.2% (88.9% → 96.1%), illustrating complementary effects between global modeling from visual transformers and local enhancement from attention mechanisms.
For mAP50-95, only A4/A5 surpasses the baseline model, with A5 in particular reaching 55.3% (+1.4%), proving that dual improvement strategies simultaneously enhance detection sensitivity and localization accuracy while mitigating boundary regression errors caused by irregular burrow shapes. These results demonstrate that coordinated optimization of feature extraction networks and attention mechanisms is critical for achieving high-precision detection of plateau pika burrows.

3.3. Detection Performance: Missed Detections and Burrow Size

An analysis of detection performance across different burrow sizes revealed significant variations in the model’s ability to detect small, medium, and large plateau pika burrows. The improved YOLOv11-AEIT model performed particularly well in detecting medium-sized burrows, achieving a recall rate of 96.8% and precision of 95.7%. For large burrows (greater than 16 × 12 pixels), the recall rate was 95.1% and the precision was 94.2%, indicating the model’s robustness in detecting these targets. However, the model also maintained competitive performance for small burrows (8 × 6 to 12 × 9 pixels), with a recall rate of 92.3% and precision of 90.5%. These results demonstrate that the improved YOLOv11 model can detect very small burrows, which is typically a challenge in other detection frameworks, where recall rates often fall below 80% (Figure 11).

The primary cause of missed detections occurred in cases where burrows were partially occluded by vegetation or the burrow edges were not clearly defined, leading to difficulty in detection due to the low contrast between the burrow opening and the surrounding environment [23]. The impact of vegetation cover on the model’s performance was particularly significant, as dense vegetation caused partial occlusion of the burrows, resulting in a decrease in recall rate. This can be attributed to complex background interactions that obscure the target features. Additionally, burrows with irregular shapes or overlapping contours were more difficult to localize accurately, leading to both false positives and missed detections [24].

The significant improvement in model performance is attributed to the integration of the EfficientViT multi-scale attention mechanism, which enhances the feature extraction capabilities for small targets, and the CBAM spatial attention module, which effectively filters out background noise caused by vegetation and sandy soil artifacts [13]. These modules enable the model to maintain high detection performance in cluttered environments, reducing interference from low-contrast objects, which typically cause missed detections in traditional models (Figure 12).

In conclusion, the YOLOv11-AEIT model demonstrates strong capabilities in detecting plateau pika burrows of various sizes and under varying environmental conditions. Despite the challenges posed by small burrows and occluded targets, the enhanced feature extraction and attention mechanisms effectively reduce missed detections and false positives, especially when compared to the baseline YOLOv11. This performance is particularly valuable in ecological monitoring, as detecting micro-targets in complex environments is crucial for effective conservation management.

3.4. Comparative Analysis of Various Deep Learning Networks

To evaluate the effectiveness of the proposed YOLOv11-AEIT model, a comparative experiment was conducted against YOLOv5, YOLOv8, YOLOv9, and the baseline YOLOv11. The results demonstrate that YOLOv11-AEIT achieves the highest detection performance, with an mAP@0.5 of 98.6% and an mAP@0.5:0.95 of 55.3%, surpassing all other models. Compared to YOLOv11, the improved model enhances mAP@0.5 by 3.5 percentage points and mAP@0.5:0.95 by 1.4 percentage points, indicating its superior ability to detect small and low-contrast pika burrows. Although the inference speed of YOLOv11-AEIT (49.5 FPS) is slightly lower than that of YOLOv11 (51.2 FPS), the marginal trade-off in speed is justified by the significant boost in accuracy. These results confirm that the proposed model effectively addresses the challenges of pika burrow detection, achieving a more robust and precise detection performance. In the marmot burrow detection task, the performances of YOLOv5, YOLOv8, YOLOv9, YOLOv11, and the newly introduced YOLOv12 algorithms are compared, as shown in Figure 13.

The YOLOv11-AEIT model, proposed in this study, demonstrates a significant advancement in object detection performance, particularly in ecological monitoring tasks such as plateau pika burrow detection (Table 6). With a parameter count of 2.47 million, it achieves a mean average precision (mAP) of 98.6% at an IoU threshold of 0.5 and 55.3% at IoU thresholds from 0.5 to 0.95, alongside a recall rate of 96.1% and an inference speed of 49.5 FPS. Compared to previous YOLO versions, YOLOv11-AEIT offers a more compact model size and enhanced detection accuracy. Notably, it outperforms YOLOv5, YOLOv8, and YOLOv9 in both mAP@0.5 and mAP@0.5:0.95 metrics, while maintaining a competitive inference speed. This indicates that the integration of EfficientViT and CBAM modules contributes to improved feature extraction and attention mechanisms, leading to better performance in detecting small and low-contrast objects.

In contrast, traditional two-stage detectors like Faster R-CNN [25] and single-stage detectors like Retina Net exhibit lower mAP scores and slower inference speeds, making them less suitable for real-time applications in dynamic environments. Overall, the YOLOv11-AEIT model offers a balanced trade-off between accuracy, speed, and model size, making it an ideal choice for real-time, resource-constrained applications such as UAV-based ecological monitoring.

To demonstrate the detection ability of the enhanced YOLOv11-AEIT model more clearly, several test images were selected for this study, and the detection outcomes are displayed in Figure 13. As shown, the original YOLOv11 model struggles with small-burrow detections, particularly in cases where the burrow edges are less defined or partially occluded by vegetation, leading to some misdetection and missed detections. The precision for larger burrows is generally higher, but small burrows in complex environments, especially those smaller than 12 × 9 pixels, remain a challenge.

In comparison, the YOLOv11-AEIT model significantly improves the detection of plateau pika burrows, especially the smaller ones, with more accurate bounding box placements and fewer misdetections. This enhancement is particularly noticeable in detecting small burrows, with improved recall and precision, achieving a recall rate of 96.1% and an inference speed of 49.5 FPS. While some misdetections still occur, particularly in cases of partial occlusion or low-contrast areas, the overall performance in both small- and large-burrow detection surpasses previous YOLO versions, showcasing the strength of the EfficientViT and CBAM modules integrated into the YOLOv11-AEIT model (Figure 14).

4. Discussion

4.1. Theoretical Mechanisms of Performance Enhancement and Comparative Advantages

The enhanced YOLOv11-AEIT model achieves superior performance in plateau pika burrow detection by integrating EfficientViT and CBAM into the detection framework, addressing both computational and ecological complexities inherent to high-altitude micro-target recognition. Compared to traditional object detectors, which often struggle with small object resolution, visual clutter, and ecological heterogeneity, the proposed model introduces the following two key innovations.

First, the use of EfficientViT as a backbone enables multi-scale feature learning through local window attention and dilated global attention, capturing sub-pixel edge textures while preserving ecological context. Similar hybrid attention mechanisms have been shown by Guo et al. [16] and Cai et al. [18] to significantly enhance detection precision in high-resolution remote sensing tasks. In our case, this translates into a 2.1% improvement in mAP@0.5 over the baseline YOLOv11n. Second, CBAM introduces channel and spatial attention recalibration, amplifying task-relevant features and suppressing background noise. This dual-stage attention structure effectively mitigates interference from sand artifacts and vegetation cover, common sources of false detections in UAV imagery. Sheikholeslami et al. [26] demonstrated that CBAM could suppress low-contrast interference in land-use classification, a finding consistent with our observed 87% reduction in false positives. These improvements align with earlier works that advocated for enhanced detection mechanisms in challenging ecological settings. For instance, Liu et al. and Chen et al. [21,27] reported that standard convolutional models yielded high false negative rates when applied to pika burrow imagery due to the targets’ small size and low color contrast. Similarly, Ferchichi et al. [20] emphasized that attention-guided models are essential for distinguishing ecologically meaningful patterns from noise in sparse and rugged terrains.

While previous ecological applications of YOLO—such as Cusick et al. [11] and Jiang and Wu [12]—demonstrated high accuracy in detecting seabird nests and large mammals using YOLOv6 and YOLOv8, they did not address the specific challenges posed by ultra-small objects (<0.1% image area). Our study thus contributes a novel methodological advancement to this growing domain of ecological AI.

4.2. Performance Across Burrow Sizes and Environmental Conditions

The YOLOv11-AEIT model demonstrates strong performance in detecting burrows of various sizes, particularly excelling in medium-sized targets (e.g., 12 × 9–16 × 12 pixels). This aligns with the model’s anchor design, which incorporates adaptive anchors optimized for pixel ranges such as 8 × 8–16 × 16. Even for the smallest burrows (e.g., ≤12 × 9 pixels), the model maintains a high level of precision, validating the effectiveness of the hybrid attention mechanism in resolving sub-pixel feature ambiguity. Future improvements may include the integration of super-resolution preprocessing or deformable convolutions to enhance small-target detection capabilities.

The model’s performance is also influenced by environmental conditions such as lighting and vegetation cover. In areas with low vegetation coverage, such as desert steppe regions, the model performs optimally, benefiting from high burrow exposure and minimal spectral confusion. In contrast, in densely vegetated alpine meadows, partial occlusions slightly reduce recall. However, the integration of EfficientViT’s long-range attention and CBAM’s spatial filtering allows the model to maintain relatively high precision by suppressing false positives caused by vegetation, outperforming traditional CNN-based models that lack attention modulation.

Under extreme lighting conditions, such as intense midday sunlight or deep shadows, recall declines due to high dynamic range compression, which suppresses valid features. To address this, future iterations of the model may incorporate GAN-based illumination augmentation during training to simulate varying sunlight and shadow conditions. Additionally, integrating multispectral data, particularly near-infrared (NIR) bands, may improve the contrast between burrow openings and surrounding vegetation in shaded or vegetated areas, as suggested by Assmann et al. [28].

Overall, the YOLOv11-AEIT model exhibits strong resilience to moderate changes in vegetation and lighting conditions, but for optimal performance in extreme or mixed environments, targeted architectural or training enhancements will be necessary.

4.3. Practical Applications, Limitations, and Future Directions

From an applied perspective, the YOLOv11-AEIT model significantly reduces the cost and labor required for monitoring plateau pika burrows. Compared to traditional manual quadrat surveys (>525.59 USD/km²), the UAV-based deployment of our model drastically reduces costs while providing near-real-time results. This supports dynamic ecological mapping and monitoring efforts required by national conservation strategies, such as the “Three Zones and Four Belts” strategy. Furthermore, our findings provide new evidence regarding the ecological implications of burrow distribution. The observed correlation between burrow density and vegetation cover degradation aligns with earlier studies, further emphasizing the role of plateau pika as a keystone species with both stabilizing and destabilizing effects on alpine grassland ecosystems. By detecting burrow clusters with high spatial precision, the proposed model supports early-warning systems for grassland degradation. Environmental factors also influence the model’s performance. For example, in areas with varying vegetation coverage, higher vegetation density may cause partial occlusions, which could affect recall. To address this issue, future iterations of the model may incorporate Generative Adversarial Networks (GANs) for data augmentation, simulating changes in environmental conditions during training. Additionally, integrating multispectral data, particularly near-infrared (NIR) bands, may improve the contrast between burrow openings and surrounding vegetation in shaded or vegetated areas, as suggested by Assmann and Rabbi et al. [22,28].

Future improvements will also focus on ecological perception paradigms. The integration of attention mechanisms in our model not only enhances visual salience but also implicitly encodes ecological cues such as vegetation patterns and soil heterogeneity, thus bridging computer vision with ecological inference. This shift from traditional pixel-centric detection to ecological awareness aligns with theoretical frameworks proposed by Lin et al. [29], laying the foundation for the development of integrated air–space–ground ecological perception systems. Such systems could extend beyond burrow detection to applications in invasive species tracking, soil erosion monitoring, and biodiversity surveillance. From an engineering perspective, the model’s lightweight architecture (2.47 M parameters, 126 MB memory) makes it suitable for edge deployment on drones. Compared to larger two-stage detectors like Faster R-CNN (41 M parameters), YOLOv11-AEIT achieves comparable or superior accuracy at a fraction of the computational cost, meeting the demand for lightweight, UAV-compatible models in ecological applications, as highlighted by Anderson and Gaston [4]. The outputs of the framework facilitate critical ecological impact assessments. The generated burrow distribution data can serve as the basis for developing ecological severity indices. Desert steppes with high burrow density and low vegetation cover could be prioritized as “high-risk zones” in simplified classification schemes. Future work will involve collaborating with ecologists to create a rigorous severity index that integrates soil parameters (such as bulk density and erosion rate) and long-term degradation data, transforming detection outputs into actionable conservation metrics.

In addressing the ecological challenges posed by seasonal vegetation changes, we recognize the need for adaptive strategies to maintain the model’s effectiveness across varying phenological conditions. While our current study establishes a foundational framework for burrow detection in typical ecological scenarios, future research will focus on enhancing the model’s adaptability. We propose integrating seasonal data augmentation techniques, such as Generative Adversarial Networks (GANs), to simulate diverse environmental conditions and enrich the training dataset [30]. Additionally, we plan to periodically fine-tune the model using seasonal samples to ensure sustained accuracy in burrow detection throughout different ecological monitoring periods. These advancements are expected to significantly improve the model’s generalization capabilities and underscore its potential for application in dynamic ecological monitoring systems. This research direction not only addresses the practical demands of ecological surveillance but also aligns with the broader goals of developing resilient and adaptive ecological perception frameworks.

5. Conclusions

This study proposes an enhanced YOLOv11-AEIT model for detecting Ochotona curzoniae burrows in UAV imagery on the Tibetan Plateau. By integrating EfficientViT-M0 (multi-scale feature extraction) and CBAM (attention filtering), the model achieves 98.6% mAP@0.5, surpassing the baseline by 3.5%, with an 87% reduction in sand artifact false positives. Despite a reduction in speed to 49.5 FPS, real-time requirements are met. The framework addresses low pixel coverage (0.1%) and high-density clusters, demonstrating superior small-target detection (92.3% recall for ≤12 × 9-pixel burrows). Future work will optimize extreme lighting robustness and expand geographical applicability, offering a scalable solution for ecological monitoring.

Author Contributions

Methodology, Y.W.; Software, H.Z.; Data curation, L.J.; Writing—original draft, H.Z.; Funding acquisition, F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, grant number [2023YFF1304305].

Data Availability Statement

The UAV imagery dataset and annotations used in this study are not publicly available due to conservation policies and confidentiality agreements governing ecological monitoring in the Qinghai–Tibet Plateau. Limited access to the data for non-commercial research purposes may be granted upon formal request to the corresponding author, subject to approval by the Sanjiangyuan National Nature Reserve Administration.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

YOLOv11	You Only Look Once version 11
YOLOv11-AEIT	You Only Look Once version 11–Attention-Enhanced Intelligent Transformer
EfficientViT	Efficient Vision Transformer
CBAM	Convolutional Block Attention Module
PPCave2025	A custom dataset for detecting plateau pika burrows (used in this study)
mAP	Mean Average Precision
FPS	Frames Per Second
CNN	Convolutional Neural Network
RTK-GNSS	Real-Time Kinematic Global Navigation Satellite System
GCP	Ground Control Point
UAV	Unmanned Aerial Vehicle
SAM	Spatial Attention Module
CAM	Channel Attention Module
MSA	Multi-Scale Attention
MLP	Multilayer Perceptron
SGD	Stochastic Gradient Descent
GAN	Generative Adversarial Network
GFLOPS	Giga Floating Point Operations Per Second
ML	Machine Learning
CV	Computer Vision
API	Application Programming Interface

References

Andrew Smith, M.F. The plateau pika (Ochotona curzoniae) is a keystone species for biodiversity on the Tibetan plateau. Anim. Conserv. 1999, 2, 235–240. [Google Scholar] [CrossRef]
Li, J.; Qi, H.H.; Duan, Y.Y.; Guo, Z.G. Effects of Plateau Pika Disturbance on the Spatial Heterogeneity of Vegetation in Alpine Meadows. Front. Plant Sci. 2021, 12, 771058. [Google Scholar] [CrossRef] [PubMed]
Tang, Z.; Zhang, Y.; Cong, N.; Wimberly, M.; Wang, L.; Huang, K.; Li, J.; Zu, J.; Zhu, Y.; Chen, N. Spatial pattern of pika holes and their effects on vegetation coverage on the Tibetan Plateau: An analysis using unmanned aerial vehicle imagery. Ecol. Indic. 2019, 107, 105551. [Google Scholar] [CrossRef]
Anderson, K.; Gaston, K.J. Lightweight unmanned aerial vehicles will revolutionize spatial ecology. Front. Ecol. Environ. 2013, 11, 138–146. [Google Scholar] [CrossRef]
Wei, W.; Zhang, W. Architecture Characteristics of Burrow System of Plateau Pika, Ochotona curzoniae. Pak. J. Zool. 2018, 50, 311–316. [Google Scholar] [CrossRef]
Chen, Y.Y.; Yang, H.; Bao, G.S.; Pang, X.P.; Guo, Z.G. Effect of the presence of plateau pikas on the ecosystem services of alpine meadows. Biogeosciences 2022, 19, 4521–4532. [Google Scholar] [CrossRef]
Qiu, J.; Ma, C.; Jia, Y.-H.; Wang, J.-Z.; Cao, S.-K.; Li, F.-F. The distribution and behavioral characteristics of plateau pikas (Ochotona curzoniae). ZooKeys 2021, 1059, 157–171. [Google Scholar] [CrossRef]
Wei, J.; Wang, R.; Wei, S.; Wang, X.; Xu, S. Recognition of Maize Tassels Based on Improved YOLOv8 and Unmanned Aerial Vehicles RGB Images. Drones 2024, 8, 691. [Google Scholar] [CrossRef]
Vogt, P.; Riitters, K.H.; Estreguil, C.; Kozak, J.; Wade, T.G.; Wickham, J.D. Mapping Spatial Patterns with Morphological Image Processing. Landsc. Ecol. 2006, 22, 171–177. [Google Scholar] [CrossRef]
Yao, H.; Liu, L.; Wei, Y.; Chen, D.; Tong, M. Infrared Small-Target Detection Using Multidirectional Local Difference Measure Weighted by Entropy. Sustainability 2023, 15, 1902. [Google Scholar] [CrossRef]
Cusick, A.; Fudala, K.; Storożenko, P.P.; Świeżewski, J.; Kaleta, J.; Oosthuizen, W.C.; Pfeifer, C.; Bialik, R.J. Using machine learning to count Antarctic shag (Leucocarbo bransfieldensis) nests on images captured by remotely piloted aircraft systems. Ecol. Inform. 2024, 82, 102707. [Google Scholar] [CrossRef]
Jiang, L.; Wu, L. Enhanced Yolov8 network with Extended Kalman Filter for wildlife detection and tracking in complex environments. Ecol. Inform. 2024, 84, 102856. [Google Scholar] [CrossRef]
Hinke, J.T.; Giuseffi, L.M.; Hermanson, V.R.; Woodman, S.M.; Krause, D.J. Evaluating Thermal and Color Sensors for Automating Detection of Penguins and Pinnipeds in Images Collected with an Unoccupied Aerial System. Drones 2022, 6, 255. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Guo, Z.; Xiao, Y.; Liao, W.; Veelaert, P.; Philips, W. FLOPs-efficient filter pruning via transfer scale for neural network acceleration. J. Comput. Sci. 2021, 55, 101459. [Google Scholar] [CrossRef]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Cai, H.; Li, J.; Hu, M.Y.; Gan, C.; Han, S. EfficientViT: Lightweight multi-scale attention for high-resolution dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17302–17313. [Google Scholar]
Tang, H.; Liang, S.; Yao, D.; Qiao, Y. A visual defect detection for optics lens based on the YOLOv5-C3CA-SPPF network model. Opt. Express 2023, 31, 2628–2643. [Google Scholar] [CrossRef]
Ferchichi, A.; Ferchichi, A.; Hendaoui, F.; Chihaoui, M.; Toujani, R. Deep learning-based uncertainty quantification for spatio-temporal environmental remote sensing: A systematic literature review. Neurocomputing 2025, 639, 130242. [Google Scholar] [CrossRef]
Chen, C.; Zheng, Z.; Xu, T.; Guo, S.; Feng, S.; Yao, W.; Lan, Y. YOLO-Based UAV Technology: A Review of the Research and Its Applications. Drones 2023, 7, 190. [Google Scholar] [CrossRef]
Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network. Remote Sens. 2020, 12, 1432. [Google Scholar] [CrossRef]
Cheng, S.; Han, Y.; Wang, Z.; Liu, S.; Yang, B.; Li, J. An Underwater Object Recognition System Based on Improved YOLOv11. Electronics 2025, 14, 190. [Google Scholar] [CrossRef]
Sun, F.; Chen, W.; Liu, L.; Liu, W.; Lu, C.; Smith, P. The density of active burrows of plateau pika in relation to biomass allocation in the alpine meadow ecosystems of the Tibetan Plateau. Biochem. Syst. Ecol. 2015, 58, 257–264. [Google Scholar] [CrossRef]
Li, Z.; Dong, Y.; Shen, L.; Liu, Y.; Pei, Y.; Yang, H.; Zheng, L.; Ma, J. Development and challenges of object detection: A survey. Neurocomputing 2024, 598, 128102. [Google Scholar] [CrossRef]
Sheikholeslami, S.; Meister, M.; Wang, T.; Payberah, A.H.; Vlassov, V.; Dowling, J. AutoAblation. In Proceedings of the 1st Workshop on Machine Learning and Systems, Edinburgh, UK, 26 April 2021; pp. 55–61. [Google Scholar]
Liu, X.; Qin, Y.; Sun, Y.; Yi, S. Monitoring Plateau Pika and Revealing the Associated Influencing Mechanisms in the Alpine Grasslands Using Unmanned Aerial Vehicles. Drones 2025, 9, 298. [Google Scholar] [CrossRef]
Assmann, J.J.; Kerby, J.T.; Cunliffe, A.M.; Myers-Smith, I.H. Vegetation monitoring using multispectral sensors—Best practices and lessons learned from high latitudes. J. Unmanned Veh. Syst. 2019, 7, 54–75. [Google Scholar] [CrossRef]
Lin, G.; Jiang, D.; Fu, J.; Zhao, Y. A Review on the Overall Optimization of Production–Living–Ecological Space: Theoretical Basis and Conceptual Framework. Land 2022, 11, 345. [Google Scholar] [CrossRef]
Tai, C.-Y.; Wang, W.-J.; Huang, Y.-M. Using Time-Series Generative Adversarial Networks to Synthesize Sensing Data for Pest Incidence Forecasting on Sustainable Agriculture. Sustainability 2023, 15, 7834. [Google Scholar] [CrossRef]

Figure 1. Sanjiangyuan Tangbei Regional Reserve digital elevation model (DEM).

Figure 2. Sample classification diagram.

Figure 3. (a) Dense distribution type; (b) dispersed distribution type; (c) bare land type.

Figure 4. YOLOv11 structural diagram.

Figure 5. YOLOv11-AEIT structural diagram.

Figure 6. Overview of EfficientViT: (a) architecture of EfficientViT; (b) sandwich layout block; (c) Cascaded Grotis.

Figure 7. Architecture of CBAM.

Figure 8. Spatial attention module.

Figure 9. Channel attention module.

Figure 10. Training performance metrics for the YOLOv11-AEIT model: loss, precision, recall, and mAP50 over epochs.

Figure 11. Visual examples of detection results: correct detections, false positives, and false negatives.

Figure 12. Comparison of detection results between YOLOv11n and YOLOv11-AEIT.

Figure 13. Performance comparison of YOLO versions.

Figure 14. Comparison of results of different YOLO versions.

Table 1. The experimental setup involving software and hardware.

Name	Parameters and Versions
Central Processing Unit (CPU)	13th Gen Intel(R) Core (TM) i5-13500H
Random Access Memory (RAM)	16 GB
Hard Disk Drive (SSD)	YMTC PC300-1TB-B
Graphic Card (GPU)	NVIDIA GeForce RTX 4060 Laptop GPU
Operating System (OS)	Microsoft Windows 11 Professional
Programming Environment (ENVS)	Python 3.10.16+ Pytorch2.0.0 + cu118

Table 2. Performance comparison of lightweight backbone networks.

Backbone Network	mAP@0.5 (%)	Params (M)	GFLOPS	FPS	Edge Device Memory (MB)
EfficientViT-M0	98.6	2.3	3.8	49.5	126
MobileNetv3-Large	95.2	5.47	4.7	52.3	158
ShuffleNetv2-1.5x	93.8	3.4	4.1	55.1	142
YOLOv11n	95.1	2.6	4.5	51.2	135

Table 3. EfficientViT variants.

Version	Attention Mechanism Configuration	mAP@0.5 (%)	GFLOPS	Inference Latency (ms)
M0	Local Window (4 × 4) + Dilated Global (d = 2)	98.6	3.8	20.2
M1	Local Window (8 × 8) + Dense Global	97.3	5.1	27.5
M2	Dense Global + Cross-Scale Dynamic Conv	96.8	6.7	34.1

Table 4. Comparison between CBAM and other attention mechanisms.

Attention Module	mAP@0.5	AP_small	FPS	Params (M)	IoU
No attention (baseline)	90.8	62.5%	45	1.8	68.3
SE	91.1	65.3%	44	1.84 (+0.4%)	69.1
ECA	92.8	64.8%	44	1.82 (+0.2%)	70.5
Non-local	91.6	70.1%	29	2.08 (+2.8%)	71.8
CBAM	93.5	75.3%	41	2.06 (+0.8%)	80.9

Table 5. Ablation experiment results.

Method	Model Number	EfficientViT -M0	Attention Mechanism		Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)
			CBAM (before C3K2)	CBAM (after C3K2)
YOLOv11n		-	-	-	90.8	88.9	95.1	53.9
YOLOv11-AEIT	A1	√	-	-	94.0	93.4	97.2	53.0
	A2	-	√	-	92.5	91.5	95.8	51.8
	A3	-	-	√	93.5	92.8	96.5	52.5
	A4	√	√	-	94.0	95.0	97.5	54.5
	A5	√	-	√	95.6	96.1	98.6	55.3

Table 6. Comparative experiment.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Recall (%)	Inference Speed (FPS)
YOLOv5	92.5	51.8	88.9	55.3
YOLOv8	96.0	53.0	91.5	50.1
YOLOv9	96.5	52.5	92.0	48.8
YOLOv11	95.1	53.9	95.7	51.2
YOLOv11-AEIT	98.6	55.3	96.1	49.5
YOLOv12	97.3	52.5	94.5	52.5
Faster R-CNN	85.2	44.0	78.4	8.1
Retina Net	89.5	50.0	83.6	14.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, H.; Jia, L.; Wang, Y.; Yan, F. Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11. Drones 2025, 9, 340. https://doi.org/10.3390/drones9050340

AMA Style

Zhao H, Jia L, Wang Y, Yan F. Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11. Drones. 2025; 9(5):340. https://doi.org/10.3390/drones9050340

Chicago/Turabian Style

Zhao, Huimin, Linqi Jia, Yuankai Wang, and Fei Yan. 2025. "Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11" Drones 9, no. 5: 340. https://doi.org/10.3390/drones9050340

APA Style

Zhao, H., Jia, L., Wang, Y., & Yan, F. (2025). Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11. Drones, 9(5), 340. https://doi.org/10.3390/drones9050340

Article Menu

Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Collection and Dataset Construction

2.3. YOLOv11 Algorithm

2.4. Improved YOLOv11 Algorithm

2.4.1. Improved YOLOv11 Network Structure

2.4.2. EfficientViT Model

2.4.3. CBAM Module

2.5. Accuracy Evaluation

3. Results

3.1. Experimental Environment

3.2. Ablation Experiment

3.2.1. Lightweight Backbone Networks

3.2.2. EfficientViT Variants

3.2.3. Attention Mechanisms

3.2.4. Ablation Experiment Results

3.3. Detection Performance: Missed Detections and Burrow Size

3.4. Comparative Analysis of Various Deep Learning Networks

4. Discussion

4.1. Theoretical Mechanisms of Performance Enhancement and Comparative Advantages

4.2. Performance Across Burrow Sizes and Environmental Conditions

4.3. Practical Applications, Limitations, and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI