1. Introduction
As a critical transit hub for the habitat and migration of numerous bird species, Dongting Lake supports a rich wetland ecosystem that offers a unique and favorable environment for avian life. This makes it an ideal natural site for investigating bird species diversity. Changes in avian diversity serve as a key ecological indicator, reflecting shifts in the environmental conditions of the Dongting Lake wetland, and thus hold significant value for both scientific research and ecological monitoring. Within this context, the task of automated bird detection functions as an “intelligent sentinel” in biodiversity surveys [
1,
2]. It enables the efficient and accurate identification and localization of bird targets, substantially enhancing both the efficiency of field surveys and the reliability of collected ecological data [
3].
Traditional bird detection methods rely primarily on manual inspection of images or video frames by domain experts for species identification and data recording. However, this approach is labor-intensive, costly, and inefficient when applied to large-scale datasets. Moreover, detection results often lack consistency due to inter-observer variability in expertise and recognition ability. These limitations are further aggravated in complex environments, where visual clutter significantly degrades detection accuracy.
In contrast to manual detection, computer vision-based bird detection leverages efficient algorithms to rapidly process large-scale datasets, thereby significantly enhancing data processing efficiency and enabling precise extraction of bird-related features. Current mainstream detection approaches can be broadly categorized into three groups: one-stage detection methods, two-stage detection methods, and Transformer-based end-to-end detection methods. Two-stage detection methods, exemplified by the widely used Faster R-CNN [
4,
5,
6], first employ a Region Proposal Network (RPN) to generate candidate object regions, which are subsequently classified and refined using bounding box regression. While such methods typically achieve high detection accuracy, their multi-stage processing pipelines lead to relatively slow inference speeds, making them less suitable for real-time applications. Transformer-based detection models, such as the representative DETR [
7,
8], exhibit strong global feature modeling capabilities and effectively capture long-range dependencies within images. However, these models are often characterized by large parameter sizes, slow convergence during training, and a high demand for computational resources. These limitations pose challenges for deployment in resource-constrained environments or scenarios requiring real-time performance.
In contrast, one-stage detection methods such as YOLO (You Only Look Once) [
9,
10,
11,
12,
13] and RetinaNet [
14], achieve significantly higher detection speeds by eliminating the region proposal stage and directly predicting object categories and locations on the feature maps. Among these, the YOLO family of models demonstrates a favorable balance between real-time processing capability, detection accuracy, and generalization performance, making it particularly well-suited for large-scale, long-duration monitoring tasks. Given these advantages, this study adopts YOLOv11 [
15] as the baseline model to investigate bird detection in the Dongting Lake region.
Although YOLOv11, as the latest iteration of the YOLO series, has made significant progress in algorithm architecture and performance optimization, it still faces multiple challenges in the bird detection task in Dongting Lake. For example, the recognition of small-scale bird targets is still difficult, the uneven distribution of target scales restricts the generalization ability of the model, and there is also noise interference caused by similar backgrounds. To address the challenge of large-scale variations in bird images, researchers have conducted extensive investigations from the perspectives of network architecture optimization and learning strategy innovation. In object detection tasks, a common approach is to employ feature pyramid structures [
16] to integrate multi-resolution feature maps, thereby facilitating multi-scale representation. Additionally, some methods incorporate multi-scale modules within single-branch networks. For instance, the MSCB module in MSCAM [
17] draws inspiration from the Inception architecture, utilizing parallel convolutions with kernel sizes of 1 × 1, 3 × 3, and 5 × 5 to capture features at different scales and enhance the network’s representational capacity. The Atrous Spatial Pyramid Pooling (ASPP) [
18] module improves contextual awareness through a dilated convolution pyramid. However, these manually crafted convolutional configurations often perform poorly when adapting to diverse scale distributions in complex scenes. In recent years, multi-task learning in object detection has mainly concentrated on the joint optimization of classification and localization sub-tasks. Some methods have also introduced auxiliary tasks (such as semantic segmentation or keypoint detection) to enhance the model’s understanding of object structures and contextual information. Nevertheless, the effectiveness of auxiliary tasks largely depends on their relevance to the main task; poorly correlated auxiliary tasks may lead to gradient interference or overfitting, ultimately undermining the model’s generalization performance. Simplifying them will facilitate better understanding for a broader audience.
In the detection task of bird images, background interference is one of the key challenges affecting model performance. In natural scenes, bird targets are often highly fused with background elements such as leaves, branches, sky, water, etc., leading to blurred target boundaries and feature confusion, which in turn triggers false or missed detection [
19]. To address this issue, researchers have delved into three avenues: feature enhancement, context modeling, and attention focusing. In terms of feature enhancement, early approaches aimed to amplify the texture disparity between the target and the background through the utilization of a background-suppressing convolution kernel or an edge-aware loss function. However, these methods exhibit limited efficacy when the background and the target share similar color distributions. In terms of context modeling, researchers often use GNN [
20] or Transformer architectures to construct semantic relationship graphs between the target and the environment/other objects, such as human-object interaction graphs or multi-target tracking graphs, to assist in distinguishing between the target and the background in interaction detection and character association tasks. Although the approach improves context understanding, the construction of additional relationship graphs relies on a large number of annotations, which becomes an important bottleneck constraining the generalization ability. Meanwhile, the attention mechanism becomes a core tool for suppressing background interference, such as the MCA [
21], which enables the model to focus on target saliency features by dynamically suppressing background-related channels and spatial regions; the CBAM [
22] module further combines the spatial and channel attention to provide a more comprehensive and effective feature extraction capability.
The YOLO series detection methods are also widely used in small target detection and ecological monitoring. For example, Li et al. [
23] proposed an improved YOLOv8 based SOD-YOLO, combined with RFCBAM and BSSI-FPN modules to optimize small-scale object detection and multi-scale feature fusion capabilities. Wang et al. [
24] proposed a WB-YOLO integrated visual transformer encoder module based on improved YOLOv7, which is used for efficient detection of wild bats to achieve effective multi-scale feature fusio. Zhang et al. [
25] propose YOLO-Feature and Clustering Enhanced (YOLO-FCE), an improved model based on the YOLOv9 architecture to evaluate and enhance the model’s feature extraction capabilities. Ji et al. [
26] Propose HydroSpot-YOLO, which integrates an Attentional Scale Sequence Fusion (ASF) mechanism and a P2 detection layer to improve the detection of small and densely clustered targets under challenging conditions such as water reflections, cluttered backgrounds, and variable illumination. Although these methods based on Yolo have promoted the development of ecological monitoring, there are still some limitations. Some methods introduce complex modules, increase the computational cost, limit their deployment in the resource constrained field environment, or do not fully consider the unique challenges of ecological data, such as extreme scale changes, high background clutter, species similarity and limited annotation data sets. In addition, some models sacrifice reasoning efficiency to obtain accuracy or lack robustness in different ecological scenarios.
To address the aforementioned challenges, this study proposes targeted enhancements based on the YOLOv11n framework, aiming to improve target representation and scene adaptability within the complex environment of Dongting Lake. The main contributions of this work are summarized as follows:
- The introduction of an Efficient Multiscale Attention Mechanism (EMA) [ 27- ] enhances the feature representation of targets across different scales by employing parallel subnetworks that simultaneously capture both channel-wise and spatial information. This mechanism significantly improves the model’s capability for multi-scale detection in complex background scenarios. 
- The improved RepNCSPELAN4-ECO module is designed to introduce depthwise separable convolution (DWConv) and adaptive channel compression mechanisms, making feature extraction and multi-scale feature fusion more comprehensive and efficient, and improving the detection capability of birds at different scales; 
- Neck network ordinary convolution is upgraded to lightweight GSConv [ 28- ] convolution, which effectively aggregates the global context and yet significantly reduces the computational redundancy through the combination of grouped convolution and spatial convolution, improving the accuracy and speed of detection. 
- We constructed the Dongting Lake bird detection dataset DTH-Birds by photographing high-quality common birds in Dongting Lake through various ways and performed data labeling and data enhancement, totaling 14,107 images. Meanwhile, the robustness of the Birds-YOLO model in dealing with diverse datasets and different scenarios are further demonstrated on the open bird dataset CUB200-2011 [ 29- ]. 
  2. Materials and Methods
  2.1. Data Acquisition Processing
In this study, the improved Birds-YOLO model is evaluated on the publicly available CUB-200-2011 dataset and a self-constructed dataset, DTH-Birds, as illustrated in 
Figure 1. The CUB-200-2011 dataset is a widely used benchmark for bird image classification, comprising 11,788 images across 200 bird subcategories. Each image is annotated with a category label and bounding box coordinates indicating the bird’s location. The dataset is typically split into training, validation, and test sets in a 70%:15%:15% ratio, containing 8242, 1773, and 1773 images, respectively.
The DTH-Birds dataset was collected from early 2024 to March 2025. Data sources include the bird-watching monitoring platform provided by the Hunan Provincial Department of Natural Resources and field photographs taken by researchers at Dongting Lake. The dataset was assembled through two main collection phases: the first involved video recording, screenshot extraction, cropping, and organization of images from the monitoring platform between January and February 2024; the second consisted of on-site field photography conducted in the Dongting Lake area in March 2025. The collected images encompass a wide range of weather conditions, varying degrees of occlusion, differences in bird density, and scale variations. This diversity enhances the dataset’s representativeness and contributes to improving the model’s generalization capability.
The DTH-Birds dataset exhibits significant differences in diversity and background complexity compared to CUB-200-2011, as illustrated in 
Figure 2. Although CUB-200-2011 covers 200 bird subclasses and provides accurate bounding box annotations, its images are mostly captured by networks and manually filtered, with relatively clean backgrounds, concentrated shooting angles, stable lighting conditions, and relatively limited scene changes. The DTH-Birds dataset is directly collected from real ecological monitoring and field shooting environments in the Dongting Lake area of Hunan Province. It includes both long-distance monitoring images from bird watching platform video screenshots and high-resolution images taken up close in the field, covering various weather conditions such as sunny and cloudy days, as well as complex scenarios such as water reflection, vegetation obstruction, dense bird populations, and sparse individuals. This collection method makes DTH-Birds closer to real monitoring task scenarios in terms of background interference, target scale changes, viewpoint diversity, and environmental dynamics, which puts higher demands on the robustness and generalization ability of the model.
To address the challenges of limited availability and imbalanced sample numbers across certain bird categories, four data augmentation techniques were applied to selected classes (illustrated in 
Figure 3). These techniques include adding salt-and-pepper noise, darkening, lightening, and rotating, aimed at increasing image diversity and improving category distribution balance. The final dataset comprises 14,107 images spanning 47 bird subclasses. It is partitioned into training, validation, and test sets at an 80%:10%:10% ratio, containing 11,287, 1410, and 1410 images, respectively. The specific dataset partitioning and data augmentation parameters are shown in 
Table 1.
The data annotation process was performed using the LabelImg tool, followed by expert-guided screening, classification, and refinement. The annotation adhered to the following core principles: bounding boxes must accurately and fully enclose the entire body of the target bird; in images containing multiple bird species, each category is labeled separately; in densely populated scenes, individual birds are annotated and distinguished one by one; partially occluded birds are annotated based on their visible regions.
  2.2. YOLOv11’s Network Architecture
YOLOv11 is a real-time object detection algorithm officially released by Ultralytics on 30 September 2024. Its core design goal is to balance detection accuracy and inference speed while systematically optimizing object detection performance in complex scenes. Compared to previous versions in the YOLO series, the YOLOv11 architecture (illustrated in 
Figure 4) primarily consists of three components: the backbone, neck, and head.
Regarding the backbone network design, YOLOv11 employs CSPDarknet53 as its feature extraction backbone and generates multi-scale feature maps through five subsampling stages to enhance the model’s ability to perceive targets at varying scales. Key modules include: The C3k2 module replaces the traditional C2f [
30] structure, optimizing residual connections and inter-channel feature interactions, thereby significantly improving the model’s multi-scale feature representation capability. The CBS module, consisting of convolution, batch normalization, and the SiLU [
31] activation function, enables fast nonlinear transformations and feature map normalization, which helps stabilize training and enhance representational power. The Spatial Pyramid Pooling Fast (SPPF) module applies multi-scale pooling operations to map feature maps to fixed dimensions, effectively strengthening the model’s global semantic information extraction. The C2PSA module integrates pyramid slice attention by combining multi-level feature slicing with channel-spatial attention fusion, improving the model’s discriminative ability in complex backgrounds and small-object scenarios [
32].
In terms of neck network design, YOLOv11 adopts a PAN-FPN structure, which enhances the bidirectional flow of information through both bottom-up and top-down path aggregation. This design strengthens the integration of shallow spatial features and deep semantic features, effectively compensating for the limitations of the traditional FPN [
33] in terms of localization accuracy. For the detection head, YOLOv11 employs a decoupled architecture that separately handles classification and bounding box regression tasks, thereby improving task-specific optimization. The classification branch utilizes Binary Cross-Entropy (BCE) loss and incorporates two layers of DWConv, which significantly reduces model complexity and computational cost while preserving strong classification performance. The regression branch combines Distribution Focal Loss (DFL) and Complete Intersection over Union (CIoU [
34]) loss to jointly optimize bounding box localization accuracy and regression stability [
16].
Overall, YOLOv11 enhances real-time object detection performance through lightweight optimization of the backbone network, improved multi-scale feature fusion mechanisms, an efficient decoupled detection head design, and targeted enhancements for small object detection. Based on these advantages, this study selects YOLOv11n as the baseline model for further improvement and experimentation [
35].
  2.3. Birds-YOLO’s Network Architecture
Although YOLOv11n demonstrates an excellent balance between real-time performance and detection accuracy in general object detection tasks, it still exhibits notable limitations in fine-grained feature representation, scale adaptability, and multi-level feature fusion efficiency when applied to multi-scale bird detection in the complex ecological environment of the Dongting Lake Basin. Specifically, the small object sizes, dense spatial distribution, and complex background interference of bird targets in this region pose significant challenges for YOLOv11n in terms of feature expressiveness and detection robustness. To address these issues, this paper proposes an enhanced architecture, Birds-YOLO, optimized specifically for bird detection. The proposed model improves feature extraction, multi-scale adaptability, and generalization performance through targeted structural modifications.
The network architecture of Birds-YOLO illustrated in 
Figure 5, retains the hierarchical structure of the classic YOLO series, comprising three key components: backbone, neck, and head. To address the core challenges associated with bird detection in the Dongting Lake region, the following architectural enhancements are introduced:
To tackle the issues of large-scale variation and complex background interference, EMA is incorporated to dynamically enhance feature representations across different scales. A multi-branch parallel processing strategy is employed to simultaneously capture channel-wise and spatial positional weights. Through feature fusion, a refined multi-scale attention map is generated, thereby improving the model’s ability to detect targets of varying scales in cluttered backgrounds.
To further optimize the efficiency of feature extraction and multi-scale fusion, the RepNCSPELAN4-ECO module is improved by introducing depthwise separable convolution and adaptive channel compression mechanism. These changes enhance the module’s ability to capture features of birds of different sizes. The introduction of dynamic channel compression mechanism, combined with residual connections and cross scale feature interaction, significantly improves the adequacy of feature representation and the efficiency of multi-scale feature integration.
To reduce the computational redundancy commonly found in traditional convolution operations within the neck network, standard convolutions are replaced with lightweight GSConv modules. GSConv integrates grouped convolution and spatial convolution in a collaborative design that reduces computational complexity while preserving global context aggregation. Specifically, it partitions the input feature map into multiple groups, applies spatial convolution independently within each group, and employs a channel shuffle mechanism to promote inter-group information exchange. This design improves inference speed without compromising detection accuracy.
Through the aforementioned enhancements, Birds-YOLO significantly improves the feature extraction capability and detection robustness for multi-scale bird targets in complex wetland environments while maintaining competitive overall detection performance.
  2.4. EMA
In the Dongting Lake wetland ecological monitoring task, the target detection algorithm faces two core challenges: first, due to the variability of target distance, flight attitude, and habitat during the shooting process, the size span of birds in the image is significant, which makes it difficult for the traditional model to take into account the global semantic representation of large targets and fine-grained texture information of small targets in feature extraction; Secondly, the background of the basin is complex and dynamic, and there is high similarity interference between natural elements such as wetland vegetation, water reflection, reed bushes and bird targets. In addition, environmental factors such as light changes and seasonal vegetation coverage differences further aggravate the difficulty of distinguishing between targets and backgrounds, making the model prone to false detection or missing detection in complex scenes. In response to the above challenges, this study introduced an efficient multi-scale attention (EMA) mechanism into the yolov11n model to dynamically enhance the feature representation ability of targets with different scales and the inhibition ability of the model against complex backgrounds, so as to improve the detection robustness of the algorithm in the wetland environment. The EMA structure is shown in 
Figure 6.
The EMA module operates through an excitation mechanism and a modulation mechanism. The excitation mechanism computes the inner product between the input feature and a learnable parameter to generate a similarity matrix, where each element measures the semantic similarity between a feature position and the parameter. Higher similarity indicates greater feature importance under the current context. Subsequently, the modulation mechanism dynamically reweights each feature position based on this matrix, enhancing informative features while suppressing redundant ones.
The EMA module employs a multi-scale attention mechanism to capture channel and spatial information simultaneously through a parallel sub-network. It mainly consists of two branches: 1 × 1 branch and 3 × 3 branch. The channel and spatial information are effectively combined through these two parallel-designed branches without adding too many parameters and computational costs. An input feature map 
 is first divided into G sub-features in the channel direction, i.e., 
. The divided G sub-feature maps are fused with other branch information on branch 1; Branch 2 uses two-dimensional average pooling to globally average the feature maps from both height and width directions, as shown in
        where H and W denote the height and width of the feature map; Xc denotes the feature tensor of different channels. Each sub-feature group Xi after 1 × 1 convolution results in 
, 
, W1 × 1 is the weight matrix of 1 × 1 convolution and W3 × 3 is the weight matrix of 3 × 3 convolution, which is the activation function. The outputs of the 1 × 1 branch and the 3 × 3 branch are fused by cross-dimensional interaction, and the outputs are aggregated for all sub-feature groups 
, 
.
The EMA module enhances feature representation by integrating channel and spatial information through feature grouping and a multi-scale parallel sub-network. It adaptively weights feature maps according to their importance, improving multi-scale bird detection. By smoothing the attention distribution, EMA reduces sensitivity to noise and outliers, thereby increasing robustness. Moreover, its grouped and parallel design achieves strong feature expressiveness with low computational overhead.
  2.5. RepNCSPELAN4-ECO
On the basis of the original RepNCSPELAN4 module [
36], RepNCSPELAN4_ECO introduces multiple key structural optimizations aimed at achieving a better balance between high-precision feature extraction and efficient computation, particularly suitable for fine-grained object detection tasks in long-distance bird detection. The main improvements are reflected in the following aspects:
Using DWConv instead of standard convolution. RepNCSPELAN4 uses standard 3 × 3 convolution for spatial feature extraction in branch paths, which has strong expressive power but comes with high computational overhead. To address this issue, RepNCSPELAN4_ECO replaced both standard convolution operations in the branch with DWConv, located on the cv2 and cv3 paths, respectively. The computational complexity of standard convolution is O (H × W × C
in × C
out × K
2), and K = 3 is the size of the convolution kernel; DWConv decomposes convolution into two steps: depthwise and Pointwise (1 × 1), reducing its complexity to O (H × W × C × K
2 + H × W × C
2), where C = C
in = C
out. At channel number under the typical setting of C = 256, the computational cost of a single 3 × 3 convolution can be reduced by approximately C/K
2 ≈ 28.4 times, significantly reducing the FLOPs and memory bandwidth requirements of the RepNCSPELAN4 model. The structural details of the RepNCSPELAN4-ECO module are illustrated in 
Figure 7.
The original RepNCSPELAN4 module requires manual specification of the intermediate channel dimensions c3 and c4, complicating parameter configuration and limiting scalability and automation. To address this, RepNCSPELAN4_ECO introduces a proportional control mechanism that automatically computes the intermediate channel count as c4 = int(c2 × ratio), where c2 is the output channel number and ratio is usually set to 0.5. This design simplifies network configuration, enables flexible model width scaling via the ratio, and enhances the module’s generality and scalability.
Despite its lightweight design, RepNCSPELAN4_ECO preserves the multi-path information aggregation mechanism of RepNCSPELAN4. the input is convolved by 1 × 1 and split into two parts through chunk (2, 1); Part of it is directly connected and retained (y[0]), while the other undergoes deep nonlinear transformation through two cascaded RepNCSP + DWConv modules. Finally, y[0] the output of the first stage, and the output of the second stage are concatenated along the channel dimension and fused by a 1 × 1 convolution. This design inherits the gradient flow optimization of ELAN and the feature reuse principle of CSP, ensuring effective integration of shallow details and deep semantics while minimizing redundancy and preserving feature expressiveness.
Overall, RepNCSPELAN4-ECO effectively reduces redundant computational overhead by introducing DWConv and adaptive channel compression mechanisms. The improved architecture achieves a better balance between accuracy and speed without significantly sacrificing feature expression capabilities, making it suitable for object detection tasks in resource constrained scenarios.
  2.6. GSConv
GSConv effectively optimizes traditional convolution operations by combining grouped convolution with pointwise spatial convolution. This design not only significantly reduces the model’s parameter count and computational overhead but also improves inference speed. Consequently, the model achieves efficient real-time detection in resource-constrained environments. Moreover, GSConv reduces the computational burden while largely preserving the original detection accuracy, demonstrating a favorable balance between efficiency and performance. Therefore, it is particularly well-suited for bird detection tasks that demand both computational efficiency and high accuracy.
The GSConv module adopts a hybrid convolutional architecture that enables lightweight feature extraction by combining dense channel-wise connections with sparse spatial sampling. It maintains global channel interaction while employing sparse convolutions to aggregate contextual information and reduce computational redundancy. The module features a dual-branch structure: one branch uses standard convolution for downsampling and coarse-grained semantic capture, while the other applies depthwise convolution (DWConv) to extract fine-grained spatial details. The outputs are concatenated along the channel dimension and processed by a channel shuffle operation to promote cross-channel information exchange. This complementary fusion of global semantics and local textures enhances feature representation, as illustrated in 
Figure 8.
The feature extraction module employs a three-stage processing pipeline. First, a standard convolutional layer compresses the input feature map C1 by reducing its channel dimension to C2/2, which not only extracts primary feature representations but also decreases computational load for subsequent operations. Next, DWConv is applied independently across channels, leveraging a sparse sampling strategy to enhance feature diversity and improve the module’s capacity for fine-detail representation. Finally, the outputs from the standard convolution branch and the depthwise convolution branch are concatenated along the channel dimension, reconstructing the full C2 channel feature map. A subsequent channel shuffle operation rearranges the features to optimize their distribution, facilitating cross-channel interaction and fusion, thereby producing a more discriminative and robust feature representation. Its computational complexity formula is
        where K1, K2 is the convolution kernel size.
In practical bird detection scenarios at Dongting Lake, computing resources are often constrained, particularly in embedded systems or mobile devices with limited processing power and storage capacity. To address these limitations, the GSConv module is introduced, optimizing the convolutional structure to better utilize scarce computational resources and enable faster real-time inference. Not only boosts the overall detection performance but also improves the model’s applicability and practical value in resource-constrained environments.
  4. Discussion
The Birds-YOLO model proposed in this paper integrates the EMA attention mechanism, an improved RepNCSPELAN4-ECO module, and lightweight GSConv convolution to address challenges in bird detection such as multi-scale targets, complex backgrounds, and efficiency. Experimental results show significant improvements in mAP, demonstrating the effectiveness of the approach on a self-built dataset from the Dongting Lake area. However, the added components increase computational complexity and model size, potentially limiting deployment in resource-constrained or real-time scenarios. Additionally, the dataset, while diverse, may not fully represent all bird species or environmental variations in the region, which could affect generalization. It should also be noted that, due to high training costs, experiments were conducted with a fixed random seed for reproducibility, limiting statistical analysis through multiple runs. Future work will focus on expanding datasets, exploring model compression techniques to reduce inference cost, and validating the framework’s generalizability to other ecological monitoring tasks such as detecting amphibians, insects, or small mammals under real-world conditions.
  5. Conclusions
This study aims to propose a high-precision and robust bird detection model for the Dongting Lake area. We collected real-world data covering 47 species of birds in Dongting Lake through various methods and developed an improved bird detection model named Birds-YOLO based on this dataset.
First, an Efficient Multi-scale attention mechanism is introduced to address the insufficient feature extraction capacity of traditional backbone networks. This mechanism enhances semantic interaction across multi-scale feature maps via a dynamic weighting strategy. The EMA module adopts a parallel multi-branch architecture that integrates both channel and spatial attention, enabling efficient cross-scale feature fusion with low computational overhead. It effectively mitigates the performance bottleneck of traditional backbones in detecting small targets and complex backgrounds.
Second, to further improve feature extraction efficiency and multi-scale representation ability, we replaced the backbone network with a new architecture incorporating the RepNCSPELAN4-ECO module. This module not only enhances feature expression but also significantly improves multi-scale feature fusion efficiency, particularly in complex environments.
Finally, to reduce the computational redundancy in the neck structure, the GSConv module is introduced to replace and optimize traditional convolutions. GSConv combines grouped convolution with channel shuffling to significantly reduce computational complexity while maintaining accurate feature transmission. This improves the overall detection speed and resource utilization efficiency.
Experimental results show that Birds-YOLO outperforms the baseline model across multiple metrics. On the CUB200-2011 and DTH-Birds datasets, Birds-YOLO achieves mAP@0.5 scores of 83.5% and 91.8%, respectively, representing improvements of 3.5% and 2.6% over the original YOLOv11n. The proposed method demonstrates strong detection performance across diverse bird species, scales, and complex natural backgrounds, surpassing existing mainstream object detectors in overall effesctiveness.