1. Introduction
Cotton is one of the most important cash crops worldwide, playing a crucial role in maintaining agricultural stability and meeting global fiber demand [
1,
2,
3]. Owing to its unique breathability and softness, cotton generates substantial economic benefits for farmers, textile workers, and other industry professionals. However, the proliferation of weeds in cotton fields severely hampers the development of the cotton industry. These weeds engage in continuous and intense interspecific competition throughout the entire cotton growth cycle [
4], depriving cotton plants of critical resources such as water, fertilizer, light, and heat. This resource stress results in stunted growth, reduced yields, and deteriorated fiber quality. Moreover, weeds serve as intermediate hosts for numerous cotton pests and diseases, disrupting the ecological equilibrium of cotton fields [
5,
6]. Statistics indicate that without effective weed management, cotton yield losses can reach 30% to 90% [
7]. Currently, control measures typically rely on extensive application of chemical herbicides or manual weeding [
8]. However, the excessive use of herbicides triggers ecological issues, including enhanced weed resistance and soil residue contamination. Furthermore, traditional weed diagnosis relies on manual observation, which is inefficient, costly, and incapable of providing the real-time accuracy required to safeguard cotton health [
9,
10].
To address the challenges posed by weeds, cotton weed detection technologies utilizing computer vision, pattern recognition, and machine learning have emerged, aiming to achieve automated and precise identification of weeds in the field [
11,
12,
13]. In recent years, rapid advancements in computing power and artificial intelligence, coupled with the development of deep learning, have significantly transformed traditional image processing techniques, substantially enhancing computer vision capabilities in classification and object recognition. For instance, feature extraction using classic backbone networks such as Residual Network (ResNet) [
14] and GoogLeNet [
15] has markedly improved recognition accuracy. Specifically, researchers like Peteinatos et al. achieved over 90% accuracy in identifying weed species by training deep models like VGG16 and ResNet-50 on large-scale datasets [
16]. Notably, the You Only Look Once (YOLO) [
17] series of object detection methods has demonstrated immense potential for real-time target detection [
18,
19].
In the current agricultural scenario, single-stage object detection algorithms have excellent inference speed and superior detection accuracy, achieving an optimal balance between the two, which makes these algorithms widely used. A typical representative of single-stage object detection algorithms is the YOLO series. To specifically address the computational constraints of agricultural edge devices, researchers both domestically and internationally have explored extensive innovations in module optimization, attention mechanism refinement, and model lightweighting. Early studies spanning YOLOv4 to YOLOv7 focused on architectural enhancements to refine detection frameworks. For instance, techniques such as CBAM, E-ELAN, and advanced data augmentation strategies were employed to bolster feature perception capabilities for crop detection tasks. Shoaib et al. [
20] substantially improved the accuracy of sugar beet weed identification by integrating pixel-level image synthesis, Transformer modules, and the SAFF adaptive spatial feature fusion module. The release of YOLOv8 has further accelerated research into crop detection within the academic community. Researchers have implemented various refinements for more precise detection tasks in numerous precision agriculture scenarios. For instance, Wang et al. [
21] enhanced multi-scale feature fusion by incorporating Vision Transformers (ViTs) and a Weighted Bidirectional Feature Pyramid Network (BiFPN), thereby effectively improving weed recognition performance in wheat fields.
Although the accuracy of these models has been greatly improved, the varying shapes and sizes of weeds require precise identification and differentiation of weed types. This diversity necessitates that models possess the capability to accurately recognize distinct edge features [
22]. Consequently, recent research on YOLOv10 and YOLOv11 has focused on optimizing models for practical deployment, exploring the balance between high efficiency and high precision. Specifically, Wang et al. [
23] proposed an end-to-end real-time monitoring optimization scheme, introducing an NMS-free training strategy to reduce inference time. Li et al. [
24] developed the D3-YOLOv10 model, achieving lightweight tomato object detection while enhancing performance in identifying obscured leaves. Regarding YOLOv11, scholars have introduced numerous research improvements: Tang et al. [
25] developed the lightweight YOLOv11-AIU model for tomato early blight grading; Fang et al. [
26] enhanced rice disease detection capabilities by utilizing CARAFE upsampling to focus on detailed features; Zhang et al. [
27] developed YOLO11-Pear, which can be used for pear detection in complex orchards; and Kutyrev et al. [
28] optimized YOLO11x for deployment on UAVs for apple counting.
In non-agricultural scenarios, lightweight models from the YOLO series are also widely adopted. Such cross-domain innovations provide valuable insights for developing agricultural models. For instance, Peng et al. [
29] proposed TD-YOLOA, which is specifically designed for tire defect detection. This method effectively extracts prominent structural defect features, but it is less suitable for agricultural targets such as cotton field weeds, because weeds exhibit extremely irregular morphological variations, severe occlusion, and blurred boundaries in complex field environments. Jin et al. [
30] developed an enhanced YOLOv11m model for defect detection in high-speed rail overhead contact systems; yet, this model is optimized for infrastructure inspection rather than detecting small agricultural targets, such as weeds or lesions. He et al. [
31] proposed a multi-scale, multi-class object detection model for complex high-resolution remote sensing imagery. As remote sensing technology is highly adaptable to agricultural contexts, it offers innovative perspectives for UAV monitoring. Ahmed and El-Sheimy [
32] fused a YOLOv11 model to enhance the stability of continuous tracking in drone videos. Nevertheless, their approach is suited for real-time detection in general visual tasks and is not specifically designed for monitoring agricultural dynamics, such as crop growth and weed infestation. Chen et al. [
33] introduced a dual-path instance segmentation network for rice detection, to achieve lightweight processing and edge deployment. However, their method is primarily effective for structured crop environments and may struggle with targets characterized by irregular shapes, blurred edges, or significant overlapping.
Although significant progress has been made in plant and weed detection using YOLO-based methods, most of these approaches still face challenges such as high feature parameter extraction difficulty, weak robustness to interference, high computational complexity, difficult edge deployment, and poor generalizability. Moreover, the limited variety of weeds that can be detected indicates room for improvement in model accuracy and applicability. To address these issues, unlike previous approaches that often optimize isolated components while neglecting the potential for holistic network innovation, this study proposes a Quad-Synergistic Lightweight Perception Mechanism (QSLPM). This mechanism integrates Slimneck lightweight neck reconstruction, ADown efficient spatial downsampling, SEAM semantic attention guidance, and SIoU angle-aware geometric regression. QSLPM emphasizes synergistic interactions between modules beyond isolated enhancements, achieving significant compression of redundant features and computational load while simultaneously boosting feature sensitivity and regression accuracy. This synergistic interaction enables the model to perform fine-grained detection of densely clustered, occluded, or morphologically similar weeds in real-world cotton field environments. The AVGS-YOLO model was developed based on QSLPM and strikes an optimal balance between detection accuracy and computational efficiency, demonstrating strong real-time deployment capability in complex cotton field settings.
Building upon this foundation, this study has developed a lightweight, practical AVGS-YOLO model that achieves highly efficient weed recognition with minimal computational resource consumption. It is primarily suited for real-time detection scenarios of cotton weeds in agricultural edge computing environments. Currently, this model is capable of analyzing cotton weed image data in the field. In the future, as smart agriculture continues to advance, this model may be extended to mobile intelligent terminal platforms, providing agricultural practitioners with a more convenient weed identification and detection tool, thereby further promoting the precision management of weeds in cotton fields. Based on the YOLOv11n model, this study proposes the following key technologies and methods:
(a) This study introduces a targeted architectural optimization named Quadruple Synergistic Lightweight Perception Mechanism (QSLPM). This model employs Slimneck neck structure reorganization to eliminate feature redundancy, utilizes ADown for efficient downsampling to suppress background noise, integrates a new detection head, Detect_SEAM, with embedded SEAM attention for the precise capture of subtle features, and introduces the SIoU loss function to enable angle-aware precise regression. This mechanism provides an effective technical perspective for constructing “high-precision, low-computational-power” detectors in practical agricultural scenarios through the deep synergy between feature extraction and spatial localization.
(b) A high-quality cotton field weed dataset has been constructed. Unlike existing datasets based on simple binary classification or laboratory settings, this dataset explicitly classifies 12 weed species, features more realistic complex backgrounds, and demonstrates robust performance in dense weed growth. It provides a foundation for evaluating the robustness and fine-grained classification capabilities of deep learning-based detection models.
(c) Comprehensive ablation experiments and comparative experiments were conducted to verify the accuracy and generalization ability of the proposed AVGS-YOLO model. The results demonstrate an optimal balance between inference efficiency and detection accuracy, significantly outperforming existing lightweight mainstream detection models.
(d) Gradient-Weighted Class Activation Map++ (Grad-CAM++) heatmaps were employed to visualize key feature regions, intuitively demonstrating the model’s ability to focus on challenging samples within complex backgrounds [
34].
In summary, this study introduces the Quadruple Synergistic Lightweight Perception Mechanism (QSLPM). Unlike recent agricultural detection studies that primarily focus on replacing backbones or simply stacking attention mechanisms, our approach represents a systematic integration strategy specifically designed to address high-density and noisy agricultural environments. This mechanism synergistically integrates lightweight neck feature reorganization, spatial downsampling, semantic attention enhancement, and angular geometric regression, offering a robust approach for constructing high-performance, low-complexity detection models. Detailed descriptions of these techniques and methods are presented in
Section 2 of this paper. Experimental data and evaluation metrics are discussed in
Section 3.
Section 4 presents the experimental results and discussion, while the conclusions are presented in
Section 5. These sections aim to provide readers with a comprehensive and clear understanding of the research findings.
4. Results and Discussion
4.1. Comparative Analysis of Improved Module Performance
To rigorously validate the effectiveness of the improved algorithms relative to the original algorithm, this study designed three distinct network configurations and conducted comparative ablation experiments. Using the original YOLOv11n model as the baseline, the Slimneck module, SEAM, and ADown module were sequentially integrated to form improved networks for experimentation. All experiments employed the same cotton weed dataset, batch size, and training cycle. The results of the ablation experiments are summarized in
Table 3.
The experimental data clearly demonstrate the effectiveness of each improved network. Incorporating the Slimneck network (Model 2) increased precision to 96.0% and slightly improved mAP to 96.9%. This indicates that the SlimNeck architecture effectively enhances the quality of feature extraction. It successfully maintains and improves precision while introducing a slight decrease in parameters. Concurrently, integrating the detection head fused with the SEAM significantly contributes to lightweighting, reducing GFLOPs by 20.6% while effectively boosting recall and mAP. This demonstrates that the SEAM attention mechanism effectively enhances the model’s ability to capture latent features, reducing missed detections.
Building upon Model 2, the SEAM was introduced to form Model 5, achieving the highest precision of 96.6% (+1.2%). Subsequently, the ADown downsampling module was incorporated to create the final model. Further analysis reveals that Model 6 shares an mAP50 of 97.7% with Model 3, indicating potential redundancy within the models. A detailed comparison between Models 3 and 6 reveals that Model 6’s precision improved from 94.9% to 96.5%, but its recall decreased from 93.5% to 92.9%. This indicates that while ADown downsampling suppresses background noise, it lacks a dedicated feature retention mechanism, leading to the loss of fine-grained details of tiny weeds. Consequently, this redundancy precisely highlights the importance of the Slimneck module. The final Model 8 demonstrates that combining Slimneck with SEAM and ADown significantly reduces computational complexity while maintaining high precision and recall. This approach effectively eliminates redundant information while preserving key features. The overall network improvement further reduces the computational load while maintaining a high mAP.
Ultimately, the overall model performance improved from Model 1 to Model 8. Exceptional results were achieved across all evaluation metrics: accuracy reached 95.9%, recall reached 94.2%, and mAP50 reached 98.2%, with specific improvements of +0.5%, +2.0%, and +1.8%, respectively. This demonstrates the enhanced AVGS-YOLO model’s outstanding performance in reducing false negatives. This ablation experiment demonstrates that different modules do not merely provide incremental improvements but achieve synergistic gains (1 + 1 > 2). By eliminating texture redundancy and suppressing background noise, the Slimneck and ADown modules clear obstacles for the Detect_SEAM module, enabling the detector to focus more effectively on the irregular contours of weeds without interference. Beyond enhancing accuracy, the model also achieves significant lightweighting. Parameter count decreased by approximately 17.4%, computational load (GFLOPs) was reduced by about 27%, and model size shrank from 5.5 MB to 4.7 MB. This makes the model lighter and faster, with the balance of accuracy and efficiency confirming the synergistic interaction of the improved model. Overall, the AVGS-YOLO enhancement strategy effectively resolves the traditional trade-off between high accuracy and low computational demands, substantially boosting the model’s deployment potential on agricultural edge devices and its real-time detection capabilities.
A comparison of the precision–recall curves for the original YOLOv11n and improved AVGS-YOLO models (
Figure 13) reveals that the enhanced AVGS-YOLO network achieves significantly improved recognition performance for 12 cotton weed species. It successfully addresses the “weak categories” of the original model, such as the challenging Goosegrass and Carpetweed, with AP values increasing by 1.7% and 1.6%, respectively. This greatly alleviates the detection imbalance between categories.
The morphology of the curves demonstrates that the improved model’s PR curve remains stable in the high recall range (recall > 0.9), demonstrating reduced false negatives while lowering false positives and enhancing robustness. Furthermore, accuracy improved by 2.0% for weeds with high feature similarity like Palmer Amaranth, indicating an enhanced capability in extracting fine-grained features. The AVGS-YOLO model not only raises the peak detection accuracy but also enhances generalization capabilities for complex, difficult-to-classify samples, making it better suited for practical cotton field operations.
The comparison of the confusion matrices before and after improvement (
Figure 14) intuitively demonstrates the enhanced model’s breakthrough in reducing inter-class confusion. The most significant improvement lies in identifying difficult-to-classify samples, compensating for the original model’s deficiency in feature extraction for this category. Simultaneously, the recognition rate for Cutleaf reached 1.00, while other categories also showed steady improvement. This demonstrates that the enhanced AVGS-YOLO network possesses stronger feature discrimination capabilities, effectively distinguishing visually similar weeds. Observing the off-diagonal regions of the matrix reveals a marked reduction in misclassification noise. This significantly enhances the fine-grained classification accuracy for specific weed species and substantially lowers the risk of misclassification during field operations.
4.2. Comparative Experiments of Different Classic Models
To systematically assess performance disparities across models, diagnose their underlying bottlenecks, and validate the correctness of our proposed optimization direction, this study compares AVGS-YOLO with mainstream object detection networks, including RT-DETR [
53], YOLOv5n [
54], YOLOv8n [
55], YOLOv10n, and YOLOv11n [
56].
Table 4 presents the experimental results of this comparison, which serves to validate the model’s superior performance in cotton weed detection.
Table 4 presents a comparative analysis of AVGS-YOLO in terms of object detection performance. Compared to YOLOv11n, AVGS-YOLO reduces the number of parameters by 17.4% and decreases GFLOPs by 20%. Due to its lightweight architecture, the model size is decreased by 14.5%. In the cotton weed detection task, it demonstrates higher detection accuracy compared to models such as YOLOv8n and RT-DETR. Its precision of 95.9% and recall of 94.2% demonstrate the model’s effectiveness in minimizing false positives and false negatives, striking an optimal balance between detection performance and computational efficiency.
To further elucidate the performance trends of AVGS-YOLO, the training curves are visualized in
Figure 15. For all models, both precision and mAP50 metrics increase rapidly within the first 100 epochs and then plateau. By comparing the training trajectories of the improved network with other classic networks, it is evident that in terms of the core evaluation metrics of precision and mAP50 for object detection tasks, the AVGS-YOLO model outperforms mainstream algorithms such as RT-DERT, YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv11n.
To qualitatively evaluate the detection performance of the proposed AVGS-YOLO model,
Figure 16 presents the detection results under four scenarios: a single weed against a simple background, a single weed against a complex background, multiple weeds in a simple background, and multiple weeds in a complex background. The complex background specifically includes severe noise interference, such as soil cracks resembling linear geometry and withered weeds resembling intricate textures.
The detection results from RT-DETR indicate that while the model performs well on certain weeds, its recognition accuracy fluctuates, and some bounding boxes exhibit imprecision. Compared to RT-DETR, YOLOv5n, YOLOv8n, and YOLOv10n achieve higher detection accuracy for prominent weeds. However, as we can see from the figure, there are still some failure cases when dealing with weeds of various morphologies (columns C and D); they still have omissions or misclassifications, and the failure cases are also clearly shown in the Grad-CAM++ heatmap. YOLOv11n achieves precise localization for various cotton weed types, enabling a close match between prediction boxes and weed areas. However, prediction accuracy remains unstable, and false negatives persist. The AVGS-YOLO model demonstrates superior instance separation capabilities, significantly reducing false negatives in dense target environments while providing precise bounding box predictions—particularly evident in challenging scenarios involving multiple weeds. The visualization results in
Figure 17 validate the quantitative metrics, further revealing the AVGS-YOLO model’s exceptional discriminative ability when handling complex challenges such as high occlusion rates and strong inter-class similarity. Specifically, the AVGS-YOLO model demonstrated excellent robustness, effectively mitigating interference from soil cracks and overcoming the risk of false detections caused by withered weeds.
These results collectively demonstrate that the enhanced AVGS-YOLO model can more effectively extract key features of cotton weeds while reducing interference from background elements such as soil cracks and withered weeds. It significantly reduces both false positive and false negative rates while maintaining low computational complexity and a relatively compact size. This makes the improved AVGS-YOLO model well-suited for efficient deployment on resource-constrained edge devices.
4.3. Comparative Experiments with the Latest Models
In order to further verify the performance of the AVGS-YOLO model, this paper compares the AVGS-YOLO model with the latest iterations of the YOLO algorithm, YOLOv12n, YOLOv13n, YOLO26n, as well as Transformer-based models RF-DETR, RT-DETRv4-S, and DEIMv2-N, with the aim of quantitatively comparing the performance level of the AVGS-YOLO model in cotton weed target detection tasks. The specific comparison results are shown in the table below.
As shown in
Table 5, although models with the Transformer architecture perform well on general datasets, their performance on the cotton weed dataset presents a notable trade-off between accuracy and computational cost. Specifically, the RT-DETRv4-S model achieves a strong mAP50 of 95.7%, but its GFLOPs reach as high as 24.9, suggesting a potential for computational redundancy in this specific application. Conversely, the DEIM-N model has lower GFLOPs of 10.8, but its mAP50 also decreases to 89.8%, indicating that while achieving model lightweighting, it did not fully maintain model accuracy. Compared to Transformer-based models, the YOLO series’ nano models demonstrate a more balanced performance in this task, maintaining a stable mAP while keeping model parameters and GFLOPs low. When comparing the various numerical values of YOLOv12n, YOLOv13n, and the newly proposed YOLO26n, it can be seen that the AVGS-YOLO model appears to be an effective improvement.
4.4. Performance Analysis of Different Loss Functions
The improved AVGS-YOLO model in this study employs the SIoU loss function. To comprehensively evaluate the performance of the SIOU-enhanced YOLOv11n model, we conducted comparative experiments using six loss functions: YOLOv11n + SIoU, YOLOv11n + GIoU, YOLOv11n + EIoU, YOLOv11n + DIoU, YOLOv11n + SlideLoss, and YOLOv11n + FocalLoss. The objective is to analyze the performance differences among these loss functions for cotton weed detection tasks. As shown in
Table 6, SIOU demonstrates significant improvements over commonly used loss functions such as GIoU [
57], EIoU [
58], DIoU [
59], SlideLoss [
60], and FocalLoss [
61]. It achieves optimal precision and recall while maintaining a high mAP. This demonstrates that the SIoU loss function comprehensively optimizes the YOLOv11n model, achieving the best overall balance in target localization performance.
4.5. Performance Analysis of Different Detection Heads
This study adopts YOLOv11n as the baseline model, and integrates six commonly used detection heads, including Detect_DyHead [
62], Detect_Efficient [
63], Detect_MultiSEAM [
64], Detect_CBAM [
65], Detect_ECA [
66], and Detect_SEAM, to construct six variants: YOLO + Detect_DyHead, YOLO + Detect_Efficient, YOLO + Detect_MultiSEAM, YOLO + Detect_CBAM, YOLO + Detect_ECA, and YOLO + Detect_SEAM. All experiments were conducted using the same cotton weed dataset, batch size, and training cycles. The results are presented in
Table 7. The experimental results demonstrate that YOLO + Detect_SEAM outperforms the other five models comprehensively in terms of precision, recall, mAP50 and mAP50-95. Specifically, YOLO + Detect_SEAM achieved 94.9% precision, 93.5% recall, 97.7% mAP50, and 91.3% mAP50-95, while maintaining a compact model size of 4.6 MB, a low parameter count of 2.14 million, and a low computational complexity of 5.0 GFLOPs.
Compared to classical lightweight attention mechanisms, Detect_CBAM and Detect_ECA achieved mAP50 values of 93.4% and 92.8%, respectively, indicating a notable performance gap relative to the SEAM selected in this study. A deeper analysis reveals that this disparity stems from the specific visual characteristics of the agricultural scenario. CBAM employs a serial channel-spatial attention mechanism incorporating global pooling, which causes the loss of spatial details of small weeds. Meanwhile, the high color similarity between cotton and weeds makes it difficult for the ECA attention mechanism, which relies on channel attention, to extract discriminative features. In contrast, the SEAM employed in this study excels at preserving spatial coherence and local structural integrity, enabling Detect_SEAM to more effectively capture the characteristics of different weeds. In comparison with other advanced detection heads, Detect_DyHead achieved precision, recall, mAP50, and mAP50-95 of 94.7%, 92.3%, 97.2%, and 88.6%; Detect_Efficient achieved 93.8%, 92.4%, 97.0%, and 86.6%; and Detect_MultiSEAM achieved 94.3%, 92.1%, 96.5%, and 88.5%, respectively. These results clearly demonstrate that embedding SEAM attention into the detection head successfully achieves a significant reduction in model complexity while substantially improving detection performance, overcoming the common trade-off between accuracy and computational cost typically found in other mechanisms.
4.6. Comparative Analysis with Existing Methods
To further validate the effectiveness of the improved AVGS-YOLO model, this study selected recently published cotton weed detection models and compared their results, as shown in
Table 8. The table compares key parameters, including precision, recall, mAP, model size, number of parameters, and GFLOPs.
It can be observed that Das et al. [
67], utilizing drone-acquired cotton field weed data, achieved an mAP50 of 88%, precision of 87%, and recall of 78% on YOLOv7. These detection performances are all lower than those of the AVGS-YOLO model developed in this study. Wang et al. (2025) [
68] integrated DS_HGNetV2, BiFPN, and LiteDetect modules to propose the YOLO-Weed Nano model, achieving significant lightweighting improvements. However, its mAP50 still has considerable room for enhancement. In this regard, our study strikes a relative balance between accuracy and efficiency. In another study, Zheng et al. [
69] proposed an enhanced YOLO-WL model, reducing its size to 4.6 MB while achieving 92.3% mAP50. Overall, it demonstrates notable lightweighting achievements, though detection accuracy could be further improved. Karim et al. [
70] introduced an automated cotton weed targeting system using an improved lightweight YOLOv8 on edge platforms. While this approach demonstrates the potential of automated detection, there remains room for enhancement in terms of model lightweighting and mAP50.
Compared to these models, the proposed AVGS-YOLO achieves a high mAP50 of at 98.2%, while maintaining a compact 4.7 MB model size, 2.16 million parameters, and 4.6 GFLOPs. These notable results highlight the effectiveness of the proposed Quadruple Synergistic Lightweight Perception Mechanism (QSLPM), enabling improvements in accuracy, efficiency, and lightweight performance. This comprehensive performance enhancement makes the AVGS-YOLO model more suitable for deployment on resource-constrained edge devices.
4.7. Heatmap Analysis Detection
Heatmap analysis plays a crucial role in visualizing object detection models, particularly in complex agricultural vision tasks such as cotton weed identification. Grad-CAM++ is a gradient-based visualization method that generates spatially attentive heatmaps relevant to class decisions by performing gradient-weighted linear combinations on feature maps within convolutions of neural networks. It provides detailed visualization for cotton weed identification images.
Grad-CAM++ is an enhanced version of Grad-CAM. It can generate more detailed and focused heatmaps. Compared to Grad-CAM, Grad-CAM++ incorporates higher-order derivative information into the weight calculation process, allowing it to recognize at a finer granularity and thereby highlight the image regions that have a critical impact on detection results. This enhances the interpretability and trustworthiness of the cotton weed detection system. Unlike standard classification, Grad-CAM++ visualization in the study is generated by backpropagating the gradients of specific target scores within the detection head. It not only shows classification information but also clearly reflects the spatial features that contribute the most to the model’s detection confidence. Grad-CAM++ visually reveals the varying contributions of different image regions to the model’s prediction. Warmer colors on the heatmap indicate higher contributions, while cooler colors denote lower contributions. This study analyzed the detection heatmaps of the YOLOv11n and AVGS-YOLO models using Grad-CAM++, with the results presented in
Figure 17.
The comparison in
Figure 17 displays the heatmaps generated by the YOLOv11n and AVGS-YOLO models for the cotton weed detection task, revealing differences in their recognition capabilities. From left to right, each column of images represents four scenarios: a single weed against a simple background, a single weed in a complex background, multiple weeds in a simple background, and multiple weeds against a complex background. Observations of the heatmap comparison reveal that YOLOv11n’s heatmaps are more dispersed, with some coverage extending beyond the target area into background regions, particularly along weed edges. This indicates that YOLOv11n’s feature extraction is more sensitive to background noise. In multi-target scenarios, the heatmap of YOLOv11n fails to clearly focus attention on each individual weed, instead also attending to background areas beyond the target, displaying broader and sometimes blurry activation regions. This intuitively reflects the attention drift in failure cases. Compared with YOLOv11n, the heatmap of AVGS-YOLO is more concentrated and compact, with high-activation areas accurately covering the main body of the weeds and with clearer boundaries. This compactness is not only visually appealing but also closely related to the statistical improvements in the quantitative metrics mentioned earlier. This demonstrates that the AVGS-YOLO model effectively suppresses background noise, reduces false negatives and false positives, and better focuses on detecting critical regions.
4.8. Generalization Performance on Standard Benchmarks
MS COCO (Microsoft Common Objects in Context) is recognized in the field of computer vision as one of the most influential large-scale benchmark datasets. It contains 80 categories of objects from everyday life scenarios, such as people, vehicles, furniture, etc. The 2017 version includes 118,287 training images (train), 5000 validation images (val), and 40,670 test images (test). The MS COCO dataset is generally considered a benchmark for evaluating a model’s generalization, robustness, and detection capabilities due to its high complexity and object richness. To comprehensively assess the generalization ability of the AVGS-YOLO model, we conducted benchmark testing using the MS COCO 2017 dataset and then compared the results with the experimental results obtained by YOLOv11n on the same dataset, aiming to evaluate AVGS-YOLO’s performance in all aspects. Considering computational resource constraints, and to ensure absolute fairness and direct comparability of the evaluation, we placed AVGS-YOLO and the benchmark model YOLOv11n under exactly the same experimental conditions, trained for 100 epochs, and used completely consistent hyperparameters and data augmentation strategies. The specific comparison results are shown in
Table 9.
As shown in
Table 8, AVGS-YOLO achieved an mAP50 of 50.1% and an mAP50-95 of 35.4%, both of which are slightly higher than the corresponding metrics of the YOLOv11n baseline. This suggests that AVGS-YOLO possesses a stable generalization ability across diverse scenarios. While maintaining competitive accuracy, the model in this study also demonstrates a more efficient lightweight design, with the number of parameters reduced by about 17.4% compared to YOLOv11n, and computational load (GFLOPs) reduced by about 27.0%. The results on the MS COCO dataset indicate that AVGS-YOLO can maintain detection performance on par with mainstream benchmark models in general scenarios.
5. Conclusions
This study addresses critical issues such as the morphological diversity of weeds in cotton fields, complex environmental backgrounds, large model parameters, and large size of conventional models. It proposes a lightweight cotton weed detection model, AVGS-YOLO. Through systematic ablation and comparative experiments, we demonstrated that the model achieves an optimal balance between detection accuracy and computational efficiency, driven by targeted structural optimization rather than random fluctuations.
The core of this model is the Quaternary Synergistic Lightweight Perception Mechanism (QSLPM), which reorganizes the Slimneck architecture to eliminate feature redundancy caused by the high similarity of weed textures, uses ADown for efficient downsampling to suppress background noise such as soil cracks, and combines a new detection head Detect_SEAM, with an embedded SEAM attention mechanism for precise capture of irregular weed contour features. Additionally, the SIoU loss function is introduced to achieve accurate angle-aware regression under weed overlap. Through specific experimental result analysis, compared with the baseline model YOLOv11n, the AVGS-YOLO model’s precision increased by 0.5%, recall increased by 2%, mAP50 increased by 1.8%, and mAP50-95 increased by 5.8%, reaching 95.9%, 94.2%, 98.2% and 93.3%, respectively. The model also achieved significant lightweighting, with parameters reduced by 17.4%, computational cost (GFLOPs) reduced by 27%, and model size reduced from 5.5 MB to 4.7 MB. The improvement in model performance and the implementation of model lightweighting realized the true synergistic gain proposed in this study (1 + 1 > 2).
Although this study has achieved good results, there are certain limitations. The current dataset was only collected from a single geographical location, which limits the geographical generalization ability of the improved model proposed in this study for different soil types and regional weed variants. Furthermore, the current research has only evaluated the model on high-performance workstations, and due to objective conditions, it has not yet been effectively tested on edge devices. Therefore, our future work will focus on addressing these existing limitations: (1) In the future, we will place more emphasis on expanding the dataset by increasing data collection locations to further validate the model’s adaptability in real cotton field environments. (2) We will focus on deploying the model to physical edge devices, such as Jetson Nano, Xavier, and Raspberry Pi, to verify its actual performance in real scenarios. (3) Ultra-recent architectures such as YOLO26n and DEIMv2-N will be used for benchmarking, and they will be compared with the model proposed in this study in order to continuously optimize the performance limits of the model.