AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments

Wang, Suqi; Wei, Linjing

doi:10.3390/agriculture16080828

Open AccessArticle

AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments

by

Suqi Wang

¹ and

Linjing Wei

^2,*

¹

School of Science, Gansu Agricultural University, Lanzhou 730070, China

²

School of Information Science and Technology, Gansu Agricultural University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(8), 828; https://doi.org/10.3390/agriculture16080828

Submission received: 4 February 2026 / Revised: 18 March 2026 / Accepted: 2 April 2026 / Published: 8 April 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Cotton represents one of the world’s most significant agricultural commodities. However, severe weed proliferation in cotton fields seriously hampers the development of the cotton industry, making precise weed control essential for ensuring healthy cotton growth. Traditional object detection methods often suffer from computational complexity, rendering them difficult to deploy on resource-constrained edge devices. To address this challenge, this paper proposes AVGS-YOLO, a lightweight and enhanced model employing a Quadruple Synergistic Lightweight Perception Mechanism (QSLPM) for precise weed detection in complex cotton field environments. The QSLPM emphasizes synergistic interactions between modules. It integrates lightweight neck architecture (Slimneck) to optimize feature extraction pathways for cotton weeds; the ADown module (Adaptive Downsampling) replaces Conv modules to address model parameter redundancy; the small object attention modulation module (SEAM) enhances the recognition of small-scale cotton weed features; and angle-sensitive geometric regression (SIoU) improves bounding box localization accuracy. Experimental results demonstrate that the AVGS-YOLO model achieves 95.9% precision, 94.2% recall, 98.2% mAP50, and 93.3% mAP50-95. While maintaining high detection accuracy, the model achieves a lightweight design with reductions of 17.4% in parameters, 27% in GFLOPs, and 14.5% in model size. Demonstrating strong performance in identifying cotton weeds within complex cotton field environments, this model provides technical support for deployment on resource-constrained edge devices, thereby advancing intelligent agricultural development and safeguarding the healthy growth of cotton crops.

Keywords:

cotton weed detection; YOLOv11; QSLPM; Slimneck; ADown; SEAM; SIoU; lightweight CNN

1. Introduction

Cotton is one of the most important cash crops worldwide, playing a crucial role in maintaining agricultural stability and meeting global fiber demand [1,2,3]. Owing to its unique breathability and softness, cotton generates substantial economic benefits for farmers, textile workers, and other industry professionals. However, the proliferation of weeds in cotton fields severely hampers the development of the cotton industry. These weeds engage in continuous and intense interspecific competition throughout the entire cotton growth cycle [4], depriving cotton plants of critical resources such as water, fertilizer, light, and heat. This resource stress results in stunted growth, reduced yields, and deteriorated fiber quality. Moreover, weeds serve as intermediate hosts for numerous cotton pests and diseases, disrupting the ecological equilibrium of cotton fields [5,6]. Statistics indicate that without effective weed management, cotton yield losses can reach 30% to 90% [7]. Currently, control measures typically rely on extensive application of chemical herbicides or manual weeding [8]. However, the excessive use of herbicides triggers ecological issues, including enhanced weed resistance and soil residue contamination. Furthermore, traditional weed diagnosis relies on manual observation, which is inefficient, costly, and incapable of providing the real-time accuracy required to safeguard cotton health [9,10].

To address the challenges posed by weeds, cotton weed detection technologies utilizing computer vision, pattern recognition, and machine learning have emerged, aiming to achieve automated and precise identification of weeds in the field [11,12,13]. In recent years, rapid advancements in computing power and artificial intelligence, coupled with the development of deep learning, have significantly transformed traditional image processing techniques, substantially enhancing computer vision capabilities in classification and object recognition. For instance, feature extraction using classic backbone networks such as Residual Network (ResNet) [14] and GoogLeNet [15] has markedly improved recognition accuracy. Specifically, researchers like Peteinatos et al. achieved over 90% accuracy in identifying weed species by training deep models like VGG16 and ResNet-50 on large-scale datasets [16]. Notably, the You Only Look Once (YOLO) [17] series of object detection methods has demonstrated immense potential for real-time target detection [18,19].

In the current agricultural scenario, single-stage object detection algorithms have excellent inference speed and superior detection accuracy, achieving an optimal balance between the two, which makes these algorithms widely used. A typical representative of single-stage object detection algorithms is the YOLO series. To specifically address the computational constraints of agricultural edge devices, researchers both domestically and internationally have explored extensive innovations in module optimization, attention mechanism refinement, and model lightweighting. Early studies spanning YOLOv4 to YOLOv7 focused on architectural enhancements to refine detection frameworks. For instance, techniques such as CBAM, E-ELAN, and advanced data augmentation strategies were employed to bolster feature perception capabilities for crop detection tasks. Shoaib et al. [20] substantially improved the accuracy of sugar beet weed identification by integrating pixel-level image synthesis, Transformer modules, and the SAFF adaptive spatial feature fusion module. The release of YOLOv8 has further accelerated research into crop detection within the academic community. Researchers have implemented various refinements for more precise detection tasks in numerous precision agriculture scenarios. For instance, Wang et al. [21] enhanced multi-scale feature fusion by incorporating Vision Transformers (ViTs) and a Weighted Bidirectional Feature Pyramid Network (BiFPN), thereby effectively improving weed recognition performance in wheat fields.

Although the accuracy of these models has been greatly improved, the varying shapes and sizes of weeds require precise identification and differentiation of weed types. This diversity necessitates that models possess the capability to accurately recognize distinct edge features [22]. Consequently, recent research on YOLOv10 and YOLOv11 has focused on optimizing models for practical deployment, exploring the balance between high efficiency and high precision. Specifically, Wang et al. [23] proposed an end-to-end real-time monitoring optimization scheme, introducing an NMS-free training strategy to reduce inference time. Li et al. [24] developed the D3-YOLOv10 model, achieving lightweight tomato object detection while enhancing performance in identifying obscured leaves. Regarding YOLOv11, scholars have introduced numerous research improvements: Tang et al. [25] developed the lightweight YOLOv11-AIU model for tomato early blight grading; Fang et al. [26] enhanced rice disease detection capabilities by utilizing CARAFE upsampling to focus on detailed features; Zhang et al. [27] developed YOLO11-Pear, which can be used for pear detection in complex orchards; and Kutyrev et al. [28] optimized YOLO11x for deployment on UAVs for apple counting.

In non-agricultural scenarios, lightweight models from the YOLO series are also widely adopted. Such cross-domain innovations provide valuable insights for developing agricultural models. For instance, Peng et al. [29] proposed TD-YOLOA, which is specifically designed for tire defect detection. This method effectively extracts prominent structural defect features, but it is less suitable for agricultural targets such as cotton field weeds, because weeds exhibit extremely irregular morphological variations, severe occlusion, and blurred boundaries in complex field environments. Jin et al. [30] developed an enhanced YOLOv11m model for defect detection in high-speed rail overhead contact systems; yet, this model is optimized for infrastructure inspection rather than detecting small agricultural targets, such as weeds or lesions. He et al. [31] proposed a multi-scale, multi-class object detection model for complex high-resolution remote sensing imagery. As remote sensing technology is highly adaptable to agricultural contexts, it offers innovative perspectives for UAV monitoring. Ahmed and El-Sheimy [32] fused a YOLOv11 model to enhance the stability of continuous tracking in drone videos. Nevertheless, their approach is suited for real-time detection in general visual tasks and is not specifically designed for monitoring agricultural dynamics, such as crop growth and weed infestation. Chen et al. [33] introduced a dual-path instance segmentation network for rice detection, to achieve lightweight processing and edge deployment. However, their method is primarily effective for structured crop environments and may struggle with targets characterized by irregular shapes, blurred edges, or significant overlapping.

Although significant progress has been made in plant and weed detection using YOLO-based methods, most of these approaches still face challenges such as high feature parameter extraction difficulty, weak robustness to interference, high computational complexity, difficult edge deployment, and poor generalizability. Moreover, the limited variety of weeds that can be detected indicates room for improvement in model accuracy and applicability. To address these issues, unlike previous approaches that often optimize isolated components while neglecting the potential for holistic network innovation, this study proposes a Quad-Synergistic Lightweight Perception Mechanism (QSLPM). This mechanism integrates Slimneck lightweight neck reconstruction, ADown efficient spatial downsampling, SEAM semantic attention guidance, and SIoU angle-aware geometric regression. QSLPM emphasizes synergistic interactions between modules beyond isolated enhancements, achieving significant compression of redundant features and computational load while simultaneously boosting feature sensitivity and regression accuracy. This synergistic interaction enables the model to perform fine-grained detection of densely clustered, occluded, or morphologically similar weeds in real-world cotton field environments. The AVGS-YOLO model was developed based on QSLPM and strikes an optimal balance between detection accuracy and computational efficiency, demonstrating strong real-time deployment capability in complex cotton field settings.

Building upon this foundation, this study has developed a lightweight, practical AVGS-YOLO model that achieves highly efficient weed recognition with minimal computational resource consumption. It is primarily suited for real-time detection scenarios of cotton weeds in agricultural edge computing environments. Currently, this model is capable of analyzing cotton weed image data in the field. In the future, as smart agriculture continues to advance, this model may be extended to mobile intelligent terminal platforms, providing agricultural practitioners with a more convenient weed identification and detection tool, thereby further promoting the precision management of weeds in cotton fields. Based on the YOLOv11n model, this study proposes the following key technologies and methods:

(a) This study introduces a targeted architectural optimization named Quadruple Synergistic Lightweight Perception Mechanism (QSLPM). This model employs Slimneck neck structure reorganization to eliminate feature redundancy, utilizes ADown for efficient downsampling to suppress background noise, integrates a new detection head, Detect_SEAM, with embedded SEAM attention for the precise capture of subtle features, and introduces the SIoU loss function to enable angle-aware precise regression. This mechanism provides an effective technical perspective for constructing “high-precision, low-computational-power” detectors in practical agricultural scenarios through the deep synergy between feature extraction and spatial localization.

(b) A high-quality cotton field weed dataset has been constructed. Unlike existing datasets based on simple binary classification or laboratory settings, this dataset explicitly classifies 12 weed species, features more realistic complex backgrounds, and demonstrates robust performance in dense weed growth. It provides a foundation for evaluating the robustness and fine-grained classification capabilities of deep learning-based detection models.

(c) Comprehensive ablation experiments and comparative experiments were conducted to verify the accuracy and generalization ability of the proposed AVGS-YOLO model. The results demonstrate an optimal balance between inference efficiency and detection accuracy, significantly outperforming existing lightweight mainstream detection models.

(d) Gradient-Weighted Class Activation Map++ (Grad-CAM++) heatmaps were employed to visualize key feature regions, intuitively demonstrating the model’s ability to focus on challenging samples within complex backgrounds [34].

In summary, this study introduces the Quadruple Synergistic Lightweight Perception Mechanism (QSLPM). Unlike recent agricultural detection studies that primarily focus on replacing backbones or simply stacking attention mechanisms, our approach represents a systematic integration strategy specifically designed to address high-density and noisy agricultural environments. This mechanism synergistically integrates lightweight neck feature reorganization, spatial downsampling, semantic attention enhancement, and angular geometric regression, offering a robust approach for constructing high-performance, low-complexity detection models. Detailed descriptions of these techniques and methods are presented in Section 2 of this paper. Experimental data and evaluation metrics are discussed in Section 3. Section 4 presents the experimental results and discussion, while the conclusions are presented in Section 5. These sections aim to provide readers with a comprehensive and clear understanding of the research findings.

2. Methods

2.1. Overview and Algorithm Improvements Based on YOLOv11n

The YOLO (You Only Look Once) series is a family of deep learning-based object detection algorithms that transform the object detection problem into a regression task, enabling neural networks to directly predict object categories and locations. It is characterized by high accuracy and strong real-time capabilities. On 30 September 2024, the Ultralytics team released the YOLOv11 model (v11; Ultralytics, Los Angeles, CA, USA) [35], introducing refined enhancements to the accuracy, speed, and efficiency of the YOLO series in image processing. These enable it to perform real-time detection tasks more effectively on resource-constrained devices. The YOLOv11 model is built around three core modules. Its key advantage lies in the updated C3k2 module within the backbone, which replaces the traditional C2f module to achieve a balance between feature extraction efficiency and model architectural flexibility. Additionally, a C2PSA module is integrated after the SPPF layer [36] to enhance multi-scale feature learning [37].

Therefore, YOLOv11n was selected as the foundational architecture for the cotton weed recognition framework. Its overall structure comprises four fundamental components: input, backbone, neck, and head, with the detailed architecture illustrated in Figure 1.

Although YOLOv11 demonstrates strong detection performance in general object detection tasks, it still exhibits certain limitations in cotton weed detection scenarios. This limitation is particularly pronounced for regions containing small-scale targets against complex backgrounds. Such targets, often characterized by low contrast and irregular shapes, are susceptible to feature loss during the downsampling process, complicating the trade-off between high accuracy and a lightweight design [38]. To address this, this paper proposes an optimized model based on YOLOv11n, named the AVGS-YOLO model, aiming to achieve precise weed identification in complex cotton field environments. The AVGS-YOLO model comprises an improved backbone, an optimized neck, and an enhanced head (as shown in Figure 2). The backbone integrates several targeted improvements and optimizations: within the main network, the convolutional layers responsible for downsampling in the P3 layer and subsequent layers are replaced with the ADown module. The ADown module reduces feature map dimensions while preserving critical spatial information. It also captures weed features of varying sizes more effectively, minimizing detail loss during downsampling and significantly improving detection efficiency for complex-textured weed images. Subsequently, for feature fusion, the standard C3k2 modules in layers 13, 16, 19, and 22 were replaced with VoV-GSCSP modules, while the Conv modules in layers 17 and 20 were replaced with GSConv modules. This constructs a lightweight architecture based on GSConv and VoV-GSCSP. This lightweight architecture design successfully reduces model parameter redundancy and computational complexity while enhancing feature fusion efficiency, addressing the issue of insufficient feature representation in lightweight models. Finally, the Detect_SEAM detection head, integrated with the SEAM attention mechanism, was adopted. This module effectively improves detection accuracy for small targets by enhancing spatial and channel consistency in feature maps, mitigating missed detections caused by cotton leaf occlusion and small weed objects. The primary objective of this model is to provide a reliable technical tool for precise weed control in cotton fields on resource-constrained edge devices, aiming to achieve improved efficiency and high accuracy. This model is intended to replace traditional, extensive pesticide application methods in the future, assisting agricultural machinery in implementing more precise localized application strategies. This approach will reduce pesticide usage, lower environmental impact, and ultimately achieve dual economic and ecological benefits [39].

2.2. Downsampling Module

Balancing detection accuracy and model compactness is a key factor in the deployment of deep learning models. While traditional backbone networks (e.g., VGG [40], ResNet [41]) achieve high accuracy, they often incur high computational costs and inference latency. This makes them less suitable for lightweight models and unable to meet the stringent requirements for mobile and edge deployments (e.g., cotton weed detection) [42]. The downsampling module is a crucial component for achieving a lightweight architecture. It reduces the spatial dimensions of feature maps, enhancing computational efficiency while expanding the receptive field, thereby contributing to the model’s overall compactness. Additionally, downsampling more effectively captures key information and mitigates the risk of overfitting. In traditional downsampling methods, max pooling often causes critical information loss and is unfavorable for small object detection. While strided convolutions offer higher efficiency, they frequently overlook local details. The baseline model YOLOv11n in this study employs fixed-kernel and fixed-stride Conv layers for downsampling, potentially leading to missed detections during feature map compression and consequently affecting the final cotton weed recognition results.

To overcome this limitation, we integrate the ADown module, designed to replace traditional downsampling operations [43]. Through adaptive spatial compression, multi-scale feature fusion, and a lightweight design, ADown effectively reduces feature resolution while improving detection accuracy and efficiency. It has reached a non-excellent balance between information recognition accuracy and computational model consumption, which makes it particularly suitable for small target detection and resource-constrained edge scenarios.

The workflow of the ADown module (Figure 3) involves receiving an input feature map of size C × H × W. To reduce computational overhead, the input feature map

X

is first divided into two parts through a channel splitting operation: X1 and X2, each with dimensions C/2 × H × W. X1 undergoes spatial downsampling via a Conv layer (k = 3, s = 2, p = 1), extracting local features of cotton weeds. Its output dimension is C/2 × H/2 × W/2. X2 first undergoes max pooling, followed by a Conv layer with k = 1, s = 1, p = 0 for channel reorganization, with its resulting output dimension also being C/2 × H/2 × W/2. Finally, the outputs from these two parallel branches are fused along the channel dimension via a concatenation operation. This process, formulated in Equation (1), allows the model to leverage a richer set of feature representations:

Y = C o n c a t (C o n v (X_{1}), C o n v (M a x P o o l 2 d (X_{2})))

(1)

where

Y

represents the final downsampled feature map. The process ultimately outputs a feature map with dimensions C × H/2 × W/2 for cotton weed identification.

2.3. Slimneck Module with Lightweight Feature Fusion

The neck component bridges the backbone network and the head network, and is responsible for feature fusion and processing to enhance detection accuracy and efficiency.

This study employs GSConv layers to replace traditional Conv layers within the neck, resulting in a more lightweight architecture that balances accuracy and speed. Traditional CNNs progressively transform spatial information into channel information, leading to a partial loss of semantic information. In contrast, GSConv [44,45] preserves implicit connections between channels while maintaining low computational complexity.

The workflow of the GSConv module (as shown in Figure 4) is as follows: the input feature maps

X_{i n}

first pass through a convolutional layer

f_{S D}

that halves the number of output channels, generating

X_{S C}

. Subsequently, a Depth-wise Separable Convolution (DSC) operation

f_{D S C}

independently convolves

X_{S C}

to produce

X_{D S C}

. The outputs from the SC layer and the DSC layer are then concatenated to form

X_{c a t}

. The concatenated feature map undergoes a Channel Shuffle operation to enhance information flow, ultimately producing the output

X_{o u t}

. This process is mathematically expressed as follows:

X_{S C} = f_{S C} (X_{i n})

(2)

X_{D S C} = f_{D S C} (X_{S C})

(3)

X_{c a t} = C o n c a t (X_{S D}, X_{D S C})

(4)

X_{o u t} = S h u f f l e (X_{c a t})

(5)

where

f_{S C}

and

f_{D S C}

denote Standard Convolution and Depth-wise Separable Convolution, respectively. The Concat function performs channel-wise concatenation, and Shuffle rearranges the channels to facilitate feature interaction.

A comparative overview of the computational workflows for Standard Convolution (SC) and Depth-wise Separable Convolution (DSC) is presented in Figure 5. SC exhibits high channel correlation but incurs substantial computational overhead, while DSC offers greater efficiency yet lacks interaction between channels. GSConv leverages a channel shuffle strategy to combine the advantages of both SC and DSC, which is crucial for enhancing the operational efficiency of cotton weed recognition models deployed on mobile and edge devices.

This study further integrates the GSConv and VoV-GSCSP modules (Figure 6, right). The GS bottleneck module (Figure 6, left), formed by stacking GSConv layers, enhances feature extraction capabilities, while the VoV-GSCSP module improves feature utilization through structural optimization. These designs, based on the Slimneck concept, aim to effectively reduce computational load and inference time while maintaining robust accuracy in cotton weed identification.

2.4. CSMM-Based Enhanced SEAM Attention Mechanism

Conventional detection heads frequently encounter feature alignment errors in complex cotton fields with severe inter-class occlusions, making it difficult to localize and detect targets and severely compromising weed detection accuracy. To address this challenge, this study introduces a Spatial Enhanced Attention Module (SEAM) after the neck layer output (Figure 7) to enhance the feature response of occluded weeds. This SEAM amplifies and enhances the faint features of obscured weeds while effectively suppressing background noise, thereby ensuring detection accuracy. The process comprises three stages: multi-branch feature integration, channel relationship modeling, and attention reweighting.

2.4.1. Multi-Branch Feature Integration

The input feature map

X \in ℝ^{C \times H \times W}

is first processed through the core component of the SEAM, specifically the parallel Cross-Scale Multi-Head Module (CSMM) branch [46] (Figure 7, right). Each CSMM extracts salient features through Patch Embedding and Depthwise Convolution, significantly reducing the parameter count via channel-wise operations. To preserve gradient flow and enrich the feature space, the outputs of these branches and the original input are fused via element-wise addition, formulated as

U = X + \sum_{i = 1}^{N} Ψ_{C S M M} (X)

(6)

where U denotes the fused output feature, and N represents the number of parallel branches. The function

Ψ_{C S M M} (\cdot)

denotes a nonlinear transformation within the CSMM, utilizing the GELU activation function [47,48] and Batch Normalization (BatchNorm) to stabilize the training process.

2.4.2. Channel Attention Modeling

The fused features U are first compressed using Global Average Pooling to reduce spatial information. A Multi-Layer Perceptron (MLP) network learns the importance of each channel. This parallel attention branch employs an exponential activation function (Channel Exp), and the final attention map M is subsequently computed using a Sigmoid activation function

(σ)

, as formulated below:

M = σ (W_{2} \cdot δ (W_{1} \cdot A v g P o o l (U)))

(7)

where

W_{1}

and

W_{2}

represent the weights of the fully connected layers in the MLP, AvgPool denotes the Global Average Pooling operation,

δ

denotes the ReLU activation function, and

σ

denotes the Sigmoid activation function.

2.4.3. Attention Re-Weighting

Finally, the SEAM multiplies the generated attention weights

M

by the original features

X

element-wise

(\otimes)

[49], resulting in the re-weighted output

Y

:

Y = X \otimes M

(8)

where

Y

represents the calibrated output feature map. This output performs adaptive calibration on occluded features, thereby effectively resolving recognition challenges caused by occlusions in cotton weed scenarios and enhancing the accuracy of detection and identification.

2.5. SIOU Loss Function

The Complete Intersection over Union (CIoU) loss function, employed by the YOLOv11n model, quantifies the disparity between predicted and ground-truth boxes based on three geometric metrics: their overlap area, the distance between their center points, and their aspect ratio consistency.

However, if the aspect ratios of the predicted and ground-truth boxes are identical, the v penalty term becomes zero, potentially preventing the model from accurately regressing object size. Furthermore, when calculating center point distance loss, CIoU only considers Euclidean distance without accounting for the directional alignment between predicted and ground-truth boxes [50].

This study introduces the SIoU loss function [51], which improves upon CIoU by incorporating an angle cost and redefining the distance loss. This function integrates four key elements to compute the final loss: angle cost, distance cost, shape cost, and the standard IoU loss.

2.5.1. Angular Loss in SIoU

The angular loss in SIoU is defined as follows (Figure 8):

Λ = 1 - 2 \sin^{2} (\arcsin (\frac{c_{h}}{σ}) - \frac{π}{4}) = \cos (2 (\arcsin (\frac{c_{h}}{σ}) - \frac{π}{4}))

(9)

where

Λ

is the angular loss,

c_{h}

represents the height difference between the center points of the ground-truth box and the predicted box, and

σ

denotes the Euclidean distance between their center points. In fact,

\arcsin \frac{c_{h}}{σ}

equals angle

α

:

\frac{c_{h}}{σ} = \sin (α)

(10)

σ = \sqrt{{(b_{c_{y}}^{g t} - b_{c_{y}})}^{2} + {(b_{c_{y}}^{g t} - b_{c_{y}})}^{2}}

(11)

c_{h} = \max (b_{c_{y}}^{g t}, b_{c_{y}}) - \min (b_{c_{y}}^{g t}, b_{c_{y}})

(12)

where

(b_{c_{x}}^{g t}, b_{c_{y}}^{g t})

represents the center coordinates of the ground-truth bounding box, and

(b_{c_{x}}, b_{c_{y}})

represents the center coordinates of the predicted bounding box. When

α = \frac{π}{2}

or

α = 0

occurs, the angular loss

Λ = 0

is applied. During training, if

α < \frac{π}{4}

holds, then

α

is minimized. Otherwise,

β

is minimized, where

β = \frac{π}{2} - α

.

2.5.2. Distance Loss in SIoU

The distance loss in SIoU is defined as follows (Figure 9):

Δ = \sum_{t = x, y} (1 - e^{- γ ρ_{t}}) = 2 - e^{- γ ρ_{x}} - e^{- γ ρ_{y}}

(13)

where

ρ_{x} = {(\frac{b_{c_{x}}^{g t} - b_{c_{x}}}{c_{w}})}^{2}, ρ_{y} = {(\frac{b_{c_{y}}^{g t} - b_{c_{y}}}{c_{h}})}^{2}, γ = 2 - Λ

(14)

Here,

(c_{w}, c_{h})

represents the width and height of the minimum bounding rectangle for the ground-truth box and the predicted box, and

Λ

is the angular loss defined in the previous section.

Figure 9. SIoU distance loss chart.

2.5.3. Shape Loss in SIoU

The shape loss in SIoU is defined as follows:

Ω = {\sum_{t = w, h} (1 - e^{- w_{t}})}^{θ} = {(1 - e^{- w_{w}})}^{θ} + {(1 - e^{- w_{h}})}^{θ}

(15)

where

w_{w} = \frac{|w - w^{g t}|}{\max (w, w^{g t})}, w_{h} = \frac{|h - h^{g t}|}{\max (h, h^{g t})}

(16)

Here,

(w, h)

and

(w^{g t}, h^{g t})

denote the width and height of the predicted bounding box and the ground-truth bounding box, respectively.

θ

controls the degree of emphasis placed on the shape loss. To prevent an excessive focus on shape loss from hindering the movement of the predicted bounding box,

θ

is typically determined via a genetic algorithm to approach 4, its rang is defined as

θ \in [2, 6]

.

2.5.4. SIoU’s IoU Loss

The IoU loss in SIoU is defined as follows (Figure 10):

I o U = \frac{|B \cap B^{gt}|}{|B \cup B^{g t}|}

(17)

where

B

and

B^{g t}

denote the predicted bounding box and the ground-truth bounding box, respectively;

\cap

represents the intersection, and

\cup

represents the union of the two boxes.

In summary, the SIoU loss function is defined as follows:

L o s s_{S I o U} = 1 - S I o U + \frac{Δ + Ω}{2}

(18)

where IoU is the intersection over union defined previously, while

Δ

and

Ω

represent the distance loss and shape loss, respectively.

Figure 10. IoU Loss Diagram.

3. Dataset and Configuration

3.1. Experimental Data

Building a high-quality dataset is a prerequisite for model training. These include Waterhemp, Morning Glory, Purslane, Spotted Spurge, Carpetweed, Ragweed, Eclipta, Prickly Sida, Palmer Amaranth, Sicklepod, Goosegrass and Cutleaf (Figure 11). The collection of cotton weed data was conducted from June to September 2021 in cotton fields located in Manas County, Xinjiang Uygur Autonomous Region, China, covering the critical growth cycles of cotton, including the seedling, squaring, and flowering stages. All images were captured using high-resolution mobile devices. Although data collection was limited to a single geographic location due to objective constraints, the dataset includes images taken under varying seasonal conditions and across different weather scenarios, specifically encompassing sunny and cloudy days. All data in this study were captured under natural lighting conditions to accurately reflect the most authentic field environment. Unlike general object detection datasets, the cotton weed in this study’s dataset suffers from severe background interference, such as soil cracks, slight occlusion among weeds, and irregular shadows, and there is a certain degree of similarity between different weeds, all of which increase the difficulty of detection and recognition. Therefore, these interferences indicate that this dataset can serve as a benchmark for evaluating agricultural detection models. During the shooting process, the optimal downward shooting angle was adopted to maximize the image quality of the weeds. Ultimately, 13,798 raw images were collected across cotton fields. During the dataset preprocessing phase, to ensure data diversity and generalization, a series of image augmentation techniques [52] were applied. These included horizontal flipping, vertical flipping, random brightness adjustment, Gaussian noise, and random rotation. Ultimately, the dataset was expanded to 15,983 weed images. Lastly, this study completed its experiments in November 2025.

3.2. Dataset Construction

In natural environments, cotton weeds exhibit a severe long-tail distribution, which inherently leads to data imbalance. To mitigate this issue of data imbalance, this study employed a naive offline image duplication strategy strictly as a class-balancing mechanism (oversampling) before training. This strategy uses basic image transformations such as horizontal flipping, vertical flipping, random brightness adjustment, Gaussian noise, and random rotation (as shown in Figure 12) to artificially increase the number of images for minority weed categories. This mathematical balancing prevents the initial loss gradients from being overwhelmingly dominated by the majority classes. It is important to emphasize that these offline transformations only balance the sampling probabilities of all classes in the data loader, whereas the true visual variance required for robust feature learning is provided by the on-the-fly augmentation pipeline during training. Therefore, this step aims solely to ensure that weed categories with fewer samples can receive stable gradient descent in the initial stage, rather than attempting to synthesize new base information or true data variability out of thin air. To prevent data leakage, this study first randomly divided the original dataset into a training set, validation set, and test set, and then applied data augmentation strategies exclusively to the training set. Specifically, the dataset was first divided in a ratio of approximately 7.5:1:1.5, where the training set included 10,065 images, the validation set contained 1342 images, and the test set contained 2391 images. Targeted data augmentation was then performed, expanding the training set from 10,065 images to 12,250 images. Consequently, the final constructed dataset contained a total of 15,983 images.

As shown in Table 1, which details the specific distribution of data for each weed species in the final dataset, the dataset exhibits a natural long-tail distribution rather than being perfectly balanced. This specific distribution intrinsically reflects the actual growth frequencies of different weed species in natural cotton fields. To prevent the model from being biased toward the majority classes during training, we introduced mitigation strategies such as an optimized SIoU loss function in subsequent experiments. These mechanisms ensure that the model maintains high detection accuracy and robust performance across all 12 categories, including the low-frequency ones.

3.3. DataSet Annotation

In natural cotton fields, weeds often grow in dense clusters, posing significant challenges, for instance, separation. While optimal downward shooting angles minimize weed occlusion in this dataset, prominent image features include dense weed distributions and minor weed overlapping. To ensure the dataset accurately reflects these crowded conditions, the images underwent meticulous annotation. For clustered or intersecting weeds, we drew distinct bounding boxes around each individual weed to minimize the risk of missed detections in crowded cotton field environments. All images in this dataset were manually annotated using the LabelImg (v1.8.6; Tzutalin, San Francisco, CA, USA) annotation tool to accurately delineate weed regions. Labels were saved in YOLO format to prepare for subsequent experiments. The comprehensive and standardized construction of this cotton weed dataset provides a solid foundation for the entire experimental training and performance evaluation process.

3.4. Experimental Environment

This study employs the cloud-based server AutodL as the experimental platform, conducting relevant experiments using Python and the PyTorch framework. Detailed system configurations and hyperparameter settings are presented in Table 2. To ensure the objectivity and fairness of the comparative analysis, we implemented a strict evaluation protocol. All training models were trained from scratch on the same training set and evaluated on the same test set. In addition, all models shared the same experimental settings, thereby ensuring that performance differences were entirely attributed to architectural improvements.

3.5. Evaluation Metrics

To assess the performance of different models, this study employed standardized and quantifiable evaluation metrics: precision (P), recall (R), average precision (AP), mean average Precision (mAP), number of parameters (Params), and giga floating-point operations (GFLOPs). These metrics provide a comprehensive and objective assessment of the overall model performance. In this study, both precision (P) and recall (R) were calculated using the macro-averaging method. This method requires calculating the metrics for each weed category independently and then taking an unweighted average of all categories’ metrics to obtain the final metrics. Using this macro-averaging method allows fewer categories to contribute equally to the final evaluation metrics, thereby preventing the overall performance from being overly influenced by the more abundant categories.

Precision (P) measures the proportion of true positive samples among all samples predicted as positive, thereby reflecting the reliability of the model’s predictions. The formula is expressed as follows:

\Pr ecison = \frac{T P}{T P + F P} \times 100 %

(19)

Here, TP represents the number of true positive instances that meet the Intersection over Union (IoU) threshold; FP represents the number of false positive predictions.

Recall (R) measures the proportion of correctly predicted positive cases among all ground-truth positive cases, reflecting the model’s capability to identify positive instances. The formula is expressed as follows:

Re call = \frac{T P}{T P + F N} \times 100 %

(20)

where FN denotes the number of false negatives.

Mean average precision (mAP) is a key metric for detection accuracy, used to evaluate the overall performance of a model in multi-class object detection tasks. It is calculated based on average precision (AP), which represents the area under the precision–recall (PR) curve for single-class detection. AP is calculated using discrete summation and the all-point interpolation method, and the AP formula is as follows:

A P = \sum_{n} (R_{n} - R_{n - 1}) P_{n}

(21)

As the average of all category APs, mAP provides an overall evaluation of performance. This work uses two mAP metrics. Among them, mAP50 is calculated at a fixed and single threshold of 0.5, while mAP50-95 is obtained by averaging the mAP over 10 different IoU thresholds from 0.50 to 0.95. The general formula for mAP is as follows:

m A P = \frac{1}{m} \sum_{i = 1}^{m} A P_{i}

(22)

Here,

A P_{i}

represents the average precision of the

i - t h

category, and mis the total number of categories. When calculating the above metrics, the confidence threshold is 0.001, and the IoU threshold for Non-Maximum Suppression (NMS) is 0.7.

Additionally, the number of parameters and GFLOPs are critical factors for determining the deployment feasibility of a model. Params serves as a key metric for evaluating model efficiency, where fewer parameters indicate a more lightweight architecture. GFLOPs measures computational complexity, representing the number of floating-point operations (in billions) required for a single forward pass through the model, reflecting the model’s computational efficiency.

4. Results and Discussion

4.1. Comparative Analysis of Improved Module Performance

To rigorously validate the effectiveness of the improved algorithms relative to the original algorithm, this study designed three distinct network configurations and conducted comparative ablation experiments. Using the original YOLOv11n model as the baseline, the Slimneck module, SEAM, and ADown module were sequentially integrated to form improved networks for experimentation. All experiments employed the same cotton weed dataset, batch size, and training cycle. The results of the ablation experiments are summarized in Table 3.

The experimental data clearly demonstrate the effectiveness of each improved network. Incorporating the Slimneck network (Model 2) increased precision to 96.0% and slightly improved mAP to 96.9%. This indicates that the SlimNeck architecture effectively enhances the quality of feature extraction. It successfully maintains and improves precision while introducing a slight decrease in parameters. Concurrently, integrating the detection head fused with the SEAM significantly contributes to lightweighting, reducing GFLOPs by 20.6% while effectively boosting recall and mAP. This demonstrates that the SEAM attention mechanism effectively enhances the model’s ability to capture latent features, reducing missed detections.

Building upon Model 2, the SEAM was introduced to form Model 5, achieving the highest precision of 96.6% (+1.2%). Subsequently, the ADown downsampling module was incorporated to create the final model. Further analysis reveals that Model 6 shares an mAP50 of 97.7% with Model 3, indicating potential redundancy within the models. A detailed comparison between Models 3 and 6 reveals that Model 6’s precision improved from 94.9% to 96.5%, but its recall decreased from 93.5% to 92.9%. This indicates that while ADown downsampling suppresses background noise, it lacks a dedicated feature retention mechanism, leading to the loss of fine-grained details of tiny weeds. Consequently, this redundancy precisely highlights the importance of the Slimneck module. The final Model 8 demonstrates that combining Slimneck with SEAM and ADown significantly reduces computational complexity while maintaining high precision and recall. This approach effectively eliminates redundant information while preserving key features. The overall network improvement further reduces the computational load while maintaining a high mAP.

Ultimately, the overall model performance improved from Model 1 to Model 8. Exceptional results were achieved across all evaluation metrics: accuracy reached 95.9%, recall reached 94.2%, and mAP50 reached 98.2%, with specific improvements of +0.5%, +2.0%, and +1.8%, respectively. This demonstrates the enhanced AVGS-YOLO model’s outstanding performance in reducing false negatives. This ablation experiment demonstrates that different modules do not merely provide incremental improvements but achieve synergistic gains (1 + 1 > 2). By eliminating texture redundancy and suppressing background noise, the Slimneck and ADown modules clear obstacles for the Detect_SEAM module, enabling the detector to focus more effectively on the irregular contours of weeds without interference. Beyond enhancing accuracy, the model also achieves significant lightweighting. Parameter count decreased by approximately 17.4%, computational load (GFLOPs) was reduced by about 27%, and model size shrank from 5.5 MB to 4.7 MB. This makes the model lighter and faster, with the balance of accuracy and efficiency confirming the synergistic interaction of the improved model. Overall, the AVGS-YOLO enhancement strategy effectively resolves the traditional trade-off between high accuracy and low computational demands, substantially boosting the model’s deployment potential on agricultural edge devices and its real-time detection capabilities.

A comparison of the precision–recall curves for the original YOLOv11n and improved AVGS-YOLO models (Figure 13) reveals that the enhanced AVGS-YOLO network achieves significantly improved recognition performance for 12 cotton weed species. It successfully addresses the “weak categories” of the original model, such as the challenging Goosegrass and Carpetweed, with AP values increasing by 1.7% and 1.6%, respectively. This greatly alleviates the detection imbalance between categories.

The morphology of the curves demonstrates that the improved model’s PR curve remains stable in the high recall range (recall > 0.9), demonstrating reduced false negatives while lowering false positives and enhancing robustness. Furthermore, accuracy improved by 2.0% for weeds with high feature similarity like Palmer Amaranth, indicating an enhanced capability in extracting fine-grained features. The AVGS-YOLO model not only raises the peak detection accuracy but also enhances generalization capabilities for complex, difficult-to-classify samples, making it better suited for practical cotton field operations.

The comparison of the confusion matrices before and after improvement (Figure 14) intuitively demonstrates the enhanced model’s breakthrough in reducing inter-class confusion. The most significant improvement lies in identifying difficult-to-classify samples, compensating for the original model’s deficiency in feature extraction for this category. Simultaneously, the recognition rate for Cutleaf reached 1.00, while other categories also showed steady improvement. This demonstrates that the enhanced AVGS-YOLO network possesses stronger feature discrimination capabilities, effectively distinguishing visually similar weeds. Observing the off-diagonal regions of the matrix reveals a marked reduction in misclassification noise. This significantly enhances the fine-grained classification accuracy for specific weed species and substantially lowers the risk of misclassification during field operations.

4.2. Comparative Experiments of Different Classic Models

To systematically assess performance disparities across models, diagnose their underlying bottlenecks, and validate the correctness of our proposed optimization direction, this study compares AVGS-YOLO with mainstream object detection networks, including RT-DETR [53], YOLOv5n [54], YOLOv8n [55], YOLOv10n, and YOLOv11n [56]. Table 4 presents the experimental results of this comparison, which serves to validate the model’s superior performance in cotton weed detection.

Table 4 presents a comparative analysis of AVGS-YOLO in terms of object detection performance. Compared to YOLOv11n, AVGS-YOLO reduces the number of parameters by 17.4% and decreases GFLOPs by 20%. Due to its lightweight architecture, the model size is decreased by 14.5%. In the cotton weed detection task, it demonstrates higher detection accuracy compared to models such as YOLOv8n and RT-DETR. Its precision of 95.9% and recall of 94.2% demonstrate the model’s effectiveness in minimizing false positives and false negatives, striking an optimal balance between detection performance and computational efficiency.

To further elucidate the performance trends of AVGS-YOLO, the training curves are visualized in Figure 15. For all models, both precision and mAP50 metrics increase rapidly within the first 100 epochs and then plateau. By comparing the training trajectories of the improved network with other classic networks, it is evident that in terms of the core evaluation metrics of precision and mAP50 for object detection tasks, the AVGS-YOLO model outperforms mainstream algorithms such as RT-DERT, YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv11n.

To qualitatively evaluate the detection performance of the proposed AVGS-YOLO model, Figure 16 presents the detection results under four scenarios: a single weed against a simple background, a single weed against a complex background, multiple weeds in a simple background, and multiple weeds in a complex background. The complex background specifically includes severe noise interference, such as soil cracks resembling linear geometry and withered weeds resembling intricate textures.

The detection results from RT-DETR indicate that while the model performs well on certain weeds, its recognition accuracy fluctuates, and some bounding boxes exhibit imprecision. Compared to RT-DETR, YOLOv5n, YOLOv8n, and YOLOv10n achieve higher detection accuracy for prominent weeds. However, as we can see from the figure, there are still some failure cases when dealing with weeds of various morphologies (columns C and D); they still have omissions or misclassifications, and the failure cases are also clearly shown in the Grad-CAM++ heatmap. YOLOv11n achieves precise localization for various cotton weed types, enabling a close match between prediction boxes and weed areas. However, prediction accuracy remains unstable, and false negatives persist. The AVGS-YOLO model demonstrates superior instance separation capabilities, significantly reducing false negatives in dense target environments while providing precise bounding box predictions—particularly evident in challenging scenarios involving multiple weeds. The visualization results in Figure 17 validate the quantitative metrics, further revealing the AVGS-YOLO model’s exceptional discriminative ability when handling complex challenges such as high occlusion rates and strong inter-class similarity. Specifically, the AVGS-YOLO model demonstrated excellent robustness, effectively mitigating interference from soil cracks and overcoming the risk of false detections caused by withered weeds.

These results collectively demonstrate that the enhanced AVGS-YOLO model can more effectively extract key features of cotton weeds while reducing interference from background elements such as soil cracks and withered weeds. It significantly reduces both false positive and false negative rates while maintaining low computational complexity and a relatively compact size. This makes the improved AVGS-YOLO model well-suited for efficient deployment on resource-constrained edge devices.

4.3. Comparative Experiments with the Latest Models

In order to further verify the performance of the AVGS-YOLO model, this paper compares the AVGS-YOLO model with the latest iterations of the YOLO algorithm, YOLOv12n, YOLOv13n, YOLO26n, as well as Transformer-based models RF-DETR, RT-DETRv4-S, and DEIMv2-N, with the aim of quantitatively comparing the performance level of the AVGS-YOLO model in cotton weed target detection tasks. The specific comparison results are shown in the table below.

As shown in Table 5, although models with the Transformer architecture perform well on general datasets, their performance on the cotton weed dataset presents a notable trade-off between accuracy and computational cost. Specifically, the RT-DETRv4-S model achieves a strong mAP50 of 95.7%, but its GFLOPs reach as high as 24.9, suggesting a potential for computational redundancy in this specific application. Conversely, the DEIM-N model has lower GFLOPs of 10.8, but its mAP50 also decreases to 89.8%, indicating that while achieving model lightweighting, it did not fully maintain model accuracy. Compared to Transformer-based models, the YOLO series’ nano models demonstrate a more balanced performance in this task, maintaining a stable mAP while keeping model parameters and GFLOPs low. When comparing the various numerical values of YOLOv12n, YOLOv13n, and the newly proposed YOLO26n, it can be seen that the AVGS-YOLO model appears to be an effective improvement.

4.4. Performance Analysis of Different Loss Functions

The improved AVGS-YOLO model in this study employs the SIoU loss function. To comprehensively evaluate the performance of the SIOU-enhanced YOLOv11n model, we conducted comparative experiments using six loss functions: YOLOv11n + SIoU, YOLOv11n + GIoU, YOLOv11n + EIoU, YOLOv11n + DIoU, YOLOv11n + SlideLoss, and YOLOv11n + FocalLoss. The objective is to analyze the performance differences among these loss functions for cotton weed detection tasks. As shown in Table 6, SIOU demonstrates significant improvements over commonly used loss functions such as GIoU [57], EIoU [58], DIoU [59], SlideLoss [60], and FocalLoss [61]. It achieves optimal precision and recall while maintaining a high mAP. This demonstrates that the SIoU loss function comprehensively optimizes the YOLOv11n model, achieving the best overall balance in target localization performance.

4.5. Performance Analysis of Different Detection Heads

This study adopts YOLOv11n as the baseline model, and integrates six commonly used detection heads, including Detect_DyHead [62], Detect_Efficient [63], Detect_MultiSEAM [64], Detect_CBAM [65], Detect_ECA [66], and Detect_SEAM, to construct six variants: YOLO + Detect_DyHead, YOLO + Detect_Efficient, YOLO + Detect_MultiSEAM, YOLO + Detect_CBAM, YOLO + Detect_ECA, and YOLO + Detect_SEAM. All experiments were conducted using the same cotton weed dataset, batch size, and training cycles. The results are presented in Table 7. The experimental results demonstrate that YOLO + Detect_SEAM outperforms the other five models comprehensively in terms of precision, recall, mAP50 and mAP50-95. Specifically, YOLO + Detect_SEAM achieved 94.9% precision, 93.5% recall, 97.7% mAP50, and 91.3% mAP50-95, while maintaining a compact model size of 4.6 MB, a low parameter count of 2.14 million, and a low computational complexity of 5.0 GFLOPs.

Compared to classical lightweight attention mechanisms, Detect_CBAM and Detect_ECA achieved mAP50 values of 93.4% and 92.8%, respectively, indicating a notable performance gap relative to the SEAM selected in this study. A deeper analysis reveals that this disparity stems from the specific visual characteristics of the agricultural scenario. CBAM employs a serial channel-spatial attention mechanism incorporating global pooling, which causes the loss of spatial details of small weeds. Meanwhile, the high color similarity between cotton and weeds makes it difficult for the ECA attention mechanism, which relies on channel attention, to extract discriminative features. In contrast, the SEAM employed in this study excels at preserving spatial coherence and local structural integrity, enabling Detect_SEAM to more effectively capture the characteristics of different weeds. In comparison with other advanced detection heads, Detect_DyHead achieved precision, recall, mAP50, and mAP50-95 of 94.7%, 92.3%, 97.2%, and 88.6%; Detect_Efficient achieved 93.8%, 92.4%, 97.0%, and 86.6%; and Detect_MultiSEAM achieved 94.3%, 92.1%, 96.5%, and 88.5%, respectively. These results clearly demonstrate that embedding SEAM attention into the detection head successfully achieves a significant reduction in model complexity while substantially improving detection performance, overcoming the common trade-off between accuracy and computational cost typically found in other mechanisms.

4.6. Comparative Analysis with Existing Methods

To further validate the effectiveness of the improved AVGS-YOLO model, this study selected recently published cotton weed detection models and compared their results, as shown in Table 8. The table compares key parameters, including precision, recall, mAP, model size, number of parameters, and GFLOPs.

It can be observed that Das et al. [67], utilizing drone-acquired cotton field weed data, achieved an mAP50 of 88%, precision of 87%, and recall of 78% on YOLOv7. These detection performances are all lower than those of the AVGS-YOLO model developed in this study. Wang et al. (2025) [68] integrated DS_HGNetV2, BiFPN, and LiteDetect modules to propose the YOLO-Weed Nano model, achieving significant lightweighting improvements. However, its mAP50 still has considerable room for enhancement. In this regard, our study strikes a relative balance between accuracy and efficiency. In another study, Zheng et al. [69] proposed an enhanced YOLO-WL model, reducing its size to 4.6 MB while achieving 92.3% mAP50. Overall, it demonstrates notable lightweighting achievements, though detection accuracy could be further improved. Karim et al. [70] introduced an automated cotton weed targeting system using an improved lightweight YOLOv8 on edge platforms. While this approach demonstrates the potential of automated detection, there remains room for enhancement in terms of model lightweighting and mAP50.

Compared to these models, the proposed AVGS-YOLO achieves a high mAP50 of at 98.2%, while maintaining a compact 4.7 MB model size, 2.16 million parameters, and 4.6 GFLOPs. These notable results highlight the effectiveness of the proposed Quadruple Synergistic Lightweight Perception Mechanism (QSLPM), enabling improvements in accuracy, efficiency, and lightweight performance. This comprehensive performance enhancement makes the AVGS-YOLO model more suitable for deployment on resource-constrained edge devices.

4.7. Heatmap Analysis Detection

Heatmap analysis plays a crucial role in visualizing object detection models, particularly in complex agricultural vision tasks such as cotton weed identification. Grad-CAM++ is a gradient-based visualization method that generates spatially attentive heatmaps relevant to class decisions by performing gradient-weighted linear combinations on feature maps within convolutions of neural networks. It provides detailed visualization for cotton weed identification images.

Grad-CAM++ is an enhanced version of Grad-CAM. It can generate more detailed and focused heatmaps. Compared to Grad-CAM, Grad-CAM++ incorporates higher-order derivative information into the weight calculation process, allowing it to recognize at a finer granularity and thereby highlight the image regions that have a critical impact on detection results. This enhances the interpretability and trustworthiness of the cotton weed detection system. Unlike standard classification, Grad-CAM++ visualization in the study is generated by backpropagating the gradients of specific target scores within the detection head. It not only shows classification information but also clearly reflects the spatial features that contribute the most to the model’s detection confidence. Grad-CAM++ visually reveals the varying contributions of different image regions to the model’s prediction. Warmer colors on the heatmap indicate higher contributions, while cooler colors denote lower contributions. This study analyzed the detection heatmaps of the YOLOv11n and AVGS-YOLO models using Grad-CAM++, with the results presented in Figure 17.

The comparison in Figure 17 displays the heatmaps generated by the YOLOv11n and AVGS-YOLO models for the cotton weed detection task, revealing differences in their recognition capabilities. From left to right, each column of images represents four scenarios: a single weed against a simple background, a single weed in a complex background, multiple weeds in a simple background, and multiple weeds against a complex background. Observations of the heatmap comparison reveal that YOLOv11n’s heatmaps are more dispersed, with some coverage extending beyond the target area into background regions, particularly along weed edges. This indicates that YOLOv11n’s feature extraction is more sensitive to background noise. In multi-target scenarios, the heatmap of YOLOv11n fails to clearly focus attention on each individual weed, instead also attending to background areas beyond the target, displaying broader and sometimes blurry activation regions. This intuitively reflects the attention drift in failure cases. Compared with YOLOv11n, the heatmap of AVGS-YOLO is more concentrated and compact, with high-activation areas accurately covering the main body of the weeds and with clearer boundaries. This compactness is not only visually appealing but also closely related to the statistical improvements in the quantitative metrics mentioned earlier. This demonstrates that the AVGS-YOLO model effectively suppresses background noise, reduces false negatives and false positives, and better focuses on detecting critical regions.

4.8. Generalization Performance on Standard Benchmarks

MS COCO (Microsoft Common Objects in Context) is recognized in the field of computer vision as one of the most influential large-scale benchmark datasets. It contains 80 categories of objects from everyday life scenarios, such as people, vehicles, furniture, etc. The 2017 version includes 118,287 training images (train), 5000 validation images (val), and 40,670 test images (test). The MS COCO dataset is generally considered a benchmark for evaluating a model’s generalization, robustness, and detection capabilities due to its high complexity and object richness. To comprehensively assess the generalization ability of the AVGS-YOLO model, we conducted benchmark testing using the MS COCO 2017 dataset and then compared the results with the experimental results obtained by YOLOv11n on the same dataset, aiming to evaluate AVGS-YOLO’s performance in all aspects. Considering computational resource constraints, and to ensure absolute fairness and direct comparability of the evaluation, we placed AVGS-YOLO and the benchmark model YOLOv11n under exactly the same experimental conditions, trained for 100 epochs, and used completely consistent hyperparameters and data augmentation strategies. The specific comparison results are shown in Table 9.

As shown in Table 8, AVGS-YOLO achieved an mAP50 of 50.1% and an mAP50-95 of 35.4%, both of which are slightly higher than the corresponding metrics of the YOLOv11n baseline. This suggests that AVGS-YOLO possesses a stable generalization ability across diverse scenarios. While maintaining competitive accuracy, the model in this study also demonstrates a more efficient lightweight design, with the number of parameters reduced by about 17.4% compared to YOLOv11n, and computational load (GFLOPs) reduced by about 27.0%. The results on the MS COCO dataset indicate that AVGS-YOLO can maintain detection performance on par with mainstream benchmark models in general scenarios.

5. Conclusions

This study addresses critical issues such as the morphological diversity of weeds in cotton fields, complex environmental backgrounds, large model parameters, and large size of conventional models. It proposes a lightweight cotton weed detection model, AVGS-YOLO. Through systematic ablation and comparative experiments, we demonstrated that the model achieves an optimal balance between detection accuracy and computational efficiency, driven by targeted structural optimization rather than random fluctuations.

The core of this model is the Quaternary Synergistic Lightweight Perception Mechanism (QSLPM), which reorganizes the Slimneck architecture to eliminate feature redundancy caused by the high similarity of weed textures, uses ADown for efficient downsampling to suppress background noise such as soil cracks, and combines a new detection head Detect_SEAM, with an embedded SEAM attention mechanism for precise capture of irregular weed contour features. Additionally, the SIoU loss function is introduced to achieve accurate angle-aware regression under weed overlap. Through specific experimental result analysis, compared with the baseline model YOLOv11n, the AVGS-YOLO model’s precision increased by 0.5%, recall increased by 2%, mAP50 increased by 1.8%, and mAP50-95 increased by 5.8%, reaching 95.9%, 94.2%, 98.2% and 93.3%, respectively. The model also achieved significant lightweighting, with parameters reduced by 17.4%, computational cost (GFLOPs) reduced by 27%, and model size reduced from 5.5 MB to 4.7 MB. The improvement in model performance and the implementation of model lightweighting realized the true synergistic gain proposed in this study (1 + 1 > 2).

Although this study has achieved good results, there are certain limitations. The current dataset was only collected from a single geographical location, which limits the geographical generalization ability of the improved model proposed in this study for different soil types and regional weed variants. Furthermore, the current research has only evaluated the model on high-performance workstations, and due to objective conditions, it has not yet been effectively tested on edge devices. Therefore, our future work will focus on addressing these existing limitations: (1) In the future, we will place more emphasis on expanding the dataset by increasing data collection locations to further validate the model’s adaptability in real cotton field environments. (2) We will focus on deploying the model to physical edge devices, such as Jetson Nano, Xavier, and Raspberry Pi, to verify its actual performance in real scenarios. (3) Ultra-recent architectures such as YOLO26n and DEIMv2-N will be used for benchmarking, and they will be compared with the model proposed in this study in order to continuously optimize the performance limits of the model.

Author Contributions

Conceptual Design: S.W. and L.W.; Methodology: S.W.; Validation Work: S.W. and L.W.; Research Implementation: S.W.; Data Organization: S.W.; Drafting: S.W.; Formal Analysis: S.W.; Visualization: S.W.; Resource Management: L.W. Review and editing: L.W.; Supervision: L.W.; Project management: L.W.; Funding acquisition: L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Lanzhou Talent Innovation and Entrepreneurship Project, grant number 2021-RC-47; the National Foreign Experts Project of the Ministry of Science and Technology, grant number G2022042005L; the Industry Support Project for Higher Education Institutions in Gansu Province, grant number 2023CYZC-54; the Gansu Provincial Key Research and Development Program, grant number 23YFWA0013; the Gansu Provincial Science and Technology Innovation Talent Project, grant number 25RCKAO15; and the Lanzhou Municipal Science and Technology Plan Project (Fourth Batch), grant number 2025-4-009.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The code required for this study is available at https://github.com/Wangsuqi/avgs- (accessed on 13 December 2024), and if more data are needed under reasonable requests, they can also be obtained from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chi, B.; Zhang, D.; Dong, H. Control of cotton pests and diseases by intercropping: A review. J. Integr. Agric. 2021, 20, 3089–3100. [Google Scholar] [CrossRef]
Tausif, M.; Jabbar, A.; Naeem, M.S.; Basit, A.; Ahmad, F.; Cassidy, T. Cotton in the new millennium: Advances, economics, perceptions and problems. Textil. Progr. 2018, 50, 1–66. [Google Scholar] [CrossRef]
Theriault, V.; Tschirley, D.L. How institutions mediate the impact of cash cropping on food crop intensification: An application to cotton in Sub-Saharan Africa. World Dev. 2014, 64, 298–310. [Google Scholar] [CrossRef]
Torres-Sánchez, J.; de Castro, A.I.; Pena, J.M.; Jiménez-Brenes, F.M.; Arquero, O.; Lovera, M.; López-Granados, F. Mapping the 3D structure of almond trees using UAV acquired photogrammetric point clouds and object-based image analysis. Biosyst. Eng. 2018, 176, 172–184. [Google Scholar] [CrossRef]
Manalil, S.; Coast, O.; Werth, J.; Chauhan, B.S. Weed management in cotton (Gossypium hirsutum L.) through weed-crop competition: A review. Crop Prot. 2017, 95, 53–59. [Google Scholar] [CrossRef]
Khadi, B.M.; Santhy, V.; Yadav, M.S. Cotton: An Introduction. In Cotton: Biotechnological Advances; Zehr, U.B., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 1–14. [Google Scholar]
Tursun, N.; Datta, A.; Tuncel, E.; Kantarci, Z.; Knezevic, S.Z. Nitrogen application influenced the critical period for weed control in cotton. Crop Prot. 2015, 74, 85–91. [Google Scholar] [CrossRef]
Gupta, S.; Tripathi, A.K.; Pandey, A.C. PotCapsNet: An explainable pyramid dilated capsule network for visualization of blight diseases. Neural Comput. Appl. 2024, 36, 23251–23274. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Young, S.L.; Meyer, G.E.; Woldt, W.E. Future directions for automated weed management in precision agriculture. In Automation: The Future of Weed Control in Cropping Systems; Young, S.L., Pierce, F.J., Eds.; Springer: Dordrecht, The Netherlands, 2014; pp. 249–259. [Google Scholar] [CrossRef]
Taiwo, G.A.; Akinwole, T.O.; Ogundepo, O.B. Statistical analysis of stakeholders perception on adoption of AI/ML in sustainable agricultural practices in rural development. In Proceedings of the International Congress on Information and Communication Technology, London, UK, 20–23 February 2023; Springer: London, UK, 2024; pp. 123–131. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Martinelli, F.; Scalenghe, R.; Davino, S.; Panno, S.; Scuderi, G.; Ruisi, P.; Villa, P.; Stroppiana, D.; Boschetti, M.; Goulart, R.L.; et al. Advanced Methods of Plant Disease Detection: A Review. Agron. Sustain. Dev. 2015, 35, 1–25. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Al-Qizwini, M.; Barjasteh, I.; Al-Qassab, H.; Radha, H. Deep Learning Algorithm for Autonomous Driving Using GoogLeNet. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 89–96. [Google Scholar]
Peteinatos, G.G.; Reichel, P.; Karouta, J.; Andújar, D.; Gerhards, R. Weed Identification in Maize, Sunflower, and Potatoes with the Aid of Convolutional Neural Networks. Remote Sens. 2020, 12, 4185. [Google Scholar] [CrossRef]
Kanna, S.K.; Ramalingam, K.; Pazhanivelan, P.; Jagadeeswaran, R.; Prabu, P.C. YOLO Deep Learning Algorithm for Object Detection in Agriculture: A Review. J. Agric. Eng. 2024, 55, 1641. [Google Scholar] [CrossRef]
Kang, S.; Hu, Z.; Liu, L.; Zhang, K.; Cao, Z. Object Detection YOLO Algorithms and Their Industrial Applications: Overview and Comparative Analysis. Electronics 2025, 14, 1104. [Google Scholar] [CrossRef]
Fu, H.; Song, G.; Wang, Y. Improved YOLOv4 Marine Target Detection Combined with CBAM. Symmetry 2021, 13, 623. [Google Scholar] [CrossRef]
Shoaib, M.; Shah, B.; Ei-Sappagh, S.; Ali, A.; Ullah, A.; Alenezi, F.; Gechev, T.; Hussain, T.; Ali, F. An Advanced Deep Learning Models-Based Plant Disease Detection: A Review of Recent Research. Front. Plant Sci. 2023, 14, 1158933. [Google Scholar] [CrossRef]
Wang, A.; Peng, T.; Cao, H.; Xu, Y.; Wei, X.; Cui, B. TIA-YOLOv5: An Improved YOLOv5 Network for Real-Time Detection of Crop and Weed in the Field. Front. Plant Sci. 2022, 13, 1091655. [Google Scholar] [CrossRef]
Barbedo, J.G.A. Plant Disease Identification from Individual Lesions and Spots Using Deep Learning. Biosyst. Eng. 2019, 180, 96–107. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Li, X.; Wang, C.; Ji, T.; Wang, Q.; Zhang, T. D³-YOLOv10: Improved YOLOv10-Based Lightweight Tomato Detection Algorithm Under Facility Scenario. Agriculture 2024, 14, 2268. [Google Scholar] [CrossRef]
Tang, X.; Sun, Z.; Yang, L.; Chen, Q.; Liu, Z.; Wang, P.; Zhang, Y. YOLOv11-AIU: A Lightweight Detection Model for the Grading Detection of Early Blight Disease in Tomatoes. Plant Methods 2025, 21, 118. [Google Scholar] [CrossRef]
Fang, K.; Zhou, R.; Deng, N.; Li, C.; Zhu, X. RLDD-YOLOv11n: Research on Rice Leaf Disease Detection Based on YOLOv11. Agronomy 2025, 15, 1266. [Google Scholar] [CrossRef]
Zhang, M.; Ye, S.; Zhao, S.; Wang, W.; Xie, C. Pear Object Detection in Complex Orchard Environment Based on Improved YOLO11. Symmetry 2025, 17, 255. [Google Scholar] [CrossRef]
Kutyrev, A.; Khort, D.; Smirnov, I.; Zubina, V. UAV-Based Sustainable Orchard Management: Deep Learning for Apple Detection and Yield Estimation. In Proceedings of the E3S Web of Conferences; EDP Sciences: Les Ulis, France, 2025; Volume 614, p. 03021. [Google Scholar]
Peng, C.; Li, X.; Wang, Y. TD-YOLOA: An Efficient YOLO Network with Attention Mechanism for Tire Defect Detection. IEEE Trans. Instrum. Meas. 2023, 72, 3529111. [Google Scholar] [CrossRef]
Jin, T.; Shen, Z.; Geng, H. Optimized YOLOv11m for Real-Time High-Speed Railway Catenary Defect Detection. Sci. Rep. 2025, 16, 200. [Google Scholar] [CrossRef] [PubMed]
He, L.-H.; Zhou, Y.-Z.; Liu, L.; Cao, W.; Ma, J.-H. Research on Object Detection and Recognition in Remote Sensing Images Based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef] [PubMed]
Ahmed, M.; El-Sheimy, N. Multi-Object Tracking in UAV Videos: A YOLOv11 Fusion Method for Detection and Segmentation Optimization. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2025, 48, 1–6. [Google Scholar] [CrossRef]
Chen, Z.; Cai, Y.; Liu, Y.; Liang, Z.; Chen, H.; Ma, R.; Qi, L. Towards End-to-End Rice Row Detection in Paddy Fields Exploiting Two-Pathway Instance Segmentation. Comput. Electron. Agric. 2025, 231, 109963. [Google Scholar] [CrossRef]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. Evaluating the Evolution of YOLO (You Only Look Once) Models: A Comprehensive Benchmark Study of YOLOv11 and Its Predecessors. arXiv 2024, arXiv:2411.00201. [Google Scholar]
Mallick, S. YOLO11: Key Features, Improvements, and Comparison. Available online: https://learnopencv.com/yolo11/ (accessed on 15 October 2024).
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Jia, F. Occlusion Target Recognition Algorithm Based on Improved YOLOv4. J. Comput. Methods Sci. Eng. 2024, 24, 3799–3811. [Google Scholar] [CrossRef]
Zou, K.; Chen, X.; Zhang, F.; Zhou, H.; Zhang, C. A Field Weed Density Evaluation Method Based on UAV Imaging and Modified U-Net. Remote Sens. 2021, 13, 310. [Google Scholar] [CrossRef]
Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Shen, C.; Van Den Hengel, A. Wider or deeper: Revisiting the ResNet model for visual recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
McCool, C.; Perez, T.; Upcroft, B. Mixtures of lightweight deep convolutional neural networks: Applied to agricultural robotics. IEEE Robot. Autom. Lett. 2017, 2, 1344–1351. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar] [CrossRef]
Sarkar, A.; Nataraj, L.; Manjunath, B.S. Detection of seam carving and localization of seam insertions in digital images. In Proceedings of the 11th ACM Workshop on Multimedia and Security (MM&Sec ’09), Princeton, NJ, USA, 7–8 September 2009; pp. 107–116. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Zhang, Q.; Wang, C.; Wu, H.; Feng, C.; Ren, K. GELU-Net: A Globally Encrypted, Locally Unencrypted Deep Neural Network for Privacy-Preserved Learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 3933–3939. [Google Scholar] [CrossRef]
Gao, H.; Wang, Z.; Ji, S. ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 5197–5205. [Google Scholar]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Zhou, H.; Su, Y.; Chen, J.; Li, J.; Ma, L.; Liu, X.; Lu, S.; Wu, Q. Maize Leaf Disease Recognition Based on Improved Convolutional Neural Network ShuffleNetV2. Plants 2024, 13, 1621. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Chang, D.; Liu, A.; Hogan, A.; Skalski, P.; et al. Ultralytics/yolov5: V6.0—YOLOv5n ‘Nano’ Models, Roboflow Integration, TensorFlow Export, OpenCV DNN Support, Version v6.0; Zenodo: Geneva, Switzerland, 2021. [CrossRef]
Liu, Q.; Huang, W.; Duan, X.; Wei, J.; Hu, T.; Yu, J.; Liu, M. DSW-YOLOv8n: A New Underwater Target Detection Algorithm Based on Improved YOLOv8n. Electronics 2023, 12, 3892. [Google Scholar] [CrossRef]
Alkhammash, E.H. A Comparative Analysis of YOLOv9, YOLOv10, and YOLOv11 for Smoke and Fire Detection. Fire 2025, 8, 26. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Lu, S.; Lu, H.; Dong, J.; Wu, S. Object Detection for UAV Aerial Scenarios Based on Vectorized IOU. Sensors 2023, 23, 3061. [Google Scholar] [CrossRef] [PubMed]
Hussain, M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. arXiv 2022, arXiv:2208.02019. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Das, A.; Yang, Y.; Subburaj, V.H. YOLOv7 for Weed Detection in Cotton Fields Using UAV Imagery. AgriEngineering 2025, 7, 313. [Google Scholar] [CrossRef]
Wang, J.; Qi, Z.; Wang, Y.; Liu, Y. A Lightweight Weed Detection Model for Cotton Fields Based on an Improved YOLOv8n. Sci. Rep. 2025, 15, 457. [Google Scholar] [CrossRef]
Zheng, L.; Long, L.; Zhu, C.; Xu, H.; Wu, J. A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms. Agronomy 2024, 14, 2649. [Google Scholar] [CrossRef]
Karim, M.J.; Nahiduzzaman, M.; Ahsan, M.; Haider, J.; Alameer, A. Development of an Early Detection and Automatic Targeting System for Cotton Weeds Using an Improved Lightweight YOLOv8 Architecture on an Edge Device. Knowl.-Based Syst. 2024, 300, 112204. [Google Scholar] [CrossRef]

Figure 1. YOLOv11 network architecture diagram.

Figure 2. AVGS-YOLO network architecture diagram.

Figure 3. Structure of the ADown module.

Figure 4. GSConv Model Network Structure Diagram.

Figure 5. Standard Convolution (SC) and Depthwise Separable Convolution (DSC) model network architecture diagram.

Figure 6. GS bottleneck and VOV-GSCSP model network architecture diagram. Here,

\oplus

represents element-wise addition.

Figure 6. GS bottleneck and VOV-GSCSP model network architecture diagram. Here,

\oplus

represents element-wise addition.

Figure 7. Structure of the SEAM. Here,

\oplus

represents element-wise addition and

\otimes

Element-wise multiplication.

Figure 7. Structure of the SEAM. Here,

\oplus

represents element-wise addition and

\otimes

Element-wise multiplication.

Figure 8. SIoU angle loss chart.

Figure 11. 12 common cotton weed species: (a) Eclipta, (b) Waterhemp, (c) Morning Glory, (d) PricklySida, (e) Purslane, (f) Palmer Amaranth, (g) Spotted Spurge, (h) Sicklepod, (i) Carpetweed, (j) Goosegrass, (k) Ragweed, (l) Cutleaf.

Figure 12. 6 image enhancement examples.

Figure 13. Comparison of precision–recall curves between YOLOv11n and AVGS-YOLO.

Figure 14. Comparison of the confusion matrix diagrams between YOLOv11n and AVGS-YOLO.

Figure 15. Visualization of comparative experiments of different models.

Figure 16. Detection effects of different models on cotton weeds under different conditions: (A) single weed with simple background, (B) single weed with complex background, (C) multiple weeds with single background, (D) multiple weeds with complex background.

Figure 17. YOLOv11n and AVGS-YOLO Heatmap Analysis.

Table 1. Cotton weed dataset.

Name of Cotton Weeds	Training/Images	Validation/Images	Test/Images	Total/Images
Waterhemp	3118	359	553	4030
Morning Glory	2381	240	368	2989
Purslane	1287	203	264	1754
Spotted Spurge	1264	78	289	1631
Carpetweed	1019	116	219	1354
Ragweed	1046	106	223	1375
Eclipta	1209	161	283	1653
Prickly Sida	942	89	174	1205
Palmer Amaranth	701	61	116	878
Sicklepod	434	48	97	579
Goosegrass	385	44	67	496
Cutleaf	235	18	46	299

Table 2. Experimental environment and training hyperparameters.

Program	Parameters
Operating	Ubuntu 20.04.5 LTS
CPU	AMD EPYC 9754 128-Core Processor
GPU	24 G NVIDIA GeForce RTX 4090
Platform	Pycharm
Base model	YOLOv11n
Programming languages	Python 3.8.10
Deep learning framework	Pytorch 2.0.0
CUDA version	11.8
Batch Size	32
Epochs	300
Learning rate	0.001
Optimizer	SGD

Note: CPU: AMD EPYC 9754 (AMD, Santa Clara, CA, USA); GPU: NVIDIA RTX 4090 (NVIDIA Corp., Santa Clara, CA, USA); OS: Ubuntu 20.04.5 (Canonical Ltd., London, UK); Python v3.8.10 (PSF, Wilmington, DE, USA); PyTorch v2.0.0 (Meta, Menlo Park, CA, USA).

Table 3. Results of ablation experiment.

Models	Slimneck	SEAM	ADown	Precision/%	Recall/%	mAP50/%	mAP50-95/%	Size/MB	Params	GFLOPs
1	-	-	-	95.4	92.2	96.4	87.5	5.5	2,584,492	6.3
2	√	-	-	96.0	92.0	96.9	89.1	5.5	2,571,964	5.9
3	-	√	-	94.9	93.5	97.7	90.9	4.6	2,146,604	5.0
4	-	-	√	95.0	94.0	97.5	90.6	4.8	2,238,380	5.5
5	√	√	-	96.6	92.0	97.4	89.6	5.3	2,480,188	5.4
6	-	√	√	96.5	92.9	97.7	91.1	4.6	2,146,603	5.0
7	√	-	√	95.2	93.7	97.4	89.9	4.8	2,225,852	5.1
8	√	√	√	95.9	94.2	98.2	93.3	4.7	2,134,076	4.6

Note: √ indicates that the module has been used, and a hyphen indicates that the module has not been used. The bold part indicates the optimal value of this column.

Table 4. Compare the results of the existing mainstream models.

Models	Precision/%	Recall/%	mAP50/%	mAP50-95/%	Size/MB	Params	GFLOPs
RT-DETR	93.4	90.3	93.6	85.2	59.1	28,467,920	100.7
YOLOv5n	91.8	92.4	93.2	84.8	4.7	2,184,004	5.8
YOLOv8n	95.6	93.2	96.6	86.7	5.6	2,686,708	6.8
YOLOv10n	94.2	91.9	93.3	82.9	5.8	2,267,508	6.5
YOLOv11n	95.4	92.2	96.4	87.5	5.5	2,584,492	6.3
AVGS-YOLO	95.9	94.2	98.2	93.3	4.7	2,134,076	4.6

Table 5. A comparison of the results of the latest existing models.

Models	Precision/%	Recall/%	mAP50/%	mAP50-95/%	Size/MB	Params/M	GFLOPs
RT-DETRv4-S	93.9	92.2	95.7	92.3	20.4	10.19	24.9
DEIM-N	91.3	92.9	89.8	85.5	9.1	4.54	10.8
YOLOv12n	95.2	93.7	97.1	91.2	5.5	2.51	5.8
YOLOv13n	90.5	91.7	94.8	90.8	5.4	2.45	6.2
YOLO26n	94.2	91.4	96.9	89.6	5.6	2.38	5.2
AVGS-YOLO	95.9	94.2	98.2	93.3	4.7	2.13	4.6

Table 6. Loss function comparison table.

Mode	Loss Functions	Precision/%	Recall/%	mAP50/%	mAP50-95/%
YOLOv11n	GIOU	93.7	91.7	96.7	88.4
	EIOU	95.4	90.8	95.8	86.7
	DIOU	92.8	92.8	94.6	85.9
	SlideLoss	93.7	92.4	95.9	87.0
	FocalLoss	88.4	85.3	92.0	84.6
	SIOU	92.4	92.5	96.9	89.2

Table 7. Common probe comparison table.

Model	Detect Attention Mechanisms	Precision/%	Recall/%	mAP50/%	mAP50-95/%	Size/MB	Params	GFLOPs
YOLOv11	Detect_DyHead	94.7	92.3	97.2	88.6	6.5	3,101,352	7.5
	Detect_Efficient	93.8	92.4	97.0	86.6	4.9	2,317,100	5.1
	Detect_MultiSEAM	94.3	92.1	96.5	88.5	9.6	4,596,652	6.0
	Detect_CBAM	88.3	89.9	93.4	85.1	5.7	2,671,250	6.4
	Detect_ECA	89.4	86.6	92.8	84.7	5.1	2,407,097	5.8
	Detect_SEAM	94.9	93.5	97.7	91.3	4.6	2,146,604	5.0

Table 8. Existing methods comparison table.

Models	Precision/%	Recall/%	mAP50/%	mAP50-95/%	Size/MB	Params	GFLOPs
YOLOv7	87.0	78.0	88.0	50.0	-	-	-
YOLO-Weed Nano	-	-	93.1	-	2.4	1,090,000	4.7
YOLO-WL	-	-	92.3	-	4.6	7,980,000	-
YOLOv8n-CBAM-Ghost	94.5	94.3	97.6	88.9	-	13,600,000	-
AVGS-YOLO (ours)	95.9	94.2	98.2	93.3	4.7	2,134,076	4.6

Note: A hyphen indicates that the module has not been used and the bold part indicates the optimal value of this column.

Table 9. COCO detection performance comparison.

Models	Precision/%	Recall/%	mAP50/%	mAP50-95/%	Size/MB	Params/M	GFLOPs
YOLOv11n	60.7	44.5	48.9	34.8	5.5	2.58	6.3
AVGS-YOLO	61.3	46.4	50.1	35.4	4.7	2.13	4.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Wei, L. AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments. Agriculture 2026, 16, 828. https://doi.org/10.3390/agriculture16080828

AMA Style

Wang S, Wei L. AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments. Agriculture. 2026; 16(8):828. https://doi.org/10.3390/agriculture16080828

Chicago/Turabian Style

Wang, Suqi, and Linjing Wei. 2026. "AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments" Agriculture 16, no. 8: 828. https://doi.org/10.3390/agriculture16080828

APA Style

Wang, S., & Wei, L. (2026). AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments. Agriculture, 16(8), 828. https://doi.org/10.3390/agriculture16080828

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AVGS-YOLO: A Quad-Synergistic Lightweight Enhanced YOLOv11 Model for Accurate Cotton Weed Detection in Complex Field Environments

Abstract

1. Introduction

2. Methods

2.1. Overview and Algorithm Improvements Based on YOLOv11n

2.2. Downsampling Module

2.3. Slimneck Module with Lightweight Feature Fusion

2.4. CSMM-Based Enhanced SEAM Attention Mechanism

2.4.1. Multi-Branch Feature Integration

2.4.2. Channel Attention Modeling

2.4.3. Attention Re-Weighting

2.5. SIOU Loss Function

2.5.1. Angular Loss in SIoU

2.5.2. Distance Loss in SIoU

2.5.3. Shape Loss in SIoU

2.5.4. SIoU’s IoU Loss

3. Dataset and Configuration

3.1. Experimental Data

3.2. Dataset Construction

3.3. DataSet Annotation

3.4. Experimental Environment

3.5. Evaluation Metrics

4. Results and Discussion

4.1. Comparative Analysis of Improved Module Performance

4.2. Comparative Experiments of Different Classic Models

4.3. Comparative Experiments with the Latest Models

4.4. Performance Analysis of Different Loss Functions

4.5. Performance Analysis of Different Detection Heads

4.6. Comparative Analysis with Existing Methods

4.7. Heatmap Analysis Detection

4.8. Generalization Performance on Standard Benchmarks

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI