SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments

Liu, Meichen; Gao, Jing

doi:10.3390/agronomy16030328

Open AccessEditor’s ChoiceArticle

SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments

by

Meichen Liu

^1,2 and

Jing Gao

^1,2,3,*

¹

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

²

Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Zhaowuda Road No. 306, Hohhot 010018, China

³

Inner Mongolia Autonomous Region Big Data Center, Chilechuan Street No. 1, Hohhot 010091, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(3), 328; https://doi.org/10.3390/agronomy16030328

Submission received: 14 December 2025 / Revised: 19 January 2026 / Accepted: 22 January 2026 / Published: 28 January 2026

(This article belongs to the Section Farming Sustainability)

Download

Browse Figures

Versions Notes

Abstract

Precise identification of weeds at the maize seedling stage is pivotal for implementing Site-Specific Weed Management and minimizing herbicide environmental pollution. However, the performance of existing lightweight detectors is severely bottlenecked by unstructured field environments, characterized by the “green-on-green” spectral similarity between crops and weeds, diminutive seedling targets, and complex mutual occlusion of leaves. To address these challenges, this study proposes SLD-YOLO11, a topology-reconstructed lightweight detection model tailored for complex field environments. First, to mitigate the feature loss of tiny targets, a Lossless Downsampling Topology based on Space-to-Depth Convolution (SPD-Conv) is constructed, transforming spatial information into depth channels to preserve fine-grained features. Second, a Decomposed Large Kernel Attention (D-LKA) mechanism is designed to mimic the wide receptive field of human vision. By modeling long-range spatial dependencies with decomposed large-kernel attention, it enhances discrimination under severe occlusion by leveraging global structural context. Third, the DySample operator is introduced to replace static interpolation, enabling content-aware feature flow reconstruction. Experimental results demonstrate that SLD-YOLO11 achieves an mAP@0.5 of 97.4% on a self-collected maize field dataset, significantly outperforming YOLOv8n, YOLOv10n, YOLOv11n, and mainstream lightweight variants. Notably, the model achieves Zero Inter-class Misclassification between maize and weeds, establishing high safety standards for weeding operations. To further bridge the gap between visual perception and precision operations, a Visual Weed-Crop Competition Index (VWCI) is innovatively proposed. By integrating detection bounding boxes with species-specific morphological correction coefficients, the VWCI quantifies field weed pressure with low cost and high throughput. Regression analysis reveals a high consistency (R² = 0.70) between the automated VWCI and manual ground-truth coverage. This study not only provides a robust detector but also offers a reliable decision-making basis for real-time variable-rate spraying by intelligent weeding robots.

Keywords:

maize weed detection; SPD-Conv; Large Kernel Attention; VWCI; precision agriculture

1. Introduction

As a global strategic crop, maize (Zea mays L.) is essential for meeting the demands for human food and animal feed. However, weed infestation remains a primary biotic factor limiting its productivity [1,2,3]. While traditional views focus on resource competition, many researchers have addressed the broader impact of weeds on crop yield [4,5,6]. Horvath et al. proposed a paradigm shift, suggesting that the perception of weeds by crops can induce physiological changes in development even before water and nutrient resources become limited, thereby leading to significant yield reduction [7]. Idziak et al. reported that weed-induced maize yield losses can range from 30% to 93%, depending on climatic conditions and agronomic practices [8]. To mitigate these losses, traditional weed management primarily relies on the broadcast spraying of herbicides. However, despite its effectiveness, this “blanket spraying” approach has led to excessive chemical residues, soil degradation, and the accelerated evolution of herbicide-resistant weed biotypes. Casimero et al. found that such intensive and continuous reliance on chemicals has resulted in the development of resistance in 17 weed species across 37 unique cases, causing severe environmental pollution. Furthermore, the ecological costs extend far beyond chemical residues [9]. Lu et al. revealed a critical negative feedback loop: emerging weed resistance forces farmers to increase tillage intensity, which in turn offsets the mitigation benefits of greenhouse gas (GHG) and increases emissions [10]. To address these economic and ecological burdens, Site-Specific Weed Management (SSWM) has emerged as a sustainable solution. Gerhards et al. demonstrated that by adapting control measures to the spatial heterogeneity of weed distribution, SSWM can save at least 50% of herbicide usage without compromising control efficacy [11]. Nevertheless, the successful implementation of such intelligent equipment depends on the accuracy of perception. As noted by Qu and Su, whether for precision spraying or laser weeding, the effectiveness of targeted operations fundamentally relies on the robustness of deep learning algorithms to distinguish weeds from crops in complex environments [12].

In recent years, deep learning-based computer vision technology has revolutionized the field of agricultural robotics. Compared to traditional machine learning methods that rely on handcrafted features (such as color histograms and texture descriptors) [5,6,13], neural networks demonstrate superior capabilities in feature extraction [14,15,16,17,18]. A study by Cordova-Cruzatty et al. confirmed that applying specific CNN architectures (such as cNET) can distinguish between maize plants and weeds via real-time segmentation during the early growth stage, outperforming traditional network architectures in both accuracy and processing time [19]. For instance, Peteinatos et al. achieved a weed species identification accuracy of over 90% by training deep models such as VGG16 and ResNet-50 on large-scale datasets, further validating the effectiveness of deep feature extraction in processing complex agricultural images [20].

To meet the stringent requirements for inference speed in field operations, single-stage detectors represented by the YOLO (You Only Look Once) series have become the standard. García-Navarrete et al. successfully applied the YOLOv5 model to distinguish maize from four common weeds under greenhouse conditions, achieving a prediction accuracy of 97% for the maize category [21]. This advantage of balancing speed and accuracy has led to its widespread application in intelligent equipment; for example, Wang et al. developed a precision spraying robot based on an improved YOLOv5s, achieving a weed recognition accuracy of 83% while ensuring real-time detection speed (30 ms/f) [22]. With the continuous evolution of algorithms, recent studies have also improved YOLO series models specifically for weed recognition in complex field environments [23,24,25,26]. For example, Zheng developed the YOLOv8-DMAS model for weed detection in cotton fields. To address challenges such as diverse weed species, dense distribution, small targets, and severe occlusion, they introduced dilated residual modules and multi-scale feature modules into the backbone network, added a detection layer for small targets, and incorporated ASFF feature fusion and Soft-NMS to suppress missed and false detections [27]. Experiments demonstrated that compared to YOLOv8s, this model achieved improvements in both precision and recall and significantly outperformed YOLOv5s, YOLOv7, and SSD. This validates the effectiveness of structural improvements in enhancing detection accuracy and robustness in complex farmland environments.

Addressing even more challenging scenarios, Li et al. developed an improved algorithm based on YOLOv10n specifically to resolve issues related to tiny object detection and occluded weeds within “green-on-green” backgrounds in paddy fields [28]. Faced with numerous model choices, Saltık et al. recently conducted a comprehensive benchmark test of YOLOv9, YOLOv10, and RT-DETR, emphasizing the necessity of balancing inference time and detection accuracy according to hardware computational capabilities in practical intelligent spraying systems [29]. Furthermore, to meet lightweight requirements, numerous studies have attempted to combine lightweight backbone networks such as MobileNet and ShuffleNet with the YOLO series. While these approaches significantly reduce parameter counts and computational loads to achieve rapid weed detection, they generally suffer from insufficient sensitivity to small targets and complex backgrounds. This further highlights the need to balance accuracy and robustness under lightweight conditions [30,31,32].

Although the YOLO series performs excellently in general scenarios, deploying lightweight detectors in unstructured maize fields still faces many challenges [33,34,35]. The first is the “Green-on-Green” problem [36]; Hasan et al. noted in their morphological weed recognition study that due to the highly similar spectral characteristics (color) and growth textures of weeds and crops, relying solely on color information makes the two visually indistinguishable, which constitutes a primary obstacle for automatic detection. The second is the extreme multi-scale variation during the seedling stage [37]; Liu et al. emphasized in a review of small object detection that for tiny objects occupying very few pixels, standard strided convolutions in deep convolutional neural networks inevitably lead to the loss of fine-grained feature information during down-sampling [38]. Yang et al. further confirmed that this information loss causes deep networks to overlook small weed targets, thereby making them “invisible” from UAV or ground perspectives [39]. The third is the severe occlusion problem; with the rapid growth of maize leaves, the morphological integrity of weeds is often compromised. Jia found in a study on occluded object recognition that limited recognition ranges and receptive fields can lead to feature extraction failure when targets are partially occluded, resulting in a significant drop in recognition accuracy [40]. Existing lightweight detectors often lack the ability to capture global context, which is particularly critical when distinguishing between severely occluded crops and randomly distributed weeds [41].

Furthermore, a high-precision connection between visual detection and agricultural decision-making has not yet been fully realized. Most existing studies on lightweight detection focus solely on bounding box regression for target localization [42,43,44]. For instance, Peng et al. developed the AGRI-YOLO model for maize weeds [45]. Although it successfully reduced parameter counts and floating-point operations (FLOPs) by nearly 50% while maintaining high accuracy, its output remained limited to target localization and classification. Similarly, Li et al. optimized a MobileNet-YOLO model for real-time pitaya detection [46]. Despite meeting the speed requirements for harvesting robots, like most pure detection methods, it did not address the quantification of target biomass. However, to implement precise Variable Rate Application (VRA), robotic systems require quantified weed pressure (such as biomass or coverage), rather than mere bounding box coordinates. Although semantic segmentation networks can provide pixel-level coverage information, they often incur high computational costs. Yu et al. proposed a Swin-DeepLab model for soybean fields. Despite achieving a mean Intersection over Union (mIoU) of 91.53%, its reliance on a complex Swin Transformer backbone and attention mechanisms to handle dense distributions implies a substantial computational load [47]. Additionally, Zou et al. used an improved U-Net to evaluate field weed density. While achieving an extremely high correlation with manual observations, this pixel-level segmentation process inherently consumes more computational resources than bounding box detection, making it difficult to deploy on resource-constrained edge devices. Therefore, how to efficiently estimate weed pressure based on lightweight detection results, without introducing the heavy burden of segmentation, remains a critical and unresolved issue in precision agriculture [48].

To address these challenges, we propose SLD-YOLO11, a topology-reconstructed lightweight detector for fine-grained maize–weed discrimination in unstructured “green-on-green” field environments. The main novel contributions are summarized as follows:

(1): Information-preserving lightweight perception for tiny and occluded seedlings. We redesign the downsampling topology using Space-to-Depth Convolution (SPD-Conv) to reduce feature truncation during encoding, improving sensitivity to minute weed seedlings and preserving discriminative cues under severe leaf occlusion.
(2): Long-range contextual modeling. We introduce an efficient Decomposed Large-Kernel Attention (D-LKA) mechanism to enhance global contextual perception, enabling more reliable separation of crops and weeds when local appearance is ambiguous, without imposing heavy computational costs.
(3): Perception-to-decision integration via an interpretable weed-pressure index. Beyond bounding-box detection, we propose the Visual Weed–Crop Competition Index (VWCI) to convert detection outputs into a low-cost, quantitative indicator of weed pressure, supporting variable-rate spraying decisions without relying on computationally expensive pixel-level segmentation.
(4): Comprehensive validation from accuracy to agronomic applicability. We evaluate the proposed framework through extensive comparisons and ablations, demonstrating its practical potential for decision-making.

The remainder of this paper is organized as follows: Section 2 details the materials and methods, including the data acquisition process, dataset construction, and theoretical framework of the proposed SLD-YOLO11 model with its key components (SPD-Conv, D-LKA, DySample). Section 3 presents the experimental results and discussion, covering the training environment, quantitative performance comparisons with state-of-the-art models, ablation studies to verify module effectiveness, and a generalization analysis on a sesame dataset. Finally, Section 4 concludes the study, summarizing the main contributions and outlining directions for future research.

2. Materials and Methods

2.1. Dataset Preparation and Characteristics

To construct a robust object detection model for complex field environments, the maize–weed dataset version used in this study has been curated and archived on Zenodo to ensure reproducibility [49]. According to the original dataset study [50], images were collected under natural field conditions from 5 June to 15 June 2022, during the critical weeding period when maize seedlings were at the 3–5 leaf stage. To reflect scale variation encountered in field operations, the acquisition height was randomly varied within 45–55 cm, providing a near top-down view of the canopy. The dataset covers diverse illumination conditions (e.g., different times of day and cloudy/sunny scenarios) and includes challenging cases with leaf occlusion and “green-on-green” appearance. In addition to dense field scenes, the dataset also contains images of individual maize plants and weeds to enrich morphological diversity. Representative examples are shown in Figure 1.

2.2. Data Collection and Processing

To meet the demand for high-quality training data in deep learning models and ensure the robustness of detection algorithms in complex field environments, this study subjected the raw collected images to rigorous preprocessing, annotation, and data augmentation operations. The specific workflow is as follows:

(1): Ground Truth Annotation: This study employed the open-source image annotation tool LabelImg to manually annotate the collected raw images. Following the Pascal VOC dataset format standard, objects in the images were classified into two categories: “Maize” and “Weed”. During annotation, bounding boxes were tightly fitted around the visible portions of target plants to minimize background noise interference. Upon completion, the system generated corresponding XML files for each image. These files detailed the image filename, dimensional parameters (height, width, channel count), and provided precise ground truth data for subsequent supervised learning by recording the category labels and positional coordinates for each target object.
(2): Dataset Partitioning: To objectively evaluate the model’s generalization performance on unseen data, this study employed stratified random sampling to partition the dataset into training and test sets. Specifically, 70% of the data was used for model training and parameter fine-tuning, while the remaining 30% formed an independent test set solely for final model performance assessment. This partitioning method ensures consistency in the distribution between test and training data.
(3): Data Augmentation Strategy: Due to the heavy reliance of deep learning models on large training datasets, and considering the high variability in morphology and growth stages of field weed samples, relying solely on raw data can easily lead to model overfitting. Therefore, this study employed data augmentation techniques to expand the training set [51]. The dataset was augmented through methods including flipping, adding Gaussian noise, and contrast enhancement, as illustrated in Figure 2.

2.3. Methodology

2.3.1. SLD-YOLO11 Network Architecture Overview

The YOLO11n detector can essentially be viewed as a hierarchical feature mapping function

Φ (I)

that maps input images

I \in ℝ^{H \times W \times 3}

to detection results. To address the issues of small target loss and lack of structural awareness in corn fields, we restructured the original feature extraction pipeline into SLD-YOLO11.

To address the loss of subtle features caused by spectral similarity and mutual occlusion between corn seedlings and weeds in complex field environments, this paper proposes a lightweight detection model named SLD-YOLO11 based on topological reconstruction and large-kernel attention, as shown in Figure 3. Taking YOLO11n as the baseline, we denote the feature map at stage k as

F_{k}

. The original YOLO11’s hierarchical transformation can be expressed in recursive form

F_{k + 1} = H_{C 3 k 2} (D_{s t r i d e} (F_{k}))

, where

D_{s t r i d e}

represents lossy downsampling. The proposed architecture reconstructs this mapping into three coupled transformation processes:

(1): Lossless Spatial Transformation: During the backbone feature extraction stage (k = 1, 2, 3), replace operator $D_{s t r i d e}$ with SPD-Conv operator $Ψ_{S P D}$ following the SPD-Conv strategy [50,51], as shown in Equation (1):

F_{k + 1} = H_{C 3 k 2} (Ψ_{S P D} (F_{k}))

(1)

(2): Global Structural Enhancement: At the deep feature stage, the large kernel attention operator $Ω_{L K A}$ is introduced to weight features, as shown in Equation (2):

{F^{'}}_{k} = F_{k} \otimes Ω_{L K A} (F_{k})

(2)

(3): Content-Aware Reconstruction: During the sampling phase on the neck, the dynamic operator $ϒ_{D y}$ replaces static interpolation, as shown in Equation (3):

F_{u p} = ϒ_{D y} ({F^{'}}_{k}, F_{g u i d e})

(3)

Among these,

F_{g u i d e}

serves as the guiding feature used to calculate the offset of sampling points. By concatenating the aforementioned Equations (1)–(3), SLD-YOLO11 forms an end-to-end, information-conserving closed-loop system.

Figure 3. Schematic diagram of the SLD-YOLO11 architecture. Arrows indicate the feature-flow direction, and different colors denote the backbone, neck, and head components, with the proposed SPD-Conv, D-LKA attention, and DySample modules highlighted.

2.3.2. Lossless Downsampling Topology Reconstruction

To address the issue of fine-grained feature truncation caused by stride convolutions during

F_{2} \to F_{3}

conversion in YOLO11n, this study defines the SPD-Conv transformation [52,53]. For early-stage weed seedlings measuring only a few pixels in size, this sparse sampling and destructive compression causes leaf features to vanish entirely in deep networks. To address this, this paper introduces the SPD-Conv reconstruction subsampling topology, as shown in Equation (4). Let

X \in ℝ^{S \times S \times C_{i n}}

denote the input features of layer k, where S is the spatial resolution and C_in is the number of input channels. The

Ψ_{S P D}

module does not directly discard pixels but instead rearranges spatial-dimensional information into the channel dimension.

{X^{'}}_{i, j} = Concat ([X_{2 i, 2 j}, X_{2 i + 1, 2 j}, X_{2 i, 2 j + 1}, X_{2 i + 1, 2 j + 1}])

(4)

where the dimension transformation from

X^{'}

to

\frac{S}{2} \times \frac{S}{2} \times 4 C_{i n}

is performed, where i and j represent spatial coordinate indices with values ranging from

0 \leq i, j < \frac{S}{2}

.

X_{2 i, 2 j}

denotes the pixel point at the even row and even column of the original feature map; similarly,

X_{2 i + 1, 2 j}

and others denote the remaining three adjacent pixel points.

Concat

represents the concatenation operation along the channel dimension. Equation (5) demonstrates that spatially adjacent pixel blocks are rearranged losslessly along the depth dimension, halving the feature map resolution while preserving the total information content. Subsequently, a non-stride convolution layer

C o n v_{1 \times 1}

is connected for channel fusion:

Y_{S P D} = σ (B (W_{c o n v} \times X^{'}))

(5)

where

Y_{S P D} \in ℝ^{\frac{S}{2} \times \frac{S}{2} \times C_{o u t}}

represents the final output of the module.

W_{c o n v} \in ℝ^{k \times k \times 4 C_{i n} \times C_{o u t}}

denotes the learnable convolutional kernel weight matrix.

B

signifies batch normalization.

σ

indicates the nonlinear activation function (SiLU). Through this approach, SLD-YOLO11 achieves full information flow conservation, ensuring that original discriminative features are preserved while reducing feature map resolution. This significantly enhances the model’s sensitivity to minute weed targets. In maize seedling fields, the cues that separate weeds from young maize are often weak and local: thin leaf edges, small gaps between leaves, and partially visible seedlings under overlap. These cues occupy only a handful of pixels and are exactly the first to disappear when the backbone repeatedly downsamples. SPD-Conv is helpful here not because it creates new information, but because it avoids discarding it: the space-to-depth rearrangement keeps the local neighborhood evidence together in the channel dimension, so the subsequent convolution can still access the original local pattern even after resolution is reduced. In practice, this makes the downstream features less sensitive to whether a seedling happens to fall on a sampled pixel location, which is a common failure mode of strided sampling for tiny targets.

2.3.3. Decomposed Large Kernel Attention, D-LKA

The key to distinguishing field weeds from crops lies not only in local texture but also in contextual structure: in many seedling-stage field scenes, maize plants exhibit more regular spatial layouts than weeds, which are often more irregularly distributed. Due to the limited effective receptive field of traditional 3 × 3 convolutions, capturing such large-scale structural context proves challenging. To enable lightweight networks to model long-range dependencies, this paper introduces a decomposable attention module. For input features

F \in ℝ^{H \times W \times C}

, to avoid the substantial computational overhead of large convolutional kernels, large-kernel convolution is decomposed into the following three logical steps:

Step 1: Local Context Capture, utilizing 5 × 5 depth convolutions to capture leaf-level local texture details, as shown in Equation (6) [54].

F_{l o c a l} = D W C o n v_{5 \times 5} (F)

(6)

where

D W C o n v_{5 \times 5}

denotes a 5 × 5 kernel size for depth convolution.

F_{l o c a l} \in ℝ^{H \times W \times C}

represents intermediate features capturing local details such as leaf textures.

Step 2: Long-range Dependency Modeling. Capture global structures at the field scale using a 7 × 7 deep-width dilated convolution (DW-D-Conv) with a dilation rate of d, as shown in Equation (7).

F_{g l o b a l} = D W - D - C o n v_{7 \times 7, d = 3} (F_{l o c a l})

(7)

where

D W - D - C o n v (\cdot)

denotes deep dilated convolutions. d = 3 represents the dilation rate, meaning the receptive field of the convolution kernel is expanded to cover a spatial range for modeling long-range dependencies.

F_{g l o b a l}

refers to features containing global structural information.

Step 3: Attention generation and feature reweighting. Spatial information is fused using 1 × 1 convolutions. The final output feature

Y_{L K A}

is obtained through the Hadamard product of the attention map and the original features, as shown in Equations (8) and (9).

A = C o n v_{1 \times 1} (F_{g l o b a l})

(8)

Y_{L K A} = F \otimes A + F

(9)

where A represents the generated attention map, where each value indicates the importance of corresponding positions on the feature map. +F denotes the residual connection, which prevents gradient vanishing while preserving original features. This mechanism enables the network to adaptively emphasize informative regions associated with weed–crop discrimination while attenuating less relevant background patterns, achieving an effective balance between computational efficiency and global structural perception. The motivation for D-LKA in maize fields is that local appearance alone is often insufficient: both maize and weeds are green, and occlusion frequently removes the most distinctive parts of weeds. When only a small fragment of a leaf is visible, the model benefits from looking beyond the immediate neighborhood and using broader scene context. The decomposed large-kernel design provides this long-range view without the heavy cost of a full large convolution. Importantly, D-LKA does not rely on a fixed “row template”; rather, it learns to aggregate informative context over a wider spatial extent when local evidence is ambiguous. Together, SPD-Conv preserves fine-grained cues that would otherwise be truncated early. D-LKA then makes these preserved cues usable under occlusion by integrating them with long-range context, which explains why their combination is particularly effective in maize seedling scenes.

2.3.4. Dynamic Flow-Field Upsampling

During the upscaling phase of feature pyramid networks, traditional nearest-neighbor interpolation replicates pixels based solely on geometric position, disregarding semantic feature content. When weeds and corn leaves mutually occlude each other, this static interpolation often leads to jagged edges and positional drift. This paper replaces traditional interpolation with the DySample operator. DySample achieves content-aware upscaling through a point sampling mechanism without incurring high computational costs [55].

Given feature map

{F^{'}}_{k} \in ℝ^{H \times W \times C}

, DySample first predicts the sampling offset

O (O f f s e t)

at each position through a linear layer [56,57]. Subsequently, it resamples the original features based on these offsets using a grid sampling function, as shown in Equation (10) [55]:

F_{u p} = GridSample ({F^{'}}_{k}, G + O)

(10)

where

G

represents the original projection grid.

F_{u p}

denotes the high-resolution feature map after upsampling.

G + O

indicates the sampling positions after offset correction. This design enables sampling points to automatically deform according to leaf morphology, thereby achieving clearer reconstruction of occluded target edges when restoring high-resolution feature maps.

GridSample ()

signifies the differentiable sampling operation based on bilinear interpolation.

2.4. Visual Weed-Crop Competition Index (VWCI) Modeling

Quantifying interspecific competition between crops and weeds is crucial for precision agricultural decision-making. Traditional methods typically rely on semantic segmentation networks to compute pixel-level vegetation coverage [58]. However, the high computational cost of segmentation models limits their deployment on resource-constrained edge devices such as agricultural robots. In contrast, lightweight object detectors like YOLO, while offering real-time inference speeds, inherently overestimate irregular plant shapes through their rectangular bounding box outputs. This leads to inaccurate representations of actual biomass coverage [59,60].

To bridge the technical gap between real-time detection and accurate coverage estimation, this study proposes a morphological correction method integrated into VWCI modeling. We introduce a species-specific morphological fill factor (Fill Factor,

α

) to describe the proportion of solid volume of plants within their bounding boxes. This coefficient serves as a proxy for plant canopy porosity. Effective biological coverage area

S_{e f f}

is defined as Equation (11).

S_{e f f} = α_{s p e c i e s} \cdot (w_{b o x} \times h_{b o x})

(11)

where

α_{s p e c i e s}

is used to compensate for blank areas within the rectangular bounding box. Based on statistical analysis of leaf morphology in this study’s dataset (corn leaves are sparse and distinct, while weeds grow more densely clustered), the morphological filling coefficient was empirically calibrated by analyzing the ratio of pixel-level mask area to bounding box area across 100 randomly sampled training set instances, as shown in Figure 4. To verify the statistical sufficiency of this sample size, we performed a cumulative stability analysis with bootstrap resampling. The results indicated that the estimated mean

α

converged and the 95% confidence interval stabilized significantly when the sample size exceeded 80, confirming that N = 100 provides a robust calibration set with small estimation uncertainty. Statistical results indicate

α_{m a i z e} = 0.70

and

α_{w e e d} = 0.66

.

Accordingly, we define VWCI as a normalized ratio of biologically effective area rather than raw pixel counts, as shown in Equation (12):

V W C I = \frac{\sum S_{e f f}^{w e e d}}{\sum S_{e f f}^{c r o p} + \sum S_{e f f}^{w e e d}}

(12)

Because VWCI is defined as a normalized ratio of effective canopy areas, it is less sensitive to the absolute scaling of the fill factor α than an unnormalized area estimate. Moderate inaccuracies in α have a bounded influence on VWCI, and part of the bias can cancel when the relative error is similar across species. We note that α is a morphology-dependent correction and may vary with crop type, growth stage, and imaging configuration; when transferring to a new scenario, α can be efficiently re-estimated using a small sample set following Figure 4, without pixel-level annotations for the full dataset. With VWCI ranging from 0 to 1, we further translate it into actionable agronomic decisions by establishing a three-tier warning mechanism: Safe period (VWCI < 0.02): No intervention required; Warning period (0.02 ≤ VWCI < 0.15): Localized spraying recommended; High-risk period (VWCI ≥ 0.15): Full-coverage spraying recommended. This mechanism enables SLD-YOLO11 to function not only as a detector but also as a real-time biological stress sensor.

2.5. Training Model

All model training and inference experiments in this study were conducted on a high-performance computing platform. To ensure experimental reproducibility, the primary hardware and software configuration parameters are as follows: For hardware, a workstation equipped with an Intel Core i9-12900K processor and an NVIDIA GeForce RTX 3090 graphics processing unit (GPU) was used, featuring 24 GB of graphics memory. The software environment was based on the Windows 10 operating system, utilizing the PyTorch 1.12.0 deep learning framework alongside CUDA 11.3 and cuDNN 8.2 acceleration libraries to enhance computational efficiency.

In specific agricultural application scenarios, building and training a large-scale deep learning model from scratch is typically extremely time-consuming and prone to getting stuck in local optima due to relatively insufficient data. Therefore, this study employs a transfer learning strategy, loading pre-trained weights from the large-scale general-purpose COCO dataset as initial parameters for the YOLOv11 model. Through transfer learning, the model leverages foundational features learned from the general-purpose dataset, accelerating convergence and enhancing feature extraction capabilities on the cornfield weed dataset. During training, input image dimensions were adjusted to 640 × 640 pixels. The Stochastic Gradient Descent (SGD) optimizer was employed with an initial learning rate of 0.01, momentum factor of 0.9, and weight decay coefficient of 0.0005. The total number of training epochs was set to 100, with a batch size of 16. To prevent overfitting, the Mosaic data augmentation strategy was employed during training. An early stopping mechanism was activated when the training loss failed to decrease for 10 consecutive epochs.

2.6. Evaluation Indicators and Statistical Analysis

To comprehensively evaluate the performance of the improved YOLOv11 model in cornfield weed detection tasks, this study adopts standard metrics from the object detection field: Intersection over Union (IoU), Precision (P), Recall (R), F1-Score, and Mean Average Precision (mAP) [61]. Before calculating these metrics, Intersection over Union (IoU) is introduced to measure the overlap between the predicted bounding box

B_{p}

and the ground truth bounding box

B_{g t}

. Its mathematical definition is shown in Equation (13). With an IoU threshold set at 0.5, detection results are classified as: True Positives (TP), i.e., correctly detected objects

I o U \geq 0.5

; False Positives (FP), i.e., erroneously detected objects (false alarms or

I o U < 0.5

); False Negatives (FN), i.e., undetected objects (misses).

I o U = \frac{a r e a (B_{p} \cap B_{g t})}{a r e a (B_{p} \cup B_{g t})}

(13)

Based on the above definitions, precision reflects the proportion of actual positive samples among the model’s predictions of positive samples, while recall reflects the proportion of all positive samples that are correctly predicted. The F1 score is the harmonic mean of precision and recall, serving as a balanced measure of a model’s overall performance. Its value ranges from [0, 1], with higher scores indicating superior balance in minimizing both missed detections and false positives. The calculation formulas are shown in Equations (14)–(16):

P r e c i s i o n = \frac{T P}{T P + F P}

(14)

R e c a l l = \frac{T P}{T P + F N}

(15)

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

To further evaluate the robustness of the model across different confidence levels, this study introduces Average Precision (AP). AP is defined as the area under the Precision–Recall (P-R) curve. Mean Average Precision (mAP) represents the arithmetic mean of AP values across all categories (corn and various weeds), serving as the most fundamental metric for assessing the performance of multi-class object detection models. The calculation formulas are shown in Equations (17) and (18).

A P = \int_{0}^{1} P (R) d R

(17)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(18)

where N = 2 denotes the sum of weed and maize, and AP_i represents the average precision of the i-th target.

Additionally, to evaluate the consistency between the model-predicted VWCI and the VWCI calculated based on manually annotated ground truth, this study employs the Coefficient of Determination (

R^{2}

) and Root Mean Square Error (RMSE) as evaluation metrics [61], as shown in Equations (19) and (20).

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{{\sum_{i = 1}^{n} (y_{i} - \bar{y})}^{2}}

(19)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{n}}

(20)

where n represents the number of samples,

y_{i}

represents the i-th measurement value,

{\hat{y}}_{i}

represents the i-th predicted value, and

{\bar{y}}_{i}

represents the average of the measurement values. Generally speaking, the higher the

R^{2}

value, the lower the RMSE values, the stronger the model’s ability to quantify biological stress.

Beyond the standard metric definitions, we employed specific statistical procedures to assess the reliability of the results. To verify the statistical reliability of the morphological fill factor derived in Section 2.4, a Bootstrap resampling method was employed. As detailed in Section 2.4, we performed cumulative stability analysis to confirm that the estimated parameters converged to a stable mean with a narrow 95% confidence interval. To validate the agronomic reliability of the VWCI, we quantified the agreement between the model predictions and manual ground-truth coverage using Regression Analysis, and reported the

R^{2}

and RMSE (Equations (19) and (20)) to evaluate the relationship strength, verifying the consistency of the proposed index.

3. Results and Discussion

3.1. Model Training

The training process of the SLD-YOLO11 model demonstrates stable convergence and efficient feature learning capabilities, as shown in Figure 5. The bounding box regression loss (box_loss) and classification loss (cls_loss) on the loss curve rapidly decrease within the first 50 epochs, indicating that the model quickly learned the fundamental morphological features of corn and weeds. Subsequently, the curves stabilize without significant fluctuations, indicating the model avoids overfitting. The mAP@0.5 rises sharply to a peak of 97.4%, demonstrating the model’s strong ability to distinguish between the two sample categories. The consistency of metrics across training and validation sets validates the effectiveness of the proposed data augmentation strategy.

3.2. Comparative Performance Evaluation

To systematically evaluate the performance of SLD-YOLO11, this study conducted comparative experiments against two categories of benchmark models. These include general SOTA models—YOLOv8n, YOLOv10n, and standard YOLOv11n representing current mainstream detection capabilities [26]. Additionally, lightweight backbone variants YOLO11-MobileNetV3 and YOLO11-ShuffleNetV2 were included. These models replace the backbone network with lightweight alternatives to test the extreme trade-off between computational cost and accuracy. All models were trained and tested under identical hyperparameters. Quantitative results are summarized in Table 1. As shown in Table 1, detection accuracy of the lightweight variants (ShuffleNetV2 and MobileNetV3) exhibits a noticeable decline, with mAP@0.5:0.95 dropping to 47.9% and 48.5%, respectively. This indicates that overly simplified backbone networks struggle to extract sufficient feature details for minute weed targets in complex cornfield environments. The baseline model YOLOv11n achieved respectable performance, with mAP@0.5:0.95 reaching 57.5%. However, the proposed SLD-YOLO11 significantly outperformed all baseline models, achieving the highest precision (93.5%) and recall (94.8%). On the mAP@0.5:0.95 metric, it improved to 65.5% compared to YOLOv11n’s 57.5%. This demonstrates that SLD-YOLO11 exhibits superior capabilities in handling multi-scale variations and occlusion issues within the dataset.

Efficiency–Accuracy Trade-off Comparison. As shown in Figure 6, although lightweight variants like YOLO11-MobileNetV3 and YOLO11-ShuffleNetV2 significantly reduce computational burden, they compromise detection accuracy, increasing the risk of false crop detections. In contrast, the proposed SLD-YOLO11 occupies the upper-center region. While its enhancements slightly increase FLOPs from 6.1 G (YOLOv11n) to 6.3 G, this minimal computational cost yields substantial detection performance gains, achieving a mAP@0.5 of 97.4%. This demonstrates that SLD-YOLO11 strikes a superior balance between efficiency and accuracy, making it highly suitable for deployment on edge devices requiring both real-time performance and high precision.

Furthermore, to validate that the performance improvement of SLD-YOLO11 is statistically significant rather than a result of random variation, we conducted a non-parametric Wilcoxon signed-rank test on the image-level F1-scores between the proposed model and the baseline models (YOLOv8n, YOLOv10n, YOLOv11n) [61]. The analysis focused on the consistency of detection performance across individual test samples. Results indicate that the improvements achieved by SLD-YOLO11 are statistically significant at the 95% confidence level (p < 0.05).

3.3. Ablation Study

To validate the independent contribution of each improved component, this study employed an ablation experiment using the “control variable method.” The baseline model was the original YOLOv11n. Detailed results are presented in Table 2. Model S-YOLO11n demonstrates the impact of SPD-Conv: replacing stride convolutions with SPD-Conv increased the mAP@0.5:0.95 by 2.3% without increasing the number of parameters. This indicates that SPD-Conv effectively prevents the loss of fine-grained feature information during subsampling, which is crucial for identifying minute weed targets. Model L-YOLO11n demonstrates the impact of D-LKA: the integration of the D-LKA module delivers significant performance gains. Although it introduces marginal parameter increases, the large kernel design successfully captures long-range contextual information, enabling the model to better distinguish weeds from complex backgrounds. The impact of DySample is demonstrated by model D-YOLO11n: Compared to the baseline, D-YOLO11n achieves a 1.4% improvement in mAP@0.5:0.95 with nearly zero additional computational cost, proving its efficiency in restoring target details. By integrating all three modules, the proposed SLD-YOLO11 achieves optimal performance (mAP@0.5:0.95: 65.5%). The 8.0% overall improvement over the baseline model indicates synergistic effects among these modules, collectively enhancing localization accuracy and classification robustness.

3.4. Robustness and Generalization Analysis

3.4.1. Qualitative Analysis in Complex Field Scenarios

To visually assess the robustness of the proposed method in complex field environments, we visually compared the detection results of SLD-YOLO11 with five state-of-the-art baseline models (YOLOv8n, YOLOv10n, YOLO11n) and lightweight variants (YOLO11-MobileNetV3, YOLO11-ShuffleNetV2), as shown in Figure 7. When weed seedlings were partially obscured by corn leaves or extremely small, the baseline models (YOLOv8n and YOLOv10n) exhibited varying degrees of missed detections (false negatives). The lightweight variants YOLO11-MobileNetV3 and YOLO11-ShuffleNetV2 struggled with feature retention, failing to detect most small-scale weed targets in densely planted rows. This confirms that simply replacing the backbone with lightweight networks leads to severe loss of fine-grained spectral and spatial information. In contrast, the proposed SLD-YOLO11 demonstrates superior detection density and accuracy. Benefiting from the lossless downsampling capability of the SPD-Conv module, our model successfully recovers minute weed targets that disappear in other models’ feature maps. Furthermore, the D-LKA attention mechanism enables the model to effectively distinguish overlapping leaves. As demonstrated by visualization results, SLD-YOLO11 captures the highest number of valid targets with precise bounding boxes, validating its effectiveness in addressing corn weed detection challenges in unstructured environments.

To further investigate the classification reliability and error distribution of the SLD-YOLO11 model, we generated a confusion matrix on the test set, as shown in Figure 8a. This matrix details the prediction breakdown for the two target classes—corn and weeds—and background refers to non-target areas. The most notable observation is the model’s excellent discrimination capability in distinguishing crops from weeds. As indicated in the central region of the matrix, the inter-class misclassification rate between maize and weeds is zero. This result is crucial for automated weeding operations, as it fundamentally safeguards crop safety by ensuring that neither young maize seedlings are mistakenly uprooted nor weeds are overlooked during the process.

The primary source of error lies in misclassifying “background” as targets (resulting in 95 false positives for corn and 124 false positives for weeds). This phenomenon is mainly attributed to the conservative confidence threshold strategy (Conf = 0.15) adopted in this study. In complex field environments, to ensure no weed infestations are missed, we prioritized minimizing false negatives over precision. Consequently, some backgrounds with textures similar to plants (e.g., crop residues or soil clods, see Figure 8b) were captured by the model. However, in practical deployment, such low-confidence false positives can be effectively filtered out through temporal tracking or simple logical post-processing. Furthermore, the model exhibits an extremely low false negative rate. Among nearly 1000 test targets, only 33 corn plants and 21 weeds remained undetected (false negatives, see Figure 8c). This high recall demonstrates that SLD-YOLO11 robustly captures targets even under challenging conditions such as occlusion.

3.4.2. Generalization Capability Analysis

Furthermore, to evaluate the scalability of the proposed SLD-YOLO11 architecture beyond maize fields, we conducted a cross-crop generalization test on a sesame dataset. Unlike maize (a narrow-leaved monocot), sesame is a broad-leaved dicot with distinct branching structures, which introduces markedly different plant morphology and field appearance for object detection. The sesame dataset contains 1300 images. We directly applied the model trained on the maize dataset to the sesame dataset using the same inference configuration, without introducing any handcrafted geometric assumptions or task-specific post-processing. Notably, row-aligned cues can be weak or visually ambiguous in many sesame scenes. As shown in Figure 9, SLD-YOLO11 still provides stable qualitative detection results under occlusion and complex backgrounds, suggesting that the proposed design can leverage generic long-range contextual information and retain effectiveness when field layouts differ from maize. This demonstrates the potential of SLD-YOLO11 as a versatile solution for multi-crop weed management.

3.5. Agronomic Application Evaluation

3.5.1. Reliability Verification of VWCI

To validate the practical applicability of the proposed VWCI as a reliable indicator for precision agriculture, we conducted quantitative regression analysis. The objective was to determine whether an automated VWCI derived solely from model detection results could accurately reflect actual weed pressure in the field without human intervention. We compared the predicted VWCI values against the true weed coverage calculated from manually annotated test sets. Here, the ground-truth weed coverage is computed from manual pixel-level masks, and it is used to evaluate whether VWCI can approximate coverage trends without running a segmentation model. As shown in Figure 10, the regression analysis revealed a strong positive correlation between the automated VWCI and the true values. The model achieved R² of 0.7039, indicating that the index explains over 70% of the variation in actual weed distribution. Furthermore, the RMSE was as low as 0.0296, demonstrating the algorithm’s stability and low bias. The fitted regression line (y = 0.80x + 0.04) closely follows the ideal diagonal (y = x), though the slope of 0.80 indicates a slight yet consistent underestimation in high-density areas—likely due to occlusion by overlapping weeds. Unlike traditional binary classification, VWCI provides a continuous quantitative measure of weed severity. This strong correlation demonstrates that our computer vision-based approach can effectively replace labor-intensive manual sampling. Consequently, VWCI can serve as a critical decision variable for variable-rate application systems, enabling farmers to adjust herbicide doses based on precise weed pressure zones.

3.5.2. Variable Rate Spraying Decision Framework

Based on the spatial distribution results of the VWCI, this study further developed a Variable-Rate Spraying (VRS) strategy to validate the agronomic applicability and decision-making potential of the proposed method. First, three application rates were defined according to VWCI values, corresponding to different weed stress levels: (1) No-Spray Zone (VWCI < 0.02), indicating minimal weed pressure in the field where chemical control measures are unnecessary; (2) Low-rate application zones (0.02 ≤ VWCI < 0.15), corresponding to scattered weed distribution or localized mild infestation, where spot or small-area precision spraying can be employed; (3) High-rate application zones (VWCI ≥ 0.15), designated for areas with significant weed density, requiring conventional-rate or full-coverage spraying to ensure effective weed control. The α values and VWCI thresholds reported here are calibrated for the maize seedling scenario in this study; when transferring to different crops, growth stages, or imaging settings, α can be efficiently re-estimated with a small sample set following Figure 4, without full pixel-level annotation.

The generated VRS prescription map (Figure 11) exhibits high consistency with the spatial distribution of VWCI. High-dose application zones typically exhibit block-like distributions, precisely corresponding to high-value patches in the VWCI heatmap. Low-dose areas predominantly occur at the edges or transition zones of high-density weed patches, while regions without detected weeds remain untreated. This hierarchical application pattern aligns with precision agriculture’s “apply only as needed” principle, effectively reducing unnecessary chemical inputs, preventing pesticide waste, and minimizing potential environmental impacts.

3.6. Discussion

3.6.1. Mechanism of Performance Improvement

SLD-YOLO11 can achieve robust maize–weed discrimination in unstructured “green-on-green” field scenarios, provided that three key conditions are simultaneously satisfied: (i) preserving fine-grained cues of tiny seedlings as much as possible in the early encoding stages; (ii) enhancing long-range contextual reasoning under severe occlusion; and (iii) reconstructing high-resolution features in a content-adaptive manner during neck feature fusion to reduce localization drift. The ablation results validate this approach: each module yields a measurable gain in mAP@0.5:0.95, and their combination exhibits a clear synergistic effect.

(1): SPD-Conv: It mitigates the disappearance of features for tiny seedlings via “information-preserving downsampling”. One of the longstanding core bottlenecks in small-object detection is that repeated stride-based downsampling causes instances with an extremely low target pixel proportion to be irreversibly weakened or even lost in deep features. This issue has been systematically discussed in reviews on small-object detection, and it also constitutes a major motivation for multi-scale architectures, i.e., alleviating the loss of small-object information through semantic representations at different resolutions [62]. In our scenario, early-stage weed seedlings often occupy only a few pixels, and the effect of information truncation introduced by sampling becomes more pronounced due to leaf occlusion and background texture interference. To address this problem, SPD-Conv rearranges local spatial neighborhoods into the channel dimension, thereby reducing spatial resolution while avoiding, as much as possible, the loss induced by stride sampling. This is consistent with the commonly used Focus strategy; for example, YOLOv5 performs downsampling via slice-and-concatenate rather than directly dropping samples, and it is also consistent with the idea that pixel rearrangement enables lossless information transfer between spatial and channel domains [63,64]. Therefore, the improvement brought by SPD-Conv in Table 2 stems from its ability to preserve and strengthen microscopic cues that distinguish weeds—such as thin-leaf edges, local gaps, and the shapes of partially missing leaves after occlusion—thereby improving the detectability of tiny weeds without significantly increasing the backbone complexity.
(2): D-LKA: It enhances long-range contextual reasoning to handle “green-on-green” similarity and occlusion ambiguity. When crops and weeds are highly similar in local appearance or only partially visible, a local receptive field is often insufficient to support stable discrimination. The model then needs to leverage broader contextual information to determine whether a fragment belongs to a maize plant with row–column structure or to a weed cluster with a more irregular spatial distribution. This demand is consistent with conclusions in related literature that modeling long-range dependencies can improve the robustness of visual recognition under complex visibility conditions [62]. Large-kernel attention has been proposed to capture long-range correlations with controllable computational overhead and has been used to build effective methods for detection tasks [54]. Meanwhile, studies on large-kernel convolutional network design have pointed out that enlarging convolution kernels can substantially expand the effective receptive field, making representations more biased toward shape and structural cues, which is particularly beneficial when texture cues are unstable [65]. In this work, D-LKA achieves a “large-kernel” effect while keeping the cost under control, more effectively resolving ambiguities caused by insufficient local evidence, thereby improving recall under occlusion and boosting the overall mAP.
(3): DySample: It reduces boundary jaggedness and spatial drift during feature fusion via content-adaptive upsampling. Beyond the encoding stage, accurate detection under occlusion and for small-scale instances also depends on whether fine structures can be reliably reconstructed during multi-scale fusion in the neck. Conventional nearest-neighbor or bilinear interpolation is a geometrically fixed upsampling scheme; when cases such as leaf overlap or highly fragmented boundaries occur, it can easily introduce jagged edges, blurred details, or feature misalignment, thereby undermining localization stability. Recent studies on content-aware upsampling have shown that adaptively modeling sampling locations or reassembly kernels conditioned on content can improve dense prediction and structural restoration quality at relatively low cost [66]. DySample formulates upsampling as efficient point sampling and learns offsets, avoiding the heavier computation of dynamic convolution while still enabling content-guided reconstruction. In this work, the consistent gains brought by DySample indicate that it achieves better feature alignment and contour recovery during pyramid fusion, which is particularly critical for slender, fragmented, or partially occluded weed instances, thereby improving the localization accuracy and robustness of the detection head.

Overall, the complete SLD-YOLO11 model achieves a clear improvement over YOLO11 in mAP@0.5:0.95 (Table 2). The magnitude of this gain typically exceeds the marginal benefit of a single lightweight modification, indicating functional coupling among the modules: SPD-Conv preserves microscopic cues at the front end that would otherwise be lost during downsampling; D-LKA introduces long-range context so that these cues can be interpreted and exploited more reliably; and DySample reconstructs high-resolution representations in a content-adaptive manner in the neck, reducing fusion misalignment and enabling the detection head to localize tiny and occluded seedlings more stably. In field images, small scale, occlusion, and green-on-green similarity often co-occur. This coupled framework provides a more coherent and testable mechanistic explanation for why SLD-YOLO11 can significantly improve accuracy while keeping computational overhead close to the baseline.

3.6.2. Limitations and Error Analysis

Although VWCI shows a strong correlation with field-observed weed coverage, it tends to underestimate weed pressure in high-density patches. This bias is mainly due to the limited observability of RGB imagery. As weed density increases, leaf overlap intensifies, causing the visible projected area captured by the camera to grow sublinearly with the true weed burden. This phenomenon is analogous to the reduced responsiveness of indices based on 2D reflectance or visible cover when canopy structure becomes closed and shadowed [67,68]. Some RGB-based field phenotyping studies also point out that canopy cover derived from top-down imagery is inherently constrained by occlusion and viewing geometry, especially in dense crop scenes [69]. Practically, VWCI is highly reliable for delineating spatial distribution patterns; however, if higher fidelity is required in high-density areas, incorporating 3D LiDAR or multi-view reconstruction as inputs is an effective way to better represent canopy structure and support precision weed management [70]. Regarding classification errors, the confusion matrix shows a limited number of background false positives, mainly associated with illumination variations. Such sensitivity to field conditions has also been repeatedly identified as a key challenge in weed detection systems, motivating robustness-oriented data design and cross-domain adaptation strategies in agricultural image analysis [71].

3.6.3. Future Work

To address the aforementioned limitations, future work will focus on two directions: (1) Incorporating depth information: Integrating RGB-D depth camera data to leverage spatial volume information for correcting biomass underestimation caused by leaf overlap, thereby further enhancing VWCI accuracy; (2) Edge deployment and closed-loop control: Deploy the quantized SLD-YOLO11 model on embedded computing platforms like Jetson Orin, integrate it with the actuators of weeding robots, and conduct real-time “variable-rate weeding” field trials to validate the system’s real-time response speed and operational stability.

4. Conclusions

This study addressed the critical challenges of “green-on-green” recognition and severe occlusion in unstructured maize fields by proposing the SLD-YOLO11 detection model. By reconstructing the topological structure of the lightweight detector, we successfully integrated SPD-Conv for lossless downsampling and a Decomposed Large Kernel Attention (D-LKA) mechanism for global structural perception. Experimental results demonstrate that SLD-YOLO11 achieves an mAP@0.5 of 97.4%, significantly outperforming state-of-the-art baselines (YOLOv8n, YOLOv10n) and lightweight variants in both accuracy and robustness. Crucially, the model achieved zero inter-class misclassification between maize and weeds, establishing a high safety standard for autonomous weeding operations to prevent accidental crop damage. Beyond improvements in detection algorithms, this work’s proposed Visual Weed-Crop Competition Index (VWCI) provides a novel, low-cost method to quantify weed pressure. The high correlation (R² = 0.70) between the automated VWCI and ground-truth coverage confirms its reliability as a surrogate for manual biological surveys. This enables the transition from binary “weed detection” to continuous “weed pressure monitoring,” providing precise data support for variable-rate spraying prescriptions. Future work will focus on overcoming the limitations of 2D vision in biomass estimation. We plan to:(1) integrate RGB-D depth sensing to resolve the underestimation issue caused by leaf overlapping in high-density areas, and (2) implement model quantization to deploy the algorithm on edge computing devices (such as NVIDIA Jetson Orin) mounted on field robots, thereby achieving a closed-loop system for real-time, site-specific precision weeding.

Author Contributions

Conceptualization, M.L.; methodology, M.L.; formal analysis, M.L.; writing—original draft preparation, M.L.; writing—review and editing, J.G.; project administration, J.G.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Inner Mongolia Science and Technology Major Special Projects [2019ZD016,2021ZD0005].

Data Availability Statement

The data presented in this study are available on reasonable from the corresponding author.

Acknowledgments

We thank the anonymous reviewers for their constructive feedback.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Symbol/Abbreviation	Definition
SPD-Conv	Space-to-Depth Convolution
D-LKA	Decomposed Large Kernel Attention
DySample	Dynamic sampling-based upsampling operator
VWCI	Visual Weed-Crop Competition Index
SSWM	Site-Specific Weed Management
mAP@0.5	Mean Average Precision at IoU threshold 0.5
mAP@0.5:0.95	Mean Average Precision at IoU thresholds from 0.5 to 0.95
P	Precision
R	Recall
TP	True Positives (Correctly detected targets)
FP	False Positives (Predicted positive but actually negative)
FN	False Negatives (Missed positive)
FLOPs	Floating Point Operations (Measure of computational cost)
FPS	Frames Per Second (Measure of inference speed)
α	morphological fill factor used in VWCI
S_eff	effective biological coverage area after morphological correction

References

Fathi, A.; Zeidali, E. Conservation Tillage and Nitrogen Fertilizer: A Review of Corn Growth and Yield and Weed Management. Cent. Asian J. Plant Sci. Innov. 2021, 1, 121–142. [Google Scholar] [CrossRef]
Gao, W.-T.; Su, W.-H. Weed Management Methods for Herbaceous Field Crops: A Review. Agronomy 2024, 14, 486. [Google Scholar] [CrossRef]
Cui, J.; Tan, F.; Bai, N.; Fu, Y. Improving U-Net Network for Semantic Segmentation of Corns and Weeds during Corn Seedling Stage in Field. Front. Plant Sci. 2024, 15, 1344958. [Google Scholar] [CrossRef] [PubMed]
Venkataraju, A.; Arumugam, D.; Stepan, C.; Kiran, R.; Peters, T. A Review of Machine Learning Techniques for Identifying Weeds in Corn. Smart Agric. Technol. 2023, 3, 100102. [Google Scholar] [CrossRef]
Tao, T.; Wei, X. A Hybrid CNN–SVM Classifier for Weed Recognition in Winter Rape Field. Plant Methods 2022, 18, 29. [Google Scholar] [CrossRef]
De Villiers, C.; Munghemezulu, C.; Chirima, G.J.; Tesfamichael, S.G.; Mashaba-Munghemezulu, Z. Crop Classification and Weed Detection in Rainfed Maize Crops Using UAV and PlanetScope Imagery. Abstr. ICA 2023, 6, 50. [Google Scholar] [CrossRef]
Horvath, D.P.; Clay, S.A.; Swanton, C.J.; Anderson, J.V.; Chao, W.S. Weed-Induced Crop Yield Loss: A New Paradigm and New Challenges. Trends Plant Sci. 2023, 28, 567–582. [Google Scholar] [CrossRef]
Idziak, R.; Waligóra, H.; Szuba, V. The Influence of Agronomical and Chemical Weed Control on Weeds of Corn. J. Plant Prot. Res. 2023, 62, 215–222. [Google Scholar] [CrossRef]
Casimero, M.; Abit, M.J.; Ramirez, A.H.; Dimaano, N.G.; Mendoza, J. Herbicide Use History and Weed Management in Southeast Asia. Adv. Weed Sci. 2022, 40, e020220054. [Google Scholar] [CrossRef] [PubMed]
Lu, C.; Yu, Z.; Hennessy, D.A.; Feng, H.; Tian, H.; Hui, D. Emerging Weed Resistance Increases Tillage Intensity and Greenhouse Gas Emissions in the US Corn–Soybean Cropping System. Nat. Food 2022, 3, 266–274. [Google Scholar] [CrossRef]
Gerhards, R.; Andújar Sanchez, D.; Hamouz, P.; Peteinatos, G.G.; Christensen, S.; Fernandez-Quintanilla, C. Advances in Site-specific Weed Management in Agriculture—A Review. Weed Res. 2022, 62, 123–133. [Google Scholar] [CrossRef]
Qu, H.-R.; Su, W.-H. Deep Learning-Based Weed–Crop Recognition for Smart Agricultural Equipment: A Review. Agronomy 2024, 14, 363. [Google Scholar] [CrossRef]
Mkhize, Y.; Madonsela, S.; Cho, M.; Nondlazi, B.; Main, R.; Ramoelo, A. Mapping Weed Infestation in Maize Fields Using Sentinel-2 Data. Phys. Chem. Earth Parts A/B/C 2024, 134, 103571. [Google Scholar] [CrossRef]
Ahmad, A.; Saraswat, D.; Aggarwal, V.; Etienne, A.; Hancock, B. Performance of Deep Learning Models for Classifying and Detecting Common Weeds in Corn and Soybean Production Systems. Comput. Electron. Agric. 2021, 184, 106081. [Google Scholar] [CrossRef]
Dastres, E.; Esmaeili, H.; Edalat, M. Species Distribution Modeling of Malva Neglecta Wallr. Weed Using Ten Different Machine Learning Algorithms: An Approach to Site-Specific Weed Management (SSWM). Eur. J. Agron. 2025, 167, 127579. [Google Scholar] [CrossRef]
Hasan, A.S.M.M.; Diepeveen, D.; Laga, H.; Jones, M.G.K.; Sohel, F. Object-Level Benchmark for Deep Learning-Based Detection and Classification of Weed Species. Crop Prot. 2024, 177, 106561. [Google Scholar] [CrossRef]
Wessner, R.N.; Frozza, R.; Duarte Da Silva Bagatini, D.; Molz, R.F. Recognition of Weeds in Corn Crops: System with Convolutional Neural Networks. J. Agric. Food Res. 2023, 14, 100669. [Google Scholar] [CrossRef]
Picon, A.; San-Emeterio, M.G.; Bereciartua-Perez, A.; Klukas, C.; Eggers, T.; Navarra-Mestre, R. Deep Learning-Based Segmentation of Multiple Species of Weeds and Corn Crop Using Synthetic and Real Image Datasets. Comput. Electron. Agric. 2022, 194, 106719. [Google Scholar] [CrossRef]
Andrea, C.-C.; Mauricio Daniel, B.B.; Jose Misael, J.B. Precise Weed and Maize Classification through Convolutional Neuronal Networks. In Proceedings of the 2017 IEEE Second Ecuador Technical Chapters Meeting (ETCM), Salinas, Ecuador, 16–20 October 2017; pp. 1–6. [Google Scholar]
Peteinatos, G.G.; Reichel, P.; Karouta, J.; Andújar, D.; Gerhards, R. Weed Identification in Maize, Sunflower, and Potatoes with the Aid of Convolutional Neural Networks. Remote Sens. 2020, 12, 4185. [Google Scholar] [CrossRef]
García-Navarrete, O.L.; Santamaria, O.; Martín-Ramos, P.; Valenzuela-Mahecha, M.Á.; Navas-Gracia, L.M. Development of a Detection System for Types of Weeds in Maize (Zea mays L.) under Greenhouse Conditions Using the YOLOv5 v7.0 Model. Agriculture 2024, 14, 286. [Google Scholar] [CrossRef]
Wang, B.; Yan, Y.; Lan, Y.; Wang, M.; Bian, Z. Accurate Detection and Precision Spraying of Corn and Weeds Using the Improved YOLOv5 Model. IEEE Access 2023, 11, 29868–29882. [Google Scholar] [CrossRef]
Jia, Z.; Zhang, M.; Yuan, C.; Liu, Q.; Liu, H.; Qiu, X.; Zhao, W.; Shi, J. ADL-YOLOv8: A Field Crop Weed Detection Model Based on Improved YOLOv8. Agronomy 2024, 14, 2355. [Google Scholar] [CrossRef]
Liu, H.; Hou, Y.; Zhang, J.; Zheng, P.; Hou, S. Research on Weed Reverse Detection Methods Based on Improved You Only Look Once (YOLO) v8: Preliminary Results. Agronomy 2024, 14, 1667. [Google Scholar] [CrossRef]
Kharismawati, D.E.; Kazic, T. Maize Seedling Detection Dataset (MSDD): A Curated High-Resolution RGB Dataset for Seedling Maize Detection and Benchmarking with YOLOv9, YOLO11, YOLOv12 and Faster-RCNN 2025. arXiv 2025, arXiv:2509.15181. [Google Scholar]
Sharma, A.; Kumar, V.; Longchamps, L. Comparative Performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and Faster R-CNN Models for Detection of Multiple Weed Species. Smart Agric. Technol. 2024, 9, 100648. [Google Scholar] [CrossRef]
Zheng, L.; Yi, J.; He, P.; Tie, J.; Zhang, Y.; Wu, W.; Long, L. Improvement of the YOLOv8 Model in the Optimization of the Weed Recognition Algorithm in Cotton Field. Plants 2024, 13, 1843. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Sun, R.; Xu, Y. Design and Optimization of a New Corn–Weed Detection Model with YOLOv8–GAS Based on Artificial Intelligence. J. Real Time Image Proc. 2025, 22, 167. [Google Scholar] [CrossRef]
Saltık, A.O.; Allmendinger, A.; Stein, A. Comparative Analysis of YOLOv9, YOLOv10 and RT-DETR for Real-Time Weed Detection. In Computer Vision—ECCV 2024 Workshops; Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2025; Volume 15625, pp. 177–193. ISBN 978-3-031-91834-6. [Google Scholar]
Alkhammash, E.H. Multi-Classification Using YOLOv11 and Hybrid YOLO11n-MobileNet Models: A Fire Classes Case Study. Fire 2025, 8, 17. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Ma, X.; Hao, Z.; Liu, S.; Li, J. Walnut Surface Defect Classification and Detection Model Based on Enhanced YOLO11n. Agriculture 2025, 15, 1707. [Google Scholar] [CrossRef]
Upadhyay, A.; Sunil, G.C.; Das, S.; Mettler, J.; Howatt, K.; Sun, X. Multiclass Weed and Crop Detection Using Optimized YOLO Models on Edge Devices. J. Agric. Food Res. 2025, 22, 102144. [Google Scholar] [CrossRef]
Wang, J.; Li, W. YOLO-Weed Nano: A Lightweight Weed Detection Algorithm Based on Improved YOLOv8n for Cotton Field Applications. Scientific Reports 2025, 14, 84748. [Google Scholar] [CrossRef]
Li, Y.; Guo, Z.; Sun, Y.; Chen, X.; Cao, Y. Weed Detection Algorithms in Rice Fields Based on Improved YOLOv10n. Agriculture 2024, 14, 2066. [Google Scholar] [CrossRef]
Allmendinger, A.; Spaeth, M.; Saile, M.; Peteinatos, G.G.; Gerhards, R. Precision Chemical Weed Management Strategies: A Review and a Design of a New CNN-Based Modular Spot Sprayer. Agronomy 2022, 12, 1620. [Google Scholar] [CrossRef]
Hasan, A.S.M.M.; Diepeveen, D.; Laga, H.; Jones, M.G.K.; Muzahid, A.A.M.; Sohel, F. Morphology-Based Weed Type Recognition Using Siamese Network. Eur. J. Agron. 2025, 163, 127439. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Yang, S.; Lin, J.; Cernava, T.; Chen, X.; Zhang, X. WeedDETR: An Efficient and Accurate Detection Method for Detecting Small-Target Weeds in UAV Images. Weed Sci. 2025, 73, e84. [Google Scholar] [CrossRef]
Jia, F. Occlusion Target Recognition Algorithm Based on Improved YOLOv4. J. Comput. Methods Sci. Eng. 2024, 24, 3799–3811. [Google Scholar] [CrossRef]
Shang, Q.; Zhang, J.; Yan, G.; Hong, L.; Zhang, R.; Li, W.; Xia, H. Target Tracking Algorithm Based on Occlusion Prediction. Displays 2023, 79, 102481. [Google Scholar] [CrossRef]
Dheeraj, A.; Chand, S. Deep Learning Based Weed Classification in Corn Using Improved Attention Mechanism Empowered by Explainable AI Techniques. Crop Prot. 2025, 190, 107058. [Google Scholar] [CrossRef]
Veeragandham, S.; Santhi, H. Effectiveness of Convolutional Layers in Pre-Trained Models for Classifying Common Weeds in Groundnut and Corn Crops. Comput. Electr. Eng. 2022, 103, 108315. [Google Scholar] [CrossRef]
Mesías-Ruiz, G.A.; Borra-Serrano, I.; Peña, J.M.; De Castro, A.I.; Fernández-Quintanilla, C.; Dorado, J. Weed Species Classification with UAV Imagery and Standard CNN Models: Assessing the Frontiers of Training and Inference Phases. Crop Prot. 2024, 182, 106721. [Google Scholar] [CrossRef]
Peng, G.; Wang, K.; Ma, J.; Cui, B.; Wang, D. AGRI-YOLO: A Lightweight Model for Corn Weed Detection with Enhanced YOLO V11n. Agriculture 2025, 15, 1971. [Google Scholar] [CrossRef]
Li, X.; Qin, Y.; Wang, F.; Guo, F.; Yeow, J.T.W. Pitaya Detection in Orchards Using the MobileNet-YOLO Model. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 6274–6278. [Google Scholar]
Yu, H.; Che, M.; Yu, H.; Zhang, J. Development of Weed Detection Method in Soybean Fields Utilizing Improved DeepLabv3+ Platform. Agronomy 2022, 12, 2889. [Google Scholar] [CrossRef]
Zou, K.; Chen, X.; Zhang, F.; Zhou, H.; Zhang, C. A Field Weed Density Evaluation Method Based on UAV Imaging and Modified U-Net. Remote Sens. 2021, 13, 310. [Google Scholar] [CrossRef]
Gegen. YOLO Dataset for Corn Weed Recognition 2026. Available online: https://zenodo.org/records/18285884 (accessed on 20 January 2026).
Xiao, L.; Wang, X. Interseedling Weed Detection Model of Maize Based on Improved YOLO Algorithm. J. Agric. Mech. Res. 2025, 47, 10–16. [Google Scholar] [CrossRef]
Zhou, H.; Su, Y.; Chen, J.; Li, J.; Ma, L.; Liu, X.; Lu, S.; Wu, Q. Maize Leaf Disease Recognition Based on Improved Convolutional Neural Network ShuffleNetV2. Plants 2024, 13, 1621. [Google Scholar] [CrossRef]
Zhou, Q.; Chai, B.; Tang, C.; Guo, Y.; Wang, K.; Nie, X.; Ye, Y. Enhanced YOLOv8 with DWR-DRB and SPD-Conv for Mechanical Wear Fault Diagnosis in Aero-Engines. Sensors 2025, 25, 5294. [Google Scholar] [CrossRef]
Gu, Z.; Zhu, K.; You, S. YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection. Electronics 2023, 12, 3744. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual Attention Network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-End Object Detection with Dynamic Attention. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2968–2977. [Google Scholar]
Mi, B.; Cheng, W. Intelligent Aerodynamic Modelling Method for Steady/Unsteady Flow Fields of Airfoils Driven by Flow Field Images Based on Modified U-Net Neural Network. Eng. Appl. Comput. Fluid Mech. 2025, 19, 2440075. [Google Scholar] [CrossRef]
Shafiee Sarvestani, G.; Edalat, M.; Shirzadifar, A.; Pourghasemi, H.R. Early Season Dominant Weed Mapping in Maize Field Using Unmanned Aerial Vehicle (Uav) Imagery: Towards Developing Prescription Map. Smart Agric. Technol. 2025, 11, 100956. [Google Scholar] [CrossRef]
Chen, P.; Xu, W.; Zhan, Y.; Yang, W.; Wang, J.; Lan, Y. Evaluation of Cotton Defoliation Rate and Establishment of Spray Prescription Map Using Remote Sensing Imagery. Remote Sens. 2022, 14, 4206. [Google Scholar] [CrossRef]
Rovira-Más, F.; Saiz-Rubio, V.; Cuenca, A.; Ortiz, C.; Teruel, M.P.; Ortí, E. Open-Format Prescription Maps for Variable Rate Spraying in Orchard Farming. J. ASABE 2024, 67, 243–257. [Google Scholar] [CrossRef]
Rainio, O.; Teuho, J.; Klén, R. Evaluation Metrics and Statistical Tests for Machine Learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 936–944. [Google Scholar]
Sun, B.; Zhang, Y.; Jiang, S.; Fu, Y. Hybrid Pixel-Unshuffled Network for Lightweight Image Super-Resolution. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2375–2383. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Gao, S.; Zhong, R.; Yan, K.; Ma, X.; Chen, X.; Pu, J.; Gao, S.; Qi, J.; Yin, G.; Myneni, R.B. Evaluating the Saturation Effect of Vegetation Indices in Forests Using 3D Radiative Transfer Simulations and Satellite Observations. Remote Sens. Environ. 2023, 295, 113665. [Google Scholar] [CrossRef]
Yan, K.; Gao, S.; Yan, G.; Ma, X.; Chen, X.; Zhu, P.; Li, J.; Gao, S.; Gastellu-Etchegorry, J.-P.; Myneni, R.B.; et al. A Global Systematic Review of the Remote Sensing Vegetation Indices. Int. J. Appl. Earth Obs. Geoinf. 2025, 139, 104560. [Google Scholar] [CrossRef]
Sweet, D.D.; Tirado, S.B.; Springer, N.M.; Hirsch, C.N.; Hirsch, C.D. Opportunities and Challenges in Phenotyping Row Crops Using Drone-based RGB Imaging. Plant Phenome J. 2022, 5, e20044. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the Radiometric and Biophysical Performance of the MODIS Vegetation Indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Wu, Z.; Chen, Y.; Zhao, B.; Kang, X.; Ding, Y. Review of Weed Detection Methods Based on Computer Vision. Sensors 2021, 21, 3647. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the field environment. (a–c) Representative field images collected by the system, illustrating the challenge where maize seedlings and weeds share similar morphological and spectral characteristics under complex field conditions.

Figure 2. (a) Original image captured from the field. (b) Image after horizontal flipping. (c) Image with added Gaussian noise. (d) Image after contrast enhancement (CLAHE).

Figure 4. Statistical distribution of the Fill Factor for maize and weeds. The box plots visualize the quartiles, while the overlaid scatter points represent individual sample distributions. The empirical mean values are used as correction coefficients in the VWCI model. Note that maize exhibits a higher fill factor due to its broader leaf morphology compared to the clustered growth of weeds.

Figure 5. Training dynamics and performance metrics of the SLD-YOLO11 model. The lighter curves show the raw epoch-wise values, whereas the darker thick curves indicate the corresponding smoothed trends. The plots illustrate the convergence trends of training and validation losses (box, classification, and distribution focal loss) alongside key performance indicators (Precision and mAP) over epochs. The rapid decline in loss curves and the steady ascent of mAP demonstrate the model’s robust convergence and efficient feature learning capability without signs of overfitting.

Figure 6. Efficiency–Accuracy Trade-off Comparison. The proposed method (red star) achieves a superior balance between inference speed (FPS) and detection accuracy (mAP) compared to other models.

Figure 7. Qualitative comparison of weed detection results produced by different models on the same field patch. (a–f) correspond to YOLO11-MobileNetV3, YOLO10n, YOLO11-ShuffleNetV2, YOLO8n, YOLO11n, and the proposed SLD-YOLO11, respectively. Bounding boxes are annotated with the predicted class and confidence score.

Figure 8. Quantitative and qualitative error analysis of the SLD-YOLO11 model on the test set. (a) Confusion Matrix: Detailed visualization of classification performance. Inter-class misclassification between maize and weeds is zero. The primary sources of error are “background” misclassified as targets or missed targets. (b) False Positive Example: Visualization of a typical misdetection. Due to the conservative confidence threshold (Conf = 0.15), background textures such as soil clods or crop residues are occasionally misidentified as targets. (c) False Negative Example: Visualization of a typical missed detection. Insufficient feature extraction from weeds resulted in missed detections or low prediction confidence.

Figure 9. Generalization testing on sesame fields.

Figure 10. Linear regression analysis between the proposed automated VWCI and manual ground-truth weed coverage. Blue dots represent individual samples, the red line represents the linear fit, and the shaded area indicates the 95% confidence interval. The high correlation (R² = 0.7039) confirms the reliability of VWCI for automated weed monitoring. Ground-truth coverage is derived from manual pixel-level masks.

Figure 11. Field-scale weed assessment and variable-rate application workflow. (a) Weed density heatmap aggregated on a fixed grid based on detection results. The VWCI for each grid cell is calculated using the effective weed area and shape compensation factor. (b) VRS prescription map generated based on VWCI thresholds: green indicates no-spray zones, yellow indicates low-dose spray zones, and red indicates high-dose spray zones.

Table 1. Comparison Analysis of Different Detection Models on the Corn/Weed Test Set.

Model	Precision	Recall	mAP@0.5	mAP @0.5:0.95	F1-Score	FLOPs (G)
YOLO11-MobileNetV3	0.857	0.899	0.923	0.479	0.878	3.4
YOLO11-ShuffleNetV2	0.911	0.900	0.948	0.485	0.905	2.8
YOLOv8n	0.919	0.906	0.956	0.523	0.912	8.7
YOLOv10n	0.887	0.875	0.925	0.560	0.881	7.7
YOLOv11n	0.924	0.938	0.970	0.575	0.931	6.1
SLD-YOLO11	0.935	0.948	0.974	0.655	0.941	6.3

Table 2. Ablation Experiments for the Lightweight Weed Recognition Model. (“√” indicates that the corresponding module is included in the model, while “—” indicates that the module is not included.).

Model	SPD-Conv	D-LKA	DySample	Precision (%)	Recall (%)	mAP @0.5 (%)	mAP @0.5:0.95(%)	Params (M)	FLOPs (G)
YOLO11n	—	—	—	92.4	93.8	97.0	57.5	2.6	6.1
S-YOLO11n	√	—	—	92.9	94.3	97.2	59.8	2.6	6.0
L-YOLO11n	—	√	—	93.2	94.1	97.3	61.5	2.8	6.2
D-YOLO11n	—	—	√	92.6	93.9	97.1	58.9	2.6	6.1
SLD-YOLO11	√	√	√	93.5	94.8	97.4	65.5	2.8	6.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, M.; Gao, J. SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments. Agronomy 2026, 16, 328. https://doi.org/10.3390/agronomy16030328

AMA Style

Liu M, Gao J. SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments. Agronomy. 2026; 16(3):328. https://doi.org/10.3390/agronomy16030328

Chicago/Turabian Style

Liu, Meichen, and Jing Gao. 2026. "SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments" Agronomy 16, no. 3: 328. https://doi.org/10.3390/agronomy16030328

APA Style

Liu, M., & Gao, J. (2026). SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments. Agronomy, 16(3), 328. https://doi.org/10.3390/agronomy16030328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SLD-YOLO11: A Topology-Reconstructed Lightweight Detector for Fine-Grained Maize–Weed Discrimination in Complex Field Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Preparation and Characteristics

2.2. Data Collection and Processing

2.3. Methodology

2.3.1. SLD-YOLO11 Network Architecture Overview

2.3.2. Lossless Downsampling Topology Reconstruction

2.3.3. Decomposed Large Kernel Attention, D-LKA

2.3.4. Dynamic Flow-Field Upsampling

2.4. Visual Weed-Crop Competition Index (VWCI) Modeling

2.5. Training Model

2.6. Evaluation Indicators and Statistical Analysis

3. Results and Discussion

3.1. Model Training

3.2. Comparative Performance Evaluation

3.3. Ablation Study

3.4. Robustness and Generalization Analysis

3.4.1. Qualitative Analysis in Complex Field Scenarios

3.4.2. Generalization Capability Analysis

3.5. Agronomic Application Evaluation

3.5.1. Reliability Verification of VWCI

3.5.2. Variable Rate Spraying Decision Framework

3.6. Discussion

3.6.1. Mechanism of Performance Improvement

3.6.2. Limitations and Error Analysis

3.6.3. Future Work

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI