GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture

Yang, Zhaofeng; Khan, Zohaib; Shen, Yue; Liu, Hui

doi:10.3390/agronomy15081824

Open AccessArticle

GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture

School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(8), 1824; https://doi.org/10.3390/agronomy15081824

Submission received: 23 June 2025 / Revised: 25 July 2025 / Accepted: 25 July 2025 / Published: 28 July 2025

(This article belongs to the Section Weed Science and Weed Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Weed infestation contributes significantly to global agricultural yield loss and increases the reliance on herbicides, raising both economic and environmental concerns. Effective weed detection in agriculture requires high accuracy and architectural efficiency. This is particularly important under challenging field conditions, including densely clustered targets, small weed instances, and low visual contrast between vegetation and soil. In this study, we propose GTDR-YOLOv12, an improved object detection framework based on YOLOv12, tailored for real-time weed identification in complex agricultural environments. The model is evaluated on the publicly available Weeds Detection dataset, which contains a wide range of weed species and challenging visual scenarios. To achieve better accuracy and efficiency, GTDR-YOLOv12 introduces several targeted structural enhancements. The backbone incorporates GDR-Conv, which integrates Ghost convolution and Dynamic ReLU (DyReLU) to improve early-stage feature representation while reducing redundancy. The GTDR-C3 module combines GDR-Conv with Task-Dependent Attention Mechanisms (TDAMs), allowing the network to adaptively refine spatial features critical for accurate weed identification and localization. In addition, the Lookahead optimizer is employed during training to improve convergence efficiency and reduce computational overhead, thereby contributing to the model’s lightweight design. GTDR-YOLOv12 outperforms several representative detectors, including YOLOv7, YOLOv9, YOLOv10, YOLOv11, YOLOv12, ATSS, RTMDet and Double-Head. Compared with YOLOv12, GTDR-YOLOv12 achieves notable improvements across multiple evaluation metrics. Precision increases from 85.0% to 88.0%, recall from 79.7% to 83.9%, and F1-score from 82.3% to 85.9%. In terms of detection accuracy, mAP:0.5 improves from 87.0% to 90.0%, while mAP:0.5:0.95 rises from 58.0% to 63.8%. Furthermore, the model reduces computational complexity. GFLOPs drop from 5.8 to 4.8, and the number of parameters is reduced from 2.51 M to 2.23 M. These reductions reflect a more efficient network design that not only lowers model complexity but also enhances detection performance. With a throughput of 58 FPS on the NVIDIA Jetson AGX Xavier, GTDR-YOLOv12 proves both resource-efficient and deployable for practical, real-time weeding tasks in agricultural settings.

Keywords:

lightweight object detection; weed identification; YOLOv12

1. Introduction

Weed infestation remains a major challenge in agriculture, reducing crop yields by competing for essential resources such as sunlight, water, and nutrients. Insufficient control not only affects productivity but also increases dependence on chemical herbicides, leading to higher costs and environmental concerns [1]. As agriculture transitions toward automation and precision farming, the demand for efficient and real-time weed detection systems has grown significantly. Traditional methods like manual removal and broad-spectrum chemical spraying are increasingly viewed as unsustainable for large-scale operations [2]. In this context, deep learning has emerged as a promising solution for crop–weed recognition, offering improved accuracy and supporting more targeted interventions [3].

Deep learning-based object detection models have become integral to agricultural vision tasks, offering significant gains in accuracy and robustness over traditional image processing methods. Among them, the YOLO (You Only Look Once) architecture is particularly valued for its real-time performance and adaptability to unstructured environments [4]. As precision agriculture increasingly depends on automated perception systems, YOLO variants have demonstrated strong applicability across diverse scenarios. For fruit and vegetable detection, ShufflenetV2-YOLOX was applied to apple localization with a favorable trade-off between speed and model size [5], while YOLOv4-Tiny combined with a ZED stereo camera enabled flower detection in greenhouses, addressing depth ambiguity and occlusion [6]. LBDC-YOLO showed robust broccoli detection under cluttered lighting conditions [7]. Additional strategies include multi-feature fusion for aquatic vegetable harvesting [8], RGB-D-based instance segmentation for lettuce height estimation [9], and efficient line extraction for seedling navigation in ridge-planted crops [10]. YOLO-based frameworks have also expanded to fine-grained agricultural tasks. Seedling-YOLO, built on YOLOv7-tiny, enhanced defect detection in broccoli planting [11], while an improved YOLOv5s enabled wheat spike recognition and feed quantity prediction in dense fields [12]. In intelligent spraying, an enhanced YOLOv8 model facilitated tomato spraying under varying lighting conditions with high accuracy [13]. Similarly, a YOLOv7-based model achieved targeted pesticide application on unhealthy grape leaves, improving coverage by 65.96% [14], and a YOLOv8-based instance segmentation framework was used for orchard tree canopy delineation, reducing off-target deposition by 40% [15]. As detection models continue to evolve, increasing focus has been placed on the more visually ambiguous and operationally critical task of weed identification. A crop row-guided YOLOv4 improved weed localization in maize by leveraging planting regularity [16], while STBNA-YOLOv5 used attention mechanisms to detect small, morphologically diverse weeds in rapeseed [17]. In dense vegetation, HAD-YOLO refined feature and anchor designs to improve detection accuracy [18]. Building on this, Liu et al. [19] integrated detection outcomes into a variable-rate spraying system for strawberries, demonstrating the potential for coupling perception with autonomous intervention. While YOLO remains dominant, non-YOLO models have also shown strong performance in weed detection. WeedMap used deep networks for large-scale semantic mapping from aerial multispectral images [20], while RT-MWDT, a lightweight transformer, enabled real-time, accurate detection in complex cornfields [21]. Collectively, these studies highlight the versatility and extensibility of YOLO-based frameworks and, more broadly, deep learning approaches in addressing the demands of real-time, high-precision agricultural perception under challenging field conditions.

Despite the widespread application of deep learning in weed detection, there remains an ongoing need to improve model robustness under complex field conditions. Early approaches aimed to reduce the effects of fluctuating illumination using traditional vision techniques, which relied on handcrafted features to extract discriminative cues from field imagery. In cotton fields, color-based methods have been used to detect weeds across multiple growth stages [22], while multi-feature fusion helped capture shape and texture variations across diverse weed populations [23]. Machine vision-based classification systems have also been developed to automatically recognize weeds in structured crop settings, laying a foundation for large-scale visual deployment [24]. These strategies remain valuable for systems operating under dynamic conditions, but recent research has shifted focus toward improving robustness to natural illumination changes, particularly at dawn and dusk. TS-YOLO demonstrated effective all-day detection of tea canopy shoots, offering a lightweight and lighting-adaptive solution [25]. Ultra-lightweight segmentation models have also been employed for wheat lodging detection across diverse visual scenes [26]. Beyond lighting, structural variability and field complexity have led to the development of adaptive deep learning models. ADeepWeeD uses incremental learning to identify emerging weed species, improving recognition across stages and environments [27]. PD-YOLO enhances small-object detection through cross-scale feature fusion, enabling weed recognition in dense canopies [28]. Lightweight CNNs support efficient weed segmentation and classification on embedded platforms [29,30]. At the robotic level, a Mask R-CNN-based system with manipulator control enables accurate weed removal in garlic and ginger fields [31].

Although recent advances have improved the performance and flexibility of agricultural vision systems, maintaining such performance under real-time, resource-constrained conditions remains a key challenge. This is particularly evident in real-world scenarios, where limited computational power and energy availability make any trade-off between speed and precision critical to operational success. Many high-performing models rely on deep architectures with high GFLOPs and large parameter counts, leading to greater memory usage and slower inference. These limitations hinder their deployment on mobile platforms such as unmanned ground vehicles and drones. To address these constraints, this study aims to develop a lightweight and robust weed detection framework tailored for real-time, resource-limited precision agriculture. Based on this objective, we propose GTDR-YOLOv12, an enhanced YOLOv12-based architecture optimized for efficient weed detection under challenging field conditions. The proposed model incorporates a series of lightweight yet discriminative modules to handle issues such as variable lighting and dense vegetation. Specifically, the initial backbone layers are replaced with GDR-Conv, which combines Ghost convolution with Dynamic ReLU (DyReLU) to improve early-stage feature expressiveness without increasing computation. Furthermore, we introduce GTDR-C3 (Ghost Temporal and Directional Attention with Dynamic ReLU–C3), which extends this design by integrating the Temporal and Directional Attention Mechanism (TDAM) and DyReLU into the Ghost framework. GTDR-C3 replaces the original C3 module and significantly enhances spatial feature extraction, particularly for small weed instances, dense targets, and low-contrast backgrounds. These redesigned GTDR-C3 blocks systematically substitute the original C3k2 and A2C2f modules in both the backbone and neck, improving multi-scale feature learning while reducing complexity. In addition, we replace AdamW with the Lookahead optimizer to improve training stability and generalization. By tracking fast weight updates through a slow-moving average, Lookahead reduces overfitting, particularly on datasets with inter-class ambiguity and limited labels. By integrating lightweight convolution, adaptive attention, and stable optimization, GTDR-YOLOv12 achieves a strong trade-off between accuracy and efficiency, supporting real-time deployment. Unlike prior studies that use lightweight or attention modules in isolation, our work adopts a coordinated design where GDR-Conv is used both independently and within GTDR-C3, enabling synergistic feature learning across the backbone and neck. This synergy is especially valuable for agricultural robots, where high-accuracy weed detection must be achieved under tight hardware constraints.

2. Data Resources and Optimization Strategies for Weed Detection Training

2.1. Dataset

Rich variability in visual and spatial characteristics is a prerequisite for robust weed detection model evaluation. Figure 1 presents several representative image samples from the Weeds Detection dataset [32], which comprises over 2000 field images annotated in YOLO format for crop and weed classes, demonstrating diverse field conditions essential for evaluating model generalization. The dataset includes samples with dense weed infestations (A and E), isolated species (C), partial occlusions and overlapping foliage (G and H), mixed ground textures (D), bare soil with sparse vegetation (F), and clearly defined crop structures (I). These variations reflect realistic agricultural conditions, making the dataset a suitable benchmark for assessing detection performance in scenarios involving densely distributed weeds, small target instances, and low visual contrast between vegetation and soil background. In addition to spatial diversity, the dataset also includes plants at different growth stages, which vary in shape, size, and spatial distribution. This allows evaluation of the model’s ability to handle morphological changes over time, an important aspect for field deployment. Additionally, the dataset covers a wide range of field scenarios, including row-crop planting patterns, uneven soil distribution and heterogeneous weed–crop mixtures. It includes imagery from diverse crop environments such as maize and soybean fields, where spatial arrangements and weed interference patterns vary significantly, further enhancing its applicability to practical weed detection tasks. These variations introduce occlusion, low vegetation–soil contrast, dense small weed instances, and other visual complexities that challenge detection accuracy. GTDR-YOLOv12 addresses these through attention-guided spatial feature refinement and lightweight convolutional modules designed for robustness under such conditions. While the dataset is externally collected, it covers a broad range of visual and spatial variations commonly found in real-world agricultural environments, including dense weed interference, mixed planting structures, occlusion, and lighting variability. These characteristics make it suitable for evaluating model performance under realistic field conditions.

2.2. Composition of Weed Species in Weeds Detection Dataset

The dataset includes several common and agriculturally significant weed species. As illustrated in Figure 2, A represents Argemone mexicana, a spiny broadleaf weed with toxic alkaloids. B shows Boerhavia diffusa, a ground-spreading weed with strong regenerative ability. C displays wild Clover, which can rapidly colonize crop fields. D is Bermuda Grass, a dense-growing grassy weed that spreads via stolons and rhizomes. These examples illustrate some of the most commonly annotated weed species in the dataset and constitute only a subset of the full range of weed categories included.

2.3. Dataset Training and Preparation Strategy

Table 1 presents the configuration of the experimental platform used throughout this study. Equipped with an Intel(R) Xeon(R) Gold 6226R CPU at 2.90 GHz and an Nvidia RTX 3090 GPU with 24 GB of VRAM, the system provided sufficient computational power to support high-resolution image processing and model training. All implementations were conducted using PyTorch 2.0.1 and Python 3.8.20, ensuring compatibility with modern deep learning pipelines and facilitating stable training across large-scale datasets. The hardware environment enabled high-speed training of the proposed model, while the software stack provided essential support for implementing and debugging complex architectural components within the YOLO framework.

Table 2 summarizes the dataset configuration following preprocessing and augmentation procedures. To enhance data diversity and improve model generalization, several augmentation strategies were employed, including simulated weather conditions (fog or rain applied to 10% of the images), brightness shifts (±25%), and gamma corrections within the range of 0.7 to 1.5. Simulated weather conditions were introduced to mimic real-world environmental noise, enhancing robustness to visibility degradation. Brightness shifts simulate variations in lighting across different times and field conditions. Gamma correction improves contrast adaptation, particularly useful for low-visibility weed instances. Auto-orientation was applied to ensure consistent image alignment. The resulting dataset consisted of 3982 images, partitioned into 3735 for training, 494 for validation, and 247 for testing. Each image was annotated with two object categories (crop and weed), and training was performed over 100 epochs. Training was conducted using a batch size of 16 and input image resolution of 640 × 640. A constant learning rate schedule was adopted, with both the initial and final learning rates set to 0.01, and cosine decay disabled. To prevent overfitting and enhance training stability, early stopping was enabled with a patience of 20 epochs, and weight decay regularization was applied with a coefficient of 0.0005. This stabilizes convergence and promotes better generalization across varying field conditions. The comprehensive dataset configuration, together with the established hardware and software environment, provides a robust and consistent foundation for implementing and evaluating the proposed model enhancements presented.

3. Methodology

3.1. Baseline Architecture: YOLOv12

YOLOv12 is a real-time object detection framework whose overall architecture is depicted in Figure 3. It adopts an encoder–decoder structure augmented with the A2C2f module, which integrates attention mechanisms and compound convolutions to enhance feature representation across scales. Together with C3k2 blocks and multiscale path aggregation, this design improves spatial interaction and contextual reasoning, contributing to efficient inference and improved general object detection performance. YOLOv12 possesses notable architectural strengths. However, it struggles in complex agricultural environments, especially when faced with densely distributed weeds, small target instances, and low visual contrast between vegetation and soil. These limitations manifest as missed detections, inaccurate boundary localization and diminished recall at high precision thresholds, indicating a suboptimal balance between localization and classification performance. These issues are primarily attributed to the limited receptive field of C3k2 blocks, the lack of task-specific attention in A2C2f, and static feature aggregation strategies. As a result, YOLOv12 struggles to enhance weak weed signals in visually ambiguous backgrounds, leading to missed detections and coarse boundary localization. Furthermore, standard convolutions in the backbone contribute to redundant parameters, limiting the model’s ability to scale down for real-time deployment on edge devices. These observed limitations highlight the need for a more specialized detection architecture, for which an improved version named GTDR-YOLOv12 is introduced, as illustrated in Figure 4, incorporating targeted modifications to address these deficiencies.

3.2. Improved YOLOv12: GTDR-YOLOv12

GTDR-YOLOv12 incorporates several lightweight and adaptive modules to enhance detection performance while maintaining computational efficiency. Key components include the GDR-Conv module, which integrates Ghost convolution and DyReLU for efficient low-level feature extraction, and the GTDR-C3 module that combines Ghost convolution, DyReLU and TDAM to strengthen semantic discrimination. These modules are strategically embedded within both the backbone and neck, where GDR-Conv layers enhance early-stage spatial representation and GTDR-C3 modules consistently replace original C3 blocks to strengthen multi-scale feature fusion. This design improves efficiency while preserving detection accuracy. As illustrated in Figure 4, the original YOLOv12 architecture is revised in both the backbone and neck. In the backbone, the two initial convolutional layers are replaced by GDR-Conv modules to reduce redundancy in early-stage processing, where high-resolution features dominate. Ghost Convolution reduces redundant computations by generating intrinsic feature maps with fewer parameters, improving efficiency without sacrificing detail. DyReLU introduces dynamic activation, enabling adaptive feature modulation based on local context. Together, they enhance the network’s ability to extract fine-grained textures essential for detecting small and visually subtle weed instances. Furthermore, all subsequent feature extraction blocks, originally composed of C3k2 and A2C2f modules, are replaced by GTDR-C3 modules. This comprehensive substitution enhances the model’s ability to capture spatially complex patterns and recalibrate feature importance using TDAM. Specifically, TDAM helps isolate and emphasize discriminative spatial cues under conditions of visual clutter. At the same time, it maintains a lightweight design suitable for real-time deployment. Such efficiency is particularly beneficial under challenging agricultural conditions, where occlusion, background clutter, and varying object scales demand both discriminative power and computational restraint. In the neck, conventional C3k2 and A2C2f blocks are likewise replaced with GTDR-C3, further reinforcing the semantic hierarchy and maintaining lightweight computation. Additionally, GTDR-YOLOv12 replaces the AdamW optimizer with Lookahead, which improves training stability and convergence by synchronizing fast and slow weight updates. By mitigating abrupt gradient oscillations, Lookahead enables more stable feature updates, which is particularly beneficial in datasets with high inter-class ambiguity and sparse annotations, where noisy gradients can easily mislead optimization. These targeted architectural and optimization refinements collectively contribute to a more compact but more capable network, achieving enhanced performance while reducing parameters and GFLOPs.

3.2.1. GDR-Conv

The structure of GDR-Conv, as illustrated in Figure 5, is designed to provide a lightweight and expressive alternative to standard convolution by integrating Ghost convolution with DyReLU activation. The module consists of two sequential branches; the first applies a

1 \times 1

convolution followed by BatchNorm and DyReLU to extract primary features with minimal computation. The second branch applies a depthwise

5 \times 5

convolution to further expand spatial representation, also followed by normalization and DyReLU. Both outputs are concatenated along the channel dimension to form the final feature map. The use of DyReLU introduces input-dependent nonlinearities by learning dynamic activation parameters

(a_{1}, b_{1}, a_{2}, b_{2})

through global pooling and pointwise convolution, enabling the module to better adapt to local semantic variations. This mechanism follows the original DyReLU design proposed by Chen et al. [33], which enhances representation capacity with negligible computational overhead. This design enhances feature diversity and discrimination while significantly reducing computational cost, making GDR-Conv especially suitable for resource-constrained agricultural detection tasks.

3.2.2. GTDR-C3

As illustrated in Figure 6, the enhanced GTDR-C3 module is a weed detection task-aware residual structure designed to enhance both efficiency and semantic focus in agricultural object detection tasks. The input feature map is divided into two parallel paths. The main branch sequentially passes through two GTDRBlocks, each consisting of a GDR-Conv module followed by a TDAM. While the GDR-Conv handles lightweight feature transformation, the TDAM adaptively recalibrates feature responses by leveraging global context. Specifically, TDAM begins with global average pooling to encode channel-wise statistics, followed by a flattening operation and two fully connected layers with a ReLU activation in between to generate attention weights. These weights are reshaped and applied via channel-wise multiplication to modulate the original feature map, thereby enhancing task-relevant activations. A GhostBottleneck block is then repeated n times to further deepen the semantic representation. The outputs of the two GTDRBlocks are concatenated along the channel dimension and refined by a third GTDRBlock. Meanwhile, the identity branch preserves the input. If the input and output channels match and shortcut is enabled, the identity and processed branches are fused via element-wise residual addition; otherwise, only the transformed main path is propagated. This structure allows the GTDR-C3 module to selectively amplify informative features while maintaining low complexity, making it effective for dense, occluded and small-object detection in agricultural field environments.

3.2.3. Lookahead Optimizer

The Lookahead optimization strategy, depicted in Figure 7, introduces an outer-loop mechanism that stabilizes training by synchronizing fast-updating weights with a set of slow-moving reference parameters. Each training iteration begins with a standard update from the inner optimizer, after which the algorithm checks whether the current step count

l a_{s t e p}

has reached the predefined interval

l a_{s t e p s} .

If not, the process bypasses the outer update and proceeds normally. When the condition is satisfied, the current weights are interpolated with the cached parameters. Optionally, momentum buffers are updated depending on the selected mode (reset, pullback, or none). In reset mode, the momentum is cleared after synchronization to encourage fresh gradient directions; pullback mode aligns the momentum with the updated weights to maintain directional consistency, while none preserves the original momentum, allowing uninterrupted accumulation of historical gradients. The

c a c h e d_{p}

parameters are then refreshed with the new values, the synchronization counter

l a_{s t e p}

is reset, and the loss is returned. This process enhances convergence stability and generalization by mitigating the effects of sharp parameter oscillations during training.

3.3. Performance Metrics

Model performance is evaluated using a combination of classification metrics and computational indicators. As shown in Equations (1) and (2), precision and recall, respectively, measure the proportion of correctly predicted positive samples and the model’s sensitivity to actual positives. These two are combined into a single measure, the F1-score, in Equation (3), which balances precision and recall. To further quantify performance across varying IoU thresholds, the mean Average Precision (mAP) is adopted, as formulated in Equation (4). On the computational side, the number of floating-point operations (FLOPs) required by convolutional layers is computed using Equation (5), and normalized to GFLOPs in Equation (6) to reflect inference efficiency. The following provides detailed definitions and formulations of these evaluation metrics.

Precision quantifies the proportion of correctly identified positive instances among all instances predicted as positive. It is computed by dividing the number of true positives (TP) by the sum of true positives and false positives (FP), as shown in Equation (1). This metric reflects the model’s ability to avoid false alarms in classification tasks.

Precision = \frac{T P}{T P + F P}

(1)

Recall measures the model’s ability to correctly identify all relevant positive instances. It is calculated by dividing the TP by the sum of true positives and false negatives (FN), as expressed in Equation (2). A higher recall indicates greater sensitivity to actual target objects, minimizing missed detections.

Recall = \frac{T P}{T P + F N}

(2)

F1-score provides a harmonic mean of precision and recall, offering a single metric that reflects the balance between FP and FN. It is particularly useful when evaluating the trade-off between detection accuracy and coverage, ensuring that the model performs consistently in identifying relevant targets without over-predicting. As indicated in (3), a higher F1-score implies a better balance between correct detections and missed instances.

F_{1} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(3)

mAP serves as a comprehensive evaluation metric by integrating both precision and recall across a range of IoU thresholds. It is computed by averaging the area under the precision–recall curve for each class and then taking the mean across all classes. As shown in Equation (4), this metric reflects the model’s ability to accurately localize and classify objects across various confidence levels and spatial overlaps.

mAP = \frac{1}{T} \sum_{t = 1}^{T} {AP}_{t}

(4)

FLOPs quantify the total number of arithmetic operations a model performs during a single forward pass. This value is determined by summing the multiply-accumulate operations across all convolutional and fully connected layers, as shown in Equation (5). To facilitate comparison and reflect hardware efficiency, FLOPs are typically normalized into GFLOPs, which represent billions of operations, as defined in Equation (6). This metric is essential for assessing the computational complexity and deployment feasibility of a model, particularly on edge or resource-constrained devices.

FLOPs = 2 \cdot C_{in} \cdot C_{out} \cdot K^{2} \cdot H_{out} \cdot W_{out}

(5)

GFLOPs = \frac{\sum FLOPs}{10^{9}}

(6)

4. Experimental Validation

4.1. Convergence Analysis of GTDR-YOLOv12

As shown in Figure 8, GTDR-YOLOv12 exhibits stable and consistent convergence behavior throughout the 100 training epochs. All loss components, including box loss, classification loss, and distribution focal loss, show a smooth and monotonic decrease on both the training and validation sets, indicating effective learning without signs of overfitting. Correspondingly, the evaluation metrics including precision, recall, mAP at 0.5, and mAP at 0.5 to 0.95 demonstrate a steady upward trend, with performance saturating at high values by the end of training. This convergence behavior, as illustrated in the plots, underscores the effectiveness of the proposed GTDR-C3 and GDR-Conv integration, as well as the contribution of the Lookahead optimizer in promoting training stability, faster convergence, and improved generalization under complex agricultural conditions.

4.2. Impact of Lookahead Optimizer on Detection Performance

Among the commonly used optimization strategies in YOLO-based object detection frameworks, Stochastic Gradient Descent (SGD) and AdamW are widely adopted due to their simplicity and empirical effectiveness. However, both have inherent limitations—SGD often suffers from slow convergence and sensitivity to learning rate schedules, while AdamW, though faster and adaptive, can lead to sharp minima and occasional overfitting. To investigate potential improvements, we introduce the Lookahead optimizer into the YOLOv12 training pipeline and compare its performance against these two baseline optimizers on the Weeds Detection dataset. As summarized in Table 3, the Lookahead-based model achieves superior performance across most metrics, attaining the highest precision (86.9%), recall (80.2%), and F1-score (83.4%), while also delivering the lowest inference latency (2.6 ms). Although AdamW slightly outperforms in terms of mAP:0.5 (87.0% vs. 86.7%), Lookahead offers a more balanced trade-off between detection accuracy and computational efficiency. In contrast, SGD performs considerably worse, with only 72.7% mAP:0.5 and 67.2% F1-Score, reaffirming its limited robustness in complex agricultural environments. The performance advantage of Lookahead lies in its unique two-stage update mechanism, which decouples fast inner updates from slow stabilizing updates. Compared to SGD, which often suffers from unstable gradients and slow convergence, Lookahead enables more consistent training and faster convergence. Unlike AdamW, which can converge quickly but is prone to sharp minima and overfitting, Lookahead encourages flatter, more generalizable solutions in the loss landscape. This distinction is particularly beneficial in visually complex tasks such as crop–weed detection, where noise and class ambiguity are common. Empirically, Lookahead delivers better precision, recall, and inference speed than both SGD and AdamW when integrated into YOLOv12, making it a compelling choice for real-time applications in precision agriculture.

4.3. Ablation Analysis of GTDR-YOLOv12

Ablation study in Table 4 systematically examines the contributions of each proposed component in GTDR-YOLOv12 to detection performance and computational efficiency. Replacing the default AdamW optimizer with Lookahead alone leads to improved precision (from 81.5 to 86.9) and F1-score (from 82.3 to 83.4), accompanied by reduced inference time and parameter count. This improvement can be attributed to Lookahead’s ability to stabilize optimization trajectories by synchronizing fast and slow weights, which helps the model escape sharp local minima and generalize better on validation data. The integration of the GDR-Conv module leads to notable improvements in recall (from 79.7 to 81.9) and mAP:0.5 (from 87.0 to 88.1). This enhancement arises from the synergy between Ghost convolution, which minimizes redundant computations, and DyReLU, which dynamically adapts activation patterns based on input context. By improving early-stage feature extraction while preserving model efficiency, GDR-Conv facilitates the capture of finer-grained visual cues, which is particularly beneficial for detecting small and visually subtle weed instances in complex agricultural scenes. Replacing conventional C3k2 and A2C2f blocks with GTDR-C3 yields even higher improvements in recall (up to 85.0) and mAP:0.5:0.95 (from 58.0 to 64.0), showing the model’s enhanced capacity to handle scale variation and background clutter. This is primarily due to the embedded TDAM mechanism, which recalibrates spatial and channel-wise attention based on global semantic context. Unlike standard C3 blocks, GTDR-C3 not only extracts features but also selectively amplifies those most relevant to the detection task. When Lookahead, GDR-Conv, and GTDR-C3 are combined within the framework, the model exhibits the strongest overall performance, and the final model achieves the best overall balance: highest precision (87.3), highest F1-score (85.8), and highest mAP values (90.9 at mAP:0.5, 65.5 at mAP:0.5:0.95). Notably, this configuration also results in the lowest GFLOPs (4.8) and the smallest parameter count (2.23 M) among all variants, highlighting the model’s superior trade-off between accuracy and efficiency. In addition to the standard detection metrics, we further report the average Intersection-over-Union (Avg. IoU) for true positive detections at an IoU threshold of 0.5. Compared to YOLOv12 (81.2%), the final model achieves 83.4%, reflecting improved localization accuracy. This metric provides a direct quantification of how well the model aligns bounding boxes with ground-truth objects. These results validate the effectiveness of the proposed improvements, demonstrating that each module not only contributes independently but also provides synergistic advantages.

4.4. Experimental Evaluation of YOLOv12 and GTDR-YOLOv12

The performance differences highlighted in the red, green and blue regions of Figure 9 clearly demonstrate the advantages introduced by GTDR-YOLOv12 over the YOLOv12, particularly as a result of the proposed architectural enhancements. In the green box region (confidence threshold under 0.1), GTDR-YOLOv12 achieves consistently higher F1 scores compared to YOLOv12, particularly for weed detection. This suggests that the proposed model is more effective at retaining valid predictions with lower confidence values. Such performance is especially important under low-confidence thresholds, where small, low-contrast and ambiguous targets are more likely to be overlooked. In the red box, which spans the mid-confidence range (approximately 0.2–0.7), GTDR-YOLOv12 achieves consistently higher F1 scores for both “crop” and “weed” classes. In the blue box, representing the high-confidence region (confidence > 0.7), GTDR-YOLOv12 exhibits a slower decay in F1 score compared to YOLOv12. This reflects the model’s enhanced robustness and reduced overconfidence in misclassifications. The observed performance gains of GTDR-YOLOv12, especially under low- and mid-confidence thresholds, are closely linked to its architectural enhancements. The introduction of GDR-Conv and GTDR-C3 modules in the early backbone improves the model’s ability to capture fine-grained and texture-sensitive features, which is crucial for identifying visually subtle weed targets. Moreover, the repeated use of GTDR-C3 across both the backbone and neck facilitates deep feature reuse and reinforces semantic consistency across scales. This hierarchical enhancement strengthens spatial detail retention and improves the confidence calibration of small and ambiguous predictions, thereby contributing to the model’s stable F1 scores across confidence levels. Overall, GTDR-YOLOv12 demonstrates more stable F1 performance than YOLOv12 in both the low-confidence (green box), mid-confidence (red box) and high-confidence (blue box) regions. This improvement reflects greater robustness and reliability in practical deployment, enabling more consistent detection performance under diverse field conditions.

The Precision–Recall (PR) curves in Figure 10 provide further insight into the detection performance differences between YOLOv12 (left) and GTDR-YOLOv12 (right). In the high-recall region (recall > 0.6), marked by the red box, GTDR-YOLOv12 consistently sustains higher precision compared to the baseline. This indicates that, under demanding conditions where the model is required to detect nearly all target instances, the improved framework is more effective at minimizing false positives without sacrificing recall. These advantages stem from the integration of task-aware attention via the TDAM mechanism and the use of DyReLU in the GDR-Conv and GTDR-C3 modules, which collectively enhance the network’s ability to extract discriminative features from small, densely distributed and low-contrast weed targets. By refining both spatial focus and activation adaptivity, these modules reduce misclassification errors in visually ambiguous regions. Moreover, the smoother and more stable slope of the GTDR-YOLOv12 curves implies improved confidence calibration and less overfitting to specific recall levels, which can be attributed to the Lookahead optimizer’s contribution to more stable training dynamics. The result is a higher mAP:0.5 (0.900 vs. 0.870) and better per-class performance (crop: 0.882 vs. 0.831, weed: 0.918 vs. 0.909), demonstrating that the proposed enhancements lead not only to better peak accuracy but also to more robust detection behavior across operational thresholds in real-world agricultural scenarios.

The visual comparison in Figure 11 clearly highlights the qualitative improvements of GTDR-YOLOv12 over the YOLOv12 in terms of detection accuracy, localization precision and robustness under challenging field conditions. Each row in the figure represents a distinct detection source; the first row shows the ground truth annotations, the second displays predictions by YOLOv12 and the third shows predictions by the proposed GTDR-YOLOv12. In the first column, GTDR-YOLOv12 exhibits superior boundary recognition by accurately localizing the “crop” instances with well-aligned bounding boxes, even under partially occluded conditions. In contrast, YOLOv12 fails to detect the crop instance, revealing its difficulty in recognizing objects that are partially truncated by image boundaries. In the second column, GTDR-YOLOv12 successfully distinguishes and localizes two closely spaced small weed instances, whereas YOLOv12 fails to separate them, resulting in merged or missed detections. It also accurately identifies fine-leaved weeds that are typically overlooked by the baseline model. To evaluate the model’s multi-class performance and generalization capability, we further validated it on the CropAndWeed dataset [34]. In the third column, under exposure distortion, YOLOv12 misclassifies a weed with diminished contrast, while GTDR-YOLOv12 correctly classifies and localizes it, showcasing better robustness to illumination degradation. In the fourth column, GTDR-YOLOv12 detects several small-sized weeds that YOLOv12 overlooks, reflecting improved sensitivity to fine-grained objects. This contrast highlights differences in spatial resolution and discriminative capacity, particularly in high-density vegetation zones. Collectively, these visual results align with the improvements observed in F1 curves (Figure 9) and PR curves (Figure 10), validating the overall effectiveness of GTDR-YOLOv12.

Beyond the performance gains over YOLOv12, Figure 11 also illustrates meaningful differences between GTDR-YOLOv12 predictions and ground truth annotations, offering insight into model–annotation discrepancies. In the first column, GTDR-YOLOv12 misses the small crop instance in the top-right corner, likely due to partial occlusion and low contrast with the background. Additionally, the large central bounding box may have suppressed the smaller detection during non-maximum suppression. In the second column, GTDR-YOLOv12 correctly identifies the presence of two adjacent weed instances, but still outputs a merged bounding box instead of separating them, as performed in the ground truth. This suggests that while recognition sensitivity is improved, the model’s spatial resolution in dense regions could be further enhanced. In the fourth column, GTDR-YOLOv12 accurately detects the dominant weeds (weed8 and weed0), closely aligning with the ground truth. However, the ground truth includes questionable labels such as dried leaves or debris marked as weed8. GTDR-YOLOv12 avoids predicting these ambiguous instances, which may be counted as errors in evaluation but arguably demonstrate better judgment in visually cluttered conditions. Overall, these comparisons underscore the value of qualitative error analysis for interpreting model behavior and highlight GTDR-YOLOv12’s potential to approach or even exceed human annotation quality in certain complex cases.

As shown in Figure 12, the first and third columns represent normal lighting, where both YOLOv12 and GTDR-YOLOv12 detect most objects with reasonable accuracy. The second column shows images with brightness reduced by approximately 35%, while the fourth column represents a condition with brightness increased by around 60%. Under reduced lighting, YOLOv12 shows a noticeable decline, missing several weeds and failing to detect crops with low contrast. This highlights its reliance on brightness-sensitive features. Under enhanced lighting, YOLOv12 exhibits overexposure effects, such as faded boundaries and occasional missed detections, especially for light-colored targets. In contrast, GTDR-YOLOv12 maintains stable detection across lighting conditions, with minimal difference between normal and dim environments, and shows better tolerance to overexposure, maintaining accurate localization and class assignment. This robustness primarily benefits from the GDR-Conv module, which integrates Ghost Convolution and DyReLU to extract rich and compact features while dynamically adapting to varying brightness. In addition, the GTDR-C3 module further strengthens this effect by incorporating a TDAM module, enabling better fusion of global context and local detail. The synergy between these two modules allows the model to maintain reliable perception even when object textures and edges are obscured by low illumination. Nevertheless, limitations persist. Some small crops and fine-leaved weeds are still missed, and low-contrast objects under dim light show minor localization errors.

Figure 13 highlights the impact of occlusion and truncation on detection performance. In the first and second columns, heavy crop-to-crop occlusion causes YOLOv12 to mistakenly merge overlapping crops into a single detection. In contrast, GTDR-YOLOv12 successfully separates and localizes individual instances, even when partially obscured by neighboring foliage. The third column shows weeds that are simultaneously occluded by crops and truncated by image boundaries, leading to missed detections by YOLOv12. GTDR-YOLOv12, however, demonstrates improved robustness by accurately detecting most of these instances. In the fourth column, GTDR-YOLOv12 successfully detects a low-contrast weed that YOLOv12 misses, likely due to its partial visibility and slight occlusion by surrounding vegetation. These results underscore the effectiveness of the proposed architecture; GDR-Conv enhances spatial sensitivity through Ghost Convolution and DyReLU, while GTDR-C3 integrates global context and local features via TDAM attention. Together, they allow the model to maintain object continuity and boundary precision under challenging visual conditions. Nevertheless, several small-leaved weeds near the image boundary are missed in the third column, reflecting the model’s remaining challenge in detecting subtle, partially occluded targets.

4.5. Comparative Analysis of YOLO Variants

Table 5 presents a comprehensive comparison of GTDR-YOLOv12 against a range of advanced object detectors, including YOLO-based models and anchor-free methods, evaluated across key performance metrics and computational efficiency indicators. Compared with other YOLO-based detectors, GTDR-YOLOv12 achieves the highest performance across all key evaluation metrics, including Precision (88.0%), Recall (83.9%), mAP:0.5 (90.0%), mAP:0.5:0.95 (63.8%) and F1-score (85.9%). Notably, it accomplishes these results with the lowest computational cost among all compared models, requiring 2.2 million parameters and 4.8 GFLOPs. This highlights its superior efficiency and lightweight design. The ameliorated results demonstrate the effectiveness of the proposed architectural enhancements in improving detection accuracy without increasing complexity. When compared to anchor-free detectors such as ATSS and Double-Head, although these models achieve relatively high mAP values, their computational demands are substantially higher. Specifically, ATSS and Double-Head have 23 times and 21 times more parameters than GTDR-YOLOv12 and their GFLOPs are approximately 58 times and 85 times greater. While RTMDet shows strong efficiency with low parameters and GFLOPs, its precision remains notably lower than GTDR-YOLOv12. Such complexity makes them unsuitable for field-based agricultural robots. In contrast, GTDR-YOLOv12 achieves a better trade-off between accuracy and efficiency, making it more practical for real-world agricultural applications.

4.6. Evaluating the Benefits of Transfer Learning and Domain Adaptation

Table 6 presents a quantitative comparison of different YOLOv12 variants to evaluate the effectiveness of pretraining strategies. Compared with the baseline YOLOv12 (COCO Pretrain), the proposed YOLOv12-GTDR achieves significantly lower inference time (from 5.8 ms to 3.0 ms) and reduced computational cost (5.8 to 4.8 GFLOPs), while maintaining competitive detection performance. When pretrained with task-specific GTDR weights, YOLOv12-GTDR (Pretrain) achieves the best overall accuracy, with the highest mAP50 (0.915) and mAP50–95 (0.674), as well as an improved F1-score (0.870). Moreover, this variant demonstrates a better balance between precision (P = 0.871) and recall (R = 0.870) compared to the COCO-pretrained baseline (P = 0.852, R = 0.887), reducing the precision–recall trade-off and indicating improved stability in detection outcomes. These findings confirm that transfer learning using domain-specific pretrained weights enhances both the convergence behavior and generalization capacity of the model in agricultural detection scenarios.

4.7. Embedded Platform Testing

The ablation study was conducted on the NVIDIA Jetson AGX Xavier, a widely adopted embedded AI computing platform featuring an 8-core ARM v8.2 64-bit CPU, a 512-core Volta GPU with Tensor Cores, and 32 GB of LPDDR4x memory. It supports up to 32 TOPS of AI performance under a 30W power budget, making it highly suitable for real-time, low-power computer vision tasks in field robotics and agricultural applications.

Table 7 presents the GFLOPs and the FPS of different ablation variants of the proposed GTDR-YOLOv12 model tested on Jetson AGX Xavier. As expected, the original YOLOv12 and its Lookahead-enhanced variant exhibit the highest computational complexity at 5.8 GFLOPs, resulting in a lower runtime speed of 48 FPS. The addition of GDR-Conv slightly reduces the GFLOPs to 5.7, yielding a modest gain to 49 FPS. A more noticeable efficiency gain is observed in the YOLOv12 + GTDR-C3 variant, with GFLOPs reduced to 5.2 and FPS increasing to 54. Our final model, GTDR-YOLOv12, achieves the lowest GFLOPs (4.8) and the highest FPS (58), demonstrating the success of our design philosophy in balancing accuracy with deployment efficiency. These results highlight the lightweight nature of the proposed modules and their suitability for deployment in embedded systems with constrained computational resources.

5. Discussion and Limitations

The experimental results and comparative evaluations presented in this study comprehensively validate the effectiveness of the proposed GTDR-YOLOv12 framework in addressing key challenges in agricultural weed detection, including the need for precise object localization in cluttered field environments, robust recognition under dense target distribution and low vegetation–soil contrast, and achieving high detection accuracy with extremely low GFLOPs and parameter count. By integrating GDR-Conv into the early stages of the backbone, the model significantly reduces computational overhead while preserving critical low-level spatial information. Furthermore, replacing all conventional feature extraction blocks with GTDR-C3, which integrates TDAM for adaptive attention, Ghost convolution for efficient representation, and DyReLU for dynamic activation, significantly enhances the model’s capacity to focus on relevant features in such challenging agricultural conditions, thereby ensuring robust and consistent detection across diverse and challenging field conditions. Unlike prior YOLO-based variants that independently incorporate Ghost convolution, attention mechanisms, or dynamic activations, our integration strategy emphasizes functional synergy through a hierarchical and task-specific design. GDR-Conv is constructed by embedding DyReLU into both paths of Ghost convolution. Moreover, GTDR-C3 integrates GDR-Conv with a TDAM within a residual structure. This design allows the model to dynamically adapt to spatial and morphological changes in plant structures, enabling consistent detection performance across different growth stages where weed shape and size may vary. This progressive fusion approach differs from conventional modular stacking, allowing the model to adaptively refine spatial features relevant to weed detection while maintaining computational efficiency. As reflected in the F1 and PR curves (Figure 9 and Figure 10), GTDR-YOLOv12 consistently outperforms the baseline YOLOv12 across varying confidence thresholds, demonstrating both higher accuracy and greater prediction stability. Visual comparisons (Figure 11) further corroborate these gains, showing improved localization, reduced FP, and superior handling of overlapping and small weed instances. Furthermore, GTDR-YOLOv12 achieves this with 2.2M parameters (reduced by 0.3M) and 4.8 GFLOPs (reduced by 1.0), marking the lowest computational load among all evaluated models suitable for deployment on resource-constrained platforms. To evaluate its practical deployability, GTDR-YOLOv12 was tested on the NVIDIA Jetson AGX Xavier, achieving a real-time speed of 58 FPS. This confirms the model’s suitability for integration into autonomous weeding robots, enabling precise and timely detection in dynamic field environments. This represents a substantial reduction in computational complexity compared to high-performing detectors such as ATSS and Double-Head, whose parameter counts and GFLOPs are at least 20 times higher. This efficiency contributes to faster convergence, lower runtime latency, and improved generalization due to structural simplicity. Moreover, the introduction of the Lookahead optimizer contributes to improved convergence stability and generalization, particularly evident in the F1 improvements during ablation studies. Overall, GTDR-YOLOv12 demonstrates a compelling balance between accuracy and efficiency. These characteristics make it not only a technically effective model, but also a realistic and deployable solution for intelligent, real-time weed management in modern precision agriculture.

Table 8 summarizes the performance and deployment characteristics of several lightweight YOLO-based weed detection models. The reported Precision ranges from 86.7 to 95.8, while Recall varies between 75.5 and 93.9, reflecting the diversity of datasets and detection scenarios. The mAP@0.5 scores are generally high, ranging from 85.1 to 95.4, indicating that all models maintain strong detection capabilities under their respective experimental settings. A key distinction lies in the GFLOPs and FPS, which reflect model complexity and real-time performance. Among the models, GTDR-YOLOv12 demonstrates the lowest GFLOPs (4.8) and maintains a competitive FPS of 58, showing strong efficiency. In contrast, YOLO-SW has the highest GFLOPs (19.9) but only achieves 59 FPS, suggesting higher computation cost. Star-YOLO, with 5.0 GFLOPs, achieves the highest FPS (118), indicating an excellent balance between computation and speed. Regarding embedded deployment, most models are tested or deployed on NVIDIA Jetson platforms. GTDR-YOLOv12 is verified on Jetson AGX Xavier, YOLOv8n-SSDW on Jetson Nano B01, and YOLO-SW on Jetson AGX Orin, confirming their suitability for edge computing applications. However, two models (YOLOv8s-Improve and Star-YOLO) do not report specific deployment platforms, leaving their practical embedded performance unverified.

Despite the observed performance improvements GTDR-YOLOv12 achieves in both detection accuracy and computational efficiency, several limitations remain. First, although the Weeds Detection dataset provides a useful benchmark, it lacks sufficient coverage of the environmental variability found across different crop types, weed morphologies and geographic regions. This raises concerns about the applicability of detection outputs in real-world robotic weeding systems, where execution accuracy can be affected by actuation delay, positioning errors, and unstructured terrain conditions. Second, while the proposed architectural modules significantly reduce GFLOPs and parameter count, the optimization has not yet been adapted to specific hardware platforms. Techniques such as model pruning or quantization for embedded accelerators remain to be explored. Finally, the current implementation supports only binary classification between crops and weeds. Extending the framework to multi-class detection for identifying individual weed species would require substantial annotation and additional training resources.

6. Conclusions

This study presents GTDR-YOLOv12, a lightweight and accurate object detection framework tailored for precision weed detection in complex agricultural environments. By integrating GDR-Conv and GTDR-C3 modules throughout the backbone and neck, the proposed model enhances early-stage feature extraction and task-dependent attention, enabling robust discrimination between crop and weed instances even under conditions of dense target distribution, small weed size and low visual contrast with the soil background. Experimental evaluations indicate that, among models tailored for lightweight agricultural applications, GTDR-YOLOv12 demonstrates superior comprehensive performance across key evaluation metrics, attaining 88.0% precision, 83.9% recall, 85.9% F1-score, 90.0% mAP:0.5 and 63.8% mAP:0.5:0.95. These results represent absolute improvements of 3.0%, 4.2%, 3.6%, 3.0% and 5.8%, respectively, over the original YOLOv12 baseline. Architectural refinements reduce the parameter count from 2.5 M to 2.2 M and GFLOPs from 5.8 to 4.8. These results highlight the model’s high detection accuracy and its effectiveness in balancing accuracy and computational efficiency, underscoring its strong potential for real-time weed monitoring in autonomous agricultural robots. When deployed on an NVIDIA Jetson AGX Xavier, GTDR-YOLOv12 achieves 58 FPS, validating its real-time capability and suitability for edge deployment in field conditions. Such capabilities may also benefit related tasks in precision agriculture and ecological monitoring, where lightweight yet robust detection models are required. In addition, the findings align with ongoing efforts to promote intelligent agricultural machinery and reduce pesticide usage through site-specific weed management, offering technical insights relevant to sustainable farming policy frameworks. Future research may explore the generalizability of GTDR-YOLOv12 to other crops and its integration into real-world weeding robots to assess practical weeding efficiency. Furthermore, investigating the practical benefits of such a high-efficiency framework in real-world farming scenarios may help quantify its economic and environmental impact. To promote transparency and reproducibility, the full implementation and pretrained models have been released at GTDR-YOLOv12 Github Repository (https://github.com/yangzhaofeng496/GTDR-YOLOv12, accessed on 24 July 2025).

Author Contributions

Z.Y.: writing—review and editing, writing—original draft, visualization, validation, software, project administration, methodology, investigation, formal analysis, data curation, conceptualization. Z.K.: writing—review and editing, validation, formal analysis, conceptualization, software, methodology. Y.S.: supervision, resources, project administration, formal analysis. H.L.: supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 32171908, and the Jiangsu Agricultural Science and Technology Innovation Fund, grant number CX(24)3025. The APC was funded by the Jiangsu Agricultural Science and Technology Innovation Fund.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Murad, N.Y.; Mahmood, T.; Forkan, A.R.M.; Morshed, A.; Jayaraman, P.P.; Siddiqui, M.S. Weed detection using deep learning: A systematic literature review. Sensors 2023, 23, 3670. [Google Scholar] [CrossRef]
Nath, C.P.; Singh, R.G.; Choudhary, V.K.; Datta, D.; Nandan, R.; Singh, S.S. Challenges and alternatives of herbicide-based weed management. Agronomy 2024, 14, 126. [Google Scholar] [CrossRef]
Akhtari, H.; Navid, H.; Karimi, H.; Dammer, K.H. Deep learning-based object detection model for location and recognition of weeds in cereal fields using colour imagery. Crop Pasture Sci. 2025, 76, 4. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A real-time apple targets detection method for picking robot based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Wang, J.; Gao, Z.; Zhang, Y.; Zhou, J.; Wu, J.; Li, P. Real-time detection and location of potted flowers based on a ZED camera and a YOLO V4-tiny deep learning algorithm. Horticulturae 2021, 8, 21. [Google Scholar] [CrossRef]
Zuo, Z.; Gao, S.; Peng, H.; Xue, Y.; Han, L.; Ma, G.; Mao, H. Lightweight Detection of Broccoli Heads in Complex Field Environments Based on LBDC-YOLO. Agronomy 2024, 14, 2359. [Google Scholar] [CrossRef]
Guan, X.; Shi, L.; Yang, W.; Ge, H.; Wei, X.; Ding, Y. Multi-Feature Fusion Recognition and Localization Method for Unmanned Harvesting of Aquatic Vegetables. Agriculture 2024, 14, 971. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, X.; Sun, J.; Yu, T.; Cai, Z.; Zhang, Z.; Mao, H. Low-cost lettuce height measurement based on depth vision and lightweight instance segmentation model. Agriculture 2024, 14, 1596. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Shi, J.; Zhou, C.; Hu, J. SN-CNN: A Lightweight and Accurate Line Extraction Algorithm for Seedling Navigation in Ridge-Planted Vegetables. Agriculture 2024, 14, 1596. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Yao, M.; Shi, J.; Hu, J. Seedling-YOLO: High-Efficiency Target Detection Algorithm for Field Broccoli Seedling Transplanting Quality Based on YOLOv7-Tiny. Agronomy 2024, 14, 931. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, Q.; Xu, W.; Xu, L.; Lu, E. Prediction of Feed Quantity for Wheat Combine Harvester Based on Improved YOLOv5s and Weight of Single Wheat Plant without Stubble. Agriculture 2024, 14, 1251. [Google Scholar] [CrossRef]
Shen, Y.; Yang, Z.; Khan, Z.; Liu, H.; Chen, W.; Duan, S. Optimization of Improved YOLOv8 for Precision Tomato Leaf Disease Detection in Sustainable Agriculture. Sensors 2025, 25, 1398. [Google Scholar] [CrossRef]
Khan, Z.; Liu, H.; Shen, Y.; Yang, Z.; Zhang, L.; Yang, F. Optimizing Precision Agriculture: A Real-Time Detection Approach for Grape Vineyard Unhealthy Leaves Using Deep Learning Improved YOLOv7. Comput. Electron. Agric. 2025, 231, 109969. [Google Scholar] [CrossRef]
Khan, Z.; Liu, H.; Shen, Y.; Zeng, X. Deep Learning Improved YOLOv8 Algorithm: Real-Time Precise Instance Segmentation of Crown Region Orchard Canopies in Natural Environment. Comput. Electron. Agric. 2024, 224, 109168. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed detection in maize fields by UAV images based on crop row preprocessing and improved YOLOv4. Agriculture 2022, 12, 975. [Google Scholar] [CrossRef]
Tao, T.; Wei, X. STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture 2024, 15, 22. [Google Scholar] [CrossRef]
Deng, L.; Miao, Z.; Zhao, X.; Yang, S.; Gao, Y.; Zhai, C.; Zhao, C. HAD-YOLO: An Accurate and Effective Weed Detection Model Based on Improved YOLOV5 Network. Agronomy 2025, 15, 57. [Google Scholar] [CrossRef]
Liu, J.; Abbas, I.; Noor, R.S. Development of deep learning-based variable rate agrochemical spraying system for targeted weeds control in strawberry crop. Agronomy 2021, 11, 1480. [Google Scholar] [CrossRef]
Sa, I.; Popović, M.; Khanna, R.; Chen, Z.; Lottes, P.; Liebisch, F.; Nieto, J.; Stachniss, C.; Walter, A.; Siegwart, R. WeedMap: A Large-Scale Semantic Weed Mapping Framework Using Aerial Multispectral Imaging and Deep Neural Network for Precision Farming. Remote Sens. 2018, 10, 1423. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, P.; Zheng, Z.; Luo, W.; Cheng, B.; Wang, S.; Zheng, Z. Rt-Mwdt: A Lightweight Real-Time Transformer with Edge-Driven Multiscale Fusion for Precisely Detecting Weeds in Complex Cornfield Environments. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5260823 (accessed on 19 July 2025).
Chen, S.; Memon, M.S.; Shen, B.; Guo, J.; Du, Z.; Tang, Z.; Memon, H. Identification of weeds in cotton fields at various growth stages using color feature techniques. Ital. J. Agron. 2024, 19, 100021. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X.; Ma, G.; Du, X.; Shaheen, N.; Mao, H. Recognition of weeds at asparagus fields using multi-feature fusion and backpropagation neural network. Int. J. Agric. Biol. Eng. 2021, 14, 190–198. [Google Scholar] [CrossRef]
Memon, M.S.; Chen, S.; Shen, B.; Liang, R.; Tang, Z.; Wang, S.; Memon, N. Automatic visual recognition, detection and classification of weeds in cotton fields based on machine vision. Crop Prot. 2025, 187, 106966. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. Ts-yolo: An All-Day and Lightweight Tea Canopy Shoots Detection Model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
Feng, G.; Wang, C.; Wang, A.; Gao, Y.; Zhou, Y.; Huang, S.; Luo, B. Segmentation of Wheat Lodging Areas from UAV Imagery Using an Ultra-Lightweight Network. Agriculture 2024, 14, 244. [Google Scholar] [CrossRef]
Rahman, M.G.; Rahman, M.A.; Parvez, M.Z.; Patwary, M.A.K.; Ahamed, T.; Fleming-Muñoz, D.A.; Moni, M.A. ADeepWeeD: An Adaptive Deep Learning Framework for Weed Species Classification. Artif. Intell. Agric. 2025, 15, 590–609. [Google Scholar] [CrossRef]
Li, S.; Chen, Z.; Xie, J.; Zhang, H.; Guo, J. PD-YOLO: A Novel Weed Detection Method Based on Multi-Scale Feature Fusion. Front. Plant Sci. 2025, 16, 1506524. [Google Scholar] [CrossRef]
Farooq, U.; Rehman, A.; Khanam, T.; Amtullah, A.; Bou-Rabee, M.A.; Tariq, M. Lightweight Deep Learning Model for Weed Detection for IoT Devices. In Proceedings of the 2022 2nd International Conference on Emerging Frontiers in Electrical and Electronic Technologies (ICEFEET), Patna, India, 24–25 June 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar]
Islam, M.D.; Liu, W.; Izere, P.; Singh, P.; Yu, C.; Riggan, B.; Shi, Y. Towards Real-Time Weed Detection and Segmentation with Lightweight CNN Models on Edge Devices. Comput. Electron. Agric. 2025, 237, 110600. [Google Scholar] [CrossRef]
Nakabayashi, T.; Yamagishi, K.; Suzuki, T. Automated Weeding Systems for Weed Detection and Removal in Garlic/Ginger Fields. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 4. [Google Scholar] [CrossRef]
swish9. Weeds Detection [Data set]. Kaggle. 2022. Available online: https://www.kaggle.com/datasets/swish9/weeds-detection (accessed on 24 July 2025).
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic ReLU. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 351–367. [Google Scholar]
Steininger, D.; Trondl, A.; Croonen, G.; Simon, J.; Widhalm, V. The CropAndWeed Dataset: A Multi-Modal Learning Approach for Efficient Crop and Weed Manipulation. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3729–3738. [Google Scholar]
Shuai, Y.; Shi, J.; Li, Y.; Zhou, S.; Zhang, L.; Mu, J. YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR. Agronomy 2025, 15, 1712. [Google Scholar] [CrossRef]
Sun, Y.; Guo, H.; Chen, X.; Li, M.; Fang, B.; Cao, Y. YOLOv8n-SSDW: A Lightweight and Accurate Model for Barnyard Grass Detection in Fields. Agriculture 2025, 15, 1510. [Google Scholar] [CrossRef]
Huang, J.; Xia, X.; Diao, Z.; Li, X.; Zhao, S.; Zhang, J.; Li, G. A Lightweight Model for Weed Detection Based on the Improved YOLOv8s Network in Maize Fields. Agronomy 2024, 14, 3062. [Google Scholar] [CrossRef]
Lu, Z.; Chengao, Z.; Lu, L.; Yan, Y.; Jun, W.; Wei, X.; Jun, T. Star-YOLO: A Lightweight and Efficient Model for Weed Detection in Cotton Fields Using Advanced YOLOv8 Improvements. Comput. Electron. Agric. 2025, 235, 110306. [Google Scholar] [CrossRef]

Figure 1. Samples images from the Weeds Detection dataset [32]. (A) Dense weed infestations. (B) Isolated weed species. (C) Partial occlusions and overlapping foliage. (D) Mixed ground textures. (E) Bare soil with sparse vegetation. (F) Clearly defined crop structures. (G) Occlusion and overlapping foliage. (H) Partial occlusion and dense weed coverage. (I) Crop fields with clear plant rows.

Figure 2. Sample weed species from the Weeds Detection dataset [32]. (A) Argemone mexicana. (B) Boerhavia diffusa. (C) Wild Clover. (D) Bermuda Grass.

Figure 3. Structural diagram of the YOLOv12 object detection framework.

Figure 4. Architecture overview of GTDR-YOLOv12.

Figure 5. Workflow of the proposed GDR-Conv module incorporating Ghost convolution and DyReLU activation.

Figure 6. Architecture of the GTDR-C3 module with GTDRBlock and TDAM.

Figure 7. Training flow of the Lookahead optimizer.

Figure 8. Training and validation performance metrics and loss curves for GTDR-YOLOv12 over 100 iterations. The X-axis represents the number of training iterations, while the Y-axis shows the corresponding values for different performance metrics (such as loss, precision, recall, and mAP).

Figure 9. Comparison of F1 score–confidence curves between YOLOv12 (left) and GTDR-YOLOv12 (right) across different classes.

Figure 10. Precision–recall curves of YOLOv12 (left) and GTDR-YOLOv12 (right) for crop and weed classes.

Figure 11. Qualitative comparison of detection results: ground truth annotations (top row), YOLOv12 predictions (middle row), and GTDR-YOLOv12 predictions (bottom row).

Figure 12. Robustness comparison under varying lighting: ground truth (top row), YOLOv12 (middle row), and GTDR-YOLOv12 (bottom row).

Figure 13. Evaluation of detection robustness in occluded and high-density conditions: ground truth (top row), YOLOv12 (middle row), and GTDR-YOLOv12 (bottom row).

Table 1. Experimental platform configuration.

Category	Configuration
CPU	Intel(R) Xeon(R) Gold 6226R CPU 2.90 GHz
GPU	Nvidia 3090
Framework	PyTorch 2.0.1
Programming	Python 3.8.20

Table 2. Dataset attributes and training configuration.

Attribute	Details
Total images	3982
Training set	3735
Validation set	494
Test set	247
Epochs	100
Batch size	16
Image size	640 × 640
Learning rate schedule	Initial: 0.01, Final: 0.01 (cosine decay disabled)
Early stopping	Enabled (patience = 20 epochs)
Regularization	Weight decay: 0.0005
Augmentation	Weather (fog/rain, 10%), brightness shift (±25%) or gamma (0.7–1.5)
Preprocessing	Auto-orientation for alignment
Outputs per image	2 object classes

Table 3. Comparison of different optimizers for YOLOv12 on the Weeds Detection dataset.

Model	Precision (%)	Recall (%)	mAP:0.5 (%)	mAP:0.5:0.95 (%)	Inf (ms)	GFLOPs	Parameters (M)	F1-Score (%)
v12 + AdamW	85.0	79.7	87.0	58.0	3.6	5.8	2.5	82.3
v12 + SGD	64.0	71.0	72.7	39.7	4.1	5.8	2.5	67.2
v12 + Lookahead	86.9	80.2	86.7	56.9	2.6	5.6	2.3	83.4

Table 4. Ablation study of GTDR-YOLOv12: effects of model components on detection accuracy and efficiency (with mean ± std across three runs).

Model	Precision (%)	Recall (%)	mAP:0.50 (%)	mAP:0.50:0.95 (%)	Inference (ms)	GFLOPs	Params (M)	F1-Score (%)	Avg. IoU (%)
v12	81.5 ± 1.6	79.7 ± 0.2	87.0 ± 1.2	58.0 ± 1.0	3.07 ± 0.46	5.8	2.51	82.3 ± 1.1	81.2
v12 + Lookahead	86.9 ± 0.0	80.2 ± 0.0	86.7 ± 0.0	56.9 ± 0.0	3.73 ± 0.57	5.8	2.51	83.4 ± 0.0	81.2
v12 + GDR-Conv	84.3 ± 0.6	81.9 ± 1.0	88.1 ± 0.9	60.4 ± 0.8	3.80 ± 0.42	5.7	2.36	83.3 ± 0.6	82.6
v12 + GTDR-C3	84.5 ± 2.7	85.0 ± 1.6	89.8 ± 1.0	64.0 ± 1.3	4.37 ± 0.71	5.2	2.28	84.4 ± 0.0	83.3
v12 + Lookahead + GDR-Conv + GTDR-C3 (Ours)	87.3 ± 0.82	83.4 ± 1.52	90.9 ± 0.70	65.5 ± 0.87	4.23 ± 1.60	4.8	2.23	85.8 ± 0.4	83.4

Table 5. Comparative performance of GTDR-YOLOv12 against advanced object detectors on Weeds Detection dataset.

Model	Precision (%)	Recall (%)	mAP:0.5 (%)	mAP:0.5:0.95 (%)	Inf (ms)	GFLOPs	Parameter (M)	F1-Score (%)
YOLOv7	65.8	75.4	71.5	38.3	5.5	103.2	36.5	70.2
YOLOv9	86.1	83.5	89.1	62.5	6.1	26.7	7.2	84.8
YOLOv10	86.8	74.2	85.2	57.6	4.3	8.2	2.7	79.9
YOLOv11	87.0	80.8	87.7	59.3	2.5	6.3	2.6	83.8
YOLOv12	85.0	79.7	87.0	58.0	3.6	5.8	2.5	82.3
ATSS	87.8	90.0	95.0	75.5	–	279	51.1	89.0
Double-Head	84.7	93.1	93.1	93.1	–	408	46.9	88.7
RTMDet	69.6	86.6	83.5	53.0	–	8.0	4.9	77.1
GTDR-YOLOv12 (Ours)	88.0	83.9	90.0	63.8	5.7	4.8	2.2	85.9

Table 6. Performance comparison of YOLOv12 variants with and without GTDR enhancements.

Model	Precision (%)	Recall (%)	mAP:0.5 (%)	mAP:0.5:0.95 (%)	Inference (ms)	GFLOPs	Parameters (M)	F1-Score (%)
YOLOv12 (COCO Pretrain)	85.2	88.7	90.8	64.7	5.8	5.8	2.51	86.9
YOLOv12-GTDR	88.0	83.9	90.0	63.8	3.0	4.8	2.23	85.9
YOLOv12-GTDR (YOLOv12-GTDR Pretrain)	87.1	87.0	91.5	67.4	4.2	4.8	2.23	87.0

Table 7. Comparison of GFLOPs and estimated FPS for different ablation variants of GTDR-YOLOv12.

Model	GFLOPs	FPS
YOLOv12	5.8	48
YOLOv12 + Lookahead	5.8	48
YOLOv12 + GDR-Conv	5.7	49
YOLOv12 + GTDR-C3	5.2	54
GTDR-YOLOv12 (Ours)	4.8	58

Table 8. Comparison of lightweight YOLO-based weed detection models on various datasets and platforms.

Model	Dataset	Precision (%)	Recall (%)	mAP:0.5 (%)	FPS	GFLOPs	Embedded Platform	Reference
YOLO-SW	Weed25	—	—	92.3	59	19.9	NVIDIA Jetson AGX Orin	[35]
YOLOv8n-SSDW	Barnyard Grass	86.7	75.5	85.1	—	7.4	NVIDIA Jetson Nano B01	[36]
YOLOv8s-Improve	Maize Field Weed	95.8	93.2	94.5	—	12.7	—	[37]
Star-YOLO	CottonWeedDet12	95.3	93.9	95.4	118	5.0	—	[38]
GTDR-YOLOv12 (Ours)	Weeds Detection	88.0	83.9	90.0	58	4.8	NVIDIA Jetson AGX Xavier	—

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Z.; Khan, Z.; Shen, Y.; Liu, H. GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture. Agronomy 2025, 15, 1824. https://doi.org/10.3390/agronomy15081824

AMA Style

Yang Z, Khan Z, Shen Y, Liu H. GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture. Agronomy. 2025; 15(8):1824. https://doi.org/10.3390/agronomy15081824

Chicago/Turabian Style

Yang, Zhaofeng, Zohaib Khan, Yue Shen, and Hui Liu. 2025. "GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture" Agronomy 15, no. 8: 1824. https://doi.org/10.3390/agronomy15081824

APA Style

Yang, Z., Khan, Z., Shen, Y., & Liu, H. (2025). GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture. Agronomy, 15(8), 1824. https://doi.org/10.3390/agronomy15081824

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture

Abstract

1. Introduction

2. Data Resources and Optimization Strategies for Weed Detection Training

2.1. Dataset

2.2. Composition of Weed Species in Weeds Detection Dataset

2.3. Dataset Training and Preparation Strategy

3. Methodology

3.1. Baseline Architecture: YOLOv12

3.2. Improved YOLOv12: GTDR-YOLOv12

3.2.1. GDR-Conv

3.2.2. GTDR-C3

3.2.3. Lookahead Optimizer

3.3. Performance Metrics

4. Experimental Validation

4.1. Convergence Analysis of GTDR-YOLOv12

4.2. Impact of Lookahead Optimizer on Detection Performance

4.3. Ablation Analysis of GTDR-YOLOv12

4.4. Experimental Evaluation of YOLOv12 and GTDR-YOLOv12

4.5. Comparative Analysis of YOLO Variants

4.6. Evaluating the Benefits of Transfer Learning and Domain Adaptation

4.7. Embedded Platform Testing

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI