1. Introduction
With the continuous growth of the global population, food security is becoming an increasingly pressing issue. By 2050, the global population is expected to exceed 9 billion, which would require a 70% to 100% increase in food production to meet the rising demand [
1]. As one of the most important staple crops worldwide, wheat plays a crucial role in the global food system. According to a report by the Food and Agriculture Organization (FAO), wheat is the most widely cultivated cereal globally and provides a primary food source for billions of people each year [
2]. However, the increase in wheat yields is confronted by several challenges, among which weed infestation is a significant factor.
Weeds compete with wheat for resources such as water, nutrients, and sunlight, significantly affecting the growth and yield of wheat. For instance, the study by Anwar et al. [
3] (2021) demonstrates that the presence of weeds significantly reduces yield, especially in the context of climate change. Colbach et al. [
4] (2014) further point out that weed competition severely reduces wheat yield, with the intensity of the competition directly correlated to the weed biomass. Additionally, Jalli et al. [
5] (2021) emphasize that weeds compete for resources, potentially reducing crop yield, and that diverse crop rotations can help mitigate the severity of plant diseases and pest occurrences.
Moreover, Javaid et al. [
6] (2022) observed that the uncontrolled growth of weeds significantly reduces wheat productivity by affecting the photosynthetic activity and overall physiological performance of the crop. Similarly, Usman et al. [
7] (2010) conducted research in Northwestern Pakistan and demonstrated that insufficient weed control drastically lowers wheat yields. Their study found that the use of reduced and zero tillage systems, in combination with appropriate herbicides such as Affinity, significantly improved weed control efficiency (up to 94.1%) and resulted in higher wheat productivity compared to conventional tillage. Therefore, the implementation of effective weed management strategies is crucial for maintaining wheat health and productivity.
Traditional weed management methods in agriculture, particularly in wheat fields, such as manual weeding and herbicide use, face significant limitations. Manual weeding, while effective in precisely removing weeds, is labor-intensive and time-consuming. It requires significant manpower, which is often expensive and scarce, especially in large-scale agricultural operations. Furthermore, manual weeding is not always feasible due to its dependency on specific timing during the growing season. Shamshiri et al. [
8] (2024) pointed out that although manual weeding is considered efficient in selective weed removal, its high labor costs and limited scalability make it impractical for modern agricultural systems.
On the other hand, herbicide use, while effective for large-scale weed control, has led to increasing issues of herbicide-resistant weed species and environmental contamination. Recent research has shown that the overuse of herbicides accelerates the development of resistant weed populations, complicating long-term management. For example, Reed et al. [
9] (2024) noted that herbicide resistance in barnyardgrass continues to increase, leading farmers to apply higher doses or a mix of chemicals, which not only raises production costs but also exacerbates environmental concerns related to soil and water contamination.
Given the challenges associated with traditional weed management, effective weed identification is crucial in wheat fields. Meesaragandla et al. [
10] (2023) emphasize that effective weed identification is critical in modern agriculture, particularly in wheat fields, to address the inefficiencies and challenges posed by traditional weed management methods. Their study highlights how advanced detection techniques improve the precision and effectiveness of weed control strategies.
In recent years, with the rapid development of computer vision and deep learning technologies, automated image recognition has become increasingly prevalent in agriculture. Significant progress in object detection enables weed recognition through machine vision, enhancing agricultural precision while reducing herbicide use and promoting sustainable farming.
For example, Punithavathi et al. [
11] (2023) proposed a model combining computer vision and deep learning that significantly improves weed detection in precision agriculture. Additionally, Razfar et al. [
12] (2022) developed a lightweight deep learning model successfully applied to weed detection in soybean fields, achieving higher efficiency with reduced computational costs.
At the same time, Rakhmatulin et al. [
13] (2021) provided a comprehensive review of various deep learning models applied to real-time weed detection in agricultural settings, discussing current challenges and future directions. These studies demonstrate that deep learning-based weed recognition technologies not only enhance agricultural productivity but also drive the advancement of smart farming.
The YOLO (You Only Look Once) model [
14] series represents cutting-edge technology in the field of object detection. Known for its high real-time efficiency and superior detection accuracy, YOLO models are widely applied across various object detection tasks. The core advantage of YOLO lies in its ability to complete both object localization and classification in a single network pass, significantly increasing detection speed compared to traditional algorithms.
Due to its end-to-end architecture, high detection speed, and accuracy, YOLO has become a major focus in precision agriculture, especially for real-time applications such as weed detection using drone monitoring systems and smart farming equipment. As noted by Gallo et al. [
15] (2023), the YOLOv7 model excels in weed detection in precision agriculture, offering fast and precise detection through UAV images.
Compared to other deep learning models, the YOLO model’s lightweight structure allows it to operate efficiently on resource-constrained devices. The proliferation of the Internet of Things (IoT) in agriculture has further expanded YOLO’s potential for application in low-power embedded systems. Zhang et al. [
16] (2024) successfully deployed the GVC-YOLO model on the Nvidia Jetson Xavier NX edge computing device, utilizing real-time video captured by field cameras to detect and process aphid damage on cotton leaves. This application not only reduces the need for manual intervention and labor costs but also promotes the development of intelligent agriculture.
The standard YOLO model faces significant challenges in complex agricultural environments, particularly in detecting weeds and wheat under conditions of occlusion and overlap. Wheat fields often feature complex backgrounds, variable lighting (e.g., strong sunlight, shadows), and changing weather conditions (e.g., sunny, cloudy, foggy), all of which degrade detection performance. Additionally, morphological similarities between weed species and wheat, as well as among weed species themselves, further complicate classification, increasing false positives and missed detections. These limitations hinder the model’s reliability in real-world agricultural applications, highlighting the need for optimization to enhance adaptability and accuracy.
To address these issues, this study proposes PMDNet, an improved model built upon the YOLOv8 [
17] framework, introducing targeted enhancements specifically tailored to the unique scenarios and challenges of wheat fields. These improvements are designed to significantly enhance both detection accuracy and efficiency in weed identification.
2. Related Work
2.1. Traditional Methods for Weed Identification in Wheat Fields
In traditional agricultural practices, weed identification predominantly depends on the farmers’ experience and visual observation skills. This manual approach requires substantial field expertise, as farmers distinguish weeds from crops based on visible characteristics such as leaf shape, color, and stem structure. Although effective when distinguishing clearly different species, this method is labor-intensive and varies in accuracy, contingent upon the individual farmer’s expertise.
In certain cases, farmers employ simple tools like magnifying glasses to enhance their observation of plant characteristics. However, these traditional techniques remain manual and susceptible to human error. As agricultural practices expand in scale, the limitations in efficiency and precision underscore the necessity for more advanced and systematic methods for weed identification.
2.2. Application of Deep Learning in Weed Identification
Recent advancements in deep learning have significantly impacted the agricultural sector, particularly in weed identification. Traditional weed management methods, relying on manual identification or excessive herbicide use, are time-consuming and environmentally harmful. Deep learning and computer vision technologies now enable automated weed identification systems, leveraging models like Convolutional Neural Networks and YOLO to accurately classify and identify weeds and crops, offering critical support for precision agriculture.
For instance, Upadhyay et al. [
18] (2024) developed a machine vision and deep learning-based smart sprayer system aimed at site-specific weed management in row crops. The study utilized edge computing technology to enable real-time identification of multiple weed species and precise spraying, thereby reducing unnecessary chemical usage and enhancing the sustainability of crop production.
In another study, Ali et al. [
19] (2024) introduced a comprehensive dataset for weed detection in rice fields in Bangladesh, using deep learning for identification. The dataset, consisting of 3632 high-resolution images of 11 common weed species, provides a foundational resource for developing and evaluating advanced machine learning algorithms, enhancing the practical application of weed detection technology in agriculture.
Additionally, Coleman et al. [
20] (2024) presented a multi-growth stage plant recognition system for Palmer amaranth (Amaranthus palmeri) in cotton (Gossypium hirsutum), utilizing various YOLO architectures. Among these, YOLOv8-X achieved the highest detection accuracy (mean average precision, mAP@[0.5:0.95]) of 47.34% across eight growth stages, and YOLOv7-Original attained 67.05% for single-class detection, demonstrating the significant potential of YOLO models for phenotyping and weed management applications.
The application of deep learning in weed identification enables precision agriculture by reducing herbicide usage, improving weed recognition accuracy, and ultimately minimizing the environmental impact of agricultural activities.
2.3. Overview of YOLO Model
The development of the YOLO (You Only Look Once) model began in 2016 when Redmon et al. [
14] (2016) introduced the initial version. This pioneering model reframed object detection as a single-stage regression task, achieving remarkable speed and positioning YOLO as a crucial model for real-time applications. YOLOv2 [
21], building on this foundation, incorporated strategies like Batch Normalization, Anchor Boxes, and multi-scale training, leading to notable improvements in accuracy and generalizability. By 2018, YOLOv3 [
22] further refined the network structure, employing a Feature Pyramid Network (FPN) to facilitate multi-scale feature detection, thereby significantly enhancing the detection performance for small objects.
In 2020, YOLOv4, led by Bochkovskiy et al. [
23] (2020), integrated several advancements, such as CSPNet, Mish activation, and CIoU loss, achieving a balanced trade-off between detection precision and speed, solidifying its position as a standard in the field. Ultralytics then released YOLOv5 [
24], focusing on a lightweight design that enabled efficient inference even on mobile and resource-constrained devices. As YOLO continued to gain prominence in diverse applications, further iterations were introduced, including YOLOv6 [
25] and YOLOv7 [
26], which employed depthwise separable convolutions and modules inspired by EfficientNet [
27] to optimize both model efficiency and inference speed. YOLOv8 builds on these foundations, delivering enhanced performance and robustness in multi-object detection tasks.
2.4. Current Research on Weed Identification Based on Improved YOLO Models
In recent years, YOLO models have been widely applied to agricultural weed identification, with researchers improving accuracy and real-time performance through various optimizations. This study provides an overview of current advancements in YOLO-based weed detection and analyzes the limitations, identifying areas for further improvement.
- (1)
Enhancements in Precision and Detection Capability
Recent advancements in object detection algorithms have significantly improved precision in agricultural weed detection. In a study by Sportelli et al. [
28] (2023), various YOLO object detectors, including YOLOv5, YOLOv6, YOLOv7, and YOLOv8, were evaluated for detecting weeds in turfgrass under varying conditions. YOLOv8l demonstrated the highest performance, achieving a precision of 94.76%, mAP_0.5 of 97.95%, and mAP_0.5:0.95 of 81.23%. However, challenges remain, as the results on additional datasets revealed limitations in robustness when dealing with diverse and complex backgrounds. The study highlighted the potential for further enhancements, such as integrating advanced annotation techniques and incorporating diverse vegetative indices to better address challenges like overlapping weeds and inconsistent lighting conditions.
- (2)
Lightweight Design and Real-Time Optimization
To improve detection speed and accuracy, several studies have focused on lightweight versions of the YOLO model. Zhu et al. [
29] (2024) proposed a YOLOx model enhanced with a lightweight attention module and a deconvolution layer, significantly reducing computational load and improving real-time detection and small feature extraction. However, the model’s precision still faces challenges in complex agricultural environments where weed occlusion is frequent, indicating a need for further optimization in robustness and feature extraction efficiency.
- (3)
Multi-Class Identification and Classification
In cotton production, Dang et al. [
30] (2022) developed the YOLOWeeds model and created the CottonWeedDet12 dataset for multi-class weed detection. Their approach successfully classified various types of weeds in cotton fields, achieving high accuracy with data augmentation for improved model adaptability. However, the study primarily focused on cotton-specific environments, and the model’s transferability to other crops or environments with greater weed diversity was not explored in detail, suggesting a potential area for further research.
- (4)
Incorporating Attention Mechanisms
Some studies have enhanced YOLO models with attention mechanisms to improve focus and precision in weed detection tasks. For example, Chen et al. [
31] (2022) introduced the YOLO-sesame model, which integrates an attention mechanism, local importance pooling in the SPP layer, and an adaptive spatial feature fusion structure to address variations in weed size and shape. The model achieved superior detection performance in sesame fields, with a mAP of 96.16% and F1 scores of 0.91 and 0.92 for sesame crops and weeds, respectively. Although the YOLO-sesame model showed promising results, further research is needed to test its adaptability in diverse agricultural environments.
While effective in specific contexts, improved YOLO models face limitations in robustness and accuracy under diverse conditions, such as varying lighting and complex backgrounds. Additionally, their performance is often crop-specific, lacking the generalizability required for broader agricultural applications.
3. Dataset Construction and Preprocessing
3.1. Dataset Construction
The dataset was collected in Anding District, Dingxi City, Gansu Province, China, with geographical coordinates at 104°39′3.05″ E longitude and 35°34′45.4″ N latitude. It covers the entire growth cycle of spring wheat, with the collection period spanning from 8 April 2024 to 20 July 2024. The dataset includes key phenological stages such as seedling emergence, tillering, stem elongation, heading, grain filling, and maturation.
The data were gathered using a smartphone with a 40-megapixel primary camera. All photos were captured at a resolution of 2736 × 2736 pixels. Images were taken at a distance of approximately 30 cm to 100 cm from the ground, ensuring that the weeds were clearly visible while maintaining the natural context of the wheat field. Considering the training effectiveness of the YOLOv8 model, each image contains between 1 and 5 bounding boxes of target weeds, reflecting the natural distribution and density of the weeds in the field.
To ensure environmental diversity, the dataset includes images taken under various weather conditions, such as sunny, cloudy, and rainy days. Multiple shooting angles were employed, including vertical shots, frontal views, and 45-degree angles, enhancing the comprehensiveness of the dataset. A total of 4274 images were collected, covering the eight most common weed species in the collection area.
The targeted weed species, as shown in
Figure 1, include
Artemisia capillaris,
Agropyron cristatum,
Chenopodium album,
Bassia scoparia,
Cirsium arvense,
Kali collinum,
Raphanus raphanistrum, and
Thermopsis lanceolata.
3.2. Data Annotation
Data annotation was performed using LabelMe software. Trained personnel conducted detailed labeling of each image by consulting local wheat field farmers to ensure accurate identification of weed species. Additionally, all annotations were reviewed and verified by weed specialists to guarantee the precision and consistency of the labeling, with each weed species labeled using its Latin scientific name. On average, each image contained 1 to 5 annotation boxes, resulting in a total of 10,127 initial annotation boxes. After the initial labeling, weed experts carefully reviewed the annotations to address cases involving overlapping or occluded weeds and ensure labeling accuracy.
3.3. Data Augmentation and Data Splitting
To address the initial imbalance in the collection of images across different categories, several data augmentation techniques were implemented to supplement and balance the dataset, improving its robustness, as discussed in Xu et al. [
32] (2023). This approach is illustrated in
Figure 2, demonstrating how these techniques result in a more representative and varied dataset. These techniques included adjusting brightness levels to simulate various lighting conditions in real-world environments, ensuring the model can effectively recognize weed species under different illumination, as seen in subfigures (d) Increased Brightness and (e) Decreased Brightness. Gaussian blur was applied to mimic varying focus conditions, as depicted in subfigure (b) Gaussian Blur, helping the model identify weeds even when images are not perfectly in focus. Additionally, salt-and-pepper noise was introduced to replicate disturbances in real-world imaging scenarios, enhancing the model’s resilience to noise, as illustrated in subfigure (c) Salt-and-Pepper Noise.
Furthermore, horizontal and vertical flips were performed to capture the weed species from different orientations, represented in subfigures (f) Horizontal Flip and (g) Vertical Flip. Finally, several rotational transformations were applied, including 45 degrees, 90 degrees, and 135 degrees, to ensure the model can recognize the weeds from multiple perspectives, as indicated in subfigures (h), (i), and (j), respectively. This comprehensive approach to data augmentation helps create a more balanced and effective training dataset, ultimately enhancing the model’s performance in identifying weed species.
In terms of data splitting, the original dataset comprised 4274 images. After data augmentation, the total number of images increased to 5967. A random selection of 600 images was used for the validation set, and 1167 images were allocated to the test set, with the remaining images forming the training set. The training set consisted of 4200 images, the validation set contained 600 images, and the test set included 1167 images, totaling 12,866 labeled boxes. The proportions of the training, validation, and test sets are approximately 7:1:2.
Figure 3 provides a comprehensive view of the training set’s label distribution and bounding box characteristics after data augmentation:
(1) Top Left (Category Instances): The bar chart shows the instance count for each weed species category in the training set. Post data augmentation, the sample distribution across categories has been balanced, ensuring that each class is represented by a relatively equal number of instances.
(2) Top Right (Bounding Box Shapes): This plot demonstrates the variety of bounding box shapes by overlaying multiple boxes. The boxes exhibit consistent proportions, supporting a balanced representation of object shapes across categories.
(3) Bottom Left (Bounding Box Position—x vs. y): The heatmap displays the distribution of bounding box center coordinates within the image space. A high density at the center of the plot indicates that many bounding boxes are centrally located in the images, though there is some spread across other regions as well, allowing the model to learn positional variations.
(4) Bottom Right (Bounding Box Dimensions—width vs. height): This scatter plot depicts the relationship between the width and height of bounding boxes. The diagonal trend indicates that bounding boxes maintain consistent aspect ratios, promoting uniform object representation across categories and aiding the model in recognizing similar objects with varying dimensions.
5. Experiment Design and Result Analysis
5.1. Experimental Setup
In this study, the experimental setup employed a high-performance computing system to evaluate the model’s performance in wheat field weed detection. The hardware configuration included an NVIDIA RTX 4090 GPU with 24 GB of VRAM, paired with an Intel Xeon Platinum 8352V CPU featuring 16 virtual cores at 2.10 GHz and 90 GB of RAM. This setup ensured efficient processing and training of the model.
The software environment was based on Python 3.8 and PyTorch version 2.0.0, running on Ubuntu 20.04 with CUDA version 11.8 for optimized GPU performance. The training was configured to run for a maximum of 300 epochs using Stochastic Gradient Descent (SGD) as the optimization algorithm, with an initial learning rate of 0.001, determined to facilitate effective model convergence through preliminary experiments. The batch size was set to 32, a decision influenced by the available GPU memory, allowing for balanced memory usage and training efficiency.
5.2. Performance Evaluation Metrics
Referring to the study by Padilla et al. [
40] (2020), this research uses performance evaluation metrics such as Precision (
P), Recall (
R), and mean Average Precision (mAP) at Intersection over Union (IoU) thresholds of 0.50 (mAP@0.50) and 0.50-0.95 (mAP@0.50:0.95) to assess the effectiveness of the weed detection model in wheat fields.
(1) Precision (
P): Precision quantifies the accuracy of the positive predictions made by the model. It is defined as the ratio of true positive predictions to the total predicted positives, as shown in Equation (1):
where
TP denotes the number of true positives and
FP denotes the number of false positives. A high precision value indicates a low rate of false positives, which is essential in applications where false alarms can result in significant consequences.
(2) Recall (
R): Recall, or sensitivity, measures the model’s capability to identify all relevant instances within the dataset. It is calculated as shown in Equation (2):
where
FN represents the number of false negatives. A high recall value indicates that the model successfully detects a significant proportion of true instances, which is particularly critical in weed detection to minimize missed detections.
(3) Mean Average Precision (mAP): mAP is a critical evaluation metric in object detection, summarizing model performance by balancing precision and recall across various IoU thresholds. It quantifies how effectively a model identifies and localizes objects within an image, providing a comprehensive measure of detection quality.
The metric includes two common thresholds: mAP@0.50 and mAP@0.50:0.95. mAP@0.50 represents the mean precision at a fixed IoU threshold of 0.50, which is relatively lenient and allows for moderate localization errors. This metric is often used as a baseline to gauge a model’s ability to detect objects, even if the bounding boxes are not perfectly aligned. In contrast, mAP@0.50:0.95 evaluates precision across a range of stricter IoU thresholds, from 0.50 to 0.95, in increments of 0.05. This more stringent metric assesses both detection and localization accuracy, offering a robust measure of the model’s consistency and precision in handling varying levels of overlap.
mAP@0.50 highlights detection capability, while mAP@0.50:0.95 provides a deeper evaluation of localization performance. A higher mAP value across these thresholds indicates superior detection and localization effectiveness.
(4)
F1 Score: The
F1 score serves as a combined metric that balances precision and recall. As shown in Equation (3), it is defined as the harmonic mean of precision and recall:
This score is particularly beneficial in scenarios with uneven class distributions, emphasizing the model’s performance in accurately identifying both positive and negative instances.
5.3. Model Performance Comparison
In this study, to enhance the performance of YOLOv8 in weed detection in wheat fields, we experimented with replacing the original YOLOv8 backbone network with various alternatives, including EfficientViT [
41], UniRepLKNet [
42], Convnextv2 [
43], Vanillanet [
44], FasterNet [
45], and PKINet. We evaluated each network based on key metrics such as Precision (P), Recall (R), mAP@0.5, mAP@0.50:0.95, and F1 score. The results show that PKINet achieved superior performance across these metrics, standing out as the most effective choice overall. Consequently, we selected PKINet to replace the original YOLOv8 backbone network.
As shown in
Table 1, PKINet achieved the highest mAP@0.5 (84.8%) and mAP@0.50:0.95 (67.7%) among all tested backbones, indicating both high detection accuracy and improved localization precision. Additionally, PKINet’s Recall (76.3%) and F1 score (83.7%) are on par with or exceed the baseline YOLOv8, further highlighting its balanced performance in both identifying and correctly localizing weed instances. While FasterNet showed slightly higher Precision (93.3%), PKINet’s performance across all other key metrics makes it the most robust and consistent choice for enhancing YOLOv8 in this application.
Our results demonstrate that integrating PKINet into YOLOv8 yields the greatest improvement in weed detection performance, establishing it as the best choice among the tested backbones.
As shown in
Table 2, the training results of various object detection models for wheat field weed detection are presented, showcasing their overall performance and effectiveness across multiple evaluation metrics. Among these models, PMDNet, the model proposed in this study, achieves the best performance across multiple metrics. Notably, compared to the baseline YOLOv8n model, PMDNet improves mAP@0.5 by 2.2% and mAP@0.50:0.95 by 5.9%, showcasing significant advancements in detection accuracy.
The baseline model YOLOv8n achieves a Precision of 92.0%, Recall of 76.4%, mAP@0.5 of 83.6%, mAP@0.50:0.95 of 65.7%, and F1-score of 83.5%. PMDNet surpasses the baseline with a Precision of 94.5%, mAP@0.5 of 85.8%, and mAP@0.50:0.95 of 69.6%, indicating the effectiveness of the proposed improvements.
Compared with other models, such as YOLOv5n [
24], YOLOv10n [
46], TOOD [
47], Faster-RCNN [
48], RetinaNet [
49], ATSS [
50], EfficientDet [
51], and RT-DETR-L [
52], PMDNet consistently demonstrates superior performance. Specifically, YOLOv10 achieves a mAP@0.50:0.95 of 64.7%, which is lower than PMDNet’s 69.6%. TOOD and Faster-RCNN perform relatively poorly, with TOOD achieving a mAP@0.50:0.95 of 55.3% and an F1-score of 73.6%, while Faster-RCNN achieves a mAP@0.50:0.95 of 52.9% and an F1-score of 72.3%, significantly below PMDNet’s performance.
Other models, such as RetinaNet and ATSS, perform adequately in certain metrics but still fall short of PMDNet overall. RetinaNet achieves a mAP@0.5 of 84.4%, slightly lower than PMDNet’s 85.8%, and a mAP@0.50:0.95 of only 56.5%, which is considerably lower than PMDNet’s 69.6%. Similarly, ATSS achieves a mAP@0.50:0.95 of 58.2% and an F1-score of 72.3%, both of which are inferior to PMDNet’s performance.
PMDNet outperformed several classical single-stage and two-stage object detection models, achieving the highest precision (94.5%), which is 14.1% higher than the lowest precision observed in Faster-RCNN (80.4%). For mAP@0.5, PMDNet achieved 85.8%, which is 5.4% higher than the lowest score of 80.4% from RT-DETR-L. Under the stricter mAP@0.50:0.95 metrics, PMDNet reached 69.6%, significantly surpassing Faster-RCNN (52.9%, an improvement of 16.7%) and RetinaNet (56.5%, an improvement of 13.1%). These results highlight PMDNet’s robustness and its consistent superiority over both single-stage and two-stage object detection models.
5.4. Ablation Study
To understand the contribution of each component in enhancing YOLOv8’s performance for weed detection, we conducted a series of ablation experiments by selectively incorporating PKINet, MSFPN, and DyHead. Each experiment systematically evaluates the effects of including or excluding these components on key performance metrics: Precision (P), Recall (R), mAP@0.5, mAP@0.50:0.95, and F1 score. The results are summarized in
Table 3.
Experiment 1 serves as the baseline model, where none of the enhancements (PKINet, MSFPN, DyHead) are applied. This baseline achieves a precision of 92.0%, recall of 76.4%, mAP@0.5 of 83.6%, mAP@0.50:0.95 of 65.7%, and an F1 score of 83.5%.
Introducing PKINet alone (Experiment 2) yields immediate performance gains across all metrics, improving mAP@0.5 to 84.8% and mAP@0.50:0.95 to 67.7%, with a slight increase in the F1 score to 83.7%. These results suggest that PKINet contributes significantly to both detection accuracy and localization precision.
In Experiment 3, only MSFPN is introduced, enhancing the model’s multi-scale detection capability. MSFPN increases precision to 93.7% and mAP@0.50:0.95 to 68.0%, with an F1 score of 83.7%, demonstrating that MSFPN is particularly beneficial for precision improvement, as it allows the model to capture features at various scales.
When only DyHead is applied in Experiment 4, there is a noticeable improvement in recall to 76.7% and mAP@0.50:0.95 to 67.6%, which highlights DyHead’s effectiveness in refining object detection heads. However, the overall F1 score of 83.4% shows that the addition of DyHead alone does not outperform the combination of PKINet or MSFPN.
Experiment 5 combines PKINet and MSFPN, achieving further enhancements with a precision of 91.9%, recall of 77.0%, mAP@0.5 of 85.0%, and mAP@0.50:0.95 of 69.2%. The F1 score also increases to 83.8%, indicating the synergistic effect of PKINet and MSFPN in capturing multi-scale features while maintaining accuracy.
Experiment 6 combines MSFPN and DyHead without PKINet. This configuration achieves a mAP@0.50:0.95 of 69.9%, the highest among setups without PKINet, showing that MSFPN and DyHead together significantly enhance localization precision. However, the F1 score decreases slightly to 83.3%.
Finally, Experiment 7 incorporates all three components (PKINet, MSFPN, DyHead), achieving the highest scores across most metrics, with a precision of 94.5%, mAP@0.5 of 85.8%, mAP@0.50:0.95 of 69.6%, and an F1 score of 84.2%. This configuration validates that the combination of PKINet, MSFPN, and DyHead delivers the best performance, maximizing the model’s capability to detect and localize weeds accurately in wheat fields.
These ablation results confirm that each component contributes uniquely to model performance, and their combination results in the most balanced improvement across detection and localization metrics.
5.5. Model Visualization Results
5.5.1. Training Process Visualization
Figure 8 shows the comparison of training performance between the PMDNet model and the YOLOv8n model across four key metrics: Precision, Recall, mAP@0.5, and mAP@0.50:0.95. Each subplot represents one metric, with the horizontal axis indicating the number of training epochs and the vertical axis showing the corresponding metric values. Overall, PMDNet consistently outperforms YOLOv8n across all four metrics, highlighting the advantages of the proposed model in detection accuracy and stability during training.
Precision shows a consistent advantage for PMDNet over YOLOv8n across most epochs. PMDNet demonstrates a more stable improvement in detection accuracy, particularly in the earlier epochs, where it achieves higher precision values. Recall values for both models follow a similar trend, with PMDNet achieving slightly higher recall in the later epochs, indicating a better balance between detecting true positives and minimizing false negatives. For mAP@0.5, PMDNet consistently outperforms YOLOv8n throughout the training process, maintaining higher values even as training progresses. Under the stricter evaluation metric, mAP@0.50:0.95, PMDNet achieves significantly better results than YOLOv8n, underscoring its robustness and improved capability to accurately detect objects under challenging IoU thresholds.
By training up to 300 epochs, it can be observed that all metrics have stabilized, indicating that the models have converged. This demonstrates that the training parameters, such as learning rate and batch size, were well-chosen and effectively configured for both models.
It is worth noting, as shown in
Figure 8, that although PMDNet achieves a slightly higher recall than the YOLOv8n model during training on the training dataset, its recall on the test dataset (76.0%) is slightly lower than that of YOLOv8n (76.4%). This discrepancy can be attributed to the distribution characteristics of the test dataset. PMDNet focuses more on improving overall detection performance and robustness, consistently outperforming YOLOv8n across other metrics. This superiority is evident in both the training process on the training dataset and the validation results on the test dataset, demonstrating the consistency of PMDNet’s performance.
5.5.2. Visualization of Prediction Results
Figure 9 illustrates the comparison of prediction results between the YOLOv8n model and the PMDNet model. To demonstrate the visualized prediction results of the models, five images were randomly selected from the test set. Each image was evaluated using both YOLOv8n and PMDNet. The three columns in
Figure 9 represent, from left to right, the ground truth annotations, the predictions by YOLOv8n, and the predictions by PMDNet.
Both models showed relatively good prediction performance, effectively detecting and classifying objects in most cases. However, PMDNet demonstrated superior performance overall. In particular, the third row of images highlights a challenging scenario with highly similar weeds and backgrounds (complex backgrounds and nearly identical colors). In this case, PMDNet demonstrated better prediction performance, successfully detecting weeds in complex backgrounds. Although not all weeds were detected, PMDNet performed significantly better in handling complex scenarios compared to YOLOv8n, which failed to predict any bounding boxes in this case.
5.6. Testing on Wheat Field Video Sequence
In the final step of this study, a 30 s video was recorded in the wheat field used for dataset collection, using the same equipment as for dataset creation (a smartphone with a 40-megapixel primary camera). The captured video was processed using the PMDNet model for tracking and detection, and the detection results are shown in
Figure 10. A frame was selected every 2 s from the video, with a total of 15 images displayed. The experimental results demonstrate that the PMDNet model performs well in weed detection on the video, effectively locating and identifying weeds, especially small target weeds and those in complex backgrounds. The detection boxes accurately marked the positions of weeds, with minimal false positives and false negatives observed in the video. The FPS of the PMDNet model during video detection was 87.7, which meets the real-time monitoring requirements. This experiment validates the application potential of the PMDNet model for weed detection in real-world wheat fields, confirming its effectiveness in agricultural weed detection.
6. Discussion
In this study, we propose PMDNet, an improved model built upon the YOLOv8 architecture, specifically designed to enhance its detection performance for wheat field weed detection tasks. By systematically replacing the backbone, feature fusion layer, and detection head of YOLOv8 with PKINet, MSFPN, and DyHead, respectively, we have demonstrated significant improvements in detection accuracy. The results provide valuable insights into the benefits of advanced model architecture customization for agricultural object detection tasks.
6.1. Performance Analysis
The experimental results reveal that PMDNet outperforms the baseline YOLOv8n model in terms of both mAP@0.5 and mAP@0.50:0.95. Specifically, the mAP@0.5 increased from 83.6% to 85.8%, indicating a 2.2% improvement. Furthermore, the mAP@0.50:0.95 rose from 65.7% to 69.6%, achieving a substantial 5.9% enhancement. These advancements suggest that PMDNet effectively captures multi-scale features and adapts to complex scenes in wheat fields with improved generalization capabilities. The improvements in the comprehensive metric (mAP@0.50:0.95) highlight the robustness of PMDNet across varying IoU thresholds.
6.2. Impact of Individual Components
Each of the architectural modifications contributed uniquely to the performance gains. PKINet, as the backbone, provides stronger feature extraction capabilities, leveraging its hierarchical and multi-scale design to enhance the representation of fine-grained features in weed and wheat segmentation. MSFPN optimizes the flow of multi-scale features through an advanced pyramid structure, ensuring effective information propagation and feature alignment. DyHead enhances the detection head by dynamically attending to contextual information, enabling better localization and classification of small or occluded objects.
6.3. Significance for Agricultural Applications
The proposed PMDNet demonstrates that integrating advanced deep learning modules can significantly improve the precision of weed detection in agricultural settings. Accurate detection of weeds in wheat fields is critical for ensuring optimal crop yield and minimizing herbicide usage. Beyond the quantitative improvements achieved in controlled experiments, PMDNet was also successfully applied to real-world wheat field video detection tasks, showcasing its capability to perform with both high precision and real-time efficiency. This practical validation underscores PMDNet’s robustness and suitability for deployment in dynamic, real-world scenarios. By improving detection accuracy and robustness and maintaining real-time processing, PMDNet provides a viable solution for precision agriculture applications. These advancements could lead to cost savings, reduced environmental impact, and higher efficiency in weed management practices, making it a valuable tool in sustainable farming.
6.4. Limitations and Future Work
While PMDNet achieves superior accuracy, several challenges remain that require further exploration. The model’s computational complexity, resulting from the integration of PKINet, MSFPN, and DyHead, limits its deployment on resource-constrained devices such as those commonly used in agricultural settings. To address this, future research will focus on developing lightweight versions of these components. For instance, pruning techniques, quantization, and knowledge distillation will be explored to balance performance and efficiency. This could enable real-time deployment of PMDNet on low-power edge devices, making it more practical for field applications.
Additionally, the dataset used in this study includes eight categories of wheat field weeds, which, while sufficient for the current scope, do not capture the full diversity of weed species encountered in real-world agricultural environments. Future work will prioritize expanding the dataset to include a broader spectrum of weed species, particularly those common in other climatic zones and agricultural systems.
The geographical and biological limitations of the dataset also warrant attention. The dataset was collected exclusively in a single region, which restricts the model’s generalizability to other agricultural conditions involving varying soil types, climates, and weed-crop interactions. In addition to geographic expansion, future efforts will focus on creating synthetic datasets or leveraging domain adaptation techniques to simulate diverse environmental conditions. This will reduce the dependence on extensive data collection and improve the model’s adaptability to new regions.
Furthermore, weeds exhibit distinct growth cycles, with visual appearances varying significantly across stages such as seedling, vegetative, and flowering. These intra-species variations can lead to misclassifications or missed detections by PMDNet, especially in later growth stages where weeds may closely resemble wheat. To mitigate this, future research will focus on creating stage-specific datasets and training the model using temporal feature extraction techniques to account for these growth-related changes. Techniques such as temporal attention mechanisms or recurrent neural network-based modules will be explored to enhance robustness to temporal variations in weed appearances.
Another limitation lies in PMDNet’s performance with particularly small or slender weed targets, such as Agropyron cristatum, which results in lower recall rates. To address this, targeted optimizations will be explored, including augmenting the training dataset with synthetic samples of such targets and employing advanced detection strategies, such as anchor-free methods or multi-scale feature refinement, to improve the detection of small and slender weeds.
Finally, future work will investigate alternative training strategies to improve generalization capabilities. Semi-supervised learning approaches, which leverage unlabeled data, and domain adaptation techniques, which address domain shifts between training and deployment environments, will be prioritized. These strategies, combined with efforts to optimize PMDNet for low-power devices and enhance its adaptability to diverse agricultural conditions, will ensure that PMDNet evolves into a more robust and practical tool for precision agriculture.
7. Conclusions
In conclusion, this study developed a wheat field weed detection dataset consisting of 5967 images, representing eight categories of weeds with balanced distribution across classes, supported by data augmentation techniques. Experimental results confirm that this dataset enables effective model training and serves as a reliable benchmark for evaluating object detection models.
Building upon this dataset, we proposed PMDNet, an improved object detection model specifically optimized for wheat field weed detection tasks. PMDNet integrates several architectural innovations: the backbone network was replaced with PKINet, the self-designed MSFPN was employed for multi-scale feature fusion, and DyHead was utilized as the detection head. These modifications significantly enhanced detection performance. Compared to the baseline YOLOv8n model, PMDNet achieved a 2.2% improvement in mAP@0.5, increasing from 83.6% to 85.8%, and a 5.9% improvement in mAP@0.50:0.95, increasing from 65.7% to 69.6%. Moreover, PMDNet demonstrated superior performance over classical detection models such as Faster-RCNN and RetinaNet, achieving a 16.7% higher mAP@0.50:0.95 compared to Faster-RCNN (52.9%) and a 13.1% improvement over RetinaNet (56.5%).
The training process analysis further demonstrated PMDNet’s stability and robustness. Throughout 300 epochs, PMDNet consistently outperformed YOLOv8n in precision, recall, and mAP metrics during training. Visualization of prediction results highlighted PMDNet’s advantage in handling challenging scenarios, such as complex backgrounds and small targets, where it consistently performed better than YOLOv8n.
Extensive ablation studies were conducted to validate the effectiveness of each proposed module, confirming their individual contributions to the model’s overall performance. In real-world video detection tests, PMDNet achieved an FPS of 87.7, meeting real-time monitoring requirements while maintaining high detection precision. These experiments demonstrated the model’s practical applicability, effectively detecting weeds in challenging environments such as dense vegetation and complex lighting conditions.
This study not only highlights the potential of PMDNet for addressing complex agricultural detection tasks but also provides a foundation for future research in precision agriculture. The advancements demonstrated in this work pave the way for more sustainable and effective weed management solutions, fostering the integration of AI technologies into modern farming practices.