PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments

Ding, Qiming; Cao, Shuaishan; Yu, Changchang; Cai, Bingbing; Yuan, Yechao; Li, He

doi:10.3390/agriculture16111152

Open AccessArticle

PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments

by

Qiming Ding

,

Shuaishan Cao

,

Changchang Yu

^*

,

Bingbing Cai

,

Yechao Yuan

and

He Li

College of Mechanical and Electrical Engineering, Henan Agricultural University, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(11), 1152; https://doi.org/10.3390/agriculture16111152 (registering DOI)

Submission received: 22 April 2026 / Revised: 20 May 2026 / Accepted: 21 May 2026 / Published: 24 May 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Precise detection of maize root–stem junction is crucial for hole fertilization in maize cultivation. However, maize root–stem junction detection under field conditions is severely affected by soil clods, crop residues, and weeds, and is further complicated by variations in plant morphology, the small scale of targets, and their sparse spatial distribution. To address these issues, an improved model named PGi-YOLO is proposed in this study, based on YOLOv11n-OBB. A P2 high-resolution detection layer is introduced to improve multi-scale feature representation and enhance small-target localization. The C2PSA-iRMB module replaces the original attention module by integrating an inverted residual mobile block (iRMB) mechanism, thereby strengthening global contextual information fusion while preserving its lightweight design. In addition, the Group Shuffle Convolution (GSConv) module is adopted to replace part of the standard convolution operations, reducing computational redundancy and improving inference efficiency. Experimental results show that PGi-YOLO achieves a precision of 92.0%, a recall of 93.4%, and an mAP@0.5 of 96.9%, with parameters of 2.61 M, a model size of 6.0 MB and an inference time of 5.1 ms. Overall, PGi-YOLO achieves a favorable balance between accuracy and efficiency, demonstrating strong robustness for maize root–stem junction detection in complex field environments and providing reliable support for precision agriculture applications.

Keywords:

maize root–stem junction; PGi-YOLO; object detection; oriented bounding box; hole fertilization

1. Introduction

As one of the most important grain crops worldwide, maize plays a critical strategic role in guaranteeing national food security, driving the high-quality development of the livestock industry, and enhancing agricultural productivity and efficiency [1,2]. Optimized fertilization management is essential for ensuring maize yield and sustainable cultivation. Hole fertilization is a precision fertilization technique in which fertilizers are applied in discrete holes below and adjacent to the maize root zone according to nutrient requirements [3,4]. Previous studies have demonstrated that this method offers multiple advantages, including promoting root development, improving fertilizer use efficiency, and reducing environmental pollution [5,6,7]. Accordingly, the development of maize hole fertilization technologies and related equipment has attracted increasing research attention.

In maize hole fertilization operations, accurate localization of maize plants is a key prerequisite for precise fertilizer application. At present, plant localization in hole fertilization primarily relies on mechanical contact-based methods, which not only tend to cause mechanical damage to seedlings but are also susceptible to factors such as plant spacing, operational speed, and field surface conditions, resulting in limited localization accuracy and difficulty in meeting the requirements of large-scale and precision maize cultivation [8,9,10]. With the advancement of agricultural intelligence, non-contact maize plant localization methods have gradually been introduced, among which machine vision-based approaches are the most widely adopted.

In vision-based maize plant localization, two primary strategies are commonly adopted: plant-center-based localization and root–stem junction (RSJ)-based localization [11]. Plant-center-based localization is primarily applied at the seedling stage, where top-view images of maize plants are acquired and plant positions are determined based on center-related features. In contrast, RSJ-based localization is typically applied during the jointing stage, where root–stem junction features are extracted from images acquired beneath the maize canopy. In this study, maize hole fertilization is primarily performed during the jointing stage; therefore, RSJ-based localization is adopted for plant position detection.

To achieve object detection in complex field environments, two main approaches are commonly adopted: traditional image processing methods and deep learning-based object detection methods. Traditional image processing methods typically perform maize detection by analyzing image features such as plant morphology, color, and size, or by integrating classical machine learning techniques [12,13]. Montalvo et al. [14] proposed a maize row detection method for fields with high weed pressure by combining image segmentation, dual-threshold analysis, and least-squares linear regression, enabling effective separation of maize and weeds. Ma et al. [15] developed a vision-based maize seedling recognition and localization method integrating HSV color space processing, background removal, and contour skeleton extraction, achieving a detection accuracy of 98.3%. Wei et al. [16] proposed a real-time localization method for individual maize plants based on excess green index enhancement, Otsu segmentation, and level-set methods, with a detection accuracy of 96%. Zan et al. [17] presented an autonomous maize tassel detection algorithm combining Random Forest and VGG16, where candidate regions were first extracted via random forest segmentation and morphological operations, followed by false detection removal using VGG16, enabling accurate detection across different growth stages. Kumar et al. [18] addressed the high annotation cost and time-consuming labeling in CNN training by proposing an automatic maize tassel detection method based on adaptive-threshold K-means clustering, together with a semi-automatic annotation tool that significantly reduces labeling time. Ren et al. [19] proposed an early identification method for hybrid maize seed production fields based on multi-source time-series remote sensing data and machine learning, where training samples are automatically generated via historical sample clustering to meet near-real-time monitoring requirements.

Traditional image processing methods only effective in relatively simple scenarios. In complex field environments, their accuracy and robustness are easily affected by factors such as illumination variation, shadows, and background interference. In contrast, Deep learning-based object detection approaches utilize multi-layer neural networks to automatically extract feature representations from large-scale datasets, enabling effective adaptation to diverse environmental variations and establishing them as the mainstream approach for maize detection in complex agricultural scenarios [20,21]. Yang et al. [22] used a Difference of Gaussian (DoG) pyramid algorithm to extract maize root–stem junction targets and construct a training dataset, and trained a Faster R-CNN network for single and multiple maize recognition and localization, reaching a mean precision of 62%. Meng et al. [23] proposed an improved SSD model that integrates depthwise separable convolution, SENet, and feature-layer fusion to enhance the accuracy, speed, and robustness of maize seedling and weed detection, achieving a maize seedling recognition accuracy of 90.58%. Li [24] improved the YOLOv5 model by adopting MobileNetV2 as the backbone network and introducing the ECANet attention mechanism, achieving a maize root–stem junction detection accuracy of 96.12%. Diao et al. [25] put forward an enhanced YOLOv8s algorithm incorporating the ASPPF module for maize plant center detection and row-line fitting for navigation path extraction, achieving a mAP of 90.2% and a row-line fitting accuracy of 94.35%. Xu et al. [26] proposed a lightweight RS-LineNet model that extracts feature points from detection boxes for maize navigation line fitting, effectively addressing the issue that canopy feature points are easily affected by environmental interference, achieving a detection accuracy of 91.1%. Gao et al. [27] developed a lightweight YOLOv4-based model that incorporates GhostNet, an attention mechanism, depthwise separable convolution, and multi-scale feature fusion for maize seedling detection. The model achieves an accuracy of 96.25%; however, its size reaches 71.69 MB, which increases hardware deployment costs and restricts its practical applicability. Liu et al. [28] proposed the DMSF-YOLO model for maize tassel detection, which enhances feature extraction, multi-scale fusion, dynamic detection heads, and the loss function, effectively addressing challenges such as occlusion, illumination variation, and scale differences, achieving a detection accuracy of 96.3%.

In studies on maize plant localization using deep learning methods, researchers have found that traditional horizontal bounding box annotation often includes excessive background information, such as soil surfaces, weeds, and crop residues, which introduces significant interference during model training and leads to suboptimal performance in maize detection. To address this issue, rotated object annotation methods have been increasingly adopted. By introducing rotation angle prediction, rotated bounding boxes can better align with the actual orientation and geometric structure of targets, thereby improving detection performance [29]. Xu et al. [30] proposed a maize plant-center localization method based on rotated bounding boxes. By adopting a high-precision annotation strategy and constructing an LC-AFPN network, they developed the LCA-YOLOv7-OBB model. Combined with color filtering and image moment calculation, the method achieved a detection accuracy of 85.19%. Xia et al. [31] proposed the MSFE-YOLOv11-OBB model for wood defect detection, which improves localization accuracy and scale adaptability by enhancing the FPN, CSP-PTB, and LSRFAConv modules, achieving a mAP@50 of 76.2% on industrial datasets. Ye et al. [32] developed a pose estimation approach for citrus fruit stalks using YOLO-OBB, where depth information is used to obtain 3D point clouds, and the OBB algorithm determines the stem orientation and picking point, achieving an average harvesting success rate of 82%. Ilo et al. [33] combined YOLOv8 instance segmentation with oriented bounding box detection to enable real-time morphological analysis in rice processing, reaching a detection precision of 94% and a size measurement error of 0.10 mm. Overall, in scenarios involving complex backgrounds, densely distributed targets, and severe occlusion—particularly in small-object detection tasks—rotated object detection algorithms can more accurately localize targets and effectively reduce interference from redundant background information, thereby improving overall localization accuracy [34,35,36].

In summary, previous maize root–stem junction detection methods suffer from several critical limitations: horizontal bounding boxes cannot accurately fit the morphological characteristics of maize root–stem junction and are prone to introducing background interference, leading to false detections; moreover, existing models still exhibit certain limitations in terms of robustness, detection accuracy, and inference speed. To address these challenges, this study proposes a maize root–stem junction detection model, termed PGi-YOLO, based on the YOLOv11n-OBB. The main contributions of the proposed method compared with existing models are summarized as follows:

(1): Oriented bounding box annotation is adopted to tightly fit the elongated and inclined morphology of maize root–stem junction targets, effectively reducing redundant background interference and improving the precision of feature learning.
(2): A P2 high-resolution detection layer is introduced into the neck network to enhance feature extraction and improve the precise localization of small targets.
(3): A lightweight Group Shuffle Convolution (GSConv) module is employed to replace a portion of standard convolution operations, thereby improving computational efficiency and reducing inference latency while preserving strong feature representation capability.
(4): An inverted residual mobile block (iRMB) is integrated into the C2PSA module at the end of the backbone network to enhance global contextual information fusion and representation while preserving the model’s lightweight nature.

The proposed model realizes targeted integration and adaptive optimization of the above modules, constructs a dedicated oriented bounding box annotation suitable for maize root–stem junction detection, achieves precise localization of maize root–stem junctions, and provides a feasible technical reference for plant positioning in hole fertilization operations under complex field conditions.

2. Materials and Methods

2.1. Image Acquisition

The image data targeting maize root–stem junction used in this study were collected from experimental farmlands located in Yuzhou City (34.162° N, 113.463° E) and Xinzheng City (34.396° N, 113.739° E), Henan Province, China. These areas are characterized by a warm temperate continental monsoon climate and are predominantly composed of cinnamon soil. Data acquisition was conducted during July 2024 and July 2025, corresponding to the jointing stage of summer maize growth. Image acquisition was performed using a Xiaomi 13 smartphone (Xiaomi Corporation, Beijing, China), with the shooting height of approximately 30 cm and downward angle of 45° set according to the actual installation position and operating field of view of the industrial camera on the pre-designed maize hole fertilization equipment. The original image had a resolution of 4097 × 3072 pixels, and all data were saved in JPG file format. To ensure dataset diversity and improve model generalization, images were collected under varying field conditions, including leaf occlusion, weed occlusion, low-light conditions, high-illumination conditions, and both simple and complex background environments, as illustrated in Figure 1. These conditions collectively represent typical complex field scenarios encountered in maize cultivation. The initially collected images were screened to remove blurred samples and invalid images without target objects, resulting in a final dataset of 2400 valid images. The dataset was subsequently divided into training, validation, and test sets at a ratio of 8:1:1 to support standardized model training and performance evaluation. All images were adjusted to 640 × 640 pixels before model training to guarantee unified input dimensions.

2.2. Data Processing

Annotation accuracy plays a critical role in training efficiency and detection performance. Due to the elongated shape and frequent inclined and overlapping distribution of maize root–stem junction, conventional horizontal bounding box annotation method tend to include irrelevant background regions, such as soil, weeds, and crop residues. This introduces substantial noise during feature learning and adversely affects detection accuracy. In contrast, the rotated bounding box annotation method incorporates an orientation angle into the horizontal bounding box, enabling a tighter fit to the natural morphology of maize root–stem junction. The rotated bounding box is parameterized as (x, y, w, h, θ), where (x, y) denotes the center coordinates of the target, w and h represent the width and height of the box, respectively, and θ is the rotation angle. The angle is defined using the long-side representation, with the horizontal rightward direction set as 0°, counterclockwise rotation as positive, and the angle range constrained to [−90°,90°). To alleviate loss oscillations caused by angular periodicity, periodic constraints and boundary smoothing are applied during training. When the predicted angle exceeds the defined range, width–height swapping and angle correction are performed to ensure stable regression and maintain angular continuity. The angle is predicted using a direct regression strategy, where the network output layer regresses θ as a continuous value, jointly optimized with the losses for position and scale. This enables the model to accurately capture the elongated and inclined structural characteristics of maize root–stem junctions. This approach effectively reduces background interference and provides more precise localization information for model training, as illustrated in Figure 2. In this study, the maize root–stem junction was manually annotated using X-AnyLabeling-GPU (v2.3.6) software and exported in JSON format for subsequent model training. All labeling work was independently finished by one professional researcher to unify annotation criteria and reduce subjective differences. A strict verification process was implemented to guarantee label accuracy and reliability. After initial labeling, comprehensive re-inspection, bounding boundary correction, and removal of mislabeled or missing samples were conducted to ensure high-quality and consistent annotations. The final dataset includes 2400 field images, with 6–8 maize root–stem junctions targets per image, and a total of 20,383 annotated instances.

To improve model generalization, mitigate overfitting, and enhance the detection accuracy of maize root–stem junction under diverse imaging conditions, a series of data augmentation strategies was applied to the training dataset, including random rotation, horizontal and vertical flipping, random cropping, distortion, and brightness adjustment, as illustrated in Figure 3. These augmentations simulate variations in natural growth orientation, illumination conditions, soil occlusion, and imaging perspectives encountered in field environments.

2.3. Improved YOLOv11-Based Model for Maize Root–Stem Junction Detection

2.3.1. YOLOv11 Network Architecture

YOLOv11, launched in 2024, belongs to the one-shot detection framework of the YOLO family. It can complete target positioning and category identification in one inference process, thus realizing high-efficiency real-time detection. This network has been extensively adopted in research scenarios including target recognition and tracking, pixel-level segmentation, as well as image categorization [37]. The YOLOv11 architecture consists of a backbone, a neck, and a detection head. By introducing more efficient feature extraction modules, optimizing multi-scale feature fusion mechanisms, and adopting a decoupled detection head with improved loss functions, YOLOv11 further enhances detection accuracy, inference speed, and robustness [38]. YOLOv11n-OBB is a lightweight oriented object detection variant of the YOLOv11 framework. It represents objects using rotated bounding boxes, parameterized by center coordinates, width, height, and rotation angle. A Rotated IoU-based positive-negative sample assignment strategy is introduced, in which the rotated IoU is used as the matching criterion between ground-truth boxes and anchor boxes of maize root–stem junctions, thereby improving the rationality of sample assignment. Meanwhile, rotated multi-scale training (Rotated MSS) is adopted during training, where the input image size is randomly selected from [512, 640, 736, 800]. This strategy improves the model’s generalization ability across different growth stages and scales of maize root–stem junctions, thereby achieving more accurate localization. This representation is particularly suitable for scenarios involving inclined, rotated, or densely distributed objects, as it effectively reduces background redundancy and improves detection accuracy. In real field environments, maize root–stem junctions typically exhibit obvious inclined and rotated characteristics. Therefore, YOLOv11n-OBB is selected as the baseline model for maize root–stem junction detection.

2.3.2. Improved PGi-YOLO Model

This study adopts YOLOv11n-OBB as the baseline model for maize root–stem junction detection. To meet the requirements of global feature extraction, small object detection, and lightweight deployment in maize root–stem junction detection, several improvements (P2 high-resolution detection layer, GSConv module and C2PSA-iRMB attention mechanism) were introduced. Based on these improvements, the architecture of the proposed PGi-YOLO network is illustrated in Figure 4.

P2 High-Resolution Detection Layer

To address the challenges of small target size, complex texture details, and strong interference from soil backgrounds in maize root–stem junction detection, the feature pyramid structure of YOLOv11n-OBB was improved. While retaining the original P3, P4, and P5 detection layers, a P2 high-resolution detection layer derived from the early stage of the backbone network was introduced, forming a four-level feature pyramid spanning P2–P5 [39]. The P2 high-resolution detection layer directly utilizes high-resolution feature maps from the shallow layers of the backbone, with a spatial resolution four times higher than that of the P3 layer. This enables more precise capture of fine-grained features, such as edge contours and surface textures of maize root–stem junctions, effectively compensating for the insufficient shallow feature representation in the original model. Meanwhile, retaining the P3–P5 layers preserves strong perception capability for medium and large scale targets, ensuring comprehensive coverage of targets with varying sizes while enhancing sensitivity to small objects.

In terms of feature fusion, the introduced P2 high-resolution detection layer is deeply integrated into the original bidirectional feature pyramid structure [40]. In the bottom-up pathway, high-resolution features from the P2 high-resolution detection layer are progressively propagated to the P3 and P4 layers through cross-layer connections, enriching local detail representation of intermediate features and improving the balance between semantic information and spatial precision. In the top-down pathway, high-level semantic features from the P3–P5 layers are upsampled and fused into the P2 high-resolution detection layer, enhancing the ability of shallow features to distinguish targets from backgrounds, particularly in complex scenarios with occlusion or background similarity. This multi-scale fusion strategy, which introduces shallow layers without removing higher-level layers, improves small-object recall while preserving global contextual awareness provided by high-level semantic features. As a result, the proposed model achieves more accurate and robust detection of maize root–stem junction with only a marginal increase in model parameters.

GSConv Module

To fulfill the needs of lightweight design and high efficiency for real-time operations on field mobile devices, GSConv was adopted to replace standard convolution layers [41]. The GSConv module employs a dual-branch parallel structure that combines depthwise separable convolution (DSC) and standard convolution (SC), thereby improving feature extraction efficiency while reducing both model parameters and computational complexity, as illustrated in Figure 5. Furthermore, a channel shuffle operation is introduced to enhance information exchange between the two branches, effectively mitigating the potential loss of feature fusion capability caused by depthwise separable convolution. This design significantly improves inference speed while maintaining strong feature representation capability for maize root–stem junctions. Consequently, a balance between computational efficiency and detection accuracy is achieved, facilitating stable deployment and real-time operation on resource-constrained field devices.

The computational complexity ratio r between GSConv relative to standard convolution is defined as Equation (1) [42]:

r = \frac{W H K_{1} K_{2} (D_{1} + 1) D_{2}}{2 W H K_{1} K_{2} D_{1} D_{2}}

(1)

where H and W denote the height and width of the feature map, respectively; K₂ and K₁ represent the height and width of the convolution kernel, respectively; and D₁ and D₂ denote the numbers of input and output channels, respectively.

As indicated by Equation (1), compared with the full-channel convolution in standard convolution layers, GSConv effectively reduces computational cost and model complexity. Therefore, in this study, GSConv is employed to replace part of the standard convolutions in the neck of the YOLOv11n-OBB model. This strategy enhances the extraction of fine-grained features of maize root–stem junction while reducing parameter size and computational overhead, striking an optimal trade-off between detection precision and running speed.

C2PSA-iRMB Attention Mechanism

To address the challenge of accurately detecting maize root–stem junction in complex field environments, where they are easily confused with soil, weeds, and crop residues, the C2PSA module was replaced with a C2PSA-iRMB module [43], as illustrated in Figure 6. The proposed module models long-range dependencies through the expanded-window multi-head self-attention (EW-MHSA), enhancing global contextual representation and enabling more effective discrimination between maize root–stem junction and background interference. Meanwhile, DWConv preserves strong local feature extraction capability, facilitating the capture of fine-grained details such as texture and edge information [44]. This collaborative design of global perception and local detail extraction improves feature representation capability of maize root–stem junctions [45]. Furthermore, owing to its lightweight architecture that combines window-based attention with depthwise convolution, the module improves detection performance while ensuring low computational cost. As a result, the model retains its lightweight characteristics and satisfies the requirements for real-time detection under field conditions.

Taking the input feature map X ∈ R^C^×H×W as an example, the channel expansion layer expands the number of channels to λ times the original. This operation can be expressed as Equation (2) [46]:

X_{e} = M L P_{e} (X) \in R^{λ C \times H \times W}

(2)

where λ is the expansion factor, and MLP_e denotes the channel expansion implemented by a fully connected layer. The expanded feature X_e is then processed by the EW-MHSA module using a window partitioning strategy. Specifically, the query (Q) and key (K) are derived from the original input features, while the expanded features are utilized as the value (V), The detailed computation is given as follows:

Q = K = L i n e a r (X) \in R^{C \times H \times W}

(3)

V = X_{e} \in R^{λ C \times H \times W}

(4)

The EW-MHSA module constructs the attention matrix (Attn) based on the original unexpanded input features, while dynamically weighting the expanded value representations within local windows, thereby facilitating global contextual modeling. The computation is formulated as Equation (5):

Attn (X e) = Soft \max (\frac{Q K^{T}}{\sqrt{d}}) \cdot V \in R^{λ C \times H \times W}

(5)

where d is the channel dimension of the query (Q) and key (K). The output of the EW-MHSA module is subsequently fed into a DWConv layer to further capture local spatial features. The operation is formulated as Equation (6):

X_{f} = DWConv (Attn (X_{e})) \in R^{λ C \times H \times W}

(6)

The features are subsequently restored to the original channel dimension via a channel compression layer (MLPs) and fused with the input features via a residual connection. The channel compression and residual operations are formulated as Equation (7):

Y = X + M L P_{S} (X_{f}) \in R^{C \times H \times W}

(7)

In this study, the PSABlock in the original C2PSA module of YOLOv11n was replaced with the iRMB module, resulting in a novel C2PSA-iRMB hybrid structure, as illustrated in Figure 6. The embedded iRMB enhances the feature representation capability of the backbone network through a self-attention mechanism. Meanwhile, the introduction of depthwise separable convolution enables more effective spatial feature extraction while improving computational efficiency, thereby reducing the model’s floating-point operations.

2.4. Experimental Environment and Evaluation Metrics

All experiments in this study were conducted on a ROG Strix G16 laptop (ASUSTek Computer Inc., Shanghai, China), which is powered by an Intel^® Core™ i9-14900HX (64-bit) processor with 16 GB of RAM and equipped with an NVIDIA GeForce RTX 4060 GPU (8 GB VRAM). The software was developed on the Windows 11 operating system, using Python 3.8.15 as the programming language, PyTorch 2.1.0 as the deep learning framework, and PyCharm 2020 as the development environment. To ensure reproducibility, a fixed random seed of 42 was used throughout the experiments. The stochastic gradient descent (SGD) optimizer was employed, configured with a momentum of 0.95 and a weight decay of 0.001. The initial learning rate was set to 0.01 and adjusted dynamically using a cosine annealing strategy. All models were initialized with weights pre-trained on the COCO dataset. The batch size was set to 4, a value determined by the 8 GB GPU memory capacity and chosen to match the scale of the 2400-image dataset for stable gradient estimation. Training was conducted for 150 epochs. With COCO pre-trained weight initialization and the cosine annealing learning rate scheduling strategy, subsequent experimental results verified that this number of epochs is sufficient to achieve full convergence of the model. Training convergence was monitored using both training loss and validation mean Average Precision (mAP), and the model with the best validation performance was selected. Neither early stopping nor cross-validation was used in this study. To guarantee a reliable and unbiased comparison, all models—including baselines, competitors, and ablation variants—were trained and evaluated under strictly controlled and identical conditions: the same number of epochs, batch size, augmentation pipeline, training strategies, hyperparameters, hardware and software platform, and dataset split. Sufficient hyperparameter tuning was performed for each model to eliminate unfair advantages. All experiments used a fixed single train/validation split and a single independent training run per configuration. Given the high computational cost associated with repeated multi-model training and the substantial manual effort required for re-collecting and re-annotating field agricultural data, repeated experiments, variance calculations, and confidence interval analyses were not conducted in this study.

The evaluation indicators employed in this work included Precision (P), Recall (R), F1-score, and mean Average Precision (mAP@0.5 and mAP@0.5–0.95). In the context of object detection, these metrics are derived from the Intersection over Union (IoU) values between the predicted bounding boxes and their corresponding ground-truth labels. A predicted box is considered a true positive (TP) if its IoU with a corresponding ground-truth box exceeds a predefined threshold; otherwise, it is counted as a false positive (FP). Ground-truth boxes not matched by any prediction are regarded as false negatives (FN), while true negatives (TN) correspond to correctly identified background regions, although TN is typically not emphasized due to the large number of background pixels. In addition, the number of model parameters, inference time and model size were used as secondary indicators to evaluate the lightweight characteristics and computational efficiency of the model. Precision, Recall, mAP, and F1-score are defined in Equations (8)–(12), respectively.

P = \frac{T P}{T P + F P} \times 100 %

(8)

R = \frac{T P}{T P + F N} \times 100 %

(9)

A P = \int_{0}^{1} P (R) d R

(10)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(11)

F_{1} = \frac{2 P R}{P + R} \times 100 %

(12)

where n is the total number of categories, i is the category index.

3. Results and Discussion

3.1. Training Convergence Analysis of YOLOv11n, YOLOv11n-OBB, and PGi-YOLO

To evaluate the convergence performance of PGi-YOLO, YOLOv11n, and YOLOv11n-OBB, the evolution of Precision, mAP@0.5, Recall, and mAP@0.5–0.95 over 150 training epochs was recorded, as illustrated in Figure 7. In terms of Precision (Figure 7a), PGi-YOLO demonstrates faster convergence than YOLOv11n-OBB and exhibits smaller fluctuations during training, reaching a stable value of approximately 92.0% after around 12 epochs. This improvement can be attributed to the introduction of the C2PSA-iRMB attention module, which enhances global feature aggregation and improves the discriminative capability for maize root–stem junction detection. For Recall, PGi-YOLO outperforms YOLOv11n-OBB, achieving a stable value of approximately 93.4%. This improvement can be attributed to the introduction of the P2 high-resolution detection layer, which enhances the detection of small-scale and obliquely oriented targets, thereby reducing missed detections. Regarding mAP@0.5, PGi-YOLO consistently outperforms YOLOv11n-OBB and converges smoothly to 96.9%, demonstrating superior overall detection accuracy. Similarly, for mAP@0.5–0.95, PGi-YOLO maintains a clear advantage, indicating improved localization accuracy and robustness under more stringent IoU thresholds. Compared with YOLOv11n, both PGi-YOLO and YOLOv11n-OBB achieve higher Precision and Recall throughout the training process, with faster convergence and reduced oscillations. The mAP@0.5 and mAP@0.5–0.95 curves also consistently outperform those of YOLOv11n, further indicating more stable convergence behavior. These results suggest that the adoption of oriented object detection contributes to improved convergence performance. This indicates that, compared with conventional horizontal bounding boxes, the proposed rotated bounding box annotation better conforms to the true geometric structure of maize root–stem junctions under conditions characterized by varying orientations, elongated shapes, and frequent occlusion by weeds and crop residues, thereby effectively reducing background interference and improving detection performance.

3.2. Ablation Study of Network Modules

To quantify the individual contributions of each modification introduced in the PGi-YOLO model, including the P2, GSConv, and C2PSA-iRMB modules, ablation studies were performed on the test set using YOLOv11n-OBB as the baseline model. Eight ablation groups were constructed by progressively integrating the three modules into the baseline model, including configurations with one, two, and all three modules. These groups were designed to comparatively assess the individual effects of each module and their combined impact on maize root–stem junction detection. The experimental results are presented in Table 1.

As shown in Table 1, introducing the P2 high-resolution detection layer improves Recall, Precision, mAP@0.5, and mAP@0.5–0.95 by 1.1%, 0.9%, 1.5%, and 2%, respectively, compared with the YOLOv11n-OBB baseline. The number of parameters and inference time increases by 0.04 M and 1.5 ms, respectively, indicating that the introduction of the P2 high-resolution detection layer increases in both model parameters and computational cost, while improving the detection performance. When the C2PSA-iRMB attention mechanism is applied independently, recall, precision, mAP@0.5, and mAP@0.5–0.95 increase by 0.3%, 1%, 1.5%, and 1.2%, respectively. The number of parameters and inference time increase by 0.02M and 0.8 ms, respectively, indicating that the introduction of the C2PSA-iRMB attention mechanism also leads to an increase in both model parameters and computational cost. These results indicate that the P2 high-resolution detection layer preserves more low-level high-resolution features, enabling precise capture of fine-grained stem base contours. Meanwhile, the C2PSA-iRMB module, by integrating depthwise separable convolution and self-attention, enhances the model’s ability to distinguish stem features from complex background interference, such as soil and crop residues. When the GSConv module is introduced independently, recall, mAP@0.5, and mAP@0.5–0.95 increase by 1.7%, 1.4%, and 1.1%, respectively. Meanwhile, the number of parameters, model size, and inference time are reduced by 0.09M, 0.2 MB, and 11 ms, respectively. Benefiting from the efficient lightweight convolution design, the GSConv module drastically reduces computational redundancy and achieves the optimal inference speed (3.0 ms) among all single-module ablation groups. Overall, the model maintains comparable performance in precision, recall, mAP@0.5, and mAP@0.5–0.95 while reducing both the number of parameters and the model size. These results indicate that the GSConv module effectively preserves the extraction of key root–stem texture and morphological features while reducing computational redundancy, thereby providing a lightweight solution for the field deployment of maize root–stem junction detection. When the P2 high-resolution detection layer and C2PSA-iRMB attention mechanism are combined, precision, recall, mAP@0.5, and mAP@0.5–0.95 increase by 1.9%, 1.2%, 1.9% and 2.4%, respectively, compared with the baseline model. However, the number of parameters, model size, and inference time increase by 0.06M, 0.4 MB, and 2.2 ms, respectively. This combination effectively reduces missed detections of small stem bases through the introduction of the P2 high-resolution detection layer, while the C2PSA-iRMB mechanism suppresses background interference, thereby enhancing detection robustness in complex field environments. Furthermore, integrating all three modules (P2 + GSConv + C2PSA-iRMB) achieves the optimal balance between performance and efficiency, yielding the best overall results, with a precision of 92.0%, a recall of 93.4%, an mAP@0.5 of 96.9%, and an mAP@0.5–0.95 of 65.6%. And the inference time was 5.1 ms. In this configuration, the P2 high-resolution detection layer and C2PSA-iRMB module jointly enhance fine-grained feature representation, while GSConv ensures high inference efficiency, fully meeting the requirements of high accuracy, lightweight design, and strong robustness for maize root–stem junction detection.

Overall, the proposed method increases the model size by only 0.3 MB, while improving recall, precision, mAP@0.5, and mAP@0.5–0.95 by2.9%, 1.5%, 2.2%, and 2.5%, respectively, compared with the baseline model. Meanwhile, the number of parameters and inference time decrease by 0.04M and 8.9 ms, respectively. Although the mAP gain is moderate, the model maintains high detection accuracy while reducing parameter scale and inference latency, achieving an effective trade-off between detection performance and lightweight deployment. It is particularly suitable for maize root–stem junction detection tasks requiring high real-time performance and accurate small-object recognition.

3.3. Comparative Experiments of Different Models

To verify the detection capability of the refined model for maize root–stem junctions, comparative experiments were conducted under identical conditions using Faster R-CNN, RT-DETR, YOLOv8n-OBB, YOLOv9c-OBB, YOLOv10n-OBB, YOLOv11n-OBB, YOLOv11n, and PGi-YOLO. The results are presented in Table 2.

As shown in Table 2, PGi-YOLO shows better performance than Faster R-CNN and RT-DETR, achieving higher Precision (92.0%) and Recall (93.4%), with mAP@0.5–0.95 and mAP@0.5 reaching 65.6% and 96.9%, respectively. These results indicate stronger robustness under stricter IoU thresholds. Conventional two-stage and transformer-based detectors including Faster R-CNN and RT-DETR suffer from inherent structural redundancy. Faster R-CNN adopts a two-stage detection paradigm with an additional region proposal network (RPN), while RT-DETR introduces massive transformer encoding and query embedding modules. These architectural characteristics inevitably lead to redundant parameters and large model sizes (Faster R-CNN: 169.0 MB; RT-DETR: 66.2 MB), which are far greater than that of the proposed model (6.0 MB). Meanwhile, PGi-YOLO reduces model complexity, with only 2.61 M parameters and a model size of 6.0 MB, which are substantially lower than those of Faster R-CNN and RT-DETR. Notably, it maintains a low inference time of only 5.1 ms per image, outperforming two-stage detectors such as Faster R-CNN (36.1 ms) and achieving comparable speed to lightweight YOLO-series models. Overall, these results highlight its effectiveness in balancing detection accuracy, computational efficiency, and real-time performance for deployment in resource-constrained environments. For the YOLO series, existing oriented object detection models (YOLOv8n-OBB, YOLOv9c-OBB, YOLOv10n-OBB, and YOLOv11n-OBB) generally face challenges in achieving a balance among robustness, accuracy, and efficiency in maize root–stem junction detection tasks. Although YOLOv8n-OBB demonstrates high efficiency with an inference time of 13.9 ms, its backbone lacks effective global modeling capability, resulting in inadequate discrimination of critical features such as root–stem contours and branching nodes. Consequently, its generalization ability under complex background conditions is limited, with an mAP@0.5–0.95 of only 63.7%. YOLOv9c-OBB, benefiting from the C3k2 module and a deeper network architecture, improves the recall to 92.6%, mAP@0.5 to 96.4%, and mAP@0.5–0.95 to 65.8%. However, this comes at the expense of increased model complexity, with parameters rising to 21.99 M and model size reaching 45 MB, and inference time extending to 19.7 ms, thereby compromising its lightweight and real-time advantages. YOLOv11n-OBB introduces the C2PSA attention module, but it does not fully address the limitation in long-range dependency modeling, resulting in a relatively low recall of 90.5% and an inference time of 14 ms. The horizontal object detection model YOLOv11n suffers from an inherent limitation in that its bounding boxes cannot accurately fit the contours of root–stem junctions. This leads to the inclusion of significant background noise during feature learning, causing a notable performance gap compared with all rotated object detection models, with a precision of 87.8%, recall of 87.3%, mAP@0.5 of 92.2%, and mAP@0.5–0.95 of 48.1%, and an inference time of 8.3 ms. The proposed PGi-YOLO model improves mAP@0.5 by 1.2%, 0.5%, 0.8%, 2.2%, and 4.7% compared with YOLOv8n-OBB, YOLOv9c-OBB, YOLOv10n-OBB, YOLOv11n-OBB, and YOLOv11n, respectively. This may be attributed to the fact that tiny and occluded maize root–stem junctions are highly sensitive to detection deviations. Even minor improvements in mAP can reduce cumulative errors, thereby preventing fertilization point offsets and missed applications while enhancing the stability of the field operation system. PGi-YOLO achieves a 4.7% absolute gain in mAP@0.5 over the horizontal YOLOv11n, further confirming that oriented bounding boxes are more suitable than horizontal detection for slender root–stem junction targets. Notably, the model maintains a low parameter count of only 2.61 M and a compact size of 6.0 MB, and a competitive inference time of 2.8 ms. This performance gain can be attributed to the synergistic integration of the modules within PGi-YOLO, which integrates lightweight attention enhancement, multi-scale high-resolution feature extraction, and efficient convolutional computation. This design effectively addresses key challenges in existing methods, including the compromise between model simplification and detection precision, limited small-object detection capability, insufficient generalization under complex backgrounds, and poor adaptability to real-time field applications. Overall, while preserving its lightweight and low-latency characteristics, PGi-YOLO achieves superior detection performance, making it well suited for practical maize root–stem junction detection tasks.

3.4. Visualization Comparison of Model Detection Effects

To enable a more intuitive comparison of detection performance, four representative field scenarios, including dense plant distribution, weed occlusion, and straw coverage, were selected for qualitative analysis in this study. The ground-truth annotations, together with the detection results of YOLOv11n, YOLOv11n-OBB, and the proposed PGi-YOLO model, were visualized and comparatively analyzed, as shown in Figure 8. Specifically, blue boxes correspond to correctly detected targets, green boxes denote the ground-truth labels, and red boxes indicate missed detections or false predictions.

In the simple background scenario (Figure 8a), a total of eight targets are present. The PGi-YOLO model successfully detects all targets with high accuracy, achieving a detection success rate of 100%. In contrast, YOLOv11n exhibits one missed detections, while YOLOv11n-OBB produces three missed detections and one false detection, with detection success rates of 87.5% and 62.5%, respectively. In the weed occlusion scenario (Figure 8b), seven targets are present. The PGi-YOLO model detects all root–stem junctions and additionally identifies one unlabeled target, which can be attributed to annotation uncertainty. YOLOv11n detects all targets but with relatively low confidence scores, whereas YOLOv11n-OBB misses one occluded target, achieving a detection success rate of 85.7%. In the straw occlusion scenario (Figure 8c), five targets are present. YOLOv11n misses two targets, achieving a detection success rate of 60%, while YOLOv11n-OBB misses three targets, achieving a detection success rate of 40%. Moreover, one large ground-truth bounding box is incorrectly split into two smaller boxes by YOLOv11n-OBB. In contrast, PGi-YOLO successfully detects all five annotated targets, demonstrating a clear advantage over the other models.

To intuitively evaluate the perception capability of the improved model for maize root–stem junction detection, the Grad-CAM visualization method was employed to compare the attention regions and distribution patterns of the models before and after improvement. The visualization results are presented in Figure 9. In the heatmaps, brighter regions indicate higher probabilities that the model assigns to target-relevant areas. Additionally, Red bounding boxes indicate false detections, whereas yellow bounding boxes correspond to missed detections.

As shown in Figure 9, the YOLOv11n model exhibits relatively weak overall attention intensity and tends to focus on regions with simple background structures and prominent target features, while showing limited attention to weed-occluded areas and distant small targets (yellow boxes). In addition, a non-target region is assigned a strong response (red box), indicating potential false detections and insufficient ability to suppress complex background interference. Compared with YOLOv11n, the YOLOv11n-OBB model alleviates missed detections in complex scenarios and exhibits a more concentrated attention distribution over target regions. In contrast, the improved PGi-YOLO model demonstrates stronger attention activation and more precise feature aggregation toward maize root–stem junctions, and achieves more accurate localization, effectively identifying targets partially occluded by weeds as well as small distant objects under complicated field backgrounds. The Grad-CAM visualization further verifies that the introduced multi-scale enhancement and attention design enable PGi-YOLO to suppress irrelevant background noise, accurately lock effective target contour features, and reduce the risk of missed detection and false detection. Overall, the improved PGi-YOLO model shows more favorable empirical performance than the baseline models in feature representation capability and detection robustness for maize root–stem junctions under complex field conditions.

From the above visual comparison results, it can be observed that target morphology, field background complexity, and model architecture jointly determine the performance of maize root–stem junction detection. Under complex field conditions, maize root–stem junction targets are typically small in size, irregular in shape, and frequently occluded by weeds and crop residues, posing significant challenges for accurate detection. Consequently, general-purpose baseline models (e.g., YOLOv11n and YOLOv11n-OBB) struggle to simultaneously achieve high detection accuracy and stable performance. In recent years, most lightweight agricultural object detection models have adopted a general-purpose design philosophy, prioritizing cross-dataset and cross-task generalization. This trend has been clearly demonstrated in several representative studies. For example, Shen et al. [47] proposed a lightweight unified detection model, Light-Y, built on YOLOv5s with a general-purpose lightweight backbone and dynamic detection head, without task-specific customization. Its primary objective is to achieve stable cross-scenario detection of multiple crop organs such as rice and wheat panicles. Lu et al. [48] developed a lightweight cross-dataset detection model, MAR-YOLOv9, which employs a lightweight feature fusion architecture without designing task-specific branches, focusing mainly on improving generalization across four different datasets. However, this generality-oriented design introduces an inherent trade-off: the stronger the model’s universality, the weaker its adaptability to task-specific challenges in agricultural scenarios, such as the small scale, strong background interference, and arbitrary orientations of maize root–stem junctions. In this study, targeting the requirements of maize precision hole fertilization, a task-specific detection optimization strategy is adopted for maize root–stem junction detection. The oriented bounding box enables tighter geometric alignment with slender and irregular root–stem structures. The P2 high-resolution detection layer enhances fine-grained feature perception of small targets. Meanwhile, the combination of the C2PSA-iRMB module and GSConv achieves efficient feature discrimination while maintaining model lightweightness. This integrated design effectively overcomes the limitations of general-purpose detection frameworks, such as insufficient adaptability to specific field targets, achieving a better trade-off among accuracy, robustness, and efficiency. Since the baseline performance of YOLOv11n-OBB is already relatively strong, the improvement in mAP@0.5 was limited, although it still increased by 2.2% compared with the baseline model. However, the proposed model achieved reductions in both inference time and parameter size, thereby attaining a good balance among detection accuracy, lightweight architecture, and inference efficiency. This balanced performance is of practical significance for precision hill fertilization in maize production. It helps reduce positioning deviation, missed detections, and potential mechanical damage to plants in complex field environments. Overall, the proposed PGi-YOLO achieves a better balance among detection accuracy, lightweight design, and inference speed under complex field conditions, making it more suitable for maize plant localization.

4. Conclusions

To address the challenges in maize root–stem junction detection, such as small target size, complex field background interference, and the difficulty of lightweight models in balancing detection accuracy and computational efficiency, this study proposes an improved detection model, PGi-YOLO, based on YOLOv11n-OBB. Specifically, a P2 high-resolution detection layer is introduced to enhance the perception and localization accuracy of small-scale targets. The C2PSA-iRMB module is designed by integrating an inverted residual mobile block into the original C2PSA structure, thereby strengthening global contextual information fusion while preserving the lightweight characteristics of the model. In addition, part of the standard convolution operations is replaced with the lightweight GSConv module to reduce computational redundancy and improve inference efficiency. Through comprehensive experimental analysis, compared with the baseline YOLOv11n-OBB, the proposed PGi-YOLO achieves improvements of 1.5% in precision, 2.9% in recall, 2.2% in mAP@0.5, and 2.5% in mAP@0.5–0.95, while reducing inference time by 8.9 ms without increasing model complexity, effectively balancing accuracy and efficiency for maize root–stem junction detection. These improvements reduce missed and false detections under occlusion, enhance robustness under complex field conditions, and enable real-time deployment on low-cost devices and agricultural machines. Meanwhile, accurate and stable maize root–stem junction detection avoids fertilization offset and missed fertilization caused by detection errors. It effectively improves the positioning accuracy and operational stability of fertilization equipment in complex farmland scenarios, thereby providing reliable technical support for precise plant localization in maize hole fertilization and related precision agriculture applications.

Limited by experimental conditions, the dataset was collected from only two planting sites in Henan Province during a single growing season, using fixed camera equipment and shooting heights, resulting in limited geographic, seasonal, and hardware diversity. The model was trained and tested solely on in-distribution data and has not been validated in unseen field environments or new geographic regions, which may lead to potential domain shift in practical applications. Meanwhile, manual OBB annotation is labor-intensive and time-consuming, and the model may also face generalization challenges when applied to new regions and crop species. To address these limitations, future work will expand the dataset with cross-regional, multi-season, and multi-variety samples, incorporate diverse imaging devices and variable shooting heights, and extend single-maize detection to multi-maize root–stem junction identification to improve generalization and mitigate domain shift. In addition, the lightweight PGi-YOLO will be deployed in maize precise hole fertilization systems for field validation. Furthermore, multi-sensor fusion experiments and semi-supervised learning strategies will be explored to reduce annotation cost and enhance robustness in complex farmland environments.

Author Contributions

Conceptualization, C.Y., and Q.D.; methodology, C.Y., and Q.D.; formal analysis, Q.D. and S.C.; data curation, Q.D., S.C. and B.C.; writing—original draft preparation, C.Y. and Q.D.; writing—review and editing, Y.Y. and H.L.; Funding Acquisition, C.Y.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 32501802) and the Postdoctoral Research Funding Program of Henan Province (Grant No. 24XM0495).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author at yuchang@henau.edu.cn.

Acknowledgments

The authors would like to thank their schools and colleagues as well as those who funded the project. All support and assistance are sincerely appreciated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Erenstein, O.; Jaleta, M.; Sonder, K.; Mottaleb, K.; Prasanna, B.M. Global maize production, consumption and trade: Trends and R&D implications. Food Secur. 2022, 14, 1295–1319. [Google Scholar] [CrossRef]
Sime, M.; Ballo, S.; Abro, Z.; Gugissa, D.A.; Mendesil, E.; Tefera, T. Farmers’ Perceptions of Maize Production Constraints and the Effects of Push–Pull Technology on Soil Fertility, Pest Infestation, and Maize Yield in Southwest Ethiopia. Agriculture 2024, 14, 381. [Google Scholar] [CrossRef]
Li, K.; Yuan, W.; Zhang, W.; Zhang, L.; Qi, B. Research status and development trend of corn fertilizing technology and fertilizing machine. J. Agric. Mech. Res. 2017, 39, 264–268. [Google Scholar] [CrossRef]
Liu, Z.; He, J.; Wang, Q.; Zheng, K. Research status and prospects of China’s maize hole-fertilizing device. Jiangsu Agric. Sci. 2019, 47, 5–8. [Google Scholar] [CrossRef]
Jiang, C.; Wang, H.; Lu, D.; Zhou, J.; Wang, S.; Zu, C. Single fertilization of urea in root zone improving crop yield. Trans. Chin. Soc. Agric. Eng. 2018, 34, 146–153. [Google Scholar] [CrossRef]
Li, G.; Su, X.; Zhou, X.; Zhang, X.; Wang, Q.; Wang, C. Design and experiment of pheumatic assisted hole fertilization targeting seed position device. Trans. Chin. Soc. Agric. Mach. 2024, 55, 191–200. [Google Scholar] [CrossRef]
Drazic, M.; Gligorevic, K.; Pajic, M.; Zlatanovic, I.; Spalevic, V.; Sestras, P.; Skataric, G.; Dudic, B. The Influence of the Application Technique and Amount of Liquid Starter Fertilizer on Corn Yield. Agriculture 2020, 10, 347. [Google Scholar] [CrossRef]
Hu, H.; Li, H.; Wang, Q.; He, J.; Zhang, Y.; Chen, W.; Wang, X. Design and experiment of targetedhole-pricking and deep-application fertilizer applicator between corn row. Trans. Chin. Soc. Agric. Eng. 2016, 32, 26–35. [Google Scholar] [CrossRef]
Li, M.; Wen, X.; Zhou, F. Working parameters optimization and experiment of precision hole fertilization control mechanism for intertilled crop. Trans. Chin. Soc. Agric. Mach. 2016, 32, 26–35. [Google Scholar] [CrossRef]
Yang, J.; Li, C.; Gu, D.; Zhang, Z.; Guan, Y.; Wu, F. Design and experiment of fixed hole fertilization mechanism for summer maize. J. Agric. Mech. Res. 2022, 44, 86–90. [Google Scholar] [CrossRef]
Zong, Z.; Zhao, S.; Liu, G. Coronal identification and centroid location of maize seeding stage. Trans. Chin. Soc. Agric. Mach. 2019, 50, 27–33. [Google Scholar] [CrossRef]
Rodene, E.; Fernando, G.D.; Piyush, V.; Ge, Y.; Schnable, J.C.; Ghosh, S.; Yang, J. Image filtering to improve maize tassel detection accuracy using machine learning algorithms. Sensors 2024, 24, 2172. [Google Scholar] [CrossRef] [PubMed]
Veramendi, W.N.C.; Cruvinel, P.E. Method for maize plants counting and crop evaluation based on multispectral images analysis. Comput. Electron. Agric. 2024, 216, 108470. [Google Scholar] [CrossRef]
Montalvo, M.; Pajares, G.; Guerrero, J.M.; Romeo, J.; Guijarro, M.; Ribeiro, A.; Ruz, J.J.; Cruz, J.M. Automatic detection of crop rows in maize fields with high weeds pressure. Expert Syst. Appl. 2012, 39, 11889–11897. [Google Scholar] [CrossRef]
Ma, Z.; Zhu, Y.; Zhang, X.; Li, A. Research on the method of crop recognition and location in maize seedings stage based on vision. J. Chin. Agric. Mech. 2020, 41, 131–137. [Google Scholar] [CrossRef]
Wei, S.; Zhang, Y.; Mei, S. Fast recognition method of maize core based on top view image. Trans. Chin. Soc. Agric. Mach. 2017, 48, 136–141. [Google Scholar] [CrossRef]
Zan, X.; Zhang, X.; Xing, Z.; Liu, W.; Zhang, X.; Su, W.; Liu, Z.; Zhao, Y.; Li, S. Automatic detection of maize tassels from UAV images by combining random forest classifier and VGG16. Remote Sens. 2020, 12, 3049. [Google Scholar] [CrossRef]
Kumar, A.; Desai, S.V.; Balasubramanian, V.N.; Rajalakshmi, P.; Guo, W.; Naik, B.B.; Balram, M.; Desai, U.B. Efficient maize tassel-detection method using UAV based remote sensing. Remote Sens. Appl. Soc. Environ. 2021, 23, 100549. [Google Scholar] [CrossRef]
Ren, T.; Liu, Z.; Zhang, L.; Liu, D.; Xi, X.; Kang, Y.; Zhao, Y.; Zhang, C.; Li, S.; Zhang, X. Early Identification of Seed Maize and Common Maize Production Fields Using Sentinel-2 Images. Remote Sens. 2020, 12, 2140. [Google Scholar] [CrossRef]
Peng, G.; Wang, K.; Ma, J.; Cui, B.; Wang, D. AGRI-YOLO: A Lightweight Model for Corn Weed Detection with Enhanced YOLO v11n. Agriculture 2025, 15, 1971. [Google Scholar] [CrossRef]
Li, H.; Hou, Y.; Li, Z.; Liu, Q.; Zhang, H.; Chen, L.; Xu, Q.; Zhao, Z. YOLO-SEW: A Lightweight Cotton Apical Bud Detection Algorithm for Complex Cotton Field Environments. Agriculture 2026, 16, 350. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, Y.; Miao, W.; Zhang, T.; Chen, L.; Huang, L. Accurate identification and location of corn rhizome based on Faster R-CNN. Trans. Chin. Soc. Agric. Mach. 2018, 49, 46–53. [Google Scholar] [CrossRef]
Meng, Q.; Zhang, M.; Yang, X.; Liu, Y.; Zhang, Z. Recognition of maize seeding and weed based on light weight convolution and feature fusion. Trans. Chin. Soc. Agric. Mach. 2020, 51, 238–245+303. [Google Scholar] [CrossRef]
Li, Z. Research on Automatic Navigation and Drive Control Methods for Tracked Maize Plant Protection Machinery. Ph.D. Thesis, Anhui Agricultural University, Hefei, China, 2023. [Google Scholar] [CrossRef]
Diao, Z.; Guo, P.; Zhang, B.; Zhang, D.; Yan, J.; He, Z.; Zhao, S.; Zhao, C.; Zhang, J. Navigation line extraction algorithm for corn spraying robot based on improved YOLOv8s network. Comput. Electron. Agric. 2023, 212, 108049. [Google Scholar] [CrossRef]
Xu, Y.; Lu, Z.; Li, J.; Zhai, Y.; Liu, C.; Zhang, X.; Zhou, Y. Farmland Navigation Line Extraction Method Based on RS-LineNet Network and Root Subordination Relationship Optimization. Agronomy 2025, 15, 2069. [Google Scholar] [CrossRef]
Gao, J.; Tan, F.; Cui, J.; Ma, B. A Method for Obtaining the Number of Maize Seedlings Based on the Improved YOLOv4 Lightweight Neural Network. Agriculture 2022, 12, 1679. [Google Scholar] [CrossRef]
Liu, D.; Fang, J.; Zhao, Y. DMSF-YOLO: A Dynamic Multi-Scale Fusion Method for Maize Tassel Detection in UAV Low-Altitude Remote Sensing Images. Agriculture 2025, 15, 1259. [Google Scholar] [CrossRef]
Wang, X.; Wu, Y.; Zhang, X.; Hong, Z.; Li, G. Survey of rotating object detection research in computer vision. Comput. Sci. 2023, 50, 79–92. [Google Scholar] [CrossRef]
Xu, Y.; Guo, L.; Huang, D.; Zhou, Y.; Li, C. Corn plant core localization method based on high-fitting rotated bounding boxes for complex environments. Trans. Chin. Soc. Agric. Mach. 2025, 56, 129–138. [Google Scholar] [CrossRef]
Xia, F.; Yi, H.; Chen, X.; Wang, W.; Wu, H.; Kong, D. Oriented Object Detection in Wood Defect with Improved YOLOv11. Forests 2026, 17, 194. [Google Scholar] [CrossRef]
Ye, L.; Ma, J.; Lv, Y.; Guo, Z.; Lai, Z.; Ou, C.; Li, J.; Wu, F. The YOLO-OBB-Based Approach for Citrus Fruit Stem Pose Estimation and Robot Picking. Agriculture 2025, 15, 2330. [Google Scholar] [CrossRef]
Ilo, B.; Rippon, D.; Singh, Y.; Shenfield, A.; Zhang, H. Real-Time Rice Milling Morphology Detection Using Hybrid Framework of YOLOv8 Instance Segmentation and Oriented Bounding Boxes. Electronics 2025, 14, 3691. [Google Scholar] [CrossRef]
Geng, Y.; Lin, Y.; Fu, Y.; Yang, S. Object detection algorithm for pigs based on dual dilated layer and rotary box location. Trans. Chin. Soc. Agric. Mach. 2023, 54, 323–330. [Google Scholar] [CrossRef]
Zhang, W.; Xia, X.; Zhou, G.; Du, J.; Chen, T.; Zhang, Z.; Ma, X. Research on the identification and detection of field pests in the complex background based on the rotation detection algorithm. Front. Plant Sci. 2022, 13, 1011499. [Google Scholar] [CrossRef]
Song, H.; Jiao, Y.; Hua, Z.; Li, R.; Xu, X. Endosperm crack detection method for seed dipping maize based on YOLO v5-OBB and CT technology. Trans. Chin. Soc. Agric. Mach. 2023, 54, 394–401+439. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
He, H.; Zhang, J.; Cai, Y.; Chen, H.; Hu, X.; Gan, Z.; Wang, Y.; Wang, C.; Wu, Y.; Xie, L. Mobilemamba: Lightweight multi-receptive visual mamba network. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; IEEE: New York, NY, USA, 2025; pp. 4497–4507. [Google Scholar] [CrossRef]
Hao, Z.; Zhu, Y.; Lu, C. Modified YOLOv8 for automated detection and identification towards highway pavement distress. Constr. Build. Mater. 2025, 495, 143428. [Google Scholar] [CrossRef]
Tang, P.; Zhang, Y. LiteFlex-YOLO: A lightweight small target detection network for maritime unmanned aerial vehicles. Pervasive Mob. Comput. 2025, 111, 102064. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Liao, J.; Liu, K.; Yang, Y.; Yan, C.; Zhang, A.; Zhu, D. Rice disease recognition in natural environment based on RDN-YOLO. Trans. Chin. Soc. Agric. Mach. 2024, 55, 233–242. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 1389–1400. [Google Scholar] [CrossRef]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.M.; Liu, J.; Wang, J. On the connection between local attention and dynamic depth-wise convolution. arXiv 2022, arXiv:2106.04263. [Google Scholar] [CrossRef]
Li, C.; Zhao, C.; Zhang, C.; Huang, W.; Li, J.; He, X.; Wang, Q. Multi-surface defect method for Dangshan pears based on AIC-YOLOv11n model. Trans. Chin. Soc. Agric. Eng. 2025, 41, 320–328. [Google Scholar] [CrossRef]
Zhao, Y. Research and Application of Grape Disease and Pest Identification Based on an Improved YOLOv7 Algorithm. Master’s Thesis, Chongqing Three Gorges University, Chongqing, China, 2025. [Google Scholar] [CrossRef]
Shen, X.; Li, S.; Qiu, F.; Yao, L. A lightweight real-time unified detection model for rice and wheat ears in complex agricultural environments. Smart Agric. Technol. 2025, 11, 101055. [Google Scholar] [CrossRef]
Lu, D.; Wang, Y. MAR-YOLOv9: A multi-dataset object detection method for agricultural fields based on YOLOv9. PLoS ONE 2024, 19, e0307643. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Representative images under different field conditions: (a) leaf occlusion, (b) weed occlusion, (c) low-light conditions, (d) high-illumination conditions, (e) simple background, and (f) complex background.

Figure 2. Illustration of the image annotation process for maize root–stem junction detection using the oriented bounding box method.

Figure 3. Data augmentation processes: (a) horizontal flip, (b) random cropping, (c) random rotation, (d) random distortion, (e) brightness adjustment, and (f) random defocus.

Figure 4. Overall architecture of the PGi-YOLO network.

Figure 5. Structure of the GSConv module.

Figure 6. Architecture of the C2PSA-iRMB module. Note: PSABlock denotes the attention module; iRMB represents the inverted residual mobile block; and EW-MHSA refers to the expanded window multi-head self-attention.

Figure 7. Training convergence of YOLOv11n, YOLOv11n-OBB, and PGi-YOLO over 150 epochs: (a) Precision, (b) Recall, (c) mAP@0.5, and (d) mAP@0.5–0.95.

Figure 8. Comparison of detection results of different models under three representative complex field scenarios: (a) simple background, (b) weed occlusion, and (c) straw occlusion.

Figure 9. Grad-CAM visualization comparison of feature responses between YOLOv11n, YOLOv11n-OBB and PGi-YOLO.

Table 1. Results of the ablation experiments.

Group	P2	GSConv	C2PSA- iRMB	Precision (%)	Recall (%)	F1-Score (%)	Mean Average Precision		Paramaters (M)	Model Size (MB)	Inference Time (ms)
Group	P2	GSConv	C2PSA- iRMB	Precision (%)	Recall (%)	F1-Score (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)	Paramaters (M)	Model Size (MB)	Inference Time (ms)
1	✗	✗	✗	90.5	90.5	91	94.7	63.1	2.65	5.7	14.0
2	✓	✗	✗	91.4	91.6	91	96.2	65.1	2.69	6.1	15.5
3	✗	✓	✗	89.9	92.2	91	96.1	64.2	2.56	5.5	3.0
4	✗	✗	✓	91.5	90.8	91	96.2	64.3	2.67	5.7	14.8
5	✓	✓	✗	92.4	91.2	92	96.3	65.4	2.66	5.9	3.8
6	✓	✗	✓	92.4	91.7	92	96.6	65.5	2.71	6.1	16.2
7	✗	✓	✓	91.6	91.0	91	96.0	64.2	2.58	5.5	3.3
8	✓	✓	✓	92.0	93.4	92	96.9	65.6	2.61	6.0	5.1

Note: “✓” indicates that the module is included, whereas “✗” indicates that the module is not included.

Table 2. Comparative results of different models for maize root–stem junction detection.

Model	Precision (%)	Recall (%)	F1-Score (%)	Mean Average Precision		Paramaters (M)	Model Size (MB)	Inference Time (ms)
Model	Precision (%)	Recall (%)	F1-Score (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)	Paramaters (M)	Model Size (MB)	Inference Time (ms)
Faster R-CNN	88.1	90.3	89	93.1	45.5	41.34	169.0	36.1
RT-DETR	87.5	88.3	88	90.4	45.7	31.98	66.2	3.4
YOLOv8n-OBB	90.2	91.0	91	95.7	63.7	2.75	5.8	13.9
YOLOv9c-OBB	90.2	92.6	91	96.4	65.8	21.99	45.0	19.7
YOLOv10n-OBB	90.4	92.3	91	96.1	64.6	2.33	5.1	2.2
YOLOv11n-OBB	90.5	90.5	91	94.7	63.1	2.65	5.7	14
YOLOv11n	87.8	87.3	88	92.2	48.1	2.58	5.5	8.3
PGi-YOLO	92.0	93.4	92	96.9	65.6	2.61	6.0	5.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ding, Q.; Cao, S.; Yu, C.; Cai, B.; Yuan, Y.; Li, H. PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments. Agriculture 2026, 16, 1152. https://doi.org/10.3390/agriculture16111152

AMA Style

Ding Q, Cao S, Yu C, Cai B, Yuan Y, Li H. PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments. Agriculture. 2026; 16(11):1152. https://doi.org/10.3390/agriculture16111152

Chicago/Turabian Style

Ding, Qiming, Shuaishan Cao, Changchang Yu, Bingbing Cai, Yechao Yuan, and He Li. 2026. "PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments" Agriculture 16, no. 11: 1152. https://doi.org/10.3390/agriculture16111152

APA Style

Ding, Q., Cao, S., Yu, C., Cai, B., Yuan, Y., & Li, H. (2026). PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments. Agriculture, 16(11), 1152. https://doi.org/10.3390/agriculture16111152

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

PGi-YOLO: An Enhanced Detection Model for Maize Root–Stem Junction in Complex Field Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.2. Data Processing

2.3. Improved YOLOv11-Based Model for Maize Root–Stem Junction Detection

2.3.1. YOLOv11 Network Architecture

2.3.2. Improved PGi-YOLO Model

P2 High-Resolution Detection Layer

GSConv Module

C2PSA-iRMB Attention Mechanism

2.4. Experimental Environment and Evaluation Metrics

3. Results and Discussion

3.1. Training Convergence Analysis of YOLOv11n, YOLOv11n-OBB, and PGi-YOLO

3.2. Ablation Study of Network Modules

3.3. Comparative Experiments of Different Models

3.4. Visualization Comparison of Model Detection Effects

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI