Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms

Xie, Linjing; Ji, Wei; Xu, Bo; Wu, Donghao; Ao, Jiaxin

doi:10.3390/agriculture16020193

Open AccessArticle

Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms

by

Linjing Xie

,

Wei Ji

,

Bo Xu

^*

,

Donghao Wu

and

Jiaxin Ao

School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(2), 193; https://doi.org/10.3390/agriculture16020193

Submission received: 29 November 2025 / Revised: 28 December 2025 / Accepted: 9 January 2026 / Published: 12 January 2026

(This article belongs to the Topic Intelligent Agriculture: Perception Technologies and Agricultural Equipment for Crop Production Processes)

Download

Browse Figures

Versions Notes

Abstract

Robotic peach harvesting represents a pivotal strategy for reducing labor costs and improving production efficiency. The fundamental prerequisite for a harvesting robot to successfully complete picking tasks is the accurate recognition of fruit growth posture subsequent to target identification. This study proposes a novel methodology for peach growth posture recognition by integrating an enhanced YOLOv8 algorithm with the RTMpose keypoint detection framework. Specifically, the conventional Neck network in YOLOv8 was replaced by an Atrous Feature Pyramid Network (AFPN) to bolster multi-scale feature representation. Additionally, the Soft Non-Maximum Suppression (Soft-NMS) algorithm was implemented to suppress redundant detections. The RTMpose model was further employed to locate critical morphological landmarks, including the stem and apex, to facilitate precise growth posture recognition. Experimental results indicated that the refined YOLOv8 model attained precision, recall, and mean average precision (mAP) of 98.62%, 96.3%, and 98.01%, respectively, surpassing the baseline model by 8.5%, 6.2%, and 3.0%. The overall accuracy for growth posture recognition achieved 89.60%. This integrated approach enables robust peach detection and reliable posture recognition, thereby providing actionable guidance for the end-effector of an autonomous harvesting robot.

Keywords:

target detection; growth posture recognition; YOLOv8; RTMpose; robotic peach harvesting

1. Introduction

Currently, peach harvesting in China remains predominantly manual, a process that is labor-intensive, inefficient, and costly. The adoption of robotic harvesters is therefore an inevitable trend in modern agriculture [1,2]. In the broader context of fruit-harvesting robotics, the work of Ji et al. on apple recognition, sorting, and end-effector design offers valuable insights for the system integration of peach-picking robots [3,4]. The vision system is a critical component of such robots [5]. However, the unstructured orchard environments and the highly variable appearance of peaches under different lighting conditions pose significant challenges, severely compromising fruit detection accuracy [6]. Consequently, the precise detection and growth posture recognition of peaches in natural environments constitute a central research focus for harvesting robot vision systems [7].

In recent years, significant research efforts have been devoted to peach detection. Conventional methods, which rely on manual feature design and data reconstruction, are often computationally complex and yield limited performance. They also typically require expensive specialized equipment, demand controlled environments, and demonstrate poor interference robustness [8]. The adoption of deep convolutional neural networks (CNNs) has therefore become prevalent to overcome these constraints.

Liu et al. [9] developed a YOLOv7-based model for peach detection in natural environments. While their enhanced convolutional architecture improved detection precision in complex scenarios, the method’s practical application is hindered by high computational demands and insufficient robustness to occluded targets. Shi et al. [10] proposed a lightweight YOLOv8s-based algorithm incorporating MobileNetV3 as the backbone to reduce complexity. By integrating the p2BiFPN structure for improved feature fusion, along with Spatial Channel Reconstruction Convolution (ScConv) and coordinate attention mechanisms, the model achieved enhanced precision in detecting small targets. However, its applicability remains confined to immature fruit detection, with inadequate performance in identifying mature peaches. To address challenges including small fruit size, color similarity, and frequent occlusions, Liu et al. [11] introduced the MAE-YOLOv8 model, which employs an Exponential Moving Average (EMA) module for refined feature discrimination, an AFPN for small target detection, and a MPDIoU loss to alleviate occlusion-related localization errors. Nevertheless, the model still produces false positives and false negatives in densely clustered or heavily occluded scenarios.

While existing methods achieve basic peach detection, they cannot accurately determine fruit spatial orientation, thereby limiting their utility for robotic grasping. Several studies have explored fruit posture recognition, though with notable limitations. Chen et al. [12] proposed a fusion recognition method based on an improved YOLOv7 for apple growth state classification. Their approach incorporates a feature scaling layer and the Convolutional Block Attention Module (CBAM) into the backbone network, combined with a U-Net segmentation network and minimum enclosing shape features to estimate the orientation of unoccluded apples. However, the method’s complexity makes implementation challenging, and its accuracy remains suboptimal. Kang et al. [13] introduced an apple orientation recognition technique leveraging 2D-3D information projection, which integrates keypoint detection with a circle fitting algorithm based on occlusion contours. This enables robust orientation recognition under significant occlusion. Li et al. [14] developed an apple growth direction detection algorithm using OpenPose, adapting human posture recognition techniques to achieve 81.54% accuracy. Despite these advances, such methods are generally limited to fruits in a single growth state and lack the generalization capacity for the diverse orientations encountered in natural environments.

To address the aforementioned challenges in peach recognition within natural environments, this study proposes an integrated framework that combines target detection with fruit posture recognition. When selecting a baseline detection model, we comprehensively evaluated model performance, computational efficiency, community support, and practical effectiveness in agricultural vision tasks. Although subsequent versions like YOLOv11 and YOLOv13 demonstrate strong performance in general object detection tasks, YOLOv8 has been extensively validated in agricultural scenarios due to its balanced accuracy-speed characteristics, mature architectural design, and rich pre-trained models and optimization tools. Specifically, YOLOv8 demonstrates solid baseline performance in small object detection and occlusion handling, while its modular design facilitates subsequent enhancements. Furthermore, YOLOv8 offers superior deployment friendliness on embedded devices compared to some newer versions with higher computational complexity. Therefore, this study builds upon the YOLOv8 model for further refinement, where the original Neck network is replaced with an AFPN, and the Soft-NMS algorithm is introduced. These enhancements boost peach detection accuracy and mitigate inefficiencies caused by leaf occlusion and short fruit stems. On this robust detection foundation, the RTMpose network is further integrated for keypoint detection. This enables accurate identification of the peach’s growth posture, thereby guiding the end-effector to plan a human-like grasping strategy for harvesting.

2. Materials and Methods

2.1. Dataset Construction

This study conducted image acquisition at the Longshan Shunxin Garden Eco-Farm in Dantu District, Zhenjiang City, Jiangsu Province. The images were captured using a Realme (Realme Chongqing Mobile Telecommunications Corp., Ltd., Chongqing, China) smartphone’s rear camera at a resolution of 3072 × 3072 pixels, which satisfied the image quality requirements for our experiments. To construct a dataset reflecting the complex variations observed in real field conditions, the collection process rigorously considered the comprehensiveness of environmental variables. Data collection was conducted in time-segmented intervals between 08:30 and 18:00 to incorporate light and shadow changes caused by varying solar elevation angles. Additionally, in terms of target states, the dataset encompasses peach samples of different sizes, orientations, and partial foliage coverage.

The dataset construction pipeline is illustrated in Figure 1. To enhance the model’s adaptability to varying environmental conditions and effectively mitigate overfitting to a specific scene. This study applied data augmentation on the initially collected 403 peach images. Through random rotation, mirroring, and brightness adjustment, the dataset was expanded to a total of 1612 images. All images were then meticulously annotated using LabelMe 5.5.0 software. Each peach instance was annotated with a bounding box, labeled as “peach”. Additionally, two keypoints—the calyx (tip) and the stem—were annotated and assigned the labels ‘0’ and ‘1’, respectively [15]. Finally, to ensure comprehensive model training and robust evaluation, the enhanced dataset was partitioned into training, testing, and validation sets in an 8:1:1 ratio.

2.2. Design of a Peach Object Detection Model Based on an Improved YOLOv8 Approach

Compared to YOLOv5 and YOLOv7, YOLOv8 introduces a series of refinements in model architecture, training strategies, and loss functions. These improvements significantly enhance its feature extraction capability, leading to concurrent gains in both training efficiency and final model performance. However, in the complex environment of peach orchards, leaves and branches often occlude the target peaches [16]. When multi-layer occlusion reduces the visible portion of a peach below 30%, the standard YOLOv8 model becomes prone to misidentification. This limitation stems from the inefficient multi-scale feature fusion in its Feature Pyramid Network (FPN) [17], where the interaction between deep semantic information and shallow spatial details remains inadequate. Furthermore, in scenes with densely clustered fruits, the detection network frequently mistakes multiple highly overlapping peaches for a single instance, resulting in missed detections [18].

To address these limitations, this study introduces an enhanced model based on YOLOv8n, named YOLOv8-Peach, whose architecture is depicted in Figure 2. Firstly, the original Neck network is replaced with the AFPN. AFPN functions as the pivotal module bridging the Backbone and Prediction Head. It is designed to fuse and refine the multi-scale features extracted by the Backbone, thereby producing feature representations that better support localization and classification in the Prediction Head. This structure offers stronger multi-scale feature extraction and finer detail processing, which helps improve the recognition accuracy for peaches of different sizes and shapes. Secondly, Soft-NMS is incorporated to enhance detection performance [19]. By adaptively retaining more high-overlap prediction boxes, it effectively alleviates the missed detection of overlapping peaches, thereby further boosting overall detection performance.

2.2.1. YOLOv8 Neck Network Enhancements

The neck network of YOLOv8 utilizes an enhanced architecture based on PAN-FPN (Path Aggregation Network with Feature Pyramid Network) [20]. The AFPN improves upon the traditional FPN by incorporating atrous convolutions (also known as dilated convolutions), which enhance the model’s capacity for modeling multi-scale contexts and long-range dependencies. By leveraging atrous convolutions, AFPN maintains computational efficiency while achieving stronger adaptability in handling objects of varying sizes, thereby effectively improving the model’s performance across different target scales. Peach detection necessitates the precise capture of detailed characteristics for each target, especially in terms of shape, color variations, and textural details [6].

The AFPN architecture is shown in Figure 3. First, multi-scale features from the backbone are uniformly aligned to 256 channels via 1 × 1 convolutions, establishing a common feature space. Upsampling is performed using bilinear interpolation to enable cross-level feature propagation. This includes 2× upsampling from C5 to C4, 4× from C5 to C3, and 2× from C4 to C3 [21]. For downsampling, convolutional layers with corresponding strides are applied: a 2 × 2 convolution with stride 2 for 2× downsampling from C3 to C4, a 4 × 4 convolution with stride 4 for 4× downsampling from C3 to C5, and a 2 × 2 convolution with stride 2 for 2× downsampling from C4 to C5. As YOLOv8 uses only three feature levels, the 8× downsampling stage and its associated operations are omitted in this implementation.

After alignment, features at each level are aggregated through three asymptotic fusion nodes: P3, P4, and P5. The fused features are further refined by four sequential residual units. Each unit follows the basic ResNet design, consisting of two 3 × 3 convolutional layers with batch normalization and ReLU activation, and includes a skip connection to facilitate gradient flow. Finally, the network outputs three normalized feature maps that match the YOLOv8 detector head in both spatial resolution and channel dimension, allowing seamless integration into multi-scale object detection.

This design achieves efficient cross-scale semantic fusion through direct connections between non-adjacent layers and a progressive aggregation mechanism, significantly strengthening feature representation while preserving real-time inference capability.

2.2.2. Integration of Soft Non-Maximum Suppression

In traditional Non-Maximum Suppression (NMS), detection boxes with a high degree of overlap are eliminated, retaining only the one with the highest confidence score. In natural orchard environments, when two peaches are highly occluded, their resulting bounding boxes can exhibit significant overlap [20]. In such cases, the standard NMS algorithm may incorrectly suppress the box of the occluded peach in favor of the one with a marginally higher score, leading to missed detections.

To mitigate this issue in peach detection, this study adopts the Soft-NMS algorithm to replace the original NMS. The fundamental difference between Soft-NMS and NMS lies in their treatment of detection boxes that have an Intersection over Union (IoU) with the highest-scoring box exceeding a set threshold. Instead of outright removal, Soft-NMS applies a continuous decay function to penalize the confidence scores of these overlapping boxes. This strategy prevents highly overlapping, yet valid, fruit detections from being completely discarded, thereby significantly improving the model’s recall in scenarios with severe fruit occlusion. The Soft-NMS scoring function is defined as follows:

s_{i} = s_{i} e^{- \frac{IoU {(M, b_{i})}^{2}}{σ}}, \forall b_{i} \notin D

(1)

where

s_{i}

denotes the confidence score of the i-th candidate bounding box,

IoU (M, b_{i})

represents the Intersection over Union between the current highest-scoring box

M

and another candidate box

b_{i}

,

M

is the candidate box with the highest confidence score at the current step,

b_{i}

is the

i

-th candidate box being processed, and

D

is the set used to store the final selected detection boxes.

σ

is the parameter controlling the decay intensity. To determine the optimal parameter, this study performed a grid search on the Gaussian decay parameter

σ

of Soft-NMS within the validation set, with a search range from 0.3 to 0.7. Results indicate that when

σ

= 0.5, the algorithm achieves the optimal balance between recall and precision.

2.3. Network Architecture for Peach Posture Recognition

To improve the success rate of robotic peach harvesting, it is essential to guide the end-effector in planning a human-like grasping strategy, which requires accurate recognition of the fruit’s growth posture during the picking operation. This study adopts the RTMpose algorithm as the core method for keypoint detection [22]. Compared to traditional regression-based approaches, RTMpose demonstrates superior performance in complex orchard environments. Its hierarchical representation and high-performance decoder enhance robustness in challenging scenarios by effectively leveraging both local features and global contextual information. This capability ensures reliable localization of keypoints—such as the calyx (tip) and the stem—even when the fruit is partially occluded by foliage or has an irregular shape.

The entire workflow adopts a cascaded architecture, with its core components sequentially comprising peach object detection based on an improved YOLOv8 model, instance segmentation using K-means clustering, Canny edge contour extraction, circle fitting via Hough transform, RTMpose keypoint detection, and final growth posture calculation. The specific workflow is illustrated in Figure 4: First, the input image undergoes inference using the improved YOLOv8-Peach model, outputting bounding box coordinates for peaches. Next, each detected box is cropped into a region of interest (ROI), where K-means is applied for foreground-background segmentation. Subsequently, based on the segmented binary mask, the Canny edge detection operator extracts peach contour features. After obtaining the contour point set, circular fitting is performed using the Hough transform to estimate the approximate geometric center and radius of the peach. Finally, within the same ROI, the RTMpose algorithm is used to regress the pixel coordinates of the peach tip and stem. These local coordinates are then mapped back to the original image coordinate system, and the growth direction of the peach is ultimately calculated based on the relative position vector between the tip and stem.

2.3.1. Peach Target Key Feature Construction Method

Research has shown that the relative positions of keypoints on a peach provide a reliable basis for classifying its growth posture. Using the keypoint coordinates obtained from the model in Section 2.3, this study calculates the relevant angles between vectors as follows:

θ = \arccos (\frac{(x_{2} - x_{1}) (x_{3} - x_{2}) + (y_{2} - y_{1}) (y_{3} - y_{2})}{\sqrt{{(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}} \cdot \sqrt{{(x_{3} - x_{2})}^{2} + {(y_{3} - y_{2})}^{2}}})

(2)

As illustrated in Figure 5,

(x_{1}, y_{1})

,

(x_{2}, y_{2})

and

(x_{3}, y_{3})

represent the coordinates of the three keypoints: the peach centroid, the calyx (tip), and the stem, respectively, with

θ

denoting the angle formed by these points. The vector originating from the centroid

{(x}_{1}, y_{1})

to the calyx

(x_{2}, y_{2})

most accurately represents the peach’s true growth orientation. However, accurately locating the centroid is challenging due to its indistinct textural features, which often introduce substantial localization errors. To improve the robustness and reliability of the recognition, this study instead utilizes the vector from the stem

(x_{3}, y_{3})

to the calyx

(x_{2}, y_{2})

as the indicator for estimating the peach’s growth orientation.

2.3.2. Peach Target Preprocessing

The choice of color space is critical for effective target segmentation. To identify the feature representation most robust to complex orchard conditions, peach images are converted from RGB to HSI and Lab spaces, and their individual components are analyzed. In natural environments, targets and backgrounds exhibit distinct distributions across different color spaces and components, each demonstrating varying sensitivity to illumination and sky interference. Through comparative evaluation, the color component that maximizes the separation between peach targets and the background is selected for subsequent segmentation.

As illustrated in Figure 6, the image can be segmented into four primary regions: the peach target, light green foliage, the sky background, and grayish branches. Among all color components, only the a component in the Lab space remains largely invariant to changes in illumination and sky presence, thereby providing clear separation between the target and background. Visually, the background appears uniformly gray, while the peach stands out as bright white, resulting in high contrast that facilitates reliable segmentation. To ensure robustness under challenging orchard conditions such as varying lighting and complex backgrounds, this study performs peach segmentation in the a component space.

After converting the image to the Lab color space, color features can be clustered and segmented using the K-means clustering algorithm. Although recent YOLO variants now support instance segmentation, this study opts for the K-means method based on the following comprehensive considerations: In scenarios with strong color separability, K-means achieves highly efficient segmentation within cropped ROIs at minimal computational cost. It requires no additional annotated data or model training, making it more suitable for embedded real-time systems. The procedure for clustering the image into k classes is as follows:

(1): Initialize the cluster centers for the $k$ classes: $Z_{1} (1), Z_{2} (1), \dots, Z_{k} (1) .$
(2): Assignment Step (Iteration $n$ ): Assign each sample point $Z$ in the set ${Z}$ to the cluster $S_{j} (n)$ whose center $Z_{j} (n)$ is the closest. Formally, $Z \in S_{j} (n)$ if $∥ Z - Z_{j} (n) ∥ < ∥ Z - Z_{i} (n) ∥$ for all $i \neq j .$
(3): Update Step: Compute new cluster centers $Z_{j} (n + 1)$ as the centroid of the points in $S_{j} (n)$ , i.e.,

$Z_{j} (n + 1) = \arg \min_{Z} \sum_{Z \in S_{j} (n)} ∥ Z - Z ∥^{2}$

(3)
(4): Convergence Check: If $Z_{j} (n + 1) = Z_{j} (n)$ for all $j = 1, \dots, k$ , stop; otherwise, set $n = n + 1$ and return to Step 2.

The segmentation result after applying the K-means algorithm is shown in Figure 7. The number of clusters,

k

, is a pre-defined parameter. For the task of segmenting peach targets from the background in natural environments, the objective is to separate two distinct classes. Therefore,

k

was set to 2 in this study.

Given the generally circular shape of peach fruits, the Canny edge detection algorithm is applied to the segmented mask to extract the target’s contour. For these approximately circular objects, the extracted arc segments from the true contour can be utilized to estimate the centroid and radius.

As shown in Figure 7, let

O (a, b)

, be the center of the peach target, and let

p_{1} (m_{1}, n_{1})

,

p_{2} (m_{2}, n_{2})

and

p_{3} (m_{3}, n_{3})

be three points on its true contour. Substituting the coordinates of these three points into the standard circle equation yields the following system:

\{\begin{matrix} {(m_{1} - a)}^{2} + {(n_{1} - b)}^{2} = {(m_{2} - a)}^{2} + {(n_{2} - b)}^{2} \\ {(m_{2} - a)}^{2} + {(n_{2} - b)}^{2} = {(m_{3} - a)}^{2} + {(n_{3} - b)}^{2} \end{matrix}

(4)

Solving this system yields the coordinates

O (a, b)

of the circle’s center. The radius

r

corresponding to the contour segment can then be calculated as follows:

r = \sqrt{{(m_{1} - a)}^{2} + {(n_{1} - b)}^{2}}

(5)

2.3.3. Keypoint Detection Using RTMpose

The backbone network of RTMpose consists of four distinct stages [23]. As the network progresses to each subsequent stage, it undergoes a branching expansion: it starts with a single branch in Stage 1, expands to two parallel branches in Stage 2, and further branches into three parallel branches in Stage 3. With the exception of the transition between Stages 1 and 2, all branching expansions incorporate a multi-scale feature fusion mechanism. The number of parallel feature streams in each stage corresponds to its stage index (e.g., Stage 3 has three streams). The overall architecture is depicted in Figure 8, and all convolutional modules in the network follow a “Conv + BN + ReLU” configuration.

The original input image is first processed by the Stem module, which is composed of two 3 × 3 convolutional layers, each with a stride of 2, resulting in 4× downsampling. The resulting feature map then enters Layer 1, which adjusts the channel dimension without altering the spatial resolution. This output is fed into the Transition 1 structure, which contains two parallel 3 × 3 convolutions: the first, with a stride of 1, adjusts the channel count; the second, with a stride of 2, performs 2× downsampling while also modifying the channels. The feature maps then proceed to Stage 2, which is composed of four sequentially connected BasicBlocks followed by an Exchange Block. The Exchange Block facilitates the fusion of features from different branches, utilizing ReLU activation functions.

Subsequently, the feature maps pass through the Transition 2 structure. Here, the branch with 8× downsampling is fed into a 3 × 3 convolutional layer (stride = 2, padding = 1) to produce a new branch with 16× downsampling. The architecture then enters Stage 3, which is repeated four times. Each repetition sequentially processes the features through four BasicBlocks and performs multi-scale fusion analogous to Stage 2. The feature maps continue into the Transition 3 structure, where the 16× downsampling branch is similarly processed by an identical convolutional layer to create a 32× downsampling branch. Finally, the features enter Stage 4, which is also stacked four times and processes features through its BasicBlocks for fusion.

The network can select outputs from different branches depending on the task requirements. For this study, the final output is taken from the highest-resolution branch. The number of kernels in the final convolutional layer is set to match the number of keypoints, which is two in this work [24].

2.4. Experimental Evaluation Metrics and Platform Configuration

2.4.1. Experimental Evaluation Criteria

The evaluation metrics for peach object detection include recall (R), precision (P), mean average precision (mAP), model size (MB), and frames per second (FPS).

recall = \frac{TP}{TP + FN} \times 100 %

(6)

precise = \frac{TP}{TP + FP} \times 100 %

(7)

AP = \int_{0}^{1} P \cdot (R) dR

(8)

mAP = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(9)

FPS = \frac{1000 ms}{pre_process + inference + NMS}

(10)

True Positive (TP) refers to the number of positive instances correctly detected. False Positive (FP) denotes the number of negative instances erroneously classified as positive. False Negative (FN) represents the number of positive instances that are incorrectly missed. Average Precision (AP) quantifies the area under the Precision-Recall curve, while mAP is computed as the mean of AP values across all categories.

2.4.2. Experimental Training Platform and Strategy

All experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The models were developed using the PyTorch 2.1.0 deep learning framework and accelerated with CUDA 12.1. The code was implemented in Python 3.9 on the PyCharm 2022.2.1 platform, leveraging the OpenCV 4.8.0 library for image processing. Through empirical tuning based on hardware constraints, the training hyperparameters were set to a batch size of 16 for 150 epochs.

2.4.3. Experimental Verification Platform

To validate the effectiveness and measurement accuracy of the proposed method, a laboratory experiment was designed to simulate real-world orchard conditions [25]. As shown in Figure 9, the setup involved an artificial peach tree with standardized peach models affixed to its branches to closely mimic a natural growth environment.

The vision system for detecting peach growth posture was deployed on an NVIDIA Jetson AGX Orin edge computing device (NVIDIA Corp., Santa Clara, CA, USA) [26]. Powered by an NVIDIA Ampere architecture GPU with 2048 CUDA cores and 64 Tensor Cores, this device provides the substantial computational power and efficient AI inference required for complex visual tasks. The software environment was built on the PyTorch 2.1.0 deep learning framework with GPU support, along with Torchvision 0.16.1 [8,27].

3. Analysis of Experimental Results

3.1. Ablation Experiment

To evaluate the effectiveness of each component in the improved model, this study conducted ablation tests. Starting from the original YOLOv8 network, the AFPN module and Soft-NMS algorithm were successively added, with training and testing performed on the same dataset. A comparative summary of the ablation results is presented in Table 1.

The ablation study presented in Table 1 shows that the baseline YOLOv8 model attains a mean Average Precision (mAP) of 95.0%. Incorporating the Asymptotic Feature Pyramid Network (AFPN) module elevates the mAP to 97.23%, corresponding to an absolute gain of 2.23 percentage points. This improvement indicates that AFPN effectively enhances the model’s detection performance for small and partially occluded objects by reinforcing multi-scale feature integration. When Soft-NMS is applied solely as a post-processing step without introducing additional model parameters or increasing theoretical computational complexity, it further raises the mAP to 96.30%, corresponding to an increase of 1.30 percentage points. This outcome indicates that Soft-NMS mitigates false negatives by adaptively adjusting the confidence scores of overlapping detections.

After integrating AFPN with Soft-NMS, the final mAP reached 98.01%, representing an overall improvement of 3.01 percentage points over the baseline model. Although the incremental addition of modules increased the model’s parameter count, the sustained improvement in accuracy metrics demonstrates that the AFPN architecture and Soft-NMS strategy synergistically refine feature discrimination and localization capabilities in complex natural environments, thereby achieving balanced enhancements in both precision and robustness.

3.2. Comparative Experiments of Different Network Models

To evaluate the efficacy of the proposed improvements for peach detection, this study conducted a comparative analysis between our YOLOv8-Peach model and three established benchmarks: YOLOv5, YOLOv7, and the original YOLOv8. The qualitative results are presented in Figure 10.

Figure 10 reveals that in natural orchard settings, YOLOv7 is prone to false positives, mistakenly identifying background elements as peaches. This issue is exacerbated in scenes with dense clusters or partial occlusion, indicating substantial room for improvement in its recall performance. Among these, YOLOv8 and YOLOv5 exhibit strong detection performance. However, YOLOv8 adopts a contemporary anchor-free design, offering greater flexibility in handling variations in target scale and shape. Furthermore, its C2f backbone architecture enhances gradient propagation and facilitates more efficient feature reuse. These architectural advantages provide a more robust and adaptable foundation for task-specific model refinement, which is why YOLOv8 was selected as the baseline network for this work. As illustrated in Figure 10, the proposed YOLOv8-Peach model successfully identifies peach targets with high precision under the same challenging conditions, effectively eliminating both false positives and false negatives. It achieves high detection confidence scores, reliably pinpointing even severely occluded fruits. Furthermore, the model maintains consistent accuracy across varying illumination conditions, demonstrating its strong potential for practical applications in peach growth state assessment.

The comparative results of different network models are presented in Table 2. The analysis indicates that, in terms of model complexity, YOLOv8 exhibits a substantially lower storage requirement compared to YOLOv7, while achieving a higher mean mAP, which demonstrates its superior storage efficiency. Relative to YOLOv5, which possesses the most compact storage footprint, YOLOv8 attains enhanced recall with only a negligible increase in model size, thereby confirming its improved capability in reducing missed detections for occluded objects. Regarding real-time performance, although the inference speed of YOLOv8 is marginally lower than that of YOLOv7, its overall accuracy-speed balance is more favorable when compared to the other models. This characteristic renders it more suitable for orchard harvesting robots, which require a simultaneous focus on both detection efficiency and accuracy, and establishes it as the baseline for improvement in this study.

The YOLOv8-Peach model, developed from this baseline, achieves comprehensive performance advancements. It elevates the mAP by 3.01 percentage points over the baseline, reaching 98.01%, with corresponding precision and recall rates of 98.62% and 96.3%. This signifies a considerable reduction in both false positive and false negative rates. The architectural enhancements, while leading to an increased model size and computational load and a consequent slight decrease in inference speed, endow the model with significantly strengthened robustness in complex orchard scenarios characterized by occlusions and lighting variations. These findings collectively validate the efficacy of the synergistic optimization achieved through the integration of the AFPN architecture and the Soft-NMS strategy, highlighting their practical value for deployment in resource-constrained operational environments.

To provide a more intuitive comparison of the model’s performance, Figure 11 plots the detection accuracy of different algorithms. It is evident that the proposed YOLOv8-Peach model holds a distinct advantage in the peach detection task, outperforming other mainstream models in both overall accuracy and reduction in erroneous detections. This superior performance confirms that the enhancements—integrating the AFPN module and Soft-NMS—enable the model to be better tailored for this specific agricultural context, leading to higher precision than conventional YOLO architectures.

3.3. Peach Keypoint Detection Experiments

Unlike the IoU evaluation metric used in object detection, the Object Keypoint Similarity (OKS) serves as the standard evaluation metric for keypoint detection, quantifying the similarity between predicted keypoints and their ground truth annotations. Similarly to IoU, OKS values range from 0 to 1, with values closer to 1 representing a higher degree of similarity. The OKS for a single keypoint is calculated as follows [24]:

OKS = \frac{\sum_{i} \exp (- \frac{i_{2}^{d}}{2 s^{2} i_{2}^{k}}) δ (v_{i} > 0)}{\sum_{i} δ (v_{i} > 0)}

(11)

The visibility flag

v_{i}

is defined as follows: 0 indicates an invisible keypoint, 1 represents an occluded keypoint, and 2 denotes a fully visible keypoint. The indicator function

δ = 1 (v_{i} > 0)

equals 1 when the condition is satisfied and 0 otherwise. Here,

s

denotes the image area in pixels,

d_{i}

represents the Euclidean distance between the

i

-th detected keypoint and its corresponding ground truth keypoint, and

K_{i}

is the normalization factor for the

i

-th keypoint.

When OKS exceeds a predefined threshold

T

, the detection is considered a true positive. AP is subsequently derived from the precision-recall curve, while mAP represents the mean of AP values calculated across multiple OKS thresholds. mAP@0.5 specifically denotes the AP at a fixed OKS threshold of 0.5. The mean Average Recall (mAR) quantifies the proportion of ground-truth keypoints successfully detected by the model, reflecting its recall performance. The corresponding numerical results are summarized in Table 3.

The results of the peach target growth posture detection are shown in Figure 12.

3.4. Peach Target Growth Posture Recognition Verification Experiment

To acquire the ground truth for peach growth orientation, a manual marking method was employed, as illustrated in Figure 13 [28]. A thin bamboo skewer was inserted through the calyx (tip) and the centroid of the peach, ensuring the skewer’s axis aligned with the vector from the centroid to the tip. In this setup, the direction of the skewer is defined as the actual growth direction of the peach, providing a clear and intuitive physical representation of the posture [29]. During the image acquisition phase, peaches marked with bamboo sticks were photographed using the same imaging equipment. Subsequently, the direction indicated by the bamboo sticks (the reference direction from the base to the tip of the peach) was quantitatively compared with the growth posture results estimated by the algorithm to evaluate the accuracy of this research method [30]. Based on this approach, the posture recognition system developed in this study was evaluated on a validation set comprising 30 samples. The statistical results for angular error are presented in Table 4.

To conduct an in-depth analysis of error sources, three representative samples are presented, as shown in Figure 13.

The angular error in detecting the target peach growth posture is shown in Table 5. The minimum error of 2.6° indicates that under conditions of clear features and minimal occlusion, the proposed method achieves high-precision posture estimation. Larger errors primarily occur in samples where the peach stem or tip is partially obscured by branches and leaves, or where reflective interference is present in the image. Although the peach core region was excluded as a directional reference due to its lack of distinct features, the selected combination of stem-to-tip features consistently supported stable posture estimation in most scenarios. Overall error remained within acceptable limits, validating the proposed method’s effectiveness and robustness in simulating real-world conditions.

4. Conclusions

To address the limitations of peach-harvesting robots, this study proposes an integrated method that combines target detection with growth posture recognition. This study enhanced YOLOv8 by incorporating AFPN and Soft-NMS, achieving significant gains in detection efficiency, particularly for complex backgrounds and multi-scale objects. The resulting YOLOv8-Peach model attained a precision of 98.62%, recall of 96.3%, and mAP of 98.01, representing improvements of 8.5%, 6.2%, and 3.0% over the original model. Furthermore, by integrating RTMpose for keypoint detection, the system accurately determines peach orientation, achieving a posture recognition accuracy of 89.60%. This integrated approach not only improves detection robustness but also enables human-like grasping strategies for the end-effector, providing critical data for robotic arm control and thereby significantly boosting the intelligence and efficiency of automated peach harvesting.

The proposed method shows strong performance under the tested conditions, although several limitations persist. First, the experimental evaluation has been primarily confined to comparisons within the YOLO series. Future work should extend validation to newer YOLO variants and other state-of-the-art detection architectures. Second, the current dataset mainly represents conventional orchard settings; the model’s robustness under diverse environmental conditions requires further assessment with more varied, multi-scenario data. Future efforts will focus on improving environmental generalization through synthetic data augmentation and cross-domain adaptive learning and will explore the integration of binocular vision or depth sensors to acquire 3D point clouds of peaches, combining 2D appearance features with 3D geometry for more robust and accurate spatial pose estimation. This will contribute to advancing the practical deployment of the system in agricultural applications.

Author Contributions

Conceptualization, W.J. and L.X.; methodology, B.X. and L.X.; software, L.X. and J.A.; validation, D.W.; formal analysis, L.X.; investigation, W.J.; resources, W.J.; data curation, L.X.; writing—original draft preparation, L.X.; writing—review and editing, W.J. and L.X.; visualization, B.X.; supervision, W.J.; project administration, W.J.; funding acquisition, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61973141), and a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (No. PAPD).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, H.; Wang, X.; Au, W.; Kang, H.; Chen, C. Intelligent robots for fruit harvesting: Recent developments and future challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
Zhang, K.; Lammers, K.; Chu, P.; Li, Z.; Lu, R. Algorithm design and integration for a robotic apple harvesting system. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2022. [Google Scholar] [CrossRef]
Ji, W.; Zhang, T.; Xu, B.; He, G. Apple recognition and picking sequence planning for harvesting robot in a complex environment. J. Agric. Eng. 2024, 55, 1549. [Google Scholar] [CrossRef]
Bac, C.W.; van Henten, E.J.; Hemming, J.; Edan, Y. Harvesting robots for high-value crops: State-of-the-art review and challenges Ahead. J. Field Robot. 2014, 31, 888–911. [Google Scholar] [CrossRef]
Ao, J.; Ji, W.; Yu, X.; Ruan, C.; Xu, B. End-effectors for fruit and vegetable harvesting robots: A review of key technologies, challenges, and future prospects. Agronomy 2025, 15, 2650. [Google Scholar] [CrossRef]
Vasconez, J.P.; Kantor, G.A.; Auat Cheein, F.A. Human–robot interaction in agriculture: A survey and current challenges. Biosyst. Eng. 2019, 179, 35–48. [Google Scholar] [CrossRef]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A real-time apple targets detection method for picking robot based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Ji, W.; Zhai, K.; Xu, B.; Wu, J. Green apple detection method based on multidimensional feature extraction network model and transformer module. J. Food Prot. 2025, 88, 100397. [Google Scholar] [CrossRef]
Liu, P.; Yin, H. YOLOv7-Peach: An algorithm for immature small yellow peaches detection in complex natural environments. Sensors 2023, 23, 5096. [Google Scholar] [CrossRef]
Shi, Y.; Qing, S.; Zhao, L.; Wang, F.; Yuwen, X.; Qu, M. YOLO-Peach: A high-performance lightweight YOLOv8s-based model for accurate recognition and enumeration of peach seedling fruits. Agronomy 2024, 14, 1628. [Google Scholar] [CrossRef]
Liu, Q.; Lv, J.; Zhang, C. MAE-YOLOv8-based small object detection of green crisp plum in real complex orchard environments. Comput. Electron. Agric. 2024, 226, 109458. [Google Scholar] [CrossRef]
Chen, Q.; Yin, C.; Guo, Z.; Wu, X.; Wang, J.; Zhou, H. Apple growth status and posture recognition using improved YOLOv7. J. Agric. Eng. 2024, 40, 258–266. [Google Scholar] [CrossRef]
Kok, E.; Chen, C. Occluded apples orientation estimator based on deep learning model for robotic harvesting. Comput. Electron. Agric. 2024, 219, 108781. [Google Scholar] [CrossRef]
Li, H.; Shi, Y.; Liu, H.; Wang, W.; Liu, W.; Yang, P. Apple growth direction detection based on improved OpenPosture. J. Chin. Agric. 2022, 34, 34–48. [Google Scholar]
Xiong, Y.; Peng, C.; Grimstad, L.; From, P.J.; Isler, V. Development and field evaluation of a strawberry harvesting robot with a cable-driven gripper. Comput. Electron. Agric. 2019, 157, 392–402. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Wang, J.; Cai, L.; Jin, Y.; Zhao, S.; Xie, B. Realtime picking point decision algorithm of trellis grape for high-speed robotic cut-and-catch harvesting. Agronomy 2023, 13, 1618. [Google Scholar] [CrossRef]
Ji, W.; He, G.; Xu, B.; Zhang, H.; Yu, X. A new picking pattern of a flexible three-fingered end-effector for apple harvesting robot. Agriculture 2024, 14, 102. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS: Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; p. 1. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the EEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 30 June 2016. [Google Scholar] [CrossRef]
Lu, P.; Jiang, T.; Li, Y.; Li, X.; Chen, K.; Yang, W. RTMO: Towards high-performance one-stage real-time multi-person posture recognition based on MMPose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; p. 1. [Google Scholar] [CrossRef]
Kang, H.; Zhou, H.; Wang, X.; Chen, C. Real-time fruit recognition and grasping recognition for robotic apple harvesting. Sensors 2020, 20, 5670. [Google Scholar] [CrossRef]
Huang, J.; Zhu, Z.; Guo, F.; Huang, G. The devil is in the details: Delving into unbiased data processing for human posture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Ji, W.; Fang, H.; Xu, B.; Wu, K.; Xie, L. Surface defect and contamination detection in photovoltaic panels based on few-shot data augmentation. Sol. Energy 2026, 303, 114166. [Google Scholar] [CrossRef]
Zhang, K.; Lammers, K.; Chu, P.; Li, Z.; Lu, R. System design and control of an apple harvesting robot. Mechatronics 2021, 79, 102644. [Google Scholar] [CrossRef]
Xiong, Y.; Ge, Y.; From, P.J.; Grimstad, L. An autonomous strawberry-harvesting robot: Design, development, integration and field evaluation. J. Field Robot. 2020, 37, 202–224. [Google Scholar] [CrossRef]
Kumra, S.; Joshi, S.; Sahin, F. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 24 October 2020–24 January 2021. [Google Scholar] [CrossRef]
Fu, L.; Feng, Y.; Majeed, Y.; Zhang, X.; Zhang, J.; Karkee, M.; Zhang, Q. Kiwifruit detection in field images using Faster R-CNN with ZFNet. Precis. Agric. 2020, 21, 856–868. [Google Scholar] [CrossRef]

Figure 1. Data set construction process.

Figure 2. YOLOv8-Peach network architecture.

Figure 3. AFPN architecture.

Figure 4. Pipeline for peach posture recognition.

Figure 5. Feature maps of peach keypoints.

Figure 6. Component diagrams of the color space.

Figure 7. Center and radius of the peach target.

Figure 8. RTMpose backbone architecture flowchart.

Figure 9. Laboratory validation environment.

Figure 10. Comparative analysis of detection performance.

Figure 11. Comparative analysis of model accuracy.

Figure 12. Posture recognition.

Figure 13. Laboratory measurement of peach growth posture.

Table 1. Ablation test results.

Method	mAP	Precision	Recall	Model Size (MB)	FPS	GFLOPs
YOLOv8	95%	90.11%	90.11%	6.23	90	8.7
YOLOv8 + AFPN	97.23%	97.11%	93.36%	7.26	85	10.5
YOLOv8 + Soft-NMS	96.30%	91.34%	91.81%	6.23	78	8.7
YOLOv8 + AFPN + Soft-NMS	98.01%	98.62%	96.3%	7.26	82	10.5

Table 2. Performance comparison of network architectures.

Model	mAP	Precision	Recall	Model Size (MB)	FPS	GFLOPs
YOLOv5	95.6%	93.84%	91.03%	3.8	93	4.5
YOLOv7	93.17%	89.95%	86.9%	12.45	99	13.2
YOLOv8	95.0%	90.11%	90.11%	6.23	90	8.7
YOLOv8-Peach	98.01%	98.62%	96.3%	7.26	82	10.5

Table 3. Posture recognition metrics.

mAP	mAP50	mAR
0.896	0.921	0.927

Table 4. Statistical Analysis of Angular Error in Peach Target Growth Posture Detection.

Sample Size	Mean Error	Minimum Error	Maximum Error
30	7.23°	2.6°	12.1°

Table 5. Angular error in peach growth posture recognition.

Experiment	Verification 1	Verification 2	Verification 3
$(x_{1}, y_{1})$	(937, 1092)	(770, 1093)	(1118, 1070)
${(x}_{2}, y_{2})$	(772, 855)	(935, 950)	(1069, 932)
${(x}_{3}, y_{3})$	(1078, 1320)	(658, 1290)	(1117, 1194)
$θ$	2.6°	10°	9.1°

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, L.; Ji, W.; Xu, B.; Wu, D.; Ao, J. Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms. Agriculture 2026, 16, 193. https://doi.org/10.3390/agriculture16020193

AMA Style

Xie L, Ji W, Xu B, Wu D, Ao J. Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms. Agriculture. 2026; 16(2):193. https://doi.org/10.3390/agriculture16020193

Chicago/Turabian Style

Xie, Linjing, Wei Ji, Bo Xu, Donghao Wu, and Jiaxin Ao. 2026. "Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms" Agriculture 16, no. 2: 193. https://doi.org/10.3390/agriculture16020193

APA Style

Xie, L., Ji, W., Xu, B., Wu, D., & Ao, J. (2026). Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms. Agriculture, 16(2), 193. https://doi.org/10.3390/agriculture16020193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research and Implementation of Peach Fruit Detection and Growth Posture Recognition Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Design of a Peach Object Detection Model Based on an Improved YOLOv8 Approach

2.2.1. YOLOv8 Neck Network Enhancements

2.2.2. Integration of Soft Non-Maximum Suppression

2.3. Network Architecture for Peach Posture Recognition

2.3.1. Peach Target Key Feature Construction Method

2.3.2. Peach Target Preprocessing

2.3.3. Keypoint Detection Using RTMpose

2.4. Experimental Evaluation Metrics and Platform Configuration

2.4.1. Experimental Evaluation Criteria

2.4.2. Experimental Training Platform and Strategy

2.4.3. Experimental Verification Platform

3. Analysis of Experimental Results

3.1. Ablation Experiment

3.2. Comparative Experiments of Different Network Models

3.3. Peach Keypoint Detection Experiments

3.4. Peach Target Growth Posture Recognition Verification Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI