A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception

Peng, Yifei; Sun, Jun; Wu, Zhaoqi; Gao, Jinye; Shi, Lei; Shi, Zhiyan

doi:10.3390/horticulturae11091039

Open AccessArticle

A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception

by

Yifei Peng

¹,

Jun Sun

^1,*

,

Zhaoqi Wu

¹,

Jinye Gao

¹,

Lei Shi

¹ and

Zhiyan Shi

²

¹

School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, China

²

School of Mathematical Sciences, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2025, 11(9), 1039; https://doi.org/10.3390/horticulturae11091039

Submission received: 6 August 2025 / Revised: 29 August 2025 / Accepted: 30 August 2025 / Published: 2 September 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in the Processing of Horticultural Crops)

Download

Browse Figures

Versions Notes

Abstract

To achieve efficient vineyard grape picking, a vision-based information processing framework integrating two-stage segmentation with morphological perception is proposed. In the first stage, an improved YOLOv8s-seg model is employed for coarse segmentation, incorporating two key enhancements: first, a dynamic deformation feature aggregation module (DDFAM), which facilitates the extraction of complex structural and morphological features; and second, an efficient asymmetric decoupled head (EADHead), which improves boundary awareness while reducing parameter redundancy. Compared with mainstream segmentation models, the improved model achieves superior performance, attaining the highest mAP@0.5 of 86.75%, a lightweight structure with 10.34 M parameters, and a real-time inference speed of 10.02 ms per image. In the second stage, the fine segmentation of fruit stems is performed using an improved OTSU thresholding algorithm, which is applied to a single-channel image derived from the hue component of the HSV color space, thereby enhancing robustness under complex lighting conditions. Morphological features extracted from the preprocessed fruit stem, including centroid coordinates and a skeleton constructed via medial axis transform (MAT), are further utilized to establish the spatial relationships with a picking point and cutting axis. The visualization analysis confirms the high feasibility and adaptability of the proposed framework, providing essential technical support for the automation of grape harvesting.

Keywords:

grape picking; computer vision; instance segmentation; image processing; morphological perception; precision agriculture

1. Introduction

Grapes (Vitis vinifera L.) represent one of the most widely cultivated fruit crops worldwide, owing to their rich nutritional composition and substantial economic value [1,2,3]. They serve as a significant source of vitamins, polyphenols, and other bioactive compounds [1,4], which have been extensively associated with antioxidant properties and cardioprotective benefits [5]. In parallel, grape cultivation plays a pivotal role in the global horticultural economy, supporting both fresh fruit consumption and the production of various processed products such as wine, juice, and dried fruit [6,7,8]. This dual significance has positioned grapes as a central focus in research related to quality assessment, yield enhancement, and the implementation of precision agriculture technologies [9].

Conventional fruit harvesting remains a labor-intensive and relatively inefficient process [10]. The application of computer vision and deep learning technologies in fruit recognition has provided essential technical support for automation and intelligent decision-making in precision agriculture [11,12]. However, grape harvesting poses unique challenges due to their distinctive morphological and structural features, characterized by compact clusters, fragile berries, and densely intertwined vines [13,14]. These inherent complexities significantly hinder the development and deployment of efficient automated harvesting systems. In contrast to fruits with firm exocarps that can tolerate direct gripping or pulling, grapes possess a soft texture and irregular geometry, thereby requiring careful handling to minimize mechanical damage and prevent the unintended detachment of adjacent berries [9,15,16].

In recent years, substantial progress has been made in automated grape harvesting, with research efforts focusing on several key areas: picking point location [16,17,18,19,20], fruit recognition under occlusion [16,17], fruit classification [21,22], and quality assessment [13,23,24], as well as robotic path planning and end-effector design [25,26]. The overarching objective of these studies is to achieve accurate and efficient grape harvesting with minimal human intervention. Notably, these research components are often interdependent and can be effectively addressed within a unified processing framework. Among these, the localization of picking points on the fruit stem plays a significant role, serving not only as the core target of fruit recognition and occlusion detection but also as a fundamental basis for downstream tasks such as three-dimensional (3D) positioning and robotic path planning [16].

To address the challenge of picking point localization in grape harvesting, numerous researchers have proposed effective approaches that integrate visual perception with deep learning techniques. Luo et al. [16] employed a combination of k-means clustering, morphological processing, and geometric constraints to detect the cutting points on stems within overlapping grape clusters. This method reached an average localization accuracy of 88.33% and a success rate of 81.66% in dual-cluster scenarios. Zhao et al. [18] developed a lightweight end-to-end model, YOLO-GP, which is capable of simultaneously detecting grape clusters and the corresponding picking points. This model adopts the strategy of associating the picking point with the boundary of the detected grape cluster, achieving a mean average precision (mAP) of 93.27% and a localization error of less than 40 pixels. Xu et al. [17] proposed a real-time picking point decision algorithm based on YOLOv4-SE, incorporating multi-channel RGB and depth inputs. By constructing 3D regions of interest (ROIs) around grape clusters, this method achieved a 93.5% success rate in localization with an average response time of 84.2 ms. Zhu et al. [19] introduced an improved YOLOv5 model for recognizing both grapes and fruit stems under natural conditions, followed by a geometric approach for rapid picking point localization. This method demonstrated a notable improvement in detection accuracy compared to the baseline model, with only a slight increase in processing time. Liu et al. [20] proposed a YOLOv8-MH model for picking point localization, which integrates a multi-head attention mechanism to enhance precision and replaces the backbone with MobileNetV3 to reduce the parameters. When combined with a depth-map-based region growing algorithm, the model achieved an average precision (AP) of 93.1% and a frame rate of 80 frames per second (FPS).

Traditional image processing techniques have demonstrated relatively high accuracy in segmentation tasks involving small-scale targets with low-complexity backgrounds. However, their performance deteriorates markedly in complex orchard environments, revealing notable limitations in efficiency, accuracy, and robustness. Consequently, numerous research efforts have shifted toward deep learning technologies. Within this trend, object detection algorithms have played a critical role in directing model attention toward relevant regions, while segmentation methods have been extensively employed to suppress background noise and preserve essential morphological features [12]. Most existing algorithms can be broadly categorized as single-stage processes, in which picking point localization is inferred through morphological or geometric reasoning, based on the outcome of a single segmentation or recognition operation. However, fruit stems typically occupy only a minute portion of the entire image, which makes them difficult to detect or segment with high accuracy, often leading to false positives and missed detections. Moreover, the ambiguous or inconsistent morphological features of the fruit stem further complicate the precise localization of the picking point. These limitations collectively underscore the inherent challenges faced by the current methodologies and emphasize the necessity for more robust, fine-grained frameworks that are capable of meeting the demands posed by complex orchard environments.

To address the aforementioned limitations, this study proposes a vision-based information processing framework for grape picking, incorporating two-stage segmentation with morphological perception. This process consists of a coarse segmentation stage and a fine segmentation stage. In the first stage, a deep learning-based segmentation model is employed to extract the ROI containing the target fruit stem. Given that only a small proportion of the ROI is within the input image, this stage is defined as a coarse segmentation operation in terms of detail-level precision. It plays a critical role in narrowing the search space and reducing interference from complex background elements, thereby facilitating more accurate processing in the subsequent stage. In the second stage, owing to their inherent advantages, image processing techniques are applied to further refine the ROI, which is characterized by a small scale and low complexity. This step is designed to capture more precise morphological features of the fruit stem while eliminating residual background noise that remains after the coarse segmentation. By integrating the complementary advantages of data-driven learning and rule-based refinement, the proposed two-stage segmentation strategy establishes a robust foundation for accurate morphological perception of the fruit stem region, which is essential for reliable picking point localization. Morphological perception refers to the systematic identification and interpretation of the object’s morphological features to support decision-making and action planning. In this study, the centroid of the finely segmented fruit stem region is calculated, and its morphological skeleton is constructed. Based on deterministic geometric rules, the picking point and corresponding cutting axis are subsequently derived to guide the end-effector of the robotic manipulator in executing precise harvesting operations.

Compared with traditional fruit-stem localization strategies, the proposed method enables a more comprehensive integration and utilization of visual information, offering superior accuracy, robustness, and real-time performance. These advantages make it highly promising for practical deployment in intelligent agricultural systems, contributing to the advancement of automated, non-destructive, and efficient fruit harvesting technologies.

2. Materials and Methods

2.1. Data Source

The grape image dataset used in this study consists of two components, referred to hereafter as dataset A and dataset B. Dataset A was collected from the Jiangxinzhou grape plantation in Zhenjiang City, Jiangsu Province, China. It contains 960 high-resolution images (3648 × 2732 pixels) captured with mobile devices, covering two representative grape cultivars: Sun Muscat and Kyoho, characterized by green and purple fruit coloration, respectively. Dataset B was sourced from a publicly available Chinese Scientific Data repository [27]. The images were captured using an Azure Kinect depth camera in the Daxu Ecological Zone of Hefei, Anhui Province, using a uniform resolution of 1280 × 720. Muscat Blanc, Shine Muscat, and Zuijinxiang were selected as representative green grape cultivars, while Kyoho, Yongyou, and Summer Black were chosen as representative purple grape cultivars. A total of 500 grape images were randomly sampled from this dataset to complement the self-constructed dataset. In total, the final dataset consisted of 1460 grape images, encompassing diverse lighting conditions, sample distributions, and cluster morphologies. Such diversity is critical for mitigating potential dataset bias and enhancing the robustness and generalizability of the proposed model.

2.2. Dataset Construction

From the original dataset, fine-grained grape instances were extracted to construct the segmentation dataset utilized in this study, yielding a total of 800 annotated images. Manual annotation was conducted using the EISeg software (version 1.1.0).

In this study, the branch connecting the fruit cluster to the main stem is divided into two sections: the proximal end, which is closer to the fruit cluster, and the distal end, which is closer to the main stem. A particular point along the branch is identified as a location where branching occurs. When extending from the proximal direction toward the distal direction, the fruit stem is defined as the segment of the branch between the first and second points. This definition provides a precise reference for subsequent segmentation and morphological analysis, effectively reducing the errors introduced by subjective judgment.

For the instance segmentation task, two target classes are considered: fruit clusters and fruit stems. The identification of fruit clusters not only provides essential semantic logic for the localization of fruit stems (e.g., potential morphological and botanical growth patterns) but also plays a crucial role in determining the final picking point. As illustrated in Figure 1, two visually similar fruit clusters can result in markedly different picking strategies and picking point locations, due to variations in the segmentation results. In Figure 1, the sample on the left is segmented as a single entity, whereas the sample on the right depicts a morphologically similar cluster structure that is segmented into two distinct fruit clusters. According to the definition of the fruit stem, the single picking point in the left sample is located closer to the distal end. In contrast, the right sample involves two separate picking points, and the corresponding picking actions must be performed sequentially.

Based on the aforementioned criteria, two target classes were manually annotated: “fruit”, representing the grape clusters, and “stem”, corresponding to the fruit stems. The annotation data were initially saved in JSON format and subsequently converted into TXT files, which were compatible with the YOLO training framework, through a dedicated data transformation process. Examples of the annotated grape images are shown in Figure 2.

The grape segmentation dataset was randomly divided into a training set (560 images), a validation set (160 images), and a test set (80 images) in a 7:2:1 ratio. To improve model robustness and reduce the risk of overfitting during training, data augmentation was applied to the annotated training images. The augmentation strategies included horizontal flipping, random brightness adjustment, and the addition of Gaussian noise. For each original training image, three augmented variants were generated, resulting in an expanded training set comprising 2240 images. This augmentation process contributed to the improved model’s generalizability in a complex vineyard environment.

2.3. Fruit Stem Segmentation

2.3.1. Coarse Segmentation

YOLOv8-seg [28] is an extended variant of the YOLOv8 object detection framework, incorporating instance segmentation capabilities. Drawing inspiration from the design of YOLACT [29], it decouples the processes of object detection and mask generation while retaining the strong detection performance of the YOLOv8 family. Considering the resource constraints that are typically encountered on edge devices in real-world agricultural scenarios, the YOLOv8s-seg variant, which provides a favorable trade-off between segmentation accuracy and computational efficiency, was selected as the baseline model for the fruit stem segmentation task in this study. To address the complex conditions that are present in an actual vineyard environment, two key architectural improvements were introduced. The proposed optimization strategies can be detailed as follows.

Firstly, a novel dynamic deformation feature aggregation module (DDFAM) is proposed to enhance the model’s feature extraction capability for relevant targets. By dynamically adapting the receptive field of the convolutional kernels, DDFAM improves the model’s ability to recognize fruit clusters and stems under occlusion and overlap conditions. Secondly, to accommodate the distinct functional requirements of regression, classification, and segmentation tasks, an efficient asymmetric decoupled head (EADHead) is designed. This structure enriches edge information in the regression and segmentation branches, while rendering the classification branch more lightweight and compact. As a result, the overall model achieves a significant reduction in parameters while enhancing segmentation performance.

DDFAM

The construction concept of DDFAM is derived from the deformable convolutional network (DCN) [30]. This architecture significantly enhances the model’s adaptability to occluded and irregularly shaped targets by dynamically adjusting the sampling locations of convolutional kernels. At its core, DCN introduces learnable spatial offset parameters for each sampling point, enabling the network to adaptively optimize sampling regions based on the contextual information of input features. This mechanism effectively overcomes the limitations of traditional convolution operations constrained by fixed rectangular grids, allowing the receptive fields to flexibly adapt to deformation-sensitive regions across different spatial scales, hierarchical levels, and even individual pixels within the feature map. Consequently, the module significantly improves the network’s precision in capturing the local structural features of complex targets. Traditional convolution employs a regular grid R to sample from the input feature map x. The output feature at a specific sampling location p₀ is given by:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n})

(1)

In Equation (1), R = {(−1,1),(−1,0),…,(0,1),(1,1)} denotes a 3 × 3 convolution kernel with a step size of 1, and w(p_n) represents the weight of the convolution kernel at the corresponding position p_n. Deformable convolution introduces an additional offset Δp_n to each sampling point, allowing the position and shape of the convolutional kernel to be dynamically adapted in response to variations in the target object. Consequently, the output feature formulation is modified as:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n})

(2)

Although deformable convolution improves spatial adaptability compared to conventional convolution, the learned sampling offsets may extend beyond the actual ROI, which can introduce irrelevant background information during feature extraction and degrade representation quality. To address this issue, learnable modulation scalars Δm_n are introduced into the deformable convolution framework [31]. These weight coefficients not only enable the dynamic adjustment of the sampling offsets but also modulate the amplitude of input features at different spatial locations. Consequently, the output feature formulation is further extended as:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n}) \cdot Δ m_{n}

(3)

The C2f module serves as a fundamental component of the YOLOv8-seg architecture. It operates by splitting the input feature map into two parallel branches: one branch retains the original features, and the other applies a deep transformation via stacked bottleneck structures. The outputs of both branches are subsequently fused by channel-wise concatenation. This design effectively preserves low-level detail features while simultaneously extracting high-level semantic representations through cascaded convolutional layers, thereby enhancing the model’s multi-scale responsiveness to targets. However, the fixed geometric structure of standard convolution restricts adaptability to irregular targets, leading to insufficient spatial generalization in complex scenes. To mitigate this limitation, the deformable ConvNets v2 (DCNv2) [31] is modularized and integrated into the bottleneck of the original C2f module to form DDFAM. The detailed structure of DDFAM is illustrated in Figure 3.

The number of stacked DC_Bottleneck structures remains consistent with the original model, and the cascaded DCNv2 modules preserve the skip connection strategy of the baseline. By dynamically adjusting the sampling positions of convolutional kernels and adapting receptive field distributions, DDFAM introduces spatially adaptive perception into the feature aggregation process. This enhancement enables the more effective modeling of objects with pose variations and complex structures, substantially improving representation quality.

EADHead

The head network of YOLOv8-seg adopts a parallel-branch architecture, in which feature maps are independently processed through three separate convolutional pathways to generate the final outputs for bounding box regression and classification confidence prediction. While this decoupled design helps mitigate feature interference among tasks and enhances detection accuracy, the repeated convolutional operations across branches lead to a significant increase in parameters. To address this issue, an optimized head network, EADHead, is proposed to effectively reduce model parameter redundancy while maintaining performance. EADHead consists of multiple improved Detect modules, and its detailed structure is presented in Figure 4.

To reduce the parameters in the bounding box regression and segmentation branches, depthwise stacked separable convolutions [32] are employed. In addition, a channel shuffle module [33] is interleaved within the network to promote cross-channel information interaction, thereby enhancing the model’s feature representation capability. Compared to standard convolution, a depthwise separable convolution employs a hierarchical processing mechanism, consisting of channel-wise convolutions followed by point-wise convolutions. This approach significantly reduces computational complexity while maintaining comparable feature extraction performance, making it particularly suitable for lightweight models in resource-constrained scenarios.

The lightweight effectiveness of this design can be quantitatively validated by examining the ratio of the computational cost of depthwise separable convolutions to that of standard convolution:

\frac{D_{f} \times D_{f} \times D_{k} \times D_{k} \times M + D_{f} \times D_{f} \times M \times N}{D_{k} \times D_{k} \times M \times N \times D_{f} \times D_{f}} = \frac{1}{N} + \frac{1}{D_{k}^{2}}

(4)

Here, M denotes the number of channels in the input feature map, D_f × D_f represents the spatial resolution of the output feature map, N is the number of output channels, and D_k × D_k refers to the kernel size. When a 3 × 3 convolution kernel is used, the computational complexity of depthwise separable convolutions is approximately 1/9 of that of a standard convolution.

Improved YOLOv8s-Seg

The proposed optimization strategies were applied to the YOLOv8s-seg model, resulting in an enhanced segmentation model with superior performance. The overall architecture of the improved YOLOv8s-seg is illustrated in Figure 5. In this study, the model is utilized to perform the coarse segmentation of grape stems, providing preliminary ROIs that serve as inputs for subsequent fine segmentation processing.

2.3.2. Fine Segmentation

Although the improved segmentation model achieves more accurate segmentation of grape stems, the extracted ROIs remain insufficiently precise due to challenges such as target-scale mismatch and the increased sparsity of feature maps caused by repeated down-sampling operations. As a result, the segmented regions often contain not only the target fruit stem but also background noise. To obtain more precise morphological features of the grape stem, the second-stage segmentation employs image processing techniques that are well-suited to small-scale, low-complexity targets. Specifically, an improved OTSU thresholding algorithm [34], which operates on a single-channel image extracted from the hue component of the HSV color space, is applied to further refine the results produced by the improved YOLOv8s-seg model. The procedure can be described as follows.

Firstly, a binary mask is generated in which the foreground corresponds to the ROI obtained from the coarse segmentation of the improved YOLOv8s-seg model. Subsequently, a bitwise AND operation is performed between the grayscale image and the binary mask to retain only the pixel intensities I within the masked region. Let K denote the total number of pixels in this region. For each candidate grayscale threshold T, the pixels are divided into two classes: background (I < T) and foreground, i.e., fruit stem (I ≥ T). The number of pixels in the background and foreground are denoted as N₀ and N₁, respectively, with the corresponding mean intensities represented by μ₀ and μ₁. The class probabilities are computed as ω₀ = N₀/K and ω₁ = N₁/K, and the between-class variance σ² is defined as:

σ^{2} (T) = ω_{0} \times ω_{1} \times {(μ_{0} - μ_{1})}^{2}

(5)

By traversing all grayscale levels at a specified interval, the optimal threshold T* is determined as the value that maximizes σ², thereby further achieving optimal separation between the fruit stem and residual background components based on the coarse segmentation results.

In real-world orchard environments, lighting conditions often tend to be highly variable and uncontrollable. Given that the basic OTSU thresholding algorithm is sensitive to illumination variations, which may significantly compromise segmentation accuracy, the HSV color space is taken into account. In contrast to the traditional RGB color space, HSV effectively decouples chromatic information from the luminance, with the hue and saturation components exhibiting greater stability under fluctuating lighting conditions. Among these, the hue channel is selected for thresholding due to its rich and discriminative color information, which is particularly beneficial for fine segmentation. The hue threshold is denoted as T_H, which enables the definition of an optimized background (H < T_H) and foreground (H ≥ T_H). This improvement integrates the light stability of the hue component, thereby enhancing the robustness of the OTSU algorithm under varying lighting conditions.

2.4. Morphological Perception

In automated harvesting systems, grape picking is primarily performed by locating the picking point and severing the fruit stem [17]. This study performs morphological perception on the preprocessed stem regions and determines the localization of picking points and the inference of the cutting axes, based on deterministic geometric rules.

Following the two-stage segmentation process, a binary image is generated with the stem region defined as the foreground. The geometric centroid (x_c,y_c) of the foreground region is then calculated, as described in Equation (6):

\{\begin{matrix} x_{c} = \frac{1}{N} \sum_{i = 1}^{N} x_{i} \\ y_{c} = \frac{1}{N} \sum_{i = 1}^{N} y_{i} \end{matrix}

(6)

Here, the target region contains N pixels, each corresponding to the coordinate (x_i,y_i). The geometric centroid of the grape stem is representative, lying approximately equidistant from both the proximal and distal ends, exhibiting favorable mechanical balance. The centroid is chosen as the basis for picking point inference, based on a comprehensive consideration of computational stability, structural rationality, and harvesting safety, effectively reducing the risk of fruit damage and the entanglement of the robotic arms in the vines. Additionally, the morphological skeleton of the foreground region is extracted using the medial axis transform (MAT), which captures the intrinsic topological and geometrical structure of the stem [35]. In image processing, the discrete form of the MAT is applied to binary images, with its definition given in Equation (7):

MAT (Ω) = \{(p, r (p)) |p \in S \subseteq F, r (p) = \underset{q \in \partial Ω}{\min ‖p - q‖}\}

(7)

where F denotes the foreground region of the binary image, ∂Ω represents the boundary set, determined by foreground pixels adjacent to background pixels, and ||·|| denotes the Euclidean norm. The Euclidean distance transform (EDT) is applied to the foreground region, assigning each pixel its minimum Euclidean distance to ∂Ω [36]. Local maxima within the resulting distance map are subsequently identified as skeleton points p, forming the set S. The corresponding value r(p) represents the radius of the largest inscribed circle centered at skeleton point p.

To mitigate the issue of spurious branches that may arise in the morphological skeleton due to excessive sampling density or irregular stem boundaries, an adaptive sampling interval is introduced. This parameter is dynamically adjusted, based on the estimated diameter of individual fruit stems, enabling effective control of the density of uniformly spaced skeleton points. The sampled points, denoted as p_s, are connected in order to reconstruct a simplified and stable morphological skeleton path for the fruit stem. This skeleton provides structural stability and topological consistency, which serves as a reliable geometric basis for subsequent picking point localization.

Based on the extracted centroid and the morphological skeleton, the p_s with the minimum Euclidean distance to the centroid is selected and designated as the picking point T. This strategy ensures that the picking point is strictly located within the fruit stem region, while maintaining the spatial proximity and geometric significance of the centroid.

The cutting axis of the stem is derived from the local geometric structure of the morphological skeleton and serves as a structural reference for guiding the cutting direction of the robotic end-effector. During the cutting operation, the end-effector adjusts its orientation to align with this axis, enabling the execution of a minimal-area cross-sectional cut on the fruit stem. To determine the cutting axis, two sampled skeleton points, p₁ and p₂, which are immediately adjacent to T along the skeleton path, are selected. The line segment connecting p₁ and p₂ approximates the local segment of the morphological skeleton near T. The normal vector of this segment is then calculated, and a straight line passing through T along this normal direction is defined as the final cutting axis of the stem.

Figure 6 illustrates three representative morphological patterns of grape stems. For each pattern, the proposed method is employed to infer both the picking point and the cutting axis.

As shown in Figure 6, the proposed strategy successfully localizes the picking point within the stem region in all three morphological scenarios. Additionally, the inferred cutting axis remains approximately perpendicular to the local morphological skeleton of the stem. These results demonstrate the feasibility of the proposed approach, highlighting its robustness in handling stems with complex geometries, as commonly encountered in real vineyard environments. This strategy serves as a foundation for generating spatial coordinates and guiding the motion planning of the end-effector. Accurate localization of the picking point and cutting axis contributes to efficient and precise harvesting operations, significantly enhancing fruit quality and ensuring the stable performance of the automated harvesting system.

2.5. Evaluation Metrics

To comprehensively evaluate the segmentation performance of the proposed model, multiple quantitative metrics were employed. For segmentation accuracy evaluation, AP@0.5 was used to evaluate segmentation performance for individual categories, while mAP@0.5, calculated as the mean of AP@0.5 across all classes, was adopted to evaluate overall segmentation performance. Here, @0.5 denotes an intersection over union (IoU) threshold that is greater than or equal to 0.5. Considering the relatively balanced distribution of samples in this instance segmentation task, macro-averaging (macro) was employed to calculate the multi-class evaluation metrics. To provide a more objective and comprehensive reflection of the overall precision, the specific definition of macro-precision is given in Equation (8), where N represents the number of classes, and Precision_i corresponds to the precision of class i.

macro - P r e c i s i o n = \frac{1}{N} \sum_{i}^{N} {P r e c i s i o n}_{i}

(8)

In addition, the model’s efficiency and suitability for real-time or embedded deployment were evaluated using three key indicators: parameters, which reflect the total count of trainable weights; floating point operations (FLOPs), which quantify the computational complexity; and detection time, defined as the average inference time per image during the testing. These metrics collectively provide a comprehensive evaluation of the model’s lightweight properties and inference speed.

3. Experimental Setup

3.1. Experimental Platform

All experiments in this study were implemented in the PyTorch 1.11.0 deep learning framework. The operating system was Windows 10 (64-bit), and the hardware platform consisted of an NVIDIA GeForce RTX 4060 Ti GPU, coupled with an Intel Core i5-13490F CPU. This configuration provided sufficient computational resources to efficiently support both the training and inference tasks.

3.2. Hyperparameter Settings

In this study, model training was performed using the stochastic gradient descent (SGD) optimizer [37], which is widely recognized for its robustness and generalization capability in deep learning tasks. A momentum coefficient of 0.9 was employed to accelerate convergence and mitigate oscillations during optimization. The initial learning rate was set to 0.01 and progressively decreased according to a cosine annealing schedule, thereby improving training stability and preventing premature convergence. Additionally, a weight decay rate of 0.0005 was applied to regularize the model, contributing to reducing the risk of overfitting.

The input images were uniformly resized to 640 × 640 pixels to ensure consistent spatial dimensions across the training dataset. A batch size of 16 was selected to balance memory efficiency and training stability. To accelerate the data processing pipeline, the number of worker threads for data loading was set to 4. The model was trained for 300 epochs, which was empirically determined to ensure convergence under the given experimental conditions.

4. Results

4.1. Coarse Segmentation of Stems

4.1.1. Comparison Between the Baseline and the Improved Model

To validate the effectiveness of the proposed improvements, a comprehensive performance comparison was conducted between the baseline and improved YOLOv8s-seg models.

As presented in Table 1, the improved model achieved an mAP@0.5 of 86.75%, representing a 4.27% increase over the baseline. Notably, the AP@0.5 for fruit stems alone increased by 8.32%, indicating a significant advancement in the model’s capacity to detect and segment fine-grained structures with complex geometries. These results validate the effectiveness of the proposed DDFAM in capturing small-sized targets with intricate morphological features. Furthermore, the application of EADHead reduces the overall parameters to approximately 86% of the baseline model. Consequently, the improved model achieves superior segmentation accuracy while maintaining a lightweight architecture, thereby offering practical advantages for real-time and resource-constrained agricultural applications.

To provide a clearer comparison of the model’s segmentation performance for fruit stems and grape clusters in complex vineyard environments, Figure 7 presents a visual comparison of the segmentation results for grape cluster samples at close range.

As shown in Figure 7a, where grape clusters are captured against a clear background with relatively low levels of interference, both the baseline and improved YOLOv8s-seg models successfully segment the grape clusters and their corresponding fruit stems. However, the improved model demonstrates sharper segmentation boundaries and better region conformity, indicating a higher degree of spatial accuracy and contour fitting. Figure 7b,c illustrates more challenging scenarios: the former shows multiple grape clusters in close proximity, and the latter depicts fruit stems with low visual contrast against the surrounding foliage, due to similar colors and textures. Under such complex environmental conditions, the baseline model suffers from evident omissions in fruit stem segmentation. In contrast, the improved model maintains a consistent and robust performance, accurately delineating and segmenting fruit stems despite the increased visual complexity. Furthermore, Figure 7d illustrates a scenario where multiple grape clusters are present within the field of view, with some located in the background instead of lying on the primary picking plane. In this context, the improved model not only accurately segments the fruit stems in the foreground but also effectively suppresses the false segmentation of background instances. This selective segmentation capability reduces the unnecessary computational burden and prevents potential misguidance in downstream harvesting planning.

Overall, these results demonstrate that the improved YOLOv8s-seg model outperforms the baseline, particularly in its robustness and segmentation accuracy under challenging scenarios involving multiple targets, fine-grained structural features, and interference from complex backgrounds.

4.1.2. Comparison Among the Different Instance Segmentation Models

To ensure a more comprehensive evaluation of the proposed YOLOv8s-seg model, a comparative analysis was conducted against several representative models, for instance, segmentation. These include the classical single-stage instance segmentation framework YOLACT and mainstream segmentation variants of the YOLO series, such as YOLOv5s-seg [38], YOLOv9c-seg [39], and YOLO11s-seg [40]. The evaluation metrics emphasize both segmentation accuracy and inference efficiency, providing a balanced assessment of overall performance, with the comparative results being summarized in Table 2.

As presented in Table 2, all evaluated models meet the requirements for real-time inference. Compared with the structurally simple classical YOLACT, the proposed model demonstrates notable improvements in both segmentation accuracy and computational efficiency, achieving a 10.38% increase in mAP@0.5 while reducing the parameters and FLOPs by 66.22% and 71.34%, respectively. With the introduction of a more complex network structure, YOLOv9c-seg achieves enhanced feature representation compared with YOLACT. However, its mAP@0.5 remains lower than that of the proposed model, and the excessive computational cost further constrains its deployment feasibility and real-time applicability in resource-limited environments. Compared with the proposed YOLOv8s-seg model, YOLOv5s-seg and YOLO11s-seg achieve faster inference speeds of 3.74 ms and 5.34 ms, respectively. However, their corresponding mAP@0.5 values are lower by 7.05% and 5.03%, respectively. These results indicate that although YOLOv5s-seg is lightweight, its limited feature extraction capability diminishes its suitability for complex vineyard environments. Similarly, while YOLO11s-seg exhibits a clear advantage in terms of inference speed, its accuracy remains insufficient for precise grape stem segmentation.

Overall, the improved YOLOv8s-seg achieves the highest mAP@0.5 among all the models evaluated, while maintaining a lightweight model architecture, striking a more favorable trade-off between accuracy and efficiency. These results demonstrate the superior overall performance of the proposed model.

4.2. Fine Segmentation of Stems

The segmentation model based on deep learning tends to produce coarse segmentation results for grape stems, with the segmented regions frequently encompassing not only the target stems but also portions of the surrounding background. To improve the precision of stem region extraction, image processing techniques are incorporated to refine the coarse segmentation outputs by further isolating the stem regions. The process of fine segmentation is illustrated in Figure 8, which compares the traditional OTSU algorithm applied in the RGB color space with an improved OTSU algorithm that operates on the hue channel.

As illustrated in Figure 8, the selected sample exhibits a high-brightness region near the proximal end of the fruit stem. The conventional RGB-based OTSU algorithm misclassifies this bright region as background, leading to incorrect segmentation. In contrast, the improved OTSU algorithm effectively overcomes this limitation. Notably, after converting the image to the HSV color space and extracting the hue component, the brightness anomaly becomes negligible in the single-channel hue image. This is attributable to the inherent insensitivity of the hue channel to lighting-induced variations, which effectively mitigates the impact of lighting inconsistencies. Furthermore, this transformation enhances the contrast between the fruit stem and the surrounding background, facilitating a more precise and efficient thresholding process. As shown in the output, the fine segmentation results of the fruit stem closely align with human visual perception while effectively eliminating most residual background information. These findings confirm that the proposed hue-based OTSU algorithm can effectively mitigate the adverse effects of complex lighting conditions. By providing a more accurate representation of morphological features, it enables the fine segmentation of the fruit stem and establishes a reliable foundation for the subsequent morphological perception.

4.3. Visualization of Picking Information

To evaluate the feasibility and accuracy of the proposed vineyard grape-picking strategy based on two-stage segmentation and morphological perception, a visualization experiment was likewise conducted on close-range images of grape clusters captured under various vineyard scenarios. The process commenced with the application of the improved YOLOv8s-seg model to perform instance segmentation (Figure 9a). Subsequently, a binary mask was generated from the coarse segmentation results (Figure 9b), and fine segmentation was achieved through the improved OTSU algorithm to extract the refined stem regions. Based on the fine segmentation results, the centroid and skeleton of the fruit stem region were computed to infer both the picking point and cutting axis (Figure 9c). The corresponding pixel coordinates were recorded and subsequently mapped back to the original image, thereby enabling a clear and accurate visualization of the picking information for the target grape cluster (Figure 9d). The visualization results are shown in Figure 9, where T represents the localization of the picking point.

As illustrated in Figure 9, the improved YOLOv8s-seg model achieves accurate and comprehensive instance segmentation of grape clusters under various complex vineyard scenarios. The coarse segmentation of the fruit stem region provides not only essential localization information but also effectively suppresses background interference. Building upon this, the fine segmentation based on the improved OTSU algorithm enables the precise extraction of key morphological features of the fruit stems. As shown in Figure 9c, critical structural features such as irregular bending, surface protrusions, and transitions between the proximal and distal ends are clearly delineated. These accurately segmented stem regions establish a solid foundation for downstream morphological perception. Figure 9c further demonstrates that, regardless of variations in stem thickness, length, or centroid position, the skeleton of the stem region can be robustly extracted using the MAT. This robustness facilitates accurate inference of both the picking point and the cutting axis. The predicted picking points are consistently located within the stem region, while the inferred cutting axes align with the perceived orthogonal direction relative to the local stem geometry, which is consistent with human visual expectations. Compared with subjective estimation, the proposed method provides a more theoretically grounded and precise approach. By mapping the predicted picking points and cutting axes back onto the original images, the digital representation and visualization of grape harvesting information are presented in a more intuitive and interpretable manner.

The overall workflow demonstrates a robust integration of two-stage segmentation and morphology-aware analysis, resulting in a highly interpretable and operationally feasible picking plan. The proposed information-processing framework provides reliable technical support for automated harvesting in orchards and exhibits substantial potential for widespread application within the field of precision agriculture.

5. Discussion

Deep learning and image processing have emerged as foundational technologies that can be extensively applied in modern agricultural practices [11,41,42,43,44,45,46]. Although significant progress has been made in grape-picking research, precise localization and the morphological perception of fruit stems remain major challenges. Some studies have primarily concentrated on fruit clusters while neglecting the fruit stem itself, relying solely on the spatial position of the cluster to perform indirect and ambiguous inferences, based on generalized morphological or geometric principles [16,17,18,20]. However, given the inherent complexity of vineyard environments and the morphological variability of grape growth, such inference strategies are inherently unstable and limited in their accuracy.

Fruit stem recognition techniques have enabled the targeted identification of relevant structures. However, this approach remains inadequate for effectively extracting and utilizing their morphological features. As a result, methods based on bounding box annotations [19,47] and data-driven pattern recognition [48] often lead to ambiguous inferences, lacking stable and interpretable mapping between stem morphology and picking point localization. To address this limitation, some studies have incorporated fruit stem segmentation techniques to extract morphological information [49]. Nevertheless, the effectiveness of these techniques remains constrained due to the inherent properties of the segmentation target. Specifically, single-stage coarse segmentation methods are insufficient to capture the fine-grained morphological features of stems with the accuracy necessary for reliable downstream tasks.

To address these challenges, this study proposes a vision-based information-processing framework for vineyard grape picking that integrates two-stage segmentation with morphological perception. First, the morphological scope of the fruit stem is explicitly defined to reduce the annotation bias arising from subjective labeling. Next, a two-stage segmentation strategy is developed to facilitate more accurate extraction of stem morphological features. Subsequently, morphological perception is performed on the refined stem region, with deterministic geometric rules being used to establish the spatial relationships between morphological features, picking points, and cutting axes. In contrast to conventional approaches, the proposed method directly infers the picking point from the intrinsic morphological attributes of the fruit stem, thereby minimizing the errors from indirect inference and external factors. Furthermore, the framework ensures that the identified picking point lies strictly within the stem region, effectively preventing a spatial mismatch between the predicted point and the actual stem structure. Visualization analysis confirms that the proposed information-processing framework exhibits strong feasibility, adaptability, and accuracy under real-world vineyard environments, including complex lighting conditions, irregular stem morphologies, and diverse fruit cluster spatial distributions.

Although the proposed model demonstrates competitive performance, there remains significant scope for further optimization. Future work will focus on achieving a more optimal balance between accuracy and model lightweighting to facilitate real-time deployment. In addition, particular attention will also be directed toward improving the model’s generalizability and robustness under diverse vineyard conditions, such as variable illumination, occlusion, and cultivar differences. Moreover, the proposed framework incorporates multi-module processing, including data transmission, caching, parallel computation, and mapping, which inevitably increases the system complexity. Subsequent research will, therefore, evaluate computational efficiency and resource allocation on resource-constrained edge devices, as well as explore hardware-oriented deployment strategies. Collectively, these efforts are expected to enable more effective validation and adaptation of the proposed vision system in real-world vineyard environments. Finally, while the proposed method enables the effective extraction and utilization of two-dimensional (2D) spatial image information, it lacks integration with 3D cues such as depth, spatial coordinates, or point cloud data. This limitation restricts the system’s ability to comprehensively perceive the spatial configuration of fruit stems and their surrounding structures. Future research will focus on extending the current 2D morphological perception framework by integrating 3D visual technologies. Notably, the framework developed in this study generates a set of precise and highly interpretable secondary morphological outputs (e.g., morphological skeletons, picking points, and cutting axes), which provide a promising foundation for a novel 3D modeling approach. Instead of explicitly reconstructing the full 3D geometry of the fruit cluster and stem, the proposed direction involves generating 3D stem skeletons and cutting planes by fusing multi-view secondary morphological information. This strategy significantly reduces the complexity of both modeling and inference, thereby enhancing its suitability for real-time applications and deployment in resource-constrained environments.

In conclusion, this study introduces an effective visual information processing framework to support planning and decision-making in vineyard grape harvesting. Furthermore, the proposed approach exhibits significant potential for broader application and seamless integration with 3D perception systems. These advancements collectively contribute to the intelligent development of fruit-harvesting techniques.

6. Conclusions

This study presents a visual information processing framework for vineyard grape picking, integrating a two-stage fruit stem segmentation strategy, with morphological perception applied to the finely segmented regions.

The proposed two-stage segmentation strategy consists of deep-learning-based coarse segmentation, followed by rule-based fine segmentation. The coarse segmentation employs an improved YOLOv8s-seg model with two key architectural enhancements. First, a novel DDFAM is introduced to improve the model’s capability in capturing morphological features, particularly for small-scale and irregularly shaped targets. Second, an EADHead is designed to enrich edge information in both the regression and segmentation branches, while maintaining a lightweight and compact model structure. Compared to the baseline model, the improved model achieves a 4.27% increase in mAP@0.5, including an 8.32% improvement in AP@0.5 specifically for fruit stem segmentation, while reducing the parameters to 86% of the original model. Compared with mainstream segmentation models, the improved model achieves the optimal overall performance, with the highest segmentation accuracy (mAP@0.5 of 86.75%), a lightweight architecture (10.34 M parameters), and real-time inference speed (10.02 ms per image). Additionally, it effectively reduces false negatives and demonstrates strong robustness against interference from complex vineyard backgrounds.

The fine segmentation stage further refines the morphological delineation of fruit stems. In this study, an improved OTSU thresholding algorithm is proposed, operating on the hue channel of the HSV color space. By decoupling the chromatic information from luminance, this approach significantly improves the robustness of thresholding under complex lighting conditions. Visual analysis confirms that the proposed method yields more accurate and sharper segmentation outcomes, effectively reducing the mis-segmentation caused by light sensitivity.

Morphological perception is subsequently performed on the preprocessed stem regions, including the calculation of centroid coordinates and the extraction of morphological skeletons using the MAT. The spatial relationships among the morphological features, picking points, and cutting axes are established using deterministic geometric rules. The proposed method ensures that the final picking point is strictly located within the stem region and that the cutting axis is aligned orthogonally to the local extension direction of the stem at the picking point. Visualization results confirm its feasibility and adaptability in real-world vineyard conditions.

The proposed visual information-processing framework enables the digitization of fruit-picking behavior and effectively supports the planning and decision-making processes of autonomous harvesting systems. By providing a robust technical foundation for vision-driven grape picking, this framework facilitates the advancement of intelligent, non-destructive, and high-efficiency fruit harvesting technologies.

Author Contributions

Conceptualization, J.S.; methodology, Y.P. and Z.W.; software, Y.P. and Z.W.; data curation, J.G.; writing—original draft preparation, Y.P.; writing—review and editing, Z.W. and L.S.; supervision, J.S. and Z.S.; resources, J.S. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of Postgraduate Research and Practice Innovation Program of Jiangsu Province, grant number SJCX25_2450.

Data Availability Statement

Dataset A, as used in this study, is not publicly available as it forms part of an ongoing project. Dataset B is publicly available at the Chinese Scientific Data repository: https://www.sciengine.com/CSD/doi/10.11922/11-6035.csd.2023.0132.zh (accessed on 9 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

3D	Three-dimensional
mAP	Mean average precision
ROI	Region of interest
AP	Average precision
FPS	Frames per second
DDFAM	Dynamic deformation feature aggregation module
EADHead	Efficient asymmetric decoupled head
DCN	Deformable convolutional network
DCNv2	Deformable ConvNets v2
MAT	Medial axis transform
EDT	Euclidean distance transform
IoU	Intersection over union
FLOPs	Floating point operations
SGD	Stochastic gradient descent
2D	Two-dimensional

References

Cui, H.; Abdel-Samie, M.A.S.; Lin, L. Novel packaging systems in grape storage—A review. J. Food Process Eng. 2019, 42, e13162. [Google Scholar] [CrossRef]
Yang, Q.; Wang, H.; Zhang, H.; Zhang, X.; Apaliya, M.T.; Zheng, X.; Mahunu, G.K. Effect of Yarrowia lipolytica on postharvest decay of grapes caused by Talaromyces rugulosus and the protein expression profile of T. rugulosus. Postharvest Biol. Technol. 2017, 126, 15–22. [Google Scholar] [CrossRef]
Faheem, M.; Liu, J.; Chang, G.; Abbas, I.; Xie, B.; Shan, Z.; Yang, K. Experimental research on grape cluster vibration signals during transportation and placing for harvest and post-harvest handling. Agriculture 2021, 11, 902. [Google Scholar] [CrossRef]
Tang, X.; Wang, Y.; Han, J.; Wang, L.; Li, C.; Ni, L. Separation, purification of anthocyanin and vitis linn polysaccharide from grape juice by the two-step extraction and dialysis. J. Food Process. Preserv. 2018, 42, e13344. [Google Scholar] [CrossRef]
Solairaj, D.; Yang, Q.; Legrand, N.N.G.; Routledge, M.N.; Zhang, H. Molecular explication of grape berry-fungal infections and their potential application in recent postharvest infection control strategies. Trends Food Sci. Technol. 2021, 116, 903–917. [Google Scholar] [CrossRef]
Jung, H.M.; Lee, S.; Lee, W.-H.; Cho, B.-K.; Lee, S.H. Effect of vibration stress on quality of packaged grapes during transportation. Eng. Agric. Environ. Food 2018, 11, 79–83. [Google Scholar] [CrossRef]
Alston, J.M.; Sambucci, O. Grapes in the world economy. In The Grape Genome; Springer: Berlin/Heidelberg, Germany, 2019; pp. 1–24. [Google Scholar] [CrossRef]
Xu, F.; Wang, B.; Hong, C.; Telebielaigen, S.; Nsor-Atindana, J.; Duan, Y.; Zhong, F. Optimization of spiral continuous flow-through pulse light sterilization for Escherichia coli in red grape juice by response surface methodology. Food Control 2019, 105, 8–12. [Google Scholar] [CrossRef]
Xu, B.; Liu, J.; Jin, Y.; Yang, K.; Zhao, S.; Peng, Y. Vibration–Collision Coupling Modeling in Grape Clusters for Non-Damage Harvesting Operations. Agriculture 2025, 15, 154. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Z.; Wang, Y.; Bao, R.; Chen, X.; Fu, S.; Tian, M.; Zhang, Y. Research on flexible end-effectors with humanoid grasp function for small spherical fruit picking. Agriculture 2023, 13, 123. [Google Scholar] [CrossRef]
Sun, J.; Peng, Y.; Chen, C.; Zhang, B.; Wu, Z.; Jia, Y.; Shi, L. ESC-YOLO: Optimizing Apple fruit recognition with efficient Spatial and channel features in YOLOX. J. Real-Time Image Process. 2024, 21, 162. [Google Scholar] [CrossRef]
Xie, H.; Zhang, Z.; Zhang, K.; Yang, L.; Zhang, D.; Yu, Y. Research on the visual location method for strawberry picking points under complex conditions based on composite models. J. Sci. Food Agric. 2024, 104, 8566–8579. [Google Scholar] [CrossRef] [PubMed]
Feng, S.k.; Yuan, L.m.; Ye, H. Grading bunch tightness for grape by multiperspective imaging approach coupled with multivariate classification methods. J. Food Process Eng. 2019, 42, e13052. [Google Scholar] [CrossRef]
Zhu, S.; Liu, J.; Yang, Q.; Jin, Y.; Zhao, S.; Tan, Z.; Qiu, J.; Zhang, H. The impact of mechanical compression on the postharvest quality of ‘Shine Muscat’grapes during short-term storage. Agronomy 2023, 13, 2836. [Google Scholar] [CrossRef]
Hussein, Z.; Fawole, O.A.; Opara, U.L. Preharvest factors influencing bruise damage of fresh fruits—A review. Sci. Hortic. 2018, 229, 45–58. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based extraction of spatial information in grape clusters for harvesting robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Wang, J.; Cai, L.; Jin, Y.; Zhao, S.; Xie, B. Realtime picking point decision algorithm of trellis grape for high-speed robotic cut-and-catch harvesting. Agronomy 2023, 13, 1618. [Google Scholar] [CrossRef]
Zhao, R.; Zhu, Y.; Li, Y. An end-to-end lightweight model for grape and picking point simultaneous detection. Biosyst. Eng. 2022, 223, 174–188. [Google Scholar] [CrossRef]
Zhu, Y.; Li, S.; Du, W.; Du, Y.; Liu, P.; Li, X. Identification of table grapes in the natural environment based on an improved Yolov5 and localization of picking points. Precis. Agric. 2023, 24, 1333–1354. [Google Scholar] [CrossRef]
Liu, H.; Liu, W.; Wang, W.; Li, H.; Geng, C. A Method of Grape Cluster Target Detection and Picking Point Location Based on Improved YOLOv8. In Proceedings of the International Conference on Guidance, Navigation and Control, Changsha, China, 9–11 August 2024; pp. 34–45. [Google Scholar]
Peng, Y.; Wang, A.; Liu, J.; Faheem, M. A comparative study of semantic segmentation models for identification of grape with different varieties. Agriculture 2021, 11, 997. [Google Scholar] [CrossRef]
Peng, Y.; Zhao, S.; Liu, J. Fused deep features-based grape varieties identification using support vector machine. Agriculture 2021, 11, 869. [Google Scholar] [CrossRef]
Yuan, L.-m.; Cai, J.-r.; Sun, L.; Ye, C. A preliminary discrimination of cluster disqualified shape for table grape by mono-camera multi-perspective simultaneously imaging approach. Food Anal. Methods 2016, 9, 758–767. [Google Scholar] [CrossRef]
Chen, X.; Ding, H.; Yuan, L.M.; Cai, J.R.; Chen, X.; Lin, Y. New approach of simultaneous, multi-perspective imaging for quantitative assessment of the compactness of grape bunches. Aust. J. Grape Wine Res. 2018, 24, 413–420. [Google Scholar] [CrossRef]
Liu, J.; Liang, J.; Zhao, S.; Jiang, Y.; Wang, J.; Jin, Y. Design of a virtual multi-interaction operation system for hand–eye coordination of grape harvesting robots. Agronomy 2023, 13, 829. [Google Scholar] [CrossRef]
Faheem, M.; Liu, J.; Chang, G.; Ahmad, I.; Peng, Y. Hanging force analysis for realizing low vibration of grape clusters during speedy robotic post-harvest handling. Int. J. Agric. Biol. Eng. 2021, 14, 62–71. [Google Scholar] [CrossRef]
Chen, W.; Rao, Y.; Wang, F.; Zhang, Y.; Yang, Y.; Luo, Q.; Zhang, T.; Wan, T.; Liu, X.; Zhang, M.; et al. A dataset of grape multimodal object detection and semantic segmentation. China Sci. Data 2025, 10, 1–16. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, Version 8.0.0; Ultralytics: Frederick, MD, USA, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 June 2025).
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Otsu, N. A threshold selection method from gray-level histograms. Automatica 1975, 11, 23–27. [Google Scholar] [CrossRef]
Lee, D.-T. Medial axis transformation of a planar shape. IEEE Trans. Pattern Anal. Mach. Intell. 1982, 4, 363–369. [Google Scholar] [CrossRef]
Fabbri, R.; Costa, L.D.F.; Torelli, J.C.; Bruno, O.M. 2D Euclidean distance transform algorithms: A comparative survey. ACM Comput. Surv. (CSUR) 2008, 40, 1–44. [Google Scholar] [CrossRef]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Keynote, Invited and Contributed Papers. Physica-Verlag HD: Berlin/Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
Jocher, G. Yolov5, Release v7.0; Ultralytics: Frederick, MD, USA, 2022. Available online: https://github.com/ultralytics/yolov5/tree/v7.0 (accessed on 15 June 2025).
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11, Version 11.0.0; Ultralytics: Frederick, MD, USA, 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 June 2025).
Xu, B.; Cui, X.; Ji, W.; Yuan, H.; Wang, J. Apple grading method design and implementation for automatic grader based on improved YOLOv5. Agriculture 2023, 13, 124. [Google Scholar] [CrossRef]
Wang, H.; Gu, J.; Wang, M. A review on the application of computer vision and machine learning in the tea industry. Front. Sustain. Food Syst. 2023, 7, 1172543. [Google Scholar] [CrossRef]
Ji, W.; Wang, J.; Xu, B.; Zhang, T. Apple grading based on multi-dimensional view processing and deep learning. Foods 2023, 12, 2117. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.; Sun, J.; Cai, Z.; Shi, L.; Wu, X.; Dai, C.; Xie, Y. Copper Stress Levels Classification in Oilseed Rape Using Deep Residual Networks and Hyperspectral False-Color Images. Horticulturae 2025, 11, 840. [Google Scholar] [CrossRef]
Amir, A.; Giri, S.; Giri, S.; Butt, M. An Advanced Approach to Tomato Apex Head Thickness Measurement Using Lightweight YOLO variants, Faster RCNN, and RGB-Depth Sensor. Smart Agric. Technol. 2025, 12, 101214. [Google Scholar] [CrossRef]
Nuanmeesri, S. Enhanced hybrid attention deep learning for avocado ripeness classification on resource constrained devices. Sci. Rep. 2025, 15, 3719. [Google Scholar] [CrossRef]
Zhu, Y.; Sui, S.; Du, W.; Li, X.; Liu, P. Picking point localization method of table grape picking robot based on you only look once version 8 nano. Eng. Appl. Artif. Intell. 2025, 146, 110266. [Google Scholar] [CrossRef]
Chen, J.; Ma, A.; Huang, L.; Li, H.; Zhang, H.; Huang, Y.; Zhu, T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
Wang, Q.; Xu, F.; Chen, Q.; Mi, Z.; Fan, Y.; Su, B. Detection and location of wine grape (Cabernet Sauvignon) picking points by using a dual-stage deep learning method. Comput. Electron. Agric. 2025, 237, 110637. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of fruit stem definition.

Figure 2. Examples of the grape dataset annotation: (a) original image, (b) label image.

Figure 3. Overall structure of DDFAM, where c denotes the number of channels and n denotes the number of stacked DC_Bottleneck structures.

Figure 4. Overall structure of the improved Detect module.

Figure 5. Schematic diagram of the overall structure of the improved YOLOv8s-seg.

Figure 6. Morphological perception results for grape stems with different patterns, where the centroid is located (a) inside the stem region, (b) on the boundary, and (c) outside the stem region.

Figure 7. Visualization of the segmentation results for the baseline and improved models. Each row corresponds to a representative detection scenario in a real vineyard environment: (a) low-complexity background, (b) overlapping clusters, (c) low light conditions, and (d) multiple target clusters. The colored regions indicate the instance masks generated by the segmentation models.

Figure 8. Fine segmentation comparison between the RGB-based and hue-based OTSU algorithms.

Figure 9. Visual information processing the workflow and picking information visualization for vineyard grapes: (a) instance segmentation, (b) stem coarse segmentation, (c) stem fine segmentation, and (d) coordinate mapping.

Table 1. Performance comparison of the baseline and improved models.

Model	macro-Precision (%)		mAP@0.5 (%)	Parameters (M)
Model	Fruit	Stem	mAP@0.5 (%)	Parameters (M)
YOLOv8s-seg	88.42	76.54	82.48	12.07
Improved YOLOv8s-seg	88.64	84.86	86.75	10.34

Table 2. Performance comparison of representative segmentation models.

Model	mAP@0.5 (%)	Parameters (M)	Flops (G)	Detect Time (ms)
YOLACT	76.37	30.61	94.97	15.14
YOLOv5s-seg	79.70	7.40	25.74	3.74
YOLOv9c-seg	77.42	27.63	157.61	14.68
YOLO11s-seg	81.72	10.07	35.28	5.34
Improved YOLOv8s-seg	86.75	10.34	27.22	10.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, Y.; Sun, J.; Wu, Z.; Gao, J.; Shi, L.; Shi, Z. A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception. Horticulturae 2025, 11, 1039. https://doi.org/10.3390/horticulturae11091039

AMA Style

Peng Y, Sun J, Wu Z, Gao J, Shi L, Shi Z. A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception. Horticulturae. 2025; 11(9):1039. https://doi.org/10.3390/horticulturae11091039

Chicago/Turabian Style

Peng, Yifei, Jun Sun, Zhaoqi Wu, Jinye Gao, Lei Shi, and Zhiyan Shi. 2025. "A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception" Horticulturae 11, no. 9: 1039. https://doi.org/10.3390/horticulturae11091039

APA Style

Peng, Y., Sun, J., Wu, Z., Gao, J., Shi, L., & Shi, Z. (2025). A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception. Horticulturae, 11(9), 1039. https://doi.org/10.3390/horticulturae11091039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Vision-Based Information Processing Framework for Vineyard Grape Picking Using Two-Stage Segmentation and Morphological Perception

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Source

2.2. Dataset Construction

2.3. Fruit Stem Segmentation

2.3.1. Coarse Segmentation

DDFAM

EADHead

Improved YOLOv8s-Seg

2.3.2. Fine Segmentation

2.4. Morphological Perception

2.5. Evaluation Metrics

3. Experimental Setup

3.1. Experimental Platform

3.2. Hyperparameter Settings

4. Results

4.1. Coarse Segmentation of Stems

4.1.1. Comparison Between the Baseline and the Improved Model

4.1.2. Comparison Among the Different Instance Segmentation Models

4.2. Fine Segmentation of Stems

4.3. Visualization of Picking Information

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI