1. Introduction
Grapes (
Vitis vinifera L.) represent one of the most widely cultivated fruit crops worldwide, owing to their rich nutritional composition and substantial economic value [
1,
2,
3]. They serve as a significant source of vitamins, polyphenols, and other bioactive compounds [
1,
4], which have been extensively associated with antioxidant properties and cardioprotective benefits [
5]. In parallel, grape cultivation plays a pivotal role in the global horticultural economy, supporting both fresh fruit consumption and the production of various processed products such as wine, juice, and dried fruit [
6,
7,
8]. This dual significance has positioned grapes as a central focus in research related to quality assessment, yield enhancement, and the implementation of precision agriculture technologies [
9].
Conventional fruit harvesting remains a labor-intensive and relatively inefficient process [
10]. The application of computer vision and deep learning technologies in fruit recognition has provided essential technical support for automation and intelligent decision-making in precision agriculture [
11,
12]. However, grape harvesting poses unique challenges due to their distinctive morphological and structural features, characterized by compact clusters, fragile berries, and densely intertwined vines [
13,
14]. These inherent complexities significantly hinder the development and deployment of efficient automated harvesting systems. In contrast to fruits with firm exocarps that can tolerate direct gripping or pulling, grapes possess a soft texture and irregular geometry, thereby requiring careful handling to minimize mechanical damage and prevent the unintended detachment of adjacent berries [
9,
15,
16].
In recent years, substantial progress has been made in automated grape harvesting, with research efforts focusing on several key areas: picking point location [
16,
17,
18,
19,
20], fruit recognition under occlusion [
16,
17], fruit classification [
21,
22], and quality assessment [
13,
23,
24], as well as robotic path planning and end-effector design [
25,
26]. The overarching objective of these studies is to achieve accurate and efficient grape harvesting with minimal human intervention. Notably, these research components are often interdependent and can be effectively addressed within a unified processing framework. Among these, the localization of picking points on the fruit stem plays a significant role, serving not only as the core target of fruit recognition and occlusion detection but also as a fundamental basis for downstream tasks such as three-dimensional (3D) positioning and robotic path planning [
16].
To address the challenge of picking point localization in grape harvesting, numerous researchers have proposed effective approaches that integrate visual perception with deep learning techniques. Luo et al. [
16] employed a combination of k-means clustering, morphological processing, and geometric constraints to detect the cutting points on stems within overlapping grape clusters. This method reached an average localization accuracy of 88.33% and a success rate of 81.66% in dual-cluster scenarios. Zhao et al. [
18] developed a lightweight end-to-end model, YOLO-GP, which is capable of simultaneously detecting grape clusters and the corresponding picking points. This model adopts the strategy of associating the picking point with the boundary of the detected grape cluster, achieving a mean average precision (mAP) of 93.27% and a localization error of less than 40 pixels. Xu et al. [
17] proposed a real-time picking point decision algorithm based on YOLOv4-SE, incorporating multi-channel RGB and depth inputs. By constructing 3D regions of interest (ROIs) around grape clusters, this method achieved a 93.5% success rate in localization with an average response time of 84.2 ms. Zhu et al. [
19] introduced an improved YOLOv5 model for recognizing both grapes and fruit stems under natural conditions, followed by a geometric approach for rapid picking point localization. This method demonstrated a notable improvement in detection accuracy compared to the baseline model, with only a slight increase in processing time. Liu et al. [
20] proposed a YOLOv8-MH model for picking point localization, which integrates a multi-head attention mechanism to enhance precision and replaces the backbone with MobileNetV3 to reduce the parameters. When combined with a depth-map-based region growing algorithm, the model achieved an average precision (AP) of 93.1% and a frame rate of 80 frames per second (FPS).
Traditional image processing techniques have demonstrated relatively high accuracy in segmentation tasks involving small-scale targets with low-complexity backgrounds. However, their performance deteriorates markedly in complex orchard environments, revealing notable limitations in efficiency, accuracy, and robustness. Consequently, numerous research efforts have shifted toward deep learning technologies. Within this trend, object detection algorithms have played a critical role in directing model attention toward relevant regions, while segmentation methods have been extensively employed to suppress background noise and preserve essential morphological features [
12]. Most existing algorithms can be broadly categorized as single-stage processes, in which picking point localization is inferred through morphological or geometric reasoning, based on the outcome of a single segmentation or recognition operation. However, fruit stems typically occupy only a minute portion of the entire image, which makes them difficult to detect or segment with high accuracy, often leading to false positives and missed detections. Moreover, the ambiguous or inconsistent morphological features of the fruit stem further complicate the precise localization of the picking point. These limitations collectively underscore the inherent challenges faced by the current methodologies and emphasize the necessity for more robust, fine-grained frameworks that are capable of meeting the demands posed by complex orchard environments.
To address the aforementioned limitations, this study proposes a vision-based information processing framework for grape picking, incorporating two-stage segmentation with morphological perception. This process consists of a coarse segmentation stage and a fine segmentation stage. In the first stage, a deep learning-based segmentation model is employed to extract the ROI containing the target fruit stem. Given that only a small proportion of the ROI is within the input image, this stage is defined as a coarse segmentation operation in terms of detail-level precision. It plays a critical role in narrowing the search space and reducing interference from complex background elements, thereby facilitating more accurate processing in the subsequent stage. In the second stage, owing to their inherent advantages, image processing techniques are applied to further refine the ROI, which is characterized by a small scale and low complexity. This step is designed to capture more precise morphological features of the fruit stem while eliminating residual background noise that remains after the coarse segmentation. By integrating the complementary advantages of data-driven learning and rule-based refinement, the proposed two-stage segmentation strategy establishes a robust foundation for accurate morphological perception of the fruit stem region, which is essential for reliable picking point localization. Morphological perception refers to the systematic identification and interpretation of the object’s morphological features to support decision-making and action planning. In this study, the centroid of the finely segmented fruit stem region is calculated, and its morphological skeleton is constructed. Based on deterministic geometric rules, the picking point and corresponding cutting axis are subsequently derived to guide the end-effector of the robotic manipulator in executing precise harvesting operations.
Compared with traditional fruit-stem localization strategies, the proposed method enables a more comprehensive integration and utilization of visual information, offering superior accuracy, robustness, and real-time performance. These advantages make it highly promising for practical deployment in intelligent agricultural systems, contributing to the advancement of automated, non-destructive, and efficient fruit harvesting technologies.
2. Materials and Methods
2.1. Data Source
The grape image dataset used in this study consists of two components, referred to hereafter as dataset A and dataset B. Dataset A was collected from the Jiangxinzhou grape plantation in Zhenjiang City, Jiangsu Province, China. It contains 960 high-resolution images (3648 × 2732 pixels) captured with mobile devices, covering two representative grape cultivars: Sun Muscat and Kyoho, characterized by green and purple fruit coloration, respectively. Dataset B was sourced from a publicly available Chinese Scientific Data repository [
27]. The images were captured using an Azure Kinect depth camera in the Daxu Ecological Zone of Hefei, Anhui Province, using a uniform resolution of 1280 × 720. Muscat Blanc, Shine Muscat, and Zuijinxiang were selected as representative green grape cultivars, while Kyoho, Yongyou, and Summer Black were chosen as representative purple grape cultivars. A total of 500 grape images were randomly sampled from this dataset to complement the self-constructed dataset. In total, the final dataset consisted of 1460 grape images, encompassing diverse lighting conditions, sample distributions, and cluster morphologies. Such diversity is critical for mitigating potential dataset bias and enhancing the robustness and generalizability of the proposed model.
2.2. Dataset Construction
From the original dataset, fine-grained grape instances were extracted to construct the segmentation dataset utilized in this study, yielding a total of 800 annotated images. Manual annotation was conducted using the EISeg software (version 1.1.0).
In this study, the branch connecting the fruit cluster to the main stem is divided into two sections: the proximal end, which is closer to the fruit cluster, and the distal end, which is closer to the main stem. A particular point along the branch is identified as a location where branching occurs. When extending from the proximal direction toward the distal direction, the fruit stem is defined as the segment of the branch between the first and second points. This definition provides a precise reference for subsequent segmentation and morphological analysis, effectively reducing the errors introduced by subjective judgment.
For the instance segmentation task, two target classes are considered: fruit clusters and fruit stems. The identification of fruit clusters not only provides essential semantic logic for the localization of fruit stems (e.g., potential morphological and botanical growth patterns) but also plays a crucial role in determining the final picking point. As illustrated in
Figure 1, two visually similar fruit clusters can result in markedly different picking strategies and picking point locations, due to variations in the segmentation results. In
Figure 1, the sample on the left is segmented as a single entity, whereas the sample on the right depicts a morphologically similar cluster structure that is segmented into two distinct fruit clusters. According to the definition of the fruit stem, the single picking point in the left sample is located closer to the distal end. In contrast, the right sample involves two separate picking points, and the corresponding picking actions must be performed sequentially.
Based on the aforementioned criteria, two target classes were manually annotated: “fruit”, representing the grape clusters, and “stem”, corresponding to the fruit stems. The annotation data were initially saved in JSON format and subsequently converted into TXT files, which were compatible with the YOLO training framework, through a dedicated data transformation process. Examples of the annotated grape images are shown in
Figure 2.
The grape segmentation dataset was randomly divided into a training set (560 images), a validation set (160 images), and a test set (80 images) in a 7:2:1 ratio. To improve model robustness and reduce the risk of overfitting during training, data augmentation was applied to the annotated training images. The augmentation strategies included horizontal flipping, random brightness adjustment, and the addition of Gaussian noise. For each original training image, three augmented variants were generated, resulting in an expanded training set comprising 2240 images. This augmentation process contributed to the improved model’s generalizability in a complex vineyard environment.
2.3. Fruit Stem Segmentation
2.3.1. Coarse Segmentation
YOLOv8-seg [
28] is an extended variant of the YOLOv8 object detection framework, incorporating instance segmentation capabilities. Drawing inspiration from the design of YOLACT [
29], it decouples the processes of object detection and mask generation while retaining the strong detection performance of the YOLOv8 family. Considering the resource constraints that are typically encountered on edge devices in real-world agricultural scenarios, the YOLOv8s-seg variant, which provides a favorable trade-off between segmentation accuracy and computational efficiency, was selected as the baseline model for the fruit stem segmentation task in this study. To address the complex conditions that are present in an actual vineyard environment, two key architectural improvements were introduced. The proposed optimization strategies can be detailed as follows.
Firstly, a novel dynamic deformation feature aggregation module (DDFAM) is proposed to enhance the model’s feature extraction capability for relevant targets. By dynamically adapting the receptive field of the convolutional kernels, DDFAM improves the model’s ability to recognize fruit clusters and stems under occlusion and overlap conditions. Secondly, to accommodate the distinct functional requirements of regression, classification, and segmentation tasks, an efficient asymmetric decoupled head (EADHead) is designed. This structure enriches edge information in the regression and segmentation branches, while rendering the classification branch more lightweight and compact. As a result, the overall model achieves a significant reduction in parameters while enhancing segmentation performance.
DDFAM
The construction concept of DDFAM is derived from the deformable convolutional network (DCN) [
30]. This architecture significantly enhances the model’s adaptability to occluded and irregularly shaped targets by dynamically adjusting the sampling locations of convolutional kernels. At its core, DCN introduces learnable spatial offset parameters for each sampling point, enabling the network to adaptively optimize sampling regions based on the contextual information of input features. This mechanism effectively overcomes the limitations of traditional convolution operations constrained by fixed rectangular grids, allowing the receptive fields to flexibly adapt to deformation-sensitive regions across different spatial scales, hierarchical levels, and even individual pixels within the feature map. Consequently, the module significantly improves the network’s precision in capturing the local structural features of complex targets. Traditional convolution employs a regular grid
R to sample from the input feature map
x. The output feature at a specific sampling location
p0 is given by:
In Equation (1),
R = {(−1,1),(−1,0),…,(0,1),(1,1)} denotes a 3 × 3 convolution kernel with a step size of 1, and
w(
pn) represents the weight of the convolution kernel at the corresponding position
pn. Deformable convolution introduces an additional offset Δ
pn to each sampling point, allowing the position and shape of the convolutional kernel to be dynamically adapted in response to variations in the target object. Consequently, the output feature formulation is modified as:
Although deformable convolution improves spatial adaptability compared to conventional convolution, the learned sampling offsets may extend beyond the actual ROI, which can introduce irrelevant background information during feature extraction and degrade representation quality. To address this issue, learnable modulation scalars Δ
mn are introduced into the deformable convolution framework [
31]. These weight coefficients not only enable the dynamic adjustment of the sampling offsets but also modulate the amplitude of input features at different spatial locations. Consequently, the output feature formulation is further extended as:
The C2f module serves as a fundamental component of the YOLOv8-seg architecture. It operates by splitting the input feature map into two parallel branches: one branch retains the original features, and the other applies a deep transformation via stacked bottleneck structures. The outputs of both branches are subsequently fused by channel-wise concatenation. This design effectively preserves low-level detail features while simultaneously extracting high-level semantic representations through cascaded convolutional layers, thereby enhancing the model’s multi-scale responsiveness to targets. However, the fixed geometric structure of standard convolution restricts adaptability to irregular targets, leading to insufficient spatial generalization in complex scenes. To mitigate this limitation, the deformable ConvNets v2 (DCNv2) [
31] is modularized and integrated into the bottleneck of the original C2f module to form DDFAM. The detailed structure of DDFAM is illustrated in
Figure 3.
The number of stacked DC_Bottleneck structures remains consistent with the original model, and the cascaded DCNv2 modules preserve the skip connection strategy of the baseline. By dynamically adjusting the sampling positions of convolutional kernels and adapting receptive field distributions, DDFAM introduces spatially adaptive perception into the feature aggregation process. This enhancement enables the more effective modeling of objects with pose variations and complex structures, substantially improving representation quality.
EADHead
The head network of YOLOv8-seg adopts a parallel-branch architecture, in which feature maps are independently processed through three separate convolutional pathways to generate the final outputs for bounding box regression and classification confidence prediction. While this decoupled design helps mitigate feature interference among tasks and enhances detection accuracy, the repeated convolutional operations across branches lead to a significant increase in parameters. To address this issue, an optimized head network, EADHead, is proposed to effectively reduce model parameter redundancy while maintaining performance. EADHead consists of multiple improved Detect modules, and its detailed structure is presented in
Figure 4.
To reduce the parameters in the bounding box regression and segmentation branches, depthwise stacked separable convolutions [
32] are employed. In addition, a channel shuffle module [
33] is interleaved within the network to promote cross-channel information interaction, thereby enhancing the model’s feature representation capability. Compared to standard convolution, a depthwise separable convolution employs a hierarchical processing mechanism, consisting of channel-wise convolutions followed by point-wise convolutions. This approach significantly reduces computational complexity while maintaining comparable feature extraction performance, making it particularly suitable for lightweight models in resource-constrained scenarios.
The lightweight effectiveness of this design can be quantitatively validated by examining the ratio of the computational cost of depthwise separable convolutions to that of standard convolution:
Here, M denotes the number of channels in the input feature map, Df × Df represents the spatial resolution of the output feature map, N is the number of output channels, and Dk × Dk refers to the kernel size. When a 3 × 3 convolution kernel is used, the computational complexity of depthwise separable convolutions is approximately 1/9 of that of a standard convolution.
Improved YOLOv8s-Seg
The proposed optimization strategies were applied to the YOLOv8s-seg model, resulting in an enhanced segmentation model with superior performance. The overall architecture of the improved YOLOv8s-seg is illustrated in
Figure 5. In this study, the model is utilized to perform the coarse segmentation of grape stems, providing preliminary ROIs that serve as inputs for subsequent fine segmentation processing.
2.3.2. Fine Segmentation
Although the improved segmentation model achieves more accurate segmentation of grape stems, the extracted ROIs remain insufficiently precise due to challenges such as target-scale mismatch and the increased sparsity of feature maps caused by repeated down-sampling operations. As a result, the segmented regions often contain not only the target fruit stem but also background noise. To obtain more precise morphological features of the grape stem, the second-stage segmentation employs image processing techniques that are well-suited to small-scale, low-complexity targets. Specifically, an improved OTSU thresholding algorithm [
34], which operates on a single-channel image extracted from the hue component of the HSV color space, is applied to further refine the results produced by the improved YOLOv8s-seg model. The procedure can be described as follows.
Firstly, a binary mask is generated in which the foreground corresponds to the ROI obtained from the coarse segmentation of the improved YOLOv8s-seg model. Subsequently, a bitwise AND operation is performed between the grayscale image and the binary mask to retain only the pixel intensities
I within the masked region. Let
K denote the total number of pixels in this region. For each candidate grayscale threshold
T, the pixels are divided into two classes: background (
I <
T) and foreground, i.e., fruit stem (
I ≥
T). The number of pixels in the background and foreground are denoted as
N0 and
N1, respectively, with the corresponding mean intensities represented by
μ0 and
μ1. The class probabilities are computed as ω
0 =
N0/
K and ω
1 =
N1/
K, and the between-class variance σ
2 is defined as:
By traversing all grayscale levels at a specified interval, the optimal threshold T* is determined as the value that maximizes σ2, thereby further achieving optimal separation between the fruit stem and residual background components based on the coarse segmentation results.
In real-world orchard environments, lighting conditions often tend to be highly variable and uncontrollable. Given that the basic OTSU thresholding algorithm is sensitive to illumination variations, which may significantly compromise segmentation accuracy, the HSV color space is taken into account. In contrast to the traditional RGB color space, HSV effectively decouples chromatic information from the luminance, with the hue and saturation components exhibiting greater stability under fluctuating lighting conditions. Among these, the hue channel is selected for thresholding due to its rich and discriminative color information, which is particularly beneficial for fine segmentation. The hue threshold is denoted as TH, which enables the definition of an optimized background (H < TH) and foreground (H ≥ TH). This improvement integrates the light stability of the hue component, thereby enhancing the robustness of the OTSU algorithm under varying lighting conditions.
2.4. Morphological Perception
In automated harvesting systems, grape picking is primarily performed by locating the picking point and severing the fruit stem [
17]. This study performs morphological perception on the preprocessed stem regions and determines the localization of picking points and the inference of the cutting axes, based on deterministic geometric rules.
Following the two-stage segmentation process, a binary image is generated with the stem region defined as the foreground. The geometric centroid (
xc,
yc) of the foreground region is then calculated, as described in Equation (6):
Here, the target region contains
N pixels, each corresponding to the coordinate (
xi,
yi). The geometric centroid of the grape stem is representative, lying approximately equidistant from both the proximal and distal ends, exhibiting favorable mechanical balance. The centroid is chosen as the basis for picking point inference, based on a comprehensive consideration of computational stability, structural rationality, and harvesting safety, effectively reducing the risk of fruit damage and the entanglement of the robotic arms in the vines. Additionally, the morphological skeleton of the foreground region is extracted using the medial axis transform (MAT), which captures the intrinsic topological and geometrical structure of the stem [
35]. In image processing, the discrete form of the MAT is applied to binary images, with its definition given in Equation (7):
where
F denotes the foreground region of the binary image, ∂Ω represents the boundary set, determined by foreground pixels adjacent to background pixels, and ||·|| denotes the Euclidean norm. The Euclidean distance transform (EDT) is applied to the foreground region, assigning each pixel its minimum Euclidean distance to ∂Ω [
36]. Local maxima within the resulting distance map are subsequently identified as skeleton points
p, forming the set
S. The corresponding value
r(
p) represents the radius of the largest inscribed circle centered at skeleton point
p.
To mitigate the issue of spurious branches that may arise in the morphological skeleton due to excessive sampling density or irregular stem boundaries, an adaptive sampling interval is introduced. This parameter is dynamically adjusted, based on the estimated diameter of individual fruit stems, enabling effective control of the density of uniformly spaced skeleton points. The sampled points, denoted as ps, are connected in order to reconstruct a simplified and stable morphological skeleton path for the fruit stem. This skeleton provides structural stability and topological consistency, which serves as a reliable geometric basis for subsequent picking point localization.
Based on the extracted centroid and the morphological skeleton, the ps with the minimum Euclidean distance to the centroid is selected and designated as the picking point T. This strategy ensures that the picking point is strictly located within the fruit stem region, while maintaining the spatial proximity and geometric significance of the centroid.
The cutting axis of the stem is derived from the local geometric structure of the morphological skeleton and serves as a structural reference for guiding the cutting direction of the robotic end-effector. During the cutting operation, the end-effector adjusts its orientation to align with this axis, enabling the execution of a minimal-area cross-sectional cut on the fruit stem. To determine the cutting axis, two sampled skeleton points, p1 and p2, which are immediately adjacent to T along the skeleton path, are selected. The line segment connecting p1 and p2 approximates the local segment of the morphological skeleton near T. The normal vector of this segment is then calculated, and a straight line passing through T along this normal direction is defined as the final cutting axis of the stem.
Figure 6 illustrates three representative morphological patterns of grape stems. For each pattern, the proposed method is employed to infer both the picking point and the cutting axis.
As shown in
Figure 6, the proposed strategy successfully localizes the picking point within the stem region in all three morphological scenarios. Additionally, the inferred cutting axis remains approximately perpendicular to the local morphological skeleton of the stem. These results demonstrate the feasibility of the proposed approach, highlighting its robustness in handling stems with complex geometries, as commonly encountered in real vineyard environments. This strategy serves as a foundation for generating spatial coordinates and guiding the motion planning of the end-effector. Accurate localization of the picking point and cutting axis contributes to efficient and precise harvesting operations, significantly enhancing fruit quality and ensuring the stable performance of the automated harvesting system.
2.5. Evaluation Metrics
To comprehensively evaluate the segmentation performance of the proposed model, multiple quantitative metrics were employed. For segmentation accuracy evaluation, AP@0.5 was used to evaluate segmentation performance for individual categories, while mAP@0.5, calculated as the mean of AP@0.5 across all classes, was adopted to evaluate overall segmentation performance. Here, @0.5 denotes an intersection over union (IoU) threshold that is greater than or equal to 0.5. Considering the relatively balanced distribution of samples in this instance segmentation task, macro-averaging (macro) was employed to calculate the multi-class evaluation metrics. To provide a more objective and comprehensive reflection of the overall precision, the specific definition of macro-precision is given in Equation (8), where
N represents the number of classes, and
Precisioni corresponds to the precision of class
i.
In addition, the model’s efficiency and suitability for real-time or embedded deployment were evaluated using three key indicators: parameters, which reflect the total count of trainable weights; floating point operations (FLOPs), which quantify the computational complexity; and detection time, defined as the average inference time per image during the testing. These metrics collectively provide a comprehensive evaluation of the model’s lightweight properties and inference speed.
5. Discussion
Deep learning and image processing have emerged as foundational technologies that can be extensively applied in modern agricultural practices [
11,
41,
42,
43,
44,
45,
46]. Although significant progress has been made in grape-picking research, precise localization and the morphological perception of fruit stems remain major challenges. Some studies have primarily concentrated on fruit clusters while neglecting the fruit stem itself, relying solely on the spatial position of the cluster to perform indirect and ambiguous inferences, based on generalized morphological or geometric principles [
16,
17,
18,
20]. However, given the inherent complexity of vineyard environments and the morphological variability of grape growth, such inference strategies are inherently unstable and limited in their accuracy.
Fruit stem recognition techniques have enabled the targeted identification of relevant structures. However, this approach remains inadequate for effectively extracting and utilizing their morphological features. As a result, methods based on bounding box annotations [
19,
47] and data-driven pattern recognition [
48] often lead to ambiguous inferences, lacking stable and interpretable mapping between stem morphology and picking point localization. To address this limitation, some studies have incorporated fruit stem segmentation techniques to extract morphological information [
49]. Nevertheless, the effectiveness of these techniques remains constrained due to the inherent properties of the segmentation target. Specifically, single-stage coarse segmentation methods are insufficient to capture the fine-grained morphological features of stems with the accuracy necessary for reliable downstream tasks.
To address these challenges, this study proposes a vision-based information-processing framework for vineyard grape picking that integrates two-stage segmentation with morphological perception. First, the morphological scope of the fruit stem is explicitly defined to reduce the annotation bias arising from subjective labeling. Next, a two-stage segmentation strategy is developed to facilitate more accurate extraction of stem morphological features. Subsequently, morphological perception is performed on the refined stem region, with deterministic geometric rules being used to establish the spatial relationships between morphological features, picking points, and cutting axes. In contrast to conventional approaches, the proposed method directly infers the picking point from the intrinsic morphological attributes of the fruit stem, thereby minimizing the errors from indirect inference and external factors. Furthermore, the framework ensures that the identified picking point lies strictly within the stem region, effectively preventing a spatial mismatch between the predicted point and the actual stem structure. Visualization analysis confirms that the proposed information-processing framework exhibits strong feasibility, adaptability, and accuracy under real-world vineyard environments, including complex lighting conditions, irregular stem morphologies, and diverse fruit cluster spatial distributions.
Although the proposed model demonstrates competitive performance, there remains significant scope for further optimization. Future work will focus on achieving a more optimal balance between accuracy and model lightweighting to facilitate real-time deployment. In addition, particular attention will also be directed toward improving the model’s generalizability and robustness under diverse vineyard conditions, such as variable illumination, occlusion, and cultivar differences. Moreover, the proposed framework incorporates multi-module processing, including data transmission, caching, parallel computation, and mapping, which inevitably increases the system complexity. Subsequent research will, therefore, evaluate computational efficiency and resource allocation on resource-constrained edge devices, as well as explore hardware-oriented deployment strategies. Collectively, these efforts are expected to enable more effective validation and adaptation of the proposed vision system in real-world vineyard environments. Finally, while the proposed method enables the effective extraction and utilization of two-dimensional (2D) spatial image information, it lacks integration with 3D cues such as depth, spatial coordinates, or point cloud data. This limitation restricts the system’s ability to comprehensively perceive the spatial configuration of fruit stems and their surrounding structures. Future research will focus on extending the current 2D morphological perception framework by integrating 3D visual technologies. Notably, the framework developed in this study generates a set of precise and highly interpretable secondary morphological outputs (e.g., morphological skeletons, picking points, and cutting axes), which provide a promising foundation for a novel 3D modeling approach. Instead of explicitly reconstructing the full 3D geometry of the fruit cluster and stem, the proposed direction involves generating 3D stem skeletons and cutting planes by fusing multi-view secondary morphological information. This strategy significantly reduces the complexity of both modeling and inference, thereby enhancing its suitability for real-time applications and deployment in resource-constrained environments.
In conclusion, this study introduces an effective visual information processing framework to support planning and decision-making in vineyard grape harvesting. Furthermore, the proposed approach exhibits significant potential for broader application and seamless integration with 3D perception systems. These advancements collectively contribute to the intelligent development of fruit-harvesting techniques.
6. Conclusions
This study presents a visual information processing framework for vineyard grape picking, integrating a two-stage fruit stem segmentation strategy, with morphological perception applied to the finely segmented regions.
The proposed two-stage segmentation strategy consists of deep-learning-based coarse segmentation, followed by rule-based fine segmentation. The coarse segmentation employs an improved YOLOv8s-seg model with two key architectural enhancements. First, a novel DDFAM is introduced to improve the model’s capability in capturing morphological features, particularly for small-scale and irregularly shaped targets. Second, an EADHead is designed to enrich edge information in both the regression and segmentation branches, while maintaining a lightweight and compact model structure. Compared to the baseline model, the improved model achieves a 4.27% increase in mAP@0.5, including an 8.32% improvement in AP@0.5 specifically for fruit stem segmentation, while reducing the parameters to 86% of the original model. Compared with mainstream segmentation models, the improved model achieves the optimal overall performance, with the highest segmentation accuracy (mAP@0.5 of 86.75%), a lightweight architecture (10.34 M parameters), and real-time inference speed (10.02 ms per image). Additionally, it effectively reduces false negatives and demonstrates strong robustness against interference from complex vineyard backgrounds.
The fine segmentation stage further refines the morphological delineation of fruit stems. In this study, an improved OTSU thresholding algorithm is proposed, operating on the hue channel of the HSV color space. By decoupling the chromatic information from luminance, this approach significantly improves the robustness of thresholding under complex lighting conditions. Visual analysis confirms that the proposed method yields more accurate and sharper segmentation outcomes, effectively reducing the mis-segmentation caused by light sensitivity.
Morphological perception is subsequently performed on the preprocessed stem regions, including the calculation of centroid coordinates and the extraction of morphological skeletons using the MAT. The spatial relationships among the morphological features, picking points, and cutting axes are established using deterministic geometric rules. The proposed method ensures that the final picking point is strictly located within the stem region and that the cutting axis is aligned orthogonally to the local extension direction of the stem at the picking point. Visualization results confirm its feasibility and adaptability in real-world vineyard conditions.
The proposed visual information-processing framework enables the digitization of fruit-picking behavior and effectively supports the planning and decision-making processes of autonomous harvesting systems. By providing a robust technical foundation for vision-driven grape picking, this framework facilitates the advancement of intelligent, non-destructive, and high-efficiency fruit harvesting technologies.