A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies

Lu, Jin; Cao, Zhongji; Wang, Jin; Wang, Zhao; Zhao, Jia; Zhang, Minjie

doi:10.3390/agriculture15151622

Open AccessArticle

A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies

by

Jin Lu

^1,2,*,†

,

Zhongji Cao

^1,†

,

Jin Wang

¹

,

Zhao Wang

¹,

Jia Zhao

¹

and

Minjie Zhang

¹

School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, 618 Chang’an West St., Xi’an 710121, China

²

Shaanxi Key Laboratory of Information Communication Network and Security, Xi’an University of Posts and Telecommunications, 618 Chang’an West St., Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2025, 15(15), 1622; https://doi.org/10.3390/agriculture15151622

Submission received: 18 June 2025 / Revised: 22 July 2025 / Accepted: 23 July 2025 / Published: 26 July 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

During the automated picking of table grapes, the automatic recognition and segmentation of grape pedicels, along with the positioning of picking points, are vital components for all the following operations of the harvesting robot. In the actual scene of a grape plantation, however, it is extremely difficult to accurately and efficiently identify and segment grape pedicels and then reliably locate the picking points. This is attributable to the low distinguishability between grape pedicels and the surrounding environment such as branches, as well as the impacts of other conditions like weather, lighting, and occlusion, which are coupled with the requirements for model deployment on edge devices with limited computing resources. To address these issues, this study proposes a novel picking point localization method for table grapes based on an instance segmentation network called Progressive Global-Local Structure-Sensitive Segmentation (PGSS-YOLOv11s) and a simple combination strategy of morphological operators. More specifically, the network PGSS-YOLOv11s is composed of an original backbone of the YOLOv11s-seg, a spatial feature aggregation module (SFAM), an adaptive feature fusion module (AFFM), and a detail-enhanced convolutional shared detection head (DE-SCSH). And the PGSS-YOLOv11s have been trained with a new grape segmentation dataset called Grape-⊥, which includes 4455 grape pixel-level instances with the annotation of ⊥-shaped regions. After the PGSS-YOLOv11s segments the ⊥-shaped regions of grapes, some morphological operations such as erosion, dilation, and skeletonization are combined to effectively extract grape pedicels and locate picking points. Finally, several experiments have been conducted to confirm the validity, effectiveness, and superiority of the proposed method. Compared with the other state-of-the-art models, the main metrics

F 1

score and mask mAP@0.5 of the PGSS-YOLOv11s reached 94.6% and 95.2% on the Grape-⊥ dataset, as well as 85.4% and 90.0% on the Winegrape dataset. Multi-scenario tests indicated that the success rate of positioning the picking points reached up to 89.44%. In orchards, real-time tests on the edge device demonstrated the practical performance of our method. Nevertheless, for grapes with short pedicels or occluded pedicels, the designed morphological algorithm exhibited the loss of picking point calculations. In future work, we will enrich the grape dataset by collecting images under different lighting conditions, from various shooting angles, and including more grape varieties to improve the method’s generalization performance.

Keywords:

grape harvesting; deep learning; instance segmentation; picking point positioning

1. Introduction

In the process of automatic grape harvesting, the efficient and accurate positioning of grape picking points plays an important role in all subsequent operations of the harvesting robot. However, the uncertain factors in vineyards under natural circumstances cause great disruptions to the accurate identification of grape pedicels and the best picking positioning. These factors include variable lighting conditions, occlusions by leaves and branches, and morphological differences in pedicel structures. On the one hand, the lighting conditions in the orchard are remarkably uneven, and the fruits often occlude each other. On the other hand, there are a wide variety of grape varieties, and the pedicels of different varieties display significant differences in external features such as color, texture, and shape. In recent years, numerous algorithm developments have paved the way for the recognition and positioning of the grape picking points. These developments can be summarized into the following two strategies:

Strategy 1: Indirect picking point identification and localization method (referred to as the indirect method for short). The basic principle of the indirect method is to first determine grape cluster or pedicel regions using traditional digital image processing techniques or deep learning models and then calculate the 2D pixel coordinates of pedicel picking points through geometric shape analysis. Early indirect methods [1,2,3,4,5] primarily relied on feature extraction of color, texture, and shape from traditional digital image processing combined with geometric analysis algorithms for grape clusters or pedicels. In recent years, with the development of artificial intelligence, deep learning-based detection methods for grape clusters or pedicels have gradually become a research focus in intelligent grape harvesting. For example, Santos et al. [6] employed Mask R-CNN for grape cluster detection, leveraging instance segmentation to generate high-quality masks that precisely define regions of interest (ROI) for accurate picking point localization in complex orchard environments. Kalampokas et al. [7] innovatively transformed grape pedicel localization into a pixel-level distance regression task, enabling their proposed RegCNN model to achieve end-to-end segmentation and precise localization by learning the geometric distance from each pixel to the nearest pedicel. Ning et al. [8] adopted a cascade strategy of ‘coarse segmentation-fine optimization’, performing coarse segmentation of grape pedicels using an improved Mask R-CNN, achieving fine segmentation by combining color segmentation and region growing algorithms, and finally determining picking points through centroid calculation and geometric constraints. Similarly, Shen et al. [9] developed a multi-scale adaptive MSA-YOLO model for pixel-level segmentation of grape pedicels, significantly improving segmentation accuracy and real-time detection performance under complex backgrounds and variable lighting. Futher, some researchers [10,11,12,13,14,15,16,17] have developed multiple similar studies, which leveraged improved YOLO networks for real-time grape cluster recognition and localization in complex orchard environments. Notably, Wang et al. [15] co-annotated grape clusters and pedicels within the same bounding box for model training, whereas Zhu et al. [13] and Zhao et al. [16] adopted a different approach: classifying grape clusters and pedicels as separate annotation categories. In the latest research, Zhu et al. [18] proposed a YOLOv8n-DWF network and a division rule of grape stems for the table grape picking robot to realize the detection of table grapes and the localization of picking points. The success rate of the picking point localization reached 88.24% within 5 pixels of error. Similarly, Li et al. [19] proposed an improved YOLO11-GS, which can simultaneously predict grape clusters with bounding boxes and grape pedicels with oriented bounding boxes. And the center point of the oriented bounding box can be directly extracted to locate the grape picking points. Analysis of these existing studies indicates that this strategy has emerged as a critical approach to integrate the powerful feature extraction capability of deep learning with the spatial reasoning ability of geometric analysis for enhancing the robustness of picking point localization.

Strategy 2: Direct picking point identification and localization method (referred to as the direct method for short). The direct method utilizes deep learning to identify grape clusters or pedicels from grape images while also directly generating the 2D pixel coordinates of picking points. For example, Zhao et al. [20] explored the lightweight end-to-end model YOLO-GP, which improves the anchor box generation strategy and designs a new loss function to simultaneously predict grape cluster detection and 2D picking point coordinates. Further, Xu et al. [21] developed the YOLOv4-SE model via multi-modal fusion of RGB and depth images, addressing occlusion challenges and enhancing picking point localization accuracy. This study also first validated the engineering practicability of the direct picking point identification and localization method in dynamic operation scenarios. Wu et al. [22] proposed the Ghost-HRNet model, adopting a top-down keypoint detection framework to realize the joint prediction of grape bunch stems and picking point coordinates. This approach enables efficient processing of synchronous localization tasks in scenes with multiple overlapping grape clusters. To solve the problem of simultaneous detection of grape clusters and picking points, Chen et al. [23] proposed an improved YOLOv8-GP, which is an end-to-end model integrating object detection and one-keypoint detection. This model may ensure that the localization error of grape pedicel picking points is controlled within 30 pixels. Similarly, Jiang et al. [24] also presented a trellis grape stem detection model using YOLOv8n-GP based on YOLOv8n-pose, which was trained on grape image datasets, including the bounding boxes of grape clusters and three keypoints of stems. With the middle keypoint of the stem as the picking point, its mAP-

k_{p}

reached 95.4%.

Despite their achievements, these immediate techniques are also not trouble free, and their limitations include the following:

(1): The annotation formats of existing grape datasets do not fully consider the interconnected relationship between pedicel parts and grape cluster parts. For example, some only annotate grape clusters using bounding boxes [12]; some perform pixel-wise annotation on grape pedicels [8]; others treat pedicel parts or pedicel key points and grape cluster parts as two independent targets [25]. Overall, these existing annotation methods fail to recognize that pedicels and grape clusters form an organic connected entity. For CNN-based recognition, however, the two can mutually provide valuable feature information in terms of color, texture, and structural characteristics. Therefore, how to design a reasonable annotation format remains a problem worth exploring for identifying and locating table grape picking points.
(2): Although existing deep learning models have significantly promoted intelligent grape detection, there is still substantial room for improvement in the accuracy of picking point positioning. Currently, geometric positioning algorithms in indirect methods require the establishment of dedicated geometric models, leading to complex calculation processes. While Li et al. [19] and Xu et al. [21] directly use the center coordinates of the pedicel detection box as the picking point coordinates, simplifying the calculation process, both this solution and direct methods share common drawbacks: low reliability of picking points and large errors. As elaborated earlier, there is a deviation error between the actual picking point and the predicted picking point.

Obviously, these troubles are not desirable for improving the accuracy and efficiency, as well as enhancing reliability. More research needs to be done. Inspired by the existing methods, through innovative annotation strategies, a multi-module fusion architecture, and morphological feature processing, this work aims to put forward a novel technical solution for addressing the abovementioned challenges. Specifically, Figure 1 depicts the overall workflow of this study, with the contributions described as follows:

(1): Establishing a new grape segmentation dataset called Grape-⊥ with a total of 1576 grape images and 4455 ⊥-shaped pixel-level instances, including 3375 Grps and 1080 GrpWBs. The ⊥-shaped grape annotation fully considers the botanical relationship between the grape cluster and pedicel while highlighting texture and shape differences.
(2): An instance segmentation network called PGSS-YOLOv11s is proposed to segment the ⊥-shaped regions of grapes. The network PGSS-YOLOv11s is composed of an original backbone of the YOLOv11s-seg, SFAM, AFFM, and DE-SCSH. The SFAM improves the feature description capacity by means of global-to-local spatial feature aggregation. The AFFM adopts bidirectional linear superposition connection, adding more feature fusion paths to achieve adaptive fusion of multi-scale features. The DE-SCSH boosts the model’s refined segmentation ability and cuts down on the number of learnable weights in the network through the structure of detail enhancement convolution and shared weights.
(3): A simple and efficacious combination strategy of morphological operators is designed to locate the picking points on the segmented grape ⊥-shaped regions. The presented strategy does not necessitate complex geometric shape computations, and there is no pixel distance error either.

2. Materials and Methods

2.1. Grape-⊥ Dataset

2.1.1. Image Acquisition

In this research, the original grape images were gathered from the Chengxi Grape Demonstration Base located at Bailu Yuan, Xi’an City, Shaanxi Province, China (geographical coordinates: 34.2147° N, 109.0965° E). As shown in Figure 2a, it depicts the growth state of grapes in the natural environment. Usually, these grapes are mainly cultivated on trellises, which are supported by cement columns in a T-shaped structure. The trellis stands at a height of about 1.8 m, and the grape plants are arranged in double rows with a spacing of approximately 2.0 m between plants and 4.0 m between rows. There are various grape varieties planted here. Some grapes are also wrapped in bags, as depicted in Figure 2c. Generally, the grape clusters, which are about 30 cm in length and have a maximum diameter of around 18 cm, and they grow in a natural downward-hanging manner at the bottom of the trellis. It is important to note that to ensure the fruit-setting quality of table grapes, after fruit thinning, the grape pedicels often stay at a length of about 12 cm and have a diameter of approximately 5 mm.

On 21 July 2023, we carried out manual image collection using a Nikon D70 camera (Nikon Corporation, Tokyo, Japan). During the collection process, the sampling distance was kept between 50 and 70 cm. A total of 1576 images were taken, covering varieties such as Shine Muscat, White Seedless, Red Globe, and bagged grapes. These images, with a resolution of 1920 × 1440 pixels, were captured under three different natural light conditions: normal, relatively dark, and strong.

2.1.2. Image Annotation with the Characteristics ‘⊥’

To achieve precise recognition and localization of grape pedicels, we proposed a novel pixel-level instance annotation format that explicitly captures the relationship between grape clusters and pedicels. As shown in Figure 3a,b, the proposed annotation format resembles a perpendicular symbol ‘⊥’, where we annotate the entire grape pedicel and a small portion of the connected grape berries or paper bags. Specifically, for the unbagged grape category, the grape cluster portion covers approximately 1–3 layers of grapes. For the wrapped grape category, the paper bag portion covers only the bag’s closure part, where it connects to the pedicel. Typically, a single image holds multiple grape clusters, which may be from the current row and the adjacent ones, as depicted in Figure 3c. Evidently, it is the factors such as occlusion and long-range distance that make the task of robots picking the grapes in the adjacent rows a tough one. Thus, from the perspective of robotic harvestability, in this research, we only carried out classified labeling on the grapes that the robot manipulator could reach. One category consists of unbagged grapes, with the label ‘Grp’, and the other is wrapped in bags, labeled ‘GrpWB’.

During the actual labeling process, we utilized the open-source tool Labelme. Based on the proposed ⊥-shaped annotation format, we performed pixel-level annotations on the grape images through polygon operations. In order to better understand the characteristics, distribution, and potential problems of our grape dataset, named Grape-⊥ dataset, a graphical visualization is displayed from four aspects, as shown in Figure 4, including label distribution, the aspect ratios of bounding boxes, the distribution of the center points of the bounding box’s position, and the scales of the object boxes in the image. There are a total of 4455 grape instances, including 3375 Grps and 1080 GrpWBs, as shown in Figure 4a. It shows that our dataset has a relatively balanced distribution of instance categories. Figure 4b shows that the aspect ratios of most grape instances are less than 1. This is consistent with the actual situation, as most grapes grown in natural orchards are taller than wider. Meanwhile, most grape targets are regularly distributed in the middle position of an image along the y direction, as shown in Figure 4c. Next, an optimistic observation is that 95% of grape instances occupy more than 10% of the entire image height. Moreover, fewer grapes are affected by interference from image acquisition conditions, such as background and occlusion, which makes grape detection easier and promotes the detection performance of the network. During the training process, we divided the dataset into a training set and a validation set at a ratio of 8:2. In the test phase, we evaluated the final model performance on a separate test set consisting of 200 images, which were not involved in any training or validation stage.

2.2. Methods

2.2.1. Network Architecture

The latest YOLOv11s-seg was selected as our benchmark framework, since it has fewer parameters compared with the previous YOLO versions and can maintain a certain degree of accuracy in the tasks of object segmentation. Figure 5 shows the overall architecture of our network, which is named PGSS-YOLOv11s. First, PGSS-YOLO11s uses the original backbone of YOLOv11s-seg for the task of feature extraction. Second, two specially designed modules are added into the neck of YOLOv11s-seg: A SFAM is designed to capture both global and local spatial features, and an AFFM is proposed to improve the capability of multiscale feature fusion. Finally, these fused features are fed to a novel segmentation head named DE-SCSH, which is constructed to predict class labels, bounding boxes, and produce pixel-wise binary segmentation masks for each grape instance respectively. Details of these modules are introduced in the following subsections.

2.2.2. Spatial Feature Aggregation Module

Due to the complexity of the vineyard environment, some misjudgements with similar shape features are prone to occur in the segmentation task between tender branches and grape pedicels. Meanwhile, the green leaves in the background also cause significant interference in the feature extraction of grape pedicels. In addition, the feature extraction capability of the backbone of YOLOv11s-seg is also limited. We analyzed the feature maps output from stage F

_{2}

, F

_{3}

, F

_{4}

, and F

_{5}

, as shown in Figure 6. These shallow feature maps (Figure 6b,c) contain rich edge and texture features, while the deeper feature maps (Figure 6d,e) display some important shape and structural features. However, upon closer observation, the expected grape ⊥-shaped features were not highlighted, and some unexpected background feature information was mixed up with local pedicel and grape fruit features.

Neuroscience experiments [26] indicate that the attention mechanism strengthened the information related to the optimization goal and suppresses irrelevant information. In order to obtain more discriminative features for enhancing the significance difference of the ⊥-shaped regions, we proposed to add a global-to-local spatial aggregation module (GLSA) [27] after the stage F

_{3}

, F

_{4}

, and F

_{5}

of the backbone, respectively. Figure 7 shows the whole structure of GLSA.

Firstly, the feature maps {

F_{i} ∣ i \in (3, 4, 5)

} with

C_{i}

(

C_{3}

= 256,

C_{4}

= 256,

C_{5}

= 512) channels from the stage

F_{3}

,

F_{4}

, and

F_{5}

of the backbone were split evenly into two feature map groups

F_{i}^{1}

and

F_{i}^{2}

, respectively. Secondly, these feature map groups were separately fed into two attention units, one global spatial attention (GSA) module and one local spatial attention (LSA) module. Finally, the outputs of those two attention units were concatenated by a 1 × 1 convolution layer

C_{1 \times 1}

. The process can be formulated as

F_{i}^{1}, F_{i}^{2} = S p l i t (F_{i})

(1)

X_{i} = C_{1 \times 1} (C o n c a t (G S A (F_{i}^{1}), L S A (F_{i}^{2})))

(2)

where {

X_{3} \in R^{h / 8 \times w / 8 \times 128}

,

X_{4} \in R^{h / 16 \times w / 16 \times 128}

, and

X_{3} \in R^{h / 32 \times w / 32 \times 128}

} define the output aggregation features. In the GSA module, we simply generate the global spatial attention map GSA(

F_{i}^{1}

) and

F_{i}^{1}

as input defined in the following:

A t t_{G} (F_{i}^{1}) = S o f t m a x (T r a n s p o s e (C_{1 \times 1} (F_{i}^{1})))

(3)

G S A (F_{i}^{1}) = M L P (A t t_{G} (F_{i}^{1}) \otimes F_{i}^{1}) + F_{i}^{1}

(4)

where

A t t_{G} (\cdot)

is the attention operation, which consists of 1 × 1 convolution layer

C_{1 \times 1}

, transpose layer, and a softmax layer.

M L P (\cdot)

consists of two fully connected layers with a ReLU nonlinearity and normalization layer. The first layer of

M L P

transforms its input to a higher-dimensional space whose expansion ratio is two, while the second layer restores the dimension to be the same as the input. In the LSA module, we compute the local spatial attention response LSA(

F_{i}^{2}

) and

F_{i}^{2}

as input defined in the following:

A t t_{L} (F_{i}^{2}) = σ (C_{1 \times 1} (F_{c} (F_{i}^{2}) + F_{i}^{2}))

(5)

L S A (F_{i}^{2}) = A t t_{L} (F_{i}^{2}) ⊙ F_{i}^{2} + F_{i}^{2}

(6)

where

A t t_{L} (\cdot)

is the local attention operation, which consists of a

F_{c} (\cdot)

operation, an add layer, a 1 × 1 convolution layer

C_{1 \times 1}

, and a sigmoid activation layer

σ (\cdot)

.

F_{c} (\cdot)

denotes cascading three 1 × 1 convolution layers and 3 × 3 depth-wise convolution layers.

The GLSA is embedded between the backbone and the neck, which may boost to zero in on the needful information of spatial features. The GSA module emphasizes the long-range relationship of each pixel in the spatial space, such as the pixels at the grape pedicels and grape clusters, and can be used as a supplement to local spatial attention. The LSA module extracts the local features of the region of interest effectively in the spatial dimension of the given feature map, such as thin and long grape pedicels. In this way, the network can effectively concentrate on the features of the shape ⊥ and enhance the feature differences between the target to be segmented and the background.

2.2.3. Adaptive Feature Fusion Module

The neck of the YOLOv11s-seg, called PANet, adopts a top-down and bottom-up bidirectional feature propagation network [28]. A schematic diagram of the structure is shown in Figure 8a. Many studies [29,30,31] claim that the PANet can facilitate the fusion of high-resolution information with stronger edge and texture features. However, some redundant node layers with no feature fusion give rise to the cost of more parameters and computations. Moreover, the feature fusion operation with a concatenation treats each feature equally, so it is easy to introduce irrelevant or harmful features and affect object information representation.

To address these issues, we designed an adaptive feature fusion module (AFFM) based on BiFPN instead of PANet to propagate object information among different levels and regions. Unlike BiFPN, AFFM adjusts the original BiFPN to receive four input features and accommodate three detection and segmentation heads. The structure of AFFM is shown in Figure 8b. The input of AFFM consists of the low-level feature maps

X_{2}

(160 × 160) processed by backbone,

X_{3}

(80 × 80) processed by GLSA and high-level feature maps

X_{4}

(40 × 40), and

X_{5}

(20 × 20) processed by GLSA.

In the top-down feature propagation path of AFFM, it upsamples

X_{5}

to generate feature map with the same scale as

X_{4}

and uses

F_{n} (\cdot)

operation to fuse

X_{5}

and

X_{4}

to obtain

X_{4}^{'}

. The above operations are repeated on

X_{4}^{'}

to create a new feature map

X_{3}^{'}

. We formulate this process as

X_{4}^{'} = F_{n} [f_{u p} (X_{5}), X_{4}]

(7)

X_{3}^{'} = F_{n} [f_{u p} (X_{4}^{'}), X_{3}]

(8)

where

f_{u p}

is the upsample layer.

F_{n} (\cdot)

is the node operation, which consists of an AWFF followed by a C3k2 in sequence for fusing different feature maps, as shown in Figure 8c. Specifically, AWFF refers to linearly aggregating a set of feature maps

f_{1}, f_{2}, \dots, f_{k}

with identical dimensions and assigning them a group of learnable weights

W_{1}, W_{2}, \dots, W_{c}

that scale each feature map. These weights serve as coefficients during linear computation, ensuring the output after AWFF maintains the same spatial dimensions as each input feature map. The C3k2 module is a key feature extraction component in YOLOv11s-seg.

Similarly to the top-down path, in the bottom-up feature process path of AFFM, the main difference is that the feature map is downsampled using a CBS block with a stride of 2.

X_{3}^{″}

is obtained through the node operation of

X_{3}

,

X_{3}^{'}

, and the downsampled

X_{2}

. This node operation can fuse more features without increasing costs and decreasing accuracy considerably.

X_{3}^{″}

,

X_{4}^{″}

, and

X_{5}^{'}

as the output results of AFFM are sent to DE-SCSH for ⊥-shaped segmentation. The calculation process of the bottom-up path can be expressed as follows:

X_{3}^{″} = F_{n} [X_{3}^{'}, X_{3}, CBS (X_{2}, s t r i d e = 2)]

(9)

X_{4}^{″} = F_{n} [X_{4}^{'}, X_{4}, CBS (X_{3}^{″}, s t r i d e = 2)]

(10)

X_{5}^{'} = F_{n} [X_{5}, CBS (X_{4}^{″}, s t r i d e = 2)]

(11)

where CBS means 3 × 3 convolution, including batch normalization and SiLU.

2.2.4. Detail Enhancement—Shared Convolution Segmentation Head

The head structure functions as the ‘output layer’ of the YOLO networks, with its core task being to convert feature maps into specific object locations, categories, and confidence scores, directly determining the accuracy and performance of the model. Previous studies have explored various strategies to enhance the performance of the head structure for detection and segmentation tasks. For instance, the decoupled head [32] in YOLOX is a prominent example. Although the decoupled head can effectively resolve conflicts arising from the coupled head’s classification and regression tasks, this design substantially increases the network’s parameter count and fails to fully exploit the high-level features, resulting in diminished detection accuracy and decreased inference speed, particularly when dealing with complex backgrounds and varying target sizes. To tackle these issues, Zhang et al. drew inspiration from the Asymmetric Decouple Head (ADH) [33], incorporating multi-level channel compression to develop a lightweight asymmetric detection head (LADH-Head) [34] for YOLOv5 in the object detection of remote sensing images. The LADH-Head replaces the conventional 3 × 3 convolutions in the three branches of the ADH structure with 3 × 3 depth-wise separable convolutions (DWConvs). While preserving the spatial dimensions of feature maps, it reduces model parameters and enhances detection speed. Similarly, based on YOLOv8-seg model, Zhang et al. [35] proposed a novel lightweight shared convolution segmentation head (LSCSH), which introduces shared convolution and innovatively incorporates multi-scale perception layers. This LSCSH enhances segmentation performance representation, significantly reduces the model’s computational load, and effectively improves the real-time detection and precise segmentation of chip pads. Furthermore, to effectively address the issue of spatial misalignment between the two subtasks in traditional detection heads, Feng et al. [36] designed a Task-aligned Head (T-Head), which offers a better balance between learning task-interactive and task-specific features, as well as a greater flexibility to learn the alignment via a task-aligned predictor. The shared objective across these studies lies in enhancing the network’s adaptability to their specific fields via improvements to the network head architecture. For the segmentation of grape ⊥-shaped regions, however, there are still two unfavorable factors in the original segmentation head of YOLOv11s-seg, as shown in Figure 9a. One is that the original segmentation head module uses a traditional single-scale segmentation structure, which is ineffective at handling multi-scale targets. It segments from only one scale of the feature map, ignoring contributions from other scales and failing to address the issue of inconsistent target scales for each segmentation head. The other is that the original segmentation head struggles to achieve fine-grained segmentation for thin and long targets, such as grape pedicels. The green grape pedicels or grape clusters are very similar to the large green background, so we have observed that the original segmentation head of YOLOv11s-seg is prone to causing blurred or misclassified ⊥-shaped edge segmentation.

To address these two key issues, we proposed a novel segmentation head structure DE-SCSH, where we introduced a group normalization convolution (Conv_GN) [37] and a detail-enhanced convolution (DEConv) [38]. The former can improve the localization and classification performance greatly of the segmentation head for the ⊥-shaped regions of grapes. And the DEConv may enhance the representation and generalization capacity and can be equivalently converted into a vanilla convolution without triggering extra parameters and computational cost. The structure of DE-SCSH is illustrated in Figure 9b. The core idea of this design is to replace the two conventional convolutions used in the three segmentation heads with two shared 3 × 3 DEConv_GN and three 1 × 1 Conv_GN, as indicated by the green and red parts. Additionally, we replaced the original instance segmentation route, which consists of an unshared 1 × 1 Conv_GN, a 3 × 3 DEConv_GN, and a conventional convolution, at the

X_{3}^{″}

,

X_{4}^{″}

, and

X_{5}^{'}

levels of the improved neck, respectively. The new structure of DE-SCSH allows for a lightweight segmentation head that improves fine-grained segmentation while endowing the segmentation head with stronger multi-scale perception capabilities.

2.3. Localization of Picking Points

The localization of grape picking points refers to the process of calculating the pixel coordinates that are conducive to the robot’s cutting operation on the grape pedicel. After studying a great deal of literature and experiments, this work suggests that the segmented images of grape ⊥-shaped regions are used to calculate the picking points. In contrast with the traditional segmented images of grape clusters and grape pedicel, the segmented images of grape ⊥-shaped regions have obvious morphological and geometric features.

Feature 1: The pedicel region in the ⊥-shaped structure usually presents a vertical strip distribution, which is thin and long. And the number of pixels in each row is extremely limited in the pedicel region.

Feature 2: The contour of the grape cluster region presents an irregular shape resembling a cloud, and there is no discernible pattern. The number of pixels is significantly larger than that of the grape pedicel region.

With the help of the above features, we proposed a simple and reliable method for positioning grape picking points. To facilitate the following elaboration, the binarized image segmented by the PGSS-YOLOv11s is defined as

f (u, v)

. The details are described below.

The first step is to extract the grape cluster part of the ⊥-shaped region, which is denoted as

f_{o p e n e d}

. According to the above Features 1 and 2, the pedicel region is first removed from the binarized image f using an erosion operation, and then the grape cluster part

f_{o p e n e d}

is restored using the dilation operator. The erosion kernel and dilation kernel are defined as

B_{e r o d e}

and

B_{d i l a t e}

, respectively. Mathematically, this process can be described as

f_{o p e n e d} = (f ⊖ B_{e r o d e}) \oplus B_{d i l a t e}

(12)

where ⊖ denotes the erosion operation, and ⊕ denotes the dilation operation.

The next step is to extract the grape pedicel part of the ⊥-shaped region, which is denoted as

f_{p e d i c e l}

. A subtraction operation, which is performed between the binarized image f and the grape cluster part

f_{o p e n e d}

, can be described as

f_{d i f} = f - f_{o p e n e d}

. Ideally,

f_{d i f}

should be the grape pedicel part

f_{p e d i c e l}

. However, the grape cluster part

f_{o p e n e d}

restored by Formula (12) often has deviations from the grape cluster part in the original binarized image f, so

f_{d i f}

usually contains some isolated regions with a small proportion of pixels. To address this, connected component analysis is employed to statistically analyse all connected regions in

f_{d i f}

and their corresponding pixel areas, denoted as

f_{d i f} (i)

and

A (i)

, respectively. A threshold

A_{0}

is set simultaneously. If

A (i) \leq A_{0}

, then

f_{p e d i c e l} (j)

is assigned the value of

f_{d i f} (i)

. Conversely, if

A (i) > A_{0}

, then

f_{p e d i c e l} (j)

is set to 0. So, the refined grape pedicel part

f_{p e d i c e l}

can be determined in terms of a simple selection rule as follows:

f_{p e d i c e l} (j) = \{\begin{matrix} 0, A (i) > A_{0} \\ f_{d i f} (i), A (i) \leq A_{0} \end{matrix}, f_{d i f} = f - f_{o p e n e d}, j \leq i = 1, 2, \dots, N .

(13)

where

f_{d i f} (i)

represents the i-th isolated region,

f_{p e d i c e l} (j)

represents the j-th pedicel region, and N is the total number of isolated regions.

The final step is to calculate the pixel coordinates

(u, v)

of the pedicel picking point. To ensure that the picking points can be accurately located within the grape pedicel region, the skeletonization skelet is first performed on each grape pedicel region

f_{p e d i c e l} (j)

to obtain the corresponding pedicel skeleton line

l_{p e d i c e l} (j)

. Then, the median method mediam is applied to extract the pixel coordinates corresponding to the median value on each skeleton line. These coordinates are the pixel coordinates of the picking points. The discussion of the final step can be mathematically described as

[u (j), v (j)] = m e d i a m \{s k e l e t [f_{p e d i c e l} (j)]\}

(14)

In this study, the kernel size of the aforementioned dilation and erosion operators was set to 5 × 5, the threshold

A_{0}

was 250, and the iteration count of the skeletonization was set to 1. It is very beneficial to visualize the steps of the proposed algorithm with an example of a grape image. In the beginning, the original grape image (Figure 10a) is segmented by our proposed model PGSS-YOLOv11s for obtaining the binarized image f of the ⊥-shaped region, which is displayed clearly in Figure 10b. The pedicel region is marked with a red rectangle, and the grape cluster region is marked with a green rectangle. Apparently, they both conform to the descriptions of Feature 1 and Feature 2. After the erosion and dilation operation, the pedicel regions are removed, and the white region is associated with the grape cluster region

f_{o p e n e d}

, as illustrated in Figure 10d. The subtraction between Figure 10b,d initially determines the pedicel region

f_{d i f}

in Figure 10e. Note that some small-sized and isolated white regions are strewn next to the grape pedicel, marked with red triangles. Figure 10f shows the refined pedicel region

f_{p e d i c e l}

with the Formula (13). With the help of Formula (14), the pixel coordinates of the picking points can be effectively and automatically located in the pedicel region, as shown in Figure 10h.

3. Results

In this section, the experimental settings, including evaluation metrics and the network training implementation details, are first discussed in Section 3.1. Then, the evaluation of grape ⊥-shaped segmentation results, including comparison with the state-of-the-arts methods and an ablation study with different modules, are carried out to verify the effectiveness and superiority of the proposed network architecture in Section 3.2. Finally, the evaluation of picking point location within complex scenarios are shown to provide evidence of the approach’s efficacy.

3.1. Experimental Settings

3.1.1. Evaluation Metrics

To quantifiably estimate the performance of the modified instance segmentation network PGSS-YOLOv11s, we employed three precision metrics: mask mAP@0.5, mask mAP@0.75, and mask mAP@0.5:0.95. They are all based on segmentation benchmarks. Specifically, mask mAP@0.5 and mask mAP@0.75 denote the mean average precision for mask segmentation tasks at an IoU (Intersection over Union) threshold of 0.50 and 0.75, respectively, while mask mAP@0.5:0.95 calculates the mean average precision across multiple IoU thresholds (from 0.50 to 0.95 in 0.05 increments) for mask-based evaluations. Additionally, the mask mAP@0.5 is also used to evaluate the precision of each category of segmented instances. In ablation studies, Precision (P), Recall (R), and

F 1

score are used as additional evaluation metrics to validate the performance of individual modules. Notably, the P and R are also both based on the accuracy of the masks. The

F 1

score integrates the model’s P and R via their harmonic mean, providing a comprehensive performance measure. Additionally, we evaluate model efficiency using five metrics:

P a r a m s

,

F l o p s

,

F P S

,

i n f e r e n c e t i m e

, and Model size (

M S

), which collectively assess the model’s segmentation efficiency.

3.1.2. Experimental Platform and Details

All of experiments were conducted on a computer equipped with Ubuntu 18.04, an Intel Xeon Gold 5115 CPU @ 2.40 GHz (Intel Corporation, Santa Clara, CA, USA), and an NVIDIA Tesla T4 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 32 GB of RAM. We wrote the proposed PGSS-YOLO11s network with the programming language Python 3.8.13 and PyTorch 1.12.0. To speed up the convergence speed of the deep neural network, CUDA 11.7 and cuDNN 8.0 were configurated. For the training process, the following training options were defined: 500 epochs, a batch size of 16, an initial learning rate of 0.01, a weight decay of 0.0005, and a momentum of 0.937. The learning rate was dynamically adjusted through PyTorch’s cosine annealing warm restarts scheduler, configuring with an initial cycle length 50 and cycle doubling factor 2. Meanwhile, the stochastic gradient descent (SGD) was selected to update the parameters of PGSS-YOLOv11s network. To further stabilize training, a warm-up phase of three epochs was employed, with a warm-up momentum of 0.8 and a warm-up bias learning rate of 0.1, complementing the cosine learning rate schedule. No early stopping mechanism was used during the training process.

In order to enrich the diversity of images in the dataset, prevent the model from overfitting, and make the features learned by the model more robust and generalizable, we introduced the Mosaic [39] and Mixup [40] data augmentation techniques. For the Mosaic data augmentation, first, four images are randomly selected from the dataset. Then, each image is randomly cropped in terms of scale or shape. Subsequently, the four cropped subimages are stitched together into a new image in a randomly arranged manner. The size of this new image remains the same as that of the original input image, and random adjustments of color, brightness, and contrast are made to the stitched image. As for the Mixup data augmentation, first, two images and the corresponding labels are randomly selected from the dataset. Then, a linear interpolation operation is performed on these two images in combination with the randomly generated mixing coefficient so as to generate a mixed image and the corresponding mixed label. These two data augmentation methods are applied in a cascaded manner during each epoch of the network training. Both of them can simulate the growth situations of grapes in a natural orchard environment, such as complex scenarios like the overlapping of fruits, the occlusion between fruits and leaves, and the color characteristics of grapes under different lighting conditions. It should be further noted that the detailed operational parameters involved in these data augmentation techniques are all randomly generated by the torch.random function or the random number generators provided in PyTorch’s data augmentation modules.

In addition, a simple transfer learning strategy was adopted to improve the training efficiency and overall performance of all the networks involved in the following text. Specifically, we firstly conducted 100,000 iterations of pre-training on the publicly available MS COCO dataset [41] to obtain the pre-trained weights of the corresponding models. Next, during the training stage on the Grape-⊥ dataset and the Winegrape dataset, the pre-trained weight parameters were loaded to accelerate the convergence speed of the model for the other new target tasks. This was done to enhance the adaptability of the model to the instance segmentation of the ⊥-shaped regions of table grapes.

3.2. Segmentation of Grape ⊥-Shaped Regions

3.2.1. Comparison with Different Methods

In order to verify the superiority of PGSS-YOLOv11s in the instance segmentation tasks, a comparative experiment was conducted on the Grape-⊥ dataset and the publicly available Winegrape dataset developed by Santos et al. [42]. Here, the Winegrape dataset contains 2020 instances of grape clusters, including five grape varieties: Chardonnay (

C D Y

), Cabernet Franc (

C F R

), Cabernet Sauvignon (

C S V

), Sauvignon Blanc (

S V B

), and Syrah (

S Y H

). In the experiment, nine state-of-the-arts instance segmentation models, namely, YOLOv11s-seg [43], Mask-RCNN [44], YOLACT [45], SOLOv2 [46], YOLOv5s-seg [47], RTMDet [48], YOLOv8s-seg [49], Co-DETR [50], and FastInst [51], were selected as the comparison objects. In addition, to ensure the fairness and rigor of the comparison, these models, which were downloaded from public network platforms like https://github.com/, were trained and tested on a computer with the same configuration. And the input image size of each model was adaptively adjusted to the optimal parameters according to the structural characteristics of the respective model.

Table 1 and Table 2 present the results of segmentation accuracy provided by the comparative experiment on the Grape-⊥ dataset and Winegrape dataset, respectively. It is easy to conclude from Table 1 that the proposed PGSS-YOLOv11s performed outstandingly in multiple segmentation accuracy indicators. For example, in terms of the core evaluation indicators of mask mAP@0.5, mask mAP@0.75, and mask mAP@0.5:0.95, the model PGSS-YOLOv11s achieved 95.2%, 40.3%, and 48.8%, respectively. Except for Co-DETR, these results are significantly superior to those of other comparative models. Regarding the challenging mask mAP@0.5:0.95 metric, PGSS-YOLOv11s achieved a substantial 4.4% improvement over the competitive YOLOv11s-seg and lagged only 2.4% behind the top-performing Co-DETR. Meanwhile, the test results on the Winegrape dataset from Table 2 also corroborate the segmentation performance of PGSS-YOLOv11s, achieving 90.0%, 66.0%, and 63.2% in mask mAP@0.5, mask mAP@0.75, and mask mAP@0.5:0.95, respectively. Notably, in the more challenging mask mAP@0.5:0.95 metric, its performance surpassed that of Co-DETR by 2.0%.

Furthermore, when considering the segmentation speed and computational complexity listed in Table 3, it is evident that the segmentation speeds of Mask-RCNN and Co-DETR were significantly lower than those of other models. In particular, although Co-DETR leads in segmentation accuracy, it incurs a substantial 384 M parameters and 537 G

F l o p s

while achieving an

F P S

as low as 3.0 Img/s. Clearly, such results are unfavorable for model deployment on edge development boards (e.g., NVIDIA Jetson Nano 4B (NVIDIA Corporation, Santa Clara, CA, USA)), which feature limited physical resources and low economic costs. In contrast, PGSS-YOLOv11s achieves a balance between high accuracy and efficiency. Specifically, in terms of computational complexity, PGSS-YOLOv11s has only 8.0 M

P a r a m s

, slightly inferior to YOLOv5s-seg but merely 1/48 of Co-DETR’s

P a r a m s

. This is attributed to PGSS-YOLOv11s adopting bidirectional linear superposition connection to replace traditional concatenation operations and designing a shared convolution architecture in the segmentation head module to replace the original parallel convolution structure. In terms of segmentation speed, PGSS-YOLOv11s incurs only 39.4 G

F l o p s

, far less than that of Co-DETR, and achieves an

F P S

of 30.9 frames. These results verify the superior performance of PGSS-YOLOv11s, which has been modified through the feature fusion mechanism and the segmentation head structure. That is to say, it effectively enhances the ability to extract shallow semantic information, significantly improves the fine segmentation performance of the ⊥-shaped regions of grapes, and strengthens the pixel-level discrimination ability between the target and the background. Overall, the proposed PGSS-YOLOv11s fully meets the real-time and high-precision processing requirements in orchard scenarios.

Obviously, as shown in Figure 11a, under the condition of sufficient lighting, except for PGSS-YOLOv11s and FastInst, the other models produced a low rate of missed detections but frequent false detections. In particular, for the grape clusters at the left and right edges of the entire image, those with occluded peduncles were more likely to be mistakenly classified as ⊥-shaped regions. For the situation with sparsely distributed grape clusters (see Figure 11c), three models SOLOv2, YOLOv11s-seg, and FastInst, failed to segment the ⊥-shaped region of the grapes in the lower right corner of the image, and the model RTMDet mistakenly segmented the weeds on the ground as the ⊥-shaped regions of the grapes. Although the remaining models successfully segmented the ⊥-shaped regions of the grapes, our proposed PGSS-YOLOv11s stands out the most in terms of the integrity and refinement of ⊥-shaped regions. The similar instance segmentation results of ⊥-shaped regions are also presented in other actual orchard scenarios. For instance, for the case of bagged grapes in Figure 11e, although Mask-RCNN, SOLOv2, YOLOv5s-seg, FastInst, Co-DETR, and PGSS-YOLOv11s correctly segmented the ⊥-shaped regions of the grapes, the performance of our proposed PGSS-YOLOv11s was still more outstanding for the segmentation details. These visualized experimental results once again demonstrate that through the synergistic effect of the feature fusion and detail enhancement modules, PGSS-YOLOv11s exhibits fine segmentation performance for the ⊥-shaped regions of grapes. At the same time, it also shows that PGSS-YOLOv11s has strong robustness and adaptability in complex orchard environments such as lighting variations, the differences in fruit density, and bagging.

3.2.2. Ablation Studies with Different Modules

To validate the importance of different modules, including SFAM, AFFM, and DE-SCSH, we implemented the following ablation studies on our network architecture using the two datasets mentioned above. Taking YOLOv11s-seg as the baseline model, a series of submodels were constructed through the combination of these different modules (as listed in Table 4). And all submodels were trained using a unified loss function.

Table 5 gives the results of ablation studies on the Grape-⊥ dataset and the Winegrape dataset. The quantitative results of the key evaluation indicators,

F 1

, mask mAP@0.5, and mask mAP@0.5:0.95, have clearly demonstrated that, within the framework of the baseline model YOLOv11s-seg, during the process of introducing the modules SFAM, AFFM, and DE-SCSH individually, in pairs, and finally all three simultaneously, there was a significant and continuous upward trend in the model’s accuracy. For example, on the Grape-⊥ dataset, compared with the model YOLOv11s-seg, the

F 1

score of our model PGSS-YOLOv11s gradually increased from 90.8% to 94.6%; mask mAP@0.5 and mask mAP@0.5:0.95 also continuously increased from 92.9% and 44.4% to 95.2% and 48.8%, respectively. These fully verify that the abovementioned modules make a positive contribution to the segmentation accuracy of the model. Similarly, this conclusion is also perfectly verified by the results of the ablation studies on the Winegrape dataset.

At the same time, the abovementioned quantitative results not only conclusively verify the effectiveness of each module in improving the instance segmentation accuracy of the network, but they also fully demonstrate the important role played by each module in the regulation of the model’s computational complexity and the optimization of computational efficiency. For example, when the AFFM module was individually introduced into the baseline model YOLOv11s-seg, the

P a r a m s

and the

M S

of SM2 were significantly reduced by 2.49 M and 4.7 MB, respectively, which can be attributed to the lightweight design of AFFM, featuring efficient bidirectional cross-scale connections and weighted feature fusion. The module DE-SCSH also reduces the

P a r a m s

by adopting the method of weight sharing in detail-enhanced convolution. Compared with the SM1, however, the reduction in the

P a r a m s

of the SM3 was negligible. For the SFAM module, the depth-wise convolution layers and channel split-and-fusion operations bring about more computations. Therefore, SM1 performed worse than the baseline model YOLOv11s-seg in the

P a r a m s

,

M S

,

F l o p s

, and

i n f e r e n c e t i m e

. Fortunately, when these three modules were combined in pairs or all three were introduced into the YOLOv11s-seg simultaneously, there was a significant synergistic and complementary effect. As an example, the

P a r a m s

decreased from 10.7 M to 7.96 M, achieving a reduction of 25.61%; the

M S

also decreased from 19.6 MB to 17.4 MB, achieving a byte compression rate of 11.22%. Moreover, on the premise of ensuring high segmentation accuracy and low computational complexity, the

F l o p s

only increased by 4.1 G, and the

i n f e r e n c e t i m e

was maintained at 19.1 ms. These experimental results and analyses not only reveal the significance of the collaborative design of multiple modules in balancing the model’s accuracy, computational complexity, and inference efficiency, but they also provide the technical support for the reliable deployment of the model on edge devices with limited computing power resources.

Furthermore, in order to conduct an in-depth analysis of the feature decision-making mechanisms of these modules, the Grad-CAM technique [52] was adopted to graphically and intuitively explain the degree of attention that different modules have toward the ⊥-shaped regions of grapes, as shown in Figure 12. In natural grape orchard scenarios, the captured images often contain a large number of background interference features that are similar in color, shape, and texture to the ⊥-shaped regions. Therefore, before the SFAM module was introduced into the baseline model YOLOv11s-seg, Figure 12b indicates that the baseline model activated a great deal of invalid information like the background, and its attention to the ⊥-shaped target regions was clearly unfocused. After the SFAM module was introduced, through the global-to-local spatial feature aggregation processing of feature maps at different scales, the responses of noncritical regions such as the background were effectively suppressed, as shown in Figure 12c. However, there are also some green, spot-like background features scattered in the heatmap, which indicates that the SFAM module did not pay sufficient attention to the ⊥-shaped regions of grapes. After integrating the AFFM module further, through the adaptive fusion of high-resolution shallow detail features and low-resolution deep semantic features in both top-down and bottom-up bidirectional paths with multi-scale features, the model’s recognition ability was substantially improved for the ⊥-shaped regions of grapes. Compared with Figure 12c, a conspicuous enhancement in the color of the ⊥-shaped regions of the grapes are illustrated in the Figure 12d, which still shows scatters of some spot-like interference features. Finally, after the introduction of the DE-SCSH module, the feature distinction was effectively highlighted between the key regions and the invalid background through the operation of the detail enhancement and the feature sharing. Thus, the ⊥-shaped regions of the grapes were located more precisely. As depicted in Figure 12e, the activation intensity of background areas such as ground weeds, branches, and unpickable grapes has been drastically reduced, and the characteristic energy of the pickable grapes achieved precise focusing on the ⊥-shaped target areas. In conclusion, these results visually explain the validity and effectiveness of each model.

In addition to the above ablation studies, we have also explored the impact on the overall performance of the model when the SFAM module is integrated into different feature output stages (F

_{2}

, F

_{3}

, F

_{4}

, and F

_{5}

) of the backbone. Taking SM6 without the SFAM module in Table 4 as the baseline model, Table 6 presents a series of models constructed by introducing the SFAM module at different feature output levels.

Table 7 records the ablation results on the Grape-⊥ dataset and the Winegrape dataset. Apparently, after the SFAM module was introduced at the feature output levels of F

_{2}

, F

_{3}

, F

_{4}

, F

_{5}

or F

_{2}

, F

_{3}

, or F

_{4}

in the backbone, compared with the SM6, the key performance indicators of the models SM6-F

_{2}

-F

_{5}

and SM6-F

_{2}

-F

_{4}

all decreased. At the same time, compared with our proposed PGSS-YOLOv11s, there was a relative increase in the

P a r a m s

and

F l o p s

results. For example, on the Grape-⊥ dataset, although the model SM6-F

_{2}

-F

_{5}

achieved an improvement in R, P significantly decreased, resulting in a decrease of 0.9%, 1.3%, and 0.1% in the indicators

F 1

, mask mAP@0.5, and mask mAP@0.5:0.95, respectively. This result indicates that feature aggregation across all levels is prone to triggering semantic conflicts among features of different scales, leading to the mixing of feature information, which in turn weakens the accuracy of object segmentation. When the SFAM module acted on the shallow features from F

_{2}

to F

_{4}

respectively, although the model SM6-F

_{2}

-F

_{4}

showed some improvement in P, R dropped sharply to 85.5%, causing the

F 1

score, mask mAP@0.5, and mask mAP@0.5:0.95 indicators to reach the lowest values. This also confirms that due to the limited semantic representation ability of shallow features, it is difficult to support effective spatial feature aggregation. The above results fully verify that the performance of the SFAM module highly depends on the reasonable configuration of the feature levels: excessive stacking of shallow features will limit the aggregation of target features due to information overload. On the other hand, implementing the aggregation strategy at the deep feature levels can significantly improve the segmentation accuracy of the model for complex target structures by enhancing the semantic representation ability.

Finally, within the YOLO series, we selected three classical instance segmentation models: YOLOv5s-seg, YOLOv8s-seg, and YOLOv11s-seg. After embedding SFAM and AFFM into these models, we compared our proposed DE-SCSH with other typical head structures, LADH [34], LSCSH [35] and T-Head [36], in terms of segmentation accuracy (mask mAP@0.5, mask mAP@0.5:0.95), segmentation speed (

F L O P s

,

F P S

), and computational complexity (

P a r a m s

,

M S

). Herein, the models integrated with SFAM and AFFM are designated as YOLOv5s-seg-SM4, YOLOv8s-seg-SM4, and YOLOv11s-seg-SM4, respectively. It is emphasized that the only varying component among all compared models is the head structure. The comparative results are consolidated in Table 8.

From Table 8, the four metrics related to segmentation speed and computational complexity indicate that the variations in

P a r a m s

,

M S

,

F L O P s

, and

F P S

are relatively minor. The only notable exception is that when LADH was embedded into YOLOv8s-seg-SM4, the model’s

P a r a m s

,

M S

, and

F L O P s

decreased by 4.54, 7.6, and 8.4 respectively, while

F P S

increased by 8.6 frames, thus exhibiting an advantage over other models. Nevertheless, the efficiency metrics of all models fall within a practically applicable range, which is insufficient to fully demonstrate that LADH is completely superior to other segmentation heads. In terms of segmentation accuracy metrics, all four segmentation heads performed relatively well on the Grape-⊥ dataset. Except for a few minor precision decreases, most showed a trend of precision improvement. More prominently, after integrating DE-SCSH into YOLOv5s-seg-SM4, the values of mask mAP@0.5 and mask mAP@0.5:0.95 increased by 3.4% and 5.0%, respectively. What is more noteworthy, by contrast, is that on the Winegrape dataset, all segmentation heads except DE-SCSH exhibited varying degrees of precision degradation across the three different baseline models. Only DE-SCSH maintained an upward trend in precision across the three groups of baseline models, a phenomenon that far outperforms other segmentation heads and sufficiently demonstrates the generalization ability and robustness of DE-SCSH. Overall, from the perspective of practical applications, DE-SCSH can balance precision and efficiency, achieving an optimal trade-off between the both, which constitutes its advantage over other head structures.

3.3. Evaluation of Picking Point Detection

In order to verify the effectiveness and robustness of the proposed picking point positioning method, this study randomly selected 100 images from the Grape-⊥ dataset, including 70 images of Grps and 30 images of GrpWBs. These images cover complex orchard scenes with different lighting intensities, fruit density levels, and bagging statuses. At the same time, they also include the situations of both vertical and curved pedicel morphologies. The first row of Figure 13 typically shows the original grape images captured in five orchard scenes: a sunny day (moderate lighting), a cloudy day (relatively dim lighting), sparse grape clusters with a far distance, dense grape clusters with a very close distance, and bagged grapes. The middle row of Figure 13 depicts the binarized images of the ⊥-shaped regions after instance segmentation by PGSS-YOLOv11s. The third row of Figure 13 shows the calculated picking points (represented by red solid dots). Apparently, the proposed picking point positioning algorithm can adapt to these complex orchard scenes.

Furthermore, we analyzed one by one the number of actual ⊥-shaped regions segmented by the model PGSS-YOLOv11s and the number of actual picking points calculated by the positioning algorithm for these 100 grape images. As shown in Figure 14, the horizontal axis represents the image index, while the vertical axis denotes the number of instances segmented by the designed model or the number of successfully localized picking points by the proposed algorithm. Here, image indices 1–70 represent the nonbagged grape type Grps, and image indices 71–100 represent the bagged grape type GrpWBs. The legend NumInstance (red solid line with diamond symbols) depicts the number of instances of ⊥-shaped regions segmented by PGSS-YOLOv11s in an image; the legend NumPickPoints (red dashed line with small circles) describes the number of picking points successfully localized by the positioning algorithm in each image. It can also be easily observed that for Grps, when the number of instances of the ⊥-shaped regions in each image was less than or equal to 5, the value of NumPickPoints was almost equal to that of NumInstance. However, as the number of instances of the ⊥-shaped regions in the images increased, the phenomenon of failed picking point positioning occurred. For GrpWBs, the number of instances in the images was generally between two and four, and the situation of failed picking point positioning also occurred.

Finally, in order to reasonably evaluate the positioning accuracy of the picking points, a new indicator called the picking point positioning rate

l_{p}

is defined as follows:

l_{p} = \frac{p^{'}}{p} \times 100 %

(15)

where p represents the total number of ⊥-shaped regions segmented by the instance segmentation model, and

p^{'}

denotes the total number of the picking points calculated by the picking point positioning algorithm. Table 9 statistically analyzes the results in Figure 14. In the 100 grape images, there are 228 instances of the ⊥-shaped regions of Grps and 94 instances of the ⊥-shaped regions of GrpWBs. After being processed by the picking point positioning algorithm, 206 picking points were successfully localized for the Grps, and 82 picking points were successfully localized for the GrpWBs. Their respective localization success rates are 90.33% and 87.23%. Overall, the total positioning success rate of the picking points reached 89.44%, which strongly proves the rationality and effectiveness of the proposed method in multi-scenario applications.

3.4. Deployment on Edge Device

In order to verify the effectiveness of the proposed method in practical applications, as depicted in Figure 15a, we developed an edge-intelligent vision acquisition and detection system, which primarily comprises the edge device NVIDIA Jetson Nano 4B (NVIDIA Corporation, Santa Clara, CA, USA), a 3D camera Intel Realsense D435i (Intel Corporation, Santa Clara, CA, USA), a 10.1-inch IPS display touch screen (Yahboom Technology Co., Ltd., Shenzhen, China) and a 5 V/30,000 mAh mobile power supply. The PGSS-Yolov11s model and picking point locating algorithm have been deployed on this edge device. The system can capture image and extract the 2D or 3D coordinates of pedicel picking point in real time. Based on PyQt5, Figure 15b presents a graphical user interface, where users can select the camera type, set the acquisition mode, choose the deep learning model, and configure the detection mode.

In the vineyard, we held this system by hand, moved along the grape planting row at a speed of 0.4 m/s, and conducted real-time segmentation and picking point localization on a row of grapes. The location results of eight frames are shown in Figure 16. As can be seen, our method was able to maintain accurate real-time segmentation of grape ⊥-shaped regions and real-time location of stems picking points. During the movement from Frame 5 to Frame 15, the confidence level of the grapes marked by red dashed ellipses increased from 0.70 to 0.93, and the 2D coordinates of the picking points were successfully localized. Also, there were occasional missed locations due to shooting angles. However, this missed location (marked by red dashed rectangles) did not always exist. During the movement, grape bunches or stems that were previously missed could be detected at other moments; for instance, stems fully localized in the previous Frame 15 missed the picking point in the subsequent Frame 35 but were fully localized again in Frame 45. It is important to note that the grapes marked by red solid ellipses are located in adjacent rows and belong to the category nonpickable by the robotic arm. Although the picking points of the pedicels were not extracted in Frames 45 and 55, they were fully localized in the previous Frame 35. It would be hazardous if the robotic arm, positioned in the current row, attempted to pick grapes from an adjacent row.

4. Discussion

4.1. Key Differences in Principles from Existing Advanced Methods

The above experimental results have demonstrated that our method can accurately recognize grapes and precisely localize picking points with high reliability. Similar to the strategies of Ning et al. [8], Zhu et al. [13], and Zhang et al. [14], our approach also adopts a two-stage method integrating deep learning with traditional digital image processing. As described in Section 1, this indirect strategy first uses deep learning techniques to detect grape regions, and then calculates the 2D pixel coordinates of grape pedicels through traditional digital image processing algorithms. This strategy has presented excellent development potential in aspects such as adaptability to complex orchard environments and extension of applications in multiple scenarios. However, three critical differences characterize the principle we introduced.

Difference 1: The ⊥-shaped grape annotation method. Previous researchers have currently designed seven grape annotation formats. As shown in Figure 17a–g, these include the following: (1) bounding box annotations for grape clusters [14], (2) pixel-level annotations for grape clusters [6], (3) instance-level pixel-by-pixel annotations for grape pedicels [8], (4) dual bounding box annotations combining clusters and pedicels [25], (5) combined bounding box annotations for clusters with single-point pedicel markers [20], (6) bounding box annotations for clusters with three key-point pedicel markers [22,24], and (7) bounding box annotations distinguishing grape clusters and pedicels into two separate categories [16].

If entire clusters or bags were annotated, the salience of the thin, elongated pedicel region might be reduced, making it difficult for convolutional neural networks (CNNs) to learn effective pedicel features. In contrast, our ⊥-shaped grape annotation fully considers the botanical relationship between the grape cluster and pedicel while highlighting texture and shape differences. As shown in the experimental results of Section 3.2, this annotation method indeed helps CNNs distinguish pedicel features from vine branches, especially differentiating young green shoots from pedicels, and it also enhances the network’s ability to localize pedicel regions precisely.

Difference 2: Construction of the advanced PGSS-YOLOv11s. Based on YOLOv11s-seg, we introduced the hierarchical dual-branch module GLSA to capture both global context and local details, the parameter-efficient and context-aware module AFFM to fuse multi-level features, and the lightweight multi-scale DE-SCSH to enhance fine-grained segmentation of grape ⊥-shaped regions. The experimental results in Section 3.2 have demonstrated that our model significantly outperforms other state-of-the-art instance segmentation models in terms of segmentation accuracy.

Difference 3: Simple and robust picking point localization algorithm. As shown in Figure 18a, existing geometric determination methods [1,2,4] require calculating pedicel candidate regions by image centroid, top point, and grape width, followed by edge extraction, Hough line detection, and point-to-line distance computation for each line to screen pedicel segments. This process not only relies on multiple empirical parameters but also involves complex image processing and geometric calculations. Figure 18b presents a state-of-the-art pedicel picking point localization method, which enables convenient acquisition of the 2D coordinates of the picking point by calculating the center point of the predicted pedicel oriented bounding box. However, this method shares a common drawback with the direct methods described in Refs. [20,21,23,24], namely, the occurrence of deviation in picking point localization, with errors reaching up to 30 pixels [23], resulting in poor reliability. In comparison, our method eliminates the need for tedious ROI setting for cutting points and multi-step geometric derivations; instead, it achieves grape pedicel localization by simply combining basic image operators such as erosion, dilation, and skeletonization. Furthermore, our localization algorithm calculates picking points based on the pixel regions of the pedicels, thus avoiding pixel localization errors and ensuring that the pixel coordinates are guaranteed to be situated on the grape pedicels. Experiments in Section 3.3 have pointed out that our method could enable reliable picking point calculation.

4.2. Cases of Failed Picking Point Positioning

Again from Figure 14, in order to explore the reasons of failed picking point positioning, we analyzed the grape images one by one. Specifically, when the captured image contained multiple (at least five) grape clusters for Grps, the grape clusters, which are farther from the camera or those located at the edge of the image, account for a relatively small proportion of the pixel area in the entire image. This leads to an even smaller pixel area of the corresponding grape pedicel region, and in some cases, it no longer conforms to the characteristics of a thin and long grape pedicel. Under such objective circumstances, although the model can successfully segment the ⊥-shaped region, during the process of calculating the picking points, the grape pedicel regions with too small pixel areas are regarded as noise and removed. Similarly, for GrpWBs, although the captured image contains a small number (around three) of grape cluster instances, due to the fact that the pixel area of the grape pedicel region that is not wrapped by the paper bag accounts for a small proportion, a result similar to that of Grps occurs during the process of calculating the picking points. Figure 19 representatively shows the grape images with failed picking point positioning. The red circles mark the grapes that the picking points have not been calculated. Apparently, these grapes are farther from the camera and account for a small proportion of the pixels in the entire image.

It is worth emphasizing that, based on the analysis of the on-site operation behavior of the grape picking robot, grapes that are in a state of being far away from a camera, densely packed, or obscured by other things are not the priority picking targets. If they are forcibly picked, it is likely to cause damage to the grapes or the robot. In the actual operation process, it is the grapes with a clear view and relatively close distance that are usually given priority for processing, because such grapes are more convenient for the robot to pick. Fortunately, it is the grapes missed in detection, rather than those with incorrect positioning, that account for all the instances of positioning failure. In other words, the picking point positioning algorithm we proposed only calculates and locates the ⊥-shaped regions that meet the effective visual features and filters out the grapes with grape pedicel of small areas caused by objective factors. Therefore, to a certain extent, the proposed positioning algorithm is more in line with the actual operation requirements of the picking robot.

4.3. Collecting More Grape Image Data Under Natural Environments

The grape-⊥ dataset was collected in a pseudo-enclosed space shaped like an ‘n’, and the entire vineyard was covered by a large sunshade net, resulting in relatively uniform lighting in the collected images. Additionally, the shooting distance was basically maintained between 50 and 70 cm, with horizontal shots ensuring clear imaging of both grapes and pedicels. Furthermore, from the perspective of robot pickability, this work only annotated grapes that the robot could harvest. Although data augmentation was used during training to simulate occlusion scenarios, complex unannotated occlusion cases in reality were not considered. While these operations provided favorable practical conditions for deep learning model detection and robot picking, they inadvertently limited the diversity of the grape dataset, leading to insufficient generalization ability of the proposed method. Drawing on the discussion in Ref. [18], we will enrich the grape dataset by collecting images under different lighting conditions, from various shooting angles, and including more grape varieties to improve the model’s generalization performance.

4.4. Function Expansion of Table Grape Picking Methods

Our proposed ⊥-shaped annotation format, PGSS-YOLOv11s, and picking point localization strategy are not only suitable for picking point localization in table grapes but also extendable to automated harvesting of fruits and vegetables with slender pedicels, such as cucumbers, eggplants, peppers, pumpkins, durians, and jackfruits. For example, Figure 20 shows the ⊥-shaped annotation formats of eggplants, durians, and cucumbers. In the future, we will construct more datasets for fruits and vegetables with slender pedicels to conduct broader research using our designed PGSS-YOLOv11s and picking point localization strategy.

5. Conclusions

In this study, a new two-stage method is proposed to locate grape picking points. (1) In the first stage, we put forward a grape instance segmentation framework named PGSS-YOLOv11s to perform fine-grained segmentation of the ⊥-shaped regions of grapes. Concretely, the PGSS-YOLOv11s uses the original backbone of YOLOv11s-seg for the task of feature extraction. After the backbone, the SFAM is added to aggregate the global-to-local spatial feature of grapes. As the neck of PGSS-YOLOv11s, the AFFM can adaptively fuses high-resolution shallow detail features and low-resolution deep semantic features in both top-down and bottom-up bidirectional paths with multi-scale features. Furthermore, the novel segmentation head DE-SCSH effectively highlights the feature distinction between the grape ⊥-shaped regions and the invalid backgrounds. In addition, we also developed a new grape segmentation dataset called Grape-⊥ with a total of 1576 grape images and 4455 ⊥-shaped instances, including 3375 Grps and 1080 GrpWBs. Completely different from the existing grape annotation forms, the ⊥-shaped region encompasses the whole grape pedicel and a small part of the attached grape berries or paper bags. Finally, the experimental results verify that PGSS-YOLOv11s achieves better instance segmentation performance for the grape ⊥-shaped regions than other state-of-the-art models, and it also possesses the lower computational complexity and the higher inference efficiency. Specifically, the

F 1

score and mask mAP@0.5 of PGSS-YOLOv11s reached 94.6% and 95.2% on the Grape-⊥ dataset and 85.4% and 90.0% on the Winegrape dataset. And compared with the YOLOv11s-seg, the

P a r a m s

and

M S

of PGSS-YOLOv11s were reduced to 7.96 M and 17.4 MB, respectively. Whie the mask mAP@0.5 and mask mAP@0.5:0.95 of PGSS-YOLOv11s are respectively 3.2% and 2.4% lower on the Grape-⊥ dataset, when benchmarked against the top-performing Co-DETR, the

P a r a m s

of PGSS-YOLOv11s constitutes merely 1/48 of that of Co-DETR. Simultaneously, the

F l o p s

of PGSS-YOLOv11s amount to only 39.4 G, far lower than that amount of Co-DETR. Moreover, it only takes 19.1 ms to predict a grape image with the resolution of 1920 × 1440 pixels. (2) In the second stage, a simple and efficacious combination strategy of morphological operators was designed to locate the picking points on the segmented grape ⊥-shaped regions. In contrast to the existing calculation principles of the picking points, our presented strategy does not necessitate complex geometric shape computations, and there is no pixel distance error either. Multi-scenario tests have indicated that the success rate of positioning the picking points reached up to 89.44%. In the end, we discussed three key differences between the proposed method and existing methods, analyzed the possible reasons for the failed localization of picking points, emphasized the importance of enriching the dataset, and pointed out that our method has extensive scalability.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agriculture15151622/s1.

Author Contributions

Conceptualization, Methodology, Investigation, Resources, Writing—Review and Editing, Supervision, Project administration, J.L.; Methodology, Validation, Investigation, Data Curation, Writing—Original Draft, Writing—Review and Editing, Visualization, Z.C.; Data Curation, Z.W., J.Z. and M.Z.; Funding acquisition, J.W. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China [62205271 and 52375128], Natural Science Basic Research Program of Shaanxi Province [2025JC-YBMS-771], and Key Agricultural Funding of Xi’an Science and Technology Planning Projects of China [23NYGG0020].

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The Grape-⊥ dataset that supports the findings of this study is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based extraction of spatial information in grape clusters for harvesting robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Lu, Q.; Chen, X.; Zhang, P.; Zou, X. A vision methodology for harvesting robot to detect cutting points on peduncles of double overlapping grape clusters in a vineyard. Comput. Ind. 2018, 99, 130–139. [Google Scholar] [CrossRef]
Xiong, J.; Liu, Z.; Lin, R.; Bu, R.; He, Z.; Yang, Z.; Liang, C. Green grape detection and picking-point calculation in a night-time natural environment using a charge-coupled device (CCD) vision sensor with artificial illumination. Sensors 2018, 18, 969. [Google Scholar] [CrossRef]
Jin, Y.; Yu, C.; Yin, J.; Yang, S.X. Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test. Comput. Electron. Agric. 2022, 202, 107364. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, T.; Liu, L.; Liu, P.; Li, X. Fast location of table grapes picking point based on infrared tube. Inventions 2022, 7, 27. [Google Scholar] [CrossRef]
Santos, T.T.; De Souza, L.L.; dos Santos, A.A.; Avila, S. Grape detection, segmentation, and tracking using deep neural networks and three-dimensional association. Comput. Electron. Agric. 2020, 170, 105247. [Google Scholar] [CrossRef]
Kalampokas, T.; Vrochidou, E.; Papakostas, G.A.; Pachidis, T.; Kaburlasos, V.G. Grape stem detection using regression convolutional neural networks. Comput. Electron. Agric. 2021, 186, 106220. [Google Scholar] [CrossRef]
Ning, Z.; Luo, L.; Liao, J.; Wen, H.; Wei, H.; Lu, Q. Recognition and the optimal picking point location of grape stems based on deep learning. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2021, 37, 222–229. [Google Scholar] [CrossRef]
Shen, Q.; Zhang, X.; Shen, M.; Xu, D. Multi-scale adaptive YOLO for instance segmentation of grape pedicels. Comput. Electron. Agric. 2025, 229, 109712. [Google Scholar] [CrossRef]
Li, H.; Li, C.; Li, G.; Chen, L. A real-time table grape detection method based on improved YOLOv4-tiny network in complex background. Biosyst. Eng. 2021, 212, 347–359. [Google Scholar] [CrossRef]
Su, S.; Chen, R.; Fang, X.; Zhu, Y.; Zhang, T.; Xu, Z. A Novel Lightweight Grape Detection Method. Agriculture 2022, 12, 1364. [Google Scholar] [CrossRef]
Zhang, C.; Ding, H.; Shi, Q.; Wang, Y. Grape cluster real-time detection in complex natural scenes based on YOLOv5s deep learning network. Agriculture 2022, 12, 1242. [Google Scholar] [CrossRef]
Zhu, Y.; Li, S.; Du, W.; Du, Y.; Liu, P.; Li, X. Identification of table grapes in the natural environment based on an improved Yolov5 and localization of picking points. Precis. Agric. 2023, 24, 1333–1354. [Google Scholar] [CrossRef]
Zhang, T.; Wu, F.; Wang, M.; Chen, Z.; Li, L.; Zou, X. Grape-bunch identification and location of picking points on occluded fruit axis based on YOLOv5-GAP. Horticulturae 2023, 9, 498. [Google Scholar] [CrossRef]
Wang, W.; Shi, Y.; Liu, W.; Che, Z. An Unstructured Orchard Grape Detection Method Utilizing YOLOv5s. Agriculture 2024, 14, 262. [Google Scholar] [CrossRef]
Zhao, J.; Yao, X.; Wang, Y.; Yi, Z.; Xie, Y.; Zhou, X. Lightweight-Improved YOLOv5s Model for Grape Fruit and Stem Recognition. Agriculture 2024, 14, 774. [Google Scholar] [CrossRef]
Huang, X.; Peng, D.; Qi, H.; Zhou, L.; Zhang, C. Detection and Instance Segmentation of Grape Clusters in Orchard Environments Using an Improved Mask R-CNN Model. Agriculture 2024, 14, 918. [Google Scholar] [CrossRef]
Zhu, Y.; Sui, S.; Du, W.; Li, X.; Liu, P. Picking point localization method of table grape picking robot based on you only look once version 8 nano. Eng. Appl. Artif. Intell. 2025, 146, 110266. [Google Scholar] [CrossRef]
Li, P.; Chen, J.; Chen, Q.; Huang, L.; Jiang, Z.; Hua, W.; Li, Y. Detection and picking point localization of grape bunches and stems based on oriented bounding box. Comput. Electron. Agric. 2025, 233, 110168. [Google Scholar] [CrossRef]
Zhao, R.; Zhu, Y.; Li, Y. An end-to-end lightweight model for grape and picking point simultaneous detection. Biosyst. Eng. 2022, 223, 174–188. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Wang, J.; Cai, L.; Jin, Y.; Zhao, S.; Xie, B. Realtime picking point decision algorithm of trellis grape for high-speed robotic cut-and-catch harvesting. Agronomy 2023, 13, 1618. [Google Scholar] [CrossRef]
Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A method for identifying grape stems using keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
Chen, J.; Ma, A.; Huang, L.; Li, H.; Zhang, H.; Huang, Y.; Zhu, T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
Jiang, T.; Li, Y.; Feng, H.; Wu, J.; Sun, W.; Ruan, Y. Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP. Agriculture 2024, 14, 1449. [Google Scholar] [CrossRef]
Zhou, X.; Wu, F.; Zou, X.; Meng, H.; Zhang, Y.; Luo, X. Method for locating picking points of grape clusters using multi-object recognition. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2023, 39, 166–177. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016. [Google Scholar] [CrossRef]
Tang, F.; Xu, Z.; Huang, Q.; Wang, J.; Hou, X.; Su, J.; Liu, J. DuAT: Dual-aggregation transformer network for medical image segmentation. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; pp. 343–356. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hu, J.F.; Sun, J.; Lin, Z.; Lai, J.H.; Zeng, W.; Zheng, W.S. Apanet: Auto-path aggregation for future instance segmentation prediction. IEEE Trans. Pattern Anal. 2021, 44, 3386–3403. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zhao, L.; Chen, J.; Shahzad, M.; Xia, M.; Lin, H. MFPANet: Multi-scale feature perception and aggregation network for high-resolution snow depth estimation. Remote Sens. 2024, 16, 2087. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021. [Google Scholar] [CrossRef]
Huang, L.; Li, W.; Tan, Y.; Shen, L.; Yu, J.; Fu, H. YOLOCS: Object Detection based on Dense Channel Compression for Feature Spatial Solidification. arXiv 2024, arXiv:2305.04170. [Google Scholar] [CrossRef]
Zhang, J.; Chen, Z.; Yan, G.; Wang, Y.; Hu, B. Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images. Remote Sens. 2023, 15, 4974. [Google Scholar] [CrossRef]
Zhang, Z.; Zou, Y.; Tan, Y.; Zhou, C. YOLOv8-seg-CP: A lightweight instance segmentation algorithm for chip pad based on improved YOLOv8-seg model. Sci. Rep. 2024, 14, 27716. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. arXiv 2018. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Santos, T.; de Souza, L.; dos Santos, A.; Sandra, A. Embrapa Wine Grape Instance Segmentation Dataset–Embrapa WGISD. Zenodo. 2019. Available online: https://zenodo.org/records/3361736 (accessed on 9 June 2025).
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 740–755. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9156–9165. [Google Scholar]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic, Faster and Stronger. arXiv 2020, arXiv:2003.10152. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J. YOLOv5 SOTA Realtime Instance Segmentation (v7.0). Zenodo. 2022. Available online: https://zenodo.org/records/7347926 (accessed on 2 June 2025).
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J. Ultralytics YOLOv8. Github. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 3 May 2024).
Zong, Z.; Song, G.; Liu, Y. DETRs with Collaborative Hybrid Assignments Training. arXiv 2023. [Google Scholar] [CrossRef]
He, J.; Li, P.; Geng, Y.; Xie, X. FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation. arXiv 2023, arXiv:2303.08594. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Flow chart of this study.

Figure 2. Vineyard: (a) grape growing conditions; (b) grapes of different varieties; (c) grapes wrapped in bags.

Figure 3. ⊥-shaped annotation: (a) annotation of unbagged grapes; (b) annotation of bagged grapes; (c) multiple grape clusters from the current and adjacent rows.

Figure 4. Grape-⊥ dataset: (a) the number of labels for each category; (b) the aspect ratios of bounding boxes; (c) distribution of the bounding box center and the width-to-height ratio relative to the image (darker-colored tiles indicate a greater number of bounding boxes at this center or with this aspect ratio).

Figure 5. The architecture of PGSS-YOLOv11s.

Figure 6. Feature map visualization: (a) original image; (b) feature map of stage F

_{2}

; (c) feature map of stage F

_{3}

; (d) feature map of stage F

_{4}

; (e) feature map of stage F

_{5}

.

Figure 6. Feature map visualization: (a) original image; (b) feature map of stage F

_{2}

; (c) feature map of stage F

_{3}

; (d) feature map of stage F

_{4}

; (e) feature map of stage F

_{5}

.

Figure 7. Structure of GLSA.

Figure 8. Schematic diagram of the neck. (a) PAFPN; (b) AFFM; (c) AWFF and

F_{n}

.

Figure 8. Schematic diagram of the neck. (a) PAFPN; (b) AFFM; (c) AWFF and

F_{n}

.

Figure 9. Structure of the segmentation head in YOLOv11s-seg and DE-SCSH: (a) the original segmentation head of YOLOv11s-seg; (b) DE-SCSH.

Figure 10. Positioning of picking points: (a) original grape image; (b) binary image of the ⊥-shaped region; (c) binary image after erosion; (d)

f_{o p e n e d}

; (e)

f_{d i f}

; (f)

f_{p e d i c e l}

; (g)

l_{p e d i c e l}

; (h) pinpoint the picking points.

Figure 10. Positioning of picking points: (a) original grape image; (b) binary image of the ⊥-shaped region; (c) binary image after erosion; (d)

f_{o p e n e d}

; (e)

f_{d i f}

; (f)

f_{p e d i c e l}

; (g)

l_{p e d i c e l}

; (h) pinpoint the picking points.

Figure 11. Visualization of seven models in multiple scenarios: (a) sunny; (b) cloudy; (c) sparse; (d) dense; (e) bagged.

Figure 12. Grad-CAM visualizations from different modules on grape ⊥-shaped regions: (a) input image; (b) before adding SFAM; (c) after adding SFAM; (d) after adding AFFM; (e) after adding DE-SCSH; (f) output image.

Figure 13. Picking point localization results in different scenarios: (a) sunny day; (b) cloudy day; (c) sparse grape clusters; (d) dense grape clusters; (e) bagged grapes.

Figure 14. Distribution of ⊥-shaped regions instances and positioning results.

Figure 15. Edge-intelligent vision acquisition and detection system: (a) hardware; (b) a graphical user interface.

Figure 16. Localization results of eight frames in field tests, red dashed ellipses indicate the changes in confidence and segmentation area of the same bunch of grapes when moving from Frame 5 to Frame 15; red dashed rectangles represent the segmentation and localization of the same bunch of grapes at different moving positions; red solid ellipses denote the situation where grapes in adjacent rows are segmented but not localized. (See detailed real-time detection videos in the Supplementary Materials).

Figure 17. Examples of representative labelling methods: (a) bounding box annotation for grape clusters [14]; (b) pixel-level annotation for grape clusters [6]; (c) pixel-level annotation for grape pedicels [8]; (d) segmented annotation for grape clusters, pedicels, and shoots [25]; (e) bounding box annotation for grape clusters and single point annotation for pedicels [20]; (f) annotation consisting of an area box, three keypoints and two lines [22,24]; (g) bounding box annotations for clusters and pedicels as two separate categories [16].

Figure 18. Schematic diagram of representative picking point localization methods: (a) traditional geometric determination methods [1]; (b) center point of the predicted pedicel oriented bounding box [19]; (c) end-to-end prediction from Ref. [24].

Figure 19. The cases of positioning failure; failed instances are marked with red circles.

Figure 20. The ⊥-shaped annotation formats of (a) eggplants; (b) durians; (c) cucumbers, each closed green polygonal area constitutes an annotated instance.

Table 1. The comparison of state-of-the-arts instance segmentation models on the Grape-⊥ dataset.

Method	Grp	GrpWB	Mask mAP@0.5	Mask mAP@0.75	Mask mAP@0.5:0.95
Mask-RCNN	77.4	95.8	86.6	32.2	39.3
YOLACT	92.7	75.5	84.1	19.0	33.5
SOLOv2	88.0	94.4	91.2	35.6	42.7
YOLOv5s-seg	84.7	95.3	90.0	31.2	38.4
RTMDet	94.5	90.9	92.7	23.4	39.8
YOLOv8s-seg	89.9	97.2	93.5	38.8	43.6
YOLOv11s-seg	89.3	96.5	92.9	37.6	44.4
Co-DETR	97.6	99.2	98.4	43.3	51.2
FastInst	92.1	92.7	92.4	40.1	43.2
Ours	90.9	99.4	95.2	40.3	48.8

Table 2. The comparison of state-of-the-arts instance segmentation models on the Winegrape dataset.

Method	CDY	CFR	CSV	SVB	SYH	Mask mAP@0.5	Mask mAP@0.75	Mask mAP@0.5:0.95
Mask-RCNN	83.9	76.8	89.9	85.4	95.7	86.3	63.6	59.0
YOLACT	56.7	67.0	58.4	67.6	52.3	60.4	31.6	32.8
SOLOv2	75.3	83.7	82.8	93.1	92.7	85.5	63.5	57.1
YOLOv5s-seg	60.9	80.1	80.2	75.4	79.2	75.2	48.6	42.7
RTMDet	71.8	80.0	77.9	89.5	87.2	81.3	54.0	49.3
YOLOv8s-seg	86.7	90.7	80.0	95.3	89.3	86.4	60.1	60.3
YOLOv11s-seg	81.7	88.5	80.8	93.7	90.2	87.0	62.8	58.3
Co-DETR	90.2	93.5	82.6	95.3	94.4	91.2	70.1	61.2
FastInst	81.6	90.2	78.2	92.6	93.4	87.2	64.2	58.4
Ours	88.9	90.9	80.7	96.0	93.4	90.0	66.0	63.2

Table 3. The comparison of state-of-the-arts instance segmentation models in term of segmentation efficiency.

Method	Params/M	MS/MB	Flops/G	FPS (Img/s)
Mask-RCNN	44.0	169.6 MB	247	6.5
YOLACT	34.8	135.0 MB	61.8	27.3
SOLOv2	46.3	178.0 MB	96.2	18.4
YOLOv5s-seg	7.4	14.4 MB	25.7	40.3
RTMDet	10.2	39.1 MB	21.5	30.3
YOLOv8s-seg	11.8	22.8 MB	42.4	28.3
YOLOv11s-seg	10.1	19.6 MB	35.3	37.6
Co-DETR	384	1.4 GB	537	3.0
FastInst	34.2	132.9 MB	75.5	22.4
Ours	8.0	17.4 MB	39.4	30.9

Table 4. A couple of submodels constructed by the combination of these different modules.

Submodels	SFAM	AFFM	DE-SCSH
YOLOv11s-seg	-	-	-
SM1	✓	-	-
SM2	-	✓	-
SM3	-	-	✓
SM4	✓	✓	-
SM5	✓	-	✓
SM6	-	✓	✓
PGSS-YOLOv11s	✓	✓	✓

Table 5. Ablation results on the Grape-⊥ dataset and the Winegrape dataset.

Models	Grape-⊥ Dataset					Winegrape Dataset					Params/M	MS/MB	Flops/G	Inference/ms
Models	P/%	R/%	F 1	Mask mAP@0.5	Mask mAP@0.5:0.95	P/%	R/%	F 1	Mask mAP@0.5	Mask mAP@0.5:0.95	Params/M	MS/MB	Flops/G	Inference/ms
YOLOv11s-seg	90.3	91.3	90.8	92.9	44.4	82.9	80.0	81.4	87.0	58.3	10.07	19.6	35.3	13.9
SM1	94.4	89.7	92.0	93.8	47.3	82.8	83.6	83.2	87.1	58.9	14.64	28.5	44.2	20.6
SM2	89.4	93.1	91.2	93.7	45.7	84.0	81.5	82.7	86.7	59.7	7.58	14.9	35.4	14.2
SM3	94.0	91.9	92.9	94.3	48.0	83.8	82.1	82.9	88.0	59.3	9.44	20.0	37.0	18.0
SM4	93.7	92.1	92.9	94.0	47.4	84.2	84.4	84.3	88.1	60.6	8.16	16.1	37.1	29.2
SM5	95.4	92.8	94.1	94.6	48.4	84.0	84.3	84.1	89.0	61.3	14.01	28.9	45.9	34.5
SM6	92.6	94.1	93.3	94.5	48.2	85.5	82.8	84.1	89.1	62.6	7.39	16.2	37.7	18.9
Ours	96.0	93.2	94.6	95.2	48.8	85.8	85.0	85.4	90.0	63.2	7.96	17.4	39.4	19.1

Table 6. Models with the introduction of SFAM module in different feature output stages.

Models	SM6	SM6-F $_{2}$ -F $_{5}$	SM6-F $_{2}$ -F $_{4}$	Ours
Feature output stages of the backbone	-	F $_{2}$ , F $_{3}$ , F $_{4}$ , F $_{5}$	F $_{2}$ , F $_{3}$ , F $_{4}$	F $_{3}$ , F $_{4}$ , F $_{5}$

Table 7. Ablation results for SFAM module on Grape-⊥ dataset and Winegrape dataset.

Models	Grape-⊥ Dataset					Winegrape Dataset					Params/M	Flops/G
Models	P/%	R/%	F 1	Mask mAP@0.5	Mask mAP@0.5:0.95	P/%	R/%	F 1	Mask mAP@0.5	Mask mAP@0.5:0.95	Params/M	Flops/G
SM6	92.6	94.1	93.3	94.5	48.2	85.5	82.8	84.1	89.1	62.6	7.39	37.7
SM6- $F_{2}$ - $F_{5}$	91.5	93.4	92.4	93.2	48.1	82.9	80.5	81.7	88.2	60.6	8.16	40.9
SM6- $F_{2}$ - $F_{4}$	96.8	85.5	90.8	90.0	45.5	81.8	79.6	80.7	86.2	58.1	7.95	40.8
Ours	96.0	93.2	94.6	95.2	48.8	85.8	85.0	85.4	90.0	63.2	7.96	39.4

Table 8. Performance of the DE-SCSH-Head.

Methods	Head Structures	Grape-⊥ Dataset		Winegrape Dataset		Params/M	MS/MB	Flops/G	FPS (Img/s)
Methods	Head Structures	Mask mAP@0.5	Mask mAP@0.5:0.95	Mask mAP@0.5	Mask mAP@0.5:0.95	Params/M	MS/MB	Flops/G	FPS (Img/s)
YOLOv5s- seg-SM4	Original	90.0	31.2	75.2	42.7	7.44	14.4	25.7	40.3
	+LADH	91.3 (+1.3%)	32.9 (+1.7%)	76.2 (+1.0%)	41.0 (−1.7%)	6.32 (−1.12)	12.2 (−2.2)	22.3 (−3.4)	48.3 (+8.0)
	+T-Head	91.1 (+1.1%)	33.3 (+2.1%)	73.2 (−2.0%)	38.4 (−4.3%)	6.02 (−1.42)	12.0 (−2.4)	27.8 (+2.1)	37.4 (−2.9)
	+LSCSH	90.6 (+0.6%)	35.7 (+4.5%)	74.2 (−1.0%)	40.2 (−2.5%)	6.18 (−1.26)	10.2 (−4.2)	23.2 (−2.5)	44.2 (+3.9)
	DE-SCSH	93.4 (+3.4%)	36.2 (+5.0%)	78.7 (+3.5%)	48.1 (+5.4%)	6.42 (−1.02)	10.4 (−4.0)	24.2 (−4.5)	46.2 (+5.9)
YOLOv8s- seg-SM4	Original	93.5	43.6	86.4	60.3	11.82	22.8	42.4	28.3
	+LADH	92.7 (−0.8%)	47.1 (+3.5%)	85.5 (−0.9%)	51.2 (−9.1%)	7.28 (−4.54)	15.2 (−7.6)	34.0 (−8.4)	36.9 (+8.6)
	+T-Head	94.5 (+1.0%)	46.6 (+3.0%)	79.3 (−7.1%)	51.6 (−8.7%)	9.56 (−2.26)	19.5 (−3.3)	44.2 (+1.8)	28.0 (−0.3)
	+LSCSH	91.8 (−1.7%)	48.1 (+4.5%)	77.0 (−9.4%)	50.5 (−9.8%)	7.46 (−4.36)	15.4 (−7.4)	38.4 (−4.0)	32.1 (+3.8)
	DE-SCSH	94.0 (+0.5%)	46.5 (+2.9%)	87.1 (+0.7%)	61.3 (+1.0%)	7.46 (−4.36)	17.1 (−5.7)	38.4 (−4.0)	33.2 (+4.9)
YOLOv11s- seg-SM4	Original	92.9	44.4	87.0	58.3	10.07	19.6	35.3	37.6
	+LADH	95.5 (+2.6%)	47.2 (+2.8%)	85.6 (−1.4%)	51.1 (−7.2%)	7.77 (−2.30)	16.2 (−3.4)	34.9 (−0.4)	38.0 (+0.4)
	+T-Head	94.1 (+1.2%)	46.1 (+1.7%)	83.5 (−3.5%)	54.4 (−3.9%)	7.54 (−2.53)	16.8 (−2.8)	38.4 (+3.1)	32.2 (−5.4)
	+LSCSH	93.4 (+0.5%)	46.9 (+2.5%)	82.6 (−4.4%)	52.2 (−6.1%)	8.08 (−1.99)	16.7 (−2.9)	36.9 (+1.6)	34.6 (−3.0)
	DE-SCSH	95.2 (+2.3%)	48.8 (+4.4%)	90.0 (+3.0%)	63.2 (+4.9%)	7.96 (−2.11)	17.4 (−2.2)	39.4 (+4.1)	30.9 (−6.7)

Table 9. Localization success rates for Grps and GrpWBs.

Classes	p	$p^{'}$	$l_{p}$
Grps	228	206	90.35%
GrpWBs	94	82	87.23%
Total	322	288	89.44%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, J.; Cao, Z.; Wang, J.; Wang, Z.; Zhao, J.; Zhang, M. A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies. Agriculture 2025, 15, 1622. https://doi.org/10.3390/agriculture15151622

AMA Style

Lu J, Cao Z, Wang J, Wang Z, Zhao J, Zhang M. A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies. Agriculture. 2025; 15(15):1622. https://doi.org/10.3390/agriculture15151622

Chicago/Turabian Style

Lu, Jin, Zhongji Cao, Jin Wang, Zhao Wang, Jia Zhao, and Minjie Zhang. 2025. "A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies" Agriculture 15, no. 15: 1622. https://doi.org/10.3390/agriculture15151622

APA Style

Lu, J., Cao, Z., Wang, J., Wang, Z., Zhao, J., & Zhang, M. (2025). A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies. Agriculture, 15(15), 1622. https://doi.org/10.3390/agriculture15151622

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies

Abstract

1. Introduction

2. Materials and Methods

2.1. Grape-⊥ Dataset

2.1.1. Image Acquisition

2.1.2. Image Annotation with the Characteristics ‘⊥’

2.2. Methods

2.2.1. Network Architecture

2.2.2. Spatial Feature Aggregation Module

2.2.3. Adaptive Feature Fusion Module

2.2.4. Detail Enhancement—Shared Convolution Segmentation Head

2.3. Localization of Picking Points

3. Results

3.1. Experimental Settings

3.1.1. Evaluation Metrics

3.1.2. Experimental Platform and Details

3.2. Segmentation of Grape ⊥-Shaped Regions

3.2.1. Comparison with Different Methods

3.2.2. Ablation Studies with Different Modules

3.3. Evaluation of Picking Point Detection

3.4. Deployment on Edge Device

4. Discussion

4.1. Key Differences in Principles from Existing Advanced Methods

4.2. Cases of Failed Picking Point Positioning

4.3. Collecting More Grape Image Data Under Natural Environments

4.4. Function Expansion of Table Grape Picking Methods

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI