YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots

Zheng, Hengyi; Ma, Yuhan; Zhang, Tengxu; Han, Shuo; Qian, Mengbo

doi:10.3390/horticulturae12060697

Open AccessArticle

YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots

by

Hengyi Zheng

^1,2,†,

Yuhan Ma

^3,†,

Tengxu Zhang

^1,2,

Shuo Han

^1,2 and

Mengbo Qian

^1,2,*

¹

College of Optical, Mechanical and Electrical Engineering, Zhejiang A&F University, Hangzhou 311300, China

²

Zhejiang Key Laboratory of Intelligent Sensing and Robotics for Agriculture, Hangzhou 310058, China

³

College of Environmental and Resource Sciences, Zhejiang A&F University, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Horticulturae 2026, 12(6), 697; https://doi.org/10.3390/horticulturae12060697 (registering DOI)

Submission received: 15 April 2026 / Revised: 31 May 2026 / Accepted: 2 June 2026 / Published: 5 June 2026

(This article belongs to the Section Viticulture)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The optimized YOLOv8n-seg model achieved improved grape berry instance segmentation performance, with a box mAP50-95 of 0.8945, a mask mAP50-95 of 0.7910, an inference speed of 119.19 FPS, and 3.26 M parameters on an NVIDIA RTX 3060 Laptop GPU.
A two-stage knowledge distillation and pruning framework was developed to improve the mask representation ability of the lightweight YOLOv8n-seg model for dense, small, and partially occluded grape berries.

What are the implications of the main findings?

The proposed method provides a lightweight visual perception and thinning decision-support approach for grape berry thinning, helping convert berry-level instance segmentation results into preliminary thinning-target recommendations.
This study offers a field-oriented technical basis for future grape thinning robots by balancing segmentation accuracy, model lightweightness, and inference efficiency, although further validation on embedded platforms and real robotic thinning operations is still required.

Abstract

Berry thinning is a fundamental operation in modern vineyard management, and future robotic thinning systems have the potential to reduce labor intensity and improve operational consistency. However, automated berry thinning under field conditions is still constrained by insufficient berry-level segmentation accuracy, difficulty in recognizing occluded berries, and high missed-detection rates for small berries. These limitations mainly arise from dense berry arrangements, severe mutual occlusion, and the subtle visual features of small targets. To address these challenges, this study developed a lightweight grape berry instance segmentation and thinning decision-support method based on YOLOv8n-seg. A two-stage knowledge distillation strategy, using Mask R-CNN and YOLOv8l-seg as teacher models, was combined with 30% backbone pruning to improve the recognition of occluded and small berries while maintaining model efficiency. Subsequently, the DBSCAN clustering algorithm was used to analyze berry centroid coordinates and equivalent diameters extracted from instance segmentation masks, thereby generating preliminary thinning-target recommendations based on local berry density and berry size. The model was trained and evaluated on a self-constructed dataset containing 330 valid grape bunch images collected in 2025 from Yongming Vineyard, Lin’an District, Hangzhou, Zhejiang Province, China. The results showed that the optimized YOLOv8n-seg model achieved a box

m A P 50 - 95

of 0.8945 and a mask

m A P 50 - 95

of 0.7910, with an inference speed of 119.19

F P S

and 3.26 M parameters on an NVIDIA RTX 3060 Laptop GPU. Compared with the original YOLOv8n-seg model, the optimized model improved mask

m A P 50 - 95

by 1.20 percentage points, increased inference speed by 71.79

F P S

, and reduced the number of parameters by 2.38 M. These results indicate that the proposed method improves grape berry instance segmentation performance while achieving a favorable balance among segmentation accuracy, lightweight characteristics, and inference efficiency. The proposed framework provides an offline RGB-based visual perception and preliminary thinning decision-support method for future grape berry thinning robots. However, because the current dataset was collected from Shine Muscat grape bunches at the berry enlargement stage in a single vineyard using the same imaging setup, the results should be interpreted as preliminary evidence under the specific cultivar, growth stage, vineyard, and imaging conditions of this study. Further validation across different grape cultivars, growth stages, vineyards, production seasons, camera systems, embedded platforms, and real robotic thinning operations is still required.

Keywords:

grape thinning; YOLOv8n-seg; knowledge distillation; instance segmentation; DBSCAN clustering; agricultural robot

1. Introduction

Grapes are a high-value economic crop cultivated extensively worldwide, and their production involves several refined horticultural practices aimed at improving fruit quality and marketability [1]. Among these practices, berry thinning is a critical operation for regulating bunch compactness, improving berry uniformity, and enhancing the commercial quality of table grapes. For compact cultivars such as Shine Muscat, thinning is usually conducted during the fruiting or berry enlargement stage according to agronomic principles such as removing small, weak, deformed, and densely distributed berries while retaining well-developed berries with uniform spatial distribution. A recent technical practice study on the refined management of Xing’an Shine Muscat grapes reported that maintaining approximately 40–60 berries per bunch after thinning was beneficial for improving bunch structure and fruit quality [2]. However, conventional manual thinning is labor-intensive, time-consuming, and highly dependent on worker experience. The subjective nature of manual operations may also lead to inconsistent thinning quality among workers and grape bunches. Inadequate thinning can result in excessive bunch compactness, uneven berry development, reduced fruit quality, and unstable commercial value [3,4]. Therefore, developing intelligent thinning technologies is of practical importance for improving the standardization and efficiency of vineyard management.

The development of intelligent grape thinning equipment has become an important demand in modern grape production. Zhou et al. noted that the practical application of fruit-harvesting robots in field environments is often limited by environmental complexity, operation cycle time, and execution success rate. They further emphasized that onboard visual perception systems should balance detection accuracy, real-time performance, and computational efficiency under practical robotic constraints [5]. As a key component of grape thinning robots, visual perception directly influences the accuracy of berry recognition and the reliability of subsequent thinning decisions. However, grape berry perception under vineyard conditions remains challenging. First, grape berries are densely distributed within bunches, and severe mutual occlusion makes it difficult to extract complete visual features from individual berries. Second, branches, leaves, vines, and berries may present similar colors and textures under natural illumination, resulting in background interference. Third, adjacent berries often exhibit closely adhered boundaries and low feature separability, which complicates accurate single-berry instance segmentation. Finally, robotic thinning requires efficient perception models with low computational cost, which places additional constraints on model lightweightness and inference speed.

Deep learning has been widely applied to fruit detection, segmentation, and counting, and instance segmentation provides an effective technical pathway for berry-level perception. Two-stage instance segmentation models, such as Mask R-CNN, have strong potential for object localization and mask representation because of their region proposal and mask prediction mechanisms [6]. However, their relatively complex inference pipelines and computational cost limit their suitability for real-time robotic perception, especially when deployed on resource-constrained platforms. In contrast, lightweight one-stage instance segmentation models, such as YOLOv8n-seg, provide higher inference efficiency and are easier to adapt to robotic vision tasks. Nevertheless, lightweight models still face limitations in complex vineyard environments, especially for small berries, occluded berries, and closely adhered berry boundaries under high Intersection over Union (IoU) evaluation conditions [7,8]. Knowledge distillation, first introduced by Hinton et al., provides a feasible strategy for transferring knowledge from a high-capacity teacher model to a lightweight student model, thereby improving the performance of compact models without substantially increasing inference cost [9]. Gou et al. further indicated that knowledge distillation has become an important technique for lightweight model compression and performance enhancement in resource-constrained visual recognition tasks [10].

Although existing YOLO-based methods have shown promising real-time performance in fruit detection and segmentation, most of them mainly focus on target recognition, bunch detection, or berry counting. Their ability to provide berry-level thinning decisions under dense occlusion remains limited. Two-stage models such as Mask R-CNN can generate high-quality masks, but their computational cost and inference latency restrict their deployment on vineyard robots. Moreover, previous studies have rarely considered the simultaneous requirements of small and occluded berry segmentation, lightweight model compression, and thinning-oriented decision-making. Therefore, there remains a need for an integrated framework that can balance segmentation accuracy, real-time inference, lightweight design, and practical thinning decision support in vineyard environments.

To address these challenges, this study developed a lightweight grape berry instance segmentation and thinning decision-support method based on YOLOv8n-seg. The proposed method integrates cross-architecture knowledge distillation from Mask R-CNN, same-architecture refinement distillation from YOLOv8l-seg, backbone pruning, and DBSCAN-based thinning decision analysis. The aim is to improve berry-level instance segmentation performance for dense and occluded grape bunches while generating preliminary thinning-target recommendations based on berry spatial density and size information. The main contributions of this study are summarized as follows:

(1) A lightweight grape berry instance segmentation framework based on YOLOv8n-seg was developed to address small-target recognition and mutual occlusion in dense grape bunches.

(2) A two-stage knowledge distillation and pruning strategy was proposed, in which Mask R-CNN and YOLOv8l-seg were used as teacher models to improve the mask representation ability of the lightweight student model while reducing model complexity.

(3) The distillation weights of bounding-box, mask, and feature supervision were systematically optimized to achieve a favorable balance among segmentation accuracy, inference speed, and model size.

(4) A DBSCAN-based thinning decision-making method was designed to convert berry-level segmentation results into preliminary and interpretable thinning-target recommendations based on local berry density and berry size, thereby linking visual perception with robotic thinning decision support.

2. Related Work

Early investigations into visual perception for grape-picking robots predominantly utilized traditional computer vision techniques based on color, shape, texture, and geometric constraints. Luo et al. extracted spatial information from grape bunches using binocular vision and successfully localized cutting points, thereby confirming the feasibility of conventional vision methods for grape bunch localization [11]. Jin et al. developed a far–near-view combined vision system for grape bunch and stem recognition using threshold segmentation, morphological processing, and the Hough transform. Although picking experiments validated the engineering practicality of this method, its overall detection pipeline remained dependent on handcrafted features and fixed procedures, which restricted its generalization capability under complex field conditions [12]. Kurtser et al. employed RGB-D cameras for in-field grape cluster size assessment and demonstrated that depth information could improve the accuracy of yield-related phenotyping measurements. However, their work mainly focused on bunch-level size estimation rather than instance segmentation of densely distributed individual berries, and its performance was still affected by viewing angle, background interference, and natural illumination variation [13]. Overall, traditional vision methods provide clear implementation procedures but are sensitive to environmental changes and are insufficient for robust berry-level perception in dense grape bunches.

With the advancement of deep learning, visual perception for grape robots has gradually shifted from handcrafted image-processing rules to data-driven detection, segmentation, and localization methods. Yin et al. employed Mask R-CNN to segment grape bunches and combined binocular point clouds with the Random Sample Consensus algorithm to estimate bunch pose, demonstrating the potential of instance segmentation for contour representation and 3D localization in grape-harvesting scenarios [6]. Santos et al. integrated grape bunch detection, instance segmentation, and 3D association tracking, showing that instance-level perception can support fruit counting and continuous visual tracking [14]. For picking-point localization, Wu et al. proposed a grape peduncle recognition method based on object detection and keypoint estimation, which improved the localization of peduncles but remained dependent on prior detection results [15]. Zhang et al. developed YOLOv5-GAP to improve grape bunch recognition and picking-point localization under rachis occlusion, but their method mainly focused on bunches and picking points rather than berry-level segmentation [16]. Zhao et al. proposed YOLO-GP for end-to-end synchronous detection of grape bunches and picking points, advancing grape-harvesting perception systems toward lightweight and integrated detection [17]. Nevertheless, these studies primarily addressed bunch-level detection, peduncle localization, picking-point estimation, or berry counting, while the berry-level instance segmentation and thinning-target selection required for grape thinning remain less explored.

To achieve a balance between accuracy and speed, recent research has increasingly emphasized lightweight models and real-time perception in agricultural robotic applications. Sapkota et al. compared YOLOv8 and Mask R-CNN for object segmentation in complex orchard environments and reported that YOLOv8 showed advantages in accuracy, recall, and inference speed, suggesting that single-stage segmentation frameworks are promising for real-time robotic perception [7]. However, their study focused on apple orchard scenes and did not address the specific challenges of grape berry occlusion, dense adhesion, and small-target segmentation. Recent developments in the YOLO series also indicate that real-time detection models are evolving toward improved information utilization and lower inference latency. YOLOv9 introduced programmable gradient information and the GELAN architecture to improve information preservation and parameter utilization during model training [18]. YOLOv10 further proposed a real-time end-to-end detection framework with an NMS-free design and an efficiency–accuracy driven model optimization strategy to reduce post-processing latency and computational redundancy [19]. Although these newer YOLO architectures provide useful references for real-time detection, their direct application to grape berry instance segmentation and thinning decision-making still requires task-specific adaptation. In this study, YOLOv8n-seg was selected as the baseline model because of its mature implementation, lightweight architecture, stable instance segmentation pipeline, and suitability for constructing a compact student model in knowledge distillation. The purpose of this study was not to propose a new YOLO backbone, but to improve the segmentation ability and deployment efficiency of a lightweight instance segmentation model through two-stage knowledge distillation and pruning.

For grape-specific segmentation tasks, recent studies have investigated lightweight instance segmentation and berry-level perception under complex vineyard conditions. Shen et al. proposed Multi-Scale Adaptive YOLO (MSA-YOLO) for grape pedicel instance segmentation, in which multi-scale feature fusion and shallow feature enhancement were introduced to improve segmentation performance under complex backgrounds and variable target scales [8]. However, this study focused on slender grape pedicels, whose morphology and segmentation difficulty differ substantially from densely clustered spherical berries. Du and Liu developed AS-SwinT for instance segmentation and berry counting of table grape bunches before thinning, demonstrating that transformer-based feature representation can improve the recognition of small and occluded berries in dense grape bunches [20]. Nevertheless, this method mainly focused on high-precision segmentation and counting, and its relatively complex structure may increase the computational burden for real-time robotic deployment. Woo et al. proposed a lightweight berry-number prediction method to assist table grape cultivation, indicating that grape thinning vision systems are evolving from simple berry recognition toward quantity estimation and management support [21]. However, berry-count prediction alone cannot provide precise boundaries or specific thinning targets for individual berries.

In addition to direct instance segmentation, several studies have explored grape detection and counting through structured spatial representation. Yang et al. proposed a probability map-based grape detection and counting framework, in which intermediate probability maps of grape clusters and berries were generated to support structured berry counting under field conditions [22]. This work demonstrates that spatial probability representation can improve grape counting tasks, but its objective remains detection and counting rather than direct thinning-target selection. More recently, Yang et al. proposed Mask-GK, an efficient method based on a mask Gaussian kernel for grape berry segmentation and counting in field conditions [23]. This method highlights the importance of berry-level segmentation for precision viticulture. However, its primary emphasis is still on segmentation and counting accuracy, while the conversion from perception results to thinning-target decisions is not sufficiently addressed. Therefore, although existing grape berry segmentation and counting studies provide important foundations, they do not fully solve the problem of determining which berries should be removed during thinning.

In addition to grape-related studies, research on other horticultural crops has provided methodological inspiration for lightweight segmentation and decision-oriented robotic perception. Liang et al. achieved instance segmentation and localization of tomato lateral-shoot pruning points using an improved YOLOv5 model, demonstrating that lightweight segmentation models can support subsequent robotic operation decisions [24]. Gao et al. improved the robustness of greenhouse tomato detection under complex backgrounds through anchor-box optimization and model refinement. Solimani et al. further enhanced tomato plant phenotyping detection by optimizing the YOLOv8 architecture to address data complexity and target variability [25,26]. Liu et al. proposed Y-HRNet, a two-stage coarse detection–fine segmentation framework for multi-category cherry tomato instance segmentation, validating the effectiveness of fine segmentation in dense and complex backgrounds [27]. However, the inference efficiency of such two-stage or structurally enhanced models can still be constrained. Fatehi et al. enhanced YOLOv9t for real-time detection of bloomed Damask roses in field conditions through knowledge distillation, showing that teacher–student learning can improve lightweight agricultural detection models while maintaining real-time inference capability [28]. However, their study focused on object detection rather than instance segmentation and did not further optimize mask representation quality.

From the perspective of robotic application, visual perception results must ultimately support operation-oriented decisions rather than only provide detection or segmentation outputs. Zhou et al. noted that the commercialization of fruit-picking robots is still limited by operation success rate, cycle time, and environmental adaptability. Tombe et al. also highlighted that agricultural vision systems often face challenges related to insufficient generalization and difficulties in real-time deployment under actual field conditions [5,29]. Recent grape robot studies have begun to combine RGB perception, depth information, and 3D localization. Shen et al. proposed a two-stage multimodal 3D point localization framework for automatic grape harvesting, which linked RGB-based segmentation, depth filtering, depth completion, and 3D operation-point localization to improve harvesting-point perception [30]. This study indicates that grape robot vision is evolving from two-dimensional recognition toward multimodal perception and three-dimensional operation-point localization. Nevertheless, its focus remained on harvesting-point localization rather than berry-level thinning-target selection. For grape thinning, the visual system must determine which individual berries should be removed according to berry size, spatial density, and local bunch structure, which is a different decision-making problem from harvesting-point localization.

In summary, existing grape vision research has progressed from traditional image processing to deep-learning-based detection, segmentation, counting, and 3D localization. However, several limitations remain [31]. First, traditional vision methods lack robustness under variable illumination, occlusion, and complex backgrounds. Second, high-capacity segmentation models can provide detailed masks but often require large computational resources, whereas lightweight models may suffer from reduced mask quality for small and occluded berries [32]. Third, recent grape berry segmentation and counting methods mainly focus on perception accuracy or counting performance, while thinning-oriented target selection is rarely incorporated. Finally, few studies have simultaneously considered lightweight berry-level instance segmentation, model compression through knowledge distillation and pruning, and decision support for berry thinning. Consequently, integrating two-stage knowledge distillation into a lightweight YOLOv8n-seg framework and combining the segmentation outputs with DBSCAN-based density analysis provides a practical approach for grape berry instance segmentation and preliminary thinning decision support.

3. Materials and Methods

3.1. Experimental Materials and Overview of the Research Scheme

Field image acquisition for this study was conducted in 2025 at Yongming Grape Plantation, Lin’an District, Hangzhou, Zhejiang Province, China. The experimental vineyard is located at approximately 30.2368° N latitude and 119.6602° E longitude. The vineyard adopted a standardized rain-sheltered cultivation system with a bird-net trellis structure. The planting layout was arranged in regular rows, with a row spacing of approximately 1.8 m and a vine spacing of approximately 0.6 m. The canopy structure, bunch distribution, and protected cultivation conditions were representative of local table grape production systems in Zhejiang Province.

At the time of image acquisition, the grapes were in the berry enlargement stage. The grape bunches exhibited natural variations in berry density, local occlusion, compactness, berry size, and illumination conditions, which were consistent with the visual challenges encountered in practical grape berry thinning operations. Shine Muscat grape bunches were selected as the experimental object because this cultivar is widely cultivated in protected grape production in China and is characterized by oval berries, yellow-green skin, and relatively high bunch compactness. The experimental vineyard environment is shown in Figure 1.

The image acquisition device used in this study was an Orbbec Gemini 2 RGB-D camera (Orbbec Inc., Shenzhen, China). The camera provides RGB and depth image acquisition functions. According to the manufacturer’s specifications, the RGB resolution can reach 1920 × 1080 pixels at 30 fps, and the depth resolution can reach 1280 × 800 pixels at 30 fps. The RGB field of view is approximately 86° horizontally and 55° vertically, while the depth field of view is approximately 91° horizontally and 66° vertically. The nominal depth sensing range is 0.15–10 m, with a depth precision of ≤2% at 2 m.

In this study, the camera was mounted on the visual acquisition module of a self-developed robot-oriented experimental platform. During image acquisition, the shooting distance was maintained within 10–30 cm, and the camera was oriented at approximately 45° toward the grape bunch to capture both the overall bunch morphology and detailed berry features. The camera was used after routine factory calibration, and no additional geometric calibration was performed because the present study focused on two-dimensional RGB-based instance segmentation rather than three-dimensional localization.

Although the device is an RGB-D camera, only RGB images were used for model training, instance segmentation, and thinning decision analysis in this study. Depth information was not introduced into the current model because the objective of this work was to evaluate a lightweight two-dimensional berry-level visual perception and thinning decision-support method. In addition, dense berry occlusion and close-range imaging may affect the completeness and stability of depth information in grape bunch scenes. Therefore, depth information will be further incorporated in future work for three-dimensional berry localization, spatial thinning analysis, and robotic trajectory planning. The camera module used for image acquisition is shown in Figure 2.

To clarify the overall experimental procedure, the complete workflow of this study is summarized in Figure 3. The workflow includes four main stages: dataset construction, two-stage model optimization, segmentation evaluation, and thinning decision evaluation. This design connects field image acquisition and instance segmentation with subsequent thinning-target recommendation and expert-annotation-based evaluation.

3.2. Experimental Image Processing

3.2.1. Image Acquisition and Screening

To improve the representativeness of the dataset, grape bunches were sampled from different rows and positions within the experimental vineyard. The sampled bunches covered different growth conditions and compactness levels, including relatively loose, moderately compact, and compact bunches. The images used in this study were still images directly captured in the vineyard rather than video frames extracted from continuous recordings.

During image acquisition, the camera was mounted on the visual acquisition module of the robot-oriented experimental platform. The platform was repositioned between different grape bunches, while the camera was kept relatively stable during each image capture. The shooting distance was maintained within 10–30 cm, and the camera was oriented at approximately 45° toward the grape bunch. The camera viewpoint was varied across samples, including frontal, lateral, and oblique top views, to capture both overall bunch morphology and detailed berry features. For each grape bunch retained in the final dataset, only one image was used, ensuring that no grape bunch appeared repeatedly in the training, validation, or test sets.

Image acquisition was conducted under natural field illumination. Images were collected under different outdoor lighting conditions, including sunny and cloudy conditions, as well as local shadow and diffuse illumination caused by the vineyard canopy and rain-shelter structure. The camera exposure, focus, and white balance were controlled using the default automatic settings of the RGB camera. This acquisition protocol was designed to include variations in bunch morphology, berry density, mutual occlusion, canopy background, and illumination conditions.

A total of 435 grape bunch images were collected from 435 grape bunches. After image acquisition, quality screening was performed to remove invalid images with severe blur, overexposure, underexposure, incomplete bunch regions, or poor visibility of berry contours. After screening, 330 valid images corresponding to 330 different grape bunches were retained to construct the grape berry instance segmentation dataset used in this study. Each retained valid image corresponded to one grape bunch, and no grape bunch appeared repeatedly in the final dataset. Examples of original images are shown in Figure 4.

3.2.2. Image Annotation and Dataset Division

The 330 valid images were annotated using the LabelMe annotation tool for instance segmentation. Each visible grape berry was manually delineated using a polygon mask along its visible contour. All grape berries were annotated as a single class, namely “grape berry”, for model training and evaluation. The terms “small berries” and “occluded berries” in this study refer to challenging visual conditions in the dataset rather than independent semantic categories. During annotation, particular attention was paid to small berries, partially occluded berries, and berries with closely adhered boundaries to ensure that the annotated masks closely matched the visible berry contours.

The annotation was performed manually by a trained annotator using the LabelMe tool and was subsequently checked by another researcher familiar with grape berry image annotation. During the checking process, obvious contour errors, missing labels, and duplicate labels were corrected. However, formal inter-annotator agreement assessment was not conducted in the current study.

The dataset contained 16,461 annotated grape berry instances. The 330 valid images corresponded to 330 different grape bunches, meaning that each grape bunch appeared in only one image. This reduced the risk that multiple images of the same bunch would be simultaneously included in the training, validation, and test sets. After annotation, the dataset was divided into training, validation, and test sets at a ratio of 8:1:1, resulting in 264 training images, 33 validation images, and 33 test images. This split ratio was adopted to allocate sufficient images for model training while retaining independent validation and test sets for model selection and final performance evaluation. Considering the relatively limited dataset size, this strategy was used to balance the availability of training samples with the need for independent model validation and testing under the current data conditions. The training set contained 13,079 annotated berry instances, the validation set contained 1754 instances, and the test set contained 1628 instances. The average numbers of annotated berry instances per image were 49.54, 53.15, and 49.33 for the training, validation, and test sets, respectively. The data augmentation process was conducted only after dataset division and was applied only to the training set, while the validation and test sets were kept unchanged for model selection and final performance evaluation. To further describe the dataset distribution, the number of annotated berry instances in each image was counted, and the mean, standard deviation, minimum, and maximum number of berry instances per image were calculated for each subset. The dataset division and annotation statistics are shown in Table 1. An example of the annotation interface is presented in Figure 5.

The instance-count distribution indicates that the training, validation, and test sets contained grape bunches with comparable berry-density levels, although natural variations in bunch compactness and berry visibility existed among individual images.

3.2.3. Data Augmentation

All images were resized to 640 × 640 pixels before model training. Data augmentation was performed after dataset division and was applied only to the training set, while the validation and test sets were not augmented. This strategy was adopted to avoid potential data leakage and to ensure that model selection and final performance evaluation were conducted on original field-collected images.

To enhance the robustness of the model under complex vineyard conditions, several data augmentation strategies were applied to the training images. These strategies included random brightness adjustment with a variation range of ±30%, random contrast adjustment with a variation range of ±20%, random horizontal flipping with a probability of 0.5, random cropping with a cropping ratio of 0.7–1.0, Gaussian blur with kernel sizes of 3 × 3 and 5 × 5, and random Gaussian noise injection with a variance range of 0.01–0.03. These operations were designed to simulate non-ideal field conditions, such as illumination variation, partial image blur, camera disturbance, local occlusion, and image noise. Examples of augmented training images are shown in Figure 6.

After data augmentation, the number of training images increased from 264 to 2112. The augmented images were used only for model training, while the original validation and test images were retained for hyperparameter tuning and final performance evaluation. This augmentation strategy helped alleviate overfitting caused by the limited dataset size and improved the adaptability of the model to variations in illumination, berry appearance, and image quality. The corresponding augmentation operations and their purposes are summarized in Table 2.

These augmentation operations were selected to improve the model’s tolerance to common field imaging variations while maintaining the basic morphological characteristics of grape bunches. All augmented images were generated only from the training set, and no augmented images were included in the validation or test sets.

3.3. Two-Stage Distillation Framework for Grape Berry Instance Segmentation

YOLOv8n-seg is a lightweight instance segmentation model within the YOLOv8 series. Its network architecture comprises three primary components: the backbone network, the neck feature fusion network, and the segmentation head. The backbone facilitates multi-scale feature extraction utilizing convolutional layers, C2f modules, and the SPPF module. The neck effectively fuses multi-scale features through upsampling and feature concatenation operations. The segmentation head integrates a prototype mask branch based on the multi-scale detection branches to produce the final instance segmentation results. This model not only maintains a high inference speed but also demonstrates commendable detection and segmentation performance, making it suitable as a lightweight baseline for robotic vision tasks that require efficient inference.

In natural orchard environments, grape berries are typically densely clustered, significantly occluded locally, highly adhesive at their boundaries, and exposed to considerable variations in illumination. Consequently, the original lightweight model exhibits notable deficiencies in high-precision instance segmentation and robustness in complex scenarios. To address the trade-off between accuracy and efficiency in lightweight models for grape thinning applications, this study introduces a lightweight instance segmentation method that employs two-stage knowledge distillation and pruning optimization. The method utilizes YOLOv8n-seg as the student model. In the first stage, Mask R-CNN serves as the teacher model, enhancing the student’s target localization and mask representation capabilities through cross-architecture knowledge distillation. Mask R-CNN was used in the first stage as a heterogeneous teacher model, not as the final deployed model. Its role was to provide region-proposal-based structural guidance and mask-level supervision for the lightweight YOLOv8n-seg student model during cross-architecture knowledge transfer. Simultaneously, 30% channel pruning is applied to the front convolutional layers of the backbone network to mitigate parameter redundancy and reduce computational complexity. In the second stage, the pruned model from the first stage is adopted as the new student model, while YOLOv8l-seg is introduced as the teacher model to conduct same-architecture refinement distillation. By optimizing the distillation weights, the model’s detection and segmentation performance under high-IoU conditions is further enhanced. The final optimized model achieves an improved balance between accuracy and speed while maintaining the lightweight advantage of YOLOv8n-seg. The overall model workflow and structure are illustrated in Figure 7.

3.3.1. Knowledge Distillation and Pruning Guided by Mask R-CNN

Hinton et al. demonstrated that the essence of knowledge distillation lies in using the soft supervision generated by a teacher model to guide the learning process of a student model, thereby improving the performance of a lightweight model without substantially increasing its inference cost [9]. In the first stage of optimization, YOLOv8n-seg was used as the student model, while Mask R-CNN with a ResNet50-FPN backbone was adopted as the teacher model. This stage aimed to introduce cross-architecture supervision from a two-stage instance segmentation model and to reduce model redundancy through channel pruning.

Before distillation-based fine-tuning, channel pruning was applied to the front convolutional layers of the YOLOv8n-seg backbone. Specifically, the first two convolutional layers, namely model.0.conv and model.1.conv, were selected as pruning targets because they are located at the shallow feature extraction stage and have relatively simple sequential connections. To avoid structural instability caused by pruning complex feature aggregation modules, C2f modules and deeper feature fusion layers were not directly pruned in this stage.

The importance of each output channel was measured using the L1-norm of the corresponding convolutional kernel weights. For the j-th output channel, the channel importance score was calculated as:

S_{j} = \sum |W_{j}|

(1)

where

W_{j}

denotes the convolutional kernel weights corresponding to the j-th output channel. According to the importance scores, channels with lower scores were removed, and the remaining channels were retained. The pruning ratio was set to 30% based on a pruning-ratio sensitivity analysis. Four candidate pruning ratios, namely 10%, 20%, 30%, and 40%, were evaluated under the complete two-stage distillation framework, and the 30% pruning ratio provided the most favorable balance between segmentation accuracy and computational efficiency under the current experimental setting. To maintain computational compatibility and structural stability, the number of retained channels was adjusted to a multiple of 8, and at least 8 channels were preserved in each pruned layer. After pruning the output channels of the current convolutional layer, the corresponding Batch Normalization parameters were synchronously pruned, including the scale factor, bias, running mean, and running variance. In addition, the input channels of the subsequent convolutional layer were adjusted accordingly to ensure the consistency of feature propagation.

After channel pruning, the pruned YOLOv8n-seg model was further fine-tuned under the guidance of Mask R-CNN. Because Mask R-CNN and YOLOv8n-seg have different network structures and prediction paradigms, the teacher outputs were aligned before calculating the distillation losses. The bounding boxes predicted by Mask R-CNN were originally represented in the xyxy format, where xyxy denotes the coordinates of the upper-left and lower-right corners of a bounding box. These boxes were converted into the normalized cxcywh format used by YOLO-style models, where cx and cy denote the normalized center coordinates, and w and h denote the normalized width and height of the bounding box. Low-confidence teacher predictions were filtered before distillation to reduce the influence of unreliable pseudo-supervision. The remaining teacher boxes and masks were matched with student predictions according to their spatial overlap. For mask-level distillation, the teacher masks were resized to the student mask output resolution using bilinear interpolation. In addition, feature-level guidance was introduced by extracting intermediate feature responses from the student model and feature pyramid information from the teacher model.

The total loss in the first-stage distillation process consisted of the original YOLOv8n-seg task loss and the cross-architecture distillation loss:

L_{t o t a l}^{s t a g e 1} = L_{t a s k} + L_{K D}^{s t a g e 1}

(2)

The first-stage distillation loss included classification, bounding-box, mask, and feature distillation terms:

L_{K D}^{s t a g e 1} = λ_{c l s} L_{c l s}^{K D} + λ_{b o x} L_{b o x}^{K D} + λ_{m a s k} L_{m a s k}^{K D} + λ_{f e a t} L_{f e a t}^{K D}

(3)

where

L_{t a s k}

denotes the original YOLOv8n-seg segmentation loss, and

L_{c l s}^{K D}

,

L_{b o x}^{K D}

,

L_{m a s k}^{K D}

, and

L_{f e a t}^{K D}

denote the classification, bounding-box, mask, and feature distillation losses, respectively. In this stage, the distillation weights were set as

λ_{c l s}

= 0.1,

λ_{b o x}

= 0.4,

λ_{m a s k}

= 0.3, and

λ_{f e a t}

= 0.2.

For bounding-box distillation, the aligned teacher boxes and student box predictions were compared after format conversion. The box distillation loss combined an IoU-based term and a mean squared error term:

L_{b o x}^{K D} = 0.7 \cdot (1 - I o U) + 0.3 \cdot M S E (B_{s}, B_{t})

(4)

where

B_{s}

and

B_{t}

denote the student and teacher box representations, respectively. This design allowed the student model to learn both the spatial overlap information and the regression distribution of the teacher model.

For mask distillation, the teacher masks generated by Mask R-CNN were resized to the spatial resolution of the student mask prediction. The mask distillation loss was calculated using binary cross-entropy between the student mask logits and the aligned teacher mask probabilities:

L_{m a s k}^{K D} = B C E (M_{s}, M_{t})

(5)

where

M_{s}

denotes the student mask prediction and

M_{t}

denotes the aligned teacher mask. This term was introduced to improve the mask representation ability of the lightweight student model, especially for densely distributed and partially occluded berries.

For feature distillation, intermediate feature maps from the YOLOv8n-seg neck and feature pyramid information from Mask R-CNN were used for feature-level guidance. When the channel dimensions of the student and teacher features were inconsistent, a 1 × 1 convolutional alignment layer was used to adjust the student feature channels. The 1 × 1 alignment layer was used only during the training stage for feature distillation loss calculation and was not included in the final inference model. When their spatial resolutions differed, bilinear interpolation was used for spatial alignment. The feature distillation loss was calculated using the mean squared error between the aligned student and teacher feature maps:

L_{f e a t}^{K D} = M S E (F_{s}, F_{t})

(6)

where

F_{s}

and

F_{t}

represent the aligned feature maps of the student and teacher models, respectively.

The pruned and distilled YOLOv8n-seg model obtained after the first stage served as the intermediate student model for the subsequent second-stage same-architecture refinement distillation guided by YOLOv8l-seg. Through this stage, the model reduced redundant channels in the shallow backbone while retaining the basic feature extraction capability required for grape berry instance segmentation.

3.3.2. Refinement Distillation and Weight Optimization Guided by YOLOv8l-Seg

In the second stage of optimization, the intermediate pruned YOLOv8n-seg model derived from the first stage served as the student model, while YOLOv8l-seg was adopted as the teacher model. Compared with Mask R-CNN, YOLOv8l-seg has a higher degree of consistency with the student model in terms of network architecture and prediction paradigm. Therefore, the purpose of the second stage was to conduct same-architecture refinement distillation, enabling the pruned student model to further align with the teacher model in multi-scale prediction, segmentation representation, and feature response.

Gou et al. demonstrated that knowledge distillation facilitates knowledge transfer across multiple levels, including response, feature, and relation layers, thereby establishing a robust theoretical foundation for the concurrent integration of various types of distillation supervision in the optimization of lightweight models [10]. Based on this theoretical basis, this study introduced three types of distillation supervision in the second-stage refinement distillation process: bounding-box distillation, mask distillation, and feature distillation. The original YOLOv8n-seg segmentation loss was retained as the task loss, and the total training loss was defined as:

L_{t o t a l} = L_{t a s k} + L_{K D}

(7)

L_{K D} = λ_{b o x} L_{b o x}^{K D} + λ_{m a s k} L_{m a s k}^{K D} + λ_{f e a t} L_{f e a t}^{K D}

(8)

where

L_{t a s k}

denotes the original task loss of YOLOv8n-seg,

L_{b o x}^{K D}

denotes the bounding-box distillation loss,

L_{m a s k}^{K D}

denotes the mask distillation loss, and

L_{f e a t}^{K D}

denotes the feature distillation loss. The coefficients

λ_{b o x}

,

λ_{m a s k}

, and

λ_{f e a t}

represent the corresponding distillation weights.

For bounding-box distillation, the raw outputs of the detection heads from the student and teacher models were decomposed into the box regression branch and the classification branch. Since this study involved only one target class, namely grape berry, classification distillation was not enabled. The bounding-box distillation loss was calculated using the

S m o o t h L 1

loss between the box-regression outputs of the student and teacher models across paired prediction scales:

L_{b o x}^{K D} = \frac{\sum S m o o t h L 1 (B_{s}, i, B_{t}, i)}{N}

(9)

where

B_{s}, i

and

B_{t}, i

denote the box-regression outputs of the student and teacher models at the

i

-th prediction scale, respectively, and

N

denotes the number of paired prediction scales.

For mask distillation, both the mask coefficient branch and the prototype mask branch of YOLOv8-seg were considered. The mask distillation loss was calculated as:

L_{m a s k}^{K D} = 0.5 S m o o t h L 1 (C_{s}, C_{t}) + 0.5 S m o o t h L 1 (P_{s}, P_{t})

(10)

where

C_{s}

and

C_{t}

represent the mask coefficients of the student and teacher models, respectively, and

P_{s}

and

P_{t}

represent the corresponding prototype mask outputs. This design allows the student model to learn not only the instance-level mask coefficients but also the shared prototype mask representation from the larger teacher model.

For feature distillation, an attention transfer strategy was adopted to avoid modifying the deployable student model structure. Feature maps from selected intermediate layers of the student and teacher models were converted into channel-independent spatial attention maps. Specifically, the squared feature responses were averaged along the channel dimension and then normalized to obtain spatial attention maps. The feature distillation loss was calculated using the mean squared error between the normalized attention maps of the student and teacher models:

A (F) = N o r m a l i z e (V e c ({M e a n}_{c} (F^{2})))

(11)

L_{f e a t}^{K D} = \frac{\sum M S E (A (F_{s}, k), A (F_{t}, k))}{K}

(12)

where

F_{s}, k

and

F_{t}, k

denote the feature maps of the student and teacher models at the

k

-th selected layer, respectively, and K denotes the number of selected feature layers. In this study, the feature responses from layers 15 and 18 were used for attention transfer. This feature-level supervision was used only during training and did not introduce additional modules into the final deployed model.

To determine an appropriate balance among localization supervision, mask supervision, and feature-level semantic guidance, three distillation weight configurations were compared. The tested configurations were 0.15/0.50/0.10, 0.15/0.55/0.08, and 0.12/0.55/0.10 for bounding-box, mask, and feature distillation, respectively. Considering both segmentation accuracy and inference efficiency, the final configuration was set to

λ_{b o x}

= 0.12,

λ_{m a s k}

= 0.55, and

λ_{f e a t}

= 0.10.

The relatively high mask distillation weight was adopted because grape berry thinning requires accurate instance boundary representation, especially in regions with small berries, mutual occlusion, and closely adhered berry boundaries. Compared with coarse bounding-box localization, mask quality is more directly related to berry-level instance separation and subsequent thinning decision-making. Meanwhile, reducing the bounding-box distillation weight helped avoid over-constraining the localization distribution of the lightweight student model, and a moderate feature distillation weight provided additional semantic guidance without dominating the optimization process.

After second-stage refinement distillation, the final pruned and optimized YOLOv8n-seg model was obtained. This model maintained the lightweight structure of the student network while improving its mask representation ability and achieving a favorable balance between segmentation accuracy and inference efficiency.

3.3.3. Structure of the Final Optimized Model

The final optimized model retains its classification within the YOLOv8n-seg architecture family and primarily comprises the backbone, neck, and segmentation head.

The backbone extracts hierarchical features from vineyard scene images. Utilizing convolutional modules and C2f feature extraction modules, it progressively encodes low-level texture cues alongside high-level semantic information pertaining to grape berries. In comparison to the original YOLOv8n-seg, the initial convolutional layers have been pruned during the first stage, leading to a marked enhancement in computational efficiency. The neck facilitates multi-scale feature fusion. Through iterative upsampling, concatenation, and feature transformation operations, it merges semantically robust deep features with higher-resolution shallow features, thereby augmenting the model’s capacity to represent densely distributed and locally occluded berries. The segmentation head generates multi-scale detection and segmentation results and comprises three prediction branches—P3, P4, and P5—along with a prototype mask branch for instance mask generation. This architecture significantly enhances instance segmentation performance under complex natural conditions while preserving the real-time inference advantages characteristic of YOLOv8n-seg.

The final optimized model does not incorporate a completely new backbone or feature fusion structure. Rather, it attains an improved balance between model accuracy and lightweight design through a two-stage distillation and pruning optimization process. The model’s ultimate recognition performance is illustrated in Figure 8.

3.4. DBSCAN-Based Thinning Decision-Making Method

While the instance segmentation model effectively identifies individual grape berries, its output does not directly address the fundamental question in thinning operations: which berries should be removed. In practical vineyard production, the berries targeted for removal are primarily located in densely clustered regions within bunches, and it is essential to consider berry size to prevent the inadvertent removal of well-developed berries. Consequently, this study developed a thinning decision-making method based on DBSCAN, utilizing the output from the instance segmentation model. By integrating spatial distribution information with the morphological characteristics of the berries, the visual perception results are converted into actionable thinning operation decisions.

The overall procedure of the proposed thinning decision-making method is outlined as follows. First, an optimized instance segmentation model extracts the segmentation mask for each berry from the input image. Next, the centroid coordinates and equivalent diameter of each berry are calculated based on the mask. Subsequently, the DBSCAN algorithm performs density clustering according to the spatial distribution of berry centroids, thereby identifying locally dense berry clusters. Finally, in accordance with the agronomic principle of “prioritizing the removal of small berries,” the berries designated for removal within each dense cluster are determined.

3.4.1. Extraction of Berry Features Based on Instance Segmentation Results

The initial step in the thinning decision-making process involves extracting berry-level geometric features from the instance segmentation results. For each identified berry, the instance segmentation mask is binarized, and its outer contour is subsequently extracted. Utilizing the contour information, the centroid coordinates and equivalent diameter of each berry are then calculated.

Let the centroid coordinates of the

i

-th berry be:

c_{i} = (x_{i}, y_{i})

(13)

Here,

x_{i}

and

y_{i}

represent the horizontal and vertical coordinates of the berry centroid within the image. The centroid is derived from contour moment calculations and reliably indicates the spatial position of the berry in the image.

To characterize berry size, the equivalent diameter is determined from the contour area. If the contour area of the

i

-th berry is denoted as

A_{i}

, then its equivalent diameter

d_{i}

is defined as:

d_{i} = 2 \sqrt{\frac{A_{i}}{π}}

(14)

This definition transforms the area of an irregular contour into an equivalent circular diameter, thereby providing a compact and robust characterization of berry size. Consequently, each berry can be represented as a feature triplet:

F_{i} = \{c_{i}, d_{i}, m_{i}\}

(15)

where

c_{i}

is the centroid coordinate,

d_{i}

is the equivalent diameter, and

m_{i}

is the corresponding binary mask.

3.4.2. Density Clustering Based on DBSCAN

Following the extraction of berry features, the centroid coordinates of all berries are treated as a collection of points within a two-dimensional space. To identify locally dense regions that necessitate thinning, the DBSCAN algorithm was utilized for clustering analysis. The DBSCAN algorithm, introduced by Ram et al., does not necessitate a predetermined number of clusters, can uncover density clusters of arbitrary shapes, and effectively differentiates noise points, rendering it appropriate for identifying locally dense regions of grape berries [33].

DBSCAN offers two primary advantages for the current task. First, it does not necessitate prior specification of the number of clusters, allowing for adaptive identification of dense regions within a dataset. Second, it effectively differentiates dense clusters of berries from isolated, sparsely distributed ones, aligning closely with the agronomic characteristics associated with grape thinning.

In this study, the neighborhood radius parameter

ε

was adaptively determined based on the average equivalent diameter of the berries in the image. Let the average equivalent diameter of all detected berries in the image be denoted as

\bar{d}

, which is calculated as:

\bar{d} = \frac{1}{N} \sum_{i = 1}^{N} d_{i}

(16)

Thus, the neighborhood radius is defined as:

ε = 1.2 \bar{d}

(17)

where the coefficient 1.2 denotes the local spacing between berries that are in proximity but not fully overlapping in vineyard scenes.

The minimum number of neighborhood samples necessary to establish a dense cluster was defined as:

M i n P t s = 3

(18)

MinPts was set to 3 because three adjacent berries can form the smallest visually recognizable local dense region in the two-dimensional grape bunch image. This setting helps identify small dense berry groups while avoiding the excessive exclusion of potential thinning regions caused by a larger MinPts value.

Using the specified parameter settings, the DBSCAN algorithm categorizes each berry as either part of a valid dense cluster or as a noise point. Berries identified as noise points are deemed sparsely distributed and are consequently excluded from the thinning decision range. Only those berries that belong to dense clusters are considered as potential candidates for removal. This process converts the spatial distribution of berries in the image into several local dense clusters, thereby establishing a structured foundation for the selection of subsequent thinning targets.

3.4.3. Thinning Decision Rule with Priority Given to Small Berries

Following the identification of dense berry clusters, it is essential to determine which specific berries should be removed. The decision rule employed in this study adheres to the agronomic principle of “maintaining the overall bunch structure while preferentially removing small berries to alleviate local crowding.”

For each dense cluster identified by DBSCAN, the quantity of berries within the cluster is initially counted. Let

N_{k}

represent the number of berries in the

k

-th cluster. This is applicable when the following condition is met:

N_{k} > T_{c}

(19)

The cluster is designated as a target cluster requiring thinning. In this study, the cluster density threshold,

T_{c}

, is set at 6; therefore, when the number of berries within a cluster exceeds six, the distribution is considered excessively dense, necessitating thinning.

For each target dense cluster, the quantity of berries to be removed is determined by the removal ratio

r

and calculated as:

N_{k}^{r e m o v e} = m a x (1, ⌊r N_{k}⌋)

(20)

In this study, the removal ratio

r

was established at 0.3. This ensures that when a target cluster is deemed to require thinning, at least one berry is selected for removal.

To align with agronomic practices in grape thinning, the berries within each cluster are sorted by equivalent diameter, from smallest to largest, and the first

N_{k}^{r e m o v e}

berries in this ranking are designated as thinning targets. This strategy effectively embodies the principle of prioritizing the removal of smaller berries, thereby alleviating local crowding within the bunch while maximizing the retention of larger berries that exhibit superior growth status.

The final output of the thinning decision encompasses the total number of berries identified in the image, the count of dense berry clusters, the number of berries designated for removal, and the centroid coordinates of all berries to be eliminated. These results serve dual purposes: they facilitate visualization and manual-assisted decision-making, and they can also be directly converted into spatial location information for the thinning-target recommendation mechanism.

Overall, the DBSCAN-based thinning decision-making method proposed in this study establishes an effective link between berry instance segmentation and the execution of thinning operations. It addresses the recognition challenge of identifying “where the berries are” and resolves the decision-making issue of determining “which berries should be removed.” This approach enhances the applicability of visual perception results to meet the requirements of intelligent thinning systems in actual vineyard environments.

3.5. Experimental Environment, Training Configuration, and Evaluation Metrics

Model training and offline inference evaluation were conducted on a computer equipped with an NVIDIA GeForce RTX 3060 Laptop GPU and an Intel Core i7-11800H CPU. The software environment included Windows 11, Python 3.12.4, PyTorch 2.0.0, the Ultralytics framework, and the Scikit-learn library. It should be noted that the inference speed reported in this study was obtained on this laptop GPU platform rather than on an embedded edge device. Therefore, the reported FPS reflects the computational efficiency of the proposed model under the current experimental platform, while its actual performance on embedded platforms, such as NVIDIA Jetson devices, still requires further validation.

All models were trained and evaluated using the same training, validation, and test sets described in Section 3.2.2. The validation set was used for model selection and hyperparameter tuning, while the test set was used for final performance evaluation. The main training parameters of the first-stage Mask R-CNN-guided distillation and pruning process and the second-stage YOLOv8l-seg-guided refinement distillation process are summarized in Table 3 and Table 4, respectively.

The instance segmentation performance was evaluated using box

m A P 50 - 95

mask

m A P 50 - 95

,

P r e c i s i o n

, and

R e c a l l

. Box

m A P 50 - 95

was used to evaluate bounding-box localization performance averaged over

I o U

thresholds from 0.50 to 0.95, while mask

m A P 50 - 95

was used to evaluate instance mask segmentation performance over the same

I o U

threshold range.

P r e c i s i o n

was used to evaluate the proportion of correctly predicted positive samples among all predicted positive samples, whereas recall was used to evaluate the proportion of correctly detected positive samples among all ground-truth positive samples.

The computational efficiency of each model was evaluated using inference speed, model parameters, and floating-point operations. Inference speed was reported as frames per second (

F P S

). Unless otherwise specified, the

F P S

reported in this study refers to pure forward inference speed under the current laptop GPU environment. Model complexity was described using the number of parameters and FLOPs.

1. Precision quantifies the accuracy of positive sample predictions in model outputs. It is defined as the ratio of true positives (

T P

), which represents the number of correctly identified positive samples, to the sum of true positives and false positives (

F P

), the latter being the number of negative samples erroneously classified as positive.

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

2.

R e c a l l

quantifies the proportion of true positive samples that the model successfully identifies, with

F N

representing the number of positive samples that remain undetected.

R e c a l l = \frac{T P}{T P + F N}

(22)

3. The

F 1

-score is the harmonic mean of

P r e c i s i o n

and

R e c a l l

, providing a comprehensive assessment of the balance between these two metrics.

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(23)

4. Intersection over Union (

I o U

) quantifies the extent of overlap between the predicted region and the ground-truth annotated region. The Area of Overlap refers to the overlapping area between the two regions, while the Area of Union denotes the combined area of both regions.

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(24)

In this study,

m A P 50

and

m A P 50 - 95

were calculated separately for bounding boxes and segmentation masks. Therefore, box

m A P 50 - 95

was used to evaluate detection-box localization performance, whereas mask

m A P 50 - 95

was used to evaluate instance segmentation quality.

5.

m A P 50

refers to the mean average precision across all categories when the intersection over union (

I o U

) threshold is set at 0.5. In this context,

N

represents the total number of categories, while APi denotes the average precision for the

i

-th category at the specified

I o U

threshold.

m A P 50 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}, I o U \geq 0.5

(25)

6. mAP50-95 represents the mean average precision calculated across multiple intersection over union (

I o U

) thresholds, which range from 0.5 to 0.95 in increments of 0.05. In this context,

j

signifies the

j

-th value within the

I o U

threshold sequence.

m A P 50 - 95 = \frac{1}{10} \sum_{j = 0}^{9} m A P (I o U = 0.5 + 0.05 j)

(26)

7. The computational efficiency of each model was evaluated using inference speed, model parameters, and floating-point operations. Inference speed was reported as frames per second (

F P S

). Unless otherwise specified, the

F P S

reported in this study refers to pure forward inference speed under the current laptop GPU environment and does not include image preprocessing, NMS, DBSCAN clustering, or visualization time. Model complexity was described using the number of parameters and FLOPs.

F P S

was calculated as follows:

F P S = \frac{1000}{t_{i n f e r e n c e}}

(27)

For the DBSCAN-based thinning decision module,

P r e c i s i o n

,

R e c a l l

,

F 1

-score, and mean absolute error (

M A E

) were used to compare the recommended thinning targets with expert consensus annotations. True positives (

T P

) denote berries selected for removal by both the DBSCAN-based method and the expert. False positives (

F P

) denote berries selected by the DBSCAN-based method but not by the expert, whereas false negatives (

F N

) denote berries selected by the expert but missed by the DBSCAN-based method.

P r e c i s i o n

,

R e c a l l

,

F 1

-score, and

M A E

were calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(28)

R e c a l l = \frac{T P}{T P + F N}

(29)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(30)

M A E = \frac{1}{M} \sum_{j = 1}^{M} |N_{j}^{D B S C A N} - N_{j}^{E x p e r t}|

(31)

where

M

is the number of evaluated test images,

N_{j}^{D B S C A N}

is the number of thinning targets recommended by the DBSCAN-based method in the

j

-th image, and

N_{j}^{E x p e r t}

is the number of thinning targets in the expert consensus annotation for the same image.

4. Experimental Results and Analysis

4.1. Training Process and Model Convergence

All model training, validation, and offline inference evaluations were conducted under the experimental settings described in Section 3.5. The training configurations for the two-stage optimization process are summarized in Table 3 and Table 4. In the first stage, the pruned YOLOv8n-seg student model was optimized under Mask R-CNN-guided cross-architecture distillation. In the second stage, the first-stage model was further refined using YOLOv8l-seg-guided same-architecture distillation.

During training, the loss curves and evaluation metrics showed a stable convergence trend, indicating that the two-stage distillation and pruning framework could be effectively optimized under the current dataset and training settings. The first-stage optimization mainly aimed to reduce model redundancy while maintaining the basic feature extraction capability of the lightweight student model. The second-stage refinement further improved the segmentation representation ability of the student model through same-architecture teacher supervision. The comparative performance of the baseline model, first-stage model, and final optimized model is analyzed in the following sections.

4.2. Comparative Analysis Among Baseline, Reference, and Optimized Models

To assess the efficacy of the proposed two-stage knowledge distillation and pruning framework, the final optimized model was compared with the original lightweight model and various scales of YOLOv8-seg series models. The results indicated that while the lightweight YOLOv8n-seg demonstrates commendable inference efficiency, its detection and segmentation performance in complex grape bunch scenarios under high-IoU conditions still exhibits significant potential for enhancement. Conversely, larger or two-stage models, such as YOLOv8l-seg, YOLOv8x-seg, and Mask R-CNN, can offer advantages in segmentation accuracy; however, their larger parameter size, higher computational cost, or more complex inference pipelines make them less suitable as final lightweight deployment-oriented models.

To tackle this issue, the current study enhances instance segmentation performance while maintaining the lightweight attributes of YOLOv8n-seg through a two-stage distillation and pruning optimization process. The model refined in the first stage attained a superior speed-accuracy balance compared to the original lightweight model. Building on this foundation, the second stage further improved the student model’s capacity for multi-scale feature representation and segmentation expression through same-architecture refinement distillation utilizing YOLOv8l-seg.

Experimental results demonstrated that following second-stage distillation and weight optimization, the final model exhibited consistent improvements in both box

m A P 50 - 95

and mask

m A P 50 - 95

. With the optimal weight configuration (bounding box 0.12, mask 0.55, feature 0.10), the model attained its highest overall performance, achieving a box

m A P 50 - 95

of 0.8945, a mask

m A P 50 - 95

of 0.7910, and a pure inference

F P S

of 119.19. These findings suggest that the proposed method effectively balances detection accuracy, segmentation accuracy, and inference efficiency without substantially increasing model complexity. The comparative results are presented in Table 5 and Table 6.

RT-DETR-L was included only as a detection-oriented reference model to provide an additional comparison of bounding-box accuracy and inference efficiency. Since RT-DETR-L does not generate instance masks, it was not included in mask-level segmentation comparison, and its Mask

m A P 50 - 95

value is therefore not reported.

Therefore, the main instance segmentation comparison focused on YOLOv8-seg series models, Mask R-CNN, and the optimized YOLOv8n-seg student models.

It should be noted that the final optimized model was selected according to the best trade-off between segmentation accuracy and deployment efficiency rather than accuracy alone. In this study, the comparison focused on the deployed student models under a unified inference setting. Therefore, the reported gains in

F P S

and parameter reduction should be interpreted together with the deployment configuration and pruning status of each model. The relationship between the number of model parameters and inference speed is shown in Figure 9.

It should be noted that Mask R-CNN was included as a two-stage instance segmentation reference model and as the heterogeneous teacher model used in the first-stage distillation. After retraining under the revised training configuration, Mask R-CNN achieved a box

m A P 50 - 95

of 0.9317 and a mask

m A P 50 - 95

of 0.8224 on the test set, indicating that it could provide effective region-proposal-based structural guidance and mask-level supervision for the lightweight YOLOv8n-seg student model. Although Mask R-CNN achieved the highest mask

m A P 50 - 95

among the compared models, its two-stage inference pipeline, larger computational cost, and lower inference speed make it less suitable as the final lightweight deployment-oriented model. Therefore, Mask R-CNN was used in this study as a heterogeneous teacher model rather than as the final deployed model.

In contrast to the approach of directly selecting a high-capacity two-stage model or simply increasing model scale, the proposed method focuses on optimizing a lightweight YOLOv8n-seg student model through staged distillation and pruning. The effectiveness of the first-stage optimization was evaluated by comparing the first-stage model with the original YOLOv8n-seg baseline, while the final optimized model was selected based on the trade-off among segmentation accuracy, inference speed, parameter size, and computational efficiency.

4.3. Stability Evaluation Under Different Random Seeds

To further evaluate the stability of the proposed optimization strategy under the current dataset split, repeated training experiments were conducted using three different random seeds. The original YOLOv8n-seg baseline model and the final optimized model were trained and evaluated using the same training, validation, and test sets, while only the random seed was changed. The random seeds were set to 0, 42, and 3407. The box

m A P 50 - 95

, mask

m A P 50 - 95

,

P r e c i s i o n

, and

R e c a l l

values were recorded for each run, and the mean and standard deviation were calculated. The results are summarized in Table 7. The three-seed experiment was used as a stability assessment under repeated training conditions and was not intended to replace the best-performing model result reported in Table 5.

As shown in Table 7, the final optimized model achieved a mean box

m A P 50 - 95

of 0.8933 ± 0.0009 and a mean mask

m A P 50 - 95

of 0.7890 ± 0.0031, which were higher than the 0.8918 ± 0.0024 and 0.7773 ± 0.0027 obtained by the original YOLOv8n-seg baseline, respectively. The final optimized model also achieved a higher mean precision value of 0.9502 ± 0.0011 compared with 0.9460 ± 0.0027 for the baseline model. These results indicate that the proposed two-stage distillation and pruning strategy improved the stability of box localization, mask-level segmentation, and prediction correctness under different random initialization conditions.

However, the mean

R e c a l l

of the final optimized model was 0.9236 ± 0.0027, which was slightly lower than the 0.9305 ± 0.0061 of the baseline model. This suggests that the proposed optimization strategy mainly improved segmentation accuracy and prediction precision, while a small trade-off in detection completeness was observed. Overall, the repeated training results indicate that the final optimized model maintained relatively stable performance across different random seeds, especially in terms of mask

m A P 50 - 95

and

P r e c i s i o n

. Nevertheless, because the current evaluation was conducted using only three random seeds under a fixed dataset split, the results should be interpreted as stability evidence under the current experimental setting rather than statistically significant conclusions. Future work will include k-fold cross-validation, broader multi-seed evaluation, and formal statistical significance testing on larger and more diverse datasets.

4.4. Ablation Analysis of Pruning and Knowledge Distillation Strategies

4.4.1. Component-Level Ablation of Pruning and Knowledge Distillation

To further analyze the contribution of each optimization component, ablation experiments were conducted by comparing the original YOLOv8n-seg baseline, the pruning-only model, the Mask R-CNN distillation-only model, the YOLOv8l-seg distillation-only model, the first-stage model, the Pruning + YOLOv8l-seg KD model, and the final optimized model. The pruning-only model was used to evaluate the effect of channel pruning alone. The Mask R-CNN distillation-only model was used to evaluate the contribution of cross-architecture knowledge distillation without pruning. The YOLOv8l-seg distillation-only model was used to evaluate whether direct same-architecture distillation from YOLOv8l-seg could improve the original lightweight student model. The first-stage model represented the combined effect of early backbone pruning and Mask R-CNN-guided distillation. The Pruning + YOLOv8l-seg KD model was used to determine whether direct same-architecture distillation after pruning could replace the proposed two-stage optimization strategy. The final optimized model further introduced YOLOv8l-seg-guided refinement distillation on the basis of the first-stage model. All model variants were trained and evaluated using the same dataset split and experimental settings described in Section 3.5. The results are shown in Table 8.

Compared with the original YOLOv8n-seg baseline, the pruning-only model reduced the number of parameters from 5.64 M to 3.26 M and increased

F P S

from 47.40 to 51.89, indicating that channel pruning improved model compactness and inference efficiency. Its mask

m A P 50 - 95

increased from 0.7790 to 0.7861, whereas box

m A P 50 - 95

and recall slightly decreased. This suggests that pruning alone can improve lightweight characteristics and maintain acceptable segmentation performance, but it may also lead to a slight loss in detection completeness.

The Mask R-CNN KD-only model achieved a box

m A P 50 - 95

of 0.9015 and a mask

m A P 50 - 95

of 0.7891, both higher than those of the baseline model. This result indicates that heterogeneous knowledge distillation from Mask R-CNN can provide useful region-proposal-based structural guidance and mask-level supervision for the YOLOv8n-seg student model. However, because the model structure was not pruned, its parameter size remained 5.64 M, and its inference speed decreased to 42.67

F P S

.

The YOLOv8l-seg KD-only model achieved the highest box

m A P 50 - 95

of 0.9035 and mask

m A P 50 - 95

of 0.7915 among the ablation variants, indicating that same-architecture distillation from a larger YOLOv8l-seg teacher can effectively improve segmentation accuracy. However, this variant retained the original YOLOv8n-seg model size and did not provide the same level of lightweight compression as the pruning-related models.

The Pruning + YOLOv8l-seg KD model was further evaluated to determine whether direct same-architecture distillation after pruning could replace the proposed two-stage optimization strategy. This variant achieved a box

m A P 50 - 95

of 0.8874, a mask

m A P 50 - 95

of 0.7871, a precision of 0.9430, and a recall of 0.9248, with 3.26 M parameters. Compared with the pruning-only model, this variant slightly improved mask

m A P 50 - 95

and

R e c a l l

, indicating that YOLOv8l-seg-guided distillation could partially compensate for the representation loss caused by pruning. However, its box

m A P 50 - 95

, mask

m A P 50 - 95

, and precision were still lower than those of the final optimized model. This result suggests that directly applying YOLOv8l-seg distillation after pruning was not sufficient to achieve the best balance, and that Mask R-CNN-guided first-stage distillation provided useful structural and mask-level guidance before the second-stage refinement.

Overall, the ablation results indicate that pruning, Mask R-CNN-guided distillation, and YOLOv8l-seg-guided refinement distillation contributed differently to the final model performance. Pruning reduced model parameters and improved lightweight characteristics, but it could also cause a slight decrease in box localization and recall. Mask R-CNN KD-only and YOLOv8l-seg KD-only improved segmentation accuracy without reducing model size, whereas pruning-related variants improved compactness. The direct Pruning + YOLOv8l-seg KD model showed that same-architecture distillation after pruning could improve mask performance to some extent, but it did not outperform the final two-stage model. Therefore, the proposed two-stage distillation and pruning strategy provided a more favorable trade-off among segmentation accuracy, model compactness, and inference efficiency than using pruning or single-teacher distillation alone.

4.4.2. Sensitivity Analysis of Pruning Ratio

To further examine the influence of pruning intensity, a pruning-ratio sensitivity analysis was conducted under the complete two-stage distillation framework. Four pruning ratios, namely 10%, 20%, 30%, and 40%, were evaluated using the same dataset split, training configuration, and evaluation protocol. As shown in Table 9, increasing the pruning ratio from 10% to 30% gradually improved both segmentation accuracy and inference efficiency. The 30% pruning ratio achieved the highest Box mAP50-95, Mask mAP50-95, precision, and recall, with values of 0.8945, 0.7910, 0.9507, and 0.9243, respectively. Compared with the 10% and 20% settings, the 30% pruning ratio may have removed more redundant low-importance channels and improved the compactness of feature representation, thereby allowing the subsequent two-stage distillation process to guide the lightweight model more effectively.

When the pruning ratio was further increased to 40%, the

F P S

increased to 142.75 and the FLOPs decreased to 5.67 G. However, the Box

m A P 50 - 95

, Mask

m A P 50 - 95

,

P r e c i s i o n

, and

R e c a l l

decreased to 0.8896, 0.7861, 0.9478, and 0.9205, respectively. This indicates that excessive pruning may weaken feature representation and reduce segmentation performance, especially for dense and partially occluded grape berries. Therefore, the 30% pruning ratio was selected in this study because it provided the most favorable balance between segmentation accuracy and computational efficiency under the current experimental setting.

4.4.3. Effect of Distillation Weight Configuration

To further investigate the impact of various distillation weight configurations on the performance of the second-stage model, systematic ablation experiments were performed focusing on three supervision terms: bounding-box distillation, mask distillation, and feature distillation. The results are presented in Table 10 and Table 11.

The experimental results indicate that the baseline configuration (0.20/0.45/0.15) demonstrated satisfactory initial performance regarding Box

m A P 50 - 95

; however, there remained potential for improvement in Mask

m A P 50 - 95

. Following the adjustment of the distillation weights to 0.15/0.50/0.10, Mask

m A P 50 - 95

rose to 0.7903. This improvement suggests that a judicious increase in the mask distillation weight, coupled with a reduction in the bounding-box distillation weight, enhances the model’s capacity to represent instance boundaries effectively.

Upon adjusting the weights to 0.15/0.55/0.08, the model attained the highest Mask

m A P 50 - 95

of 0.7922, demonstrating that a mask-oriented distillation configuration is more effective for the grape berry instance segmentation task. However, this configuration resulted in a decrease in pure inference

F P S

to 103.57, indicating that while segmentation accuracy improved, overall inference efficiency suffered.

Considering both accuracy and efficiency, this study identified 0.12/0.55/0.10 as the optimal configuration for second-stage distillation. Under this configuration, the model achieved the highest Box

m A P 50 - 95

of 0.8945, a Mask

m A P 50 - 95

of 0.7910, and a pure inference

F P S

of 119.19, thereby yielding the best overall performance. These results suggest that, for grape berry instance segmentation, appropriately reducing the bounding-box distillation weight, increasing the mask distillation weight, and maintaining a moderate feature distillation weight can effectively balance detection accuracy, segmentation quality, and inference efficiency.

Overall, optimizing the distillation weight is a critical factor in enhancing the performance of the second-stage model. In contrast to incorporating additional modules or designing extra loss functions, a well-configured distillation weight can more effectively leverage the benefits of the proposed method in lightweight instance segmentation tasks.

4.5. Visualization Results and Thinning Decision Evaluation

To intuitively assess the application performance of the proposed method in practical grape thinning scenarios, we present visualization examples of the instance segmentation results alongside the thinning decision outcomes derived from DBSCAN.

To visually assess the performance of various models in grape berry instance segmentation, three representative images of grape bunches were selected for comparison. The analysis included the original YOLOv8n-seg, the first-stage distilled and pruned model, the second-stage distillation baseline model, and the final optimized model, with results illustrated in Figure 10. Overall, the original YOLOv8n-seg effectively performed basic segmentation for most visible berries. However, in areas characterized by dense berry arrangements, mutual occlusion, and strong interference from branches and leaves in the background, challenges such as inadequate instance separation and unstable boundaries persisted. Following the first-stage cross-architecture knowledge distillation and backbone pruning, the model exhibited more concentrated feature responses in the primary bunch region, leading to notable improvements in segmentation results in certain local areas. The introduction of second-stage same-architecture refinement distillation further enhanced the continuity of segmentation and improved the representation of local details in densely populated berry regions.

In contrast, the final optimized model demonstrated enhanced stability in instance separation performance across various test samples. Specifically, in scenarios featuring closely contacted adjacent berries, irregular berry arrangements, and complex natural backgrounds, the optimized model effectively maintained target contours and minimized segmentation confusion in localized areas. These visual outcomes align closely with the quantitative experimental findings presented earlier, indicating that the two-stage distillation and pruning strategy improved the practical performance of the lightweight instance segmentation model under the current vineyard imaging conditions.

Although the quantitative results demonstrate the overall performance of the optimized model, small berries, occluded berries, and closely adhered berries remain important visual challenges in grape berry instance segmentation. Therefore, representative visualization results were further selected to qualitatively analyze the performance and limitations of the final optimized model under small-target, occlusion, and dense-adhesion conditions.

4.5.1. Qualitative Analysis of Small-Target, Occlusion, and Dense-Adhesion Cases

Although all grape berries were annotated as a single semantic class in this study, small berries and occluded berries represent important visual challenges in grape berry instance segmentation. Therefore, representative visualization results were used to qualitatively analyze the performance of the final optimized model under small-target, occlusion, and dense-adhesion conditions.

As shown in Figure 11, the final optimized model produced relatively complete masks for most visible berries under leaf occlusion, branch occlusion, and dense berry adhesion. In particular, the model maintained clear mask coverage for partially occluded berries and small berries located near the bunch edges, indicating that the two-stage distillation strategy improved the mask representation ability of the lightweight student model.

However, several failure cases were still observed. When adjacent berries had highly similar colors and extremely weak boundary contrast, the predicted masks occasionally became incomplete or merged with neighboring berries. In addition, small berries that were severely occluded by leaves, branches, or adjacent berries were sometimes missed. These results indicate that the proposed model improves the segmentation of small and occluded berries to some extent, but fine-grained quantitative evaluation remains limited because small and occluded berries were not annotated as independent categories. Future work will establish attribute-level annotations for small berries, occluded berries, and severely adhered berries to quantitatively evaluate model robustness under different visual difficulty levels.

4.5.2. Visualization of DBSCAN-Based Thinning Decision Results

To assess the effectiveness of the proposed thinning decision-making method, we analyzed the spatial distribution of berries within grape bunches using the DBSCAN density clustering algorithm. This analysis was based on berry centroid coordinates and size information obtained from instance segmentation, leading to the generation of thinning decision results, as depicted in Figure 12. Specifically, Figure 12a illustrates the distribution of berry centroids in a two-dimensional space alongside the clustering-based thinning selection results; blue points denote retained berries, while red crosses indicate berries identified for removal. Figure 12b provides a visualization of the corresponding thinning decision on the grape bunch image, with red-highlighted regions marking the targets designated for removal.

The results indicate that the proposed method effectively identifies locally overcrowded regions within grape bunches against complex natural backgrounds and generates agronomically interpretable thinning decisions based on the principle of “prioritizing the removal of small berries.” In this instance, a total of 62 valid berries were identified, of which 16 were designated as targets for removal. This finding suggests that the method successfully integrates berry spatial density features with individual size differences, thereby offering robust decision support for subsequent intelligent thinning-target recommendation.

4.5.3. Sensitivity Analysis of DBSCAN Thinning-Decision Parameters

To further evaluate whether the DBSCAN-based thinning decision was overly sensitive to manually selected parameters, a parameter sensitivity analysis was conducted using the 330 valid grape bunch images. Four key parameters were examined, including the neighborhood coefficient α in ε = α

\bar{d}

,

M i n P t

, the dense-cluster threshold

T_{c}

, and the removal ratio

r

. During the analysis, one parameter was varied at a time while the remaining parameters were kept at their default values. The default parameter combination was α = 1.2,

M i n P t

= 3,

T_{c}

= 6, and

r

= 0.3. The number of dense clusters and the number of recommended thinning targets were used to evaluate the influence of parameter variation on the thinning decision.

As shown in Table 12, the DBSCAN-based thinning decision showed different levels of sensitivity to the four parameters. When α decreased from 1.2 to 1.0, the neighborhood radius became smaller, which split berry distributions into more local clusters. Although the number of dense clusters increased from 477 to 715, the total number of recommended thinning targets decreased by 41.47%, indicating that an excessively small neighborhood radius may lead to fragmented clustering and insufficient thinning recommendations. When α increased to 1.4, adjacent berries were more likely to be merged into larger clusters, and the number of thinning targets increased by 9.48%. Therefore, α = 1.2 provided a moderate neighborhood radius for identifying local dense berry regions.

The influence of

M i n P t

was relatively limited when it varied from 2 to 3, as both settings produced the same number of dense clusters and thinning targets. Increasing

M i n P t

to 4 made the density requirement more restrictive and reduced the total number of thinning targets by 16.17%. The dense-cluster threshold

T_{c}

showed only a small influence on the thinning results within the tested range. Compared with the default value of

T_{c}

= 6, setting

T_{c}

to 5 and 7 changed the number of thinning targets by only +0.98% and −0.72%, respectively, indicating that the thinning decision was relatively stable around the selected

T_{c}

value.

The removal ratio

r

directly controlled the thinning intensity. When

r

decreased from 0.3 to 0.2, the number of thinning targets decreased by 34.68%, whereas increasing

r

to 0.4 increased the number of thinning targets by 34.98%. This result indicates that

r

is the most direct parameter affecting the final number of berries selected for removal. Overall, the default parameter combination α = 1.2,

M i n P t

= 3,

T_{c}

= 6, and

r

= 0.3 produced a moderate thinning intensity and avoided overly conservative or excessive thinning recommendations under the tested conditions.

4.5.4. Quantitative Evaluation of DBSCAN-Based Thinning Decision

To quantitatively evaluate the reliability of the DBSCAN-based thinning decision module, the 33 test images were independently annotated by three experts according to grape thinning principles. For each numbered grape berry image, each expert selected berries that should be preferentially removed based on local berry density, berry size, bunch compactness, and spatial distribution. A consensus expert annotation was then generated using a majority-voting strategy. A berry was regarded as an expert-selected thinning target if it was selected by at least two of the three experts. The DBSCAN-recommended thinning targets were compared with the consensus expert annotations using

P r e c i s i o n

,

R e c a l l

,

F 1

-score, and

M A E

.

The three experts selected 537, 536, and 537 thinning targets, respectively, indicating similar count-level judgment among experts. The majority-voting consensus annotation contained 533 thinning targets. The average pairwise

F_{1}

-score and Jaccard index among the three experts were 0.834 ± 0.067 and 0.721 ± 0.093, respectively, suggesting a reasonable level of inter-expert agreement for thinning-target annotation.

As shown in Table 13, the DBSCAN-based thinning decision recommended 544 berries for removal from 33 test images, while the expert consensus annotation identified 533 berries as thinning targets. Among them, 411 berries were consistently selected by both the DBSCAN-based method and the expert consensus annotation. The proposed method achieved a

P r e c i s o n

of 0.756, a

R e c a l l

of 0.771, and an

F_{1}

-score of 0.763, with an

M A E

of 1.48 berries per image. These results indicate that the DBSCAN-based thinning decision showed reasonable consistency with the three-expert consensus annotation under the current test conditions and could provide preliminary thinning decision-support outputs.

The inter expert agreement results for thin target annotation are summarized in Table 14.

However, the DBSCAN-based recommendations should still not be interpreted as fully validated agronomic thinning prescriptions. Although three experts were included in the revised evaluation, the test set remained limited to 33 images, and real field thinning trials were not conducted. In addition, thinning-target selection may vary among experts because of differences in cultivar characteristics, target yield, bunch compactness, fruit maturity, and production management objectives. Therefore, further validation using larger test sets, additional cultivars, different growth stages, multi-scenario field images, inter-seasonal field data, and real field thinning trials is still required.

4.6. Discussion

The experimental results of this study demonstrate that, in the context of grape berry instance segmentation, enhancing the performance of lightweight models does not necessarily require the development of complex new modules. Instead, meaningful improvements can be achieved through the optimization of training strategies. The first-stage cross-architecture distillation and backbone pruning allowed the model to attain strong baseline performance while maintaining a lightweight structure. Subsequently, the second-stage same-architecture refinement distillation further enhanced the student model’s segmentation capabilities under high Intersection over Union (

I o U

) conditions.

Although the final optimized model did not achieve the highest absolute

m A P 50 - 95

among all compared models, its advantage lies in the trade-off among segmentation accuracy, inference speed, parameter size, and computational efficiency. For example, after retraining under the revised training configuration, Mask R-CNN achieved strong mask-level accuracy, but its two-stage inference pipeline, lower inference speed, and higher computational cost make it less suitable for lightweight robotic perception. Similarly, larger YOLOv8-seg models can provide competitive segmentation accuracy, but their larger parameter sizes and higher FLOPs increase the difficulty of deployment on resource-constrained robotic platforms. In contrast, the final optimized YOLOv8n-seg model retained a compact structure while achieving improved mask

m A P 50 - 95

and substantially higher inference speed than the original baseline. Therefore, the proposed method should be interpreted as a deployment-oriented optimization strategy rather than an accuracy-only model selection strategy.

Nevertheless, the ablation and pruning-ratio analyses in this study still have several limitations. Although component-level ablation experiments and a pruning-ratio sensitivity analysis were added in this revision, they were conducted under a fixed dataset split and a fixed training configuration. Four pruning ratios, namely 10%, 20%, 30%, and 40%, were evaluated under the complete two-stage distillation framework. The results showed that the 30% pruning ratio achieved the highest box

m A P 50 - 95

, mask

m A P 50 - 95

,

P r e c i s i o n

, and

R e c a l l

among the tested settings, while also maintaining a relatively high inference speed. However, this result should be interpreted as the most favorable setting among the tested pruning ratios under the current experimental conditions, rather than as a globally optimal pruning configuration. Future work will further investigate finer pruning-ratio intervals, different pruning strategies, broader datasets, and teacher-order ablation experiments to improve the generalizability of the compression and distillation configuration.

In addition, repeated training experiments with three random seeds were added in this revision to evaluate the stability of the baseline and final optimized models. The results showed that the final optimized model maintained relatively stable mask

m A P 50 - 95

and

P r e c i s i o n

across different random initialization conditions. However, the evaluation was still conducted under a fixed dataset split, and k-fold cross-validation and formal statistical significance testing were not performed. Therefore, the reported improvements should be interpreted as performance differences observed under the current experimental setting rather than statistically significant conclusions. Future work will include larger datasets, k-fold cross-validation, broader repeated training, and statistical significance analysis to further evaluate the stability and reliability of the proposed method.

Compared with simply introducing additional modules or loss functions, the optimization of distillation weights provided a more direct way to improve the balance between segmentation accuracy and inference efficiency in the current task.

In addition, RT-DETR-L was used only as a detection-oriented reference model in this study. Because it does not generate instance-level masks, it cannot directly replace instance segmentation models for berry-level thinning decision support. Therefore, its results were interpreted only as a reference for bounding-box detection accuracy and inference efficiency, rather than as evidence of mask-level segmentation performance.

From the perspective of task characteristics, grape berries are typically densely clustered, exhibit strong adherence at their boundaries, and are relatively small in size. Consequently, high-precision mask representation is more critical than relying solely on coarse-grained bounding-box localization. This observation elucidates why a mask-oriented distillation weight configuration can yield enhanced performance.

The DBSCAN-based thinning decision-making strategy presented in this study addresses the limitations of conventional instance segmentation methods, which can identify the location of berries but fail to determine which berries should be removed. Yang et al. demonstrated that existing grape vision techniques perform detection and counting tasks effectively; however, generating thinning operation recommendations from perception results remains a significant gap in current research [23]. Woo et al. further noted that while thinning assistance systems can aid manual management by predicting berry counts, precise screening at the single-berry level is essential for facilitating automated thinning-target recommendation [21]. By incorporating berry centroid and diameter information into spatial clustering analysis, this study converts visual perception results into actionable thinning decision criteria, thereby enhancing the model’s alignment with the practical requirements of grape thinning robots.

From a practical perspective, the proposed framework provides a bridge between berry-level visual perception and thinning-target decision support. Instance segmentation alone can identify the location and contour of grape berries, but it cannot determine which berries should be removed according to local density and berry size. The DBSCAN-based decision module partially addresses this gap by transforming segmentation outputs into preliminary thinning-target recommendations. This provides a useful intermediate decision layer for future grape thinning robots. However, the current framework remains an offline visual perception and decision-support method, and its practical use in robotic thinning still depends on further integration with three-dimensional localization, end-effector trajectory planning, and closed-loop execution control.

However, the expert-annotation-based evaluation of the DBSCAN thinning decision module remains preliminary. Although three experts were included and a majority-voting consensus annotation was used in this revision, the evaluation was still based on only 33 test images, and real field thinning trials were not conducted. In addition, thinning decisions may vary among agronomists depending on cultivar characteristics, target yield, bunch compactness, fruit maturity, and production management objectives. Future work will introduce larger test sets, additional cultivars, different growth stages, inter-seasonal field data, and real thinning trials to further validate the agronomic reliability and practical applicability of the DBSCAN-based thinning decision module.

The generalization ability of the proposed method is also limited by the current dataset. Although the dataset contained 16,461 annotated berry instances, these instances were derived from 330 valid grape bunch images collected from Shine Muscat grape bunches at the berry enlargement stage in a single vineyard using the same RGB-D camera system. Therefore, the dataset does not cover sufficient variations in grape cultivars, growth stages, production years, vineyard management conditions, canopy structures, illumination conditions, camera systems, or orchard environments. Although data augmentation was used to improve the robustness of model training, it cannot replace real external validation data. Consequently, the current results should be interpreted as preliminary evidence obtained under the specific cultivar, growth stage, vineyard, and imaging conditions of this study, rather than as conclusive evidence of general applicability across diverse grape production scenarios. Future work will expand the dataset by including different grape cultivars, different growth stages, multiple vineyards, different production seasons, and different imaging systems to further evaluate the generalization capability of the proposed method.

Moreover, although all visible berries were manually annotated at the instance level, attribute-level annotations for small berries, occluded berries, and severely adhered berries were not established. Therefore, the current study could not provide separate quantitative performance metrics for these visually challenging categories. In addition, formal inter-annotator agreement assessment was not conducted. Future work will introduce multi-annotator labeling, annotation consistency evaluation, and attribute-level labels to further improve dataset reliability and enable more detailed robustness analysis.

In addition, the inference speed reported in this study was obtained on an NVIDIA RTX 3060 Laptop GPU rather than on an embedded edge platform. Therefore, the current

F P S

results only reflect the computational efficiency of the model under an offline laptop-GPU environment and cannot be directly regarded as evidence of potential real-time embedded deployment after further validation. More importantly, real robotic thinning experiments were not conducted in the current study. Practical robotic indicators, such as berry localization error, end-effector positioning accuracy, thinning success rate, and operation cycle time, still need to be evaluated under closed-loop field conditions. Therefore, the proposed method should be regarded as an offline RGB-based visual perception and preliminary thinning decision-support module for future grape thinning robots, rather than as a fully validated robotic thinning system.

The proposed method primarily depends on two-dimensional RGB image information for berry instance segmentation and thinning-target recommendation, and three-dimensional structural or depth information was not incorporated into the current decision module. Therefore, the recommended thinning targets cannot be directly converted into executable three-dimensional robot coordinates without additional depth sensing, multi-view reconstruction, hand-eye calibration, and coordinate transformation. This limitation may lead to decision or execution errors in cases with severe berry occlusion, overlapping berries, and complex spatial hierarchies within grape bunches. Future work will integrate RGB-D data, multi-view imaging, and multimodal perception to improve three-dimensional bunch structure understanding, berry localization accuracy, and robotic thinning execution.

5. Conclusions and Future Prospects

To address the challenges of grape berry instance segmentation under natural vineyard conditions, including dense berry distribution, mutual occlusion, complex backgrounds, and the need for lightweight perception, this study developed a YOLOv8n-seg-based grape berry instance segmentation and thinning decision-support method. A two-stage knowledge distillation and pruning optimization strategy was introduced to improve the segmentation performance of the lightweight student model, and a DBSCAN-based clustering method was further used to generate preliminary thinning-target recommendations based on berry centroid positions and equivalent diameter features.

The experimental results on the self-constructed grape berry dataset showed that the final optimized model achieved a box

m A P 50 - 95

of 0.8945 and a mask

m A P 50 - 95

of 0.7910. Compared with the original YOLOv8n-seg baseline, the mask

m A P 50 - 95

increased by 1.20 percentage points. The model achieved an inference speed of 119.19

F P S

, with 3.26 M parameters and 5.95 GFLOPs on an NVIDIA RTX 3060 Laptop GPU. These results indicate that the proposed optimization strategy improved the segmentation performance and inference efficiency of the lightweight model under the current offline GPU-based experimental setting, providing a potential basis for future edge-oriented visual perception after embedded-platform validation.

The pruning-ratio sensitivity analysis further showed that the 30% pruning ratio achieved the best overall segmentation performance among the tested pruning settings and provided a favorable balance between segmentation accuracy and inference efficiency under the current experimental setting.

Based on the instance segmentation results, this study further proposed a DBSCAN-based thinning decision method. By combining berry centroid coordinates, local-density clustering, and equivalent diameter information, the method identified locally overcrowded berry regions and selected smaller berries within dense clusters as preliminary thinning targets. The parameter sensitivity analysis showed that the default DBSCAN parameter combination produced relatively stable thinning recommendations under the tested conditions. In addition, comparison with three-expert consensus annotations on 33 test images showed that the DBSCAN-based thinning decision achieved a

P r e c i s i o n

of 0.756,

R e c a l l

of 0.771,

F 1

-score of 0.763, and

M A E

of 1.48 berries per image. The average pairwise

F 1

-score among the three experts was 0.834 ± 0.067, indicating a reasonable level of inter-expert agreement. These results suggest that the proposed method can provide preliminary visual decision support for grape berry thinning under the current test conditions. However, the DBSCAN-based recommendations should not be interpreted as fully validated agronomic thinning prescriptions, and further validation using larger test sets, additional cultivars, multi-scenario field images, and real field thinning trials is still required.

Several limitations should also be noted. First, the dataset used in this study contained 330 valid images and 16,461 annotated berry instances collected from Shine Muscat grape bunches at the berry enlargement stage in a single vineyard using the same imaging setup. Therefore, the reported results are limited to the specific cultivar, growth stage, vineyard environment, and imaging conditions of this study. External validation across different grape cultivars, growth stages, vineyards, production seasons, and camera systems was not conducted and remains necessary before broader application. Second, although repeated training experiments with three random seeds were added for the baseline and final optimized models, other model comparisons and ablation variants were mainly based on one main training run under a fixed dataset split. Therefore, the reported performance differences should not be interpreted as statistically significant conclusions. K-fold cross-validation, broader multi-seed evaluation, and formal statistical significance testing are still needed to further evaluate model stability and reliability. Third, inference speed was tested on a laptop GPU rather than an embedded edge device, and real robotic thinning experiments were not conducted. Therefore, the reported FPS should be interpreted as offline inference performance rather than embedded real-time deployment performance. Practical robotic indicators, such as berry localization error, end-effector positioning accuracy, thinning success rate, fruit damage rate, and operation cycle time, remain to be evaluated under closed-loop field conditions. Finally, the current thinning decision method mainly relies on two-dimensional RGB image information and does not yet incorporate three-dimensional structural or depth information. Therefore, the recommended thinning targets cannot be directly converted into executable robot coordinates without additional depth sensing, multi-view reconstruction, hand-eye calibration, and coordinate transformation, which may limit decision accuracy and robotic applicability under severe occlusion and complex bunch structures.

Future research may focus on the following aspects:

(1) Expansion of perceptual modalities: The current approach primarily conducts grape berry instance segmentation and thinning-target recommendation using single-view two-dimensional RGB images. Future studies will incorporate RGB-D data, multi-view imaging, depth completion, hand-eye calibration, and coordinate transformation to convert two-dimensional thinning targets into three-dimensional robot-operable coordinates and to improve the understanding of bunch structure, occlusion relationships, and complex illumination conditions.

(2) Enhancement of model generalization capability: The dataset will be expanded to include different grape cultivars, growth stages, vineyards, production seasons, canopy structures, and camera systems. Transfer learning, domain adaptation, and cross-domain validation strategies will also be explored to improve model robustness under different production conditions.

(3) Improvement of experimental reliability: Future work will conduct repeated training with broader random seeds, k-fold cross-validation, and formal statistical significance testing. Although a pruning-ratio sensitivity analysis using 10%, 20%, 30%, and 40% pruning settings was added in this study, the current conclusion was still obtained under a fixed dataset split and the present training configuration. Future work will further evaluate finer pruning-ratio intervals, different pruning strategies, and broader datasets to determine a more generalizable compression configuration.

(4) Optimization of the thinning decision model: On the basis of spatial density and berry size features, additional agronomic information, such as bunch morphology, target yield, vine vigor, fruit maturity, and expert thinning rules, will be incorporated to construct a more comprehensive thinning decision model.

(5) Embedded deployment and robotic thinning validation: Future studies will evaluate the proposed model on embedded platforms such as NVIDIA Jetson devices through model compression, quantization, and hardware-specific acceleration. In addition, closed-loop robotic thinning experiments will be conducted under real vineyard conditions to evaluate berry three-dimensional localization accuracy, hand-eye calibration accuracy, end-effector positioning accuracy, thinning success rate, fruit damage rate, and operation cycle time.

Author Contributions

Methodology, H.Z.; Validation, Y.M. and S.H.; Investigation, Y.M.; Data curation, H.Z.; Writing—original draft, H.Z., Y.M. and T.Z.; Writing—review & editing, H.Z. and M.Q.; Supervision, M.Q.; Project administration, M.Q.; Funding acquisition, M.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Department of Zhejiang Province: LD24E050001.

Data Availability Statement

The dataset generated and analyzed during the current study is not publicly available because it was self-collected from field experiments and is subject to project confidentiality restrictions. The full training, evaluation, and DBSCAN-based thinning decision analysis code is also not publicly released at this stage because it is part of an ongoing research project. However, non-sensitive materials, including representative annotated image examples, detailed training configurations, model configuration files, pruning and distillation parameter settings, DBSCAN parameter settings, and pseudocode of the thinning decision procedure, may be made available from the corresponding author upon reasonable request and with permission from the project management team.

Conflicts of Interest

The authors declare no conflict of interest.

References

Preface. In Integrated Processing Technologies for Food and Agricultural By-Products; Academic Press: London, UK, 2019; pp. ix–x.
Long, Z.; She, Y.; Zhou, Y.; Qin, S. Research on the Practice Model of Fine Management Technology for Xing’an Sunshine Rose Grape during the fruiting period. Friends Farmers Get. Rich. 2026, 135–137. [Google Scholar]
Zhuang, S.; Tozzini, L.; Green, A.; Acimovic, D.; Howell, G.S.; Castellarin, S.D.; Sabbatini, P. Impact of Cluster Thinning and Basal Leaf Removal on Fruit Quality of Cabernet Franc (Vitis vinifera L.) Grapevines Grown in Cool Climate Conditions. HortScience 2014, 49, 750–756. [Google Scholar] [CrossRef]
Wang, Y.; He, Y.-N.; Chen, W.-K.; He, F.; Chen, W.; Cai, X.-D.; Duan, C.-Q.; Wang, J. Effects of Cluster Thinning on Vine Photosynthesis, Berry Ripeness and Flavonoid Composition of Cabernet Sauvignon. Food Chem. 2018, 248, 101–110. [Google Scholar] [CrossRef]
Zhou, H.; Wang, X.; Au, W.; Kang, H.; Chen, C. Intelligent Robots for Fruit Harvesting: Recent Developments and Future Challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
Yin, W.; Wen, H.; Ning, Z.; Ye, J.; Dong, Z.; Luo, L. Fruit Detection and Pose Estimation for Grape Cluster–Harvesting Robot Using Binocular Imagery Based on Deep Neural Networks. Front. Robot. AI 2021, 8, 626989. [Google Scholar] [CrossRef]
Sapkota, R.; Ahmed, D.; Karkee, M. Comparing Yolov8 and Mask Rcnn for Object Segmentation in Complex Orchard Environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar]
Shen, Q.; Zhang, X.; Shen, M.; Xu, D. Multi-Scale Adaptive YOLO for Instance Segmentation of Grape Pedicels. Comput. Electron. Agric. 2025, 229, 109712. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-Based Extraction of Spatial Information in Grape Clusters for Harvesting Robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
Jin, Y.; Yu, C.; Yin, J.; Yang, S.X. Detection Method for Table Grape Ears and Stems Based on a Far-Close-Range Combined Vision System and Hand-Eye-Coordinated Picking Test. Comput. Electron. Agric. 2022, 202, 107364. [Google Scholar] [CrossRef]
Kurtser, P.; Ringdahl, O.; Rotstein, N.; Berenstein, R.; Edan, Y. In-Field Grape Cluster Size Assessment for Vine Yield Estimation Using a Mobile Robot and a Consumer Level RGB-D Camera. IEEE Robot. Autom. Lett. 2020, 5, 2031–2038. [Google Scholar] [CrossRef]
Santos, T.T.; De Souza, L.L.; Dos Santos, A.A.; Avila, S. Grape Detection, Segmentation, and Tracking Using Deep Neural Networks and Three-Dimensional Association. Comput. Electron. Agric. 2020, 170, 105247. [Google Scholar] [CrossRef]
Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A Method for Identifying Grape Stems Using Keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
Zhang, T.; Wu, F.; Wang, M.; Chen, Z.; Li, L.; Zou, X. Grape-Bunch Identification and Location of Picking Points on Occluded Fruit Axis Based on YOLOv5-GAP. Horticulturae 2023, 9, 498. [Google Scholar] [CrossRef]
Zhao, R.; Zhu, Y.; Li, Y. An End-to-End Lightweight Model for Grape and Picking Point Simultaneous Detection. Biosyst. Eng. 2022, 223, 174–188. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024. [Google Scholar] [CrossRef]
Du, W.; Liu, P. Instance Segmentation and Berry Counting of Table Grape before Thinning Based on AS-SwinT. Plant Phenomics 2023, 5, 0085. [Google Scholar] [CrossRef]
San Woo, Y.; Buayai, P.; Nishizaki, H.; Makino, K.; Kamarudin, L.M.; Mao, X. End-to-End Lightweight Berry Number Prediction for Supporting Table Grape Cultivation. Comput. Electron. Agric. 2023, 213, 10820. [Google Scholar] [CrossRef]
Yang, C.; Geng, T.; Peng, J.; Song, Z. Probability Map-Based Grape Detection and Counting. Comput. Electron. Agric. 2024, 224, 109175. [Google Scholar] [CrossRef]
Yang, C.; Geng, T.; Peng, J.; Xu, C.; Song, Z. Mask-GK: An Efficient Method Based on Mask Gaussian Kernel for Segmentation and Counting of Grape Berries in Field. Comput. Electron. Agric. 2025, 234, 110286. [Google Scholar] [CrossRef]
Liang, X.; Wei, Z.; Chen, K. A Method for Segmentation and Localization of Tomato Lateral Pruning Points in Complex Environments Based on Improved YOLOV5. Comput. Electron. Agric. 2025, 229, 109731. [Google Scholar] [CrossRef]
Gao, G.; Wang, S.; Shuai, C.; Zhang, Z.; Zhang, S.; Feng, Y. Recognition and Detection of Greenhouse Tomatoes in Complex Environment. Trait. Signal 2022, 39, 291. [Google Scholar] [CrossRef]
Solimani, F.; Cardellicchio, A.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Optimizing Tomato Plant Phenotyping Detection: Boosting YOLOv8 Architecture to Tackle Data Complexity. Comput. Electron. Agric. 2024, 218, 108728. [Google Scholar] [CrossRef]
Liu, M.; Chen, W.; Cheng, J.; Wang, Y.; Zhao, C. Y-HRNet: Research on Multi-Category Cherry Tomato Instance Segmentation Model Based on Improved YOLOv7 and HRNet Fusion. Comput. Electron. Agric. 2024, 227, 109531. [Google Scholar] [CrossRef]
Fatehi, F.; Bagherpour, H.; Parian, J.A. Enhancing the Performance of YOLOv9t Through a Knowledge Distillation Approach for Real-Time Detection of Bloomed Damask Roses in the Field. Smart Agric. Technol. 2025, 10, 100794. [Google Scholar] [CrossRef]
Tombe, R. Computer Vision for Smart Farming and Sustainable Agriculture. In Proceedings of the 2020 IST-Africa Conference (IST-Africa), Virtual, 18–22 May 2020; pp. 1–8. [Google Scholar]
Shen, Q.; Xu, D.; Guo, T.; Mao, X.; Xia, F. Two–Stage Multimodal 3D Point Localization Framework for Automatic Grape Harvesting. Smart Agric. Technol. 2025, 12, 101062. [Google Scholar] [CrossRef]
Woo, Y.S.; Li, Z.; Tamura, S.; Buayai, P.; Nishizaki, H.; Makino, K.; Kamarudin, L.M.; Mao, X. 3D Grape Bunch Model Reconstruction from 2D Images. Comput. Electron. Agric. 2023, 215, 108328. [Google Scholar] [CrossRef]
Magalhães, S.A.; Moreira, A.P.; dos Santos, F.N.; Dias, J. Active Perception Fruit Harvesting Robots—A Systematic Review. J. Intell. Robot. Syst. 2022, 105, 14. [Google Scholar] [CrossRef]
Ram, A.; Jalal, S.; Jalal, A.S.; Kumar, M. A Density Based Algorithm for Discovering Density Varied Clusters in Large Spatial Databases. Int. J. Comput. Appl. 2010, 3, 1–4. [Google Scholar] [CrossRef]

Figure 1. Experimental vineyard environment.

Figure 2. Image acquisition camera module installed on the grape thinning robot platform.

Figure 3. Overall workflow of dataset construction, model optimization, and thinning decision evaluation.

Figure 4. Examples of original images in the dataset.

Figure 5. Example of the LabelMe annotation interface.

Figure 6. Examples of augmented training images.

Figure 7. Overall workflow and network structure of the model.

Figure 8. Example of the final recognition effect of the model.

Figure 9. Relationship between the number of parameters and inference speed of each model.

Figure 10. Visual comparison results of grape bunch instance segmentation using different models. Note: From left to right, the images depict the original YOLOv8n-seg, the first-stage distilled and pruned model, the second-stage distillation baseline model (v2 fixed), and the final optimized model. Each subfigure presents the segmentation result of the full image on the left, while the right side displays a locally enlarged view of the corresponding region.

Figure 11. Representative segmentation results and failure cases under small-target, occlusion, and dense-adhesion conditions. Note: (a) Segmentation result under leaf and branch occlusion; (b) segmentation result in densely adhered berry regions; (c) recognition result of small berries and multi-scale berry regions; (d) typical difficult case under severe adhesion and weak boundary contrast. The final optimized model produced relatively complete masks for most visible berries, but missed detections, incomplete contours, and merged masks may still occur under severe occlusion or extremely weak berry boundaries.

Figure 12. Example of DBSCAN clustering and thinning decision results. Note: (a) Results of DBSCAN clustering and thinning selection based on berry centroid coordinates, with blue points denoting retained berries and red crosses indicating berries deemed for removal; (b) corresponding thinning decision results on the grape bunch image, where red-highlighted regions signify berries designated for removal.

Table 1. Dataset division and annotation statistics.

Dataset Subset	Number of Images	Number of Annotated Berry Instances	Mean Berry Instances per Image	SD	Min	Max
Training set	264	13,079	49.54	10.05	27	93
Validation set	33	1754	53.15	8.81	39	73
Test set	33	1628	49.33	8.16	34	63
Total	330	16,461	49.88	9.77	27	93

Table 2. Data augmentation operations and their purposes.

Augmentation Operation	Implementation	Purpose
Overcast day simulation	The image brightness was reduced to simulate weak illumination conditions.	To improve model robustness under cloudy, shaded, or low-light vineyard environments.
Bright-light simulation	The image brightness was increased to simulate strong illumination conditions.	To improve tolerance to direct sunlight, local overexposure, and strong light variation.
Noise increase	Random noise was added to the image.	To simulate image sensor noise and field acquisition disturbance.
Flip and blur	The image was flipped and slightly blurred.	To simulate viewpoint variation, slight camera defocus, and motion blur during image acquisition.
Saturation increase	The color saturation of the image was enhanced.	To improve robustness to color variation caused by illumination changes and camera imaging differences.
Grayscale increase	The image color information was partially weakened by increasing the grayscale effect.	To improve model performance when color contrast between berries, leaves, and background is reduced.
Horizontal flip	The image was flipped horizontally.	To increase lateral-view diversity and improve robustness to different shooting directions.
Random rotation	The image was randomly rotated within a limited angle range.	To simulate small changes in camera posture and acquisition angle.

Table 3. Training parameter settings for first-stage pruning and Mask R-CNN-guided distillation.

Parameter Category	Parameter Setting
Student model	YOLOv8n-seg
Teacher model	Mask R-CNN with ResNet50-FPN
Pruning ratio	30%
Pruned layers	model.0.conv and model.1.conv
Pruning criterion	L1-norm of convolutional kernel weights
Image size	640 × 640
Training epochs	80
Batch size	8
Optimizer	AdamW
Initial learning rate	0.0005
Learning rate schedule	Cosine learning rate decay
Warm-up epochs	3
Random seed	0, 42, 3407
Distillation temperature	2.0
Classification distillation weight	0.10
Box distillation weight	0.40
Mask distillation weight	0.30
Feature distillation weight	0.20
IoU setting	IoU was not used as a training threshold; it was used in box distillation and evaluation
Workers	Ultralytics default

Table 4. Training parameter settings for second-stage YOLOv8l-seg-guided refinement distillation.

Parameter Category	Parameter Setting
Student model	First-stage pruned and distilled YOLOv8n-seg
Teacher model	YOLOv8l-seg
Image size	640 × 640
Training epochs	80
Batch size	8
Optimizer	AdamW
Initial learning rate	0.0005
Final learning rate factor	0.01
Learning rate schedule	Cosine learning rate decay
Warm-up epochs	3
Weight decay	0.0005
Close mosaic	Last 10 epochs
Patience	30
Random seed	0, 42, 3407
Distillation temperature	2.0
Classification distillation weight	0.00
Box distillation weight	0.12
Mask distillation weight	0.55
Feature distillation weight	0.10
Workers	0

Table 5. Comparison of accuracy indicators among YOLOv8-seg models, reference models, and optimized student models.

Model	Box mAP50-95	Mask mAP50-95	Precision
YOLOv8n-seg	0.8921	0.7790	0.9460
YOLOv8s-seg	0.9261	0.8138	0.9516
YOLOv8l-seg	0.9241	0.8190	0.9596
YOLOv8x-seg	0.9260	0.8107	0.9457
Mask R-CNN	0.9317	0.8224	0.9740
RT-DETR-L	0.9180	—	0.9399
First-stage Model	0.8877	0.7841	0.9529
Final Model	0.8945	0.7910	0.9507

Note: “—“ indicates that the corresponding metric was not applicable. RT-DETR-L was evaluated only as a detection-oriented reference model and was not used for mask-level segmentation comparison.

Table 6. Comparison of efficiency indicators among YOLOv8-seg models, reference models, and optimized student models.

Model	Recall	FPS	Parameters/M	FLOPs/G
YOLOv8n-seg	0.9317	47.40	5.64	6.34
YOLOv8s-seg	0.9540	46.50	20.02	21.34
YOLOv8l-seg	0.9315	27.80	89.71	110.32
YOLOv8x-seg	0.9335	20.10	140.11	172.16
Mask R-CNN	0.9535	16.54	44.40	175.2
RT-DETR-L	0.9474	25.86	31.99	107.93
First-stage Model	0.9195	70.40	5.42	5.95
Final Model	0.9243	119.19	3.26	5.95

Table 7. Stability evaluation of the baseline and final optimized models under different random seeds.

Model	Random Seed	Box mAP50-95	Mask mAP50-95	Precision	Recall
YOLOv8n-seg baseline	0	0.8894	0.7797	0.9453	0.9372
YOLOv8n-seg baseline	42	0.8941	0.7744	0.9496	0.9292
YOLOv8n-seg baseline	3407	0.8920	0.7777	0.9431	0.9252
YOLOv8n-seg baseline	Mean ± SD	0.8918 ± 0.0024	0.7773 ± 0.0027	0.9460 ± 0.0027	0.9305 ± 0.0061
Final optimized model	0	0.8932	0.7855	0.9514	0.9255
Final optimized model	42	0.8925	0.7906	0.9491	0.9248
Final optimized model	3407	0.8942	0.7909	0.9501	0.9205
Final optimized model	Mean ± SD	0.8933 ± 0.0009	0.7890 ± 0.0031	0.9502 ± 0.0011	0.9236 ± 0.0027

Table 8. Component level ablation analysis of propaganda and knowledge distortion strategies.

Model Variant	Pruning	Mask R-CNN KD	YOLOv8l-seg KD	Box mAP50-95	Mask mAP50-95	Precision	Recall	FPS	Parameters/M	FLOPs/G
YOLOv8n-seg baseline	—	—	—	0.8921	0.7790	0.9460	0.9317	47.40	5.64	6.34
Pruning-only model	✓	—	—	0.8871	0.7861	0.9467	0.9217	51.89	3.26	5.95
Mask R-CNN KD-only model	—	✓	—	0.9015	0.7891	0.9459	0.9235	42.67	5.64	6.34
YOLOv8l-seg KD-only model	—	—	✓	0.9035	0.7915	0.9542	0.9168	51.74	5.64	6.34
First-stage model	✓	✓	—	0.8877	0.7841	0.9529	0.9195	70.40	5.42	5.95
Pruning + YOLOv8l-seg KD model	✓	—	✓	0.8874	0.7871	0.9430	0.9248	51.28	3.26	5.95
Final optimized model	✓	✓	✓	0.8945	0.7910	0.9507	0.9243	119.19	3.26	5.95

Note: “✓” indicates that the corresponding optimization component was used, while “—“ indicates that it was not used. FLOPs were reported according to the corresponding model architecture; KD-only variants retained the original YOLOv8n-seg student structure, while pruning-related variants used the pruned lightweight structure.

Table 9. Sensitivity analysis of different pruning ratios under the complete two-stage distillation framework.

Pruning Ratio	Box mAP50-95	Mask mAP50-95	Precision	Recall	FPS	Parameters/M	FLOPs/G
10%	0.8872	0.7835	0.9452	0.9186	78.35	3.26	6.58
20%	0.8918	0.7882	0.9485	0.9219	97.62	3.26	6.23
30%	0.8945	0.7910	0.9507	0.9243	119.19	3.26	5.95
40%	0.8896	0.7861	0.9478	0.9205	142.75	3.26	5.67

Table 10. Comparison of model accuracy under different distillation weight settings.

Distillation Weight Configuration (Box/Mask/Feature)	Box mAP50-95	Mask mAP50-95	Precision
0.20/0.45/0.15 (v2 fixed baseline)	0.8941	0.7876	0.9518
0.15/0.50/0.10	0.8928	0.7903	0.9482
0.15/0.55/0.08	0.8933	0.7922	0.9514
0.12/0.55/0.10	0.8945	0.7910	0.9507

Table 11. Comparison of model efficiency under different distillation weight settings.

Distillation Weight Configuration (Box/Mask/Feature)	Recall	Approx FPS	Pure FPS
0.20/0.45/0.15 (v2 fixed baseline)	0.9204	40.73	117.27
0.15/0.50/0.10	0.9242	39.06	115.17
0.15/0.55/0.08	0.9253	37.84	103.57
0.12/0.55/0.10	0.9243	38.29	119.19

Table 12. Sensitivity analysis of DBSCAN thinning-decision parameters on 330 valid grape bunch images.

Parameter	Tested Value	Dense Clusters	Thinning Targets	Mean Targets per Image	Change vs. Default
α in ε = α $\bar{d}$	1.0	715	2921	8.85	−41.47%
α in ε = α $\bar{d}$	1.2	477	4991	15.12	0.00%
α in ε = α $\bar{d}$	1.4	387	5464	16.56	+9.48%
MinPts	2	477	4991	15.12	0.00%
MinPts	3	477	4991	15.12	0.00%
MinPts	4	618	4184	12.68	−16.17%
Tc	5	526	5040	15.27	+0.98%
Tc	6	477	4991	15.12	0.00%
Tc	7	459	4955	15.02	−0.72%
r	0.2	477	3260	9.88	−34.68%
r	0.3	477	4991	15.12	0.00%
r	0.4	477	6737	20.42	+34.98%

Note: The default parameter combination was α = 1.2,

M i n P t

= 3,

T_{c}

= 6, and

r

= 0.3. A total of 19,624 grape berries were detected from the 330 valid images and used for DBSCAN-based thinning-decision analysis.

Table 13. Quantitative evaluation of DBSCAN-based thinning decisions against three-expert consensus annotations.

Evaluation Set	Images	Detected Berries	DBSCAN Targets	Expert Targets	TP	Precision	Recall	F1-Score	MAE
Test images	33	2147	544	533	411	0.756	0.771	0.763	1.48

Note: Expert consensus targets were generated using majority voting among three experts. A berry was regarded as an expert-selected thinning target if it was selected by at least two of the three experts.

T P

denotes the number of berries consistently selected by both the DBSCAN-based method and the expert consensus annotation. MAE denotes the mean absolute count error between the number of DBSCAN-recommended thinning targets and expert consensus targets per image.

Table 14. Inter-expert agreement for thinning-target annotation.

Agreement Metric	Value
Expert 1 selected targets	537
Expert 2 selected targets	536
Expert 3 selected targets	537
Pairwise F1-score	0.834 ± 0.067
Pairwise Jaccard index	0.721 ± 0.093
Targets selected by all three experts	410
Targets selected by exactly two experts	123
Targets selected by at least one expert	667

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, H.; Ma, Y.; Zhang, T.; Han, S.; Qian, M. YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots. Horticulturae 2026, 12, 697. https://doi.org/10.3390/horticulturae12060697

AMA Style

Zheng H, Ma Y, Zhang T, Han S, Qian M. YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots. Horticulturae. 2026; 12(6):697. https://doi.org/10.3390/horticulturae12060697

Chicago/Turabian Style

Zheng, Hengyi, Yuhan Ma, Tengxu Zhang, Shuo Han, and Mengbo Qian. 2026. "YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots" Horticulturae 12, no. 6: 697. https://doi.org/10.3390/horticulturae12060697

APA Style

Zheng, H., Ma, Y., Zhang, T., Han, S., & Qian, M. (2026). YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots. Horticulturae, 12(6), 697. https://doi.org/10.3390/horticulturae12060697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8n-Seg-Based Grape Berry Instance Segmentation and Thinning Decision-Making for Vineyard Robots

Highlights

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Experimental Materials and Overview of the Research Scheme

3.2. Experimental Image Processing

3.2.1. Image Acquisition and Screening

3.2.2. Image Annotation and Dataset Division

3.2.3. Data Augmentation

3.3. Two-Stage Distillation Framework for Grape Berry Instance Segmentation

3.3.1. Knowledge Distillation and Pruning Guided by Mask R-CNN

3.3.2. Refinement Distillation and Weight Optimization Guided by YOLOv8l-Seg

3.3.3. Structure of the Final Optimized Model

3.4. DBSCAN-Based Thinning Decision-Making Method

3.4.1. Extraction of Berry Features Based on Instance Segmentation Results

3.4.2. Density Clustering Based on DBSCAN

3.4.3. Thinning Decision Rule with Priority Given to Small Berries

3.5. Experimental Environment, Training Configuration, and Evaluation Metrics

4. Experimental Results and Analysis

4.1. Training Process and Model Convergence

4.2. Comparative Analysis Among Baseline, Reference, and Optimized Models

4.3. Stability Evaluation Under Different Random Seeds

4.4. Ablation Analysis of Pruning and Knowledge Distillation Strategies

4.4.1. Component-Level Ablation of Pruning and Knowledge Distillation

4.4.2. Sensitivity Analysis of Pruning Ratio

4.4.3. Effect of Distillation Weight Configuration

4.5. Visualization Results and Thinning Decision Evaluation

4.5.1. Qualitative Analysis of Small-Target, Occlusion, and Dense-Adhesion Cases

4.5.2. Visualization of DBSCAN-Based Thinning Decision Results

4.5.3. Sensitivity Analysis of DBSCAN Thinning-Decision Parameters

4.5.4. Quantitative Evaluation of DBSCAN-Based Thinning Decision

4.6. Discussion

5. Conclusions and Future Prospects

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI