4.2. Comparative Analysis Among Baseline, Reference, and Optimized Models
To assess the efficacy of the proposed two-stage knowledge distillation and pruning framework, the final optimized model was compared with the original lightweight model and various scales of YOLOv8-seg series models. The results indicated that while the lightweight YOLOv8n-seg demonstrates commendable inference efficiency, its detection and segmentation performance in complex grape bunch scenarios under high-IoU conditions still exhibits significant potential for enhancement. Conversely, larger or two-stage models, such as YOLOv8l-seg, YOLOv8x-seg, and Mask R-CNN, can offer advantages in segmentation accuracy; however, their larger parameter size, higher computational cost, or more complex inference pipelines make them less suitable as final lightweight deployment-oriented models.
To tackle this issue, the current study enhances instance segmentation performance while maintaining the lightweight attributes of YOLOv8n-seg through a two-stage distillation and pruning optimization process. The model refined in the first stage attained a superior speed-accuracy balance compared to the original lightweight model. Building on this foundation, the second stage further improved the student model’s capacity for multi-scale feature representation and segmentation expression through same-architecture refinement distillation utilizing YOLOv8l-seg.
Experimental results demonstrated that following second-stage distillation and weight optimization, the final model exhibited consistent improvements in both box
and mask
. With the optimal weight configuration (bounding box 0.12, mask 0.55, feature 0.10), the model attained its highest overall performance, achieving a box
of 0.8945, a mask
of 0.7910, and a pure inference
of 119.19. These findings suggest that the proposed method effectively balances detection accuracy, segmentation accuracy, and inference efficiency without substantially increasing model complexity. The comparative results are presented in
Table 5 and
Table 6.
RT-DETR-L was included only as a detection-oriented reference model to provide an additional comparison of bounding-box accuracy and inference efficiency. Since RT-DETR-L does not generate instance masks, it was not included in mask-level segmentation comparison, and its Mask value is therefore not reported.
Therefore, the main instance segmentation comparison focused on YOLOv8-seg series models, Mask R-CNN, and the optimized YOLOv8n-seg student models.
It should be noted that the final optimized model was selected according to the best trade-off between segmentation accuracy and deployment efficiency rather than accuracy alone. In this study, the comparison focused on the deployed student models under a unified inference setting. Therefore, the reported gains in
and parameter reduction should be interpreted together with the deployment configuration and pruning status of each model. The relationship between the number of model parameters and inference speed is shown in
Figure 9.
It should be noted that Mask R-CNN was included as a two-stage instance segmentation reference model and as the heterogeneous teacher model used in the first-stage distillation. After retraining under the revised training configuration, Mask R-CNN achieved a box of 0.9317 and a mask of 0.8224 on the test set, indicating that it could provide effective region-proposal-based structural guidance and mask-level supervision for the lightweight YOLOv8n-seg student model. Although Mask R-CNN achieved the highest mask among the compared models, its two-stage inference pipeline, larger computational cost, and lower inference speed make it less suitable as the final lightweight deployment-oriented model. Therefore, Mask R-CNN was used in this study as a heterogeneous teacher model rather than as the final deployed model.
In contrast to the approach of directly selecting a high-capacity two-stage model or simply increasing model scale, the proposed method focuses on optimizing a lightweight YOLOv8n-seg student model through staged distillation and pruning. The effectiveness of the first-stage optimization was evaluated by comparing the first-stage model with the original YOLOv8n-seg baseline, while the final optimized model was selected based on the trade-off among segmentation accuracy, inference speed, parameter size, and computational efficiency.
4.4. Ablation Analysis of Pruning and Knowledge Distillation Strategies
4.4.1. Component-Level Ablation of Pruning and Knowledge Distillation
To further analyze the contribution of each optimization component, ablation experiments were conducted by comparing the original YOLOv8n-seg baseline, the pruning-only model, the Mask R-CNN distillation-only model, the YOLOv8l-seg distillation-only model, the first-stage model, the Pruning + YOLOv8l-seg KD model, and the final optimized model. The pruning-only model was used to evaluate the effect of channel pruning alone. The Mask R-CNN distillation-only model was used to evaluate the contribution of cross-architecture knowledge distillation without pruning. The YOLOv8l-seg distillation-only model was used to evaluate whether direct same-architecture distillation from YOLOv8l-seg could improve the original lightweight student model. The first-stage model represented the combined effect of early backbone pruning and Mask R-CNN-guided distillation. The Pruning + YOLOv8l-seg KD model was used to determine whether direct same-architecture distillation after pruning could replace the proposed two-stage optimization strategy. The final optimized model further introduced YOLOv8l-seg-guided refinement distillation on the basis of the first-stage model. All model variants were trained and evaluated using the same dataset split and experimental settings described in
Section 3.5. The results are shown in
Table 8.
Compared with the original YOLOv8n-seg baseline, the pruning-only model reduced the number of parameters from 5.64 M to 3.26 M and increased from 47.40 to 51.89, indicating that channel pruning improved model compactness and inference efficiency. Its mask increased from 0.7790 to 0.7861, whereas box and recall slightly decreased. This suggests that pruning alone can improve lightweight characteristics and maintain acceptable segmentation performance, but it may also lead to a slight loss in detection completeness.
The Mask R-CNN KD-only model achieved a box of 0.9015 and a mask of 0.7891, both higher than those of the baseline model. This result indicates that heterogeneous knowledge distillation from Mask R-CNN can provide useful region-proposal-based structural guidance and mask-level supervision for the YOLOv8n-seg student model. However, because the model structure was not pruned, its parameter size remained 5.64 M, and its inference speed decreased to 42.67 .
The YOLOv8l-seg KD-only model achieved the highest box of 0.9035 and mask of 0.7915 among the ablation variants, indicating that same-architecture distillation from a larger YOLOv8l-seg teacher can effectively improve segmentation accuracy. However, this variant retained the original YOLOv8n-seg model size and did not provide the same level of lightweight compression as the pruning-related models.
The Pruning + YOLOv8l-seg KD model was further evaluated to determine whether direct same-architecture distillation after pruning could replace the proposed two-stage optimization strategy. This variant achieved a box of 0.8874, a mask of 0.7871, a precision of 0.9430, and a recall of 0.9248, with 3.26 M parameters. Compared with the pruning-only model, this variant slightly improved mask and , indicating that YOLOv8l-seg-guided distillation could partially compensate for the representation loss caused by pruning. However, its box , mask , and precision were still lower than those of the final optimized model. This result suggests that directly applying YOLOv8l-seg distillation after pruning was not sufficient to achieve the best balance, and that Mask R-CNN-guided first-stage distillation provided useful structural and mask-level guidance before the second-stage refinement.
Overall, the ablation results indicate that pruning, Mask R-CNN-guided distillation, and YOLOv8l-seg-guided refinement distillation contributed differently to the final model performance. Pruning reduced model parameters and improved lightweight characteristics, but it could also cause a slight decrease in box localization and recall. Mask R-CNN KD-only and YOLOv8l-seg KD-only improved segmentation accuracy without reducing model size, whereas pruning-related variants improved compactness. The direct Pruning + YOLOv8l-seg KD model showed that same-architecture distillation after pruning could improve mask performance to some extent, but it did not outperform the final two-stage model. Therefore, the proposed two-stage distillation and pruning strategy provided a more favorable trade-off among segmentation accuracy, model compactness, and inference efficiency than using pruning or single-teacher distillation alone.
4.4.2. Sensitivity Analysis of Pruning Ratio
To further examine the influence of pruning intensity, a pruning-ratio sensitivity analysis was conducted under the complete two-stage distillation framework. Four pruning ratios, namely 10%, 20%, 30%, and 40%, were evaluated using the same dataset split, training configuration, and evaluation protocol. As shown in
Table 9, increasing the pruning ratio from 10% to 30% gradually improved both segmentation accuracy and inference efficiency. The 30% pruning ratio achieved the highest Box mAP50-95, Mask mAP50-95, precision, and recall, with values of 0.8945, 0.7910, 0.9507, and 0.9243, respectively. Compared with the 10% and 20% settings, the 30% pruning ratio may have removed more redundant low-importance channels and improved the compactness of feature representation, thereby allowing the subsequent two-stage distillation process to guide the lightweight model more effectively.
When the pruning ratio was further increased to 40%, the increased to 142.75 and the FLOPs decreased to 5.67 G. However, the Box , Mask , , and decreased to 0.8896, 0.7861, 0.9478, and 0.9205, respectively. This indicates that excessive pruning may weaken feature representation and reduce segmentation performance, especially for dense and partially occluded grape berries. Therefore, the 30% pruning ratio was selected in this study because it provided the most favorable balance between segmentation accuracy and computational efficiency under the current experimental setting.
4.4.3. Effect of Distillation Weight Configuration
To further investigate the impact of various distillation weight configurations on the performance of the second-stage model, systematic ablation experiments were performed focusing on three supervision terms: bounding-box distillation, mask distillation, and feature distillation. The results are presented in
Table 10 and
Table 11.
The experimental results indicate that the baseline configuration (0.20/0.45/0.15) demonstrated satisfactory initial performance regarding Box ; however, there remained potential for improvement in Mask . Following the adjustment of the distillation weights to 0.15/0.50/0.10, Mask rose to 0.7903. This improvement suggests that a judicious increase in the mask distillation weight, coupled with a reduction in the bounding-box distillation weight, enhances the model’s capacity to represent instance boundaries effectively.
Upon adjusting the weights to 0.15/0.55/0.08, the model attained the highest Mask of 0.7922, demonstrating that a mask-oriented distillation configuration is more effective for the grape berry instance segmentation task. However, this configuration resulted in a decrease in pure inference to 103.57, indicating that while segmentation accuracy improved, overall inference efficiency suffered.
Considering both accuracy and efficiency, this study identified 0.12/0.55/0.10 as the optimal configuration for second-stage distillation. Under this configuration, the model achieved the highest Box of 0.8945, a Mask of 0.7910, and a pure inference of 119.19, thereby yielding the best overall performance. These results suggest that, for grape berry instance segmentation, appropriately reducing the bounding-box distillation weight, increasing the mask distillation weight, and maintaining a moderate feature distillation weight can effectively balance detection accuracy, segmentation quality, and inference efficiency.
Overall, optimizing the distillation weight is a critical factor in enhancing the performance of the second-stage model. In contrast to incorporating additional modules or designing extra loss functions, a well-configured distillation weight can more effectively leverage the benefits of the proposed method in lightweight instance segmentation tasks.
4.5. Visualization Results and Thinning Decision Evaluation
To intuitively assess the application performance of the proposed method in practical grape thinning scenarios, we present visualization examples of the instance segmentation results alongside the thinning decision outcomes derived from DBSCAN.
To visually assess the performance of various models in grape berry instance segmentation, three representative images of grape bunches were selected for comparison. The analysis included the original YOLOv8n-seg, the first-stage distilled and pruned model, the second-stage distillation baseline model, and the final optimized model, with results illustrated in
Figure 10. Overall, the original YOLOv8n-seg effectively performed basic segmentation for most visible berries. However, in areas characterized by dense berry arrangements, mutual occlusion, and strong interference from branches and leaves in the background, challenges such as inadequate instance separation and unstable boundaries persisted. Following the first-stage cross-architecture knowledge distillation and backbone pruning, the model exhibited more concentrated feature responses in the primary bunch region, leading to notable improvements in segmentation results in certain local areas. The introduction of second-stage same-architecture refinement distillation further enhanced the continuity of segmentation and improved the representation of local details in densely populated berry regions.
In contrast, the final optimized model demonstrated enhanced stability in instance separation performance across various test samples. Specifically, in scenarios featuring closely contacted adjacent berries, irregular berry arrangements, and complex natural backgrounds, the optimized model effectively maintained target contours and minimized segmentation confusion in localized areas. These visual outcomes align closely with the quantitative experimental findings presented earlier, indicating that the two-stage distillation and pruning strategy improved the practical performance of the lightweight instance segmentation model under the current vineyard imaging conditions.
Although the quantitative results demonstrate the overall performance of the optimized model, small berries, occluded berries, and closely adhered berries remain important visual challenges in grape berry instance segmentation. Therefore, representative visualization results were further selected to qualitatively analyze the performance and limitations of the final optimized model under small-target, occlusion, and dense-adhesion conditions.
4.5.1. Qualitative Analysis of Small-Target, Occlusion, and Dense-Adhesion Cases
Although all grape berries were annotated as a single semantic class in this study, small berries and occluded berries represent important visual challenges in grape berry instance segmentation. Therefore, representative visualization results were used to qualitatively analyze the performance of the final optimized model under small-target, occlusion, and dense-adhesion conditions.
As shown in
Figure 11, the final optimized model produced relatively complete masks for most visible berries under leaf occlusion, branch occlusion, and dense berry adhesion. In particular, the model maintained clear mask coverage for partially occluded berries and small berries located near the bunch edges, indicating that the two-stage distillation strategy improved the mask representation ability of the lightweight student model.
However, several failure cases were still observed. When adjacent berries had highly similar colors and extremely weak boundary contrast, the predicted masks occasionally became incomplete or merged with neighboring berries. In addition, small berries that were severely occluded by leaves, branches, or adjacent berries were sometimes missed. These results indicate that the proposed model improves the segmentation of small and occluded berries to some extent, but fine-grained quantitative evaluation remains limited because small and occluded berries were not annotated as independent categories. Future work will establish attribute-level annotations for small berries, occluded berries, and severely adhered berries to quantitatively evaluate model robustness under different visual difficulty levels.
4.5.2. Visualization of DBSCAN-Based Thinning Decision Results
To assess the effectiveness of the proposed thinning decision-making method, we analyzed the spatial distribution of berries within grape bunches using the DBSCAN density clustering algorithm. This analysis was based on berry centroid coordinates and size information obtained from instance segmentation, leading to the generation of thinning decision results, as depicted in
Figure 12. Specifically,
Figure 12a illustrates the distribution of berry centroids in a two-dimensional space alongside the clustering-based thinning selection results; blue points denote retained berries, while red crosses indicate berries identified for removal.
Figure 12b provides a visualization of the corresponding thinning decision on the grape bunch image, with red-highlighted regions marking the targets designated for removal.
The results indicate that the proposed method effectively identifies locally overcrowded regions within grape bunches against complex natural backgrounds and generates agronomically interpretable thinning decisions based on the principle of “prioritizing the removal of small berries.” In this instance, a total of 62 valid berries were identified, of which 16 were designated as targets for removal. This finding suggests that the method successfully integrates berry spatial density features with individual size differences, thereby offering robust decision support for subsequent intelligent thinning-target recommendation.
4.5.3. Sensitivity Analysis of DBSCAN Thinning-Decision Parameters
To further evaluate whether the DBSCAN-based thinning decision was overly sensitive to manually selected parameters, a parameter sensitivity analysis was conducted using the 330 valid grape bunch images. Four key parameters were examined, including the neighborhood coefficient α in ε = α, , the dense-cluster threshold , and the removal ratio . During the analysis, one parameter was varied at a time while the remaining parameters were kept at their default values. The default parameter combination was α = 1.2, = 3, = 6, and = 0.3. The number of dense clusters and the number of recommended thinning targets were used to evaluate the influence of parameter variation on the thinning decision.
As shown in
Table 12, the DBSCAN-based thinning decision showed different levels of sensitivity to the four parameters. When α decreased from 1.2 to 1.0, the neighborhood radius became smaller, which split berry distributions into more local clusters. Although the number of dense clusters increased from 477 to 715, the total number of recommended thinning targets decreased by 41.47%, indicating that an excessively small neighborhood radius may lead to fragmented clustering and insufficient thinning recommendations. When α increased to 1.4, adjacent berries were more likely to be merged into larger clusters, and the number of thinning targets increased by 9.48%. Therefore, α = 1.2 provided a moderate neighborhood radius for identifying local dense berry regions.
The influence of was relatively limited when it varied from 2 to 3, as both settings produced the same number of dense clusters and thinning targets. Increasing to 4 made the density requirement more restrictive and reduced the total number of thinning targets by 16.17%. The dense-cluster threshold showed only a small influence on the thinning results within the tested range. Compared with the default value of = 6, setting to 5 and 7 changed the number of thinning targets by only +0.98% and −0.72%, respectively, indicating that the thinning decision was relatively stable around the selected value.
The removal ratio directly controlled the thinning intensity. When decreased from 0.3 to 0.2, the number of thinning targets decreased by 34.68%, whereas increasing to 0.4 increased the number of thinning targets by 34.98%. This result indicates that is the most direct parameter affecting the final number of berries selected for removal. Overall, the default parameter combination α = 1.2, = 3, = 6, and = 0.3 produced a moderate thinning intensity and avoided overly conservative or excessive thinning recommendations under the tested conditions.
4.5.4. Quantitative Evaluation of DBSCAN-Based Thinning Decision
To quantitatively evaluate the reliability of the DBSCAN-based thinning decision module, the 33 test images were independently annotated by three experts according to grape thinning principles. For each numbered grape berry image, each expert selected berries that should be preferentially removed based on local berry density, berry size, bunch compactness, and spatial distribution. A consensus expert annotation was then generated using a majority-voting strategy. A berry was regarded as an expert-selected thinning target if it was selected by at least two of the three experts. The DBSCAN-recommended thinning targets were compared with the consensus expert annotations using , , -score, and .
The three experts selected 537, 536, and 537 thinning targets, respectively, indicating similar count-level judgment among experts. The majority-voting consensus annotation contained 533 thinning targets. The average pairwise -score and Jaccard index among the three experts were 0.834 ± 0.067 and 0.721 ± 0.093, respectively, suggesting a reasonable level of inter-expert agreement for thinning-target annotation.
As shown in
Table 13, the DBSCAN-based thinning decision recommended 544 berries for removal from 33 test images, while the expert consensus annotation identified 533 berries as thinning targets. Among them, 411 berries were consistently selected by both the DBSCAN-based method and the expert consensus annotation. The proposed method achieved a
of 0.756, a
of 0.771, and an
-score of 0.763, with an
of 1.48 berries per image. These results indicate that the DBSCAN-based thinning decision showed reasonable consistency with the three-expert consensus annotation under the current test conditions and could provide preliminary thinning decision-support outputs.
The inter expert agreement results for thin target annotation are summarized in
Table 14.
However, the DBSCAN-based recommendations should still not be interpreted as fully validated agronomic thinning prescriptions. Although three experts were included in the revised evaluation, the test set remained limited to 33 images, and real field thinning trials were not conducted. In addition, thinning-target selection may vary among experts because of differences in cultivar characteristics, target yield, bunch compactness, fruit maturity, and production management objectives. Therefore, further validation using larger test sets, additional cultivars, different growth stages, multi-scenario field images, inter-seasonal field data, and real field thinning trials is still required.
4.6. Discussion
The experimental results of this study demonstrate that, in the context of grape berry instance segmentation, enhancing the performance of lightweight models does not necessarily require the development of complex new modules. Instead, meaningful improvements can be achieved through the optimization of training strategies. The first-stage cross-architecture distillation and backbone pruning allowed the model to attain strong baseline performance while maintaining a lightweight structure. Subsequently, the second-stage same-architecture refinement distillation further enhanced the student model’s segmentation capabilities under high Intersection over Union () conditions.
Although the final optimized model did not achieve the highest absolute among all compared models, its advantage lies in the trade-off among segmentation accuracy, inference speed, parameter size, and computational efficiency. For example, after retraining under the revised training configuration, Mask R-CNN achieved strong mask-level accuracy, but its two-stage inference pipeline, lower inference speed, and higher computational cost make it less suitable for lightweight robotic perception. Similarly, larger YOLOv8-seg models can provide competitive segmentation accuracy, but their larger parameter sizes and higher FLOPs increase the difficulty of deployment on resource-constrained robotic platforms. In contrast, the final optimized YOLOv8n-seg model retained a compact structure while achieving improved mask and substantially higher inference speed than the original baseline. Therefore, the proposed method should be interpreted as a deployment-oriented optimization strategy rather than an accuracy-only model selection strategy.
Nevertheless, the ablation and pruning-ratio analyses in this study still have several limitations. Although component-level ablation experiments and a pruning-ratio sensitivity analysis were added in this revision, they were conducted under a fixed dataset split and a fixed training configuration. Four pruning ratios, namely 10%, 20%, 30%, and 40%, were evaluated under the complete two-stage distillation framework. The results showed that the 30% pruning ratio achieved the highest box , mask , , and among the tested settings, while also maintaining a relatively high inference speed. However, this result should be interpreted as the most favorable setting among the tested pruning ratios under the current experimental conditions, rather than as a globally optimal pruning configuration. Future work will further investigate finer pruning-ratio intervals, different pruning strategies, broader datasets, and teacher-order ablation experiments to improve the generalizability of the compression and distillation configuration.
In addition, repeated training experiments with three random seeds were added in this revision to evaluate the stability of the baseline and final optimized models. The results showed that the final optimized model maintained relatively stable mask and across different random initialization conditions. However, the evaluation was still conducted under a fixed dataset split, and k-fold cross-validation and formal statistical significance testing were not performed. Therefore, the reported improvements should be interpreted as performance differences observed under the current experimental setting rather than statistically significant conclusions. Future work will include larger datasets, k-fold cross-validation, broader repeated training, and statistical significance analysis to further evaluate the stability and reliability of the proposed method.
Compared with simply introducing additional modules or loss functions, the optimization of distillation weights provided a more direct way to improve the balance between segmentation accuracy and inference efficiency in the current task.
In addition, RT-DETR-L was used only as a detection-oriented reference model in this study. Because it does not generate instance-level masks, it cannot directly replace instance segmentation models for berry-level thinning decision support. Therefore, its results were interpreted only as a reference for bounding-box detection accuracy and inference efficiency, rather than as evidence of mask-level segmentation performance.
From the perspective of task characteristics, grape berries are typically densely clustered, exhibit strong adherence at their boundaries, and are relatively small in size. Consequently, high-precision mask representation is more critical than relying solely on coarse-grained bounding-box localization. This observation elucidates why a mask-oriented distillation weight configuration can yield enhanced performance.
The DBSCAN-based thinning decision-making strategy presented in this study addresses the limitations of conventional instance segmentation methods, which can identify the location of berries but fail to determine which berries should be removed. Yang et al. demonstrated that existing grape vision techniques perform detection and counting tasks effectively; however, generating thinning operation recommendations from perception results remains a significant gap in current research [
23]. Woo et al. further noted that while thinning assistance systems can aid manual management by predicting berry counts, precise screening at the single-berry level is essential for facilitating automated thinning-target recommendation [
21]. By incorporating berry centroid and diameter information into spatial clustering analysis, this study converts visual perception results into actionable thinning decision criteria, thereby enhancing the model’s alignment with the practical requirements of grape thinning robots.
From a practical perspective, the proposed framework provides a bridge between berry-level visual perception and thinning-target decision support. Instance segmentation alone can identify the location and contour of grape berries, but it cannot determine which berries should be removed according to local density and berry size. The DBSCAN-based decision module partially addresses this gap by transforming segmentation outputs into preliminary thinning-target recommendations. This provides a useful intermediate decision layer for future grape thinning robots. However, the current framework remains an offline visual perception and decision-support method, and its practical use in robotic thinning still depends on further integration with three-dimensional localization, end-effector trajectory planning, and closed-loop execution control.
However, the expert-annotation-based evaluation of the DBSCAN thinning decision module remains preliminary. Although three experts were included and a majority-voting consensus annotation was used in this revision, the evaluation was still based on only 33 test images, and real field thinning trials were not conducted. In addition, thinning decisions may vary among agronomists depending on cultivar characteristics, target yield, bunch compactness, fruit maturity, and production management objectives. Future work will introduce larger test sets, additional cultivars, different growth stages, inter-seasonal field data, and real thinning trials to further validate the agronomic reliability and practical applicability of the DBSCAN-based thinning decision module.
The generalization ability of the proposed method is also limited by the current dataset. Although the dataset contained 16,461 annotated berry instances, these instances were derived from 330 valid grape bunch images collected from Shine Muscat grape bunches at the berry enlargement stage in a single vineyard using the same RGB-D camera system. Therefore, the dataset does not cover sufficient variations in grape cultivars, growth stages, production years, vineyard management conditions, canopy structures, illumination conditions, camera systems, or orchard environments. Although data augmentation was used to improve the robustness of model training, it cannot replace real external validation data. Consequently, the current results should be interpreted as preliminary evidence obtained under the specific cultivar, growth stage, vineyard, and imaging conditions of this study, rather than as conclusive evidence of general applicability across diverse grape production scenarios. Future work will expand the dataset by including different grape cultivars, different growth stages, multiple vineyards, different production seasons, and different imaging systems to further evaluate the generalization capability of the proposed method.
Moreover, although all visible berries were manually annotated at the instance level, attribute-level annotations for small berries, occluded berries, and severely adhered berries were not established. Therefore, the current study could not provide separate quantitative performance metrics for these visually challenging categories. In addition, formal inter-annotator agreement assessment was not conducted. Future work will introduce multi-annotator labeling, annotation consistency evaluation, and attribute-level labels to further improve dataset reliability and enable more detailed robustness analysis.
In addition, the inference speed reported in this study was obtained on an NVIDIA RTX 3060 Laptop GPU rather than on an embedded edge platform. Therefore, the current results only reflect the computational efficiency of the model under an offline laptop-GPU environment and cannot be directly regarded as evidence of potential real-time embedded deployment after further validation. More importantly, real robotic thinning experiments were not conducted in the current study. Practical robotic indicators, such as berry localization error, end-effector positioning accuracy, thinning success rate, and operation cycle time, still need to be evaluated under closed-loop field conditions. Therefore, the proposed method should be regarded as an offline RGB-based visual perception and preliminary thinning decision-support module for future grape thinning robots, rather than as a fully validated robotic thinning system.
The proposed method primarily depends on two-dimensional RGB image information for berry instance segmentation and thinning-target recommendation, and three-dimensional structural or depth information was not incorporated into the current decision module. Therefore, the recommended thinning targets cannot be directly converted into executable three-dimensional robot coordinates without additional depth sensing, multi-view reconstruction, hand-eye calibration, and coordinate transformation. This limitation may lead to decision or execution errors in cases with severe berry occlusion, overlapping berries, and complex spatial hierarchies within grape bunches. Future work will integrate RGB-D data, multi-view imaging, and multimodal perception to improve three-dimensional bunch structure understanding, berry localization accuracy, and robotic thinning execution.