1. Introduction
Muskmelon is a high-value horticultural crop valued for sweetness and aroma, with strong market demand. In China, the thick-skinned cultivar ‘Boyang No. 9’ is widely grown in protected systems due to its stable yield and superior eating quality [
1]. Shandong Province, supported by advanced facility agriculture, has become a core region for standardized and industrialized production of ‘Boyang No. 9’, achieving both high output and consistent quality [
2]. Aroma formation further underpins its commercial value and consumer preference [
3]. Greenhouse cultivation dominates muskmelon production; however, at the ripening stage, fruits are often low-hanging and embedded in dense vines and leaves, while harvesting occurs under warm and humid conditions. These factors lead to poor working conditions, high labor intensity, and limited picking efficiency [
4]. With rising labor shortages and production costs, harvesting robots are increasingly considered a practical approach to improving efficiency and sustainability in high-value horticulture [
5]. Recent surveys and reviews have highlighted rapid progress in fruit and vegetable harvesting robots, while consistently emphasizing occlusion, target clustering/overlap, and constrained onboard computation as persistent barriers to reliable field deployment [
6,
7,
8]. In practice, unsafe or failed picking attempts can damage fruits and vines and increase subsequent labor demand, making greenhouse harvesting decisions inherently risk-sensitive, especially under occlusion and clustering.
The performance of harvesting robots depends on converting perception outputs into reliable harvesting decisions (detection, localization, and harvestability). Recent surveys note that this perception-to-action conversion remains challenging in greenhouse scenes under visual clutter and limited onboard computation [
9]. Machine vision has, therefore, become central to robotic harvesting [
10,
11]. Handcrafted-feature methods are often brittle under heavy canopy clutter and varying illumination [
12], RGB–D fusion improves localization and picking-point estimation [
13,
14], and lightweight deep-learning models enable real-time perception on edge devices in complex horticultural environments [
15,
16,
17,
18,
19,
20].
Nevertheless, a detected fruit is not necessarily a safe target for manipulation. Graspable regions may be partially hidden, and collision risk and accessibility depend strongly on leaf occlusion and fruit crowding/overlap. Recent harvesting systems, therefore, increasingly incorporate active sensing and viewpoint planning [
21,
22,
23] or 3D pose-aware perception [
24,
25,
26] to improve downstream end-effector success rather than detection accuracy alone. This indicates a gap between detection and action: without explicit harvestability reasoning, robots may attempt to harvest fruits that are visually detected but operationally risky, resulting in collisions, low success rates, and potential fruit/vine damage.
To bridge this gap, we propose a Risk-Gated Harvestability Decision (RGHD) framework designed for agricultural edge devices. Given the limited compute budget in greenhouse platforms, RGHD adopts a multi-stage design that separates fast candidate generation from lightweight risk assessment because detector confidence alone does not explicitly reflect occlusion severity or collision-prone crowding. Specifically, a YOLO11n detector generates fruit candidates, and a ShuffleNetV2 classifier with an edge-guided spatial attention module estimates occlusion severity. The classifier takes the detected ROI as input to estimate occlusion severity from local boundary cues, which cannot be reliably inferred from box-level scores or overlap-based postprocessing alone. To avoid costly 3D segmentation, clustering risk is approximated using an efficient 2D IoU-based metric. These cues are combined in a lightweight decision gate with two configurable policies—Safety-First and Efficiency-First—to trade off throughput and operational safety. Experiments on a greenhouse dataset of ‘Boyang No. 9’ melons show that RGHD converts detections into actionable harvesting decisions, reducing unsafe attempts while maintaining real-time performance.
The main contributions are as follows:
Decoupling Perception and Decision: We introduce a modular framework that separates target detection from harvestability reasoning, enabling transparent failure analysis and trusted human–robot interaction under complex occlusion and interference.
Edge-Enhanced Occlusion Grading: We design an ROI-based classification module incorporating edge-guided spatial attention (EGSA), which significantly improves the recognition of occlusion levels by emphasizing boundary cues while maintaining low inference latency.
Tunable Risk Policies: We formulate a multi-source risk fusion mechanism that allows the system to switch between “Safety-First” and “Efficiency-First” operating modes, allowing mode selection under different crowding conditions of the greenhouse environment.
2. Materials and Methods
2.1. Framework Overview
The RGHD framework is designed to decouple visual perception from decision logic, processing single-frame RGB images to output specific harvestability commands. As illustrated in
Figure 1, the pipeline operates in three sequential stages:
Candidate Generation: A lightweight detector (YOLO11n) scans the global view to localize all potential melon targets and generate initial Regions of Interest (ROIs).
Risk Perception: Each candidate ROI is analyzed individually to quantify environmental constraints. This involves a fine-grained classification network for Occlusion Severity and a geometric analyzer for Spatial overlap risks.
Gated Decision: A logical gate integrates the multi-source risk cues and selects between the “Efficiency-First” and “Safety-First” operating policies. Candidates that do not meet the selected policy are skipped and the system moves on to the next target. Importantly, a rejection is not treated as a permanent exclusion: unharvested fruits remain on the vine and can be re-assessed in later harvesting cycles. As occlusion and crowding conditions change over time—due to vine growth, leaf movement, or routine canopy management—previously rejected fruits may become harvestable.
2.2. Data Collection and Dataset Construction
Data were collected at a commercial greenhouse melon production base in Zhoucun District, Zibo City, Shandong Province, China. The target crop was the thick-skinned muskmelon cultivar ‘Boyang No. 9’. Image acquisition was conducted from late March to early April 2025 using a vivo smartphone (vivo, Dongguan, China; Model: iQOO Neo7, equipped with a 50-megapixel main CMOS sensor, f/1.88 aperture, and 23 mm equivalent focal length). Images were captured in auto-exposure and auto-white-balance modes without HDR processing, at an original resolution of 3060 × 3060 pixels, and at a shooting distance of 30–50 cm from the canopy. While commercial harvesting robots typically employ industrial vision sensors (e.g., RGB-D cameras), a high-resolution smartphone camera was selected for this initial dataset construction to cost-effectively and rapidly capture high-fidelity RGB data representing complex real-world canopy features. To ensure the dataset captured the complexity of real cultivation conditions, sampling covered varying illumination conditions (front-lighting and backlighting with dynamic shadowing) across different times of day, and included scenes with fruit overlap and dense foliage occlusion (
Figure 2). A total of 945 raw RGB images were obtained and subsequently standardized to 640 × 640 pixels.
To increase data diversity, geometric augmentations (random horizontal flipping and scaling) and pixel-level augmentations (Gaussian noise and motion blur) were applied to generate 5670 images from 945 original RGB images. The dataset was then split in a group-wise manner into training, validation, and test sets containing 3972, 1128, and 570 images, respectively. Specifically, all augmented variants derived from the same original image were kept in the same subset to prevent information leakage.
2.2.1. Labeling Scheme
To support the full “localization–status–decision” pipeline, the dataset includes two types of annotations:
Detection annotations: YOLO-format bounding boxes for each fruit instance.
Occlusion levels:
Occ = 0: no occlusion; fruit boundary is complete and clearly visible;
Occ = 1: minor occlusion; fruit remains identifiable but the visible boundary is partially missing;
Occ = 2: severe occlusion; a large portion of the fruit body is occluded and the boundary is largely incomplete; harvesting is not recommended.
All annotations were produced by a single annotator and were rechecked in a second pass to ensure label consistency.
2.2.2. Definition of Fruit Overlap State and IoU Inference
Fruit overlap describes the degree of interleaving and boundary blending between fruits in image space and reflects separability and potential interference risk during harvesting. Three overlap levels (ovl) are defined as:
ovl = 0: no significant overlap;
ovl = 1: partial overlap, which may require adjusting the grasping approach or obstacle avoidance;
ovl = 2: extensive overlap with strong boundary blending; direct picking is not recommended.
The overlap status is not manually annotated; it is derived by post-processing the detection outputs. For each target bounding box
the IoU is computed with all other boxes
in the same frame
. The maximum IoU value is used as the overlap risk score:
This score is mapped to ovl using two thresholds
and
:
In this study, and , which were empirically set based on preliminary observations to represent low, moderate, and high overlap interference, respectively.
2.2.3. Dataset Statistics
Table 1 summarizes the dataset distribution across occlusion levels and overlap levels after data augmentation.
2.3. Stage I: Fruit Detection and Stable Candidate Selection
Stage I aims to generate bounding boxes for all visible melons in a single-frame RGB image and to construct per-fruit ROI candidates for subsequent analysis. Considering frequent occlusion and illumination variation in greenhouse scenes, the detector should balance detection capability with real-time performance and edge-deployment constraints. Therefore, we adopt the lightweight YOLO11n model as the front-end detector in
Figure 3 [
27].
For each input image, YOLO11n outputs a set of melon bounding boxes. Each bounding box is then used to extract a corresponding ROI by cropping the original image (with boundary clipping when a box extends beyond the image border). The resulting ROI images serve as candidate inputs to Stage II for occlusion-state perception and subsequent risk modeling. In this study, YOLO11n is used primarily as a box generator to provide complete melon instances for ROI construction, without modifying the detector architecture.
2.4. Stage II: Fruit State Perception
2.4.1. ROI Generation
For each detected fruit, a region of interest (ROI) is cropped from the original image. To preserve contextual cues that are informative for occlusion assessment, the detection box is expanded by 10% in both width and height before cropping, and the expanded box is clipped to the image boundaries.
The ROI is then normalized to 224 × 224 while preserving the aspect ratio: the longer side is resized to 224 pixels, and the remaining area is padded to form a square input. In our implementation, constant-value padding is used to avoid geometric distortion and ensure a consistent input size for the classifier. Representative examples of cropped ROI under different occlusion and overlap conditions are shown in
Figure 4.
2.4.2. Occlusion Classification Network
Occlusion perception is formulated as a three-class classification task (occ
{0, 1, 2}). To meet edge and mobile deployment constraints, we adopt ShuffleNetV2 as the backbone and introduce an Edge-Guided Spatial Attention (EGSA) module to enhance occlusion-relevant boundary cues while suppressing background texture interference [
28]. During training, weighted cross-entropy (WCE) is used to address class imbalance and to reduce high-cost errors, particularly misclassifying severely occluded targets as low-risk cases.
To ensure a fair evaluation and avoid information leakage, the augmented dataset is split such that all augmented variants derived from the same original image are assigned to the same subset (train/valid/test). ROI samples inherit the split of their source images, preventing correlated samples from appearing across subsets.
2.4.3. Edge-Guided Spatial Attention (EGSA)
Occlusion-level perception in greenhouse images is challenging because discriminative cues are often concentrated around local boundaries where the fruit is partially covered by leaves. To enhance boundary-sensitive representations with minimal overhead, we introduce a lightweight Edge-Guided Spatial Attention (EGSA) module into the feature extraction stage of the baseline network.
As illustrated in
Figure 5, EGSA constructs a spatial attention map by combining two complementary cues derived from an intermediate feature map: (i) a structural response map obtained by channel-mean aggregation, and (ii) an edge hint map produced by Sobel-based edge extraction followed by normalization. These two maps are concatenated and passed through a 1 × 1 convolution and a sigmoid activation to generate the spatial attention map, which is then broadcast to all channels and used to reweight the original feature map:
Here, denotes element-wise multiplication with broadcasting along the channel dimension. Because EGSA relies on a fixed edge operator and only one 1 × 1 convolution, it introduces minimal parameters and computation. This design suppresses background leaf-texture interference and emphasizes occlusion-relevant fruit boundaries and partially visible fruit regions, thereby improving separability among occlusion levels.
2.4.4. Occlusion Classification with Weighted Cross-Entropy
Due to the uneven distribution of samples across occlusion levels and the fact that misclassifying severe occlusion (occ = 2) as a low-risk class may induce overly aggressive picking decisions, we adopt a weighted cross-entropy (WCE) loss to reduce safety-critical errors. Let
denote the predicted probability of class
and
denote the one-hot ground-truth label. The WCE loss is defined as:
where
in this study and
is the class weight.
To avoid information leakage, class weights are computed using only the training set. Let
be the number of training samples in class
. We set
and normalize the weights such that
:
This weighting strategy increases the penalty for under-represented and higher-risk categories, improving recognition of severely occluded samples while maintaining overall performance.
2.5. Fruit Overlap Risk Estimation
In a single frame, the detector outputs a set of melon bounding box candidates
. In greenhouse scenes with frequent fruit clustering, substantial overlap between candidates in image space often indicates potential spatial interference for subsequent manipulation. To enable online assessment without introducing additional learning modules, we adopt a geometry-based overlap estimator based on the Intersection over Union (IoU), as illustrated in
Figure 6.
The IoU between two candidate boxes
and
is computed as:
For each candidate
, the overlap risk score is defined as the maximum IoU with any other candidate in the same frame:
which characterizes the most unfavorable local crowding condition around the target.
The continuous score is then mapped to a discrete overlap level
using two thresholds
and
:
where the thresholds are determined according to operational safety requirements and validation-set statistics.
In this study,
and
, consistent with the overlap-level definition in
Section 2.2.2. The resulting overlap level is used as an input to the subsequent multi-source risk fusion and gated decision.
2.6. Risk-Gated Harvestability Decision (RGHD)
Operational definition and scope. In this work, “harvestability” is operationally defined by observable visual cues in RGB images-occlusion, 2D overlap, and relative scale-that characterize candidate ambiguity in cluttered greenhouse canopies. RGHD focuses on risk-aware candidate gating for conservative decision-making under crowding, providing adjustable policies rather than estimating execution outcomes. Accordingly, the reported unsafe acceptance/rejection rates quantify decision risk with respect to the adopted cues and gating rules. Although harvestability is defined operationally based on visual cues in this study, these criteria are consistent with practical greenhouse harvesting principles, where severe occlusion, strong fruit–fruit interference, and insufficient visible fruit area are commonly regarded as indicators of non-harvestable targets by experienced workers.
After occlusion-level prediction and overlap-risk quantification, RGHD fuses multi-source cues and outputs a harvestability decision through a rule-based gate in
Figure 7. For each candidate fruit
, three inputs are used: (1) the occlusion level
predicted by the ShuffleNetV2-based occlusion classifier; (2) the overlap level
derived from IoU-based mapping [
29]; (3) a scale-related cue
computed from the normalized bounding box area:
Where denotes the area of the candidate bounding box and is the image area. A candidate is considered harvestable only when occlusion risk, spatial-interference risk, and scale constraints jointly satisfy the selected policy. Two operating modes are provided to support different safety–efficiency preferences:
In this study, overlap levels are mapped from the maximum IoU score using two thresholds and , and is used in the gate to exclude candidates with severe overlap/interference. The scale threshold is used to filter extremely small candidates, which are more likely to be distant, partially visible, or unreliable for decision-making under a single-view setting.
3. Results
3.1. Experimental Setup and Training Configuration
Experiments were conducted on a workstation equipped with an AMD Ryzen 5 5600 CPU (AMD, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4060 GPU (8 GB) (NVIDIA, Santa Clara, CA, USA). The software environment consisted of Python 3.8 and PyTorch 1.13.1 with CUDA 11.3 support.
The Stage I YOLO11n detector was trained for 200 epochs (batch size 16), and the Stage II occlusion classifier was trained for 100 epochs (batch size 16). For YOLO11n, the input resolution was set to 640 × 640 and the initial learning rate was set to 0.01. Stochastic gradient descent (SGD) with a momentum of 0.937 and a weight decay of 0.0005 was used as the optimizer. For the occlusion classifier, ROI inputs were resized to 224 × 224, and the network was optimized using the AdamW optimizer implemented in PyTorch (v1.13.1) with an initial learning rate of 5 × 10
−4. The main training hyperparameters for Stage I and Stage II are summarized in
Table 2.
3.2. Evaluation Indicators
3.2.1. Object Detection Evaluation Metrics
For object detection models, we employ the widely adopted mAP@0.5 and mAP@0.5:0.95 to assess detection accuracy while using Recall@0.5 to quantify recall rates for detected objects. Additionally, we measure inference speed (FPS) and model parameter count (Model Size) to evaluate an algorithm’s real-time performance and resource consumption during engineering deployment.
3.2.2. Harvestability Decision Evaluation Metrics
Harvestability gating in RGHD is formulated as a binary decision task. In this study, we evaluate gating outputs under an operational proxy definition derived from occlusion annotations and geometry-based cues (overlap and relative scale). Specifically, a proxy low-risk instance is defined as (occ ≤ 1) ∧ (ovl ≤ 1) ∧ (s ≥ 0.05); all other cases are treated as proxy high-risk.
To decouple detection failures from policy behavior, predicted candidates are first matched to annotated fruit objects using IoU ≥ 0.5, and policy statistics are computed only on the matched pairs. The system performance is evaluated using Accuracy (
), Precision (
), Recall (
), and Macro-
score, which are calculated based on True Positives (
), False Positives (
), True Negatives (
), and False Negatives (
):
For multi-class occlusion evaluation, the Macro-
score is used to mitigate class imbalance by averaging the
-scores across all
classes:
To evaluate the safety-efficiency trade-off, we report acceptance-related statistics alongside two customized risk-oriented rates:
Here, denotes the proxy unsafe-acceptance rate (accepting proxy high-risk cases), and denotes the proxy missed-opportunity rate (rejecting proxy low-risk cases). Notably, these metrics represent decision-level error rates based on the operational proxies, rather than standard detector-level false positives or false negatives. Jointly reporting / with // provides a compact description of the operating characteristics under Safety-First and Efficiency-First policies.
3.3. Performance Comparison of Different Fruit Detection Models
To evaluate the suitability of different object detectors for greenhouse melon scenes, we compared several mainstream lightweight detectors under the same dataset split, input resolution (640 × 640), and training protocol. Detection accuracy was evaluated using mAP@0.5 and mAP@0.5:0.95, together with Precision and Recall@0.5. Inference speed (FPS) and model weight size (MB) were also reported to reflect real-time performance and deployment cost. The results are summarized in
Table 3.
Among all compared detectors, YOLO11n achieved the highest mAP@0.5 (94.8%) and a high Recall@0.5 (89.9%), indicating improved detection reliability in the presence of illumination variation, occlusion, and fruit clustering. From a greenhouse harvesting perspective, higher recall is particularly important because missed detections directly translate into missed harvesting opportunities, whereas moderate over-detection can be further filtered by subsequent harvestability decision modules. From a deployment perspective, YOLO11n delivered 113.3 FPS with a compact model size of 5.24 MB, providing a favorable balance between accuracy, recall, speed, and model footprint. Therefore, YOLO11n was selected as the front-end detector for subsequent ROI construction, fruit-state perception, and risk-gated harvestability decision-making.
To verify the training stability of the selected YOLO11n model, its complete training curves are provided in
Figure 8.
3.4. Occlusion Classification Network Design and Performance Comparison
Occlusion Classification Network Design and Performance Comparison Occlusion level is a key state variable affecting harvesting safety, and failing to identify severe occlusion (occ = 2) can readily lead to high-risk proxy unsafe acceptance. Therefore, we formulate fruit occlusion recognition as an ROI-based three-class classification task and conduct a systematic evaluation from three aspects: backbone selection, attention-module design, and loss-function configuration.
3.4.1. Baseline Backbone Comparison: Architecture–Accuracy–Speed Trade-Off
Occlusion level is a key state variable affecting harvesting safety, and failing to identify severe occlusion (occ = 2) may increase the risk of proxy unsafe acceptance. Therefore, we formulate occlusion recognition as an ROI-based three-class classification task. We first compare several lightweight backbones under the same training protocol and input size (224 × 224) and report overall accuracy (Acc), Macro-F1, class-wise recall for occ = 0/1/2, and inference latency per ROI. The results are summarized in
Table 4.
Overall, ShuffleNetV2 achieves the best balance between classification performance (Acc = 85.4%, Macro-F1 = 82.8%) and latency (6.70 ms per ROI) among the evaluated lightweight backbones, and is therefore selected as the baseline for subsequent attention-module and loss-function ablation studies.
3.4.2. Effect of Attention Mechanisms on Occlusion Recognition Performance
To evaluate the effectiveness of attention mechanisms for occlusion recognition, we compared four variants under the same backbone (ShuffleNetV2) and identical training settings, with the loss function fixed to standard cross-entropy (CE): no attention (Baseline), SE, CBAM, and the proposed EGSA. The results are summarized in
Table 5.
Overall, introducing attention improves Macro-F1 and increases recall for the safety-critical class (occ = 2). Among the tested modules, EGSA achieves the best overall performance (Acc = 87.5%, Macro-F1 = 85.2%, Recall (occ = 2) = 97.0%) while maintaining low inference latency (6.85 ms per ROI). Compared with SE and CBAM, EGSA provides a stronger accuracy/F1 gain with comparable or lower latency, suggesting that edge-guided spatial weighting can better emphasize occlusion-relevant boundary regions and suppress background texture interference under greenhouse conditions. These results support EGSA as an effective and lightweight attention design for edge-oriented occlusion recognition. This mechanism is illustrated using Grad-CAM heatmaps(
Figure 9), generated using an in-house implementation based on PyTorch autograd (PyTorch v1.13.1) and saved with Matplotlib v3.7.5 For occluded targets, the EGSA classifier shows high activation concentrated near the fruit–leaf boundary. This suggests that the harvestability gate is influenced by boundary-related cues near fruit–leaf contact regions, rather than only global texture patterns.
3.4.3. Effect of Different Loss Functions on Occlusion Classification Performance
To address class imbalance among occlusion levels and the asymmetric cost of misclassifying safety-critical cases, we compared three loss functions for occlusion classification: standard cross-entropy (CE), weighted cross-entropy (WCE), and focal loss. All settings were kept identical except for the loss function. The results are reported in
Table 6, and the convergence behavior during training is illustrated in
Figure 10.
Overall, WCE achieves the best performance across the evaluated metrics. Compared with CE, WCE improves Acc from 85.4% to 86.9% and Macro-F1 from 82.8% to 84.5%, while also increasing Recall (occ = 2) from 96.2% to 97.0. In contrast, focal loss provides a smaller gain for the severe-occlusion class and remains inferior to WCE in Acc and Macro-F1. Therefore, WCE is selected as the training loss for the occlusion classification module in subsequent experiments.
3.4.4. Ablation Study of the Occlusion Classification Model
Based on the comparative results of backbone selection, attention mechanisms, and loss functions, the final occlusion classifier is configured as ShuffleNetV2 + EGSA + WCE. EGSA incurs only a small inference overhead, with latency increasing from 6.70 to 6.85 ms per ROI, while WCE affects training only and does not increase inference latency. As shown in
Table 7, the combination of EGSA and WCE yields the best overall performance, achieving an accuracy of 89.2%, a Macro-F1 of 87.8%, and the highest Recall (occ = 2) of 97.5%. This provides a more reliable occlusion-state input for downstream risk modeling and harvestability gating.
Relative to the baseline configuration (ShuffleNetV2 trained with CE and without attention), the final model improves accuracy by 0.038 and Macro-F1 by 0.050 and increases Recall (occ = 2) by 0.013. These results suggest that EGSA and WCE provide complementary benefits for robust occlusion recognition under greenhouse conditions. In greenhouse harvesting, severely occluded fruits typically imply limited graspable area and a higher likelihood of failed or damaging attempts; improving recognition of occ = 2 therefore helps the RGHD gate reject high-risk candidates more reliably, enhancing decision safety rather than only boosting Macro-F1.
3.5. Performance Comparison of Harvestability Gating Under Different Policies
To characterize the safety–efficiency trade-off, we compared two operating policies (Safety-First and Efficiency-First) on candidates matched with IoU ≥ 0.5. All decision statistics are computed under the proxy definition in
Section 3.2.2.
Table 8 reports the policy operating characteristics using Precision/Recall/F1 and the two risk-oriented rates (FPR and FNR).
The two policies represent distinct operating points. The Safety-First policy yields a lower proxy unsafe-acceptance rate (FPR = 4.4%), corresponding to a more conservative acceptance behavior, while the Efficiency-First policy yields a lower proxy missed-opportunity rate (FNR = 4.7%) and higher acceptance recall (91.0%), corresponding to a more aggressive acceptance behavior. This comparison highlights an explicit and tunable safety–efficiency trade-off under the adopted proxy: Safety-First prioritizes avoiding proxy high-risk acceptance, whereas Efficiency-First prioritizes throughput at the cost of a higher chance of accepting proxy high-risk candidates (under our occlusion/overlap/scale-based proxy definition). In subsequent experiments, Safety-First is used as the default setting and Efficiency-First is reported as a reference operating mode. This comparison reflects a realistic trade-off faced in greenhouse harvesting practice: conservative strategies prioritize operational safety and crop protection, whereas aggressive strategies favor throughput at the expense of increased harvesting risk. RGHD explicitly exposes this trade-off, enabling strategy selection according to production priorities.
3.6. Consistency with Manual Judgment
To assess practical reliability, we compared RGHDs with manual expert judgments on 2026 test targets. Under the Safety-First policy, the system produced 953 harvest and 1073 skip decisions, with an overall agreement of 92.7%.
Table 9 shows an asymmetric error pattern: 110 cases were classified as skip when experts judged harvestable, whereas 38 cases were classified as harvest when experts judged unsafe. This imbalance indicates that, under Safety-First, RGHD tends to err on the side of skipping uncertain targets, thereby reducing the likelihood of collision-prone attempts in unstructured greenhouse scenes.
3.7. Crowding-Stratified Decision Behavior
We further examine the Safety-First gate under different crowding levels using an overlap-density index computed from detection boxes. For each frame, the crowding index is defined as the mean of the per-candidate maximum IoU values, and frames are grouped into Low/Medium/High strata by tertiles of this index (i.e., the bottom, middle, and top third of frames ranked by overlap density).
Table 10 summarizes the gate’s operating behavior across the three strata.
With increasing crowding, overlap becomes more frequent and the Safety-First gate exhibits more conservative acceptance, as reflected by a decreasing accept rate from Low to High crowding. This provides an interpretable, scene-consistent characterization of the gate’s behavior under clustered canopies.
Figure 11 provides representative qualitative examples under different crowding levels, illustrating accepted and rejected candidates produced by the gate.
3.8. System Integration and Prototype Verification
3.8.1. Validation Environment and Workflow
To verify the engineering feasibility of the proposed RGHD framework, we implemented a prototype and deployed the complete perception-to-decision pipeline on a laptop computer. The prototype takes a single RGB frame as input and processes greenhouse melon images sequentially. The end-to-end workflow includes fruit detection, ROI construction, occlusion-level recognition, overlap-risk estimation, multi-source risk fusion, and harvestability output. For verification and analysis, the prototype visualizes intermediate and final results as on-image overlays and exports the corresponding outputs (e.g., predicted labels and decision results) for subsequent error-case inspection. The system interface is shown in
Figure 12.
3.8.2. Prototype Implementation and Integration
The prototype was developed in Python and integrates the perception and decision modules using PyTorch, YOLO11n based on the Ultralytics repository (v8.4.3), and OpenCV (opencv-python v4.10.0.84). The functional modules are summarized as follows:
Image input: sequential image loading and queue management;
Fruit detection: localizes fruit targets using YOLO11n;
ROI construction: ROI cropping from detected boxes with a fixed outward margin;
Occlusion perception: occlusion-level prediction using the lightweight CNN classifier;
Overlap risk: overlap level inference from pairwise box relationships;
Gated decision: rule-based fusion of occlusion, overlap, and scale-related cues to output the harvestability decision;
Visualization and export: overlay rendering and result export for qualitative inspection and error analysis.
3.8.3. Prototype Verification Results
Prototype verification shows that the system can stably complete the full workflow of “detection–occlusion perception–risk fusion–harvestability decision” on the target platform and generate consistent visualization overlays and exportable results. This suggests functional completeness and integration feasibility at the prototype-validation stage, providing a practical implementation basis for subsequent deployment and policy tuning in greenhouse melon harvesting scenarios.
4. Discussion
4.1. The Necessity of Risk-Gated Reasoning
Detection alone does not indicate that a fruit is safe or feasible to pick in greenhouse harvesting. Muskmelons are often partially occluded by vines/leaves or located in crowded clusters, which increases collision risk if actions are triggered from bounding boxes alone. RGHD, therefore, treats occlusion, crowding, and target scale as explicit risk cues and approves a picking action only when the estimated risk satisfies the Safety-First criterion [
30]. This reframes perception outputs as harvestability screening aligned with manipulation constraints.
4.2. Decision Support Under Deployment Constraints
RGHD targets edge deployment and avoids computationally intensive 3D point-cloud segmentation. Occlusion risk is estimated from EGSA-enhanced boundary cues, and clustering risk is approximated using a lightweight 2D IoU proxy. The same pipeline supports “Safety-First” and “Efficiency-First” modes without retraining.
We also tested a single-stage alternative that labels boxes as “pickable” or “unpickable”. In RGHD, occlusion is graded on cropped ROIs by an auxiliary CNN, leveraging leaf–fruit boundary patterns that box-level scores and post-processing do not explicitly encode. The single-stage YOLO baseline showed poorer safety-related behavior, with more errors on unpickable targets and reduced precision on pickable ones. A plausible reason is a learning trade-off: optimizing global box regression can weaken sensitivity to fine occlusion cues. Separating candidate generation from risk assessment helps keep false acceptances low (e.g., 4.4% under Safety-First), supporting collision avoidance.
4.3. Practical Deployment and System Integration
This work is presented as a risk-aware decision layer on top of detection, rather than a complete on-robot deployment study. In a practical harvesting system, RGHD can be integrated into an ROS-based pipeline as an intermediate filter within a standard “Perception–Decision–Action” loop. Specifically, YOLO11n first detects candidate fruits and outputs their bounding boxes. Each candidate is then cropped to an ROI and evaluated by the ShuffleNetV2 risk module (occlusion grading and overlap/scale-based risk cues). Only candidates classified as Harvestable under the selected policy are forwarded to the motion planner as target coordinates. Candidates rated “Unsafe” are skipped, and the system proceeds to the next target; they are not permanently discarded and can be re-evaluated in subsequent harvesting cycles.
In terms of runtime, the sequential detection–crop–classification pipeline introduces latency that increases with the number of detected candidates. The measured processing time is ~6.85 ms per ROI; for a typical scene containing 4–5 melons, the added delay is ~27–34 ms. Although the current evaluation was conducted on a workstation, this sub-100 ms overhead is small relative to the seconds-scale execution time of manipulator motions during greenhouse harvesting.
4.4. Limitations and Future Work
First, validation used smartphone imagery; deployment should be evaluated with robotic-grade industrial sensors to assess domain shift. Second, the 2D overlap proxy simplifies the 3D workspace; depth cues or multi-view geometry [
31,
32] may better capture approach/grasp constraints. Finally, the current study validates perception and decision-making offline; the next step is on-robot testing to quantify end-to-end picking success, collisions, and crop safety in greenhouse trials.
5. Conclusions
This study presents a Risk-Gated Harvestability Decision (RGHD) framework for greenhouse muskmelon harvesting under foliage occlusion and fruit crowding. RGHD separates fruit discovery from harvestability judgment. A lightweight YOLO11n detector is used to locate fruits, and an EGSA-enhanced ShuffleNetV2 classifier estimates occlusion. Harvestability is then decided with a configurable gate that combines occlusion level, overlap-based interference (IoU), and a scale cue, enabling Safety-First and Efficiency-First operating modes.
YOLO11n achieved 75.8% mAP50–95 at 113.3 FPS. Under Safety-First, the proxy unsafe-acceptance rate (FPR) decreased from 8.7% to 4.4% (−49.4%), while decision precision remained 88.0%; Efficiency-First increased acceptance with 91.0% recall. These results suggest that RGHD can reduce risky picking decisions and help limit fruit and vine damage in greenhouse production, and the same idea can be applied to other crops with similar occlusion and crowding. Future work will integrate depth or multi-view cues and verify decisions with real picking outcomes.