3.1. Dataset Description and Preprocessing
This study utilizes the Fruitectives Multi-Fruit Ripeness Dataset, obtained from Roboflow Universe [
37]. The dataset was selected based on criteria directly aligned with robotic harvesting requirements, including multi-fruit diversity across 8 species for generalization assessment, explicit ripeness stage annotations across 4 categories aligned with harvesting decision logic, YOLO-compatible format (txt bounding boxes) for direct model training, public availability ensuring reproducibility and community engagement, and sufficient scale (2460 + images) for robust deep learning evaluation while remaining computationally manageable for comparative analysis. Unlike laboratory-based fruit datasets focusing on a single species or binary maturity classification, this dataset enables multi-class, multi-species evaluation under realistic orchard variability. The Fruitectives dataset used in this study represents a diverse collection of fruit images. To ensure the framework’s practical applicability, a portion of the test imagery was captured by our team at the Tanggang Ecological Orchard, Shuyuan Town, Pudong New Area, Shanghai. The Fruitectives dataset used in this study consists of 100% real-world, high-resolution photographs captured in complex orchard environments and professional agricultural settings. To be clear, no part of the core training or testing data is synthetically generated via Computer Graphics (CG) or Generative Adversarial Networks (GANs). The ‘simulated stress’ mentioned in
Section 5 refers to the controlled digital perturbation (e.g., masking and brightness adjustment) applied to real images to evaluate model robustness, not to the creation of synthetic image content.
This in situ collection ensures that the core evaluation is grounded in raw, unstructured visual signals from actual field operations, providing a bridge between controlled stress testing and real-world applicability. This site presents complex, real-world orchard challenges, including variable canopy density, natural backlight, and significant illumination gradients, which are essential for evaluating model robustness beyond synthetic benchmarks.
The dataset contains a total of 2460 images with variable resolutions ranging from 480 to 2048 pixels, comprising 2460 images with 8925 manually annotated instances across 32 distinct sub-categories (8 species × 4 ripeness stages). This results in an average of 3.6 targets per image, providing a high-density learning signal [
38] that is statistically significant for deep learning convergence in specialized domains. It covers eight fruit species, namely banana, mango, apple, cantaloupe, pear, orange, peach, and grape. The species-level image and annotation distribution is summarized in
Table 1. In addition to species diversity, the dataset is annotated with four ripeness categories: Unripe (Class 0), Ripe (Class 1), Overripe (Class 2), and Rotten (Class 3). This four-stage design reflects real harvesting decisions rather than simplified binary classification, aligning with recent approaches in intelligent muskmelon ripeness detection [
39].The definitions, harvesting decisions, and annotation proportions of the four ripeness categories are reported in
Table 2. The Fruitectives dataset (2460 images) was divided into training (1476 images, 60%), validation (492 images, 20%), and test (492 images, 20%) subsets. The corresponding annotation counts are 5355 for training and 1785 each for validation and testing. The split was performed at the image level to prevent annotation leakage across subsets. The training set was used for model optimization, the validation set for hyperparameter tuning and early stopping, and the test set was reserved for final performance evaluation.
Despite the diversity of the Fruitectives dataset, certain inherent biases and limitations persist. First, class imbalance is observed in annotation density; for instance, Grape contains 5.1 annotations per image compared to 3.6 for Apple, which may bias the model toward clustered features. Second, while the dataset covers eight species, it predominantly features fruits with distinct geometric shapes (e.g., spherical or oblong), potentially limiting direct generalization to irregular crops like berries or leafy vegetables. Furthermore, most images were captured under clear or partially cloudy daylight, meaning the model’s performance under extreme weather (e.g., heavy rain or fog) or nighttime artificial lighting remains to be fully validated.
Figure 1 illustrates the visual diversity and complexity of the Fruitectives dataset. In addition to covering eight fruit species and four ripeness stages, the dataset reflects real orchard conditions characterized by severe occlusion, clustered fruits, illumination variation, and small-scale targets. These challenges motivate the structural enhancements proposed in Orchard-YOLO, particularly the P2 high-resolution head and the Coordinate Attention mechanism.
To ensure consistent input quality, all images were normalized to the range [0, 1], with RGB channels aligned to ImageNet mean statistics ([0.485, 0.456, 0.406]) to facilitate stable optimization. Images were resized to 640 × 640 using letterbox padding to preserve aspect ratio while remaining compatible with the YOLO input setting. All annotations were converted to normalized YOLO bounding box format, in which center coordinates and box dimensions are expressed relative to image size. To improve robustness under orchard variability, data augmentation was applied during training [
40]. Geometric transformations included random horizontal flipping (
p = 0.5), small-angle rotation (±10°), and mild perspective distortion, while photometric augmentation included HSV perturbation (H ± 10%, S ± 20%, V ± 20%) and brightness/contrast adjustment (±20%). In addition, Mosaic [
19] and MixUp [
41] were adopted to enhance multi-object contextual learning, and DropBlock [
42] and GridMask [
43] were used to improve resistance to occlusion and partial visibility. This strategy follows common YOLO training practice and is particularly suitable for agricultural scenes with foliage occlusion, variable lighting, and scale diversity.
Compared to typical fruit detection studies that focus on 1–2 species, our inclusion of 8 taxonomically diverse species (ranging from pomes and stone fruits to berries and tropical fruits) represents a significant step toward a generalized agricultural vision backbone, rather than a crop-specific solution.
The Fruitectives Multi-Fruit Ripeness Dataset was selected because it supports systematic evaluation of fruit detection and maturity assessment under realistic harvesting conditions [
37]. Its coverage of eight fruit species enables cross-species generalization analysis, while the four-stage labeling scheme (Unripe, Ripe, Overripe, and Rotten) is more consistent with practical harvesting logic than binary maturity classification. In addition, the dataset is natively provided in YOLO-compatible annotation format, which ensures consistency across training and evaluation and reduces preprocessing bias. Its public availability also improves reproducibility and facilitates benchmarking in future studies.
Although the dataset is moderate in size, with 2460 images and 8925 annotations, it provides a reasonable balance between diversity and computational feasibility for multi-model comparison. Moreover, the augmentation strategy described above effectively increases sample variability and mitigates overfitting across the 32 fruit-ripeness subcategories, thereby supporting robust evaluation of the proposed framework.
3.2. Orchard-YOLO Architecture Design
To address the specific challenges of agricultural environments—specifically small target loss during downsampling and feature confusion under occlusion—we propose Orchard-YOLO.
As shown in
Figure 2, the architecture improves upon the YOLO baseline through three key innovations.
3.2.1. High-Resolution P2 Detection Head
Conventional YOLO architectures perform detection at three pyramid levels (P3, P4, P5), corresponding to downsampling strides of 8, 16, and 32, respectively. While effective for medium and large objects, this configuration may suppress fine-grained texture information necessary for detecting small or partially visible fruits. In orchard scenes, distant fruits and early stage rot lesions often occupy less than 5% of the image area, making them susceptible to feature loss in deeper layers.
To mitigate this limitation, we introduce an additional P2 detection head with a stride of 4. This branch extracts high-resolution feature maps (160 × 160 at 640 × 640 input resolution) from earlier backbone stages and integrates them into the feature pyramid through the Path Aggregation Network (PANet). By incorporating shallow spatial features into multi-scale fusion, the network retains fine-grained texture cues that are essential for distinguishing subtle ripeness differences, particularly between Overripe and Rotten categories.By incorporating shallow spatial features into multi-scale fusion, the network retains fine-grained texture cues that are essential for distinguishing subtle ripeness differences, particularly between Overripe and Rotten categories. The P2-head design and its role in mitigating small-fruit feature loss are illustrated in
Figure 3.
3.2.2. Coordinate Attention (CA) for Occlusion Handling
Dense canopy structures introduce significant background interference, where leaf textures may resemble unripe fruit coloration. Standard channel-attention mechanisms (e.g., SE blocks) emphasize inter-channel relationships but rely on global pooling that collapses spatial dimensions, thereby losing explicit positional information.
To enhance spatial discrimination, Coordinate Attention (CA) [
46] modules are embedded within the neck architecture. CA decomposes channel attention into two one-dimensional encodings along the horizontal and vertical directions, enabling the network to capture long-range dependencies across the canopy while preserving precise positional information. This mechanism strengthens the response to fruit regions while suppressing surrounding foliage features, thereby reducing false positives in cluttered orchard environments.This spatially aware attention mechanism and its role in suppressing foliage interference are illustrated in
Figure 4.
3.2.3. Ghost-Convolution Lightweight Backbone
The addition of the P2 head increases computational load. To maintain real-time performance on the Jetson Nano, we replace standard convolutions in the backbone with Ghost Modules.
The Ghost module generates half of the intrinsic feature maps using standard convolution and the other half using cheap linear operations (“phantom” features). This reduces the parameter count and FLOPs by approximately 40%, creating a “YOLO-Nano” scale model that outperforms the standard “YOLO-Small” in accuracy.
We define ‘lightweight’ through three primary metrics: parameter reduction, FLOPs optimization, and deployment-ready file size. By replacing standard heavy convolutions with Ghost Modules, Orchard-YOLO reduces redundant parameter overhead by approximately 40%. The detailed lightweight optimization results, including parameter reduction, INT8 quantization, model size, and Jetson Nano inference speed, are reported later in
Section 4.4. This multi-level optimization is designed to support high-speed inference on power-constrained edge devices such as the Jetson Nano.
To maintain clarity throughout this study, we define our proposed models as follows:
Orchard-YOLO (The Framework): The full-scale, feature-enhanced architecture (incorporating P2 Head and Coordinate Attention).
Orchard-YOLO-Light (The Lightweight Variant): The optimized version of Orchard-YOLO, where standard convolutions are replaced by Ghost-convolution modules to reduce parameter overhead.
Orchard-YOLO-Light-INT8 (The Deployed Version): The quantized version used for edge-device inference, obtained by applying INT8 quantization to Orchard-YOLO-Light for accelerated inference on edge devices (Jetson Nano).
The lightweight feature-generation process of Ghost convolution is illustrated in
Figure 5.
3.3. Evaluation Metrics
Mean Average Precision (mAP) was adopted as the primary evaluation metric for multi-class object detection. Given the 32 detection categories (8 fruit species × 4 ripeness stages), mAP was computed as the mean of class-wise Average Precision (AP) values derived from precision–recall curves. To provide a comprehensive assessment of localization performance under different matching strictness levels, three standard variants were reported: mAP@0.5, using an IoU threshold of 0.50 for relatively lenient evaluation; mAP@0.95, using an IoU threshold of 0.95 for strict localization assessment; and mAP@0.5:0.95, averaged across IoU thresholds from 0.5 to 0.95 in increments of 0.05, following the COCO standard. In addition, Precision, Recall, and F1-score were evaluated at both per-class and overall levels. Precision represents the proportion of correctly detected fruits among all detections, while recall reflects the proportion of ground-truth fruits successfully identified. The F1-score provides a harmonic balance between precision and recall. These metrics are particularly important for robotic harvesting systems, since high precision minimizes false picking decisions, such as harvesting unripe or rotten fruits, whereas high recall reduces missed harvests and potential economic loss.
Beyond detection accuracy, deployment feasibility in robotic harvesting platforms was assessed through computational efficiency metrics, including inference speed, latency, and model size. Inference speed (FPS) was measured on standardized hardware platforms, including an NVIDIA RTX 3080 (GPU) and a Jetson Nano (edge device), and was calculated based on average inference time under fixed batch size conditions. Latency (ms) refers to the average end-to-end processing time per image, including preprocessing, model inference, non-maximum suppression (NMS), and output generation. This metric is critical for robotic control loops that typically require decision latency below 100 ms. To better simulate real-time robot operation, latency measurements were conducted per image with batch size = 1. Model size (MB), defined as the trained weight file size, was also included because it directly affects storage requirements, transmission feasibility, and deployment on memory-constrained embedded systems.
To further assess robustness under realistic orchard variability, modified test sets were constructed to simulate challenging environmental conditions. For occlusion robustness, fruit bounding box regions were randomly masked at 30%, 50%, and 70% levels to simulate foliage occlusion, and robustness was quantified by the relative mAP degradation compared to clean test conditions. For lighting variation, image brightness was adjusted within ±20%, ±40%, and ±50% ranges to simulate illumination changes under field conditions, and the reported metric is the average mAP across all brightness levels. For scale variation, fruits were categorized according to relative image area into small (<5%), medium (5–20%), and large (>20%) groups, and per-scale mAP together with standard deviation was reported. Through this multi-dimensional evaluation framework, the study provides a comprehensive assessment of detection accuracy, computational feasibility, and robustness under complex harvesting scenarios.
3.4. Training Configuration
To ensure a strictly fair comparison and eliminate training bias, all YOLO variants (v8, v11, and v13) were trained from scratch using a unified hyperparameter suite (as detailed in
Table 3) and the same computational environment (NVIDIA RTX 3080). This includes identical data augmentation strategies (Mosaic, MixUp, and GridMask), the same optimizer (SGD with momentum), and a consistent 100-epoch schedule. No pre-training weights or model-specific tuning were applied to any individual architecture, ensuring that the reported performance gains are derived solely from structural refinements.
These settings were chosen to balance convergence stability and deployment efficiency. A cosine learning rate schedule was adopted to facilitate smooth optimization, and CIoU loss [
48] was employed to improve bounding box regression accuracy. Given that the Rotten and Overripe categories account for less than 30% of the dataset, inverse-frequency class weighting was applied to mitigate bias toward dominant classes. Early stopping was implemented to prevent overfitting, with validation monitored throughout training. This configuration ensures consistent and fair benchmarking across all evaluated architectures.
Training was conducted on an NVIDIA RTX 3080 GPU (10 GB VRAM), requiring approximately 1.5–2 h per model for 100 epochs using 1476 training images. Batch processing was performed with 32 images per iteration. Due to the fast convergence of the YOLOv13 architecture and our optimized learning rate schedule, most models reached their peak performance within the first 70–80 epochs, further reducing the actual wall-clock training time. To evaluate deployment feasibility in robotic harvesting scenarios, inference experiments were conducted on an NVIDIA Jetson Nano (4 GB VRAM) [
49]. TensorRT was used for runtime optimization, and INT8 quantization was applied during deployment evaluation, achieving a 3–4× inference speed improvement. All experiments were implemented using PyTorch (version 2.0+), CUDA 11.8, and the Ultralytics YOLO framework with official pretrained weights as initialization. A fixed random seed (42) was used to ensure reproducibility.
Validation was conducted every 10 epochs using 492 validation images. Performance metrics including mAP@0.5, mAP@0.5:0.95, precision, recall, and validation loss were monitored throughout training. The best-performing checkpoint was selected based on the highest validation mAP@0.5, and early stopping was triggered if no improvement was observed for 20 consecutive epochs. To ensure statistical robustness, 5-fold cross-validation was performed on the full dataset. Reported results are presented as mean ± standard deviation across folds. Paired t-tests were used for significance analysis, with p < 0.05 considered statistically significant.
3.5. Experimental Setup
To comprehensively evaluate model performance for robotic fruit harvesting, experiments were designed along three complementary dimensions. The first dimension focused on the relationship between accuracy and model complexity. Models from three YOLO generations were compared, including YOLOv8 (nano, small, medium, large), YOLOv11 (nano, small, medium, large), and YOLOv13 (nano, small, medium, large). This comparison was conducted to quantify accuracy improvements resulting from architectural evolution, allowing performance gains to be attributed to structural refinements rather than merely increased parameter capacity. The second dimension examined lightweight variants’ performance. The proposed Orchard-YOLO-Light model was evaluated against the full YOLOv13 baseline, YOLOv8-n (representative lightweight YOLO variant), and MobileNetSSD (alternative lightweight architecture). This setup was designed to determine whether Orchard-YOLO-Light demonstrates superior performance–efficiency trade-offs compared with existing lightweight solutions, particularly for embedded robotic deployment scenarios. The third dimension addressed real-world robustness. All evaluated models were tested under varying environmental conditions, including clean images (test set baseline), fruit occlusion (30%, 50%, and 70% masking), brightness variations (±20%, ±40%, ±50%), and scale differences (small, medium, and large fruits). These experiments were intended to compare model robustness under realistic orchard conditions.
To provide a comprehensive performance analysis, we evaluated different model configurations based on the specific requirements of each experimental dimension.
Ablation Studies: We utilize the Nano-scale variant to isolate the computational cost and accuracy gain of each module, as small models are more sensitive to architectural changes.
Species and Ripeness Assessment: We employ the Medium-scale variant to provide a balanced baseline for generalization, as it offers sufficient capacity for multi-class discrimination without overfitting.
Lightweighting and Deployment: We focus on the Orchard-YOLO-Light series (Base/INT8) to demonstrate the efficacy of our Ghost-convolution and quantization strategies.
Robustness Analysis: We evaluate the Large-scale variant under extreme simulated stress, as large models represent the upper bound of the architecture’s capacity to handle feature-loss in degraded conditions.
All experiments were implemented using PyTorch 2.0.0. YOLOv8 and YOLOv11 models were trained and evaluated using the Ultralytics YOLO framework, whereas YOLOv13 was implemented based on its publicly available third-party repository. Data handling and preprocessing were performed using the Roboflow Python SDK and OpenCV 4.8.0. Training monitoring and visualization were conducted using Matplotlib 3.7.1 and Weights & Biases (Wandb) 0.15.12. For cross-platform deployment compatibility, models were exported using TorchScript and ONNX 1.14.0 formats, and TensorRT 8.6.1 was used for optimized inference on edge devices.
To ensure statistical rigor, paired t-tests were conducted to compare mAP performance across model variants, with p < 0.05 considered statistically significant. For qualitative comparison of detection outcomes, McNemar’s test was applied to assess differences in prediction success rates. Bonferroni correction was used when multiple pairwise comparisons were performed, with the adjusted significance level defined as α′ = 0.05 divided by the number of comparisons. In addition to aggregate metrics, detailed error analysis was conducted. False Positive Rate (FPR) was calculated to quantify incorrect ripeness class assignments, and False Negative Rate (FNR) was used to measure missed fruit detections. The confusion matrix was analyzed to provide per-class performance breakdowns. Furthermore, per-fruit-species analysis was performed to identify species-specific detection strengths and weaknesses.
3.6. Deployment Considerations
The proposed detection framework is designed for direct integration into an autonomous fruit harvesting robot. As illustrated in
Figure 6, Orchard-YOLO functions as the perception module within a complete perception-to-actuation pipeline. The workflow begins with RGB image acquisition from the onboard camera. The captured image is processed by the YOLOv13-based detection network to generate fruit localization and ripeness classification outputs in the form of (
). A confidence threshold (
> 0.5) is then applied to remove unreliable detections and ensure stable downstream decision-making. Subsequently, 2D bounding box coordinates are fused with depth camera measurements to obtain 3D fruit localization. This depth-aware positioning enables accurate spatial estimation required for robotic manipulation. Based on the reconstructed 3D coordinates, the motion planning module computes collision-free trajectories [
31] while considering canopy structure and obstacle constraints. The robotic arm then executes fruit picking using force-controlled gripper actuation, followed by post-harvest handling.
As shown in
Figure 6, this sequential pipeline ensures that visual perception, geometric reconstruction, and mechanical execution operate within a unified control framework suitable for real orchard deployment. From a real-time control perspective, robotic servo systems typically require update frequencies between 10–30 Hz, corresponding to decision intervals of 33–100 ms. Under these constraints, Orchard-YOLO-Light achieves approximately 22 ms per frame (45 FPS), providing sufficient latency margin for perception, depth fusion, and motion planning modules. YOLOv13-medium operates at approximately 45 ms per frame (22 FPS), which remains acceptable but leaves reduced computational headroom for complex scenes. These results demonstrate that the proposed lightweight variant satisfies the timing requirements of embedded robotic harvesting platforms.
To further verify practical deployment feasibility, the final model should be evaluated on physical Jetson Nano hardware under sustained edge-device operation. Key deployment-oriented criteria include real execution speed beyond offline benchmark simulation, runtime memory consumption during continuous inference, thermal stability under prolonged use, and power consumption relevant to mobile robotic platforms. These aspects are important for determining whether the proposed detection framework can satisfy both computational and operational constraints in embedded agricultural robotics.