1. Introduction
Trellised watermelon cultivation is a production method that aims to achieve high yields and good fruit quality. It uses vertical planting, where watermelons are supported on trellis structures during growth. This arrangement can improve fruit quality and yield and can also increase land use efficiency [
1]. As this cultivation method expands in China, it is contributing to agricultural development and creating a stronger demand for efficient harvesting.
At the same time, harvesting robotics is advancing, and replacing manual picking with harvesting robots is becoming an expected trend [
2]. Research on intelligent harvesting equipment has also progressed [
3]. For a harvesting robot to work reliably, it must first detect the fruit accurately before performing localization and picking. In trellised cultivation, detection is challenging because fruits may appear small, be partially hidden by leaves, and be affected by changing light conditions. Therefore, accurate identification of trellised watermelons is a key factor that directly influences harvesting efficiency.
Object detection driven by deep learning has become a widely used solution for visual tasks in agriculture. Current object detectors are commonly grouped into two-stage and single-stage methods. Two-stage methods, such as the R-CNN (Region-based Convolutional Neural Network) series [
4,
5,
6], typically generate candidate regions and then refine classification and localization. Single-stage methods, such as the YOLO (You Only Look Once) series [
7,
8,
9], predict object classes and bounding boxes in one step and are often faster. Many studies have improved these frameworks for fruit detection in complex scenes. Jianwei Yan et al. [
10] improved Faster R-CNN by enhancing feature sampling and region processing, thereby increasing bounding box localization accuracy and achieving a detection accuracy of 95.53% on Rosa roxburghii fruits. Weihui Wang et al. [
11] proposed an enhanced Faster R-CNN model for winter jujube defect recognition by using ResNet50 (Residual Network50 layer) [
12] instead of VGG16 (Visual Geometry Group 16) [
13], adding an SE (Squeeze and Excitation) attention module, introducing an FPN (Feature Pyramid Network) [
14] for multi-scale fusion, and applying Soft NMS (Soft Non-Maximum Suppression) [
15] to better handle overlapping objects; the model achieved an mAP@0.5 (mean Average Precision at Intersection over Union threshold of 0.5) of 91.60%. Huijun Yin et al. [
16] combined an improved YOLOv7-based detector with DeepSORT for automatic watermelon counting in drone videos. They introduced the GhostConv [
17] and C2f (Context Fusion) modules to reduce computation, added SimAM (Simple Attention Module) attention [
18] to strengthen feature extraction, replaced CIoU with Focal EIoU (Focal Efficient Intersection over Union) to speed up convergence, and used a mask collision mechanism in DeepSORT to improve counting accuracy; the method improved precision by 2.3 percentage points and mAP by 0.3 percentage points over the baseline. Gang Ge et al. [
19] proposed TOMO YOLO for tomato detection by improving feature fusion and adding an AWD detection head, achieving an mAP@0.5 of 90.6%. Xu Li et al. [
20] proposed YOLO Pepper, which adds CA attention and uses DCNv2 deformable convolution to improve detection under occlusion, achieving an average detection accuracy of 93.3%, 2.8 percentage points higher than the baseline.
Despite these advances, convolutional neural networks (CNNs) mainly learn local features and may not capture enough global context. Here, global context means the overall scene information that helps relate an object to its surroundings; for example, how a watermelon relates to nearby leaves and the trellis, as well as cues from other parts in the image. When fruits are small or partly blocked, lacking such whole-image information can reduce detection accuracy. To address this problem, Transformer-based methods have been introduced into object detection [
21]. Transformers use self-attention to capture global context, and they can reduce reliance on hand-crafted anchors. DETR (DEtection TRansformer), originally proposed by Carion et al. [
22], is a representative Transformer-based detector that enables end-to-end prediction and reduces the need for post-processing such as non-maximum suppression (NMS). However, DETR often has high computational demands and has slow training convergence, which limits its use in complex agricultural applications. RT-DETR [
23], proposed by Baidu, keeps the end-to-end design while improving efficiency and accuracy, and it has shown strong performance on multiple datasets.
Based on RT-DETR, this study focuses on trellised watermelon detection in complex field conditions, where many targets are small and occlusion is frequent, leading to reduced detection accuracy. To improve performance while controlling model size and computation, we propose an improved model, RT-DETR-Watermelon, designed specifically for trellised watermelon detection. The proposed method primarily addresses the visual localization of watermelons, providing reliable detection capabilities for trellised watermelon harvesting robots.
3. Results
3.1. Ablation Experiment Results and Analysis
To validate the effectiveness of the improvements made to RT-DETR, this paper designed and conducted ablation tests to evaluate the model’s performance. The outcomes are presented in
Table 1.
The experimental results show that the proposed model provides clear improvements across multiple evaluation metrics.
After adding the SSFF and TFE modules, precision, mAP@0.5, and mAP@[0.5:0.95] increased by 1.6, 0.1, and 4.4 percentage points, while recall decreased by only 0.7 percentage points.
After introducing the P2 detection layer, precision increased by 1.0 percentage points, recall by 0.1 percentage points, mAP@0.5 by 0.1 percentage points, and mAP@[0.5:0.95] by 3.8 percentage points.
Reducing the channel number slightly decreased precision and mAP@0.5 by 0.5 and 0.1 percentage points, respectively. However, FLOPs, parameter count, and model size were reduced by 28.8%, 26.2%, and 26.7%, achieving a more lightweight model.
When the context-guided module was added independently, precision, recall, mAP@0.5, and mAP@[0.5:0.95] improved by 2.1, 0.5, 0.9, and 3.7 percentage points, respectively, compared with the baseline. At the same time, FLOPs, parameters, and model size decreased by 16.3%, 16.6%, and 16.3%.
Replacing the original loss with MPDIoU increased precision, recall, mAP@0.5, and mAP@[0.5:0.95] by 0.9, 0.1, 0.2, and 3.4 percentage points.
After combining all improvements, the final model achieved gains of 0.4, 1.8, 1.0, and 3.5 percentage points in precision, recall, mAP@0.5, and mAP@[0.5:0.95]. In addition, parameters, FLOPs, and model size decreased by 53.5%, 23.5%, and 53.2%. Overall, the proposed method improves detection performance while substantially reducing model complexity, confirming the effectiveness of the proposed modifications.
3.2. Stability Evaluation
To assess the stability of the model’s performance, we conducted five independent training runs on the dataset, with the results presented in
Figure 8. The mean, standard deviation, and 95% confidence interval of the key metrics were computed, as reported in
Table 2.
The model achieved a precision of 92.84% ± 0.51 (95% CI: [92.20%, 93.48%]) and a recall of 88.14% ± 0.43 (95% CI: [87.60%, 88.68%]). The mAP@0.5 reached 93.88% ± 0.15 (95% CI: [93.70%, 94.06%]), while the mAP@[0.5:0.95] was 73.76% ± 0.52 (95% CI: [73.12%, 74.40%]). These results indicate that the proposed method exhibits good consistency and robustness across repeated experiments.
3.3. Comparative Trial
To evaluate the improved RT-DETR model, we compared it with five widely used detectors YOLOv8s [
32], YOLOv8n [
33], SSD [
34], Faster R-CNN, and the original RT-DETR under identical experimental settings. The comparison results are summarized in
Table 3.
In terms of detection accuracy, the improved model achieved a mAP@0.5 value of 93.9%, exceeding the YOLOv8s, YOLOv8n, SSD, Faster R-CNN, and the original RT-DETR scores by 0.6, 0.35, 4.9, 5.7, and 1.0 percentage points.
In terms of recall, the improved model outperformed the same five models by 2.7, 4.4, 9.1, 11.2, and 1.8 percentage points.
In terms of model lightweighting, RT-DETR-Watermelon has 9.2 M parameters and a model size of 18.9 MB, which is smaller than YOLOv8s with 11.1 M parameters and 21.4 MB, SSD with 23.7 M parameters and 94.9 MB, Faster R-CNN with 28.3 M parameters and 107.8 MB, and the original RT-DETR with 19.8 M parameters and 40.4 MB. However, YOLOv8n remains lighter, with 3.0 M parameters and a 6.0 MB model size. In terms of speed, RT-DETR-Watermelon achieves 21.2 FPS, which is faster than Faster R-CNN at 16.66 FPS, but slower than the YOLOv8 variants and RT-DETR under the same test setting.
Overall, RT-DETR-Watermelon provides a competitive balance between accuracy and compactness, achieving good detection performance while reducing the parameter count and model size compared with several baselines.
3.4. Effect of Scaling Factors
To provide a comparison of RT-DETR-Watermelon models with different scaling factors, we evaluated variants with different depth and width settings, namely S with a depth of 0.67 and a width of 0.75, ours with a depth of 1.0 and a width of 1.0, and L with a depth of 1.33 and a width of 1.25. The results are summarized in
Table 4.
As shown in
Table 4, the S variant achieves the fastest inference speed of 24.6 FPS with the lowest computational cost of 32.4 GFLOPs and the smallest model size of 15.7 MB. However, this efficiency is obtained at the expense of detection performance, especially recall, which drops to 84.8%, while precision and mAP@0.5 are 92.4% and 92.3%, respectively. When scaling the model up to L with a depth of 1.33 and a width of 1.25, the computation increases to 61.0 GFLOPs and the model size grows to 21.5 MB; meanwhile, the accuracy gain is limited, reaching 93.4% precision, 86.2% recall, and 93.3% mAP@0.5, and the speed decreases to 17.4 FPS. In contrast, our configuration with a depth of 1.0 and a width of 1.0 provides the best overall trade-off, delivering 93.2% precision, 88.2% recall, and the highest 93.9% mAP@0.5 with moderate computation of 43.5 GFLOPs and real-time performance of 21.2 FPS. Overall, these results indicate that the selected scaling factors offer a favorable balance between detection accuracy and computational efficiency for trellised watermelon detection.
3.5. Small-Object Subset Evaluation
As described in
Section 2.3.2, we further evaluate both models on a dedicated small-object subset. As shown in
Table 5.
RT-DETR-Watermelon improves recall from 83.0% to 84.5%, mAP@0.5 from 90.1% to 90.8%, and yields a larger gain on the mAP@[0.5:0.95], from 60.9% to 63.9%. Meanwhile, precision slightly decreases from 88.7% to 87.8%. The above results indicate that although the improved model shows a slight reduction in detection precision for small objects, it reduces missed detections and retrieves more small objects. Overall, the proposed model shows improved performance on the small-object subset.
3.6. Occlusion Subset Evaluation
As described in
Section 2.3.3, we further evaluated the robustness of the proposed method under occlusion by testing both models on the occlusion subset. The quantitative results are reported in
Table 6.
Compared with the baseline RT-DETR, RT-DETR-Watermelon achieves the same precision (93.6%) while improving recall from 88.2% to 90.4%. Meanwhile, the mAP@0.5 increases from 94.2% to 94.7%. In addition, the mAP@[0.5:0.95] improves from 69.4% to 74.0%. These results demonstrate that the proposed improvements enhance detection robustness in occluded scenarios, particularly in terms of recall and localization accuracy.
3.7. Low-Light Object Evaluation
As described in
Section 2.3.4, we further evaluated both models on the low-light subset to assess robustness under insufficient illumination. The quantitative results are reported in
Table 7.
Compared with the baseline RT-DETR, RT-DETR-Watermelon improves precision from 92.7% to 93.6% and recall from 86.9% to 88.2%. Meanwhile, mAP@0.5 slightly increases from 93.0% to 93.1%, while mAP@[0.5:0.95] shows a substantial improvement from 67.9% to 74.4%. These results indicate that the proposed method achieves more reliable detection in low-light conditions, particularly in terms of overall localization quality under challenging illumination.
3.8. Loss Function Comparison Test
To validate the effectiveness of the proposed MPDIoU bounding box regression loss, we compared it with GIoU, Inner IoU [
35], and CIoU [
36]. Their performance on the validation set is reported in
Table 8, and the corresponding convergence curves are displayed in
Figure 9, while the mAP@0.5 curves are shown in
Figure 10 and the PR curve comparisons are presented in
Figure 11. MPDIoU achieves the best precision, recall, and mAP@0.5 among the compared losses. In addition, the model trained with MPDIoU converges faster and reaches a lower final loss than the other loss functions.
3.9. Heatmap Effect Comparison
We use Grad CAM to generate heatmaps for the baseline and improved models for comparison. The resulting heatmaps are shown in
Figure 12, where darker regions indicate stronger model attention.
As shown in
Figure 12, the improved model produces darker and more concentrated responses on trellised watermelon targets. In contrast, the baseline model shows localization errors or missed detections for some fruits, accompanied by weaker and more scattered attention over the target regions. Background and noise regions exhibit lower intensity responses. In addition, YOLOv8s and YOLOv8n show limited sensitivity to small fruits, with many small targets receiving weak activation and being missed.
3.10. Image Detection Results for Trellised Watermelon
To demonstrate the effectiveness of the improvements in complex agricultural environments, we chose some representative images from the dataset that include different numbers of targets, varying degrees of occlusion, and diverse lighting conditions for comparative evaluation. The comparative results are presented in
Figure 13.
In single-target scenes, both the baseline and improved models detect trellised watermelons reliably. However, in multi-target scenes, the baseline model produces false positives by confusing background structures with watermelons. In occluded scenes, it also generates duplicate boxes, assigning multiple detections to the same watermelon. In unobstructed scenes, the baseline model sometimes misses small trellised fruits due to limited small-object feature extraction, whereas the improved model detects these targets consistently.
Lighting variations further highlight the differences between the two methods. Under normal illumination, the baseline model still produces false negatives, especially for occluded fruits. In low-light scenes, it again generates duplicate boxes, while the improved model avoids this issue and remains stable.
4. Discussion
This study focuses on trellised watermelon detection in complex agricultural scenes. In such environments, heavy occlusion and large-scale variation often cause missed detections, especially for small fruits. To address these challenges, we propose RT-DETR-Watermelon by enhancing multiscale feature representation and introducing lightweight context modelling.
Compared with the baseline RT-DETR, RT-DETR-Watermelon improves precision by 0.4 percentage points, recall by 1.8 percentage points, and mAP@0.5 by 1.0 percentage point. At the same time, it reduces parameters by 53.5%, FLOPs by 23.5%, and model size by 53.2%. These results indicate that the proposed design improves detection accuracy while lowering model complexity, which is beneficial for deployment under limited computational resources.
To ensure a fair evaluation, all experiments in this study included the proposed model and baseline detectors, training settings, and the software environment. Beyond overall metrics, we report stratified results on challenging subsets to match the motivation of this work. The small-object subset and the occlusion subset show larger gains than the overall set, suggesting that the improvements are related to better handling of small targets and occlusion rather than minor overall fluctuations. We also include a low-light subset evaluation as additional evidence for challenging conditions.
We further evaluated training stability. We repeated experiments with five random seeds and report mean ± standard deviation and confidence intervals. The standard deviation of mAP@0.5 is within 0.15%, and an independent-samples test shows a significant difference from the baseline (p < 0.05). These results suggest that the reported improvements are stable under our experimental settings.
Ablation results show that each component contributes to performance, while different modules may involve trade-offs across metrics. Therefore, the final model configuration is selected to achieve a balanced improvement across overall accuracy, subset performance, and model complexity. Finally, we note that this work focuses on 2D detection and localization and is intended to serve as an upstream perception module for subsequent robotic tasks.
4.1. Practical Significance of the Improved Model
The improvement in mAP@0.5 is only 1.0 percentage point, but its significance should be understood in the context of recent agricultural object detection research. Recent studies, such as those on ELD-YOLO [
37], YOLOv5-ACS [
38], AAB-YOLO [
39], DS-YOLO [
40], and YOLOv8MSP-PD [
41], were proposed to address common challenges in agricultural scenes, including fruit occlusion, overlap, complex backgrounds, illumination variation, and natural field conditions. These studies indicate that improvements in agricultural detection are often gradual rather than substantial because the task is challenging and baseline detectors achieve relatively high performance. Therefore, the 1.0 percentage point gain achieved in this study is consistent with the level of improvement commonly reported in the recent literature and can still be considered meaningful in practice, especially for challenging agricultural detection tasks.
The practical significance of this improvement should not be evaluated only by the increase in mAP@0.5. In real agricultural applications, even small detection errors may accumulate during large-scale field operations and reduce the reliability of downstream tasks. In this study, the proposed method not only improves mAP@0.5, but also reduces model size and parameter count, making it more suitable for deployment on resource-limited agricultural devices. And the 1.8 percentage point improvement in recall indicates that fewer fruits are missed in challenging scenes, such as those with occlusion, overlap, or background interference. This is important for practical applications such as fruit counting, yield estimation, and robotic harvesting, where missed detections may directly affect economic returns. And the multi-seed evaluation with confidence intervals shows that the observed improvement remains consistent across repeated runs, suggesting that the gain is stable and reliable in practice rather than caused by random variation.
4.2. Study Limitations
Several limitations should be acknowledged. The dataset was collected from a single farm in Zibo, China, and includes two cultivars, one camera device, and one trellised cultivation system. This limited diversity may introduce dataset bias and reduce generalization to other regions, varieties, devices, and cultivation practices. Although we report multi-seed stability, broader validation on additional datasets is still needed to further reduce the risk of overfitting. End-to-end latency has not been measured on a real edge device. Finally, extreme adverse conditions were not systematically covered, and image degradation may affect detection reliability in practice.
4.3. Failure Case Analysis
Although RT-DETR-Watermelon delivers consistent improvements on the overall test set, as well as on the challenging small object, occlusion, and low-light subsets reported, it still fails in several complex field conditions. The most common issues include missed detections or inaccurate boxes under extreme occlusion, where more than half of a fruit is covered or the fruit lies outside the image boundary, false positives caused by background objects with a round or fruit-like appearance, and missed instances that become extremely small after resizing the input to 640 × 640, where texture and shape cues are heavily degraded. Dense and overlapping fruits may also trigger duplicate boxes or merged instances, which can affect counting-oriented applications. To mitigate these issues, we will enrich the dataset with more hard samples, including heavy occlusion, backlight and very small fruits, employing stronger augmentations that simulate these degradations to reduce ambiguity in crowded and occluded scenes. We will also construct a more fine-grained hard-case evaluation by further stratifying the current occlusion subset into 30–50% and more than 50% occlusion, and stratifying the small-object subset into targets smaller than 20 pixels and targets between 20 and 30 pixels, then report the corresponding metrics for more detailed analysis.
4.4. Deployment Feasibility on Embedded Platforms
Although we reduce the computational cost to 43.5 GFLOPs per inference, this value can still be demanding for typical low-power embedded devices used in agriculture. Practical efficiency depends on the available accelerator, numerical precision, memory bandwidth, and software optimization. On entry-level GPU platforms like the Jetson Nano, the model may be feasible at low-to-moderate frame rates when using optimized inference and reduced precision, but the margin can be limited if higher resolution, higher FPS, or multiple perception tasks must run simultaneously. In contrast, dedicated edge AI hardware such as the Jetson Orin provides much larger computation headroom and is more suitable when strict real-time performance and long battery operation are required. We also note that agricultural mobile robots often move slowly, so perception does not always require high FPS. In practice, energy use can be further reduced by lowering input resolution, using frame skipping or event-driven inference, and applying FP16 quantization with an optimized runtime, which improves the viability of deploying a 43.5 GFLOPs model on battery-powered robots.
4.5. Future Work
Future work will be carried out in four directions to improve both the scientific rigor and practical usability of our work. We will expand our dataset to boost the model’s generalization ability by collecting data from multiple farms and regions across different seasons and lighting conditions, while including more crop cultivars, and optimize the model’s deployment efficiency for real-world use—beyond adopting lighter backbone networks, we will explore model compression and acceleration methods including structured pruning and knowledge distillation, test the model’s end-to-end latency and memory usage on common edge devices, analyze the trade-off between accuracy and efficiency under different computing budgets, and evaluate the model’s adaptability to changes in input resolution and real-time requirements to guide real-world deployment. We will also upgrade our evaluation from model-level to real-world system evaluation by integrating the detector into a full end-to-end harvesting pipeline, combining 2D detection with depth sensing for 3D fruit localization; validating grasp planning and execution via grasp success rate, fruit damage rate, and single-fruit harvesting time; and extending the model’s perception ability to fruit segmentation and ripeness recognition to enable closed-loop decision-making for the harvesting process, while conducting multi-scenario field trials to verify the model’s real-world performance through cross-farm and cross-system field tests under different trellis structures, crop cultivars, and environmental conditions to quantitatively evaluate the model’s robustness and practical performance in real planting scenarios.
5. Conclusions
This study developed RT-DETR-Watermelon, a lightweight end-to-end detector for trellised watermelon images. The model targets three common field challenges: partial occlusion by leaves and vines, large changes in fruit scale, and a high proportion of small fruits. To address these issues while keeping the network compact, we introduced a context-guided module into the backbone, added a high-resolution P2 detection layer, employed SSFF and TFE for multi-scale feature fusion, adopted MPDIoU loss for more stable box regression, and reduced channel width in the neck and head.
On the proposed dataset, RT-DETR-Watermelon achieves 93.2% precision, 88.2% recall, and 93.9% mAP@0.5 with 43.5 GFLOPs, 9.2 M parameters, and a model size of 18.9 MB. Relative to the RT-DETR baseline, it improves recall and mAP@0.5 while reducing parameters and model size by more than half, indicating a better accuracy–efficiency balance for deployment. The improvements are also observed on small-object, occlusion, and low-light subsets, suggesting increased robustness in practical field conditions.
This study is limited by the dataset scope and by the lack of latency tests on embedded devices. Future work will expand data collection across farms, cultivars, seasons, and imaging devices, and will further optimize inference on edge hardware through practical compression and acceleration. We also plan to integrate the detector into a complete harvesting pipeline to support 3D localization and downstream operations.