Next Article in Journal
Comparative Transcriptome Analysis Reveals That the AGO4-RdDM Pathway in Solanum tuberosum Is Potentially Induced by Short-Term Heat Shock Stress and Positively Regulates Thermotolerance
Previous Article in Journal
Optimizing 3D LiDAR Installation Height for High-Fidelity Canopy Phenotyping in Spindle-Shaped Orchards
Previous Article in Special Issue
Advances in Berry Harvesting Robots
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Target Detection of Trellised Watermelons in Complex Agricultural Scenes Based on Improved RT-DETR

1
College of Agricultural Engineering and Food Science, Shandong University of Technology, Zibo 255000, China
2
Shandong Academy of Agricultural Machinery Science, Jinan 250100, China
3
Shandong Key Laboratory of Intelligent Agricultural Equipment in Hilly and Mountainous Areas, Jinan 250100, China
4
Zibo Institute for Quality Inspection of Electromechanical and Pump Products, Zibo 255200, China
*
Author to whom correspondence should be addressed.
Horticulturae 2026, 12(3), 333; https://doi.org/10.3390/horticulturae12030333
Submission received: 26 January 2026 / Revised: 7 March 2026 / Accepted: 9 March 2026 / Published: 10 March 2026
(This article belongs to the Special Issue A New Wave of Smart and Mechanized Techniques in Horticulture)

Abstract

To address the problems of severe fruit occlusion, large variations in target scale, and many small-scale goals being overlooked in the recognition of trellised watermelons under complex agricultural scenarios, this study proposes an improved RT-DETR-based detection model, termed RT-DETR-Watermelon. A context-guided (CG) module is embedded into the backbone network. A dedicated P2 detection layer is added to enhance the model’s sensitivity to small objects. A scale sequence feature fusion (SSFF) module and a triple feature encoder (TFE) module are introduced into the model to improve the model’s capability to detect targets at multiple scales. The original bounding box regression loss is replaced with MPDIoU (Multiple Path Distance Intersection over Union) loss, which accelerates model convergence and improves localization precision. Finally, the number of channels is adjusted to reduce parameter count, computational complexity, and storage size. The experimental results show that, compared with the original RT-DETR model, the proposed RT-DETR-Watermelon model increases precision, recall, and mean Average Precision (mAP@0.5) by 0.4, 1.8, and 1.0 percentage points, while reducing the number of parameters, computational cost, and model size by 53.5%, 23.5%, and 53.2%, respectively.

1. Introduction

Trellised watermelon cultivation is a production method that aims to achieve high yields and good fruit quality. It uses vertical planting, where watermelons are supported on trellis structures during growth. This arrangement can improve fruit quality and yield and can also increase land use efficiency [1]. As this cultivation method expands in China, it is contributing to agricultural development and creating a stronger demand for efficient harvesting.
At the same time, harvesting robotics is advancing, and replacing manual picking with harvesting robots is becoming an expected trend [2]. Research on intelligent harvesting equipment has also progressed [3]. For a harvesting robot to work reliably, it must first detect the fruit accurately before performing localization and picking. In trellised cultivation, detection is challenging because fruits may appear small, be partially hidden by leaves, and be affected by changing light conditions. Therefore, accurate identification of trellised watermelons is a key factor that directly influences harvesting efficiency.
Object detection driven by deep learning has become a widely used solution for visual tasks in agriculture. Current object detectors are commonly grouped into two-stage and single-stage methods. Two-stage methods, such as the R-CNN (Region-based Convolutional Neural Network) series [4,5,6], typically generate candidate regions and then refine classification and localization. Single-stage methods, such as the YOLO (You Only Look Once) series [7,8,9], predict object classes and bounding boxes in one step and are often faster. Many studies have improved these frameworks for fruit detection in complex scenes. Jianwei Yan et al. [10] improved Faster R-CNN by enhancing feature sampling and region processing, thereby increasing bounding box localization accuracy and achieving a detection accuracy of 95.53% on Rosa roxburghii fruits. Weihui Wang et al. [11] proposed an enhanced Faster R-CNN model for winter jujube defect recognition by using ResNet50 (Residual Network50 layer) [12] instead of VGG16 (Visual Geometry Group 16) [13], adding an SE (Squeeze and Excitation) attention module, introducing an FPN (Feature Pyramid Network) [14] for multi-scale fusion, and applying Soft NMS (Soft Non-Maximum Suppression) [15] to better handle overlapping objects; the model achieved an mAP@0.5 (mean Average Precision at Intersection over Union threshold of 0.5) of 91.60%. Huijun Yin et al. [16] combined an improved YOLOv7-based detector with DeepSORT for automatic watermelon counting in drone videos. They introduced the GhostConv [17] and C2f (Context Fusion) modules to reduce computation, added SimAM (Simple Attention Module) attention [18] to strengthen feature extraction, replaced CIoU with Focal EIoU (Focal Efficient Intersection over Union) to speed up convergence, and used a mask collision mechanism in DeepSORT to improve counting accuracy; the method improved precision by 2.3 percentage points and mAP by 0.3 percentage points over the baseline. Gang Ge et al. [19] proposed TOMO YOLO for tomato detection by improving feature fusion and adding an AWD detection head, achieving an mAP@0.5 of 90.6%. Xu Li et al. [20] proposed YOLO Pepper, which adds CA attention and uses DCNv2 deformable convolution to improve detection under occlusion, achieving an average detection accuracy of 93.3%, 2.8 percentage points higher than the baseline.
Despite these advances, convolutional neural networks (CNNs) mainly learn local features and may not capture enough global context. Here, global context means the overall scene information that helps relate an object to its surroundings; for example, how a watermelon relates to nearby leaves and the trellis, as well as cues from other parts in the image. When fruits are small or partly blocked, lacking such whole-image information can reduce detection accuracy. To address this problem, Transformer-based methods have been introduced into object detection [21]. Transformers use self-attention to capture global context, and they can reduce reliance on hand-crafted anchors. DETR (DEtection TRansformer), originally proposed by Carion et al. [22], is a representative Transformer-based detector that enables end-to-end prediction and reduces the need for post-processing such as non-maximum suppression (NMS). However, DETR often has high computational demands and has slow training convergence, which limits its use in complex agricultural applications. RT-DETR [23], proposed by Baidu, keeps the end-to-end design while improving efficiency and accuracy, and it has shown strong performance on multiple datasets.
Based on RT-DETR, this study focuses on trellised watermelon detection in complex field conditions, where many targets are small and occlusion is frequent, leading to reduced detection accuracy. To improve performance while controlling model size and computation, we propose an improved model, RT-DETR-Watermelon, designed specifically for trellised watermelon detection. The proposed method primarily addresses the visual localization of watermelons, providing reliable detection capabilities for trellised watermelon harvesting robots.

2. Materials and Methods

2.1. Data Acquisition and Data Processing

The trellised watermelon image dataset used in this study was collected at an ecological farm in Zibo City, Shandong Province, China. The dataset includes two cultivars, Red Honey watermelon and Qi lin watermelon. Images were captured using a REDMI K80 Android smartphone (Xiaomi Corporation, Beijing, China) at a resolution of 3072 × 3072 pixels.
Because trellised watermelons are photographed in visually complex field environments, the dataset was designed to cover a wide range of conditions to improve model robustness and generalization. It contains 1368 images with variations in illumination, occlusion, and number of fruits. All watermelons in each image were manually annotated with bounding boxes using LabelImg v1.8.6. Representative examples are shown in Figure 1.
The dataset was split into training, testing, and validation sets in a 7:2:1 ratio. We used random contrast, noise, horizontal flipping, random brightness, and vertical flipping techniques to perform data augmentation [12]. In the end, we obtained a total of 8208 images. The training set includes 5745 photos, the test set includes 1642 photos, and the validation set includes 821 photos. The results of the data augmentation are illustrated in Figure 2.

2.2. Model and Training

2.2.1. Model Improvement Based on RT-DETR

To address the challenges of trellised watermelon detection, such as severe occlusion, large variations in target scale, and a high proportion of small targets, while keeping the model lightweight and computationally efficient, we took RT-DETR [23] as the baseline and proposed an improved model termed RT-DETR-Watermelon (Figure 3). The proposed model preserves the overall RT-DETR detection framework, and the RT-DETR-Watermelon variant is obtained through the following modifications: we adopted ResNet-18 as the backbone and inserted a context-guided (CG) module [24] to strengthen feature representation under occlusion; we added a P2 small-object layer [12] to enhance feature extraction for small targets; we introduced a TFE [25] and SSFF [26] into the neck to improve multi-scale feature fusion and detection under substantial size variations; we replaced the original regression loss with MPDIoU loss [27] to provide stronger localization guidance and accelerate convergence; finally, the number of model channels was adjusted to reduce parameter count and computational complexity.

2.2.2. Context-Guided Module

In trellised watermelon detection, leaves and vines often partially cover the fruit, reducing feature visibility and detection accuracy. To improve robustness to occlusion, we introduce a context-guided module, as shown in Figure 4. The context-guided module is a lightweight feature extraction module that combines local details with wider contextual information, helping the network capture fruit cues even when the target is partially hidden. As illustrated in Figure 4, the context-guided module performs feature extraction by integrating multi-scale information. First, the local branch f l o c uses a 3 × 3 standard convolution to extract basic texture and shape features. Next, the surrounding context branch f s u r applies dilated convolution to enlarge the receptive field with limited additional cost, which helps model the relationship between the fruit and its surrounding environment and improves feature extraction for occluded targets. The joint feature integration unit f j o i then merges local and contextual features, followed by batch normalization and a PReLU activation to stabilize training and improve convergence. Finally, the global branch f g l o uses global average pooling to aggregate spatial information, and two fully connected layers generate channel-wise weights to select and strengthen important features. By combining local, contextual, and global cues, the context-guided module improves the model’s robustness to occlusion and background interference.

2.2.3. TFE Module

To improve feature extraction for objects of different sizes, we introduced a TFE module, as shown in Figure 5. The main idea of a TFE is to fuse multi-scale features by first converting feature maps from different resolutions (large, medium, and small) into a unified medium-scale representation, and then integrating them.
Specifically, large-scale feature maps are downsampled to the medium scale using average pooling and max pooling. The combination of these two pooling operations helps retain both salient local responses and fine-grained details. Small-scale feature maps are upsampled to the medium scale using nearest neighbor interpolation, which aligns spatial sizes with low computational cost. The resized feature maps are then normalized and concatenated along the channel dimension to produce the fused output.
A TFE aligns feature maps from different resolutions to a unified scale and then fuses them, enabling the network to exploit complementary multi-scale cues and improving the detection of small objects. High-resolution features are downsampled using average pooling and max pooling, which helps retain both global context and salient local responses. Low-resolution features are upsampled using nearest neighbor interpolation, providing efficient scale alignment with low computational cost. After normalization and channel-wise concatenation, subsequent layers can learn an effective combination of information from different scales. Unlike attention that mainly reweighted features within a single scale, a TFE explicitly performs cross-scale alignment and integration. Compared with global self-attention or non-local modules that require expensive global interactions, a TFE achieves cross-scale integration with much lower overhead, making it suitable for real-time detection.

2.2.4. SSFF Module

In field conditions, trellised watermelons often appear as numerous small targets with large variations in size and frequent occlusion. These characteristics make accurate classification and localization difficult for detectors. To improve multi-scale representation in such scenes, we introduced the Scale Sequence Feature Fusion (SSFF) module, as shown in Figure 6.
SSFF fuses feature maps from multiple backbone stages. The highest-resolution feature map is used as the reference. First, 1 × 1 convolutions are applied to the lower-resolution feature maps to align their channel numbers with the reference feature map. Next, nearest neighbor interpolation upsamples these feature maps to the same spatial size. An unsqueeze operation then adds a new scale dimension, and feature maps from different stages are stacked along this dimension to form a scale sequence. Finally, the stacked features are processed by 3D convolution, followed by batch normalization and LeakyReLU activation to complete the fusion.
The SSFF module enhances detection by fusing multi-stage features, combining fine spatial details from high-resolution feature maps with semantic information from low-resolution maps. This cross-scale fusion improves the representation of small and partially occluded fruits by preserving fine boundary details while providing stronger category-level context. Unlike traditional methods, which use fixed fusion paths and may introduce feature conflicts across scales, SSFF adaptively combines features, ensuring that high-resolution details are emphasized for small targets while low-resolution features contribute reliable semantic information. As a result, SSFF improves detection accuracy and robustness, particularly in complex agricultural environments where occlusion and background clutter are common.

2.2.5. P2 Detection Layer

In trellised watermelon cultivation, trellis height increases the imaging distance, causing fruits to appear small in the captured images. The original RT-DETR model uses three detection layers (P3, P4, and P5), which create feature representations of 80 × 80, 40 × 40, and 20 × 20 pixels; the capability of this configuration in terms of detecting small objects is limited. To address this limitation, we add a P2 detection layer that outputs a 160 × 160 feature map. The higher spatial resolution of P2 preserves finer details and prevents models from missing small targets.

2.2.6. Channel Count Adjustment

The improved architecture strengthens feature extraction for small, multi-scale, and partially occluded targets, thereby enhancing recognition of trellised watermelons in complex environments. However, these changes also increase the number of parameters and computational complexity. To achieve a lightweight design, we reduced the channel width of the neck and head in the original RT-DETR from 256 to 128, while keeping the backbone channels unchanged.

2.2.7. Loss Function Optimization

Trellised watermelons grow irregularly and vary greatly in size, which often leads to containment relationships between the predicted boxes and the actual boxes. In these cases, the GIoU loss [28] in RT-DETR can degenerate to standard IoU, weakening bounding box regression and reducing localization accuracy. To mitigate this problem, we adopted the MPDIoU loss function (Figure 7) to improve localization precision for trellised watermelon targets and to accelerate convergence during training.
MPDIoU consists of two parts: the overlap term and a normalized shift penalty term based on corner distances. The calculation process for MPDIoU is as follows:
MPDIoU = A B A B d 1 2 + d 2 2 W 2 + H 2
d 1 2 + d 2 2 = [ ( x 1 g t x 1 p r d ) 2 + ( y 1 g t y 1 p r d ) 2 ] + [ ( x 2 g t x 2 p r d ) 2 + ( y 2 g t y 2 p r d ) 2 ]
L MPDIoU = 1 MPDIoU
In Formula (1), A is the predicted box, and B is the ground truth box. A B represents the intersection area of the two boxes. A B represents the union area of the two boxes. A B A B is the standard IoU term that measures the overlap between the predicted box A and the ground truth box B. The second term d 1 2 + d 2 2 W 2 + H 2 measures the geometric shift between the two boxes using the squared distances of the top-left and bottom-right corners, and it is normalized by the squared image diagonal. Therefore, when the predicted box is slightly translated, the corner distance term changes smoothly (proportional to the squared shift), making the metric less sensitive to small shifts than IoU alone and providing more stable gradients for box regression. This is particularly helpful in containment cases, where GIoU may provide weak localization guidance. Finally, L MPDIoU represents the MPDIoU loss function.

2.2.8. Experimental Environment

All experiments were conducted on a system equipped with an Intel Core i5 10200H processor and an NVIDIA GeForce RTX 3050 laptop graphics card. The software environment included Windows 11, CUDA 12.4, PyTorch 2.3.0, and Python 3.10.14.
During training, image dimensions were adjusted to 640 × 640 pixels. The batch size was set to 16, the model was trained for 200 epochs, and the initial learning rate was set to 0.001. Weight decay and momentum were set to 0.0005 and 0.937, respectively, utilizing the AdamW optimizer.

2.3. Evaluation Protocol

2.3.1. Evaluation Indicators

To evaluate the performance of the detection model, we use precision (P), recall (R), mean Average Precision at an IoU threshold of 0.5 (mAP@0.5), and mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 (mAP@[0.5:0.95]). In addition, model size is reported to indicate deployment feasibility. Computational complexity is quantified using the number of parameters and floating point operations (FLOPs).

2.3.2. Small Object Subset Construction

In this study, all images are resized to 640 × 640 for training and evaluation. We define small targets as watermelon instances whose ground truth bounding box width and height are both less than 30 pixels [29] in the 640 × 640 images (less than 30 × 30 px). Based on this criterion, we screened the validation set and formed a dedicated small-object subset. Specifically, we selected 108 validation images that contain small targets. For evaluation, we report precision (P), recall (R), mAP@0.5, and mAP@[0.5:0.95] on this subset. The same evaluation settings are applied to both the baseline RT-DETR and the proposed RT-DETR-Watermelon to ensure a fair comparison.

2.3.3. Occluded Object Subset Construction

To address occlusion in a reproducible manner, we constructed an occluded subset from the test set to evaluate the model’s robustness under occlusion. A target instance was considered “occluded” when the proportion of the target area that was not visible due to being covered by other objects was at least 30% [30]. If an image contained at least one instance meeting this criterion, the image was included in the occluded subset. The occluded images were manually selected according to this quantitative rule; this subset contains 158 images, and all other experimental settings, including the data split and evaluation protocol, were kept identical to the main experiment to ensure fair and consistent comparison.

2.3.4. Low Light Object Subset Construction

To evaluate the model’s robustness under low light conditions, we constructed a dedicated low light subset. This subset curates samples from the original dataset captured under a range of challenging illumination settings, including varying degrees of underexposure, dim environments, and weak lighting, thereby reflecting the lighting constraints commonly encountered in real world applications. By evaluating the model on this subset separately, we can better assess its stability when visual information is limited.

2.4. Visualization Method

Heatmap Visualization Method

A heatmap highlights the key image regions that contribute most to the model’s object detection decision-making. In this study, Gradient-weighted Class Activation Mapping (Grad CAM) [31] was adopted to generate visualization heatmaps using PyTorch (version 2.3.0). Channel-wise weights are calculated from the gradients of the final convolutional layer, then used to generate a weighted feature map, which is ultimately upsampled and projected onto the original input image.

3. Results

3.1. Ablation Experiment Results and Analysis

To validate the effectiveness of the improvements made to RT-DETR, this paper designed and conducted ablation tests to evaluate the model’s performance. The outcomes are presented in Table 1.
The experimental results show that the proposed model provides clear improvements across multiple evaluation metrics.
After adding the SSFF and TFE modules, precision, mAP@0.5, and mAP@[0.5:0.95] increased by 1.6, 0.1, and 4.4 percentage points, while recall decreased by only 0.7 percentage points.
After introducing the P2 detection layer, precision increased by 1.0 percentage points, recall by 0.1 percentage points, mAP@0.5 by 0.1 percentage points, and mAP@[0.5:0.95] by 3.8 percentage points.
Reducing the channel number slightly decreased precision and mAP@0.5 by 0.5 and 0.1 percentage points, respectively. However, FLOPs, parameter count, and model size were reduced by 28.8%, 26.2%, and 26.7%, achieving a more lightweight model.
When the context-guided module was added independently, precision, recall, mAP@0.5, and mAP@[0.5:0.95] improved by 2.1, 0.5, 0.9, and 3.7 percentage points, respectively, compared with the baseline. At the same time, FLOPs, parameters, and model size decreased by 16.3%, 16.6%, and 16.3%.
Replacing the original loss with MPDIoU increased precision, recall, mAP@0.5, and mAP@[0.5:0.95] by 0.9, 0.1, 0.2, and 3.4 percentage points.
After combining all improvements, the final model achieved gains of 0.4, 1.8, 1.0, and 3.5 percentage points in precision, recall, mAP@0.5, and mAP@[0.5:0.95]. In addition, parameters, FLOPs, and model size decreased by 53.5%, 23.5%, and 53.2%. Overall, the proposed method improves detection performance while substantially reducing model complexity, confirming the effectiveness of the proposed modifications.

3.2. Stability Evaluation

To assess the stability of the model’s performance, we conducted five independent training runs on the dataset, with the results presented in Figure 8. The mean, standard deviation, and 95% confidence interval of the key metrics were computed, as reported in Table 2.
The model achieved a precision of 92.84% ± 0.51 (95% CI: [92.20%, 93.48%]) and a recall of 88.14% ± 0.43 (95% CI: [87.60%, 88.68%]). The mAP@0.5 reached 93.88% ± 0.15 (95% CI: [93.70%, 94.06%]), while the mAP@[0.5:0.95] was 73.76% ± 0.52 (95% CI: [73.12%, 74.40%]). These results indicate that the proposed method exhibits good consistency and robustness across repeated experiments.

3.3. Comparative Trial

To evaluate the improved RT-DETR model, we compared it with five widely used detectors YOLOv8s [32], YOLOv8n [33], SSD [34], Faster R-CNN, and the original RT-DETR under identical experimental settings. The comparison results are summarized in Table 3.
In terms of detection accuracy, the improved model achieved a mAP@0.5 value of 93.9%, exceeding the YOLOv8s, YOLOv8n, SSD, Faster R-CNN, and the original RT-DETR scores by 0.6, 0.35, 4.9, 5.7, and 1.0 percentage points.
In terms of recall, the improved model outperformed the same five models by 2.7, 4.4, 9.1, 11.2, and 1.8 percentage points.
In terms of model lightweighting, RT-DETR-Watermelon has 9.2 M parameters and a model size of 18.9 MB, which is smaller than YOLOv8s with 11.1 M parameters and 21.4 MB, SSD with 23.7 M parameters and 94.9 MB, Faster R-CNN with 28.3 M parameters and 107.8 MB, and the original RT-DETR with 19.8 M parameters and 40.4 MB. However, YOLOv8n remains lighter, with 3.0 M parameters and a 6.0 MB model size. In terms of speed, RT-DETR-Watermelon achieves 21.2 FPS, which is faster than Faster R-CNN at 16.66 FPS, but slower than the YOLOv8 variants and RT-DETR under the same test setting.
Overall, RT-DETR-Watermelon provides a competitive balance between accuracy and compactness, achieving good detection performance while reducing the parameter count and model size compared with several baselines.

3.4. Effect of Scaling Factors

To provide a comparison of RT-DETR-Watermelon models with different scaling factors, we evaluated variants with different depth and width settings, namely S with a depth of 0.67 and a width of 0.75, ours with a depth of 1.0 and a width of 1.0, and L with a depth of 1.33 and a width of 1.25. The results are summarized in Table 4.
As shown in Table 4, the S variant achieves the fastest inference speed of 24.6 FPS with the lowest computational cost of 32.4 GFLOPs and the smallest model size of 15.7 MB. However, this efficiency is obtained at the expense of detection performance, especially recall, which drops to 84.8%, while precision and mAP@0.5 are 92.4% and 92.3%, respectively. When scaling the model up to L with a depth of 1.33 and a width of 1.25, the computation increases to 61.0 GFLOPs and the model size grows to 21.5 MB; meanwhile, the accuracy gain is limited, reaching 93.4% precision, 86.2% recall, and 93.3% mAP@0.5, and the speed decreases to 17.4 FPS. In contrast, our configuration with a depth of 1.0 and a width of 1.0 provides the best overall trade-off, delivering 93.2% precision, 88.2% recall, and the highest 93.9% mAP@0.5 with moderate computation of 43.5 GFLOPs and real-time performance of 21.2 FPS. Overall, these results indicate that the selected scaling factors offer a favorable balance between detection accuracy and computational efficiency for trellised watermelon detection.

3.5. Small-Object Subset Evaluation

As described in Section 2.3.2, we further evaluate both models on a dedicated small-object subset. As shown in Table 5.
RT-DETR-Watermelon improves recall from 83.0% to 84.5%, mAP@0.5 from 90.1% to 90.8%, and yields a larger gain on the mAP@[0.5:0.95], from 60.9% to 63.9%. Meanwhile, precision slightly decreases from 88.7% to 87.8%. The above results indicate that although the improved model shows a slight reduction in detection precision for small objects, it reduces missed detections and retrieves more small objects. Overall, the proposed model shows improved performance on the small-object subset.

3.6. Occlusion Subset Evaluation

As described in Section 2.3.3, we further evaluated the robustness of the proposed method under occlusion by testing both models on the occlusion subset. The quantitative results are reported in Table 6.
Compared with the baseline RT-DETR, RT-DETR-Watermelon achieves the same precision (93.6%) while improving recall from 88.2% to 90.4%. Meanwhile, the mAP@0.5 increases from 94.2% to 94.7%. In addition, the mAP@[0.5:0.95] improves from 69.4% to 74.0%. These results demonstrate that the proposed improvements enhance detection robustness in occluded scenarios, particularly in terms of recall and localization accuracy.

3.7. Low-Light Object Evaluation

As described in Section 2.3.4, we further evaluated both models on the low-light subset to assess robustness under insufficient illumination. The quantitative results are reported in Table 7.
Compared with the baseline RT-DETR, RT-DETR-Watermelon improves precision from 92.7% to 93.6% and recall from 86.9% to 88.2%. Meanwhile, mAP@0.5 slightly increases from 93.0% to 93.1%, while mAP@[0.5:0.95] shows a substantial improvement from 67.9% to 74.4%. These results indicate that the proposed method achieves more reliable detection in low-light conditions, particularly in terms of overall localization quality under challenging illumination.

3.8. Loss Function Comparison Test

To validate the effectiveness of the proposed MPDIoU bounding box regression loss, we compared it with GIoU, Inner IoU [35], and CIoU [36]. Their performance on the validation set is reported in Table 8, and the corresponding convergence curves are displayed in Figure 9, while the mAP@0.5 curves are shown in Figure 10 and the PR curve comparisons are presented in Figure 11. MPDIoU achieves the best precision, recall, and mAP@0.5 among the compared losses. In addition, the model trained with MPDIoU converges faster and reaches a lower final loss than the other loss functions.

3.9. Heatmap Effect Comparison

We use Grad CAM to generate heatmaps for the baseline and improved models for comparison. The resulting heatmaps are shown in Figure 12, where darker regions indicate stronger model attention.
As shown in Figure 12, the improved model produces darker and more concentrated responses on trellised watermelon targets. In contrast, the baseline model shows localization errors or missed detections for some fruits, accompanied by weaker and more scattered attention over the target regions. Background and noise regions exhibit lower intensity responses. In addition, YOLOv8s and YOLOv8n show limited sensitivity to small fruits, with many small targets receiving weak activation and being missed.

3.10. Image Detection Results for Trellised Watermelon

To demonstrate the effectiveness of the improvements in complex agricultural environments, we chose some representative images from the dataset that include different numbers of targets, varying degrees of occlusion, and diverse lighting conditions for comparative evaluation. The comparative results are presented in Figure 13.
In single-target scenes, both the baseline and improved models detect trellised watermelons reliably. However, in multi-target scenes, the baseline model produces false positives by confusing background structures with watermelons. In occluded scenes, it also generates duplicate boxes, assigning multiple detections to the same watermelon. In unobstructed scenes, the baseline model sometimes misses small trellised fruits due to limited small-object feature extraction, whereas the improved model detects these targets consistently.
Lighting variations further highlight the differences between the two methods. Under normal illumination, the baseline model still produces false negatives, especially for occluded fruits. In low-light scenes, it again generates duplicate boxes, while the improved model avoids this issue and remains stable.

4. Discussion

This study focuses on trellised watermelon detection in complex agricultural scenes. In such environments, heavy occlusion and large-scale variation often cause missed detections, especially for small fruits. To address these challenges, we propose RT-DETR-Watermelon by enhancing multiscale feature representation and introducing lightweight context modelling.
Compared with the baseline RT-DETR, RT-DETR-Watermelon improves precision by 0.4 percentage points, recall by 1.8 percentage points, and mAP@0.5 by 1.0 percentage point. At the same time, it reduces parameters by 53.5%, FLOPs by 23.5%, and model size by 53.2%. These results indicate that the proposed design improves detection accuracy while lowering model complexity, which is beneficial for deployment under limited computational resources.
To ensure a fair evaluation, all experiments in this study included the proposed model and baseline detectors, training settings, and the software environment. Beyond overall metrics, we report stratified results on challenging subsets to match the motivation of this work. The small-object subset and the occlusion subset show larger gains than the overall set, suggesting that the improvements are related to better handling of small targets and occlusion rather than minor overall fluctuations. We also include a low-light subset evaluation as additional evidence for challenging conditions.
We further evaluated training stability. We repeated experiments with five random seeds and report mean ± standard deviation and confidence intervals. The standard deviation of mAP@0.5 is within 0.15%, and an independent-samples test shows a significant difference from the baseline (p < 0.05). These results suggest that the reported improvements are stable under our experimental settings.
Ablation results show that each component contributes to performance, while different modules may involve trade-offs across metrics. Therefore, the final model configuration is selected to achieve a balanced improvement across overall accuracy, subset performance, and model complexity. Finally, we note that this work focuses on 2D detection and localization and is intended to serve as an upstream perception module for subsequent robotic tasks.

4.1. Practical Significance of the Improved Model

The improvement in mAP@0.5 is only 1.0 percentage point, but its significance should be understood in the context of recent agricultural object detection research. Recent studies, such as those on ELD-YOLO [37], YOLOv5-ACS [38], AAB-YOLO [39], DS-YOLO [40], and YOLOv8MSP-PD [41], were proposed to address common challenges in agricultural scenes, including fruit occlusion, overlap, complex backgrounds, illumination variation, and natural field conditions. These studies indicate that improvements in agricultural detection are often gradual rather than substantial because the task is challenging and baseline detectors achieve relatively high performance. Therefore, the 1.0 percentage point gain achieved in this study is consistent with the level of improvement commonly reported in the recent literature and can still be considered meaningful in practice, especially for challenging agricultural detection tasks.
The practical significance of this improvement should not be evaluated only by the increase in mAP@0.5. In real agricultural applications, even small detection errors may accumulate during large-scale field operations and reduce the reliability of downstream tasks. In this study, the proposed method not only improves mAP@0.5, but also reduces model size and parameter count, making it more suitable for deployment on resource-limited agricultural devices. And the 1.8 percentage point improvement in recall indicates that fewer fruits are missed in challenging scenes, such as those with occlusion, overlap, or background interference. This is important for practical applications such as fruit counting, yield estimation, and robotic harvesting, where missed detections may directly affect economic returns. And the multi-seed evaluation with confidence intervals shows that the observed improvement remains consistent across repeated runs, suggesting that the gain is stable and reliable in practice rather than caused by random variation.

4.2. Study Limitations

Several limitations should be acknowledged. The dataset was collected from a single farm in Zibo, China, and includes two cultivars, one camera device, and one trellised cultivation system. This limited diversity may introduce dataset bias and reduce generalization to other regions, varieties, devices, and cultivation practices. Although we report multi-seed stability, broader validation on additional datasets is still needed to further reduce the risk of overfitting. End-to-end latency has not been measured on a real edge device. Finally, extreme adverse conditions were not systematically covered, and image degradation may affect detection reliability in practice.

4.3. Failure Case Analysis

Although RT-DETR-Watermelon delivers consistent improvements on the overall test set, as well as on the challenging small object, occlusion, and low-light subsets reported, it still fails in several complex field conditions. The most common issues include missed detections or inaccurate boxes under extreme occlusion, where more than half of a fruit is covered or the fruit lies outside the image boundary, false positives caused by background objects with a round or fruit-like appearance, and missed instances that become extremely small after resizing the input to 640 × 640, where texture and shape cues are heavily degraded. Dense and overlapping fruits may also trigger duplicate boxes or merged instances, which can affect counting-oriented applications. To mitigate these issues, we will enrich the dataset with more hard samples, including heavy occlusion, backlight and very small fruits, employing stronger augmentations that simulate these degradations to reduce ambiguity in crowded and occluded scenes. We will also construct a more fine-grained hard-case evaluation by further stratifying the current occlusion subset into 30–50% and more than 50% occlusion, and stratifying the small-object subset into targets smaller than 20 pixels and targets between 20 and 30 pixels, then report the corresponding metrics for more detailed analysis.

4.4. Deployment Feasibility on Embedded Platforms

Although we reduce the computational cost to 43.5 GFLOPs per inference, this value can still be demanding for typical low-power embedded devices used in agriculture. Practical efficiency depends on the available accelerator, numerical precision, memory bandwidth, and software optimization. On entry-level GPU platforms like the Jetson Nano, the model may be feasible at low-to-moderate frame rates when using optimized inference and reduced precision, but the margin can be limited if higher resolution, higher FPS, or multiple perception tasks must run simultaneously. In contrast, dedicated edge AI hardware such as the Jetson Orin provides much larger computation headroom and is more suitable when strict real-time performance and long battery operation are required. We also note that agricultural mobile robots often move slowly, so perception does not always require high FPS. In practice, energy use can be further reduced by lowering input resolution, using frame skipping or event-driven inference, and applying FP16 quantization with an optimized runtime, which improves the viability of deploying a 43.5 GFLOPs model on battery-powered robots.

4.5. Future Work

Future work will be carried out in four directions to improve both the scientific rigor and practical usability of our work. We will expand our dataset to boost the model’s generalization ability by collecting data from multiple farms and regions across different seasons and lighting conditions, while including more crop cultivars, and optimize the model’s deployment efficiency for real-world use—beyond adopting lighter backbone networks, we will explore model compression and acceleration methods including structured pruning and knowledge distillation, test the model’s end-to-end latency and memory usage on common edge devices, analyze the trade-off between accuracy and efficiency under different computing budgets, and evaluate the model’s adaptability to changes in input resolution and real-time requirements to guide real-world deployment. We will also upgrade our evaluation from model-level to real-world system evaluation by integrating the detector into a full end-to-end harvesting pipeline, combining 2D detection with depth sensing for 3D fruit localization; validating grasp planning and execution via grasp success rate, fruit damage rate, and single-fruit harvesting time; and extending the model’s perception ability to fruit segmentation and ripeness recognition to enable closed-loop decision-making for the harvesting process, while conducting multi-scenario field trials to verify the model’s real-world performance through cross-farm and cross-system field tests under different trellis structures, crop cultivars, and environmental conditions to quantitatively evaluate the model’s robustness and practical performance in real planting scenarios.

5. Conclusions

This study developed RT-DETR-Watermelon, a lightweight end-to-end detector for trellised watermelon images. The model targets three common field challenges: partial occlusion by leaves and vines, large changes in fruit scale, and a high proportion of small fruits. To address these issues while keeping the network compact, we introduced a context-guided module into the backbone, added a high-resolution P2 detection layer, employed SSFF and TFE for multi-scale feature fusion, adopted MPDIoU loss for more stable box regression, and reduced channel width in the neck and head.
On the proposed dataset, RT-DETR-Watermelon achieves 93.2% precision, 88.2% recall, and 93.9% mAP@0.5 with 43.5 GFLOPs, 9.2 M parameters, and a model size of 18.9 MB. Relative to the RT-DETR baseline, it improves recall and mAP@0.5 while reducing parameters and model size by more than half, indicating a better accuracy–efficiency balance for deployment. The improvements are also observed on small-object, occlusion, and low-light subsets, suggesting increased robustness in practical field conditions.
This study is limited by the dataset scope and by the lack of latency tests on embedded devices. Future work will expand data collection across farms, cultivars, seasons, and imaging devices, and will further optimize inference on edge hardware through practical compression and acceleration. We also plan to integrate the detector into a complete harvesting pipeline to support 3D localization and downstream operations.

Author Contributions

Methodology, W.Y., H.Q., H.Y. and G.Z.; software, W.Y., H.Q., S.W., H.Y. and Y.H.; investigation, W.Y., H.Q., S.W., Y.H. and G.Z.; data curation, W.Y., S.W., H.Y., Y.H. and G.Z.; writing—original draft preparation, W.Y. and G.Z.; writing—review and editing, W.Y. and G.Z.; funding acquisition, H.Q., S.W., H.Y. and G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the Shandong Provincial Technology Innovation Guidance Program (YDZX2024020), Shandong Provincial Major Science and Technology Innovation Project (2022CXGC020701).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

References

  1. Zhao, X.; Zheng, L.; Liu, D.; Song, K.; Lu, P.; Yang, Y.; Yang, L.; Li, X.; Li, Y.; Zhang, Y.; et al. Addition of Earthworms to Continuous Cropping Soil Inhibits the Fusarium Wilt in Watermelon: Evidence Under Both Field and Pot Conditions. Horticulturae 2025, 11, 1088. [Google Scholar] [CrossRef]
  2. Liu, C.; Gong, L.; Yuan, J.; Li, Y. Current research status and development trends of key technologies for agricultural robots. Trans. Chin. Soc. Agric. Mach. 2022, 53, 1–22+55. [Google Scholar] [CrossRef]
  3. Lü, Z.; Zhang, X.; Zhang, L. Design of an intelligent harvesting device for watermelon. Mech. Eng. Autom. 2024, 2, 72–73. [Google Scholar]
  4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
  5. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar] [CrossRef]
  6. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar] [CrossRef]
  7. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
  8. Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar] [CrossRef]
  9. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  10. Yan, J.; Zhao, Y.; Zhang, L.; Su, X.; Liu, H.; Zhang, F.; Fan, W.; He, L. Improved Faster-RCNN for recognizing Rosa roxburghii fruits in natural environment. Trans. Chin. Soc. Agric. Eng. 2019, 35, 143–150. [Google Scholar] [CrossRef]
  11. Wang, W.; Xin, Z.; Che, Q.; Zhang, J. Defect detection method for winter jujubes based on improved Faster RCNN model. Trans. Chin. Soc. Agric. Eng. 2024, 40, 283–289. [Google Scholar] [CrossRef]
  12. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
  13. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  14. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. arXiv 2016, arXiv:1612.03144. [Google Scholar] [CrossRef]
  15. Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving object detection with one line of code. arXiv 2017, arXiv:1704.04503. [Google Scholar] [CrossRef]
  16. Yin, H.; Wang, B.; Jing, Y.; Li, J.; Wang, P.; Quan, G.; Sun, T. Watermelon counting method in UAV aerial videos based on improved YOLOv7. Trans. Chin. Soc. Agric. Eng. 2024, 40, 124–134. [Google Scholar] [CrossRef]
  17. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. arXiv 2019, arXiv:1911.11907. [Google Scholar] [CrossRef]
  18. Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR); MLR Press: New York, NY, USA, 2021; pp. 11863–11874. [Google Scholar]
  19. Ge, G.; Yang, J.; Liu, Y.; Hu, Y.; Liu, H. Detection of tomato targets in complex agricultural scenes based on an improved YOLOv8n model. Trans. Chin. Soc. Agric. Eng. 2025, 41, 143–153. [Google Scholar] [CrossRef]
  20. Li, X.; Liu, Q.; Kuang, M.; Pan, J.; Liu, D.; Xiang, Y.; Wu, Y.; Xie, F. Pepper fruit detection method in a natural environment based on improved YOLOX. Trans. Chin. Soc. Agric. Eng. 2024, 40, 119–126. [Google Scholar] [CrossRef]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  22. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar] [CrossRef]
  23. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q. DETRs beat YOLOs on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
  24. Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. arXiv 2018, arXiv:1811.08201. [Google Scholar] [CrossRef] [PubMed]
  25. Liu, J.; Li, S. Detection method for densely distributed kiwifruit flowers based on improved YOLOv8n. Trans. Chin. Soc. Agric. Eng. 2025, 41, 172–181. [Google Scholar] [CrossRef]
  26. Li, J.; Yang, Z.; Zheng, Q.; Qiao, J.; Tu, J. Wheat ear detection and counting method based on RT-WEDT. Trans. Chin. Soc. Agric. Eng. 2024, 40, 146–156. [Google Scholar] [CrossRef]
  27. Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
  28. Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. arXiv 2019, arXiv:1902.09630. [Google Scholar] [CrossRef]
  29. Madan, M.; Reich, C. Strengthening Small Object Detection in Adapted RT-DETR Through Robust Enhancements. Electronics 2025, 14, 3830. [Google Scholar] [CrossRef]
  30. Li, X.; Shi, J.; Li, Y.; Wang, C.; Sun, W.; Zhuo, Z.; Yue, X.; Ni, J.; Tan, K. Blueberry Maturity Detection in Natural Orchard Environments Using an Improved YOLOv11n Network. Agriculture 2026, 16, 60. [Google Scholar] [CrossRef]
  31. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. arXiv 2016, arXiv:1610.02391. [Google Scholar] [CrossRef]
  32. Zhao, S.; Fang, C.; Hua, T.; Jiang, Y. Detecting the Maturity of Red Strawberries Using Improved YOLOv8s Model. Agriculture 2025, 15, 2263. [Google Scholar] [CrossRef]
  33. Wang, M.; Li, F. Real-Time Accurate Apple Detection Based on Improved YOLOv8n in Complex Natural Environments. Plants 2025, 14, 365. [Google Scholar] [CrossRef] [PubMed]
  34. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. arXiv 2015, arXiv:1512.02325. [Google Scholar] [CrossRef]
  35. Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar] [CrossRef]
  36. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. arXiv 2020, arXiv:2005.03572. [Google Scholar] [CrossRef]
  37. Wang, X.; Huang, Y.; Wei, S.; Xu, W.; Zhu, X.; Mu, J.; Chen, X. ELD-YOLO: A Lightweight Framework for Detecting Occluded Mandarin Fruits in Plant Research. Plants 2025, 14, 1729. [Google Scholar] [CrossRef]
  38. Liu, J.; Wang, C.; Xing, J. YOLOv5-ACS: Improved Model for Apple Detection and Positioning in Apple Forests in Complex Scenes. Forests 2023, 14, 2304. [Google Scholar] [CrossRef]
  39. Yang, L.; Zhang, T.; Zhou, S.; Guo, J. AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments. Agriculture 2025, 15, 836. [Google Scholar] [CrossRef]
  40. Teng, H.; Sun, F.; Wu, H.; Lv, D.; Lv, Q.; Feng, F.; Yang, S.; Li, X. DS-YOLO: A Lightweight Strawberry Fruit Detection Algorithm. Agronomy 2025, 15, 2226. [Google Scholar] [CrossRef]
  41. Liu, Y.; Han, X.; Zhang, H.; Liu, S.; Ma, W.; Yan, Y.; Sun, L.; Jing, L.; Wang, Y.; Wang, J. YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions. Agronomy 2025, 15, 1581. [Google Scholar] [CrossRef]
Figure 1. Example images from trellised Watermelons dataset: (a) normal light; (b) occlusion; (c) single target; (d) low light; (e) no occlusion; and (f) multiple targets.
Figure 1. Example images from trellised Watermelons dataset: (a) normal light; (b) occlusion; (c) single target; (d) low light; (e) no occlusion; and (f) multiple targets.
Horticulturae 12 00333 g001
Figure 2. Example of Data Augmentation Effects for trellised Watermelon Dataset: (a) original image; (b) random contrast; (c) noise; (d) horizontal flip; (e) random brightness; and (f) vertical flip.
Figure 2. Example of Data Augmentation Effects for trellised Watermelon Dataset: (a) original image; (b) random contrast; (c) noise; (d) horizontal flip; (e) random brightness; and (f) vertical flip.
Horticulturae 12 00333 g002
Figure 3. Structure diagram of the improved RT-DETR model: ConvNormLayer is Convolution + Batch Normalization + ReLU activation function; MaxPool2d is the 2D max pooling operation; CG is the context-guided module; Add is element-wise addition; RepC3 is a reparameterized convolution module; SSFF is the scale sequence feature fusion module; TFE is the triple feature encoder module; Upsample is an upsampling module; AIFI is a feature interaction module; Concat represents the concatenation operation; the number in the upper-right corner of a module indicates the network layer the module belongs to, while the number in the lower-right corner shows which layers the input feature maps received by the module come from.
Figure 3. Structure diagram of the improved RT-DETR model: ConvNormLayer is Convolution + Batch Normalization + ReLU activation function; MaxPool2d is the 2D max pooling operation; CG is the context-guided module; Add is element-wise addition; RepC3 is a reparameterized convolution module; SSFF is the scale sequence feature fusion module; TFE is the triple feature encoder module; Upsample is an upsampling module; AIFI is a feature interaction module; Concat represents the concatenation operation; the number in the upper-right corner of a module indicates the network layer the module belongs to, while the number in the lower-right corner shows which layers the input feature maps received by the module come from.
Horticulturae 12 00333 g003
Figure 4. Structural diagram of the context-guided module: GAP represents global average pooling; FC represents the fully connected layer; BN represents batch normalization; PReLU represents the PReLU activation function.
Figure 4. Structural diagram of the context-guided module: GAP represents global average pooling; FC represents the fully connected layer; BN represents batch normalization; PReLU represents the PReLU activation function.
Horticulturae 12 00333 g004
Figure 5. Structural diagram of the triple feature encoder: MaxPool refers to max pooling; AvgPool represents average pooling; and Nearest represents nearest neighbor interpolation.
Figure 5. Structural diagram of the triple feature encoder: MaxPool refers to max pooling; AvgPool represents average pooling; and Nearest represents nearest neighbor interpolation.
Horticulturae 12 00333 g005
Figure 6. Structural diagram of the scale sequence feature fusion module: Nearest refers to the nearest neighbor interpolation method; Unsqueeze means expanding dimensions; Stack refers to the stacking operation; 3D Conv refers to 3D convolution; 3D BN represents 3D batch normalization; LeakyReLU refers to the LeakyReLU activation function; MaxPool3d refers to 3D max pooling; and Squeeze means reducing dimensions.
Figure 6. Structural diagram of the scale sequence feature fusion module: Nearest refers to the nearest neighbor interpolation method; Unsqueeze means expanding dimensions; Stack refers to the stacking operation; 3D Conv refers to 3D convolution; 3D BN represents 3D batch normalization; LeakyReLU refers to the LeakyReLU activation function; MaxPool3d refers to 3D max pooling; and Squeeze means reducing dimensions.
Horticulturae 12 00333 g006
Figure 7. Schematic diagram of MPDIoU loss function: W represents the image width; H represents the image height. The red box indicates the predicted box, and the yellow box indicates the ground truth box. d 1 represents the distance between the top-left corners of the ground truth box and the predicted box, and d 2 represents the distance between the bottom-right corners of the ground truth box and the predicted box. ( x 1 g t , y 1 g t ) represents the coordinates of the top-left corner of the ground truth box, ( x 2 g t , y 2 g t ) represents the coordinates of the bottom-right corner of the ground truth box, ( x 1 p r d , y 1 p r d ) represents the coordinates of the top-left corner of the predicted box, and ( x 2 p r d , y 2 p r d ) represents the coordinates of the bottom-right corner of the predicted box.
Figure 7. Schematic diagram of MPDIoU loss function: W represents the image width; H represents the image height. The red box indicates the predicted box, and the yellow box indicates the ground truth box. d 1 represents the distance between the top-left corners of the ground truth box and the predicted box, and d 2 represents the distance between the bottom-right corners of the ground truth box and the predicted box. ( x 1 g t , y 1 g t ) represents the coordinates of the top-left corner of the ground truth box, ( x 2 g t , y 2 g t ) represents the coordinates of the bottom-right corner of the ground truth box, ( x 1 p r d , y 1 p r d ) represents the coordinates of the top-left corner of the predicted box, and ( x 2 p r d , y 2 p r d ) represents the coordinates of the bottom-right corner of the predicted box.
Horticulturae 12 00333 g007
Figure 8. Performance stability across five independent training runs.
Figure 8. Performance stability across five independent training runs.
Horticulturae 12 00333 g008
Figure 9. Comparison of loss functions.
Figure 9. Comparison of loss functions.
Horticulturae 12 00333 g009
Figure 10. mAP@0.5 curves.
Figure 10. mAP@0.5 curves.
Horticulturae 12 00333 g010
Figure 11. PR curve comparisons.
Figure 11. PR curve comparisons.
Horticulturae 12 00333 g011
Figure 12. Heatmap of partial trellised watermelons image detection: (a) normal light; (b) low light; (c) no occlusion; (d) occlusion; (e) single target; and (f) multiple targets.
Figure 12. Heatmap of partial trellised watermelons image detection: (a) normal light; (b) low light; (c) no occlusion; (d) occlusion; (e) single target; and (f) multiple targets.
Horticulturae 12 00333 g012
Figure 13. Comparison of detection results before and after model improvement: (a) single target; (b) multiple targets; (c) occlusion; (d) no occlusion; (e) normal light; and (f) low light. The blue circles in the image indicate misdetected objects, while the yellow circles indicate missed objects.
Figure 13. Comparison of detection results before and after model improvement: (a) single target; (b) multiple targets; (c) occlusion; (d) no occlusion; (e) normal light; and (f) low light. The blue circles in the image indicate misdetected objects, while the yellow circles indicate missed objects.
Horticulturae 12 00333 g013
Table 1. Results of ablation experiment.
Table 1. Results of ablation experiment.
SSFF + TFEp2Channel Number AdjustmentCG ModuleMPDIoUPrecisionRecallmAP@0.5mAP@
[0.5:0.95]
GFLOPsParamsModel Size
×××××92.886.492.970.456.919.840.4
××××94.485.79374.861.420.141
××××93.886.59374.278.118.538.1
××××92.38892.873.540.514.629.6
××××94.986.993.874.147.616.533.8
××××93.786.593.173.856.919.840.4
×××94.285.692.773.6108.315.331.6
××93.286.892.973.357.614.930.4
×93.487.293.573.143.59.218.9
93.288.293.973.843.59.218.9
× indicates absence of the corresponding improvement, while √ indicates its use.
Table 2. Summary statistics of key metrics.
Table 2. Summary statistics of key metrics.
CriteriaMean ± SD (%)95% CI
Precision92.84 ± 0.51[92.20, 93.48]
Recall88.14 ± 0.43[87.60, 88.68]
mAP@0.593.88 ± 0.15[93.70, 94.06]
mAP@[0.5:0.95]73.76 ± 0.52[73.12, 74.40]
Table 3. Comparative experiments.
Table 3. Comparative experiments.
ModelsPrecisionRecallmAP@0.5mAP@[0.5:0.95]GFLOPsParamsModel SizeFPS
(%)(%)(%)(%)(M)(MB)
YOLOv8s91.685.593.373.928.411.121.466.0
YOLOv8n91.883.893.5574.038.13.06.063.0
SSD9179.18967.830.423.794.945.4
Faster-RCNN93.47788.26737.528.3107.816.66
RT-DETR92.886.492.970.456.919.840.424.8
RT-DETR-
Watermelon
93.288.293.973.843.59.218.921.2
The RT-DETR and RT-DETR-Watermelon rows are identical to the baseline and final configurations in Table 1.
Table 4. Comparison of RT-DETR-Watermelon with different scaling factors.
Table 4. Comparison of RT-DETR-Watermelon with different scaling factors.
ModelsDepthWidthPrecisionRecallmAP@0.5GFLOPsParamsModel SizeFPS
(%)(%)(%)(M)(MB)
S0.670.7592.484.892.332.48.215.724.6
ours1.01.093.288.293.943.59.218.921.2
L1.331.2593.486.293.361.010.621.517.4
Table 5. Small-object comparative experiments.
Table 5. Small-object comparative experiments.
ModelPrecision (%)Recall (%)mAP@0.5 (%)mAP@[0.5:0.95] (%)
RT-DETR88.78390.160.9
RT-DETR-
Watermelon
87.884.590.863.9
Table 6. Occlusion subset comparative experiments.
Table 6. Occlusion subset comparative experiments.
ModelPrecision (%)Recall (%)mAP@0.5 (%)mAP@[0.5:0.95] (%)
RT-DETR93.688.294.269.4
RT-DETR-
Watermelon
93.690.494.774
Table 7. Low-light subset comparative experiments.
Table 7. Low-light subset comparative experiments.
ModelPrecision (%)Recall (%)mAP@0.5 (%)mAP@[0.5:0.95] (%)
RT-DETR92.786.99367.9
RT-DETR-
Watermelon
93.688.293.174.4
Table 8. Comparative analysis of loss functions.
Table 8. Comparative analysis of loss functions.
Loss FunctionPrecision (%)Recall (%)mAP@0.5 (%)mAP@[0.5:0.95] (%)
GIoU92.886.492.970.4
Inner IoU91.885.99374.1
CIoU92.984.792.673.8
MPDIoU93.786.593.173.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, W.; Qu, H.; Wang, S.; Yang, H.; Hao, Y.; Zhang, G. Target Detection of Trellised Watermelons in Complex Agricultural Scenes Based on Improved RT-DETR. Horticulturae 2026, 12, 333. https://doi.org/10.3390/horticulturae12030333

AMA Style

Yan W, Qu H, Wang S, Yang H, Hao Y, Zhang G. Target Detection of Trellised Watermelons in Complex Agricultural Scenes Based on Improved RT-DETR. Horticulturae. 2026; 12(3):333. https://doi.org/10.3390/horticulturae12030333

Chicago/Turabian Style

Yan, Weichen, Huixing Qu, Shaowei Wang, Huawei Yang, Yongbing Hao, and Guohai Zhang. 2026. "Target Detection of Trellised Watermelons in Complex Agricultural Scenes Based on Improved RT-DETR" Horticulturae 12, no. 3: 333. https://doi.org/10.3390/horticulturae12030333

APA Style

Yan, W., Qu, H., Wang, S., Yang, H., Hao, Y., & Zhang, G. (2026). Target Detection of Trellised Watermelons in Complex Agricultural Scenes Based on Improved RT-DETR. Horticulturae, 12(3), 333. https://doi.org/10.3390/horticulturae12030333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop