RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection

Yuan, Jun; Fan, Jing; Liu, Hui; Yan, Weilong; Li, Donghan; Sun, Zhenke; Liu, Hongtao; Huang, Dongyan

doi:10.3390/horticulturae12010042

Open AccessArticle

RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection

by

Jun Yuan

,

Jing Fan

,

Hui Liu

,

Weilong Yan

,

Donghan Li

,

Zhenke Sun

,

Hongtao Liu

and

Dongyan Huang

^*

College of Engineering and Technology, Jilin Agricultural University, Changchun 130118, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(1), 42; https://doi.org/10.3390/horticulturae12010042

Submission received: 19 November 2025 / Revised: 23 December 2025 / Accepted: 27 December 2025 / Published: 29 December 2025

(This article belongs to the Section Fruit Production Systems)

Download

Browse Figures

Versions Notes

Abstract

To develop a high-performance detection system for automated harvesting on resource-limited edge devices, we introduce FSA-DETR-P, a lightweight detection framework that addresses challenges such as illumination inconsistency, occlusion, and scale variation in complex orchard environments. Unlike traditional computationally intensive architectures, this model optimizes real-time detection transformers by integrating an efficient backbone for fast feature extraction, a simplified aggregation structure to minimize complexity, and an adaptive mechanism for multi-scale feature fusion. The optimized backbone improves early-stage texture extraction while reducing computational demands. The streamlined aggregation design enhances multi-level interactions without losing spatial detail, and the adaptive fusion module strengthens the detection of small, partially occluded, or ambiguous fruits. We created a domain-specific pomegranate dataset, expanded to 13,840 images with a rigorous 8:1:1 split for training, validation, and testing. The results show that the pruned and optimized model achieves a Mean Average Precision (mAP50) of 0.928 and mAP50–95 of 0.632 with reduced parameters (13.73 M) and lower computational costs (34.6 GFLOPs). It operates at 24.6 FPS on an NVIDIA Jetson Orin Nano, indicating a strong balance between accuracy and deployability, making it well-suited for orchard monitoring and robotic harvesting in real-world applications.

Keywords:

RT-DETR; FasterNet; pomegranates; object detection; lightweight

1. Introduction

Pomegranates are a nutrient-rich fruit, abundant in vitamins, minerals, and antioxidants, and are known for their health benefits, including anti-cancer, antioxidant, and immune-boosting effects [1,2]. They are widely cultivated in temperate and tropical regions around the world, particularly in Asia, Europe, and North Africa. With the increasing recognition of their health value, commercial cultivation of pomegranates has expanded annually [3,4]. Detecting pomegranate fruits presents multiple challenges, especially when using deep learning models [5]. First, complex orchard environments, such as direct sunlight and canopy shadows, lead to inconsistent image quality, affecting the diversity and accuracy of training datasets. Second, the color similarity between fruits and surrounding backgrounds, such as soil, leaves, and branches, complicates accurate fruit identification, particularly in densely planted orchards [6,7]. Third, pomegranates undergo significant changes in shape and size as they mature, requiring models to adapt to different growth stages to ensure accurate recognition. Finally, sensor data are often affected by environmental factors such as dust or weather changes, further impacting detection performance [8,9]. Therefore, optimizing data acquisition, image preprocessing, and model robustness is critical for improving the accuracy and efficiency of pomegranate detection.

With the continuous development of deep learning, object detection has achieved remarkable progress in agricultural applications [10]. Fruit detection has become a crucial component of precision agriculture, especially in automated and intelligent farming systems [11]. Single-stage object detection methods have gained popularity due to their simple structure and fast inference speed, making them suitable for real-time fruit monitoring systems [12]. These methods directly regress the bounding box and class label, offering clear advantages in scenarios with limited computational resources. Wang et al. [13] developed PG-YOLO, which integrates a ShuffleNetv2 backbone and a multi-head self-attention mechanism to enhance the detection of small pomegranate fruits while reducing model complexity. Both studies provide efficient solutions for real-time pomegranate detection. Deng et al. [14] designed SE-YOLO, a lightweight tomato detection framework optimized for agricultural environments, employing explicit edge awareness and innovative network architectures to improve the detection of occluded fruits. Song et al. [15] proposed LBSR-YOLO for blueberry monitoring, integrating BSRN with YOLOv10n and incorporating LKWSConv, PConv, FPConv, and ODConv to enhance efficiency and accuracy in low-resolution images. Li et al. [16] developed a lightweight GP-DETR-based green pepper detection model, optimizing feature extraction, multi-scale detection, and near-color discrimination in complex backgrounds. You et al. [17] presented VBP-YOLO-prune, an optimized YOLOv8n model integrating V7 downsampling, BiFPN feature fusion, and an improved PIOUv2 loss function, achieving high real-time detection accuracy and deployment efficiency in complex orchards.

Traditional two-stage object detection methods divide the detection process into two steps: first, generating candidate regions, and then performing classification and regression [18]. Tian et al. [19] proposed an improved Faster R-CNN model, integrating parallel convolutional neural networks, a Feature Pyramid Network (FPN), and Progressive Non-Maximum Suppression (Progressive-NMS) to optimize oat spike detection and counting, achieving a 13.01% increase in mean average precision (mAP) and providing a reference for oat yield prediction. Anthony [20] designed an efficient Mask R-CNN model, for instance, for the segmentation of strawberry fruits to enhance the efficiency of automated harvesting systems. By leveraging Detectron2 and the NVIDIA TAO Toolkit for training and optimizing the model with NVIDIA TensorRT, the optimized model achieved 83.17 mAP, 25.46 FPS, and a compact size of 48.2 MB, suitable for real-time applications. Li et al. [21] proposed an improved Strawberry R-CNN model using a multi-stage network, RoIAlign, and bilinear interpolation to enhance strawberry recognition accuracy. Experimental results showed counting accuracies of 99.1% and 73.7% for mature and immature strawberries, respectively, with an average precision of 0.8733, demonstrating its suitability for automated strawberry monitoring and harvesting. While these methods achieve high detection accuracy, their computational complexity and slower inference speed require substantial resources, limiting their application in real-time scenarios.

In fruit detection, many studies have proposed lightweight single-stage models to improve real-time performance and reduce computational demands. However, their effectiveness in real-world agricultural environments is often underexplored. Challenges such as varying illumination and dense fruit clusters can limit the performance of lightweight models, especially for pomegranate detection. In this study, we design a lightweight single-stage detection model and conduct thorough real-world validation. The model architecture is optimized to reduce computational load while maintaining high accuracy. Field tests confirm its efficiency and stability on edge devices. These experiments demonstrate that the model can perform real-time pomegranate detection effectively and operate on resource-constrained hardware, highlighting its practical applicability.

This validated lightweight design offers a practical solution for smart agriculture, especially in pomegranate detection and intelligent harvesting systems. Our findings demonstrate the real-world applicability of lightweight detection models, advancing intelligent agricultural technologies and their use in automated fruit detection and harvesting. The key contributions of this study include the successful deployment of the FSA-DETR-P model on the NVIDIA Jetson Orin Nano, enabling efficient real-time detection with limited computational resources and providing a viable solution for smart orchard automation.

2. Materials and Methods

2.1. Dataset Construction

The dataset for this study is derived from the publicly available Pomegranate Fruit Detection Dataset (PG-YOLO: An Efficient Detection Algorithm for Pomegranate Before Fruit Thinning), designed for pomegranate fruit detection and recognition [13]. Sample images from the dataset are presented in Figure 1.

The images used in this study were captured at a pomegranate harvesting base in Linlong District, Xi’an, Shaanxi Province, China, between 9 AM and 5 PM, under various lighting conditions and angles. The original dataset contains 4380 JPG images (1920 × 1080 pixels, 24-bit color depth), each paired with a label file detailing the class and location of the pomegranates. While the dataset covers a wide range of image features, factors like lighting changes, occlusions, and blurriness could impact model performance. To address these, the dataset was augmented to improve robustness. Lighting variations were simulated by adjusting brightness to reflect different times of day, helping the model adapt to fluctuating natural light [22]. Random occlusions (synthetic green patches, shown in Figure 2) were introduced to simulate fruit-leaf overlaps, enhancing performance in complex backgrounds. Additionally, mosaic data augmentation was applied by randomly stitching four images together, increasing the diversity of the training set and improving the model’s generalization across different backgrounds, as shown in Figure 2 [23].

To further improve the model’s robustness, we introduced a mixing coefficient following a Beta distribution (ranging from [0, 1]). Using this coefficient, two images were randomly selected and blended to create synthetic images with more complex features. This technique enhances the model’s ability to handle blurry targets and detect partially visible objects. To prevent data leakage and maintain the independence of the test set, dataset partitioning was performed before applying data augmentation. The 4380 original images were randomly divided into training, validation, and test sets in an 8:1:1 ratio. Data augmentation techniques were then applied separately to each subset. This resulted in a final dataset of 13,840 images (11,072 for training, 1384 for validation, and 1384 for testing). Importantly, this ensures that the test set and its original images were not used in training. Through these augmentation strategies and a careful dataset split, the diversity of the dataset was greatly increased, optimizing model performance and enhancing robustness and generalization [24]. Prior to feeding the data into the network, standard preprocessing steps were applied to stabilize the model. Images were resized to 640 × 640 pixels to match the RT-DETR input dimensions, and pixel values were normalized from the range [0, 255] to [0, 1], speeding up gradient convergence during training.

2.2. Model Selection and Enhancement

RT-DETR, introduced by Baidu Research in 2023, is an innovative real-time object detector that eliminates the need for post-processing techniques like Non-Maximum Suppression (NMS). It is the first end-to-end model specifically designed for real-time detection tasks. The model integrates the Transformer architecture with a ResNet-18 backbone, leveraging the Transformer’s ability to capture long-range dependencies while benefiting from ResNet-18’s lightweight design to minimize computational overhead. This fusion improves both detection accuracy and inference speed, making RT-DETR ideal for real-time applications [25,26].

RT-DETR stands out from traditional methods by adopting an end-to-end training approach, simplifying the detection pipeline. The model optimizes the process from input image to output result, eliminating the need for complex Region Proposal Networks (RPN) and separate post-processing steps, which enhances both efficiency and accuracy. It performs classification and bounding box regression simultaneously through multi-task learning, allowing for accurate localization of targets and improved recognition accuracy. Despite the efficiency of the ResNet-18 backbone in reducing parameters and computational resources, the model still faces challenges, especially when processing high-resolution images on resource-limited devices, which can impact real-time performance. To overcome this, the ResNet-18 backbone was replaced with FasterNet, which is more efficient in feature extraction, particularly in complex and densely packed backgrounds, greatly enhancing inference efficiency. Additionally, the Slimneck architecture (featuring GSConv and VoVGSCSP) was incorporated to reduce computation by using sparse convolutions and attention mechanisms, while improving the model’s ability to capture fine details and maintain stability in complex environments. A dedicated small object detection layer was also added to improve the detection of small or occluded targets. Finally, the Scale Sequence Feature Fusion (SSFF) module effectively integrates information from multiple feature layers, boosting the model’s ability to detect multi-scale targets, such as pomegranate fruits, with greater precision and efficiency. The network architecture is shown in Figure 3.

2.2.1. FasterNetBlock

FasterNet is an efficient feature extraction network designed specifically to accelerate real-time detection tasks, particularly for pomegranate fruit localization. It integrates joint detection and tracking paradigms, utilizing optimized deep convolution and attention mechanisms to deliver high performance with low computational cost. The core innovation of FasterNet lies in its use of the PConv (Partial Convolution) architecture [27]. PConv reduces redundant computations by applying convolutions only to a subset of the input channels, while retaining the ability to extract spatial features [28]. This not only accelerates inference but also reduces memory access requirements. Compared to traditional deep convolution, PConv optimizes both computational load and memory usage, resulting in faster inference speeds. FasterNet maximizes the extraction efficiency of spatial features through this architectural optimization, ensuring high accuracy, which makes it especially suitable for real-time detection tasks. In contrast to the GhostBlock module, which applies convolution across all input channels to generate feature maps, FasterNet only convolves a subset of channels, significantly reducing unnecessary computations and improving memory access efficiency. Additionally, FasterNet employs more efficient 1 × 1 convolutions to further boost speed. To enhance performance, FasterNet replaces the original RT-DETR-R18 backbone with the FasterNet module for feature extraction. The FasterNetBlock utilizes PConv convolutions to offer higher speeds and fewer parameters while minimizing accuracy loss. The integration of Batch Normalization (BN) and convolution modules further accelerates inference and reduces the computational burden. By replacing the RT-DETR-R18 backbone with FasterNet, the model achieves increased computational efficiency, reduced redundant calculations, and faster detection speeds, making it especially suited for real-time tasks in resource-constrained environments. The architecture of FasterNet is illustrated in Figure 4.

2.2.2. Slimneck (GSConv + VoVGSCSP)

GSConv (Group Shuffle Convolution) is a lightweight convolution operation that merges the benefits of standard convolution (SC) and depthwise separable convolution (DWConv). This approach enhances computational efficiency by reducing the number of parameters and operations, while still maintaining high accuracy in feature extraction, making it suitable for real-time detection tasks in resource-limited environments [29]. It reduces computational complexity while maintaining strong feature extraction capabilities, as shown in Figure 5. Initially, the input feature map is grouped through standard convolution, and then in-depthwise separable convolution is applied to each group [30]. This design reduces computational costs while avoiding excessive separation of channel information, ensuring the integrity of feature extraction. The output is generated by concatenating the results from both standard and depthwise separable convolutions, followed by a channel shuffle operation to optimize the information flow.

GSConv uses a dual-branch structure: one branch performs downsampling and captures coarse-grained semantic features, while the other extracts fine-grained spatial features. Finally, channel shuffling further integrates global semantic and local texture information, enhancing feature representation. GSConv reduces computational overhead while preserving high accuracy, making it an ideal choice for real-time detection tasks, especially in resource-constrained environments.

For pomegranate fruit detection, the VoVGSCSP module plays a vital role in enhancing both detection efficiency and accuracy, particularly in capturing fine-grained details and handling multi-scale features. This module combines group convolution with channel shuffling to reduce redundant computations, lowering computational overhead without compromising accuracy. Its streamlined design ensures robust adaptability for embedded systems, significantly reducing memory usage and power consumption. Additionally, the VoVGSCSP module excels in multi-scale feature extraction, providing strong support for tasks like instance segmentation in complex and unstructured environments. It is particularly effective at detecting small objects and fine details, making it a practical and efficient solution for deployment in resource-constrained settings.

The architecture of the VoVGSCSP backbone is illustrated, where it enables robust semantic feature transfer using a pair of GSConv operations during both upsampling and downsampling processes. In the neck portion of the proposed model, the VoVGSCSP module replaces the RepC3 module. This strategic change reduces the model’s computational complexity while still maintaining a sufficient level of accuracy. These modifications are crucial for achieving a balance between efficiency and performance in the pomegranate fruit detection system. The structure is shown in Figure 6.

2.2.3. ASF (SSFF + TEF)

Existing models typically tackle the multi-scale issue in images using the Feature Pyramid Network (FPN) structure, which integrates features from different scales via summation or concatenation. However, these methods often fail to fully exploit the correlations between feature maps. In contrast, the SSFF (Scale Sequence Feature Fusion) module is designed to better integrate high-dimensional information from deep feature maps with fine-grained details from shallow feature maps, as shown in Figure 7.

The SSFF module creates a multi-scale feature sequence by taking feature maps from the backbone network. Each feature map at a specific scale is processed through convolution to standardize the channel dimensions and extract local information. These feature maps are then modified and enriched with additional dimensions to align them. Once aligned, the feature maps are concatenated along the new dimensions to form a 3D volume. This 3D volume is then processed with 3D convolution and max pooling to capture spatial and semantic dependencies across multiple resolution levels. Finally, redundant dimensions are discarded, and the feature maps are restored to a 2D format, preserving the multi-scale representations [31].

This design significantly improves small object detection, especially in capturing fine-grained details and cross-scale features. The SSFF module combines deep global semantic information with shallow fine-grained features, effectively integrating the correlations between images at different scales. It addresses the issue in traditional feature pyramid structures where such correlations are not fully utilized, making multi-scale feature fusion more efficient.

The TFE (Target Feature Enhancement) module, located at the lower level of the SSFF, is designed to handle feature maps at multiple resolutions, especially large, medium, and small-scale feature maps. Its primary goal is to preserve fine-grained details typically lost in traditional feature pyramid structures due to upsampling and naive concatenation operations. This architecture enhances the model’s efficiency in detecting densely scattered small objects, as shown in Figure 8.

Specifically, large-scale feature maps are processed through a convolutional module, followed by a hybrid pooling strategy, which includes both max-pooling and average-pooling, for downsampling. This approach effectively reduces spatial resolution while improving the model’s ability to adapt to spatial variations in the input image. For small-scale feature maps, the channels are first adjusted using a convolutional module, then upsampled through nearest-neighbor interpolation to preserve local features and prevent the loss of small object information.

Subsequently, feature maps from different spatial scales are aligned along both the channel and resolution dimensions before being fused into a unified feature representation. This multi-scale feature fusion allows the TFE (Target Feature Enhancement) module to capture fine-grained details of small target objects, thereby improving small object detection accuracy. By combining both local and global feature information, the TFE module generates more precise segmentation maps, enhancing the model’s overall detection performance.

3. Results and Discussion

3.1. Experimental Environment and Configuration

To ensure fair comparison and expedite the model training, the enhanced experimental training process was conducted using an NVIDIA GeForce RTX 4090 GPU with 24 GB memory, paired with a 12th Gen Intel Core i5-12400F CPU running at 4.00 GHz. The software environment included CUDA 11.8, Python 3.8.18, and PyTorch 2.1.2, all integrated within the Ultralytics framework. This framework was used to implement the improved RT-DETR architecture and manage the overall training pipeline. To prevent overfitting, an early stopping mechanism was applied, with the patience parameter set to 50 epochs. The hardware configuration used for both training and testing is detailed in Table 1, while the hyperparameters employed during training are presented in Table 2.

In this study, we used several evaluation metrics to assess the performance of the proposed model, including mAP50, mAP95, Precision, Recall, the number of parameters, and GFLOPS. Specifically, mAP (Mean Average Precision) evaluates detection performance by calculating average precision at various Intersection over Union (IoU) thresholds, with mAP0.50 and mAP0.95 computed separately. Formula (1) calculates mAP, where N represents the number of categories, and “AP(i)” denotes the Average Precision for category i. Precision is the ratio of correctly predicted positive samples to all samples predicted as positive, calculated using Formula (2), where TP refers to True Positives and FP to False Positives. Recall is the ratio of correctly predicted positive samples to all actual positive samples, calculated using Formula (3), with FN denoting False Negatives. The number of parameters refers to the total trainable parameters in the model, typically measured in millions (M). GFLOPS (Giga Floating Point Operations Per Second) quantifies computational complexity during inference, calculated with Formula (4), where “Total FLOPs” represents the number of floating-point operations required during inference. These metrics provide a comprehensive assessment of the model’s accuracy and efficiency.

mAP = \frac{1}{N} \sum_{i = 1}^{N} AP (i)

(1)

Precision = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

GFLOPS = \frac{Total FLOPs}{10^{9} \times Inference Time (seconds)}

(4)

3.2. Pruning Experiment

To verify the performance of the proposed improved model under structural compression, comparative experiments with different pruning strategies were conducted on the improved model based on RT-DETR R18. Table 3 lists the detection performance, number of parameters, and computational complexity (GFLOPs) of four pruning methods (L1, Group Norm, Random, and Lamp) [32,33,34]. As can be seen from the experimental results, the Lamp pruning method achieves the optimal comprehensive performance: its Precision (P) and Recall (R) reach 0.851 and 0.878, respectively, while mAP50 and mAP50–95 are improved to 0.928 and 0.632, all of which are higher than those of other pruning strategies. Meanwhile, the Lamp method effectively reduces model complexity while maintaining high detection accuracy—its number of parameters is reduced to 13.73 M and computational complexity to 34.6 GFLOPs, demonstrating a better performance-efficiency balance compared with the unpruned model or other strategies. In contrast, the L1, Group Norm, and Random pruning methods all experience varying degrees of accuracy degradation after compression, especially a significant drop in the mAP50–95 metric, indicating certain deficiencies in these methods in retaining feature expression capabilities. Overall, the Lamp pruning strategy can significantly reduce the model’s parameter count and computational cost while ensuring detection accuracy, proving its superiority in lightweight deployment scenarios.

Figure 9 shows a comparison of the number of channels in each layer of the improved RT-DETR R18-based model before pruning (base) and after pruning (prune). It can be observed that the pruning operation has a significant non-uniform impact on different layers of the model: the number of channels in some convolutional layers is drastically reduced, while that in some key feature extraction layers changes slightly. This hierarchical channel retention reflects the adaptive selection mechanism of the pruning algorithm in structural optimization—prioritizing the preservation of channels that contribute significantly to feature expression while removing redundant or low-importance feature channels.

From the overall trend, the number of channels in the pruned model is significantly reduced in most layers, especially in the middle and high-level convolutional modules and the decoding head, where the channel reduction is relatively concentrated. This indicates that the pruning process effectively reduces redundant computations in the network and achieves a significant structural compression effect. Meanwhile, some feature fusion layers (such as cross-layer connections and decoding layers) still retain a relatively large number of channels, demonstrating that the model maintains the ability to transmit key semantic information while achieving structural compression, which is conducive to preserving detection accuracy.

Combined with the experimental results, the pruned model still maintains high detection accuracy (mAP50 = 0.928, mAP50–95 = 0.632) with the number of parameters and computational complexity reduced to 13.7 M and 34.6 GFLOPs, respectively. This verifies the effectiveness of the pruning strategy in balancing model lightweighting and performance. Overall, the pruning scheme can sufficiently reduce model complexity while maintaining stable detection performance, laying a solid foundation for the deployment of the model on edge devices.

3.3. Model Training and Comparison

Figure 10 presents the convergence process of the model. Both training and validation losses decrease consistently and stabilize after 60 epochs, reaching a state of steady-state convergence. The high degree of synchronicity between the training and validation curves indicates excellent model generalization and confirms that the hyperparameter settings are appropriate for the task. No gradient explosion or overfitting was observed during the 100-epoch training duration.

To validate the effectiveness of the proposed improvements, we compared the confusion matrices of the baseline model and the proposed method (Figure 11). The results demonstrate a significant enhancement in the model’s ability to capture targets in complex environments. Specifically, the number of True Positives (TP) increased from 4772 to 5439, indicating a stronger feature extraction capability. More importantly, the proposed method achieved a decisive reduction in False Negatives (FN), dropping from 775 in the baseline to only 108. This represents an 86.1% reduction in the miss rate, proving that the improved model effectively resolves the issue of missed detections caused by occlusion or small target size. It is noted that the number of False Positives (FP) increased from 1396 to 2071. This shift reflects a strategic trade-off: the model was designed with higher sensitivity to ensure that obscure or ambiguous targets are not overlooked. In the context of agricultural applications, prioritizing Recall (minimizing False Negatives) is critical to guarantee accurate yield estimation and harvesting, making the acceptance of slightly higher False Positives a necessary and robust strategy.

3.4. Ablation Experiment

To verify the impact of each improved module on pomegranate detection performance, this study takes RT-DETR R18 as the base model and gradually introduces the FasterNet feature enhancement module, ASF (Adaptive Spatial Fusion) module, and Slimneck lightweight neck structure. The final model is combined with a pruning strategy for performance evaluation, and the experimental results are shown in Table 4.

Starting with the base model RT-DETR R18, its mAP50 and mAP50–95 are 0.881 and 0.585, respectively. After introducing the FasterNet module, the detection accuracy is significantly improved (mAP50 increased to 0.905), mainly due to the multi-scale feature enhancement mechanism introduced by this module in the feature extraction stage, enabling the model to better capture key information of pomegranates under different sizes and occlusion conditions. On this basis, adding the ASF (Adaptive Spatial Fusion) module further improves the model performance (mAP50–95 increased from 0.589 to 0.620). The ASF module fuses multi-scale feature maps through adaptive weights, effectively enhancing the expression ability of spatial features and making the model more accurate in recognizing pomegranate contours and surface textures in complex backgrounds. However, the ASF module also leads to an increase in the number of parameters and computational complexity (to 20.15 M and 61.4 GFLOPs). In contrast, the Slimneck structure, through a lightweight feature aggregation and information compression mechanism, effectively reduces model complexity while maintaining detection performance (the number of parameters is reduced to 19.31 M, and GFLOPs to 53.6). Its ability to maintain a high mAP50 (0.904) while reducing computational overhead indicates that the Slimneck structure has excellent performance in terms of feature transmission efficiency. Furthermore, after combining Slimneck with the ASF module (Slimneck + ASF), the model’s mAP50–95 is increased to 0.615, demonstrating a complementary effect between the two in feature fusion and information compression: The ASF module strengthens spatial information, and the Slimneck optimizes information flow channels, working together to improve detection accuracy. After introducing the combination of FasterNet + Slimneck + ASF, the model maintains low complexity (50.1 GFLOPs) while mAP50 is increased to 0.921 and mAP50–95 reaches 0.625, indicating that this combination achieves a balance between performance and efficiency in feature extraction and fusion.

Finally, after applying the pruning strategy on this basis, the model’s number of parameters is significantly reduced to 13.73 M, and computational complexity to 34.6 GFLOPs, while mAP50 and mAP50–95 are further increased to 0.928 and 0.632, respectively. This shows that pruning removes redundant channels while retaining high-contribution features, effectively improving the model’s inference efficiency and generalization ability. In Section 3.2, we have conducted experimental analysis on different pruning methods and ultimately selected the Lamp method.

Figure 12 presents the mAP50 performance curve for the RT-DETR R18 model with various combinations of improved modules in pomegranate detection tasks. The curve shows a steady improvement in detection accuracy as additional modules are integrated. Notably, after adding the FasterNet module, the mAP50 increases significantly, indicating enhanced multi-scale feature extraction. With the introduction of the ASF module, mAP50–95 further improves, demonstrating that the adaptive spatial fusion mechanism helps better integrate spatial feature information across levels. The Slimneck structure maintains high detection performance while achieving a lightweight design, proving its efficiency in feature aggregation and information transmission. When all three modules are combined, the model reaches its highest performance. After applying pruning, both the number of parameters and computational complexity are significantly reduced, yet the accuracy improves, suggesting that pruning removes redundant channels and boosts the model’s feature representation. Overall, the curve demonstrates the synergistic benefits of the proposed strategies in improving detection performance while ensuring model lightweighting.

3.5. Model Comparison Experiment

To evaluate the overall performance of the proposed improved model in the pomegranate detection task, we conducted comparative experiments with several mainstream object detection models, including different versions of the YOLO series (YOLOv5m, YOLOv8m, YOLOv11m, YOLOv12m) and the RT-DETR series (RT-DETR-R18, RT-DETR-R34, RT-DETR-L). The experimental results, presented in Table 5, show that the proposed model (Ours) outperforms the others in both detection accuracy and model complexity. Specifically, our model achieves higher detection accuracy while maintaining lower computational complexity, making it more suitable for resource-constrained environments without sacrificing performance.

From the perspective of detection performance, the YOLO series models have certain advantages in terms of lightweight and real-time performance, but their detection accuracy in complex backgrounds is still insufficient. Among them, the mAP50–95 of YOLOv12m is 0.534, which is slightly higher than that of YOLOv8m, but the overall performance is still lower than that of the RT-DETR series models. In contrast, RT-DETR models, relying on their Transformer-based end-to-end detection structure, have more advantages in feature expression capability. The mAP50–95 of RT-DETR-R18 reaches 0.585, and the R34 and L versions are further improved to 0.563 and 0.577, indicating that they are more excellent in global modeling capability and spatial information fusion. However, although the RT-DETR-R18 is superior to the YOLO series in accuracy, its model parameter quantity and computational complexity are relatively high (up to 19.87 M parameters and 56.9 GFLOPs), which limits its application on resource-constrained devices. In contrast, after introducing the FasterNet feature enhancement module, ASF adaptive spatial fusion module, Slimneck lightweight structure and combining with pruning optimization, the improved model proposed in this paper (Ours) achieves the optimal detection performance: the precision (P) and recall (R) reach 0.851 and 0.878, respectively, and the mAP50 and mAP50–95 are improved to 0.928 and 0.632, respectively. At the same time, the parameter quantity and computational complexity are significantly reduced to 13.73 M and 34.6 GFLOPs, which are about 30.9% less in parameter quantity and 39.2% less in computational cost compared with RT-DETR-R18 (Figure 13).

In summary, the proposed model achieves significant lightweight effects while ensuring high-precision pomegranate detection, balancing detection performance and inference efficiency, and demonstrating excellent comprehensive performance and strong potential for practical application. Figure 13 is a bar chart comparison of the mAP index, which more intuitively reflects the gaps in the core index mAP among various models.

4. Discussion

This study focuses on pomegranate detection in complex orchard environments. Experimental results confirm that the improved model, integrated with FasterNet, ASF, Slimneck modules, and Lamp pruning, outperforms mainstream models (YOLOv5, YOLOv8, RT-DETR) in mAP50/mAP50–95 with fewer parameters and GFLOPs. It also maintains stable real-time performance (24.6 FPS) on the NVIDIA Jetson Orin Nano, addressing resource and latency constraints in agricultural automation. This section further discusses the model’s structural advantages, limitations, and future improvements.

In addition, the introduction of the pruning strategy further optimizes the network structure. Compared with traditional pruning methods such as L1, Group Norm, or random pruning, Lamp pruning is more targeted in channel selection. It can effectively remove redundant parts while retaining high-contribution feature channels, thereby significantly reducing the model’s parameter count and computational load with almost no loss in accuracy. This result indicates that after structural optimization, the model has stronger generalization ability and inference efficiency, making it particularly suitable for resource-constrained environments such as mobile devices and edge computing.

Based on the comprehensive comparison of experimental results, the proposed model in this paper outperforms the YOLOv5, YOLOv8, and RT-DETR series models in both mAP50 and mAP50–95 metrics, with a significant reduction in parameter count and GFLOPs. This demonstrates that this study has achieved efficient model compression and structural optimization under the premise of ensuring detection accuracy, verifying the feasibility and effectiveness of the proposed improvement strategies. Structurally, the proposed model occupies a strategic position between the ultra-lightweight YOLO-n/s variants (which prioritize speed but often compromise on accuracy) and the heavier YOLO-m variants. While the parameter counts of FSA-DETR-P (13.73 M) are significantly lower than those of the YOLO-m series (e.g., YOLOv8m at 25.84 M), it delivers detection performance that exceeds these heavier models. This indicates that our model effectively bridges the gap between lightness and precision, offering a superior trade-off that maintains high accuracy with reduced complexity compared to the ‘m’ variants. As shown in Figure 14, it presents the visualization results of pomegranate detection in complex natural environments for the YOLO series (YOLOv5m, YOLOv8m, YOLOv11m, YOLOv12m), RT-DETR series (RT-DETR-R18, RT-DETR-R34, RT-DETR-L), and the proposed model (Ours). Overall, there are significant differences in the detection performance of different models under conditions of occlusion, illumination changes, and background interference.

As shown in the figure, the YOLO series models often struggle with inaccurate detection box positioning or missed detections in certain scenarios. This is particularly evident in areas with leaf occlusions or similar background colors, where the models tend to blur target boundaries or miss smaller targets. On the other hand, the RT-DETR series models, which leverage the Transformer structure for global feature modeling, perform better in identifying partially occluded pomegranates. However, they still suffer from occasional false detections in complex backgrounds. In contrast, the improved model proposed in this study (Ours) demonstrates superior accuracy in detecting pomegranate targets. It achieves precise detection box placement, clear target boundaries, and almost no missed or false detections. This exceptional performance can be attributed to several enhancements: the FasterNet module’s ability to extract multi-scale features, the ASF module’s adaptive spatial fusion mechanism, and the Slimneck structure’s efficient feature aggregation. These improvements allow the model to distinguish pomegranates from complex backgrounds in vegetated environments more effectively. Additionally, the pruning strategy boosts the model’s inference efficiency, ensuring high-precision detection even under lightweight conditions.

In summary, Figure 14 intuitively verifies that the improved model proposed in this paper has stronger feature expression capability and robustness in pomegranate detection tasks and can achieve stable and accurate detection results under conditions of complex illumination, occlusion, and background interference.

To assess the robustness and generalization capabilities of the FSA-DETR-P model, images were randomly sampled from the test set and subjected to a range of data augmentation techniques. These included noise injection, desaturation, sharpening, and saturation enhancement. The detection results, shown in Figure 15, clearly demonstrate that the model maintains strong performance even under these altered conditions, effectively validating its robustness and generalization ability. The model consistently identifies pomegranate targets accurately, even when faced with various types of image distortions, indicating its adaptability to diverse real-world environments.

The improved RT-DETR-R18 model, deployed on the NVIDIA Jetson Orin Nano, demonstrates its strong adaptability and efficiency on edge devices. The NVIDIA Jetson Orin Nano was selected as the deployment platform because it offers an optimal balance between AI computing performance (up to 40 TOPS), power efficiency (7 W–15 W), and cost. This configuration represents a typical hardware setup for mobile agricultural robots, where maintaining high inference speed under limited battery and thermal constraints is critical. This model requires minimal computational resources while achieving high-precision detection of pomegranate fruits in complex orchard environments [35,36,37]. Despite challenges such as dense fruit distribution and varying lighting conditions, it maintains stable performance. With its lightweight design and optimized detection algorithm, the model processes large volumes of image data in real time, achieving a frame rate of 24.6 FPS. This makes it highly suitable for practical applications in agricultural automation, meeting the demands for both efficiency and accuracy. As the model is applied in pomegranate orchards, it is expected to not only improve harvesting efficiency but also provide technical support for other intelligent orchard tasks, contributing to the advancement of smart agriculture. See Figure 16 for details.

Although the proposed model achieves excellent performance in pomegranate detection and lightweight deployment, it has certain limitations. First, its performance degrades under extreme weather (heavy rain, fog), as the feature extraction modules are less robust to image contrast changes and noise. Second, it lacks adaptability to pomegranates in different growth stages (e.g., green young fruits), due to insufficient training samples covering diverse growth stages. Third, real-time performance drops slightly in dense target scenarios.

Future work will address these limitations: (1) Enhance robustness to extreme weather by introducing image preprocessing modules and expanding datasets with simulated extreme weather samples. (2) Collect multi-growth stage pomegranate data and optimize the network to improve cross-stage detection ability. (3) Optimize the ASF module to enhance real-time performance in dense target scenarios.

5. Conclusions

To tackle the challenges of insufficient detection accuracy and high model complexity in traditional object detection models for pomegranate detection, this paper presents an improved lightweight detection model based on RT-DETR R18. By incorporating the FasterNet feature enhancement module, the ASF adaptive spatial fusion module, the Slimneck lightweight structure, and the Lamp pruning strategy, the model achieves notable advancements in both detection performance and computational efficiency. Experimental results reveal that the proposed model achieves mAP50 and mAP50–95 scores of 0.928 and 0.632, respectively, marking a 4.7% improvement in both metrics compared to the original RT-DETR R18 model. Additionally, the parameter count and computational complexity are reduced by approximately 30.9% and 39.2%, respectively. These results demonstrate that the proposed enhancements successfully reduce the model’s size and complexity while maintaining high detection accuracy, underscoring its practical value and potential for broader deployment in real-world pomegranate detection tasks.

However, this study still has certain limitations. Firstly, the model is mainly trained and validated in pomegranate orchard scenes under natural lighting conditions, and its adaptability to extreme lighting, severe occlusion, or dense overlapping fruit scenarios needs further verification. Secondly, although the pruned model achieves a good balance between accuracy and efficiency, there may still be a risk of performance degradation when applied to larger-scale datasets or detection tasks of different fruit types. In addition, the model has not undergone actual deployment testing on mobile hardware or embedded devices, and its stability and latency performance in real-time detection scenarios require further evaluation.

Future research work will be carried out in the following directions: Firstly, further optimize the fusion mechanism between the Transformer structure and convolutional features, and introduce a dynamic feature selection module to enhance the model’s adaptive ability to complex scenes. Secondly, combine knowledge distillation with multi-modal information (such as RGB-Depth or hyperspectral data) to further strengthen the model’s feature expression capability. Thirdly, explore the embedded deployment of the model in actual orchard picking robots and intelligent grading systems, to realize a high-precision, low-latency, and implementable intelligent fruit and vegetable detection solution.

Author Contributions

Conceptualization, J.Y., J.F., H.L. (Hui Liu), W.Y., D.L., Z.S., H.L. (Hongtao Liu) and D.H.; methodology: J.Y., J.F., H.L. (Hongtao Liu) and W.Y.; investigation: J.Y., J.F., H.L. (Hongtao Liu), D.L. and Z.S.; visualization: J.Y., W.Y., D.L., H.L. (Hongtao Liu) and D.H.; formal analysis: J.Y., J.F., H.L. (Hongtao Liu), W.Y., D.L., Z.S. and H.L. (Hongtao Liu); resources: J.Y., W.Y., D.L., Z.S. and H.L. (Hongtao Liu); supervision: H.L. (Hongtao Liu), W.Y., D.L., Z.S. and H.L. (Hongtao Liu); software: J.Y., J.F., Z.S. and H.L. (Hongtao Liu); data curation: J.Y., J.F., H.L. (Hongtao Liu), D.L. and Z.S.; project administration: D.H., J.Y., Z.S. and H.L. (Hongtao Liu); validation: D.H., J.Y., Z.S. and H.L. (Hui Liu); writing—original draft: J.Y., J.F. and H.L. (Hui Liu); funding acquisition: D.H., J.Y. and H.L. (Hui Liu); writing—review and editing: J.Y., J.F. and H.L. (Hui Liu); All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Natural Science Foundation of Jilin Province” (20230101219JC) and “National Key Research and Development Program of China” (grant number 2023YFD1500404).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ASF	Adaptive Scale Fusion
CNN	Convolutional Neural Network
CPU	Central Processing Unit
CUDA	Computer Unified Device Architecture
DWConv	Depthwise Separable Convolution
FLOPs	Floating Point Operations
FN	False Negatives
FP	False Positives
FPN	Feature Pyramid Network
FPS	Frames Per Second
GFLOPs	Giga Floating Point Operations
GPU	Graphics Processing Unit
GSConv	Group Shuffle Convolution
IoU	Intersection over Union
mAP	Mean Average Precision
NMS	Non-Maximum Suppression
PConv	Partial Convolution
R-CNN	Region-based Convolutional Neural Network

References

Siddiqui, S.A.; Singh, S.; Nayik, G.A. Bioactive compounds from pomegranate peels-Biological properties, structure—function relationships, health benefits and food applications–A comprehensive review. J. Funct. Foods 2024, 116, 106132. [Google Scholar] [CrossRef]
Valero-Mendoza, A.; Meléndez-Rentería, N.; Chávez-González, M.; Flores-Gallegos, A.; Wong-Paz, J.; Govea-Salas, M.; Zugasti-Cruz, A.; Ascacio-Valdés, J. The whole pomegranate (Punica granatum L.), biological properties and important findings: A review. Food Chem. Adv. 2023, 2, 100153. [Google Scholar] [CrossRef]
Andishmand, H.; Azadmard-Damirchi, S.; Hamishekar, H.; Torbati, M.; Kharazmi, M.S.; Savage, G.P.; Tan, C.; Jafari, S.M. Nano-delivery systems for encapsulation of phenolic compounds from pomegranate peel. Adv. Colloid Interface Sci. 2023, 311, 102833. [Google Scholar] [CrossRef] [PubMed]
Singh, J.; Kaur, H.P.; Verma, A.; Chahal, A.S.; Jajoria, K.; Rasane, P.; Kaur, S.; Kaur, J.; Gunjal, M.; Ercisli, S. Pomegranate peel phytochemistry, pharmacological properties, methods of extraction, and its application: A comprehensive review. ACS Omega 2023, 8, 35452–35469. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Saleem, M.H.; Potgieter, J.; Arif, K.M. Automation in agriculture by machine and deep learning techniques: A review of recent developments. Precis. Agric. 2021, 22, 2053–2091. [Google Scholar] [CrossRef]
Zhu, N.; Liu, X.; Liu, Z.; Hu, K.; Wang, Y.; Tan, J.; Huang, M.; Zhu, Q.; Ji, X.; Jiang, Y. Deep learning for smart agriculture: Concepts, tools, applications, and opportunities. Int. J. Agric. Biol. Eng. 2018, 11, 32–44. [Google Scholar] [CrossRef]
Attri, I.; Awasthi, L.K.; Sharma, T.P.; Rathee, P. A review of deep learning techniques used in agriculture. Ecol. Inform. 2023, 77, 102217. [Google Scholar] [CrossRef]
Zheng, Y.-Y.; Kong, J.-L.; Jin, X.-B.; Wang, X.-Y.; Su, T.-L.; Zuo, M. CropDeep: The crop vision dataset for deep-learning-based classification and detection in precision agriculture. Sensors 2019, 19, 1058. [Google Scholar] [CrossRef]
Guerra Ibarra, J.P.; Cuevas de la Rosa, F.J.; Hernandez Vidales, J.R. Evaluation of the Effectiveness of the UNet Model with Different Backbones in the Semantic Segmentation of Tomato Leaves and Fruits. Horticulturae 2025, 11, 514. [Google Scholar] [CrossRef]
Wu, M.; Lin, H.; Shi, X.; Zhu, S.; Zheng, B. MTS-YOLO: A Multi-Task Lightweight and Efficient Model for Tomato Fruit Bunch Maturity and Stem Detection. Horticulturae 2024, 10, 1006. [Google Scholar] [CrossRef]
Yang, Z.; Li, Y.; Han, Q.; Wang, H.; Li, C.; Wu, Z. A Method for Tomato Ripeness Recognition and Detection Based on an Improved YOLOv8 Model. Horticulturae 2025, 11, 15. [Google Scholar] [CrossRef]
Wang, J.; Liu, M.; Du, Y.; Zhao, M.; Jia, H.; Guo, Z.; Su, Y.; Lu, D.; Liu, Y. PG-YOLO: An efficient detection algorithm for pomegranate before fruit thinning. Eng. Appl. Artif. Intell. 2024, 134, 108700. [Google Scholar] [CrossRef]
Deng, X.; Huang, T.; Wang, W.; Feng, W. SE-YOLO: A sobel-enhanced framework for high-accuracy, lightweight real-time tomato detection with edge deployment capability. Comput. Electron. Agric. 2025, 239, 110973. [Google Scholar] [CrossRef]
Song, Z.; Li, W.; Tan, W.; Qin, T.; Chen, C.; Yang, J. LBSR-YOLO: Blueberry health monitoring algorithm for WSN scenario application. Comput. Electron. Agric. 2025, 238, 110803. [Google Scholar] [CrossRef]
Li, T.; Xue, J.; Wei, M.; Yuan, X.; Wang, X.; Zhang, Z.; Wu, Z.; Sun, Y.; Zhang, T.; Cheng, K. GP-DETR: A Lightweight Real-Time Intelligent Model for Near-Background Colour Pepper Detection in Complex Agricultural Environments. Smart Agric. Technol. 2025, 12, 101219. [Google Scholar] [CrossRef]
You, H.; Wang, H.; Wei, Z.; Bi, C.; Zhang, L.; Li, X.; Yin, Y. VBP-YOLO-prune: Robust apple detection under variable weather via feature-adaptive fusion and efficient YOLO pruning. Alex. Eng. J. 2025, 128, 992–1014. [Google Scholar] [CrossRef]
Restrepo-Arias, J.F.; Montoya-Castaño, M.J.; Moreno-De La Espriella, M.F.; Branch-Bedoya, J.W. An Application of Deep Learning Models for the Detection of Cocoa Pods at Different Ripening Stages: An Approach with Faster R-CNN and Mask R-CNN. Computation 2025, 13, 159. [Google Scholar] [CrossRef]
Tian, C.; Wang, J.; Zheng, D.; Li, Y.; Zhang, X. Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN. Agronomy 2025, 15, 536. [Google Scholar] [CrossRef]
Crespo, A.; Moncada, C.; Crespo, F.; Morocho-Cayamcela, M.E. An efficient strawberry segmentation model based on Mask R-CNN and TensorRT. Artif. Intell. Agric. 2025, 15, 327–337. [Google Scholar] [CrossRef]
Li, J.; Zhu, Z.; Liu, H.; Su, Y.; Deng, L. Strawberry R-CNN: Recognition and counting model of strawberry based on improved faster R-CNN. Ecol. Inform. 2023, 77, 102210. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.-T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A comprehensive survey on data augmentation. IEEE Trans. Knowl. Data Eng. 2025, 38, 47–66. [Google Scholar] [CrossRef]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Zeng, F.; Liu, W. Auggpt: Leveraging chatgpt for text data augmentation. IEEE Trans. Big Data 2025, 11, 907–918. [Google Scholar] [CrossRef]
Islam, T.; Hafiz, M.S.; Jim, J.R.; Kabir, M.M.; Mridha, M. A systematic review of deep learning data augmentation in medical imaging: Recent advances and future research directions. Healthc. Anal. 2024, 5, 100340. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
He, W.; Zhang, Y.; Xu, T.; An, T.; Liang, Y.; Zhang, B. Object detection for medical image analysis: Insights from the RT-DETR model. In Proceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, Kuala Lumpur, Malaysia, 14–16 February 2025; pp. 415–420. [Google Scholar]
Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Ma, X.; Guo, F.-M.; Niu, W.; Lin, X.; Tang, J.; Ma, K.; Ren, B.; Wang, Y. Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 5117–5124. [Google Scholar]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. On the connection between local attention and dynamic depth-wise convolution. arXiv 2021, arXiv:2106.04263. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive sparsity for the magnitude-based pruning. arXiv 2020, arXiv:2010.07611. [Google Scholar]
Yang, C.; Yang, Z.; Khattak, A.M.; Yang, L.; Zhang, W.; Gao, W.; Wang, M. Structured Pruning of Convolutional Neural Networks via L1 Regularization. IEEE Access 2019, 7, 106385–106394. [Google Scholar] [CrossRef]
Bing, Y.; Qilong, Z. Kernel norm-similarity based group pruning for compressing CNNs. In Proceedings of the International Conference on Computer Application and Information Security (ICCAIS 2024), Wuhan, China, 20–22 December 2024; SPIE: Bellingham, WA, USA, 2025; p. 135621R. [Google Scholar]
Sun, F.; Guan, Z.; Lyu, Z.; Zhen, T.; Liu, S.; Li, X. Efficient and lightweight deep learning model for small-object detection in stored grain pest management. J. Stored Prod. Res. 2026, 115, 102890. [Google Scholar] [CrossRef]
Islam, M.D.; Liu, W.; Izere, P.; Singh, P.; Yu, C.; Riggan, B.; Zhang, K.; Jhala, A.J.; Knezevic, S.; Ge, Y.; et al. Towards real-time weed detection and segmentation with lightweight CNN models on edge devices. Comput. Electron. Agric. 2025, 237, 110600. [Google Scholar] [CrossRef]
Li, H.; Chen, J.; Gu, Z.; Dong, T.; Chen, J.; Huang, J.; Gai, J.; Gong, H.; Lu, Z.; He, D. Optimizing edge-enabled system for detecting green passion fruits in complex natural orchards using lightweight deep learning model. Comput. Electron. Agric. 2025, 234, 110269. [Google Scholar] [CrossRef]

Figure 1. Sample images from the dataset.

Figure 2. Image data augmentation.

Figure 3. FSA-DETR-P model architecture.

Figure 4. Structure diagram of the FasterNetBlock.

Figure 5. GSConv module structure.

Figure 6. Structure of VoVGSCSP.

Figure 7. Structural diagram of SSFF.

Figure 8. Structural diagram of the TFE module.

Figure 9. Comparison of the number of channels in each layer of the model before and after pruning.

Figure 10. Training and Process of the FSA-DETR-P Model.

Figure 11. Comparison of Confusion Matrices Before and After Improvement. (a) Original model (b) FSA-DETR-P.

Figure 12. Comparison of mAP@50 curves in ablation experiments.

Figure 13. Comparison chart of different models for the mAP metric.

Figure 14. Comparison chart of the detection effects of different models.

Figure 15. Special enhancement effects showcase.

Figure 16. Real-time deployment of the proposed model on the NVIDIA Jetson Orin Nano.

Table 1. Laboratory Equipment Setup.

Name	Device Model
CPU	12th Gen Intel Core i5-12400F@4.00 GHz
GPU	NVIDIA GeForce RTX 4090 24 GB
Memory	64 GB
Operating System	WINDOWS 10
Framework	PyTorch 2.1.2
Programming language	Python 3.8.18
CUDA Version	11.8

Table 2. Hyperparameter Settings.

Epoch	Batch Size	Optimizer	Initial Learning Rate	Final Learning Rate	Momentum
100	32	SGD	0.01	0.01	0.937

Table 3. Comparison of several pruning methods.

Method	P (%)	R (%)	mAP50 (%)	mAP 50–95 (%)	Parameters (M)	Flops (G)
L1	0.833	0.87	0.911	0.621	14.01	41.9
group_norm	0.839	0.875	0.904	0.619	14.01	41.9
Random	0.844	0.861	0.906	0.60	14.01	41.9
Lamp	0.851	0.878	0.928	0.632	13.73	34.6

Table 4. Table of ablation experiment results.

Model	P (%)	R (%)	mAP50 (%)	mAP 50–95 (%)	Parameters (M)	Flops (G)
RT-DETR R18	0.812	0.859	0.881	0.585	19.87	56.9
RT-DETR–FasterNet	0.821	0.869	0.905	0.589	16.79	49.5
RT-DETR -ASF	0.839	0.888	0.913	0.62	20.15	61.4
Slimneck	0.823	0.861	0.904	0.595	19.31	53.6
slimneck + ASF	0.844	0.883	0.91	0.615	19.60	57.7
FasterNet + slimneck + ASF	0.837	0.875	0.921	0.625	16.51	50.1
ASF + slimneck + FasterNet + prune	0.851	0.878	0.928	0.632	13.73	34.6

Table 5. Comparison chart of experimental results of different models.

Method	P (%)	R (%)	mAP50 (%)	mAP 50–95 (%)	Parameters (M)	Flops (G)
YOLOv5m	0.807	0.855	0.872	0.463	25.05	64
YOLOv8m	0.812	0.845	0.882	0.515	25.84	78.7
YOLOv8s	0.781	0.814	0.842	0.449	11.16	28.8
YOLO11m	0.807	0.851	0.861	0.521	20.33	67.7
YOLOv12m	0.818	0.843	0.875	0.534	20.18	67.1
rtdetr-r34	0.821	0.85	0.9	0.563	31.11	88.8
rtdetr-l	0.811	0.832	0.885	0.577	31.99	103.4
rtdetr-r18	0.812	0.859	0.881	0.585	19.87	56.9
Ours	0.851	0.878	0.928	0.632	13.73	34.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, J.; Fan, J.; Liu, H.; Yan, W.; Li, D.; Sun, Z.; Liu, H.; Huang, D. RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection. Horticulturae 2026, 12, 42. https://doi.org/10.3390/horticulturae12010042

AMA Style

Yuan J, Fan J, Liu H, Yan W, Li D, Sun Z, Liu H, Huang D. RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection. Horticulturae. 2026; 12(1):42. https://doi.org/10.3390/horticulturae12010042

Chicago/Turabian Style

Yuan, Jun, Jing Fan, Hui Liu, Weilong Yan, Donghan Li, Zhenke Sun, Hongtao Liu, and Dongyan Huang. 2026. "RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection" Horticulturae 12, no. 1: 42. https://doi.org/10.3390/horticulturae12010042

APA Style

Yuan, J., Fan, J., Liu, H., Yan, W., Li, D., Sun, Z., Liu, H., & Huang, D. (2026). RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection. Horticulturae, 12(1), 42. https://doi.org/10.3390/horticulturae12010042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RT-DETR Optimization with Efficiency-Oriented Backbone and Adaptive Scale Fusion for Precise Pomegranate Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Model Selection and Enhancement

2.2.1. FasterNetBlock

2.2.2. Slimneck (GSConv + VoVGSCSP)

2.2.3. ASF (SSFF + TEF)

3. Results and Discussion

3.1. Experimental Environment and Configuration

3.2. Pruning Experiment

3.3. Model Training and Comparison

3.4. Ablation Experiment

3.5. Model Comparison Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI