A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR

Huang, Yutong; Wang, Xianyao; Liu, Xinyao; Cai, Liping; Feng, Xuefei; Chen, Xiaoyan

doi:10.3390/agronomy15051173

Open AccessArticle

A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR

by

Yutong Huang

^†

,

Xianyao Wang

^†,

Xinyao Liu

,

Liping Cai

,

Xuefei Feng

and

Xiaoyan Chen

^*

College of Information Engineering, Sichuan Agriculture University, Ya’an 625000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2025, 15(5), 1173; https://doi.org/10.3390/agronomy15051173

Submission received: 9 April 2025 / Revised: 9 May 2025 / Accepted: 9 May 2025 / Published: 12 May 2025

(This article belongs to the Special Issue Advanced Machine Learning in Agriculture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

As one of the world’s economically valuable fruit crops, citrus has its quality and productivity closely tied to the degree of fruit ripeness. However, accurately and efficiently detecting citrus ripeness in complex orchard environments for selective robotic harvesting remains a challenge. To address this, we constructed a citrus ripeness detection dataset under complex orchard conditions, proposed a lightweight algorithm based on visual saliency priors and the RT-DETR model, and named it LightSal-RTDETR. To reduce computational overhead, we designed the E-CSPPC module, which efficiently combines cross-stage partial networks with gated and partial convolutions, combined with cascaded group attention (CGA) and inverted residual mobile block (iRMB), which minimizes model complexity and computational demand and simultaneously strengthens the model’s capacity for feature representation. Additionally, the Inner-SIoU loss function was employed for bounding box regression, while a weight initialization method based on visual saliency maps was proposed. Experiments on our dataset show that LightSal-RTDETR achieves a mAP@50 of 81%, improving by 1.9% over the original model while reducing parameters by 28.1% and computational cost by 26.5%. Therefore, LightSal-RTDETR effectively solves the citrus ripeness detection problem in orchard scenes with high complexity, offering an efficient solution for smart agriculture applications.

Keywords:

citrus; RT-DETR; ripeness detection; visual saliency maps; smart agriculture

1. Introduction

Citrus, a highly valuable fruit crop globally, is packed with essential nutrients such as vitamins, carotenoids, and dietary fiber. It also contains bioactive compounds like flavonoids and limonene, making it valuable for both consumption and medicinal purposes [1]. A key factor in determining fruit quality is its ripening stage, which is typically classified manually by field experts. However, this is a labor-intensive and error-prone process, particularly under complex natural conditions which require a high level of expertise from the operator [2]. In recent years, to address the growing gap between population growth and labor shortages, the application of agricultural robotics in fruit harvesting has been increasing significantly [3]. Therefore, developing a lightweight, efficient, and accurate citrus ripeness detection model is crucial for enabling automated citrus harvesting.

Recently, computer vision has gained significant attention in fruit ripeness detection owing to its non-contact, high-accuracy, and fast detection advantages. Early studies mainly focused on color, texture, and morphology. For example, Malik et al. converted RGB images to the HSV color space and achieved 81.6% accuracy in tomato detection tasks [4]. The method proposed by Ling et al. integrates AdaBoost with color analysis for tomato detection and has been successfully adopted by harvesting robots [5]. However, traditional feature extraction methods struggle in complex environments due to noise and lighting variations. With deep learning advancements, models including LeNet [6], AlexNet [7], VGG [8], GoogleNet [9], and ResNet [10] have further promoted the application of CNNs to complex tasks. Deep learning technology has shown significant advantages in agriculture, with researchers applying it to agricultural applications. For instance, Fan et al. combined dark channel enhancement and YOLOv5 for strawberry ripeness identification, achieving over 90% accuracy [11]. Wang et al. optimized Faster R-CNN to detect the ripeness of tomatoes, demonstrating remarkable performance in difficult scenarios [12].

Within the context of smart agriculture, deep learning has shown great potential, but research on citrus ripeness detection involving complex shapes and color variations remains limited. To achieve more accurate detection in complex orchard scenes, Xu et al. proposed HPL-YOLOv4, a modified version of the YOLOv4 architecture [13]. To enhance green citrus detection in natural conditions, Zheng et al. developed YOLO BP by introducing Bi-PANet to better integrate features across layers [14]. However, neither approach considered the ripeness information, limiting their effectiveness. A two-stage ripeness detection framework was developed by Chen et al., combining YOLOv5 for identifying fruit regions with saliency-based analysis for ripeness classification [15]. While insightful, its high computational cost may hinder real-time harvesting in large orchards. Moreover, the green appearance of unripe citrus fruits is often similar to the color of surrounding foliage, leading to significant missed detections. Some unripe fruits also exhibit both orange and green hues, making them visually similar to ripe fruits and thus prone to misclassification [16]. Therefore, it is necessary to develop a lightweight and accurate detection framework that can perform robustly in complex orchard environments and can support real-time applications.

To handle the aforementioned difficulties, recent advancements in model architectures and attention mechanisms offer new opportunities. Transformer-based detectors such as DETR [17] and RT-DETR [18] introduce global attention, which enables better modeling of contextual relationships and improves robustness to scale and occlusion issues. RT-DETR, in particular, combines the global information processing capability of transformers with the efficient feature extraction capability of convolutional neural networks (CNNs), aiming to achieve real-time object detection while maintaining high accuracy. In terms of design, RT-DETR removes the complex candidate box generation and non-maximum suppression (NMS) steps commonly used in traditional object detection, instead adopting a direct prediction method for object bounding boxes and class labels from feature maps. This end-to-end design not only simplifies the model structure but also reduces computational resource consumption, enabling RT-DETR to operate on edge devices and meet real-time application requirements.

Meanwhile, visual saliency detection, inspired by human attention mechanisms, has shown promise in guiding neural networks toward informative regions. Despite its potential for enhancing feature extraction, the integration of saliency priors directly into object detection pipelines remains limited, particularly in agricultural scenarios such as citrus ripeness detection. Existing approaches like Chen et al.’s [15] perform detection first and then crop individual fruit regions for separate saliency-guided classification, rather than embedding saliency cues within the detection process itself. This two-stage strategy increases computational complexity and limits real-time applicability in large-scale orchard environments. Traditional saliency algorithms such as the Itti model [19] and FT [20] are efficient but inadequate in complex natural scenes. Deep learning-based saliency models offer better accuracy but require large, labeled datasets and high computational costs. In comparison, traditional color-based saliency detection methods offer greater robustness and interpretability under low-resource and small-sample conditions. In light of this, this study leverages techniques from traditional saliency detection methods, such as color space transformation and grayscale mapping, to enhance the model’s overall detection effectiveness. By combining lightweight saliency priors with modern detection frameworks like RT-DETR, we can improve feature focus and model efficiency, offering a promising solution that distinguishes itself from traditional approaches.

Building upon the advancements in detection frameworks and saliency-guided feature enhancement, this paper presents an innovative citrus ripeness detection method that integrates visual saliency priors with the improved RT-DETR model, named LightSal-RTDETR. LightSal-RTDETR integrates visual saliency priors to enhance feature extraction, incorporates an improved backbone with the iRMB-cascaded block combining the cascaded group attention (CGA) mechanism and the inverted residual mobile block (iRMB), and introduces an efficient module that combines cross-stage partial networks with gated and partial convolution mechanisms (E-CSPPC) to reduce computational cost. In addition, the loss function is optimized by combining Inner-IoU and SIoU for more precise bounding box regression. Considering the absence of publicly available citrus ripeness detection datasets, we also constructed a comprehensive dataset covering a wide range of backgrounds, distances, and lighting conditions to support robust model training and evaluation.

The principal innovations and contributions of this work are described as follows:

Establishment of a citrus ripeness dataset. A citrus ripeness dataset was developed under various backgrounds, distances, and lighting conditions, resulting in greater adaptability and precision under a wide range of environments. This provides a robust foundation for training models capable of handling complex orchard scenarios.
iRMB-cascaded block. A new iRMB-cascaded block was designed by integrating the CGA mechanism with the iRMB, enhancing the backbone’s residual structure. This integration improves feature extraction efficiency while reducing computational load without compromising performance.
E-CSPPC module. A novel E-CSPPC module was introduced to minimize unnecessary computations and memory usage, enhancing the efficiency of the lightweight network and further optimizing the model’s performance under resource-constrained conditions.
Optimized loss function. The loss function was refined by combining Inner-IoU and SIoU, resulting in more accurate bounding box regression, thereby improving overall detection precision.
Visual saliency maps-based initialization. A new weight transfer strategy was applied, where model weights trained on a combination of visual saliency maps and RGB images were used to initialize the new model. This strategy facilitates more efficient extraction of key image features by the model.

2. Materials and Methods

2.1. Data Acquisition and Dataset Construction

The citrus image dataset used in this study was collected through on-site photography by research team members from orchards in Meishan City, Sichuan Province, China. The images were captured using a Canon R-10 (Canon Inc., Tokyo, Japan) and an OPPO Reno8 Pro+ mobile phone (Guangdong OPPO Mobile Telecommunications Corp., Ltd., Dongguan, China), covering citrus fruits at different growth stages. The dataset consists of 1912 images captured under various lighting conditions, capture distances, and levels of occlusion, with a resolution of 4096 × 3072 pixels. It includes 9144 citrus targets, of which 4287 were ripe and 4857 were unripe. All images are stored in JPEG format, which can effectively control the file volume while ensuring high image quality. Figure 1 illustrates citrus fruits at various stages of ripeness within complex environmental settings.

The classification criteria of mature and immature fruits combined the shooting time and biological characteristics: images of immature fruits were mainly collected in October 2024, while mature fruits were concentrated in December 2024. At the same time, the shape characteristics of citrus fruits were combined for comprehensive judgment. Unripe citrus fruits are usually green or yellow-green, indicating that they have yet to reach the optimal harvesting period. Ripe citrus fruits, on the other hand, are considered to be at the ideal harvesting stage, characterized by an orange color, which indicates their full ripeness. At this stage, picking will not negatively affect the final product quality. Based on specified characteristics, citrus images were annotated using LabelImg [21], and the bounding box labels were saved in a YOLO-compatible text format. Unripe citrus fruits were annotated as ‘unripe_mandarin’, typically exhibiting a green to yellowish-green peel, while ripe citrus fruits were labeled as ‘mandarin’, characterized by an orange peel. During the annotation process, each citrus fruit was enclosed with a tightly fitted rectangular bounding box following its contour. In cases where fruits were overlapped or partially occluded by branches and leaves, the bounding boxes were determined by estimating the complete contours of the fruits based on their visible regions. A total of 1339, 188, and 385 images were allocated to the training, validation, and test sets, respectively, using a 7:1:2 split. This split ensures a sufficiently large training set while maintaining a larger test set, which contributes to a more robust model evaluation. These images, collected from real citrus orchards, present challenges for ripeness detection, such as overlapping, occlusion, visual similarity to the orchard background, and dense target distribution.

2.2. Generation of Visual Saliency Maps

To highlight citrus targets and enhance ripeness differences, this study introduces a simple visual saliency map generation strategy that integrates traditional image processing techniques with task-specific adaptations. First, images are converted into the HSV domain, in which the hue, saturation, and value channels are employed to filter the citrus color range (from orange to green). A binary mask is generated by applying appropriate color thresholds to separate ripe and unripe fruit regions. To enhance regional continuity and eliminate noise, morphological operations (closing and opening) are applied to refine the mask. Next, adaptive grayscale mapping is applied to citrus regions based on their hue distribution, such that ripe orange areas appear with high brightness, unripe green areas with medium brightness, and the background with the lowest brightness. Specifically, the maximum grayscale value for ripe regions is defined as

G_{\max}

, and the minimum grayscale value for unripe regions as

G_{\min}

. Linear mapping is performed using the following equation:

G = G_{\max} - (H - H_{low}) \times \frac{G_{\max} - G_{\min}}{H_{high} - H_{low}}

(1)

where

H_{low}

and

H_{high}

represent the lower and upper hue limits for orange and green, respectively. The resulting saliency maps effectively highlight the ripeness-related features of citrus fruits in complex environments, facilitating downstream object detection tasks. Examples of the generated visual saliency maps are shown in Figure 2. Notably, during the initial training stage, the generated visual saliency maps are fed into the model alongside the original RGB images to jointly guide feature extraction. The obtained model weights, enriched with prior saliency knowledge, are later used for initializing the final model, as detailed in Section 2.7.

2.3. The LightSal-RTDETR Model Architecture

LightSal-RTDETR is a lightweight model designed to efficiently detect citrus ripeness in natural environments. RT-DETR simplifies the detection pipeline by removing components such as anchor generation and NMS, achieving an effective balance between accuracy and model complexity. However, in practical applications, limited hardware resources remain a challenge, highlighting the need to reduce computational load. This study proposes the LightSal-RTDETR model, which addresses these issues through a lightweight architectural design.

In LightSal-RTDETR, the CGA mechanism [22] is integrated with the iRMB [23] to form the iRMB-cascaded block module, which strengthens the residual structure of the backbone network. This integration improves feature extraction efficiency while reducing computational complexity. Additionally, to further minimize redundant computation, the model employs partial convolution (PConv) [24] to refine the RepC3 module, resulting in the proposed E-CSPPC module. For the loss function, the model combines Inner-IoU [25] with SIoU [26], forming a new Inner-SIoU to enhance the accuracy of bounding box regression. Moreover, A weight transfer strategy is adopted, where weights jointly trained on visual saliency maps and RGB images are used for model initialization. This improves feature localization and object detection performance. Figure 3 presents the overall architecture of LightSal-RTDETR.

2.4. Improved Backbone

The ResNet18 backbone consists of a series of basic block modules, which use skip connections to introduce identity mappings, thus enhancing gradient flow and mitigating the vanishing gradient issue [10]. To enhance feature extraction capabilities, this study integrates the CGA mechanism and the iRMB into the basic block structure. The improved backbone architecture is illustrated in Figure 4.

CGA enhances the feature extraction process, enabling the model to better capture intricate patterns by employing a cascaded attention mechanism across multiple feature groups. Each attention head is assigned a distinct subset of feature channels, structurally partitioning the attention computations across heads to improve representation diversity and efficiency [22]. The attention mechanism is mathematically formulated as follows:

{\tilde{X}}_{ij} = Attn (X_{ij} W_{ij}^{Q}, X_{ij} W_{ij}^{K}, X_{ij} W_{ij}^{V}),

(2)

{\tilde{X}}_{i + 1} = Concat {[{\tilde{X}}_{ij}]}_{j = 1 : h} W_{i}^{P}

(3)

The input feature

X_{i}

is divided into h segments, denoted as

X_{i} = [X_{i 1}, X_{i 2}, \dots, X_{ih}]

, where h is the total number of attention heads. The j-th attention head performs self-attention on its corresponding segment

X_{ij}

. Each head uses the projection matrices

W_{ij}^{Q}

,

W_{ij}^{K}

, and

W_{ij}^{V}

to transform its segment into query, key, and value subspaces, respectively. After computing attention outputs for all heads, a linear projection

W_{i}^{P}

is applied to the concatenated result to restore the original dimensionality. To increase representational capacity, CGA projects the Q, K, and V embeddings onto enhanced feature subspaces that encode more diverse and discriminative information. Attention maps are computed in a cascaded sequence, where the output of each head serves as input to the next, enabling progressive refinement of the feature representations.

X_{ij}^{'} = X_{ij} + {\tilde{X}}_{i (j - 1)}, 1 < j \leq h

(4)

As shown in Equation (4),

X_{ij}^{'}

is the sum of the j-th input split

X_{ij}

and the output

{\tilde{X}}_{i (j - 1)}

from the (j − 1)-th head (calculated by Equations (2) and (3)), which replaces

X_{ij}

as the new input for the j-th head in the self-attention computation. After the Q projection, a token interaction layer is incorporated to help the self-attention mechanism [27] capture local and global contextual information, further enhancing the expressiveness of features. This cascaded design, similar to group convolutions [28], reduces floating-point operations (FLOPs) and parameters by providing each attention head with different feature splits, which increases the diversity of the attention maps. Moreover, this design increases network depth, further improving model capacity without adding extra parameters.

To extract rich ripeness information from complex orchard environments while maintaining a lightweight model, we also employed the iRMB [23], integrating it into the backbone network for feature extraction. iRMB is a hybrid network module that integrates depthwise separable convolution [29] (3 × 3 DW-Conv) with a self-attention mechanism. A 1 × 1 convolution is employed to reduce and expand the channel dimensions, enhancing computational efficiency. The depthwise separable convolution effectively captures spatial patterns, while the attention mechanism models global feature dependencies. The architectural design of iRMB is depicted in Figure 5. Additionally, the SE module [30] was introduced to reweight channels based on global information and is positioned after the depthwise convolution, as depicted in Figure 4.

The iRMB-cascaded block integrates the strengths of both CGA and iRMB mechanisms, creating a powerful feature extraction module that enhances the ability to capture complex feature relationships while maintaining computational efficiency, as detailed in Figure 4. CGA strengthens the ability to capture complex feature relationships by applying cascaded attention across different groups, thereby improving cross-group correlation modeling. Meanwhile, the iRMB mechanism optimizes internal information flow, enhancing the robustness and accuracy of feature representations while maintaining computational efficiency. Incorporating the iRMB-cascaded block into ResNet18 not only enhances feature extraction capability but also improves overall performance without increasing computational complexity.

2.5. Improved Efficient Hybrid Encoder

RT-DETR is composed of a backbone, an encoder, and a decoder. Its notable detection performance mainly benefits from the efficient hybrid encoder, which effectively handles multi-scale feature representation by decoupling intra-scale interactions and cross-scale fusion [18]. Within the encoder, the re-parameterized convolution module, RepC3, is deployed to enhance feature extraction. However, despite improving feature representation, RepC3 may introduce redundant convolution operations during inference, leading to increased computational complexity. To tackle this issue, we developed the E-CSPPC module based on PConv to reduce model complexity and computation. The structure of E-CSPPC is depicted in Figure 6.

PConv performs convolution only on a portion of the input channels, extracting spatial features while preserving the other channels unchanged [24]. Figure 7a,b show the structures of partial and regular convolution. Compared to traditional convolution, PConv improves local feature capture and enhances recognition capability. For regular memory access, we opt for the first or last

c_{p}

channels to represent the full feature map, assuming the number of channels in both the input and output feature maps is identical. Therefore, when the input shape is

h \times w \times c_{p}

and the kernel size is

k \times k

, the FLOPs of PConv are as follows:

h \times w \times k^{2} \times c_{p}^{2}

(5)

With a partial ratio of

r = \frac{1}{4}

, the FLOPs of PConv are reduced to

\frac{1}{16}

that of regular convolution. Additionally, PConv requires only

\frac{1}{4}

the memory access of regular convolution.

Gated convolution (GatedConv) dynamically adjusts convolution outputs through a gating mechanism [31]. Specifically, the input feature map is processed by a standard convolution to extract local features. A 1 × 1 convolutional layer subsequently produces a weight map that matches the dimensions of the input feature map. The weight map, following sigmoid activation, is multiplied by the corresponding elements in the feature map to modify its feature intensity. This mechanism allows the network to selectively retain or suppress specific regional features, enhancing feature extraction and robustness.

The P_GatedConv module, combining GatedConv with PConv, is integrated into the E-CSPPC module. This reduces computational and memory redundancy and lowers the number of parameters. It also reduces model size and development costs for hierarchical devices.

2.6. The Improved Loss Function

To enhance the evaluation accuracy, Inner-SIoU is used to replace the IoU loss function in this model. During training, a smaller auxiliary boundary is used for loss computation, and a scaling factor is applied to adjust the auxiliary boundary ratio. Based on Equations (5)–(8), Inner-IoU [25] is applied to the SIoU loss function [26] calculation as follows:

inter = (\min (b_{r}^{gt}, b_{r}) - \max (b_{l}^{gt}, b_{l})) \times (\min (b_{b}^{gt}, b_{b}) - \max (b_{t}^{gt}, b_{t}))

(6)

union = (w^{gt} \times h^{gt}) \times {(radio)}^{2} + (w \times h) \times {(radio)}^{2} - inter

(7)

{IoU}^{inner} = \frac{inter}{union}

(8)

L_{Inner - SIoU} = L_{SIoU} + IoU - {IoU}^{inner}

(9)

Define

b^{gt}

as the ground truth box and b as the anchor box, with their widths and heights represented by

w^{gt}

,

h^{gt}

and w, h, respectively. The ratio serves as a parameter that determines the scale of the auxiliary boundary. The SIoU loss function

L_{SIoU}

is composed of four components: angle, distance, shape, and IoU losses. Inner-SIoU improves the regression accuracy for high IoU samples by optimizing the internal area of the box and introducing small-scale auxiliary boundaries for loss computation, effectively reducing the influence of low IoU samples during training.

2.7. Weight Transfer

This study adopts a weight initialization strategy, using model weights obtained from joint training on RGB images and visual saliency maps as the initialization parameters for the new model. Guided by saliency maps, the model can acquire richer key information during the early training stage, improving its representation of salient features. In addition, this initialization method provides better optimization guidance during model optimization, reducing focus on non-salient regions, thereby improving detection accuracy under cluttered backgrounds and occlusions. The model obtained from joint training on visual saliency maps and RGB images is shown in Figure 8. To ensure the compatibility of the weights, the structure of this model is basically consistent with LightSal-RTDETR.

3. Results

3.1. Training Environment and Settings

To train and test the proposed model, the experiments were conducted in an Ubuntu 22.04 environment, equipped with an AMD EPYC 9754 128-core CPU and an NVIDIA GeForce RTX 4090 D GPU with 24 GB of RAM. The programming language used was Python 3.8.10, with PyTorch version 2.0.0 and CUDA version 11.8 as the deep learning framework.

For consistency, input images were uniformly scaled to 640 × 640 pixels throughout the training phase. The training was conducted over 150 epochs using a batch size of 16, with the AdamW optimizer configured for a learning rate of 0.0005 and a weight decay of 0.0001. To ensure objectivity, all experimental evaluations of the proposed method were carried out using identical hyperparameter settings. These hyperparameters were tuned based on validation performance to ensure good convergence and generalization. The training process was monitored using the validation mean average precision (mAP). As shown in Figure 9a, the validation mAP increased steadily and plateaued in the later epochs, reaching a best value of 80.8% at mAP@50. The final test set mAP was 81%, which is consistent with the validation performance, suggesting that the model effectively generalizes without overfitting.

3.2. Evaluation Metrics

A comparative analysis of detection performance before and after enhancement is conducted under the same experimental setup to validate the algorithm’s improvements. The performance assessment in this work is based on precision, recall, F1 score, mean average precision (mAP), total parameters, and GFLOPs. All reported precision, recall, and F1 scores are calculated using macro-averaging. For NMS, the following thresholds are used: YOLO series (confidence = 0.25, IoU = 0.7), SSD (confidence = 0.5, IoU = 0.45), and Faster R-CNN (confidence = 0.5, IoU = 0.3). These threshold values are chosen based on the default settings used in the respective models’ original implementations, ensuring consistency with standard practices in object detection tasks. In contrast, DETR-based models do not require NMS thresholds during evaluation due to their transformer-based architecture. During the prediction phase, a confidence threshold of 0.25 is applied to filter out low-confidence predictions, while all predictions are included in the evaluation phase to compute the precision-recall curve and mAP.

Precision represents the model’s effectiveness in correctly predicting positive outcomes. The formula is defined in Equation (10):

Precision = \frac{TP}{TP + FP}

(10)

TP denotes the correct positive predictions, whereas FP refers to incorrect ones.

Recall reflects how well the model detects positive samples. The formula is defined in Equation (11):

Recall = \frac{TP}{TP + FN}

(11)

FN refers to positive instances that were incorrectly classified as negative.

The F1 score offers a comprehensive evaluation of model performance. The F1 score is defined in Equation (12):

F 1 score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(12)

The mean average precision (mAP) is a comprehensive performance metric for object detection tasks that accounts for both the precision and recall of the model at different thresholds. Prior to computing AP, the area under the precision–recall curve (AUC) is estimated via interpolation. In this paper, we use the 101-point interpolation method, where interpolated precision is sampled at 101 equally spaced recall thresholds between 0 and 1. Notably, Section 3.4 follows the COCO-style evaluation protocol to ensure fair comparison with lightweight DETR benchmarks, while all other experiments adopt the default mAP computation used in the YOLO framework, which is also based on 101-point interpolation. The formula for mAP is shown as Equation (13):

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(13)

A P_{i}

represents the average precision of the i-th class, and N represents the number of classes.

The parameter count indicates the cumulative number of parameters that can be learned by the model. It is typically used to evaluate model complexity.

GFLOPs serves as a measure of its computational efficiency. During inference, GFLOPs provides a useful estimate of the computational load a model imposes, making it suitable for cross-model efficiency comparisons on the same device. Fewer GFLOPs indicate higher computational efficiency.

3.3. Ablation Study

To evaluate the effectiveness of the modules introduced, an ablation study was conducted using the dataset constructed in this study. In the experiment, i_C Block represents the iRMB-cascaded block module, E-C represents the E-CSPPC module, and Inner-SIoU refers to the use of the Inner-SIoU loss. The weight_load refers to the initialization of model weights obtained from joint training with visual saliency maps and RGB images. Table 1 presents the findings from the ablation study. The ablation study includes five groups of experiments, with each experiment adding different modules. The models are evaluated against the original RT-DETR model based on precision, recall, mAP, parameter count, and GFLOPs.

As shown in Table 1, integrating the E-CSPPC module into the network leads to increases of 0.2% in mAP@50 and 2.1% in mAP@50:95, while simultaneously lowering the number of parameters and computational overhead. By combining the iRMB-cascaded block and E-CSPPC, the detection performance is further boosted, with mAP@50 increasing by 0.4% and mAP@50:95 by 1.8%. Compared to the baseline RT-DETR, this approach cuts the parameter count to 14.28 M and the computational cost to 41.8 GFLOPs, representing reductions of 28.1% and 26.5%, respectively. Furthermore, integrating the Inner-SIoU loss function leads to a 1.0% improvement in mAP@50. A comparison between Experiment 4 and Experiment 5 indicates that initializing with weights trained on both visual saliency maps and RGB images in Experiment 5 leads to improvements of 0.5% in mAP@50 and 0.9% in mAP@50:95, relative to random initialization. Finally, with all enhancements applied, LightSal-RTDETR demonstrates performance gains of 1.9% for mAP@50 and 3.4% for mAP@50:95. It also reduces the parameter count and GFLOPs by 28.1% and 26.5% compared to the baseline. Beyond improving detection accuracy, LightSal-RTDETR significantly reduces model complexity and computational cost, enhancing its suitability for real-world deployment. This demonstrates the practical value of the proposed improvements.

3.4. Comparison Experiments

The effectiveness of the proposed LightSal-RTDETR was evaluated by comparing it with a range of mainstream detectors, including SSD [32], Faster R-CNN [33], YOLOv5, YOLOv8, YOLOv10 [34], YOLO11, YOLOv12 [35], and RT-DETR networks. Figure 9a presents the performance curves of all nine models evaluated on the proposed dataset. Faster R-CNN has significantly more parameters and higher computational complexity than the other models, while its average precision remains relatively low. Due to its limited practical value, Figure 9b excludes Faster R-CNN and illustrates the relationship between parameter count, computational complexity, and mAP@50:95 for the remaining models. As shown, LightSal-RTDETR demonstrates a clear advantage in terms of accuracy–efficiency trade-offs.

Figure 9. Results of comparison experiments. (a) Accuracy variation of nine object detectors, and (b) bubble chart of parameters, GFLOPs, and mAP@50:95.

Table 2 presents a comparison of different models in terms of accuracy, recall, mAP, model parameters, and GFLOPs. Compared with SSD, Faster R-CNN, YOLOv5, YOLOv10, YOLO11, and YOLOv12, LightSal-RTDETR achieves higher average precision by 1.6%, 5.9%, 0.6%, 4.1%, 0.4%, and 0.5%, respectively, at an IoU of 0.5. Although YOLOv8 achieves a high accuracy of 82.2%, it comes at the cost of considerably higher parameter size and computational burden compared to LightSal-RTDETR. LightSal-RTDETR surpasses the baseline RT-DETR by 1.9% in mAP@50 and 3.4% in mAP@50:95, while also delivering 1.6% higher accuracy and a 2.7% boost in recall. Overall, LightSal-RTDETR offers a favorable trade-off between accuracy and resource consumption, maintaining high detection precision with reduced parameter size and computational demands, thus supporting efficient lightweight deployment.

In addition, to further validate the effectiveness of the proposed model, we compared LightSal-RTDETR with several state-of-the-art lightweight DETR-based models, namely D-FINE [36], LW-DETR [37], and DEIM-D-FINE [38]. The comparison results are shown in Table 3. The mAP evaluation metrics used in this experiment follow the official COCO evaluation protocol. As shown in the table, LightSal-RTDETR achieves a good balance between accuracy and model complexity. Specifically, it achieves 80.5% on mAP@50 and 60.0% on mAP@50:95, both outperforming the other compared models and demonstrating superior object detection capability. Meanwhile, it maintains a low model size of only 14.28 M parameters and 41.8 GFLOPs, which are lower than those of D-FINE, LW-DETR, and DEIM-D-FINE, highlighting its lightweight advantage. These results indicate that LightSal-RTDETR not only offers better detection performance but also demonstrates improved adaptability to edge devices with limited resources.

3.5. Visualization Analysis

In natural environments, citrus ripeness detection is influenced by various factors, including uneven lighting, occlusion by leaves or branches, and overlapping fruits. Figure 10 presents a comparison of citrus ripeness prediction generated by LightSal-RTDETR and the original RT-DETR. Various example scenarios are used to demonstrate detection effectiveness, such as under direct sunlight, in shaded conditions, with overlapping or occluded objects, and in high-density environments. The results indicate that LightSal-RTDETR demonstrates strong robustness in adapting to lighting variations and detecting overlapping fruits.

To gain deeper insights into the effectiveness of the proposed algorithm under complex orchard backgrounds in citrus ripeness detection, this study employs GradCAM++ [39] for heatmap analysis. In the resulting heatmaps, red regions represent the model’s focus areas, with darker shades indicating higher levels of attention. Figure 11 provides a comparative visualization of the heatmaps generated by the original and improved models. The results demonstrate that the optimized model exhibits a stronger focus on target regions while effectively reducing computational overhead in non-target areas, thereby confirming the effectiveness of the improvements.

3.6. Comparative Experiments on Backbone Network

The RT-DETR model in this study adopts ResNet-18 as the backbone network. To evaluate the effectiveness of enhancing the basic block in ResNet18 through the integration of iRMB and CGA, we conducted comparative experiments with several representative convolutional structures, as shown in Table 4.

As shown in Table 4, compared with the baseline model, the introduction of the iRMB-cascaded block not only significantly reduces the number of parameters and computational cost but also improves the detection accuracy. Notably, an increase of 1.1% in mAP@50 is achieved. In comparison to using iRMB alone, the iRMB-cascaded block yields a 0.6% improvement in mAP while reducing the parameter count by 1.13 M and lowering the computational complexity by 2.3 GFLOPs. Furthermore, this study also integrates PConv and DualConv [40] into the original basic block module for comparison purposes. Although both PConv and DualConv reduce model size and computation, they lead to performance degradation, with PConv causing a 0.5% drop in mAP@50, and DualConv resulting in a 1.3% decrease in mAP@50:95, which fails to meet the accuracy requirements for citrus ripeness detection. Through a comprehensive comparison, the proposed iRMB-cascaded block demonstrates superior overall performance.

3.7. Generalization Assessment of E-CSPPC Module

To further evaluate the generalization capability of the E-CSPPC module, we conducted validation experiments using the YOLO11 model based on the YOLO architecture. In this experiment, the original C3K2 module in the neck of YOLO11 was replaced with the proposed E-CSPPC module to assess its effectiveness in feature extraction following feature fusion. The experimental results are presented in Table 5.

As shown in the results, the performance of the E-CSPPC module varies across different model architectures. In the RT-DETR model, integrating the E-CSPPC module improved mAP@50 from 79.1% to 79.3% and mAP@50:95 from 57.1% to 59.2% while reducing the number of parameters by 1 M and computational cost by 4.9 GFLOPs. In contrast, when applied to YOLO11, the introduction of E-CSPPC slightly reduced mAP@50 from 80.6% to 80.2% and mAP@50:95 from 57.5% to 57.0%, though it still reduced parameters by 2.07 M and GFLOPs by 6.6 G.

These outcomes indicate that the performance impact of E-CSPPC is closely related to the target architecture. In our study focused on optimizing RT-DETR, integrating E-CSPPC resulted in improved accuracy and reduced computational cost. In contrast, a slight drop in detection accuracy was observed when applying E-CSPPC to YOLO11, though it still achieved notable reductions in model parameters and GFLOPs. This suggests that the practical benefits of E-CSPPC, particularly in terms of model lightweighting, depend on how well it aligns with the architectural characteristics of the base model. We acknowledge that E-CSPPC may not be universally applicable to all detection frameworks, and further evaluation on broader model families is left as future work.

3.8. Validity of Weight Transfer

To further evaluate the effectiveness of weight transfer from models jointly trained on visual saliency maps and RGB images, we conducted an experiment comparing various weight initialization strategies for citrus ripeness detection. The results are presented in Table 6. In this experiment, both models, RT-DETR and improved RT-DETR, used random initialization, while RT-DETR (pretrained) and improved RT-DETR (pretrained) employed pretrained weights from the COCO [41] and Objects365 [42] datasets. It is also worth noting that improved RT-DETR and LightSal-RTDETR share an identical network architecture.

The experimental results indicate that RT-DETR (pretrained) achieves a 1.6% improvement in F1 score compared to its randomly initialized counterpart. However, mAP@50 is 0.3% lower than the random initialization model. The improved RT-DETR (pretrained) shows a 0.6% lower mAP@50 than the random initialization version of the improved RT-DETR, indicating that pretrained weights from COCO and Objects365 may not be suitable for the citrus dataset. In contrast, LightSal-RTDETR (the model with weights trained jointly using visual saliency maps and RGB images) shows significant improvements in both F1 score and mAP@50, suggesting that this transfer learning approach is more suitable for citrus ripeness detection.

3.9. Applicability

To explore the model’s generalization ability to other fruit types, we utilized the Melon Puspalebo Dataset [43] from Roboflow. This public dataset contains melons at various ripeness stages in orchard environments, categorized into two classes: ripe and unripe. A 7:2:1 proportion was used to allocate the dataset to the training, validation, and testing phases. The dataset, which includes augmented images as part of the original collection, was expanded to a total of 1038 images. Table 7 displays the performance evaluation outcomes of LightSal-RTDETR on the dataset.

Table 7 indicates that the proposed LightSal-RTDETR model exhibits high effectiveness on this dataset. Specifically, the mAP@50 for ripe and unripe categories achieves 99.5% and 91.5%, respectively. Furthermore, the results from additional evaluation metrics reflect strong performance. These results highlight the applicability and adaptability of LightSal-RTDETR in assessing fruit ripeness across different categories.

3.10. Deployment Plan

To enable the practical application of the proposed algorithm in orchard environments, it is planned to be deployed on an intelligent fruit-picking robot. The system will be built on an embedded computing platform powered by the NVIDIA Jetson Orin Nano, which offers high computational efficiency and low power consumption, making it suitable for edge computing scenarios. The robot will be equipped with a high-resolution industrial camera (e.g., IDS uEye FA series) for image acquisition via a USB 3.0 interface and will support remote data transmission through Wi-Fi or 4G networks. To enhance system stability and robustness in complex orchard conditions such as high humidity and vibration, the camera module will adopt a shock-absorbing structural design. By integrating the proposed algorithm, the robot will be capable of autonomous orchard inspection and intelligent fruit ripeness detection, providing effective support for orchard management and harvesting operations.

4. Discussion

The experimental results indicate that the proposed model improves the accuracy of citrus ripeness detection in complex orchard environments while simultaneously reducing model parameters and computational cost. The ablation study further verifies the effectiveness of the proposed architectural enhancements. The E-CSPPC module, iRMB-cascaded block, and Inner-SIoU loss function were introduced, and the weights obtained through the hybrid training of RGB images and visual saliency maps were used for weight initialization. Compared to the baseline model, LightSal-RTDETR achieved improvements of 1.6%, 2.7%, 1.9%, and 3.4% in precision, recall, mAP@50, and mAP@50:95, respectively. Meanwhile, the model’s parameters and computational cost were reduced by 28.1% and 26.5%, respectively. These results demonstrate that the improved model effectively detects citrus maturity under complex orchard backgrounds and extracts features representing the citrus ripening stages. In the comparative experiments, we compared the improved model with mainstream object detection models and, using the COCO mAP calculation standard, also compared it with leading lightweight DETR-based models, D-FINE, LW-DETR, and DEIM-D-FINE. LightSal-RTDETR exhibited significantly lower parameter count and computational cost while maintaining high detection accuracy, indicating an optimal balance between computational load and detection performance. In the visual analysis, we visualized detection results in complex orchard environments, including various lighting conditions, overlaps, occlusions, and multiple fruits. The results demonstrated the model’s robustness in challenging orchard scenarios. The proposed method holds promise for enhancing citrus harvest quality in future automated harvesting platforms.

Despite the promising detection performance of the model, limitations in data collection still remain. Although this study includes citrus samples at various ripeness stages collected from complex orchard environments, certain data gaps are evident. For instance, the dataset lacks instances of decayed citrus fruits and images captured under extreme weather conditions. These limitations suggest potential areas for improvement. Future research should focus on gathering a more varied range of datasets to improve the model’s robustness and generalization ability.

From an architectural design perspective, LightSal-RTDETR introduces CGA to enhance feature extraction capability and improve modeling of occluded areas while incorporating PConv to reduce the model size and computational cost. However, these modules may face performance degradation under certain conditions. For instance, when the input image contains significant object occlusion or blurred edges, the local attention mechanism, due to its limited receptive field, may struggle to model global dependencies across regions, leading to decreased detection accuracy. Furthermore, in cases of high occlusion or extreme lighting conditions, partial convolutions may erroneously suppress features in the target area, negatively impacting the model’s performance. In addition, the E-CSPPC module, proposed as a lightweight enhancement block, demonstrated varying levels of effectiveness across different architectures. While it improved both accuracy and efficiency in the RT-DETR framework, it led to a slight accuracy drop in YOLO11, despite significantly reducing model complexity. This further reinforces the importance of aligning module design with architectural characteristics. Future work may explore how such modules can be better adapted to different model types.

Additionally, the proposed LightSal-RTDETR achieves an mAP of 81%, which meets the fundamental requirements for industrial deployment. However, there remains room for further enhancement in detection accuracy. Current maturity detection methods are still at the algorithmic stage, and there remains a certain gap in their deployment in real-world scenarios. The application and deployment of LightSal-RTDETR in citrus orchards still require further research and exploration. In the future, our research will continue to transform this algorithm into a deployable system to ensure its effectiveness and reliability in real orchard environments.

5. Conclusions

To overcome the difficulties posed by complex orchard environments in citrus ripeness detection, we proposed a lightweight algorithm, LightSal-RTDETR, based on the RT-DETR architecture. Experimental evaluation confirms that the algorithm achieves 81% in mAP@50 and 60.5% in mAP@50:95. The proposed approach maintains a favorable balance between detection model complexity and precision when compared with popular object detection algorithms. LightSal-RTDETR outperforms the original RT-DETR by increasing mAP@50 by 1.9% and mAP@50:95 by 3.4%, simultaneously achieving a 28.1% reduction in parameter count and a 26.5% reduction in computational cost. These results demonstrate the feasibility of this method for citrus ripeness detection, enabling more accurate identification of ripe and unripe fruits in natural environments and offering robust technical support for the development of smart orchards. It should be noted that variations in environmental conditions and citrus cultivars may influence the model’s detection performance. Subsequent studies will aim to refine the model architecture, enrich the dataset, and enhance detection accuracy and robustness under diverse and challenging conditions, thereby contributing to the advancement of modern agricultural automation. Based on cutting-edge practices in agricultural robotics, we propose a feasible prospective deployment plan. Potential applications include automated orchard harvesting systems. These implementations are expected to significantly enhance the automation level of the citrus industry, optimize harvest timing decisions, reduce post-harvest losses, and provide data-driven support for precision agriculture, ultimately accelerating the intelligent transformation and upgrading of the citrus industry.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agronomy15051173/s1, Figures: Images from the Constructed Citrus Ripeness Detection Dataset.

Author Contributions

Conceptualization, Y.H. and X.W.; methodology, Y.H.; validation, Y.H., X.W. and X.L.; formal analysis, X.L.; investigation, L.C.; resources, X.C.; data curation, X.F.; writing—original draft preparation, Y.H.; writing—review and editing, X.F.; visualization, X.L.; supervision, L.C.; project administration, X.W.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Undergraduate Training Program on Innovation and Entrepreneurship. Grant Number: 202410626005.

Data Availability Statement

The dataset used in this study is part of ongoing research and is not publicly available due to continued use in related experiments. However, a portion of the dataset is provided in the Supplementary Materials. The source code is available at https://github.com/ZoeHuangYT/LightSal-RTDETR (accessed on 8 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, X.; Zhao, C.; Shi, H.; Liao, Y.; Xu, F.; Du, H.; Xiao, H.; Zheng, J. Nutrients and bioactives in citrus fruits: Different citrus varieties, fruit parts, and growth stages. Crit. Rev. Food Sci. Nutr. 2023, 63, 2018–2041. [Google Scholar] [CrossRef]
Rizzo, M.; Marcuzzo, M.; Zangari, A.; Gasparetto, A.; Albarelli, A. Fruit ripeness classification: A survey. Artif. Intell. Agric. 2023, 7, 44–57. [Google Scholar] [CrossRef]
Zhou, H.Y.; Wang, X.; Au, W.; Kang, H.W.; Chen, C. Intelligent robots for fruit harvesting: Recent developments and future challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
Malik, M.H.; Zhang, T.; Li, H.; Zhang, M.; Shabbir, S.; Saeed, A. Mature tomato fruit detection algorithm based on improved HSV and watershed algorithm. IFAC-PapersOnLine 2018, 51, 431–436. [Google Scholar] [CrossRef]
Ling, X.; Zhao, Y.S.; Gong, L.; Liu, C.L.; Wang, T. Dual-arm cooperation and implementing for robotic harvesting tomato using binocular vision. Robot. Auton. Syst. 2019, 114, 134–143. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fan, Y.C.; Zhang, S.Y.; Feng, K.; Qian, K.C.; Wang, Y.T.; Qin, S.Z. Strawberry Ripeness Recognition Algorithm Combining Dark Channel Enhancement and YOLOv5. Sensors 2022, 22, 419. [Google Scholar] [CrossRef]
Wang, Z.; Ling, Y.M.; Wang, X.L.; Meng, D.Z.; Nie, L.X.; An, G.Q.; Wang, X.H. An improved Faster R-CNN model for multi-object tomato ripeness detection in complex scenarios. Ecol. Inform. 2022, 72, 12. [Google Scholar] [CrossRef]
Xu, L.J.; Wang, Y.H.; Shi, X.S.; Tang, Z.L.; Chen, X.Y.; Wang, Y.C.; Zou, Z.Y.; Huang, P.; Liu, B.; Yang, N.; et al. Real-time and accurate detection of citrus in complex scenes based on HPL-YOLOv4. Comput. Electron. Agric. 2023, 205, 16. [Google Scholar] [CrossRef]
Zheng, Z.H.; Xiong, J.T.; Lin, H.; Han, Y.L.; Sun, B.X.; Xie, Z.M.; Yang, Z.G.; Wang, C.L. A Method of Green Citrus Detection in Natural Environments Using a Deep Convolutional Neural Network. Front. Plant Sci. 2021, 12, 13. [Google Scholar] [CrossRef]
Chen, S.M.; Xiong, J.T.; Jiao, J.M.; Xie, Z.M.; Huo, Z.W.; Hu, W.X. Citrus fruits ripeness detection in natural environments based on convolutional neural networks and visual saliency map. Precis. Agric. 2022, 23, 1515–1531. [Google Scholar] [CrossRef]
Lu, J.; Lee, W.S.; Gan, H.; Hu, X. Immature citrus fruit detection based on local binary pattern feature and hierarchical contour analysis. Biosyst. Eng. 2018, 171, 78–90. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 20, 1254–1259. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Tzutalin, D. LabelImg.Git Code. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 20 March 2025).
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1389–1400. [Google Scholar]
Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-iou: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4471–4480. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression Task in DETRs as Fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Chen, Q.; Su, X.; Zhang, X.; Wang, J.; Chen, J.; Shen, Y.; Han, C.; Chen, Z.; Xu, W.; Li, F. LW-DETR: A transformer replacement to yolo for real-time detection. arXiv 2024, arXiv:2406.03459. [Google Scholar]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. arXiv 2024, arXiv:2412.04234. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
Zhong, J.; Chen, J.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9528–9535. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. 2014; pp. 740–755. [Google Scholar]
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8430–8439. [Google Scholar]
Golulus. Melon Puspalebo Dataset. Available online: https://universe.roboflow.com/golulus/melon-puspalebo (accessed on 2 April 2025).

Figure 1. Images of citrus fruits at different ripeness stages in complex orchard scenes: (a) under the sunlight, (b) in the shade, (c) overlap, (d) occlusion, (e) near scene, and (f) far scene.

Figure 2. Illustration of the generated visual saliency maps: (a) initial images and (b) corresponding generated visual saliency maps.

Figure 3. Network architecture of LightSal-RTDETR.

Figure 4. Depiction of the improved backbone.

Figure 5. Depiction of inverted residual mobile block (iRMB).

Figure 6. Depiction of the E-CSPPC module.

Figure 7. Diagram of the structure for partial convolution and convolution.

Figure 8. Schematic diagram of the model jointly trained on visual saliency maps and original RGB images.

Figure 10. Comparative experimental analysis. (a) Direct sunlight, (b) shadow coverage, (c) overlap, (d) occlusion, and (e) high-density distribution.

Figure 11. The visualization examples generated by GradCAM++. (a) Original images, (b) results of RT-DETR, and (c) results of LightSal-RTDETR.

Table 1. Results of the ablation study.

Methods	E-C	i_C Block	Inner- SIoU	Weight_ Load	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)	Params	GFLOPs
1	-	-	-	-	80.0	76.5	79.1	57.1	19.87 M	56.9
2	√	-	-	-	79.7	77.8	79.3	59.2	18.87 M	52.0
3	√	√	-	-	80.5	77.4	79.5	58.9	14.28 M	41.8
4	√	√	√	-	81.2	78	80.5	59.6	14.28 M	41.8
5	√	√	√	√	81.6	79.2	81.0	60.5	14.28 M	41.8

Table 2. Results of the comparison experiments.

Model	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)	Params	GFLOPs
SSD	76.3	72.6	79.4	50.6	23.74 M	60.85
Faster R-CNN	52.1	82.3	75.1	41.3	136.71 M	369.74
YOLOv5	76.5	72.9	80.4	57.7	25.04 M	64.0
YOLOv8	74.2	78.3	82.2	59.4	25.84 M	78.7
YOLOv10	70.0	72.6	76.9	55.6	16.45 M	63.4
YOLO11	74.3	75.6	80.6	57.5	20.03 M	67.7
YOLOv12	78.1	72.1	80.5	57.0	20.10 M	67.1
RT-DETR	80.0	76.5	79.1	57.1	19.87 M	56.9
LightSal-RTDETR	81.6	79.2	81.0	60.5	14.28 M	41.8

Table 3. Comparison of lightweight DETR variants.

Model	mAP@50 (%)	mAP@50:95 (%)	Params	GFLOPs
RT-DETR	78.5	56.8	19.87 M	56.9
D-FINE	79.0	58.0	19.19 M	56.3
LW-DETR	77.9	54.7	28.24 M	42.8
DEIM-D-FINE	78.0	57.3	19.19 M	56.3
LightSal-RTDETR	80.5	60.0	14.28 M	41.8

Table 4. Results of the comparison experiments on backbone network.

Backbone	mAP@50 (%)	mAP@50:95 (%)	Params	GFLOPs
Basic Block	79.1	57.1	19.87 M	56.9
Pconv Block	78.6	58.8	14.00 M	42.8
DualConv Block	79.5	55.8	15.86 M	47.3
iRMB Block	79.6	59.1	16.41 M	49.1
iRMB_Cascaded Block	80.2	59.3	15.28 M	46.8

Table 5. Generalization performance of E-CSPPC module.

Model	mAP@50 (%)	mAP@50:95 (%)	Params	GFLOPs
RT-DETR	79.1	57.1	19.87 M	56.9
RT-DETR+E-CSPPC	79.3	59.2	18.87 M	52.0
YOLO11	80.6	57.5	20.03 M	67.7
YOLO11+E-CSPPC	80.2	57.0	17.96 M	61.1

Table 6. Results of different weight initializations.

Model	F1 Score (%)	Precision (%)	Recall (%)	mAP@50 (%)
RT-DETR	78.2	80.0	76.5	79.1
RT-DETR (Pretrained)	79.8	79.6	80.0	78.8
Improved RT-DETR	79.6	81.2	78.0	80.5
Improved RT-DETR (Pretrained)	79.7	81.9	77.7	79.9
LightSal-RTDETR	80.4	81.6	79.2	81.0

Table 7. Performance metrics for different categories.

Classes	F1 Score (%)	mAP@50 (%)	mAP@50:95 (%)
all	96.9	95.5	93.8
ripe	98.3	99.5	97.0
unripe	95.5	91.5	90.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Y.; Wang, X.; Liu, X.; Cai, L.; Feng, X.; Chen, X. A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR. Agronomy 2025, 15, 1173. https://doi.org/10.3390/agronomy15051173

AMA Style

Huang Y, Wang X, Liu X, Cai L, Feng X, Chen X. A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR. Agronomy. 2025; 15(5):1173. https://doi.org/10.3390/agronomy15051173

Chicago/Turabian Style

Huang, Yutong, Xianyao Wang, Xinyao Liu, Liping Cai, Xuefei Feng, and Xiaoyan Chen. 2025. "A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR" Agronomy 15, no. 5: 1173. https://doi.org/10.3390/agronomy15051173

APA Style

Huang, Y., Wang, X., Liu, X., Cai, L., Feng, X., & Chen, X. (2025). A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR. Agronomy, 15(5), 1173. https://doi.org/10.3390/agronomy15051173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Dataset Construction

2.2. Generation of Visual Saliency Maps

2.3. The LightSal-RTDETR Model Architecture

2.4. Improved Backbone

2.5. Improved Efficient Hybrid Encoder

2.6. The Improved Loss Function

2.7. Weight Transfer

3. Results

3.1. Training Environment and Settings

3.2. Evaluation Metrics

3.3. Ablation Study

3.4. Comparison Experiments

3.5. Visualization Analysis

3.6. Comparative Experiments on Backbone Network

3.7. Generalization Assessment of E-CSPPC Module

3.8. Validity of Weight Transfer

3.9. Applicability

3.10. Deployment Plan

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI