1. Introduction
Citrus, as the world’s largest fruit and an important economic crop, has become one of the most produced, consumed, and processed fruits globally, due to its rich nutritional components and bioactive substances [
1]. Yunnan, one of the original habitats of citrus, has developed into a core production area for early and late-maturing citrus varieties, relying on its unique climatic and geographical conditions. Since the 21st century, the local citrus planting scale has expanded rapidly, making it one of the most important citrus production bases in China [
2]. According to statistics, in 2022, the national citrus planting area reached 2.99 million hectares, and in 2023, production exceeded 64.33 million tons, with an enormous harvesting workload. At present, citrus harvesting mainly relies on manual labor, which results in low efficiency, high costs, and maturity misjudgments [
3]. Therefore, the development of accurate citrus maturity recognition technology in natural environments has significant practical significance. As a key step in fruit harvesting, the quality of citrus picking directly affects the fruit’s quality, storage performance, and commercial value.
With the continuous advancement of smart agriculture, fruit maturity recognition has become a key technological component in orchard management and automated harvesting. Traditional manual inspection methods are not only inefficient, but also have a high degree of subjectivity in the maturity grading process. This is especially problematic in complex orchard environments, where dense foliage and frequent changes in lighting make it difficult to ensure the accuracy and consistency of recognition results, severely limiting the timeliness and standardization of fruit harvesting. Compared to other fruit types such as apples, tomatoes, and blueberries, citrus fruits pose unique challenges for visual detection tasks. First, citrus trees typically have dense canopies, leading to frequent and severe occlusions of fruits by leaves or adjacent fruits. Second, the ripening process often involves uneven color transitions, with different parts of the same fruit exhibiting varying maturity levels. Third, citrus surfaces are characterized by rough textures and prominent pores, which complicate the extraction of texture features. Additionally, their relatively large size demands higher localization precision, especially under occlusion. These characteristics, collectively, make the detection and classification of citrus fruit maturity more complex than for many other fruit types, emphasizing the need for specialized detection models. In real-world orchard environments, fruit maturity recognition is further challenged by a range of complex factors that significantly impact detection accuracy and robustness. The term “complex environments” in this study mainly refers to (1) dynamic and variable natural lighting conditions, including strong sunlight, shadows, and backlighting, which can degrade image quality; (2) frequent occlusions caused by dense foliage or overlapping fruits, which reduce target visibility; (3) highly cluttered and variable backgrounds, such as branches, soil, fruit bags, and irrigation equipment, which introduce noise into the detection process; and (4) non-uniform color development during ripening, where parts of the citrus fruit may remain green or yellow-green even when fully mature. These challenges limit the effectiveness of conventional methods that rely solely on color or morphological features, underscoring the need for robust and adaptive detection models capable of operating under real orchard conditions.
In recent years, deep learning technologies have been widely applied in agricultural visual recognition tasks, and fruit maturity recognition methods based on these technologies have gained attention, due to their high classification accuracy. However, there are still several challenges in practical deployment: (1) sample class imbalance, which can cause detection bias in models; (2) interference from complex backgrounds and fluctuating lighting, which affects detection stability; and (3) the coexistence of fruits at multiple maturity stages, which increases the difficulty of model judgment and affects the precision and accuracy of recognition.
To address these issues, this paper proposes an improved YOLOv8 object detection algorithm, ORD-YOLO, for automatic fruit recognition and maturity grading, using citrus fruits as a case study. To overcome the performance bottlenecks of existing models in complex environments, the proposed algorithm employs a differentiated data augmentation strategy to mitigate the sample imbalance problem. It also introduces four key improvement modules: the ODConv [
4] module, which enhances feature expression capabilities and improves detection robustness in complex backgrounds; the RepGFPN [
5] structure, which optimizes the feature pyramid network to accelerate information fusion and inference speed; the DynamicHead [
6] module, which guides a dynamic attention mechanism to strengthen key feature extraction; and the InnerDIoU loss function, which optimizes bounding box location accuracy. Experimental results show that ORD-YOLO performs excellently in real orchard scenarios, effectively reducing false negatives and false positives, and demonstrating strong robustness and practicality. Compared to traditional methods, this model overcomes issues such as low efficiency, high subjectivity, and significant background interference, while significantly improving the automation level of the system. It can provide reliable visual perception support for harvesting robots, promoting the continuous development of orchard management toward precision and intelligence.
The ORD-YOLO model enables fast, objective citrus ripeness detection, enhancing orchard management through automation and precision. It addresses the inefficiency and subjectivity of manual assessment, improving harvest timing and fruit quality with strong accuracy and robustness. Its real-time, lightweight design fits harvesting robots and mobile devices, cutting labor costs. By applying computer vision and deep learning, ORD-YOLO drives agricultural digitalization and supports sustainable, precision farming across various crops.
2. Materials and Methods
2.1. Related Work
Currently, three main methods are used to determine maturity: the manual judgment method, which is simple and quick but highly subjective and prone to misjudgment; the physico–chemical method, which yields accurate results but is destructive and cumbersome [
7]; and the machine vision method, which combines high efficiency, non-destructive testing, and high accuracy, making it possible to achieve intelligent recognition of fruit maturity.
Fruit maturity detection methods based on physico–chemical properties generally involve quantifying indicators such as sugar content, acidity, and the sugar–acid ratio. Josué Barragán-Iglesias et al. [
8] established a maturity prediction model by analyzing the changes in parameters such as color, TSS, and hardness during the ripening process of papayas. Shailendra Kumar et al. [
9] developed a maturity index based on physiological and biochemical changes, such as respiration, ethylene release, and sugar accumulation, which can accurately identify crop maturity and determine the optimal harvest time. Arun Kumar Gupta et al. [
10] developed a lattice electrode sensor (IDE) that detects grapefruit maturity accurately by measuring flavonoid content. While physico–chemical methods provide reliable and accurate results, they suffer from limitations such as long detection cycles, high dependence on equipment, and destructive sampling, which make them unsuitable for real-time orchard detection. Machine vision-based methods for intelligent fruit maturity recognition achieve non-destructive automated classification through image feature analysis and deep learning algorithms. The bimodal (visual–tactile) avocado grading system developed by Junchang Zhang et al. [
11] achieves precise classification but has limitations such as high equipment costs, complex operations, and low efficiency. Sitti Wetenriajeng Sidehabi et al. [
12] developed a passion fruit grading system based on RGB and a* chromatic features, which achieved 90% accuracy but risks overfitting, due to insufficient sample size. H. Gan et al. [
13] proposed a visible-light–thermal-infrared image fusion method for citrus detection, which improved early yield estimation accuracy but was susceptible to environmental interference. The lightweight MSC-YOLOv8 model designed by Tian Youwen et al. [
14] improved blueberry detection speed, but sacrificed deep feature extraction ability and was affected by sample class imbalance. Existing methods all suffer from stability issues, and there is an urgent need for the development of robust maturity detection algorithms that can adapt to the complex environment of orchards.
Zhengyang Zhong et al. [
15] proposed a lightweight and efficient mango detection model, LightYOLO, based on the Darknet53 architecture. This model not only achieves fast and accurate detection, but also significantly reduces false positives and missed detections, while lowering the number of parameters and computational cost (FLOPs). To enhance the detection of tomatoes in complex environments, Xiangyang Sun et al. [
16] developed a lightweight greenhouse tomato detection model, S-YOLO, which delivers fast detection performance with only 9.11 M parameters, outperforming many mainstream models. Jifei Zhao et al. [
17] introduced a lightweight pomegranate growth-stage detection algorithm, YOLO-Granada, based on YOLOv5. The model achieves substantial improvements in parameter count, computation load, and model size. It supports real-time detection at 8.66 images per second, offering a precise and lightweight solution for intelligent management devices in pomegranate orchards.
The main contributions of this article are as follows:
To propose ORD-YOLO: A novel ripeness recognition model based on YOLOv8, integrating ODConv, RepGFPN, Dynamic Head, and InnerDIoU to boost robustness and accuracy in complex scenes.
Citrus dataset with complex backgrounds: Includes varied lighting, occlusions, and fruit positions, mimicking real orchard conditions for challenging ripeness detection.
Ablation studies: Validate each module’s impact on precision, recall, and efficiency, proving ORD-YOLO’s performance gains.
Superior generalization and deployment potential: Outperforms baselines in tests and transfer tasks, showing promise for orchard robots and mobile devices.
2.2. Image Acquisition
The citrus experimental data used in this study were collected from the subtropical fruit planting demonstration park in Huaning County, Yuxi City, Yunnan Province, China. The research focuses on the early-maturing citrus variety “Xingjin Honey Tangerine.” Data collection took place from June 2023 to October 2024, acquiring multi-angle and multi-period image data. The system setup includes cameras inside the greenhouse (I), cameras outside the greenhouse (O), and ground probes inside the greenhouse (G), with their positions shown in
Figure 1. This system facilitated the collection of images at different time periods, ensuring diversity and coverage of the data in terms of time and lighting conditions.
The collected images were systematically organized through a web-based data management platform, creating a citrus fruit image dataset that covers a variety of typical scenarios, as shown in
Figure 2. This dataset includes images under different lighting conditions (natural light and strong light), varying fruit quantities (single fruit, multiple fruits, and fruits overlap), occlusion situations (such as branch and leaf occlusion), and complex environments like rainy days. In total, 3264 image samples were acquired, providing solid and reliable data support for subsequent computer vision-based citrus maturity analysis.
2.3. Citrus Ripeness Division
The color change of the citrus fruit peel is an important phenotypic indicator of its maturity. During fruit development and ripening, significant changes occur in the peel pigment metabolism, primarily manifested by the continuous degradation of chlorophyll and the gradual accumulation of carotenoids. This dynamic process results in clear, staged transitions in peel color [
18], which can be divided into three typical developmental phases:
Green Stage (Chlorophyll-Dominant Period): Chlorophyll content is high, and carotenoid synthesis has not yet significantly begun.
Yellow–Green Transition Stage (Pigment Conversion Period): Chlorophyll degradation and carotenoid accumulation reach a dynamic balance.
Orange Stage (Carotenoid Dominant Period): Carotenoids dominate, and chlorophyll content drops to its lowest point.
Based on the above color change patterns and referencing the definition of citrus maturity from Chenglin Wang et al. [
19], this study classifies citrus fruit maturity into the following three categories: Immature Fruit: more than 80% of the fruit’s peel area is green; Semi-Mature Fruit: both green and yellow areas of the fruit peel do not exceed 80% of the total peel area; and Mature Fruit: more than 80% of the fruit’s peel area is orange. Images of the fruit corresponding to each maturity stage are shown in
Figure 3.
2.4. Citrus Dataset Construction
This study conducted a rigorous selection of image quality during the data preprocessing phase. Due to the wide coverage of the shooting equipment and the influence of different lighting conditions at various times, some low-quality samples were found in the 3264 originally captured images, including those that were blurry or had excessively large background areas. After manual cropping and screening, 2597 images with clearly visible fruit from three collection points were retained, forming the effective dataset. The distribution of the collected equipment dataset is shown in
Table 1.
To improve the model’s generalization ability and accurately identify fruit maturity in complex backgrounds, this study applied various data augmentation strategies to process the original images. These strategies included horizontal flipping, random rotation, contrast adjustment, and brightness adjustment [
20,
21].
During the data organization process, it was found that there was a significant class imbalance in the sample distribution. This phenomenon mainly arises from the actual situation in agricultural production: mature citrus fruits are harvested promptly, leading to a significantly smaller number of mature samples compared to other maturity categories. This imbalance could affect the model’s ability to learn features from the less represented classes. To address this issue, this study adopted a differentiated data augmentation scheme to balance the sample distribution, as follows: for Immature (IM) samples, the original data amount remained unchanged; for Semi-Mature (SM) samples, flipping, rotation, and contrast enhancement were applied; and for Mature (MA) samples, brightness enhancement was added on top of the previous augmentations. Example images of the data augmentation are shown in
Figure 4.
After data augmentation, the total number of images in the dataset was expanded to 5586, with the specific expansion details shown in
Table 2. All samples were annotated according to the maturity classification criteria on the MakeSense platform, and the annotation format used was the YOLO standard TXT text file. To ensure consistency in annotation, all the labeling work was completed by the same researcher. The final dataset was randomly divided into training (4468 images), validation (558 images), and test (560 images) sets in an 8:1:1 ratio.
3. Model Construction
3.1. YOLOv8
YOLOv8 is a single-stage object detection algorithm developed by Ultralytics, combining high performance with multi-task support. It is suitable for tasks such as object detection, instance segmentation, and pose estimation. For detection tasks, YOLOv8 offers five versions (n, s, m, l, and x), based on the network’s depth and width, to meet various application needs. Its architecture primarily consists of the backbone network, the neck network, and the detection head, which are used for multi-scale feature extraction, feature fusion enhancement, and the prediction of object classes and locations [
22]. The processing flow is as follows: after image preprocessing, the backbone network extracts multi-scale features, the neck network completes cross-scale fusion based on the feature pyramid, and the final output—class labels and bounding boxes—is generated by the detection head. Compared to previous models (e.g., YOLOv5 [
23], YOLOv7 [
24]), YOLOv8 introduces three major improvements in its architecture:
Module Optimization: The traditional C3 module is replaced with C2f, introducing cross-layer connections and multi-branch structures to improve gradient propagation efficiency and feature reuse capability.
Enhanced Feature Fusion: The PANet structure [
25] is upgraded to PAN-FPN [
26], strengthening cross-scale connections and fusion paths, thereby improving feature expression capability.
Decoupled Detection Head: The classification and regression task branches are independent, reducing feature interference between tasks and improving detection accuracy.
3.2. Improved Model ORD-YOLO
In complex backgrounds, YOLOv8 in citrus fruit maturity detection is susceptible to factors such as strong lighting, branch and leaf occlusion, and fruit overlap, which limits its detection performance. To improve recognition accuracy, this paper proposes an improved YOLOv8 architecture, ORD-YOLO, whose structure is shown in
Figure 5. The main improvements include the following:
Replacing the standard convolution with ODConv in the backbone network to enhance feature extraction capabilities. Standard convolutions have limited adaptability to spatial and contextual variations. ODConv introduces dynamic convolution across spatial, channel, and kernel dimensions, enabling the network to adjust its receptive field based on input features. This improves the model’s ability to extract diverse textures and semantic cues in complex orchard environments, enhancing backbone representation.
Replacing the Path Aggregation Network (PANet) with RepGFPN in the neck network to improve feature fusion. PANet may lead to feature redundancy and limited deep-semantic integration. RepGFPN addresses this by using structural re-parameterization, enhancing multi-scale feature fusion and improving efficiency. It effectively combines spatial detail with semantic depth, benefiting detection of small or occluded citrus fruits.
Using a Dynamic Detection Head (DynamicHead) to enhance the perception of key features. Static detection heads can struggle with varying object appearances. DynamicHead uses dynamic convolutions to adaptively focus on task-relevant features, improving robustness to variation in shape, texture, and background noise, and enhancing detection precision in real-world orchard conditions.
Optimizing the bounding box-regression loss function to InnerDIoU to improve localization accuracy. Traditional IoU-based losses may neglect fine localization at high overlaps. InnerDIoU considers both IoU and internal structure, refining boundary regression especially for densely packed or overlapping fruits, leading to more accurate localization in complex field scenes.
Figure 5.
ORD-YOLO network structure diagram.
Figure 5.
ORD-YOLO network structure diagram.
3.2.1. ODConv Module
In complex citrus orchard environments, fruit maturity detection relies not only on model performance, but also on computational efficiency. To enhance feature extraction capabilities and computational efficiency, this paper replaces the original convolution layers with Omni-Dimensional Dynamic Convolution (ODConv). ODConv introduces a multi-dimensional attention mechanism that learns dynamic weights in parallel across four dimensions: the number of convolution kernels, spatial dimensions, input channels, and output channels. This approach significantly outperforms CondConv [
27] and DyConv [
28], which only adjust the number of convolution kernels, and offers stronger dynamic modeling capabilities. The computation method is shown in Equation (1):
In the above formula, represents the branch weight attention for the convolution kernel , while , , and represent the three newly introduced attention mechanisms. The computations are performed along the spatial dimension, input channel dimension, and output channel dimension of the convolution kernel , representing multiplicative operations along different dimensions of the kernel space. Here, , , , and are computed using the multi-head attention module ).
In ODConv, the attention mechanism of the convolution kernel includes four dimensions:
Spatial Attention : Assigning weights at spatial positions.
Input Channel Attention : Assigning weights to the input channels of each convolution filter .
Output Channel Attention : Assigning different weights to the output channels .
Global Attention : Assigning global weights to the entire convolution kernel.
As shown in
Figure 6, the four types of attention are multiplied in sequence across the spatial, input-channel, output-channel, and global-kernel dimensions. This differential treatment across the dimensions enhances the feature extraction capabilities by providing more expressive power during the convolution operation.
Additionally, the four attention weights of the convolution kernel
—
,
,
and
—are all computed using the multi-head attention mechanism
, as shown in
Figure 7. The specific process is as follows: the input feature x is first compressed into a vector of length
through channel-level global average pooling (GAP). It is then mapped to a lower-dimensional space via a fully connected layer (FC) with a reduction ratio of r, and after a ReLU activation, it is split into four branches. Each branch uses convolutions of different sizes (
,
and
) along with Softmax or Sigmoid functions to generate normalized attention weights.
3.2.2. Efficient RepGFPN Feature Pyramid
The Feature Pyramid Network (FPN) [
29] effectively alleviates the detection challenges caused by scale variations of objects through multi-scale feature fusion. However, traditional FPN and its derivative structures (such as PANet and BiFPN [
30]) suffer from issues such as high computational cost and memory usage. In contrast, the Re-parameterizable Generalized Feature Pyramid Network (RepGFPN) introduces structural re-parameterization and cross-scale dynamic fusion mechanisms. It significantly improves inference efficiency, while retaining the multi-scale feature representation capabilities. Its structure is shown in
Figure 8.
Given the significant difference in FLOPs between feature maps of different scales and the limited computational resources that make it difficult to unify channel dimensions, RepGFPN adopts a strategy of setting different channel numbers for each scale in the neck feature fusion module. By flexibly adjusting the number of channels at each scale, RepGFPN significantly improves detection accuracy.
In terms of feature interaction, RepGFPN removes the upsampling introduced by Queen-Fusion in GFPN, reducing unnecessary computational overhead. To further enhance detection performance, RepGFPN replaces the traditional 3 × 3 convolution used for feature fusion with CSPNet [
31], and optimizes it with a Re-parameterization mechanism and Efficient Layer Aggregation Network (ELAN) structure. This approach improves fusion efficiency and model performance without increasing computational cost.
3.2.3. Dynamic Detector
In object detection, the detection head is responsible for converting the extracted features into the final prediction results. For citrus fruits, which have complex image features across different scenes, the detection head needs to possess superior feature representation and transformation capabilities to accurately localize the fruits and categorize their maturity. To address this challenge, this paper proposes an improved method based on Dynamic Head.
Unlike traditional detection heads, DynamicHead integrates three attention mechanisms—scale, spatial, and task-aware—into a unified self-attention structure. This design follows a hierarchical strategy, deploying specialized attention modules at the layer, spatial, and channel dimensions to enable multi-dimensional feature learning. The structure is shown in
Figure 9.
Let the feature pyramid consist of L layers of feature maps
. After unification to a middle scale through upsampling or downsampling, a four-dimensional tensor
is obtained, where L, H, W, and C represent the number of layers, height, width, and channels, respectively. To streamline the notation, define
, reducing the tensor F to a three-dimensional tensor
. The self-attention computation in DynamicHead is defined in Equation (2):
In this equation, ,, and represent the attention functions applied to L, S, and C, respectively. The architecture employs a decomposed attention mechanism, effectively reducing the complexity introduced by high-dimensional tensor computations.
This mechanism dynamically integrates features across different scales, weighted by their semantic importance. It is computed as follows:
In this context, refers to the linear function of the convolution, and is the hard-sigmoid activation function.
- 2.
Spatial-Aware Attention :
This attention mechanism is employed to capture discriminative spatial features. It begins with sparse attention learning using deformable convolutions, followed by the aggregation of features at the same spatial location across layer:
In this case, K refers to the number of sampling points, represents the attention position obtained by the offset , and is the self-learned importance scalar , all derived from the median features of F.
- 3.
Task-Aware Attention :
To effectively support tasks such as bounding box regression and center point prediction, a task-aware mechanism is introduced:
In the above equation, represents the feature of the channel c, and is the super function used to learn the activation threshold. The function is derived by applying global average pooling across the dimensions, followed by two fully connected layers. This produces normalized weights, which are then passed through a sigmoid function to output values within the range of [−1, 1].
In conclusion, DynamicHead is constructed by sequentially combining three attention mechanisms—
,
, and
—as shown in
Figure 10. This structure significantly enhances its adaptability and accuracy for multi-task detection.
The object detection model is optimized by comparing the differences between the predicted bounding boxes and the ground-truth boxes. The loss function, as the core mechanism for assessing prediction accuracy, plays a crucial role in determining the model’s training effectiveness and overall performance. Given the stringent requirements for model convergence speed and detection accuracy (including false positives and false negatives) in citrus fruit maturity detection, this paper introduces the InnerDIoU loss function to enhance the YOLOv8 model.
In this context, the efficiency of bounding box regression is vital for both model convergence and localization precision. The commonly used metric, IoU (Intersection over Union) [
32], is employed to measure the overlap between the predicted and ground-truth boxes. The calculation is as follows:
Here,
, and
refer to the coordinates of the ground-truth bounding box and the predicted bounding box, respectively. It is crucial to note that the IoU is applicable only when there is an overlap between the predicted and ground-truth boxes. The IoU-based loss function is expressed as
In the above formula, refers to the penalty term between and .
DIoU [
33] extends IoU by adding the normalized Euclidean distance between the center points as a penalty term, which helps speed up convergence. The definition is as follows:
In the above equation,
and
represent the center points of the two bounding boxes,
is the Euclidean distance, and
is the diagonal length of the smallest enclosing rectangle that contains both boxes. The DIoU loss function is defined as
InnerDIoU, an extension of DIoU, further incorporates the overlap of the central region of the bounding box, and uses the geometric distance between the central regions as a penalty term to achieve more refined bounding-box regression optimization. While DIoU focuses on the overlap of the entire box and the distance between center points, InnerDIoU prioritizes consistency modeling of key internal regions within the box. This approach is especially effective for complex detection scenarios where citrus fruits may be occluded or overlapping. Not only does it enhance the model’s localization robustness, but it also speeds up regression convergence. The loss function of InnerDIoU and its core component, InnerIoU, is defined as follows:
Here, InnerIoU represents the IoU between the two central areas of the box.
3.3. Test Environment and Configuration
The experimental setup for this study is as follows: the hardware system consists of an AMD Ryzen 9 7945HX processor (base frequency of 2.50 GHz) paired with an NVIDIA GeForce RTX 4060 GPU (8 GB VRAM), and 16 GB of RAM. The software environment is based on the 64-bit version of Windows 11, with the PyTorch 1.12.1 deep learning framework and Python 3.10.15 as the programming language. The development environment used is PyCharm 2024, and GPU acceleration is implemented using CUDA 11.6.
The model training parameters are as follows: input images are resized to 640 × 640 pixels, with a batch size of 16. The optimizer is Stochastic Gradient Descent (SGD) [
34], with an initial learning rate of 0.001 and a weight decay coefficient of 0.0005.
3.4. Evaluation Metrics
For the citrus orchard fruit maturity detection task, it is essential to evaluate both the model’s detection accuracy and its adaptability to complex scenarios. This study establishes multiple quantitative evaluation criteria across various dimensions. Key metrics selected for measuring detection accuracy include Precision, Recall, and Mean Average Precision (mAP) [
35], which assess the model’s ability to correctly identify mature fruits, retrieve all true mature fruits, and perform across multiple categories.
Precision: the ratio of correctly identified mature fruits (True Positive, TP) to all predicted mature fruits (TP + False Positive, FP). Recall: the ability of the model to correctly detect all true mature fruits, defined as follows:
Average Precision: evaluates the model’s detection performance for a single class. Mean Average Precision (mAP): the mean of the Average Precision (AP) for all categories, offering an overall measure of the model’s detection performance. The mathematical definition of both is as follows:
Regarding computational efficiency, core metrics like model weight size, parameter count (Parameters), detection speed (Speed), and floating-point operations (GFLOPs) [
36] are introduced. These accuracy and efficiency metrics together form a multidimensional evaluation framework, providing a comprehensive assessment of the model’s performance in citrus fruit maturity detection.
4. Results
4.1. Improved Model Results
To validate the effectiveness of the improved model ORD-YOLO in detecting citrus fruit maturity, this study compares its detection performance with that of YOLOv8 on the citrus dataset. After 500 training epochs, the accuracy, recall, and mean average precision (mAP) of ORD-YOLO are 93.83%, 91.62%, and 96.92%, respectively. In comparison, the corresponding metrics for YOLOv8 are 89.17%, 88.32%, and 93.92%, indicating that ORD-YOLO improves upon YOLOv8 by 4.66%, 3.30%, and 3.00%, respectively. Additionally, the ORD-YOLO model has a weight of 12.9 MB, GFLOPs of 14.1, detection speed of 7.9 ms, and a parameter count of 6.2 M. To assess the time overhead during the model training process, we recorded the training time data. The results show that the average training time per epoch is approximately 3.71 min, with a total training time of about 1855 min (500 epochs), indicating that the model has high training efficiency, which is conducive to subsequent optimization and deployment on resource-constrained devices.
Table 3 shows its superior detection performance across different maturity categories.
Specifically, the model achieves the highest recognition accuracy for MA, reaching 93.5%; both IM and SM achieve 93.3%. In terms of recall, SM is the highest, at 92.5%, followed by IM and MA, at 91.9% and 88.7%, respectively. Regarding the mean average precision (mAP@50), both IM and SM reach 96.9%, while MA stands at 96.6%. The loss function is a core metric for evaluating the model’s detection performance, reflecting both the model’s convergence speed and detection accuracy [
37]. Choosing an appropriate loss function helps accelerate model convergence and improve detection performance.
Figure 11 shows the loss function variation of ORD-YOLO and YOLOv8 under the same training configuration. It can be seen that ORD-YOLO exhibits faster convergence and lower final loss values in boundary box regression, object classification, and feature distribution loss, with a smoother training curve. Specifically, the boundary box loss decreases to about 0.55, a reduction of approximately 39%; the classification loss stabilizes at 0.55, a reduction of about 5%; and the feature distribution loss is around 1.3, a reduction of about 11%. These results demonstrate the significant advantages of ORD-YOLO in both localization and classification accuracy, confirming the effectiveness and robustness of its improvement strategy.
Additionally, to verify the generalization ability of the improved model, prediction experiments comparing ORD-YOLO and YOLOv8 were conducted on four typical scenarios in the test set, as shown in
Figure 12. The experiments demonstrate that both models can correctly identify fruit in cases of severe occlusion, but ORD-YOLO exhibits significantly higher confidence than YOLOv8. In overlapping fruit scenarios, YOLOv8 experiences repeated detections, while ORD-YOLO performs more accurately. Under shadow occlusion, ORD-YOLO still maintains high confidence. In natural-light single-fruit scenarios, YOLOv8 suffers from false detections, while ORD-YOLO shows higher confidence. Even under strong light interference, ORD-YOLO outperforms YOLOv8. Overall, the results indicate that ORD-YOLO demonstrates stronger robustness and superior detection performance in complex environments.
4.2. Ablation Experiments
This study sequentially introduces four improvements to the YOLOv8 model: (1) replacing standard convolution with ODConv; (2) replacing the feature aggregation network with RepGFPN; (3) changing the detection head to DynamicHead; and (4) optimizing the loss function to InnerDIoU. To validate the effectiveness of each module, ablation experiments with gradual additions were designed, and the results are shown in
Table 4.
After introducing ODConv, the model’s recall (R) and mean average precision at 50 (mAP@50) increased by 1.15% and 0.42%, respectively, while GFLOPs decreased by 0.9. When RepGFPN was introduced, precision (P), recall (R), and mAP@50 increased by 0.34%, 1.74%, and 0.53%, respectively. Combining ODConv and RepGFPN resulted in a further increase in P, R, and mAP@50 by 2.5%, 2.12%, and 1.76%. When RepGFPN and DynamicHead were combined, P, R, and mAP@50 increased by 1.3%, 1.54%, and 1.51%. With the joint application of all three modules, the model’s P, R, and mAP@50 improved by 2.95%, 2.06%, and 2.09%, respectively. Finally, after adding the InnerDIoU loss function, P, R, and mAP@50 were further improved, by 4.66%, 3.3%, and 3%.
Although the addition of these modules increased the model’s weight, the detection performance (P, R, mAP@50) was significantly enhanced, fully validating the effectiveness of each module in improving the model’s accuracy and robustness. Especially in terms of GFLOPs, despite the increase in model complexity after introducing ODConv, RepGFPN, and DynamicHead, GFLOPs decreased by 0.9 compared to the model with only RepGFPN and DynamicHead. Additionally, when ODConv and RepGFPN were combined, GFLOPs decreased by 1.1, compared to using RepGFPN alone. This improvement is attributed to the dynamic weight mechanism and multi-dimensional modeling capability of ODConv. The ablation experiment results fully verify the effectiveness of the improvement modules in enhancing the detection accuracy and robustness of the ORD-YOLO model.
To visually demonstrate the impact of the improvement strategies on the model’s detection performance, Grad-CAM [
38] heatmaps were used for the ablation experiment visualization. The results are shown in
Figure 13, where deeper colors and larger areas indicate higher model attention (i.e., higher confidence). The results reveal that ORD-YOLO outperforms other models in both fruit focus ability and attention distribution. In contrast, YOLOv8 exhibits issues with missed detections and attention dispersion. ODConv, through dynamic weight adjustment, enables the model to focus on key regions; RepGFPN effectively enhances the local detail representation of the fruit, alleviating the attention dispersion problem; and DynamicHead further strengthens the response in the core fruit area, improving the model’s perceptual ability. A comprehensive analysis indicates that ORD-YOLO can accurately focus on the citrus fruit’s position, with heatmap results showing accurate focus in complex backgrounds, with heat concentration on the fruit. This demonstrates a stronger target recognition and attention ability, further validating the effectiveness of the improvement strategies.
4.3. Comparative Experiments
To comprehensively evaluate the performance of ORD-YOLO, this study compares it with mainstream detection algorithms (YOLOv5, YOLOv8, YOLOv10n [
39], YOLOv11 [
40], SSD [
41], and CenterNet [
42]), as shown in
Table 5. ORD-YOLO demonstrates the best overall performance in citrus fruit maturity detection. Its accuracy, recall, and mAP@50 reach 93.83%, 91.62%, and 96.92%, respectively, significantly outperforming YOLOv5, YOLOv8, YOLOv10n, and YOLOv11, as well as traditional SSD and CenterNet models. Although ORD-YOLO’s parameter count (6.2 M) and model size (12.9 MB) are slightly larger, and its detection speed (7.9 ms) is slower than some YOLO models, it offers significant advantages in detection accuracy and stability, demonstrating stronger robustness and detection capability. In contrast, while SSD has a higher recall rate (94.43%), its precision (P) and mAP@50 are both lower than ORD-YOLO’s. CenterNet, though achieving the highest accuracy (96.87%), has a relatively lower recall rate (77.54%). Overall, ORD-YOLO strikes a good balance between detection accuracy and model capacity, while maintaining fast inference speed, showing great potential for wide applications in complex environments.
To comprehensively evaluate the performance of the aforementioned models in different environments, this study visualizes the detection results on the test set, as shown in
Figure 14. The experiments cover five complex environments, and the detection performance varies accordingly. The results show that CenterNet performs the worst in all scenarios (lowest confidence) and experiences a significant decline in detection performance under complex conditions, compared to other models. SSD performs better in three scenarios: close-range single fruit, strong-light–fruit strong-light–fruit, and branch–leaf-occlusion–fruit branch–leaf occlusion–fruit, but suffers from repeated recognition and classification errors in rainy conditions. While YOLOv5, YOLOv8, and YOLOv11 did not exhibit missed or false detections, their overall confidence was lower than that of ORD-YOLO. YOLOv10n missed detections in the strong-light–fruit overlap environment. In summary, ORD-YOLO maintains stable and high detection accuracy across complex environments, demonstrating stronger robustness. The confidence levels of YOLOv5, YOLOv8, and YOLOv11 vary significantly across different complex environments, while SSD faces issues with repeated recognition and false detections.
5. Discussion
The ORD-YOLO citrus maturity detection model proposed in this study maintains high detection accuracy, even in complex backgrounds. By incorporating ODConv, RepGFPN, DynamicHead, and InnerDIoU modules into different structures of the YOLOv8 model, the model’s feature extraction capability for multi-scale objects is significantly enhanced. The ORD-YOLO model offers key advantages for real-world orchard management and intelligent harvesting. First, it shows strong robustness in complex environments, effectively handling lighting variations, occlusions, and cluttered backgrounds, to ensure accurate and stable ripeness recognition. Second, its high inference speed and lightweight design support real-time detection on resource-limited devices like harvesting robots and mobile terminals. Third, improved precision and recall reduce missed and false detections, minimizing human error and enhancing harvesting efficiency. Lastly, its design and optimization provide broad applicability, serving as a valuable reference for maturity detection in other fruits and vegetables and advancing agricultural automation.
Although ORD-YOLO has achieved initial success in detecting citrus fruits and classifying their maturity, and has demonstrated its effectiveness and robustness in complex backgrounds, there are still certain limitations. First, this study divides fruit maturity into three levels, which fails to fully capture the subtle changes in fruit maturity from unripe to fully ripe, limiting the model’s ability to express the continuous evolution of maturity and making it difficult to meet the demands of precision orchard management. Therefore, future research could introduce more detailed grading standards and combine multi-source information such as image time-series and spectral data to enhance the model’s ability to perceive the dynamic changes in maturity. Currently, the model has not been deployed in real-world environments or tested on edge devices, and remains in the preliminary experimental phase. In the future, we plan to gradually deploy the model in real-world settings, with a particular focus on conducting extensive testing and evaluation on edge devices.
Secondly, this study focused on the early-maturing “Xingjin Honey Tangerine” variety, which, while representative, has variety-specific characteristics in terms of fruit shape, color changes, and maturation cycles. The adaptability and generalization of the trained model to other mid- to late-maturing citrus varieties still need further verification. To improve the model’s generality and practical value, future research should expand the sample varieties and construct diverse datasets for further study. Finally, although the ORD-YOLO model has shown significant improvements in accuracy and stability compared to the baseline model, its complexity has also increased. Compared to the original model, the parameter count has increased by 3.2 M, and the model weight has increased by 6.5 M. While this brings performance advantages, it also increases the dependence on computational resources, limiting its efficient deployment on edge devices. Future research could focus on model compression, pruning optimization, and lightweight design (such as introducing Ghost modules, MobileNet architectures, etc.) to further reduce model complexity and inference costs. This would improve real-time performance and deployment flexibility, while ensuring accuracy, and would drive the continued evolution of orchard management systems towards intelligence and low-cost solutions.
6. Conclusions
This study proposes an improved algorithm, ORD-YOLO, based on YOLOv8, for citrus fruit maturity recognition. The model achieves high accuracy and stability in complex orchard environments, showing no missed or false detections, and demonstrates strong robustness and applicability. Compared to manual inspection, it effectively addresses issues such as low efficiency, subjectivity in classification, and difficulty in identifying targets within complex backgrounds. Furthermore, it provides visual guidance for harvesting robots, enhancing automation in orchard management.
Results show ORD-YOLO significantly improves detection over YOLOv8, with faster convergence and lower losses during training. Despite slight increases in weight size and GFLOPs, precision, recall, and mAP@50 reach 93.83%, 91.62%, and 96.92%, respectively. The GFLOPs increase enhances feature learning, improving stability against complex backgrounds. Compared to SSD, CenterNet, and YOLO models, ORD-YOLO leads in all key metrics; although CenterNet has higher precision (96.87%), its recall (77.54%) and mAP@50 (95%) are lower. SSD has high recall (94.43%) but lower precision (87.69%). Compared to YOLOv8, ORD-YOLO improves precision, recall, and mAP@50 by 4.66%, 3.30%, and 3.00%. It maintains real-time speed (7.9 ms), outperforming SSD (96.15 ms) and CenterNet (41.67 ms), balancing accuracy and efficiency for agriculture. Test results confirm accurate fruit localization and classification, without false or duplicate detections. Through optimized architecture and loss functions, ORD-YOLO achieves enhanced accuracy and efficiency, suitable for precise fruit detection in complex agricultural settings, overcoming small target detection challenges inherent in complex backgrounds.
The improved citrus fruit maturity detection model, ORD-YOLO, performs excellently in complex orchard environments. It accurately identifies three maturity stages—IM, SM, and MA—with an average accuracy of 96.2%. Besides precise maturity classification, the model supports intelligent orchard management by aiding farmers in harvest decisions, reducing errors, and enhancing fruit value. This ensures precise harvesting and improved fruit quality. Its design also offers insights for maturity detection in other fruits and vegetables, advancing intelligent, automated agriculture.