ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments

Huang, Zhaobo; Li, Xianhui; Fan, Shitong; Liu, Yang; Zou, Huan; He, Xiangchun; Xu, Shuai; Zhao, Jianghua; Li, Wenfeng

doi:10.3390/agriculture15151711

Open AccessArticle

ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments

by

Zhaobo Huang

^1,2,

Xianhui Li

^1,2,

Shitong Fan

^1,2,

Yang Liu

^1,2,

Huan Zou

^1,2,

Xiangchun He

¹,

Shuai Xu

¹,

Jianghua Zhao

^1,2 and

Wenfeng Li

^1,2,*

¹

College of Mechanical and Electrical Engineering, Yunnan Agricultural University, Kunming 650201, China

²

International Joint Laboratory for Intelligent Crop Production in Yunnan Province, Yunnan Agricultural University, Kunming 650231, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(15), 1711; https://doi.org/10.3390/agriculture15151711

Submission received: 15 July 2025 / Revised: 2 August 2025 / Accepted: 5 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Application of Smart Technologies in Orchard Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

With its unique climate and geographical advantages, Yunnan Province in China has become one of the country’s most important citrus-growing regions. However, the dense foliage and large fruit size of citrus trees often result in significant occlusion, and the fluctuating light intensity further complicates accurate assessment of fruit maturity. To address these challenges, this study proposes an improved model based on YOLOv8, named ORD-YOLO, for citrus fruit maturity detection. To enhance the model’s robustness in complex environments, several key improvements have been introduced. First, the standard convolution operations are replaced with Omni-Dimensional Dynamic Convolution (ODConv) to improve feature extraction capabilities. Second, the feature fusion process is optimized and inference speed is increased by integrating a Re-parameterizable Generalized Feature Pyramid Network (RepGFPN). Third, the detection head is redesigned using a Dynamic Head structure that leverages dynamic attention mechanisms to enhance key feature perception. Additionally, the loss function is optimized using InnerDIoU to improve object localization accuracy. Experimental results demonstrate that the enhanced ORD-YOLO model achieves a precision of 93.83%, a recall of 91.62%, and a mean Average Precision (mAP) of 96.92%, representing improvements of 4.66%, 3.3%, and 3%, respectively, over the original YOLOv8 model. ORD-YOLO not only maintains stable and accurate citrus fruit maturity recognition under complex backgrounds, but also significantly reduces misjudgment caused by manual assessments. Furthermore, the model enables real-time, non-destructive detection. When deployed on harvesting robots, it can substantially increase picking efficiency and reduce post-maturity fruit rot due to delayed harvesting. These advancements contribute meaningfully to the quality improvement, efficiency enhancement, and digital transformation of the citrus industry.

Keywords:

cirtus; YOLOv8; ODConv; RepGFPN; InnerDIoU; maturity

1. Introduction

Citrus, as the world’s largest fruit and an important economic crop, has become one of the most produced, consumed, and processed fruits globally, due to its rich nutritional components and bioactive substances [1]. Yunnan, one of the original habitats of citrus, has developed into a core production area for early and late-maturing citrus varieties, relying on its unique climatic and geographical conditions. Since the 21st century, the local citrus planting scale has expanded rapidly, making it one of the most important citrus production bases in China [2]. According to statistics, in 2022, the national citrus planting area reached 2.99 million hectares, and in 2023, production exceeded 64.33 million tons, with an enormous harvesting workload. At present, citrus harvesting mainly relies on manual labor, which results in low efficiency, high costs, and maturity misjudgments [3]. Therefore, the development of accurate citrus maturity recognition technology in natural environments has significant practical significance. As a key step in fruit harvesting, the quality of citrus picking directly affects the fruit’s quality, storage performance, and commercial value.

With the continuous advancement of smart agriculture, fruit maturity recognition has become a key technological component in orchard management and automated harvesting. Traditional manual inspection methods are not only inefficient, but also have a high degree of subjectivity in the maturity grading process. This is especially problematic in complex orchard environments, where dense foliage and frequent changes in lighting make it difficult to ensure the accuracy and consistency of recognition results, severely limiting the timeliness and standardization of fruit harvesting. Compared to other fruit types such as apples, tomatoes, and blueberries, citrus fruits pose unique challenges for visual detection tasks. First, citrus trees typically have dense canopies, leading to frequent and severe occlusions of fruits by leaves or adjacent fruits. Second, the ripening process often involves uneven color transitions, with different parts of the same fruit exhibiting varying maturity levels. Third, citrus surfaces are characterized by rough textures and prominent pores, which complicate the extraction of texture features. Additionally, their relatively large size demands higher localization precision, especially under occlusion. These characteristics, collectively, make the detection and classification of citrus fruit maturity more complex than for many other fruit types, emphasizing the need for specialized detection models. In real-world orchard environments, fruit maturity recognition is further challenged by a range of complex factors that significantly impact detection accuracy and robustness. The term “complex environments” in this study mainly refers to (1) dynamic and variable natural lighting conditions, including strong sunlight, shadows, and backlighting, which can degrade image quality; (2) frequent occlusions caused by dense foliage or overlapping fruits, which reduce target visibility; (3) highly cluttered and variable backgrounds, such as branches, soil, fruit bags, and irrigation equipment, which introduce noise into the detection process; and (4) non-uniform color development during ripening, where parts of the citrus fruit may remain green or yellow-green even when fully mature. These challenges limit the effectiveness of conventional methods that rely solely on color or morphological features, underscoring the need for robust and adaptive detection models capable of operating under real orchard conditions.

In recent years, deep learning technologies have been widely applied in agricultural visual recognition tasks, and fruit maturity recognition methods based on these technologies have gained attention, due to their high classification accuracy. However, there are still several challenges in practical deployment: (1) sample class imbalance, which can cause detection bias in models; (2) interference from complex backgrounds and fluctuating lighting, which affects detection stability; and (3) the coexistence of fruits at multiple maturity stages, which increases the difficulty of model judgment and affects the precision and accuracy of recognition.

To address these issues, this paper proposes an improved YOLOv8 object detection algorithm, ORD-YOLO, for automatic fruit recognition and maturity grading, using citrus fruits as a case study. To overcome the performance bottlenecks of existing models in complex environments, the proposed algorithm employs a differentiated data augmentation strategy to mitigate the sample imbalance problem. It also introduces four key improvement modules: the ODConv [4] module, which enhances feature expression capabilities and improves detection robustness in complex backgrounds; the RepGFPN [5] structure, which optimizes the feature pyramid network to accelerate information fusion and inference speed; the DynamicHead [6] module, which guides a dynamic attention mechanism to strengthen key feature extraction; and the InnerDIoU loss function, which optimizes bounding box location accuracy. Experimental results show that ORD-YOLO performs excellently in real orchard scenarios, effectively reducing false negatives and false positives, and demonstrating strong robustness and practicality. Compared to traditional methods, this model overcomes issues such as low efficiency, high subjectivity, and significant background interference, while significantly improving the automation level of the system. It can provide reliable visual perception support for harvesting robots, promoting the continuous development of orchard management toward precision and intelligence.

The ORD-YOLO model enables fast, objective citrus ripeness detection, enhancing orchard management through automation and precision. It addresses the inefficiency and subjectivity of manual assessment, improving harvest timing and fruit quality with strong accuracy and robustness. Its real-time, lightweight design fits harvesting robots and mobile devices, cutting labor costs. By applying computer vision and deep learning, ORD-YOLO drives agricultural digitalization and supports sustainable, precision farming across various crops.

2. Materials and Methods

2.1. Related Work

Currently, three main methods are used to determine maturity: the manual judgment method, which is simple and quick but highly subjective and prone to misjudgment; the physico–chemical method, which yields accurate results but is destructive and cumbersome [7]; and the machine vision method, which combines high efficiency, non-destructive testing, and high accuracy, making it possible to achieve intelligent recognition of fruit maturity.

Fruit maturity detection methods based on physico–chemical properties generally involve quantifying indicators such as sugar content, acidity, and the sugar–acid ratio. Josué Barragán-Iglesias et al. [8] established a maturity prediction model by analyzing the changes in parameters such as color, TSS, and hardness during the ripening process of papayas. Shailendra Kumar et al. [9] developed a maturity index based on physiological and biochemical changes, such as respiration, ethylene release, and sugar accumulation, which can accurately identify crop maturity and determine the optimal harvest time. Arun Kumar Gupta et al. [10] developed a lattice electrode sensor (IDE) that detects grapefruit maturity accurately by measuring flavonoid content. While physico–chemical methods provide reliable and accurate results, they suffer from limitations such as long detection cycles, high dependence on equipment, and destructive sampling, which make them unsuitable for real-time orchard detection. Machine vision-based methods for intelligent fruit maturity recognition achieve non-destructive automated classification through image feature analysis and deep learning algorithms. The bimodal (visual–tactile) avocado grading system developed by Junchang Zhang et al. [11] achieves precise classification but has limitations such as high equipment costs, complex operations, and low efficiency. Sitti Wetenriajeng Sidehabi et al. [12] developed a passion fruit grading system based on RGB and a* chromatic features, which achieved 90% accuracy but risks overfitting, due to insufficient sample size. H. Gan et al. [13] proposed a visible-light–thermal-infrared image fusion method for citrus detection, which improved early yield estimation accuracy but was susceptible to environmental interference. The lightweight MSC-YOLOv8 model designed by Tian Youwen et al. [14] improved blueberry detection speed, but sacrificed deep feature extraction ability and was affected by sample class imbalance. Existing methods all suffer from stability issues, and there is an urgent need for the development of robust maturity detection algorithms that can adapt to the complex environment of orchards.

Zhengyang Zhong et al. [15] proposed a lightweight and efficient mango detection model, LightYOLO, based on the Darknet53 architecture. This model not only achieves fast and accurate detection, but also significantly reduces false positives and missed detections, while lowering the number of parameters and computational cost (FLOPs). To enhance the detection of tomatoes in complex environments, Xiangyang Sun et al. [16] developed a lightweight greenhouse tomato detection model, S-YOLO, which delivers fast detection performance with only 9.11 M parameters, outperforming many mainstream models. Jifei Zhao et al. [17] introduced a lightweight pomegranate growth-stage detection algorithm, YOLO-Granada, based on YOLOv5. The model achieves substantial improvements in parameter count, computation load, and model size. It supports real-time detection at 8.66 images per second, offering a precise and lightweight solution for intelligent management devices in pomegranate orchards.

The main contributions of this article are as follows:

To propose ORD-YOLO: A novel ripeness recognition model based on YOLOv8, integrating ODConv, RepGFPN, Dynamic Head, and InnerDIoU to boost robustness and accuracy in complex scenes.
Citrus dataset with complex backgrounds: Includes varied lighting, occlusions, and fruit positions, mimicking real orchard conditions for challenging ripeness detection.
Ablation studies: Validate each module’s impact on precision, recall, and efficiency, proving ORD-YOLO’s performance gains.
Superior generalization and deployment potential: Outperforms baselines in tests and transfer tasks, showing promise for orchard robots and mobile devices.

2.2. Image Acquisition

The citrus experimental data used in this study were collected from the subtropical fruit planting demonstration park in Huaning County, Yuxi City, Yunnan Province, China. The research focuses on the early-maturing citrus variety “Xingjin Honey Tangerine.” Data collection took place from June 2023 to October 2024, acquiring multi-angle and multi-period image data. The system setup includes cameras inside the greenhouse (I), cameras outside the greenhouse (O), and ground probes inside the greenhouse (G), with their positions shown in Figure 1. This system facilitated the collection of images at different time periods, ensuring diversity and coverage of the data in terms of time and lighting conditions.

The collected images were systematically organized through a web-based data management platform, creating a citrus fruit image dataset that covers a variety of typical scenarios, as shown in Figure 2. This dataset includes images under different lighting conditions (natural light and strong light), varying fruit quantities (single fruit, multiple fruits, and fruits overlap), occlusion situations (such as branch and leaf occlusion), and complex environments like rainy days. In total, 3264 image samples were acquired, providing solid and reliable data support for subsequent computer vision-based citrus maturity analysis.

2.3. Citrus Ripeness Division

The color change of the citrus fruit peel is an important phenotypic indicator of its maturity. During fruit development and ripening, significant changes occur in the peel pigment metabolism, primarily manifested by the continuous degradation of chlorophyll and the gradual accumulation of carotenoids. This dynamic process results in clear, staged transitions in peel color [18], which can be divided into three typical developmental phases:

Green Stage (Chlorophyll-Dominant Period): Chlorophyll content is high, and carotenoid synthesis has not yet significantly begun.
Yellow–Green Transition Stage (Pigment Conversion Period): Chlorophyll degradation and carotenoid accumulation reach a dynamic balance.
Orange Stage (Carotenoid Dominant Period): Carotenoids dominate, and chlorophyll content drops to its lowest point.

Based on the above color change patterns and referencing the definition of citrus maturity from Chenglin Wang et al. [19], this study classifies citrus fruit maturity into the following three categories: Immature Fruit: more than 80% of the fruit’s peel area is green; Semi-Mature Fruit: both green and yellow areas of the fruit peel do not exceed 80% of the total peel area; and Mature Fruit: more than 80% of the fruit’s peel area is orange. Images of the fruit corresponding to each maturity stage are shown in Figure 3.

2.4. Citrus Dataset Construction

This study conducted a rigorous selection of image quality during the data preprocessing phase. Due to the wide coverage of the shooting equipment and the influence of different lighting conditions at various times, some low-quality samples were found in the 3264 originally captured images, including those that were blurry or had excessively large background areas. After manual cropping and screening, 2597 images with clearly visible fruit from three collection points were retained, forming the effective dataset. The distribution of the collected equipment dataset is shown in Table 1.

To improve the model’s generalization ability and accurately identify fruit maturity in complex backgrounds, this study applied various data augmentation strategies to process the original images. These strategies included horizontal flipping, random rotation, contrast adjustment, and brightness adjustment [20,21].

During the data organization process, it was found that there was a significant class imbalance in the sample distribution. This phenomenon mainly arises from the actual situation in agricultural production: mature citrus fruits are harvested promptly, leading to a significantly smaller number of mature samples compared to other maturity categories. This imbalance could affect the model’s ability to learn features from the less represented classes. To address this issue, this study adopted a differentiated data augmentation scheme to balance the sample distribution, as follows: for Immature (IM) samples, the original data amount remained unchanged; for Semi-Mature (SM) samples, flipping, rotation, and contrast enhancement were applied; and for Mature (MA) samples, brightness enhancement was added on top of the previous augmentations. Example images of the data augmentation are shown in Figure 4.

After data augmentation, the total number of images in the dataset was expanded to 5586, with the specific expansion details shown in Table 2. All samples were annotated according to the maturity classification criteria on the MakeSense platform, and the annotation format used was the YOLO standard TXT text file. To ensure consistency in annotation, all the labeling work was completed by the same researcher. The final dataset was randomly divided into training (4468 images), validation (558 images), and test (560 images) sets in an 8:1:1 ratio.

3. Model Construction

3.1. YOLOv8

YOLOv8 is a single-stage object detection algorithm developed by Ultralytics, combining high performance with multi-task support. It is suitable for tasks such as object detection, instance segmentation, and pose estimation. For detection tasks, YOLOv8 offers five versions (n, s, m, l, and x), based on the network’s depth and width, to meet various application needs. Its architecture primarily consists of the backbone network, the neck network, and the detection head, which are used for multi-scale feature extraction, feature fusion enhancement, and the prediction of object classes and locations [22]. The processing flow is as follows: after image preprocessing, the backbone network extracts multi-scale features, the neck network completes cross-scale fusion based on the feature pyramid, and the final output—class labels and bounding boxes—is generated by the detection head. Compared to previous models (e.g., YOLOv5 [23], YOLOv7 [24]), YOLOv8 introduces three major improvements in its architecture:

Module Optimization: The traditional C3 module is replaced with C2f, introducing cross-layer connections and multi-branch structures to improve gradient propagation efficiency and feature reuse capability.
Enhanced Feature Fusion: The PANet structure [25] is upgraded to PAN-FPN [26], strengthening cross-scale connections and fusion paths, thereby improving feature expression capability.
Decoupled Detection Head: The classification and regression task branches are independent, reducing feature interference between tasks and improving detection accuracy.

3.2. Improved Model ORD-YOLO

In complex backgrounds, YOLOv8 in citrus fruit maturity detection is susceptible to factors such as strong lighting, branch and leaf occlusion, and fruit overlap, which limits its detection performance. To improve recognition accuracy, this paper proposes an improved YOLOv8 architecture, ORD-YOLO, whose structure is shown in Figure 5. The main improvements include the following:

Replacing the standard convolution with ODConv in the backbone network to enhance feature extraction capabilities. Standard convolutions have limited adaptability to spatial and contextual variations. ODConv introduces dynamic convolution across spatial, channel, and kernel dimensions, enabling the network to adjust its receptive field based on input features. This improves the model’s ability to extract diverse textures and semantic cues in complex orchard environments, enhancing backbone representation.
Replacing the Path Aggregation Network (PANet) with RepGFPN in the neck network to improve feature fusion. PANet may lead to feature redundancy and limited deep-semantic integration. RepGFPN addresses this by using structural re-parameterization, enhancing multi-scale feature fusion and improving efficiency. It effectively combines spatial detail with semantic depth, benefiting detection of small or occluded citrus fruits.
Using a Dynamic Detection Head (DynamicHead) to enhance the perception of key features. Static detection heads can struggle with varying object appearances. DynamicHead uses dynamic convolutions to adaptively focus on task-relevant features, improving robustness to variation in shape, texture, and background noise, and enhancing detection precision in real-world orchard conditions.
Optimizing the bounding box-regression loss function to InnerDIoU to improve localization accuracy. Traditional IoU-based losses may neglect fine localization at high overlaps. InnerDIoU considers both IoU and internal structure, refining boundary regression especially for densely packed or overlapping fruits, leading to more accurate localization in complex field scenes.

Figure 5. ORD-YOLO network structure diagram.

3.2.1. ODConv Module

In complex citrus orchard environments, fruit maturity detection relies not only on model performance, but also on computational efficiency. To enhance feature extraction capabilities and computational efficiency, this paper replaces the original convolution layers with Omni-Dimensional Dynamic Convolution (ODConv). ODConv introduces a multi-dimensional attention mechanism that learns dynamic weights in parallel across four dimensions: the number of convolution kernels, spatial dimensions, input channels, and output channels. This approach significantly outperforms CondConv [27] and DyConv [28], which only adjust the number of convolution kernels, and offers stronger dynamic modeling capabilities. The computation method is shown in Equation (1):

y = (α_{ω 1} ⊙ α_{f 1} ⊙ α_{c 1} ⊙ α_{s 1} ⊙ W_{1} + \dots + α_{ω n} ⊙ α_{f n} ⊙ α_{c n} ⊙ α_{s n} ⊙ W_{n}) * x

(1)

In the above formula,

α_{ω i} \in R

represents the branch weight attention for the convolution kernel

W_{i}

, while

α_{si} \in R^{k \times k}

,

α_{ci} \in R^{c_{in}}

, and

α_{fi} \in R^{c_{out}}

represent the three newly introduced attention mechanisms. The computations are performed along the spatial dimension, input channel dimension, and output channel dimension of the convolution kernel

W_{i}

, representing multiplicative operations along different dimensions of the kernel space. Here,

α_{si}

,

α_{ci}

,

α_{fi}

, and

α_{ω i}

are computed using the multi-head attention module

π_{i} (x

).

In ODConv, the attention mechanism of the convolution kernel

W_{i}

includes four dimensions:

Spatial Attention $α_{s i}$ : Assigning weights at $k \times k$ spatial positions.
Input Channel Attention $α_{c i}$ : Assigning weights to the input channels $c_{i n}$ of each convolution filter $W_{i}^{m}$ .
Output Channel Attention $α_{f i}$ : Assigning different weights to the output channels $c_{o u t}$ .
Global Attention $α_{ω i}$ : Assigning global weights to the entire convolution kernel.

As shown in Figure 6, the four types of attention are multiplied in sequence across the spatial, input-channel, output-channel, and global-kernel dimensions. This differential treatment across the dimensions enhances the feature extraction capabilities by providing more expressive power during the convolution operation.

Additionally, the four attention weights of the convolution kernel

W_{i}

—

α_{ω i}

,

α_{s i}

,

α_{c i}

and

α_{f i}

—are all computed using the multi-head attention mechanism

π_{i} (x)

, as shown in Figure 7. The specific process is as follows: the input feature x is first compressed into a vector of length

c_{i n}

through channel-level global average pooling (GAP). It is then mapped to a lower-dimensional space via a fully connected layer (FC) with a reduction ratio of r, and after a ReLU activation, it is split into four branches. Each branch uses convolutions of different sizes (

k \times k

,

c_{i n} \times 1,

and

n \times 1

) along with Softmax or Sigmoid functions to generate normalized attention weights.

3.2.2. Efficient RepGFPN Feature Pyramid

The Feature Pyramid Network (FPN) [29] effectively alleviates the detection challenges caused by scale variations of objects through multi-scale feature fusion. However, traditional FPN and its derivative structures (such as PANet and BiFPN [30]) suffer from issues such as high computational cost and memory usage. In contrast, the Re-parameterizable Generalized Feature Pyramid Network (RepGFPN) introduces structural re-parameterization and cross-scale dynamic fusion mechanisms. It significantly improves inference efficiency, while retaining the multi-scale feature representation capabilities. Its structure is shown in Figure 8.

Given the significant difference in FLOPs between feature maps of different scales and the limited computational resources that make it difficult to unify channel dimensions, RepGFPN adopts a strategy of setting different channel numbers for each scale in the neck feature fusion module. By flexibly adjusting the number of channels at each scale, RepGFPN significantly improves detection accuracy.

In terms of feature interaction, RepGFPN removes the upsampling introduced by Queen-Fusion in GFPN, reducing unnecessary computational overhead. To further enhance detection performance, RepGFPN replaces the traditional 3 × 3 convolution used for feature fusion with CSPNet [31], and optimizes it with a Re-parameterization mechanism and Efficient Layer Aggregation Network (ELAN) structure. This approach improves fusion efficiency and model performance without increasing computational cost.

3.2.3. Dynamic Detector

In object detection, the detection head is responsible for converting the extracted features into the final prediction results. For citrus fruits, which have complex image features across different scenes, the detection head needs to possess superior feature representation and transformation capabilities to accurately localize the fruits and categorize their maturity. To address this challenge, this paper proposes an improved method based on Dynamic Head.

Unlike traditional detection heads, DynamicHead integrates three attention mechanisms—scale, spatial, and task-aware—into a unified self-attention structure. This design follows a hierarchical strategy, deploying specialized attention modules at the layer, spatial, and channel dimensions to enable multi-dimensional feature learning. The structure is shown in Figure 9.

Let the feature pyramid consist of L layers of feature maps

F_{i n} = {\{F_{i}\}}_{i = 1}^{L}

. After unification to a middle scale through upsampling or downsampling, a four-dimensional tensor

F \in R^{L \times H \times W \times C}

is obtained, where L, H, W, and C represent the number of layers, height, width, and channels, respectively. To streamline the notation, define

S = H \times W

, reducing the tensor F to a three-dimensional tensor

F \in R^{L \times S \times C}

. The self-attention computation in DynamicHead is defined in Equation (2):

W (F) = π_{C} (π_{S} (π_{L} (F) \cdot F) \cdot F) \cdot F

(2)

In this equation,

π_{L} (\cdot)

,

π_{S} (\cdot)

, and

π_{C} (\cdot)

represent the attention functions applied to L, S, and C, respectively. The architecture employs a decomposed attention mechanism, effectively reducing the complexity introduced by high-dimensional tensor computations.

Scale-Aware Attention $π_{L}$ :

This mechanism dynamically integrates features across different scales, weighted by their semantic importance. It is computed as follows:

π_{L} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{S, C} F)) \cdot F

(3)

In this context,

f (\cdot)

refers to the linear function of the

1 \times 1

convolution, and

σ (x) = m a x (0, m i n (1, \frac{x + 1}{2}))

is the hard-sigmoid activation function.

2.: Spatial-Aware Attention $π_{S}$ :

This attention mechanism is employed to capture discriminative spatial features. It begins with sparse attention learning using deformable convolutions, followed by the aggregation of features at the same spatial location across layer:

π_{S} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} ω_{l, k} \cdot F (l; P_{k} + ∆ P_{k}; c) \cdot ∆ m_{k}

(4)

In this case, K refers to the number of sampling points,

P_{k} + ∆ P_{k}

represents the attention position obtained by the offset

∆ P_{k}

, and

∆ m_{k}

is the self-learned importance scalar

P_{k}

, all derived from the median features of F.

3.: Task-Aware Attention $π_{C}$ :

To effectively support tasks such as bounding box regression and center point prediction, a task-aware mechanism is introduced:

π_{C} (F) \cdot F = \max (α^{1} (F) \cdot F_{c} + β^{1} (F), α^{2} (F) \cdot F_{c} + β^{2} (F))

(5)

In the above equation,

F_{c}

represents the feature of the channel c, and

{[α^{1}, α^{2}, β^{1}, β^{2}]}^{T} = θ (\cdot)

is the super function used to learn the activation threshold. The function

θ (\cdot)

is derived by applying global average pooling across the

L \times S

dimensions, followed by two fully connected layers. This produces normalized weights, which are then passed through a sigmoid function to output values within the range of [−1, 1].

In conclusion, DynamicHead is constructed by sequentially combining three attention mechanisms—

π_{L}

,

π_{S}

, and

π_{C}

—as shown in Figure 10. This structure significantly enhances its adaptability and accuracy for multi-task detection.

The object detection model is optimized by comparing the differences between the predicted bounding boxes and the ground-truth boxes. The loss function, as the core mechanism for assessing prediction accuracy, plays a crucial role in determining the model’s training effectiveness and overall performance. Given the stringent requirements for model convergence speed and detection accuracy (including false positives and false negatives) in citrus fruit maturity detection, this paper introduces the InnerDIoU loss function to enhance the YOLOv8 model.

In this context, the efficiency of bounding box regression is vital for both model convergence and localization precision. The commonly used metric, IoU (Intersection over Union) [32], is employed to measure the overlap between the predicted and ground-truth boxes. The calculation is as follows:

I o U = \frac{|B \cap B^{g t}|}{|B \cup B^{g t}|}

(6)

Here,

B^{g t} = (x^{g t}, y^{g t}, w^{g t}, h^{g t})

, and

B = (x, y, w, h)

refer to the coordinates of the ground-truth bounding box and the predicted bounding box, respectively. It is crucial to note that the IoU is applicable only when there is an overlap between the predicted and ground-truth boxes. The IoU-based loss function is expressed as

L = 1 - I o U + R (B, B^{g t})

(7)

In the above formula,

R (B, B^{g t})

refers to the penalty term between

B

and

B^{g t}

.

DIoU [33] extends IoU by adding the normalized Euclidean distance between the center points as a penalty term, which helps speed up convergence. The definition is as follows:

R_{D I o U} = \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(8)

In the above equation,

b

and

b^{g t}

represent the center points of the two bounding boxes,

ρ (\cdot)

is the Euclidean distance, and

c

is the diagonal length of the smallest enclosing rectangle that contains both boxes. The DIoU loss function is defined as

L_{D I o U} = 1 - I o U + R_D I o U

(9)

InnerDIoU, an extension of DIoU, further incorporates the overlap of the central region of the bounding box, and uses the geometric distance between the central regions as a penalty term to achieve more refined bounding-box regression optimization. While DIoU focuses on the overlap of the entire box and the distance between center points, InnerDIoU prioritizes consistency modeling of key internal regions within the box. This approach is especially effective for complex detection scenarios where citrus fruits may be occluded or overlapping. Not only does it enhance the model’s localization robustness, but it also speeds up regression convergence. The loss function of InnerDIoU and its core component, InnerIoU, is defined as follows:

L_{I n n e r I o U} = 1 - I n n e r I o U

(10)

L_{I n n e r D I o U} = 1 - I n n e r I o U + R_{D I o U}

(11)

Here, InnerIoU represents the IoU between the two central areas of the box.

3.3. Test Environment and Configuration

The experimental setup for this study is as follows: the hardware system consists of an AMD Ryzen 9 7945HX processor (base frequency of 2.50 GHz) paired with an NVIDIA GeForce RTX 4060 GPU (8 GB VRAM), and 16 GB of RAM. The software environment is based on the 64-bit version of Windows 11, with the PyTorch 1.12.1 deep learning framework and Python 3.10.15 as the programming language. The development environment used is PyCharm 2024, and GPU acceleration is implemented using CUDA 11.6.

The model training parameters are as follows: input images are resized to 640 × 640 pixels, with a batch size of 16. The optimizer is Stochastic Gradient Descent (SGD) [34], with an initial learning rate of 0.001 and a weight decay coefficient of 0.0005.

3.4. Evaluation Metrics

For the citrus orchard fruit maturity detection task, it is essential to evaluate both the model’s detection accuracy and its adaptability to complex scenarios. This study establishes multiple quantitative evaluation criteria across various dimensions. Key metrics selected for measuring detection accuracy include Precision, Recall, and Mean Average Precision (mAP) [35], which assess the model’s ability to correctly identify mature fruits, retrieve all true mature fruits, and perform across multiple categories.

Precision: the ratio of correctly identified mature fruits (True Positive, TP) to all predicted mature fruits (TP + False Positive, FP). Recall: the ability of the model to correctly detect all true mature fruits, defined as follows:

P = \frac{T P}{T P + F P} \times 100 %

(12)

R = \frac{T P}{T P + F N} \times 100 %

(13)

Average Precision: evaluates the model’s detection performance for a single class. Mean Average Precision (mAP): the mean of the Average Precision (AP) for all categories, offering an overall measure of the model’s detection performance. The mathematical definition of both is as follows:

A P = \int_{0}^{1} P (R) d R

(14)

M A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(15)

Regarding computational efficiency, core metrics like model weight size, parameter count (Parameters), detection speed (Speed), and floating-point operations (GFLOPs) [36] are introduced. These accuracy and efficiency metrics together form a multidimensional evaluation framework, providing a comprehensive assessment of the model’s performance in citrus fruit maturity detection.

4. Results

4.1. Improved Model Results

To validate the effectiveness of the improved model ORD-YOLO in detecting citrus fruit maturity, this study compares its detection performance with that of YOLOv8 on the citrus dataset. After 500 training epochs, the accuracy, recall, and mean average precision (mAP) of ORD-YOLO are 93.83%, 91.62%, and 96.92%, respectively. In comparison, the corresponding metrics for YOLOv8 are 89.17%, 88.32%, and 93.92%, indicating that ORD-YOLO improves upon YOLOv8 by 4.66%, 3.30%, and 3.00%, respectively. Additionally, the ORD-YOLO model has a weight of 12.9 MB, GFLOPs of 14.1, detection speed of 7.9 ms, and a parameter count of 6.2 M. To assess the time overhead during the model training process, we recorded the training time data. The results show that the average training time per epoch is approximately 3.71 min, with a total training time of about 1855 min (500 epochs), indicating that the model has high training efficiency, which is conducive to subsequent optimization and deployment on resource-constrained devices. Table 3 shows its superior detection performance across different maturity categories.

Specifically, the model achieves the highest recognition accuracy for MA, reaching 93.5%; both IM and SM achieve 93.3%. In terms of recall, SM is the highest, at 92.5%, followed by IM and MA, at 91.9% and 88.7%, respectively. Regarding the mean average precision (mAP@50), both IM and SM reach 96.9%, while MA stands at 96.6%. The loss function is a core metric for evaluating the model’s detection performance, reflecting both the model’s convergence speed and detection accuracy [37]. Choosing an appropriate loss function helps accelerate model convergence and improve detection performance. Figure 11 shows the loss function variation of ORD-YOLO and YOLOv8 under the same training configuration. It can be seen that ORD-YOLO exhibits faster convergence and lower final loss values in boundary box regression, object classification, and feature distribution loss, with a smoother training curve. Specifically, the boundary box loss decreases to about 0.55, a reduction of approximately 39%; the classification loss stabilizes at 0.55, a reduction of about 5%; and the feature distribution loss is around 1.3, a reduction of about 11%. These results demonstrate the significant advantages of ORD-YOLO in both localization and classification accuracy, confirming the effectiveness and robustness of its improvement strategy.

Additionally, to verify the generalization ability of the improved model, prediction experiments comparing ORD-YOLO and YOLOv8 were conducted on four typical scenarios in the test set, as shown in Figure 12. The experiments demonstrate that both models can correctly identify fruit in cases of severe occlusion, but ORD-YOLO exhibits significantly higher confidence than YOLOv8. In overlapping fruit scenarios, YOLOv8 experiences repeated detections, while ORD-YOLO performs more accurately. Under shadow occlusion, ORD-YOLO still maintains high confidence. In natural-light single-fruit scenarios, YOLOv8 suffers from false detections, while ORD-YOLO shows higher confidence. Even under strong light interference, ORD-YOLO outperforms YOLOv8. Overall, the results indicate that ORD-YOLO demonstrates stronger robustness and superior detection performance in complex environments.

4.2. Ablation Experiments

This study sequentially introduces four improvements to the YOLOv8 model: (1) replacing standard convolution with ODConv; (2) replacing the feature aggregation network with RepGFPN; (3) changing the detection head to DynamicHead; and (4) optimizing the loss function to InnerDIoU. To validate the effectiveness of each module, ablation experiments with gradual additions were designed, and the results are shown in Table 4.

After introducing ODConv, the model’s recall (R) and mean average precision at 50 (mAP@50) increased by 1.15% and 0.42%, respectively, while GFLOPs decreased by 0.9. When RepGFPN was introduced, precision (P), recall (R), and mAP@50 increased by 0.34%, 1.74%, and 0.53%, respectively. Combining ODConv and RepGFPN resulted in a further increase in P, R, and mAP@50 by 2.5%, 2.12%, and 1.76%. When RepGFPN and DynamicHead were combined, P, R, and mAP@50 increased by 1.3%, 1.54%, and 1.51%. With the joint application of all three modules, the model’s P, R, and mAP@50 improved by 2.95%, 2.06%, and 2.09%, respectively. Finally, after adding the InnerDIoU loss function, P, R, and mAP@50 were further improved, by 4.66%, 3.3%, and 3%.

Although the addition of these modules increased the model’s weight, the detection performance (P, R, mAP@50) was significantly enhanced, fully validating the effectiveness of each module in improving the model’s accuracy and robustness. Especially in terms of GFLOPs, despite the increase in model complexity after introducing ODConv, RepGFPN, and DynamicHead, GFLOPs decreased by 0.9 compared to the model with only RepGFPN and DynamicHead. Additionally, when ODConv and RepGFPN were combined, GFLOPs decreased by 1.1, compared to using RepGFPN alone. This improvement is attributed to the dynamic weight mechanism and multi-dimensional modeling capability of ODConv. The ablation experiment results fully verify the effectiveness of the improvement modules in enhancing the detection accuracy and robustness of the ORD-YOLO model.

To visually demonstrate the impact of the improvement strategies on the model’s detection performance, Grad-CAM [38] heatmaps were used for the ablation experiment visualization. The results are shown in Figure 13, where deeper colors and larger areas indicate higher model attention (i.e., higher confidence). The results reveal that ORD-YOLO outperforms other models in both fruit focus ability and attention distribution. In contrast, YOLOv8 exhibits issues with missed detections and attention dispersion. ODConv, through dynamic weight adjustment, enables the model to focus on key regions; RepGFPN effectively enhances the local detail representation of the fruit, alleviating the attention dispersion problem; and DynamicHead further strengthens the response in the core fruit area, improving the model’s perceptual ability. A comprehensive analysis indicates that ORD-YOLO can accurately focus on the citrus fruit’s position, with heatmap results showing accurate focus in complex backgrounds, with heat concentration on the fruit. This demonstrates a stronger target recognition and attention ability, further validating the effectiveness of the improvement strategies.

4.3. Comparative Experiments

To comprehensively evaluate the performance of ORD-YOLO, this study compares it with mainstream detection algorithms (YOLOv5, YOLOv8, YOLOv10n [39], YOLOv11 [40], SSD [41], and CenterNet [42]), as shown in Table 5. ORD-YOLO demonstrates the best overall performance in citrus fruit maturity detection. Its accuracy, recall, and mAP@50 reach 93.83%, 91.62%, and 96.92%, respectively, significantly outperforming YOLOv5, YOLOv8, YOLOv10n, and YOLOv11, as well as traditional SSD and CenterNet models. Although ORD-YOLO’s parameter count (6.2 M) and model size (12.9 MB) are slightly larger, and its detection speed (7.9 ms) is slower than some YOLO models, it offers significant advantages in detection accuracy and stability, demonstrating stronger robustness and detection capability. In contrast, while SSD has a higher recall rate (94.43%), its precision (P) and mAP@50 are both lower than ORD-YOLO’s. CenterNet, though achieving the highest accuracy (96.87%), has a relatively lower recall rate (77.54%). Overall, ORD-YOLO strikes a good balance between detection accuracy and model capacity, while maintaining fast inference speed, showing great potential for wide applications in complex environments.

To comprehensively evaluate the performance of the aforementioned models in different environments, this study visualizes the detection results on the test set, as shown in Figure 14. The experiments cover five complex environments, and the detection performance varies accordingly. The results show that CenterNet performs the worst in all scenarios (lowest confidence) and experiences a significant decline in detection performance under complex conditions, compared to other models. SSD performs better in three scenarios: close-range single fruit, strong-light–fruit strong-light–fruit, and branch–leaf-occlusion–fruit branch–leaf occlusion–fruit, but suffers from repeated recognition and classification errors in rainy conditions. While YOLOv5, YOLOv8, and YOLOv11 did not exhibit missed or false detections, their overall confidence was lower than that of ORD-YOLO. YOLOv10n missed detections in the strong-light–fruit overlap environment. In summary, ORD-YOLO maintains stable and high detection accuracy across complex environments, demonstrating stronger robustness. The confidence levels of YOLOv5, YOLOv8, and YOLOv11 vary significantly across different complex environments, while SSD faces issues with repeated recognition and false detections.

5. Discussion

The ORD-YOLO citrus maturity detection model proposed in this study maintains high detection accuracy, even in complex backgrounds. By incorporating ODConv, RepGFPN, DynamicHead, and InnerDIoU modules into different structures of the YOLOv8 model, the model’s feature extraction capability for multi-scale objects is significantly enhanced. The ORD-YOLO model offers key advantages for real-world orchard management and intelligent harvesting. First, it shows strong robustness in complex environments, effectively handling lighting variations, occlusions, and cluttered backgrounds, to ensure accurate and stable ripeness recognition. Second, its high inference speed and lightweight design support real-time detection on resource-limited devices like harvesting robots and mobile terminals. Third, improved precision and recall reduce missed and false detections, minimizing human error and enhancing harvesting efficiency. Lastly, its design and optimization provide broad applicability, serving as a valuable reference for maturity detection in other fruits and vegetables and advancing agricultural automation.

Although ORD-YOLO has achieved initial success in detecting citrus fruits and classifying their maturity, and has demonstrated its effectiveness and robustness in complex backgrounds, there are still certain limitations. First, this study divides fruit maturity into three levels, which fails to fully capture the subtle changes in fruit maturity from unripe to fully ripe, limiting the model’s ability to express the continuous evolution of maturity and making it difficult to meet the demands of precision orchard management. Therefore, future research could introduce more detailed grading standards and combine multi-source information such as image time-series and spectral data to enhance the model’s ability to perceive the dynamic changes in maturity. Currently, the model has not been deployed in real-world environments or tested on edge devices, and remains in the preliminary experimental phase. In the future, we plan to gradually deploy the model in real-world settings, with a particular focus on conducting extensive testing and evaluation on edge devices.

Secondly, this study focused on the early-maturing “Xingjin Honey Tangerine” variety, which, while representative, has variety-specific characteristics in terms of fruit shape, color changes, and maturation cycles. The adaptability and generalization of the trained model to other mid- to late-maturing citrus varieties still need further verification. To improve the model’s generality and practical value, future research should expand the sample varieties and construct diverse datasets for further study. Finally, although the ORD-YOLO model has shown significant improvements in accuracy and stability compared to the baseline model, its complexity has also increased. Compared to the original model, the parameter count has increased by 3.2 M, and the model weight has increased by 6.5 M. While this brings performance advantages, it also increases the dependence on computational resources, limiting its efficient deployment on edge devices. Future research could focus on model compression, pruning optimization, and lightweight design (such as introducing Ghost modules, MobileNet architectures, etc.) to further reduce model complexity and inference costs. This would improve real-time performance and deployment flexibility, while ensuring accuracy, and would drive the continued evolution of orchard management systems towards intelligence and low-cost solutions.

6. Conclusions

This study proposes an improved algorithm, ORD-YOLO, based on YOLOv8, for citrus fruit maturity recognition. The model achieves high accuracy and stability in complex orchard environments, showing no missed or false detections, and demonstrates strong robustness and applicability. Compared to manual inspection, it effectively addresses issues such as low efficiency, subjectivity in classification, and difficulty in identifying targets within complex backgrounds. Furthermore, it provides visual guidance for harvesting robots, enhancing automation in orchard management.

Results show ORD-YOLO significantly improves detection over YOLOv8, with faster convergence and lower losses during training. Despite slight increases in weight size and GFLOPs, precision, recall, and mAP@50 reach 93.83%, 91.62%, and 96.92%, respectively. The GFLOPs increase enhances feature learning, improving stability against complex backgrounds. Compared to SSD, CenterNet, and YOLO models, ORD-YOLO leads in all key metrics; although CenterNet has higher precision (96.87%), its recall (77.54%) and mAP@50 (95%) are lower. SSD has high recall (94.43%) but lower precision (87.69%). Compared to YOLOv8, ORD-YOLO improves precision, recall, and mAP@50 by 4.66%, 3.30%, and 3.00%. It maintains real-time speed (7.9 ms), outperforming SSD (96.15 ms) and CenterNet (41.67 ms), balancing accuracy and efficiency for agriculture. Test results confirm accurate fruit localization and classification, without false or duplicate detections. Through optimized architecture and loss functions, ORD-YOLO achieves enhanced accuracy and efficiency, suitable for precise fruit detection in complex agricultural settings, overcoming small target detection challenges inherent in complex backgrounds.

The improved citrus fruit maturity detection model, ORD-YOLO, performs excellently in complex orchard environments. It accurately identifies three maturity stages—IM, SM, and MA—with an average accuracy of 96.2%. Besides precise maturity classification, the model supports intelligent orchard management by aiding farmers in harvest decisions, reducing errors, and enhancing fruit value. This ensures precise harvesting and improved fruit quality. Its design also offers insights for maturity detection in other fruits and vegetables, advancing intelligent, automated agriculture.

Author Contributions

Conceptualization, X.L. and S.X.; methodology, X.L. and X.H.; software, X.L. and S.F.; validation, X.L., S.F. and Y.L.; formal analysis, X.L. and S.F.; investigation, X.L. and Y.L. and H.Z.; resources, X.L.; data curation, X.L. and J.Z.; writing—original draft preparation, X.L.; writing—review and editing, Z.H. and H.Z.; visualization, X.L.; supervision, H.Z. and W.L.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the International Joint Laboratory for Intelligent Crop Production in Yunnan Province under Grant NO. 202303AP140014, and the Yunnan Provincial Science and Technology Programme Key Projects under Grant No. 202401AS070004.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Satari, B.; Karimi, K. Citrus Processing Wastes: Environmental Impacts, Recent Advances, and Future Perspectives in Total Valorization. Resour. Conserv. Recycl. 2018, 129, 153–167. [Google Scholar] [CrossRef]
Nie, R.; Sun, M. Investigation and Analysis of the Occurrence and Distribution of Five Common Diseases in Yunnan Citrus. Hortic. Sin. 2024, 51, 2685–2700. [Google Scholar]
Li, H.; Shi, Y. A Review of Orchard Harvesting Robots. China Agric. Inf. 2019, 31, 1–9. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-Dimensional Dynamic Convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A Report on Real-Time Object Detection Design. arXiv 2023, arXiv:2211.15444. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. arXiv 2021, arXiv:2106.08322. [Google Scholar]
Janzantti, N.S.; Monteiro, M. Changes in the Aroma of Organic Passion Fruit (Passiflora Edulis Sims f. Flavicarpa Deg.) during Ripeness. LWT—Food Sci. Technol. 2014, 59, 612–620. [Google Scholar] [CrossRef]
Barragán-Iglesias, J.; Méndez-Lagunas, L.L.; Rodríguez-Ramírez, J. Ripeness Indexes and Physicochemical Changes of Papaya (Carica Papaya L. Cv. Maradol) during Ripening on-Tree. Sci. Hortic. 2018, 236, 272–278. [Google Scholar] [CrossRef]
Kumar, S.; Singh, R.P.; Rizwanullah, M. Different Maturity Indices of Fruits and Vegetables Crops. In Current Trends in Horticulture; Biotech Books: New Delhi, India, 2023; pp. 101–118. [Google Scholar]
Gupta, A.K.; Das, S.; Sahu, P.P.; Mishra, P. Design and Development of IDE Sensor for Naringin Quantification in Pomelo Juice: An Indicator of Citrus Maturity. Food Chem. 2022, 377, 131947. [Google Scholar] [CrossRef]
Zhang, J.; Qin, L.; Wang, G.; Wang, Q.; Zhang, X. Non-Destructive Ripeness Detection of Avocados (Persea Americana Mill) Using Vision and Tactile Perception Information Fusion Method. Food Bioprocess Technol. 2025, 18, 881–898. [Google Scholar] [CrossRef]
Sidehabi, S.W.; Suyuti, A.; Areni, I.S.; Nurtanio, I. Classification on Passion Fruit’s Ripeness Using K-Means Clustering and Artificial Neural Network. In Proceedings of the 2018 International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, 6–7 March 2018; IEEE: Piscataway, NJ, USA; pp. 304–309. [Google Scholar]
Gan, H.; Lee, W.S.; Alchanatis, V.; Ehsani, R.; Schueller, J.K. Immature Green Citrus Fruit Detection Using Color and Thermal Images. Comput. Electron. Agric. 2018, 152, 117–125. [Google Scholar] [CrossRef]
Tian, Y.; Qin, S.; Yan, Y.; Wang, J.; Jiang, F. Blueberry Ripeness Detection in Complex Field Environments Based on Improved YOLOv8. Trans. Chin. Soc. Agric. Eng. 2024, 40, 153–162. [Google Scholar]
Zhong, Z.; Yun, L.; Cheng, F.; Chen, Z.; Zhang, C. Light-YOLO: A Lightweight and Efficient YOLO-Based Deep Learning Model for Mango Detection. Agriculture 2024, 14, 140. [Google Scholar] [CrossRef]
Sun, X. Enhanced Tomato Detection in Greenhouse Environments: A Lightweight Model Based on S-YOLO with High Accuracy. Front. Plant Sci. 2024, 15, 1451018. [Google Scholar] [CrossRef]
Zhao, J.; Du, C.; Li, Y.; Mudhsh, M.; Guo, D.; Fan, Y.; Wu, X.; Wang, X.; Almodfer, R. YOLO-Granada: A Lightweight Attentioned Yolo for Pomegranates Fruit Detection. Sci. Rep. 2024, 14, 16848. [Google Scholar] [CrossRef]
Tao, J.; Zhang, S.; Zhang, L.; An, X.; Liu, C. Relation ship between color formation and change in composition of carotenoids in peel of citrus frui. Zhi Wu Sheng Li Yu Fen Zi Sheng Wu Xue Xue Bao 2003, 29, 121–126. [Google Scholar]
Wang, C.; Han, Q.; Li, C.; Zou, T.; Zou, X. Fusion of Fruit Image Processing and Deep Learning: A Study on Identification of Citrus Ripeness Based on R-LBP Algorithm and YOLO-CIT Model. Front. Plant Sci. 2024, 15, 1397816. [Google Scholar] [CrossRef]
Wang, C.; Li, H.; Deng, X.; Liu, Y.; Wu, T.; Liu, W.; Xiao, R.; Wang, Z.; Wang, B. Improved You Only Look Once v.8 Model Based on Deep Learning: Precision Detection and Recognition of Fresh Leaves from Yunnan Large-Leaf Tea Tree. Agriculture 2024, 14, 2324. [Google Scholar] [CrossRef]
Huang, R.; Li, X.; Yang, Y.; Li, W.; Huang, Z.; Wei, Y.; Liu, H. Research on Citrus Fruit Target Detection and Recognition Based on FEAConv-YOLO v5. Jiangsu Agric. Sci. 2025, 53, 54–61. [Google Scholar]
Liu, Q.; Lv, J.; Zhang, C. MAE-YOLOv8-Based Small Object Detection of Green Crisp Plum in Real Complex Orchard Environments. Comput. Electron. Agric. 2024, 226, 109458. [Google Scholar] [CrossRef]
Ma, J.; Lu, A.; Chen, C.; Ma, X.; Ma, Q. YOLOv5-Lotus an Efficient Object Detection Method for Lotus Seedpod in a Natural Environment. Comput. Electron. Agric. 2023, 206, 107635. [Google Scholar] [CrossRef]
Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens. 2024, 16, 1002. [Google Scholar] [CrossRef]
Roy, A.M.; Bose, R.; Bhaduri, J. A Fast Accurate Fine-Grain Object Detection Model Based on YOLOv4 Deep Neural Network. Neural Comput. Applic. 2022, 34, 3895–3921. [Google Scholar] [CrossRef]
Zhang, L.; Luo, P.; Ding, S.; Li, T.; Qin, K.; Mu, J. The Grading Detection Model for Fingered Citron Slices (Citrus Medica ‘Fingered’) Based on YOLOv8-FCS. Front. Plant Sci. 2024, 15, 1411178. [Google Scholar] [CrossRef] [PubMed]
Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Austin, TX, USA, 2019; Volume 32. [Google Scholar]
He, J.; Zhang, T.; Zhang, Z.; Yu, T.; Zhang, Y. Efficient Dynamic Correspondence Network. IEEE Trans. Image Process. 2024, 33, 228–240. [Google Scholar] [CrossRef]
Sun, H.; Yang, S.; Chen, L.; Liao, P.; Liu, X.; Liu, Y.; Wang, N. Brain Tumor Image Segmentation Based on Improved FPN. BMC Med. Imaging 2023, 23, 172. [Google Scholar] [CrossRef]
Chen, J.; Mai, H.; Luo, L.; Chen, X.; Wu, K. Effective Feature Fusion Network in BIFPN for Small Object Detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 699–703. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Wu, S.; Yang, J.; Wang, X.; Li, X. IoU-Balanced Loss Functions for Single-Stage Object Detection. Pattern Recognit. Lett. 2022, 156, 96–103. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Gower, R.M.; Loizou, N.; Qian, X.; Sailanbayev, A.; Shulgin, E.; Richtárik, P. SGD: General Analysis and Improved Rates. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR. pp. 5200–5209. [Google Scholar]
Wang, Y.; Yan, G.; Meng, Q.; Yao, T.; Han, J.; Zhang, B. DSE-YOLO: Detail Semantics Enhancement YOLO for Multi-Stage Strawberry Detection. Comput. Electron. Agric. 2022, 198, 107057. [Google Scholar] [CrossRef]
Cui, M.; Lou, Y.; Ge, Y.; Wang, K. LES-YOLO: A Lightweight Pinecone Detection Algorithm Based on Improved YOLOv4-Tiny Network. Comput. Electron. Agric. 2023, 205, 107613. [Google Scholar] [CrossRef]
Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A Comprehensive Survey of Loss Functions in Machine Learning. Ann. Data. Sci. 2022, 9, 187–212. [Google Scholar] [CrossRef]
Zhu, X.; Chen, F.; Zheng, Y.; Li, Z.; Zhang, X. Olive Variety Recognition Combining Bilinear Network and Attention Mechanism. Trans. Chin. Soc. Agric. Eng. 2023, 39, 183–192. [Google Scholar]
Li, Y.; Guo, Z.; Sun, Y.; Chen, X.; Cao, Y. Weed Detection Algorithms in Rice Fields Based on Improved YOLOv10n. Agriculture 2024, 14, 2066. [Google Scholar] [CrossRef]
Yan, Z.; Wu, Y.; Zhao, W.; Zhang, S.; Li, X. Research on an Apple Recognition and Yield Estimation Model Based on the Fusion of Improved YOLOv11 and DeepSORT. Agriculture 2025, 15, 765. [Google Scholar] [CrossRef]
Yu, Y.; Chen, T.; Du, L.; Fang, R. Real-Time Visual Detection Method for Cluttered Objects Based on Deep Learning Acceleration Model. Trans. Chin. Soc. Agric. Mach. 2025, 56, 617–624. [Google Scholar]
Fan, P.; Zheng, C.; Sun, J.; Chen, D.; Lang, G.; Li, Y. Enhanced Real-Time Target Detection for Picking Robots Using Lightweight CenterNet in Complex Orchard Environments. Agriculture 2024, 14, 1059. [Google Scholar] [CrossRef]

Figure 1. Image acquisition points. ((a) Cameras inside the greenhouse (I); (b) cameras outside the greenhouse (G); (c) ground probes inside the greenhouse (O)).

Figure 2. Images of citrus in different backgrounds.

Figure 3. Citrus images of different ripenesses. (a) Immature; (b) Semi-immature; (c) Mature.

Figure 4. Citrus Data Enhancement Sample.

Figure 6. ODConv four-dimensional attention is multiplied by the convolutional kernel. (a–d) are multiplication operations in the spatial dimension, multiplication operations in the input channel dimension, filter multiplication in the output channel dimension, and multiplication operations in the kernel space dimension, respectively.

Figure 7. ODConv weight aggregation process.

Figure 8. Efficient RepGFPN network structure diagram.

Figure 9. Multi-dimensional attention mechanism application process.

Figure 10. Multi-dimensional attention mechanism information-fusion process.

Figure 11. Curve of loss function variation.

Figure 12. Comparison of ORD-YOLO and YOLOv8 detection renderings. (a–h) are the detection effects of ORD-YOLO and YOLOv8 in four complex environments, respectively.

Figure 13. Activation heat map of each model of YOLOv8 ablation test.

Figure 14. Detection effects of different models in complex backgrounds.

Table 1. Three-view raw data distribution.

Devices	Device I	Device O	Device G	Total
Original images	379	1215	1003	2597

Table 2. The distribution of the original and enhanced datasets at different maturity levels.

Maturity Levels	Original Images	Enhanced Images	Total
IM	1694	0	1694
SM	623	1869	2492
MA	280	1120	1400
Total	2597	2989	5586

Table 3. ORD-YOLO Maturity Test Results.

Citrus Maturity	P/%	R/%	mAP50/%
IM	93.3	91.9	96.9
SM	93.3	92.5	96.9
MA	93.5	88.7	96.6

Table 4. Results of ORD-YOLOv8 ablation test.

ODConv	RepGFPN	DynamicHead	InnerDIoU	P/%	R/%	mAP50/%	GFLOPs	Weight/MB
×	×	×	×	89.17	88.32	93.92	8.1	6.4
√	×	×	×	88.94	89.47	94.34	7.2	8.7
×	√	×	×	89.50	90.06	94.45	8.5	6.9
√	√	×	×	91.67	90.44	95.68	7.4	9.3
×	√	√	×	90.47	89.86	95.43	15.0	10.5
√	√	√	×	92.12	90.38	96.01	14.1	12.9
√	√	√	√	93.83	91.62	96.92	14.1	12.9

Note: “×” means that the improvement strategy is not used, and “√” means that the improvement strategy is used.

Table 5. Comparison of experimental results of different detection models.

Models	P/%	R/%	mAP50/%	Weight/MB	Parameters/MB	Detection Speed/ms
SSD	87.69	94.43	96.04	100.27	2.6	96.15
CenterNet	96.87	77.54	95.00	124.61	3.2	41.67
YOLOv5	87.47	84.93	91.67	5.3	2.5	3.1
YOLOv8	89.17	88.32	93.92	6.4	3.0	3.2
YOLOv10n	90.48	87.49	94.09	5.8	2.6	3.3
YOLOv11	89.15	90.38	94.99	5.5	2.5	3.3
ORD-YOLO	93.83	91.62	96.92	12.9	6.2	7.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Z.; Li, X.; Fan, S.; Liu, Y.; Zou, H.; He, X.; Xu, S.; Zhao, J.; Li, W. ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments. Agriculture 2025, 15, 1711. https://doi.org/10.3390/agriculture15151711

AMA Style

Huang Z, Li X, Fan S, Liu Y, Zou H, He X, Xu S, Zhao J, Li W. ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments. Agriculture. 2025; 15(15):1711. https://doi.org/10.3390/agriculture15151711

Chicago/Turabian Style

Huang, Zhaobo, Xianhui Li, Shitong Fan, Yang Liu, Huan Zou, Xiangchun He, Shuai Xu, Jianghua Zhao, and Wenfeng Li. 2025. "ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments" Agriculture 15, no. 15: 1711. https://doi.org/10.3390/agriculture15151711

APA Style

Huang, Z., Li, X., Fan, S., Liu, Y., Zou, H., He, X., Xu, S., Zhao, J., & Li, W. (2025). ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments. Agriculture, 15(15), 1711. https://doi.org/10.3390/agriculture15151711

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Work

2.2. Image Acquisition

2.3. Citrus Ripeness Division

2.4. Citrus Dataset Construction

3. Model Construction

3.1. YOLOv8

3.2. Improved Model ORD-YOLO

3.2.1. ODConv Module

3.2.2. Efficient RepGFPN Feature Pyramid

3.2.3. Dynamic Detector

3.3. Test Environment and Configuration

3.4. Evaluation Metrics

4. Results

4.1. Improved Model Results

4.2. Ablation Experiments

4.3. Comparative Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI