A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments

Yang, Junjie; Wu, Haorong; Lv, Dong; Ma, Wei; Teng, Hao; Chen, Dehua

doi:10.3390/horticulturae12060718

Open AccessArticle

A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments

by

Junjie Yang

¹,

Haorong Wu

^1,*

,

Dong Lv

²,

Wei Ma

³,

Hao Teng

¹ and

Dehua Chen

¹

School of Electronic Information and Electrical Engineering, Chengdu University, Chengdu 610106, China

²

Chengdu Institute of Metrology Verification and Testing, Chengdu 610000, China

³

Institute of Urban Agriculture, Chinese Academy of Agricultural Sciences, Chengdu 610200, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(6), 718; https://doi.org/10.3390/horticulturae12060718 (registering DOI)

Submission received: 10 April 2026 / Revised: 26 May 2026 / Accepted: 6 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Emerging Technologies in Smart Agriculture)

Download

Browse Figures

Versions Notes

Abstract

This study was conducted to address the challenges of detecting citrus fruits in complex orchard environments characterized by overlap, occlusion, and variable lighting conditions. To tackle these issues, an improved detection model named YOLO-MGP was developed based on the YOLOv8n architecture. Four key enhancements were introduced to the core components of the detection framework. First, the primary backbone network was replaced with MobileNetV3, which substantially reduced computational requirements while preserving the capability for multi-scale feature extraction. Second, a C2f-GLU module was incorporated into the neck network. By leveraging Gated Linear Units, this module strengthens the feature selection and fusion processes. Third, an additional P2 detection layer was added to improve the detection of small targets. This modification was complemented by the integration of a Coordinate Attention mechanism, which refines the distribution of feature weights across spatial and channel dimensions. Finally, the CIoU loss was replaced by PIoU to enhance the accuracy of bounding box regression, particularly for occluded and overlapping targets. Experimental results demonstrate that the YOLO-MGP model achieved a precision of 94.2%, a recall of 89.7%, and a

m A P_{50}

of 95.7% on our custom citrus dataset. By substantially reducing the number of parameters while maintaining competitive detection performance, the proposed method offers a practical and lightweight solution for fruit detection in automated harvesting systems.

Keywords:

citrus; YOLO-MGP; YOLOv8n; PIoU; lightweight detection

1. Introduction

China’s citrus business routinely remains among the foremost global producers. As a leading economic crop, citrus serves as a cornerstone of agriculture in southern China and the Yangtze River Basin. With an annual output value of nearly $28 billion, the sector directly supports ten million jobs and plays a pivotal role in driving rural revitalization [1]. However, orchard management today remains heavily dependent on manual labor. This challenge is particularly acute in hilly and mountainous regions, where terrain complexity renders traditional management models inefficient, labor-intensive, and prohibitively costly. Under these limits, current methodologies fail to satisfy the requirements of extensive, precision agriculture [2]. Consequently, designing a lightweight yet robust citrus detection system suited for complex orchard environments has become essential. This effort is vital for advancing industrial transformation and upgrading, while also helping to cultivate new, high-quality productive forces within the citrus sector.

In citrus orchards, variable illumination, complex foliage backgrounds, multi-scale fruit, and colour changes during ripening cause significant detection challenges, including feature extraction bias, missed detections, and false positives. Traditional object detection methods rely on hand-crafted features and expert parameter tuning, resulting in weak generalization, poor robustness, and high computational cost. Zhuang et al. [3] presented a monocular vision approach combining homomorphic filtering, watershed-based candidate localization, LBP texture features, and HIK-SVM [4,5,6], achieving a recall above 0.86 on field images but suffering from sensitivity to scale and maturity. Sun et al. [7] proposed a fluorescence-based defect inspection system using multi-light vision and SVM [8], attaining high detection accuracy under controlled nighttime conditions. Sengupta and Lee [9] integrated Circular Hough Transform, SVM texture classification, Canny edge detection, and SIFT [10,11,12] for immature citrus identification, reporting 80.4% accuracy but with many false positives, especially under occlusion and shadows. These conventional pipelines fall short of the demands for intelligent monitoring in complex orchard environments.

With rapid advances in artificial intelligence and edge hardware, deep learning-based detectors, particularly the YOLO series [13], have become central to citrus industry intelligence, enabling real-time fruit detection, maturity assessment, pest surveillance [14], and picking guidance [15].

To tackle citrus detection in challenging environments, recent studies have developed various YOLO variants. For occluded fruit, Lin et al. [16] proposed AG-YOLO, adopting NextViT and GCFM [17,18] to enhance global context and feature fusion, reaching 90.6% precision. Liao et al. [19] built YOLO-MECD with EMA, CSPPC, and MPDIoU [20,21,22], achieving 84.4% precision and a 75.6% parameter reduction. Emphasizing extreme lightweight design, Zeng et al. [23] introduced YOLO-FCAP, incorporating Pico scaling [24], FasterNetBlockWithCA, and ADown [25], delivering 92.5%

m A P

with only 0.81 M parameters and 168 FPS. For specialized tasks, Liu et al. [26] designed YOLOv5s-BiFormer, merging BiFormer, CNN, and Transformer [27,28,29] for citrus shoot detection (86.8%

m A P

, 7.3 M parameters). Deng et al. [30] created YOLOv7-BiGS with BiFormer and GSConv [31] for variety detection (91.0% precision), and Zhou et al. [32] developed YOLOv5-NMM, adding a small-object layer, CBAM, BiFPN, and Soft-NMS [33,34,35] for maturity detection (93.2% precision, 7.2 M parameters). Despite these advances, existing models still struggle to balance lightweight architectures and robustness against complex backgrounds, often exhibiting high false-positive rates under dense occlusion, extreme lighting, and morphological similarity [36].

To improve feature extraction and accurate citrus fruit recognition while enabling model lightweighting, this study introduces YOLO-MGP, an enhanced YOLOv8n-based detection framework. The following is a summary of this study’s primary contributions:

This study establishes a citrus fruit image dataset encompassing diverse illumination conditions, varying imaging backgrounds, and different levels of occlusion, thereby providing a high-quality sample foundation for subsequent experimental validation and model training.
The C2f_GLU module is introduced to replace the standard C2f module in the neck, combining gated linear units with depthwise convolutions to achieve spatially adaptive feature recalibration with minimal overhead.
An optimized detection head configuration and loss function are adopted: a dedicated small-object detection layer is integrated with the Coordinate Attention mechanism while the deepest layer is removed for efficiency; the PIoUv2 loss is employed to improve regression accuracy for overlapping citrus fruits.
A lightweight backbone substitution (MobileNetV3) is introduced to replace CSPDarknet. Beyond merely reducing parameters and accelerating inference, MobileNetV3 integrates Squeeze-and-Excitation modules that perform dynamic channel-wise recalibration, which is later shown to confer intrinsic robustness to illumination variation. This backbone is purposefully co-designed with the C2f_GLU neck module and the optimized detection head, together forming a coherent lightweight framework where the representational capacity lost through compression is recovered by subsequent gated feature refinement and attention-based multi-scale detection, ultimately achieving a favorable accuracy-efficiency trade-off for resource-constrained edge devices.

2. Materials and Methods

2.1. Datasets Construction

The citrus dataset construction workflow is illustrated in Figure 1. Images were collected at the Huijia Qicai Tianyuan Citrus Orchard in Longquanyi District, Chengdu City, Sichuan Province, China (

30^{\circ} 37^{'} 50.812 ″

N,

104^{\circ} 14^{'} 14.233 ″

E, altitude 498 m). This region features a subtropical monsoon climate with mild winters, abundant rainfall, and fertile alluvial soil, creating optimal conditions for citrus cultivation and ensuring naturally abundant fruit sets for dataset acquisition. The orchard spans approximately 200,000 m² and employs a dense planting configuration with 4 m row spacing and 1.5 m inter-plant distance. Trees were cultivated on the outer slopes of contour-dug trenches, which provided relatively unobstructed access from both sides and facilitated multi-angle image acquisition [37]. These topographic and agronomic characteristics collectively contributed to a diverse and representative sample pool encompassing various occlusion patterns, illumination angles, and fruit scales.

Data collection was conducted on 2 December 2025, during three daily intervals: 08:00–09:00 (Morning, low-angle illumination), 12:00–14:00 (Noon, strong direct light), and 17:00–18:00 (Afternoon, dim backlight). A multi-angle and multi-temporal shooting strategy was employed: operators walked along the orchard rows, capturing fruits from distances typically ranging between 0.5 m and 1.0 m, adjusting the viewpoint to include frontal, lateral, and downward perspectives. After removing approximately 100 images with severe blur or underexposure, a total of 1102 high-quality images were retained, with roughly 80% (approx. 880) from the Canon camera and 20% (approx. 220) from the iPhone. This natural device imbalance reflects the complementary roles of the two sensors: the Canon served as the primary high-fidelity acquisition tool, while the iPhone provided targeted validation for consumer-grade deployment scenarios.

To construct the main training and evaluation pipeline, the full set of 1102 images was randomly partitioned into training, validation, and test sets at an 8:1:1 ratio. Data augmentation techniques [38]—including random color shifting, Gaussian blurring, and random noise—were then applied exclusively to the training set, expanding it to 4408 images; the validation and test sets remained unaugmented throughout all experiments.

For the dedicated analysis of model performance under different lighting conditions, the 1102 annotated images were first grouped by their acquisition time window: Morning (08:00–09:00), Noon (12:00–14:00), and Afternoon (17:00–18:00). Representative samples of citrus under various natural environments and augmented conditions are illustrated in Figure 2. From each temporal group, 200 images were randomly selected, yielding three evaluation subsets of equal size. This grouping by actual capture time guarantees that each subset exclusively contains the target illumination condition, while random selection within each group avoids subjective bias. A small validation set was further randomly extracted from each subset for hyperparameter tuning, with the remaining images used for testing. Because the random selection within each temporal group naturally preserves the original device distribution, the Canon-to-iPhone ratio remains approximately 8:2 across all three subsets, mirroring the overall dataset composition. Consequently, even the smallest device-specific group (iPhone images in the Afternoon subset) contains no fewer than approximately 30–40 samples, a quantity sufficient to compute detection metrics with meaningful confidence. We acknowledge that this modest imbalance precludes a fully balanced multi-factor ANOVA; however, for the pragmatic assessment of lighting robustness in orchard environments, the per-condition precision, recall, and

m A P

values reported herein provide reliable quantitative evidence without introducing systematic evaluation bias. For the occlusion analysis, images were categorized into three levels based on the percentage of visible fruit area: no/slight occlusion (visible area

\geq 80 %

), moderate occlusion (visible area 40–

80 %

), and heavy occlusion (visible area

< 40 %

). Images meeting each criterion were identified from the annotated dataset to form the respective evaluation subsets.

2.2. YOLOv8n

Figure 3 shows the general architecture of YOLOv8n. The network consists of three main parts: a decoupled head for regression and classification, a neck for multi-scale feature fusion, and a backbone for hierarchical feature extraction. The backbone employs an enhanced CSPDarknet [39] topology consisting of CBS, C2f, and SPPF modules. The CBS module integrates a convolutional layer, batch normalization, and the SiLU activation function to perform fundamental local feature extraction. The C2f module, inspired by the ELAN [40] design in YOLOv7 and the C3 [41] module in YOLOv5, facilitates richer feature learning and improved gradient flow. The SPPF module expands the receptive field and incorporates global semantic context by performing three successive max-pooling procedures.

The neck adopts a PAN + FPN architecture, combining a top-down pathway that upsamples high-level semantics and fuses them with fine-grained spatial details, and a bottom-up pathway that propagates precise localization cues upward. The head improves detection precision by using a decoupled design that divides the regression and classification branches into an anchor-free framework.

2.3. YOLO-MGP

This study designs an enhanced detection model, termed YOLO-MGP, based on YOLOv8n, to address the specific challenges of citrus detection, namely inadequate accuracy in complex orchard environments and excessive model complexity. The architectural improvements, which integrate and adapt several existing techniques into a cohesive lightweight framework, are as follows: First, the original backbone network was substituted with the lightweight feature extraction network MobileNetV3 to diminish computational burden and enhance detection velocity. Second, the C2f module in the original neck structure was substituted with a lightweight variant known as C2f_GLU. This change reduces the number of parameters while improving the model’s ability to extract, augment, and integrate features. Third, a detection head specialized for small objects with a 160 × 160 resolution was incorporated into the Head layer, which integrates the CA attention mechanism to boost small-target detection capability. Finally, the network’s feature extraction capacity and convergence rate were improved by using the PIoU loss function. Figure 4 depicts the structure of the suggested model.

2.3.1. MobileNetV3 Network

The native YOLOv8n backbone employs the CSPDarknet architecture, which integrates the cross-stage partial design paradigm of CSPNet [42] with the Darknet-53 hierarchical structure inherited from YOLOv5. By partitioning feature maps along the channel dimension and routing only a subset through a partial convolutional block while bypassing the remainder, CSPDarknet effectively enhances gradient propagation and mitigates the vanishing gradient problem. While this design yields strong performance on generic detection benchmarks, it exhibits two notable limitations in the context of citrus orchard monitoring. First, the reliance on standard convolutional blocks incurs substantial computational overhead as network capacity scales, rendering deployment on resource-constrained agricultural edge devices impractical. Second, the architecture exhibits constrained sensitivity to fine-grained spatial features, which is detrimental for detecting densely clustered small targets—a defining characteristic of citrus imagery where immature fruits are frequently occluded by foliage.

To address these constraints, we substitute the original CSPDarknet backbone with MobileNetV3 [43], a lightweight convolutional neural network specifically engineered for efficient mobile inference. MobileNetV3 is empirically recognized for attaining a Pareto-optimal balance between inference latency and detection accuracy on embedded processors, thereby meeting the twin demands of real-time operation and accurate small-target identification in citrus detection systems intended for field deployment. The architectural specification of MobileNetV3 is illustrated in Figure 5.

The backbone ingests an RGB citrus image of spatial dimension

640 \times 640

and progressively abstracts it through a hierarchical cascade of inverted residual blocks. The network is organized into a sequence of stages wherein spatial resolution is successively halved by strided depthwise convolutions, while channel capacity expands to encode increasingly semantic representations. The initial stem layer employs a

3 \times 3

convolution with stride 2, rapidly downsampling the input to

320 \times 320

with 16 channels. Subsequent stages interleave bottleneck blocks with and without squeeze-and-excitation (SE) modules, culminating in a final

1 \times 1

convolutional projection to a high-dimensional feature space prior to global pooling. For the detection task, we extract multi-scale feature maps from the terminal blocks of stages corresponding to output strides of

8 \times

,

16 \times

, and

32 \times

relative to the input—specifically, layers with spatial resolutions of

80 \times 80

,

40 \times 40

, and

20 \times 20

. These hierarchical representations are then routed to the YOLOv8n neck, enabling the detection head to localize small, occluded citrus fruits on fine-grained grids while simultaneously recognizing larger, unobstructed instances on semantically richer, coarser feature maps.

As indicated in the architectural overview, a subset of the inverted residual blocks within MobileNetV3 are augmented with Squeeze-and-Excitation (SE) modules [44]. Positioned strategically after the depthwise convolution within the expansion path—thereby operating on the representation of greatest channel dimensionality—the SE mechanism explicitly models interdependencies between channels to perform dynamic, adaptive recalibration of convolutional features. This is accomplished through three sequential operations: Squeeze, Excitation, and Scale.

Given an input feature map X, a standard convolution operation

F_{t r} (\cdot)

first extracts spatial features to produce a new feature map

U \in R^{H \times W \times C}

, where each channel c is denoted as

u_{c}

:

u_{c} = F_{t r} (X) .

(1)

This yields

U = [u_{1}, u_{2}, \dots, u_{C}]

, which serves as the input to the SE module. Then, by combining spatial data from the whole feature map, the Squeeze operation seeks to obtain global contextual information for every channel. Specifically, it applies global average pooling to each channel

U_{c}

, compressing the

H \times W

spatial dimensions into a single scalar

z_{c}

:

z_{c} = F_{s q} (U_{c}) = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{c} (i, j) .

(2)

Thus,

z \in R^{C}

serves as a channel-wise descriptor that summarizes the global spatial information of each channel. Subsequently, the Excitation operation employs a lightweight gating mechanism to generate per-channel modulation weights. This is implemented as a two-layer fully-connected bottleneck network: a dimensionality-reduction layer compresses the channel count from C to

C / r

, where r denotes a predefined compression factor (fixed to 4 in MobileNetV3); a ReLU activation

δ

introduces nonlinearity; and a subsequent expansion layer restores the dimensionality to C, followed by a Hard-sigmoid activation

σ

to produce the normalized weight vector

s \in R^{C}

:

s = F_{e x} (z, W) = σ (W_{2} δ (W_{1} z)) .

(3)

Finally, the Scale operation applies these learned channel weights to the original feature map, producing the recalibrated output

\tilde{X}

wherein informative channels are amplified and less relevant ones are suppressed:

{\tilde{X}}_{c} = F_{s c a l e} (U_{c}, s_{c}) = s_{c} \cdot U_{c} .

(4)

In the above formulation, H and W represent the spatial height and width of the feature map;

z_{c}

denotes the channel descriptor for the c-th channel;

W_{1}

and

W_{2}

refer to the weight matrices of the dimensionality-reduction and dimensionality-expansion layers, respectively;

δ

represents the ReLU activation function; and

σ

signifies the Hard-sigmoid activation function.

Beyond the SE attention mechanism, the foundational computational unit underpinning the efficiency of MobileNetV3 is the depthwise separable convolution. This process breaks down a standard convolution into two separate and sequential stages: a depthwise convolution and a pointwise convolution, as shown in Figure 6.

In a standard convolutional layer, each filter simultaneously aggregates spatial information from a local receptive field and cross-channel correlations, operating on all input channels concurrently. While effective, this coupling incurs substantial computational redundancy. Depthwise separable convolution breaks this entanglement. In order to maintain spatial locality while avoiding cross-channel interaction, the first stage, depthwise convolution, applies a single lightweight filter to each input channel separately. This yields an intermediate representation of identical channel count, wherein each channel encodes spatially filtered features specific to its corresponding input. The second stage-pointwise convolution-employs a

1 \times 1

convolutional kernel to project this intermediate tensor onto the desired output channel dimension. This stage is solely responsible for fusing and recombining information across channels, effectively decoupling spatial filtering from feature generation. In MobileNetV3, this factorized convolution constitutes the core spatial processing unit within each inverted residual block, wherein the depthwise convolution operates on the expanded channel representation and the subsequent pointwise convolution projects it back to the bottleneck dimension.

2.3.2. C2f_GLU

In YOLOv8n, the C2f module serves as the core component for feature processing, integrating functions for feature extraction, enhancement, and fusion. This structure significantly enhances the representational power of the module through feature reuse and the introduction of rich gradient flows. In the context of citrus recognition in intricate orchard environments, the standard C2f module enhances the receptive field and feature richness by stacking multiple Bottlenecks, resulting in an increase in parameters and computational demands, thereby rendering it less appropriate for mobile device deployment.

This study presents the C2f_GLU [45] module as a solution to the aforementioned problems. The architectural design is illustrated in Figure 7. This module dynamically improves the depiction of essential citrus attributes using a gated channel attention technique and integrates a streamlined bottleneck design, thereby improving feature discriminability while effectively controlling computational complexity. Its core component is the Conv_GLU [45] unit, which achieves gated channel attention modelling based on nearest-neighbour features by introducing DWConv (depthwise convolution) into the GLU gating branch.

The procedural flow is detailed below. The input feature map is processed by two parallel paths to produce the value feature V and the gating weight G, respectively. In the value pathway, a

1 \times 1

convolution is employed to extract basic features. Meanwhile, in the gating pathway, spatial awareness is introduced: this pathway sequentially performs a

1 \times 1

convolution, followed by a

3 \times 3

depthwise convolution (DWConv),and an activation function to produce a spatially adaptive attention map. Subsequently, the value feature V and the gating weight G are multiplied element-wise to achieve dynamic recalibration of the feature map. Finally, a

1 \times 1

convolution is employed, after which the modulated feature

X_{g a t e D}

is combined with the module’s original input X through a residual connection. In summary, the formulation of Conv_GLU is expressed as:

V = C o n v_{1 \times 1}^{v} (X_{c})

(5)

G = σ (D W C o n v_{3 \times 3} (C o n v_{1 \times 1}^{g} (X_{c})))

(6)

X_{g a t e} = V * G

(7)

X_{o u t} = X + X_{g a t e}

(8)

ConvGLU (x) = (X_{v} + b) * σ (D W C o n v (X_{G} + b))

(9)

The integration of the lightweight Conv_GLU unit within the C2f_GLU framework is designed to enhance the discriminability of citrus-related features through spatially adaptive gating, while maintaining a parameter-efficient and computationally economical profile.

2.3.3. CA_Detect

The intricate nature of citrus orchard landscapes leads to multi-perspective discrepancies in the obtained images, causing substantial changes in target scales, especially for the detection of small-target fruits [46]. The features of these targets are weak and are prone to being mistaken for the background, which presents a serious challenge to high-precision detection. In this study, targets in the citrus dataset were categorised into four types based on the aspect ratio of the citrus labels and images: large targets (aspect ratio

> 0.3

), medium targets (

0.3 >

aspect ratio

> 0.1

), small targets (

0.1 >

aspect ratio), and tiny targets (

0.02 >

aspect ratio). The precise distribution of the labels is illustrated in Figure 8, and the statistical results are detailed in Table 1.

To mitigate the inherent deficiency of the baseline model in detecting small-scale objects—an issue stemming from insufficient spatial granularity in deeper feature hierarchies—this study introduces a series of architectural refinements targeting both the neck network and the detection heads. A dedicated P2 detection head, explicitly tailored for the extraction and localization of diminutive targets, is incorporated into the existing framework. This head capitalizes on the Coordinate Attention mechanism [47] to encode precise positional dependencies, while concurrently aggregating shallow feature maps that retain high-resolution structural details and fine-grained spatial cues. The fusion of these complementary representations markedly enhances the model’s capacity for multi-scale feature learning, particularly in discerning small objects that are otherwise prone to being submerged within semantically coarse deeper layers. To counteract the inevitable surge in computational burden and parameter volume engendered by the inclusion of this additional shallow prediction layer, a layer-wise compensation paradigm is adopted. Concretely, the deep P5 detection head—characterized by substantial semantic redundancy and limited contribution to fine-grained localization—is surgically excised in tandem with the introduction of the P2 layer, thereby preserving a favorable trade-off between detection fidelity and inference efficiency. The resultant streamlined network topology is depicted in Figure 9.

By incorporating a coordinate-aware feature encoding technique, the coordinate attention mechanism allows explicit modelling of positional information while maintaining the modelling capability of channel attention. This mechanism consists of two main stages: embedding of coordinate information and generation of coordinate attention [48]. The initial phase entails one-dimensional global average pooling, which operates across the spatial dimensions of width and height. The resulting

z_{c}^{h} (h)

and

z_{c}^{w} (w)

are concatenated along the spatial dimension, as illustrated in Figure 10.

A

1 \times 1

convolution

F_{1}

is then employed for dimensionality reduction. Subsequently, the resulting feature map is split into a pair of directional features,

f^{h} \in R^{C / r \times H}

and

f^{w} \in R^{C / r \times W}

, which are then mapped back to the original channel dimension using two independent

1 \times 1

convolutions

F_{h}

and

F_{w}

. These are normalized via a Sigmoid activation, producing the final attention weights. In the final phase, the attention weights from both dimensions are augmented to the original feature map dimensions. As theoretically explained below, the augmented output is generated by multiplying these weights with the input feature element-wise:

Horizontal pooling (width direction):

z_{c}^{h} (h) = \frac{1}{W} \sum_{i = 1}^{W} x_{c} (h, i), c = 1, 2, \dots, C

(10)

Vertical pooling (height direction):

z_{c}^{w} (w) = \frac{1}{H} \sum_{j = 1}^{H} x_{c} (j, w), c = 1, 2, \dots, C

(11)

f = δ (F_{1} ([z^{h}, z^{w}])), f \in R^{C / r \times (H + W)}

(12)

g^{h} = σ (F_{h} (f^{h})), g^{h} \in R^{C \times H}

(13)

g^{w} = σ (F_{w} (f^{w})), g^{w} \in R^{C \times W}

(14)

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(15)

C, H, and W denote the channel count, height, and width of the feature tensor, respectively.

[z^{h}, z^{w}]

represents the concatenation operation of the two tensors

z^{h}

and

z^{w}

. The symbol r denotes the channel reduction ratio,

δ

refers to the nonlinear activation function, while

σ

denotes the Sigmoid activation function.

By combining feature responses from various orientations, the coordinate attention mechanism preserves positional information while capturing long-range associations. This method improves the representation of important regions by encoding orientation-sensitive information into attention weight maps along the horizontal and vertical directions. These maps are then applied to the input features using element-wise multiplication. Compared with traditional attention mechanisms, the coordinate attention approach models channel dependencies and spatial structures simultaneously. For applications requiring accurate localisation, like small item identification, this method improves position awareness and expands the receptive field at a minimal computational cost.

2.3.4. PIoU Loss Function

In this study, the Powerful-IoU (PIoU) loss mechanism proposed by Liu et al. [49] was adopted to replace the default Complete-IoU (CIoU) loss employed in the YOLOv8n architecture for bounding box regression.

The PIoU loss introduces a penalty factor P that is designed with adaptability to the target box dimensions. In particular, the normalised distances between the respective edges of the ground-truth box and the forecast box are used to define P. Unlike conventional penalty terms that depend on the minimum bounding rectangle, the denominator of P relies exclusively on the width

w_{g t}

and height

h_{g t}

of the target box. Figure 11 shows a schematic comparison of the computational principles behind CIoU and PIoU.

The core structure of PIoU comprises two components: the standard IoU term and a regularization term incorporating the penalty factor P. Its mathematical formulation is provided below:

IoU = \frac{A \cap B}{A \cup B}

(16)

PIoU = IoU - f (P)

(17)

{Loss}_{P I o U} = 1 - PIoU = {Loss}_{I o U} + f (P)

(18)

The mean of the normalised edge distances between the target box and the prediction box is used to calculate the penalty factor P:

P = \frac{1}{4} (\frac{| d_{w 1} |}{w_{g t}} + \frac{| d_{w 2} |}{w_{g t}} + \frac{| d_{h 1} |}{h_{g t}} + \frac{| d_{h 2} |}{h_{g t}})

(19)

Here,

d_{w 1}

,

d_{w 2}

,

d_{h 1}

, and

d_{h 2}

denote the absolute distances between the corresponding side pairs of the predicted and target boxes, as illustrated in Figure 11b;

w_{g t}

and

h_{g t}

indicate the width and height of the ground-truth bounding box.

The function

f (P)

serves as a gradient modulation term that adapts the penalty magnitude based on the quality of the predicted box.

f (P)

is defined as:

f (P) = 1 - e^{- P^{2}}

(20)

This functional form ensures that the gradient of the penalty term remains bounded and varies non-monotonically with respect to P.

This study employs PIoUv2 as the specific loss function variant. PIoUv2 extends the base PIoU formulation by incorporating a non-monotonic attention mechanism that further weights the loss contribution according to anchor box quality. The quality of an anchor box is quantified by the parameter q, which is derived from the penalty factor P:

q = e^{- P}, q \in (0, 1]

(21)

Here,

q = 1

corresponds to perfect alignment between the predicted box and the target box (

P = 0

), while lower values of q indicate progressively poorer alignment.

The attention function

u (x)

is defined as a non-monotonic weighting function controlled by a single hyperparameter

λ

:

u (x) = 3 x \cdot e^{- x^{2}}

(22)

The complete PIoUv2 loss is then formulated as the product of the attention weight and the base PIoU loss:

{Loss}_{P I o U_v 2} = u (λ q) \cdot {Loss}_{P I o U}

(23)

In this formulation,

λ

is a tunable hyperparameter that regulates the intensity of the attention mechanism. In accordance with the recommendations of Liu et al.,

λ

was set to 1.3 for all experiments conducted in this study.

3. Results

3.1. Experimental Environment and Parameters

Table 2 delineates the parameters of the experimental environment, encompassing both hardware and software. The trials were performed on a platform utilizing the Windows 10 operating system. Images were scaled to 640 × 640 pixels before input. A batch size of 30 was utilized, accompanied by 4 worker processes for data loading. The learning rate was initialized at 0.01, with a minimum established at 0.0001. A weight decay coefficient of 0.0005 was utilized. The Stochastic Gradient Descent (SGD) optimizer was employed, featuring a momentum coefficient of 0.937. The training was conducted across 200 epochs. No pre-trained weights were used in any of the experiments. To ensure the reproducibility of the results, all experiments were conducted with a fixed random seed across all libraries.

3.2. Model Evaluation Metrics

This study employed precision (P), recall (R), and mean average precision (

m A P

) as metrics to evaluate item detecting performance. Precision is defined as the ratio of true positive predictions to the total cases categorized as positive. Recall denotes the proportion of actual positive instances accurately identified by the model. Concerning the

m A P

variations,

m A P_{50}

signifies the average precision calculated at a constant IoU threshold of 0.5. Conversely,

m A P_{50 - 95}

is the mean of average precision values computed across a series of IoU thresholds ranging from 0.5 to 0.95, increased in 0.05 intervals (i.e., 0.5, 0.55, …, 0.95), culminating in 10 thresholds that are then averaged. Additionally, to assess the model’s appropriateness for resource-limited settings, supplementary metrics were incorporated, such as the number of parameters (Params), floating point operations (FLOPs), and model size. The relevant mathematical expressions are presented below:

P = \frac{T P}{T P + F P}

(24)

R = \frac{T P}{T P + F N}

(25)

A P_{c l a s s} = \int_{0}^{1} P (R) d R

(26)

m A P_{50} = \frac{1}{N} \sum_{i = 1}^{N} A P_{c l a s s} (I o U = 0.5)

(27)

m A P_{50 - 95} = \frac{1}{10} \sum_{i = 0}^{9} m A P (I o U = 0.5 + 0.05 i)

(28)

In this context, TP, FP, and FN denote the quantities of true positives, false positives, and false negatives, respectively. Absolute Precision (AP) is utilized to assess the detection performance for an individual category. This metric is obtained by computing precision values at different recall levels and then finding the area under the precision–recall curve. N represents the total number of categories in the object detection task.

3.3. Ablation Experiment

A number of ablation tests were carried out to assess the effect of the improved modules suggested in this study on citrus object identification. Every test was conducted using the same citrus dataset under consistent experimental conditions. The performance metrics are presented in Table 3, while Figure 12 provides a visual comparison of the parameter counts and accuracy trends across different module combinations.

The ablation study was conducted by incrementally integrating each proposed module into the baseline YOLOv8n architecture. As shown in Table 3, substituting the backbone network with MobileNetV3 (Row 2) resulted in decreases of 1.2, 1.1, and 1.7 percentage points in Precision, Recall, and

m A P_{50}

, respectively, while the parameter count increased slightly from 3.01 M to 3.11 M and the FPS declined to 227. Replacing the standard C2f module with C2f_GLU (Row 3) reduced the parameter count by 7.0% to 2.80 M, though Precision and

m A P_{50}

dropped by 2.0 and 1.2 percentage points, with a modest FPS reduction to 225. The implementation of the CA_Detect detection head (Row 4), which introduces a high-resolution P2 layer and omits the P5 layer, enhanced Precision by 0.3 percentage points, Recall by 1.7 percentage points, and

m A P_{50}

by 0.8 percentage points. Critically, it reduced the parameter count by 32.9% to 2.02 M, albeit at the cost of a more pronounced FPS drop to 107, attributable to the increased computational load of the high-resolution feature map. Nevertheless, this speed remains well above the typical real-time threshold for agricultural applications. Replacing the baseline loss function with PIoU (Row 5) increased Precision by 2.0 percentage points and

m A P_{50}

by 0.4 percentage points, with no change in parameter count and a negligible FPS decrease to 235.

Multi-module combinations were further evaluated to assess synergistic effects. The combination of MobileNetV3 and PIoU (Row 6) resulted in declines of 0.4, 1.1, and 1.1 percentage points in Precision, Recall, and

m A P_{50}

, respectively, relative to the baseline, while maintaining a high FPS of 235. The addition of C2f_GLU to this configuration (Row 7) yielded a 7.0% parameter reduction compared to the baseline, though

m A P_{50}

decreased by 1.3 percentage points to 92.9% and FPS dropped to 217. In contrast, integrating MobileNetV3, CA_Detect, and PIoU (Row 8) produced improvements of 3.0, 3.3, and 1.8 percentage points in Precision, Recall, and

m A P_{50}

, respectively, along with a 29.5% reduction in parameters and an FPS of 178, demonstrating a favorable efficiency–accuracy trade-off. Finally, the complete integration of all four modules—MobileNetV3, C2f_GLU, CA_Detect, and PIoU (Row 9)—constitutes the proposed YOLO-MGP model. Relative to the baseline YOLOv8n, this configuration improved Precision by 2.8 percentage points, Recall by 2.6 percentage points,

m A P_{50}

by 1.5 percentage points, and

m A P_{50 - 95}

by 1.9 percentage points, while reducing the total parameter count by 32.1%, from 3.01 to 2.04 M. The corresponding FPS of 159 is approximately 65% of the baseline speed; however, it still corresponds to a processing latency of only 6.3 ms per frame, which is more than sufficient for real-time citrus recognition in field conditions where 30–60 FPS is typically required. Thus, the YOLO-MGP model achieves a compelling balance between enhanced detection accuracy, substantial parameter reduction, and practical inference latency for edge deployment.

3.4. Comparative Experiments on Different Models

This study assessed the efficacy of the suggested YOLO-MGP method by contrasting it with various established object detection models and recently introduced citrus-specific detection algorithms. Table 4 presents a comparative analysis of the performance of YOLOv7-tiny, YOLOv9t, YOLOv10n [50], YOLOv11n, YOLOv8n, and five recently developed citrus detection models on the dataset utilized in this study. Among these citrus-specific models, YOLO-CW [51] enhances the baseline with CBAM attention and Wise-IoU [52] loss; YOLO-CWG [53] further integrates CBAM, Wise-IoU v3, and a lightweight GhostNet [54] module; YOLO-AGV [55] incorporates AIFI [56], VoVGSCSP, and GSConv modules for improved multi-scale feature fusion; YOLO-BC [57], which adopts a BiFPN neck and CA attention mechanism, and YOLO-CC [58], which integrates the CARAFE upsampling module, are two additional citrus-specific detectors included to broaden the comparative evaluation. Figure 13 illustrates the results of the performance comparison.

Table 4 illustrates notable disparities in performance and efficiency among various models for citrus fruit detection. YOLO-MGP demonstrated excellent performance across multiple core metrics. Its precision (P) reached 94.2%, outperforming all comparison models and exceeding the suboptimal YOLOv8n (91.4%) by 2.8 percentage points. Regarding recall (R), YOLO-MGP led with 89.7%, which was 2.6 percentage points higher than YOLOv8n (87.1%), suggesting more comprehensive target coverage while maintaining high precision. For mean average precision (

m A P

), YOLO-MGP achieved top scores of 95.7% for

m A P_{50}

and 73.9% for

m A P_{50 - 95}

. Notably, the

m A P_{50 - 95}

value represented an increase of 1.9 percentage points compared to YOLOv8n (72.0%) and a substantial increase of 5.6 percentage points compared to YOLOv10n (68.3%).

With respect to the citrus-specific detection models, YOLO-CW achieved an

m A P_{50}

of 92.8% and an

m A P_{50 - 95}

of 68.6%, with a parameter count of 3.10 M and a model size of 6.1 MB. While its recall of 85.9% was relatively competitive, its precision of 89.4% remained 4.8 percentage points lower than that of YOLO-MGP. YOLO-CWG exhibited improved performance with an

m A P_{50}

of 93.3% and an

m A P_{50 - 95}

of 69.3%; however, this upgrade resulted in a significantly elevated parameter count of 6.40 M and a model size of 12.8 MB—the largest among all assessed models—making it less appropriate for deployment on resource-limited edge devices. YOLO-AGV achieved an

m A P_{50}

of 92.5% and an

m A P_{50 - 95}

of 68.7%, with 3.43 million parameters and a model size of 6.8 MB, positioning its overall performance marginally inferior to that of YOLO-CW and YOLO-CWG. YOLO-BC and YOLO-CC delivered comparable results, with

m A P_{50}

of 92.6% and 92.5%, and

m A P_{50 - 95}

of 68.1% and 67.7%, respectively. Their parameter counts (3.02 M and 3.15 M) and model sizes (6.0 MB and 6.2 MB) were similar to those of YOLO-CW, yet they still did not reach the detection accuracy of YOLO-MGP.

Conversely, YOLO-MGP surpassed all five citrus-specific models in every detection statistic, while also exhibiting the smallest model size (4.4 MB) and the second-lowest parameter count (2.04 M). As can be clearly observed from Figure 13, YOLO-MGP demonstrates enhanced detection capability regarding both precision and efficiency. This comparative advantage underscores the effectiveness of the architectural innovations introduced in YOLO-MGP for achieving an optimal balance between detection accuracy and model compactness in citrus fruit detection tasks.

In terms of model complexity, YOLO-MGP also demonstrated significant advantages. Its parameter count was only 2.04 M, lower than most comparison models and only slightly above YOLOv9t (1.97 M), the model with the fewest parameters. Regarding computational cost (FLOPs), YOLO-MGP was 9.4 G, which, although higher than YOLOv10n (6.5 G) and YOLOv11n (6.3 G), was considerably lower than YOLOv7-tiny (13.2 G). Notably, the model size of YOLO-MGP was only 4.4 MB, tying with YOLOv9t for the smallest among all compared models and smaller than all other YOLO variants. These results suggest that this model is particularly advantageous for deployment on mobile and embedded devices in agricultural settings.

Figure 14 illustrates heatmaps that demonstrate the performance of various models in recognizing citrus fruits under different settings, including occlusion by branches and foliage, overlapping fruits, small target recognition, and variations in illumination conditions.

The heatmap visualizations demonstrate that YOLO-MGP exhibits a markedly more focused attention pattern compared to the other models. While YOLO-MGP concentrates its high-response activations tightly on the citrus fruit regions, the comparative models display scattered and diffuse activation maps, suggesting a less precise allocation of computational focus.

3.5. Comparative Experiments of Different Backbone Networks

To address the limitations of the original YOLOv8n backbone in capturing features from small citrus targets and to reduce model complexity, this study employed the MobileNetV3-small network as a replacement for the baseline backbone architecture. To further validate this selection, a comparative analysis was conducted against five mainstream lightweight backbone networks. The evaluation metrics are presented in Table 5.

As shown in Table 5, the baseline YOLOv8n backbone achieved a Precision of 91.4%, Recall of 87.1%, and

m A P_{50}

of 94.2%, with 3.01 M parameters and 8.1 GFLOPs. Substituting the backbone with MobileNetV3 resulted in decreases of 1.2, 1.1, and 1.7 percentage points in Precision, Recall, and

m A P_{50}

, respectively, while the parameter count remained nearly equivalent at 3.11 M and FLOPs decreased by 18.5% to 6.6 G. GhostNet and StarNet exhibited lower performance, with

m A P_{50}

values of 91.0% and 90.1%, respectively. ShuffleNetV2 achieved the highest

m A P_{50}

among the lightweight alternatives at 93.6%, though this came at the cost of increased parameters (4.06 M) and FLOPs (8.3 G). EfficientNet yielded an

m A P_{50}

of 92.8% with 3.45 M parameters and 7.5 GFLOPs. Compared with ShuffleNetV2, MobileNetV3 reduced the parameter count by 23.3% and FLOPs by 20.5%, while incurring a

m A P_{50}

reduction of only 1.1 percentage points. Relative to StarNet, which had the lowest computational cost (5.8 GFLOPs), MobileNetV3 improved

m A P_{50}

by 2.4 percentage points with an additional 0.8 GFLOPs. The results indicate that MobileNetV3 attains an advantageous equilibrium between detection accuracy and processing efficiency for citrus fruit detection tasks.

3.6. Comparison of Different Loss Functions

The efficacy of the PIoU loss for citrus detection was assessed by substituting it with six widely adopted loss functions within the YOLOv8n pipeline, while maintaining consistent training configurations. Table 6 reports the resulting performance indicators, and Figure 15 depicts the training loss trajectories for each variant.

As shown in Table 6, the baseline CIoU loss function achieved a Precision of 91.4%, Recall of 87.1%, and

m A P_{50}

of 94.2%. DIoU and MPDIoU yielded comparable

m A P_{50}

values of 94.2% and 94.3%, respectively. GIoU improved

m A P_{50}

by 0.3 percentage points to 94.5%, while SIoU achieved the highest Recall among all compared functions at 88.0%. EIoU attained a Precision of 93.0%, though its Recall decreased by 0.5 percentage points relative to the baseline. The proposed PIoU loss function achieved the highest Precision (93.4%) and

m A P_{50}

(94.6%) among all evaluated functions, corresponding to gains of 2.0 and 0.4 percentage points, respectively, relative to the CIoU baseline.

Figure 15 presents the training loss curves for each function. As illustrated in Figure 15a, the bounding box regression loss of PIoU decreased more rapidly than that of the other functions, indicating accelerated convergence during training. The classification loss and distribution focal loss curves exhibited similar trends across all evaluated functions. Overall, PIoU exhibited enhanced performance regarding both convergence velocity and ultimate detection precision.

3.7. Comparison of Different Detection Heads

To assess the trade-off introduced by adding the P2 detection head for small-target discrimination while simultaneously pruning the P5 head to curtail parameters, comparative experiments were conducted with four detection head configurations under identical training conditions. The performance metrics are presented in Table 7.

As shown in Table 7, the baseline YOLOv8n detection head (P3 + P4 + P5) without the CA module achieved a Precision of 91.4%, Recall of 87.1%, and

m A P_{50}

of 94.2%, with 3.01 M parameters and 8.1 GFLOPs. When the CA module was incorporated into the same P3 + P4 + P5 structure, denoted as P3 + P4 + P5 (CA), Precision decreased by 0.9 percentage points to 90.5%, Recall dropped by 2.4 percentage points to 84.7%, and

m A P_{50}

declined by 1.4 percentage points to 92.8%. Adding a P2 layer to form the P2 + P3 + P4 + P5 (CA) configuration improved Precision by 0.8 percentage points to 92.2% and increased FLOPs to 12.2 G, while

m A P_{50}

reached 94.4% relative to the P3 + P4 + P5 (CA) variant. Removing the P5 layer to obtain the P2 + P3 + P4 (CA) configuration further improved Recall by 1.7 percentage points to 88.8% and increased

m A P_{50}

by 0.6 percentage points to 95.0%, while reducing the parameter count by 31.1% to 2.02 M relative to the P2 + P3 + P4 + P5 (CA) configuration. Compared with the YOLOv8n baseline (P3 + P4 + P5 without CA), the P2 + P3 + P4 (CA) configuration improved Precision by 0.3 percentage points, Recall by 1.7 percentage points, and

m A P_{50}

by 0.8 percentage points, while reducing parameters by 32.9% from 3.01 M to 2.02 M. These results demonstrate that incorporating the P2 layer enhances small target detection capability, while removing the P5 layer effectively reduces model complexity without compromising overall accuracy.

3.8. Performance Under Different Lighting Conditions

To quantitatively assess the robustness of the proposed model under varying natural illumination, the 1102 annotated images were grouped by acquisition time, and 200 images were randomly selected from each temporal group—Morning, Noon, and Afternoon—yielding three evaluation subsets in which the device ratio naturally remained at approximately 8:2 (Canon EOS-800D:iPhone 13). Precision, recall, and mean average precision were computed independently for each subset. The results are summarized in Table 8.

As shown in Table 8, YOLO-MGP maintained consistently high precision and recall across all three lighting scenarios. Precision varied within a narrow range from 90.8% to 92.2%, and

m A P_{50}

remained stable between 91.7% and 92.3%, with a maximum fluctuation of only 0.6 percentage points. The

m A P_{50 - 95}

metric declined more noticeably under afternoon conditions (64.8% vs. 69.6% at noon), suggesting that strongly directional, low-angle illumination poses additional challenges for precise bounding box regression, yet overall detection accuracy remained robust.

3.9. Performance Under Different Occlusion Conditions

To further examine the model’s detection capability under occlusion, images from the full annotated dataset were categorized into three levels according to the percentage of visible fruit area: no/slight occlusion (visible area

\geq 80 %

), moderate occlusion (visible area 40–

80 %

), and heavy occlusion (visible area

< 40 %

). Images meeting each criterion were identified to form the respective evaluation subsets. The detection metrics for each occlusion level are reported in Table 9.

The quantitative results demonstrate that YOLO-MGP retains high detection accuracy across all occlusion levels. Under no or slight occlusion, the model achieved near-perfect precision and recall (98.0% and 97.7%), with an

m A P_{50}

of 97.7%. Under moderate occlusion, precision decreased to 94.6%, while recall remained high at 96.4% and

m A P_{50}

reached 98.3%. Under heavy occlusion, the model maintained a precision of 93.1% and an

m A P_{50}

of 94.0%; recall declined to 85.7%, and

m A P_{50 - 95}

dropped to 72.6%, reflecting the inherent difficulty of accurately localizing fruits whose visible area is severely reduced.

4. Experimental Discussion

4.1. Discussion on the Applicability of Enhanced Modules

The ablation results (Table 3) reveal a notable phenomenon: individual integration of MobileNetV3 or C2f_GLU led to performance degradation, yet their synergistic combination with the CA_Detect head and PIoU loss produced substantial gains over the baseline. This nonlinear interaction pattern warrants detailed causal analysis.

When MobileNetV3 replaced the baseline backbone in isolation (Row 2), precision and

m A P_{50}

dropped by 1.2 and 1.7 percentage points, respectively. This decline can be attributed to the reduced representational capacity of depthwise separable convolutions compared to standard convolutions in the original CSPDarknet backbone. Specifically, the depthwise operation limits cross-channel information flow during early-stage feature extraction, which is detrimental for citrus detection where fine-grained color and texture cues are critical for distinguishing fruits from similarly colored foliage. However, when the CA_Detect head with its high-resolution P2 layer was introduced (Row 8), the same MobileNetV3 backbone contributed to a 3.0 percentage point precision gain over the baseline. This reversal indicates that the P2 layer’s fine-grained spatial features compensated for the backbone’s reduced cross-channel modeling capacity, effectively forming a coarse-to-fine feature hierarchy where MobileNetV3 provides efficient semantic encoding and CA_Detect refines spatial details. This complementary pattern is consistent with observations in other lightweight detection architectures: for example, in YOLO-AGV [55], the combination of AIFI attention and VoVGSCSP modules similarly achieved synergistic improvements over individual integration, with

m A P_{50}

increasing by 1.3 percentage points over the baseline, albeit at the cost of larger parameters. In contrast, YOLO-MGP achieves a more favorable accuracy-efficiency balance by carefully pairing lightweight modules: the full model (Row 9) improves

m A P_{50}

by 1.5 points while reducing parameters by 32.1%.

The C2f_GLU module, although causing a 1.2 percentage point

m A P_{50}

drop when used alone (Row 3), reduced parameters by 7.0% and contributed critically to occlusion robustness when integrated into the full model. The gating mechanism’s adaptive suppression of background foliage features is conceptually analogous to the attention mechanisms used in YOLO-CW [51] (CBAM) and YOLO-BC [57] (CA attention), yet C2f_GLU achieves this selectivity at a lower computational overhead, as evidenced by its 2.80 M parameter count compared to CBAM-augmented YOLO-CW (3.10 M) and CA-augmented YOLO-BC (3.02 M). The PIoU loss function further complements occlusion handling by providing geometry-aware gradient signals. Unlike CIoU used in the baseline and most compared models (YOLO-CW, YOLO-CC, etc.), PIoU explicitly models edge-distance penalties normalized by target dimensions. For citrus fruits with approximately circular cross-sections, this geometric prior accelerates convergence—evidenced by the faster box_loss descent in Figure 15a—and improves localization of partially occluded fruits, where visible boundaries may be fragmented into non-contiguous segments.

A comparative analysis against the five citrus-specific models in Table 4 further substantiates the effectiveness of YOLO-MGP. YOLO-CWG [53], which integrates CBAM, Wise-IoU v3, and GhostNet, achieved a respectable

m A P_{50}

of 93.3% but with a parameter count of 6.40 M—over three times that of YOLO-MGP—and a model size of 12.8 MB, making it impractical for edge deployment. The high parameter count primarily stems from GhostNet’s use of additional linear operations to generate feature maps, which, while reducing FLOPs, paradoxically increases storage requirements. YOLO-MGP, by employing MobileNetV3 with SE attention and the C2f_GLU module, achieves both lower parameters (2.04 M) and higher accuracy. YOLO-AGV [55] incorporates AIFI attention and GSConv modules for multi-scale fusion; its

m A P_{50}

of 92.5% is 3.2 percentage points lower than YOLO-MGP, yet it requires 3.43 M parameters. The GSConv module’s channel shuffle operation, while efficient, may disrupt fine-grained spatial features necessary for small citrus detection, a limitation that our P2 detection head explicitly addresses. YOLO-BC [57] and YOLO-CC [58] both adopt advanced neck designs (BiFPN and CARAFE upsampling, respectively) and achieve

m A P_{50}

values around 92.5–92.6%. These results suggest that neck enhancements alone cannot fully compensate for backbone and detection head limitations. YOLO-MGP’s holistic design—lightweight backbone, gated feature refinement, high-resolution detection head, and geometrically adapted loss—collectively yields the best accuracy (95.7%

m A P_{50}

) with the smallest model size (4.4 MB) among all compared citrus-specific models. The radar chart (Figure 13) visually confirms this comprehensive advantage.

4.2. Analysis of Lighting Robustness and Sensor Influence

The quantitative results in Section 3.8 indicate that YOLO-MGP maintains stable precision and recall across morning, noon, and afternoon lighting conditions, with

m A P_{50}

varying by only 0.6 percentage points. This stability is particularly noteworthy when contrasted with traditional vision methods. Zhuang et al. [3] reported that their homomorphic filtering approach suffered from missed detections when fruits appeared greenish-yellow under certain lighting angles, indicating sensitivity to illumination-induced color shifts. Similarly, Lv et al. [2] noted that illumination variation was a primary failure mode for lightweight detectors in field conditions, with some models exhibiting

m A P

drops exceeding 10 percentage points between optimal and challenging lighting. Against this backdrop, the 0.6 percentage point fluctuation of YOLO-MGP represents a meaningful improvement in illumination robustness.

Figure 16 shows qualitative detection examples under morning, noon, and afternoon lighting conditions, visually confirming this robustness across different illumination levels. This stability can be attributed to two architectural characteristics. First, the squeeze-and-excitation (SE) modules in the MobileNetV3 backbone perform dynamic channel-wise recalibration: under strong midday light, SE modules likely amplify channels encoding texture and shape cues while suppressing those saturated by specular highlights; under dim afternoon backlight, they enhance the remaining contrast information in shadowed regions. Second, the Coordinate Attention mechanism in the CA_Detect head encodes precise positional dependencies along both axes, preserving accurate fruit boundaries even when fruit-background contrast is reduced. The effectiveness of coordinate attention in preserving spatial structure under varying illumination has been independently validated by Hou et al. in the original CA paper, and its incorporation in our detection head is a key factor in YOLO-MGP’s illumination invariance.

However, it is important to consider that the lighting-condition subsets were constructed by grouping images according to their recorded capture time (Morning, Noon, Afternoon) and then randomly sampling within each temporal group, rather than through a strict stratified factorial design with balanced sensor allocations. While this grouping guarantees that each subset exclusively contains the target illumination condition, the device ratio was not explicitly controlled and remained at approximately 8:2 (Canon EOS-800D:iPhone 13). Consequently, a minor interaction effect between illumination and sensor characteristics cannot be statistically excluded. This limitation is common in agricultural computer vision studies; for instance, many published datasets combine images from diverse sensors without controlling for device-specific effects. Future work should adopt a controlled factorial acquisition design to formally isolate the main effects of light and device. Despite this limitation, the current evaluation provides a realistic assessment applicable to multi-source orchard monitoring, where mixed consumer-grade and professional sensors are the operational norm.

4.3. Analysis of Occlusion Handling

The quantitative results in Section 3.9 demonstrate that YOLO-MGP retains an

m A P_{50}

of 94.0% even under heavy occlusion, with recall declining to 85.7%. To contextualize this performance, we compare with the occlusion-handling strategies of other citrus-specific models in our benchmark. YOLO-CW employs CBAM attention to enhance occluded target features, achieving 89.4% precision and 85.9% recall on the full test set. However, CBAM’s channel–spatial attention operates globally on the feature map and lacks a dedicated gating mechanism to selectively suppress background foliage textures; under heavy occlusion, this limitation is expected to become more pronounced as foliage-induced false positives increase. YOLO-AGV utilizes multi-scale fusion via AIFI and GSConv; however, GSConv’s channel shuffle operation may disrupt the fine-grained spatial continuity required for perceiving fragmented boundaries, a limitation that likely affects its recall under heavy occlusion. YOLO-BC’s BiFPN neck provides multi-scale feature enrichment, but the absence of a dedicated occlusion-aware loss function may limit its bounding box regression accuracy when fruit boundaries are partially concealed. YOLO-MGP’s combination of C2f_GLU gating and PIoU loss directly addresses both false positive suppression (gating) and fragmented boundary regression (PIoU), yielding the highest heavy-occlusion

m A P_{50}

among the compared models.

The

m A P_{50 - 95}

drop from 95.7% (no occlusion) to 72.6% (heavy occlusion) warrants particular attention. This 23.1 percentage point degradation reveals that while the model successfully detects heavily occluded fruits (high

m A P_{50}

), the precision of bounding box localization degrades substantially when fruit boundaries are fragmented. This pattern is common across YOLO-based detectors: the strict IoU thresholds in

m A P_{50 - 95}

penalize imprecise boundary regression more severely. The PIoU loss function mitigates this issue through size-adaptive edge normalization, but the residual degradation suggests that further architectural innovations—perhaps explicit occlusion-aware feature reconstruction modules—are needed to close this localization gap.

Qualitative examination of detection examples (Figure 17) confirms that the C2f_GLU gating mechanism effectively suppresses false positives from foliage textures that visually resemble citrus peel. This discriminative capability is critical because foliage-induced false positives represent one of the most common failure modes in orchard detection, as documented by Sengupta and Lee [9], who reported excessive false positives as the primary limitation of their traditional vision approach. The deep learning-based solution presented here substantially mitigates this issue, as evidenced by the 93.1% precision maintained under heavy occlusion.

4.4. Limitations and Practical Implications

While YOLO-MGP achieves a favorable trade-off between detection accuracy and model compactness, several limitations must be acknowledged.

First, the computational budget of 9.4 G FLOPs, while moderate for desktop GPUs, may pose challenges on edge devices. In our comparisons, YOLOv10n (6.5 G FLOPs) and YOLOv11n (6.3 G FLOPs) offer lower computational requirements, albeit with correspondingly lower

m A P_{50}

(92.1%). The P2 detection head accounts for a significant portion of the FLOPs increase, suggesting that future lightweighting efforts could explore more efficient high-resolution feature processing, perhaps through feature pyramid distillation or neural architecture search tailored to small-target detection heads.

Second, the dataset was collected from a single orchard with a single citrus variety in December, limiting the demonstrated generalizability. Different citrus cultivars exhibit substantial variation in color (from green through yellow to deep orange), size (from kumquat to pomelo), and shape (spherical vs. oblate), all of which influence detection difficulty. The cross-dataset validation performed in related work on pomelo and kumquat datasets provides a model for the multi-variety evaluation needed to substantiate broad generalizability claims.

Third, the lighting and occlusion evaluations, while informative, were conducted on subsets that mixed two sensor types without formal statistical control. A dedicated multi-factor study with balanced sensor, illumination, and occlusion groupings would enable rigorous attribution of performance variations to specific environmental factors.

Finally, translating reliable detection into successful picking requires integration with robotic arm kinematics, grasp planning, and dynamic obstacle avoidance. The perception–performance gap between detection metrics and end-to-end picking success rates has been highlighted by several agricultural robotics studies and represents the critical next step for this line of research. The encouraging detection performance of YOLO-MGP across challenging orchard scenarios suggests that the visual perception component is ready for such system-level integration and evaluation.

5. Conclusions

This research introduces a lightweight detection model, namely YOLO-MGP, intended for efficient operation in natural orchard settings. The model addresses key challenges in citrus detection, including fruit overlap, occlusion by branches and foliage, and variable lighting conditions, which often lead to low detection accuracy and difficulties in deploying lightweight architectures. YOLO-MGP is developed by enhancing the YOLOv8n framework through modifications in four key areas: the backbone network, feature fusion module, detecting head, and loss function.

Regarding model architecture, the lightweight MobileNetV3 network was initially adopted as a substitute for the original CSPDarknet backbone, thereby reducing computational demands whilst preserving the capacity for multi-scale feature extraction. Subsequently, the C2f_GLU module was incorporated into the neck network to refine the selection and integration of salient features via gated linear units, effectively mitigating interference from complex contextual information. To accommodate the substantial prevalence of small targets within the dataset, a P2 small-object detection layer integrated with a coordinate attention mechanism was introduced. Concurrently, a layer compensation strategy was employed to eliminate the redundant P5 layer, thereby enhancing sensitivity to small fruits whilst maintaining parameter efficiency. Lastly, the CIoU loss function was replaced with the proposed PIoU, whose gradient modulation mechanism contributed to improved bounding box regression accuracy for overlapping and occluded targets.

The experimental findings demonstrate that YOLO-MGP attained a precision of 94.2%, a recall of 89.7%, and an

m A P_{50}

of 95.7% on the self-constructed citrus dataset, corresponding to gains of 2.8, 2.6, and 1.5 percentage points, respectively, relative to the baseline YOLOv8n. The model’s parameter count was lowered from approximately 3.0 M to 2.04 M—a reduction of around 32.1%—whilst its storage footprint was compressed to 4.4 MB, thereby achieving effective lightweighting without compromising detection performance. Evaluations across diverse scenarios revealed that the model exhibited a notable degree of adaptability to challenging conditions, including fruit overlap, occlusion by branches and foliage, illumination changes, and variations in fruit scale. Heatmap visualisation further indicated that the model concentrated more intensely on the core regions of the fruit, with feature responses exhibiting greater focus.

This study possesses specific limitations. Firstly, although the newly added P2 detection layer enhanced the perception of small targets, it also led to an increase in floating-point operations, which affected inference speed to some extent. As a result, further optimisation is required to improve the model’s real-time performance on highly resource-constrained platforms. Secondly, the current dataset was collected from a single orchard, and the sample size and scene diversity are limited. The generalisability of the model to different regions and citrus varieties therefore needs to be further verified. Furthermore, the model’s detection stability under significant occlusion and extreme lighting circumstances remains susceptible to enhancement.

Future study will focus on numerous critical domains. First, model optimisation techniques-including channel pruning and knowledge distillation will be employed to mitigate the computational burden introduced by the P2 layer, thereby accelerating inference and enhancing the model’s suitability for deployment on mobile platforms. Second, additional orchard datasets will be gathered, encompassing citrus images across diverse varieties, ripeness levels, and climatic conditions, with the aim of improving model generalisability. Third, more resilient feature extraction architectures will be investigated to strengthen detection robustness under severe occlusion and challenging illumination, thereby facilitating the integration of the model into agricultural automation systems.

Author Contributions

Conceptualization, J.Y. and H.W.; methodology, J.Y.; software, D.L.; validation, W.M., H.T. and D.C.; formal analysis, J.Y.; investigation, H.W.; resources, W.M.; data curation, D.L.; writing—original draft preparation, J.Y., H.W., D.L., W.M., H.T. and D.C.; writing—review and editing, J.Y., H.W., D.L., W.M., H.T. and D.C.; visualization, H.T.; supervision, D.C.; project administration, H.W.; funding acquisition, W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan Natural Science Foundation, grant number 2026NSFSC1511.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the Huijia Qicai Tianyuan Citrus Orchard for providing the experimental site and technical support during data collection.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

YOLO	You Only Look Once
mAP	Mean Average Precision
IoU	Intersection over Union
GLU	Gated Linear Unit
CA	Coordinate Attention
SE	Squeeze-and-Excitation
FLOPs	Floating Point Operations
SGD	Stochastic Gradient Descent

References

Huang, Z.; Li, Z.; Yao, L.; Yuan, Y.; Hong, Z.; Huang, S.; Wang, Y.; Ye, J.; Zhang, L.; Ding, J. Geographical Distribution and Potential Distribution Prediction of Thirteen Species of Citrus L. in China. Environ. Sci. Pollut. Res. 2024, 31, 6558–6571. [Google Scholar] [CrossRef]
Lv, Q.; Sun, F.; Bian, Y.; Wu, H.; Li, X.; Li, X.; Zhou, J. A Lightweight Citrus Object Detection Method in Complex Environments. Agriculture 2025, 15, 1046. [Google Scholar] [CrossRef]
Zhuang, J.J.; Luo, S.M.; Hou, C.J.; Tang, Y.; Et., A. Detection of Orchard Citrus Fruits Using a Monocular Machine Vision-Based Method for Automatic Fruit Picking Applications. Comput. Electron. Agric. 2018, 152, 64–73. [Google Scholar] [CrossRef]
Wu, L.; Zhu, L.; Weng, H.; Chen, G.; Liu, H.; Liu, Y.; Ye, D. Edge Computing-Based Computer Vision and Deep Transfer Learning for High-Throughput Assessment of Aspergillus Flavus Infection in Crop Seeds. Plant Phenomics 2026, 8, 100110. [Google Scholar] [CrossRef]
Khayyat, M.M.; Zamzami, N.; Zhang, L.; Nappi, M.; Umer, M. Fuzzy-CNN: Improving Personal Human Identification Based on IRIS Recognition Using LBP Features. J. Inf. Secur. Appl. 2024, 83, 103761. [Google Scholar] [CrossRef]
Çetin-Kaya, Y.; Kaya, M. A Novel Ensemble Framework for Multi-Classification of Brain Tumors Using Magnetic Resonance Imaging. Diagnostics 2024, 14, 383. [Google Scholar] [CrossRef] [PubMed]
Sun, B.; Liu, K.; Feng, L.; Peng, H.; Yang, Z. The Surface Defects Detection of Citrus on Trees Based on a Support Vector Machine. Agronomy 2023, 13, 43. [Google Scholar] [CrossRef]
S, A.; Kathirvelan, J. Computer Vision-Based Detection and Classification of Chemically Ripened Bananas and Papayas at Vendor Site through Deep Learning AI Models Using Real-Time Dataset. Results Eng. 2025, 26, 104730. [Google Scholar] [CrossRef]
Sengupta, S.; Lee, W.S. Identification and Determination of the Number of Immature Green Citrus Fruit in a Canopy Under Different Ambient Light Conditions. Biosyst. Eng. 2014, 117, 51–61. [Google Scholar] [CrossRef]
Mir, H.; Mehridehnavi, A. A Novel Approach for Fast Circlet Transform: Dynamic Analysis of Coefficients for Circular Shapes Quantification. Pattern Recognit. 2026, 176, 113098. [Google Scholar] [CrossRef]
Wu, H.; Li, X.; Sun, F.; Huang, L.; Yang, T.; Bian, Y.; Lv, Q. An Improved Product Defect Detection Method Combining Centroid Distance and Textural Information. Electronics 2024, 13, 3798. [Google Scholar] [CrossRef]
Kursun, R.; Koklu, M. Effectiveness of SIFT Features in Enhancing Watermelon Leaf Disease Classification Accuracy. In Proceedings of the 2025 International Conference on Computer Systems and Technologies (CompSysTech), Ruse, Bulgaria, 27–28 June 2025; pp. 1–6. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhang, X.; Xun, Y.; Chen, Y. Automated Identification of Citrus Diseases in Orchards Using Deep Learning. Biosyst. Eng. 2022, 223, 249–258. [Google Scholar] [CrossRef]
Alaaudeen, K.M.; Selvarajan, S.; Manoharan, H.; Jhaveri, R.H. Intelligent Robotics Harvesting System Process for Fruits Grasping Prediction. Sci. Rep. 2024, 14, 2820. [Google Scholar] [CrossRef]
Lin, Y.; Huang, Z.; Liang, Y.; Liu, Y.; Jiang, W. AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion. Agriculture 2024, 14, 114. [Google Scholar] [CrossRef]
Li, J.; Xia, X.; Li, W.; Li, H.; Wang, X.; Xiao, X.; Wang, R.; Zheng, M.; Pan, X. Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios. arXiv 2022, arXiv:2207.05501. [Google Scholar] [CrossRef]
Niu, L.; Di Paola, M.; Pirrotta, A.; Xu, W. Generalized Complex Fractional Moment for the Probabilistic Characteristic of Random Vectors. Eng. Struct. 2024, 318, 118685. [Google Scholar] [CrossRef]
Yue, L. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Ma, Z.; Dong, N.; Gu, J.; Cheng, H.; Meng, Z.; Du, X. STRAW-YOLO: A Detection Method for Strawberry Fruits Targets and Key Points. Comput. Electron. Agric. 2025, 230, 109853. [Google Scholar] [CrossRef]
Liu, R.M.; Su, W.H. APHS-YOLO: A Lightweight Model for Real-Time Detection and Classification of Stropharia Rugoso-Annulata. Foods 2024, 13, 1710. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Zeng, T.; Tong, J.; Sun, X.; Liu, J.; He, X.; Guan, Z.; Liu, L.; Jiang, N.; Wan, T. YOLO-FCAP: An Improved Lightweight Object Detection Model Based on YOLOv8n for Citrus Yield Prediction in Complex Environments. Smart Agric. Technol. 2025, 12, 101590. [Google Scholar] [CrossRef]
Kim, S.; Jang, J.; Lee, N.; Shin, S.; Kim, J.; Jung, K.; Park, D. Assessment of Water Reuse and Pico-Scale Hydropower Systems in High-Rise Buildings: A Water–Energy–Carbon Nexus Approach for Urban Sustainability. Appl. Energy 2025, 402, 126878. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Liu, W.h.; Jian, C.; Zhang, S.a.; Cong, G.x.; Wang, H.b.; Li, J.y.; Li, X.m.; Tang, J.; Ma, Y.y.; Zheng, Y.q. Enhanced YOLOv5s with BiFormer Attention for Citrus Spring Shoot Detection: Optimized Phenological Period and Cross-Regional Application. Smart Agric. Technol. 2025, 12, 101446. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. arXiv 2023, arXiv:2303.08810. [Google Scholar] [CrossRef]
Wan, M.; Li, Y.; Ikegaya, N. Prediction of Pedestrian-Level Percentile Wind Speeds with CNN Models Using Fundamental Statistics. Build. Environ. 2026, 290, 114139. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Deng, F.; Chen, J.; Fu, L.; Zhong, J.; Qiaoi, W.; Luo, J.; Li, J.; Li, N. Real-Time Citrus Variety Detection in Orchards Based on Complex Scenarios of Improved YOLOv7. Front. Plant Sci. 2024, 15, 1381694. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Lightweight-Design for Real-Time Detector Architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Zhou, B.; Wu, K.; Chen, M. Detection of Gannan Navel Orange Ripeness in Natural Environment Based on YOLOv5-NMM. Agronomy 2024, 14, 910. [Google Scholar] [CrossRef]
Zarboubi, M.; Bellout, A.; Chabaa, S.; Dliou, A. Enhancing Integrated Pest Management with IoT and YOLO-Evo: A Smart, Low-Cost Monitoring System for Sustainable Apple Farming. Results Eng. 2026, 29, 108850. [Google Scholar] [CrossRef]
Zhou, J.; Sun, F.; Wu, H.; Lv, Q.; Feng, F.; Zhao, B.; Li, X. Kiwi-YOLO: A Kiwifruit Object Detection Algorithm for Complex Orchard Environments. Agronomy 2025, 15, 2424. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS – Improving Object Detection With One Line of Code. arXiv 2017, arXiv:1704.04503. [Google Scholar] [CrossRef]
Teng, H.; Sun, F.; Wu, H.; Lv, D.; Lv, Q.; Feng, F.; Yang, S.; Li, X. DS-YOLO: A Lightweight Strawberry Fruit Detection Algorithm. Agronomy 2025, 15, 2226. [Google Scholar] [CrossRef]
Sun, F.; Lv, Q.; Bian, Y.; He, R.; Lv, D.; Gao, L.; Wu, H.; Li, X. Grape Target Detection Method in Orchard Environment Based on Improved YOLOv7. Agronomy 2024, 15, 42. [Google Scholar] [CrossRef]
Lin, J.; Hu, G.; Chen, J. Mixed Data Augmentation and Osprey Search Strategy for Enhancing YOLO in Tomato Disease, Pest, and Weed Detection. Expert Syst. Appl. 2025, 264, 125737. [Google Scholar] [CrossRef]
Li, F.; Sun, T.; Dong, P.; Wang, Q.; Li, Y.; Sun, C. MSF-CSPNet: A Specially Designed Backbone Network for Faster R-CNN. IEEE Access 2024, 12, 52390–52399. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, D.; Guo, X.; Yang, H. Lightweight Algorithm for Apple Detection Based on an Improved YOLOv5 Model. Plants 2023, 12, 3032. [Google Scholar] [CrossRef]
Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2019, arXiv:1709.01507. [Google Scholar] [CrossRef]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. arXiv 2024, arXiv:2311.17132. [Google Scholar] [CrossRef]
Bian, Y.; Wu, H.; Sun, F.; Lv, Q.; Li, X. An Improved Lightweight Metal Sheets Surface Defect Detection Algorithm Based on YOLOv8. Clust. Comput. 2025, 28, 580. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. arXiv 2021, arXiv:2103.02907. [Google Scholar] [CrossRef]
Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Liu, C.; Wang, K.; Li, Q.; Zhao, F.; Zhao, K.; Ma, H. Powerful-IoU: More Straightforward and Faster Bounding Box Regression Loss with a Nonmonotonic Focusing Mechanism. Neural Netw. 2024, 170, 276–284. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Li, H.; Tao, X. Enhanced YOLOv8-Based Real-Time Navel Orange Detection for UAV Aerial Imaging in Complex Orchard Environments. Int. J. Cogn. Inform. Nat. Intell. 2025, 20, 1–13. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar] [CrossRef]
Yu, C.; Guo, J.; Pan, Q. CWG-YOLOv8: A Navel Orange Detection Model Based on Improved YOLOv8 in an Agricultural Environment. Opt. Mem. Neural Netw. 2026, 35, 147–155. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. arXiv 2020, arXiv:1911.11907. [Google Scholar] [CrossRef]
He, Y.; Li, Y.; Li, Z.; Song, R.; Xu, C. An Improved YOLOv8-Based Lightweight Approach for Orange Maturity Detection. J. Food Meas. Charact. 2025, 19, 4740–4754. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Liu, X.; Li, G.; Chen, W.; Liu, B.; Chen, M.; Lu, S. Detection of Dense Citrus Fruits by Combining Coordinated Attention and Cross-Scale Connection with Weighted Feature Fusion. Appl. Sci. 2022, 12, 6600. [Google Scholar] [CrossRef]
Ryu, J.; Kwak, D.; Choi, S. YOLOv8 with Post-Processing for Small Object Detection Enhancement. Appl. Sci. 2025, 15, 7275. [Google Scholar] [CrossRef]

Figure 1. Geographic location and on-site acquisition process of the citrus dataset.

Figure 2. Citrus samples under natural environments and augmented samples.

Figure 3. Structure of YOLOv8n Network.

Figure 4. Structure of the proposed YOLO-MGP network. Note: The proposed improvements include the MobileNetV3 backbone, C2f_GLU module, CA_Detect detection head, and PIoU loss function.

Figure 5. Architectural specification of the MobileNetV3 backbone. The network progressively downsamples the input image from

640 \times 640

to

20 \times 20

via stacked inverted residual blocks for subsequent object detection.

Figure 5. Architectural specification of the MobileNetV3 backbone. The network progressively downsamples the input image from

640 \times 640

to

20 \times 20

via stacked inverted residual blocks for subsequent object detection.

Figure 6. Architectural comparison between standard convolution and depthwise separable convolution. Note: (Left) Standard convolution fuses spatial and cross-channel computation in a single step. (Right) Depthwise separable convolution factorizes the operation into a spatial filtering stage via depthwise convolution and a channel projection stage via pointwise convolution, significantly reducing computational complexity.

Figure 7. Structure of C2f_GLU.

Figure 8. Bounding Box Size Distribution.

Figure 9. Schematic after adding a small object detection layer.

Figure 10. Coordinate Attention Principle Structure Diagram.

Figure 11. CIoU and PIoU Principle Comparison Diagram.

Figure 12. Ablation Experiment Comparison Chart for YOLO-MGP. Note: A is MobileNetV3, B is C2f_GLU, C is CA_Detect detection head, and D is PIoU loss function.

Figure 13. Radar Chart Comparison of Model Performance.

Figure 14. Heatmap Comparison of Models Under Different Scenarios. Note: Within the heatmap, dark blue signifies minimal network attention, while red highlights regions of peak computational focus.

Figure 15. Different Loss Function Variation Curves.

Figure 16. Qualitative detection examples under morning, noon, and afternoon lighting conditions.

Figure 17. Qualitative detection examples under no, partial, and heavy occlusion conditions.

Table 1. Size Distribution of Object Instances in the Citrus Dataset.

Dataset	Mean Width (%)	Mean Height (%)	Small (<10%)	Tiny (<2%)	Total
Number	5.29	6.05	33,504 (84.9%)	21,840 (55.4%)	39,440

Table 2. Table of Training Platform Specifications.

Computer Configuration	Parameter Information
CPU	Intel Core i5-13490F
GPU	NVIDIA GeForce RTX 5060
Memory	32 G
Operating System	Windows 10
PyTorch	2.9.0
CUDA	13.0
Programming Language	Python 3.10

Table 3. Experimental Results of Ablation Study on the YOLO-MGP Citrus Recognition Model.

A	B	C	D	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)	Params (M)	FPS
×	×	×	×	91.4	87.1	94.2	72.0	3.01	243
✓	×	×	×	90.2	86.0	92.5	67.5	3.11	227
×	✓	×	×	89.4	86.8	93.0	68.5	2.80	225
×	×	✓	×	91.7	88.8	95.0	72.1	2.02	107
×	×	×	✓	93.4	86.9	94.6	72.1	3.01	235
✓	×	×	✓	91.0	86.0	93.1	68.2	3.11	235
✓	✓	×	✓	90.5	86.5	92.9	67.5	2.90	217
✓	×	✓	✓	94.4	90.4	96.0	74.4	2.12	178
✓	✓	✓	✓	94.2	89.7	95.7	73.9	2.04	159

Note: A: MobileNetV3; B: C2f_GLU; C: CA_Detect; D: PIoU. “✓” indicates the module is used, “×” indicates not used.

Table 4. Citrus Fruit Detection Performance of YOLO-MGP Algorithm.

Model	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)	Params	FLOPs (G)	Size (MB)
YOLOv7t	85.4	76.9	84.3	54.0	6,014,038	13.2	12.3
YOLOv9t	91.3	84.0	92.4	67.4	1,970,979	7.6	4.4
YOLOv10n	89.4	83.8	92.1	68.3	2,265,363	6.5	5.5
YOLOv11n	90.2	84.4	92.1	67.7	2,582,347	6.3	5.2
YOLO-CW	89.4	85.9	92.8	68.6	3,097,601	8.2	6.1
YOLO-CWG	90.5	86.5	93.3	69.3	6,400,313	8.7	12.8
YOLO-AGV	90.5	84.7	92.5	68.7	3,426,787	7.6	6.8
YOLO-BC	90.6	84.5	92.6	68.1	3,022,747	8.2	6.0
YOLO-CC	90.7	84.5	92.5	67.7	3,151,459	8.5	6.2
YOLOv8n	91.4	87.1	94.2	72.0	3,005,843	8.1	6.0
YOLO-MGP	94.2	89.7	95.7	73.9	2,040,363	9.4	4.4

Table 5. Performance Comparison of Different Backbone Networks.

Backbone	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)	Params (M)	FLOPs (G)
YOLOv8n	91.4	87.1	94.2	72.0	3.01	8.1
MobileNetV3	90.2	86.0	92.5	67.5	3.11	6.6
GhostNet	88.4	84.5	91.0	65.3	3.23	7.2
ShuffleNetV2	89.7	83.7	93.6	68.5	4.06	8.3
StarNet	87.2	84.3	90.1	63.8	2.35	5.8
EfficientNet	90.1	85.2	92.8	68.9	3.45	7.5

Table 6. Performance Comparison of Different Loss Functions.

Loss Function	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)
CIoU	91.4	87.1	94.2	72.0
DIoU	91.8	87.8	94.2	71.9
GIoU	92.5	87.4	94.5	72.4
SIoU	91.5	88.0	94.4	72.1
EIoU	93.0	86.6	94.2	71.6
MPDIoU	92.4	87.9	94.3	72.0
PIoU	93.4	86.9	94.6	72.1

Note: All experiments were conducted using the YOLOv8n baseline with identical hyperparameters.

Table 7. Performance Comparison of Different Detection Head Configurations.

Detection Head	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)	Params (M)	FLOPs (G)
P3 + P4 + P5 (CA)	90.5	84.7	92.8	68.4	3.02	8.1
P2 + P3 + P4 + P5 (CA)	92.2	87.1	94.4	71.7	2.93	12.2
P2 + P3 + P4 (CA)	91.7	88.8	95.0	72.1	2.02	11.5
P3 + P4 + P5	91.4	87.1	94.2	72.0	3.01	8.1

Note: All configurations were evaluated using the YOLOv8n baseline with identical training hyperparameters.

Table 8. Quantitative detection results of YOLO-MGP under different lighting conditions.

Lighting Condition	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)
Morning	92.2	83.7	92.3	68.5
Noon	91.3	82.3	91.7	69.6
Afternoon	90.8	83.8	91.9	64.8

Table 9. Quantitative detection results of YOLO-MGP under different occlusion levels.

Occlusion Level	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)
No Occlusion	98.0	97.7	97.7	95.7
Partial Occlusion	94.6	96.4	98.3	93.2
Heavy Occlusion	93.1	85.7	94.0	72.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Wu, H.; Lv, D.; Ma, W.; Teng, H.; Chen, D. A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments. Horticulturae 2026, 12, 718. https://doi.org/10.3390/horticulturae12060718

AMA Style

Yang J, Wu H, Lv D, Ma W, Teng H, Chen D. A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments. Horticulturae. 2026; 12(6):718. https://doi.org/10.3390/horticulturae12060718

Chicago/Turabian Style

Yang, Junjie, Haorong Wu, Dong Lv, Wei Ma, Hao Teng, and Dehua Chen. 2026. "A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments" Horticulturae 12, no. 6: 718. https://doi.org/10.3390/horticulturae12060718

APA Style

Yang, J., Wu, H., Lv, D., Ma, W., Teng, H., & Chen, D. (2026). A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments. Horticulturae, 12(6), 718. https://doi.org/10.3390/horticulturae12060718

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight and High-Precision Citrus Detection Model for Unstructured Orchard Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets Construction

2.2. YOLOv8n

2.3. YOLO-MGP

2.3.1. MobileNetV3 Network

2.3.2. C2f_GLU

2.3.3. CA_Detect

2.3.4. PIoU Loss Function

3. Results

3.1. Experimental Environment and Parameters

3.2. Model Evaluation Metrics

3.3. Ablation Experiment

3.4. Comparative Experiments on Different Models

3.5. Comparative Experiments of Different Backbone Networks

3.6. Comparison of Different Loss Functions

3.7. Comparison of Different Detection Heads

3.8. Performance Under Different Lighting Conditions

3.9. Performance Under Different Occlusion Conditions

4. Experimental Discussion

4.1. Discussion on the Applicability of Enhanced Modules

4.2. Analysis of Lighting Robustness and Sensor Influence

4.3. Analysis of Occlusion Handling

4.4. Limitations and Practical Implications

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI