This study designs an enhanced detection model, termed YOLO-MGP, based on YOLOv8n, to address the specific challenges of citrus detection, namely inadequate accuracy in complex orchard environments and excessive model complexity. The architectural improvements, which integrate and adapt several existing techniques into a cohesive lightweight framework, are as follows: First, the original backbone network was substituted with the lightweight feature extraction network MobileNetV3 to diminish computational burden and enhance detection velocity. Second, the C2f module in the original neck structure was substituted with a lightweight variant known as C2f_GLU. This change reduces the number of parameters while improving the model’s ability to extract, augment, and integrate features. Third, a detection head specialized for small objects with a 160 × 160 resolution was incorporated into the Head layer, which integrates the CA attention mechanism to boost small-target detection capability. Finally, the network’s feature extraction capacity and convergence rate were improved by using the PIoU loss function.
Figure 4 depicts the structure of the suggested model.
2.3.1. MobileNetV3 Network
The native YOLOv8n backbone employs the CSPDarknet architecture, which integrates the cross-stage partial design paradigm of CSPNet [
42] with the Darknet-53 hierarchical structure inherited from YOLOv5. By partitioning feature maps along the channel dimension and routing only a subset through a partial convolutional block while bypassing the remainder, CSPDarknet effectively enhances gradient propagation and mitigates the vanishing gradient problem. While this design yields strong performance on generic detection benchmarks, it exhibits two notable limitations in the context of citrus orchard monitoring. First, the reliance on standard convolutional blocks incurs substantial computational overhead as network capacity scales, rendering deployment on resource-constrained agricultural edge devices impractical. Second, the architecture exhibits constrained sensitivity to fine-grained spatial features, which is detrimental for detecting densely clustered small targets—a defining characteristic of citrus imagery where immature fruits are frequently occluded by foliage.
To address these constraints, we substitute the original CSPDarknet backbone with MobileNetV3 [
43], a lightweight convolutional neural network specifically engineered for efficient mobile inference. MobileNetV3 is empirically recognized for attaining a Pareto-optimal balance between inference latency and detection accuracy on embedded processors, thereby meeting the twin demands of real-time operation and accurate small-target identification in citrus detection systems intended for field deployment. The architectural specification of MobileNetV3 is illustrated in
Figure 5.
The backbone ingests an RGB citrus image of spatial dimension and progressively abstracts it through a hierarchical cascade of inverted residual blocks. The network is organized into a sequence of stages wherein spatial resolution is successively halved by strided depthwise convolutions, while channel capacity expands to encode increasingly semantic representations. The initial stem layer employs a convolution with stride 2, rapidly downsampling the input to with 16 channels. Subsequent stages interleave bottleneck blocks with and without squeeze-and-excitation (SE) modules, culminating in a final convolutional projection to a high-dimensional feature space prior to global pooling. For the detection task, we extract multi-scale feature maps from the terminal blocks of stages corresponding to output strides of , , and relative to the input—specifically, layers with spatial resolutions of , , and . These hierarchical representations are then routed to the YOLOv8n neck, enabling the detection head to localize small, occluded citrus fruits on fine-grained grids while simultaneously recognizing larger, unobstructed instances on semantically richer, coarser feature maps.
As indicated in the architectural overview, a subset of the inverted residual blocks within MobileNetV3 are augmented with Squeeze-and-Excitation (SE) modules [
44]. Positioned strategically after the depthwise convolution within the expansion path—thereby operating on the representation of greatest channel dimensionality—the SE mechanism explicitly models interdependencies between channels to perform dynamic, adaptive recalibration of convolutional features. This is accomplished through three sequential operations: Squeeze, Excitation, and Scale.
Given an input feature map
X, a standard convolution operation
first extracts spatial features to produce a new feature map
, where each channel
c is denoted as
:
This yields
, which serves as the input to the SE module. Then, by combining spatial data from the whole feature map, the Squeeze operation seeks to obtain global contextual information for every channel. Specifically, it applies global average pooling to each channel
, compressing the
spatial dimensions into a single scalar
:
Thus,
serves as a channel-wise descriptor that summarizes the global spatial information of each channel. Subsequently, the Excitation operation employs a lightweight gating mechanism to generate per-channel modulation weights. This is implemented as a two-layer fully-connected bottleneck network: a dimensionality-reduction layer compresses the channel count from
C to
, where
r denotes a predefined compression factor (fixed to 4 in MobileNetV3); a ReLU activation
introduces nonlinearity; and a subsequent expansion layer restores the dimensionality to
C, followed by a Hard-sigmoid activation
to produce the normalized weight vector
:
Finally, the Scale operation applies these learned channel weights to the original feature map, producing the recalibrated output
wherein informative channels are amplified and less relevant ones are suppressed:
In the above formulation, H and W represent the spatial height and width of the feature map; denotes the channel descriptor for the c-th channel; and refer to the weight matrices of the dimensionality-reduction and dimensionality-expansion layers, respectively; represents the ReLU activation function; and signifies the Hard-sigmoid activation function.
Beyond the SE attention mechanism, the foundational computational unit underpinning the efficiency of MobileNetV3 is the depthwise separable convolution. This process breaks down a standard convolution into two separate and sequential stages: a depthwise convolution and a pointwise convolution, as shown in
Figure 6.
In a standard convolutional layer, each filter simultaneously aggregates spatial information from a local receptive field and cross-channel correlations, operating on all input channels concurrently. While effective, this coupling incurs substantial computational redundancy. Depthwise separable convolution breaks this entanglement. In order to maintain spatial locality while avoiding cross-channel interaction, the first stage, depthwise convolution, applies a single lightweight filter to each input channel separately. This yields an intermediate representation of identical channel count, wherein each channel encodes spatially filtered features specific to its corresponding input. The second stage-pointwise convolution-employs a convolutional kernel to project this intermediate tensor onto the desired output channel dimension. This stage is solely responsible for fusing and recombining information across channels, effectively decoupling spatial filtering from feature generation. In MobileNetV3, this factorized convolution constitutes the core spatial processing unit within each inverted residual block, wherein the depthwise convolution operates on the expanded channel representation and the subsequent pointwise convolution projects it back to the bottleneck dimension.
2.3.2. C2f_GLU
In YOLOv8n, the C2f module serves as the core component for feature processing, integrating functions for feature extraction, enhancement, and fusion. This structure significantly enhances the representational power of the module through feature reuse and the introduction of rich gradient flows. In the context of citrus recognition in intricate orchard environments, the standard C2f module enhances the receptive field and feature richness by stacking multiple Bottlenecks, resulting in an increase in parameters and computational demands, thereby rendering it less appropriate for mobile device deployment.
This study presents the C2f_GLU [
45] module as a solution to the aforementioned problems. The architectural design is illustrated in
Figure 7. This module dynamically improves the depiction of essential citrus attributes using a gated channel attention technique and integrates a streamlined bottleneck design, thereby improving feature discriminability while effectively controlling computational complexity. Its core component is the Conv_GLU [
45] unit, which achieves gated channel attention modelling based on nearest-neighbour features by introducing DWConv (depthwise convolution) into the GLU gating branch.
The procedural flow is detailed below. The input feature map is processed by two parallel paths to produce the value feature
V and the gating weight
G, respectively. In the value pathway, a
convolution is employed to extract basic features. Meanwhile, in the gating pathway, spatial awareness is introduced: this pathway sequentially performs a
convolution, followed by a
depthwise convolution (DWConv),and an activation function to produce a spatially adaptive attention map. Subsequently, the value feature
V and the gating weight
G are multiplied element-wise to achieve dynamic recalibration of the feature map. Finally, a
convolution is employed, after which the modulated feature
is combined with the module’s original input
X through a residual connection. In summary, the formulation of Conv_GLU is expressed as:
The integration of the lightweight Conv_GLU unit within the C2f_GLU framework is designed to enhance the discriminability of citrus-related features through spatially adaptive gating, while maintaining a parameter-efficient and computationally economical profile.
2.3.3. CA_Detect
The intricate nature of citrus orchard landscapes leads to multi-perspective discrepancies in the obtained images, causing substantial changes in target scales, especially for the detection of small-target fruits [
46]. The features of these targets are weak and are prone to being mistaken for the background, which presents a serious challenge to high-precision detection. In this study, targets in the citrus dataset were categorised into four types based on the aspect ratio of the citrus labels and images: large targets (aspect ratio
), medium targets (
aspect ratio
), small targets (
aspect ratio), and tiny targets (
aspect ratio). The precise distribution of the labels is illustrated in
Figure 8, and the statistical results are detailed in
Table 1.
To mitigate the inherent deficiency of the baseline model in detecting small-scale objects—an issue stemming from insufficient spatial granularity in deeper feature hierarchies—this study introduces a series of architectural refinements targeting both the neck network and the detection heads. A dedicated P2 detection head, explicitly tailored for the extraction and localization of diminutive targets, is incorporated into the existing framework. This head capitalizes on the Coordinate Attention mechanism [
47] to encode precise positional dependencies, while concurrently aggregating shallow feature maps that retain high-resolution structural details and fine-grained spatial cues. The fusion of these complementary representations markedly enhances the model’s capacity for multi-scale feature learning, particularly in discerning small objects that are otherwise prone to being submerged within semantically coarse deeper layers. To counteract the inevitable surge in computational burden and parameter volume engendered by the inclusion of this additional shallow prediction layer, a layer-wise compensation paradigm is adopted. Concretely, the deep P5 detection head—characterized by substantial semantic redundancy and limited contribution to fine-grained localization—is surgically excised in tandem with the introduction of the P2 layer, thereby preserving a favorable trade-off between detection fidelity and inference efficiency. The resultant streamlined network topology is depicted in
Figure 9.
By incorporating a coordinate-aware feature encoding technique, the coordinate attention mechanism allows explicit modelling of positional information while maintaining the modelling capability of channel attention. This mechanism consists of two main stages: embedding of coordinate information and generation of coordinate attention [
48]. The initial phase entails one-dimensional global average pooling, which operates across the spatial dimensions of width and height. The resulting
and
are concatenated along the spatial dimension, as illustrated in
Figure 10.
A convolution is then employed for dimensionality reduction. Subsequently, the resulting feature map is split into a pair of directional features, and , which are then mapped back to the original channel dimension using two independent convolutions and . These are normalized via a Sigmoid activation, producing the final attention weights. In the final phase, the attention weights from both dimensions are augmented to the original feature map dimensions. As theoretically explained below, the augmented output is generated by multiplying these weights with the input feature element-wise:
Horizontal pooling (width direction):
Vertical pooling (height direction):
C, H, and W denote the channel count, height, and width of the feature tensor, respectively. represents the concatenation operation of the two tensors and . The symbol r denotes the channel reduction ratio, refers to the nonlinear activation function, while denotes the Sigmoid activation function.
By combining feature responses from various orientations, the coordinate attention mechanism preserves positional information while capturing long-range associations. This method improves the representation of important regions by encoding orientation-sensitive information into attention weight maps along the horizontal and vertical directions. These maps are then applied to the input features using element-wise multiplication. Compared with traditional attention mechanisms, the coordinate attention approach models channel dependencies and spatial structures simultaneously. For applications requiring accurate localisation, like small item identification, this method improves position awareness and expands the receptive field at a minimal computational cost.
2.3.4. PIoU Loss Function
In this study, the Powerful-IoU (PIoU) loss mechanism proposed by Liu et al. [
49] was adopted to replace the default Complete-IoU (CIoU) loss employed in the YOLOv8n architecture for bounding box regression.
The PIoU loss introduces a penalty factor
P that is designed with adaptability to the target box dimensions. In particular, the normalised distances between the respective edges of the ground-truth box and the forecast box are used to define
P. Unlike conventional penalty terms that depend on the minimum bounding rectangle, the denominator of
P relies exclusively on the width
and height
of the target box.
Figure 11 shows a schematic comparison of the computational principles behind CIoU and PIoU.
The core structure of PIoU comprises two components: the standard IoU term and a regularization term incorporating the penalty factor
P. Its mathematical formulation is provided below:
The mean of the normalised edge distances between the target box and the prediction box is used to calculate the penalty factor
P:
Here,
,
,
, and
denote the absolute distances between the corresponding side pairs of the predicted and target boxes, as illustrated in
Figure 11b;
and
indicate the width and height of the ground-truth bounding box.
The function
serves as a gradient modulation term that adapts the penalty magnitude based on the quality of the predicted box.
is defined as:
This functional form ensures that the gradient of the penalty term remains bounded and varies non-monotonically with respect to P.
This study employs PIoUv2 as the specific loss function variant. PIoUv2 extends the base PIoU formulation by incorporating a non-monotonic attention mechanism that further weights the loss contribution according to anchor box quality. The quality of an anchor box is quantified by the parameter
q, which is derived from the penalty factor
P:
Here, corresponds to perfect alignment between the predicted box and the target box (), while lower values of q indicate progressively poorer alignment.
The attention function
is defined as a non-monotonic weighting function controlled by a single hyperparameter
:
The complete PIoUv2 loss is then formulated as the product of the attention weight and the base PIoU loss:
In this formulation, is a tunable hyperparameter that regulates the intensity of the attention mechanism. In accordance with the recommendations of Liu et al., was set to 1.3 for all experiments conducted in this study.