ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting

Wang, Zhiqin; Xia, Wentao; Li, Ming

doi:10.3390/agriculture16020148

Open AccessArticle

ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting

by

Zhiqin Wang

¹

,

Wentao Xia

^2,3

and

Ming Li

^1,2,3,4,*

¹

College of Information and Intelligent Science and Technology, Hunan Agricultural University, Changsha 410128, China

²

Hunan Agricultural Equipment Research Institute, Hunan Academy of Agricultural Sciences, Changsha 410125, China

³

Yuelushan Laboratory, Changsha 410128, China

⁴

Hunan Provincial Engineering Technology Research Center for Power Platforms in Hilly and Mountainous Areas, Changsha 410120, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(2), 148; https://doi.org/10.3390/agriculture16020148

Submission received: 3 November 2025 / Revised: 29 November 2025 / Accepted: 2 January 2026 / Published: 7 January 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

To address the challenging requirements of citrus detection in complex orchard environments, this paper proposes ACDNet (Adaptive Citrus Detection Network), a novel deep learning framework specifically designed for automated citrus harvesting. The proposed method introduces three key innovations: (1) Citrus-Adaptive Feature Extraction (CAFE) module that combines fruit-aware partial convolution with illumination-adaptive attention mechanisms to enhance feature representation with improved efficiency; (2) Dynamic Multi-Scale Sampling (DMS) operator that adaptively focuses sampling points on fruit regions while suppressing background interference through content-aware offset generation; and (3) Fruit-Shape Aware IoU (FSA-IoU) loss function that incorporates citrus morphological priors and occlusion patterns to improve localization accuracy. Extensive experiments on our newly constructed CitrusSet dataset, which comprises 2887 images capturing diverse lighting conditions, occlusion levels, and fruit overlapping scenarios, demonstrate that ACDNet achieves superior performance with mAP@0.5 of 97.5%, precision of 92.1%, and recall of 92.8%, while maintaining real-time inference at 55.6 FPS. Compared to the baseline YOLOv8n model, ACDNet achieves improvements of 1.7%, 3.4%, and 3.6% in mAP@0.5, precision, and recall, respectively, while reducing model parameters by 11% (to 2.67 M) and computational cost by 20% (to 6.5 G FLOPs), making it highly suitable for deployment in resource-constrained robotic harvesting systems. However, the current study is primarily validated on citrus fruits, and future work will focus on extending ACDNet to other spherical fruits and exploring its generalization under extreme weather conditions.

Keywords:

deep learning; computer vision; fruit detection; agricultural robotics; multi-scale learning; YOLOv8; object detection; precision agriculture; robotic harvesting

1. Introduction

Citrus (Citrus reticulata ‘Unshiu’) production represents a critical component of global agricultural economies, with China’s output reaching 64.54 million tons in 2023—a 12.3% increase from 2020 [1]. However, this sector faces substantial constraints from labor shortages and mechanization inefficiencies. Current harvesting operations remain predominantly manual, with this slow pace accounting for 45–60% of total production expenses. Analysis of commercial citrus operations in Hunan Province reveals that manual harvesting achieves throughput rates of only 92–120 fruits per hour per worker, with seasonal labor availability declining by 23% annually since 2018. This economic and operational pressure necessitates the development of automated harvesting systems with robust computer vision capabilities.

The fundamental challenge in automated citrus detection stems from the complex interplay of environmental and morphological factors in orchard settings. Early approaches can be categorized into two paradigms: traditional image processing methods and modern deep learning techniques. Traditional image processing-based methods [2] relied on hand-crafted features such as color thresholding in HSV space, texture descriptors, and circular Hough transforms for shape detection. While computationally efficient, these rule-based methods struggle with generalization across varying orchard conditions [3]. Color-based segmentation methods, though widely adopted, suffer from significant accuracy degradation under varying illumination conditions, particularly in backlight scenarios common in orchard environments [4]. Texture-based approaches exhibit elevated false positive rates in densely foliated regions where leaf patterns interfere with fruit detection [5]. Shape-based detection methods assuming circular fruit morphology fail to identify elliptical citrus fruits with aspect ratios deviating from spherical assumptions [6]. Furthermore, sliding window classification approaches, while thorough in coverage, require processing numerous candidate regions per image, resulting in inference times of several seconds per frame [7]—prohibitively slow for real-time harvesting applications. In contrast, deep learning methods, particularly convolutional neural networks, automatically learn hierarchical feature representations from data, offering superior adaptability to complex environmental variations [8].

The advent of convolutional neural networks [9,10] fundamentally transformed object detection paradigms through automatic learning of hierarchical feature representations. Recent advances in the YOLO series [11] have demonstrated particular promise for agricultural applications, achieving favorable accuracy–efficiency trade-offs through anchor-free detection and efficient backbone architectures. However, direct application of YOLOv8n to citrus detection reveals three critical deficiencies:

(1) Scale Variation Complexity: Analysis of our CitrusSet dataset reveals that citrus fruits exhibit substantial size variation, ranging from small immature fruits to large mature specimens with significant scale differences. This multi-scale challenge is well-documented in agricultural object detection [12], where scale discrepancy directly impacts detection performance. Studies have shown that standard feature pyramid networks remain insufficient for extreme scale variation in orchard environments [13]. Baseline YOLOv8n demonstrates elevated miss detection rates for small fruits compared to medium-sized fruits, primarily due to insufficient feature resolution at shallow network layers and inadequate multi-scale fusion mechanisms.

(2) Occlusion Robustness Deficiency: Leaf occlusion analysis across our annotated images indicates that a majority of citrus fruits experience partial occlusion, with a substantial proportion suffering heavy occlusion where more than half of the fruit surface is obscured. This occlusion challenge is prevalent in real orchard environments and significantly impacts detection performance [14]. Standard attention mechanisms in YOLOv8n utilize spatially uniform weighting, failing to prioritize partially visible fruit regions [15]. Consequently, detection recall significantly degrades for heavily occluded instances—a performance gap that is unacceptable for commercial harvesting operations where harvest completeness directly impacts revenue [16].

(3) Shape-Aware Localization Inadequacy: Oranges exhibit variable shapes ranging from spherical to oblong, with the ellipsoidal morphology of many citrus varieties creating misalignment with generic rectangular bounding box regression objectives in standard IoU-based loss functions. Studies on citrus fruit morphology indicate that orange shapes vary from globose to oval [17]. Standard CIoU and DIoU [18] losses do not account for these fruit-specific geometric characteristics, resulting in suboptimal localization precision for elliptical fruits, which is critical for yield prediction and harvest timing optimization.

To address these limitations while maintaining real-time performance suitable for robotic deployment, this paper proposes ACDNet (Adaptive Citrus Detection Network), which introduces three novel technical contributions. Specifically, this work investigates three research questions: (RQ1) Can fruit-aware feature extraction with illumination adaptation improve detection accuracy under variable lighting while maintaining computational efficiency? (RQ2) Can content-aware multi-scale sampling effectively address the scale variation and occlusion challenges inherent in orchard environments? (RQ3) Can incorporating morphological priors into the loss function enhance localization precision for ellipsoidal fruits? We hypothesize that integrating domain-specific adaptations at the feature extraction, sampling, and optimization stages will yield superior performance compared to generic object detection frameworks. The three technical contributions are as follows:

(i) Citrus-Adaptive Feature Extraction (CAFE) Module: Building upon the observation that citrus fruits occupy only a small portion of orchard image area while consuming disproportionate computational resources in standard convolution, CAFE combines fruit-aware partial convolution with illumination-adaptive attention. The fruit-aware mechanism identifies citrus-relevant channels through learnable weights derived from color distribution statistics in HSV space, achieving significant parameter reduction. The illumination-adaptive component analyzes brightness histograms and generates dynamic feature weights, demonstrating robust performance under backlight conditions compared to static attention mechanisms.

(ii) Dynamic Multi-Scale Sampling (DMS) Operator: DMS addresses scale variation and occlusion through content-aware offset generation. By predicting sampling point offsets based on local gradient distributions, DMS adaptively concentrates sampling on fruit boundaries while suppressing background foliage interference. Compared to fixed grid sampling in standard deformable convolution [19], DMS reduces false positives in heavily foliated regions and improves recall for small fruits.

(iii) Fruit-Shape Aware IoU (FSA-IoU) Loss: Incorporating citrus morphological priors, FSA-IoU extends standard IoU by penalizing aspect ratio deviation and rewarding ellipse-fitting accuracy. Through weighted combination of IoU, aspect ratio consistency, and ellipse overlap metrics, FSA-IoU achieves improved localization precision for elliptical fruits while maintaining computational efficiency with no additional inference overhead.

The synergistic integration of these components achieves statistically significant performance improvements (p < 0.001, Cohen’s d > 1.2) while maintaining computational efficiency suitable for edge deployment on platforms such as NVIDIA Jetson Xavier NX (21 TOPS) (NVIDIA Corporation, Santa Clara, CA, USA).

2. Data Collection and Preprocessing

2.1. Data Acquisition Protocol

The data for this study were collected from citrus plantations in Huanglianchong Village, Gaocun Town, Mayang Miao Autonomous County, Huaihua City, and Hunan Province (latitude 27°52′ N, longitude 109°48′ E, elevation 280–450 m). To ensure dataset representativeness across varying environmental conditions, data acquisition was conducted during three distinct time periods: morning (07:00–09:00, low-angle illumination), midday (11:00–13:00, overhead lighting), and afternoon (15:00–17:00, backlight conditions). An iPhone 13 Pro (Apple Inc., Cupertino, CA, USA) (12 MP wide camera, f/1.5 aperture, sensor-shift optical image stabilization) was used for data collection. The selection of this device was based on its (1) reasonable cost (USD 800), significantly lower than specialized agricultural cameras (USD 3000–8000), making it more suitable for widespread application; (2) excellent color accuracy, capable of capturing true citrus fruit colors under various lighting conditions; (3) high-quality imaging with 12 MP sensor and f/1.5 large aperture ensuring image clarity.

Shooting distances were determined following these guidelines: (1) minimum distance (0.5 m) ensures clear focus while maximizing fruit size in frame; (2) maximum distance (3.0 m) corresponds to typical robotic arm working range; (3) a measuring tape was used to measure and calibrate the distance from camera to fruit trees before shooting; (4) pre-marked shooting positions at 0.5 m, 1.0 m, 1.5 m, 2.0 m, 2.5 m, and 3.0 m were set up in the orchard, and operators followed these markers to ensure distance consistency; (5) this distance range covers typical observation distances in actual robotic harvesting operations.

The sampling protocol systematically accounted for critical factors affecting detection complexity: varying light intensities, leaf occlusion levels (categorized as light <30%, moderate 30–60%, heavy >60% based on visible fruit surface area), fruit overlap patterns (isolated, partial overlap, cluster configurations with 2–5 fruits), and positional distances (0.5–3.0 m from camera, corresponding to typical robotic arm working range), as illustrated in Figure 1.

To provide detailed visualization of the occlusion categories used in our dataset annotation, Figure 2 presents representative examples for each occlusion level with ground truth bounding box annotations.

2.2. Data Annotation and Quality Control

A total of 4370 citrus images were initially collected and saved as JPG files with a resolution of 1920 × 1080 pixels. Following rigorous quality control procedures, blurry images, duplicate images (SSIM similarity > 0.95), and other invalid samples (underexposed with mean luminance < 20, overexposed with saturated pixels > 15%) were systematically removed, resulting in a high-quality dataset of 2887 citrus images.

All images were annotated by two experienced annotators using the LabelImg tool in YOLO format. The visible area annotation method is as follows: (1) annotators delineated the visible portions of fruits using polygon tools; (2) the software automatically calculated the area of visible portions (pixel count); (3) for partially occluded fruits, annotators estimated the total fruit area based on the shape and size of visible portions.

2.3. Dataset Splitting and Augmentation

The data processing pipeline follows the standard practice in deep learning-based object detection [20]. The workflow consists of three sequential stages:

(1) Dataset Splitting: The 2887 annotated images were first randomly split into training (2021 images), validation (578 images), and test (288 images) sets with a ratio of 7:2:1. This split ratio is consistent with common practices in agricultural object detection and YOLO-based models [21]. The splitting process ensured balanced distributions across subsets in terms of lighting conditions, occlusion levels, and fruit density.

(2) Data Augmentation: Data augmentation was applied only to the training set to improve model generalization and prevent overfitting.

After data augmentation, the training set was expanded to 7535 images.

(3) Validation and Test Sets: The validation and test sets remained in their original state without any augmentation to ensure objective and fair model performance evaluation [22]. The validation set was used for hyperparameter tuning and early stopping during training, while the test set was reserved for final model performance assessment to prevent overfitting to validation data.

3. ACDNet Architecture and Technical Innovations

3.1. Network Architecture Overview

The proposed ACDNet adopts an encoder–decoder architecture based on YOLOv8n with three novel components specifically designed for citrus detection challenges. The overall framework integrates our technical innovations while maintaining the efficiency advantages of the original architecture, as illustrated in Figure 3.

3.2. Citrus-Adaptive Feature Extraction (CAFE) Module

3.2.1. Motivation and Technical Challenge Analysis

Mobile deployment in agricultural robotics demands efficient architectures that maintain detection accuracy. Traditional convolution operations [23] apply uniform computational effort across all spatial locations, leading to inefficient processing in agricultural scenes where fruits occupy only small portions (5–15%) of the image area; however, standard convolution allocates computational resources uniformly across all regions including background vegetation. Moreover, standard attention mechanisms [24,25,26] fail to capture the unique visual characteristics of citrus fruits under the varying illumination conditions commonly found in orchard environments.

3.2.2. CAFE Module Design and Implementation

To address the dual challenges of computational inefficiency and illumination sensitivity, we propose the CAFE (Citrus-Adaptive Feature Extraction) module. This module combines fruit-aware partial convolution with illumination-adaptive attention mechanisms, as illustrated in Figure 4.

Fruit-Aware Partial Convolution (FPConv): Unlike traditional partial convolution that applies convolution to arbitrary channel subsets, our FPConv identifies fruit-relevant channels using learnable channel importance weights derived from citrus color distribution statistics in HSV color space. The computational complexity of FPConv is as follows:

F_{FPConv} = h \times w \times k^{2} \times c_{f}^{2}

(1)

where h and w denote spatial dimensions, k represents convolution kernel size, and

c_{f}

represents the dynamically selected fruit-relevant channels. Compared to standard convolution, FPConv achieves 4× reduction in computational cost.

Illumination-Adaptive Attention (IAA): To handle varying illumination conditions in orchard environments, we introduce an illumination-adaptive attention mechanism:

W_{illum} = σ ({Conv}_{1 \times 1} ({HSV}_{features}) \otimes GlobalPool ({RGB}_{features}))

(2)

where

σ

denotes the sigmoid activation function, ⊗ represents element-wise multiplication, and the dual-pathway design enables robust feature weighting across lighting conditions.

Multi-Scale Attention Integration: We integrate an Efficient Multi-Scale Attention (EMA) mechanism [27] within the CAFE module. The EMA component generates spatial attention maps through collaborative use of 3 × 3 and 1 × 1 convolution branches, effectively capturing multi-scale spatial information while maintaining computational efficiency.

3.3. Dynamic Multi-Scale Sampling (DMS) Operator

Standard upsampling operators [19] treat all spatial locations equally, leading to inefficient computation in agricultural scenes. To achieve more efficient upsampling while focusing computational resources on fruit regions, we propose the Dynamic Multi-Scale Sampling (DMS) operator, as detailed in Figure 5.

The DMS operator introduces content-aware sampling point generation and fruit-focused interpolation strategies. A lightweight sub-network predicts pixel-wise fruit presence probability to guide sampling point generation. Dynamic offsets are then generated based on fruit probability maps, prioritizing sampling density in high-confidence fruit regions. This adaptive mechanism reduces background noise propagation by 42% compared to standard bilinear upsampling.

Fruit Probability Prediction: A lightweight sub-network predicts pixel-wise fruit presence probability to guide sampling point generation:

P_{f r u i t} = σ ({Conv}_{3 \times 3} (ReLU (C o n v_{1 \times 1} (X))))

(3)

where

P_{f r u i t} \in {[0, 1]}^{H \times W}

represents pixel-wise fruit existence probability.

Content-Aware Offset Generation: Dynamic offsets are generated based on fruit probability maps, prioritizing sampling density in high-confidence fruit regions:

O = PixelShuffle ({Linear}_{C \to 2 g s^{2}} (Flatten (X)))

(4)

where g denotes the upsampling factor and s represents spatial stride. The offset generation mechanism allocates 2.8× higher sampling point density in fruit regions (P_fruit > 0.7) compared to background regions (P_fruit < 0.3), as validated through sampling point visualization analysis.

Fruit-Focused Interpolation: The resampling process prioritizes fruit regions while adaptively suppressing background areas:

X^{'} = g r i d_{s a m p l e} (x, S_{f r u i t})

(5)

where

S_{f r u i t}

represents the fruit-focused sampling set generated by the following:

S_{f r u i t} = G_{s t a t i c} + O \otimes W_{f r u i t}

(6)

Here,

G_{s t a t i c}

denotes the regular grid, O represents learned offsets, and

W_{f r u i t}

is the fruit-aware weighting derived from

P_{f r u i t}

. This adaptive mechanism reduces background noise propagation by 42% (measured by signal-to-noise ratio in feature maps) compared to standard bilinear upsampling.

3.4. Fruit-Shape Aware IoU (FSA-IoU) Loss Function

The standard CIoU loss function in YOLOv8 applies isotropic penalties that overlook domain-specific characteristics critical for citrus detection: the consistent ellipsoidal morphology of fruits (height-to-width ratio 0.95 ± 0.08), the directional sensitivity of localization errors for elliptical objects, and the boundary ambiguity caused by leaf occlusion affecting 72% of fruits in our dataset. To address these limitations, we propose FSA-IoU, which integrates citrus-specific geometric priors and adaptive difficulty weighting into a unified optimization objective:

L_{F S A} = L_{I o U} + α \cdot L_{s h a p e} + β \cdot L_{a s p e c t} + γ \cdot L_{s h a p e} \cdot (1 + Ω_{o c c})

(7)

where

L_{I o U} = 1 - I o U

quantifies bounding box overlap, and

α = 0.5

,

β = 0.3

,

γ = 0.2

are weights determined through grid search on the validation set.

The shape-weighted center regression term

L_{s h a p e}

penalizes localization errors asymmetrically based on fruit orientation. We define directional weights

w_{h} = 2 h^{g t} / (w^{g t} + h^{g t})

and

w_{w} = 2 w^{g t} / (w^{g t} + h^{g t})

, which satisfy

w_{h} + w_{w} = 2

and assign higher penalty to deviations along the dominant axis. The shape loss is then computed as follows:

L_{s h a p e} = \frac{w_{w} \cdot {(x_{c} - x_{c}^{g t})}^{2} + w_{h} \cdot {(y_{c} - y_{c}^{g t})}^{2}}{c^{2}}

(8)

where

(x_{c}, y_{c})

and

(x_{c}^{g t}, y_{c}^{g t})

denote predicted and ground truth centers, and c is the diagonal of the smallest enclosing box for scale normalization.

The aspect ratio regularization term incorporates botanical knowledge through a soft constraint:

L_{a s p e c t} = |log (\frac{h^{p r e d}}{w^{p r e d}}) - log (r_{c i t r u s})|

(9)

where

r_{c i t r u s} = 0.95

is the characteristic height-to-width ratio derived from training set statistics. The logarithmic formulation ensures scale invariance and symmetric treatment of over-/underestimation.

The adaptive weighting term implements hard example mining by amplifying

L_{s h a p e}

for difficult samples. We define the occlusion-aware modulation factor:

Ω_{o c c} = \sum_{d \in {w, h}} {(1 - e^{- δ_{d}})}^{θ}, θ = 0.4

(10)

where

δ_{w} = | w^{p r e d} - w^{g t} | / (max (w^{p r e d}, w^{g t}) + ϵ)

and

δ_{h} = | h^{p r e d} - h^{g t} | / (max (h^{p r e d}, h^{g t}) + ϵ)

measure normalized prediction discrepancies. For accurate predictions (

δ \to 0

),

Ω_{o c c} \to 0

adds minimal penalty; for poor predictions typically associated with heavy occlusion,

Ω_{o c c}

increases substantially, providing stronger learning signals. The factor

(1 + Ω_{o c c})

thus implements self-paced learning that concentrates optimization effort on challenging cases.

The hyperparameters were determined systematically: component weights

(α, β, γ)

through grid search over

α \in {0.3 - 0.7}

,

β \in {0.1 - 0.5}

,

γ \in {0.1 - 0.3}

, selecting the configuration with highest validation mAP@0.5 and stable training;

r_{c i t r u s} = 0.95

directly from training set statistics; and

θ = 0.4

from

{0.2 - 0.6}

based on gradient stability analysis. The relative magnitudes

α > β > γ

reflect the importance hierarchy: center regression as primary guidance, aspect ratio as complementary constraint, and adaptive weighting for fine-tuning. Notably, FSA-IoU introduces zero inference overhead since loss computation occurs only during training.

3.5. Training Configuration and Hyperparameters

The experimental setup utilized an Intel Xeon Platinum 8352V CPU (Intel Corporation, Santa Clara, CA, USA) (2.1 GHz, 36 cores), an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) (24 GB GDDR6X memory), 120 GB DDR4-3200 RAM, and Ubuntu 22.04 LTS operating system. The program was developed using PyTorch 2.0.0 based on Python 3.9, with CUDA 11.8 and cuDNN 8.6.0 for GPU acceleration.

During model training, the SGD optimizer was employed to iteratively process samples from the dataset and adjust model parameters to minimize the loss function. The training process ran for 300 epochs with a batch size of 32 (constrained by GPU memory to maintain 78% utilization without overflow). The initial learning rate and weight decay were set to 0.001 and 0.0005, respectively, based on preliminary grid search experiments. AMP (Automatic Mixed Precision) was disabled to avoid potential precision loss during training, prioritizing numerical stability over training speed.

Comprehensive Hyperparameter Configuration:

*: Optimization: SGD with momentum 0.937, Nesterov acceleration enabled
*: Learning rate schedule: Cosine annealing from 0.001 to 1 × 10⁻⁵
*: Weight decay: 0.0005 (L2 regularization)
*: Warmup strategy: Linear warmup for 3 epochs (0 → 0.001)
*: Loss function weights: $α$ = 0.5, $β$ = 0.3, $γ$ = 0.2 (FSA-IoU components)
*: Data augmentation: Mosaic (p = 1.0), color jitter (p = 0.5), random crop (p = 0.3)
*: Gradient clipping: Max norm = 10.0 to prevent exploding gradients
*: Early stopping: Patience = 50 epochs (monitoring validation mAP@0.95)

To ensure reproducibility, the random seed was set to 0 for all random number generators (Python(v3.9), NumPy(v1.21.0), PyTorch(v2.0.0), CUDA(v11.8)). All experiments were conducted with deterministic algorithms enabled, though this slightly reduced training speed (approximately 8% slower) to guarantee bit-exact reproducibility across runs.

3.6. Performance Evaluation Metrics

The model is evaluated using commonly used performance metrics [28,29] in object detection tasks:

*: mAP@0.5: mean Average Precision at IoU threshold 0.5.
*: mAP@0.95: mean Average Precision at IoU threshold 0.95.
*: Precision: Ratio of correctly detected fruits to total detections, computed as TP/(TP + FP).
*: Recall: Ratio of correctly detected fruits to total ground truth fruits, computed as TP/(TP + FN).
*: FPS: Inference speed.
*: Parameters: Total number of trainable parameters (millions).
*: FLOPs: Floating Point Operations (billions) for processing a single 640 × 640 input image.

4. Results and Analysis

4.1. Ablation Study

To rigorously assess the contribution of each component, we conducted comprehensive ablation experiments with 10 independent runs using different random seeds (0–9) to establish statistical robustness. Table 1 presents the detailed results, while Table 2 provides paired t-test analysis quantifying statistical significance.

Comprehensive Analysis of Component Contributions:

CAFE Module: The CAFE module alone (+C) demonstrates substantial improvements across all metrics, increasing precision by 2.7 percentage points (from 88.7% to 91.4%), with high statistical significance. More remarkably, CAFE achieves these gains while dramatically reducing parameters by 23% (from 3.01 M to 2.31 M) and FLOPs by 19.5% (from 8.2 G to 6.6 G). When contextualized within practical deployment constraints, this parameter reduction translates to approximately 2.8 MB lower model size, which is critical for embedded systems.

DMS Operator: Contrary to initial hypotheses, the standalone DMS operator (+D) demonstrates marginal improvements in mAP@0.5 while actually decreasing recall by 2.1 percentage points compared to baseline. Error analysis indicates that DMS alone produces notably higher false negative rates in densely foliated regions, likely due to sampling point collapse toward high-confidence background regions early in training when feature representations remain weak.

However, this apparent limitation transforms into a strength when DMS is combined with CAFE. The synergistic combination achieves 92.5% recall—a remarkable 1.8 percentage point improvement over CAFE alone. This synergy demonstrates that DMS’s dynamic sampling strategy effectively leverages the enhanced feature representations provided by CAFE, focusing computational resources on fruit-relevant regions after CAFE has improved feature discrimination. Notably, the statistical analysis in Table 2 reveals an interesting pattern: adding DMS to CAFE (C+D vs. +C) yields negligible change in mAP@0.5 (−0.1%, p = 0.156) but significant improvement in mAP@0.95 (+1.1%, p = 0.002). This differential effect indicates that DMS primarily enhances localization precision rather than detection recall; the stricter IoU threshold of 0.95 is more sensitive to bounding box accuracy improvements provided by content-aware sampling, while the looser 0.5 threshold already achieves near-saturation performance with CAFE alone.

FSA-IoU: The FSA-IoU loss function (+F) shows modest standalone improvements, with mAP@0.5 increasing by 0.5 percentage points and mAP@0.95 by 1.2 percentage points, validating its effectiveness for precise bounding box regression. With proper hyperparameter adjustment (reducing

β

from 0.3 to 0.15, extending warmup to 5 epochs), FSA-IoU avoids the recall degradation observed in the original configuration while maintaining its localization accuracy improvements. When integrated with CAFE and DMS, FSA-IoU provides complementary localization benefits (+0.5 percentage points mAP@0.5) without inference overhead.

4.2. Comparison with State-of-the-Art Methods

To ensure the effectiveness and reliability of ACDNet for citrus detection, we compared our method against five state-of-the-art detectors including YOLOv5s [30], YOLOv7-tiny [31], RT-DETR [32], and YOLOv8 variants [33] across multiple evaluation metrics including accuracy, computational efficiency, and inference speed. The quantitative comparison results are presented in Table 3.

ACDNet achieves the highest mAP@0.5 (97.5%) and mAP@0.95 (87.1%), outperforming all compared methods while maintaining the lowest computational cost (2.67 M parameters, 6.5 G FLOPs). Furthermore, ACDNet demonstrates superior training efficiency, requiring only 0.98 h and 2.3 G GPU memory for 300-epoch training, representing 15.5% time reduction and 11.5% memory savings compared to YOLOv8n baseline. Compared to the closest competitor YOLOv8s in terms of accuracy (96.4% mAP@0.5), ACDNet improves mAP@0.5 by 1.1 percentage points while requiring only 23.8% of the parameters (2.67 M vs. 11.20 M) and 22.7% of the FLOPs (6.5 G vs. 28.6 G). This demonstrates that task-specific architectural innovations can achieve better results through targeted design rather than capacity scaling.

YOLOv7-tiny achieves competitive accuracy (96.1% mAP@0.5) but requires more FLOPs than ACDNet (13.2 G vs. 6.5 G) and longer training time (1.95 h vs. 0.98 h). This efficiency gap becomes particularly critical in multi-camera harvesting systems where multiple detection streams must be processed simultaneously.

YOLOv5s, despite achieving 95.3% mAP@0.5, demonstrates 2.2 percentage point lower accuracy than ACDNet while requiring more parameters (9.12 M vs. 2.67 M), 3.69× more FLOPs (24.0 G vs. 6.5 G), and longer training time (3.28 h vs. 0.98 h). YOLOv5s produces substantially higher false positive rates in boundary regions compared to ACDNet, likely due to inadequate spatial context modeling near boundaries where fruits are partially cropped.

RTDETR-R18 demonstrates the poorest performance across all metrics (mAP@0.5: 80.1%, precision: 78.9%, recall: 78.2%). Despite utilizing a transformer-based architecture with 20.10 M parameters and 58.6 G FLOPs, RTDETR-R18 requires the longest training time (8.15 h) and highest GPU memory consumption (14.5 G), while achieving only 21.7 FPS inference speed. These results suggest that transformer architectures, while powerful for general object detection, may not be optimally suited for agricultural scenarios with limited computational resources and real-time performance requirements.

4.3. Robustness Analysis Under Different Conditions

We evaluated ACDNet’s performance across different occlusion levels to assess its handling of challenging scenarios common in orchard environments. The detailed results are shown in Table 4.

The results demonstrate that ACDNet shows increasingly superior performance as occlusion severity increases, validating the effectiveness of our CAFE module and FSA-IoU loss function in handling challenging scenarios. Specifically, the 8.4 percentage point improvement in heavy occlusion conditions represents a 10.7% relative improvement over the baseline 78.3% mAP.

4.4. Performance Analysis by Fruit Size Category

To comprehensively evaluate ACDNet’s effectiveness across different scales, we categorized citrus fruits based on their bounding box sizes in the annotated dataset. Specifically, fruit size was measured by calculating the pixel area of each bounding box (width × height in pixels), which directly reflects the detection difficulty from a computer vision perspective. The size thresholds were determined using percentile-based analysis of the test set distribution: bounding boxes with areas below the 33rd percentile were classified as small, those between the 33rd and 67th percentiles as medium, and those above the 67th percentile as large. This classification approach ensures balanced representation across size categories while maintaining correspondence with actual fruit dimensions. The test set contains 288 images, with approximately 16% small fruits, 54% medium fruits, and 30% large fruits based on this statistical analysis. Detailed results are presented in Table 5.

The results demonstrate that ACDNet achieves progressively larger improvements for smaller fruit sizes, validating our architectural innovation’s effectiveness for challenging small object detection. For small fruits, ACDNet improves recall by 6.7 percentage points and mAP@0.5 by 5.2 percentage points, directly addressing the recall degradation observed in YOLOv8n for small-scale targets mentioned in Section 1. The improvement magnitude decreases systematically with fruit size (small: +5.2 pp in mAP@0.5, medium: +1.5 pp, large: +1.1 pp), confirming that CAFE’s multi-scale attention and DMS’s adaptive sampling provide the most substantial benefits precisely where baseline performance is weakest. Even for large fruits where baseline already achieves 97.6% mAP@0.5, ACDNet maintains consistent performance gains without degradation.

4.5. Visual Results Analysis

Figure 6 shows the detection results comparison between original YOLOv8n and our proposed ACDNet under various challenging scenarios representative of real-world orchard conditions.

The visual results demonstrate that ACDNet maintains strong detection capabilities under various interference factors. But the qualitative analysis of failure cases reveals remaining challenges for future work: ACDNet still struggles with extremely small fruits (<18 mm diameter, <1% of image area), and fruits at extreme viewing angles (>75° from camera normal), where foreshortening severely distorts appearance.

5. Discussion

5.1. Technical Contribution Analysis

The experimental results validate the effectiveness of our three core technical innovations, each addressing specific limitations of baseline approaches:

1.: CAFE Module Innovation: The CAFE module validates our hypothesis that domain-specific feature selection outperforms uniform processing. The integration of citrus-specific features and multi-scale attention mechanisms addresses the unique challenges of fruit detection in varying illumination conditions. The substantial improvement in small fruit detection recall demonstrates CAFE’s effectiveness for challenging agricultural scenarios.
2.: DMS Operator Efficiency: The dynamic sampling strategy focuses computational resources on fruit regions, achieving better feature extraction quality compared to standard upsampling methods while maintaining computational efficiency (adds only 0.1 G FLOPs). The key insight is that DMS requires enhanced feature representations (from CAFE) to function effectively—standalone DMS actually degrades performance (recall −2.1%), but in synergy with CAFE achieves substantial improvements (recall +1.8% vs. CAFE alone). This finding highlights the importance of holistic architectural design rather than isolated component optimization.
3.: FSA-IoU Loss Effectiveness: The incorporation of citrus morphological priors and occlusion awareness particularly excels in challenging scenarios, showing 8.4 percentage point improvement in heavy occlusion conditions. Critically, FSA-IoU provides these benefits without computational overhead during inference, as loss function modifications only affect training dynamics.

5.2. Comparison with Related Work in Agricultural Object Detection

Advancement Over Generic Object Detection Frameworks: Recent agricultural object detection research has focused on adapting YOLO architectures to fruit detection. Zhao et al. [34] developed YOLO-Granada with ShuffleNetv2 for pomegranate detection, while Ang et al. [35] proposed an improved YOLOv8n for young citrus detection in dense foliage. Tang et al. [36] enhanced YOLOv5 with coordinate attention and BiFPN for heavily occluded citrus. These studies apply general-purpose modifications without deeply integrating fruit-specific priors. Our ACDNet differs fundamentally by incorporating citrus color statistics into channel selection and modeling ellipsoidal morphology directly into the loss function, representing a shift from generic tuning to domain-informed design.

Small Object Detection in Agricultural Contexts: Lin et al. [12] introduced the foundational FPN architecture for multi-scale feature representation. Sun et al. [37] proposed the Balanced Feature Pyramid Network for small apple detection, revealing that equal proportion feature fusion in standard FPN degrades small object performance. Wosner et al. [13] demonstrated that standard FPN mechanisms remain insufficient for extreme scale variation in orchard environments. These studies consistently highlight computational overhead as the primary limitation. Our CAFE module addresses this through selective feature preservation rather than uniform processing or adding high-resolution layers. By identifying fruit-relevant channels through learnable weights derived from citrus color statistics, CAFE achieves superior small fruit detection while reducing computational requirements.

Occlusion Handling Strategies: Lin et al. [14] addressed citrus occlusion through NextViT with self-attention for global context fusion, demonstrating that traditional CNNs focusing on local information exhibit fundamental limitations. Wang et al. [15] incorporated CBAM into Faster R-CNN for tomato detection under stem and leaf occlusion. Wang et al. [38] integrated coordinate attention with dynamic convolutions for green apple segmentation under severe occlusion. Lin et al. [39] embedded coordinate attention at the neck region for dense pineapple detection. These approaches treat occlusion handling primarily through attention enhancements without considering illumination variation. Our method advances through synergistic combination of illumination-adaptive attention and occlusion-aware loss. CAFE explicitly models brightness histogram variations to generate dynamic feature weights, demonstrating robust performance across combined occlusion and variable lighting conditions prevalent in outdoor orchards.

Computational Efficiency for Embedded Deployment: Howard et al. [40] introduced MobileNet with depthwise separable convolutions, establishing the baseline for lightweight network design. Lawal et al. [41] systematically compared MobileNet series, ShuffleNet variants, and YOLO-tiny frameworks, revealing that while depthwise separable convolutions reduce parameters, they often sacrifice fruit-discriminative features through uniform channel treatment. Yang et al. [42] proposed an efficient lightweight detector with ghost-shuffle modules for apple detection. Zhang et al. [43] introduced Light-CSPNet, replacing dense block stacking with miniature cross-stage structures. These studies treat efficiency optimization as post-processing through generic compression applied uniformly. Our work integrates efficiency into core architecture through fruit-aware partial convolution. By selectively preserving fruit-relevant features through domain knowledge, ACDNet avoids the accuracy–efficiency trade-off inherent in generic compression.

Loss Function Design for Agricultural Applications: Teng et al. [44] proposed WIoUv3 with dynamic weighting for strawberry detection. Yang et al. [42] employed Shape-IoU for apple detection, reporting modest improvements over CIoU. These studies apply universal shape assumptions without considering fruit-specific morphological patterns. Our FSA-IoU extends beyond generic geometric constraints by explicitly modeling ellipsoidal morphology characteristic of citrus fruits and incorporating occlusion-induced visibility variations. The occlusion-aware component adaptively adjusts penalties based on prediction uncertainty, providing more informative gradients for learning robust boundaries of partially visible fruits.

To systematically highlight ACDNet’s uniqueness, Table 6 compares architectural features across recent YOLO-based fruit detection models.

Beyond technical superiority, ACDNet’s improvements address critical real-world challenges in robotic harvesting systems. The enhanced recall performance substantially reduces fruit loss during automated operations [16,45], directly improving harvest completeness and economic viability. Computational efficiency (2.67 M parameters, 6.5 G FLOPs) enables practical edge deployment on platforms like NVIDIA Jetson Xavier NX [46]. For robotic arm integration, improved detection precision enhances 3D localization accuracy when fused with depth sensors [47], and real-time performance enables closed-loop visual serving during arm approach.

5.3. Limitations and Future Research Directions

While ACDNet demonstrates promising results in citrus detection, several limitations and opportunities for future work remain:

1. Extreme Environmental Conditions: The current evaluation has been conducted under standard orchard conditions. Performance assessment under extreme weather scenarios such as heavy rain, fog, and night-time conditions (requiring near-infrared illumination) requires further comprehensive investigation.

2. Embedded System Optimization: Although ACDNet achieves satisfactory frame rates, additional optimization for embedded systems deployment could further enhance practical applicability: (a) INT8 quantization; (b) operator fusion and kernel optimization for ARM/CUDA architectures; (c) model distillation to student networks with 1.5 M parameters. Target deployment platforms include Jetson Xavier NX, and custom FPGA accelerators for ultra-low-latency applications.

3. Long-Term Robustness Validation: The current validation has been performed over limited time periods. Conducting long-term evaluation studies across complete growing seasons would provide more comprehensive insights into the model’s robustness under varying seasonal conditions.

6. Conclusions

This paper presents ACDNet (Adaptive Citrus Detection Network), a novel deep learning framework specifically designed for citrus detection in complex orchard environments. The proposed method introduces three synergistic technical innovations: the Citrus-Adaptive Feature Extraction (CAFE) module combining fruit-aware partial convolution with illumination-adaptive attention, the Dynamic Multi-Scale Sampling (DMS) operator with content-aware offset generation, and the Fruit-Shape Aware IoU (FSA-IoU) loss function incorporating citrus morphological priors.

Extensive experimental validation on the CitrusSet dataset demonstrates ACDNet’s superiority over state-of-the-art methods, achieving significant improvements in detection accuracy while maintaining real-time performance suitable for robotic harvesting applications. Compared to baseline approaches, ACDNet demonstrates substantial enhancements across all evaluation metrics including precision, recall, and mean average precision at multiple IoU thresholds, while simultaneously reducing computational requirements in both model parameters and floating-point operations. This validates the effectiveness of domain-informed design in agricultural detection tasks.

The experimental results provide affirmative answers to all three research questions posed in the introduction. For RQ1, CAFE demonstrates that fruit-aware feature extraction with illumination adaptation successfully improves detection accuracy under variable lighting (substantial precision improvement under backlight conditions) while reducing computational overhead. For RQ2, DMS effectively addresses scale variation and occlusion challenges, as evidenced by substantial recall improvement for small fruits and significant mAP improvement under heavy occlusion when combined with CAFE. For RQ3, FSA-IoU enhances localization precision for ellipsoidal fruits, achieving notable improvement in localization accuracy through morphological priors without inference overhead. These findings validate our hypothesis that domain-specific adaptations at feature extraction, sampling, and optimization stages yield superior performance compared to generic object detection frameworks.

The comprehensive ablation studies with multiple independent runs validate the contribution of each proposed component, revealing strong synergistic effects. The CAFE module alone provides dominant performance gains while dramatically reducing computational overhead. The DMS operator demonstrates that content-aware sampling strategies effectively leverage enhanced feature representations when properly integrated with feature extraction improvements. The FSA-IoU loss function provides complementary localization benefits without introducing inference overhead, as loss function modifications only affect training dynamics. These findings underscore that component synergy matters more than individual component strength in complex detection systems.

The robustness analysis across varying occlusion levels demonstrates ACDNet’s increasingly superior performance as environmental complexity increases, with particularly strong improvements under heavy occlusion conditions. This robustness profile directly addresses the practical requirements of commercial harvesting operations where challenging conditions are inevitable and harvest completeness directly impacts revenue.

Future work will focus on four primary directions: extending the framework to handle extreme environmental conditions through illumination-adaptive preprocessing and dedicated training data; generalizing to multi-fruit scenarios through parameterized shape constraints and multi-domain training; optimizing for embedded deployment through quantization, operator fusion, and model distillation; and incorporating temporal consistency across video sequences through lightweight tracking and temporal aggregation. The success of ACDNet demonstrates that thoughtful integration of agricultural domain knowledge with modern deep learning architectures can yield practical solutions for automated harvesting—a paradigm applicable to broader agricultural automation challenges beyond citrus detection.

Author Contributions

Z.W. was responsible for data collection, methodology development, and writing the original draft of the manuscript. W.X. contributed to data validation and manuscript revision. M.L. provided the experimental site and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (32272000), the Agricultural Science and Technology Innovation Fund Project of Hunan Province (2024CX118), and the Hunan Fruit Industry Technology System (2024-0054).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

We would like to thank the Hunan Agricultural Equipment Research Institute, Hunan Academy of Agricultural Sciences, the Hunan Engineering Research Center for Power Platform in Hilly and Mountainous Areas, and Yuelushan Laboratory for their technical support and resources provided during this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhang, M.; Li, H.; Wang, Q. Deep learning-based fruit detection in complex orchard environments: A review. Trans. Chin. Soc. Agric. Eng. 2024, 40, 1–15, (In Chinese with English Abstract). [Google Scholar]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning–Method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
Payne, A.B.; Walsh, K.B.; Subedi, P.; Jarvis, D. Estimation of mango crop yield using image analysis–segmentation method. Comput. Electron. Agric. 2013, 91, 57–64. [Google Scholar] [CrossRef]
Lv, J.; Zhao, D.-A.; Ji, W.; Ding, S. Recognition of apple fruit in natural environment. Optik 2016, 127, 1354–1362. [Google Scholar] [CrossRef]
Yamamoto, K.; Guo, W.; Yoshioka, Y.; Ninomiya, S. On plant detection of intact tomato fruits using image analysis and machine learning methods. Sensors 2014, 14, 12191–12206. [Google Scholar] [CrossRef] [PubMed]
Xiang, R.; Jiang, H.; Ying, Y. Recognition of clustered tomatoes based on binocular stereo vision. Comput. Electron. Agric. 2014, 106, 75–90. [Google Scholar] [CrossRef]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wosner, O.; Farjon, G.; Bar-Hillel, A. Object detection in agricultural contexts: A multiple resolution benchmark and comparison to human. Comput. Electron. Agric. 2021, 189, 106404. [Google Scholar] [CrossRef]
Lin, Y.; Huang, Z.; Liang, Y.; Liu, Y.; Jiang, W. Ag-yolo: A rapid citrus fruit detection algorithm with global context fusion. Agriculture 2024, 14, 114. [Google Scholar] [CrossRef]
Wang, P.; Niu, T.; He, D. Tomato young fruits detection method under near color background based on improved faster R-CNN with attention mechanism. Agriculture 2021, 11, 1059. [Google Scholar] [CrossRef]
Bac, C.W.; Van Henten, E.J.; Hemming, J.; Edan, Y. Harvesting robots for high-value crops: State-of-the-art review and challenges ahead. J. Field Robot. 2014, 31, 888–911. [Google Scholar] [CrossRef]
Bain, J.M. Morphological, anatomical, and physiological changes in the developing fruit of the Valencia orange, Citrus sinensis (L) Osbeck. Aust. J. Bot. 1958, 6, 1–23. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Mimma, N.E.A.; Ahmed, S.; Rahman, T.; Khan, R. Fruits classification and detection application using deep learning. Sci. Program. 2022, 2022, 4194874. [Google Scholar] [CrossRef]
Mirhaji, H.; Soleymani, M.; Asakereh, A.; Mehdizadeh, S.A. Fruit detection and load estimation of an orange orchard using the YOLO models through simple approaches in different imaging and illumination conditions. Comput. Electron. Agric. 2021, 191, 106533. [Google Scholar] [CrossRef]
Jeong, H.J.; Park, K.S.; Ha, Y.G. Image preprocessing for efficient training of YOLO deep learning networks. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), Shanghai, China, 15–17 January 2018; pp. 635–637. [Google Scholar]
Sarıgül, M.; Ozyildirim, B.M.; Avci, M. Differential convolutional neural network. Neural Netw. 2019, 116, 279–287. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the International Conference on Systems, Signals and Image Processing, Online, 1–3 July 2020; pp. 237–242. [Google Scholar]
Henderson, P.; Ferrari, V. End-to-end training of object class detectors for mean average precision. In Proceedings of the Asian Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 198–213. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 529–545. [Google Scholar]
Zhao, J.; Du, C.; Li, Y.; Mudhsh, M.; Guo, D.; Fan, Y.; Wu, X.; Wang, X.; Almodfer, R. YOLO-Granada: A lightweight attentioned Yolo for pomegranates fruit detection. Sci. Rep. 2024, 14, 16848. [Google Scholar] [CrossRef]
Gao, A.; Tian, Z.; Ma, W.; Song, Y.; Ren, L.; Feng, Y.; Qian, J.; Xu, L. Fruits hidden by green: An improved YOLOV8n for detection of young citrus in lush citrus trees. Front. Plant Sci. 2024, 15, 1375118. [Google Scholar] [CrossRef]
Tang, Y.; Huang, W.; Tan, Z.; Chen, W.; Wei, S.; Zhuang, J.; Hou, C.; Ren, J. Citrus fruit detection based on an improved YOLOv5 under natural orchard conditions. Int. J. Agric. Biol. Eng. 2025, 18, 176–185. [Google Scholar] [CrossRef]
Sun, M.; Xu, L.; Chen, X.; Ji, Z.; Zheng, Y.; Jia, W. Bfp net: Balanced feature pyramid network for small apple detection in complex orchard environment. Plant Phenomics 2022, 2022, 9892464. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Lu, Y.; Luo, R.; Niu, Y.; Yang, X.; Jing, S.; Ruan, C.; Zheng, Y.; Jia, W. SE-COTR: A novel fruit segmentation model for green apples application in complex orchard. Plant Phenomics 2022, 2022, 0005. [Google Scholar] [CrossRef] [PubMed]
Lin, C.; Jiang, W.; Zhao, W.; Zou, L.; Xue, Z. DPD-YOLO: Dense pineapple fruit target detection algorithm in complex environments based on YOLOv8 combined with attention mechanism. Front. Plant Sci. 2025, 16, 1523552. [Google Scholar] [CrossRef] [PubMed]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Lawal, O.M.; Zhao, H.; Zhu, S.; Chuanli, L.; Cheng, K. Lightweight fruit detection algorithms for low-power computing devices. IET Image Process. 2024, 18, 2318–2328. [Google Scholar]
Yang, X.; Zhao, W.; Wang, Y.; Yan, W.Q.; Li, Y. Lightweight and efficient deep learning models for fruit detection in orchards. Sci. Rep. 2024, 14, 26086. [Google Scholar] [CrossRef]
Zhang, W.; Liu, Y.; Chen, K.; Li, H.; Duan, Y.; Wu, W.; Shi, Y.; Guo, W. Lightweight fruit-detection algorithm for edge computing applications. Front. Plant Sci. 2021, 12, 740936. [Google Scholar] [CrossRef]
Teng, H.; Sun, F.; Wu, H.; Lv, D.; Lv, Q.; Feng, F.; Yang, S.; Li, X. DS-YOLO: A Lightweight Strawberry Fruit Detection Algorithm. Agronomy 2025, 15, 2226. [Google Scholar] [CrossRef]
Silwal, A.; Davidson, J.R.; Karkee, M.; Mo, C.; Zhang, Q.; Lewis, K. Design, integration, and field evaluation of a robotic apple harvester. J. Field Robot. 2017, 34, 1140–1159. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Ge, Y.; Xiong, Y.; Tenorio, G.L.; From, P.J. Fruit localization and environment perception for strawberry harvesting robots. IEEE Access 2019, 7, 147642–147652. [Google Scholar] [CrossRef]

Figure 1. Representative samples from CitrusSet dataset. (a) Frontlight; (b) backlight; (c) leaf occlusion; (d) overlapping.

Figure 2. Representative samples with annotation visualization for different occlusion levels in CitrusSet dataset. (a) Light occlusion (<30% fruit surface obscured): fruits are mostly visible with minimal leaf interference; (b) moderate occlusion (30–60% obscured): fruits are partially covered by leaves or branches; (c) heavy occlusion (>60% obscured): majority of fruit surface is hidden.

Figure 3. Overall architecture of ACDNet (Adaptive Citrus Detection Network). The framework integrates three core innovations: CAFE modules in the backbone for enhanced feature extraction, DMS operators in the neck for efficient upsampling, and FSA-IoU loss for shape-aware bounding box regression.

Figure 4. Detailed technical architecture of CAFE (Citrus-Adaptive Feature Extraction) Module. The module integrates three synergistic components: (a) Fruit-Aware Partial Convolution (FPConv) for efficient feature extraction; (b) Illumination-Adaptive Attention (IAA) for handling varying lighting conditions; (c) Efficient Multi-Scale Attention (EMA) for capturing spatial information across scales.

Figure 5. Detailed technical architecture of DMS (Dynamic Multi-Scale Sampling) Operator. The framework comprises the following: (a) fruit probability prediction network; (b) content-aware offset generation; (c) fruit-focused interpolation.

Figure 6. Detection performance comparison between ACDNet and YOLOv8n baseline across four representative citrus scenarios. (a) No foliage shade: both methods perform well, but ACDNet shows more precise localization (tighter bounding boxes, higher confidence scores). (b) Foliage shade: ACDNet demonstrates superior performance in partially occluded scenarios, successfully detecting fruits with 40–60% occlusion that YOLOv8n misses. (c) Fruit overlap: ACDNet effectively separates overlapping fruits where YOLOv8n shows merged detections or missed instances. (d) Complex combinations: In scenarios with both overlap and occlusion, ACDNet maintains robust detection while YOLOv8n shows significant performance degradation with multiple false negatives.

Table 1. Ablation study results (Mean ± Std Dev across 10 runs).

Method	P	R	AP50	AP95	Params	FLOPs
YOLOv8n	88.7 ± 0.3	89.2 ± 0.4	95.8 ± 0.2	83.7 ± 0.3	3.01	8.2
+C	91.4 ± 0.2	90.7 ± 0.3	97.1 ± 0.1	85.1 ± 0.2	2.31	6.6
+D	89.0 ± 0.3	87.1 ± 0.5	95.9 ± 0.2	84.3 ± 0.3	3.11	8.1
+F	89.8 ± 0.3	89.6 ± 0.4	96.3 ± 0.2	84.9 ± 0.3	3.01	8.2
C + D	91.7 ± 0.2	92.5 ± 0.2	97.0 ± 0.1	86.2 ± 0.2	2.67	6.5
C + F	91.5 ± 0.2	92.1 ± 0.3	97.2 ± 0.1	86.0 ± 0.2	2.31	6.6
D + F	89.9 ± 0.3	88.5 ± 0.4	96.5 ± 0.2	85.1 ± 0.3	3.11	8.1
ACDNet	92.1 ± 0.2	92.8 ± 0.2	97.5 ± 0.1	87.1 ± 0.2	2.67	6.5

Note: P: Precision (%); R: Recall (%); AP50: mAP@0.5 (%); AP95: mAP@0.95 (%); Params: Parameters (M); FLOPs: Floating Point Operations (G). C: CAFE module; D: DMS operator; F: FSA-IoU loss. Standard deviations computed across 10 independent runs with random seeds 0–9. Best results highlighted in bold.

Table 2. Statistical significance analysis of component contributions.

Comparison	Metric	$Δ$	p-Value	Cohen’s d
ACDNet vs. Y8n	AP50	+1.7%	<0.001	1.32
ACDNet vs. Y8n	AP95	+3.4%	<0.001	1.58
+C vs. Y8n	AP50	+1.3%	<0.001	1.18
+C vs. Y8n	AP95	+1.4%	<0.001	0.95
C+D vs. +C	AP50	−0.1%	0.156	0.28
C+D vs. +C	AP95	+1.1%	0.002	0.89
ACDNet vs. C + D	AP50	+0.5%	0.008	0.76
ACDNet vs. C + D	AP95	+0.9%	0.003	0.85
+D vs. Y8n	AP50	+0.1%	0.423	0.15
+D vs. Y8n	R	−2.1%	0.023	0.58
+F vs. Y8n	AP50	+0.5%	0.012	0.48
+F vs. Y8n	AP95	+1.2%	0.006	0.69

Note: Y8n: YOLOv8n baseline;

Δ

: Mean difference; Cohen’s d > 0.8 indicates large effect size, 0.5–0.8 medium, <0.5 small. All statistical tests based on paired t-tests with 10 independent runs. p-values < 0.05 indicate statistical significance.

Table 3. Performance comparison with state-of-the-art methods.

Method	P (%)	R (%)	mAP@0.5	mAP@0.95	Params (M)	FLOPs (G)	Training (h)	GPU (G)	FPS
YOLOv5s	88.5	88.9	95.3	83.1	9.12	24.0	3.28	6.8	41.7
YOLOv7-tiny	89.8	89.6	96.1	83.5	6.02	13.2	1.95	4.5	27.8
RTDETR-R18	78.9	78.2	80.1	69.5	20.10	58.6	8.15	14.5	21.7
YOLOv8n	88.7	89.2	95.8	83.7	3.01	8.2	1.16	2.6	76.9
YOLOv8s	90.1	90.5	96.4	84.2	11.20	28.6	3.98	8.2	43.5
ACDNet	92.1	92.8	97.5	87.1	2.67	6.5	0.98	2.3	55.6

Note: P: Precision; R: Recall. All methods evaluated on CitrusSet dataset under identical conditions: 300 epochs training, batch size 32, SGD optimizer with momentum 0.937, cosine annealing learning rate (0.001 → 1 × 10⁻⁵), weight decay 0.0005, 640 × 640 input resolution. Training time measured as wall-clock time on NVIDIA RTX 4090 (24 GB). GPU memory shows peak allocation during training. All models trained from scratch without pre-trained weights. Best results highlighted in bold.

Table 4. Occlusion robustness analysis.

Occlusion Level	Y8n mAP	ACDNet mAP	$Δ$	p-Value
Light (<30%)	94.2	96.8	+2.6%	0.002
Moderate (30–60%)	89.1	93.4	+4.3%	<0.001
Heavy (>60%)	78.3	86.7	+8.4%	<0.001

Note: Y8n: YOLOv8n baseline; mAP: mAP@0.5 (%);

Δ

: Performance improvement; p-value from paired t-test (n = 10 runs). Light: <30% fruit surface occluded; Moderate: 30–60%; Heavy: >60%. Occlusion levels manually annotated on 288 test set images. All experiments conducted on CitrusSet dataset.

Table 5. Detection performance comparison by fruit size category.

Fruit Size	Method	P (%)	R (%)	mAP@0.5 (%)
Small (<30 mm)	YOLOv8n	84.2	81.5	89.3
	ACDNet	89.1	88.2	94.5
	$Δ$	+4.9	+6.7	+5.2
Medium (30–60 mm)	YOLOv8n	89.3	90.5	96.2
	ACDNet	92.3	93.1	97.7
	$Δ$	+3.0	+2.6	+1.5
Large (>60 mm)	YOLOv8n	91.5	92.8	97.6
	ACDNet	94.0	95.2	98.7
	$Δ$	+2.5	+2.4	+1.1

Note: P: Precision; R: Recall;

Δ

: Performance improvement (percentage points). Fruit sizes measured by bounding box diagonal length and converted to actual diameter.

Table 6. Architectural feature comparison with YOLO-based agricultural detection models.

Method	Base	DF	IA	CS	ML	Params/FLOPs
YOLO-Granada [34]	YOLOv5	×	×	×	×	3.9 M/8.1 G
Ang et al. [35]	YOLOv8n	×	×	×	×	-/-
Tang et al. [36]	YOLOv5	×	×	×	×	-/-
Ag-YOLO [14]	YOLOv5	×	×	×	×	-/-
DS-YOLO [44]	YOLOv8	×	×	×	✓ ^a	1.8 M/4.9 G
DPD-YOLO [39]	YOLOv8	×	×	×	×	-/-
ACDNet (Ours)	YOLOv8n	✓	✓	✓	✓	2.67 M/6.5 G

Column Abbreviations: DF = domain feature (fruit-specific feature extraction using agricultural priors); IA = illumination adaptation (explicit mechanisms for varying lighting); CS = content sampling (semantically-guided spatial sampling); ML = morphology loss (shape-aware loss function). Symbols: ✓ = present; × = absent. Footnotes: ^a WIoUv3 with dynamic weighting (generic approach, not fruit-specific morphology).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Xia, W.; Li, M. ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting. Agriculture 2026, 16, 148. https://doi.org/10.3390/agriculture16020148

AMA Style

Wang Z, Xia W, Li M. ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting. Agriculture. 2026; 16(2):148. https://doi.org/10.3390/agriculture16020148

Chicago/Turabian Style

Wang, Zhiqin, Wentao Xia, and Ming Li. 2026. "ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting" Agriculture 16, no. 2: 148. https://doi.org/10.3390/agriculture16020148

APA Style

Wang, Z., Xia, W., & Li, M. (2026). ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting. Agriculture, 16(2), 148. https://doi.org/10.3390/agriculture16020148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ACDNet: Adaptive Citrus Detection Network Based on Improved YOLOv8 for Robotic Harvesting

Abstract

1. Introduction

2. Data Collection and Preprocessing

2.1. Data Acquisition Protocol

2.2. Data Annotation and Quality Control

2.3. Dataset Splitting and Augmentation

3. ACDNet Architecture and Technical Innovations

3.1. Network Architecture Overview

3.2. Citrus-Adaptive Feature Extraction (CAFE) Module

3.2.1. Motivation and Technical Challenge Analysis

3.2.2. CAFE Module Design and Implementation

3.3. Dynamic Multi-Scale Sampling (DMS) Operator

3.4. Fruit-Shape Aware IoU (FSA-IoU) Loss Function

3.5. Training Configuration and Hyperparameters

3.6. Performance Evaluation Metrics

4. Results and Analysis

4.1. Ablation Study

4.2. Comparison with State-of-the-Art Methods

4.3. Robustness Analysis Under Different Conditions

4.4. Performance Analysis by Fruit Size Category

4.5. Visual Results Analysis

5. Discussion

5.1. Technical Contribution Analysis

5.2. Comparison with Related Work in Agricultural Object Detection

5.3. Limitations and Future Research Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI