1. Introduction
Citrus (
Citrus reticulata ‘Unshiu’) production represents a critical component of global agricultural economies, with China’s output reaching 64.54 million tons in 2023—a 12.3% increase from 2020 [
1]. However, this sector faces substantial constraints from labor shortages and mechanization inefficiencies. Current harvesting operations remain predominantly manual, with this slow pace accounting for 45–60% of total production expenses. Analysis of commercial citrus operations in Hunan Province reveals that manual harvesting achieves throughput rates of only 92–120 fruits per hour per worker, with seasonal labor availability declining by 23% annually since 2018. This economic and operational pressure necessitates the development of automated harvesting systems with robust computer vision capabilities.
The fundamental challenge in automated citrus detection stems from the complex interplay of environmental and morphological factors in orchard settings. Early approaches can be categorized into two paradigms: traditional image processing methods and modern deep learning techniques. Traditional image processing-based methods [
2] relied on hand-crafted features such as color thresholding in HSV space, texture descriptors, and circular Hough transforms for shape detection. While computationally efficient, these rule-based methods struggle with generalization across varying orchard conditions [
3]. Color-based segmentation methods, though widely adopted, suffer from significant accuracy degradation under varying illumination conditions, particularly in backlight scenarios common in orchard environments [
4]. Texture-based approaches exhibit elevated false positive rates in densely foliated regions where leaf patterns interfere with fruit detection [
5]. Shape-based detection methods assuming circular fruit morphology fail to identify elliptical citrus fruits with aspect ratios deviating from spherical assumptions [
6]. Furthermore, sliding window classification approaches, while thorough in coverage, require processing numerous candidate regions per image, resulting in inference times of several seconds per frame [
7]—prohibitively slow for real-time harvesting applications. In contrast, deep learning methods, particularly convolutional neural networks, automatically learn hierarchical feature representations from data, offering superior adaptability to complex environmental variations [
8].
The advent of convolutional neural networks [
9,
10] fundamentally transformed object detection paradigms through automatic learning of hierarchical feature representations. Recent advances in the YOLO series [
11] have demonstrated particular promise for agricultural applications, achieving favorable accuracy–efficiency trade-offs through anchor-free detection and efficient backbone architectures. However, direct application of YOLOv8n to citrus detection reveals three critical deficiencies:
(1) Scale Variation Complexity: Analysis of our CitrusSet dataset reveals that citrus fruits exhibit substantial size variation, ranging from small immature fruits to large mature specimens with significant scale differences. This multi-scale challenge is well-documented in agricultural object detection [
12], where scale discrepancy directly impacts detection performance. Studies have shown that standard feature pyramid networks remain insufficient for extreme scale variation in orchard environments [
13]. Baseline YOLOv8n demonstrates elevated miss detection rates for small fruits compared to medium-sized fruits, primarily due to insufficient feature resolution at shallow network layers and inadequate multi-scale fusion mechanisms.
(2) Occlusion Robustness Deficiency: Leaf occlusion analysis across our annotated images indicates that a majority of citrus fruits experience partial occlusion, with a substantial proportion suffering heavy occlusion where more than half of the fruit surface is obscured. This occlusion challenge is prevalent in real orchard environments and significantly impacts detection performance [
14]. Standard attention mechanisms in YOLOv8n utilize spatially uniform weighting, failing to prioritize partially visible fruit regions [
15]. Consequently, detection recall significantly degrades for heavily occluded instances—a performance gap that is unacceptable for commercial harvesting operations where harvest completeness directly impacts revenue [
16].
(3) Shape-Aware Localization Inadequacy: Oranges exhibit variable shapes ranging from spherical to oblong, with the ellipsoidal morphology of many citrus varieties creating misalignment with generic rectangular bounding box regression objectives in standard IoU-based loss functions. Studies on citrus fruit morphology indicate that orange shapes vary from globose to oval [
17]. Standard CIoU and DIoU [
18] losses do not account for these fruit-specific geometric characteristics, resulting in suboptimal localization precision for elliptical fruits, which is critical for yield prediction and harvest timing optimization.
To address these limitations while maintaining real-time performance suitable for robotic deployment, this paper proposes ACDNet (Adaptive Citrus Detection Network), which introduces three novel technical contributions. Specifically, this work investigates three research questions: (RQ1) Can fruit-aware feature extraction with illumination adaptation improve detection accuracy under variable lighting while maintaining computational efficiency? (RQ2) Can content-aware multi-scale sampling effectively address the scale variation and occlusion challenges inherent in orchard environments? (RQ3) Can incorporating morphological priors into the loss function enhance localization precision for ellipsoidal fruits? We hypothesize that integrating domain-specific adaptations at the feature extraction, sampling, and optimization stages will yield superior performance compared to generic object detection frameworks. The three technical contributions are as follows:
(i) Citrus-Adaptive Feature Extraction (CAFE) Module: Building upon the observation that citrus fruits occupy only a small portion of orchard image area while consuming disproportionate computational resources in standard convolution, CAFE combines fruit-aware partial convolution with illumination-adaptive attention. The fruit-aware mechanism identifies citrus-relevant channels through learnable weights derived from color distribution statistics in HSV space, achieving significant parameter reduction. The illumination-adaptive component analyzes brightness histograms and generates dynamic feature weights, demonstrating robust performance under backlight conditions compared to static attention mechanisms.
(ii) Dynamic Multi-Scale Sampling (DMS) Operator: DMS addresses scale variation and occlusion through content-aware offset generation. By predicting sampling point offsets based on local gradient distributions, DMS adaptively concentrates sampling on fruit boundaries while suppressing background foliage interference. Compared to fixed grid sampling in standard deformable convolution [
19], DMS reduces false positives in heavily foliated regions and improves recall for small fruits.
(iii) Fruit-Shape Aware IoU (FSA-IoU) Loss: Incorporating citrus morphological priors, FSA-IoU extends standard IoU by penalizing aspect ratio deviation and rewarding ellipse-fitting accuracy. Through weighted combination of IoU, aspect ratio consistency, and ellipse overlap metrics, FSA-IoU achieves improved localization precision for elliptical fruits while maintaining computational efficiency with no additional inference overhead.
The synergistic integration of these components achieves statistically significant performance improvements (p < 0.001, Cohen’s d > 1.2) while maintaining computational efficiency suitable for edge deployment on platforms such as NVIDIA Jetson Xavier NX (21 TOPS) (NVIDIA Corporation, Santa Clara, CA, USA).
2. Data Collection and Preprocessing
2.1. Data Acquisition Protocol
The data for this study were collected from citrus plantations in Huanglianchong Village, Gaocun Town, Mayang Miao Autonomous County, Huaihua City, and Hunan Province (latitude 27°52′ N, longitude 109°48′ E, elevation 280–450 m). To ensure dataset representativeness across varying environmental conditions, data acquisition was conducted during three distinct time periods: morning (07:00–09:00, low-angle illumination), midday (11:00–13:00, overhead lighting), and afternoon (15:00–17:00, backlight conditions). An iPhone 13 Pro (Apple Inc., Cupertino, CA, USA) (12 MP wide camera, f/1.5 aperture, sensor-shift optical image stabilization) was used for data collection. The selection of this device was based on its (1) reasonable cost (USD 800), significantly lower than specialized agricultural cameras (USD 3000–8000), making it more suitable for widespread application; (2) excellent color accuracy, capable of capturing true citrus fruit colors under various lighting conditions; (3) high-quality imaging with 12 MP sensor and f/1.5 large aperture ensuring image clarity.
Shooting distances were determined following these guidelines: (1) minimum distance (0.5 m) ensures clear focus while maximizing fruit size in frame; (2) maximum distance (3.0 m) corresponds to typical robotic arm working range; (3) a measuring tape was used to measure and calibrate the distance from camera to fruit trees before shooting; (4) pre-marked shooting positions at 0.5 m, 1.0 m, 1.5 m, 2.0 m, 2.5 m, and 3.0 m were set up in the orchard, and operators followed these markers to ensure distance consistency; (5) this distance range covers typical observation distances in actual robotic harvesting operations.
The sampling protocol systematically accounted for critical factors affecting detection complexity: varying light intensities, leaf occlusion levels (categorized as light <30%, moderate 30–60%, heavy >60% based on visible fruit surface area), fruit overlap patterns (isolated, partial overlap, cluster configurations with 2–5 fruits), and positional distances (0.5–3.0 m from camera, corresponding to typical robotic arm working range), as illustrated in
Figure 1.
To provide detailed visualization of the occlusion categories used in our dataset annotation,
Figure 2 presents representative examples for each occlusion level with ground truth bounding box annotations.
2.2. Data Annotation and Quality Control
A total of 4370 citrus images were initially collected and saved as JPG files with a resolution of 1920 × 1080 pixels. Following rigorous quality control procedures, blurry images, duplicate images (SSIM similarity > 0.95), and other invalid samples (underexposed with mean luminance < 20, overexposed with saturated pixels > 15%) were systematically removed, resulting in a high-quality dataset of 2887 citrus images.
All images were annotated by two experienced annotators using the LabelImg tool in YOLO format. The visible area annotation method is as follows: (1) annotators delineated the visible portions of fruits using polygon tools; (2) the software automatically calculated the area of visible portions (pixel count); (3) for partially occluded fruits, annotators estimated the total fruit area based on the shape and size of visible portions.
2.3. Dataset Splitting and Augmentation
The data processing pipeline follows the standard practice in deep learning-based object detection [
20]. The workflow consists of three sequential stages:
(1) Dataset Splitting: The 2887 annotated images were first randomly split into training (2021 images), validation (578 images), and test (288 images) sets with a ratio of 7:2:1. This split ratio is consistent with common practices in agricultural object detection and YOLO-based models [
21]. The splitting process ensured balanced distributions across subsets in terms of lighting conditions, occlusion levels, and fruit density.
(2) Data Augmentation: Data augmentation was applied only to the training set to improve model generalization and prevent overfitting.
After data augmentation, the training set was expanded to 7535 images.
(3) Validation and Test Sets: The validation and test sets remained in their original state without any augmentation to ensure objective and fair model performance evaluation [
22]. The validation set was used for hyperparameter tuning and early stopping during training, while the test set was reserved for final model performance assessment to prevent overfitting to validation data.
3. ACDNet Architecture and Technical Innovations
3.1. Network Architecture Overview
The proposed ACDNet adopts an encoder–decoder architecture based on YOLOv8n with three novel components specifically designed for citrus detection challenges. The overall framework integrates our technical innovations while maintaining the efficiency advantages of the original architecture, as illustrated in
Figure 3.
3.2. Citrus-Adaptive Feature Extraction (CAFE) Module
3.2.1. Motivation and Technical Challenge Analysis
Mobile deployment in agricultural robotics demands efficient architectures that maintain detection accuracy. Traditional convolution operations [
23] apply uniform computational effort across all spatial locations, leading to inefficient processing in agricultural scenes where fruits occupy only small portions (5–15%) of the image area; however, standard convolution allocates computational resources uniformly across all regions including background vegetation. Moreover, standard attention mechanisms [
24,
25,
26] fail to capture the unique visual characteristics of citrus fruits under the varying illumination conditions commonly found in orchard environments.
3.2.2. CAFE Module Design and Implementation
To address the dual challenges of computational inefficiency and illumination sensitivity, we propose the CAFE (Citrus-Adaptive Feature Extraction) module. This module combines fruit-aware partial convolution with illumination-adaptive attention mechanisms, as illustrated in
Figure 4.
Fruit-Aware Partial Convolution (FPConv): Unlike traditional partial convolution that applies convolution to arbitrary channel subsets, our FPConv identifies fruit-relevant channels using learnable channel importance weights derived from citrus color distribution statistics in HSV color space. The computational complexity of FPConv is as follows:
where
h and
w denote spatial dimensions,
k represents convolution kernel size, and
represents the dynamically selected fruit-relevant channels. Compared to standard convolution, FPConv achieves 4× reduction in computational cost.
Illumination-Adaptive Attention (IAA): To handle varying illumination conditions in orchard environments, we introduce an illumination-adaptive attention mechanism:
where
denotes the sigmoid activation function, ⊗ represents element-wise multiplication, and the dual-pathway design enables robust feature weighting across lighting conditions.
Multi-Scale Attention Integration: We integrate an Efficient Multi-Scale Attention (EMA) mechanism [
27] within the CAFE module. The EMA component generates spatial attention maps through collaborative use of 3 × 3 and 1 × 1 convolution branches, effectively capturing multi-scale spatial information while maintaining computational efficiency.
3.3. Dynamic Multi-Scale Sampling (DMS) Operator
Standard upsampling operators [
19] treat all spatial locations equally, leading to inefficient computation in agricultural scenes. To achieve more efficient upsampling while focusing computational resources on fruit regions, we propose the Dynamic Multi-Scale Sampling (DMS) operator, as detailed in
Figure 5.
The DMS operator introduces content-aware sampling point generation and fruit-focused interpolation strategies. A lightweight sub-network predicts pixel-wise fruit presence probability to guide sampling point generation. Dynamic offsets are then generated based on fruit probability maps, prioritizing sampling density in high-confidence fruit regions. This adaptive mechanism reduces background noise propagation by 42% compared to standard bilinear upsampling.
Fruit Probability Prediction: A lightweight sub-network predicts pixel-wise fruit presence probability to guide sampling point generation:
where
represents pixel-wise fruit existence probability.
Content-Aware Offset Generation: Dynamic offsets are generated based on fruit probability maps, prioritizing sampling density in high-confidence fruit regions:
where
g denotes the upsampling factor and
s represents spatial stride. The offset generation mechanism allocates 2.8× higher sampling point density in fruit regions (P
fruit > 0.7) compared to background regions (P
fruit < 0.3), as validated through sampling point visualization analysis.
Fruit-Focused Interpolation: The resampling process prioritizes fruit regions while adaptively suppressing background areas:
where
represents the fruit-focused sampling set generated by the following:
Here, denotes the regular grid, O represents learned offsets, and is the fruit-aware weighting derived from . This adaptive mechanism reduces background noise propagation by 42% (measured by signal-to-noise ratio in feature maps) compared to standard bilinear upsampling.
3.4. Fruit-Shape Aware IoU (FSA-IoU) Loss Function
The standard CIoU loss function in YOLOv8 applies isotropic penalties that overlook domain-specific characteristics critical for citrus detection: the consistent ellipsoidal morphology of fruits (height-to-width ratio 0.95 ± 0.08), the directional sensitivity of localization errors for elliptical objects, and the boundary ambiguity caused by leaf occlusion affecting 72% of fruits in our dataset. To address these limitations, we propose FSA-IoU, which integrates citrus-specific geometric priors and adaptive difficulty weighting into a unified optimization objective:
where
quantifies bounding box overlap, and
,
,
are weights determined through grid search on the validation set.
The shape-weighted center regression term
penalizes localization errors asymmetrically based on fruit orientation. We define directional weights
and
, which satisfy
and assign higher penalty to deviations along the dominant axis. The shape loss is then computed as follows:
where
and
denote predicted and ground truth centers, and
c is the diagonal of the smallest enclosing box for scale normalization.
The aspect ratio regularization term incorporates botanical knowledge through a soft constraint:
where
is the characteristic height-to-width ratio derived from training set statistics. The logarithmic formulation ensures scale invariance and symmetric treatment of over-/underestimation.
The adaptive weighting term implements hard example mining by amplifying
for difficult samples. We define the occlusion-aware modulation factor:
where
and
measure normalized prediction discrepancies. For accurate predictions (
),
adds minimal penalty; for poor predictions typically associated with heavy occlusion,
increases substantially, providing stronger learning signals. The factor
thus implements self-paced learning that concentrates optimization effort on challenging cases.
The hyperparameters were determined systematically: component weights through grid search over , , , selecting the configuration with highest validation mAP@0.5 and stable training; directly from training set statistics; and from based on gradient stability analysis. The relative magnitudes reflect the importance hierarchy: center regression as primary guidance, aspect ratio as complementary constraint, and adaptive weighting for fine-tuning. Notably, FSA-IoU introduces zero inference overhead since loss computation occurs only during training.
3.5. Training Configuration and Hyperparameters
The experimental setup utilized an Intel Xeon Platinum 8352V CPU (Intel Corporation, Santa Clara, CA, USA) (2.1 GHz, 36 cores), an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) (24 GB GDDR6X memory), 120 GB DDR4-3200 RAM, and Ubuntu 22.04 LTS operating system. The program was developed using PyTorch 2.0.0 based on Python 3.9, with CUDA 11.8 and cuDNN 8.6.0 for GPU acceleration.
During model training, the SGD optimizer was employed to iteratively process samples from the dataset and adjust model parameters to minimize the loss function. The training process ran for 300 epochs with a batch size of 32 (constrained by GPU memory to maintain 78% utilization without overflow). The initial learning rate and weight decay were set to 0.001 and 0.0005, respectively, based on preliminary grid search experiments. AMP (Automatic Mixed Precision) was disabled to avoid potential precision loss during training, prioritizing numerical stability over training speed.
Comprehensive Hyperparameter Configuration:
- *
Optimization: SGD with momentum 0.937, Nesterov acceleration enabled
- *
Learning rate schedule: Cosine annealing from 0.001 to 1 × 10−5
- *
Weight decay: 0.0005 (L2 regularization)
- *
Warmup strategy: Linear warmup for 3 epochs (0 → 0.001)
- *
Loss function weights: = 0.5, = 0.3, = 0.2 (FSA-IoU components)
- *
Data augmentation: Mosaic (p = 1.0), color jitter (p = 0.5), random crop (p = 0.3)
- *
Gradient clipping: Max norm = 10.0 to prevent exploding gradients
- *
Early stopping: Patience = 50 epochs (monitoring validation mAP@0.95)
To ensure reproducibility, the random seed was set to 0 for all random number generators (Python(v3.9), NumPy(v1.21.0), PyTorch(v2.0.0), CUDA(v11.8)). All experiments were conducted with deterministic algorithms enabled, though this slightly reduced training speed (approximately 8% slower) to guarantee bit-exact reproducibility across runs.
3.6. Performance Evaluation Metrics
The model is evaluated using commonly used performance metrics [
28,
29] in object detection tasks:
- *
mAP@0.5: mean Average Precision at IoU threshold 0.5.
- *
mAP@0.95: mean Average Precision at IoU threshold 0.95.
- *
Precision: Ratio of correctly detected fruits to total detections, computed as TP/(TP + FP).
- *
Recall: Ratio of correctly detected fruits to total ground truth fruits, computed as TP/(TP + FN).
- *
FPS: Inference speed.
- *
Parameters: Total number of trainable parameters (millions).
- *
FLOPs: Floating Point Operations (billions) for processing a single 640 × 640 input image.
4. Results and Analysis
4.1. Ablation Study
To rigorously assess the contribution of each component, we conducted comprehensive ablation experiments with 10 independent runs using different random seeds (0–9) to establish statistical robustness.
Table 1 presents the detailed results, while
Table 2 provides paired
t-test analysis quantifying statistical significance.
Comprehensive Analysis of Component Contributions:
CAFE Module: The CAFE module alone (+C) demonstrates substantial improvements across all metrics, increasing precision by 2.7 percentage points (from 88.7% to 91.4%), with high statistical significance. More remarkably, CAFE achieves these gains while dramatically reducing parameters by 23% (from 3.01 M to 2.31 M) and FLOPs by 19.5% (from 8.2 G to 6.6 G). When contextualized within practical deployment constraints, this parameter reduction translates to approximately 2.8 MB lower model size, which is critical for embedded systems.
DMS Operator: Contrary to initial hypotheses, the standalone DMS operator (+D) demonstrates marginal improvements in mAP@0.5 while actually decreasing recall by 2.1 percentage points compared to baseline. Error analysis indicates that DMS alone produces notably higher false negative rates in densely foliated regions, likely due to sampling point collapse toward high-confidence background regions early in training when feature representations remain weak.
However, this apparent limitation transforms into a strength when DMS is combined with CAFE. The synergistic combination achieves 92.5% recall—a remarkable 1.8 percentage point improvement over CAFE alone. This synergy demonstrates that DMS’s dynamic sampling strategy effectively leverages the enhanced feature representations provided by CAFE, focusing computational resources on fruit-relevant regions after CAFE has improved feature discrimination. Notably, the statistical analysis in
Table 2 reveals an interesting pattern: adding DMS to CAFE (C+D vs. +C) yields negligible change in mAP@0.5 (−0.1%,
p = 0.156) but significant improvement in mAP@0.95 (+1.1%,
p = 0.002). This differential effect indicates that DMS primarily enhances localization precision rather than detection recall; the stricter IoU threshold of 0.95 is more sensitive to bounding box accuracy improvements provided by content-aware sampling, while the looser 0.5 threshold already achieves near-saturation performance with CAFE alone.
FSA-IoU: The FSA-IoU loss function (+F) shows modest standalone improvements, with mAP@0.5 increasing by 0.5 percentage points and mAP@0.95 by 1.2 percentage points, validating its effectiveness for precise bounding box regression. With proper hyperparameter adjustment (reducing from 0.3 to 0.15, extending warmup to 5 epochs), FSA-IoU avoids the recall degradation observed in the original configuration while maintaining its localization accuracy improvements. When integrated with CAFE and DMS, FSA-IoU provides complementary localization benefits (+0.5 percentage points mAP@0.5) without inference overhead.
4.2. Comparison with State-of-the-Art Methods
To ensure the effectiveness and reliability of ACDNet for citrus detection, we compared our method against five state-of-the-art detectors including YOLOv5s [
30], YOLOv7-tiny [
31], RT-DETR [
32], and YOLOv8 variants [
33] across multiple evaluation metrics including accuracy, computational efficiency, and inference speed. The quantitative comparison results are presented in
Table 3.
ACDNet achieves the highest mAP@0.5 (97.5%) and mAP@0.95 (87.1%), outperforming all compared methods while maintaining the lowest computational cost (2.67 M parameters, 6.5 G FLOPs). Furthermore, ACDNet demonstrates superior training efficiency, requiring only 0.98 h and 2.3 G GPU memory for 300-epoch training, representing 15.5% time reduction and 11.5% memory savings compared to YOLOv8n baseline. Compared to the closest competitor YOLOv8s in terms of accuracy (96.4% mAP@0.5), ACDNet improves mAP@0.5 by 1.1 percentage points while requiring only 23.8% of the parameters (2.67 M vs. 11.20 M) and 22.7% of the FLOPs (6.5 G vs. 28.6 G). This demonstrates that task-specific architectural innovations can achieve better results through targeted design rather than capacity scaling.
YOLOv7-tiny achieves competitive accuracy (96.1% mAP@0.5) but requires more FLOPs than ACDNet (13.2 G vs. 6.5 G) and longer training time (1.95 h vs. 0.98 h). This efficiency gap becomes particularly critical in multi-camera harvesting systems where multiple detection streams must be processed simultaneously.
YOLOv5s, despite achieving 95.3% mAP@0.5, demonstrates 2.2 percentage point lower accuracy than ACDNet while requiring more parameters (9.12 M vs. 2.67 M), 3.69× more FLOPs (24.0 G vs. 6.5 G), and longer training time (3.28 h vs. 0.98 h). YOLOv5s produces substantially higher false positive rates in boundary regions compared to ACDNet, likely due to inadequate spatial context modeling near boundaries where fruits are partially cropped.
RTDETR-R18 demonstrates the poorest performance across all metrics (mAP@0.5: 80.1%, precision: 78.9%, recall: 78.2%). Despite utilizing a transformer-based architecture with 20.10 M parameters and 58.6 G FLOPs, RTDETR-R18 requires the longest training time (8.15 h) and highest GPU memory consumption (14.5 G), while achieving only 21.7 FPS inference speed. These results suggest that transformer architectures, while powerful for general object detection, may not be optimally suited for agricultural scenarios with limited computational resources and real-time performance requirements.
4.3. Robustness Analysis Under Different Conditions
We evaluated ACDNet’s performance across different occlusion levels to assess its handling of challenging scenarios common in orchard environments. The detailed results are shown in
Table 4.
The results demonstrate that ACDNet shows increasingly superior performance as occlusion severity increases, validating the effectiveness of our CAFE module and FSA-IoU loss function in handling challenging scenarios. Specifically, the 8.4 percentage point improvement in heavy occlusion conditions represents a 10.7% relative improvement over the baseline 78.3% mAP.
4.4. Performance Analysis by Fruit Size Category
To comprehensively evaluate ACDNet’s effectiveness across different scales, we categorized citrus fruits based on their bounding box sizes in the annotated dataset. Specifically, fruit size was measured by calculating the pixel area of each bounding box (width × height in pixels), which directly reflects the detection difficulty from a computer vision perspective. The size thresholds were determined using percentile-based analysis of the test set distribution: bounding boxes with areas below the 33rd percentile were classified as small, those between the 33rd and 67th percentiles as medium, and those above the 67th percentile as large. This classification approach ensures balanced representation across size categories while maintaining correspondence with actual fruit dimensions. The test set contains 288 images, with approximately 16% small fruits, 54% medium fruits, and 30% large fruits based on this statistical analysis. Detailed results are presented in
Table 5.
The results demonstrate that ACDNet achieves progressively larger improvements for smaller fruit sizes, validating our architectural innovation’s effectiveness for challenging small object detection. For small fruits, ACDNet improves recall by 6.7 percentage points and mAP@0.5 by 5.2 percentage points, directly addressing the recall degradation observed in YOLOv8n for small-scale targets mentioned in
Section 1. The improvement magnitude decreases systematically with fruit size (small: +5.2 pp in mAP@0.5, medium: +1.5 pp, large: +1.1 pp), confirming that CAFE’s multi-scale attention and DMS’s adaptive sampling provide the most substantial benefits precisely where baseline performance is weakest. Even for large fruits where baseline already achieves 97.6% mAP@0.5, ACDNet maintains consistent performance gains without degradation.
4.5. Visual Results Analysis
Figure 6 shows the detection results comparison between original YOLOv8n and our proposed ACDNet under various challenging scenarios representative of real-world orchard conditions.
The visual results demonstrate that ACDNet maintains strong detection capabilities under various interference factors. But the qualitative analysis of failure cases reveals remaining challenges for future work: ACDNet still struggles with extremely small fruits (<18 mm diameter, <1% of image area), and fruits at extreme viewing angles (>75° from camera normal), where foreshortening severely distorts appearance.
6. Conclusions
This paper presents ACDNet (Adaptive Citrus Detection Network), a novel deep learning framework specifically designed for citrus detection in complex orchard environments. The proposed method introduces three synergistic technical innovations: the Citrus-Adaptive Feature Extraction (CAFE) module combining fruit-aware partial convolution with illumination-adaptive attention, the Dynamic Multi-Scale Sampling (DMS) operator with content-aware offset generation, and the Fruit-Shape Aware IoU (FSA-IoU) loss function incorporating citrus morphological priors.
Extensive experimental validation on the CitrusSet dataset demonstrates ACDNet’s superiority over state-of-the-art methods, achieving significant improvements in detection accuracy while maintaining real-time performance suitable for robotic harvesting applications. Compared to baseline approaches, ACDNet demonstrates substantial enhancements across all evaluation metrics including precision, recall, and mean average precision at multiple IoU thresholds, while simultaneously reducing computational requirements in both model parameters and floating-point operations. This validates the effectiveness of domain-informed design in agricultural detection tasks.
The experimental results provide affirmative answers to all three research questions posed in the introduction. For RQ1, CAFE demonstrates that fruit-aware feature extraction with illumination adaptation successfully improves detection accuracy under variable lighting (substantial precision improvement under backlight conditions) while reducing computational overhead. For RQ2, DMS effectively addresses scale variation and occlusion challenges, as evidenced by substantial recall improvement for small fruits and significant mAP improvement under heavy occlusion when combined with CAFE. For RQ3, FSA-IoU enhances localization precision for ellipsoidal fruits, achieving notable improvement in localization accuracy through morphological priors without inference overhead. These findings validate our hypothesis that domain-specific adaptations at feature extraction, sampling, and optimization stages yield superior performance compared to generic object detection frameworks.
The comprehensive ablation studies with multiple independent runs validate the contribution of each proposed component, revealing strong synergistic effects. The CAFE module alone provides dominant performance gains while dramatically reducing computational overhead. The DMS operator demonstrates that content-aware sampling strategies effectively leverage enhanced feature representations when properly integrated with feature extraction improvements. The FSA-IoU loss function provides complementary localization benefits without introducing inference overhead, as loss function modifications only affect training dynamics. These findings underscore that component synergy matters more than individual component strength in complex detection systems.
The robustness analysis across varying occlusion levels demonstrates ACDNet’s increasingly superior performance as environmental complexity increases, with particularly strong improvements under heavy occlusion conditions. This robustness profile directly addresses the practical requirements of commercial harvesting operations where challenging conditions are inevitable and harvest completeness directly impacts revenue.
Future work will focus on four primary directions: extending the framework to handle extreme environmental conditions through illumination-adaptive preprocessing and dedicated training data; generalizing to multi-fruit scenarios through parameterized shape constraints and multi-domain training; optimizing for embedded deployment through quantization, operator fusion, and model distillation; and incorporating temporal consistency across video sequences through lightweight tracking and temporal aggregation. The success of ACDNet demonstrates that thoughtful integration of agricultural domain knowledge with modern deep learning architectures can yield practical solutions for automated harvesting—a paradigm applicable to broader agricultural automation challenges beyond citrus detection.