1. Introduction
As a globally important economic crop, the apple has dominated the world in terms of planting scale and production [
1]. However, traditional picking operations are highly dependent on manual labor and face problems such as labor shortage, inefficiency, and lack of picking and grading precision. Especially in complex orchard environments, factors such as the shading of apples, light variations, and variety diversity (e.g., the color difference between the Red Fuji and Golden Marshal) pose serious challenges to the visual recognition capabilities of automated equipment. The development of modern agriculture not only pursues the improvement of production efficiency, but also pays increasing attention to its far-reaching impact on the ecological environment. This macro-consideration of agricultural sustainability also puts forward higher requirements for the application of precision agriculture technology and automated fruit picking, that is, how to improve operational efficiency and, at the same time, help to achieve a wider range of agro-ecological benefits [
2]. Therefore, the development of a high-precision, lightweight, adaptable vision inspection system has become a core breakthrough in promoting the landing of apple-picking robots.
Early applications of the visual inspection of agricultural fruits relied on manual feature extraction based on the morphological, chromatic, and textural characteristics of the object, and then using machine learning algorithms, such as Support Vector Machines, for image-level classification. There are obvious limitations in using machine learning as a detection method, as the extracted features will not be reusable when the fruit categories are different or the orchard environment is transformed. In addition, in the environment of complex background and light, the manually extracted features often do not have high generalization ability. At this stage, deep-learning feature extraction methods are used for fruit-detection applications. Deep learning automates feature extraction from complex images through data-driven learning, enabling simultaneous classification and localization. Unlike conventional approaches, deep learning autonomously learns discriminative features through mass data training, enabling more efficient and robust detection.
In addition to visual object detection, modern agricultural sensing technologies—such as 3D imaging for plant phenotyping and spectral analysis for early stress detection—have enhanced environmental perception and plant monitoring [
3,
4]. For example, He et al. [
5]. explored the possibility of using polarization information to infer plant growth states by studying the multispectral polarized bidirectional reflectance properties of plant canopies. However, for practical automation tasks like robotic harvesting, robust and efficient fruit detection remains essential. Especially in dynamic orchard environments, real-time systems must balance detection accuracy with computational efficiency. High precision is critical for tasks such as grasping and yield estimation, yet achieving it often increases model complexity. Conversely, optimizing for speed and resource use may reduce detection performance. Designing models that strike this accuracy–efficiency trade-off is thus a central challenge in the agricultural computer vision. Xia et al. [
6] proposed a physically informed neural network-based approach aimed at improving the potential of deep learning models’ generalization capabilities in complex environments. Huang et al. [
7]. proposed a method based on adaptive denoising and the texture decomposition attention mechanism, which exhibits strong robustness under strong light interference. Kang and Chen [
8] developed a multi-task deep neural network (DaSNet-v2) capable of simultaneously processing fruit and branch segmentation to enhance the visual perception of a robot in an orchard environment. Wang et al. [
9,
10,
11]. proposed an adversarial training method based on semi-supervised variational self-encoders, which significantly improved the classification accuracy of the model when dealing with a complex dataset. And in the field of camera-based 2D image detection frameworks, two dominant branches have been yielded: computationally efficient single-stage networks (YOLO family) and region-proposal-based two-stage systems. Among them, YOLO (You Only Look Once) has become the mainstream framework for agricultural robot vision systems, attributable to its unified design that maintains detection accuracy while meeting real-time requirements. Earlier studies based on models such as YOLOv5 and YOLOv7 have improved apple detection in orchard environments through multi-scale feature fusion. However, these models still face the following challenges: difficulty in detecting long-distance or occluded targets; redundant model parameters and high computational overheads; and insufficient environmental adaptability.
In recent years, apple target detection research has focused on the two core challenges of complex orchard environment adaptability and a lightweight model to develop technological innovation and lightweight model design. In terms of lightweight design, Wang et al. [
12] developed a channel-pruned YOLOv5s variant that achieves 92.7% parameter reduction (retaining only 7.3% of original parameters) while preserving an 87.6% recall rate, with the compressed model size reduced to 1.4 MB. Ma et al. [
13] improved the model of YOLOv7 through the fusion of the ULSAM attentional mechanism and P2BiFPN features, -tiny, which improves the small-target apple detection mAP to 80.4% with an inference speed of 58.6 FPS. For the improvement of occlusion resistance, Chen et al. [
14] added a new 160 × 160 feature layer and integrated the CBAM attention mechanism in YOLOv7 to solve the feature confusion problem caused by branch and leaf occlusion, which results in a ripeness detection mAP of 87.1% (an improvement from the baseline of 4.3%). Zhang et al. [
15] enhanced YOLOv5’s occlusion-handling capability by substituting the original GIoU loss with a CIoU loss function specifically designed for occluded targets, thereby improving detection accuracy in occlusion scenarios. Gao et al. [
16] developed SRN-YOLO as an enhanced version of YOLOv7, incorporating three key architectural improvements: (1) a specially designed SResNet module to preserve fine-grained gradient information effectively, (2) a recursive feature pyramid network (RFPN) structure to maintain feature integrity during multi-scale fusion, and (3) a novel NWD-CIoU loss function for precise bounding-box regression. This comprehensive optimization framework achieves 81.2 mAP and 71.6 mAP on two benchmark datasets, respectively. M et al. [
17] proposed YOLOv4-NLAM-CBAM to enhance the region of interest perception by introducing NLAM and CBAM attention mechanisms, and the experimental results reached 97.2% mAP and 91.2% F1 scores, respectively. Zhang et al. [
18] enhanced the YOLOv4 model’s recognition of small objects through its hybrid GhostNet-Attention feature extractor and DWSConv-based head design. The proposed modifications yielded a performance with a 95.72% mAP score. Yan et al. [
19] embedded the SE visual attention module following the C2f backbone components to improve target-specific feature representation, and introduced the Dynamic Snake convolution layer in the Neck network structure to strengthen the feature-capturing ability for irregular dendritic structures, and the apple-recognition accuracy (The apple recognition precision (Precision) reached 99.6%, the recall rate (Recall) 96.8%, and the mean accuracy (mAP) 98.3%. All of them were significantly better than the baseline models, such as YOLOv8n and YOLOv5s, in the apple detection task.
Although the above improvements have led to improved detection performance of the YOLO series, there are still limitations in real-time detection environments over a wide range of orchards. For example, the increased model size and computational overhead of the improved YOLOv8 make it difficult for it to be efficiently deployed on edge devices; YOLOv5 and YOLOv7 may perform poorly in complex lighting conditions dealing with exposure, backlighting, and a nighttime field of view because they mainly rely on sunny-weather image data for training; and YOLOv8, despite the introduction of the Squeeze-and-Excitation attention mechanisms, still shows constrained effectiveness against intricate branch occlusions and suboptimal performance in complex lighting, ultimately reducing its practical applicability in real-world orchard environments.
In summary, current research on apple-target detection in agricultural, automated picking tasks still faces challenges such as low detection accuracy for small targets, as well as in scenarios with complex lighting and occlusions; improved detection models often involve large numbers of parameters, which hinders their deployment on edge devices commonly used in agricultural operations. To solve the above problems, we constructed a set of mixed apple datasets containing real orchards under multiple light and background environments. Utilizing the constructed dataset, we developed an efficient and lightweight detection framework, YOLO11-ARAF, built upon the YOLO11 architecture, aimed at enhancing the detection performance in complex orchard scenarios. The contributions of this paper are as follows:
(1) Based on the YOLO11 framework, an improved attention mechanism (AFGCAM) and a rotational convolution module (CARConv) were introduced to replace and enhance the original backbone and neck components. These modifications led to performance enhancements of 0.3% in Precision, 1.1% in Recall, 0.72% in mAP@50, and 2.0% in mAP@50:95 metrics. Furthermore, the enhanced model was distilled into a lightweight version, YOLO11n, which features significantly fewer parameters, a higher detection speed, and a lower computational cost—achieving high detection efficiency without compromising accuracy;
(2) The research constructed an apple imagery dataset featuring diverse environmental conditions, including varied illumination and viewing perspectives. This dataset comprises 3942 annotated images specifically designed to enhance model robustness and adaptability;
(3) Ablation studies and comparative evaluations demonstrate that YOLO11-ARAF outperforms other target detection methods in overall performance while exhibiting better adaptability in complex environmental conditions.
The paper’s organization proceeds as follows: The proposed algorithmic improvements are described in
Section 2. The experimental results and their analysis are presented in
Section 3. Concluding remarks and future work are provided in
Section 4.
2. Methodology
2.1. Overview of YOLO11-ARAF
Our work builds upon YOLO11, the official version released by Ultralytics in October 2024 [
20], as the base detection framework. As shown in
Figure 1, after completing the apple-image data acquisition in the orchard, the whole task flow includes dataset construction, data annotation, and dataset division, and raw images were processed through the YOLO11-ARAF architecture to generate bounding-box predictions for target apples. Due to the improved model parameters, the computational complexity was increased. Therefore, to address model complexity concerns, we employed response-based knowledge distillation to distill the YOLO11-ARAF model toYOLO11n, which improves computational efficiency while maintaining the accuracy of the model to meet the computational requirements in a real-time detection environment.
The selected target detection framework was YOLO11, and the network structure of this model mainly contains three detection structures: Backbone, Neck, and Detect. The backbone network progressively constructs multi-scale feature representations from raw pixels, simultaneously preserving fine-grained details while abstracting high-level semantic content through its hierarchical architecture. Neck fuses and enhances the features extracted by Backbone and optimizes the detection effect of small and large targets through multi-scale feature fusion. The detection head transforms multi-scale feature representations into final predictions through coordinate regression and classification. Therefore, the main feature extraction structure of the whole YOLO model focuses on the backbone and neck, and enhancing these two components leads to superior feature representation learning in the model. We propose the attention mechanism module of AFGCAM based on the attention mechanism of AFGCA and added it to the Backbone and Neck parts of the YOLO11 model to improve the extraction ability of global and local information in the feature extraction of the model. On the other hand, the ARConv rotational convolution was added to Backbone and the CARConv module is proposed to replace the C3K2 module to enhance the oriented object-detection performance of the model against rotating targets.
We propose YOLO11-ARAF, a novel detection architecture uniquely designed for apple detection in complex orchard environments. This framework introduces two key innovations to the YOLO11 baseline. First, we present CARConv, a specialized convolutional block that uniquely integrates Adaptive Rotational Convolution (ARConv) by strategically replacing the Bottleneck component within YOLO11’s C3k2 modules. This targeted integration is designed to specifically enhance the model’s feature representation capabilities for apples exhibiting diverse orientations due to natural growth patterns and occlusions. Second, we introduced AFGCAM, a novel attention mechanism that significantly enhances the original Adaptive Fine-Grained Channel Attention (AFGCA) by incorporating Global Max Pooling (GMP) alongside Global Average Pooling (GAP). This unique dual-pooling strategy within AFGCAM is specifically devised to improve feature discriminability under challenging and variable orchard illumination conditions by capturing a richer set of channel statistics. The synergistic integration of our specifically designed CARConv and the novel AFGCAM module into the backbone and neck of YOLO11 results in a distinctive architecture optimized for robust apple detection.
2.2. CARConv Module
Considering that the orientation of objects such as apples and branches in real orchard scenes can vary greatly in angle within the viewfinder frame, which poses a challenge to standard object detection models such as YOLO11, we strategically integrated the ARConv module into the C3K2 convolutional layer. This convolutional improvement enhances the model’s capacity for learning discriminative representations from apples presenting different angles in the orchard, which facilitates more accurate apple detection.
When an object is rotated arbitrarily, the model may exhibit degraded performance in detecting and classifying it. To address this challenge, Pu et al. [
21] proposed the adaptive rotational convolution (ARConv) module, which aims to enhance orientation-variant object-detection capability in images. The core idea is to adaptively rotate the convolution kernel according to the object’s principal rotation angle in the input image. This is achieved by predicting the rotation angle and the routing function of the convolution kernel combination weights. The rotated convolution kernel is then combined using the predicted weights to generate the final convolution kernel applied to the input feature map. The specific convolution module is shown in
Figure 2.
The ARConv module, depicted in
Figure 2, enhances rotational adaptivity. It employs a routing function that takes input image features
to predict a set of N rotation angles
and N combination weights
. Each of the N base convolutional kernels
is then rotated by its corresponding angle
The final output feature map
is obtained by convolving the input features with a weighted sum of these N rotated kernels, where the weights are the predicted
. This allows the convolution to dynamically adapt to the orientation of features in the input. Full details can be found in Pu et al. [
21]. The equations in
Figure 2 are illustrated as follows.
The convolutional kernel rotation mechanism equation is
where
is the original convolutional kernel,
is the rotation angle, and
is the rotated convolutional kernel.
The routing function equation is
where,
is the routing function,
is the image feature,
is the set of predicted rotation angles, and
is the set of predicted combination weights.
The Adaptive Rotation Convolution Module equation is
where
denotes the combination weights of the i-th convolutional kernel,
denotes the convolution operation, and
is the combined output feature.
The Bottleneck module in the C3K2 module has many advantages in feature extraction, but it also has potential shortcomings and limitations. The 1 × 1 convolution in the Bottleneck module is used to reduce the feature map’s channel dimension, and the reduction in the number of channels compresses the amount of information in the feature map, especially if the feature map itself has a small number of channels. This compression may discard discriminative features, thereby degrading the model’s detection accuracy, particularly in complex scenes or small object recognition tasks. To compensate for the limitations of the C3K2 module, we introduced the ARConv module to replace the original Bottleneck module. By dynamically adjusting the orientation of the convolutional kernel, ARConv augments the backbone’s representational learning capacity. This is ideally suited for target detection scenarios where the object has a rotating target, enabling more effective multi-angle feature representation learning at different angles. We substituted the Bottleneck module in the C3k2 layer with the ARConv module and named it CARConv. This specific architectural modification represents a novel approach to integrating rotational adaptivity directly and deeply within YOLO11’s core feature extraction blocks, rather than as a more generic add-on. By strategically embedding ARConv in this manner, our CARConv module is uniquely designed to enhance the backbone’s capability to learn discriminative features for apples presenting diverse and challenging orientations in complex orchard environments. This enhanced CARConv block, capable of adaptively adjusting its convolutional kernels, therefore, facilitates more accurate feature learning for such rotationally varied targets. The resulting structure of this substitution is illustrated in
Figure 3.
2.3. AFGCAM Module
In recent years, in the field of computer vision, attention mechanisms such as SE and CBAM, etc. have been widely used, especially in deep learning models, and are often used to improve feature discriminability of the network structure. It should be considered that the complex lighting environment in the real orchard scene increases the difficulty of the model in extracting the target features. Therefore, we tried to introduce the AFGCA attention mechanism to enhance the model’s feature learning capacity for lower-quality images.
The AFGCA adaptive fine-grained channel attention mechanism proposed by Sun et al. [
22] was initially applied as a design framework in the field of image defogging to improve the picture quality of the images. Similarly, under the complex light interference in a real orchard environment, light that is too strong or too weak can affect the quality of the picture captured by the camera, which, in turn, interferes with the model’s feature representation capability. This inspired us to add AFGCA to the feature-extraction framework of the YOLO model to help the model better learn the features of the target apples in complex lighting and backgrounds. The architectural framework of the AFGCA attentional mechanism (
Figure 4) was initially designed for the feature reconfiguration phase of the end-to-end denoising architecture, as well as being a key component in the synthetic network responsible for parameter estimation. This design strategically exploits the adaptive reconfiguration capabilities of the network by optimizing the feature transformations to effectively utilize macro-scale contextual relationships and micro-scale channel interactions.
The AFGCA module (illustrated in
Figure 4, left) initiates its computational workflow by applying global average pooling (GAP) to the input feature tensor
to obtain channel-wise statistical descriptors. These descriptors are then utilized to model both local and global inter-channel relationships, typically forming a channel correlation matrix. From this, adaptive channel attention weights,
, are derived through a dynamic fusion process involving learnable factors. Finally, these learned weights
are applied channel-wise to the original input feature tensor,
, to produce the refined feature representation
. The specific calculation formulas outlining these steps are detailed below:
where
denotes the input feature tensor with
channels, height,
, and width,
.
is the channel descriptor for the n-th channel, derived from the feature values
at the spatial position
.
signifies the Global Average Pooling function, yielding the channel descriptor vector
. Local channel information is represented by
, obtained using a band matrix
with weight coefficients
over
neighboring channels, while
represents global channel information. These interact to form the channel correlation matrix
. From
, global (
) and local (
) channel attention weight vectors are derived. The final combined channel attention weights
are produced through dynamic fusion involving a learnable factor
. The Sigmoid activation function is denoted by
Finally,
is the refined output feature tensor, resulting from the element-wise multiplication
of W with the input feature tensor
.
While AFGCA offers a strong foundation for channel attention, to further bolster feature learning, particularly for apples under the diverse and often suboptimal lighting conditions found in orchards, we propose AFGCAM, a novel and enhanced attention module. To preserve additional feature details, we have added a global maximum pooling (GMP) operation to the AFGCA module for optimization. Our key innovation in AFGCAM is the unique integration of a Global Max Pooling (GMP) pathway operating in parallel with the original Global Average Pooling (GAP) pathway before the channel attention weights are derived (see
Figure 4., right). The feature maps after the GMP and GAP operations are then summed, element by element, at the corresponding positions in the channel dimension. Whereas global average pooling (GAP) softens the features and retains the overall pattern, global maximum pooling (GMP) assists in identifying fine-scale and local peaks, thereby accentuating object characteristics. Compared to a single-path global maximum pooling (GMP) operation, the fusion of both pooling operations can comprehensively consider features at multiple scales, enhancing the model’s capacity to detect and represent small-scale occluded object features. We have integrated the refined AFGCAM module into the YOLO11 model to strengthen its feature extraction capabilities. The internal architecture of the AFGCAM is shown in
Figure 4.
This dual-pooling strategy is a distinctive feature of AFGCAM. It is designed to capture a more comprehensive range of channel-wise statistics: GAP effectively summarizes the overall contextual features and background information, while GMP excels at identifying and preserving the most salient local features and peak activations, which correspond to important object characteristics. By fusing these complementary statistics, AFGCAM enables more robust and discriminative feature refinement, especially for visually challenging targets, such as apples under varying illumination or those that are small or partially occluded.
2.4. YOLO11-ARAF Network
The enhanced overall model architecture is depicted in
Figure 5. YOLO11-ARConv-AFGCAM is structured into three components: Backbone, Neck, and Head. This enhanced YOLO11-ARAF architecture derives its distinct advantages from the unique and synergistic interplay of our specifically designed CARConv modules and the novel AFGCAM attention mechanism. As detailed in
Section 2.2 and
Section 2.3 respectively, CARConv imbues the network with robust rotational adaptivity, while AFGCAM significantly refines feature discriminability, especially under challenging visual conditions. These modules are strategically embedded within the backbone and neck components of the YOLO11 framework. This holistic and specific configuration is engineered to cooperatively enhance the model’s capability for accurate and efficient apple detection in complex real-world orchard environments. For instance, in optimizing the Backbone and Neck parts, the introduction of our CARConv layer (replacing the original C3K2 layer) significantly improves the detection ability for apples at any direction and angle. Simultaneously, the AFGCAM module, an attention-guided feature learning component, is incorporated into both the Backbone and Neck to fuse global and local feature map information, thereby enhancing the model’s feature extraction and processing capabilities for complex scenes.
2.5. Response-Based Knowledge Distillation
To address the increased model parameters and the resulting increased computational overhead due to the introduction of the ARConv and AFGCAM modules, we used knowledge distillation [
23]. As shown in
Figure 6, the improved model with higher accuracy was used as the teacher model, while the knowledge was distilled into the smaller, faster, and more computationally efficient student model (YOLO11n). This ensured that the student model maintains higher detection accuracy while maintaining the lightweight advantage required for deployment on edge devices in agricultural environments.
We employed a response-based knowledge refinement approach that uses the output of the teacher model’s responses as soft labels. This approach allowed the student model to learn from the refined responses of the teacher, thus improving detection accuracy and generalization. During knowledge refinement, a compact student model was trained to emulate the performance of a more complex teacher model. The teacher’s outputs (or soft labels) contain more fine-grained information than the single-encoded hard labels, providing intermediate supervision that helps the student model learn more effectively. In our study, the teacher models are complex YOLO11-ARAF models with ARConv and AFGCAM modules. The teacher model’s output was utilized to direct the training of the slimmer YOLO11n student model. The mean square error loss was employed to compute the loss values in this study. By minimizing a distillation loss function that quantifies the discrepancy between student predictions and teacher soft labels, the student model learns to emulate the teacher’s performance.
This approach retains the improved detection accuracy of the instructor’s model while ensuring that the student model is lightweight and efficient enough to be deployed on edge devices in the orchard. The experimental results demonstrate the efficacy of the proposed methodology, showing that the student model maintains high detection accuracy while maintaining a reduced computational cost. The in the figure is the categorization confidence of the teacher model for each anchor. Each anchor corresponds to a category score, which indicates the probability that the anchor belongs to a certain category; is the bounding box regression value of the teacher model for each anchor. Each anchor point corresponds to four regression values , indicating the location and size of the target. The student models and are the same.
In order to calculate
, it is necessary to first put
through the sigmoid activation layer:
Then the Maxscore should be calculated:
The Maxscore should be reshaped afterward:
The object scale metric is used to calculate the classification and bounding-box loss, thus facilitating knowledge transfer from the teacher to the student model. The loss function weights can be adaptively modulated to prioritize high-confidence prediction areas during knowledge distillation. It is shown below.
(1) The bounding-box loss equation is
where
is the batch size;
is the number of anchor points;
is the dimension of the bounding box regression.
;
denote the teacher model’s bounding box regression output,
is the predicted value of the bounding box for the teacher model, and
is the dynamically adjusted weight.
(2) The classified loss equation is
where
represents the total class count;
is the categorization score of the student model;
is the categorization score of the teacher model; and
is the dynamically adjusted weight.
The following equation is used for the calculation of the total loss:
The SGD optimizer was used to minimize losses during distillation in this study.
3. Experiments Results and Analysis
3.1. Evaluation Indicators
YOLO11 model-related model evaluation metrics
1. Precision
Precision measures the proportion of correctly predicted positive instances among all samples predicted as positive. It reflects the model’s accuracy in identifying positive targets. The formula is as follows:
where TP (True Positive) is the number of samples correctly identified as positive, and FP (False Positive) refers to the number of negative samples incorrectly classified as positive;
2. Recall
Recall assesses the model’s ability to correctly detect all actual positive samples. It indicates how well the model captures true positives. The calculation is given by
where FN (False Negative) denotes the number of positive samples that the model failed to identify;
3. F1-Score
The F1-score is the harmonic mean of precision and recall and is a reconciled average of the two, used to measure the comprehensive performance of the model. Its calculation formula is
4. Average Precision, AP
Average Precision (AP) is used to calculate the average accuracy under different categories, which demonstrates the model’s robust performance across varying thresholds. For the target detection task, the AP is usually obtained through the integration of the precision-recall curve. Its calculation formula is
where R is the recall;
5. Mean Average Precision, mAP
The average accuracy mean is the average accuracy of the multi-category problem, and the average of all category APs is calculated to indicate the model’s overall detection capability across multiple object categories. Its calculation formula is
where
denotes the total category count, and
represents the average precision for the i-th class;
6. FPS
FPS indicates the inference speed of the model, representing how many frames the model can process per second. It is a critical metric for assessing real-time performance. Its calculation formula is
where Inference Time refers to the time (in milliseconds) the model takes to process a single image.
For the experimental loop environment in this paper, the graphics card is RTX4070; the Python version is 3.10; the Pytorch version is 12.4; and the Cuda version is 12.6.
3.2. Dataset Construction
The dataset used in this study consists of 3942 images, as shown in
Figure 7. The initial resolutions of the collected images, including 1280 × 960 and 640 × 480 pixels, were standardized and pre-processed prior to model training. These images were collected from different orchards and agricultural environments to ensure a diverse presentation of apple targets under different conditions. The dataset consists of images of apples with different orientations, lighting conditions, and shading levels. This comprehensive dataset is essential for training and validating the proposed model.
The camera positions were all located at the reachable position of the robotic arm when the pictures were taken. The composed dataset was obtained by taking pictures in different orchards, using common varieties, different shooting angles, and different light and shading conditions. The samples of the dataset are large enough and complex enough to meet the requirements of orchard-picking conditions. In the actual orchard-picking environment, the growing conditions and distribution of apples are often not completely structured or idealized. Fruit distribution may be influenced by the natural growth pattern of trees, shading by branches and leaves, and light conditions, leading to randomness and complexity in fruit location. By collecting these images of apples with certain recognition difficulties, we aimed to improve the robustness, adaptability, and generalization ability of the model. Specifically, the model needs to be able to operate stably under complex lighting conditions, adapt to the appearance characteristics of different varieties of fruits, and accurately recognize target fruits under occlusion and background interference. This diverse dataset design not only helps the model learn more comprehensive features in the training phase, but also significantly improves the reliability and efficiency of the model in practical applications, thus meeting the actual needs of automated picking in orchards.
To ensure the model’s robustness for real-world complexities, the dataset was curated to include a broad spectrum of challenging visual conditions. As illustrated in
Figure 7, this encompasses varied illumination scenarios, including backlighting, overexposure, low-light/night, complex/dappled lighting, and frequent instances of natural occlusion by leaves, branches, and other apples. The number of images captured in the dataset under different lighting conditions is similar in proportion. The dataset also incorporates variations in object scale, with apples appearing at different distances from the camera, resulting in a range of apparent sizes. This includes instances of smaller, more distant fruits as well as larger, closer ones, preparing the model for detection across different scales commonly observed in orchard navigation and harvesting tasks.
We used the X-AnyLabeling software to label the target apples in the dataset one by one and store the label files uniformly. The labeling file format is YOLO. The original image and the labeled label file are divided into the training set, validation set, and test set in a ratio of 8:1:1. In the training stage, we used the stochastic gradient descent (SGD) method to optimize the learning rate. The model was trained with a batch size of 32 for a maximum of 300 iterations. All of the images were resized to a uniform input resolution of 640 × 640 pixels for the YOLO11 architecture; shorter dimensions were padded in this process to maintain the aspect ratio and prevent object distortion. In addition, in order to enhance the variability of the dataset and improve the generality of the model, standard data enhancement techniques such as common geometric (e.g., flipping, rotating) and photometric (e.g., brightness, contrast adjustments) transformations, as well as mosaic enhancement, were dynamically applied during the training phase in order to augment the dataset and improve the model’s learning and generalization capabilities.
3.3. Comparative Experiments
The detection results were compared with different models of the YOLO series, the RTDETR model of the end-to-end series, and the DNE model proposed by other scholars for apple target detection, as shown in
Table 1. Specifically, the YOLO series models benchmarked include several iterations such as YOLOv8, YOLOv10, our baseline YOLO11, and YOLO12, representing the ongoing advancements in this popular single-stage detector family known for its balance of speed and accuracy. In contrast, the RTDETR (Real-Time Detection Transformer) models, including the RTDETR-L and RTDETR-ResNet50 variants, exemplify end-to-end detection approaches built upon Transformer architectures, which have recently shown strong performance. Finally, DNE-YOLO [
24] represents a contemporary specialized model developed by other researchers specifically for apple detection in diverse natural environments, providing a relevant domain-specific benchmark.
It can be seen that, although YOLO11-AFAR is improved in the model parameters compared to YOLO11, the detection results of YOLO11-AFAR are all better than the other models, and the Precision, Recall, mAP@50, and mAP@50:95 detection performance metrics were improved by 0.3%, 1.1%, 0.72%, and 2% compared to YOLO11, respectively. The Precision, Recall, mAP@50, and mAP@50:95 of each YOLO model were plotted as comparison curves, as in
Figure 8, which shows that all the indexes of YOLO11-AFAR are higher than the other models, and, especially, the enhancement of mAP@50:95 is significantly higher than the other models, demonstrating enhanced robustness for apple detection in challenging orchard conditions. It is important to elaborate on the significance of the mAP@50:95 metric, which our YOLO11-ARAF model improved by 2% over the YOLO11 baseline, as shown in
Table 1. While mAP@50 primarily evaluates a model’s capability to correctly identify and broadly localize objects requiring an IoU of 0.5, the mAP@50:95 metric offers a more comprehensive and stringent assessment by averaging AP scores across a range of IoU thresholds from 0.50 to 0.95 in steps of 0.05. This means that a notable improvement in mAP@50:95, such as that achieved by YOLO11-ARAF, does not merely indicate better object detection in a general sense. Crucially, it suggests that the model exhibits enhanced performance, even when much higher localization accuracy is demanded, i.e., at stricter IoU thresholds like 0.75, 0.85, or 0.95. Therefore, the observed 2% gain in mAP@50:95 for YOLO11-ARAF signifies a more robust improvement in overall detection quality, encompassing both superior object recognition and, critically, more precise bounding box regression compared to the baseline. This enhanced localization accuracy is particularly vital for downstream applications, such as robotic grasping, in automated harvesting, where precise positioning of the detected apples is essential. While our mAP@50 also shows an improvement of 0.72%, the more substantial gain in mAP@50:95 underscores that the architectural enhancements in YOLO11-ARAF, including CARConv and AFGCAM, contribute significantly to refining the precision of object localization across a spectrum of IoU requirements, making it a more reliable model for real-world complex orchard environments.
To enhance model interpretability, we employed activation heatmaps to visualize YOLO11-ARAF’s detection process, explicitly revealing its region-specific attention patterns. As shown in
Figure 9, from the visualization results, we can see that YOLO11-ARAF can more accurately locate the location of the target apples than other series of model anchors of yolo. Specifically, the regions of interest of the Yolov8, Yolov10, and Yolo11 models are more concentrated in the background regions of the complex branches in the frame that do not contain the target apples, which suggests that these models are more susceptible to disturbances of the complicated backgrounds in the frame, which affects the overall model’s performance in detecting the apples. For Yolo12, although most of the regions of interest fall accurately on the target apples, it failed to adequately process the low-illumination apple in the bottom-right quadrant, thus resulting in missed detection. Taken together, Yolo11-AFAR has better attention results.
To further evaluate the robustness of our proposed model, we selected two representative challenging scene sub-validation sets from the existing validation set. The complex background sub-validation set consists of 168 images, which mainly includes the case of severe occlusion by branches and leaves, as well as the interference of non-fruit tree objects. The complex lighting sub-validation set consists of 199 images selected specifically for lighting conditions (e.g., backlighting, extreme backlighting, and low light) that are common in real-world orchards. The trained YOLO11-ARAF model was evaluated in both subsets along with the main benchmark model. The detailed performance metrics, including the number of detected apple instances with mAP@50:95, are shown in
Table 2.
The results in
Table 2 show that our improved YOLO11-ARAF model not only detects the number of apple instances closer to the number of real labels under the two challenging conditions of complex backgrounds and complex lighting, but also that it significantly outperforms the compared baseline model in terms of key mAP metrics, and the detection accuracy of the model is improved by 0.7% and 2% relative to the baseline model under the two complex environments, respectively. This indicates that the model has higher detection performance under both complex environments, especially anti-interference ability under complex lighting. There are 0.7% and 2% accuracy improvements in the two complex environments, respectively, which indicates that the model’s detection performance is higher in both complex environments, especially in the anti-interference ability in complex light. This suggests that the YOLO11-ARAF model is able to cope with interference caused by complex environmental factors (e.g., severe occlusion, object interference, and unfavorable lighting) more effectively, and exhibits excellent adaptability and robustness in real orchard scenarios. Notably, a comparative analysis of these results shows that complex background conditions, which are mainly characterized by severe occlusion, usually pose a greater challenge to model detection performance than complex illumination conditions, which points to a potential direction for more targeted optimization for severe occlusion in future work.
3.4. Lightweighting Experiments
Since the improved model has larger model parameters, the parameters were increased by nearly 0.1 M and the computational complexity was increased by nearly 1GFLOPs compared to the model of yolo11. From the practical application point of view, the model is generally deployed on edge devices, and it is expected to be able to deploy a lightweight model. Therefore, we adopted the distillation technique to distill the improved model into the yolo11n model.
To further investigate the adaptability of the knowledge captured by the enhanced YOLO11-ARAF teacher model, we also explored distilling it into student models built on other contemporary lightweight backbones. To this end, we replaced the original YOLO11 backbone with EfficientViT [
26], GhostHGNetV2 [
27], and MobileNetV4 [
28], respectively, and then applied the same distillation process.
Table 3 lists the comparative results of these experiments, where the upper part of the table shows the validation results of the YOLO series and YOLO11-ARAF models, and the lower part shows the comparative experimental results of the YOLO11 model and YOLO11-ARAF distilled to different models, respectively.
The analysis of
Table 3 shows that, while the knowledge in YOLO11-ARAF can be transferred to these different lightweight architectures, achieving the optimal balance of accuracy and efficiency depends heavily on the particular student backbone and its integration. For example, the yolo11-ARAF-to-GhostHGNetV2 variant shows a competitive mAP@50:95 of 0.624, whereas other backbones, such as EfficientViT and MobileNetV4, when integrated into the YOLO11 framework and extracted from YOLO11-ARAF, have mAP@50:95 scores of 0.619 and 0.625, respectively, with a varying number of parameters and FPS tradeoffs, as detailed in the table. These findings suggest that simply employing a different standalone lightweight backbone does not inherently guarantee superior post-distillation performance compared to a well-matched student architecture such as YOLO11n. Our primary distillation model yolo11-ARAF-to-yolo11n achieved 0.644 mAP@50:95 with good efficiency, 2.56 M Params, 76.1 FPS, and remains the most efficient lightweight configuration in our study. The model obtained by distillation was reduced by 0.1M parameters and 1GFLOPS in terms of computational complexity, and it more than doubled in terms of FPS compared to the improved model. In contrast, the accuracy and effectiveness of yolo11-ARAF-to-yolo11 was higher than that of the other models, both in terms of the model’s parameters and computational complexity, which highlights the efficient synergistic effect of knowledge refinement from enhanced teachers to closely related and compatible lightweight student architectures.
FPS (Frames Per Second) is a significant index to measure the inference speed of the model, and it denotes the quantity of image frames the model can analyze per second, reflecting its real-time capabilities in practical applications. A higher FPS translates to a quicker inference speed and superior real-time capabilities for the model. The model mAP@50:95 obtained by distillation is basically maintained at 0.644, while the FPS is double that of the improved model, which is basically maintained at the detection speed of the baseline model, indicating that the lightweight and improved model maintains high accuracy while improving efficiency.
Latency is an important metric that describes the efficiency of a model’s inference, specifically defined as the temporal interval for the model to generate output results (detection frames and categories) from receiving input data (images). Latency directly impacts the model’s responsiveness in real-time detection scenarios, especially in application scenarios that require a fast response. Latency is obtained by summing up the preprocess time, the inference time, and the postprocess time. As shown in
Table 4, the latency metrics demonstrate that the distilled yolo11-ARAF-to-YOLO11 model achieves a 16.54±3.41ms inference time, which is about 0.23 ms higher than that of the improved model, and which is close to that of YOLO11 and YOLOv10, which indicates that it is more efficient in its reasoning and suitable for real-time applications.
The results after model distillation show that the computational speed of the network has been significantly improved, but the reduction of model parameters after distillation is not much (0.1 M), which is because the parameters of the model of the network composed by adding the module itself are not particularly large. Generally speaking, the higher accuracy of the model means that more detection structures are needed to help feature extraction, and the computational complexity of the model is relatively greater, which means that it is difficult to achieve high accuracy and a small model under the same model. Therefore, in this study, as the overall model parameter increase is not big, and the structure of the lightweight is not easy, the use of this more lightweight, general distillation method is more suitable for a lightweight general network. Comparison between the distilled model and other lightweight networks enables it to be clearly seen that this method of distilling into the pre-improved model can reduce the size of the model on the one hand, and, on the other hand, can ensure that the accuracy of the model can be maintained at a high level after the improvement; this is in fact a balance between the accuracy of the model and the computational efficiency.
3.5. Comparative Experiments on Convolution Module
In an effort to assess the performance and contribution of the ARConv module embedded in the CARConv layer, the ARConv module was replaced with different convolutional kernels in the model, including PWConv, ShiftConv, and so on. By comparing ARConv with other convolutional modules, we could conduct a thorough assessment of the detection capabilities of the CARConv layer. The results of the comparison between ARConv and other convolutional modules are presented in
Table 5. The results indicate that ARConv surpasses other conventional convolutional modules across all evaluation metrics, achieving a 0.5% increase in mAP@50 relative to the original model, and an improvement of 1.2% for mAP@50:95. This indicates that the added ARConv enables the network to more effectively accommodate the rotational characteristics in the detected objects, and to extract relevant features from images through the internal adaptive tuning of the convolutional kernel, thus strengthening feature extraction across the entire backbone network and achieving more precise target localization. Specifically, due to the growing environment and shooting angle of the apple in the complex background, its shooting position will always show a certain rotation pattern. The rotational adaptive feature of CARConv still ensures that the model recognizes the rotational shape of apples in complicated environments, thereby enhancing the model’s detection performance.
3.6. Comparative Experiments on Attention Mechanisms
To evaluate the enhanced performance of the AFGCA attentional module, we substituted AFGCA and AFGCAM with alternative attention mechanisms such as EMA, GAM, CBAM, SE, and SimAM. These five mechanisms, known for their simplicity and efficiency, have been extensively adopted in recent research. By comparing them with AFGCA, we comprehensively assessed the strengths and limitations of integrating AFGCAM into the YOLO11 model. The results of the comparison between AFGCA and other attention mechanisms are presented in
Table 6. The results show that all the evaluation indexes of AFGCAM are better than other common attention mechanisms. Specifically, after adding the EMA, CBAM, SE, and SimAM attention mechanisms, the mAP@50:95 of the model was all slightly improved, by 0.3%, 0.4%, 0.2%, and 0.5%, respectively, while the mAP@50 remained the same, by 0.2%. Notably, the model with the added GAM attention mechanism was as high as 3.2M in terms of the number of parameters, while mAP@50 and mAP@50:95 instead decreased by 0.1% and 0.8%, which suggests that adding attention mechanism does not always guarantee enhanced model performance, particularly for apple-target detection under complex lighting. In contrast, the addition of the incoming AFGCA and AFGCAM attentional mechanisms slightly boosted the number of parameters to 2.6 M. The AFGCCA model exhibited a 0.6% and 1.1% improvement in mAP@50 and mAP@50:95, while the AFGCAM model showed enhancements of 0.7% and 1.2% with regard to the same metrics.
AFGCA’s notable accuracy improvement stems from its ability to interact with both local and global feature map information, thereby enhancing feature representation effectively. Specifically, the extraction of local information allows the model to focus more on the features of small targets in the image; the extraction of global information enables the model to pay more attention to the position of targets in the image relative to the whole image. By fully mining and fusing the global and local information of the feature map, the model’s sensitivity to the target location is effectively enhanced, resulting in more accurate localization results. The maximum pooling operation GMP introduced in AFGCAM can fully integrate multi-scale feature information, thereby enhancing the model’s capability to characterize targets across various scales. From the results, all the accuracies of the AFGCAM attention mechanism are further improved compared to the AFGCA attention mechanism.
3.7. Ablation Experiments
The YOLO11-ARAF proposed in this study is built upon YOLO11 with two different improvements. In order to verify each enhancement module’s effectiveness in YOLO11-ARAF, we organized various combinations of the two modules and performed ablation studies on the specified dataset. The experimental outcomes are presented in
Table 7.
As indicated in
Table 6, the model with solely the CARConv module saw a 0.5% and 1.2% boost in mAP@50 and mAP@50:95, respectively. The model incorporating only AFGCA exhibited a 0.2% and 0.9% enhancement in mAP@50 and mAP@50:95. The models improved by adding both the CARConv layer and AFGCA are higher than the YOLO11-ARConv model, the YOLO11-AFGCAM model, and are relative to the yolo11 baseline model by 0.3%, 1.1%, 0.72%, 0.6%, and 2%, respectively, for P, R, F1-score, mAP@50, and mAP@50:95. It indicates that the improved module is still effective when combined. It is worth noting that the model that includes both CARConv and AFGCAM has a 2% improvement on both m AP50–95, alongside a notable reduction in computational complexity.
3.8. Discussion and Limitation
While the YOLO11-ARAF model achieves impressive accuracy and other metrics, its parameter size remains larger than typical convolutional models and attention mechanisms. Although knowledge distillation was used in this study for model lightweight improvement, the feature extraction part can still perform model pruning to eliminate redundant feature extraction layers to improve computational efficiency. Model pruning, as well as finding more lightweight distillation student models with different knowledge distillation algorithms, are possible future directions for improvement and extension. A limitation of this study is the use of a single apple variety in our dataset. While this allowed a focused investigation on complex lighting and environmental challenges from the robotic arm’s perspective, it may restrict the model’s generalization capabilities. Specifically, its performance could be affected when encountering other apple varieties with significantly different visual characteristics (e.g., color, shape, texture). Moreover, the model’s applicability to other fruit types is likely limited, as the learned features are inherently apple-specific. Orchard conditions varying with different crop types could also present further generalization challenges. Future research will involve collecting diverse apple samples to retrain the model and expand its feature recognition capabilities.
4. Conclusions
This study introduces an enhanced apple detection model, YOLO11-ARAF, built upon YOLO11n, designed to tackle the challenges of inaccurate detection and limited adaptability in complicated orchard settings. Firstly, we built an apple image dataset for complex orchard environments, and a total of 3942 images were collected. To enhance apple detection in complex backgrounds, the CARConv module was adopted instead of the original C3K2 module. Next, we upgraded the AFGCA module to the AFGCAM attention mechanism, integrated into the Backbone and Neck of the YOLO11 model. The addition of AFGCAM allowed the model to better focus on the global and local information of the feature map, thus improving its feature extraction capability. Finally, the improved model was distilled into the YOLO11n model, boosting the computational speed and efficiency on the image while maintaining higher efficiency.
The results of the experiments highlight that the accuracy, Recall, mAP@50 and mAP@50:95 values of the YOLO11-ARAF model are 89.4%, 86%, 92.3%, and 64.4%, respectively, which are 0.3%, 1.1%, 0.72%, and 2% higher than YOLO11, respectively. The improved model distills into the original model with 0.1M lower parameters and doubled FPS, which enables fast and accurate apple detection in complex orchard environments with limited computational resources. The lightweight algorithm developed in this study can serve as a valuable reference for real-time orchard-picking robot operations within the apple-detection domain.