YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments

Lin, Yangtian; Xia, Yujun; Xia, Pengcheng; Liu, Zhengyang; Wang, Haodi; Qin, Chengjin; Gong, Liang; Liu, Chengliang

doi:10.3390/agriculture15101104

Open AccessArticle

YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments

by

Yangtian Lin

,

Yujun Xia

,

Pengcheng Xia

,

Zhengyang Liu

,

Haodi Wang

,

Chengjin Qin

^*

,

Liang Gong

and

Chengliang Liu

State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(10), 1104; https://doi.org/10.3390/agriculture15101104

Submission received: 24 April 2025 / Revised: 12 May 2025 / Accepted: 17 May 2025 / Published: 20 May 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate object detection is a fundamental component of autonomous apple-picking systems. In response to the insufficient recognition performance and poor generalization capacity of existing detection algorithms under unstructured orchard scenarios, we constructed a customized apple image dataset captured under varying illumination conditions and introduced an improved detection architecture, YOLO11-ARAF, derived from YOLO11. First, to enhance the model’s ability to capture apple-specific features, we replaced the original C3k2 module with the CARConv convolutional layer. Second, to reinforce feature learning in visually challenging orchard environments, the enhanced attention module AFGCAM was embedded into the model architecture. Third, we applied knowledge distillation to transfer the enhanced model to a compact YOLO11n framework, maintaining high detection efficiency while reducing computational cost, and optimizing it for deployment on devices with limited computational resources. To assess our method’s performance, we conducted comparative experiments on the constructed apple image dataset. The improved YOLO11-ARAF model attained 89.4% accuracy, 86% recall, 92.3% mAP@50, and 64.4% mAP@50:95 in our experiments, which are 0.3%, 1.1%, 0.72%, and 2% higher than YOLO11, respectively. Furthermore, the distilled model significantly reduces parameters and doubles the inference speed (FPS), enabling rapid and precise apple detection in challenging orchard settings with limited computational resources.

Keywords:

YOLO11-ARAF; diverse orchard environment; deep learning; apple detection

1. Introduction

As a globally important economic crop, the apple has dominated the world in terms of planting scale and production [1]. However, traditional picking operations are highly dependent on manual labor and face problems such as labor shortage, inefficiency, and lack of picking and grading precision. Especially in complex orchard environments, factors such as the shading of apples, light variations, and variety diversity (e.g., the color difference between the Red Fuji and Golden Marshal) pose serious challenges to the visual recognition capabilities of automated equipment. The development of modern agriculture not only pursues the improvement of production efficiency, but also pays increasing attention to its far-reaching impact on the ecological environment. This macro-consideration of agricultural sustainability also puts forward higher requirements for the application of precision agriculture technology and automated fruit picking, that is, how to improve operational efficiency and, at the same time, help to achieve a wider range of agro-ecological benefits [2]. Therefore, the development of a high-precision, lightweight, adaptable vision inspection system has become a core breakthrough in promoting the landing of apple-picking robots.

Early applications of the visual inspection of agricultural fruits relied on manual feature extraction based on the morphological, chromatic, and textural characteristics of the object, and then using machine learning algorithms, such as Support Vector Machines, for image-level classification. There are obvious limitations in using machine learning as a detection method, as the extracted features will not be reusable when the fruit categories are different or the orchard environment is transformed. In addition, in the environment of complex background and light, the manually extracted features often do not have high generalization ability. At this stage, deep-learning feature extraction methods are used for fruit-detection applications. Deep learning automates feature extraction from complex images through data-driven learning, enabling simultaneous classification and localization. Unlike conventional approaches, deep learning autonomously learns discriminative features through mass data training, enabling more efficient and robust detection.

In addition to visual object detection, modern agricultural sensing technologies—such as 3D imaging for plant phenotyping and spectral analysis for early stress detection—have enhanced environmental perception and plant monitoring [3,4]. For example, He et al. [5]. explored the possibility of using polarization information to infer plant growth states by studying the multispectral polarized bidirectional reflectance properties of plant canopies. However, for practical automation tasks like robotic harvesting, robust and efficient fruit detection remains essential. Especially in dynamic orchard environments, real-time systems must balance detection accuracy with computational efficiency. High precision is critical for tasks such as grasping and yield estimation, yet achieving it often increases model complexity. Conversely, optimizing for speed and resource use may reduce detection performance. Designing models that strike this accuracy–efficiency trade-off is thus a central challenge in the agricultural computer vision. Xia et al. [6] proposed a physically informed neural network-based approach aimed at improving the potential of deep learning models’ generalization capabilities in complex environments. Huang et al. [7]. proposed a method based on adaptive denoising and the texture decomposition attention mechanism, which exhibits strong robustness under strong light interference. Kang and Chen [8] developed a multi-task deep neural network (DaSNet-v2) capable of simultaneously processing fruit and branch segmentation to enhance the visual perception of a robot in an orchard environment. Wang et al. [9,10,11]. proposed an adversarial training method based on semi-supervised variational self-encoders, which significantly improved the classification accuracy of the model when dealing with a complex dataset. And in the field of camera-based 2D image detection frameworks, two dominant branches have been yielded: computationally efficient single-stage networks (YOLO family) and region-proposal-based two-stage systems. Among them, YOLO (You Only Look Once) has become the mainstream framework for agricultural robot vision systems, attributable to its unified design that maintains detection accuracy while meeting real-time requirements. Earlier studies based on models such as YOLOv5 and YOLOv7 have improved apple detection in orchard environments through multi-scale feature fusion. However, these models still face the following challenges: difficulty in detecting long-distance or occluded targets; redundant model parameters and high computational overheads; and insufficient environmental adaptability.

In recent years, apple target detection research has focused on the two core challenges of complex orchard environment adaptability and a lightweight model to develop technological innovation and lightweight model design. In terms of lightweight design, Wang et al. [12] developed a channel-pruned YOLOv5s variant that achieves 92.7% parameter reduction (retaining only 7.3% of original parameters) while preserving an 87.6% recall rate, with the compressed model size reduced to 1.4 MB. Ma et al. [13] improved the model of YOLOv7 through the fusion of the ULSAM attentional mechanism and P2BiFPN features, -tiny, which improves the small-target apple detection mAP to 80.4% with an inference speed of 58.6 FPS. For the improvement of occlusion resistance, Chen et al. [14] added a new 160 × 160 feature layer and integrated the CBAM attention mechanism in YOLOv7 to solve the feature confusion problem caused by branch and leaf occlusion, which results in a ripeness detection mAP of 87.1% (an improvement from the baseline of 4.3%). Zhang et al. [15] enhanced YOLOv5’s occlusion-handling capability by substituting the original GIoU loss with a CIoU loss function specifically designed for occluded targets, thereby improving detection accuracy in occlusion scenarios. Gao et al. [16] developed SRN-YOLO as an enhanced version of YOLOv7, incorporating three key architectural improvements: (1) a specially designed SResNet module to preserve fine-grained gradient information effectively, (2) a recursive feature pyramid network (RFPN) structure to maintain feature integrity during multi-scale fusion, and (3) a novel NWD-CIoU loss function for precise bounding-box regression. This comprehensive optimization framework achieves 81.2 mAP and 71.6 mAP on two benchmark datasets, respectively. M et al. [17] proposed YOLOv4-NLAM-CBAM to enhance the region of interest perception by introducing NLAM and CBAM attention mechanisms, and the experimental results reached 97.2% mAP and 91.2% F1 scores, respectively. Zhang et al. [18] enhanced the YOLOv4 model’s recognition of small objects through its hybrid GhostNet-Attention feature extractor and DWSConv-based head design. The proposed modifications yielded a performance with a 95.72% mAP score. Yan et al. [19] embedded the SE visual attention module following the C2f backbone components to improve target-specific feature representation, and introduced the Dynamic Snake convolution layer in the Neck network structure to strengthen the feature-capturing ability for irregular dendritic structures, and the apple-recognition accuracy (The apple recognition precision (Precision) reached 99.6%, the recall rate (Recall) 96.8%, and the mean accuracy (mAP) 98.3%. All of them were significantly better than the baseline models, such as YOLOv8n and YOLOv5s, in the apple detection task.

Although the above improvements have led to improved detection performance of the YOLO series, there are still limitations in real-time detection environments over a wide range of orchards. For example, the increased model size and computational overhead of the improved YOLOv8 make it difficult for it to be efficiently deployed on edge devices; YOLOv5 and YOLOv7 may perform poorly in complex lighting conditions dealing with exposure, backlighting, and a nighttime field of view because they mainly rely on sunny-weather image data for training; and YOLOv8, despite the introduction of the Squeeze-and-Excitation attention mechanisms, still shows constrained effectiveness against intricate branch occlusions and suboptimal performance in complex lighting, ultimately reducing its practical applicability in real-world orchard environments.

In summary, current research on apple-target detection in agricultural, automated picking tasks still faces challenges such as low detection accuracy for small targets, as well as in scenarios with complex lighting and occlusions; improved detection models often involve large numbers of parameters, which hinders their deployment on edge devices commonly used in agricultural operations. To solve the above problems, we constructed a set of mixed apple datasets containing real orchards under multiple light and background environments. Utilizing the constructed dataset, we developed an efficient and lightweight detection framework, YOLO11-ARAF, built upon the YOLO11 architecture, aimed at enhancing the detection performance in complex orchard scenarios. The contributions of this paper are as follows:

(1) Based on the YOLO11 framework, an improved attention mechanism (AFGCAM) and a rotational convolution module (CARConv) were introduced to replace and enhance the original backbone and neck components. These modifications led to performance enhancements of 0.3% in Precision, 1.1% in Recall, 0.72% in mAP@50, and 2.0% in mAP@50:95 metrics. Furthermore, the enhanced model was distilled into a lightweight version, YOLO11n, which features significantly fewer parameters, a higher detection speed, and a lower computational cost—achieving high detection efficiency without compromising accuracy;

(2) The research constructed an apple imagery dataset featuring diverse environmental conditions, including varied illumination and viewing perspectives. This dataset comprises 3942 annotated images specifically designed to enhance model robustness and adaptability;

(3) Ablation studies and comparative evaluations demonstrate that YOLO11-ARAF outperforms other target detection methods in overall performance while exhibiting better adaptability in complex environmental conditions.

The paper’s organization proceeds as follows: The proposed algorithmic improvements are described in Section 2. The experimental results and their analysis are presented in Section 3. Concluding remarks and future work are provided in Section 4.

2. Methodology

2.1. Overview of YOLO11-ARAF

Our work builds upon YOLO11, the official version released by Ultralytics in October 2024 [20], as the base detection framework. As shown in Figure 1, after completing the apple-image data acquisition in the orchard, the whole task flow includes dataset construction, data annotation, and dataset division, and raw images were processed through the YOLO11-ARAF architecture to generate bounding-box predictions for target apples. Due to the improved model parameters, the computational complexity was increased. Therefore, to address model complexity concerns, we employed response-based knowledge distillation to distill the YOLO11-ARAF model toYOLO11n, which improves computational efficiency while maintaining the accuracy of the model to meet the computational requirements in a real-time detection environment.

The selected target detection framework was YOLO11, and the network structure of this model mainly contains three detection structures: Backbone, Neck, and Detect. The backbone network progressively constructs multi-scale feature representations from raw pixels, simultaneously preserving fine-grained details while abstracting high-level semantic content through its hierarchical architecture. Neck fuses and enhances the features extracted by Backbone and optimizes the detection effect of small and large targets through multi-scale feature fusion. The detection head transforms multi-scale feature representations into final predictions through coordinate regression and classification. Therefore, the main feature extraction structure of the whole YOLO model focuses on the backbone and neck, and enhancing these two components leads to superior feature representation learning in the model. We propose the attention mechanism module of AFGCAM based on the attention mechanism of AFGCA and added it to the Backbone and Neck parts of the YOLO11 model to improve the extraction ability of global and local information in the feature extraction of the model. On the other hand, the ARConv rotational convolution was added to Backbone and the CARConv module is proposed to replace the C3K2 module to enhance the oriented object-detection performance of the model against rotating targets.

We propose YOLO11-ARAF, a novel detection architecture uniquely designed for apple detection in complex orchard environments. This framework introduces two key innovations to the YOLO11 baseline. First, we present CARConv, a specialized convolutional block that uniquely integrates Adaptive Rotational Convolution (ARConv) by strategically replacing the Bottleneck component within YOLO11’s C3k2 modules. This targeted integration is designed to specifically enhance the model’s feature representation capabilities for apples exhibiting diverse orientations due to natural growth patterns and occlusions. Second, we introduced AFGCAM, a novel attention mechanism that significantly enhances the original Adaptive Fine-Grained Channel Attention (AFGCA) by incorporating Global Max Pooling (GMP) alongside Global Average Pooling (GAP). This unique dual-pooling strategy within AFGCAM is specifically devised to improve feature discriminability under challenging and variable orchard illumination conditions by capturing a richer set of channel statistics. The synergistic integration of our specifically designed CARConv and the novel AFGCAM module into the backbone and neck of YOLO11 results in a distinctive architecture optimized for robust apple detection.

2.2. CARConv Module

Considering that the orientation of objects such as apples and branches in real orchard scenes can vary greatly in angle within the viewfinder frame, which poses a challenge to standard object detection models such as YOLO11, we strategically integrated the ARConv module into the C3K2 convolutional layer. This convolutional improvement enhances the model’s capacity for learning discriminative representations from apples presenting different angles in the orchard, which facilitates more accurate apple detection.

When an object is rotated arbitrarily, the model may exhibit degraded performance in detecting and classifying it. To address this challenge, Pu et al. [21] proposed the adaptive rotational convolution (ARConv) module, which aims to enhance orientation-variant object-detection capability in images. The core idea is to adaptively rotate the convolution kernel according to the object’s principal rotation angle in the input image. This is achieved by predicting the rotation angle and the routing function of the convolution kernel combination weights. The rotated convolution kernel is then combined using the predicted weights to generate the final convolution kernel applied to the input feature map. The specific convolution module is shown in Figure 2.

The ARConv module, depicted in Figure 2, enhances rotational adaptivity. It employs a routing function that takes input image features

(x)

to predict a set of N rotation angles

(Θ = {θ_{1}, \dots, θ_{N}})

and N combination weights

(Λ = {λ_{1}, \dots, λ_{N}})

. Each of the N base convolutional kernels

(W_{i})

is then rotated by its corresponding angle

θ_{i}

The final output feature map

(y)

is obtained by convolving the input features with a weighted sum of these N rotated kernels, where the weights are the predicted

λ_{i}

. This allows the convolution to dynamically adapt to the orientation of features in the input. Full details can be found in Pu et al. [21]. The equations in Figure 2 are illustrated as follows.

The convolutional kernel rotation mechanism equation is

W' = R o t a t e (W; θ)

where

W

is the original convolutional kernel,

θ

is the rotation angle, and

W'

is the rotated convolutional kernel.

The routing function equation is

θ, λ = f (x)

where,

f (x)

is the routing function,

x

is the image feature,

θ

is the set of predicted rotation angles, and

λ

is the set of predicted combination weights.

The Adaptive Rotation Convolution Module equation is

y = (λ_{1} W_{1}^{'} + λ_{2} W_{2}^{'} + \dots + λ_{n} W_{n}^{'}) * x

where

λ_{i}

denotes the combination weights of the i-th convolutional kernel,

*

denotes the convolution operation, and

y

is the combined output feature.

The Bottleneck module in the C3K2 module has many advantages in feature extraction, but it also has potential shortcomings and limitations. The 1 × 1 convolution in the Bottleneck module is used to reduce the feature map’s channel dimension, and the reduction in the number of channels compresses the amount of information in the feature map, especially if the feature map itself has a small number of channels. This compression may discard discriminative features, thereby degrading the model’s detection accuracy, particularly in complex scenes or small object recognition tasks. To compensate for the limitations of the C3K2 module, we introduced the ARConv module to replace the original Bottleneck module. By dynamically adjusting the orientation of the convolutional kernel, ARConv augments the backbone’s representational learning capacity. This is ideally suited for target detection scenarios where the object has a rotating target, enabling more effective multi-angle feature representation learning at different angles. We substituted the Bottleneck module in the C3k2 layer with the ARConv module and named it CARConv. This specific architectural modification represents a novel approach to integrating rotational adaptivity directly and deeply within YOLO11’s core feature extraction blocks, rather than as a more generic add-on. By strategically embedding ARConv in this manner, our CARConv module is uniquely designed to enhance the backbone’s capability to learn discriminative features for apples presenting diverse and challenging orientations in complex orchard environments. This enhanced CARConv block, capable of adaptively adjusting its convolutional kernels, therefore, facilitates more accurate feature learning for such rotationally varied targets. The resulting structure of this substitution is illustrated in Figure 3.

2.3. AFGCAM Module

In recent years, in the field of computer vision, attention mechanisms such as SE and CBAM, etc. have been widely used, especially in deep learning models, and are often used to improve feature discriminability of the network structure. It should be considered that the complex lighting environment in the real orchard scene increases the difficulty of the model in extracting the target features. Therefore, we tried to introduce the AFGCA attention mechanism to enhance the model’s feature learning capacity for lower-quality images.

The AFGCA adaptive fine-grained channel attention mechanism proposed by Sun et al. [22] was initially applied as a design framework in the field of image defogging to improve the picture quality of the images. Similarly, under the complex light interference in a real orchard environment, light that is too strong or too weak can affect the quality of the picture captured by the camera, which, in turn, interferes with the model’s feature representation capability. This inspired us to add AFGCA to the feature-extraction framework of the YOLO model to help the model better learn the features of the target apples in complex lighting and backgrounds. The architectural framework of the AFGCA attentional mechanism (Figure 4) was initially designed for the feature reconfiguration phase of the end-to-end denoising architecture, as well as being a key component in the synthetic network responsible for parameter estimation. This design strategically exploits the adaptive reconfiguration capabilities of the network by optimizing the feature transformations to effectively utilize macro-scale contextual relationships and micro-scale channel interactions.

The AFGCA module (illustrated in Figure 4, left) initiates its computational workflow by applying global average pooling (GAP) to the input feature tensor

F

to obtain channel-wise statistical descriptors. These descriptors are then utilized to model both local and global inter-channel relationships, typically forming a channel correlation matrix. From this, adaptive channel attention weights,

W

, are derived through a dynamic fusion process involving learnable factors. Finally, these learned weights

W

are applied channel-wise to the original input feature tensor,

F

, to produce the refined feature representation

F^{*}

. The specific calculation formulas outlining these steps are detailed below:

\begin{array}{l} U_{n} = GAP (F_{n}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{n} (i, j) \\ U_{l c} = \sum_{i = 1}^{k} U \cdot b_{i} \\ U_{g c} = \sum_{i = 1}^{c} U \cdot d_{i} \\ M = U_{g c} \cdot U_{l c}^{T} \\ U_{w g c} = \sum_{j} M_{i, j}, i \in 1, 2, 3, \dots, c \\ U_{w l c} = \sum_{j} {(U_{l c} \cdot U_{g c}^{T})}_{i, j} = \sum_{j} M_{i, j}^{T}, i \in 1, 2, 3, \dots, c \\ W = σ (σ (θ) \times σ (U_{w g c}) + (1 - σ (θ)) \times σ (U_{w l c})) \\ F^{*} = W \otimes F \end{array}

where

F

denotes the input feature tensor with

C

channels, height,

H

, and width,

W

.

U_{n}

is the channel descriptor for the n-th channel, derived from the feature values

F_{n} (i, j)

at the spatial position

(i, j)

.

G A P (\cdot)

signifies the Global Average Pooling function, yielding the channel descriptor vector

U

. Local channel information is represented by

U_{l c}

, obtained using a band matrix

B

with weight coefficients

b i

over

k

neighboring channels, while

U g c

represents global channel information. These interact to form the channel correlation matrix

M

. From

M

, global (

W_{g c}

) and local (

W_{l c}, W_{l c}^{'}

) channel attention weight vectors are derived. The final combined channel attention weights

W

are produced through dynamic fusion involving a learnable factor

θ

. The Sigmoid activation function is denoted by

σ (\cdot)

Finally,

F^{*}

is the refined output feature tensor, resulting from the element-wise multiplication

(\otimes)

of W with the input feature tensor

F

.

While AFGCA offers a strong foundation for channel attention, to further bolster feature learning, particularly for apples under the diverse and often suboptimal lighting conditions found in orchards, we propose AFGCAM, a novel and enhanced attention module. To preserve additional feature details, we have added a global maximum pooling (GMP) operation to the AFGCA module for optimization. Our key innovation in AFGCAM is the unique integration of a Global Max Pooling (GMP) pathway operating in parallel with the original Global Average Pooling (GAP) pathway before the channel attention weights are derived (see Figure 4., right). The feature maps after the GMP and GAP operations are then summed, element by element, at the corresponding positions in the channel dimension. Whereas global average pooling (GAP) softens the features and retains the overall pattern, global maximum pooling (GMP) assists in identifying fine-scale and local peaks, thereby accentuating object characteristics. Compared to a single-path global maximum pooling (GMP) operation, the fusion of both pooling operations can comprehensively consider features at multiple scales, enhancing the model’s capacity to detect and represent small-scale occluded object features. We have integrated the refined AFGCAM module into the YOLO11 model to strengthen its feature extraction capabilities. The internal architecture of the AFGCAM is shown in Figure 4.

This dual-pooling strategy is a distinctive feature of AFGCAM. It is designed to capture a more comprehensive range of channel-wise statistics: GAP effectively summarizes the overall contextual features and background information, while GMP excels at identifying and preserving the most salient local features and peak activations, which correspond to important object characteristics. By fusing these complementary statistics, AFGCAM enables more robust and discriminative feature refinement, especially for visually challenging targets, such as apples under varying illumination or those that are small or partially occluded.

2.4. YOLO11-ARAF Network

The enhanced overall model architecture is depicted in Figure 5. YOLO11-ARConv-AFGCAM is structured into three components: Backbone, Neck, and Head. This enhanced YOLO11-ARAF architecture derives its distinct advantages from the unique and synergistic interplay of our specifically designed CARConv modules and the novel AFGCAM attention mechanism. As detailed in Section 2.2 and Section 2.3 respectively, CARConv imbues the network with robust rotational adaptivity, while AFGCAM significantly refines feature discriminability, especially under challenging visual conditions. These modules are strategically embedded within the backbone and neck components of the YOLO11 framework. This holistic and specific configuration is engineered to cooperatively enhance the model’s capability for accurate and efficient apple detection in complex real-world orchard environments. For instance, in optimizing the Backbone and Neck parts, the introduction of our CARConv layer (replacing the original C3K2 layer) significantly improves the detection ability for apples at any direction and angle. Simultaneously, the AFGCAM module, an attention-guided feature learning component, is incorporated into both the Backbone and Neck to fuse global and local feature map information, thereby enhancing the model’s feature extraction and processing capabilities for complex scenes.

2.5. Response-Based Knowledge Distillation

To address the increased model parameters and the resulting increased computational overhead due to the introduction of the ARConv and AFGCAM modules, we used knowledge distillation [23]. As shown in Figure 6, the improved model with higher accuracy was used as the teacher model, while the knowledge was distilled into the smaller, faster, and more computationally efficient student model (YOLO11n). This ensured that the student model maintains higher detection accuracy while maintaining the lightweight advantage required for deployment on edge devices in agricultural environments.

We employed a response-based knowledge refinement approach that uses the output of the teacher model’s responses as soft labels. This approach allowed the student model to learn from the refined responses of the teacher, thus improving detection accuracy and generalization. During knowledge refinement, a compact student model was trained to emulate the performance of a more complex teacher model. The teacher’s outputs (or soft labels) contain more fine-grained information than the single-encoded hard labels, providing intermediate supervision that helps the student model learn more effectively. In our study, the teacher models are complex YOLO11-ARAF models with ARConv and AFGCAM modules. The teacher model’s output was utilized to direct the training of the slimmer YOLO11n student model. The mean square error loss was employed to compute the loss values in this study. By minimizing a distillation loss function that quantifies the discrepancy between student predictions and teacher soft labels, the student model learns to emulate the teacher’s performance.

This approach retains the improved detection accuracy of the instructor’s model while ensuring that the student model is lightweight and efficient enough to be deployed on edge devices in the orchard. The experimental results demonstrate the efficacy of the proposed methodology, showing that the student model maintains high detection accuracy while maintaining a reduced computational cost. The

t_p r e d_s c o r e

in the figure is the categorization confidence of the teacher model for each anchor. Each anchor corresponds to a category score, which indicates the probability that the anchor belongs to a certain category;

t_p r e d_d i s t r i

is the bounding box regression value of the teacher model for each anchor. Each anchor point corresponds to four regression values

(x, y, w, h)

, indicating the location and size of the target. The student models

s_p r e d_s c o r e s

and

s_p r e d_d i s t r i

are the same.

In order to calculate

T_{obj_scale}

, it is necessary to first put

t_p r e d_s c o r e

through the sigmoid activation layer:

σ (t_{pred_score}) = \frac{1}{1 + e^{- t_{pred_score}}}

Then the Maxscore should be calculated:

{MaxScore}_{i, j} = \max (σ {(T_{pred})}^{(i, c, j)})

The Maxscore should be reshaped afterward:

T_{obj_scale} = reshape (MaxScore, B \times 1 \times A)

The object scale metric

T_{obj_scale}

is used to calculate the classification and bounding-box loss, thus facilitating knowledge transfer from the teacher to the student model. The loss function weights can be adaptively modulated to prioritize high-confidence prediction areas during knowledge distillation. It is shown below.

(1) The bounding-box loss equation is

L_{box} = \frac{1}{B \cdot A \cdot D} \sum_{B}^{i = 1} \sum_{A}^{j = 1} \sum_{D}^{d = 1} ({(s_{reg, i, j, d} - t_{reg, i, j, d})}^{2} \cdot T_{obj_scale, i, 1, j})

where

B

is the batch size;

A

is the number of anchor points;

D

is the dimension of the bounding box regression.

x, y, w, h

;

s_{reg, i, j, d}

denote the teacher model’s bounding box regression output,

t_{reg, i, j, d}

is the predicted value of the bounding box for the teacher model, and

T_{objscale, i, 1, j}

is the dynamically adjusted weight.

(2) The classified loss equation is

L_{cls} = \frac{1}{B \cdot A \cdot C} \sum_{B}^{i = 1} \sum_{A}^{j = 1} \sum_{C}^{c = 1} ({(s_{i, c, j} - t_{i, c, j})}^{2} \cdot T_{obj_scale, i, 1, j})

where

C

represents the total class count;

s_{i, c, j}

is the categorization score of the student model;

t_{i, c, j}

is the categorization score of the teacher model; and

T_{obj_scale, i, 1, j}

is the dynamically adjusted weight.

The following equation is used for the calculation of the total loss:

L_{total} = L_{cls} + L_{box}

The SGD optimizer was used to minimize losses during distillation in this study.

3. Experiments Results and Analysis

3.1. Evaluation Indicators

YOLO11 model-related model evaluation metrics

1. Precision

Precision measures the proportion of correctly predicted positive instances among all samples predicted as positive. It reflects the model’s accuracy in identifying positive targets. The formula is as follows:

Precision = \frac{TP}{TP + FP}

where TP (True Positive) is the number of samples correctly identified as positive, and FP (False Positive) refers to the number of negative samples incorrectly classified as positive;

2. Recall

Recall assesses the model’s ability to correctly detect all actual positive samples. It indicates how well the model captures true positives. The calculation is given by

Recall = \frac{TP}{TP + FN}

where FN (False Negative) denotes the number of positive samples that the model failed to identify;

3. F1-Score

The F1-score is the harmonic mean of precision and recall and is a reconciled average of the two, used to measure the comprehensive performance of the model. Its calculation formula is

F 1 = 2 \cdot \frac{Precisien \cdot Recall}{Precision + Recall}

4. Average Precision, AP

Average Precision (AP) is used to calculate the average accuracy under different categories, which demonstrates the model’s robust performance across varying thresholds. For the target detection task, the AP is usually obtained through the integration of the precision-recall curve. Its calculation formula is

AP = \int_{0}^{1} Precision (R) d R

where R is the recall;

5. Mean Average Precision, mAP

The average accuracy mean is the average accuracy of the multi-category problem, and the average of all category APs is calculated to indicate the model’s overall detection capability across multiple object categories. Its calculation formula is

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

where

N

denotes the total category count, and

{A P}_{i}

represents the average precision for the i-th class;

6. FPS

FPS indicates the inference speed of the model, representing how many frames the model can process per second. It is a critical metric for assessing real-time performance. Its calculation formula is

FPS = \frac{1}{Inference Time}

where Inference Time refers to the time (in milliseconds) the model takes to process a single image.

For the experimental loop environment in this paper, the graphics card is RTX4070; the Python version is 3.10; the Pytorch version is 12.4; and the Cuda version is 12.6.

3.2. Dataset Construction

The dataset used in this study consists of 3942 images, as shown in Figure 7. The initial resolutions of the collected images, including 1280 × 960 and 640 × 480 pixels, were standardized and pre-processed prior to model training. These images were collected from different orchards and agricultural environments to ensure a diverse presentation of apple targets under different conditions. The dataset consists of images of apples with different orientations, lighting conditions, and shading levels. This comprehensive dataset is essential for training and validating the proposed model.

The camera positions were all located at the reachable position of the robotic arm when the pictures were taken. The composed dataset was obtained by taking pictures in different orchards, using common varieties, different shooting angles, and different light and shading conditions. The samples of the dataset are large enough and complex enough to meet the requirements of orchard-picking conditions. In the actual orchard-picking environment, the growing conditions and distribution of apples are often not completely structured or idealized. Fruit distribution may be influenced by the natural growth pattern of trees, shading by branches and leaves, and light conditions, leading to randomness and complexity in fruit location. By collecting these images of apples with certain recognition difficulties, we aimed to improve the robustness, adaptability, and generalization ability of the model. Specifically, the model needs to be able to operate stably under complex lighting conditions, adapt to the appearance characteristics of different varieties of fruits, and accurately recognize target fruits under occlusion and background interference. This diverse dataset design not only helps the model learn more comprehensive features in the training phase, but also significantly improves the reliability and efficiency of the model in practical applications, thus meeting the actual needs of automated picking in orchards.

To ensure the model’s robustness for real-world complexities, the dataset was curated to include a broad spectrum of challenging visual conditions. As illustrated in Figure 7, this encompasses varied illumination scenarios, including backlighting, overexposure, low-light/night, complex/dappled lighting, and frequent instances of natural occlusion by leaves, branches, and other apples. The number of images captured in the dataset under different lighting conditions is similar in proportion. The dataset also incorporates variations in object scale, with apples appearing at different distances from the camera, resulting in a range of apparent sizes. This includes instances of smaller, more distant fruits as well as larger, closer ones, preparing the model for detection across different scales commonly observed in orchard navigation and harvesting tasks.

We used the X-AnyLabeling software to label the target apples in the dataset one by one and store the label files uniformly. The labeling file format is YOLO. The original image and the labeled label file are divided into the training set, validation set, and test set in a ratio of 8:1:1. In the training stage, we used the stochastic gradient descent (SGD) method to optimize the learning rate. The model was trained with a batch size of 32 for a maximum of 300 iterations. All of the images were resized to a uniform input resolution of 640 × 640 pixels for the YOLO11 architecture; shorter dimensions were padded in this process to maintain the aspect ratio and prevent object distortion. In addition, in order to enhance the variability of the dataset and improve the generality of the model, standard data enhancement techniques such as common geometric (e.g., flipping, rotating) and photometric (e.g., brightness, contrast adjustments) transformations, as well as mosaic enhancement, were dynamically applied during the training phase in order to augment the dataset and improve the model’s learning and generalization capabilities.

3.3. Comparative Experiments

The detection results were compared with different models of the YOLO series, the RTDETR model of the end-to-end series, and the DNE model proposed by other scholars for apple target detection, as shown in Table 1. Specifically, the YOLO series models benchmarked include several iterations such as YOLOv8, YOLOv10, our baseline YOLO11, and YOLO12, representing the ongoing advancements in this popular single-stage detector family known for its balance of speed and accuracy. In contrast, the RTDETR (Real-Time Detection Transformer) models, including the RTDETR-L and RTDETR-ResNet50 variants, exemplify end-to-end detection approaches built upon Transformer architectures, which have recently shown strong performance. Finally, DNE-YOLO [24] represents a contemporary specialized model developed by other researchers specifically for apple detection in diverse natural environments, providing a relevant domain-specific benchmark.

It can be seen that, although YOLO11-AFAR is improved in the model parameters compared to YOLO11, the detection results of YOLO11-AFAR are all better than the other models, and the Precision, Recall, mAP@50, and mAP@50:95 detection performance metrics were improved by 0.3%, 1.1%, 0.72%, and 2% compared to YOLO11, respectively. The Precision, Recall, mAP@50, and mAP@50:95 of each YOLO model were plotted as comparison curves, as in Figure 8, which shows that all the indexes of YOLO11-AFAR are higher than the other models, and, especially, the enhancement of mAP@50:95 is significantly higher than the other models, demonstrating enhanced robustness for apple detection in challenging orchard conditions. It is important to elaborate on the significance of the mAP@50:95 metric, which our YOLO11-ARAF model improved by 2% over the YOLO11 baseline, as shown in Table 1. While mAP@50 primarily evaluates a model’s capability to correctly identify and broadly localize objects requiring an IoU of 0.5, the mAP@50:95 metric offers a more comprehensive and stringent assessment by averaging AP scores across a range of IoU thresholds from 0.50 to 0.95 in steps of 0.05. This means that a notable improvement in mAP@50:95, such as that achieved by YOLO11-ARAF, does not merely indicate better object detection in a general sense. Crucially, it suggests that the model exhibits enhanced performance, even when much higher localization accuracy is demanded, i.e., at stricter IoU thresholds like 0.75, 0.85, or 0.95. Therefore, the observed 2% gain in mAP@50:95 for YOLO11-ARAF signifies a more robust improvement in overall detection quality, encompassing both superior object recognition and, critically, more precise bounding box regression compared to the baseline. This enhanced localization accuracy is particularly vital for downstream applications, such as robotic grasping, in automated harvesting, where precise positioning of the detected apples is essential. While our mAP@50 also shows an improvement of 0.72%, the more substantial gain in mAP@50:95 underscores that the architectural enhancements in YOLO11-ARAF, including CARConv and AFGCAM, contribute significantly to refining the precision of object localization across a spectrum of IoU requirements, making it a more reliable model for real-world complex orchard environments.

To enhance model interpretability, we employed activation heatmaps to visualize YOLO11-ARAF’s detection process, explicitly revealing its region-specific attention patterns. As shown in Figure 9, from the visualization results, we can see that YOLO11-ARAF can more accurately locate the location of the target apples than other series of model anchors of yolo. Specifically, the regions of interest of the Yolov8, Yolov10, and Yolo11 models are more concentrated in the background regions of the complex branches in the frame that do not contain the target apples, which suggests that these models are more susceptible to disturbances of the complicated backgrounds in the frame, which affects the overall model’s performance in detecting the apples. For Yolo12, although most of the regions of interest fall accurately on the target apples, it failed to adequately process the low-illumination apple in the bottom-right quadrant, thus resulting in missed detection. Taken together, Yolo11-AFAR has better attention results.

To further evaluate the robustness of our proposed model, we selected two representative challenging scene sub-validation sets from the existing validation set. The complex background sub-validation set consists of 168 images, which mainly includes the case of severe occlusion by branches and leaves, as well as the interference of non-fruit tree objects. The complex lighting sub-validation set consists of 199 images selected specifically for lighting conditions (e.g., backlighting, extreme backlighting, and low light) that are common in real-world orchards. The trained YOLO11-ARAF model was evaluated in both subsets along with the main benchmark model. The detailed performance metrics, including the number of detected apple instances with mAP@50:95, are shown in Table 2.

The results in Table 2 show that our improved YOLO11-ARAF model not only detects the number of apple instances closer to the number of real labels under the two challenging conditions of complex backgrounds and complex lighting, but also that it significantly outperforms the compared baseline model in terms of key mAP metrics, and the detection accuracy of the model is improved by 0.7% and 2% relative to the baseline model under the two complex environments, respectively. This indicates that the model has higher detection performance under both complex environments, especially anti-interference ability under complex lighting. There are 0.7% and 2% accuracy improvements in the two complex environments, respectively, which indicates that the model’s detection performance is higher in both complex environments, especially in the anti-interference ability in complex light. This suggests that the YOLO11-ARAF model is able to cope with interference caused by complex environmental factors (e.g., severe occlusion, object interference, and unfavorable lighting) more effectively, and exhibits excellent adaptability and robustness in real orchard scenarios. Notably, a comparative analysis of these results shows that complex background conditions, which are mainly characterized by severe occlusion, usually pose a greater challenge to model detection performance than complex illumination conditions, which points to a potential direction for more targeted optimization for severe occlusion in future work.

3.4. Lightweighting Experiments

Since the improved model has larger model parameters, the parameters were increased by nearly 0.1 M and the computational complexity was increased by nearly 1GFLOPs compared to the model of yolo11. From the practical application point of view, the model is generally deployed on edge devices, and it is expected to be able to deploy a lightweight model. Therefore, we adopted the distillation technique to distill the improved model into the yolo11n model.

To further investigate the adaptability of the knowledge captured by the enhanced YOLO11-ARAF teacher model, we also explored distilling it into student models built on other contemporary lightweight backbones. To this end, we replaced the original YOLO11 backbone with EfficientViT [26], GhostHGNetV2 [27], and MobileNetV4 [28], respectively, and then applied the same distillation process. Table 3 lists the comparative results of these experiments, where the upper part of the table shows the validation results of the YOLO series and YOLO11-ARAF models, and the lower part shows the comparative experimental results of the YOLO11 model and YOLO11-ARAF distilled to different models, respectively.

The analysis of Table 3 shows that, while the knowledge in YOLO11-ARAF can be transferred to these different lightweight architectures, achieving the optimal balance of accuracy and efficiency depends heavily on the particular student backbone and its integration. For example, the yolo11-ARAF-to-GhostHGNetV2 variant shows a competitive mAP@50:95 of 0.624, whereas other backbones, such as EfficientViT and MobileNetV4, when integrated into the YOLO11 framework and extracted from YOLO11-ARAF, have mAP@50:95 scores of 0.619 and 0.625, respectively, with a varying number of parameters and FPS tradeoffs, as detailed in the table. These findings suggest that simply employing a different standalone lightweight backbone does not inherently guarantee superior post-distillation performance compared to a well-matched student architecture such as YOLO11n. Our primary distillation model yolo11-ARAF-to-yolo11n achieved 0.644 mAP@50:95 with good efficiency, 2.56 M Params, 76.1 FPS, and remains the most efficient lightweight configuration in our study. The model obtained by distillation was reduced by 0.1M parameters and 1GFLOPS in terms of computational complexity, and it more than doubled in terms of FPS compared to the improved model. In contrast, the accuracy and effectiveness of yolo11-ARAF-to-yolo11 was higher than that of the other models, both in terms of the model’s parameters and computational complexity, which highlights the efficient synergistic effect of knowledge refinement from enhanced teachers to closely related and compatible lightweight student architectures.

FPS (Frames Per Second) is a significant index to measure the inference speed of the model, and it denotes the quantity of image frames the model can analyze per second, reflecting its real-time capabilities in practical applications. A higher FPS translates to a quicker inference speed and superior real-time capabilities for the model. The model mAP@50:95 obtained by distillation is basically maintained at 0.644, while the FPS is double that of the improved model, which is basically maintained at the detection speed of the baseline model, indicating that the lightweight and improved model maintains high accuracy while improving efficiency.

Latency is an important metric that describes the efficiency of a model’s inference, specifically defined as the temporal interval for the model to generate output results (detection frames and categories) from receiving input data (images). Latency directly impacts the model’s responsiveness in real-time detection scenarios, especially in application scenarios that require a fast response. Latency is obtained by summing up the preprocess time, the inference time, and the postprocess time. As shown in Table 4, the latency metrics demonstrate that the distilled yolo11-ARAF-to-YOLO11 model achieves a 16.54±3.41ms inference time, which is about 0.23 ms higher than that of the improved model, and which is close to that of YOLO11 and YOLOv10, which indicates that it is more efficient in its reasoning and suitable for real-time applications.

The results after model distillation show that the computational speed of the network has been significantly improved, but the reduction of model parameters after distillation is not much (0.1 M), which is because the parameters of the model of the network composed by adding the module itself are not particularly large. Generally speaking, the higher accuracy of the model means that more detection structures are needed to help feature extraction, and the computational complexity of the model is relatively greater, which means that it is difficult to achieve high accuracy and a small model under the same model. Therefore, in this study, as the overall model parameter increase is not big, and the structure of the lightweight is not easy, the use of this more lightweight, general distillation method is more suitable for a lightweight general network. Comparison between the distilled model and other lightweight networks enables it to be clearly seen that this method of distilling into the pre-improved model can reduce the size of the model on the one hand, and, on the other hand, can ensure that the accuracy of the model can be maintained at a high level after the improvement; this is in fact a balance between the accuracy of the model and the computational efficiency.

3.5. Comparative Experiments on Convolution Module

In an effort to assess the performance and contribution of the ARConv module embedded in the CARConv layer, the ARConv module was replaced with different convolutional kernels in the model, including PWConv, ShiftConv, and so on. By comparing ARConv with other convolutional modules, we could conduct a thorough assessment of the detection capabilities of the CARConv layer. The results of the comparison between ARConv and other convolutional modules are presented in Table 5. The results indicate that ARConv surpasses other conventional convolutional modules across all evaluation metrics, achieving a 0.5% increase in mAP@50 relative to the original model, and an improvement of 1.2% for mAP@50:95. This indicates that the added ARConv enables the network to more effectively accommodate the rotational characteristics in the detected objects, and to extract relevant features from images through the internal adaptive tuning of the convolutional kernel, thus strengthening feature extraction across the entire backbone network and achieving more precise target localization. Specifically, due to the growing environment and shooting angle of the apple in the complex background, its shooting position will always show a certain rotation pattern. The rotational adaptive feature of CARConv still ensures that the model recognizes the rotational shape of apples in complicated environments, thereby enhancing the model’s detection performance.

3.6. Comparative Experiments on Attention Mechanisms

To evaluate the enhanced performance of the AFGCA attentional module, we substituted AFGCA and AFGCAM with alternative attention mechanisms such as EMA, GAM, CBAM, SE, and SimAM. These five mechanisms, known for their simplicity and efficiency, have been extensively adopted in recent research. By comparing them with AFGCA, we comprehensively assessed the strengths and limitations of integrating AFGCAM into the YOLO11 model. The results of the comparison between AFGCA and other attention mechanisms are presented in Table 6. The results show that all the evaluation indexes of AFGCAM are better than other common attention mechanisms. Specifically, after adding the EMA, CBAM, SE, and SimAM attention mechanisms, the mAP@50:95 of the model was all slightly improved, by 0.3%, 0.4%, 0.2%, and 0.5%, respectively, while the mAP@50 remained the same, by 0.2%. Notably, the model with the added GAM attention mechanism was as high as 3.2M in terms of the number of parameters, while mAP@50 and mAP@50:95 instead decreased by 0.1% and 0.8%, which suggests that adding attention mechanism does not always guarantee enhanced model performance, particularly for apple-target detection under complex lighting. In contrast, the addition of the incoming AFGCA and AFGCAM attentional mechanisms slightly boosted the number of parameters to 2.6 M. The AFGCCA model exhibited a 0.6% and 1.1% improvement in mAP@50 and mAP@50:95, while the AFGCAM model showed enhancements of 0.7% and 1.2% with regard to the same metrics.

AFGCA’s notable accuracy improvement stems from its ability to interact with both local and global feature map information, thereby enhancing feature representation effectively. Specifically, the extraction of local information allows the model to focus more on the features of small targets in the image; the extraction of global information enables the model to pay more attention to the position of targets in the image relative to the whole image. By fully mining and fusing the global and local information of the feature map, the model’s sensitivity to the target location is effectively enhanced, resulting in more accurate localization results. The maximum pooling operation GMP introduced in AFGCAM can fully integrate multi-scale feature information, thereby enhancing the model’s capability to characterize targets across various scales. From the results, all the accuracies of the AFGCAM attention mechanism are further improved compared to the AFGCA attention mechanism.

3.7. Ablation Experiments

The YOLO11-ARAF proposed in this study is built upon YOLO11 with two different improvements. In order to verify each enhancement module’s effectiveness in YOLO11-ARAF, we organized various combinations of the two modules and performed ablation studies on the specified dataset. The experimental outcomes are presented in Table 7.

As indicated in Table 6, the model with solely the CARConv module saw a 0.5% and 1.2% boost in mAP@50 and mAP@50:95, respectively. The model incorporating only AFGCA exhibited a 0.2% and 0.9% enhancement in mAP@50 and mAP@50:95. The models improved by adding both the CARConv layer and AFGCA are higher than the YOLO11-ARConv model, the YOLO11-AFGCAM model, and are relative to the yolo11 baseline model by 0.3%, 1.1%, 0.72%, 0.6%, and 2%, respectively, for P, R, F1-score, mAP@50, and mAP@50:95. It indicates that the improved module is still effective when combined. It is worth noting that the model that includes both CARConv and AFGCAM has a 2% improvement on both m AP50–95, alongside a notable reduction in computational complexity.

3.8. Discussion and Limitation

While the YOLO11-ARAF model achieves impressive accuracy and other metrics, its parameter size remains larger than typical convolutional models and attention mechanisms. Although knowledge distillation was used in this study for model lightweight improvement, the feature extraction part can still perform model pruning to eliminate redundant feature extraction layers to improve computational efficiency. Model pruning, as well as finding more lightweight distillation student models with different knowledge distillation algorithms, are possible future directions for improvement and extension. A limitation of this study is the use of a single apple variety in our dataset. While this allowed a focused investigation on complex lighting and environmental challenges from the robotic arm’s perspective, it may restrict the model’s generalization capabilities. Specifically, its performance could be affected when encountering other apple varieties with significantly different visual characteristics (e.g., color, shape, texture). Moreover, the model’s applicability to other fruit types is likely limited, as the learned features are inherently apple-specific. Orchard conditions varying with different crop types could also present further generalization challenges. Future research will involve collecting diverse apple samples to retrain the model and expand its feature recognition capabilities.

4. Conclusions

This study introduces an enhanced apple detection model, YOLO11-ARAF, built upon YOLO11n, designed to tackle the challenges of inaccurate detection and limited adaptability in complicated orchard settings. Firstly, we built an apple image dataset for complex orchard environments, and a total of 3942 images were collected. To enhance apple detection in complex backgrounds, the CARConv module was adopted instead of the original C3K2 module. Next, we upgraded the AFGCA module to the AFGCAM attention mechanism, integrated into the Backbone and Neck of the YOLO11 model. The addition of AFGCAM allowed the model to better focus on the global and local information of the feature map, thus improving its feature extraction capability. Finally, the improved model was distilled into the YOLO11n model, boosting the computational speed and efficiency on the image while maintaining higher efficiency.

The results of the experiments highlight that the accuracy, Recall, mAP@50 and mAP@50:95 values of the YOLO11-ARAF model are 89.4%, 86%, 92.3%, and 64.4%, respectively, which are 0.3%, 1.1%, 0.72%, and 2% higher than YOLO11, respectively. The improved model distills into the original model with 0.1M lower parameters and doubled FPS, which enables fast and accurate apple detection in complex orchard environments with limited computational resources. The lightweight algorithm developed in this study can serve as a valuable reference for real-time orchard-picking robot operations within the apple-detection domain.

Author Contributions

Conceptualization, Y.X. and C.Q.; methodology, Y.L.; software, Y.L.; validation, Y.X., P.X. and Z.L.; formal analysis, P.X.; investigation, Y.L., Z.L. and H.W.; resources, C.Q.; data curation, L.G.; writing—original draft preparation, Y.L.; writing—review and editing, P.X. and C.Q.; visualization, Y.L. and H.W.; supervision, Y.X. and C.Q.; project administration, C.L.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Agricultural Science and Technology Innovation Project under grant number T2024207. The APC was funded by the Shanghai Agricultural Science and Technology Innovation Project under grant number T2024207.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Musacchi, S.; Serra, S. Apple fruit quality: Overview on pre-harvest factors. Sci. Hortic. 2018, 234, 409–430. [Google Scholar] [CrossRef]
Jiang, C.; Wang, Y.; Yang, Z.; Zhao, Y. Do adaptive policy adjustments deliver ecosystem-agriculture-economy co-benefits in land degradation neutrality efforts? Evidence from southeast coast of China. Environ. Monit. Assess. 2023, 195, 1215. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Wang, T.; Lu, L.; Huang, H.; Ding, J.; Zhang, F. Developing a 3D clumping index model to improve optical measurement accuracy of crop leaf area index. Field Crop. Res. 2022, 275, 108361. [Google Scholar] [CrossRef]
Ma, X.; Ding, J.; Wang, T.; Lu, L.; Sun, H.; Zhang, F.; Cheng, X.; Nurmemet, I. A pixel dichotomy coupled linear kernel-driven model for estimating fractional vegetation cover in arid areas from high-spatial-resolution images. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 4406015. [Google Scholar] [CrossRef]
He, Q.; Zhan, J.; Liu, X.; Dong, C.; Tian, D.; Fu, Q. Multispectral polarimetric bidirectional reflectance research of plant canopy. Opt. Lasers Eng. 2025, 184, 108688. [Google Scholar] [CrossRef]
Xia, Y.-J.; Song, Q.; Yi, B.; Lyu, T.; Sun, Z.; Li, Y. Improving out-of-distribution generalization for online weld expulsion inspection using physics-informed neural networks. Weld. World 2025, 69, 1309–1322. [Google Scholar] [CrossRef]
Huang, G.; Qin, C.; Wang, H.; Liu, C. TBM rock fragmentation classification using an adaptive spot denoising and contour-texture decomposition attention-based method. Tunn. Undergr. Space Technol. 2025, 161, 106498. [Google Scholar] [CrossRef]
Kang, H.; Chen, C. Fruit detection, segmentation and 3D visualisation of environments in apple orchards. Comput. Electron. Agric. 2020, 171, 105302. [Google Scholar] [CrossRef]
Wang, H.; Qin, C.; Yu, H.; Liu, C. Geological type recognition for shield machine using a semi-supervised variational auto-encoder-based adversarial method. Tunn. Undergr. Space Technol. 2025, 156, 106258. [Google Scholar] [CrossRef]
Qin, C.; Huang, G.; Yu, H.; Zhang, Z.; Tao, J.; Liu, C. Adaptive VMD and multi-stage stabilized transformer-based long-distance forecasting for multiple shield machine tunneling parameters. Autom. Constr. 2024, 165, 105563. [Google Scholar] [CrossRef]
Zhong, T.; Qin, C.; Shi, G.; Zhang, Z.; Tao, J.; Liu, C. A residual denoising and multiscale attention-based weighted domain adaptation network for tunnel boring machine main bearing fault diagnosis. Sci. China Technol. Sci. 2024, 67, 2594–2618. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Ma, L.; Zhao, L.; Wang, Z.; Zhang, J.; Chen, G. Detection and counting of small target apples under complicated environments by using improved YOLOv7-tiny. Agronomy 2023, 13, 1419. [Google Scholar] [CrossRef]
Chen, Q.; Yin, C.; Guo, Z.; Wu, X.; Wang, J.; Zhou, H. Apple growth state and posture recognition based on improved YOLOv7. Trans. Chin. Soc. Agric. Eng. 2024, 40, 258–266. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-time vehicle detection based on improved YOLO v5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
Gao, S.; Chu, M.; Zhang, L. A detection network for small defects of steel surface based on YOLOv7. Digit. Signal Process. 2024, 149, 104484. [Google Scholar] [CrossRef]
Jiang, M.; Song, L.; Wang, Y.; Li, Z.; Song, H. Fusion of the YOLOv4 network model and visual attention mechanism to detect low-quality young apples in a complex environment. Precis. Agric. 2022, 23, 559–577. [Google Scholar] [CrossRef]
Zhang, C.; Kang, F.; Wang, Y. An improved apple object detection method based on lightweight YOLOv4 in complex backgrounds. Remote. Sens. 2022, 14, 4150. [Google Scholar] [CrossRef]
Yan, B.; Liu, Y.; Yan, W. A Novel fusion perception algorithm of tree branch/trunk and apple for harvesting robot based on improved YOLOv8s. Agronomy 2024, 14, 1895. [Google Scholar] [CrossRef]
Ultralytics. Ultralytics YOLOv11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 21 October 2024).
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6589–6600. [Google Scholar]
Sun, H.; Wen, Y.; Feng, H.; Zheng, Y.; Mei, Q.; Ren, D.; Yu, M. Unsupervised bidirectional contrastive reconstruction and adaptive fine-grained channel attention networks for image dehazing. Neural Netw. 2024, 176, 106314. [Google Scholar] [CrossRef]
Huang, T.; You, S.; Wang, F.; Qian, C.; Xu, C. Knowledge distillation from a stronger teacher. Adv. Neural Inf. Process. Syst. 2022, 35, 33716–33727. [Google Scholar]
Wu, H.; Mo, X.; Wen, S.; Wu, K.; Ye, Y.; Wang, Y.; Zhang, Y. DNE-YOLO: A method for apple fruit detection in Diverse Natural Environments. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102220. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14420–14430. [Google Scholar]
Geng, C.; Wang, A.; Yang, C.; Xu, Z.; Xu, Y.; Liu, X.; Zhu, H. Application of Improved YOLOv8n-seg in Crayfish Trunk Segmentation. Isr. J. Aquac.-Bamidgeh 2024, 76, 380–390. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal models for the mobile ecosystem. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar]
Zhu, J.; Hu, T.; Zheng, L.; Zhou, N.; Ge, H.; Hong, Z. YOLOv8-C2f-Faster-EMA: An improved underwater trash detection model based on YOLOv8. Sensors 2024, 24, 2483. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse branch block: Building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 10886–10895. [Google Scholar]
Wan, D.; Lu, R.; Hu, B.; Yin, J.; Shen, S.; Xu, T.; Lang, X. YOLO-MIF: Improved YOLOv8 with Multi-Information Fusion for Object Detection in Gray-Scale Images. Adv. Eng. Inform. 2024, 62, 102709. [Google Scholar] [CrossRef]
Drokin, I. Kolmogorov-arnold convolutions: Design principles and empirical studies. arXiv 2024, arXiv:2407.01092. [Google Scholar]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 649–667. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the 38th International Conference on Machine Learning, PMLR 139, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]

Figure 1. Framework of the proposed YOLO11-ARAF.

Figure 2. ARConv module.

Figure 3. C3K2 and CARConv module.

Figure 4. Internal structure diagram of the AFGCA and AFGCAM module.

Figure 5. YOLO11-ARConv-AFGCAM network.

Figure 6. Response-based knowledge distillation.

Figure 7. Some images of datasets.

Figure 8. Comparison curves of P, Recall, mAP@50, mAP@50:95 with YOLO series models.

Figure 9. Comparison of heat maps for different models.

Table 1. Improvement results compared with YOLO, RTDETR series, and other models.

Models	P	R	F1-Scores	mAP@50	mAP@50:95	Param	GFLOPs
yolov8	0.886	0.844	86.45%	0.917	0.625	3,005,843	8.1
yolov10	0.884	0.851	86.72%	0.919	0.625	2,265,363	6.5
yolo11	0.891	0.849	86.95%	0.917	0.624	2,582,347	6.3
yolo12	0.878	0.859	86.83%	0.917	0.628	2,565,531	6.4
RTDETR-l2 [25]	0.81	0.817	81.34%	0.873	0.542	31,985,795	103.4
RTDETR-resnet50	0.826	0.839	83.24%	0.891	0.569	41,936,739	125.6
DNE-YOLO	0.892	0.858	87.46%	0.924	0.633	2,944,254	7.5
yolo11-ARConv-AFGCAM	0.894	0.86	87.67%	0.923	0.644	2,841,453	7.3

Table 2. Validation experiments of the improved model with other models on sub-datasets.

Models	Complex Background			Complex Light
Models	Actual	Detected	mAP@50:95	Actual	Detected	mAP@50:95
yolov8	3203	2586	0.504	1248	1187	0.852
yolov10	3203	2532	0.502	1248	1123	0.846
yolo11	3203	2572	0.508	1248	1151	0.849
yolo12	3203	2606	0.506	1248	1192	0.857
yolo11-ARConv-AFGCAM	3203	2640	0.515	1248	1220	0.869

Table 3. Comparison of mAP and parameters, GFPLOPs, and FPS between the distilled model and different models.

Models	mAP@50:95	Parameter (M)	GFLOPs	FPS
yolo11	0.624	2.58	6.3	72.8
yolo12	0.628	2.56	6.4	38.8
yolo11-ARConv-AFGCAM	0.644	2.67	7.3	30.8
yolo11-to-yolo11n	0.623	2.58	6.3	73.6
yolo11-ARAF-to-EfficientViT [26]	0.619	3.74	7.9	35.2
yolo11-ARAF-to-GhostHGNetV2 [27]	0.624	3.14	7.7	38.5
yolo11-ARAF-to-MoblienetV4 [28]	0.625	5.429	21	26.4
yolo11-ARAF-to-yolo11	0.644	2.56	6.3	76.1

Table 4. Comparison of inference delay time between the distilled model and each model.

Models	Latency
yolov8	0.01353 s ± 0.00351 s
yolov10	0.01610 s ± 0.00363 s
yolo11	0.01589 s ± 0.00272 s
yolo11-ARConv-AFGCAM	0.03982 s ± 0.00645 s
yolo12	0.02855 s ± 0.00520 s
yolo11-ARAF-to-yolo11	0.01654 s ± 0.00341 s

Table 5. Comparison of different convolution modules.

Module	P	R	F1-Scores	mAP@50	mAP@50:95	Param
Yolo11	0.891	0.849	86.95%	0.917	0.624	2,582,347
+FasterEMA [29]	0.876	0.864	87%	0.92	0.62	2,292,979
+PWConv [30]	0.881	0.859	86.99%	0.919	0.625	2,395,995
+DeepDBB [31]	0.885	0.85	86.71%	0.922	0.628	2,582,347
+DBB [32]	0.892	0.845	86.79%	0.919	0.626	2,582,347
+KAN [33]	0.881	0.854	86.73%	0.917	0.623	3,319,275
+ShiftConv [34]	0.878	0.848	86.72%	0.918	0.619	2,575,195
+ARConv	0.885	0.865	87.49%	0.922	0.636	2,605,975

Table 6. Comparative analysis of various attention mechanisms commonly used in deep-learning models.

Module	P	R	F1-Scores	mAP@50	mAP@50:95	Param
Yolo11	0.891	0.849	86.95%	0.917	0.624	2,582,347
+EMA [35]	0.877	0.857	86.69%	0.919	0.627	2,498,379
+GAM [36]	0.892	0.841	86.57%	0.916	0.616	3,204,939
+CBAM [37]	0.881	0.852	86.63%	0.916	0.628	2,648,237
+SE [38]	0.892	0.847	86.89%	0.919	0.626	2,590,539
+SimAM [39]	0.886	0.85	86.76%	0.919	0.629	2,582,347
+AFGCA	0.894	0.86	87.67%	0.923	0.633	2,648,145
+AFGCAM	0.886	0.856	87.07%	0.924	0.636	2,648,145

Table 7. Results of ablation experiments.

Models	P	R	F1-Scores	mAP@50	mAP@50:95	Param	GFLOPs
yolo11	0.891	0.849	86.95%	0.917	0.624	2,582,347	6.3
yolo11-ARConv	0.885	0.865	87.49%	0.922	0.636	2,605,975	7.2
yolo11-AFGCA	0.894	0.86	87.66%	0.923	0.633	2,648,145	6.3
yolo11-AFGCAM	0.886	0.856	87.7%	0.924	0.636	2,648,145	6.3
yolo11-ARConv-AFGCA	0.886	0.858	87.37%	0.923	0.636	2,841,453	7.3
yolo11-ARConv-AFGCAM	0.894	0.86	87.67%	0.923	0.644	2,841,453	7.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, Y.; Xia, Y.; Xia, P.; Liu, Z.; Wang, H.; Qin, C.; Gong, L.; Liu, C. YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments. Agriculture 2025, 15, 1104. https://doi.org/10.3390/agriculture15101104

AMA Style

Lin Y, Xia Y, Xia P, Liu Z, Wang H, Qin C, Gong L, Liu C. YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments. Agriculture. 2025; 15(10):1104. https://doi.org/10.3390/agriculture15101104

Chicago/Turabian Style

Lin, Yangtian, Yujun Xia, Pengcheng Xia, Zhengyang Liu, Haodi Wang, Chengjin Qin, Liang Gong, and Chengliang Liu. 2025. "YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments" Agriculture 15, no. 10: 1104. https://doi.org/10.3390/agriculture15101104

APA Style

Lin, Y., Xia, Y., Xia, P., Liu, Z., Wang, H., Qin, C., Gong, L., & Liu, C. (2025). YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments. Agriculture, 15(10), 1104. https://doi.org/10.3390/agriculture15101104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments

Abstract

1. Introduction

2. Methodology

2.1. Overview of YOLO11-ARAF

2.2. CARConv Module

2.3. AFGCAM Module

2.4. YOLO11-ARAF Network

2.5. Response-Based Knowledge Distillation

3. Experiments Results and Analysis

3.1. Evaluation Indicators

3.2. Dataset Construction

3.3. Comparative Experiments

3.4. Lightweighting Experiments

3.5. Comparative Experiments on Convolution Module

3.6. Comparative Experiments on Attention Mechanisms

3.7. Ablation Experiments

3.8. Discussion and Limitation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI