AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments

Yang, Liusong; Zhang, Tian; Zhou, Shihan; Guo, Jingtan

doi:10.3390/agriculture15080836

Open AccessArticle

AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(8), 836; https://doi.org/10.3390/agriculture15080836

Submission received: 27 February 2025 / Revised: 1 April 2025 / Accepted: 11 April 2025 / Published: 12 April 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Apple detection in natural environments is crucial for advancing agricultural automation. However, orchards often employ bagging techniques to protect apples from pests and improve quality, which introduces significant detection challenges due to the varied appearance and occlusion of apples caused by bags. Additionally, the complex and variable natural backgrounds further complicate the detection process. To address these multifaceted challenges, this study introduces AAB-YOLO, a lightweight apple detection model based on an improved YOLOv11 framework. AAB-YOLO incorporates ADown modules to reduce model complexity, the C3k2_ContextGuided module for enhanced understanding of complex scenes, and the Detect_SEAM module for improved handling of occluded apples. Furthermore, the Inner_EIoU loss function is employed to boost detection accuracy and efficiency. The experimental results demonstrate significant improvements: mAP@50 increases from 0.917 to 0.921, precision rises from 0.948 to 0.951, and recall improves by 1.04%, while the model’s parameter count and computational complexity are reduced by 37.7% and 38.1%, respectively. By achieving lightweight performance while maintaining high accuracy, AAB-YOLO effectively meets the real-time apple detection needs in natural environments, overcoming the challenges posed by orchard bagging techniques and complex backgrounds.

Keywords:

apple; bagged apple; object detection; lightweight; YOLOv11

1. Introduction

In modern agriculture, apples, as a widely grown fruit, are rich in nutrients, low in fat, high in carbohydrates, and contain vitamins C and E. They can be grown under a variety of conditions and have high economic value [1]. Due to the complex environment of modern apple orchards, fruit harvesting still relies on manual labor, but faces challenges such as high labor intensity, long cycle time, and low efficiency [2]. In contrast, automated harvesting technology is gradually emerging with its high efficiency and precision. Through advanced technologies such as image recognition, machine learning, and robotic arms, automated picking equipment can recognize and pick apples more quickly, significantly improving picking efficiency. However, when apple picking is performed in a natural environment, it often faces the problem of fruit occlusion, i.e., apples are occluded by leaves, branches, or other fruits, which makes it difficult for picking equipment to accurately recognize the target fruit. This problem not only affects the efficiency and accuracy of picking but also may lead to the unnecessary waste of resources and fruit damage. Secondly, to protect the fruits from pests and birds, and to reduce the amount of insecticides and fungicides on the surface of the fruits, most orchards use bagging techniques to protect the fruits [3]. This practice, although effective, poses additional difficulties in apple detection. In addition to the above technical difficulties, how to develop lightweight models for resource-constrained devices (e.g., edge devices and mobile devices) is also a hot topic in current research. These devices usually have limited computational resources and storage space, so how to ensure the accuracy of model detection while realizing the lightweight of the model has become an urgent problem [4].

With the rapid development of computer vision and deep learning technology, machine learning provides new solutions for apple recognition [5]. The advancement of agricultural modernization has also led to the introduction of more and more automated and intelligent equipment into agricultural production. Accurate recognition of crops by target detection techniques is a key step in automating harvesting. In the specific application scenario of apple harvesting, the introduction of deep learning technology greatly improves the accuracy and efficiency of target detection technology. Traditional apple recognition methods are often limited by factors such as lighting conditions, fruit shading, and morphological diversity, making it difficult to achieve efficient and accurate recognition. The apple recognition algorithm based on deep learning, through the construction of the deep convolutional neural network (CNN) and other models, can automatically learn the complex features of apples from a large amount of image data [2], including color, texture, shape, and contextual information, so as to achieve the accurate positioning and recognition of apple fruits.

Deep learning in target detection is mainly categorized into two types: second-order algorithms and one-stage algorithms. Second-order algorithms include R-CNN [6], Faster R-CNN [7], and so on. This type of method first generates a series of potential candidate target regions through a region proposal network, and then performs further classification and bounding box regression on these regions for precise localization and identification. Such methods have significant advantages in target detection accuracy, especially when facing challenges such as complex backgrounds, illumination changes, or partial occlusion of the target, and still show strong robustness. However, its relatively high computational complexity due to its need to generate and process a large number of candidate regions may limit its application in some scenarios with high real-time requirements. In contrast, one-stage algorithms, such as SSD [8,9] and the Yolo series [10,11,12,13,14], take a more direct and efficient approach. They directly predict both the class and location of the target in a single forward pass, without the need to generate candidate regions, and thus have higher computational efficiency and faster detection speeds. By optimizing the network structure and loss function, these algorithms are able to significantly reduce the computational cost while maintaining high detection accuracy, making them more suitable for automated agricultural harvesting systems with high real-time requirements.

Yang et al. [15] addressed the challenges of apple fruit detection by improving the YOLOv7 algorithm, incorporating preprocessing for overlapping images, dataset partitioning, the introduction of the MobileOne module, enhancements to the SPPCSPS module, and the addition of an auxiliary detection head. These improvements led to a 6.9% increase in accuracy, a 10% boost in recall, and 5% and 3.8% improvements in mAP1 and mAP2, respectively. However, despite these significant results in apple detection, the performance gains came at the cost of increased computational complexity and model size, posing challenges for deployment and inference speed in practical applications. Wu et al. [16] constructed the DNE-APPLE dataset and proposed the DNE-YOLO model, achieving an accuracy of 90.7%, a recall of 88.9%, and an mAP50 of 94.3% under complex weather conditions, with a computational complexity of 25.4 GFLOPs and 10.46 M parameters. While the model exhibits excellent performance under complex weather conditions, its adaptability to other environmental factors (such as intense sunlight and shadows) may be limited. HAO et al. [17] introduced the YOLO-RD-Apple Orchard model, leveraging RGB and Depth images for detecting occluded apples, achieving an AP value of 93.1%, a 70% reduction in parameters, and a detection speed of 40.5 FPS. FU et al. [18] designed an apple recognition method based on Faster R-CNN, achieving average precisions of 90.9%, 85.8%, 89.9%, and 84.8% under conditions of no occlusion, occlusion by branches, occlusion by leaves, and overlapping occlusion by fruits, respectively, with a processing time of 0.241 s per image. This method has certain advantages in handling multi-category occlusion problems, but may be constrained by computational resources and inference speed. Lu et al. [19] improved the YOLOv4 model by adding the CBAM module for detecting apples of different maturity stages, achieving detection accuracies of 86.2%, 87.5%, and 92.6% for early, mid, and harvest stages, respectively. The model excels in feature extraction and detection accuracy but may be limited by the diversity and complexity of orchard environments. Furthermore, the model’s parameters and computational complexity need further optimization to adapt to more resource-constrained scenarios. Zhen et al. [20] improved the YOLOv7 model by combining Partial Convolution (PConv), Efficient Channel Attention (ECA) modules, and Sparrow Search Algorithm (SSA) optimization, resulting in increases of 4.15%, 0.38%, and 1.39% in accuracy, recall, and mean average precision, respectively, with reductions of 22.93% and 27.41% in parameters and computational operations. Despite achieving significant results in natural orchard apple detection, the model may still be affected by the complexity of orchard environments and does not consider the issue of bagging in most orchards in reality. In the detection of bagged fruits, the traditional methods face many challenges due to very little related research. Liu et al. [21] proposed a method to attenuate the effect of light on the identification of apple fruits in plastic bags. This method uses a watershed algorithm to segment the original image into irregular blocks based on the edge detection results of R-G grayscale images. Compared with the watershed algorithm based on gradient images, this segmentation method reduces the number of blocks by 20.31%, while being able to preserve the fruit edges. Afterwards, a support vector machine is used to classify the blocks based on the color and texture features extracted from each block. The experimental results show that the proposed method reduces the false negative rate (FNR) from 21.71% to 4.65% and the false positive rate (FPR) from 14.53% to 3.50%, which significantly improves the accuracy of the recognition as compared to the pixel classification-based image recognition method. However, despite the success of this method, with the rapid development of deep learning techniques, especially the advancement in the field of target detection, we have the possibility to further improve the detection performance.

In summary, despite the initial results achieved in the field of apple research based on target detection, several unresolved challenges remain. In particular, most of the existing methods focus on the detection of naked apple fruits, while ignoring the additional challenges added to the detection task by the protective measures of bagging commonly used in orchards. The accuracy and generalization performance of recognition when dealing with variable environmental conditions, such as light fluctuations, object occlusion and fruit overlapping, have yet to be enhanced [22]. In addition, current models generally have high computational complexity and large parameter scales, which hinder their deployment on resource-constrained mobile or embedded platforms, and thus limit their practical applications. To address these limitations, this paper proposes a lightweight apple detection model for natural environments based on improved YOLOv11, AAB-YOLO. The main contributions of this paper are as follows:

Specialized agricultural dataset: developed a 3140-image dataset capturing real-world bagged/unbagged apples under varied shading and lighting conditions, addressing critical gaps in agricultural object detection.
Enhanced feature extraction: introduced the ADown module with optimized convolutional downsampling, preserving critical spatial information while reducing computational complexity.
Contextual scene understanding: integrated C3k2_ContextGuided module to fuse multi-scale contextual features, improving detection in complex orchard environments.
Occlusion robustness: proposed Detect_SEAM module specifically designed to handle apple occlusions in natural foliage settings.
Efficient training strategy: adopted Inner_EIou loss function to simultaneously enhance detection accuracy and training efficiency across diverse scenarios.
These integrated innovations collectively enable the accurate detection of bagged/unbagged apples under challenging conditions, demonstrating significant advancements in agricultural vision systems.

2. Materials and Methods

2.1. Dataset Description

In this study, image data were collected from multiple distances and angles of yellow marshal apples in the orchard of Yangguantun, Juno County, Heze City, Shandong Province, between 20 and 22 August 2024, using a smartphone as an image acquisition tool. Various images were collected under several different conditions, including light, backlight, shade between bagged apples, shade between unbagged apples, and shade between bagged and unbagged apples, and saved in JPG format. From these images, 628 original images were finally obtained after screening. Some of the images included in the dataset are shown in Figure 1.

After that, the bagged apples and apple targets in the images were labeled using labelme5.50 software to create the labeled files for network training. The information in the labeled file included the image file name, image size, pixel coordinates in the upper left corner, and pixel coordinates in the lower right corner. The dataset was then randomly divided into three categories—training set, validation set, and test set—in the ratio of 8:1:1, commonly used when dividing datasets. This approach ensures that the training set has enough data for model learning, while the validation and test sets have enough data for evaluating and validating the performance of the model. In order to increase data diversity, prevent overfitting, and improve the robustness of the model to noise and changes, we randomly augmented each of the divided data by adding noise, adjusting brightness, cutout, rotating, translating, and flipping each data five times in seven ways. The enhanced data are shown in Figure 2. Finally, the number of samples in the training set, validation set, and test set are 2510, 310, and 320, respectively.

2.2. The AAB-YOLO (Apple and Bagged_Apple) Model

YOLOv11 [23] is the latest version of the YOLO family of target detection algorithms, crafted by the Ultralytics team. This version significantly optimizes the feature extraction and object detection capabilities by introducing a series of architectural innovations, such as the C3k2 block, the SPFF module, and the C2PSA block. The C3k2 block improves the efficiency of feature extraction through parallel convolutional layers, while the SPFF module enhances the model’s ability to detect objects of different sizes, especially small objects, through the aggregation of contextual information at multiple scales. The C2PSA block, on the other hand, introduces an attention mechanism that improves the model’s focus on important regions in the image, thus further enhancing the detection accuracy. In terms of performance, YOLOv11 not only achieves higher detection accuracy and speed, but also maintains excellent computational efficiency to ensure real-time performance. In this study, yolov11 is used as the base network to develop an apple detection model based on AAB-YOLO, and the model architecture is shown in Figure 3.

The AAB-YOLO (Apple And Bagged_apple) model uses the ADown module to replace the Conv module of the original network, which employs an efficient convolutional kernel and step size setting, enabling the key features in the image to be retained while reducing the parameters, which improves the efficiency of the model in detecting apples. Meanwhile, the C3k2_ContexGuided module is used as an enhanced feature extraction unit to replace the original C3k2 module. This module further incorporates the context-aware mechanism while maintaining the efficient feature extraction capability of the C3k2 module. By introducing additional contextual information, this module is able to better understand the complex scene in the image and improve the model’s ability to recognize the target apple. In addition, the improved Detect_SEAM detection head using the SEAM (Spatial Excitation and Attention Module) module enhances the model’s ability to recognize occluded or hard-to-detect small target apples. Combined with the Inner_EIoU loss function, the model significantly improves detection accuracy, enhances robustness, and simplifies the computational process while maintaining efficient operation, better adapting to diverse natural environment scenarios.

2.2.1. ADown Model

The ADown [24] module, as an efficient downsampling mechanism, helps to reduce the spatial dimension of the feature map, which in turn reduces the computational effort of the model. When apple detection is performed in natural environments, the amount of information to be processed by the model is often large due to factors such as light, occlusion, and complex background. By introducing the ADown module, the resolution of the feature map can be effectively reduced to minimize the unnecessary computational overhead, thus improving the operational efficiency of the model. This not only helps to improve the real-time detection capability of the model, but also makes the model easier to deploy on resource-constrained devices.

The network structure of the ADown module is depicted in Figure 4, where h represents the height of the image, w represents the width, and c represents the number of channels. The core idea of the ADown module is to reduce the spatial dimensions of the feature map through a series of operations while preserving important image information as much as possible. Initially, the input feature map passes through an average pooling layer (AvgPool2d), reducing its spatial dimensions from h × w × c to (h − 1) × (w − 1) × c. Subsequently, the feature map is split into two branches via a split operation, with each branch having a spatial dimension of (h − 1) × (w − 1) × c/2. One of these branches further undergoes a max pooling layer (MaxPool2d), decreasing its spatial dimension to h/2 × w/2 × c/2. The other branch is processed through convolutional layers (Conv 3 × 3 and Conv 1 × 1) to maintain its spatial dimension unchanged, i.e., h/2 × w/2 × c/2. Finally, these two branches’ feature maps are merged through a concatenation (Concat) operation, generating a new feature map with a spatial dimension of h/2 × w/2 × c. In this manner, the ADown module is capable of preserving crucial information vital for apple detection while reducing the resolution of the feature map. Furthermore, the learnable nature of the ADown module allows it to adapt to different data scenarios for further performance optimization. Therefore, the ADown module not only contributes to reducing the computational load of the model but also enhances its detection accuracy, leading to superior performance in apple detection under natural environments.

2.2.2. C3k2_ContextGuided Feature Extraction Module

In apple detection tasks in natural environments, traditional image processing or detection models often encounter several challenges, especially when apple features are variable due to lighting, occlusion, or background complexity. These problems often result in models that extract apple features with incomplete information and struggle to effectively capture the overall contextual information of the apple and its surroundings. To overcome this limitation, this paper introduces the CG (ContextGuided) [25] module and combines it with the C3k2 module in order to build a more powerful feature extraction framework, the structure of which is shown in Figure 5. This module not only helps the model understand the scene more accurately to distinguish apples from other objects, but also enhances the model’s ability to process multi-scale features, thus effectively responding to the changes in the size, shape, and location of apples in natural environments and improving the accuracy of detection.

The ContextGuided Block (CG Block) significantly improves the model’s ability to perceive the target and the environment by collaboratively extracting local, global, and kill-that-following features through multiple branches. The feature maps are input to the three branches in parallel, and the local feature extractor employs a 3 × 3 convolution to extract detailed features (Equation (1)), preserving the target object texture and edge information;

F_{local} = C {onv}_{3 \times 3} (X)

(1)

The global feature extractor provides scene-level guidance by compressing the spatial dimensions through global average pooling (GAP) and combining it with multilayer perceptron (MLP) mapping to global semantic features (Equation (2));

F_{global} = M L P (G A P (X))

(2)

The surrounding context feature extractor utilizes a 3 × 3 convolution with an expansion ratio of 2 (Equation (3)) to expand the receptive field to capture environmental information such as placement of branches and foliage occlusion.

F_{context} = D ilated C {onv}_{3 \times 3, d = 2} (X)

(3)

After the three-way features are spliced by channels (Equation (4)), fusion and dimensionality reduction are realized by 1 × 1 convolution (Equation (5)), and finally the output feature map is generated by LeakyReLU activation (Equation (6)). This design process enables the CG module to effectively improve the comprehension ability of deep learning models in complex scenes.

C oncat (F_{local}, F_{g l o b a l}, F_{c o n t e x t})

(4)

F_{fused} = C {onv}_{1 \times 1} (\cdot)

(5)

Y = L eaky Re L U (F_{fused})

(6)

X denotes the input feature map,

F_{local}

denotes the detail features extracted from the input feature map by a local feature extractor (3 × 3 standard convolution),

F_{global}

denotes the image-level semantic features extracted by a global feature extractor (global average pooling + multilayer perceptron),

F_{context}

denotes the contextual information extracted by a surrounding contextual feature extractor (3 × 3 convolution with dilation rate of 2),

C oncat (\cdot)

the features spliced along the channel dimension spliced features,

F_{fused}

denotes features fused by 1 × 1 convolution after splicing along the channel dimension,

C {onv}_{3 \times 3}

denotes a convolution operation with a kernel size of 3 × 3,

d = 2

denotes an expansion rate of 2,

C {onv}_{1 \times 1}

adjusts the number of channels and fuses the information, and Y denotes the final output feature map.

The C3k2 module in YOLOv11 consists of two convolutional branches: one that passes features directly, and the other that performs depth extraction via C3k2 blocks or Bottleneck structures. The C3k2 module significantly improves the feature extraction capability of the model, but when confronted with complex scenes in natural environments, such as branch and leaf occlusion, overlapping fruits, and multiple lighting conditions, the module may not be able to adequately capture all the key features. To address the above problems, this paper introduces the ContextGuided Block, which enhances the feature extraction capability by fusing contextual information, reduces the processing of redundant information, and maintains stable detection performance in different scenes. In this paper, the ContextGuided Block is used as the Bottleneck module in the C3k2 module to form the new C3k2-CG lightweight feature extraction module. The structure of C3k2-CG is shown in Figure 6.

In summary, addressing the potential issue of inadequate feature extraction by the C3k2 module in complex natural environments, the C3k2-CG module introduces the ContextGuided Block as an effective enhancement. This module optimizes the feature extraction process by integrating local, surrounding contextual, and global contextual information, thereby reducing the processing of redundant information and decreasing the overall number of parameters and computational complexity. Furthermore, this fusion strategy enhances the model’s ability to understand complex scenarios, particularly when confronted with challenges in natural environments such as foliage occlusion, fruit overlap, and varying lighting conditions. The C3k2-CG module is capable of capturing key features more accurately, thereby improving detection accuracy.

2.2.3. Detet_SEAM Head

The YOLOv11 detection head integrates the design of YOLOv10, utilizing Depthwise Separable Convolutions (DWConvs) to enhance efficiency and introducing the DynamicHead for adaptive adjustments. To improve the detection of small targets, a new tiny target detection head is added, forming a four-head structure. These improvements enable YOLOv11 to efficiently classify and locate targets on multi-scale feature maps, maintaining high accuracy and speed. The structure of the YOLOv11 detection head is illustrated in Figure 7. However, detecting apples in natural environments often presents challenges such as occlusion, varying illumination, or complex backgrounds. Traditional detection heads often struggle to effectively capture the details of target objects when faced with complex backgrounds and changing lighting conditions. To overcome these difficulties, this paper introduces the SEAM (Self-Ensembling Attention Mechanism) [26] module to modify the detection head, enhancing the model’s robustness and generalization capabilities. The SEAM module effectively enhances the model’s ability to handle occlusion by integrating multi-view features and utilizing consistency regularization techniques, improving detection accuracy and robustness. Meanwhile, compared to other attention mechanisms such as the CBAM (Convolutional Block Attention Module) [27], SEAM exhibits unique advantages in terms of computational efficiency and complexity. The design of SEAM, which adopts Depthwise Separable Convolutions, not only significantly reduces computational costs but also maintains high performance, making it more competitive in real-time detection tasks. Furthermore, the structural design of the SEAM module also enables it to excel in processing multi-scale features and complex backgrounds, further enhancing the model’s ability to detect apples in natural environments. The specific structure of the SEAM module is shown in Figure 8.

The overall architecture of the SEAM consists of three CSMM modules of different sizes (patch-6, patch-7, and patch-8). The CSMM module splits the input image into patches of different sizes and generates feature representations by initial processing with a Patch Embedding layer. Then, the correlation between spatial dimensions and channels is learned using Deep Separable Convolution and the training process is stabilized by the GELU activation function and batch normalization. Finally, the feature representations of the multi-scale features are average pooled and then operated by channel exp and finally multiplied to provide enhanced feature representations.

The new detection head first inserts the SEAM after the base convolution (Conv 3 × 3), which enhances the spatial context modeling capability. Inserting the SEAM after the depth separable convolution (DWConv + Conv) enhances the channel attention mechanism. The redundant convolutional layers from the original (e.g., the second Conv 3 × 3 in the regression branch) are removed. The new detection head implements feature recalibration through the SEAM module, reducing the number of parameters while maintaining performance. The new detection head is named Detect_SEAM and the structure is shown in Figure 9.

2.2.4. Inner-EIoU Loss

With the development of detectors, edge regression has made significant progress, but the existing IoU-based edge regression methods have limitations and lack of generalization. Zhang et al. [28] found that distinguishing different regression samples and using different scales of auxiliary edges to calculate the loss can accelerate the regression by analyzing the BBR model. Based on this, Inner-IoU Loss is proposed to control the scale of auxiliary borders by introducing scale factors to improve the generalization ability of the model. The principle of Inner-IoU is shown in Figure 10.

The GT box and the Anchor box are denoted as

B^{g t}

and

B

, respectively, and the center of the GT box is denoted by

(x_{c,}^{g t} y_{c}^{g t})

in the figure, while

(x_{c,} y_{c})

denotes the center of the Anchor box and the inner Anchor box. The width and height of the GT box are denoted by

w^{g t}

and

h^{g t}

, respectively, while the width and height of the Anchor box are denoted by

w

and

h

. The variable ratio corresponds to the scaling factor.

Inner_EloU (Inner-Efficient Intersection over Union) is a bounding box regression loss function based on the improvement of loU, which optimizes the efficiency of the gradient propagation of different regression samples by introducing a scaling factor (ratio) to dynamically adjust the size of the auxiliary border. The ratio is a key parameter used to dynamically adjust the scale of the auxiliary border. By controlling the size of the auxiliary border, the ratio can optimize the gradient propagation efficiency of different regression samples, thus improving the generalization ability of the model. The specific value of the ratio depends on the scale and distribution of the targets in the dataset, and needs to be selected according to the specific task and dataset characteristics. Its core formula is defined as follows:

L_{I n n e r - E I o u} = L_{E I o U} + I o U - I o U_{i n n e r}

(7)

I o U_{i n n e r}

denotes the intersection ratio calculated by the auxiliary border, the size of which is controlled by the scale factor ratio. Specifically, for the real frame

B^{g t}

and the anchor frame

B

, the boundary coordinates of the auxiliary border are calculated as follows:

\begin{array}{l} b_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} \cdot r a t i o}{2}, b_{r}^{g t} = x_{c}^{g t} + \frac{w^{g t} \cdot r a t i o}{2}, \end{array}

(8)

b_{t}^{g t} = y_{t}^{g t} - \frac{h^{g t} \cdot r a t i o}{2}, b_{b}^{g t} = y_{c}^{g t} + \frac{h^{g t} \cdot r a t i o}{2},

(9)

b_{l} = x_{c} - \frac{w \cdot r a t i o}{2}, b_{r} = x_{c} + \frac{w \cdot r a t i o}{2},

(10)

b_{t} = y_{c} - \frac{h \cdot r a t i o}{2}, b_{b} = y_{c} + \frac{h \cdot r a t i o}{2},

(11)

The computation of the intersection region (inter) and the union region (union) further generates the

I o U_{i n n e r}

, which ultimately adjusts the loss gradient of the EloU through the weighting term

I o U - I o U_{i n n e r}

.

The computation of the intersection region (inter) and union region (union) further generates

I o U_{i n n e r}

to adjust the loss gradient of EloU. Replacing CloU with Inner_EloU shows significant advantages in apple detection tasks in natural environments. First, Inner_EloU adaptively adjusts the auxiliary border (e.g.,

r a t i o \in [0.7, 0.8]

) through the scale factor, and adopts a smaller auxiliary border for high loU samples (e.g., complete visible apples), which results in a larger absolute value of the gradient and accelerates the convergence; and adopts a larger auxiliary border for low loU samples (e.g., occluded or small-targeted apples), which extends the effective regression range and reduces the leakage of detection.

3. Experiment and Results

3.1. Experimental Configuration and Evaluation Indicators

The system used for the experiments in this paper is Windows 11 operating system with 16 GB of RAM; the hardware CPU used for the experiments is an Intel® Xeon® Gold 6152 CPU with 10 cores, manufactured by Intel Corporation, USA. The GPU used is an NVIDIA GeForce RTX 3090 Laptop GPU, manufactured by NVIDIA Corporation, USA; the software used for the experiments is Pycharm 2023.3.4; and the language based on Python is adopted as the Pytorch deep learning framework, and the Python version is Python 3.8.

In the experiments of this paper, the hyperparameters used in the training and validation process of all models are kept consistent, and the experimental parameters used in the network training process of this paper are shown in Table 1.

In order to evaluate the model performance, precision, recall, mean accuracy (mAP), the number of parameters (Params), GFLOPs, and the model weight file size (in MB) are selected as the evaluation metrics for the experiment.

Precision is an important metric to measure the prediction accuracy of the target detection model. It indicates the proportion of true positive samples among all instances predicted as positive by the model. The calculation formula is as follows:

P = \frac{TP}{TP + FP}

(12)

TP, FP, and FN denote the number of true positive, false positive, and false negative instances, respectively.

Recall reflects the completeness of the model in detecting the target. It indicates the proportion of all actually existing positive samples that are successfully detected by the model. The formula is calculated as follows:

R = \frac{TP}{TP + FN}

(13)

AP (Average Precision) is defined as the average of precision at all different levels of recall for a single category. The formula is as follows:

AP = \int_{0}^{1} P (R) d R

(14)

The mAP is a key metric for measuring the performance of a target detection model in a multi-category scenario. It is obtained by averaging the average precision (AP) for each category. The calculation formula is as follows:

mAP = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(15)

IoU (Intersection over Union) is a measure of the degree of overlap between the predicted frame and the real frame, calculated as the area of intersection between the predicted frame and the real frame divided by their concatenated area. The prediction is considered correct when the IoU value is greater than or equal to 50%. mAP50 is the mean average precision under the 50% IoU threshold. In this paper, mAP50 is adopted as the evaluation metric.

GFLOPs is a measure of the computational complexity of the model, which indicates the number of floating-point operations (in billions) that the model can perform per second. The GFLOPs value is usually computed by the forward propagation of the model, which reflects the computational performance of the model during inference or training. The higher value of the GFLOPs indicates that the model’s computational complexity is higher, and more computational resources may be required. The model weights file size is the size of the storage space required to store the model weights. Smaller weight files can be more easily deployed and used on different hardware platforms and reduce storage costs.

3.2. Ablation Experiments

In order to verify the effectiveness of the improved module proposed in this paper in complex natural scenarios, eight sets of ablation experiments were conducted under the same dataset and training conditions, and the results of the experiments are shown in Table 2.

The ablation experiments reveal that the original YOLOv11n model has 2.58 M parameters, 6.3 GFLOPs, and a weight file size of 5.20 MB, with precision, recall, mAP50, and mAP@50-95 of 0.948, 0.865, 0.917, and 0.813, respectively. While demonstrating strong initial performance in accuracy and recall, its parameter count and computational complexity are relatively high. Upon integrating the ADown module to replace the original downsampling layer, GFLOPs decrease by 15.9%, parameters are reduced by 24.6%, recall improves by 1.04%, and mAP@50-95 marginally increases to 0.818, indicating that lightweight modifications preserve high-precision detection while slightly enhancing mid-to-high IoU threshold detection capabilities. The C3k2_ContextGuided module, leveraging context-aware mechanisms to enhance feature extraction, reduces parameters by 15.5% and GFLOPs by 22.2%, while elevating mAP50 to 0.919. However, mAP@50-95 drops slightly to 0.808, potentially due to trade-offs between feature enhancement and lightweight design affecting high-threshold detection stability. When the Detect_SEAM module is introduced alone, GFLOPs decrease by 7.94% and parameters by 3.4%, with its spatial attention mechanism effectively suppressing background interference to boost mAP@50-95 to 0.815, although lightweighting gains are moderate.

Synergistic benefits emerge through modular combinations: Combining ADown with C3k2_ContextGuided reduces parameters by 34.11% and GFLOPs by 30.16%, maintaining mAP50 at 0.919 but lowering mAP@50-95 to 0.801, reflecting the challenges of complexity reduction on high-IoU detections. The ADown-Detect_SEAM pairing lowers GFLOPs by 23.8% while increasing precision to 0.957 and mAP@50-95 to 0.810, suggesting that attention mechanisms compensate for information loss from lightweighting. Integrating C3k2_ContextGuided with Detect_SEAM reduces parameters by 18.9% and GFLOPs by 22.2%, lifting mAP50 to 0.919 but decreasing mAP@50-95 to 0.795, indicating that while context awareness and attention improve primary metrics, strict IoU threshold adaptability requires further tuning. When all three modules are combined, recall and mAP50 improve significantly, yet mAP@50-95 reaches 0.805, highlighting inherent trade-offs between multi-modular lightweighting and high-threshold detection. Finally, adopting the Inner_EIou loss function enables the model to maintain parametric and computational efficiency while recovering mAP@50-95 to 0.801, demonstrating that loss optimization effectively balances lightweight design and high-precision detection for superior metric harmonization.

When fusing the ADown, C3k2_ContextGuided, Detect_SEAM, and Inner_EIou modules, the model achieves a high-precision detection performance of mAP50 0.921 with a reduction of 37.6% in the number of parameters, 38.1% in the number of GFLOPs, 38.0% in the number of weights, and the recall rate is improved by 1.04%. The experiments show that the improved scheme in this paper effectively balances the efficiency and precision requirements of the apple detection task in the natural environment through multi-scale context modeling and the attention mechanism while significantly reducing the model complexity, providing a reliable technical path for lightweight deployment.

Based on the detailed experimental results mentioned above, we further visualized the impact of integrating different modules through the generation of heatmaps, as shown in Figure 11. These heatmaps provide an intuitive understanding of how each module contributes to the model’s performance, particularly in terms of feature extraction and target detection. After incorporating the ADown module, the heatmaps revealed a more focused attention on critical regions, suggesting that the module effectively reduced unnecessary computations while preserving key information. This lightweight approach not only decreased the model’s complexity but also enhanced its recall rate for apple detection. The introduction of the C3k2_ContextGuided module further refined the model’s feature extraction capabilities. The heatmaps generated after adding this module demonstrated a more nuanced understanding of the apple’s context, capturing fine details that were previously overlooked. This context-aware mechanism significantly improved the model’s accuracy while maintaining its lightweight nature. Similarly, when only the Detect_SEAM module was integrated, the heatmaps exhibited an enhanced ability to suppress background interference. The spatially enhanced attention mechanism in this module allowed the model to better distinguish apples from their surroundings, leading to improved detection stability in complex environments. When combining multiple modules, such as ADown with C3k2_ContextGuided or Detect_SEAM, the heatmaps displayed a synergistic effect. The combination of these modules not only further reduced the model’s complexity and computational load but also significantly improved its recognition accuracy for apples. Finally, the heatmaps generated by integrating all proposed modules—ADown, C3k2_ContextGuided, Detect_SEAM, and Inner_EIou—demonstrated the model’s exceptional detection performance.

3.3. Comparative Experiments with Other Models

In order to further verify the effectiveness of the improved network model in this paper in the task of apple detection in natural environments, under the same experimental parameters, datasets, and training strategies, the experimental results of the improved network model are compared with those of other mainstream target detection models at the current stage, which include the following: the YOLOv5s, the YOLOv6s [29], and the YoLov8n [23]. The reason for selecting these models as benchmarks is that they have broad applications and recognition in the field of object detection, and to varying degrees, they have achieved lightweight designs, which align well with the research objectives of this paper. P, R, mAP@50, mAP@50-95, the number of parameters, GFLOPs, and the model weight file size were used as evaluation metrics. The experimental results are shown in Table 3.

From the experimental results, we know that the YOLOv6s detection algorithm has a larger number of parameters and computational volume, and generates a larger model weight file, so the analysis concludes that the YOLOv6s algorithm cannot meet the needs of this dataset in terms of lightweight real-time detection; YOLOv5s has a smaller difference between the number of parameters, computational volume, and the size of the model weights and AAB-YOLO, but AAB-YOLO has a clear advantage over YOLOv5s in terms of recall and mAP50; YOLOv8n has a slightly higher number of parameters, computational volume, and model weight size compared to AAB-YOLO, and AAB-YOLO outperforms YOLOv8n in terms of the two key performance indicators, namely recall and mAP50, and shows a higher detection precision compared to YOLOv8n. Compared to the benchmark model YOLOv11n, the improved network outperforms the original network in all metrics. However, it is worth noting that although AAB-YOLO achieves excellent results on mAP@50, it does not perform as well as the other models on the metric mAP@50-95. This may be due to AAB-YOLO’s weakened ability to generalize to some complex scenarios while pursuing a lightweight design, resulting in a decrease in detection accuracy at high IoU thresholds. Future research can further explore how to improve the model’s ability to generalize to complex scenes and accurately predict the target bounding box while maintaining the lightweight advantage. Through the above analysis and comprehensively considering each evaluation index, the proposed AAB-YOLO model in this paper greatly improves the detection effect compared to other models and is more suitable for apple detection in natural environments.

Figure 12 illustrates the comparison curves of precision, recall, map@50, and map@50-95 for YOLOv11n and the improved model AAB-YOLO. Specifically, as can be seen from the comparison of the precision curves, the YOLOv11n model shows high precision in the early stage of training, but as the number of training rounds increases, its performance fluctuates but remains at a relatively high level overall. However, the AAB-YOLO model has a slightly worse performance relative to YOLOv11n at the beginning of training, but as the number of training rounds accumulates, its performance gradually surpasses that of YOLOv11n and stably maintains at a higher precision level at the later stage, and the overall precision curve of AAB-YOLO is stable without obvious fluctuations, showing good generalization ability and stability. This trend is also reflected in the mAP@50 curve comparison. Meanwhile, in the recall curve comparison, we can see that the AAB-YOLO model’s recall continues to increase during the training process and eventually surpasses the YOLOv11n model, showing stronger target detection capability. Notably, the comparison results with YOLOv8n show that AAB-YOLO outperforms YOLOv8n in terms of precision, recall, and mAP@50 metrics, which is more obvious especially when the number of training rounds is high. This result further validates the overall superiority of the AAB-YOLO model in apple detection tasks in natural environments.

3.4. Visual Analysis

3.4.1. Comparison of Detection in Natural Environment

To enhance the methodology for evaluating the performance of the AAB-YOLO algorithm, we show the detection results of YOLOv11n and the improved AAB-YOLO model on the test image dataset in Figure 13.

First of all, in terms of target recognition accuracy, both AAB-YOLO and YOLOv11 show strong recognition ability. In most detection scenarios, both can accurately recognize target objects. However, in terms of detail processing and avoiding missed detections, AAB-YOLO shows some advantages. Especially in some complex scenes, such as when the number of target objects is large, densely distributed, or there is occlusion, AAB-YOLO can more accurately recognize the target objects and reduce the occurrence of missed detection. This is reflected in the second set of comparison images, in which AAB-YOLO successfully captures the target object in the lower left corner, while YOLOv11 fails to detect it.

Secondly, both have their own performance in terms of the positioning accuracy of the bounding box. YOLOv11 is able to fit the target object more closely in some cases, with higher accuracy of the bounding box. AAB-YOLO, on the other hand, is able to cover the main part of the target object more accurately in general, although the bounding box may be relatively large in some cases. It is worth noting that AAB-YOLO is able to maintain stable localization accuracy in complex scenes and is not easily interfered by background noise or neighboring objects, which reflects its strong robustness. Although AAB-YOLO shows excellent performance in many aspects, there may still be some room for improvement in its detection results in some specific situations. For example, in cases where the target objects are very densely distributed or occluded from each other, AAB-YOLO may need to further optimize its detection algorithm to improve the fit and accuracy of the bounding box.

3.4.2. Comparison of Detection Under Different Light

In order to comprehensively evaluate the detection performance of the AAB-YOLO model under different lighting conditions, we adjusted the brightness of the images to simulate the detection effect under different lighting environments, as shown in Figure 14.

First of all, under normal light conditions (group a), both models show excellent detection accuracy, can accurately frame the target object, and the bounding box is highly consistent with the actual position of the target object. At this point, the detection results of AAB-YOLO and YOLOv11 are comparable and both are satisfactory.

However, when the lighting conditions turn dim (group b images), AAB-YOLO shows its advantage in detection stability. In the dim environment, AAB-YOLO is able to recognize and frame the target object more stably, especially in the complex situation of dense targets or occlusion, and its detection effect still maintains at a high level. In contrast, YOLOv11 was able to detect targets in dimly lit environments, but the redundancy of the bounding box increased. This may be due to the relatively weak adaptation of YOLOv11 to light changes, whereas AAB-YOLO is more robust in this regard.

Under strong light conditions (group c images), both models encountered some challenges. Although AAB-YOLO still manages to maintain good detection accuracy in most cases, its bounding box fit is slightly weaker than that of YOLOv11 in some of the images. Especially in the region with strong light reflections, YOLOv11 is able to fit the target object more closely in some cases, showing its specific adaptability in bright light conditions.

In summary, the detection performance of AAB-YOLO and YOLOv11 under different light conditions has its own characteristics. Under normal light conditions, both perform equally well; in dim environments, AAB-YOLO has higher detection stability and better adaptability to light changes; and under bright light conditions, YOLOv11 may have better adaptability in some specific scenarios.

3.4.3. Heatmap

To further validate the performance of the AAB-YOLO model on the task of detecting apples in natural environments, this paper uses the gradient-weighted class-activation mapping [30] (Grad-CAM+++) method for visualization and analysis, as shown in Figure 15.

From the heatmap, it can be concluded that both the YOLOv11n and AAB-YOLO models can effectively identify and localize the target region in the image when detecting apples in natural environments, but the thermal response of the AAB-YOLO model in the target region is more concentrated and of higher intensity, which suggests that the model can more accurately localize the target object. On the other hand, the thermal region of the YOLOv11 model is relatively dispersed, and part of the response spreads into the background, which may have localization bias. It is further verified that the AAB-YOLO model outperforms the original model in target localization accuracy, background suppression ability, and detection consistency.

4. Discussion

The AAB-YOLO framework achieves an effective balance between detection accuracy and computational efficiency for the apple recognition task in complex orchard scenarios. The experimental results show that compared with YOLOv11, the framework significantly improves mAP50 (from 0.917 to 0.921) and recall (+1.04%), while the parameters are reduced by 37.7%. However, its lightweight design leads to a slight decrease in mAP@50-95, which reflects the architecture’s focus on real-time optimization and possibly compromises on dense fruit localization accuracy. By introducing ADown sampling and the C3k2_ContextGuided module, the framework effectively preserves multiscale features and enhances contextual information fusion, which significantly improves the robustness of the model under light variations and partial occlusion conditions. In particular, the Detect_SEAM module, through targeted optimization of feature extraction and contextual association modeling of occluded targets, effectively solves the leakage detection problem caused by the occlusion of branches and leaves, and significantly enhances the recognition ability of apples under occluded conditions, making the detection effect more suitable for the practical application requirements of automated picking systems.

However, there is still room for improving the performance of the framework in extreme complex scenes. When severe occlusion, motion blur, or extreme light angles are simultaneously present, the detection accuracy drops significantly, suggesting that temporal context information (e.g., video detection frameworks) or hybrid Transformer-CNN architectures can be introduced in the future to enhance scene understanding. In addition, the recall optimization strategy for small apples may lead to an increase in false detections in fruit-dense regions, and dynamic anchor adjustment or adaptive thresholding mechanisms based on spatial density should be further explored to balance precision and recall.

From the perspective of technology translation, the low power consumption of AAB-YOLO makes it easy to be deployed in resource-constrained orchard environments, especially in small-scale farmer scenarios, showing significant advantages. Future research could focus on multimodal data fusion (e.g., combining thermal or hyperspectral imaging) to enhance environmental adaptability, developing energy-efficient compression techniques to extend the endurance of field devices, and introducing lifelong learning strategies to adapt to seasonal changes in orchards, thereby promoting the intelligence and utility of agricultural automation systems. This framework provides an important foundation for building a new generation of agricultural automation systems, effectively bridging algorithmic innovation and practical application needs.

5. Conclusions

The proposed AAB-YOLO framework addresses the challenge of real-time apple detection in natural orchard environments by introducing targeted architectural optimizations. By integrating the ADown sampling module and C3k2_ContextGuided feature extraction unit, the model achieves efficient multi-scale feature preservation and contextual reasoning, while the SEAM-enhanced detection head significantly improves recognition of occluded targets. These innovations enable AAB-YOLO to outperform the baseline YOLOv11 in terms of mAP50 and recall while maintaining a lightweight structure, making it particularly suitable for resource-constrained agricultural automation systems.

Despite demonstrating strong performance in typical scenarios, the model faces challenges in extreme conditions such as severe illumination variations and complex occlusions. The simplified architecture also leads to a slight decrease in mAP@50-95 compared to heavier models, reflecting a deliberate trade-off between accuracy and efficiency. Future work will focus on enhancing scene adaptability through hybrid model architectures and advanced compression techniques, while exploring multi-modal sensor fusion to further improve robustness in dynamic agricultural settings. This work establishes a foundational framework for balancing practical deployment needs with algorithmic innovation in precision agriculture.

Author Contributions

Conceptualization, L.Y. and T.Z. methodology, T.Z.; software, L.Y.; validation, T.Z., L.Y., and S.Z.; formal analysis, T.Z.; investigation, J.G.; resources, T.Z.; data curation, J.G.; writing—original draft preparation, L.Y. and T.Z.; writing—review and editing, L.Y. and T.Z.; visualization, S.Z.; supervision, L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sekharamantry, P.K.; Melgani, F.; Malacarne, J. Deep learning-based apple detection with attention module and improved loss function in YOLO. Remote Sens. 2023, 15, 1516. [Google Scholar] [CrossRef]
Liu, J.; Wang, C.; Xing, J. YOLOv5-ACS: Improved Model for Apple Detection and Positioning in Apple Forests in Complex Scenes. Forests 2023, 14, 2304. [Google Scholar] [CrossRef]
Luca, L.P.; Scollo, F.; Distefano, G.; Ferlito, F.; Bennici, S.; Inzirillo, I.; Gentile, A.; La Malfa, S.; Nicolosi, E. Pre-Harvest Bagging of Table Grapes Reduces Accumulations of Agrochemical Residues and Increases Fruit Quality. Agriculture 2023, 13, 1933. [Google Scholar] [CrossRef]
Jin, L.; Yu, Y.; Zhou, J.; Bai, D. SWVR: A lightweight deep learning algorithm for forest fire detection and recognition. Forests 2024, 15, 204. [Google Scholar] [CrossRef]
Gao, X.; Li, S.; Su, X.; Li, Y.; Huang, L.; Tang, W.; Zhang, Y.; Dong, M. Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production. Agriculture 2024, 14, 1098. [Google Scholar] [CrossRef]
Bhagya, C.; Shyna, A. An Overview of Deep Learning Based Object Detection Techniques. In Proceedings of the 2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT), Chennai, India, 25–26 April 2019; pp. 1–6. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision-ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Qiao, S.; Chen, L.-C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. arXiv 2020, arXiv:2006.02334. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved apple fruit target recognition method based on YOLOv7 model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Wu, H.; Mo, X.; Wen, S.; Wu, K.; Ye, Y.; Wang, Y.; Zhang, Y. DNE-YOLO: A method for apple fruit detection in Diverse Natural Environments. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102220. [Google Scholar] [CrossRef]
Hao, P.-F.; Liu, L.-Q.; Gu, R.-Y. YOLO-RD-Apple orchard heterogenous image obscured fruit detection model. J. Graph. 2023, 44, 456–464. [Google Scholar]
Fu, L.; Majeed, Y.; Zhang, X.; Karkee, M.; Zhang, Q. Faster R-CNN-based apple detection in dense-foliage fruiting-wall trees using RGB and depth features for robotic harvesting. Biosyst. Eng. 2020, 197, 245–256. [Google Scholar] [CrossRef]
Lu, S.; Chen, W.; Zhang, X.; Karkee, M. Canopy-attention-YOLO v4-based immature/mature apple fruit detection on dense-foliage tree architectures for early crop load estimation. Comput. Electron. Agric. 2022, 193, 106696. [Google Scholar] [CrossRef]
Zhen, Z.H.N.; Jun, Z.H.U.; Zizhen, J.I.N.; Hongqi, H.A. Lightweight Apple Recognition Method in Natural Orchard Environment Based on Improved YOLO v7 Model. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2024, 55, 231–242. [Google Scholar]
Liu, X.; Jia, W.; Ruan, C.; Zhao, D.; Gu, Y.; Chen, W. The recognition of apple fruits in plastic bags based on block classification. Precis. Agric. 2017, 19, 735–749. [Google Scholar] [CrossRef]
Gongal, A.; Amatya, S.; Karkee, M.; Zhang, Q.; Lewis, K. Sensors and Systems for Fruit Detection and Localization: A Review. Comput. Electron. Agric. 2015, 116, 8–19. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 October 2024).
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2: A scale and occlusion aware face detector. arXiv 2022, arXiv:2208.02019. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]

Figure 1. Data in different cases. (a) Backlit shot; (b) facing light shot; (c) shade between bagged apples; (d) shade between bagged and unbagged apples; (e) shade between unbagged apples.

Figure 2. Data enhancements. (a) is the original image; (b) is the panning operation; (c) is the flip plus rotate operation; (d) is the adjust brightness plus rotate operation; (e) is the add noise operation; (f) is the cutout operation.

Figure 3. AAB-YOLO network structure.

Figure 4. Logical structure of ADown; Note: h is the height of the feature map, w is the width, and c is the number of channels.

Figure 5. Structure of ContextGuided Block.

Figure 6. C3k2-CG structure.

Figure 7. YOLOv11 detection head structure.

Figure 8. SEAM structure.

Figure 9. Detect_SEAM structure.

Figure 10. Description of Inner_IoU.

Figure 11. Heatmaps produced following the integration of different modules. Group (a) is the original image; group (b) is the detection result of YOLOv11; group (c) is the result after adding only the ADown module; group (d) is the result after adding only the C3k2_ContextGuide module; group (e) is the result after adding only the Detect_SEAM module; group (f) is the result after adding both the ADown and C3k2_ContextGuide modules at the same time; group (g) is the result after adding both the ADown and Detect_SEAM modules; group (h) is the result after adding both the C3k2_ContextGuide and Detect_SEAM modules; group (i) is the result after adding both the ADown, C3k2_ContextGuide, and Detect_SEAM modules; group (j) is the result of the final improved model AAB-YOLO.

Figure 12. Comparison curves of experimental results for different models. (a) is a comparison of precision curves; (b) is a comparison of recall curves; (c) is a comparison of map@50 curves; (d) is a comparison of map@50-95 curves.

Figure 13. Comparison of detect results: (a) is the test result of AAB-YOLO; (b) is the test result of YOLOv11.

Figure 14. Comparison of detection results under different light. Group (a) is the picture in normal light; group (b) is the picture in dim light; group (c) is the picture in strong light.

Figure 15. Comparison of heatmaps of YOLOv11 model and AAB-YOLO model. Group (a) is the original image; group (b) is the detection result of YOLOv11n model; group (c) is the detection result of AAB-YOLO.

Table 1. Experimental parameters.

Hyperparameters	Values
image size	640 × 640
batch size	16
epochs	200
optimizer	SGD
learning rate	0.01
momentum	0.937
weight decay	0.0005

Table 2. The results of ablation experiment.

YOLOv11n	ADown	C3k2_ContextGuided	Detect_SEAM	Inner_EIoU	Precision	Recall	mAP@50	mAP@50-95	Params/10⁶	GFLOPs	Weights File Size/MB
`√`	`×`	`×`	`×`	`×`	0.948	0.865	0.917	0.813	2.58	6.3	5.50
`√`	`√`	`×`	`×`	`×`	0.934	0.874	0.920	0.818	2.10	5.3	4.29
`√`	`×`	`√`	`×`	`×`	0.951	0.866	0.919	0.808	2.18	4.9	4.33
`√`	`×`	`×`	`√`	`×`	0.947	0.859	0.916	0.815	2.49	5.8	5.07
`√`	`√`	`√`	`×`	`×`	0.953	0.853	0.918	0.801	1.70	4.4	3.58
`√`	`√`	`×`	`√`	`×`	0.957	0.856	0.918	0.810	2.01	4.8	4.16
`√`	`×`	`√`	`√`	`×`	0.959	0.858	0.919	0.795	2.09	4.9	4.33
`√`	`√`	`√`	`√`	`×`	0.940	0.874	0.915	0.805	1.61	3.9	3.41
`√`	`√`	`√`	`√`	`√`	0.951	0.876	0.921	0.801	1.61	3.9	3.41

Table 3. Comparison of various detection models.

Models	Precision	Recall	mAP@50	mAP@50-95	Params/10⁶	GFLOPs	Weights File Size/MB
YOLOv5s	0.957	0.857	0.917	0.809	2.18	5.8	4.43
YOLOv6s	0.940	0.853	0.906	0.812	4.16	11.5	8.15
YOLOv8n	0.949	0.857	0.910	0.812	2.68	6.8	5.36
YOLOv11n	0.948	0.865	0.917	0.813	2.58	6.3	5.50
AAB-YOLO	0.951	0.876	0.921	0.801	1.61	3.9	3.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Zhang, T.; Zhou, S.; Guo, J. AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments. Agriculture 2025, 15, 836. https://doi.org/10.3390/agriculture15080836

AMA Style

Yang L, Zhang T, Zhou S, Guo J. AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments. Agriculture. 2025; 15(8):836. https://doi.org/10.3390/agriculture15080836

Chicago/Turabian Style

Yang, Liusong, Tian Zhang, Shihan Zhou, and Jingtan Guo. 2025. "AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments" Agriculture 15, no. 8: 836. https://doi.org/10.3390/agriculture15080836

APA Style

Yang, L., Zhang, T., Zhou, S., & Guo, J. (2025). AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments. Agriculture, 15(8), 836. https://doi.org/10.3390/agriculture15080836

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. The AAB-YOLO (Apple and Bagged_Apple) Model

2.2.1. ADown Model

2.2.2. C3k2_ContextGuided Feature Extraction Module

2.2.3. Detet_SEAM Head

2.2.4. Inner-EIoU Loss

3. Experiment and Results

3.1. Experimental Configuration and Evaluation Indicators

3.2. Ablation Experiments

3.3. Comparative Experiments with Other Models

3.4. Visual Analysis

3.4.1. Comparison of Detection in Natural Environment

3.4.2. Comparison of Detection Under Different Light

3.4.3. Heatmap

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI