1. Introduction
The comprehensive mechanization rate of orchards in China is only 25.88% [
1], while the mechanization rate is only around 5%. According to data from the US Department of Agriculture (USDA) database and the China Apple Industry Association, the apple output in the 2021–2022 production season reached 459.73 million tons, accounting for 56.4%. China’s apple production ranks first in the world, with a national apple cultivation area of 2 million hectares [
2]. Apples are the most widely planted fruit in China, and their output accounts for one-eighth of the total fruit output [
3]. The fruit-harvesting process in orchards is mainly done by hand, and the overall level of mechanization is even lower [
4]. The harvesting process is time-consuming and labor-intensive. Currently, the labor input in this stage accounts for almost half of the entire production and planting process, ranging from 40% to 50%, and even reaching 67% [
5]. The cost of picking the fruits can reach 50% to 70% of the total cost. Coupled with the shortage of the rural labor force, it is difficult to recruit workers during the picking season, resulting in delayed picking and thus affecting the quality of the fruits. This has had a significant negative impact on farmers’ income and the development of the industry [
6]. Zhao [
7] once used apples as an example to illustrate that 55% of the world’s apple production comes from China, but the mechanization rate of apple harvesting in China is less than 3%. The manual operation efficiency is very low, and workers can only harvest 300 kg per day. The existing mechanical harvesting equipment often fails to accurately identify the positions of the fruits, resulting in missed harvests of fruits or incorrect harvests of leaves. This makes it difficult to effectively enhance the efficiency and quality of robotic harvesting [
8].
Computer vision, based on the large-scale popularization of computers, is widely applied in machine understanding and high-level computational analysis of visual information, including scene object recognition, detection, object tracking, instance segmentation, pose and motion estimation [
9], object and scene modeling, and pixel restoration [
10]. Object detection is one of the challenging problems in computer vision. In recent years, with the rapid advancement of deep learning, scholars both domestically and internationally have conducted extensive research.
Hu proposed an apple target detection and localization method based on the combination of the improved YOLOX and RGB-D depth camera [
11]. The experiments have shown that the F1 value of the proposed method reaches 93%, the mAP@0.5 reaches 94.09%, and the positioning errors in the X, Y, and Z directions are less than 7 mm, 7 mm, and 5 mm, respectively. Ji et al. [
12] proposed an improved YOLOX algorithm for target detection of apple fruits. It is based on the YOLOX-tiny network and incorporates the lightweight model ShuffleNetv2 and the CBAM attention mechanism. Additionally, an Adaptive Spatial Feature Fusion (ASFF) module is added to the PANet network. The results show that the mAP@0.5, precision, recall rate, and F1 score of this network model are 96.76%, 95.62%, 93.75% and 0.95, respectively. Kumar integrated an adaptive pooling scheme and an attribute enhancement model into the YOLOv5 architecture and introduced a loss function to obtain accurate bounding boxes, thereby maximizing the detection accuracy [
13]. This model can detect smaller objects and improve feature quality to detect apples in complex backgrounds, with precision, recall rate, and F1 score being 97%, 99%, and 98%, respectively.
Deep learning methods can extract high-dimensional features of fruits, effectively resisting the effects of lighting, overlap, and occlusion [
14,
15], and have robustness in detecting apple targets with high recognition accuracy. The training time of the AlexNet and VGG16 network models is relatively long, and they are not easily deployable; the network structure of CenterNet is prone to misjudging the center points of two objects of the same category that are close to each other. Although two-stage recognition algorithms such as Mask RCNN boast high accuracy, their models are large and the recognition time is lengthy; The YOLO algorithm, with its advantages of easy deployment, efficient training, and high recognition accuracy, meets the requirements for real-time apple target detection in unstructured orchard environments [
16,
17,
18]. However, it demonstrates suboptimal performance in scenarios involving object occlusion, overlapping instances, and small-target detection tasks within unstructured orchard environments. Therefore, modifications to the original model framework are required to achieve accurate identification of fruit targets under these agriculturally challenging conditions.
Typical object detection algorithms are primarily categorized into two-stage and one-stage approaches. Representative two-stage algorithms, such as Faster R-CNN, R-CNN, and Fast R-CNN, require generating Region Proposals through heuristic methods or convolutional neural networks, followed by classification and regression operations. This framework necessitates a two-step training process: (1) training the Region Proposal Network (RPN); (2) training the core object detection network. Two-stage algorithms exhibit slightly higher detection accuracy while demonstrating relatively lower detection speed [
19]. Representative one-stage algorithms, such as SSD and YOLO, directly output class and location information through their backbone networks without requiring Region Proposal Networks (RPNs). Compared to two-stage approaches, these methods achieve faster detection speeds at the expense of slightly reduced accuracy [
20]. To achieve the detection and recognition of apple fruits in complex orchard environments, it is necessary to select a detection network that features fast detection speed, high recognition accuracy and excellent performance [
21]. Wang proposed a multi-pose pitaya detection method based on the optimized YOLOv7 model. They compared the YOLOv7 algorithm with YOLOv4-tiny, YOLOv5, YOLOX, SSD, Faster-RCNN and other algorithms. The results showed that the YOLOv7 network had higher inference speed and detection accuracy than the other algorithms. Specifically, P was 82.3%, R was 83.7%, F1 was 83.0%, and mAP was 91.9%, demonstrating its performance advantage [
22].
YOLOv7 is a typical single-stage object detection algorithm that features high accuracy, ease of training, and fast detection speeds. Higher detection accuracy and faster reasoning speed can be achieved without increasing computational cost, meeting the requirements of precise real-time picking. However, in the detection tasks of complex orchard environments with occlusions, there are still missed detections, indicating that there is room for improvement and enhancement. Without increasing the cost of reasoning, higher detection accuracy and faster reasoning speed were achieved, meeting the requirements of precise real-time harvesting [
23]. However, in the detection tasks of complex orchard environments with occlusions, there are still missed detections, indicating that there is room for improvement and enhancement.
The present study utilized real orchard images as the underlying dataset. The characteristics of apples were utilised to enhance the feature extraction network, loss function, and detection boxes of YOLOv7. The photographs were initially labelled and subsequently underwent a training process to derive the model parameters. The trained model was then utilised for the purpose of performance evaluation. The primary components of this study were as follows: (1) an enhanced YOLOv7-based apple recognition model is proposed, integrating the Game Attention Mechanism, BiFPN multi-scale neck optimization, and improved prediction heads; (2) ablation and comparative experiments were conducted to clarify the contribution of each module; and (3) the optimized model was integrated into an apple harvesting system to achieve practical visual recognition for robotic harvesting applications.
3. Experimental Results
3.1. Ablation Study
The impact of model improvements on recognition performance was demonstrated, and the effectiveness of each modified module was validated by conducting an experiment on the apple dataset under identical experimental configurations. The performance of the detection system was analysed by sequentially incorporating three improvement methods based on the YOLOv7 model. The experimental results were presented in
Table 2.
A comparison of the baseline YOLOv7 model with the model incorporating the Global Attention Mechanism (GAM) in the final layer of the backbone network reveals a substantial enhancement in performance. The precision is augmented by 0.9 percentage points, the recall by 0.3 percentage points, the mAP@0.5 by 0.9 percentage points, and the mAP@0.5:0.95 by 3.2 percentage points.
The replacement of the original neck structure with the Bidirectional Feature Pyramid Network (BiFPN) enhances precision by 1.4 percentage points, mAP@0.5 by 0.1 percentage points, and mAP@0.5:0.95 by 0.5 percentage points. However, this modification led to a reduction in recall, which is attributed to insufficient feature fusion between shallow and deep feature maps during multi-scale integration.
The incorporation of an auxiliary detection head, when executed in isolation, has been demonstrated to enhance precision by 0.3 percentage points, recall by 1.3 percentage points, mAP@0.5 by 0.8 percentage points, and mAP@0.5:0.95 by 3.4 percentage points. This enhancement significantly strengthened the detection capability for small targets.
The findings of this study demonstrate that when GAM and BiFPN are deployed in conjunction, there is a marginal decline in precision. However, it is notable that the recall increased by 1.3 percentage points, the mAP@0.5 by 0.8 percentage points, and the mAP@0.5:0.95 by 4.8 percentage points. While both modules enhance the extraction of features, the multi-scale fusion introduced by BiFPN gives rise to background interference, thereby adversely affecting recognition accuracy.
The combination of the auxiliary head and GAM has been demonstrated to achieve improvements of 1.8 percentage points in precision, 0.1 percentage points in recall, 2.2 percentage points in mAP@0.5, and 2.3 percentage points in mAP@0.5:0.95.
The integration of all three components (auxiliary head + GAM + BiFPN) has been demonstrated to yield optimal results, with a 4.3% increase in precision, a 1.8% increase in recall, an 4.4% increase in mAP@0.5, and a 7.9% increase in mAP@0.5:0.95. This synergy enhances multi-scale feature fusion while suppressing complex background interference through the attention mechanism of the Graph Attention Mechanism (GAM) and the detection capabilities of the auxiliary head.
In order to analyse the impact of GAM on feature extraction, visualisation studies (see
Table 3) have been conducted, which have revealed that the addition of GAM results in a significant enhancement of apple detection regions, indicating an improvement in feature focusing. In Scenarios 1 and 4, the presence of background noise at the image edges results in feature dispersion. Conversely, in Scenarios 2 and 3, features exhibit a pronounced concentration within apple regions, characterised by darker colouration. Following the incorporation of BiFPN and the auxiliary head, multi-scale fusion has been demonstrated to enhance target perception and effectively suppress background interference in challenging scenarios (1 and 4), thereby enabling more precise feature focusing.
3.2. Comparative Experiments of Different Models
Based on the comparative analysis of the training loss curves obtained using different optimization strategies in
Figure 10, our improved optimization algorithm demonstrates accelerated convergence in both bounding box loss and object loss curves during training. As shown in
Figure 10a, the bounding box loss stabilizes around 200 epochs for baseline models, while our model converges below 0.01 at approximately 50 epochs—significantly outperforming other networks. This improvement stems from BiFPN’s optimization of feature fusion pathways and removal of redundant network nodes, which collectively enhance training efficiency.
As demonstrated in
Figure 10b, our object loss achieves the lowest recorded value (≈0.04), maintaining a 20% reduction compared to the second-best performer after stabilization26. These experimental results confirm that integrating GAM (Global Attention Mechanism), auxiliary detection heads, and BiFPN modules into the YOLOv7 architecture substantially accelerates model convergence and training speed.
As demonstrated in
Table 4, while the recall rate of our algorithm is marginally lower than that of YOLOX, it surpasses YOLOX in all other critical metrics. Specifically, the YOLOv7-imp model demonstrates superior performance in terms of precision, recall, mAP@0.5, and F1-score when compared to all baseline algorithms. It is important to note the following: For mAP@0.5, YOLOv7-imp surpasses competing algorithms by 8.5, 4.2, 7.5, 10.2, 6.1, and 4.4 percentage points, respectively. These findings underscore the efficacy of the proposed enhanced model in achieving superior detection capabilities, characterised by enhanced localization accuracy and classification confidence.
3.3. Comparison Tests of Detection Performance in Different Scenarios
To validate the detection performance of the improved model across diverse scenarios, this study selected Faster-RCNN, YOLOv5s, YOLOv7, and YOLOv7-imp algorithms for apple detection in orchards under four distinct lighting and distance conditions: Close-range front lighting, Close-range back lighting, Long-range front lighting, Long-range back lighting. The missed detection counts of apples were recorded for each algorithm to compare detection performance. As visually evidenced in
Figure 11, the black circles highlight regions where the three baseline algorithms (Faster-RCNN, YOLOv5s, YOLOv7) exhibited either missed detections (undetected apples) or misdetections (incorrect identifications), while the YOLOv7-imp algorithm successfully identified these targets. The results demonstrate that YOLOv7-imp maintains robust detection accuracy under challenging lighting and distance variations, significantly reducing both omission and commission errors compared to other state-of-the-art detectors.
As demonstrated in
Figure 11 and
Table 5, in close-range front lighting and close-range backlighting scenarios, both Faster-RCNN and the YOLO-series algorithms demonstrate zero missed detections. This phenomenon can be attributed to the fact that apples occupy distinct pixel areas with high contrast against the background when viewed at close range.
In the context of long-range front lighting and long-range backlighting scenarios, it has been observed that all networks exhibit a prevalence of missed detections. This phenomenon can be attributed to the presentation of apple images exhibiting reduced brightness and contrast in comparison to the background, thereby increasing the difficulty of detection. It is noteworthy that YOLOv7 demonstrates a significant rate of false negatives (54.83% and 25.53%) in instances of occluded or overlapping apples, attributable to the indistinguishability of boundaries between overlapping fruits and the constrained expression of complex background characteristics.
In both long-range scenarios, challenges such as interlaced branches, leaf occlusion, and fruit overlap have been shown to result in missed or false detections. The YOLOv7-imp detection algorithm has been shown to outperform competing algorithms due to its enhanced network optimisation, which facilitates superior feature extraction capabilities. The Game Attention Mechanism (GAM) has been demonstrated to enhance the network’s feature focusing capabilities, and the application of auxiliary detection heads has been shown to improve the network’s small target detection capabilities. Furthermore, the Networked Multi-Purpose Detector (NMD) has been shown to optimise the identification of apples in long-distance scenes.
3.4. Field Harvesting Experiment
The field experiment was conducted on 30 September 2023 at Huamanyuan Apple Orchard, which is located in Huancui District, Weihai City, Shandong Province. The experiment was conducted during clear weather conditions, which were observed between 10:00 a.m. and 3:00 p.m.
The trial was conducted using Yantai Red Fuji apples cultivated in a dwarf rootstock high-density planting system. The apples were randomly selected for harvesting tests. The system sequentially executed fruit detection and recognition, spatial positioning, and motion planning, and all operational data were recorded in real time.
In the context of autonomous apple harvesting, the manipulator executed a series of coordinated actions in accordance with the following workflow, facilitated by its vision and motion planning systems. The host computer processed the orchard images captured by the depth camera to derive 3D coordinates in the camera coordinate system. These coordinates were then transformed into the robotic arm’s base coordinate system to generate spatial positioning data. Consequently, the robotic arm planned and executed its trajectory with the objective of reaching the target picking locations. As demonstrated in
Figure 12, upon arrival at the initial pose, the vision system detected ripe apples, and the motion control system computed and executed optimal paths. In conclusion, the end-effector was successfully calibrated to achieve accurate positioning, and the functionality of stem detachment was successfully executed.
In order to validate the efficiency of the vision system within the integrated harvesting system, multiple field trials were conducted on the robotic arm-end effector assembly in an orchard setting. The harvest was considered successful only if the peduncle detachment occurred without any damage to the fruit. An analysis of 20 harvest cycles (see
Table 6) reveals that the host computer’s path planning computation time accounted for a minor proportion of the total harvesting cycle. The deployed vision model has been demonstrated to exhibit rapid inference speed and high localization accuracy, thus meeting the real-time operational requirements of the harvesting system. The overall success rate of 85% is indicative of the vision system’s efficacy in the context of automated apple harvesting. The primary failures were attributable to suboptimal path solutions during the motion planning stage.
4. Discussion
Currently, traditional fruit-detection methods exhibit significant limitations, as they primarily focus on the straightforward recognition of unobstructed and easily pickable apples, while struggling to cope with the unstructured conditions typical of real-world orchard environments. Although various solutions have been proposed in existing research, achieving a balance between model accuracy and recognition efficiency remains a persistent challenge, largely due to the algorithms’ sensitivity to environmental variations. Consequently, the development of a high-performance and accurate vision system for apple recognition and localization that can operate robustly in complex environments is critical for advancing agricultural automation in harvesting operations [
26].
In apple detection tasks, various studies have proposed multiple improvement approaches for complex lighting and dense scene conditions. Zhu et al. [
27] verified that the combination of YOLOv5s and the CBAM attention mechanism is superior to the original model. Chen et al. [
28] demonstrated in the apple inflorescence recognition experiment that YOLOv7 is superior to YOLOv5s and YOLOv5 with the CA attention mechanism integrated. Chai et al. [
8] employ the CBAM attention mechanism to enhance the focus and expression of key features in the YOLOv8 network, enabling the detection of cherry tomatoes. The GAM attention mechanism enhances the performance of CNN by amplifying the cross-dimensional receptive areas, compared with CBAM, GAM eliminates the maximum pooling in the spatial channel and retains some feature information [
24]. Yang et al. [
29] suggested replacing the backbone with MobileOne to achieve lightweight processing, yet still encountered a 23.40% false negative rate in dense orchards, indicating that relying solely on backbone compression cannot effectively address occlusion and background interference issues. In contrast, this study avoids backbone replacement by introducing a GAM attention mechanism to suppress irrelevant feature interference and enhance target focus. Previous research has validated this approach’s effectiveness: Zang et al. [
30] improved wheat ear detection accuracy by 4.95% by embedding a GAM attention module into YOLOv5s. Zhang et al. [
31] similarly observed that incorporating GAM significantly improved feature representation capabilities in complex backgrounds during crop pest and disease identification. Consistent with these efforts, our findings demonstrate that attention enhancement offers greater specificity and practicality than backbone replacement in complex orchard environments.
To address the challenges of small-target and long-distance fruit detection, this study adopted Bi-FPN for multi-scale feature fusion and additionally designed an auxiliary detection head to enhance the recognition capability of apples at a distance. Similar improvement approaches have been proven effective in related research, such as Wen et al. [
32] proposing PcMNet, which achieved lightweight high-precision detection in orchard environments by refining the feature fusion structure. Jin et al. [
33] presented an enhanced deep learning model designed to improve the accuracy and adaptability of recognition algorithms for robotic arm-based harvesting. Zhang et al. [
34] employed a highly effective weighted bi-directional feature pyramid network (BiFPN) for effective cross-scale feature fusion, significantly enhancing the multi-scale object detection capability of the YOLO framework, particularly for small objects. Additionally, Weng et al. [
35] introduced GAM and feature fusion strategies in tomato detection to enhance small target recognition capabilities. These comparisons further demonstrate the rationality and applicability of our approach in complex environments.
Regarding boundary localization, this research applied the NMD loss function to enhance positioning accuracy. For instance, Yang et al. [
36] replaced CIoU with Shape-IoU in the lightweight ELD fruit detector, achieving greater robustness in complex scenarios. Experimental results from this study also demonstrate that NMD loss effectively reduces positioning errors, providing more precise path guidance for robotic harvesting. Additionally, the model weight in this study was 76 MB, significantly reducing memory consumption. Similar lightweight attempts have been validated in studies such as a lightweight CCG-YOLOv5n model [
37] and YOLOv5s-BC [
38]. Fu et al. [
26] adopted a method combining IoU and NWD to reduce the false negative rate under low-density overlapping conditions in YOLOv10, achieving an accuracy of 89.3%. This demonstrates the feasibility of NWD optimization. Therefore, this research balances engineering deployment requirements while maintaining accuracy.
It should be noted that the proposed method is primarily designed for dwarf rootstock high-density planting systems, and its generalization capability under other cultivation patterns or regional varietal conditions requires further validation. Previous studies have achieved 94% detection accuracy in high-occlusion scenarios by modeling fruit-to-fruit occlusion relationships [
38]. Semi-supervised strategies have also been employed to enhance cross-scenario recognition performance for small objects [
39]. Future research may integrate domain adaptation, occlusion modeling, or semi-supervised learning mechanisms to further enhance the method’s universality across diverse orchard conditions. The method proposed in this study is designed for apple detection in dense dwarf rootstock orchards, yet it is adaptable to other agricultural environments. Its cross-dimensional information fusion enhances the network’s ability to effectively detect objects of varying sizes, which is crucial in agricultural scenarios where factors like fruit color and size variation, as well as differing camera distances, come into play. The integration of the GAM and the BiFPN strengthens this capability. Consequently, this research approach can be extended to a wide range of agricultural robotics applications, including fruit detection in orchards and object recognition in broader agricultural settings such as vegetable and crop monitoring.
5. Conclusions
This study improves computational efficiency through an optimized network architecture and model scaling strategy, resulting in an enhanced YOLOv7-based model capable of addressing apple recognition challenges in complex orchard environments. To this end, an orchard apple dataset was first constructed, and model generalization and robustness were reinforced through offline data augmentation. An attention mechanism was incorporated into the backbone network, the NMD bounding box regression loss was employed to improve prediction accuracy, and an auxiliary detection head was introduced to enhance small-target recognition.
The ablation experiments demonstrate that the optimized model achieves a precision of 94.7%, a recall of 85.9%, and an mAP@0.5 of 95.9%. The combined contributions of the auxiliary detection head, the GAM attention mechanism, and the BiFPN substantially improve model performance, with gains of 4.3%, 1.8%, 4.4%, and 7.9% in precision, recall, mAP@0.5, and mAP@0.5–0.95, respectively, relative to the baseline. These findings highlight the effectiveness of the attention mechanism in suppressing background interference, the auxiliary detection head in improving small-target detection, and the BiFPN in facilitating multi-scale feature fusion.
Comparative evaluations against SSD, Faster R-CNN, YOLOv4, YOLOv5, and YOLOv7 indicate that the proposed model achieves superior target recognition performance while maintaining real-time inference speed. Specifically, the model improves mean average precision (mAP) by 8.5, 4.2, 7.5, 10.2, 6.1, and 4.4 percentage points, respectively, over the compared benchmarks. In orchard apple detection scenarios with varying distances and illumination conditions, the enhanced model exhibits higher accuracy, lower missed detection rates, and fewer false positives than Faster R-CNN, YOLOv5, and YOLOv7. These results confirm the methodology’s capacity to address the complex characteristics of orchard apples and underscore its practical applicability in agricultural and plant science applications.