Detection of Orchard Apples Using Improved YOLOv5s-GBR Model

: The key technology of automated apple harvesting is detecting apples quickly and accurately. The traditional detection methods of apple detection are often slow and inaccurate in unstructured orchards. Therefore, this article proposes an improved YOLOv5s-GBR model for orchard apple detection under complex natural conditions. First, the researchers collected photos of apples in their natural environments from different angles; then, we enhanced the dataset by changing the brightness, rotating the images, and adding noise. In the YOLOv5s network, the following modules were introduced to improve its performance: First, the YOLOv5s model’s backbone network was swapped out for the GhostNetV2 module. The goal of this improvement was to lessen the computational burden on the YOLOv5s algorithm while increasing the detection speed. Second, the bi-level routing spatial attention module (BRSAM), which combines spatial attention (SA) with bi-level routing attention (BRA), was used in this study. By strengthening the model’s capacity to extract important characteristics from the target, its generality and robustness were enhanced. Lastly, this research replaced the original bounding box loss function with a repulsion loss function to detect overlapping targets. This model performs better in detection, especially in situations involving occluded and overlapping targets. According to the test results, the YOLOv5s-GBR model improved the average precision by 4.1% and recall by 4.0% compared to those of the original YOLOv5s model, with an impressive detection accuracy of 98.20% at a frame rate of only 101.2 fps. The improved algorithm increases the recognition accuracy by 12.7%, 10.6%, 5.9%, 2.7%, 1.9%, 0.8%, 2.6%, and 5.3% compared to those of YOLOv5-lite-s, YOLOv5-lite-e, yolov4-tiny, YOLOv5m, YOLOv5l, YOLOv8s, Faster R-CNN, and SSD, respectively, and the YOLOv5s-GBR model can be used to accurately recognize overlapping or occluded apples, which can be subsequently deployed in picked robots to meet the realistic demand of real-time apple detection.


Introduction
Apples are widely favored by consumers because they are rich in many vitamins, and thus are grown on a large scale by fruit growers [1].Currently, a huge amount of labor is required for apple production activities.Yantai's climate and environmental conditions are ideal for apple growth, which is why Yantai has been identified by the Chinese Ministry of Agriculture as the center for high-quality apple cultivation and production in China.The terrain of Yantai is hilly and mountainous, characterized by "six mountains, one water and three fields".This natural geography is suitable for apple cultivation, but it also leads to unstructured orchards with irregular tree positions and different tree shapes and sizes.The complex working environment and the strong demand for seasonal work in Yantai's apple production areas have resulted in an insufficient supply of agricultural labor to complete high-intensity apple harvesting tasks in a short period of time [2,3].In Yantai, the labor expenditures for fresh apple harvesting account for 35% to 45% of all the production costs [4].Therefore, it is important to increase the level of automation in the apple industry, seek technological advances to reduce the dependence on labor, and promote the development of Yantai's apple industry by improving picking efficiency, reducing production costs, and increasing profitability for fruit farmers [5].
Rapid apple recognition in unstructured orchards is essential for automated apple picking.As a result, in unstructured orchards, a quick and precise detection technique for apples must be established.With the rapid development of machine learning [6] and digital image processing techniques [7] and their application in the field of recognition and target detection, target fruits can be detected quickly.Target detection algorithms can be broadly categorized into two groups; the first are the candidate region-based detection algorithms, which identify a set of candidate regions and then subsequently classify them.R-CNN, Fast R-CNN, and Faster R-CNN are typical examples of such algorithms.Chu et al. [8] developed a robust framework for apple detection using a suppression mask R-CNN.Xuan et al. [9] used a Faster RCNN and the improved YOLOv3 deep learning algorithm to detect apples in natural environments.In addition to this, Kang and Chen [10] created a fruit detection framework for apple harvesting, consisting of a fruit detector called "LedNet" and an automatic label generation module.Linker et al. [11] proposed a technique to detect apples under natural light.This technique identifies a collection of pixels that likely contain apples by amending the color and smoothness to form a "seed region".The consistency between the "seed region" and the apple model can be used as an indicator of the presence or absence of apples in the region.However, when the apples are widely distributed or heavily overlapped, accurate identification becomes difficult.The second class of regression-based algorithms determines the class and location of the target by directly regressing the target using a CNN.The common algorithms include the SSD and YOLO families.Jiang et al. [12] incorporated an attention mechanism into the YOLOv4 model to better determine the features of apples at the early fruiting stage.Lu et al. [13] modified the YOLOv4 model with an embedded attention mechanism to detect the growth status of apples.Wang et al. [14] proposed a real-time apple stem calyx recognition algorithm based on improved YOLOv5.Wang et al. [15] developed a fast apple recognition and tracking method based on a modified version of the YOLOv5 algorithm in conjunction with the MobileNetv2 module.Solimani et al. [16] proposed a tomato plant trait detection model by combining the SE attention module with the YOLOv8 model.Ma et al. [17] combined the Ghost and ShuffieNetv2 modules with YOLOv8 to design a lightweight YOLOv8n-ShuffieNetv2-Ghost-SE model for the real-time detection of apples.
As the classic model of the YOLO series, YOLOv5 is widely used due to its excellent scalability.The architecture used by YOLOv5 is a simple convolutional neural network; however, YOLOv8 uses multiple residual units and branches, so the network architecture of YOLOv8 is much more complex.YOLOv5 is not only capable of predicting the positions and categories of multiple anchor frames simultaneously but also performs satisfactorily in detecting small and hidden targets under intricate field conditions [18].Under the same hardware conditions, the detection speed and accuracy of YOLOv5 are better than those of other deep learning algorithms, which can meet the demand for real-time apple detection.Therefore, in this study, YOLOv5 was selected as the base model after comprehensively comparing the network complexity of the model, detection speed, accuracy, practical applications, and other factors.In this paper, an improved apple detection algorithm based on YOLOv5s-GBR is proposed to solve the problem of quickly and accurately detecting apples in complex, unstructured orchards.This provides a stable and efficient detection model for the design of automatic apple-picking equipment.The main improvements are as follows: (1) Using the lightweight GhostNetV2 instead of the C3 module in the backbone network reduces the redundancy of the network layer, lowers the number of parameters in the model, and improves the detection accuracy and speed while reducing the computational burden.
(2) In this study, we uniquely utilize the adaptive sparse sampling of bi-level routing attention (BRA) to focus attention on a few key labels and also use the spatial attention (SA) module to enhance the local key information of the sparsely sampled features and propose a novel bi-level routing spatial attention module (BRSAM).This design both extracts the key features and reduces the influence of irrelevant features, which improves the computational efficiency and detection performance, as well as the generalization ability and robustness of the model.(3) This research used the repulsive loss function instead of the original bounding box loss function to enhance the detection of occluded, overlapping targets.

Image Acquisition and Data Enhancement
To improve the generalizability of the algorithm, the researchers generated the apple dataset by traveling to an orchard located in the Muping District of Yantai City, Shandong Province, China (latitude 37.38 • N, longitude 121.60 • E), several times a year to collect images of apples during the growing and ripening periods.For image collection, we used an RGB camera to collect 1428 images of different apple varieties (Red Fuji, Cream Fuji, Red Star, and Yellow Marshal) at different shooting angles under different lighting conditions (sunny and cloudy days; front exposure and back exposure).This dataset includes images with changes in light, overlapping apples, branches and leaves slightly obscuring the fruits, leaves heavily obscuring the fruits, branches obscuring the apples, and branches dividing the apples.The above natural conditions are shown in Figure 1.
Agronomy 2024, 14, x FOR PEER REVIEW 3 of 15 (2) In this study, we uniquely utilize the adaptive sparse sampling of bi-level routing attention (BRA) to focus attention on a few key labels and also use the spatial attention (SA) module to enhance the local key information of the sparsely sampled features and propose a novel bi-level routing spatial attention module (BRSAM).This design both extracts the key features and reduces the influence of irrelevant features, which improves the computational efficiency and detection performance, as well as the generalization ability and robustness of the model.(3) This research used the repulsive loss function instead of the original bounding box loss function to enhance the detection of occluded, overlapping targets.

Image Acquisition and Data Enhancement
To improve the generalizability of the algorithm, the researchers generated the apple dataset by traveling to an orchard located in the Muping District of Yantai City, Shandong Province, China (latitude 37.38° N, longitude 121.60°E), several times a year to collect images of apples during the growing and ripening periods.For image collection, we used an RGB camera to collect 1428 images of different apple varieties (Red Fuji, Cream Fuji, Red Star, and Yellow Marshal) at different shooting angles under different lighting conditions (sunny and cloudy days; front exposure and back exposure).This dataset includes images with changes in light, overlapping apples, branches and leaves slightly obscuring the fruits, leaves heavily obscuring the fruits, branches obscuring the apples, and branches dividing the apples.The above natural conditions are shown in Figure 1.In order to extract the features and tackle complicated scene interference, there should be a greater variety of scenarios in the training photos.Nevertheless, the fruit trees and collection time restrict the number of shots that can be taken when gathering apple photos in the field.This article expanded the dataset utilizing brightness modification, clipping, adding noise, random rotation, and horizontal mirroring in order to enhance the model's detection of occluded objects, as shown in Figure 2. The training, validation, and test sets were divided into a 6:2:2 ratio using the random sample approach after the dataset In order to extract the features and tackle complicated scene interference, there should be a greater variety of scenarios in the training photos.Nevertheless, the fruit trees and collection time restrict the number of shots that can be taken when gathering apple photos in the field.This article expanded the dataset utilizing brightness modification, clipping, adding noise, random rotation, and horizontal mirroring in order to enhance the model's detection of occluded objects, as shown in Figure 2. The training, validation, and test sets were divided into a 6:2:2 ratio using the random sample approach after the dataset was expanded to include 5712 pictures.The training and validation sets were used for model training and evaluation during training, while the test set was used to evaluate the final model's detection performance.For the same variety, the apples were classified into three grades based on the color of their surface as follows: ripe, medium ripe, and green.The apples were categorized into two grades, high-quality and average, based on their size and uniformity.Using LabelImg version 1.8.6 annotation software, the apple photos were manually labeled in YOLO format.The term "apple" appeared on 52,894 of the labels that we were able to gather.
was expanded to include 5712 pictures.The training and validation sets were used for model training and evaluation during training, while the test set was used to evaluate the final model's detection performance.For the same variety, the apples were classified into three grades based on the color of their surface as follows: ripe, medium ripe, and green.The apples were categorized into two grades, high-quality and average, based on their size and uniformity.Using LabelImg version 1.8.6 annotation software, the apple photos were manually labeled in YOLO format.The term "apple" appeared on 52,894 of the labels that we were able to gather.

Target Detection Network Model Selection
The YOLO algorithm is currently in its eighth iteration, or YOLOv8, after undergoing multiple iterations and advancements.Various configurations of the convolution kernels and network feature extraction modules are used in these iterations.The model's size and parameters gradually increased as these iterations were produced.It is worth noting that the convolutional neural network (CNN) architecture of YOLOv5 is not as complex as that of YOLOv8.The YOLOv5 model is faster and smaller as a result of its simplicity.It is critical to efficiently perform apple detection in complicated situations.This study uses the YOLOv5s model architecture to enhance the accuracy, efficiency, and model complexity of the apple object detection network design.

YOLOv5s Network Architecture
The architecture of YOLOv5 is made up of several parts, such as an input layer, neck and backbone network, and prediction head.Input processing includes mosaic data enhancement, adaptive anchor frame computation, and adaptive image scaling.The backbone network contains various feature extraction modules, including CBL (Conv + BN + SiLU, where BN denotes batch normalization and SiLU represents the activation function), C3, and SPPF.At the feather map level, SPPF represents the merging of local and global features.For multi-scale feature fusion and enhanced feature extraction, the neck network primarily makes use of the FPN + PAnet structure [19].This approach improves

Target Detection Network Model Selection
The YOLO algorithm is currently in its eighth iteration, or YOLOv8, after undergoing multiple iterations and advancements.Various configurations of the convolution kernels and network feature extraction modules are used in these iterations.The model's size and parameters gradually increased as these iterations were produced.It is worth noting that the convolutional neural network (CNN) architecture of YOLOv5 is not as complex as that of YOLOv8.The YOLOv5 model is faster and smaller as a result of its simplicity.It is critical to efficiently perform apple detection in complicated situations.This study uses the YOLOv5s model architecture to enhance the accuracy, efficiency, and model complexity of the apple object detection network design.

YOLOv5s Network Architecture
The architecture of YOLOv5 is made up of several parts, such as an input layer, neck and backbone network, and prediction head.Input processing includes mosaic data enhancement, adaptive anchor frame computation, and adaptive image scaling.The backbone network contains various feature extraction modules, including CBL (Conv + BN + SiLU, where BN denotes batch normalization and SiLU represents the activation function), C3, and SPPF.At the feather map level, SPPF represents the merging of local and global features.For multi-scale feature fusion and enhanced feature extraction, the neck network primarily makes use of the FPN + PAnet structure [19].This approach improves the network's capacity to efficiently capture features at various scales.The prediction head employs complete intersection (CIOU) loss [20] as the bounding box loss, while the BCE loss is used for the confidence loss and classification loss.This combination of loss functions enhances the YOLOv5s model's overall object detection performance.

Model Backbone Module Improvement
It is difficult to deploy YOLOv5 on small embedded or mobile devices because of its many parameters and stringent hardware requirements.In order to solve this issue, many lightweight models, such as MobileNet and GhostNet, have been applied to the YOLO detection algorithm, which use deep convolution (DW) rather than conventional convolution to reduce the number of parameters.The original YOLOv5s network structure's C3 module uses a lot of convolution operations, which makes it less effective at capturing local information inside the window region.In this study, the lightweight GhostNetV2 is used instead of the C3 module in the backbone network to reduce the redundancy in the network layer, thus reducing the number of parameters and computation of the model [21].GhostNetV2 improves the model's overall efficiency by striking a more balanced trade-off between inference speed and accuracy.GhostNetV2 introduces DFC attention to augment the output features of the Ghost module to capture long-range dependencies between pixels in different spaces.The input features are fed into two parallel branches, the Ghost and DFC attention modules, respectively, and the two modules extract information from different perspectives.The final output multiplies the elements performed using the outputs of the two branches.This information aggregation process is shown in Figure 3.
the network's capacity to efficiently capture features at various scales.The prediction head employs complete intersection (CIOU) loss [20] as the bounding box loss, while the BCE loss is used for the confidence loss and classification loss.This combination of loss functions enhances the YOLOv5s model's overall object detection performance.

Model Backbone Module Improvement
It is difficult to deploy YOLOv5 on small embedded or mobile devices because of its many parameters and stringent hardware requirements.In order to solve this issue, many lightweight models, such as MobileNet and GhostNet, have been applied to the YOLO detection algorithm, which use deep convolution (DW) rather than conventional convolution to reduce the number of parameters.The original YOLOv5s network structure's C3 module uses a lot of convolution operations, which makes it less effective at capturing local information inside the window region.In this study, the lightweight GhostNetV2 is used instead of the C3 module in the backbone network to reduce the redundancy in the network layer, thus reducing the number of parameters and computation of the model [21].GhostNetV2 improves the model's overall efficiency by striking a more balanced trade-off between inference speed and accuracy.GhostNetV2 introduces DFC attention to augment the output features of the Ghost module to capture long-range dependencies between pixels in different spaces.The input features are fed into two parallel branches, the Ghost and DFC attention modules, respectively, and the two modules extract information from different perspectives.The final output multiplies the elements performed using the outputs of the two branches.This information aggregation process is shown in Figure 3.This study uses YOLOv5s as the baseline model.Because GhostNetv2 is a lightweight module, the C3 module is replaced with the C3GhostV2 module within the backbone of YOLOv5s.The improved C3GhostV2 adopts a reverse residual bottleneck structure consisting of two GhostConvs modules.To minimize the extra computational effort associated with DFC attention, GhostNetV2 reduces the size of the features by horizontally and vertically down-sampling them so that all the operations in DFC can be performed on smaller features.The resulting feature map is then subjected to up-sampling to return it This study uses YOLOv5s as the baseline model.Because GhostNetv2 is a lightweight module, the C3 module is replaced with the C3GhostV2 module within the backbone of YOLOv5s.The improved C3GhostV2 adopts a reverse residual bottleneck structure consisting of two GhostConvs modules.To minimize the extra computational effort associated with DFC attention, GhostNetV2 reduces the size of the features by horizontally and vertically down-sampling them so that all the operations in DFC can be performed on smaller features.The resulting feature map is then subjected to up-sampling to return it to its original size to match the features in the Ghost branch.The network structure of C3GhostV2 is shown in Figure 4, where a DFC attention branch is parallelized with the Ghost module to augment the extended features.The enhanced features are then sent to the second Ghost module to generate the output features.This captures the long-range dependencies between pixels at different spatial locations and enhances the model's expressiveness.The combination of Ghost and DFC attention allows the C3GhostV2 module to reduce the number of parameters while improving the detection accuracy.
to its original size to match the features in the Ghost branch.The network structure of C3GhostV2 is shown in Figure 4, where a DFC attention branch is parallelized with the Ghost module to augment the extended features.The enhanced features are then sent to the second Ghost module to generate the output features.This captures the long-range dependencies between pixels at different spatial locations and enhances the model's expressiveness.The combination of Ghost and DFC attention allows the C3GhostV2 module to reduce the number of parameters while improving the detection accuracy.

Bi-Level Routing Spatial Attention Module
In order to improve the feature extraction ability of the model, we focus on the important features and suppress the unnecessary features.In this study, we propose a novel bi-level routing spatial attention module (BRSAM) by combining bi-level routing attention (BRA) in the BiFormer attention module and spatial attention (SA) in the CBAM attention module.This original design approach employs the adaptive sparse sampling feature of bi-level routing attention (BRA) to focus attention on a few critical labels instead of downsampling the other non-critical labels.Meanwhile, the spatial attention (SA) module enhances the local key information of the sparsely sampled features.This design accomplishes the extraction of key features and reduces the influence of irrelevant features, which improves the computational efficiency and detection performance as well as the generalization ability and robustness of the model.This research adds the bi-level spatial attention module to the backbone network of the YOLOv5 model, which achieves enhanced detection accuracy by enhancing the weights of the key target features.
Given an intermediate feature map as an input, BRSAM sequentially infers 2D bilevel routing and spatial attention maps.The whole attention process of the BRSAM is shown in Figure 5.

Bi-Level Routing Spatial Attention Module
In order to improve the feature extraction ability of the model, we focus on the important features and suppress the unnecessary features.In this study, we propose a novel bi-level routing spatial attention module (BRSAM) by combining bi-level routing attention (BRA) in the BiFormer attention module and spatial attention (SA) in the CBAM attention module.This original design approach employs the adaptive sparse sampling feature of bi-level routing attention (BRA) to focus attention on a few critical labels instead of down-sampling the other non-critical labels.Meanwhile, the spatial attention (SA) module enhances the local key information of the sparsely sampled features.This design accomplishes the extraction of key features and reduces the influence of irrelevant features, which improves the computational efficiency and detection performance as well as the generalization ability and robustness of the model.This research adds the bi-level spatial attention module to the backbone network of the YOLOv5 model, which achieves enhanced detection accuracy by enhancing the weights of the key target features.
Given an intermediate feature map as an input, BRSAM sequentially infers 2D bi-level routing and spatial attention maps.The whole attention process of the BRSAM is shown in Figure 5.
to its original size to match the features in the Ghost branch.The network structure of C3GhostV2 is shown in Figure 4, where a DFC attention branch is parallelized with the Ghost module to augment the extended features.The enhanced features are then sent to the second Ghost module to generate the output features.This captures the long-range dependencies between pixels at different spatial locations and enhances the model's expressiveness.The combination of Ghost and DFC attention allows the C3GhostV2 module to reduce the number of parameters while improving the detection accuracy.

Bi-Level Routing Spatial Attention Module
In order to improve the feature extraction ability of the model, we focus on the important features and suppress the unnecessary features.In this study, we propose a novel bi-level routing spatial attention module (BRSAM) by combining bi-level routing attention (BRA) in the BiFormer attention module and spatial attention (SA) in the CBAM attention module.This original design approach employs the adaptive sparse sampling feature of bi-level routing attention (BRA) to focus attention on a few critical labels instead of downsampling the other non-critical labels.Meanwhile, the spatial attention (SA) module enhances the local key information of the sparsely sampled features.This design accomplishes the extraction of key features and reduces the influence of irrelevant features, which improves the computational efficiency and detection performance as well as the generalization ability and robustness of the model.This research adds the bi-level spatial attention module to the backbone network of the YOLOv5 model, which achieves enhanced detection accuracy by enhancing the weights of the key target features.
Given an intermediate feature map as an input, BRSAM sequentially infers 2D bilevel routing and spatial attention maps.The whole attention process of the BRSAM is shown in Figure 5.  Here, ⊗ denotes the multiplication of elements.During multiplication, the attention values are copied accordingly.Each sub-attention module is described in detail below.
The structure of bi-level routing attention (BRA) is shown in Figure 6 [22].In order to efficiently locate the best key-value pairs for global access, the most irrelevant key-value pairs at the coarse-grained region level are filtered out.Here, a region-level affinity graph is first constructed, and then it is pruned by retaining the first k connections, keeping only the first k routing regions that need attention.After identifying the regions of interest, finegrained labeling is then applied to a set of remaining candidate regions, and label-to-label attention in the remaining regions is calculated.
Here, ⊗ denotes the multiplication of elements.During multiplication, the attention values are copied accordingly.Each sub-attention module is described in detail below.
The structure of bi-level routing attention (BRA) is shown in Figure 6 [22].In order to efficiently locate the best key-value pairs for global access, the most irrelevant key-value pairs at the coarse-grained region level are filtered out.Here, a region-level affinity graph is first constructed, and then it is pruned by retaining the first  connections, keeping only the first  routing regions that need attention.After identifying the regions of interest, fine-grained labeling is then applied to a set of remaining candidate regions, and labelto-label attention in the remaining regions is calculated.where  ,  ,  ∈  are projection weights for the query, key, and value, respectively.
The participating regions are found by constructing a directed graph.The regionlevel query sum key  ,  ∈  is derived by applying the average value of each region to  and , respectively.We then derive the adjacency matrix of the inter-region affinity graph using matrix multiplication between  and the transposed  : For the entries in the adjacency matrix,  measure how much two regions are semantically related.Next, only the top-k connections of each region are kept to prune the affinity graph, which is indexed by the routing index matrix  ∈  , preserving the top-k connections on a line-by-line basis: Hence, the  row of  contains  indices of most relevant regions for the  region.
From the region-to-region routing index matrix  , this research can then apply finegrained token-to-token attention, which will focus on all the key-value pairs in the  routing regions.For each query token in region  the key and value tensor are collected: After inputting a feature map of X ∈ R H×W×C , it is first divided into S × S nonoverlapping regions such that each region contains HW S 2 feature vectors.This step is performed by reshaping X as X r ∈ R S 2 × HW S 2 ×C .The query, key, and value tensors Q, K, V ∈ R S 2 × HW S 2 ×C are then derived via linear mapping: where W q , W k , W v ∈ R C×C are projection weights for the query, key, and value, respectively.The participating regions are found by constructing a directed graph.The region-level query sum key Q r , K r ∈ R S 2 ×C is derived by applying the average value of each region to Q and K, respectively.We then derive the adjacency matrix of the inter-region affinity graph using matrix multiplication between Q r and the transposed K r : For the entries in the adjacency matrix, A r measure how much two regions are semantically related.Next, only the top-k connections of each region are kept to prune the affinity graph, which is indexed by the routing index matrix I r ∈ N S 2 ×k , preserving the top-k connections on a line-by-line basis: Hence, the ith row of I r contains k indices of most relevant regions for the ith region.
From the region-to-region routing index matrix I r , this research can then apply finegrained token-to-token attention, which will focus on all the key-value pairs in the k routing regions.For each query token in region i the key and value tensor are collected: where K g , V g are the collected key and value tensor.The collected key-value pairs are then processed as follows: Here, the local context enhancement term LCE(V) is introduced [23].The function LCE(•) uses deep convolutional parameterization with a convolutional kernel size of 5.
The structure of the spatial attention module is shown in Figure 7 [24].
ℎ ,  ,  ℎ , where  ,  are the collected key and value tensor.The collected key-value pairs are then processed as follows: ,  ,    (5) Here, the local context enhancement term   is introduced [23].The function  • uses deep convolutional parameterization with a convolutional kernel size of 5.
The structure of the spatial attention module is shown in Figure 7 [24].Spatial attention focuses on the "where" information and complements bi-level routing attention.To compute spatial attention, average pooling and maximum pooling operations are first applied along the bi-level routing features, thereby highlighting the information regions.They are then fused through the convolutional layer to generate an effective feature description, where spatial attention primarily encodes the locations emphasized or suppressed by bi-level routing attention.
The bi-level routing module constructs a feature map of the region to highlight the key features in global information and then computes the attention between the information through fine-grained labeling.The spatial attention module successively performs pooling and convolution operations on the routing feature inputs to compute the location information of key features.The two attention modules complement each other and function in sequence, thus constituting a bi-level routing spatial attention module.

Repulsion Loss
In a natural orchard, apples are often clustered together, occluding each other, or the apples are occluded by leaves.In this study, we introduce repulsion loss to solve the apple occlusion problem during detection.Repulsion loss is driven by two motivations: the attraction of the target and the repulsion of other objects around it [25].The repulsion loss consists of three components, which are defined as follows: The attraction term  requires the prediction frame to be close to the specified target, while the repulsion terms  and  require the prediction frame to be far away from the other surrounding ground truth objects and prediction frames with different specified targets, respectively.The coefficients  and  are weights that balance the auxiliary loss.Repulsion loss assumes that at least one bounding box has a high IoU ≥ 0.5, and then that bounding box is considered to be a positive sample, and vice versa for a negative sample.The attraction term  narrows the gap between the predicted and actual boxes; the SmoothL1 distance is used as the attraction distance from the actual box, with the largest IoU designated as the target.The repulsion term  defines the neighboring non-targets with maximum IoU regions other than the specified target as repulsive objects; the greater the overlap between the repulsive objects and the non-targeted Spatial attention focuses on the "where" information and complements bi-level routing attention.To compute spatial attention, average pooling and maximum pooling operations are first applied along the bi-level routing features, thereby highlighting the information regions.They are then fused through the convolutional layer to generate an effective feature description, where spatial attention primarily encodes the locations emphasized or suppressed by bi-level routing attention.
The bi-level routing module constructs a feature map of the region to highlight the key features in global information and then computes the attention between the information through fine-grained labeling.The spatial attention module successively performs pooling and convolution operations on the routing feature inputs to compute the location information of key features.The two attention modules complement each other and function in sequence, thus constituting a bi-level routing spatial attention module.

Repulsion Loss
In a natural orchard, apples are often clustered together, occluding each other, or the apples are occluded by leaves.In this study, we introduce repulsion loss to solve the apple occlusion problem during detection.Repulsion loss is driven by two motivations: the at-traction of the target and the repulsion of other objects around it [25].The repulsion loss consists of three components, which are defined as follows: The attraction term L Attr requires the prediction frame to be close to the specified target, while the repulsion terms L RepGT and L RepBox require the prediction frame to be far away from the other surrounding ground truth objects and prediction frames with different specified targets, respectively.The coefficients α and β are weights that balance the auxiliary loss.Repulsion loss assumes that at least one bounding box has a high IoU ≥ 0.5, and then that bounding box is considered to be a positive sample, and vice versa for a negative sample.The attraction term L Attr narrows the gap between the predicted and actual boxes; the SmoothL1 distance is used as the attraction distance from the actual box, with the largest IoU designated as the target.The repulsion term L RepGT defines the neighboring non-targets with maximum IoU regions other than the specified target as repulsive objects; the greater the overlap between the repulsive objects and the non-targeted actual objects is, the greater the penalty L RepGT on the bounding box regressor is, thus preventing the predicted bounding box from moving towards its neighboring non-targeted objects.L RepBox is able to reduce the distance of between the different regression targets in Agronomy 2024, 14, 682 9 of 15 close proximity to each other to merge the predicted bounding boxes of the same object after NMS, thus making the detector more robust during detection using images containing many apples.

Improved YOLOv5s Network Model
As illustrated in Figure 8, the enhanced YOLOv5s network structure in this work is known as YOLOv5s-GBR.In order to make the model lightweight, the Ghost module is first added to the backbone.Second, a bi-level routing spatial attention module (BRSAM) is added in front of the spatial pyramid pooling and fixed pooling layer to enhance the performance of the model for the detection of numerous targets.
actual objects is, the greater the penalty  on the bounding box regressor is, thus preventing the predicted bounding box from moving towards its neighboring non-targeted objects. is able to reduce the distance of between the different regression targets in close proximity to each other to merge the predicted bounding boxes of the same object after NMS, thus making the detector more robust during detection using images containing many apples.

Improved YOLOv5s Network Model
As illustrated in Figure 8, the enhanced YOLOv5s network structure in this work is known as YOLOv5s-GBR.In order to make the model lightweight, the Ghost module is first added to the backbone.Second, a bi-level routing spatial attention module (BRSAM) is added in front of the spatial pyramid pooling and fixed pooling layer to enhance the performance of the model for the detection of numerous targets.

Training Platform
This study trains and assesses the apple target detection model in an authentic setting using a deep learning framework built with PyTorch 1.10.1.The system configuration includes Windows 10 as the operating system, an Intel(i9)-1390K CPU, 128 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory.The input image size for the model is 640 × 640 pixels.The training parameters are defined as follows: the momentum is set to 0.937, the initial learning rate is 0.001, the decay coefficient is 0.9, the batch size is 8, and the number of iterations is 300.

Training Platform
This study trains and assesses the apple target detection model in an authentic setting using a deep learning framework built with PyTorch 1.10.1.The system configuration includes Windows 10 as the operating system, an Intel(i9)-1390K CPU, 128 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory.The input image size for the model is 640 × 640 pixels.The training parameters are defined as follows: the momentum is set to 0.937, the initial learning rate is 0.001, the decay coefficient is 0.9, the batch size is 8, and the number of iterations is 300.

Evaluation Indicators
The following metrics were used to evaluate the model in order to visualize its performance in detecting apples: precision (P, %), recall (R, %), average precision (AP, %), the number of parameters (Params, M) and floating point operations (FLOPs, G).The metrics are calculated as shown in Equations ( 7)- (9).
where P represents the proportion of all the correctly detected predicted frames; R denotes the percentage of successfully detected tagged frames among all the tagged frames; TP stands for the correctly matched forecast frames; FP stands for the improperly predicted frames; the number of lost tagged frames is denoted by FN; and the mean average accuracy value of the apples is represented by AP.

Training Results
After training is complete, the results of the improved YOLOv5s-GBR model are compared with those of the original model, and these are shown in Figure 9.The YOLOv5s-G, YOLOv5s-B, and YOLOv5-R algorithms are combined as shown in Table 1 The recall and mAP values are high for single-class targets, as Figure 10 illustrates.Every parameter suggests that the trained model satisfies the high precision requirements.The recall and mAP values are high for single-class targets, as Figure 10 illustrates.Every parameter suggests that the trained model satisfies the high precision requirements.

Ablation Experiments
In complicated neural networks, ablation tests are usually employed to ascertain the impact of a particular substructure of the network, training approach, and training parameters on the model.Therefore, they are essential to the design of neural network architectures.An improved YOLOv5s-GBR model for orchard apple detection is proposed.To verify the enhancement, ablation tests were carried out.The enhanced mechanisms of the new YOLOv5s-GBR model were taken out and trained individually to see if enhancement had any discernible impact.The results of the ablation experiments are shown in Table 2.With a mAP of 94.3%, the Group 1 experiments represent the parameter and mAP results of the original YOLOv5s model.The C3 module in the YOLOv5 target detection model was replaced in the Group 2 tests with the lightweight C3GhostV2 module, resulting in a 26.6% decrease in the parameters and a 0.8% reduction in the mAP from those of   2. With a mAP of 94.3%, the Group 1 experiments represent the parameter and mAP results of the original YOLOv5s model.The C3 module in the YOLOv5 target detection model was replaced in the Group 2 tests with the lightweight C3GhostV2 module, resulting in a 26.6% decrease in the parameters and a 0.8% reduction in the mAP from those of the original model.In comparison to the target detection model in Group 1, Group 3 ′ s addition of the BRSAM to Group 2 resulted in a 21.5% decrease in the parameters and a 2.5% rise in the mAP.Group 4 used a different repulsion loss module from Group 2, which reduced the number of parameters by 24% and improved the mAP by 2.8% compared to those of the Group 1 detection model.In the fifth group of experiments, the YOLOv5s-GBR model reduced the number of parameters by 20.3% and improved the mAP by 4.1% compared to those of the Group 1 detection model.The experimental results show that the improved module improved the overall performance of the YOLOv5s-GBR model, and the GhostNetV2 module effectively reduced the complexity of the model and the amount of computation, while the bi-level routing spatial attention module and repulsion loss improved the detection accuracy more significantly.

Verification of the Network Model
The test samples collected from the locations where the images were taken were fed into the YOLOv5s and the modified YOLOv5s-GBR models for detection, respectively.As shown in Figure 11, by observing the model detection results before and after improvement, the improved YOLOv5s-GBR model shows a high detection precision and recall, and the improved model can still accurately identify apple targets that are heavily occluded; this model also did not misidentify the backgrounds as apples when they had a similar color.This research demonstrated the superiority of the YOLOv5s-GBR model by comparing the improved YOLOv5s-GBR model with other detection models, such as Faster R-CNN, SSD, and YOLOv5s.The experimental results are shown in Table 3.The same test dataset was chosen for the experiments, and all the settings were constant.The YOLOv5s-GBR model proposed in this study improves precision by 4.1%, recall by 4.0%, and mAP by 4.1% compared with the original YOLOv5s model.Compared with other lightweight models, such as YOLOv5-lite-s, YOLOv5-lite-e, and YOLOv4-tiny, the YOLOv5s-GBR model significantly increases precision by 9.2%, 8.1%, and 2.0%, respectively; it increases recall by 14.1%, 12.6%, and 7.0%, respectively; and it increases mAP by 12.7%, 10.6% and 5.9%, respectively.Moreover, the YOLOv5s-GBR model still outperforms the original YOLOv5m, YOLOv5l, YOLOv8s, Faster R-CNN, and the commonly used models of SSD by improving the precision by 2.5%, 1.8%, 0.9%, 8.9%, and 3.1%, respectively; it improves the recall by 2.8%, 2.0%, 1.1%, 6.4%, and 9.0%, respectively; and improves the mAP by 2.7%, 1.9%, 0.8%, 2.6%, and 5.3%, respectively.However, the size of the model is only 14.3MB, and the detection frame rate is only 101.2 FPS, which proves that the lightweight design of the detection model improves the detection speed and accuracy while still maintaining a small model size, which facilitates the subsequent deployment on embedded platforms.

Conclusions
In this paper, the YOLOv5s-GBR model is proposed for the detection of occluded apples in unstructured orchards in natural environments.Producing a dataset requires consideration of factors like lighting, overlapping fruits, foliage occlusion, and similarity between the target and background colors.As a result, the dataset contains a variety of common natural situations and is as rich as possible.In terms of model improvement, the algorithm introduces a lightweight C3Ghost module into the backbone network instead of the C3 module to increase the operation speed.This paper presents a new attention mechanism module called the bi-level routing spatial attention module (BRSAM) for use in the feature fusion process.In this work, spatial attention (SA) in the CBAM and bi-level routing attention (BRA) in the BiFormer attention module are combined.Instead of downsampling the other unimportant labels, this novel and enhanced method makes use of the adaptive sparse sampling function of bi-level routing attention (BRA) to focus on a small number of important labels.In the meantime, the local key information of the sparsely sampled features is improved with the spatial attention (SA) module.This innovation achieves the extraction of important features while minimizing the extraction of superfluous information, improving computational performance and efficiency.It also strengthens the model's generalization and robustness and improves the extraction of characteristics from obstructed targets.Just adding the bi-level routing space attention module (BRSAM) to the YOLOv5s model with the BiFormer attention module and the CBAM reveals that the recognition accuracy improves by 3.3% and 2.7%, and the recall improves by 2.5% and 1.9%, respectively.Lastly, to worsen the detection performance in both occlusion and overlapping scenarios, this research substitutes the repulsive loss function for the original bounding box loss function.This improves the detection of sheltered and overlapping apples.By comparing them, the improved YOLOv5s-GBR model has a mAP of 98.20% and a frame rate of only 101.2 fps, which is better than those of the popular models, such as YOLOv5s, YOLOv5-lite-s, YOLOv5-lite-e, yolov4-tiny, YOLOv5m, YOLOv5l, YOLOv8s, Faster R-CNN, and SSD.This has great potential for pickup and detection applications on mobile devices with limited computational power.
In the future, we will prune the YOLOv5s-GBR network channel according to the characteristics of apples to further optimize the network model and improve the detection precision and recall.Ultimately, this enhanced detection model will be implemented using an apple-picking robot to achieve autonomous detection and picking, thereby showcasing the pragmatic significance of this study.

Figure 1 .
Figure 1.Pictures of apples in different natural conditions.

Figure 1 .
Figure 1.Pictures of apples in different natural conditions.

Figure 6 .
Figure 6.Bi-level routing attention module.After inputting a feature map of  ∈  , it is first divided into   non-overlapping regions such that each region contains feature vectors.This step is performed by reshaping  as  ∈  .The query, key, and value tensors , ,  ∈  are then derived via linear mapping:    ,    ,    (1)

Figure 9 .
Figure 9. Training set box_loss curves.The recall and mAP values are high for single-class targets, as Figure10illustrates.Every parameter suggests that the trained model satisfies the high precision requirements.
In complicated neural networks, ablation tests are usually employed to ascertain the impact of a particular substructure of the network, training approach, and training parameters on the model.Therefore, they are essential to the design of neural network architectures.An improved YOLOv5s-GBR model for orchard apple detection is proposed.To verify the enhancement, ablation tests were carried out.The enhanced mechanisms of the new YOLOv5s-GBR model were taken out and trained individually to see if enhancement had any discernible impact.The results of the ablation experiments are shown in Table

Table 1 .
How the algorithms are combined.

Table 2 .
Results of ablation experiments.

Table 3 .
Comparison of experimental results.

Table 3 .
Comparison of experimental results.