1. Introduction
The “Okubo” peach, an excellent peach variety, is widely planted in northern China. Rational fruit thinning is an effective practice in the growth management of peach trees since it enhances fruit quality and extends the lifespan of the plants. Failure to carry out reasonable fruit thinning leads to tree weakness, the shortening of new shoots, a small fruit shape, poor quality, and the formation of large and small year fruit phenomena. Reasonable fruit thinning can reduce the nutrient competition and fruit drop phenomenon and improve the fruit set rate of high-quality commercial fruit [
1,
2,
3]. Peach tree fruit thinning is generally about 40 days after flowering, retaining the fruit thinning part of the thinning dense fruit part; fruit thinning should thin out small fruits, double fruits, deformed fruits, pests and diseases, and leafless fruits to realize the scientific management of a peach orchard.
Currently, the commonly used methods of fruit thinning are artificial fruit thinning, mechanical fruit thinning, and chemical fruit thinning [
4]. Artificial fruit thinning is based on production experience; the fruit thinning effect is good but requires a lot of labor and high costs. The chemical fruit thinning cost is low, but the fruit-thinning effect is unstable, and it is easy to cause excessive fruit thinning. Mechanical fruit thinning can reduce labor, but it still requires manual operation [
5]. The accurate detection of young peaches is a key part of the design of intelligent fruit-thinning robots, which offers technical assistance for designing the vision system for fruit-thinning robots.
The use of vision technology combined with image processing and artificial intelligence has obvious advantages in detecting fruits, which can be efficiently analyzed [
6,
7]. However, due to the complex environment of small fruit sizes and similar fruit and leaf colors, traditional machine vision technology makes it challenging to detect young fruits accurately. As the computational power of computers improves and deep learning algorithms have developed rapidly in recent years, convolutional neural networks (CNNs) have shown a strong detection ability in fruit target detection. Target detection algorithms are categorized into two main groups, and the first one is based on regionally generated convolutional network structures, which consist of R-CNNs [
8], fast R-CNNs [
9], faster R-CNNs [
10], and mask R-CNNs [
11]. The detection of the target position is treated as a regression problem in the second approach, such as SSD [
12] and YOLOv3 [
13].
In recent times, many researchers have used the YOLO algorithm for detection models in fruit detection studies, such as for green apple [
14], citrus [
15], strawberry [
16], tomato [
17], and pear [
18] fruits. To realize the ripeness detection of tomato fruits, Zeng et al. [
19] replaced the YOLOv5 backbone network with MobileNetV3 and performed channel pruning on the neck network to achieve a lightweight model and an mAP of 96.9%. Nan et al. [
20] utilized the WFE-C4 module and GPF-SPP to achieve the fusion of feature information and detected a variety of dragon fruits during the picking process with an mAP of 86%. In order to implement the detection of citrus fruits at the ripening stage in an orchard, Xu et al. [
21] used GhostNet as the backbone of YOLOv4 and DWConv for reducing the parameters, and the ECA module was included to enhance the precision, and the mAP of the improved YOLOv4 for citrus reached 98.21%.
In modern orchard management, some scholars have tested fruits during the thinning period. To enhance the problem of the leakage detection of young apple fruits under occlusion conditions, Jiang et al. [
22] introduced the fusion of a non-local attention module in YOLOv4, which was tested to achieve an mAP of 96.9% under severe occlusion conditions and was able to realize the detection of young apple fruits in complex environments. To realize the automatic detection of pear fruits during the fruit-thinning period, Lu et al. [
23] proposed a green and tiny target pear fruit detection algorithm, ODL Net. The results of the experiments showed that the precision of ODL Net before and after the fruit-thinning stage reached 56.2% and 65.1%, respectively, and the recall arrived at 61.3% and 70.8%, respectively. To achieve apple detection during fruit thinning, Wang and He [
24] used channel pruning to prune YOLOv5s, and the improved YOLOv5s achieved a 95.8% accuracy for apple detection during fruit thinning. Hussain et al. [
25] utilized a mask R-CNN for the segmentation of green apples as well as stem segments to detect apples in the fruit-thinning stage and estimated the orientation of green fruits and stems using principal component analysis. The AP values of the model for the overall masking of green apples and stems were 83.4% and 38.9%, respectively.
The current YOLO-based detection model has made some progress in fruit detection at the thinning stage, and the detection effect is sufficient so that it can be used as a guide for young “Okubo” peaches. Simultaneously, the model structure is optimized, the parameters and FLOPs are reduced, the accuracy is improved, and the problems of the large model size and missing fruit recognition under occlusion are solved. In this study, a detection method, YOLO-PEM, for young peaches is proposed due to similar fruit colors and leaves, small fruit sizes, and missing detection in the occlusion of “Okubo” peaches during fruit thinning. The primary contributions of this study are as follows:
(1) PConv was used to design the C2f_P module instead of all C2f in YOLOv8s to lighten the model and enhance the accuracy for young peaches.
(2) The EMA module was embedded into the C2f_P_1 module of YOLO-P to improve the model’s capability to extract the feature information of young peaches, thus improving the detection precision for young peaches.
(3) CIoU was replaced with the MPDIoU boundary loss function to improve the accuracy of the bounding box and speed up the model convergence.
The YOLO-PEM has a smaller size and higher accuracy for young peach detection, which offers technical support for establishing a robotic vision system for “Okubo” peach fruit growth management and thinning.
  3. Results and Analysis
  3.1. Effects of Different C2f_P Module Positions on Model Performance
To examine the impact of applying the C2f_P module with varying YOLOv8 positions on young “Okubo” peaches, the C2f_P module was substituted with the C2f modules of the YOLOv8 backbone network (denoted as YOLOv8S-B), the C2f module of the neck (denoted as YOLOv8S-N), and the C2f module of all positions (denoted as YOLOv8S-A. The results are displayed in 
Table 3. As shown in 
Table 3, adding C2f_P at all three locations improved the AP and F1 score while reducing the number of parameters, FLOPs, and model size compared with YOLOv8s. This implies that the C2f_P module can optimize the model by minimizing unnecessary calculations and memory retrievals. Compared with the YOLOv8s model, YOLOv8s-B enhances the feature extraction ability from the backbone network by 0.68% in terms of the AP and 0.47% in terms of the F1 score. Similarly, YOLOv8s-N enhances the ability of the neck network to combine characteristics, resulting in a 0.91% increase in the average precision (AP) and a 0.12% rise in the F1 score compared with YOLOv8s. Additionally, YOLOv8s-A has a higher AP by 0.12% compared with YOLOv8s-B, while its F1 score is higher than that of YOLOv8s-N by 0.31%, illustrating its advantage in detection accuracy.
The model sizes of YOLOv8s-B, YOLOv8s-N, and YOLOv8s-A are 18.8 MB, 18.7 MB, and 16.0 MB, respectively. Among them, YOLOv8s-A exhibits the most lightweight characteristics with the fewest parameters and FLOPs compared with the other two variants. Specifically, in terms of model lightweight, YOLOv8s-A achieves a reduction of 2.82 M in parameters, 7.0 G in FLOPs, and 5.4 MB in model size compared with YOLOv8s.
  3.2. The Effect of the Attention Mechanism on the Detection of Young Peaches
In this study, the EMA was embedded inside the FasterBlock of the C2f_P_1 module in the model, which enhanced the backbone network’s ability to extract characteristics. This allowed the impacts of various attention mechanisms to be tested on the detection model’s performance. The model improved by the lightweight C2f_P module is denoted as YOLO-P. To assess the effect of the EMA module on YOLO-P, we examined the effects of BAM, CA, SE, and SimAM on the model’s performance individually. The Bottleneck attention module (BAM) is a channel attention mechanism that adaptively strengthens or weakens the feature responses of different channels [
35]. Coordinate attention (CA) integrates channel information and directional position information, which enables the model to accurately detect targets of interest [
36]. Squeeze-and-excitation (SE) makes the model pay more attention to the valuable channel information by learning the adaptive channel weights, thus improving the model’s performance ability [
37]. SimAM is a non-parametric 3D attention module proposed based on the characteristics of the human brain’s attention mechanism, which improves the computation speed of the attention weights through the resolution calculation of the energy function [
38]. The results are shown in 
Table 4.
Table 4 shows that of the five attention mechanisms introduced, the AP of the model using the EMA module was 90.68%, which was 0.40%, 0.37%, 0.49%, and 0.70% higher than that of BAM, CA, SE, and SimAM, respectively. Compared with YOLO-P, the precision, recall, AP, and F1 score of the model using the EMA module increased by 0.46%, 0.06%, 0.87%, and 0.24%, respectively, and the model size increased by only 0.1 MB. The precision of the model using the EMA module was higher than that of BAM, CA, and SE by 0.83%, 1.41%, and 0.69%, and the recall was higher than that of SimAM by 1.47%. Compared with YOLO-P, the precision of SimAM enhanced by 0.98%, the model size did not grow, but the recall and F1 score of the model decreased by 1.41% and 0.36%, and the detection AP decreased. Hence, in this study, the EMA module was chosen to be integrated into the FasterBlock of the C2f_P module of the YOLO-P lightweight model backbone network. This integration aimed to enhance the extracted feature information of the model backbone network and enhance the accuracy of young fruit detection.
 The box plots reflect the discrete and distributed characteristics of a set of data and plot the AP changes of the five attention mechanisms (EMA, BAM, CA, SE, and SimAM) during training, as shown in 
Figure 8. The EMA has the highest AP during training, and the median and upper quartiles of the AP have higher values than the other four attention mechanisms. The EAM, CA, SE, and SimAM attention mechanisms exhibit outliers, which arise from the model’s failure to converge during the early stage of training, resulting in a low AP. Hence, the EMA module integrated into the C2f_P_1 module of YOLO-P has superior precision in detecting young “Okubo” peaches.
  3.3. Performance Comparison of Different Loss Functions
In this study, we substituted the CIoU loss function of YOLOv8s with SIoU [
39], EIoU [
40], DIoU [
32], and MPDIoU to compare their effects on the detection of young “Okubo” peaches. The results of different loss function models are shown in 
Table 5. The loss value curves that were obtained when using various loss functions for training are displayed in 
Figure 9. As indicated in 
Table 5, among the five individual loss function models, the F1 score of the MPDIoU loss function is 86.18%, which is 0.33%, 0.27%, 0.55%, and 0.14% higher than that of CIoU, SIoU, EIoU, and DIoU, respectively. The AP of the MPDIoU loss function is 89.90%, which is 0.89%, 0.82%, 0.55%, and 0.31% higher than that of CIoU, SIoU, EIoU, and DIoU, respectively. The precision of MPDIoU is 0.01%, 0.20%, and 0.30% higher than that of CIoU, SIoU, and DIoU, and 0.24% lower than that of EIoU, but its recall is 1.18% higher. The recall of MPDIoU is 0.59%, 0.33%, and 0.02% higher than that of CIoU, SIoU, and DIoU, respectively.
As demonstrated in 
Figure 9, the loss value tends to change within a specific range after 150 epochs, which indicates that the model converges and there are no over-fitting and under-fitting phenomena. The five types of loss functions, CIoU, SIoU, EIoU, DIoU, and MPDIoU, converge quickly and maintain a lower loss value range change. Still, the MPDIoU loss function has a lower loss value when the model is taught to converge with smaller loss values, and the training effect is improved. In short, the comprehensive advantages of using the MPDIoU loss function for model training are more apparent, and the detection accuracy is the highest.
  3.4. Ablation Test
The ablation test was employed to test the effects of the different modules on the model performance by removing some modules and training the model with improved modules under the same parameters. In this study, the addition and combination of different modules were carried out for five sets of tests to confirm the effect of each module. Based on YOLOv8s, the C2f_P module is used for model lightweight; the EMA module improves the backbone network’s capacity to retrieve feature data; and the MPDIoU loss function speeds up model convergence and improves detection accuracy. YOLO-P represents the use of the C2f_P module, YOLO-M stands for the use of the MPDIoU bounding loss function, YOLO-PE represents the use of the C2f_P module and EMA module, YOLO-PM represents the use of C2f_P module and MPDIoU bounding loss function, and YOLO-PEM represents the use of three modules. 
Table 6 displays the ablation test results.
As shown in 
Table 6, the precision of YOLO-PEM is 1.51%, 1.90%, 1.50%, 1.46%, and 2.79% higher than that of YOLOv8s, YOLO-P, YOLO-M, YOLO-PE, and YOLO-PM, respectively. The recall of YOLO-PEM is slightly higher than that of YOLO-P, YOLO-M, YOLO-PE, and YOLO-PM, but YOLO-PEM has significantly higher precision than these models, balancing precision and recall. Compared with YOLOv8s, YOLO-P, YOLO-M, YOLO-PE, and YOLO-PM, the F1 score of the YOLO-PEM model is higher by 0.85%, 0.42%, 0.52%, 0.18%, and 0.70%, and the AP is higher by 1.85%, 1.05%, 0.96%, 1.18%, and 0.73%, respectively.
In terms of lightweight, the model size of YOLO-PEM is decreased by 5.3 MB, the parameters are reduced by 2.81 M, and the counting floating-point number is reduced by 6.6G compared with YOLOv8s. Compared with YOLO-P, YOLO-PEM only raises the model size, FLOPs, and parameters by 0.1 MB, 0.4 G, and 0.01 M, but the AP of the model is raised by 1.05%. Simultaneously, compared with YOLO-PM, the parameters, FLOPs, and size of YOLO-PEM are slightly increased by 0.01 M, 0.4 G, and 0.1 MB, respectively, but the AP of YOLO-PEM is raised by 0.73%. Therefore, YOLO-PEM enhances the detection accuracy for young “Okubo” peaches.
  3.5. Detection of Young Peaches by Different Lightweight Detection Models
To confirm the effectiveness of the enhanced model, YOLO-PEM was contrasted and tested against five popular lightweight detection models: YOLOv3-tiny, YOLOv4-tiny [
41], YOLOv5s, YOLOv6s [
42], and YOLOv7-tiny. The test results of the different models are shown in 
Table 7.
As shown in 
Table 7, the AP and F1 score of YOLO-PEM for young peach detection were 90.86% and 86.70%, respectively, with significantly higher accuracy than the other lightweight models. Compared with YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, YOLOv6s, and YOLOv7-tiny, the AP of YOLO-PEM increased by 6.26%, 6.01%, 2.05%, 2.12%, and 1.87%, respectively. In addition, in terms of the F1 score, YOLO-PEM’s was 3.93%, 3.42%, 1.54%, 1.52% and 0.68% higher than that of YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, YOLOv6s, and YOLOv7-tiny, respectively. Secondly, the number of parameters of YOLO-PEM was 0.33 M and 1.31 M lower than that of YOLOv3-tiny and YOLOv6s. Moreover, the YOLO-PEM model achieved a remarkable FPS of 196.2 f·s-1, enabling the efficient real-time detection of fruits during the fruit-thinning stage by the fruit-thinning robot. Notably, while surpassing YOLOv6s in terms of speed and closely approaching YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, and YOLOv7-tiny in performance, the FPS of the YOLO-PEM model was accompanied by a superior detection accuracy compared with the other models investigated herein. These findings highlight that the developed YOLO-PEM model effectively balances both accuracy and speed for detecting young “Okubo” peach fruits. The model’s size is 0.4 MB and 20.5 MB lower than that of YOLOv6s. The size of YOLO-PEM is slimly higher than that of YOLOv5s and YOLOv7-tiny by 2.4 MB and 4.4 MB, respectively, but the AP of the YOLO-PEM model is 2.05% and 1.87% higher than that of the YOLOv5s model, which can better meet the accuracy demand of young peach detection in complex peach orchard environments. The YOLO-PEM has higher detection precision and a lighter model structure than the other detection models in detecting young peaches, which outperforms the accuracy and model size. The YOLO-PEM model can detect young “Okubo” peaches in the complex environment of peach orchards.
  3.6. Comparison of YOLO-PEM and YOLOv8s Detection Effects
To confirm the benefits of the YOLO-PEM model in detecting young “Okubo” peaches, several photos of young peaches in the test set were compared with YOLOv8s. 
Figure 10 displays the results.
As displayed in 
Figure 10, in sunny conditions, the improved YOLO-PEM has a higher precision in detecting young peaches compared with YOLOv8s’s detection effect, and YOLOv8s shows missed detection in severe occlusion. The YOLO-PEM model did not show missed detection in overcast conditions, while YOLOv8s showed missed detection. In the case of shading and string cluster growth, the color of leaves and young peaches were similar, and the YOLO-PEM model had a higher detection accuracy for young peaches than YOLOv8s. In backlight conditions, YOLOv8s had a false detection, identifying leaves as fruits, while the YOLO-PEM model did not have a false detection.
In the actual operating environment, two types of environmental conditions, mainly sunny and cloudy conditions, were used to test the young “Okubo” peaches in both kinds of weather. The YOLO-PEM model was applied to the images of young peaches in the test set in sunny and cloudy conditions, of which 169 images were taken in sunny and 73 images in cloudy conditions. The results of testing different weather conditions are displayed in 
Table 8.
As displayed in 
Table 8, in sunny conditions, the AP and F1 of YOLO-PEM were 1.86% and 0.75% higher than YOLOv8s, respectively. In cloudy conditions, the AP and F1 of the YOLO-PEM model were 1.9% and 1.23% higher for young peaches, respectively, than the YOLOv8s model. In addition, the AP and F1 of young peaches in cloudy conditions were higher than those in sunny conditions, which was due to the fact that in sunny conditions, the collected images had large light variations and uneven light intensity, and the captured images might be overexposed. However, in cloudy conditions, the light variation was slight, the light intensity was uniform, and the pictures were less affected by light.
In order to more intuitively visualize the characteristics of the YOLO-PEM detection of young “Okubo” peaches, this study utilized the GradCAM++ method to generate heat maps and visualize the characteristics of peach fruits, thereby demonstrating the advantage of YOLO-PEM. The brightness of regions in the heat map indicates their impact on the model’s output. The spectrum ranging from blue to red reflects the model’s focus on various locations. Darker shades of red indicate stronger attention, while blue signifies weaker attention. The heat maps of YOLO-PEM and YOLOv8s were compared under various meteorological circumstances, as depicted in 
Figure 11. The heat map clearly shows that the YOLO-PEM model gives much more attention to the young peach fruits in various weather conditions than YOLOv8s. Additionally, while YOLOv8s only focuses on local fruit features, the YOLO-PEM model can pay greater attention to global fruit features, thereby improving recognition accuracy and offering more advantages.
  4. Discussion
Several researchers have investigated the application of object identification algorithms to identify and detect peaches. Liu et al. [
43] introduced an improved iteration of YOLOv7 to accurately detect yellow peaches in the growth phase, resulting in favorable results. Nevertheless, the mAP and F1 scores of the model for yellow peach were 80.4% and 76%, respectively. However, it should be noted that the model has a somewhat big size of 51.9 MB, which suggests that it may not achieve optimal detection accuracy and has excessive dimensions. Assunção et al. [
44] employed the MobileDet detection model to identify ripe peaches, achieving only an 88.2% detection accuracy rate. To accurately detect immature small yellow peaches in terms of their quantity and location, Xu et al. [
45] introduced an EMA-YOLO model based on YOLOv8. However, EMA-YOLO yielded a modest mAP value of merely 84.1%, thereby exhibiting reduced accuracy alongside increased size dimensions. These limitations impede the effective trade-off between accuracy and size in peach fruit inspection systems, potentially resulting in a sluggish model detection speed as well as imprecise fruit detection during field experiments.
In addition, Mirhaji et al. [
46] employed the YOLOv4 algorithm to accurately identify oranges in nocturnal field conditions with varying illumination levels. The model has good detection accuracy for oranges and serves as a valuable reference for nighttime recognition research. Consequently, future investigations can concentrate on capturing images of young “Okubo” peaches during nighttime hours and developing a dedicated detection model specifically tailored for such young fruits. This will establish a visual theoretical foundation for enabling the all-day operation of fruit-picking robots. This study focuses on detecting young “Okubo” peach fruits at the fruit-thinning stage, which provides a reference for intelligent fruit thinning. Li et al. [
47] proposed PeachYOLO, a lightweight detection algorithm based on YOLOv8, for the detection of peach fruits in the mature stage. The improved method can offer a reference for subsequent research on the whole-stage peach picking and the identification of the ripening stage of “Okubo” peaches.
This study solely concentrated on identifying spherical fruits such as “Okubo” peaches, while other types of peaches were not considered throughout the detection process. To enhance the comprehensiveness and robustness of the model, forthcoming studies should encompass images of peach fruits exhibiting diverse shapes. Additionally, further research can explore fruit counting methodologies applicable to “Okubo” peaches to facilitate subsequent bagging procedures and yield estimation endeavors. Monitoring and identifying the growth progression of “Okubo” peach fruits can furnish growers with invaluable insights, empowering them to implement rational cultivation practices based on scientific knowledge that augment both yield quantity and quality. Finally, this study tested the fruits and young fruits of “Okubo” peach fruits, and future research can enrich the image dataset and utilize more data feature information. The proposed lightweight model YOLO-PEM is only used for model verification and testing on laptop computers. In future research, the model can be embedded in edge devices for application, for instance, in the vision inspection system and vision-processing chip of a fruit-thinning robot or low-end devices for application.
  5. Conclusions
In this study, the YOLO-PEM model was proposed to address the rapid and precise detection of young “Okubo” peaches in complex orchard environments, and an image dataset comprising young peaches was established to train the model. To achieve the model’s lightweight, the C2f_P module was substituted for the C2f of the YOLOv8 neck network and backbone network. The EMA module was embedded into the C2f_P_1 module to enhance the capability of the model to extract features and improve its detection accuracy. CIoU is replaced with the MPDIoU boundary loss function to enhance the accuracy of the bounding box for young peaches and accelerate model convergence. Through an ablation test, the enhanced model’s efficacy was confirmed. The following are this study’s primary conclusions.
To accurately detect young peaches in complex orchard environments, five YOLOv8 version models were compared with a mild difference in the AP. The AP tested by YOLOv8s on the constructed “Okubo” young fruits datasets was 89.01%, and the model size was 21.4 MB, which balanced the accuracy and model size. The influence of various C2f_P module placements on model efficiency and lightweight was studied, and the changes in model parameters, size, FLOPs, and AP were compared. The AP and F1 score of the YOLO-A model with the C2f_P module replacing all C2f modules in YOLOv8 were determined to be 89.81% and 86.28%, respectively. Compared with YOLOv8s, YOLO-A exhibited a reduction of 25.34% in parameters, 25.23% in model size, and 24.65% in FLOPs.
It was discussed how various attention mechanisms affected the model’s performance. EMA was chosen to embed into the C2f_P_1 module, the AP was raised by 0.87%, and the model size was increased by 0.1 MB. A comparison was made between the effects of different bounding loss functions on the detection effect. The model with the highest AP and F1 score, increasing by 0.89% and 0.33%, respectively, was the one that used the MPDIoU loss function.
Compared with YOLOv8s, the average precision of YOLO-PEM was enhanced by 1.85%, the F1 score increased by 0.85%, and the model size decreased by 5.3 MB, thus verifying the effectiveness of the proposed method. To test the advantages of the YOLO-PEM model versus other lightweight models in detecting young peaches, the results show that compared with YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, YOLOv6s, and YOLOv7-tiny, the AP of YOLO-PEM was improved by 6.26%, 6.01%, 2.05%, 2.12%, and 1.87%, respectively. The FPS was 196.2f ·s-1, which could meet the detection requirements in a complex orchard environment.
This present study introduces a novel application scenario of a deep learning object detection model for precisely detecting young “Okubo” peaches in a complex peach orchard environment. In a complex orchard environment, YOLO-PEM’s small size, quick detection speed, and high precision in detecting “Okubo” young fruits can offer technical support for the design of an “Okubo” peach fruit-thinning robot vision system and serve as a foundation for scientific orchard management.