EMA-YOLO: A Novel Target-Detection Algorithm for Immature Yellow Peach Based on YOLOv8

Accurate determination of the number and location of immature small yellow peaches is crucial for bagging, thinning, and estimating yield in modern orchards. However, traditional methods have faced challenges in accurately distinguishing immature yellow peaches due to their resemblance to leaves and susceptibility to variations in shooting angles and distance. To address these issues, we proposed an improved target-detection model (EMA-YOLO) based on YOLOv8. Firstly, the sample space was enhanced algorithmically to improve the diversity of samples. Secondly, an EMA attention-mechanism module was introduced to encode global information; this module could further aggregate pixel-level features through dimensional interaction and strengthen small-target-detection capability by incorporating a 160 × 160 detection head. Finally, EIoU was utilized as a loss function to reduce the incidence of missed detections and false detections of the target small yellow peaches under the condition of high density of yellow peaches. Experimental results show that compared with the original YOLOv8n model, the EMA-YOLO model improves mAP by 4.2%, Furthermore, compared with SDD, Objectbox, YOLOv5n, and YOLOv7n, this model’s mAP was improved by 30.1%, 14.2%,15.6%, and 7.2%, respectively. In addition, the EMA-YOLO model achieved good results under different conditions of illumination and shooting distance and significantly reduced the number of missed detections. Therefore, this method can provide technical support for smart management of yellow-peach orchards.


Introduction
Yellow peaches are favored by consumers due to their nutrient composition and deliciousness [1].As their quality of life improves, more and more people are becoming concerned about their health, and the focus of the demand for yellow peaches has shifted from quantity to quality.As a result, more and more farmers utilize various new technologies to manage modern orchards, and accurately counting the number of immature yellow peaches is becoming crucial [2].In a practical scenario, by utilizing statistical data regarding immature yellow peaches, growers can optimize decision-making about purchasing bagging materials and employing workers [3].Furthermore, such data also can significantly enhance both the yield and quality of yellow peaches while optimizing orchard management.
However, due to the complexity of the orchard environment, manual estimation of the number of peaches is the primary method used [4].This method is obviously inaccurate, low-efficiency and high-cost, which makes it challenging to adapt it to the intelligent management of large yellow-peach orchards.In response to these challenges, researchers have conducted in-depth research.Particularly with the continuous advancement of computer vision technology, an increasing number of researchers have begun utilizing visual-detection technology for in situ fruit counting [5].The series of YOLO algorithms are ⃝ Introduction of the EMA (Efficient Multi-Scale Attention) attention-mechanism module to encode global information and further aggregate pixel-level features through dimensional interaction.
2 ⃝ Combination with the 160 × 160 scale detection head to enhance small-targetdetection capabilities.⃝ Employ EIoU (Efficient Intersection over Union) as the loss function to reduce the rates of missed detections or false detections of small target yellow peaches in dense environments.
These improvements are tailored to address the specific challenges posed by detecting immature small yellow peaches in natural environments.

Materials 2.1. Data Acquisition
The yellow-peach samples were gathered from a plantation base in Daping Village, Jinggangshan City, Jiangxi Province (Figure 1).The data-collection period was from April to May 2022 and 2023.To enhance the algorithm's versatility and facilitate subsequent UAV (Unmanned Aerial Vehicle) operations, images were captured under different lighting conditions and at different shooting distances and angles using a mobile phone.Each image had a resolution of 4000 × 3000 pixels.To ensure result accuracy, blurry images were manually excluded, and the remaining images were evenly distributed across different scenes.A total of 1520 original images were obtained, of which 498 were long-distance image had a resolution of 4000 × 3000 pixels.To ensure result accuracy, blurry images were manually excluded, and the remaining images were evenly distributed across different scenes.A total of 1520 original images were obtained, of which 498 were long-distance images (more than 3 m), 506 were medium-distance images (1-3 m) and 516 were close-distance images (less than 1 m).The dataset is divided into a training set, a verification set and a test set at a ratio of 8:2.The images were annotated by Labelimg.The annotation file was then stored in XML (extensible markup language) format.The file contained the vertex coordinates of the yellow-peach target rectangle and the number of labeled instances, totaling 36,872 instances.

Data Augmentation
Data augmentation is a technique widely used in deep learning to improve generalization ability and reduce the risk of overfitting [16][17][18][19].In this experiment, we utilized a feature-enhancement-based data-augmentation algorithm that can improve the visual impact of images [20].The data-augmentation methods employed in this study included brightness adjustments, random rotation and inversion.The details are shown in Figure 2. As a result, 9120 images were obtained.The images were annotated by Labelimg.The annotation file was then stored in XML (extensible markup language) format.The file contained the vertex coordinates of the yellow-peach target rectangle and the number of labeled instances, totaling 36,872 instances.

Data Augmentation
Data augmentation is a technique widely used in deep learning to improve generalization ability and reduce the risk of overfitting [16][17][18][19].In this experiment, we utilized a feature-enhancement-based data-augmentation algorithm that can improve the visual impact of images [20].The data-augmentation methods employed in this study included brightness adjustments, random rotation and inversion.The details are shown in Figure 2. As a result, 9120 images were obtained.
To obtain reliable results, the randomness of dataset partitioning was eliminated.In this experiment, the test was carried out using five-fold cross-validation.At the training process, one sub-data set was taken as the test set and the other four were used as the training set.The whole training process was repeated five times.Due to the slight differences in the performance of the model on different training sets, the test results also fluctuated slightly.Finally, the average was taken as the final result.This method ensures that each sample is trained and tested, thus reducing the generalization error.To obtain reliable results, the randomness of dataset partitioning was eliminated.In this experiment, the test was carried out using five-fold cross-validation.At the training process, one sub-data set was taken as the test set and the other four were used as the training set.The whole training process was repeated five times.Due to the slight differences in the performance of the model on different training sets, the test results also fluctuated slightly.Finally, the average was taken as the final result.This method ensures that each sample is trained and tested, thus reducing the generalization error.

YOLOv8
YOLOv8 is a single-stage target-detection algorithm proposed by the Ultralytics company in 2023.It includes five versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x.The number of model parameters and computational complexity increase with the depth and width of the model.Users can select the network structure based on their specific application scenarios.Notably, YOLOv8-n is specifically designed to be used on embedded devices while maintaining detection speeds and accuracy [21].To apply our algorithm on the mobile device, we choose the network structure of YOLOv8n.The structure diagram of YOLOv8n is depicted in Figure 3.The entire network comprises a backbone for feature extraction, a neck network, and a detection head for feature fusion.The

YOLOv8
YOLOv8 is a single-stage target-detection algorithm proposed by the Ultralytics company in 2023.It includes five versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x.The number of model parameters and computational complexity increase with the depth and width of the model.Users can select the network structure based on their specific application scenarios.Notably, YOLOv8-n is specifically designed to be used on embedded devices while maintaining detection speeds and accuracy [21].To apply our algorithm on the mobile device, we choose the network structure of YOLOv8n.The structure diagram of YOLOv8n is depicted in Figure 3.The entire network comprises a backbone for feature extraction, a neck network, and a detection head for feature fusion.The backbone section adopts a cross-stage local network structure to reduce computational load and improve gradient strength.Additionally, it incorporates a spatial pyramid pooling module to improve spatial feature extraction.The head section employs the current mainstream decoupled head to effectively decrease parameter count and computational complexity while enhancing the model's generalization ability and robustness.YOLOv8 also departs from previous designs in the YOLO series that used anchor-based methods to predict anchor box position and size; instead, it utilizes an anchor-free detection method to directly predict target center points and width/height ratios.This reduction in anchor boxes further improves the model's detection speed and accuracy.
backbone section adopts a cross-stage local network structure to reduce computational load and improve gradient strength.Additionally, it incorporates a spatial pyramid pooling module to improve spatial feature extraction.The head section employs the current mainstream decoupled head to effectively decrease parameter count and computational complexity while enhancing the model's generalization ability and robustness.YOLOv8 also departs from previous designs in the YOLO series that used anchor-based methods to predict anchor box position and size; instead, it utilizes an anchor-free detection method to directly predict target center points and width/height ratios.This reduction in anchor boxes further improves the model's detection speed and accuracy.

EMA-YOLO
The detection of immature yellow peaches presents several challenges, including small-target detection and natural scene occlusion.To address these issues, we have proposed the EMA-YOLO model, which incorporates an EMA attention-mechanism module into the YOLOv8n framework.This enhancement includes an additional 160 × 160 small detection head and utilizes EIoU (Efficient Intersection over Union) as a loss function to achieve lightweight and accurate model improvements.The network-structure diagram for EMA-YOLO is depicted in Figure 4, with red boxes highlighting the improvements in the model.

EMA-YOLO
The detection of immature yellow peaches presents several challenges, including smalltarget detection and natural scene occlusion.To address these issues, we have proposed the EMA-YOLO model, which incorporates an EMA attention-mechanism module into the YOLOv8n framework.This enhancement includes an additional 160 × 160 small detection head and utilizes EIoU (Efficient Intersection over Union) as a loss function to achieve lightweight and accurate model improvements.The network-structure diagram for EMA-YOLO is depicted in Figure 4, with red boxes highlighting the improvements in the model.

EMA Attention Mechanism
Immature yellow peaches can easily be confused with leaves due to their colo small size.This poses a challenge for the traditional YOLOv8 detection model.How it was noted that yellow peaches typically grow on the main stem and tend to aggr

EMA Attention Mechanism
Immature yellow peaches can easily be confused with leaves due to their color and small size.This poses a challenge for the traditional YOLOv8 detection model.However, it was noted that yellow peaches typically grow on the main stem and tend to aggregate.Thanks to this feature, an attention-mechanism module is a potentially useful method for enhancing performance on integration into the original YOLOv8 network.
Currently, there are various commonly used attention-mechanism modules available; among them is the Efficient Multi-Scale Attention [22] module, which was designed for learning across space without dimensionality reduction.EMA utilizes a grouping structure and employs cross-space learning methods to establish short-and long-term dependencies through a multi-scale parallel subnetwork design.It retains information from each channel through cross-dimensional interactions while grouping channel dimensions into multiple sub-features.
The spatial semantic features are evenly distributed within each feature map, effectively establishing short-and long-term dependence relations through feature grouping and the multi-scale structure of EMA, ultimately contributing to improved detector performance while reducing parameter requirements and computational overhead.The structure of the EMA attention-mechanism module is shown in Figure 5.The EMA attention mechanism preserves precise location information by modeling long-term dependencies.The EMA attention-mechanism model includes long-term dependencies and preserves precise location information.Because context information is integrated at various scales, neural networks are empowered to produce more accurate pixel-level attention for feature maps.Additionally, the parallelization of convolution kernels provides a more robust structure that utilizes cross-space learning methods to handle short-and long-term dependencies.This is achieved by employing 3 × 3 convolution and 1 × 1 convolution in parallel to incorporate more contextual information into the intermediate feature graph.EMA-YOLO leverages the output of the EMA attention-mechanism module combined with channel information and context data to differentiate between occlusive yellow peaches at different scales, effectively distinguishing them from the leaves that are similar in color to the unripe yellow peaches.The EMA attention mechanism preserves precise location information by modeling long-term dependencies.The EMA attention-mechanism model includes long-term dependencies and preserves precise location information.Because context information is integrated at various scales, neural networks are empowered to produce more accurate pixel-level attention for feature maps.Additionally, the parallelization of convolution kernels provides a more robust structure that utilizes cross-space learning methods to handle short-and long-term dependencies.This is achieved by employing 3 × 3 convolution and 1 × 1 convolution in parallel to incorporate more contextual information into the intermediate feature graph.EMA-YOLO leverages the output of the EMA attention-mechanism module combined with channel information and context data to differentiate between occlusive yellow peaches at different scales, effectively distinguishing them from the leaves that are similar in color to the unripe yellow peaches.

Incorporate the Small-Object Detection Head
The size of the peaches changes with shooting distance, which is a common phenomenon in practical orchard management.This leads to significant scale-based differences in the detection of peaches.In images taken from a long distance, a yellow peach occupies less than 1% of the image size, resulting in a potential loss of feature information.Zhu et al. [23] addressed this issue by adding a transformer detection head to their model, achieving successful results in detecting densely arranged objects with dramatic scale changes during high-speed and low-altitude flight.Inspired by this study, we add a 160 × 160 small-target detection head to enhance sensitivity in detection of smaller targets.EMA-YOLO initially extracts features from the sixth layer of the backbone network and utilizes concat splicing to fuse shallow features extracted from the neck network structure with context information extracted from an EAM attention-mechanism module.Finally, we used the output fourth detection head at layer 21 as our small-object detection head.
While this improvement slightly increases computational load, it significantly enhances EMA-YOLO's ability to detect small objects by capturing more detailed feature information and effectively reducing false detections and missed detections of yellow peaches across different scales.The improved detection layer is shown in Figure 6.While this improvement slightly increases computational load, it significantly enhances EMA-YOLO's ability to detect small objects by capturing more detailed feature information and effectively reducing false detections and missed detections of yellow peaches across different scales.The improved detection layer is shown in Figure 6.

EIoU Loss Function
The original YOLOv8n algorithm uses the prediction box loss function CIoU (Complete Intersection over Union) [24], as shown in Equation (1).
IoU (intersection over union) represent the ratio of the intersection and union between the bounding box and the real box;  and  represent the center points of the bounding box and the real box;  is the Euclidean distance between two points. is the penalty factor;  represents the similarity of aspect ratio;  is the diagonal length of the minimum closure region;  and ℎ represent the width and height of the boundary frame; and  and ℎ represent the width and height of the real box, respectively.
The above formula demonstrates that CIoU (complete intersection over union) incorporates the aspect ratio of the bounding box as a penalty term into the bounding-box loss function, which can enhance the convergence speed of the regression process for the prediction box to some extent.However, when there is a linear relationship between the

EIoU Loss Function
The original YOLOv8n algorithm uses the prediction box loss function CIoU (Complete Intersection over Union) [24], as shown in Equation (1).
IoU (intersection over union) represent the ratio of the intersection and union between the bounding box and the real box; b and b gt represent the center points of the bounding box and the real box; ρ 2 is the Euclidean distance between two points.α is the penalty factor; v represents the similarity of aspect ratio; c is the diagonal length of the minimum closure region; w and h represent the width and height of the boundary frame; and w gt and h gt represent the width and height of the real box, respectively.
The above formula demonstrates that CIoU (complete intersection over union) incorporates the aspect ratio of the bounding box as a penalty term into the bounding-box loss function, which can enhance the convergence speed of the regression process for the prediction box to some extent.However, when there is a linear relationship between the width and height of the prediction box and those of the real box, simultaneous adjustment of both dimensions in regression becomes impossible.As a result, the penalty term loses its effectiveness in describing the regression objective accurately, potentially leading to slow convergence and inaccurate regression.To address this issue, EIoU [25] is introduced in this paper, utilizing a loss-function expression that directly penalizes w and h in Equation ( 4), as shown below: where L IoU , L dis and L asp represent the losses for overlap, distance and width/height, respectively; IoU is the ratio of the intersection to the union between the predicted bounding box and the ground-truth bounding box.c, c w and c h denote the diagonal length, width and height of the minimum enclosing rectangle for two frames.ρ(b, b gt ) represents the distance between the center points of the predicted and ground-truth bounding boxes; ρ(w, w gt ) indicates the difference in width between the predicted and ground-truth bounding boxes; ρ(h, h gt ) signifies the difference between the predicted and ground-truth bounding boxes.The EIoU loss function minimizes the discrepancy in width and height between the target and the anchor.It takes into consideration various factors such as overlap area and distance between center points, as well as real disparities in width, height and side length.This approach enables the model to prioritize high-quality anchor frames during regression, leading to faster convergence speed and improved regression prediction accuracy while maintaining strong anti-interference capability.The introduction of EIoU significantly enhances performance in object-detection tasks, particularly for yellow-peach detection in conditions of severe occlusion.

Experimental Parameters
The experimental environment configuration is as follows: the operating system of the experimental platform is Windows 10 (professional edition) with a Nvidia GeForce RTX 3090Ti graphics card and 24 GB of video memory (Santa Clara, CA, USA).The software setup includes CUDA 11.7, Python 3.8, and the PyTorch 1.7 deep-learning framework.The training process consists of 450 epochs, with a batch size of 4 for reading image data.The initial learning rate is set to 0.0001.

Evaluation Metrics
The main relevant metrics utilized in this experiment to assess the impact of the neural network model include precision, recall, F1 score [26], PR curve, and mean average precision.Precision indicates the proportion of correctly predicted positive samples out of all actual positive samples, while recall represents the proportion of correctly predicted positive samples out of all actual positive samples.The F1 score is a weighted average of precision and recall.Generally, a higher F1 score suggests a more stable and robust model.Mean average precision (mAP) measures the combined impact of both precision and recall across all n categories [27]; thus, mAP was selected as the primary measure for model evaluation in this study.
The metrics comprehensively evaluate the precision, recall, and F1 score for model validation.The above-mentioned metrics are calculated as follows: Sensors 2024, 24, 3783 9 of 24 TP stands for true cases (TP); FP stands for false positive cases (FP); TN stands for true negative cases (TN); FN stands for false negative cases (FN) [28].

Experimental Result
The experimental results demonstrate that the improved EMA-YOLO model achieved a precision (P) of 0.836 and a recall rate (R) of 0.744, with a corresponding F1 score of 0.787.The precision curve, recall curve, mAP curve and loss curve of YOLOv8 and EMA-YOLO were compared, as shown in Figure 7.It can be seen that the precision, recall and mAP have been improved, and the loss function converges faster.Some of the test results from the orchard are shown in Figure 8.The experimental results demonstrate that the improved EMA-YOLO model achieved a precision (P) of 0.836 and a recall rate (R) of 0.744, with a corresponding F1 score of 0.787.The precision curve, recall curve, mAP curve and loss curve of YOLOv8 and EMA-YOLO were compared, as shown in Figure 7.It can be seen that the precision, recall and mAP have been improved, and the loss function converges faster.Some of the test results from the orchard are shown in Figure 8.

Ablation Experiment
Ablation experiments were conducted on EMA-YOLO to evaluate the impact of each improvement, and the results are shown in Table 1.The findings suggest that data augmentation resulted in a 1.6% improvement in mAP, demonstrating its effectiveness in expanding the sample space and improving detection accuracy through enhanced sample diversity.Additionally, integration of the EMA attention-mechanism module resulted in a 1.1% improvement in mAP, highlighting its ability to enhance feature extraction and overall network accuracy.Table 4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to net-in the images represents the detected yellow peach).

Ablation Experiment
Ablation experiments were conducted on EMA-YOLO to evaluate the impact of each improvement, and the results are shown in Table 1.The findings suggest that data augmentation resulted in a 1.6% improvement in mAP, demonstrating its effectiveness in expanding the sample space and improving detection accuracy through enhanced sample diversity.Additionally, integration of the EMA attention-mechanism module resulted in a 1.1% improvement in mAP, highlighting its ability to enhance feature extraction and overall network accuracy.Furthermore, appending the detection head also resulted in an 0.9% improvement in mAP due to the improved suitability of the p2 detection head for small-object detection, thereby mitigating loss of small-object information as network depth increases.
Moreover, introducing the EIoU loss function resulted in an 0.6% improvement in model mAP.In addition, incorporation of FocalLoss effectively addressed sample imbalance within bounding-box regression tasks by prioritizing high-quality anchor boxes over those with minimal overlap with target boxes.
The results of ablation experiments demonstrate a significant improvement in the mean average precision (mAP) of the model.In this experiment, the shallow features extracted from the neck network structure were fused with the context information extracted from the EAM attention-mechanism module, and the mAP was increased by 2% on input into the small-target detection head.This improvement is mainly due to the fact that our homemade yellow-peach dataset contains a large number of heavily obscured yellow peaches.When EMA is combined with the small-target detection head, the network can detect more small targets and the accuracy increases.On this basis, the exclusion loss function was replaced to further improve the model, allowing it to accurately locate each yellow peach in dense environments and reduce the rates of missed and false under conditions of severe occlusion.The incorporation of data augmentation, utilization of EIoU as the loss function, addition of an attention module, and incorporation of the detection head have collectively led to an increase in the mAP of the YOLOv8 model for yellow-peach detection from 79.9% to 84.1%, thereby achieving superior performance.

Comparison of Different Networks
In order to demonstrate the advantages of the EMA-YOLO model, we conducted a performance comparison with other common object-detection-algorithm models.The classical object-detection network SSD [29], which is based on object regression (with vgg-16 backbone network), as well as Objectbox [30] and the YOLO [31] series, including YOLOv7-Peach [15], were included for comparison.The results of the comparison are presented in Table 2.The table illustrates that the EMA-YOLO model demonstrates superior precision and recall rates compared to other models, with a mAP of 84.1%.Specifically, the EMA-YOLO model shows a 1.5% improvement in precision compared to YOLOv8, along with a 3.6% increase in recall rate.This indicates a reduced rate of missed detections and enhanced overall accuracy.The orchard environment presents complexities such as background leaves that are similar in color to immature peaches, occlusions from dense fruit distribution, and numerous small targets.However, the improved network model addresses these issues by incorporating an EMA attention mechanism and additional small-target detection head.As a result, there are significant improvements in recall rate.While Single Shot Multibox Detector (SSD) boasts high precision with this dataset, its low recall rate limits its mAP (only 54.0%).On the other hand, ObjectBox-a recently developed anchor-free object-detection network-achieves a precision of 83.8% and a recall of 61.4%, yet falls short in terms of its mAP, at only 69.9%.
In summary, the EMA-YOLO model successfully strikes a balance between high precision and recall rates that align with project requirements.

Comparison at Different Shooting Distances
Capturing images in a natural environment makes it nearly impossible to maintain constant camera angles and shooting distances.Thus, it is necessary to ensure that objects of different sizes are effectively detected.However, manual methods may overlook some small yellow peaches.Therefore, it is essential to verify the model's performance by detecting yellow-peach images at different shooting distances.Figure 9 shows the detection effectiveness of YOLOv8 and EMA-YOLO.
Table 3 summarizes the result of the detection comparison.Compared with the aforementioned experimental, it is evident that there is minimal disparity between the YOLOv8 model and EMA-YOLO model at short distances.In scenarios (a) and (b), EMA-YOLO achieved the perfect result, with no peaches missed, while YOLOv8 missed only one peach in both scenarios.At a moderate distance, YOLOv8 missed three peaches and EMA-YOLO missed two peaches in scenario (a).In scenario (b), YOLOv8 missed three yellow peaches compared to EMA-YOLO's single miss.However, for long-distance image detection, compared to detection at short and moderate distances, there are more severe cases of missed detections: in scenario (a), seven yellow peaches were overlooked; in scenario (b), ten were omitted by YOLOv8, whereas only one or two were overlooked by EMA-YOLO.In summary, despite some missed detections by EMA-YOLO, its rate is lower than that of the YOLOv8 model, demonstrating its superior performance in the detection of unmatured small yellow peaches.

Comparison of Different Light Intensities
When working in the orchard, it is important to consider weather conditions.For instance, images captured under strong light typically display higher contrast and more pronounced shadows and highlights of objects, whereas those images acquired under low light may show reduced contrast and potential blurring of object details.Additionally, images captured in low light may also suffer from noise or blur, impacting the clarity of details, while high light allows for clearer detail capture.
To assess the robustness of the EMA-YOLO model, we conducted tests under varying light intensities.The results are depicted in Figure 10.
Table 4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.Table 3 summarizes the result of the detection comparison.Compared with the aforementioned experimental, it is evident that there is minimal disparity between the YOLOv8 model and EMA-YOLO model at short distances.In scenarios (a) and (b), EMA-YOLO achieved the perfect result, with no peaches missed, while YOLOv8 missed only one peach in both scenarios.At a moderate distance, YOLOv8 missed three peaches and EMA-YOLO missed two peaches in scenario (a).In scenario (b), YOLOv8 missed three yellow peaches compared to EMA-YOLO's single miss.However, for long-distance image detection, compared to detection at short and moderate distances, there are more severe cases of missed detections: in scenario (a), seven yellow peaches were overlooked; in scenario (b), ten were omitted by YOLOv8, whereas only one or two were overlooked by EMA-YOLO.In summary, despite some missed detections by EMA-YOLO, its rate is lower than that of the YOLOv8 model, demonstrating its superior performance in the detection of unmatured small yellow peaches.Table 4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets.

YOLOv8
Strong light

Moderate light
represents the missed yellow peach.
Table 3.Detection results at different shooting distance.4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.

Models
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets.Table 4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets.Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets.

Comparison of Different Densities
During the shooting process, it was observed that yellow peaches tend to have a dense growth distribution in the natural environment.This results in occlusion between the fruits and also between the fruits and leaves.Such occlusion hinders accurate extraction of certain characteristics of the yellow-peach fruits during target detection, leading to missed identification of partially occluded peaches.Therefore, in order to validate the superiority of the EMA-YOLO model, it is essential to compare its detection capability with that of the YOLOv8 model for yellow-peach targets at different densities, as shown in Figure 11.
The results are summarized in Table 5.At sparse densities, neither EMA-YOLO nor YOLOv8 missed any peaches in scenario (a).In scenario (b), the YOLOv8 model missed two objects, while the EMA-YOLO model did not miss any.Thus, the EMA-YOLO model is better than YOLOv8 in this case.When dealing with a moderately dense yellow-peach distribution, YOLOv8 missed five and six yellow peaches in two different figures.On the other hand, EMA-YOLO missed only two yellow peaches in each figures respectively.
In the case of extremely dense distributions of yellow peaches, there is a substantial difference between YOLOv8 and EMA-YOLO.In scenario (a), YOLOv8 detected only 145 yellow peaches; although EMA-YOLO also missed five yellow peaches, there was a difference of 17 yellow peaches compared with YOLOv8's results.In scenario (b), YOLOv8 missed 14 yellow peaches, 10 more than EMA-YOLO.These results show that in the case of a very dense distribution of yellow peaches, YOLOv8 has a serious missed-detection problem due to the severe occlusion and the small size of yellow peaches.This result also indicates that EMA-YOLO has better detection capabilities.
Furthermore, the convolutional neural network experiences information loss after multiple passes, leading to occlusion in dense distributions and resulting in inaccurate or missed detection.The EMA attention-mechanism module enhances the ability of the EMA-YOLO model to extract information from occluded yellow peaches, prioritizing the retention of information that may be lost during layer-wise transmission.This improvement allows for more accurate detection of severely occluded yellow-peach features in densely distributed areas.Consequently, it significantly improves detection performance under these conditions.

Comparison of Computational Load
Model parameter number (Params) is a metric used to evaluate the spatial complexity and scale of the model, so low parameter number is an important index indicating a lightweight model.Furthermore, model computation (GFLOPs) is the number of floatingpoint operations performed by a model in one forward propagation, usually in billions of floating-point operations per second.The model computing power is used to evaluate the computing-resource consumption of the model.Lower computing-power requirements are more applicable to devices or scenarios with limited memory or computing power.Table 6 summarizes the Params and GFLOPs of our model.
By combining the EMA with the small-target detection head, the network can detect more small targets and improve the accuracy of its results.However, the introduction of the EMA module and the small-target detection head inevitably led to an increase in the parameter number and number of model calculations, and the introduction of loss function cannot make the model lightweight.Although a certain amount of computation is sacrificed, in the visualization results, the model shows obvious improvement in smalltarget detection and performance under conditions of severe occlusion.In the context of increasingly abundant storage and computing resources, the accuracy of the model should be considered first.The results are summarized in Table 5.At sparse densities, neither EMA-YOLO nor YOLOv8 missed any peaches in scenario (a).In scenario (b), the YOLOv8 model missed two objects, while the EMA-YOLO model did not miss any.Thus, the EMA-YOLO model is better than YOLOv8 in this case.When dealing with a moderately dense yellow-peach distribution, YOLOv8 missed five and six yellow peaches in two different figures.On the other hand, EMA-YOLO missed only two yellow peaches in each figures respectively.
In the case of extremely dense distributions of yellow peaches, there is a substantial difference between YOLOv8 and EMA-YOLO.In scenario (a), YOLOv8 detected only 145 yellow peaches; although EMA-YOLO also missed five yellow peaches, there was a difference of 17 yellow peaches compared with YOLOv8's results.In scenario (b), YOLOv8 missed 14 yellow peaches, 10 more than EMA-YOLO.These results show that in the case of a very dense distribution of yellow peaches, YOLOv8 has a serious missed-detection problem due to the severe occlusion and the small size of yellow peaches.This result also indicates that EMA-YOLO has better detection capabilities.
Furthermore, the convolutional neural network experiences information loss after multiple passes, leading to occlusion in dense distributions and resulting in inaccurate or missed detection.The EMA attention-mechanism module enhances the ability of the EMA-YOLO model to extract information from occluded yellow peaches, prioritizing the Table 4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets.Our analysis also rev from noise or blurring eff integrating an EMA atten dedicated small-target de ture extraction across div interfering ones.Further work transmission, there

Discussion
After the EMA attention-mechanism module was integrated and a small-target detection head was added to the YOLOv8 model, the experimental comparison clearly demonstrated improved small-target detection, as well as performance enhancement under different lighting and intensity levels.This led to a reduction in missed detections and an overall improvement in the accuracy of the object-detection algorithm.Although the introduction of the EMA module and the addition of the small-target detection head inevitably increased the number of parameters and model calculations, the model has obviously improved performance for small-target detection and under conditions of severe occlusion.The Grad-CAM method [32] is commonly used to improve the interpretability of neural network models by generating heat maps based on weight features extracted from different layers.
As shown in Figure 12, it is evident that the improved EMA-YOLO model exhibits more hierarchical red coloration for detected targets compared to the original YOLOv8 model, particularly when dealing with small yellow-peach targets against similar-colored backgrounds.
According to the above three groups of comparative experiments (Figures 9-11), it can be seen that for yellow-peach images taken at short and moderate distances, yellow-peach images taken at strong and moderate light intensities, and yellow-peach images with sparse distributions, the detection results of the EMA-YOLO model include fewer missing targets.Therefore, the EMA-YOLO model has certain advantages in the detection of yellow peaches in orchards and this can basically meet the needs of agricultural detection.However, when the distribution of yellow peaches is dense and the target of yellow peaches is small, the detection capacity of EMA-YOLO for yellow peaches is not ideal.There are more missed detections (refer to Figure 11 and Table 5), which may be due to the fact that there is less feature information for occluded yellow peaches in the dense distribution with low resolution, resulting in the failure to extract some features.In view of this, the feature extraction of the input picture information should be strengthened in the subsequent research process to reduce the loss of information caused by the increase in network layers and thus further improve the accuracy.
A method named YOLOv7-Peach, proposed in reference [15], has a precision rate of 79.3%, a recall rate of 73%, and an average accuracy of 80.4%.Compared with YOLOv7-Peach, our method has a higher recall rate, is more suitable for counting function, and reduces the number of missed detections.Compared with YOLOv7, our method has a 4.3% higher precision, a 1.4% higher recall rate, and a 3.7% higher mAP rate on the same data set.Reference [15] does not give the relevant computational load index, so it is not possible to make further comparisons in terms of the "lightweight" aspect.
of neural network models by generating heat maps based on weight features extracted from different layers.
As shown in Figure 12, it is evident that the improved EMA-YOLO model exhibits more hierarchical red coloration for detected targets compared to the original YOLOv8 model, particularly when dealing with small yellow-peach targets against similar-colored backgrounds.

Application of Our Method
To validate the practical applicability of the method proposed in this paper, an application system will be developed based on the cloud and mobile terminals.The system is primarily built on Android technology.Users can take photos or upload local images via the mobile terminal; the images are processed in the cloud, and the results are fed back to the terminal.The specific test model used is a Xiaomi device running Android 12.0.The test results interface is illustrated in Figure 13.
To validate the practical applicability of the method proposed in this paper, an application system will be developed based on the cloud and mobile terminals.The system is primarily built on Android technology.Users can take photos or upload local images via the mobile terminal; the images are processed in the cloud, and the results are fed back to the terminal.The specific test model used is a Xiaomi device running Android 12.0.The test results interface is illustrated in Figure 13.Table 4 summarizes the ground truth and numbers of peaches detected by various models.It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions.In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios.These findings highlight the superior detection capabilities of EMA-YOLO.
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification.By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones.Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets. in the image represents the detected yellow peach).

Conclusions
This paper addressed the problem of the detection of yellow peaches in the natural environment and designed the EMA-YOLO model based on YOLOv8.The model utilizes data-augmentation technology to expand the sample space and increase sample diversity.It introduces the EMA attention-mechanism module to encode global information and further aggregate pixel-level features through dimensional interaction, thereby reducing missed and false detections of small targets in densely occluded environments and improving overall accuracy.Additionally, a small-target detection head is incorporated to enhance the ability of the model to detect small targets, thus improving the detection rate.Furthermore, replacing the loss function with EIoU reduces instances of missing and false detections of target small yellow peaches in dense scenarios.The model demonstrates certain advantages in detecting yellow peaches under various environmental conditions in peach orchards, providing valuable insights for the accurate detection of small targets against similar background colors and offering technical support for intelligent yield-estimation management of yellow peaches.

Figure 1 .
Figure 1.The location of the plantation base.

Figure 1 .
Figure 1.The location of the plantation base.

Figure 2 .
Figure 2. Examples of images from the yellow-peach dataset.(a) random rotation (b) change in brightness (c) flip.(d) random rotation and change in brightness (e) flipping and change in brightness (f) flipping and random rotation.

Figure 2 .
Figure 2. Examples of images from the yellow-peach dataset.(a) random rotation (b) change in brightness (c) flip.(d) random rotation and change in brightness (e) flipping and change in brightness (f) flipping and random rotation.

Sensors 2024 ,
24, x FOR PEER REVIEW

Figure 4 .
Figure 4. Network structure of the EMA-YOLO (red boxes represent the improvements).

Figure 4 .
Figure 4. Network structure of the EMA-YOLO (red boxes represent the improvements).

Figure 5 .
Figure 5. Structure diagram of the EMA attention-mechanism module.

Figure 5 .
Figure 5. Structure diagram of the EMA attention-mechanism module.

Figure 8 .
Figure 8. Partial test results.( in the images represents the detected yellow peach).

Figure 9 .Figure 10 .
Figure 9. Detection at different shooting distances.(a) and (b) are two different images under the same environmental conditions.

Figure 10 .
Figure 10.Detect the same environ and represen

Figure 10 .Figure 10 .
Figure 10.Detection under different light intensities.(a) and (b) are two different images under the same environmental conditions.

Figure 10 .
Figure 10.Detect the same environ and represen

Figure 11 .Figure 10 .
Figure 11.Detection at different densities.(a) and (b) are two different images under the same environmental conditions.

Figure 10 .
Figure 10.Detection under d the same environmental con and represents the miss

Figure 12 .Figure 12 .
Figure 12.Visualization results.According to the above three groups of comparative experiments (Figures9-11), it can be seen that for yellow-peach images taken at short and moderate distances, yellow-

Table 3 .
Detection results at different shooting distance.

Table 4 .
Detection results under different light intensities.
in the images represents the detected yellow peach, and Sensors 2024, 24, x FOR PEER REVIEW

Table 4 .
Detection results under different light intensities.

Table 4 .
Detection results under different light intensities.
in the images represents the detected yellow peach, and Sensors 2024, 24, x FOR PEER REVIEW

Table 4 .
Detectio represents the missed yellow peach.

Table 4 .
Detection results under different light intensities.

Table 4 .
Detection results under different light intensities.
in the images represents the detected yellow peach, and Sensors 2024, 24, x FOR PEER REVIEW

Table 4 .
Detection results un

Table 5 .
Detection results under conditions of different densities.

Table 6 .
Computational load comparison table.

Table 4 .
Detection results under different light intensities.