1. Introduction
Originating from Europe and East Asia, strawberries are a delicious and tasty fruit. They are widely loved by consumers for their enticing aroma, tender and smooth texture, and vibrant red color [
1,
2]. In addition, strawberries can adapt to diverse environments, making them widely cultivated worldwide and one of the most popular fruits globally [
3]. As cultivation areas continue to expand, economic benefits have significantly increased. Strawberries hold substantial economic importance across the globe and possess immense potential as an export commodity [
4].
According to the latest data from the United Nations Food and Agriculture Organization (FAO), China is the world’s largest producer of strawberries [
5,
6]. Its cultivation area and output have consistently been in the lead worldwide, accounting for approximately one-third of the world’s total production. After strawberries begin to set fruit, they typically require 20 to 30 days to ripen. The degree of ripeness is a key factor affecting strawberry quality and flavor; harvesting them too early or too late will significantly reduce their quality [
7]. Currently, strawberry harvesting relies primarily on manual picking, making it the most labor-intensive task in strawberry cultivation. During peak seasons, rising labor costs indirectly increase the overall production expenses of strawberry farming [
5]. Additionally, manual picking can easily cause surface damage and lead to misjudgments about fruit ripeness, failing to ensure strawberries are harvested at their optimal picking time.
With the rapid advancement of modern computer technology, the integration of traditional agriculture with modern computing has emerged as a significant research focus [
8]. Introducing robots into strawberry harvesting not only liberates labor and reduces the workload for farmers but also enhances picking efficiency during peak ripening periods [
9]. Therefore, conducting research on automated strawberry ripeness detection in natural environments holds substantial scientific significance and practical value. It not only provides crucial technical support for strawberry production processes but also lays the foundation for developing intelligent harvesting equipment [
10].
In recent years, computer vision technology has been widely applied in the agricultural sector, enabling the detection of ripeness in fruits such as apples, bananas, grapes, and tomatoes [
11,
12,
13]. Numerous scholars have employed traditional image segmentation techniques to extract features such as the size, color, and contours of target objects. Cho et al. [
14] selected spectral data correlated with tomato ripeness and established a ripeness classification model using support vector classification (SVC) and snapshot hyperspectral imaging. Arefi et al. [
15] employed threshold analysis for background removal in the RGB color space, followed by mature tomato extraction using HSI, YIQ, and RGB color spaces, achieving an accuracy rate of 96.36%. Fadhel et al. [
16] employed color thresholding and k-means clustering to identify unripe strawberries. The results demonstrate that color thresholding outperformed k-means in terms of accuracy, effectiveness, and speed in code implementation. Mo et al. [
17] developed a banana fruit ripeness discrimination model based on genetic algorithms and SVM, utilizing manually measured banana rhombic features as indicators of ripeness. The final model achieved an average prediction accuracy of 86.20%. Khojastehnazhand et al. [
18] employed RGB values converted to HSV for classification, focusing on the HSV color space and Naive Bayes method to classify dragon fruit ripeness, achieving an accuracy rate of 86.6%. Malik et al. [
19] proposed a novel detection method based on an improved HSV color space and watershed segmentation algorithm, utilizing low-cost and readily available RGB cameras to assist robots in automatically picking ripe red tomatoes.
Traditional image processing-based maturity detection has achieved certain results but exhibits several significant drawbacks: (1) Algorithms are developed for specific crops and environments, requiring predefined parameters such as thresholds and filter kernel sizes, lacking dynamic adjustment capabilities. (2) Feature extraction parameter settings rely on subjective human experience, making it difficult to adapt to complex and variable real-world scenarios, resulting in poor universality. (3) Due to the excessive complexity of crop feature extraction, the results obtained are easily affected by environmental noise, light changes, overlap, and other surrounding disturbances. This leads to low detection efficiency and poor robustness and generalization capabilities, making it difficult to meet field detection requirements.
In recent years, deep learning technology has garnered significant attention. By leveraging neural networks to learn more abstract and advanced features from raw images, it enables automatic feature learning and extraction without human intervention [
20]. This technology effectively analyzes large datasets and identifies replicable features, substantially enhancing performance in image classification, segmentation, and detection [
21]. It currently represents a major research focus in the field of classification and recognition.
Zhou et al. [
22] employed the YOLOv3 model to detect the maturity levels of strawberry flowers and fruits at different stages using both aerial and ground-level imaging, achieving excellent results with average accuracies of 88.0% and 89.0%. MIAO et al. [
23] modified the original model backbone to MobileNetV3 and incorporated a Global Attention Mechanism into the feature fusion network to address the challenges of subtle maturity differences among cherry tomatoes and frequent mutual occlusion between fruits. The improved YOLOv7 model achieved a mAP of 98.2% on the test set while reducing the model size by 55.7 MB. Qiu et al. [
24] proposed an algorithm based on an improved YOLOv4 model, utilizing MobilenetV3, separable convolutions, and the SeNet attention mechanism for precise grape ripeness detection and spatial localization within orchards, achieving an accuracy of 93.52%. Fan et al. [
25] proposed an algorithm combining YOLOv5 with dark channel enhancement based on the three standard criteria of strawberry maturity: ripe, near-ripe, and unripe. Huang et al. [
26] proposed a production line-based mango ripeness detection method using an enhanced YOLOv8s model. By modifying the C2F module within the YOLOv8s model’s neck structure and incorporating a channel attention mechanism, the approach significantly improves detection efficiency. Wang et al. [
27] proposed an artificial intelligence algorithm framework for locating lychee picking points and detecting obstacles, achieving 88.1% accuracy in segmenting lychee fruits from branches and an 88% success rate in identifying picking points.
In summary, although deep learning-based fruit ripeness detection technology has made certain progress, related methods still require improvement in terms of robustness and practical application [
28]. Strawberry fruits exhibit the following characteristics: (1) Strawberry canopies are dense, and the fruit bodies are relatively small, often overlapping and shading each other. (2) Due to varying light conditions during growth, strawberry fruits do not ripen simultaneously. Even on the same plant, the degree of ripeness among individual fruits is not uniform.
This study collected photographs of strawberry ripeness across diverse natural environments and enhanced the expressive power of the dataset’s feature information through various image enhancement techniques. Subsequently, this study selected the YOLOv8s model—known for its minimal parameters and high accuracy—as the base model. By incorporating a Global Attention Mechanism (GAM) into the backbone network, the model effectively enhances its extraction of key information regarding strawberry ripeness. Finally, through multiple comparative validations, this paper demonstrates the effectiveness of the improved model, ensuring its suitability for subsequent application in harvesting robots.
2. Materials and Methods
2.1. Building Strawberry Maturity Dataset
Deep learning methods have been successfully used to grade and identify strawberry ripeness. Beyond demanding high-quality detection models, this approach relies on ample datasets for model training. Commonly used public datasets such as ImageNet, Open Images Dataset, COCO, and PASCALVOC feature diverse object categories but lack images depicting strawberries at different stages of ripeness. Therefore, this study necessitates the construction of a strawberry ripeness dataset for model training.
2.1.1. Image Collection
All datasets used for training were manually collected at the No. 1 Farm Strawberry Garden in Changzhou City, Jiangsu Province. The strawberry variety selected was Hongyan, with samples collected between November 2024 and March 2025. To ensure consistency between training data features and those encountered in the subsequent target recognition tasks of the harvesting robot, data were acquired using an Intel Realsense D415 (Intel, Santa Clara, CA, USA). With a Python (V3.6) program, images were automatically captured every 500 milliseconds at a resolution of 640 × 640 pixels. During collection, the camera maintained a distance of 5~15 cm from the target. The lighting conditions and shooting angles were varied as much as possible throughout the process. Finally, the collected images underwent manual screening to remove blurry images or those lacking strawberry fruit objects. The collected strawberry image samples are shown in
Figure 1.
This dataset comprises 1500 images capturing diverse lighting conditions, including front lighting, backlighting, overcast skies, and sunny weather. The strawberry data further encompasses various growth states such as single fruits, clusters of fruits, overlapping fruits, and foliage obstruction. This comprehensive approach ensures the trained model can effectively handle complex scenarios encountered in real-world strawberry cultivation environments.
2.1.2. Image Enhancement
To ensure the efficiency of the strawberry detection model, a rich dataset is required to guarantee the model’s strong generalization ability and stability. Sufficient data enhance the model’s adaptability to diverse scenarios. Furthermore, due to the complex growing environment of strawberries, factors such as lighting and shadows during photography are difficult to control. By simulating variations in natural lighting conditions, angular deviations, and imaging blur, we expanded the diversity of the strawberry ripeness dataset to enhance the model’s generalization capability in real-world scenarios. To achieve this, three data augmentation techniques were employed to enrich the features of strawberry samples within the dataset:
(1) To ensure the model can effectively recognize targets under varying lighting conditions, this study adjusted the exposure levels of captured raw images to accommodate detection requirements in both intense sunlight and dim environments.
(2) To enable the model to adapt to blurry photos caused by the camera’s inability to focus, this study employed a Gaussian blur method to simulate the out-of-focus effect.
(3) To simulate the randomness of imaging angles in natural environments, this study enhanced the dataset using image flipping techniques.
After processing the strawberry images as described above. We applied the three image enhancement techniques described above to each original image; the dataset size was expanded from 1500 images to 6000 images (
Figure 2).
During our research, we surveyed over a dozen strawberry farms in Jiangsu Province and found that farmers determine strawberry ripeness based on the extent of red coloration covering the fruit during harvest. Through extensive sample analysis, strawberry ripeness was categorized into three stages: unripe, half-ripe, and ripe. Unripe strawberries exhibit no red coloration. Half-ripe strawberries display red coverage between 30% and 70% of the fruit’s surface area. Strawberries with over 70% red coverage are classified as ripe. The LabelImg annotation tool was used to manually label each sample according to the three stages. Then, text labels were generated in the YOLO format (
Table 1).
2.2. YOLOv8s Model
As a newer iteration within the mature YOLO framework, YOLOv8 focuses on achieving an optimal balance between accuracy and speed [
29]. Building upon the success of YOLOv5, YOLOv8 introduces new features and enhancements, resulting in significantly improved accuracy. Based on CNNs, YOLOv8 employs an end-to-end object detection approach, directly feeding images into the model to output bounding boxes for target locations and category predictions [
30]. YOLOv8 stands as one of the most advanced models in the YOLO series, offering both superior performance and cutting-edge architecture. Its optimized training mechanism, mixed-precision support, and multi-format export capabilities enable flexible and efficient deployment, making it the current benchmark within the YOLO family for balancing performance and ease of use.
For different detection tasks, five YOLOv8 models of varying sizes have been designed. YOLOv8s strikes a good balance between detection speed and accuracy. Compared to YOLOv8n, it achieves higher detection accuracy, while outperforming YOLOv8m and YOLOv8l in detection speed [
31]. This makes it highly suitable for robotic harvesting tasks. This study selected YOLOv8s as the base model and optimized its architecture based on the characteristics of strawberry fruit images.
As shown in
Figure 3, the YOLOv8s network architecture consists of four components: the input module, backbone network, neck network, and detection head. The backbone network is responsible for extracting features from the input image, primarily composed of CBS (Convolution-BN-SiLU), C2f (Cross-Stage Local Fusion), and SPPF (Spatial Pyramid Pooling-Fast) modules. The C2f module enhances the model’s feature representation capabilities by optimizing feature propagation paths. This significantly improves overall object detection accuracy and inference speed while maintaining efficient computational resource utilization.
The neck section employs a Rep-PAN (Reparametrized Path Aggregation Network) architecture. During training, it utilizes a multi-branch topology to extract features, effectively integrating characteristics from different levels. At inference time, structural reparameterization techniques merge these into a lightweight, computationally efficient graph, perfectly balancing model representational power with inference efficiency. The detection head receives multi-scale feature maps from Rep-PAN, introduces Dynamic Label Allocation (DLA) to reasonably adjust the distribution of positive and negative samples, and adopts an Anchor-Free strategy in the improved prediction head to reduce computational complexity.
2.3. Global Attention Mechanism
In recent years, numerous researchers have incorporated attention mechanisms into their models to achieve improvements in detection accuracy. The core concept of the attention mechanism is to simulate the focusing capabilities of the human visual system, enabling the model to selectively focus on different parts of the input when processing complex data. This enhances both the model’s performance and interpretability. Currently, attention mechanisms are primarily categorized into channel-based and spatial-based approaches. By dynamically adjusting feature channel weights by analyzing the importance of each channel, the network can focus more on information from critical channels. Spatial attention enhances the model’s focus on key regions by calculating the distribution of importance across spatial dimensions.
CBAM combines channel and spatial attention mechanisms, first adjusting the channel dimension and then optimizing the spatial dimension [
32]. This enables the network to focus more intently on task-relevant regions when processing images. However, the drawback of CBAM lies in its reliance on local information processing, which makes it difficult to capture global contextual information and results in poor performance in complex agricultural scenarios.
In this paper, the Global Attention Module (GAM) [
33] replaces CBAM. GAM amplifies global-level interaction features while minimizing information dispersion. By computing global context, it effectively captures global information within gesture features, enhancing the model’s focus on critical features.
Figure 4 shows the channel and spatial attention modules. The calculation method is as follows:
where
F1 is the input feature,
F2 is the intermediate state, and
F3 is the output feature.
The channel attention submodule employs multilayer perceptrons to preserve information across channels, spatial width, and height, thereby enhancing global interaction features. The spatial attention submodule fuses spatial information through convolutional layers. GAM addresses CBAM’s limitation of insufficient global feature representation when processing images by simultaneously focusing on interactions across both channel and spatial dimensions, thereby strengthening the expression capability of global features.
The channel attention submodule preserves information across three dimensions through a three-dimensional arrangement. A two-layer multilayer perceptron (MLP) is employed to extend cross-dimensional channel-spatial dependencies. This MLP adopts an encoder–decoder architecture with a compression ratio of r (rate = 4).
In the spatial attention submodule, two convolutional layers are employed to fuse spatial features and focus on spatial information. First, a convolution with a 7 × 7 kernel reduces the number of channels to decrease computational load. Subsequently, another 7 × 7 convolution increases the channel count to maintain consistency, and the padding value is set to 3. Finally, the output passes through a Sigmoid function.
2.4. Fusion of GAM and YOLOv8s
As shown in
Figure 5, the GAM module is integrated into the backbone network of the YOLOv8 algorithm to enhance the extraction of both global and local features. This approach preserves finer-grained details while suppressing irrelevant noise, thereby improving the detection of occluded objects. A
4 ×
4-pixel detection head was added to the head section, along with a fourth output layer
P2, to enhance the detection of small targets like strawberry fruits. This design not only preserves YOLOv8’s efficiency but also significantly improves the model’s ability to detect small objects, achieving higher detection accuracy and robustness.
2.5. Settings and Data Description
Through stratified random sampling, the 6000 images were divided into training, validation, and test sets in an 8:1:1 ratio. The division strictly maintained consistent proportions across the three categories: unripe, semiripe, and ripe. The training set contained 4800 images covering all scenarios without extreme interference. It was used for iterative model parameter training and gradient descent optimization, with model weights saved every 10 iterations. The validation set comprised 600 images with identical scene distributions to the training set. It was used for hyperparameter optimization, overfitting monitoring, and optimal model selection. Early stopping was triggered if validation loss increases for 15 consecutive iterations. The test set contained 600 images representing extreme scenarios not covered in the training or validation sets, enabling unbiased evaluation of the model’s generalization capability.
The experimental hardware utilized a Dell T7920 workstation (Dell, American) equipped with an Intel Xeon Gold 6248 CPU, NVIDIA Tesla V100 GPU, 64GB of memory, and 2TB NVMe SSD storage. This configuration provides sufficient computational power for efficient deep learning model training. Training parameters were set based on preliminary experimental results, with the final parameters detailed in the table below. Key parameter selections are outlined in
Table 2.
3. Results and Discussion
The experimental results and discussion focus on the effectiveness of the YOLOv8s model enhanced with a Global Attention Mechanism (GAM) for strawberry ripeness detection. The model’s performance advantages are systematically quantified across five dimensions: standardization of experimental baseline conditions, validation of core improvement modules, cross-comparison with multiple algorithms, robustness testing in complex scenarios, and analysis of the underlying mechanisms behind the results.
3.1. Performance Enhancement of YOLOv8s Through GAM Module
To validate the effectiveness of GAM in enhancing the YOLOv8s model, this study employed a controlled variable method. The sole variable tested was “whether the GAM module is introduced,” while all other experimental conditions remained identical. We compare the convergence characteristics, detection accuracy, and feature extraction capabilities between the original YOLOv8s and the improved YOLOv8s, with a particular focus on verifying GAM’s advantages in complex scenarios.
Model convergence characteristics serve as key indicators for evaluating model stability and training efficiency. By monitoring the trends in training loss (Train-Loss), validation loss (Val-Loss), and validation set mAP, we compared the convergence processes of the two model groups. The results are shown in
Figure 6.
By monitoring the trends in training loss, validation loss, and validation set mAP, we observed that both model groups entered the convergence phase after 40 epochs. However, the improved YOLOv8s converged faster, with loss values below 0.3 after 30 epochs, whereas the original model required 45 epochs. The final training loss of the improved model stabilized at 0.23 ± 0.05, and the validation loss stabilized at 0.38 ± 0.02, representing reductions of 0.13 and 0.14, respectively, compared to the original model. Regarding validation set mAP, the improved model reached 88.5% at 50 epochs and stabilized at 91.5% by epoch 120, whereas the original model achieved only 86.9% mAP at epoch 120.
The core reason for this discrepancy lies in the original YOLOv8s backbone network’s reliance solely on local feature aggregation, making it prone to learning irrelevant features like leaf patterns and soil backgrounds. This leads to slow loss reduction and overfitting. In contrast, the GAM module employs a dual-branch interaction of “channel attention–spatial attention” to focus on effective features such as the red areas and contour textures of strawberry fruits. It suppresses the weights of irrelevant features, thereby accelerating convergence and enhancing generalization stability.
3.2. Performance Comparison of Different Models in Strawberry Maturity Assessment
To validate the competitiveness of the improved YOLOv8s among comparable algorithms, we selected traditional YOLO variants, attention-enhanced models, and other common detection models for comparison. A comprehensive evaluation was conducted across three dimensions: accuracy, speed, and lightweight performance. All models were retrained using the training dataset from this study to ensure consistent conditions.
The comparison models include YOLOv3, YOLOv5s, and YOLOv7s; the attention-enhanced model CBAM-YOLOv8s; and the proposed GAM-YOLOv8s. YOLOv3 serves as the baseline model for early strawberry ripeness detection, based on the Darknet53 backbone network; YOLOv5s represents lightweight models, based on CSPDarknet53; YOLOv7s represents medium-to-high-accuracy models, based on the ELAN backbone network; CBAM-YOLOv8s replaces GAM with the traditional CBAM attention mechanism, functioning as the attention mechanism comparison group.
All models were retrained using the training set from this study (under consistent training conditions). After training, various performance metrics were calculated on the test set, and the results are shown in
Table 3.
The test results indicate that, among the traditional YOLO series, YOLOv3 exhibits relatively low overall performance, achieving only 79.4% mAP@0.5, with an inference speed of 28FPS. Both accuracy and real-time capability fall short. YOLOv5s improves accuracy to 83.5% and reaches an inference speed of 42FPS, demonstrating greater lightweight performance, though recognition accuracy remains insufficient. YOLOv7s achieves further accuracy improvement to 85.4%, but its speed drops to 38FPS, resulting in slightly poorer balance. As a new-generation base model, YOLOv8s enhances mAP@0.5 to 86.9% while significantly boosting inference speed to 55FPS, delivering overall performance superior to its predecessor. CBAM-YOLOv8s, which incorporates an attention mechanism based on YOLOv8s, achieves an even higher accuracy of 88.4%, though its inference speed slightly decreases to 51FPS. This demonstrates that the attention mechanism positively contributes to accuracy.
The proposed GAM-YOLOv8s model achieves the best performance across all metrics, with a mAP@0.5 of 91.5%. This represents improvements of 12.1%, 8.0%, 6.1%, 4.6%, and 3.1% over YOLOv3, YOLOv5s, YOLOv7s, YOLOv8s, and CBAM-YOLOv8s, respectively. Meanwhile, the inference speed remains at 53FPS, which is only slightly lower than that of the baseline YOLOv8s. This demonstrates that the model maintains high real-time performance and lightweight advantages while significantly improving detection accuracy. Its comprehensive performance stands out, making it highly valuable for practical applications.
We further analyzed the feature heatmaps, which are shown in
Figure 7. We compared the activation focus of the base YOLOv8s (
Figure 7a) and the proposed GAM-YOLOv8s (
Figure 7c). By measuring the pixel intensities in the feature maps, we observed that the GAM-YOLOv8s model concentrated significantly more activation energy (approximately 85–90%) within the ground-truth bounding box of the strawberry, compared to the more dispersed activations of the base model. This difference in feature concentration demonstrates that GAM effectively filters background noise and directs the model’s attention. This focused learning mechanism ensures more efficient and relevant gradient updates, resulting in faster convergence and lower, more stable loss values.
3.3. Verification of Model Robustness Improvement Across Different Scenarios
In the natural environment of strawberry cultivation, variations in light intensity and fruit occlusion represent the two primary core interference factors affecting detection accuracy. To validate the robustness of the enhanced YOLOv8s model, this study designed “Light Robustness Tests” and “Occlusion Robustness Tests” to compare performance differences between the improved model and the baseline YOLOv8s. The test dataset was categorized into four lighting conditions: “front-lit,” “back-lit,” “cloudy,” and “dusk low-light,” with 150 images per category. The mAP comparisons between the two models are presented in the table below (
Table 4).
Comparisons of mAP@0.5 under varying lighting conditions reveal that both the baseline YOLOv8s and the improved GAM-YOLOv8s exhibit detection performance that fluctuates with light intensity. However, the enhanced model demonstrates superior overall performance and greater robustness. In front-lit scenes (8000–12,000 lux), the base model achieved a mAP of 90.2%, while GAM-YOLOv8s improved to 93.5%, representing a 3.3 percentage point increase. In back-lit conditions (1000–3000 lux), the baseline model drops to 78.5%, while the improved model maintains 87.2%, an increase of 8.7 percentage points. Under cloudy conditions (500–1500 lux), the baseline model achieves 85.6%, and the improved model reaches 90.1%, an increase of 4.5 percentage points. In low-light conditions at dusk (200–500 lux), the baseline model achieved only a mAP of 72.3%, a performance close to that of our defined “practicality threshold,” which is set at 70% mAP@0.5. Given our operational requirements for harvesting robots, a detection accuracy below 70% would result in missed detections of ripe fruit or a high false detection rate of unripe fruit. In contrast, GAM-YOLOv8s maintained a mAP of 82.8%, well above this survival threshold, achieving a maximum improvement of 10.5 percentage points. Overall, as illumination decreased from bright to dim, the baseline model’s mAP declined by 19.8%, while the improved model only decreased by 10.7%. Moreover, the improvement effect exhibited an increasing trend as lighting conditions deteriorated.
Occlusion Robustness Testing categorizes the dataset into three scenarios based on fruit occlusion levels, with 200 images per category and 67–68 samples of each maturity stage across all three categories. Strawberry leaves serve as occluding objects to simulate canopy shading in natural cultivation. The unobstructed scenario features fruit surfaces without leaf coverage, preserving complete features. The partially obstructed scenario shows 30–50% leaf coverage on fruit surfaces, with partial features visible. The severely obstructed scenario exhibits 50–70% leaf coverage, revealing only edge or localized features. Test results using the AP and average miss rate as core metrics for each category are presented in
Table 5.
Table 5 shows that, as the strawberry occlusion area increases from 0% to 50–70%, both YOLOv8s and GAM-YOLOv8s exhibit declining detection accuracy (AP). However, the improved model demonstrates a significant advantage that grows more pronounced with increasing occlusion. Without occlusion, YOLOv8s achieved AP values of 89.2%, 88.5%, and 93.8% for the three strawberry categories, with an average miss rate of 3.2%. GAM-YOLOv8s improved these to 92.5%, 91.8%, and 95.6%, while reducing the miss rate to 1.8%. Under partial occlusion (30–50%), YOLOv8s’ AP dropped to 81.5–86.2% with a false negative rate rising to 8.5%, while GAM-YOLOv8s achieved an AP of 87.2–92.3% and a false negative rate of 4.2%, demonstrating a 7.1–7.4% AP improvement and a clear advantage. Under severe occlusion (50–70%), YOLOv8s’ AP dropped to 71.8–78.6% with a false negative rate of 15.6%. GAM-YOLOv8s maintained an AP of 83.5–90.2% and a false negative rate of 6.8%, achieving an AP improvement of 13.8–14%. The results demonstrate that GAM-YOLOv8s significantly mitigates performance degradation in occlusion scenarios, with the most pronounced effects observed under moderate to severe occlusion conditions.
3.4. Compared with Existing Research
In recent years, many scholars have used various deep learning models to study strawberry maturity (
Table 6). Tao et al. [
34] propose a strawberry maturity recognition algorithm named YOLOv5s-BiCE, which replaces the upsampling algorithm with the CARAFE module structure to achieve multi-scale feature fusion. Compared to the original YOLOv5s model, YOLOv5s-BiCE achieves a 2.8% improvement in mean average precision and a 7.4% increase in accuracy. Chen et al. [
35] propose a CES-YOLOv8 network architecture that replaces some C2f modules in the YOLOv8 backbone with ConvNeXt V2 modules to enhance feature capture capabilities for strawberries at different ripeness levels. The introduction of the ECA attention mechanism further improves feature representation. Experimental results demonstrate that the CES-YOLOv8 model achieves an accuracy of 88.20%, a recall rate of 89.80%, a mAP50 of 92.10%, and an F1 score of 88.99% in complex environments.
Cai et al. [
36] introduced an attention mechanism into the main network and the hollow space pyramid pooling module of the DeepLabV3+ network to enhance the feature information of strawberry images. Experimental results demonstrate that this method can accurately segment strawberry images at different stages of ripeness, achieving a model mean pixel accuracy of 90.9% and a mean intersection-over-union ratio of 83.05%, with a frame rate of 7.67 frames per second. Tamrakar et al. [
37] proposed a lightweight improved version of the YOLOv5s-CGhostnet model to enhance strawberry detection performance. Compared to the original YOLOv5 model, the model size was significantly reduced by 85.09%, and the GFLOPS computational load decreased by 88.5%.
In terms of detection results, our model outperforms those of Chen et al. [
35] and Cai et al. [
36], with 2.51% and 0.6% higher mAP values, respectively. Compared with Tao et al. [
34] and Tamrakar et al. [
37], our model has a slightly lower mAP. The main reason is that the image samples used in this paper are closer to the natural environment, with more complex backgrounds and lighting. Regarding model parameters, Tao et al. [
34] and Tamrakar et al. [
37] selected YOLOv5s as their base model due to its lower number of parameters. This paper opts for YOLOv8s, which features fewer parameters and higher accuracy. Its computational speed outperforms Chen et al. [
35] and Cai et al. [
36], enabling precise strawberry maturity detection with minimal computational effort. Through comparisons with models in previous research, the model proposed in this paper is effective and has good detection effects in natural environments.