Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n

Cao, Xiaoman; Zhong, Peng; Huang, Yihao; Huang, Mingtao; Huang, Zhengyan; Zou, Tianlong; Xing, He

doi:10.3390/agriculture15010090

Open AccessArticle

Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n

by

Xiaoman Cao

^1,2,3

,

Peng Zhong

⁴,

Yihao Huang

⁴,

Mingtao Huang

⁴,

Zhengyan Huang

¹

,

Tianlong Zou

² and

He Xing

^3,5,*

¹

School of Mechanical and Electrical Engineering, Guangdong Polytechnic of Industry and Commerce, Guangzhou 510510, China

²

Foshan Zhongke Innovation Research Institute of Intelligent Agriculture and Robotics, Foshan 528000, China

³

Guangdong Provincial Key Laboratory of Agricultural Artificial Intelligence, Guangzhou 510642, China

⁴

School of Modern Information Industry, Guangzhou College of Commerce, Guangzhou 511363, China

⁵

School of Information Technology& Engineering, Guangzhou College of Commerce, Guangzhou 511363, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(1), 90; https://doi.org/10.3390/agriculture15010090

Submission received: 11 December 2024 / Revised: 26 December 2024 / Accepted: 31 December 2024 / Published: 2 January 2025

(This article belongs to the Topic Intelligent Agriculture: Perception Technologies and Agricultural Equipment for Crop Production Processes)

Download

Browse Figures

Versions Notes

Abstract

When picking strawberries outdoors, due to factors such as light changes, obstacle occlusion, and small target detection objects, the phenomena of poor strawberry recognition accuracy and low recognition rate are caused. An improved YOLOv5n strawberry high-precision recognition algorithm is proposed. The algorithm uses FasterNet to replace the original YOLOv5n backbone network and improves the detection rate. The MobileViT attention mechanism module is added to improve the feature extraction ability of small target objects so that the model has higher detection accuracy and smaller module sizes. The CBAM hybrid attention module and C2f module are introduced to improve the feature expression ability of the neural network, enrich the gradient flow information, and improve the performance and accuracy of the model. The SPPELAN module is added as well to improve the model’s detection efficiency for small objects. The experimental results show that the detection accuracy of the improved model is 98.94%, the recall rate is 99.12%, the model volume is 53.22 MB, and the mAP value is 99.43%. Compared with the original YOLOv5n, the detection accuracy increased by 14.68%, and the recall rate increased by 11.37%. This technology has effectively accomplished the accurate detection and identification of strawberries under complex outdoor conditions and provided a theoretical basis for accurate outdoor identification and precise picking technology.

Keywords:

strawberry recognition; YOLOv5n; deep learning; image processing

1. Introduction

Strawberries are sweet and nutritious, making them a fruit with high economic value. China is the world’s largest producer of strawberries. As the economic value of strawberries continues to increase, so does the planting area of strawberries in China. Currently, the primary method of harvesting strawberries is manual picking, which has high labor intensity. At the same time, the strawberry ripening period is short, and untimely harvesting will lead to strawberry rot and economic losses. Accurate identification of strawberries is the basis for precise picking and harvesting. Therefore, efficient and accurate strawberry identification is essential for intelligent strawberry harvesting [1,2].

At present, scholars at home and abroad have made certain achievements in the research on fruit identification and classification, and in the early days, the color and shape of fruits were mainly used to identify and classify them accurately. However, this method is affected by the mutual occlusion between outdoor light, fruit, and obstacles, and the recognition accuracy is not high. It is not easy to meet the requirements of accurate recognition. With the development of deep learning technology, image recognition and classification technology has been greatly improved [3,4,5,6,7,8,9,10,11,12,13,14,15]. Visual methods have demonstrated broad applications across various agricultural and industrial domains [16,17]. For instance, advanced imaging techniques have been developed for non-radiating subspace visualization [18], while binocular stereo-vision systems have been effectively employed for precise fruit detection and positioning in Camellia oleifera orchards [19]. Deep neural networks combined with satellite image processing in remote sensing applications have proven valuable for sustainable urban planning and development monitoring [20]. Wang et al. [21] studied and analyzed the identification and detection of Solanum rostratum Dunal. Based on the YOLOv5 algorithm, the attention mechanism was improved and optimized, and the CBAM module was used to replace the original attention module, which improved the detection accuracy of the model and the real-time detection of Solanum rostratum Dunal. Jia Xueying et al. [22] conducted in-depth research on the problems of low efficiency and low accuracy of automatic detection and grading of citrus, and proposed a real-time detection model for citrus surface defects. Based on the YOLOv7 model, the coordinate attention module was introduced to improve the model’s attention, and static and dynamic context representation was integrated to increase the model’s expressive ability. The model was tested, and the overall accuracy reached 94.4%, with a high detection accuracy. To improve the generalization ability of apple leaf disease recognition, Guo et al. [23] proposed an improved MobileNetV3 recognition model. The attention module and the fully connected layer of the network in the model were optimized. The model validation analysis was carried out using the transfer learning training method. The results showed that the average accuracy of the model was as high as 95.62%, which had the advantages of high precision and small size, which could provide a reference for the identification of apple leaf diseases. Stripe rust is one of the main diseases affecting wheat yield. To improve the detection accuracy of the disease and identify the grade of the disease, Su Baofeng et al. [24] identified the disease based on time series and support vector machine algorithm, and the recognition accuracy was 83.7%, which provided a certain reference for identifying wheat disease grade. When detecting insect pests, the traditional detection algorithm makes it difficult to detect them accurately because of their high concealment and mobility. Tian et al. [25] proposed an MD-YOLO algorithm. The algorithm added the denseness module and adaptive attention module (AAM). These modules helped improve the expressive power of features. The algorithm combined the feature extraction path and the feature aggregation path and effectively obtained the spatial location information of the shallow network. The algorithm’s effectiveness was verified by experiments, which provided a certain research basis for detecting small objects in farmland. Rai et al. [26] designed a YOLO-Spot detection model to improve the accuracy of weed identification. The model showed high improvement in accuracy and parameter usage. The model was applied to the UAV detection system, and good detection results were obtained. To accurately identify and locate sick silkworms, Shi et al. [27] proposed an improved detection model based on the YOLOv5s algorithm. In this model, ConvNeXt large kernel depth separable convolution was used to expand the receptive field, and the channel attention mechanism ECANet was added to enhance the feature extraction ability. The test results showed that the average detection accuracy was 96.46%. The environment of outdoor soybean plantations is complex, so the traditional image recognition algorithm cannot accurately identify the pests. Zhu et al. [28] proposed a CBF-YOLO network model. The model could be used to identify soybean pests in complex environments. The algorithm mainly comprises CSE-ELAN, Bi-PAN, and FFE modules, which improved the feature extraction ability in space and channel dimensions and gave the model more accurate recognition and positioning ability. The experimental results showed that the average detection accuracy was 86.9%. This was 6.3% higher than the original algorithm. To improve cherry tomato recognition accuracy and positioning accuracy, Zhang et al. [29] proposed a cherry tomato recognition and positioning model based on the YOLOv4-LITE lightweight neural network. The MobileNet-v3 module was used to construct the network of the model, which greatly improved the feature extraction ability and target detection speed. Modifying the feature pyramid network improves the detection ability of small target objects, and the overall model is migrated to the mobile terminal, which realizes efficient and fast detection. It provided a certain research basis for agricultural picking. Huang et al. [30] proposed a lightweight detection algorithm based on the YOLOv5s model to solve the problem of missed detection caused by occlusion during strawberry picking. In this algorithm, the MobileNet V3 network was used to replace the original backbone network, and the Alpha-IoU loss function was introduced to accelerate the convergence speed of the model. The experimental results showed that the detection speed was 44 frames/s, and the detection accuracy was 99.4%, which was much better than the original model and met the fast and high-precision detection requirements. Liu et al. [31] improved the YOLOv8-Pose model to accurately identify the key points of strawberry fruit and strawberry stem in the red ripening stage. The Slim-neck module and CBAM attention mechanism module were added to the model, which effectively improved the feature extraction ability of the model for small target objects. The influence of light and other factors on the model prediction were analyzed. The experimental results showed that the model detection accuracy of the algorithm was higher than 94%, and it had a good detection effect and robustness.

Because strawberries are small target objects affected by natural light, it is difficult to identify them. Based on the research of the above scholars, this paper proposes an improved YOLOv5 detection model. The model improves and optimizes the backbone network and added the MobileViT attention mechanism, CBAM attention mechanism, C2f module, and SPPELAN module. The model not only is lightweight but also significantly improves the detection speed and accuracy, which provides an information basis for the later harvest of strawberries.

2. Materials and Methods

2.1. Data Sources

The pictures in this experiment were taken in a strawberry field on a strawberry plantation in Guangdong Province. Data sets were collected for different periods and taken under different lighting conditions and shooting angles to meet outdoor detection diversity needs. A total of 1368 strawberry images were captured and saved in JPG image format. The captured strawberry images included ripe strawberry images, unripe strawberry images, and rotten strawberry images, and the three types of strawberry images are shown in Figure 1.

2.2. Data Processing

The size of the original image used in this study was 3024 × 4032 pixels, and there were more impurities in the large-size image, which affected the training speed of the model. To meet the demand of model input and improve the training speed of the model, the original image was preprocessed, and the image size was changed to 640 × 640 pixels. This size could effectively retain the basic information of the image and did not affect the training effect, which provided a good foundation for the later visualization processing. Strawberry images were marked by Label Studio software, and the marking types were ripe, unripe, and rotten. Figure 2 shows the effect picture after marking. After marking, the image was saved in XML format. The image data set was expanded to 19600 images by data enhancement techniques. To prevent overfitting of the model, the final data set was divided into a training set, a validation set, and a test set in the ratio of 8:1:1.

3. Improved YOLOv5n Strawberry Detection Algorithm

3.1. YOLOv5n Model

The YOLOv5 model has high stability and accuracy for small target detection. The YOLOv5n model [32] is a lightweight YOLOv5 model with high detection accuracy and fast detection speed, and the model and calculation amount are small, so it is suitable for edge computation scenarios. This model can provide a good foundation for later applications on mobile terminals. However, due to the complexity of the strawberry-picking environment, the background noise, and the obstruction problems, the original YOLOv5n model may produce false detection of strawberry recognition, and its detection accuracy and speed make it difficult to meet the actual detection requirements.

3.2. Improved YOLOv5n Model

Given the shortcomings of the YOLOv5n model, this paper proposes an improved YOLOv5n model for accurately and efficiently recognizing strawberries. The overall improvement scheme is as follows: (1) The original backbone was modified to FasterNet, (2) the MobileViT and CBAM attention mechanisms were added, (3) the C2f module and SPPELAN module were introduced, and (4) the non-maximum suppression algorithm Soft-NMS was added. The improved overall structure is shown in Figure 3.

3.2.1. Improvement and Optimization of Backbone Network

To embed images into the model, it was necessary to preprocess the images, including resizing, standardization, etc., and convert them into a format suitable for model input. The process is as follows: (1) the image was converted to 640 × 640 pixels and normalized. (2) The image was input as a 4D tensor (batch_2, channels, height, width) into the backbone network. Images were extracted through multiple convolutional layers in the network and gradually embedded into a higher-dimensional feature space, enabling the model to recognize objects. Due to the significant impact of color metamers on object classification, this study first performed color space conversion and standardization on the images. Afterwards, a convolutional layer (conv) network was used to extract color features from the processed strawberry images. Finally, the color information was introduced into the backbone network.

FasterNet [33] is an efficient neural network architecture. This architecture introduces partial convolution (PConv) and pointwise convolution (PWConv) as the leading operators to reduce redundant computation and memory access, thereby improving spatial feature extraction capability. PConv adopts a unique strategy of only performing regular convolution operations on some input channels to extract spatial features while keeping the remaining channels unchanged. Compared to conventional convolution, the FLOPs of PConv are only 1/16 of conventional convolution. The memory access requirement of PConv is also relatively small, only 1/4 of that of conventional convolution. PConv only performs convolution operations on specific channels to improve memory access efficiency. FasterNet has 4 levels, each with an embedding layer (a 4 × 4 convolution with a stride of 4). Each FasterNet contains one PConv layer and two PWConv layers for spatial downsampling and channel expansion. They are combined to form an inverted residual block, with normalization and activation layers added after the middle layer. The middle layer has the function of expanding the number of channels and can reuse input features. PConv only performs regular convolution on some input channels, leaving the remaining channels unchanged, and its floating-point operations are much lower than regular convolution operations. Therefore, the FasterNet architecture can significantly improve the computational speed of algorithms. This article adopted the FasterNet architecture instead of the original backbone of the YOLOv5n model to further improve detection speed and accuracy, providing a good foundation for later algorithm porting to mobile devices.

3.2.2. Improvement and Optimization of Attention Mechanism

A SENet (squeeze and excitation networks) attention mechanism is used with the original YOLOv5n model. This attention mechanism adds a global self-attention module after each convolutional layer, allowing the network to adjust the weights of each feature channel automatically. This attention mechanism also adds a squeeze and excitation structure to compress the feature channel dimension and further improve detection performance. However, this module mainly considers adaptability in the spatial dimension while ignoring adaptability in the channel dimension, which results in poor performance when dealing with high-resolution images.

To improve the shortcomings of the original attention mechanism, this paper adopted MobileViT and CBAM attention mechanism modules [34,35]. MobileViT is a lightweight model based on Transformers. The primary function of this module is image classification. MobileViT mainly comprises convolution, MV2 (inverted residual block in MobiletNetV2), a MobileViT block, global pooling, and fully connected layers. The network structure is shown in Figure 4.

Compared to traditional convolutional neural networks, MobileViT uses a lightweight attention mechanism for feature extraction, which ensures accuracy while having a faster processing speed and smaller model size. This will provide a solid foundation for future applications on mobile devices.

This study adopted the attention mechanism modules of CBAM (Convolutional Block Attention Module) and MobileViT. The CBAM module processed image feature maps using channel attention and spatial attention. Its input was a convolutional feature map, and its output was a weighted feature map. This module was built into the convolutional layer to enhance the model’s attention to important features. However, although the CBAM attention module improved the feature extraction ability of the model, it had a large number of training parameters, which increased the complexity of the model. Therefore, adding the MobileViT attention module can make the model lightweight. Two modules were used alternately to perform a mixed convolution attention mechanism on the image feature map. The input was the image feature map extracted by deep features, and the output was the feature map weighted by attention, making it lighter and more computationally efficient. The MobileViT attention module was built into the deep feature fusion part of the network to enhance the model’s perception of details and long-term dependencies.

To further reduce the impact of outdoor lighting on target recognition, an attention mechanism CBAM module was introduced. This module combined channel attention and spatial attention modules to improve the feature expression ability of convolutional neural networks, which could enable the module to achieve better detection and recognition results. After the input strawberry feature map, an attention module was set to extract the original strawberry feature map information, thereby enhancing the accuracy of feature extraction by the backbone network and obtaining the extracted strawberry feature map. This attention module generated a (C × 1 × 1) one-dimensional channel attention map and a (1 × H × W) two-dimensional spatial attention map for a given dimension of C × H × W (where C represents the number of channels, H represents the height, and W represents the width) of the intermediate feature map F. During the process of element-by-element multiplication, the attention values would be replicated—that is, the channel attention values would be replicated along the spatial dimension, and vice versa. The final output result of the refined CBAM module was obtained. The input feature maps were processed using average pooling and maximum pooling, respectively. The two feature maps of average pooling and maximum pooling were spliced in the channel dimension to form a feature map with two channels, and the merged feature was obtained. Standard convolution layers performed convolution to generate a two-dimensional spatial attention map. The fully connected layer generated the channel attention weight Ms and the activation function, and finally, the output result was multiplied by the original graph to be restored to the size of C × H × W. Here, the weighting coefficient Ms is as shown in Equation (1).

\begin{matrix} M s (F) & = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)])) \\ = σ (f^{7 \times 7} ([F^{S} a v g; F^{S} \max])) \end{matrix}

(1)

where F represents the feature map post channel attention module application,

A v g P o o l (F)

denotes the application of global average pooling on F,

M a x P o o l (F)

signifies global max pooling operation on F, σ refers to the Sigmoid function, and f ^7×7 indicates a convolution operation with a 7 × 7 convolutional layer.

3.2.3. C2f Module

To further improve the accuracy of strawberry recognition and the light weight of the model, the C2f module was used to replace the C3 module in the original YOLOv5n. The C2f module has two convolutional layers that combine high-level features with information from above and below. The structure of the C2f module is shown in Figure 5. The C2f module consists of (1) a convolution layer (Conv), which receives the input feature map, generates the intermediate feature map, and is responsible for extracting the basic features of the input image; (2) a bottleneck module, where the intermediate feature map generated by the bottleneck module is split into two parts, one of which passes directly to the final Concat module and another to multiple bottleneck modules for further processing, and the bottleneck module processes the input feature map through a series of convolution, normalization, and activation operations, with the resulting feature map spliced with the part of the feature map passed directly in the Concat module; and (3) a Concat module, where the bottleneck module processes the feature map and directly transmits feature maps, which are spliced in the Concat block to realize feature fusion. The model can comprehensively utilize multiscale and multi-level information through the above operations and provide rich feature representations for subsequent detection and classification tasks.

3.2.4. SPPELAN Module

Because the strawberry belongs to the small size target detection, to further improve the detection accuracy and efficiency of the model, the SPPELAN module was integrated into this study. The characteristics of SPP (spatial pyramid pooling) and ELAN (local feature aggregation network) were integrated into this module. The model’s representation ability was improved within the local area of the feature map in this module.

The SPPELAN module consists of (1) the maximum pooling layer, which is used to perform spatial pyramid pooling, and the maximum pooling operation is performed through kernels of different sizes to capture multiscale features, and (2) a local feature aggregation network, which improves the representation ability of the model through the aggregation of local features. The SPPELAN module improves the model’s ability to detect objects of different sizes while maintaining sensitivity to important local features in the image. The network diagram is shown in Figure 6.

3.2.5. Soft-NMS Module

In this study, the Soft-NMS module was used to optimize the detection frame in target detection and improve the accuracy and efficiency of detection. Soft-NMS (soft non-maximum suppression) is an improved non-maximum suppression algorithm. The traditional NMS algorithm can directly retain the detection box with the highest score and ignore the other detection boxes once it detects the overlapping detection boxes when processing the detection boxes. This approach may lead to some real goals being wrongly ignored. In contrast, Soft-NMS preserves more potentially correct detection boxes by assigning lower scores to partially overlapping detection boxes instead of directly ignoring them. This module improved the accuracy and recall rate of object detection. The specific detection process was as follows: When the detection boxes overlapped, Soft-NMS would weigh and adjust the scores based on their intersection over union (IoU) values instead of directly removing detection boxes with lower scores, like traditional NMS. This weighting strategy enabled Soft-NMS to deal with the overlap of detection frames more flexibly, thus improving detection accuracy while maintaining detection efficiency.

3.3. Model Training and Testing

3.3.1. Experiment Environment

This experimental model was built based on the PyTorch deep learning framework. The experimental operating system was Windows 10. The hardware configuration was a 13th Gen Intel^® Core I9-13900K processor and two 24 GB Nvidia Tesla P40 GPUs. The software configuration was CUDA 11.8 and CUDNN 9.2.0. Training parameters: This study referred to the original YOLO series model, and through the pre-training results, we could see that when the learning rate and decay coefficient was 0.01, it could basically meet the training requirements. When the training was 500 times, complete convergence was achieved.

3.3.2. Evaluation Indicators

To evaluate the accuracy of the image detection model for strawberry object detection, the precision (P), recall (R), and mean average precision (mAP) of image detection were used as the evaluation indicators for the final effect in this study. To determine whether the model was suitable for deployment on mobile devices, it was also necessary to comprehensively consider parameters such as detection speed FPS (frames per second). The F1 value, as a harmonic mean of precision and recall, could assist in evaluating model performance. Precision (P) represented the proportion of true positive samples among all detected positive samples, which could reflect the model’s ability to distinguish negative samples. Its calculation formula is shown in Equation (2).

P = \frac{T P}{T P + F P}

(2)

where TP refers to the sample where both the predicted and the actual results are correct, and FP is detected as a correct sample but not the correct sample.

The recall rate R represented the proportion of samples that were actually positive samples correctly predicted by the model as positive samples. The calculation formula is shown in Equation (3).

R = \frac{T P}{T P + F N}

(3)

where FN refers to the samples that were actually positive but incorrectly predicted as negative.

Mean average precision (mAP) refers to the average prediction accuracy of all categories in a dataset. In object detection tasks, it was used to evaluate the detection performance of algorithms for various types of targets. The calculation formula is shown in Equation (4).

m A P = \frac{1}{C} \int_{0}^{1} P (R) d R

(4)

where C is the number of categories, and P (R) is the change in precision P relative to recall R.

The F1 value is the harmonic mean of precision and recall used to comprehensively evaluate classification models’ performance. The calculation formula is shown in Equation (5).

F 1 = \frac{2 P R}{P + R}

(5)

where P represents accuracy, and R is the recall rate.

Detection speed FPS refers to the number of images the model can process per second. This indicator directly reflects the speed at which the model processes images. The calculation formula is shown in Equation (6).

F P S = \frac{1000}{P_{e T i m e} + I_{n f e r T i m e} + N_{M S T i m e}}

(6)

where P_eTime is the preprocessing time of the image, ms; I_nferTime is the time required for network inference, ms; and N_MSTime is the time for optimizing network prediction boxes, ms.

4. Results and Analysis

4.1. Ablation Test Results and Analysis

The results of the ablation test are shown in Table 1. Due to the use of FasterNet structure in the backbone, which had the characteristics of a lightweight model and high computational efficiency, the model was more lightweight, and the detection speed significantly improved. Due to the use of the CBAM and MobileViT attention modules, the attention mechanism significantly improved the accuracy and detection frame rate of small object detection by enhancing channel attention and spatial attention. At the same time, the model introduced the SPPELAN module and the C2f convolution module, which effectively improved detection accuracy and further reduced model size through feature extraction optimization and the lightweight design. When the FasterNet backbone structure was used alone, the model size was reduced by more than 15 MB compared with the unused structure. This indicates that the backbone structure can realize weight reduction in the model. The above ablation experiments show that the improved model is better than the unimproved model in terms of detection effect. Based on the experimental results, it can be seen that the proposed algorithm has good application value.

4.2. Comparison of the Detection Effect with Different Models

To verify the detection effectiveness of the improved YOLOv5n model, in this section, several classical models, including RCNN, YOLOv3, YOLOv4, YOLOv5n, YOLOv8, and YOLOv9-e, were trained and tested in the same environment. Table 2 lists the detection results of the compared models on the test set. The experimental results show that the overall results of the improved YOLOv5n model are better than those of other models. The mean average precision (mAP) value of the improved YOLOv5n model was 99.43%, which improved by 59.09%, 15.86%, 5.26%, 9.25%, 16.23%, and 15.58%, respectively, compared with RCNN, YOLOv3, YOLOv4, YOLOv5n, YOLOv8, and YOLOv9-e. The recall rate of the improved YOLOv5n model was 99.12%, which increased by 75.37%, 18.39%, 15.42%, 11.37%, 19.55%, and 15.94%, respectively, compared with other algorithms. The model size was 53.2 MB, 54.81 MB, 190.86 MB, and 86.83 MB, smaller than RCNN, YOLOv4, and YOLOv9-e. However, compared to the YOLOv3, YOLOv5n, and YOLOv8 models, it increased by 35.80 MB, 46.48 MB, and 46.85 MB. The detection frame rate was 16.61 frames per second, which was 56.83% and 37.39% higher than the RCNN and YOLOv9-e models. Compared to YOLOv3, YOLOv4, YOLOv5n, and YOLOv8, the detection rate decreased, and was 35.94%, 38.17%, 437.75%, and 185.56% lower than the YOLOv3, YOLOv4, YOLOv5, and YOLOv8 models, respectively. Regarding computational complexity, the improved YOLOv5n model was significantly lower than the YOLOv3, YOLOv4, YOLOv5n, and YOLOv8 models. Although the improved YOLOv5n model was not optimal regarding model size, detection speed, or computational complexity, it outperformed other models in terms of accuracy and other metrics. Therefore, the overall detection performance of this model is better. The improved YOLOv5n model can meet the practical needs of strawberry-picking detection.

This study was conducted under different lighting conditions for detection and recognition, as shown in Figure 7. Recognition and analysis were conducted for different lighting conditions and occlusion environments. As shown in Figure 7, under the influence of light, the improved model in this paper did not experience missed or repeated detections, while the other models all experienced missed or repeated detections. Therefore, the improved model in this study can reduce the impact of outdoor light changes on strawberry recognition. Figure 7 shows the detection effect of each model. It can be seen from Figure 7 that there are a large number of mutual occlusions in the distribution of strawberries in the strawberry field, and with the change in time, the light also changes, which affects the detection effect of strawberries. By comparing different models, it can be seen that there are missed and false detections in the detection. Moreover, due to the influence of light, phenomena such as repeated detection seriously affect recognition accuracy. The RCNN detection model experienced a large number of false detections and duplicate recognitions; the YOLOv3 detection model encountered a large number of false detections, identifying rotten strawberries as mature ones; there were many missed detections in the YOLOv4 and YOLOv5n detection models; the YOLOv8 detection model encountered false detections, identifying mature strawberries as bad strawberries; and the phenomenon of merging in mature strawberries (deformed strawberries) was identified as multiple strawberries in the YOLOv9emodel, resulting in recognition errors. Strawberries in different locations and occluded strawberries could be detected by the improved YOLOv5n model with a high confidence level, indicating that the Soft-NMS module can effectively optimize the detection frame, remove detection frames with less confidence, and optimize detection frames with repeated detection. The experimental results show that the improved YOLOv5n algorithm still maintains high accuracy and robustness in complex environments and achieves accurate detection. It meets the requirements of outdoor accurate identification.

Due to the complex background, multiple interference factors, and small size of the target object, the training curve changed significantly. As shown in Figure 8, comparing the mAP, accuracy, recall, and loss function curves of different models, it can be seen that the traditional RCNN algorithm had an accuracy of less than 30% and could not be improved. The YOLO series algorithms exhibited superior performance through their unique architecture and training strategy. These algorithms achieved rapid fitting within fewer epochs and excel in accuracy, recall, loss value, and mAP. However, different algorithms exhibited varying rates of improvement and outcomes. Compared to other YOLO algorithms, the improved YOLOv5n model in this study demonstrated a slower convergence rate, yet its final results were significantly superior to those of other algorithms. Therefore, the improved YOLOv5n model proposed in this study is more suitable for detection and recognition under complex outdoor conditions.

Compared with other deep learning models [1,8,11], the algorithm proposed in this study improved the precision (P), mean average precision (mAP), and recall (R) but was lacking in the detection frame rate. In the later stage, it was necessary to improve the detection rate of the algorithm further and the overall efficiency of detection and recognition.

5. Conclusions

In response to the high-precision detection requirements of strawberries under complex conditions in outdoor orchards, this study proposed an improved YOLOv5n model, which effectively enhanced various performance indicators. The main conclusions are as follows:

(1) Replacing the original backbone network with FasterNet significantly enhanced computational efficiency while maintaining accuracy. The model, which integrated CBAM, a MobileViT attention mechanism, and the SPPELAN module, enhanced the accuracy and detection rate for small targets like strawberries. Including the C2f convolution module further contributed to the model’s lightweight design. Through ablation experiments, it was evident that all indicators of the model saw significant improvements. The improved YOLOv5n model boasted a detection accuracy of 98.94%, a recall rate of 99.12%, a model size of 53.22 MB, and a mAP value of 99.43%.

(2) This study compared it with other models to further validate the effectiveness of the improved YOLOv5n model. According to the experimental results, the overall performance of the improved YOLOv5n model surpassed that of other models, achieving an accuracy of 98.9%. Compared to the RCNN, YOLOv3, YOLOv4, YOLOv5n, YOLOv8, and YOLOv9-e models, it improved by 71.38%, 9.78%, 4.73%, 14.68%, 14.87%, and 19.36%, respectively. The experiment demonstrates that the improved YOLOv5n model effectively enhanced the accuracy of outdoor detection, satisfying the requirements for strawberry detection and recognition under complex outdoor conditions.

Author Contributions

Conceptualization, X.C. and P.Z.; methodology, X.C., Y.H., M.H. and P.Z.; software, X.C., Y.H., M.H. and Z.H.; validation, M.H., Z.H. and T.Z.; formal analysis, Z.H. and T.Z.; investigation, Z.H. and T.Z.; resources, Z.H. and X.C.; data curation, H.X.; writing—original draft preparation, X.C., Z.H. and P.Z.; writing—review and editing, Z.H. and P.Z.; visualization, Z.H., Y.H. and M.H.; supervision, H.X.; project administration, X.C. and H.X.; funding acquisition, X.C. and H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Guangdong Basic and Applied Basic Research Foundation (project No. 2022A1515110719), the 2024 Guangdong Province Ordinary University Characteristic Innovation Project (project No. 2024KTSCX396), the 2024 university-level scientific research projects of Guangdong Polytechnic of Industry and Commerce (project No. 2024-ZK-07), the 2022 university-level high-level talent project (project No.2022-gc-08), the 2023 University-Level Scientific Research Platform and Innovation Team Project (project No. 2023-TD-03), the Open Project Program of the Guangdong Provincial Key Laboratory of Agricultural Artificial Intelligence (project No. GDKL- AAI-2022002; GDKL-AAI-2022003), and the 2023 Guangdong Province Science and Technology Innovation Strategy Special Fund (project No. PDJH2023a0818). We would also like to thank the anonymous reviewers for their critical comments and suggestions for improving the manuscript.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Y.; Yan, G.; Meng, Q.L.; Yao, T.; Han, J.F.; Zhang, B. DSE-YOLO: Detail semantics enhancement YOLO for multi-stage strawberry detection. Comput. Electron. Agric. 2022, 198, 107057. [Google Scholar] [CrossRef]
Zhao, S.Y.; Liu, J.Z.; Wu, S. Multiple disease detection method for greenhouse-cultivated strawberry based on multiscale feature fusion Faster R_CNN. Comput. Electron. Agric. 2022, 199, 107176. [Google Scholar] [CrossRef]
Miao, R.H.; Li, G.A.; Huang, Z.B.; Li, Z.W.; Du, H.L. Maturity Detection of Apple in Complex Orchard Environment Based on YOLO v7-ST-ASFF. Trans. Chin. Soc. Agric. Mach. 2024, 55, 219–228. (In Chinese) [Google Scholar]
Miao, R.H.; Li, Z.W.; Wu, J.L. Lightweight Maturity Detection of Cherry Tomato Based on Improved YOLO v7. Trans. Chin. Soc. Agric. Mach. 2023, 54, 225–233. (In Chinese) [Google Scholar]
Zhang, Z.; Zhou, J.; Jiang, Z.Z.; Han, H.Q. Lightweight Apple Recognition Method in Natural Orchard Environment Based on Improved YOLO v7 Model. Trans. Chin. Soc. Agric. Mach. 2024, 55, 231–242+262. (In Chinese) [Google Scholar]
Yuan, J.; Xie, L.W.; Guo, X.; Liang, R.G.; Zhang, Y.G.; Ma, H.T. Apple Leaf Disease Detection Method Based on Improved YOLO v7. Trans. Chin. Soc. Agric. Mach. 2024, 55, 1–9. (In Chinese) [Google Scholar]
Song, H.B.; Yang, H.R.; Su, X.W.; Zhou, Y.H.; Gao, X.Y.; Shang, Y.Y.; Zhang, S.J. Application of Image Enhancement Technology Based on Enlighten GAN in Apple Detection in Natural Scenes. Trans. Chin. Soc. Agric. Mach. 2024, 55, 266–279. (In Chinese) [Google Scholar]
Yang, Z.Y.; Wang, X.C.; Qi, Z.H.; Wang, D.Z. Recognizing strawberry to detect key points for peduncle picking using improved YOLO v8 model. Trans. Chin. Soc. Agric. Eng. 2024, 40, 167–175. (In Chinese) [Google Scholar]
Nan, Y.L.; Zhang, H.C.; Zeng, Y.; Zheng, J.Q.; Ge, Y.F. Intelligent detection of Multi-Class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Xu, D.F.; Zhao, H.M.; Lawal, O.M.; Lu, X.Y.; Ren, R.; Zhang, S.J. An Automatic Jujube Fruit Detection and Ripeness Inspection Method in the Natural Environment. Agronomy 2023, 13, 451. [Google Scholar] [CrossRef]
Du, X.Q.; Cheng, H.C.; Ma, Z.H.; Lu, W.W.; Wang, M.X.; Meng, Z.C.; Jiang, C.J.; Hong, F.W. DSW-YOLO: A detection method for ground-planted strawberry fruits under different occlusion levels. Comput. Electron. Agric. 2023, 214, 108304. [Google Scholar] [CrossRef]
Wang, Y.W.; Wang, Y.J.; Zhao, J.B. MGA-YOLO: A lightweight one-stage network for apple leaf disease detection. Front. Plant Sci. 2022, 13, 927424. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.F.; Yu, J.Z.; Yang, S.Q.; Ning, J.F. An improved YOLO algorithm for detecting flowers and fruits on strawberry seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Wu, D.H.; Lv, S.C.; Jiang, M.; Song, H.B. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Wang, D.D.; He, D.J. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Wang, H.; Xu, X.; Liu, Y.; Lu, D.; Liang, B.; Tang, Y. Real-Time Defect Detection for Metal Components: A Fusion of Enhanced Canny–Devernay and YOLOv6 Algorithms. Appl. Sci. 2023, 13, 6898. [Google Scholar] [CrossRef]
Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 1–37, 1183–1219. [Google Scholar] [CrossRef]
Siampour, H.; Nezhad, A.Z. Revealing the Invisible: Imaging Through Non-Radiating Subspace. J. Opt. Photonics Res. 2024, 1, 159–169. [Google Scholar] [CrossRef]
Tang, Y.; Zhou, H.; Wang, H.; Zhang, Y. Fruit detection and positioning technology for a Camellia oleifera C. Abel orchard based on improved YOLOv4-tiny model and binocular stereo vision. Expert Syst. Appl. 2023, 211, 118573. [Google Scholar] [CrossRef]
Jamal Jumaah, H.; Adnan Rashid, A.; Abdul Razzaq Saleh, S.; Jamal Jumaah, S. Deep Neural Remote Sensing and Sentinel-2 Satellite Image Processing of Kirkuk City, Iraq for Sustainable Prospective. J. Opt. Photonics Res. 2024. [Google Scholar] [CrossRef]
Wang, Q.F.; Cheng, M.; Huang, S.; Cai, Z.J.; Zhang, J.L.; Yuan, H.B. A deep learning approach incorporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings. Comput. Electron. Agric. 2022, 199, 107194. [Google Scholar] [CrossRef]
Jia, X.Y.; Zhao, C.J.; Zhou, J.; Wang, Q.Y.; Liang, X.T.; He, X.; Huang, W.Q.; Zhang, C. Online detection of citrus surface defects using improved YOLOv7 modeling. Trans. Chin. Soc. Agric. Eng. 2023, 39, 142–151. (In Chinese) [Google Scholar]
Guo, H.P.; Cao, Y.Z.; Wang, C.S.; Rong, L.R.; Li, Y.; Wang, T.W.; Yang, F.Z. Recognition and application of apple defoliation disease based on transfer learning. Trans. Chin. Soc. Agric. Eng. 2024, 40, 184–192. (In Chinese) [Google Scholar]
Su, B.F.; Liu, D.Z.; Chen, Q.F.; Han, D.J.; Wu, J.H. Method for the identification of wheat stripe rust resistance grade using time series vegetation index. Trans. Chin. Soc. Agric. Eng. 2024, 40, 160–170. (In Chinese) [Google Scholar]
Tian, Y.N.; Wang, S.H.; Li, E.; Yang, G.D.; Liang, Z.Z.; Tan, M. MD-YOLO: Multiscale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Rai, N.; Zhang, Y.; Villamil, M.; Howatt, K.; Ostlie, M.; Sun, X. Agricultural weed identification in images and videos by integrating optimized deep learning architecture on an edge computing technology. Comput. Electron. Agric. 2024, 216, 108442. [Google Scholar] [CrossRef]
Shi, H.K.; Xiao, W.F.; Zhu, S.P.; Li, L.B.; Zhang, J.F. CA-YOLOv5: Detection model for healthy and diseased silkworms in mixed conditions based on improved YOLOv5. Int. J. Agric. Biol. Eng. 2023, 16, 236–245. [Google Scholar] [CrossRef]
Zhu, L.Q.; Li, X.M.; Sun, H.M.; Han, Y.P. Research on CBF-YOLO detection model for common soybean pests in complex environment. Comput. Electron. Agric. 2024, 216, 108515. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Z.Z.; Bao, R.F.; Zhang, C.C.; Wang, Z.H. Recognition of dense cherry tomatoes based on improved YOLOv4-LITE lightweight neural network. Trans. Chin. Soc. Agric. Eng. 2021, 37, 270–278. (In Chinese) [Google Scholar]
Huang, J.C.; Zhao, X.D.; Gao, F.Z.; Wen, X.; Jin, S.Y.; Zhang, Y. Recognizing and detecting the strawberry at multi-stages using improved lightweight YOLOv5s. Trans. Chin. Soc. Agric. Eng. 2023, 39, 181–187. (In Chinese) [Google Scholar]
Liu, M.C.; Chu, Z.Y.; Cui, M.S.; Yang, Q.L.; Wang, J.X.; Yang, H.W. Red Ripe Strawberry Recognition and Stem Detection Based on Improved YOLO v8—Pose. Trans. Chin. Soc. Agric. Mach. 2023, 54, 244–251. (In Chinese) [Google Scholar]
Xie, R.L.; Zhu, Y.J.; Luo, J.; Qin, G.F.; Wang, D. Detection algorithm for bearing roller end surface defects based on improved YOLOv5n and image fusion. Meas. Sci. Technol. 2023, 34, 045402. [Google Scholar] [CrossRef]
Chen, J.R.; Kao, S.H.; He, H.; Zhuo, W.P.; Wen, S.; Lee, C.H.; Gary Chan, S.H. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1–15. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Lightweight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]

Figure 1. Three types of strawberry images.

Figure 2. Marking effect diagram.

Figure 3. Improved YOLOv5n network structure diagram. Note: FasterNet is the backbone network; Neck is a bottleneck structure. C represents the number of channels, H represents height, and W represents width.

Figure 4. MobileViT attention mechanism module.

Figure 5. Structure of C2f.

Figure 6. SPPELAN module network diagram.

Figure 7. Outdoor detection effect picture.

Figure 8. Different model training process curves.

Table 1. Results of ablation experiments.

Fasternet	CBAM and MobileViT	Soft-NMS	SPPELAN	C2f	P	R	mAP 50/%	mAP 50–95/%	Model Size /MB	Detection Frame Rate FPS/(Frame·s⁻¹)
√	√	√	√	√	98.94%	99.12%	99.43%	97.76%	53.22	16.61
√	×	×	×	×	86.64%	84.87%	89.75%	74.77%	59.33	46.08
×	√	×	×	×	92.62%	91.54%	90.22%	73.23%	137.01	16.30
×	×	√	×	×	84.08%	87.72%	90.17%	85.64%	74.54	25.74
×	×	×	√	×	94.09%	92.77%	91.48%	83.82%	115.66	37.04
×	×	×	×	√	92.48%	89.16%	88.23%	80.12%	86.45	42.82

Table 2. Test results of different algorithms in the test set.

No.	Models	mAP 50–95/%	mAP 50/%	Computation /GFLOPs	F1	Size of Model /MB	P/%	R/%	FPS /(Frame·s⁻¹)
1	RCNN	14.01%	40.31%	470.58	25.48%	108.01	27.52%	23.73%	7.17
2	YOLOv3	74.22%	83.54%	12.90	84.71%	17.40	89.12%	80.71%	22.58
3	YOLOv4	77.41%	94.14%	30.26	88.62%	244.06	94.17%	83.68%	22.95
4	YOLOv5n	73.13%	90.15%	4.12	85.94%	6.72	84.22%	87.73%	89.32
5	YOLOv8	76.24%	83.17%	8.12	81.73%	6.35	84.03%	79.55%	47.43
6	YOLOv9-e	77.34%	83.82%	240.73	81.31%	140.03	79.54%	83.16%	10.40
7	Improved YOLOv5n	97.01%	99.43%	78.03	99.03%	53.22	98.94%	99.12%	16.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, X.; Zhong, P.; Huang, Y.; Huang, M.; Huang, Z.; Zou, T.; Xing, H. Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n. Agriculture 2025, 15, 90. https://doi.org/10.3390/agriculture15010090

AMA Style

Cao X, Zhong P, Huang Y, Huang M, Huang Z, Zou T, Xing H. Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n. Agriculture. 2025; 15(1):90. https://doi.org/10.3390/agriculture15010090

Chicago/Turabian Style

Cao, Xiaoman, Peng Zhong, Yihao Huang, Mingtao Huang, Zhengyan Huang, Tianlong Zou, and He Xing. 2025. "Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n" Agriculture 15, no. 1: 90. https://doi.org/10.3390/agriculture15010090

APA Style

Cao, X., Zhong, P., Huang, Y., Huang, M., Huang, Z., Zou, T., & Xing, H. (2025). Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n. Agriculture, 15(1), 90. https://doi.org/10.3390/agriculture15010090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Lightweight Algorithm Model for Precise Recognition and Detection of Outdoor Strawberries Based on Improved YOLOv5n

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources

2.2. Data Processing

3. Improved YOLOv5n Strawberry Detection Algorithm

3.1. YOLOv5n Model

3.2. Improved YOLOv5n Model

3.2.1. Improvement and Optimization of Backbone Network

3.2.2. Improvement and Optimization of Attention Mechanism

3.2.3. C2f Module

3.2.4. SPPELAN Module

3.2.5. Soft-NMS Module

3.3. Model Training and Testing

3.3.1. Experiment Environment

3.3.2. Evaluation Indicators

4. Results and Analysis

4.1. Ablation Test Results and Analysis

4.2. Comparison of the Detection Effect with Different Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI