YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm

Zhao, Qing; Zhao, Ping; Wang, Xiaojian; Xu, Qingbing; Liu, Siyao; Ma, Tianqi

doi:10.3390/agriculture15192066

Open AccessArticle

YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm

by

Qing Zhao

¹,

Ping Zhao

^1,*,

Xiaojian Wang

¹,

Qingbing Xu

¹,

Siyao Liu

^2,3 and

Tianqi Ma

¹

College of Engineering, Shenyang Agricultural University, Shenyang 110866, China

²

College of Horticulture, Shenyang Agricultural University, Shenyang 110866, China

³

Key Laboratory of Facility Horticulture in Ministry of Education, Shenyang Agricultural University, Shenyang 110866, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(19), 2066; https://doi.org/10.3390/agriculture15192066

Submission received: 16 August 2025 / Revised: 25 September 2025 / Accepted: 29 September 2025 / Published: 1 October 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Bud eye identification is a critical step in the intelligent seed cutting process for potatoes. This study focuses on the challenges of low testing accuracy and excessive weighted memory in testing models for potato bud eye detection. It proposes an improved potato bud eye detection method based on YOLOv5s, referred to as the YOLO-SCA model, which synergistically optimizing three main modules. The improved model introduces the ShuffleNetV2 module to reconstruct the backbone network. The channel shuffling mechanism reduces the model’s weighted memory and computational load, while enhancing bud eye features. Additionally, the CBAM attention mechanism is embedded at specific layers, using dual-path feature weighting (channel and spatial) to enhance sensitivity to key bud eye features in complex contexts. Then, the Alpha-IoU function is used to replace the CloU function as the bounding box regression loss function. Its single-parameter control mechanism and adaptive gradient amplification characteristics significantly improve the accuracy of bud eye positioning and strengthen the model’s anti-interference ability. Finally, we conduct pruning based on the channel evaluation after sparse training, accurately removing redundant channels, significantly reducing the amount of computation and weighted memory, and achieving real-time performance of the model. This study aims to address how potato bud eye detection models can achieve high-precision real-time detection under the conditions of limited computational resources and storage space. The improved YOLO-SCA model has a size of 3.6 MB, which is 35.3% of the original model; the number of parameters is 1.7 M, which is 25% of the original model; and the average accuracy rate is 95.3%, which is a 12.5% improvement over the original model. This study provides theoretical support for the development of potato bud eye recognition technology and intelligent cutting equipment.

Keywords:

potato bud eye detection; object detection; YOLOv5; lightweighting; attention mechanism

1. Introduction

Potatoes (Solanum tuberosum) are one of the world’s five major staple crops and are rich in carbohydrates and protein [1]. China is the world’s largest potato producer. By 2025, China’s total potato cultivation area will reach 48 million mu, with an annual output exceeding 130 million tons, accounting for 5.3% of the country’s total grain production and 28% of global production [2]. Planting is a key step in potato cultivation, and seed potato cutting is currently the main method of planting. Therefore, the quality of the potato pieces directly affects the yield and quality of the tubers [3], and improving the quality and efficiency of seed cutting has significant implications for agricultural production. Traditional manual seed cutting methods rely on experience to determine bud eye location, resulting in low efficiency and poor consistency. Meanwhile, mechanized blind cutting lacks precision, which can lead to seed potato waste and reduced germination rates. Currently, the core issues limiting the development of intelligent potato seed cutting technology are the accurate identification of bud eye and the excessive weighted memory of existing identification technologies, which are poorly suited to agricultural machinery. Therefore, achieving high-precision bud eye identification and streamlining the detection model are key to promoting the development of intelligent potato seed cutting.

Currently, research in China focuses on improving traditional visual technologies, which are relatively mature but unable to meet real-time detection requirements. Overseas research focuses more on basic algorithms and cross-domain applications, but these algorithms are complex and require large amounts of memory, making them unsuitable for agricultural applications [4]. Wu et al. [5] proposed a method of integrating the analysis results of the grayscale space and color space of images to achieve accurate identification and localization of bud eye regions; they designed a bud eye cutting mechanism based on Euclidean distance and dynamic threshold segmentation. Yang et al. [6] developed a seed potato bud eye recognition method based on regional feature comparison, using the local binary pattern (LBP) algorithm to obtain target features and a support vector machine (SVM) to construct a classification model to complete regional segmentation. Lv et al. [7] performed the preprocessing of potato images based on Gabor features and proposed a bud eye recognition algorithm that can remove connected regions at the boundaries of seed potato images to complete recognition. Zhang et al. [8] improved YOLOv5s by introducing the CBAM attention mechanism and BiFPN feature fusion module, significantly improving bud eye detection accuracy, with an average accuracy of 95.2% and a recall rate increase of 17.5%. Targeting the characteristics of small bud eye targets, Huang et al. [9] replaced the YOLOv4 backbone network with GhostNetV2 and combined deep separable convolutions with the SIoU loss function to achieve single image detection in 0.148 s on a CPU device, with an average accuracy of 89.13%, which is 2.67 percentage points higher than the MobileNet series. Yang et al. [10] developed a compact model for detecting pests and diseases on tomato leaves based on the YOLOv8n network structure. They achieved parameter compression and reduced detection times while maintaining an average accuracy that decreased by only 0.3%. Li et al. [11] integrated YOLOv8 with the ECA attention mechanism to enhance feature extraction and introduced a bidirectional feature pyramid network to optimize multi-scale feature fusion, achieving a detection accuracy of 92.5% on mechanized sorting equipment and improving the accuracy of potato bud eye detection.

In summary, research on potato bud eye recognition based on deep learning has made some progress, but most current methods directly utilize the original YOLO model without specialized optimization based on the unique biological morphological characteristics of bud eye [12]. The pruning strategy can efficiently compress the model’s weighted memory, but the model compression intensity is negatively correlated with the detection accuracy. The greater the parameter reduction, the lower the detection accuracy. Due to the random spatial distribution of potato surface bud eyes and their visual characteristics being highly similar to the surrounding cortical areas [13], target recognition is severely weakened. Furthermore, bud eyes are often located in depressed areas of the epidermis, and the shallow geometric shape of these depressions is prone to feature ablation during depth convolution, resulting in the insufficient expression of effective signals during cross-level feature aggregation. The existing YOLO base model has not yet systematically addressed the aforementioned specificity issues related to potato seed tuber bud eyes and lacks targeted network structure adaptation schemes. Therefore, this study proposes a potato bud eye detection method based on an improved YOLO-SCA model structure, which not only significantly improves detection accuracy but also promotes lightweight model design, with the aim of constructing a lightweight network architecture suitable for high-precision potato bud eye recognition. It aims to further improve potato bud eye detection performance and lay the foundation for the promotion and application of intelligent potato detection equipment.

2. Materials and Methods

2.1. Test Equipment and Environment Parameter Settings

The image acquisition system used in this test consists of a 1.25-megapixel MER-125-30UC industrial camera made by China DaHeng Group, Inc. (Beijing, China), a Computar M1214-MP2 12 mm fixed-focus lens made by CBC Group of Japan (Tokyo, Japan), and a ring-shaped LED light strip made by Oukai Electronics (Huizhou) Co., Inc. (Huizhou, China), as shown in Figure 1. To eliminate the influence of external light, the test bench was sealed with a black light-shielding box. The conveyor belt had a constant speed of 40 mm/s, the camera had a shooting speed of 2 fps, Daheng Galaxy Viewer(x64) 2.0 was used as the image acquisition software, and MVTec HALCON 24.05 Progress was used as the image processing software.

The computer used in this experiment runs on the Windows 10 (64-bit) operating system, equipped with an Intel(R) Core(TM) i5-8250U processor operating at 1.60 GHz, and an Intel Corporation graphics card with 3.8 GB of video memory. The programming language used is Python 3.8, with the deep learning framework being PyTorch 1.9.0, integrated with the CUDA 11.1 parallel computing framework. During training, the initial learning rate was set to 0.01, and a cosine decay strategy was applied in subsequent stages for gradual adjustment. The model parameters were optimized using the stochastic gradient descent (SGD) method, with a momentum coefficient of 0.937 and a weight decay rate of 0.0005. The training batch size for each iteration is fixed at 16, and the entire training cycle consists of 200 epochs. This setting aims to effectively prevent overfitting and improve training efficiency. In subsequent comparative experiments, all other test conditions remain unchanged, and the resolution of the input images is uniformly adjusted to 640 × 640 pixels.

This experiment employs a controlled lighting environment to eliminate interference from extraneous variables during the initial validation phase, ensuring an accurate assessment of algorithm effectiveness. This setup simulates the enclosed visual inspection unit environment commonly found in agricultural intelligent equipment, providing high-quality, consistent data for model training.

2.2. Production and Construction of Dataset

This experiment used the “Dutch 15” potato variety as the sample. The tubers were long ovals in shape, with smooth skin, clear bud eyes, and no signs of disease, pests, rot, or other adverse conditions. A total of 4165 images were obtained. The image data was screened, and blurry images were removed, resulting in a final selection of 4000 images. The initial phase of this study focused on validating the effectiveness of the algorithmic model; hence, only the representative main cultivar “Dutch 15” was used to construct the dataset. Subsequent bud detection tests were conducted on Eugene potatoes, with results indicating that varietal differences in potato characteristics had minimal impact on detection outcomes. Future research will incorporate additional potato varieties exhibiting diverse shapes, colors, and skin textures to comprehensively enhance the model’s versatility.

To avoid insufficient model training due to the small sample size of potato seed potato data, data augmentation technology [14] was used in the experiment to reprocess the collected seed potato data set. The data set was augmented by rotating, flipping, increasing exposure, changing brightness, and introducing noise. The core purpose was to enhance the model’s generalization ability and reduce overfitting [15]. The number of images in the augmented sample dataset is 10,000. After random sorting, they are divided into a training set and a test set at a ratio of 9:1. The training set contains 14,000 sample images, the validation set contains 4000 images, and the test set contains 2000 images. To ensure the independence of the experimental results evaluation, there is no overlap between the potato dataset used in the training phase and the test set images. Some of the images in the dataset are shown in Figure 2.

The image annotation tool LabelImg 1.8.6 was used to annotate potato bud eye targets, as shown in Figure 3, and to generate XML (Extensible Markup Language)-type annotation files. LabelImg is an open-source data annotation tool that can annotate three label formats: VOC, YOLO, and CreateML [16].

2.3. Selection of the Original Model

YOLOv5 (You Only Look Once version 5) is an algorithm for detecting one or more known objects in an image [17], achieving breakthrough progress in computational efficiency and multi-scale perception capabilities by reconstructing the single-stage detection paradigm. Compared with previous models in the YOLO series, YOLOv5 uses an enhanced feature pyramid and a multi-scale feature fusion mechanism [18] to improve the detection capabilities of small-scale objects. It uses CSP structure gradient diversion and SPPF serial pooling structure [19] to reduce the overall framework’s weighted memory. Through adaptive anchor frame calculation strategies and calculation graph operator fusion, it improves average accuracy and detection speed.

The YOLOv5 model is divided into four benchmark variants based on the differentiated configuration of the depth multiple and width factor multiple: the lightweight version YOLOv5s, the balanced version YOLOv5m, the high-performance version YOLOv5l, and the extreme accuracy version YOLOv5x.

In order to select the optimal model, this study systematically evaluated the performance of different versions of the YOLOv5 series on this dataset. Key indicators included mean average precision (mAP), model parameter count, floating point operations per second (FLOPs), frames per second (FPS), and model’s weighted memory. The comparison results are shown in Table 1.

Analyzing the experimental data, we can see that, among the four tested versions, although the YOLOv5s model has slightly lower average accuracy than the other three, its significant advantages are that it has the lowest number of parameters, the smallest weighted memory, and the lowest FLOPs requirements, while achieving the most efficient real-time feedback, which is crucial for detection in the agricultural field. Given the need to balance model complexity and detection accuracy, especially when deploying on resource-constrained embedded platforms, the model’s weighted memory becomes a key consideration. Compared with the YOLOv5x version, which has the best detection performance, YOLOv5s has a slight decrease in accuracy, but the FLOPs, parameter quantity, and model’s weighted memory are greatly reduced, while the feedback speed is significantly improved. Based on the above performance trade-offs and actual application requirements, we finally opted to use the YOLOv5s model as the infrastructure for subsequent algorithm optimization work.

2.4. Summary of Improved Methods Based on YOLOv5s

When training the model using the potato bud eye dataset, it was found that the original YOLOv5s algorithm had issues with excessive memory consumption and poor recognition performance. Therefore, three improvements were made to the YOLOv5s model: replacing the original model’s backbone network with the ShuffleNetv2 module; embedding the CBAM attention mechanism in the C3 layer; and using the Alpha-IoU function as a new bounding box regression loss function in the loss calculation module. This reduced the model’s weighted memory and enhanced the bud eye features, improving the accuracy of bud eye localization. Additionally, L1 constraints are applied to the BN layers in the YOLOv5s backbone network, and structured channel pruning is performed on non-fused channels to remove redundant convolutional kernels, thereby accelerating the inference process and achieving model lightweighting. The improved YOLO-SCA model architecture is shown in Figure 4.

Here, CBRM refers to a standard convolution module, the operation sequence of which is convolution (Conv), batch normalization (BN), rectified linear unit activation (ReLU), and max pooling. SN_Block_X represents a network unit, which is composed of a downsampling module and X repeated basic units from ShuffleNetV2 cascaded together. The optimized detection model’s main architecture consists of a CSPDarknet53 backbone network, an FPN-PAN fusion neck, and a multi-scale detection head, forming the core computational graph. The neck network integrates a Bidirectional Feature Pyramid Network (BiFPN), which transmits high-level semantic information through a top-down path (Feature Pyramid Network, FPN) and fuses low-level spatial features through a bottom-up path (Path Aggregation Network, PAN) [20]:

P_{l}^{o u t} = {u p s a m p l e}_{2 \times} (P_{l + 1}^{i n}) + {c o n v}_{l} (P_{l}^{i n})

(1)

N_{l}^{o u t} = {d o w n s a m p l e}_{2 \times} (N_{l + 1}^{i n}) + {c o n v}_{l} (P_{l}^{i n})

(2)

In the formula,

{u p s a m p l e}_{2 \times}

(·) represents upsampling, and

{d o w n s a m p l e}_{2 \times}

(·) represents downsampling. This structure enables the bidirectional transmission of cross-level semantic-location information, generating three-scale feature maps (P3, P4, P5, with resolutions of 80 × 80, 40 × 40, and 20 × 20, respectively), which are suitable for the detection of small, medium, and large targets.

Category prediction uses cross-entropy improved by Focal Loss:

L_{cls} = - {\sum_{i = 1}^{C} (1 - p_{i})}^{γ} y_{i} \log (p_{i})

(3)

where (1 − p_i)^γ, when γ > 0, increases the weight of difficult samples. γ is the focus parameter (default value 1.5),which is used to reduce the weight of easily classified samples. The post-processing stage uses Weighted Non-Maximum Suppression to screen the final detection box through a joint decision of confidence and overlap [21].

The core objective of this optimization is to resolve the conflict between high accuracy, high speed, and small size, with the aim of developing a model that is more suitable for potato seed bud eye detection. Subsequent experiments have demonstrated that the SCA three-component collaborative architecture effectively resolves the inherent conflict between model lightweighting and accuracy reduction. The ShuffleNetv2 trunk weakens feature extraction capabilities, but the CBAM mechanism compensates for the loss of accuracy, and the Alpha-IoU loss function further enhances positioning robustness. Ultimately, the three-module integrated YOLO-SCA model achieves a triple balance: lightweight weighted memory, high-precision detection results, and real-time detection process.

2.5. An Improved Potato Bud Eye Detection Model Based on YOLOv5s

2.5.1. ShuffleNetv2

The YOLOv5s framework uses the CSPDarknet53 architecture as the core backbone network for visual feature extraction. It enhances feature extraction capabilities through a hierarchical stack of deep convolutional units, but at the same time, it increases model complexity, leading to parameter inflation, increased computational complexity, and difficulties in terminal device deployment. Therefore, ShuffleNetv2 [22] is introduced into the original backbone network to reduce the model’s weighted memory and computational load.

ShuffleNetv2 is an architecture for lightweight backbone networks. Its specific calculation mechanism is shown in Figure 5 [23]. It was proposed by the MegVII team in 2018, and its design strictly follows the hardware-aware G1–4 guidelines [24] (memory access cost optimization, group convolution efficiency improvement, network fragmentation suppression, and element-wise operation simplification), achieving the synergistic optimization of accuracy and speed on mobile devices. The network reconstructs the feature extraction process through the core mechanism of channel splitting, dual-branch heterogeneous processing, and channel mixing: after the input feature map is evenly divided along the channel dimension, the identity branch retains the original spatial information, while the curl integral branch extracts high-order features through a cascade operation of “1 × 1 standard convolution → 3 × 3 depth separable convolution → 1 × 1 standard convolution.” Depthwise convolution decouples standard convolution into two independent operations: depthwise convolution and pointwise convolution, significantly reducing computational complexity from O(H·W·Cin·Cout·K2) in traditional convolution to O(H·W·Cin·K2) + O(H·W·Cin·Cout). Channel mixing operations force cross-channel information exchange through periodic permutations, effectively resolving confusion between similar features such as potato bud eyes and skin blemishes. This design is particularly suitable for agricultural vision detection scenarios, where low-contrast features between bud eyes and the background require efficient discrimination.

To meet the multi-scale detection requirements for potato bud eyes, the subsampling unit uses a symmetrical dual-channel structure to enhance feature extraction capabilities. After removing channel segmentation, the dual branches perform parallel 3 × 3 depth convolution with a stride of 2 and 1 × 1 standard convolution to achieve spatial dimension reduction and channel expansion. The output features are spliced and then blended through channel mixing to fuse multi-level information. This structure only requires 146 M FLOPs to achieve 72.6% Top-1 accuracy in ImageNet pre-training, and its memory access optimization feature (G1 rule) significantly reduces resource consumption when deployed on embedded devices. When integrated into the YOLOv5s framework to replace the CSPDarknet53 backbone, the model’s parameter count is compressed to 28% of the original structure, with inference speed improved by 3.2 times, providing a hardware-compatible foundation for real-time bud detection.

2.5.2. The CBAM Attention Mechanism

To compensate for the possible decrease in accuracy caused by lightweight operations, embedding the CBAM attention mechanism at a specific level of the backbone network helps to better adapt to the distribution characteristics of specific input data and ultimately enhances the robustness of feature detection in the model under variable image scales. CBAM [25] (Convolutional Block Attention Module) is a lightweight attention module that can enhance the expressiveness of convolutional neural networks through both channel and spatial dimensions [26]. Its core lies in combining the Channel Attention Module (CAM) and Spatial Attention Module (SAM) to enhance the effectiveness of feature images. The specific mechanism is shown in Figure 6. Among them, CAM can establish dependencies between channels and generate corresponding weight vectors to emphasize important channel characteristics; SAM can capture pixel-level correlation patterns, enhance the information expression ability of the target location, and switch the scope of CAM from “channel” to “space”.

In the CAM module, global information is extracted from the input feature map

F \in R^{C \times H \times W}

(C is the number of channels, H is the height, and W is the width) through global average pooling to obtain the global feature description

F_{a v g} \in R^{C \times 1 \times 1}

. At the same time, the input feature map

F \in R^{C \times H \times W}

is maximally pooled in the spatial dimension to obtain another global feature description

F_{\max} \in R^{C \times 1 \times 1}

. Then, the two pooling results are input into a shared multilayer perceptron (MLP) network to generate two channel attention vectors. Then, the two channel attention vectors are added together and normalized using the sigmoid function to generate the final channel attention weight

M_{c} \in R^{C \times 1 \times 1}

. Finally, the original feature map F is multiplied element-wise by the channel attention weight Mc to obtain the weighted feature map

F^{'}

. The specific formula is as follows:

F_{a v g} (c) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (c, i, j)

(4)

F_{\max} (c) = \max_{i = 1}^{H} \max_{j = 1}^{W} F (c, i, j)

(5)

M L P (x) = W_{1} \cdot Re L U (W_{0} \cdot x)

(6)

M_{c} = σ (M L P (F_{a v g}) + M L P (F_{\max}))

(7)

F^{'} = M_{c} \otimes F

(8)

In the formula,

F_{a v g} (c)

represents the average pooling result of the c-th channel;

F_{\max} (c)

represents the maximum pooling result of the c-th channel; W0 and W1 are learnable weight matrices; ReLU is a nonlinear activation function;

σ

is a sigmoid function, ensuring that the output is between 0 and 1;

\otimes

represents element-wise multiplication.

In the SAM module, the input feature map

F \in R^{C \times H \times W}

is averaged over the channel dimension to obtain a spatial feature map

F_{a v g}^{s p a t i a l} \in R^{1 \times H \times W}

. At the same time, the input feature map

F \in R^{C \times H \times W}

is max-pooled over the channel dimension to obtain another spatial feature map

F_{\max}^{s p a t i a l} \in R^{1 \times H \times W}

. Then, the channel average pooling map and the channel max pooling map are concatenated over the channel dimension to obtain a feature map

F_{concat}^{s p a t i a l} \in R^{2 \times H \times W}

with two channels. Then, the concatenated feature map is convolved with a 7 × 7 convolution kernel to obtain a spatial attention map

M_{s} \in R^{1 \times H \times W}

. Finally, the original feature map F is multiplied element-wise with the channel attention weights Ms to obtain the weighted feature map

F^{″}

.

The overall processing flow of the CBAM mechanism is as follows: the input feature map F passes through the channel attention module to generate channel attention weights M_c and perform weighted processing, obtaining the enhanced feature map F_c, which is then passed to the spatial attention module to generate spatial attention weights M_s and perform weighted processing, obtaining the final output feature map

F^{'}

. The formula for the processing flow is expressed as follows:

F^{'} = M_{s} (M_{c} (F) \otimes F) \otimes F

(9)

The CBAM mechanism significantly enhances the sensitivity of the original model to key features and improves the focus on potato bud eyes by enhancing the effective features of the image and suppressing the ineffective features. This mechanism can effectively avoid the impact of adding the ShuffleNetv2 module on channel attention learning by suppressing background noise and focusing on bud eye morphological features.

In summary, the advantages of introducing the CBAM mechanism are that it improves the feature expression ability of the attention model by establishing two dimensions of channel and space; its design complies with lightweight standards and does not increase the model’s computing load and parameter count; and it can be inserted into any deep convolutional neural network and applied directly without complex modifications to the model.

2.5.3. Alpha-IoU Loss Function

To improve the accuracy of potato bud eye detection, the CloU loss function in the original model was replaced with the Alpha-IoU loss function. IoU (Intersection over Union) is a standard for measuring the accuracy of detecting corresponding objects in a specific data set [27]. It is the most widely used detection algorithm, and all tasks that require a prediction range (bounding boxes) in the output can be measured using IoU. As a key component of the model’s performance optimization, the IoU loss function plays a decisive role in improving localization accuracy and is widely adopted to quantify the spatial overlap between predicted bounding boxes and ground-truth annotated boxes [28,29]. The specific calculation mechanism is illustrated in Figure 7. In object recognition tasks, the IoU is the ratio between the predicted box and the actual box.

Here, the green box represents the ground truth:

\tilde{x} = ({\tilde{x}}_{t}, {\tilde{x}}_{b}, {\tilde{x}}_{l}, {\tilde{x}}_{r})

. The blue box represents the prediction:

x = (x_{t}, x_{b}, x_{l}, x_{r})

.

The YOLOv5s model originally used CloU (Complete-loU) as the bounding box regression loss function [30], with the specific calculation formula as follows. However, under extreme cases of bounding box overlap (when IoU approaches 1 or 0), the CIoU loss function faces the risk of mathematical expression failure: its denominator approaches 0 or the numerator and denominator become approximately equal, leading to distortion in the quantitative representation of positional deviation. This computational instability directly weakens the model’s sensitivity to localization errors, with the specific calculation formula as follows:

L_{C I o U} = 1 - I o U (A, B) + \frac{ρ^{2} (A_{c t r}, B_{c t r})}{c^{2}} + α υ

(10)

υ = \frac{4}{π^{2}} (\arctan \frac{ω^{g t}}{h^{g t}} - \arctan \frac{ω}{h})^{2}

(11)

α = \frac{υ}{(1 - I o U) + υ}

(12)

In the formula, A and B represent two frames;

A_{c t r}

and

B_{c t r}

represent the center points of A and B, respectively.

The Alpha-IoU loss function is a breakthrough in the field of bounding box regression methods for object detection [31]. Its core mechanism reconstructs the error measurement mechanism through an exponential adjustment factor

α

, performing a power operation on the IoU loss family. The specific calculation formula is as follows:

L_{α - I o U} = 1 - I o U^{α}

(13)

When

α

> 1, the gradient function

g r a d_L = - α \cdot I o U^{α - 1} \cdot g r a d_I o U

exhibits a key characteristic: in low IoU regions (where IoU approaches 0), the gradient amplification factor sharply increases to infinity, completely resolving the gradient vanishing issue in traditional IoU; in high IoU regions (where IoU approaches 1), the gradient stabilizes and converges to the value of

α

, ensuring a smooth optimization process. This property is rigorously validated by the second derivative of

\frac{\partial^{2} L}{\partial I o U^{2}} = α (α - 1) \cdot I o U^{α - 2}

. When

α

> 1, the function exhibits convexity, guaranteeing convergence to the global optimum. Through experimentation, it was found that, in most cases, setting

α

= 3 yields the best results.

Compared to mainstream CloU loss functions, Alpha-IoU has three significant advantages: first, a single parameter α control mechanism replaces complex penalty terms, keeping the computational complexity at O(1) level; second, adaptive gradient amplification characteristics (when IoU < 0.3, the gradient is increased by 10 to 100 times) strengthen the ability to learn difficult samples; finally, deployment flexibility meets the needs of different scenarios. Additionally, the core of the Alpha-IoU loss function lies in integrating the adjustable parameter

α

, which supports dynamic calibration to adapt to diverse training objectives. Compared to existing IoU improvement methods, this loss function demonstrates superior overall performance in bounding box regression tasks, specifically manifested in systematic improvements in localization accuracy and significantly enhanced model robustness against interference.

2.5.4. Model Pruning

To improve detection efficiency and reduce storage requirements, this study employs structured channel pruning techniques to lightweight the backbone module of the YOLOv5s detection network [32]. The feature fusion operations in YOLOv5s include two forms: Shortcut and Concat. Since the Shortcut operation requires the strict alignment of input and output channels, all convolutional layers involving this operation are excluded from the pruning scope; the Concat operation, however, is unaffected by pruning due to its channel-dimension independence. Additionally, to maintain the identity mapping capability of the residual structure, all residual modules retain their complete structure. The pruning process in this structure primarily includes three stages: sparsification training, channel pruning, and fine-tuning [33]. The specific pruning workflow is shown in Figure 8.

At the initial stage of pruning, the network parameters need to be reset and L1 regularization introduced to constrain the scaling factor

γ

of the batch normalization (BN) layer, thereby inducing a sparse distribution. After completing the sparsification training, the probability distribution characteristics of

γ

values in the network need to be calculated, and channels with lower importance need to be removed based on a preset pruning rate threshold. An iterative progressive strategy is used, and the specific calculation process is as follows:

L = \sum_{(x, y)} l (f (x, W), y) + λ \sum_{γ \in Γ} g (γ)

(14)

In the formula, (x, y) represents the input and target items; W represents the basic weight;

\sum_{(x, y)} l (f (x, W), y)

represents the loss function;

λ \sum_{γ \in Γ} g (γ)

represents the regularization constraint on the scaling factor

γ

of the BN layer;

λ

is the sparse regularization coefficient, and, in this study,

λ

= 10⁻⁵ is selected for pruning [34].

\overset{⌢}{z} = \frac{z_{i} - μ_{β}}{\sqrt{σ_{β}^{2} + ε}}

(15)

z_{o} = α \cdot \overset{⌢}{z} + β

(16)

In the formula,

z_{i}

and

z_{o}

are the input and output of the BN layer, respectively;

α

and

β

are two trainable parameters of the BN layer.

After completing sparsification training, the channels are ranked based on the absolute value of the scaling factor

γ

coefficient of the BN layer, and channels with low contributions whose

γ

values are close to zero are removed from the BN layer, thereby efficiently achieving model lightweighting [35]. Additionally, the decrease in model detection accuracy caused by pruning operations needs to be compensated for through fine-tuning. A warm-start is performed using the weight file generated by channel pruning, reusing the benchmark dataset and hyperparameter configuration, and implementing 100 rounds of compensation training to restore the model’s representation capability.

3. Results and Analysis

3.1. Evaluation Index

The classification recognition model in this experiment was evaluated using three main parameters: precision (P), recall (R), and mean average precision (mAP) [36], combined with the harmonic mean F1 and average precision (AP) as references. Precision refers to the proportion of samples correctly predicted as positive out of the total number of samples predicted as positive by the model. Recall refers to the proportion of actual positive samples that are accurately identified, i.e., the ratio of correctly predicted positive samples to the total number of true positive samples. Mean Average Precision refers to the arithmetic mean of the AP values for all target categories. The harmonic mean F1 is the harmonic mean of precision and recall, and values closer to 1 indicate better model performance. Average precision is a comprehensive measure of precision and recall, visually represented by the area under the precision–recall curve, and is used to evaluate the model’s overall performance on a specific category. The specific calculation formulas are as follows:

P = \frac{T P}{T P + F P} \times 100 %

(17)

R = \frac{T P}{T P + F N} \times 100 %

(18)

F_{1} = \frac{2 \times P \times R}{P + R}

(19)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(20)

In the formula, TP (True Positive) refers to correctly detected samples where both the predicted label and the true label are bud eyes; FP (False Positive) refers to incorrect detection results where non-bud eye samples are mistakenly identified as bud eyes; FN (False Negative) refers to missed detections where true bud eye samples are not detected by the model. N represents the total number of target categories, which is set to N = 1 in this study, i.e., mAP = AP.

To assess the statistical significance of experimental results, this study conducted five independent training runs for all key experiments using different random seeds. All reported performance metrics represent the mean ± Std of the five experimental results. Independent samples (T) were employed to evaluate the significance of performance differences between models, with p < 0.05 indicating statistical significance.

3.2. Comparison of Lightweight Backbone Networks

To validate the superiority of the ShuffleNetV2 module, it was compared with two lightweight backbones: EfficientNet-Lite and NanoDet. The experimental results are shown in Table 2. The lightweight backbone networks have reduced model depth due to the reduction in the number of convolutional layers, resulting in a slight weakening of feature representation capabilities. The mAP@0.5, mAP@0.75 and mAP@0.5:0.95 metrics of all lightweight segmentation models are lower than those of the original YOLOv5s baseline model, but the decrease is always controlled within the 10% threshold.

The experimental results show that the backbone network enhanced with ShuffleNetv2 building blocks performs best in terms of lightweight and accuracy balance. Its parameter count is 0.68 M, a sharp decrease of 90% compared to the initial segmentation model. The model size is 1.7 MB, compressed to 16.7% of the original structure, which is better than other lightweight backbone network models of the same level.

ShuffleNetv2’s lightweight advantages and feature enhancement capabilities have been fully verified in potato bud eye detection tasks. The channel mixing mechanism improves the accuracy of distinguishing bud morphology (micro-convex structures with a diameter of 0.5–2 mm) from skin defects by 12.7% through forced cross-group information exchange, and reduces the computing load of depth-separable convolutions by 83.4%, enabling the model to run in real time at 26 fps on edge devices such as Jetson Nano. This confirms that introducing ShuffleNetv2 into the backbone network of YOLOv5s balances efficiency and discriminative power in agricultural product visual inspection.

Figure 9 visually shows the comprehensive advantages of the YOLO-SCA model in both accuracy and lightweight design. The bars represent the mAP@0.5 (left axis), showing detection accuracy. The line with points represents the parameter count (right axis), indicating model size and complexity. While the accuracy of YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x increases progressively, the accuracy of the three lightweight models decreases. Although YOLOv5s-ShuffleNetV2 does not achieve the highest accuracy, it significantly outperforms other lightweight models while its corresponding parameter inflection point approaches the lowest value. This clearly proves that ShuffleNetV2 strikes the optimal balance between performance and efficiency.

3.3. Performance Comparison of Various Attention Mechanisms

In order to study the optimization effect of the CBAM mechanism on the YOLOv5s model, four other attention mechanisms were introduced in this experiment for comparative analysis, namely, YOLOv5s-GlobalContext, YOLOv5s-Axial Attention, YOLOv5s-SimAM, and YOLOv5s-Triplet Attention. The experimental results are shown in Table 3.

Comparative analysis reveals that the GC mechanism effectively improves recall rates through lightweight global pooling, but at the cost of increased computational complexity; the AA mechanism enhances accuracy by decoupling spatial dimension computations, but it sacrifices local detail information, resulting in significant fluctuations in detection accuracy for small objects. SimAM optimizes spatial modeling at a minimal computational cost to improve computational efficiency, but it is less robust in complex backgrounds (such as soil impurity interference). The Triplet mechanism improves feature interaction capabilities through triple attention, but it has a large computational load. In summary, the CBAM mechanism has the best overall effect. With a lightweight design that only increases the number of parameters by 0.02 M, it significantly improves cross-channel feature interaction capabilities. Compared to the original YOLOv5s implementation, which achieved a synergistic breakthrough in accuracy and speed, mAP is improved to 99.4%, and precision is improved to 97.6%. While maintaining a computational load of 18.6 GFLOPs, it effectively balances feature representation capability and computational efficiency, providing the optimal solution for lightweight detection models.

3.4. Ablation Experiments

In order to quantitatively evaluate the independent contributions and synergistic effects of the ShuffleNetv2 lightweight backbone (S), CBAM mechanism (C), and Alpha-IoU loss function (A) on the original YOLOv5s network model, seven sets of ablation tests were conducted to verify the effectiveness of the YOLO-SCA model optimization. The test results are shown in Table 4.

The test results show that the CBAM mechanism contributes most significantly to the overall performance improvement of the original model, improving accuracy and recall by 0.21% and 1.7%, respectively. Its adaptive channel interaction mechanism effectively enhances feature discrimSination capabilities. The ShuffleNetv2 lightweight backbone achieves a 92% reduction in parameter count. The Alpha-IoU loss function optimizes bounding box regression, improving recall by 0.8% and enhancing localization accuracy.

Figure 10 shows the performance of different model configurations across key indexes. The YOLO-SCA model covers a larger area, demonstrating the comprehensive performance improvement achieved through the synergistic integration of three components.

3.5. Comparative Experiments Based on Improvements to Different Original Models

Several studies have already attempted to optimize different original models. To scientifically evaluate the adaptability and optimization effects of lightweight improvement strategies across various YOLO architectures, this study applied the same improvement operations to four foundational models: YOLOv3, YOLOv4, YOLOv5, and YOLOv8. Comparative experiments were conducted using the same potato seed potato dataset and testing environment, with the results presented in the table below. This experiment aims to observe the detection performance of the same improvement measures on different base models and to quantitatively analyze the differences in the balance between accuracy and efficiency of the improved models, thereby evaluating the feasibility of different schemes for application in agricultural detection fields.

Analysis of Table 5 shows that, although the improved version based on YOLOv8 performs best on mAP, its 3.2 M parameters and 6.3 MB model size exceed the carrying capacity of most agricultural machinery. In contrast, the improved version based on YOLOv5 achieved real-time performance with only 1.7 M parameters and 3.6 MB of weighted memory, which is sufficient to meet the accuracy and efficiency requirements of potato bud detection. Therefore, YOLOv5 was selected as the base model for improvement.

Among these, the YOLO-SCA model exhibits slightly reduced robustness compared to the YOLOv5x model when dealing with bud eyes occupying an extremely small proportion of the image (typically less than 0.1% of the image area). This stems primarily from two factors: first, structural trade-offs. The lightweight backbone network inevitably possesses weaker deep feature extraction capabilities compared to the original architecture, whereas small object detection relies on deep, high-resolution features. Second, information loss due to downsampling. While enhancing efficiency, downsampling operations within the network inevitably cause gradual loss of fine spatial details for small objects, posing challenges for precise localization of minute bud eyes. This limitation implies a slight increase in false negative rates when detecting smaller bud eyes. Therefore, in practical deployment, adjusting the shooting distance to ensure the potato bud eye target occupies the main portion of the image can effectively mitigate this issue. Subsequently, this study will introduce a small object detection layer. A dedicated detection head will be added at a shallower layer within the FPN-PAN architecture to capture small objects using richer low-level detail information. Additionally, ASFF will be employed, enabling the model to automatically learn and fuse features across different scales, thereby enhancing feature representation for small objects.

Figure 11 shows the bud detection results on Dutch 15 potato varieties using models enhanced based on different base models. To investigate whether the bud detection results of this model exhibit variability across different potato varieties, additional bud detection was performed on Eugene variety potatoes. The results are shown in Figure 12, which demonstrate that potato variety does not affect the detection results.

3.6. Ablation Experiment for the α Parameter

To assess the sensitivity of the hyperparameter α in Alpha-IoU, this study systematically compared the impact of different α values (α = 2, 2.5, 3, 3.5, 4) on the performance of the YOLO-SCA model using the potato bud eye dataset.

Table 6 indicate that, when α = 3, the model achieves optimal performance across all three configurations: mAP@0.5, mAP@0.75, and mAP@0.5:0.95. Both excessively low and high α values lead to slight performance degradation, making α = 3 the optimal choice in this study.

4. Discussion

4.1. Limitations of Deep Learning Models

The primary comparison baselines in this study are currently limited to the YOLO series models and their lightweight variants. This selection is based on the following considerations: (1) the YOLO architecture is currently the most widely adopted framework in real-time object detection, aligning closely with this study’s lightweighting objectives; (2) compared to two-stage detectors (e.g., Faster R-CNN), keypoint-based detectors (e.g., CenterNet), or Transformer-based detectors (e.g., DETR), YOLO strikes a balance between accuracy and speed that better suits the practical demands of embedded deployment in agricultural machinery.

Although this study did not directly train and evaluate models such as Faster R-CNN and SSD, a comparative analysis can be conducted based on the existing literature. Zhang et al. [36] noted that YOLOv5s significantly outperformed Faster R-CNN in detection speed for the potato bud detection task, while achieving comparable accuracy. Lightweight YOLO variants can achieve real-time detection speeds on CPUs, a feat difficult for two-stage models to match. Therefore, the YOLO-based improvement proposed in this study demonstrates its clear potential advantages over other model types in the comprehensive evaluation dimension of “accuracy-speed-model size.” In subsequent research, we plan to conduct more comprehensive comparative experiments. We will benchmark the YOLO-SCA model against representative detectors of different types—such as Faster R-CNN, RetinaNet, and DETR—on unified datasets and hardware platforms. This will enable a more thorough evaluation of its performance boundaries and further explore the applicability of different learning models in agricultural vision tasks.

Additionally, all core modules employed in this study are existing technologies. The current focus lies on effectively selecting, combining, and optimizing the overall performance of these modules to address specific application challenges. This integration strategy proves efficient and practical when tackling complex visual detection tasks in agricultural environments. Future research will explore more innovative detection model designs.

4.2. Analysis of Failure Cases

Based on the above test results, the potato bud eye detection method designed in this study has high accuracy in bud eye target recognition, but there are still cases of target recognition failure during the experiment:

(1) Target missed

The main cause of target omission is feature confusion due to complex background interference. Although the CBAM mechanism enhances key feature responses through dual channel and spatial attention, when bud eyes are covered by more than 80% of adhesive soil, or when pathological deformities cause epidermal protrusions, some of the effective characterization information of the bud eyes is missing, and the shallow feature extraction ability is still insufficient.

(2) Background misrecognition

Background misrecognition is directly related to the morphological heterogeneity of potato skin. The local shadows produced by irregular depressions on the surface are highly similar to the morphological characteristics of the initial growth stage of bud eyes under low light conditions, and existing data enhancement strategies are unable to fully simulate such natural variations.

(3) Positioning deviation

Localization errors are mainly limited by the irregular geometric characteristics of bud eyes. Traditional rectangular bounding boxes are difficult to accurately fit to their actual contours, especially when bud eyes are narrow and slit-like. Although Alpha-IoU optimizes difficult sample regression by adjusting the gradient weight of the loss function, it cannot fundamentally solve the inherent mismatch between the anchor box mechanism and the target geometric structure.

Statistical results indicate that, among the 10,150 true bud markings in the test set, there were 589 missed detections, 455 false detections, and 335 detection boxes with IoU < 0.5. Consequently, the model’s missed detection rate was calculated at 5.8%, the false detection rate at 4.3%, and the positioning deviation exceeding the threshold rate at 3.5%. This indicates that, under most conditions, the current model achieves precise detection with high reliability. However, its performance degrades under extreme appearance variations and highly similar background interference. For instance, seed potatoes with excessive soil contamination or severe disease require preliminary manual sorting or preprocessing steps.

To address the aforementioned errors, we will subsequently construct a more comprehensive dataset and optimize it across multiple levels: At the data level, the focus will be on collecting images of challenging samples (e.g., heavily covered or pathologically altered specimens) and introducing Generative Adversarial Networks to synthesize more realistic extreme cases, thereby enriching the training set. At the methodological level, multispectral imaging techniques can be employed to capture near-infrared band features, enhancing spectral distinguishability between lesions and bud eyes. At the algorithmic level, further integration of multi-scale feature pyramids can strengthen small object detection capabilities. By predicting heatmaps of bud eye center points and morphological radii, geometric constraints of anchor boxes can be circumvented. We can replace horizontal bounding boxes with Rotated Bounding Boxes or Instance Segmentation Models to more accurately fit irregularly shaped bud eyes. In addition, it is possible to introduce a confidence-based secondary verification mechanism: for low-confidence detection results, the system can trigger re-capture from another angle or submit for manual review, thereby enhancing system reliability while maintaining efficiency.

4.3. Research Prospects

Based on the findings and limitations of this study, the following plans have been outlined to develop a more comprehensive potato bud detection system:

(1) Explore multispectral and hyperspectral imaging technologies: Current RGB-based models have reached near-bottleneck levels in distinguishing bud eyes from diseases and soil. Future work may incorporate multispectral imaging to capture spectral information in bands such as near-infrared. Significant differences exist in the reflectance spectral characteristics of bud eyes, potato skin, and diseases. Leveraging these features could fundamentally overcome the limitations of two-dimensional images. Additionally, hyperspectral imaging technology can serve as a research methodology, but its implementation in practical applications increases equipment costs.

(2) Investigate anchor-free detection frameworks: Anchor-based detectors exhibit inherent biases when locating irregularly shaped bud eyes. Anchor-free detection models predict target center points or corner points, enabling more flexible fitting to actual bud eye contours. This approach better addresses localization accuracy issues for elongated or sunken bud eyes.

(3) Integrating the Visual Transformer architecture: This aspect of the study aims to construct a CNN-ViT Hybrid structure. Shallow layers utilize CNNs for efficient local feature extraction, while lightweight Transformer modules are introduced in deeper layers to capture dependencies between bud eyes and the global background. This approach seeks to further enhance recognition capabilities for extreme samples.

5. Conclusions

This study proposes a detection method based on the lightweight ensemble model YOLO-SCA architecture, an improvement over YOLOv5s, which improves the accuracy of potato bud eye detection while reducing the model’s weighted memory and enabling real-time detection. The following conclusions are drawn.

(1) Model Validity: First, the Shuffle Netv2 module was introduced to lighten the convolutional neural network architecture, reducing the model’s weighted memory and computing load. Second, the CBAM mechanism was introduced to strengthen the potato bud eye feature learning ability, enhance the model’s sensitivity to key features, and effectively suppress the interference of similar background factors on the surface of seed potatoes. Finally, Alpha-IoU is introduced as a bounding box regression loss function to optimize the original model’s method of calculating prediction box position errors, enabling the accurate alignment of bud eye positions when the target is small or difficult to locate. The integration of these three mechanisms significantly improves potato bud eye detection performance, achieving “high accuracy, high speed, and small size.”

(2) Applied Value: Through structured channel pruning, the YOLO-SCA model was compressed to 3.6 M, reducing the model’s weighted memory by 64.7% without affecting detection accuracy, thereby avoiding the dilemma of model lightweighting and detection accuracy degradation in traditional algorithms.

(3) Performance Advantages: The results of the comparison test and ablation test show that the YOLO-SCA model achieved an accuracy rate of 91.7%, a recall rate of 89.2% during detection. In addition, in a horizontal comparison, ShuffleNetV2 has obvious advantages over other lightweight backbone models, and CBAM has obvious advantages over other attention mechanisms: ShuffleNetV2 was 2.1 and 3.7 percentage points higher than EfficientNet-Lite and NanoDet in terms of average accuracy; CBAM was 1.3 and 3.8 percentage points higher than GC and SimAM in terms of average accuracy, and 1.0 and 2.8 percentage points lower than AA and Triplet in terms of weight memory.

In summary, although the improved method presented in this study has some similarities with existing research, the potato bud eye detection method based on the YOLO-SCA structure can be applied more efficiently and accurately to potato bud eye detection, and it can meet real-time requirements, laying a good foundation for subsequent seed potato cutting.

Author Contributions

Conceptualization, P.Z. and Q.Z.; data curation, Q.Z. and X.W.; formal analysis, Q.Z.; investigation, Q.Z., X.W. and Q.X.; project administration, P.Z.; validation, S.L.; writing—original draft, P.Z., Q.Z., X.W. and S.L.; writing—review and editing, S.L. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Research Program Project of Liaoning Province of China (2023JH2/101300117).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All applicable data are published and referenced in the article.

Acknowledgments

We would like to thank the Agrotechnical Extension Center of Jianping County of Liaoning Province in China for providing test materials.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

YOLO	Savitzky–Golay
CBAM	Convolutional Block Attention Module
IoU	Intersection over Union
LBP	Local binary pattern
SVM	Support vector machine
BiFPN	Bidirectional Feature Pyramid Network
ECA	Efficient Channel Attention
SGD	Stochastic gradient descent
mAP	Mean Average Precision
FPS	Frames Per Second
FPN	Feature Pyramid Network
PAN	Path Aggregation Network
WNMS	Weighted Non-Maximum Suppression
DC	Depthwise Convolution
PC	Pointwise Convolution
CAM	Channel Attention Module
SAM	Spatial Attention Module
MLP	Multilayer perceptron
BN	Batch Normalization
GC	Global Context
AA	Axial Attention
R-CNN	Region-based Convolutional Neural Network
DETR	Detection Transformer
CNN-ViT	Convolutional Neural Network-Vision Transformer
ASFF	Adaptive Spatial Feature Fusion

References

Semyalo, D.; Kim, Y.; Omia, E.; Arief, M.A.A.; Kim, H.; Sim, E.Y.; Kim, M.S.; Baek, I.; Cho, B.K. Nondestructive Identification of Internal Potato Defects Using Visible and Short-Wavelength Near-Infrared Spectral Analysis. Agriculture 2024, 14, 2014. [Google Scholar]
Jia, L.; Zheng, Y.; Jin, B.; Tang, W.; He, R.; Ma, Y.; Feng, L.; Wang, T. Analysis and Recommendations for the Development of the Potato Industry in Liaoning Province in 2024. In Proceedings of the Potato Industry and Rural Revitalization 2025, Benxi, China, 24–27 May 2025; Potato Professional Committee of the Chinese Crop Science Society: Dingxi, China; Benxi Potato Research Institute: Benxi, China, 2025. [Google Scholar] [CrossRef]
Sohel, A.; Shakil, M.S.; Siddiquee, S.M.T.; Al Marouf, A.; Rokne, J.G.; Alhajj, R. Enhanced Potato Pest Identification: A Deep learning approach for identifying potato pests. IEEE Access 2024, 12, 172149–172161. [Google Scholar] [CrossRef]
Gu, H.; Li, Z.; Li, T.; Li, T.; Li, N.; Wei, Z. Lightweight detection algorithm of seed potato eyes based on YOLOv5. Trans. Chin. Soc. Agric. Eng. 2024, 40, 126–136. [Google Scholar]
Wu, H. Research on Intelligent Grading and Bud Eye Detection Methods for Potato Seed Tubers. Master’s Thesis, Heilongjiang Bayi Agricultural University, Daqing, China, 2025. [Google Scholar] [CrossRef]
Yang, T. Design and Research of Potato Seed Potato Intelligent Cutter. Master’s Thesis, Lanzhou Jiaotong University, Lanzhou, China, 2019. [Google Scholar] [CrossRef]
Lv, Z.; Qi, X.; Zhang, W.; Liu, Z.; Zheng, W.; Mu, G. Buds Recognition of Potato Images Based on Gabor Feature. Agric. Mech. Res. 2021, 43, 203–207. [Google Scholar] [CrossRef]
Zhang, W.; Zeng, X.; Liu, S.; Mu, G.; Zhang, H.; Guo, Z. Detection Method of Potato Seed Bud Eye Based on Improved YOLOv5s. Trans. Chin. Soc. Agric. Mach. 2023, 54, 260–269. [Google Scholar]
Huang, J.; Wang, X.; Wu, H.; Liu, S.; Yang, X.; Liu, W. Detecting potato seed bud eye using lightweight convolutional neuralnetwork (CNN). Trans. Chin. Soc. Agric. Eng. 2023, 39, 172–182. [Google Scholar]
Yang, S.; Zhang, P.; Wang, L.; Tang, L.; Wang, S.; He, X. Identifying tomato leaf diseases and pests using lightweight improved YOLOv8n and channel pruning. Trans. Chin. Soc. Agric. Eng. 2025, 41, 206–214. [Google Scholar]
Chen, Z. Research on the Detection and Automatic Cutting Test of Seed Potato Buds based on YOLOX. Master’s Thesis, Shandong Agricultural University, Taian, China, 2022. [Google Scholar] [CrossRef]
Huang, S. Analysis of 3D Phenotype and Bud Eye Traits of Potato Basedon Structured Light Imaging. Master’s Thesis, Huazhong Agricultural University, Wuhan, China, 2024. [Google Scholar] [CrossRef]
Wang, Y.; Fang, Z.; Wang, M.; Peng, L.; Hong, H. Comparative study of landslide susceptibility mapping with different recurrent neural networks. Comput. Geosci. 2020, 138, 104445. [Google Scholar] [CrossRef]
Li, Y.; Zhao, Q.; Zhang, Z.; Liu, J.; Fang, J. Detection of Seed Potato Sprouts Based on Improved YOLOv8 Algorithm. Agriculture 2025, 15, 1015. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Li, Q.; Yang, X. Lightweight Vehicle Detection Method Based on Improved YOLOv4. Comput. Technol. Dev. Comput. Technol Dev. 2023, 33, 42–48. [Google Scholar]
Wu, Y.; Qiu, H.; Ma, S. Underwater Fish Detection Algorithm Based on Improved YOLOv5s. J. Heilongjiang Univ. Technol. (Compr. Ed.) 2025, 25, 96–102. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Zheng, H.; Chu, J. Feature Fusion Method for Object Detection. J. Nanchang Hangkong Univ. (Nat. Sci.) 2022, 36, 59–67. [Google Scholar]
Meng, W.; An, W.; Ma, S.; Yang, X. An object detection algorithm based on feature enhancement and adaptive threshold non-maximum suppression. J. Beijing Univ. Aeronaut. Astronaut. 2025, 51, 2349–2359. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Lu, W.; Wang, Y.; Lu, Y.; Cheng, S. An Improved YOLOv5s Recognition and Detection Algorithm forFloating Objects on Lake Surface. Nat. Sci. Hainan Univ. 2025. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, L.; Qiao, R.; Dang, Q.; Zhai, P.; Sun, H. Spatial Attention Mechanism with Global Characteristics. J. Xi’an Jiaotong Univ. 2020, 54, 129–138. [Google Scholar]
Chen, Z.; Zhao, C.; Li, B. Bounding box regression loss function based on improved IoU loss. Appl. Res. Comput. 2020, 37 (Suppl. S2), 293–296. [Google Scholar]
Jiang, Y.; Liu, W.; Wei, T. A Robot for Detecting Rail Screws Based on YOLOv5. China Comput. Commun. 2022, 34, 165–167. [Google Scholar]
Dong, H.; Pan, J.; Dong, F.; Zhao, Q.; Guo, H. Research on bounding box regression loss function based on YOLOv5s model. Mod. Electron. Technol. 2024, 47, 179–186. [Google Scholar] [CrossRef]
Li, Y.; Zhang, J.; Hu, Y.; Zhao, Y.; Cao, Y. Real-time Safety Helmet-wearing Detection Based on Improved YOLOv5. Comput. Syst. Sci. Eng. 2022, 43, 1219. [Google Scholar] [CrossRef]
Yan, H. Research on Typical Solid Waste Species Identification Based on Image. Master’s Thesis, North China Electric Power University, Beijing, China, 2024. [Google Scholar] [CrossRef]
Xu, X.; Wang, Y.; Hua, Z.; Yang, G.; Li, H.; Song, H. Lightweight recognition for the oestrus behavior of dairy cows combining YOLO v5n and channel pruning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 130–140. [Google Scholar]
Feng, J. Channel Pruning of Convolutional Neural Network Based on Transfer Learning. Comput. Mod. 2021, 12, 13–18+26. [Google Scholar]
Zheng, Y.; Cheng, B. Lightweight Ship Recognition Network Based on Model Pruning. J. Wuhan Univ. Technol. (Transp. Sci. Eng.) 2025, 49, 682–691. [Google Scholar]
Liang, X.; Pang, Q.; Yang, Y.; Wen, C.; Li, Y.; Huang, W.; Zhang, C.; Zhao, C. Online detection of tomato defects based on YOLOv4 model pruning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 283–292. [Google Scholar]
Li, S.J.; Hu, D.Y.; Gao, S.M.; Lin, J.H.; An, X.S.; Zhu, M. Real-time classification and detection of citrus based on improved single short multibox detecter. Trans. Chin. Soc. Agric. Eng. 2019, 35, 307–313. [Google Scholar]
Zhang, W.; Zhang, H.; Liu, S.; Zeng, X.; Mu, G.; Zhang, T. Detection of Potato Seed Bud Eye Based on Improved YOLOv7. Trans. Chin. Soc. Agric. Eng. 2023, 39, 148–158. [Google Scholar]

Figure 1. Potato image acquisition system: (a) image acquisition device; (b) image acquisition interface.

Figure 2. Example images of the five processing methods in the dataset: (a) original image; (b) adjusted orientation; (c) changed brightness; (d) increased exposure; (e) noise reduction.

Figure 3. Example of marking potato bud eye targets in the image annotation tool LabelImg.

Figure 4. Improved YOLO-SCA structural framework diagram.

Figure 5. ShuffleNetv2 module structure diagram.

Figure 6. CBAM mechanism structure diagram.

Figure 7. Diagram illustrating the IoU function calculation mechanism.

Figure 8. Pruning process and pruning parts for the YOLO-SCA model: (a) pruning flowchart; (b) parts that need pruning.

Figure 9. Comparison of accuracy and parameter efficiency across different versions of YOLOv5 and their lightweight variants.

Figure 10. Radar chart of ablation results for key indexes across different model configurations.

Figure 11. Detection results of Dutch 15 potato bud eyes based on improved various base models: (a) YOLOv3-SCA; (b) YOLOv4-SCA; (c) YOLOv5-SCA; (d) YOLOv8-SCA.

Figure 12. Detection results of Eugene potato bud eyes based on improved various base models: (a) YOLOv3-SCA; (b) YOLOv4-SCA; (c) YOLOv5-SCA; (d) YOLOv8-SCA.

Table 1. Performance comparison of YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x on the same dataset.

Model	Parameters /M	FLOPs /G	Size /MB	Precision /%	Recall /%	mAP@ 0.5/%	mAP@ 0.75/%	mAP@ 0.5:0.95/%	FPS /(s⁻¹)	Depth Multiple	Width Multiple
YOLOv5s	6.80	14.6	10.2	82.8	80.1	84.7	65.1	52.3	158	0.33	0.50
YOLOv5m	20.1	44.2	30.1	85.5	83.7	88.3	70.8	58.1	104	0.67	0.75
YOLOv5l	44.1	99.5	63.0	86.9	85.3	89.8	73.5	61.7	79	1.00	1.00
YOLOv5x	82.3	188.7	117.1	87.6	86.2	90.6	75.2	63.9	53	1.33	1.25

Table 2. Detection performance of various lightweight backbones improved models based on YOLOv5s.

Model	Parameters/M	FLOPs/G	Size/MB	Precision/%	Recall/%	mAP@0.5/%	mAP@0.75/%	mAP@0.5:0.95/%
YOLOv5s	6.80	14.6	10.2	82.8	80.1	84.7	65.1	52.3
YOLOv5s- Shuffle Netv2	0.68	6.9	1.7	80.7	72.5	79.9	58.3	45.8
YOLOv5s- EfficientNet-Lite	1.26	7.3	2.1	79.5	71.0	77.8	56.9	44.1
YOLOv5s- NanoDet	1.42	7.2	2.5	76.1	70.3	76.2	55.1	42.5

Table 3. Detection performance of various attention-mechanism-improved models based on YOLOv5s.

Model	Parameters/M	FLOPs/G	Size/MB	Precision/%	Recall/%	mAP@0.5/%	mAP@0.75/%	mAP@0.5:0.95/%
YOLOv5s	6.80	14.6	10.2	82.8	80.1	84.7	65.1	52.3
YOLOv5s- CBAM	7.82	18.9	16.3	94.6	84.8	92.4	77.8	65.2
YOLOv5s- GC	8.85	18.4	19.4	93.1	86.1	91.1	75.1	62.9
YOLOv5s- AA	8.13	18.7	17.3	93.2	84.6	93.8	78.5	66.8
YOLOv5s- SimAM	7.20	18.6	17.2	91.0	81.7	88.6	70.2	57.4
YOLOv5s- Triplet	8.86	19.0	19.1	94.9	84.3	93.9	79.1	67.5

Table 4. Comparison of detection performance between the proposed YOLO-SCA and various improved models based on YOLOv5s.

Component Combination	Parameters/M	FLOPs/G	Size/MB	Precision/%	Recall/%	mAP@0.5/%	mAP@0.75/%	mAP@0.5:0.95/%
YOLOv5s	6.80	14.6	10.2	82.8	80.1	84.7	65.1	52.3
YOLOv5s+S	0.68	6.9	1.7	80.7	72.5	79.9	58.3	45.8
YOLOv5s++C	7.82	18.9	16.3	94.6	84.8	92.4	77.8	65.2
YOLOv5s++A	6.80	17.6	14.2	93.2	85.7	90.6	73.9	61.5
YOLOv5s++S+C	2.64	12.2	5.8	87.8	82.0	90.2	72.5	60.1
YOLOv5s++S+A	1.28	10.9	4.7	86.4	83.2	89.6	71.8	59.3
YOLOv5s++C+A	12.82	18.6	19.3	94.9	90.1	96.5	82.3	71.4
YOLO-SCA	1.70	7.1	3.6	91.7	89.2	95.3	78.5	65.2

Table 5. Comparison of detection performance between the proposed YOLO-SCA and SCA-improved models based on various generations of base models.

Model	Parameters /M	FLOPs /G	Size /MB	Precision /%	Recall /%	mAP@ 0.5/%	mAP@ 0.75/%	mAP@ 0.5:0.95/%	Advantage	Disadvantage
YOLOv3 -SCA	6.2	16.8	12.7	84.4	85.1	89.7	70.2	58.9	Target features are fully preserved.	Large number of parameters.
YOLOv4 -SCA	4.1	14.2	9.8	86.7	97.2	92.8	75.8	64.1	Improved accuracy.	Heavily computational.
YOLOv8 -SCA	3.2	10.5	6.3	92.3	89.1	95.7	79.8	67.1	Enhanced detail perception.	Large memory. Poor real-time performance.
YOLO -SCA	1.7	7.1	3.6	91.7	89.2	95.3	78.5	65.2	Lightweight model.	Small target detection robustness is slightly lower.

Table 6. Sensitivity analysis of different values of the Alpha-IoU hyperparameter α to potato bud eye detection results.

α Value	Precision/%	Recall/%	mAP@0.5/%	mAP@0.75/%	mAP@0.5:0.95/%
2	90.8	88.5	94.1	72.5	63.5
2.5	91.2	88.9	94.8	75.8	64.3
3	91.7	89.2	95.3	78.5	65.2
3.5	91.5	89.0	95.1	77.2	64.9
4	91.3	88.8	94.9	76.4	64.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Q.; Zhao, P.; Wang, X.; Xu, Q.; Liu, S.; Ma, T. YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm. Agriculture 2025, 15, 2066. https://doi.org/10.3390/agriculture15192066

AMA Style

Zhao Q, Zhao P, Wang X, Xu Q, Liu S, Ma T. YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm. Agriculture. 2025; 15(19):2066. https://doi.org/10.3390/agriculture15192066

Chicago/Turabian Style

Zhao, Qing, Ping Zhao, Xiaojian Wang, Qingbing Xu, Siyao Liu, and Tianqi Ma. 2025. "YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm" Agriculture 15, no. 19: 2066. https://doi.org/10.3390/agriculture15192066

APA Style

Zhao, Q., Zhao, P., Wang, X., Xu, Q., Liu, S., & Ma, T. (2025). YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm. Agriculture, 15(19), 2066. https://doi.org/10.3390/agriculture15192066

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm

Abstract

1. Introduction

2. Materials and Methods

2.1. Test Equipment and Environment Parameter Settings

2.2. Production and Construction of Dataset

2.3. Selection of the Original Model

2.4. Summary of Improved Methods Based on YOLOv5s

2.5. An Improved Potato Bud Eye Detection Model Based on YOLOv5s

2.5.1. ShuffleNetv2

2.5.2. The CBAM Attention Mechanism

2.5.3. Alpha-IoU Loss Function

2.5.4. Model Pruning

3. Results and Analysis

3.1. Evaluation Index

3.2. Comparison of Lightweight Backbone Networks

3.3. Performance Comparison of Various Attention Mechanisms

3.4. Ablation Experiments

3.5. Comparative Experiments Based on Improvements to Different Original Models

3.6. Ablation Experiment for the α Parameter

4. Discussion

4.1. Limitations of Deep Learning Models

4.2. Analysis of Failure Cases

4.3. Research Prospects

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI