Although the YOLOv8 network demonstrates excellent performance in object detection accuracy and processing speed, it encounters persistent challenges in small object recognition within complex environments. Specifically, in our potato seed tuber sprout detection task, the target sprouts typically occupy only a minimal portion of the overall image. Furthermore, residual soil clods adhering to the seed tubers create additional interference for computer vision systems. These factors collectively contribute to suboptimal detection speed and accuracy that require further enhancement.
YOLOv8 uses the traditional convolution in the backbone module, which is more powerful and capable of extracting complex features, but it is more computationally intensive and generates more parameters for the operation. In this study, the convolution in layers 2, 4, and 6 in backbone are replaced with GhostConv, and the C2f module is replaced with C3ghost. By combining GhostConv and C3 module to build efficient convolutional layers, the network reduces the large amount of computation and storage requirements while maintaining better accuracy, which can be used in embedded devices. The module for extracting features in YOLOv8 is mainly C2f. With the increase in convolutional layers, C2f becomes inaccurate for feature extraction, and the extracted feature data are easily lost. As a result, the model is likely to have detection errors or target omissions, leading to accuracy degradation when recognizing small targets. In this study, the ECA attention mechanism is added after each C2f in the neck part of YOLOv8 to adaptively adjust the weights of each channel, and BiFPN is used to improve the feature fusion efficiency. The improved YOLOv8_EBG network structure is shown in
Figure 6.
3.2.1. Hyper-Ghost
For sprout detection of seed potato, detection speed is as important as accuracy [
32,
33]. In this study, GhostConv and C3ghost were introduced in YOLOv8n’s backbone instead of the traditional convolution and C2f to form the Hype Ghost module to ensure that the model accuracy is not lost without reducing the number of parameters in the model.
The schematic diagram of GhostConv is shown below in
Figure 7. For an input of size
X, where C is the number of input channels and H and W are the height and width of the input feature map. GhostConv first goes through a standard convolutional layer to generate the base feature map
. As shown in Equation (1):
This convolution operation is a smaller convolution kernel and generates a relatively small number of feature map channels, as shown in the figure below. If the number of input feature map channels is 12, then the number of feature map channels generated by this convolution operation step is 6.
GhostConv then generates more Ghost feature maps
from the base feature map
through k linear operations Ψ. The purpose of this step is to generate a large number of features with a small number of computations. The process of generating the Ghost feature map can be represented as shown in Equation (2).
In this case, G is a lightweight operator module, and deep separable convolution is used in this study.
Ultimately, the output of GhostConv is a feature map obtained by combining the base feature map and the Ghost feature map. As shown in Equation (3).
where
denotes the feature map stitching operation.
Compared with C2f, the primary advantages of C3ghost lie in its superior computational efficiency and lightweight architecture. This enhancement stems from its GhostNet-inspired design paradigm, which strategically minimizes redundant computational operations to optimize memory utilization and thereby boost overall computational performance. Furthermore, C3ghost exhibits remarkable hardware adaptability, particularly in resource-constrained environments, with its efficient and lightweight characteristics conferring significant competitive advantages.
While C2f has achieved certain progress in computational optimization, comparative analyses demonstrate that C3ghost outperforms in three critical aspects: computational efficiency, memory optimization, and cross-platform adaptability. These technical merits endow C3ghost with greater potential for deployment in resource-limited scenarios. The architectural schematic of C3ghost is presented in
Figure 8.
The GhostConv module demonstrates superior computational efficiency while preserving model accuracy through its innovative approach of minimizing redundant computations and generating virtual feature maps from existing features [
34]. In the present study, the proposed Hyper Ghost module further enhances this capability, offering a computationally efficient solution specifically optimized for potato sprout detection tasks.
3.2.2. ECA
In the field of deep learning, especially in visual tasks, the attention mechanism has become a key technique to enhance the perceptual ability and interpretability of models, and an effective attention strategy can significantly enhance the ability of models to focus on important features [
35,
36,
37]. The large number of Ghost Conv we introduced in 3.2.1 effectively realizes the lightweight of the model, which provides enough computational margin for us to introduce a reasonable attention mechanism, and at the same time, the attention mechanism is also an important way to compensate for the possible loss of computational accuracy caused by GhostConv.
In this study, the Efficient Channel Attention mechanism, a lightweight channel attention model designed for convolutional neural networks, is adopted, which originates from the optimization for the SE attention mechanism, which models the relationships between channels in a more efficient way to improve computational efficiency and performance, with the ability to adaptively adjust the weights of each channel [
38]. The design concept of ECA is to optimize the feature representation and decision-making process of the network by minimizing the parameter introduction and computational cost while effectively capturing the complex dependencies between channels. This complex dependency, by integrating ECA, can offset the accuracy degradation due to the limited computation of Ghost Conv with little increase in computation power.
The structure of the ECA module is shown in
Figure 9. For an input
with dimensions (
,
,
), the ECA first performs a global average pooling to obtain a weight vector
= {
} of the same dimensions as the number of input channels, and the weight coefficients
. As shown in Equation (4).
In Equation (4), is the pixel value of channel c at the corresponding position in the input feature map. The result obtained by global average pooling extraction is called the description vector element of channel c, which is used to express the importance weight of each channel.
In the SE mechanism, the above results are activated after two fully connected layers to obtain the channel weights, but the ECA mechanism omits the complex fully connected layers and directly applies the one-dimensional convolution
to
transformation to achieve the capture and learning based on the dependency relationship between the channel description vectors, and generates new channel weight vectors through the backpropagation process in the iterative process, which is expressed as in Equation (5):
In Equation (5), is the number of convolution kernels for one-dimensional convolution. In order to match the number of outputs of ω with the input channels, here the convolution process needs to comprehensively configure the step stride and boundary processing strategy to ensure that the processed output vector will be consistent with the number of channels of the original feature map, and to ensure that each channel has a corresponding output weight value.
For the vector ω generated by one-dimensional convolution in this process, the activation function is ultimately used for metric control, so that the weights of each channel are restricted to a fixed interval, and the
attention mechanism uses a sigmoid function for activation before feeding the normalized weights back to the inputs in the form of multiplication. As shown in Equation (6):
where
, the output adjusted by the ECA attention mechanism, carries convolutionally processed weights on top of
; σ denotes the sigmoid function, which provides a smooth and non-linear mapping that helps the network learn more complex patterns, allowing the
attention mechanism to adjust the contribution of different channels in a more subtle way, helping the network to be able to pay attention to the characteristics of potato seed sprouts.
In this study, the attention mechanism is added to the backbone and neck parts, respectively. Adding the attention mechanism to the backbone can make the model better recognize the channel information of the buds in the feature extraction stage, thus improving the robustness of the overall model, and adding the attention mechanism to the SPPF of the YOLOv8 backbone The first layer can effectively improve the effectiveness and accuracy of SPPF when dealing with multi-scale features, and ECA can help the network to strengthen important features before multi-scale pooling in SPPF, optimizing the subsequent multi-scale fusion and target detection performance.
The neck part of YOLOv8 is important to improve the detection accuracy of the model on different scales and sizes of targets through multi-scale operations such as feature fusion, feature enhancement, and information transfer, and to maintain an efficient real-time detection capability by optimizing the amount of computation. By adding the ECA attention mechanism in the neck part, the network is able to better handle features of different scales and enhance the detection ability of multi-scale targets, which is especially important for bud detection.
3.2.3. BiFPN
BiFPN, the Bidirectional Feature Pyramid Network, is a neural network architecture for target detection and image segmentation. In computer vision tasks, Feature Pyramid Network (FPN) is a commonly used method to capture targets at different scales by constructing feature maps at different scales [
39]. However, traditional FPNs have some drawbacks, such as inefficient feature fusion and insufficient information flow, etc. BiFPN overcomes these problems by introducing a bidirectional feature fusion mechanism and weighted feature fusion. The traditional FPN is unidirectional, i.e., it passes information from high-level feature maps to low-level feature maps. BiFPN, on the other hand, adds directional information transfer, i.e., transferring information from low-level feature maps to high-level feature maps. This bidirectional information flow makes the information fusion between feature maps more adequate. Moreover, in BiFPN, feature maps of different scales are assigned different weights when fused. These weights are learnable parameters, and the model will automatically adjust them during the training process to optimally fuse features of different scales.
In this way, the model is able to make better use of the information in each feature map and improve the overall feature representation [
40]. Each time, features are fused from different layers, BiFPN dynamically adjusts the weights of the fusion between layers based on the learned parameters. In this way, the network can automatically learn how to better fuse information from different layers based on the characteristics of the data. Through this weighted fusion, the network is able to reduce the noise that may be caused by the bottom layer features and improve the representation of the higher layer features. Suppose that the feature map at a certain layer is
, which is weighted and fused with the upper layer feature map
and the lower layer feature map
by a learnable weight matrix. Then the computation of the output feature map
of a layer in BiFPN can be expressed as shown in Equation (7):
where
are the weights obtained from learning, reflecting the importance of information flow from layer j to layer i. These weights are optimized by training data.
For each layer of the feature map, BiFPN not only relies on simple up-sampling or convolution operation but also exchanges and weights the information in multiple directions to obtain an optimal fusion result, as shown in
Figure 10.