A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions

Zhang, Zhi; Lu, Yongzong; Peng, Yun; Yang, Mengying; Hu, Yongguang

doi:10.3390/agronomy15051122

Open AccessArticle

A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions

by

Zhi Zhang

¹,

Yongzong Lu

¹

,

Yun Peng

²,

Mengying Yang

³ and

Yongguang Hu

^1,*

¹

School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

²

School of Electronic Engineering, Changzhou College of Information Technology, Changzhou 213164, China

³

School of Material Science and Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(5), 1122; https://doi.org/10.3390/agronomy15051122

Submission received: 26 March 2025 / Revised: 28 April 2025 / Accepted: 29 April 2025 / Published: 30 April 2025

(This article belongs to the Collection Advances of Agricultural Robotics in Sustainable Agriculture 4.0)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of tea shoots in field conditions is a challenging task for production management and harvesting in tea plantations. Deep learning is well-suited for performing complex tasks due to its robust feature extraction capabilities. However, low-complexity models often suffer from poor detection performance, while high-complexity models are hindered by large size and high computational cost, making them unsuitable for deployment on resource-limited mobile devices. To address this issue, a lightweight and high-performance model was developed based on YOLOv5 for detecting tea shoots in field conditions. Initially, a dataset was constructed based on 1862 images of the tea canopy shoots acquired in field conditions, and the “one bud and one leaf” region in the images was labeled. Then, YOLOv5 was modified with a parallel-branch fusion downsampling block and a lightweight feature extraction block. The modified model was then further compressed using model pruning and knowledge distillation, which led to additional improvements in detection performance. Ultimately, the proposed lightweight and high-performance model for tea shoot detection achieved precision, recall, and average precision of 81.5%, 81.3%, and 87.8%, respectively, which were 0.4%, 0.6%, and 2.0% higher than the original YOLOv5. Additionally, the model size, number of parameters, and FLOPs were reduced to 8.9 MB, 4.2 M, and 15.8 G, representing decreases of 90.6%, 90.9%, and 85.3% compared to YOLOv5. Compared to other state-of-the-art detection models, the proposed model outperforms YOLOv3-SPP, YOLOv7, YOLOv8-X, and YOLOv9-E in detection performance while maintaining minimal dependency on computational and storage resources. The proposed model demonstrates the best performance in detecting tea shoots under field conditions, offering a key technology for intelligent tea production management.

Keywords:

tea shoot detection; deep learning; downsampling; model compression

1. Introduction

Tea (Camellia sinensis (L.) O. Kuntze) is an evergreen shrub and a major cash crop known for its distinctive flavor and rich secondary metabolites, which provide various health benefits [1,2,3]. According to the Food and Agriculture Organization of the United Nations, tea planting area and yield have increased from 2010 to 2022. In China, the planting area expanded from 1.41 to 3.40 million hectares, while production rose from 6.32 to 14.54 million tons. Tea shoots are not only the raw material for famous and high-quality tea, but also key indicators of tea plant health and growth [4]. Accurate detection of tea shoots in the field conditions is a prerequisite for ensuring high efficiency and quality operations in the intelligent famous tea plucking, and it is beneficial for producers to make appropriate market plans and precise field management schemes.

Over the past two decades, numerous studies on tea shoot detection have achieved significant progress. Early studies focused on image processing techniques and mathematical models for recognition, such as threshold segmentation and watershed methods, which extracted the tea shoots region in natural scenes based on pixel-level similarity [5,6,7]. To improve detection accuracy, prior knowledge of image color, shape, and morphology was incorporated, and machine learning models was used to learn the morphological changes in tea shoots [8,9,10]. The light is complex in field conditions, and the tea shoots are close to old leaves in color and its morphology is variable [11,12], resulting in poor performance of traditional detection methods, due to their reliance on extracted features feature extraction and trained classifiers. Therefore, to accurately detect tea shoots in field conditions, there is a demand for methods with superior resistance to background influences and feature extraction capabilities.

Deep learning offers significant advantages over traditional machine learning methods by extracting high-dimensional features, maximizing data potential, improving detection accuracy and robustness [13,14,15,16]. Deep learning-based target detection algorithms are categorized into two-stage and one-stage types. Two-stage algorithms, known for high precision, include RCNN [17], Faster RCNN [18], and Mask RCNN [19]. Their implementation involves first locating candidate regions in an image using a selective search frame algorithm, and then classifying each candidate region. One-stage algorithms integrate target localization and classification, directly generating multiple predicted bounding boxes and classification probability, represented by SSD [20] and the YOLO series [21]. Due to the excellent performance, deep learning-based visual perception systems are widely adopted in agricultural production, including planting [22,23], field management [24,25,26], and harvesting [27]. Zhang et al. proposed a field obstacle detection method using an improved YOLOv8, integrating a binocular camera to accurately localize obstacles, supporting autonomous unmanned agricultural machines [28]. To enhance visual perception for tomato harvesting robots, Du et al. proposed a multi-task convolutional neural network based on YOLOv5 to detect target location and pose [29]. These studies successfully detected various objects in complex agricultural environments, significantly advancing intelligent agriculture.

The tea shoots consist of tender buds and leaves, categorized as one bud, one bud and one leaf, and one bud and two leaves based on the number of leaves. The tea shoots are small and similar in color to old leaves, making them difficult to accurately identify in field conditions. Therefore, many studies have addressed this agricultural challenge using deep learning methods. Wang et al. constructed a Mask RCNN model based on ResNet50, FPN, and RoIAlign for extracting tea shoots region [30]. Chen and Chen utilized Faster RCNN to detect regions covering “one bud and two leaves” under field conditions [31]. Li et al. employed YOLOv3 to detect the tea shoot regions of “one bud and one leaf” and identified picking points using depth information from an RGB-D camera [32]. To enhance detection performance, Xu et al. proposed a two-level fusion model combining the fast detection ability of YOLOv3 with the high precision classification capability of DenseNet201 for accurate tea shoot detection [33]. Although these methods improved tea shoot detection accuracy to some extent, the high computational cost and large storage requirements made them unsuitable for deployment on mobile platforms and terminals.

To address these challenges, the employed models should have superior detection performance and be lightweight to enhance deployment friendliness. Currently, achieving lightweight models in target detection tasks using deep learning methods, primarily involves modifying the network architecture with lightweight structures and compressing the network model [34]. The main strategies for lightweight modification are as follows: (1) The backbone feature extraction network in the original model was replaced with lightweight convolutional neural networks, such as ShuffleNet [35], GhostNet [36], MobileNet [37], and EfficientNet [38]. (2) Ghost convolution [39,40] or depthwise separable convolution [41,42] was employed instead of traditional convolution in the network model structure. (3) Integrating attention mechanism modules into the network structure to compensate for the decreased feature extraction capability of lightweight modules [43,44]. Although these approaches could significantly reduce the model size while maintaining detection performance, their effectiveness in the lightweighting needs further improvement. Model pruning is a common method in model compression to achieve a slim effect by removing unimportant connections from the network [45,46]. Increasing the pruning rate during model pruning can significantly reduce model size and computational cost, resulting in a super compact model. However, the detection performance cannot be improved by this approach, and may even decrease with an increasing pruning rate.

To overcome those problems, this study examined the development of a lightweight and high-performance model for detecting tea shoots in field conditions. To this end, this study proposed a novel feasible approach by integrating model lightweight modifications and compression techniques. YOLOv5 is modified by introducing a parallel-branch fusion downsampling block and a lightweight feature extraction block to reduce the information loss during the downsampling process and improve the feature extraction capability. The model compression—combining channel pruning and knowledge distillation—was used to further slim the modified model. By pruning the modified model, the channels with a low contribution for detecting tea shoots were removed, yielding a sparse and compact model. The knowledge learned from a larger model was transferred to the sparse and compact model through knowledge distillation, allowing the model have a high capability of extracting critical features for tea shoots at low complexity. The proposed model in this study has advantages such as lightweight and high performance, which is better deployed on mobile platforms with limited computational resources for intelligent famous tea plucking or tea plant phenotyping analysis, providing technical support for promoting the development of the tea industry. The main contributions are as follows:

(1): To reduce the information loss of tea shoot features in the traditional downsampling process, a parallel-branch fusion downsampling block was proposed through the combination of max-pooling and depthwise separable convolution;
(2): Construction of a lightweight feature extraction block based on partial convolution with enhanced ability to extract critical features of tea shoots;
(3): Proposal of a lightweight and high-performance tea shoot detection model by combining lightweight modifications and model compression, which effectively reduces misdetection and omission.

2. Materials and Methods

The intelligent agricultural operation mobile platforms often face constraints in storage and computing capacity, making it impractical to deploy large and computationally intensive models. Therefore, a lightweight and high-performance model for detecting tea shoots was proposed to improve the detection performance, while reducing the dependence on the computational and storage resources of the deployed platforms. The detailed research procedure is shown in Figure 1, which was divided into three steps. In step 1, the images of tea shoots were acquired in the field, and the one bud and one leaf region in the image was labelled as “TS” to construct the dataset. This dataset served as the research foundation for model training and performance evaluation. In step 2, the YOLOv5 was modified using lightweight blocks. A parallel-branch fusion downsampling block (PFD) was introduced to downsample the image feature maps, and the influences of different lightweight feature extraction blocks (FC3, GC3, and DSC3) on the model’s detection performance were compared, resulting in the development of a lightweight model with improved detection performance, named L-YOLO. In step 3, L-YOLO was further optimized by model compression methods. Unimportant connections in the networks were removed through model pruning, and the effects of various pruning rates on the pruned model’s performance were analyzed to obtain a large and sparse model (PL-YOLO). By using L-YOLO as the teacher model and the pruned model as the student model, knowledge distillation was employed to recover the detection performance losses caused by model pruning. The final model (HLTS-Model) exhibits excellent detection performance with reduced model size and computational complexity, which improves deployment friendliness on mobile platforms.

2.1. Data Acquisition and Dataset Construction

Tea shoot images were acquired from the Yinchunbiya tea plantation (Figure 2a), Zhenjiang, Jiangsu province, in April 2024. A standardized production and management pattern was adopted in this tea plantation, where the tea plant was pruned in September or October every year. The selected tea plant cultivar was Zhongcha108, aged 14 years. The images were captured using a digital camera (Canon PowerShot SX30 IS, Canon, Tokyo, Japan) and a smartphone (iPhone8, Apple, Shenzhen, China), with resolutions of 4320 × 3240 pixels and 4032 × 3024 pixels, respectively. To ensure the applicability and generalization of the developed model, the images were obtained under different weather conditions (sunny and cloudy) and at various times of the day (Figure 2b). Specifically, the height of shooting was 15 cm to 30 cm above the canopy, the angle of shooting was 30° to 80°, and the time of shooting included 7:00 a.m. to 9:00 a.m., 11:00 a.m. to 2:00 p.m., and 4:00 p.m. to 6:00 p.m. As shown in Figure 2b, the tea shoots were mainly distributed on the leveled canopy surface, which is the detection target in this study. A total of 1862 images of tea canopy shoots were acquired, including 758 and 1104 images by digital camera and smartphone, respectively. Annotation was performed using LabelImg (https://github.com/tzutalin/labelImg (accessed on 15 September 2024)), with the “one bud and one leaf” as the labeled object. This annotation information was saved with XML files following the PASCAL VOC format. The dataset was randomly split into training set, validation set, and testing set in a ratio of 6:2:2, and the number of images in them was 1118, 372, and 372, respectively.

2.2. Overall Structure of YOLOv5

YOLOv5 was developed in 2020, and has undergone continuous updates, with the latest version being v7.0 [47]. The network architecture comprises three main components: backbone, neck, and head (Figure 3). The backbone consists of the Fcous layer, CSPDarkNet53, and SPPF. The Fcous layer performs a slicing operation on the input image, downsampling it while retaining all information. CSPDarkNet53 is composed of a stack of CBS and C3 blocks. The CBS block combines a convolutional layer (Conv), a batch normalization layer (BN), and a SiLU activation function. Meanwhile, the C3 block divides the input feature map into two branches. The first branch sequentially passes through the CBS block and the n × Res_block, while the second branch passes through a single CBS block. Subsequently, the two branches are concatenated through the Concat layer, and the features are further fused using another CBS block. This operation enhances the network’s learning ability and reduces the number of parameters. The SPPF block performs serial pooling with 5 × 5 maximum pooling kernels on the input feature maps and concatenates the outputs of each node through the Concat layer to obtain feature information on different scales. The neck consists of a feature pyramid network (FPN) and a path aggregation network (PAN). The FPN fuses data through top-down up-sampling while the PAN transmits data in a bottom-up pyramid manner, enhancing feature extraction capability. Finally, the head predicts and presents the bounding boxes, classes, and confidence scores. Depending on the width and depth of the network model, YOLOv5 was categorized as YOLOv5-N, YOLOv5-S, YOLOv5-M, YOLOv5-L, and YOLOv5-X.

2.3. Model Lightweighting Modifications

To obtain a lightweight and high-performance model for detecting tea shoots in field conditions, this study first modified YOLOv5 and proposed a lightweight tea shoot detection model, named L-YOLO (Figure 4).

2.3.1. Parallel-Branch Fusion of Downsampling Block

Convolutional neural networks commonly use convolution or pooling operations to downscale feature maps, which can help mitigate overfitting but may lead to the loss of critical information [48]. The model’s characterization ability could be enhanced by increasing the branch structure [49]. To address the reduction in effective information during downsampling while performing necessary channel conversion, a parallel-branch fusion of downsampling block (PFD) were introduced in the model’s backbone and neck: PFD_1 and PFD_2. PFD mainly included max-pooling (MaxPool), 1 × 1 convolution (CBS), 3 × 3 depthwise separable convolution (DBS), and concatenation (Concat) (Figure 5).

The PFD_1 used two parallel branches to process the input feature map. In one branch, the input feature map underwent max-pooling with a kernel size of 2 and a stride of 2 for scaling, followed by a

1 \times 1

convolution for feature fusion. In the other branch, feature extraction was performed using a

3 \times 3

depthwise separable convolution with a stride of 2. The outputs from both branches were concatenated using the Concat layer, thereby achieving both feature map scaling and channel expansion. For the PFD_2, the input feature maps were initially split into two equal parts based on the channel number. Each part then underwent operations mirroring those in PFD_1 to achieve feature map scaling without altering the number of channels.

For the downsampling process with an expanded number of output channels, the number of parameters by convolution (Conv_exp-parameters) and PFD_1 (PFD_1-parameters) are calculated as follows:

C o n v_{e x p} - p a r a m e t e r s = 2 c \times (c \times 3 \times 3)

(1)

P F D_1 - p a r a m e t e r s = c \times c + c \times 3 \times 3 + c \times c

(2)

where

c

is the number of channels;

3 \times 3

is the convolutional kernel size.

The ratio of Conv_exp-parameters and PFD_1-parameters is as follows:

\frac{C o n v_{e x p} - p a r a m e t e r s}{P F D_1 - p a r a m e t e r s} = \frac{2 c \times (c \times 3 \times 3)}{c \times c + c \times 3 \times 3 + c \times c} = \frac{18 c}{2 c + 9}

(3)

For the downsampling process with a constant number of output channels, the number of parameters by convolution (Conv_fix-parameters) and PFD_2 (PFD_2-parameters) are calculated as follows:

C o n v_{f i x} - p a r a m e t e r s = c \times (c \times 3 \times 3)

(4)

P F D_2 - p a r a m e t e r s = (c / 2) \times (c / 2) + (c / 2) \times 3 \times 3 + (c / 2) \times (c / 2)

(5)

The ratio of Conv_fix-parameters and PFD_2-parameters is as follows:

\frac{C o n v_{f i x} - p a r a m e t e r s}{P F D_2 - p a r a m e t e r s} = \frac{c \times (c \times 3 \times 3)}{(c / 2) \times (c / 2) + (c / 2) \times 3 \times 3 + (c / 2) \times (c / 2)} = \frac{18 c}{c + 9}

(6)

From the above equations, the PFD block effectively reduces the number of parameters in the downsampling process by integrating max-pooling, depthwise separable convolution, and channel split.

2.3.2. Lightweight Feature Extraction Block

Due to the high similarity of feature maps across channels, computational costs could be reduced by exploiting their redundancy. Partial convolution accelerates computation by selectively processing a portion of the input feature map, thereby avoiding the full computation of all convolution kernels and inputs [50]. Specifically, the input feature map is divided into two parts in a certain proportion. Feature extraction is performed on one part through convolution, while the other part remains unchanged. The two parts are then aggregated through a concatenation operation (Figure 6).

Compared to traditional convolution, partial convolution requires fewer floating-point operations (FLOPs). The FLOPs of traditional convolution are calculated as follows:

F L O P s (T) = k \times k \times c \times c^{'} \times w \times h

(7)

The FLOPs of partial convolution are calculated as follows:

F L O P s (P) = k \times k \times (r \times c) \times (r \times c^{'}) \times w \times h

(8)

where

k \times k

is the convolution kernel size;

c

is the number of input feature map channels; c′ is the number of output feature map channels;

w

is the width of the input feature map;

h

is the height of the input feature map; and

r

is the scale factor.

The

3 \times 3

convolution within the C3 block of YOLOv5 maintains a constant number of channels in the feature map. Therefore, when substituting conventional convolution with partial convolution and setting

r = 1 / 4

, the theoretical speedup can be calculated as follows:

\frac{F L O P s (T)}{F L O P s (P)} = \frac{k \times k \times c \times c^{'} \times w \times h}{k \times k \times (r \times c) \times (r \times c^{'}) \times w \times h} = \frac{1}{r^{2}} = 16

(9)

Hence, this study proposed a lightweight feature extraction block FC3 (FC3_1 and FC3_2) (Figure 7) by replacing the 3 × 3 traditional convolutions in the C3 block of the original YOLOv5 with the partial convolution, which instead of the C3 block in both the backbone feature extraction network and neck of original YOLOv5, respectively.

2.4. Model Pruning and Knowledge Distillation

Large neural networks benefit from complex structures and a vast number of parameters, achieving remarkable success and high performance in real-world scenarios [51]. However, their computational complexity and significant storage requirements pose challenges for deployment in real-time applications, especially on mobile devices with limited resources. Although small models with reduced computational demands could meet the real-time requirements on mobile devices, these models often exhibit poor detection performance [52]. Zhu and Gupta demonstrated that a large sparse model could achieve higher accuracy than a small dense model [53]. Thus, to develop a more efficient model for detecting tea shoots in natural environments, this study employed the layer-adaptive magnitude-based pruning (LAMP) method to compress the modified lightweight model and utilized knowledge distillation to restore the detection performance of the pruned model. LAMP approximates the impact of pruning on model distortion using the LAMP score and removes all connections with a score smaller than the target weight [54]. Consider a d-depth feedforward neural network with a weight tensor

W^{(1)}, \dots, W^{(d)}

. For fully connected and convolutional layers, the corresponding weight tensor is two-dimensional and four-dimensional matrices, respectively. To uniformly define the LAMP scores for both types of layers, it is assumed that each weight tensor is unrolled (or flattened) into a one-dimensional vector. Assuming, without loss of generality, that the weights are sorted in ascending order according to a given index map, the condition

| W [u] | \leq | W [v] |

holds when

u < v

, where

W [u]

denotes as the term of

W

mapped by the u-th index. The LAMP score of the u-th index of the weight tensor

W

is given as follows:

s c o r e (u; W) = \frac{{(W [u])}^{2}}{\sum_{v \geq u} {(W [v])}^{2}}

(10)

Knowledge distillation enhances the performance of lightweight models by transferring knowledge from a large, complex neural network (the teacher model) to a small, simple neural network (the student model). This process enables the student model to achieve competitive or even superior performance compared to the teacher model. Based on the nature of the transferred knowledge, knowledge distillation is categorized into three types: response-based, feature-based, and relation-based [55]. Response-based knowledge distillation uses soft labels or probability distributions generated by the teacher model as training targets for the student model [56]. This approach allows the student model to learn the predictive patterns of the teacher model. Feature-based knowledge distillation transfers knowledge through the feature map in the middle layer of the teacher model [57]. The student model improves its detection performance by learning the features extracted from these middle layers of the teacher model. Relation-based knowledge distillation transfers knowledge through feature relationships, such as similarities between features and dependencies within the teacher model [58]. The student model learns these relationships to capture the structural information of the input data more effectively. In this study, the model obtained from lightweight modification served as the teacher model, and the pruned model was the student model.

2.5. Hyperparameter Setting and Experimental Platform

The PyTorch (version 1.13.1) deep learning framework was employed to train and evaluate the proposed model. The experimental platform was configured with an AMD 3700X CPU, a NVIDIA GeForce RTX3080Ti GPU with 12GB of memory, and the Windows 10 operating system. The stochastic gradient descent optimizer was employed to optimize the model. The batch size was set to 8, the input image size was 640 × 640, and the training process was conducted over 100 epochs with a momentum value of 0.937. Additionally, the weight decay was set to 0.0005, the initial learning rate was 0.01, and the cosine annealing learning rate strategy was applied. The cosine annealing learning rate strategy gradually reduce the learning rate, which could exhibit better convergence and performance of the model during training.

3. Results and Discussion

3.1. Detection Performance Metrics

The evaluation metrics of the model’s detection performance are precision (P), recall (R), and average precision (AP). The intersection over union (IoU) threshold is set to 0.5, demonstrating that the model’s prediction is deemed correct when the IoU score between the ground truth boxes and predicted boxes exceeds 0.5. The computational formulas for these evaluation metrics are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

A P = \int_{0}^{1} P (R) d R

(13)

where TP represents the count of the accurately detected tea shoot samples; FP is the count of the mistakenly identified tea shoot samples; and FN severs as the count of the missed detected tea shoot samples.

3.2. Model Performance with Lightweighting Modifications

3.2.1. Influence of PFD and FC3 Blocks on Model Performance

To balance between detection performance and complexity of the model, this study modified YOLOv5 by introducing the PFD and FC3 blocks (Table 1). As demonstrated in Table 1, the model size, the number of parameters, and FLOPs were significantly reduced by integrating the PFD and FC3 blocks in YOLOv5, indicating that the modification strategy effectively reduced the model complexity. However, the reduced complexity of YOLOv5-N and YOLOv5-S was accompanied by a decrease in detection performance, which might be related to the simple structure and fewer parameters of these models. The feature representation ability of these models might be weakened by further reducing the complexity, thus preventing these models from capturing complex features. For the larger models (YOLOv5-M, YOLOv5-L, and YOLOv5-X), this modification strategy enabled to maintain or even improve the detection performance while realizing lightweighting, indicating that the PFD and FC3 blocks were effectively reducing the redundancy of the complex models, allowing them to better capture contextual information and multilevel features. Among these models, YOLOv5-L + PFD + FC3 and YOLOv5-X + PFD + FC3 exhibited similar detection performance, which was significantly higher than other models, with an average precision of 87.2%. By comparison, YOLOv5-L + PFD + FC3 has a smaller model size, number of parameters, and FLOPs than that of YOLOv5-X + PFD + FC3. Therefore, YOLOv5-L + PFD + FC3 was chosen as the lightweight detection model for tea shoots, named L-YOLO.

3.2.2. Comparison of Different Lightweight Feature Extraction Blocks

To evaluate the effectiveness of FC3, this study compared it with GC3 (replacing the conventional convolution in the C3 block of the original network with ghost convolution) and DSC3 (replacing the conventional convolution in the C3 block of the original network with depthwise separable convolution). The detection performance of PFD with FC3, GC3, and DSC3 for modifying YOLOv5-L is shown in Table 2. By introducing FC3, GC3, and DSC3, the model size, the number of parameters, and FLOPs were effectively reduced, with DSC3 performing the best. Compared to YOLOv5-L, YOLOv5-L + PFD + DSC3 reduced the model size, number of parameters, and FLOPs by 73.9%, 72.9%, and 74.7%, respectively. However, the detection performance of YOLOv5-L + PFD + DSC3 decreased with a reduction in the average precision of 0.8%. In contrast with YOLOv5-L + PFD + DSC3, YOLOv5-L + PFD + FC3 has a certain disadvantage in model lightweighting (higher model size, number of parameters, and FLOPs), but it has better detection performance with 2.2% higher average precision. The above analysis results indicated that FC3 could effectively reduce the model complexity, while improving the detection performance.

To further explore the internal process of feature extraction using different lightweight feature extraction blocks, the gradient-weighted class activation mapping method (Grad-CAM) [59] was employed to visualize the regions of interest of various models (Figure 8). This method intuitively demonstrated whether these models learned the critical features of the tea shoots. In the attention heat map, the different degrees of important for each image position in the task of detecting tea shoots are indicated by various colors, with darker colors representing higher positive reactivity and greater significance to the model. Models with various lightweight feature extraction blocks mainly focus on the tea shoots region and have less attention to the background. Among the compared lightweight feature extraction blocks, FC3 demonstrated superior capability in concentrating attention on the tea shoots region based on the learned knowledge.

3.3. Model Performance with Model Compression

Based on the above results, the model’s size, number of parameters, and FLOPs significantly reduced while maintaining detection accuracy in the modified model (L-YOLO). To further enhance the model’s deployment friendliness, the LAMP method was utilized to prune L-YOLO. The detection performance, model size, number of parameters, and frames per second (FPS) of the model under different pruning rates are shown in Figure 9. As the pruning rate increased, more original connections were removed from the network, gradually decreasing the model size and number of parameters. When the pruning rate was less than 0.6, the detection performance of the models had little change. However, the increase in the pruning rate led to a significant decrease in detection performance. Small pruning rates remove only redundant connections, while large pruning rates remove essential connections, resulting in the weakened represent ability for complex features of the model [60,61]. Interestingly, the FPS of the pruned models did not show a consistent increase despite reducing the model complexity; instead, it exhibited oscillating changes, likely due to the structural limitations of the network model [62]. Nonetheless, all pruned models outperformed L-YOLO in terms of FPS, with the highest FPS observed at a pruning rate of 0.4. To develop a lightweight and high-performance model for detecting tea shoots in field conditions, knowledge distillation was employed to recover the detection performance of models with pruning rates ≥ 0.6. Different knowledge transfer types, including response-based knowledge (Res-K), feature-based knowledge (Fea-K), and relation-based knowledge (Rel-K), were analyzed for their effect on model detection performance.

Figure 10 shows that the detection performance was significantly improved by knowledge distillation methods, with response-based knowledge distillation achieving the highest performance across different pruning rates. The output of the teacher model was utilized as the learning objective for the student model by response-based knowledge distillation, allowing the student model to effectively capture and monitor the decision information of the teacher model. In contrast, other knowledge transfer types rely on aligning the intermediate feature layers of the teacher and student models, which is challenging due to changes in the pruned model’s internal structure [63]. Response-based distillation directly applies to the output layer, making it more adaptable to the pruned model’s structure. The average precision of the models using response-based knowledge distillation under various pruning rates were 87.9%, 87.8%, 86.7%, and 80.8%, reflecting improvements of 1.1%, 1.8%, 2.2%, and 5.9%, respectively. Based on the detection performance and model complexity, the model using response-based knowledge distillation with a pruning rate of 0.7 was determined to be the lightweight and high-performance model for detecting tea shoots.

3.4. Ablation Experiments and Detection Results Comparison

Ablation experiments were conducted to assess the impact of specific network substructures and training strategies on model performance. This study involved three key modifications to the YOLOv5-L. First, the PFD was used to replace the convolutional layer with a step size of 2 in the network to downsample the feature layer. Then, the C3 block in the network was replaced by the FC3, a lightweight feature extraction block constructed based on partial convolution. Finally, the model compression methods combining model pruning and knowledge distillation (PK) were used to further slim the model. The performance changes resulting from these modifications were analyzed through ablation experiments (Table 3).

The PFD effectively reduced information loss during the downsampling process by utilizing a parallel downsampling structure, with an increase of 0.3% in average precision and a reduction in the model size, the number of parameters, and FLOPs by 18.6 MB, 7.9 M, and 16.9 G, respectively. Combining PFD and FC3 could learn the key features of tea shoots more efficiently and improve the detection performance of the model, with an increase of 1.1% in average precision, the model size, the number of parameters, and FLOPs were reduced by 29.6 MB, 15.2 M, and 37.6 G, respectively. Compared to the training process of the model before and after lightweight modification, the modified model has faster loss reduction and convergence by the better feature extraction capability (Figure 11). The modified model was further optimized by combining the model compression method, which increased the average precision by 0.6% and reduced the model size, the number of parameters, and FLOPs by 37.7 MB, 18.8 M, and 37.3 G, respectively. The final lightweight and high-performance model achieved superior performance compared to YOLOv5-L, with improvements of 0.4% in precision, 0.6% in recall, and 2.0% in average precision. The model size, number of parameters, and FLOPs were reduced by 90.6%, 90.9%, and 85.3%, respectively. These modifications resulted in a decrease in real-time detection performance, with FPS dropping from 58.4 to 44.5, this still meets real-time detection requirements. These modifications significantly improve the model’s efficiency and resource-friendliness, making it a practical and reliable solution for detecting tea shoots. The ablation experiments demonstrated that each modification step positively contributed to the overall performance of the model. The final model, incorporating PFD, FC3, and model compression, exhibits superior detection capabilities and efficiency compared to the original YOLOv5-L model.

The proposed lightweight and high-performance tea shoot detection model, named HLTS-Model, demonstrates significant advantages over the YOLOv5-L model. A comparative analysis of detection results between the two models is illustrated in Figure 12.

(1): Improvement of bounding box prediction: Tea shoots often include buds and leaves with similar morphological features, complicating accurate bounding box prediction. The HLTS-Model demonstrated enhanced accuracy and confidence in bounding box predictions, effectively differentiating between these similar features (Figure 12a,b).
(2): Reduction in omissions due to shading: Shading on tea leaves, stalks, and shoots can significantly impact the detection performance. The HLTS-Model notably reduced such omissions, providing more reliable detection even in shaded areas (Figure 12c,d).
(3): Enhancing detection of small objects: Tea shoots vary considerably in size due to factors such as the sprouting period, location, and climatic conditions. The HLTS-Model excelled in detecting small-sized tea shoots, showcasing its effectiveness in handling significant size variations (Figure 12e,f).
(4): Mitigation of misdetection: Misdetection often arises from the similarity between tea shoots and leaves, compounded by variable lighting conditions. For instance, the “second leaf” can resemble the tea shoots during the “one bud and two leaves” period. The HLTS-Model effectively reduced such misdetections, addressing the challenge of distinguishing between similar features under varying light conditions (Figure 12g,h).

The HLTS-Model demonstrated superior performance in various aspects of tea shoot detection compared to YOLOv5-L. It achieved higher detection accuracy, and lower missed and false detections in challenging field conditions (including shading, large-scale size variations, and significant light changes) than that of YOLOv5-L. This model represents a substantial advancement in field conditions, providing a robust solution for accurately detecting tea shoots.

3.5. Contrast Experiment

3.5.1. Comparison of Different Detection Models

To intuitively demonstrate the advantages of the proposed model in this study, the performance was compared with other state-of-the-art detection models under the same experimental platform and training parameters (in Section 2.5). The used dataset was the self-constructed tea shoot detection dataset of this study, and the image data utilized for training, validation, and performance evaluation were consistent across different models. These models include Faster RCNN, CenterNet [64], FCOS [65], YOLOv3 [66], YOLOX [67], YOLOv6 [68], TOOD [69], YOLOv7 [70], YOLOv8, and YOLOv9 [71] (Table 4), which encompass both two-stage and one-stage, as well as anchor-free and anchor-based detection methods.

The HLTS-Model achieved the highest average precision of 87.8%, outperforming YOLOv3-SPP by 2.6%, YOLOv7 by 2.7%, YOLOv8-X by 2.3%, and YOLOv9-E by 0.8%. Although FCOS achieved the highest precision of 83.7%, its average precision was 11.0% lower than that of the HLTS-Model. YOLOX-Tiny with the highest recall of 86.7%, had an average precision that was 4.5% lower than that of the HLTS-Model. The YOLOv8-N, with a size of 6.2 MB, parameters of 3.0 M, and FLOPs of 8.1 G, and was well-suited for deployment on platforms with limited computational and storage resources. In comparison, the HLTS-Model, though slightly larger, remained highly efficient with a model size of 8.9 MB, 4.2 M parameters, and FLOPs of 15.8 G. The increased complexity of models such as YOLOv8-X resulted in improved performance, with an average precision increase from 82.4% for YOLOv8-N to 85.5%. However, this performance improvement necessitated significantly higher computational resources, with the model size, the number of parameters, and FLOPs increasing to 136.7 MB, 68.1 M, and 257.4 G, respectively. The YOLOv8-N achieved the highest FPS of 80.0, indicating that lower complexity models can offer faster inference speeds, albeit at some cost to detection performance. The HLTS-Model achieved a balanced trade-off, with an inference speed of 44.5 FPS, which was adequate for real-time applications. Compared to the latest proposed YOLOv9-E, the HLTS-Model showed better deployment friendliness with higher detection performance, while its model size, number of parameters, and FLOPs were only 6.4%, 6.1%, and 6.5% of YOLOv9-E. The HLTS-Model offered an optimal combination of high detection accuracy, reasonable model size, low computational cost, and competitive inference speed. From the confusion matrices of the different models (Figure 13), it can be observed that Faster RCNN, YOLOv3-SPP, TOOD, YOLOv8-X, and YOLOv9-E accurately recognized 73%, 85%, 71%, 86%, and 86% of the tea shoots on the test set, respectively. Conversely, these models incorrectly recognized 27%, 15%, 29%, 14%, and 14% of the tea shoots as background. As depicted in Figure 12, the accurate detection of tea shoots in field conditions posed a significant challenge due to variations in light conditions, growth period, and the morphology of the tea shoots. The model proposed in this study accurately detected 90% of the tea shoots on the test set, incorrectly detecting only 10% of the tea shoots as background. This indicates that the proposed model more effectively extracts the key features for recognizing tea shoots in variable natural environments, significantly reducing the influence of interference factors.

3.5.2. Comparison with Other Existing Models

This study investigated and compared several proposed models for tea shoot detection in recent years (Table 5). The HLTS-Model has significant advantages in both detection performance and model complexity compared to the developed models by Zhang et al. [4] and Liu et al. [72]. The detection performance of HLTS-Model is slightly lower than that of the model proposed by Wang et al. [73] with average precision reduced by 1.2%, but it has an obvious advantage in terms of model complexity the with number of parameters reduced by 93.3%. Compared to the lightweight models proposed in other studies, although HLTS-Model may have a larger model size, the number of parameters, or FLOPs, it has the best detection performance. Overall, the proposed model in this study balances both detection performance and model complexity, allowing it to be better deployed on mobile platforms for accurate detection of tea shoots in field conditions.

3.6. Generalizability of the Modification Strategy

To verify the applicability of the model modification strategy proposed in this study in other agricultural scenarios, a comparative experiment of before and after model modification was carried out on the dataset (895 images) for detecting tomatoes in factory production patterns, which was obtained from Kaggle (https://www.kaggle.com (accessed on 23 April 2025)). The division method of the dataset, the experimental platform, and the training parameter settings are consistent with those in Section 2. YOLOv5-L was modified using the proposed method in this study, and the obtained model by training was named Tomato-YOLO. The detection performance of before and after model modification is shown in Table 6. Compared with YOLOv5-L, the model size, the number of parameters, and FLOPs of Tomato-YOLO were significantly reduced, and the precision, recall, and mean precision were improved by 0.2%, 3.4%, and 1.2%, respectively. By comparing the detection results before and after the model modification, it can be seen that Tomato-YOLO effectively reduces the missed detection and improves the confidence of the detected target (Figure 14).

4. Conclusions

In this study, the primary objective is to achieve the accurate detection of tea shoots in field conditions and to improve the deployment friendliness of the model. Based on YOLOv5, a feasible approach combining lightweight modification and model compression was proposed to achieve excellent detection performance while reducing model complexity. A downsampling block with multi-branch fusion and a lightweight feature extraction block based on partial convolution were proposed to reduce the information loss during the downsampling process and enhance the ability to extract key features for tea shoots, respectively. Furthermore, model compression was achieved through model pruning and knowledge distillation, and the resulting precision, recall, and average precision of the proposed model for tea shoot detection were 81.5%, 81.3%, and 87.8%, respectively. These represent improvements of 0.4%, 0.6%, and 2.0% compared to the original YOLOv5. The model size, number of parameters, and FLOPs were reduced to 8.9 MB, 4.2 M, and 15.8 G, representing decreases of 90.6%, 90.9%, and 85.3% compared to YOLOv5. Compared to the other state-of-the-art detection models, the proposed model in this study exhibits the best detection performance, with average precision 2.7%, 2.3%, and 0.8% higher than that of YOLOv7, YOLOv8-X, and YOLOv9-E, and model size is only 11.9%, 6.5%, and 6.4% for them. The proposed model in this study has high detection performance and low complexity, enabling it to be better deployed on resource-limited mobile platforms for accurate detection of tea shoots in field conditions, such as intelligent famous tea plucking and tea plant phenotyping platforms.

However, this approach also have limitations. The current experimental dataset is relatively small with a single and limited sample variety. Tea plantations often adopt a multi-variety planting pattern in planting planning, and there are great morphological variations in the shoots of different tea plant varieties. Therefore, the model developed on Zhongcha 108 in this study may not apply to other varieties of tea shoot detection in field conditions. Future research should focus on expanding the size of the image dataset to encompass a wide range of varieties. Additionally, integrating other types of images, such as depth and near-infrared images, could further enhance the model’s ability to learn critical features of tea shoots, thereby improving detection performance and generalizability.

Author Contributions

Conceptualization: Z.Z.; data curation: Z.Z., Y.L. and Y.P.; formal analysis: Z.Z.; funding acquisition: Y.H.; investigation: Z.Z. and Y.L.; methodology: Z.Z.; project administration: Z.Z. and Y.H.; resources: Z.Z.; software: Z.Z. and M.Y.; supervision: Z.Z., Y.L. and Y.H.; validation: Z.Z., Y.L. and M.Y.; visualization: Z.Z. and M.Y.; writing—original draft: Z.Z.; writing—review and editing: Z.Z., Y.L., Y.P. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD-2023–87), and Changzhou Sci&Tech Program (Grant No. CJ20230011).

Data Availability Statement

The original contributions presented in the study are included in the article: further inquiries can be directed to the corresponding author.

Acknowledgments

The principal authors would like to extremely express their gratitude to the School of Agricultural Engineering, Jiangsu University for providing essential instruments without which this work would not have been possible. We would like to thank the anonymous reviewers for their precious attention.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cao, Y.; Li, H.; Sun, J.; Zhou, X.; Yao, K.; Nirere, A. Nondestructive Determination of the Total Mold Colony Count in Green Tea by Hyperspectral Imaging Technology. J. Food Process Eng. 2020, 43, e13570. [Google Scholar] [CrossRef]
Yin, L.; Jayan, H.; Cai, J.; El-Seedi, H.; Guo, Z.; Zou, X. Development of a Sensitive SERS Method for Label-Free Detection of Hexavalent Chromium in Tea Using Carbimazole Redox Reaction. Foods 2023, 12, 2673. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Yang, M.; Pan, Q.; Jin, X.; Wang, G.; Zhao, Y.; Hu, Y. Identification of Tea Plant Cultivars Based on Canopy Images Using Deep Learning Methods. Sci. Hortic. 2025, 339, 113908. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Yang, M.; Wang, G.; Zhao, Y.; Hu, Y. Optimal Training Strategy for High-Performance Detection Model of Multi-Cultivar Tea Shoots Based on Deep Learning Methods. Sci. Hortic. 2024, 328, 112949. [Google Scholar] [CrossRef]
Wei, J.; Chen, Y.; Jin, X.; Zheng, J.; Shi, Y.; Zhang, H. Researches on tender tea shoots identification under natural conditions. J. Tea Sci. 2012, 32, 377–381. [Google Scholar] [CrossRef]
Bojie, Z.; Dong, W.; Weizhong, S.; Yu, L.; Ke, W. Research on tea bud identification technology based on HSI/HSV color transformation. In Proceedings of the 2019 6th International Conference on Information Science and Control Engineering (ICISCE), Shanghai, China, 20–22 December 2019; pp. 511–515. [Google Scholar] [CrossRef]
Zhang, L.; Zou, L.; Wu, C.; Jia, J.; Chen, J. Method of Famous Tea Sprout Identification and Segmentation Based on Improved Watershed Algorithm. Comput. Electron. Agric. 2021, 184, 106108. [Google Scholar] [CrossRef]
Shao, P.; Wu, M.; Wang, X.; Zhou, J.; Liu, S. Research on the Tea Bud Recognition Based on Improved K-Means Algorithm. MATEC Web Conf. 2018, 232, 03050. [Google Scholar] [CrossRef]
Li, W.; Chen, R.; Gao, Y. Automatic recognition of tea bud image based on support vector machine. In Advanced Hybrid Information Processing, Proceedings of the 4th EAI International Conference, ADHIP 2020, Binzhou, China, 26–27 September 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 279–290. [Google Scholar] [CrossRef]
Wang, G.; Wang, Z.; Zhao, Y.; Zhang, Y. Tea bud recognition based on machine learning. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 6533–6537. [Google Scholar] [CrossRef]
Yang, H.; Chen, L.; Chen, M.; Ma, Z.; Deng, F.; Li, M.; Li, X. Tender Tea Shoots Recognition and Positioning for Picking Robot Using Improved YOLO-V3 Model. IEEE Access 2019, 7, 180998–181011. [Google Scholar] [CrossRef]
Chen, C.; Lu, J.; Zhou, M.; Yi, J.; Liao, M.; Gao, Z. A YOLOv3-Based Computer Vision System for Identification of Tea Buds and the Picking Point. Comput. Electron. Agric. 2022, 198, 107116. [Google Scholar] [CrossRef]
Gulzar, Y.; Ünal, Z.; Kızıldeniz, T.; Umar, M. Deep learning-based classification of alfalfa varieties: A comparative study using a custom leaf image dataset. MethodsX 2024, 13, 103051. [Google Scholar] [CrossRef]
Amri, E.; Gulzar, Y.; Yeafi, A.; Jendoubi, S.; Dhawi, F.; Mir, M. Advancing automatic plant classification system in Saudi Arabia: Introducing a novel dataset and ensemble deep learning approach. Model. Earth Syst. Environ. 2024, 10, 2693–2709. [Google Scholar] [CrossRef]
Zhao, S.; Peng, Y.; Liu, J.; Wu, S. Tomato Leaf Disease Diagnosis Based on Improved Convolution Neural Network by Attention Module. Agriculture 2021, 11, 651. [Google Scholar] [CrossRef]
Ma, J.; Zhao, Y.; Fan, W.; Liu, J. An Improved YOLOv8 Model for Lotus Seedpod Instance Segmentation in the Lotus Pond Environment. Agronomy 2024, 14, 1325. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Gulzar, Y. Enhancing soybean classification with modified inception model: A transfer learning approach. Emir. J. Food Agric. 2024, 36, 1–9. [Google Scholar] [CrossRef]
Yan, Z.; Zhao, Y.; Luo, W.; Ding, X.; Li, K.; He, Z.; Shi, Y.; Cui, Y. Machine Vision-Based Tomato Plug Tray Missed Seeding Detection and Empty Cell Replanting. Comput. Electron. Agric. 2023, 208, 107800. [Google Scholar] [CrossRef]
Gulzar, Y.; Ünal, Z. Optimizing Pear Leaf Disease Detection Through PL-DenseNet. Appl. Fruit Sci. 2025, 67, 40. [Google Scholar] [CrossRef]
Seelwal, P.; Dhiman, P.; Gulzar, Y.; Kaur, A.; Wadhwa, S.; Onn, C. A systematic review of deep learning applications for rice disease diagnosis: Current trends and future directions. Front. Comput. Sci. 2024, 6, 1452961. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed Detection in Maize Fields by UAV Images Based on Crop Row Preprocessing and Improved YOLOv4. Agriculture 2022, 12, 975. [Google Scholar] [CrossRef]
Ji, W.; Zhang, T.; Xu, B.; He, G. Apple Recognition and Picking Sequence Planning for Harvesting Robot in a Complex Environment. J. Agric. Eng. 2024, 55, 1549. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, K.; Huang, J.; Wang, Z.; Zhang, B.; Xie, Q. Field Obstacle Detection and Location Method Based on Binocular Vision. Agriculture 2024, 14, 1493. [Google Scholar] [CrossRef]
Du, X.; Meng, Z.; Ma, Z.; Zhao, L.; Lu, W.; Cheng, H.; Wang, Y. Comprehensive Visual Information Acquisition for Tomato Picking Robot Based on Multitask Convolutional Neural Network. Biosyst. Eng. 2024, 238, 51–61. [Google Scholar] [CrossRef]
Wang, T.; Zhang, K.; Zhang, W.; Wang, R.; Wan, S.; Rao, Y.; Jiang, Z.; Gu, L. Tea Picking Point Detection and Location Based on Mask-RCNN. Inf. Process. Agric. 2023, 10, 267–275. [Google Scholar] [CrossRef]
Chen, Y.; Chen, S. Localizing Plucking Points of Tea Leaves Using Deep Convolutional Neural Networks. Comput. Electron. Agric. 2020, 171, 105298. [Google Scholar] [CrossRef]
Li, Y.; He, L.; Jia, J.; Lv, J.; Chen, J.; Qiao, X.; Wu, C. In-Field Tea Shoot Detection and 3D Localization Using an RGB-D Camera. Comput. Electron. Agric. 2021, 185, 106149. [Google Scholar] [CrossRef]
Xu, W.; Zhao, L.; Li, J.; Shang, S.; Ding, X.; Wang, T. Detection and Classification of Tea Buds Based on Deep Learning. Comput. Electron. Agric. 2022, 192, 106547. [Google Scholar] [CrossRef]
Wang, C.; Huang, K.; Yao, Y.; Chen, J.; Shuai, H.; Cheng, W. Lightweight Deep Learning: An Overview. IEEE Consum. Electron. Mag. 2024, 13, 51–64. [Google Scholar] [CrossRef]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A Real-Time Apple Targets Detection Method for Picking Robot Based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Cong, P.; Feng, H.; Lv, K.; Zhou, J.; Li, S. MYOLO: A Lightweight Fresh Shiitake Mushroom Detection Model Based on YOLOv3. Agriculture 2023, 13, 392. [Google Scholar] [CrossRef]
Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight Tomato Real-Time Detection Method Based on Improved YOLO and Mobile Deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Ji, W.; Gao, X.; Xu, B.; Pan, Y.; Zhang, Z.; Zhao, D. Apple Target Recognition Method in Complex Environment Based on Improved YOLOv4. J. Food Process Eng. 2021, 44, e13866. [Google Scholar] [CrossRef]
Ji, W.; Wang, J.; Xu, B.; Zhang, T. Apple Grading Based on Multi-Dimensional View Processing and Deep Learning. Foods 2023, 12, 2117. [Google Scholar] [CrossRef]
Bai, Y.; Yu, J.; Yang, S.; Ning, J. An Improved YOLO Algorithm for Detecting Flowers and Fruits on Strawberry Seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Han, X.; Wang, H.; Yuan, T.; Zou, K.; Liao, Q.; Deng, K.; Zhang, Z.; Zhang, C.; Li, W. A Rapid Segmentation Method for Weed Based on CDM and ExG Index. Crop Prot. 2023, 172, 106321. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Zhao, X.; Su, X.; Wu, W. Lightweight Detection Networks for Tea Bud on Complex Agricultural Environment via Improved YOLO V4. Comput. Electron. Agric. 2023, 211, 107955. [Google Scholar] [CrossRef]
Gui, Z.; Chen, J.; Li, Y.; Chen, Z.; Wu, C.; Dong, C. A Lightweight Tea Bud Detection Model Based on Yolov5. Comput. Electron. Agric. 2023, 205, 107636. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. TS-YOLO: An All-Day and Lightweight Tea Canopy Shoots Detection Model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
Li, Y.; He, L.; Jia, J.; Chen, J.; Lyu, J.; Wu, C. High-Efficiency Tea Shoot Detection Method via a Compressed Deep Learning Model. Int. J. Agric. Biol. Eng. 2022, 15, 159–166. [Google Scholar] [CrossRef]
Huang, J.; Tang, A.; Chen, G.; Zhang, D.; Gao, F.; Chen, T. Mobile recognition solution of tea buds based on compact-YOLO v4 algorithm. Trans. Chin. Soc. Agric. Mach. 2023, 54, 282–290. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Xie, T.; Fang, J.; Imyhxy; et al. Ultralytics/Yolov5: V7.0-YOLOv5 SOTA Realtime Instance Segmentation; Version v7.0; Zenodo: Geneva, Switzerland, 2022. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.; Tang, J.; Ding, C.; Luo, B. A Robust Feature Downsampling Module for Remote-Sensing Visual Tasks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets great again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.; Chan, S. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
Tu, Z.; He, F.; Tao, D. Understanding Generalization in Recurrent Neural Networks. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [Google Scholar]
Zheng, Z.; Li, J.; Qin, L. YOLO-BYTE: An Efficient Multi-Object Tracking Algorithm for Automatic Monitoring of Dairy Cows. Comput. Electron. Agric. 2023, 209, 107857. [Google Scholar] [CrossRef]
Zhu, M.; Gupta, S. To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv 2017, arXiv:1710.01878. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive sparsity for the magnitude-based pruning. arXiv 2021, arXiv:2010.07611. [Google Scholar] [CrossRef]
Gou, J.; Yu, B.; Maybank, S.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Passban, P.; Wu, Y.; Rezagholizadeh, M.; Liu, Q. ALP-KD: Attention-Based Layer Projection for Knowledge Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 13657–13665. [Google Scholar] [CrossRef]
Passalis, N.; Tzelepi, M.; Tefas, A. Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2339–2348. [Google Scholar] [CrossRef]
Selvaraju, R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 December 2017; pp. 618–626. [Google Scholar] [CrossRef]
Wang, Z.; Xu, X.; Hua, Z.; Shang, Y.; Duan, Y.; Song, H. Lightweight recognition for the oestrus behavior of dairy cows combining YOLO v5n and channel pruning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 130–140. [Google Scholar] [CrossRef]
Yu, G.; Cai, R.; Luo, Y.; Hou, M.; Deng, R. A-Pruning: A Lightweight Pineapple Flower Counting Network Based on Filter Pruning. Complex Intell. Syst. 2024, 10, 2047–2066. [Google Scholar] [CrossRef]
Tan, K.; Tang, J.; Zhao, Z.; Wang, C.; Miao, H.; Zhang, X.; Chen, X. Efficient and Lightweight Layer-Wise in-Situ Defect Detection in Laser Powder Bed Fusion via Knowledge Distillation and Structural Re-Parameterization. Expert Syst. Appl. 2024, 255, 124628. [Google Scholar] [CrossRef]
Guan, B.; Li, J. Lightweight Detection Network for Bridge Defects Based on Model Pruning and Knowledge Distillation. Structures 2024, 62, 106276. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.; Huang, W. TOOD: Task-aligned One-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar] [CrossRef]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–27 August 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Wang, C.; Yeh, I.; Liao, H. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2014, arXiv:2402.13616. [Google Scholar] [CrossRef]
Liu, Z.; Zhuo, L.; Dong, C.; Li, J. YOLO-TBD: Tea Bud Detection with Triple-Branch Attention Mechanism and Self-Correction Group Convolution. Ind. Crops Prod. 2025, 226, 120607. [Google Scholar] [CrossRef]
Wang, X.; Wu, Z.; Xiao, G.; Han, C.; Fang, C. YOLOv7-DWS: Tea bud recognition and detection network in multi-density environment via improved YOLOv7. Front. Plant Sci. 2025, 15, 1503033. [Google Scholar] [CrossRef]
Yang, D.; Huang, Z.; Zheng, C.; Chen, H.; Jiang, X. Detecting tea shoots using improved YOLOv8n. Trans. Chin. Soc. Agric. Eng. Trans. CSAE 2024, 40, 165–173. [Google Scholar] [CrossRef]
Fang, W.; Chen, W. TBF-YOLOv8n: A Lightweight Tea Bud Detection Model Based on YOLOv8n Improvements. Sensors 2025, 25, 547. [Google Scholar] [CrossRef]
Li, H.; Kong, M.; Shi, Y. Tea Bud Detection Model in a Real Picking Environment Based on an Improved YOLOv5. Biomimetics 2024, 9, 692. [Google Scholar] [CrossRef] [PubMed]
Bai, B.; Wang, J.; Li, J.; Yu, L.; Wen, J.; Han, Y. T-YOLO: A lightweight and efficient detection model for nutrient buds in complex tea-plantation environments. J. Sci. Food Agric. 2024, 104, 5698–5711. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall procedure of the proposed lightweight and high-performance tea shoot detection model.

Figure 2. Tea plantation and image samples of image acquisition. (a,b) are the tea plantations and examples of captured images, respectively.

Figure 3. Structure of YOLOv5.

Figure 4. Structure of L-YOLO.

Figure 5. Structure of parallel-branch fusion downsampling block.

Figure 6. Schematic of partial convolution process.

Figure 7. Structure of FC3.

Figure 8. Grad-CAM output for different lightweight feature extraction blocks.

Figure 9. Trend of model performance under different pruning rates.

Figure 10. Performance of the model with knowledge distillation under different knowledge transfer types.

Figure 11. Training process of the YOLOv5-L and modified model.

Figure 12. Comparison of the detection results of YOLOv5-L and HLTS-Model. Yellow and blue boxes represent missed and false detections, respectively. (a,c,e,g) show the detection results of the YOLOv5-L, and (b,d,f,h) show the detection results of the HLTS-Model.

Figure 13. Confusion matrix of different models.

Figure 14. Comparison of detection results of YOLOv5-L and Tomato-YOLO. Yellow boxes represent missed detection.

Table 1. Detection performance of the modified YOLOv5 models.

Models	P (%)	R (%)	AP (%)	Model Size (MB)	Parameters (M)	FLOPs (G)
YOLOv5-N	78.3	75.3	82.4	3.9	1.9	4.6
YOLOv5-N + PFD	77.0	75.4	81.3	2.9	1.2	3.2
YOLOv5-N + PFD + FC3	76.1	72.2	78.4	2.3	0.9	2.4
YOLOv5-S	80.8	78.5	85.4	13.7	7.2	16.6
YOLOv5-S + PFD	79.3	78.5	84.6	10.3	4.9	11.7
YOLOv5-S + PFD + FC3	79.6	75.6	82.9	7.8	3.7	8.6
YOLOv5-M	84.2	78.0	85.4	42.2	21.2	49.2
YOLOv5-M + PFD	82.1	78.2	85.4	32.9	16.2	38.7
YOLOv5-M + PFD + FC3	80.9	79.2	85.6	21.8	10.6	24.6
YOLOv5-L	81.1	80.7	85.8	94.8	46.1	107.6
YOLOv5-L + PFD	81.9	79.5	86.1	76.2	38.2	90.7
YOLOv5-L + PFD + FC3	80.9	81.2	87.2	46.6	24.2	53.1
YOLOv5-X	81.9	80.6	86.1	173.1	86.7	206.3
YOLOv5-X + PFD	83.6	79.4	86.2	147.1	73.2	178.2
YOLOv5-X + PFD + FC3	81.8	81.2	87.2	85.5	42.4	99.8

Table 2. Model performance of YOLOv5-L with different lightweight feature extraction blocks.

Models	P (%)	R (%)	AP (%)	Model Size (MB)	Parameters (M)	FLOPs (G)
YOLOv5-L	81.1	80.7	85.8	94.8	46.1	107.6
YOLOv5-L + PFD	81.9	79.5	86.1	76.2	38.2	90.7
YOLOv5-L + PFD + FC3	80.9	81.2	87.2	46.6	24.2	53.1
YOLOv5-L + PFD + GC3	79.0	78.0	84.1	27.0	13.5	32.6
YOLOv5-L + PFD + DSC3	79.8	78.7	85.0	24.7	12.5	27.2

Table 3. Testing results of ablation experiments.

YOLOv5-L	PFD	FC3	PK	P (%)	R (%)	AP (%)	Model Size (MB)	Parameters (M)	FLOPs (G)	FPS
√	×	×	×	81.1	80.7	85.8	94.8	46.1	107.6	58.4
√	√	×	×	81.9	79.5	86.1	76.2	38.2	90.7	54.4
√	√	√	×	80.9	81.2	87.2	46.6	24.2	53.1	47.2
√	√	√	√	81.5	81.3	87.8	8.9	4.2	15.8	44.5

√ and × represent the corresponding blocks/methods added or not added, respectively.

Table 4. Detection results of HLTS-Model and other state-of-the-art detection models.

Models	P (%)	R (%)	AP (%)	Model Size (MB)	Parameters (M)	FLOPs (G)	FPS
Faster RCNN	72.8	82.3	77.5	317.0	41.3	71.7	28.3
CenterNet	78.8	64.0	69.5	245.0	32.1	59.0	28.6
FCOS	83.7	66.2	76.8	246.0	32.1	59.0	30.4
YOLOv3	80.9	80.0	85.0	207.7	103.7	282.2	43.7
YOLOv3-SPP	81.4	79.1	85.2	209.8	104.7	283.1	42.6
YOLOv3-Tiny	78.8	73.0	80.7	24.3	12.1	24.3	33.6
YOLOX-N	68.4	76.5	73.7	13.3	0.9	0.5	43.6
YOLOX-Tiny	69.0	86.7	83.3	60.4	5.0	7.6	40.7
YOLOv6	76.3	77.8	82.2	8.7	4.2	11.9	42.7
TOOD	71.2	83.6	80.7	244.0	32.0	59.16	19.6
YOLOv7	78.5	81.4	85.1	74.8	37.6	106.5	60.2
YOLOv7-Tiny	76.8	76.5	81.7	12.3	6.0	13.2	66.2
YOLOv8-N	76.7	76.8	82.4	6.2	3.0	8.1	80.0
YOLOv8-S	77.9	79.2	84.4	22.5	11.1	28.4	77.5
YOLOv8-M	79.9	79.6	84.9	52.0	25.8	78.7	57.8
YOLOv8-L	80.9	79.4	85.2	87.6	43.6	164.8	50.5
YOLOv8-X	79.7	81.3	85.5	136.7	68.1	257.4	43.1
YOLOv9-C	81.4	80.5	87.0	102.7	50.7	236.6	15.7
YOLOv9-E	80.0	80.8	87.0	139.9	69.4	244.8	15.5
HLTS-Model	81.5	81.3	87.8	8.9	4.2	15.8	44.5

Table 5. Performance comparison of HLTS-Model with other proposed models from other studies.

Existing Study	Dataset Size (Pictures)	Detected Object	Model Size (MB)	Parameters (M)	FLOPs (G)	P (%)	R (%)	AP (%)
Zhang et al. [4]	1692	BOL	71.3	32.7	105.1	87.3	81.2	87.1
Li et at. [42]	7723	B, BOL	——	11.4	6.6	——	——	85.2
Zhang et al. [44]	2417	BOL	11.8	——	——	85.4	78.4	82.1
Liu et al. [72]	2576	——	——	41.27	167.9	79.3	82.6	87.0
Wang et al. [73]	945	BOL, BTL	——	62.7	——	——	83.9	89.1
Yang et al. [74]	513	BOL, BTL	6.7	——	——	82.5	74.4	81.7
Fang et al. [75]	6242	——	——	2.6	4.5	87.5	74.4	85.0
Li et al. [76]	4100	B, BOL, BTL	——	7.2	14.8	84.5	74.1	83.7
Bai et al. [77]	1368	B	——	11.3	17.2	——	——	84.1
Ours	1862	BOL	8.9	4.2	15.8	81.5	81.3	87.8

—— represents not available; B, BOL, and BTL represent one bud, one bud and one leaf, and one bud and two leaves, respectively.

Table 6. Detection performance of YOLOv5-L and Tomato-YOLO.

Models	P (%)	R (%)	AP (%)	Model Size (MB)	Parameters (M)	FLOPs (G)
YOLOv5-L	87.9	79.8	89.1	94.8	46.1	107.6
Tomato-YOLO	88.1	83.2	90.3	7.6	3.6	15.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Lu, Y.; Peng, Y.; Yang, M.; Hu, Y. A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions. Agronomy 2025, 15, 1122. https://doi.org/10.3390/agronomy15051122

AMA Style

Zhang Z, Lu Y, Peng Y, Yang M, Hu Y. A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions. Agronomy. 2025; 15(5):1122. https://doi.org/10.3390/agronomy15051122

Chicago/Turabian Style

Zhang, Zhi, Yongzong Lu, Yun Peng, Mengying Yang, and Yongguang Hu. 2025. "A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions" Agronomy 15, no. 5: 1122. https://doi.org/10.3390/agronomy15051122

APA Style

Zhang, Z., Lu, Y., Peng, Y., Yang, M., & Hu, Y. (2025). A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions. Agronomy, 15(5), 1122. https://doi.org/10.3390/agronomy15051122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight and High-Performance YOLOv5-Based Model for Tea Shoot Detection in Field Conditions

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Dataset Construction

2.2. Overall Structure of YOLOv5

2.3. Model Lightweighting Modifications

2.3.1. Parallel-Branch Fusion of Downsampling Block

2.3.2. Lightweight Feature Extraction Block

2.4. Model Pruning and Knowledge Distillation

2.5. Hyperparameter Setting and Experimental Platform

3. Results and Discussion

3.1. Detection Performance Metrics

3.2. Model Performance with Lightweighting Modifications

3.2.1. Influence of PFD and FC3 Blocks on Model Performance

3.2.2. Comparison of Different Lightweight Feature Extraction Blocks

3.3. Model Performance with Model Compression

3.4. Ablation Experiments and Detection Results Comparison

3.5. Contrast Experiment

3.5.1. Comparison of Different Detection Models

3.5.2. Comparison with Other Existing Models

3.6. Generalizability of the Modification Strategy

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI