YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm

Wang, Wenbo; Xi, Yidan; Gu, Jinan; Yang, Qiuyue; Pan, Zhiyao; Zhang, Xinzhou; Xu, Gongyue; Zhou, Man

doi:10.3390/agronomy15061318

Open AccessArticle

YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm

by

Wenbo Wang

^1,*

,

Yidan Xi

¹,

Jinan Gu

¹

,

Qiuyue Yang

¹,

Zhiyao Pan

¹,

Xinzhou Zhang

¹,

Gongyue Xu

¹

and

Man Zhou

²

¹

School of Mechanical Engineering, Jiangsu University, Zhenjiang 212013, China

²

School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(6), 1318; https://doi.org/10.3390/agronomy15061318

Submission received: 18 April 2025 / Revised: 22 May 2025 / Accepted: 27 May 2025 / Published: 28 May 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

With the continuous development of artificial intelligence technology, the transformation of traditional agriculture into intelligent agriculture is quickly accelerating. However, due to the diverse growth postures of tender shoots and complex growth environments in tea plants, traditional tea picking machines are unable to precisely select the tender shoots, and the picking of high-end and premium tea still relies on manual labor, resulting in low efficiency and high costs. To address these issues, an instance segmentation algorithm named YOLOv8-TEA is proposed. Firstly, this algorithm is based on the single-stage instance segmentation algorithm YOLOv8-seg, replacing some C2f modules in the original feature extraction network with MVB, combining the advantages of convolutional neural networks (CNN) and Transformers, and adding a C2PSA module following spatial pyramid pooling (SPPF) to integrate convolution and attention mechanisms. Secondly, a learnable dynamic upsampling method is used to replace the traditional upsampling, and the CoTAttention module is added, along with the fusion of dilated convolutions in the segmentation head to enhance the learning ability of the feature fusion network. Finally, through ablation experiments and comparative experiments, the improved algorithm significantly improves the segmentation accuracy while effectively reducing the model parameters, with mAP (Box) and mAP (Mask) reaching 86.9% and 86.8%, respectively, and GFLOPs reduced to 52.7.

Keywords:

recognition method; high-quality tea; machine vision; instance segmentation

1. Introduction

With the improvement of living standards, people’s requirements for the quality of tea have increased [1]. The surge in demand for high-quality tea has led to innovations in tea production and harvesting processes [2]. Tea picking methods can be classified into two kinds, i.e., manual picking and mechanized picking [3,4]. Although manual picking can accurately pick out the tea buds, it is inefficient, requiring a large amount of manpower and easily causing the waste of resources. Mechanized picking can improve picking efficiency and reduce costs, yet it destroys the integrity of tea shoots, and the mixing of old leaves and broken leaves affects the quality of tea [5]. Therefore, the picking of premium and high-quality tea mainly relies on manual picking. Thus, it is of great significance to study the technology of identifying tea shoots. Computer vision is used to accurately identify young shoots so as to improve the picking quality of tea [6,7].

At present, the automatic identification of tea shoots has become the key to improving the efficiency and quality of tea picking [8,9]. In the early research, traditional image processing methods [10] were mainly used to identify tea shoots through color space analysis [11] and edge detection [12]. Zhao et al. used the HSV spatial transformation method to segment the tea bud image after RGB analysis of the tea bud image, and successfully extracted the tea bud image by setting the threshold and combining the three-channel component method [13]. Although the traditional method can separate the tea buds from the old leaves, it has poor robustness in complex scenarios such as light changes and branch and leaf shading, and it is difficult to meet the practical application needs.

The tea shoot recognition method based on deep learning can better adapt to the recognition tasks in complex environments through the automatic extraction of multi-level features [14,15,16]. Sun et al. presented a detection method for key organs in Tomato based on deep migration learning in a complex background [17]. Peng et al. proposed a comparative study of semantic segmentation models for identification of grape with different varieties [18].

Meanwhile, the YOLO series of algorithms [19], with their end-to-end object detection framework, can quickly identify tea buds in complex backgrounds with high detection accuracy and strong real-time performance, and is widely applied in the field of tea bud recognition. Yang et al. improved the YOLOV3 network by introducing the residual network block structure and using convolution instead of the fully connected layer, combined with the K-means clustering algorithm to achieve end-to-end tea target recognition [20]. Ji et al. presented apple recognition method in complex environment based on improved YOLOv4 [21]. Zhou et al. presented an improved field obastacle detection algorithm based on YOLOv8 [22].

In terms of tea picking, there is currently an urgent need for a tender shoot recognition technology with both high accuracy and real-time performance, so as to adapt to the complex natural environment and meet the requirements of lightweight deployment of picking robots. However, the existing techniques still have limitations. Traditional image processing methods rely on manual feature design, and the segmentation accuracy drops sharply when the illumination is uneven or the target is overlapping, which is hard to apply in dynamic farmland scenes. Although deep learning-based instance segmentation algorithms can improve adaptability through multi-level feature extraction, their application is still limited because of their high computation complexity and low recognition accuracy in complex natural environment. The research gaps can be summarized as follows.

(1): The recognition accuracy of tender tea shoots needs to be improved. The existing models have not been optimized for the features of tender tea shoots, resulting in insufficient distinction between tender shoots and similar background areas, as well as confusion during fine-grained classification. The tea plant environment is complex and changeable, and the models are prone to false positives and false negatives in conditions such as shadows and overlaps. Enhancing the feature extraction capability, especially in capturing the unique characteristics of tender tea shoots, is the key to improving recognition accuracy.
(2): The combination of object detection and segmentation algorithms affects the real-time performance of the system. The existing methods (such as Faster R-CNN [17] and DeepLabV3+ [18]) that integrate object detection and semantic segmentation for identifying tea buds are effective but computationally intensive, which impacts real-time performance. Therefore, while optimizing computational efficiency with high segmentation accuracy, how to enhance the stability and real-time performance of the system are also key research focuses in the current task of identifying tea buds.

To meet these research gaps, this study proposes the instance segmentation algorithm YOLOv8-TEA, which contains three main parts, i.e., the backbone feature extraction network, the feature fusion network, and the high-precision instance segmentation head. Firstly, in terms of feature extraction network, the MobileViT Block (MVB) was innovatively combined with the C2PSA module to replace some C2f modules in the original feature extraction network, making full use of the advantages of convolutional neural network (CNN) and Transformer. The ability to capture local texture features and global context information is enhanced, and the feature extraction effect of the model for tender tea shoots in complex environments is improved. Secondly, in the feature fusion network, the learnable dynamic upsampling methods DySample and CoTAttention modules are introduced to replace the traditional sampling method, which effectively solves the problem of loss of details in traditional methods and improves the segmentation accuracy in the scene of branches and leaves occlusion. Finally, in the segmentation head part, the depthwise separable convolution reconstruction was used to reduce the amount of parameter interaction and meet the real-time requirements of the picking robot.

Through the above improvements, this study not only significantly improves the segmentation accuracy, but also effectively reduces the number of model parameters, and reduces the GFLOPs to 52.7, which fills the shortcomings of existing studies in accuracy and real-time performance, and provides new ideas and methods for the development of tea tender shoot recognition technology.

The main contents of this paper are as follows: Section 2 describes the related research work of tender tea shoot recognition, instance segmentation algorithms, and the contribution of this study. Section 3 introduces the proposed YOLOv8-seg method in detail. Section 4 discusses ablation experiments and comparative experimental results. Finally, suggestions for future research are made.

2. Related Works

2.1. Identification of Tea Shoots

With the development of computer vision technology, the automatic recognition of tea shoots has gradually become an important research direction for improving the efficiency and quality of tea picking [8]. Early studies on the automatic recognition of tea shoots mainly relied on traditional image processing methods [10]. These methods included color space analysis [11], edge detection [12], morphological processing [14], and manual feature extraction, aiming to achieve the recognition and location of tea shoots. Huan et al. proposed a K-means clustering [23] algorithm to identify tea buds in color images, achieving an accuracy rate of about 94%. Zhao et al. analyzed tea bud images in the RGB space and then used the HSV space conversion method to segment the images, which successfully extracted tea bud images [13]. Although these traditional image processing methods can separate tea buds from old leaves, they have poor robustness in complex scenarios such as varying light conditions and leaf occlusion, and thus struggle to meet the requirements of practical applications.

The tea bud recognition method grounded in deep learning stands out by its ability to automatically extract multi-level features, enabling it to better adapt to the intricate recognition tasks in complex environments. Chen et al. obtained the one bud with two leaves area in tea images through the Faster R-CNN algorithm and combined the FCN network to identify the picking points, achieving an accuracy rate of 79% and a recall rate of 90% [24]. Yan et al. proposed a lightweight convolutional neural network MC-DM based on the improved DeepLabV3+ network. Specifically, by introducing a densely connected dilated spatial pyramid pooling module, they achieved denser pixel sampling and a larger receptive field, and successfully realized the segmentation of tea buds with a reduced number of parameters [25]. Wang et al. applied Mask-RCNN to the tea picking point location task, using the ResNet50 residual network and feature pyramid network to extract the features of tea buds and leaves and the region proposal network for preliminary classification and candidate box regression training [26]. Xiao et al. increased the scale and diversity of training samples by generating images using an improved DCGAN network combined with traditional techniques, and embedded a channel locking attention module in the GhostNet network to enhance the classification performance of the network [27]. Pan et al. presented a picking point identification and localization method [2], which optimized the Oriented R-CNN [28] object detection algorithm by replacing the feature fusion network with PAFPN and introducing the coordinate attention mechanism, and added the Swin-Transformer [29] semantic segmentation algorithm. Qiu et al. studied the classification of apple color and deformity using machine vision combined with CNN [30]. The Semask [31], along with the feature alignment module and the feature selection module, improves the accuracy of tea bud segmentation.

Meanwhile, the YOLO series of algorithms has been widely applied in the field of tea bud recognition due to their end-to-end object detection framework, which enables the rapid identification of tea buds in complex backgrounds. They not only boast high detection accuracy but also exhibit excellent real-time performance. Yang et al. made improvements to the YOLOv3 network [20], which introduced residual network block structures [32] during the downsampling process, and utilized 1 × 1 convolution operations to substitute the fully connected layers, and combined the K-means clustering algorithm to achieve end-to-end tea bud recognition and the location of picking points, with an accuracy rate exceeding 90%. Li et al. proposed the Tea-YOLO algorithm based on the YOLOv4 model, replacing the backbone feature extraction layer of YOLOv4 with the GhostNet lightweight network and introducing the CBAM attention module to enhance the feature expression ability of the path aggregation network [33]. Furthermore, Zhang et al. proposed a lightweight all-weather tea bud detection model TS-YOLO based on the improved YOLOv4, replacing the original feature extraction network with the lightweight neural network MobileNetV3 [34]. Yang et al. constructed four tea bud datasets with similar features and conducted comparative experiments based on different mainstream algorithms [35].

2.2. Instance Segmentation Algorithm

The instance segmentation algorithm can obtain the exact outline of the target, and its output not only includes the category information of the target but also includes the independent segmentation mask of each instance [36] so that the morphological information of each instance can be completely preserved. It has two main directions, i.e., the two-stage and single-stage segmentation algorithms. The two-stage method is usually based on the candidate region generation mechanism, which detects the target first and then performs pixel-level segmentation. For example, He et al. proposed a two-stage instance segmentation algorithm, Mask R-CNN, which performs well in object detection by an improved Faster R-CNN method [37]. Cai et al. proposed a multi-stage cascaded instance segmentation framework for instance segmentation tasks, named Cascade Mask R-CNN, which adopts a cascaded optimization approach for object classification, bounding box regression, and mask prediction based on Mask R-CNN [38].

The single-stage method directly performs end-to-end instance segmentation of the image, without the need for candidate region generation, and has higher computational efficiency. Bolya et al. proposed a real-time single-stage instance segmentation algorithm called YOLACT, which adopts a decoupled mask generation strategy, splitting the instance mask generation process into two parallel branches of prototype masks and coefficient prediction to achieve efficient instance segmentation [39]. Wang et al. proposed a single-stage instance segmentation algorithm named SOLO grounded in the anchor-free concept and designed a target instance allocation mechanism based on spatial position encoding, achieving end-to-end efficient instance segmentation [40,41]. Hu et al. proposed a single-stage instance segmentation algorithm ISTR based on Transformer, which uses a query mechanism in the segmentation head to extract target features and predict masks, enhancing the ability to distinguish overlapping targets and improve the detailed representation of instance masks [42].

Specifically, YOLO-seg [43] is a single-stage instance segmentation algorithm rooted in the YOLO series of object detection frameworks. It distinguishes itself by adding pixel-level segmentation capabilities on top of its fundamental object detection functionality. This series of algorithms first uses a detection head to generate the bounding boxes of objects, then generates a set of mask prototypes within each detection box area on the feature map. Finally, based on the prediction results of the object detection branch [44], it assigns the corresponding masks to each instance and performs pixel-level mask correction to achieve end-to-end instance segmentation.

With the development of different instance segmentation algorithms, they have shown good performance in the fields of autonomous driving detection and remote sensing image processing, but they have not been widely used in the field of agricultural product detection. Due to the dense branches and leaves of tea trees and the variable morphology of young tea shoots, it is difficult for the traditional target detection method to accurately segment the young tea shoots, while the example segmentation method can provide fine-grained morphological information to achieve the efficient detection and accurate segmentation of tea shoots. Therefore, an instance segmentation algorithm YOLOv8-TEA is proposed to combine the advantages of YOLO-seg structure and improve the cutting accuracy to achieve more efficient and accurate object detection.

2.3. Contribution of This Research

Compared with the existing research, the core contributions of this work are mainly reflected in three aspects, as follows.

Firstly, compared with Tea-YOLO, which relies on GhostNet for model compression and CBAM for attention enhancement [33], we propose a feature extraction network that combines MobileViT Block (MVB) with the C2PSA module [34]. The CNN-Transformer hybrid structure of MVB synchronously captures local texture features (such as the edge morphology of shoots) and global context information (such as the spatial distribution of branches and leaves), while the C2PSA module enhances the feature response of key regions through the partial self-attention mechanism.

Secondly, different from DeepLabV3+ and MC-DM [18], which improve the accuracy of semantic segmentation through dilated convolution and dense connection, but have low computational efficiency, a dynamic perception feature fusion mechanism is designed in this paper, where the Dysample upsampler and the CoTAttention module that can learn the offset are innovatively introduced, and the adaptive sampling position adjustment and the convolution–attention coordination mechanism are used. The segmentation accuracy (mAP50) is 10.4% higher than that of YOLOv7m-seg in the scene of branches and leaves occlusion [39].

Finally, while CBAM enhances feature representation through channel and spatial attention, we further optimize parameter usage and computational efficiency by reconstructing segmentation heads [2] using Depthwise separable Convolutions (DWConv). The DWConv module is introduced into the detection branch of the segmentation head, and the channel separation strategy is used to reduce the amount of parameter interaction to meet the real-time requirements of the picking robot. The improved algorithm significantly improves the segmentation accuracy while effectively reducing the number of model parameters, with mAP (Box) and mAP (Mask) reaching 86.9% and 86.8%, respectively, and GFLOPs reduced to 52.7.

3. Methods

3.1. Overall Model Structure of YOLOv8-TEA

The overall process of YOLOv8-TEA is shown in Figure 1a. After the input image, the image features are extracted by adding the feature extraction network of the MVB and C2PSA modules. Secondly, the image features are fused by the network that adds the Dy_Sample module and the CoTAttention mechanism. Then, the feature map is processed by the segmentation head that fuses the DWConv module.

The overall framework of the proposed instance segmentation model YOLOv8-TEA is shown in Figure 1b, which mainly contains three parts: the feature extraction backbone network (module A), the feature fusion network (module B), and the segmentation head (module C). In the feature extraction stage, MobileVit Block (MVB) is used to replace part of the C2f module in the original feature extraction network, and the C2PSA module is added in the last layer. The feature fusion section uses dynamic upsampling Dysamples to replace the traditional upsampling method, while the CoTAttention attention mechanism was introduced in the process. The segmentation header part inherits the design idea of YOLACT, which processes the input feature map separately through independent detection and segmentation branches, and combines the bounding box and segmentation mask in the post-processing stage to finally output the complete instance segmentation result.

3.2. The Improved Feature Extraction Network Based on MobileViT

As one of the basic networks of the YOLO series, Darknet53 ensures the feature extraction capability of the network by introducing more convolutional layers and residual blocks. The original feature extraction network, CSPDarknet of the YOLOv8-seg model, introduces CSPNet and SPPF structures on the basis of Darknet53, which not only ensures efficient computing, but also ensures the feature extraction ability. CSPNet [45] is composed of two main branches, namely the backbone branch that realizes deep feature extraction through deep convolution calculation via Dense Block and the shortcut branch that directly passes part of the input features to the final fusion layer, which optimizes the gradient information transfer and reduces the computational cost through cross-stage feature distribution. However, SPPF uses recursive pooling instead of multi-scale pooling, which significantly reduces the computational effort while maintaining a large receptive field.

In this study, based on the original feature extraction network CSPDarknet, some C2f modules in the original feature extraction network are replaced with MobileVit Block (MVB) [46], which combines the advantages of the convolutional neural network (CNN) and Transformer, and the C2PSA module following the Spatial Pooling Pyramid (SPPF), to combine convolution with attention mechanism to further improve the feature extraction ability, as shown in Figure 2. MobileViT is a model architecture that combines the advantages of convolutional neural networks (CNN) and Vision Transformers (ViT). It is able to capture long-distance dependency information through the self-attention mechanism, so as to extract features in the image more effectively. In contrast, the traditional C2f block may be slightly insufficient in the richness of feature extraction and the ability to capture complex semantic information.

Replacing specific C2f blocks with MobileViT can introduce a more powerful feature extraction mechanism, which helps the model to better understand the objects in the image, improve the representation ability of different objects and scenes, and then improve the segmentation accuracy. If other C2f blocks were changed to MobileVit, it would cause training instability. Changing the model structure may break the original training balance, especially when fine-tuning the pre-trained model, which may require more fine hyperparameter tuning; otherwise, it may lead to training instability, such as gradient explosion or disappearance. Replacing all C2f blocks with MobileVit may result in a substantial increase in computational resource requirements. Due to the relatively high computational complexity of MobileViT, total replacement will dramatically increase the number of parameters and calculation of the model and greatly increase the requirements for hardware resources, which may mean that it would not be able to run in real-time on resource-constrained devices.

The originally used C2f module in YOLOv8-seg mainly consists of four parts: a 1 × 1 convolution for dimensionality reduction; a split module that divides the convolution output into two parts, one directly entering the final concatenation and the other entering the DarknetBottleneck for computation; multiple DarknetBottleneck modules; and a 1 × 1 convolution to adjust the number of channels of the concatenated features. While this structure improves the efficiency of feature reuse, it also tends to cause a lack of global information interaction and the loss of small object details due to its heavy reliance on DarknetBottleneck.

The MVB module combines the advantages of a convolutional neural network (CNN) and a transformer to capture both the local features and global context information of an image, as shown in Figure 3. The MVB module fuses convolutional features and Transformer features—the convolutional layer is responsible for efficiently extracting local features and the Transformer processes the global context information through the self-attention mechanism—and finally fuses the two through residual connections to generate feature representations that contain both local details and global dependencies so that the model can better understand the structure and relationship of the image, thereby improving the effect of detection and segmentation.

The C2PSA module refers to the method of improving the performance of the model through a partial self-attention mechanism in YOLOv10, which introduces the improved multi-head attention mechanism into the C2f structure, replaces the Darknet Bottleneck layer, and discards the output results of the middle layer, so as to realize the effective combination of convolutional blocks and attention mechanism, as shown in Figure 4. The C2PSA module first receives the input feature map and converts the number of channels from c1 to 2 × c using a 1 × 1 convolution. Here, c is the number of hidden channels calculated from c1 and the extended scale e. Secondly, the output is divided into two parts, a and b, by means of the Split function, and the number of channels in each part is c. Output b is then processed using multiple PSABlock modules.

As shown in Figure 5, the PSABlock module incorporates a multi-head attention mechanism and a feedforward neural network (FFN). It first processes input features through the attention mechanism, then further enhances these features via the FFN layer. Finally, it decides whether to apply a shortcut connection by adding the input and output to maintain information flow and mitigate issues like gradient vanishing. The processed features (B) and the original features (A) are combined through channel concatenation to form a new feature map. A 1 × 1 convolution is applied to the concatenated feature map to adjust the channel dimension from 2 × c back to the input channel number c1, after which the features are output to the next layer.

3.3. An Improved Feature Fusion Network Integrating Dysample and CoTAttention

FPN is a feature fusion network of the original algorithm of Oriented R-CNN, which fuses the features of multi-scale feature maps output by different layers of the feature extraction network through a top-down structure with horizontal connections. Specifically, FPN realizes the effective integration of multi-scale features by combining feature maps of different scales; that is, the detailed information is obtained from the lower layer of the feature extraction network and then upsampled through the top-down path and fused layer by layer to obtain a high-dimensional feature map containing rich semantic information. Based on the fusion strategy of FPN, PAFPN proposes a more efficient secondary fusion strategy. PAFPN not only adds a bottom-up structure to the top-down structure of the FPN, but also further integrates the characteristics of each layer. Bottom-up helps convey more detailed information and spatial location features, while top-down conveys more semantic features. Through this bidirectional feature transfer, PAFPN can integrate multi-scale features more comprehensively, improve the ability to express features, and capture details.

Although PAFPN has achieved good results in image feature fusion, it adopts the nearest neighbor interpolation method without feature learning ability in the upsampling process, which directly enlarges the image by “copying” the adjacent pixel values, which limits the learning ability of the feature fusion network to a certain extent. In order to improve the learning ability of PAFPN, the traditional upsampling method is replaced by a learnable dynamic upsampling method, and the CoTAttention attention module is added, as shown in Figure 6.

DySample replaces the traditional upsampling method by introducing a dynamic range factor and optimizing the initial sampling position, thereby generating high-quality upsampling feature maps more flexibly and efficiently [47]. Unlike the traditional fixed-grid sampling, DySample adopts a dynamic sampling strategy, enabling the upsampling process to adaptively adjust and better capture feature details to enhance reconstruction accuracy.

First, DySample generates a sampling set through a sampling point generator, as shown in Figure 7. The input feature map selects specific points through a grid sampling function and extracts values from them to achieve resampling. In this process, the input features, upsampled features, generated offsets, and the original grid are represented by X, X’, o, and g, respectively. In the sampling set generator, the sampling set is the sum of the generated offsets and the positions of the original grid, and it has two different implementation methods, as shown in Figure 8. One is a static range factor version that generates offsets through a linear layer; the other is a dynamic range factor version that first generates a dynamic range factor and then modulates the offsets, where σ is an S-shaped activation function used to dynamically adjust the sampling range factor. Different from the traditional method, DySample optimizes the initial position distribution through bilinear initialization, which effectively avoids the problem of an uneven sampling position that may exist in the traditional method. In order to better handle the boundary problem, reduce the overlap of sampling locations, and avoid artifacts, DySample dynamically adjusts the offset range to ensure that the theoretical marginal condition is met between the overlapping and non-overlapping sampling locations.

The traditional self-attention module dynamically learns the interaction between each element in the sequence by computing the relationship among Query, Key, and Value, thereby capturing long-range dependencies, enabling parallelization, and providing strong context modeling capabilities. However, due to the need to store the attention matrix and intermediate results, it incurs relatively high memory consumption when processing high-resolution images or long texts. In contrast, CoTAttention combines convolution operations with the attention mechanism. It generates Keys and Values through convolution and then weights the Values via the attention mechanism to enhance the representativeness of the output feature map, allowing the model to capture richer context information across different positions. The structure is shown in Figure 9.

Let the input be X ∈ R^H×W×C, and the Query, Key, and Value are defined as K = X, Q = X, and V = XW_v, respectively. First, to provide more context information for each key, a k × k group convolution is applied to all adjacent keys within a k × k grid to obtain the Key K¹ ∈ R^H×W×C, which represents the static context relationship between adjacent keys and is regarded as the static context representation of the input X. Then, K¹ and the Query are concatenated, and a context attention matrix A is calculated through two consecutive 1 × 1 convolutions.

A = [K^{1}, Q] W_{θ} W_{δ}

(1)

Here, W_θ is equipped with a ReLU activation function, while W_δ has no activation function. Next, all values are aggregated according to the contextual attention matrix A to obtain the final dynamic context K².

K^{2} = V ⨂ A

(2)

Finally, the weighted features are restored to the same size as the input, and then the static context K¹ and dynamic context K² are fused through the self-attention mechanism to obtain the final output Y.

Y = f (K^{1} K^{2})

(3)

The CoTAttention attention mechanism realizes the fusion of local information and global information by combining convolution operation and attention mechanism so that the features of each position can be adjusted according to the contextual information of other positions, and the model expression ability is improved while maintaining high computational efficiency.

3.4. Improved Segmentation Head Combined with DWConv

The YOLOv8-seg segmentation head inherits the architecture of the YOLACT segmentation head and introduces a segmentation branch alongside the object detection branch [48]. The detection branch comprises a regression head—responsible for predicting the coordinates and dimensions of object bounding boxes—and a classification head—tasked with predicting object categories. The regression head utilizes two convolutional layers and an output layer to generate bounding box-associated parameters, primarily including the distribution of center coordinates and size metrics for each pixel point. The classification head adopts a parallel architecture, employing two depthwise separable convolutional layers (DWConv) to compute category probabilities for each pixel point, as illustrated in Figure 10.

The branch splitting approach transforms the instance segmentation problem into a weighted combination problem by introducing mask prototypes and a coefficient generation mechanism. Here, mask prototypes serve as the basic templates for generating segmentation masks, representing universal and fixed templates of different parts or shapes of the target objects. Meanwhile, each layer of the input feature map generates mask coefficients through convolutional layers, which control the way the mask prototypes are combined with the feature maps of the input image and determine the segmentation effect of each target. The segmentation mask for each target is achieved by multiplying the mask prototypes with the corresponding mask coefficients and performing a weighted combination. The mask coefficients determine the weight of each prototype, while the prototypes define the shape of the mask. The combination method is as follows:

{mask}_{i} = \sum_{k = 1}^{nm} {mc}_{i, k} \cdot {proto}_{k}

(4)

where mask_i is the segmentation mask of target i, mc_i,k is the k-th mask coefficient corresponding to target i, proto_k is the k-th mask prototype, and the final mask is the weighted sum of these prototypes.

4. Experiment

4.1. Datasetss and Evaluation Metrics

The experimental images were collected from the Juyuanchun tea garden in Yizheng, Yangzhou, Jiangsu Province, with the target variety being “Longjing 34”. During the research, RGB images of the tender tea shoots of tea plants were captured from multiple angles in April and August 2023. The used camera was the rear camera of iPhone 13 Pro. To ensure the generalization and robustness of the model, the impact of different shooting angles and methods on the subsequent extraction of tea bud picking points was studied, and the diversity of collected samples was ensured. The image collection process included different lighting scenarios such as early morning, noon, and evening, and samples were collected from vertical, oblique downward, backlight, and light-reversal angles. Data augmentation [49] was performed on all samples, which not only enriched the samples and improved the accuracy of the algorithm but also restored the real shooting environment of machine picking. The storage format was JPG. In order to ensure the interference of the dataset quality on the detection effect of the model, the image was visually evaluated, and the image with blurred target, no tea tender tip target, and tea tender tip was seriously occluded, and the image with clear shooting outline and visible picking point was retained.

In order to improve the generalization ability and robustness of the model, a variety of data augmentation techniques are used in the data preprocessing process so that the model can adapt to different illumination, angle, and background changes, so as to improve the accuracy and stability of picking point detection.

Firstly, in the aspect of geometric transformation, the original image is rotated randomly to simulate the changes of different shooting angles and improve the adaptability of the model to rotation transformation. At the same time, the horizontal flip and vertical flip are used to enhance the direction diversity of the data and prevent the model from overfitting the specific direction features. In addition, the main feature area was retained by random clipping and the sample variation was increased so that the model could still effectively identify the picking point of tea tender shoots in the case of local information changes.

Secondly, in terms of illumination and color adjustment, in order to adapt to the light conditions of different periods of time such as early morning, noon, and evening, the brightness of the image is adjusted to ensure that the model can maintain good robustness in different lighting environments. At the same time, the contrast and saturation are adjusted to optimize the visual difference between the young shoots and the background so as to improve the adaptability of the model under various light conditions. In addition, the color shift technology is used to simulate the changes of environmental light sources to further improve the stability of the model in complex natural environments.

Finally, in order to enhance the adaptability of the model to occlusion, random occlusion is used in the enhancement of background interference to simulate the situation that the tender tea shoot is occluded by some leaves or external objects so that the model can maintain high recognition accuracy in the case of missing local information.

To effectively harvest tea buds, tea picking robots usually need to identify the picking points on the tender shoots of tea plants. Based on the growth conditions of tea plants and the national standard GB/T18650-2008 [50], the tender shoots with single buds and one bud with one leaf are defined as the picking targets. During the data annotation process, Labelme is used to precisely annotate the collected tea plant images to ensure that the subsequent model can accurately learn the feature information of the key areas. The annotation tool interface and label information are shown in Figure 11. In order to improve the generalization ability of the model, the instance segmentation dataset was split in a ratio of 7:2:1, with 6230 images in the training set, 1780 images in the validation set, and 890 images in the test set.

To evaluate the performance of this method, the mean average precision (mAP) is utilized to assess the detection accuracy of the rotated object detection model. We use AP and mAP to measure the model’s performance. The relevant calculation formulas are as follows:

AP = \int_{0}^{1} PdR

(5)

P = \frac{TP}{TP + FP}

(6)

R = \frac{TP}{TP + FN}

(7)

where P is the precision, R is the recall, TP is the number of positive samples correctly predicted by the model, and FP is the number of negative samples that the model is wrongly predicted.

4.2. Experimental Environment and Training Parameter Settings

The YOLOv8-TEA model was built using Pytorch 1.12 and trained and tested on a computer equipped with an AMD Ryzen 5 5600 6-Core Processor as the CPU and an NVIDIA GeForce RTX 3070 as the GPU. The development environment was specifically Python 3.8, CUDA 11.3, and CuDNN 8.3.2.

The training loop type based on Epoch is adopted, with a total of 300 epochs for training and a batch size of 16. The Stochastic Gradient Descent (SGD) optimizer is used, with a learning rate of 0.0025, a momentum of 0.9, and a weight decay of 0.0001. A stepwise learning rate decay strategy is employed, which includes a warm-up stage and several stepwise decay stages [51]. The warm-up stage uses a linear decay strategy with 500 iterations and a warm-up ratio of 1/3. The parameter list for stepwise decay is [48,192,264]. All input images are cropped to a size of 640 × 640.

4.3. Ablation Experiments

To verify the effectiveness of the different improved modules in the model, seven groups of experiments were designed, as shown in Table 1, to test whether to replace some C2f modules in the backbone feature extraction network with MVB modules, whether to add C2PSA modules to the backbone feature extraction network, whether to replace the ordinary upsampling method in the feature fusion network with dynamic upsampling, whether to add the CoTAttention mechanism to the feature fusion network, and whether to insert depthwise separable convolution in the detection branch of the segmentation head. The accuracy (P-B), recall (R-B), mean average precision (mAP-B) of the detection boxes, the accuracy (P-M), recall (R-M), mean average precision (mAP-M) of the segmentation masks, and the number of parameters (GFLOPs) of the model were evaluated on the test set, respectively. The specific results are shown in Table 2.

According to the experimental data in Table 2, after replacing the C2f module in the backbone network with the MVB module, the computational cost GFLOPs was significantly reduced from 111. 7 to 51. 5, while the map50 for detection and segmentation improved by 1.3% and 1.2%, respectively. Adding the C2PSA module increased the map50 for detection and segmentation by 1.4% and 2.2%, respectively. Replacing the ordinary upsampling with dynamic upsampling improved the map50 for detection and segmentation by 2.1% and 2.2%, respectively. Introducing the CoTAttention mechanism increased the map50 for detection and segmentation by 1.2% and 2.1%, respectively. Replacing the ordinary convolution in the segmentation head with depthwise separable convolution improved the map50 for detection and segmentation by 1.8% and 1.3%, respectively. The model TOLOv8-TEA, which integrates all the above improvements, reduced the computational cost by 53% while improving the mAP50 for detection and segmentation by 2.9% and 2.2%, respectively, compared to the original YOLOv8m-seg, fully verifying the effectiveness of the improvement methods.

As shown in Figure 12, the mAP (Box) and mAP (Mask) of YOLOv8m-seg and YOLOv8-TEA both steadily increased with the increase in the number of iterations and eventually converged. In the early stage of training, the accuracy difference between the two models was small, but as the iterations continued, the performance gap between them gradually widened, and the improved model demonstrated higher accuracy in both detection and segmentation tasks. Finally, the recognition and segmentation effects of the model on images were further tested. As shown in Figure 13, the proposed instance segmentation model YOLOv8-TEA can accurately recognize and segment tea shoot images under different lighting conditions and with different numbers of tea shoots.

4.4. Comparative Experiments

To further evaluate the recognition and segmentation performance of the algorithm on the tea shoot dataset, the following were selected for comparison: YOLOv5m-seg, YOLOv7m-seg, YOLOv11m-seg, YOLACT, and SOLO. Due to the essential differences in bounding box dependency and mask prediction mechanisms among different methods, mAP (Mask) can more accurately measure the performance of the segmentation task, avoiding the interference of detection box accuracy, and thus ensuring the fairness and rationality of the comparison results. In this study, the COCO standard [52] mAP calculation method is used, where mAP50 refers to the IoU threshold of 0.5 as the judgment criterion, which reflects the performance of the model under the relaxed overlap requirement. mAP50-95 is the average precision calculated for IoU thresholds ranging from 0.5 to 0.95 (with a step size of 0.05), which comprehensively evaluates the robustness of the model at different levels of stringency. Although mAP50 (IoU = 0.5) can intuitively reflect the practicality of the model in common scenarios, the introduction of MAP50-95 (IoU = 0.5:0.95) further verifies the potential of the model in high-precision localization tasks.

As shown in Table 3, the proposed model YOLOv8-TEA performs best in terms of segmentation accuracy. Compared with YOLOv11m-seg, YOLOv7m-seg, YOLOv5m-seg, YOLACT, and SOLO, its mAP50 has increased by 4.3%, 10.4%, 15.4%, 25.7%, and 28.9%, respectively, and its mAP50-95 has increased by 12.0%, 17.0%, 22.5%, 29.1%, and 34.8%, respectively. Additionally, the computational complexity of YOLOv8-TEA is only 52.7 GFLOPs, significantly lower than the 129.3 GFLOPs of YOLOv11m-seg and 153.5 GFLOPs of SOLO, further verifying its efficiency. The experimental results show that YOLOv8-TEA significantly reduces the computational cost while maintaining high accuracy, demonstrating stronger practical application value. At the same time, the inference speed of YOLOv8-TEA is 74.1FPS, which is significantly higher than the industrial real-time standard (≥30 FPS). The high FPS of YOLOv8-TEA not only meets the requirements of real-time operation, but also shows significant advantages in computing speed and actual scene adaptation. These results strongly demonstrate the model’s robustness and practical deployment potential.

Figure 14 shows the real values of the model and the predicted results. In Figure 14b, the boxes with different colors represent different types of buds and leaves predicted by the model. The verification process not only shows the visual results predicted by the model by comparing with the real tea image in Figure 14a, but also shows the visual results predicted by the model. Moreover, the fit between the prediction and the real value is strictly tested through quantitative indicators, which effectively enhances the credibility of the model performance conclusion. To more intuitively display the segmentation effects of each algorithm, Figure 15 shows the segmentation test results of each method on the same test image, where the poor segmentation results are circled in red boxes. All of these models have missed detection, and SOLO has a more serious level of missed detection. At the same time, YOLOv5m-seg and YOLOv7m-seg still have an incomplete segmentation of the boundary of tea shoots, YOLACT struggles to identify the segmentation boundary between some background regions and tea shoots, and YOLOv11m-seg has low segmentation accuracy for small-size shoots. In summary, it can be seen that YOLOv8-TEA not only leads in overall segmentation accuracy, but also is more accurate in edge detail processing.

5. Conclusions and Discussions

In this study, the instance segmentation algorithm YOLOv8-TEA is proposed by improving the YOLOv8-seg model, which uses MobileVit Block to replace part of the C2f modules in the original feature extraction network, and adds the C2PSA module in the last layer to enhance the local feature extraction ability. The lightweight dynamic upsampler Dysample, which simplifies the upsampling process by point sampling, replaces the original upsampling method that relies on dynamic convolution to achieve upsampling, which reduces the computational complexity and improves the inference speed. By adding the CoTAttention attention module, the advantages of context mining and self-attention learning were simultaneously used to improve the representation ability of the deep network. In the segmentation branch, the instance mask generation method of YOLACT is inherited, and the dilated convolution is added to the classification head to expand the receptive field so as to further improve the segmentation accuracy and speed. Secondly, an example segmentation dataset of tea shoots in natural scenes is constructed, including images of tea shoots under different growth states, different backgrounds, and different lighting conditions. At the same time, in order to improve the robustness and generalization ability of the model, data enhancement processing is performed on the dataset to ensure that it can effectively deal with various complex changes in the actual picking scene. Finally, the improved algorithm YOLOv8-TEA is experimentally verified using metrics such as P, R, mAP50, and GFLOPs. The results show that the mAP50 (Box) and mAP50 (Mask) of the improved model reach 86.9% and 86.8%, respectively, which are 2.9% and 2.2% higher than those of the base model. Comparative experiments with other instance segmentation algorithms verify the effectiveness of the model.

While our research is primarily focused on the agricultural domain, specifically tea picking, we recognize that the proposed methods and techniques have potential for wider applications. For example, in forestry management, it is very important to monitor the growth of trees. Our method can be used to identify and monitor tree shoots and new branches, thus helping foresters to better understand the growth status and health of forests. In industrial vision and defect detection, e.g., for millimeter-scale defects such as bearing roller cracks and circuit board solder joints, the model’s sensitivity to edge details (CoTAttention enhances local feature response) can achieve the accurate segmentation of defect contours. The lightweight design (52.7 GFLOPs) meets the high-speed inspection requirements of the production line (≥45 FPS), and supports real-time defect classification and location localization. In the detection of reflective surfaces such as glass and metal sheets, the global context modeling capability of the MVB module can suppress the interference of uneven illumination, and dynamic upsampling can avoid the misjudgment of defect boundaries caused by interpolation ambiguity, which provides technical support for high-precision quality control.

Although YOLOv8-TEA shows high accuracy and efficiency in the task of tea shoot recognition, there are still limitations: first, although the model performs well in the experimental dataset (multi-period of morning and dusk, multi-angle of backlight/light), the robustness of the model to extreme weather (such as heavy rain and fog) and severe occlusion scenes (branch coverage > 80%) still needs to be improved. Second, the ability to generalize across varieties is limited. The current model is based on the dataset of a single tea variety “Longjing 34” and does not cover the morphological differences in other tea varieties (such as Biluochun and Tieguanyin). Finally, there is the problem of the missing detection of small targets. For shoots smaller than 20 × 20 pixels (such as early bud tips), the boundary blurring phenomenon of the model segmentation mask is significant, and the missed detection rate is 8.2% higher than that of medium and large targets, which is directly related to the minimum receptive field (20 × 20) design of the feature fusion network.

Future research may focus on solving the occlusion problem of picking points for tender tea shoots and improving the generalization ability of the model. In the future, multi-modal data fusion technology, such as combining infrared information and multi-view images, can be considered to further improve the algorithm’s ability to segment objects in occlusion scenes.

Author Contributions

W.W.: conceptualization, methodology, writing—review and editing; Y.X.: data curation, software, writing—original draft; J.G.: funding acquisition, project administration, supervision; Q.Y.: formal analysis, validation; Z.P.: data curation, methodology; X.Z.: funding acquisition, resource, supervision; G.X.: investigation, visualization; M.Z.: resources, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the financial supports of the Key project of Jiangsu Provincial Key Research and Development Program (BE2021016-3), the National Natural Science Foundation of China (52105516), and the special scientific research project of the School of Emergency Manage-ment, Jiangsu University (KY-A-03).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the privacy policy of the organization.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, J.; Zhang, K.; Adade, S.Y.S.; Lin, J.; Lin, H.; Chen, Q. Tea Grading, Blending, and Matching Based on Computer Vision and Deep Learning. J. Sci. Food Agric. 2025, 105, 3239–3251. [Google Scholar] [CrossRef] [PubMed]
Pan, Z.; Gu, J.; Wang, W.; Fang, X.; Xia, Z.; Wang, Q.; Wang, M. Picking Point Identification and Localization Method Based on Swin-Transformer for High-Quality Tea. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102262. [Google Scholar] [CrossRef]
Zhou, M.; Sun, H.; Xu, X.; Yang, J.; Wang, G.; Wei, Z.; Xu, T.; Yin, J. Study on the Method and Mechanism of Seedling Picking for Pepper (Capsicum Annuum L.) Plug Seedlings. Agriculture 2023, 14, 11. [Google Scholar] [CrossRef]
Li, L.; Li, M.; Cui, Q.; Liu, Y.; Chen, Y.; Wang, Y.; Zhang, Z.; Chen, Q.; Ning, J. Rapid Monitoring of Black Tea Fermentation Quality Based on a Solution-Phase Sensor Array Combined with UV-Visible Spectroscopy. Food Chem. 2022, 377, 131974. [Google Scholar] [CrossRef]
Han, Y.; Xiao, H.; Qin, G.; Song, Z.; Ding, W.; Mei, S. Developing Situations of Tea Plucking Machine. Engineering 2014, 6, 268–273. [Google Scholar] [CrossRef]
Wang, H.; Gu, J.; Wang, M. A Review on the Application of Computer Vision and Machine Learning in the Tea Industry. Front. Sustain. Food Syst. 2023, 7, 1172543. [Google Scholar] [CrossRef]
Memon, M.S.; Chen, S.; Shen, B.; Liang, R.; Tang, Z.; Wang, S.; Zhou, W.; Memon, N. Automatic Visual Recognition, Detection and Classification of Weeds in Cotton Fields Based on Machine Vision. Crop Prot. 2025, 187, 106966. [Google Scholar] [CrossRef]
Sun, J.; Tang, K.; Wu, X.; Dai, C.; Chen, Y.; Shen, J. Nondestructive Identification of Green Tea Varieties Based on Hyperspectral Imaging Technology. J. Food Process Eng. 2018, 41, e12800. [Google Scholar] [CrossRef]
Dong, C.; Zhu, H.; Wang, J.; Yuan, H.; Zhao, J.; Chen, Q. Prediction of Black Tea Fermentation Quality Indices Using NIRS and Nonlinear Tools. Food Sci. Biotechnol. 2017, 26, 853–860. [Google Scholar] [CrossRef]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A Real-Time Apple Targets Detection Method for Picking Robot Based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Guo, Z.; Huang, W.; Peng, Y.; Chen, Q.; Ouyang, Q.; Zhao, J. Color Compensation and Comparison of Shortwave near Infrared and Long Wave near Infrared Spectroscopy for Determination of Soluble Solids Content of ‘Fuji’ Apple. Postharvest Biol. Technol. 2016, 115, 81–90. [Google Scholar] [CrossRef]
Luo, Y.; Wei, L.; Xu, L.; Zhang, Q.; Liu, J.; Cai, Q.; Zhang, W. Stereo-Vision-Based Multi-Crop Harvesting Edge Detection for Precise Automatic Steering of Combine Harvester. Biosyst. Eng. 2022, 215, 115–128. [Google Scholar] [CrossRef]
Bojie, Z.; Dong, W.; Weizhong, S.; Yu, L.; Ke, W. Research on Tea Bud Identification Technology Based on HSI/HSV Color Transformation. In Proceedings of the 2019 6th International Conference on Information Science and Control Engineering (ICISCE), Shanghai, China, 20–22 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 511–515. [Google Scholar]
Zhu, J.; Sun, B.; Cai, J.; Xu, Y.; Lu, F.; Ma, H. Inspection and Classification of Wheat Quality Using Image Processing. Qual. Assur. Saf. Crops Foods 2023, 15, 43–54. [Google Scholar] [CrossRef]
Huang, X.; Yu, S.; Xu, H.; Aheto, J.H.; Bonah, E.; Ma, M.; Wu, M.; Zhang, X. Rapid and Nondestructive Detection of Freshness Quality of Postharvest Spinaches Based on Machine Vision and Electronic Nose. J. Food Saf. 2019, 39, e12708. [Google Scholar] [CrossRef]
Wang, W.; Shan, Y.; Xi, Y.; Xia, Z.; Xu, G.; Zhang, X. A Predictive Production-Logistics Cooperation Method for Service-Oriented Smart Discrete Manufacturing System. J. Eng. Des. 2025, 1–27. [Google Scholar] [CrossRef]
Sun, J.; He, X.; Ge, X.; Wu, X.; Shen, J.; Song, Y. Detection of Key Organs in Tomato Based on Deep Migration Learning in a Complex Background. Agriculture 2018, 8, 196. [Google Scholar] [CrossRef]
Peng, Y.; Wang, A.; Liu, J.; Faheem, M. A Comparative Study of Semantic Segmentation Models for Identification of Grape with Different Varieties. Agriculture 2021, 11, 997. [Google Scholar] [CrossRef]
Wang, J.; Gao, Z.; Zhang, Y.; Zhou, J.; Wu, J.; Li, P. Real-Time Detection and Location of Potted Flowers Based on a ZED Camera and a YOLO V4-Tiny Deep Learning Algorithm. Horticulturae 2021, 8, 21. [Google Scholar] [CrossRef]
Yang, H.; Chen, L.; Chen, M.; Ma, Z.; Deng, F.; Li, M.; Li, X. Tender Tea Shoots Recognition and Positioning for Picking Robot Using Improved YOLO-V3 Model. IEEE Access 2019, 7, 180998–181011. [Google Scholar] [CrossRef]
Ji, W.; Gao, X.; Xu, B.; Pan, Y.; Zhang, Z.; Zhao, D. Apple Target Recognition Method in Complex Environment Based on Improved YOLOv4. J. Food Process. Eng. 2021, 44, e13866. [Google Scholar] [CrossRef]
Zhou, X.; Chen, W.; Wei, X. Improved Field Obstacle Detection Algorithm Based on YOLOv8. Agriculture 2024, 14, 2263. [Google Scholar] [CrossRef]
Huan, J.; Cao, W.; Liu, X. A Dissolved Oxygen Prediction Method Based on K-Means Clustering and the ELM Neural Network: A Case Study of the Changdang Lake, China. Appl. Eng. Agric. 2017, 33, 461–469. [Google Scholar] [CrossRef]
Chen, Y.-T.; Chen, S.-F. Localizing Plucking Points of Tea Leaves Using Deep Convolutional Neural Networks. Comput. Electron. Agric. 2020, 171, 105298. [Google Scholar] [CrossRef]
Yan, C.; Chen, Z.; Li, Z.; Liu, R.; Li, Y.; Xiao, H.; Lu, P.; Xie, B. Tea Sprout Picking Point Identification Based on Improved DeepLabV3+. Agriculture 2022, 12, 1594. [Google Scholar] [CrossRef]
Wang, T.; Zhang, K.; Zhang, W.; Wang, R.; Wan, S.; Rao, Y.; Jiang, Z.; Gu, L. Tea Picking Point Detection and Location Based on Mask-RCNN. Inf. Process. Agric. 2023, 10, 267–275. [Google Scholar] [CrossRef]
Xiao, J.; Huang, H.; Chen, X.; Fan, Q.; Han, Z.; Hu, P. Identification of Tea Bud with Improved DCGAN Algorithm and GhostNet-RCLAM Network. J. Food Meas. Charact. 2023, 17, 4191–4207. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3500–3509. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
Qiu, D.; Guo, T.; Yu, S.; Liu, W.; Li, L.; Sun, Z.; Peng, H.; Hu, D. Classification of Apple Color and Deformity Using Machine Vision Combined with CNN. Agriculture 2024, 14, 978. [Google Scholar] [CrossRef]
Jain, J.; Singh, A.; Orlov, N.; Huang, Z.; Li, J.; Walton, S.; Shi, H. SeMask: Semantically Masked Transformers for Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW, Paris, France, 2–6 October 2023; pp. 752–761. [Google Scholar] [CrossRef]
Zhao, S.; Peng, Y.; Liu, J.; Wu, S. Tomato Leaf Disease Diagnosis Based on Improved Convolution Neural Network by Attention Module. Agriculture 2021, 11, 651. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Zhao, X.; Su, X.; Wu, W. Lightweight Detection Networks for Tea Bud on Complex Agricultural Environment via Improved YOLO V4. Comput. Electron. Agric. 2023, 211, 107955. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. TS-YOLO: An All-Day and Lightweight Tea Canopy Shoots Detection Model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
Yang, M.; Yuan, W.; Xu, G. YOLOX Target Detection Model Can Identify and Classify Several Types of Tea Buds with Similar Characteristics. Sci. Rep. 2024, 14, 2855. [Google Scholar] [CrossRef]
Tang, S.; Xia, Z.; Gu, J.; Wang, W.; Huang, Z.; Zhang, W. High-Precision Apple Recognition and Localization Method Based on RGB-D and Improved SOLOv2 Instance Segmentation. Front. Sustain. Food Syst. 2024, 8, 1403872. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9156–9165. [Google Scholar] [CrossRef]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. SOLO: A Simple Framework for Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8587–8601. [Google Scholar] [CrossRef]
Xie, H.; Zhang, Z.; Zhang, K.; Yang, L.; Zhang, D.; Yu, Y. Research on the Visual Location Method for Strawberry Picking Points under Complex Conditions Based on Composite Models. J. Sci. Food Agric. 2024, 104, 8566–8579. [Google Scholar] [CrossRef]
Hu, J.; Lu, Y.; Zhang, S.; Cao, L. ISTR: Mask-Embedding-Based Instance Segmentation Transformer. IEEE Trans. Image Process. 2024, 33, 2895–2907. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Zhao, Y.; Fan, W.; Liu, J. An Improved YOLOv8 Model for Lotus Seedpod Instance Segmentation in the Lotus Pond Environment. Agronomy 2024, 14, 1325. [Google Scholar] [CrossRef]
Hu, T.; Wang, W.; Gu, J.; Xia, Z.; Zhang, J.; Wang, B. Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images. Agronomy 2023, 13, 1816. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Z.; Ali, S.; Yang, N.; Fu, S.; Zhang, Y. Multi-Class Detection of Cherry Tomatoes Using Improved YOLOv4-Tiny. Int. J. Agric. Biol. Eng. 2023, 16, 225–231. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Duan, Y.; Han, W.; Guo, P.; Wei, X. YOLOv8-GDCI: Research on the Phytophthora Blight Detection Method of Different Parts of Chili Based on Improved YOLOv8 Model. Agronomy 2024, 14, 2734. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, X.; Sun, J.; Yu, T.; Cai, Z.; Zhang, Z.; Mao, H. Low-Cost Lettuce Height Measurement Based on Depth Vision and Lightweight Instance Segmentation Model. Agriculture 2024, 14, 1596. [Google Scholar] [CrossRef]
Lu, P.; Zheng, W.; Lv, X.; Xu, J.; Zhang, S.; Li, Y.; Zhangzhong, L. An Extended Method Based on the Geometric Position of Salient Image Features: Solving the Dataset Imbalance Problem in Greenhouse Tomato Growing Scenarios. Agriculture 2024, 14, 1893. [Google Scholar] [CrossRef]
GB/T 18650-2008; Geographical Indication Product—Longjing Tea. China Standards Press: Beijing, China, 2008.
Zhang, Z.; Lu, Y.; Yang, M.; Wang, G.; Zhao, Y.; Hu, Y. Optimal Training Strategy for High-Performance Detection Model of Multi-Cultivar Tea Shoots Based on Deep Learning Methods. Sci. Hortic. 2024, 328, 112949. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) LNCS 2014, 8693, 740–755. [Google Scholar] [CrossRef]

Figure 1. Overall network framework of the YOLOv8-TEA algorithm. (a) Flowchart of YOLOv8-TEA. (b) Network framework.

Figure 2. The improved feature extraction network.

Figure 3. MobileViT Block.

Figure 4. C2PSA Block.

Figure 5. PSABlock.

Figure 6. The improved feature fusion network.

Figure 7. Sampling-based dynamic upsampling.

Figure 8. The sampling point generator in DySample. (a) Static range factor version. (b) Dynamic range factor.

Figure 9. Contextual Transformer (CoT) block.

Figure 10. Improved split head structure diagram.

Figure 11. The interface of the annotation tool and the label information.

Figure 12. The mAP (Box) and mAP (Mask) change curve of the two models in the verification set during the training process.

Figure 13. The actual recognition and segmentation effects of YOLOv8-TEA in different natural environments.

Figure 14. Comparison between the original image and the true value. (a) The original image. (b) The original image.

Figure 15. Segmentation test results of different models.

Table 1. Ablation experimental model composed of different modules. √ indicates that the module is used for experiments.

Model	Backbone Network		Feature Fusion Network		Segmentation Head
	MVB	C2PSA	CoTAttention	Dysample	Conv→DWConv
1
2	√
3		√
4			√
5				√
6					√
7	√	√	√	√	√

Table 2. Comparison results of instance segmentation model ablation experiments.

Model	P (Box)	R (Box)	mAP50 (Box)	P (Mask)	R (Mask)	Map50 (Mask)	GFLOPs
1	0.878	0.711	0.84	0.853	0.738	0.836	111.7
2	0.862	0.773	0.853	0.852	0.772	0.856	51.5
3	0.845	0.78	0.852	0.855	0.788	0.865	110.0
4	0.862	0.756	0.858	0.839	0.778	0.865	123.0
5	0.853	0.778	0.85	0.861	0.788	0.864	158.9
6	0.893	0.706	0.855	0.895	0.721	0.857	110.0
7	0.865	0.754	0.869	0.819	0.781	0.868	52.7

Table 3. Comparison of experimental results for different instance segmentation models.

Model	mAP50	Map50–95	GFLOPs
YOLOv5m-seg	0.714	0.445	92.6
YOLOv7m-seg	0.763	0.50	102.9
YOLOv11m-seg	0.825	0.55	129.3
YOLACT	0.611	0.375	94.3
SOLO	0.579	0.322	153.5
YOLOv8-TEA	0.868	0.67	52.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Xi, Y.; Gu, J.; Yang, Q.; Pan, Z.; Zhang, X.; Xu, G.; Zhou, M. YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm. Agronomy 2025, 15, 1318. https://doi.org/10.3390/agronomy15061318

AMA Style

Wang W, Xi Y, Gu J, Yang Q, Pan Z, Zhang X, Xu G, Zhou M. YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm. Agronomy. 2025; 15(6):1318. https://doi.org/10.3390/agronomy15061318

Chicago/Turabian Style

Wang, Wenbo, Yidan Xi, Jinan Gu, Qiuyue Yang, Zhiyao Pan, Xinzhou Zhang, Gongyue Xu, and Man Zhou. 2025. "YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm" Agronomy 15, no. 6: 1318. https://doi.org/10.3390/agronomy15061318

APA Style

Wang, W., Xi, Y., Gu, J., Yang, Q., Pan, Z., Zhang, X., Xu, G., & Zhou, M. (2025). YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm. Agronomy, 15(6), 1318. https://doi.org/10.3390/agronomy15061318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm

Abstract

1. Introduction

2. Related Works

2.1. Identification of Tea Shoots

2.2. Instance Segmentation Algorithm

2.3. Contribution of This Research

3. Methods

3.1. Overall Model Structure of YOLOv8-TEA

3.2. The Improved Feature Extraction Network Based on MobileViT

3.3. An Improved Feature Fusion Network Integrating Dysample and CoTAttention

3.4. Improved Segmentation Head Combined with DWConv

4. Experiment

4.1. Datasetss and Evaluation Metrics

4.2. Experimental Environment and Training Parameter Settings

4.3. Ablation Experiments

4.4. Comparative Experiments

5. Conclusions and Discussions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI