1. Introduction
Tea, as one of the most significant agricultural products globally, occupies a crucial position in world economic and cultural development [
1,
2]. According to the latest statistical data from Statista [
3] and the International Tea Council [
4], global tea consumption continues to expand, with annual consumption reaching up to 7.3 billion kilograms. Notably, China, as the world’s largest tea producer and exporter, achieved tea export volumes of 374,100 metric tons in 2024.
Tea buds, as the core raw material for tea production, primarily develop at the apex of tea branches and leaf axils [
5]. Characterized by their delicate morphology and soft texture, they contain abundant bioactive compounds, including tea polyphenols, amino acids, and caffeine [
6]. As a result, they play a central role in determining tea quality. In tea plantation management, the accurate detection and prediction of tea buds are critical for yield estimation, enabling producers to make informed, data-driven management decisions [
7,
8,
9]. Moreover, due to their fragile epidermal structure, tea buds are particularly vulnerable to pest infestation [
10]. Integrating computer vision technologies into tea bud detection could support pest monitoring and mitigate potential economic losses proactively. Traditional practices in tea bud harvesting and classification have relied heavily on manual labor. However, the ongoing expansion of tea cultivation, coupled with rising labor shortages in agriculture, increasingly impedes the scalability and industrial standardization of the tea industry [
11,
12]. Thus, precise and automated tea bud recognition has become vital technology for improving yield prediction, pest control, and mechanized harvesting.
In the domain of intelligent tea bud recognition, existing approaches are broadly categorized into two paradigms: traditional image processing and deep learning. Early efforts in the former leveraged low-level visual features. For example, Wang et al. developed a region-growing algorithm based on chromatic and contour features [
13]; Wu et al. enhanced detection accuracy through LAB color space analysis combined with K-means clustering [
14]; Tang et al. proposed a G-component-based tender leaf screening method coupled with an optimized Otsu segmentation strategy [
15]. Additionally, Karunasena’s group integrated HOG descriptors with support vector machine classifiers to construct effective traditional models [
16]. Wang et al. [
17] implemented tea bud extraction using a method combining K-means clustering and image morphological processing. While these methods offered initial successes, they struggled in real-world scenarios. Field conditions—such as dynamic lighting, diverse angles, and meteorological variability—introduce significant variability in tea bud appearance. Compounding this, the high visual similarity between tea buds and surrounding foliage, along with their small size and complex background clutter, renders color- and shape-based segmentation inadequate for robust detection.
With the advancement of computer vision, deep learning has become the prevailing methodology for tea bud detection. State-of-the-art (SOTA) object detection models now dominate this space, offering significant improvements for estimating tea shoot quantities and locating tea shoot positions. However, in practical tea garden environments, tea buds are typically small, densely packed, and frequently occluded by leaves or branches. Environmental variables like lighting and background complexity further hinder detection reliability. To address these constraints, several enhanced detection frameworks have been proposed, for instance, the Tea-YOLOv8s model, which incorporates deformable convolutions, attention mechanisms, and spatial pyramid pooling to enhance feature representation and robustness [
18]. By strengthening the model’s feature learning capability for complex-shaped targets, this approach significantly improves the accuracy and reliability of tea shoot detection. Building upon this, Liu et al. developed YOLO-TBD, integrating shallow feature maps with the Path Aggregation Feature Pyramid Network architecture and replacing standard convolutions with self-calibrating group convolutions [
19]. Chen et al. [
9] introduced a selective kernel attention mechanism and a multi-feature fusion module to capture local and global dependencies, achieving comprehensive and distinctive feature representation. To address multi-scale detection challenges, Lu et al. employed a dynamic head mechanism along with a lightweight C3Ghost module and
-CIoU loss function to address scale variation and speed–accuracy tradeoffs [
20]. The RT-DETR-Tea [
21] model effectively optimizes deep features through a cascaded group attention mechanism and improves RepC3 using UniRepLKNet, enabling the efficient fusion of tea bud feature information.
However, in application scenarios such as real-time tea shoot monitoring for robotic harvesting, these complex models often fall short. Their high computational demands are incompatible with embedded systems and mobile terminals, where memory and processing resources are limited [
22,
23,
24,
25]. To address this, two key strategies have emerged: architectural lightweighting and post-training model compression.
On the one hand, lightweight architectural designs aim to reduce model complexity. For instance, Wang et al. reconstructed the YOLOv5 backbone by integrating lightweight Xception and ShuffleNetV2 modules, supplemented with reverse attention and receptive field blocks to enhance representational efficiency [
26]. Similarly, Gui’s team employed Ghost convolutions and bottleneck attention modules to reduce computational load, further compressing the model via knowledge distillation and pruning [
27]. On the other hand, post hoc methods such as knowledge distillation and pruning provide additional compression. Knowledge distillation enables knowledge transfer from teacher to student networks, while model pruning algorithms structurally eliminate redundant parameters. This significantly reduces computational complexity and storage requirements while preserving model performance. For example, Zheng et al. introduced the Reconstructed Feature and Dual Distillation framework, leveraging spatial attention-based masking and dual distillation to optimize feature learning in constrained environments [
28].
This study proposes the following core hypotheses: An effective feature fusion architecture enhances the detectability of small tea bud targets, and analyzing layerwise contributions, followed by pruning redundant structures, reduces computational costs while improving accuracy.
Against this backdrop, we propose TeaBudNet, a novel lightweight detection framework tailored for small object tea bud recognition under complex natural conditions. Our key contributions include the following:
Weight-FPN and P2 Module Design: We present a multi-scale feature fusion architecture that integrates weighted feature propagation and high-resolution shallow features. This significantly enhances the detection accuracy for small and multi-scale tea buds, achieving a +5.0% gain in mAP@50 compared to existing SOTA methods.
Group–Taylor Pruning Strategy: A novel non-global pruning method grounded in Taylor expansion is introduced to structurally compress the model while preserving performance. This approach achieves a 50% reduction in parameter count and a 3% decrease in computational cost, alongside a 0.6% improvement in accuracy.
Real-World Deployment and Dataset Contribution: We construct a benchmark multi-category tea bud dataset and demonstrate real-time deployment capabilities on Huawei Atlas 200I DKA2 edge devices. Field trials confirm the system’s practical utility for real-time tea bud monitoring in operational plantations.
2. Materials and Methods
The small size of tea buds and their high similarity to the background pose significant challenges for detection. Furthermore, the computational constraints of edge devices in outdoor environments impose stringent requirements on model size and computational load. To address these issues, this study proposes a novel multi-scale feature fusion framework that enhances the model’s capacity to discriminate subtle tea bud features across spatial hierarchies. Additionally, model pruning techniques not only substantially reduce computational overhead but also improve model accuracy. The overall research framework is illustrated in
Figure 1.
2.1. Data Processing and Dataset Production
The data collection for this study was conducted from March to October 2024 at Wuben Tea Plantation in Mingshan District, Ya’an City, Sichuan Province, with the specific locations illustrated in
Figure 2. To comprehensively capture tea bud information, a multi-source visual acquisition system was employed. This included both fixed and mobile imaging devices: dome-type Changhong CH001 cameras (Sichuan Changhong Electronic Holding Group Co., Ltd., Mianyang, China) were used for stationary capture, while mobile data were collected using smartphones and tablets from multiple brands, thereby ensuring broad device compatibility and heterogeneous image inputs.
In terms of imaging protocol, fixed cameras were installed with downward tilt angles to cover tea plantation plots at two distinct spatial resolutions (1–2 m2 and 5–10 m2). These systems were programmed for automated hourly image capture. In contrast, mobile data collection involved manual photography, with close-up images of tea buds captured every 1–2 weeks. Randomized angles and distances were applied to increase sample variability and mimic real-world visual conditions. This study focused on two representative tea cultivars, namely, Fuding Dabaicha and Sanhua 1951, with collection cycles spanning three tea growing seasons—spring (early March to late April), summer (late May to late July), and autumn (early September to early October).
To maximize data diversity and ecological validity, image acquisition was conducted under a variety of weather conditions (e.g., including rainy, sunny, and cloudy days) and during multiple timeslots between 7:00 and 18:00 daily. This spatiotemporally comprehensive strategy resulted in a robust tea bud image database with continuous phenological coverage across growth cycles.
To facilitate precise model training, rigorous manual annotation of the collected images was performed. Tea bud instances were labeled and saved in plain text format compatible with object detection frameworks. The annotated dataset was subsequently split into training, validation, and test subsets using a 60:20:20 ratio. A detailed summary of the dataset is presented in
Table 1.
2.2. Deep Learning-Based Algorithm for Tea Bud Detection
2.2.1. TeaBudNet
Real-time object detection has been widely applied in fields such as industrial quality inspection, medical image analysis, and intelligent crop monitoring. The YOLO series, leveraging the unique advantages of its single-stage detection framework, has become a technical benchmark in this domain [
29,
30]. However, traditional YOLO models primarily rely on Convolutional Neural Networks (CNNs) to build their feature extraction systems. While they excel in balancing speed and accuracy, their inherent local receptive field characteristic limits their capabilities in modeling long-range dependencies and integrating global contextual information. Therefore, to better harness the efficacy of attention mechanisms in practical applications, YOLOv12 innovatively proposes a Region-Aware Attention mechanism and a Residual Efficient Layer Aggregation Network (Res-ELAN) [
31]. These effectively mitigate two major technical bottlenecks: high computational complexity and low memory access efficiency. In the tea bud detection task addressed in this paper, this model demonstrates excellent performance across all metrics and is selected as the baseline model for further improvement.
To enhance the detection accuracy of multi-scale tea bud objects in images while accommodating lightweight deployment requirements, we propose TeaBudNet, a refined detection model built upon YOLOv12. A detailed structural comparison with the baseline models is presented in
Figure 3. To address the limitations of conventional YOLO-based detectors in detecting tiny tea buds, TeaBudNet abandons the conventional feature concatenation operations in YOLOv12’s neck structure and innovatively introduces a multi-scale feature fusion module based on a dynamic weighting mechanism (Weight-FPN). Furthermore, an additional P2-level detection head is integrated into the original three-level detection head architecture of YOLOv12, thereby enhancing the model’s recognition and localization capabilities for small-scale tea buds. TeaBudNet not only improves detection accuracy but also significantly reduces model size, providing a more precise and lightweight solution for tea bud identification.
Addressing the issues of fine-grained feature loss and inadequate cross-scale modeling, which are common in micro-tea bud detection under natural conditions, this study proposes a systematic architectural optimization strategy. First, to mitigate information degradation during cross-layer feature fusion in traditional Feature Pyramid Networks (FPNs), a multi-scale feature fusion module based on a dynamic weighting mechanism (Weight-FPN) is designed. This module employs a bidirectional cross-scale connection structure and replaces the conventional concatenation operation with a non-linear fusion mechanism. Using convolutional kernels for channel alignment and a spatial attention-guided weighting unit, it enables adaptive, multi-level feature integration. Second, recognizing the difficulty of detecting small targets using conventional detection head structures, we introduce a P2 detection layer extension strategy. By reusing the down-sampled feature map from Stage 2 of the backbone network, this layer captures high-resolution shallow features essential for accurate small target localization. Furthermore, to support lightweight deployment on edge devices, a hierarchical group pruning optimization framework is proposed. This method clusters convolutional kernels via a channel correlation matrix, enabling a structured inter-group importance evaluation. Dynamic L1 regularization promotes intra-group sparsity, while a detection head sensitivity feedback mechanism adaptively tunes pruning thresholds. Collectively, these enhancements allow TeaBudNet to significantly reduce computational cost and model size while outperforming existing SOTA models in tea bud detection accuracy after fine-tuning.
2.2.2. Weight-FPN
To address the challenges of multi-scale tea bud detection, we adopt a cross-scale feature fusion framework based on the Bidirectional Feature Pyramid Network (BiFPN) and extend it through the design of an enhanced Weight-FPN architecture. This formulation enables us to integrate semantically rich features from deeper layers with geometrically detailed features from shallow layers while dynamically balancing their contributions. Specifically, given an input multi-scale feature set , where denotes the input feature at the i-th level, the core objective of feature fusion lies in constructing an effective transformation function to generate an optimized feature set .
The Feature Pyramid Network (FPN) pioneered a top-down lateral connection mechanism to enable hierarchical feature fusion, laying the groundwork for many modern object detectors. Building on this, the BiFPN introduced bottom-up pathways and assigned learnable weights
to each input, facilitating adaptive, bidirectional feature interactions across scales. As illustrated in its architectural diagram, the fused outputs are formulated as follows:
In small tea shoot detection, while the high-resolution shallow-layer features preserve the geometric details of tea shoots (e.g., apex curvature and leaf margin demarcation), their limited semantic abstraction capability results in low inter-class discriminability and susceptibility to noise interference. Conversely, deeper features convey a richer semantic context due to hierarchical abstraction but at the cost of spatial resolution, often reducing small objects to minimal or ambiguous activations.
The core optimization of the Weight-FPN structure lies in the introduction of the fusion module, whose detailed architecture is illustrated in
Figure 4. This module replaces traditional concatenation with a learnable fusion mechanism that integrates semantic and spatial information from different layers. Specifically, it accepts the feature map from the preceding fusion stage and supplementary features from other scales. Prior to fusion,
convolutional layers are used for channel dimension alignment. This design enables the model to simultaneously integrate the semantic information from higher-level features and the spatial detail information from lower-level features, thereby significantly enhancing the model’s performance on multi-scale tea bud detection tasks.
2.2.3. Model Pruning
In the practical deployment of tea bud detection models, model lightweighting, inference efficiency optimization, and generalization performance enhancement constitute the core technical requirements. Given the computational and memory limitations of edge computing terminals (such as UAV-mounted vision systems, portable intelligent terminals, and distributed vision sensor nodes), we adopt an efficient structured pruning strategy to reduce model complexity. Pruning techniques address the parameter redundancy intrinsic to deep neural networks. This technique systematically prunes less contributive weights, channels, or neurons by establishing structured or unstructured importance evaluation criteria, thereby significantly reducing computational complexity and storage overhead while largely preserving model performance.
To train the network via pruning techniques in order to minimize the error (E), the optimization objective is defined as follows: given a parameter set
W of a neural network and a dataset
D, the inputs are
x, and the outputs are
y. Specifically, the optimization objective can be expressed as
Here, denotes the error over the entire dataset D, while represents the error of the output y given input x and network parameters W. By minimizing this error, we can effectively reduce the network’s complexity while largely preserving its performance.
Conventional pruning techniques typically rely on heuristic criteria such as weight magnitude, which may not reliably correlate with a neuron’s true contribution. Additionally, these methods often lack support for complex architectures like skip connections, which makes it difficult for them to accommodate the increasingly diverse architectural requirements of modern deep neural networks [
32]. To address these issues, built upon the TeaBudNet model architecture, this study proposes a pruning algorithm based on Taylor expansion for importance estimation. This algorithm estimates importance by computing the first-order Taylor approximation of the loss function. Furthermore, by incorporating gating units after batch normalization layers, it achieves a unified quantization of importance scales across different network layers, thereby effectively overcoming the inherent drawbacks of traditional pruning methods.
To identify the neurons that contribute the least to the network, this study quantified each neuron. Mathematically, the importance of neuron
is defined as the squared change in loss upon its removal:
Therefore,
can be regarded as a function solely of
, denoted by
. In this case, for the univariate function
, its second-order Taylor expansion in the vicinity of the expansion point
can be approximately expressed as
Substituting this second-order expansion into Equation (
3), we obtain
Since the second-order expansion retains only terms up to
, all terms of the third order and higher are neglected. Through Taylor expansion approximation in the parameter neighborhood, an efficient computational formula is derived:
The gradient is directly utilized from the intermediate results of backpropagation, requiring no additional computation.
To address the scale inconsistency in cross-layer pruning (particularly with batch normalization layers and skip connections), we embed a gating variable
. This variable is embedded at the output of the neurons targeted for pruning, making their importance directly equivalent to the group contribution:
This approach avoids layerwise sensitivity analysis and supports a globally consistent evaluation across layers, including those with skip connections.
Building on this principle, the proposed pruning algorithm employs a progressive removal strategy to maintain training stability. This strategy integrates gradient accumulation with stepwise pruning techniques, followed by dynamic fine-tuning after pruning. Consequently, it effectively achieves the model compression objective while preserving model accuracy. The specific pruning process is described in
Figure 5.
2.3. System Deployment
The Tea Bud Intelligent Monitoring System comprises the following core components: a Huawei Atlas 200 DK A2 (Huawei Technologies Co., Ltd., Shenzhen, China) development board serving as the master control unit for data processing and system coordination; a 12-megapixel color USB industrial high-definition camera for capturing images of tea leaves in the plantation; a 220V high-efficiency power supply system ensuring continuous and stable operation in outdoor environments; and a 5G outdoor network module enabling high-speed, stable data transmission. All components are integrated and mounted onto a robust vertical pole and horizontal arm steel structure designed to withstand outdoor environmental conditions (
Figure 6).
The monitoring device is installed on a 3 m high vertical pole with a horizontal arm supporting the camera at 2 m above the canopy. Cable routing is optimized with exit ports and universal joints for flexibility.
The base of the device is securely anchored using a robust steel ground cage and a heavy flange plate, ensuring safety and stability under adverse weather conditions. Key control equipment is protected in a stainless steel box at the base, ensuring electromagnetic isolation and environmental shielding.
Following successful validation through laboratory testing, this Tea Bud Intelligent Monitoring System has been deployed in a tea plantation located in a southwestern province of China. The systems are strategically positioned at multiple key monitoring points within the plantation, performing 24 h uninterrupted data acquisition. A total of three units have been deployed in this instance, with each unit covering an area of approximately 20 square meters.