1. Introduction
Tea has become one of the most popular beverages worldwide due to its unique flavor and high nutritional value. According to statistics from 2020, worldwide tea output amounted to 62,690 tons, among which China contributed the largest share, reaching 29,860 tons [
1,
2]. With the ongoing consumption upgrade and industrial modernization, there is an increasing demand for standardized raw material harvesting, bud integrity, and yield estimation in tea production. High-grade teas typically require the plucking of “one bud” or “one bud and one leaf” [
3,
4]. Therefore, accurate bud recognition, counting, and localization form the foundation for quality assurance and automated harvesting. In recent years, machine vision and intelligent harvesting have received growing attention in tea garden management and automated plucking. While traditional manual harvesting can maintain quality, it is labor-intensive, inefficient, and subject to seasonal labor shortages, which has driven research and application in mechanized and intelligent harvesting. Some recent studies have explored the integration of edge devices, such as unmanned aerial vehicles (UAVs), with tea-target detection, moving toward an integrated “detection–localization–harvesting” workflow to enable both tea garden inspection and large-scale automated operations [
5].
With the rapid advancement of computer vision and artificial intelligence, deep learning techniques have increasingly been applied to tackle various problems in agriculture. Leveraging these algorithms, tea-harvesting drones and robotic systems are able to detect and classify tea leaves more accurately, thereby enhancing both harvesting precision and operational efficiency [
6,
7]. Region-based Convolutional Neural Networks (R-CNN) [
8] were among the first models introduced for object recognition. Subsequently, approaches such as You Only Look Once (YOLO) [
9,
10,
11,
12,
13,
14,
15,
16], Single Shot MultiBox Detectors (SSD) [
17], and other improved variants have been widely adopted in agricultural applications, achieving high accuracy and robustness in tasks including pest and disease detection, weed identification, and crop yield estimation. Sa et al. [
18] proposed a fruit detection approach that integrates Faster R-CNN with multispectral imagery, achieving accurate apple detection in orchard environments. Bargoti and Underwood [
19] employed Faster R-CNN to detect multiple types of fruits, including apples, mangoes, and oranges, demonstrating strong generalization performance under natural illumination variations and occlusion conditions. Wang et al. [
20] introduced a blueberry maturity recognition method that combines an improved I-MSRCR image enhancement algorithm with a lightweight YOLO-BLBE model, effectively improving both recognition accuracy and detection efficiency for fruits at different maturity stages in complex natural environments. Xiao et al. [
21] adopted a hybrid framework that integrates a Transformer model from the field of natural language processing with deep learning techniques to classify apples at different ripeness levels. This approach facilitates the fusion of multimodal data and offers greater flexibility in representation learning and modeling. Appe et al. [
22] improved the YOLOv5 architecture by incorporating the Convolutional Block Attention Module (CBAM) for automatic multi-class tomato classification, achieving an average accuracy of 88.1%. However, compared to these agricultural tasks, tea bud detection presents more unique and complex challenges due to the small and densely distributed targets, as well as the high visual similarity between buds and surrounding leaves [
23].
To address the unique challenges in tea bud detection, researchers have customized advanced object detection architectures and proposed various specialized models tailored for this task. Yang et al. [
23] introduced the RFA-YOLOv8 model, which incorporates the Receptive Field Coordinate Attention Convolution (RFCAConv) module and an improved SPPFCSPC multi-scale feature extraction structure on the YOLOv8 framework. This model achieved 84.1% mAP@0.5 and 58.7% mAP@[0.5:0.95] on a self-constructed tea bud dataset. Yang et al. [
24] proposed a tea bud recognition algorithm based on an improved YOLOv3, which optimizes the network architecture through an image pyramid mechanism, significantly enhancing detection accuracy and robustness under varying poses and occlusion conditions. Wang M et al. [
25] introduced Tea-YOLOv5s, an enhanced YOLOv5s model incorporating ASPP for multi-scale feature extraction, BiFPN for efficient feature fusion, and CBAM for attention refinement. On the tea-shoots dataset, the model outperformed the original YOLOv5s, improving average precision and recall by 4.0 and 0.5 percentage points, respectively. Gui et al. [
26] presented YOLO-Tea, which combines a multi-scale convolutional attention module (MCBAM), jointly optimizes anchor boxes using k-means clustering and a genetic algorithm, and employs EIoU loss with Soft-NMS, ultimately achieving 95.2% mean accuracy on the tea bud detection task. Chen et al. [
27] proposed RT-DETR-Tea, which introduces cascaded grouped attention, the GD-Tea multi-scale fusion mechanism, and the DRBC3 module, attaining 96.1% accuracy and 79.7% mAP@[0.5:0.95] in multi-variety unstructured tea garden scenarios.
Although these deep learning models have demonstrated significant advantages in tea bud detection, their high computational demands and inference energy consumption hinder deployment on energy-constrained mobile devices. Spiking Neural Networks (SNNs), by simulating the spike-based signaling mechanism of biological neurons, perform computation and transmission only when information changes, thereby substantially reducing energy consumption and making them highly suitable for deployment on energy-limited edge devices or mobile platforms [
28]. In recent years, several studies have explored the application of SNNs to object detection, achieving a certain balance between energy efficiency and performance. EMS-YOLO [
29] was the first SNN model to perform object detection using a direct training strategy, employing surrogate gradient methods for end-to-end training. The SpikSSD model [
30] further introduced a Spike-Based Feature Fusion Module (SBFM) to enhance the detection capability for multi-scale objects. SpikeYOLO [
31] proposed a detection framework combining an integer-valued training strategy with spike-driven inference, achieving performance close to that of Artificial Neural Networks (ANNs).
Improving the performance of Spiking Neural Networks in object detection while preserving their inherent energy efficiency has become a major challenge for practical SNN deployment. Researchers have attempted to develop more effective encoding schemes that convert static images into spatiotemporal feature sequences suitable for SNN inputs, aiming to improve network performance. Rate coding [
32] (Van Rullen et al., 2001) represents pixel or feature intensity through spike firing frequency, where higher luminance corresponds to higher neuronal firing rates within a fixed time window. Temporal coding [
33] (Comsa et al., 2020) conveys input strength or feature importance via spike timing rather than frequency, encoding information in the temporal position of spikes. Phase coding [
34] (Kim et al., 2018) encodes input intensity using spike phase positions within a periodic reference signal (e.g., oscillatory waves). However, these encoding schemes fail to generate the intrinsic temporal dynamics observed in human vision, which serves as the fundamental inspiration for SNNs. The conventional LIF neuron remains the most classic spiking neuron model, mimicking membrane potential integration and leakage to achieve bio-inspired information processing. Nevertheless, the standard LIF model uses fixed parameters and has a limited dynamic range, making it difficult to accommodate multi-scale features and nonlinear temporal variations in complex visual tasks. To address this, researchers have proposed multiple enhanced neuron models, including PLIF neurons [
35], I-LIF neurons [
31], GLIF neurons [
36], and others, which further optimize object detection performance in SNNs. Attention modules allow the network to concentrate on the most informative features while suppressing irrelevant or redundant responses. Moreover, integrating attention modules into SNN architectures can reduce spike firing rates triggered by non-target information, improving detection accuracy while lowering energy consumption. Yao et al. (2021) [
37] attached the Squeeze-and-Excitation (SE) module [
38] to the temporal input dimension of SNNs, but this method achieves improved performance only on small datasets using shallow networks. Yao et al. [
39] transformed CBAM [
40] into a multi-dimensional attention mechanism and injected it into SNN architectures, revealing the potential of deep SNNs as a general-purpose backbone for diverse applications. However, challenges remain when integrating attention modules into SNN architectures intended for deployment on neuromorphic hardware that supports sparse addition operations. This is because attention score computation typically requires partial multiplication units to dynamically generate attention weights, which may impede the spike-driven nature of SNNs.
Recently, the SpikeYOLO model proposed by Luo et al. [
31] has achieved a major breakthrough in Spiking Neural Networks (SNNs) for object detection, demonstrating the potential to surpass traditional Artificial Neural Network (ANNs) models in detection performance while preserving the inherent low-energy advantages of SNNs, thus presenting dual benefits in both performance and energy efficiency. Inspired by this, this paper takes SpikeYOLO as the base model and proposes an improved object detection method—GAE-SpikeYOLO, which aims to enable precise tea bud detection and recognition at low energy cost. Specifically, the proposed method integrates Gated Attention Coding (GAC) and the Temporal-Channel-Spatial Attention (TCSA) module to enhance the model’s detection capability for tea buds in complex environments. In addition, EIoU is employed as the bounding box regression loss to achieve faster convergence and more accurate bounding box localization. This method provides a new solution for deploying object detection algorithms on energy-constrained mobile devices and enabling high-precision tea bud detection.
The main contributions of this paper are summarized as follows:
- (1)
This paper presents the first study on the application of Spiking Neural Networks (SNNs) to tea bud object detection.
- (2)
This paper proposes an improved object detection model, namely GAE-SpikeYOLO, which integrates Gated Attention Coding, Temporal-Channel-Spatial Attention, and EIoU Loss into SpikeYOLO. The proposed model outperforms the original model on tea bud detection in complex environments, while achieving improvements in energy efficiency optimization, mAP@50, and mAP@[50:95].
- (3)
The dataset used in this paper consists of 7842 tea bud images, covering diverse shooting perspectives and complex tea garden scenarios, enabling a comprehensive evaluation of the detection performance of GAE-SpikeYOLO under natural conditions.
4. Discussion
Accurate and automated detection and localization of tea tree buds is of significant importance for the development of intelligent tea bud harvesting systems. In current agricultural production scenarios such as tea plantations, bud picking and maturity assessment still rely primarily on human expertise, which is not only time-consuming and labor-intensive but also prone to subjective errors [
53]. To reduce human involvement and improve operational efficiency, researchers have recently proposed various computer vision-based crop detection methods, which have shown promising results in tasks such as fruit and vegetable maturity recognition and disease identification [
54,
55,
56,
57]. However, most existing mainstream methods are based on Artificial Neural Networks (ANNs), whose high demand for floating-point computations poses challenges in deploying these models on energy-constrained edge devices, leading to high computational costs and power consumption [
58,
59]. This limitation is particularly critical in applications requiring real-time inference, such as unmanned aerial vehicle (UAV) inspection and field mobile robots, where conventional ANNs models struggle to maintain detection accuracy while controlling energy consumption.
Based on the issues outlined above, this study explores the feasibility of applying Spiking Neural Networks (SNNs) to tea bud detection in natural scenes and proposes an improved detection model, GAE-SpikeYOLO, which integrates the GAC module, the TCSA mechanism, and the EIoU loss function on the original SpikeYOLO model. To evaluate the effectiveness of the proposed model in practical scenarios, a tea bud detection dataset containing 7842 images was used. The dataset covers multiple weather conditions, diverse shooting angles, and complex tea plantation environments. On this basis, ten baseline models, including seven mainstream object detection models, two lightweight detection models, and one Spiking Neural Networks model, were systematically analyzed alongside the proposed GAE-SpikeYOLO through both quantitative and qualitative evaluations. The experimental results indicate that, under identical datasets, training hyperparameters, and hardware platforms, the baseline SNN model already achieves a favorable balance between energy consumption and detection performance. The proposed GAE-SpikeYOLO model further achieves significant improvements across multiple metrics, including mAP, IoU, Precision, Recall, and energy consumption. Specifically, GAE-SpikeYOLO attains a Precision of 83.0%, a Recall of 72.1%, a mAP@0.5 of 81.0%, and a mAP@[0.5:0.95] of 60.4%, with an energy consumption of 49.4 mJ. Compared to the original SpikeYOLO model, this approach achieves improvements of 1.4%, 1.6%, 2.0%, and 3.3% in Precision, Recall, mAP@0.5 and mAP@[0.5:0.95], respectively, and achieves reductions of 24.3% in energy consumption. These results demonstrate that the proposed detection model substantially reduces energy consumption while maintaining high detection accuracy, making it more suitable for deployment on energy-constrained edge devices and providing a promising solution for efficient and energy-efficient tea bud detection in complex natural environments.
Based on qualitative analysis and ablation experiments, the proposed model demonstrates a clear improvement over SpikeYOLO in detecting small tea bud targets, while exhibiting enhanced robustness under challenging conditions such as occlusion and illumination variation. These performance gains can be attributed to the joint optimization of input encoding, deep feature attention, and bounding box regression. Specifically, the GAC module enhances the representation of partially occluded and low-level tea bud features by selectively amplifying informative spatiotemporal responses and suppressing background-induced spike activity. Moreover, the introduced TCSA module reinforces high-level semantic consistency by adaptively emphasizing semantically relevant regions, thereby reducing false detections caused by leaf occlusion, specular highlights, and background interference. In addition, replacing the CIoU loss with EIoU improves localization stability through decoupled width-height regression, which is advantageous for small and occluded targets. Consequently, the proposed model achieves more accurate and reliable tea bud detection across varying levels of occlusion and illumination conditions.
Compared with existing object detection methods for tea buds [
23,
24,
25,
26,
27], this study explores how Spiking Neural Networks (SNNs) perform on the tea bud detection task. The improved model not only exhibits significant energy-efficiency advantages but also achieves clear improvements in detection accuracy over the original model. This effectively addresses the common issue of insufficient detection performance observed in conventional SNNs in practical applications.
Although this study has achieved promising experimental results, several issues warrant further investigation. Compared with ANNs models, the training of SNNs models is more sensitive to gradient estimation and parameter updates, which may still lead to false detections under natural disturbances such as extreme lighting conditions.
In future work, we aim to design an attention module specifically for Spiking Neural Networks that completely eliminates floating-point operations while dynamically computing attention weights. This will further reduce the model’s overall energy consumption and improve its deploy ability in energy-constrained environments [
44]. At the same time, we plan to deploy the proposed model on edge hardware platforms with neuromorphic brain-like characteristics to evaluate its real-time performance, power consumption, and operational stability in actual tea plantation environments. In addition, taking advantage of the inherent energy efficiency of SNNs, we will explore the scalability of this approach to other crop detection tasks, including apples, bananas, and various other fruits and vegetables.
5. Conclusions
This study focuses on the task of tea bud detection and investigates the application of energy-efficient object detection techniques in energy-constrained environments. To meet the energy-efficient deployment requirements of edge detection devices while maintaining detection performance, the Spiking Neural Networks model SpikeYOLO, which has demonstrated strong performance in object detection, was selected as the baseline model. Based on this, an energy-efficient object detection model, GAE-SpikeYOLO, was proposed. By comparing it with nine mainstream object detection models, including YOLOv6 through YOLOv12, YOLOv8m-MobileNetV4, and YOLOv8 (s), the advantages and limitations of SpikeYOLO were analyzed, providing guidance for subsequent improvements. The introduction of Gated Attention Coding (GAC) and the Temporal-Channel-Spatial Attention (TCSA) module significantly enhanced the model’s performance while reducing overall energy consumption. Furthermore, replacing the bounding box loss function with EIoU led to additional improvements in detection performance. On the tea bud detection dataset, the proposed model achieved a Precision of 83.0%, a Recall of 72.1%, a mAP@0.5 of 81.0%, a mAP@[0.5:0.95] of 60.4%, and an energy consumption of 49.4 mJ. In comparison to the original SpikeYOLO model, this approach demonstrates improvements of 1.4%, 1.6%, 2.0%, and 3.3% in Precision, Recall, mAP@0.5, and mAP@[0.5:0.95], respectively, while energy consumption decreased by 24.3%. In addition, ablation experiments demonstrated that each optimization module in the proposed model is effective. Finally, experiments were conducted under identical conditions to evaluate the model with different time steps . The results indicate that when , the proposed model achieves the best balance between detection performance and energy efficiency. In summary, the method presented in this study provides a promising and efficient energy-efficient solution for intelligent tea bud detection in precision agriculture.